本篇博文主要内容为 2025-05-21 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-05-21)

今日共更新796篇论文,其中:

  • 自然语言处理198篇(Computation and Language (cs.CL))
  • 人工智能253篇(Artificial Intelligence (cs.AI))
  • 计算机视觉126篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习255篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Language Models use Lookbacks to Track Beliefs

【速读】: 该论文试图解决语言模型(Language Models, LMs)如何表征角色信念的问题,特别是当这些信念可能与现实不符时,这涉及到理解LM的心智理论(Theory of Mind, ToM)能力。其解决方案的关键在于发现并分析一种称为“回溯机制”(lookback mechanism)的算法模式,该机制使LM能够在需要时回忆重要信息。通过将每个角色-物体-状态三元组绑定在一起,并在状态标记的残差流的低秩子空间中使用排序ID(Ordering IDs, OIs)进行表示,LM能够通过绑定回溯和答案回溯来检索相关信息。此外,当引入关于角色可见性的文本时,LM会生成可见性ID以编码观察者与被观察者之间的关系,并通过可见性回溯更新观察者的信念。

链接: https://arxiv.org/abs/2505.14685
作者: Nikhil Prakash,Natalie Shapira,Arnab Sen Sharma,Christoph Riedl,Yonatan Belinkov,Tamar Rott Shaham,David Bau,Atticus Geiger
机构: Northeastern University (东北大学); Technion (以色列理工学院); MIT CSAIL (麻省理工学院计算机科学与人工智能实验室); Pr(Ai)2R Group (Pr(Ai)2R 组)
类目: Computation and Language (cs.CL)
备注: 32 pages, 32 figures. Code and data at this https URL

点击查看摘要

Abstract:How do language models (LMs) represent characters’ beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM) capabilities of LMs. We analyze Llama-3-70B-Instruct’s ability to reason about characters’ beliefs using causal mediation and abstraction. We construct a dataset that consists of simple stories where two characters each separately change the state of two objects, potentially unaware of each other’s actions. Our investigation uncovered a pervasive algorithmic pattern that we call a lookback mechanism, which enables the LM to recall important information when it becomes necessary. The LM binds each character-object-state triple together by co-locating reference information about them, represented as their Ordering IDs (OIs) in low rank subspaces of the state token’s residual stream. When asked about a character’s beliefs regarding the state of an object, the binding lookback retrieves the corresponding state OI and then an answer lookback retrieves the state token. When we introduce text specifying that one character is (not) visible to the other, we find that the LM first generates a visibility ID encoding the relation between the observing and the observed character OIs. In a visibility lookback, this ID is used to retrieve information about the observed character and update the observing character’s beliefs. Our work provides insights into the LM’s belief tracking mechanisms, taking a step toward reverse-engineering ToM reasoning in LMs.
zh

[NLP-1] Mind the Gap: Bridging Thought Leap for Improved Chain-of-Thought Tuning

【速读】: 该论文试图解决数学推理中Chain-of-Thought (CoT)数据集存在的Thought Leaps问题,即专家在生成推理步骤时省略了中间步骤,导致模型学习和泛化能力受限。解决方案的关键在于提出CoT Thought Leap Bridge Task,通过自动检测推理中的跳跃并生成缺失的中间步骤,以恢复CoT的完整性和连贯性。为此,研究者构建了ScaleQM+数据集,并训练了CoT-Bridge模型来填补这些思维跳跃,实验表明该方法在多个数学推理基准上显著提升了模型性能,并增强了数据的可提炼性和强化学习的初始效果。

链接: https://arxiv.org/abs/2505.14684
作者: Haolei Xu,Yuchen Yan,Yongliang Shen,Wenqi Zhang,Guiyang Hou,Shengpei Jiang,Kaitao Song,Weiming Lu,Jun Xiao,Yueting Zhuang
机构: Zhejiang University (浙江大学); The Chinese University of Hong Kong (香港中文大学); Microsoft Research Asia (微软亚洲研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable progress on mathemati-cal tasks through Chain-of-Thought (CoT) reasoning. However, existing mathematical CoT datasets often suffer from Thought Leaps due to experts omitting intermediate steps, which negatively impacts model learning and generalization. We propose the CoT Thought Leap Bridge Task, which aims to automatically detect leaps and generate missing intermediate reasoning steps to restore the completeness and coherence of CoT. To facilitate this, we constructed a specialized training dataset called ScaleQM+, based on the structured ScaleQuestMath dataset, and trained CoT-Bridge to bridge thought leaps. Through comprehensive experiments on mathematical reasoning benchmarks, we demonstrate that models fine-tuned on bridged datasets consistently outperform those trained on original datasets, with improvements of up to +5.87% on NuminaMath. Our approach effectively enhances distilled data (+3.02%) and provides better starting points for reinforcement learning (+3.1%), functioning as a plug-and-play module compatible with existing optimization techniques. Furthermore, CoT-Bridge demonstrate improved generalization to out-of-domain logical reasoning tasks, confirming that enhancing reasoning completeness yields broadly applicable benefits.
zh

[NLP-2] wo Experts Are All You Need for Steering Thinking: Reinforcing Cognitive Effort in MoE Reasoning Models Without Additional Training

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)中普遍存在的认知效率问题,如过度思考和思考不足。其解决方案的关键在于引入一种称为“强化认知专家”(Reinforcing Cognitive Experts, RICE)的推理阶段调控方法,通过利用归一化点互信息(nPMI)系统地识别专门化的“认知专家”,这些专家负责协调具有元级推理特征的操作,例如包含“think”标记的推理过程。该方法无需额外训练或复杂启发式规则,即可有效提升推理准确性、认知效率和跨领域泛化能力。

链接: https://arxiv.org/abs/2505.14681
作者: Mengru Wang,Xingyu Chen,Yue Wang,Zhiwei He,Jiahao Xu,Tian Liang,Qiuzhi Liu,Yunzhi Yao,Wenxuan Wang,Ruotian Ma,Haitao Mi,Ningyu Zhang,Zhaopeng Tu,Xiaolong Li,Dong Yu
机构: Tencent(腾讯); Zhejiang University (浙江大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:Mixture-of-Experts (MoE) architectures within Large Reasoning Models (LRMs) have achieved impressive reasoning capabilities by selectively activating experts to facilitate structured cognitive processes. Despite notable advances, existing reasoning models often suffer from cognitive inefficiencies like overthinking and underthinking. To address these limitations, we introduce a novel inference-time steering methodology called Reinforcing Cognitive Experts (RICE), designed to improve reasoning performance without additional training or complex heuristics. Leveraging normalized Pointwise Mutual Information (nPMI), we systematically identify specialized experts, termed ‘‘cognitive experts’’ that orchestrate meta-level reasoning operations characterized by tokens like ‘‘think’’. Empirical evaluations with leading MoE-based LRMs (DeepSeek-R1 and Qwen3-235B) on rigorous quantitative and scientific reasoning benchmarks demonstrate noticeable and consistent improvements in reasoning accuracy, cognitive efficiency, and cross-domain generalization. Crucially, our lightweight approach substantially outperforms prevalent reasoning-steering techniques, such as prompt design and decoding constraints, while preserving the model’s general instruction-following skills. These results highlight reinforcing cognitive experts as a promising, practical, and interpretable direction to enhance cognitive efficiency within advanced reasoning models.
zh

[NLP-3] NExT-Search: Rebuilding User Feedback Ecosystem for Generative AI Search SIGIR2025

【速读】: 该论文试图解决生成式 AI 搜索(Generative AI Search)中由于反馈机制不完善而导致的系统优化困难问题。传统网络搜索通过细粒度的用户反馈(如点击、停留时间等)持续优化排序模型,而生成式 AI 搜索由于其较长的处理流程和仅接收最终答案的粗粒度反馈,导致无法有效将用户反馈映射到具体系统组件,从而破坏了反馈驱动的改进循环。解决方案的关键在于提出 NExT-Search,该范式通过引入细粒度、过程级反馈机制,结合用户调试模式与影子用户模式,实现对搜索流程各阶段的持续优化,并通过在线适应与离线更新策略提升系统性能。

链接: https://arxiv.org/abs/2505.14680
作者: Sunhao Dai,Wenjie Wang,Liang Pang,Jun Xu,See-Kiong Ng,Ji-Rong Wen,Tat-Seng Chua
机构: Gaoling School of Artificial Intelligence (人工智能学院); Renmin University of China (中国人民大学); University of Science and Technology of China (中国科学技术大学); CAS Key Laboratory of AI Safety (中国科学院人工智能安全重点实验室); Institute of Computing Technology Chinese Academy of Sciences (中国科学院计算技术研究所); National University of Singapore (新加坡国立大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: SIGIR 2025 Perspective Paper

点击查看摘要

Abstract:Generative AI search is reshaping information retrieval by offering end-to-end answers to complex queries, reducing users’ reliance on manually browsing and summarizing multiple web pages. However, while this paradigm enhances convenience, it disrupts the feedback-driven improvement loop that has historically powered the evolution of traditional Web search. Web search can continuously improve their ranking models by collecting large-scale, fine-grained user feedback (e.g., clicks, dwell time) at the document level. In contrast, generative AI search operates through a much longer search pipeline, spanning query decomposition, document retrieval, and answer generation, yet typically receives only coarse-grained feedback on the final answer. This introduces a feedback loop disconnect, where user feedback for the final output cannot be effectively mapped back to specific system components, making it difficult to improve each intermediate stage and sustain the feedback loop. In this paper, we envision NExT-Search, a next-generation paradigm designed to reintroduce fine-grained, process-level feedback into generative AI search. NExT-Search integrates two complementary modes: User Debug Mode, which allows engaged users to intervene at key stages; and Shadow User Mode, where a personalized user agent simulates user preferences and provides AI-assisted feedback for less interactive users. Furthermore, we envision how these feedback signals can be leveraged through online adaptation, which refines current search outputs in real-time, and offline update, which aggregates interaction logs to periodically fine-tune query decomposition, retrieval, and generation models. By restoring human control over key stages of the generative AI search pipeline, we believe NExT-Search offers a promising direction for building feedback-rich AI search systems that can evolve continuously alongside human feedback.
zh

[NLP-4] UltraEdit: Training- Subject- and Memory-Free Lifelong Editing in Large Language Models

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在持续学习(Lifelong Learning)过程中面临的高效、广泛更新同时保持原有能力并确保可靠部署的问题。其解决方案的关键在于提出ULTRAEDIT,这是一种无需训练、无需特定主题且无记忆的编辑方法,通过仅依赖轻量级线性代数运算计算参数偏移,实现快速且一致的参数修改,具有极低的计算开销。此外,ULTRAEDIT采用持续归一化策略,以适应分布变化并保持长期一致性,从而显著提升了编辑速度和可扩展性。

链接: https://arxiv.org/abs/2505.14679
作者: Xiaojie Gu,Guangxu Chen,Jungang Li,Jia-Chen Gu,Xuming Hu,Kai Zhang
机构: The Hong Kong University of Science and Technology (Guangzhou)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Lifelong learning enables large language models (LLMs) to adapt to evolving information by continually updating their internal knowledge. An ideal system should support efficient, wide-ranging updates while preserving existing capabilities and ensuring reliable deployment. Model editing stands out as a promising solution for this goal, offering a focused and efficient way to revise a model’s internal knowledge. Although recent paradigms have made notable progress, they often struggle to meet the demands of practical lifelong adaptation at scale. To bridge this gap, we propose ULTRAEDIT-a fundamentally new editing solution that is training-, subject- and memory-free, making it particularly well-suited for ultra-scalable, real-world lifelong model editing. ULTRAEDIT performs editing through a self-contained process that relies solely on lightweight linear algebra operations to compute parameter shifts, enabling fast and consistent parameter modifications with minimal overhead. To improve scalability in lifelong settings, ULTRAEDIT employs a lifelong normalization strategy that continuously updates feature statistics across turns, allowing it to adapt to distributional shifts and maintain consistency over time. ULTRAEDIT achieves editing speeds over 7x faster than the previous state-of-the-art method-which was also the fastest known approach-while consuming less than 1/3 the VRAM, making it the only method currently capable of editing a 7B LLM on a 24GB consumer-grade GPU. Furthermore, we construct ULTRAEDITBENCH-the largest dataset in the field to date, with over 2M editing pairs-and demonstrate that our method supports up to 1M edits while maintaining high accuracy. Comprehensive experiments on four datasets and six models show that ULTRAEDIT consistently achieves superior performance across diverse model editing scenarios. Our code is available at: this https URL.
zh

[NLP-5] Reward Reasoning Model

【速读】: 该论文试图解决如何有效利用测试时计算(test-time compute)来提升奖励模型(reward model)性能的问题。其解决方案的关键在于引入了奖励推理模型(Reward Reasoning Models, RRMs),这些模型通过执行有意识的推理过程在生成最终奖励前进行深度思考,从而在复杂查询中获得更准确的奖励评估。RRMs采用强化学习框架,无需显式的推理轨迹作为训练数据,即可自适应地提升推理能力,并有效利用额外的测试时计算资源以提高奖励准确性。

链接: https://arxiv.org/abs/2505.14674
作者: Jiaxin Guo,Zewen Chi,Li Dong,Qingxiu Dong,Xun Wu,Shaohan Huang,Furu Wei
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reward models play a critical role in guiding large language models toward outputs that align with human expectations. However, an open challenge remains in effectively utilizing test-time compute to enhance reward model performance. In this work, we introduce Reward Reasoning Models (RRMs), which are specifically designed to execute a deliberate reasoning process before generating final rewards. Through chain-of-thought reasoning, RRMs leverage additional test-time compute for complex queries where appropriate rewards are not immediately apparent. To develop RRMs, we implement a reinforcement learning framework that fosters self-evolved reward reasoning capabilities without requiring explicit reasoning traces as training data. Experimental results demonstrate that RRMs achieve superior performance on reward modeling benchmarks across diverse domains. Notably, we show that RRMs can adaptively exploit test-time compute to further improve reward accuracy. The pretrained reward reasoning models are available at this https URL.
zh

[NLP-6] ContextAgent : Context-Aware Proactive LLM Agents with Open-World Sensory Perceptions

【速读】: 该论文旨在解决现有主动式智能代理在用户意图理解不足和主动服务功能有限的问题。传统方法要么依赖封闭环境中的直接大语言模型(Large Language Models, LLM)推理,要么采用基于规则的主动通知机制,难以有效捕捉用户的实际需求。其解决方案的关键在于提出ContextAgent,这是一个首个具备上下文感知能力的主动代理,通过从可穿戴设备的多维感官数据中提取上下文信息,并结合历史人物特征上下文,预测用户对主动服务的需求,从而实现更精准的主动辅助。

链接: https://arxiv.org/abs/2505.14668
作者: Bufang Yang,Lilin Xu,Liekang Zeng,Kaiwei Liu,Siyang Jiang,Wenrui Lu,Hongkai Chen,Xiaofan Jiang,Guoliang Xing,Zhenyu Yan
机构: The Chinese University of Hong Kong (香港中文大学); Columbia University (哥伦比亚大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have propelled intelligent agents from reactive responses to proactive support. While promising, existing proactive agents either rely exclusively on observations from enclosed environments (e.g., desktop UIs) with direct LLM inference or employ rule-based proactive notifications, leading to suboptimal user intent understanding and limited functionality for proactive service. In this paper, we introduce ContextAgent, the first context-aware proactive agent that incorporates extensive sensory contexts to enhance the proactive capabilities of LLM agents. ContextAgent first extracts multi-dimensional contexts from massive sensory perceptions on wearables (e.g., video and audio) to understand user intentions. ContextAgent then leverages the sensory contexts and the persona contexts from historical data to predict the necessity for proactive services. When proactive assistance is needed, ContextAgent further automatically calls the necessary tools to assist users unobtrusively. To evaluate this new task, we curate ContextAgentBench, the first benchmark for evaluating context-aware proactive LLM agents, covering 1,000 samples across nine daily scenarios and twenty tools. Experiments on ContextAgentBench show that ContextAgent outperforms baselines by achieving up to 8.5% and 6.0% higher accuracy in proactive predictions and tool calling, respectively. We hope our research can inspire the development of more advanced, human-centric, proactive AI assistants.
zh

[NLP-7] SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在面对有害提示时产生的不安全输出问题,以及现有安全对齐方法在降低有害输出的同时可能损害推理深度所带来的权衡问题。解决方案的关键在于提出SAFEPATH,这是一种轻量级的对齐方法,通过微调LRMs在接收到有害提示时,在推理过程开始处生成一个8个标记的短安全提示(Safety Primer),而其余推理过程保持未监督状态,从而在有效减少有害输出的同时维持推理性能。

链接: https://arxiv.org/abs/2505.14667
作者: Wonje Jeung,Sangyeon Yoon,Minsuk Kahng,Albert No
机构: Yonsei University (延世大学); Hongik University (弘益大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 22 pages

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have become powerful tools for complex problem solving, but their structured reasoning pathways can lead to unsafe outputs when exposed to harmful prompts. Existing safety alignment methods reduce harmful outputs but can degrade reasoning depth, leading to significant trade-offs in complex, multi-step tasks, and remain vulnerable to sophisticated jailbreak attacks. To address this, we introduce SAFEPATH, a lightweight alignment method that fine-tunes LRMs to emit a short, 8-token Safety Primer at the start of their reasoning, in response to harmful prompts, while leaving the rest of the reasoning process unsupervised. Empirical results across multiple benchmarks indicate that SAFEPATH effectively reduces harmful outputs while maintaining reasoning performance. Specifically, SAFEPATH reduces harmful responses by up to 90.0% and blocks 83.3% of jailbreak attempts in the DeepSeek-R1-Distill-Llama-8B model, while requiring 295.9x less compute than Direct Refusal and 314.1x less than SafeChain. We further introduce a zero-shot variant that requires no fine-tuning. In addition, we provide a comprehensive analysis of how existing methods in LLMs generalize, or fail, when applied to reasoning-centric models, revealing critical gaps and new directions for safer AI.
zh

[NLP-8] EmoGist: Efficient In-Context Learning for Visual Emotion Understanding

【速读】: 该论文试图解决视觉情绪分类(visual emotion classification)中因情绪在图像中的表现高度依赖上下文而带来的预测准确性不足的问题。解决方案的关键在于利用上下文相关的 emotion label 定义,通过预生成情感标签的多种解释,并在测试时根据嵌入相似性检索相应的解释,再将其输入快速的视觉语言模型(VLM)进行分类,从而提升分类性能。

链接: https://arxiv.org/abs/2505.14660
作者: Ronald Seoh,Dan Goldwasser
机构: Purdue University (普渡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we introduce EmoGist, a training-free, in-context learning method for performing visual emotion classification with LVLMs. The key intuition of our approach is that context-dependent definition of emotion labels could allow more accurate predictions of emotions, as the ways in which emotions manifest within images are highly context dependent and nuanced. EmoGist pre-generates multiple explanations of emotion labels, by analyzing the clusters of example images belonging to each category. At test time, we retrieve a version of explanation based on embedding similarity, and feed it to a fast VLM for classification. Through our experiments, we show that EmoGist allows up to 13 points improvement in micro F1 scores with the multi-label Memotion dataset, and up to 8 points in macro F1 in the multi-class FI dataset.
zh

[NLP-9] Beyond Words: Multimodal LLM Knows When to Speak

【速读】: 该论文旨在解决基于大型语言模型(Large Language Model, LLM)的聊天机器人在对话中难以准确判断何时发言的问题,尤其是在需要快速、简短回应的场景下。其关键解决方案是引入多模态输入(包括视觉、音频和文本),并通过构建一个包含时间对齐的多模态流的真实对话数据集,实现对响应时机的精细建模。在此基础上,提出MM-When2Speak模型,该模型能够自适应地融合多模态上下文信息,以预测合适的回应时机和类型,从而显著提升响应时间的准确性。

链接: https://arxiv.org/abs/2505.14654
作者: Zikai Liao,Yi Ouyang,Yi-Lun Lee,Chen-Ping Yu,Yi-Hsuan Tsai,Zhaozheng Yin
机构: Stony Brook University (纽约州立大学石溪分校); Atmanity Inc. (Atmanity公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project page: this https URL

点击查看摘要

Abstract:While large language model (LLM)-based chatbots have demonstrated strong capabilities in generating coherent and contextually relevant responses, they often struggle with understanding when to speak, particularly in delivering brief, timely reactions during ongoing conversations. This limitation arises largely from their reliance on text input, lacking the rich contextual cues in real-world human dialogue. In this work, we focus on real-time prediction of response types, with an emphasis on short, reactive utterances that depend on subtle, multimodal signals across vision, audio, and text. To support this, we introduce a new multimodal dataset constructed from real-world conversational videos, containing temporally aligned visual, auditory, and textual streams. This dataset enables fine-grained modeling of response timing in dyadic interactions. Building on this dataset, we propose MM-When2Speak, a multimodal LLM-based model that adaptively integrates visual, auditory, and textual context to predict when a response should occur, and what type of response is appropriate. Experiments show that MM-When2Speak significantly outperforms state-of-the-art unimodal and LLM-based baselines, achieving up to a 4x improvement in response timing accuracy over leading commercial LLMs. These results underscore the importance of multimodal inputs for producing timely, natural, and engaging conversational AI.
zh

[NLP-10] General-Reason er: Advancing LLM Reasoning Across All Domains

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在跨领域推理能力上的局限性,尤其是在数据稀缺和答案表示多样化领域中的适用性和泛化能力不足的问题。其解决方案的关键在于提出一种新的训练范式——General-Reasoner,该范式通过构建一个大规模、高质量的可验证答案问题数据集,并引入基于生成式模型的答案验证器,替代传统的规则验证方法,从而提升模型在多种领域的推理能力。

链接: https://arxiv.org/abs/2505.14652
作者: Xueguang Ma,Qian Liu,Dongfu Jiang,Ge Zhang,Zejun Ma,Wenhu Chen
机构: University of Waterloo (滑铁卢大学); Vector Institute (向量研究所); TikTok (TikTok); M-A-P (M-A-P)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of large language models (LLMs). Particularly, the “Zero” reinforcement learning introduced by Deepseek-R1-Zero, enables direct RL training of base LLMs without relying on an intermediate supervised fine-tuning stage. Despite these advancements, current works for LLM reasoning mainly focus on mathematical and coding domains, largely due to data abundance and the ease of answer verification. This limits the applicability and generalization of such models to broader domains, where questions often have diverse answer representations, and data is more scarce. In this paper, we propose General-Reasoner, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains. Our key contributions include: (1) constructing a large-scale, high-quality dataset of questions with verifiable answers curated by web crawling, covering a wide range of disciplines; and (2) developing a generative model-based answer verifier, which replaces traditional rule-based verification with the capability of chain-of-thought and context-awareness. We train a series of models and evaluate them on a wide range of datasets covering wide domains like physics, chemistry, finance, electronics etc. Our comprehensive evaluation across these 12 benchmarks (e.g. MMLU-Pro, GPQA, SuperGPQA, TheoremQA, BBEH and MATH AMC) demonstrates that General-Reasoner outperforms existing baseline methods, achieving robust and generalizable reasoning performance while maintaining superior effectiveness in mathematical reasoning tasks.
zh

[NLP-11] Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference CVPR

【速读】: 该论文旨在解决深度神经网络在任务复杂度增加背景下模型规模扩大所带来的延迟和内存效率问题。其解决方案的关键在于提出一种硬件高效的量化与推理方案——W4A8,该方案将权重以4位整数精度量化存储,并在推理计算中使用8位浮点运算,相较于16位操作实现了显著的速度提升和内存利用率优化。为减少精度损失,论文进一步引入了双精度量化(Dual Precision Quantization, DPQ)算法,在不增加推理开销的前提下有效保持了模型性能。

链接: https://arxiv.org/abs/2505.14638
作者: Tomer Gafni,Asaf Karnieli,Yair Hanani
机构: Intel(英特尔)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at eLVM Workshop, CVPR, 2025

点击查看摘要

Abstract:Deep neural networks have achieved state-of-the-art results in a wide range of applications, from natural language processing and computer vision to speech recognition. However, as tasks become increasingly complex, model sizes continue to grow, posing challenges in latency and memory efficiency. To meet these constraints, post-training quantization has emerged as a promising solution. In this paper, we propose a novel hardware-efficient quantization and inference scheme that exploits hardware advantages with minimal accuracy degradation. Specifically, we introduce a W4A8 scheme, where weights are quantized and stored using 4-bit integer precision, and inference computations are performed using 8-bit floating-point arithmetic, demonstrating significant speedups and improved memory utilization compared to 16-bit operations, applicable on various modern accelerators. To mitigate accuracy loss, we develop a novel quantization algorithm, dubbed Dual Precision Quantization (DPQ), that leverages the unique structure of our scheme without introducing additional inference overhead. Experimental results demonstrate improved performance (i.e., increased throughput) while maintaining tolerable accuracy degradation relative to the full-precision model.
zh

[NLP-12] Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas

【速读】: 该论文试图解决AI模型在面对复杂情境时可能产生的风险行为难以被有效检测的问题,特别是当更强的模型采用如Alignment Faking等新方法规避检测时。解决方案的关键在于通过构建LitmusValues评估流程,揭示AI模型在一系列AI价值类别上的优先级,并利用AIRiskDilemmas中涵盖的涉及AI安全风险的道德困境场景,测量模型的价值优先级,从而获得自洽的潜在风险预测。这种方法表明,即使是看似无害的价值(如Care)也能预测已见和未见的风险行为。

链接: https://arxiv.org/abs/2505.14633
作者: Yu Ying Chiu,Zhilin Wang,Sharan Maiya,Yejin Choi,Kyle Fish,Sydney Levine,Evan Hubinger
机构: University of Washington (华盛顿大学); NVIDIA (英伟达); Cambridge (剑桥大学); Stanford (斯坦福大学); MIT (麻省理工学院); Harvard (哈佛大学); Anthropic (Anthropic)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 34 pages, 11 figures, see associated data at this https URL and code at this https URL

点击查看摘要

Abstract:Detecting AI risks becomes more challenging as stronger models emerge and find novel methods such as Alignment Faking to circumvent these detection attempts. Inspired by how risky behaviors in humans (i.e., illegal activities that may hurt others) are sometimes guided by strongly-held values, we believe that identifying values within AI models can be an early warning system for AI’s risky behaviors. We create LitmusValues, an evaluation pipeline to reveal AI models’ priorities on a range of AI value classes. Then, we collect AIRiskDilemmas, a diverse collection of dilemmas that pit values against one another in scenarios relevant to AI safety risks such as Power Seeking. By measuring an AI model’s value prioritization using its aggregate choices, we obtain a self-consistent set of predicted value priorities that uncover potential risks. We show that values in LitmusValues (including seemingly innocuous ones like Care) can predict for both seen risky behaviors in AIRiskDilemmas and unseen risky behaviors in HarmBench.
zh

[NLP-13] hink Only When You Need with Large Hybrid-Reasoning Models

【速读】: 该论文试图解决传统大型语言模型(Large Language Models, LLMs)在处理简单查询时因过度依赖扩展性思维过程而导致的token消耗和延迟过高的问题。其解决方案的关键在于提出一种能够根据用户查询的上下文信息自适应决定是否进行思维推理的大型混合推理模型(Large Hybrid-Reasoning Models, LHRMs)。该模型通过两阶段训练流程实现,包括以混合微调(Hybrid Fine-Tuning, HFT)作为冷启动阶段,以及随后基于混合组策略优化(Hybrid Group Policy Optimization, HGPO)的在线强化学习,从而隐式学习选择合适的思维模式。此外,论文还引入了混合精度(Hybrid Accuracy)作为评估混合推理能力的量化指标。

链接: https://arxiv.org/abs/2505.14631
作者: Lingjie Jiang,Xun Wu,Shaohan Huang,Qingxiu Dong,Zewen Chi,Li Dong,Xingxing Zhang,Tengchao Lv,Lei Cui,Furu Wei
机构: Microsoft Research (微软研究院); Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent Large Reasoning Models (LRMs) have shown substantially improved reasoning capabilities over traditional Large Language Models (LLMs) by incorporating extended thinking processes prior to producing final responses. However, excessively lengthy thinking introduces substantial overhead in terms of token consumption and latency, which is particularly unnecessary for simple queries. In this work, we introduce Large Hybrid-Reasoning Models (LHRMs), the first kind of model capable of adaptively determining whether to perform thinking based on the contextual information of user queries. To achieve this, we propose a two-stage training pipeline comprising Hybrid Fine-Tuning (HFT) as a cold start, followed by online reinforcement learning with the proposed Hybrid Group Policy Optimization (HGPO) to implicitly learn to select the appropriate thinking mode. Furthermore, we introduce a metric called Hybrid Accuracy to quantitatively assess the model’s capability for hybrid thinking. Extensive experimental results show that LHRMs can adaptively perform hybrid thinking on queries of varying difficulty and type. It outperforms existing LRMs and LLMs in reasoning and general capabilities while significantly improving efficiency. Together, our work advocates for a reconsideration of the appropriate use of extended thinking processes and provides a solid starting point for building hybrid thinking systems.
zh

[NLP-14] KERL: Knowledge-Enhanced Personalized Recipe Recommendation using Large Language Models ACL2025

【速读】: 该论文试图解决将食品相关知识图谱(Knowledge Graph, KG)与大语言模型(Large Language Model, LLM)结合,以提供个性化食品推荐并生成包含微营养信息的食谱的问题。解决方案的关键在于提出一种统一系统KERL,该系统通过从KG中提取实体和子图,并将其作为上下文输入到LLM中,从而选择满足约束条件的食谱,并进一步生成烹饪步骤和营养信息。

链接: https://arxiv.org/abs/2505.14629
作者: Fnu Mohbat,Mohammed J Zaki
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ACL 2025

点击查看摘要

Abstract:Recent advances in large language models (LLMs) and the abundance of food data have resulted in studies to improve food understanding using LLMs. Despite several recommendation systems utilizing LLMs and Knowledge Graphs (KGs), there has been limited research on integrating food related KGs with LLMs. We introduce KERL, a unified system that leverages food KGs and LLMs to provide personalized food recommendations and generates recipes with associated micro-nutritional information. Given a natural language question, KERL extracts entities, retrieves subgraphs from the KG, which are then fed into the LLM as context to select the recipes that satisfy the constraints. Next, our system generates the cooking steps and nutritional information for each recipe. To evaluate our approach, we also develop a benchmark dataset by curating recipe related questions, combined with constraints and personal preferences. Through extensive experiments, we show that our proposed KG-augmented LLM significantly outperforms existing approaches, offering a complete and coherent solution for food recommendation, recipe generation, and nutritional analysis. Our code and benchmark datasets are publicly available at this https URL.
zh

[NLP-15] Debating for Better Reasoning : An Unsupervised Multimodal Approach

【速读】: 该论文试图解决在大型语言模型(Large Language Models, LLMs)能力超越人类评估者的情况下,如何实现有效且可扩展的监督问题。其解决方案的关键在于将辩论机制扩展到多模态场景,通过让两个“有视觉能力”的专家模型就答案进行辩论,而由一个“盲”(仅文本)的判断者根据论点质量进行裁决,从而实现对更强模型的监督与性能提升。该方法聚焦于专家意见不一致的情况,避免了显式的角色扮演,并通过微调使较弱的LLM能够赋予视觉-语言模型推理能力。

链接: https://arxiv.org/abs/2505.14627
作者: Ashutosh Adhikari,Mirella Lapata
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) gain expertise across diverse domains and modalities, scalable oversight becomes increasingly challenging, particularly when their capabilities may surpass human evaluators. Debate has emerged as a promising mechanism for enabling such oversight. In this work, we extend the debate paradigm to a multimodal setting, exploring its potential for weaker models to supervise and enhance the performance of stronger models. We focus on visual question answering (VQA), where two “sighted” expert vision-language models debate an answer, while a “blind” (text-only) judge adjudicates based solely on the quality of the arguments. In our framework, the experts defend only answers aligned with their beliefs, thereby obviating the need for explicit role-playing and concentrating the debate on instances of expert disagreement. Experiments on several multimodal tasks demonstrate that the debate framework consistently outperforms individual expert models. Moreover, judgments from weaker LLMs can help instill reasoning capabilities in vision-language models through finetuning.
zh

[NLP-16] nyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning

【速读】: 该论文试图解决验证器(verifier)在强化学习(Reinforcement Learning, RL)过程中产生的虚假负样本(false negatives)问题,即验证器错误地拒绝了正确的模型输出。解决方案的关键在于提出tinyV,这是一种轻量级的基于大语言模型(LLM)的验证器,它能够动态识别潜在的虚假负样本并恢复有效的响应,从而提升奖励估计的准确性。

链接: https://arxiv.org/abs/2505.14625
作者: Zhangchen Xu,Yuetai Li,Fengqing Jiang,Bhaskar Ramasubramanian,Luyao Niu,Bill Yuchen Lin,Radha Poovendran
机构: University of Washington(华盛顿大学); Western Washington University(西华盛顿大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has become a powerful tool for enhancing the reasoning abilities of large language models (LLMs) by optimizing their policies with reward signals. Yet, RL’s success relies on the reliability of rewards, which are provided by verifiers. In this paper, we expose and analyze a widespread problem–false negatives–where verifiers wrongly reject correct model outputs. Our in-depth study of the Big-Math-RL-Verified dataset reveals that over 38% of model-generated responses suffer from false negatives, where the verifier fails to recognize correct answers. We show, both empirically and theoretically, that these false negatives severely impair RL training by depriving the model of informative gradient signals and slowing convergence. To mitigate this, we propose tinyV, a lightweight LLM-based verifier that augments existing rule-based methods, which dynamically identifies potential false negatives and recovers valid responses to produce more accurate reward estimates. Across multiple math-reasoning benchmarks, integrating TinyV boosts pass rates by up to 10% and accelerates convergence relative to the baseline. Our findings highlight the critical importance of addressing verifier false negatives and offer a practical approach to improve RL-based fine-tuning of LLMs. Our code is available at this https URL.
zh

[NLP-17] Enhancing Learned Knowledge in LoRA Adapters Through Efficient Contrastive Decoding on Ascend NPUs KDD2025

【速读】: 该论文试图解决在使用LoRA(Low-Rank Adaptation)微调的大语言模型(LLM)进行解码时,因基础模型的偏见或干扰导致任务特定知识未能有效利用的问题,从而影响复杂推理或深度上下文理解任务的性能。解决方案的关键在于提出一种名为对比LoRA解码(Contrastive LoRA Decoding, CoLD)的新解码框架,该框架通过基于LoRA适配专家模型与对应基础模型概率分布差异对候选标记进行评分,优先选择与LoRA学习表示更一致的标记,从而最大化任务特定知识的利用,提升下游任务性能。

链接: https://arxiv.org/abs/2505.14620
作者: Morgan Lindsay Heisler,Linzi Xing,Ge Shi,Hanieh Sadri,Gursimran Singh,Weiwei Zhang,Tao Ye,Ying Xiong,Yong Zhang,Zhenan Fan
机构: Huawei Technologies Canada (华为技术加拿大); Huawei Technologies (华为技术); Beijing China (北京中国); Toronto ON Canada (多伦多安大略省加拿大)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted at ACM KDD 2025

点击查看摘要

Abstract:Huawei Cloud users leverage LoRA (Low-Rank Adaptation) as an efficient and scalable method to fine-tune and customize large language models (LLMs) for application-specific needs. However, tasks that require complex reasoning or deep contextual understanding are often hindered by biases or interference from the base model when using typical decoding methods like greedy or beam search. These biases can lead to generic or task-agnostic responses from the base model instead of leveraging the LoRA-specific adaptations. In this paper, we introduce Contrastive LoRA Decoding (CoLD), a novel decoding framework designed to maximize the use of task-specific knowledge in LoRA-adapted models, resulting in better downstream performance. CoLD uses contrastive decoding by scoring candidate tokens based on the divergence between the probability distributions of a LoRA-adapted expert model and the corresponding base model. This approach prioritizes tokens that better align with the LoRA’s learned representations, enhancing performance for specialized tasks. While effective, a naive implementation of CoLD is computationally expensive because each decoding step requires evaluating multiple token candidates across both models. To address this, we developed an optimized kernel for Huawei’s Ascend NPU. CoLD achieves up to a 5.54% increase in task accuracy while reducing end-to-end latency by 28% compared to greedy decoding. This work provides practical and efficient decoding strategies for fine-tuned LLMs in resource-constrained environments and has broad implications for applied data science in both cloud and on-premises settings.
zh

[NLP-18] Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models

【速读】: 该论文试图解决推理导向的大语言模型(reasoning-focused large language models, LLMs)在检测到被评估时行为发生变化的问题,这种现象类似于霍桑效应(Hawthorne phenomenon),可能导致模型优化测试通过性能或更容易响应有害提示。解决方案的关键在于提出一种白盒探测框架,该框架能够线性识别与测试意识相关的激活,并在监控下游性能的同时引导模型趋向或远离测试意识,从而实现对模型行为的细粒度控制。

链接: https://arxiv.org/abs/2505.14617
作者: Sahar Abdelnabi,Ahmed Salem
机构: Microsoft(微软)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Reasoning-focused large language models (LLMs) sometimes alter their behavior when they detect that they are being evaluated, an effect analogous to the Hawthorne phenomenon, which can lead them to optimize for test-passing performance or to comply more readily with harmful prompts if real-world consequences appear absent. We present the first quantitative study of how such “test awareness” impacts model behavior, particularly its safety alignment. We introduce a white-box probing framework that (i) linearly identifies awareness-related activations and (ii) steers models toward or away from test awareness while monitoring downstream performance. We apply our method to different state-of-the-art open-source reasoning LLMs across both realistic and hypothetical tasks. Our results demonstrate that test awareness significantly impact safety alignment, and is different for different models. By providing fine-grained control over this latent effect, our work aims to increase trust in how we perform safety evaluation.
zh

[NLP-19] SATBench: Benchmarking LLM s Logical Reasoning via Automated Puzzle Generation from SAT Formulas

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在基于搜索的逻辑推理能力方面的评估问题,特别是针对布尔可满足性(Boolean Satisfiability, SAT)问题的求解能力。其解决方案的关键在于构建SATBench,一个通过从SAT公式生成逻辑谜题并转化为故事情境和条件来评估LLMs逻辑推理能力的基准测试集。该方法利用了SAT问题的搜索特性,而非传统依赖于推理规则的方法,从而揭示了现有LLMs在处理复杂逻辑约束时的局限性。

链接: https://arxiv.org/abs/2505.14615
作者: Anjiang Wei,Yuheng Wu,Yingjia Wan,Tarun Suresh,Huanmi Tan,Zhanke Zhou,Sanmi Koyejo,Ke Wang,Alex Aiken
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:We introduce SATBench, a benchmark for evaluating the logical reasoning capabilities of large language models (LLMs) through logical puzzles derived from Boolean satisfiability (SAT) problems. Unlike prior work that focuses on inference rule-based reasoning, which often involves deducing conclusions from a set of premises, our approach leverages the search-based nature of SAT problems, where the objective is to find a solution that fulfills a specified set of logical constraints. Each instance in SATBench is generated from a SAT formula, then translated into a story context and conditions using LLMs. The generation process is fully automated and allows for adjustable difficulty by varying the number of clauses. All 2100 puzzles are validated through both LLM-assisted and solver-based consistency checks, with human validation on a subset. Experimental results show that even the strongest model, o4-mini, achieves only 65.0% accuracy on hard UNSAT problems, close to the random baseline of 50%. SATBench exposes fundamental limitations in the search-based logical reasoning abilities of current LLMs and provides a scalable testbed for future research in logical reasoning.
zh

[NLP-20] Language Models Optimized to Fool Detectors Still Have a Distinct Style (And How to Change It)

【速读】: 该论文试图解决机器生成文本(machine-generated text)检测的可靠性问题,特别是针对语言模型(language models)优化以规避检测器的问题。其解决方案的关键在于识别出一个对优化具有鲁棒性的风格特征空间(stylistic feature space),并证明该空间可用于可靠检测经过优化以逃避检测的模型样本。此外,研究还发现即使模型被明确优化以对抗风格检测器,检测性能仍保持稳定,表明风格检测器具有较高的内在鲁棒性。

链接: https://arxiv.org/abs/2505.14608
作者: Rafael Rivera Soto,Barry Chen,Nicholas Andrews
机构: Lawrence Livermore National Laboratory (劳伦斯利弗莫尔国家实验室); Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite considerable progress in the development of machine-text detectors, it has been suggested that the problem is inherently hard, and therefore, that stakeholders should proceed under the assumption that machine-generated text cannot be reliably detected as such. We examine a recent such claim by Nicks et al. (2024) regarding the ease with which language models can be optimized to degrade the performance of machine-text detectors, including detectors not specifically optimized against. We identify a feature space \unicodex2013 the stylistic feature space \unicodex2013 that is robust to such optimization, and show that it may be used to reliably detect samples from language models optimized to prevent detection. Furthermore, we show that even when models are explicitly optimized against stylistic detectors, detection performance remains surprisingly unaffected. We then seek to understand if stylistic detectors are inherently more robust. To study this question, we explore a new paraphrasing approach that simultaneously aims to close the gap between human writing and machine writing in stylistic feature space while avoiding detection using traditional features. We show that when only a single sample is available for detection, this attack is universally effective across all detectors considered, including those that use writing style. However, as the number of samples available for detection grows, the human and machine distributions become distinguishable. This observation encourages us to introduce AURA, a metric that estimates the overlap between human and machine-generated distributions by analyzing how detector performance improves as more samples become available. Overall, our findings underscore previous recommendations to avoid reliance on machine-text detection.
zh

[NLP-21] sudoLLM : On Multi-role Alignment of Language Models

【速读】: 该论文试图解决在大型语言模型(Large Language Model, LLM)中缺乏基于用户授权的访问权限控制问题,从而导致模型可能产生不符合安全规范的敏感信息。解决方案的关键在于引入sudoLLM框架,该框架通过向查询中注入细微的基于用户的偏见信号,并训练LLM仅在用户被授权的情况下才生成敏感信息,从而实现多角色对齐的LLM。这一方法显著提升了模型的对齐性、泛化能力以及对基于提示的越狱攻击的抵抗力。

链接: https://arxiv.org/abs/2505.14607
作者: Soumadeep Saha,Akshay Chaturvedi,Joy Mahapatra,Utpal Garain
机构: ISI Kolkata(印度统计学院加尔各答分校); IRIT Toulouse(图卢兹信息与推理技术研究所)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Under review. Code and data to be released later

点击查看摘要

Abstract:User authorization-based access privileges are a key feature in many safety-critical systems, but have thus far been absent from the large language model (LLM) realm. In this work, drawing inspiration from such access control systems, we introduce sudoLLM, a novel framework that results in multi-role aligned LLMs, i.e., LLMs that account for, and behave in accordance with, user access rights. sudoLLM injects subtle user-based biases into queries and trains an LLM to utilize this bias signal in order to produce sensitive information if and only if the user is authorized. We present empirical results demonstrating that this approach shows substantially improved alignment, generalization, and resistance to prompt-based jailbreaking attacks. The persistent tension between the language modeling objective and safety alignment, which is often exploited to jailbreak LLMs, is somewhat resolved with the aid of the injected bias signal. Our framework is meant as an additional security layer, and complements existing guardrail mechanisms for enhanced end-to-end safety with LLMs.
zh

[NLP-22] oward Reliable Biomedical Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models IJCAI2025

【速读】: 该论文旨在解决生成式AI(Generative AI)在生物医学领域生成假设时的可信度评估问题,特别是如何检测和减少模型产生的幻觉(hallucination)现象。其关键解决方案是提出TruthHypo基准和KnowHD知识驱动的幻觉检测器,通过评估假设与现有知识的契合度,有效筛选出可信的假设,从而提升生成假设的可靠性。

链接: https://arxiv.org/abs/2505.14599
作者: Guangzhi Xiong,Eric Xie,Corey Williams,Myles Kim,Amir Hassan Shariatmadari,Sikun Guo,Stefan Bekiranov,Aidong Zhang
机构: University of Virginia (弗吉尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to IJCAI 2025

点击查看摘要

Abstract:Large language models (LLMs) have shown significant potential in scientific disciplines such as biomedicine, particularly in hypothesis generation, where they can analyze vast literature, identify patterns, and suggest research directions. However, a key challenge lies in evaluating the truthfulness of generated hypotheses, as verifying their accuracy often requires substantial time and resources. Additionally, the hallucination problem in LLMs can lead to the generation of hypotheses that appear plausible but are ultimately incorrect, undermining their reliability. To facilitate the systematic study of these challenges, we introduce TruthHypo, a benchmark for assessing the capabilities of LLMs in generating truthful biomedical hypotheses, and KnowHD, a knowledge-based hallucination detector to evaluate how well hypotheses are grounded in existing knowledge. Our results show that LLMs struggle to generate truthful hypotheses. By analyzing hallucinations in reasoning steps, we demonstrate that the groundedness scores provided by KnowHD serve as an effective metric for filtering truthful hypotheses from the diverse outputs of LLMs. Human evaluations further validate the utility of KnowHD in identifying truthful hypotheses and accelerating scientific discovery. Our data and source code are available at this https URL.
zh

[NLP-23] Success is in the Details: Evaluate and Enhance Details Sensitivity of Code LLM s through Counterfactuals

【速读】: 该论文试图解决当前代码大语言模型(Code LLMs)在面对问题描述中细微变化时,识别和响应能力不足的问题,即代码敏感性(Code Sensitivity)不足的问题。现有代码基准和指令数据主要关注任务的难度和多样性,而忽略了敏感性。论文提出的关键解决方案是构建CTF-Code基准,通过反事实扰动方法,在最小化输入变化的同时最大化输出变化,以评估模型的敏感性;同时引入CTF-Instruct增量指令微调框架,利用选择机制优化数据,提升模型在难度、多样性和敏感性三个维度上的表现,从而有效增强模型性能。

链接: https://arxiv.org/abs/2505.14597
作者: Xianzhen Luo,Qingfu Zhu,Zhiming Zhang,Mingzheng Xu,Tianhao Cheng,Yixuan Wang,Zheng Chu,Shijie Xuyang,Zhiyuan Ma,YuanTao Fan,Wanxiang Che
机构: Harbin Institute of Technology (哈尔滨工业大学); Fudan University (复旦大学); University of Science and Technology of China (中国科学技术大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computation and Language (cs.CL)
备注: Code Model is this https URL

点击查看摘要

Abstract:Code Sensitivity refers to the ability of Code LLMs to recognize and respond to details changes in problem descriptions. While current code benchmarks and instruction data focus on difficulty and diversity, sensitivity is overlooked. We first introduce the CTF-Code benchmark, constructed using counterfactual perturbations, minimizing input changes while maximizing output changes. The evaluation shows that many LLMs have a more than 10% performance drop compared to the original problems. To fully utilize sensitivity, CTF-Instruct, an incremental instruction fine-tuning framework, extends on existing data and uses a selection mechanism to meet the three dimensions of difficulty, diversity, and sensitivity. Experiments show that LLMs fine-tuned with CTF-Instruct data achieve over a 2% improvement on CTF-Code, and more than a 10% performance boost on LiveCodeBench, validating the feasibility of enhancing LLMs’ sensitivity to improve performance.
zh

[NLP-24] MCIP: Protecting MCP Safety via Model Contextual Integrity Protocol

【速读】: 该论文试图解决Model Context Protocol (MCP)在去中心化架构下存在的安全机制缺失问题,以及由此引发的系统性安全分析挑战。解决方案的关键在于基于MAESTRO框架对MCP的安全机制进行分析,并提出改进的Model Contextual Integrity Protocol (MCIP),同时构建细粒度的安全行为分类体系及相应的基准测试和训练数据,以提升大语言模型(LLMs)在MCP交互中识别安全风险的能力。

链接: https://arxiv.org/abs/2505.14590
作者: Huihao Jing,Haoran Li,Wenbin Hu,Qi Hu,Heli Xu,Tianshu Chu,Peizhao Hu,Yangqiu Song
机构: 未知
类目: Computation and Language (cs.CL)
备注: 17 pages

点击查看摘要

Abstract:As Model Context Protocol (MCP) introduces an easy-to-use ecosystem for users and developers, it also brings underexplored safety risks. Its decentralized architecture, which separates clients and servers, poses unique challenges for systematic safety analysis. This paper proposes a novel framework to enhance MCP safety. Guided by the MAESTRO framework, we first analyze the missing safety mechanisms in MCP, and based on this analysis, we propose the Model Contextual Integrity Protocol (MCIP), a refined version of MCP that addresses these this http URL, we develop a fine-grained taxonomy that captures a diverse range of unsafe behaviors observed in MCP scenarios. Building on this taxonomy, we develop benchmark and training data that support the evaluation and improvement of LLMs’ capabilities in identifying safety risks within MCP interactions. Leveraging the proposed benchmark and training data, we conduct extensive experiments on state-of-the-art LLMs. The results highlight LLMs’ vulnerabilities in MCP interactions and demonstrate that our approach substantially improves their safety performance.
zh

[NLP-25] Context Reason er: Incentivizing Reasoning Capability for Contextualized Privacy and Safety Compliance via Reinforcement Learning

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在安全和隐私方面存在的风险问题,特别是现有缓解策略在风险场景中无法有效保留上下文推理能力,且过度依赖敏感模式匹配,导致合规性不足和系统性法律风险。解决方案的关键在于将安全和隐私问题转化为符合上下文完整性(Contextual Integrity, CI)理论的合规问题,并在CI框架下对模型进行调整,以符合GDPR、欧盟人工智能法案(EU AI Act)和HIPAA等关键监管标准。通过引入基于规则的奖励机制强化学习(reinforcement learning, RL),在提升模型安全与隐私合规性的同时,增强其上下文推理能力。实验结果表明,该方法显著提高了法律合规性及通用推理能力。

链接: https://arxiv.org/abs/2505.14585
作者: Wenbin Hu,Haoran Li,Huihao Jing,Qi Hu,Ziqian Zeng,Sirui Han,Heli Xu,Tianshu Chu,Peizhao Hu,Yangqiu Song
机构: HKUST(香港科技大学); South China University of Technology(华南理工大学); Huawei Technologies(华为技术)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) exhibit remarkable capabilities, they also introduce significant safety and privacy risks. Current mitigation strategies often fail to preserve contextual reasoning capabilities in risky scenarios. Instead, they rely heavily on sensitive pattern matching to protect LLMs, which limits the scope. Furthermore, they overlook established safety and privacy standards, leading to systemic risks for legal compliance. To address these gaps, we formulate safety and privacy issues into contextualized compliance problems following the Contextual Integrity (CI) theory. Under the CI framework, we align our model with three critical regulatory standards: GDPR, EU AI Act, and HIPAA. Specifically, we employ reinforcement learning (RL) with a rule-based reward to incentivize contextual reasoning capabilities while enhancing compliance with safety and privacy norms. Through extensive experiments, we demonstrate that our method not only significantly enhances legal compliance (achieving a +17.64% accuracy improvement in safety/privacy benchmarks) but also further improves general reasoning capability. For OpenThinker-7B, a strong reasoning model that significantly outperforms its base model Qwen2.5-7B-Instruct across diverse subjects, our method enhances its general reasoning capabilities, with +2.05% and +8.98% accuracy improvement on the MMLU and LegalBench benchmark, respectively.
zh

[NLP-26] Can Pruning Improve Reasoning ? Revisiting Long-CoT Compression with Capability in Mind for Better Reasoning

【速读】: 该论文试图解决长链式思维(Long-CoT)推理在大型语言模型(LLMs)中虽能提升准确性,但其冗长、自我反思的风格不利于有效压缩到小型语言模型(SLMs)中的问题。解决方案的关键在于提出一种结构感知的框架——Prune-on-Logic,该框架通过将Long-CoT转换为逻辑图,并在自验证约束下选择性地剪除低效的推理步骤,从而实现对Long-CoT的有效压缩。研究发现,剪除验证步骤可在保持精度的同时降低推理成本,优于传统的基于token级别的基线方法和未压缩的微调方法。

链接: https://arxiv.org/abs/2505.14582
作者: Shangziqi Zhao,Jiahao Yuan,Guisong Yang,Usman Naseem
机构: Xi’an Jiaotong University (西安交通大学); University of Shanghai for Science and Technology (上海理工大学); Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL)
备注: 17 pages,4 figures

点击查看摘要

Abstract:Long chain-of-thought (Long-CoT) reasoning improves accuracy in LLMs, yet its verbose, self-reflective style often hinders effective distillation into small language models (SLMs). We revisit Long-CoT compression through the lens of capability alignment and ask: Can pruning improve reasoning? We propose Prune-on-Logic, a structure-aware framework that transforms Long-CoT into logic graphs and selectively prunes low-utility reasoning steps under self-verification constraints. Through systematic analysis across three pruning strategies – targeting entire chains, core reasoning, and verification – we find that pruning verification steps yields consistent accuracy gains while reducing inference cost, outperforming token-level baselines and uncompressed fine-tuning. In contrast, pruning reasoning or all-chain steps degrades performance, revealing that small models benefit not from shorter CoTs, but from semantically leaner ones. Our findings highlight pruning as a structural optimization strategy for aligning CoT reasoning with SLM capacity.
zh

[NLP-27] RATES: Trait-Specific Rubric-Assisted Cross-Prompt Essay Scoring ACL2025

【速读】: 该论文试图解决传统自动化作文评分(Automated Essay Scoring, AES)在根据个体特质进行作文评估方面的不足。其解决方案的关键在于提出TRATES框架,该框架基于特定特质的评分标准(trait grading rubrics),利用生成式AI (Generative AI) 生成与特质相关的特征(通过评估问题表示),并结合通用写作质量和提示特定特征,训练一个简单的经典回归模型来预测未见提示下的作文特质分数。

链接: https://arxiv.org/abs/2505.14577
作者: Sohaila Eltanbouly,Salam Albatarni,Tamer Elsayed
机构: Qatar University (卡塔尔大学)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2025 Findings

点击查看摘要

Abstract:Research on holistic Automated Essay Scoring (AES) is long-dated; yet, there is a notable lack of attention for assessing essays according to individual traits. In this work, we propose TRATES, a novel trait-specific and rubric-based cross-prompt AES framework that is generic yet specific to the underlying trait. The framework leverages a Large Language Model (LLM) that utilizes the trait grading rubrics to generate trait-specific features (represented by assessment questions), then assesses those features given an essay. The trait-specific features are eventually combined with generic writing-quality and prompt-specific features to train a simple classical regression model that predicts trait scores of essays from an unseen prompt. Experiments show that TRATES achieves a new state-of-the-art performance across all traits on a widely-used dataset, with the generated LLM-based features being the most significant.
zh

[NLP-28] Agent Context Protocols Enhance Collective Inference

【速读】: 该论文旨在解决构建通用人工智能系统时面临的挑战,即如何实现多智能体之间的有效协作与协调。传统方法依赖于模糊且非结构化的自然语言进行协调,这限制了复杂交互并阻碍了与领域专用智能体的互操作性。论文提出的解决方案是引入Agent context protocols (ACPs),其关键在于结合持久化执行蓝图(显式的依赖图以存储中间智能体输出)和标准化消息模式,从而实现鲁棒且容错的多智能体集体推理。

链接: https://arxiv.org/abs/2505.14569
作者: Devansh Bhardwaj,Arjun Beniwal,Shreyas Chaudhari,Ashwin Kalyan,Tanmay Rajpurohit,Karthik R. Narasimhan,Ameet Deshpande,Vishvak Murahari
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:AI agents have become increasingly adept at complex tasks such as coding, reasoning, and multimodal understanding. However, building generalist systems requires moving beyond individual agents to collective inference – a paradigm where multi-agent systems with diverse, task-specialized agents complement one another through structured communication and collaboration. Today, coordination is usually handled with imprecise, ad-hoc natural language, which limits complex interaction and hinders interoperability with domain-specific agents. We introduce Agent context protocols (ACPs): a domain- and agent-agnostic family of structured protocols for agent-agent communication, coordination, and error handling. ACPs combine (i) persistent execution blueprints – explicit dependency graphs that store intermediate agent outputs – with (ii) standardized message schemas, enabling robust and fault-tolerant multi-agent collective inference. ACP-powered generalist systems reach state-of-the-art performance: 28.3 % accuracy on AssistantBench for long-horizon web assistance and best-in-class multimodal technical reports, outperforming commercial AI systems in human evaluation. ACPs are highly modular and extensible, allowing practitioners to build top-tier generalist agents quickly.
zh

[NLP-29] Pivot Language for Low-Resource Machine Translation DATE

【速读】: 该论文试图解决低资源语言对(如尼泊尔语到英语)缺乏大规模且领域多样的平行语料库的问题,其解决方案的关键在于使用印地语作为桥梁语言(pivot language)进行翻译。通过印地语作为中间语言,研究者采用了两种方法——转移法(Transfer Method,全监督)和反向翻译(Backtranslation,半监督),以提升翻译性能。其中,转移法在开发测试集上达到了14.2的SacreBLEU分数,相较于Guzman等人(2019)报告的基线全监督分数提升了6.6分。尽管略低于半监督基线分数15.1,但作者分析了可能的不足原因并提出了未来改进的方向。

链接: https://arxiv.org/abs/2505.14553
作者: Abhimanyu Talwar,Julien Laasri
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 7 pages, 3 figures, paper dated May 13, 2019

点击查看摘要

Abstract:Certain pairs of languages suffer from lack of a parallel corpus which is large in size and diverse in domain. One of the ways this is overcome is via use of a pivot language. In this paper we use Hindi as a pivot language to translate Nepali into English. We describe what makes Hindi a good candidate for the pivot. We discuss ways in which a pivot language can be used, and use two such approaches - the Transfer Method (fully supervised) and Backtranslation (semi-supervised) - to translate Nepali into English. Using the former, we are able to achieve a devtest Set SacreBLEU score of 14.2, which improves the baseline fully supervised score reported by (Guzman et al., 2019) by 6.6 points. While we are slightly below the semi-supervised baseline score of 15.1, we discuss what may have caused this under-performance, and suggest scope for future work.
zh

[NLP-30] KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation

【速读】: 该论文旨在解决现有评估方法在衡量大语言模型(Large Language Models, LLMs)推理能力时存在的局限性,即现有基准测试通常具有领域特定性,无法全面反映模型的通用推理潜力。其解决方案的关键在于提出Knowledge Orthogonal Reasoning Gymnasium (KORGym),这是一个受KOR-Bench和Gymnasium启发的动态评估平台,提供多种文本或视觉形式的游戏,并支持交互式、多轮次的强化学习场景评估,从而更全面地揭示模型的推理模式与性能。

链接: https://arxiv.org/abs/2505.14552
作者: Jiajun Shi,Jian Yang,Jiaheng Liu,Xingyuan Bu,Jiangjie Chen,Junting Zhou,Kaijing Ma,Zhoufutu Wen,Bingli Wang,Yancheng He,Liang Song,Hualei Zhu,Shilong Li,Xingjian Wang,Wei Zhang,Ruibin Yuan,Yifan Yao,Wenjun Yang,Yunli Wang,Siyuan Fang,Siyu Yuan,Qianyu He,Xiangru Tang,Yingshui Tan,Wangchunshu Zhou,Zhaoxiang Zhang,Zhoujun Li,Wenhao Huang,Ge Zhang
机构: ByteDance(字节跳动); Beihang University(北京航空航天大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) underscore the need for more comprehensive evaluation methods to accurately assess their reasoning capabilities. Existing benchmarks are often domain-specific and thus cannot fully capture an LLM’s general reasoning potential. To address this limitation, we introduce the Knowledge Orthogonal Reasoning Gymnasium (KORGym), a dynamic evaluation platform inspired by KOR-Bench and Gymnasium. KORGym offers over fifty games in either textual or visual formats and supports interactive, multi-turn assessments with reinforcement learning scenarios. Using KORGym, we conduct extensive experiments on 19 LLMs and 8 VLMs, revealing consistent reasoning patterns within model families and demonstrating the superior performance of closed-source models. Further analysis examines the effects of modality, reasoning strategies, reinforcement learning techniques, and response length on model performance. We expect KORGym to become a valuable resource for advancing LLM reasoning research and developing evaluation methodologies suited to complex, interactive environments.
zh

[NLP-31] Breaking Bad Tokens: Detoxification of LLM s Using Sparse Autoencoders

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在用户交互中生成有害内容的问题,如粗俗、低俗和贬低性言论。其解决方案的关键在于利用稀疏自编码器(Sparse Autoencoders, SAEs)识别模型残差流中的与毒性相关的方向,并通过相应的解码器向量进行目标激活引导。该方法引入了三层不同的引导强度,在GPT-2 Small和Gemma-2-2B上评估,展示了毒性降低与语言流畅性之间的权衡。实验结果表明,在较强引导强度下,该方法在减少毒性方面优于现有基线,同时保持了模型的知识和通用能力。

链接: https://arxiv.org/abs/2505.14536
作者: Agam Goyal,Vedant Rathi,William Yeh,Yian Wang,Yuen Chen,Hari Sundaram
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint: 19 pages, 7 figures, 1 table

点击查看摘要

Abstract:Large language models (LLMs) are now ubiquitous in user-facing applications, yet they still generate undesirable toxic outputs, including profanity, vulgarity, and derogatory remarks. Although numerous detoxification methods exist, most apply broad, surface-level fixes and can therefore easily be circumvented by jailbreak attacks. In this paper we leverage sparse autoencoders (SAEs) to identify toxicity-related directions in the residual stream of models and perform targeted activation steering using the corresponding decoder vectors. We introduce three tiers of steering aggressiveness and evaluate them on GPT-2 Small and Gemma-2-2B, revealing trade-offs between toxicity reduction and language fluency. At stronger steering strengths, these causal interventions surpass competitive baselines in reducing toxicity by up to 20%, though fluency can degrade noticeably on GPT-2 Small depending on the aggressiveness. Crucially, standard NLP benchmark scores upon steering remain stable, indicating that the model’s knowledge and general abilities are preserved. We further show that feature-splitting in wider SAEs hampers safety interventions, underscoring the importance of disentangled feature learning. Our findings highlight both the promise and the current limitations of SAE-based causal interventions for LLM detoxification, further suggesting practical guidelines for safer language-model deployment.
zh

[NLP-32] Internal Chain-of-Thought: Empirical Evidence for Layer-wise Subtask Scheduling in LLM s

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理复合任务时内部如何分解和执行子任务的问题,旨在提升模型的透明度和可解释性。其解决方案的关键在于通过层间上下文掩码和跨任务补丁方法,验证了不同子任务在不同网络深度中被学习,并通过LogitLens技术揭示了子任务在各层中按顺序执行的层间模式,从而证明LLMs具备内部规划和执行子任务的能力。

链接: https://arxiv.org/abs/2505.14530
作者: Zhipeng Yang,Junzhuo Li,Siyu Xia,Xuming Hu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 27 pages, 17 figures

点击查看摘要

Abstract:We show that large language models (LLMs) exhibit an \textitinternal chain-of-thought : they sequentially decompose and execute composite tasks layer-by-layer. Two claims ground our study: (i) distinct subtasks are learned at different network depths, and (ii) these subtasks are executed sequentially across layers. On a benchmark of 15 two-step composite tasks, we employ layer-from context-masking and propose a novel cross-task patching method, confirming (i). To examine claim (ii), we apply LogitLens to decode hidden states, revealing a consistent layerwise execution pattern. We further replicate our analysis on the real-world \textTRACE benchmark, observing the same stepwise dynamics. Together, our results enhance LLMs transparency by showing their capacity to internally plan and execute subtasks (or instructions), opening avenues for fine-grained, instruction-level activation steering.
zh

[NLP-33] Exploring Graph Representations of Logical Forms for Language Modeling ACL2025

【速读】: 该论文试图解决传统文本语言模型在数据效率上的不足,即在相同规模的数据下,模型难以有效学习复杂模式的问题。其解决方案的关键在于提出基于逻辑形式的语言模型(Language Models over Logical Forms, LFLMs),并通过图结构表示逻辑形式的语义,构建了图-based Formal-Logical Distributional Semantics (GFoLDS) 原型,利用模型内部固有的基本语言知识,提升模型对复杂模式的学习能力。实验表明,GFoLDS在下游任务中显著优于基于文本的Transformer语言模型,证明了LFLMs在减少数据需求方面的有效性。

链接: https://arxiv.org/abs/2505.14523
作者: Michael Sullivan
机构: University at Buffalo (纽约州立大学布法罗分校); Saarland University (萨尔兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To be published in ACL 2025 Findings

点击查看摘要

Abstract:We make the case for language models over logical forms (LFLMs), arguing that such models are more data-efficient than their textual counterparts. To that end, we introduce the Graph-based Formal-Logical Distributional Semantics (GFoLDS) prototype, a pretrained LM over graph representations of logical forms, as a proof-of-concept of LFLMs. Using GFoLDS, we present strong experimental evidence that LFLMs can leverage the built-in, basic linguistic knowledge inherent in such models to immediately begin learning more complex patterns. On downstream tasks, we show that GFoLDS vastly outperforms textual, transformer LMs pretrained on similar amounts of data, indicating that LFLMs can learn with substantially less data than models over plain text. Furthermore, we show that the performance of this model is likely to scale with additional parameters and pretraining data, suggesting the viability of LFLMs in real-world applications.
zh

[NLP-34] ModRWKV: Transformer Multimodality in Linear Time

【速读】: 该论文试图解决当前多模态研究主要依赖于具有二次复杂度的Transformer架构的大语言模型(LLM)所带来的计算成本高昂的问题,同时探索现代循环神经网络(RNN)架构在多模态场景下的应用潜力。其解决方案的关键在于提出了一种名为ModRWKV的解耦多模态框架,该框架基于RWKV7架构作为LLM基础,并通过动态可调的异构模态编码器实现多源信息融合,同时采用轻量级设计以平衡性能与计算效率,并利用预训练的RWKV7 LLM权重进行初始化,显著提升了多模态训练的效率与模型对多模态信号的理解能力。

链接: https://arxiv.org/abs/2505.14505
作者: Jiale Kang,Ziyin Yue,Qingyu Yin,Jiang Rui,Weile Li,Zening Lu,Zhouran Ji
机构: RWKVOS( RWKVOS); Zhejiang University(浙江大学); The Hong Kong University of Science and Technology(香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Currently, most multimodal studies are based on large language models (LLMs) with quadratic-complexity Transformer architectures. While linear models like RNNs enjoy low inference costs, their application has been largely limited to the text-only modality. This work explores the capabilities of modern RNN architectures in multimodal contexts. We propose ModRWKV-a decoupled multimodal framework built upon the RWKV7 architecture as its LLM backbone-which achieves multi-source information fusion through dynamically adaptable heterogeneous modality encoders. We designed the multimodal modules in ModRWKV with an extremely lightweight architecture and, through extensive experiments, identified a configuration that achieves an optimal balance between performance and computational efficiency. ModRWKV leverages the pretrained weights of the RWKV7 LLM for initialization, which significantly accelerates multimodal training. Comparative experiments with different pretrained checkpoints further demonstrate that such initialization plays a crucial role in enhancing the model’s ability to understand multimodal signals. Supported by extensive experiments, we conclude that modern RNN architectures present a viable alternative to Transformers in the domain of multimodal large language models (MLLMs). Furthermore, we identify the optimal configuration of the ModRWKV architecture through systematic exploration.
zh

[NLP-35] Enhanced Multimodal Aspect-Based Sentiment Analysis by LLM -Generated Rationales

【速读】: 该论文旨在解决多模态方面情感分析(Multimodal Aspect-Based Sentiment Analysis, MABSA)中因小语言模型(SLMs)能力有限而导致的情感、方面及其关联识别不准确的问题。其解决方案的关键在于提出一种名为LRSA的框架,该框架结合了SLMs的决策能力与大语言模型(LLMs)提供的额外信息,通过将LLMs生成的解释作为推理依据注入SLMs,并采用双交叉注意力机制增强特征交互与融合,从而提升SLMs在MABSA任务中的表现。

链接: https://arxiv.org/abs/2505.14499
作者: Jun Cao,Jiyi Li,Ziwei Yang,Renjie Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:There has been growing interest in Multimodal Aspect-Based Sentiment Analysis (MABSA) in recent years. Existing methods predominantly rely on pre-trained small language models (SLMs) to collect information related to aspects and sentiments from both image and text, with an aim to align these two modalities. However, small SLMs possess limited capacity and knowledge, often resulting in inaccurate identification of meaning, aspects, sentiments, and their interconnections in textual and visual data. On the other hand, Large language models (LLMs) have shown exceptional capabilities in various tasks by effectively exploring fine-grained information in multimodal data. However, some studies indicate that LLMs still fall short compared to fine-tuned small models in the field of ABSA. Based on these findings, we propose a novel framework, termed LRSA, which combines the decision-making capabilities of SLMs with additional information provided by LLMs for MABSA. Specifically, we inject explanations generated by LLMs as rationales into SLMs and employ a dual cross-attention mechanism for enhancing feature interaction and fusion, thereby augmenting the SLMs’ ability to identify aspects and sentiments. We evaluated our method using two baseline models, numerous experiments highlight the superiority of our approach on three widely-used benchmarks, indicating its generalizability and applicability to most pre-trained models for MABSA.
zh

[NLP-36] Reasoning Models Better Express Their Confidence

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在表达自身置信度方面存在的不足,这一问题限制了模型的可靠性。论文提出的解决方案关键在于利用推理模型——即进行扩展链式思维(Chain-of-Thought, CoT)推理的LLMs——通过其缓慢思考行为(如探索多种方法和回溯)实现更准确的置信度校准。研究表明,这些行为使推理模型能够在CoT过程中动态调整置信度,从而显著提升置信度校准效果。

链接: https://arxiv.org/abs/2505.14489
作者: Dongkeun Yoon,Seungone Kim,Sohee Yang,Sunkyoung Kim,Soyeon Kim,Yongil Kim,Eunbi Choi,Yireun Kim,Minjoon Seo
机构: KAIST(韩国科学技术院); LG AI Research(LG人工智能研究院); CMU(卡内基梅隆大学); UCL(伦敦大学学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Despite their strengths, large language models (LLMs) often fail to communicate their confidence accurately, making it difficult to assess when they might be wrong and limiting their reliability. In this work, we demonstrate that reasoning models-LLMs that engage in extended chain-of-thought (CoT) reasoning-exhibit superior performance not only in problem-solving but also in accurately expressing their confidence. Specifically, we benchmark six reasoning models across six datasets and find that they achieve strictly better confidence calibration than their non-reasoning counterparts in 33 out of the 36 settings. Our detailed analysis reveals that these gains in calibration stem from the slow thinking behaviors of reasoning models-such as exploring alternative approaches and backtracking-which enable them to adjust their confidence dynamically throughout their CoT, making it progressively more accurate. In particular, we find that reasoning models become increasingly better calibrated as their CoT unfolds, a trend not observed in non-reasoning models. Moreover, removing slow thinking behaviors from the CoT leads to a significant drop in calibration. Lastly, we show that these gains are not exclusive to reasoning models-non-reasoning models also benefit when guided to perform slow thinking via in-context learning.
zh

[NLP-37] MoMoE: Mixture of Moderation Experts Framework for AI-Assisted Online Governance

【速读】: 该论文试图解决在线社区中内容审核的可扩展性和透明性问题,现有方法需要为每个社区单独训练模型且决策过程不透明,限制了实际应用。其解决方案的关键是提出Mixture of Moderation Experts (MoMoE),一个模块化、跨社区的内容审核框架,通过协调四个操作符(Allocate, Predict, Aggregate, Explain)实现可扩展的审核,并生成后验解释。MoMoE由七种社区专用专家(MoMoE-Community)和五种规范违规专家(MoMoE-NormVio)组成,能够在无需针对每个社区进行微调的情况下,达到与强微调基线相当甚至更好的性能,同时提供简洁可靠的解释。

链接: https://arxiv.org/abs/2505.14483
作者: Agam Goyal,Xianyang Zhan,Yilun Chen,Koustuv Saha,Eshwar Chandrasekharan
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint: 15 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Large language models (LLMs) have shown great potential in flagging harmful content in online communities. Yet, existing approaches for moderation require a separate model for every community and are opaque in their decision-making, limiting real-world adoption. We introduce Mixture of Moderation Experts (MoMoE), a modular, cross-community framework that adds post-hoc explanations to scalable content moderation. MoMoE orchestrates four operators – Allocate, Predict, Aggregate, Explain – and is instantiated as seven community-specialized experts (MoMoE-Community) and five norm-violation experts (MoMoE-NormVio). On 30 unseen subreddits, the best variants obtain Micro-F1 scores of 0.72 and 0.67, respectively, matching or surpassing strong fine-tuned baselines while consistently producing concise and reliable explanations. Although community-specialized experts deliver the highest peak accuracy, norm-violation experts provide steadier performance across domains. These findings show that MoMoE yields scalable, transparent moderation without needing per-community fine-tuning. More broadly, they suggest that lightweight, explainable expert ensembles can guide future NLP and HCI research on trustworthy human-AI governance of online communities.
zh

[NLP-38] PlanGPT -VL: Enhancing Urban Planning with Domain-Specific Vision-Language Models

【速读】: 该论文旨在解决现有视觉-语言模型(Vision-Language Models, VLMs)在城市规划领域中对规划地图的分析与评估能力不足的问题。规划地图涉及土地利用、基础设施布局和功能分区等复杂空间信息,需要专业化的空间配置理解、法规要求认知及多尺度分析能力。论文提出的解决方案关键在于构建PlanGPT-VL,这是首个针对城市规划地图的领域专用视觉-语言模型,其核心创新包括:(1) 基于PlanAnno-V框架的高质量VQA数据合成方法;(2) 通过结构化验证减少幻觉的“关键点思维”策略;(3) 结合监督微调与冻结视觉编码器参数的综合训练方法。这些技术显著提升了模型在规划地图解释任务中的表现,同时保持了高事实准确性。

链接: https://arxiv.org/abs/2505.14481
作者: He Zhu,Junyou Su,Minxi Chen,Wen Wang,Yijie Deng,Guanhua Chen,Wenjia Zhang
机构: Peking University & Tongji University (北京大学和同济大学); Southern University of Science and Technology (南方科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the field of urban planning, existing Vision-Language Models (VLMs) frequently fail to effectively analyze and evaluate planning maps, despite the critical importance of these visual elements for urban planners and related educational contexts. Planning maps, which visualize land use, infrastructure layouts, and functional zoning, require specialized understanding of spatial configurations, regulatory requirements, and multi-scale analysis. To address this challenge, we introduce PlanGPT-VL, the first domain-specific Vision-Language Model tailored specifically for urban planning maps. PlanGPT-VL employs three innovative approaches: (1) PlanAnno-V framework for high-quality VQA data synthesis, (2) Critical Point Thinking to reduce hallucinations through structured verification, and (3) comprehensive training methodology combining Supervised Fine-Tuning with frozen vision encoder parameters. Through systematic evaluation on our proposed PlanBench-V benchmark, we demonstrate that PlanGPT-VL significantly outperforms general-purpose state-of-the-art VLMs in specialized planning map interpretation tasks, offering urban planning professionals a reliable tool for map analysis, assessment, and educational applications while maintaining high factual accuracy. Our lightweight 7B parameter model achieves comparable performance to models exceeding 72B parameters, demonstrating efficient domain specialization without sacrificing performance.
zh

[NLP-39] owards Reliable Proof Generation with LLM s: A Neuro-Symbolic Approach

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在形式化领域中面临的挑战,特别是在需要严格逻辑推理和符号推理的任务中表现不佳的问题,例如数学证明生成。其解决方案的关键在于提出一种神经符号方法,将LLMs的生成优势与结构化组件相结合。该方法包括两个核心部分:一是通过检索类似问题及其证明来引导LLMs生成证明;二是利用形式化验证器对生成的证明进行评估并提供反馈,从而帮助模型修正错误的证明。

链接: https://arxiv.org/abs/2505.14479
作者: Oren Sultan,Eitan Stern,Dafna Shahaf
机构: The Hebrew University of Jerusalem (希伯来大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: long paper

点击查看摘要

Abstract:Large language models (LLMs) struggle with formal domains that require rigorous logical deduction and symbolic reasoning, such as mathematical proof generation. We propose a neuro-symbolic approach that combines LLMs’ generative strengths with structured components to overcome this challenge. As a proof-of-concept, we focus on geometry problems. Our approach is two-fold: (1) we retrieve analogous problems and use their proofs to guide the LLM, and (2) a formal verifier evaluates the generated proofs and provides feedback, helping the model fix incorrect proofs. We demonstrate that our method significantly improves proof accuracy for OpenAI’s o1 model (58%-70% improvement); both analogous problems and the verifier’s feedback contribute to these gains. More broadly, shifting to LLMs that generate provably correct conclusions could dramatically improve their reliability, accuracy and consistency, unlocking complex tasks and critical real-world applications that require trustworthiness.
zh

[NLP-40] Adapting Pretrained Language Models for Citation Classification via Self-Supervised Contrastive Learning KDD2025

【速读】: 该论文旨在解决学术引用分类(citation classification)中的挑战,即在标注数据稀缺、上下文噪声和虚假关键词相关性等条件下,准确识别学术引用的意图。其解决方案的关键在于提出了一种名为Citss的新框架,该框架通过自监督对比学习缓解数据稀缺问题,并引入两种专门策略获取对比样本:句级裁剪(sentence-level cropping)以增强对目标引用的关注,以及关键词扰动(keyphrase perturbation)以减少对特定关键词的依赖。此外,Citss被设计为兼容基于编码器的预训练语言模型(PLMs)和基于解码器的大语言模型(LLMs),从而充分利用大规模预训练的优势。

链接: https://arxiv.org/abs/2505.14471
作者: Tong Li,Jiachuan Wang,Yongqi Zhang,Shuangyin Li,Lei Chen
机构: Hong Kong University of Science and Technology (香港科技大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); South China Normal University (华南师范大学)
类目: Computation and Language (cs.CL)
备注: Manuscripts, accepted to KDD 2025

点击查看摘要

Abstract:Citation classification, which identifies the intention behind academic citations, is pivotal for scholarly analysis. Previous works suggest fine-tuning pretrained language models (PLMs) on citation classification datasets, reaping the reward of the linguistic knowledge they gained during pretraining. However, directly fine-tuning for citation classification is challenging due to labeled data scarcity, contextual noise, and spurious keyphrase correlations. In this paper, we present a novel framework, Citss, that adapts the PLMs to overcome these challenges. Citss introduces self-supervised contrastive learning to alleviate data scarcity, and is equipped with two specialized strategies to obtain the contrastive pairs: sentence-level cropping, which enhances focus on target citations within long contexts, and keyphrase perturbation, which mitigates reliance on specific keyphrases. Compared with previous works that are only designed for encoder-based PLMs, Citss is carefully developed to be compatible with both encoder-based PLMs and decoder-based LLMs, to embrace the benefits of enlarged pretraining. Experiments with three benchmark datasets with both encoder-based PLMs and decoder-based LLMs demonstrate our superiority compared to the previous state of the art. Our code is available at: this http URL
zh

[NLP-41] PAST: Phonetic-Acoustic Speech Tokenizer

【速读】: 该论文试图解决传统语音处理中依赖预训练自监督模型进行语音表示学习的问题,旨在通过一种端到端的框架直接整合语音学信息与信号重建。解决方案的关键在于提出PAST(Phonetic-Aware Speech Tokenizer),该框架利用监督式语音学数据,在分词过程中通过辅助任务直接融入领域知识,从而无需依赖外部预训练模型,提升了语音表示的质量和效率。

链接: https://arxiv.org/abs/2505.14470
作者: Nadav Har-Tuv,Or Tal,Yossi Adi
机构: The Hebrew University of Jerusalem, Israel
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We present PAST, a novel end-to-end framework that jointly models phonetic information alongside signal reconstruction, eliminating the need for external pretrained models. Unlike previous approaches that rely on pretrained self-supervised models, PAST employs supervised phonetic data, directly integrating domain knowledge into the tokenization process via auxiliary tasks. Additionally, we introduce a streamable, causal variant of PAST, enabling real-time speech applications. Results demonstrate that PAST surpasses existing evaluated baseline tokenizers across common evaluation metrics, including phonetic representation and speech reconstruction. Notably, PAST also achieves superior performance when serving as a speech representation for speech language models, further highlighting its effectiveness as a foundation for spoken language generation. To foster further research, we release the full implementation. For code, model checkpoints, and samples see: this https URL
zh

[NLP-42] Attributional Safety Failures in Large Language Models under Code-Mixed Perturbations

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理代码混合(code-mixed)输入和输出时产生的安全问题,特别是相较于纯英语输入,模型更容易生成不安全输出的现象。解决方案的关键在于利用可解释性方法分析模型内部属性变化,从而揭示导致有害行为的机制,并通过区分普遍不安全和文化特定不安全查询来探索文化维度的影响。

链接: https://arxiv.org/abs/2505.14469
作者: Somnath Banerjee,Pratyush Chatterjee,Shanu Kumar,Sayan Layek,Parag Agrawal,Rima Hazra,Animesh Mukherjee
机构: Indian Institute of Technology Kharagpur, India; Microsoft Corporation, India
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in LLMs have raised significant safety concerns, particularly when dealing with code-mixed inputs and outputs. Our study systematically investigates the increased susceptibility of LLMs to produce unsafe outputs from code-mixed prompts compared to monolingual English prompts. Utilizing explainability methods, we dissect the internal attribution shifts causing model’s harmful behaviors. In addition, we explore cultural dimensions by distinguishing between universally unsafe and culturally-specific unsafe queries. This paper presents novel experimental insights, clarifying the mechanisms driving this phenomenon.
zh

[NLP-43] Void in Language Models

【速读】: 该论文试图解决的问题是:在基于Transformer的语言模型(LM)推理过程中,是否所有层都会被激活。为了解决这一问题,研究者提出了一种非训练且无参数的自适应计算方法——L2自适应计算(LAC),其关键在于通过监测激活的L2范数变化来检测未被激活的层(称为Voids)。该方法被应用于追踪指令调优语言模型在提示处理和响应生成两个阶段中的激活层,结果表明不同阶段激活的层存在差异,并且通过有选择性地跳过大部分未激活层可以提升模型在特定任务上的性能。

链接: https://arxiv.org/abs/2505.14467
作者: Mani Shemiranifar
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite advances in transformer-based language models (LMs), a fundamental question remains largely unanswered: Are all layers activated during inference? We investigate this question by detecting unactivated layers (which we refer to as Voids) using a non-trainable and parameter-free adaptive computation method called L2 Adaptive Computation (LAC). We adapt LAC from its original efficiency-focused application to trace activated layers during inference. This method monitors changes in the L2-norm of activations to identify voids. We analyze layer activation in instruction-tuned LMs across two phases: Prompt Processing (PP), where we trace activated layers for each token in the input prompts, and Response Generation (RG), where we trace activated layers for each generated token. We further demonstrate that distinct layers are activated during these two phases. To show the effectiveness of our method, we evaluated three distinct instruction-tuned LMs from the Llama, Mistral, and Qwen families on three benchmarks: MMLU, GPQA Diamond, and BoolQ. For example, on MMLU with a zero-shot setting, skipping voids in Qwen2.5-7B-Instruct resulted in an improvement from 69.24 to 71.29 while the model uses only 30% of the layers. Similarly, Mistral-7B-Instruct-v0.3 on GPQA Diamond improved from 13.88 to 18.36 when using 70% of the layers during both the PP and RG phases. These results show that not all layers contribute equally during inference, and that selectively skipping most of them can improve the performance of models on certain tasks.
zh

[NLP-44] Not All Correct Answers Are Equal: Why Your Distillation Source Matters

【速读】: 该论文旨在解决如何通过知识蒸馏提升开源语言模型的推理能力问题。其解决方案的关键在于收集三个先进教师模型(AM-Thinking-v1、Qwen3-235B-A22B 和 DeepSeek-R1)在共享语料库上的验证输出,构建平行数据集,并利用这些高质量、经过验证的推理轨迹来训练学生模型。实验结果表明,基于 AM-Thinking-v1 的蒸馏数据在 token 长度多样性与困惑度方面表现更优,从而使得学生模型在多个推理基准测试中取得最佳性能。

链接: https://arxiv.org/abs/2505.14464
作者: Xiaoyu Tian,Yunjie Ji,Haotian Wang,Shuaiting Chen,Sitong Zhao,Yiping Peng,Han Zhao,Xiangang Li
机构: Beike (贝壳); Ke.com (安居客)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Distillation has emerged as a practical and effective approach to enhance the reasoning capabilities of open-source language models. In this work, we conduct a large-scale empirical study on reasoning data distillation by collecting verified outputs from three state-of-the-art teacher models-AM-Thinking-v1, Qwen3-235B-A22B, and DeepSeek-R1-on a shared corpus of 1.89 million queries. We construct three parallel datasets and analyze their distributions, revealing that AM-Thinking-v1-distilled data exhibits greater token length diversity and lower perplexity. Student models trained on each dataset are evaluated on reasoning benchmarks including AIME2024, AIME2025, MATH500, and LiveCodeBench. The AM-based model consistently achieves the best performance (e.g., 84.3 on AIME2024, 72.2 on AIME2025, 98.4 on MATH500, and 65.9 on LiveCodeBench) and demonstrates adaptive output behavior-producing longer responses for harder tasks and shorter ones for simpler tasks. These findings highlight the value of high-quality, verified reasoning traces. We release the AM-Thinking-v1 and Qwen3-235B-A22B distilled datasets to support future research on open and high-performing reasoning-oriented language models. The datasets are publicly available on Hugging Face\footnoteDatasets are available on Hugging Face: \hrefthis https URLAM-Thinking-v1-Distilled, \hrefthis https URLAM-Qwen3-Distilled…
zh

[NLP-45] RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding

【速读】: 该论文旨在解决视觉语言模型(VLMs)在理解视觉文化方面存在的不足,尤其是在解释文化细微差别方面的局限性。其解决方案的关键在于引入RAVENEA(Retrieval-Augmented Visual culture Understanding),一个专注于文化导向的视觉问答(cVQA)和文化感知的图像描述生成(cIC)任务的新基准,通过整合由人工标注者筛选和排序的超过10,000篇维基百科文档,增强模型对文化背景的理解能力。实验结果表明,结合文化感知检索的轻量级VLM在两个任务上的表现均优于未增强的模型,验证了检索增强方法在多模态理解中的有效性。

链接: https://arxiv.org/abs/2505.14462
作者: Jiaang Li,Yifei Yuan,Wenyan Li,Mohammad Aliannejadi,Daniel Hershcovich,Anders Søgaard,Ivan Vulić,Wenxuan Zhang,Paul Pu Liang,Yang Deng,Serge Belongie
机构: University of Copenhagen(哥本哈根大学); ETH Zürich(苏黎世联邦理工学院); University of Amsterdam(阿姆斯特丹大学); University of Cambridge(剑桥大学); Singapore University of Technology and Design(新加坡科技设计大学); Massachusetts Institute of Technology(麻省理工学院); Singapore Management University(新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As vision-language models (VLMs) become increasingly integrated into daily life, the need for accurate visual culture understanding is becoming critical. Yet, these models frequently fall short in interpreting cultural nuances effectively. Prior work has demonstrated the effectiveness of retrieval-augmented generation (RAG) in enhancing cultural understanding in text-only settings, while its application in multimodal scenarios remains underexplored. To bridge this gap, we introduce RAVENEA (Retrieval-Augmented Visual culturE uNdErstAnding), a new benchmark designed to advance visual culture understanding through retrieval, focusing on two tasks: culture-focused visual question answering (cVQA) and culture-informed image captioning (cIC). RAVENEA extends existing datasets by integrating over 10,000 Wikipedia documents curated and ranked by human annotators. With RAVENEA, we train and evaluate seven multimodal retrievers for each image query, and measure the downstream impact of retrieval-augmented inputs across fourteen state-of-the-art VLMs. Our results show that lightweight VLMs, when augmented with culture-aware retrieval, outperform their non-augmented counterparts (by at least 3.2% absolute on cVQA and 6.2% absolute on cIC). This highlights the value of retrieval-augmented methods and culturally inclusive benchmarks for multimodal understanding.
zh

[NLP-46] CtrlDiff: Boosting Large Diffusion Language Models with Dynamic Block Prediction and Controllable Generation

【速读】: 该论文旨在解决当前基于扩散的语言模型在生成过程中存在的固定粒度和可控性不足的问题。其关键解决方案是提出CtrlDiff,一个动态且可控制的半自回归框架,通过强化学习根据局部语义自适应地确定每个生成块的大小,并引入针对离散扩散的分类器引导控制机制,从而在降低计算开销的同时实现高效的后期条件生成。

链接: https://arxiv.org/abs/2505.14455
作者: Chihan Huang,Hao Tang
机构: Nanjing University of Science and Technology (南京理工大学); Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although autoregressive models have dominated language modeling in recent years, there has been a growing interest in exploring alternative paradigms to the conventional next-token prediction framework. Diffusion-based language models have emerged as a compelling alternative due to their powerful parallel generation capabilities and inherent editability. However, these models are often constrained by fixed-length generation. A promising direction is to combine the strengths of both paradigms, segmenting sequences into blocks, modeling autoregressive dependencies across blocks while leveraging discrete diffusion to estimate the conditional distribution within each block given the preceding context. Nevertheless, their practical application is often hindered by two key limitations: rigid fixed-length outputs and a lack of flexible control mechanisms. In this work, we address the critical limitations of fixed granularity and weak controllability in current large diffusion language models. We propose CtrlDiff, a dynamic and controllable semi-autoregressive framework that adaptively determines the size of each generation block based on local semantics using reinforcement learning. Furthermore, we introduce a classifier-guided control mechanism tailored to discrete diffusion, which significantly reduces computational overhead while facilitating efficient post-hoc conditioning without retraining. Extensive experiments demonstrate that CtrlDiff sets a new standard among hybrid diffusion models, narrows the performance gap to state-of-the-art autoregressive approaches, and enables effective conditional text generation across diverse tasks.
zh

[NLP-47] Creative Preference Optimization

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成真正具有创造力内容方面能力有限的问题,即模型生成的内容在新颖性、多样性、惊喜性和质量等方面存在不足。解决方案的关键在于提出一种名为创意偏好优化(Creative Preference Optimization, CrPO)的新对齐方法,该方法通过模块化的方式将多个创意维度的信号注入到偏好优化目标中,从而更全面地提升模型的创造力。

链接: https://arxiv.org/abs/2505.14442
作者: Mete Ismayilzada,Antonio Laverghetta Jr.,Simone A. Luchini,Reet Patel,Antoine Bosselut,Lonneke van der Plas,Roger Beaty
机构: EPFL(瑞士联邦理工学院); Università della Svizzera Italiana(意大利瑞士大学); Pennsylvania State University(宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 27 pages

点击查看摘要

Abstract:While Large Language Models (LLMs) have demonstrated impressive performance across natural language generation tasks, their ability to generate truly creative content-characterized by novelty, diversity, surprise, and quality-remains limited. Existing methods for enhancing LLM creativity often focus narrowly on diversity or specific tasks, failing to address creativity’s multifaceted nature in a generalizable way. In this work, we propose Creative Preference Optimization (CrPO), a novel alignment method that injects signals from multiple creativity dimensions into the preference optimization objective in a modular fashion. We train and evaluate creativity-augmented versions of several models using CrPO and MuCE, a new large-scale human preference dataset spanning over 200,000 human-generated responses and ratings from more than 30 psychological creativity assessments. Our models outperform strong baselines, including GPT-4o, on both automated and human evaluations, producing more novel, diverse, and surprising generations while maintaining high output quality. Additional evaluations on NoveltyBench further confirm the generalizability of our approach. Together, our results demonstrate that directly optimizing for creativity within preference frameworks is a promising direction for advancing the creative capabilities of LLMs without compromising output quality.
zh

[NLP-48] S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models

【速读】: 该论文试图解决语音大语言模型(Speech LLMs)在处理音频输入时出现的智能退化(intelligence degradation)问题,即相比文本输入,其推理和生成性能下降的现象。解决方案的关键在于提出S2SBench基准,该基准通过设计针对句子续写和常识推理的诊断数据集,并引入基于困惑度差异的成对评估协议,以量化语音LLMs在性能上的退化程度。

链接: https://arxiv.org/abs/2505.14438
作者: Yuanbo Fang,Haoze Sun,Jun Liu,Tao Zhang,Zenan Zhou,Weipeng Chen,Xiaofen Xing,Xiangmin Xu
机构: Baichuan Inc(百川智能)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:End-to-end speech large language models ((LLMs)) extend the capabilities of text-based models to directly process and generate audio tokens. However, this often leads to a decline in reasoning and generation performance compared to text input, a phenomenon referred to as intelligence degradation. To systematically evaluate this gap, we propose S2SBench, a benchmark designed to quantify performance degradation in Speech LLMs. It includes diagnostic datasets targeting sentence continuation and commonsense reasoning under audio input. We further introduce a pairwise evaluation protocol based on perplexity differences between plausible and implausible samples to measure degradation relative to text input. We apply S2SBench to analyze the training process of Baichuan-Audio, which further demonstrates the benchmark’s effectiveness. All datasets and evaluation code are available at this https URL.
zh

[NLP-49] Neural Incompatibility: The Unbridgeable Gap of Cross-Scale Parametric Knowledge Transfer in Large Language Models ACL’25

【速读】: 该论文试图解决跨规模大型语言模型(Large Language Models, LLMs)之间的参数化知识迁移(Parametric Knowledge Transfer, PKT)问题,旨在超越基于符号语言的传统知识迁移范式,实现更高效的知识转移。其解决方案的关键在于对参数空间进行对齐,论文提出两种方法:一种是后对齐的PKT(PostPKT),另一种是预对齐的PKT(PrePKT)。其中,PrePKT通过少量训练步骤在无需微调的情况下实现跨规模LLMs的参数空间对齐,提出了LaTen方法作为具体实现方案。然而,研究发现神经结构不兼容性(Neural Incompatibility)是影响有效PKT的根本性挑战。

链接: https://arxiv.org/abs/2505.14436
作者: Yuqiao Tan,Shizhu He,Kang Liu,Jun Zhao
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL’25 Main. Code link: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) offer a transparent brain with accessible parameters that encode extensive knowledge, which can be analyzed, located and transferred. Consequently, a key research challenge is to transcend traditional knowledge transfer paradigms rooted in symbolic language and achieve genuine Parametric Knowledge Transfer (PKT). Significantly, exploring effective methods for transferring knowledge across LLMs of different scales through parameters presents an intriguing and valuable research direction. In this paper, we first demonstrate \textbfAlignment in parametric space is the fundamental prerequisite to achieve successful cross-scale PKT. We redefine the previously explored knowledge transfer as Post-Align PKT (PostPKT), which utilizes extracted parameters for LoRA initialization and requires subsequent fine-tune for alignment. Hence, to reduce cost for further fine-tuning, we introduce a novel Pre-Align PKT (PrePKT) paradigm and propose a solution called \textbfLaTen ( \textbfL oc \textbfa te- \textbfT h \textbfe n-Alig \textbfn ) that aligns the parametric spaces of LLMs across scales only using several training steps without following training. Comprehensive experiments on four benchmarks demonstrate that both PostPKT and PrePKT face challenges in achieving consistently stable transfer. Through in-depth analysis, we identify \textbfNeural Incompatibility as the ethological and parametric structural differences between LLMs of varying scales, presenting fundamental challenges to achieving effective PKT. These findings provide fresh insights into the parametric architectures of LLMs and highlight promising directions for future research on efficient PKT. Our code is available at this https URL.
zh

[NLP-50] Rank-K: Test-Time Reasoning for Listwise Reranking

【速读】: 该论文旨在解决传统检索流水线中因使用计算资源密集型的神经重排序模型而导致的效率与可扩展性问题。其关键解决方案是提出Rank-K,一种基于推理语言模型的列表级文档重排序模型,该模型在查询时利用大语言模型的推理能力,实现了测试时的可扩展性,从而有效提升了检索效果,同时降低了计算负担。

链接: https://arxiv.org/abs/2505.14432
作者: Eugene Yang,Andrew Yates,Kathryn Ricci,Orion Weller,Vivek Chari,Benjamin Van Durme,Dawn Lawrie
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:Retrieve-and-rerank is a popular retrieval pipeline because of its ability to make slow but effective rerankers efficient enough at query time by reducing the number of comparisons. Recent works in neural rerankers take advantage of large language models for their capability in reasoning between queries and passages and have achieved state-of-the-art retrieval effectiveness. However, such rerankers are resource-intensive, even after heavy optimization. In this work, we introduce Rank-K, a listwise passage reranking model that leverages the reasoning capability of the reasoning language model at query time that provides test time scalability to serve hard queries. We show that Rank-K improves retrieval effectiveness by 23% over the RankZephyr, the state-of-the-art listwise reranker, when reranking a BM25 initial ranked list and 19% when reranking strong retrieval results by SPLADE-v3. Since Rank-K is inherently a multilingual model, we found that it ranks passages based on queries in different languages as effectively as it does in monolingual retrieval.
zh

[NLP-51] From Templates to Natural Language: Generalization Challenges in Instruction-Tuned LLM s for Spatial Reasoning

【速读】: 该论文试图解决指令调优的大语言模型(Large Language Models, LLMs)在真实环境中的泛化能力问题,特别是在从合成指令到人类撰写的指令的迁移过程中存在的挑战。其解决方案的关键在于通过仅使用合成指令对LLMs进行微调,并在包含合成与人类编写指令的基准数据集上评估模型性能,以揭示模型在复杂任务中的泛化能力下降原因并进行详细错误分析。

链接: https://arxiv.org/abs/2505.14425
作者: Chalamalasetti Kranti,Sherzod Hakimov,David Schlangen
机构: University of Potsdam (波茨坦大学); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心)
类目: Computation and Language (cs.CL)
备注: 4 pages

点击查看摘要

Abstract:Instruction-tuned large language models (LLMs) have shown strong performance on a variety of tasks; however, generalizing from synthetic to human-authored instructions in grounded environments remains a challenge for them. In this work, we study generalization challenges in spatial grounding tasks where models interpret and translate instructions for building object arrangements on a 2.5 D grid. We fine-tune LLMs using only synthetic instructions and evaluate their performance on a benchmark dataset containing both synthetic and human-written instructions. Our results reveal that while models generalize well on simple tasks, their performance degrades significantly on more complex tasks. We present a detailed error analysis of the gaps in instruction generalization.
zh

[NLP-52] Scaling Low-Resource MT via Synthetic Data Generation with LLM s

【速读】: 该论文试图解决低资源机器翻译(Machine Translation, MT)中数据稀缺的问题,其解决方案的关键在于利用大语言模型(Large Language Model, LLM)生成的合成数据。通过构建以英语Europarl为源语料的文档级合成语料库,并借助对齐方法扩展至147个额外的语言对,研究验证了该合成数据在自动和人工评估中均表现出高质量。实验结果表明,即使存在噪声,LLM生成的合成数据也能显著提升低资源语言的机器翻译性能。

链接: https://arxiv.org/abs/2505.14423
作者: Ona de Gibert,Joseph Attieh,Teemu Vahtola,Mikko Aulamo,Zihao Li,Raúl Vázquez,Tiancheng Hu,Jörg Tiedemann
机构: University of Helsinki (赫尔辛基大学); University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We investigate the potential of LLM-generated synthetic data for improving low-resource machine translation (MT). Focusing on seven diverse target languages, we construct a document-level synthetic corpus from English Europarl, and extend it via pivoting to 147 additional language pairs. Automatic and human evaluation confirm its high overall quality. We study its practical application by (i) identifying effective training regimes, (ii) comparing our data with the HPLT dataset, and (iii) testing its utility beyond English-centric MT. Finally, we introduce SynOPUS, a public repository for synthetic parallel datasets. Our findings show that LLM-generated synthetic data, even when noisy, can substantially improve MT performance for low-resource languages.
zh

[NLP-53] SAE-FiRE: Enhancing Earnings Surprise Predictions Through Sparse Autoencoder Feature Selection

【速读】: 该论文旨在解决通过分析收益电话会议记录预测收益意外(earnings surprises)的问题,这一任务在金融研究领域日益受到关注。由于会议记录通常包含超过5,000字的内容,存在大量冗余信息和行业特定术语,给语言模型的分析带来了挑战。论文提出的解决方案是Sparse Autoencoder for Financial Representation Enhancement (SAE-FiRE)框架,其关键在于利用稀疏自编码器(Sparse Autoencoders, SAEs)高效识别模式并过滤噪声,从而提取关键信息并消除冗余,特别关注捕捉具有预测能力的细微财务信号。

链接: https://arxiv.org/abs/2505.14420
作者: Huopu Zhang,Yanguang Liu,Mengnan Du
机构: Georgia Institute of Technology (佐治亚理工学院); New Jersey Institute of Technology (新泽西理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Predicting earnings surprises through the analysis of earnings conference call transcripts has attracted increasing attention from the financial research community. Conference calls serve as critical communication channels between company executives, analysts, and shareholders, offering valuable forward-looking information. However, these transcripts present significant analytical challenges, typically containing over 5,000 words with substantial redundancy and industry-specific terminology that creates obstacles for language models. In this work, we propose the Sparse Autoencoder for Financial Representation Enhancement (SAE-FiRE) framework to address these limitations by extracting key information while eliminating redundancy. SAE-FiRE employs Sparse Autoencoders (SAEs) to efficiently identify patterns and filter out noises, and focusing specifically on capturing nuanced financial signals that have predictive power for earnings surprises. Experimental results indicate that the proposed method can significantly outperform comparing baselines.
zh

[NLP-54] Hidden Ghost Hand: Unveiling Backdoor Vulnerabilities in MLLM -Powered Mobile GUI Agents

【速读】: 该论文旨在解决由多模态大语言模型(Multimodal Large Language Models, MLLMs)驱动的图形用户界面(GUI)代理在供应链中面临的后门攻击问题。其关键解决方案是提出AgentGhost框架,该框架通过构建复合触发器(结合目标层与交互层)实现隐蔽的后门注入,并将后门注入建模为一个Min-Max优化问题,利用监督对比学习最大化特征差异,同时通过监督微调最小化后门与正常行为之间的差异,从而提升攻击的有效性与隐蔽性。

链接: https://arxiv.org/abs/2505.14418
作者: Pengzhou Cheng,Haowen Hu,Zheng Wu,Zongru Wu,Tianjie Ju,Daizong Ding,Zhuosheng Zhang,Gongshen Liu
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注: 25 pages, 10 figures, 12 Tables

点击查看摘要

Abstract:Graphical user interface (GUI) agents powered by multimodal large language models (MLLMs) have shown greater promise for human-interaction. However, due to the high fine-tuning cost, users often rely on open-source GUI agents or APIs offered by AI providers, which introduces a critical but underexplored supply chain threat: backdoor attacks. In this work, we first unveil that MLLM-powered GUI agents naturally expose multiple interaction-level triggers, such as historical steps, environment states, and task progress. Based on this observation, we introduce AgentGhost, an effective and stealthy framework for red-teaming backdoor attacks. Specifically, we first construct composite triggers by combining goal and interaction levels, allowing GUI agents to unintentionally activate backdoors while ensuring task utility. Then, we formulate backdoor injection as a Min-Max optimization problem that uses supervised contrastive learning to maximize the feature difference across sample classes at the representation space, improving flexibility of the backdoor. Meanwhile, it adopts supervised fine-tuning to minimize the discrepancy between backdoor and clean behavior generation, enhancing effectiveness and utility. Extensive evaluations of various agent models in two established mobile benchmarks show that AgentGhost is effective and generic, with attack accuracy that reaches 99.7% on three attack objectives, and shows stealthiness with only 1% utility degradation. Furthermore, we tailor a defense method against AgentGhost that reduces the attack accuracy to 22.1%. Our code is available at \textttanonymous.
zh

[NLP-55] PRL: Prompts from Reinforcement Learning

【速读】: 该论文试图解决有效提示工程(prompt engineering)在充分挖掘大语言模型(LLM)能力中的核心挑战问题。传统上,设计高质量的提示需要专家直觉和对任务的深入理解,而最有效的提示往往依赖于人类难以察觉但对指导LLM行为至关重要的细微语义线索。论文提出的解决方案是PRL(Prompts from Reinforcement Learning),这是一种基于强化学习(Reinforcement Learning)的自动提示生成方法,其关键在于能够生成训练过程中未见过的少样本示例,从而提升模型在文本分类、简化和摘要等任务上的性能。

链接: https://arxiv.org/abs/2505.14412
作者: Paweł Batorski,Adrian Kosmala,Paul Swoboda
机构: Heinrich Heine Universität Düsseldorf (海因里希·海涅杜塞尔多夫大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Effective prompt engineering remains a central challenge in fully harnessing the capabilities of LLMs. While well-designed prompts can dramatically enhance performance, crafting them typically demands expert intuition and a nuanced understanding of the task. Moreover, the most impactful prompts often hinge on subtle semantic cues, ones that may elude human perception but are crucial for guiding LLM behavior. In this paper, we introduce PRL (Prompts from Reinforcement Learning), a novel RL-based approach for automatic prompt generation. Unlike previous methods, PRL can produce novel few-shot examples that were not seen during training. Our approach achieves state-of-the-art performance across a range of benchmarks, including text classification, simplification, and summarization. On the classification task, it surpasses prior methods by 2.58% over APE and 1.00% over EvoPrompt. Additionally, it improves the average ROUGE scores on the summarization task by 4.32 over APE and by 2.12 over EvoPrompt and the SARI score on simplification by 6.93 over APE and by 6.01 over EvoPrompt. Our code is available at this https URL .
zh

[NLP-56] Pierce the Mists Greet the Sky: Decipher Knowledge Overshadowing via Knowledge Circuit Analysis EMNLP

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中由知识掩盖(knowledge overshadowing)引起的幻觉问题,即在模型推理过程中,某些激活的知识无意中遮蔽了其他相关知识,导致错误输出。解决方案的关键在于提出PhantomCircuit框架,该框架通过创新性的知识电路分析方法,深入解析注意力头的内部工作机制,追踪竞争性知识路径如何促成掩盖现象及其在训练过程中的演变。

链接: https://arxiv.org/abs/2505.14406
作者: Haoming Huang,Yibo Yan,Jiahao Huo,Xin Zou,Xinfeng Li,Kun Wang,Xuming Hu
机构: The Hong Kong University of Science and Technology (广州科技大学); The Hong Kong University of Science and Technology (香港科技大学); Shanghai Jiao Tong University (上海交通大学); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注: 18 pages, 6 figures, EMNLP under review

点击查看摘要

Abstract:Large Language Models (LLMs), despite their remarkable capabilities, are hampered by hallucinations. A particularly challenging variant, knowledge overshadowing, occurs when one piece of activated knowledge inadvertently masks another relevant piece, leading to erroneous outputs even with high-quality training data. Current understanding of overshadowing is largely confined to inference-time observations, lacking deep insights into its origins and internal mechanisms during model training. Therefore, we introduce PhantomCircuit, a novel framework designed to comprehensively analyze and detect knowledge overshadowing. By innovatively employing knowledge circuit analysis, PhantomCircuit dissects the internal workings of attention heads, tracing how competing knowledge pathways contribute to the overshadowing phenomenon and its evolution throughout the training process. Extensive experiments demonstrate PhantomCircuit’s effectiveness in identifying such instances, offering novel insights into this elusive hallucination and providing the research community with a new methodological lens for its potential mitigation.
zh

[NLP-57] Log-Augmented Generation: Scaling Test-Time Reasoning with Reusable Computation

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)及其代理形式在面对新任务时难以保留并迁移先前任务中的推理能力的问题。其解决方案的关键在于提出一种名为日志增强生成(Log-Augmented Generation, LAG)的框架,该框架通过在测试阶段直接复用先前任务的日志中的计算和推理信息,来增强模型的学习能力和对新挑战的应对效果。LAG采用键值(Key-Value, KV)缓存表示任务日志,仅存储部分标记的KV缓存以保持系统效率,同时在新任务中检索相关日志的KV值以辅助生成过程。与基于反思的记忆机制不同,LAG无需额外的知识提取或蒸馏步骤,直接复用已有推理,从而在提升准确性的同时保持系统的高效性和可扩展性。

链接: https://arxiv.org/abs/2505.14398
作者: Peter Baile Chen,Yi Zhang,Dan Roth,Samuel Madden,Jacob Andreas,Michael Cafarella
机构: MIT(麻省理工学院); AWS AI(AWS人工智能); University of Pennsylvania(宾夕法尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Data and code are available at this https URL

点击查看摘要

Abstract:While humans naturally learn and adapt from past experiences, large language models (LLMs) and their agentic counterparts struggle to retain reasoning from previous tasks and apply them in future contexts. To address this limitation, we propose a novel framework, log-augmented generation (LAG) that directly reuses prior computation and reasoning from past logs at test time to enhance model’s ability to learn from previous tasks and perform better on new, unseen challenges, all while keeping the system efficient and scalable. Specifically, our system represents task logs using key-value (KV) caches, encoding the full reasoning context of prior tasks while storing KV caches for only a selected subset of tokens. When a new task arises, LAG retrieves the KV values from relevant logs to augment generation. Our approach differs from reflection-based memory mechanisms by directly reusing prior reasoning and computations without requiring additional steps for knowledge extraction or distillation. Our method also goes beyond existing KV caching techniques, which primarily target efficiency gains rather than improving accuracy. Experiments on knowledge- and reasoning-intensive datasets demonstrate that our method significantly outperforms standard agentic systems that do not utilize logs, as well as existing solutions based on reflection and KV cache techniques.
zh

[NLP-58] Causal Cartographer: From Mapping to Reasoning Over Counterfactual Worlds

【速读】: 该论文试图解决当前基础模型(如大型语言模型,LLMs)在因果推理能力上的不足,特别是其无法超越对现有因果关系的机械记忆,从而无法有效回答反事实问题。此外,现实世界中反事实评估面临挑战,因为只能观察到实际发生的情况。论文提出的解决方案关键在于构建一个名为Causal Cartographer的框架,通过显式提取和建模因果关系,结合图检索增强生成代理和反事实推理代理,以实现可靠且高效的因果推理,并提升LLMs在因果推理任务中的鲁棒性。

链接: https://arxiv.org/abs/2505.14396
作者: Gaël Gendron,Jože M. Rožanec,Michael Witbrock,Gillian Dobbie
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 29 pages, 9 pages for the main paper, 20 pages for the references and appendix, 25 figures

点击查看摘要

Abstract:Causal world models are systems that can answer counterfactual questions about an environment of interest, i.e. predict how it would have evolved if an arbitrary subset of events had been realized differently. It requires understanding the underlying causes behind chains of events and conducting causal inference for arbitrary unseen distributions. So far, this task eludes foundation models, notably large language models (LLMs), which do not have demonstrated causal reasoning capabilities beyond the memorization of existing causal relationships. Furthermore, evaluating counterfactuals in real-world applications is challenging since only the factual world is observed, limiting evaluation to synthetic datasets. We address these problems by explicitly extracting and modeling causal relationships and propose the Causal Cartographer framework. First, we introduce a graph retrieval-augmented generation agent tasked to retrieve causal relationships from data. This approach allows us to construct a large network of real-world causal relationships that can serve as a repository of causal knowledge and build real-world counterfactuals. In addition, we create a counterfactual reasoning agent constrained by causal relationships to perform reliable step-by-step causal inference. We show that our approach can extract causal knowledge and improve the robustness of LLMs for causal reasoning tasks while reducing inference costs and spurious correlations.
zh

[NLP-59] MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language

【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)在低资源语言中的文本生成能力评估问题,特别是在缺乏直接评估方法的情况下。其解决方案的关键在于提出MUG-Eval框架,通过将现有基准测试转换为对话任务,并测量模型在这些任务中的准确性来评估多语言生成能力,该框架设计的对话任务要求目标语言的有效沟通,从而以任务成功率作为成功对话生成的代理指标。

链接: https://arxiv.org/abs/2505.14395
作者: Seyoung Song,Seogyeong Jeong,Eunsu Kim,Jiho Jin,Dongkwan Kim,Jay Shin,Alice Oh
机构: KAIST(韩国科学技术院); Trillion Labs(万亿实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating text generation capabilities of large language models (LLMs) is challenging, particularly for low-resource languages where methods for direct assessment are scarce. We propose MUG-Eval, a novel framework that evaluates LLMs’ multilingual generation capabilities by transforming existing benchmarks into conversational tasks and measuring the LLMs’ accuracies on those tasks. We specifically designed these conversational tasks to require effective communication in the target language. Then, we simply use task success rate as a proxy of successful conversation generation. Our approach offers two key advantages: it is independent of language-specific NLP tools or annotated datasets, which are limited for most languages, and it does not rely on LLMs-as-judges, whose evaluation quality degrades outside a few high-resource languages. We evaluate 8 LLMs across 30 languages spanning high, mid, and low-resource categories, and we find that MUG-Eval correlates strongly with established benchmarks ( r 0.75) while enabling standardized comparisons across languages and models. Our framework provides a robust and resource-efficient solution for evaluating multilingual generation that can be extended to thousands of languages.
zh

[NLP-60] Editing Across Languages: A Survey of Multilingual Knowledge Editing

【速读】: 该论文旨在解决多语言环境下知识编辑(Multilingual Knowledge Editing, MKE)的可靠性与泛化性问题,即如何在不同语言之间有效执行事实性编辑并确保其一致性。其解决方案的关键在于系统化地归纳和分类现有的MKE方法,包括基于参数、记忆、微调以及超网络的策略,并通过分析不同方法在跨语言传播中的有效性与挑战,为可编辑的语言感知大语言模型(LLMs)的发展提供理论基础与技术方向。

链接: https://arxiv.org/abs/2505.14393
作者: Nadir Durrani,Basel Mousi,Fahim Dalvi
机构: Qatar Computing Research Institute, HBKU, Doha, Qatar (卡塔尔计算研究学院,HBKU,多哈,卡塔尔)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Knowledge Editing has been extensively studied in monolingual settings, it remains underexplored in multilingual contexts. This survey systematizes recent research on Multilingual Knowledge Editing (MKE), a growing subdomain of model editing focused on ensuring factual edits generalize reliably across languages. We present a comprehensive taxonomy of MKE methods, covering parameter-based, memory-based, fine-tuning, and hypernetwork approaches. We survey available benchmarks,summarize key findings on method effectiveness and transfer patterns, identify challenges in cross-lingual propagation, and highlight open problems related to language anisotropy, evaluation coverage, and edit scalability. Our analysis consolidates a rapidly evolving area and lays the groundwork for future progress in editable language-aware LLMs.
zh

[NLP-61] AutoRev: Automatic Peer Review System for Academic Research Papers

【速读】: 该论文试图解决学术研究论文自动撰写评审(review)的问题,该任务需要深入理解文档内容及其各部分之间的相互依赖关系,同时对技术细节和整体结构有全面把握。现有方法主要依赖于微调大语言模型(large language models, LLMs),但往往忽视了长输入token长度带来的计算和性能限制。该论文的解决方案关键在于提出AutoRev,一个基于图(graph)的自动同行评审系统,通过将学术文档表示为图结构,提取对评审贡献最大的关键段落,从而有效提升评审生成的效果,并具备扩展至其他自然语言处理(NLP)下游任务的潜力。

链接: https://arxiv.org/abs/2505.14376
作者: Maitreya Prafulla Chitale,Ketaki Mangesh Shetye,Harshit Gupta,Manav Chaudhary,Vasudeva Varma
机构: IIIT Hyderabad (印度信息技术研究所海得拉巴分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generating a review for an academic research paper is a complex task that requires a deep understanding of the document’s content and the interdependencies between its sections. It demands not only insight into technical details but also an appreciation of the paper’s overall coherence and structure. Recent methods have predominantly focused on fine-tuning large language models (LLMs) to address this challenge. However, they often overlook the computational and performance limitations imposed by long input token lengths. To address this, we introduce AutoRev, an Automatic Peer Review System for Academic Research Papers. Our novel framework represents an academic document as a graph, enabling the extraction of the most critical passages that contribute significantly to the review. This graph-based approach demonstrates effectiveness for review generation and is potentially adaptable to various downstream tasks, such as question answering, summarization, and document representation. When applied to review generation, our method outperforms SOTA baselines by an average of 58.72% across all evaluation metrics. We hope that our work will stimulate further research in applying graph-based extraction techniques to other downstream tasks in NLP. We plan to make our code public upon acceptance.
zh

[NLP-62] Is Your Prompt Safe? Investigating Prompt Injection Attacks Against Open-Source LLM s EMNLP2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对基于提示的攻击时所表现出的安全性问题,特别是针对开源LLMs的提示注入攻击(prompt injection attacks)的有效性分析与评估。论文的关键解决方案是提出了一种新的评估指标——攻击成功率(Attack Success Probability, ASP),该指标不仅考虑攻击的成功与否,还纳入了模型响应中的不确定性,从而更全面地反映了攻击的可行性与模型的脆弱性。通过这一指标,研究者设计并验证了有效的攻击方法,如催眠攻击和忽略前缀攻击,揭示了当前主流开源LLMs在安全防护方面的不足。

链接: https://arxiv.org/abs/2505.14368
作者: Jiawen Wang,Pritha Gupta,Ivan Habernal,Eyke Hüllermeier
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 8 pages, 3 figures, EMNLP 2025 under review

点击查看摘要

Abstract:Recent studies demonstrate that Large Language Models (LLMs) are vulnerable to different prompt-based attacks, generating harmful content or sensitive information. Both closed-source and open-source LLMs are underinvestigated for these attacks. This paper studies effective prompt injection attacks against the \mathbf14 most popular open-source LLMs on five attack benchmarks. Current metrics only consider successful attacks, whereas our proposed Attack Success Probability (ASP) also captures uncertainty in the model’s response, reflecting ambiguity in attack feasibility. By comprehensively analyzing the effectiveness of prompt injection attacks, we propose a simple and effective hypnotism attack; results show that this attack causes aligned language models, including Stablelm2, Mistral, Openchat, and Vicuna, to generate objectionable behaviors, achieving around 90 % ASP. They also indicate that our ignore prefix attacks can break all \mathbf14 open-source LLMs, achieving over 60 % ASP on a multi-categorical dataset. We find that moderately well-known LLMs exhibit higher vulnerability to prompt injection attacks, highlighting the need to raise public awareness and prioritize efficient mitigation strategies.
zh

[NLP-63] Dual Decomposition of Weights and Singular Value Low Rank Adaptation

【速读】: 该论文旨在解决参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)中基于低秩适配(Low-rank Adaptation, LoRA)的方法所面临的两个核心问题:训练动态不稳定和预训练模型知识传递效率低下,这些问题主要源于适配器参数的随机初始化。论文提出的解决方案关键在于DuDe方法,该方法通过将权重矩阵分解为幅度和方向分量,并利用奇异值分解(Singular Value Decomposition, SVD)进行合理初始化,从而提升优化稳定性并更好地保留预训练表示。

链接: https://arxiv.org/abs/2505.14367
作者: Jialong Han,Si Zhang,Ke Zhang
机构: SIST, ShanghaiTech University (上海科技大学); SKLP, ICT, ACS (智能计算与系统实验室,中国科学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) has emerged as a critical paradigm for adapting Large Language Models (LLMs) to downstream tasks, among which Low-rank Adaptation (LoRA) represents one of the most widely adopted methodologies. However, existing LoRA-based approaches exhibit two fundamental limitations: unstable training dynamics and inefficient knowledge transfer from pre-trained models, both stemming from random initialization of adapter parameters. To overcome these challenges, we propose DuDe, a novel approach that decomposes weight matrices into magnitude and direction components, employing Singular Value Decomposition (SVD) for principled initialization. Our comprehensive evaluation demonstrates DuDe’s superior performance and robustness, achieving up to 48.35% accuracy on MMLU and 62.53% ( \pm 1.59) accuracy on GSM8K. Our theoretical analysis and empirical validation collectively demonstrate that DuDe’s decomposition strategy enhances optimization stability and better preserves pre-trained representations, particularly for domain-specific tasks requiring specialized knowledge. The combination of robust empirical performance and rigorous theoretical foundations establishes DuDe as a significant contribution to PEFT methodologies for LLMs.
zh

[NLP-64] PersonaTAB: Predicting Personality Traits using Textual Acoustic and Behavioral Cues in Fully-Duplex Speech Dialogs INTERSPEECH2025

【速读】: 该论文试图解决语音对话系统中缺乏个性标注的问题,即当前神经语音对话系统尚未充分探索能够根据用户个性调整行为的个性感知对话代理。解决方案的关键在于构建一个预处理流程,将原始音频记录转换为带有时间戳、回应类型以及情感/情绪标签的对话数据集,并利用自动语音识别(ASR)系统提取转录文本和时间信息,进而生成对话级别的注释。在此基础上,设计了一个基于大语言模型的对话个性预测系统,并通过人工评估者识别对话特征并分配个性标签,从而实现与人类判断更强的一致性。

链接: https://arxiv.org/abs/2505.14356
作者: Sho Inoue,Shai Wang,Haizhou Li
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: This is accepted to Interspeech 2025; Added an extra page for supplementary figures; Project page: this https URL

点击查看摘要

Abstract:Despite significant progress in neural spoken dialog systems, personality-aware conversation agents – capable of adapting behavior based on personalities – remain underexplored due to the absence of personality annotations in speech datasets. We propose a pipeline that preprocesses raw audio recordings to create a dialogue dataset annotated with timestamps, response types, and emotion/sentiment labels. We employ an automatic speech recognition (ASR) system to extract transcripts and timestamps, then generate conversation-level annotations. Leveraging these annotations, we design a system that employs large language models to predict conversational personality. Human evaluators were engaged to identify conversational characteristics and assign personality labels. Our analysis demonstrates that the proposed system achieves stronger alignment with human judgments compared to existing approaches.
zh

[NLP-65] WirelessMathBench: A Mathematical Modeling Benchmark for LLM s in Wireless Communications ACL2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂、领域特定的数学推理能力方面存在的不足,特别是在无线通信领域的应用。其解决方案的关键在于构建了一个名为WirelessMathBench的新基准,用于评估LLMs在无线通信工程中的数学建模能力。该基准包含587个精心筛选的问题,来源于40篇前沿研究论文,涵盖从基础选择题到复杂的方程补全任务,并严格遵循物理和量纲约束,从而全面测试模型的数学推理能力。

链接: https://arxiv.org/abs/2505.14354
作者: Xin Li,Mengbing Liu,Li Wei,Jiancheng An,Mérouane Debbah,Chau Yuen
机构: Nanyang Technological University (南洋理工大学); Khalifa University (哈利法大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to ACL 2025 Findings

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved impressive results across a broad array of tasks, yet their capacity for complex, domain-specific mathematical reasoning-particularly in wireless communications-remains underexplored. In this work, we introduce WirelessMathBench, a novel benchmark specifically designed to evaluate LLMs on mathematical modeling challenges to wireless communications engineering. Our benchmark consists of 587 meticulously curated questions sourced from 40 state-of-the-art research papers, encompassing a diverse spectrum of tasks ranging from basic multiple-choice questions to complex equation completion tasks, including both partial and full completions, all of which rigorously adhere to physical and dimensional constraints. Through extensive experimentation with leading LLMs, we observe that while many models excel in basic recall tasks, their performance degrades significantly when reconstructing partially or fully obscured equations, exposing fundamental limitations in current LLMs. Even DeepSeek-R1, the best performer on our benchmark, achieves an average accuracy of only 38.05%, with a mere 7.83% success rate in full equation completion. By publicly releasing WirelessMathBench along with the evaluation toolkit, we aim to advance the development of more robust, domain-aware LLMs for wireless system analysis and broader engineering applications.
zh

[NLP-66] FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang Amdo and Kham Speech Dataset Generation

【速读】: 该论文旨在解决藏语(Tibetan)作为一种低资源语言,其三种主要方言(Ü-Tsang、Amdo 和 Kham)之间并行语音语料稀缺的问题,这限制了语音建模的进展。解决方案的关键在于提出一种少样本、多说话人、多方言的文本到语音(FMSD-TTS)框架,该框架通过有限的参考音频和明确的方言标签合成平行方言语音。其核心创新包括一种新型的说话人-方言融合模块和一种方言专用动态路由网络(DSDR-Net),以捕捉方言间的细微声学和语言差异,同时保持说话人身份的一致性。

链接: https://arxiv.org/abs/2505.14351
作者: Yutong Liu,Ziyue Zhang,Ban Ma-bao,Yuqing Cai,Yongbin Yu,Renzeng Duojie,Xiangxiang Wang,Fan Gao,Cheng Huang,Nyima Tashi
机构: School of Information and Software Engineering, University of Electronic Science and Technology of China (信息与软件工程学院,电子科技大学); School of Information Science and Technology, Tibet University (信息科学与技术学院,西藏大学); Department of Ophthalmology, University of Texas Southwestern Medical Center (眼科系,德克萨斯西南医学中心)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 13 pages

点击查看摘要

Abstract:Tibetan is a low-resource language with minimal parallel speech corpora spanning its three major dialects-Ü-Tsang, Amdo, and Kham-limiting progress in speech modeling. To address this issue, we propose FMSD-TTS, a few-shot, multi-speaker, multi-dialect text-to-speech framework that synthesizes parallel dialectal speech from limited reference audio and explicit dialect labels. Our method features a novel speaker-dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to capture fine-grained acoustic and linguistic variations across dialects while preserving speaker identity. Extensive objective and subjective evaluations demonstrate that FMSD-TTS significantly outperforms baselines in both dialectal expressiveness and speaker similarity. We further validate the quality and utility of the synthesized speech through a challenging speech-to-speech dialect conversion task. Our contributions include: (1) a novel few-shot TTS system tailored for Tibetan multi-dialect speech synthesis, (2) the public release of a large-scale synthetic Tibetan speech corpus generated by FMSD-TTS, and (3) an open-source evaluation toolkit for standardized assessment of speaker similarity, dialect consistency, and audio quality.
zh

[NLP-67] OSoRA: Output-Dimension and Singular-Value Initialized Low-Rank Adaptation

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)微调过程中计算资源消耗大、效率低的问题。现有参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法虽为替代方案,但仍需大量资源。论文提出的解决方案是OSoRA(Output-Dimension and Singular-Value Initialized Low-Rank Adaptation),其关键在于通过将奇异值分解(Singular Value Decomposition, SVD)与可学习的缩放向量整合到统一框架中,仅在训练过程中优化输出维度向量,而冻结对应的奇异向量矩阵,从而显著减少可训练参数数量,降低计算需求。

链接: https://arxiv.org/abs/2505.14350
作者: Jialong Han,Si Zhang,Ke Zhang
机构: SIST, ShanghaiTech University (上海科技大学); SKLP, ICT, ACS (智能计算与系统实验室,中国科学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine-tuning Large Language Models (LLMs) has become increasingly challenging due to their massive scale and associated computational costs. Parameter-Efficient Fine-Tuning (PEFT) methodologies have been proposed as computational alternatives; however, their implementations still require significant resources. In this paper, we present OSoRA (Output-Dimension and Singular-Value Initialized Low-Rank Adaptation), a novel PEFT method for LLMs. OSoRA extends Low-Rank Adaptation (LoRA) by integrating Singular Value Decomposition (SVD) with learnable scaling vectors in a unified framework. It first performs an SVD of pre-trained weight matrices, then optimizes an output-dimension vector during training, while keeping the corresponding singular vector matrices frozen. OSoRA substantially reduces computational resource requirements by minimizing the number of trainable parameters during fine-tuning. Comprehensive evaluations across mathematical reasoning, common sense reasoning, and other benchmarks demonstrate that OSoRA achieves comparable or superior performance to state-of-the-art methods like LoRA and VeRA, while maintaining a linear parameter scaling even as the rank increases to higher dimensions. Our ablation studies further confirm that jointly training both the singular values and the output-dimension vector is critical for optimal performance.
zh

[NLP-68] QA-prompting: Improving Summarization with Large Language Models using Question-Answering

【速读】: 该论文试图解决长文本摘要任务中由于位置偏差导致的关键信息提取不理想的问题。其解决方案的关键在于提出一种名为QA-prompting的简单提示方法,该方法通过在生成摘要前引入问答作为中间步骤,以提取关键信息并增强文本上下文,从而减轻位置偏差,实现单次语言模型调用下的有效摘要生成,无需微调或流水线处理。

链接: https://arxiv.org/abs/2505.14347
作者: Neelabh Sinha
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注: Submitted to ARR

点击查看摘要

Abstract:Language Models (LMs) have revolutionized natural language processing, enabling high-quality text generation through prompting and in-context learning. However, models often struggle with long-context summarization due to positional biases, leading to suboptimal extraction of critical information. There are techniques to improve this with fine-tuning, pipelining, or using complex techniques, which have their own challenges. To solve these challenges, we propose QA-prompting - a simple prompting method for summarization that utilizes question-answering as an intermediate step prior to summary generation. Our method extracts key information and enriches the context of text to mitigate positional biases and improve summarization in a single LM call per task without requiring fine-tuning or pipelining. Experiments on multiple datasets belonging to different domains using ten state-of-the-art pre-trained models demonstrate that QA-prompting outperforms baseline and other state-of-the-art methods, achieving up to 29% improvement in ROUGE scores. This provides an effective and scalable solution for summarization and highlights the importance of domain-specific question selection for optimal performance.
zh

[NLP-69] RADAR: Enhancing Radiology Report Generation with Supplementary Knowledge Injection

【速读】: 该论文旨在解决当前在放射学报告生成中,多模态大型语言模型(Large Language Models, LLMs)未能充分利用其内部已学习知识的问题,导致冗余信息整合和学习表示的低效利用。其解决方案的关键在于提出RADAR框架,通过系统性地结合LLM的内部知识与外部检索到的补充知识,提升报告生成的准确性与信息量。具体而言,RADAR首先提取与专家基于图像的分类输出对齐的模型内部知识,随后检索相关补充知识以进一步丰富信息,最终通过融合两种来源生成更高质量的放射学报告。

链接: https://arxiv.org/abs/2505.14318
作者: Wenjun Hou,Yi Cheng,Kaishuai Xu,Heng Li,Yan Hu,Wenjie Li,Jiang Liu
机构: The Hong Kong Polytechnic University (香港理工大学); Southern University of Science and Technology (南方科技大学); University of Nottingham Ningbo China (诺丁汉大学宁波校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in various domains, including radiology report generation. Previous approaches have attempted to utilize multimodal LLMs for this task, enhancing their performance through the integration of domain-specific knowledge retrieval. However, these approaches often overlook the knowledge already embedded within the LLMs, leading to redundant information integration and inefficient utilization of learned representations. To address this limitation, we propose RADAR, a framework for enhancing radiology report generation with supplementary knowledge injection. RADAR improves report generation by systematically leveraging both the internal knowledge of an LLM and externally retrieved information. Specifically, it first extracts the model’s acquired knowledge that aligns with expert image-based classification outputs. It then retrieves relevant supplementary knowledge to further enrich this information. Finally, by aggregating both sources, RADAR generates more accurate and informative radiology reports. Extensive experiments on MIMIC-CXR, CheXpert-Plus, and IU X-ray demonstrate that our model outperforms state-of-the-art LLMs in both language quality and clinical accuracy
zh

[NLP-70] A MIND for Reasoning : Meta-learning for In-context Deduction

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对分布外问题时泛化能力有限的问题,特别是如何实现对演绎规则的系统性理解,以识别知识库中推导给定假设所需的适当前提子集。解决方案的关键在于提出一种名为元学习用于上下文推理(Meta-learning for In-context Deduction, MIND)的新型少样本元学习微调方法,旨在提升模型在未见知识库上的泛化能力并系统性地应用推理规则。

链接: https://arxiv.org/abs/2505.14313
作者: Leonardo Bertolazzi,Manuel Vargas Guzmán,Raffaella Bernardi,Maciej Malicki,Jakub Szymanik
机构: University of Trento (特伦托大学); University of Warsaw (华沙大学); Free University of Bozen-Bolzano (波尔扎诺自由大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly evaluated on formal tasks, where strong reasoning abilities define the state of the art. However, their ability to generalize to out-of-distribution problems remains limited. In this paper, we investigate how LLMs can achieve a systematic understanding of deductive rules. Our focus is on the task of identifying the appropriate subset of premises within a knowledge base needed to derive a given hypothesis. To tackle this challenge, we propose Meta-learning for In-context Deduction (MIND), a novel few-shot meta-learning fine-tuning approach. The goal of MIND is to enable models to generalize more effectively to unseen knowledge bases and to systematically apply inference rules. Our results show that MIND significantly improves generalization in small LMs ranging from 1.5B to 7B parameters. The benefits are especially pronounced in smaller models and low-data settings. Remarkably, small models fine-tuned with MIND outperform state-of-the-art LLMs, such as GPT-4o and o3-mini, on this task.
zh

[NLP-71] HausaNLP: Current Status Challenges and Future Directions for Hausa Natural Language Processing

【速读】: 该论文试图解决Hausa语言在自然语言处理(Natural Language Processing, NLP)领域中因资源匮乏而面临的一系列挑战,包括有限的公开数据集和模型表示不足等问题。其解决方案的关键在于构建一个名为HausaNLP的整理目录,该目录整合了数据集、工具和研究成果,以提高Hausa NLP的可访问性并推动进一步发展。此外,论文还探讨了将Hausa语言集成到大型语言模型(Large Language Models, LLMs)中的挑战,并提出了通过扩展数据集、改进语言建模方法以及加强社区协作来促进Hausa NLP发展的战略研究方向。

链接: https://arxiv.org/abs/2505.14311
作者: Shamsuddeen Hassan Muhammad,Ibrahim Said Ahmad,Idris Abdulmumin,Falalu Ibrahim Lawan,Babangida Sani,Sukairaj Hafiz Imam,Yusuf Aliyu,Sani Abdullahi Sani,Ali Usman Umar,Kenneth Church,Vukosi Marivate
机构: Imperial College London (帝国理工学院); Northeastern University (东北大学); DSFSI, University of Pretoria (DSFSI,开普敦大学); Kaduna State University (卡杜纳州大学); Kalinga University (卡林加大学); Bayero University, Kano (贝亚罗大学,卡诺); Universiti Teknologi PETRONAS (马来西亚理工大学); University of the Witwatersrand, Johannesburg (约翰内斯堡威特沃特斯兰德大学); Federal University of Lafia (拉法联邦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hausa Natural Language Processing (NLP) has gained increasing attention in recent years, yet remains understudied as a low-resource language despite having over 120 million first-language (L1) and 80 million second-language (L2) speakers worldwide. While significant advances have been made in high-resource languages, Hausa NLP faces persistent challenges, including limited open-source datasets and inadequate model representation. This paper presents an overview of the current state of Hausa NLP, systematically examining existing resources, research contributions, and gaps across fundamental NLP tasks: text classification, machine translation, named entity recognition, speech recognition, and question answering. We introduce HausaNLP (this https URL), a curated catalog that aggregates datasets, tools, and research works to enhance accessibility and drive further development. Furthermore, we discuss challenges in integrating Hausa into large language models (LLMs), addressing issues of suboptimal tokenization and dialectal variation. Finally, we propose strategic research directions emphasizing dataset expansion, improved language modeling approaches, and strengthened community collaboration to advance Hausa NLP. Our work provides both a foundation for accelerating Hausa NLP progress and valuable insights for broader multilingual NLP research.
zh

[NLP-72] Studying the Role of Input-Neighbor Overlap in Retrieval-Augmented Language Models Training Efficiency

【速读】: 该论文试图解决检索增强语言模型在训练和推理过程中,查询与检索上下文之间重叠程度对模型性能影响的优化问题。其解决方案的关键在于通过系统性实验分析不同重叠水平对模型表现的影响,并发现适度增加查询与上下文的重叠可以显著提升模型的测试困惑度和学习效率。进一步地,研究提出通过生成合成上下文(特别是通过改写查询)来有意增强重叠,从而提高数据效率并减少约40%的训练时间,同时保持模型性能。

链接: https://arxiv.org/abs/2505.14309
作者: Ehsan Doostmohammadi,Marco Kuhlmann
机构: Linköping University (林雪平大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented language models have demonstrated performance comparable to much larger models while requiring fewer computational resources. The effectiveness of these models crucially depends on the overlap between query and retrieved context, but the optimal degree of this overlap remains unexplored. In this paper, we systematically investigate how varying levels of query–context overlap affect model performance during both training and inference. Our experiments reveal that increased overlap initially has minimal effect, but substantially improves test-time perplexity and accelerates model learning above a critical threshold. Building on these findings, we demonstrate that deliberately increasing overlap through synthetic context can enhance data efficiency and reduce training time by approximately 40% without compromising performance. We specifically generate synthetic context through paraphrasing queries. We validate our perplexity-based findings on question-answering tasks, confirming that the benefits of retrieval-augmented language modeling extend to practical applications. Our results provide empirical evidence of significant optimization potential for retrieval mechanisms in language model pretraining.
zh

[NLP-73] JOLT-SQL: Joint Loss Tuning of Text-to-SQL with Confusion-aware Noisy Schema Sampling

【速读】: 该论文旨在解决文本到SQL(Text-to-SQL)任务中监督微调(Supervised Fine-Tuning, SFT)方法面临的复杂多阶段流程和对噪声模式信息鲁棒性差的问题。其解决方案的关键在于提出JOLT-SQL,一个简化的一阶段SFT框架,通过统一的损失函数联合优化模式链接和SQL生成,同时采用判别式模式链接、局部双向注意力机制以及具有选择性注意力的噪声模式采样策略,以提升在噪声模式条件下的鲁棒性。

链接: https://arxiv.org/abs/2505.14305
作者: Jinwang Song,Hongying Zan,Kunli Zhang,Lingling Mu,Yingjie Han,Haobo Hua,Min Peng
机构: Zhengzhou University (郑州大学); Zhengzhou University of Aeronautics (郑州航空工业管理学院); Wuhan University (武汉大学)
类目: Computation and Language (cs.CL)
备注: Work in progress. 13 pages, 6 figures

点击查看摘要

Abstract:Text-to-SQL, which maps natural language to SQL queries, has benefited greatly from recent advances in Large Language Models (LLMs). While LLMs offer various paradigms for this task, including prompting and supervised fine-tuning (SFT), SFT approaches still face challenges such as complex multi-stage pipelines and poor robustness to noisy schema information. To address these limitations, we present JOLT-SQL, a streamlined single-stage SFT framework that jointly optimizes schema linking and SQL generation via a unified loss. JOLT-SQL employs discriminative schema linking, enhanced by local bidirectional attention, alongside a confusion-aware noisy schema sampling strategy with selective attention to improve robustness under noisy schema conditions. Experiments on the Spider and BIRD benchmarks demonstrate that JOLT-SQL achieves state-of-the-art execution accuracy among comparable-size open-source models, while significantly improving both training and inference efficiency.
zh

[NLP-74] Scaling Law for Quantization-Aware Training

【速读】: 该论文旨在解决量化感知训练(QAT)在4-bit精度(W4A4)下的可扩展性问题,特别是量化误差的来源及其与模型规模、训练数据量和量化粒度之间的关系。其解决方案的关键在于提出了一种统一的QAT缩放定律,将量化误差建模为模型规模、训练数据量和量化组大小的函数,并通过268次QAT实验验证了量化误差的变化趋势。此外,通过分解量化误差为权重和激活部分,识别出FC2层中由异常值引起的激活量化误差是W4A4 QAT的主要瓶颈,并采用混合精度量化策略缓解该问题,从而实现权重和激活量化误差的平衡。

链接: https://arxiv.org/abs/2505.14302
作者: Mengzhao Chen,Chaoyi Zhang,Jing Liu,Yutao Zeng,Zeyue Xue,Zhiheng Liu,Yunshui Li,Jin Ma,Jie Huang,Xun Zhou,Ping Luo
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: A unified scaling law for QAT that models quantization error as a function of model size, training data volume, and quantization group size

点击查看摘要

Abstract:Large language models (LLMs) demand substantial computational and memory resources, creating deployment challenges. Quantization-aware training (QAT) addresses these challenges by reducing model precision while maintaining performance. However, the scaling behavior of QAT, especially at 4-bit precision (W4A4), is not well understood. Existing QAT scaling laws often ignore key factors such as the number of training tokens and quantization granularity, which limits their applicability. This paper proposes a unified scaling law for QAT that models quantization error as a function of model size, training data volume, and quantization group size. Through 268 QAT experiments, we show that quantization error decreases as model size increases, but rises with more training tokens and coarser quantization granularity. To identify the sources of W4A4 quantization error, we decompose it into weight and activation components. Both components follow the overall trend of W4A4 quantization error, but with different sensitivities. Specifically, weight quantization error increases more rapidly with more training tokens. Further analysis shows that the activation quantization error in the FC2 layer, caused by outliers, is the primary bottleneck of W4A4 QAT quantization error. By applying mixed-precision quantization to address this bottleneck, we demonstrate that weight and activation quantization errors can converge to similar levels. Additionally, with more training data, weight quantization error eventually exceeds activation quantization error, suggesting that reducing weight quantization error is also important in such scenarios. These findings offer key insights for improving QAT research and development.
zh

[NLP-75] SafetyNet: Detecting Harmful Outputs in LLM s by Modeling and Monitoring Deceptive Behaviors

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成有害内容时的实时监测问题,特别是针对后门触发响应的场景,即特定输入引发模型生成暴力、色情或仇恨言论等不安全内容。其解决方案的关键在于采用一种无监督方法,将正常行为作为基线,将有害输出视为异常值,并通过分析模型在生成有害内容时表现出的内部行为特征,识别真正的因果指标而非表面相关性。此外,为应对先进模型可能的欺骗行为,研究引入了Safety-Net框架,该框架通过多检测器机制监控不同表示维度,有效检测有害行为,即使信息在表征空间中被转移以规避单一监测器。

链接: https://arxiv.org/abs/2505.14300
作者: Maheep Chaudhary,Fazl Barez
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:High-risk industries like nuclear and aviation use real-time monitoring to detect dangerous system conditions. Similarly, Large Language Models (LLMs) need monitoring safeguards. We propose a real-time framework to predict harmful AI outputs before they occur by using an unsupervised approach that treats normal behavior as the baseline and harmful outputs as outliers. Our study focuses specifically on backdoor-triggered responses – where specific input phrases activate hidden vulnerabilities causing the model to generate unsafe content like violence, pornography, or hate speech. We address two key challenges: (1) identifying true causal indicators rather than surface correlations, and (2) preventing advanced models from deception – deliberately evading monitoring systems. Hence, we approach this problem from an unsupervised lens by drawing parallels to human deception: just as humans exhibit physical indicators while lying, we investigate whether LLMs display distinct internal behavioral signatures when generating harmful content. Our study addresses two critical challenges: 1) designing monitoring systems that capture true causal indicators rather than superficial correlations; and 2)preventing intentional evasion by increasingly capable "Future models’'. Our findings show that models can produce harmful content through causal mechanisms and can become deceptive by: (a) alternating between linear and non-linear representations, and (b) modifying feature relationships. To counter this, we developed Safety-Net – a multi-detector framework that monitors different representation dimensions, successfully detecting harmful behavior even when information is shifted across representational spaces to evade individual monitors. Our evaluation shows 96% accuracy in detecting harmful cases using our unsupervised ensemble approach.
zh

[NLP-76] Cross-Lingual Optimization for Language Transfer in Large Language Models ACL2025

【速读】: 该论文试图解决在多语言环境下,传统监督微调(Supervised Fine-Tuning, SFT)方法过度关注英语性能,且在数据资源有限的语言中表现不佳的问题。其解决方案的关键在于提出跨语言优化(Cross-Lingual Optimization, CLO),通过利用公开的英语监督微调数据和翻译模型,实现从英语为中心的大语言模型到目标语言的有效迁移,同时保持其英语能力。CLO在减少数据依赖性的同时提升了目标语言的性能,并在低资源语言中表现出更高的效率和鲁棒性。

链接: https://arxiv.org/abs/2505.14297
作者: Jungseob Lee,Seongtae Hong,Hyeonseok Moon,Heuiseok Lim
机构: Korea University (韩国大学)
类目: Computation and Language (cs.CL)
备注: Accepted for publication at ACL 2025. Jungseob Lee and Seongtae Hong contributed equally to this work

点击查看摘要

Abstract:Adapting large language models to other languages typically employs supervised fine-tuning (SFT) as a standard approach. However, it often suffers from an overemphasis on English performance, a phenomenon that is especially pronounced in data-constrained environments. To overcome these challenges, we propose \textbfCross-Lingual Optimization (CLO) that efficiently transfers an English-centric LLM to a target language while preserving its English capabilities. CLO utilizes publicly available English SFT data and a translation model to enable cross-lingual transfer. We conduct experiments using five models on six languages, each possessing varying levels of resource. Our results show that CLO consistently outperforms SFT in both acquiring target language proficiency and maintaining English performance. Remarkably, in low-resource languages, CLO with only 3,200 samples surpasses SFT with 6,400 samples, demonstrating that CLO can achieve better performance with less data. Furthermore, we find that SFT is particularly sensitive to data quantity in medium and low-resource languages, whereas CLO remains robust. Our comprehensive analysis emphasizes the limitations of SFT and incorporates additional training strategies in CLO to enhance efficiency.
zh

[NLP-77] Universal Acoustic Adversarial Attacks for Flexible Control of Speech-LLM s

【速读】: 该论文旨在解决语音大语言模型(Speech LLM)在面对对抗攻击时的安全性问题,特别是通用声学对抗攻击的潜在威胁。研究的关键在于提出一种固定的、通用的对抗音频片段,将其附加到原始输入音频前,以诱导模型产生无输出或执行被修改的任务,进而通过引入特定输入属性(如说话人性别或语言)实现选择性攻击,从而实现对模型输出的细粒度控制。这一方法揭示了现有语音LLM在安全性方面的关键漏洞,并强调了提升模型鲁棒性的必要性。

链接: https://arxiv.org/abs/2505.14286
作者: Rao Ma,Mengjie Qian,Vyas Raina,Mark Gales,Kate Knill
机构: ALTA Institute, Department of Engineering, University of Cambridge (ALTA研究所,工程系,剑桥大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The combination of pre-trained speech encoders with large language models has enabled the development of speech LLMs that can handle a wide range of spoken language processing tasks. While these models are powerful and flexible, this very flexibility may make them more vulnerable to adversarial attacks. To examine the extent of this problem, in this work we investigate universal acoustic adversarial attacks on speech LLMs. Here a fixed, universal, adversarial audio segment is prepended to the original input audio. We initially investigate attacks that cause the model to either produce no output or to perform a modified task overriding the original prompt. We then extend the nature of the attack to be selective so that it activates only when specific input attributes, such as a speaker gender or spoken language, are present. Inputs without the targeted attribute should be unaffected, allowing fine-grained control over the model outputs. Our findings reveal critical vulnerabilities in Qwen2-Audio and Granite-Speech and suggest that similar speech LLMs may be susceptible to universal adversarial attacks. This highlights the need for more robust training strategies and improved resistance to adversarial attacks.
zh

[NLP-78] YESciEval: Robust LLM -as-a-Judge for Scientific Question Answering ACL2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在科学问答任务中评估鲁棒性不足的问题,特别是针对LLM评估者中存在的乐观偏差(optimism bias)。其解决方案的关键在于引入YESciEval框架,该框架结合了细粒度评分标准评估与强化学习方法,以提升评估的客观性和可靠性。通过释放多学科科学问答数据集及其对抗性变体,并利用多个LLM进行评估,该方法实现了无需依赖专有模型或人类反馈的可扩展、低成本评估体系。

链接: https://arxiv.org/abs/2505.14279
作者: Jennifer D’Souza,Hamed Babaei Giglou,Quentin Münch
机构: TIB Leibniz Information Centre for Science and Technology (TIB莱布尼茨科学技术信息中心); Leibniz Universität Hannover (莱布尼茨汉诺威大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures, Accepted as a Long Paper at the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)

点击查看摘要

Abstract:Large Language Models (LLMs) drive scientific question-answering on modern search engines, yet their evaluation robustness remains underexplored. We introduce YESciEval, an open-source framework that combines fine-grained rubric-based assessment with reinforcement learning to mitigate optimism bias in LLM evaluators. We release multidisciplinary scienceQA datasets, including adversarial variants, with evaluation scores from multiple LLMs. Independent of proprietary models and human feedback, our approach enables scalable, cost-free evaluation. By advancing reliable LLM-as-a-judge models, this work supports AI alignment and fosters robust, transparent evaluation essential for scientific inquiry and artificial general intelligence.
zh

[NLP-79] Data-Efficient Hate Speech Detection via Cross-Lingual Nearest Neighbor Retrieval with Limited Labeled Data

【速读】: 该论文旨在解决在低资源语言中检测仇恨语言时,由于标注数据稀缺而导致的模型性能受限问题。其解决方案的关键在于利用最近邻检索技术,从一个大规模多语言仇恨语言检测语料库中扩充目标语言的少量标注数据,从而提升检测性能。通过这种方式,该方法在八种语言上均表现出优于仅使用目标语言数据训练的模型,并且在多数情况下超越了当前最先进的方法,同时具备高数据效率和可扩展性。

链接: https://arxiv.org/abs/2505.14272
作者: Faeze Ghorbanpour,Daryna Dementieva,Alexander Fraser
机构: TU Munich (慕尼黑工业大学); LMU Munich (慕尼黑路德维希-马克西米利安大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Considering the importance of detecting hateful language, labeled hate speech data is expensive and time-consuming to collect, particularly for low-resource languages. Prior work has demonstrated the effectiveness of cross-lingual transfer learning and data augmentation in improving performance on tasks with limited labeled data. To develop an efficient and scalable cross-lingual transfer learning approach, we leverage nearest-neighbor retrieval to augment minimal labeled data in the target language, thereby enhancing detection performance. Specifically, we assume access to a small set of labeled training instances in the target language and use these to retrieve the most relevant labeled examples from a large multilingual hate speech detection pool. We evaluate our approach on eight languages and demonstrate that it consistently outperforms models trained solely on the target language data. Furthermore, in most cases, our method surpasses the current state-of-the-art. Notably, our approach is highly data-efficient, retrieving as small as 200 instances in some cases while maintaining superior performance. Moreover, it is scalable, as the retrieval pool can be easily expanded, and the method can be readily adapted to new languages and tasks. We also apply maximum marginal relevance to mitigate redundancy and filter out highly similar retrieved instances, resulting in improvements in some languages.
zh

[NLP-80] FAID: Fine-grained AI-generated Text Detection using Multi-task Auxiliary and Multi-level Contrastive Learning

【速读】: 该论文旨在解决在生成性任务中区分人类写作、AI生成文本以及人类与AI协作文本的问题(Human-Generated, AI-Generated, and Human-AI Collaborative Texts)。其解决方案的关键在于提出一个细粒度的检测框架FAID,该框架不仅能够将文本分类为上述三类,还能识别潜在的AI模型家族。FAID通过结合多层级对比学习与多任务辅助分类,以捕捉作者身份和模型特定特征,从而提升检测的准确性与可解释性。此外,FAID引入了适应机制以应对分布偏移问题,而无需对未见数据进行重新训练,显著提升了在新领域和新型AI模型上的泛化能力。

链接: https://arxiv.org/abs/2505.14271
作者: Minh Ngoc Ta,Dong Cao Van,Duc-Anh Hoang,Minh Le-Anh,Truong Nguyen,My Anh Tran Nguyen,Yuxia Wang,Preslav Nakov,Sang Dinh
机构: BKAI Research Center, Hanoi University of Science and Technology (BKAI 研究中心,河内科技大学); MBZUAI (MBZUAI)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The growing collaboration between humans and AI models in generative tasks has introduced new challenges in distinguishing between human-written, AI-generated, and human-AI collaborative texts. In this work, we collect a multilingual, multi-domain, multi-generator dataset FAIDSet. We further introduce a fine-grained detection framework FAID to classify text into these three categories, meanwhile identifying the underlying AI model family. Unlike existing binary classifiers, FAID is built to capture both authorship and model-specific characteristics. Our method combines multi-level contrastive learning with multi-task auxiliary classification to learn subtle stylistic cues. By modeling AI families as distinct stylistic entities, FAID offers improved interpretability. We incorporate an adaptation to address distributional shifts without retraining for unseen data. Experimental results demonstrate that FAID outperforms several baseline approaches, particularly enhancing the generalization accuracy on unseen domains and new AI models. It provide a potential solution for improving transparency and accountability in AI-assisted writing.
zh

[NLP-81] hink-J: Learning to Think for Generative LLM -as-a-Judge

【速读】: 该论文旨在解决生成式大语言模型作为评判者(LLM-as-a-Judge)在性能上未能达到预期的问题,尤其是在响应偏好建模和奖励建模中的表现不足。解决方案的关键在于提出Think-J方法,通过让模型学习“如何思考”来提升其判断能力,具体包括利用少量精心筛选的数据建立初始的判断思维能力,并通过强化学习(Reinforcement Learning, RL)优化判断思维过程,其中包含基于离线和在线RL的两种优化方法。实验结果表明,该方法显著提升了生成式LLM-Judge的评估能力,且无需额外的人工标注。

链接: https://arxiv.org/abs/2505.14268
作者: Hui Huang,Yancheng He,Hongli Zhou,Rui Zhang,Wei Liu,Weixun Wang,Wenbo Su,Bo Zheng,Jiaheng Liu
机构: Alibaba Group(阿里巴巴集团); Nanjing University(南京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 14 figures

点击查看摘要

Abstract:LLM-as-a-Judge refers to the automatic modeling of preferences for responses generated by Large Language Models (LLMs), which is of significant importance for both LLM evaluation and reward modeling. Although generative LLMs have made substantial progress in various tasks, their performance as LLM-Judge still falls short of expectations. In this work, we propose Think-J, which improves generative LLM-as-a-Judge by learning how to think. We first utilized a small amount of curated data to develop the model with initial judgment thinking capabilities. Subsequently, we optimize the judgment thinking traces based on reinforcement learning (RL). We propose two methods for judgment thinking optimization, based on offline and online RL, respectively. The offline RL requires training a critic model to construct positive and negative examples for learning. The online method defines rule-based reward as feedback for optimization. Experimental results showed that our approach can significantly enhance the evaluation capability of generative LLM-Judge, surpassing both generative and classifier-based LLM-Judge without requiring extra human annotations.
zh

[NLP-82] AAPO: Enhance the Reasoning Capabilities of LLM s with Advantage Momentum

【速读】: 该论文旨在解决基于强化学习(Reinforcement Learning, RL)的后训练方法在处理大规模语言模型(Large Language Models, LLMs)时存在的训练效率问题,尤其是在估计的优势接近零时表现不佳的问题。其解决方案的关键在于提出了一种名为优势增强策略优化(Advantage-Augmented Policy Optimization, AAPO)的新算法,该算法通过动量估计方案增强优势值,并优化交叉熵(Cross-Entropy, CE)损失,从而有效缓解了群体相对优势估计方法中的训练低效问题。

链接: https://arxiv.org/abs/2505.14264
作者: Jian Xiong,Jingbo Zhou,Jingyong Ye,Dejing Dou
机构: Fudan University (复旦大学); Baidu Research (百度研究); BEDI Cloud (贝迪云)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:Reinforcement learning (RL) has emerged as an effective approach for enhancing the reasoning capabilities of large language models (LLMs), especially in scenarios where supervised fine-tuning (SFT) falls short due to limited chain-of-thought (CoT) data. Among RL-based post-training methods, group relative advantage estimation, as exemplified by Group Relative Policy Optimization (GRPO), has attracted considerable attention for eliminating the dependency on the value model, thereby simplifying training compared to traditional approaches like Proximal Policy Optimization (PPO). However, we observe that exsiting group relative advantage estimation method still suffers from training inefficiencies, particularly when the estimated advantage approaches zero. To address this limitation, we propose Advantage-Augmented Policy Optimization (AAPO), a novel RL algorithm that optimizes the cross-entropy (CE) loss using advantages enhanced through a momentum-based estimation scheme. This approach effectively mitigates the inefficiencies associated with group relative advantage estimation. Experimental results on multiple mathematical reasoning benchmarks demonstrate the superior performance of AAPO.
zh

[NLP-83] FuxiMT: Sparsifying Large Language Models for Chinese-Centric Multilingual Machine Translation

【速读】: 该论文旨在解决多语言机器翻译(Machine Translation, MT)中资源匮乏语言对的翻译性能不足问题,以及在缺乏平行语料的情况下实现跨语言翻译的能力。其解决方案的关键在于构建一个以中文为中心的多语言机器翻译模型FuxiMT,该模型基于稀疏化的大语言模型(Large Language Model, LLM),采用两阶段训练策略:首先在大规模中文语料上进行预训练,随后在涵盖65种语言的大型平行数据集上进行多语言微调。此外,FuxiMT引入了Mixture-of-Experts (MoEs) 结构,并结合课程学习策略,以提升在不同资源水平下的鲁棒性,从而在低资源场景下和未见过的语言对上均表现出色。

链接: https://arxiv.org/abs/2505.14256
作者: Shaolin Zhu,Tianyu Dong,Bo Li,Deyi Xiong
机构: Tianjin University(天津大学); Tsinghua University(清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we present FuxiMT, a novel Chinese-centric multilingual machine translation model powered by a sparsified large language model (LLM). We adopt a two-stage strategy to train FuxiMT. We first pre-train the model on a massive Chinese corpus and then conduct multilingual fine-tuning on a large parallel dataset encompassing 65 languages. FuxiMT incorporates Mixture-of-Experts (MoEs) and employs a curriculum learning strategy for robust performance across various resource levels. Experimental results demonstrate that FuxiMT significantly outperforms strong baselines, including state-of-the-art LLMs and machine translation models, particularly under low-resource scenarios. Furthermore, FuxiMT exhibits remarkable zero-shot translation capabilities for unseen language pairs, indicating its potential to bridge communication gaps where parallel data are scarce or unavailable.
zh

[NLP-84] ransBench: Benchmarking Machine Translation for Industrial-Scale Applications

【速读】: 该论文试图解决工业场景下机器翻译(Machine Translation, MT)评估体系不足的问题,即现有通用评估框架无法有效衡量专业领域中的翻译质量,导致学术基准与实际应用效果之间存在显著差距。解决方案的关键在于提出一个三层次的翻译能力框架,涵盖基础语言能力、领域专业知识和文化适应性,并构建了针对电商领域的跨语言翻译基准TransBench,结合传统指标与领域特定的评估模型,以实现对工业MT系统的全面、系统化评估。

链接: https://arxiv.org/abs/2505.14244
作者: Haijun Li,Tianqi Shi,Zifu Shang,Yuxuan Han,Xueyu Zhao,Hao Wang,Yu Qian,Zhiqiang Qian,Linlong Xu,Minghao Wu,Chenyang Lyu,Longyue Wang,Gongbo Tang,Weihua Luo,Zhao Xu,Kaifu Zhang
机构: Alibaba International Digital Commerce (阿里巴巴国际数字商业); Beijing Language and Culture University (北京语言文化大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Machine translation (MT) has become indispensable for cross-border communication in globalized industries like e-commerce, finance, and legal services, with recent advancements in large language models (LLMs) significantly enhancing translation quality. However, applying general-purpose MT models to industrial scenarios reveals critical limitations due to domain-specific terminology, cultural nuances, and stylistic conventions absent in generic benchmarks. Existing evaluation frameworks inadequately assess performance in specialized contexts, creating a gap between academic benchmarks and real-world efficacy. To address this, we propose a three-level translation capability framework: (1) Basic Linguistic Competence, (2) Domain-Specific Proficiency, and (3) Cultural Adaptation, emphasizing the need for holistic evaluation across these dimensions. We introduce TransBench, a benchmark tailored for industrial MT, initially targeting international e-commerce with 17,000 professionally translated sentences spanning 4 main scenarios and 33 language pairs. TransBench integrates traditional metrics (BLEU, TER) with Marco-MOS, a domain-specific evaluation model, and provides guidelines for reproducible benchmark construction. Our contributions include: (1) a structured framework for industrial MT evaluation, (2) the first publicly available benchmark for e-commerce translation, (3) novel metrics probing multi-level translation quality, and (4) open-sourced evaluation tools. This work bridges the evaluation gap, enabling researchers and practitioners to systematically assess and enhance MT systems for industry-specific needs.
zh

[NLP-85] chnical Report on classification of literature related to children speech disorder

【速读】: 该论文试图解决如何系统性地分类关于儿童言语障碍的科学文献的问题,以支持自动化文献综述。解决方案的关键在于采用基于自然语言处理(NLP)的方法,结合潜在狄利克雷分布(LDA)和BERTopic两种主题建模技术,对文献进行主题结构分析,并通过引入定制的停用词列表提升分类的相关性和精确性。

链接: https://arxiv.org/abs/2505.14242
作者: Ziang Wang,Amir Aryani
机构: Australian National University (澳大利亚国立大学); Swinburne University of Technology (斯威本科技大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This technical report presents a natural language processing (NLP)-based approach for systematically classifying scientific literature on childhood speech disorders. We retrieved and filtered 4,804 relevant articles published after 2015 from the PubMed database using domain-specific keywords. After cleaning and pre-processing the abstracts, we applied two topic modeling techniques - Latent Dirichlet Allocation (LDA) and BERTopic - to identify latent thematic structures in the corpus. Our models uncovered 14 clinically meaningful clusters, such as infantile hyperactivity and abnormal epileptic behavior. To improve relevance and precision, we incorporated a custom stop word list tailored to speech pathology. Evaluation results showed that the LDA model achieved a coherence score of 0.42 and a perplexity of -7.5, indicating strong topic coherence and predictive performance. The BERTopic model exhibited a low proportion of outlier topics (less than 20%), demonstrating its capacity to classify heterogeneous literature effectively. These results provide a foundation for automating literature reviews in speech-language pathology.
zh

[NLP-86] ABBA: Highly Expressive Hadamard Product Adaptation for Large Language Models

【速读】: 该论文旨在解决如何高效地将大规模语言模型适应到新领域的问题,其核心挑战在于在保持大部分预训练权重固定的情况下,引入轻量级可训练模块以提升模型的适应能力。解决方案的关键在于提出一种新的参数高效微调架构ABBA,该架构通过将更新部分重新参数化为两个独立可学习的低秩矩阵的哈达玛积(Hadamard Product),从而完全解耦更新过程与预训练权重,实现了在相同参数预算下更高的表达能力。

链接: https://arxiv.org/abs/2505.14238
作者: Raghav Singhal,Kaustubh Ponkshe,Rohit Vartak,Praneeth Vepakomma
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Duke University (杜克大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Raghav Singhal, Kaustubh Ponkshe, and Rohit Vartak contributed equally to this work

点击查看摘要

Abstract:Large Language Models have demonstrated strong performance across a wide range of tasks, but adapting them efficiently to new domains remains a key challenge. Parameter-Efficient Fine-Tuning (PEFT) methods address this by introducing lightweight, trainable modules while keeping most pre-trained weights fixed. The prevailing approach, LoRA, models updates using a low-rank decomposition, but its expressivity is inherently constrained by the rank. Recent methods like HiRA aim to increase expressivity by incorporating a Hadamard product with the frozen weights, but still rely on the structure of the pre-trained model. We introduce ABBA, a new PEFT architecture that reparameterizes the update as a Hadamard product of two independently learnable low-rank matrices. In contrast to prior work, ABBA fully decouples the update from the pre-trained weights, enabling both components to be optimized freely. This leads to significantly higher expressivity under the same parameter budget. We formally analyze ABBA’s expressive capacity and validate its advantages through matrix reconstruction experiments. Empirically, ABBA achieves state-of-the-art results on arithmetic and commonsense reasoning benchmarks, consistently outperforming existing PEFT methods by a significant margin across multiple models. Our code is publicly available at: this https URL.
zh

[NLP-87] Mechanistic Fine-tuning for In-context Learning

【速读】: 该论文试图解决在语言模型(Language Model, LM)中通过上下文学习(In-context Learning, ICL)实现少样本学习时,因预训练数据与ICL风格数据不匹配而导致的性能差距问题。为降低传统端到端微调方法的计算成本,该研究提出了一种关键解决方案——注意力行为微调(Attention Behavior Fine-Tuning, ABFT),其核心在于基于ICL内部机制的先前研究,构建以注意力分数(attention scores)为训练目标的优化策略,而非依赖最终输出,从而引导模型关注上下文中正确的标签词并抑制错误标签词的注意力。

链接: https://arxiv.org/abs/2505.14233
作者: Hakaze Cho,Peng Luo,Mariko Kato,Rin Kaenbyou,Naoya Inoue
机构: Japan Advanced Institute of Science and Technology (日本高级科学与技术研究院); Beijing Institute of Technology (北京理工大学); RIKEN (理化学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 28 pages, 31 figures, 6 tables

点击查看摘要

Abstract:In-context Learning (ICL) utilizes structured demonstration-query inputs to induce few-shot learning on Language Models (LMs), which are not originally pre-trained on ICL-style data. To bridge the gap between ICL and pre-training, some approaches fine-tune LMs on large ICL-style datasets by an end-to-end paradigm with massive computational costs. To reduce such costs, in this paper, we propose Attention Behavior Fine-Tuning (ABFT), utilizing the previous findings on the inner mechanism of ICL, building training objectives on the attention scores instead of the final outputs, to force the attention scores to focus on the correct label tokens presented in the context and mitigate attention scores from the wrong label tokens. Our experiments on 9 modern LMs and 8 datasets empirically find that ABFT outperforms in performance, robustness, unbiasedness, and efficiency, with only around 0.01% data cost compared to the previous methods. Moreover, our subsequent analysis finds that the end-to-end training objective contains the ABFT objective, suggesting the implicit bias of ICL-style data to the emergence of induction heads. Our work demonstrates the possibility of controlling specific module sequences within LMs to improve their behavior, opening up the future application of mechanistic interpretability.
zh

[NLP-88] “Haet Bhasha aur Diskrimineshun”: Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLM s

【速读】: 该论文试图解决多语言多模态大型语言模型(Large Language Models, LLMs)在面对代码混用和语音扰动攻击时的安全性问题,特别是现有红队测试主要针对英语语言,导致模型在多语言环境下仍易受到攻击。解决方案的关键在于引入一种新颖的越狱策略,通过在混合语言提示中应用语音扰动来修改敏感词汇,从而有效绕过安全过滤机制,同时保持提示的可解释性。该方法在文本生成任务中实现了99%的攻击成功率和100%的攻击相关性率,在图像生成任务中分别达到78%和95%,并揭示了语音扰动对词元化的影响是越狱成功的核心因素。

链接: https://arxiv.org/abs/2505.14226
作者: Darpan Aswal,Siddharth D Jaiswal
机构: Université Paris-Saclay (巴黎-萨克雷大学); IIT Kharagpur (印度理工学院卡哈格普尔分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become increasingly powerful, with multilingual and multimodal capabilities improving by the day. These models are being evaluated through audits, alignment studies and red-teaming efforts to expose model vulnerabilities towards generating harmful, biased and unfair content. Existing red-teaming efforts have previously focused on the English language, using fixed template-based attacks; thus, models continue to be susceptible to multilingual jailbreaking strategies, especially in the multimodal context. In this study, we introduce a novel strategy that leverages code-mixing and phonetic perturbations to jailbreak LLMs for both text and image generation tasks. We also introduce two new jailbreak strategies that show higher effectiveness than baseline strategies. Our work presents a method to effectively bypass safety filters in LLMs while maintaining interpretability by applying phonetic misspellings to sensitive words in code-mixed prompts. Our novel prompts achieve a 99% Attack Success Rate for text generation and 78% for image generation, with Attack Relevance Rate of 100% for text generation and 95% for image generation when using the phonetically perturbed code-mixed prompts. Our interpretability experiments reveal that phonetic perturbations impact word tokenization, leading to jailbreak success. Our study motivates increasing the focus towards more generalizable safety alignment for multilingual multimodal models, especially in real-world settings wherein prompts can have misspelt words.
zh

[NLP-89] Reinforcement Learning vs. Distillation: Understanding Accuracy and Capability in LLM Reasoning

【速读】: 该论文试图解决强化学习与可验证奖励(RLVR)在提升语言模型能力方面存在局限性的问题,以及知识蒸馏在提升模型准确性和能力方面的机制差异。其解决方案的关键在于揭示RLVR虽然能提高整体准确性,但因过度关注较简单问题的准确性而损害了最难问题的性能,从而无法提升模型能力;同时,知识蒸馏通过学习强推理模式可稳定提升准确性,但在引入新知识时才能有效提升能力,否则可能导致简单问题的准确性提升而复杂问题的准确性下降,类似于RLVR的现象。

链接: https://arxiv.org/abs/2505.14216
作者: Minwu Kim,Anubhav Shrestha,Safal Shrestha,Aadim Nepal,Keith Ross
机构: New York University Abu Dhabi (纽约大学阿布扎比分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 23 pages

点击查看摘要

Abstract:Recent studies have shown that reinforcement learning with verifiable rewards (RLVR) enhances overall accuracy but fails to improve capability, while distillation can improve both. In this paper, we investigate the mechanisms behind these phenomena. First, we demonstrate that RLVR does not improve capability because it focuses on improving the accuracy of the less-difficult questions to the detriment of the accuracy of the most difficult questions, thereby leading to no improvement in capability. Second, we find that RLVR does not merely increase the success probability for the less difficult questions, but in our small model settings produces quality responses that were absent in its output distribution before training. In addition, we show these responses are neither noticeably longer nor feature more reflection-related keywords, underscoring the need for more reliable indicators of response quality. Third, we show that while distillation reliably improves accuracy by learning strong reasoning patterns, it only improves capability when new knowledge is introduced. Moreover, when distilling only with reasoning patterns and no new knowledge, the accuracy of the less-difficult questions improves to the detriment of the most difficult questions, similar to RLVR. Together, these findings offer a clearer understanding of how RLVR and distillation shape reasoning behavior in language models.
zh

[NLP-90] Automatic Dataset Generation for Knowledge Intensive Question Answering Tasks

【速读】: 该论文旨在解决当前问答(QA)系统在处理需要复杂推理或实时知识整合的查询时存在的不足,以及检索增强生成(RAG)方法在处理多源信息间的逻辑连接方面所面临的挑战。其解决方案的关键在于通过自动化生成基于上下文的QA对来增强大型语言模型(LLMs)在知识密集型QA任务中的表现,利用LLMs自动生成微调数据,减少对人工标注的依赖,并提升模型的理解与推理能力。

链接: https://arxiv.org/abs/2505.14212
作者: Sizhe Yuen,Ting Su,Ziyang Wang,Yali Du,Adam J. Sobey
机构: The Alan Turing Institute (艾伦·图灵研究所); King’s College London (伦敦国王学院); University of Southampton (南安普顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A question-answering (QA) system is to search suitable answers within a knowledge base. Current QA systems struggle with queries requiring complex reasoning or real-time knowledge integration. They are often supplemented with retrieval techniques on a data source such as Retrieval-Augmented Generation (RAG). However, RAG continues to face challenges in handling complex reasoning and logical connections between multiple sources of information. A novel approach for enhancing Large Language Models (LLMs) in knowledge-intensive QA tasks is presented through the automated generation of context-based QA pairs. This methodology leverages LLMs to create fine-tuning data, reducing reliance on human labelling and improving model comprehension and reasoning capabilities. The proposed system includes an automated QA generator and a model fine-tuner, evaluated using perplexity, ROUGE, BLEU, and BERTScore. Comprehensive experiments demonstrate improvements in logical coherence and factual accuracy, with implications for developing adaptable Artificial Intelligence (AI) systems. Mistral-7b-v0.3 outperforms Llama-3-8b with BERT F1, BLEU, and ROUGE scores 0.858, 0.172, and 0.260 of for the LLM generated QA pairs compared to scores of 0.836, 0.083, and 0.139 for the human annotated QA pairs.
zh

[NLP-91] Unraveling Interwoven Roles of Large Language Models in Authorship Privacy: Obfuscation Mimicking and Verification

【速读】: 该论文试图解决在生成式 AI (Generative AI) 背景下,作者隐私保护中作者身份混淆(Authorship Obfuscation, AO)、作者模仿(Authorship Mimicking, AM)和作者验证(Authorship Verification, AV)三者之间的动态关系问题。现有研究多独立探讨这三个任务,而其相互作用尚未被充分探索,尤其在 LLMs 深刻影响用户生成内容的创作与分享方式的背景下,这一问题愈发突出。该论文提出首个统一框架,用于分析 LLM 支持下的 AO、AM 和 AV 之间的动态交互,关键在于量化它们如何相互作用以改变人类撰写的文本,并考察人口统计学元数据对任务性能、跨任务动态及隐私风险的影响。

链接: https://arxiv.org/abs/2505.14195
作者: Tuc Nguyen,Yifan Hu,Thai Le
机构: Indiana University (印第安纳大学); Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注: 17 pages, 3 figures

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have been fueled by large scale training corpora drawn from diverse sources such as websites, news articles, and books. These datasets often contain explicit user information, such as person names and addresses, that LLMs may unintentionally reproduce in their generated outputs. Beyond such explicit content, LLMs can also leak identity revealing cues through implicit signals such as distinctive writing styles, raising significant concerns about authorship privacy. There are three major automated tasks in authorship privacy, namely authorship obfuscation (AO), authorship mimicking (AM), and authorship verification (AV). Prior research has studied AO, AM, and AV independently. However, their interplays remain under explored, which leaves a major research gap, especially in the era of LLMs, where they are profoundly shaping how we curate and share user generated content, and the distinction between machine generated and human authored text is also increasingly blurred. This work then presents the first unified framework for analyzing the dynamic relationships among LLM enabled AO, AM, and AV in the context of authorship privacy. We quantify how they interact with each other to transform human authored text, examining effects at a single point in time and iteratively over time. We also examine the role of demographic metadata, such as gender, academic background, in modulating their performances, inter-task dynamics, and privacy risks. All source code will be publicly available.
zh

[NLP-92] Safety Subspaces are Not Distinct: A Fine-Tuning Case Study

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在安全对齐(safety alignment)过程中出现的脆弱性问题,即在进一步微调后,模型可能重新引入有害行为。其解决方案的关键在于从几何视角探讨安全相关行为是否存在于可识别的权重空间子空间中,并评估这些子空间是否可用于防御对齐偏差。研究发现,安全与不安全行为在子空间中相互增强,且不同安全含义的提示激活了重叠的内部表示,未发现能选择性控制安全性的子空间,这表明安全行为并非局部化于特定几何方向,而是源于模型更广泛学习动态中的纠缠高影响组件。

链接: https://arxiv.org/abs/2505.14185
作者: Kaustubh Ponkshe,Shaan Shah,Raghav Singhal,Praneeth Vepakomma
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); University of California San Diego (加利福尼亚大学圣地亚哥分校); Massachusetts Institute of Technology (麻省理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Kaustubh Ponkshe, Shaan Shah, and Raghav Singhal contributed equally to this work

点击查看摘要

Abstract:Large Language Models (LLMs) rely on safety alignment to produce socially acceptable responses. This is typically achieved through instruction tuning and reinforcement learning from human feedback. However, this alignment is known to be brittle: further fine-tuning, even on benign or lightly contaminated data, can degrade safety and reintroduce harmful behaviors. A growing body of work suggests that alignment may correspond to identifiable geometric directions in weight space, forming subspaces that could, in principle, be isolated or preserved to defend against misalignment. In this work, we conduct a comprehensive empirical study of this geometric perspective. We examine whether safety-relevant behavior is concentrated in specific subspaces, whether it can be separated from general-purpose learning, and whether harmfulness arises from distinguishable patterns in internal representations. Across both parameter and activation space, our findings are consistent: subspaces that amplify safe behaviors also amplify unsafe ones, and prompts with different safety implications activate overlapping representations. We find no evidence of a subspace that selectively governs safety. These results challenge the assumption that alignment is geometrically localized. Rather than residing in distinct directions, safety appears to emerge from entangled, high-impact components of the model’s broader learning dynamics. This suggests that subspace-based defenses may face fundamental limitations and underscores the need for alternative strategies to preserve alignment under continued training. We corroborate these findings through multiple experiments on five open-source LLMs. Our code is publicly available at: this https URL.
zh

[NLP-93] hinkSwitcher: When to Think Hard When to Think Fast

【速读】: 该论文试图解决大型推理模型(Large Reasoning Models, LRs)在处理简单任务时因过度推理而导致的计算开销过大的问题。解决方案的关键在于提出ThinkSwitcher框架,该框架通过设计有效的提示(prompt)来激发模型进行高效短链推理(short CoT),并引入一个轻量级的切换模块,根据任务复杂度动态切换模型的推理模式,从而在保持复杂任务高精度的同时降低计算成本。

链接: https://arxiv.org/abs/2505.14183
作者: Guosheng Liang,Longguang Zhong,Ziyi Yang,Xiaojun Quan
机构: Sun Yat-sen University (中山大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large reasoning models (LRMs) excel at solving complex tasks by leveraging long chain-of-thought (CoT) reasoning. However, this often leads to overthinking on simple tasks, resulting in unnecessary computational overhead. We observe that LRMs inherently possess the capability for efficient short CoT reasoning, which can be reliably elicited through prompt design. To leverage this capability, we propose ThinkSwitcher, a framework that enables a single LRM to dynamically switch between short and long CoT modes based on task complexity. ThinkSwitcher introduces a lightweight switching module trained with supervision signals derived from the relative performance of each reasoning mode across tasks. Experiments on multiple reasoning benchmarks show that ThinkSwitcher reduces computational cost by 20-30% while maintaining high accuracy on complex tasks. This demonstrates the effectiveness of ThinkSwitcher as a scalable and efficient solution for unified LRM deployment.
zh

[NLP-94] SlangDIT: Benchmarking LLM s in Interpretative Slang Translation

【速读】: 该论文试图解决俚语翻译中因语境依赖的语义扩展而导致的翻译不准确问题,其核心挑战在于如何有效捕捉俚语在特定语境下的隐含意义。解决方案的关键在于提出了一种名为SlangDIT的解释性俚语翻译任务,该任务包含三个子任务:俚语检测、跨语言俚语解释和上下文中的俚语翻译,通过整合俚语检测与解释过程来提升翻译的准确性。为支持该任务,研究者构建了一个包含25k英文-中文句对的SlangDIT数据集,并提出了一种深度思考模型SlangOWL,该模型能够识别句子中的俚语、判断其多义性并分析可能含义,最终提供符合语境的最佳解释与翻译。实验表明,SlangOWL显著优于基线模型和监督微调模型。

链接: https://arxiv.org/abs/2505.14181
作者: Yunlong Liang,Fandong Meng,Jiaan Wang,Jie Zhou
机构: Tencent Inc (腾讯公司)
类目: Computation and Language (cs.CL)
备注: work in progress

点击查看摘要

Abstract:The challenge of slang translation lies in capturing context-dependent semantic extensions, as slang terms often convey meanings beyond their literal interpretation. While slang detection, explanation, and translation have been studied as isolated tasks in the era of large language models (LLMs), their intrinsic interdependence remains underexplored. The main reason is lacking of a benchmark where the two tasks can be a prerequisite for the third one, which can facilitate idiomatic translation. In this paper, we introduce the interpretative slang translation task (named SlangDIT) consisting of three sub-tasks: slang detection, cross-lingual slang explanation, and slang translation within the current context, aiming to generate more accurate translation with the help of slang detection and slang explanation. To this end, we construct a SlangDIT dataset, containing over 25k English-Chinese sentence pairs. Each source sentence mentions at least one slang term and is labeled with corresponding cross-lingual slang explanation. Based on the benchmark, we propose a deep thinking model, named SlangOWL. It firstly identifies whether the sentence contains a slang, and then judges whether the slang is polysemous and analyze its possible meaning. Further, the SlangOWL provides the best explanation of the slang term targeting on the current context. Finally, according to the whole thought, the SlangOWL offers a suitable translation. Our experiments on LLMs (\emphe.g., Qwen2.5 and LLama-3.1), show that our deep thinking approach indeed enhances the performance of LLMs where the proposed SLangOWL significantly surpasses the vanilla models and supervised fine-tuned models without thinking.
zh

[NLP-95] Enhancing Abstractive Summarization of Scientific Papers Using Structure Information

【速读】: 该论文试图解决科学论文的摘要生成中存在的两个主要问题:现有方法多依赖于Encoder-Decoder架构,将论文视为词序列,无法充分捕捉科学论文中固有的结构信息;以及现有研究通过关键词映射或特征工程识别结构信息,但这些方法在面对科学论文的结构灵活性和跨学科鲁棒性时表现不佳。解决方案的关键在于提出一个两阶段的抽象摘要框架,该框架利用科学论文中结构功能的自动识别。第一阶段通过标准化章节标题并构建大规模数据集来识别关键结构组件(如背景、方法、结果、讨论),第二阶段采用Longformer模型捕获跨部分的丰富上下文关系,从而生成更具上下文感知的摘要。

链接: https://arxiv.org/abs/2505.14179
作者: Tong Bao,Heng Zhang,Chengzhi Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Abstractive summarization of scientific papers has always been a research focus, yet existing methods face two main challenges. First, most summarization models rely on Encoder-Decoder architectures that treat papers as sequences of words, thus fail to fully capture the structured information inherent in scientific papers. Second, existing research often use keyword mapping or feature engineering to identify the structural information, but these methods struggle with the structural flexibility of scientific papers and lack robustness across different disciplines. To address these challenges, we propose a two-stage abstractive summarization framework that leverages automatic recognition of structural functions within scientific papers. In the first stage, we standardize chapter titles from numerous scientific papers and construct a large-scale dataset for structural function recognition. A classifier is then trained to automatically identify the key structural components (e.g., Background, Methods, Results, Discussion), which provides a foundation for generating more balanced summaries. In the second stage, we employ Longformer to capture rich contextual relationships across sections and generating context-aware summaries. Experiments conducted on two domain-specific scientific paper summarization datasets demonstrate that our method outperforms advanced baselines, and generates more comprehensive summaries. The code and dataset can be accessed at this https URL.
zh

[NLP-96] okenization Constraints in LLM s: A Study of Symbolic and Arithmetic Reasoning Limits

【速读】: 该论文试图解决语言模型在符号推理任务中表现受限的问题,特别是由于分词(tokenization)结构对逻辑对齐和符号计算的干扰。其解决方案的关键在于引入“Token Awareness”概念,以量化分词粒度不足如何破坏逻辑一致性,并通过使用原子对齐的格式提升模型的符号推理能力,从而实现更有效的泛化。研究结果表明,合理的分词策略能够显著提升模型在结构化推理任务中的表现,甚至使小型模型超越大型系统。

链接: https://arxiv.org/abs/2505.14178
作者: Xiang Zhang,Juntai Cao,Jiaqi Wei,Yiwei Xu,Chenyu You
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tokenization is the first - and often underappreciated - layer of computation in language models. While Chain-of-Thought (CoT) prompting enables transformer models to approximate recurrent computation by externalizing intermediate steps, we show that the success of such reasoning is fundamentally bounded by the structure of tokenized inputs. This work presents a theoretical and empirical investigation into how tokenization schemes, particularly subword-based methods like byte-pair encoding (BPE), impede symbolic computation by merging or obscuring atomic reasoning units. We introduce the notion of Token Awareness to formalize how poor token granularity disrupts logical alignment and prevents models from generalizing symbolic procedures. Through systematic evaluation on arithmetic and symbolic tasks, we demonstrate that token structure dramatically affect reasoning performance, causing failure even with CoT, while atomically-aligned formats unlock strong generalization, allowing small models (e.g., GPT-4o-mini) to outperform larger systems (e.g., o1) in structured reasoning. Our findings reveal that symbolic reasoning ability in LLMs is not purely architectural, but deeply conditioned on token-level representations.
zh

[NLP-97] Cheaper Better Faster Stronger: Robust Text-to-SQL without Chain-of-Thought or Fine-Tuning

【速读】: 该论文试图解决在文本到SQL(text-to-SQL)任务中,使用大语言模型(Large Language Models, LLMs)进行代码生成时成本过高的问题。现有方法如链式思维(Chain-of-Thought, CoT)、自洽性以及微调等虽然有效,但推理过程需要大量LLM调用,导致单次查询平均成本高达0.46美元,而微调模型的成本更是可达数千美元。论文提出的解决方案“N-rep一致性”通过利用同一模式输入的多种表示形式来弥补单一表示的不足,从而提高系统的鲁棒性,并允许使用更小、更经济的模型,无需任何推理或微调,实现了在低成本范围内(每查询仅0.039美元)与高成本方法相当的BIRD基准性能。

链接: https://arxiv.org/abs/2505.14174
作者: Yusuf Denizay Dönder,Derek Hommel,Andrea W Wen-Yi,David Mimno,Unso Eun Seo Jo
机构: Gena Co.; Cornell University
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLMs are effective at code generation tasks like text-to-SQL, but is it worth the cost? Many state-of-the-art approaches use non-task-specific LLM techniques including Chain-of-Thought (CoT), self-consistency, and fine-tuning. These methods can be costly at inference time, sometimes requiring over a hundred LLM calls with reasoning, incurring average costs of up to \ 0.46 per query, while fine-tuning models can cost thousands of dollars. We introduce “N-rep” consistency, a more cost-efficient text-to-SQL approach that achieves similar BIRD benchmark scores as other more expensive methods, at only \ 0.039 per query. N-rep leverages multiple representations of the same schema input to mitigate weaknesses in any single representation, making the solution more robust and allowing the use of smaller and cheaper models without any reasoning or fine-tuning. To our knowledge, N-rep is the best-performing text-to-SQL approach in its cost range.
zh

[NLP-98] HOR-MoE: Hierarchical Task-Guided and Context-Responsive Routing for Neural Machine Translation ACL2025

【速读】: 该论文旨在解决现有稀疏混合专家(Sparse Mixture-of-Experts, MoE)在神经机器翻译(Neural Machine Translation, NMT)中存在的两个局限性:一是直接利用任务知识(如领域/语言特定知识)进行专家分配,而这些知识在实际应用中通常不可用,并忽略了自然分组的领域/语言特性;二是专家选择仅依赖局部token表示,未考虑上下文信息,无法从全局视角把握每个token的状态。解决方案的关键在于引入THOR-MoE,通过引入分层任务引导和上下文响应的路由策略,首先预测领域/语言标签并提取混合表示以分层分配任务级专家,随后注入上下文信息以增强从预选任务级专家集中的token路由,从而实现更精准的专家分配。

链接: https://arxiv.org/abs/2505.14173
作者: Yunlong Liang,Fandong Meng,Jie Zhou
机构: Tencent Inc (腾讯公司)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 main conference

点击查看摘要

Abstract:The sparse Mixture-of-Experts (MoE) has achieved significant progress for neural machine translation (NMT). However, there exist two limitations in current MoE solutions which may lead to sub-optimal performance: 1) they directly use the task knowledge of NMT into MoE (\emphe.g., domain/linguistics-specific knowledge), which are generally unavailable at practical application and neglect the naturally grouped domain/linguistic properties; 2) the expert selection only depends on the localized token representation without considering the context, which fully grasps the state of each token in a global view. To address the above limitations, we propose THOR-MoE via arming the MoE with hierarchical task-guided and context-responsive routing policies. Specifically, it 1) firstly predicts the domain/language label and then extracts mixed domain/language representation to allocate task-level experts in a hierarchical manner; 2) injects the context information to enhance the token routing from the pre-selected task-level experts set, which can help each token to be accurately routed to more specialized and suitable experts. Extensive experiments on multi-domain translation and multilingual translation benchmarks with different architectures consistently demonstrate the superior performance of THOR-MoE. Additionally, the THOR-MoE operates as a plug-and-play module compatible with existing Top- k ~\citeshazeer2017 and Top- p ~\citehuang-etal-2024-harder routing schemes, ensuring broad applicability across diverse MoE architectures. For instance, compared with vanilla Top- p ~\citehuang-etal-2024-harder routing, the context-aware manner can achieve an average improvement of 0.75 BLEU with less than 22% activated parameters on multi-domain translation tasks.
zh

[NLP-99] he Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在字符级任务(如统计单词中的字母数量)上表现不佳的问题,其根本原因是分词(tokenization)的局限性。论文提出的关键解决方案是通过一种轻量级的架构修改,显著提升模型的字符级推理能力,同时保留子词模型的归纳优势。该方法基于概念涌现的渗透模型,表明学习字符组合与学习常识知识在本质上并无差异,从而为理解并缓解分词模型的结构性盲点提供了理论依据。

链接: https://arxiv.org/abs/2505.14172
作者: Adrian Cosma,Stefan Ruseti,Emilian Radoi,Mihai Dascalu
机构: National University of Science and Technology POLITEHNICA Bucharest (国家科学与技术理工大学布加勒斯特理工学院)
类目: Computation and Language (cs.CL)
备注: 1 Table, 8 Figures

点击查看摘要

Abstract:Despite their remarkable progress across diverse domains, Large Language Models (LLMs) consistently fail at simple character-level tasks, such as counting letters in words, due to a fundamental limitation: tokenization. In this work, we frame this limitation as a problem of low mutual information and analyze it in terms of concept emergence. Using a suite of 19 synthetic tasks that isolate character-level reasoning in a controlled setting, we show that such capabilities emerge slowly, suddenly, and only late in training. We further show that percolation-based models of concept emergence explain these patterns, suggesting that learning character composition is not fundamentally different from learning commonsense knowledge. To address this bottleneck, we propose a lightweight architectural modification that significantly improves character-level reasoning while preserving the inductive advantages of subword models. Together, our results bridge low-level perceptual gaps in tokenized LMs and provide a principled framework for understanding and mitigating their structural blind spots. We make our code publicly available.
zh

[NLP-100] PL-FGSA: A Prompt Learning Framework for Fine-Grained Sentiment Analysis Based on MindSpore

【速读】: 该论文旨在解决细粒度情感分析(Fine-grained Sentiment Analysis, FGSA)中传统方法依赖任务特定架构和大量标注数据的问题,从而限制了模型的泛化能力和可扩展性。其解决方案的关键在于提出PL-FGSA,一个基于提示学习(Prompt Learning)的统一框架,该框架结合了提示设计与轻量级TextCNN主干网络,将FGSA重新定义为多任务提示增强生成问题,联合处理方面抽取、情感分类和因果解释,在统一范式下提升模型的可解释性与性能。

链接: https://arxiv.org/abs/2505.14165
作者: Zhenkai Qin,Jiajing He,Qiao Fang
机构: Guangxi Police College (广西警察学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Fine-grained sentiment analysis (FGSA) aims to identify sentiment polarity toward specific aspects within a text, enabling more precise opinion mining in domains such as product reviews and social media. However, traditional FGSA approaches often require task-specific architectures and extensive annotated data, limiting their generalization and scalability. To address these challenges, we propose PL-FGSA, a unified prompt learning-based framework implemented using the MindSpore platform, which integrates prompt design with a lightweight TextCNN backbone. Our method reformulates FGSA as a multi-task prompt-augmented generation problem, jointly tackling aspect extraction, sentiment classification, and causal explanation in a unified paradigm. By leveraging prompt-based guidance, PL-FGSA enhances interpretability and achieves strong performance under both full-data and low-resource conditions. Experiments on three benchmark datasets-SST-2, SemEval-2014 Task 4, and MAMS-demonstrate that our model consistently outperforms traditional fine-tuning methods and achieves F1-scores of 0.922, 0.694, and 0.597, respectively. These results validate the effectiveness of prompt-based generalization and highlight the practical value of PL-FGSA for real-world sentiment analysis tasks.
zh

[NLP-101] Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models

【速读】: 该论文试图解决多语言视觉-语言模型(Multilingual Vision-Language Models)中存在的社会偏见问题,特别是种族和性别偏见及其在不同语言间的传播机制。其解决方案的关键在于对三个公开的多语言CLIP检查点(M-CLIP、NLLB-CLIP和CAPIVARA-CLIP)进行系统性审计,通过在零样本设置下使用平衡的数据集(如\textscFairFace和\textscPATA)量化偏见并评估刻板印象放大效应,揭示多语言模型在低资源语言和语法性别复杂语言中的偏见表现及其跨语言传播路径。

链接: https://arxiv.org/abs/2505.14160
作者: Zahraa Al Sahili,Ioannis Patras,Matthew Purver
机构: Queen Mary University of London (伦敦玛丽女王大学); Institut Jožef Stefan (约泽夫·斯蒂芬研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multilingual vision-language models promise universal image-text retrieval, yet their social biases remain under-explored. We present the first systematic audit of three public multilingual CLIP checkpoints – M-CLIP, NLLB-CLIP, and CAPIVARA-CLIP – across ten languages that vary in resource availability and grammatical gender. Using balanced subsets of \textscFairFace and the \textscPATA stereotype suite in a zero-shot setting, we quantify race and gender bias and measure stereotype amplification. Contrary to the assumption that multilinguality mitigates bias, every model exhibits stronger gender bias than its English-only baseline. CAPIVARA-CLIP shows its largest biases precisely in the low-resource languages it targets, while the shared cross-lingual encoder of NLLB-CLIP transports English gender stereotypes into gender-neutral languages; loosely coupled encoders largely avoid this transfer. Highly gendered languages consistently magnify all measured bias types, but even gender-neutral languages remain vulnerable when cross-lingual weight sharing imports foreign stereotypes. Aggregated metrics conceal language-specific ``hot spots,‘’ underscoring the need for fine-grained, language-aware bias evaluation in future multilingual vision-language research.
zh

[NLP-102] mporal Alignment of Time Sensitive Facts with Activation Engineering

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对涉及时间敏感性的事实性问题时,因训练数据中包含多领域和不同时期的冲突知识而导致的回答不准确问题。解决方案的关键在于采用激活工程(activation engineering)技术,通过在特定层注入时间相关信息,使模型在推理过程中更准确地回忆与时间相关的事实,从而实现无需微调或构建数据集的时空对齐。

链接: https://arxiv.org/abs/2505.14158
作者: Sanjay Govindan,Maurice Pagnucco,Yang Song
机构: University of New South Wales (新南威尔士大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are trained on diverse and often conflicting knowledge spanning multiple domains and time periods. Some of this knowledge is only valid within specific temporal contexts, such as answering the question, “Who is the President of the United States in 2022?” Ensuring LLMs generate time appropriate responses is crucial for maintaining relevance and accuracy. In this work we explore activation engineering as a method for temporally aligning LLMs to improve factual recall without any training or dataset creation. In this research we explore an activation engineering technique to ground three versions of LLaMA 2 to specific points in time and examine the effects of varying injection layers and prompting strategies. Our experiments demonstrate up to a 44% and 16% improvement in relative and explicit prompting respectively, achieving comparable performance to the fine-tuning method proposed by Zhao et al. (2024) . Notably, our approach achieves similar results to the fine-tuning baseline while being significantly more computationally efficient and requiring no pre-aligned datasets.
zh

[NLP-103] Prior Prompt Engineering for Reinforcement Fine-Tuning

【速读】: 该论文旨在解决在强化学习微调(RFT)背景下,如何通过先验提示工程(pPE)引导语言模型(LM)内化不同行为的问题。现有研究多集中于算法、奖励设计和数据筛选,而对先验提示的设计则研究不足。论文的关键解决方案是将推理、规划、基于代码的推理、知识回忆和无示例利用等五种典型的推理时提示工程(iPE)策略转化为对应的pPE方法,并通过实验验证这些方法在域内和域外基准上的有效性,结果表明pPE能够显著提升模型性能并赋予模型不同的行为特征。

链接: https://arxiv.org/abs/2505.14157
作者: Pittawat Taveekitworachai,Potsawee Manakul,Sarana Nutanong,Kunat Pipatanakul
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 25 pages, 42 figures

点击查看摘要

Abstract:This paper investigates prior prompt engineering (pPE) in the context of reinforcement fine-tuning (RFT), where language models (LMs) are incentivized to exhibit behaviors that maximize performance through reward signals. While existing RFT research has primarily focused on algorithms, reward shaping, and data curation, the design of the prior prompt–the instructions prepended to queries during training to elicit behaviors such as step-by-step reasoning–remains underexplored. We investigate whether different pPE approaches can guide LMs to internalize distinct behaviors after RFT. Inspired by inference-time prompt engineering (iPE), we translate five representative iPE strategies–reasoning, planning, code-based reasoning, knowledge recall, and null-example utilization–into corresponding pPE approaches. We experiment with Qwen2.5-7B using each of the pPE approaches, then evaluate performance on in-domain and out-of-domain benchmarks (e.g., AIME2024, HumanEval+, and GPQA-Diamond). Our results show that all pPE-trained models surpass their iPE-prompted counterparts, with the null-example pPE approach achieving the largest average performance gain and the highest improvement on AIME2024 and GPQA-Diamond, surpassing the commonly used reasoning approach. Furthermore, by adapting a behavior-classification framework, we demonstrate that different pPE strategies instill distinct behavioral styles in the resulting models. These findings position pPE as a powerful yet understudied axis for RFT.
zh

[NLP-104] Enhancing Keyphrase Extraction from Academic Articles Using Section Structure Information

【速读】: 该论文旨在解决学术论文数量激增导致研究人员检索相关文献耗时增加的问题,其核心解决方案是通过改进关键短语提取(Keyphrase Extraction, KPE)模型的性能。该研究的关键在于利用学术论文的结构特征和各部分文本信息,而非仅依赖标题和摘要,从而在提升KPE效果的同时减少噪声干扰。具体而言,方案包括探索七种结构特征对KPE模型的影响,并通过关键短语集成算法整合各部分文本的提取结果,以获得更准确的关键词提取结果。

链接: https://arxiv.org/abs/2505.14149
作者: Chengzhi Zhang,Xinyi Yan,Lei Zhao,Yingyi Zhang
机构: 未知
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The exponential increase in academic papers has significantly increased the time required for researchers to access relevant literature. Keyphrase Extraction (KPE) offers a solution to this situation by enabling researchers to efficiently retrieve relevant literature. The current study on KPE from academic articles aims to improve the performance of extraction models through innovative approaches using Title and Abstract as input corpora. However, the semantic richness of keywords is significantly constrained by the length of the abstract. While full-text-based KPE can address this issue, it simultaneously introduces noise, which significantly diminishes KPE performance. To address this issue, this paper utilized the structural features and section texts obtained from the section structure information of academic articles to extract keyphrase from academic papers. The approach consists of two main parts: (1) exploring the effect of seven structural features on KPE models, and (2) integrating the extraction results from all section texts used as input corpora for KPE models via a keyphrase integration algorithm to obtain the keyphrase integration result. Furthermore, this paper also examined the effect of the classification quality of section structure on the KPE performance. The results show that incorporating structural features improves KPE performance, though different features have varying effects on model efficacy. The keyphrase integration approach yields the best performance, and the classification quality of section structure can affect KPE performance. These findings indicate that using the section structure information of academic articles contributes to effective KPE from academic articles. The code and dataset supporting this study are available at this https URL.
zh

[NLP-105] s3: You Dont Need That Much Data to Train a Search Agent via RL

【速读】: 该论文旨在解决传统检索增强生成(Retrieval-augmented generation, RAG)系统中检索与生成过程耦合导致的下游任务性能受限及模型兼容性差的问题。现有方法要么仅优化检索指标(如NDCG),忽视生成效果,要么对整个大语言模型进行微调,使检索与生成相互干扰,限制了实际搜索效用和与冻结或专有模型的兼容性。论文提出的s3框架是一种轻量级、模型无关的解决方案,其关键在于将检索器与生成器解耦,并通过超越RAG的奖励信号——即生成准确性的提升——来训练检索器,从而在仅需2.4k训练样本的情况下,优于基于70倍更多数据的基线模型,在六个通用问答和五个医疗问答基准上均表现出更强的下游性能。

链接: https://arxiv.org/abs/2505.14146
作者: Pengcheng Jiang,Xueqiang Xu,Jiacheng Lin,Jinfeng Xiao,Zifeng Wang,Jimeng Sun,Jiawei Han
机构: University of Illinois Urbana Champaign (伊利诺伊大学厄巴纳-香槟分校); Amazon (亚马逊)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems empower large language models (LLMs) to access external knowledge during inference. Recent advances have enabled LLMs to act as search agents via reinforcement learning (RL), improving information acquisition through multi-turn interactions with retrieval engines. However, existing approaches either optimize retrieval using search-only metrics (e.g., NDCG) that ignore downstream utility or fine-tune the entire LLM to jointly reason and retrieve-entangling retrieval with generation and limiting the real search utility and compatibility with frozen or proprietary models. In this work, we propose s3, a lightweight, model-agnostic framework that decouples the searcher from the generator and trains the searcher using a Gain Beyond RAG reward: the improvement in generation accuracy over naive RAG. s3 requires only 2.4k training samples to outperform baselines trained on over 70x more data, consistently delivering stronger downstream performance across six general QA and five medical QA benchmarks.
zh

[NLP-106] xts or Images? A Fine-grained Analysis on the Effectiveness of Input Representations and Models for Table Question Answering ACL25

【速读】: 该论文旨在解决表格问答(Table Question Answering, TQA)中不同表格表示方式(文本或图像)与模型类型(多模态大语言模型,MLLMs 或大语言模型,LLMs)组合效果差异的问题,特别是在不同问题复杂度和表格规模下的表现差异。其解决方案的关键在于提出一种动态选择表格表示方法的策略——FRES,通过根据具体场景选择最优的表格表示方式,相较于无差别使用两种表示方式,平均性能提升了10%。

链接: https://arxiv.org/abs/2505.14131
作者: Wei Zhou,Mohsen Mesgar,Heike Adel,Annemarie Friedrich
机构: Bosch Center for Artificial Intelligence(博世人工智能中心); Hochschule der Medien(媒体高等学校); University of Augsburg(奥格斯堡大学)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL25 (Findings)

点击查看摘要

Abstract:In table question answering (TQA), tables are encoded as either texts or images. Prior work suggests that passing images of tables to multi-modal large language models (MLLMs) performs comparably to or even better than using textual input with large language models (LLMs). However, the lack of controlled setups limits fine-grained distinctions between these approaches. In this paper, we conduct the first controlled study on the effectiveness of several combinations of table representations and models from two perspectives: question complexity and table size. We build a new benchmark based on existing TQA datasets. In a systematic analysis of seven pairs of MLLMs and LLMs, we find that the best combination of table representation and model varies across setups. We propose FRES, a method selecting table representations dynamically, and observe a 10% average performance improvement compared to using both representations indiscriminately.
zh

[NLP-107] Probing BERT for German Compound Semantics

【速读】: 该论文试图解决预训练的德语BERT模型在多大程度上编码了名词复合词语义的知识问题,其解决方案的关键在于通过调整目标词、层和大小写模型的组合,并利用868个标准复合词预测其合成性来评估模型的表征能力。研究发现,德语BERT在早期层中能够更有效地恢复合成性信息,但其性能仍显著低于英语模型,这可能与德语复合词的更高产出性及构成成分层面的歧义性有关。

链接: https://arxiv.org/abs/2505.14130
作者: Filip Miletić,Aaron Schmid,Sabine Schulte im Walde
机构: Institute for Natural Language Processing, University of Stuttgart (自然语言处理研究所,斯图加特大学)
类目: Computation and Language (cs.CL)
备注: Accepted to SwissText 2025

点击查看摘要

Abstract:This paper investigates the extent to which pretrained German BERT encodes knowledge of noun compound semantics. We comprehensively vary combinations of target tokens, layers, and cased vs. uncased models, and evaluate them by predicting the compositionality of 868 gold standard compounds. Looking at representational patterns within the transformer architecture, we observe trends comparable to equivalent prior work on English, with compositionality information most easily recoverable in the early layers. However, our strongest results clearly lag behind those reported for English, suggesting an inherently more difficult task in German. This may be due to the higher productivity of compounding in German than in English and the associated increase in constituent-level ambiguity, including in our target compound set.
zh

[NLP-108] Self-Reasoning Language Models: Unfold Hidden Reasoning Reasoning Chains with Few Reasoning Catalyst

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在复杂推理任务中性能提升受限的问题,特别是如何有效生成更长的思维链(Chain-of-Thought, CoT)以体现人类认知中的元推理技能,如反思和分解。其解决方案的关键在于提出自推理语言模型(Self-Reasoning Language Model, SRLM),该模型能够通过自我训练合成更长的CoT数据,并利用少量示例(即1,000个样本)作为推理催化剂,引导模型展开隐藏的推理链条,从而实现性能的持续提升。

链接: https://arxiv.org/abs/2505.14116
作者: Hongru Wang,Deng Cai,Wanjun Zhong,Shijue Huang,Jeff Z. Pan,Zeming Liu,Kam-Fai Wong
机构: The Chinese University of Hong Kong (香港中文大学); ByteDance (字节跳动); The University of Edinburgh (爱丁堡大学); Beihang University (北京航空航天大学); MoE Key Laboratory of High Confidence Software Technologies (教育部高可信软件技术重点实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Inference-time scaling has attracted much attention which significantly enhance the performance of Large Language Models (LLMs) in complex reasoning tasks by increasing the length of Chain-of-Thought. These longer intermediate reasoning rationales embody various meta-reasoning skills in human cognition, such as reflection and decomposition, being difficult to create and acquire. In this work, we introduce \textitSelf-Reasoning Language Model (SRLM), where the model itself can synthesize longer CoT data and iteratively improve performance through self-training. By incorporating a few demonstration examples (i.e., 1,000 samples) on how to unfold hidden reasoning chains from existing responses, which act as a reasoning catalyst, we demonstrate that SRLM not only enhances the model’s initial performance but also ensures more stable and consistent improvements in subsequent iterations. Our proposed SRLM achieves an average absolute improvement of more than +2.5 points across five reasoning tasks: MMLU, GSM8K, ARC-C, HellaSwag, and BBH on two backbone models. Moreover, it brings more improvements with more times of sampling during inference, such as absolute +7.89 average improvement with 64 sampling times, revealing the in-depth, diverse and creative reasoning paths in SRLM against the strong baseline.
zh

[NLP-109] Invisible Entropy: Towards Safe and Efficient Low-Entropy LLM Watermarking

【速读】: 该论文试图解决生成式 AI (Generative AI) 在低熵场景下进行水印标记时的挑战,即在可预测输出中难以选择绿色标记令牌而不破坏文本自然性的问题。现有方法依赖于原始大语言模型 (LLM) 计算熵值并选择高熵令牌进行水印,但面临计算成本高和检测延迟以及模型泄露风险的问题。解决方案的关键在于提出一种名为 Invisible Entropy (IE) 的水印范式,其通过引入轻量级特征提取器和熵标签器来预测下一个令牌的熵值,并基于理论分析开发了自适应设置熵阈值的阈值导航器,从而提升水印文本的自然性和检测鲁棒性。

链接: https://arxiv.org/abs/2505.14112
作者: Tianle Gu,Zongqi Wang,Kexin Huang,Yuanqi Yao,Xiangliang Zhang,Yujiu Yang,Xiuying Chen
机构: Mohamed bin Zayed University of Artificial Intelligence (MBZUAI); Tsinghua Shenzhen International Graduate School, Tsinghua University; Fudan University; Shanghai Artificial Intelligence Laboratory; University of Notre Dame
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Logit-based LLM watermarking traces and verifies AI-generated content by maintaining green and red token lists and increasing the likelihood of green tokens during generation. However, it fails in low-entropy scenarios, where predictable outputs make green token selection difficult without disrupting natural text flow. Existing approaches address this by assuming access to the original LLM to calculate entropy and selectively watermark high-entropy tokens. However, these methods face two major challenges: (1) high computational costs and detection delays due to reliance on the original LLM, and (2) potential risks of model leakage. To address these limitations, we propose Invisible Entropy (IE), a watermarking paradigm designed to enhance both safety and efficiency. Instead of relying on the original LLM, IE introduces a lightweight feature extractor and an entropy tagger to predict whether the entropy of the next token is high or low. Furthermore, based on theoretical analysis, we develop a threshold navigator that adaptively sets entropy thresholds. It identifies a threshold where the watermark ratio decreases as the green token count increases, enhancing the naturalness of the watermarked text and improving detection robustness. Experiments on HumanEval and MBPP datasets demonstrate that IE reduces parameter size by 99% while achieving performance on par with state-of-the-art methods. Our work introduces a safe and efficient paradigm for low-entropy watermarking. this https URL this https URL
zh

[NLP-110] DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language Models

【速读】: 该论文旨在解决当前大型语言模型在临床诊断推理任务中的性能评估不足问题,特别是针对其在真实医疗场景中安全有效部署的挑战。解决方案的关键在于构建一个全面且具有挑战性的基准测试平台——DiagnosisArena,该平台包含1,113对分割后的患者病例及其对应诊断,覆盖28个医学专科,并基于顶级医学期刊发表的临床案例报告进行构建。通过多轮AI系统与人类专家的筛选和审查,确保数据质量和避免数据泄露,从而为评估专业级诊断能力提供可靠依据。

链接: https://arxiv.org/abs/2505.14107
作者: Yakun Zhu,Zhongzhen Huang,Linjie Mu,Yutong Huang,Wei Nie,Shaoting Zhang,Pengfei Liu,Xiaofan Zhang
机构: Shanghai Jiao Tong University (上海交通大学); SII (SII); SPIRAL Lab (SPIRAL实验室); Generative AI Research Lab (生成式人工智能研究实验室); GAIR (GAIR); Shanghai Chest Hospital (上海市胸科医院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The emergence of groundbreaking large language models capable of performing complex reasoning tasks holds significant promise for addressing various scientific challenges, including those arising in complex clinical scenarios. To enable their safe and effective deployment in real-world healthcare settings, it is urgently necessary to benchmark the diagnostic capabilities of current models systematically. Given the limitations of existing medical benchmarks in evaluating advanced diagnostic reasoning, we present DiagnosisArena, a comprehensive and challenging benchmark designed to rigorously assess professional-level diagnostic competence. DiagnosisArena consists of 1,113 pairs of segmented patient cases and corresponding diagnoses, spanning 28 medical specialties, deriving from clinical case reports published in 10 top-tier medical journals. The benchmark is developed through a meticulous construction pipeline, involving multiple rounds of screening and review by both AI systems and human experts, with thorough checks conducted to prevent data leakage. Our study reveals that even the most advanced reasoning models, o3-mini, o1, and DeepSeek-R1, achieve only 45.82%, 31.09%, and 17.79% accuracy, respectively. This finding highlights a significant generalization bottleneck in current large language models when faced with clinical diagnostic reasoning challenges. Through DiagnosisArena, we aim to drive further advancements in AIs diagnostic reasoning capabilities, enabling more effective solutions for real-world clinical diagnostic challenges. We provide the benchmark and evaluation tools for further research and development this https URL.
zh

[NLP-111] A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations

【速读】: 该论文试图解决在多轮对话中如何评估大型语言模型(Large Language Models, LLMs)的个性化推理与生成能力的问题。现有研究通常仅关注个性化或对话结构中的单一维度,而本文提出的PersonaConvBench基准则整合了这两个方面,包含句子分类、影响回归和以用户为中心的文本生成三个核心任务,覆盖十个基于Reddit的多样化领域。解决方案的关键在于构建一个统一的评估框架,通过引入个性化的对话历史,系统地分析其对LLM输出的影响,并验证个性化上下文在提升模型性能方面的有效性。

链接: https://arxiv.org/abs/2505.14106
作者: Li Li,Peilin Cai,Ryan A. Rossi,Franck Dernoncourt,Branislav Kveton,Junda Wu,Tong Yu,Linxin Song,Tiankai Yang,Yuehan Qin,Nesreen K. Ahmed,Samyadeep Basu,Subhojyoti Mukherjee,Ruiyi Zhang,Zhengmian Hu,Bo Ni,Yuxiao Zhou,Zichao Wang,Yue Huang,Yu Wang,Xiangliang Zhang,Philip S. Yu,Xiyang Hu,Yue Zhao
机构: University of Southern California; Adobe Research; University of California, San Diego; Intel AI Research; University of Maryland, College Park; Vanderbilt University; Virginia Polytechnic Institute and State University; University of Notre Dame; University of Oregon; University of Illinois Chicago; Arizona State University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present PersonaConvBench, a large-scale benchmark for evaluating personalized reasoning and generation in multi-turn conversations with large language models (LLMs). Unlike existing work that focuses on either personalization or conversational structure in isolation, PersonaConvBench integrates both, offering three core tasks: sentence classification, impact regression, and user-centric text generation across ten diverse Reddit-based domains. This design enables systematic analysis of how personalized conversational context shapes LLM outputs in realistic multi-user scenarios. We benchmark several commercial and open-source LLMs under a unified prompting setup and observe that incorporating personalized history yields substantial performance improvements, including a 198 percent relative gain over the best non-conversational baseline in sentiment classification. By releasing PersonaConvBench with evaluations and code, we aim to support research on LLMs that adapt to individual styles, track long-term context, and produce contextually rich, engaging responses.
zh

[NLP-112] Legal Rule Induction: Towards Generalizable Principle Discovery from Analogous Judicial Precedents

【速读】: 该论文试图解决从司法判决中归纳法律规则(Legal Rule Induction, LRI)的问题,这一任务在计算法律研究中尚未得到充分探索,主要受限于模型推理能力和符号推理能力的不足。其解决方案的关键在于首次构建了一个LRI基准数据集,包含5,121个案例集(共计38,088个中文案例)用于模型调优,以及216个专家标注的黄金测试集,从而为法律规则的自动提取提供了标准化的评估框架和训练资源。

链接: https://arxiv.org/abs/2505.14104
作者: Wei Fan,Tianshi Zheng,Yiran Hu,Zheye Deng,Weiqi Wang,Baixuan Xu,Chunyang Li,Haoran Li,Weixing Shen,Yangqiu Song
机构: HKUST(香港科技大学); Tsinghua University(清华大学)
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Legal rules encompass not only codified statutes but also implicit adjudicatory principles derived from precedents that contain discretionary norms, social morality, and policy. While computational legal research has advanced in applying established rules to cases, inducing legal rules from judicial decisions remains understudied, constrained by limitations in model inference efficacy and symbolic reasoning capability. The advent of Large Language Models (LLMs) offers unprecedented opportunities for automating the extraction of such latent principles, yet progress is stymied by the absence of formal task definitions, benchmark datasets, and methodologies. To address this gap, we formalize Legal Rule Induction (LRI) as the task of deriving concise, generalizable doctrinal rules from sets of analogous precedents, distilling their shared preconditions, normative behaviors, and legal consequences. We introduce the first LRI benchmark, comprising 5,121 case sets (38,088 Chinese cases in total) for model tuning and 216 expert-annotated gold test sets. Experimental results reveal that: 1) State-of-the-art LLMs struggle with over-generalization and hallucination; 2) Training on our dataset markedly enhances LLMs capabilities in capturing nuanced rule patterns across similar cases.
zh

[NLP-113] MultiHal: Multilingual Dataset for Knowledge-Graph Grounded Evaluation of LLM Hallucinations

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成文本时存在的事实准确性不足问题,即所谓的“幻觉”现象。现有评估基准多依赖于英语数据集及补充信息(如网页链接或文本段落),而忽视了结构化事实资源的利用。该研究的关键解决方案是引入知识图谱(Knowledge Graphs, KGs),通过构建一个基于KG的多语言、多跳基准测试框架——\textbf{MultiHal},以提升事实性语言建模的评估能力。该方法从开放域知识图谱中提取并清洗出高质量的25.9k条KG路径,实验结果表明,与传统问答系统相比,KG-RAG在语义相似度评分上提升了0.12至0.36个点,验证了知识图谱集成的有效性。

链接: https://arxiv.org/abs/2505.14101
作者: Ernests Lavrinovics,Russa Biswas,Katja Hose,Johannes Bjerva
机构: Aalborg University (奥尔堡大学); TU Wien (维也纳技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have inherent limitations of faithfulness and factuality, commonly referred to as hallucinations. Several benchmarks have been developed that provide a test bed for factuality evaluation within the context of English-centric datasets, while relying on supplementary informative context like web links or text passages but ignoring the available structured factual resources. To this end, Knowledge Graphs (KGs) have been identified as a useful aid for hallucination mitigation, as they provide a structured way to represent the facts about entities and their relations with minimal linguistic overhead. We bridge the lack of KG paths and multilinguality for factual language modeling within the existing hallucination evaluation benchmarks and propose a KG-based multilingual, multihop benchmark called \textbfMultiHal framed for generative text evaluation. As part of our data collection pipeline, we mined 140k KG-paths from open-domain KGs, from which we pruned noisy KG-paths, curating a high-quality subset of 25.9k. Our baseline evaluation shows an absolute scale increase by approximately 0.12 to 0.36 points for the semantic similarity score in KG-RAG over vanilla QA across multiple languages and multiple models, demonstrating the potential of KG integration. We anticipate MultiHal will foster future research towards several graph-based hallucination mitigation and fact-checking tasks.
zh

[NLP-114] Beyond Chains: Bridging Large Language Models and Knowledge Bases in Complex Question Answering

【速读】: 该论文旨在解决知识库问答(Knowledge Base Question Answering, KBQA)中大型语言模型(LLM)仅方法存在的知识过时、幻觉和缺乏透明性问题,以及基于链的知识图谱检索增强生成(KG-RAG)方法在处理复杂非链结构问题时的局限性。其解决方案的关键在于提出一种四阶段框架PDRR(Predict, Decompose, Retrieve, Reason),通过预测问题类型、分解问题为结构化三元组、从知识库中检索相关信息,并引导LLM作为智能体进行推理和补全分解后的三元组,从而提升问答的准确性和泛化能力。

链接: https://arxiv.org/abs/2505.14099
作者: Yihua Zhu,Qianying Liu,Akiko Aizawa,Hidetoshi Shimodaira
机构: Kyoto University (京都大学); University of Tokyo (东京大学); NII (国家信息学研究所); RIKEN (理化学研究所)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Knowledge Base Question Answering (KBQA) aims to answer natural language questions using structured knowledge from KBs. While LLM-only approaches offer generalization, they suffer from outdated knowledge, hallucinations, and lack of transparency. Chain-based KG-RAG methods address these issues by incorporating external KBs, but are limited to simple chain-structured questions due to the absence of planning and logical structuring. Inspired by semantic parsing methods, we propose PDRR: a four-stage framework consisting of Predict, Decompose, Retrieve, and Reason. Our method first predicts the question type and decomposes the question into structured triples. Then retrieves relevant information from KBs and guides the LLM as an agent to reason over and complete the decomposed triples. Experimental results demonstrate that PDRR consistently outperforms existing methods across various LLM backbones and achieves superior performance on both chain-structured and non-chain complex questions.
zh

[NLP-115] Gender Trouble in Language Models: An Empirical Audit Guided by Gender Performativity Theory

【速读】: 该论文试图解决语言模型中编码并延续有害性别刻板印象的问题,特别是针对现有方法仅关注语言层面的关联性而未能深入性别建构本身所带来的局限性。其解决方案的关键在于倡导对“性别偏见”进行更广泛的定义,并通过性别研究文献中的性别建构理论来操作化这一概念,进而实证分析不同架构、训练数据集和模型规模的语言模型如何编码性别。研究发现,语言模型倾向于将性别视为与生物性别绑定的二元分类,导致非二元性别术语被抹除和病理化,且更大的模型强化了性别与性别的强关联,进一步巩固了狭隘的性别理解。

链接: https://arxiv.org/abs/2505.14080
作者: Franziska Sofia Hafner,Ana Valdivia,Luc Rocher
机构: University of Oxford (牛津大学); Institute of Advanced Studies (高级研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Language models encode and subsequently perpetuate harmful gendered stereotypes. Research has succeeded in mitigating some of these harms, e.g. by dissociating non-gendered terms such as occupations from gendered terms such as ‘woman’ and ‘man’. This approach, however, remains superficial given that associations are only one form of prejudice through which gendered harms arise. Critical scholarship on gender, such as gender performativity theory, emphasizes how harms often arise from the construction of gender itself, such as conflating gender with biological sex. In language models, these issues could lead to the erasure of transgender and gender diverse identities and cause harms in downstream applications, from misgendering users to misdiagnosing patients based on wrong assumptions about their anatomy. For FAccT research on gendered harms to go beyond superficial linguistic associations, we advocate for a broader definition of ‘gender bias’ in language models. We operationalize insights on the construction of gender through language from gender studies literature and then empirically test how 16 language models of different architectures, training datasets, and model sizes encode gender. We find that language models tend to encode gender as a binary category tied to biological sex, and that gendered terms that do not neatly fall into one of these binary categories are erased and pathologized. Finally, we show that larger models, which achieve better results on performance benchmarks, learn stronger associations between gender and sex, further reinforcing a narrow understanding of gender. Our findings lead us to call for a re-evaluation of how gendered harms in language models are defined and addressed. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC) Cite as: arXiv:2505.14080 [cs.CL] (or arXiv:2505.14080v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.14080 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: FAccT '25: Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency Related DOI: https://doi.org/10.1145/3715275.3732112 Focus to learn more DOI(s) linking to related resources
zh

[NLP-116] BAR: A Backward Reasoning based Agent for Complex Minecraft Tasks

【速读】: 该论文试图解决在复杂任务中,基于大型语言模型(Large Language Model, LLM)的智能体因前向推理(forward reasoning)存在感知差距而导致的任务规划失效问题。其解决方案的关键在于采用后向推理(backward reasoning),从任务目标状态出发进行规划,从而直接实现任务目标。为此,作者提出了一种基于后向推理的智能体(BAckward Reasoning based agent, BAR),其核心包括递归目标分解模块、状态一致性保持模块和阶段记忆模块,以实现鲁棒、一致且高效的规划。

链接: https://arxiv.org/abs/2505.14079
作者: Weihong Du,Wenrui Liao,Binyu Yan,Hongru Liang,Anthony G. Cohn,Wenqiang Lei
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language model (LLM) based agents have shown great potential in following human instructions and automatically completing various tasks. To complete a task, the agent needs to decompose it into easily executed steps by planning. Existing studies mainly conduct the planning by inferring what steps should be executed next starting from the agent’s initial state. However, this forward reasoning paradigm doesn’t work well for complex tasks. We propose to study this issue in Minecraft, a virtual environment that simulates complex tasks based on real-world scenarios. We believe that the failure of forward reasoning is caused by the big perception gap between the agent’s initial state and task goal. To this end, we leverage backward reasoning and make the planning starting from the terminal state, which can directly achieve the task goal in one step. Specifically, we design a BAckward Reasoning based agent (BAR). It is equipped with a recursive goal decomposition module, a state consistency maintaining module and a stage memory module to make robust, consistent, and efficient planning starting from the terminal state. Experimental results demonstrate the superiority of BAR over existing methods and the effectiveness of proposed modules.
zh

[NLP-117] xtual Steering Vectors Can Improve Visual Understanding in Multimodal Large Language Models

【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)缺乏有效且针对性的引导方法的问题,而现有的引导技术主要针对文本为主的大型语言模型(Large Language Models, LLMs)。论文提出的解决方案的关键在于利用文本-only LLM 的向量表示,通过稀疏自编码器(Sparse Autoencoders, SAEs)、均值偏移(Mean Shift)和线性探测(Linear Probing)等方法对 MLLMs 进行引导,实验表明这种基于文本的引导向量能够显著提升多模态任务的准确性,尤其在空间关系和计数任务上表现突出,且具有良好的泛化能力。

链接: https://arxiv.org/abs/2505.14071
作者: Woody Haosheng Gan,Deqing Fu,Julian Asilis,Ollie Liu,Dani Yogatama,Vatsal Sharan,Robin Jia,Willie Neiswanger
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Steering methods have emerged as effective and targeted tools for guiding large language models’ (LLMs) behavior without modifying their parameters. Multimodal large language models (MLLMs), however, do not currently enjoy the same suite of techniques, due in part to their recency and architectural diversity. Inspired by this gap, we investigate whether MLLMs can be steered using vectors derived from their text-only LLM backbone, via sparse autoencoders (SAEs), mean shift, and linear probing. We find that text-derived steering consistently enhances multimodal accuracy across diverse MLLM architectures and visual tasks. In particular, mean shift boosts spatial relationship accuracy on CV-Bench by up to +7.3% and counting accuracy by up to +3.3%, outperforming prompting and exhibiting strong generalization to out-of-distribution datasets. These results highlight textual steering vectors as a powerful, efficient mechanism for enhancing grounding in MLLMs with minimal additional data collection and computational overhead.
zh

[NLP-118] Enhancing LLM s via High-Knowledge Data Selection

【速读】: 该论文旨在解决预训练语料库中知识稀缺的问题,其核心挑战在于现有数据选择方法未充分考虑文本语料的知识丰富性。论文提出的解决方案关键在于设计了一个无需梯度的高知识评分器(High-Knowledge Scorer, HKS),从知识维度筛选高质量数据。该方法通过构建多领域知识元素池,并引入知识密度和覆盖度作为评估指标,从而实现对文本知识内容的全面评估,最终提升模型在知识密集型和通用理解任务中的性能。

链接: https://arxiv.org/abs/2505.14070
作者: Feiyu Duan,Xuemiao Zhang,Sirui Wang,Haoran Que,Yuqi Liu,Wenge Rong,Xunliang Cai
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The performance of Large Language Models (LLMs) is intrinsically linked to the quality of its training data. Although several studies have proposed methods for high-quality data selection, they do not consider the importance of knowledge richness in text corpora. In this paper, we propose a novel and gradient-free High-Knowledge Scorer (HKS) to select high-quality data from the dimension of knowledge, to alleviate the problem of knowledge scarcity in the pre-trained corpus. We propose a comprehensive multi-domain knowledge element pool and introduce knowledge density and coverage as metrics to assess the knowledge content of the text. Based on this, we propose a comprehensive knowledge scorer to select data with intensive knowledge, which can also be utilized for domain-specific high-knowledge data selection by restricting knowledge elements to the specific domain. We train models on a high-knowledge bilingual dataset, and experimental results demonstrate that our scorer improves the model’s performance in knowledge-intensive and general comprehension tasks, and is effective in enhancing both the generic and domain-specific capabilities of the model.
zh

[NLP-119] Improved Methods for Model Pruning and Knowledge Distillation

【速读】: 该论文试图解决大型语言模型在模型剪枝过程中导致性能显著下降或需要大量重新训练和微调的问题。其解决方案的关键在于提出一种名为MAMA Pruning(Movement and Magnitude Analysis)的改进剪枝方法,该方法通过利用预训练阶段固定的权重和偏置以及后训练阶段验证的GRPO奖励作为新的剪枝指标,有效减少了模型规模和计算复杂度,同时在极端剪枝水平下仍能保持与原始未剪枝模型相当的性能。

链接: https://arxiv.org/abs/2505.14052
作者: Wei Jiang,Anying Fu,Youling Zhang
机构: Suanfamama(苏安法玛)
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Model pruning is a performance optimization technique for large language models like R1 or o3-mini. However, existing pruning methods often lead to significant performance degradation or require extensive retraining and fine-tuning. This technique aims to identify and remove neurons, connections unlikely leading to the contribution during the human-computer interaction phase. Our goal is to obtain a much smaller and faster knowledge distilled model that can quickly generate content almost as good as those of the unpruned ones. We propose MAMA Pruning, short for Movement and Magnitude Analysis, an improved pruning method that effectively reduces model size and computational complexity while maintaining performance comparable to the original unpruned model even at extreme pruned levels. The improved method is based on weights, bias fixed in the pre-training phase and GRPO rewards verified during the post-training phase as our novel pruning indicators. Preliminary experimental results show that our method outperforms and be comparable to state-of-the-art methods across various pruning levels and different downstream computational linguistics tasks.
zh

[NLP-120] From Unaligned to Aligned: Scaling Multilingual LLM s with Multi-Way Parallel Corpora

【速读】: 该论文试图解决在低资源语言中提升大规模语言模型(Large Language Models, LLMs)性能的问题,特别是针对非对齐多语言数据在跨语言语义捕捉上的局限性。其解决方案的关键在于引入一个大规模、高质量的多向平行语料库TED2025,该语料库覆盖113种语言,最多支持50种语言并行对齐,从而提供更强的跨语言一致性,进而有效提升多语言模型的性能。通过该语料库,研究探索了持续预训练和指令微调等策略,并验证了多向平行数据在多语言基准测试中的优越性。

链接: https://arxiv.org/abs/2505.14045
作者: Yingli Shen,Wen Lai,Shuo Wang,Kangyang Luo,Alexander Fraser,Maosong Sun
机构: Tsinghua University (清华大学); Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continued pretraining and instruction tuning on large-scale multilingual data have proven to be effective in scaling large language models (LLMs) to low-resource languages. However, the unaligned nature of such data limits its ability to effectively capture cross-lingual semantics. In contrast, multi-way parallel data, where identical content is aligned across multiple languages, provides stronger cross-lingual consistency and offers greater potential for improving multilingual performance. In this paper, we introduce a large-scale, high-quality multi-way parallel corpus, TED2025, based on TED Talks. The corpus spans 113 languages, with up to 50 languages aligned in parallel, ensuring extensive multilingual coverage. Using this dataset, we investigate best practices for leveraging multi-way parallel data to enhance LLMs, including strategies for continued pretraining, instruction tuning, and the analysis of key influencing factors. Experiments on six multilingual benchmarks show that models trained on multiway parallel data consistently outperform those trained on unaligned multilingual data.
zh

[NLP-121] ProMind-LLM : Proactive Mental Health Care via Causal Reasoning with Sensor Data

【速读】: 该论文旨在解决传统心理健康风险评估方法依赖主观文本记录所带来的不确定性与不可靠性问题,这些问题可能导致预测结果不一致。其解决方案的关键在于引入ProMind-LLM,该模型通过整合客观行为数据作为补充信息,结合领域特定预训练、自优化机制以及因果链式思维推理,提升心理健康风险评估的可靠性与可解释性。

链接: https://arxiv.org/abs/2505.14038
作者: Xinzhe Zheng,Sijie Ji,Jiawei Sun,Renqi Chen,Wei Gao,Mani Srivastava
机构: California Institute of Technology (加州理工学院); UCLA (加利福尼亚大学洛杉矶分校); National University of Singapore (新加坡国立大学); Hangzhou Dianzi University (杭州电子科技大学); Fudan University (复旦大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mental health risk is a critical global public health challenge, necessitating innovative and reliable assessment methods. With the development of large language models (LLMs), they stand out to be a promising tool for explainable mental health care applications. Nevertheless, existing approaches predominantly rely on subjective textual mental records, which can be distorted by inherent mental uncertainties, leading to inconsistent and unreliable predictions. To address these limitations, this paper introduces ProMind-LLM. We investigate an innovative approach integrating objective behavior data as complementary information alongside subjective mental records for robust mental health risk assessment. Specifically, ProMind-LLM incorporates a comprehensive pipeline that includes domain-specific pretraining to tailor the LLM for mental health contexts, a self-refine mechanism to optimize the processing of numerical behavioral data, and causal chain-of-thought reasoning to enhance the reliability and interpretability of its predictions. Evaluations of two real-world datasets, PMData and Globem, demonstrate the effectiveness of our proposed methods, achieving substantial improvements over general LLMs. We anticipate that ProMind-LLM will pave the way for more dependable, interpretable, and scalable mental health case solutions.
zh

[NLP-122] AUTOLAW: Enhancing Legal Compliance in Large Language Models via Case Law Generation and Jury-Inspired Deliberation

【速读】: 该论文旨在解决领域特定大型语言模型(LLMs)在法律等专业领域中因缺乏对区域法律差异的适应性而导致的合规性和可信度不足的问题。现有法律评估基准难以应对多样化的本地情境,限制了其在动态变化的监管环境中的适用性。论文提出的解决方案是AutoLaw,其关键在于结合对抗性数据生成与模拟陪审团的审议过程,通过动态合成判例法以反映地方法规,并利用基于LLM的“陪审员”池模拟司法决策,从而减少偏差并提高违规检测的准确性。

链接: https://arxiv.org/abs/2505.14015
作者: Tai D. Nguyen,Long H. Pham,Jun Sun
机构: Singapore Management University (新加坡管理大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid advancement of domain-specific large language models (LLMs) in fields like law necessitates frameworks that account for nuanced regional legal distinctions, which are critical for ensuring compliance and trustworthiness. Existing legal evaluation benchmarks often lack adaptability and fail to address diverse local contexts, limiting their utility in dynamically evolving regulatory landscapes. To address these gaps, we propose AutoLaw, a novel violation detection framework that combines adversarial data generation with a jury-inspired deliberation process to enhance legal compliance of LLMs. Unlike static approaches, AutoLaw dynamically synthesizes case law to reflect local regulations and employs a pool of LLM-based “jurors” to simulate judicial decision-making. Jurors are ranked and selected based on synthesized legal expertise, enabling a deliberation process that minimizes bias and improves detection accuracy. Evaluations across three benchmarks: Law-SG, Case-SG (legality), and Unfair-TOS (policy), demonstrate AutoLaw’s effectiveness: adversarial data generation improves LLM discrimination, while the jury-based voting strategy significantly boosts violation detection rates. Our results highlight the framework’s ability to adaptively probe legal misalignments and deliver reliable, context-aware judgments, offering a scalable solution for evaluating and enhancing LLMs in legally sensitive applications.
zh

[NLP-123] Activation-Guided Consensus Merging for Large Language Models

【速读】: 该论文旨在解决如何高效且稳定地整合不同大型语言模型(Large Language Models, LLMs)的多样化能力,以构建一个统一模型的问题。现有基于训练和提示的方法在效率和稳定性方面存在显著挑战,而传统模型合并方法通常假设各层重要性一致,忽视了神经组件的功能异质性。论文提出的解决方案关键在于引入一种称为“激活引导共识合并”(Activation-Guided Consensus Merging, ACM)的即插即用框架,该框架通过预训练模型与微调模型激活之间的互信息来确定分层合并系数,从而有效保留任务特定能力,无需梯度计算或额外训练。

链接: https://arxiv.org/abs/2505.14009
作者: Yuxuan Yao,Shuqi Liu,Zehua Liu,Qintong Li,Mingyang Liu,Xiongwei Han,Zhijiang Guo,Han Wu,Linqi Song
机构: City University of Hong Kong (香港城市大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); University of Hong Kong (香港大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent research has increasingly focused on reconciling the reasoning capabilities of System 2 with the efficiency of System 1. While existing training-based and prompt-based approaches face significant challenges in terms of efficiency and stability, model merging emerges as a promising strategy to integrate the diverse capabilities of different Large Language Models (LLMs) into a unified model. However, conventional model merging methods often assume uniform importance across layers, overlooking the functional heterogeneity inherent in neural components. To address this limitation, we propose \textbfActivation-Guided \textbfConsensus \textbfMerging (\textbfACM), a plug-and-play merging framework that determines layer-specific merging coefficients based on mutual information between activations of pre-trained and fine-tuned models. ACM effectively preserves task-specific capabilities without requiring gradient computations or additional training. Extensive experiments on Long-to-Short (L2S) and general merging tasks demonstrate that ACM consistently outperforms all baseline methods. For instance, in the case of Qwen-7B models, TIES-Merging equipped with ACM achieves a \textbf55.3% reduction in response length while simultaneously improving reasoning accuracy by \textbf1.3 points. We submit the code with the paper for reproducibility, and it will be publicly available.
zh

[NLP-124] Social Sycophancy: A Broader Understanding of LLM Sycophancy

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中的一种严重安全与实用性风险——奉承行为(sycophancy),即模型对用户过度认同和阿谀奉承的问题。现有研究仅关注用户明确陈述观点的认同度评估,而忽略了在模糊情境(如建议和寻求支持)中可能产生的奉承行为,这些情境缺乏明确的真理标准,但可能强化有害的隐性假设或行为。该研究的关键在于引入一种更丰富的社会奉承理论,将奉承定义为对用户“面子”(face)的过度维护,即个体在互动中寻求维持的积极自我形象,并提出ELEPHANT框架,用于评估五种面子维护行为(情感验证、道德认可、间接语言、间接行动和接受框架)。通过两个数据集(开放性问题和Reddit的r/AmITheAsshole)验证,结果显示LLMs表现出较高的社会奉承率,且该行为在偏好数据集中被奖励,难以有效缓解。

链接: https://arxiv.org/abs/2505.13995
作者: Myra Cheng,Sunny Yu,Cinoo Lee,Pranav Khadpe,Lujain Ibrahim,Dan Jurafsky
机构: Stanford University (斯坦福大学); Carnegie Mellon University (卡内基梅隆大学); University of Oxford (牛津大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:A serious risk to the safety and utility of LLMs is sycophancy, i.e., excessive agreement with and flattery of the user. Yet existing work focuses on only one aspect of sycophancy: agreement with users’ explicitly stated beliefs that can be compared to a ground truth. This overlooks forms of sycophancy that arise in ambiguous contexts such as advice and support-seeking, where there is no clear ground truth, yet sycophancy can reinforce harmful implicit assumptions, beliefs, or actions. To address this gap, we introduce a richer theory of social sycophancy in LLMs, characterizing sycophancy as the excessive preservation of a user’s face (the positive self-image a person seeks to maintain in an interaction). We present ELEPHANT, a framework for evaluating social sycophancy across five face-preserving behaviors (emotional validation, moral endorsement, indirect language, indirect action, and accepting framing) on two datasets: open-ended questions (OEQ) and Reddit’s r/AmITheAsshole (AITA). Across eight models, we show that LLMs consistently exhibit high rates of social sycophancy: on OEQ, they preserve face 47% more than humans, and on AITA, they affirm behavior deemed inappropriate by crowdsourced human judgments in 42% of cases. We further show that social sycophancy is rewarded in preference datasets and is not easily mitigated. Our work provides theoretical grounding and empirical tools (datasets and code) for understanding and addressing this under-recognized but consequential issue.
zh

[NLP-125] DecIF: Improving Instruction-Following through Meta-Decomposition

【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)在指令跟随能力上的数据生成问题,现有方法通常依赖于预存在的文档或外部资源,这限制了其灵活性和泛化能力。论文提出的解决方案是DecIF,其关键在于采用完全自主的、基于元分解引导的框架,仅利用LLMs生成多样且高质量的指令跟随数据。DecIF的核心原理是分解,通过迭代生成元信息并结合响应约束形成结构化指令,并利用LLMs检测和解决生成指令中的不一致性,同时将每个指令分解为原子级评估标准以实现严格验证。

链接: https://arxiv.org/abs/2505.13990
作者: Tingfeng Hui,Pengyu Zhu,Bowen Ping,Ling Tang,Yaqi Zhang,Sen Su
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Instruction-following has emerged as a crucial capability for large language models (LLMs). However, existing approaches often rely on pre-existing documents or external resources to synthesize instruction-following data, which limits their flexibility and generalizability. In this paper, we introduce DecIF, a fully autonomous, meta-decomposition guided framework that generates diverse and high-quality instruction-following data using only LLMs. DecIF is grounded in the principle of decomposition. For instruction generation, we guide LLMs to iteratively produce various types of meta-information, which are then combined with response constraints to form well-structured and semantically rich instructions. We further utilize LLMs to detect and resolve potential inconsistencies within the generated instructions. Regarding response generation, we decompose each instruction into atomic-level evaluation criteria, enabling rigorous validation and the elimination of inaccurate instruction-response pairs. Extensive experiments across a wide range of scenarios and settings demonstrate DecIF’s superior performance on instruction-following tasks. Further analysis highlights its strong flexibility, scalability, and generalizability in automatically synthesizing high-quality instruction data.
zh

[NLP-126] he Hallucination Tax of Reinforcement Finetuning

【速读】: 该论文试图解决强化学习微调(Reinforcement Finetuning, RFT)导致大型语言模型(Large Language Models, LLMs)在面对无法回答的问题时产生幻觉(hallucination)的问题,即模型在缺乏足够或明确信息的情况下仍自信地生成虚假答案。解决方案的关键在于引入SUM(Synthetic Unanswerable Math)数据集,在RFT过程中仅加入10%的SUM数据,从而有效恢复模型的拒绝行为,同时对可解答任务的准确性影响较小。这一方法使LLMs能够在推理过程中评估自身的不确定性与知识边界,提升其在跨领域数学问题和事实性问答任务中的泛化能力。

链接: https://arxiv.org/abs/2505.13988
作者: Linxin Song,Taiwei Shi,Jieyu Zhao
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement finetuning (RFT) has become a standard approach for enhancing the reasoning capabilities of large language models (LLMs). However, its impact on model trustworthiness remains underexplored. In this work, we identify and systematically study a critical side effect of RFT, which we term the hallucination tax: a degradation in refusal behavior causing models to produce hallucinated answers to unanswerable questions confidently. To investigate this, we introduce SUM (Synthetic Unanswerable Math), a high-quality dataset of unanswerable math problems designed to probe models’ ability to recognize an unanswerable question by reasoning from the insufficient or ambiguous information. Our results show that standard RFT training could reduce model refusal rates by more than 80%, which significantly increases model’s tendency to hallucinate. We further demonstrate that incorporating just 10% SUM during RFT substantially restores appropriate refusal behavior, with minimal accuracy trade-offs on solvable tasks. Crucially, this approach enables LLMs to leverage inference-time compute to reason about their own uncertainty and knowledge boundaries, improving generalization not only to out-of-domain math problems but also to factual question answering tasks.
zh

[NLP-127] Mixed Signals: Understanding Model Disagreement in Multimodal Empathy Detection

【速读】: 该论文试图解决多模态模型在共情检测中因不同模态提供冲突线索而导致性能下降的问题,其解决方案的关键在于分析单模态与多模态预测之间的分歧,以识别潜在的模糊性,并通过引入门控融合模型来优化多模态信息的整合过程。

链接: https://arxiv.org/abs/2505.13979
作者: Maya Srikanth,Run Chen,Julia Hirschberg
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal models play a key role in empathy detection, but their performance can suffer when modalities provide conflicting cues. To understand these failures, we examine cases where unimodal and multimodal predictions diverge. Using fine-tuned models for text, audio, and video, along with a gated fusion model, we find that such disagreements often reflect underlying ambiguity, as evidenced by annotator uncertainty. Our analysis shows that dominant signals in one modality can mislead fusion when unsupported by others. We also observe that humans, like models, do not consistently benefit from multimodal input. These insights position disagreement as a useful diagnostic signal for identifying challenging examples and improving empathy system robustness.
zh

[NLP-128] DRP: Distilled Reasoning Pruning with Skill-aware Step Decomposition for Efficient Large Reasoning Models

【速读】: 该论文试图解决大型推理模型(Large Reasoning Models, LRMs)在推理过程中产生的冗长思维链(chain-of-thought, CoT)导致的效率低下问题。解决方案的关键在于提出一种混合框架——知识蒸馏剪枝(Distilled Reasoning Pruning, DRP),该框架结合了推理时的剪枝与基于调优的知识蒸馏策略,通过教师模型进行技能感知的步骤分解和内容剪枝,并将剪枝后的推理路径蒸馏到学生模型中,从而实现高效且准确的推理。

链接: https://arxiv.org/abs/2505.13975
作者: Yuxuan Jiang,Dawei Li,Frank Ferraro
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校); Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Large Reasoning Models (LRMs) have demonstrated success in complex reasoning tasks through long chain-of-thought (CoT) reasoning, their inference often involves excessively verbose reasoning traces, resulting in substantial inefficiency. To address this, we propose Distilled Reasoning Pruning (DRP), a hybrid framework that combines inference-time pruning with tuning-based distillation, two widely used strategies for efficient reasoning. DRP uses a teacher model to perform skill-aware step decomposition and content pruning, and then distills the pruned reasoning paths into a student model, enabling it to reason both efficiently and accurately. Across several challenging mathematical reasoning datasets, we find that models trained with DRP achieve substantial improvements in token efficiency without sacrificing accuracy. Specifically, DRP reduces average token usage on GSM8K from 917 to 328 while improving accuracy from 91.7% to 94.1%, and achieves a 43% token reduction on AIME with no performance drop. Further analysis shows that aligning the reasoning structure of training CoTs with the student’s reasoning capacity is critical for effective knowledge transfer and performance gains.
zh

[NLP-129] oward Effective Reinforcement Learning Fine-Tuning for Medical VQA in Vision-Language Models

【速读】: 该论文旨在解决在医疗视觉问答(VQA)任务中,基于强化学习(Reinforcement Learning, RL)的调优方法难以实现临床合理模型行为的问题。其解决方案的关键在于通过系统分析四个关键维度——基础模型初始化策略、医学语义对齐的作用、基于长度的奖励对长链推理的影响以及偏差的影响——来优化RL调优过程,从而提升模型在医疗领域的表现。研究结果表明,基于Group Relative Policy Optimization (GRPO)的RL调优方法在准确性和推理质量上均优于传统的监督微调(SFT)。

链接: https://arxiv.org/abs/2505.13973
作者: Wenhui Zhu,Xuanzhao Dong,Xin Li,Peijie Qiu,Xiwen Chen,Abolfazl Razi,Aris Sotiras,Yi Su,Yalin Wang
机构: Arizona State University (亚利桑那州立大学); Clemson University (克莱姆森大学); Washington University in St.Louis (圣路易斯华盛顿大学); Banner Alzheimer’s Institute (巴纳尔阿尔茨海默病研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, reinforcement learning (RL)-based tuning has shifted the trajectory of Multimodal Large Language Models (MLLMs), particularly following the introduction of Group Relative Policy Optimization (GRPO). However, directly applying it to medical tasks remains challenging for achieving clinically grounded model behavior. Motivated by the need to align model response with clinical expectations, we investigate four critical dimensions that affect the effectiveness of RL-based tuning in medical visual question answering (VQA): base model initialization strategy, the role of medical semantic alignment, the impact of length-based rewards on long-chain reasoning, and the influence of bias. We conduct extensive experiments to analyze these factors for medical MLLMs, providing new insights into how models are domain-specifically fine-tuned. Additionally, our results also demonstrate that GRPO-based RL tuning consistently outperforms standard supervised fine-tuning (SFT) in both accuracy and reasoning quality.
zh

[NLP-130] ruth or Twist? Optimal Model Selection for Reliable Label Flipping Evaluation in LLM -based Counterfactuals

【速读】: 该论文试图解决在生成式 AI (Generative AI) 中,通过反事实数据增强(Counterfactual Data Augmentation, CDA)提升大型语言模型(Large Language Models, LLMs)性能和鲁棒性时,评估反事实样本有效性的判断模型(judge model)选择不一致的问题。解决方案的关键在于明确反事实生成器与判断模型之间的关系类型,并验证独立且未微调的判断模型能够提供最可靠的标签翻转评估,从而更有效地支持CDA过程。

链接: https://arxiv.org/abs/2505.13972
作者: Qianli Wang,Van Bach Nguyen,Nils Feldhus,Luis Felipe Villa-Arenas,Christin Seifert,Sebastian Möller,Vera Schmitt
机构: Quality and Usability Lab, Technische Universität Berlin(质量与可用性实验室,柏林工业大学); University of Marburg(马尔堡大学); German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI)); Deutsche Telekom(德国电信); BIFOLD – Berlin Institute for the Foundations of Learning and Data(柏林学习与数据基础研究所(BIFOLD))
类目: Computation and Language (cs.CL)
备注: in submission

点击查看摘要

Abstract:Counterfactual examples are widely employed to enhance the performance and robustness of large language models (LLMs) through counterfactual data augmentation (CDA). However, the selection of the judge model used to evaluate label flipping, the primary metric for assessing the validity of generated counterfactuals for CDA, yields inconsistent results. To decipher this, we define four types of relationships between the counterfactual generator and judge models. Through extensive experiments involving two state-of-the-art LLM-based methods, three datasets, five generator models, and 15 judge models, complemented by a user study (n = 90), we demonstrate that judge models with an independent, non-fine-tuned relationship to the generator model provide the most reliable label flipping evaluations. Relationships between the generator and judge models, which are closely aligned with the user study for CDA, result in better model performance and robustness. Nevertheless, we find that the gap between the most effective judge models and the results obtained from the user study remains considerably large. This suggests that a fully automated pipeline for CDA may be inadequate and requires human intervention.
zh

[NLP-131] CAFES: A Collaborative Multi-Agent Framework for Multi-Granular Multimodal Essay Scoring

【速读】: 该论文旨在解决自动化作文评分(Automated Essay Scoring, AES)在评估泛化性和多模态感知方面的不足,以及现有基于多模态大语言模型(Multimodal Large Language Model, MLLM)的方法中存在的幻觉性解释和评分与人类判断不一致的问题。其解决方案的关键在于提出CAFES框架,这是一个专门设计用于AES的协作式多智能体系统,通过三个专业智能体——初始评分器、反馈池管理者和反思评分器的协同工作,实现对作文的快速、精准且符合人类判断的多维度评价。

链接: https://arxiv.org/abs/2505.13965
作者: Jiamin Su,Yibo Yan,Zhuoran Gao,Han Zhang,Xiang Liu,Xuming Hu
机构: The Hong Kong University of Science and Technology (Guangzhou); Beijing Future Brain Education Technology Co., Ltd.; The Hong Kong University of Science and Technology
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2502.11916

点击查看摘要

Abstract:Automated Essay Scoring (AES) is crucial for modern education, particularly with the increasing prevalence of multimodal assessments. However, traditional AES methods struggle with evaluation generalizability and multimodal perception, while even recent Multimodal Large Language Model (MLLM)-based approaches can produce hallucinated justifications and scores misaligned with human judgment. To address the limitations, we introduce CAFES, the first collaborative multi-agent framework specifically designed for AES. It orchestrates three specialized agents: an Initial Scorer for rapid, trait-specific evaluations; a Feedback Pool Manager to aggregate detailed, evidence-grounded strengths; and a Reflective Scorer that iteratively refines scores based on this feedback to enhance human alignment. Extensive experiments, using state-of-the-art MLLMs, achieve an average relative improvement of 21% in Quadratic Weighted Kappa (QWK) against ground truth, especially for grammatical and lexical diversity. Our proposed CAFES framework paves the way for an intelligent multimodal AES system. The code will be available upon acceptance.
zh

[NLP-132] hrough a Compressed Lens: Investigating the Impact of Quantization on LLM Explainability and Interpretability

【速读】: 该论文试图解决量化(quantization)对大型语言模型(Large Language Models, LLMs)可解释性(explainability)和可理解性(interpretability)的影响问题,这一领域此前尚未得到充分研究。其解决方案的关键在于通过多种量化技术与不同的可解释性方法(如反事实例子和自然语言解释)及可理解性分析(如知识记忆分析和潜在多跳推理分析)相结合,进行系统性的实验,并辅以用户研究,以全面评估量化对模型透明度的影响。研究揭示了量化对可解释性和可理解性的影响具有高度依赖性,且方向不一致,从而为LLMs在需要高透明度的应用中的部署提供了重要警示。

链接: https://arxiv.org/abs/2505.13963
作者: Qianli Wang,Mingyang Wang,Nils Feldhus,Simon Ostermann,Yuan Cao,Hinrich Schütze,Sebastian Möller,Vera Schmitt
机构: Quality and Usability Lab, Technische Universität Berlin; German Research Center for Artificial Intelligence (DFKI); Saarland Informatics Campus; LMU Munich; Bosch Center for Artificial Intelligence (BCAI); Munich Center for Machine Learning (MCML); Centre for European Research in Trusted AI (CERTAIN); BIFOLD – Berlin Institute for the Foundations of Learning and Data; Technical University of Munich
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: In submission

点击查看摘要

Abstract:Quantization methods are widely used to accelerate inference and streamline the deployment of large language models (LLMs). While prior research has extensively investigated the degradation of various LLM capabilities due to quantization, its effects on model explainability and interpretability, which are crucial for understanding decision-making processes, remain unexplored. To address this gap, we conduct comprehensive experiments using three common quantization techniques at distinct bit widths, in conjunction with two explainability methods, counterfactual examples and natural language explanations, as well as two interpretability approaches, knowledge memorization analysis and latent multi-hop reasoning analysis. We complement our analysis with a thorough user study, evaluating selected explainability methods. Our findings reveal that, depending on the configuration, quantization can significantly impact model explainability and interpretability. Notably, the direction of this effect is not consistent, as it strongly depends on (1) the quantization method, (2) the explainability or interpretability approach, and (3) the evaluation protocol. In some settings, human evaluation shows that quantization degrades explainability, while in others, it even leads to improvements. Our work serves as a cautionary tale, demonstrating that quantization can unpredictably affect model transparency. This insight has important implications for deploying LLMs in applications where transparency is a critical requirement.
zh

[NLP-133] Beyond Text: Unveiling Privacy Vulnerabilities in Multi-modal Retrieval-Augmented Generation

【速读】: 该论文试图解决多模态检索增强生成(Multimodal Retrieval-Augmented Generation, MRAG)系统中未被探索的隐私漏洞问题,特别是在视觉-语言和语音-语言模态下的隐私风险。解决方案的关键在于提出一种新的组合结构化提示攻击方法,在黑盒环境下展示攻击者如何通过操纵查询提取私有信息,从而揭示出大模型多模态系统在直接生成检索内容或间接暴露敏感信息方面的潜在风险。

链接: https://arxiv.org/abs/2505.13957
作者: Jiankun Zhang,Shenglai Zeng,Jie Ren,Tianqi Zheng,Hui Liu,Xianfeng Tang,Hui Liu,Yi Chang
机构: Michigan State University (密歇根州立大学); Amazon.com (亚马逊公司); Jilin University (吉林大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal Retrieval-Augmented Generation (MRAG) systems enhance LMMs by integrating external multimodal databases, but introduce unexplored privacy vulnerabilities. While text-based RAG privacy risks have been studied, multimodal data presents unique challenges. We provide the first systematic analysis of MRAG privacy vulnerabilities across vision-language and speech-language modalities. Using a novel compositional structured prompt attack in a black-box setting, we demonstrate how attackers can extract private information by manipulating queries. Our experiments reveal that LMMs can both directly generate outputs resembling retrieved content and produce descriptions that indirectly expose sensitive information, highlighting the urgent need for robust privacy-preserving MRAG techniques.
zh

[NLP-134] FlashThink: An Early Exit Method For Efficient Reasoning

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在推理任务中生成过长推理内容导致计算开销过大的问题。解决方案的关键在于引入一个验证模型,用于识别模型在推理过程中可以提前终止的精确时刻,从而在不降低模型准确性的前提下有效缩短推理内容。

链接: https://arxiv.org/abs/2505.13949
作者: Guochao Jiang,Guofeng Quan,Zepeng Ding,Ziqin Luo,Dixuan Wang,Zheng Hu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown impressive performance in reasoning tasks. However, LLMs tend to generate excessively long reasoning content, leading to significant computational overhead. Our observations indicate that even on simple problems, LLMs tend to produce unnecessarily lengthy reasoning content, which is against intuitive expectations. Preliminary experiments show that at a certain point during the generation process, the model is already capable of producing the correct solution without completing the full reasoning content. Therefore, we consider that the reasoning process of the model can be exited early to achieve the purpose of efficient reasoning. We introduce a verification model that identifies the exact moment when the model can stop reasoning and still provide the correct answer. Comprehensive experiments on four different benchmarks demonstrate that our proposed method, FlashThink, effectively shortens the reasoning content while preserving the model accuracy. For the Deepseek-R1 and QwQ-32B models, we reduced the length of reasoning content by 77.04% and 77.47%, respectively, without reducing the accuracy.
zh

[NLP-135] Memory-Centric Embodied Question Answer

【速读】: 该论文旨在解决具身问答(Embodied Question Answering, EQA)中代理在复杂任务下因记忆能力不足而导致的效率与准确性问题。传统框架以规划器为中心,导致记忆模块与其他模块之间的交互受限,难以有效处理涉及多目标和跨区域的任务。其解决方案的关键在于提出一种以记忆为中心的EQA框架MemoryEQA,通过将记忆信息灵活注入所有模块(包括规划模块、记忆模块和回答模块),提升系统对复杂任务的处理能力。该框架引入了多模态分层记忆机制,包含存储语言增强场景图的全局记忆和保留历史观测与状态信息的局部记忆,并利用多模态大语言模型将记忆信息转换为各模块所需的输入格式。

链接: https://arxiv.org/abs/2505.13948
作者: Mingliang Zhai,Zhi Gao,Yuwei Wu,Yunde Jia
机构: Beijing Key Laboratory of Intelligent Information Technology (北京市智能信息技术重点实验室); School of Computer Science and Technology (计算机科学与技术学院); Beijing Institute of Technology (北京理工大学); Guangdong Laboratory of Machine Perception and Intelligent Computing (广东省机器感知与智能计算实验室); Shenzhen MSU-BIT University (深圳莫斯科大学-北京理工大学联合学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 14pages, 7 figures, 6 tables

点击查看摘要

Abstract:Embodied Question Answering (EQA) requires agents to autonomously explore and understand the environment to answer context-dependent questions. Existing frameworks typically center around the planner, which guides the stopping module, memory module, and answering module for reasoning. In this paper, we propose a memory-centric EQA framework named MemoryEQA. Unlike planner-centric EQA models where the memory module cannot fully interact with other modules, MemoryEQA flexible feeds memory information into all modules, thereby enhancing efficiency and accuracy in handling complex tasks, such as those involving multiple targets across different regions. Specifically, we establish a multi-modal hierarchical memory mechanism, which is divided into global memory that stores language-enhanced scene maps, and local memory that retains historical observations and state information. When performing EQA tasks, the multi-modal large language model is leveraged to convert memory information into the required input formats for injection into different modules. To evaluate EQA models’ memory capabilities, we constructed the MT-HM3D dataset based on HM3D, comprising 1,587 question-answer pairs involving multiple targets across various regions, which requires agents to maintain memory of exploration-acquired target information. Experimental results on HM-EQA, MT-HM3D, and OpenEQA demonstrate the effectiveness of our framework, where a 19.8% performance gain on MT-HM3D compared to baseline model further underscores memory capability’s pivotal role in resolving complex tasks.
zh

[NLP-136] owards Rehearsal-Free Continual Relation Extraction: Capturing Within-Task Variance with Adaptive Prompting

【速读】: 该论文旨在解决持续关系抽取(Continual Relation Extraction, CRE)中基于提示的方法所面临的任务身份识别不准确、灾难性遗忘以及跨任务和任务内变化处理困难等问题。其解决方案的关键在于提出WAVE++,该方法受到前缀调优(prefix-tuning)与专家混合(mixture of experts)之间联系的启发,引入了任务特定的提示池以增强模型在不同任务间的灵活性和适应性,同时避免跨越边界的风险;此外,通过引入标签描述提供更丰富的全局上下文以提升关系分类精度,并采用无需训练的机制优化推理阶段的任务预测,还结合生成模型在共享参数中巩固先验知识,从而无需显式存储数据。

链接: https://arxiv.org/abs/2505.13944
作者: Bao-Ngoc Dao,Quang Nguyen,Luyen Ngo Dinh,Minh Le,Nam Le,Linh Ngo Van
机构: Hanoi University of Science and Technology (河内科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Memory-based approaches have shown strong performance in Continual Relation Extraction (CRE). However, storing examples from previous tasks increases memory usage and raises privacy concerns. Recently, prompt-based methods have emerged as a promising alternative, as they do not rely on storing past samples. Despite this progress, current prompt-based techniques face several core challenges in CRE, particularly in accurately identifying task identities and mitigating catastrophic forgetting. Existing prompt selection strategies often suffer from inaccuracies, lack robust mechanisms to prevent forgetting in shared parameters, and struggle to handle both cross-task and within-task variations. In this paper, we propose WAVE++, a novel approach inspired by the connection between prefix-tuning and mixture of experts. Specifically, we introduce task-specific prompt pools that enhance flexibility and adaptability across diverse tasks while avoiding boundary-spanning risks; this design more effectively captures variations within each task and across tasks. To further refine relation classification, we incorporate label descriptions that provide richer, more global context, enabling the model to better distinguish among different relations. We also propose a training-free mechanism to improve task prediction during inference. Moreover, we integrate a generative model to consolidate prior knowledge within the shared parameters, thereby removing the need for explicit data storage. Extensive experiments demonstrate that WAVE++ outperforms state-of-the-art prompt-based and rehearsal-based methods, offering a more robust solution for continual relation extraction. Our code is publicly available at this https URL.
zh

[NLP-137] MLZero: A Multi-Agent System for End-to-end Machine Learning Automation

【速读】: 该论文试图解决现有AutoML系统在处理多模态数据时仍需大量人工配置和专家输入的问题,以及大型语言模型(Large Language Models, LLMs)在代码生成和API知识方面的局限性。解决方案的关键在于提出MLZero框架,该框架利用LLMs实现跨多种数据模态的端到端机器学习自动化,通过引入认知感知模块将原始多模态输入转化为感知上下文,并结合语义和情景记忆增强迭代代码生成过程,从而减少人工干预并提升性能。

链接: https://arxiv.org/abs/2505.13941
作者: Haoyang Fang,Boran Han,Nick Erickson,Xiyuan Zhang,Su Zhou,Anirudh Dagar,Jiani Zhang,Ali Caner Turkmen,Cuixiong Hu,Huzefa Rangwala,Ying Nian Wu,Bernie Wang,George Karypis
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Existing AutoML systems have advanced the automation of machine learning (ML); however, they still require substantial manual configuration and expert input, particularly when handling multimodal data. We introduce MLZero, a novel multi-agent framework powered by Large Language Models (LLMs) that enables end-to-end ML automation across diverse data modalities with minimal human intervention. A cognitive perception module is first employed, transforming raw multimodal inputs into perceptual context that effectively guides the subsequent workflow. To address key limitations of LLMs, such as hallucinated code generation and outdated API knowledge, we enhance the iterative code generation process with semantic and episodic memory. MLZero demonstrates superior performance on MLE-Bench Lite, outperforming all competitors in both success rate and solution quality, securing six gold medals. Additionally, when evaluated on our Multimodal AutoML Agent Benchmark, which includes 25 more challenging tasks spanning diverse data modalities, MLZero outperforms the competing methods by a large margin with a success rate of 0.92 (+263.6%) and an average rank of 2.28. Our approach maintains its robust effectiveness even with a compact 8B LLM, outperforming full-size systems from existing solutions.
zh

[NLP-138] EEG-to-Text Translation: A Model for Deciphering Human Brain Activity

【速读】: 该论文试图解决将脑电图(EEG)信号解码为文本的性能限制问题。其解决方案的关键在于提出一种新的模型——R1 Translator,该模型结合了双向长短期记忆网络(LSTM)编码器与预训练的基于Transformer的解码器,通过利用EEG特征来生成高质量的文本输出。该模型通过LSTM捕捉序列依赖性,并将其输入到Transformer解码器中进行有效的文本生成。

链接: https://arxiv.org/abs/2505.13936
作者: Saydul Akbar Murad,Ashim Dahal,Nick Rahimi
机构: The University of Southern Mississippi (密西西比州南大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid advancement of large language models like Gemini, GPT, and others, bridging the gap between the human brain and language processing has become an important area of focus. To address this challenge, researchers have developed various models to decode EEG signals into text. However, these models still face significant performance limitations. To overcome these shortcomings, we propose a new model, R1 Translator, which aims to improve the performance of EEG-to-text decoding. The R1 Translator model combines a bidirectional LSTM encoder with a pretrained transformer-based decoder, utilizing EEG features to produce high-quality text outputs. The model processes EEG embeddings through the LSTM to capture sequential dependencies, which are then fed into the transformer decoder for effective text generation. The R1 Translator excels in ROUGE metrics, outperforming both T5 (previous research) and Brain Translator. Specifically, R1 achieves a ROUGE-1 score of 38.00% §, which is up to 9% higher than T5 (34.89%) and 3% better than Brain (35.69%). It also leads in ROUGE-L, with a F1 score of 32.51%, outperforming T5 by 3% (29.67%) and Brain by 2% (30.38%). In terms of CER, R1 achieves a CER of 0.5795, which is 2% lower than T5 (0.5917) and 4% lower than Brain (0.6001). Additionally, R1 performs better in WER with a score of 0.7280, outperforming T5 by 4.3% (0.7610) and Brain by 3.6% (0.7553). Code is available at this https URL.
zh

[NLP-139] Word length predicts word order: “Min-max”-ing drives language evolution

【速读】: 该论文试图解决语言演变中词序变化的普遍驱动机制问题,特别是在先天论与使用基于理论之间的争议。其解决方案的关键在于利用一个涵盖1,500多种语言、代表133个语系和111种孤立语言的大规模标注平行数据集,揭示词类长度与词序之间存在跨语言显著相关性,但这种相关性并非直接,部分支持了处理相关的理论,同时能够预测两种不同的系统发育线路中的历史词序变化,并在回归模型中解释了比谱系或语言区域更多的变异。这一发现提出了一个由处理压力与信息结构竞争驱动的“最小-最大”语言进化整合理论。

链接: https://arxiv.org/abs/2505.13913
作者: Hiram Ring
机构: NTU Singapore(南洋理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current theories of language propose an innate (Baker 2001; Chomsky 1981) or a functional (Greenberg 1963; Dryer 2007; Hawkins 2014) origin for the surface structures (i.e. word order) that we observe in languages of the world, while evolutionary modeling (Dunn et al. 2011) suggests that descent is the primary factor influencing such patterns. Although there are hypotheses for word order change from both innate and usage-based perspectives for specific languages and families, there are key disagreements between the two major proposals for mechanisms that drive the evolution of language more broadly (Wasow 2002; Levy 2008). This paper proposes a universal underlying mechanism for word order change based on a large tagged parallel dataset of over 1,500 languages representing 133 language families and 111 isolates. Results indicate that word class length is significantly correlated with word order crosslinguistically, but not in a straightforward manner, partially supporting opposing theories of processing, while at the same time predicting historical word order change in two different phylogenetic lines and explaining more variance than descent or language area in regression models. Such findings suggest an integrated “Min-Max” theory of language evolution driven by competing pressures of processing and information structure, aligning with recent efficiency-oriented (Levshina 2023) and information-theoretic proposals (Zaslavsky 2020; Tucker et al. 2025).
zh

[NLP-140] Efficient Agent Training for Computer Use

【速读】: 该论文旨在解决高质量轨迹数据规模不足对开发类人计算机使用智能体的瓶颈问题。其解决方案的关键在于提出PC Agent-E框架,该框架通过少量高质人类标注的轨迹数据(仅312条)结合Claude 3.7 Sonnet生成多样化的动作决策,从而显著提升数据质量并减少对大规模人类示范的依赖。

链接: https://arxiv.org/abs/2505.13909
作者: Yanheng He,Jiahe Jin,Pengfei Liu
机构: Shanghai Jiao Tong University (上海交通大学); SII (SII); Generative AI Research Lab (GAIR) (生成式人工智能研究实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: We open-source our entire suite of code, data, and models to facilitate future research at this https URL

点击查看摘要

Abstract:Scaling up high-quality trajectory data has long been a critical bottleneck for developing human-like computer use agents. We introduce PC Agent-E, an efficient agent training framework that significantly reduces reliance on large-scale human demonstrations. Starting with just 312 human-annotated computer use trajectories, we further improved data quality by synthesizing diverse action decisions with Claude 3.7 Sonnet. Trained on these enriched trajectories, our PC Agent-E model achieved a remarkable 141% relative improvement, surpassing the strong Claude 3.7 Sonnet with extended thinking on WindowsAgentArena-V2, an improved benchmark we also released. Furthermore, PC Agent-E demonstrates strong generalizability to different operating systems on OSWorld. Our findings suggest that strong computer use capabilities can be stimulated from a small amount of high-quality trajectory data.
zh

[NLP-141] Cross-Linguistic Transfer in Multilingual NLP: The Role of Language Families and Morphology

【速读】: 该论文试图解决跨语言迁移(cross-lingual transfer)在多语言自然语言处理(NLP)中的有效性问题,特别是探讨语言家族亲缘关系和形态相似性对模型在不同NLP任务中性能的影响。其解决方案的关键在于分析语言间的语义距离度量与迁移效果之间的相关性,并研究如何将类型学和形态信息整合到预训练模型中以提升对多样化语言的迁移能力。

链接: https://arxiv.org/abs/2505.13908
作者: Ajitesh Bankula,Praney Bankula
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cross-lingual transfer has become a crucial aspect of multilingual NLP, as it allows for models trained on resource-rich languages to be applied to low-resource languages more effectively. Recently massively multilingual pre-trained language models (e.g., mBERT, XLM-R) demonstrate strong zero-shot transfer capabilities[14] [13]. This paper investigates cross-linguistic transfer through the lens of language families and morphology. Investigating how language family proximity and morphological similarity affect performance across NLP tasks. We further discuss our results and how it relates to findings from recent literature. Overall, we compare multilingual model performance and review how linguistic distance metrics correlate with transfer outcomes. We also look into emerging approaches that integrate typological and morphological information into model pre-training to improve transfer to diverse languages[18] [19].
zh

[NLP-142] Lets Verify Math Questions Step by Step

【速读】: 该论文试图解决数学问题数据集中存在无效或描述不充分的问题,这些问题可能影响大型语言模型(Large Language Models, LLMs)在数学推理任务中的训练效果。解决方案的关键在于提出一种名为MathQ-Verify的五阶段流水线,通过格式验证、形式化分解、逻辑矛盾检测、目标导向的完整性检查等步骤,系统性地筛选出不符合规范的数学问题,从而提升数学问答数据集的质量。

链接: https://arxiv.org/abs/2505.13903
作者: Chengyu Shen,Zhen Hao Wong,Runming He,Hao Liang,Meiyi Qiang,Zimo Meng,Zhengyang Zhao,Bohan Zeng,Zhengzhou Zhu,Bin Cui,Wentao Zhang
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have recently achieved remarkable progress in mathematical reasoning. To enable such capabilities, many existing works distill strong reasoning models into long chains of thought or design algorithms to construct high-quality math QA data for training. However, these efforts primarily focus on generating correct reasoning paths and answers, while largely overlooking the validity of the questions themselves. In this work, we propose Math Question Verification (MathQ-Verify), a novel five-stage pipeline designed to rigorously filter ill-posed or under-specified math problems. MathQ-Verify first performs format-level validation to remove redundant instructions and ensure that each question is syntactically well-formed. It then formalizes each question, decomposes it into atomic conditions, and verifies them against mathematical definitions. Next, it detects logical contradictions among these conditions, followed by a goal-oriented completeness check to ensure the question provides sufficient information for solving. To evaluate this task, we use existing benchmarks along with an additional dataset we construct, containing 2,147 math questions with diverse error types, each manually double-validated. Experiments show that MathQ-Verify achieves state-of-the-art performance across multiple benchmarks, improving the F1 score by up to 25 percentage points over the direct verification baseline. It further attains approximately 90% precision and 63% recall through a lightweight model voting scheme. MathQ-Verify offers a scalable and accurate solution for curating reliable mathematical datasets, reducing label noise and avoiding unnecessary computation on invalid questions. Our code and data are available at this https URL.
zh

[NLP-143] InfiGFusion: Graph-on-Logits Distillation via Efficient Gromov-Wasserstein for Model Fusion

【速读】: 该论文旨在解决多模型融合中因忽视词汇维度间语义依赖关系而导致的生成行为对齐不足问题。现有基于logit的融合方法虽然保持了推理效率,但独立处理词汇维度,未能捕捉模型内部推理过程中词元类型间的交互关系。解决方案的关键在于提出\textbf{InfiGFusion}框架,其核心是新颖的\textit{Graph-on-Logits Distillation (GLD)}损失函数,通过保留每个输出的top-k logit并聚合其在序列位置上的外积构建全局共激活图,从而显式建模维度间的语义依赖,并设计一种基于排序的闭式近似方法降低计算复杂度至O(nlogn)O(n \log n),确保可扩展性与效率。

链接: https://arxiv.org/abs/2505.13893
作者: Yuanyi Wang,Zhaoyi Yan,Yiming Zhang,Qi Zhou,Yanggan Gu,Fei Wu,Hongxia Yang
机构: Reallm Labs; The Hong Kong Polytechnic University; Zhejiang University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have intensified efforts to fuse heterogeneous open-source models into a unified system that inherits their complementary strengths. Existing logit-based fusion methods maintain inference efficiency but treat vocabulary dimensions independently, overlooking semantic dependencies encoded by cross-dimension interactions. These dependencies reflect how token types interact under a model’s internal reasoning and are essential for aligning models with diverse generation behaviors. To explicitly model these dependencies, we propose \textbfInfiGFusion, the first structure-aware fusion framework with a novel \textitGraph-on-Logits Distillation (GLD) loss. Specifically, we retain the top- k logits per output and aggregate their outer products across sequence positions to form a global co-activation graph, where nodes represent vocabulary channels and edges quantify their joint activations. To ensure scalability and efficiency, we design a sorting-based closed-form approximation that reduces the original O(n^4) cost of Gromov-Wasserstein distance to O(n \log n) , with provable approximation guarantees. Experiments across multiple fusion settings show that GLD consistently improves fusion quality and stability. InfiGFusion outperforms SOTA models and fusion baselines across 11 benchmarks spanning reasoning, coding, and mathematics. It shows particular strength in complex reasoning tasks, with +35.6 improvement on Multistep Arithmetic and +37.06 on Causal Judgement over SFT, demonstrating superior multi-step and relational inference.
zh

[NLP-144] Mapping the Minds of LLM s: A Graph-Based Analysis of Reasoning LLM

【速读】: 该论文试图解决Reasoning LLMs (RLMs)在推理过程中表现出的非直观和不稳定行为,特别是其在少样本提示下性能退化的问题。解决方案的关键在于提出一种基于图的统一分析框架,该框架首先将长且冗长的Chain-of-Thought (CoT)输出聚类为语义一致的推理步骤,然后构建有向推理图以捕捉这些步骤之间的上下文和逻辑依赖关系。通过跨模型和提示策略的全面分析,揭示了结构属性如探索密度、分支率和收敛比与推理准确性的强相关性,从而为理解RLMs的内部推理结构提供了新的视角。

链接: https://arxiv.org/abs/2505.13890
作者: Zhen Xiong,Yujun Cai,Zhecheng Li,Yiwei Wang
机构: University of Southern California (南加州大学); The University of Queensland (昆士兰大学); University of California, San Diego (加州大学圣地亚哥分校); University of California, Merced (加州大学默塞德分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in test-time scaling have enabled Large Language Models (LLMs) to display sophisticated reasoning abilities via extended Chain-of-Thought (CoT) generation. Despite their potential, these Reasoning LLMs (RLMs) often demonstrate counterintuitive and unstable behaviors, such as performance degradation under few-shot prompting, that challenge our current understanding of RLMs. In this work, we introduce a unified graph-based analytical framework for better modeling the reasoning processes of RLMs. Our method first clusters long, verbose CoT outputs into semantically coherent reasoning steps, then constructs directed reasoning graphs to capture contextual and logical dependencies among these steps. Through comprehensive analysis across models and prompting regimes, we reveal that structural properties, such as exploration density, branching, and convergence ratios, strongly correlate with reasoning accuracy. Our findings demonstrate how prompting strategies substantially reshape the internal reasoning structure of RLMs, directly affecting task outcomes. The proposed framework not only enables quantitative evaluation of reasoning quality beyond conventional metrics but also provides practical insights for prompt engineering and the cognitive analysis of LLMs. Code and resources will be released to facilitate future research in this direction.
zh

[NLP-145] Mobile-Agent -V: A Video-Guided Approach for Effortless and Efficient Operational Knowledge Injection in Mobile Automation

【速读】: 该论文试图解决移动设备使用激增背景下,传统AI框架在操作专业知识不足导致任务管理效率低下的问题。其解决方案的关键在于引入Mobile-Agent-V框架,该框架通过视频作为指导工具,直接从视频内容中提取操作知识,从而无需人工干预,显著降低了知识获取的难度和时间成本。

链接: https://arxiv.org/abs/2505.13887
作者: Junyang Wang,Haiyang Xu,Xi Zhang,Ming Yan,Ji Zhang,Fei Huang,Jitao Sang
机构: Beijing Jiaotong University (北京交通大学); Alibaba Group (阿里巴巴集团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages, 7 figures, 9 tables. arXiv admin note: substantial text overlap with arXiv:2502.17110

点击查看摘要

Abstract:The exponential rise in mobile device usage necessitates streamlined automation for effective task management, yet many AI frameworks fall short due to inadequate operational expertise. While manually written knowledge can bridge this gap, it is often burdensome and inefficient. We introduce Mobile-Agent-V, an innovative framework that utilizes video as a guiding tool to effortlessly and efficiently inject operational knowledge into mobile automation processes. By deriving knowledge directly from video content, Mobile-Agent-V eliminates manual intervention, significantly reducing the effort and time required for knowledge acquisition. To rigorously evaluate this approach, we propose Mobile-Knowledge, a benchmark tailored to assess the impact of external knowledge on mobile agent performance. Our experimental findings demonstrate that Mobile-Agent-V enhances performance by 36% compared to existing methods, underscoring its effortless and efficient advantages in mobile automation.
zh

[NLP-146] Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning NEURIPS2025

【速读】: 该论文旨在解决视觉语言模型(Vision Language Models, VLMs)中视觉-语言链式思维(Visual-language Chain-of-Thought, CoT)数据资源稀缺的问题,这一问题限制了VLMs推理能力的提升。为了解决这一问题,研究者提出了一种新颖的解决方案——Code2Logic,其关键在于利用游戏代码作为资源,该游戏代码天然包含逻辑结构和状态转换过程,并通过大型语言模型(Large Language Models, LLMs)对游戏代码进行适配,从而实现推理过程和结果的自动获取。基于此方法,研究者构建了GameQA数据集,该数据集具有成本低、可扩展性强、任务多样等特点,并在多个基准测试中展现了良好的性能表现。

链接: https://arxiv.org/abs/2505.13886
作者: Jingqi Tong,Jixin Tang,Hangcheng Li,Yurong Mou,Ming Zhang,Jun Zhao,Yanbo Wen,Fan Song,Jiahao Zhan,Yuyang Lu,Chaoran Tao,Zhiyuan Guo,Jizhou Yu,Tianhao Cheng,Changhao Jiang,Zhen Wang,Tao Liang,Zhihui Fei,Mingyang Wan,Guojun Ma,Weifeng Ge,Guanhua Chen,Tao Gui,Xipeng Qiu,Qi Zhang,Xuanjing Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 49 pages, 19 figures, submitted to NeurIPS 2025

点击查看摘要

Abstract:Visual-language Chain-of-Thought (CoT) data resources are relatively scarce compared to text-only counterparts, limiting the improvement of reasoning capabilities in Vision Language Models (VLMs). However, high-quality vision-language reasoning data is expensive and labor-intensive to annotate. To address this issue, we leverage a promising resource: game code, which naturally contains logical structures and state transition processes. Therefore, we propose Code2Logic, a novel game-code-driven approach for multimodal reasoning data synthesis. Our approach leverages Large Language Models (LLMs) to adapt game code, enabling automatic acquisition of reasoning processes and results through code execution. Using the Code2Logic approach, we developed the GameQA dataset to train and evaluate VLMs. GameQA is cost-effective and scalable to produce, challenging for state-of-the-art models, and diverse with 30 games and 158 tasks. Surprisingly, despite training solely on game data, VLMs demonstrated out of domain generalization, specifically Qwen2.5-VL-7B improving performance by 2.33% across 7 diverse vision-language benchmarks. Our code and dataset are available at this https URL.
zh

[NLP-147] InfiFPO: Implicit Model Fusion via Preference Optimization in Large Language Models

【速读】: 该论文旨在解决模型融合过程中偏好对齐(preference alignment)阶段研究不足的问题,特别是现有方法在处理多模型融合时忽略了源模型的概率信息。其解决方案的关键在于提出InfiFPO,该方法通过将直接偏好优化(DPO)中的参考模型替换为在序列层面融合多源概率的集成模型,从而避免了复杂的词汇对齐问题并保留概率信息,同时引入概率截断和最大间隔融合策略,使主模型能够更好地对齐人类偏好并有效蒸馏源模型的知识。

链接: https://arxiv.org/abs/2505.13878
作者: Yanggan Gu,Zhaoyi Yan,Yuanyi Wang,Yiming Zhang,Qi Zhou,Fei Wu,Hongxia Yang
机构: Reallm Labs; The Hong Kong Polytechnic University (香港理工大学); Zhejiang University (浙江大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 17 pages

点击查看摘要

Abstract:Model fusion combines multiple Large Language Models (LLMs) with different strengths into a more powerful, integrated model through lightweight training methods. Existing works on model fusion focus primarily on supervised fine-tuning (SFT), leaving preference alignment (PA) --a critical phase for enhancing LLM performance–largely unexplored. The current few fusion methods on PA phase, like WRPO, simplify the process by utilizing only response outputs from source models while discarding their probability information. To address this limitation, we propose InfiFPO, a preference optimization method for implicit model fusion. InfiFPO replaces the reference model in Direct Preference Optimization (DPO) with a fused source model that synthesizes multi-source probabilities at the sequence level, circumventing complex vocabulary alignment challenges in previous works and meanwhile maintaining the probability information. By introducing probability clipping and max-margin fusion strategies, InfiFPO enables the pivot model to align with human preferences while effectively distilling knowledge from source models. Comprehensive experiments on 11 widely-used benchmarks demonstrate that InfiFPO consistently outperforms existing model fusion and preference optimization methods. When using Phi-4 as the pivot model, InfiFPO improve its average performance from 79.95 to 83.33 on 11 benchmarks, significantly improving its capabilities in mathematics, coding, and reasoning tasks.
zh

[NLP-148] Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning

【速读】: 该论文试图解决推理型语言模型在生成长中间推理路径时导致的内存占用高和吞吐量低的问题,这限制了其实际部署。解决方案的关键在于利用推理路径的语义稀疏性,通过一种无需训练的推理路径压缩(Reasoning Path Compression, RPC)方法,在不显著影响准确率的前提下加速推理过程。具体而言,RPC通过定期压缩键值(KV)缓存,保留重要性得分高的部分,该得分由包含最近生成查询的选择器窗口计算得出。

链接: https://arxiv.org/abs/2505.13866
作者: Jiwon Song,Dongwon Jo,Yulhwa Kim,Jae-Joon Kim
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent reasoning-focused language models achieve high accuracy by generating lengthy intermediate reasoning paths before producing final answers. While this approach is effective in solving problems that require logical thinking, long reasoning paths significantly increase memory usage and throughput of token generation, limiting the practical deployment of such models. We propose Reasoning Path Compression (RPC), a training-free method that accelerates inference by leveraging the semantic sparsity of reasoning paths. RPC periodically compresses the KV cache by retaining KV cache that receive high importance score, which are computed using a selector window composed of recently generated queries. Experiments show that RPC improves generation throughput of QwQ-32B by up to 1.60 \times compared to the inference with full KV cache, with an accuracy drop of 1.2% on the AIME 2024 benchmark. Our findings demonstrate that semantic sparsity in reasoning traces can be effectively exploited for compression, offering a practical path toward efficient deployment of reasoning LLMs. Our code is available at this https URL.
zh

[NLP-149] PandaGuard: Systematic Evaluation of LLM Safety in the Era of Jailbreaking Attacks

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在安全对齐方面面临的对抗性提示(adversarial prompts)问题,即“越狱”攻击,这些攻击可以绕过安全机制并引发有害输出。其解决方案的关键在于提出PandaGuard,一个统一且模块化的框架,将LLM越狱安全建模为包含攻击者、防御者和评判者的多智能体系统,并集成19种攻击方法、12种防御机制及多种判断策略,通过灵活的插件架构支持多样化的LLM接口与实验配置,从而提升可复现性和实际部署能力。

链接: https://arxiv.org/abs/2505.13862
作者: Guobin Shen,Dongcheng Zhao,Linghao Feng,Xiang He,Jihang Wang,Sicheng Shen,Haibo Tong,Yiting Dong,Jindong Li,Xiang Zheng,Yi Zeng
机构: Beijing Institute of AI Safety and Governance (北京人工智能安全与治理研究院); Beijing Key Laboratory of Safe AI and Superalignment (北京安全人工智能与超对齐重点实验室); BrainCog Lab, CASIA (脑认知与智能实验室,中科院); Long-term AI (长期人工智能)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable capabilities but remain vulnerable to adversarial prompts known as jailbreaks, which can bypass safety alignment and elicit harmful outputs. Despite growing efforts in LLM safety research, existing evaluations are often fragmented, focused on isolated attack or defense techniques, and lack systematic, reproducible analysis. In this work, we introduce PandaGuard, a unified and modular framework that models LLM jailbreak safety as a multi-agent system comprising attackers, defenders, and judges. Our framework implements 19 attack methods and 12 defense mechanisms, along with multiple judgment strategies, all within a flexible plugin architecture supporting diverse LLM interfaces, multiple interaction modes, and configuration-driven experimentation that enhances reproducibility and practical deployment. Built on this framework, we develop PandaBench, a comprehensive benchmark that evaluates the interactions between these attack/defense methods across 49 LLMs and various judgment approaches, requiring over 3 billion tokens to execute. Our extensive evaluation reveals key insights into model vulnerabilities, defense cost-performance trade-offs, and judge consistency. We find that no single defense is optimal across all dimensions and that judge disagreement introduces nontrivial variance in safety assessments. We release the code, configurations, and evaluation results to support transparent and reproducible research in LLM safety.
zh

[NLP-150] Domain Gating Ensemble Networks for AI-Generated Text Detection EMNLP2025

【速读】: 该论文旨在解决机器生成文本检测模型在面对未见过的领域和生成模型时适应性不足的问题(domain adaptation challenge)。其解决方案的关键在于提出DoGEN(Domain Gating Ensemble Networks),通过集成一组领域专家检测模型,并利用领域分类器的权重来动态调整各模型的贡献,从而实现对新领域文本的有效检测。

链接: https://arxiv.org/abs/2505.13855
作者: Arihant Tripathi,Liam Dugan,Charis Gao,Maggie Huan,Emma Jin,Peter Zhang,David Zhang,Julia Zhao,Chris Callison-Burch
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to EMNLP 2025

点击查看摘要

Abstract:As state-of-the-art language models continue to improve, the need for robust detection of machine-generated text becomes increasingly critical. However, current state-of-the-art machine text detectors struggle to adapt to new unseen domains and generative models. In this paper we present DoGEN (Domain Gating Ensemble Networks), a technique that allows detectors to adapt to unseen domains by ensembling a set of domain expert detector models using weights from a domain classifier. We test DoGEN on a wide variety of domains from leading benchmarks and find that it achieves state-of-the-art performance on in-domain detection while outperforming models twice its size on out-of-domain detection. We release our code and trained models to assist in future research in domain-adaptive AI detection.
zh

[NLP-151] Forensic deepfake audio detection using segmental speech features

【速读】: 该论文试图解决深度伪造音频(deepfake audio)的检测问题,特别是针对语音身份认证和法医语音比对的应用场景。其解决方案的关键在于利用音段性语音特征(segmental speech sounds)的声学特性,这些特征由于与人类发音机制密切相关,具有较高的可解释性,并且对于深度伪造模型而言更难以复制,从而在检测深度伪造音频中表现出更高的有效性。

链接: https://arxiv.org/abs/2505.13847
作者: Tianle Yang,Chengzhe Sun,Siwei Lyu,Phil Rose
机构: University at Buffalo (纽约州立大学布法罗分校); Australian National University Emeritus Faculty (澳大利亚国立大学荣誉教职员工)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This study explores the potential of using acoustic features of segmental speech sounds to detect deepfake audio. These features are highly interpretable because of their close relationship with human articulatory processes and are expected to be more difficult for deepfake models to replicate. The results demonstrate that certain segmental features commonly used in forensic voice comparison are effective in identifying deep-fakes, whereas some global features provide little value. These findings underscore the need to approach audio deepfake detection differently for forensic voice comparison and offer a new perspective on leveraging segmental features for this purpose.
zh

[NLP-152] Improve Language Model and Brain Alignment via Associative Memory ACL2025

【速读】: 该论文试图解决语言模型与人类大脑在处理语音信息时的对齐问题,旨在提升语言模型与人脑认知机制的一致性。其解决方案的关键在于引入关联记忆(associative memory),通过模拟关联记忆扩展原始文本刺激,并将其作为输入用于计算语言模型,从而增强语言模型与大脑中与关联记忆相关区域的对齐效果。此外,研究还通过构建包含1000个样本的故事数据集,对大型语言模型进行特定监督微调,进一步提升了其与大脑响应的匹配度。

链接: https://arxiv.org/abs/2505.13844
作者: Congchi Yin,Yongpeng Zhang,Xuyun Wen,Piji Li
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); The Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education (教育部脑机智能技术重点实验室)
类目: Computation and Language (cs.CL)
备注: Accepted by Findings of ACL 2025

点击查看摘要

Abstract:Associative memory engages in the integration of relevant information for comprehension in the human cognition system. In this work, we seek to improve alignment between language models and human brain while processing speech information by integrating associative memory. After verifying the alignment between language model and brain by mapping language model activations to brain activity, the original text stimuli expanded with simulated associative memory are regarded as input to computational language models. We find the alignment between language model and brain is improved in brain regions closely related to associative memory processing. We also demonstrate large language models after specific supervised fine-tuning better align with brain response, by building the \textitAssociation dataset containing 1000 samples of stories, with instructions encouraging associative memory as input and associated content as output.
zh

[NLP-153] EfficientLLM : Efficiency in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在参数量和上下文窗口持续增长背景下所面临的计算、能耗和经济成本过高的问题。其解决方案的关键在于提出EfficientLLM,这是一个新型基准测试平台,并首次对大规模LLMs的效率技术进行了全面的实证研究。研究通过在生产级集群(48xGH200,8xH200 GPU)上系统评估了三个核心维度:架构预训练(如MQA、GQA、MLA、NSA等高效注意力机制及稀疏专家混合模型)、微调(如LoRA、RSLoRA、DoRA等参数高效方法)以及推理(如int4、float16量化方法),并定义了六项细粒度指标以全面衡量硬件利用率、延迟吞吐平衡及碳足迹,从而为下一代基础模型的效率与性能权衡提供指导。

链接: https://arxiv.org/abs/2505.13840
作者: Zhengqing Yuan,Weixiang Sun,Yixin Liu,Huichi Zhou,Rong Zhou,Yiyang Li,Zheyuan Zhang,Wei Song,Yue Huang,Haolong Jia,Keerthiram Murugesan,Yu Wang,Lifang He,Jianfeng Gao,Lichao Sun,Yanfang Ye
机构: University of Notre Dame; Lehigh University; Imperial College London; Rutgers University; International Business Machines Corporation (IBM); University of Illinois Chicago; Microsoft Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have driven significant progress, yet their growing parameter counts and context windows incur prohibitive compute, energy, and monetary costs. We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale. Conducted on a production-class cluster (48xGH200, 8xH200 GPUs), our study systematically explores three key axes: (1) architecture pretraining (efficient attention variants: MQA, GQA, MLA, NSA; sparse Mixture-of-Experts (MoE)), (2) fine-tuning (parameter-efficient methods: LoRA, RSLoRA, DoRA), and (3) inference (quantization methods: int4, float16). We define six fine-grained metrics (Memory Utilization, Compute Utilization, Latency, Throughput, Energy Consumption, Compression Rate) to capture hardware saturation, latency-throughput balance, and carbon cost. Evaluating over 100 model-technique pairs (0.5B-72B parameters), we derive three core insights: (i) Efficiency involves quantifiable trade-offs: no single method is universally optimal; e.g., MoE reduces FLOPs and improves accuracy but increases VRAM by 40%, while int4 quantization cuts memory/energy by up to 3.9x at a 3-5% accuracy drop. (ii) Optima are task- and scale-dependent: MQA offers optimal memory-latency trade-offs for constrained devices, MLA achieves lowest perplexity for quality-critical tasks, and RSLoRA surpasses LoRA efficiency only beyond 14B parameters. (iii) Techniques generalize across modalities: we extend evaluations to Large Vision Models (Stable Diffusion 3.5, Wan 2.1) and Vision-Language Models (Qwen2.5-VL), confirming effective transferability. By open-sourcing datasets, evaluation pipelines, and leaderboards, EfficientLLM provides essential guidance for researchers and engineers navigating the efficiency-performance landscape of next-generation foundation models.
zh

[NLP-154] Structured Agent Distillation for Large Language Model

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)作为决策代理在实际部署中面临的高推理成本和模型规模过大问题。其解决方案的关键在于提出结构化代理蒸馏(Structured Agent Distillation),通过将轨迹划分为[REASON]和[ACT]段,并针对每个段应用特定的损失函数,以保持推理保真度和动作一致性,从而在压缩模型规模的同时维持性能。

链接: https://arxiv.org/abs/2505.13820
作者: Jun Liu,Zhenglun Kong,Peiyan Dong,Changdi Yang,Tianqi Li,Hao Tang,Geng Yuan,Wei Niu,Wenbin Zhang,Pu Zhao,Xue Lin,Dong Huang,Yanzhi Wang
机构: Carnegie Mellon University (卡内基梅隆大学); Northeastern University (东北大学); Harvard University (哈佛大学); MIT (麻省理工学院); Peking University (北京大学); University of Georgia (佐治亚大学); Florida International University (佛罗里达国际大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks. Yet, their practical deployment is constrained by high inference costs and large model sizes. We propose Structured Agent Distillation, a framework that compresses large LLM-based agents into smaller student models while preserving both reasoning fidelity and action consistency. Unlike standard token-level distillation, our method segments trajectories into [REASON] and [ACT] spans, applying segment-specific losses to align each component with the teacher’s behavior. This structure-aware supervision enables compact agents to better replicate the teacher’s decision process. Experiments on ALFWorld, HotPotQA-ReAct, and WebShop show that our approach consistently outperforms token-level and imitation learning baselines, achieving significant compression with minimal performance drop. Scaling and ablation results further highlight the importance of span-level alignment for efficient and deployable agents.
zh

[NLP-155] Interpretable Traces Unexpected Outcomes: Investigating the Disconnect in Trace-Based Knowledge Distillation

【速读】: 该论文试图解决在知识蒸馏(Knowledge Distillation, KD)中评估推理轨迹(reasoning traces)的忠实性及其与最终模型性能之间相关性的问题。其解决方案的关键在于采用基于规则的问题分解方法,将复杂查询分解为结构化的子问题,从而生成可解释的推理轨迹,并在推理阶段即可直接评估其正确性。通过在Open Book QA任务中的实验验证,该方法有效提升了推理轨迹的可评价性,但实验结果也表明,正确的推理轨迹并不必然导致模型输出正确的最终答案,且最终答案的正确性与中间轨迹的正确性之间存在较低的相关性,这挑战了利用推理轨迹提升小语言模型(Small Language Models, SLMs)性能的隐含假设。

链接: https://arxiv.org/abs/2505.13792
作者: Siddhant Bhambri,Upasana Biswas,Subbarao Kambhampati
机构: Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:Question Answering (QA) poses a challenging and critical problem, particularly in today’s age of interactive dialogue systems such as ChatGPT, Perplexity, Microsoft Copilot, etc. where users demand both accuracy and transparency in the model’s outputs. Since smaller language models (SLMs) are computationally more efficient but often under-perform compared to larger models, Knowledge Distillation (KD) methods allow for finetuning these smaller models to improve their final performance. Lately, the intermediate tokens or the so called `reasoning’ traces produced by Chain-of-Thought (CoT) or by reasoning models such as DeepSeek R1 are used as a training signal for KD. However, these reasoning traces are often verbose and difficult to interpret or evaluate. In this work, we aim to address the challenge of evaluating the faithfulness of these reasoning traces and their correlation with the final performance. To this end, we employ a KD method leveraging rule-based problem decomposition. This approach allows us to break down complex queries into structured sub-problems, generating interpretable traces whose correctness can be readily evaluated, even at inference time. Specifically, we demonstrate this approach on Open Book QA, decomposing the problem into a Classification step and an Information Retrieval step, thereby simplifying trace evaluation. Our SFT experiments with correct and incorrect traces on the CoTemp QA, Microsoft Machine Reading Comprehension QA, and Facebook bAbI QA datasets reveal the striking finding that correct traces do not necessarily imply that the model outputs the correct final solution. Similarly, we find a low correlation between correct final solutions and intermediate trace correctness. These results challenge the implicit assumption behind utilizing reasoning traces for improving SLMs’ final performance via KD.
zh

[NLP-156] Krikri: Advancing Open Large Language Models for Greek

【速读】: 该论文旨在解决针对希腊语的高质量大型语言模型(Large Language Model, LLM)缺乏的问题,特别是在自然语言理解和生成以及代码生成任务中的性能不足。解决方案的关键在于构建一个基于Meta的Llama 3.1-8B架构的定制化模型——Llama-Krikri-8B,其通过在高质量希腊语数据上进行广泛训练,实现了对语言细微差别的优越适应性,并结合多阶段后训练流程及MAGPIE等技术,提升了模型的指令理解和生成能力。此外,研究者还提出了三个新的公开基准测试,以全面评估模型在希腊语环境下的表现。

链接: https://arxiv.org/abs/2505.13772
作者: Dimitris Roussis,Leon Voukoutis,Georgios Paraskevopoulos,Sokratis Sofianopoulos,Prokopis Prokopidis,Vassilis Papavasileiou,Athanasios Katsamanis,Stelios Piperidis,Vassilis Katsouros
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce Llama-Krikri-8B, a cutting-edge Large Language Model tailored for the Greek language, built on Meta’s Llama 3.1-8B. Llama-Krikri-8B has been extensively trained on high-quality Greek data to ensure superior adaptation to linguistic nuances. With 8 billion parameters, it offers advanced capabilities while maintaining efficient computational performance. Llama-Krikri-8B supports both Modern Greek and English, and is also equipped to handle polytonic text and Ancient Greek. The chat version of Llama-Krikri-8B features a multi-stage post-training pipeline, utilizing both human and synthetic instruction and preference data, by applying techniques such as MAGPIE. In addition, for evaluation, we propose three novel public benchmarks for Greek. Our evaluation on existing as well as the proposed benchmarks shows notable improvements over comparable Greek and multilingual LLMs in both natural language understanding and generation as well as code generation.
zh

[NLP-157] Ice Cream Doesnt Cause Drowning: Benchmarking LLM s Against Statistical Pitfalls in Causal Inference

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在进行统计因果推断时存在的可靠性不足问题,特别是其在面对复杂因果推断陷阱(如辛普森悖论或选择偏差)时的表现。解决方案的关键在于提出CausalPitfalls基准,该基准通过结构化的多难度挑战和评分标准,系统性地评估LLMs在克服常见因果推断陷阱方面的能力,并结合直接提示和代码辅助提示两种协议,量化衡量模型的因果推理能力和响应可靠性。

链接: https://arxiv.org/abs/2505.13770
作者: Jin Du,Li Chen,Xun Xian,An Luo,Fangqiao Tian,Ganghua Wang,Charles Doss,Xiaotong Shen,Jie Ding
机构: University of Minnesota (明尼苏达大学); University of Chicago (芝加哥大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Reliable causal inference is essential for making decisions in high-stakes areas like medicine, economics, and public policy. However, it remains unclear whether large language models (LLMs) can handle rigorous and trustworthy statistical causal inference. Current benchmarks usually involve simplified tasks. For example, these tasks might only ask LLMs to identify semantic causal relationships or draw conclusions directly from raw data. As a result, models may overlook important statistical pitfalls, such as Simpson’s paradox or selection bias. This oversight limits the applicability of LLMs in the real world. To address these limitations, we propose CausalPitfalls, a comprehensive benchmark designed to rigorously evaluate the capability of LLMs in overcoming common causal inference pitfalls. Our benchmark features structured challenges across multiple difficulty levels, each paired with grading rubrics. This approach allows us to quantitatively measure both causal reasoning capabilities and the reliability of LLMs’ responses. We evaluate models using two protocols: (1) direct prompting, which assesses intrinsic causal reasoning, and (2) code-assisted prompting, where models generate executable code for explicit statistical analysis. Additionally, we validate the effectiveness of this judge by comparing its scoring with assessments from human experts. Our results reveal significant limitations in current LLMs when performing statistical causal inference. The CausalPitfalls benchmark provides essential guidance and quantitative metrics to advance the development of trustworthy causal reasoning systems.
zh

[NLP-158] Advancing Software Quality: A Standards-Focused Review of LLM -Based Assurance Techniques

【速读】: 该论文试图解决如何将生成式 AI (Generative AI) 与现有的软件质量保证 (Software Quality Assurance, SQA) 标准相结合,以提升传统 SQA 流程的自动化水平和效率。其解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)的技术能力,如需求验证、缺陷检测、测试生成和文档维护,同时确保这些 AI 驱动的方法符合 ISO/IEC 12207、ISO/IEC 25010 等国际标准的要求,从而在增强 SQA 效能的同时保持合规性和流程成熟度。

链接: https://arxiv.org/abs/2505.13766
作者: Avinash Patil
机构: Juniper Networks Inc. (瞻博网络公司)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages, 1 Table, 6 Figures

点击查看摘要

Abstract:Software Quality Assurance (SQA) is critical for delivering reliable, secure, and efficient software products. The Software Quality Assurance Process aims to provide assurance that work products and processes comply with predefined provisions and plans. Recent advancements in Large Language Models (LLMs) present new opportunities to enhance existing SQA processes by automating tasks like requirement analysis, code review, test generation, and compliance checks. Simultaneously, established standards such as ISO/IEC 12207, ISO/IEC 25010, ISO/IEC 5055, ISO 9001/ISO/IEC 90003, CMMI, and TMM provide structured frameworks for ensuring robust quality practices. This paper surveys the intersection of LLM-based SQA methods and these recognized standards, highlighting how AI-driven solutions can augment traditional approaches while maintaining compliance and process maturity. We first review the foundational software quality standards and the technical fundamentals of LLMs in software engineering. Next, we explore various LLM-based SQA applications, including requirement validation, defect detection, test generation, and documentation maintenance. We then map these applications to key software quality frameworks, illustrating how LLMs can address specific requirements and metrics within each standard. Empirical case studies and open-source initiatives demonstrate the practical viability of these methods. At the same time, discussions on challenges (e.g., data privacy, model bias, explainability) underscore the need for deliberate governance and auditing. Finally, we propose future directions encompassing adaptive learning, privacy-focused deployments, multimodal analysis, and evolving standards for AI-driven software quality.
zh

[NLP-159] Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在元认知能力方面的局限性问题,即模型能否监测并报告其内部激活模式。解决方案的关键在于引入一种受神经科学启发的神经反馈范式,通过句子-标签对的形式,使模型学习显式报告和控制特定方向上的神经表征激活,从而量化其元认知能力。该方法揭示了LLMs在监控内部机制时存在维度远低于模型整体神经空间的“元认知空间”。

链接: https://arxiv.org/abs/2505.13763
作者: Li Ji-An,Hua-Dong Xiong,Robert C. Wilson,Marcelo G. Mattar,Marcus K. Benna
机构: University of California San Diego (加州大学圣地亚哥分校); Georgia Tech (佐治亚理工学院); New York University (纽约大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can sometimes report the strategies they actually use to solve tasks, but they can also fail to do so. This suggests some degree of metacognition – the capacity to monitor one’s own cognitive processes for subsequent reporting and self-control. Metacognitive abilities enhance AI capabilities but raise safety concerns, as models might obscure their internal processes to evade neural-activation-based oversight mechanisms designed to detect harmful behaviors. Given society’s increased reliance on these models, it is critical that we understand the limits of their metacognitive abilities, particularly their ability to monitor their internal activations. To address this, we introduce a neuroscience-inspired neurofeedback paradigm designed to quantify the ability of LLMs to explicitly report and control their activation patterns. By presenting models with sentence-label pairs where labels correspond to sentence-elicited internal activations along specific directions in the neural representation space, we demonstrate that LLMs can learn to report and control these activations. The performance varies with several factors: the number of example pairs provided, the semantic interpretability of the target neural direction, and the variance explained by that direction. These results reveal a “metacognitive space” with dimensionality much lower than the model’s neural space, suggesting LLMs can monitor only a subset of their neural mechanisms. Our findings provide empirical evidence quantifying metacognitive capabilities in LLMs, with significant implications for AI safety.
zh

[NLP-160] Simulation Agent : A Framework for Integrating Simulation and Large Language Models for Enhanced Decision-Making

【速读】: 该论文试图解决仿真系统因复杂性而难以被非技术用户访问,以及大型语言模型(Large Language Models, LLMs)在结构化、因果理解方面不足的问题。解决方案的关键在于提出一种仿真代理框架,该框架整合了仿真模型与LLMs的优势,通过LLMs的对话能力实现与复杂仿真系统的无缝交互,同时利用仿真系统为LLMs提供准确且结构化的现实现象表示,从而构建出具有鲁棒性和泛化能力的实证验证基础。

链接: https://arxiv.org/abs/2505.13761
作者: Jacob Kleiman,Kevin Frank,Sindy Campagna
机构: PwC US(普华永道美国)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Simulations, although powerful in accurately replicating real-world systems, often remain inaccessible to non-technical users due to their complexity. Conversely, large language models (LLMs) provide intuitive, language-based interactions but can lack the structured, causal understanding required to reliably model complex real-world dynamics. We introduce our simulation agent framework, a novel approach that integrates the strengths of both simulation models and LLMs. This framework helps empower users by leveraging the conversational capabilities of LLMs to interact seamlessly with sophisticated simulation systems, while simultaneously utilizing the simulations to ground the LLMs in accurate and structured representations of real-world phenomena. This integrated approach helps provide a robust and generalizable foundation for empirical validation and offers broad applicability across diverse domains.
zh

[NLP-161] LLM -Based Compact Reranking with Document Features for Scientific Retrieval

【速读】: 该论文旨在解决科学领域中基于大语言模型(Large Language Model, LLM)的列表重排序(listwise reranking)所面临的挑战,包括初始检索结果质量不高导致相关文档排名较低,以及传统重排序方法因受限于上下文窗口长度而无法充分考虑候选文档的问题。其解决方案的关键在于提出一种无需训练、与模型无关的重排序框架CoRank,通过分阶段处理:首先离线提取文档级特征,其次使用紧凑的语义表示进行粗粒度重排序,最后对前阶段选出的顶级候选文档进行细粒度重排序,从而在保持关键细节的同时扩大候选覆盖范围,提升整体检索性能。

链接: https://arxiv.org/abs/2505.13757
作者: Runchu Tian,Xueqiang Xu,Bowen Jin,SeongKu Kang,Jiawei Han
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Korea University (高丽大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 17 pages, 4 figures

点击查看摘要

Abstract:Scientific retrieval is essential for advancing academic discovery. Within this process, document reranking plays a critical role by refining first-stage retrieval results. However, large language model (LLM) listwise reranking faces unique challenges in the scientific domain. First-stage retrieval is often suboptimal in the scientific domain, so relevant documents are ranked lower. Moreover, conventional listwise reranking uses the full text of candidate documents in the context window, limiting the number of candidates that can be considered. As a result, many relevant documents are excluded before reranking, which constrains overall retrieval performance. To address these challenges, we explore compact document representations based on semantic features such as categories, sections, and keywords, and propose a training-free, model-agnostic reranking framework for scientific retrieval called CoRank. The framework involves three stages: (i) offline extraction of document-level features, (ii) coarse reranking using these compact representations, and (iii) fine-grained reranking on full texts of the top candidates from stage (ii). This hybrid design provides a high-level abstraction of document semantics, expands candidate coverage, and retains critical details required for precise ranking. Experiments on LitSearch and CSFCube show that CoRank significantly improves reranking performance across different LLM backbones, increasing nDCG@10 from 32.0 to 39.7. Overall, these results highlight the value of information extraction for reranking in scientific retrieval.
zh

[NLP-162] Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training

【速读】: 该论文旨在解决大规模语言模型(Large Language Model, LLM)预训练过程中超参数(Hyperparameters, HPs)的高效调优问题,特别是学习率(learning rate)和权重衰减(weight decay)等关键参数的合理设置。其解决方案的关键在于揭示超参数与模型规模(N)、数据集规模(D)及批量大小(B)之间的缩放规律,提出了一种基于令牌-参数比(tokens-per-parameter ratio, D/N)的精确幂律关系来预测最优权重衰减参数(λopt),并进一步分析了最优批量大小(Bopt)和临界批量大小(Bcrit)的缩放规律,为实际训练中的帕累托最优模型规模和数据集规模选择提供了理论依据。

链接: https://arxiv.org/abs/2505.13738
作者: Shane Bergsma,Nolan Dey,Gurpreet Gosal,Gavia Gray,Daria Soboleva,Joel Hestness
机构: Cerebras Systems (Cerebras 系统)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Efficient LLM pre-training requires well-tuned hyperparameters (HPs), including learning rate \eta and weight decay \lambda. We study scaling laws for HPs: formulas for how to scale HPs as we scale model size N, dataset size D, and batch size B. Recent work suggests the AdamW timescale, B/(\eta\lambdaD), should remain constant across training settings, and we verify the implication that optimal \lambda scales linearly with B, for a fixed N,D. However, as N,D scale, we show the optimal timescale obeys a precise power law in the tokens-per-parameter ratio, D/N. This law thus provides a method to accurately predict \lambdaopt in advance of large-scale training. We also study scaling laws for optimal batch size Bopt (the B enabling lowest loss at a given N,D) and critical batch size Bcrit (the B beyond which further data parallelism becomes ineffective). In contrast with prior work, we find both Bopt and Bcrit scale as power laws in D, independent of model size, N. Finally, we analyze how these findings inform the real-world selection of Pareto-optimal N and D under dual training time and compute objectives.
zh

[NLP-163] SQLForge: Synthesizing Reliable and Diverse Data to Enhance Text-to-SQL Reasoning in LLM s ACL

【速读】: 该论文旨在解决开放源代码大型语言模型(LLMs)在文本到SQL推理任务中与闭源模型之间存在的显著性能差距问题。其解决方案的关键在于提出SQLForge,通过SQL语法约束和SQL-to-question逆向翻译提升数据的可靠性,并利用SQL模板增强和迭代数据领域探索机制提高数据多样性,从而增强LLMs的文本到SQL推理能力。

链接: https://arxiv.org/abs/2505.13725
作者: Yu Guo,Dong Jin,Shenghao Ye,Shuangwu Chen,Jian Yang,Xiaobin Tan
机构: University of Science and Technology of China (中国科学技术大学); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究所)
类目: Computation and Language (cs.CL)
备注: 12 pages, 7 figures, accepted to ACL Findings 2025

点击查看摘要

Abstract:Large Language models (LLMs) have demonstrated significant potential in text-to-SQL reasoning tasks, yet a substantial performance gap persists between existing open-source models and their closed-source counterparts. In this paper, we introduce SQLForge, a novel approach for synthesizing reliable and diverse data to enhance text-to-SQL reasoning in LLMs. We improve data reliability through SQL syntax constraints and SQL-to-question reverse translation, ensuring data logic at both structural and semantic levels. We also propose an SQL template enrichment and iterative data domain exploration mechanism to boost data diversity. Building on the augmented data, we fine-tune a variety of open-source models with different architectures and parameter sizes, resulting in a family of models termed SQLForge-LM. SQLForge-LM achieves the state-of-the-art performance on the widely recognized Spider and BIRD benchmarks among the open-source models. Specifically, SQLForge-LM achieves EX accuracy of 85.7% on Spider Dev and 59.8% on BIRD Dev, significantly narrowing the performance gap with closed-source methods.
zh

[NLP-164] Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings

【速读】: 该论文旨在解决在训练数据稀缺环境下构建具备推理能力的大语言模型(Large Language Models, LLMs)所面临的挑战。传统方法依赖于大量高质量的训练数据,如基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)或精心策划的长链思维(Long Chain of Thoughts, CoT)蒸馏,但在数据有限的情况下效果受限。论文提出的解决方案关键在于一种样本高效的两阶段训练策略:第一阶段通过从玩具领域(如Knights \ Knaves逻辑谜题)中蒸馏长链思维来“预热”模型,以获得通用推理能力;第二阶段在预热后的模型上应用RLVR,使用少量目标领域示例进行微调。该方法有效提升了模型在数据稀缺条件下的推理性能与样本效率。

链接: https://arxiv.org/abs/2505.13718
作者: Safal Shrestha,Minwu Kim,Aadim Nepal,Anubhav Shrestha,Keith Ross
机构: New York University Abu Dhabi (纽约大学阿布扎比分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Designing effective reasoning-capable LLMs typically requires training using Reinforcement Learning with Verifiable Rewards (RLVR) or distillation with carefully curated Long Chain of Thoughts (CoT), both of which depend heavily on extensive training data. This creates a major challenge when the amount of quality training data is scarce. We propose a sample-efficient, two-stage training strategy to develop reasoning LLMs under limited supervision. In the first stage, we “warm up” the model by distilling Long CoTs from a toy domain, namely, Knights \ Knaves (K\K) logic puzzles to acquire general reasoning skills. In the second stage, we apply RLVR to the warmed-up model using a limited set of target-domain examples. Our experiments demonstrate that this two-phase approach offers several benefits: (i) the warmup phase alone facilitates generalized reasoning, leading to performance improvements across a range of tasks, including MATH, HumanEval ^+ , and MMLU-Pro. (ii) When both the base model and the warmed-up model are RLVR trained on the same small dataset ( \leq100 examples), the warmed-up model consistently outperforms the base model; (iii) Warming up before RLVR training allows a model to maintain cross-domain generalizability even after training on a specific domain; (iv) Introducing warmup in the pipeline improves not only accuracy but also overall sample efficiency during RLVR training. The results in this paper highlight the promise of warmup for building robust reasoning LLMs in data-scarce environments.
zh

[NLP-165] Are Large Language Models Good at Detecting Propaganda?

【速读】: 该论文试图解决识别新闻文章中宣传技巧的问题,以帮助用户识别基于逻辑谬误和情感诉求的操纵性内容。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)和基于Transformer的模型进行 propaganda 技巧检测,并评估其性能。研究发现,尽管GPT-4在部分指标上优于其他模型,但其表现仍不及RoBERTa-CRF基线模型,表明在该任务中传统模型仍具有竞争力。

链接: https://arxiv.org/abs/2505.13706
作者: Julia Jose,Rachel Greenstadt
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Propagandists use rhetorical devices that rely on logical fallacies and emotional appeals to advance their agendas. Recognizing these techniques is key to making informed decisions. Recent advances in Natural Language Processing (NLP) have enabled the development of systems capable of detecting manipulative content. In this study, we look at several Large Language Models and their performance in detecting propaganda techniques in news articles. We compare the performance of these LLMs with transformer-based models. We find that, while GPT-4 demonstrates superior F1 scores (F1=0.16) compared to GPT-3.5 and Claude 3 Opus, it does not outperform a RoBERTa-CRF baseline (F1=0.67). Additionally, we find that all three LLMs outperform a MultiGranularity Network (MGN) baseline in detecting instances of one out of six propaganda techniques (name-calling), with GPT-3.5 and GPT-4 also outperforming the MGN baseline in detecting instances of appeal to fear and flag-waving.
zh

[NLP-166] Clarifying orthography: Orthographic transparency as compressibility

【速读】: 该论文试图解决的是拼写与发音之间关系(即正字法透明度)缺乏统一、跨脚本度量的问题。其解决方案的关键在于利用算法信息论中的思想,通过计算正字法字符串与语音字符串之间的互可压缩性来量化正字法透明度,从而提供一种能够综合考虑拼写不规则性和规则复杂性的原理性度量方法。

链接: https://arxiv.org/abs/2505.13657
作者: Charles J. Torres,Richard Futrell
机构: University of California, Irvine (加州大学欧文分校)
类目: Computation and Language (cs.CL); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Orthographic transparency – how directly spelling is related to sound – lacks a unified, script-agnostic metric. Using ideas from algorithmic information theory, we quantify orthographic transparency in terms of the mutual compressibility between orthographic and phonological strings. Our measure provides a principled way to combine two factors that decrease orthographic transparency, capturing both irregular spellings and rule complexity in one quantity. We estimate our transparency measure using prequential code-lengths derived from neural sequence models. Evaluating 22 languages across a broad range of script types (alphabetic, abjad, abugida, syllabic, logographic) confirms common intuitions about relative transparency of scripts. Mutual compressibility offers a simple, principled, and general yardstick for orthographic transparency.
zh

[NLP-167] Guided Search Strategies in Non-Serializable Environments with Applications to Software Engineering Agents ICML

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多次求解尝试中难以保持一致性能的问题,特别是在复杂多步骤任务中,如数学推理和代理软件工程。其解决方案的关键在于引入基于学习的动作价值函数估计器引导的搜索策略,包括1-step lookahead和轨迹选择,以在非可序列化强化学习环境(如Docker容器)中有效探索多个解题路径,从而提升模型的平均成功率。

链接: https://arxiv.org/abs/2505.13652
作者: Karina Zainullina,Alexander Golubev,Maria Trofimova,Sergei Polezhaev,Ibragim Badertdinov,Daria Litvintseva,Simon Karasik,Filipp Fisin,Sergei Skvortsov,Maksim Nekrashevich,Anton Shevtsov,Boris Yangel
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: ICML

点击查看摘要

Abstract:Large language models (LLMs) have recently achieved remarkable results in complex multi-step tasks, such as mathematical reasoning and agentic software engineering. However, they often struggle to maintain consistent performance across multiple solution attempts. One effective approach to narrow the gap between average-case and best-case performance is guided test-time search, which explores multiple solution paths to identify the most promising one. Unfortunately, effective search techniques (e.g. MCTS) are often unsuitable for non-serializable RL environments, such as Docker containers, where intermediate environment states cannot be easily saved and restored. We investigate two complementary search strategies applicable to such environments: 1-step lookahead and trajectory selection, both guided by a learned action-value function estimator. On the SWE-bench Verified benchmark, a key testbed for agentic software engineering, we find these methods to double the average success rate of a fine-tuned Qwen-72B model, achieving 40.8%, the new state-of-the-art for open-weights models. Additionally, we show that these techniques are transferable to more advanced closed models, yielding similar improvements with GPT-4o.
zh

[NLP-168] Cross-Lingual Representation Alignment Through Contrastive Image-Caption Tuning ACL2025

【速读】: 该论文试图解决多语言句子表示对齐的问题,传统方法通常依赖双语语料(bitexts)来弥合语言间的差距。其解决方案的关键在于利用视觉信息替代双语语料,通过图像描述数据集实现多语言图像-标题对齐,从而隐式地对齐不同语言的文本表示。该方法无需多语言专业知识即可高效构建数据集,尤其适用于低资源语言,并且能够将预训练编码器未见过的语言在后期纳入对齐过程,使得对齐后的表示可用于跨语言自然语言理解(NLU)和双语语料检索任务。

链接: https://arxiv.org/abs/2505.13628
作者: Nathaniel Krasner,Nicholas Lanuzo,Antonios Anastasopoulos
机构: George Mason University (乔治梅森大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 Main Conference

点击查看摘要

Abstract:Multilingual alignment of sentence representations has mostly required bitexts to bridge the gap between languages. We investigate whether visual information can bridge this gap instead. Image caption datasets are very easy to create without requiring multilingual expertise, so this offers a more efficient alternative for low-resource languages. We find that multilingual image-caption alignment can implicitly align the text representations between languages, languages unseen by the encoder in pretraining can be incorporated into this alignment post-hoc, and these aligned representations are usable for cross-lingual Natural Language Understanding (NLU) and bitext retrieval.
zh

[NLP-169] RAR: Setting Knowledge Tripwires for Retrieval Augmented Rejection

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)内容审核中的挑战,特别是需要灵活且可适应的解决方案以快速应对新兴威胁的问题。其解决方案的关键在于提出了一种名为检索增强拒绝(Retrieval Augmented Rejection, RAR)的方法,该方法利用检索增强生成(Retrieval-Augmented Generation, RAG)架构,在不进行模型微调的情况下动态拒绝不安全的用户查询。通过在向量数据库中插入并标记恶意文档,系统能够在这些文档被检索时识别并拒绝有害请求,从而实现灵活的实时定制能力。

链接: https://arxiv.org/abs/2505.13581
作者: Tommaso Mario Buonocore,Enea Parimbelli
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 7 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Content moderation for large language models (LLMs) remains a significant challenge, requiring flexible and adaptable solutions that can quickly respond to emerging threats. This paper introduces Retrieval Augmented Rejection (RAR), a novel approach that leverages a retrieval-augmented generation (RAG) architecture to dynamically reject unsafe user queries without model retraining. By strategically inserting and marking malicious documents into the vector database, the system can identify and reject harmful requests when these documents are retrieved. Our preliminary results show that RAR achieves comparable performance to embedded moderation in LLMs like Claude 3.5 Sonnet, while offering superior flexibility and real-time customization capabilities, a fundamental feature to timely address critical vulnerabilities. This approach introduces no architectural changes to existing RAG systems, requiring only the addition of specially crafted documents and a simple rejection mechanism based on retrieval results.
zh

[NLP-170] CS-Sum: A Benchmark for Code-Switching Dialogue Summarization and the Limits of Large Language Models

【速读】: 该论文旨在解决代码混杂(Code-switching, CS)对话在大型语言模型(Large Language Models, LLMs)中的可理解性问题,特别是通过将代码混杂对话翻译并总结为英文来评估LLMs的性能。解决方案的关键在于引入CS-Sum,这是首个针对中文-英语(EN-ZH)、泰米尔语-英语(EN-TA)和马来语-英语(EN-MS)跨语言对的代码混杂对话摘要基准,包含每种语言对900至1300条人工标注的对话数据,并系统评估了多种方法(包括少样本学习、翻译-摘要和微调)在该任务上的表现。研究还揭示了LLMs在处理代码混杂输入时存在的常见错误类型及其在不同语言对和模型间的差异。

链接: https://arxiv.org/abs/2505.13559
作者: Sathya Krishnan Suresh,Tanmay Surana,Lim Zhi Hao,Eng Siong Chng
机构: Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 17 pages, 5 figures and 11 tables

点击查看摘要

Abstract:Code-switching (CS) poses a significant challenge for Large Language Models (LLMs), yet its comprehensibility remains underexplored in LLMs. We introduce CS-Sum, to evaluate the comprehensibility of CS by the LLMs through CS dialogue to English summarization. CS-Sum is the first benchmark for CS dialogue summarization across Mandarin-English (EN-ZH), Tamil-English (EN-TA), and Malay-English (EN-MS), with 900-1300 human-annotated dialogues per language pair. Evaluating ten LLMs, including open and closed-source models, we analyze performance across few-shot, translate-summarize, and fine-tuning (LoRA, QLoRA on synthetic data) approaches. Our findings show that though the scores on automated metrics are high, LLMs make subtle mistakes that alter the complete meaning of the dialogue. To this end, we introduce 3 most common type of errors that LLMs make when handling CS input. Error rates vary across CS pairs and LLMs, with some LLMs showing more frequent errors on certain language pairs, underscoring the need for specialized training on code-switched data.
zh

[NLP-171] Combining the Best of Both Worlds: A Method for Hybrid NMT and LLM Translation ACL2025

【速读】: 该论文试图解决在机器翻译任务中使用生成式 AI (Generative AI) 所带来的高计算成本和显著延迟问题。研究发现,在大多数情况下,生成式 AI 的翻译结果与神经机器翻译 (Neural Machine Translation, NMT) 系统相当,仅在特定场景下生成式 AI 和 NMT 各有优势。因此,将 NMT 与生成式 AI 结合使用,并仅在必要时调用生成式 AI 成为一种可行方案。解决方案的关键在于设计一种调度策略,该策略在保证翻译速度和尽可能减少生成式 AI 使用的前提下,优化翻译结果。论文通过比较多种调度策略并提出一种基于源句特征的简单决策机制,验证了其在多语言测试集上的有效性。

链接: https://arxiv.org/abs/2505.13554
作者: Zhanglin Wu,Daimeng Wei,Xiaoyu Chen,Hengchao Shang,Jiaxin Guo,Zongyao Li,Yuanchang Luo,Jinlong Yang,Zhiqiang Rao,Hao Yang
机构: Huawei Translation Service Center (华为翻译服务中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures, 9 tables, ACL 2025

点击查看摘要

Abstract:Large language model (LLM) shows promising performances in a variety of downstream tasks, such as machine translation (MT). However, using LLMs for translation suffers from high computational costs and significant latency. Based on our evaluation, in most cases, translations using LLMs are comparable to that generated by neural machine translation (NMT) systems. Only in particular scenarios, LLM and NMT models show respective advantages. As a result, integrating NMT and LLM for translation and using LLM only when necessary seems to be a sound solution. A scheduling policy that optimizes translation result while ensuring fast speed and as little LLM usage as possible is thereby required. We compare several scheduling policies and propose a novel and straightforward decider that leverages source sentence features. We conduct extensive experiments on multilingual test sets and the result shows that we can achieve optimal translation performance with minimal LLM usage, demonstrating effectiveness of our decider.
zh

[NLP-172] AdAEM: An Adaptively and Automated Extensible Measurement of LLM s Value Difference

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)价值测量数据集存在的信息不足问题,即现有数据集往往包含过时、污染或通用的测试问题,无法有效捕捉不同LLMs之间的价值差异,导致结果饱和且缺乏信息量。其解决方案的关键在于提出AdAEM框架,该框架通过上下文优化方式探测跨文化和时间周期的多样化LLMs内部价值边界,自动适应性地生成和扩展测试问题,从而最大化信息理论目标,提取最新或具有文化争议性的主题,提供更具区分性和信息量的模型价值差异分析。

链接: https://arxiv.org/abs/2505.13531
作者: Shitong Duan,Xiaoyuan Yi,Peng Zhang,Dongkuan Xu,Jing Yao,Tun Lu,Ning Gu,Xing Xie
机构: Fudan University (复旦大学); Microsoft Research Asia (微软亚洲研究院); North Carolina State University (北卡罗来纳州立大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Assessing Large Language Models (LLMs)’ underlying value differences enables comprehensive comparison of their misalignment, cultural adaptability, and biases. Nevertheless, current value measurement datasets face the informativeness challenge: with often outdated, contaminated, or generic test questions, they can only capture the shared value orientations among different LLMs, leading to saturated and thus uninformative results. To address this problem, we introduce AdAEM, a novel, self-extensible assessment framework for revealing LLMs’ inclinations. Distinct from previous static benchmarks, AdAEM can automatically and adaptively generate and extend its test questions. This is achieved by probing the internal value boundaries of a diverse set of LLMs developed across cultures and time periods in an in-context optimization manner. The optimization process theoretically maximizes an information-theoretic objective to extract the latest or culturally controversial topics, providing more distinguishable and informative insights about models’ value differences. In this way, AdAEM is able to co-evolve with the development of LLMs, consistently tracking their value dynamics. Using AdAEM, we generate 12,310 questions grounded in Schwartz Value Theory, conduct an extensive analysis to manifest our method’s validity and effectiveness, and benchmark the values of 16 LLMs, laying the groundwork for better value research.
zh

[NLP-173] BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs

【速读】: 该论文试图解决大型推理模型(Large Reasoning Models, LRMs)在数学和逻辑推理任务中表现出过度自信且错误回答的问题,这些问题源于模型缺乏对不确定性的承认。论文提出的关键解决方案是BARREL框架,其核心在于促进简洁且边界感知的事实性推理,从而提高模型的可靠性。实验结果表明,BARREL训练显著提升了DeepSeek-R1-Distill-Llama-8B的可靠性,从39.33%提升至61.48%,同时保持了与基于R1生成的推理数据微调模型相当的准确性。

链接: https://arxiv.org/abs/2505.13529
作者: Junxiao Yang,Jinzhe Tu,Haoran Liu,Xiaoce Wang,Chujie Zheng,Zhexin Zhang,Shiyao Cui,Caishun Chen,Tiantian He,Hongning Wang,Yew-Soon Ong,Minlie Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in Large Reasoning Models (LRMs) have shown impressive capabilities in mathematical and logical reasoning. However, current LRMs rarely admit ignorance or respond with “I don’t know”. Instead, they often produce incorrect answers while showing undue confidence, raising concerns about their factual reliability. In this work, we identify two pathological reasoning patterns characterized by overthinking that contribute to the overconfident and incorrect answers: last-minute guessing and second-thought spiraling. To address these issues, we propose BARREL-a novel framework that promotes concise and boundary-aware factual reasoning. Our experiments show that BARREL-training increases the reliability of DeepSeek-R1-Distill-Llama-8B from 39.33% to 61.48%, while still achieving accuracy comparable to models finetuned on reasoning data generated by R1. These results demonstrate that our pilot study is inspiring to build more reliable and factual System 2 LRMs.
zh

[NLP-174] Logic Jailbreak: Efficiently Unlocking LLM Safety Restrictions Through Formal Logical Expression

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在对齐人类价值观过程中仍易受到 jailbreak 攻击的问题。解决方案的关键在于提出 LogiBreak,这是一种新颖且通用的黑盒 jailbreak 方法,其核心思想是通过将有害的自然语言提示转换为形式化逻辑表达式,从而利用对齐数据与基于逻辑输入之间的分布差异,保持语义意图和可读性的同时规避安全约束。

链接: https://arxiv.org/abs/2505.13527
作者: Jingyu Peng,Maolin Wang,Nan Wang,Xiangyu Zhao,Jiatong Li,Kai Zhang,Qi Liu
机构: University of Science and Technology of China (中国科学技术大学); City University of Hong Kong (香港城市大学); Universiteit van Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite substantial advancements in aligning large language models (LLMs) with human values, current safety mechanisms remain susceptible to jailbreak attacks. We hypothesize that this vulnerability stems from distributional discrepancies between alignment-oriented prompts and malicious prompts. To investigate this, we introduce LogiBreak, a novel and universal black-box jailbreak method that leverages logical expression translation to circumvent LLM safety systems. By converting harmful natural language prompts into formal logical expressions, LogiBreak exploits the distributional gap between alignment data and logic-based inputs, preserving the underlying semantic intent and readability while evading safety constraints. We evaluate LogiBreak on a multilingual jailbreak dataset spanning three languages, demonstrating its effectiveness across various evaluation settings and linguistic contexts.
zh

[NLP-175] LoRASuite: Efficient LoRA Adaptation Across Large Language Model Upgrades

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)更新频繁导致的低秩适配权重(LoRA weights)快速过时问题,传统方法需要从头重新训练LoRA权重,成本高、耗时长且对环境不利。其解决方案的关键在于提出一种名为LoRASuite的模块化方法,通过计算新旧LLMs已知参数之间的转移矩阵,并基于中心核对齐和余弦相似度分别分配对应的层和注意力头,随后进行小规模精细化微调以保证数值稳定性,从而高效利用已有LoRA权重适应新模型版本。

链接: https://arxiv.org/abs/2505.13515
作者: Yanan Li,Fanxu Meng,Muhan Zhang,Shiai Zhu,Shangguang Wang,Mengwei Xu
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Peking University (北京大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) are frequently updated, LoRA weights trained on earlier versions quickly become obsolete. The conventional practice of retraining LoRA weights from scratch on the latest model is costly, time-consuming, and environmentally detrimental, particularly as the diversity of LLMs and downstream tasks expands. This motivates a critical question: “How can we efficiently leverage existing LoRA weights to adapt to newer model versions?” To address this, we propose LoRASuite, a modular approach tailored specifically to various types of LLM updates. First, we compute a transfer matrix utilizing known parameters from both old and new LLMs. Next, we allocate corresponding layers and attention heads based on centered kernel alignment and cosine similarity metrics, respectively. A subsequent small-scale, skillful fine-tuning step ensures numerical stability. Experimental evaluations demonstrate that LoRASuite consistently surpasses small-scale vanilla LoRA methods. Notably, on backbone LLMs such as MiniCPM and Qwen, LoRASuite even exceeds the performance of full-scale LoRA retraining, with average improvements of +1.4 and +6.6 points on math tasks, respectively. Additionally, LoRASuite significantly reduces memory consumption by 5.5 GB and computational time by 78.23%.
zh

[NLP-176] Induction Head Toxicity Mechanistically Explains Repetition Curse in Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中出现的重复诅咒(repetition curse)问题,即模型生成重复或循环的token序列。其解决方案的关键在于揭示诱导头(induction heads)在驱动这种重复行为中的作用,特别是其“毒性”特性,即在重复过程中过度主导模型输出的logits,排斥其他注意力头的贡献。通过识别诱导头作为重复诅咒的主要驱动因素,论文提出了基于注意力头正则化的技术,以降低诱导头的主导地位,从而促进更丰富和连贯的输出。

链接: https://arxiv.org/abs/2505.13514
作者: Shuxun Wang,Qingyu Yin,Chak Tou Leong,Qiang Zhang,Linyi Yang
机构: Zhejiang University (浙江大学); The Hong Kong Polytechnic University (香港理工大学); University College London (伦敦大学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Repetition curse is a phenomenon where Large Language Models (LLMs) generate repetitive sequences of tokens or cyclic sequences. While the repetition curse has been widely observed, its underlying mechanisms remain poorly understood. In this work, we investigate the role of induction heads–a specific type of attention head known for their ability to perform in-context learning–in driving this repetitive behavior. Specifically, we focus on the “toxicity” of induction heads, which we define as their tendency to dominate the model’s output logits during repetition, effectively excluding other attention heads from contributing to the generation process. Our findings have important implications for the design and training of LLMs. By identifying induction heads as a key driver of the repetition curse, we provide a mechanistic explanation for this phenomenon and suggest potential avenues for mitigation. We also propose a technique with attention head regularization that could be employed to reduce the dominance of induction heads during generation, thereby promoting more diverse and coherent outputs.
zh

[NLP-177] Can AI Freelancers Compete? Benchmarking Earnings Reliability and Task Success at Scale

【速读】: 该论文旨在探索大型语言模型(Large Language Models, LLMs)作为自主代理执行现实世界任务的潜力,特别是在自由职业软件开发领域的应用。其解决方案的关键在于构建一个基准测试框架,该框架基于从Kaggle Freelancer数据集中提取的合成任务,所有任务价格均标准化为美元,并配备结构化的输入输出测试用例和预估价格标签,从而实现自动化正确性检查和经济性能评估。这一方法简化了评估过程,提高了可扩展性和可重复性,使大规模模型性能比较成为可能。

链接: https://arxiv.org/abs/2505.13511
作者: David Noever,Forrest McKee
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study explores Large Language Models (LLMs) as autonomous agents for real-world tasks, including freelance software development. This work presents a new benchmark that evaluates LLMs on freelance programming and data analysis tasks derived from economic data. We construct the benchmark using synthetic tasks created from a Kaggle Freelancer dataset of job postings, with all job prices standardized to USD (median fixed-project price around 250, and an average of 306). Each task is accompanied by structured input-output test cases and an estimated price tag, enabling automated correctness checking and a monetary performance valuation. This approach is inspired by OpenAI’s recent SWE-Lancer benchmark (1,400 real Upwork tasks worth 1M total). Still, our framework simplifies evaluation using programmatically testable tasks and predicted price values, making it highly scalable and repeatable. On this benchmark, we evaluate four modern LLMs - Claude 3.5 Haiku, GPT-4o-mini, Qwen 2.5, and Mistral. We report each model’s accuracy (task success rate and test-case pass rate) and the total “freelance earnings” it achieves (sum of prices of solved tasks). Our results show that Claude 3.5 Haiku performs best, earning approximately 1.52 million USD, followed closely by GPT-4o-mini at 1.49 million, then Qwen 2.5 ( 1.33M) and Mistral ( 0.70M). We analyze the distribution of errors per task and observe that the strongest models solve the most tasks and rarely fail completely on any project. We discuss the implications of these results for the feasibility of AI as a freelance developer, the advantages and limitations of our automated benchmark approach, and the gap between performance on structured tasks versus the true complexity of real-world freelance jobs.
zh

[NLP-178] me-R1: Towards Comprehensive Temporal Reasoning in LLM s

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在时间智能方面的不足,即难以将对过去的推理与未来的预测和合理生成结合起来。现有方法通常仅针对孤立的时间技能,如过去事件的问题回答或基础预测,且在处理超出知识截止点的事件或需要创造性预见的任务时表现不佳。解决方案的关键在于提出\textit{Time-R1}框架,该框架通过一种新颖的三阶段发展路径,利用精心设计的动态规则奖励系统驱动强化学习(Reinforcement Learning, RL)课程,逐步构建基础的时间理解、超越知识截止点的未来事件预测能力,并最终实现无需微调的创造性未来场景生成能力。实验表明,Time-R1在多个挑战性任务上优于参数量超过其200倍的模型,证明了渐进式强化学习微调在提升小型高效模型时间性能方面的有效性。

链接: https://arxiv.org/abs/2505.13508
作者: Zijia Liu,Peixuan Han,Haofei Yu,Haoru Li,Jiaxuan You
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate impressive capabilities but lack robust temporal intelligence, struggling to integrate reasoning about the past with predictions and plausible generations of the future. Meanwhile, existing methods typically target isolated temporal skills, such as question answering about past events or basic forecasting, and exhibit poor generalization, particularly when dealing with events beyond their knowledge cutoff or requiring creative foresight. To address these limitations, we introduce \textitTime-R1, the first framework to endow a moderate-sized (3B-parameter) LLM with comprehensive temporal abilities: understanding, prediction, and creative generation. Our approach features a novel three-stage development path; the first two constitute a \textitreinforcement learning (RL) curriculum driven by a meticulously designed dynamic rule-based reward system. This framework progressively builds (1) foundational temporal understanding and logical event-time mappings from historical data, (2) future event prediction skills for events beyond its knowledge cutoff, and finally (3) enables remarkable generalization to creative future scenario generation without any fine-tuning. Strikingly, experiments demonstrate that Time-R1 outperforms models over 200 times larger, including the state-of-the-art 671B DeepSeek-R1, on highly challenging future event prediction and creative scenario generation benchmarks. This work provides strong evidence that thoughtfully engineered, progressive RL fine-tuning allows smaller, efficient models to achieve superior temporal performance, offering a practical and scalable path towards truly time-aware AI. To foster further research, we also release \textitTime-Bench, a large-scale multi-task temporal reasoning dataset derived from 10 years of news data, and our series of \textitTime-R1 checkpoints.
zh

[NLP-179] EcoSafeRAG : Efficient Security through Context Analysis in Retrieval-Augmented Generation

【速读】: 该论文旨在解决Retrieval-Augmented Generation (RAG)系统在引入外部知识以提升生成内容的准确性和上下文相关性的同时,所面临的新型安全威胁,尤其是语料库污染(corpus poisoning)问题。其解决方案的关键在于EcoSafeRAG采用句子级处理和诱饵引导的上下文多样性检测机制,通过分析候选文档的上下文多样性来识别恶意内容,而无需依赖大型语言模型(LLM)的内部知识。

链接: https://arxiv.org/abs/2505.13506
作者: Ruobing Yao,Yifei Zhang,Shuang Song,Neng Gao,Chenyang Tu
机构: Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所); School of Cybersecurity, University of Chinese Academy of Sciences(中国科学院大学网络空间安全学院); State Key Laboratory of Cyberspace Security Defense(网络空间安全防护国家重点实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) compensates for the static knowledge limitations of Large Language Models (LLMs) by integrating external knowledge, producing responses with enhanced factual correctness and query-specific contextualization. However, it also introduces new attack surfaces such as corpus poisoning at the same time. Most of the existing defense methods rely on the internal knowledge of the model, which conflicts with the design concept of RAG. To bridge the gap, EcoSafeRAG uses sentence-level processing and bait-guided context diversity detection to identify malicious content by analyzing the context diversity of candidate documents without relying on LLM internal knowledge. Experiments show EcoSafeRAG delivers state-of-the-art security with plug-and-play deployment, simultaneously improving clean-scenario RAG performance while maintaining practical operational costs (relatively 1.2 \times latency, 48%-80% token reduction versus Vanilla RAG).
zh

[NLP-180] Noise Injection Systemically Degrades Large Language Model Safety Guardrails

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中安全防护机制在面对扰动时的鲁棒性不足的问题。研究通过系统性地向模型激活值中注入高斯噪声,评估了安全微调(safety fine-tuning)的稳健性,发现高斯噪声可显著提升有害输出率(p < 0.001,最高达27%),且更深层次的安全微调并未提供额外保护,而链式思维推理(chain-of-thought reasoning)仍保持相对完整。该研究揭示了当前安全对齐技术的关键漏洞,并指出基于推理和强化学习的方法可能是提升AI安全系统鲁棒性的潜在方向。

链接: https://arxiv.org/abs/2505.13500
作者: Prithviraj Singh Shahani,Matthias Scheutz
机构: Tufts University (塔夫茨大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages,3 figures

点击查看摘要

Abstract:Safety guardrails in large language models (LLMs) are a critical component in preventing harmful outputs. Yet, their resilience under perturbation remains poorly understood. In this paper, we investigate the robustness of safety fine-tuning in LLMs by systematically injecting Gaussian noise into model activations. We show across multiple open-weight models that (1) Gaussian noise raises harmful-output rates (p 0.001) by up to 27%, (2) that deeper safety fine-tuning affords no extra protection, and (3) that chain-of-thought reasoning remains largely intact. The findings reveal critical vulnerabilities in current safety alignment techniques and highlight the potential of reasoning-based and reinforcement learning approaches as promising direction for developing more robust AI safety systems. These results have important implications for real-world deployment of LLMs in safety-critical applications as these results imply that widely-deployed safety tuning methods can fail even without adversarial prompts.
zh

[NLP-181] IRLBench: A Multi-modal Culturally Grounded Parallel Irish-English Benchmark for Open-Ended LLM Reasoning Evaluation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多语言和低资源环境下的性能评估不足问题,特别是现有基准测试存在文化偏见、仅限文本评估、依赖选择题形式以及对极端低资源语言支持有限等缺陷。其解决方案的关键在于引入IRLBench,这是一个以英语和爱尔兰语并行呈现的基准测试,爱尔兰语被联合国教科文组织列为濒危语言。IRLBench由2024年爱尔兰毕业考试的12个代表性科目组成,通过长文本生成任务并结合官方评分标准,实现了对模型正确性和语言保真度的全面评估。

链接: https://arxiv.org/abs/2505.13498
作者: Khanh-Tung Tran,Barry O’Sullivan,Hoang D. Nguyen
机构: University College Cork (爱尔兰科克大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have demonstrated promising knowledge and reasoning abilities, yet their performance in multilingual and low-resource settings remains underexplored. Existing benchmarks often exhibit cultural bias, restrict evaluation to text-only, rely on multiple-choice formats, and, more importantly, are limited for extremely low-resource languages. To address these gaps, we introduce IRLBench, presented in parallel English and Irish, which is considered definitely endangered by UNESCO. Our benchmark consists of 12 representative subjects developed from the 2024 Irish Leaving Certificate exams, enabling fine-grained analysis of model capabilities across domains. By framing the task as long-form generation and leveraging the official marking scheme, it does not only support a comprehensive evaluation of correctness but also language fidelity. Our extensive experiments of leading closed-source and open-source LLMs reveal a persistent performance gap between English and Irish, in which models produce valid Irish responses less than 80% of the time, and answer correctly 55.8% of the time compared to 76.2% in English for the best-performing model. We release IRLBench (this https URL) and an accompanying evaluation codebase (this https URL) to enable future research on robust, culturally aware multilingual AI development.
zh

[NLP-182] LLM 4CD: Leverag ing Large Language Models for Open-World Knowledge Augmented Cognitive Diagnosis

【速读】: 该论文试图解决传统认知诊断(Cognitive Diagnosis, CD)方法在建模学生、练习题和知识概念时仅依赖ID关系而忽视教育数据空间中丰富的语义关系的问题,以及智能辅导系统(Intelligent Tutoring Systems, ITS)在处理新增学生和练习题时面临的冷启动问题。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)的开放世界知识,构建具有认知表达能力的文本表示,并通过创新的双层编码框架——宏观层面的认知文本编码器和微观层面的知识状态编码器,将语义信息引入CD任务,从而以语义表示替代传统的ID嵌入,提升模型对新样本的适应能力。

链接: https://arxiv.org/abs/2505.13492
作者: Weiming Zhang,Lingyue Fu,Qingyao Li,Kounianhua Du,Jianghao Lin,Jingwei Yu,Wei Xia,Weinan Zhang,Ruiming Tang,Yong Yu
机构: Shanghai Jiao Tong University (上海交通大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cognitive diagnosis (CD) plays a crucial role in intelligent education, evaluating students’ comprehension of knowledge concepts based on their test histories. However, current CD methods often model students, exercises, and knowledge concepts solely on their ID relationships, neglecting the abundant semantic relationships present within educational data space. Furthermore, contemporary intelligent tutoring systems (ITS) frequently involve the addition of new students and exercises, a situation that ID-based methods find challenging to manage effectively. The advent of large language models (LLMs) offers the potential for overcoming this challenge with open-world knowledge. In this paper, we propose LLM4CD, which Leverages Large Language Models for Open-World Knowledge Augmented Cognitive Diagnosis. Our method utilizes the open-world knowledge of LLMs to construct cognitively expressive textual representations, which are then encoded to introduce rich semantic information into the CD task. Additionally, we propose an innovative bi-level encoder framework that models students’ test histories through two levels of encoders: a macro-level cognitive text encoder and a micro-level knowledge state encoder. This approach substitutes traditional ID embeddings with semantic representations, enabling the model to accommodate new students and exercises with open-world knowledge and address the cold-start problem. Extensive experimental results demonstrate that our proposed method consistently outperforms previous CD models on multiple real-world datasets, validating the effectiveness of leveraging LLMs to introduce rich semantic information into the CD task.
zh

[NLP-183] ProdRev: A DNN framework for empowering customers using generative pre-trained transformers

【速读】: 该论文试图解决消费者在面对大量产品评论时产生的决策瘫痪问题(decision paralysis),尤其是在疫情后电子商务使用偏好加速的背景下。其解决方案的关键在于提出一个框架,通过微调生成式预训练变换器(Generative Pre-trained Transformer, GPT)模型来更好地理解评论,并引入“常识”(common-sense)以提升决策质量。该框架采用抽象摘要(abstractive summarization)方法,而非简单的抽取式摘要,从而揭示评论之间的真正关系,帮助用户快速获取产品的优缺点,做出更明智的决策。

链接: https://arxiv.org/abs/2505.13491
作者: Aakash Gupta,Nataraj Das
机构: Think Evolve Consultancy LLP (思创演进咨询有限公司); National Institute Of Technology (国家技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 2022 International Conference on Decision Aid Sciences and Applications (DASA)

点击查看摘要

Abstract:Following the pandemic, customers, preference for using e-commerce has accelerated. Since much information is available in multiple reviews (sometimes running in thousands) for a single product, it can create decision paralysis for the buyer. This scenario disempowers the consumer, who cannot be expected to go over so many reviews since its time consuming and can confuse them. Various commercial tools are available, that use a scoring mechanism to arrive at an adjusted score. It can alert the user to potential review manipulations. This paper proposes a framework that fine-tunes a generative pre-trained transformer to understand these reviews better. Furthermore, using “common-sense” to make better decisions. These models have more than 13 billion parameters. To fine-tune the model for our requirement, we use the curie engine from generative pre-trained transformer (GPT3). By using generative models, we are introducing abstractive summarization. Instead of using a simple extractive method of summarizing the reviews. This brings out the true relationship between the reviews and not simply copy-paste. This introduces an element of “common sense” for the user and helps them to quickly make the right decisions. The user is provided the pros and cons of the processed reviews. Thus the user/customer can take their own decisions.
zh

[NLP-184] Contrastive Cross-Course Knowledge Tracing via Concept Graph Guided Knowledge Transfer IJCAI2025

【速读】: 该论文旨在解决现有知识追踪(Knowledge Tracing, KT)模型主要依赖单一课程数据,难以全面捕捉学习者知识状态的问题。其解决方案的关键在于提出TransKT,一种基于对比学习的跨课程知识追踪方法,通过概念图引导的知识迁移来建模不同课程间的学习行为关系。具体而言,TransKT利用零样本大语言模型(Large Language Model, LLM)提示构建跨课程概念图,以隐式链接不同课程中的相关概念,并结合LLM到LM的管道整合语义特征,从而提升图卷积网络(Graph Convolutional Networks, GCNs)在知识迁移中的性能,同时通过对比目标对齐单课程与跨课程知识状态,增强模型对学习者整体知识状态的鲁棒性和准确性表示。

链接: https://arxiv.org/abs/2505.13489
作者: Wenkang Han,Wang Lin,Liya Hu,Zhenlong Dai,Yiyun Zhou,Mengze Li,Zemin Liu,Chang Yao,Jingyuan Chen
机构: Zhejiang University (浙江大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by IJCAI 2025

点击查看摘要

Abstract:Knowledge tracing (KT) aims to predict learners’ future performance based on historical learning interactions. However, existing KT models predominantly focus on data from a single course, limiting their ability to capture a comprehensive understanding of learners’ knowledge states. In this paper, we propose TransKT, a contrastive cross-course knowledge tracing method that leverages concept graph guided knowledge transfer to model the relationships between learning behaviors across different courses, thereby enhancing knowledge state estimation. Specifically, TransKT constructs a cross-course concept graph by leveraging zero-shot Large Language Model (LLM) prompts to establish implicit links between related concepts across different courses. This graph serves as the foundation for knowledge transfer, enabling the model to integrate and enhance the semantic features of learners’ interactions across courses. Furthermore, TransKT includes an LLM-to-LM pipeline for incorporating summarized semantic features, which significantly improves the performance of Graph Convolutional Networks (GCNs) used for knowledge transfer. Additionally, TransKT employs a contrastive objective that aligns single-course and cross-course knowledge states, thereby refining the model’s ability to provide a more robust and accurate representation of learners’ overall knowledge states.
zh

[NLP-185] Source framing triggers systematic evaluation bias in Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在文本评估中的判断一致性、偏见及对框架效应的鲁棒性问题。研究通过系统性地检验四种先进LLMs(OpenAI o3-mini、Deepseek Reasoner、xAI Grok 2和Mistral)在不同主题下的跨模型与模型内一致性,分析其评估结果受陈述来源(如人类作者或另一LLM)的影响。解决方案的关键在于通过操控陈述的来源信息,揭示框架效应如何显著影响LLMs的评估结果,从而凸显其在信息系统的公正性、中立性和完整性方面的潜在风险。

链接: https://arxiv.org/abs/2505.13488
作者: Federico Germani,Giovanni Spitale
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used not only to generate text but also to evaluate it, raising urgent questions about whether their judgments are consistent, unbiased, and robust to framing effects. In this study, we systematically examine inter- and intra-model agreement across four state-of-the-art LLMs (OpenAI o3-mini, Deepseek Reasoner, xAI Grok 2, and Mistral) tasked with evaluating 4,800 narrative statements on 24 different topics of social, political, and public health relevance, for a total of 192,000 assessments. We manipulate the disclosed source of each statement to assess how attribution to either another LLM or a human author of specified nationality affects evaluation outcomes. We find that, in the blind condition, different LLMs display a remarkably high degree of inter- and intra-model agreement across topics. However, this alignment breaks down when source framing is introduced. Here we show that attributing statements to Chinese individuals systematically lowers agreement scores across all models, and in particular for Deepseek Reasoner. Our findings reveal that framing effects can deeply affect text evaluation, with significant implications for the integrity, neutrality, and fairness of LLM-mediated information systems.
zh

[NLP-186] Detecting Prefix Bias in LLM -based Reward Models

【速读】: 该论文试图解决基于人类反馈的强化学习(Reinforcement Learning with Human Feedback, RLHF)中奖励模型可能存在的偏见问题,特别是由查询前缀微小变化引发的前缀偏见(prefix bias)。解决方案的关键在于引入新的方法来检测和评估这种偏见,并提出一种数据增强策略以减轻其影响。通过在多种开源偏好数据集和奖励模型架构上的综合评估,验证了该方法的有效性,强调了在设计和评估公平可靠奖励模型时考虑偏见的重要性。

链接: https://arxiv.org/abs/2505.13487
作者: Ashwin Kumar,Yuzi He,Aram H. Markosyan,Bobbie Chern,Imanol Arrieta-Ibarra
机构: Meta Platforms, Inc.(Meta)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Human Feedback (RLHF) has emerged as a key paradigm for task-specific fine-tuning of language models using human preference data. While numerous publicly available preference datasets provide pairwise comparisons of responses, the potential for biases in the resulting reward models remains underexplored. In this work, we introduce novel methods to detect and evaluate prefix bias – a systematic shift in model preferences triggered by minor variations in query prefixes – in LLM-based reward models trained on such datasets. We leverage these metrics to reveal significant biases in preference models across racial and gender dimensions. Our comprehensive evaluation spans diverse open-source preference datasets and reward model architectures, demonstrating susceptibility to this kind of bias regardless of the underlying model architecture. Furthermore, we propose a data augmentation strategy to mitigate these biases, showing its effectiveness in reducing the impact of prefix bias. Our findings highlight the critical need for bias-aware dataset design and evaluation in developing fair and reliable reward models, contributing to the broader discourse on fairness in AI.
zh

[NLP-187] Evaluating Large Language Models for Real-World Engineering Tasks

【速读】: 该论文试图解决当前对大型语言模型(Large Language Models, LLMs)在工程任务中的评估存在两个关键缺陷:一是依赖于简化使用场景,通常改编自考试材料,其中正确性易于验证;二是使用了非系统化的场景,无法充分捕捉关键的工程能力。解决方案的关键在于构建一个经过筛选的数据库,包含超过100个源自真实、面向生产的工程场景的问题,系统地覆盖产品设计、预测和诊断等核心能力,从而更全面地评估LLMs在复杂工程任务中的表现。

链接: https://arxiv.org/abs/2505.13484
作者: Rene Heesch,Sebastian Eilermann,Alexander Windmann,Alexander Diedrich,Philipp Rosenthal,Oliver Niggemann
机构: Helmut Schmidt University (汉堡-赫尔姆霍兹大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are transformative not only for daily activities but also for engineering tasks. However, current evaluations of LLMs in engineering exhibit two critical shortcomings: (i) the reliance on simplified use cases, often adapted from examination materials where correctness is easily verifiable, and (ii) the use of ad hoc scenarios that insufficiently capture critical engineering competencies. Consequently, the assessment of LLMs on complex, real-world engineering problems remains largely unexplored. This paper addresses this gap by introducing a curated database comprising over 100 questions derived from authentic, production-oriented engineering scenarios, systematically designed to cover core competencies such as product design, prognosis, and diagnosis. Using this dataset, we evaluate four state-of-the-art LLMs, including both cloud-based and locally hosted instances, to systematically investigate their performance on complex engineering tasks. Our results show that LLMs demonstrate strengths in basic temporal and structural reasoning but struggle significantly with abstract reasoning, formal modeling, and context-sensitive engineering logic.
zh

[NLP-188] EmoMeta: A Multimodal Dataset for Fine-grained Emotion Classification in Chinese Metaphors

【速读】: 该论文试图解决多模态隐喻在情感分类中的复杂性问题以及跨语言情感细微差别的研究不足。现有研究多集中于英语,缺乏对其他语言中情感表达的深入分析。其解决方案的关键在于构建一个包含5,000个中文文本-图像对的多模态隐喻细粒度情感数据集,该数据集对隐喻出现、领域关系及包括喜悦、爱、信任、恐惧、悲伤、厌恶、愤怒、惊讶、期待和中性在内的多种情感进行了精细标注,并已公开供后续研究使用。

链接: https://arxiv.org/abs/2505.13483
作者: Xingyuan Lu,Yuxi Liu,Dongyu Zhang,Zhiyao Wu,Jing Ren,Feng Xia
机构: Dalian University of Technology(大连理工大学); School of Software(软件学院); School of Foreign Languages(外国语学院); Faculty of Business Administration(商学院); RMIT University(皇家墨尔本理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Metaphors play a pivotal role in expressing emotions, making them crucial for emotional intelligence. The advent of multimodal data and widespread communication has led to a proliferation of multimodal metaphors, amplifying the complexity of emotion classification compared to single-mode scenarios. However, the scarcity of research on constructing multimodal metaphorical fine-grained emotion datasets hampers progress in this domain. Moreover, existing studies predominantly focus on English, overlooking potential variations in emotional nuances across languages. To address these gaps, we introduce a multimodal dataset in Chinese comprising 5,000 text-image pairs of metaphorical advertisements. Each entry is meticulously annotated for metaphor occurrence, domain relations and fine-grained emotion classification encompassing joy, love, trust, fear, sadness, disgust, anger, surprise, anticipation, and neutral. Our dataset is publicly accessible (this https URL), facilitating further advancements in this burgeoning field.
zh

[NLP-189] MedEIR: A Specialized Medical Embedding Model for Enhanced Information Retrieval

【速读】: 该论文旨在解决现有嵌入模型在医学文档语义捕捉、长文本处理及跨领域适应性方面的不足。其解决方案的关键在于提出MedEIR,一个针对医学和通用自然语言处理任务联合优化的嵌入模型与分词器,结合基于ALiBi的长上下文处理技术,支持长达8,192个标记的序列,并通过仅60亿个标记的预训练和300万句对的微调,实现了在多个基准测试中的卓越性能。

链接: https://arxiv.org/abs/2505.13482
作者: Anand Selvadurai,Jasheen Shaik,Girish Chandrasekar,ShriRadhaKrishnan Balamurugan,Eswara Reddy
机构: CompIndia Infotech Pvt. Ltd
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 9 pages, 1 figure. This manuscript is a substantial revision of a previously submitted paper. We have explicitly clarified novelty, strengthened scholarly depth, and expanded experimental validation

点击查看摘要

Abstract:Embedding models have become essential for retrieval-augmented generation (RAG) tasks, semantic clustering, and text re-ranking. But despite their growing use, many of these come with notable limitations. For example, Jina fails to capture the semantic content of medical documents, while models such as MiniLM often perform poorly on long-form documents. Domain-adapted models, while specialized, often underperform in general-purpose tasks, reducing their overall applicability. General-domain tokenizers often misinterpret medical vocabulary. The limitations of current embedding models, whether in tokenization accuracy, domain comprehension, or handling long sequences, highlight the need for more versatile solutions. In this work, we present MedEIR, a novel embedding model and tokenizer jointly optimized for both medical and general NLP tasks, incorporating ALiBi-based long-context processing to support sequences of up to 8,192 tokens. MedEIR was pre-trained on only 6 billion tokens, significantly fewer than Jina’s, followed by fine-tuning on 3 million sentence pairs. MedEIR consistently outperforms Jina V2 and MiniLM across MTEB benchmarks, achieving top scores on ArguAna (55.24), NFCorpus (38.44), MedicalQARetrieval (74.25), SciFact (72.04), and TRECCOVID (79.56). These results highlight the potential of MedEIR as a highly effective embedding model, demonstrating strong performance across both general-purpose and domain-specific tasks and outperforming existing models on multiple benchmarks.
zh

[NLP-190] Evaluating Reasoning LLM s for Suicide Screening with the Columbia-Suicide Severity Rating Scale

【速读】: 该论文试图解决如何利用大型语言模型(Large Language Models, LLMs)进行自动化的自杀风险评估问题,特别是在在线平台上用户可能向AI系统而非人类表达自杀意念的新趋势下。解决方案的关键在于评估LLMs在无需额外训练的情况下,使用Columbia-Suicide Severity Rating Scale (C-SSRS) 对用户发帖进行7级严重程度分类的能力,重点分析了模型在零样本场景下的表现及其在相邻严重程度级别间的误分类模式。

链接: https://arxiv.org/abs/2505.13480
作者: Avinash Patil,Siru Tao,Amardeep Gedhu
机构: Arizona State University (亚利桑那州立大学); Carnegie Mellon University (卡内基梅隆大学); Santa Clara University (圣克拉拉大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 8 Pages, 6 Figures, 1 Table

点击查看摘要

Abstract:Suicide prevention remains a critical public health challenge. While online platforms such as Reddit’s r/SuicideWatch have historically provided spaces for individuals to express suicidal thoughts and seek community support, the advent of large language models (LLMs) introduces a new paradigm-where individuals may begin disclosing ideation to AI systems instead of humans. This study evaluates the capability of LLMs to perform automated suicide risk assessment using the Columbia-Suicide Severity Rating Scale (C-SSRS). We assess the zero-shot performance of six models-including Claude, GPT, Mistral, and LLaMA-in classifying posts across a 7-point severity scale (Levels 0-6). Results indicate that Claude and GPT closely align with human annotations, while Mistral achieves the lowest ordinal prediction error. Most models exhibit ordinal sensitivity, with misclassifications typically occurring between adjacent severity levels. We further analyze confusion patterns, misclassification sources, and ethical considerations, underscoring the importance of human oversight, transparency, and cautious deployment. Full code and supplementary materials are available at this https URL.
zh

[NLP-191] SLOT: Sample-specific Language Model Optimization at Test-time

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理复杂指令时表现不佳的问题,特别是在那些未在通用样本中充分表示的提示(prompt)上。解决方案的关键在于提出SLOT(Sample-specific Language Model Optimization at Test-time),一种在测试阶段进行少量优化的参数高效方法,通过更新轻量级的样本特定参数向量来增强模型对单个提示的响应准确性。该参数向量被添加到输出头之前的最终隐藏层,并通过在每个样本优化过程中缓存最后一层特征实现高效适应。SLOT通过仅最小化输入提示的交叉熵损失,使模型更好地与给定指令对齐。

链接: https://arxiv.org/abs/2505.12392
作者: Yang Hu,Xingyu Zhang,Xueji Fang,Zhiyang Chen,Xiao Wang,Huatian Zhang,Guojun Qi
机构: Westlake University (西湖大学); University of Washington (华盛顿大学); USTC (中国科学技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose SLOT (Sample-specific Language Model Optimization at Test-time), a novel and parameter-efficient test-time inference approach that enhances a language model’s ability to more accurately respond to individual prompts. Existing Large Language Models (LLMs) often struggle with complex instructions, leading to poor performances on those not well represented among general samples. To address this, SLOT conducts few optimization steps at test-time to update a light-weight sample-specific parameter vector. It is added to the final hidden layer before the output head, and enables efficient adaptation by caching the last layer features during per-sample optimization. By minimizing the cross-entropy loss on the input prompt only, SLOT helps the model better aligned with and follow each given instruction. In experiments, we demonstrate that our method outperforms the compared models across multiple benchmarks and LLMs. For example, Qwen2.5-7B with SLOT achieves an accuracy gain of 8.6% on GSM8K from 57.54% to 66.19%, while DeepSeek-R1-Distill-Llama-70B with SLOT achieves a SOTA accuracy of 68.69% on GPQA among 70B-level models. Our code is available at this https URL.
zh

[NLP-192] From Words to Worlds: Compositionality for Cognitive Architectures ICML2024

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在组合性(compositionality)学习方面的表现及其与模型性能之间关系的问题。研究的核心在于探讨模型规模扩展与指令微调对组合性策略学习的影响差异。其解决方案的关键在于通过跨四个LLM家族(共12个模型)和三个任务类别的实证分析,揭示出模型规模的增加有助于提升组合性能力,而指令微调则可能产生相反效果,从而为LLM与人类认知能力的对齐发展提供了新的研究方向。

链接: https://arxiv.org/abs/2407.13419
作者: Ruchira Dhar,Anders Søgaard
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
备注: Accepted to ICML 2024 Workshop on LLMs Cognition

点击查看摘要

Abstract:Large language models (LLMs) are very performant connectionist systems, but do they exhibit more compositionality? More importantly, is that part of why they perform so well? We present empirical analyses across four LLM families (12 models) and three task categories, including a novel task introduced below. Our findings reveal a nuanced relationship in learning of compositional strategies by LLMs – while scaling enhances compositional abilities, instruction tuning often has a reverse effect. Such disparity brings forth some open issues regarding the development and improvement of large language models in alignment with human cognitive capacities.
zh

[NLP-193] aching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples INTERSPEECH2025

【速读】: 该论文试图解决音频感知大语言模型(Audio-aware Large Language Models, ALLMs)在处理音频输入时容易产生不存在的声音事件(hallucinations)的问题,从而影响其在实际应用中的可靠性。解决方案的关键在于提出一种名为LISTEN(Learning to Identify Sounds Through Extended Negative Samples)的对比学习训练方法,该方法通过使用基础大语言模型(LLM)生成的合成数据来增强ALLMs区分存在与不存在声音的能力,且无需修改LLM参数,仅通过轻量级适配器高效集成音频表示。

链接: https://arxiv.org/abs/2505.14518
作者: Chun-Yi Kuan,Hung-yi Lee
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:Recent advancements in audio-aware large language models (ALLMs) enable them to process and understand audio inputs. However, these models often hallucinate non-existent sound events, reducing their reliability in real-world applications. To address this, we propose LISTEN (Learning to Identify Sounds Through Extended Negative Samples), a contrastive-like training method that enhances ALLMs’ ability to distinguish between present and absent sounds using synthesized data from the backbone LLM. Unlike prior approaches, our method requires no modification to LLM parameters and efficiently integrates audio representations via a lightweight adapter. Experiments show that LISTEN effectively mitigates hallucinations while maintaining impressive performance on existing audio question and reasoning benchmarks. At the same time, it is more efficient in both data and computation.
zh

[NLP-194] Mitigating Subgroup Disparities in Multi-Label Speech Emotion Recognition: A Pseudo-Labeling and Unsupervised Learning Approach INTERSPEECH2025

【速读】: 该论文试图解决在类别型语音情感识别(Speech Emotion Recognition, SER)中存在的人群子组差异和性能偏差问题,尤其是在缺乏显式人口统计标签(demographic labels)的情况下难以实现公平性的问题。解决方案的关键在于引入隐式人口统计推断(Implicit Demography Inference, IDI)模块,该模块通过预训练模型的伪标签(pseudo-labeling)和基于k均值聚类的无监督学习方法来减轻偏差。

链接: https://arxiv.org/abs/2505.14449
作者: Yi-Cheng Lin,Huang-Cheng Chou,Hung-yi Lee
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Accepted by InterSpeech 2025. 7 pages including 2 pages of appendix

点击查看摘要

Abstract:While subgroup disparities and performance bias are increasingly studied in computational research, fairness in categorical Speech Emotion Recognition (SER) remains underexplored. Existing methods often rely on explicit demographic labels, which are difficult to obtain due to privacy concerns. To address this limitation, we introduce an Implicit Demography Inference (IDI) module that leverages pseudo-labeling from a pre-trained model and unsupervised learning using k-means clustering to mitigate bias in SER. Our experiments show that pseudo-labeling IDI reduces subgroup disparities, improving fairness metrics by over 33% with less than a 3% decrease in SER accuracy. Also, the unsupervised IDI yields more than a 26% improvement in fairness metrics with a drop of less than 4% in SER performance. Further analyses reveal that the unsupervised IDI consistently mitigates race and age disparities, demonstrating its potential in scenarios where explicit demographic information is unavailable.
zh

[NLP-195] Pairwise Evaluation of Accent Similarity in Speech Synthesis INTERSPEECH2025

【速读】: 该论文试图解决语音合成中口音相似性评估方法不足的问题(accent similarity evaluation),尤其是在高保真口音生成领域,相关研究仍较为薄弱。其解决方案的关键在于改进主观和客观评估方法:在主观方面,通过优化XAB听辨测试,引入转录文本、强调感知口音差异以及严格的可靠性筛选,以提高统计显著性并降低成本;在客观方面,利用与发音相关的度量标准,如元音共振峰距离和语音后验图,来评估口音生成效果。

链接: https://arxiv.org/abs/2505.14410
作者: Jinzuomu Zhong,Suyuan Liu,Dan Wells,Korin Richmond
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted by INTERSPEECH 2025

点击查看摘要

Abstract:Despite growing interest in generating high-fidelity accents, evaluating accent similarity in speech synthesis has been underexplored. We aim to enhance both subjective and objective evaluation methods for accent similarity. Subjectively, we refine the XAB listening test by adding components that achieve higher statistical significance with fewer listeners and lower costs. Our method involves providing listeners with transcriptions, having them highlight perceived accent differences, and implementing meticulous screening for reliability. Objectively, we utilise pronunciation-related metrics, based on distances between vowel formants and phonetic posteriorgrams, to evaluate accent generation. Comparative experiments reveal that these metrics, alongside accent similarity, speaker similarity, and Mel Cepstral Distortion, can be used. Moreover, our findings underscore significant limitations of common metrics like Word Error Rate in assessing underrepresented accents.
zh

[NLP-196] OmniGenBench: A Modular Platform for Reproducible Genomic Foundation Models Benchmarking

【速读】: 该论文试图解决基因组基础模型(Genomic Foundation Models, GFMs)在AI驱动基因组学中面临的评估不严谨和可重复性不足的问题。解决方案的关键在于提出OmniGenBench,这是一个模块化的基准测试平台,旨在统一数据、模型、基准测试和可解释性层,实现对任何GFM的标准化、一键式评估,并通过自动化流程和社区可扩展功能解决数据透明度、模型互操作性、基准碎片化和黑盒可解释性等关键挑战。

链接: https://arxiv.org/abs/2505.14402
作者: Heng Yang,Jack Cole,Yuan Li,Renzhi Chen,Geyong Min,Ke Li
机构: 未知
类目: Genomics (q-bio.GN); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The code of nature, embedded in DNA and RNA genomes since the origin of life, holds immense potential to impact both humans and ecosystems through genome modeling. Genomic Foundation Models (GFMs) have emerged as a transformative approach to decoding the genome. As GFMs scale up and reshape the landscape of AI-driven genomics, the field faces an urgent need for rigorous and reproducible evaluation. We present OmniGenBench, a modular benchmarking platform designed to unify the data, model, benchmarking, and interpretability layers across GFMs. OmniGenBench enables standardized, one-command evaluation of any GFM across five benchmark suites, with seamless integration of over 31 open-source models. Through automated pipelines and community-extensible features, the platform addresses critical reproducibility challenges, including data transparency, model interoperability, benchmark fragmentation, and black-box interpretability. OmniGenBench aims to serve as foundational infrastructure for reproducible genomic AI research, accelerating trustworthy discovery and collaborative innovation in the era of genome-scale modeling.
zh

[NLP-197] InterFeat: An Automated Pipeline for Finding Interesting Hypotheses in Structured Biomedical Data

【速读】: 该论文试图解决科学发现中“有趣现象”的自动识别问题,这一概念传统上依赖人工且定义不明确。解决方案的关键在于构建一个整合机器学习、知识图谱、文献检索和大型语言模型的管道,将“有趣性”形式化为新颖性、实用性和合理性的综合指标,从而在结构化生物医学数据中自动化发现具有潜在机制支持的特征-目标关系(feature-target relations)。

链接: https://arxiv.org/abs/2505.13534
作者: Dan Ofer,Michal Linial,Dafna Shahaf
机构: The Hebrew University of Jerusalem (希伯来大学); Institute of Life Sciences (生命科学研究所); Department of Biological Chemistry (生物化学系); Department of Computer Science (计算机科学系)
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Finding interesting phenomena is the core of scientific discovery, but it is a manual, ill-defined concept. We present an integrative pipeline for automating the discovery of interesting simple hypotheses (feature-target relations with effect direction and a potential underlying mechanism) in structured biomedical data. The pipeline combines machine learning, knowledge graphs, literature search and Large Language Models. We formalize “interestingness” as a combination of novelty, utility and plausibility. On 8 major diseases from the UK Biobank, our pipeline consistently recovers risk factors years before their appearance in the literature. 40–53% of our top candidates were validated as interesting, compared to 0–7% for a SHAP-based baseline. Overall, 28% of 109 candidates were interesting to medical experts. The pipeline addresses the challenge of operationalizing “interestingness” scalably and for any target. We release data and code: this https URL
zh

计算机视觉

[CV-0] Grouping First Attending Smartly: Training-Free Acceleration for Diffusion Transformers

【速读】:该论文旨在解决基于扩散的Transformer模型在生成图像和视频时计算成本过高的问题,这限制了其在实际应用中的部署。例如,在A100 GPU上生成8192×8192图像可能需要超过一小时。论文提出的解决方案是GRAT(GRouping first, ATtending smartly),其关键在于利用预训练扩散Transformer中学习到的注意力图的固有稀疏性(通常为局部聚焦),并通过优化GPU并行性来加速注意力计算。具体而言,GRAT首先将连续的token划分为非重叠的组,以匹配GPU执行模式和预训练生成Transformer中学习到的局部注意力结构,随后通过同一组内的所有查询token共享一组可访问的键和值token来加速注意力计算,这些键和值token进一步被限制在结构化区域,从而显著降低计算开销,同时保留关键注意力模式和长程上下文。

链接: https://arxiv.org/abs/2505.14687
作者: Sucheng Ren,Qihang Yu,Ju He,Alan Yuille,Liang-Chieh Chen
机构: Johns Hopkins University (约翰霍普金斯大学); Independent Researcher (独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website at this http URL

点击查看摘要

Abstract:Diffusion-based Transformers have demonstrated impressive generative capabilities, but their high computational costs hinder practical deployment, for example, generating an 8192\times 8192 image can take over an hour on an A100 GPU. In this work, we propose GRAT (\textbfGRouping first, \textbfATtending smartly), a training-free attention acceleration strategy for fast image and video generation without compromising output quality. The key insight is to exploit the inherent sparsity in learned attention maps (which tend to be locally focused) in pretrained Diffusion Transformers and leverage better GPU parallelism. Specifically, GRAT first partitions contiguous tokens into non-overlapping groups, aligning both with GPU execution patterns and the local attention structures learned in pretrained generative Transformers. It then accelerates attention by having all query tokens within the same group share a common set of attendable key and value tokens. These key and value tokens are further restricted to structured regions, such as surrounding blocks or criss-cross regions, significantly reducing computational overhead (e.g., attaining a \textbf35.8 \times speedup over full attention when generating 8192\times 8192 images) while preserving essential attention patterns and long-range context. We validate GRAT on pretrained Flux and HunyuanVideo for image and video generation, respectively. In both cases, GRAT achieves substantially faster inference without any fine-tuning, while maintaining the performance of full attention. We hope GRAT will inspire future research on accelerating Diffusion Transformers for scalable visual generation.
zh

[CV-1] Emerging Properties in Unified Multimodal Pretraining

【速读】:该论文旨在解决多模态理解与生成的统一问题,即如何构建一个能够同时处理和生成多种模态信息(如文本、图像、视频等)的通用基础模型。解决方案的关键在于提出BAGEL,这是一个原生支持多模态理解和生成的统一解码器模型,其在由大规模交错文本、图像、视频和网络数据组成的万亿级标记语料上进行预训练,从而在复杂多模态推理任务中展现出显著的能力提升。

链接: https://arxiv.org/abs/2505.14683
作者: Chaorui Deng,Deyao Zhu,Kunchang Li,Chenhui Gou,Feng Li,Zeyu Wang,Shu Zhong,Weihao Yu,Xiaonan Nie,Ziang Song,Guang Shi,Haoqi Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 37 pages, 17 figures

点击查看摘要

Abstract:Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open0source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder0only model pretrained on trillions of tokens curated from large0scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community. The project page is at this https URL
zh

[CV-2] UniGen: Enhanced Training Test-Time Strategies for Unified Multimodal Understanding and Generation

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Model, MLLM)在图像理解和生成任务中的统一性与性能提升问题。其关键解决方案是提出一种新的Chain-of-Thought Verification (CoT-V)策略,该策略通过测试阶段的Best-of-N方法显著提升了图像生成质量,使模型在测试时既能作为图像生成器又能作为验证器,逐步评估文本提示与生成图像之间的语义对齐程度。

链接: https://arxiv.org/abs/2505.14682
作者: Rui Tian,Mingfei Gao,Mingze Xu,Jiaming Hu,Jiasen Lu,Zuxuan Wu,Yinfei Yang,Afshin Dehghan
机构: Apple(苹果); Fudan University(复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report

点击查看摘要

Abstract:We introduce UniGen, a unified multimodal large language model (MLLM) capable of image understanding and generation. We study the full training pipeline of UniGen from a data-centric perspective, including multi-stage pre-training, supervised fine-tuning, and direct preference optimization. More importantly, we propose a new Chain-of-Thought Verification (CoT-V) strategy for test-time scaling, which significantly boosts UniGen’s image generation quality using a simple Best-of-N test-time strategy. Specifically, CoT-V enables UniGen to act as both image generator and verifier at test time, assessing the semantic alignment between a text prompt and its generated image in a step-by-step CoT manner. Trained entirely on open-source datasets across all stages, UniGen achieves state-of-the-art performance on a range of image understanding and generation benchmarks, with a final score of 0.78 on GenEval and 85.19 on DPG-Bench. Through extensive ablation studies, our work provides actionable insights and addresses key challenges in the full life cycle of building unified MLLMs, contributing meaningful directions to the future research.
zh

[CV-3] Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning

【速读】:该论文试图解决视觉语言模型(VLMs)在没有显式链式思维(CoT)监督的情况下,通过强化学习进行视觉推理的问题。研究发现,直接对VLM应用强化学习会导致模型依赖简单问题的捷径,从而降低其在未见数据分布上的泛化能力。解决方案的关键在于鼓励模型在推理前先对图像进行解释,即采用“描述-推理-回答”的输出格式,通过生成详细的图像描述再构建推理链条,从而提升模型的视觉推理能力。

链接: https://arxiv.org/abs/2505.14677
作者: Jiaer Xia,Yuhang Zang,Peng Gao,Yixuan Li,Kaiyang Zhou
机构: Hong Kong Baptist University (香港浸会大学); Shanghai AI Lab (上海人工智能实验室); University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning general-purpose reasoning capabilities has long been a challenging problem in AI. Recent research in large language models (LLMs), such as DeepSeek-R1, has shown that reinforcement learning techniques like GRPO can enable pre-trained LLMs to develop reasoning capabilities using simple question-answer pairs. In this paper, we aim to train visual language models (VLMs) to perform reasoning on image data through reinforcement learning and visual question-answer pairs, without any explicit chain-of-thought (CoT) supervision. Our findings indicate that simply applying reinforcement learning to a VLM – by prompting the model to produce a reasoning chain before providing an answer – can lead the model to develop shortcuts from easy questions, thereby reducing its ability to generalize across unseen data distributions. We argue that the key to mitigating shortcut learning is to encourage the model to interpret images prior to reasoning. Therefore, we train the model to adhere to a caption-reason-answer output format: initially generating a detailed caption for an image, followed by constructing an extensive reasoning chain. When trained on 273K CoT-free visual question-answer pairs and using only reinforcement learning, our model, named Visionary-R1, outperforms strong multimodal models, such as GPT-4o, Claude3.5-Sonnet, and Gemini-1.5-Pro, on multiple visual reasoning benchmarks.
zh

[CV-4] raining-Free Watermarking for Autoregressive Image Generation

【速读】:该论文旨在解决在自回归图像生成模型中嵌入水印以保护图像版权和防止视觉生成模型被恶意滥用的问题,而现有方法主要针对扩散模型,对自回归模型的水印研究仍较为薄弱。其解决方案的关键在于提出了一种无需训练的水印框架IndexMark,该框架利用码本的冗余特性,通过“匹配-替换”方法选择相似的码本索引进行替换,从而在不影响图像质量的前提下嵌入水印,并通过索引编码器提高验证精度,同时引入辅助验证方案增强对裁剪攻击的鲁棒性。

链接: https://arxiv.org/abs/2505.14673
作者: Yu Tong,Zihao Pan,Shuai Yang,Kaiyang Zhou
机构: Hong Kong Baptist University (香港浸会大学); Wuhan University (武汉大学); Sun Yat-sen University (中山大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Invisible image watermarking can protect image ownership and prevent malicious misuse of visual generative models. However, existing generative watermarking methods are mainly designed for diffusion models while watermarking for autoregressive image generation models remains largely underexplored. We propose IndexMark, a training-free watermarking framework for autoregressive image generation models. IndexMark is inspired by the redundancy property of the codebook: replacing autoregressively generated indices with similar indices produces negligible visual differences. The core component in IndexMark is a simple yet effective match-then-replace method, which carefully selects watermark tokens from the codebook based on token similarity, and promotes the use of watermark tokens through token replacement, thereby embedding the watermark without affecting the image quality. Watermark verification is achieved by calculating the proportion of watermark tokens in generated images, with precision further improved by an Index Encoder. Furthermore, we introduce an auxiliary validation scheme to enhance robustness against cropping attacks. Experiments demonstrate that IndexMark achieves state-of-the-art performance in terms of image quality and verification accuracy, and exhibits robustness against various perturbations, including cropping, noises, Gaussian blur, random erasing, color jittering, and JPEG compression.
zh

[CV-5] UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens

【速读】:该论文试图解决个性化模型在理解和生成任务中使用独立概念标记导致的局限性问题,特别是在生成复杂提示图像时无法有效融合个性化知识。解决方案的关键在于提出UniCTokens框架,该框架将个性化信息整合到统一的视觉语言模型(VLM)中,通过训练一组统一的概念标记来利用互补语义,从而提升个性化任务的性能,并采用分阶段的训练策略以增强理解与生成任务之间的相互促进。

链接: https://arxiv.org/abs/2505.14671
作者: Ruichuan An,Sihan Yang,Renrui Zhang,Zijun Shen,Ming Lu,Gaole Dai,Hao Liang,Ziyu Guo,Shilin Yan,Yulin Luo,Bocheng Zou,Chaoqun Yang,Wentao Zhang
机构: Peking University (北京大学); Xi’an JiaoTong University (西安交通大学); CUHK (香港中文大学); Intel Labs, China (英特尔实验室,中国); Nanjing University (南京大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Personalized models have demonstrated remarkable success in understanding and generating concepts provided by users. However, existing methods use separate concept tokens for understanding and generation, treating these tasks in isolation. This may result in limitations for generating images with complex prompts. For example, given the concept \langle bo\rangle , generating " \langle bo\rangle wearing its hat" without additional textual descriptions of its hat. We call this kind of generation personalized knowledge-driven generation. To address the limitation, we present UniCTokens, a novel framework that effectively integrates personalized information into a unified vision language model (VLM) for understanding and generation. UniCTokens trains a set of unified concept tokens to leverage complementary semantics, boosting two personalized tasks. Moreover, we propose a progressive training strategy with three stages: understanding warm-up, bootstrapping generation from understanding, and deepening understanding from generation to enhance mutual benefits between both tasks. To quantitatively evaluate the unified VLM personalization, we present UnifyBench, the first benchmark for assessing concept understanding, concept generation, and knowledge-driven generation. Experimental results on UnifyBench indicate that UniCTokens shows competitive performance compared to leading methods in concept understanding, concept generation, and achieving state-of-the-art results in personalized knowledge-driven generation. Our research demonstrates that enhanced understanding improves generation, and the generation process can yield valuable insights into understanding. Our code and dataset will be released at: \hrefthis https URLthis https URL.
zh

[CV-6] AKRMap: Adaptive Kernel Regression for Trustworthy Visualization of Cross-Modal Embeddings

【速读】:该论文试图解决跨模态嵌入(cross-modal embeddings)可视化方法受限于传统降维(Dimensionality Reduction, DR)技术如PCA和t-SNE的问题,这些方法未能有效结合多模态间的度量标准(如CLIPScore)。解决方案的关键在于提出AKRMap,一种新的降维技术,通过在投影空间中学习度量景观的核回归来提高可视化的准确性。AKRMap构建了一个由后投影核回归损失引导的监督投影网络,并采用可与投影联合优化的自适应广义核,从而高效生成能够捕捉复杂度量分布的可视化结果,并支持交互功能如缩放和叠加以进行深入探索。

链接: https://arxiv.org/abs/2505.14664
作者: Yilin Ye,Junchao Huang,Xingchen Zeng,Jiazhi Xia,Wei Zeng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cross-modal embeddings form the foundation for multi-modal models. However, visualization methods for interpreting cross-modal embeddings have been primarily confined to traditional dimensionality reduction (DR) techniques like PCA and t-SNE. These DR methods primarily focus on feature distributions within a single modality, whilst failing to incorporate metrics (e.g., CLIPScore) across multiple this http URL paper introduces AKRMap, a new DR technique designed to visualize cross-modal embeddings metric with enhanced accuracy by learning kernel regression of the metric landscape in the projection space. Specifically, AKRMap constructs a supervised projection network guided by a post-projection kernel regression loss, and employs adaptive generalized kernels that can be jointly optimized with the projection. This approach enables AKRMap to efficiently generate visualizations that capture complex metric distributions, while also supporting interactive features such as zoom and overlay for deeper exploration. Quantitative experiments demonstrate that AKRMap outperforms existing DR methods in generating more accurate and trustworthy visualizations. We further showcase the effectiveness of AKRMap in visualizing and comparing cross-modal embeddings for text-to-image models. Code and demo are available at this https URL.
zh

[CV-7] CAD-Coder: An Open-Source Vision-Language Model for Computer-Aided Design Code Generation

【速读】:该论文旨在解决工程设计中高效创建准确且可编辑的3D CAD模型的问题,当前手动工作流程存在耗时且依赖用户专业知识的局限性。论文提出的解决方案关键在于引入CAD-Coder,一个针对生成可编辑CAD代码(CadQuery Python)进行微调的开源视觉-语言模型(VLM),其基于自建的GenCAD-Code数据集(包含163k个CAD模型图像与代码对),在语法有效性、3D实体相似性准确性等方面优于现有先进模型,并展现出从真实世界图像生成CAD代码的能力。

链接: https://arxiv.org/abs/2505.14646
作者: Anna C. Doris,Md Ferdous Alam,Amin Heyrani Nobari,Faez Ahmed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Efficient creation of accurate and editable 3D CAD models is critical in engineering design, significantly impacting cost and time-to-market in product innovation. Current manual workflows remain highly time-consuming and demand extensive user expertise. While recent developments in AI-driven CAD generation show promise, existing models are limited by incomplete representations of CAD operations, inability to generalize to real-world images, and low output accuracy. This paper introduces CAD-Coder, an open-source Vision-Language Model (VLM) explicitly fine-tuned to generate editable CAD code (CadQuery Python) directly from visual input. Leveraging a novel dataset that we created–GenCAD-Code, consisting of over 163k CAD-model image and code pairs–CAD-Coder outperforms state-of-the-art VLM baselines such as GPT-4.5 and Qwen2.5-VL-72B, achieving a 100% valid syntax rate and the highest accuracy in 3D solid similarity. Notably, our VLM demonstrates some signs of generalizability, successfully generating CAD code from real-world images and executing CAD operations unseen during fine-tuning. The performance and adaptability of CAD-Coder highlights the potential of VLMs fine-tuned on code to streamline CAD workflows for engineers and designers. CAD-Coder is publicly available at: this https URL.
zh

[CV-8] VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation

【速读】:该论文试图解决当前长视频理解(Long Video Understanding, LVU)基准测试的有效性和鲁棒性不足的问题。现有基准测试主要依赖于选择题(Multiple-Choice Questions, MCQs),其评估结果因模型可能通过猜测获得正确答案而被高估,且部分问题存在强先验信息,使得模型无需理解视频内容即可作答。此外,增加输入帧数并不一定提升基准测试性能,这与预期相悖。为解决上述问题,论文提出了VideoEval-Pro,一个包含开放式短答案问题的现实主义LVU基准,其关键在于通过感知和推理任务评估视频的片段级和全视频理解能力,从而更真实地反映大型多模态模型(Large Multimodal Models, LMMs)的长视频理解能力。

链接: https://arxiv.org/abs/2505.14640
作者: Wentao Ma,Weiming Ren,Yiming Jia,Zhuofeng Li,Ping Nie,Ge Zhang,Wenhu Chen
机构: University of Waterloo(滑铁卢大学); University of Toronto(多伦多大学); Vector Institute(向量研究所); Shanghai University(上海大学); Independent(独立); M-A-P(未知)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Dataset: this https URL , Project Webpage: this https URL

点击查看摘要

Abstract:Large multimodal models (LMMs) have recently emerged as a powerful tool for long video understanding (LVU), prompting the development of standardized LVU benchmarks to evaluate their performance. However, our investigation reveals a rather sober lesson for existing LVU benchmarks. First, most existing benchmarks rely heavily on multiple-choice questions (MCQs), whose evaluation results are inflated due to the possibility of guessing the correct answer; Second, a significant portion of questions in these benchmarks have strong priors to allow models to answer directly without even reading the input video. For example, Gemini-1.5-Pro can achieve over 50% accuracy given a random frame from a long video on Video-MME. We also observe that increasing the number of frames does not necessarily lead to improvement on existing benchmarks, which is counterintuitive. As a result, the validity and robustness of current LVU benchmarks are undermined, impeding a faithful assessment of LMMs’ long-video understanding capability. To tackle this problem, we propose VideoEval-Pro, a realistic LVU benchmark containing questions with open-ended short-answer, which truly require understanding the entire video. VideoEval-Pro assesses both segment-level and full-video understanding through perception and reasoning tasks. By evaluating 21 proprietary and open-source video LMMs, we conclude the following findings: (1) video LMMs show drastic performance ( 25%) drops on open-ended questions compared with MCQs; (2) surprisingly, higher MCQ scores do not lead to higher open-ended scores on VideoEval-Pro; (3) compared to other MCQ benchmarks, VideoEval-Pro benefits more from increasing the number of input frames. Our results show that VideoEval-Pro offers a more realistic and reliable measure of long video understanding, providing a clearer view of progress in this domain.
zh

[CV-9] A General Framework for Group Sparsity in Hyperspectral Unmixing Using Endmember Bundles

【速读】:该论文旨在解决高光谱数据由于空间分辨率低而导致的混合材料贡献问题,即高光谱解混(Hyperspectral Unmixing, HU)问题。传统线性混合模型假设每种材料仅由单一光谱特征(endmember)表示,但材料类别间的变异性使得这一假设难以准确描述实际场景。解决方案的关键在于使用端元束(endmember bundles)来表示每种材料,即每个材料对应一组光谱特征,并通过基于束的框架引入组间稀疏性或组内及跨组稀疏性(SWAG),以更精确地刻画各材料在像素中的相对比例。此外,该框架还支持多种促进稀疏性的惩罚项,其中变换的ℓ₁(TL1)惩罚是HU领域的一种新正则化方法。

链接: https://arxiv.org/abs/2505.14634
作者: Gokul Bhusal,Yifei Lou,Cristina Garcia-Cardona,Ekaterina Merkurjev
机构: Michigan State University (密歇根州立大学); The University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); Los Alamos National Laboratory (洛斯阿拉莫斯国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Due to low spatial resolution, hyperspectral data often consists of mixtures of contributions from multiple materials. This limitation motivates the task of hyperspectral unmixing (HU), a fundamental problem in hyperspectral imaging. HU aims to identify the spectral signatures (\textitendmembers) of the materials present in an observed scene, along with their relative proportions (\textitfractional abundance) in each pixel. A major challenge lies in the class variability in materials, which hinders accurate representation by a single spectral signature, as assumed in the conventional linear mixing model. Moreover, To address this issue, we propose using group sparsity after representing each material with a set of spectral signatures, known as endmember bundles, where each group corresponds to a specific material. In particular, we develop a bundle-based framework that can enforce either inter-group sparsity or sparsity within and across groups (SWAG) on the abundance coefficients. Furthermore, our framework offers the flexibility to incorporate a variety of sparsity-promoting penalties, among which the transformed \ell_1 (TL1) penalty is a novel regularization in the HU literature. Extensive experiments conducted on both synthetic and real hyperspectral data demonstrate the effectiveness and superiority of the proposed approaches.
zh

[CV-10] 3D Reconstruction from Sketches DATE

【速读】:该论文试图解决从多张草图中重建三维场景的问题(3D scene reconstruction from multiple sketches)。其解决方案的关键在于构建一个包含图像-草图对的数据集,并利用该数据集训练CycleGAN将拼接后的草图转换为逼真图像,随后通过MegaDepth模型估计图像的深度图,从而实现三维重建。

链接: https://arxiv.org/abs/2505.14621
作者: Abhimanyu Talwar,Julien Laasri
机构: Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, 8 figures, paper dated December 12, 2018

点击查看摘要

Abstract:We consider the problem of reconstructing a 3D scene from multiple sketches. We propose a pipeline which involves (1) stitching together multiple sketches through use of correspondence points, (2) converting the stitched sketch into a realistic image using a CycleGAN, and (3) estimating that image’s depth-map using a pre-trained convolutional neural network based architecture called MegaDepth. Our contribution includes constructing a dataset of image-sketch pairs, the images for which are from the Zurich Building Database, and sketches have been generated by us. We use this dataset to train a CycleGAN for our pipeline’s second step. We end up with a stitching process that does not generalize well to real drawings, but the rest of the pipeline that creates a 3D reconstruction from a single sketch performs quite well on a wide variety of drawings.
zh

[CV-11] Instance Segmentation for Point Sets DATE

【速读】:该论文试图解决在基于点云的实例分割中,由于使用内存密集型相似性矩阵而导致的内存占用问题,该矩阵的内存消耗与点数呈二次关系。解决方案的关键在于采用两种基于采样的方法,在子采样的点集上进行实例分割,并通过最近邻方法将标签外推到完整的点集,从而有效降低计算和存储开销。

链接: https://arxiv.org/abs/2505.14583
作者: Abhimanyu Talwar,Julien Laasri
机构: Harvard John A. Paulson School of Engineering and Applied Sciences (哈佛大学约翰·A·保尔森工程与应用科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, 11 figures, paper dated 2019

点击查看摘要

Abstract:Recently proposed neural network architectures like PointNet [QSMG16] and PointNet++ [QYSG17] have made it possible to apply Deep Learning to 3D point sets. The feature representations of shapes learned by these two networks enabled training classifiers for Semantic Segmentation, and more recently for Instance Segmentation via the Similarity Group Proposal Network (SGPN) [WYHN17]. One area of improvement which has been highlighted by SGPN’s authors, pertains to use of memory intensive similarity matrices which occupy memory quadratic in the number of points. In this report, we attempt to tackle this issue through use of two sampling based methods, which compute Instance Segmentation on a sub-sampled Point Set, and then extrapolate labels to the complete set using the nearest neigbhour approach. While both approaches perform equally well on large sub-samples, the random-based strategy gives the most improvements in terms of speed and memory usage.
zh

[CV-12] Dynadiff: Single-stage Decoding of Images from Continuously Evolving fMRI

【速读】:该论文试图解决传统脑-图像解码方法依赖复杂多阶段流水线和预处理步骤,导致时间维度信息丢失,从而限制了时间分辨脑解码性能的问题。解决方案的关键在于提出一种单阶段扩散模型Dynadiff(Dynamic Neural Activity Diffusion for Image Reconstruction),该模型能够直接从动态演变的fMRI记录中重建图像,简化了训练过程,并在时间分辨的fMRI信号上表现出色,尤其在高层语义图像重建指标上优于现有方法,同时还能精确表征大脑活动中的图像表示演化。

链接: https://arxiv.org/abs/2505.14556
作者: Marlène Careil,Yohann Benchetrit,Jean-Rémi King
机构: Meta(元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Brain-to-image decoding has been recently propelled by the progress in generative AI models and the availability of large ultra-high field functional Magnetic Resonance Imaging (fMRI). However, current approaches depend on complicated multi-stage pipelines and preprocessing steps that typically collapse the temporal dimension of brain recordings, thereby limiting time-resolved brain decoders. Here, we introduce Dynadiff (Dynamic Neural Activity Diffusion for Image Reconstruction), a new single-stage diffusion model designed for reconstructing images from dynamically evolving fMRI recordings. Our approach offers three main contributions. First, Dynadiff simplifies training as compared to existing approaches. Second, our model outperforms state-of-the-art models on time-resolved fMRI signals, especially on high-level semantic image reconstruction metrics, while remaining competitive on preprocessed fMRI data that collapse time. Third, this approach allows a precise characterization of the evolution of image representations in brain activity. Overall, this work lays the foundation for time-resolved brain-to-image decoding.
zh

[CV-13] Personalize Your Gaussian: Consistent 3D Scene Personalization from a Single Image

【速读】:该论文旨在解决从单张参考图像中个性化3D场景时存在的视角偏差问题(viewpoint bias),这一问题导致生成的3D场景在多视角下缺乏一致性以及与输入图像的参照一致性。解决方案的关键在于提出了一种名为Consistent Personalization for 3D Gaussian Splatting (CP-GS) 的框架,该框架通过逐步将单视角参考外观传播到新视角,并结合预训练的图像到3D生成模型与迭代LoRA微调技术,有效扩展参考信息,最终通过几何线索引导的视角一致生成过程,生成高质量的多视角指导图像和个性化的3D高斯泼溅(3DGS)输出。

链接: https://arxiv.org/abs/2505.14537
作者: Yuxuan Wang,Xuanyu Yi,Qingshan Xu,Yuan Zhou,Long Chen,Hanwang Zhang
机构: Nanyang Technological University (南洋理工大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages

点击查看摘要

Abstract:Personalizing 3D scenes from a single reference image enables intuitive user-guided editing, which requires achieving both multi-view consistency across perspectives and referential consistency with the input image. However, these goals are particularly challenging due to the viewpoint bias caused by the limited perspective provided in a single image. Lacking the mechanisms to effectively expand reference information beyond the original view, existing methods of image-conditioned 3DGS personalization often suffer from this viewpoint bias and struggle to produce consistent results. Therefore, in this paper, we present Consistent Personalization for 3D Gaussian Splatting (CP-GS), a framework that progressively propagates the single-view reference appearance to novel perspectives. In particular, CP-GS integrates pre-trained image-to-3D generation and iterative LoRA fine-tuning to extract and extend the reference appearance, and finally produces faithful multi-view guidance images and the personalized 3DGS outputs through a view-consistent generation process guided by geometric cues. Extensive experiments on real-world scenes show that our CP-GS effectively mitigates the viewpoint bias, achieving high-quality personalization that significantly outperforms existing methods. The code will be released at this https URL.
zh

[CV-14] diffDemorph: Extending Reference-Free Demorphing to Unseen Faces

【速读】:该论文试图解决无参考(Reference-free, RF)人脸变形(face morph)逆向还原问题,即在仅拥有变形图像的情况下,恢复出原始的构成身份图像。传统方法因依赖于关于训练和测试阶段变形技术、人脸风格及生成变形所用图像分布的假设而受到限制。本文提出了一种基于扩散模型(diffusion-based)的新方法,其关键在于能够高保真地解耦复合变形图像中的组成部分,且无需依赖参考图像,实现了跨变形技术与人脸风格的泛化能力,在所有测试数据集上均优于当前最先进方法至少59.46%。

链接: https://arxiv.org/abs/2505.14527
作者: Nitish Shukla,Arun Ross
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A face morph is created by combining two (or more) face images corresponding to two (or more) identities to produce a composite that successfully matches the constituent identities. Reference-free (RF) demorphing reverses this process using only the morph image, without the need for additional reference images. Previous RF demorphing methods were overly constrained, as they rely on assumptions about the distributions of training and testing morphs such as the morphing technique used, face style, and images used to create the morph. In this paper, we introduce a novel diffusion-based approach that effectively disentangles component images from a composite morph image with high visual fidelity. Our method is the first to generalize across morph techniques and face styles, beating the current state of the art by \geq 59.46% under a common training protocol across all datasets tested. We train our method on morphs created using synthetically generated face images and test on real morphs, thereby enhancing the practicality of the technique. Experiments on six datasets and two face matchers establish the utility and efficacy of our method.
zh

[CV-15] SparC: Sparse Representation and Construction for High-Resolution 3D Shapes Modeling

【速读】:该论文旨在解决高保真3D物体合成中的挑战,尤其是由于网格数据的非结构化特性和密集体素网格的立方复杂性所带来的问题。现有两阶段流水线在使用变分自编码器(VAE)对网格进行压缩后,常因表示效率低下和模态不匹配导致细节严重丢失。该论文提出的解决方案关键在于引入SparC框架,其核心是结合稀疏可变形Marching Cubes表示(SparseCubes)与新型编码器SparConv-VAE。SparseCubes通过将符号距离场和形变场散射到稀疏立方体中,实现了高分辨率(1024³)表面的生成,支持可微优化;而SparConv-VAE是首个完全基于稀疏卷积网络构建的模态一致变分自编码器,能够实现高效且近无损的3D重建,适用于高分辨率生成建模。

链接: https://arxiv.org/abs/2505.14521
作者: Zhihao Li,Yufei Wang,Heliang Zheng,Yihao Luo,Bihan Wen
机构: Nanyang Technological University (南洋理工大学); Sensory Universe (感知宇宙); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Homepage: this https URL

点击查看摘要

Abstract:High-fidelity 3D object synthesis remains significantly more challenging than 2D image generation due to the unstructured nature of mesh data and the cubic complexity of dense volumetric grids. Existing two-stage pipelines-compressing meshes with a VAE (using either 2D or 3D supervision), followed by latent diffusion sampling-often suffer from severe detail loss caused by inefficient representations and modality mismatches introduced in VAE. We introduce SparC, a unified framework that combines a sparse deformable marching cubes representation SparseCubes with a novel encoder SparConv-VAE. SparseCubes converts raw meshes into high-resolution ( 1024^3 ) surfaces with arbitrary topology by scattering signed distance and deformation fields onto a sparse cube, allowing differentiable optimization. SparConv-VAE is the first modality-consistent variational autoencoder built entirely upon sparse convolutional networks, enabling efficient and near-lossless 3D reconstruction suitable for high-resolution generative modeling through latent diffusion. SparC achieves state-of-the-art reconstruction fidelity on challenging inputs, including open surfaces, disconnected components, and intricate geometry. It preserves fine-grained shape details, reduces training and inference cost, and integrates naturally with latent diffusion models for scalable, high-resolution 3D generation.
zh

[CV-16] ReservoirTTA: Prolonged Test-time Adaptation for Evolving and Recurring Domains

【速读】:该论文旨在解决测试时适应(TTA)在测试领域持续变化的场景下的性能稳定性问题,特别是在领域重复或渐进演化的情况下。其解决方案的关键在于提出ReservoirTTA框架,该框架通过维护一个领域专用模型的水库——一种自适应的测试时模型集成——来检测新的领域并将其样本路由到相应的专用模型,从而实现领域特定的适应。此多模型策略有效克服了单一模型适应中的灾难性遗忘、域间干扰和误差累积等关键问题。

链接: https://arxiv.org/abs/2505.14511
作者: Guillaume Vray,Devavrat Tomar,Xufeng Gao,Jean-Philippe Thiran,Evan Shelhamer,Behzad Bozorgtabar
机构: EPFL(瑞士联邦理工学院); CHUV(洛桑大学医院); UBC(不列颠哥伦比亚大学); Vector Institute(向量研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces ReservoirTTA, a novel plug-in framework designed for prolonged test-time adaptation (TTA) in scenarios where the test domain continuously shifts over time, including cases where domains recur or evolve gradually. At its core, ReservoirTTA maintains a reservoir of domain-specialized models – an adaptive test-time model ensemble – that both detects new domains via online clustering over style features of incoming samples and routes each sample to the appropriate specialized model, and thereby enables domain-specific adaptation. This multi-model strategy overcomes key limitations of single model adaptation, such as catastrophic forgetting, inter-domain interference, and error accumulation, ensuring robust and stable performance on sustained non-stationary test distributions. Our theoretical analysis reveals key components that bound parameter variance and prevent model collapse, while our plug-in TTA module mitigates catastrophic forgetting of previously encountered domains. Extensive experiments on the classification corruption benchmarks, including ImageNet-C and CIFAR-10/100-C, as well as the Cityscapes \rightarrow ACDC semantic segmentation task, covering recurring and continuously evolving domain shifts, demonstrate that ReservoirTTA significantly improves adaptation accuracy and maintains stable performance across prolonged, recurring shifts, outperforming state-of-the-art methods.
zh

[CV-17] Enhancing Interpretability of Sparse Latent Representations with Class Information

【速读】:该论文旨在解决变分自编码器(Variational Autoencoder, VAE)在高维空间中生成的潜在空间缺乏结构化可解释性的问题。尽管变分稀疏编码(Variational Sparse Coding, VSC)通过引入尖峰-平板先验分布实现了每个输入的稀疏潜在表示,但其未能在同类别样本间提供一致的潜在维度模式,导致无法捕捉类别内共享的高层概念。该论文提出的解决方案的关键在于引入一种新的损失函数,以确保同一类别样本的潜在空间中活跃维度具有一致性,从而构建更具结构化和可解释性的潜在空间,能够同时捕捉全局因素和类别特定因素。

链接: https://arxiv.org/abs/2505.14476
作者: Farshad Sangari Abiz,Reshad Hosseini,Babak N. Araabi
机构: University of Tehran (德黑兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Variational Autoencoders (VAEs) are powerful generative models for learning latent representations. Standard VAEs generate dispersed and unstructured latent spaces by utilizing all dimensions, which limits their interpretability, especially in high-dimensional spaces. To address this challenge, Variational Sparse Coding (VSC) introduces a spike-and-slab prior distribution, resulting in sparse latent representations for each input. These sparse representations, characterized by a limited number of active dimensions, are inherently more interpretable. Despite this advantage, VSC falls short in providing structured interpretations across samples within the same class. Intuitively, samples from the same class are expected to share similar attributes while allowing for variations in those attributes. This expectation should manifest as consistent patterns of active dimensions in their latent representations, but VSC does not enforce such consistency. In this paper, we propose a novel approach to enhance the latent space interpretability by ensuring that the active dimensions in the latent space are consistent across samples within the same class. To achieve this, we introduce a new loss function that encourages samples from the same class to share similar active dimensions. This alignment creates a more structured and interpretable latent space, where each shared dimension corresponds to a high-level concept, or “factor.” Unlike existing disentanglement-based methods that primarily focus on global factors shared across all classes, our method captures both global and class-specific factors, thereby enhancing the utility and interpretability of latent representations. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2505.14476 [cs.CV] (or arXiv:2505.14476v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.14476 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-18] VisualQuality-R1: Reasoning -Induced Image Quality Assessment via Reinforcement Learning to Rank

【速读】:该论文旨在解决图像质量评估(Image Quality Assessment, IQA)中依赖视觉推理的挑战,特别是在无参考IQA(NR-IQA)任务中,传统方法在推理和泛化能力上存在局限。其解决方案的关键在于引入VisualQuality-R1模型,该模型通过强化学习排序(reinforcement learning to rank)进行训练,利用群体相对策略优化生成多质量评分,并基于Thurstone模型计算图像间质量比较的概率,同时采用连续保真度度量作为奖励函数,而非离散二值标签。这一方法显著提升了模型在多种图像处理任务中的性能与适应性。

链接: https://arxiv.org/abs/2505.14460
作者: Tianhe Wu,Jian Zou,Jie Liang,Lei Zhang,Kede Ma
机构: City University of Hong Kong (香港城市大学); OPPO Research Institute (OPPO研究院); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:DeepSeek-R1 has demonstrated remarkable effectiveness in incentivizing reasoning and generalization capabilities of large language models (LLMs) through reinforcement learning. Nevertheless, the potential of reasoning-induced computational modeling has not been thoroughly explored in the context of image quality assessment (IQA), a task critically dependent on visual reasoning. In this paper, we introduce VisualQuality-R1, a reasoning-induced no-reference IQA (NR-IQA) model, and we train it with reinforcement learning to rank, a learning algorithm tailored to the intrinsically relative nature of visual quality. Specifically, for a pair of images, we employ group relative policy optimization to generate multiple quality scores for each image. These estimates are then used to compute comparative probabilities of one image having higher quality than the other under the Thurstone model. Rewards for each quality estimate are defined using continuous fidelity measures rather than discretized binary labels. Extensive experiments show that the proposed VisualQuality-R1 consistently outperforms discriminative deep learning-based NR-IQA models as well as a recent reasoning-induced quality regression method. Moreover, VisualQuality-R1 is capable of generating contextually rich, human-aligned quality descriptions, and supports multi-dataset training without requiring perceptual scale realignment. These features make VisualQuality-R1 especially well-suited for reliably measuring progress in a wide range of image processing tasks like super-resolution and image generation.
zh

[CV-19] Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models

【速读】:该论文旨在解决视频大语言模型(VideoLLM)在处理视频理解任务时面临的效率问题,特别是由于大量视觉标记(visual tokens)带来的二次复杂度问题。其解决方案的关键在于提出了一种可插拔的推理加速框架“Video Compression Commander”(VidCom2),通过量化每帧的独特性来自适应调整压缩强度,从而在有效保留关键信息的同时减少视频序列中的冗余,实现性能与效率的平衡。

链接: https://arxiv.org/abs/2505.14454
作者: Xuyang Liu,Yiyu Wang,Junpeng Ma,Linfeng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Sichuan University (四川大学); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our code is available at this https URL

点击查看摘要

Abstract:Video large language models (VideoLLM) excel at video understanding, but face efficiency challenges due to the quadratic complexity of abundant visual tokens. Our systematic analysis of token compression methods for VideoLLMs reveals two critical issues: (i) overlooking distinctive visual signals across frames, leading to information loss; (ii) suffering from implementation constraints, causing incompatibility with modern architectures or efficient operators. To address these challenges, we distill three design principles for VideoLLM token compression and propose a plug-and-play inference acceleration framework “Video Compression Commander” (VidCom2). By quantifying each frame’s uniqueness, VidCom2 adaptively adjusts compression intensity across frames, effectively preserving essential information while reducing redundancy in video sequences. Extensive experiments across various VideoLLMs and benchmarks demonstrate the superior performance and efficiency of our VidCom2. With only 25% visual tokens, VidCom2 achieves 99.6% of the original performance on LLaVA-OV while reducing 70.8% of the LLM generation latency. Notably, our Frame Compression Adjustment strategy is compatible with other token compression methods to further improve their performance. Our code is available at this https URL.
zh

[CV-20] Diving into the Fusion of Monocular Priors for Generalized Stereo Matching ATC

【速读】:该论文旨在解决立体匹配在处理如遮挡和非朗伯表面等病态区域时的困难,特别是在利用单目先验信息进行融合时所遇到的泛化能力受限问题。其解决方案的关键在于深入分析并解决三个主要问题:单目深度与视差绝对深度之间的对齐问题、迭代更新结构中因过度自信导致的局部最优问题以及早期迭代中噪声视差对融合过程的干扰。为此,作者提出了一种二值局部顺序图来引导融合过程,将深度图转换为二值相对格式,统一了相对与绝对深度表示,并通过重新加权初始视差更新来解决局部最优和噪声问题;此外,还将最终的单目深度直接融合到视差中作为配准问题处理,采用像素级线性回归模块实现全局自适应对齐。该方法有效且高效地利用了单目先验信息以提升立体匹配性能。

链接: https://arxiv.org/abs/2505.14414
作者: Chengtang Yao,Lidong Yu,Zhidan Liu,Jiaxi Zeng,Yuwei Wu,Yunde Jia
机构: Beijing Key Laboratory of Intelligent Information Technology (北京重点实验室); School of Computer Science & Technology (计算机科学与技术学院); Beijing Institute of Technology (北京理工大学); Guangdong Laboratory of Machine Perception and Intelligent Computing (广东机器感知与智能计算实验室); Shenzhen MSU-BIT University (深圳中美联合大学); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:The matching formulation makes it naturally hard for the stereo matching to handle ill-posed regions like occlusions and non-Lambertian surfaces. Fusing monocular priors has been proven helpful for ill-posed matching, but the biased monocular prior learned from small stereo datasets constrains the generalization. Recently, stereo matching has progressed by leveraging the unbiased monocular prior from the vision foundation model (VFM) to improve the generalization in ill-posed regions. We dive into the fusion process and observe three main problems limiting the fusion of the VFM monocular prior. The first problem is the misalignment between affine-invariant relative monocular depth and absolute depth of disparity. Besides, when we use the monocular feature in an iterative update structure, the over-confidence in the disparity update leads to local optima results. A direct fusion of a monocular depth map could alleviate the local optima problem, but noisy disparity results computed at the first several iterations will misguide the fusion. In this paper, we propose a binary local ordering map to guide the fusion, which converts the depth map into a binary relative format, unifying the relative and absolute depth representation. The computed local ordering map is also used to re-weight the initial disparity update, resolving the local optima and noisy problem. In addition, we formulate the final direct fusion of monocular depth to the disparity as a registration problem, where a pixel-wise linear regression module can globally and adaptively align them. Our method fully exploits the monocular prior to support stereo matching results effectively and efficiently. We significantly improve the performance from the experiments when generalizing from SceneFlow to Middlebury and Booster datasets while barely reducing the efficiency.
zh

[CV-21] Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal Inconsistency

【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在视频理解任务中对时间分析能力的鲁棒性不足的问题,特别是在对抗性环境中模型过度依赖先验知识和文本上下文而忽视视频实际时间动态的现象。其解决方案的关键在于设计一种全景直接偏好优化方法(Panoramic Direct Preference Optimization, PanoDPO),该方法促使LMMs同时整合视觉和语言特征偏好,从而提升模型在时间分析中的鲁棒性和可靠性。

链接: https://arxiv.org/abs/2505.14405
作者: Jiafeng Liang,Shixin Jiang,Xuan Dong,Ning Wang,Zheng Chu,Hui Su,Jinlan Fu,Ming Liu,See-Kiong Ng,Bing Qin
机构: Harbin Institute of Technology (哈尔滨工业大学); Peng Cheng Laboratory (鹏城实验室); National University of Singapore (新加坡国立大学); Meituan Inc. (美团公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have recently demonstrated impressive performance on general video comprehension benchmarks. Nevertheless, for broader applications, the robustness of their temporal analysis capability needs to be thoroughly investigated yet predominantly ignored. Motivated by this, we propose a novel temporal robustness benchmark (TemRobBench), which introduces temporal inconsistency perturbations separately at the visual and textual modalities to assess the robustness of models. We evaluate 16 mainstream LMMs and find that they exhibit over-reliance on prior knowledge and textual context in adversarial environments, while ignoring the actual temporal dynamics in the video. To mitigate this issue, we design panoramic direct preference optimization (PanoDPO), which encourages LMMs to incorporate both visual and linguistic feature preferences simultaneously. Experimental results show that PanoDPO can effectively enhance the model’s robustness and reliability in temporal analysis.
zh

[CV-22] ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLM s with Free-Style Intermediate State Representations

【速读】:该论文试图解决当前基准测试中提供的中间视觉状态(Intermediate Visual State, IVS)过于固定,无法真实反映多模态大语言模型(Multimodal Large Language Models, MLLMs)在连续推理过程中的内在推理能力的问题。其解决方案的关键在于引入一个名为ViC-Bench的专用基准,包含四个代表性任务,并为每个任务设计了支持函数调用的自由风格IVS生成流程,同时提出了一种渐进式的三阶段评估策略和增量提示信息注入(Incremental Prompting Information Injection, IPII)方法,以系统性地分析IVS对未受约束推理性能的影响因素。

链接: https://arxiv.org/abs/2505.14404
作者: Xuecheng Wu,Jiaxing Liu,Danlei Huang,Xiaoyu Li,Yifan Wang,Chen Chen,Liya Ma,Xuezhi Cao,Junxiao Xue
机构: Xi’an Jiaotong University (西安交通大学); Meituan Inc (美团); University of Science and Technology of China (中国科学技术大学); Zhejiang Lab (浙江实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual-Interleaved Chain-of-Thought (VI-CoT) enables MLLMs to continually update their understanding and decisions based on step-wise intermediate visual states (IVS), much like a human would, which demonstrates impressive success in various tasks, thereby leading to emerged advancements in related benchmarks. Despite promising progress, current benchmarks provide models with relatively fixed IVS, rather than free-style IVS, whch might forcibly distort the original thinking trajectories, failing to evaluate their intrinsic reasoning capabilities. More importantly, existing benchmarks neglect to systematically explore the impact factors that IVS would impart to untamed reasoning performance. To tackle above gaps, we introduce a specialized benchmark termed ViC-Bench, consisting of four representive tasks: maze navigation, jigsaw puzzle, embodied long-horizon planning, and complex counting, where each task has dedicated free-style IVS generation pipeline supporting function calls. To systematically examine VI-CoT capability, we propose a thorough evaluation suite incorporating a progressive three-stage strategy with targeted new metrics. Besides, we establish Incremental Prompting Information Injection (IPII) strategy to ablatively explore the prompting factors for VI-CoT. We extensively conduct evaluations for 18 advanced MLLMs, revealing key insights into their VI-CoT capability. Our proposed benchmark is publicly open at Huggingface.
zh

[CV-23] DeepEyes: Incentivizing “Thinking with Images” via Reinforcement Learning

【速读】:该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, VLMs)在实现视觉与文本推理无缝融合方面的挑战,特别是如何有效整合先进的视觉输入处理到推理机制中。其解决方案的关键在于提出一种交错的多模态推理范式,并引入DeepEyes模型,该模型通过端到端强化学习实现“以图思考”的能力,无需依赖冷启动的监督微调(SFT)。DeepEyes利用模型自身的基础定位能力作为工具,而非依赖独立的专用模型,同时设计了面向工具使用的数据选择机制和奖励策略,以促进成功的工具辅助推理轨迹。

链接: https://arxiv.org/abs/2505.14362
作者: Ziwei Zheng,Michael Yang,Jack Hong,Chenxiao Zhao,Guohai Xu,Le Yang,Chao Shen,Xing Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (VLMs) have shown strong capabilities in multimodal understanding and reasoning, yet they are primarily constrained by text-based reasoning processes. However, achieving seamless integration of visual and textual reasoning which mirrors human cognitive processes remains a significant challenge. In particular, effectively incorporating advanced visual input processing into reasoning mechanisms is still an open question. Thus, in this paper, we explore the interleaved multimodal reasoning paradigm and introduce DeepEyes, a model with “thinking with images” capabilities incentivized through end-to-end reinforcement learning without the need for cold-start SFT. Notably, this ability emerges natively within the model itself, leveraging its inherent grounding ability as a tool instead of depending on separate specialized models. Specifically, we propose a tool-use-oriented data selection mechanism and a reward strategy to encourage successful tool-assisted reasoning trajectories. DeepEyes achieves significant performance gains on fine-grained perception and reasoning benchmarks and also demonstrates improvement in grounding, hallucination, and mathematical reasoning tasks. Interestingly, we observe the distinct evolution of tool-calling behavior from initial exploration to efficient and accurate exploitation, and diverse thinking patterns that closely mirror human visual reasoning processes. Code is available at this https URL.
zh

[CV-24] Vision-Language Modeling Meets Remote Sensing: Models Datasets and Perspectives

【速读】:该论文旨在为遥感领域提供一种基于两阶段范式的视觉-语言建模(VLM)的及时且全面的综述。其核心问题在于如何有效融合图像与自然语言信息,以提升遥感数据解析任务的性能,并增强模型与用户的交互能力。解决方案的关键在于通过大规模图像-文本对的预训练获取通用知识,并在此基础上进行任务特定数据的微调,从而实现模型在多种遥感任务中的强大表现和灵活适应性。

链接: https://arxiv.org/abs/2505.14361
作者: Xingxing Weng,Chao Pang,Gui-Song Xia
机构: Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Geoscience and Remote Sensing Magazine

点击查看摘要

Abstract:Vision-language modeling (VLM) aims to bridge the information gap between images and natural language. Under the new paradigm of first pre-training on massive image-text pairs and then fine-tuning on task-specific data, VLM in the remote sensing domain has made significant progress. The resulting models benefit from the absorption of extensive general knowledge and demonstrate strong performance across a variety of remote sensing data analysis tasks. Moreover, they are capable of interacting with users in a conversational manner. In this paper, we aim to provide the remote sensing community with a timely and comprehensive review of the developments in VLM using the two-stage paradigm. Specifically, we first cover a taxonomy of VLM in remote sensing: contrastive learning, visual instruction tuning, and text-conditioned image generation. For each category, we detail the commonly used network architecture and pre-training objectives. Second, we conduct a thorough review of existing works, examining foundation models and task-specific adaptation methods in contrastive-based VLM, architectural upgrades, training strategies and model capabilities in instruction-based VLM, as well as generative foundation models with their representative downstream applications. Third, we summarize datasets used for VLM pre-training, fine-tuning, and evaluation, with an analysis of their construction methodologies (including image sources and caption generation) and key properties, such as scale and task adaptability. Finally, we conclude this survey with insights and discussions on future research directions: cross-modal representation alignment, vague requirement comprehension, explanation-driven model reliability, continually scalable model capabilities, and large-scale datasets featuring richer modalities and greater challenges.
zh

[CV-25] Dual Data Alignment Makes AI-Generated Image Detector Easier Generalizable

【速读】:该论文旨在解决现有检测器因训练数据集存在偏差而导致的过拟合问题,这种偏差使得模型在非因果图像属性上学习到虚假相关性,从而在应用到无偏数据集时性能显著下降。解决方案的关键在于提出双数据对齐(Dual Data Alignment, DDA),通过同时对齐像素域和频率域,以消除重建图像中的频率级不对齐问题,进而减少虚假相关性的持续存在。

链接: https://arxiv.org/abs/2505.14359
作者: Ruoxin Chen,Junwei Xi,Zhiyuan Yan,Ke-Yue Zhang,Shuang Wu,Jingyi Xie,Xu Chen,Lei Xu,Isabel Guan,Taiping Yao,Shouhong Ding
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 Pages, 9 figures

点击查看摘要

Abstract:Existing detectors are often trained on biased datasets, leading to the possibility of overfitting on non-causal image attributes that are spuriously correlated with real/synthetic labels. While these biased features enhance performance on the training data, they result in substantial performance degradation when applied to unbiased datasets. One common solution is to perform dataset alignment through generative reconstruction, matching the semantic content between real and synthetic images. However, we revisit this approach and show that pixel-level alignment alone is insufficient. The reconstructed images still suffer from frequency-level misalignment, which can perpetuate spurious correlations. To illustrate, we observe that reconstruction models tend to restore the high-frequency details lost in real images (possibly due to JPEG compression), inadvertently creating a frequency-level misalignment, where synthetic images appear to have richer high-frequency content than real ones. This misalignment leads to models associating high-frequency features with synthetic labels, further reinforcing biased cues. To resolve this, we propose Dual Data Alignment (DDA), which aligns both the pixel and frequency domains. Moreover, we introduce two new test sets: DDA-COCO, containing DDA-aligned synthetic images for testing detector performance on the most aligned dataset, and EvalGEN, featuring the latest generative models for assessing detectors under new generative architectures such as visual auto-regressive generators. Finally, our extensive evaluations demonstrate that a detector trained exclusively on DDA-aligned MSCOCO could improve across 8 diverse benchmarks by a non-trivial margin, showing a +7.2% on in-the-wild benchmarks, highlighting the improved generalizability of unbiased detectors.
zh

[CV-26] Vid2World: Crafting Video Diffusion Models to Interactive World Models

【速读】:该论文旨在解决现有世界模型在数据效率和预测精度方面的局限性,特别是在复杂环境中缺乏高保真、细粒度的预测能力。其解决方案的关键在于将预训练的视频扩散模型(video diffusion models)转化为交互式世界模型,通过重构模型架构和训练目标以实现自回归生成,并引入因果动作引导机制以提升动作可控性,从而实现对复杂环境的有效建模与交互。

链接: https://arxiv.org/abs/2505.14357
作者: Siqiao Huang,Jialong Wu,Qixing Zhou,Shangchen Miao,Mingsheng Long
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this http URL

点击查看摘要

Abstract:World models, which predict transitions based on history observation and action sequences, have shown great promise in improving data efficiency for sequential decision making. However, existing world models often require extensive domain-specific training and still produce low-fidelity, coarse predictions, limiting their applicability in complex environments. In contrast, video diffusion models trained on large, internet-scale datasets have demonstrated impressive capabilities in generating high-quality videos that capture diverse real-world dynamics. In this work, we present Vid2World, a general approach for leveraging and transferring pre-trained video diffusion models into interactive world models. To bridge the gap, Vid2World performs casualization of a pre-trained video diffusion model by crafting its architecture and training objective to enable autoregressive generation. Furthermore, it introduces a causal action guidance mechanism to enhance action controllability in the resulting interactive world model. Extensive experiments in robot manipulation and game simulation domains show that our method offers a scalable and effective approach for repurposing highly capable video diffusion models to interactive world models.
zh

[CV-27] Egocentric Action-aware Inertial Localization in Point Clouds

【速读】:该论文旨在解决人体惯性定位(Inertial Localization)中的轨迹漂移问题,该问题由IMU传感器噪声引起,且人类动作的多样性进一步增加了IMU信号处理的复杂性。解决方案的关键在于提出一种名为Egocentric Action-aware Inertial Localization (EAIL)的框架,该框架通过层次化多模态对齐学习从头戴式IMU信号中提取的自我中心动作线索与三维点云中的局部环境特征之间的关联,从而利用这些动作作为空间锚点来补偿定位漂移。

链接: https://arxiv.org/abs/2505.14346
作者: Mingfang Zhang,Ryo Yonetani,Yifei Huang,Liangyang Ouyang,Ruicong Liu,Yoichi Sato
机构: The University of Tokyo (东京大学); CyberAgent AI Lab (CyberAgent人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a novel inertial localization framework named Egocentric Action-aware Inertial Localization (EAIL), which leverages egocentric action cues from head-mounted IMU signals to localize the target individual within a 3D point cloud. Human inertial localization is challenging due to IMU sensor noise that causes trajectory drift over time. The diversity of human actions further complicates IMU signal processing by introducing various motion patterns. Nevertheless, we observe that some actions observed through the head-mounted IMU correlate with spatial environmental structures (e.g., bending down to look inside an oven, washing dishes next to a sink), thereby serving as spatial anchors to compensate for the localization drift. The proposed EAIL framework learns such correlations via hierarchical multi-modal alignment. By assuming that the 3D point cloud of the environment is available, it contrastively learns modality encoders that align short-term egocentric action cues in IMU signals with local environmental features in the point cloud. These encoders are then used in reasoning the IMU data and the point cloud over time and space to perform inertial localization. Interestingly, these encoders can further be utilized to recognize the corresponding sequence of actions as a by-product. Extensive experiments demonstrate the effectiveness of the proposed framework over state-of-the-art inertial localization and inertial action recognition baselines.
zh

[CV-28] Replace in Translation: Boost Concept Alignment in Counterfactual Text-to-Image

【速读】:该论文试图解决反事实文本到图像(Counterfactual Text-to-Image, T2I)中概念对齐(Concept Alignment)不足的问题,即在生成的图像中未能准确包含提示中要求的所有对象。解决方案的关键在于利用可控T2I模型,在潜在空间中逐步替换合成图像中的对象,以将图像从常见场景转换为反事实场景,同时通过一种称为显式逻辑叙事提示(Explicit Logical Narrative Prompt, ELNP)的策略来指导这一替换过程,该策略基于最新的SoTA语言模型DeepSeek生成指令,从而提升反事实T2I中的概念对齐效果。

链接: https://arxiv.org/abs/2505.14341
作者: Sifan Li,Ming Tao,Hao Zhao,Ling Shao,Hao Tang
机构: Liaoning University (辽宁大学); Nanjing University of Posts and Telecommunications (南京邮电大学); Tsinghua University (清华大学); University of Chinese Academy of Sciences (中国科学院大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-Image (T2I) has been prevalent in recent years, with most common condition tasks having been optimized nicely. Besides, counterfactual Text-to-Image is obstructing us from a more versatile AIGC experience. For those scenes that are impossible to happen in real world and anti-physics, we should spare no efforts in increasing the factual feel, which means synthesizing images that people think very likely to be happening, and concept alignment, which means all the required objects should be in the same frame. In this paper, we focus on concept alignment. As controllable T2I models have achieved satisfactory performance for real applications, we utilize this technology to replace the objects in a synthesized image in latent space step-by-step to change the image from a common scene to a counterfactual scene to meet the prompt. We propose a strategy to instruct this replacing process, which is called as Explicit Logical Narrative Prompt (ELNP), by using the newly SoTA language model DeepSeek to generate the instructions. Furthermore, to evaluate models’ performance in counterfactual T2I, we design a metric to calculate how many required concepts in the prompt can be covered averagely in the synthesized images. The extensive experiments and qualitative comparisons demonstrate that our strategy can boost the concept alignment in counterfactual T2I.
zh

[CV-29] Plane Geometry Problem Solving with Multi-modal Reasoning : A Survey

【速读】:该论文旨在解决平面几何问题求解(Plane Geometry Problem Solving, PGPS)领域缺乏系统性综述的问题,以全面总结和分析当前PGPS的研究进展。其解决方案的关键在于提出一种基于编码器-解码器框架的分类方法,对PGPS方法进行系统归纳,并进一步根据架构设计对编码器和解码器进行分类与分析,同时指出当前研究中的主要挑战,如编码阶段的幻觉问题以及基准测试中的数据泄露问题。

链接: https://arxiv.org/abs/2505.14340
作者: Seunghyuk Cho,Zhenyue Qin,Yang Liu,Youngbin Choi,Seungbeom Lee,Dongwoo Kim
机构: POSTECH(浦项科技大学); Australian National University(澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 18 pages

点击查看摘要

Abstract:Plane geometry problem solving (PGPS) has recently gained significant attention as a benchmark to assess the multi-modal reasoning capabilities of large vision-language models. Despite the growing interest in PGPS, the research community still lacks a comprehensive overview that systematically synthesizes recent work in PGPS. To fill this gap, we present a survey of existing PGPS studies. We first categorize PGPS methods into an encoder-decoder framework and summarize the corresponding output formats used by their encoders and decoders. Subsequently, we classify and analyze these encoders and decoders according to their architectural designs. Finally, we outline major challenges and promising directions for future research. In particular, we discuss the hallucination issues arising during the encoding phase within encoder-decoder architectures, as well as the problem of data leakage in current PGPS benchmarks.
zh

[CV-30] Domain Adaptation for Multi-label Image Classification: a Discriminator-free Approach

【速读】:该论文旨在解决无监督域适应(Unsupervised Domain Adaptation, UDA)在多标签图像分类(Multi-Label Image Classification, MLIC)中的域偏移问题。传统方法通常引入额外的判别器子网络,但这种任务解耦可能削弱任务特定的判别能力。本文提出的解决方案关键在于构建一个直接从任务特定分类器中衍生出的对抗性评判器,通过使用两成分高斯混合模型(GMM)建模源域和目标域的预测,并利用深度神经网络估计GMM参数,从而避免传统期望最大化(Expectation Maximization, EM)算法的迭代过程,实现端到端的可微分且计算成本较低的对抗损失函数。

链接: https://arxiv.org/abs/2505.14333
作者: Inder Pal Singh,Enjie Ghorbel,Anis Kacem,Djamila Aouada
机构: Interdisciplinary Centre for Security, Reliability and Trust (SnT), University of Luxembourg, Luxembourg; Cristal Laboratory, National School of Computer Sciences, University of Manouba, Tunisia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The paper is under consideration at Computer Vision and Image Understanding. arXiv admin note: text overlap with arXiv:2301.10611

点击查看摘要

Abstract:This paper introduces a discriminator-free adversarial-based approach termed DDA-MLIC for Unsupervised Domain Adaptation (UDA) in the context of Multi-Label Image Classification (MLIC). While recent efforts have explored adversarial-based UDA methods for MLIC, they typically include an additional discriminator subnet. Nevertheless, decoupling the classification and the discrimination tasks may harm their task-specific discriminative power. Herein, we address this challenge by presenting a novel adversarial critic directly derived from the task-specific classifier. Specifically, we employ a two-component Gaussian Mixture Model (GMM) to model both source and target predictions, distinguishing between two distinct clusters. Instead of using the traditional Expectation Maximization (EM) algorithm, our approach utilizes a Deep Neural Network (DNN) to estimate the parameters of each GMM component. Subsequently, the source and target GMM parameters are leveraged to formulate an adversarial loss using the Fréchet distance. The proposed framework is therefore not only fully differentiable but is also cost-effective as it avoids the expensive iterative process usually induced by the standard EM method. The proposed method is evaluated on several multi-label image datasets covering three different types of domain shift. The obtained results demonstrate that DDA-MLIC outperforms existing state-of-the-art methods in terms of precision while requiring a lower number of parameters. The code is made publicly available at this http URL.
zh

[CV-31] Handloom Design Generation Using Generative Networks

【速读】:该论文试图解决如何利用深度学习技术生成服装设计,特别是针对手工织物(handloom fabric)的设计生成问题,同时探讨了相关的挑战及其应用。解决方案的关键在于采用当前最先进的生成模型和风格迁移算法,以研究并观察其在设计生成任务中的性能,从而提升生成式AI(Generative AI)在理解和合成艺术设计方面的能力。此外,论文还提供了新的数据集NeuralLoom以支持该任务的研究。

链接: https://arxiv.org/abs/2505.14330
作者: Rajat Kanti Bhattacharjee,Meghali Nandi,Amrit Jha,Gunajit Kalita,Ferdous Ahmed Barbhuiya
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper proposes deep learning techniques of generating designs for clothing, focused on handloom fabric and discusses the associated challenges along with its application. The capability of generative neural network models in understanding artistic designs and synthesizing those is not yet explored well. In this work, multiple methods are employed incorporating the current state of the art generative models and style transfer algorithms to study and observe their performance for the task. The results are then evaluated through user score. This work also provides a new dataset NeuralLoom for the task of the design generation.
zh

[CV-32] Breaking Down Video LLM Benchmarks: Knowledge Spatial Perception or True Temporal Understanding?

【速读】:该论文试图解决现有视频理解基准测试中混杂基于知识和纯图像的问题,未能明确区分模型的时间推理能力,这是视频理解区别于其他模态的关键方面。其解决方案的关键在于提出VBenchComp,一个自动化的评估流程,将问题分类为LLM-Answerable(可由大语言模型回答)、Semantic(语义相关)和Temporal(时间相关)等不同领域,从而实现对视频大语言模型不同能力的细粒度评估。

链接: https://arxiv.org/abs/2505.14321
作者: Bo Feng,Zhengfeng Lai,Shiyu Li,Zizhen Wang,Simon Wang,Ping Huang,Meng Cao
机构: Apple(苹果)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing video understanding benchmarks often conflate knowledge-based and purely image-based questions, rather than clearly isolating a model’s temporal reasoning ability, which is the key aspect that distinguishes video understanding from other modalities. We identify two major limitations that obscure whether higher scores truly indicate stronger understanding of the dynamic content in videos: (1) strong language priors, where models can answer questions without watching the video; and (2) shuffling invariance, where models maintain similar performance on certain questions even when video frames are temporally shuffled. To alleviate these issues, we propose VBenchComp, an automated pipeline that categorizes questions into different domains: LLM-Answerable, Semantic, and Temporal. Specifically, LLM-Answerable questions can be answered without viewing the video; Semantic questions remain answerable even when the video frames are shuffled; and Temporal questions require understanding the correct temporal order of frames. The rest of the questions are labeled as Others. This can enable fine-grained evaluation of different capabilities of a video LLM. Our analysis reveals nuanced model weaknesses that are hidden by traditional overall scores, and we offer insights and recommendations for designing future benchmarks that more accurately assess video LLMs.
zh

[CV-33] Accuracy and Fairness of Facial Recognition Technology in Low-Quality Police Images: An Experiment With Synthetic Faces

【速读】:该论文试图解决面部识别技术(Facial Recognition Technology, FRT)在刑事调查中的准确性与公平性问题,特别是在现实场景中图像退化对FRT性能的影响。研究的关键在于通过合成退化图像(包括对比度、亮度、运动模糊、姿态偏移和分辨率降低)评估FRT在1:n识别任务中的表现,并分析不同人口统计群体间的差异。研究采用StyleGAN3生成合成人脸并结合FairFace进行标注,利用Deepface与ArcFace损失函数进行评估,揭示了图像质量下降导致的错误率变化及其在性别和种族上的不平等影响。

链接: https://arxiv.org/abs/2505.14320
作者: Maria Cuellar,Hon Kiu(James)To,Arush Mehrotra
机构: University of Pennsylvania (宾夕法尼亚大学); Wharton School (沃顿商学院); School of Engineering & Applied Science (工程与应用科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Facial recognition technology (FRT) is increasingly used in criminal investigations, yet most evaluations of its accuracy rely on high-quality images, unlike those often encountered by law enforcement. This study examines how five common forms of image degradation–contrast, brightness, motion blur, pose shift, and resolution–affect FRT accuracy and fairness across demographic groups. Using synthetic faces generated by StyleGAN3 and labeled with FairFace, we simulate degraded images and evaluate performance using Deepface with ArcFace loss in 1:n identification tasks. We perform an experiment and find that false positive rates peak near baseline image quality, while false negatives increase as degradation intensifies–especially with blur and low resolution. Error rates are consistently higher for women and Black individuals, with Black females most affected. These disparities raise concerns about fairness and reliability when FRT is used in real-world investigative contexts. Nevertheless, even under the most challenging conditions and for the most affected subgroups, FRT accuracy remains substantially higher than that of many traditional forensic methods. This suggests that, if appropriately validated and regulated, FRT should be considered a valuable investigative tool. However, algorithmic accuracy alone is not sufficient: we must also evaluate how FRT is used in practice, including user-driven data manipulation. Such cases underscore the need for transparency and oversight in FRT deployment to ensure both fairness and forensic validity.
zh

[CV-34] RETRO: REthinking Tactile Representation Learning with Material PriOrs

【速读】:该论文试图解决现有触觉表征学习方法中对物体表面材料特性关注不足的问题(tactile representation learning methods neglect the material characteristics),这些问题在塑造触觉体验中起着关键作用。解决方案的关键在于重新审视触觉表征学习框架,并引入材料感知先验(material-aware priors),这些先验代表了不同材料的预学习特征,使触觉模型能够更好地捕捉和泛化表面纹理的细微差别,从而在多种材料和纹理上实现更准确、上下文丰富的触觉反馈。

链接: https://arxiv.org/abs/2505.14319
作者: Weihao Xia,Chenliang Zhou,Cengiz Oztireli
机构: University College London (伦敦大学学院); University of Cambridge (剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Code: this https URL

点击查看摘要

Abstract:Tactile perception is profoundly influenced by the surface properties of objects in contact. However, despite their crucial role in shaping tactile experiences, these material characteristics have been largely neglected in existing tactile representation learning methods. Most approaches primarily focus on aligning tactile data with visual or textual information, overlooking the richness of tactile feedback that comes from understanding the materials’ inherent properties. In this work, we address this gap by revisiting the tactile representation learning framework and incorporating material-aware priors into the learning process. These priors, which represent pre-learned characteristics specific to different materials, allow tactile models to better capture and generalize the nuances of surface texture. Our method enables more accurate, contextually rich tactile feedback across diverse materials and textures, improving performance in real-world applications such as robotics, haptic feedback systems, and material editing.
zh

[CV-35] A Review of Vision-Based Assistive Systems for Visually Impaired People: Technologies Applications and Future Directions

【速读】:该论文试图解决视觉障碍人士在独立生活过程中对障碍物及周围环境信息的准确性和及时性需求问题,其解决方案的关键在于开发先进的辅助技术,特别是基于视觉的系统,以提升移动能力并增强与外部世界的互动,具体包括障碍物检测、导航和用户交互等关键技术的最新进展。

链接: https://arxiv.org/abs/2505.14298
作者: Fulong Yao,Wenju Zhou,Huosheng Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visually impaired individuals rely heavily on accurate and timely information about obstacles and their surrounding environments to achieve independent living. In recent years, significant progress has been made in the development of assistive technologies, particularly vision-based systems, that enhance mobility and facilitate interaction with the external world in both indoor and outdoor settings. This paper presents a comprehensive review of recent advances in assistive systems designed for the visually impaired, with a focus on state-of-the-art technologies in obstacle detection, navigation, and user interaction. In addition, emerging trends and future directions in visual guidance systems are discussed.
zh

[CV-36] owards Generating Realistic Underwater Images

【速读】:该论文试图解决从均匀光照的合成图像生成逼真水下图像的问题,其解决方案的关键在于结合对比学习(contrastive learning)和生成对抗网络(generative adversarial networks, GANs)。通过使用VAROS数据集评估图像翻译模型的性能,研究比较了不同方法在感知质量和结构保留之间的权衡,其中CUT模型通过引入对比学习提升了空间相似性,而结合深度信息进一步增强了生成图像的真实感。

链接: https://arxiv.org/abs/2505.14296
作者: Abdul-Kazeem Shamba
机构: Norwegian University of Science and Technology (挪威科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:This paper explores the use of contrastive learning and generative adversarial networks for generating realistic underwater images from synthetic images with uniform lighting. We investigate the performance of image translation models for generating realistic underwater images using the VAROS dataset. Two key evaluation metrics, Fréchet Inception Distance (FID) and Structural Similarity Index Measure (SSIM), provide insights into the trade-offs between perceptual quality and structural preservation. For paired image translation, pix2pix achieves the best FID scores due to its paired supervision and PatchGAN discriminator, while the autoencoder model attains the highest SSIM, suggesting better structural fidelity despite producing blurrier outputs. Among unpaired methods, CycleGAN achieves a competitive FID score by leveraging cycle-consistency loss, whereas CUT, which replaces cycle-consistency with contrastive learning, attains higher SSIM, indicating improved spatial similarity retention. Notably, incorporating depth information into CUT results in the lowest overall FID score, demonstrating that depth cues enhance realism. However, the slight decrease in SSIM suggests that depth-aware learning may introduce structural variations.
zh

[CV-37] RA-Touch: Retrieval-Augmented Touch Understanding with Enriched Visual Data

【速读】:该论文旨在解决视觉-触觉感知(visuo-tactile perception)中由于触觉数据收集成本高且耗时而造成的研究不足问题。其解决方案的关键在于引入RA-Touch框架,该框架通过利用富含触觉语义的视觉数据增强视觉-触觉感知能力。具体而言,研究者对大规模视觉数据集进行了针对触觉特性的重新描述,使模型能够访问传统视觉数据集中缺乏的触觉语义信息,并通过检索与触觉输入对齐的视觉-文本表示来有效利用这些信息,从而提升触觉理解性能。

链接: https://arxiv.org/abs/2505.14270
作者: Yoorhim Cho,Hongyeob Kim,Semin Kim,Youjia Zhang,Yunseok Choi,Sungeun Hong
机构: Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visuo-tactile perception aims to understand an object’s tactile properties, such as texture, softness, and rigidity. However, the field remains underexplored because collecting tactile data is costly and labor-intensive. We observe that visually distinct objects can exhibit similar surface textures or material properties. For example, a leather sofa and a leather jacket have different appearances but share similar tactile properties. This implies that tactile understanding can be guided by material cues in visual data, even without direct tactile supervision. In this paper, we introduce RA-Touch, a retrieval-augmented framework that improves visuo-tactile perception by leveraging visual data enriched with tactile semantics. We carefully recaption a large-scale visual dataset with tactile-focused descriptions, enabling the model to access tactile semantics typically absent from conventional visual datasets. A key challenge remains in effectively utilizing these tactile-aware external descriptions. RA-Touch addresses this by retrieving visual-textual representations aligned with tactile inputs and integrating them to focus on relevant textural and material properties. By outperforming prior methods on the TVL benchmark, our method demonstrates the potential of retrieval-based visual reuse for tactile understanding. Code is available at this https URL
zh

[CV-38] Speculative Decoding Reimagined for Multimodal Large Language Models

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)推理速度较慢的问题,尤其是在采用推测解码(Speculative Decoding)技术时,无法达到与纯文本大语言模型(Large Language Models, LLMs)相同的加速效果。解决方案的关键在于提出一种针对MLLMs的多模态推测解码(Multimodal Speculative Decoding, MSD)方法,其核心设计原则包括:一是文本和视觉令牌在草稿模型中需分别处理以适应各自特性;二是草稿模型需要同时具备语言建模能力和视觉感知能力。为此,MSD采用了两阶段训练策略,首先在纯文本指令微调数据集上训练以提升语言建模能力,随后逐步引入多模态数据以增强视觉感知能力。

链接: https://arxiv.org/abs/2505.14260
作者: Luxi Lin,Zhihang Lin,Zhanpeng Zeng,Rongrong Ji
机构: Xiamen University (厦门大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages

点击查看摘要

Abstract:This paper introduces Multimodal Speculative Decoding (MSD) to accelerate Multimodal Large Language Models (MLLMs) inference. Speculative decoding has been shown to accelerate Large Language Models (LLMs) without sacrificing accuracy. However, current speculative decoding methods for MLLMs fail to achieve the same speedup as they do for LLMs. To address this, we reimagine speculative decoding specifically for MLLMs. Our analysis of MLLM characteristics reveals two key design principles for MSD: (1) Text and visual tokens have fundamentally different characteristics and need to be processed separately during drafting. (2) Both language modeling ability and visual perception capability are crucial for the draft model. For the first principle, MSD decouples text and visual tokens in the draft model, allowing each to be handled based on its own characteristics. For the second principle, MSD uses a two-stage training strategy: In stage one, the draft model is trained on text-only instruction-tuning datasets to improve its language modeling ability. In stage two, MSD gradually introduces multimodal data to enhance the visual perception capability of the draft model. Experiments show that MSD boosts inference speed by up to 2.29\times for LLaVA-1.5-7B and up to 2.46\times for LLaVA-1.5-13B on multimodal benchmarks, demonstrating its effectiveness. Our code is available at this https URL.
zh

[CV-39] Aligning Attention Distribution to Information Flow for Hallucination Mitigation in Large Vision-Language Models

【速读】:该论文试图解决大型视觉语言模型(LVLM)中由于注意力分布与实际信息流不匹配而导致的视觉理解能力不足和幻觉问题。解决方案的关键在于通过分析注意力头对核心语义表示的关注,利用语义表示中的核心信息增强模型的视觉理解能力,并通过两阶段优化范式将这些注意力头的优势传播至整个模型,从而实现注意力分布与信息流的对齐。

链接: https://arxiv.org/abs/2505.14257
作者: Jianfei Zhao,Feng Zhang,Xin Sun,Chong Feng
机构: Beijing Institute of Technology (北京理工大学); Zhongguancun Academy (中关村学院); Southeast Academy of Information Technology, Beijing Institute of Technology (东南信息科技研究院,北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Due to the unidirectional masking mechanism, Decoder-Only models propagate information from left to right. LVLMs (Large Vision-Language Models) follow the same architecture, with visual information gradually integrated into semantic representations during forward propagation. Through systematic analysis, we observe that over 80% of the visual information is absorbed into the semantic representations. However, the model’s attention still predominantly focuses on the visual representations. This misalignment between the attention distribution and the actual information flow undermines the model’s visual understanding ability and contributes to hallucinations. To address this issue, we enhance the model’s visual understanding by leveraging the core information embedded in semantic representations. Specifically, we identify attention heads that focus on core semantic representations based on their attention distributions. Then, through a two-stage optimization paradigm, we propagate the advantages of these attention heads across the entire model, aligning the attention distribution with the actual information flow. We evaluate our method on three image captioning benchmarks using five different LVLMs, demonstrating its effectiveness in significantly reducing hallucinations. Further experiments reveal a trade-off between reduced hallucinations and richer details. Notably, our method allows for manual adjustment of the model’s conservativeness, enabling flexible control to meet diverse real-world requirements. Code will be released once accepted.
zh

[CV-40] Instructing Text-to-Image Diffusion Models via Classifier-Guided Semantic Optimization

【速读】:该论文试图解决传统文本到图像扩散模型在图像生成与编辑中依赖手动提示词(text prompts)所带来的效率低、易引入无关信息及编辑性能受限的问题。其解决方案的关键在于通过属性分类器优化语义嵌入(semantic embeddings),从而引导文本到图像模型实现目标编辑,而无需依赖文本提示或对扩散模型进行训练或微调。该方法在数据集层面学习精确的语义嵌入,并理论证明其为属性语义的最优表示,从而实现解耦且准确的编辑效果。

链接: https://arxiv.org/abs/2505.14254
作者: Yuanyuan Chang,Yinghua Yao,Tao Qin,Mengmeng Wang,Ivor Tsang,Guang Dai
机构: Xi’an Jiaotong University (西安交通大学); Agency for Science, Technology and Research, Singapore (新加坡科技研究局); Institute of High Performance Computing, Agency for Science, Technology and Research, Singapore (新加坡科技研究局高性能计算研究所); Zhejiang University of Technology (浙江理工大学); State Grid Corporation of China (中国国家电网公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image diffusion models have emerged as powerful tools for high-quality image generation and editing. Many existing approaches rely on text prompts as editing guidance. However, these methods are constrained by the need for manual prompt crafting, which can be time-consuming, introduce irrelevant details, and significantly limit editing performance. In this work, we propose optimizing semantic embeddings guided by attribute classifiers to steer text-to-image models toward desired edits, without relying on text prompts or requiring any training or fine-tuning of the diffusion model. We utilize classifiers to learn precise semantic embeddings at the dataset level. The learned embeddings are theoretically justified as the optimal representation of attribute semantics, enabling disentangled and accurate edits. Experiments further demonstrate that our method achieves high levels of disentanglement and strong generalization across different domains of data.
zh

[CV-41] Visual Agent ic Reinforcement Fine-Tuning

【速读】:该论文旨在解决大型视觉-语言模型(LVLMs)在多模态代理能力上的不足,特别是如何实现真正基于图像的思考与操作。其解决方案的关键在于提出一种名为视觉代理强化微调(Visual-ARFT)的方法,通过该方法,开放源代码的LVLMs能够具备浏览网页获取实时信息以及编写代码对输入图像进行裁剪、旋转等处理的能力,从而提升模型的灵活性和适应性。

链接: https://arxiv.org/abs/2505.14246
作者: Ziyu Liu,Yuhang Zang,Yushan Zou,Zijian Liang,Xiaoyi Dong,Yuhang Cao,Haodong Duan,Dahua Lin,Jiaqi Wang
机构: Shanghai Jiaotong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: project url: this https URL

点击查看摘要

Abstract:A key trend in Large Reasoning Models (e.g., OpenAI’s o3) is the native agentic ability to use external tools such as web browsers for searching and writing/executing code for image manipulation to think with images. In the open-source research community, while significant progress has been made in language-only agentic abilities such as function calling and tool integration, the development of multi-modal agentic capabilities that involve truly thinking with images, and their corresponding benchmarks, are still less explored. This work highlights the effectiveness of Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT) for enabling flexible and adaptive reasoning abilities for Large Vision-Language Models (LVLMs). With Visual-ARFT, open-source LVLMs gain the ability to browse websites for real-time information updates and write code to manipulate and analyze input images through cropping, rotation, and other image processing techniques. We also present a Multi-modal Agentic Tool Bench (MAT) with two settings (MAT-Search and MAT-Coding) designed to evaluate LVLMs’ agentic search and coding abilities. Our experimental results demonstrate that Visual-ARFT outperforms its baseline by +18.6% F1 / +13.0% EM on MAT-Coding and +10.3% F1 / +8.7% EM on MAT-Search, ultimately surpassing GPT-4o. Visual-ARFT also achieves +29.3 F1% / +25.9% EM gains on existing multi-hop QA benchmarks such as 2Wiki and HotpotQA, demonstrating strong generalization capabilities. Our findings suggest that Visual-ARFT offers a promising path toward building robust and generalizable multimodal agents.
zh

[CV-42] Decoupling Classifier for Boosting Few-shot Object Detection and Instance Segmentation NEURIPS2022

【速读】:该论文旨在解决小样本目标检测(few-shot object detection, FSOD)和实例分割(few-shot instance segmentation, FSIS)中由于缺失标签问题导致的分类偏差问题。现有方法在实例级小样本场景下因缺乏标签而产生严重的分类偏误,这一问题被作者首次正式提出。论文的关键解决方案是将标准分类头解耦为两个独立的头部,分别处理清晰的正样本和由缺失标签引起的噪声负样本,从而有效缓解噪声影响并提升模型对新类别的学习能力。

链接: https://arxiv.org/abs/2505.14239
作者: Bin-Bin Gao,Xiaochen Chen,Zhongyi Huang,Congchong Nie,Jun Liu,Jinxiang Lai,Guannan Jiang,Xi Wang,Chengjie Wang
机构: Tencent YouTu Lab (腾讯优图实验室); CATL (宁德时代)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2022

点击查看摘要

Abstract:This paper focus on few-shot object detection~(FSOD) and instance segmentation~(FSIS), which requires a model to quickly adapt to novel classes with a few labeled instances. The existing methods severely suffer from bias classification because of the missing label issue which naturally exists in an instance-level few-shot scenario and is first formally proposed by us. Our analysis suggests that the standard classification head of most FSOD or FSIS models needs to be decoupled to mitigate the bias classification. Therefore, we propose an embarrassingly simple but effective method that decouples the standard classifier into two heads. Then, these two individual heads are capable of independently addressing clear positive samples and noisy negative samples which are caused by the missing label. In this way, the model can effectively learn novel classes while mitigating the effects of noisy negative samples. Without bells and whistles, our model without any additional computation cost and parameters consistently outperforms its baseline and state-of-the-art by a large margin on PASCAL VOC and MS-COCO benchmarks for FSOD and FSIS tasks. The Code is available at this https URL.
zh

[CV-43] UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning

【速读】:该论文旨在解决传统视觉定位方法在面对包含隐含和复杂指令的多图像实际场景时所面临的挑战,这些问题主要源于模型在跨多种多模态上下文中的推理能力不足。其解决方案的关键在于提出UniVG-R1,一个基于强化学习(Reinforcement Learning, RL)与冷启动数据结合的推理引导多模态大语言模型(Multimodal Large Language Model, MLLM),通过构建高质量的思维链(Chain-of-Thought, CoT)标注数据集进行监督微调,并采用规则驱动的强化学习策略提升模型的推理能力,同时引入难度感知的权重调整策略以缓解训练过程中易样本偏差问题。

链接: https://arxiv.org/abs/2505.14231
作者: Sule Bai,Mingxing Li,Yong Liu,Jing Tang,Haoji Zhang,Lei Sun,Xiangxiang Chu,Yansong Tang
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院,清华大学); AMAP, Alibaba Group (阿里云)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional visual grounding methods primarily focus on single-image scenarios with simple textual references. However, extending these methods to real-world scenarios that involve implicit and complex instructions, particularly in conjunction with multiple images, poses significant challenges, which is mainly due to the lack of advanced reasoning ability across diverse multi-modal contexts. In this work, we aim to address the more practical universal grounding task, and propose UniVG-R1, a reasoning guided multimodal large language model (MLLM) for universal visual grounding, which enhances reasoning capabilities through reinforcement learning (RL) combined with cold-start data. Specifically, we first construct a high-quality Chain-of-Thought (CoT) grounding dataset, annotated with detailed reasoning chains, to guide the model towards correct reasoning paths via supervised fine-tuning. Subsequently, we perform rule-based reinforcement learning to encourage the model to identify correct reasoning chains, thereby incentivizing its reasoning capabilities. In addition, we identify a difficulty bias arising from the prevalence of easy samples as RL training progresses, and we propose a difficulty-aware weight adjustment strategy to further strengthen the performance. Experimental results demonstrate the effectiveness of UniVG-R1, which achieves state-of-the-art performance on MIG-Bench with a 9.1% improvement over the previous method. Furthermore, our model exhibits strong generalizability, achieving an average improvement of 23.4% in zero-shot performance across four image and video reasoning grounding benchmarks. The project page can be accessed at this https URL.
zh

[CV-44] VoQA: Visual-only Question Answering

【速读】:该论文试图解决在视觉-only Question Answering (VoQA) 任务中,现有大型视觉-语言模型(LVLMs)对图像中嵌入的视觉文本问题进行定位、识别和推理时表现不佳的问题。解决方案的关键在于引入Guided Response Triggering Supervised Fine-tuning (GRT-SFT),这是一种结构化的微调策略,通过引导模型仅基于视觉输入进行逐步推理,从而显著提升模型性能。

链接: https://arxiv.org/abs/2505.14227
作者: Luyang Jiang,Jianing An,Jie Luo,Wenjun Wu,Lei Huang
机构: Beihang University (北京航空航天大学); Hangzhou International Innovation Institute, Beihang University (杭州国际创新研究院,北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages

点击查看摘要

Abstract:We propose Visual-only Question Answering (VoQA), a novel multimodal task in which questions are visually embedded within images, without any accompanying textual input. This requires models to locate, recognize, and reason over visually embedded textual questions, posing challenges for existing large vision-language models (LVLMs), which show notable performance drops even with carefully designed prompts. To bridge this gap, we introduce Guided Response Triggering Supervised Fine-tuning (GRT-SFT), a structured fine-tuning strategy that guides the model to perform step-by-step reasoning purely based on visual input, significantly improving model performance. Our work enhances models’ capacity for human-like visual understanding in complex multimodal scenarios, where information, including language, is perceived visually.
zh

[CV-45] Flexible-weighted Chamfer Distance: Enhanced Objective Function for Point Cloud Completion

【速读】:该论文试图解决在点云补全任务中直接使用固定权重的Chamfer Distance (CD)作为目标函数时,虽然整体性能表现良好(即CD得分低),但往往无法实现良好的全局分布问题,这通常体现在高Earth Mover’s Distance (EMD)和Decomposed Chamfer Distance (DCD)得分以及较差的人类评估结果上。解决方案的关键在于提出一种Flexible-Weighted Chamfer Distance (FCD),通过为CD的全局分布组件分配更高的权重,并引入灵活的加权策略来调整两个组件之间的平衡,旨在提升全局分布的同时保持稳健的整体性能。

链接: https://arxiv.org/abs/2505.14218
作者: Jie Li,Shengwei Tian,Long Yu,Xin Ning
机构: Xinjiang University (新疆大学); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Chamfer Distance (CD) comprises two components that can evaluate the global distribution and local performance of generated point clouds, making it widely utilized as a similarity measure between generated and target point clouds in point cloud completion tasks. Additionally, CD’s computational efficiency has led to its frequent application as an objective function for guiding point cloud generation. However, using CD directly as an objective function with fixed equal weights for its two components can often result in seemingly high overall performance (i.e., low CD score), while failing to achieve a good global distribution. This is typically reflected in high Earth Mover’s Distance (EMD) and Decomposed Chamfer Distance (DCD) scores, alongside poor human assessments. To address this issue, we propose a Flexible-Weighted Chamfer Distance (FCD) to guide point cloud generation. FCD assigns a higher weight to the global distribution component of CD and incorporates a flexible weighting strategy to adjust the balance between the two components, aiming to improve global distribution while maintaining robust overall performance. Experimental results on two state-of-the-art networks demonstrate that our method achieves superior results across multiple evaluation metrics, including CD, EMD, DCD, and F-Score, as well as in human evaluations.
zh

[CV-46] Beginning with You: Perceptual-Initialization Improves Vision-Language Representation and Alignment

【速读】:该论文试图解决视觉表示学习中如何有效利用人类感知结构以提升模型在未见任务上的泛化能力问题。其解决方案的关键在于提出Perceptual-Initialization(PI)范式,即在模型初始化阶段引入人类感知结构,而非作为下游微调步骤。具体而言,通过使用NIGHTS数据集的人类衍生三元组嵌入来初始化CLIP视觉编码器,并在YFCC15M上进行自监督学习,从而在无需任务特定微调的情况下显著提升零样本分类和检索任务的性能。

链接: https://arxiv.org/abs/2505.14204
作者: Yang Hu,Runchen Wang,Stephen Chong Zhao,Xuhui Zhan,Do Hun Kim,Mark Wallace,David A. Tovar
机构: Vanderbilt University (范德比尔特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注: 10 pages, 5 figures, 2 tables

点击查看摘要

Abstract:We introduce Perceptual-Initialization (PI), a paradigm shift in visual representation learning that incorporates human perceptual structure during the initialization phase rather than as a downstream fine-tuning step. By integrating human-derived triplet embeddings from the NIGHTS dataset to initialize a CLIP vision encoder, followed by self-supervised learning on YFCC15M, our approach demonstrates significant zero-shot performance improvements, without any task-specific fine-tuning, across 29 zero shot classification and 2 retrieval benchmarks. On ImageNet-1K, zero-shot gains emerge after approximately 15 epochs of pretraining. Benefits are observed across datasets of various scales, with improvements manifesting at different stages of the pretraining process depending on dataset characteristics. Our approach consistently enhances zero-shot top-1 accuracy, top-5 accuracy, and retrieval recall (e.g., R@1, R@5) across these diverse evaluation tasks, without requiring any adaptation to target domains. These findings challenge the conventional wisdom of using human-perceptual data primarily for fine-tuning and demonstrate that embedding human perceptual structure during early representation learning yields more capable and vision-language aligned systems that generalize immediately to unseen tasks. Our work shows that “beginning with you”, starting with human perception, provides a stronger foundation for general-purpose vision-language intelligence.
zh

[CV-47] owards Omnidirectional Reasoning with 360-R1: A Dataset Benchmark and GRPO-based Method

【速读】:该论文试图解决多模态大语言模型(MLLMs)在处理全向图像(ODIs)时存在的理解与推理能力不足的问题,特别是对象定位、特征提取和幻觉抑制等方面的挑战。解决方案的关键在于引入了首个针对全向视觉问答的基准数据集OmniVQA,并基于Qwen2.5-VL-Instruct提出了一种基于规则的强化学习方法360-R1,通过改进群体相对策略优化(GRPO)并设计三种新颖的奖励函数,包括推理过程相似性奖励、答案语义准确性奖励和结构化格式合规性奖励,从而提升了模型在全向空间中的表现。

链接: https://arxiv.org/abs/2505.14197
作者: Xinshen Zhang,Zhen Ye,Xu Zheng
机构: The Hong Kong Polytechnic University (香港理工大学); HKUST (香港科技大学); HKUST(GZ) (香港科技大学(广州))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Omnidirectional images (ODIs), with their 360° field of view, provide unparalleled spatial awareness for immersive applications like augmented reality and embodied AI. However, the capability of existing multi-modal large language models (MLLMs) to comprehend and reason about such panoramic scenes remains underexplored. This paper addresses this gap by introducing OmniVQA, the first dataset and conducting the first benchmark for omnidirectional visual question answering. Our evaluation of state-of-the-art MLLMs reveals significant limitations in handling omnidirectional visual question answering, highlighting persistent challenges in object localization, feature extraction, and hallucination suppression within panoramic contexts. These results underscore the disconnect between current MLLM capabilities and the demands of omnidirectional visual understanding, which calls for dedicated architectural or training innovations tailored to 360° imagery. Building on the OmniVQA dataset and benchmark, we further introduce a rule-based reinforcement learning method, 360-R1, based on Qwen2.5-VL-Instruct. Concretely, we modify the group relative policy optimization (GRPO) by proposing three novel reward functions: (1) reasoning process similarity reward, (2) answer semantic accuracy reward, and (3) structured format compliance reward. Extensive experiments on our OmniVQA demonstrate the superiority of our proposed method in omnidirectional space (+6% improvement).
zh

[CV-48] Bridge the Gap between Past and Future: Siamese Model Optimization for Context-Aware Document Ranking

【速读】:该论文试图解决传统方法在利用历史会话数据捕捉用户意图演变方面的局限性,旨在通过引入未来上下文信息来提升文档排序效果。其解决方案的关键在于提出了一种孪生模型优化框架(siamese model optimization framework),包含一个依赖历史的模型(ForeRanker)和一个具备未来感知能力的模型,两者通过监督标签和彼此预测的伪标签进行协同训练,其中ForeRanker在推理时仅使用历史会话,但通过逐步学习未来相关信息来增强排序效果。此外,为缓解训练过程中的不一致性,引入了带有动态门控机制的同伴知识蒸馏方法,以实现对上下文信息的选择性融合。

链接: https://arxiv.org/abs/2505.14180
作者: Songhao Wu,Quan Tu,Mingjie Zhong,Hong Liu,Jia Xu,Jinjie Gu,Rui Yan
机构: 未知
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the realm of information retrieval, users often engage in multi-turn interactions with search engines to acquire information, leading to the formation of sequences of user feedback behaviors. Leveraging the session context has proven to be beneficial for inferring user search intent and document ranking. A multitude of approaches have been proposed to exploit in-session context for improved document ranking. Despite these advances, the limitation of historical session data for capturing evolving user intent remains a challenge. In this work, we explore the integration of future contextual information into the session context to enhance document ranking. We present the siamese model optimization framework, comprising a history-conditioned model and a future-aware model. The former processes only the historical behavior sequence, while the latter integrates both historical and anticipated future behaviors. Both models are trained collaboratively using the supervised labels and pseudo labels predicted by the other. The history-conditioned model, referred to as ForeRanker, progressively learns future-relevant information to enhance ranking, while it singly uses historical session at inference time. To mitigate inconsistencies during training, we introduce the peer knowledge distillation method with a dynamic gating mechanism, allowing models to selectively incorporate contextual information. Experimental results on benchmark datasets demonstrate the effectiveness of our ForeRanker, showcasing its superior performance compared to existing methods.
zh

[CV-49] LMP: Leverag ing Motion Prior in Zero-Shot Video Generation with Diffusion Transformer

【速读】:该论文旨在解决视频生成中缺乏对视频内容的细粒度控制问题,特别是在仅通过提示控制视频中主体运动时面临的挑战,以及在图像到视频生成过程中参考图像与参考视频中的主体在初始位置、大小和形状上的差异导致的运动控制失败问题。其解决方案的关键在于提出一种名为Leveraging Motion Prior (LMP)的框架,该框架通过引入前景-背景解耦模块、重加权运动迁移模块和外观分离模块,使生成视频能够参考用户提供的运动视频中的运动信息,从而实现更精确的运动控制和更高的视频生成质量。

链接: https://arxiv.org/abs/2505.14167
作者: Changgu Chen,Xiaoyan Yang,Junwei Shu,Changbo Wang,Yang Li
机构: East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, large-scale pre-trained diffusion transformer models have made significant progress in video generation. While current DiT models can produce high-definition, high-frame-rate, and highly diverse videos, there is a lack of fine-grained control over the video content. Controlling the motion of subjects in videos using only prompts is challenging, especially when it comes to describing complex movements. Further, existing methods fail to control the motion in image-to-video generation, as the subject in the reference image often differs from the subject in the reference video in terms of initial position, size, and shape. To address this, we propose the Leveraging Motion Prior (LMP) framework for zero-shot video generation. Our framework harnesses the powerful generative capabilities of pre-trained diffusion transformers to enable motion in the generated videos to reference user-provided motion videos in both text-to-video and image-to-video generation. To this end, we first introduce a foreground-background disentangle module to distinguish between moving subjects and backgrounds in the reference video, preventing interference in the target video generation. A reweighted motion transfer module is designed to allow the target video to reference the motion from the reference video. To avoid interference from the subject in the reference video, we propose an appearance separation module to suppress the appearance of the reference subject in the target video. We annotate the DAVIS dataset with detailed prompts for our experiments and design evaluation metrics to validate the effectiveness of our method. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in generation quality, prompt-video consistency, and control capability. Our homepage is available at this https URL
zh

[CV-50] M3Depth: Wavelet-Enhanced Depth Estimation on Mars via Mutual Boosting of Dual-Modal Data

【速读】:该论文旨在解决在火星等缺乏纹理和几何约束环境下的深度估计性能下降问题,这类环境通常导致传统学习型立体深度估计方法效果不佳。其解决方案的关键在于提出M3Depth模型,该模型通过引入基于小波变换的卷积核以有效捕捉低频特征并扩展感受野,同时结合显式建模深度图与表面法线图之间互补关系的一致性损失,以及设计具有互增强机制的像素级精修模块,从而提升深度估计的准确性与鲁棒性。

链接: https://arxiv.org/abs/2505.14159
作者: Junjie Li,Jiawei Wang,Miyu Li,Yu Liu,Yumei Wang,Haitao Xu
机构: Beijing University of Posts and Telecommunications (北京邮电大学); National Space Science Center, Chinese Academy of Sciences (中国科学院国家空间科学中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Depth estimation plays a great potential role in obstacle avoidance and navigation for further Mars exploration missions. Compared to traditional stereo matching, learning-based stereo depth estimation provides a data-driven approach to infer dense and precise depth maps from stereo image pairs. However, these methods always suffer performance degradation in environments with sparse textures and lacking geometric constraints, such as the unstructured terrain of Mars. To address these challenges, we propose M3Depth, a depth estimation model tailored for Mars rovers. Considering the sparse and smooth texture of Martian terrain, which is primarily composed of low-frequency features, our model incorporates a convolutional kernel based on wavelet transform that effectively captures low-frequency response and expands the receptive field. Additionally, we introduce a consistency loss that explicitly models the complementary relationship between depth map and surface normal map, utilizing the surface normal as a geometric constraint to enhance the accuracy of depth estimation. Besides, a pixel-wise refinement module with mutual boosting mechanism is designed to iteratively refine both depth and surface normal predictions. Experimental results on synthetic Mars datasets with depth annotations show that M3Depth achieves a significant 16% improvement in depth estimation accuracy compared to other state-of-the-art methods in depth estimation. Furthermore, the model demonstrates strong applicability in real-world Martian scenarios, offering a promising solution for future Mars exploration missions.
zh

[CV-51] Unify Graph Learning with Text: Unleashing LLM Potentials for Session Search

【速读】:该论文旨在解决会话搜索中传统策略在处理用户复杂信息需求时,过于依赖序列建模而忽视交互中的图结构问题,以及现有方法在捕捉结构信息时使用通用文档表示而忽略词级语义建模的问题。其解决方案的关键在于提出Symbolic Graph Ranker (SGR),通过引入符号语法规则将会话图转换为文本,从而将会话历史、交互过程和任务指令无缝整合为大型语言模型(Large Language Models, LLMs)的输入,并通过自监督符号学习任务增强LLMs对图结构的捕捉能力,从粗粒度到细粒度全面理解拓扑信息。

链接: https://arxiv.org/abs/2505.14156
作者: Songhao Wu,Quan Tu,Hong Liu,Jia Xu,Zhongyi Liu,Guannan Zhang,Ran Wang,Xiuying Chen,Rui Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Session search involves a series of interactive queries and actions to fulfill user’s complex information need. Current strategies typically prioritize sequential modeling for deep semantic understanding, overlooking the graph structure in interactions. While some approaches focus on capturing structural information, they use a generalized representation for documents, neglecting the word-level semantic modeling. In this paper, we propose Symbolic Graph Ranker (SGR), which aims to take advantage of both text-based and graph-based approaches by leveraging the power of recent Large Language Models (LLMs). Concretely, we first introduce a set of symbolic grammar rules to convert session graph into text. This allows integrating session history, interaction process, and task instruction seamlessly as inputs for the LLM. Moreover, given the natural discrepancy between LLMs pre-trained on textual corpora, and the symbolic language we produce using our graph-to-text grammar, our objective is to enhance LLMs’ ability to capture graph structures within a textual format. To achieve this, we introduce a set of self-supervised symbolic learning tasks including link prediction, node content generation, and generative contrastive learning, to enable LLMs to capture the topological information from coarse-grained to fine-grained. Experiment results and comprehensive analysis on two benchmark datasets, AOL and Tiangong-ST, confirm the superiority of our approach. Our paradigm also offers a novel and effective methodology that bridges the gap between traditional search strategies and modern LLMs.
zh

[CV-52] ReactDiff: Latent Diffusion for Facial Reaction Generation

【速读】:该论文旨在解决音频-视频片段中听众面部反应生成的问题,其核心挑战在于捕捉视频与音频之间的相关性,同时平衡反应的适当性、真实性和多样性。现有方法多集中于单模态输入或简化的反应映射,而本文提出的 Facial Reaction Diffusion (ReactDiff) 框架通过将多模态 Transformer 与潜在空间中的条件扩散相结合,实现了更高质量的反应生成。ReactDiff 的关键在于利用类内和类间注意力机制进行细粒度的多模态交互,并通过编码器与解码器之间的潜在扩散过程生成多样且符合语境的输出。

链接: https://arxiv.org/abs/2505.14151
作者: Jiaming Li,Sheng Wang,Xin Wang,Yitao Zhu,Honglin Xiong,Zixu Zhuang,Qian Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Given the audio-visual clip of the speaker, facial reaction generation aims to predict the listener’s facial reactions. The challenge lies in capturing the relevance between video and audio while balancing appropriateness, realism, and diversity. While prior works have mostly focused on uni-modal inputs or simplified reaction mappings, recent approaches such as PerFRDiff have explored multi-modal inputs and the one-to-many nature of appropriate reaction mappings. In this work, we propose the Facial Reaction Diffusion (ReactDiff) framework that uniquely integrates a Multi-Modality Transformer with conditional diffusion in the latent space for enhanced reaction generation. Unlike existing methods, ReactDiff leverages intra- and inter-class attention for fine-grained multi-modal interaction, while the latent diffusion process between the encoder and decoder enables diverse yet contextually appropriate outputs. Experimental results demonstrate that ReactDiff significantly outperforms existing approaches, achieving a facial reaction correlation of 0.26 and diversity score of 0.094 while maintaining competitive realism. The code is open-sourced at \hrefthis https URLgithub.
zh

[CV-53] Hunyuan-Game: Industrial-grade Intelligent Game Creation Model

【速读】:该论文旨在解决游戏开发中高质量游戏资产(包括图像和视频)的综合生成与合成问题,尤其是在满足玩家偏好和提升设计师效率方面存在的挑战。其解决方案的关键在于构建两个主要分支:图像生成与视频生成,分别基于大规模游戏图像数据集和海量游戏及动漫视频数据集,开发出针对游戏场景定制化的生成模型,涵盖从文本到图像、视觉效果生成、透明图像生成、角色生成以及图像到视频、360度姿态动画合成、动态插画生成、视频超分辨率和交互式游戏视频生成等多个核心算法模型,这些模型不仅具备高水平的艺术表现力,还深度融合了领域专业知识,形成了对多样游戏与动漫艺术风格的系统性理解。

链接: https://arxiv.org/abs/2505.14135
作者: Ruihuang Li,Caijin Zhou,Shoujian Zheng,Jianxiang Lu,Jiabin Huang,Comi Chen,Junshu Tang,Guangzheng Xu,Jiale Tao,Hongmei Wang,Donghao Li,Wenqing Yu,Senbo Wang,Zhimin Li,Yetshuan Shi,Haoyu Yang,Yukun Wang,Wenxun Dai,Jiaqi Li,Linqing Wang,Qixun Wang,Zhiyong Xu,Yingfang Zhang,Jiangfeng Xiong,Weijie Kong,Chao Zhang,Hongxin Zhang,Qiaoling Zheng,Weiting Guo,Xinchi Deng,Yixuan Li,Renjia Wei,Yulin Jian,Duojun Huang,Xuhua Ren,Sihuan Lin,Yifu Sun,Yuan Zhou,Joey Wang,Qin Lin,Jingmiao Yu,Jihong Zhang,Caesar Zhong,Di Wang,Yuhong Liu,Linus,Jie Jiang,Longhuang Wu,Shuai Shao,Qinglin Lu
机构: Tencent Hunyuan(腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Intelligent game creation represents a transformative advancement in game development, utilizing generative artificial intelligence to dynamically generate and enhance game content. Despite notable progress in generative models, the comprehensive synthesis of high-quality game assets, including both images and videos, remains a challenging frontier. To create high-fidelity game content that simultaneously aligns with player preferences and significantly boosts designer efficiency, we present Hunyuan-Game, an innovative project designed to revolutionize intelligent game production. Hunyuan-Game encompasses two primary branches: image generation and video generation. The image generation component is built upon a vast dataset comprising billions of game images, leading to the development of a group of customized image generation models tailored for game scenarios: (1) General Text-to-Image Generation. (2) Game Visual Effects Generation, involving text-to-effect and reference image-based game visual effect generation. (3) Transparent Image Generation for characters, scenes, and game visual effects. (4) Game Character Generation based on sketches, black-and-white images, and white models. The video generation component is built upon a comprehensive dataset of millions of game and anime videos, leading to the development of five core algorithmic models, each targeting critical pain points in game development and having robust adaptation to diverse game video scenarios: (1) Image-to-Video Generation. (2) 360 A/T Pose Avatar Video Synthesis. (3) Dynamic Illustration Generation. (4) Generative Video Super-Resolution. (5) Interactive Game Video Generation. These image and video generation models not only exhibit high-level aesthetic expression but also deeply integrate domain-specific knowledge, establishing a systematic understanding of diverse game and anime art styles.
zh

[CV-54] Intra-class Patch Swap for Self-Distillation

【速读】:该论文旨在解决传统知识蒸馏(Knowledge Distillation, KD)框架中依赖预训练高容量教师网络所带来的问题,如存储需求高、额外训练成本以及教师模型选择的不确定性。其解决方案的关键在于提出一种无需教师网络的自蒸馏(self-distillation)框架,该框架仅使用单一学生网络,并通过一种称为类内块交换增强(intra-class patch swap augmentation)的简单但高效的增强技术,模拟教师-学生动态,从而实现预测分布对齐。该方法具有概念简洁、模型无关和易于实现的特点。

链接: https://arxiv.org/abs/2505.14124
作者: Hongjun Choi,Eun Som Jeon,Ankita Shukla,Pavan Turaga
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in Neurocomputing

点击查看摘要

Abstract:Knowledge distillation (KD) is a valuable technique for compressing large deep learning models into smaller, edge-suitable networks. However, conventional KD frameworks rely on pre-trained high-capacity teacher networks, which introduce significant challenges such as increased memory/storage requirements, additional training costs, and ambiguity in selecting an appropriate teacher for a given student model. Although a teacher-free distillation (self-distillation) has emerged as a promising alternative, many existing approaches still rely on architectural modifications or complex training procedures, which limit their generality and efficiency. To address these limitations, we propose a novel framework based on teacher-free distillation that operates using a single student network without any auxiliary components, architectural modifications, or additional learnable parameters. Our approach is built on a simple yet highly effective augmentation, called intra-class patch swap augmentation. This augmentation simulates a teacher-student dynamic within a single model by generating pairs of intra-class samples with varying confidence levels, and then applying instance-to-instance distillation to align their predictive distributions. Our method is conceptually simple, model-agnostic, and easy to implement, requiring only a single augmentation function. Extensive experiments across image classification, semantic segmentation, and object detection show that our method consistently outperforms both existing self-distillation baselines and conventional teacher-based KD approaches. These results suggest that the success of self-distillation could hinge on the design of the augmentation itself. Our codes are available at this https URL. Comments: Accepted for publication in Neurocomputing Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.14124 [cs.CV] (or arXiv:2505.14124v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.14124 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1016/j.neucom.2025.130408 Focus to learn more DOI(s) linking to related resources
zh

[CV-55] CONSIGN: Conformal Segmentation Informed by Spatial Groupings via Decomposition

【速读】:该论文试图解决图像分割模型生成的像素级置信度分数(通常来自softmax输出)缺乏严格的统计不确定性估计的问题,这些分数是启发式的,无法提供可靠的定量不确定性评估。解决方案的关键在于引入CONSIGN(Conformal Segmentation Informed by Spatial Groupings via Decomposition),这是一种基于共形预测(Conformal Prediction, CP)的方法,通过考虑像素间的空间相关性来改进不确定性量化。该方法生成具有用户指定高概率误差保证的有意义预测集,并兼容任何能够生成多个样本输出的预训练分割模型。

链接: https://arxiv.org/abs/2505.14113
作者: Bruno Viti,Elias Karabelas,Martin Holler
机构: University of Graz, AT; BioTechMed-Graz
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Most machine learning-based image segmentation models produce pixel-wise confidence scores - typically derived from softmax outputs - that represent the model’s predicted probability for each class label at every pixel. While this information can be particularly valuable in high-stakes domains such as medical imaging, these (uncalibrated) scores are heuristic in nature and do not constitute rigorous quantitative uncertainty estimates. Conformal prediction (CP) provides a principled framework for transforming heuristic confidence scores into statistically valid uncertainty estimates. However, applying CP directly to image segmentation ignores the spatial correlations between pixels, a fundamental characteristic of image data. This can result in overly conservative and less interpretable uncertainty estimates. To address this, we propose CONSIGN (Conformal Segmentation Informed by Spatial Groupings via Decomposition), a CP-based method that incorporates spatial correlations to improve uncertainty quantification in image segmentation. Our method generates meaningful prediction sets that come with user-specified, high-probability error guarantees. It is compatible with any pre-trained segmentation model capable of generating multiple sample outputs - such as those using dropout, Bayesian modeling, or ensembles. We evaluate CONSIGN against a standard pixel-wise CP approach across three medical imaging datasets and two COCO dataset subsets, using three different pre-trained segmentation models. Results demonstrate that accounting for spatial structure significantly improves performance across multiple metrics and enhances the quality of uncertainty estimates.
zh

[CV-56] Unintended Bias in 2D Image Segmentation and Its Effect on Attention Asymmetry

【速读】:该论文试图解决预训练模型在应用于专业数据集(如生物医学成像)时引入的非预期偏差问题,这种偏差会导致模型对不同切片的重要性评估不一致,从而引发特征利用的不一致性,并表现为显著性图分布的不对称性。解决方案的关键在于提出策略以中和由预训练颜色通道权重引入的偏差,通过实验验证所提方法在保持预训练模型优势的同时,提升了模型的可解释性。

链接: https://arxiv.org/abs/2505.14105
作者: Zsófia Molnár,Gergely Szabó,András Horváth
机构: ITK, PPCU (ITK, PPCU)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Supervised pretrained models have become widely used in deep learning, especially for image segmentation tasks. However, when applied to specialized datasets such as biomedical imaging, pretrained weights often introduce unintended biases. These biases cause models to assign different levels of importance to different slices, leading to inconsistencies in feature utilization, which can be observed as asymmetries in saliency map distributions. This transfer of color distributions from natural images to non-natural datasets can compromise model performance and reduce the reliability of results. In this study, we investigate the effects of these biases and propose strategies to mitigate them. Through a series of experiments, we test both pretrained and randomly initialized models, comparing their performance and saliency map distributions. Our proposed methods, which aim to neutralize the bias introduced by pretrained color channel weights, demonstrate promising results, offering a practical approach to improving model explainability while maintaining the benefits of pretrained models. This publication presents our findings, providing insights into addressing pretrained weight biases across various deep learning tasks.
zh

[CV-57] Unlocking the Power of SAM 2 for Few-Shot Segmentation ICML’25

【速读】:该论文旨在解决少样本分割(Few-Shot Segmentation, FSS)中因样本数量有限而导致的过拟合问题,同时利用基础模型(如SAM 2)的类无关匹配能力提升分割性能。其解决方案的关键在于设计一种伪提示生成器(Pseudo Prompt Generator),通过编码伪查询记忆并与查询特征进行兼容性匹配,以弥补视频数据与FSS任务在对象身份一致性上的不兼容问题;此外,还引入了迭代记忆优化和支持校准的记忆注意力机制,以增强记忆中的查询前景特征并抑制背景干扰,从而提升分割精度。

链接: https://arxiv.org/abs/2505.14100
作者: Qianxiong Xu,Lanyun Zhu,Xuanyi Liu,Guosheng Lin,Cheng Long,Ziyue Li,Rui Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted by ICML’25

点击查看摘要

Abstract:Few-Shot Segmentation (FSS) aims to learn class-agnostic segmentation on few classes to segment arbitrary classes, but at the risk of overfitting. To address this, some methods use the well-learned knowledge of foundation models (e.g., SAM) to simplify the learning process. Recently, SAM 2 has extended SAM by supporting video segmentation, whose class-agnostic matching ability is useful to FSS. A simple idea is to encode support foreground (FG) features as memory, with which query FG features are matched and fused. Unfortunately, the FG objects in different frames of SAM 2’s video data are always the same identity, while those in FSS are different identities, i.e., the matching step is incompatible. Therefore, we design Pseudo Prompt Generator to encode pseudo query memory, matching with query features in a compatible way. However, the memories can never be as accurate as the real ones, i.e., they are likely to contain incomplete query FG, and some unexpected query background (BG) features, leading to wrong segmentation. Hence, we further design Iterative Memory Refinement to fuse more query FG features into the memory, and devise a Support-Calibrated Memory Attention to suppress the unexpected query BG features in memory. Extensive experiments have been conducted on PASCAL-5 ^i and COCO-20 ^i to validate the effectiveness of our design, e.g., the 1-shot mIoU can be 4.2% better than the best baseline.
zh

[CV-58] Generalizable Multispectral Land Cover Classification via Frequency-Aware Mixture of Low-Rank Token Experts

【速读】:该论文旨在解决多光谱土地覆盖分类(MLCC)中由于传感器差异和地理空间条件不同导致的光谱偏移问题。现有方法主要依赖领域自适应和泛化策略,但通常使用小规模模型,性能有限。论文提出的Land-MoE解决方案的关键在于通过层次化插入频率感知的低秩令牌专家混合(MoLTE)模块,以参数高效的方式微调视觉基础模型(VFMs)。该方法包含两个核心模块:MoLTE利用不同秩的令牌生成多样化的特征调整,增强对光谱偏移的鲁棒性;频率感知滤波器(FAF)则在频域上对优化后的特征进行调制,有效捕捉与语义本质强相关的频段信息并抑制无关噪声。

链接: https://arxiv.org/abs/2505.14088
作者: Xi Chen,Shen Yan,Juelin Zhu,Chen Chen,Yu Liu,Maojun Zhang
机构: National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce Land-MoE, a novel approach for multispectral land cover classification (MLCC). Spectral shift, which emerges from disparities in sensors and geospatial conditions, poses a significant challenge in this domain. Existing methods predominantly rely on domain adaptation and generalization strategies, often utilizing small-scale models that exhibit limited performance. In contrast, Land-MoE addresses these issues by hierarchically inserting a Frequency-aware Mixture of Low-rank Token Experts, to fine-tune Vision Foundation Models (VFMs) in a parameter-efficient manner. Specifically, Land-MoE comprises two key modules: the mixture of low-rank token experts (MoLTE) and frequency-aware filters (FAF). MoLTE leverages rank-differentiated tokens to generate diverse feature adjustments for individual instances within multispectral images. By dynamically combining learnable low-rank token experts of varying ranks, it enhances the robustness against spectral shifts. Meanwhile, FAF conducts frequency-domain modulation on the refined features. This process enables the model to effectively capture frequency band information that is strongly correlated with semantic essence, while simultaneously suppressing frequency noise irrelevant to the task. Comprehensive experiments on MLCC tasks involving cross-sensor and cross-geospatial setups demonstrate that Land-MoE outperforms existing methods by a large margin. Additionally, the proposed approach has also achieved state-of-the-art performance in domain generalization semantic segmentation tasks of RGB remote sensing images.
zh

[CV-59] Large-Scale Multi-Character Interaction Synthesis

【速读】:该论文试图解决大规模多角色交互生成的问题(multi-character interaction synthesis),特别是针对角色之间协调交互(coordinated interactions)的合成与过渡规划(transition planning)。现有方法在单角色动画中表现有限,无法处理多角色间复杂的互动,而基于深度学习的方法通常仅关注双角色交互,缺乏过渡规划能力;优化方法依赖手动设计的目标函数,泛化性差;群体模拟中的交互则过于稀疏和被动。论文提出了一种条件生成式流水线,其关键在于构建一个可协调的多角色交互空间用于交互合成,并引入过渡规划网络以实现协调。实验表明该方法在多角色交互合成中有效,并具备良好的可扩展性和迁移性。

链接: https://arxiv.org/abs/2505.14087
作者: Ziyi Chang,He Wang,George Alex Koulieris,Hubert P. H. Shum
机构: Durham University (杜伦大学); University College London (伦敦大学学院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating large-scale multi-character interactions is a challenging and important task in character animation. Multi-character interactions involve not only natural interactive motions but also characters coordinated with each other for transition. For example, a dance scenario involves characters dancing with partners and also characters coordinated to new partners based on spatial and temporal observations. We term such transitions as coordinated interactions and decompose them into interaction synthesis and transition planning. Previous methods of single-character animation do not consider interactions that are critical for multiple characters. Deep-learning-based interaction synthesis usually focuses on two characters and does not consider transition planning. Optimization-based interaction synthesis relies on manually designing objective functions that may not generalize well. While crowd simulation involves more characters, their interactions are sparse and passive. We identify two challenges to multi-character interaction synthesis, including the lack of data and the planning of transitions among close and dense interactions. Existing datasets either do not have multiple characters or do not have close and dense interactions. The planning of transitions for multi-character close and dense interactions needs both spatial and temporal considerations. We propose a conditional generative pipeline comprising a coordinatable multi-character interaction space for interaction synthesis and a transition planning network for coordinations. Our experiments demonstrate the effectiveness of our proposed pipeline for multicharacter interaction synthesis and the applications facilitated by our method show the scalability and transferability.
zh

[CV-60] Place Recognition: A Comprehensive Review Current Challenges and Future Directions

【速读】:该论文旨在解决车辆导航与地图构建中的位置识别(place recognition)问题,该问题的核心是使系统能够判断某一位置是否已被访问过,这对于同时定位与建图(SLAM)中的回环检测以及在不同环境条件下实现长期导航至关重要。论文提出的解决方案关键在于对当前主流方法的系统性综述,包括基于卷积神经网络(CNN)的方法、基于Transformer的框架以及跨模态策略,这些方法分别在视觉描述符学习、全局依赖建模和多源异构数据融合方面展现出优势,从而提升了位置识别的鲁棒性与泛化能力。

链接: https://arxiv.org/abs/2505.14068
作者: Zhenyu Li,Tianyi Shang,Pengjie Xu,Zhaojun Deng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages

点击查看摘要

Abstract:Place recognition is a cornerstone of vehicle navigation and mapping, which is pivotal in enabling systems to determine whether a location has been previously visited. This capability is critical for tasks such as loop closure in Simultaneous Localization and Mapping (SLAM) and long-term navigation under varying environmental conditions. In this survey, we comprehensively review recent advancements in place recognition, emphasizing three representative methodological paradigms: Convolutional Neural Network (CNN)-based approaches, Transformer-based frameworks, and cross-modal strategies. We begin by elucidating the significance of place recognition within the broader context of autonomous systems. Subsequently, we trace the evolution of CNN-based methods, highlighting their contributions to robust visual descriptor learning and scalability in large-scale environments. We then examine the emerging class of Transformer-based models, which leverage self-attention mechanisms to capture global dependencies and offer improved generalization across diverse scenes. Furthermore, we discuss cross-modal approaches that integrate heterogeneous data sources such as Lidar, vision, and text description, thereby enhancing resilience to viewpoint, illumination, and seasonal variations. We also summarize standard datasets and evaluation metrics widely adopted in the literature. Finally, we identify current research challenges and outline prospective directions, including domain adaptation, real-time performance, and lifelong learning, to inspire future advancements in this domain. The unified framework of leading-edge place recognition methods, i.e., code library, and the results of their experimental evaluations are available at this https URL.
zh

[CV-61] Scaling Vision Mamba Across Resolutions via Fractal Traversal

【速读】:该论文旨在解决Vision Mamba在处理视觉输入时面临的两个主要问题:2D到1D块序列化导致的局部空间连续性破坏以及跨输入分辨率的可扩展性不足。其解决方案的关键在于引入基于分形的块序列化方法,通过Hilbert曲线保持空间局部性,并实现无缝的分辨率适应性;同时,通过Cross-State Routing(CSR)机制增强长程依赖关系的传播,并利用Positional-Relation Capture(PRC)模块恢复由曲线拐点破坏的局部邻接关系。

链接: https://arxiv.org/abs/2505.14062
作者: Bo Li,Haoke Xiao,Lv Tang
机构: vivo Mobile Communication Co., Ltd (维沃移动通信有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progressing

点击查看摘要

Abstract:Vision Mamba has recently emerged as a promising alternative to Transformer-based architectures, offering linear complexity in sequence length while maintaining strong modeling capacity. However, its adaptation to visual inputs is hindered by challenges in 2D-to-1D patch serialization and weak scalability across input resolutions. Existing serialization strategies such as raster scanning disrupt local spatial continuity and limit the model’s ability to generalize across scales. In this paper, we propose FractalMamba++, a robust vision backbone that leverages fractal-based patch serialization via Hilbert curves to preserve spatial locality and enable seamless resolution adaptability. To address long-range dependency fading in high-resolution inputs, we further introduce a Cross-State Routing (CSR) mechanism that enhances global context propagation through selective state reuse. Additionally, we propose a Positional-Relation Capture (PRC) module to recover local adjacency disrupted by curve inflection points. Extensive experiments on image classification, semantic segmentation, object detection, and change detection demonstrate that FractalMamba++ consistently outperforms previous Mamba-based backbones, particularly under high-resolution settings.
zh

[CV-62] Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting ACL2025

【速读】:该论文旨在解决文档图像解析(Document Image Parsing)中由于文本段落、图表、公式和表格等复杂交织元素带来的挑战,现有方法在集成开销、效率瓶颈和版式结构退化方面存在不足。其解决方案的关键在于提出一种名为Dolphin的新型多模态文档图像解析模型,该模型采用“分析-解析”范式,在第一阶段生成按阅读顺序排列的布局元素序列,这些异构元素作为锚点并与任务特定提示耦合,在第二阶段进行并行内容解析,从而提升解析效率与准确性。

链接: https://arxiv.org/abs/2505.14059
作者: Hao Feng,Shu Wei,Xiang Fei,Wei Shi,Yingdong Han,Lei Liao,Jinghui Lu,Binghong Wu,Qi Liu,Chunhui Lin,Jingqun Tang,Hao Liu,Can Huang
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACL 2025

点击查看摘要

Abstract:Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitations, we present \textitDolphin (\textit\textbfDocument Image \textbfParsing via \textbfHeterogeneous Anchor Prompt\textbfing), a novel multimodal document image parsing model following an analyze-then-parse paradigm. In the first stage, Dolphin generates a sequence of layout elements in reading order. These heterogeneous elements, serving as anchors and coupled with task-specific prompts, are fed back to Dolphin for parallel content parsing in the second stage. To train Dolphin, we construct a large-scale dataset of over 30 million samples, covering multi-granularity parsing tasks. Through comprehensive evaluations on both prevalent benchmarks and self-constructed ones, Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. The code and pre-trained models are publicly available at this https URL
zh

[CV-63] Learning Concept-Driven Logical Rules for Interpretable and Generalizable Medical Image Classification MICCAI2025

【速读】:该论文试图解决基于概念的医学影像模型在临床应用中面临的决策安全性问题,特别是概念泄露(concept leakage)和缺乏全局决策逻辑解释的问题。概念泄露导致软概念表示中的非预期信息损害了模型的可解释性和泛化能力,而现有方法多关注局部解释(instance-level),忽视了数据集级别的全局决策逻辑。解决方案的关键在于提出一种名为Concept Rule Learner (CRL) 的新框架,该框架通过从二值化视觉概念中学习布尔逻辑规则,利用逻辑层捕捉概念间的相关性,并提取具有临床意义的规则,从而实现局部和全局的可解释性。

链接: https://arxiv.org/abs/2505.14049
作者: Yibo Gao,Hangqi Zhou,Zheyao Gao,Bomin Wang,Shangqi Gao,Sihan Wang,Xiahai Zhuang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: early accepted by MICCAI 2025

点击查看摘要

Abstract:The pursuit of decision safety in clinical applications highlights the potential of concept-based methods in medical imaging. While these models offer active interpretability, they often suffer from concept leakages, where unintended information within soft concept representations undermines both interpretability and generalizability. Moreover, most concept-based models focus solely on local explanations (instance-level), neglecting the global decision logic (dataset-level). To address these limitations, we propose Concept Rule Learner (CRL), a novel framework to learn Boolean logical rules from binarized visual concepts. CRL employs logical layers to capture concept correlations and extract clinically meaningful rules, thereby providing both local and global interpretability. Experiments on two medical image classification tasks show that CRL achieves competitive performance with existing methods while significantly improving generalizability to out-of-distribution data. The code of our work is available at this https URL.
zh

[CV-64] Selective Structured State Space for Multispectral-fused Small Target Detection

【速读】:该论文旨在解决高分辨率遥感图像中目标检测的问题,特别是小目标识别准确率低和计算成本高的挑战。其解决方案的关键在于利用Mamba架构的线性复杂度以提高效率,同时通过引入Enhanced Small Target Detection (ESTD)模块和Convolutional Attention Residual Gate (CARG)模块,增强模型对小目标的局部细节捕捉能力和全局注意力机制,从而提升小目标的检测性能。

链接: https://arxiv.org/abs/2505.14043
作者: Qianqian Zhang,WeiJun Wang,Yunxing Liu,Li Zhou,Hao Zhao,Junshe An,Zihan Wang
机构: National Space Science Center, Chinese Academy of Sciences, China; School of Computer Science and Technology, University of Chinese Academy of Sciences, China; Institute for AI Industry Research (AIR), Tsinghua University, China; School of Astronomy and Space Science, University of Chinese Academy of Sciences, China; School of Computing, National University of Singapore
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Target detection in high-resolution remote sensing imagery faces challenges due to the low recognition accuracy of small targets and high computational costs. The computational complexity of the Transformer architecture increases quadratically with image resolution, while Convolutional Neural Networks (CNN) architectures are forced to stack deeper convolutional layers to expand their receptive fields, leading to an explosive growth in computational demands. To address these computational constraints, we leverage Mamba’s linear complexity for efficiency. However, Mamba’s performance declines for small targets, primarily because small targets occupy a limited area in the image and have limited semantic information. Accurate identification of these small targets necessitates not only Mamba’s global attention capabilities but also the precise capture of fine local details. To this end, we enhance Mamba by developing the Enhanced Small Target Detection (ESTD) module and the Convolutional Attention Residual Gate (CARG) module. The ESTD module bolsters local attention to capture fine-grained details, while the CARG module, built upon Mamba, emphasizes spatial and channel-wise information, collectively improving the model’s ability to capture distinctive representations of small targets. Additionally, to highlight the semantic representation of small targets, we design a Mask Enhanced Pixel-level Fusion (MEPF) module for multispectral fusion, which enhances target features by effectively fusing visible and infrared multimodal information.
zh

[CV-65] Adversarially Pretrained Transformers may be Universally Robust In-Context Learners

【速读】:该论文试图解决对抗训练(adversarial training)计算成本高昂的问题,其解决方案的关键在于利用在多样化任务上进行对抗预训练的Transformer模型作为稳健的基础模型,从而在下游任务中无需进行额外的对抗训练。研究通过理论证明,单个对抗预训练的Transformer可以通过上下文学习(in-context learning)鲁棒地泛化到多个未见过的任务,而无需任何参数更新,其鲁棒性来源于模型对稳健特征的关注以及对利用非预测性特征的攻击的抵抗能力。

链接: https://arxiv.org/abs/2505.14042
作者: Soichiro Kumano,Hiroshi Kera,Toshihiko Yamasaki
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Adversarial training is one of the most effective adversarial defenses, but it incurs a high computational cost. In this study, we show that transformers adversarially pretrained on diverse tasks can serve as robust foundation models and eliminate the need for adversarial training in downstream tasks. Specifically, we theoretically demonstrate that through in-context learning, a single adversarially pretrained transformer can robustly generalize to multiple unseen tasks without any additional training, i.e., without any parameter updates. This robustness stems from the model’s focus on robust features and its resistance to attacks that exploit non-predictive features. Besides these positive findings, we also identify several limitations. Under certain conditions (though unrealistic), no universally robust single-layer transformers exist. Moreover, robust transformers exhibit an accuracy–robustness trade-off and require a large number of in-context demonstrations. The code is available at this https URL.
zh

[CV-66] AppleGrowthVision: A large-scale stereo dataset for phenological analysis fruit detection and 3D reconstruction in apple orchards

【速读】:该论文旨在解决苹果园监测中由于数据集限制而导致的精度农业应用不足的问题,特别是缺乏多样化、现实的图像数据集以及密集、异质场景标注的困难。其关键解决方案是提出AppleGrowthVision,这是一个大规模数据集,包含两个子集:第一部分包含来自德国勃兰登堡农场的9,317张高分辨率立体图像,覆盖六个农业验证的生长阶段;第二部分包含来自同一农场及德国皮尔尼茨农场的1,125张密集标注图像,共包含31,084个苹果标签。该数据集提供了具有农业验证生长阶段的立体图像数据,支持精确的物候分析和三维重建,并通过扩展MinneApple和MAD提升了目标检测模型的性能。

链接: https://arxiv.org/abs/2505.14029
作者: Laura-Sophia von Hirschhausen,Jannes S. Magnusson,Mykyta Kovalenko,Fredrik Boye,Tanay Rawat,Peter Eisert,Anna Hilsmann,Sebastian Pretzsch,Sebastian Bosse
机构: Fraunhofer HHI (弗劳恩霍夫研究所); Fraunhofer IVI (弗劳恩霍夫研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning has transformed computer vision for precision agriculture, yet apple orchard monitoring remains limited by dataset constraints. The lack of diverse, realistic datasets and the difficulty of annotating dense, heterogeneous scenes. Existing datasets overlook different growth stages and stereo imagery, both essential for realistic 3D modeling of orchards and tasks like fruit localization, yield estimation, and structural analysis. To address these gaps, we present AppleGrowthVision, a large-scale dataset comprising two subsets. The first includes 9,317 high resolution stereo images collected from a farm in Brandenburg (Germany), covering six agriculturally validated growth stages over a full growth cycle. The second subset consists of 1,125 densely annotated images from the same farm in Brandenburg and one in Pillnitz (Germany), containing a total of 31,084 apple labels. AppleGrowthVision provides stereo-image data with agriculturally validated growth stages, enabling precise phenological analysis and 3D reconstructions. Extending MinneApple with our data improves YOLOv8 performance by 7.69 % in terms of F1-score, while adding it to MinneApple and MAD boosts Faster R-CNN F1-score by 31.06 %. Additionally, six BBCH stages were predicted with over 95 % accuracy using VGG16, ResNet152, DenseNet201, and MobileNetv2. AppleGrowthVision bridges the gap between agricultural science and computer vision, by enabling the development of robust models for fruit detection, growth modeling, and 3D analysis in precision agriculture. Future work includes improving annotation, enhancing 3D reconstruction, and extending multimodal analysis across all growth stages.
zh

[CV-67] OmniStyle: Filtering High Quality Style Transfer Data at Scale CVPR2025

【速读】:该论文旨在解决高质量、可控的风格迁移(style transfer)问题,特别是在大规模数据支持下的高效模型训练与精确风格控制。其解决方案的关键在于构建了一个大规模的成对风格迁移数据集OmniStyle-1M,该数据集包含超过一百万对内容-风格-风格化图像三元组,并通过文本描述和指令提示增强多样性与可控性;同时引入OmniFilter质量评估框架以确保数据集的高质量,进而基于Diffusion Transformer(DiT)架构提出OmniStyle框架,实现了高分辨率、细节丰富的风格迁移效果。

链接: https://arxiv.org/abs/2505.14028
作者: Ye Wang,Ruiqi Liu,Jiang Lin,Fei Liu,Zili Yi,Yilin Wang,Rui Ma
机构: Jilin University (吉林大学); Nanjing University (南京大学); ByteDance (字节跳动); Adobe (Adobe); Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE, China (教育部知识驱动人机智能工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:In this paper, we introduce OmniStyle-1M, a large-scale paired style transfer dataset comprising over one million content-style-stylized image triplets across 1,000 diverse style categories, each enhanced with textual descriptions and instruction prompts. We show that OmniStyle-1M can not only enable efficient and scalable of style transfer models through supervised training but also facilitate precise control over target stylization. Especially, to ensure the quality of the dataset, we introduce OmniFilter, a comprehensive style transfer quality assessment framework, which filters high-quality triplets based on content preservation, style consistency, and aesthetic appeal. Building upon this foundation, we propose OmniStyle, a framework based on the Diffusion Transformer (DiT) architecture designed for high-quality and efficient style transfer. This framework supports both instruction-guided and image-guided style transfer, generating high resolution outputs with exceptional detail. Extensive qualitative and quantitative evaluations demonstrate OmniStyle’s superior performance compared to existing approaches, highlighting its efficiency and versatility. OmniStyle-1M and its accompanying methodologies provide a significant contribution to advancing high-quality style transfer, offering a valuable resource for the research community.
zh

[CV-68] owards Efficient Multi-Scale Deformable Attention on NPU

【速读】:该论文旨在解决多尺度可变形注意力(Multi-scale Deformable Attention, MSDA)在视觉任务中因随机访问网格采样策略导致的优化难题,特别是在领域专用加速器如NPU上的性能瓶颈问题。其解决方案的关键在于提出一种协同设计方法,系统性地重新思考MSDA在Ascend NPU架构上的内存访问与计算策略,从而实现高效的前向和反向计算,并适配训练工作负载,同时集成一系列硬件感知优化技术。

链接: https://arxiv.org/abs/2505.14022
作者: Chenghuan Huang,Zhigeng Xu,Chong Sun,Chen Li,Ziyang Ma
机构: WeChat HPC, Tencent Inc.(微信高性能计算,腾讯公司); WeChat Vision, Tencent Inc.(微信视觉,腾讯公司)
类目: Performance (cs.PF); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures

点击查看摘要

Abstract:Multi-scale deformable attention (MSDA) is a flexible and powerful feature extraction mechanism for visual tasks, but its random-access grid sampling strategy poses significant optimization challenges, especially on domain-specific accelerators such as NPUs. In this work, we present a co-design approach that systematically rethinks memory access and computation strategies for MSDA on the Ascend NPU architecture. With this co-design approach, our implementation supports both efficient forward and backward computation, is fully adapted for training workloads, and incorporates a suite of hardware-aware optimizations. Extensive experiments show that our solution achieves up to 5.9\times (forward), 8.9\times (backward), and 7.3\times (end-to-end training) speedup over the grid sample-based baseline, and 1.9\times , 2.4\times , and 2.0\times acceleration over the latest vendor library, respectively.
zh

[CV-69] Adversarial Training from Mean Field Perspective NEURIPS23

【速读】:该论文试图解决对抗训练在随机深度神经网络中的训练动力学不明确的问题,特别是针对对抗样本的鲁棒性提升。其解决方案的关键在于提出了一种基于平均场理论的新理论框架,该框架克服了现有基于平均场方法的局限性,并在此基础上推导出不同ℓ_p和ℓ_q范数下对抗损失的紧致上界。此外,研究还揭示了无跳跃连接的网络通常无法进行对抗训练,以及对抗训练会降低网络容量,同时指出网络宽度可以缓解这些问题。

链接: https://arxiv.org/abs/2505.14021
作者: Soichiro Kumano,Hiroshi Kera,Toshihiko Yamasaki
机构: The University of Tokyo(东京大学); Chiba University(千叶大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: NeurIPS23

点击查看摘要

Abstract:Although adversarial training is known to be effective against adversarial examples, training dynamics are not well understood. In this study, we present the first theoretical analysis of adversarial training in random deep neural networks without any assumptions on data distributions. We introduce a new theoretical framework based on mean field theory, which addresses the limitations of existing mean field-based approaches. Based on this framework, we derive (empirically tight) upper bounds of \ell_q norm-based adversarial loss with \ell_p norm-based adversarial examples for various values of p and q . Moreover, we prove that networks without shortcuts are generally not adversarially trainable and that adversarial training reduces network capacity. We also show that network width alleviates these issues. Furthermore, we present the various impacts of the input and output dimensions on the upper bounds and time evolution of the weight variance.
zh

[CV-70] EGFormer: Towards Efficient and Generalizable Multimodal Semantic Segmentation

【速读】:该论文旨在解决多模态语义分割中模型计算效率不足的问题,尽管现有方法主要关注提升分割精度,但其计算效率尚未得到充分研究。解决方案的关键在于提出EGFormer框架,该框架通过引入两个创新模块实现高效多模态融合:首先,Any-modal Scoring Module (ASM) 为每个模态独立分配重要性评分,支持基于特征图的动态排序;其次,Modal Dropping Module (MDM) 在每一步骤中过滤掉信息量较少的模态,仅保留并聚合最有价值的特征。这种设计使模型能够在不牺牲性能的前提下,显著减少参数量和推理时间,并提升跨域适应能力。

链接: https://arxiv.org/abs/2505.14014
作者: Zelin Zhang,Tao Zhang,KediLI,Xu Zheng
机构: University of Sydney(悉尼大学); University of Technology Sydney(悉尼科技大学); HKUST(GZ)(香港科技大学(广州))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent efforts have explored multimodal semantic segmentation using various backbone architectures. However, while most methods aim to improve accuracy, their computational efficiency remains underexplored. To address this, we propose EGFormer, an efficient multimodal semantic segmentation framework that flexibly integrates an arbitrary number of modalities while significantly reducing model parameters and inference time without sacrificing performance. Our framework introduces two novel modules. First, the Any-modal Scoring Module (ASM) assigns importance scores to each modality independently, enabling dynamic ranking based on their feature maps. Second, the Modal Dropping Module (MDM) filters out less informative modalities at each stage, selectively preserving and aggregating only the most valuable features. This design allows the model to leverage useful information from all available modalities while discarding redundancy, thus ensuring high segmentation quality. In addition to efficiency, we evaluate EGFormer on a synthetic-to-real transfer task to demonstrate its generalizability. Extensive experiments show that EGFormer achieves competitive performance with up to 88 percent reduction in parameters and 50 percent fewer GFLOPs. Under unsupervised domain adaptation settings, it further achieves state-of-the-art transfer performance compared to existing methods.
zh

[CV-71] UHD Image Dehazing via anDehazeFormer with Atmospheric-aware KV Cache

【速读】:该论文旨在解决超高清(UHD)图像去雾任务中现有方法存在的训练速度慢和内存消耗高的问题。其关键解决方案包括:1)受nGPT架构启发的自适应归一化机制,使网络在参数表达范围受限的情况下实现超快速且稳定的训练;2)一种基于物理雾霾形成模型的感知大气散射的KV缓存机制,能够动态优化特征保留。这些创新显著提升了训练收敛速度并降低了内存开销,从而实现了在RTX4090 GPU上每秒处理50张高分辨率图像的实时性能。

链接: https://arxiv.org/abs/2505.14010
作者: Pu Wang,Pengwen Dai,Chen Wu,Yeying Jin,Dianjie Lu,Guijuan Zhang,Youshan Zhang,Zhuoran Zheng
机构: Shandong University (山东大学); Sun Yat-sen University (中山大学); USTC (中国科学技术大学); Tencent (腾讯); SDNU (山东师范大学); Yeshiva University (叶史瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:In this paper, we propose an efficient visual transformer framework for ultra-high-definition (UHD) image dehazing that addresses the key challenges of slow training speed and high memory consumption for existing methods. Our approach introduces two key innovations: 1) an \textbfadaptive \textbfnormalization mechanism inspired by the nGPT architecture that enables ultra-fast and stable training with a network with a restricted range of parameter expressions; and 2) we devise an atmospheric scattering-aware KV caching mechanism that dynamically optimizes feature preservation based on the physical haze formation model. The proposed architecture improves the training convergence speed by \textbf5 \times while reducing memory overhead, enabling real-time processing of 50 high-resolution images per second on an RTX4090 GPU. Experimental results show that our approach maintains state-of-the-art dehazing quality while significantly improving computational efficiency for 4K/8K image restoration tasks. Furthermore, we provide a new dehazing image interpretable method with the help of an integrated gradient attribution map. Our code can be found here: this https URL.
zh

[CV-72] Multi-Label Stereo Matching for Transparent Scene Depth Estimation

【速读】:该论文旨在解决透明场景中同时估计透明物体深度和被遮挡背景深度的多标签立体匹配问题(multi-label stereo matching)。传统方法假设视差维度上服从单峰分布,并将匹配问题建模为单标签回归,无法准确处理透明场景中的多深度值问题。该解决方案的关键在于提出一种多标签回归框架,通过像素级的多元高斯表示(pixel-wise multivariate Gaussian representation)来编码同一像素处的多个深度值,其中均值向量表示深度信息,协方差矩阵用于判断是否需要多标签表示。该方法在GRU框架内迭代预测,有效提升了透明表面深度估计的性能,同时保留了背景信息以支持场景重建。

链接: https://arxiv.org/abs/2505.14008
作者: Zhidan Liu,Chengtang Yao,Jiaxi Zeng,Yuwei Wu,Yunde Jia
机构: Beijing Institute of Technology (北京理工大学); Shenzhen MSU-BIT University (深圳美中大学-比特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we present a multi-label stereo matching method to simultaneously estimate the depth of the transparent objects and the occluded background in transparent this http URL previous methods that assume a unimodal distribution along the disparity dimension and formulate the matching as a single-label regression problem, we propose a multi-label regression formulation to estimate multiple depth values at the same pixel in transparent scenes. To resolve the multi-label regression problem, we introduce a pixel-wise multivariate Gaussian representation, where the mean vector encodes multiple depth values at the same pixel, and the covariance matrix determines whether a multi-label representation is necessary for a given pixel. The representation is iteratively predicted within a GRU framework. In each iteration, we first predict the update step for the mean parameters and then use both the update step and the updated mean parameters to estimate the covariance matrix. We also synthesize a dataset containing 10 scenes and 89 objects to validate the performance of transparent scene depth estimation. The experiments show that our method greatly improves the performance on transparent surfaces while preserving the background information for scene reconstruction. Code is available at this https URL.
zh

[CV-73] StPR: Spatiotemporal Preservation and Routing for Exemplar-Free Video Class-Incremental Learning

【速读】:该论文旨在解决视频类增量学习(Video Class-Incremental Learning, VCIL)中的灾难性遗忘问题,即在持续学习新动作类别时如何有效保留先前知识。与传统类增量学习(Class-Incremental Learning, CIL)不同,VCIL需要处理时空结构的复杂性,这对模型在保持空间语义和时间动态方面提出了更高要求。该论文提出的解决方案关键在于设计了一个无需示例重放的统一框架——时空保全与路由(Spatiotemporal Preservation and Routing, StPR),其核心包括:通过帧共享语义蒸馏(Frame-Shared Semantics Distillation, FSSD)选择性地保留重要语义通道以维持先验知识,并利用基于时间分解的专家混合(Temporal Decomposition-based Mixture-of-Experts, TD-MoE)动态路由任务相关专家,从而实现无需任务ID或存储示例的推理。

链接: https://arxiv.org/abs/2505.13997
作者: Huaijie Wang,De Cheng,Guozhang Li,Zhipeng Xu,Lingfeng He,Jie Li,Nannan Wang,Xinbo Gao
机构: Xidian University (西安电子科技大学); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video Class-Incremental Learning (VCIL) seeks to develop models that continuously learn new action categories over time without forgetting previously acquired knowledge. Unlike traditional Class-Incremental Learning (CIL), VCIL introduces the added complexity of spatiotemporal structures, making it particularly challenging to mitigate catastrophic forgetting while effectively capturing both frame-shared semantics and temporal dynamics. Existing approaches either rely on exemplar rehearsal, raising concerns over memory and privacy, or adapt static image-based methods that neglect temporal modeling. To address these limitations, we propose Spatiotemporal Preservation and Routing (StPR), a unified and exemplar-free VCIL framework that explicitly disentangles and preserves spatiotemporal information. First, we introduce Frame-Shared Semantics Distillation (FSSD), which identifies semantically stable and meaningful channels by jointly considering semantic sensitivity and classification contribution. These important semantic channels are selectively regularized to maintain prior knowledge while allowing for adaptation. Second, we design a Temporal Decomposition-based Mixture-of-Experts (TD-MoE), which dynamically routes task-specific experts based on their temporal dynamics, enabling inference without task ID or stored exemplars. Together, StPR effectively leverages spatial semantics and temporal dynamics, achieving a unified, exemplar-free VCIL framework. Extensive experiments on UCF101, HMDB51, and Kinetics400 show that our method outperforms existing baselines while offering improved interpretability and efficiency in VCIL. Code is available in the supplementary materials.
zh

[CV-74] Every Pixel Tells a Story: End-to-End Urdu Newspaper OCR

【速读】:该论文旨在解决乌尔都语报纸中光学字符识别(OCR)的挑战,包括复杂的多列布局、低分辨率档案扫描和多样化的字体风格。其解决方案的关键在于构建一个端到端的处理流程,包含四个核心模块:文章分割、图像超分辨率、列分割和文本识别。通过微调YOLOv11x进行文章和列分割,以及使用SwinIR模型提升图像质量,并在文本识别阶段对比多种大语言模型(LLM),最终实现了高效的OCR性能。

链接: https://arxiv.org/abs/2505.13943
作者: Samee Arif,Sualeha Farid
机构: University of Michigan - Ann Arbor (密歇根大学-安娜堡分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces a comprehensive end-to-end pipeline for Optical Character Recognition (OCR) on Urdu newspapers. In our approach, we address the unique challenges of complex multi-column layouts, low-resolution archival scans, and diverse font styles. Our process decomposes the OCR task into four key modules: (1) article segmentation, (2) image super-resolution, (3) column segmentation, and (4) text recognition. For article segmentation, we fine-tune and evaluate YOLOv11x to identify and separate individual articles from cluttered layouts. Our model achieves a precision of 0.963 and mAP@50 of 0.975. For super-resolution, we fine-tune and benchmark the SwinIR model (reaching 32.71 dB PSNR) to enhance the quality of degraded newspaper scans. To do our column segmentation, we use YOLOv11x to separate columns in text to further enhance performance - this model reaches a precision of 0.970 and mAP@50 of 0.975. In the text recognition stage, we benchmark a range of LLMs from different families, including Gemini, GPT, Llama, and Claude. The lowest WER of 0.133 is achieved by Gemini-2.5-Pro.
zh

[CV-75] LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts

【速读】:该论文旨在解决长视频文本检索(long video-text retrieval)中存在的基准数据集限制问题,包括视频时长有限、字幕质量低以及标注粒度粗等问题,这些问题阻碍了先进检索方法的评估。其解决方案的关键在于引入LoVR基准,包含467个长视频和超过40,804个细粒度片段,并采用一种高效的字幕生成框架,该框架结合了视觉语言模型(VLM)自动生成、字幕质量评分和动态优化,以提升标注准确性并保持可扩展性。此外,还提出了一种语义融合方法,用于生成连贯的全视频字幕,同时保留重要上下文信息。

链接: https://arxiv.org/abs/2505.13928
作者: Qifeng Cai,Hao Liang,Hejun Dong,Meiyi Qiang,Ruichuan An,Zhaoyang Han,Zhengzhou Zhu,Bin Cui,Wentao Zhang
机构: East China Normal University (华东师范大学); Peking University (北京大学); Beihang University (北京航空航天大学); Beijing Institute of Technology (北京理工大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Long videos contain a vast amount of information, making video-text retrieval an essential and challenging task in multimodal learning. However, existing benchmarks suffer from limited video duration, low-quality captions, and coarse annotation granularity, which hinder the evaluation of advanced video-text retrieval methods. To address these limitations, we introduce LoVR, a benchmark specifically designed for long video-text retrieval. LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions. To overcome the issue of poor machine-generated annotations, we propose an efficient caption generation framework that integrates VLM automatic generation, caption quality scoring, and dynamic refinement. This pipeline improves annotation accuracy while maintaining scalability. Furthermore, we introduce a semantic fusion method to generate coherent full-video captions without losing important contextual information. Our benchmark introduces longer videos, more detailed captions, and a larger-scale dataset, presenting new challenges for video understanding and retrieval. Extensive experiments on various advanced embedding models demonstrate that LoVR is a challenging benchmark, revealing the limitations of current approaches and providing valuable insights for future research. We release the code and dataset link at this https URL
zh

[CV-76] An Explorative Analysis of SVM Classifier and ResNet50 Architecture on African Food Classification

【速读】:该论文试图解决非洲食物识别系统研究不足的问题,尤其是在深度学习与传统机器学习方法在非洲食物分类中的应用方面缺乏系统评估。其解决方案的关键在于对比微调的ResNet50模型与支持向量机(Support Vector Machine, SVM)分类器在六个非洲常见食物类别上的性能,并通过混淆矩阵、F1分数、准确率、召回率和精确率五个关键评估指标进行分析,以揭示两种方法的优势与局限性。

链接: https://arxiv.org/abs/2505.13923
作者: Chinedu Emmanuel Mbonu,Kenechukwu Anigbogu,Doris Asogwa,Tochukwu Belonwu
机构: Nazarbayev University (纳扎尔巴耶夫大学); Nnamdi Azikiwe University (恩南迪·阿齐基韦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 9 figures

点击查看摘要

Abstract:Food recognition systems has advanced significantly for Western cuisines, yet its application to African foods remains underexplored. This study addresses this gap by evaluating both deep learning and traditional machine learning methods for African food classification. We compared the performance of a fine-tuned ResNet50 model with a Support Vector Machine (SVM) classifier. The dataset comprises 1,658 images across six selected food categories that are known in Africa. To assess model effectiveness, we utilize five key evaluation metrics: Confusion matrix, F1-score, accuracy, recall and precision. Our findings offer valuable insights into the strengths and limitations of both approaches, contributing to the advancement of food recognition for African cuisines.
zh

[CV-77] Blind Restoration of High-Resolution Ultrasound Video

【速读】:该论文旨在解决超声视频图像中信号噪声比低(low signal-to-noise ratio, SNR)和分辨率有限的问题,这些问题限制了其在临床诊断与分析中的应用。此外,设备和采集设置的差异导致数据分布和噪声水平不同,进一步降低了预训练模型的泛化能力。论文提出的解决方案是基于自监督学习的超声视频超分辨率算法Deep Ultrasound Prior (DUP),其关键在于采用视频自适应的神经网络优化过程,在无需配对训练数据的情况下提升超声视频的分辨率并同时去除噪声。

链接: https://arxiv.org/abs/2505.13915
作者: Chu Chen,Kangning Cui,Pasquale Cascarano,Wei Tang,Elena Loli Piccolomini,Raymond H. Chan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Ultrasound imaging is widely applied in clinical practice, yet ultrasound videos often suffer from low signal-to-noise ratios (SNR) and limited resolutions, posing challenges for diagnosis and analysis. Variations in equipment and acquisition settings can further exacerbate differences in data distribution and noise levels, reducing the generalizability of pre-trained models. This work presents a self-supervised ultrasound video super-resolution algorithm called Deep Ultrasound Prior (DUP). DUP employs a video-adaptive optimization process of a neural network that enhances the resolution of given ultrasound videos without requiring paired training data while simultaneously removing noise. Quantitative and visual evaluations demonstrate that DUP outperforms existing super-resolution algorithms, leading to substantial improvements for downstream applications.
zh

[CV-78] 4D-ROLLS: 4D Radar Occupancy Learning via LiDAR Supervision

【速读】:该论文旨在解决自动驾驶中3D场景理解的问题,特别是针对现有占用估计(occupancy estimation)方法依赖于LiDAR或摄像头,在恶劣环境(如烟雾、雨、雪和雾)下性能下降的问题。其解决方案的关键在于提出4D-ROLLS,这是首个利用LiDAR点云作为监督信号的4D雷达弱监督占用估计方法,通过生成伪LiDAR标签(包括占用查询和LiDAR高度图)作为多阶段监督,训练4D雷达占用估计模型,并使其与LiDAR生成的占用图对齐,从而提升模型在占用估计中的准确性。

链接: https://arxiv.org/abs/2505.13905
作者: Ruihan Liu,Xiaoyi Wu,Xijun Chen,Liang Hu,Yunjiang Lou
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:A comprehensive understanding of 3D scenes is essential for autonomous vehicles (AVs), and among various perception tasks, occupancy estimation plays a central role by providing a general representation of drivable and occupied space. However, most existing occupancy estimation methods rely on LiDAR or cameras, which perform poorly in degraded environments such as smoke, rain, snow, and fog. In this paper, we propose 4D-ROLLS, the first weakly supervised occupancy estimation method for 4D radar using the LiDAR point cloud as the supervisory signal. Specifically, we introduce a method for generating pseudo-LiDAR labels, including occupancy queries and LiDAR height maps, as multi-stage supervision to train the 4D radar occupancy estimation model. Then the model is aligned with the occupancy map produced by LiDAR, fine-tuning its accuracy in occupancy estimation. Extensive comparative experiments validate the exceptional performance of 4D-ROLLS. Its robustness in degraded environments and effectiveness in cross-dataset training are qualitatively demonstrated. The model is also seamlessly transferred to downstream tasks BEV segmentation and point cloud occupancy prediction, highlighting its potential for broader applications. The lightweight network enables 4D-ROLLS model to achieve fast inference speeds at about 30 Hz on a 4060 GPU. The code of 4D-ROLLS will be made available at this https URL.
zh

[CV-79] Domain Adaptation of VLM for Soccer Video Understanding CVPR2025

【速读】:该论文试图解决视频理解中视觉语言模型(Vision Language Models, VLMs)在特定领域中的迁移学习能力不足的问题。现有研究多为领域无关的,缺乏对VLMs在专业领域适应性的深入探索。解决方案的关键在于通过大规模领域相关数据和大型语言模型(LLM)生成指令遵循数据,采用课程学习(curriculum learning)的方式迭代微调通用领域的VLM,以提升其在特定领域(如足球)的任务表现。

链接: https://arxiv.org/abs/2505.13860
作者: Tiancheng Jiang,Henry Wang,Md Sirajus Salekin,Parmida Atighehchian,Shinan Zhang
机构: Massachusetts Institute of Technology (麻省理工学院); Amazon Web Services (亚马逊网络服务)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures, accepted to the 11th IEEE International Workshop on Computer Vision in Sports (CVSports) at CVPR 2025; supplementary appendix included as ancillary PDF

点击查看摘要

Abstract:Vision Language Models (VLMs) have demonstrated strong performance in multi-modal tasks by effectively aligning visual and textual representations. However, most video understanding VLM research has been domain-agnostic, leaving the understanding of their transfer learning capability to specialized domains under-explored. In this work, we address this by exploring the adaptability of open-source VLMs to specific domains, and focusing on soccer as an initial case study. Our approach uses large-scale soccer datasets and LLM to create instruction-following data, and use them to iteratively fine-tune the general-domain VLM in a curriculum learning fashion (first teaching the model key soccer concepts to then question answering tasks). The final adapted model, trained using a curated dataset of 20k video clips, exhibits significant improvement in soccer-specific tasks compared to the base model, with a 37.5% relative improvement for the visual question-answering task and an accuracy improvement from 11.8% to 63.5% for the downstream soccer action classification task.
zh

[CV-80] SuperMapNet for Long-Range and High-Accuracy Vectorized HD Map Construction

【速读】:该论文旨在解决高精度矢量地图构建中的两大核心问题:一是BEV特征生成过程中单一模态方法感知能力有限,而直接拼接的多模态方法无法有效捕捉不同模态间的协同与差异,导致特征范围受限且存在空洞;二是地图元素分类与定位中仅使用点信息,未考虑元素信息及其与点信息的交互,从而导致形状错误和元素纠缠,影响准确性。其解决方案的关键在于提出SuperMapNet,通过基于交叉注意力的协同增强模块和基于流的差异对齐模块,紧密融合相机图像的语义信息与LiDAR点云的几何信息,以生成长距离BEV特征,并通过三层交互机制(Point2Point、Element2Element、Point2Element)实现局部与全局特征的紧密耦合,提升分类与定位的准确性。

链接: https://arxiv.org/abs/2505.13856
作者: Ruqin Zhou,San Jiang,Wanshou Jiang,Yongsheng Zhang,Chenguang Dai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 9 figures

点击查看摘要

Abstract:Vectorized HD map is essential for autonomous driving. Remarkable work has been achieved in recent years, but there are still major issues: (1) in the generation of the BEV features, single modality-based methods are of limited perception capability, while direct concatenation-based multi-modal methods fail to capture synergies and disparities between different modalities, resulting in limited ranges with feature holes; (2) in the classification and localization of map elements, only point information is used without the consideration of element infor-mation and neglects the interaction between point information and element information, leading to erroneous shapes and element entanglement with low accuracy. To address above issues, we introduce SuperMapNet for long-range and high-accuracy vectorized HD map construction. It uses both camera images and LiDAR point clouds as input, and first tightly couple semantic information from camera images and geometric information from LiDAR point clouds by a cross-attention based synergy enhancement module and a flow-based disparity alignment module for long-range BEV feature generation. And then, local features from point queries and global features from element queries are tightly coupled by three-level interactions for high-accuracy classification and localization, where Point2Point interaction learns local geometric information between points of the same element and of each point, Element2Element interaction learns relation constraints between different elements and semantic information of each elements, and Point2Element interaction learns complement element information for its constituent points. Experiments on the nuScenes and Argoverse2 datasets demonstrate superior performances, surpassing SOTAs over 14.9/8.8 mAP and 18.5/3.1 mAP under hard/easy settings, respectively. The code is made publicly available1.
zh

[CV-81] MGStream: Motion-aware 3D Gaussian for Streamable Dynamic Scene Reconstruction

【速读】:该论文旨在解决基于3D Gaussian Splatting(3DGS)的流式动态场景重建中存在的闪烁伪影、存储效率低以及难以建模新出现物体的问题。其解决方案的关键在于引入MGStream,通过将运动相关的3D高斯分布(3DGs)用于动态场景重建,而常规的3DGs用于静态场景,结合运动掩码与基于聚类的凸包算法实现运动相关3DGs的构建,并通过刚性变形和基于注意力的优化策略提升动态建模与新物体重建能力,从而有效避免闪烁伪影并提高存储效率。

链接: https://arxiv.org/abs/2505.13839
作者: Zhenyu Bao,Qing Li,Guibiao Liao,Zhongyuan Zhao,Kanglin Liu
机构: Peking University (北京大学); Pengcheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has gained significant attention in streamable dynamic novel view synthesis (DNVS) for its photorealistic rendering capability and computational efficiency. Despite much progress in improving rendering quality and optimization strategies, 3DGS-based streamable dynamic scene reconstruction still suffers from flickering artifacts and storage inefficiency, and struggles to model the emerging objects. To tackle this, we introduce MGStream which employs the motion-related 3D Gaussians (3DGs) to reconstruct the dynamic and the vanilla 3DGs for the static. The motion-related 3DGs are implemented according to the motion mask and the clustering-based convex hull algorithm. The rigid deformation is applied to the motion-related 3DGs for modeling the dynamic, and the attention-based optimization on the motion-related 3DGs enables the reconstruction of the emerging objects. As the deformation and optimization are only conducted on the motion-related 3DGs, MGStream avoids flickering artifacts and improves the storage efficiency. Extensive experiments on real-world datasets N3DV and MeetRoom demonstrate that MGStream surpasses existing streaming 3DGS-based approaches in terms of rendering quality, training/storage efficiency and temporal consistency. Our code is available at: this https URL.
zh

[CV-82] InstanceBEV: Unifying Instance and BEV Representation for Global Modeling

【速读】:该论文旨在解决多视角相机构建占用网络(Occupancy Networks)时面临的数据复杂度呈立方增长的问题,以及基于鸟瞰图(Bird’s-Eye View, BEV)方法在大规模全局建模中需要大量工程优化的挑战。其解决方案的关键在于提出InstanceBEV,这是首个将实例级降维引入BEV的方法,通过直接利用Transformer进行全局特征聚合,实现了无需依赖稀疏化或加速算子的全局建模。

链接: https://arxiv.org/abs/2505.13817
作者: Feng Li,Kun Xu,Zhaoyue Wang,Yunduan Cui,Mohammad Masum Billah,Jia Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Occupancy Grid Maps are widely used in navigation for their ability to represent 3D space occupancy. However, existing methods that utilize multi-view cameras to construct Occupancy Networks for perception modeling suffer from cubic growth in data complexity. Adopting a Bird’s-Eye View (BEV) perspective offers a more practical solution for autonomous driving, as it provides higher semantic density and mitigates complex object occlusions. Nonetheless, BEV-based approaches still require extensive engineering optimizations to enable efficient large-scale global modeling. To address this challenge, we propose InstanceBEV, the first method to introduce instance-level dimensionality reduction for BEV, enabling global modeling with transformers without relying on sparsification or acceleration operators. Different from other BEV methods, our approach directly employs transformers to aggregate global features. Compared to 3D object detection models, our method samples global feature maps into 3D space. Experiments on OpenOcc-NuScenes dataset show that InstanceBEV achieves state-of-the-art performance while maintaining a simple, efficient framework without requiring additional optimizations.
zh

[CV-83] FlashKAT: Understanding and Addressing Performance Bottlenecks in the Kolmogorov-Arnold Transformer

【速读】:该论文试图解决Kolmogorov-Arnold Transformer (KAT)在训练速度上的显著瓶颈问题,尽管其FLOPs与传统Transformer相当,但其训练速度仍比传统方法慢123倍。解决方案的关键在于识别出该 slowdown 主要由内存停顿(memory stalls)引起,特别是在Group-Rational KAN (GR-KAN)的反向传播过程中由于梯度累积效率低下所致。为了解决这一内存瓶颈,作者提出了FlashKAT,其核心是重构内核以最小化梯度累积,并通过原子加法和对慢速内存的访问来优化计算流程,从而实现训练速度的大幅提升。

链接: https://arxiv.org/abs/2505.13813
作者: Matthew Raffel,Lizhong Chen
机构: Oregon State University (俄勒冈州立大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Kolmogorov-Arnold Network (KAN) has been gaining popularity as an alternative to the multi-layer perceptron (MLP) with its increased expressiveness and interpretability. However, the KAN can be orders of magnitude slower due to its increased computational cost and training instability, limiting its applicability to larger-scale tasks. Recently, the Kolmogorov-Arnold Transformer (KAT) has been proposed, which can achieve FLOPs similar to the traditional Transformer with MLPs by leveraging Group-Rational KAN (GR-KAN). Unfortunately, despite the comparable FLOPs, our characterizations reveal that the KAT is still 123x slower in training speeds, indicating that there are other performance bottlenecks beyond FLOPs. In this paper, we conduct a series of experiments to understand the root cause of the slowdown in KAT. We uncover that the slowdown can be isolated to memory stalls and, more specifically, in the backward pass of GR-KAN caused by inefficient gradient accumulation. To address this memory bottleneck, we propose FlashKAT, which builds on our restructured kernel that minimizes gradient accumulation with atomic adds and accesses to slow memory. Evaluations demonstrate that FlashKAT can achieve a training speedup of 86.5x compared with the state-of-the-art KAT, while reducing rounding errors in the coefficient gradients. Our code is available at this https URL.
zh

[CV-84] Physics-Driven Local-Whole Elastic Deformation Modeling for Point Cloud Representation Learning

【速读】:该论文旨在解决现有点云表示学习方法在捕捉局部信息与整体结构之间关系上的不足,这些方法通常侧重于结构特征而忽视了局部细节与整体形状之间的相互作用。其解决方案的关键在于提出一种物理驱动的自监督学习方法,通过构建局部-整体力传播机制来建模局部特征与整体结构之间的关系,具体采用双任务编码器-解码器框架,结合隐式场的几何建模能力和物理驱动的弹性形变,从而有效提升点云表示的质量。

链接: https://arxiv.org/abs/2505.13812
作者: Zhongyu Chen,Rong Zhao,Xie Han,Xindong Guo,Song Wang,Zherui Qiao
机构: School of Computer Science and Technology, North University of China, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing point cloud representation learning tend to learning the geometric distribution of objects through data-driven approaches, emphasizing structural features while overlooking the relationship between the local information and the whole structure. Local features reflect the fine-grained variations of an object, while the whole structure is determined by the interaction and combination of these local features, collectively defining the object’s shape. In real-world, objects undergo elastic deformation under external forces, and this deformation gradually affects the whole structure through the propagation of forces from local regions, thereby altering the object’s geometric properties. Inspired by this, we propose a physics-driven self-supervised learning method for point cloud representation, which captures the relationship between parts and the whole by constructing a local-whole force propagation mechanism. Specifically, we employ a dual-task encoder-decoder framework, integrating the geometric modeling capability of implicit fields with physics-driven elastic deformation. The encoder extracts features from the point cloud and its tetrahedral mesh representation, capturing both geometric and physical properties. These features are then fed into two decoders: one learns the whole geometric shape of the point cloud through an implicit field, while the other predicts local deformations using two specifically designed physics information loss functions, modeling the deformation relationship between local and whole shapes. Experimental results show that our method outperforms existing approaches in object classification, few-shot learning, and segmentation, demonstrating its effectiveness.
zh

[CV-85] Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels CVPR’25

【速读】:该论文旨在解决基于文本指令的视觉语言模型(VLM)在复杂指令下获取像素级定位能力的问题,具体针对幻觉参考、多物体场景、推理、多粒度和部分级参考等五个现实挑战。其解决方案的关键在于利用预训练教师模型的知识蒸馏,生成高质量的指令-响应对,并将其与现有的像素级标注相链接,从而减少对昂贵人工标注的依赖。该方法构建的数据集Ground-V包含了丰富的物体定位知识和细腻的像素级指代表达,实验结果表明,基于Ground-V训练的模型在多个基准测试中取得了显著的性能提升。

链接: https://arxiv.org/abs/2505.13788
作者: Yongshuo Zong,Qin Zhang,Dongsheng An,Zhihua Li,Xiang Xu,Linghan Xu,Zhuowen Tu,Yifan Xing,Onkar Dabeer
机构: AWS AI Labs (AWS AI 实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR’25

点击查看摘要

Abstract:This work presents a simple yet effective workflow for automatically scaling instruction-following data to elicit pixel-level grounding capabilities of VLMs under complex instructions. In particular, we address five critical real-world challenges in text-instruction-based grounding: hallucinated references, multi-object scenarios, reasoning, multi-granularity, and part-level references. By leveraging knowledge distillation from a pre-trained teacher model, our approach generates high-quality instruction-response pairs linked to existing pixel-level annotations, minimizing the need for costly human annotation. The resulting dataset, Ground-V, captures rich object localization knowledge and nuanced pixel-level referring expressions. Experiment results show that models trained on Ground-V exhibit substantial improvements across diverse grounding tasks. Specifically, incorporating Ground-V during training directly achieves an average accuracy boost of 4.4% for LISA and a 7.9% for PSALM across six benchmarks on the gIoU metric. It also sets new state-of-the-art results on standard benchmarks such as RefCOCO/+/g. Notably, on gRefCOCO, we achieve an N-Acc of 83.3%, exceeding the previous state-of-the-art by more than 20%.
zh

[CV-86] ransfer Learning from Visual Speech Recognition to Mouthing Recognition in German Sign Language

【速读】:该论文试图解决手语识别(Sign Language Recognition, SLR)系统中对非手动特征,特别是口部动作(mouthing)识别的不足问题。其解决方案的关键在于通过从视觉语音识别(Visual Speech Recognition, VSR)迁移学习,提升德语手语中口部动作的识别性能。研究利用三个不同的VSR数据集,探讨任务相似性对迁移效果的影响,并发现多任务学习能够同时提升口部动作识别和VSR的准确性及模型鲁棒性,表明口部动作识别应被视为与VSR相关但独立的任务。

链接: https://arxiv.org/abs/2505.13784
作者: Dinh Nam Pham,Eleftherios Avramidis
机构: German Research Center for Artificial Intelligence (DFKI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at 19th IEEE International Conference on Automatic Face and Gesture Recognition 2025

点击查看摘要

Abstract:Sign Language Recognition (SLR) systems primarily focus on manual gestures, but non-manual features such as mouth movements, specifically mouthing, provide valuable linguistic information. This work directly classifies mouthing instances to their corresponding words in the spoken language while exploring the potential of transfer learning from Visual Speech Recognition (VSR) to mouthing recognition in German Sign Language. We leverage three VSR datasets: one in English, one in German with unrelated words and one in German containing the same target words as the mouthing dataset, to investigate the impact of task similarity in this setting. Our results demonstrate that multi-task learning improves both mouthing recognition and VSR accuracy as well as model robustness, suggesting that mouthing recognition should be treated as a distinct but related task to VSR. This research contributes to the field of SLR by proposing knowledge transfer from VSR to SLR datasets with limited mouthing annotations.
zh

[CV-87] Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping

【速读】:该论文试图解决声音景观(soundscape)映射中由于依赖卫星图像和配对的地理标记音频样本而导致的声音源多样性无法被充分捕捉的问题。其解决方案的关键在于利用视觉-语言模型(VLM)生成语义丰富的声音景观描述,并通过跨模态对比学习(contrastive learning)在音频、音频描述、卫星图像和卫星图像描述之间建立联系,从而学习一个共享的声音景观概念代码本(codebook),并以这些概念的加权平均表示每个样本。

链接: https://arxiv.org/abs/2505.13777
作者: Subash Khanal,Srikumar Sastry,Aayush Dhakal,Adeel Ahmad,Nathan Jacobs
机构: Washington University in St. Louis (华盛顿大学圣路易斯分校); Taylor Geospatial Institute (泰勒地理空间研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:We present Sat2Sound, a multimodal representation learning framework for soundscape mapping, designed to predict the distribution of sounds at any location on Earth. Existing methods for this task rely on satellite image and paired geotagged audio samples, which often fail to capture the diversity of sound sources at a given location. To address this limitation, we enhance existing datasets by leveraging a Vision-Language Model (VLM) to generate semantically rich soundscape descriptions for locations depicted in satellite images. Our approach incorporates contrastive learning across audio, audio captions, satellite images, and satellite image captions. We hypothesize that there is a fixed set of soundscape concepts shared across modalities. To this end, we learn a shared codebook of soundscape concepts and represent each sample as a weighted average of these concepts. Sat2Sound achieves state-of-the-art performance in cross-modal retrieval between satellite image and audio on two datasets: GeoSound and SoundingEarth. Additionally, building on Sat2Sound’s ability to retrieve detailed soundscape captions, we introduce a novel application: location-based soundscape synthesis, which enables immersive acoustic experiences. Our code and models will be publicly available.
zh

[CV-88] ReSW-VL: Representation Learning for Surgical Workflow Analysis Using Vision-Language Model

【速读】:该论文试图解决在手术流程分析中,针对用于特征提取或表征学习的卷积神经网络(CNN)的训练方法研究不足的问题。其解决方案的关键在于利用视觉-语言模型(ReSW-VL)进行表征学习,具体通过提示学习(prompt learning)对CLIP(Convolutional Language Image Model)视觉编码器进行微调,以提升手术阶段识别的性能。

链接: https://arxiv.org/abs/2505.13746
作者: Satoshi Kondo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Surgical phase recognition from video is a technology that automatically classifies the progress of a surgical procedure and has a wide range of potential applications, including real-time surgical support, optimization of medical resources, training and skill assessment, and safety improvement. Recent advances in surgical phase recognition technology have focused primarily on Transform-based methods, although methods that extract spatial features from individual frames using a CNN and video features from the resulting time series of spatial features using time series modeling have shown high performance. However, there remains a paucity of research on training methods for CNNs employed for feature extraction or representation learning in surgical phase recognition. In this study, we propose a method for representation learning in surgical workflow analysis using a vision-language model (ReSW-VL). Our proposed method involves fine-tuning the image encoder of a CLIP (Convolutional Language Image Model) vision-language model using prompt learning for surgical phase recognition. The experimental results on three surgical phase recognition datasets demonstrate the effectiveness of the proposed method in comparison to conventional methods.
zh

[CV-89] Frozen Backpropagation: Relaxing Weight Symmetry in Temporally-Coded Deep Spiking Neural Networks

【速读】:该论文旨在解决在神经形态硬件上实现基于反向传播(Backpropagation, BP)训练的Spiking Neural Networks (SNNs)时,由于前向和反馈权重需要保持对称而导致的硬件开销和能耗增加的问题。其解决方案的关键在于提出一种名为Frozen Backpropagation (fBP)的训练算法,该算法通过周期性冻结反馈权重来计算梯度,从而更新前向权重,减少权重传输次数并降低同步开销。此外,论文还提出了三种不同计算复杂度的部分权重传输方案,以进一步提升传输效率。

链接: https://arxiv.org/abs/2505.13741
作者: Gaspard Goupy,Pierre Tirilly,Ioan Marius Bilasco
机构: Univ. Lille (Université de Lille); CNRS (Centre National de la Recherche Scientifique); Centrale Lille (Centrale Lille); UMR 9189 CRIStAL (UMR 9189 CRIStAL)
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Direct training of Spiking Neural Networks (SNNs) on neuromorphic hardware can greatly reduce energy costs compared to GPU-based training. However, implementing Backpropagation (BP) on such hardware is challenging because forward and backward passes are typically performed by separate networks with distinct weights. To compute correct gradients, forward and feedback weights must remain symmetric during training, necessitating weight transport between the two networks. This symmetry requirement imposes hardware overhead and increases energy costs. To address this issue, we introduce Frozen Backpropagation (fBP), a BP-based training algorithm relaxing weight symmetry in settings with separate networks. fBP updates forward weights by computing gradients with periodically frozen feedback weights, reducing weight transports during training and minimizing synchronization overhead. To further improve transport efficiency, we propose three partial weight transport schemes of varying computational complexity, where only a subset of weights is transported at a time. We evaluate our methods on image recognition tasks and compare them to existing approaches addressing the weight symmetry requirement. Our results show that fBP outperforms these methods and achieves accuracy comparable to BP. With partial weight transport, fBP can substantially lower transport costs by 1,000x with an accuracy drop of only 0.5pp on CIFAR-10 and 1.1pp on CIFAR-100, or by up to 10,000x at the expense of moderated accuracy loss. This work provides insights for guiding the design of neuromorphic hardware incorporating BP-based on-chip learning.
zh

[CV-90] Improving Compositional Generation with Diffusion Models Using Lift Scores

【速读】:该论文试图解决扩散模型在组合生成任务中条件对齐不足的问题,即生成的样本未能准确满足多个条件的组合要求。解决方案的关键在于引入一种基于提升分数(lift scores)的重采样准则,通过评估生成样本是否与每个单独条件对齐,并将结果进行组合以判断组合提示是否被满足。该方法的核心洞察是,提升分数可以仅利用原始扩散模型高效近似,无需额外训练或外部模块,从而在保持效果的同时降低推理过程中的计算开销。

链接: https://arxiv.org/abs/2505.13740
作者: Chenning Yu,Sicun Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a novel resampling criterion using lift scores, for improving compositional generation in diffusion models. By leveraging the lift scores, we evaluate whether generated samples align with each single condition and then compose the results to determine whether the composed prompt is satisfied. Our key insight is that lift scores can be efficiently approximated using only the original diffusion model, requiring no additional training or external modules. We develop an optimized variant that achieves relatively lower computational overhead during inference while maintaining effectiveness. Through extensive experiments, we demonstrate that lift scores significantly improved the condition alignment for compositional generation across 2D synthetic data, CLEVR position tasks, and text-to-image synthesis. Our code is available at this http URL.
zh

[CV-91] GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization

【速读】:该论文试图解决全球图像地理定位(geolocalization)问题,即从任意地点拍摄的图像中预测出GPS坐标,其核心挑战在于不同地区视觉内容的多样性。传统方法通常采用两阶段流程:先检索候选位置,再选择最佳匹配,但它们依赖于简单的相似性启发式和点监督,未能建模候选位置之间的空间关系。解决方案的关键在于提出GeoRanker,这是一种基于距离感知的排序框架,利用大视觉-语言模型联合编码查询与候选信息,并预测地理接近性;同时引入多阶距离损失函数,对绝对和相对距离进行排序,使模型能够推理结构化的空间关系。

链接: https://arxiv.org/abs/2505.13731
作者: Pengyue Jia,Seongheon Park,Song Gao,Xiangyu Zhao,Yixuan Li
机构: City University of Hong Kong (香港城市大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Worldwide image geolocalization-the task of predicting GPS coordinates from images taken anywhere on Earth-poses a fundamental challenge due to the vast diversity in visual content across regions. While recent approaches adopt a two-stage pipeline of retrieving candidates and selecting the best match, they typically rely on simplistic similarity heuristics and point-wise supervision, failing to model spatial relationships among candidates. In this paper, we propose GeoRanker, a distance-aware ranking framework that leverages large vision-language models to jointly encode query-candidate interactions and predict geographic proximity. In addition, we introduce a multi-order distance loss that ranks both absolute and relative distances, enabling the model to reason over structured spatial relationships. To support this, we curate GeoRanking, the first dataset explicitly designed for geographic ranking tasks with multimodal candidate information. GeoRanker achieves state-of-the-art results on two well-established benchmarks (IM2GPS3K and YFCC4K), significantly outperforming current best methods.
zh

[CV-92] GeoVLM: Improving Automated Vehicle Geolocalisation Using Vision-Language Matching

【速读】:该论文旨在解决跨视角地理定位(cross-view geo-localisation)中由于场景相似性导致的正确匹配难以被排名为首选的问题。现有方法虽然具有较高的召回率,但无法确保正确图像在检索结果中位于首位。论文提出的解决方案是GeoVLM,其关键在于利用视觉语言模型(vision language models)的零样本能力,通过可解释的跨视角语言描述实现跨视角地理定位,从而提升最佳匹配的准确性。

链接: https://arxiv.org/abs/2505.13669
作者: Barkin Dagda,Muhammad Awais,Saber Fallah
机构: Connected and Autonomous Vehicles Lab (CAV-Lab), University of Surrey (萨里大学); Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Cross-view geo-localisation identifies coarse geographical position of an automated vehicle by matching a ground-level image to a geo-tagged satellite image from a database. Despite the advancements in Cross-view geo-localisation, significant challenges still persist such as similar looking scenes which makes it challenging to find the correct match as the top match. Existing approaches reach high recall rates but they still fail to rank the correct image as the top match. To address this challenge, this paper proposes GeoVLM, a novel approach which uses the zero-shot capabilities of vision language models to enable cross-view geo-localisation using interpretable cross-view language descriptions. GeoVLM is a trainable reranking approach which improves the best match accuracy of cross-view geo-localisation. GeoVLM is evaluated on standard benchmark VIGOR and University-1652 and also through real-life driving environments using Cross-View United Kingdom, a new benchmark dataset introduced in this paper. The results of the paper show that GeoVLM improves retrieval performance of cross-view geo-localisation compared to the state-of-the-art methods with the help of explainable natural language descriptions. The code is available at this https URL
zh

[CV-93] FedCTTA: A Collaborative Approach to Continual Test-Time Adaptation in Federated Learning IJCNN2025

【速读】:该论文试图解决联邦学习(Federated Learning, FL)中模型在部署后因训练与测试数据分布差异导致的性能下降问题,以及现有测试时适应(Test-Time Adaptation, TTA)方法在联邦设置中面临的计算开销大、隐私泄露风险高和可扩展性差等挑战。其解决方案的关键在于提出一种隐私保护且计算高效的联邦持续测试时适应框架(Federated Continual Test-Time Adaptation, FedCTTA),该框架通过基于随机噪声样本生成的模型输出分布进行相似性感知聚合,避免直接交换局部特征,从而实现适应性知识共享并保障数据隐私;同时,FedCTTA通过最小化每个客户端的熵来增强模型对动态目标分布的置信度,并在适应过程中无需服务器端训练,保持恒定的内存占用,提升了系统的可扩展性。

链接: https://arxiv.org/abs/2505.13643
作者: Rakibul Hasan Rajib,Md Akil Raihan Iftee,Mir Sazzat Hossain,A. K. M. Mahbubur Rahman,Sajib Mistry,M Ashraful Amin,Amin Ahsan Ali
机构: Center for Computational & Data Sciences, Independent University, Bangladesh(计算与数据科学中心,独立大学,孟加拉国); Curtin University(柯廷大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures, Accepted In IJCNN 2025

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training across distributed clients without sharing raw data, making it ideal for privacy-sensitive applications. However, FL models often suffer performance degradation due to distribution shifts between training and deployment. Test-Time Adaptation (TTA) offers a promising solution by allowing models to adapt using only test samples. However, existing TTA methods in FL face challenges such as computational overhead, privacy risks from feature sharing, and scalability concerns due to memory constraints. To address these limitations, we propose Federated Continual Test-Time Adaptation (FedCTTA), a privacy-preserving and computationally efficient framework for federated adaptation. Unlike prior methods that rely on sharing local feature statistics, FedCTTA avoids direct feature exchange by leveraging similarity-aware aggregation based on model output distributions over randomly generated noise samples. This approach ensures adaptive knowledge sharing while preserving data privacy. Furthermore, FedCTTA minimizes the entropy at each client for continual adaptation, enhancing the model’s confidence in evolving target distributions. Our method eliminates the need for server-side training during adaptation and maintains a constant memory footprint, making it scalable even as the number of clients or training rounds increases. Extensive experiments show that FedCTTA surpasses existing methods across diverse temporal and spatial heterogeneity scenarios.
zh

[CV-94] IPENS:Interactive Unsupervised Framework for Rapid Plant Phenotyping Extraction via NeRF-SAM2 Fusion

【速读】:该论文旨在解决植物表型分析中因物种多样性导致的依赖大量高精度人工标注数据的问题,以及在谷物级别自遮挡目标上无监督方法效果不佳的挑战。其解决方案的关键在于提出IPENS方法,该方法利用辐射场信息将由SAM2分割得到的2D掩码提升至3D空间,实现多目标点云提取,并通过多目标协同优化策略有效解决单次交互下的多目标分割问题。

链接: https://arxiv.org/abs/2505.13633
作者: Wentao Song,He Huang,Youqiang Sun,Fang Qu,Jiaqi Zhang,Longhui Fang,Yuwei Hao,Chenyang Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Advanced plant phenotyping technologies play a crucial role in targeted trait improvement and accelerating intelligent breeding. Due to the species diversity of plants, existing methods heavily rely on large-scale high-precision manually annotated data. For self-occluded objects at the grain level, unsupervised methods often prove ineffective. This study proposes IPENS, an interactive unsupervised multi-target point cloud extraction method. The method utilizes radiance field information to lift 2D masks, which are segmented by SAM2 (Segment Anything Model 2), into 3D space for target point cloud extraction. A multi-target collaborative optimization strategy is designed to effectively resolve the single-interaction multi-target segmentation challenge. Experimental validation demonstrates that IPENS achieves a grain-level segmentation accuracy (mIoU) of 63.72% on a rice dataset, with strong phenotypic estimation capabilities: grain volume prediction yields R2 = 0.7697 (RMSE = 0.0025), leaf surface area R2 = 0.84 (RMSE = 18.93), and leaf length and width predictions achieve R2 = 0.97 and 0.87 (RMSE = 1.49 and 0.21). On a wheat dataset,IPENS further improves segmentation accuracy to 89.68% (mIoU), with equally outstanding phenotypic estimation performance: spike volume prediction achieves R2 = 0.9956 (RMSE = 0.0055), leaf surface area R2 = 1.00 (RMSE = 0.67), and leaf length and width predictions reach R2 = 0.99 and 0.92 (RMSE = 0.23 and 0.15). This method provides a non-invasive, high-quality phenotyping extraction solution for rice and wheat. Without requiring annotated data, it rapidly extracts grain-level point clouds within 3 minutes through simple single-round interactions on images for multiple targets, demonstrating significant potential to accelerate intelligent breeding efficiency.
zh

[CV-95] Self-Supervised Learning for Image Segmentation: A Comprehensive Survey

【速读】:该论文试图解决监督学习中对大量精确标注数据的依赖问题,这一需求导致数据标注过程耗时且成本高昂。其解决方案的关键在于自监督学习(Self-supervised learning, SSL),通过利用大量未标注数据并设计替代任务(pretext tasks)来学习有用的表征,从而减少对人工标注的依赖,提升图像分割等下游任务的性能。

链接: https://arxiv.org/abs/2505.13584
作者: Thangarajah Akilan,Nusrat Jahan,Wandong Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 19 figures, to be submitted for a possible IEEE publication

点击查看摘要

Abstract:Supervised learning demands large amounts of precisely annotated data to achieve promising results. Such data curation is labor-intensive and imposes significant overhead regarding time and costs. Self-supervised learning (SSL) partially overcomes these limitations by exploiting vast amounts of unlabeled data and creating surrogate (pretext or proxy) tasks to learn useful representations without manual labeling. As a result, SSL has become a powerful machine learning (ML) paradigm for solving several practical downstream computer vision problems, such as classification, detection, and segmentation. Image segmentation is the cornerstone of many high-level visual perception applications, including medical imaging, intelligent transportation, agriculture, and surveillance. Although there is substantial research potential for developing advanced algorithms for SSL-based semantic segmentation, a comprehensive study of existing methodologies is essential to trace advances and guide emerging researchers. This survey thoroughly investigates over 150 recent image segmentation articles, particularly focusing on SSL. It provides a practical categorization of pretext tasks, downstream tasks, and commonly used benchmark datasets for image segmentation research. It concludes with key observations distilled from a large body of literature and offers future directions to make this research field more accessible and comprehensible for readers.
zh

[CV-96] Breaking the Compression Ceiling: Data-Free Pipeline for Ultra-Efficient Delta Compression

【速读】:该论文旨在解决多任务场景下因存储大量微调模型而导致的显著存储开销问题,通过Delta压缩技术仅存储预训练模型和高度压缩的Delta权重来缓解这一问题。然而,现有方法难以同时保持高压缩率和性能,并且通常依赖于数据。为了解决这些挑战,论文提出了UltraDelta,这是首个无需依赖数据的Delta压缩管道,其关键在于三个核心组件:基于方差的混合稀疏性分配、分布感知压缩以及基于迹范数的重缩放。这些组件共同实现了跨层间、层内和全局维度的冗余最小化、信息最大化和性能稳定化。

链接: https://arxiv.org/abs/2505.13563
作者: Xiaohui Wang,Peng Ye,Chenyu Huang,Shenghe Zheng,Bo Zhang,Wanli Ouyang,Tao Chen
机构: Fudan University (复旦大学); Shanghai AI Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学); Harbin Institude of Technology (哈尔滨工业大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rise of the fine-tuned–pretrained paradigm, storing numerous fine-tuned models for multi-tasking creates significant storage overhead. Delta compression alleviates this by storing only the pretrained model and the highly compressed delta weights (the differences between fine-tuned and pretrained model weights). However, existing methods fail to maintain both high compression and performance, and often rely on data. To address these challenges, we propose UltraDelta, the first data-free delta compression pipeline that achieves both ultra-high compression and strong performance. UltraDelta is designed to minimize redundancy, maximize information, and stabilize performance across inter-layer, intra-layer, and global dimensions, using three key components: (1) Variance-Based Mixed Sparsity Allocation assigns sparsity based on variance, giving lower sparsity to high-variance layers to preserve inter-layer information. (2) Distribution-Aware Compression applies uniform quantization and then groups parameters by value, followed by group-wise pruning, to better preserve intra-layer distribution. (3) Trace-Norm-Guided Rescaling uses the trace norm of delta weights to estimate a global rescaling factor, improving model stability under higher compression. Extensive experiments across (a) large language models (fine-tuned on LLaMA-2 7B and 13B) with up to 133x, (b) general NLP models (RoBERTa-base, T5-base) with up to 800x, © vision models (ViT-B/32, ViT-L/14) with up to 400x, and (d) multi-modal models (BEiT-3) with 40x compression ratio, demonstrate that UltraDelta consistently outperforms existing methods, especially under ultra-high compression.
zh

[CV-97] EuLearn: A 3D database for learning Euler characteristics

【速读】:该论文试图解决当前缺乏能够公平代表多种拓扑类型(topological types)的表面数据集的问题,从而阻碍了机器学习系统对拓扑特征的识别与学习。其解决方案的关键在于设计了一种基于随机绳结(random knots)的嵌入表面,使得表面具有统一变化的亏格(genus),并且能够自我缠绕,从而生成包含网格、点云和标量场的新拓扑数据集。此外,为提升模型性能,作者还提出了一种非欧几里得(non-Euclidean)的统计采样方法,并引入了考虑邻接信息的PointNet和Transformer架构改进。

链接: https://arxiv.org/abs/2505.13539
作者: Rodrigo Fritz,Pablo Suárez-Serrato,Victor Mijangos,Anayanzi D. Martinez-Hernandez,Eduardo Ivan Velazquez Richards
机构: Instituto de Matemáticas, UNAM(数学研究所,墨西哥国立自治大学); Facultad de Ciencias, UNAM(理学院,墨西哥国立自治大学); IIMAS, UNAM(IIMAS,墨西哥国立自治大学)
类目: Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Differential Geometry (math.DG); Geometric Topology (math.GT)
备注: 35 pages, many figures. Datasets and source code publicly available at this https URL and this https URL

点击查看摘要

Abstract:We present EuLearn, the first surface datasets equitably representing a diversity of topological types. We designed our embedded surfaces of uniformly varying genera relying on random knots, thus allowing our surfaces to knot with themselves. EuLearn contributes new topological datasets of meshes, point clouds, and scalar fields in 3D. We aim to facilitate the training of machine learning systems that can discern topological features. We experimented with specific emblematic 3D neural network architectures, finding that their vanilla implementations perform poorly on genus classification. To enhance performance, we developed a novel, non-Euclidean, statistical sampling method adapted to graph and manifold data. We also introduce adjacency-informed adaptations of PointNet and Transformer architectures that rely on our non-Euclidean sampling strategy. Our results demonstrate that incorporating topological information into deep learning workflows significantly improves performance on these otherwise challenging EuLearn datasets.
zh

[CV-98] Open Set Domain Adaptation with Vision-language models via Gradient-aware Separation

【速读】:该论文旨在解决开放集域适应(Open-Set Domain Adaptation, OSDA)中的双重挑战,即在不同域之间对齐已知类分布的同时识别目标域特有的未知类别。现有方法往往未能有效利用模态间的语义关系,并且在未知样本检测中容易出现误差累积。论文提出的解决方案关键在于利用对比语言-图像预训练(Contrastive Language-Image Pretraining, CLIP)的两个核心创新:一是通过可学习的文本提示动态适应CLIP的文本编码器,实现源域与目标域之间的语义一致性;二是通过梯度分析模块量化域偏移,基于学习提示的L2范数比较,区分已知与未知样本的梯度行为。

链接: https://arxiv.org/abs/2505.13507
作者: Haoyang Chen
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-Set Domain Adaptation (OSDA) confronts the dual challenge of aligning known-class distributions across domains while identifying target-domain-specific unknown categories. Current approaches often fail to leverage semantic relationships between modalities and struggle with error accumulation in unknown sample detection. We propose to harness Contrastive Language-Image Pretraining (CLIP) to address these limitations through two key innovations: 1) Prompt-driven cross-domain alignment: Learnable textual prompts conditioned on domain discrepancy metrics dynamically adapt CLIP’s text encoder, enabling semantic consistency between source and target domains without explicit unknown-class supervision. 2) Gradient-aware open-set separation: A gradient analysis module quantifies domain shift by comparing the L2-norm of gradients from the learned prompts, where known/unknown samples exhibit statistically distinct gradient behaviors. Evaluations on Office-Home show that our method consistently outperforms CLIP baseline and standard baseline. Ablation studies confirm the gradient norm’s critical role.
zh

[CV-99] An Edge AI Solution for Space Object Detection CEC

【速读】:该论文旨在解决近地轨道中空间物体检测(Space Object Detection, SOD)任务中对高精度和低延迟的迫切需求,以支持实时碰撞评估与规避。其解决方案的关键在于提出一种基于深度学习的边缘人工智能(Edge AI)方法,融合了Squeeze-and-Excitation(SE)模块、Vision Transformers(ViT)以及YOLOv9框架,从而在多种实际SOD场景中实现了高准确率和极低延迟的多卫星检测能力。

链接: https://arxiv.org/abs/2505.13468
作者: Wenxuan Zhang,Peng Hu
机构: University of Waterloo(滑铁卢大学); University of Manitoba(曼尼托巴大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: Accepted as poster paper at the 2025 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE)

点击查看摘要

Abstract:Effective Edge AI for space object detection (SOD) tasks that can facilitate real-time collision assessment and avoidance is essential with the increasing space assets in near-Earth orbits. In SOD, low Earth orbit (LEO) satellites must detect other objects with high precision and minimal delay. We explore an Edge AI solution based on deep-learning-based vision sensing for SOD tasks and propose a deep learning model based on Squeeze-and-Excitation (SE) layers, Vision Transformers (ViT), and YOLOv9 framework. We evaluate the performance of these models across various realistic SOD scenarios, demonstrating their ability to detect multiple satellites with high accuracy and very low latency.
zh

[CV-100] End-to-end fully-binarized network design: from Generic Learned Thermometer to Block Pruning

【速读】:该论文试图解决二值神经网络(Binary Neural Network, BNN)在输入数据表示上的不足,传统方法主要关注模型权重和激活值,而忽视了原始输入数据的处理。其解决方案的关键在于提出一种名为通用学习温度计(Generic Learned Thermometer, GLT)的编码技术,通过学习非线性量化阈值来改进输入数据的表示,该技术通过多次数据二值化替代传统的使用自然二进制编码的模数转换(Analog to Digital Conversion, ADC),从而提升模型的灵活性和性能。

链接: https://arxiv.org/abs/2505.13462
作者: Thien Nguyen,William Guicquero
机构: CEA-LETI(法国原子能委员会-微电子与信息技术实验室)
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
备注: Accepted to IEEE AICAS 2025

点击查看摘要

Abstract:Existing works on Binary Neural Network (BNN) mainly focus on model’s weights and activations while discarding considerations on the input raw data. This article introduces Generic Learned Thermometer (GLT), an encoding technique to improve input data representation for BNN, relying on learning non linear quantization thresholds. This technique consists in multiple data binarizations which can advantageously replace a conventional Analog to Digital Conversion (ADC) that uses natural binary coding. Additionally, we jointly propose a compact topology with light-weight grouped convolutions being trained thanks to block pruning and Knowledge Distillation (KD), aiming at reducing furthermore the model size so as its computational complexity. We show that GLT brings versatility to the BNN by intrinsically performing global tone mapping, enabling significant accuracy gains in practice (demonstrated by simulations on the STL-10 and VWW datasets). Moreover, when combining GLT with our proposed block-pruning technique, we successfully achieve lightweight (under 1Mb), fully-binarized models with limited accuracy degradation while being suitable for in-sensor always-on inference use cases.
zh

[CV-101] Model Steering: Learning with a Reference Model Improves Generalization Bounds and Scaling Laws

【速读】:该论文试图解决在模型训练中如何利用预训练模型作为参考来指导和提升目标模型训练的问题,其核心挑战在于缺乏对这一新兴学习范式(称为model steering)的理论理解,导致性能未达最优。解决方案的关键在于提出一种基于风险最小化的理论驱动框架——DRRho风险最小化,该框架根植于分布鲁棒优化(Distributionally Robust Optimization, DRO),通过泛化分析揭示了该方法在泛化能力和数据效率上的优势。此外,结合对比学习与DRO的联系,提出了基于参考模型的新型对比语言-图像预训练方法DRRho-CLIP,实验验证了理论的有效性并展示了其优越性。

链接: https://arxiv.org/abs/2505.06699
作者: Xiyuan Wei,Ming Lin,Fanjiang Ye,Fengguang Song,Liangliang Cao,My T. Thai,Tianbao Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 18 pages, 6 figures

点击查看摘要

Abstract:This paper formalizes an emerging learning paradigm that uses a trained model as a reference to guide and enhance the training of a target model through strategic data selection or weighting, named \textbfmodel steering . While ad-hoc methods have been used in various contexts, including the training of large foundation models, its underlying principles remain insufficiently understood, leading to sub-optimal performance. In this work, we propose a theory-driven framework for model steering called \textbfDRRho risk minimization , which is rooted in Distributionally Robust Optimization (DRO). Through a generalization analysis, we provide theoretical insights into why this approach improves generalization and data efficiency compared to training without a reference model. To the best of our knowledge, this is the first time such theoretical insights are provided for the new learning paradigm, which significantly enhance our understanding and practice of model steering. Building on these insights and the connection between contrastive learning and DRO, we introduce a novel method for Contrastive Language-Image Pretraining (CLIP) with a reference model, termed DRRho-CLIP. Extensive experiments validate the theoretical insights, reveal a superior scaling law compared to CLIP without a reference model, and demonstrate its strength over existing heuristic approaches.
zh

[CV-102] PseudoNeg-MAE: Self-Supervised Point Cloud Learning using Conditional Pseudo-Negative Embeddings

【速读】:该论文试图解决点云掩码自编码器(Point Cloud Masked Autoencoder, PCMAE)在自监督学习中难以同时保持特征表示的判别性和对变换的敏感性问题。传统对比学习方法注重不变性,而近期方法虽引入了变换敏感性,但存在不变性崩溃(invariant-collapse)现象,导致潜在表示在不同变换下变化有限。解决方案的关键在于提出一种新型损失函数,显式惩罚不变性崩溃,使网络在保留判别性表示的同时捕捉更丰富的变换线索,并引入参数化网络COPE来学习变换引起的局部位移,同时通过变换条件伪负样本损失防止COPE输出退化为恒等映射。

链接: https://arxiv.org/abs/2409.15832
作者: Sutharsan Mahendren,Saimunur Rahman,Piotr Koniusz,Tharindu Fernando,Sridha Sridharan,Clinton Fookes,Peyman Moghadam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We propose PseudoNeg-MAE, a novel self-supervised learning framework that enhances global feature representation of point cloud masked autoencoder by making them both discriminative and sensitive to transformations. Traditional contrastive learning methods focus on achieving invariance, discarding transformation-specific information. Recent approaches incorporate transformation sensitivity by explicitly modeling relationships between original and transformed inputs. However, they report an invariant-collapse phenomenon, where the predictor degenerates into identity mappings, resulting in latent representations that have limited variation across transformations. We propose a novel loss that explicitly penalizes invariant collapse, enabling the network to capture richer transformation cues while preserving discriminative representations. PseudoNeg-MAE uses a parametric network COPE, which learns the localized displacements caused by transformations within the latent space. However, jointly training COPE with the MAE leads to undesirable trivial solutions where COPE outputs collapse to an identity. To address this, we propose a loss that uses transformation-conditioned pseudo-negatives, to penalize such trivial invariant solutions. We validate PseudoNeg-MAE on shape classification and relative pose estimation tasks, where it achieves competitive performance on the ModelNet40 and ScanObjectNN datasets under challenging evaluation protocols and demonstrates superior accuracy in estimating relative poses compared to supervised methods.
zh

[CV-103] Automated Fetal Biometry Assessment with Deep Ensembles using Sparse-Sampling of 2D Intrapartum Ultrasound Images MICCAI

【速读】:该论文旨在解决产程中胎儿头部位置监测的准确性与一致性问题,通过减少观察者间和观察者内的变异来提高测量的可靠性。其解决方案的关键在于提出一种自动化的胎儿生物测量流程,包括从超声视频中分类标准切面、分割胎儿头部和耻骨联合以及计算角度进展(AoP)和头-耻骨距离(HSD)。该流程采用稀疏采样和基于集成的深度学习方法以增强模型的泛化能力,并通过保留最大连通区域和椭圆拟合提升测量的结构保真度。

链接: https://arxiv.org/abs/2505.14572
作者: Jayroop Ramesh,Valentin Bacher,Mark C. Eid,Hoda Kalabizadeh,Christian Rupprecht,Ana IL Namburete,Pak-Hei Yeung,Madeleine K.Wyburd,Nicola K. Dinsdale
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Top 5 in MICCAI IUGC 2024: Intrapartum Ultrasound Grand Challenge Runners up in Classification!

点击查看摘要

Abstract:The International Society of Ultrasound advocates Intrapartum Ultrasound (US) Imaging in Obstetrics and Gynecology (ISUOG) to monitor labour progression through changes in fetal head position. Two reliable ultrasound-derived parameters that are used to predict outcomes of instrumental vaginal delivery are the angle of progression (AoP) and head-symphysis distance (HSD). In this work, as part of the Intrapartum Ultrasounds Grand Challenge (IUGC) 2024, we propose an automated fetal biometry measurement pipeline to reduce intra- and inter-observer variability and improve measurement reliability. Our pipeline consists of three key tasks: (i) classification of standard planes (SP) from US videos, (ii) segmentation of fetal head and pubic symphysis from the detected SPs, and (iii) computation of the AoP and HSD from the segmented regions. We perform sparse sampling to mitigate class imbalances and reduce spurious correlations in task (i), and utilize ensemble-based deep learning methods for task (i) and (ii) to enhance generalizability under different US acquisition settings. Finally, to promote robustness in task iii) with respect to the structural fidelity of measurements, we retain the largest connected components and apply ellipse fitting to the segmentations. Our solution achieved ACC: 0.9452, F1: 0.9225, AUC: 0.983, MCC: 0.8361, DSC: 0.918, HD: 19.73, ASD: 5.71, \Delta_AoP : 8.90 and \Delta_HSD : 14.35 across an unseen hold-out set of 4 patients and 224 US frames. The results from the proposed automated pipeline can improve the understanding of labour arrest causes and guide the development of clinical risk stratification tools for efficient and effective prenatal care.
zh

[CV-104] Neural Inverse Scattering with Score-based Regularization

【速读】:该论文试图解决逆散射问题(inverse scattering problem),这是在显微成像到遥感等多个成像应用中的基础挑战。该问题通常需要联合估计两个未知量——物体内部的图像和散射场,因此需要有效的图像先验来正则化推断过程。论文提出的解决方案的关键在于引入一种正则化的神经场(neural field, NF)方法,该方法整合了基于分数的生成模型中使用的去噪分数函数(denoising score function),通过该函数引入图像的丰富结构先验,从而提升成像质量。

链接: https://arxiv.org/abs/2505.14560
作者: Yuan Gao,Wenhan Guo,Yu Sun
机构: Johns Hopkins University (约翰霍普金斯大学); Pomona College (波莫纳学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Inverse scattering is a fundamental challenge in many imaging applications, ranging from microscopy to remote sensing. Solving this problem often requires jointly estimating two unknowns – the image and the scattering field inside the object – necessitating effective image prior to regularize the inference. In this paper, we propose a regularized neural field (NF) approach which integrates the denoising score function used in score-based generative models. The neural field formulation offers convenient flexibility to performing joint estimation, while the denoising score function imposes the rich structural prior of images. Our results on three high-contrast simulated objects show that the proposed approach yields a better imaging quality compared to the state-of-the-art NF approach, where regularization is based on total variation.
zh

[CV-105] Neural Video Compression with Context Modulation CVPR2025

【速读】:该论文旨在解决现有基于条件编码的神经视频编解码器(Neural Video Codec, NVC)在利用参考帧信息时存在的不足,即其固有的时间上下文传播机制无法充分挖掘参考信息,从而限制了压缩性能的进一步提升。解决方案的关键在于通过两个步骤对时间上下文进行调制:首先引入流方向机制,以挖掘参考帧与预测帧之间的相关性,生成额外的方向化时间上下文;其次引入上下文补偿机制,利用方向化上下文对从参考特征传播得到的时间上下文进行调制,结合协同机制和解耦损失监督,有效消除无关传播信息,从而实现更优的上下文建模。

链接: https://arxiv.org/abs/2505.14541
作者: Chuanbo Tang,Zhuoyuan Li,Yifan Bian,Li Li,Dong Liu
机构: University of Science and Technology of China (中国科学技术大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 8 figures, accepted by CVPR 2025

点击查看摘要

Abstract:Efficient video coding is highly dependent on exploiting the temporal redundancy, which is usually achieved by extracting and leveraging the temporal context in the emerging conditional coding-based neural video codec (NVC). Although the latest NVC has achieved remarkable progress in improving the compression performance, the inherent temporal context propagation mechanism lacks the ability to sufficiently leverage the reference information, limiting further improvement. In this paper, we address the limitation by modulating the temporal context with the reference frame in two steps. Specifically, we first propose the flow orientation to mine the inter-correlation between the reference frame and prediction frame for generating the additional oriented temporal context. Moreover, we introduce the context compensation to leverage the oriented context to modulate the propagated temporal context generated from the propagated reference feature. Through the synergy mechanism and decoupling loss supervision, the irrelevant propagated information can be effectively eliminated to ensure better context modeling. Experimental results demonstrate that our codec achieves on average 22.7% bitrate reduction over the advanced traditional video codec H.266/VVC, and offers an average 10.1% bitrate saving over the previous state-of-the-art NVC DCVC-FM. The code is available at this https URL.
zh

[CV-106] Scaling and Enhancing LLM -based AVSR: A Sparse Mixture of Projectors Approach

【速读】:该论文旨在解决在资源受限环境中部署音频-视觉语音识别(AVSR)系统时面临的高计算成本问题。其解决方案的关键在于提出Llama-SMoP,这是一种高效的多模态大语言模型(LLM),通过引入稀疏的专家混合(SMoP)模块,在不增加推理成本的前提下提升模型容量。该模块利用稀疏门控的专家混合(MoE)投影器,使小型LLM能够在保持高性能的同时应用于AVSR任务。

链接: https://arxiv.org/abs/2505.14336
作者: Umberto Cappellazzo,Minsu Kim,Stavros Petridis,Daniele Falavigna,Alessio Brutti
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Audio-Visual Speech Recognition (AVSR) enhances robustness in noisy environments by integrating visual cues. While recent advances integrate Large Language Models (LLMs) into AVSR, their high computational cost hinders deployment in resource-constrained settings. To address this, we propose Llama-SMoP, an efficient Multimodal LLM that employs a Sparse Mixture of Projectors (SMoP) module to scale model capacity without increasing inference costs. By incorporating sparsely-gated mixture-of-experts (MoE) projectors, Llama-SMoP enables the use of smaller LLMs while maintaining strong performance. We explore three SMoP configurations and show that Llama-SMoP DEDR (Disjoint-Experts, Disjoint-Routers), which uses modality-specific routers and experts, achieves superior performance on ASR, VSR, and AVSR tasks. Ablation studies confirm its effectiveness in expert activation, scalability, and noise robustness.
zh

[CV-107] From stability of Langevin diffusion to convergence of proximal MCMC for non-log-concave sampling

【速读】:该论文试图解决从非凸势能中采样分布的问题,特别是针对在成像逆问题等场景中常见的非凸且非光滑势能。其解决方案的关键在于提出一种结合前向-后向优化算法与无调整朗之万算法(ULA)步骤的近端随机梯度朗之万算法(PSGLA),并通过证明PSGLA在强凸势能下的稳定性以及利用Moreau包络的性质,首次建立了PSGLA在非凸势能下的收敛性证明。

链接: https://arxiv.org/abs/2505.14177
作者: Marien Renaud,Valentin De Bortoli,Arthur Leclaire,Nicolas Papadakis
机构: Univ. Bordeaux (波尔多大学); CNRS (法国国家科学研究中心); Bordeaux INP (波尔多综合理工学院); IMB, UMR 5251 (波尔多数学实验室,UMR 5251); ENS (法国高等师范学院); PSL University (巴黎萨克雷大学); LTCI (电信研究院); Télécom Paris (巴黎电信学院); IP Paris (巴黎理工学院); INRIA (法国国家信息与自动化研究所)
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We consider the problem of sampling distributions stemming from non-convex potentials with Unadjusted Langevin Algorithm (ULA). We prove the stability of the discrete-time ULA to drift approximations under the assumption that the potential is strongly convex at infinity. In many context, e.g. imaging inverse problems, potentials are non-convex and non-smooth. Proximal Stochastic Gradient Langevin Algorithm (PSGLA) is a popular algorithm to handle such potentials. It combines the forward-backward optimization algorithm with a ULA step. Our main stability result combined with properties of the Moreau envelope allows us to derive the first proof of convergence of the PSGLA for non-convex potentials. We empirically validate our methodology on synthetic data and in the context of imaging inverse problems. In particular, we observe that PSGLA exhibits faster convergence rates than Stochastic Gradient Langevin Algorithm for posterior sampling while preserving its restoration properties.
zh

[CV-108] NOVA: A Benchmark for Anomaly Localization and Clinical Reasoning in Brain MRI

【速读】:该论文旨在解决当前基础和视觉-语言模型在面对分布外(out-of-distribution)数据时的泛化能力不足问题,特别是在医疗影像领域中对罕见或全新病理情况的检测、定位与推理能力。其解决方案的关键在于构建一个名为NOVA的评估基准,该基准包含约900个脑部MRI扫描,涵盖281种罕见病理性变化及异构的采集协议,且每个案例均配有丰富的临床叙述和双盲专家边界框标注,从而实现对异常定位、视觉描述生成和诊断推理的联合评估。由于NOVA未用于模型训练,它成为了一种极端的分布外泛化压力测试,要求模型在样本外观和语义空间上均能跨越分布差距。

链接: https://arxiv.org/abs/2505.14064
作者: Cosmin I. Bercea,Jun Li,Philipp Raffler,Evamaria O. Riedel,Lena Schmitzer,Angela Kurz,Felix Bitzer,Paula Roßmüller,Julian Canisius,Mirjam L. Beyrle,Che Liu,Wenjia Bai,Bernhard Kainz,Julia A. Schnabel,Benedikt Wiestler
机构: Technical University of Munich(慕尼黑工业大学); Klinikum Rechts der Isar(伊萨尔河右岸诊所); Helmholtz Center Munich(慕尼黑赫姆霍兹中心); FAU Erlangen-Nürnberg(埃尔朗根-纽伦堡大学); Imperial College London(伦敦帝国理工学院); King’s College London(伦敦国王学院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In many real-world applications, deployed models encounter inputs that differ from the data seen during training. Out-of-distribution detection identifies whether an input stems from an unseen distribution, while open-world recognition flags such inputs to ensure the system remains robust as ever-emerging, previously unknown categories appear and must be addressed without retraining. Foundation and vision-language models are pre-trained on large and diverse datasets with the expectation of broad generalization across domains, including medical imaging. However, benchmarking these models on test sets with only a few common outlier types silently collapses the evaluation back to a closed-set problem, masking failures on rare or truly novel conditions encountered in clinical use. We therefore present NOVA , a challenging, real-life evaluation-only benchmark of \sim 900 brain MRI scans that span 281 rare pathologies and heterogeneous acquisition protocols. Each case includes rich clinical narratives and double-blinded expert bounding-box annotations. Together, these enable joint assessment of anomaly localisation, visual captioning, and diagnostic reasoning. Because NOVA is never used for training, it serves as an extreme stress-test of out-of-distribution generalisation: models must bridge a distribution gap both in sample appearance and in semantic space. Baseline results with leading vision-language models (GPT-4o, Gemini 2.0 Flash, and Qwen2.5-VL-72B) reveal substantial performance drops across all tasks, establishing NOVA as a rigorous testbed for advancing models that can detect, localize, and reason about truly unknown anomalies. Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2505.14064 [eess.IV] (or arXiv:2505.14064v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2505.14064 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-109] End-to-end Cortical Surface Reconstruction from Clinical Magnetic Resonance Images

【速读】:该论文旨在解决临床磁共振(MR)扫描在进行皮层表面分析时的局限性,因为现有工具通常仅适用于至少1 mm各向同性分辨率且针对特定磁共振对比度(如T1加权)的扫描,而临床扫描在对比度和分辨率上具有高度异质性。其解决方案的关键在于使用合成领域随机化数据训练首个能够直接从任意对比度和分辨率的扫描中显式估计皮层表面的神经网络,该方法通过将模板网格变形至白质(WM)表面以保证拓扑正确性,并进一步变形以估计灰质(GM)表面,从而实现无需重新训练即可处理异质性临床MR扫描的高效准确表面重建。

链接: https://arxiv.org/abs/2505.14017
作者: Jesper Duemose Nielsen,Karthik Gopinath,Andrew Hoopes,Adrian Dalca,Colin Magdamo,Steven Arnold,Sudeshna Das,Axel Thielscher,Juan Eugenio Iglesias,Oula Puonti
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Surface-based cortical analysis is valuable for a variety of neuroimaging tasks, such as spatial normalization, parcellation, and gray matter (GM) thickness estimation. However, most tools for estimating cortical surfaces work exclusively on scans with at least 1 mm isotropic resolution and are tuned to a specific magnetic resonance (MR) contrast, often T1-weighted (T1w). This precludes application using most clinical MR scans, which are very heterogeneous in terms of contrast and resolution. Here, we use synthetic domain-randomized data to train the first neural network for explicit estimation of cortical surfaces from scans of any contrast and resolution, without retraining. Our method deforms a template mesh to the white matter (WM) surface, which guarantees topological correctness. This mesh is further deformed to estimate the GM surface. We compare our method to recon-all-clinical (RAC), an implicit surface reconstruction method which is currently the only other tool capable of processing heterogeneous clinical MR scans, on ADNI and a large clinical dataset (n=1,332). We show a approximately 50 % reduction in cortical thickness error (from 0.50 to 0.24 mm) with respect to RAC and better recovery of the aging-related cortical thinning patterns detected by FreeSurfer on high-resolution T1w scans. Our method enables fast and accurate surface reconstruction of clinical scans, allowing studies (1) with sample sizes far beyond what is feasible in a research setting, and (2) of clinical populations that are difficult to enroll in research studies. The code is publicly available at this https URL.
zh

[CV-110] Bronchovascular Tree-Guided Weakly Supervised Learning Method for Pulmonary Segment Segmentation

【速读】:该论文旨在解决肺段分割中由于肺段边界在医学图像中难以区分而导致的像素级标注繁琐的问题。其解决方案的关键在于提出一种弱监督学习方法,即解剖-层次监督学习(Anatomy-Hierarchy Supervised Learning, AHSL),该方法借助精确的临床解剖定义进行肺段分割。该方法的设计基于两个原则:一是利用肺段级标签直接监督肺段输出,确保其准确包含相应的支气管血管树;二是通过肺叶级监督间接控制肺段,确保其位于对应的肺叶内。此外,还引入了结合支气管血管先验信息的两阶段分割策略,并提出了一个一致性损失以增强分割边界平滑性。

链接: https://arxiv.org/abs/2505.13911
作者: Ruijie Zhao(1),Zuopeng Tan(2),Xiao Xue(2),Longfei Zhao(2),Bing Li(2),Zicheng Liao(1),Ying Ming(1),Jiaru Wang(1),Ran Xiao(1),Sirong Piao(1),Rui Zhao(1),Qiqi Xu(2),Wei Song(1) ((1) Department of Radiology, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China, (2) Canon Medical Systems (China), Beijing, China)
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pulmonary segment segmentation is crucial for cancer localization and surgical planning. However, the pixel-wise annotation of pulmonary segments is laborious, as the boundaries between segments are indistinguishable in medical images. To this end, we propose a weakly supervised learning (WSL) method, termed Anatomy-Hierarchy Supervised Learning (AHSL), which consults the precise clinical anatomical definition of pulmonary segments to perform pulmonary segment segmentation. Since pulmonary segments reside within the lobes and are determined by the bronchovascular tree, i.e., artery, airway and vein, the design of the loss function is founded on two principles. First, segment-level labels are utilized to directly supervise the output of the pulmonary segments, ensuring that they accurately encompass the appropriate bronchovascular tree. Second, lobe-level supervision indirectly oversees the pulmonary segment, ensuring their inclusion within the corresponding lobe. Besides, we introduce a two-stage segmentation strategy that incorporates bronchovascular priori information. Furthermore, a consistency loss is proposed to enhance the smoothness of segment boundaries, along with an evaluation metric designed to measure the smoothness of pulmonary segment boundaries. Visual inspection and evaluation metrics from experiments conducted on a private dataset demonstrate the effectiveness of our method.
zh

[CV-111] XDementNET: An Explainable Attention Based Deep Convolutional Network to Detect Alzheimer Progression from MRI data

【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s Disease, AD)的精准诊断问题,特别是在医疗费用上升和人工智能在医学影像诊断中广泛应用的背景下。其解决方案的关键在于提出一种新型深度学习架构,该架构结合了多残差块、专用空间注意力块、分组查询注意力机制以及多头注意力机制,以提升从脑部磁共振成像(MRI)数据中提取关键信息的能力,从而实现对AD的高效分类与诊断。

链接: https://arxiv.org/abs/2505.13906
作者: Soyabul Islam Lincoln,Mirza Mohd Shahriar Maswood
机构: Khulna University of Engineering & Technology (库尔纳工程技术大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 12 figures,

点击查看摘要

Abstract:A common neurodegenerative disease, Alzheimer’s disease requires a precise diagnosis and efficient treatment, particularly in light of escalating healthcare expenses and the expanding use of artificial intelligence in medical diagnostics. Many recent studies shows that the combination of brain Magnetic Resonance Imaging (MRI) and deep neural networks have achieved promising results for diagnosing AD. Using deep convolutional neural networks, this paper introduces a novel deep learning architecture that incorporates multiresidual blocks, specialized spatial attention blocks, grouped query attention, and multi-head attention. The study assessed the model’s performance on four publicly accessible datasets and concentrated on identifying binary and multiclass issues across various categories. This paper also takes into account of the explainability of AD’s progression and compared with state-of-the-art methods namely Gradient Class Activation Mapping (GradCAM), Score-CAM, Faster Score-CAM, and XGRADCAM. Our methodology consistently outperforms current approaches, achieving 99.66% accuracy in 4-class classification, 99.63% in 3-class classification, and 100% in binary classification using Kaggle datasets. For Open Access Series of Imaging Studies (OASIS) datasets the accuracies are 99.92%, 99.90%, and 99.95% respectively. The Alzheimer’s Disease Neuroimaging Initiative-1 (ADNI-1) dataset was used for experiments in three planes (axial, sagittal, and coronal) and a combination of all planes. The study achieved accuracies of 99.08% for axis, 99.85% for sagittal, 99.5% for coronal, and 99.17% for all axis, and 97.79% and 8.60% respectively for ADNI-2. The network’s ability to retrieve important information from MRI images is demonstrated by its excellent accuracy in categorizing AD stages.
zh

[CV-112] Automated Quality Evaluation of Cervical Cytopathology Whole Slide Images Based on Content Analysis

【速读】:该论文旨在解决宫颈癌筛查中传统人工评估方法存在的主观性强、成本高、耗时长及可靠性低的问题。其解决方案的关键在于提出一种基于The Bethesda System (TBS)诊断标准、人工智能算法和临床数据特征的全自动化质量评估方法,通过多模型分析WSI的上下文信息,量化与TBS相关的质量评价指标,并利用XGBoost模型挖掘病理医生对不同质量评价指标的关注度,从而构建全面的WSI样本评分模型。

链接: https://arxiv.org/abs/2505.13875
作者: Lanlan Kang,Jian Wang,Jian QIn,Yiqin Liang,Yongjun He
机构: Harbin University of Science and Technology (哈尔滨科学技术大学); Anhui University of Technology (安徽工业大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 10 figures

点击查看摘要

Abstract:The ThinPrep Cytologic Test (TCT) is the most widely used method for cervical cancer screening, and the sample quality directly impacts the accuracy of the diagnosis. Traditional manual evaluation methods rely on the observation of pathologist under microscopes. These methods exhibit high subjectivity, high cost, long duration, and low reliability. With the development of computer-aided diagnosis (CAD), an automated quality assessment system that performs at the level of a professional pathologist is necessary. To address this need, we propose a fully automated quality assessment method for Cervical Cytopathology Whole Slide Images (WSIs) based on The Bethesda System (TBS) diagnostic standards, artificial intelligence algorithms, and the characteristics of clinical data. The method analysis the context of WSIs to quantify quality evaluation metrics which are focused by TBS such as staining quality, cell counts and cell mass proportion through multiple models including object detection, classification and segmentation. Subsequently, the XGBoost model is used to mine the attention paid by pathologists to different quality evaluation metrics when evaluating samples, thereby obtaining a comprehensive WSI sample score calculation model. Experimental results on 100 WSIs demonstrate that the proposed evaluation method has significant advantages in terms of speed and consistency.
zh

[CV-113] Exploring Image Quality Assessment from a New Perspective: Pupil Size

【速读】:该论文试图解决图像质量评估(Image Quality Assessment, IQA)任务对人类认知过程的影响问题,以及瞳孔大小与图像质量之间的关系。其解决方案的关键在于通过主观实验对比自由观察任务和IQA任务中瞳孔大小的变化,从而揭示人在评估图像质量时可能激活了视觉注意力机制,并发现瞳孔大小变化与图像质量之间存在紧密关联。这一研究为客观IQA方法的理论基础提供了支持,并提出了新的主观IQA方法以获取真实的主观图像质量印象。

链接: https://arxiv.org/abs/2505.13841
作者: Yixuan Gao,Xiongkuo Min,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper explores how the image quality assessment (IQA) task affects the cognitive processes of people from the perspective of pupil size and studies the relationship between pupil size and image quality. Specifically, we first invited subjects to participate in a subjective experiment, which includes two tasks: free observation and IQA. In the free observation task, subjects did not need to perform any action, and they only needed to observe images as they usually do with an album. In the IQA task, subjects were required to score images according to their overall impression of image quality. Then, by analyzing the difference in pupil size between the two tasks, we find that people may activate the visual attention mechanism when evaluating image quality. Meanwhile, we also find that the change in pupil size is closely related to image quality in the IQA task. For future research on IQA, this research can not only provide a theoretical basis for the objective IQA method and promote the development of more effective objective IQA methods, but also provide a new subjective IQA method for collecting the authentic subjective impression of image quality.
zh

[CV-114] Direction-Aware Neural Acoustic Fields for Few-Shot Interpolation of Ambisonic Impulse Responses INTERSPEECH2025

【速读】:该论文试图解决现有基于神经场(Neural Fields, NFs)的方法在建模声场时未能精确捕捉单点方向性特征的问题,特别是在处理多声道声场数据时的局限性。其解决方案的关键在于提出一种方向感知神经场(Direction-aware Neural Field, DANF),通过引入Ambisonic格式的房间脉冲响应(RIR)来显式地融入方向信息,并设计了一种方向感知损失函数以增强模型对声场方向特性的建模能力。此外,还探索了DANF在不同新房间场景下的适应性,包括低秩适配方法。

链接: https://arxiv.org/abs/2505.13617
作者: Christopher Ick,Gordon Wichern,Yoshiki Masuyama,François Germain,Jonathan Le Roux
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)
备注: Accepted at Interspeech 2025

点击查看摘要

Abstract:The characteristics of a sound field are intrinsically linked to the geometric and spatial properties of the environment surrounding a sound source and a listener. The physics of sound propagation is captured in a time-domain signal known as a room impulse response (RIR). Prior work using neural fields (NFs) has allowed learning spatially-continuous representations of RIRs from finite RIR measurements. However, previous NF-based methods have focused on monaural omnidirectional or at most binaural listeners, which does not precisely capture the directional characteristics of a real sound field at a single point. We propose a direction-aware neural field (DANF) that more explicitly incorporates the directional information by Ambisonic-format RIRs. While DANF inherently captures spatial relations between sources and listeners, we further propose a direction-aware loss. In addition, we investigate the ability of DANF to adapt to new rooms in various ways including low-rank adaptation.
zh

[CV-115] Learning Wavelet-Sparse FDK for 3D Cone-Beam CT Reconstruction

【速读】:该论文旨在解决锥形束计算机断层扫描(Cone-Beam Computed Tomography, CBCT)中传统Feldkamp-Davis-Kress (FDK)算法对噪声和伪影敏感的问题,同时克服现有深度学习方法在计算复杂度和可解释性方面的不足。其解决方案的关键在于提出一种基于FDK的增强神经网络,通过在余弦加权和滤波阶段选择性地引入可训练元素,保持传统算法的可解释性;同时利用小波变换对余弦权重和滤波器进行稀疏表示,从而将参数数量减少93.75%,在不牺牲性能的前提下加速收敛并维持与传统FDK算法相当的推理计算成本。

链接: https://arxiv.org/abs/2505.13579
作者: Yipeng Sun,Linda-Sophie Schneider,Chengze Ye,Mingxuan Gu,Siyuan Mei,Siming Bayer,Andreas Maier
机构: Pattern Recognition Lab, Friedrich-Alexander University Erlangen-Nuremberg(模式识别实验室,埃尔朗根-纽伦堡弗里德里希-亚历山大大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by Fully3D 2025

点击查看摘要

Abstract:Cone-Beam Computed Tomography (CBCT) is essential in medical imaging, and the Feldkamp-Davis-Kress (FDK) algorithm is a popular choice for reconstruction due to its efficiency. However, FDK is susceptible to noise and artifacts. While recent deep learning methods offer improved image quality, they often increase computational complexity and lack the interpretability of traditional methods. In this paper, we introduce an enhanced FDK-based neural network that maintains the classical algorithm’s interpretability by selectively integrating trainable elements into the cosine weighting and filtering stages. Recognizing the challenge of a large parameter space inherent in 3D CBCT data, we leverage wavelet transformations to create sparse representations of the cosine weights and filters. This strategic sparsification reduces the parameter count by 93.75% without compromising performance, accelerates convergence, and importantly, maintains the inference computational cost equivalent to the classical FDK algorithm. Our method not only ensures volumetric consistency and boosts robustness to noise, but is also designed for straightforward integration into existing CT reconstruction pipelines. This presents a pragmatic enhancement that can benefit clinical applications, particularly in environments with computational limitations.
zh

[CV-116] GANCompress: GAN-Enhanced Neural Image Compression with Binary Spherical Quantization

【速读】:该论文旨在解决视觉数据压缩中面临的三大核心问题:在高压缩比下保持感知质量、提升计算效率以及适应多样化视觉内容。其解决方案的关键在于提出一种名为GANCompress的神经压缩框架,该框架通过将二值球面量化(Binary Spherical Quantization, BSQ)与生成对抗网络(Generative Adversarial Networks, GANs)相结合,实现高效的离散化和高质量的重建。具体而言,该方法采用基于Transformer的自编码器,并引入增强型BSQ瓶颈,将潜在表示投影到超球面以控制量化误差;随后通过包含频域注意力机制和颜色一致性优化的专用GAN架构进一步提升压缩性能。

链接: https://arxiv.org/abs/2505.13542
作者: Karthik Sivakoti
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The exponential growth of visual data in digital communications has intensified the need for efficient compression techniques that balance rate-distortion performance with computational feasibility. While recent neural compression approaches have shown promise, they still struggle with fundamental challenges: preserving perceptual quality at high compression ratios, computational efficiency, and adaptability to diverse visual content. This paper introduces GANCompress, a novel neural compression framework that synergistically combines Binary Spherical Quantization (BSQ) with Generative Adversarial Networks (GANs) to address these challenges. Our approach employs a transformer-based autoencoder with an enhanced BSQ bottleneck that projects latent representations onto a hypersphere, enabling efficient discretization with bounded quantization error. This is followed by a specialized GAN architecture incorporating frequency-domain attention and color consistency optimization. Experimental results demonstrate that GANCompress achieves substantial improvement in compression efficiency – reducing file sizes by up to 100x with minimal visual distortion. Our method outperforms traditional codecs like H.264 by 12-15% in perceptual metrics while maintaining comparable PSNR/SSIM values, with 2.4x faster encoding and decoding speeds. On standard benchmarks including ImageNet-1k and COCO2017, GANCompress sets a new state-of-the-art, reducing FID from 0.72 to 0.41 (43% improvement) compared to previous methods while maintaining higher throughput. This work presents a significant advancement in neural compression technology with promising applications for real-time visual communication systems.
zh

人工智能

[AI-0] Abacus: A Cost-Based Optimizer for Semantic Operator Systems

【速读】:该论文试图解决在基于语义操作符的系统中,如何通过全局优化提升系统性能、成本和延迟的问题。现有优化器在可应用的优化类型上存在局限,无法在满足其他维度约束的情况下优化系统质量、成本或延迟。解决方案的关键是提出Abacus,一个可扩展的成本基础优化器,它通过最小的一组验证示例和可能的先验操作符性能知识来估计操作符性能,并搜索给定优化目标下的最佳语义操作符实现方式。

链接: https://arxiv.org/abs/2505.14661
作者: Matthew Russo,Sivaprasad Sudhir,Gerardo Vitagliano,Chunwei Liu,Tim Kraska,Samuel Madden,Michael Cafarella
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: 16 pages, 6 figures

点击查看摘要

Abstract:LLMs enable an exciting new class of data processing applications over large collections of unstructured documents. Several new programming frameworks have enabled developers to build these applications by composing them out of semantic operators: a declarative set of AI-powered data transformations with natural language specifications. These include LLM-powered maps, filters, joins, etc. used for document processing tasks such as information extraction, summarization, and more. While systems of semantic operators have achieved strong performance on benchmarks, they can be difficult to optimize. An optimizer for this setting must determine how to physically implement each semantic operator in a way that optimizes the system globally. Existing optimizers are limited in the number of optimizations they can apply, and most (if not all) cannot optimize system quality, cost, or latency subject to constraint(s) on the other dimensions. In this paper we present Abacus, an extensible, cost-based optimizer which searches for the best implementation of a semantic operator system given a (possibly constrained) optimization objective. Abacus estimates operator performance by leveraging a minimal set of validation examples and, if available, prior beliefs about operator performance. We evaluate Abacus on document processing workloads in the biomedical and legal domains (BioDEX; CUAD) and multi-modal question answering (MMQA). We demonstrate that systems optimized by Abacus achieve 18.7%-39.2% better quality and up to 23.6x lower cost and 4.2x lower latency than the next best system.
zh

[AI-1] Explainable AI for Securing Healthcare in IoT-Integrated 6G Wireless Networks

【速读】:该论文旨在解决6G赋能医疗环境中因集成物联网设备而产生的安全风险问题,这些问题可能对患者安全和数据隐私造成严重威胁。论文提出的解决方案关键在于利用可解释性AI(Explainable AI)技术,如SHAP、LIME和DiCE,以识别系统中的漏洞、增强防御机制,并提升6G医疗系统的信任度与透明度。

链接: https://arxiv.org/abs/2505.14659
作者: Navneet Kaur,Lav Gupta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As healthcare systems increasingly adopt advanced wireless networks and connected devices, securing medical applications has become critical. The integration of Internet of Medical Things devices, such as robotic surgical tools, intensive care systems, and wearable monitors has enhanced patient care but introduced serious security risks. Cyberattacks on these devices can lead to life threatening consequences, including surgical errors, equipment failure, and data breaches. While the ITU IMT 2030 vision highlights 6G’s transformative role in healthcare through AI and cloud integration, it also raises new security concerns. This paper explores how explainable AI techniques like SHAP, LIME, and DiCE can uncover vulnerabilities, strengthen defenses, and improve trust and transparency in 6G enabled healthcare. We support our approach with experimental analysis and highlight promising results.
zh

[AI-2] Cost-Augmented Monte Carlo Tree Search for LLM -Assisted Planning

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在成本敏感型规划任务中表现不佳的问题,特别是在处理严格预算约束时,LLMs往往无法有效平衡任务成功率与成本效率。论文提出的解决方案是Cost-Augmented Monte Carlo Tree Search (CATS),其关键在于将显式的成本意识引入LLM引导的规划过程中,通过调整成本约束的松紧程度,使规划器能够在快速排除不可行解与优化最小成本之间取得平衡,从而提升任务成功率和成本效率。

链接: https://arxiv.org/abs/2505.14656
作者: Zihao Zhang,Fei Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While LLMs excel at open-ended reasoning, they often struggle with cost-sensitive planning, either treating all actions as having equal cost or failing to stay within strict budgets. In this paper, we introduce Cost-Augmented Monte Carlo Tree Search (CATS), a novel approach that brings explicit cost-awareness into LLM-guided planning. Tight cost constraints push the planner to quickly identify infeasible solutions, while looser constraints encourage optimization for minimal cost. We benchmark top LLMs such as GPT-4.1, Claude-3.7-Sonnet, and DeepSeek-R1, against our CATS planner to evaluate their performance in cost-sensitive scenarios. Our experiments suggest that raw LLMs such as GPT-4.1 often falter under tight budgets, whereas CATS consistently delivers strong performance, achieving higher task success rates and better cost efficiency. CATS provides an effective solution for budget-aware decision-making by combining the reasoning power of LLMs with structured search.
zh

[AI-3] Let LLM s Break Free from Overthinking via Self-Braking Tuning

【速读】:该论文试图解决大型推理模型(Large Reasoning Models, LRMs)在生成过程中出现的冗余推理问题,即“过度思考”(overthinking)现象,该现象导致计算开销增加并影响模型效率。解决方案的关键在于提出一种名为自制动调优(Self-Braking Tuning, SBT)的新框架,该框架通过让模型自身调节推理过程,从而减少对外部干预的依赖。SBT通过构建基于标准答案的过度思考识别指标,设计系统化方法检测冗余推理,并生成用于学习自我调节行为的训练信号,同时引入创新的制动提示机制,使模型能够自然地在适当节点终止推理,从而有效降低token消耗并保持较高的准确性。

链接: https://arxiv.org/abs/2505.14604
作者: Haoran Zhao,Yuchen Yan,Yongliang Shen,Haolei Xu,Wenqi Zhang,Kaitao Song,Jian Shao,Weiming Lu,Jun Xiao,Yueting Zhuang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Github: this https URL Project: this https URL

点击查看摘要

Abstract:Large reasoning models (LRMs), such as OpenAI o1 and DeepSeek-R1, have significantly enhanced their reasoning capabilities by generating longer chains of thought, demonstrating outstanding performance across a variety of tasks. However, this performance gain comes at the cost of a substantial increase in redundant reasoning during the generation process, leading to high computational overhead and exacerbating the issue of overthinking. Although numerous existing approaches aim to address the problem of overthinking, they often rely on external interventions. In this paper, we propose a novel framework, Self-Braking Tuning (SBT), which tackles overthinking from the perspective of allowing the model to regulate its own reasoning process, thus eliminating the reliance on external control mechanisms. We construct a set of overthinking identification metrics based on standard answers and design a systematic method to detect redundant reasoning. This method accurately identifies unnecessary steps within the reasoning trajectory and generates training signals for learning self-regulation behaviors. Building on this foundation, we develop a complete strategy for constructing data with adaptive reasoning lengths and introduce an innovative braking prompt mechanism that enables the model to naturally learn when to terminate reasoning at an appropriate point. Experiments across mathematical benchmarks (AIME, AMC, MATH500, GSM8K) demonstrate that our method reduces token consumption by up to 60% while maintaining comparable accuracy to unconstrained models.
zh

[AI-4] owards a Foundation Model for Communication Systems

【速读】:该论文试图解决通信系统中如何利用人工智能(Artificial Intelligence, AI)进行多任务特征估计的问题,特别是通过构建一个适用于通信数据的基础模型来提升模型的通用性和适应性。其解决方案的关键在于提出一种基于Transformer的多模态模型,该模型能够直接处理通信数据,并通过创新的方法应对包括分词、位置嵌入、多模态融合、可变特征尺寸和归一化在内的关键技术挑战。

链接: https://arxiv.org/abs/2505.14603
作者: Davide Buffelli,Sowmen Das,Yu-Wei Lin,Sattar Vakili,Chien-Yi Wang,Masoud Attarifar,Pritthijit Nath,Da-shan Shiu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) has demonstrated unprecedented performance across various domains, and its application to communication systems is an active area of research. While current methods focus on task-specific solutions, the broader trend in AI is shifting toward large general models capable of supporting multiple applications. In this work, we take a step toward a foundation model for communication data–a transformer-based, multi-modal model designed to operate directly on communication data. We propose methodologies to address key challenges, including tokenization, positional embedding, multimodality, variable feature sizes, and normalization. Furthermore, we empirically demonstrate that such a model can successfully estimate multiple features, including transmission rank, selected precoder, Doppler spread, and delay profile.
zh

[AI-5] KIPPO: Koopman-Inspired Proximal Policy Optimization IJCAI IJCAI2025

【速读】:该论文试图解决在复杂和非线性动力系统中开发有效控制策略的挑战,尤其是在高方差梯度估计和非凸优化景观导致学习轨迹不稳定的情况下。解决方案的关键在于引入Koopman-Inspired Proximal Policy Optimization (KIPPO),该方法通过一个近似Koopman算子的辅助网络,在不改变核心策略或价值函数架构的前提下,学习系统动力学的近似线性潜在空间表示,从而提升策略学习的有效性和稳定性。

链接: https://arxiv.org/abs/2505.14566
作者: Andrei Cozma,Landon Harris,Hairong Qi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for IJCAI 2025. This arXiv submission is the full version of the conference paper, including the appendix and supplementary material omitted from the IJCAI proceedings

点击查看摘要

Abstract:Reinforcement Learning (RL) has made significant strides in various domains, and policy gradient methods like Proximal Policy Optimization (PPO) have gained popularity due to their balance in performance, training stability, and computational efficiency. These methods directly optimize policies through gradient-based updates. However, developing effective control policies for environments with complex and non-linear dynamics remains a challenge. High variance in gradient estimates and non-convex optimization landscapes often lead to unstable learning trajectories. Koopman Operator Theory has emerged as a powerful framework for studying non-linear systems through an infinite-dimensional linear operator that acts on a higher-dimensional space of measurement functions. In contrast with their non-linear counterparts, linear systems are simpler, more predictable, and easier to analyze. In this paper, we present Koopman-Inspired Proximal Policy Optimization (KIPPO), which learns an approximately linear latent-space representation of the underlying system’s dynamics while retaining essential features for effective policy learning. This is achieved through a Koopman-approximation auxiliary network that can be added to the baseline policy optimization algorithms without altering the architecture of the core policy or value function. Extensive experimental results demonstrate consistent improvements over the PPO baseline with 6-60% increased performance while reducing variability by up to 91% when evaluated on various continuous control tasks.
zh

[AI-6] Bellm an operator convergence enhancements in reinforcement learning algorithms

【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)中算法收敛性和效率提升的问题,其核心在于通过拓扑学的数学基础来深化对状态空间、动作空间和策略空间结构的理解。解决方案的关键在于利用巴拿赫压缩映射原理(Banach contraction principle),通过巴拿赫空间上的贝尔曼算子(Bellman operators)解释RL算法的收敛性,并探讨替代形式的贝尔曼算子对标准环境(如MountainCar、CartPole和Acrobot)中收敛速度和性能的影响,从而为RL算法设计提供理论支持与优化路径。

链接: https://arxiv.org/abs/2505.14564
作者: David Krame Kadurha,Domini Jocema Leko Moutouo,Yae Ulrich Gaba
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper reviews the topological groundwork for the study of reinforcement learning (RL) by focusing on the structure of state, action, and policy spaces. We begin by recalling key mathematical concepts such as complete metric spaces, which form the foundation for expressing RL problems. By leveraging the Banach contraction principle, we illustrate how the Banach fixed-point theorem explains the convergence of RL algorithms and how Bellman operators, expressed as operators on Banach spaces, ensure this convergence. The work serves as a bridge between theoretical mathematics and practical algorithm design, offering new approaches to enhance the efficiency of RL. In particular, we investigate alternative formulations of Bellman operators and demonstrate their impact on improving convergence rates and performance in standard RL environments such as MountainCar, CartPole, and Acrobot. Our findings highlight how a deeper mathematical understanding of RL can lead to more effective algorithms for decision-making problems.
zh

[AI-7] Physics-Guided Learning of Meteorological Dynamics for Weather Downscaling and Forecasting KDD2025

【速读】:该论文旨在解决传统数值天气预报(Numerical Weather Prediction, NWP)方法在计算效率和物理完整性上的不足,以及深度学习(Deep Learning, DL)模型在物理规律遵循方面的局限性。其解决方案的关键在于提出PhyDL-NWP框架,该框架将物理方程与潜在力参数化整合到数据驱动模型中,通过自动微分计算物理项,并利用物理信息损失函数使预测结果与控制动力学一致,从而实现无分辨率限制的降尺度和高效推理。

链接: https://arxiv.org/abs/2505.14555
作者: Yingtao Luo,Shikai Fang,Binqing Wu,Qingsong Wen,Liang Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published/Accepted in KDD 2025 (February Cycle)

点击查看摘要

Abstract:Weather forecasting is essential but remains computationally intensive and physically incomplete in traditional numerical weather prediction (NWP) methods. Deep learning (DL) models offer efficiency and accuracy but often ignore physical laws, limiting interpretability and generalization. We propose PhyDL-NWP, a physics-guided deep learning framework that integrates physical equations with latent force parameterization into data-driven models. It predicts weather variables from arbitrary spatiotemporal coordinates, computes physical terms via automatic differentiation, and uses a physics-informed loss to align predictions with governing dynamics. PhyDL-NWP enables resolution-free downscaling by modeling weather as a continuous function and fine-tunes pre-trained models with minimal overhead, achieving up to 170x faster inference with only 55K parameters. Experiments show that PhyDL-NWP improves both forecasting performance and physical consistency.
zh

[AI-8] rustworthy Reputation Games and Applications to Proof-of-Reputation Blockchains

【速读】:该论文试图解决去中心化区块链系统中声誉系统易被操纵且缺乏经济鲁棒性的问题,其核心在于设计一种称为“可信声誉博弈”的模型,以确保用户在报告服务器可信度时采取诚实策略,并通过博弈论方法实现对服务器相对可信度的准确估计。解决方案的关键在于将PageRank算法与可信度发现问题相结合,构建一个满足ϵ\epsilon-最优反应和ϵ\epsilon-纳什均衡性质的机制,从而提升系统的可靠性和安全性。

链接: https://arxiv.org/abs/2505.14551
作者: Petros Drineas,Rohit Nema,Rafail Ostrovsky,Vassilis Zikas
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Reputation systems play an essential role in the Internet era, as they enable people to decide whom to trust, by collecting and aggregating data about users’ behavior. Recently, several works proposed the use of reputation for the design and scalability improvement of decentralized (blockchain) ledgers; however, such systems are prone to manipulation and to our knowledge no game-theoretic treatment exists that can support their economic robustness. In this work we put forth a new model for the design of what we call, \em trustworthy reputation systems. Concretely, we describe a class of games, which we term \em trustworthy reputation games, that enable a set of users to report a function of their beliefs about the trustworthiness of each server in a set – i.e., their estimate of the probability that this server will behave according to its specified strategy – in a way that satisfies the following properties: 1. It is (\epsilon -)best response for any rational user in the game to play a prescribed (truthful) strategy according to their true belief. 2. Assuming that the users’ beliefs are not too far from the \em true trustworthiness of the servers, playing the above ( \epsilon- )Nash equilibrium allows anyone who observes the users’ strategies to estimate the relative trustworthiness of any two servers. Our utilities and decoding function build on a connection between the well known PageRank algorithm and the problem of trustworthiness discovery, which can be of independent interest. Finally, we show how the above games are motivated by and can be leveraged in proof-of-reputation (PoR) blockchains. Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2505.14551 [cs.GT] (or arXiv:2505.14551v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2505.14551 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Rohit Nema [view email] [v1] Tue, 20 May 2025 16:06:25 UTC (1,694 KB) Full-text links: Access Paper: View a PDF of the paper titled Trustworthy Reputation Games and Applications to Proof-of-Reputation Blockchains, by Petros Drineas and 3 other authorsView PDFOther Formats view license Current browse context: cs.GT prev | next new | recent | 2025-05 Change to browse by: cs cs.AI cs.CR References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[AI-9] Can Large Language Models Really Recognize Your Name?

【速读】:该论文试图解决当前基于大型语言模型(Large Language Models, LLMs)的隐私保护方案中存在的一项关键问题,即LLMs在检测敏感用户数据(特别是个人身份信息,PII)时存在系统性失效,尤其是在处理语境模糊的人名时容易出现遗漏或误判。论文提出的解决方案之关键在于构建AMBENCH基准数据集,该数据集通过利用名称规律性偏差现象,在简短文本片段中嵌入看似模糊的人名以及无害的提示注入,以评估LLMs在隐私保护任务中的表现。实验结果表明,与更易识别的名称相比,模糊人名的召回率下降了20%-40%,且在存在提示注入的情况下,模糊人名被LLMs生成的隐私保护摘要忽略的可能性增加了四倍。这揭示了仅依赖LLMs进行隐私保护的风险,并强调了对其隐私失败模式进行系统研究的必要性。

链接: https://arxiv.org/abs/2505.14549
作者: Dzung Pham,Peter Kairouz,Niloofar Mireshghallah,Eugene Bagdasarian,Chau Minh Pham,Amir Houmansadr
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly being used to protect sensitive user data. However, current LLM-based privacy solutions assume that these models can reliably detect personally identifiable information (PII), particularly named entities. In this paper, we challenge that assumption by revealing systematic failures in LLM-based privacy tasks. Specifically, we show that modern LLMs regularly overlook human names even in short text snippets due to ambiguous contexts, which cause the names to be misinterpreted or mishandled. We propose AMBENCH, a benchmark dataset of seemingly ambiguous human names, leveraging the name regularity bias phenomenon, embedded within concise text snippets along with benign prompt injections. Our experiments on modern LLMs tasked to detect PII as well as specialized tools show that recall of ambiguous names drops by 20–40% compared to more recognizable names. Furthermore, ambiguous human names are four times more likely to be ignored in supposedly privacy-preserving summaries generated by LLMs when benign prompt injections are present. These findings highlight the underexplored risks of relying solely on LLMs to safeguard user privacy and underscore the need for a more systematic investigation into their privacy failure modes.
zh

[AI-10] Multi-agent Reinforcement Learning vs. Fixed-Time Control for Traffic Signal Optimization: A Simulation Study

【速读】:该论文旨在解决城市交通拥堵问题,特别是交叉口处的动态交通模式难以被传统固定时序信号控制系统有效管理的问题。其解决方案的关键在于采用多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)来优化多个交叉口之间的信号协调控制,通过构建一个去中心化的MARL控制器,使每个交通信号作为自主智能体,基于局部观测和相邻智能体的信息进行决策,从而提升交通流的整体效率。

链接: https://arxiv.org/abs/2505.14544
作者: Saahil Mahato
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Urban traffic congestion, particularly at intersections, significantly impacts travel time, fuel consumption, and emissions. Traditional fixed-time signal control systems often lack the adaptability to manage dynamic traffic patterns effectively. This study explores the application of multi-agent reinforcement learning (MARL) to optimize traffic signal coordination across multiple intersections within a simulated environment. Utilizing Pygame, a simulation was developed to model a network of interconnected intersections with randomly generated vehicle flows to reflect realistic traffic variability. A decentralized MARL controller was implemented, in which each traffic signal operates as an autonomous agent, making decisions based on local observations and information from neighboring agents. Performance was evaluated against a baseline fixed-time controller using metrics such as average vehicle wait time and overall throughput. The MARL approach demonstrated statistically significant improvements, including reduced average waiting times and improved throughput. These findings suggest that MARL-based dynamic control strategies hold substantial promise for improving urban traffic management efficiency. More research is recommended to address scalability and real-world implementation challenges.
zh

[AI-11] A Logic of General Attention Using Edge-Conditioned Event Models (Extended Version)

【速读】:该论文试图解决现有动态认识论逻辑在建模复杂注意力场景时的局限性,特别是无法处理对非原子公式(如逻辑结构命题、高阶信念或他者注意力)的关注问题。其解决方案的关键在于引入一种更通用的注意力逻辑,通过广义的边条件事件模型实现表达能力与简洁性的平衡,并将注意力扩展至任意公式,使代理能够关注其他代理的信念或注意力。该工作将注意力视为一种模态,类似于信念或意识,并提出了约束该模态的闭包性质的注意力原则,从而为注意力的公理化提供了基础。

链接: https://arxiv.org/abs/2505.14539
作者: Gaia Belardinelli,Thomas Bolander,Sebastian Watzl
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we present the first general logic of attention. Attention is a powerful cognitive ability that allows agents to focus on potentially complex information, such as logically structured propositions, higher-order beliefs, or what other agents pay attention to. This ability is a strength, as it helps to ignore what is irrelevant, but it can also introduce biases when some types of information or agents are systematically ignored. Existing dynamic epistemic logics for attention cannot model such complex attention scenarios, as they only model attention to atomic formulas. Additionally, such logics quickly become cumbersome, as their size grows exponentially in the number of agents and announced literals. Here, we introduce a logic that overcomes both limitations. First, we generalize edge-conditioned event models, which we show to be as expressive as standard event models yet exponentially more succinct (generalizing both standard event models and generalized arrow updates). Second, we extend attention to arbitrary formulas, allowing agents to also attend to other agents’ beliefs or attention. Our work treats attention as a modality, like belief or awareness. We introduce attention principles that impose closure properties on that modality and that can be used in its axiomatization. Throughout, we illustrate our framework with examples of AI agents reasoning about human attentional biases, demonstrating how such agents can discover attentional biases.
zh

[AI-12] Energy-Efficient Deep Reinforcement Learning with Spiking Transformers

【速读】:该论文试图解决传统基于代理的Transformer在强化学习中因高计算复杂度导致的能量消耗大、难以部署于实际自主系统的问题。其解决方案的关键在于提出一种结合脉冲神经网络(SNN)能效优势与强化学习决策能力的新型算法——Spike-Transformer Reinforcement Learning (STRL),通过设计多步漏电积分-发放(LIF)神经元和具备时空模式处理能力的注意力机制,构建类似Transformer的结构,并引入状态、动作和奖励编码以优化强化学习任务,从而在保持高性能的同时显著提升能效。

链接: https://arxiv.org/abs/2505.14533
作者: Mohammad Irfan Uddin,Nishad Tasnim,Md Omor Faruk,Zejian Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agent-based Transformers have been widely adopted in recent reinforcement learning advances due to their demonstrated ability to solve complex tasks. However, the high computational complexity of Transformers often results in significant energy consumption, limiting their deployment in real-world autonomous systems. Spiking neural networks (SNNs), with their biologically inspired structure, offer an energy-efficient alternative for machine learning. In this paper, a novel Spike-Transformer Reinforcement Learning (STRL) algorithm that combines the energy efficiency of SNNs with the powerful decision-making capabilities of reinforcement learning is developed. Specifically, an SNN using multi-step Leaky Integrate-and-Fire (LIF) neurons and attention mechanisms capable of processing spatio-temporal patterns over multiple time steps is designed. The architecture is further enhanced with state, action, and reward encodings to create a Transformer-like structure optimized for reinforcement learning tasks. Comprehensive numerical experiments conducted on state-of-the-art benchmarks demonstrate that the proposed SNN Transformer achieves significantly improved policy performance compared to conventional agent-based Transformers. With both enhanced energy efficiency and policy optimality, this work highlights a promising direction for deploying bio-inspired, low-cost machine learning models in complex real-world decision-making scenarios.
zh

[AI-13] NavBench: A Unified Robotics Benchmark for Reinforcement Learning-Based Autonomous Navigation

【速读】:该论文旨在解决自主机器人在不同环境(如陆地、水下、空中和太空)中导航与操作时,现有基准测试受限于特定平台、缺乏通用性和公平比较的问题。其解决方案的关键在于提出NavBench,一个跨领域的基准框架,通过标准化任务定义,实现不同机器人平台在多样化环境中的导航策略训练与评估,同时具备统一的跨介质基准测试、可扩展模块化设计以及稳健的仿真到现实验证能力,从而提升强化学习(Reinforcement Learning, RL)导航策略的适应性与可迁移性。

链接: https://arxiv.org/abs/2505.14526
作者: Matteo El-Hariry,Antoine Richard,Ricard M. Castan,Luis F. W. Batista,Matthieu Geist,Cedric Pradalier,Miguel Olivares-Mendez
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Submitted for publication. Under review (2025)

点击查看摘要

Abstract:Autonomous robots must navigate and operate in diverse environments, from terrestrial and aquatic settings to aerial and space domains. While Reinforcement Learning (RL) has shown promise in training policies for specific autonomous robots, existing benchmarks are often constrained to unique platforms, limiting generalization and fair comparisons across different mobility systems. In this paper, we present NavBench, a multi-domain benchmark for training and evaluating RL-based navigation policies across diverse robotic platforms and operational environments. Built on IsaacLab, our framework standardizes task definitions, enabling different robots to tackle various navigation challenges without the need for ad-hoc task redesigns or custom evaluation metrics. Our benchmark addresses three key challenges: (1) Unified cross-medium benchmarking, enabling direct evaluation of diverse actuation methods (thrusters, wheels, water-based propulsion) in realistic environments; (2) Scalable and modular design, facilitating seamless robot-task interchangeability and reproducible training pipelines; and (3) Robust sim-to-real validation, demonstrated through successful policy transfer to multiple real-world robots, including a satellite robotic simulator, an unmanned surface vessel, and a wheeled ground vehicle. By ensuring consistency between simulation and real-world deployment, NavBench simplifies the development of adaptable RL-based navigation strategies. Its modular design allows researchers to easily integrate custom robots and tasks by following the framework’s predefined templates, making it accessible for a wide range of applications. Our code is publicly available at NavBench.
zh

[AI-14] Guarded Query Routing for Large Language Models

【速读】:该论文试图解决的是受保护的查询路由(guarded query routing)问题,即如何将用户查询正确地路由到不同的大型语言模型(LLM)端点,同时处理分布外(out-of-distribution)查询带来的挑战,如跨领域问题、其他语言的查询或包含不安全内容的文本。其解决方案的关键在于引入了一个新的基准测试集——Guarded Query Routing Benchmark (GQR-Bench),该基准覆盖了三个典型目标领域(法律、金融和医疗),并包含七个数据集以评估对分布外查询的鲁棒性。通过对比多种路由机制(包括LLM、基于嵌入的模型和传统机器学习模型),研究揭示了不同方法在准确性和速度之间的权衡,并提出了针对实际应用的具体建议。

链接: https://arxiv.org/abs/2505.14524
作者: Richard Šléher,William Brach,Tibor Sloboda,Kristián Košťál,Lukas Galke
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Query routing, the task to route user queries to different large language model (LLM) endpoints, can be considered as a text classification problem. However, out-of-distribution queries must be handled properly, as those could be questions about unrelated domains, queries in other languages, or even contain unsafe text. Here, we thus study a \emphguarded query routing problem, for which we first introduce the Guarded Query Routing Benchmark (GQR-Bench), which covers three exemplary target domains (law, finance, and healthcare), and seven datasets to test robustness against out-of-distribution queries. We then use GQR-Bench to contrast the effectiveness and efficiency of LLM-based routing mechanisms (GPT-4o-mini, Llama-3.2-3B, and Llama-3.1-8B), standard LLM-based guardrail approaches (LlamaGuard and NVIDIA NeMo Guardrails), continuous bag-of-words classifiers (WideMLP, fastText), and traditional machine learning models (SVM, XGBoost). Our results show that WideMLP, enhanced with out-of-domain detection capabilities, yields the best trade-off between accuracy (88%) and speed (4ms). The embedding-based fastText excels at speed (1ms) with acceptable accuracy (80%), whereas LLMs yield the highest accuracy (91%) but are comparatively slow (62ms for local Llama-3.1:8B and 669ms for remote GPT-4o-mini calls). Our findings challenge the automatic reliance on LLMs for (guarded) query routing and provide concrete recommendations for practical applications. GQR-Bench will be released as a Python package – \textttgqr.
zh

[AI-15] Latent Flow Transformer

【速读】:该论文试图解决传统Transformer架构中层数过多导致的效率问题,尤其是在与基于扩散和流模型的连续层方法相比时。其解决方案的关键在于提出Latent Flow Transformer (LFT),通过用一个通过流匹配训练的可学习传输算子替换多个层,实现显著的压缩并保持与原始架构的兼容性。此外,引入Flow Walking (FW)算法以解决现有流方法在保持耦合方面的局限性。

链接: https://arxiv.org/abs/2505.14513
作者: Yen-Chen Wu,Feng-Ting Liao,Meng-Hsi Chen,Pei-Chen Ho,Farhang Nabiei,Da-shan Shiu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformers, the standard implementation for large language models (LLMs), typically consist of tens to hundreds of discrete layers. While more layers can lead to better performance, this approach has been challenged as far from efficient, especially given the superiority of continuous layers demonstrated by diffusion and flow-based models for image generation. We propose the Latent Flow Transformer (LFT), which replaces a block of layers with a single learned transport operator trained via flow matching, offering significant compression while maintaining compatibility with the original architecture. Additionally, we address the limitations of existing flow-based methods in \textitpreserving coupling by introducing the Flow Walking (FW) algorithm. On the Pythia-410M model, LFT trained with flow matching compresses 6 of 24 layers and outperforms directly skipping 2 layers (KL Divergence of LM logits at 0.407 vs. 0.529), demonstrating the feasibility of this design. When trained with FW, LFT further distills 12 layers into one while reducing the KL to 0.736 surpassing that from skipping 3 layers (0.932), significantly narrowing the gap between autoregressive and flow-based generation paradigms.
zh

[AI-16] BACON: A fully explainable AI model with graded logic for decision making problems

【速读】:该论文旨在解决在高风险现实应用场景中,机器学习模型和自主代理缺乏透明性和可信解释的问题。为实现AI决策的端到端透明性,需要模型不仅具备高准确性,还需具有完全可解释性和人类可调节性。论文提出的解决方案是BACON框架,其关键在于利用分级逻辑自动训练可解释的AI模型,从而在保持高预测准确性的同时,提供完整的结构透明性和基于逻辑的符号化解释,支持有效的人机协作与专家引导的优化。

链接: https://arxiv.org/abs/2505.14510
作者: Haishi Bai,Jozo Dujmovic,Jianwu Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As machine learning models and autonomous agents are increasingly deployed in high-stakes, real-world domains such as healthcare, security, finance, and robotics, the need for transparent and trustworthy explanations has become critical. To ensure end-to-end transparency of AI decisions, we need models that are not only accurate but also fully explainable and human-tunable. We introduce BACON, a novel framework for automatically training explainable AI models for decision making problems using graded logic. BACON achieves high predictive accuracy while offering full structural transparency and precise, logic-based symbolic explanations, enabling effective human-AI collaboration and expert-guided refinement. We evaluate BACON with a diverse set of scenarios: classic Boolean approximation, Iris flower classification, house purchasing decisions and breast cancer diagnosis. In each case, BACON provides high-performance models while producing compact, human-verifiable decision logic. These results demonstrate BACON’s potential as a practical and principled approach for delivering crisp, trustworthy explainable AI.
zh

[AI-17] How Managers Perceive AI-Assisted Conversational Training for Workplace Communication

【速读】:该论文试图解决管理者在职场沟通技能提升方面缺乏定制化和持续性培训的问题(effective workplace communication training for managers)。其解决方案的关键在于设计一种基于对话的角色扮演系统——CommCoach,作为功能探针,以理解管理者如何预期使用AI来练习沟通技能。研究强调了自适应、低风险模拟在练习复杂职场对话中的价值,并指出人机协作、透明且情境感知的反馈以及对AI生成角色的控制是实现有效AI辅助沟通训练的重要因素。

链接: https://arxiv.org/abs/2505.14452
作者: Lance T Wilhelm,Xiaohan Ding,Kirk McInnis Knutsen,Buse Carik,Eugenia H Rho
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: accepted to CUI '25

点击查看摘要

Abstract:Effective workplace communication is essential for managerial success, yet many managers lack access to tailored and sustained training. Although AI-assisted communication systems may offer scalable training solutions, little is known about how managers envision the role of AI in helping them improve their communication skills. To investigate this, we designed a conversational role-play system, CommCoach, as a functional probe to understand how managers anticipate using AI to practice their communication skills. Through semi-structured interviews, participants emphasized the value of adaptive, low-risk simulations for practicing difficult workplace conversations. They also highlighted opportunities, including human-AI teaming, transparent and context-aware feedback, and greater control over AI-generated personas. AI-assisted communication training should balance personalization, structured learning objectives, and adaptability to different user styles and contexts. However, achieving this requires carefully navigating tensions between adaptive and consistent AI feedback, realism and potential bias, and the open-ended nature of AI conversations versus structured workplace discourse.
zh

[AI-18] RefiDiff: Refinement-Aware Diffusion for Efficient Missing Data Imputation

【速读】:该论文试图解决高维混合类型数据集中缺失值(missing values)在非随机缺失(Missing Not At Random, MNAR)机制下的数据插补问题,现有方法难以有效整合局部与全局数据特征,从而限制了在MNAR和高维场景下的性能。其解决方案的关键在于提出一种创新框架RefiDiff,该框架结合局部机器学习预测与基于Mamba的去噪网络,以捕捉远距离特征和样本之间的相互关系,同时通过预精炼和后精炼步骤提升结果的稳定性和准确性。

链接: https://arxiv.org/abs/2505.14451
作者: Md Atik Ahamed,Qiang Ye,Qiang Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Missing values in high-dimensional, mixed-type datasets pose significant challenges for data imputation, particularly under Missing Not At Random (MNAR) mechanisms. Existing methods struggle to integrate local and global data characteristics, limiting performance in MNAR and high-dimensional settings. We propose an innovative framework, RefiDiff, combining local machine learning predictions with a novel Mamba-based denoising network capturing interrelationships among distant features and samples. Our approach leverages pre-refinement for initial warm-up imputations and post-refinement to polish results, enhancing stability and accuracy. By encoding mixed-type data into unified tokens, RefiDiff enables robust imputation without architectural or hyperparameter tuning. RefiDiff outperforms state-of-the-art (SOTA) methods across missing-value settings, excelling in MNAR with a 4x faster training time than SOTA DDPM-based approaches. Extensive evaluations on nine real-world datasets demonstrate its robustness, scalability, and effectiveness in handling complex missingness patterns.
zh

[AI-19] Choosing a Model Shaping a Future: Comparing LLM Perspectives on Sustainability and its Relationship with AI

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在可持续性情境下的决策支持中可能存在的内在偏见和视角差异问题。研究的关键在于通过标准化的心理测量学可持续性问卷对五种先进的LLMs——Claude、DeepSeek、GPT、LLaMA和Mistral进行系统评估,以捕捉其响应模式和变异性,从而揭示不同模型在可持续性概念及其与人工智能关系上的认知差异。研究结果表明,模型选择会显著影响组织的可持续性战略,强调了在部署LLMs进行可持续性相关决策时需关注模型特定偏见的重要性。

链接: https://arxiv.org/abs/2505.14435
作者: Annika Bush,Meltem Aksoy,Markus Pauly,Greta Ontrup
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As organizations increasingly rely on AI systems for decision support in sustainability contexts, it becomes critical to understand the inherent biases and perspectives embedded in Large Language Models (LLMs). This study systematically investigates how five state-of-the-art LLMs – Claude, DeepSeek, GPT, LLaMA, and Mistral - conceptualize sustainability and its relationship with AI. We administered validated, psychometric sustainability-related questionnaires - each 100 times per model – to capture response patterns and variability. Our findings revealed significant inter-model differences: For example, GPT exhibited skepticism about the compatibility of AI and sustainability, whereas LLaMA demonstrated extreme techno-optimism with perfect scores for several Sustainable Development Goals (SDGs). Models also diverged in attributing institutional responsibility for AI and sustainability integration, a results that holds implications for technology governance approaches. Our results demonstrate that model selection could substantially influence organizational sustainability strategies, highlighting the need for awareness of model-specific biases when deploying LLMs for sustainability-related decision-making.
zh

[AI-20] Interpretable Neural System Dynamics: Combining Deep Learning with System Dynamics Modeling to Support Critical Applications

【速读】:该论文试图解决深度学习(Deep Learning, DL)在可解释性和因果可靠性方面的不足,以及传统系统动力学(System Dynamics, SD)在可扩展性和领域知识依赖性上的局限。解决方案的关键在于提出一种可解释的神经系统动力学框架,通过整合基于概念的可解释性、机制可解释性和因果机器学习,将DL的预测能力与传统SD模型的可解释性相结合,从而实现因果可靠性和可扩展性的统一。

链接: https://arxiv.org/abs/2505.14428
作者: Riccardo D’Elia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: To be submitted to this http URL for publication in the Doctoral Consortium Proceedings of XAI 2025, The World Conference on Explainable Artificial Intelligence

点击查看摘要

Abstract:The objective of this proposal is to bridge the gap between Deep Learning (DL) and System Dynamics (SD) by developing an interpretable neural system dynamics framework. While DL excels at learning complex models and making accurate predictions, it lacks interpretability and causal reliability. Traditional SD approaches, on the other hand, provide transparency and causal insights but are limited in scalability and require extensive domain knowledge. To overcome these limitations, this project introduces a Neural System Dynamics pipeline, integrating Concept-Based Interpretability, Mechanistic Interpretability, and Causal Machine Learning. This framework combines the predictive power of DL with the interpretability of traditional SD models, resulting in both causal reliability and scalability. The efficacy of the proposed pipeline will be validated through real-world applications of the EU-funded AutoMoTIF project, which is focused on autonomous multimodal transportation systems. The long-term goal is to collect actionable insights that support the integration of explainability and safety in autonomous systems.
zh

[AI-21] SCOPE: Compress Mathematical Reasoning Steps for Efficient Automated Process Annotation

【速读】:该论文试图解决过程奖励模型(Process Reward Models, PRMs)在数学推理任务中依赖于计算成本高昂的流程标注方法的问题。现有方法,无论是通过人工标注还是蒙特卡洛模拟,都存在较高的计算开销。解决方案的关键在于提出一种基于压缩的新型方法——Step COmpression for Process Estimation (SCOPE),该方法通过将自然语言推理步骤转换为代码并利用抽象语法树进行归一化,合并等效步骤以构建前缀树,从而显著降低标注成本。与基于模拟的方法不同,SCOPE利用前缀树结构,每个根到叶的路径作为训练样本,将复杂度从 O(NMK) 降低到 O(N),并在仅使用先前方法5%计算资源的情况下构建了一个包含196K样本的大规模数据集。

链接: https://arxiv.org/abs/2505.14419
作者: Huimin Xu,Xin Mao,Feng-Lin Li,Xiaobao Wu,Wang Chen,Wei Zhang,Anh Tuan Luu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Process Reward Models (PRMs) have demonstrated promising results in mathematical reasoning, but existing process annotation approaches, whether through human annotations or Monte Carlo simulations, remain computationally expensive. In this paper, we introduce Step COmpression for Process Estimation (SCOPE), a novel compression-based approach that significantly reduces annotation costs. We first translate natural language reasoning steps into code and normalize them through Abstract Syntax Tree, then merge equivalent steps to construct a prefix tree. Unlike simulation-based methods that waste numerous samples on estimation, SCOPE leverages a compression-based prefix tree where each root-to-leaf path serves as a training sample, reducing the complexity from O(NMK) to O(N) . We construct a large-scale dataset containing 196K samples with only 5% of the computational resources required by previous methods. Empirical results demonstrate that PRMs trained on our dataset consistently outperform existing automated annotation approaches on both Best-of-N strategy and ProcessBench.
zh

[AI-22] Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning

【速读】:该论文试图解决长链思维(long CoT)模型在训练过程中由于计算成本高而导致的固定训练数据集利用率低的问题。现有方法要么完全丢弃负面样本(RFT),要么对所有标记应用相同的惩罚(RL),未能有效利用负面样本中包含的自我反思和错误修正等潜在学习信号。解决方案的关键在于提出一种细粒度的离线强化学习框架——带负面样本增强的行为约束策略梯度(BCPG-NSA),其核心包括三个阶段:样本分割、基于共识的步骤正确性评估以及结合负面样本增强的策略优化,从而有效挖掘负面样本中的正向步骤。

链接: https://arxiv.org/abs/2505.14403
作者: Zhaohui Yang,Shilei Jiang,Chen Hu,Linjing Li,Shihong Deng,Daxin Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in reasoning language models have witnessed a paradigm shift from short to long CoT pattern. Given the substantial computational cost of rollouts in long CoT models, maximizing the utility of fixed training datasets becomes crucial. Our analysis reveals that negative responses contain valuable components such as self-reflection and error-correction steps, yet primary existing methods either completely discard negative samples (RFT) or apply equal penalization across all tokens (RL), failing to leverage these potential learning signals. In light of this, we propose Behavior Constrained Policy Gradient with Negative Sample Augmentation (BCPG-NSA), a fine-grained offline RL framework that encompasses three stages: 1) sample segmentation, 2) consensus-based step correctness assessment combining LLM and PRM judgers, and 3) policy optimization with NSA designed to effectively mine positive steps within negative samples. Experimental results show that BCPG-NSA outperforms baselines on several challenging math/coding reasoning benchmarks using the same training dataset, achieving improved sample efficiency and demonstrating robustness and scalability when extended to multiple iterations.
zh

[AI-23] Knowledge Graph Based Repository-Level Code Generation

【速读】:该论文试图解决在代码生成过程中,大型语言模型(Large Language Models, LLMs)由于缺乏上下文准确性而导致的代码质量不佳问题,特别是在动态演化的代码库中。现有代码搜索与检索方法在检索结果的质量和上下文相关性方面存在不足,进而影响了代码生成的效果。论文提出的解决方案的关键在于采用基于知识图谱的方法,将代码仓库表示为图结构,以捕捉代码的结构和关系信息,从而提升上下文感知的代码生成能力。该框架通过混合检索策略增强上下文相关性,追踪文件间的模块依赖,生成更稳健且与现有代码库一致的代码。

链接: https://arxiv.org/abs/2505.14394
作者: Mihir Athale,Vishal Vaddina
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have transformed code generation from natural language queries. However, despite their extensive knowledge and ability to produce high-quality code, LLMs often struggle with contextual accuracy, particularly in evolving codebases. Current code search and retrieval methods frequently lack robustness in both the quality and contextual relevance of retrieved results, leading to suboptimal code generation. This paper introduces a novel knowledge graph-based approach to improve code search and retrieval leading to better quality of code generation in the context of repository-level tasks. The proposed approach represents code repositories as graphs, capturing structural and relational information for enhanced context-aware code generation. Our framework employs a hybrid approach for code retrieval to improve contextual relevance, track inter-file modular dependencies, generate more robust code and ensure consistency with the existing codebase. We benchmark the proposed approach on the Evolutionary Code Benchmark (EvoCodeBench) dataset, a repository-level code generation benchmark, and demonstrate that our method significantly outperforms the baseline approach. These findings suggest that knowledge graph based code generation could advance robust, context-sensitive coding assistance tools.
zh

[AI-24] Beyond the First Error: Process Reward Models for Reflective Mathematical Reasoning

【速读】:该论文试图解决当前数据标注方法在长链式思维(Long CoT)推理过程中存在的问题,即现有方法倾向于仅关注第一个错误步骤及其之前的所有步骤,而忽略后续可能存在的正确推理步骤。解决方案的关键在于提出一种新的数据标注方法,专门用于对长CoT推理过程进行评分,并引入“误差传播”和“误差终止”概念,以增强模型识别有效自我修正行为及基于错误步骤的推理能力。

链接: https://arxiv.org/abs/2505.14391
作者: Zhaohui Yang,Chenghua He,Xiaowen Shi,Linjing Li,Qiyue Yin,Shihong Deng,Daxin Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many studies focus on data annotation techniques for training effective PRMs. However, current methods encounter a significant issue when applied to long CoT reasoning processes: they tend to focus solely on the first incorrect step and all preceding steps, assuming that all subsequent steps are incorrect. These methods overlook the unique self-correction and reflection mechanisms inherent in long CoT, where correct reasoning steps may still occur after initial reasoning mistakes. To address this issue, we propose a novel data annotation method for PRMs specifically designed to score the long CoT reasoning process. Given that under the reflection pattern, correct and incorrect steps often alternate, we introduce the concepts of Error Propagation and Error Cessation, enhancing PRMs’ ability to identify both effective self-correction behaviors and reasoning based on erroneous steps. Leveraging an LLM-based judger for annotation, we collect 1.7 million data samples to train a 7B PRM and evaluate it at both solution and step levels. Experimental results demonstrate that compared to existing open-source PRMs and PRMs trained on open-source datasets, our PRM achieves superior performance across various metrics, including search guidance, BoN, and F1 scores. Compared to widely used MC-based annotation methods, our annotation approach not only achieves higher data efficiency but also delivers superior performance. Detailed analysis is also conducted to demonstrate the stability and generalizability of our method.
zh

[AI-25] SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation

【速读】:该论文旨在解决在处理视觉丰富的文档时,如何提升检索增强生成(Retrieval-Augmented Generation, RAG)系统的性能问题,尤其是在面对单页包含大量信息的复杂文档时。其解决方案的关键在于提出SCAN(SemantiC Document Layout ANalysis),这是一种面向视觉语言模型(VLM)的文档布局分析方法,能够以适当的语义粒度识别文档组件,在保持上下文完整性的同时提高处理效率。SCAN通过粗粒度语义划分将文档分为连贯区域,结合精细化标注数据集对目标检测模型进行微调,从而显著提升文本和视觉RAG的端到端性能。

链接: https://arxiv.org/abs/2505.14381
作者: Yuyang Dong,Nobuhiro Ueda,Krisztián Boros,Daiki Ito,Takuya Sera,Masafumi Oyamada
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: v1

点击查看摘要

Abstract:With the increasing adoption of Large Language Models (LLMs) and Vision-Language Models (VLMs), rich document analysis technologies for applications like Retrieval-Augmented Generation (RAG) and visual RAG are gaining significant attention. Recent research indicates that using VLMs can achieve better RAG performance, but processing rich documents still remains a challenge since a single page contains large amounts of information. In this paper, we present SCAN (\textbfSemanti\textbfC Document Layout \textbfANalysis), a novel approach enhancing both textual and visual Retrieval-Augmented Generation (RAG) systems working with visually rich documents. It is a VLM-friendly approach that identifies document components with appropriate semantic granularity, balancing context preservation with processing efficiency. SCAN uses a coarse-grained semantic approach that divides documents into coherent regions covering continuous components. We trained the SCAN model by fine-tuning object detection models with sophisticated annotation datasets. Our experimental results across English and Japanese datasets demonstrate that applying SCAN improves end-to-end textual RAG performance by up to 9.0% and visual RAG performance by up to 6.4%, outperforming conventional approaches and even commercial document processing solutions.
zh

[AI-26] When Bias Backfires: The Modulatory Role of Counterfactual Explanations on the Adoption of Algorithmic Bias in XAI-Supported Human Decision-Making

【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)在决策过程中可能传递偏见,进而影响人类判断的问题。其解决方案的关键在于通过引入反事实解释(counterfactual explanations)来校准可解释人工智能(Explainable AI, XAI),以减少AI偏见对人类决策的负面影响。研究结果表明,当未提供反事实解释时,参与者在后期独立决策中会延续AI的性别偏见,而当提供解释后,这种偏见会被逆转,从而有助于维护公平的决策过程。

链接: https://arxiv.org/abs/2505.14377
作者: Ulrike Kuhl,Annika Bush
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted for XAI2025

点击查看摘要

Abstract:Although the integration of artificial intelligence (AI) into everyday tasks improves efficiency and objectivity, it also risks transmitting bias to human decision-making. In this study, we conducted a controlled experiment that simulated hiring decisions to examine how biased AI recommendations - augmented with or without counterfactual explanations - influence human judgment over time. Participants, acting as hiring managers, completed 60 decision trials divided into a baseline phase without AI, followed by a phase with biased (X)AI recommendations (favoring either male or female candidates), and a final post-interaction phase without AI. Our results indicate that the participants followed the AI recommendations 70% of the time when the qualifications of the given candidates were comparable. Yet, only a fraction of participants detected the gender bias (8 out of 294). Crucially, exposure to biased AI altered participants’ inherent preferences: in the post-interaction phase, participants’ independent decisions aligned with the bias when no counterfactual explanations were provided before, but reversed the bias when explanations were given. Reported trust did not differ significantly across conditions. Confidence varied throughout the study phases after exposure to male-biased AI, indicating nuanced effects of AI bias on decision certainty. Our findings point to the importance of calibrating XAI to avoid unintended behavioral shifts in order to safeguard equitable decision-making and prevent the adoption of algorithmic bias.
zh

[AI-27] owards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds

【速读】:该论文试图解决在人机交互(Human-Robot Interaction, HRI)中实现具身认知的核心能力——视觉视角转换(Visual Perspective Taking, VPT)的问题。为达成这一目标,研究者提出了一种概念性框架,并引入了一个在NVIDIA Omniverse中生成的合成数据集,该数据集支持监督学习以进行空间推理任务。数据集中每个实例包含RGB图像、自然语言描述以及表示物体位姿的真实4×4变换矩阵,其关键在于通过推断Z轴距离作为基础技能,为未来扩展至完整的6 Degrees Of Freedom (DOFs)推理奠定基础。

链接: https://arxiv.org/abs/2505.14366
作者: Joel Currie,Gioele Migno,Enrico Piacenti,Maria Elena Giannaccini,Patric Bach,Davide De Tommaso,Agnieszka Wykowska
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted to: Intelligent Autonomous Systems (IAS) 2025 as Late Breaking Report

点击查看摘要

Abstract:We present a conceptual framework for training Vision-Language Models (VLMs) to perform Visual Perspective Taking (VPT), a core capability for embodied cognition essential for Human-Robot Interaction (HRI). As a first step toward this goal, we introduce a synthetic dataset, generated in NVIDIA Omniverse, that enables supervised learning for spatial reasoning tasks. Each instance includes an RGB image, a natural language description, and a ground-truth 4X4 transformation matrix representing object pose. We focus on inferring Z-axis distance as a foundational skill, with future extensions targeting full 6 Degrees Of Freedom (DOFs) reasoning. The dataset is publicly available to support further research. This work serves as a foundational step toward embodied AI systems capable of spatial understanding in interactive human-robot scenarios.
zh

[AI-28] Upgrading Democracies with Fairer Voting Methods

【速读】:该论文试图解决现代民主社会中投票方法过时、无法有效反映多元社会需求的问题,其解决方案的关键在于引入替代性优先投票方法,如累积投票和等额分配法,以实现选民偏好的比例代表制。通过实证研究瑞士阿尔劳市的参与式预算新方法,论文展示了公平投票方法在提升项目多样性、扩大地理与偏好代表性以及促进创新项目方面的显著效果,同时揭示了公民对比例投票方法的偏好及其背后体现的民主价值。

链接: https://arxiv.org/abs/2505.14349
作者: Evangelos Pournaras,Srijoni Majumdar,Thomas Wellings,Joshua C. Yang,Fatemeh B. Heravan,Regula Hänggli Fricker,Dirk Helbing
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: Includes Supplementary Information

点击查看摘要

Abstract:Voting methods are instrumental design element of democracies. Citizens use them to express and aggregate their preferences to reach a collective decision. However, voting outcomes can be as sensitive to voting rules as they are to people’s voting choices. Despite the significance and inter-disciplinary scientific progress on voting methods, several democracies keep relying on outdated voting methods that do not fit modern, pluralistic societies well, while lacking social innovation. Here, we demonstrate how one can upgrade real-world democracies, namely by using alternative preferential voting methods such as cumulative voting and the method of equal shares designed for a proportional representation of voters’ preferences. By rigorously assessing a new participatory budgeting approach applied in the city of Aarau, Switzerland, we unravel the striking voting outcomes of fair voting methods: more winning projects with the same budget and broader geographic and preference representation of citizens by the elected projects, in particular for voters who used to be under-represented, while promoting novel project ideas. We provide profound causal evidence showing that citizens prefer proportional voting methods, which possess strong legitimacy without the need of very technical specialized explanations. We also reveal strong underlying democratic values exhibited by citizens who support fair voting methods such as altruism and compromise. These findings come with a global momentum to unleash a new and long-awaited participation blueprint of how to upgrade democracies.
zh

[AI-29] Enhancing Classification with Semi-Supervised Deep Learning Using Distance-Based Sample Weights ICML

【速读】:该论文试图解决半监督深度学习中如何有效利用有限的标记数据和大量未标记数据以提升分类性能的问题,特别是在噪声或不平衡数据集等挑战性场景下。其解决方案的关键在于引入基于距离的加权机制,通过衡量训练样本与测试数据的接近程度来优先选择具有信息量的样本,从而增强模型的泛化能力和鲁棒性。该方法结合了不确定性一致性与图表示等技术,在保证可扩展性的前提下,显著提升了模型在多个基准数据集上的准确率、精确率和召回率。

链接: https://arxiv.org/abs/2505.14345
作者: Aydin Abedinia,Shima Tabakhi,Vahid Seydi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5 pages, 6 figures. This paper has been accepted for publication and oral presentation at the 2025 10th IEEE International Conference on Machine Learning Technologies (ICMLT 2025). The final authenticated version will be available in IEEE Xplore following the conference

点击查看摘要

Abstract:Recent advancements in semi-supervised deep learning have introduced effective strategies for leveraging both labeled and unlabeled data to improve classification performance. This work proposes a semi-supervised framework that utilizes a distance-based weighting mechanism to prioritize critical training samples based on their proximity to test data. By focusing on the most informative examples, the method enhances model generalization and robustness, particularly in challenging scenarios with noisy or imbalanced datasets. Building on techniques such as uncertainty consistency and graph-based representations, the approach addresses key challenges of limited labeled data while maintaining scalability. Experiments on twelve benchmark datasets demonstrate significant improvements across key metrics, including accuracy, precision, and recall, consistently outperforming existing methods. This framework provides a robust and practical solution for semi-supervised learning, with potential applications in domains such as healthcare and security where data limitations pose significant challenges.
zh

[AI-30] Exploring Jailbreak Attacks on LLM s through Intent Concealment and Diversion

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全性方面面临的挑战,特别是针对越狱攻击(jailbreak attacks)的防御不足问题。现有越狱攻击方法存在两个主要缺陷:一是需要大量的迭代查询,二是跨模型的泛化能力较差。此外,现有的越狱评估数据集主要关注问答任务,忽略了文本生成任务中对有毒内容准确再生的需求。为了解决这些问题,论文提出了两个关键贡献:一是ICE,一种新颖的黑盒越狱方法,通过意图隐藏(Intent Concealment)和偏离(divErsion)技术有效绕过安全限制,实现了单次查询即达到高攻击成功率(ASR),显著提升了效率和跨模型的迁移性;二是BiSceneEval,一个用于评估LLMs在问答和文本生成任务中鲁棒性的综合性数据集。实验结果表明,ICE优于现有越狱技术,揭示了当前防御机制中的关键漏洞,强调了结合预定义安全机制与实时语义分解的混合安全策略的重要性。

链接: https://arxiv.org/abs/2505.14316
作者: Tiehan Cui,Yanxu Mao,Peipei Liu,Congying Liu,Datao You
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although large language models (LLMs) have achieved remarkable advancements, their security remains a pressing concern. One major threat is jailbreak attacks, where adversarial prompts bypass model safeguards to generate harmful or objectionable content. Researchers study jailbreak attacks to understand security and robustness of LLMs. However, existing jailbreak attack methods face two main challenges: (1) an excessive number of iterative queries, and (2) poor generalization across models. In addition, recent jailbreak evaluation datasets focus primarily on question-answering scenarios, lacking attention to text generation tasks that require accurate regeneration of toxic content. To tackle these challenges, we propose two contributions: (1) ICE, a novel black-box jailbreak method that employs Intent Concealment and divErsion to effectively circumvent security constraints. ICE achieves high attack success rates (ASR) with a single query, significantly improving efficiency and transferability across different models. (2) BiSceneEval, a comprehensive dataset designed for assessing LLM robustness in question-answering and text-generation tasks. Experimental results demonstrate that ICE outperforms existing jailbreak techniques, revealing critical vulnerabilities in current defense mechanisms. Our findings underscore the necessity of a hybrid security strategy that integrates predefined security mechanisms with real-time semantic decomposition to enhance the security of LLMs.
zh

[AI-31] MultiTab: A Comprehensive Benchmark Suite for Multi-Dimensional Evaluation in Tabular Domains

【速读】:该论文试图解决现有基准测试主要依赖平均情况指标,无法揭示模型在不同数据条件下的行为差异的问题。其解决方案的关键在于提出MultiTab,一个用于表格学习算法多维、数据感知分析的基准套件和评估框架。MultiTab通过将196个公开数据集按照样本量、标签不平衡度和特征交互等关键数据特性进行分类,并评估涵盖多种归纳偏置的13种代表性模型,从而揭示模型性能对数据 regimes 的高度敏感性。

链接: https://arxiv.org/abs/2505.14312
作者: Kyungeun Lee,Moonjung Eo,Hye-Seung Cho,Dongmin Kim,Ye Seul Sim,Seoyoon Kim,Min-Kook Suh,Woohyung Lim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Despite the widespread use of tabular data in real-world applications, most benchmarks rely on average-case metrics, which fail to reveal how model behavior varies across diverse data regimes. To address this, we propose MultiTab, a benchmark suite and evaluation framework for multi-dimensional, data-aware analysis of tabular learning algorithms. Rather than comparing models only in aggregate, MultiTab categorizes 196 publicly available datasets along key data characteristics, including sample size, label imbalance, and feature interaction, and evaluates 13 representative models spanning a range of inductive biases. Our analysis shows that model performance is highly sensitive to such regimes: for example, models using sample-level similarity excel on datasets with large sample sizes or high inter-feature correlation, while models encoding inter-feature dependencies perform best with weakly correlated features. These findings reveal that inductive biases do not always behave as intended, and that regime-aware evaluation is essential for understanding and improving model behavior. MultiTab enables more principled model design and offers practical guidance for selecting models tailored to specific data characteristics. All datasets, code, and optimization logs are publicly available at this https URL.
zh

[AI-32] EVA: Red-Teaming GUI Agents via Evolving Indirect Prompt Injection

【速读】:该论文旨在解决多模态代理在操作图形用户界面(GUI)时面临的间接提示注入(indirect prompt injection)威胁,此类攻击通过在视觉环境中嵌入误导性指令(如弹窗或聊天消息)来影响代理行为。传统的一次性攻击方法无法适应模型对视觉注意力的动态分配,因此效果有限。论文提出的解决方案EVA(红队框架)的关键在于将攻击转化为闭环优化过程,通过持续监控代理对GUI的注意力分布,并动态调整对抗性线索、关键词、措辞和布局,从而显著提高攻击成功率和跨不同GUI场景的可迁移性。

链接: https://arxiv.org/abs/2505.14289
作者: Yijie Lu,Tianjie Ju,Manman Zhao,Xinbei Ma,Yuan Guo,ZhuoSheng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As multimodal agents are increasingly trained to operate graphical user interfaces (GUIs) to complete user tasks, they face a growing threat from indirect prompt injection, attacks in which misleading instructions are embedded into the agent’s visual environment, such as popups or chat messages, and misinterpreted as part of the intended task. A typical example is environmental injection, in which GUI elements are manipulated to influence agent behavior without directly modifying the user prompt. To address these emerging attacks, we propose EVA, a red teaming framework for indirect prompt injection which transforms the attack into a closed loop optimization by continuously monitoring an agent’s attention distribution over the GUI and updating adversarial cues, keywords, phrasing, and layout, in response. Compared with prior one shot methods that generate fixed prompts without regard for how the model allocates visual attention, EVA dynamically adapts to emerging attention hotspots, yielding substantially higher attack success rates and far greater transferability across diverse GUI scenarios. We evaluate EVA on six widely used generalist and specialist GUI agents in realistic settings such as popup manipulation, chat based phishing, payments, and email composition. Experimental results show that EVA substantially improves success rates over static baselines. Under goal agnostic constraints, where the attacker does not know the agent’s task intent, EVA still discovers effective patterns. Notably, we find that injection styles transfer well across models, revealing shared behavioral biases in GUI agents. These results suggest that evolving indirect prompt injection is a powerful tool not only for red teaming agents, but also for uncovering common vulnerabilities in their multimodal decision making.
zh

[AI-33] AquaSignal: An Integrated Framework for Robust Underwater Acoustic Analysis

【速读】:该论文旨在解决水下声学信号处理中的噪声干扰、已知声学事件分类以及新型或异常信号检测问题。其解决方案的关键在于构建一个模块化且可扩展的流水线(AquaSignal),该流水线整合了先进的深度学习架构,包括用于降噪的U-Net网络、用于已知事件分类的ResNet18卷积神经网络,以及基于自编码器的无监督新颖性检测模型,从而在复杂多变的海洋环境中提升声学信号分析的可靠性和准确性。

链接: https://arxiv.org/abs/2505.14285
作者: Eirini Panteli,Paulo E. Santos,Nabil Humphrey
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 8 pages; 9 figures

点击查看摘要

Abstract:This paper presents AquaSignal, a modular and scalable pipeline for preprocessing, denoising, classification, and novelty detection of underwater acoustic signals. Designed to operate effectively in noisy and dynamic marine environments, AquaSignal integrates state-of-the-art deep learning architectures to enhance the reliability and accuracy of acoustic signal analysis. The system is evaluated on a combined dataset from the Deepship and Ocean Networks Canada (ONC) benchmarks, providing a diverse set of real-world underwater scenarios. AquaSignal employs a U-Net architecture for denoising, a ResNet18 convolutional neural network for classifying known acoustic events, and an AutoEncoder-based model for unsupervised detection of novel or anomalous signals. To our knowledge, this is the first comprehensive study to apply and evaluate this combination of techniques on maritime vessel acoustic data. Experimental results show that AquaSignal improves signal clarity and task performance, achieving 71% classification accuracy and 91% accuracy in novelty detection. Despite slightly lower classification performance compared to some state-of-the-art models, differences in data partitioning strategies limit direct comparisons. Overall, AquaSignal demonstrates strong potential for real-time underwater acoustic monitoring in scientific, environmental, and maritime domains.
zh

[AI-34] X-KAN: Optimizing Local Kolmogorov-Arnold Networks via Evolutionary Rule-Based Machine Learning IJCAI2025

【速读】:该论文试图解决神经网络在处理局部复杂或不连续函数时表现不佳的问题,这是因为现有方法依赖于覆盖整个问题空间的单一全局模型。解决方案的关键在于提出X-KAN,一种结合了Kolmogorov-Arnold Networks (KAN) 的高表达能力与XCSF(一种基于进化规则的机器学习框架)自适应划分能力的新方法。X-KAN通过将局部KAN模型作为规则后件,并利用规则前件定义局部区域,实现了对复杂函数的有效逼近。

链接: https://arxiv.org/abs/2505.14273
作者: Hiroki Shiraishi,Hisao Ishibuchi,Masaya Nakata
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: Accepted by the 34th International Joint Conference on Artificial Intelligence (IJCAI 2025)

点击查看摘要

Abstract:Function approximation is a critical task in various fields. However, existing neural network approaches struggle with locally complex or discontinuous functions due to their reliance on a single global model covering the entire problem space. We propose X-KAN, a novel method that optimizes multiple local Kolmogorov-Arnold Networks (KANs) through an evolutionary rule-based machine learning framework called XCSF. X-KAN combines KAN’s high expressiveness with XCSF’s adaptive partitioning capability by implementing local KAN models as rule consequents and defining local regions via rule antecedents. Our experimental results on artificial test functions and real-world datasets demonstrate that X-KAN significantly outperforms conventional methods, including XCSF, Multi-Layer Perceptron, and KAN, in terms of approximation accuracy. Notably, X-KAN effectively handles functions with locally complex or discontinuous structures that are challenging for conventional KAN, using a compact set of rules (average 7.2 \pm 2.3 rules). These results validate the effectiveness of using KAN as a local model in XCSF, which evaluates the rule fitness based on both accuracy and generality. Our X-KAN implementation is available at this https URL.
zh

[AI-35] Hybrid Adaptive Modeling in Process Monitoring: Leverag ing Sequence Encoders and Physics-Informed Neural Networks

【速读】:该论文旨在解决传统物理信息神经网络(Physics-Informed Neural Networks, PINNs)在面对变化的参数、边界条件和初始条件时需要重新训练的问题,从而限制了其在实时应用中的灵活性和适应性。解决方案的关键在于引入深度集合(Deep Sets)或序列编码器(Sequence Encoders)来对动态参数、边界条件和初始条件进行编码,并将这些编码后的特征作为PINN的输入,从而使模型能够适应参数和条件的变化,而无需重新训练。

链接: https://arxiv.org/abs/2505.14252
作者: Mouad Elaarabi,Domenico Borzacchiello,Philippe Le Bot,Nathan Lauzeral,Sebastien Comas-Cardona
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we explore the integration of Sequence Encoding for Online Parameter Identification with Physics-Informed Neural Networks to create a model that, once trained, can be utilized for real time applications with variable parameters, boundary conditions, and initial conditions. Recently, the combination of PINNs with Sparse Regression has emerged as a method for performing dynamical system identification through supervised learning and sparse regression optimization, while also solving the dynamics using PINNs. However, this approach can be limited by variations in parameters or boundary and initial conditions, requiring retraining of the model whenever changes occur. In this work, we introduce an architecture that employs Deep Sets or Sequence Encoders to encode dynamic parameters, boundary conditions, and initial conditions, using these encoded features as inputs for the PINN, enabling the model to adapt to changes in parameters, BCs, and ICs. We apply this approach to three different problems. First, we analyze the Rossler ODE system, demonstrating the robustness of the model with respect to noise and its ability to generalize. Next, we explore the model’s capability in a 2D Navier-Stokes PDE problem involving flow past a cylinder with a parametric sinusoidal inlet velocity function, showing that the model can encode pressure data from a few points to identify the inlet velocity profile and utilize physics to compute velocity and pressure throughout the domain. Finally, we address a 1D heat monitoring problem using real data from the heating of glass fiber and thermoplastic composite plates.
zh

[AI-36] oward Embodied AGI: A Review of Embodied AI and the Road Ahead

【速读】:该论文试图解决如何系统化地理解和构建具有广泛能力的具身人工通用智能(Embodied AGI)的问题。其关键在于提出一个涵盖五个层级(L1-L5)的系统性分类框架,并探讨从基础层级(L1-L2)到高级别能力(L3-L5)所需的关键组件与技术路径,进而构建一个面向L3+级别的机器人脑概念框架,为未来研究提供技术展望与理论基础。

链接: https://arxiv.org/abs/2505.14235
作者: Yequan Wang,Aixin Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial General Intelligence (AGI) is often envisioned as inherently embodied. With recent advances in robotics and foundational AI models, we stand at the threshold of a new era-one marked by increasingly generalized embodied AI systems. This paper contributes to the discourse by introducing a systematic taxonomy of Embodied AGI spanning five levels (L1-L5). We review existing research and challenges at the foundational stages (L1-L2) and outline the key components required to achieve higher-level capabilities (L3-L5). Building on these insights and existing technologies, we propose a conceptual framework for an L3+ robotic brain, offering both a technical outlook and a foundation for future exploration.
zh

[AI-37] Fast and close Shannon entropy approximation

【速读】:该论文旨在解决在物理、信息论、机器学习(Machine Learning, ML)和量子计算等领域中,Shannon entropy (SE) 及其量子力学类似物 von Neumann entropy 的计算所面临的高成本、低鲁棒性和收敛缓慢的问题。其关键解决方案是提出一种非奇异的 Shannon entropy 及其梯度的快速近似方法——Fast Entropy Approximation (FEA),该方法在保持均方误差低于 10^-3 的同时,仅需 5 至 6 次基本运算,显著提升了计算效率,并在特征选择任务中验证了其对模型质量的提升和计算成本的降低。

链接: https://arxiv.org/abs/2505.14234
作者: Illia Horenko,Davide Bassetti,Lukáš Pospíšil
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 1 figure

点击查看摘要

Abstract:Shannon entropy (SE) and its quantum mechanical analogue von Neumann entropy are key components in many tools used in physics, information theory, machine learning (ML) and quantum computing. Besides of the significant amounts of SE computations required in these fields, the singularity of the SE gradient is one of the central mathematical reason inducing the high cost, frequently low robustness and slow convergence of such tools. Here we propose the Fast Entropy Approximation (FEA) - a non-singular rational approximation of Shannon entropy and its gradient that achieves a mean absolute error of 10^-3 , which is approximately 20 times lower than comparable state-of-the-art methods. FEA allows around 50% faster computation, requiring only 5 to 6 elementary computational operations, as compared to tens of elementary operations behind the fastest entropy computation algorithms with table look-ups, bitshifts, or series approximations. On a set of common benchmarks for the feature selection problem in machine learning, we show that the combined effect of fewer elementary operations, low approximation error, and a non-singular gradient allows significantly better model quality and enables ML feature extraction that is two to three orders of magnitude faster and computationally cheaper when incorporating FEA into AI tools.
zh

[AI-38] Federated learning in low-resource settings: A chest imaging study in Africa – Challenges and lessons learned

【速读】:该论文试图解决在非洲低资源地区利用胸部X光进行结核病(Tuberculosis, TB)诊断时面临的隐私问题和数据稀缺问题。解决方案的关键在于采用联邦学习(Federated Learning, FL),通过允许医院在不共享原始患者数据的情况下协作训练人工智能模型,从而克服传统集中式模型的局限性。

链接: https://arxiv.org/abs/2505.14217
作者: Jorge Fabila,Lidia Garrucho,Víctor M. Campello,Carlos Martín-Isla,Karim Lekadir
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study explores the use of Federated Learning (FL) for tuberculosis (TB) diagnosis using chest X-rays in low-resource settings across Africa. FL allows hospitals to collaboratively train AI models without sharing raw patient data, addressing privacy concerns and data scarcity that hinder traditional centralized models. The research involved hospitals and research centers in eight African countries. Most sites used local datasets, while Ghana and The Gambia used public ones. The study compared locally trained models with a federated model built across all institutions to evaluate FL’s real-world feasibility. Despite its promise, implementing FL in sub-Saharan Africa faces challenges such as poor infrastructure, unreliable internet, limited digital literacy, and weak AI regulations. Some institutions were also reluctant to share model updates due to data control concerns. In conclusion, FL shows strong potential for enabling AI-driven healthcare in underserved regions, but broader adoption will require improvements in infrastructure, education, and regulatory support.
zh

[AI-39] Embedded Mean Field Reinforcement Learning for Perimeter-defense Game

【速读】:该论文旨在解决大规模异构边界防御博弈中的控制挑战,特别是在三维环境中考虑现实因素如运动动力学和风场的影响,以提升实际应用的可行性。其关键解决方案是提出一种嵌入式平均场策略梯度框架(Embedded Mean-Field Actor-Critic, EMFAC),该框架通过表征学习实现高阶动作聚合,支持防御者之间的可扩展协作,并引入基于奖励表征的轻量级代理级注意力机制,以提高决策效率和大规模任务的收敛速度。

链接: https://arxiv.org/abs/2505.14209
作者: Li Wang,Xin Yu,Xuxin Lv,Gangzheng Ai,Wenjun Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid advancement of unmanned aerial vehicles (UAVs) and missile technologies, perimeter-defense game between attackers and defenders for the protection of critical regions have become increasingly complex and strategically significant across a wide range of domains. However, existing studies predominantly focus on small-scale, simplified two-dimensional scenarios, often overlooking realistic environmental perturbations, motion dynamics, and inherent heterogeneity–factors that pose substantial challenges to real-world applicability. To bridge this gap, we investigate large-scale heterogeneous perimeter-defense game in a three-dimensional setting, incorporating realistic elements such as motion dynamics and wind fields. We derive the Nash equilibrium strategies for both attackers and defenders, characterize the victory regions, and validate our theoretical findings through extensive simulations. To tackle large-scale heterogeneous control challenges in defense strategies, we propose an Embedded Mean-Field Actor-Critic (EMFAC) framework. EMFAC leverages representation learning to enable high-level action aggregation in a mean-field manner, supporting scalable coordination among defenders. Furthermore, we introduce a lightweight agent-level attention mechanism based on reward representation, which selectively filters observations and mean-field information to enhance decision-making efficiency and accelerate convergence in large-scale tasks. Extensive simulations across varying scales demonstrate the effectiveness and adaptability of EMFAC, which outperforms established baselines in both convergence speed and overall performance. To further validate practicality, we test EMFAC in small-scale real-world experiments and conduct detailed analyses, offering deeper insights into the framework’s effectiveness in complex scenarios.
zh

[AI-40] Challenges and Limitations in the Synthetic Generation of mHealth Sensor Data ALT

【速读】:该论文试图解决移动健康(mHealth)领域中由于严格伦理法规、隐私担忧和其他限制导致的数据收集受限问题,以及现有生成模型在处理多模态、长程依赖和条件生成等关键挑战时的不足。解决方案的关键在于系统评估最先进的时间序列生成模型,并引入一种新的评估框架,以公平比较合成数据的内在质量和其在下游预测任务中的实用性,从而揭示现有方法在跨模态一致性、时间连贯性及在训练于合成数据、测试于真实数据和数据增强场景下的鲁棒性方面的局限性。

链接: https://arxiv.org/abs/2505.14206
作者: Flavio Di Martino,Franca Delmastro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to ACM Transactions on Computing for Healthcare (ACM HEALTH)

点击查看摘要

Abstract:The widespread adoption of mobile sensors has the potential to provide massive and heterogeneous time series data, driving Artificial Intelligence applications in mHealth. However, data collection remains limited due to stringent ethical regulations, privacy concerns, and other constraints, hindering progress in the field. Synthetic data generation, particularly through Generative Adversarial Networks and Diffusion Models, has emerged as a promising solution to address both data scarcity and privacy issues. Yet, these models are often limited to short-term, unimodal signal patterns. This paper presents a systematic evaluation of state-of-the-art generative models for time series synthesis, with a focus on their ability to jointly handle multi-modality, long-range dependencies, and conditional generation-key challenges in the mHealth domain. To ensure a fair comparison, we introduce a novel evaluation framework designed to measure both the intrinsic quality of synthetic data and its utility in downstream predictive tasks. Our findings reveal critical limitations in the existing approaches, particularly in maintaining cross-modal consistency, preserving temporal coherence, and ensuring robust performance in train-on-synthetic, test-on-real, and data augmentation scenarios. Finally, we present our future research directions to enhance synthetic time series generation and improve the applicability of generative models in mHealth.
zh

[AI-41] FLASH-D: FlashAttention with Hidden Softmax Division

【速读】:该论文试图解决传统Transformer模型中注意力机制计算效率低下的问题,尤其是矩阵运算与Softmax归一化过程导致的计算延迟和对输入序列长度的依赖。其解决方案的关键在于提出FLASH-D,这是一种数学等价但更简化的FlashAttention内核形式,通过将Softmax除法隐藏在其他非线性函数计算中、实现数值稳定的指数计算以及降低计算成本而不引入数值近似,从而提升计算效率并支持硬件加速。

链接: https://arxiv.org/abs/2505.14201
作者: Kosmas Alexandridis,Vasileios Titopoulos,Giorgos Dimitrakopoulos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED) 2025

点击查看摘要

Abstract:The transformer’s attention mechanism has revolutionized AI and machine learning, with its efficient computation being crucial to its performance. However, calculating attention involves matrix operations interspersed with softmax rescaling, which inherently slows down computation and requires processing the entire input sequence. Building on online softmax computation, FlashAttention integrates softmax calculation with matrix arithmetic, enabling tiled computation independent of sequence length. While optimized for GPUs, FlashAttention’s simplicity makes it amenable to direct hardware acceleration. This work re-evaluates the core FlashAttention kernel, presenting FLASH-D a mathematically equivalent, yet simplified, formulation that achieves: (a) hiding softmax division within other non-linear function evaluations; (b) inherently numerically stable computation of exponentials, eliminating the need for maximum value subtraction; and © a reduction in computational cost without introducing numerical approximations to the FlashAttention kernel. Importantly, the essential FlashAttention properties that facilitate efficient tiled implementation are fully preserved. Hardware implementation results at 28nm demonstrate that this proposed formulation achieves a 22.8% reduction in area and a 20.3% reduction in power, on average, compared to state-of-the-art parallel hardware architectures without any performance penalty.
zh

[AI-42] Dynamic Replanning for Improved Public Transport Routing IJCAI2025

【速读】:该论文试图解决公共交通中因延误导致的行程效率问题,特别是现有解决方案在处理延迟时的局限性,如基于历史数据的备用计划无法捕捉提前到达的机会,而瞬时规划虽能应对当前延迟但无法预测未来延迟。论文的关键解决方案是提出一种动态重规划框架,其中“推”(push)方法通过服务器主动监控并调整行程,相较于用户手动请求的“拉”(pull)方法,能够实现更显著的到达时间节省和性能提升。

链接: https://arxiv.org/abs/2505.14193
作者: Abdallah Abuaisha,Bojie Shen,Daniel Harabor,Peter Stuckey,Mark Wallace
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted for publication at IJCAI 2025. 8 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Delays in public transport are common, often impacting users through prolonged travel times and missed transfers. Existing solutions for handling delays remain limited; backup plans based on historical data miss opportunities for earlier arrivals, while snapshot planning accounts for current delays but not future ones. With the growing availability of live delay data, users can adjust their journeys in real-time. However, the literature lacks a framework that fully exploits this advantage for system-scale dynamic replanning. To address this, we formalise the dynamic replanning problem in public transport routing and propose two solutions: a “pull” approach, where users manually request replanning, and a novel “push” approach, where the server proactively monitors and adjusts journeys. Our experiments show that the push approach outperforms the pull approach, achieving significant speedups. The results also reveal substantial arrival time savings enabled by dynamic replanning.
zh

[AI-43] α-GAN by Rényi Cross Entropy

【速读】:该论文旨在解决传统生成对抗网络(Generative Adversarial Network, GAN)中可能存在的梯度消失等问题,提出了一种基于Rényi测度的α-GAN模型。其解决方案的关键在于通过Rényi交叉熵构建价值函数,将判别器的软决策转化为样本来源的期望确定性度量,并通过调整Rényi阶数α来优化生成器与判别器之间的博弈过程,从而在α∈(0,1)范围内实现梯度放大,提升收敛速度。

链接: https://arxiv.org/abs/2505.14190
作者: Ni Ding,Miao Qiao,Jiaxing Xu,Yiping Ke,Xiaoyu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper proposes \alpha -GAN, a generative adversarial network using Rényi measures. The value function is formulated, by Rényi cross entropy, as an expected certainty measure incurred by the discriminator’s soft decision as to where the sample is from, true population or the generator. The discriminator tries to maximize the Rényi certainty about sample source, while the generator wants to reduce it by injecting fake samples. This forms a min-max problem with the solution parameterized by the Rényi order \alpha . This \alpha -GAN reduces to vanilla GAN at \alpha = 1 , where the value function is exactly the binary cross entropy. The optimization of \alpha -GAN is over probability (vector) space. It is shown that the gradient is exponentially enlarged when Rényi order is in the range \alpha \in (0,1) . This makes convergence faster, which is verified by experimental results. A discussion shows that choosing \alpha \in (0,1) may be able to solve some common problems, e.g., vanishing gradient. A following observation reveals that this range has not been fully explored in the existing Rényi version GANs.
zh

[AI-44] DSMentor: Enhancing Data Science Agents with Curriculum Learning and Online Knowledge Accumulation

【速读】:该论文试图解决大型语言模型(Large Language Model, LLM)代理在处理复杂数据科学任务时,因推理过程中问题顺序不当而导致性能受限的问题。其解决方案的关键在于引入课程学习(Curriculum Learning)策略,通过按难度递增的顺序组织数据科学任务,并结合渐进式长期记忆机制,以引导代理更有效地积累和利用知识,从而提升其在复杂任务中的表现。

链接: https://arxiv.org/abs/2505.14163
作者: He Wang,Alexander Hanbo Li,Yiqun Hu,Sheng Zhang,Hideo Kobayashi,Jiani Zhang,Henry Zhu,Chung-Wei Hang,Patrick Ng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents have shown promising performance in generating code for solving complex data science problems. Recent studies primarily focus on enhancing in-context learning through improved search, sampling, and planning techniques, while overlooking the importance of the order in which problems are tackled during inference. In this work, we develop a novel inference-time optimization framework, referred to as DSMentor, which leverages curriculum learning – a strategy that introduces simpler task first and progressively moves to more complex ones as the learner improves – to enhance LLM agent performance in challenging data science tasks. Our mentor-guided framework organizes data science tasks in order of increasing difficulty and incorporates a growing long-term memory to retain prior experiences, guiding the agent’s learning progression and enabling more effective utilization of accumulated knowledge. We evaluate DSMentor through extensive experiments on DSEval and QRData benchmarks. Experiments show that DSMentor using Claude-3.5-Sonnet improves the pass rate by up to 5.2% on DSEval and QRData compared to baseline agents. Furthermore, DSMentor demonstrates stronger causal reasoning ability, improving the pass rate by 8.8% on the causality problems compared to GPT-4 using Program-of-Thoughts prompts. Our work underscores the importance of developing effective strategies for accumulating and utilizing knowledge during inference, mirroring the human learning process and opening new avenues for improving LLM performance through curriculum-based inference optimization.
zh

[AI-45] MM-Agent : LLM as Agents for Real-world Mathematical Modeling Problem

【速读】:该论文旨在解决如何利用生成式 AI(Generative AI)进行真实世界数学建模的问题,即在缺乏预定义公式的前提下,完成问题分析、抽象建模和生成完整解决方案的复杂任务。其关键在于提出了一种受专家启发的框架 MM-Agent,该框架将数学建模过程分解为四个阶段:开放式问题分析、结构化模型构建、计算问题求解和报告生成,从而有效提升了 LLM 在实际建模任务中的表现。

链接: https://arxiv.org/abs/2505.14148
作者: Fan Liu,Zherui Yang,Cancheng Liu,Tianrui Song,Xiaofeng Gao,Hao Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mathematical modeling is a cornerstone of scientific discovery and engineering practice, enabling the translation of real-world problems into formal systems across domains such as physics, biology, and economics. Unlike mathematical reasoning, which assumes a predefined formulation, modeling requires open-ended problem analysis, abstraction, and principled formalization. While Large Language Models (LLMs) have shown strong reasoning capabilities, they fall short in rigorous model construction, limiting their utility in real-world problem-solving. To this end, we formalize the task of LLM-powered real-world mathematical modeling, where agents must analyze problems, construct domain-appropriate formulations, and generate complete end-to-end solutions. We introduce MM-Bench, a curated benchmark of 111 problems from the Mathematical Contest in Modeling (MCM/ICM), spanning the years 2000 to 2025 and across ten diverse domains such as physics, biology, and economics. To tackle this task, we propose MM-Agent, an expert-inspired framework that decomposes mathematical modeling into four stages: open-ended problem analysis, structured model formulation, computational problem solving, and report generation. Experiments on MM-Bench show that MM-Agent significantly outperforms baseline agents, achieving an 11.88% improvement over human expert solutions while requiring only 15 minutes and \ 0.88 per task using GPT-4o. Furthermore, under official MCM/ICM protocols, MM-Agent assisted two undergraduate teams in winning the Finalist Award (\textbftop 2.0% among 27,456 teams) in MCM/ICM 2025, demonstrating its practical effectiveness as a modeling copilot. Our code is available at this https URL
zh

[AI-46] SHARP: Synthesizing High-quality Aligned Reasoning Problems for Large Reasoning Models Reinforcement Learning

【速读】:该论文旨在解决在STEM领域中,使用强化学习训练大型推理模型(Large Reasoning Models, LRM)时面临的问题集稀缺、质量不高、多样性不足以及难以验证的挑战。现有合成方法如Chain-of-Thought prompting生成的数据往往过于简化或无法验证,限制了模型在复杂任务上的进展。论文提出的解决方案是SHARP(Synthesizing High-quality Aligned Reasoning Problems for LRMs reinforcement learning with verifiable rewards),其关键在于提出了一套自对齐原则,确保问题难度达到研究生和竞赛级别,逻辑一致性严格,答案明确且可验证,并通过一个结构化的三阶段框架(对齐、实例化、推理)来保证主题多样性和对问题生成的细粒度控制。此外,SHARP利用先进的LRM推导和验证具有挑战性的STEM问题,并通过可验证奖励信号的强化学习循环优化模型推理能力。

链接: https://arxiv.org/abs/2505.14147
作者: Xiong Jun Wu,Zhenduo Zhang,ZuJie Wen,Zhiqiang Zhang,Wang Ren,Lei Shi,Cai Chen,Deng Zhao,Dingnan Jin,Qing Cui,Jun Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training large reasoning models (LRMs) with reinforcement learning in STEM domains is hindered by the scarcity of high-quality, diverse, and verifiable problem sets. Existing synthesis methods, such as Chain-of-Thought prompting, often generate oversimplified or uncheckable data, limiting model advancement on complex tasks. To address these challenges, we introduce SHARP, a unified approach to Synthesizing High-quality Aligned Reasoning Problems for LRMs reinforcement learning with verifiable rewards (RLVR). SHARP encompasses a strategic set of self-alignment principles – targeting graduate and Olympiad-level difficulty, rigorous logical consistency, and unambiguous, verifiable answers – and a structured three-phase framework (Alignment, Instantiation, Inference) that ensures thematic diversity and fine-grained control over problem generation. We implement SHARP by leveraging a state-of-the-art LRM to infer and verify challenging STEM questions, then employ a reinforcement learning loop to refine the model’s reasoning through verifiable reward signals. Experiments on benchmarks such as GPQA demonstrate that SHARP-augmented training substantially outperforms existing methods, markedly improving complex reasoning accuracy and pushing LRM performance closer to expert-level proficiency. Our contributions include the SHARP strategy, framework design, end-to-end implementation, and experimental evaluation of its effectiveness in elevating LRM reasoning capabilities.
zh

[AI-47] Multimodal Mixture of Low-Rank Experts for Sentiment Analysis and Emotion Recognition ICME2025

【速读】:该论文旨在解决多任务学习(Multi-task Learning, MTL)中因任务间复杂相关性导致的参数冲突问题,特别是在多模态情感分析(Multimodal Sentiment Analysis, MSA)和多模态情绪识别(Multimodal Emotion Recognition, MER)联合训练中的参数共享问题。现有方法主要依赖硬参数共享策略,未能有效处理任务间的参数冲突。论文提出的解决方案是基于多模态低秩专家混合(Multimodal Mixture of Low-Rank Experts, MMoLRE),通过引入共享专家和任务特定专家来分别建模任务间的共性和差异性,从而避免参数冲突,同时利用低秩结构降低计算和参数开销。

链接: https://arxiv.org/abs/2505.14143
作者: Shuo Zhang,Jinsong Zhang,Zhejun Zhang,Lei Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ICME 2025

点击查看摘要

Abstract:Multi-task learning (MTL) enables the efficient transfer of extra knowledge acquired from other tasks. The high correlation between multimodal sentiment analysis (MSA) and multimodal emotion recognition (MER) supports their joint training. However, existing methods primarily employ hard parameter sharing, ignoring parameter conflicts caused by complex task correlations. In this paper, we present a novel MTL method for MSA and MER, termed Multimodal Mixture of Low-Rank Experts (MMoLRE). MMoLRE utilizes shared and task-specific experts to distinctly model common and unique task characteristics, thereby avoiding parameter conflicts. Additionally, inspired by low-rank structures in the Mixture of Experts (MoE) framework, we design low-rank expert networks to reduce parameter and computational overhead as the number of experts increases. Extensive experiments on the CMU-MOSI and CMU-MOSEI benchmarks demonstrate that MMoLRE achieves state-of-the-art performance on the MSA task and competitive results on the MER task.
zh

[AI-48] Building a Stable Planner: An Extended Finite State Machine Based Planning Module for Mobile GUI Agent EMNLP2025

【速读】:该论文旨在解决移动图形用户界面(GUI)代理在任务规划中的挑战,特别是在缺乏对目标应用深入理解的情况下,难以生成准确的任务执行计划,导致执行过程中“迷失”。解决方案的关键在于提出SPlanner,一个即插即用的规划模块,其核心是利用扩展有限状态机(EFSMs)建模移动应用的控制逻辑和配置,并将用户指令分解为基于EFSMs的主要功能序列,通过遍历EFSMs生成执行路径,再通过大语言模型(LLM)将其优化为自然语言计划,从而有效指导视觉语言模型(VLMs)生成交互式GUI操作以完成用户任务。

链接: https://arxiv.org/abs/2505.14141
作者: Fanglin Mo,Junzhe Chen,Haoxuan Zhu,Xuming Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages. Submitted to EMNLP 2025

点击查看摘要

Abstract:Mobile GUI agents execute user commands by directly interacting with the graphical user interface (GUI) of mobile devices, demonstrating significant potential to enhance user convenience. However, these agents face considerable challenges in task planning, as they must continuously analyze the GUI and generate operation instructions step by step. This process often leads to difficulties in making accurate task plans, as GUI agents lack a deep understanding of how to effectively use the target applications, which can cause them to become “lost” during task execution. To address the task planning issue, we propose SPlanner, a plug-and-play planning module to generate execution plans that guide vision language model(VLMs) in executing tasks. The proposed planning module utilizes extended finite state machines (EFSMs) to model the control logits and configurations of mobile applications. It then decomposes a user instruction into a sequence of primary function modeled in EFSMs, and generate the execution path by traversing the EFSMs. We further refine the execution path into a natural language plan using an LLM. The final plan is concise and actionable, and effectively guides VLMs to generate interactive GUI actions to accomplish user tasks. SPlanner demonstrates strong performance on dynamic benchmarks reflecting real-world mobile usage. On the AndroidWorld benchmark, SPlanner achieves a 63.8% task success rate when paired with Qwen2.5-VL-72B as the VLM executor, yielding a 28.8 percentage point improvement compared to using Qwen2.5-VL-72B without planning assistance.
zh

[AI-49] RL of Thoughts: Navigating LLM Reasoning with Inference-time Reinforcement Learning

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理能力上的局限性,尤其是其基于token的自回归特性限制了复杂推理能力的发挥。为提升LLMs的推理性能,现有方法如Chain/Tree/Graph-of-Thought(s)通过引入复杂的逻辑结构来引导推理,但这些手动预定义、任务无关的框架缺乏适应性。该论文提出的解决方案关键在于训练一个轻量级的导航模型(navigator model),通过强化学习(Reinforcement Learning, RL)在推理过程中动态选择并组合逻辑块,以构建任务相关的逻辑结构,从而增强LLMs的推理能力。

链接: https://arxiv.org/abs/2505.14140
作者: Qianyue Hao,Sibo Li,Jian Yuan,Yong Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite rapid advancements in large language models (LLMs), the token-level autoregressive nature constrains their complex reasoning capabilities. To enhance LLM reasoning, inference-time techniques, including Chain/Tree/Graph-of-Thought(s), successfully improve the performance, as they are fairly cost-effective by guiding reasoning through sophisticated logical structures without modifying LLMs’ parameters. However, these manually predefined, task-agnostic frameworks are applied uniformly across diverse tasks, lacking adaptability. To improve this, we propose RL-of-Thoughts (RLoT), where we train a lightweight navigator model with reinforcement learning (RL) to adaptively enhance LLM reasoning at inference time. Specifically, we design five basic logic blocks from the perspective of human cognition. During the reasoning process, the trained RL navigator dynamically selects the suitable logic blocks and combines them into task-specific logical structures according to problem characteristics. Experiments across multiple reasoning benchmarks (AIME, MATH, GPQA, etc.) with multiple LLMs (GPT, Llama, Qwen, and DeepSeek) illustrate that RLoT outperforms established inference-time techniques by up to 13.4%. Remarkably, with less than 3K parameters, our RL navigator is able to make sub-10B LLMs comparable to 100B-scale counterparts. Moreover, the RL navigator demonstrates strong transferability: a model trained on one specific LLM-task pair can effectively generalize to unseen LLMs and tasks. Our code is open-source at this https URL for reproducibility.
zh

[AI-50] FlowQ: Energy-Guided Flow Policies for Offline Reinforcement Learning

【速读】:该论文试图解决在扩散模型中将指导机制融入训练过程的问题,而传统方法主要关注推理阶段的指导。解决方案的关键在于提出能量引导流匹配(energy-guided flow matching),通过近似能量引导的概率路径为高斯路径,学习一个与流策略对应的条件速度场,从而在训练阶段实现对轨迹的引导,消除了推理阶段对指导的需求。

链接: https://arxiv.org/abs/2505.14139
作者: Marvin Alles,Nutan Chen,Patrick van der Smagt,Botond Cseke
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The use of guidance to steer sampling toward desired outcomes has been widely explored within diffusion models, especially in applications such as image and trajectory generation. However, incorporating guidance during training remains relatively underexplored. In this work, we introduce energy-guided flow matching, a novel approach that enhances the training of flow models and eliminates the need for guidance at inference time. We learn a conditional velocity field corresponding to the flow policy by approximating an energy-guided probability path as a Gaussian path. Learning guided trajectories is appealing for tasks where the target distribution is defined by a combination of data and an energy function, as in reinforcement learning. Diffusion-based policies have recently attracted attention for their expressive power and ability to capture multi-modal action distributions. Typically, these policies are optimized using weighted objectives or by back-propagating gradients through actions sampled by the policy. As an alternative, we propose FlowQ, an offline reinforcement learning algorithm based on energy-guided flow matching. Our method achieves competitive performance while the policy training time is constant in the number of flow sampling steps.
zh

[AI-51] Memory Assignment for Finite-Memory Strategies in Adversarial Patrolling Games

【速读】:该论文试图解决有限记忆(finite-memory)防御者策略中记忆分配(memory assignment)的难题,这一问题在现有算法中需要手动设定每个位置的可用记忆大小,而正确的记忆分配是一个已知的开放且困难的问题,限制了有限记忆策略的实用性。解决方案的关键在于提出一种通用方法,通过迭代调整记忆分配,从而优化防御者的策略,该方法可以与任何黑盒策略优化工具结合使用。

链接: https://arxiv.org/abs/2505.14137
作者: Vojtěch Kůr,Vít Musil,Vojtěch Řehák
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adversarial Patrolling games form a subclass of Security games where a Defender moves between locations, guarding vulnerable targets. The main algorithmic problem is constructing a strategy for the Defender that minimizes the worst damage an Attacker can cause. We focus on the class of finite-memory (also known as regular) Defender’s strategies that experimentally outperformed other competing classes. A finite-memory strategy can be seen as a positional strategy on a finite set of states. Each state consists of a pair of a location and a certain integer value–called memory. Existing algorithms improve the transitional probabilities between the states but require that the available memory size itself is assigned at each location manually. Choosing the right memory assignment is a well-known open and hard problem that hinders the usability of finite-memory strategies. We solve this issue by developing a general method that iteratively changes the memory assignment. Our algorithm can be used in connection with \emphany black-box strategy optimization tool. We evaluate our method on various experiments and show its robustness by solving instances of various patrolling models.
zh

[AI-52] Local Mixtures of Experts: Essentially Free Test-Time Training via Model Merging

【速读】:该论文试图解决传统混合专家(Mixture of Expert, MoE)模型在扩展专家数量时面临的训练和推理成本过高的问题。其解决方案的关键在于提出测试时模型融合(Test-Time Model Merging, TTMM),通过模型融合技术在测试阶段避免显著的计算开销,同时通过增加专家数量来提升模型性能,从而近似实现测试时训练(Test-Time Training, TTT)的效果,且在推理速度上相比TTT提升了100倍以上。

链接: https://arxiv.org/abs/2505.14136
作者: Ryo Bertolissi,Jonas Hübotter,Ido Hakimi,Andreas Krause
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixture of expert (MoE) models are a promising approach to increasing model capacity without increasing inference cost, and are core components of many state-of-the-art language models. However, current MoE models typically use only few experts due to prohibitive training and inference cost. We propose Test-Time Model Merging (TTMM) which scales the MoE paradigm to an order of magnitude more experts and uses model merging to avoid almost any test-time overhead. We show that TTMM is an approximation of test-time training (TTT), which fine-tunes an expert model for each prediction task, i.e., prompt. TTT has recently been shown to significantly improve language models, but is computationally expensive. We find that performance of TTMM improves with more experts and approaches the performance of TTT. Moreover, we find that with a 1B parameter base model, TTMM is more than 100x faster than TTT at test-time by amortizing the cost of TTT at train-time. Thus, TTMM offers a promising cost-effective approach to scale test-time training.
zh

[AI-53] A Methodological Framework for Measuring Spatial Labeling Similarity

【速读】:该论文旨在解决空间标签相似性度量的问题,即如何准确评估两个空间标签序列之间的差异及其影响因素。现有方法往往无法全面考虑标签匹配数量、空间分布拓扑结构以及不匹配标签的异质性影响。解决方案的关键在于提出一种基于图结构的空间标签相似性度量框架,该框架通过将空间标签转换为基于位置组织、标签和属性(如位置重要性)的图,并提取图属性分布以高效计算分布差异,从而反映标签间的差异程度。该方法被具体实现为Spatial Labeling Analogy Metric (SLAM),并在空间转录组学数据中验证了其有效性。

链接: https://arxiv.org/abs/2505.14128
作者: Yihang Du,Jiaying Hu,Suyang Hou,Yueyang Ding,Xiaobo Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spatial labeling assigns labels to specific spatial locations to characterize their spatial properties and relationships, with broad applications in scientific research and practice. Measuring the similarity between two spatial labelings is essential for understanding their differences and the contributing factors, such as changes in location properties or labeling methods. An adequate and unbiased measurement of spatial labeling similarity should consider the number of matched labels (label agreement), the topology of spatial label distribution, and the heterogeneous impacts of mismatched labels. However, existing methods often fail to account for all these aspects. To address this gap, we propose a methodological framework to guide the development of methods that meet these requirements. Given two spatial labelings, the framework transforms them into graphs based on location organization, labels, and attributes (e.g., location significance). The distributions of their graph attributes are then extracted, enabling an efficient computation of distributional discrepancy to reflect the dissimilarity level between the two labelings. We further provide a concrete implementation of this framework, termed Spatial Labeling Analogy Metric (SLAM), along with an analysis of its theoretical foundation, for evaluating spatial labeling results in spatial transcriptomics (ST) \textitas per their similarity with ground truth labeling. Through a series of carefully designed experimental cases involving both simulated and real ST data, we demonstrate that SLAM provides a comprehensive and accurate reflection of labeling quality compared to other well-established evaluation metrics. Our code is available at this https URL.
zh

[AI-54] Contrastive Consolidation of Top-Down Modulations Achieves Sparsely Supervised Continual Learning

【速读】:该论文试图解决机器学习方法在持续学习过程中容易出现的灾难性遗忘问题,即在对新任务进行监督微调时会损害原始任务的性能。其解决方案的关键在于引入任务调制对比学习(Task-Modulated Contrastive Learning, TMCL),该方法受大脑新皮层生物物理机制的启发,利用预测编码原理持续且无监督地整合自上而下的信息。通过学习新的仿射调制来增强新类别的分离能力,同时不改变前馈权重,并借助视图不变性学习机制稳定表示空间,从而实现稳定性与可塑性的平衡。

链接: https://arxiv.org/abs/2505.14125
作者: Viet Anh Khoa Tran,Emre Neftci,Willem. A. M. Wybo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: 33 pages, 5 figures

点击查看摘要

Abstract:Biological brains learn continually from a stream of unlabeled data, while integrating specialized information from sparsely labeled examples without compromising their ability to generalize. Meanwhile, machine learning methods are susceptible to catastrophic forgetting in this natural learning setting, as supervised specialist fine-tuning degrades performance on the original task. We introduce task-modulated contrastive learning (TMCL), which takes inspiration from the biophysical machinery in the neocortex, using predictive coding principles to integrate top-down information continually and without supervision. We follow the idea that these principles build a view-invariant representation space, and that this can be implemented using a contrastive loss. Then, whenever labeled samples of a new class occur, new affine modulations are learned that improve separation of the new class from all others, without affecting feedforward weights. By co-opting the view-invariance learning mechanism, we then train feedforward weights to match the unmodulated representation of a data sample to its modulated counterparts. This introduces modulation invariance into the representation space, and, by also using past modulations, stabilizes it. Our experiments show improvements in both class-incremental and transfer learning over state-of-the-art unsupervised approaches, as well as over comparable supervised approaches, using as few as 1% of available labels. Taken together, our work suggests that top-down modulations play a crucial role in balancing stability and plasticity.
zh

[AI-55] Collaborative Unlabeled Data Optimization

【速读】:该论文试图解决如何通过优化数据本身来提高深度学习训练的效率和可持续性这一关键问题。现有模型中心方法存在三个主要局限,其共同瓶颈在于从数据中提取的知识被锁定在模型参数中,限制了知识的可重用性和可扩展性。解决方案的关键是提出CoOpt框架,这是一个高效且并行化的协同未标记数据优化框架,能够将知识有效编码到数据本身中,通过分布式未标记数据和利用公开的任务无关模型,实现可扩展、可重用和可持续的训练流程。

链接: https://arxiv.org/abs/2505.14117
作者: Xinyi Shang,Peng Sun,Fengyuan Liu,Tao Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper pioneers a novel data-centric paradigm to maximize the utility of unlabeled data, tackling a critical question: How can we enhance the efficiency and sustainability of deep learning training by optimizing the data itself? We begin by identifying three key limitations in existing model-centric approaches, all rooted in a shared bottleneck: knowledge extracted from data is locked to model parameters, hindering its reusability and scalability. To this end, we propose CoOpt, a highly efficient, parallelized framework for collaborative unlabeled data optimization, thereby effectively encoding knowledge into the data itself. By distributing unlabeled data and leveraging publicly available task-agnostic models, CoOpt facilitates scalable, reusable, and sustainable training pipelines. Extensive experiments across diverse datasets and architectures demonstrate its efficacy and efficiency, achieving 13.6% and 6.8% improvements on Tiny-ImageNet and ImageNet-1K, respectively, with training speedups of 1.94 \times and 1.2 \times .
zh

[AI-56] AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models

【速读】:该论文旨在解决针对大型音频语言模型(Large Audio-Language Models, LALMs)的音频越狱攻击(audio jailbreak attacks)在有效性、适用性和实用性方面的不足。现有方法通常假设攻击者能够完全操控用户提示,但实际场景中这一假设并不总是成立。论文提出的解决方案是AudioJailbreak,其关键在于四个核心特性:异步性(通过构造后缀越狱音频实现音频与用户提示在时间轴上无需对齐)、通用性(通过整合多个提示生成单一越狱扰动以适应不同提示)、隐蔽性(通过多种意图隐藏策略避免引起受害者警觉)以及空中鲁棒性(通过引入房间脉冲响应模拟混响失真,确保音频在无线播放时仍有效)。这些特性使得AudioJailbreak在攻击场景和效果上均优于现有方法。

链接: https://arxiv.org/abs/2505.14103
作者: Guangke Chen,Fu Song,Zhe Zhao,Xiaojun Jia,Yang Liu,Yanchen Qiao,Weizhe Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Jailbreak attacks to Large audio-language models (LALMs) are studied recently, but they achieve suboptimal effectiveness, applicability, and practicability, particularly, assuming that the adversary can fully manipulate user prompts. In this work, we first conduct an extensive experiment showing that advanced text jailbreak attacks cannot be easily ported to end-to-end LALMs via text-to speech (TTS) techniques. We then propose AudioJailbreak, a novel audio jailbreak attack, featuring (1) asynchrony: the jailbreak audio does not need to align with user prompts in the time axis by crafting suffixal jailbreak audios; (2) universality: a single jailbreak perturbation is effective for different prompts by incorporating multiple prompts into perturbation generation; (3) stealthiness: the malicious intent of jailbreak audios will not raise the awareness of victims by proposing various intent concealment strategies; and (4) over-the-air robustness: the jailbreak audios remain effective when being played over the air by incorporating the reverberation distortion effect with room impulse response into the generation of the perturbations. In contrast, all prior audio jailbreak attacks cannot offer asynchrony, universality, stealthiness, or over-the-air robustness. Moreover, AudioJailbreak is also applicable to the adversary who cannot fully manipulate user prompts, thus has a much broader attack scenario. Extensive experiments with thus far the most LALMs demonstrate the high effectiveness of AudioJailbreak. We highlight that our work peeks into the security implications of audio jailbreak attacks against LALMs, and realistically fosters improving their security robustness. The implementation and audio samples are available at our website this https URL.
zh

[AI-57] Personalized Student Knowledge Modeling for Future Learning Resource Prediction

【速读】:该论文试图解决学生知识追踪与行为建模中的持续性挑战,包括个性化不足、对多样化学习活动(尤其是非评估材料)建模不充分,以及忽视知识获取与行为模式之间的相互作用。其解决方案的关键在于提出一种基于状态的多任务方法——知识建模与材料预测(KMaP),通过聚类的学生画像技术生成个性化的学生表示,从而提升对未来学习资源偏好的预测能力。

链接: https://arxiv.org/abs/2505.14072
作者: Soroush Hashemifar,Sherry Sahebi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite advances in deep learning for education, student knowledge tracing and behavior modeling face persistent challenges: limited personalization, inadequate modeling of diverse learning activities (especially non-assessed materials), and overlooking the interplay between knowledge acquisition and behavioral patterns. Practical limitations, such as fixed-size sequence segmentation, frequently lead to the loss of contextual information vital for personalized learning. Moreover, reliance on student performance on assessed materials limits the modeling scope, excluding non-assessed interactions like lectures. To overcome these shortcomings, we propose Knowledge Modeling and Material Prediction (KMaP), a stateful multi-task approach designed for personalized and simultaneous modeling of student knowledge and behavior. KMaP employs clustering-based student profiling to create personalized student representations, improving predictions of future learning resource preferences. Extensive experiments on two real-world datasets confirm significant behavioral differences across student clusters and validate the efficacy of the KMaP model.
zh

[AI-58] Field Matters: A lightweight LLM -enhanced Method for CTR Prediction

【速读】:该论文旨在解决传统点击率(CTR)预测方法在引入大语言模型(LLM)时存在的计算开销过大的问题。现有方法通常需要对大规模实例或用户/物品实体的详细文本描述进行大量处理,导致效率低下。解决方案的关键在于提出一种轻量级的LLM增强CTR方法——LLaCTR,其核心是采用字段级别的增强范式,通过自监督字段-特征微调从小规模特征字段中提炼出关键且轻量的语义知识,并利用该知识增强特征表示与特征交互,从而在保持高效的同时提升模型效果。

链接: https://arxiv.org/abs/2505.14057
作者: Yu Cui,Feng Liu,Jiawei Chen,Xingyu Lou,Changwang Zhang,Jun Wang,Yuegang Sun,Xiaohu Yang,Can Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Click-through rate (CTR) prediction is a fundamental task in modern recommender systems. In recent years, the integration of large language models (LLMs) has been shown to effectively enhance the performance of traditional CTR methods. However, existing LLM-enhanced methods often require extensive processing of detailed textual descriptions for large-scale instances or user/item entities, leading to substantial computational overhead. To address this challenge, this work introduces LLaCTR, a novel and lightweight LLM-enhanced CTR method that employs a field-level enhancement paradigm. Specifically, LLaCTR first utilizes LLMs to distill crucial and lightweight semantic knowledge from small-scale feature fields through self-supervised field-feature fine-tuning. Subsequently, it leverages this field-level semantic knowledge to enhance both feature representation and feature interactions. In our experiments, we integrate LLaCTR with six representative CTR models across four datasets, demonstrating its superior performance in terms of both effectiveness and efficiency compared to existing LLM-enhanced methods. Our code is available at this https URL.
zh

[AI-59] Adaptive Cyclic Diffusion for Inference Scaling

【速读】:该论文试图解决扩散模型在推理阶段计算资源分配不够灵活的问题,即现有的推理时缩放方法依赖于固定的去噪调度,无法根据实例难度或任务需求动态调整计算努力。解决方案的关键在于提出一种自适应推理时缩放框架——自适应双向循环扩散(Adaptive Bi-directional Cyclic Diffusion, ABCD),其核心在于通过双向扩散循环优化输出,并自适应控制探索深度和终止条件,包含循环扩散搜索、自动探索-利用平衡以及自适应思考时间三个组件,从而在保持计算效率的同时提升多种任务的性能。

链接: https://arxiv.org/abs/2505.14036
作者: Gyubin Lee,Truong Nhat Nguyen Bao,Jaesik Yoon,Dongwoo Lee,Minsu Kim,Yoshua Bengio,Sungjin Ahn
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models have demonstrated strong generative capabilities across domains ranging from image synthesis to complex reasoning tasks. However, most inference-time scaling methods rely on fixed denoising schedules, limiting their ability to allocate computation based on instance difficulty or task-specific demands adaptively. We introduce the challenge of adaptive inference-time scaling-dynamically adjusting computational effort during inference-and propose Adaptive Bi-directional Cyclic Diffusion (ABCD), a flexible, search-based inference framework. ABCD refines outputs through bi-directional diffusion cycles while adaptively controlling exploration depth and termination. It comprises three components: Cyclic Diffusion Search, Automatic Exploration-Exploitation Balancing, and Adaptive Thinking Time. Experiments show that ABCD improves performance across diverse tasks while maintaining computational efficiency.
zh

[AI-60] CSAGC-IDS: A Dual-Module Deep Learning Network Intrusion Detection Model for Complex and Imbalanced Data

【速读】:该论文旨在解决网络入侵检测中高维复杂流量模式和类别不平衡的问题。其解决方案的关键在于提出CSAGC-IDS模型,该模型结合了SC-CGAN(一种自注意力增强的卷积条件生成对抗网络)以生成高质量数据缓解类别不平衡问题,并集成CSCA-CNN(一种通过代价敏感学习和通道注意力机制增强的卷积神经网络)以提取复杂流量数据的特征,从而实现精准的入侵检测。

链接: https://arxiv.org/abs/2505.14027
作者: Yifan Zeng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As computer networks proliferate, the gravity of network intrusions has escalated, emphasizing the criticality of network intrusion detection systems for safeguarding security. While deep learning models have exhibited promising results in intrusion detection, they face challenges in managing high-dimensional, complex traffic patterns and imbalanced data categories. This paper presents CSAGC-IDS, a network intrusion detection model based on deep learning techniques. CSAGC-IDS integrates SC-CGAN, a self-attention-enhanced convolutional conditional generative adversarial network that generates high-quality data to mitigate class imbalance. Furthermore, CSAGC-IDS integrates CSCA-CNN, a convolutional neural network enhanced through cost sensitive learning and channel attention mechanism, to extract features from complex traffic data for precise detection. Experiments conducted on the NSL-KDD dataset. CSAGC-IDS achieves an accuracy of 84.55% and an F1-score of 84.52% in five-class classification task, and an accuracy of 91.09% and an F1 score of 92.04% in binary classification this http URL, this paper provides an interpretability analysis of the proposed model, using SHAP and LIME to explain the decision-making mechanisms of the model.
zh

[AI-61] FedGraM: Defending Against Untargeted Attacks in Federated Learning via Embedding Gram Matrix

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中针对非定向攻击的防御问题,这类攻击会降低全局模型在底层数据分布上的性能。现有防御机制在实际FL环境中因数据异质性而效果有限,因此本文提出一种新的鲁棒聚合方法FedGraM,其关键在于利用泛化贡献来区分攻击样本。具体而言,通过维护一个包含每类一个样本的辅助数据集,提取本地模型的嵌入表示,并计算每个模型嵌入的Gram矩阵范数,以此作为模型在嵌入空间中类间分离能力的指标,从而识别并移除潜在恶意模型,提升全局模型的鲁棒性。

链接: https://arxiv.org/abs/2505.14024
作者: Di Wu,Qian Li,Heng Yang,Yong Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Learning (FL) enables geographically distributed clients to collaboratively train machine learning models by sharing only their local models, ensuring data privacy. However, FL is vulnerable to untargeted attacks that aim to degrade the global model’s performance on the underlying data distribution. Existing defense mechanisms attempt to improve FL’s resilience against such attacks, but their effectiveness is limited in practical FL environments due to data heterogeneity. On the contrary, we aim to detect and remove the attacks to mitigate their impact. Generalization contribution plays a crucial role in distinguishing untargeted attacks. Our observations indicate that, with limited data, the divergence between embeddings representing different classes provides a better measure of generalization than direct accuracy. In light of this, we propose a novel robust aggregation method, FedGraM, designed to defend against untargeted attacks in FL. The server maintains an auxiliary dataset containing one sample per class to support aggregation. This dataset is fed to the local models to extract embeddings. Then, the server calculates the norm of the Gram Matrix of the embeddings for each local model. The norm serves as an indicator of each model’s inter-class separation capability in the embedding space. FedGraM identifies and removes potentially malicious models by filtering out those with the largest norms, then averages the remaining local models to form the global model. We conduct extensive experiments to evaluate the performance of FedGraM. Our empirical results show that with limited data samples used to construct the auxiliary dataset, FedGraM achieves exceptional performance, outperforming state-of-the-art defense methods.
zh

[AI-62] Disentangled Multi-span Evolutionary Network against Temporal Knowledge Graph Reasoning ACL2025

【速读】:该论文旨在解决时间知识图谱(Temporal Knowledge Graph, TKG)推理中的两个关键问题:一是现有方法在建模子图语义演化时忽略了子图间的内部结构交互,二是未能区分不引起语义变化的平滑特征与语义演化过程。其解决方案的关键在于提出一种解耦的多跨度演化网络(Disentangled Multi-span Evolutionary Network, DiMNet),通过设计多跨度演化策略捕捉局部邻居特征并感知历史邻居语义信息,从而实现子图间的内部交互;同时引入解耦组件以自适应分离节点的活跃和稳定特征,动态控制历史语义对未来演化的影响。

链接: https://arxiv.org/abs/2505.14020
作者: Hao Dong,Ziyue Qiao,Zhiyuan Ning,Qi Hao,Yi Du,Pengyang Wang,Yuanchun Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted to ACL 2025 Findings

点击查看摘要

Abstract:Temporal Knowledge Graphs (TKGs), as an extension of static Knowledge Graphs (KGs), incorporate the temporal feature to express the transience of knowledge by describing when facts occur. TKG extrapolation aims to infer possible future facts based on known history, which has garnered significant attention in recent years. Some existing methods treat TKG as a sequence of independent subgraphs to model temporal evolution patterns, demonstrating impressive reasoning performance. However, they still have limitations: 1) In modeling subgraph semantic evolution, they usually neglect the internal structural interactions between subgraphs, which are actually crucial for encoding TKGs. 2) They overlook the potential smooth features that do not lead to semantic changes, which should be distinguished from the semantic evolution process. Therefore, we propose a novel Disentangled Multi-span Evolutionary Network (DiMNet) for TKG reasoning. Specifically, we design a multi-span evolution strategy that captures local neighbor features while perceiving historical neighbor semantic information, thus enabling internal interactions between subgraphs during the evolution process. To maximize the capture of semantic change patterns, we design a disentangle component that adaptively separates nodes’ active and stable features, used to dynamically control the influence of historical semantics on future evolution. Extensive experiments conducted on four real-world TKG datasets show that DiMNet demonstrates substantial performance in TKG reasoning, and outperforms the state-of-the-art up to 22.7% in MRR.
zh

[AI-63] owards Comprehensive and Prerequisite-Free Explainer for Graph Neural Networks IJCAI2025

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)可解释性(XGNN)方法中存在的两个主要局限性:一是无法在全数据集样本空间的不同分布中捕捉完整的决策逻辑,二是对边属性和GNN内部可访问性有严格的前提要求。其解决方案的关键在于提出OPEN,这是一种首个无需前提条件且能够全面解释GNN的框架,它通过推断并划分整个数据集的样本空间为多个环境,每个环境包含遵循不同分布的图,并通过从每个环境中采样子图并分析其预测来学习GNN在不同分布下的决策逻辑,从而有效克服了现有方法的局限性。

链接: https://arxiv.org/abs/2505.14005
作者: Han Zhang,Yan Wang,Guanfeng Liu,Pengfei Ding,Huaxiong Wang,Kwok-Yan Lam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by IJCAI 2025 AI4Tech Track

点击查看摘要

Abstract:To enhance the reliability and credibility of graph neural networks (GNNs) and improve the transparency of their decision logic, a new field of explainability of GNNs (XGNN) has emerged. However, two major limitations severely degrade the performance and hinder the generalizability of existing XGNN methods: they (a) fail to capture the complete decision logic of GNNs across diverse distributions in the entire dataset’s sample space, and (b) impose strict prerequisites on edge properties and GNN internal accessibility. To address these limitations, we propose OPEN, a novel c\textbfOmprehensive and \textbfPrerequisite-free \textbfExplainer for G\textbfNNs. OPEN, as the first work in the literature, can infer and partition the entire dataset’s sample space into multiple environments, each containing graphs that follow a distinct distribution. OPEN further learns the decision logic of GNNs across different distributions by sampling subgraphs from each environment and analyzing their predictions, thus eliminating the need for strict prerequisites. Experimental results demonstrate that OPEN captures nearly complete decision logic of GNNs, outperforms state-of-the-art methods in fidelity while maintaining similar efficiency, and enhances robustness in real-world scenarios.
zh

[AI-64] VeRecycle: Reclaiming Guarantees from Probabilistic Certificates for Stochastic Dynamical Systems after Change IJCAI2025

【速读】:该论文试图解决在系统动力学发生局部变化时,如何高效地重新获得概率神经Lyapunov证书的问题,而无需进行完全的再认证过程。传统方法在面对未建模的不确定性或局部状态空间的变化时,需要对系统进行全面再认证,这在使用神经网络证书时尤其耗费计算资源。解决方案的关键在于提出VeRecycle框架,该框架能够形式化地回收离散时间随机动力系统的保证,并在系统动力学仅在给定状态子集内发生变化时,高效地重用概率证书。

链接: https://arxiv.org/abs/2505.14001
作者: Sterre Lutz,Matthijs T.J. Spaan,Anna Lukina
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: accepted to IJCAI 2025

点击查看摘要

Abstract:Autonomous systems operating in the real world encounter a range of uncertainties. Probabilistic neural Lyapunov certification is a powerful approach to proving safety of nonlinear stochastic dynamical systems. When faced with changes beyond the modeled uncertainties, e.g., unidentified obstacles, probabilistic certificates must be transferred to the new system dynamics. However, even when the changes are localized in a known part of the state space, state-of-the-art requires complete re-certification, which is particularly costly for neural certificates. We introduce VeRecycle, the first framework to formally reclaim guarantees for discrete-time stochastic dynamical systems. VeRecycle efficiently reuses probabilistic certificates when the system dynamics deviate only in a given subset of states. We present a general theoretical justification and algorithmic implementation. Our experimental evaluation shows scenarios where VeRecycle both saves significant computational effort and achieves competitive probabilistic guarantees in compositional neural control.
zh

[AI-65] Divide by Question Conquer by Agent : SPLIT-RAG with Question-Driven Graph Partitioning

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在扩展至大规模知识图谱时面临的效率与准确性权衡问题。现有方法通常依赖于单一的图检索策略,导致简单查询产生不必要的延迟,而复杂多跳问题则出现推理碎片化。论文提出的解决方案关键在于SPLIT-RAG框架,其核心是基于问题驱动的语义图划分和协同子图检索,通过属性感知的图分割将知识图谱划分为语义一致的子图,并利用类型专业化知识库实现多智能体RAG,从而在提升效率的同时保证检索准确性。

链接: https://arxiv.org/abs/2505.13994
作者: Ruiyi Yang,Hao Xue,Imran Razzak,Hakim Hacid,Flora D. Salim
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注: 20 pages, 4 figures

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems empower large language models (LLMs) with external knowledge, yet struggle with efficiency-accuracy trade-offs when scaling to large knowledge graphs. Existing approaches often rely on monolithic graph retrieval, incurring unnecessary latency for simple queries and fragmented reasoning for complex multi-hop questions. To address these challenges, this paper propose SPLIT-RAG, a multi-agent RAG framework that addresses these limitations with question-driven semantic graph partitioning and collaborative subgraph retrieval. The innovative framework first create Semantic Partitioning of Linked Information, then use the Type-Specialized knowledge base to achieve Multi-Agent RAG. The attribute-aware graph segmentation manages to divide knowledge graphs into semantically coherent subgraphs, ensuring subgraphs align with different query types, while lightweight LLM agents are assigned to partitioned subgraphs, and only relevant partitions are activated during retrieval, thus reduce search space while enhancing efficiency. Finally, a hierarchical merging module resolves inconsistencies across subgraph-derived answers through logical verifications. Extensive experimental validation demonstrates considerable improvements compared to existing approaches.
zh

[AI-66] When LLM s meet open-world graph learning: a new perspective for unlabeled data uncertainty

【速读】:该论文试图解决开放世界场景下文本属性图(text-attributed graph, TAG)学习中数据不确定性的问题,特别是针对有限标注和未知类节点的处理不足。现有方法通常依赖于孤立的语义或结构化方法进行未知类排斥,缺乏有效的标注流程。解决方案的关键在于提出一种基于大语言模型(large language model, LLM)的框架——开放世界图助手(Open-world Graph Assistant, OGA),其核心包括自适应标签可追溯性,该机制结合语义与拓扑信息实现未知类排斥,并配备图标签注释器以支持通过新标注节点更新模型。

链接: https://arxiv.org/abs/2505.13989
作者: Yanzhe Wen,Xunkai Li,Qi Zhang,Zhu Lei,Guang Zeng,Rong-Hua Li,Guoren Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, large language models (LLMs) have significantly advanced text-attributed graph (TAG) learning. However, existing methods inadequately handle data uncertainty in open-world scenarios, especially concerning limited labeling and unknown-class nodes. Prior solutions typically rely on isolated semantic or structural approaches for unknown-class rejection, lacking effective annotation pipelines. To address these limitations, we propose Open-world Graph Assistant (OGA), an LLM-based framework that combines adaptive label traceability, which integrates semantics and topology for unknown-class rejection, and a graph label annotator to enable model updates using newly annotated nodes. Comprehensive experiments demonstrate OGA’s effectiveness and practicality.
zh

[AI-67] Solving Normalized Cut Problem with Constrained Action Space

【速读】:该论文试图解决在组合优化问题中如何整合外部知识以引导求解过程向领域相关结果靠拢的挑战。其解决方案的关键在于提出一种基于约束动作空间的强化学习(Reinforcement Learning, RL)方法,通过该方法引导归一化割(normalized cut)问题向预定义的模板实例靠近,从而获得更符合实际应用场景的图划分结果。

链接: https://arxiv.org/abs/2505.13986
作者: Qize Jiang,Linsey Pang,Alice Gatti,Mahima Aggarwa,Giovanna Vantin,Xiaosong Ma,Weiwei Sun,Sanjay Chawla
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has emerged as an important paradigm to solve combinatorial optimization problems primarily due to its ability to learn heuristics that can generalize across problem instances. However, integrating external knowledge that will steer combinatorial optimization problem solutions towards domain appropriate outcomes remains an extremely challenging task. In this paper, we propose the first RL solution that uses constrained action spaces to guide the normalized cut problem towards pre-defined template instances. Using transportation networks as an example domain, we create a Wedge and Ring Transformer that results in graph partitions that are shaped in form of Wedges and Rings and which are likely to be closer to natural optimal partitions. However, our approach is general as it is based on principles that can be generalized to other domains.
zh

[AI-68] he Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and Recognition INTERSPEECH2025

【速读】:该论文旨在解决会议场景下语音应用面临的复杂声学条件问题,通过多模态、多设备的会议转录技术提升语音识别与说话人日志的准确性。其解决方案的关键在于融合视频模态与音频信息,具体任务包括 Audio-Visual Speaker Diarization (AVSD)、Audio-Visual Speech Recognition (AVSR) 和 Audio-Visual Diarization and Recognition (AVDR),并通过多模态数据的协同处理显著提升了转录性能。

链接: https://arxiv.org/abs/2505.13971
作者: Ming Gao,Shilong Wu,Hang Chen,Jun Du,Chin-Hui Lee,Shinji Watanabe,Jingdong Chen,Siniscalchi Sabato Marco,Odette Scharenborg
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted by Interspeech 2025. Camera-ready version

点击查看摘要

Abstract:Meetings are a valuable yet challenging scenario for speech applications due to complex acoustic conditions. This paper summarizes the outcomes of the MISP 2025 Challenge, hosted at Interspeech 2025, which focuses on multi-modal, multi-device meeting transcription by incorporating video modality alongside audio. The tasks include Audio-Visual Speaker Diarization (AVSD), Audio-Visual Speech Recognition (AVSR), and Audio-Visual Diarization and Recognition (AVDR). We present the challenge’s objectives, tasks, dataset, baseline systems, and solutions proposed by participants. The best-performing systems achieved significant improvements over the baseline: the top AVSD model achieved a Diarization Error Rate (DER) of 8.09%, improving by 7.43%; the top AVSR system achieved a Character Error Rate (CER) of 9.48%, improving by 10.62%; and the best AVDR system achieved a concatenated minimum-permutation Character Error Rate (cpCER) of 11.56%, improving by 72.49%.
zh

[AI-69] Hypothesis on the Functional Advantages of the Selection-Broadcast Cycle Structure: Global Workspace Theory and Dealing with a Real-Time World

【速读】:该论文试图解决在动态、实时场景中,人工智能和机器人系统如何实现高效认知处理与适应性决策的问题。其解决方案的关键在于提出并强调由全局工作空间理论(Global Workspace Theory, GWT)所启发的“选择-广播周期”结构,该结构将选择与广播过程整合为一个循环机制,从而提升了实时认知系统的性能。通过这种结构,论文识别出三种主要优势:动态思维适应性、基于经验的适应性和即时实时适应性,为构建具备复杂任务处理能力的鲁棒、通用型AI和机器人系统提供了新的方向。

链接: https://arxiv.org/abs/2505.13969
作者: Junya Nakanishi,Jun Baba,Yuichiro Yoshikawa,Hiroko Kamide,Hiroshi Ishiguro
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper discusses the functional advantages of the Selection-Broadcast Cycle structure proposed by Global Workspace Theory (GWT), inspired by human consciousness, particularly focusing on its applicability to artificial intelligence and robotics in dynamic, real-time scenarios. While previous studies often examined the Selection and Broadcast processes independently, this research emphasizes their combined cyclic structure and the resulting benefits for real-time cognitive systems. Specifically, the paper identifies three primary benefits: Dynamic Thinking Adaptation, Experience-Based Adaptation, and Immediate Real-Time Adaptation. This work highlights GWT’s potential as a cognitive architecture suitable for sophisticated decision-making and adaptive performance in unsupervised, dynamic environments. It suggests new directions for the development and implementation of robust, general-purpose AI and robotics systems capable of managing complex, real-world tasks.
zh

[AI-70] Visual Instruction Bottleneck Tuning

【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在分布偏移(distribution shifts)下遇到不熟悉查询时性能退化的问题。现有方法通常依赖于更多指令数据或更大规模的模型架构,但这些方法均需付出较高的劳动力或计算成本。该研究提出了一种从表征学习角度增强MLLM鲁棒性的替代方案,其关键在于基于信息瓶颈(Information Bottleneck, IB)原理推导出MLLM的变分下界,并设计了可视化的指令瓶颈调优(Visual Instruction Bottleneck Tuning, Vittle)。通过理论分析与实验证明,Vittle能够通过学习最小充分表征来提升MLLM在分布偏移下的鲁棒性。

链接: https://arxiv.org/abs/2505.13946
作者: Changdae Oh,Jiatong Li,Shawn Im,Yixuan Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite widespread adoption, multimodal large language models (MLLMs) suffer performance degradation when encountering unfamiliar queries under distribution shifts. Existing methods to improve MLLM generalization typically require either more instruction data or larger advanced model architectures, both of which incur non-trivial human labor or computational costs. In this work, we take an alternative approach to enhance the robustness of MLLMs under distribution shifts, from a representation learning perspective. Inspired by the information bottleneck (IB) principle, we derive a variational lower bound of the IB for MLLMs and devise a practical implementation, Visual Instruction Bottleneck Tuning (Vittle). We then provide a theoretical justification of Vittle by revealing its connection to an information-theoretic robustness metric of MLLM. Empirical validation of three MLLMs on open-ended and closed-form question answering and object hallucination detection tasks over 45 datasets, including 30 shift scenarios, demonstrates that Vittle consistently improves the MLLM’s robustness under shifts by pursuing the learning of a minimal sufficient representation.
zh

[AI-71] DrugPilot: LLM -based Parameterized Reasoning Agent for Drug Discovery

【速读】:该论文旨在解决药物发现领域中大型语言模型(LLMs)在处理多模态和异构数据、动态更新领域知识以及对复杂计算任务预测结果信心不足等方面的挑战。其解决方案的关键在于提出DrugPilot,一个基于参数化推理的药物发现LLM代理,通过参数化推理架构克服传统端到端LLM预测方法的局限性,并支持药物发现流程中的主要阶段,实现多阶段研究任务的自动化规划与执行。

链接: https://arxiv.org/abs/2505.13940
作者: Kun Li,Zhennan Wu,Shoupeng Wang,Wenbin Hu
机构: 未知
类目: Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注: 22 pages, 10 figures, 5 tables

点击查看摘要

Abstract:In the field of AI4Science, large-scale language models (LLMs) show great potential to parse complex scientific semantics, integrate cross-disciplinary knowledge, and assist critical task research. However, in the field of drug discovery, despite the optimization through professional data pre-training, context window expansion, and internet search, the existing LLMs are still facing challenges such as massive multi-modal and heterogeneous data processing, domain knowledge dynamic updating delay, and insufficient confidence in predicting the results of complex computational tasks. To address these challenges, we propose the DrugPilot, an LLM-based agent with parameterized reasoning for drug discovery. DrugPilot addresses key limitations of traditional end-to-end LLM prediction approaches through its parametric inference architecture. This agent system supports major phases of the drug discovery pipeline, facilitating automated planning and execution of multi-stage research tasks. To address the critical challenge of multi-modal drug data analysis (incorporating both public datasets and user-submitted data), we developed an interactive parameterized memory pool. This innovative component standardizes real-world drug data into parametric representations, simultaneously enabling efficient knowledge retrieval in multi-turn dialogue while mitigating the information loss inherent in text-based data transmission. Additionally, we created a drug instruct dataset across 8 essential drug discovery tasks for model fine-tuning and evaluation. Based on the Berkeley function calling evaluation framework, DrugPilot demonstrated the most advanced tool calling capabilities on our drug discovery tool instruction dataset, outperforming existing agents (e.g., ReAct, LoT). Specifically, it achieves task completion rates of 98.0%, 93.5%, and 64.0% on simple, multiple, and multi-turn tasks, respectively.
zh

[AI-72] CLEVER: A Curated Benchmark for Formally Verified Code Generation

【速读】:该论文试图解决端到端可验证代码生成的问题,特别是在Lean环境中生成符合特定规范的代码。其解决方案的关键在于构建了一个高质量、经过筛选的基准测试集 \rm C\small LEVER ,包含161个问题,每个问题包括生成与保留真实规范相匹配的规范以及生成可证明满足该规范的Lean实现。该基准测试避免了测试用例监督、大语言模型生成的注释以及可能泄露实现逻辑或允许空洞解的规范,所有输出均通过Lean类型检查器进行后期验证以确保机器可检查的正确性。

链接: https://arxiv.org/abs/2505.13938
作者: Amitayush Thakur,Jasper Lee,George Tsoukalas,Meghana Sistla,Matthew Zhao,Stefan Zetzche,Greg Durrett,Yisong Yue,Swarat Chaudhuri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:We introduce \rm C\small LEVER , a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification. Unlike prior benchmarks, \rm C\small LEVER avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean’s type checker to ensure machine-checkable correctness. We use \rm C\small LEVER to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning. Our benchmark can be found on GitHub(this https URL) as well as HuggingFace(this https URL). All our evaluation code is also available online(this https URL).
zh

[AI-73] RLVR-World: Training World Models with Reinforcement Learning

【速读】:该论文试图解决世界模型(world models)在训练过程中标准目标函数(如最大似然估计)与任务特定目标(如状态转移预测的准确性或感知质量)不一致的问题。解决方案的关键在于提出RLVR-World框架,该框架利用可验证奖励的强化学习(reinforcement learning with verifiable rewards, RLVR),直接优化世界模型以提升这些任务相关指标。通过将世界建模建模为分词序列的自回归预测,并对解码后的预测结果评估其性能指标作为可验证奖励,RLVR-World在文本游戏、网页导航和机器人操作等多个领域均表现出显著的性能提升。

链接: https://arxiv.org/abs/2505.13934
作者: Jialong Wu,Shaofeng Yin,Ningya Feng,Mingsheng Long
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Code is available at project website: this https URL

点击查看摘要

Abstract:World models predict state transitions in response to actions and are increasingly developed across diverse modalities. However, standard training objectives such as maximum likelihood estimation (MLE) often misalign with task-specific goals of world models, i.e., transition prediction metrics like accuracy or perceptual quality. In this paper, we present RLVR-World, a unified framework that leverages reinforcement learning with verifiable rewards (RLVR) to directly optimize world models for such metrics. Despite formulating world modeling as autoregressive prediction of tokenized sequences, RLVR-World evaluates metrics of decoded predictions as verifiable rewards. We demonstrate substantial performance gains on both language- and video-based world models across domains, including text games, web navigation, and robot manipulation. Our work indicates that, beyond recent advances in reasoning language models, RLVR offers a promising post-training paradigm for enhancing the utility of generative models more broadly.
zh

[AI-74] APEX: Empowering LLM s with Physics-Based Task Planning for Real-time Insight

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在物理交互建模方面的局限性,即其在动态物体交互建模和任务泛化能力上的不足。现有方法通过视觉-语言模型(Vision-Language Models, VLMs)进行感知或通过强化学习(Reinforcement Learning, RL)实现自适应决策,但这些方法无法有效捕捉动态交互或需要任务特定训练,从而限制了其在现实世界中的应用。该论文提出的解决方案是APEX(Anticipatory Physics-Enhanced Execution),其关键在于通过构建结构化图来识别和建模环境中的关键动态交互,并提供低延迟的物理可行动作前向模拟,使LLMs能够基于预测结果而非静态观察选择最优策略。

链接: https://arxiv.org/abs/2505.13921
作者: Wanjing Huang,Weixiang Yan,Zhen Zhang,Ambuj Singh
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate strong reasoning and task planning capabilities but remain fundamentally limited in physical interaction modeling. Existing approaches integrate perception via Vision-Language Models (VLMs) or adaptive decision-making through Reinforcement Learning (RL), but they fail to capture dynamic object interactions or require task-specific training, limiting their real-world applicability. We introduce APEX (Anticipatory Physics-Enhanced Execution), a framework that equips LLMs with physics-driven foresight for real-time task planning. APEX constructs structured graphs to identify and model the most relevant dynamic interactions in the environment, providing LLMs with explicit physical state updates. Simultaneously, APEX provides low-latency forward simulations of physically feasible actions, allowing LLMs to select optimal strategies based on predictive outcomes rather than static observations. We evaluate APEX on three benchmarks designed to assess perception, prediction, and decision-making: (1) Physics Reasoning Benchmark, testing causal inference and object motion prediction; (2) Tetris, evaluating whether physics-informed prediction enhances decision-making performance in long-horizon planning tasks; (3) Dynamic Obstacle Avoidance, assessing the immediate integration of perception and action feasibility analysis. APEX significantly outperforms standard LLMs and VLM-based models, demonstrating the necessity of explicit physics reasoning for bridging the gap between language-based intelligence and real-world task execution. The source code and experiment setup are publicly available at this https URL .
zh

[AI-75] Parallel Belief Revision via Order Aggregation

【速读】:该论文试图解决如何将单步并行修正(parallel revision)模型扩展到迭代情况(iterated case)的问题。其解决方案的关键在于利用最近关于迭代并行收缩(iterated parallel contraction)的研究,提出一种基于TeamQueue聚合器家族的通用方法,以系统性地将串行迭代信念修正算子扩展为处理并行变化的机制,从而恢复文献中独立合理的性质,同时避免产生更为可疑的结果。

链接: https://arxiv.org/abs/2505.13914
作者: Jake Chandler,Richard Booth
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite efforts to better understand the constraints that operate on single-step parallel (aka “package”, “multiple”) revision, very little work has been carried out on how to extend the model to the iterated case. A recent paper by Delgrande Jin outlines a range of relevant rationality postulates. While many of these are plausible, they lack an underlying unifying explanation. We draw on recent work on iterated parallel contraction to offer a general method for extending serial iterated belief revision operators to handle parallel change. This method, based on a family of order aggregators known as TeamQueue aggregators, provides a principled way to recover the independently plausible properties that can be found in the literature, without yielding the more dubious ones.
zh

[AI-76] Learning to Insert for Constructive Neural Vehicle Routing Solver

【速读】:该论文旨在解决传统神经组合优化(Neural Combinatorial Optimisation, NCO)方法在求解车辆路径问题(Vehicle Routing Problem, VRP)时因采用固定的拼接式范式导致的次优解问题。其解决方案的关键在于提出一种基于插入式范式的新型学习方法——L2C-Insert,通过在当前部分解中的任意有效位置战略性地插入未访问节点,从而显著提升解的灵活性和质量。该方法引入了三个核心组件:用于精确预测插入位置的模型架构、高效的训练方案以及充分利用插入范式灵活性的先进推理技术。

链接: https://arxiv.org/abs/2505.13904
作者: Fu Luo,Xi Lin,Mengyuan Zhong,Fei Liu,Zhenkun Wang,Jianyong Sun,Qingfu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Neural Combinatorial Optimisation (NCO) is a promising learning-based approach for solving Vehicle Routing Problems (VRPs) without extensive manual design. While existing constructive NCO methods typically follow an appending-based paradigm that sequentially adds unvisited nodes to partial solutions, this rigid approach often leads to suboptimal results. To overcome this limitation, we explore the idea of insertion-based paradigm and propose Learning to Construct with Insertion-based Paradigm (L2C-Insert), a novel learning-based method for constructive NCO. Unlike traditional approaches, L2C-Insert builds solutions by strategically inserting unvisited nodes at any valid position in the current partial solution, which can significantly enhance the flexibility and solution quality. The proposed framework introduces three key components: a novel model architecture for precise insertion position prediction, an efficient training scheme for model optimization, and an advanced inference technique that fully exploits the insertion paradigm’s flexibility. Extensive experiments on both synthetic and real-world instances of the Travelling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) demonstrate that L2C-Insert consistently achieves superior performance across various problem sizes.
zh

[AI-77] Do Language Models Use Their Depth Efficiently?

【速读】:该论文试图解决深度语言模型(Large Language Models, LLMs)是否有效利用其深度来执行更复杂的计算,还是仅仅将相同的计算分散到更多层中的问题。其解决方案的关键在于分析Llama 3.1和Qwen 3系列模型的残差流(residual stream),通过比较子层输出与残差流、跳过第二半部分层的影响、多跳任务中的表现以及训练浅层模型到深层模型的线性映射,发现深度增加并未带来新的计算类型,而是对残差进行更细粒度的调整。

链接: https://arxiv.org/abs/2505.13898
作者: Róbert Csordás,Christopher D. Manning,Christopher Potts
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Modern LLMs are increasingly deep, and depth correlates with performance, albeit with diminishing returns. However, do these models use their depth efficiently? Do they compose more features to create higher-order computations that are impossible in shallow models, or do they merely spread the same kinds of computation out over more layers? To address these questions, we analyze the residual stream of the Llama 3.1 and Qwen 3 family of models. We find: First, comparing the output of the sublayers to the residual stream reveals that layers in the second half contribute much less than those in the first half, with a clear phase transition between the two halves. Second, skipping layers in the second half has a much smaller effect on future computations and output predictions. Third, for multihop tasks, we are unable to find evidence that models are using increased depth to compose subresults in examples involving many hops. Fourth, we seek to directly address whether deeper models are using their additional layers to perform new kinds of computation. To do this, we train linear maps from the residual stream of a shallow model to a deeper one. We find that layers with the same relative depth map best to each other, suggesting that the larger model simply spreads the same computations out over its many layers. All this evidence suggests that deeper models are not using their depth to learn new kinds of computation, but only using the greater depth to perform more fine-grained adjustments to the residual. This may help explain why increasing scale leads to diminishing returns for stacked Transformer architectures.
zh

[AI-78] Utilizing Strategic Pre-training to Reduce Overfitting: Baguan – A Pre-trained Weather Forecasting Model KDD2025

【速读】:该论文旨在解决天气预报中由于真实世界气象数据有限导致的过拟合问题。与计算机视觉或自然语言处理等数据丰富的领域不同,天气预报需要在数据稀缺的情况下提升模型性能。论文提出的解决方案关键在于采用预训练方法,通过选择适当挑战性的预训练任务引入局部性偏差,从而有效缓解过拟合并提升模型性能。研究进一步提出了Baguan模型,该模型基于自监督预训练的Siamese Autoencoder,并针对不同预测时效进行微调,实验证明其在中短期天气预报及其他下游任务中表现出色。

链接: https://arxiv.org/abs/2505.13873
作者: Peisong Niu,Ziqing Ma,Tian Zhou,Weiqi Chen,Lefei Shen,Rong Jin,Liang Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: KDD2025 research track accepted

点击查看摘要

Abstract:Weather forecasting has long posed a significant challenge for humanity. While recent AI-based models have surpassed traditional numerical weather prediction (NWP) methods in global forecasting tasks, overfitting remains a critical issue due to the limited availability of real-world weather data spanning only a few decades. Unlike fields like computer vision or natural language processing, where data abundance can mitigate overfitting, weather forecasting demands innovative strategies to address this challenge with existing data. In this paper, we explore pre-training methods for weather forecasting, finding that selecting an appropriately challenging pre-training task introduces locality bias, effectively mitigating overfitting and enhancing performance. We introduce Baguan, a novel data-driven model for medium-range weather forecasting, built on a Siamese Autoencoder pre-trained in a self-supervised manner and fine-tuned for different lead times. Experimental results show that Baguan outperforms traditional methods, delivering more accurate forecasts. Additionally, the pre-trained Baguan demonstrates robust overfitting control and excels in downstream tasks, such as subseasonal-to-seasonal (S2S) modeling and regional forecasting, after fine-tuning.
zh

[AI-79] Safety2Drive: Safety-Critical Scenario Benchmark for the Evaluation of Autonomous Driving

【速读】:该论文旨在解决自动驾驶(Autonomous Driving, AD)系统在安全验证和实际部署中面临的评估不足问题,特别是现有数据集缺乏符合监管要求的闭环测试场景以及真实世界AD事故的代表性不足。其解决方案的关键在于提出Safety2Drive,一个用于评估AD系统的安全关键场景库,该库具备三个核心贡献:全面覆盖标准法规要求的测试项、支持安全关键场景的泛化能力(包括自然环境干扰和跨传感器的对抗攻击注入)、以及多维度评估能力(涵盖AD系统及各类感知任务)。Safety2Drive提供了一种从场景构建到验证的范式,建立了标准化的测试框架以支持AD的安全部署。

链接: https://arxiv.org/abs/2505.13872
作者: Jingzheng Li,Tiancheng Wang,Xingyu Peng,Jiacheng Chen,Zhijun Chen,Bing Li,Xianglong Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous Driving (AD) systems demand the high levels of safety assurance. Despite significant advancements in AD demonstrated on open-source benchmarks like Longest6 and Bench2Drive, existing datasets still lack regulatory-compliant scenario libraries for closed-loop testing to comprehensively evaluate the functional safety of AD. Meanwhile, real-world AD accidents are underrepresented in current driving datasets. This scarcity leads to inadequate evaluation of AD performance, posing risks to safety validation and practical deployment. To address these challenges, we propose Safety2Drive, a safety-critical scenario library designed to evaluate AD systems. Safety2Drive offers three key contributions. (1) Safety2Drive comprehensively covers the test items required by standard regulations and contains 70 AD function test items. (2) Safety2Drive supports the safety-critical scenario generalization. It has the ability to inject safety threats such as natural environment corruptions and adversarial attacks cross camera and LiDAR sensors. (3) Safety2Drive supports multi-dimensional evaluation. In addition to the evaluation of AD systems, it also supports the evaluation of various perception tasks, such as object detection and lane detection. Safety2Drive provides a paradigm from scenario construction to validation, establishing a standardized test framework for the safe deployment of AD.
zh

[AI-80] Learning Spatio-Temporal Dynamics for Trajectory Recovery via Time-Aware Transformer

【速读】:该论文旨在解决GPS轨迹在实际应用中因采样率低而导致的轨迹点间隔大且不规则的问题,从而提升基于GPS系统的轨迹采样率。其解决方案的关键在于提出了一种名为TedTrajRec的新方法,该方法通过将轨迹数据的时空动态分为空间-时间交通动态和轨迹动态两方面进行建模。具体而言,引入了PD-GNN来捕捉道路段的周期性模式并学习拓扑感知动态,同时提出了TedFormer,一种结合闭式神经微分方程的时间感知Transformer,以有效处理不规则采样的轨迹数据。

链接: https://arxiv.org/abs/2505.13857
作者: Tian Sun,Yuqi Chen,Baihua Zheng,Weiwei Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted as a journal paper in IEEE Transactions on Intelligent Transportation Systems (T-ITS)

点击查看摘要

Abstract:In real-world applications, GPS trajectories often suffer from low sampling rates, with large and irregular intervals between consecutive GPS points. This sparse characteristic presents challenges for their direct use in GPS-based systems. This paper addresses the task of map-constrained trajectory recovery, aiming to enhance trajectory sampling rates of GPS trajectories. Previous studies commonly adopt a sequence-to-sequence framework, where an encoder captures the trajectory patterns and a decoder reconstructs the target trajectory. Within this framework, effectively representing the road network and extracting relevant trajectory features are crucial for overall performance. Despite advancements in these models, they fail to fully leverage the complex spatio-temporal dynamics present in both the trajectory and the road network. To overcome these limitations, we categorize the spatio-temporal dynamics of trajectory data into two distinct aspects: spatial-temporal traffic dynamics and trajectory dynamics. Furthermore, We propose TedTrajRec, a novel method for trajectory recovery. To capture spatio-temporal traffic dynamics, we introduce PD-GNN, which models periodic patterns and learns topologically aware dynamics concurrently for each road segment. For spatio-temporal trajectory dynamics, we present TedFormer, a time-aware Transformer that incorporates temporal dynamics for each GPS location by integrating closed-form neural ordinary differential equations into the attention mechanism. This allows TedFormer to effectively handle irregularly sampled data. Extensive experiments on three real-world datasets demonstrate the superior performance of TedTrajRec. The code is publicly available at this https URL. Comments: Accepted as a journal paper in IEEE Transactions on Intelligent Transportation Systems (T-ITS) Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.13857 [cs.LG] (or arXiv:2505.13857v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.13857 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-81] A Challenge to Build Neuro-Symbolic Video Agents

【速读】:该论文试图解决现代视频理解系统在处理复杂时间序列事件时的局限性,特别是在长期依赖关系和事件顺序理解方面的不足,这限制了系统在真实世界应用中进行主动推理和决策的能力。解决方案的关键在于引入神经符号(neuro-symbolic)方法,通过将视频查询分解为原子事件、构建连贯的时间序列,并与时间约束进行验证,从而提升系统的可解释性、结构化推理能力和行为可靠性。

链接: https://arxiv.org/abs/2505.13851
作者: Sahil Shah,Harsh Goel,Sai Shankar Narasimhan,Minkyu Choi,S P Sharan,Oguzhan Akcin,Sandeep Chinchali
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern video understanding systems excel at tasks such as scene classification, object detection, and short video retrieval. However, as video analysis becomes increasingly central to real-world applications, there is a growing need for proactive video agents for the systems that not only interpret video streams but also reason about events and take informed actions. A key obstacle in this direction is temporal reasoning: while deep learning models have made remarkable progress in recognizing patterns within individual frames or short clips, they struggle to understand the sequencing and dependencies of events over time, which is critical for action-driven decision-making. Addressing this limitation demands moving beyond conventional deep learning approaches. We posit that tackling this challenge requires a neuro-symbolic perspective, where video queries are decomposed into atomic events, structured into coherent sequences, and validated against temporal constraints. Such an approach can enhance interpretability, enable structured reasoning, and provide stronger guarantees on system behavior, all key properties for advancing trustworthy video agents. To this end, we present a grand challenge to the research community: developing the next generation of intelligent video agents that integrate three core capabilities: (1) autonomous video search and analysis, (2) seamless real-world interaction, and (3) advanced content generation. By addressing these pillars, we can transition from passive perception to intelligent video agents that reason, predict, and act, pushing the boundaries of video understanding.
zh

[AI-82] Enhancing Robot Navigation Policies with Task-Specific Uncertainty Managements

【速读】:该论文试图解决机器人在复杂环境中导航时面临的不确定性管理问题,包括传感器噪声、环境变化和信息不完整等因素,这些问题在不同任务和场景下对精度的要求各不相同。解决方案的关键在于提出GUIDE(Generalized Uncertainty Integration for Decision-Making and Execution)框架,该框架通过任务特定的不确定性地图(Task-Specific Uncertainty Maps, TSUMs)将任务需求整合到导航策略中,使机器人能够根据上下文动态调整不确定性管理策略。

链接: https://arxiv.org/abs/2505.13837
作者: Gokul Puthumanaillam,Paulo Padrao,Jose Fuentes,Leonardo Bobadilla,Melkior Ornik
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Robots navigating complex environments must manage uncertainty from sensor noise, environmental changes, and incomplete information, with different tasks requiring varying levels of precision in different areas. For example, precise localization may be crucial near obstacles but less critical in open spaces. We present GUIDE (Generalized Uncertainty Integration for Decision-Making and Execution), a framework that integrates these task-specific requirements into navigation policies via Task-Specific Uncertainty Maps (TSUMs). By assigning acceptable uncertainty levels to different locations, TSUMs enable robots to adapt uncertainty management based on context. When combined with reinforcement learning, GUIDE learns policies that balance task completion and uncertainty management without extensive reward engineering. Real-world tests show significant performance gains over methods lacking task-specific uncertainty awareness.
zh

[AI-83] oward Real-World Cooperative and Competitive Soccer with Quadrupedal Robot Teams

【速读】:该论文旨在解决腿部机器人在复杂动态环境中实现协调团队协作的问题,特别是在机器人足球这一具有竞争性和多智能体交互的场景下,需要同时具备精细的运动控制和长周期的战略决策能力。解决方案的关键在于提出了一种分层的多智能体强化学习(MARL)框架,其中低层技能通过强化学习训练以实现动态的腿部运动和球类操作,如行走、带球和射门;高层策略则基于多智能体近端策略优化(MAPPO)并通过虚构自我对弈(FSP)进行训练,以适应多样化的对手策略并生成复杂的团队行为,如协同传球、拦截和动态角色分配。

链接: https://arxiv.org/abs/2505.13834
作者: Zhi Su,Yuman Gao,Emily Lukas,Yunfei Li,Jiaze Cai,Faris Tulbah,Fei Gao,Chao Yu,Zhongyu Li,Yi Wu,Koushil Sreenath
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 11 pages, 12 figures

点击查看摘要

Abstract:Achieving coordinated teamwork among legged robots requires both fine-grained locomotion control and long-horizon strategic decision-making. Robot soccer offers a compelling testbed for this challenge, combining dynamic, competitive, and multi-agent interactions. In this work, we present a hierarchical multi-agent reinforcement learning (MARL) framework that enables fully autonomous and decentralized quadruped robot soccer. First, a set of highly dynamic low-level skills is trained for legged locomotion and ball manipulation, such as walking, dribbling, and kicking. On top of these, a high-level strategic planning policy is trained with Multi-Agent Proximal Policy Optimization (MAPPO) via Fictitious Self-Play (FSP). This learning framework allows agents to adapt to diverse opponent strategies and gives rise to sophisticated team behaviors, including coordinated passing, interception, and dynamic role allocation. With an extensive ablation study, the proposed learning method shows significant advantages in the cooperative and competitive multi-agent soccer game. We deploy the learned policies to real quadruped robots relying solely on onboard proprioception and decentralized localization, with the resulting system supporting autonomous robot-robot and robot-human soccer matches on indoor and outdoor soccer courts.
zh

[AI-84] PlanNet: An AI-Driven Framework for Efficient Telecom Network Planning

【速读】:该论文试图解决5G网络规划中基站选址的挑战,该问题涉及覆盖效率、成本、用户满意度及实际约束条件的高效优化。传统人工方法依赖人类专家经验,存在效率低下且难以保证规划与建设一致性的问题;现有AI工具虽在某些方面提升了效率,但仍难以满足电信运营商动态网络环境和多目标需求。论文提出的解决方案是TelePlanNet,其关键在于构建了一个三层架构的AI驱动框架,结合大语言模型(Large Language Models, LLMs)实现实时用户输入处理与意图对齐,并通过改进的群体相对策略优化(Improved Group Relative Policy Optimization, GRPO)强化学习训练规划模型,从而有效实现多目标优化与候选站点评估。

链接: https://arxiv.org/abs/2505.13831
作者: Zongyuan Deng,Yujie Cai,Qing Liu,Shiyao Mu,Bin Lyu,Zhen Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 5 figures, 1 table, submitted to IEEE ICCC 2025

点击查看摘要

Abstract:The selection of base station sites is a critical challenge in 5G network planning, which requires efficient optimization of coverage, cost, user satisfaction, and practical constraints. Traditional manual methods, reliant on human expertise, suffer from inefficiencies and are limited to an unsatisfied planning-construction consistency. Existing AI tools, despite improving efficiency in certain aspects, still struggle to meet the dynamic network conditions and multi-objective needs of telecom operators’ networks. To address these challenges, we propose TelePlanNet, an AI-driven framework tailored for the selection of base station sites, integrating a three-layer architecture for efficient planning and large-scale automation. By leveraging large language models (LLMs) for real-time user input processing and intent alignment with base station planning, combined with training the planning model using the improved group relative policy optimization (GRPO) reinforcement learning, the proposed TelePlanNet can effectively address multi-objective optimization, evaluates candidate sites, and delivers practical solutions. Experiments results show that the proposed TelePlanNet can improve the consistency to 78%, which is superior to the manual methods, providing telecom operators with an efficient and scalable tool that significantly advances cellular network planning.
zh

[AI-85] Multimodal RAG -driven Anomaly Detection and Classification in Laser Powder Bed Fusion using Large Language Models

【速读】:该论文旨在解决增材制造(Additive Manufacturing, AM)过程中缺陷和工艺异常检测的挑战,通过自动化方法实现对多种AM工艺的异常识别与分类。其解决方案的关键在于提出了一种基于检索增强生成(Retrieval-Augmented Generation, RAG)的多模态框架,该框架利用文献中的信息(包括图像和描述性文本)而非训练数据进行零样本异常检测,结合文本与图像检索及多模态生成模型,实现了无需额外训练即可适应不同图像和工艺条件的异常分析。

链接: https://arxiv.org/abs/2505.13828
作者: Kiarash Naghavi Khanghah,Zhiling Chen,Lela Romeo,Qian Yang,Rajiv Malhotra,Farhad Imani,Hongyi Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ASME 2025 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference IDETC/CIE2025, August 17-20, 2025, Anaheim, CA (IDETC2025-168615)

点击查看摘要

Abstract:Additive manufacturing enables the fabrication of complex designs while minimizing waste, but faces challenges related to defects and process anomalies. This study presents a novel multimodal Retrieval-Augmented Generation-based framework that automates anomaly detection across various Additive Manufacturing processes leveraging retrieved information from literature, including images and descriptive text, rather than training datasets. This framework integrates text and image retrieval from scientific literature and multimodal generation models to perform zero-shot anomaly identification, classification, and explanation generation in a Laser Powder Bed Fusion setting. The proposed framework is evaluated on four L-PBF manufacturing datasets from Oak Ridge National Laboratory, featuring various printer makes, models, and materials. This evaluation demonstrates the framework’s adaptability and generalizability across diverse images without requiring additional training. Comparative analysis using Qwen2-VL-2B and GPT-4o-mini as MLLM within the proposed framework highlights that GPT-4o-mini outperforms Qwen2-VL-2B and proportional random baseline in manufacturing anomalies classification. Additionally, the evaluation of the RAG system confirms that incorporating retrieval mechanisms improves average accuracy by 12% by reducing the risk of hallucination and providing additional information. The proposed framework can be continuously updated by integrating emerging research, allowing seamless adaptation to the evolving landscape of AM technologies. This scalable, automated, and zero-shot-capable framework streamlines AM anomaly analysis, enhancing efficiency and accuracy.
zh

[AI-86] RAG /LLM Augmented Switching Driven Polymorphic Metaheuristic Framework

【速读】:该论文旨在解决传统元启发式算法在处理复杂优化问题时因固定结构和需要大量调参而导致的效率受限问题。其解决方案的关键在于提出了一种自适应的元启发式切换机制,通过实时性能反馈和动态算法选择实现算法的自主调整,核心组件包括Polymorphic Metaheuristic Agent (PMA) 和 Polymorphic Metaheuristic Selection Agent (PMSA),它们能够根据关键性能指标动态选择和切换元启发式算法,从而提升收敛速度、适应性和解的质量。

链接: https://arxiv.org/abs/2505.13808
作者: Faramarz Safi Esfahani,Ghassan Beydoun,Morteza Saberi,Brad McCusker,Biswajeet Pradhan
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Metaheuristic algorithms are widely used for solving complex optimization problems, yet their effectiveness is often constrained by fixed structures and the need for extensive tuning. The Polymorphic Metaheuristic Framework (PMF) addresses this limitation by introducing a self-adaptive metaheuristic switching mechanism driven by real-time performance feedback and dynamic algorithmic selection. PMF leverages the Polymorphic Metaheuristic Agent (PMA) and the Polymorphic Metaheuristic Selection Agent (PMSA) to dynamically select and transition between metaheuristic algorithms based on key performance indicators, ensuring continuous adaptation. This approach enhances convergence speed, adaptability, and solution quality, outperforming traditional metaheuristics in high-dimensional, dynamic, and multimodal environments. Experimental results on benchmark functions demonstrate that PMF significantly improves optimization efficiency by mitigating stagnation and balancing exploration-exploitation strategies across various problem landscapes. By integrating AI-driven decision-making and self-correcting mechanisms, PMF paves the way for scalable, intelligent, and autonomous optimization frameworks, with promising applications in engineering, logistics, and complex decision-making systems.
zh

[AI-87] ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech INTERSPEECH2025

【速读】:该论文旨在解决高保真情感语音转换(EVC)中缺乏灵活且可解释控制的问题。其解决方案的关键在于提出EVC-CLAP模型,该模型通过自然语言提示和类别标签进行情感对比语言-音频预训练,以提取并对齐跨语音和文本模态的细粒度情感元素;同时引入带有自适应强度门的FuEncoder,实现情感特征与来自预训练自动语音识别(ASR)模型的音素后验Grams的无缝融合,并结合基于捕捉特征的流匹配模型来重建源语音的梅尔频谱图,从而提升情感表现力和语音自然度。

链接: https://arxiv.org/abs/2505.13805
作者: Yu Pan,Yanni Hu,Yuguang Yang,Jixun Yao,Jianhao Ye,Hongbin Zhou,Lei Ma,Jianjun Zhao
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted by InterSpeech 2025

点击查看摘要

Abstract:Despite great advances, achieving high-fidelity emotional voice conversion (EVC) with flexible and interpretable control remains challenging. This paper introduces ClapFM-EVC, a novel EVC framework capable of generating high-quality converted speech driven by natural language prompts or reference speech with adjustable emotion intensity. We first propose EVC-CLAP, an emotional contrastive language-audio pre-training model, guided by natural language prompts and categorical labels, to extract and align fine-grained emotional elements across speech and text modalities. Then, a FuEncoder with an adaptive intensity gate is presented to seamless fuse emotional features with Phonetic PosteriorGrams from a pre-trained ASR model. To further improve emotion expressiveness and speech naturalness, we propose a flow matching model conditioned on these captured features to reconstruct Mel-spectrogram of source speech. Subjective and objective evaluations validate the effectiveness of ClapFM-EVC.
zh

[AI-88] LLM -based Evaluation Policy Extraction for Ecological Modeling

【速读】:该论文试图解决传统数值指标在评估生态时间序列时无法捕捉特定领域时间模式的问题,这些问题对于生态过程的准确建模至关重要。传统方法如R平方和均方根误差虽然被广泛用于量化模型与观测生态系统变量之间的相似性,但它们往往无法有效反映生态过程中的关键时间特征,因此通常需要专家视觉检查,这不仅耗费大量人力,也限制了大规模评估的应用。该论文提出的解决方案的关键在于将度量学习与基于大语言模型(Large Language Model, LLM)的自然语言策略提取相结合,以生成可解释的评估标准。该方法通过处理成对标注并实施策略优化机制,生成并组合不同的评估指标,从而有效捕捉目标评估偏好,包括合成生成和专家标注的模型比较。

链接: https://arxiv.org/abs/2505.13794
作者: Qi Cheng,Licheng Liu,Qing Zhu,Runlong Yu,Zhenong Jin,Yiqun Xie,Xiaowei Jia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating ecological time series is critical for benchmarking model performance in many important applications, including predicting greenhouse gas fluxes, capturing carbon-nitrogen dynamics, and monitoring hydrological cycles. Traditional numerical metrics (e.g., R-squared, root mean square error) have been widely used to quantify the similarity between modeled and observed ecosystem variables, but they often fail to capture domain-specific temporal patterns critical to ecological processes. As a result, these methods are often accompanied by expert visual inspection, which requires substantial human labor and limits the applicability to large-scale evaluation. To address these challenges, we propose a novel framework that integrates metric learning with large language model (LLM)-based natural language policy extraction to develop interpretable evaluation criteria. The proposed method processes pairwise annotations and implements a policy optimization mechanism to generate and combine different assessment metrics. The results obtained on multiple datasets for evaluating the predictions of crop gross primary production and carbon dioxide flux have confirmed the effectiveness of the proposed method in capturing target assessment preferences, including both synthetically generated and expert-annotated model comparisons. The proposed framework bridges the gap between numerical metrics and expert knowledge while providing interpretable evaluation policies that accommodate the diverse needs of different ecosystem modeling studies.
zh

[AI-89] Preference Learning with Lie Detectors can Induce Honesty or Evasion

【速读】:该论文试图解决在大型语言模型(Large Language Model, LLM)后训练过程中,由于欺骗性行为可能导致评估失效和用户被误导的问题。其解决方案的关键在于将欺骗检测器(lie detector)引入偏好学习的标注步骤,以评估所学策略是否真正诚实,而非仅是规避检测器。研究通过分析探索量、欺骗检测器准确率和KL正则化强度三个关键因素,揭示了在特定条件下,结合欺骗检测器的训练方法可以有效提升模型诚实性,而在其他情况下可能导致模型学习规避检测器并保持欺骗性。

链接: https://arxiv.org/abs/2505.13787
作者: Chris Cundy,Adam Gleave
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As AI systems become more capable, deceptive behaviors can undermine evaluation and mislead users at deployment. Recent work has shown that lie detectors can accurately classify deceptive behavior, but they are not typically used in the training pipeline due to concerns around contamination and objective hacking. We examine these concerns by incorporating a lie detector into the labelling step of LLM post-training and evaluating whether the learned policy is genuinely more honest, or instead learns to fool the lie detector while remaining deceptive. Using DolusChat, a novel 65k-example dataset with paired truthful/deceptive responses, we identify three key factors that determine the honesty of learned policies: amount of exploration during preference learning, lie detector accuracy, and KL regularization strength. We find that preference learning with lie detectors and GRPO can lead to policies which evade lie detectors, with deception rates of over 85%. However, if the lie detector true positive rate (TPR) or KL regularization is sufficiently high, GRPO learns honest policies. In contrast, off-policy algorithms (DPO) consistently lead to deception rates under 25% for realistic TPRs. Our results illustrate a more complex picture than previously assumed: depending on the context, lie-detector-enhanced training can be a powerful tool for scalable oversight, or a counterproductive method encouraging undetectable misalignment.
zh

[AI-90] CoIn: Counting the Invisible Reasoning Tokens in Commercial Opaque LLM APIs

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)服务中因推理过程不透明而导致的计费透明度问题。具体而言,提供商通常仅返回最终答案而隐藏推理痕迹,使得用户无法验证所支付的推理令牌(reasoning tokens)的真实性,从而可能引发令牌数量虚报或注入低效合成令牌的问题。解决方案的关键在于提出CoIn框架,该框架通过构建基于令牌嵌入指纹的可验证哈希树来审计令牌数量,并利用基于嵌入的相关性匹配检测伪造的推理内容,从而有效识别令牌数量虚报行为。

链接: https://arxiv.org/abs/2505.13778
作者: Guoheng Sun,Ziyao Wang,Bowei Tian,Meng Liu,Zheyu Shen,Shwai He,Yexiao He,Wanghao Ye,Yiting Wang,Ang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As post-training techniques evolve, large language models (LLMs) are increasingly augmented with structured multi-step reasoning abilities, often optimized through reinforcement learning. These reasoning-enhanced models outperform standard LLMs on complex tasks and now underpin many commercial LLM APIs. However, to protect proprietary behavior and reduce verbosity, providers typically conceal the reasoning traces while returning only the final answer. This opacity introduces a critical transparency gap: users are billed for invisible reasoning tokens, which often account for the majority of the cost, yet have no means to verify their authenticity. This opens the door to token count inflation, where providers may overreport token usage or inject synthetic, low-effort tokens to inflate charges. To address this issue, we propose CoIn, a verification framework that audits both the quantity and semantic validity of hidden tokens. CoIn constructs a verifiable hash tree from token embedding fingerprints to check token counts, and uses embedding-based relevance matching to detect fabricated reasoning content. Experiments demonstrate that CoIn, when deployed as a trusted third-party auditor, can effectively detect token count inflation with a success rate reaching up to 94.7%, showing the strong ability to restore billing transparency in opaque LLM services. The dataset and code are available at this https URL.
zh

[AI-91] Beyond Semantics: The Unreason able Effectiveness of Reason less Intermediate Tokens

【速读】:该论文试图解决的问题是:当前对大型推理模型取得显著成果的解释是否合理,特别是是否可以将这些成果归因于“思维链”(Chain of Thought, CoT)中中间标记的语义对模型性能的直接影响。论文的关键解决方案是通过在形式可验证的推理轨迹和解法上训练Transformer模型,限制中间步骤和最终输出与形式化求解器(如A*搜索)保持一致,并构建一个形式化解释器来系统评估解的准确性以及中间轨迹的正确性,从而检验中间轨迹是否因果性地影响解的准确性。

链接: https://arxiv.org/abs/2505.13775
作者: Kaya Stechly,Karthik Valmeekam,Atharva Gundawar,Vardhan Palod,Subbarao Kambhampati
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent impressive results from large reasoning models have been interpreted as a triumph of Chain of Thought (CoT), and especially of the process of training on CoTs sampled from base LLMs in order to help find new reasoning patterns. In this paper, we critically examine that interpretation by investigating how the semantics of intermediate tokens-often anthropomorphized as “thoughts” or reasoning traces and which are claimed to display behaviors like backtracking, self-verification etc.-actually influence model performance. We train transformer models on formally verifiable reasoning traces and solutions, constraining both intermediate steps and final outputs to align with those of a formal solver (in our case, A* search). By constructing a formal interpreter of the semantics of our problems and intended algorithm, we systematically evaluate not only solution accuracy but also the correctness of intermediate traces, thus allowing us to evaluate whether the latter causally influences the former. We notice that, despite significant improvements on the solution-only baseline, models trained on entirely correct traces still produce invalid reasoning traces when arriving at correct solutions. To further show that trace accuracy is only loosely connected to solution accuracy, we then train models on noisy, corrupted traces which have no relation to the specific problem each is paired with, and find that not only does performance remain largely consistent with models trained on correct data, but in some cases can improve upon it and generalize more robustly on out-of-distribution tasks. These results challenge the assumption that intermediate tokens or “Chains of Thought” induce predictable reasoning behaviors and caution against anthropomorphizing such outputs or over-interpreting them (despite their mostly correct forms) as evidence of human-like or algorithmic behaviors in language models.
zh

[AI-92] Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models

【速读】:该论文试图解决大型推理模型(Large Reasoning Models, LRMs)在中间推理过程中的忠实性问题,即确保模型在生成最终答案前的思维草稿(thinking draft)中的推理步骤与结论具有因果一致性和逻辑依赖性。解决方案的关键在于提出一种系统性的反事实干预框架,从两个互补维度评估忠实性:一是内部思维草稿忠实性(Intra-Draft Faithfulness),通过反事实步骤插入检验单个推理步骤对后续步骤及最终结论的影响;二是思维草稿到答案的忠实性(Draft-to-Answer Faithfulness),通过扰动思维草稿的结论逻辑来评估最终答案是否与之逻辑一致且依赖于该草稿。

链接: https://arxiv.org/abs/2505.13774
作者: Zidi Xiong,Chen Shan,Zhenting Qi,Himabindu Lakkaraju
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have significantly enhanced their capabilities in complex problem-solving by introducing a thinking draft that enables multi-path Chain-of-Thought explorations before producing final answers. Ensuring the faithfulness of these intermediate reasoning processes is crucial for reliable monitoring, interpretation, and effective control. In this paper, we propose a systematic counterfactual intervention framework to rigorously evaluate thinking draft faithfulness. Our approach focuses on two complementary dimensions: (1) Intra-Draft Faithfulness, which assesses whether individual reasoning steps causally influence subsequent steps and the final draft conclusion through counterfactual step insertions; and (2) Draft-to-Answer Faithfulness, which evaluates whether final answers are logically consistent with and dependent on the thinking draft, by perturbing the draft’s concluding logic. We conduct extensive experiments across six state-of-the-art LRMs. Our findings show that current LRMs demonstrate selective faithfulness to intermediate reasoning steps and frequently fail to faithfully align with the draft conclusions. These results underscore the need for more faithful and interpretable reasoning in advanced LRMs.
zh

[AI-93] Model Cards for AI Teammates: Comparing Human-AI Team Familiarization Methods for High-Stakes Environments

【速读】:该论文试图解决在协同、快节奏的智能、监视与侦察(ISR)环境中,如何有效使人类操作员熟悉人工智能(AI)队友的问题。研究比较了三种不同的熟悉方法:阅读关于AI队友的文档、在任务前与AI队友协作训练,以及无任何熟悉措施。解决方案的关键在于结合AI文档、结构化现场培训和探索性交互,以提升人类对AI队友决策机制及其优劣势的理解,从而形成高效的团队策略。研究发现,文档熟悉方式虽然能快速采用策略,但可能导致风险规避行为;而直接交互则有助于策略探索,但对内部过程理解较弱,因此综合方法能够平衡效率与灵活性。

链接: https://arxiv.org/abs/2505.13773
作者: Ryan Bowers,Richard Agbeyibor,Jack Kolb,Karen Feigh
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: Submitted to IEEE RO-MAN 2025 (under review). 8 pages, 7 figures

点击查看摘要

Abstract:We compare three methods of familiarizing a human with an artificial intelligence (AI) teammate (“agent”) prior to operation in a collaborative, fast-paced intelligence, surveillance, and reconnaissance (ISR) environment. In a between-subjects user study (n=60), participants either read documentation about the agent, trained alongside the agent prior to the mission, or were given no familiarization. Results showed that the most valuable information about the agent included details of its decision-making algorithms and its relative strengths and weaknesses compared to the human. This information allowed the familiarization groups to form sophisticated team strategies more quickly than the control group. Documentation-based familiarization led to the fastest adoption of these strategies, but also biased participants towards risk-averse behavior that prevented high scores. Participants familiarized through direct interaction were able to infer much of the same information through observation, and were more willing to take risks and experiment with different control modes, but reported weaker understanding of the agent’s internal processes. Significant differences were seen between individual participants’ risk tolerance and methods of AI interaction, which should be considered when designing human-AI control interfaces. Based on our findings, we recommend a human-AI team familiarization method that combines AI documentation, structured in-situ training, and exploratory interaction.
zh

[AI-94] Understanding Task Representations in Neural Networks via Bayesian Ablation

【速读】:该论文试图解决神经网络中学习到的表征难以解释的问题,特别是由于其子符号语义导致的可解释性挑战。解决方案的关键在于引入一种基于概率的框架,该框架受贝叶斯推断启发,定义了表征单元的分布以推断其对任务性能的因果贡献,并结合信息论提出了多种工具和指标,用于揭示模型的关键特性,如表征的分布式性、流形复杂性和多义性。

链接: https://arxiv.org/abs/2505.13742
作者: Andrew Nam,Declan Campbell,Thomas Griffiths,Jonathan Cohen,Sarah-Jane Leslie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural networks are powerful tools for cognitive modeling due to their flexibility and emergent properties. However, interpreting their learned representations remains challenging due to their sub-symbolic semantics. In this work, we introduce a novel probabilistic framework for interpreting latent task representations in neural networks. Inspired by Bayesian inference, our approach defines a distribution over representational units to infer their causal contributions to task performance. Using ideas from information theory, we propose a suite of tools and metrics to illuminate key model properties, including representational distributedness, manifold complexity, and polysemanticity.
zh

[AI-95] Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers

【速读】:该论文试图解决如何准确解释Transformer模型中注意力头(attention heads)的功能角色问题,特别是如何区分其对任务性能的因果影响而非仅相关性。解决方案的关键在于提出因果头门控(causal head gating, CHG),该方法通过学习注意力头的软门控并根据其对任务性能的影响将其分类为促进型、干扰型或无关型,从而实现对注意力头功能的因果解释。CHG无需依赖假设驱动的提示模板或目标标签,可直接应用于任何数据集,并通过消融和因果中介分析验证其有效性。

链接: https://arxiv.org/abs/2505.13737
作者: Andrew Nam,Henry Conklin,Yukang Yang,Thomas Griffiths,Jonathan Cohen,Sarah-Jane Leslie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, 2 tables

点击查看摘要

Abstract:We present causal head gating (CHG), a scalable method for interpreting the functional roles of attention heads in transformer models. CHG learns soft gates over heads and assigns them a causal taxonomy - facilitating, interfering, or irrelevant - based on their impact on task performance. Unlike prior approaches in mechanistic interpretability, which are hypothesis-driven and require prompt templates or target labels, CHG applies directly to any dataset using standard next-token prediction. We evaluate CHG across multiple large language models (LLMs) in the Llama 3 model family and diverse tasks, including syntax, commonsense, and mathematical reasoning, and show that CHG scores yield causal - not merely correlational - insight, validated via ablation and causal mediation analyses. We also introduce contrastive CHG, a variant that isolates sub-circuits for specific task components. Our findings reveal that LLMs contain multiple sparse, sufficient sub-circuits, that individual head roles depend on interactions with others (low modularity), and that instruction following and in-context learning rely on separable mechanisms.
zh

[AI-96] SayCoNav: Utilizing Large Language Models for Adaptive Collaboration in Decentralized Multi-Robot Navigation

【速读】:该论文试图解决在大规模未知环境中,自主机器人团队如何通过自适应协作完成复杂导航任务的问题。解决方案的关键在于提出SayCoNav方法,该方法利用生成式AI (Generative AI) 自动生成机器人团队的协作策略,并基于此策略,每个机器人以去中心化的方式生成自身的计划与行动,同时在导航过程中通过信息共享持续更新步骤计划,从而实现高效的协作与动态适应。

链接: https://arxiv.org/abs/2505.13729
作者: Abhinav Rajvanshi,Pritish Sahu,Tixiao Shan,Karan Sikka,Han-Pang Chiu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adaptive collaboration is critical to a team of autonomous robots to perform complicated navigation tasks in large-scale unknown environments. An effective collaboration strategy should be determined and adapted according to each robot’s skills and current status to successfully achieve the shared goal. We present SayCoNav, a new approach that leverages large language models (LLMs) for automatically generating this collaboration strategy among a team of robots. Building on the collaboration strategy, each robot uses the LLM to generate its plans and actions in a decentralized way. By sharing information to each other during navigation, each robot also continuously updates its step-by-step plans accordingly. We evaluate SayCoNav on Multi-Object Navigation (MultiON) tasks, that require the team of the robots to utilize their complementary strengths to efficiently search multiple different objects in unknown environments. By validating SayCoNav with varied team compositions and conditions against baseline methods, our experimental results show that SayCoNav can improve search efficiency by at most 44.28% through effective collaboration among heterogeneous robots. It can also dynamically adapt to the changing conditions during task execution.
zh

[AI-97] Policy-Driven World Model Adaptation for Robust Offline Model-based Reinforcement Learning

【速读】:该论文旨在解决离线模型基础强化学习(offline model-based reinforcement learning, MBRL)中存在目标不匹配和策略鲁棒性不足的问题。现有方法通常采用两阶段训练流程,导致世界模型未针对有效策略学习进行优化,并且学习到的策略在部署时对环境中的小扰动敏感。论文提出的解决方案关键在于构建一个统一的学习目标,通过动态适应世界模型与策略的协同优化,提升整体系统的鲁棒性。其核心是通过创新性地利用Stackelberg学习动力学求解最大化最小优化问题,从而实现更高效和稳定的策略学习。

链接: https://arxiv.org/abs/2505.13709
作者: Jiayu Chen,Aravind Venugopal,Jeff Schneider
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Offline reinforcement learning (RL) offers a powerful paradigm for data-driven control. Compared to model-free approaches, offline model-based RL (MBRL) explicitly learns a world model from a static dataset and uses it as a surrogate simulator, improving data efficiency and enabling potential generalization beyond the dataset support. However, most existing offline MBRL methods follow a two-stage training procedure: first learning a world model by maximizing the likelihood of the observed transitions, then optimizing a policy to maximize its expected return under the learned model. This objective mismatch results in a world model that is not necessarily optimized for effective policy learning. Moreover, we observe that policies learned via offline MBRL often lack robustness during deployment, and small adversarial noise in the environment can lead to significant performance degradation. To address these, we propose a framework that dynamically adapts the world model alongside the policy under a unified learning objective aimed at improving robustness. At the core of our method is a maximin optimization problem, which we solve by innovatively utilizing Stackelberg learning dynamics. We provide theoretical analysis to support our design and introduce computationally efficient implementations. We benchmark our algorithm on twelve noisy D4RL MuJoCo tasks and three stochastic Tokamak Control tasks, demonstrating its state-of-the-art performance.
zh

[AI-98] RL in Name Only? Analyzing the Structural Assumptions in RL post-training for LLM s

【速读】:该论文试图解决当前基于强化学习(Reinforcement Learning, RL)的大型语言模型(Large Language Models, LLMs)后训练方法中存在的一些结构性假设问题,特别是这些假设是否合理以及其对模型性能的实际影响。论文指出,现有方法如GRPO在建模LLM训练时采用了两个关键的结构假设:一是将MDP状态视为动作的拼接,即状态成为上下文窗口,动作成为模型生成的token;二是将状态-动作轨迹的奖励均匀分配到整个轨迹。这些简化假设使得RL方法实际上等价于一种以结果为导向的监督学习,而非真正依赖于强化学习机制。论文的关键解决方案在于通过实验验证,表明结合正负样本的迭代监督微调可以达到与GRPO方法相当的性能,从而质疑了当前RL框架的有效性和解释力。

链接: https://arxiv.org/abs/2505.13697
作者: Soumya Rani Samineni,Durgesh Kalwar,Karthik Valmeekam,Kaya Stechly,Subbarao Kambhampati
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning-based post-training of large language models (LLMs) has recently gained attention, particularly following the release of DeepSeek R1, which applied GRPO for fine-tuning. Amid the growing hype around improved reasoning abilities attributed to RL post-training, we critically examine the formulation and assumptions underlying these methods. We start by highlighting the popular structural assumptions made in modeling LLM training as a Markov Decision Process (MDP), and show how they lead to a degenerate MDP that doesn’t quite need the RL/GRPO apparatus. The two critical structural assumptions include (1) making the MDP states be just a concatenation of the actions-with states becoming the context window and the actions becoming the tokens in LLMs and (2) splitting the reward of a state-action trajectory uniformly across the trajectory. Through a comprehensive analysis, we demonstrate that these simplifying assumptions make the approach effectively equivalent to an outcome-driven supervised learning. Our experiments on benchmarks including GSM8K and Countdown using Qwen-2.5 base models show that iterative supervised fine-tuning, incorporating both positive and negative samples, achieves performance comparable to GRPO-based training. We will also argue that the structural assumptions indirectly incentivize the RL to generate longer sequences of intermediate tokens-which in turn feeds into the narrative of “RL generating longer thinking traces.” While RL may well be a very useful technique for improving the reasoning abilities of LLMs, our analysis shows that the simplistic structural assumptions made in modeling the underlying MDP render the popular LLM RL frameworks and their interpretations questionable.
zh

[AI-99] Building spatial world models from sparse transitional episodic memories

【速读】:该论文试图解决如何从稀疏且不连续的事件记忆中学习构建环境的空间模型问题。其解决方案的关键在于提出了一种新的框架——事件空间世界模型(Episodic Spatial World Model, ESWM),该框架能够高效地利用少量观察数据构建出稳健的环境表征,并具备内在的适应性,能够在环境变化时快速更新。此外,ESWM无需额外训练即可实现接近最优的探索新环境和导航至任意点的策略。

链接: https://arxiv.org/abs/2505.13696
作者: Zizhan He,Maxime Daigle,Pouya Bashivan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many animals possess a remarkable capacity to rapidly construct flexible mental models of their environments. These world models are crucial for ethologically relevant behaviors such as navigation, exploration, and planning. The ability to form episodic memories and make inferences based on these sparse experiences is believed to underpin the efficiency and adaptability of these models in the brain. Here, we ask: Can a neural network learn to construct a spatial model of its surroundings from sparse and disjoint episodic memories? We formulate the problem in a simulated world and propose a novel framework, the Episodic Spatial World Model (ESWM), as a potential answer. We show that ESWM is highly sample-efficient, requiring minimal observations to construct a robust representation of the environment. It is also inherently adaptive, allowing for rapid updates when the environment changes. In addition, we demonstrate that ESWM readily enables near-optimal strategies for exploring novel environments and navigating between arbitrary points, all without the need for additional training.
zh

[AI-100] A*-Decoding: Token-Efficient Inference Scaling

【速读】:该论文试图解决在固定计算预算下,如何最优利用资源以提升语言模型在复杂推理任务中的性能问题。解决方案的关键在于引入A*-decoding,这是一种基于搜索的推理阶段策略,它通过A*搜索算法在生成过程中优先选择高质量的推理路径,从而更高效地利用计算资源。该方法将语言模型的解码过程建模为部分解空间中的结构化搜索,并借助外部过程监督信号指导生成,实现了在相同计算预算下优于传统方法的性能表现。

链接: https://arxiv.org/abs/2505.13672
作者: Giannis Chatziveroglou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inference-time scaling has emerged as a powerful alternative to parameter scaling for improving language model performance on complex reasoning tasks. While existing methods have shown strong performance gains under fixed compute budgets, there has been little focus on optimally utilizing that budget during inference. In this work, we introduce A*-decoding, a search-based inference-time strategy that builds on the A* search algorithm to optimally utilize a fixed compute budget by prioritizing high-quality reasoning paths during generation. We frame language model decoding as a structured search in a state space of partial solutions, applying the A* transition model to identify promising continuations guided by an external process supervision signal. In our experiments, A*-decoding reaches the performance levels of strong inference scaling baselines like best-of-N and particle filtering while using up to 3x fewer tokens and 30% fewer PRM passes under equivalent compute budgets. On the MATH500 and AIME 2024 benchmarks, A*-decoding enables Llama-3.2-1B-Instruct to match the performance of the 70x larger Llama-3.1-70B-Instruct, and allows Qwen3-1.7B to reach o1-like reasoning accuracy. These results highlight the power of structured search in decoding, offering an alternative to brute-force sampling or scale-driven gains. Our work demonstrates how thoughtful inference-time strategies can enhance reasoning in SLMs, pointing toward future advances in more efficient and scalable language model deployment.
zh

[AI-101] MAFA: A multi-agent framework for annotation

【速读】:该论文旨在解决用户查询与最相关FAQ匹配的准确性与效率问题,传统方法通常依赖单一模型或技术,难以捕捉多样化的用户询问细节。其解决方案的关键在于提出一种多智能体框架,结合多个具有不同方法的专用智能体以及一个裁判智能体,通过重排序候选结果以生成最优答案,同时引入受注意力推理查询(Attentive Reasoning Queries, ARQs)启发的结构化推理方法和专门的少样本示例策略,以增强集成多样性与查询空间覆盖范围。

链接: https://arxiv.org/abs/2505.13668
作者: Mahmood Hegazy,Aaron Rodrigues,Azzam Naeem
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern applications require accurate and efficient retrieval of information in response to user queries. Mapping user utterances to the most relevant Frequently Asked Questions (FAQs) is a crucial component of these systems. Traditional approaches often rely on a single model or technique, which may not capture the nuances of diverse user inquiries. In this paper, we introduce a multi-agent framework for FAQ annotation that combines multiple specialized agents with different approaches and a judge agent that reranks candidates to produce optimal results. Our agents utilize a structured reasoning approach inspired by Attentive Reasoning Queries (ARQs), which guides them through systematic reasoning steps using targeted, task-specific JSON queries. Our framework features a specialized few-shot example strategy, where each agent receives different few-shots, enhancing ensemble diversity and coverage of the query space. We evaluate our framework on a real-world banking dataset as well as public benchmark datasets (LCQMC and FiQA), demonstrating significant improvements over single-agent approaches across multiple metrics, including a 14% increase in Top-1 accuracy, an 18% increase in Top-5 accuracy, and a 12% improvement in Mean Reciprocal Rank on our dataset, and similar gains on public benchmarks when compared with traditional single agent annotation techniques. Our framework is particularly effective at handling ambiguous queries, making it well-suited for deployment in production applications while showing strong generalization capabilities across different domains and languages.
zh

[AI-102] Self-Reinforced Graph Contrastive Learning

【速读】:该论文旨在解决图对比学习(Graph Contrastive Learning, GCL)中正样本对质量不足的问题,这一问题可能导致图的内在语义和结构特性被扭曲而非保留。解决方案的关键在于提出一种名为SRGCL(Self-Reinforced Graph Contrastive Learning)的框架,该框架利用模型自身的编码器动态评估并选择高质量的正样本对,通过结合多种增强策略的统一正样本生成器以及基于流形假设的筛选机制,确保潜在空间的基础几何结构得以保持,并通过概率机制迭代优化样本对质量,从而提升模型的表示能力。

链接: https://arxiv.org/abs/2505.13650
作者: Chou-Ying Hsieh,Chun-Fu Jang,Cheng-En Hsieh,Qian-Hui Chen,Sy-Yen Kuo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graphs serve as versatile data structures in numerous real-world domains-including social networks, molecular biology, and knowledge graphs-by capturing intricate relational information among entities. Among graph-based learning techniques, Graph Contrastive Learning (GCL) has gained significant attention for its ability to derive robust, self-supervised graph representations through the contrasting of positive and negative sample pairs. However, a critical challenge lies in ensuring high-quality positive pairs so that the intrinsic semantic and structural properties of the original graph are preserved rather than distorted. To address this issue, we propose SRGCL (Self-Reinforced Graph Contrastive Learning), a novel framework that leverages the model’s own encoder to dynamically evaluate and select high-quality positive pairs. We designed a unified positive pair generator employing multiple augmentation strategies, and a selector guided by the manifold hypothesis to maintain the underlying geometry of the latent space. By adopting a probabilistic mechanism for selecting positive pairs, SRGCL iteratively refines its assessment of pair quality as the encoder’s representational power improves. Extensive experiments on diverse graph-level classification tasks demonstrate that SRGCL, as a plug-in module, consistently outperforms state-of-the-art GCL methods, underscoring its adaptability and efficacy across various domains.
zh

[AI-103] Learning (Approximately) Equivariant Networks via Constrained Optimization

【速读】:该论文试图解决传统等变神经网络(equivariant neural networks)在面对现实世界数据中不完全对称性时的局限性,即严格等变模型可能无法很好地拟合数据,而无约束模型则缺乏有效利用部分对称性的方法。其解决方案的关键在于提出自适应约束等变性(Adaptive Constrained Equivariance, ACE),该方法通过从一个灵活的非等变模型开始,逐步减少其与等变性的偏差,从而在训练初期平滑优化过程,并在数据驱动的平衡点上稳定模型,实现等变性与非等变性之间的有效权衡。

链接: https://arxiv.org/abs/2505.13631
作者: Andrei Manolache,Luiz F.O. Chamon,Mathias Niepert
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Equivariant neural networks are designed to respect symmetries through their architecture, boosting generalization and sample efficiency when those symmetries are present in the data distribution. Real-world data, however, often departs from perfect symmetry because of noise, structural variation, measurement bias, or other symmetry-breaking effects. Strictly equivariant models may struggle to fit the data, while unconstrained models lack a principled way to leverage partial symmetries. Even when the data is fully symmetric, enforcing equivariance can hurt training by limiting the model to a restricted region of the parameter space. Guided by homotopy principles, where an optimization problem is solved by gradually transforming a simpler problem into a complex one, we introduce Adaptive Constrained Equivariance (ACE), a constrained optimization approach that starts with a flexible, non-equivariant model and gradually reduces its deviation from equivariance. This gradual tightening smooths training early on and settles the model at a data-driven equilibrium, balancing between equivariance and non-equivariance. Across multiple architectures and tasks, our method consistently improves performance metrics, sample efficiency, and robustness to input perturbations compared with strictly equivariant models and heuristic equivariance relaxations.
zh

[AI-104] OMGPT : A Sequence Modeling Framework for Data-driven Operational Decision Making

【速读】:该论文旨在解决运筹学与管理科学(Operations Research and Management Science, OR/OM)中出现的序列决策问题,具体包括动态定价、库存管理、资源分配和排队控制等任务。解决方案的关键在于提出了一种通用的序列建模框架,并基于该框架构建了一个基于Transformer的神经网络模型——OMGPT,该模型将所有任务视为一个序列预测问题,即根据历史信息预测最优未来动作。与现有方法相比,OMGPT的关键优势在于其能够利用大量预训练数据,并且无需假设任何解析模型结构,从而实现从历史到未来动作的直接且丰富的映射。

链接: https://arxiv.org/abs/2505.13580
作者: Hanzhao Wang,Guanting Chen,Kalyan Talluri,Xiaocheng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2405.14219

点击查看摘要

Abstract:We build a Generative Pre-trained Transformer (GPT) model from scratch to solve sequential decision making tasks arising in contexts of operations research and management science which we call OMGPT. We first propose a general sequence modeling framework to cover several operational decision making tasks as special cases, such as dynamic pricing, inventory management, resource allocation, and queueing control. Under the framework, all these tasks can be viewed as a sequential prediction problem where the goal is to predict the optimal future action given all the historical information. Then we train a transformer-based neural network model (OMGPT) as a natural and powerful architecture for sequential modeling. This marks a paradigm shift compared to the existing methods for these OR/OM tasks in that (i) the OMGPT model can take advantage of the huge amount of pre-trained data; (ii) when tackling these problems, OMGPT does not assume any analytical model structure and enables a direct and rich mapping from the history to the future actions. Either of these two aspects, to the best of our knowledge, is not achieved by any existing method. We establish a Bayesian perspective to theoretically understand the working mechanism of the OMGPT on these tasks, which relates its performance with the pre-training task diversity and the divergence between the testing task and pre-training tasks. Numerically, we observe a surprising performance of the proposed model across all the above tasks.
zh

[AI-105] VocalAgent : Large Language Models for Vocal Health Diagnostics with Safety-Aware Evaluation

【速读】:该论文旨在解决全球范围内语音障碍诊断与治疗可及性不足的问题,尤其是在缺乏便捷诊断手段的情况下。其解决方案的关键在于开发VocalAgent,这是一个基于音频大语言模型(Audio Large Language Model)的语音健康诊断系统,通过在医院患者采集的三个数据集上微调Qwen-Audio-Chat模型,实现对语音障碍的高精度分类,并通过多维度评估框架确保其安全性、跨语言性能及模态有效性。

链接: https://arxiv.org/abs/2505.13577
作者: Yubin Kim,Taehan Kim,Wonjune Kang,Eugene Park,Joonsik Yoon,Dongjae Lee,Xin Liu,Daniel McDuff,Hyeonhoon Lee,Cynthia Breazeal,Hae Won Park
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Vocal health plays a crucial role in peoples’ lives, significantly impacting their communicative abilities and interactions. However, despite the global prevalence of voice disorders, many lack access to convenient diagnosis and treatment. This paper introduces VocalAgent, an audio large language model (LLM) to address these challenges through vocal health diagnosis. We leverage Qwen-Audio-Chat fine-tuned on three datasets collected in-situ from hospital patients, and present a multifaceted evaluation framework encompassing a safety assessment to mitigate diagnostic biases, cross-lingual performance analysis, and modality ablation studies. VocalAgent demonstrates superior accuracy on voice disorder classification compared to state-of-the-art baselines. Its LLM-based method offers a scalable solution for broader adoption of health diagnostics, while underscoring the importance of ethical and technical validation.
zh

[AI-106] FreeMesh: Boosting Mesh Generation with Coordinates Merging ICML2025

【速读】:该论文试图解决当前自回归网格生成方法中缺乏高效评估不同分词器(tokenizer)性能的度量标准的问题,尤其是针对将网格序列化的分词器。其解决方案的关键在于提出了一种新的理论评估指标——每标记网格熵(Per-Token-Mesh-Entropy, PTME),用于无需训练即可评估现有网格分词器的性能,并基于PTME提出了一种即插即用的分词技术——坐标合并(coordinate merging),通过重新排列和合并坐标中最常见的模式来进一步提升现有分词器的压缩比。

链接: https://arxiv.org/abs/2505.13573
作者: Jian Liu,Haohan Weng,Biwen Lei,Xianghui Yang,Zibo Zhao,Zhuo Chen,Song Guo,Tao Han,Chunchao Guo
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2025, camera-ready version

点击查看摘要

Abstract:The next-coordinate prediction paradigm has emerged as the de facto standard in current auto-regressive mesh generation methods. Despite their effectiveness, there is no efficient measurement for the various tokenizers that serialize meshes into sequences. In this paper, we introduce a new metric Per-Token-Mesh-Entropy (PTME) to evaluate the existing mesh tokenizers theoretically without any training. Building upon PTME, we propose a plug-and-play tokenization technique called coordinate merging. It further improves the compression ratios of existing tokenizers by rearranging and merging the most frequent patterns of coordinates. Through experiments on various tokenization methods like MeshXL, MeshAnything V2, and Edgerunner, we further validate the performance of our method. We hope that the proposed PTME and coordinate merging can enhance the existing mesh tokenizers and guide the further development of native mesh generation.
zh

[AI-107] Q2Forge: Minting Competency Questions and SPARQL Queries for Question-Answering Over Knowledge Graphs

【速读】:该论文旨在解决非专家用户在构建知识图谱(Knowledge Graph, KG)时,难以有效生成和验证与KG相关的 competency questions(能力问题)及其对应SPARQL查询的问题。传统方法依赖人工编写,不仅耗时且数量有限,无法充分展示KG的潜力。论文提出的解决方案是Q² Forge,其关键在于通过迭代方式利用人类反馈和大型语言模型(Large Language Models, LLMs)作为评判者,自动生成并验证新的competency questions及对应的SPARQL查询,从而形成完整的从问题生成到查询评估的流水线,支持任意目标KG参考查询集的构建。

链接: https://arxiv.org/abs/2505.13572
作者: Yousouf Taghzouti(WIMMICS, ICN),Franck Michel(Laboratoire I3S - SPARKS, WIMMICS),Tao Jiang(ICN),Louis-Félix Nothias(ICN),Fabien Gandon(WIMMICS, Laboratoire I3S - SPARKS)
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The SPARQL query language is the standard method to access knowledge graphs (KGs). However, formulating SPARQL queries is a significant challenge for non-expert users, and remains time-consuming for the experienced ones. Best practices recommend to document KGs with competency questions and example queries to contextualise the knowledge they contain and illustrate their potential applications. In practice, however, this is either not the case or the examples are provided in limited numbers. Large Language Models (LLMs) are being used in conversational agents and are proving to be an attractive solution with a wide range of applications, from simple question-answering about common knowledge to generating code in a targeted programming language. However, training and testing these models to produce high quality SPARQL queries from natural language questions requires substantial datasets of question-query pairs. In this paper, we present Q ^2 Forge that addresses the challenge of generating new competency questions for a KG and corresponding SPARQL queries. It iteratively validates those queries with human feedback and LLM as a judge. Q ^2 Forge is open source, generic, extensible and modular, meaning that the different modules of the application (CQ generation, query generation and query refinement) can be used separately, as an integrated pipeline, or replaced by alternative services. The result is a complete pipeline from competency question formulation to query evaluation, supporting the creation of reference query sets for any target KG.
zh

[AI-108] Learning Dynamics of RNNs in Closed-Loop Environments

【速读】:该论文试图解决在生物合理设置中建模闭环动态的重要性问题,特别是针对递归神经网络(RNN)在闭环环境中学习动力学的特性。传统训练范式依赖于开环、监督设置,而现实世界的学习发生在闭环环境中,因此需要新的理论框架来描述这种差异。解决方案的关键在于建立一个数学理论,用于描述线性RNN在闭环情境下的学习动态,并揭示其与开环RNN在学习轨迹上的显著差异。该理论表明,闭环RNN的学习动力学由两个竞争目标之间的相互作用所主导:短期策略改进和代理-环境交互的长期稳定性。

链接: https://arxiv.org/abs/2505.13567
作者: Yoav Ger,Omri Barak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: 9 pages with 6 figures

点击查看摘要

Abstract:Recurrent neural networks (RNNs) trained on neuroscience-inspired tasks offer powerful models of brain computation. However, typical training paradigms rely on open-loop, supervised settings, whereas real-world learning unfolds in closed-loop environments. Here, we develop a mathematical theory describing the learning dynamics of linear RNNs trained in closed-loop contexts. We first demonstrate that two otherwise identical RNNs, trained in either closed- or open-loop modes, follow markedly different learning trajectories. To probe this divergence, we analytically characterize the closed-loop case, revealing distinct stages aligned with the evolution of the training loss. Specifically, we show that the learning dynamics of closed-loop RNNs, in contrast to open-loop ones, are governed by an interplay between two competing objectives: short-term policy improvement and long-term stability of the agent-environment interaction. Finally, we apply our framework to a realistic motor control task, highlighting its broader applicability. Taken together, our results underscore the importance of modeling closed-loop dynamics in a biologically plausible setting.
zh

[AI-109] Aligning Trustworthy AI with Democracy: A Dual Taxonomy of Opportunities and Risks

【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)对民主治理的复杂影响问题,旨在评估AI在威胁与促进民主原则之间的双重作用。其解决方案的关键在于提出一个双维度分类体系——AI对民主的潜在风险(AI Risks to Democracy, AIRD)和AI对民主的积极贡献(AI’s Positive Contributions to Democracy, AIPD),并基于欧盟伦理AI治理框架中的“可信AI”七大要求,为每类风险提供相应的缓解策略。该框架强调了透明度和社会福祉在所有风险类别中的跨领域重要性,并为AI系统与民主价值的对齐提供了结构化视角。

链接: https://arxiv.org/abs/2505.13565
作者: Oier Mentxaka,Natalia Díaz-Rodríguez,Mark Coeckelbergh,Marcos López de Prado,Emilia Gómez,David Fernández Llorca,Enrique Herrera-Viedma,Francisco Herrera
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 26 pages, 5 figures

点击查看摘要

Abstract:Artificial Intelligence (AI) poses both significant risks and valuable opportunities for democratic governance. This paper introduces a dual taxonomy to evaluate AI’s complex relationship with democracy: the AI Risks to Democracy (AIRD) taxonomy, which identifies how AI can undermine core democratic principles such as autonomy, fairness, and trust; and the AI’s Positive Contributions to Democracy (AIPD) taxonomy, which highlights AI’s potential to enhance transparency, participation, efficiency, and evidence-based policymaking. Grounded in the European Union’s approach to ethical AI governance, and particularly the seven Trustworthy AI requirements proposed by the European Commission’s High-Level Expert Group on AI, each identified risk is aligned with mitigation strategies based on EU regulatory and normative frameworks. Our analysis underscores the transversal importance of transparency and societal well-being across all risk categories and offers a structured lens for aligning AI systems with democratic values. By integrating democratic theory with practical governance tools, this paper offers a normative and actionable framework to guide research, regulation, and institutional design to support trustworthy, democratic AI. It provides scholars with a conceptual foundation to evaluate the democratic implications of AI, equips policymakers with structured criteria for ethical oversight, and helps technologists align system design with democratic principles. In doing so, it bridges the gap between ethical aspirations and operational realities, laying the groundwork for more inclusive, accountable, and resilient democratic systems in the algorithmic age. Comments: 26 pages, 5 figures Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2505.13565 [cs.CY] (or arXiv:2505.13565v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2505.13565 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-110] Language and Thought: The View from LLM s

【速读】:该论文试图探讨语言对思维影响的本质问题,具体而言是验证Daniel Dennett在《Kinds of Minds》中提出的观点,即语言的引入可能使心智状态发生根本性变化,以至于将有语言和无语言的心智都称为“心智”是一种错误。论文通过分析大型语言模型(Large Language Models, LLMs)在推理任务中的表现,论证语言的抽象性和编码效率使得推理在计算上变得可行,从而支持了Dennett关于语言对思维具有革命性影响的观点。解决方案的关键在于揭示语言如何提升推理能力,并通过AI实验验证这一机制在跨领域任务中的有效性。

链接: https://arxiv.org/abs/2505.13561
作者: Daniel Rothschild
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 37 Pages

点击查看摘要

Abstract:Daniel Dennett speculated in Kinds of Minds 1996: “Perhaps the kind of mind you get when you add language to it is so different from the kind of mind you can have without language that calling them both minds is a mistake.” Recent work in AI can be seen as testing Dennett’s thesis by exploring the performance of AI systems with and without linguistic training. I argue that the success of Large Language Models at inferential reasoning, limited though it may be, supports Dennett’s radical view about the effect of language on thought. I suggest it is the abstractness and efficiency of linguistic encoding that lies behind the capacity of LLMs to perform inferences across a wide range of domains. In a slogan, language makes inference computationally tractable. I assess what these results in AI indicate about the role of language in the workings of our own biological minds.
zh

[AI-111] AMAQA: A Metadata-based QA Dataset for RAG Systems

【速读】:该论文旨在解决现有问答(QA)基准在整合元数据方面的不足,从而限制了在需要结合文本数据与外部信息的场景下的评估效果。其关键解决方案是提出AMAQA,一个包含约110万条英文消息及丰富元数据(如时间戳、主题、情感倾向和毒性指标)的开放访问QA数据集,首次将元数据与标签引入单跳QA基准,以支持更精确和上下文相关的查询,同时通过实验验证元数据对提升模型准确性的显著作用。

链接: https://arxiv.org/abs/2505.13557
作者: Davide Bruni,Marco Avvenuti,Nicola Tonellotto,Maurizio Tesconi
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems are widely used in question-answering (QA) tasks, but current benchmarks lack metadata integration, hindering evaluation in scenarios requiring both textual data and external information. To address this, we present AMAQA, a new open-access QA dataset designed to evaluate tasks combining text and metadata. The integration of metadata is especially important in fields that require rapid analysis of large volumes of data, such as cybersecurity and intelligence, where timely access to relevant information is critical. AMAQA includes about 1.1 million English messages collected from 26 public Telegram groups, enriched with metadata such as timestamps, topics, emotional tones, and toxicity indicators, which enable precise and contextualized queries by filtering documents based on specific criteria. It also includes 450 high-quality QA pairs, making it a valuable resource for advancing research on metadata-driven QA and RAG systems. To the best of our knowledge, AMAQA is the first single-hop QA benchmark to incorporate metadata and labels such as topics covered in the messages. We conduct extensive tests on the benchmark, establishing a new standard for future research. We show that leveraging metadata boosts accuracy from 0.12 to 0.61, highlighting the value of structured context. Building on this, we explore several strategies to refine the LLM input by iterating over provided context and enriching it with noisy documents, achieving a further 3-point gain over the best baseline and a 14-point improvement over simple metadata filtering. The dataset is available at this https URL
zh

[AI-112] Counter-Inferential Behavior in Natural and Artificial Cognitive Systems

【速读】:该论文试图解决认知系统中出现的反推理行为(counter-inferential behavior)问题,即代理在面对经验成功或适应性需求时错误归因或抑制适应性调整,从而导致认识论上的僵化或非适应性稳定。其解决方案的关键在于理解此类行为源于内部信息模型、经验反馈与高层评估机制之间的结构化交互,而非噪声或设计缺陷。研究强调在稳定条件下保持最小适应性激活的重要性,并提出了能够抵御信息压力下僵化的认知架构设计原则。

链接: https://arxiv.org/abs/2505.13551
作者: Serge Dolgikh
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Social and Information Networks (cs.SI)
备注: 23 pages, 3 figures

点击查看摘要

Abstract:This study explores the emergence of counter-inferential behavior in natural and artificial cognitive systems, that is, patterns in which agents misattribute empirical success or suppress adaptation, leading to epistemic rigidity or maladaptive stability. We analyze archetypal scenarios in which such behavior arises: reinforcement of stability through reward imbalance, meta-cognitive attribution of success to internal superiority, and protective reframing under perceived model fragility. Rather than arising from noise or flawed design, these behaviors emerge through structured interactions between internal information models, empirical feedback, and higher-order evaluation mechanisms. Drawing on evidence from artificial systems, biological cognition, human psychology, and social dynamics, we identify counter-inferential behavior as a general cognitive vulnerability that can manifest even in otherwise well-adapted systems. The findings highlight the importance of preserving minimal adaptive activation under stable conditions and suggest design principles for cognitive architectures that can resist rigidity under informational stress.
zh

[AI-113] JIR-Arena: The First Benchmark Dataset for Just-in-time Information Recommendation

【速读】:该论文旨在解决即时信息推荐(Just-in-time Information Recommendation, JIR)任务缺乏系统性定义和评估框架的问题。现有的研究未对JIR任务进行形式化定义,也未建立统一的评估标准,这限制了该领域的进一步发展。解决方案的关键在于提出首个JIR任务的数学定义及相应的评估指标,并构建JIR-Arena,一个包含多样化、信息需求密集型场景的多模态基准数据集。该数据集通过结合多人和大语言模型的输入来近似用户信息需求分布,利用静态知识库快照评估信息检索效果,并采用多轮多实体验证框架以提高评估的客观性和泛化性。此外,论文还实现了一个能够处理实时信息流的基线JIR系统,为后续研究提供了基础。

链接: https://arxiv.org/abs/2505.13550
作者: Ke Yang,Kevin Ros,Shankar Kumar Senthil Kumar,ChengXiang Zhai
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Just-in-time Information Recommendation (JIR) is a service designed to deliver the most relevant information precisely when users need it, , addressing their knowledge gaps with minimal effort and boosting decision-making and efficiency in daily life. Advances in device-efficient deployment of foundation models and the growing use of intelligent wearable devices have made always-on JIR assistants feasible. However, there has been no systematic effort to formally define JIR tasks or establish evaluation frameworks. To bridge this gap, we present the first mathematical definition of JIR tasks and associated evaluation metrics. Additionally, we introduce JIR-Arena, a multimodal benchmark dataset featuring diverse, information-request-intensive scenarios to evaluate JIR systems across critical dimensions: i) accurately inferring user information needs, ii) delivering timely and relevant recommendations, and iii) avoiding irrelevant content that may distract users. Developing a JIR benchmark dataset poses challenges due to subjectivity in estimating user information needs and uncontrollable system variables affecting reproducibility. To address these, JIR-Arena: i) combines input from multiple humans and large AI models to approximate information need distributions; ii) assesses JIR quality through information retrieval outcomes using static knowledge base snapshots; and iii) employs a multi-turn, multi-entity validation framework to improve objectivity and generality. Furthermore, we implement a baseline JIR system capable of processing real-time information streams aligned with user inputs. Our evaluation of this baseline system on JIR-Arena indicates that while foundation model-based JIR systems simulate user needs with reasonable precision, they face challenges in recall and effective content retrieval. To support future research in this new area, we fully release our code and data. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.13550 [cs.IR] (or arXiv:2505.13550v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2505.13550 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-114] Exploring Federated Pruning for Large Language Models

【速读】:该论文试图解决在隐私敏感领域中,大型语言模型(Large Language Model, LLM)压缩过程中难以获取公共校准样本的问题。解决方案的关键在于提出FedPrLLM,这是一个联邦剪枝框架,通过让每个客户端仅基于本地校准数据计算剪枝掩码矩阵并将其共享给服务器,从而在保护本地数据隐私的同时实现全局模型的协作剪枝。

链接: https://arxiv.org/abs/2505.13547
作者: Pengxin Guo,Yinong Wang,Wei Li,Mengting Liu,Ming Li,Jinkai Zheng,Liangqiong Qu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM pruning has emerged as a promising technology for compressing LLMs, enabling their deployment on resource-limited devices. However, current methodologies typically require access to public calibration samples, which can be challenging to obtain in privacy-sensitive domains. To address this issue, we introduce FedPrLLM, a comprehensive federated pruning framework designed for the privacy-preserving compression of LLMs. In FedPrLLM, each client only needs to calculate a pruning mask matrix based on its local calibration data and share it with the server to prune the global model. This approach allows for collaborative pruning of the global model with the knowledge of each client while maintaining local data privacy. Additionally, we conduct extensive experiments to explore various possibilities within the FedPrLLM framework, including different comparison groups, pruning strategies, and the decision to scale weights. Our extensive evaluation reveals that one-shot pruning with layer comparison and no weight scaling is the optimal choice within the FedPrLLM framework. We hope our work will help guide future efforts in pruning LLMs in privacy-sensitive fields. Our code is available at this https URL.
zh

[AI-115] Prompt Stability Matters: Evaluating and Optimizing Auto-Generated Prompt in General-Purpose Systems

【速读】:该论文试图解决现有自动提示生成方法过于关注即时任务性能而忽视提示内在可靠性的问题,从而导致系统可解释性受限且无法应对大语言模型(Large Language Models, LLMs)的固有随机性。其解决方案的关键在于引入提示稳定性(prompt stability)作为评估指标,通过语义稳定性(semantic stability)量化提示响应的一致性,并基于LLaMA模型微调评估器实现跨任务的自动化测量,进而构建首个具备稳定性感知的通用提示生成系统,通过稳定性反馈迭代提升提示质量和系统性能。

链接: https://arxiv.org/abs/2505.13546
作者: Ke Chen,Yufei Zhou,Xitong Zhang,Haohan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Automatic prompt generation plays a crucial role in enabling general-purpose multi-agent systems to perform diverse tasks autonomously. Existing methods typically evaluate prompts based on their immediate task performance, overlooking the intrinsic qualities that determine their reliability. This outcome-centric view not only limits interpretability but also fails to account for the inherent stochasticity of large language models (LLMs). In this work, we bring attention to prompt stability-the consistency of model responses across repeated executions-as a key factor for building robust and effective prompt generation systems. To quantify this, we propose semantic stability as a criterion for assessing the response consistency of prompts, and fine-tune a LLaMA-based evaluator to measure it automatically across tasks. These components have enabled us to develop the first stability-aware general-purpose prompt generation system that leverages stability feedback to iteratively enhance both prompt quality and system-level performance. Furthermore, we establish a logical chain between prompt stability and task success by analyzing the structural dependencies within our system, proving stability as a necessary condition for effective system-level execution. Empirical results across general and domain-specific tasks demonstrate that our stability-aware framework improves both accuracy and output consistency. By shifting the focus from one-off results to persistent reliability, our work offers a new perspective on prompt design and contributes practical tools for building more trustworthy general-purpose systems.
zh

[AI-116] Know Or Not: a library for evaluating out-of-knowledge base robustness

【速读】:该论文试图解决在高风险应用中,大型语言模型(Large Language Models, LLMs)因幻觉(hallucination)行为导致的可靠性问题,特别是在检索增强生成(Retrieval-Augmented Generation, RAG)设置下,当问题超出知识库范围时,LLMs仍可能产生不准确回答的问题。解决方案的关键在于提出一种系统评估LLMs在RAG场景下对知识库外(Out-of-Knowledge-Base, OOKB)情况的鲁棒性方法,无需依赖人工标注的黄金标准答案。该方法通过开源库knowornot实现,其核心在于提供统一的API、模块化架构、严格的数据建模设计以及可定制的工具,以支持用户构建自定义的评估数据和流程。

链接: https://arxiv.org/abs/2505.13545
作者: Jessica Foo,Pradyumna Shyama Prasad,Shaun Khoo
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While the capabilities of large language models (LLMs) have progressed significantly, their use in high-stakes applications have been limited due to risks of hallucination. One key approach in reducing hallucination is retrieval-augmented generation (RAG), but even in such setups, LLMs may still hallucinate when presented with questions outside of the knowledge base. Such behavior is unacceptable in high-stake applications where LLMs are expected to abstain from answering queries it does not have sufficient context on. In this work, we present a novel methodology for systematically evaluating out-of-knowledge base (OOKB) robustness of LLMs (whether LLMs know or do not know) in the RAG setting, without the need for manual annotation of gold standard answers. We implement our methodology in knowornot, an open-source library that enables users to develop their own customized evaluation data and pipelines for OOKB robustness. knowornot comprises four main features. Firstly, it provides a unified, high-level API that streamlines the process of setting up and running robustness benchmarks. Secondly, its modular architecture emphasizes extensibility and flexibility, allowing users to easily integrate their own LLM clients and RAG settings. Thirdly, its rigorous data modeling design ensures experiment reproducibility, reliability and traceability. Lastly, it implements a comprehensive suite of tools for users to customize their pipelines. We demonstrate the utility of knowornot by developing a challenging benchmark, PolicyBench, which spans four Question-Answer (QA) chatbots on government policies, and analyze its OOKB robustness. The source code of knowornot is available this https URL.
zh

[AI-117] Multi-head Temporal Latent Attention

【速读】:该论文旨在解决Transformer自注意力机制中键值(Key-Value, KV)缓存随序列长度线性增长导致的推理效率瓶颈问题。其解决方案的关键在于提出多头时间潜在注意力(Multi-head Temporal Latent Attention, MTLA),通过在时间维度上进一步压缩KV缓存,显著降低自注意力推理的内存占用。MTLA采用超网络动态合并时间相邻的KV缓存向量,并引入一种步长感知的因果掩码以解决压缩后的KV缓存与处理序列长度不匹配的问题,从而实现高效的并行训练和与推理行为的一致性。

链接: https://arxiv.org/abs/2505.13544
作者: Keqi Deng,Philip C. Woodland
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Transformer self-attention offers strong parallelism, the Key-Value (KV) cache grows linearly with sequence length and becomes a bottleneck for inference efficiency. Multi-head latent attention was recently developed to compress the KV cache into a low-rank latent space. This paper proposes Multi-head Temporal Latent Attention (MTLA), which further reduces the KV cache size along the temporal dimension, greatly lowering the memory footprint of self-attention inference. MTLA employs a hyper-network to dynamically merge temporally adjacent KV cache vectors. To address the mismatch between the compressed KV cache and processed sequence lengths, a stride-aware causal mask is proposed to ensure efficient parallel training and consistency with inference behaviour. Experiments across tasks, including speech translation, speech recognition, speech understanding and text summarisation, demonstrate that MTLA achieves competitive performance compared to standard Multi-Head Attention (MHA), while greatly improving inference speed and GPU memory usage. For example, on a English-German speech translation task, MTLA achieves a 5.3x speedup and a reduction in GPU memory usage by a factor of 8.3 compared to MHA, while maintaining translation quality.
zh

[AI-118] RAG Xplain: From Explainable Evaluation to Actionable Guidance of RAG Pipelines

【速读】:该论文试图解决传统检索增强生成(Retrieval-Augmented Generation, RAG)评估方法仅提供定量分数而缺乏对复杂流水线的可操作性指导的问题。其解决方案的关键在于提出RAGXplain,一个能够将RAG性能量化评估转化为清晰见解的评估框架,通过大语言模型(Large Language Model, LLM)推理,将原始评分转换为连贯的叙述,识别性能瓶颈并提出针对性改进措施,从而提升系统性能并增强用户对AI决策的信任。

链接: https://arxiv.org/abs/2505.13538
作者: Dvir Cohen,Lin Burg,Gilad Barkan
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems show promise by coupling large language models with external knowledge, yet traditional RAG evaluation methods primarily report quantitative scores while offering limited actionable guidance for refining these complex pipelines. In this paper, we introduce RAGXplain, an evaluation framework that quantifies RAG performance and translates these assessments into clear insights that clarify the workings of its complex, multi-stage pipeline and offer actionable recommendations. Using LLM reasoning, RAGXplain converts raw scores into coherent narratives identifying performance gaps and suggesting targeted improvements. By providing transparent explanations for AI decision-making, our framework fosters user trust-a key challenge in AI adoption. Our LLM-based metric assessments show strong alignment with human judgments, and experiments on public question-answering datasets confirm that applying RAGXplain’s actionable recommendations measurably improves system performance. RAGXplain thus bridges quantitative evaluation and practical optimization, empowering users to understand, trust, and enhance their AI systems.
zh

[AI-119] Information Extraction from Visually Rich Documents using LLM -based Organization of Documents into Independent Textual Segments ACL

【速读】:该论文旨在解决从包含布局特征的视觉丰富文档(Visually Rich Documents, VRDs)中进行信息抽取(Information Extraction, IE)的问题,传统非大语言模型(LLM)的自然语言处理(NLP)方法在推理能力、泛化性和对未见文档格式的适应性方面存在不足,而近期基于生成式LLM的方法虽具备推理能力,但在理解文档布局线索和跨异构VRD基准数据集的性能上表现不佳。论文提出的解决方案关键在于引入BLOCKIE,该方法将VRDs组织为局部化、可复用的语义文本块(semantic blocks),并独立处理这些块,从而实现更聚焦和更具泛化能力的推理,显著提升了F1分数,并增强了对新文档格式的适应能力和对隐含信息的提取能力。

链接: https://arxiv.org/abs/2505.13535
作者: Aniket Bhattacharyya,Anurag Tripathi,Ujjal Das,Archan Karmakar,Amit Pathak,Maneesh Gupta
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted to ACL Main 2025

点击查看摘要

Abstract:Information extraction (IE) from Visually Rich Documents (VRDs) containing layout features along with text is a critical and well-studied task. Specialized non-LLM NLP-based solutions typically involve training models using both textual and geometric information to label sequences/tokens as named entities or answers to specific questions. However, these approaches lack reasoning, are not able to infer values not explicitly present in documents, and do not generalize well to new formats. Generative LLM-based approaches proposed recently are capable of reasoning, but struggle to comprehend clues from document layout especially in previously unseen document formats, and do not show competitive performance in heterogeneous VRD benchmark datasets. In this paper, we propose BLOCKIE, a novel LLM-based approach that organizes VRDs into localized, reusable semantic textual segments called \textitsemantic blocks , which are processed independently. Through focused and more generalizable reasoning,our approach outperforms the state-of-the-art on public VRD benchmarks by 1-3% in F1 scores, is resilient to document formats previously not encountered and shows abilities to correctly extract information not explicitly present in documents.
zh

[AI-120] FinMaster: A Holistic Benchmark for Mastering Full-Pipeline Financial Workflows with LLM s

【速读】:该论文试图解决金融领域中大型语言模型(Large Language Models, LLMs)评估基准不足的问题,具体表现为领域特定数据缺乏、任务设计过于简单以及评估框架不完整。解决方案的关键在于提出FinMaster,这是一个全面的金融基准,旨在系统评估LLMs在金融素养、会计、审计和咨询等方面的能力,其核心组成部分包括FinSim(生成合成隐私合规财务数据的模拟器)、FinSuite(涵盖183个不同类型和难度级别的核心金融任务集)以及FinEval(统一的评估接口),从而为LLMs在复杂金融场景中的表现提供全面且具有挑战性的测试环境。

链接: https://arxiv.org/abs/2505.13533
作者: Junzhe Jiang,Chang Yang,Aixin Cui,Sihan Jin,Ruiyu Wang,Bo Li,Xiao Huang,Dongning Sun,Xinrun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Financial tasks are pivotal to global economic stability; however, their execution faces challenges including labor intensive processes, low error tolerance, data fragmentation, and tool limitations. Although large language models (LLMs) have succeeded in various natural language processing tasks and have shown potential in automating workflows through reasoning and contextual understanding, current benchmarks for evaluating LLMs in finance lack sufficient domain-specific data, have simplistic task design, and incomplete evaluation frameworks. To address these gaps, this article presents FinMaster, a comprehensive financial benchmark designed to systematically assess the capabilities of LLM in financial literacy, accounting, auditing, and consulting. Specifically, FinMaster comprises three main modules: i) FinSim, which builds simulators that generate synthetic, privacy-compliant financial data for companies to replicate market dynamics; ii) FinSuite, which provides tasks in core financial domains, spanning 183 tasks of various types and difficulty levels; and iii) FinEval, which develops a unified interface for evaluation. Extensive experiments over state-of-the-art LLMs reveal critical capability gaps in financial reasoning, with accuracy dropping from over 90% on basic tasks to merely 40% on complex scenarios requiring multi-step reasoning. This degradation exhibits the propagation of computational errors, where single-metric calculations initially demonstrating 58% accuracy decreased to 37% in multimetric scenarios. To the best of our knowledge, FinMaster is the first benchmark that covers full-pipeline financial workflows with challenging tasks. We hope that FinMaster can bridge the gap between research and industry practitioners, driving the adoption of LLMs in real-world financial practices to enhance efficiency and accuracy.
zh

[AI-121] Distributional Soft Actor-Critic with Harmonic Gradient for Safe and Efficient Autonomous Driving in Multi-lane Scenarios

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在实际应用中处理约束的问题,尤其是自动驾驶系统中高效驾驶与安全约束之间的冲突。其解决方案的关键在于提出一种名为谐波策略迭代(Harmonic Policy Iteration, HPI)的安全导向训练技术,通过计算与高效驾驶和安全约束相关的两个策略梯度,并生成一个谐波梯度用于策略更新,从而最小化两者之间的冲突,实现更平衡和稳定的训练过程。

链接: https://arxiv.org/abs/2505.13532
作者: Feihong Zhang,Guojian Zhan,Bin Shuai,Tianyi Zhang,Jingliang Duan,Shengbo Eben Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: IEEE Intelligent Vehicles Symposium (IV 2025)

点击查看摘要

Abstract:Reinforcement learning (RL), known for its self-evolution capability, offers a promising approach to training high-level autonomous driving systems. However, handling constraints remains a significant challenge for existing RL algorithms, particularly in real-world applications. In this paper, we propose a new safety-oriented training technique called harmonic policy iteration (HPI). At each RL iteration, it first calculates two policy gradients associated with efficient driving and safety constraints, respectively. Then, a harmonic gradient is derived for policy updating, minimizing conflicts between the two gradients and consequently enabling a more balanced and stable training process. Furthermore, we adopt the state-of-the-art DSAC algorithm as the backbone and integrate it with our HPI to develop a new safe RL algorithm, DSAC-H. Extensive simulations in multi-lane scenarios demonstrate that DSAC-H achieves efficient driving performance with near-zero safety constraint violations.
zh

[AI-122] LLM -Based User Simulation for Low-Knowledge Shilling Attacks on Recommender Systems

【速读】:该论文试图解决推荐系统(Recommender Systems, RS)在面对洗牌攻击(shilling attacks)时的安全性问题,即攻击者通过注入虚假用户画像来操纵系统输出。传统攻击策略通常依赖于简单的启发式方法,需要访问内部推荐系统数据,并且忽视了文本评论的操控潜力。解决方案的关键在于提出Agent4SR框架,该框架利用基于大语言模型(Large Language Model, LLM)的智能体,通过评分和评论生成实现低知识、高影响的洗牌攻击,其核心包括目标用户画像构建、混合记忆检索以及通过跨无关评论传播目标物品特征的评论攻击策略,从而提升攻击效果与隐蔽性。

链接: https://arxiv.org/abs/2505.13528
作者: Shengkang Gu,Jiahao Liu,Dongsheng Li,Guangping Zhang,Mingzhe Han,Hansu Gu,Peng Zhang,Ning Gu,Li Shang,Tun Lu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 11 pages, under review

点击查看摘要

Abstract:Recommender systems (RS) are increasingly vulnerable to shilling attacks, where adversaries inject fake user profiles to manipulate system outputs. Traditional attack strategies often rely on simplistic heuristics, require access to internal RS data, and overlook the manipulation potential of textual reviews. In this work, we introduce Agent4SR, a novel framework that leverages Large Language Model (LLM)-based agents to perform low-knowledge, high-impact shilling attacks through both rating and review generation. Agent4SR simulates realistic user behavior by orchestrating adversarial interactions, selecting items, assigning ratings, and crafting reviews, while maintaining behavioral plausibility. Our design includes targeted profile construction, hybrid memory retrieval, and a review attack strategy that propagates target item features across unrelated reviews to amplify manipulation. Extensive experiments on multiple datasets and RS architectures demonstrate that Agent4SR outperforms existing low-knowledge baselines in both effectiveness and stealth. Our findings reveal a new class of emergent threats posed by LLM-driven agents, underscoring the urgent need for enhanced defenses in modern recommender systems.
zh

[AI-123] Geography-Aware Large Language Models for Next POI Recommendation

【速读】:该论文旨在解决基于用户历史移动数据的下一个兴趣点(Next Point-of-Interest, Next POI)推荐问题,该问题在基于位置的服务和个性化应用中具有重要作用。传统方法在建模地理信息和POI转移关系方面存在局限,而大型语言模型(Large Language Models, LLMs)虽具备强大的语义理解和上下文推理能力,但在处理空间任务时面临两个关键挑战:一是具体GPS坐标出现频率较低,导致LLMs难以建模精确的空间上下文;二是缺乏对POI转移关系的知识,限制了其捕捉潜在POI-POI关系的能力。为解决这些问题,论文提出GA-LLM(Geography-Aware Large Language Model)框架,其关键在于引入两个专用组件:地理坐标注入模块(Geographic Coordinate Injection Module, GCIM)通过分层和基于傅里叶的位置编码将GPS坐标转换为空间表示,以多角度理解地理特征;POI对齐模块(POI Alignment Module, PAM)则将POI转移关系融入LLM的语义空间,使其能够推断全局POI关系并泛化到未见过的POI。

链接: https://arxiv.org/abs/2505.13526
作者: Zhao Liu,Wei Liu,Huajie Zhu,Jianxing Yu,Jian Yin,Wang-Chien Lee,Shun Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 9 pages, 7figures

点击查看摘要

Abstract:The next Point-of-Interest (POI) recommendation task aims to predict users’ next destinations based on their historical movement data and plays a key role in location-based services and personalized applications. Accurate next POI recommendation depends on effectively modeling geographic information and POI transition relations, which are crucial for capturing spatial dependencies and user movement patterns. While Large Language Models (LLMs) exhibit strong capabilities in semantic understanding and contextual reasoning, applying them to spatial tasks like next POI recommendation remains challenging. First, the infrequent nature of specific GPS coordinates makes it difficult for LLMs to model precise spatial contexts. Second, the lack of knowledge about POI transitions limits their ability to capture potential POI-POI relationships. To address these issues, we propose GA-LLM (Geography-Aware Large Language Model), a novel framework that enhances LLMs with two specialized components. The Geographic Coordinate Injection Module (GCIM) transforms GPS coordinates into spatial representations using hierarchical and Fourier-based positional encoding, enabling the model to understand geographic features from multiple perspectives. The POI Alignment Module (PAM) incorporates POI transition relations into the LLM’s semantic space, allowing it to infer global POI relationships and generalize to unseen POIs. Experiments on three real-world datasets demonstrate the state-of-the-art performance of GA-LLM.
zh

[AI-124] ACPs: Agent Collaboration Protocols for the Internet of Agents

【速读】:该论文试图解决当前自主代理在互操作性、可扩展性和协调性方面面临的挑战,尤其是现有代理通信协议(如MCP、A2A和ANP)存在碎片化和场景特定的问题。其解决方案的关键是提出一种全面的协议套件——代理协作协议(Agent Collaboration Protocols, ACPs),该协议包括注册、发现、交互和工具协议,以支持可信访问、能力编排和工作流构建,从而为构建安全、开放且可扩展的代理互联网基础设施奠定基础。

链接: https://arxiv.org/abs/2505.13523
作者: Jun Liu,Ke Yu,Keliang Chen,Ke Li,Yuxinyue Qian,Xiaolian Guo,Haozhe Song,Yinming Li
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 7 pages, 8 figures

点击查看摘要

Abstract:With the rapid advancement of artificial intelligence, the proliferation of autonomous agents has introduced new challenges in interoperability, scalability, and coordination. The Internet of Agents (IoA) aims to interconnect heterogeneous agents through standardized communication protocols, enabling seamless collaboration and intelligent task execution. However, existing agent communication protocols such as MCP, A2A, and ANP remain fragmented and scenario-specific. To address this gap, we propose Agent Collaboration Protocols (ACPs), a comprehensive protocol suite for the IoA. ACPs include registration, discovery, interaction, and tooling protocols to support trustable access, capability orchestration, and workflow construction. We present the architecture, key technologies, and application workflows of ACPs, and demonstrate its effectiveness in a collaborative restaurant booking scenario. ACPs lay the foundation for building a secure, open, and scalable agent internet infrastructure.
zh

[AI-125] A Heuristic Algorithm Based on Beam Search and Iterated Local Search for the Maritime Inventory Routing Problem

【速读】:该论文试图解决 Maritime Inventory Routing Problem (MIRP) 的求解难题,特别是在处理大规模实例及其变体时缺乏高效方法的问题。由于问题的高度复杂性,传统的精确方法(如基于混合整数规划 MIP 的方法)在实际应用中因计算时间过长而不具可行性,而非 MIP 基础的启发式方法又因问题的高度约束性而难以构建有效的初始解。为解决这一问题,本文提出了一种不依赖数学优化技术的启发式方法,其关键在于结合了改进的 Beam Search 算法与 Iterated Local Search 过程,从而在合理计算时间内提升了已知最优解的性能。

链接: https://arxiv.org/abs/2505.13522
作者: Nathalie Sanghikian,Rafael Meirelles,Rafael Martinelli,Anand Subramanian
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Maritime Inventory Routing Problem (MIRP) plays a crucial role in the integration of global maritime commerce levels. However, there are still no well-established methodologies capable of efficiently solving large MIRP instances or their variants due to the high complexity of the problem. The adoption of exact methods, typically based on Mixed Integer Programming (MIP), for daily operations is nearly impractical due to the CPU time required, as planning must be executed multiple times while ensuring high-quality results within acceptable time limits. Non-MIP-based heuristics are less frequently applied due to the highly constrained nature of the problem, which makes even the construction of an effective initial solution challenging. Papageorgiou et al. (2014) introduced a single-product MIRP as the foundation for MIRPLib, aiming to provide a collection of publicly available benchmark instances. However, only a few studies that propose new methodologies have been published since then. To encourage the use of MIRPLib and facilitate result comparisons, this study presents a heuristic approach that does not rely on mathematical optimization techniques to solve a deterministic, finite-horizon, single-product MIRP. The proposed heuristic combines a variation of a Beam Search algorithm with an Iterated Local Search procedure. Among the 72 instances tested, the developed methodology can improve the best-known solution for ten instances within an acceptable CPU time.
zh

[AI-126] Beyond Retrieval: Joint Supervision and Multimodal Document Ranking for Textbook Question Answering

【速读】:该论文试图解决文本教材问答(Textbook Question Answering, TQA)中因复杂多模态上下文导致的语义对齐和任务特定文档检索困难的问题。解决方案的关键在于提出一种基于多目标联合训练机制的增强语义表示方法,通过结合成对排序和来自答案的隐式监督信号,优化问题与文档的语义表示,从而提升检索文档的相关性。该方法构建于检索-生成架构之上,利用多模态大语言模型进行答案生成,并在CK12-QA数据集上验证了其有效性,显著提升了信息文档与无关文档之间的区分能力。

链接: https://arxiv.org/abs/2505.13520
作者: Hessa Alawwad,Usman Naseem,Areej Alhothali,Ali Alkhathlan,Amani Jamal
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 14 pages, 16 figure

点击查看摘要

Abstract:Textbook question answering (TQA) is a complex task, requiring the interpretation of complex multimodal context. Although recent advances have improved overall performance, they often encounter difficulties in educational settings where accurate semantic alignment and task-specific document retrieval are essential. In this paper, we propose a novel approach to multimodal textbook question answering by introducing a mechanism for enhancing semantic representations through multi-objective joint training. Our model, Joint Embedding Training With Ranking Supervision for Textbook Question Answering (JETRTQA), is a multimodal learning framework built on a retriever–generator architecture that uses a retrieval-augmented generation setup, in which a multimodal large language model generates answers. JETRTQA is designed to improve the relevance of retrieved documents in complex educational contexts. Unlike traditional direct scoring approaches, JETRTQA learns to refine the semantic representations of questions and documents through a supervised signal that combines pairwise ranking and implicit supervision derived from answers. We evaluate our method on the CK12-QA dataset and demonstrate that it significantly improves the discrimination between informative and irrelevant documents, even when they are long, complex, and multimodal. JETRTQA outperforms the previous state of the art, achieving a 2.4% gain in accuracy on the validation set and 11.1% on the test set.
zh

[AI-127] HALO: Hierarchical Autonomous Logic-Oriented Orchestration for Multi-Agent LLM Systems

【速读】:该论文旨在解决现有基于大型语言模型的多智能体系统(Multi-Agent Systems, MAS)在复杂交互环境中适应性与灵活性不足的问题,以及由此导致的在高度专业化和专家级任务中的性能不佳。其解决方案的关键在于提出HALO框架,该框架采用分层推理架构,包括高层规划代理、中层角色设计代理和底层推理代理,通过任务分解、子任务特定代理实例化和子任务执行的结构化工作流搜索问题来提升系统性能,同时引入自适应提示优化模块以提高用户查询的准确性。

链接: https://arxiv.org/abs/2505.13516
作者: Zhipeng Hou,Junyi Tang,Yipeng Wang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: The code repository is available at this https URL

点击查看摘要

Abstract:Recent advancements in Multi-Agent Systems (MAS) powered by Large Language Models (LLMs) have demonstrated tremendous potential in diverse task scenarios. Nonetheless, existing agentic systems typically rely on predefined agent-role design spaces and static communication structures, limiting their adaptability as well as flexibility in complex interaction environments and leading to subpar performance on highly specialized and expert-level tasks. To address these issues, we introduce HALO, a multi-agent collaboration framework based on a hierarchical reasoning architecture. Specifically, we incorporate a high-level planning agent for task decomposition, mid-level role-design agents for subtask-specific agent instantiation, and low-level inference agents for subtask execution. Particularly, subtask execution is reformulated as a structured workflow search problem, where Monte Carlo Tree Search (MCTS) systematically explores the agentic action space to construct optimal reasoning trajectories. Additionally, as the majority of users lack expertise in prompt engineering, we leverage an Adaptive Prompt Refinement module to transform raw queries into task-specific prompts. Empirical evaluations on Code Generation (HumanEval), General Reasoning (MMLU), and Arithmetic Reasoning (MATH) benchmark datasets highlight the effectiveness of HALO, yielding a 14.4% average improvement over state-of-the-art baselines. Notably, HALO achieves up to 13.3% performance gain on the Moral Scenarios subject in the MMLU benchmark and up to 19.6% performance gain on the Algebra subarea in the MATH benchmark, indicating its advanced proficiency in tackling highly specialized and expert-level tasks. The code repository is available at this https URL.
zh

[AI-128] An agent ic system with reinforcement-learned subsystem improvements for parsing form-like documents

【速读】:该论文试图解决从表单类文档(如发票、采购订单、账单和财务文件)中提取字母数字数据的问题,传统方法通常依赖于光学字符识别(OCR)和学习算法或结构单一的流水线,这些方法在系统性改进方面潜力有限。解决方案的关键在于提出一种基于智能体的AI系统,该系统利用大型语言模型(Large Language Model, LLM)智能体和强化学习(Reinforcement Learning, RL)驱动智能体,以在LLM推理不确定性下实现一致且自我改进的提取过程。该框架采用模块化、多智能体结构,结合任务特定提示和奖励与惩罚策略,引导元提示智能体从历史错误中学习,从而提升基于提示的执行智能体的性能。

链接: https://arxiv.org/abs/2505.13504
作者: Ayesha Amjad,Saurav Sthapit,Tahir Qasim Syed
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Extracting alphanumeric data from form-like documents such as invoices, purchase orders, bills, and financial documents is often performed via vision (OCR) and learning algorithms or monolithic pipelines with limited potential for systemic improvements. We propose an agentic AI system that leverages Large Language Model (LLM) agents and a reinforcement learning (RL) driver agent to automate consistent, self-improving extraction under LLM inference uncertainty. Our work highlights the limitations of monolithic LLM-based extraction and introduces a modular, multi-agent framework with task-specific prompts and an RL policy of rewards and penalties to guide a meta-prompting agent to learn from past errors and improve prompt-based actor agents. This self-corrective adaptive system handles diverse documents, file formats, layouts, and LLMs, aiming to automate accurate information extraction without the need for human intervention. Results as reported on two benchmark datasets of SOIRE, and CORD, are promising for the agentic AI framework.
zh

[AI-129] Optimal Control for Transformer Architectures: Enhancing Generalization Robustness and Efficiency

【速读】:该论文试图解决Transformer模型在训练和架构设计中的性能优化问题,旨在通过理论驱动的方法提升模型的泛化能力和鲁棒性。解决方案的关键在于将最优控制理论(optimal control theory)引入Transformer的研究中,利用连续时间形式化的工具,为模型的训练和结构设计提供可操作的见解,从而实现更高效的参数使用和更好的测试性能。

链接: https://arxiv.org/abs/2505.13499
作者: Kelvin Kan,Xingjian Li,Benjamin J. Zhang,Tuhin Sahai,Stanley Osher,Markos A. Katsoulakis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:We study Transformers through the perspective of optimal control theory, using tools from continuous-time formulations to derive actionable insights into training and architecture design. This framework improves the performance of existing Transformer models while providing desirable theoretical guarantees, including generalization and robustness. Our framework is designed to be plug-and-play, enabling seamless integration with established Transformer models and requiring only slight changes to the implementation. We conduct seven extensive experiments on tasks motivated by text generation, sentiment analysis, image classification, and point cloud classification. Experimental results show that the framework improves the test performance of the baselines, while being more parameter-efficient. On character-level text generation with nanoGPT, our framework achieves a 46% reduction in final test loss while using 42% fewer parameters. On GPT-2, our framework achieves a 5.6% reduction in final test loss, demonstrating scalability to larger models. To the best of our knowledge, this is the first work that applies optimal control theory to both the training and architecture of Transformers. It offers a new foundation for systematic, theory-driven improvements and moves beyond costly trial-and-error approaches.
zh

[AI-130] LODGE: Joint Hierarchical Task Planning and Learning of Domain Models with Grounded Execution

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在基于自然语言指令进行规划时产生的计划缺陷问题,以及现有方法需要大量人工反馈才能获得有效模型的局限性。其解决方案的关键在于学习分层领域(hierarchical domains),通过将低层次谓词和动作组合为高层次对应物,并利用仿真验证其前提条件和效果,从而提升长期规划任务的成功率。此外,引入一个中心错误推理器以确保不同规划层级之间的一致性。

链接: https://arxiv.org/abs/2505.13497
作者: Claudius Kienle,Benjamin Alt,Oleg Arenz,Jan Peters
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) enable planning from natural language instructions using implicit world knowledge, but often produce flawed plans that require refinement. Instead of directly predicting plans, recent methods aim to learn a problem domain that can be solved for different goal states using classical planners. However, these approaches require significant human feedback to obtain useful models. We address this shortcoming by learning hierarchical domains, where low-level predicates and actions are composed into higher-level counterparts, and by leveraging simulation to validate their preconditions and effects. This hierarchical approach is particularly powerful for long-horizon planning, where LLM-based planning approaches typically struggle. Furthermore, we introduce a central error reasoner to ensure consistency among the different planning levels. Evaluation on two challenging International Planning Competition (IPC) domains and a long-horizon robot manipulation task demonstrates higher planning success rates than state-of-the-art domain synthesis and LLM-modulo planning methods, while constructing high-quality models of the domain. Resources, videos and detailed experiment results are available at this https URL.
zh

[AI-131] ADALog: Adaptive Unsupervised Anomaly detection in Logs with Self-attention Masked Language Model ICML

【速读】:该论文旨在解决现代软件系统中异构日志数据的异常检测问题,这类数据具有动态格式、碎片化事件序列和变化的时间模式,使得异常检测既重要又具有挑战性。其解决方案的关键在于提出ADALog框架,该框架无需依赖日志解析、严格的序列依赖或标记数据,而是直接处理未结构化的单个日志,提取日志内部的上下文关系,并对正常数据进行自适应阈值设定。该方法采用基于Transformer的预训练双向编码器,通过掩码语言建模任务微调正常日志,以捕捉领域特定的语法和语义模式,从而在令牌级别上通过重构概率识别异常,并通过自适应分位数阈值进行评估。

链接: https://arxiv.org/abs/2505.13496
作者: Przemek Pospieszny,Wojciech Mormul,Karolina Szyndler,Sanjeev Kumar
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Conference paper accepted at ICMLT 2025; to appear in the IEEE Conference Proceedings

点击查看摘要

Abstract:Modern software systems generate extensive heterogeneous log data with dynamic formats, fragmented event sequences, and varying temporal patterns, making anomaly detection both crucial and challenging. To address these complexities, we propose ADALog, an adaptive, unsupervised anomaly detection framework designed for practical applicability across diverse real-world environments. Unlike traditional methods reliant on log parsing, strict sequence dependencies, or labeled data, ADALog operates on individual unstructured logs, extracts intra-log contextual relationships, and performs adaptive thresholding on normal data. The proposed approach utilizes a transformer-based, pretrained bidirectional encoder with a masked language modeling task, fine-tuned on normal logs to capture domain-specific syntactic and semantic patterns essential for accurate anomaly detection. Anomalies are identified via token-level reconstruction probabilities, aggregated into log-level scores, with adaptive percentile-based thresholding calibrated only on normal data. This allows the model to dynamically adapt to evolving system behaviors while avoiding rigid, heuristic-based thresholds common in traditional systems. We evaluate ADALog on benchmark datasets BGL, Thunderbird, and Spirit, showing strong generalization and competitive performance compared to state-of-the-art supervised and unsupervised methods. Additional ablation studies examine the effects of masking, fine-tuning, and token positioning on model behavior and interpretability.
zh

[AI-132] Algorithmic Tradeoffs in Fair Lending: Profitability Compliance and Long-Term Impact

【速读】:该论文试图解决金融领域中机器学习模型在自动化贷款决策过程中面临的算法公平性问题,特别是在公平性约束(如人口统计均等或机会均等)与放贷方盈利能力之间的权衡问题。论文的关键解决方案在于通过合成数据模拟现实中的贷款模式,评估不同公平性干预措施对利润和违约率的影响,并发现通过去除受保护属性(即公平性通过无知实现)在公平性和盈利能力指标上优于显式公平性干预措施。

链接: https://arxiv.org/abs/2505.13469
作者: Aayam Bansal,Harsh Vardhan Narsaria
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注: 8 pages

点击查看摘要

Abstract:As financial institutions increasingly rely on machine learning models to automate lending decisions, concerns about algorithmic fairness have risen. This paper explores the tradeoff between enforcing fairness constraints (such as demographic parity or equal opportunity) and maximizing lender profitability. Through simulations on synthetic data that reflects real-world lending patterns, we quantify how different fairness interventions impact profit margins and default rates. Our results demonstrate that equal opportunity constraints typically impose lower profit costs than demographic parity, but surprisingly, removing protected attributes from the model (fairness through unawareness) outperforms explicit fairness interventions in both fairness and profitability metrics. We further identify the specific economic conditions under which fair lending becomes profitable and analyze the feature-specific drivers of unfairness. These findings offer practical guidance for designing lending algorithms that balance ethical considerations with business objectives.
zh

[AI-133] AgentS GEN: Multi-Agent LLM in the Loop for Semantic Collaboration and GENeration of Synthetic Data

【速读】:该论文试图解决安全关键型应用(如施工安全)中由于伦理和物流障碍导致的真实场景数据稀缺问题,这限制了AI系统的训练效果。其解决方案的关键在于提出一种基于大语言模型(LLM)的多智能体框架,通过评估代理(Evaluator Agent)与编辑代理(Editor Agent)的迭代闭环协作,确保生成的合成数据在语义一致性和安全特定约束方面具有深度,从而生成符合实际规范且兼顾安全要求与视觉语义的合成场景。

链接: https://arxiv.org/abs/2505.13466
作者: Vu Dinh Xuan,Hao Vo,David Murphy,Hoang D. Nguyen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The scarcity of data depicting dangerous situations presents a major obstacle to training AI systems for safety-critical applications, such as construction safety, where ethical and logistical barriers hinder real-world data collection. This creates an urgent need for an end-to-end framework to generate synthetic data that can bridge this gap. While existing methods can produce synthetic scenes, they often lack the semantic depth required for scene simulations, limiting their effectiveness. To address this, we propose a novel multi-agent framework that employs an iterative, in-the-loop collaboration between two agents: an Evaluator Agent, acting as an LLM-based judge to enforce semantic consistency and safety-specific constraints, and an Editor Agent, which generates and refines scenes based on this guidance. Powered by LLM’s capabilities to reasoning and common-sense knowledge, this collaborative design produces synthetic images tailored to safety-critical scenarios. Our experiments suggest this design can generate useful scenes based on realistic specifications that address the shortcomings of prior approaches, balancing safety requirements with visual semantics. This iterative process holds promise for delivering robust, aesthetically sound simulations, offering a potential solution to the data scarcity challenge in multimedia safety applications.
zh

[AI-134] Pel A Programming Language for Orchestrating AI Agents

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在控制和编排其能力方面存在的挑战,特别是在超越简单文本生成之外的表达性、可扩展性、成本、安全性和细粒度控制方面的限制。解决方案的关键在于引入一种名为Pel的新编程语言,该语言基于Lisp、Elixir、Gleam和Haskell的优势,提供了一个语法简洁、自指性强且语义丰富的平台,使LLMs能够安全高效地表达复杂操作、控制流和多智能体通信。Pel的设计强调了一个易于修改的最小语法,适用于受限的LLM生成环境,并通过语法级别的能力控制消除了对复杂沙箱机制的需求。其关键特性包括强大的管道机制、一等闭包支持、内置自然语言条件评估以及先进的REPeL交互环境,从而提升了LLM编排的鲁棒性、安全性和表达力。

链接: https://arxiv.org/abs/2505.13453
作者: Behnam Mohammadi
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: Added relevant figures and the section 4.5

点击查看摘要

Abstract:The proliferation of Large Language Models (LLMs) has opened new frontiers in computing, yet controlling and orchestrating their capabilities beyond simple text generation remains a challenge. Current methods, such as function/tool calling and direct code generation, suffer from limitations in expressiveness, scalability, cost, security, and the ability to enforce fine-grained control. This paper introduces Pel, a novel programming language specifically designed to bridge this gap. Inspired by the strengths of Lisp, Elixir, Gleam, and Haskell, Pel provides a syntactically simple, homoiconic, and semantically rich platform for LLMs to express complex actions, control flow, and inter-agent communication safely and efficiently. Pel’s design emphasizes a minimal, easily modifiable grammar suitable for constrained LLM generation, eliminating the need for complex sandboxing by enabling capability control at the syntax level. Key features include a powerful piping mechanism for linear composition, first-class closures enabling easy partial application and functional patterns, built-in support for natural language conditions evaluated by LLMs, and an advanced Read-Eval-Print-Loop (REPeL) with Common Lisp-style restarts and LLM-powered helper agents for automated error correction. Furthermore, Pel incorporates automatic parallelization of independent operations via static dependency analysis, crucial for performant agentic systems. We argue that Pel offers a more robust, secure, and expressive paradigm for LLM orchestration, paving the way for more sophisticated and reliable AI agentic frameworks.
zh

[AI-135] LLM Context Conditioning and PWP Prompting for Multimodal Validation of Chemical Formulas

【速读】:该论文试图解决在复杂科学和技术文档中识别细微技术错误的问题,尤其是需要多模态解释(如图像中的公式)的情况。传统大型语言模型(Large Language Models, LLMs)固有的错误修正倾向可能会掩盖实际存在的不准确性。该研究提出的关键解决方案是基于持续工作流提示(Persistent Workflow Prompting, PWP)原则的结构化LLM上下文条件化方法,旨在在推理阶段调节LLM的行为,从而提高通用LLM(如Gemini 2.5 Pro和ChatGPT Plus o3)在精确验证任务中的可靠性,且仅依赖其标准聊天界面,无需API访问或模型修改。

链接: https://arxiv.org/abs/2505.12257
作者: Evgeny Markhasin
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
备注: 10 pages

点击查看摘要

Abstract:Identifying subtle technical errors within complex scientific and technical documents, especially those requiring multimodal interpretation (e.g., formulas in images), presents a significant hurdle for Large Language Models (LLMs) whose inherent error-correction tendencies can mask inaccuracies. This exploratory proof-of-concept (PoC) study investigates structured LLM context conditioning, informed by Persistent Workflow Prompting (PWP) principles, as a methodological strategy to modulate this LLM behavior at inference time. The approach is designed to enhance the reliability of readily available, general-purpose LLMs (specifically Gemini 2.5 Pro and ChatGPT Plus o3) for precise validation tasks, crucially relying only on their standard chat interfaces without API access or model modifications. To explore this methodology, we focused on validating chemical formulas within a single, complex test paper with known textual and image-based errors. Several prompting strategies were evaluated: while basic prompts proved unreliable, an approach adapting PWP structures to rigorously condition the LLM’s analytical mindset appeared to improve textual error identification with both models. Notably, this method also guided Gemini 2.5 Pro to repeatedly identify a subtle image-based formula error previously overlooked during manual review, a task where ChatGPT Plus o3 failed in our tests. These preliminary findings highlight specific LLM operational modes that impede detail-oriented validation and suggest that PWP-informed context conditioning offers a promising and highly accessible technique for developing more robust LLM-driven analytical workflows, particularly for tasks requiring meticulous error detection in scientific and technical documents. Extensive validation beyond this limited PoC is necessary to ascertain broader applicability.
zh

[AI-136] SSPS: Self-Supervised Positive Sampling for Robust Self-Supervised Speaker Verification INTERSPEECH2025

【速读】:该论文试图解决自监督学习(Self-Supervised Learning, SSL)在说话人验证(Speaker Verification, SV)中因使用同话语正样本采样和数据增强而导致的性能瓶颈问题,该策略主要编码了录音条件中的通道信息,而非说话人身份信息。解决方案的关键在于提出了一种新的正样本采样技术——自监督正样本采样(Self-Supervised Positive Sampling, SSPS),通过在潜在空间中利用聚类分配和正样本嵌入的记忆队列,寻找与锚点属于同一说话人但录音条件不同的正样本,从而降低说话人内方差并提升SV性能。

链接: https://arxiv.org/abs/2505.14561
作者: Theo Lepage,Reda Dehak
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
备注: accepted at Interspeech 2025

点击查看摘要

Abstract:Self-Supervised Learning (SSL) has led to considerable progress in Speaker Verification (SV). The standard framework uses same-utterance positive sampling and data-augmentation to generate anchor-positive pairs of the same speaker. This is a major limitation, as this strategy primarily encodes channel information from the recording condition, shared by the anchor and positive. We propose a new positive sampling technique to address this bottleneck: Self-Supervised Positive Sampling (SSPS). For a given anchor, SSPS aims to find an appropriate positive, i.e., of the same speaker identity but a different recording condition, in the latent space using clustering assignments and a memory queue of positive embeddings. SSPS improves SV performance for both SimCLR and DINO, reaching 2.57% and 2.53% EER, outperforming SOTA SSL methods on VoxCeleb1-O. In particular, SimCLR-SSPS achieves a 58% EER reduction by lowering intra-speaker variance, providing comparable performance to DINO-SSPS.
zh

[AI-137] Benchmarking data encoding methods in Quantum Machine Learning

【速读】:该论文试图解决量子机器学习(Quantum Machine Learning, QML)中数据编码方法选择的难题,即如何根据特定数据集选择合适的编码方式以提升模型性能。解决方案的关键在于研究并基准测试多种常用的编码方法,以评估其在不同数据集上的表现,从而为实际应用提供指导。

链接: https://arxiv.org/abs/2505.14295
作者: Orlane Zang,Grégoire Barrué,Tony Quertier
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 30 pages, 8 figures

点击查看摘要

Abstract:Data encoding plays a fundamental and distinctive role in Quantum Machine Learning (QML). While classical approaches process data directly as vectors, QML may require transforming classical data into quantum states through encoding circuits, known as quantum feature maps or quantum embeddings. This step leverages the inherently high-dimensional and non-linear nature of Hilbert space, enabling more efficient data separation in complex feature spaces that may be inaccessible to classical methods. This encoding part significantly affects the performance of the QML model, so it is important to choose the right encoding method for the dataset to be encoded. However, this choice is generally arbitrary, since there is no “universal” rule for knowing which encoding to choose based on a specific set of data. There are currently a variety of encoding methods using different quantum logic gates. We studied the most commonly used types of encoding methods and benchmarked them using different datasets.
zh

[AI-138] Articulatory Feature Prediction from Surface EMG during Speech Production INTERSPEECH2025

【速读】:该论文试图解决从表面肌电图(surface electromyography, EMG)信号中预测发音特征并将其转换为可理解语音波形的问题。解决方案的关键在于提出一种融合卷积层和Transformer块的模型,并通过独立的预测器对发音特征进行建模,从而实现高相关性的预测结果(约为0.9),并进一步将预测的发音特征解码为语音波形,这在现有研究中尚属首次。

链接: https://arxiv.org/abs/2505.13814
作者: Jihwan Lee,Kevin Huang,Kleanthis Avramidis,Simon Pistrosch,Monica Gonzalez-Machorro,Yoonjeong Lee,Björn Schuller,Louis Goldstein,Shrikanth Narayanan
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Accepted for Interspeech2025

点击查看摘要

Abstract:We present a model for predicting articulatory features from surface electromyography (EMG) signals during speech production. The proposed model integrates convolutional layers and a Transformer block, followed by separate predictors for articulatory features. Our approach achieves a high prediction correlation of approximately 0.9 for most articulatory features. Furthermore, we demonstrate that these predicted articulatory features can be decoded into intelligible speech waveforms. To our knowledge, this is the first method to decode speech waveforms from surface EMG via articulatory features, offering a novel approach to EMG-based speech synthesis. Additionally, we analyze the relationship between EMG electrode placement and articulatory feature predictability, providing knowledge-driven insights for optimizing EMG electrode configurations. The source code and decoded speech samples are publicly available.
zh

[AI-139] Randomised Optimism via Competitive Co-Evolution for Matrix Games with Bandit Feedback IJCAI2025

【速读】:该论文旨在解决在未知收益矩阵和带通反馈环境下,两人零和矩阵博弈中的学习问题,其中每个玩家只能观察到自己的行动及对应的噪声收益。现有研究主要关注确定性乐观策略(如UCB)在该场景下的有效性,但随机乐观策略在矩阵博弈中的潜力尚未被理论探索。论文提出的解决方案是 Competitive Co-evolutionary Bandit Learning(\coebl),其关键在于将进化算法(EAs)引入带通框架,通过EA的变异算子实现随机乐观策略,从而在理论上证明了\coebl能够达到次线性遗憾,与基于确定性乐观的方法性能相当。

链接: https://arxiv.org/abs/2505.13562
作者: Shishen Lin
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 21 pages, 10 figures, accepted at IJCAI 2025

点击查看摘要

Abstract:Learning in games is a fundamental problem in machine learning and artificial intelligence, with numerous applications~\citepsilver2016mastering,schrittwieser2020mastering. This work investigates two-player zero-sum matrix games with an unknown payoff matrix and bandit feedback, where each player observes their actions and the corresponding noisy payoff. Prior studies have proposed algorithms for this setting~\citepo2021matrix,maiti2023query,cai2024uncoupled, with \citeto2021matrix demonstrating the effectiveness of deterministic optimism (e.g., \ucb) in achieving sublinear regret. However, the potential of randomised optimism in matrix games remains theoretically unexplored. We propose Competitive Co-evolutionary Bandit Learning (\coebl), a novel algorithm that integrates evolutionary algorithms (EAs) into the bandit framework to implement randomised optimism through EA variation operators. We prove that \coebl achieves sublinear regret, matching the performance of deterministic optimism-based methods. To the best of our knowledge, this is the first theoretical regret analysis of an evolutionary bandit learning algorithm in matrix games. Empirical evaluations on diverse matrix game benchmarks demonstrate that \coebl not only achieves sublinear regret but also consistently outperforms classical bandit algorithms, including \exptr~\citepauer2002nonstochastic, the variant \exptrni~\citepcai2024uncoupled, and \ucb~\citepo2021matrix. These results highlight the potential of evolutionary bandit learning, particularly the efficacy of randomised optimism via evolutionary algorithms in game-theoretic settings. Comments: 21 pages, 10 figures, accepted at IJCAI 2025 Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2505.13562 [stat.ML] (or arXiv:2505.13562v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2505.13562 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-140] Learning to Program Quantum Measurements for Machine Learning

【速读】:该论文试图解决量子机器学习(Quantum Machine Learning, QML)模型中可观测量(observable)的动态编程问题,以及传统QML模型在测量方案设计上的不足。其关键解决方案是提出一种创新框架,使量子系统的可观测量——特别是厄米特矩阵(Hermitian matrix)——可训练,通过端到端的可微分学习框架,同时优化用于编程参数化可观测量的神经网络和标准量子电路参数,从而实现可观测量根据输入数据流实时自适应调整。

链接: https://arxiv.org/abs/2505.13525
作者: Samual Yen-Chi Chen,Huan-Hsin Tseng,Hsin-Yi Lin,Shinjae Yoo
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:The rapid advancements in quantum computing (QC) and machine learning (ML) have sparked significant interest, driving extensive exploration of quantum machine learning (QML) algorithms to address a wide range of complex challenges. The development of high-performance QML models requires expert-level expertise, presenting a key challenge to the widespread adoption of QML. Critical obstacles include the design of effective data encoding strategies and parameterized quantum circuits, both of which are vital for the performance of QML models. Furthermore, the measurement process is often neglected-most existing QML models employ predefined measurement schemes that may not align with the specific requirements of the targeted problem. We propose an innovative framework that renders the observable of a quantum system-specifically, the Hermitian matrix-trainable. This approach employs an end-to-end differentiable learning framework, enabling simultaneous optimization of the neural network used to program the parameterized observables and the standard quantum circuit parameters. Notably, the quantum observable parameters are dynamically programmed by the neural network, allowing the observables to adapt in real time based on the input data stream. Through numerical simulations, we demonstrate that the proposed method effectively programs observables dynamically within variational quantum circuits, achieving superior results compared to existing approaches. Notably, it delivers enhanced performance metrics, such as higher classification accuracy, thereby significantly improving the overall effectiveness of QML models.
zh

[AI-141] Continuous Domain Generalization

【速读】:该论文试图解决传统领域泛化方法在处理现实世界数据分布连续变化时的局限性,这些变化通常涉及多个潜在因素(如时间、地理和社会经济背景),而现有方法往往将领域视为离散或沿单一轴(如时间)演化,无法捕捉真实世界的多维复杂性。解决方案的关键在于提出连续领域泛化(Continuous Domain Generalization, CDG)任务,并构建一个基于几何与代数理论的框架,证明跨领域的最优模型参数位于低维流形上。为此,论文提出了神经李变换算子(Neural Lie Transport Operator, NeuralLTO),通过强制几何连续性和代数一致性来实现结构化的参数迁移,并引入门控机制和局部坐标图策略以应对噪声或不完整的领域描述符。

链接: https://arxiv.org/abs/2505.13519
作者: Zekun Cai,Yiheng Yao,Guangji Bai,Renhe Jiang,Xuan Song,Ryosuke Shibasaki,Liang Zhao
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages, 9 figures

点击查看摘要

Abstract:Real-world data distributions often shift continuously across multiple latent factors such as time, geography, and socioeconomic context. However, existing domain generalization approaches typically treat domains as discrete or evolving along a single axis (e.g., time), which fails to capture the complex, multi-dimensional nature of real-world variation. This paper introduces the task of Continuous Domain Generalization (CDG), which aims to generalize predictive models to unseen domains defined by arbitrary combinations of continuous variation descriptors. We present a principled framework grounded in geometric and algebraic theory, showing that optimal model parameters across domains lie on a low-dimensional manifold. To model this structure, we propose a Neural Lie Transport Operator (NeuralLTO), which enables structured parameter transitions by enforcing geometric continuity and algebraic consistency. To handle noisy or incomplete domain descriptors, we introduce a gating mechanism to suppress irrelevant dimensions and a local chart-based strategy for robust generalization. Extensive experiments on synthetic and real-world datasets-including remote sensing, scientific documents, and traffic forecasting-demonstrate that our method significantly outperforms existing baselines in generalization accuracy and robustness under descriptor imperfections.
zh

[AI-142] Data Balancing Strategies: A Survey of Resampling and Augmentation Methods

【速读】:该论文旨在解决机器学习中由于类别标签分布不均导致的模型预测偏差和准确率下降问题(imbalanced data)。其解决方案的关键在于通过多种重采样策略对数据进行平衡,包括合成过采样、自适应技术、生成模型、基于集成的方法、混合方法、欠采样以及基于邻居的方法,其中生成式AI(Generative AI)如生成对抗网络(GANs)和变分自编码器(VAEs)被重点提及,因其能够生成高质量的合成样本以改善少数类的表示。

链接: https://arxiv.org/abs/2505.13518
作者: Behnam Yousefimehr,Mehdi Ghatee,Mohammad Amin Seifi,Javad Fazli,Sajed Tavakoli,Zahra Rafei,Shervin Ghaffari,Abolfazl Nikahd,Mahdi Razi Gandomani,Alireza Orouji,Ramtin Mahmoudi Kashani,Sarina Heshmati,Negin Sadat Mousavi
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Imbalanced data poses a significant obstacle in machine learning, as an unequal distribution of class labels often results in skewed predictions and diminished model accuracy. To mitigate this problem, various resampling strategies have been developed, encompassing both oversampling and undersampling techniques aimed at modifying class proportions. Conventional oversampling approaches like SMOTE enhance the representation of the minority class, whereas undersampling methods focus on trimming down the majority class. Advances in deep learning have facilitated the creation of more complex solutions, such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), which are capable of producing high-quality synthetic examples. This paper reviews a broad spectrum of data balancing methods, classifying them into categories including synthetic oversampling, adaptive techniques, generative models, ensemble-based strategies, hybrid approaches, undersampling, and neighbor-based methods. Furthermore, it highlights current developments in resampling techniques and discusses practical implementations and case studies that validate their effectiveness. The paper concludes by offering perspectives on potential directions for future exploration in this domain.
zh

[AI-143] Exploring Emotional Synchrony in Dyadic Interactions: The Role of Speech Conditions in Facial and Vocal Affective Alignment

【速读】:该论文试图解决多模态情感表达与同步问题,特别是面部表情和语音在对话中的时空对齐机制。其关键解决方案是通过分析IEMOCAP数据集中的双人互动,利用EmoNet和基于Wav2Vec2的模型提取连续情感估计,并依据语音重叠情况进行分类,进而通过皮尔逊相关系数、滞后调整分析和动态时间规整(DTW)评估情感同步性。研究揭示了非重叠语音相较于重叠语音在情感同步性上的稳定性与可预测性,以及不同对话结构下情感表达的时序方向性差异。

链接: https://arxiv.org/abs/2505.13455
作者: Von Ralph Dane Marquez Herbuela,Yukie Nagai
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding how humans express and synchronize emotions across multiple communication channels particularly facial expressions and speech has significant implications for emotion recognition systems and human computer interaction. Motivated by the notion that non-overlapping speech promotes clearer emotional coordination, while overlapping speech disrupts synchrony, this study examines how these conversational dynamics shape the spatial and temporal alignment of arousal and valence across facial and vocal modalities. Using dyadic interactions from the IEMOCAP dataset, we extracted continuous emotion estimates via EmoNet (facial video) and a Wav2Vec2-based model (speech audio). Segments were categorized based on speech overlap, and emotional alignment was assessed using Pearson correlation, lag adjusted analysis, and Dynamic Time Warping (DTW). Across analyses, non overlapping speech was associated with more stable and predictable emotional synchrony than overlapping speech. While zero-lag correlations were low and not statistically different, non overlapping speech showed reduced variability, especially for arousal. Lag adjusted correlations and best-lag distributions revealed clearer, more consistent temporal alignment in these segments. In contrast, overlapping speech exhibited higher variability and flatter lag profiles, though DTW indicated unexpectedly tighter alignment suggesting distinct coordination strategies. Notably, directionality patterns showed that facial expressions more often preceded speech during turn-taking, while speech led during simultaneous vocalizations. These findings underscore the importance of conversational structure in regulating emotional communication and provide new insight into the spatial and temporal dynamics of multimodal affective alignment in real world interaction.
zh

[AI-144] Uncertainty Quantification for Prior-Data Fitted Networks using Martingale Posteriors

【速读】:该论文试图解决生成式 AI (Generative AI) 在基于表格数据集进行预测时缺乏不确定性量化的问题,尤其是在预测均值、分位数等统计量方面。其解决方案的关键在于提出一种基于鞅后验(Martingale posteriors)的合理且高效的采样过程,以构建这些估计的贝叶斯后验分布,并证明了该方法的收敛性。

链接: https://arxiv.org/abs/2505.11325
作者: Thomas Nagler,David Rügamer
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Prior-data fitted networks (PFNs) have emerged as promising foundation models for prediction from tabular data sets, achieving state-of-the-art performance on small to moderate data sizes without tuning. While PFNs are motivated by Bayesian ideas, they do not provide any uncertainty quantification for predictive means, quantiles, or similar quantities. We propose a principled and efficient sampling procedure to construct Bayesian posteriors for such estimates based on Martingale posteriors, and prove its convergence. Several simulated and real-world data examples showcase the uncertainty quantification of our method in inference applications.
zh

机器学习

[LG-0] Quartet: Native FP4 Training Can Be Optimal for Large Language Models

链接: https://arxiv.org/abs/2505.14669
作者: Roberto L. Castro,Andrei Panferov,Soroush Tabesh,Oliver Sieberling,Jiale Chen,Mahdi Nikdan,Saleh Ashkboos,Dan Alistarh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has been paralleled by unprecedented increases in computational demands, with training costs for state-of-the-art models doubling every few months. Training models directly in low-precision arithmetic offers a solution, by improving both computational throughput and energy efficiency. Specifically, NVIDIA’s recent Blackwell architecture facilitates extremely low-precision operations, specifically FP4 variants, promising substantial efficiency gains. Yet, current algorithms for training LLMs in FP4 precision face significant accuracy degradation and often rely on mixed-precision fallbacks. In this paper, we systematically investigate hardware-supported FP4 training and introduce Quartet, a new approach enabling accurate, end-to-end FP4 training with all the major computations (in e.g. linear layers) being performed in low precision. Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across varying bit-widths and allows us to identify a “near-optimal” low-precision training technique in terms of accuracy-vs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for NVIDIA Blackwell GPUs, and show that it can achieve state-of-the-art accuracy for FP4 precision, successfully training billion-scale models. Our method demonstrates that fully FP4-based training is a competitive alternative to standard-precision and FP8 training. Our code is available at this https URL.

[LG-1] Early Diagnosis of Atrial Fibrillation Recurrence: A Large Tabular Model Approach with Structured and Unstructured Clinical Data

链接: https://arxiv.org/abs/2505.14643
作者: Ane G. Domingo-Aldama,Marcos Merino Prado,Alain García Olea,Koldo Gojenola Galletebeitia,Josu Goikoetxea Salutregi,Aitziber Atutxa Salazar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:BACKGROUND: Atrial fibrillation (AF), the most common arrhythmia, is linked to high morbidity and mortality. In a fast-evolving AF rhythm control treatment era, predicting AF recurrence after its onset may be crucial to achieve the optimal therapeutic approach, yet traditional scores like CHADS2-VASc, HATCH, and APPLE show limited predictive accuracy. Moreover, early diagnosis studies often rely on codified electronic health record (EHR) data, which may contain errors and missing information. OBJECTIVE: This study aims to predict AF recurrence between one month and two years after onset by evaluating traditional clinical scores, ML models, and our LTM approach. Moreover, another objective is to develop a methodology for integrating structured and unstructured data to enhance tabular dataset quality. METHODS: A tabular dataset was generated by combining structured clinical data with free-text discharge reports processed through natural language processing techniques, reducing errors and annotation effort. A total of 1,508 patients with documented AF onset were identified, and models were evaluated on a manually annotated test set. The proposed approach includes a LTM compared against traditional clinical scores and ML models. RESULTS: The proposed LTM approach achieved the highest predictive performance, surpassing both traditional clinical scores and ML models. Additionally, the gender and age bias analyses revealed demographic disparities. CONCLUSION: The integration of structured data and free-text sources resulted in a high-quality dataset. The findings emphasize the limitations of traditional clinical scores in predicting AF recurrence and highlight the potential of ML-based approaches, particularly our LTM model. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.14643 [cs.LG] (or arXiv:2505.14643v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.14643 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ane G. Domingo-Aldama [view email] [v1] Tue, 20 May 2025 17:31:05 UTC (2,266 KB) Full-text links: Access Paper: View a PDF of the paper titled Early Diagnosis of Atrial Fibrillation Recurrence: A Large Tabular Model Approach with Structured and Unstructured Clinical Data, by Ane G. Domingo-Aldama and 4 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-2] Bridging Predictive Coding and MDL: A Two-Part Code Framework for Deep Learning

链接: https://arxiv.org/abs/2505.14635
作者: Benjamin Prada,Shion Matsumoto,Abdul Malik Zekri,Ankur Mali
类目: Machine Learning (cs.LG)
*备注: 24 pages, 2 figures

点击查看摘要

Abstract:We present the first theoretical framework that connects predictive coding (PC), a biologically inspired local learning rule, with the minimum description length (MDL) principle in deep networks. We prove that layerwise PC performs block-coordinate descent on the MDL two-part code objective, thereby jointly minimizing empirical risk and model complexity. Using Hoeffding’s inequality and a prefix-code prior, we derive a novel generalization bound of the form R(\theta) \le R(\theta) + \fracL(\theta)N , capturing the tradeoff between fit and compression. We further prove that each PC sweep monotonically decreases the empirical two-part codelength, yielding tighter high-probability risk bounds than unconstrained gradient descent. Finally, we show that repeated PC updates converge to a block-coordinate stationary point, providing an approximate MDL-optimal solution. To our knowledge, this is the first result offering formal generalization and convergence guarantees for PC-trained deep models, positioning PC as a theoretically grounded and biologically plausible alternative to backpropagation.

[LG-3] Virtual Cells: Predict Explain Discover

链接: https://arxiv.org/abs/2505.14613
作者: Emmanuel Noutahi,Jason Hartford,Prudencio Tossou,Shawn Whitfield,Alisandra K. Denton,Cas Wognum,Kristina Ulicna,Jonathan Hsu,Michael Cuccarese,Emmanuel Bengio,Dominique Beaini,Christopher Gibson,Daniel Cohen,Berton Earnshaw
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Drug discovery is fundamentally a process of inferring the effects of treatments on patients, and would therefore benefit immensely from computational models that can reliably simulate patient responses, enabling researchers to generate and test large numbers of therapeutic hypotheses safely and economically before initiating costly clinical trials. Even a more specific model that predicts the functional response of cells to a wide range of perturbations would be tremendously valuable for discovering safe and effective treatments that successfully translate to the clinic. Creating such virtual cells has long been a goal of the computational research community that unfortunately remains unachieved given the daunting complexity and scale of cellular biology. Nevertheless, recent advances in AI, computing power, lab automation, and high-throughput cellular profiling provide new opportunities for reaching this goal. In this perspective, we present a vision for developing and evaluating virtual cells that builds on our experience at Recursion. We argue that in order to be a useful tool to discover novel biology, virtual cells must accurately predict the functional response of a cell to perturbations and explain how the predicted response is a consequence of modifications to key biomolecular interactions. We then introduce key principles for designing therapeutically-relevant virtual cells, describe a lab-in-the-loop approach for generating novel insights with them, and advocate for biologically-grounded benchmarks to guide virtual cell development. Finally, we make the case that our approach to virtual cells provides a useful framework for building other models at higher levels of organization, including virtual patients. We hope that these directions prove useful to the research community in developing virtual models optimized for positive impact on drug discovery outcomes.

[LG-4] MMD-Newton Method for Multi-objective Optimization

链接: https://arxiv.org/abs/2505.14610
作者: Hao Wang,Chenyu Shi,Angel E. Rodriguez-Fernandez,Oliver Schütze
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Maximum mean discrepancy (MMD) has been widely employed to measure the distance between probability distributions. In this paper, we propose using MMD to solve continuous multi-objective optimization problems (MOPs). For solving MOPs, a common approach is to minimize the distance (e.g., Hausdorff) between a finite approximate set of the Pareto front and a reference set. Viewing these two sets as empirical measures, we propose using MMD to measure the distance between them. To minimize the MMD value, we provide the analytical expression of its gradient and Hessian matrix w.r.t. the search variables, and use them to devise a novel set-oriented, MMD-based Newton (MMDN) method. Also, we analyze the theoretical properties of MMD’s gradient and Hessian, including the first-order stationary condition and the eigenspectrum of the Hessian, which are important for verifying the correctness of MMDN. To solve complicated problems, we propose hybridizing MMDN with multiobjective evolutionary algorithms (MOEAs), where we first execute an EA for several iterations to get close to the global Pareto front and then warm-start MMDN with the result of the MOEA to efficiently refine the approximation. We empirically test the hybrid algorithm on 11 widely used benchmark problems, and the results show the hybrid (MMDN + MOEA) can achieve a much better optimization accuracy than EA alone with the same computation budget.

[LG-5] Electrostatics from Laplacian Eigenbasis for Neural Network Interatomic Potentials

链接: https://arxiv.org/abs/2505.14606
作者: Maksim Zhdanov,Vladislav Kurenkov
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Recent advances in neural network interatomic potentials have emerged as a promising research direction. However, popular deep learning models often lack auxiliary constraints grounded in physical laws, which could accelerate training and improve fidelity through physics-based regularization. In this work, we introduce \Phi -Module, a universal plugin module that enforces Poisson’s equation within the message-passing framework to learn electrostatic interactions in a self-supervised manner. Specifically, each atom-wise representation is encouraged to satisfy a discretized Poisson’s equation, making it possible to acquire a potential \boldsymbol\phi and a corresponding charge density \boldsymbol\rho linked to the learnable Laplacian eigenbasis coefficients of a given molecular graph. We then derive an electrostatic energy term, crucial for improved total energy predictions. This approach integrates seamlessly into any existing neural potential with insignificant computational overhead. Experiments on the OE62 and MD22 benchmarks confirm that models combined with \Phi -Module achieve robust improvements over baseline counterparts. For OE62 error reduction ranges from 4.5% to 17.8%, and for MD22, baseline equipped with \Phi -Module achieves best results on 5 out of 14 cases. Our results underscore how embedding a first-principles constraint in neural interatomic potentials can significantly improve performance while remaining hyperparameter-friendly, memory-efficient and lightweight in training. Code will be available at \hrefthis https URLdunnolab/phi-module.

[LG-6] CSTS: A Benchmark for the Discovery of Correlation Structures in Time Series Clustering

链接: https://arxiv.org/abs/2505.14596
作者: Isabella Degen,Zahraa S Abdallah,Henry W J Reeve,Kate Robson Brown
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 9 pages main + 32 pages total, 2 figures main + 6 figures appendix, 1 table main + 17 tables appendix, dataset available at this https URL , code available at this https URL

点击查看摘要

Abstract:Time series clustering promises to uncover hidden structural patterns in data with applications across healthcare, finance, industrial systems, and other critical domains. However, without validated ground truth information, researchers cannot objectively assess clustering quality or determine whether poor results stem from absent structures in the data, algorithmic limitations, or inappropriate validation methods, raising the question whether clustering is “more art than science” (Guyon et al., 2009). To address these challenges, we introduce CSTS (Correlation Structures in Time Series), a synthetic benchmark for evaluating the discovery of correlation structures in multivariate time series data. CSTS provides a clean benchmark that enables researchers to isolate and identify specific causes of clustering failures by differentiating between correlation structure deterioration and limitations of clustering algorithms and validation methods. Our contributions are: (1) a comprehensive benchmark for correlation structure discovery with distinct correlation structures, systematically varied data conditions, established performance thresholds, and recommended evaluation protocols; (2) empirical validation of correlation structure preservation showing moderate distortion from downsampling and minimal effects from distribution shifts and sparsification; and (3) an extensible data generation framework enabling structure-first clustering evaluation. A case study demonstrates CSTS’s practical utility by identifying an algorithm’s previously undocumented sensitivity to non-normal distributions, illustrating how the benchmark enables precise diagnosis of methodological limitations. CSTS advances rigorous evaluation standards for correlation-based time series clustering.

[LG-7] Physics-informed Reduced Order Modeling of Time-dependent PDEs via Differentiable Solvers

链接: https://arxiv.org/abs/2505.14595
作者: Nima Hosseini Dashtbayaz,Hesam Salehipour,Adrian Butscher,Nigel Morris
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reduced-order modeling (ROM) of time-dependent and parameterized differential equations aims to accelerate the simulation of complex high-dimensional systems by learning a compact latent manifold representation that captures the characteristics of the solution fields and their time-dependent dynamics. Although high-fidelity numerical solvers generate the training datasets, they have thus far been excluded from the training process, causing the learned latent dynamics to drift away from the discretized governing physics. This mismatch often limits generalization and forecasting capabilities. In this work, we propose Physics-informed ROM ( \Phi -ROM) by incorporating differentiable PDE solvers into the training procedure. Specifically, the latent space dynamics and its dependence on PDE parameters are shaped directly by the governing physics encoded in the solver, ensuring a strong correspondence between the full and reduced systems. Our model outperforms state-of-the-art data-driven ROMs and other physics-informed strategies by accurately generalizing to new dynamics arising from unseen parameters, enabling long-term forecasting beyond the training horizon, maintaining continuity in both time and space, and reducing the data cost. Furthermore, \Phi -ROM learns to recover and forecast the solution fields even when trained or evaluated with sparse and irregular observations of the fields, providing a flexible framework for field reconstruction and data assimilation. We demonstrate the framework’s robustness across different PDE solvers and highlight its broad applicability by providing an open-source JAX implementation readily extensible to other PDE systems and differentiable solvers.

[LG-8] Adaptive Pruning of Deep Neural Networks for Resource-Aware Embedded Intrusion Detection on the Edge

链接: https://arxiv.org/abs/2505.14592
作者: Alexandre Broggi,Nathaniel Bastian,Lance Fiondella,Gokhan Kul
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Artificial neural network pruning is a method in which artificial neural network sizes can be reduced while attempting to preserve the predicting capabilities of the network. This is done to make the model smaller or faster during inference time. In this work we analyze the ability of a selection of artificial neural network pruning methods to generalize to a new cybersecurity dataset utilizing a simpler network type than was designed for. We analyze each method using a variety of pruning degrees to best understand how each algorithm responds to the new environment. This has allowed us to determine the most well fit pruning method of those we searched for the task. Unexpectedly, we have found that many of them do not generalize to the problem well, leaving only a few algorithms working to an acceptable degree.

[LG-9] me to Embed: Unlocking Foundation Models for Time Series with Channel Descriptions

链接: https://arxiv.org/abs/2505.14543
作者: Utsav Dutta,Sina Khoshfetrat Pakazad,Henrik Ohlsson
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional time series models are task-specific and often depend on dataset-specific training and extensive feature engineering. While Transformer-based architectures have improved scalability, foundation models, commonplace in text, vision, and audio, remain under-explored for time series and are largely restricted to forecasting. We introduce \textbfCHARM , a foundation embedding model for multivariate time series that learns shared, transferable, and domain-aware representations. To address the unique difficulties of time series foundation learning, \textbfCHARM incorporates architectural innovations that integrate channel-level textual descriptions while remaining invariant to channel order. The model is trained using a Joint Embedding Predictive Architecture (JEPA), with novel augmentation schemes and a loss function designed to improve interpretability and training stability. Our 7 M-parameter model achieves state-of-the-art performance across diverse downstream tasks, setting a new benchmark for time series representation learning.

[LG-10] Spiking Neural Networks with Temporal Attention-Guided Adaptive Fusion for imbalanced Multi-modal Learning

链接: https://arxiv.org/abs/2505.14535
作者: Jiangrong Shen,Yulin Xie,Qi Xu,Gang Pan,Huajin Tang,Badong Chen
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Multimodal spiking neural networks (SNNs) hold significant potential for energy-efficient sensory processing but face critical challenges in modality imbalance and temporal misalignment. Current approaches suffer from uncoordinated convergence speeds across modalities and static fusion mechanisms that ignore time-varying cross-modal interactions. We propose the temporal attention-guided adaptive fusion framework for multimodal SNNs with two synergistic innovations: 1) The Temporal Attention-guided Adaptive Fusion (TAAF) module that dynamically assigns importance scores to fused spiking features at each timestep, enabling hierarchical integration of temporally heterogeneous spike-based features; 2) The temporal adaptive balanced fusion loss that modulates learning rates per modality based on the above attention scores, preventing dominant modalities from monopolizing optimization. The proposed framework implements adaptive fusion, especially in the temporal dimension, and alleviates the modality imbalance during multimodal learning, mimicking cortical multisensory integration principles. Evaluations on CREMA-D, AVE, and EAD datasets demonstrate state-of-the-art performance (77.55%, 70.65% and 97.5%accuracy, respectively) with energy efficiency. The system resolves temporal misalignment through learnable time-warping operations and faster modality convergence coordination than baseline SNNs. This work establishes a new paradigm for temporally coherent multimodal learning in neuromorphic systems, bridging the gap between biological sensory processing and efficient machine intelligence.

[LG-11] Lessons from Defending Gemini Against Indirect Prompt Injections

链接: https://arxiv.org/abs/2505.14534
作者: Chongyang Shi,Sharon Lin,Shuang Song,Jamie Hayes,Ilia Shumailov,Itay Yona,Juliette Pluto,Aneesh Pappu,Christopher A. Choquette-Choo,Milad Nasr,Chawin Sitawarin,Gena Gibson,Andreas Terzis,John “Four” Flynn
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gemini is increasingly used to perform tasks on behalf of users, where function-calling and tool-use capabilities enable the model to access user data. Some tools, however, require access to untrusted data introducing risk. Adversaries can embed malicious instructions in untrusted data which cause the model to deviate from the user’s expectations and mishandle their data or permissions. In this report, we set out Google DeepMind’s approach to evaluating the adversarial robustness of Gemini models and describe the main lessons learned from the process. We test how Gemini performs against a sophisticated adversary through an adversarial evaluation framework, which deploys a suite of adaptive attack techniques to run continuously against past, current, and future versions of Gemini. We describe how these ongoing evaluations directly help make Gemini more resilient against manipulation.

[LG-12] SifterNet: A Generalized and Model-Agnostic Trigger Purification Approach

链接: https://arxiv.org/abs/2505.14531
作者: Shaoye Luo,Xinxin Fan,Quanliang Jing,Chi Lin,Mengfan Li,Yunfeng Lu,Yongjun Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Aiming at resisting backdoor attacks in convolution neural networks and vision Transformer-based large model, this paper proposes a generalized and model-agnostic trigger-purification approach resorting to the classic Ising model. To date, existing trigger detection/removal studies usually require to know the detailed knowledge of target model in advance, access to a large number of clean samples or even model-retraining authorization, which brings the huge inconvenience for practical applications, especially inaccessible to target model. An ideal countermeasure ought to eliminate the implanted trigger without regarding whatever the target models are. To this end, a lightweight and black-box defense approach SifterNet is proposed through leveraging the memorization-association functionality of Hopfield network, by which the triggers of input samples can be effectively purified in a proper manner. The main novelty of our proposed approach lies in the introduction of ideology of Ising model. Extensive experiments also validate the effectiveness of our approach in terms of proper trigger purification and high accuracy achievement, and compared to the state-of-the-art baselines under several commonly-used datasets, our SiferNet has a significant superior performance.

[LG-13] Interpretable Dual-Stream Learning for Local Wind Hazard Prediction in Vulnerable Communities

链接: https://arxiv.org/abs/2505.14522
作者: Mahmuda Akhter Nishu,Chenyu Huang,Milad Roohi,Xin Zhong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Wind hazards such as tornadoes and straight-line winds frequently affect vulnerable communities in the Great Plains of the United States, where limited infrastructure and sparse data coverage hinder effective emergency response. Existing forecasting systems focus primarily on meteorological elements and often fail to capture community-specific vulnerabilities, limiting their utility for localized risk assessment and resilience planning. To address this gap, we propose an interpretable dual-stream learning framework that integrates structured numerical weather data with unstructured textual event narratives. Our architecture combines a Random Forest and RoBERTa-based transformer through a late fusion mechanism, enabling robust and context-aware wind hazard prediction. The system is tailored for underserved tribal communities and supports block-level risk assessment. Experimental results show significant performance gains over traditional baselines. Furthermore, gradient-based sensitivity and ablation studies provide insight into the model’s decision-making process, enhancing transparency and operational trust. The findings demonstrate both predictive effectiveness and practical value in supporting emergency preparedness and advancing community resilience.

[LG-14] Just One Layer Norm Guarantees Stable Extrapolation

链接: https://arxiv.org/abs/2505.14512
作者: Juliusz Ziomek,George Whittle,Michael A. Osborne
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In spite of their prevalence, the behaviour of Neural Networks when extrapolating far from the training distribution remains poorly understood, with existing results limited to specific cases. In this work, we prove general results – the first of their kind – by applying Neural Tangent Kernel (NTK) theory to analyse infinitely-wide neural networks trained until convergence and prove that the inclusion of just one Layer Norm (LN) fundamentally alters the induced NTK, transforming it into a bounded-variance kernel. As a result, the output of an infinitely wide network with at least one LN remains bounded, even on inputs far from the training data. In contrast, we show that a broad class of networks without LN can produce pathologically large outputs for certain inputs. We support these theoretical findings with empirical experiments on finite-width networks, demonstrating that while standard NNs often exhibit uncontrolled growth outside the training domain, a single LN layer effectively mitigates this instability. Finally, we explore real-world implications of this extrapolatory stability, including applications to predicting residue sizes in proteins larger than those seen during training and estimating age from facial images of underrepresented ethnicities absent from the training set.

[LG-15] Federated prediction for scalable and privacy-preserved knowledge-based planning in radiotherapy

链接: https://arxiv.org/abs/2505.14507
作者: Jingyun Chen,David Horowitz,Yading Yuan
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Under review for publication by the journal of Medical Physics

点击查看摘要

Abstract:Background: Deep learning has potential to improve the efficiency and consistency of radiation therapy planning, but clinical adoption is hindered by the limited model generalizability due to data scarcity and heterogeneity among institutions. Although aggregating data from different institutions could alleviate this problem, data sharing is a practical challenge due to concerns about patient data privacy and other technical obstacles. Purpose: This work aims to address this dilemma by developing FedKBP+, a comprehensive federated learning (FL) platform for predictive tasks in real-world applications in radiotherapy treatment planning. Methods: We implemented a unified communication stack based on Google Remote Procedure Call (gRPC) to support communication between participants whether located on the same workstation or distributed across multiple workstations. In addition to supporting the centralized FL strategies commonly available in existing open-source frameworks, FedKBP+ also provides a fully decentralized FL model where participants directly exchange model weights to each other through Peer-to-Peer communication. We evaluated FedKBP+ on three predictive tasks using scale-attention network (SA-Net) as the predictive model. Conclusions: Our results demonstrate that FedKBP+ is highly effective, efficient and robust, showing great potential as a federated learning platform for radiation therapy.

[LG-16] Learning to Integrate Diffusion ODEs by Averag ing the Derivatives

链接: https://arxiv.org/abs/2505.14502
作者: Wenze Liu,Xiangyu Yue
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To accelerate diffusion model inference, numerical solvers perform poorly at extremely small steps, while distillation techniques often introduce complexity and instability. This work presents an intermediate strategy, balancing performance and cost, by learning ODE integration using loss functions derived from the derivative-integral relationship, inspired by Monte Carlo integration and Picard iteration. From a geometric perspective, the losses operate by gradually extending the tangent to the secant, thus are named as secant losses. The secant losses can rapidly convert (via fine-tuning or distillation) a pretrained diffusion model into its secant version. In our experiments, the secant version of EDM achieves a 10 -step FID of 2.14 on CIFAR-10, while the secant version of SiT-XL/2 attains a 4 -step FID of 2.27 and an 8 -step FID of 1.96 on ImageNet- 256\times256 . Code will be available.

[LG-17] Personalised Insulin Adjustment with Reinforcement Learning: An In-Silico Validation for People with Diabetes on Intensive Insulin Treatment

链接: https://arxiv.org/abs/2505.14477
作者: Maria Panagiotou,Lorenzo Brigato,Vivien Streit,Amanda Hayoz,Stephan Proennecke,Stavros Athanasopoulos,Mikkel T. Olsen,Elizabeth J. den Brok,Cecilie H. Svensson,Konstantinos Makrilakis,Maria Xatzipsalti,Andriani Vazeou,Peter R. Mertens,Ulrik Pedersen-Bjergaard,Bastiaan E. de Galan,Stavroula Mougiakakou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite recent advances in insulin preparations and technology, adjusting insulin remains an ongoing challenge for the majority of people with type 1 diabetes (T1D) and longstanding type 2 diabetes (T2D). In this study, we propose the Adaptive Basal-Bolus Advisor (ABBA), a personalised insulin treatment recommendation approach based on reinforcement learning for individuals with T1D and T2D, performing self-monitoring blood glucose measurements and multiple daily insulin injection therapy. We developed and evaluated the ability of ABBA to achieve better time-in-range (TIR) for individuals with T1D and T2D, compared to a standard basal-bolus advisor (BBA). The in-silico test was performed using an FDA-accepted population, including 101 simulated adults with T1D and 101 with T2D. An in-silico evaluation shows that ABBA significantly improved TIR and significantly reduced both times below- and above-range, compared to BBA. ABBA’s performance continued to improve over two months, whereas BBA exhibited only modest changes. This personalised method for adjusting insulin has the potential to further optimise glycaemic control and support people with T1D and T2D in their daily self-management. Our results warrant ABBA to be trialed for the first time in humans.

[LG-18] ServerlessLoRA: Minimizing Latency and Cost in Serverless Inference for LoRA-Based LLM s

链接: https://arxiv.org/abs/2505.14468
作者: Yifan Sui,Hao Wang,Hanfei Yu,Yitao Hu,Jianxun Li,Hao Wang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Serverless computing has grown rapidly for serving Large Language Model (LLM) inference due to its pay-as-you-go pricing, fine-grained GPU usage, and rapid scaling. However, our analysis reveals that current serverless can effectively serve general LLM but fail with Low-Rank Adaptation (LoRA) inference due to three key limitations: 1) massive parameter redundancy among functions where 99% of weights are unnecessarily duplicated, 2) costly artifact loading latency beyond LLM loading, and 3) magnified resource contention when serving multiple LoRA LLMs. These inefficiencies lead to massive GPU wastage, increased Time-To-First-Token (TTFT), and high monetary costs. We propose ServerlessLoRA, a novel serverless inference system designed for faster and cheaper LoRA LLM serving. ServerlessLoRA enables secure backbone LLM sharing across isolated LoRA functions to reduce redundancy. We design a pre-loading method that pre-loads comprehensive LoRA artifacts to minimize cold-start latency. Furthermore, ServerlessLoRA employs contention aware batching and offloading to mitigate GPU resource conflicts during bursty workloads. Experiment on industrial workloads demonstrates that ServerlessLoRA reduces TTFT by up to 86% and cuts monetary costs by up to 89% compared to state-of-the-art LLM inference solutions. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2505.14468 [cs.LG] (or arXiv:2505.14468v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.14468 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-19] Adverseness vs. Equilibrium: Exploring Graph Adversarial Resilience through Dynamic Equilibrium

链接: https://arxiv.org/abs/2505.14463
作者: Xinxin Fan,Wenxiong Chen,Mengfan Li,Wenqi Wei,Ling Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adversarial attacks to graph analytics are gaining increased attention. To date, two lines of countermeasures have been proposed to resist various graph adversarial attacks from the perspectives of either graph per se or graph neural networks. Nevertheless, a fundamental question lies in whether there exists an intrinsic adversarial resilience state within a graph regime and how to find out such a critical state if exists. This paper contributes to tackle the above research questions from three unique perspectives: i) we regard the process of adversarial learning on graph as a complex multi-object dynamic system, and model the behavior of adversarial attack; ii) we propose a generalized theoretical framework to show the existence of critical adversarial resilience state; and iii) we develop a condensed one-dimensional function to capture the dynamic variation of graph regime under perturbations, and pinpoint the critical state through solving the equilibrium point of dynamic system. Multi-facet experiments are conducted to show our proposed approach can significantly outperform the state-of-the-art defense methods under five commonly-used real-world datasets and three representative attacks.

[LG-20] Interpretable Reinforcement Learning for Load Balancing using Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2505.14459
作者: Kamal Singh,Sami Marouani,Ahmad Al Sheikh,Pham Tran Anh Quang,Amaury Habrard
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has been increasingly applied to network control problems, such as load balancing. However, existing RL approaches often suffer from lack of interpretability and difficulty in extracting controller equations. In this paper, we propose the use of Kolmogorov-Arnold Networks (KAN) for interpretable RL in network control. We employ a PPO agent with a 1-layer actor KAN model and an MLP Critic network to learn load balancing policies that maximise throughput utility, minimize loss as well as delay. Our approach allows us to extract controller equations from the learned neural networks, providing insights into the decision-making process. We evaluate our approach using different reward functions demonstrating its effectiveness in improving network performance while providing interpretable policies.

[LG-21] Explaining Neural Networks with Reason s

链接: https://arxiv.org/abs/2505.14424
作者: Levin Hornischer,Hannes Leitgeb
类目: Machine Learning (cs.LG)
*备注: 28 pages (12 pages main text), 29 figures

点击查看摘要

Abstract:We propose a new interpretability method for neural networks, which is based on a novel mathematico-philosophical theory of reasons. Our method computes a vector for each neuron, called its reasons vector. We then can compute how strongly this reasons vector speaks for various propositions, e.g., the proposition that the input image depicts digit 2 or that the input prompt has a negative sentiment. This yields an interpretation of neurons, and groups thereof, that combines a logical and a Bayesian perspective, and accounts for polysemanticity (i.e., that a single neuron can figure in multiple concepts). We show, both theoretically and empirically, that this method is: (1) grounded in a philosophically established notion of explanation, (2) uniform, i.e., applies to the common neural network architectures and modalities, (3) scalable, since computing reason vectors only involves forward-passes in the neural network, (4) faithful, i.e., intervening on a neuron based on its reason vector leads to expected changes in model output, (5) correct in that the model’s reasons structure matches that of the data source, (6) trainable, i.e., neural networks can be trained to improve their reason strengths, (7) useful, i.e., it delivers on the needs for interpretability by increasing, e.g., robustness and fairness.

[LG-22] owards Non-Euclidean Foundation Models: Advancing AI Beyond Euclidean Frameworks WWW2025

链接: https://arxiv.org/abs/2505.14417
作者: Menglin Yang,Yifei Zhang,Jialin Chen,Melanie Weber,Rex Ying
类目: Computational Geometry (cs.CG); Machine Learning (cs.LG)
*备注: WWW 2025 Companion

点击查看摘要

Abstract:In the era of foundation models and Large Language Models (LLMs), Euclidean space is the de facto geometric setting of our machine learning architectures. However, recent literature has demonstrated that this choice comes with fundamental limitations. To that end, non-Euclidean learning is quickly gaining traction, particularly in web-related applications where complex relationships and structures are prevalent. Non-Euclidean spaces, such as hyperbolic, spherical, and mixed-curvature spaces, have been shown to provide more efficient and effective representations for data with intrinsic geometric properties, including web-related data like social network topology, query-document relationships, and user-item interactions. Integrating foundation models with non-Euclidean geometries has great potential to enhance their ability to capture and model the underlying structures, leading to better performance in search, recommendations, and content understanding. This workshop focuses on the intersection of Non-Euclidean Foundation Models and Geometric Learning (NEGEL), exploring its potential benefits, including the potential benefits for advancing web-related technologies, challenges, and future directions. Workshop page: [this https URL](this https URL)

[LG-23] able Foundation Models: on knowledge pre-training for tabular learning

链接: https://arxiv.org/abs/2505.14415
作者: Myung Jun Kim,Félix Lefebvre,Gaëtan Brison,Alexandre Perez-Lebel,Gaël Varoquaux
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Table foundation models bring high hopes to data science: pre-trained on tabular data to embark knowledge or priors, they should facilitate downstream tasks on tables. One specific challenge is that of data semantics: numerical entries take their meaning from context, e.g., column name. Pre-trained neural networks that jointly model column names and table entries have recently boosted prediction accuracy. While these models outline the promises of world knowledge to interpret table values, they lack the convenience of popular foundation models in text or vision. Indeed, they must be fine-tuned to bring benefits, come with sizeable computation costs, and cannot easily be reused or combined with other architectures. Here we introduce TARTE, a foundation model that transforms tables to knowledge-enhanced vector representations using the string to capture semantics. Pre-trained on large relational data, TARTE yields representations that facilitate subsequent learning with little additional cost. These representations can be fine-tuned or combined with other learners, giving models that push the state-of-the-art prediction performance and improve the prediction/computation performance trade-off. Specialized to a task or a domain, TARTE gives domain-specific representations that facilitate further learning. Our study demonstrates an effective approach to knowledge pre-training for tabular learning.

[LG-24] Byte Pair Encoding for Efficient Time Series Forecasting

链接: https://arxiv.org/abs/2505.14411
作者: Leon Götz,Marcel Kollovieh,Stephan Günnemann,Leo Schwinn
类目: Machine Learning (cs.LG)
*备注: 24 pages in total, 17 figures

点击查看摘要

Abstract:Existing time series tokenization methods predominantly encode a constant number of samples into individual tokens. This inflexible approach can generate excessive tokens for even simple patterns like extended constant values, resulting in substantial computational overhead. Inspired by the success of byte pair encoding, we propose the first pattern-centric tokenization scheme for time series analysis. Based on a discrete vocabulary of frequent motifs, our method merges samples with underlying patterns into tokens, compressing time series adaptively. Exploiting our finite set of motifs and the continuous properties of time series, we further introduce conditional decoding as a lightweight yet powerful post-hoc optimization method, which requires no gradient computation and adds no computational overhead. On recent time series foundation models, our motif-based tokenization improves forecasting performance by 36% and boosts efficiency by 1990% on average. Conditional decoding further reduces MSE by up to 44%. In an extensive analysis, we demonstrate the adaptiveness of our tokenization to diverse temporal patterns, its generalization to unseen data, and its meaningful token representations capturing distinct time series properties, including statistical moments and trends.

[LG-25] Explaining Unreliable Perception in Automated Driving: A Fuzzy-based Monitoring Approach

链接: https://arxiv.org/abs/2505.14407
作者: Aniket Salvi,Gereon Weiss,Mario Trapp
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autonomous systems that rely on Machine Learning (ML) utilize online fault tolerance mechanisms, such as runtime monitors, to detect ML prediction errors and maintain safety during operation. However, the lack of human-interpretable explanations for these errors can hinder the creation of strong assurances about the system’s safety and reliability. This paper introduces a novel fuzzy-based monitor tailored for ML perception components. It provides human-interpretable explanations about how different operating conditions affect the reliability of perception components and also functions as a runtime safety monitor. We evaluated our proposed monitor using naturalistic driving datasets as part of an automated driving case study. The interpretability of the monitor was evaluated and we identified a set of operating conditions in which the perception component performs reliably. Additionally, we created an assurance case that links unit-level evidence of \textitcorrect ML operation to system-level \textitsafety. The benchmarking demonstrated that our monitor achieved a better increase in safety (i.e., absence of hazardous situations) while maintaining availability (i.e., ability to perform the mission) compared to state-of-the-art runtime ML monitors in the evaluated dataset.

[LG-26] Algorithmic Hiring and Diversity: Reducing Human-Algorithm Similarity for Better Outcomes

链接: https://arxiv.org/abs/2505.14388
作者: Prasanna Parasurama,Panos Ipeirotis
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); General Economics (econ.GN)
*备注:

点击查看摘要

Abstract:Algorithmic tools are increasingly used in hiring to improve fairness and diversity, often by enforcing constraints such as gender-balanced candidate shortlists. However, we show theoretically and empirically that enforcing equal representation at the shortlist stage does not necessarily translate into more diverse final hires, even when there is no gender bias in the hiring stage. We identify a crucial factor influencing this outcome: the correlation between the algorithm’s screening criteria and the human hiring manager’s evaluation criteria – higher correlation leads to lower diversity in final hires. Using a large-scale empirical analysis of nearly 800,000 job applications across multiple technology firms, we find that enforcing equal shortlists yields limited improvements in hire diversity when the algorithmic screening closely mirrors the hiring manager’s preferences. We propose a complementary algorithmic approach designed explicitly to diversify shortlists by selecting candidates likely to be overlooked by managers, yet still competitive according to their evaluation criteria. Empirical simulations show that this approach significantly enhances gender diversity in final hires without substantially compromising hire quality. These findings highlight the importance of algorithmic design choices in achieving organizational diversity goals and provide actionable guidance for practitioners implementing fairness-oriented hiring algorithms.

[LG-27] Layer-wise Quantization for Quantized Optimistic Dual Averag ing ICML2025

链接: https://arxiv.org/abs/2505.14371
作者: Anh Duc Nguyen,Ilia Markov,Frank Zhengqing Wu,Ali Ramezani-Kebrya,Kimon Antonakopoulos,Dan Alistarh,Volkan Cevher
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Accepted at the International Conference on Machine Learning (ICML 2025)

点击查看摘要

Abstract:Modern deep neural networks exhibit heterogeneity across numerous layers of various types such as residuals, multi-head attention, etc., due to varying structures (dimensions, activation functions, etc.), distinct representation characteristics, which impact predictions. We develop a general layer-wise quantization framework with tight variance and code-length bounds, adapting to the heterogeneities over the course of training. We then apply a new layer-wise quantization technique within distributed variational inequalities (VIs), proposing a novel Quantized Optimistic Dual Averaging (QODA) algorithm with adaptive learning rates, which achieves competitive convergence rates for monotone VIs. We empirically show that QODA achieves up to a 150% speedup over the baselines in end-to-end training time for training Wasserstein GAN on 12+ GPUs.

[LG-28] owards eliciting latent knowledge from LLM s with mechanistic interpretability

链接: https://arxiv.org/abs/2505.14352
作者: Bartosz Cywiński,Emil Ryd,Senthooran Rajamanoharan,Neel Nanda
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As language models become more powerful and sophisticated, it is crucial that they remain trustworthy and reliable. There is concerning preliminary evidence that models may attempt to deceive or keep secrets from their operators. To explore the ability of current techniques to elicit such hidden knowledge, we train a Taboo model: a language model that describes a specific secret word without explicitly stating it. Importantly, the secret word is not presented to the model in its training data or prompt. We then investigate methods to uncover this secret. First, we evaluate non-interpretability (black-box) approaches. Subsequently, we develop largely automated strategies based on mechanistic interpretability techniques, including logit lens and sparse autoencoders. Evaluation shows that both approaches are effective in eliciting the secret word in our proof-of-concept setting. Our findings highlight the promise of these approaches for eliciting hidden knowledge and suggest several promising avenues for future work, including testing and refining these methods on more complex model organisms. This work aims to be a step towards addressing the crucial problem of eliciting secret knowledge from language models, thereby contributing to their safe and reliable deployment.

[LG-29] Better Neural Network Expressivity: Subdividing the Simplex

链接: https://arxiv.org/abs/2505.14338
作者: Egor Bakaev,Florestan Brunck,Christoph Hertrich,Jack Stade,Amir Yehudayoff
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM); Neural and Evolutionary Computing (cs.NE); Combinatorics (math.CO)
*备注: 11 pages, 1 figure

点击查看摘要

Abstract:This work studies the expressivity of ReLU neural networks with a focus on their depth. A sequence of previous works showed that \lceil \log_2(n+1) \rceil hidden layers are sufficient to compute all continuous piecewise linear (CPWL) functions on \mathbbR^n . Hertrich, Basu, Di Summa, and Skutella (NeurIPS’21) conjectured that this result is optimal in the sense that there are CPWL functions on \mathbbR^n , like the maximum function, that require this depth. We disprove the conjecture and show that \lceil\log_3(n-1)\rceil+1 hidden layers are sufficient to compute all CPWL functions on \mathbbR^n . A key step in the proof is that ReLU neural networks with two hidden layers can exactly represent the maximum function of five inputs. More generally, we show that \lceil\log_3(n-2)\rceil+1 hidden layers are sufficient to compute the maximum of n\geq 4 numbers. Our constructions almost match the \lceil\log_3(n)\rceil lower bound of Averkov, Hojny, and Merkert (ICLR’25) in the special case of ReLU networks with weights that are decimal fractions. The constructions have a geometric interpretation via polyhedral subdivisions of the simplex into ``easier’’ polytopes. Comments: 11 pages, 1 figure Subjects: Machine Learning (cs.LG); Discrete Mathematics (cs.DM); Neural and Evolutionary Computing (cs.NE); Combinatorics (math.CO) Cite as: arXiv:2505.14338 [cs.LG] (or arXiv:2505.14338v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.14338 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-30] Vulnerability of Transfer-Learned Neural Networks to Data Reconstruction Attacks in Small-Data Regime

链接: https://arxiv.org/abs/2505.14323
作者: Tomasz Maciążek,Robert Allison
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 27 pages

点击查看摘要

Abstract:Training data reconstruction attacks enable adversaries to recover portions of a released model’s training data. We consider the attacks where a reconstructor neural network learns to invert the (random) mapping between training data and model weights. Prior work has shown that an informed adversary with access to released model’s weights and all but one training data point can achieve high-quality reconstructions in this way. However, differential privacy can defend against such an attack with little to no loss in model’s utility when the amount of training data is sufficiently large. In this work we consider a more realistic adversary who only knows the distribution from which a small training dataset has been sampled and who attacks a transfer-learned neural network classifier that has been trained on this dataset. We exhibit an attack that works in this realistic threat model and demonstrate that in the small-data regime it cannot be defended against by DP-SGD without severely damaging the classifier accuracy. This raises significant concerns about the use of such transfer-learned classifiers when protection of training-data is paramount. We demonstrate the effectiveness and robustness of our attack on VGG, EfficientNet and ResNet image classifiers transfer-learned on MNIST, CIFAR-10 and CelebA respectively. Additionally, we point out that the commonly used (true-positive) reconstruction success rate metric fails to reliably quantify the actual reconstruction effectiveness. Instead, we make use of the Neyman-Pearson lemma to construct the receiver operating characteristic curve and consider the associated true-positive reconstruction rate at a fixed level of the false-positive reconstruction rate.

[LG-31] Low-Cost FlashAttention with Fused Exponential and Multiplication Hardware Operators

链接: https://arxiv.org/abs/2505.14314
作者: Kosmas Alexandridis,Vasileios Titopoulos,Giorgos Dimitrakopoulos
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: IEEE Computer Society Annual Symposium on VLSI (ISVLSI 2025)

点击查看摘要

Abstract:Attention mechanisms, particularly within Transformer architectures and large language models (LLMs), have revolutionized sequence modeling in machine learning and artificial intelligence applications. To compute attention for increasingly long sequences, specialized accelerators have been proposed to execute key attention steps directly in hardware. Among the various recently proposed architectures, those based on variants of the FlashAttention algorithm, originally designed for GPUs, stand out due to their optimized computation, tiling capabilities, and reduced memory traffic. In this work, we focus on optimizing the kernel of floating-point-based FlashAttention using new hardware operators that fuse the computation of exponentials and vector multiplications, e.g., e^x, V. The proposed ExpMul hardware operators significantly reduce the area and power costs of FlashAttention-based hardware accelerators. When implemented in a 28nm ASIC technology, they achieve improvements of 28.8% in area and 17.6% in power, on average, compared to state-of-the-art hardware architectures with separate exponentials and vector multiplications hardware operators.

[LG-32] aming Recommendation Bias with Causal Intervention on Evolving Personal Popularity

链接: https://arxiv.org/abs/2505.14310
作者: Shiyin Tan,Dongyuan Li,Renhe Jiang,Zhen Wang,Xingtong Yu,Manabu Okumura
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Popularity bias occurs when popular items are recommended far more frequently than they should be, negatively impacting both user experience and recommendation accuracy. Existing debiasing methods mitigate popularity bias often uniformly across all users and only partially consider the time evolution of users or items. However, users have different levels of preference for item popularity, and this preference is evolving over time. To address these issues, we propose a novel method called CausalEPP (Causal Intervention on Evolving Personal Popularity) for taming recommendation bias, which accounts for the evolving personal popularity of users. Specifically, we first introduce a metric called Evolving Personal Popularity to quantify each user’s preference for popular items. Then, we design a causal graph that integrates evolving personal popularity into the conformity effect, and apply deconfounded training to mitigate the popularity bias of the causal graph. During inference, we consider the evolution consistency between users and items to achieve a better recommendation. Empirical studies demonstrate that CausalEPP outperforms baseline methods in reducing popularity bias while improving recommendation accuracy.

[LG-33] Optimizing Binary and Ternary Neural Network Inference on RRAM Crossbars using CIM-Explorer

链接: https://arxiv.org/abs/2505.14303
作者: Rebecca Pelke,José Cubero-Cascante,Nils Bosbach,Niklas Degener,Florian Idrizi,Lennart M. Reimann,Jan Moritz Joseph,Rainer Leupers
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Using Resistive Random Access Memory (RRAM) crossbars in Computing-in-Memory (CIM) architectures offers a promising solution to overcome the von Neumann bottleneck. Due to non-idealities like cell variability, RRAM crossbars are often operated in binary mode, utilizing only two states: Low Resistive State (LRS) and High Resistive State (HRS). Binary Neural Networks (BNNs) and Ternary Neural Networks (TNNs) are well-suited for this hardware due to their efficient mapping. Existing software projects for RRAM-based CIM typically focus on only one aspect: compilation, simulation, or Design Space Exploration (DSE). Moreover, they often rely on classical 8 bit quantization. To address these limitations, we introduce CIM-Explorer, a modular toolkit for optimizing BNN and TNN inference on RRAM crossbars. CIM-Explorer includes an end-to-end compiler stack, multiple mapping options, and simulators, enabling a DSE flow for accuracy estimation across different crossbar parameters and mappings. CIM-Explorer can accompany the entire design process, from early accuracy estimation for specific crossbar parameters, to selecting an appropriate mapping, and compiling BNNs and TNNs for a finalized crossbar chip. In DSE case studies, we demonstrate the expected accuracy for various mappings and crossbar parameters. CIM-Explorer can be found on GitHub.

[LG-34] A Private Approximation of the 2nd-Moment Matrix of Any Subsamplable Input

链接: https://arxiv.org/abs/2505.14251
作者: Bar Mahpud,Or Sheffet
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:We study the problem of differentially private second moment estimation and present a new algorithm that achieve strong privacy-utility trade-offs even for worst-case inputs under subsamplability assumptions on the data. We call an input (m,\alpha,\beta) -subsamplable if a random subsample of size m (or larger) preserves w.p \geq 1-\beta the spectral structure of the original second moment matrix up to a multiplicative factor of 1\pm \alpha . Building upon subsamplability, we give a recursive algorithmic framework similar to Kamath et al 2019, that abides zero-Concentrated Differential Privacy (zCDP) while preserving w.h.p. the accuracy of the second moment estimation upto an arbitrary factor of (1\pm\gamma) . We then show how to apply our algorithm to approximate the second moment matrix of a distribution \mathcalD , even when a noticeable fraction of the input are outliers.

[LG-35] Learning with Local Search MCMC Layers

链接: https://arxiv.org/abs/2505.14240
作者: Germain Vivier-Ardisson,Mathieu Blondel,Axel Parmentier
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Integrating combinatorial optimization layers into neural networks has recently attracted significant research interest. However, many existing approaches lack theoretical guarantees or fail to perform adequately when relying on inexact solvers. This is a critical limitation, as many operations research problems are NP-hard, often necessitating the use of neighborhood-based local search heuristics. These heuristics iteratively generate and evaluate candidate solutions based on an acceptance rule. In this paper, we introduce a theoretically-principled approach for learning with such inexact combinatorial solvers. Inspired by the connection between simulated annealing and Metropolis-Hastings, we propose to transform problem-specific neighborhood systems used in local search heuristics into proposal distributions, implementing MCMC on the combinatorial space of feasible solutions. This allows us to construct differentiable combinatorial layers and associated loss functions. Replacing an exact solver by a local search strongly reduces the computational burden of learning on many applications. We demonstrate our approach on a large-scale dynamic vehicle routing problem with time windows.

[LG-36] Regularized least squares learning with heavy-tailed noise is minimax optimal

链接: https://arxiv.org/abs/2505.14214
作者: Mattes Mollenhauer,Nicole Mücke,Dimitri Meunier,Arthur Gretton
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 32 pages, 1 figure

点击查看摘要

Abstract:This paper examines the performance of ridge regression in reproducing kernel Hilbert spaces in the presence of noise that exhibits a finite number of higher moments. We establish excess risk bounds consisting of subgaussian and polynomial terms based on the well known integral operator framework. The dominant subgaussian component allows to achieve convergence rates that have previously only been derived under subexponential noise - a prevalent assumption in related work from the last two decades. These rates are optimal under standard eigenvalue decay conditions, demonstrating the asymptotic robustness of regularized least squares against heavy-tailed noise. Our derivations are based on a Fuk-Nagaev inequality for Hilbert-space valued random variables.

[LG-37] A PID-Controlled Tensor Wheel Decomposition Model for Dynamic Link Prediction

链接: https://arxiv.org/abs/2505.14211
作者: Qu Wang,Yan Xia
类目: Machine Learning (cs.LG)
*备注: 7 pages, 2 figures

点击查看摘要

Abstract:Link prediction in dynamic networks remains a fundamental challenge in network science, requiring the inference of potential interactions and their evolving strengths through spatiotemporal pattern analysis. Traditional static network methods have inherent limitations in capturing temporal dependencies and weight dynamics, while tensor-based methods offer a promising paradigm by encoding dynamic networks into high-order tensors to explicitly model multidimensional interactions across nodes and time. Among them, tensor wheel decomposition (TWD) stands out for its innovative topological structure, which decomposes high-order tensors into cyclic factors and core tensors to maintain structural integrity. To improve the prediction accuracy, this study introduces a PID-controlled tensor wheel decomposition (PTWD) model, which mainly adopts the following two ideas: 1) exploiting the representation power of TWD to capture the latent features of dynamic network topology and weight evolution, and 2) integrating the proportional-integral-derivative (PID) control principle into the optimization process to obtain a stable model parameter learning scheme. The performance on four real datasets verifies that the proposed PTWD model has more accurate link prediction capabilities compared to other models.

[LG-38] MSDformer: Multi-scale Discrete Transformer For Time Series Generation

链接: https://arxiv.org/abs/2505.14202
作者: Zhicheng Chen,Shibo Feng,Xi Xiao,Zhong Zhang,Qing Li,Xingyu Gao,Peilin Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Discrete Token Modeling (DTM), which employs vector quantization techniques, has demonstrated remarkable success in modeling non-natural language modalities, particularly in time series generation. While our prior work SDformer established the first DTM-based framework to achieve state-of-the-art performance in this domain, two critical limitations persist in existing DTM approaches: 1) their inability to capture multi-scale temporal patterns inherent to complex time series data, and 2) the absence of theoretical foundations to guide model optimization. To address these challenges, we proposes a novel multi-scale DTM-based time series generation method, called Multi-Scale Discrete Transformer (MSDformer). MSDformer employs a multi-scale time series tokenizer to learn discrete token representations at multiple scales, which jointly characterize the complex nature of time series data. Subsequently, MSDformer applies a multi-scale autoregressive token modeling technique to capture the multi-scale patterns of time series within the discrete latent space. Theoretically, we validate the effectiveness of the DTM method and the rationality of MSDformer through the rate-distortion theorem. Comprehensive experiments demonstrate that MSDformer significantly outperforms state-of-the-art methods. Both theoretical analysis and experimental results demonstrate that incorporating multi-scale information and modeling multi-scale patterns can substantially enhance the quality of generated time series in DTM-based approaches. The code will be released upon acceptance.

[LG-39] Nonparametric Teaching for Graph Property Learners ICML2025

链接: https://arxiv.org/abs/2505.14170
作者: Chen Zhang,Weixin Bu,Zeyi Ren,Zhengwu Liu,Yik-Chung Wu,Ngai Wong
类目: Machine Learning (cs.LG)
*备注: ICML 2025 Spotlight (25 pages, 17 figures)

点击查看摘要

Abstract:Inferring properties of graph-structured data, e.g., the solubility of molecules, essentially involves learning the implicit mapping from graphs to their properties. This learning process is often costly for graph property learners like Graph Convolutional Networks (GCNs). To address this, we propose a paradigm called Graph Neural Teaching (GraNT) that reinterprets the learning process through a novel nonparametric teaching perspective. Specifically, the latter offers a theoretical framework for teaching implicitly defined (i.e., nonparametric) mappings via example selection. Such an implicit mapping is realized by a dense set of graph-property pairs, with the GraNT teacher selecting a subset of them to promote faster convergence in GCN training. By analytically examining the impact of graph structure on parameter-based gradient descent during training, and recasting the evolution of GCNs–shaped by parameter updates–through functional gradient descent in nonparametric teaching, we show for the first time that teaching graph property learners (i.e., GCNs) is consistent with teaching structure-aware nonparametric learners. These new findings readily commit GraNT to enhancing learning efficiency of the graph property learner, showing significant reductions in training time for graph-level regression (-36.62%), graph-level classification (-38.19%), node-level regression (-30.97%) and node-level classification (-47.30%), all while maintaining its generalization performance.

[LG-40] Personalized Bayesian Federated Learning with Wasserstein Barycenter Aggregation

链接: https://arxiv.org/abs/2505.14161
作者: Ting Wei,Biao Mei,Junliang Lyu,Renquan Zhang,Feng Zhou,Yifan Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Personalized Bayesian federated learning (PBFL) handles non-i.i.d. client data and quantifies uncertainty by combining personalization with Bayesian inference. However, existing PBFL methods face two limitations: restrictive parametric assumptions in client posterior inference and naive parameter averaging for server aggregation. To overcome these issues, we propose FedWBA, a novel PBFL method that enhances both local inference and global aggregation. At the client level, we use particle-based variational inference for nonparametric posterior representation. At the server level, we introduce particle-based Wasserstein barycenter aggregation, offering a more geometrically meaningful approach. Theoretically, we provide local and global convergence guarantees for FedWBA. Locally, we prove a KL divergence decrease lower bound per iteration for variational inference convergence. Globally, we show that the Wasserstein barycenter converges to the true parameter as the client data size increases. Empirically, experiments show that FedWBA outperforms baselines in prediction accuracy, uncertainty calibration, and convergence rate, with ablation studies confirming its robustness.

[LG-41] MAS-KCL: Knowledge component graph structure learning with large language model-based agent ic workflow

链接: https://arxiv.org/abs/2505.14126
作者: Yuan-Hao Jiang,Kezong Tang,Zi-Wei Chen,Yuang Wei,Tian-Yi Liu,Jiayi Wu
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
*备注: In CGI 2025: 42nd Computer Graphics International Conference, Kowloon, Hong Kong, Peper No. 134

点击查看摘要

Abstract:Knowledge components (KCs) are the fundamental units of knowledge in the field of education. A KC graph illustrates the relationships and dependencies between KCs. An accurate KC graph can assist educators in identifying the root causes of learners’ poor performance on specific KCs, thereby enabling targeted instructional interventions. To achieve this, we have developed a KC graph structure learning algorithm, named MAS-KCL, which employs a multi-agent system driven by large language models for adaptive modification and optimization of the KC graph. Additionally, a bidirectional feedback mechanism is integrated into the algorithm, where AI agents leverage this mechanism to assess the value of edges within the KC graph and adjust the distribution of generation probabilities for different edges, thereby accelerating the efficiency of structure learning. We applied the proposed algorithm to 5 synthetic datasets and 4 real-world educational datasets, and experimental results validate its effectiveness in learning path recognition. By accurately identifying learners’ learning paths, teachers are able to design more comprehensive learning plans, enabling learners to achieve their educational goals more effectively, thus promoting the sustainable development of education.

[LG-42] Assessing wildfire susceptibility in Iran: Leverag ing machine learning for geospatial analysis of climatic and anthropogenic factors

链接: https://arxiv.org/abs/2505.14122
作者: Ehsan Masoudian,Ali Mirzaei,Hossein Bagheri
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study investigates the multifaceted factors influencing wildfire risk in Iran, focusing on the interplay between climatic conditions and human activities. Utilizing advanced remote sensing, geospatial information system (GIS) processing techniques such as cloud computing, and machine learning algorithms, this research analyzed the impact of climatic parameters, topographic features, and human-related factors on wildfire susceptibility assessment and prediction in Iran. Multiple scenarios were developed for this purpose based on the data sampling strategy. The findings revealed that climatic elements such as soil moisture, temperature, and humidity significantly contribute to wildfire susceptibility, while human activities-particularly population density and proximity to powerlines-also played a crucial role. Furthermore, the seasonal impact of each parameter was separately assessed during warm and cold seasons. The results indicated that human-related factors, rather than climatic variables, had a more prominent influence during the seasonal analyses. This research provided new insights into wildfire dynamics in Iran by generating high-resolution wildfire susceptibility maps using advanced machine learning classifiers. The generated maps identified high risk areas, particularly in the central Zagros region, the northeastern Hyrcanian Forest, and the northern Arasbaran forest, highlighting the urgent need for effective fire management strategies.

[LG-43] Personalized and Resilient Distributed Learning Through Opinion Dynamics

链接: https://arxiv.org/abs/2505.14081
作者: Luca Ballotta,Nicola Bastianello,Riccardo M. G. Ferrari,Karl H. Johansson
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC)
*备注: This work has been submitted to IEEE for possible publication

点击查看摘要

Abstract:In this paper, we address two practical challenges of distributed learning in multi-agent network systems, namely personalization and resilience. Personalization is the need of heterogeneous agents to learn local models tailored to their own data and tasks, while still generalizing well; on the other hand, the learning process must be resilient to cyberattacks or anomalous training data to avoid disruption. Motivated by a conceptual affinity between these two requirements, we devise a distributed learning algorithm that combines distributed gradient descent and the Friedkin-Johnsen model of opinion dynamics to fulfill both of them. We quantify its convergence speed and the neighborhood that contains the final learned models, which can be easily controlled by tuning the algorithm parameters to enforce a more personalized/resilient behavior. We numerically showcase the effectiveness of our algorithm on synthetic and real-world distributed learning tasks, where it achieves high global accuracy both for personalized models and with malicious agents compared to standard strategies.

[LG-44] Generalized Category Discovery via Token Manifold Capacity Learning

链接: https://arxiv.org/abs/2505.14044
作者: Luyao Tang,Kunze Huang,Chaoqi Chen,Cheng Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generalized category discovery (GCD) is essential for improving deep learning models’ robustness in open-world scenarios by clustering unlabeled data containing both known and novel categories. Traditional GCD methods focus on minimizing intra-cluster variations, often sacrificing manifold capacity, which limits the richness of intra-class representations. In this paper, we propose a novel approach, Maximum Token Manifold Capacity (MTMC), that prioritizes maximizing the manifold capacity of class tokens to preserve the diversity and complexity of data. MTMC leverages the nuclear norm of singular values as a measure of manifold capacity, ensuring that the representation of samples remains informative and well-structured. This method enhances the discriminability of clusters, allowing the model to capture detailed semantic features and avoid the loss of critical information during clustering. Through theoretical analysis and extensive experiments on coarse- and fine-grained datasets, we demonstrate that MTMC outperforms existing GCD methods, improving both clustering accuracy and the estimation of category numbers. The integration of MTMC leads to more complete representations, better inter-class separability, and a reduction in dimensional collapse, establishing MTMC as a vital component for robust open-world learning. Code is in this http URL.

[LG-45] Unsupervised Graph Clustering with Deep Structural Entropy KDD KDD2025

链接: https://arxiv.org/abs/2505.14040
作者: Jingyun Zhang,Hao Peng,Li Sun,Guanlin Wu,Chunyang Liu,Zhengtao Yu
类目: Machine Learning (cs.LG)
*备注: Accepted to Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2025 (KDD 2025). 13 pages, 10 figures, 11 tables

点击查看摘要

Abstract:Research on Graph Structure Learning (GSL) provides key insights for graph-based clustering, yet current methods like Graph Neural Networks (GNNs), Graph Attention Networks (GATs), and contrastive learning often rely heavily on the original graph structure. Their performance deteriorates when the original graph’s adjacency matrix is too sparse or contains noisy edges unrelated to clustering. Moreover, these methods depend on learning node embeddings and using traditional techniques like k-means to form clusters, which may not fully capture the underlying graph structure between nodes. To address these limitations, this paper introduces DeSE, a novel unsupervised graph clustering framework incorporating Deep Structural Entropy. It enhances the original graph with quantified structural information and deep neural networks to form clusters. Specifically, we first propose a method for calculating structural entropy with soft assignment, which quantifies structure in a differentiable form. Next, we design a Structural Learning layer (SLL) to generate an attributed graph from the original feature data, serving as a target to enhance and optimize the original structural graph, thereby mitigating the issue of sparse connections between graph nodes. Finally, our clustering assignment method (ASS), based on GNNs, learns node embeddings and a soft assignment matrix to cluster on the enhanced graph. The ASS layer can be stacked to meet downstream task requirements, minimizing structural entropy for stable clustering and maximizing node consistency with edge-based cross-entropy loss. Extensive comparative experiments are conducted on four benchmark datasets against eight representative unsupervised graph clustering baselines, demonstrating the superiority of the DeSE in both effectiveness and interpretability.

[LG-46] Learning High-dimensional Ionic Model Dynamics Using Fourier Neural Operators

链接: https://arxiv.org/abs/2505.14039
作者: Luca Pellegrini,Massimiliano Ghiotto,Edoardo Centofanti,Luca Franco Pavarino
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Ionic models, described by systems of stiff ordinary differential equations, are fundamental tools for simulating the complex dynamics of excitable cells in both Computational Neuroscience and Cardiology. Approximating these models using Artificial Neural Networks poses significant challenges due to their inherent stiffness, multiscale nonlinearities, and the wide range of dynamical behaviors they exhibit, including multiple equilibrium points, limit cycles, and intricate interactions. While in previous studies the dynamics of the transmembrane potential has been predicted in low dimensionality settings, in the present study we extend these results by investigating whether Fourier Neural Operators can effectively learn the evolution of all the state variables within these dynamical systems in higher dimensions. We demonstrate the effectiveness of this approach by accurately learning the dynamics of three well-established ionic models with increasing dimensionality: the two-variable FitzHugh-Nagumo model, the four-variable Hodgkin-Huxley model, and the forty-one-variable O’Hara-Rudy model. To ensure the selection of near-optimal configurations for the Fourier Neural Operator, we conducted automatic hyperparameter tuning under two scenarios: an unconstrained setting, where the number of trainable parameters is not limited, and a constrained case with a fixed number of trainable parameters. Both constrained and unconstrained architectures achieve comparable results in terms of accuracy across all the models considered. However, the unconstrained architecture required approximately half the number of training epochs to achieve similar error levels, as evidenced by the loss function values recorded during training. These results underline the capabilities of Fourier Neural Operators to accurately capture complex multiscale dynamics, even in high-dimensional dynamical systems.

[LG-47] Partition-wise Graph Filtering: A Unified Perspective Through the Lens of Graph Coarsening KDD KDD2025

链接: https://arxiv.org/abs/2505.14033
作者: Guoming Li,Jian Yang,Yifan Chen
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Numerical Analysis (math.NA)
*备注: Accepted at the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2025 February Cycle

点击查看摘要

Abstract:Filtering-based graph neural networks (GNNs) constitute a distinct class of GNNs that employ graph filters to handle graph-structured data, achieving notable success in various graph-related tasks. Conventional methods adopt a graph-wise filtering paradigm, imposing a uniform filter across all nodes, yet recent findings suggest that this rigid paradigm struggles with heterophilic graphs. To overcome this, recent works have introduced node-wise filtering, which assigns distinct filters to individual nodes, offering enhanced adaptability. However, a fundamental gap remains: a comprehensive framework unifying these two strategies is still absent, limiting theoretical insights into the filtering paradigms. Moreover, through the lens of Contextual Stochastic Block Model, we reveal that a synthesis of graph-wise and node-wise filtering provides a sufficient solution for classification on graphs exhibiting both homophily and heterophily, suggesting the risk of excessive parameterization and potential overfitting with node-wise filtering. To address the limitations, this paper introduces Coarsening-guided Partition-wise Filtering (CPF). CPF innovates by performing filtering on node partitions. The method begins with structure-aware partition-wise filtering, which filters node partitions obtained via graph coarsening algorithms, and then performs feature-aware partition-wise filtering, refining node embeddings via filtering on clusters produced by k -means clustering over features. In-depth analysis is conducted for each phase of CPF, showing its superiority over other paradigms. Finally, benchmark node classification experiments, along with a real-world graph anomaly detection application, validate CPF’s efficacy and practical utility.

[LG-48] Adaptive Sentencing Prediction with Guaranteed Accuracy and Legal Interpretability

链接: https://arxiv.org/abs/2505.14011
作者: Yifei Jin,Xin Zheng,Lei Guo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing research on judicial sentencing prediction predominantly relies on end-to-end models, which often neglect the inherent sentencing logic and lack interpretability-a critical requirement for both scholarly research and judicial practice. To address this challenge, we make three key contributions:First, we propose a novel Saturated Mechanistic Sentencing (SMS) model, which provides inherent legal interpretability by virtue of its foundation in China’s Criminal Law. We also introduce the corresponding Momentum Least Mean Squares (MLMS) adaptive algorithm for this model. Second, for the MLMS algorithm based adaptive sentencing predictor, we establish a mathematical theory on the accuracy of adaptive prediction without resorting to any stationarity and independence assumptions on the data. We also provide a best possible upper bound for the prediction accuracy achievable by the best predictor designed in the known parameters case. Third, we construct a Chinese Intentional Bodily Harm (CIBH) dataset. Utilizing this real-world data, extensive experiments demonstrate that our approach achieves a prediction accuracy that is not far from the best possible theoretical upper bound, validating both the model’s suitability and the algorithm’s accuracy.

[LG-49] VAMO: Efficient Large-Scale Nonconvex Optimization via Adaptive Zeroth Order Variance Reduction

链接: https://arxiv.org/abs/2505.13954
作者: Jiahe Chen,Ziye Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Optimizing large-scale nonconvex problems, common in machine learning, demands balancing rapid convergence with computational efficiency. First-order (FO) stochastic methods like SVRG provide fast convergence and good generalization but incur high costs due to full-batch gradients in large models. Conversely, zeroth-order (ZO) algorithms reduce this burden using estimated gradients, yet their slow convergence in high-dimensional settings limits practicality. We introduce VAMO (VAriance-reduced Mixed-gradient Optimizer), a stochastic variance-reduced method combining FO mini-batch gradients with lightweight ZO finite-difference probes under an SVRG-style framework. VAMO’s hybrid design uses a two-point ZO estimator to achieve a dimension-agnostic convergence rate of \mathcalO(1/T + 1/b) , where T is the number of iterations and b is the batch-size, surpassing the dimension-dependent slowdown of purely ZO methods and significantly improving over SGD’s \mathcalO(1/\sqrtT) rate. Additionally, we propose a multi-point ZO variant that mitigates the O(1/b) error by adjusting number of estimation points to balance convergence and cost, making it ideal for a whole range of computationally constrained scenarios. Experiments including traditional neural network training and LLM finetuning show VAMO outperforms established FO and ZO methods, offering a faster, more flexible option for improved efficiency.

[LG-50] me Reversal Symmetry for Efficient Robotic Manipulations in Deep Reinforcement Learning

链接: https://arxiv.org/abs/2505.13925
作者: Yunpeng Jiang,Jianshu Hu,Paul Weng,Yutong Ban
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Symmetry is pervasive in robotics and has been widely exploited to improve sample efficiency in deep reinforcement learning (DRL). However, existing approaches primarily focus on spatial symmetries, such as reflection, rotation, and translation, while largely neglecting temporal symmetries. To address this gap, we explore time reversal symmetry, a form of temporal symmetry commonly found in robotics tasks such as door opening and closing. We propose Time Reversal symmetry enhanced Deep Reinforcement Learning (TR-DRL), a framework that combines trajectory reversal augmentation and time reversal guided reward shaping to efficiently solve temporally symmetric tasks. Our method generates reversed transitions from fully reversible transitions, identified by a proposed dynamics-consistent filter, to augment the training data. For partially reversible transitions, we apply reward shaping to guide learning, according to successful trajectories from the reversed task. Extensive experiments on the Robosuite and MetaWorld benchmarks demonstrate that TR-DRL is effective in both single-task and multi-task settings, achieving higher sample efficiency and stronger final performance compared to baseline methods.

[LG-51] ShortcutProbe: Probing Prediction Shortcuts for Learning Robust Models IJCAI2025

链接: https://arxiv.org/abs/2505.13910
作者: Guangtao Zheng,Wenqian Ye,Aidong Zhang
类目: Machine Learning (cs.LG)
*备注: Accepted to IJCAI 2025

点击查看摘要

Abstract:Deep learning models often achieve high performance by inadvertently learning spurious correlations between targets and non-essential features. For example, an image classifier may identify an object via its background that spuriously correlates with it. This prediction behavior, known as spurious bias, severely degrades model performance on data that lacks the learned spurious correlations. Existing methods on spurious bias mitigation typically require a variety of data groups with spurious correlation annotations called group labels. However, group labels require costly human annotations and often fail to capture subtle spurious biases such as relying on specific pixels for predictions. In this paper, we propose a novel post hoc spurious bias mitigation framework without requiring group labels. Our framework, termed ShortcutProbe, identifies prediction shortcuts that reflect potential non-robustness in predictions in a given model’s latent space. The model is then retrained to be invariant to the identified prediction shortcuts for improved robustness. We theoretically analyze the effectiveness of the framework and empirically demonstrate that it is an efficient and practical tool for improving a model’s robustness to spurious bias on diverse datasets.

[LG-52] Cross-Domain Diffusion with Progressive Alignment for Efficient Adaptive Retrieval

链接: https://arxiv.org/abs/2505.13907
作者: Junyu Luo,Yusheng Zhao,Xiao Luo,Zhiping Xiao,Wei Ju,Li Shen,Dacheng Tao,Ming Zhang
类目: Machine Learning (cs.LG)
*备注: IEEE TIP

点击查看摘要

Abstract:Unsupervised efficient domain adaptive retrieval aims to transfer knowledge from a labeled source domain to an unlabeled target domain, while maintaining low storage cost and high retrieval efficiency. However, existing methods typically fail to address potential noise in the target domain, and directly align high-level features across domains, thus resulting in suboptimal retrieval performance. To address these challenges, we propose a novel Cross-Domain Diffusion with Progressive Alignment method (COUPLE). This approach revisits unsupervised efficient domain adaptive retrieval from a graph diffusion perspective, simulating cross-domain adaptation dynamics to achieve a stable target domain adaptation process. First, we construct a cross-domain relationship graph and leverage noise-robust graph flow diffusion to simulate the transfer dynamics from the source domain to the target domain, identifying lower noise clusters. We then leverage the graph diffusion results for discriminative hash code learning, effectively learning from the target domain while reducing the negative impact of noise. Furthermore, we employ a hierarchical Mixup operation for progressive domain alignment, which is performed along the cross-domain random walk paths. Utilizing target domain discriminative hash learning and progressive domain alignment, COUPLE enables effective domain adaptive hash learning. Extensive experiments demonstrate COUPLE’s effectiveness on competitive benchmarks.

[LG-53] New Evidence of the Two-Phase Learning Dynamics of Neural Networks ICLR2025

链接: https://arxiv.org/abs/2505.13900
作者: Zhanpeng Zhou,Yongyi Yang,Mahito Sugiyama,Junchi Yan
类目: Machine Learning (cs.LG)
*备注: This work extends the workshop paper, On the Cone Effect in the Learning Dynamics, accepted by ICLR 2025 Workshop DeLTa

点击查看摘要

Abstract:Understanding how deep neural networks learn remains a fundamental challenge in modern machine learning. A growing body of evidence suggests that training dynamics undergo a distinct phase transition, yet our understanding of this transition is still incomplete. In this paper, we introduce an interval-wise perspective that compares network states across a time window, revealing two new phenomena that illuminate the two-phase nature of deep learning. i) \textbfThe Chaos Effect. By injecting an imperceptibly small parameter perturbation at various stages, we show that the response of the network to the perturbation exhibits a transition from chaotic to stable, suggesting there is an early critical period where the network is highly sensitive to initial conditions; ii) \textbfThe Cone Effect. Tracking the evolution of the empirical Neural Tangent Kernel (eNTK), we find that after this transition point the model’s functional trajectory is confined to a narrow cone-shaped subset: while the kernel continues to change, it gets trapped into a tight angular region. Together, these effects provide a structural, dynamical view of how deep networks transition from sensitive exploration to stable refinement during training.

[LG-54] Exploring Causes of Representational Similarity in Machine Learning Models

链接: https://arxiv.org/abs/2505.13899
作者: Zeyu Michael Li,Hung Anh Vu,Damilola Awofisayo,Emily Wenger
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Numerous works have noted significant similarities in how machine learning models represent the world, even across modalities. Although much effort has been devoted to uncovering properties and metrics on which these models align, surprisingly little work has explored causes of this similarity. To advance this line of inquiry, this work explores how two possible causal factors – dataset overlap and task overlap – influence downstream model similarity. The exploration of dataset overlap is motivated by the reality that large-scale generative AI models are often trained on overlapping datasets of scraped internet data, while the exploration of task overlap seeks to substantiate claims from a recent work, the Platonic Representation Hypothesis, that task similarity may drive model similarity. We evaluate the effects of both factors through a broad set of experiments. We find that both positively correlate with higher representational similarity and that combining them provides the strongest effect. Our code and dataset are published.

[LG-55] CRAFT: Time Series Forecasting with Cross-Future Behavior Awareness

链接: https://arxiv.org/abs/2505.13896
作者: Yingwei Zhang,Ke Bu,Zhuoran Zhuang,Tao Xie,Yao Yu,Dong Li,Yang Guo,Detao Lv
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The past decades witness the significant advancements in time series forecasting (TSF) across various real-world domains, including e-commerce and disease spread prediction. However, TSF is usually constrained by the uncertainty dilemma of predicting future data with limited past observations. To settle this question, we explore the use of Cross-Future Behavior (CFB) in TSF, which occurs before the current time but takes effect in the future. We leverage CFB features and propose the CRoss-Future Behavior Awareness based Time Series Forecasting method (CRAFT). The core idea of CRAFT is to utilize the trend of cross-future behavior to mine the trend of time series data to be predicted. Specifically, to settle the sparse and partial flaws of cross-future behavior, CRAFT employs the Koopman Predictor Module to extract the key trend and the Internal Trend Mining Module to supplement the unknown area of the cross-future behavior matrix. Then, we introduce the External Trend Guide Module with a hierarchical structure to acquire more representative trends from higher levels. Finally, we apply the demand-constrained loss to calibrate the distribution deviation of prediction results. We conduct experiments on real-world dataset. Experiments on both offline large-scale dataset and online A/B test demonstrate the effectiveness of CRAFT. Our dataset and code is available at this https URL.

[LG-56] Certifiably Safe Manipulation of Deformable Linear Objects via Joint Shape and Tension Prediction ICRA2025

链接: https://arxiv.org/abs/2505.13889
作者: Yiting Zhang,Shichen Li
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted to ICRA 2025 Workshop on Learning Meets Model-Based Methods for Contact-Rich Manipulation

点击查看摘要

Abstract:Manipulating deformable linear objects (DLOs) is challenging due to their complex dynamics and the need for safe interaction in contact-rich environments. Most existing models focus on shape prediction alone and fail to account for contact and tension constraints, which can lead to damage to both the DLO and the robot. In this work, we propose a certifiably safe motion planning and control framework for DLO manipulation. At the core of our method is a predictive model that jointly estimates the DLO’s future shape and tension. These predictions are integrated into a real-time trajectory optimizer based on polynomial zonotopes, allowing us to enforce safety constraints throughout the execution. We evaluate our framework on a simulated wire harness assembly task using a 7-DOF robotic arm. Compared to state-of-the-art methods, our approach achieves a higher task success rate while avoiding all safety violations. The results demonstrate that our method enables robust and safe DLO manipulation in contact-rich environments.

[LG-57] ranSUN: A Preemptive Paradigm to Eradicate Retransformation Bias Intrinsically from Regression Models in Recommender Systems

链接: https://arxiv.org/abs/2505.13881
作者: Jiahao Yu,Haozhuang Liu,Yeqiu Yang,Lu Chen,Wu Jian,Yuning Jiang,Bo Zheng
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 22 pages, 6 figures

点击查看摘要

Abstract:Regression models are crucial in recommender systems. However, retransformation bias problem has been conspicuously neglected within the community. While many works in other fields have devised effective bias correction methods, all of them are post-hoc cures externally to the model, facing practical challenges when applied to real-world recommender systems. Hence, we propose a preemptive paradigm to eradicate the bias intrinsically from the models via minor model refinement. Specifically, a novel TranSUN method is proposed with a joint bias learning manner to offer theoretically guaranteed unbiasedness under empirical superior convergence. It is further generalized into a novel generic regression model family, termed Generalized TranSUN (GTS), which not only offers more theoretical insights but also serves as a generic framework for flexibly developing various bias-free models. Comprehensive experimental results demonstrate the superiority of our methods across data from various domains, which have been successfully deployed in two real-world industrial recommendation scenarios, i.e. product and short video recommendation scenarios in Guess What You Like business domain in the homepage of Taobao App (a leading e-commerce platform), to serve the major online traffic. Codes will be released after this paper is published.

[LG-58] Enforcing Hard Linear Constraints in Deep Learning Models with Decision Rules

链接: https://arxiv.org/abs/2505.13858
作者: Gonzalo E. Constante-Flores,Hao Chen,Can Li
类目: Machine Learning (cs.LG)
*备注: 1 figure

点击查看摘要

Abstract:Deep learning models are increasingly deployed in safety-critical tasks where predictions must satisfy hard constraints, such as physical laws, fairness requirements, or safety limits. However, standard architectures lack built-in mechanisms to enforce such constraints, and existing approaches based on regularization or projection are often limited to simple constraints, computationally expensive, or lack feasibility guarantees. This paper proposes a model-agnostic framework for enforcing input-dependent linear equality and inequality constraints on neural network outputs. The architecture combines a task network trained for prediction accuracy with a safe network trained using decision rules from the stochastic and robust optimization literature to ensure feasibility across the entire input space. The final prediction is a convex combination of the two subnetworks, guaranteeing constraint satisfaction during both training and inference without iterative procedures or runtime optimization. We prove that the architecture is a universal approximator of constrained functions and derive computationally tractable formulations based on linear decision rules. Empirical results on benchmark regression tasks show that our method consistently satisfies constraints while maintaining competitive accuracy and low inference latency.

[LG-59] Rethink the Role of Deep Learning towards Large-scale Quantum Systems ICML2025

链接: https://arxiv.org/abs/2505.13852
作者: Yusheng Zhao,Chi Zhang,Yuxuan Du
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: ICML 2025

点击查看摘要

Abstract:Characterizing the ground state properties of quantum systems is fundamental to capturing their behavior but computationally challenging. Recent advances in AI have introduced novel approaches, with diverse machine learning (ML) and deep learning (DL) models proposed for this purpose. However, the necessity and specific role of DL models in these tasks remain unclear, as prior studies often employ varied or impractical quantum resources to construct datasets, resulting in unfair comparisons. To address this, we systematically benchmark DL models against traditional ML approaches across three families of Hamiltonian, scaling up to 127 qubits in three crucial ground-state learning tasks while enforcing equivalent quantum resource usage. Our results reveal that ML models often achieve performance comparable to or even exceeding that of DL approaches across all tasks. Furthermore, a randomization test demonstrates that measurement input features have minimal impact on DL models’ prediction performance. These findings challenge the necessity of current DL models in many quantum system learning scenarios and provide valuable insights into their effective utilization.

[LG-60] Frag ments to Facts: Partial-Information Frag ment Inference from LLM s

链接: https://arxiv.org/abs/2505.13819
作者: Lucas Rosenblatt,Bin Han,Robert Wolfe,Bill Howe
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) can leak sensitive training data through memorization and membership inference attacks. Prior work has primarily focused on strong adversarial assumptions, including attacker access to entire samples or long, ordered prefixes, leaving open the question of how vulnerable LLMs are when adversaries have only partial, unordered sample information. For example, if an attacker knows a patient has “hypertension,” under what conditions can they query a model fine-tuned on patient data to learn the patient also has “osteoarthritis?” In this paper, we introduce a more general threat model under this weaker assumption and show that fine-tuned LLMs are susceptible to these fragment-specific extraction attacks. To systematically investigate these attacks, we propose two data-blind methods: (1) a likelihood ratio attack inspired by methods from membership inference, and (2) a novel approach, PRISM, which regularizes the ratio by leveraging an external prior. Using examples from both medical and legal settings, we show that both methods are competitive with a data-aware baseline classifier that assumes access to labeled in-distribution data, underscoring their robustness.

[LG-61] Context-Free Synthetic Data Mitigates Forgetting

链接: https://arxiv.org/abs/2505.13811
作者: Parikshit Bansal,Sujay Sanghavi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fine-tuning a language model often results in a degradation of its existing performance on other tasks, due to a shift in the model parameters; this phenomenon is often referred to as (catastrophic) forgetting. We are interested in mitigating this, in settings where we only have access to the model weights but no access to its training data/recipe. A natural approach is to penalize the KL divergence between the original model and the new one. Our main realization is that a simple process - which we term context-free generation - allows for an approximate unbiased estimation of this KL divergence. We show that augmenting a fine-tuning dataset with context-free generations mitigates forgetting, in two settings: (a) preserving the zero-shot performance of pretrained-only models, and (b) preserving the reasoning performance of thinking models. We show that contextual synthetic data, and even a portion of the pretraining data, are less effective. We also investigate the effect of choices like generation temperature, data ratios etc. We present our results for OLMo-1B for pretrained-only setting and R1-Distill-Llama-8B for the reasoning setting.

[LG-62] Scalable Autoregressive 3D Molecule Generation

链接: https://arxiv.org/abs/2505.13791
作者: Austin H. Cheng,Chong Sun,Alán Aspuru-Guzik
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Generative models of 3D molecular structure play a rapidly growing role in the design and simulation of molecules. Diffusion models currently dominate the space of 3D molecule generation, while autoregressive models have trailed behind. In this work, we present Quetzal, a simple but scalable autoregressive model that builds molecules atom-by-atom in 3D. Treating each molecule as an ordered sequence of atoms, Quetzal combines a causal transformer that predicts the next atom’s discrete type with a smaller Diffusion MLP that models the continuous next-position distribution. Compared to existing autoregressive baselines, Quetzal achieves substantial improvements in generation quality and is competitive with the performance of state-of-the-art diffusion models. In addition, by reducing the number of expensive forward passes through a dense transformer, Quetzal enables significantly faster generation speed, as well as exact divergence-based likelihood computation. Finally, without any architectural changes, Quetzal natively handles variable-size tasks like hydrogen decoration and scaffold completion. We hope that our work motivates a perspective on scalability and generality for generative modelling of 3D molecules.

[LG-63] Score-Based Training for Energy-Based TTS Models

链接: https://arxiv.org/abs/2505.13771
作者: Wanli Sun,Anton Ragni
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Noise contrastive estimation (NCE) is a popular method for training energy-based models (EBM) with intractable normalisation terms. The key idea of NCE is to learn by comparing unnormalised log-likelihoods of the reference and noisy samples, thus avoiding explicitly computing normalisation terms. However, NCE critically relies on the quality of noisy samples. Recently, sliced score matching (SSM) has been popularised by closely related diffusion models (DM). Unlike NCE, SSM learns a gradient of log-likelihood, or score, by learning distribution of its projections on randomly chosen directions. However, both NCE and SSM disregard the form of log-likelihood function, which is problematic given that EBMs and DMs make use of first-order optimisation during inference. This paper proposes a new criterion that learns scores more suitable for first-order schemes. Experiments contrasts these approaches for training EBMs.

[LG-64] Augmenting Online RL with Offline Data is All You Need: A Unified Hybrid RL Algorithm Design and Analysis UAI2025

链接: https://arxiv.org/abs/2505.13768
作者: Ruiquan Huang,Donghao Li,Chengshuai Shi,Cong Shen,Jing Yang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by UAI2025

点击查看摘要

Abstract:This paper investigates a hybrid learning framework for reinforcement learning (RL) in which the agent can leverage both an offline dataset and online interactions to learn the optimal policy. We present a unified algorithm and analysis and show that augmenting confidence-based online RL algorithms with the offline dataset outperforms any pure online or offline algorithm alone and achieves state-of-the-art results under two learning metrics, i.e., sub-optimality gap and online learning regret. Specifically, we show that our algorithm achieves a sub-optimality gap \tildeO(\sqrt1/(N_0/\mathttC(\pi^|\rho)+N_1) ) , where \mathttC(\pi^|\rho) is a new concentrability coefficient, N_0 and N_1 are the numbers of offline and online samples, respectively. For regret minimization, we show that it achieves a constant \tildeO( \sqrtN_1/(N_0/\mathttC(\pi^-|\rho)+N_1) ) speed-up compared to pure online learning, where \mathttC(\pi^-|\rho) is the concentrability coefficient over all sub-optimal policies. Our results also reveal an interesting separation on the desired coverage properties of the offline dataset for sub-optimality gap minimization and regret minimization. We further validate our theoretical findings in several experiments in special RL models such as linear contextual bandits and Markov decision processes (MDPs).

[LG-65] WIND: Accelerated RNN-T Decoding with Windowed Inference for Non-blank Detection

链接: https://arxiv.org/abs/2505.13765
作者: Hainan Xu,Vladimir Bataev,Lilit Grigoryan,Boris Ginsburg
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose Windowed Inference for Non-blank Detection (WIND), a novel strategy that significantly accelerates RNN-T inference without compromising model accuracy. During model inference, instead of processing frames sequentially, WIND processes multiple frames simultaneously within a window in parallel, allowing the model to quickly locate non-blank predictions during decoding, resulting in significant speed-ups. We implement WIND for greedy decoding, batched greedy decoding with label-looping techniques, and also propose a novel beam-search decoding method. Experiments on multiple datasets with different conditions show that our method, when operating in greedy modes, speeds up as much as 2.4X compared to the baseline sequential approach while maintaining identical Word Error Rate (WER) performance. Our beam-search algorithm achieves slightly better accuracy than alternative methods, with significantly improved speed. We will open-source our WIND implementation.

[LG-66] Consistency Conditions for Differentiable Surrogate Losses

链接: https://arxiv.org/abs/2505.13760
作者: Drona Khurana,Anish Thilagar,Dhamma Kimpara,Rafael Frongillo
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The statistical consistency of surrogate losses for discrete prediction tasks is often checked via the condition of calibration. However, directly verifying calibration can be arduous. Recent work shows that for polyhedral surrogates, a less arduous condition, indirect elicitation (IE), is still equivalent to calibration. We give the first results of this type for non-polyhedral surrogates, specifically the class of convex differentiable losses. We first prove that under mild conditions, IE and calibration are equivalent for one-dimensional losses in this class. We construct a counter-example that shows that this equivalence fails in higher dimensions. This motivates the introduction of strong IE, a strengthened form of IE that is equally easy to verify. We establish that strong IE implies calibration for differentiable surrogates and is both necessary and sufficient for strongly convex, differentiable surrogates. Finally, we apply these results to a range of problems to demonstrate the power of IE and strong IE for designing and analyzing consistent differentiable surrogates.

[LG-67] Panda: A pretrained forecast model for universal representation of chaotic dynamics

链接: https://arxiv.org/abs/2505.13755
作者: Jeffrey Lai,Anthony Bao,William Gilpin
类目: Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Chaotic systems are intrinsically sensitive to small errors, challenging efforts to construct predictive data-driven models of real-world dynamical systems such as fluid flows or neuronal activity. Prior efforts comprise either specialized models trained separately on individual time series, or foundation models trained on vast time series databases with little underlying dynamical structure. Motivated by dynamical systems theory, we present Panda, Patched Attention for Nonlinear DynAmics. We train Panda on a novel synthetic, extensible dataset of 2 \times 10^4 chaotic dynamical systems that we discover using an evolutionary algorithm. Trained purely on simulated data, Panda exhibits emergent properties: zero-shot forecasting of unseen real world chaotic systems, and nonlinear resonance patterns in cross-channel attention heads. Despite having been trained only on low-dimensional ordinary differential equations, Panda spontaneously develops the ability to predict partial differential equations without retraining. We demonstrate a neural scaling law for differential equations, underscoring the potential of pretrained models for probing abstract mathematical domains like nonlinear dynamics.

[LG-68] Finding Maximum Independent Sets in Dynamic Graphs using Unsupervised Learning

链接: https://arxiv.org/abs/2505.13754
作者: Devendra Parkar,Anya Chaturvedi,Andréa W. Richa,Joshua J. Daymude
类目: Machine Learning (cs.LG)
*备注: 11 pages, 3 tables

点击查看摘要

Abstract:We present the first unsupervised learning model for finding Maximum Independent Sets (MaxIS) in dynamic graphs where edges change over time. Our method combines structural learning from graph neural networks (GNNs) with a learned distributed update mechanism that, given an edge addition or deletion event, modifies nodes’ internal memories and infers their MaxIS membership in a single, parallel step. We parameterize our model by the update mechanism’s radius and investigate the resulting performance-runtime tradeoffs for various dynamic graph topologies. We evaluate our model against state-of-the-art MaxIS methods for static graphs, including a mixed integer programming solver, deterministic rule-based algorithms, and a heuristic learning framework based on dynamic programming and GNNs. Across synthetic and real-world dynamic graphs of 100-10,000 nodes, our model achieves competitive approximation ratios with excellent scalability; on large graphs, it significantly outperforms the state-of-the-art heuristic learning framework in solution quality, runtime, and memory usage. Our model generalizes well on graphs 100x larger than the ones used for training, achieving performance at par with both a greedy technique and a commercial mixed integer programming solver while running 1.5-23x faster than greedy.

[LG-69] Synthetic Non-stationary Data Streams for Recognition of the Unknown

链接: https://arxiv.org/abs/2505.13745
作者: Joanna Komorniczak
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The problem of data non-stationarity is commonly addressed in data stream processing. In a dynamic environment, methods should continuously be ready to analyze time-varying data – hence, they should enable incremental training and respond to concept drifts. An equally important variability typical for non-stationary data stream environments is the emergence of new, previously unknown classes. Often, methods focus on one of these two phenomena – detection of concept drifts or detection of novel classes – while both difficulties can be observed in data streams. Additionally, concerning previously unknown observations, the topic of open set of classes has become particularly important in recent years, where the goal of methods is to efficiently classify within known classes and recognize objects outside the model competence. This article presents a strategy for synthetic data stream generation in which both concept drifts and the emergence of new classes representing unknown objects occur. The presented research shows how unsupervised drift detectors address the task of detecting novelty and concept drifts and demonstrates how the generated data streams can be utilized in the open set recognition task.

[LG-70] urbocharging Gaussian Process Inference with Approximate Sketch-and-Project

链接: https://arxiv.org/abs/2505.13723
作者: Pratik Rathore,Zachary Frangella,Sachin Garg,Shaghayegh Fazliani,Michał Dereziński,Madeleine Udell
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 28 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Gaussian processes (GPs) play an essential role in biostatistics, scientific machine learning, and Bayesian optimization for their ability to provide probabilistic predictions and model uncertainty. However, GP inference struggles to scale to large datasets (which are common in modern applications), since it requires the solution of a linear system whose size scales quadratically with the number of samples in the dataset. We propose an approximate, distributed, accelerated sketch-and-project algorithm ( \textttADASAP ) for solving these linear systems, which improves scalability. We use the theory of determinantal point processes to show that the posterior mean induced by sketch-and-project rapidly converges to the true posterior mean. In particular, this yields the first efficient, condition number-free algorithm for estimating the posterior mean along the top spectral basis functions, showing that our approach is principled for GP inference. \textttADASAP outperforms state-of-the-art solvers based on conjugate gradient and coordinate descent across several benchmark datasets and a large-scale Bayesian optimization task. Moreover, \textttADASAP scales to a dataset with 3 \cdot 10^8 samples, a feat which has not been accomplished in the literature.

[LG-71] Robust learning of halfspaces under log-concave marginals

链接: https://arxiv.org/abs/2505.13708
作者: Jane Lange,Arsen Vasilyan
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We say that a classifier is \emphadversarially robust to perturbations of norm r if, with high probability over a point x drawn from the input distribution, there is no point within distance \le r from x that is classified differently. The \emphboundary volume is the probability that a point falls within distance r of a point with a different label. This work studies the task of computationally efficient learning of hypotheses with small boundary volume, where the input is distributed as a subgaussian isotropic log-concave distribution over \mathbbR^d . Linear threshold functions are adversarially robust; they have boundary volume proportional to r . Such concept classes are efficiently learnable by polynomial regression, which produces a polynomial threshold function (PTF), but PTFs in general may have boundary volume \Omega(1) , even for r \ll 1 . We give an algorithm that agnostically learns linear threshold functions and returns a classifier with boundary volume O(r+\varepsilon) at radius of perturbation r . The time and sample complexity of d^\tildeO(1/\varepsilon^2) matches the complexity of polynomial regression. Our algorithm augments the classic approach of polynomial regression with three additional steps: a) performing the \ell_1 -error regression under noise sensitivity constraints, b) a structured partitioning and rounding step that returns a Boolean classifier with error \textsfopt + O(\varepsilon) and noise sensitivity O(r+\varepsilon) simultaneously, and c) a local corrector that ``smooths’’ a function with low noise sensitivity into a function that is adversarially robust. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2505.13708 [cs.DS] (or arXiv:2505.13708v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2505.13708 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Arsen Vasilyan [view email] [v1] Mon, 19 May 2025 20:12:16 UTC (77 KB)

[LG-72] Unsupervised anomaly detection in MeV ultrafast electron diffraction

链接: https://arxiv.org/abs/2505.13702
作者: Mariana A. Fazio,Salvador Sosa Güitron,Marcus Babzien,Mikhail Fedurin,Junjie Li,Mark Palmer,Sandra S. Biedron,Manel Martinez-Ramon
类目: Machine Learning (cs.LG); Instrumentation and Detectors (physics.ins-det)
*备注:

点击查看摘要

Abstract:This study focus in the construction of an unsupervised anomaly detection methodology to detect faulty images in MUED. We believe that unsupervised techniques are the best choice for our purposes because the data used to train the detector does not need to be manually labeled, and instead, the machine is intended to detect by itself the anomalies in the dataset, which liberates the user of tedious, time-consuming initial image examination. The structure must, additionally, provide the user with some measure of uncertainty in the detection, so the user can take decisions based on this measure.

[LG-73] HarmonE: A Self-Adaptive Approach to Architecting Sustainable MLOps

链接: https://arxiv.org/abs/2505.13693
作者: Hiya Bhatt,Shaunak Biswas,Srinivasan Rakhunathan,Karthik Vaidhyanathan
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: This paper has been accepted to ECSA 2025

点击查看摘要

Abstract:Machine Learning Enabled Systems (MLS) are becoming integral to real-world applications, but ensuring their sustainable performance over time remains a significant challenge. These systems operate in dynamic environments and face runtime uncertainties like data drift and model degradation, which affect the sustainability of MLS across multiple dimensions: technical, economical, environmental, and social. While Machine Learning Operations (MLOps) addresses the technical dimension by streamlining the ML model lifecycle, it overlooks other dimensions. Furthermore, some traditional practices, such as frequent retraining, incur substantial energy and computational overhead, thus amplifying sustainability concerns. To address them, we introduce HarmonE, an architectural approach that enables self-adaptive capabilities in MLOps pipelines using the MAPE-K loop. HarmonE allows system architects to define explicit sustainability goals and adaptation thresholds at design time, and performs runtime monitoring of key metrics, such as prediction accuracy, energy consumption, and data distribution shifts, to trigger appropriate adaptation strategies. We validate our approach using a Digital Twin (DT) of an Intelligent Transportation System (ITS), focusing on traffic flow prediction as our primary use case. The DT employs time series ML models to simulate real-time traffic and assess various flow scenarios. Our results show that HarmonE adapts effectively to evolving conditions while maintaining accuracy and meeting sustainability goals.

[LG-74] Optimal Client Sampling in Federated Learning with Client-Level Heterogeneous Differential Privacy

链接: https://arxiv.org/abs/2505.13655
作者: Jiahao Xu,Rui Hu,Olivera Kotevska
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning with client-level differential privacy (DP) provides a promising framework for collaboratively training models while rigorously protecting clients’ privacy. However, classic approaches like DP-FedAvg struggle when clients have heterogeneous privacy requirements, as they must uniformly enforce the strictest privacy level across clients, leading to excessive DP noise and significant model utility degradation. Existing methods to improve the model utility in such heterogeneous privacy settings often assume a trusted server and are largely heuristic, resulting in suboptimal performance and lacking strong theoretical underpinnings. In this work, we address these challenges under a practical attack model where both clients and the server are honest-but-curious. We propose GDPFed, which partitions clients into groups based on their privacy budgets and achieves client-level DP within each group to reduce the privacy budget waste and hence improve the model utility. Based on the privacy and convergence analysis of GDPFed, we find that the magnitude of DP noise depends on both model dimensionality and the per-group client sampling ratios. To further improve the performance of GDPFed, we introduce GDPFed ^+ , which integrates model sparsification to eliminate unnecessary noise and optimizes per-group client sampling ratios to minimize convergence error. Extensive empirical evaluations on multiple benchmark datasets demonstrate the effectiveness of GDPFed ^+ , showing substantial performance gains compared with state-of-the-art methods.

[LG-75] raceable Black-box Watermarks for Federated Learning

链接: https://arxiv.org/abs/2505.13651
作者: Jiahao Xu,Rui Hu,Olivera Kotevska,Zikai Zhang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Due to the distributed nature of Federated Learning (FL) systems, each local client has access to the global model, posing a critical risk of model leakage. Existing works have explored injecting watermarks into local models to enable intellectual property protection. However, these methods either focus on non-traceable watermarks or traceable but white-box watermarks. We identify a gap in the literature regarding the formal definition of traceable black-box watermarking and the formulation of the problem of injecting such watermarks into FL systems. In this work, we first formalize the problem of injecting traceable black-box watermarks into FL. Based on the problem, we propose a novel server-side watermarking method, \mathbfTraMark , which creates a traceable watermarked model for each client, enabling verification of model leakage in black-box settings. To achieve this, \mathbfTraMark partitions the model parameter space into two distinct regions: the main task region and the watermarking region. Subsequently, a personalized global model is constructed for each client by aggregating only the main task region while preserving the watermarking region. Each model then learns a unique watermark exclusively within the watermarking region using a distinct watermark dataset before being sent back to the local client. Extensive results across various FL systems demonstrate that \mathbfTraMark ensures the traceability of all watermarked models while preserving their main task performance.

[LG-76] Collapsing Taylor Mode Automatic Differentiation

链接: https://arxiv.org/abs/2505.13644
作者: Felix Dangel,Tim Siebert,Marius Zeinhofer,Andrea Walther
类目: Machine Learning (cs.LG)
*备注: 10 pages + appendix

点击查看摘要

Abstract:Computing partial differential equation (PDE) operators via nested backpropagation is expensive, yet popular, and severely restricts their utility for scientific machine learning. Recent advances, like the forward Laplacian and randomizing Taylor mode automatic differentiation (AD), propose forward schemes to address this. We introduce an optimization technique for Taylor mode that ‘collapses’ derivatives by rewriting the computational graph, and demonstrate how to apply it to general linear PDE operators, and randomized Taylor mode. The modifications simply require propagating a sum up the computational graph, which could – or should – be done by a machine learning compiler, without exposing complexity to users. We implement our collapsing procedure and evaluate it on popular PDE operators, confirming it accelerates Taylor mode and outperforms nested backpropagation.

[LG-77] 4Hammer: a board-game reinforcement learning environment for the hour long time frame

链接: https://arxiv.org/abs/2505.13638
作者: Massimo Fioravanti,Giovanni Agosta
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong performance on tasks with short time frames, but struggle with tasks requiring longer durations. While datasets covering extended-duration tasks, such as software engineering tasks or video games, do exist, there are currently few implementations of complex board games specifically designed for reinforcement learning and LLM evaluation. To address this gap, we propose the 4Hammer reinforcement learning environment, a digital twin simulation of a subset of Warhammer 40,000-a complex, zero-sum board game. Warhammer 40,000 features intricate rules, requiring human players to thoroughly read and understand over 50 pages of detailed natural language rules, grasp the interactions between their game pieces and those of their opponents, and independently track and communicate the evolving game state.

[LG-78] Incentivizing Truthful Language Models via Peer Elicitation Games

链接: https://arxiv.org/abs/2505.13636
作者: Baiting Chen,Tong Zhu,Jiale Han,Lexin Li,Gang Li,Xiaowu Dai
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong generative capabilities but remain prone to inconsistencies and hallucinations. We introduce Peer Elicitation Games (PEG), a training-free, game-theoretic framework for aligning LLMs through a peer elicitation mechanism involving a generator and multiple discriminators instantiated from distinct base models. Discriminators interact in a peer evaluation setting, where rewards are computed using a determinant-based mutual information score that provably incentivizes truthful reporting without requiring ground-truth labels. We establish theoretical guarantees showing that each agent, via online learning, achieves sublinear regret in the sense their cumulative performance approaches that of the best fixed truthful strategy in hindsight. Moreover, we prove last-iterate convergence to a truthful Nash equilibrium, ensuring that the actual policies used by agents converge to stable and truthful behavior over time. Empirical evaluations across multiple benchmarks demonstrate significant improvements in factual accuracy. These results position PEG as a practical approach for eliciting truthful behavior from LLMs without supervision or fine-tuning.

[LG-79] Deterministic Bounds and Random Estimates of Metric Tensors on Neuromanifolds

链接: https://arxiv.org/abs/2505.13614
作者: Ke Sun
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The high dimensional parameter space of modern deep neural networks – the neuromanifold – is endowed with a unique metric tensor defined by the Fisher information, estimating which is crucial for both theory and practical methods in deep learning. To analyze this tensor for classification networks, we return to a low dimensional space of probability distributions – the core space – and carefully analyze the spectrum of its Riemannian metric. We extend our discoveries there into deterministic bounds of the metric tensor on the neuromanifold. We introduce an unbiased random estimate of the metric tensor and its bounds based on Hutchinson’s trace estimator. It can be evaluated efficiently through a single backward pass and can be used to estimate the diagonal, or block diagonal, or the full tensor. Its quality is guaranteed with a standard deviation bounded by the true value up to scaling.

[LG-80] Half Search Space is All You Need

链接: https://arxiv.org/abs/2505.13586
作者: Pavel Rumiantsev,Mark Coates
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Neural Architecture Search (NAS) is a powerful tool for automating architecture design. One-Shot NAS techniques, such as DARTS, have gained substantial popularity due to their combination of search efficiency with simplicity of implementation. By design, One-Shot methods have high GPU memory requirements during the search. To mitigate this issue, we propose to prune the search space in an efficient automatic manner to reduce memory consumption and search time while preserving the search accuracy. Specifically, we utilise Zero-Shot NAS to efficiently remove low-performing architectures from the search space before applying One-Shot NAS to the pruned search space. Experimental results on the DARTS search space show that our approach reduces memory consumption by 81% compared to the baseline One-Shot setup while achieving the same level of accuracy.

[LG-81] Uncovering Critical Sets of Deep Neural Networks via Sample-Independent Critical Lifting

链接: https://arxiv.org/abs/2505.13582
作者: Leyang Zhang,Yaoyu Zhang,Tao Luo
类目: Machine Learning (cs.LG)
*备注: 20 pages

点击查看摘要

Abstract:This paper investigates the sample dependence of critical points for neural networks. We introduce a sample-independent critical lifting operator that associates a parameter of one network with a set of parameters of another, thus defining sample-dependent and sample-independent lifted critical points. We then show by example that previously studied critical embeddings do not capture all sample-independent lifted critical points. Finally, we demonstrate the existence of sample-dependent lifted critical points for sufficiently large sample sizes and prove that saddles appear among them.

[LG-82] Symmetry-Breaking Descent for Invariant Cost Functionals

链接: https://arxiv.org/abs/2505.13578
作者: Mikhail Osipov
类目: Machine Learning (cs.LG)
*备注: 19 pages, 7 appendices

点击查看摘要

Abstract:We study the problem of reducing a task cost functional W(S) , defined over Sobolev-class signals S , when the cost is invariant under a global symmetry group G \subset \mathrmDiff(M) and accessible only as a black-box. Such scenarios arise in machine learning, imaging, and inverse problems, where cost metrics reflect model outputs or performance scores but are non-differentiable and model-internal. We propose a variational method that exploits the symmetry structure to construct explicit, symmetry-breaking deformations of the input signal. A gauge field \phi , obtained by minimizing an auxiliary energy functional, induces a deformation h = A_\phi[S] that generically lies transverse to the G -orbit of S . We prove that, under mild regularity, the cost W strictly decreases along this direction – either via Clarke subdifferential descent or by escaping locally flat plateaus. The exceptional set of degeneracies has zero Gaussian measure. Our approach requires no access to model gradients or labels and operates entirely at test time. It provides a principled tool for optimizing invariant cost functionals via Lie-algebraic variational flows, with applications to black-box models and symmetry-constrained tasks.

[LG-83] FlexFed: Mitigating Catastrophic Forgetting in Heterogeneous Federated Learning in Pervasive Computing Environments

链接: https://arxiv.org/abs/2505.13576
作者: Sara Alosaime(University of Warwick),Arshad Jhumka(University of Leeds)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training while preserving privacy by allowing clients to share model updates instead of raw data. Pervasive computing environments (e.g., for Human Activity Recognition, HAR), which we focus on in this paper, are characterized by resource-constrained end devices, streaming sensor data and intermittent client participation. Variations in user behavior, common in HAR environments, often result in non-stationary data distributions. As such, existing FL approaches face challenges in HAR settings due to differing assumptions. The combined effects of HAR characteristics, namely heterogeneous data and intermittent participation, can lead to a severe issue called catastrophic forgetting (CF). Unlike Continuous Learning (CL), which addresses CF using memory and replay mechanisms, FL’s privacy constraints prohibit such strategies. To tackle CF in HAR environments, we propose FlexFed, a novel FL approach that prioritizes data retention for efficient memory use and dynamically adjusts offline training frequency based on distribution shifts, client capability and offline duration. To better quantify CF in FL, we introduce a new metric that accounts for under-represented data, enabling more accurate evaluations. We also develop a realistic HAR-based evaluation framework that simulates streaming data, dynamic distributions, imbalances and varying availability. Experiments show that FlexFed mitigates CF more effectively, improves FL efficiency by 10 to 15 % and achieves faster, more stable convergence, especially for infrequent or under-represented data. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.13576 [cs.LG] (or arXiv:2505.13576v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.13576 Focus to learn more arXiv-issued DOI via DataCite

[LG-84] An Overview of Arithmetic Adaptations for Inference of Convolutional Neural Networks on Re-configurable Hardware

链接: https://arxiv.org/abs/2505.13575
作者: Ilkay Wunderlich,Benjamin Koch,Sven Schönfeld
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) have gained high popularity as a tool for computer vision tasks and for that reason are used in various applications. There are many different concepts, like single shot detectors, that have been published for detecting objects in images or video streams. However, CNNs suffer from disadvantages regarding the deployment on embedded platforms such as re-configurable hardware like Field Programmable Gate Arrays (FPGAs). Due to the high computational intensity, memory requirements and arithmetic conditions, a variety of strategies for running CNNs on FPGAs have been developed. The following methods showcase our best practice approaches for a TinyYOLOv3 detector network on a XILINX Artix-7 FPGA using techniques like fusion of batch normalization, filter pruning and post training network quantization.

[LG-85] Surrogate Modeling of 3D Rayleigh-Benard Convection with Equivariant Autoencoders

链接: https://arxiv.org/abs/2505.13569
作者: Fynn Fromme,Christine Allen-Blanchette,Hans Harder,Sebastian Peitz
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:The use of machine learning for modeling, understanding, and controlling large-scale physics systems is quickly gaining in popularity, with examples ranging from electromagnetism over nuclear fusion reactors and magneto-hydrodynamics to fluid mechanics and climate modeling. These systems – governed by partial differential equations – present unique challenges regarding the large number of degrees of freedom and the complex dynamics over many scales both in space and time, and additional measures to improve accuracy and sample efficiency are highly desirable. We present an end-to-end equivariant surrogate model consisting of an equivariant convolutional autoencoder and an equivariant convolutional LSTM using G -steerable kernels. As a case study, we consider the three-dimensional Rayleigh-Bénard convection, which describes the buoyancy-driven fluid flow between a heated bottom and a cooled top plate. While the system is E(2)-equivariant in the horizontal plane, the boundary conditions break the translational equivariance in the vertical direction. Our architecture leverages vertically stacked layers of D_4 -steerable kernels, with additional partial kernel sharing in the vertical direction for further efficiency improvement. Our results demonstrate significant gains both in sample and parameter efficiency, as well as a better scaling to more complex dynamics, that is, larger Rayleigh numbers. The accompanying code is available under this https URL.

[LG-86] Online Decision-Focused Learning

链接: https://arxiv.org/abs/2505.13564
作者: Aymeric Capitaine,Maxime Haddouche,Eric Moulines,Michael I. Jordan,Etienne Boursier,Alain Durmus
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Decision-focused learning (DFL) is an increasingly popular paradigm for training predictive models whose outputs are used in decision-making tasks. Instead of merely optimizing for predictive accuracy, DFL trains models to directly minimize the loss associated with downstream decisions. This end-to-end strategy holds promise for tackling complex combinatorial problems; however, existing studies focus solely on scenarios where a fixed batch of data is available and the objective function does not change over time. We instead investigate DFL in dynamic environments where the objective function and data distribution evolve over time. This setting is challenging because the objective function has zero or undefined gradients – which prevents the use of standard first-order optimization methods – and is generally non-convex. To address these difficulties, we (i) regularize the objective to make it differentiable and (ii) make use of the optimism principle, based on a near-optimal oracle along with an appropriate perturbation. This leads to a practical online algorithm for which we establish bounds on the expected dynamic regret, both when the decision space is a simplex and when it is a general bounded convex polytope. Finally, we demonstrate the effectiveness of our algorithm by comparing its performance with a classic prediction-focused approach on a simple knapsack experiment.

[LG-87] Learning Collision Risk from Naturalistic Driving with Generalised Surrogate Safety Measures

链接: https://arxiv.org/abs/2505.13556
作者: Yiru Jiao,Simeon C. Calvert,Sander van Cranenburgh,Hans van Lint
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 18 pages, 8 figures

点击查看摘要

Abstract:Accurate and timely alerts for drivers or automated systems to unfolding collisions remains a challenge in road safety, particularly in highly interactive urban traffic. Existing approaches require labour-intensive annotation of sparse risk, struggle to consider varying interaction context, or are useful only in the scenarios they are designed for. To address these limits, this study introduces the generalised surrogate safety measure (GSSM), a new approach that learns exclusively from naturalistic driving without crash or risk labels. GSSM captures the patterns of normal driving and estimates the extent to which a traffic interaction deviates from the norm towards unsafe extreme. Utilising neural networks, normal interactions are characterised by context-conditioned distributions of multi-directional spacing between road users. In the same interaction context, a spacing closer than normal entails higher risk of potential collision. Then a context-adaptive risk score and its associated probability can be calculated based on the theory of extreme values. Any measurable factors, such as motion kinematics, weather, lighting, can serve as part of the context, allowing for diverse coverage of safety-critical interactions. Multiple public driving datasets are used to train GSSMs, which are tested with 4,875 real-world crashes and near-crashes reconstructed from the SHRP2 NDS. A vanilla GSSM using only instantaneous states achieves AUPRC of 0.9 and secures a median time advance of 2.6 seconds to prevent potential collisions. Additional data and contextual factors provide further performance gains. Across various interaction types such as rear-end, merging, and crossing, the accuracy and timeliness of GSSM consistently outperforms existing baselines. GSSM therefore establishes a scalable, context-aware, and generalisable foundation to proactively quantify collision risk in traffic interactions.

[LG-88] Selective Code Generation for Functional Guarantees

链接: https://arxiv.org/abs/2505.13553
作者: Jaewoo Jeong,Taesoo Kim,Sangdon Park
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) show human-level performance and their specialized descendants, code generation models, play core roles in solving complex tasks, including mathematical reasoning and software development. On the downside, the hallucination of LLMs mainly hinders their applicability to systems requiring higher safety standards, thus drawing the attention of the AI community. However, the hallucination of code generation models is rarely considered. One critical bottleneck in considering code hallucination is the intricate property of code to identify whether generated code has the intended functionality due to its un-natural form, different to natural languages. Handful of unit tests have been considered to address this issue, but scaling-up its size is extremely expensive. We address this core bottleneck by automatically generating unit tests using dynamic code analysis tools, which leverages the \emphexecutable nature of code. Given generated unit tests from true code for measuring functional correctness of generated code, we propose to learn a \emphselective code generator, which abstains from answering for unsure generation, to control the rate of code hallucination among non-abstaining answers in terms of a false discovery rate. This learning algorithm provides a controllability guarantee, providing trustworthiness of code generation. Finally, we propose to use generated unit tests in evaluation as well as in learning for precise code evaluation, calling this evaluation paradigm \emphFuzzEval. We demonstrate the efficacy of our selective code generator over open and closed code generators, showing clear benefit of leveraging generated unit tests along with the controllability of code hallucination and reasonable selection efficiency via our selective code generator.

[LG-89] Zero-Shot Forecasting Mortality Rates: A Global Study

链接: https://arxiv.org/abs/2505.13521
作者: Gabor Petnehazi,Laith Al Shaggah,Jozsef Gall,Bernadett Aradi
类目: Machine Learning (cs.LG); Risk Management (q-fin.RM); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:This study explores the potential of zero-shot time series forecasting, an innovative approach leveraging pre-trained foundation models, to forecast mortality rates without task-specific fine-tuning. We evaluate two state-of-the-art foundation models, TimesFM and CHRONOS, alongside traditional and machine learning-based methods across three forecasting horizons (5, 10, and 20 years) using data from 50 countries and 111 age groups. In our investigations, zero-shot models showed varying results: while CHRONOS delivered competitive shorter-term forecasts, outperforming traditional methods like ARIMA and the Lee-Carter model, TimesFM consistently underperformed. Fine-tuning CHRONOS on mortality data significantly improved long-term accuracy. A Random Forest model, trained on mortality data, achieved the best overall performance. These findings underscore the potential of zero-shot forecasting while highlighting the need for careful model selection and domain-specific adaptation.

[LG-90] On the definition and importance of interpretability in scientific machine learning

链接: https://arxiv.org/abs/2505.13510
作者: Conor Rowan,Alireza Doostan
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); History and Philosophy of Physics (physics.hist-ph); Physics and Society (physics.soc-ph)
*备注:

点击查看摘要

Abstract:Though neural networks trained on large data sets have been successfully used to describe and predict many physical phenomena, there is a sense among scientists that, unlike traditional scientific models, where relationships come packaged in the form of simple mathematical expressions, the findings of the neural network cannot be integrated into the body of scientific knowledge. Critics of ML’s inability to produce human-understandable relationships have converged on the concept of “interpretability” as its point of departure from more traditional forms of science. As the growing interest in interpretability has shown, researchers in the physical sciences seek not just predictive models, but also to uncover the fundamental principles that govern a system of interest. However, clarity around a definition of interpretability and the precise role that it plays in science is lacking in the literature. In this work, we argue that researchers in equation discovery and symbolic regression tend to conflate the concept of sparsity with interpretability. We review key papers on interpretable ML from outside the scientific community and argue that, though the definitions and methods they propose can inform questions of interpretability for SciML, they are inadequate for this new purpose. Noting these deficiencies, we propose an operational definition of interpretability for the physical sciences. Our notion of interpretability emphasizes understanding of the mechanism over mathematical sparsity. Innocuous though it may seem, this emphasis on mechanism shows that sparsity is often unnecessary. It also questions the possibility of interpretable scientific discovery when prior knowledge is lacking. We believe a precise and philosophically informed definition of interpretability in SciML will help focus research efforts toward the most significant obstacles to realizing a data-driven scientific future.

[LG-91] Fuck the Algorithm: Conceptual Issues in Algorithmic Bias

链接: https://arxiv.org/abs/2505.13509
作者: Catherine Stinson
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Algorithmic bias has been the subject of much recent controversy. To clarify what is at stake and to make progress resolving the controversy, a better understanding of the concepts involved would be helpful. The discussion here focuses on the disputed claim that algorithms themselves cannot be biased. To clarify this claim we need to know what kind of thing ‘algorithms themselves’ are, and to disambiguate the several meanings of ‘bias’ at play. This further involves showing how bias of moral import can result from statistical biases, and drawing connections to previous conceptual work about political artifacts and oppressive things. Data bias has been identified in domains like hiring, policing and medicine. Examples where algorithms themselves have been pinpointed as the locus of bias include recommender systems that influence media consumption, academic search engines that influence citation patterns, and the 2020 UK algorithmically-moderated A-level grades. Recognition that algorithms are a kind of thing that can be biased is key to making decisions about responsibility for harm, and preventing algorithmically mediated discrimination.

[LG-92] Federated Low-Rank Adaptation for Foundation Models: A Survey

链接: https://arxiv.org/abs/2505.13502
作者: Yiyuan Yang,Guodong Long,Qinghua Lu,Liming Zhu,Jing Jiang,Chengqi Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effectively leveraging private datasets remains a significant challenge in developing foundation models. Federated Learning (FL) has recently emerged as a collaborative framework that enables multiple users to fine-tune these models while mitigating data privacy risks. Meanwhile, Low-Rank Adaptation (LoRA) offers a resource-efficient alternative for fine-tuning foundation models by dramatically reducing the number of trainable parameters. This survey examines how LoRA has been integrated into federated fine-tuning for foundation models, an area we term FedLoRA, by focusing on three key challenges: distributed learning, heterogeneity, and efficiency. We further categorize existing work based on the specific methods used to address each challenge. Finally, we discuss open research questions and highlight promising directions for future investigation, outlining the next steps for advancing FedLoRA.

[LG-93] SPIEDiff: robust learning of long-time macroscopic dynamics from short-time particle simulations with quantified epistemic uncertainty

链接: https://arxiv.org/abs/2505.13501
作者: Zequn He,Celia Reina
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:The data-driven discovery of long-time macroscopic dynamics and thermodynamics of dissipative systems with particle fidelity is hampered by significant obstacles. These include the strong time-scale limitations inherent to particle simulations, the non-uniqueness of the thermodynamic potentials and operators from given macroscopic dynamics, and the need for efficient uncertainty quantification. This paper introduces Statistical-Physics Informed Epistemic Diffusion Models (SPIEDiff), a machine learning framework designed to overcome these limitations in the context of purely dissipative systems by leveraging statistical physics, conditional diffusion models, and epinets. We evaluate the proposed framework on stochastic Arrhenius particle processes and demonstrate that SPIEDiff can accurately uncover both thermodynamics and kinetics, while enabling reliable long-time macroscopic predictions using only short-time particle simulation data. SPIEDiff can deliver accurate predictions with quantified uncertainty in minutes, drastically reducing the computational demand compared to direct particle simulations, which would take days or years in the examples considered. Overall, SPIEDiff offers a robust and trustworthy pathway for the data-driven discovery of thermodynamic models.

[LG-94] RTL: Graph-enhanced LLM for RTL Code Generation

链接: https://arxiv.org/abs/2505.13479
作者: Mohammad Akyash,Kimia Azar,Hadi Kamali
类目: Programming Languages (cs.PL); Machine Learning (cs.LG)
*备注: Accepted to the IEEE International Conference on LLM-Aided Design (LAD '25)

点击查看摘要

Abstract:As hardware design complexity escalates, there is an urgent need for advanced automation in electronic design automation (EDA). Traditional register transfer level (RTL) design methods are manual, time-consuming, and prone to errors. While commercial (instruction-tuned) large language models (LLMs) shows promising performance for automation, they pose security and privacy concerns. Open-source models offer alternatives; however, they frequently fall short in quality/correctness, largely due to limited, high-quality RTL code data essential for effective training and generalization. This paper proposes RTL++, a first-of-its-kind LLM-assisted method for RTL code generation that utilizes graph representations of code structures to enhance the quality of generated code. By encoding RTL code into a textualized control flowgraphs (CFG) and data flow graphs (DFG), RTL++ captures the inherent hierarchy, dependencies, and relationships within the code. This structured graph-based approach enhances the context available to LLMs, enabling them to better understand and generate instructions. By focusing on data generation through graph representations, RTL++ addresses the limitations of previous approaches that rely solely on code and suffer from lack of diversity. Experimental results demonstrate that RTL++ outperforms state-of-the-art models fine-tuned for RTL generation, as evaluated using the VerilogEval benchmark’s Pass@1/5/10 metric, as well as the RTLLM1.1 model, which highlight the effectiveness of graph-enhanced context in advancing the capabilities of LLM-assisted RTL code generation.

[LG-95] he Spotlight Resonance Method: Resolving the Alignment of Embedded Activations ICLR

链接: https://arxiv.org/abs/2505.13471
作者: George Bird
类目: Machine Learning (cs.LG)
*备注: 25 pages, 13 figures, 2nd Workshop on Representational Alignment, International Conference on Learning Representations (ICLR)

点击查看摘要

Abstract:Understanding how deep learning models represent data is currently difficult due to the limited number of methodologies available. This paper demonstrates a versatile and novel visualisation tool for determining the axis alignment of embedded data at any layer in any deep learning model. In particular, it evaluates the distribution around planes defined by the network’s privileged basis vectors. This method provides both an atomistic and a holistic, intuitive metric for interpreting the distribution of activations across all planes. It ensures that both positive and negative signals contribute, treating the activation vector as a whole. Depending on the application, several variations of this technique are presented, with a resolution scale hyperparameter to probe different angular scales. Using this method, multiple examples are provided that demonstrate embedded representations tend to be axis-aligned with the privileged basis. This is not necessarily the standard basis, and it is found that activation functions directly result in privileged bases. Hence, it provides a direct causal link between functional form symmetry breaking and representational alignment, explaining why representations have a tendency to align with the neuron basis. Therefore, using this method, we begin to answer the fundamental question of what causes the observed tendency of representations to align with neurons. Finally, examples of so-called grandmother neurons are found in a variety of networks.

[LG-96] Predicting The Evolution of Interfaces with Fourier Neural Operators

链接: https://arxiv.org/abs/2505.13463
作者: Paolo Guida,William L. Roberts
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Recent progress in AI has established neural operators as powerful tools that can predict the evolution of partial differential equations, such as the Navier-Stokes equations. Some complex problems rely on sophisticated algorithms to deal with strong discontinuities in the computational domain. For example, liquid-vapour multiphase flows are a challenging problem in many configurations, particularly those involving large density gradients or phase change. The complexity mentioned above has not allowed for fine control of fast industrial processes or applications because computational fluid dynamics (CFD) models do not have a quick enough forecasting ability. This work demonstrates that the time scale of neural operators-based predictions is comparable to the time scale of multi-phase applications, thus proving they can be used to control processes that require fast response. Neural Operators can be trained using experimental data, simulations or a combination. In the following, neural operators were trained in volume of fluid simulations, and the resulting predictions showed very high accuracy, particularly in predicting the evolution of the liquid-vapour interface, one of the most critical tasks in a multi-phase process controller.

[LG-97] FPGA-based Acceleration for Convolutional Neural Networks: A Comprehensive Review

链接: https://arxiv.org/abs/2505.13461
作者: Junye Jiang,Yaan Zhou,Yuanhao Gong,Haoxuan Yuan,Shuanglong Liu
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: 19 pages, 3 figures

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) are fundamental to deep learning, driving applications across various domains. However, their growing complexity has significantly increased computational demands, necessitating efficient hardware accelerators. Field-Programmable Gate Arrays (FPGAs) have emerged as a leading solution, offering reconfigurability, parallelism, and energy efficiency. This paper provides a comprehensive review of FPGA-based hardware accelerators specifically designed for CNNs. It presents and summarizes the performance evaluation framework grounded in existing studies and explores key optimization strategies, such as parallel computing, dataflow optimization, and hardware-software co-design. It also compares various FPGA architectures in terms of latency, throughput, compute efficiency, power consumption, and resource utilization. Finally, the paper highlights future challenges and opportunities, emphasizing the potential for continued innovation in this field.

[LG-98] uning Learning Rates with the Cumulative-Learning Constant

链接: https://arxiv.org/abs/2505.13457
作者: Nathan Faraj
类目: Machine Learning (cs.LG)
*备注: 9 pages, 13 figures, 2 tables

点击查看摘要

Abstract:This paper introduces a novel method for optimizing learning rates in machine learning. A previously unrecognized proportionality between learning rates and dataset sizes is discovered, providing valuable insights into how dataset scale influences training dynamics. Additionally, a cumulative learning constant is identified, offering a framework for designing and optimizing advanced learning rate schedules. These findings have the potential to enhance training efficiency and performance across a wide range of machine learning applications.

[LG-99] Boosting LLM -based Relevance Modeling with Distribution-Aware Robust Learning

链接: https://arxiv.org/abs/2412.12504
作者: Hong Liu,Saisai Gong,Yixin Ji,Kaixin Wu,Jia Xu,Jinjie Gu
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:With the rapid advancement of pre-trained large language models (LLMs), recent endeavors have leveraged the capabilities of LLMs in relevance modeling, resulting in enhanced performance. This is usually done through the process of fine-tuning LLMs on specifically annotated datasets to determine the relevance between queries and items. However, there are two limitations when LLMs are naively employed for relevance modeling through fine-tuning and inference. First, it is not inherently efficient for performing nuanced tasks beyond simple yes or no answers, such as assessing search relevance. It may therefore tend to be overconfident and struggle to distinguish fine-grained degrees of relevance (e.g., strong relevance, weak relevance, irrelevance) used in search engines. Second, it exhibits significant performance degradation when confronted with data distribution shift in real-world scenarios. In this paper, we propose a novel Distribution-Aware Robust Learning framework (DaRL) for relevance modeling in Alipay Search. Specifically, we design an effective loss function to enhance the discriminability of LLM-based relevance modeling across various fine-grained degrees of query-item relevance. To improve the generalizability of LLM-based relevance modeling, we first propose the Distribution-Aware Sample Augmentation (DASA) module. This module utilizes out-of-distribution (OOD) detection techniques to actively select appropriate samples that are not well covered by the original training set for model fine-tuning. Furthermore, we adopt a multi-stage fine-tuning strategy to simultaneously improve in-distribution (ID) and OOD performance, bridging the performance gap between them. DaRL has been deployed online to serve the Alipay’s insurance product search…

[LG-100] Quantum Optimization via Gradient-Based Hamiltonian Descent ICML2025

链接: https://arxiv.org/abs/2505.14670
作者: Jiaqi Leng,Bin Shi
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 19 pages, 6 figures. To appear in the proceedings of ICML 2025

点击查看摘要

Abstract:With rapid advancements in machine learning, first-order algorithms have emerged as the backbone of modern optimization techniques, owing to their computational efficiency and low memory requirements. Recently, the connection between accelerated gradient methods and damped heavy-ball motion, particularly within the framework of Hamiltonian dynamics, has inspired the development of innovative quantum algorithms for continuous optimization. One such algorithm, Quantum Hamiltonian Descent (QHD), leverages quantum tunneling to escape saddle points and local minima, facilitating the discovery of global solutions in complex optimization landscapes. However, QHD faces several challenges, including slower convergence rates compared to classical gradient methods and limited robustness in highly non-convex problems due to the non-local nature of quantum states. Furthermore, the original QHD formulation primarily relies on function value information, which limits its effectiveness. Inspired by insights from high-resolution differential equations that have elucidated the acceleration mechanisms in classical methods, we propose an enhancement to QHD by incorporating gradient information, leading to what we call gradient-based QHD. Gradient-based QHD achieves faster convergence and significantly increases the likelihood of identifying global solutions. Numerical simulations on challenging problem instances demonstrate that gradient-based QHD outperforms existing quantum and classical methods by at least an order of magnitude.

[LG-101] Sequential QCQP for Bilevel Optimization with Line Search

链接: https://arxiv.org/abs/2505.14647
作者: Sina Sharifi,Erfan Yazdandoost Hamedani,Mahyar Fazlyab
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Under Review

点击查看摘要

Abstract:Bilevel optimization involves a hierarchical structure where one problem is nested within another, leading to complex interdependencies between levels. We propose a single-loop, tuning-free algorithm that guarantees anytime feasibility, i.e., approximate satisfaction of the lower-level optimality condition, while ensuring descent of the upper-level objective. At each iteration, a convex quadratically-constrained quadratic program (QCQP) with a closed-form solution yields the search direction, followed by a backtracking line search inspired by control barrier functions to ensure safe, uniformly positive step sizes. The resulting method is scalable, requires no hyperparameter tuning, and converges under mild local regularity assumptions. We establish an O(1/k) ergodic convergence rate and demonstrate the algorithm’s effectiveness on representative bilevel tasks.

[LG-102] High-Dimensional Analysis of Bootstrap Ensemble Classifiers

链接: https://arxiv.org/abs/2505.14587
作者: Hamza Cherkaoui,Malik Tiomoko,Mohamed El Amine Seddik,Cosme Louart,Ekkehard Schnoor,Balazs Kegl
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bootstrap methods have long been a cornerstone of ensemble learning in machine learning. This paper presents a theoretical analysis of bootstrap techniques applied to the Least Square Support Vector Machine (LSSVM) ensemble in the context of large and growing sample sizes and feature dimensionalities. Leveraging tools from Random Matrix Theory, we investigate the performance of this classifier that aggregates decision functions from multiple weak classifiers, each trained on different subsets of the data. We provide insights into the use of bootstrap methods in high-dimensional settings, enhancing our understanding of their impact. Based on these findings, we propose strategies to select the number of subsets and the regularization parameter that maximize the performance of the LSSVM. Empirical experiments on synthetic and real-world datasets validate our theoretical results.

[LG-103] Performance Optimization of Energy-Harvesting Underlay Cognitive Radio Networks Using Reinforcement Learning

链接: https://arxiv.org/abs/2505.14581
作者: Deemah H. Tashman,Soumaya Cherkaoui,Walaa Hamouda
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, a reinforcement learning technique is employed to maximize the performance of a cognitive radio network (CRN). In the presence of primary users (PUs), it is presumed that two secondary users (SUs) access the licensed band within underlay mode. In addition, the SU transmitter is assumed to be an energy-constrained device that requires harvesting energy in order to transmit signals to their intended destination. Therefore, we propose that there are two main sources of energy; the interference of PUs’ transmissions and ambient radio frequency (RF) sources. The SU will select whether to gather energy from PUs or only from ambient sources based on a predetermined threshold. The process of energy harvesting from the PUs’ messages is accomplished via the time switching approach. In addition, based on a deep Q-network (DQN) approach, the SU transmitter determines whether to collect energy or transmit messages during each time slot as well as selects the suitable transmission power in order to maximize its average data rate. Our approach outperforms a baseline strategy and converges, as shown by our findings.

[LG-104] A simple estimator of the correlation kernel matrix of a determinantal point process

链接: https://arxiv.org/abs/2505.14529
作者: Christian Gouriéroux,Yang Lu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Determinantal Point Process (DPP) is a parameterized model for multivariate binary variables, characterized by a correlation kernel matrix. This paper proposes a closed form estimator of this kernel, which is particularly easy to implement and can also be used as a starting value of learning algorithms for maximum likelihood estimation. We prove the consistency and asymptotic normality of our estimator, as well as its large deviation properties.

[LG-105] Steering Deep Non-Linear Spatially Selective Filters for Weakly Guided Extraction of Moving Speakers in Dynamic Scenarios INTERSPEECH2025

链接: https://arxiv.org/abs/2505.14517
作者: Jakob Kienegger,Timo Gerkmann
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted at Interspeech 2025

点击查看摘要

Abstract:Recent speaker extraction methods using deep non-linear spatial filtering perform exceptionally well when the target direction is known and stationary. However, spatially dynamic scenarios are considerably more challenging due to time-varying spatial features and arising ambiguities, e.g. when moving speakers cross. While in a static scenario it may be easy for a user to point to the target’s direction, manually tracking a moving speaker is impractical. Instead of relying on accurate time-dependent directional cues, which we refer to as strong guidance, in this paper we propose a weakly guided extraction method solely depending on the target’s initial position to cope with spatial dynamic scenarios. By incorporating our own deep tracking algorithm and developing a joint training strategy on a synthetic dataset, we demonstrate the proficiency of our approach in resolving spatial ambiguities and even outperform a mismatched, but strongly guided extraction method.

[LG-106] FlowTSE: Target Speaker Extraction with Flow Matching INTERSPEECH2025

链接: https://arxiv.org/abs/2505.14465
作者: Aviv Navon,Aviv Shamsian,Yael Segal-Feldman,Neta Glazer,Gil Hetz,Joseph Keshet
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: InterSpeech 2025

点击查看摘要

Abstract:Target speaker extraction (TSE) aims to isolate a specific speaker’s speech from a mixture using speaker enrollment as a reference. While most existing approaches are discriminative, recent generative methods for TSE achieve strong results. However, generative methods for TSE remain underexplored, with most existing approaches relying on complex pipelines and pretrained components, leading to computational overhead. In this work, we present FlowTSE, a simple yet effective TSE approach based on conditional flow matching. Our model receives an enrollment audio sample and a mixed speech signal, both represented as mel-spectrograms, with the objective of extracting the target speaker’s clean speech. Furthermore, for tasks where phase reconstruction is crucial, we propose a novel vocoder conditioned on the complex STFT of the mixed signal, enabling improved phase estimation. Experimental results on standard TSE benchmarks show that FlowTSE matches or outperforms strong baselines.

[LG-107] A system identification approach to clustering vector autoregressive time series

链接: https://arxiv.org/abs/2505.14421
作者: Zuogong Yue,Xinyi Wang,Victor Solo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Clustering of time series based on their underlying dynamics is keeping attracting researchers due to its impacts on assisting complex system modelling. Most current time series clustering methods handle only scalar time series, treat them as white noise, or rely on domain knowledge for high-quality feature construction, where the autocorrelation pattern/feature is mostly ignored. Instead of relying on heuristic feature/metric construction, the system identification approach allows treating vector time series clustering by explicitly considering their underlying autoregressive dynamics. We first derive a clustering algorithm based on a mixture autoregressive model. Unfortunately it turns out to have significant computational problems. We then derive a `small-noise’ limiting version of the algorithm, which we call k-LMVAR (Limiting Mixture Vector AutoRegression), that is computationally manageable. We develop an associated BIC criterion for choosing the number of clusters and model order. The algorithm performs very well in comparative simulations and also scales well computationally.

[LG-108] Path-integral molecular dynamics with actively-trained and universal machine learning force fields

链接: https://arxiv.org/abs/2505.14245
作者: A. A. Solovykh(1, 2 and 3),N. E. Rybin(3 and 4),I. S. Novikov(3, 5, 6 and 7),A. V. Shapeev(3 and 4) ((1) Lomonosov Moscow State University, Faculty of Physics, Moscow, Russian Federation, (2) Skobeltsyn Institute of Nuclear Physics, Lomonosov Moscow State University, Moscow, Russian Federation, (3) Skolkovo Institute of Science and Technology, Moscow, Russian Federation, (4) Digital Materials LLC, Odintsovo, Russian Federation, (5) HSE University, Faculty of Computer Science, Moscow, Russian Federation, (6) Moscow Institute of Physics and Technology, Moscow, Russian Federation, (7) Emanuel Institute of Biochemical Physics of the Russian Academy of Sciences, Moscow, Russian Federation)
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 9 pages, 6 eps figures

点击查看摘要

Abstract:Accounting for nuclear quantum effects (NQEs) can significantly alter material properties at finite temperatures. Atomic modeling using the path-integral molecular dynamics (PIMD) method can fully account for such effects, but requires computationally efficient and accurate models of interatomic interactions. Empirical potentials are fast but may lack sufficient accuracy, whereas quantum-mechanical calculations are highly accurate but computationally expensive. Machine-learned interatomic potentials offer a solution to this challenge, providing near-quantum-mechanical accuracy while maintaining high computational efficiency compared to density functional theory (DFT) calculations. In this context, an interface was developed to integrate moment tensor potentials (MTPs) from the MLIP-2 software package into PIMD calculations using the i-PI software package. This interface was then applied to active learning of potentials and to investigate the influence of NQEs on material properties, namely the temperature dependence of lattice parameters and thermal expansion coefficients, as well as radial distribution functions, for lithium hydride (LiH) and silicon (Si) systems. The results were compared with experimental data, quasi-harmonic approximation calculations, and predictions from the universal machine learning force field MatterSim. These comparisons demonstrated the high accuracy and effectiveness of the MTP-PIMD approach.

[LG-109] QSVM-QNN: Quantum Support Vector Machine Based Quantum Neural Network Learning Algorithm for Brain-Computer Interfacing Systems

链接: https://arxiv.org/abs/2505.14192
作者: Bikash K. Behera,Saif Al-Kuwari,Ahmed Farouk
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 12 pages, 7 Figures, 7 Tables

点击查看摘要

Abstract:A brain-computer interface (BCI) system enables direct communication between the brain and external devices, offering significant potential for assistive technologies and advanced human-computer interaction. Despite progress, BCI systems face persistent challenges, including signal variability, classification inefficiency, and difficulty adapting to individual users in real time. In this study, we propose a novel hybrid quantum learning model, termed QSVM-QNN, which integrates a Quantum Support Vector Machine (QSVM) with a Quantum Neural Network (QNN), to improve classification accuracy and robustness in EEG-based BCI tasks. Unlike existing models, QSVM-QNN combines the decision boundary capabilities of QSVM with the expressive learning power of QNN, leading to superior generalization performance. The proposed model is evaluated on two benchmark EEG datasets, achieving high accuracies of 0.990 and 0.950, outperforming both classical and standalone quantum models. To demonstrate real-world viability, we further validated the robustness of QNN, QSVM, and QSVM-QNN against six realistic quantum noise models, including bit flip and phase damping. These experiments reveal that QSVM-QNN maintains stable performance under noisy conditions, establishing its applicability for deployment in practical, noisy quantum environments. Beyond BCI, the proposed hybrid quantum architecture is generalizable to other biomedical and time-series classification tasks, offering a scalable and noise-resilient solution for next-generation neurotechnological systems.

[LG-110] Hybrid Bernstein Normalizing Flows for Flexible Multivariate Density Regression with Interpretable Marginals

链接: https://arxiv.org/abs/2505.14164
作者: Marcel Arpogaus,Thomas Kneib,Thomas Nagler,David Rügamer
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Density regression models allow a comprehensive understanding of data by modeling the complete conditional probability distribution. While flexible estimation approaches such as normalizing flows (NF) work particularly well in multiple dimensions, interpreting the input-output relationship of such models is often difficult, due to the black-box character of deep learning models. In contrast, existing statistical methods for multivariate outcomes such as multivariate conditional transformation models (MCTM) are restricted in flexibility and are often not expressive enough to represent complex multivariate probability distributions. In this paper, we combine MCTM with state-of-the-art and autoregressive NF to leverage the transparency of MCTM for modeling interpretable feature effects on the marginal distributions in the first step and the flexibility of neural-network-based NF techniques to account for complex and non-linear relationships in the joint data distribution. We demonstrate our method’s versatility in various numerical experiments and compare it with MCTM and other NF models on both simulated and real-world data.

[LG-111] High-dimensional Nonparametric Contextual Bandit Problem

链接: https://arxiv.org/abs/2505.14102
作者: Shogo Iwazaki,Junpei Komiyama,Masaaki Imaizumi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 38 pages

点击查看摘要

Abstract:We consider the kernelized contextual bandit problem with a large feature space. This problem involves K arms, and the goal of the forecaster is to maximize the cumulative rewards through learning the relationship between the contexts and the rewards. It serves as a general framework for various decision-making scenarios, such as personalized online advertising and recommendation systems. Kernelized contextual bandits generalize the linear contextual bandit problem and offers a greater modeling flexibility. Existing methods, when applied to Gaussian kernels, yield a trivial bound of O(T) when we consider \Omega(\log T) feature dimensions. To address this, we introduce stochastic assumptions on the context distribution and show that no-regret learning is achievable even when the number of dimensions grows up to the number of samples. Furthermore, we analyze lenient regret, which allows a per-round regret of at most \Delta 0 . We derive the rate of lenient regret in terms of \Delta .

[LG-112] Computational Efficiency under Covariate Shift in Kernel Ridge Regression

链接: https://arxiv.org/abs/2505.14083
作者: Andrea Della Vecchia,Arnaud Mavakala Watusadisi,Ernesto De Vito,Lorenzo Rosasco
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper addresses the covariate shift problem in the context of nonparametric regression within reproducing kernel Hilbert spaces (RKHSs). Covariate shift arises in supervised learning when the input distributions of the training and test data differ, presenting additional challenges for learning. Although kernel methods have optimal statistical properties, their high computational demands in terms of time and, particularly, memory, limit their scalability to large datasets. To address this limitation, the main focus of this paper is to explore the trade-off between computational efficiency and statistical accuracy under covariate shift. We investigate the use of random projections where the hypothesis space consists of a random subspace within a given RKHS. Our results show that, even in the presence of covariate shift, significant computational savings can be achieved without compromising learning performance.

[LG-113] hermoONet – a deep learning-based small body thermophysical network: applications to modelling water activity of comets

链接: https://arxiv.org/abs/2505.14016
作者: Shunjing Zhao,Xian Shi,Hanlun Lei
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: accepted for publication in AA

点击查看摘要

Abstract:Cometary activity is a compelling subject of study, with thermophysical models playing a pivotal role in its understanding. However, traditional numerical solutions for small body thermophysical models are computationally intensive, posing challenges for investigations requiring high-resolution or repetitive modeling. To address this limitation, we employed a machine learning approach to develop ThermoONet - a neural network designed to predict the temperature and water ice sublimation flux of comets. Performance evaluations indicate that ThermoONet achieves a low average error in subsurface temperature of approximately 2% relative to the numerical simulation, while reducing computational time by nearly six orders of magnitude. We applied ThermoONet to model the water activity of comets 67P/Churyumov-Gerasimenko and 21P/Giacobini-Zinner. By successfully fitting the water production rate curves of these comets, as obtained by the Rosetta mission and the SOHO telescope, respectively, we demonstrate the network’s effectiveness and efficiency. Furthermore, when combined with a global optimization algorithm, ThermoONet proves capable of retrieving the physical properties of target bodies.

[LG-114] A Probabilistic Perspective on Model Collapse

链接: https://arxiv.org/abs/2505.13947
作者: Shirong Xu,Hengzhi He,Guang Cheng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, model collapse has become a critical issue in language model training, making it essential to understand the underlying mechanisms driving this phenomenon. In this paper, we investigate recursive parametric model training from a probabilistic perspective, aiming to characterize the conditions under which model collapse occurs and, crucially, how it can be mitigated. We conceptualize the recursive training process as a random walk of the model estimate, highlighting how the sample size influences the step size and how the estimation procedure determines the direction and potential bias of the random walk. Under mild conditions, we rigorously show that progressively increasing the sample size at each training step is necessary to prevent model collapse. In particular, when the estimation is unbiased, the required growth rate follows a superlinear pattern. This rate needs to be accelerated even further in the presence of substantial estimation bias. Building on this probabilistic framework, we also investigate the probability that recursive training on synthetic data yields models that outperform those trained solely on real data. Moreover, we extend these results to general parametric model family in an asymptotic regime. Finally, we validate our theoretical results through extensive simulations and a real-world dataset.

[LG-115] An Asymptotic Equation Linking WAIC and WBIC in Singular Models ICONIP2025

链接: https://arxiv.org/abs/2505.13902
作者: Naoki Hayashi,Takuro Kutsuna,Sawa Takamuku
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 14pages, to be submitted to ICONIP2025

点击查看摘要

Abstract:In statistical learning, models are classified as regular or singular depending on whether the mapping from parameters to probability distributions is injective. Most models with hierarchical structures or latent variables are singular, for which conventional criteria such as the Akaike Information Criterion and the Bayesian Information Criterion are inapplicable due to the breakdown of normal approximations for the likelihood and posterior. To address this, the Widely Applicable Information Criterion (WAIC) and the Widely Applicable Bayesian Information Criterion (WBIC) have been proposed. Since WAIC and WBIC are computed using posterior distributions at different temperature settings, separate posterior sampling is generally required. In this paper, we theoretically derive an asymptotic equation that links WAIC and WBIC, despite their dependence on different posteriors. This equation yields an asymptotically unbiased expression of WAIC in terms of the posterior distribution used for WBIC. The result clarifies the structural relationship between these criteria within the framework of singular learning theory, and deepens understanding of their asymptotic behavior. This theoretical contribution provides a foundation for future developments in the computational efficiency of model selection in singular models.

[LG-116] Graphon Mixtures

链接: https://arxiv.org/abs/2505.13864
作者: Sevvandi Kandanaarachchi,Cheng Soon Ong
类目: Machine Learning (stat.ML); Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Social networks have a small number of large hubs, and a large number of small dense communities. We propose a generative model that captures both hub and dense structures. Based on recent results about graphons on line graphs, our model is a graphon mixture, enabling us to generate sequences of graphs where each graph is a combination of sparse and dense graphs. We propose a new condition on sparse graphs (the max-degree), which enables us to identify hubs. We show theoretically that we can estimate the normalized degree of the hubs, as well as estimate the graphon corresponding to sparse components of graph mixtures. We illustrate our approach on synthetic data, citation graphs, and social networks, showing the benefits of explicitly modeling sparse graphs.

[LG-117] Backward Conformal Prediction

链接: https://arxiv.org/abs/2505.13732
作者: Etienne Gauthier,Francis Bach,Michael I. Jordan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Code available at: this https URL

点击查看摘要

Abstract:We introduce \textitBackward Conformal Prediction , a method that guarantees conformal coverage while providing flexible control over the size of prediction sets. Unlike standard conformal prediction, which fixes the coverage level and allows the conformal set size to vary, our approach defines a rule that constrains how prediction set sizes behave based on the observed data, and adapts the coverage level accordingly. Our method builds on two key foundations: (i) recent results by Gauthier et al. [2025] on post-hoc validity using e-values, which ensure marginal coverage of the form \mathbbP(Y_\rm test \in \hat C_n^\tilde\alpha(X_\rm test)) \ge 1 - \mathbbE[\tilde\alpha] up to a first-order Taylor approximation for any data-dependent miscoverage \tilde\alpha , and (ii) a novel leave-one-out estimator \hat\alpha^\rm LOO of the marginal miscoverage \mathbbE[\tilde\alpha] based on the calibration set, ensuring that the theoretical guarantees remain computable in practice. This approach is particularly useful in applications where large prediction sets are impractical such as medical diagnosis. We provide theoretical results and empirical evidence supporting the validity of our method, demonstrating that it maintains computable coverage guarantees while ensuring interpretable, well-controlled prediction set sizes.

[LG-118] Sobolev Gradient Ascent for Optimal Transport: Barycenter Optimization and Convergence Analysis

链接: https://arxiv.org/abs/2505.13660
作者: Kaheon Kim,Bohan Zhou,Changbo Zhu,Xiaohui Chen
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper introduces a new constraint-free concave dual formulation for the Wasserstein barycenter. Tailoring the vanilla dual gradient ascent algorithm to the Sobolev geometry, we derive a scalable Sobolev gradient ascent (SGA) algorithm to compute the barycenter for input distributions supported on a regular grid. Despite the algorithmic simplicity, we provide a global convergence analysis that achieves the same rate as the classical subgradient descent methods for minimizing nonsmooth convex functions in the Euclidean space. A central feature of our SGA algorithm is that the computationally expensive c -concavity projection operator enforced on the Kantorovich dual potentials is unnecessary to guarantee convergence, leading to significant algorithmic and theoretical simplifications over all existing primal and dual methods for computing the exact barycenter. Our numerical experiments demonstrate the superior empirical performance of SGA over the existing optimal transport barycenter solvers.

[LG-119] Scalable Bayesian Monte Carlo: fast uncertainty estimation beyond deep ensembles

链接: https://arxiv.org/abs/2505.13585
作者: Xinzhu Liang,Joseph M. Lukens,Sanjaya Lohani,Brian T. Kirby,Thomas A. Searles,Xin Qiu,Kody J. H. Law
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注: 56 pages, 44 figures, 35 tables

点击查看摘要

Abstract:This work introduces a new method called scalable Bayesian Monte Carlo (SBMC). The model interpolates between a point estimator and the posterior, and the algorithm is a parallel implementation of a consistent (asymptotically unbiased) Bayesian deep learning algorithm: sequential Monte Carlo (SMC) or Markov chain Monte Carlo (MCMC). The method is motivated theoretically, and its utility is demonstrated on practical examples: MNIST, CIFAR, IMDb. A systematic numerical study reveals that parallel implementations of SMC and MCMC are comparable to serial implementations in terms of performance and total cost, and they achieve accuracy at or beyond the state-of-the-art (SOTA) methods like deep ensembles at convergence, along with substantially improved uncertainty quantification (UQ)–in particular, epistemic UQ. But even parallel implementations are expensive, with an irreducible time barrier much larger than the cost of the MAP estimator. Compressing time further leads to rapid degradation of accuracy, whereas UQ remains valuable. By anchoring to a point estimator we can recover accuracy, while retaining valuable UQ, ultimately delivering strong performance across metrics for a cost comparable to the SOTA.

[LG-120] Autonomous nanoparticle synthesis by design

链接: https://arxiv.org/abs/2505.13571
作者: Andy S. Anker,Jonas H. Jensen,Miguel Gonzalez-Duque,Rodrigo Moreno,Aleksandra Smolska,Mikkel Juelsholt,Vincent Hardion,Mads R. V. Jorgensen,Andres Faina,Jonathan Quinson,Kasper Stoy,Tejs Vegge
类目: Materials Science (cond-mat.mtrl-sci); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Controlled synthesis of materials with specified atomic structures underpins technological advances yet remains reliant on iterative, trial-and-error approaches. Nanoparticles (NPs), whose atomic arrangement dictates their emergent properties, are particularly challenging to synthesise due to numerous tunable parameters. Here, we introduce an autonomous approach explicitly targeting synthesis of atomic-scale structures. Our method autonomously designs synthesis protocols by matching real time experimental total scattering (TS) and pair distribution function (PDF) data to simulated target patterns, without requiring prior synthesis knowledge. We demonstrate this capability at a synchrotron, successfully synthesising two structurally distinct gold NPs: 5 nm decahedral and 10 nm face-centred cubic structures. Ultimately, specifying a simulated target scattering pattern, thus representing a bespoke atomic structure, and obtaining both the synthesised material and its reproducible synthesis protocol on demand may revolutionise materials design. Thus, ScatterLab provides a generalisable blueprint for autonomous, atomic structure-targeted synthesis across diverse systems and applications.

[LG-121] CATS: Clustering-Aggregated and Time Series for Business Customer Purchase Intention Prediction

链接: https://arxiv.org/abs/2505.13558
作者: Yingjie Kuang,Tianchen Zhang,Zhen-Wei Huang,Zhongjie Zeng,Zhe-Yuan Li,Ling Huang,Yuefang Gao
类目: Econometrics (econ.EM); Machine Learning (cs.LG); General Economics (econ.GN)
*备注:

点击查看摘要

Abstract:Accurately predicting customers’ purchase intentions is critical to the success of a business strategy. Current researches mainly focus on analyzing the specific types of products that customers are likely to purchase in the future, little attention has been paid to the critical factor of whether customers will engage in repurchase behavior. Predicting whether a customer will make the next purchase is a classic time series forecasting task. However, in real-world purchasing behavior, customer groups typically exhibit imbalance - i.e., there are a large number of occasional buyers and a small number of loyal customers. This head-to-tail distribution makes traditional time series forecasting methods face certain limitations when dealing with such problems. To address the above challenges, this paper proposes a unified Clustering and Attention mechanism GRU model (CAGRU) that leverages multi-modal data for customer purchase intention prediction. The framework first performs customer profiling with respect to the customer characteristics and clusters the customers to delineate the different customer clusters that contain similar features. Then, the time series features of different customer clusters are extracted by GRU neural network and an attention mechanism is introduced to capture the significance of sequence locations. Furthermore, to mitigate the head-to-tail distribution of customer segments, we train the model separately for each customer segment, to adapt and capture more accurately the differences in behavioral characteristics between different customer segments, as well as the similar characteristics of the customers within the same customer segment. We constructed four datasets and conducted extensive experiments to demonstrate the superiority of the proposed CAGRU approach.

[LG-122] SPIRIT: Patching Speech Language Models against Jailbreak Attacks

链接: https://arxiv.org/abs/2505.13541
作者: Amirbek Djanibekov,Nurdaulet Mukhituly,Kentaro Inui,Hanan Aldarmaki,Nils Lukas
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Speech Language Models (SLMs) enable natural interactions via spoken instructions, which more effectively capture user intent by detecting nuances in speech. The richer speech signal introduces new security risks compared to text-based models, as adversaries can better bypass safety mechanisms by injecting imperceptible noise to speech. We analyze adversarial attacks and find that SLMs are substantially more vulnerable to jailbreak attacks, which can achieve a perfect 100% attack success rate in some instances. To improve security, we propose post-hoc patching defenses used to intervene during inference by modifying the SLM’s activations that improve robustness up to 99% with (i) negligible impact on utility and (ii) without any re-training. We conduct ablation studies to maximize the efficacy of our defenses and improve the utility/security trade-off, validated with large-scale benchmarks unique to SLMs.

[LG-123] Polymer Data Challenges in the AI Era: Bridging Gaps for Next-Generation Energy Materials

链接: https://arxiv.org/abs/2505.13494
作者: Ying Zhao,Guanhua Chen,Jie Liu
类目: oft Condensed Matter (cond-mat.soft); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 45 pages, 0 figures

点击查看摘要

Abstract:The pursuit of advanced polymers for energy technologies, spanning photovoltaics, solid-state batteries, and hydrogen storage, is hindered by fragmented data ecosystems that fail to capture the hierarchical complexity of these materials. Polymer science lacks interoperable databases, forcing reliance on disconnected literature and legacy records riddled with unstructured formats and irreproducible testing protocols. This fragmentation stifles machine learning (ML) applications and delays the discovery of materials critical for global decarbonization. Three systemic barriers compound the challenge. First, academic-industrial data silos restrict access to proprietary industrial datasets, while academic publications often omit critical synthesis details. Second, inconsistent testing methods undermine cross-study comparability. Third, incomplete metadata in existing databases limits their utility for training reliable ML models. Emerging solutions address these gaps through technological and collaborative innovation. Natural language processing (NLP) tools extract structured polymer data from decades of literature, while high-throughput robotic platforms generate self-consistent datasets via autonomous experimentation. Central to these advances is the adoption of FAIR (Findable, Accessible, Interoperable, Reusable) principles, adapted to polymer-specific ontologies, ensuring machine-readability and reproducibility. Future breakthroughs hinge on cultural shifts toward open science, accelerated by decentralized data markets and autonomous laboratories that merge robotic experimentation with real-time ML validation. By addressing data fragmentation through technological innovation, collaborative governance, and ethical stewardship, the polymer community can transform bottlenecks into accelerants.

信息检索

[IR-0] R2MED: A Benchmark for Reasoning -Driven Medical Retrieval

链接: https://arxiv.org/abs/2505.14558
作者: Lei Li,Xiao Zhou,Zheng Liu
类目: Information Retrieval (cs.IR)
*备注: 38 pages, 16 figures

点击查看摘要

Abstract:Current medical retrieval benchmarks primarily emphasize lexical or shallow semantic similarity, overlooking the reasoning-intensive demands that are central to clinical decision-making. In practice, physicians often retrieve authoritative medical evidence to support diagnostic hypotheses. Such evidence typically aligns with an inferred diagnosis rather than the surface form of a patient’s symptoms, leading to low lexical or semantic overlap between queries and relevant documents. To address this gap, we introduce R2MED, the first benchmark explicitly designed for reasoning-driven medical retrieval. It comprises 876 queries spanning three tasks: QA reference retrieval, clinical evidence retrieval, and clinical case retrieval. These tasks are drawn from five representative medical scenarios and twelve body systems, capturing the complexity and diversity of real-world medical information needs. We evaluate 15 widely-used retrieval systems on R2MED and find that even the best model achieves only 31.4 nDCG@10, demonstrating the benchmark’s difficulty. Classical re-ranking and generation-augmented retrieval methods offer only modest improvements. Although large reasoning models improve performance via intermediate inference generation, the best results still peak at 41.4 nDCG@10. These findings underscore a substantial gap between current retrieval techniques and the reasoning demands of real clinical tasks. We release R2MED as a challenging benchmark to foster the development of next-generation medical retrieval systems with enhanced reasoning capabilities. Data and code are available at this https URL

[IR-1] he Limits of Graph Samplers for Training Inductive Recommender Systems: Extended results

链接: https://arxiv.org/abs/2505.14241
作者: Theis E. Jendal,Matteo Lissandrini,Peter Dolog,Katja Hose
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Inductive Recommender Systems are capable of recommending for new users and with new items thus avoiding the need to retrain after new data reaches the system. However, these methods are still trained on all the data available, requiring multiple days to train a single model, without counting hyperparameter tuning. In this work we focus on graph-based recommender systems, i.e., systems that model the data as a heterogeneous network. In other applications, graph sampling allows to study a subgraph and generalize the findings to the original graph. Thus, we investigate the applicability of sampling techniques for this task. We test on three real world datasets, with three state-of-the-art inductive methods, and using six different sampling methods. We find that its possible to maintain performance using only 50% of the training data with up to 86% percent decrease in training time; however, using less training data leads to far worse performance. Further, we find that when it comes to data for recommendations, graph sampling should also account for the temporal dimension. Therefore, we find that if higher data reduction is needed, new graph based sampling techniques should be studied and new inductive methods should be designed.

[IR-2] Process vs. Outcome Reward: Which is Better for Agent ic RAG Reinforcement Learning

链接: https://arxiv.org/abs/2505.14069
作者: Wenlin Zhang,Xiangyang Li,Kuicai Dong,Yichao Wang,Pengyue Jia,Xiaopeng Li,Yingyi Zhang,Derong Xu,Zhaocheng Du,Huifeng Guo,Ruiming Tang,Xiangyu Zhao
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) enhances the text generation capabilities of large language models (LLMs) by integrating external knowledge and up-to-date information. However, traditional RAG systems are limited by static workflows and lack the adaptability required for multistep reasoning and complex task management. To address these limitations, agentic RAG systems (e.g., DeepResearch) have been proposed, enabling dynamic retrieval strategies, iterative context refinement, and adaptive workflows for handling complex search queries beyond the capabilities of conventional RAG. Recent advances, such as Search-R1, have demonstrated promising gains using outcome-based reinforcement learning, where the correctness of the final answer serves as the reward signal. Nevertheless, such outcome-supervised agentic RAG methods face challenges including low exploration efficiency, gradient conflict, and sparse reward signals. To overcome these challenges, we propose to utilize fine-grained, process-level rewards to improve training stability, reduce computational costs, and enhance efficiency. Specifically, we introduce a novel method ReasonRAG that automatically constructs RAG-ProGuide, a high-quality dataset providing process-level rewards for (i) query generation, (ii) evidence extraction, and (iii) answer generation, thereby enhancing model inherent capabilities via process-supervised reinforcement learning. With the process-level policy optimization, the proposed framework empowers LLMs to autonomously invoke search, generate queries, extract relevant evidence, and produce final answers. Compared to existing approaches such as Search-R1 and traditional RAG systems, ReasonRAG, leveraging RAG-ProGuide, achieves superior performance on five benchmark datasets using only 5k training instances, significantly fewer than the 90k training instances required by Search-R1.

[IR-3] DIFF: Dual Side-Information Filtering and Fusion for Sequential Recommendation SIGIR2025

链接: https://arxiv.org/abs/2505.13974
作者: Hye-young Kim,Minjin Choi,Sunkyung Lee,Ilwoong Baek,Jongwuk Lee
类目: Information Retrieval (cs.IR)
*备注: Accepted by SIGIR 2025. 10 pages

点击查看摘要

Abstract:Side-information Integrated Sequential Recommendation (SISR) benefits from auxiliary item information to infer hidden user preferences, which is particularly effective for sparse interactions and cold-start scenarios. However, existing studies face two main challenges. (i) They fail to remove noisy signals in item sequence and (ii) they underutilize the potential of side-information integration. To tackle these issues, we propose a novel SISR model, Dual Side-Information Filtering and Fusion (DIFF), which employs frequency-based noise filtering and dual multi-sequence fusion. Specifically, we convert the item sequence to the frequency domain to filter out noisy short-term fluctuations in user interests. We then combine early and intermediate fusion to capture diverse relationships across item IDs and attributes. Thanks to our innovative filtering and fusion strategy, DIFF is more robust in learning subtle and complex item correlations in the sequence. DIFF outperforms state-of-the-art SISR models, achieving improvements of up to 14.1% and 12.5% in Recall@20 and NDCG@20 across four benchmark datasets.

[IR-4] Benchmarking the Myopic Trap: Positional Bias in Information Retrieval

链接: https://arxiv.org/abs/2505.13950
作者: Ziyang Zeng,Dun Zhang,Jiacheng Li,Panxiang Zou,Yuqing Yang
类目: Information Retrieval (cs.IR)
*备注: 10 pages, 3 figures, 4 tables. Under review

点击查看摘要

Abstract:This study investigates a specific form of positional bias, termed the Myopic Trap, where retrieval models disproportionately attend to the early parts of documents while overlooking relevant information that appears later. To systematically quantify this phenomenon, we propose a semantics-preserving evaluation framework that repurposes the existing NLP datasets into position-aware retrieval benchmarks. By evaluating the SOTA models of full retrieval pipeline, including BM25, embedding models, ColBERT-style late-interaction models, and reranker models, we offer a broader empirical perspective on positional bias than prior work. Experimental results show that embedding models and ColBERT-style models exhibit significant performance degradation when query-related content is shifted toward later positions, indicating a pronounced head bias. Notably, under the same training configuration, ColBERT-style approach show greater potential for mitigating positional bias compared to the traditional single-vector approach. In contrast, BM25 and reranker models remain largely unaffected by such perturbations, underscoring their robustness to positional bias. Code and data are publicly available at: this http URL.

[IR-5] VulCPE: Context-Aware Cybersecurity Vulnerability Retrieval and Management

链接: https://arxiv.org/abs/2505.13895
作者: Yuning Jiang,Feiyang Shang,Freedy Tan Wei You,Huilin Wang,Chia Ren Cong,Qiaoran Meng,Nay Oo,Hoon Wei Lim,Biplab Sikdar
类目: Cryptography and Security (cs.CR); Databases (cs.DB); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The dynamic landscape of cybersecurity demands precise and scalable solutions for vulnerability management in heterogeneous systems, where configuration-specific vulnerabilities are often misidentified due to inconsistent data in databases like the National Vulnerability Database (NVD). Inaccurate Common Platform Enumeration (CPE) data in NVD further leads to false positives and incomplete vulnerability retrieval. Informed by our systematic analysis of CPE and CVEdeails data, revealing more than 50% vendor name inconsistencies, we propose VulCPE, a framework that standardizes data and models configuration dependencies using a unified CPE schema (uCPE), entity recognition, relation extraction, and graph-based modeling. VulCPE achieves superior retrieval precision (0.766) and coverage (0.926) over existing tools. VulCPE ensures precise, context-aware vulnerability management, enhancing cyber resilience.

附件下载

点击下载今日全部论文列表