本篇博文主要内容为 2025-06-02 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-06-02)
今日共更新708篇论文,其中:
- 自然语言处理共208篇(Computation and Language (cs.CL))
- 人工智能共224篇(Artificial Intelligence (cs.AI))
- 计算机视觉共154篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共241篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents
【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理交互式、多步骤推理任务(如CAPTCHAs)时能力不足的问题,这一问题已成为部署网络代理(web agents)实现端到端自动化任务的关键瓶颈。解决方案的关键在于引入Open CaptchaWorld,这是一个首个基于网络的基准测试平台,专门用于通过多样且动态的CAPTCHA谜题评估MLLM驱动代理的视觉推理与交互能力。该平台涵盖了20种现代CAPTCHA类型,共225个CAPTCHA,并引入了新的度量标准——CAPTCHA Reasoning Depth,以量化解决每个谜题所需的认知和操作步骤数量。
链接: https://arxiv.org/abs/2505.24878
作者: Yaxin Luo,Zhaoyi Li,Jiacheng Liu,Jiacheng Cui,Xiaohan Zhao,Zhiqiang Shen
机构: VILA Lab; MBZUAI; MetaAgentX
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Code at: this https URL
Abstract:CAPTCHAs have been a critical bottleneck for deploying web agents in real-world applications, often blocking them from completing end-to-end automation tasks. While modern multimodal LLM agents have demonstrated impressive performance in static perception tasks, their ability to handle interactive, multi-step reasoning challenges like CAPTCHAs is largely untested. To address this gap, we introduce Open CaptchaWorld, the first web-based benchmark and platform specifically designed to evaluate the visual reasoning and interaction capabilities of MLLM-powered agents through diverse and dynamic CAPTCHA puzzles. Our benchmark spans 20 modern CAPTCHA types, totaling 225 CAPTCHAs, annotated with a new metric we propose: CAPTCHA Reasoning Depth, which quantifies the number of cognitive and motor steps required to solve each puzzle. Experimental results show that humans consistently achieve near-perfect scores, state-of-the-art MLLM agents struggle significantly, with success rates at most 40.0% by Browser-Use Openai-o3, far below human-level performance, 93.3%. This highlights Open CaptchaWorld as a vital benchmark for diagnosing the limits of current multimodal agents and guiding the development of more robust multimodal reasoning systems. Code and Data are available at this https URL.
zh
[NLP-1] Agent -X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agent ic Tasks
【速读】: 该论文旨在解决现有基准测试在评估视觉中心代理的多步骤和深度推理能力时存在的不足,这些问题包括使用完全合成的单次查询、有限的视觉模态以及缺乏对真实世界场景中多步骤推理质量的评估框架。其解决方案的关键是引入Agent-X,这是一个大规模基准,用于在真实世界的多模态环境中评估视觉中心代理的多步骤和深度推理能力,该基准包含828个具有真实视觉情境的代理任务,并提出了一种细粒度的步骤级评估框架,以评估每个推理步骤的正确性、逻辑连贯性以及工具使用的有效性。
链接: https://arxiv.org/abs/2505.24876
作者: Tajamul Ashraf,Amal Saqib,Hanan Ghani,Muhra AlMahri,Yuhao Li,Noor Ahsan,Umair Nawaz,Jean Lahoud,Hisham Cholakkal,Mubarak Shah,Philip Torr,Fahad Shahbaz Khan,Rao Muhammad Anwer,Salman Khan
机构: Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), UAE; University of Central Florida, USA; University of Oxford, UK
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Deep reasoning is fundamental for solving complex tasks, especially in vision-centric scenarios that demand sequential, multimodal understanding. However, existing benchmarks typically evaluate agents with fully synthetic, single-turn queries, limited visual modalities, and lack a framework to assess reasoning quality over multiple steps as required in real-world settings. To address this, we introduce Agent-X, a large-scale benchmark for evaluating vision-centric agents multi-step and deep reasoning capabilities in real-world, multimodal settings. Agent- X features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text. These tasks span six major agentic environments: general visual reasoning, web browsing, security and surveillance, autonomous driving, sports, and math reasoning. Our benchmark requires agents to integrate tool use with explicit, stepwise decision-making in these diverse settings. In addition, we propose a fine-grained, step-level evaluation framework that assesses the correctness and logical coherence of each reasoning step and the effectiveness of tool usage throughout the task. Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks, achieving less than 50% full-chain success. These findings highlight key bottlenecks in current LMM reasoning and tool-use capabilities and identify future research directions in vision-centric agentic reasoning models. Our data and code are publicly available at this https URL
zh
[NLP-2] Reason Gen-R1: CoT for Autoregressive Image generation models through SFT and RL
【速读】: 该论文试图解决将链式思维推理(chain-of-thought reasoning)和强化学习(reinforcement learning, RL)整合到生成式视觉模型中的问题,这一方向尚未得到充分探索。其解决方案的关键在于提出一种两阶段框架——ReasonGen-R1,首先通过监督微调在新生成的基于文本的推理数据集上赋予自回归图像生成器显式的“思考”能力,随后利用Group Relative Policy Optimization(GRPO)算法优化生成结果。该方法通过预训练的视觉语言模型提供奖励信号,以评估整体视觉质量并优化策略。
链接: https://arxiv.org/abs/2505.24875
作者: Yu Zhang,Yunqi Li,Yifan Yang,Rui Wang,Yuqing Yang,Dai Qi,Jianmin Bao,Dongdong Chen,Chong Luo,Lili Qiu
机构: ShanghaiTech University (上海科技大学); Microsoft Corporation (微软公司); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Although chain-of-thought reasoning and reinforcement learning (RL) have driven breakthroughs in NLP, their integration into generative vision models remains underexplored. We introduce ReasonGen-R1, a two-stage framework that first imbues an autoregressive image generator with explicit text-based “thinking” skills via supervised fine-tuning on a newly generated reasoning dataset of written rationales, and then refines its outputs using Group Relative Policy Optimization. To enable the model to reason through text before generating images, We automatically generate and release a corpus of model crafted rationales paired with visual prompts, enabling controlled planning of object layouts, styles, and scene compositions. Our GRPO algorithm uses reward signals from a pretrained vision language model to assess overall visual quality, optimizing the policy in each update. Evaluations on GenEval, DPG, and the T2I benchmark demonstrate that ReasonGen-R1 consistently outperforms strong baselines and prior state-of-the-art models. More: this http URL.
zh
[NLP-3] ProxyThinker: Test-Time Guidance through Small Visual Reason ers
【速读】: 该论文试图解决在大型视觉-语言模型(LVLMs)中通过强化学习微调(RFT)进行训练时计算成本过高,从而限制模型规模扩展的问题。解决方案的关键在于提出ProxyThinker,这是一种推理阶段的技术,通过从基础模型的输出分布中减去RFT推理器的输出分布,修改解码动态,从而让大型模型继承小型、慢思考视觉推理器的视觉推理能力,而无需额外训练。
链接: https://arxiv.org/abs/2505.24872
作者: Zilin Xiao,Jaywon Koo,Siru Ouyang,Jefferson Hernandez,Yu Meng,Vicente Ordonez
机构: Rice University (莱斯大学); UIUC (伊利诺伊大学厄巴纳-香槟分校); University of Virginia (弗吉尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Recent advancements in reinforcement learning with verifiable rewards have pushed the boundaries of the visual reasoning capabilities in large vision-language models (LVLMs). However, training LVLMs with reinforcement fine-tuning (RFT) is computationally expensive, posing a significant challenge to scaling model size. In this work, we propose ProxyThinker, an inference-time technique that enables large models to inherit the visual reasoning capabilities from small, slow-thinking visual reasoners without any training. By subtracting the output distributions of base models from those of RFT reasoners, ProxyThinker modifies the decoding dynamics and successfully elicits the slow-thinking reasoning demonstrated by the emerged sophisticated behaviors such as self-verification and self-correction. ProxyThinker consistently boosts performance on challenging visual benchmarks on spatial, mathematical, and multi-disciplinary reasoning, enabling untuned base models to compete with the performance of their full-scale RFT counterparts. Furthermore, our implementation efficiently coordinates multiple language models with parallelism techniques and achieves up to 38 \times faster inference compared to previous decoding-time methods, paving the way for the practical deployment of ProxyThinker. Code is available at this https URL.
zh
[NLP-4] MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在后训练阶段应用可验证奖励强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)时面临的任务多样性与目标冲突问题。其关键解决方案是提出一种系统化的后训练框架,通过精心设计的数据集混合策略和多领域在线强化学习机制,优化模型的泛化能力和推理性能。具体而言,该框架包含一个涵盖多种可验证视觉-语言问题的数据集,并引入数据混合策略以预测并优化RL微调结果,从而提升模型在分布外基准上的表现。
链接: https://arxiv.org/abs/2505.24871
作者: Yiqing Liang,Jielin Qiu,Wenhao Ding,Zuxin Liu,James Tompkin,Mengdi Xu,Mengzhou Xia,Zhengzhong Tu,Laixi Shi,Jiacheng Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Project Webpage: this https URL
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for post-training large language models (LLMs), achieving state-of-the-art performance on tasks with structured, verifiable answers. Applying RLVR to Multimodal LLMs (MLLMs) presents significant opportunities but is complicated by the broader, heterogeneous nature of vision-language tasks that demand nuanced visual, logical, and spatial capabilities. As such, training MLLMs using RLVR on multiple datasets could be beneficial but creates challenges with conflicting objectives from interaction among diverse datasets, highlighting the need for optimal dataset mixture strategies to improve generalization and reasoning. We introduce a systematic post-training framework for Multimodal LLM RLVR, featuring a rigorous data mixture problem formulation and benchmark implementation. Specifically, (1) We developed a multimodal RLVR framework for multi-dataset post-training by curating a dataset that contains different verifiable vision-language problems and enabling multi-domain online RL learning with different verifiable rewards; (2) We proposed a data mixture strategy that learns to predict the RL fine-tuning outcome from the data mixture distribution, and consequently optimizes the best mixture. Comprehensive experiments showcase that multi-domain RLVR training, when combined with mixture prediction strategies, can significantly boost MLLM general reasoning capacities. Our best mixture improves the post-trained model’s accuracy on out-of-distribution benchmarks by an average of 5.24% compared to the same model post-trained with uniform data mixture, and by a total of 20.74% compared to the pre-finetuning baseline.
zh
[NLP-5] ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
【速读】: 该论文试图解决强化学习(Reinforcement Learning, RL)是否真正扩展了模型的推理能力,还是仅仅放大了基础模型分布中已存在的高奖励输出的问题,同时探讨持续扩大RL计算资源是否能可靠提升推理性能。解决方案的关键在于提出一种名为ProRL的新训练方法,该方法结合了KL散度控制、参考策略重置以及多样化的任务集,从而在长时间的RL训练中揭示出基础模型无法触及的新型推理策略。
链接: https://arxiv.org/abs/2505.24864
作者: Mingjie Liu,Shizhe Diao,Ximing Lu,Jian Hu,Xin Dong,Yejin Choi,Jan Kautz,Yi Dong
机构: NVIDIA(英伟达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 26 pages, 17 figures
Abstract:Recent advances in reasoning-centric language models have highlighted reinforcement learning (RL) as a promising method for aligning models with verifiable rewards. However, it remains contentious whether RL truly expands a model’s reasoning capabilities or merely amplifies high-reward outputs already latent in the base model’s distribution, and whether continually scaling up RL compute reliably leads to improved reasoning performance. In this work, we challenge prevailing assumptions by demonstrating that prolonged RL (ProRL) training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling. We introduce ProRL, a novel training methodology that incorporates KL divergence control, reference policy resetting, and a diverse suite of tasks. Our empirical analysis reveals that RL-trained models consistently outperform base models across a wide range of pass@k evaluations, including scenarios where base models fail entirely regardless of the number of attempts. We further show that reasoning boundary improvements correlates strongly with task competence of base model and training duration, suggesting that RL can explore and populate new regions of solution space over time. These findings offer new insights into the conditions under which RL meaningfully expands reasoning boundaries in language models and establish a foundation for future work on long-horizon RL for reasoning. We release model weights to support further research: this https URL
zh
[NLP-6] AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time
【速读】: 该论文试图解决在大型推理模型(Large Reasoning Models, LRMs)测试阶段对推理过程进行动态调控的问题,以提升推理效率和答案生成的准确性。其解决方案的关键在于引入了“α时刻”(α moment),通过一个通用参数α来表征缩放后的思考阶段,并在该阶段内通过建模推理转换标记的插入为伯努利随机过程,动态调度慢速思维的过渡;在α时刻之后,通过结束思考标记确定性地终止慢速思维,从而促进快速推理和高效答案生成。该方法统一并扩展了现有的单调缩放方法,实现了从慢到快的灵活且密集的推理调控。
链接: https://arxiv.org/abs/2505.24863
作者: Junyu Zhang,Runpei Dong,Han Wang,Xuying Ning,Haoran Geng,Peihao Li,Xialin He,Yutong Bai,Jitendra Malik,Saurabh Gupta,Huan Zhang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); UC Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper presents AlphaOne ( \alpha 1), a universal framework for modulating reasoning progress in large reasoning models (LRMs) at test time. \alpha 1 first introduces \alpha moment, which represents the scaled thinking phase with a universal parameter \alpha . Within this scaled pre- \alpha moment phase, it dynamically schedules slow thinking transitions by modeling the insertion of reasoning transition tokens as a Bernoulli stochastic process. After the \alpha moment, \alpha 1 deterministically terminates slow thinking with the end-of-thinking token, thereby fostering fast reasoning and efficient answer generation. This approach unifies and generalizes existing monotonic scaling methods by enabling flexible and dense slow-to-fast reasoning modulation. Extensive empirical studies on various challenging benchmarks across mathematical, coding, and scientific domains demonstrate \alpha 1’s superior reasoning capability and efficiency. Project page: this https URL
zh
[NLP-7] Beyond Multiple Choice: Evaluating Steering Vectors for Adaptive Free-Form Summarization
【速读】: 该论文试图解决在自由生成任务中利用生成式 AI (Generative AI) 控制文本属性(如主题焦点、情感、毒性及可读性)的有效性与文本质量之间的权衡问题。解决方案的关键在于通过调整 steering vectors 的强度,实现对文本属性的自适应控制,同时探索其与 prompting 方法结合后的效果,以在控制能力与文本质量之间取得最佳平衡。
链接: https://arxiv.org/abs/2505.24859
作者: Joschka Braun,Carsten Eickhoff,Seyed Ali Bahrainian
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 29 pages, 21 figures, preprint
Abstract:Steering vectors are a lightweight method for controlling text properties by adding a learned bias to language model activations at inference time. So far, steering vectors have predominantly been evaluated in multiple-choice settings, while their effectiveness in free-form generation tasks remains understudied. Moving “Beyond Multiple Choice,” we thoroughly evaluate the effectiveness of steering vectors in adaptively controlling topical focus, sentiment, toxicity, and readability in abstractive summaries of the NEWTS dataset. We find that steering effectively controls the targeted summary properties, but high steering strengths consistently degrade both intrinsic and extrinsic text quality. Compared to steering, prompting offers weaker control, while preserving text quality. Combining steering and prompting yields the strongest control over text properties and offers the most favorable efficacy-quality trade-off at moderate steering strengths. Our results underscore the practical trade-off between control strength and text quality preservation when applying steering vectors to free-form generation tasks.
zh
[NLP-8] MetaFaith: Faithful Natural Language Uncertainty Expression in LLM s
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在不确定性沟通上的不可靠问题,即LLMs在传递错误信息时往往使用确定性语言,导致用户过度依赖并削弱信任。解决方案的关键在于提出一种基于提示的校准方法——MetaFaith,该方法受到人类元认知(metacognition)的启发,旨在提升模型在不同任务领域中对内在不确定性的忠实反映能力,从而显著提高其可信度校准效果。
链接: https://arxiv.org/abs/2505.24858
作者: Gabrielle Kaili-May Liu,Gal Yona,Avi Caciularu,Idan Szpektor,Tim G. J. Rudner,Arman Cohan
机构: Yale University (耶鲁大学); Google Research (谷歌研究院); New York University (纽约大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:A critical component in the trustworthiness of LLMs is reliable uncertainty communication, yet LLMs often use assertive language when conveying false claims, leading to over-reliance and eroded trust. We present the first systematic study of \textitfaithful confidence calibration of LLMs, benchmarking models’ ability to use linguistic expressions of uncertainty that \textitfaithfully reflect their intrinsic uncertainty, across a comprehensive array of models, datasets, and prompting strategies. Our results demonstrate that LLMs largely fail at this task, and that existing interventions are insufficient: standard prompt approaches provide only marginal gains, and existing, factuality-based calibration techniques can even harm faithful calibration. To address this critical gap, we introduce MetaFaith, a novel prompt-based calibration approach inspired by human metacognition. We show that MetaFaith robustly improves faithful calibration across diverse models and task domains, enabling up to 61% improvement in faithfulness and achieving an 83% win rate over original generations as judged by humans.
zh
[NLP-9] Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning
【速读】: 该论文旨在解决如何在离线设置下有效利用正向和负向的蒸馏推理轨迹以最大化大语言模型(Large Language Model, LLM)的推理性能这一关键问题。其解决方案的关键在于提出了一种两阶段框架——强化蒸馏(Reinforcement Distillation, REDI),其中第一阶段通过监督微调(Supervised Fine-Tuning, SFT)从正向轨迹中学习,第二阶段则通过提出的REDI目标函数进一步优化模型,该目标函数是一种无需参考的简单损失函数,在数学推理任务中优于现有的DPO和SimPO方法。
链接: https://arxiv.org/abs/2505.24850
作者: Shuyao Xu,Cheng Peng,Jiangxuan Long,Weidi Xu,Wei Chu,Yuan Qi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 27 pages, 10 figures. Code available at this https URL
Abstract:Recent advances in model distillation demonstrate that data from advanced reasoning models (e.g., DeepSeek-R1, OpenAI’s o1) can effectively transfer complex reasoning abilities to smaller, efficient student models. However, standard practices employ rejection sampling, discarding incorrect reasoning examples – valuable, yet often underutilized data. This paper addresses the critical question: How can both positive and negative distilled reasoning traces be effectively leveraged to maximize LLM reasoning performance in an offline setting? To this end, We propose Reinforcement Distillation (REDI), a two-stage framework. Stage 1 learns from positive traces via Supervised Fine-Tuning (SFT). Stage 2 further refines the model using both positive and negative traces through our proposed REDI objective. This novel objective is a simple, reference-free loss function that outperforms established methods like DPO and SimPO in this distillation context. Our empirical evaluations demonstrate REDI’s superiority over baseline Rejection Sampling SFT or SFT combined with DPO/SimPO on mathematical reasoning tasks. Notably, the Qwen-REDI-1.5B model, post-trained on just 131k positive and negative examples from the open Open-R1 dataset, achieves an 83.1% score on MATH-500 (pass@1). Its performance matches or surpasses that of DeepSeek-R1-Distill-Qwen-1.5B (a model post-trained on 800k proprietary data) across various mathematical reasoning benchmarks, establishing a new state-of-the-art for 1.5B models post-trained offline with openly available data.
zh
[NLP-10] MiCRo: Mixture Modeling and Context-aware Routing for Personalized Preference Learning
【速读】: 该论文试图解决在基于人类反馈的强化学习(RLHF)中,传统基于Bradley-Terry(BT)模型的奖励建模方法因假设全局奖励函数而无法捕捉人类偏好内在多样性和异质性的问题,从而限制了大语言模型(LLM)在个性化和多元对齐方面的表现。论文提出的解决方案关键在于引入MiCRo框架,该框架通过两阶段机制实现个性化偏好学习:第一阶段采用上下文感知的混合建模方法捕捉多样化的人类偏好,第二阶段结合在线路由策略根据具体上下文动态调整混合权重,以解决歧义问题,从而在最小额外监督下实现高效且可扩展的偏好适应。
链接: https://arxiv.org/abs/2505.24846
作者: Jingyan Shen,Jiarui Yao,Rui Yang,Yifan Sun,Feng Luo,Rui Pan,Tong Zhang,Han Zhao
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Columbia University (哥伦比亚大学); Rice University (莱斯大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Reward modeling is a key step in building safe foundation models when applying reinforcement learning from human feedback (RLHF) to align Large Language Models (LLMs). However, reward modeling based on the Bradley-Terry (BT) model assumes a global reward function, failing to capture the inherently diverse and heterogeneous human preferences. Hence, such oversimplification limits LLMs from supporting personalization and pluralistic alignment. Theoretically, we show that when human preferences follow a mixture distribution of diverse subgroups, a single BT model has an irreducible error. While existing solutions, such as multi-objective learning with fine-grained annotations, help address this issue, they are costly and constrained by predefined attributes, failing to fully capture the richness of human values. In this work, we introduce MiCRo, a two-stage framework that enhances personalized preference learning by leveraging large-scale binary preference datasets without requiring explicit fine-grained annotations. In the first stage, MiCRo introduces context-aware mixture modeling approach to capture diverse human preferences. In the second stage, MiCRo integrates an online routing strategy that dynamically adapts mixture weights based on specific context to resolve ambiguity, allowing for efficient and scalable preference adaptation with minimal additional supervision. Experiments on multiple preference datasets demonstrate that MiCRo effectively captures diverse human preferences and significantly improves downstream personalization.
zh
[NLP-11] Chameleon: A Flexible Data-mixing Framework for Language Model Pretraining and Finetuning ICML2025
【速读】: 该论文旨在解决大规模语言模型中训练数据混合对泛化性能的影响问题,特别是现有领域重加权方法在计算成本高且引入新数据时需要重新训练的局限性。其解决方案的关键在于提出一种灵活高效的数据显示框架Chameleon,该框架通过在学习到的嵌入空间中使用杠杆分数(leverage scores)来量化领域的重要性,从而确定一个能够提升共享嵌入表示领域的混合策略,实现对新数据的直接迁移而无需代理重训练。
链接: https://arxiv.org/abs/2505.24844
作者: Wanyun Xie,Francesco Tonin,Volkan Cevher
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: ICML 2025
Abstract:Training data mixtures greatly impact the generalization performance of large language models. Existing domain reweighting methods often rely on costly weight computations and require retraining when new data is introduced. To this end, we introduce a flexible and efficient data mixing framework, Chameleon, that employs leverage scores to quantify domain importance within a learned embedding space. We first construct a domain affinity matrix over domain embeddings. The induced leverage scores determine a mixture that upweights domains sharing common representations in embedding space. This formulation allows direct transfer to new data by computing the new domain embeddings. In experiments, we demonstrate improvements over three key scenarios: (i) our computed weights improve performance on pretraining domains with a fraction of the compute of existing methods; (ii) Chameleon can adapt to data changes without proxy retraining, boosting few-shot reasoning accuracies when transferred to new data; (iii) our method enables efficient domain reweighting in finetuning, consistently improving test perplexity on all finetuning domains over uniform mixture. Our code is available at this https URL.
zh
[NLP-12] Vision LLM s Are Bad at Hierarchical Visual Understanding and LLM s Are the Bottleneck
【速读】: 该论文试图解决当前先进大型语言模型(Large Language Models, LLMs)在视觉世界层次化知识上的缺失问题,特别是对已建立的生物学分类体系缺乏认知,导致其在视觉语言模型(Vision LLMs)的层次化视觉理解中成为瓶颈(例如,能够识别小丑鱼但无法识别脊椎动物)。解决方案的关键在于构建包含六种分类体系和四种图像数据集的一百万个四选一视觉问答(VQA)任务,并通过这些任务对视觉语言模型进行微调,以提升其层次化一致性。然而,实验结果表明,这种微调在一定程度上强化了LLMs的瓶颈效应,暗示只有当LLMs具备相应的分类知识时,才能实现对视觉概念的完全层次化理解。
链接: https://arxiv.org/abs/2505.24840
作者: Yuwen Tan,Yuan Qing,Boqing Gong
机构: Boston University (波士顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 28 pages, 13 figures
Abstract:This paper reveals that many state-of-the-art large language models (LLMs) lack hierarchical knowledge about our visual world, unaware of even well-established biology taxonomies. This shortcoming makes LLMs a bottleneck for vision LLMs’ hierarchical visual understanding (e.g., recognizing Anemone Fish but not Vertebrate). We arrive at these findings using about one million four-choice visual question answering (VQA) tasks constructed from six taxonomies and four image datasets. Interestingly, finetuning a vision LLM using our VQA tasks reaffirms LLMs’ bottleneck effect to some extent because the VQA tasks improve the LLM’s hierarchical consistency more than the vision LLM’s. We conjecture that one cannot make vision LLMs understand visual concepts fully hierarchical until LLMs possess corresponding taxonomy knowledge.
zh
[NLP-13] Multilinguality Does not Make Sense: Investigating Factors Behind Zero-Shot Transfer in Sense-Aware Tasks
【速读】: 该论文试图解决跨语言迁移中多语言训练是否必要且有益的问题,特别是在语义多义消歧和词汇语义变化这两个较少被研究的、依赖语境的任务中。研究的关键在于通过大规模跨28种语言的分析,指出多语言训练并非有效迁移的必要条件,且其看似的优势可能更多由混杂因素(如微调数据组成和评估偏差)所导致,而非多语言本身的固有优势。
链接: https://arxiv.org/abs/2505.24834
作者: Roksana Goworek,Haim Dubossarsky
机构: Queen Mary University of London (伦敦玛丽女王大学); Language Technology Lab, University of Cambridge (剑桥大学语言技术实验室)
类目: Computation and Language (cs.CL)
备注: 8 pages, 8 figures
Abstract:Cross-lingual transfer allows models to perform tasks in languages unseen during training and is often assumed to benefit from increased multilinguality. In this work, we challenge this assumption in the context of two underexplored, sense-aware tasks: polysemy disambiguation and lexical semantic change. Through a large-scale analysis across 28 languages, we show that multilingual training is neither necessary nor inherently beneficial for effective transfer. Instead, we find that confounding factors - such as fine-tuning data composition and evaluation artifacts - better account for the perceived advantages of multilinguality. Our findings call for more rigorous evaluations in multilingual NLP. We release fine-tuned models and benchmarks to support further research, with implications extending to low-resource and typologically diverse languages.
zh
[NLP-14] How much do language models memorize?
【速读】: 该论文试图解决现代语言模型中记忆(memorization)与泛化(generalization)难以区分的问题,从而准确估计模型对数据点的“知道”程度及模型容量。其解决方案的关键在于将记忆正式分解为两个组成部分:非有意记忆(unintended memorization),即模型对特定数据集所包含的信息,以及泛化(generalization),即模型对真实数据生成过程所包含的信息。通过完全消除泛化,可以计算出总记忆量,进而估计模型容量,实验表明GPT风格模型的容量约为每参数3.6比特。
链接: https://arxiv.org/abs/2505.24832
作者: John X. Morris,Chawin Sitawarin,Chuan Guo,Narine Kokhlikyan,G. Edward Suh,Alexander M. Rush,Kamalika Chaudhuri,Saeed Mahloujifar
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We propose a new method for estimating how much a model knows'' about a datapoint and use it to measure the capacity of modern language models. Prior studies of language model memorization have struggled to disentangle memorization from generalization. We formally separate memorization into two components: \textitunintended memorization, the information a model contains about a specific dataset, and \textitgeneralization, the information a model contains about the true data-generation process. When we completely eliminate generalization, we can compute the total memorization, which provides an estimate of model capacity: our measurements estimate that GPT-style models have a capacity of approximately 3.6 bits per parameter. We train language models on datasets of increasing size and observe that models memorize until their capacity fills, at which point
grokking’’ begins, and unintended memorization decreases as models begin to generalize. We train hundreds of transformer language models ranging from 500K to 1.5B parameters and produce a series of scaling laws relating model capacity and data size to membership inference.
zh
[NLP-15] Improving Reliability and Explainability of Medical Question Answering through Atomic Fact Checking in Retrieval-Augmented LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医学长文本问答任务中存在幻觉和不准确引用的问题,这些问题限制了其在临床中的应用和监管合规性。解决方案的关键在于提出一种原子事实核查框架,该框架将LLM生成的回答分解为可验证的离散单元——原子事实,并逐一与权威医学指南知识库进行比对,从而实现错误的针对性修正和来源文献的直接追溯,提升回答的准确性和可解释性。
链接: https://arxiv.org/abs/2505.24830
作者: Juraj Vladika,Annika Domres,Mai Nguyen,Rebecca Moser,Jana Nano,Felix Busch,Lisa C. Adams,Keno K. Bressem,Denise Bernhardt,Stephanie E. Combs,Kai J. Borm,Florian Matthes,Jan C. Peeken
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures
Abstract:Large language models (LLMs) exhibit extensive medical knowledge but are prone to hallucinations and inaccurate citations, which pose a challenge to their clinical adoption and regulatory compliance. Current methods, such as Retrieval Augmented Generation, partially address these issues by grounding answers in source documents, but hallucinations and low fact-level explainability persist. In this work, we introduce a novel atomic fact-checking framework designed to enhance the reliability and explainability of LLMs used in medical long-form question answering. This method decomposes LLM-generated responses into discrete, verifiable units called atomic facts, each of which is independently verified against an authoritative knowledge base of medical guidelines. This approach enables targeted correction of errors and direct tracing to source literature, thereby improving the factual accuracy and explainability of medical QA. Extensive evaluation using multi-reader assessments by medical experts and an automated open QA benchmark demonstrated significant improvements in factual accuracy and explainability. Our framework achieved up to a 40% overall answer improvement and a 50% hallucination detection rate. The ability to trace each atomic fact back to the most relevant chunks from the database provides a granular, transparent explanation of the generated responses, addressing a major gap in current medical AI applications. This work represents a crucial step towards more trustworthy and reliable clinical applications of LLMs, addressing key prerequisites for clinical application and fostering greater confidence in AI-assisted healthcare.
zh
[NLP-16] LegalEval-Q: A New Benchmark for The Quality Evaluation of LLM -Generated Legal Text
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在法律应用中评估基准过于侧重事实准确性,而忽视了清晰度、连贯性和术语使用等语言质量方面的不足。其解决方案的关键在于构建一个基于清晰度、连贯性和术语的回归模型来评估法律文本质量,并通过专门设计的法律问题集对49个LLMs进行分析,从而揭示模型参数规模、工程选择及架构类型对性能的影响,最终提出具有成本效益的最优模型推荐。
链接: https://arxiv.org/abs/2505.24826
作者: Li yunhan,Wu gengshen
机构: City University of Macau(澳门城市大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 11 figures
Abstract:As large language models (LLMs) are increasingly used in legal applications, current evaluation benchmarks tend to focus mainly on factual accuracy while largely neglecting important linguistic quality aspects such as clarity, coherence, and terminology. To address this gap, we propose three steps: First, we develop a regression model to evaluate the quality of legal texts based on clarity, coherence, and terminology. Second, we create a specialized set of legal questions. Third, we analyze 49 LLMs using this evaluation framework. Our analysis identifies three key findings: First, model quality levels off at 14 billion parameters, with only a marginal improvement of 2.7% noted at 72 billion parameters. Second, engineering choices such as quantization and context length have a negligible impact, as indicated by statistical significance thresholds above 0.016. Third, reasoning models consistently outperform base architectures. A significant outcome of our research is the release of a ranking list and Pareto analysis, which highlight the Qwen3 series as the optimal choice for cost-performance tradeoffs. This work not only establishes standardized evaluation protocols for legal LLMs but also uncovers fundamental limitations in current training data refinement approaches. Code and models are available at: this https URL. Comments: 10 pages, 11 figures Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.24826 [cs.CL] (or arXiv:2505.24826v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.24826 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-17] PhySense: Principle-Based Physics Reasoning Benchmarking for Large Language Models
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在物理问题求解中缺乏基于核心物理原理的简洁推理能力的问题,其关键在于引入PhySense基准,该基准通过设计易于专家使用原理性指导解决但对LLMs而言却具有挑战性的问题,系统地揭示了LLMs在遵循原则性推理路径上的不足。
链接: https://arxiv.org/abs/2505.24823
作者: Yinggan Xu,Yue Liu,Zhiqiang Gao,Changnan Peng,Di Luo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have rapidly advanced and are increasingly capable of tackling complex scientific problems, including those in physics. Despite this progress, current LLMs often fail to emulate the concise, principle-based reasoning characteristic of human experts, instead generating lengthy and opaque solutions. This discrepancy highlights a crucial gap in their ability to apply core physical principles for efficient and interpretable problem solving. To systematically investigate this limitation, we introduce PhySense, a novel principle-based physics reasoning benchmark designed to be easily solvable by experts using guiding principles, yet deceptively difficult for LLMs without principle-first reasoning. Our evaluation across multiple state-of-the-art LLMs and prompt types reveals a consistent failure to align with expert-like reasoning paths, providing insights for developing AI systems with efficient, robust and interpretable principle-based scientific reasoning.
zh
[NLP-18] Guiding Generative Storytelling with Knowledge Graphs
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在自动故事生成中面临的长篇连贯性不足以及用户难以进行直观有效控制的问题。其解决方案的关键在于利用知识图谱(Knowledge Graphs, KGs)来增强基于LLMs的故事生成,通过提升叙事质量和实现用户驱动的修改来改善故事生成效果。研究提出了一种KG辅助的故事生成流程,并通过用户实验验证了知识图谱在动作导向和结构化叙事中的显著效果,同时发现编辑知识图谱能够增强用户的控制感,使故事生成过程更加互动和有趣。
链接: https://arxiv.org/abs/2505.24803
作者: Zhijun Pan,Antonios Andronis,Eva Hayek,Oscar AP Wilkinson,Ilya Lasy,Annette Parry,Guy Gadney,Tim J. Smith,Mick Grierson
机构: University of the Arts London(伦敦艺术大学); Charismatic.ai(魅力人工智能); Technische Universität Wien(维也纳技术大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: This manuscript was submitted for peer review in January 2025
Abstract:Large Language Models (LLMs) have shown great potential in automated story generation, but challenges remain in maintaining long-form coherence and providing users with intuitive and effective control. Retrieval-Augmented Generation (RAG) has proven effective in reducing hallucinations in text generation; however, the use of structured data to support generative storytelling remains underexplored. This paper investigates how knowledge graphs (KGs) can enhance LLM-based storytelling by improving narrative quality and enabling user-driven modifications. We propose a KG-assisted storytelling pipeline and evaluate its effectiveness through a user study with 15 participants. Participants created their own story prompts, generated stories, and edited knowledge graphs to shape their narratives. Through quantitative and qualitative analysis, our findings demonstrate that knowledge graphs significantly enhance story quality in action-oriented and structured narratives within our system settings. Additionally, editing the knowledge graph increases users’ sense of control, making storytelling more engaging, interactive, and playful.
zh
[NLP-19] Drop Dropout on Single-Epoch Language Model Pretraining ACL
【速读】: 该论文试图解决在单次预训练(single-epoch pretraining)任务中,是否应使用丢弃法(dropout)以提升语言模型(Language Model, LM)的性能与可编辑性问题。研究发现,在BERT(掩码语言模型)和Pythia 160M/1.4B(自回归语言模型)的单次预训练中,不使用丢弃法能够提升下游任务如语言建模、形态句法、问答和自然语言推理的性能,且“早期丢弃”策略进一步降低了模型表现。此外,无丢弃法训练的模型在基于梯度的模型编辑(MEND)中表现更优,而在基于表示的模型编辑(ReFT)中表现相当。因此,该研究的解决方案之关键在于:在单次预训练阶段应避免使用丢弃法,以优化模型性能与可编辑性。
链接: https://arxiv.org/abs/2505.24788
作者: Houjun Liu,John Bauer,Christopher D. Manning
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL Findings; 5 pages, 2 figures, 4 pages of appendix
Abstract:Originally, dropout was seen as a breakthrough regularization technique that reduced overfitting and improved performance in almost all applications of deep learning by reducing overfitting. Yet, single-epoch pretraining tasks common to modern LLMs yield minimal overfitting, leading to dropout not being used for large LLMs. Nevertheless, no thorough empirical investigation has been done on the role of dropout in LM pretraining. Through experiments in single-epoch pretraining of both masked (BERT) and autoregressive (Pythia 160M and 1.4B) LMs with varying levels of dropout, we find that downstream performance in language modeling, morpho-syntax (BLiMP), question answering (SQuAD), and natural-language inference (MNLI) improves when dropout is not applied during pretraining. We additionally find that the recently-introduced “early dropout” also degrades performance over applying no dropout at all. We further investigate the models’ editability, and find that models trained without dropout are more successful in gradient-based model editing (MEND) and equivalent in representation-based model editing (ReFT). Therefore, we advocate to drop dropout during single-epoch pretraining.
zh
[NLP-20] Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation
【速读】: 该论文试图解决文本到图像(Text-to-Image, T2I)生成模型在处理涉及多个对象、属性和空间关系的复杂指令时表现不佳的问题。现有评估基准主要关注通用文本-图像对齐,未能充分捕捉复杂多维提示的细微需求。解决方案的关键在于提出一个名为LongBench-T2I的全面评估基准,包含500个精心设计的提示,覆盖九个不同的视觉评估维度,以及一个无需额外模型训练的代理框架Plan2Gen,该框架利用大语言模型解析和分解复杂提示,从而更有效地指导图像生成过程。此外,还引入了一个评估工具包,通过多维指标自动评估生成图像的质量。
链接: https://arxiv.org/abs/2505.24787
作者: Yucheng Zhou,Jiahao Yuan,Qianning Wang
机构: University of Macau China(澳门大学中国); East China Normal University China(华东师范大学中国); Auckland University of Technology New Zealand(奥克兰理工大学新西兰)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Recent advancements in text-to-image (T2I) generation have enabled models to produce high-quality images from textual descriptions. However, these models often struggle with complex instructions involving multiple objects, attributes, and spatial relationships. Existing benchmarks for evaluating T2I models primarily focus on general text-image alignment and fail to capture the nuanced requirements of complex, multi-faceted prompts. Given this gap, we introduce LongBench-T2I, a comprehensive benchmark specifically designed to evaluate T2I models under complex instructions. LongBench-T2I consists of 500 intricately designed prompts spanning nine diverse visual evaluation dimensions, enabling a thorough assessment of a model’s ability to follow complex instructions. Beyond benchmarking, we propose an agent framework (Plan2Gen) that facilitates complex instruction-driven image generation without requiring additional model training. This framework integrates seamlessly with existing T2I models, using large language models to interpret and decompose complex prompts, thereby guiding the generation process more effectively. As existing evaluation metrics, such as CLIPScore, fail to adequately capture the nuances of complex instructions, we introduce an evaluation toolkit that automates the quality assessment of generated images using a set of multi-dimensional metrics. The data and code are released at this https URL.
zh
[NLP-21] Revisiting Epistemic Markers in Confidence Estimation: Can Markers Accurately Reflect Large Language Models Uncertainty? ACL2025
【速读】: 该论文试图解决在高风险领域中,如何准确评估大型语言模型(Large Language Models, LLMs)的置信度问题。研究指出,人类通常通过认知标记(epistemic markers,如“fairly confident”)表达置信度,而非数值形式,但目前尚不清楚LLMs是否能一致地使用这些标记来反映其内在置信度。为解决这一问题,作者首次将标记置信度定义为模型使用认知标记时的观测准确性,并在不同分布和跨分布设置下对开源与专有LLMs进行了评估。研究的关键在于揭示认知标记在分布内具有较好的泛化能力,但在分布外场景下置信度表现不一致,从而凸显了基于标记的置信度与模型实际不确定性之间需要更好的对齐。
链接: https://arxiv.org/abs/2505.24778
作者: Jiayu Liu,Qing Zong,Weiqi Wang,Yangqiu Song
机构: HKUST, Hong Kong SAR, China (香港科技大学,香港特别行政区,中国)
类目: Computation and Language (cs.CL)
备注: ACL2025
Abstract:As large language models (LLMs) are increasingly used in high-stakes domains, accurately assessing their confidence is crucial. Humans typically express confidence through epistemic markers (e.g., “fairly confident”) instead of numerical values. However, it remains unclear whether LLMs consistently use these markers to reflect their intrinsic confidence due to the difficulty of quantifying uncertainty associated with various markers. To address this gap, we first define marker confidence as the observed accuracy when a model employs an epistemic marker. We evaluate its stability across multiple question-answering datasets in both in-distribution and out-of-distribution settings for open-source and proprietary LLMs. Our results show that while markers generalize well within the same distribution, their confidence is inconsistent in out-of-distribution scenarios. These findings raise significant concerns about the reliability of epistemic markers for confidence estimation, underscoring the need for improved alignment between marker based confidence and actual model uncertainty. Our code is available at this https URL.
zh
[NLP-22] From Macro to Micro: Probing Dataset Diversity in Language Model Fine-Tuning
【速读】: 该论文试图解决大规模语言模型(Large Language Model, LLM)在监督微调(Supervised Fine-Tuning, SFT)阶段中数据集多样性不足的问题,旨在提升模型性能。其解决方案的关键在于提出一种系统化的多样性控制策略分类体系,涵盖指令组件的宏观(整个指令语义)和中观(指令单元)层面,并首次引入对响应组件微观多样性(如SFT训练样本中token的统计分布)的分析,从而全面评估不同层次多样性对模型性能的影响。
链接: https://arxiv.org/abs/2505.24768
作者: Haoyu Li,Xuhong Li,Yiming Dong,Kun Liu
机构: Beijing Institute of Technology (北京理工大学); Baidu Inc. (百度公司); Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Dataset diversity plays a pivotal role for the successful training of many machine learning models, particularly in the supervised fine-tuning (SFT) stage of large language model (LLM) development. Despite increasing recognition of its importance, systematic analyses of dataset diversity still remain underexplored. To address this gap, this work presents a systematic taxonomy of existing diversity-control strategies, which primarily focus on the instruction component, operating at either macroscopic (entire instruction semantics) or mesoscopic levels (instruction units), and furthermore introduces a novel analysis of microscopic diversity within the response component, specifically analyzing the statistical distribution of tokens in SFT training samples. In the experimental evaluation, we construct fixed-size datasets (e.g., 10,000 samples each) from a corpus of 117,000 open-source SFT samples, incorporating six distinct diversity-control strategies spanning macro-, meso-, and microscopic levels applied to both instructions and responses. We then fine-tune LLMs on these datasets to assess the six diversity-control strategies. Results reveal that while macroscopic and mesoscopic strategies lead to higher performance with increasing diversity, the microscopic strategy in responses exhibits both a stronger correlation between model performance and the degree of diversity and superior performance with maximum diversity across all strategies. These findings offer actionable insights for constructing high-performance SFT datasets.
zh
[NLP-23] REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards
【速读】: 该论文试图解决传统推理数据集在复杂度固定、数据量有限以及难以支持持续评估的问题。其解决方案的关键在于提出Reasoning Gym (RG),一个能够通过可调节复杂度生成几乎无限训练数据的推理环境库,从而支持强化学习中推理模型的评估与训练。
链接: https://arxiv.org/abs/2505.24760
作者: Zafir Stojanovski,Oliver Stanley,Joe Sharratt,Richard Jones,Abdulhakeem Adefioye,Jean Kaddour,Andreas Köpf
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: For code, see this https URL
Abstract:We introduce Reasoning Gym (RG), a library of reasoning environments for reinforcement learning with verifiable rewards. It provides over 100 data generators and verifiers spanning multiple domains including algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and various common games. Its key innovation is the ability to generate virtually infinite training data with adjustable complexity, unlike most previous reasoning datasets, which are typically fixed. This procedural generation approach allows for continuous evaluation across varying difficulty levels. Our experimental results demonstrate the efficacy of RG in both evaluating and reinforcement learning of reasoning models.
zh
[NLP-24] LGAR: Zero-Shot LLM -Guided Neural Ranking for Abstract Screening in Systematic Literature Reviews
【速读】: 该论文试图解决系统文献综述(Systematic Literature Review, SLR)中摘要筛选阶段的效率与准确性问题,特别是在使用大语言模型(Large Language Models, LLMs)进行相关性评估时存在的局限性。现有基于问答(Question Answering, QA)的方法在排序过程中容易出现误差传播,而现有的基准测试并未全面覆盖SLR的纳入与排除标准。论文的关键解决方案是提出LGAR,一种零样本的LLM引导摘要排序器,其核心由基于LLM的分级相关性评分器和密集重排序器组成,能够在不依赖特定训练数据的情况下显著提升平均精度(Mean Average Precision, MAP)。
链接: https://arxiv.org/abs/2505.24757
作者: Christian Jaumann,Andreas Wiedholz,Annemarie Friedrich
机构: XITASO GmbH (XITASO有限公司); University of Augsburg (奥格斯堡大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The scientific literature is growing rapidly, making it hard to keep track of the state-of-the-art. Systematic literature reviews (SLRs) aim to identify and evaluate all relevant papers on a topic. After retrieving a set of candidate papers, the abstract screening phase determines initial relevance. To date, abstract screening methods using large language models (LLMs) focus on binary classification settings; existing question answering (QA) based ranking approaches suffer from error propagation. LLMs offer a unique opportunity to evaluate the SLR’s inclusion and exclusion criteria, yet, existing benchmarks do not provide them exhaustively. We manually extract these criteria as well as research questions for 57 SLRs, mostly in the medical domain, enabling principled comparisons between approaches. Moreover, we propose LGAR, a zero-shot LLM Guided Abstract Ranker composed of an LLM based graded relevance scorer and a dense re-ranker. Our extensive experiments show that LGAR outperforms existing QA-based methods by 5-10 pp. in mean average precision. Our code and data is publicly available.
zh
[NLP-25] Dont Reinvent the Wheel: Efficient Instruction-Following Text Embedding based on Guided Space Transformation ACL2025
【速读】: 该论文试图解决指令跟随文本嵌入(instruction-following text embedding)任务中计算开销过大的问题,现有方法在每次新指令出现时需要重新编码整个语料库,导致效率低下。解决方案的关键在于提出GSTransform框架,该框架基于引导空间变换(Guided Space Transformation),通过轻量级的转换机制实时调整预计算的嵌入表示,以匹配用户指令,而无需重复编码整个语料库。
链接: https://arxiv.org/abs/2505.24754
作者: Yingchaojie Feng,Yiqun Sun,Yandong Sun,Minfeng Zhu,Qiang Huang,Anthony K. H. Tung,Wei Chen
机构: Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学); Harbin Institute of Technology (深圳) (哈尔滨工业大学(深圳))
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted to ACL 2025
Abstract:In this work, we investigate an important task named instruction-following text embedding, which generates dynamic text embeddings that adapt to user instructions, highlighting specific attributes of text. Despite recent advancements, existing approaches suffer from significant computational overhead, as they require re-encoding the entire corpus for each new instruction. To address this challenge, we propose GSTransform, a novel instruction-following text embedding framework based on Guided Space Transformation. Our key observation is that instruction-relevant information is inherently encoded in generic embeddings but remains underutilized. Instead of repeatedly encoding the corpus for each instruction, GSTransform is a lightweight transformation mechanism that adapts pre-computed embeddings in real time to align with user instructions, guided by a small amount of text data with instruction-focused label annotation. We conduct extensive experiments on three instruction-awareness downstream tasks across nine real-world datasets, demonstrating that GSTransform improves instruction-following text embedding quality over state-of-the-art methods while achieving dramatic speedups of 6~300x in real-time processing on large-scale datasets. The source code is available at this https URL.
zh
[NLP-26] SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training
【速读】: 该论文旨在解决低秩梯度优化方法在大规模语言模型(Large Language Models, LLMs)训练中虽提升了内存效率,但因依赖标准各向同性最速下降技术而导致收敛速度受限的问题。其关键解决方案是提出SUMO(Subspace-Aware Moment-Orthogonalization)优化器,该优化器通过在动态适应的低维子空间中使用精确的奇异值分解(SVD)进行动量正交化,从而实现与损失景观谱特性对齐的范数诱导最速下降优化步骤,有效减少近似误差并提升收敛速度。
链接: https://arxiv.org/abs/2505.24749
作者: Yehonathan Refael,Guy Smorodinsky,Tom Tirer,Ofir Lindenbaum
机构: Tel Aviv University (特拉维夫大学); Ben Gurion University (本古里安大学); Bar-Ilan University (巴伊兰大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Optimization and Control (math.OC)
备注:
Abstract:Low-rank gradient-based optimization methods have significantly improved memory efficiency during the training of large language models (LLMs), enabling operations within constrained hardware without sacrificing performance. However, these methods primarily emphasize memory savings, often overlooking potential acceleration in convergence due to their reliance on standard isotropic steepest descent techniques, which can perform suboptimally in the highly anisotropic landscapes typical of deep networks, particularly LLMs. In this paper, we propose SUMO (Subspace-Aware Moment-Orthogonalization), an optimizer that employs exact singular value decomposition (SVD) for moment orthogonalization within a dynamically adapted low-dimensional subspace, enabling norm-inducing steepest descent optimization steps. By explicitly aligning optimization steps with the spectral characteristics of the loss landscape, SUMO effectively mitigates approximation errors associated with commonly used methods like Newton-Schulz orthogonalization approximation. We theoretically establish an upper bound on these approximation errors, proving their dependence on the condition numbers of moments, conditions we analytically demonstrate are encountered during LLM training. Furthermore, we both theoretically and empirically illustrate that exact orthogonalization via SVD substantially improves convergence rates while reducing overall complexity. Empirical evaluations confirm that SUMO accelerates convergence, enhances stability, improves performance, and reduces memory requirements by up to 20% compared to state-of-the-art methods.
zh
[NLP-27] Circuit Stability Characterizes Language Model Generalization
【速读】: 该论文试图解决评估(大)语言模型能力的难题,尤其是由于先进模型的快速迭代导致基准测试饱和,而构建更具挑战性的数据集则耗时费力。论文提出的解决方案关键在于引入“电路稳定性”(circuit stability)作为评估模型性能的新方法,其核心思想是模型在不同输入上应用一致推理过程(即“电路”)的能力。通过数学形式化电路稳定性和电路等价性,并结合实证案例分析,证明了电路稳定性能够表征和预测模型泛化能力的不同方面,从而为将模型的泛化能力与可解释性联系起来提供了一条严谨的路径。
链接: https://arxiv.org/abs/2505.24731
作者: Alan Sun
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注: 16 pages, 10 figures
Abstract:Extensively evaluating the capabilities of (large) language models is difficult. Rapid development of state-of-the-art models induce benchmark saturation, while creating more challenging datasets is labor-intensive. Inspired by the recent developments in mechanistic interpretability, we introduce circuit stability as a new way to assess model performance. Circuit stability refers to a model’s ability to apply a consistent reasoning process-its circuit-across various inputs. We mathematically formalize circuit stability and circuit equivalence. Then, through three case studies, we empirically show that circuit stability and the lack thereof can characterize and predict different aspects of generalization. Our proposed methods offer a step towards rigorously relating the generality of models to their interpretability.
zh
[NLP-28] Reflect Retry Reward: Self-Improving LLM s via Reinforcement Learning
【速读】: 该论文试图解决在缺乏合成数据生成能力且仅有二元反馈的情况下,如何提升大型语言模型在复杂可验证任务上的性能问题。其解决方案的关键在于引入一种基于自我反思和强化学习的框架,该框架通过让模型在任务失败后生成自我反思性评论,并在后续尝试中结合该反思进行优化,从而实现模型的自我改进。若后续尝试成功,则对自我反思阶段生成的标记进行奖励,以此激励模型生成更有价值的反思内容。
链接: https://arxiv.org/abs/2505.24726
作者: Shelly Bensal,Umar Jamil,Christopher Bryant,Melisa Russak,Kiran Kamble,Dmytro Mozolevskyi,Muayad Ali,Waseem AlShikh
机构: Writer, Inc. (写作公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:We explore a method for improving the performance of large language models through self-reflection and reinforcement learning. By incentivizing the model to generate better self-reflections when it answers incorrectly, we demonstrate that a model’s ability to solve complex, verifiable tasks can be enhanced even when generating synthetic data is infeasible and only binary feedback is available. Our framework operates in two stages: first, upon failing a given task, the model generates a self-reflective commentary analyzing its previous attempt; second, the model is given another attempt at the task with the self-reflection in context. If the subsequent attempt succeeds, the tokens generated during the self-reflection phase are rewarded. Our experimental results show substantial performance gains across a variety of model architectures, as high as 34.7% improvement at math equation writing and 18.1% improvement at function calling. Notably, smaller fine-tuned models (1.5 billion to 7 billion parameters) outperform models in the same family that are 10 times larger. Our novel paradigm is thus an exciting pathway to more useful and reliable language models that can self-improve on challenging tasks with limited external feedback.
zh
[NLP-29] CoRet: Improved Retriever for Code Editing ACL2025
【速读】: 该论文旨在解决代码编辑任务中如何高效检索相关代码片段的问题,特别是在面对自然语言查询(如实现新功能或修复错误)时,如何从代码仓库中准确找到语义相关的内容。其解决方案的关键在于提出一种名为CoRet的密集检索模型,该模型整合了代码语义、仓库结构以及调用图依赖关系,并设计了一个专为仓库级检索优化的损失函数,从而显著提升了检索召回率。
链接: https://arxiv.org/abs/2505.24715
作者: Fabio Fehr,Prabhu Teja Sivaprasad,Luca Franceschi,Giovanni Zappella
机构: EPFL & Idiap Research Institute (EPFL & Idiap Research Institute); AWS AI Labs (AWS AI Labs)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ACL 2025
Abstract:In this paper, we introduce CoRet, a dense retrieval model designed for code-editing tasks that integrates code semantics, repository structure, and call graph dependencies. The model focuses on retrieving relevant portions of a code repository based on natural language queries such as requests to implement new features or fix bugs. These retrieved code chunks can then be presented to a user or to a second code-editing model or agent. To train CoRet, we propose a loss function explicitly designed for repository-level retrieval. On SWE-bench and Long Code Arena’s bug localisation datasets, we show that our model substantially improves retrieval recall by at least 15 percentage points over existing models, and ablate the design choices to show their importance in achieving these results.
zh
[NLP-30] FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation ACL2025
【速读】: 该论文旨在解决金融领域中缺乏有效且专门的多模态评估数据集的问题(Multimodal Evaluation Datasets)。其解决方案的关键在于构建FinMME,这是一个包含超过11,000个高质量金融研究样本的数据集,覆盖18个金融领域和6种资产类别,并包含10种主要图表类型及21种子类型,同时通过20名标注员和精心设计的验证机制确保数据质量。此外,还开发了FinScore评估系统,结合幻觉惩罚和多维能力评估,以提供无偏的模型评估。
链接: https://arxiv.org/abs/2505.24714
作者: Junyu Luo,Zhizhuo Kou,Liming Yang,Xiao Luo,Jinsheng Huang,Zhiping Xiao,Jingshu Peng,Chengzhong Liu,Jiaming Ji,Xuanzhe Liu,Sirui Han,Ming Zhang,Yike Guo
机构: State Key Laboratory for Multimedia Information Processing, PKU-Anker LLM Lab (国家多媒体信息处理重点实验室,北京大学-安克大语言模型实验室); School of Computer Science, Peking University (北京大学计算机学院); HKUST (香港科技大学); University of California, Los Angeles (加利福尼亚大学洛杉矶分校); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Main Conference
Abstract:Multimodal Large Language Models (MLLMs) have experienced rapid development in recent years. However, in the financial domain, there is a notable lack of effective and specialized multimodal evaluation datasets. To advance the development of MLLMs in the finance domain, we introduce FinMME, encompassing more than 11,000 high-quality financial research samples across 18 financial domains and 6 asset classes, featuring 10 major chart types and 21 subtypes. We ensure data quality through 20 annotators and carefully designed validation mechanisms. Additionally, we develop FinScore, an evaluation system incorporating hallucination penalties and multi-dimensional capability assessment to provide an unbiased evaluation. Extensive experimental results demonstrate that even state-of-the-art models like GPT-4o exhibit unsatisfactory performance on FinMME, highlighting its challenging nature. The benchmark exhibits high robustness with prediction variations under different prompts remaining below 1%, demonstrating superior reliability compared to existing datasets. Our dataset and evaluation protocol are available at this https URL and this https URL.
zh
[NLP-31] Voice Conversion Improves Cross-Domain Robustness for Spoken Arabic Dialect Identification INTERSPEECH2025
【速读】: 该论文旨在解决阿拉伯语方言识别(Arabic Dialect Identification, ADI)系统在跨领域场景下泛化能力不足的问题,这一问题限制了现有ADI系统的可靠性。其解决方案的关键在于基于语音转换(Voice Conversion)的方法,通过该方法训练的ADI模型在跨领域场景中表现出显著的鲁棒性提升,并在新收集的涵盖四个不同领域的实际测试集上实现了最高34.1%的准确率提升。此外,研究还表明语音转换有助于减轻ADI数据集中存在的说话人偏见。
链接: https://arxiv.org/abs/2505.24713
作者: Badr M. Abdullah,Matthew Baas,Bernd Möbius,Dietrich Klakow
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted in Interspeech 2025
Abstract:Arabic dialect identification (ADI) systems are essential for large-scale data collection pipelines that enable the development of inclusive speech technologies for Arabic language varieties. However, the reliability of current ADI systems is limited by poor generalization to out-of-domain speech. In this paper, we present an effective approach based on voice conversion for training ADI models that achieves state-of-the-art performance and significantly improves robustness in cross-domain scenarios. Evaluated on a newly collected real-world test set spanning four different domains, our approach yields consistent improvements of up to +34.1% in accuracy across domains. Furthermore, we present an analysis of our approach and demonstrate that voice conversion helps mitigate the speaker bias in the ADI dataset. We release our robust ADI model and cross-domain evaluation dataset to support the development of inclusive speech technologies for Arabic.
zh
[NLP-32] HESEIA: A community-based dataset for evaluating social biases in large language models co-designed in real school settings in Latin America
【速读】: 该论文试图解决大型语言模型(Large Language Models)中社会偏见评估资源缺乏参与性与代表性的问题,特别是现有资源未充分结合受偏见影响社区的共同设计。其解决方案的关键在于构建HESEIA数据集,该数据集通过拉丁美洲189所学校的370名高中教师和5,370名学生的参与,结合教育者的实际经验和教学专业知识,捕捉跨多个人口统计轴线和学科领域的交叉性偏见,从而提供更具本地化和多样性的偏见评估基准。
链接: https://arxiv.org/abs/2505.24712
作者: Guido Ivetta(1 and 2),Marcos J. Gomez(1 and 2),Sofía Martinelli(1),Pietro Palombini(1),M. Emilia Echeveste(1 and 2),Nair Carolina Mazzeo(2),Beatriz Busaniche(2),Luciana Benotti(1 and 2) ((1) Universidad Nacional de Córdoba, Argentina, (2) Fundación Vía Libre)
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Most resources for evaluating social biases in Large Language Models are developed without co-design from the communities affected by these biases, and rarely involve participatory approaches. We introduce HESEIA, a dataset of 46,499 sentences created in a professional development course. The course involved 370 high-school teachers and 5,370 students from 189 Latin-American schools. Unlike existing benchmarks, HESEIA captures intersectional biases across multiple demographic axes and school subjects. It reflects local contexts through the lived experience and pedagogical expertise of educators. Teachers used minimal pairs to create sentences that express stereotypes relevant to their school subjects and communities. We show the dataset diversity in term of demographic axes represented and also in terms of the knowledge areas included. We demonstrate that the dataset contains more stereotypes unrecognized by current LLMs than previous datasets. HESEIA is available to support bias assessments grounded in educational communities.
zh
[NLP-33] Causal-aware Large Language Models : Enhancing Decision-Making Through Learning Adapting and Acting IJCAI2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在决策过程中存在的推理能力不足以及难以适应新环境的问题,这些问题限制了其在复杂现实任务中的应用。解决方案的关键在于提出一种因果感知的LLMs(Causal-aware LLMs),该方法通过将结构因果模型(Structural Causal Model, SCM)整合到决策过程中,实现对环境结构化知识的建模、更新与利用,从而在“学习-适应-行动”的范式下提升模型的决策效率与环境理解能力。
链接: https://arxiv.org/abs/2505.24710
作者: Wei Chen,Jiahao Zhang,Haipeng Zhu,Boyan Xu,Zhifeng Hao,Keli Zhang,Junjian Ye,Ruichu Cai
机构: Guangdong University of Technology (广东工业大学); Shantou University (汕头大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); Peng Cheng Laboratory (鹏城实验室
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by IJCAI 2025
Abstract:Large language models (LLMs) have shown great potential in decision-making due to the vast amount of knowledge stored within the models. However, these pre-trained models are prone to lack reasoning abilities and are difficult to adapt to new environments, further hindering their application to complex real-world tasks. To address these challenges, inspired by the human cognitive process, we propose Causal-aware LLMs, which integrate the structural causal model (SCM) into the decision-making process to model, update, and utilize structured knowledge of the environment in a learning-adapting-acting" paradigm. Specifically, in the learning stage, we first utilize an LLM to extract the environment-specific causal entities and their causal relations to initialize a structured causal model of the environment. Subsequently,in the adapting stage, we update the structured causal model through external feedback about the environment, via an idea of causal intervention. Finally, in the acting stage, Causal-aware LLMs exploit structured causal knowledge for more efficient policy-making through the reinforcement learning agent. The above processes are performed iteratively to learn causal knowledge, ultimately enabling the causal-aware LLMs to achieve a more accurate understanding of the environment and make more efficient decisions. Experimental results across 22 diverse tasks within the open-world game
Crafter" validate the effectiveness of our proposed method.
zh
[NLP-34] Multi-Domain ABSA Conversation Dataset Generation via LLM s for Real-World Evaluation and Model Comparison
【速读】: 该论文试图解决Aspect-Based Sentiment Analysis (ABSA)领域中由于缺乏多样化、带标签的数据集而导致的现实对话细微差别反映不足的问题。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)生成合成的ABSA数据,以填补这一数据缺口,并通过GPT-4o生成具有跨多个领域一致主题和情感分布的数据。
链接: https://arxiv.org/abs/2505.24701
作者: Tejul Pandit,Meet Raval,Dhvani Upadhyay
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures, 5 tables, 6th International Conference on Natural Language Computing and AI (NLCAI 2025), ISBN : 978-1-923107-59-5, Computer Science Information Technology (CS IT), ISSN : 2231 - 5403, Volume 15, Number 10, May 2025
Abstract:Aspect-Based Sentiment Analysis (ABSA) offers granular insights into opinions but often suffers from the scarcity of diverse, labeled datasets that reflect real-world conversational nuances. This paper presents an approach for generating synthetic ABSA data using Large Language Models (LLMs) to address this gap. We detail the generation process aimed at producing data with consistent topic and sentiment distributions across multiple domains using GPT-4o. The quality and utility of the generated data were evaluated by assessing the performance of three state-of-the-art LLMs (Gemini 1.5 Pro, Claude 3.5 Sonnet, and DeepSeek-R1) on topic and sentiment classification tasks. Our results demonstrate the effectiveness of the synthetic data, revealing distinct performance trade-offs among the models: DeepSeekR1 showed higher precision, Gemini 1.5 Pro and Claude 3.5 Sonnet exhibited strong recall, and Gemini 1.5 Pro offered significantly faster inference. We conclude that LLM-based synthetic data generation is a viable and flexible method for creating valuable ABSA resources, facilitating research and model evaluation without reliance on limited or inaccessible real-world labeled data.
zh
[NLP-35] Speech-to-Text Translation with Phoneme-Augmented CoT: Enhancing Cross-Lingual Transfer in Low-Resource Scenarios INTERSPEECH2025
【速读】: 该论文试图解决在低资源和零资源语言环境下语音到文本翻译(Speech-to-Text Translation, S2TT)性能不佳的问题。解决方案的关键在于将音素(phoneme)表示集成到链式思维(Chain-of-Thought, CoT)框架中,通过引入音素识别作为中间步骤,增强跨语言迁移能力,从而实现无标注语音数据的语言翻译。该方法基于多语言大语言模型(multilingual LLM),并采用课程学习策略进行训练,以逐步提升任务复杂度。
链接: https://arxiv.org/abs/2505.24691
作者: Gerard I. Gállego,Oriol Pareras,Martí Cortada Garcia,Lucas Takanori,Javier Hernando
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at Interspeech 2025
Abstract:We propose a Speech-to-Text Translation (S2TT) approach that integrates phoneme representations into a Chain-of-Thought (CoT) framework to improve translation in low-resource and zero-resource settings. By introducing phoneme recognition as an intermediate step, we enhance cross-lingual transfer, enabling translation even for languages with no labeled speech data. Our system builds on a multilingual LLM, which we extend to process speech and phonemes. Training follows a curriculum learning strategy that progressively introduces more complex tasks. Experiments on multilingual S2TT benchmarks show that phoneme-augmented CoT improves translation quality in low-resource conditions and enables zero-resource translation, while slightly impacting high-resource performance. Despite this trade-off, our findings demonstrate that phoneme-based CoT is a promising step toward making S2TT more accessible across diverse languages.
zh
[NLP-36] BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization
【速读】: 该论文试图解决传统Byte Pair Encoding (BPE)分词器在多语言场景下的问题,包括对非西方文字的惩罚以及部分UTF-8序列生成的标记。其解决方案的关键是提出SCRIPT(Script Category Representation in PreTokenization),通过利用Unicode脚本和类别属性来构建初始标记,从而绕过UTF-8字节转换,实现基于规则的预分词策略,该策略能够尊重脚本边界并提高鲁棒性。
链接: https://arxiv.org/abs/2505.24689
作者: Sander Land,Catherine Arnett
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 2 figures. For associated code, see this https URL
Abstract:Byte Pair Encoding (BPE) tokenizers, widely used in Large Language Models, face challenges in multilingual settings, including penalization of non-Western scripts and the creation of tokens with partial UTF-8 sequences. Pretokenization, often reliant on complex regular expressions, can also introduce fragility and unexpected edge cases. We propose SCRIPT (Script Category Representation in PreTokenization), a novel encoding scheme that bypasses UTF-8 byte conversion by using initial tokens based on Unicode script and category properties. This approach enables a simple, rule-based pretokenization strategy that respects script boundaries, offering a robust alternative to pretokenization strategies based on regular expressions. We also introduce and validate a constrained BPE merging strategy that enforces character integrity, applicable to both SCRIPT-BPE and byte-based BPE. Our experiments demonstrate that SCRIPT-BPE achieves competitive compression while eliminating encoding-based penalties for non-Latin-script languages.
zh
[NLP-37] Soft Reasoning : Navigating Solution Spaces in Large Language Models through Controlled Embedding Exploration ICML2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在复杂推理任务中表现不佳的问题,主要原因是其多样性有限和搜索效率低下。解决方案的关键在于提出一种基于嵌入的搜索框架——Soft Reasoning,通过优化第一个词的嵌入来引导生成过程,结合嵌入扰动实现可控探索,并利用贝叶斯优化通过验证器引导的目标函数精炼嵌入,从而在探索与利用之间取得平衡。该方法提升了推理的准确性和连贯性,同时避免依赖启发式搜索。
链接: https://arxiv.org/abs/2505.24688
作者: Qinglin Zhu,Runcong Zhao,Hanqi Yan,Yulan He,Yudong Chen,Lin Gui
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by ICML 2025
Abstract:Large Language Models (LLMs) struggle with complex reasoning due to limited diversity and inefficient search. We propose Soft Reasoning, an embedding-based search framework that optimises the embedding of the first token to guide generation. It combines (1) embedding perturbation for controlled exploration and (2) Bayesian optimisation to refine embeddings via a verifier-guided objective, balancing exploration and exploitation. This approach improves reasoning accuracy and coherence while avoiding reliance on heuristic search. Experiments demonstrate superior correctness with minimal computation, making it a scalable, model-agnostic solution.
zh
[NLP-38] Should I Share this Translation? Evaluating Quality Feedback for User Reliance on Machine Translation
【速读】: 该论文试图解决在人工智能(AI)系统被广泛应用于工作和日常生活中时,用户如何负责任地使用AI的问题,尤其是在用户缺乏评估AI预测质量能力的场景下。其解决方案的关键在于引入不同类型的质量反馈机制,以帮助用户做出更准确的决策并合理依赖AI输出。研究比较了四种反馈类型:显式反馈(包括错误高亮和大语言模型(LLM)解释)和隐式反馈(包括反向翻译和问答(QA)表格),结果表明除错误高亮外,所有反馈类型均显著提升了决策准确性和适当依赖性,其中隐式反馈尤其是QA表格在多个维度上表现更优。
链接: https://arxiv.org/abs/2505.24683
作者: Dayeon Ki,Kevin Duh,Marine Carpuat
机构: University of Maryland (马里兰大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 pages, 7 figures
Abstract:As people increasingly use AI systems in work and daily life, feedback mechanisms that help them use AI responsibly are urgently needed, particularly in settings where users are not equipped to assess the quality of AI predictions. We study a realistic Machine Translation (MT) scenario where monolingual users decide whether to share an MT output, first without and then with quality feedback. We compare four types of quality feedback: explicit feedback that directly give users an assessment of translation quality using 1) error highlights and 2) LLM explanations, and implicit feedback that helps users compare MT inputs and outputs through 3) backtranslation and 4) question-answer (QA) tables. We find that all feedback types, except error highlights, significantly improve both decision accuracy and appropriate reliance. Notably, implicit feedback, especially QA tables, yields significantly greater gains than explicit feedback in terms of decision accuracy, appropriate reliance, and user perceptions, receiving the highest ratings for helpfulness and trust, and the lowest for mental burden.
zh
[NLP-39] A Simple Linear Patch Revives Layer-Pruned Large Language Models
【速读】: 该论文试图解决层剪枝(layer pruning)在压缩大语言模型(LLM)过程中导致性能显著下降的问题。其关键在于识别出性能退化源于剪枝界面处不同层和不同token的激活幅度不匹配。解决方案的核心是提出LinearPatch,该方法通过Hadamard变换抑制特定token中的大量异常值,并采用通道级缩放对齐激活幅度,这些操作可融合为一个矩阵作为修补模块,以最小的推理开销填补剪枝界面,从而有效恢复模型性能。
链接: https://arxiv.org/abs/2505.24680
作者: Xinrui Chen,Haoli Bai,Tao Yuan,Ruikang Liu,Kang Zhao,Xianzhi Yu,Lu Hou,Tian Guan,Yonghong He,Chun Yuan
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Layer pruning has become a popular technique for compressing large language models (LLMs) due to its simplicity. However, existing layer pruning methods often suffer from significant performance drops. We identify that this degradation stems from the mismatch of activation magnitudes across layers and tokens at the pruning interface. To address this, we propose LinearPatch, a simple plug-and-play technique to revive the layer-pruned LLMs. The proposed method adopts Hadamard transformation to suppress massive outliers in particular tokens, and channel-wise scaling to align the activation magnitudes. These operations can be fused into a single matrix, which functions as a patch to bridge the pruning interface with negligible inference overhead. LinearPatch retains up to 94.15% performance of the original model when pruning 5 layers of LLaMA-3-8B on the question answering benchmark, surpassing existing state-of-the-art methods by 4%. In addition, the patch matrix can be further optimized with memory efficient offline knowledge distillation. With only 5K samples, the retained performance of LinearPatch can be further boosted to 95.16% within 30 minutes on a single computing card.
zh
[NLP-40] RIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成有害内容或被恶意利用方面的安全风险问题。现有安全对齐数据集在风险覆盖范围上存在不足,主要关注词汇多样性而忽视了其他关键维度。解决方案的关键在于提出一种新的分析框架,用于系统评估对齐数据集在词汇多样性、恶意意图和越狱策略三个核心维度上的风险覆盖率,并引入TRIDENT自动化流水线,通过基于角色的零样本大语言模型生成技术,创建涵盖这些维度的多样化且全面的指令-响应对数据集,从而提升模型的安全性和鲁棒性。
链接: https://arxiv.org/abs/2505.24672
作者: Xiaorui Wu,Xiaofeng Mao,Fei Li,Xin Zhang,Xuanhong Li,Chong Teng,Donghong Ji,Zhuang Li
机构: Wuhan University (武汉大学); Ant Group (蚂蚁集团); Royal Melbourne Institute of Technology (皇家墨尔本理工学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) excel in various natural language processing tasks but remain vulnerable to generating harmful content or being exploited for malicious purposes. Although safety alignment datasets have been introduced to mitigate such risks through supervised fine-tuning (SFT), these datasets often lack comprehensive risk coverage. Most existing datasets focus primarily on lexical diversity while neglecting other critical dimensions. To address this limitation, we propose a novel analysis framework to systematically measure the risk coverage of alignment datasets across three essential dimensions: Lexical Diversity, Malicious Intent, and Jailbreak Tactics. We further introduce TRIDENT, an automated pipeline that leverages persona-based, zero-shot LLM generation to produce diverse and comprehensive instructions spanning these dimensions. Each harmful instruction is paired with an ethically aligned response, resulting in two datasets: TRIDENT-Core, comprising 26,311 examples, and TRIDENT-Edge, with 18,773 examples. Fine-tuning Llama 3.1-8B on TRIDENT-Edge demonstrates substantial improvements, achieving an average 14.29% reduction in Harm Score, and a 20% decrease in Attack Success Rate compared to the best-performing baseline model fine-tuned on the WildBreak dataset.
zh
[NLP-41] Multiple LLM Agents Debate for Equitable Cultural Alignment
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在跨文化情境下预测准确性不足的问题,旨在提升模型对全球多样文化的适应能力。其解决方案的关键在于引入多智能体辩论框架(Multi-Agent Debate framework),通过多个基于LLM的智能体在文化情境下的辩论与协作,充分发挥多模型的互补优势,从而提升模型在社会礼仪规范任务中的整体准确性和文化群体公平性。
链接: https://arxiv.org/abs/2505.24671
作者: Dayeon Ki,Rachel Rudinger,Tianyi Zhou,Marine Carpuat
机构: University of Maryland (马里兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 37 pages, 18 figures
Abstract:Large Language Models (LLMs) need to adapt their predictions to diverse cultural contexts to benefit diverse communities across the world. While previous efforts have focused on single-LLM, single-turn approaches, we propose to exploit the complementary strengths of multiple LLMs to promote cultural adaptability. We introduce a Multi-Agent Debate framework, where two LLM-based agents debate over a cultural scenario and collaboratively reach a final decision. We propose two variants: one where either LLM agents exclusively debate and another where they dynamically choose between self-reflection and debate during their turns. We evaluate these approaches on 7 open-weight LLMs (and 21 LLM combinations) using the NormAd-ETI benchmark for social etiquette norms in 75 countries. Experiments show that debate improves both overall accuracy and cultural group parity over single-LLM baselines. Notably, multi-agent debate enables relatively small LLMs (7-9B) to achieve accuracies comparable to that of a much larger model (27B parameters).
zh
[NLP-42] MSDA: Combining Pseudo-labeling and Self-Supervision for Unsupervised Domain Adaptation in ASR
【速读】: 该论文旨在解决自动语音识别(ASR)中的无监督域适应问题,即在缺乏目标域标注数据的情况下,提升ASR模型的泛化能力和鲁棒性。其解决方案的关键在于提出一种多阶段域适应流程(MSDA),该方法结合了自监督学习与半监督技术,采用两阶段的高效样本适应策略,通过级联方式整合自监督与自训练机制,从而显著提升了模型在低资源语言和弱监督场景下的性能。
链接: https://arxiv.org/abs/2505.24656
作者: Dimitrios Damianos,Georgios Paraskevopoulos,Alexandros Potamianos
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:In this work, we investigate the Meta PL unsupervised domain adaptation framework for Automatic Speech Recognition (ASR). We introduce a Multi-Stage Domain Adaptation pipeline (MSDA), a sample-efficient, two-stage adaptation approach that integrates self-supervised learning with semi-supervised techniques. MSDA is designed to enhance the robustness and generalization of ASR models, making them more adaptable to diverse conditions. It is particularly effective for low-resource languages like Greek and in weakly supervised scenarios where labeled data is scarce or noisy. Through extensive experiments, we demonstrate that Meta PL can be applied effectively to ASR tasks, achieving state-of-the-art results, significantly outperforming state-of-the-art methods, and providing more robust solutions for unsupervised domain adaptation in ASR. Our ablations highlight the necessity of utilizing a cascading approach when combining self-supervision with self-training.
zh
[NLP-43] PRISM: A Framework for Producing Interpretable Political Bias Embeddings with Political-Aware Cross-Encoder ACL2025
【速读】: 该论文试图解决现有语义文本嵌入模型在捕捉文本语义的同时,往往忽视意识形态细微差异的问题,这限制了其在需要理解政治偏见的任务中的有效性。解决方案的关键在于提出PRISM框架,该框架通过两个关键阶段实现可解释的政治偏见嵌入:首先,从弱标签新闻数据中系统提取细粒度政治话题及其对应的偏见指标;其次,基于这些指标为新闻文章分配结构化的偏见评分。这一方法确保嵌入明确关联到揭示偏见的维度,从而提升可解释性和预测能力。
链接: https://arxiv.org/abs/2505.24646
作者: Yiqun Sun,Qiang Huang,Anthony K. H. Tung,Jun Yu
机构: National University of Singapore (新加坡国立大学); Harbin Institute of Technology (深圳) (哈尔滨工业大学(深圳))
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025
Abstract:Semantic Text Embedding is a fundamental NLP task that encodes textual content into vector representations, where proximity in the embedding space reflects semantic similarity. While existing embedding models excel at capturing general meaning, they often overlook ideological nuances, limiting their effectiveness in tasks that require an understanding of political bias. To address this gap, we introduce PRISM, the first framework designed to Produce inteRpretable polItical biaS eMbeddings. PRISM operates in two key stages: (1) Controversial Topic Bias Indicator Mining, which systematically extracts fine-grained political topics and their corresponding bias indicators from weakly labeled news data, and (2) Cross-Encoder Political Bias Embedding, which assigns structured bias scores to news articles based on their alignment with these indicators. This approach ensures that embeddings are explicitly tied to bias-revealing dimensions, enhancing both interpretability and predictive power. Through extensive experiments on two large-scale datasets, we demonstrate that PRISM outperforms state-of-the-art text embedding models in political bias classification while offering highly interpretable representations that facilitate diversified retrieval and ideological analysis. The source code is available at this https URL.
zh
[NLP-44] Are Optimal Algorithms Still Optimal? Rethinking Sorting in LLM -Based Pairwise Ranking with Batching and Caching
【速读】: 该论文试图解决传统排序算法在基于成对排名提示(pairwise ranking prompting, PRP)场景下的效率评估问题,其核心在于重新定义成本模型,将焦点从传统的成对比较转向大型语言模型(LLM)推理成本。解决方案的关键在于揭示昂贵的LLM推理会颠覆传统基于比较次数的效率预测,并提出通过批量处理和缓存等策略来降低推理成本,从而优化算法在实际应用中的效率表现。
链接: https://arxiv.org/abs/2505.24643
作者: Juan Wisznia,Cecilia Bolaños,Juan Tollo,Giovanni Marraffini,Agustín Gianolini,Noe Hsueh,Luciano Del Corro
机构: Departamento de Computación, FCEyN, Universidad de Buenos Aires (计算机系,费尔南多·恩里克·卡多佐学院,布宜诺斯艾利斯大学); Instituto de Ciencias de la Computación, FCEyN, Universidad de Buenos Aires (计算机科学研究所,费尔南多·恩里克·卡多佐学院,布宜诺斯艾利斯大学); Lumina Labs* (Lumina实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce a novel framework for analyzing sorting algorithms in pairwise ranking prompting (PRP), re-centering the cost model around LLM inferences rather than traditional pairwise comparisons. While classical metrics based on comparison counts have traditionally been used to gauge efficiency, our analysis reveals that expensive LLM inferences overturn these predictions; accordingly, our framework encourages strategies such as batching and caching to mitigate inference costs. We show that algorithms optimal in the classical setting can lose efficiency when LLM inferences dominate the cost under certain optimizations.
zh
[NLP-45] Efficient Text Encoders for Labor Market Analysis
【速读】: 该论文旨在解决劳动力市场分析中从非结构化职位广告中高效提取技能信息的问题,现有方法依赖于计算成本高且速度慢的大语言模型(Large Language Models, LLMs)。其解决方案的关键在于提出一种新的对比学习方法——\textbf{ConTeXT-match},该方法结合了词粒度注意力机制,适用于极端多标签分类任务,显著提升了技能分类的效率和性能,并采用轻量级双编码器模型实现了最先进的结果。
链接: https://arxiv.org/abs/2505.24640
作者: Jens-Joris Decorte,Jeroen Van Hautte,Chris Develder,Thomas Demeester
机构: TechWolf(科技狼); Ghent University(根特大学); imec(imec); IDLab(互联网与数据科学实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Labor market analysis relies on extracting insights from job advertisements, which provide valuable yet unstructured information on job titles and corresponding skill requirements. While state-of-the-art methods for skill extraction achieve strong performance, they depend on large language models (LLMs), which are computationally expensive and slow. In this paper, we propose \textbfConTeXT-match, a novel contrastive learning approach with token-level attention that is well-suited for the extreme multi-label classification task of skill classification. \textbfConTeXT-match significantly improves skill extraction efficiency and performance, achieving state-of-the-art results with a lightweight bi-encoder model. To support robust evaluation, we introduce \textbfSkill-XL, a new benchmark with exhaustive, sentence-level skill annotations that explicitly address the redundancy in the large label space. Finally, we present \textbfJobBERT V2, an improved job title normalization model that leverages extracted skills to produce high-quality job title representations. Experiments demonstrate that our models are efficient, accurate, and scalable, making them ideal for large-scale, real-time labor market analysis.
zh
[NLP-46] Disentangling Language and Culture for Evaluating Multilingual Large Language Models ACL2025
【速读】: 该论文试图解决当前大语言模型(Large Language Models, LLMs)在多语言能力评估中缺乏全面性和文化语境考量的问题。传统评估方法往往忽视了语言与文化之间的相互作用,导致对模型跨语言和跨文化处理能力的评价不够准确。解决方案的关键在于提出一种双维度评估框架(Dual Evaluation Framework),通过将评估分解为语言媒介和文化背景两个维度,实现对LLMs在本族语和跨文化语境下处理问题能力的细致分析。该框架揭示了“文化语言协同效应”现象,并通过可解释性探针进一步验证了文化语境对模型性能的影响,为多语言模型的评估提供了新的视角和指标。
链接: https://arxiv.org/abs/2505.24635
作者: Jiahao Ying,Wei Tang,Yiran Zhao,Yixin Cao,Yu Rong,Wenxuan Zhang
机构: Singapore Management University (新加坡管理大学); University of Science and Technology of China (中国科学技术大学); National University of Singapore (新加坡国立大学); Institute of Trustworthy Embodied AI, Fudan University (复旦大学可信具身人工智能研究所); DAMO Academy, Alibaba Group (阿里达摩院); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 (Main Conference)
Abstract:This paper introduces a Dual Evaluation Framework to comprehensively assess the multilingual capabilities of LLMs. By decomposing the evaluation along the dimensions of linguistic medium and cultural context, this framework enables a nuanced analysis of LLMs’ ability to process questions within both native and cross-cultural contexts cross-lingually. Extensive evaluations are conducted on a wide range of models, revealing a notable “CulturalLinguistic Synergy” phenomenon, where models exhibit better performance when questions are culturally aligned with the language. This phenomenon is further explored through interpretability probing, which shows that a higher proportion of specific neurons are activated in a language’s cultural context. This activation proportion could serve as a potential indicator for evaluating multilingual performance during model training. Our findings challenge the prevailing notion that LLMs, primarily trained on English data, perform uniformly across languages and highlight the necessity of culturally and linguistically model evaluations. Our code can be found at https://yingjiahao14. this http URL.
zh
[NLP-47] he Hallucination Dilemma: Factuality-Aware Reinforcement Learning for Large Reasoning Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在通过强化学习(Reinforcement Learning, RL)优化进行推理任务时,因推理导向的RL微调导致幻觉(hallucination)现象显著增加的问题。解决方案的关键在于提出一种名为事实感知分步策略优化(Factuality-aware Step-wise Policy Optimization, FSPO)的创新强化学习微调算法,该算法在每个推理步骤中引入显式的事实验证机制,通过自动化验证给定证据来动态调整令牌级别的优势值,从而在整个推理过程中激励事实正确性。
链接: https://arxiv.org/abs/2505.24630
作者: Junyi Li,Hwee Tou Ng
机构: National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have significantly advanced in reasoning tasks through reinforcement learning (RL) optimization, achieving impressive capabilities across various challenging benchmarks. However, our empirical analysis reveals a critical drawback: reasoning-oriented RL fine-tuning significantly increases the prevalence of hallucinations. We theoretically analyze the RL training dynamics, identifying high-variance gradient, entropy-induced randomness, and susceptibility to spurious local optima as key factors leading to hallucinations. To address this drawback, we propose Factuality-aware Step-wise Policy Optimization (FSPO), an innovative RL fine-tuning algorithm incorporating explicit factuality verification at each reasoning step. FSPO leverages automated verification against given evidence to dynamically adjust token-level advantage values, incentivizing factual correctness throughout the reasoning process. Experiments across mathematical reasoning and hallucination benchmarks using Qwen2.5 and Llama models demonstrate that FSPO effectively reduces hallucinations while enhancing reasoning accuracy, substantially improving both reliability and performance.
zh
[NLP-48] Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在密码分析(cryptanalysis)领域的评估不足问题,即尚未充分探索LLMs在解密加密文本方面的潜力。解决方案的关键在于引入一个新颖的基准数据集,该数据集包含跨多个领域、长度、写作风格和主题的明文文本及其对应的加密版本,并通过零样本(zero-shot)和少样本(few shot)设置评估多种LLMs在不同加密方案下的解密准确性和语义理解能力。
链接: https://arxiv.org/abs/2505.24621
作者: Utsav Maskey,Chencheng Zhu,Usman Naseem
机构: Macquarie University (麦考瑞大学); Unversity of New South Wales (新南威尔士大学)
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:Recent advancements in Large Language Models (LLMs) have transformed natural language understanding and generation, leading to extensive benchmarking across diverse tasks. However, cryptanalysis a critical area for data security and encryption has not yet been thoroughly explored in LLM evaluations. To address this gap, we evaluate cryptanalytic potential of state of the art LLMs on encrypted texts generated using a range of cryptographic algorithms. We introduce a novel benchmark dataset comprising diverse plain texts spanning various domains, lengths, writing styles, and topics paired with their encrypted versions. Using zero-shot and few shot settings, we assess multiple LLMs for decryption accuracy and semantic comprehension across different encryption schemes. Our findings reveal key insights into the strengths and limitations of LLMs in side-channel communication while raising concerns about their susceptibility to jailbreaking attacks. This research highlights the dual-use nature of LLMs in security contexts and contributes to the ongoing discussion on AI safety and security.
zh
[NLP-49] Interpretable phenotyping of Heart Failure patients with Dutch discharge letters
【速读】: 该论文旨在解决心力衰竭(Heart Failure, HF)患者表型异质性对治疗和预后的影响问题,通过整合结构化和非结构化数据构建表型分类模型。其解决方案的关键在于利用出院小结(discharge letters)作为主要信息源,并采用基于Transformer的黑盒模型与Aug-Linear白盒模型进行比较,结果显示Aug-Linear模型在保持与黑盒模型相当性能的同时,提供了更符合临床医生解释的可解释性,从而支持透明的临床决策。
链接: https://arxiv.org/abs/2505.24619
作者: Vittorio Torri,Machteld J. Boonstra,Marielle C. van de Veerdonk,Deborah N. Kalkman,Alicia Uijl,Francesca Ieva,Ameen Abu-Hanna,Folkert W. Asselbergs,Iacer Calixto
机构: Politecnico di Milano (米兰理工大学); Amsterdam University Medical Center (阿姆斯特丹大学医学中心); University of Amsterdam (阿姆斯特丹大学); Amsterdam Cardiovascular Sciences (阿姆斯特丹心血管科学); Amsterdam Public Health (阿姆斯特丹公共卫生); Amsterdam Reproduction & Development (阿姆斯特丹生殖与发育); Södersjukhuset (索德医院); Karolinska Institutet (卡罗林斯卡学院); Human Technopole (人类技术园区)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 43 pages, 8 figures
Abstract:Objective: Heart failure (HF) patients present with diverse phenotypes affecting treatment and prognosis. This study evaluates models for phenotyping HF patients based on left ventricular ejection fraction (LVEF) classes, using structured and unstructured data, assessing performance and interpretability. Materials and Methods: The study analyzes all HF hospitalizations at both Amsterdam UMC hospitals (AMC and VUmc) from 2015 to 2023 (33,105 hospitalizations, 16,334 patients). Data from AMC were used for model training, and from VUmc for external validation. The dataset was unlabelled and included tabular clinical measurements and discharge letters. Silver labels for LVEF classes were generated by combining diagnosis codes, echocardiography results, and textual mentions. Gold labels were manually annotated for 300 patients for testing. Multiple Transformer-based (black-box) and Aug-Linear (white-box) models were trained and compared with baselines on structured and unstructured data. To evaluate interpretability, two clinicians annotated 20 discharge letters by highlighting information they considered relevant for LVEF classification. These were compared to SHAP and LIME explanations from black-box models and the inherent explanations of Aug-Linear models. Results: BERT-based and Aug-Linear models, using discharge letters alone, achieved the highest classification results (AUC=0.84 for BERT, 0.81 for Aug-Linear on external validation), outperforming baselines. Aug-Linear explanations aligned more closely with clinicians’ explanations than post-hoc explanations on black-box models. Conclusions: Discharge letters emerged as the most informative source for phenotyping HF patients. Aug-Linear models matched black-box performance while providing clinician-aligned interpretability, supporting their use in transparent clinical decision-making. Comments: 43 pages, 8 figures Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) MSC classes: 68T50 ACMclasses: I.2.7; J.3 Cite as: arXiv:2505.24619 [cs.CL] (or arXiv:2505.24619v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.24619 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Vittorio Torri [view email] [v1] Fri, 30 May 2025 14:11:32 UTC (327 KB)
zh
[NLP-50] Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLM s with POLLUX
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在俄语环境下生成能力评估的不足问题,特别是传统评估方法存在资源消耗大、缺乏透明性和可解释性的问题。其解决方案的关键在于提出一种新颖的评估方法,通过定义详细的评估标准和开发评分协议,使模型能够对生成结果进行自我评估并提供理由,从而实现透明、基于标准的评估,替代传统的高成本人工对比方式。此外,论文还构建了一个包含35种任务类型的细粒度分类体系,并提供了基于LLM-as-a-Judge的评估工具,以支持可扩展、可解释的模型评估与标注。
链接: https://arxiv.org/abs/2505.24616
作者: Nikita Martynov,Anastasia Mordasheva,Dmitriy Gorbetskiy,Danil Astafurov,Ulyana Isaeva,Elina Basyrova,Sergey Skachkov,Victoria Berestova,Nikolay Ivanov,Valeriia Zanina,Alena Fenogenova
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 179 pages
Abstract:We introduce POLLUX, a comprehensive open-source benchmark designed to evaluate the generative capabilities of large language models (LLMs) in Russian. Our main contribution is a novel evaluation methodology that enhances the interpretability of LLM assessment. For each task type, we define a set of detailed criteria and develop a scoring protocol where models evaluate responses and provide justifications for their ratings. This enables transparent, criteria-driven evaluation beyond traditional resource-consuming, side-by-side human comparisons. POLLUX includes a detailed, fine-grained taxonomy of 35 task types covering diverse generative domains such as code generation, creative writing, and practical assistant use cases, totaling 2,100 manually crafted and professionally authored prompts. Each task is categorized by difficulty (easy/medium/hard), with experts constructing the dataset entirely from scratch. We also release a family of LLM-as-a-Judge (7B and 32B) evaluators trained for nuanced assessment of generative outputs. This approach provides scalable, interpretable evaluation and annotation tools for model development, effectively replacing costly and less precise human judgments.
zh
[NLP-51] Harnessing Large Language Models for Scientific Novelty Detection
【速读】: 该论文试图解决科学新颖性检测(Scientific Novelty Detection, ND)中由于缺乏合适的基准数据集以及现有自然语言处理(NLP)技术在文本相似性与概念创新之间存在差距而导致的难题。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)来提取论文的闭包集合,并基于LLMs总结其核心思想,同时通过知识蒸馏训练轻量级检索器,以对齐具有相似概念的创新点,从而实现高效且准确的科学新颖性检测。
链接: https://arxiv.org/abs/2505.24615
作者: Yan Liu,Zonglin Yang,Soujanya Poria,Thanh-Son Nguyen,Erik Cambria
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages, 3 figures, 3 tables
Abstract:In an era of exponential scientific growth, identifying novel research ideas is crucial and challenging in academia. Despite potential, the lack of an appropriate benchmark dataset hinders the research of novelty detection. More importantly, simply adopting existing NLP technologies, e.g., retrieving and then cross-checking, is not a one-size-fits-all solution due to the gap between textual similarity and idea conception. In this paper, we propose to harness large language models (LLMs) for scientific novelty detection (ND), associated with two new datasets in marketing and NLP domains. To construct the considerate datasets for ND, we propose to extract closure sets of papers based on their relationship, and then summarize their main ideas based on LLMs. To capture idea conception, we propose to train a lightweight retriever by distilling the idea-level knowledge from LLMs to align ideas with similar conception, enabling efficient and accurate idea retrieval for LLM novelty detection. Experiments show our method consistently outperforms others on the proposed benchmark datasets for idea retrieval and ND tasks. Codes and data are available at this https URL.
zh
[NLP-52] When Harry Meets Superman: The Role of The Interlocutor in Persona-Based Dialogue Generation
【速读】: 该论文试图解决对话代理在生成对话时如何有效适配对话对象(interlocutor)的个人资料问题,而不仅仅是遵循提供的角色设定(persona)。其关键解决方案在于系统性地评估模型在不同对话情境下的表现,包括对对话对象信息的敏感度、面对熟悉与不熟悉对话对象及话题时的鲁棒性,以及针对特定角色对话进行微调的效果。通过遮蔽或披露对话对象信息,研究揭示了对话生成质量与对话对象信息可访问性之间的关系,强调了对话对象信息在提升对话一致性与可识别性中的重要作用。
链接: https://arxiv.org/abs/2505.24613
作者: Daniela Occhipinti,Marco Guerini,Malvina Nissim
机构: Fondazione Bruno Kessler(布鲁诺·凯塞尔基金会); University of Trento(特伦托大学); University of Groningen(格罗宁根大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Endowing dialogue agents with persona information has proven to significantly improve the consistency and diversity of their generations. While much focus has been placed on aligning dialogues with provided personas, the adaptation to the interlocutor’s profile remains largely underexplored. In this work, we investigate three key aspects: (1) a model’s ability to align responses with both the provided persona and the interlocutor’s; (2) its robustness when dealing with familiar versus unfamiliar interlocutors and topics, and (3) the impact of additional fine-tuning on specific persona-based dialogues. We evaluate dialogues generated with diverse speaker pairings and topics, framing the evaluation as an author identification task and employing both LLM-as-a-judge and human evaluations. By systematically masking or disclosing information about the interlocutor, we assess its impact on dialogue generation. Results show that access to the interlocutor’s persona improves the recognition of the target speaker, while masking it does the opposite. Although models generalise well across topics, they struggle with unfamiliar interlocutors. Finally, we found that in zero-shot settings, LLMs often copy biographical details, facilitating identification but trivialising the task.
zh
[NLP-53] Explainable Depression Detection using Masked Hard Instance Mining
【速读】: 该论文试图解决文本抑郁检测中模型预测可解释性不足的问题(explanability),尽管现有方法能够提供预测结果,但缺乏对模型决策过程的理解,这可能影响系统的可信度。其解决方案的关键在于引入掩码硬实例挖掘(Masked Hard Instance Mining, MHIM),通过有策略地掩码模型中的注意力权重,促使模型在更广泛的显著特征上分配注意力,从而提升模型的可解释性与预测性能。
链接: https://arxiv.org/abs/2505.24609
作者: Patawee Prakrankamanant,Shinji Watanabe,Ekapol Chuangsuwanich
机构: Chulalongkorn University (朱拉隆功大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper addresses the critical need for improved explainability in text-based depression detection. While offering predictive outcomes, current solutions often overlook the understanding of model predictions which can hinder trust in the system. We propose the use of Masked Hard Instance Mining (MHIM) to enhance the explainability in the depression detection task. MHIM strategically masks attention weights within the model, compelling it to distribute attention across a wider range of salient features. We evaluate MHIM on two datasets representing distinct languages: Thai (Thai-Maywe) and English (DAIC-WOZ). Our results demonstrate that MHIM significantly improves performance in terms of both prediction accuracy and explainability metrics.
zh
[NLP-54] Decoding Knowledge Attribution in Mixture-of-Experts: A Framework of Basic-Refinement Collaboration and Efficiency Analysis ACL2025
【速读】: 该论文试图解决稀疏混合专家(Mixture-of-Experts, MoE)架构中可解释性不足的问题,特别是针对异构设计的MoE模型,现有基于密集模型的归因方法无法有效捕捉动态路由与专家之间的交互。其解决方案的关键在于提出一种跨层级归因算法,用于分析稀疏MoE架构(如Qwen 1.5-MoE、OLMoE、Mixtral-8x7B)与密集模型(如Qwen 1.5-7B、Llama-7B、Mixtral-7B)之间的差异,揭示了MoE模型通过“中激活、晚增强”模式实现每层效率提升37%的机制,并提出了“基础-精炼”框架及语义驱动的路由策略,以增强模型的可解释性与鲁棒性。
链接: https://arxiv.org/abs/2505.24593
作者: Junzhuo Li,Bo Wang,Xiuze Zhou,Peijie Jiang,Jia Liu,Xuming Hu
机构: The Hong Kong University of Science and Technology (Guangzhou); The Hong Kong University of Science and Technology; Ant Group
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ACL 2025
Abstract:The interpretability of Mixture-of-Experts (MoE) models, especially those with heterogeneous designs, remains underexplored. Existing attribution methods for dense models fail to capture dynamic routing-expert interactions in sparse MoE architectures. To address this issue, we propose a cross-level attribution algorithm to analyze sparse MoE architectures (Qwen 1.5-MoE, OLMoE, Mixtral-8x7B) against dense models (Qwen 1.5-7B, Llama-7B, Mixtral-7B). Results show MoE models achieve 37% higher per-layer efficiency via a “mid-activation, late-amplification” pattern: early layers screen experts, while late layers refine knowledge collaboratively. Ablation studies reveal a “basic-refinement” framework–shared experts handle general tasks (entity recognition), while routed experts specialize in domain-specific processing (geographic attributes). Semantic-driven routing is evidenced by strong correlations between attention heads and experts (r=0.68), enabling task-aware coordination. Notably, architectural depth dictates robustness: deep Qwen 1.5-MoE mitigates expert failures (e.g., 43% MRR drop in geographic tasks when blocking top-10 experts) through shared expert redundancy, whereas shallow OLMoE suffers severe degradation (76% drop). Task sensitivity further guides design: core-sensitive tasks (geography) require concentrated expertise, while distributed-tolerant tasks (object attributes) leverage broader participation. These insights advance MoE interpretability, offering principles to balance efficiency, specialization, and robustness.
zh
[NLP-55] GATE: General Arabic Text Embedding for Enhanced Semantic Textual Similarity with Matryoshka Representation Learning and Hybrid Loss Training
【速读】: 该论文旨在解决阿拉伯语在自然语言处理(NLP)中语义文本相似性(Semantic Textual Similarity, STS)任务的资源匮乏问题,由于缺乏高质量的数据集和预训练模型,导致阿拉伯语语义相似性的准确评估与进展受到限制。论文提出的解决方案是构建通用阿拉伯语文本嵌入(General Arabic Text Embedding, GATE)模型,其关键在于采用马特罗什卡表示学习(Matryoshka Representation Learning)和结合阿拉伯语三元组数据集的混合损失训练方法,从而提升模型在需要细粒度语义理解任务中的性能。
链接: https://arxiv.org/abs/2505.24581
作者: Omer Nacar,Anis Koubaa,Serry Sibaee,Yasser Al-Habashi,Adel Ammar,Wadii Boulila
机构: Prince Sultan University (沙特国王 Saud 国王大学); Alfaisal University (阿尔法伊萨尔大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Semantic textual similarity (STS) is a critical task in natural language processing (NLP), enabling applications in retrieval, clustering, and understanding semantic relationships between texts. However, research in this area for the Arabic language remains limited due to the lack of high-quality datasets and pre-trained models. This scarcity of resources has restricted the accurate evaluation and advance of semantic similarity in Arabic text. This paper introduces General Arabic Text Embedding (GATE) models that achieve state-of-the-art performance on the Semantic Textual Similarity task within the MTEB benchmark. GATE leverages Matryoshka Representation Learning and a hybrid loss training approach with Arabic triplet datasets for Natural Language Inference, which are essential for enhancing model performance in tasks that demand fine-grained semantic understanding. GATE outperforms larger models, including OpenAI, with a 20-25% performance improvement on STS benchmarks, effectively capturing the unique semantic nuances of Arabic.
zh
[NLP-56] NexusSum: Hierarchical LLM Agents for Long-Form Narrative Summarization ACL2025
【速读】: 该论文试图解决长篇叙事文本(如书籍、电影和电视剧剧本)的摘要生成问题,该任务需要捕捉复杂的剧情线、角色互动和主题连贯性,而现有大语言模型(LLM)在这一领域仍面临挑战。其解决方案的关键在于提出NexusSum框架,该框架通过两个核心创新实现高效摘要:一是对话到描述转换(Dialogue-to-Description Transformation),将角色对话和描述性文本标准化为统一格式以提升连贯性;二是分层多LLM摘要(Hierarchical Multi-LLM Summarization),通过结构化处理流程优化内容切片与输出长度控制,从而生成准确且高质量的摘要。
链接: https://arxiv.org/abs/2505.24575
作者: Hyuntak Kim,Byung-Hak Kim
机构: CJ Corporation (CJ公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to the main track of ACL 2025
Abstract:Summarizing long-form narratives–such as books, movies, and TV scripts–requires capturing intricate plotlines, character interactions, and thematic coherence, a task that remains challenging for existing LLMs. We introduce NexusSum, a multi-agent LLM framework for narrative summarization that processes long-form text through a structured, sequential pipeline–without requiring fine-tuning. Our approach introduces two key innovations: (1) Dialogue-to-Description Transformation: A narrative-specific preprocessing method that standardizes character dialogue and descriptive text into a unified format, improving coherence. (2) Hierarchical Multi-LLM Summarization: A structured summarization pipeline that optimizes chunk processing and controls output length for accurate, high-quality summaries. Our method establishes a new state-of-the-art in narrative summarization, achieving up to a 30.0% improvement in BERTScore (F1) across books, movies, and TV scripts. These results demonstrate the effectiveness of multi-agent LLMs in handling long-form content, offering a scalable approach for structured summarization in diverse storytelling domains.
zh
[NLP-57] Improving Language and Modality Transfer in Translation by Character-level Modeling ACL2025
【速读】: 该论文试图解决当前翻译系统覆盖语言种类有限(仅涵盖全球语言的5%)的问题,特别是如何有效扩展至长尾的低资源语言。解决方案的关键在于提出一种基于字符的方法,通过SONAR多语言固定大小嵌入空间实现跨语言和跨模态的知识迁移,结合教师-学生框架与并行翻译数据构建字符级编码器,并利用自动语音识别(ASR)数据训练轻量适配器,从而实现对1,000多种语言的语音翻译。该方法在文本翻译和语音到文本翻译任务中均表现出优于传统子词模型的性能,尤其在低资源场景和零样本泛化能力方面具有显著优势。
链接: https://arxiv.org/abs/2505.24561
作者: Ioannis Tsiamas,David Dale,Marta R. Costa-jussà
机构: FAIR at Meta(FAIR at Meta); Universitat Politècnica de Catalunya(Universitat Politècnica de Catalunya)
类目: Computation and Language (cs.CL)
备注: ACL 2025
Abstract:Current translation systems, despite being highly multilingual, cover only 5% of the world’s languages. Expanding language coverage to the long-tail of low-resource languages requires data-efficient methods that rely on cross-lingual and cross-modal knowledge transfer. To this end, we propose a character-based approach to improve adaptability to new languages and modalities. Our method leverages SONAR, a multilingual fixed-size embedding space with different modules for encoding and decoding. We use a teacher-student approach with parallel translation data to obtain a character-level encoder. Then, using ASR data, we train a lightweight adapter to connect a massively multilingual CTC ASR model (MMS), to the character-level encoder, potentially enabling speech translation from 1,000+ languages. Experimental results in text translation for 75 languages on FLORES+ demonstrate that our character-based approach can achieve better language transfer than traditional subword-based models, especially outperforming them in low-resource settings, and demonstrating better zero-shot generalizability to unseen languages. Our speech adaptation, maximizing knowledge transfer from the text modality, achieves state-of-the-art results in speech-to-text translation on the FLEURS benchmark on 33 languages, surpassing previous supervised and cascade models, albeit being a zero-shot model with minimal supervision from ASR data.
zh
[NLP-58] Bench4KE: Benchmarking Automated Competency Question Generation
【速读】: 该论文试图解决知识工程(Knowledge Engineering, KE)自动化工具评估缺乏标准化的问题,这一问题影响了方法的严谨性以及结果的可复制性和比较性。解决方案的关键是引入Bench4KE,一个基于API的可扩展基准测试系统,其首次发布专注于评估自动生成能力问题(Competency Questions, CQs)的工具。Bench4KE提供了一个经过筛选的黄金标准数据集,并利用一系列相似性度量来评估生成的CQs质量,同时为未来研究建立了基准。
链接: https://arxiv.org/abs/2505.24554
作者: Anna Sofia Lippolis,Minh Davide Ragagni,Paolo Ciancarini,Andrea Giovanni Nuzzolese,Valentina Presutti
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The availability of Large Language Models (LLMs) presents a unique opportunity to reinvigorate research on Knowledge Engineering (KE) automation, a trend already evident in recent efforts developing LLM-based methods and tools for the automatic generation of Competency Questions (CQs). However, the evaluation of these tools lacks standardisation. This undermines the methodological rigour and hinders the replication and comparison of results. To address this gap, we introduce Bench4KE, an extensible API-based benchmarking system for KE automation. Its first release focuses on evaluating tools that generate CQs automatically. CQs are natural language questions used by ontology engineers to define the functional requirements of an ontology. Bench4KE provides a curated gold standard consisting of CQ datasets from four real-world ontology projects. It uses a suite of similarity metrics to assess the quality of the CQs generated. We present a comparative analysis of four recent CQ generation systems, which are based on LLMs, establishing a baseline for future research. Bench4KE is also designed to accommodate additional KE automation tasks, such as SPARQL query generation, ontology testing and drafting. Code and datasets are publicly available under the Apache 2.0 license.
zh
[NLP-59] CREFT: Sequential Multi-Agent LLM for Character Relation Extraction
【速读】: 该论文旨在解决复杂角色关系在长篇叙事中的提取问题,这一问题在情节分析和剧本评估中具有重要意义,但现有方法难以有效处理具有细微互动的长文本。其解决方案的关键在于提出CREFT框架,该框架利用专门的大型语言模型(Large Language Model, LLM)代理,通过知识蒸馏构建基础角色图,并迭代优化角色构成、关系抽取、角色识别及群体分配,从而提升准确性和完整性。
链接: https://arxiv.org/abs/2505.24553
作者: Ye Eun Chun,Taeyoon Hwang,Seung-won Hwang,Byung-Hak Kim
机构: CJ Corporation(韩进集团); Seoul National University(首尔国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding complex character relations is crucial for narrative analysis and efficient script evaluation, yet existing extraction methods often fail to handle long-form narratives with nuanced interactions. To address this challenge, we present CREFT, a novel sequential framework leveraging specialized Large Language Model (LLM) agents. First, CREFT builds a base character graph through knowledge distillation, then iteratively refines character composition, relation extraction, role identification, and group assignments. Experiments on a curated Korean drama dataset demonstrate that CREFT significantly outperforms single-agent LLM baselines in both accuracy and completeness. By systematically visualizing character networks, CREFT streamlines narrative comprehension and accelerates script review – offering substantial benefits to the entertainment, publishing, and educational sectors.
zh
[NLP-60] A*-Thought: Efficient Reasoning via Bidirectional Compression for Low-Resource Settings
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在延长推理路径以提升性能的同时导致的效率下降问题。现有方法多假设过度思考是瓶颈,并尝试通过压缩思维链来提高效率,但常导致性能退化。其解决方案的关键在于提出A*-Thought,一种基于树搜索的统一框架,通过将LRMs的推理过程建模为搜索树,并结合A*搜索算法与特定于推理路径的成本函数,高效地压缩思维链,从而找到信息密度高且成本低的推理路径。此外,还引入了双向重要性估计机制,进一步优化搜索过程,提升效率。
链接: https://arxiv.org/abs/2505.24550
作者: Xiaoang Xu,Shuo Wang,Xu Han,Zhenghao Liu,Huijia Wu,Peipei Li,Zhiyuan Liu,Maosong Sun,Zhaofeng He
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Dept. of Comp. Sci. & Tech., Tsinghua University (清华大学计算机科学与技术系), Beijing, China (中国北京); Northeastern University (东北大学); Institute for AI, Tsinghua University (清华大学人工智能研究院), Beijing, China (中国北京); Beijing National Research Center for Information Science and Technology (北京信息科学与技术国家研究中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Reasoning Models (LRMs) achieve superior performance by extending the thought length. However, a lengthy thinking trajectory leads to reduced efficiency. Most of the existing methods are stuck in the assumption of overthinking and attempt to reason efficiently by compressing the Chain-of-Thought, but this often leads to performance degradation. To address this problem, we introduce A*-Thought, an efficient tree search-based unified framework designed to identify and isolate the most essential thoughts from the extensive reasoning chains produced by these models. It formulates the reasoning process of LRMs as a search tree, where each node represents a reasoning span in the giant reasoning space. By combining the A* search algorithm with a cost function specific to the reasoning path, it can efficiently compress the chain of thought and determine a reasoning path with high information density and low cost. In addition, we also propose a bidirectional importance estimation mechanism, which further refines this search process and enhances its efficiency beyond uniform sampling. Extensive experiments on several advanced math tasks show that A*-Thought effectively balances performance and efficiency over a huge search space. Specifically, A*-Thought can improve the performance of QwQ-32B by 2.39 \times with low-budget and reduce the length of the output token by nearly 50% with high-budget. The proposed method is also compatible with several other LRMs, demonstrating its generalization capability. The code can be accessed at: this https URL.
zh
[NLP-61] Cross-Attention Speculative Decoding
【速读】: 该论文试图解决传统基于自注意力机制的推测解码(Speculative Decoding, SD)方法在架构复杂度高、泛化能力差以及训练效率低的问题。其解决方案的关键在于提出一种基于交叉注意力机制的Transformer解码器——Budget EAGLE (Beagle),该模型在不依赖池化或辅助组件的情况下,实现了与先进自注意力SD模型(如EAGLE-v2)相当的性能,同时简化了架构、提高了训练效率并保持了稳定的内存使用。此外,为支持该新型架构的有效训练,作者还提出了Two-Stage Block-Attention Training方法,以提升块级注意力场景下的训练稳定性和收敛效率。
链接: https://arxiv.org/abs/2505.24544
作者: Wei Zhong,Manasa Bharadwaj,Yixiao Wang,Nikhil Verma,Yipeng Ji,Chul Lee
机构: LG Electronics(电子公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Speculative decoding (SD) is a widely adopted approach for accelerating inference in large language models (LLMs), particularly when the draft and target models are well aligned. However, state-of-the-art SD methods typically rely on tightly coupled, self-attention-based Transformer decoders, often augmented with auxiliary pooling or fusion layers. This coupling makes them increasingly complex and harder to generalize across different models. We present Budget EAGLE (Beagle), the first, to our knowledge, cross-attention-based Transformer decoder SD model that achieves performance on par with leading self-attention SD models (EAGLE-v2) while eliminating the need for pooling or auxiliary components, simplifying the architecture, improving training efficiency, and maintaining stable memory usage during training-time simulation. To enable effective training of this novel architecture, we propose Two-Stage Block-Attention Training, a new method that achieves training stability and convergence efficiency in block-level attention scenarios. Extensive experiments across multiple LLMs and datasets show that Beagle achieves competitive inference speedups and higher training efficiency than EAGLE-v2, offering a strong alternative for architectures in speculative decoding.
zh
[NLP-62] Localizing Persona Representations in LLM s
【速读】: 该论文试图解决的问题是:如何及在何处将人格特征(personas)——由特定的人类特征、价值观和信念组成的集合——编码到大型语言模型(Large Language Models, LLMs)的表示空间中。其解决方案的关键在于利用多种降维和模式识别方法,识别出模型中对这些表示编码差异最大的层,并进一步分析选定层中的激活情况,以探究特定人格特征相对于其他特征的编码方式,包括它们的共享与独特嵌入空间。研究发现,分析的人格特征在解码器层的最后三分之一中表现出显著的表示差异,且某些伦理观点如道德虚无主义和功利主义显示出激活重叠,而政治意识形态如保守主义和自由主义则表现为更独立的区域。
链接: https://arxiv.org/abs/2505.24539
作者: Celia Cintas,Miriam Rateike,Erik Miehling,Elizabeth Daly,Skyler Speakman
机构: IBM Research Africa (IBM 研究非洲); IBM Research Europe (IBM 研究欧洲)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We present a study on how and where personas – defined by distinct sets of human characteristics, values, and beliefs – are encoded in the representation space of large language models (LLMs). Using a range of dimension reduction and pattern recognition methods, we first identify the model layers that show the greatest divergence in encoding these representations. We then analyze the activations within a selected layer to examine how specific personas are encoded relative to others, including their shared and distinct embedding spaces. We find that, across multiple pre-trained decoder-only LLMs, the analyzed personas show large differences in representation space only within the final third of the decoder layers. We observe overlapping activations for specific ethical perspectives – such as moral nihilism and utilitarianism – suggesting a degree of polysemy. In contrast, political ideologies like conservatism and liberalism appear to be represented in more distinct regions. These findings help to improve our understanding of how LLMs internally represent information and can inform future efforts in refining the modulation of specific human traits in LLM outputs. Warning: This paper includes potentially offensive sample statements.
zh
[NLP-63] Dont Erase Inform! Detecting and Contextualizing Harmful Language in Cultural Heritage Collections
【速读】: 该论文试图解决文化遗产(Cultural Heritage, CH)数据中存在过时或带有偏见的描述问题,这些问题反映了历史上的偏见,并给文化遗产机构(CHIs)在整理这些数据时带来了巨大挑战。解决方案的关键在于开发一个基于人工智能的工具,该工具能够检测CH元数据中的冒犯性术语,并提供其历史背景和当代感知的上下文信息,其核心依赖于与边缘化群体、研究人员和CH专业人员共同构建的多语言词汇库,以及传统自然语言处理技术和大型语言模型(Large Language Models, LLMs)的结合。
链接: https://arxiv.org/abs/2505.24538
作者: Orfeas Menis Mastromichalakis,Jason Liartis,Kristina Rose,Antoine Isaac,Giorgos Stamou
机构: National Technical University of Athens (希腊雅典国立技术大学); DFF - Deutsches Filminstitut & Filmmuseum (德国电影研究所与电影博物馆); Europeana Foundation (欧洲文化基金会); Vrije Universiteit Amsterdam (阿姆斯特丹自由大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Cultural Heritage (CH) data hold invaluable knowledge, reflecting the history, traditions, and identities of societies, and shaping our understanding of the past and present. However, many CH collections contain outdated or offensive descriptions that reflect historical biases. CH Institutions (CHIs) face significant challenges in curating these data due to the vast scale and complexity of the task. To address this, we develop an AI-powered tool that detects offensive terms in CH metadata and provides contextual insights into their historical background and contemporary perception. We leverage a multilingual vocabulary co-created with marginalized communities, researchers, and CH professionals, along with traditional NLP techniques and Large Language Models (LLMs). Available as a standalone web app and integrated with major CH platforms, the tool has processed over 7.9 million records, contextualizing the contentious terms detected in their metadata. Rather than erasing these terms, our approach seeks to inform, making biases visible and providing actionable insights for creating more inclusive and accessible CH collections.
zh
[NLP-64] Beyond Linear Steering: Unified Multi-Attribute Control for Language Models
【速读】: 该论文试图解决在推理阶段对大型语言模型(Large Language Models, LLMs)的多个行为属性进行控制的问题,该问题由于属性间的干扰以及线性引导方法的局限性而具有挑战性,后者假设激活空间中的行为是可加的并需要针对每个属性进行调优。论文提出的解决方案关键在于K-Steering,这是一种统一且灵活的方法,通过在隐藏激活上训练单一的非线性多标签分类器,并在推理时通过梯度计算干预方向,从而避免了线性假设,消除了存储和调优独立属性向量的需要,并实现了无需重新训练的行为动态组合。
链接: https://arxiv.org/abs/2505.24535
作者: Narmeen Oozeer,Luke Marks,Fazl Barez,Amirali Abdullah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Controlling multiple behavioral attributes in large language models (LLMs) at inference time is a challenging problem due to interference between attributes and the limitations of linear steering methods, which assume additive behavior in activation space and require per-attribute tuning. We introduce K-Steering, a unified and flexible approach that trains a single non-linear multi-label classifier on hidden activations and computes intervention directions via gradients at inference time. This avoids linearity assumptions, removes the need for storing and tuning separate attribute vectors, and allows dynamic composition of behaviors without retraining. To evaluate our method, we propose two new benchmarks, ToneBank and DebateMix, targeting compositional behavioral control. Empirical results across 3 model families, validated by both activation-based classifiers and LLM-based judges, demonstrate that K-Steering outperforms strong baselines in accurately steering multiple behaviors.
zh
[NLP-65] DEEPQUESTION: Systematic Generation of Real-World Challenges for Evaluating LLM s Performance
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在标准基准测试中表现优异,但在实际任务中性能下降的问题。其解决方案的关键在于提出DeepQuestion,这是一个可扩展的自动化框架,它基于布鲁姆分类法(Bloom’s Taxonomy)增强现有数据集,并生成能够追溯原始解题路径的新问题,以评估和创造性的技能进行测试。
链接: https://arxiv.org/abs/2505.24532
作者: Ali Khoramfar,Ali Ramezani,Mohammad Mahdi Mohajeri,Mohammad Javad Dousti,Majid Nili Ahmadabadi,Heshaam Faili
机构: University of Tehran (德黑兰大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:LLMs often excel on standard benchmarks but falter on real-world tasks. We introduce DeepQuestion, a scalable automated framework that augments existing datasets based on Bloom’s taxonomy and creates novel questions that trace original solution paths to probe evaluative and creative skills. Extensive experiments across ten open-source and proprietary models, covering both general-purpose and reasoning LLMs, reveal substantial performance drops (even up to 70% accuracy loss) on higher-order tasks, underscoring persistent gaps in deep reasoning. Our work highlights the need for cognitively diverse benchmarks to advance LLM progress. DeepQuestion and related datasets will be released upon acceptance of the paper.
zh
[NLP-66] Limited-Resource Adapters Are Regularizers Not Linguists
【速读】: 该论文试图解决低资源克里奥尔语言(Creole languages)技术性能不足的问题,通过跨语言迁移学习来提升其机器翻译(Machine Translation, MT)能力。论文提出的解决方案的关键在于结合适配器蒸馏(adapter souping)与预训练MT模型的交叉注意力微调(cross-attention fine-tuning),以利用相关高资源语言的知识进行语言迁移。然而,研究发现语言相关性与适配器性能之间并无显著关联,且随机初始化的适配器在交叉注意力微调下同样有效,表明该方法的优势主要来源于参数正则化而非语义信息的有效传递。
链接: https://arxiv.org/abs/2505.24525
作者: Marcell Fekete,Nathaniel R. Robinson,Ernests Lavrinovics,E. Djeride Jean-Baptiste,Raj Dabre,Johannes Bjerva,Heather Lent
机构: Aalborg University, Denmark; Johns Hopkins University, USA; University of Ottawa, Canada; IIT Madras, India
类目: Computation and Language (cs.CL)
备注:
Abstract:Cross-lingual transfer from related high-resource languages is a well-established strategy to enhance low-resource language technologies. Prior work has shown that adapters show promise for, e.g., improving low-resource machine translation (MT). In this work, we investigate an adapter souping method combined with cross-attention fine-tuning of a pre-trained MT model to leverage language transfer for three low-resource Creole languages, which exhibit relatedness to different language groups across distinct linguistic dimensions. Our approach improves performance substantially over baselines. However, we find that linguistic relatedness – or even a lack thereof – does not covary meaningfully with adapter performance. Surprisingly, our cross-attention fine-tuning approach appears equally effective with randomly initialized adapters, implying that the benefit of adapters in this setting lies in parameter regularization, and not in meaningful information transfer. We provide analysis supporting this regularization hypothesis. Our findings underscore the reality that neural language processing involves many success factors, and that not all neural methods leverage linguistic knowledge in intuitive ways.
zh
[NLP-67] Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors ACL2025
【速读】: 该论文试图解决当前机器生成文本(Machine-Generated Text, MGT)检测器在面对具有语言学启发的对抗攻击时表现脆弱的问题,以及现有基准缺乏对真实场景泛化能力评估的不足。解决方案的关键在于通过直接偏好优化(Direct Preference Optimization, DPO)微调语言模型,使MGT风格向人类写作文本(Human-Written Text, HWT)迁移,从而利用检测器对风格线索的依赖性,提升生成文本的隐蔽性,进而测试并揭示现有检测方法的局限性。
链接: https://arxiv.org/abs/2505.24523
作者: Andrea Pedrotti,Michele Papucci,Cristiano Ciaccio,Alessio Miaschi,Giovanni Puccetti,Felice Dell’Orletta,Andrea Esuli
机构: Istituto di Scienza e Tecnologie dell’Informazione “A. Faedo” (CNR-ISTI); Department of Computer Science, University of Pisa; ItaliaNLP Lab, Istituto di Linguistica Computazionale “Antonio Zampolli” (CNR-ILC)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at Findings of ACL 2025
Abstract:Recent advancements in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, raising concerns about the potential for malicious use, such as misinformation and manipulation. Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the lack of robust benchmarks that assess generalization to real-world scenarios. In this work, we present a pipeline to test the resilience of state-of-the-art MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed adversarial attacks. To challenge the detectors, we fine-tune language models using Direct Preference Optimization (DPO) to shift the MGT style toward human-written text (HWT). This exploits the detectors’ reliance on stylistic clues, making new generations more challenging to detect. Additionally, we analyze the linguistic shifts induced by the alignment and which features are used by detectors to detect MGT texts. Our results show that detectors can be easily fooled with relatively few examples, resulting in a significant drop in detection performance. This highlights the importance of improving detection methods and making them robust to unseen in-domain texts.
zh
[NLP-68] AMIA: Automatic Masking and Joint Intention Analysis Makes LVLMs Robust Jailbreak Defenders
【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在面对对抗性扰动和隐藏有害意图时的安全性问题。其解决方案的关键在于提出AMIA,一种轻量级的仅推理防御机制,该机制通过(1)自动遮蔽少量与文本无关的图像区域以干扰对抗性扰动,以及(2)进行联合意图分析以在响应生成前揭示并缓解隐藏的有害意图,从而实现有效的防御。
链接: https://arxiv.org/abs/2505.24519
作者: Yuqi Zhang,Yuchun Miao,Zuchao Li,Liang Ding
机构: Wuhan University (武汉大学); The University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 11 pages, 7 figures
Abstract:We introduce AMIA, a lightweight, inference-only defense for Large Vision-Language Models (LVLMs) that (1) Automatically Masks a small set of text-irrelevant image patches to disrupt adversarial perturbations, and (2) conducts joint Intention Analysis to uncover and mitigate hidden harmful intents before response generation. Without any retraining, AMIA improves defense success rates across diverse LVLMs and jailbreak benchmarks from an average of 52.4% to 81.7%, preserves general utility with only a 2% average accuracy drop, and incurs only modest inference overhead. Ablation confirms both masking and intention analysis are essential for a robust safety-utility trade-off.
zh
[NLP-69] meHC-RL: Temporal-aware Hierarchical Cognitive Reinforcement Learning for Enhancing LLM s Social Intelligence
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在社会领域认知发展不足的问题,特别是在后训练阶段提升其社会智能(social intelligence)的挑战。解决方案的关键在于引入一种基于时间感知的分层认知强化学习框架(Temporal-aware Hierarchical Cognitive Reinforcement Learning, TimeHC-RL),该方法通过融合直觉反应(System 1)与深思熟虑(System 2)等多模态认知过程,模拟社会互动中复杂的时间依赖性与认知层次,从而有效增强LLMs的社会智能。
链接: https://arxiv.org/abs/2505.24500
作者: Guiyang Hou,Xing Gao,Yuchuan Wu,Xiang Huang,Wenqi Zhang,Zhe Zheng,Yongliang Shen,Jialu Du,Fei Huang,Yongbin Li,Weiming Lu
机构: Zhejiang University (浙江大学); Tongyi Lab, Alibaba Group (通义实验室,阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 pages, 12 figures
Abstract:Recently, Large Language Models (LLMs) have made significant progress in IQ-related domains that require careful thinking, such as mathematics and coding. However, enhancing LLMs’ cognitive development in social domains, particularly from a post-training perspective, remains underexplored. Recognizing that the social world follows a distinct timeline and requires a richer blend of cognitive modes (from intuitive reactions (System 1) and surface-level thinking to deliberate thinking (System 2)) than mathematics, which primarily relies on System 2 cognition (careful, step-by-step reasoning), we introduce Temporal-aware Hierarchical Cognitive Reinforcement Learning (TimeHC-RL) for enhancing LLMs’ social intelligence. In our experiments, we systematically explore improving LLMs’ social intelligence and validate the effectiveness of the TimeHC-RL method, through five other post-training paradigms and two test-time intervention paradigms on eight datasets with diverse data patterns. Experimental results reveal the superiority of our proposed TimeHC-RL method compared to the widely adopted System 2 RL method. It gives the 7B backbone model wings, enabling it to rival the performance of advanced models like DeepSeek-R1 and OpenAI-O3. Additionally, the systematic exploration from post-training and test-time interventions perspectives to improve LLMs’ social intelligence has uncovered several valuable insights.
zh
[NLP-70] owards Effective Code-Integrated Reasoning
【速读】: 该论文旨在解决代码集成推理(code-integrated reasoning)中模型在使用外部代码工具时的训练有效性与稳定性问题。其关键解决方案是通过工具增强型强化学习(tool-augmented reinforcement learning)实现交互式学习,同时设计增强的训练策略以平衡探索与稳定性,逐步构建工具使用能力并提升推理性能。
链接: https://arxiv.org/abs/2505.24480
作者: Fei Bai,Yingqian Min,Beichen Zhang,Zhipeng Chen,Wayne Xin Zhao,Lei Fang,Zheng Liu,Zhongyuan Wang,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); BAAI (百度研究院); DataCanvas Alaya NeW (DataCanvas Alaya NeW)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Technical Report on Slow Thinking with LLMs: Code-Integrated Reasoning
Abstract:In this paper, we investigate code-integrated reasoning, where models generate code when necessary and integrate feedback by executing it through a code interpreter. To acquire this capability, models must learn when and how to use external code tools effectively, which is supported by tool-augmented reinforcement learning (RL) through interactive learning. Despite its benefits, tool-augmented RL can still suffer from potential instability in the learning dynamics. In light of this challenge, we present a systematic approach to improving the training effectiveness and stability of tool-augmented RL for code-integrated reasoning. Specifically, we develop enhanced training strategies that balance exploration and stability, progressively building tool-use capabilities while improving reasoning performance. Through extensive experiments on five mainstream mathematical reasoning benchmarks, our model demonstrates significant performance improvements over multiple competitive baselines. Furthermore, we conduct an in-depth analysis of the mechanism and effect of code-integrated reasoning, revealing several key insights, such as the extension of model’s capability boundaries and the simultaneous improvement of reasoning efficiency through code integration. All data and code for reproducing this work are available at: this https URL.
zh
[NLP-71] Leverag ing Knowledge Graphs and LLM s for Structured Generation of Misinformation
【速读】: 该论文试图解决由生成式 AI (Generative AI) 加速传播的虚假信息对社会造成的威胁,特别是其对公众舆论、民主稳定和国家安全的影响。解决方案的关键在于利用知识图谱 (Knowledge Graphs, KGs) 作为结构化的语义资源,系统性地生成虚假三元组。通过分析知识图谱的结构特性,如实体间的距离及其谓词关系,识别出可能为假的关系,并以此引导大语言模型 (Large Language Models, LLMs) 生成具有不同可信度的虚假陈述。该方法依赖于公开可用的知识图谱,能够生成人类难以检测的虚假信息,并揭示当前基于 LLM 的检测方法存在的显著局限性。
链接: https://arxiv.org/abs/2505.24479
作者: Sania Nayab,Marco Simoni,Giulio Rossolini
机构: Department of Excellence in Robotics and AI, Scuola Superiore Sant’Anna, Pisa, Italy (卓越机器人与人工智能系,圣安娜高等学院,比萨,意大利); Institute of Informatics and Telematics, National Research Council of Italy (信息与通信技术研究所,意大利国家研究理事会); Sapienza Università di Roma (罗马第一大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:
Abstract:The rapid spread of misinformation, further amplified by recent advances in generative AI, poses significant threats to society, impacting public opinion, democratic stability, and national security. Understanding and proactively assessing these threats requires exploring methodologies that enable structured and scalable misinformation generation. In this paper, we propose a novel approach that leverages knowledge graphs (KGs) as structured semantic resources to systematically generate fake triplets. By analyzing the structural properties of KGs, such as the distance between entities and their predicates, we identify plausibly false relationships. These triplets are then used to guide large language models (LLMs) in generating misinformation statements with varying degrees of credibility. By utilizing structured semantic relationships, our deterministic approach produces misinformation inherently challenging for humans to detect, drawing exclusively upon publicly available KGs (e.g., WikiGraphs). Additionally, we investigate the effectiveness of LLMs in distinguishing between genuine and artificially generated misinformation. Our analysis highlights significant limitations in current LLM-based detection methods, underscoring the necessity for enhanced detection strategies and a deeper exploration of inherent biases in generative models. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI) Cite as: arXiv:2505.24479 [cs.AI] (or arXiv:2505.24479v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.24479 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-72] Optimizing the Interface Between Knowledge Graphs and LLM s for Complex Reasoning
【速读】: 该论文试图解决在将大型语言模型(Large Language Models, LLMs)与知识图谱(Knowledge Graphs, KGs)集成的复杂系统中,超参数优化问题尚未得到充分研究的问题。解决方案的关键在于通过系统性的超参数调优,针对分块、图构建、检索和提示等参数进行优化,并利用多个多跳问答基准测试(如HotPotQA、TwoWikiMultiHop和MuSiQue)进行评估,以实现性能提升。研究结果表明,尽管性能提升具有一致性,但其效果在不同数据集和指标间存在差异,这凸显了调优的价值及标准评估方法的局限性。
链接: https://arxiv.org/abs/2505.24478
作者: Vasilije Markovic,Lazar Obradovic,Laszlo Hajdu,Jovan Pavlovic
机构: Cognee Inc.(Cognee 公司); Innorenew CoE(Innorenew 联合实验室); University of Primorska, FAMNIT(普里莫尔斯卡大学,FAMNIT 学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This is a preliminary version. A revised and expanded version is in preparation
Abstract:Integrating Large Language Models (LLMs) with Knowledge Graphs (KGs) results in complex systems with numerous hyperparameters that directly affect performance. While such systems are increasingly common in retrieval-augmented generation, the role of systematic hyperparameter optimization remains underexplored. In this paper, we study this problem in the context of Cognee, a modular framework for end-to-end KG construction and retrieval. Using three multi-hop QA benchmarks (HotPotQA, TwoWikiMultiHop, and MuSiQue) we optimize parameters related to chunking, graph construction, retrieval, and prompting. Each configuration is scored using established metrics (exact match, F1, and DeepEval’s LLM-based correctness metric). Our results demonstrate that meaningful gains can be achieved through targeted tuning. While the gains are consistent, they are not uniform, with performance varying across datasets and metrics. This variability highlights both the value of tuning and the limitations of standard evaluation measures. While demonstrating the immediate potential of hyperparameter tuning, we argue that future progress will depend not only on architectural advances but also on clearer frameworks for optimization and evaluation in complex, modular systems.
zh
[NLP-73] VietMix: A Naturally Occurring Vietnamese-English Code-Mixed Corpus with Iterative Augmentation for Machine Translation
【速读】: 该论文试图解决低资源语言在处理代码混合(code-mixing)输入时机器翻译系统失效的问题。其解决方案的关键在于构建了VietMix,一个由自然发生的越南语代码混合文本及其专家英文翻译组成的平行语料库,并开发了一个互补的合成数据生成管道,该管道包含过滤机制以确保代码混合模式在句法合理性和语用适当性方面具有可信度。
链接: https://arxiv.org/abs/2505.24472
作者: Hieu Tran,Phuong-Anh Nguyen-Le,Huy Nghiem,Quang-Nhan Nguyen,Wei Ai,Marine Carpuat
机构: University of Maryland, College Park (马里兰大学学院公园分校); Harvard University (哈佛大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Machine translation systems fail when processing code-mixed inputs for low-resource languages. We address this challenge by curating VietMix, a parallel corpus of naturally occurring code-mixed Vietnamese text paired with expert English translations. Augmenting this resource, we developed a complementary synthetic data generation pipeline. This pipeline incorporates filtering mechanisms to ensure syntactic plausibility and pragmatic appropriateness in code-mixing patterns. Experimental validation shows our naturalistic and complementary synthetic data boost models’ performance, measured by translation quality estimation scores, of up to 71.84 on COMETkiwi and 81.77 on XCOMET. Triangulating positive results with LLM-based assessments, augmented models are favored over seed fine-tuned counterparts in approximately 49% of judgments (54-56% excluding ties). VietMix and our augmentation methodology advance ecological validity in neural MT evaluations and establish a framework for addressing code-mixed translation challenges across other low-resource pairs.
zh
[NLP-74] CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation
【速读】: 该论文试图解决文化内容在机器翻译中因文化概念差异而导致的语境不足问题,这种问题使得仅依靠语言本身难以准确传达特定地区的意义。解决方案的关键在于利用图像作为文化语境,通过引入CaMMT(一个包含5,800多个图像及其英文和区域语言并行描述的多模态基准数据集),评估五种视觉语言模型(VLMs)在纯文本和文本+图像设置下的表现,结果表明视觉上下文能够显著提升翻译质量,尤其是在处理文化特异性项目(CSIs)、消歧和正确性别使用方面。
链接: https://arxiv.org/abs/2505.24456
作者: Emilio Villa-Cueva,Sholpan Bolatzhanova,Diana Turmakhan,Kareem Elzeky,Henok Biadglign Ademtew,Alham Fikri Aji,Israel Abebe Azime,Jinheon Baek,Frederico Belcavello,Fermin Cristobal,Jan Christian Blaise Cruz,Mary Dabre,Raj Dabre,Toqeer Ehsan,Naome A Etori,Fauzan Farooqui,Jiahui Geng,Guido Ivetta,Thanmay Jayakumar,Soyeong Jeong,Zheng Wei Lim,Aishik Mandal,Sofia Martinelli,Mihail Minkov Mihaylov,Daniil Orel,Aniket Pramanick,Sukannya Purkayastha,Israfel Salazar,Haiyue Song,Tiago Timponi Torrent,Debela Desalegn Yadeta,Injy Hamed,Atnafu Lambebo Tonja,Thamar Solorio
机构: MBZUAI(穆巴达拉湾人工智能大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Cultural content poses challenges for machine translation systems due to the differences in conceptualizations between cultures, where language alone may fail to convey sufficient context to capture region-specific meanings. In this work, we investigate whether images can act as cultural context in multimodal translation. We introduce CaMMT, a human-curated benchmark of over 5,800 triples of images along with parallel captions in English and regional languages. Using this dataset, we evaluate five Vision Language Models (VLMs) in text-only and text+image settings. Through automatic and human evaluations, we find that visual context generally improves translation quality, especially in handling Culturally-Specific Items (CSIs), disambiguation, and correct gender usage. By releasing CaMMT, we aim to support broader efforts in building and evaluating multimodal translation systems that are better aligned with cultural nuance and regional variation.
zh
[NLP-75] Domain Pre-training Impact on Representations
【速读】: 该论文试图解决预训练语料库对学习到的Transformer表示质量的影响问题,具体关注仅通过预训练过程所引发的表示质量。其解决方案的关键在于验证小规模专业语料库在预训练中的有效性,并指出将通用语料库与专业语料库结合的成功与否取决于目标任务与专业语料库之间的分布相似性。
链接: https://arxiv.org/abs/2505.24455
作者: Cesar Gonzalez-Gutierrez,Ariadna Quattoni
机构: Universitat Politècnica de Catalunya (加泰罗尼亚理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:This empirical study analyzes the effects of the pre-training corpus on the quality of learned transformer representations. We focus on the representation quality induced solely through pre-training. Our experiments show that pre-training on a small, specialized corpus can yield effective representations, and that the success of combining a generic and a specialized corpus depends on the distributional similarity between the target task and the specialized corpus.
zh
[NLP-76] When Large Multimodal Models Confront Evolving Knowledge:Challenges and Pathways
【速读】: 该论文试图解决大规模语言/多模态模型(LLMs/LMMs)在获取动态演变知识时难以避免灾难性遗忘的问题,以及现有知识注入方法在多模态演化知识注入中的有效性不足。其解决方案的关键在于提出EVOKE基准测试以评估多模态模型在现实场景中注入多模态演化知识的能力,并通过实验揭示了持续学习方法(如Replay和MoELoRA)在缓解遗忘方面的有效性,同时指出了文本知识增强在训练阶段的积极作用。
链接: https://arxiv.org/abs/2505.24449
作者: Kailin Jiang,Yuntao Du,Yukai Ding,Yuchen Ren,Ning Jiang,Zhi Gao,Zilong Zheng,Lei Liu,Bin Li,Qing Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language/multimodal models (LLMs/LMMs) store extensive pre-trained knowledge but struggle to maintain consistency with real-world updates, making it difficult to avoid catastrophic forgetting while acquiring evolving knowledge. Previous work focused on constructing textual knowledge datasets and exploring knowledge injection in LLMs, lacking exploration of multimodal evolving knowledge injection in LMMs. To address this, we propose the EVOKE benchmark to evaluate LMMs’ ability to inject multimodal evolving knowledge in real-world scenarios. Meanwhile, a comprehensive evaluation of multimodal evolving knowledge injection revealed two challenges: (1) Existing knowledge injection methods perform terribly on evolving knowledge. (2) Supervised fine-tuning causes catastrophic forgetting, particularly instruction following ability is severely compromised. Additionally, we provide pathways and find that: (1) Text knowledge augmentation during the training phase improves performance, while image augmentation cannot achieve it. (2) Continual learning methods, especially Replay and MoELoRA, effectively mitigate forgetting. Our findings indicate that current knowledge injection methods have many limitations on evolving knowledge, which motivates further research on more efficient and stable knowledge injection methods.
zh
[NLP-77] Exploring the Impact of Occupational Personas on Domain-Specific QA
【速读】: 该论文试图解决的是角色扮演(persona)在领域特定问答(domain-specific QA)任务中是否能够提升性能的问题,特别是探讨不同类型的角色扮演对模型表现的影响。论文的解决方案关键在于引入两种类型的角色扮演:基于职业的专业角色扮演(Profession-Based Personas, PBPs)和基于职业性格的角色扮演(Occupational Personality-Based Personas, OPBPs),并通过多科学领域的实证评估分析其效果,发现PBPs在准确性上略有提升,而OPBPs则可能降低性能,表明角色扮演的相关性并不必然促进知识的有效利用。
链接: https://arxiv.org/abs/2505.24448
作者: Eojin Kang,Jaehyuk Yu,Juae Kim
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent studies on personas have improved the way Large Language Models (LLMs) interact with users. However, the effect of personas on domain-specific question-answering (QA) tasks remains a subject of debate. This study analyzes whether personas enhance specialized QA performance by introducing two types of persona: Profession-Based Personas (PBPs) (e.g., scientist), which directly relate to domain expertise, and Occupational Personality-Based Personas (OPBPs) (e.g., scientific person), which reflect cognitive tendencies rather than explicit expertise. Through empirical evaluations across multiple scientific domains, we demonstrate that while PBPs can slightly improve accuracy, OPBPs often degrade performance, even when semantically related to the task. Our findings suggest that persona relevance alone does not guarantee effective knowledge utilization and that they may impose cognitive constraints that hinder optimal knowledge application. Future research can explore how nuanced distinctions in persona representations guide LLMs, potentially contributing to reasoning and knowledge retrieval that more closely mirror human social conceptualization.
zh
[NLP-78] Model Unlearning via Sparse Autoencoder Subspace Guided Projections
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在需要选择性知识删除时所面临的隐私和安全问题,现有遗忘策略在可解释性或对抗性提示防御方面存在不足。其解决方案的关键在于提出一种名为SAE-Guided Subspace Projection Unlearning (SSPU)的框架,该框架利用稀疏自编码器(Sparse Autoencoder, SAE)特征驱动模型参数空间中的目标更新,实现精确、可解释且鲁棒的遗忘机制。SSPU通过数据驱动的层与特征选择、QR分解构建子空间以及约束优化来控制激活进入“无关”子空间,同时保留保留知识,从而提升模型的可控性和安全性。
链接: https://arxiv.org/abs/2505.24428
作者: Xu Wang,Zihao Li,Benyou Wang,Yan Hu,Difan Zou
机构: The University of Hong Kong (香港大学); New Jersey Institute of Technology (新泽西理工学院); The Chinese University of Hong Kong, Shenzhen (香港中文大学深圳校区)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) store vast amounts of information, making them powerful yet raising privacy and safety concerns when selective knowledge removal is required. Existing unlearning strategies, ranging from gradient-based fine-tuning and model editing to sparse autoencoder (SAE) steering, either lack interpretability or fail to provide a robust defense against adversarial prompts. We propose SAE-Guided Subspace Projection Unlearning (SSPU), a novel framework that leverages SAE features to drive targeted updates in the model’s parameter space, enabling precise, interpretable, and robust unlearning. SSPU’s three-stage pipeline performs data-driven layer and feature selection, subspace construction via QR decomposition, and constrained optimization that controls activations into an “irrelevant” subspace while preserving retained knowledge. Overall, we use SAE features to construct a subspace that supervises unlearning, refining the loss and adding a regularization term to guide interpretable parameter updates. In experiments on the WMDP-Cyber forget set and three utility benchmarks (MMLU, TruthfulQA, GSM8K), SSPU reduces harmful knowledge accuracy by 3.22% compared to the strongest baseline. It also improves adversarial robustness, lowering malicious accuracy under jailbreak prompts compared to baselines. Our findings expose the limitations of prior unlearning methods and demonstrate how interpretable subspace-guided optimization can achieve robust, controllable model behavior.
zh
[NLP-79] Donate or Create? Comparing Data Collection Strategies for Emotion-labeled Multimodal Social Media Posts ACL2025
【速读】: 该论文试图解决如何准确建模主观现象(如情绪表达)的问题,特别是在数据标注中如何反映作者意图。其解决方案的关键在于比较研究参与者创作的内容(study-created content)与真实世界中捐赠的内容(genuine content)在多模态社交媒体帖子中的差异,并评估这些差异对模型性能的影响。研究发现,研究创作的内容在长度、情感表达方式及关注事件类型上与真实内容存在显著差异,这提示在训练模型时需结合真实数据以获得更有效的泛化能力。
链接: https://arxiv.org/abs/2505.24427
作者: Christopher Bagdon,Aidan Combs,Carina Silberer,Roman Klinger
机构: University of Bamberg(巴伐利亚大学); The Ohio State University(俄亥俄州立大学); University of Stuttgart(斯图加特大学)
类目: Computation and Language (cs.CL)
备注: Published at ACL 2025
Abstract:Accurate modeling of subjective phenomena such as emotion expression requires data annotated with authors’ intentions. Commonly such data is collected by asking study participants to donate and label genuine content produced in the real world, or create content fitting particular labels during the study. Asking participants to create content is often simpler to implement and presents fewer risks to participant privacy than data donation. However, it is unclear if and how study-created content may differ from genuine content, and how differences may impact models. We collect study-created and genuine multimodal social media posts labeled for emotion and compare them on several dimensions, including model performance. We find that compared to genuine posts, study-created posts are longer, rely more on their text and less on their images for emotion expression, and focus more on emotion-prototypical events. The samples of participants willing to donate versus create posts are demographically different. Study-created data is valuable to train models that generalize well to genuine data, but realistic effectiveness estimates require genuine data.
zh
[NLP-80] MMAFFBen: A Multilingual and Multimodal Affective Analysis Benchmark for Evaluating LLM s and VLMs
【速读】: 该论文旨在解决生成式 AI(Generative AI)在情感分析(affective analysis)任务中的能力不足问题,特别是其在情感极性、情感强度、情绪分类和情绪强度等关键任务上的表现尚未得到充分探索。解决方案的关键在于构建了MMAFFBen,这是首个面向多语言多模态情感分析的开源基准,涵盖了35种语言的文本、图像和视频模态,并提出了MMAFFIn数据集以及基于此的MMAFFLM-3b和MMAFFLM-7b模型,以支持对生成式 AI 在情感理解能力上的系统评估与优化。
链接: https://arxiv.org/abs/2505.24423
作者: Zhiwei Liu,Lingfei Qian,Qianqian Xie,Jimin Huang,Kailai Yang,Sophia Ananiadou
机构: The University of Manchester (曼彻斯特大学); The Fin AI (Fin AI); School of Artificial Intelligence, Wuhan University (武汉大学人工智能学院)
类目: Computation and Language (cs.CL)
备注: Work in progress
Abstract:Large language models and vision-language models (which we jointly call LMs) have transformed NLP and CV, demonstrating remarkable potential across various fields. However, their capabilities in affective analysis (i.e. sentiment analysis and emotion detection) remain underexplored. This gap is largely due to the absence of comprehensive evaluation benchmarks, and the inherent complexity of affective analysis tasks. In this paper, we introduce MMAFFBen, the first extensive open-source benchmark for multilingual multimodal affective analysis. MMAFFBen encompasses text, image, and video modalities across 35 languages, covering four key affective analysis tasks: sentiment polarity, sentiment intensity, emotion classification, and emotion intensity. Moreover, we construct the MMAFFIn dataset for fine-tuning LMs on affective analysis tasks, and further develop MMAFFLM-3b and MMAFFLM-7b based on it. We evaluate various representative LMs, including GPT-4o-mini, providing a systematic comparison of their affective understanding capabilities. This project is available at this https URL.
zh
[NLP-81] LLM s Are Globally Multilingual Yet Locally Monolingual: Exploring Knowledge Transfer via Language and Thought Theory
【速读】: 该论文试图解决多语言大语言模型(Multilingual Large Language Models, LLMs)在跨语言事实知识回忆中表现不一致的问题。其解决方案的关键在于提出语言到思维(Language-to-Thought, L2T)提示策略,通过分析输入语言、内部认知过程与知识之间的关系,强调将模型的内部思维与任务所需知识对齐对于成功跨语言迁移的重要性。实验结果表明,L2T策略能够减少模型对输入语言的依赖,并在无需基于翻译的学习情况下促进跨语言知识整合。
链接: https://arxiv.org/abs/2505.24409
作者: Eojin Kang,Juae Kim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Multilingual large language models (LLMs) open up new possibilities for leveraging information across languages, but their factual knowledge recall remains inconsistent depending on the input language. While previous studies have attempted to address this issue through English-based prompting and evaluation, we explore non-English to English transfer via Language and Thought Theory. This perspective allows us to examine language-thought binding in LLMs and uncover why factual knowledge often fails to transfer effectively. We propose the Language-to-Thought (L2T) prompting strategy, which analyzes the relationship between input language, internal cognitive processes, and knowledge. Experimental results challenge the assumption that English-based approaches consistently outperform other languages and offer a novel insight that aligning the model’s internal thought with the knowledge required for the task is critical for successful cross-lingual transfer. Furthermore, we show that applying L2T during training can alleviate LLMs’ reliance on the input language and facilitate cross-linguistic knowledge integration without translation-based learning. Code and datasets will be available.
zh
[NLP-82] ClueAnchor: Clue-Anchored Knowledge Reasoning Exploration and Optimization for Retrieval-Augmented Generation
【速读】: 该论文试图解决现有检索增强生成(Retrieval-Augmented Generation, RAG)系统在利用检索到的文档时存在不足的问题,即未能有效提取和整合支持准确且可解释推理的关键线索,特别是在相关证据隐含、分散或被噪声掩盖的情况下。解决方案的关键在于提出ClueAnchor框架,该框架通过关键线索锚定的推理探索与优化机制,从检索内容中提取关键线索,并基于不同的知识配置生成多个推理路径,最终通过基于奖励的偏好优化选择最有效的路径,从而提升推理的完整性和鲁棒性。
链接: https://arxiv.org/abs/2505.24388
作者: Hao Chen,Yukun Yan,Sen Mei,Wanxiang Che,Zhenghao Liu,Qi Shi,Xinze Li,Yuchun Fan,Pengcheng Huang,Qiushi Xiong,Zhiyuan Liu,Maosong Sun
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) augments Large Language Models (LLMs) with external knowledge to improve factuality. However, existing RAG systems frequently underutilize the retrieved documents, failing to extract and integrate the key clues needed to support faithful and interpretable reasoning, especially in cases where relevant evidence is implicit, scattered, or obscured by noise. To address this issue, we propose ClueAnchor, a novel framework for enhancing RAG via clue-anchored reasoning exploration and optimization. ClueAnchor extracts key clues from retrieved content and generates multiple reasoning paths based on different knowledge configurations, optimizing the model by selecting the most effective one through reward-based preference optimization. Experiments show that ClueAnchor significantly outperforms prior RAG baselines in reasoning completeness and robustness. Further analysis confirms its strong resilience to noisy or partially relevant retrieved content, as well as its capability to identify supporting evidence even in the absence of explicit clue supervision during inference.
zh
[NLP-83] Breaking the Gold Standard: Extracting Forgotten Data under Exact Unlearning in Large Language Models
【速读】: 该论文试图解决在使用精确去学习(exact unlearning)方法后,模型仍可能泄露敏感信息的问题。其解决方案的关键在于提出一种新型的数据提取攻击,该攻击利用了去学习前后的模型之间的信息关联,通过引导去学习后的模型并结合标记过滤策略,成功恢复出被移除数据的分布特征,从而揭示了精确去学习在隐私保护上的局限性。
链接: https://arxiv.org/abs/2505.24379
作者: Xiaoyu Wu,Yifei Pang,Terrance Liu,Zhiwei Steven Wu
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:Large language models are typically trained on datasets collected from the web, which may inadvertently contain harmful or sensitive personal information. To address growing privacy concerns, unlearning methods have been proposed to remove the influence of specific data from trained models. Of these, exact unlearning – which retrains the model from scratch without the target data – is widely regarded the gold standard, believed to be robust against privacy-related attacks. In this paper, we challenge this assumption by introducing a novel data extraction attack that compromises even exact unlearning. Our method leverages both the pre- and post-unlearning models: by guiding the post-unlearning model using signals from the pre-unlearning model, we uncover patterns that reflect the removed data distribution. Combining model guidance with a token filtering strategy, our attack significantly improves extraction success rates – doubling performance in some cases – across common benchmarks such as MUSE, TOFU, and WMDP. Furthermore, we demonstrate our attack’s effectiveness on a simulated medical diagnosis dataset to highlight real-world privacy risks associated with exact unlearning. In light of our findings, which suggest that unlearning may, in a contradictory way, increase the risk of privacy leakage, we advocate for evaluation of unlearning methods to consider broader threat models that account not only for post-unlearning models but also for adversarial access to prior checkpoints.
zh
[NLP-84] LLM Inference Enhanced by External Knowledge: A Survey
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在需要准确、上下文相关的推理任务中,由于参数记忆有限和易产生幻觉(hallucination)所带来的挑战。其解决方案的关键在于利用外部知识来增强LLMs的能力,通过系统性地探索外部知识的整合策略,特别是针对结构化知识(如表格和知识图谱)的分类与集成方法,以提升模型的可解释性、可扩展性和性能。
链接: https://arxiv.org/abs/2505.24377
作者: Yu-Hsuan Lin,Qian-Hui Chen,Yi-Jie Cheng,Jia-Ren Zhang,Yi-Hung Liu,Liang-Yu Hsia,Yun-Nung Chen
机构: National Taiwan University (国立台湾大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advancements in large language models (LLMs) have enhanced natural-language reasoning. However, their limited parametric memory and susceptibility to hallucination present persistent challenges for tasks requiring accurate, context-based inference. To overcome these limitations, an increasing number of studies have proposed leveraging external knowledge to enhance LLMs. This study offers a systematic exploration of strategies for using external knowledge to enhance LLMs, beginning with a taxonomy that categorizes external knowledge into unstructured and structured data. We then focus on structured knowledge, presenting distinct taxonomies for tables and knowledge graphs (KGs), detailing their integration paradigms with LLMs, and reviewing representative methods. Our comparative analysis further highlights the trade-offs among interpretability, scalability, and performance, providing insights for developing trustworthy and generalizable knowledge-enhanced LLMs.
zh
[NLP-85] Knowing Before Saying: LLM Representations Encode Information About Chain-of-Thought Success Before Completion
【速读】: 该论文试图解决如何在不完成整个零样本思维链(zero-shot Chain-of-Thought, CoT)过程的情况下预测其成功性的问题。解决方案的关键在于利用大语言模型(LLM)的初始步骤表示,通过一个探测分类器进行有效判断,表明推理过程中关键信息已存在于早期阶段的表示中。这一发现表明,推理过程在早期停止通常不会导致性能显著下降,从而为优化CoT效率提供了可能。
链接: https://arxiv.org/abs/2505.24362
作者: Anum Afzal,Florian Matthes,Gal Chechik,Yftah Ziser
机构: Technical University of Munich (慕尼黑工业大学); Nvidia Research (英伟达研究院); Bar-Ilan University (巴伊兰大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:We investigate whether the success of a zero-shot Chain-of-Thought (CoT) process can be predicted before completion. We discover that a probing classifier, based on LLM representations, performs well \empheven before a single token is generated, suggesting that crucial information about the reasoning process is already present in the initial steps representations. In contrast, a strong BERT-based baseline, which relies solely on the generated tokens, performs worse, likely because it depends on shallow linguistic cues rather than deeper reasoning dynamics. Surprisingly, using later reasoning steps does not always improve classification. When additional context is unhelpful, earlier representations resemble later ones more, suggesting LLMs encode key information early. This implies reasoning can often stop early without loss. To test this, we conduct early stopping experiments, showing that truncating CoT reasoning still improves performance over not using CoT at all, though a gap remains compared to full reasoning. However, approaches like supervised learning or reinforcement learning designed to shorten CoT chains could leverage our classifier’s guidance to identify when early stopping is effective. Our findings provide insights that may support such methods, helping to optimize CoT’s efficiency while preserving its benefits.\footnoteCode and data is available at \hrefthis https URL\textttthis http URL.
zh
[NLP-86] Multilingual Gloss-free Sign Language Translation: Towards Building a Sign Language Foundation Model
【速读】: 该论文旨在解决多语言手语翻译(Multilingual Sign Language Translation, MLSLT)中的语言冲突和对齐难题,以提升低资源手语的可访问性。其关键解决方案是提出一种无需词典(gloss-free)的多语言模型,该模型采用双CTC(Connectionist Temporal Classification)目标函数,分别用于手语符号的token级识别和口语文本生成,从而支持多种手语与口语之间的翻译任务。
链接: https://arxiv.org/abs/2505.24355
作者: Sihan Tan,Taro Miyazaki,Kazuhiro Nakadai
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Sign Language Translation (SLT) aims to convert sign language (SL) videos into spoken language text, thereby bridging the communication gap between the sign and the spoken community. While most existing works focus on translating a single sign language into a single spoken language (one-to-one SLT), leveraging multilingual resources could mitigate low-resource issues and enhance accessibility. However, multilingual SLT (MLSLT) remains unexplored due to language conflicts and alignment difficulties across SLs and spoken languages. To address these challenges, we propose a multilingual gloss-free model with dual CTC objectives for token-level SL identification and spoken text generation. Our model supports 10 SLs and handles one-to-one, many-to-one, and many-to-many SLT tasks, achieving competitive performance compared to state-of-the-art methods on three widely adopted benchmarks: multilingual SP-10, PHOENIX14T, and CSL-Daily.
zh
[NLP-87] Unifying Language Agent Algorithms with Graph-based Orchestration Engine for Reproducible Agent Research ACL2025
【速读】: 该论文旨在解决语言代理(language agents)开发中的关键挑战,包括高昂的工程开销、缺乏标准化组件以及评估框架不足等问题。其解决方案的关键在于提出AGORA框架,该框架通过三个核心贡献进行优化:(1)基于图的工作流引擎、高效内存管理和清晰的组件抽象构成的模块化架构;(2)一套实现最先进推理方法的可复用代理算法;(3)一个严格的评估框架,支持多维度系统性比较。这些创新使得AGORA不仅简化了语言代理的开发,还通过标准化评估协议为可重复的代理研究奠定了基础。
链接: https://arxiv.org/abs/2505.24354
作者: Qianqian Zhang,Jiajia Liao,Heting Ying,Yibo Ma,Haozhan Shen,Jingcheng Li,Peng Liu,Lu Zhang,Chunxin Fang,Kyusong Lee,Ruochen Xu,Tiancheng Zhao
机构: Om AI Research (Om AI 研究院); Binjiang Institute of Zhejiang University (浙江大学滨江研究院); College of Computer Science and Technology, Zhejiang University (计算机科学与技术学院,浙江大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2025 Demo
Abstract:Language agents powered by large language models (LLMs) have demonstrated remarkable capabilities in understanding, reasoning, and executing complex tasks. However, developing robust agents presents significant challenges: substantial engineering overhead, lack of standardized components, and insufficient evaluation frameworks for fair comparison. We introduce Agent Graph-based Orchestration for Reasoning and Assessment (AGORA), a flexible and extensible framework that addresses these challenges through three key contributions: (1) a modular architecture with a graph-based workflow engine, efficient memory management, and clean component abstraction; (2) a comprehensive suite of reusable agent algorithms implementing state-of-the-art reasoning approaches; and (3) a rigorous evaluation framework enabling systematic comparison across multiple dimensions. Through extensive experiments on mathematical reasoning and multimodal tasks, we evaluate various agent algorithms across different LLMs, revealing important insights about their relative strengths and applicability. Our results demonstrate that while sophisticated reasoning approaches can enhance agent capabilities, simpler methods like Chain-of-Thought often exhibit robust performance with significantly lower computational overhead. AGORA not only simplifies language agent development but also establishes a foundation for reproducible agent research through standardized evaluation protocols.
zh
[NLP-88] Fewer Hallucinations More Verification: A Three-Stage LLM -Based Framework for ASR Error Correction
【速读】: 该论文旨在解决自动语音识别(Automatic Speech Recognition, ASR)中识别错误纠正的问题,即在保留准确文本的同时修正识别错误。传统方法虽有一定效果,但依赖于训练和标注数据,而大型语言模型(LLMs)虽然提供了新范式,但直接使用会导致幻觉问题,可能修改正确文本。解决方案的关键在于提出可靠的LLM纠正框架(RLLM-CF),该框架包含三个阶段:错误预检测、链式思维子任务迭代纠正以及推理过程验证,无需额外信息或模型微调,确保多轮编程下的LLM纠正准确性。
链接: https://arxiv.org/abs/2505.24347
作者: Yangui Fang,Baixu Cheng,Jing Peng,Xu Li,Yu Xi,Chengwei Zhang,Guohui Zhong
机构: Huazhong University of Science and Technology, School of Electronic Information and Communications (华中科技大学电子信息与通信学院); MoE Key Lab of Artificial Intelligence, AI Institute, X-LANCE Lab, Shanghai Jiao Tong University (教育部人工智能重点实验室,人工智能研究院,X-LANCE实验室,上海交通大学); AISpeech Ltd (AISpeech有限公司)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Automatic Speech Recognition (ASR) error correction aims to correct recognition errors while preserving accurate text. Although traditional approaches demonstrate moderate effectiveness, LLMs offer a paradigm that eliminates the need for training and labeled data. However, directly using LLMs will encounter hallucinations problem, which may lead to the modification of the correct text. To address this problem, we propose the Reliable LLM Correction Framework (RLLM-CF), which consists of three stages: (1) error pre-detection, (2) chain-of-thought sub-tasks iterative correction, and (3) reasoning process verification. The advantage of our method is that it does not require additional information or fine-tuning of the model, and ensures the correctness of the LLM correction under multi-pass programming. Experiments on AISHELL-1, AISHELL-2, and Librispeech show that the GPT-4o model enhanced by our framework achieves 21%, 11%, 9%, and 11.4% relative reductions in CER/WER.
zh
[NLP-89] KEVER2: Knowledge-Enhanced Visual Emotion Reasoning and Retrieval
【速读】: 该论文旨在解决视觉情感分析(Visual Emotion Analysis, VEA)中的关键挑战,包括图像中的情感线索抽象、重叠和纠缠,导致难以建模和解释;视觉语言模型(Vision-Language Models, VLMs)在将复杂视觉模式与情感语义对齐时面临监督不足和情感定位稀疏的问题;以及现有方法缺乏结构化的情感知识以解决歧义并确保跨多样化视觉领域的连贯情感推理。其解决方案的关键在于提出一种基于知识增强的框架——K-EVER²,通过引入语义结构化的视觉情感线索表示,并利用多模态对齐整合外部情感知识,从而实现无需依赖手工标注标签或直接情感监督的鲁棒且可解释的情感预测。
链接: https://arxiv.org/abs/2505.24342
作者: Fanhang Man,Xiaoyue Chen,Huandong Wang,Baining Zhao,Han Li,Xinlei Chen,Yong Li
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学); Peng Cheng Labotory(鹏城实验室); Department of Electronic Engineering, Tsinghua University(清华大学电子工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Understanding what emotions images evoke in their viewers is a foundational goal in human-centric visual computing. While recent advances in vision-language models (VLMs) have shown promise for visual emotion analysis (VEA), several key challenges remain unresolved. Emotional cues in images are often abstract, overlapping, and entangled, making them difficult to model and interpret. Moreover, VLMs struggle to align these complex visual patterns with emotional semantics due to limited supervision and sparse emotional grounding. Finally, existing approaches lack structured affective knowledge to resolve ambiguity and ensure consistent emotional reasoning across diverse visual domains. To address these limitations, we propose \textbfK-EVER\textsuperscript2, a knowledge-enhanced framework for emotion reasoning and retrieval. Our approach introduces a semantically structured formulation of visual emotion cues and integrates external affective knowledge through multimodal alignment. Without relying on handcrafted labels or direct emotion supervision, K-EVER\textsuperscript2 achieves robust and interpretable emotion predictions across heterogeneous image types. We validate our framework on three representative benchmarks, Emotion6, EmoSet, and M-Disaster, covering social media imagery, human-centric scenes, and disaster contexts. K-EVER\textsuperscript2 consistently outperforms strong CNN and VLM baselines, achieving up to a \textbf19% accuracy gain for specific emotions and a \textbf12.3% average accuracy gain across all emotion categories. Our results demonstrate a scalable and generalizable solution for advancing emotional understanding of visual content. Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL) Cite as: arXiv:2505.24342 [cs.CV] (or arXiv:2505.24342v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.24342 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-90] Exploring Multimodal Challenges in Toxic Chinese Detection: Taxonomy Benchmark and Findings ACL2025
【速读】: 该论文试图解决生成式 AI (Generative AI) 在检测经过扰动的有毒中文内容时存在的挑战,特别是针对中文语言的多模态特性导致的检测失效问题。其解决方案的关键在于提出一种包含3种扰动策略和8种具体方法的分类体系,并基于此构建数据集对当前主流的9个生成式 AI 模型进行评估,同时探索了基于上下文学习(in-context learning, ICL)和监督微调(supervised fine-tuning, SFT)的增强方法。研究发现,现有模型在处理多模态扰动内容时表现较差,且少量扰动样本的微调可能导致模型过度校正,误将正常内容识别为有毒内容。
链接: https://arxiv.org/abs/2505.24341
作者: Shujian Yang,Shiyao Cui,Chuanrui Hu,Haicheng Wang,Tianwei Zhang,Minlie Huang,Jialiang Lu,Han Qiu
机构: Shanghai Jiao Tong University (上海交通大学); Tsinghua University (清华大学); Qihoo 360 (奇虎360); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted to ACL 2025 (Findings). Camera-ready version
Abstract:Detecting toxic content using language models is important but challenging. While large language models (LLMs) have demonstrated strong performance in understanding Chinese, recent studies show that simple character substitutions in toxic Chinese text can easily confuse the state-of-the-art (SOTA) LLMs. In this paper, we highlight the multimodal nature of Chinese language as a key challenge for deploying LLMs in toxic Chinese detection. First, we propose a taxonomy of 3 perturbation strategies and 8 specific approaches in toxic Chinese content. Then, we curate a dataset based on this taxonomy, and benchmark 9 SOTA LLMs (from both the US and China) to assess if they can detect perturbed toxic Chinese text. Additionally, we explore cost-effective enhancement solutions like in-context learning (ICL) and supervised fine-tuning (SFT). Our results reveal two important findings. (1) LLMs are less capable of detecting perturbed multimodal Chinese toxic contents. (2) ICL or SFT with a small number of perturbed examples may cause the LLMs "overcorrect’': misidentify many normal Chinese contents as toxic.
zh
[NLP-91] GeoVision Labeler: Zero-Shot Geospatial Classification with Vision and Language Models
【速读】: 该论文试图解决在缺乏标注数据的区域中,地理空间图像分类面临的挑战,尤其是在灾害响应和土地利用监测等应用中。其解决方案的关键在于提出GeoVision Labeler (GVL),一个严格意义上的零样本分类框架,通过视觉大语言模型(vLLM)生成丰富的、可读的图像描述,并由传统的大语言模型(LLM)将其映射到用户定义的类别,从而实现灵活且可解释的图像分类。
链接: https://arxiv.org/abs/2505.24340
作者: Gilles Quentin Hacheme,Girmaw Abebe Tadesse,Caleb Robinson,Akram Zaytar,Rahul Dodhia,Juan M. Lavista Ferres
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Classifying geospatial imagery remains a major bottleneck for applications such as disaster response and land-use monitoring-particularly in regions where annotated data is scarce or unavailable. Existing tools (e.g., RS-CLIP) that claim zero-shot classification capabilities for satellite imagery nonetheless rely on task-specific pretraining and adaptation to reach competitive performance. We introduce GeoVision Labeler (GVL), a strictly zero-shot classification framework: a vision Large Language Model (vLLM) generates rich, human-readable image descriptions, which are then mapped to user-defined classes by a conventional Large Language Model (LLM). This modular, and interpretable pipeline enables flexible image classification for a large range of use cases. We evaluated GVL across three benchmarks-SpaceNet v7, UC Merced, and RESISC45. It achieves up to 93.2% zero-shot accuracy on the binary Buildings vs. No Buildings task on SpaceNet v7. For complex multi-class classification tasks (UC Merced, RESISC45), we implemented a recursive LLM-driven clustering to form meta-classes at successive depths, followed by hierarchical classification-first resolving coarse groups, then finer distinctions-to deliver competitive zero-shot performance. GVL is open-sourced at this https URL to catalyze adoption in real-world geospatial workflows.
zh
[NLP-92] Pangu DeepDiver: Adaptive Search Intensity Scaling via Open-Web Reinforcement Learning
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在开放网络问答任务中面临的迭代证据收集与反思推理能力不足的问题。现有方法依赖于静态提示规则或基于维基百科的语料库和检索环境,限制了其在现实世界网络环境中应对模糊性、冲突证据和噪声的能力。论文提出的关键解决方案是引入Search Intensity Scaling(SIS)——一种在模糊或冲突条件下增强搜索努力的涌现能力,而非依赖过度自信且验证不足的答案。为实现这一目标,研究者构建了WebPuzzle数据集,并提出了DeepDiver强化学习框架,通过在真实开放网络环境中探索来促进自适应搜索策略的学习。实验结果表明,Pangu-7B-Reasoner在DeepDiver的增强下,在真实网络任务上的表现可与671B参数的DeepSeek-R1相媲美。
链接: https://arxiv.org/abs/2505.24332
作者: Wenxuan Shi,Haochen Tan,Chuqiao Kuang,Xiaoguang Li,Xiaozhe Ren,Chen Zhang,Hanting Chen,Yasheng Wang,Lifeng Shang,Fisher Yu,Yunhe Wang
机构: Huawei ||\rm|| Noah’s Ark Lab
类目: Computation and Language (cs.CL)
备注:
Abstract:Information seeking demands iterative evidence gathering and reflective reasoning, yet large language models (LLMs) still struggle with it in open-web question answering. Existing methods rely on static prompting rules or training with Wikipedia-based corpora and retrieval environments, limiting adaptability to the real-world web environment where ambiguity, conflicting evidence, and noise are prevalent. These constrained training settings hinder LLMs from learning to dynamically decide when and where to search, and how to adjust search depth and frequency based on informational demands. We define this missing capacity as Search Intensity Scaling (SIS)–the emergent skill to intensify search efforts under ambiguous or conflicting conditions, rather than settling on overconfident, under-verification answers. To study SIS, we introduce WebPuzzle, the first dataset designed to foster information-seeking behavior in open-world internet environments. WebPuzzle consists of 24K training instances and 275 test questions spanning both wiki-based and open-web queries. Building on this dataset, we propose DeepDiver, a Reinforcement Learning (RL) framework that promotes SIS by encouraging adaptive search policies through exploration under a real-world open-web environment. Experimental results show that Pangu-7B-Reasoner empowered by DeepDiver achieve performance on real-web tasks comparable to the 671B-parameter DeepSeek-R1. We detail DeepDiver’s training curriculum from cold-start supervised fine-tuning to a carefully designed RL phase, and present that its capability of SIS generalizes from closed-form QA to open-ended tasks such as long-form writing. Our contributions advance adaptive information seeking in LLMs and provide a valuable benchmark and dataset for future research. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2505.24332 [cs.CL] (or arXiv:2505.24332v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.24332 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-93] Context-Aware Sentiment Forecasting via LLM -based Multi-Perspective Role-Playing Agents
【速读】: 该论文试图解决社交媒体上用户情感的即时预测问题(sentiment forecasting),即在事件发展过程中预测用户未来的情感反应。其解决方案的关键在于提取与情感相关的特征以增强建模能力,并提出一种多视角角色扮演框架,以模拟人类对事件发展的响应过程。
链接: https://arxiv.org/abs/2505.24331
作者: Fanhang Man,Huandong Wang,Jianjie Fang,Zhaoyi Deng,Baining Zhao,Xinlei Chen,Yong Li
机构: Shenzhen International Graduate School, Tsinghua University(深圳国际研究生院,清华大学); Department of Electronic Engineering, Tsinghua University(电子工程系,清华大学); Northeastern University at Qinghuangdao(东北大学秦皇岛分校); Department of Computer Science, University of California, Irvine(计算机科学系,加州大学欧文分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:User sentiment on social media reveals the underlying social trends, crises, and needs. Researchers have analyzed users’ past messages to trace the evolution of sentiments and reconstruct sentiment dynamics. However, predicting the imminent sentiment of an ongoing event is rarely studied. In this paper, we address the problem of \textbfsentiment forecasting on social media to predict the user’s future sentiment in response to the development of the event. We extract sentiment-related features to enhance the modeling skill and propose a multi-perspective role-playing framework to simulate the process of human response. Our preliminary results show significant improvement in sentiment forecasting on both microscopic and macroscopic levels.
zh
[NLP-94] SwiftEval: Developing a Language-Specific Benchmark for LLM -generated Code Evaluation
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在代码生成任务中对非Python语言(如Swift)评估不足的问题。现有评估基准主要面向Python,导致对其他语言的评估质量不高。论文提出的关键解决方案是采用“质量优先”的方法,而非依赖自动翻译Python基准以快速扩展多语言支持,从而构建了首个针对Swift的基准测试集SwiftEval,包含28个精心设计的问题,并对44个流行的代码生成LLMs进行了评估,结果显示在需要语言特性的任务中,模型性能显著下降,尤其在小型模型中更为明显。
链接: https://arxiv.org/abs/2505.24324
作者: Ivan Petrukha,Yana Kurliak,Nataliia Stulova
机构: MacPaw(麦帕)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注: Accepted to FORGE’25 Benchmarking on 15.01.2025, to be published by IEEE under the CC BY-NC-ND 4.0 license. This is the accepted version of the article (5 pages, 2 figures, 1 table). DOI will be added upon publication
Abstract:In recent years, large language models (LLMs) have showcased significant advancements in code generation. However, most evaluation benchmarks are primarily oriented towards Python, making it difficult to evaluate other programming languages, such as Swift, with high quality. By examining widely established multilingual benchmarks like HumanEval-XL and MultiPL-E, we identified critical issues specific to their Swift components, making them insufficient or even irrelevant for assessing LLM coding capabilities on Swift. Unlike these existing approaches, which prioritize rapid scaling and generalization by automatically translating Python-centric benchmarks with LLMs, we adopt a quality-over-quantity methodology. We present SwiftEval, the first Swift-oriented benchmark consisting of 28 carefully hand-crafted problems, and evaluate 44 popular Code LLMs on it. Our results show significant LLM scores drop for problems requiring language-specific features, most noticeable in the models of smaller sizes.
zh
[NLP-95] HiCaM: A Hierarchical-Causal Modification Framework for Long-Form Text Modification
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理长文本修改任务时面临的两个主要问题:一是不当修改或摘要无关内容导致生成不期望的修改,二是遗漏对维持文档连贯性至关重要的隐含相关段落的必要修改。解决方案的关键在于提出HiCaM框架,该框架通过分层摘要树和因果图实现层次化因果修改,从而有效提升文本修改的准确性和连贯性。
链接: https://arxiv.org/abs/2505.24319
作者: Yuntao Shi,Yi Luo,Yeyun Gong,Chen Lin
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have achieved remarkable success in various domains. However, when handling long-form text modification tasks, they still face two major problems: (1) producing undesired modifications by inappropriately altering or summarizing irrelevant content, and (2) missing necessary modifications to implicitly related passages that are crucial for maintaining document coherence. To address these issues, we propose HiCaM, a Hierarchical-Causal Modification framework that operates through a hierarchical summary tree and a causal graph. Furthermore, to evaluate HiCaM, we derive a multi-domain dataset from various benchmarks, providing a resource for assessing its effectiveness. Comprehensive evaluations on the dataset demonstrate significant improvements over strong LLMs, with our method achieving up to a 79.50% win rate. These results highlight the comprehensiveness of our approach, showing consistent performance improvements across multiple models and domains.
zh
[NLP-96] ScienceMeter: Tracking Scientific Knowledge Updates in Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在科学知识更新方面的有效性问题,即如何评估模型在知识保留、新知识获取和未来知识预测三个方面的能力。解决方案的关键在于提出ScienceMeter框架,该框架定义了三个核心指标:知识保留(knowledge preservation)、知识获取(knowledge acquisition)和知识投影(knowledge projection),用于系统性地评估科学知识更新方法的效果。通过在涵盖多个科学领域的大规模数据集上进行实验,研究揭示了现有知识更新方法的局限性,并强调了构建高效、稳健的科学知识更新机制的重要性与挑战性。
链接: https://arxiv.org/abs/2505.24302
作者: Yike Wang,Shangbin Feng,Yulia Tsvetkov,Hannaneh Hajishirzi
机构: University of Washington (华盛顿大学); Allen Institute for Artificial Intelligence (艾伦人工智能研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) are increasingly used to support scientific research, but their knowledge of scientific advancements can quickly become outdated. We introduce ScienceMeter, a new framework for evaluating scientific knowledge update methods over scientific knowledge spanning the past, present, and future. ScienceMeter defines three metrics: knowledge preservation, the extent to which models’ understanding of previously learned papers are preserved; knowledge acquisition, how well scientific claims from newly introduced papers are acquired; and knowledge projection, the ability of the updated model to anticipate or generalize to related scientific claims that may emerge in the future. Using ScienceMeter, we examine the scientific knowledge of LLMs on claim judgment and generation tasks across a curated dataset of 15,444 scientific papers and 30,888 scientific claims from ten domains including medicine, biology, materials science, and computer science. We evaluate five representative knowledge update approaches including training- and inference-time methods. With extensive experiments, we find that the best-performing knowledge update methods can preserve only 85.9% of existing knowledge, acquire 71.7% of new knowledge, and project 37.7% of future knowledge. Inference-based methods work for larger models, whereas smaller models require training to achieve comparable performance. Cross-domain analysis reveals that performance on these objectives is correlated. Even when applying on specialized scientific LLMs, existing knowledge update methods fail to achieve these objectives collectively, underscoring that developing robust scientific knowledge update mechanisms is both crucial and challenging.
zh
[NLP-97] Large Language Models are Locally Linear Mappings
【速读】: 该论文试图解决如何理解大型语言模型(Large Language Models, LLMs)内部表示和预测机制的问题,特别是如何揭示其在下一个词预测过程中的可解释语义结构。解决方案的关键在于将模型的推理操作映射到一个与输入序列完全等价的线性系统,通过调整梯度计算以使模型的雅可比矩阵几乎精确地再现前向预测,从而实现对模型各层及其注意力和MLP组件的近似线性分解。这一方法使得能够观察到语义概念的逐渐浮现,并揭示现代LLMs尽管具有强大的表达能力和全局非线性,仍可通过近似精确的局部线性分解进行解释。
链接: https://arxiv.org/abs/2505.24293
作者: James R. Golden
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Version 0
Abstract:We demonstrate that the inference operations of several open-weight large language models (LLMs) can be mapped to an exactly equivalent linear system for an input sequence without modifying the model weights or altering output predictions. Extending techniques from image diffusion models that exhibit local or piecewise linearity, we strategically alter the gradient computation with respect to a given input sequence for a next-token prediction such that the Jacobian of the model nearly exactly reproduces the forward prediction with a linear system. We demonstrate this approach across models (Llama 3, Gemma 3, Qwen 3, Phi 4, Mistral Ministral and OLMo 2, up to Llama 3.3 70B Q4) and show through the singular value decomposition of the detached Jacobian that these LLMs operate in extremely low-dimensional subspaces where many of the largest singular vectors decode to concepts related to the most-likely output token. This approach also allows us to examine the operation of each successive layer (and its attention and MLP components) as nearly-exact linear systems and observe the emergence of semantic concepts. Despite their expressive power and global nonlinearity, modern LLMs can be interpreted through nearly-exact locally linear decompositions that provide insights into their internal representations and reveal interpretable semantic structures in the next-token prediction process.
zh
[NLP-98] Mind the Quote: Enabling Quotation-Aware Dialogue in LLM s via Plug-and-Play Modules
【速读】: 该论文试图解决人类与AI对话中引用先前文本的挑战,即当前的大规模语言模型(Large Language Models, LLMs)缺乏明确机制来定位和利用这些引用片段。解决方案的关键在于提出一种以引用为中心的数据流水线,并引入QuAda方法,该方法通过在每个注意力头附加两个瓶颈投影,在推理时动态增强或抑制对引用片段的关注,而无需修改提示并仅更新2.8%的主干权重,从而实现零开销和参数高效的目标。
链接: https://arxiv.org/abs/2505.24292
作者: Yueqi Zhang,Peiwen Yuan,Shaoxiong Feng,Yiwei Li,Xinglin Wang,Jiayi Shi,Chuyi Tan,Boyuan Pan,Yao Hu,Kan Li
机构: School of Computer Science and Technology, Beijing Institute of Technology (计算机学院,北京理工大学); Xiaohongshu Inc (小红书公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Human-AI conversation frequently relies on quoting earlier text-“check it with the formula I just highlighted”-yet today’s large language models (LLMs) lack an explicit mechanism for locating and exploiting such spans. We formalise the challenge as span-conditioned generation, decomposing each turn into the dialogue history, a set of token-offset quotation spans, and an intent utterance. Building on this abstraction, we introduce a quotation-centric data pipeline that automatically synthesises task-specific dialogues, verifies answer correctness through multi-stage consistency checks, and yields both a heterogeneous training corpus and the first benchmark covering five representative scenarios. To meet the benchmark’s zero-overhead and parameter-efficiency requirements, we propose QuAda, a lightweight training-based method that attaches two bottleneck projections to every attention head, dynamically amplifying or suppressing attention to quoted spans at inference time while leaving the prompt unchanged and updating 2.8% of backbone weights. Experiments across models show that QuAda is suitable for all scenarios and generalises to unseen topics, offering an effective, plug-and-play solution for quotation-aware dialogue.
zh
[NLP-99] Faithful and Robust LLM -Driven Theorem Proving for NLI Explanations ACL2025
【速读】: 该论文试图解决自然语言推理(Natural Language Inference, NLI)中解释的可信度与鲁棒性问题,特别是在将自然语言翻译为形式化表示过程中可能出现的语义信息丢失以及逻辑结构捕捉不准确的问题。其解决方案的关键在于提出一系列策略,包括减轻自动形式化过程中的语义损失、高效识别并修正逻辑表示中的语法错误、利用逻辑表达式引导生成结构化证明草图,以及增强语言模型对定理证明器反馈的解读能力,从而提升验证效率和解释质量。
链接: https://arxiv.org/abs/2505.24264
作者: Xin Quan,Marco Valentino,Louise A. Dennis,André Freitas
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Camera-ready for ACL 2025
Abstract:Natural language explanations play a fundamental role in Natural Language Inference (NLI) by revealing how premises logically entail hypotheses. Recent work has shown that the interaction of large language models (LLMs) with theorem provers (TPs) can help verify and improve the validity of NLI explanations. However, TPs require translating natural language into machine-verifiable formal representations, a process that introduces the risk of semantic information loss and unfaithful interpretation, an issue compounded by LLMs’ challenges in capturing critical logical structures with sufficient precision. Moreover, LLMs are still limited in their capacity for rigorous and robust proof construction within formal verification frameworks. To mitigate issues related to faithfulness and robustness, this paper investigates strategies to (1) alleviate semantic loss during autoformalisation, (2) efficiently identify and correct syntactic errors in logical representations, (3) explicitly use logical expressions to guide LLMs in generating structured proof sketches, and (4) increase LLMs’ capacity of interpreting TP’s feedback for iterative refinement. Our empirical results on e-SNLI, QASC and WorldTree using different LLMs demonstrate that the proposed strategies yield significant improvements in autoformalisation (+18.46%, +34.2%, +39.77%) and explanation refinement (+29.5%, +51.5%, +41.25%) over the state-of-the-art model. Moreover, we show that specific interventions on the hybrid LLM-TP architecture can substantially improve efficiency, drastically reducing the number of iterations required for successful verification.
zh
[NLP-100] Simulating Training Data Leakage in Multiple-Choice Benchmarks for LLM Evaluation
【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)在评估过程中可能存在的数据泄露问题,即训练数据与评估集之间的潜在重叠,这会影响模型性能评估的公平性和可靠性。其解决方案的关键在于比较现有的数据泄露检测技术,包括基于排列和n-gram的方法,并在此基础上提出一种轻量级方法——半半问题(semi-half question),同时优化这些方法以实现实例级别的检测并降低计算开销。研究最终通过清理MMLU和HellaSwag数据集并重新评估多个LLMs,验证了其方法的有效性,为提升评估的可靠性和透明度提供了可行路径。
链接: https://arxiv.org/abs/2505.24263
作者: Naila Shafirni Hidayat,Muhammad Dehan Al Kautsar,Alfan Farizki Wicaksono,Fajri Koto
机构: Universitas Indonesia (印度尼西亚大学); MBZUAI (穆罕默德本扎耶德大学人工智能研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:The performance of large language models (LLMs) continues to improve, as reflected in rising scores on standard benchmarks. However, the lack of transparency around training data raises concerns about potential overlap with evaluation sets and the fairness of reported results. Although prior work has proposed methods for detecting data leakage, these approaches primarily focus on identifying outliers and have not been evaluated under controlled simulated leakage conditions. In this work, we compare existing leakage detection techniques, namely permutation and n-gram-based methods, under a continual pretraining setup that simulates real-world leakage scenarios, and additionally explore a lightweight method we call semi-half question. Although semi-half offers a low-cost alternative, our analysis shows that the n-gram method consistently achieves the highest F1-score. We also refine these techniques to support instance-level detection and reduce computational overhead. Leveraging the best-performing method, we create cleaned versions of MMLU and HellaSwag, and re-evaluate several LLMs. Our findings present a practical path toward more reliable and transparent evaluations, and we recommend contamination checks as a standard step before releasing benchmark results.
zh
[NLP-101] Effects of Theory of Mind and Prosocial Beliefs on Steering Human-Aligned Behaviors of LLM s in Ultimatum Games
【速读】: 该论文试图解决如何通过理论心智(ToM)推理来使智能体行为与人类规范对齐,以提升协商任务中的决策一致性和谈判结果的问题。其解决方案的关键在于初始化不同亲社会信念的大型语言模型(LLMs)代理,并结合链式思维(CoT)和不同层次的ToM推理方法,评估其在 ultimatum game 中的行为表现,从而验证ToM推理在增强人机交互与协作决策中的作用。
链接: https://arxiv.org/abs/2505.24255
作者: Neemesh Yadav,Palakorn Achananuparp,Jing Jiang,Ee-Peng Lim
机构: Singapore Management University (新加坡管理大学); Australian National University (澳大利亚国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 17 pages, 1 figure, 6 tables
Abstract:Large Language Models (LLMs) have shown potential in simulating human behaviors and performing theory-of-mind (ToM) reasoning, a crucial skill for complex social interactions. In this study, we investigate the role of ToM reasoning in aligning agentic behaviors with human norms in negotiation tasks, using the ultimatum game as a controlled environment. We initialized LLM agents with different prosocial beliefs (including Greedy, Fair, and Selfless) and reasoning methods like chain-of-thought (CoT) and varying ToM levels, and examined their decision-making processes across diverse LLMs, including reasoning models like o3-mini and DeepSeek-R1 Distilled Qwen 32B. Results from 2,700 simulations indicated that ToM reasoning enhances behavior alignment, decision-making consistency, and negotiation outcomes. Consistent with previous findings, reasoning models exhibit limited capability compared to models with ToM reasoning, different roles of the game benefits with different orders of ToM reasoning. Our findings contribute to the understanding of ToM’s role in enhancing human-AI interaction and cooperative decision-making. The code used for our experiments can be found at this https URL.
zh
[NLP-102] Proactive Guidance of Multi-Turn Conversation in Industrial Search ACL’25
【速读】: 该论文旨在解决多轮对话系统在动态适应用户目标变化和保持实时交互低延迟方面的挑战。其解决方案的关键在于提出了一种两阶段框架:第一阶段为基于目标自适应的监督微调(G-SFT),通过目标适应代理动态调整用户目标并提供相关上下文信息,同时利用可扩展的知识蒸馏将大语言模型(LLM)的洞察提炼到轻量级模型中以支持实时交互;第二阶段为面向点击的强化学习(C-RL),采用生成-排序范式,从用户点击信号中系统构建偏好对,并通过更具吸引力的引导提升点击率。该框架通过G-SFT实现精准的目标追踪,通过C-RL优化交互质量,从而有效提升了系统的性能与效率。
链接: https://arxiv.org/abs/2505.24251
作者: Xiaoyu Li,Xiao Li,Li Gao,Yiding Liu,Xiaoyang Wang,Shuaiqiang Wang,Junfeng Wang,Dawei Yin
机构: Baidu Inc. (百度公司)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: ACL’25 (Industry)
Abstract:The evolution of Large Language Models (LLMs) has significantly advanced multi-turn conversation systems, emphasizing the need for proactive guidance to enhance users’ interactions. However, these systems face challenges in dynamically adapting to shifts in users’ goals and maintaining low latency for real-time interactions. In the Baidu Search AI assistant, an industrial-scale multi-turn search system, we propose a novel two-phase framework to provide proactive guidance. The first phase, Goal-adaptive Supervised Fine-Tuning (G-SFT), employs a goal adaptation agent that dynamically adapts to user goal shifts and provides goal-relevant contextual information. G-SFT also incorporates scalable knowledge transfer to distill insights from LLMs into a lightweight model for real-time interaction. The second phase, Click-oriented Reinforcement Learning (C-RL), adopts a generate-rank paradigm, systematically constructs preference pairs from user click signals, and proactively improves click-through rates through more engaging guidance. This dual-phase architecture achieves complementary objectives: G-SFT ensures accurate goal tracking, while C-RL optimizes interaction quality through click signal-driven reinforcement learning. Extensive experiments demonstrate that our framework achieves 86.10% accuracy in offline evaluation (+23.95% over baseline) and 25.28% CTR in online deployment (149.06% relative improvement), while reducing inference latency by 69.55% through scalable knowledge distillation.
zh
[NLP-103] Mamba Knockout for Unraveling Factual Information Flow ACL2025
【速读】: 该论文试图解决生成式 AI (Generative AI) 中基于状态空间模型(State-Space Model, SSM)的语言模型中事实性信息流动的可解释性问题。其解决方案的关键在于利用理论与实证连接,将原本用于 Transformer 架构的注意力机制解释技术,特别是 Attention Knockout 方法,应用于 Mamba-1 和 Mamba-2 模型,以追踪信息在不同词元和层之间的传递与定位,并揭示主题词元信息的出现模式及层间动态。此外,通过 Mamba 的结构化因子分解,进一步解耦不同“特征”对词元间信息交换与单个词元丰富性的贡献,从而提供一个统一的视角来理解 Mamba 的内部操作机制。
链接: https://arxiv.org/abs/2505.24244
作者: Nir Endy,Idan Daniel Grosbard,Yuval Ran-Milo,Yonatan Slutzky,Itay Tshuva,Raja Giryes
机构: Tel Aviv University (特拉维夫大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to ACL 2025
Abstract:This paper investigates the flow of factual information in Mamba State-Space Model (SSM)-based language models. We rely on theoretical and empirical connections to Transformer-based architectures and their attention mechanisms. Exploiting this relationship, we adapt attentional interpretability techniques originally developed for Transformers–specifically, the Attention Knockout methodology–to both Mamba-1 and Mamba-2. Using them we trace how information is transmitted and localized across tokens and layers, revealing patterns of subject-token information emergence and layer-wise dynamics. Notably, some phenomena vary between mamba models and Transformer based models, while others appear universally across all models inspected–hinting that these may be inherent to LLMs in general. By further leveraging Mamba’s structured factorization, we disentangle how distinct “features” either enable token-to-token information exchange or enrich individual tokens, thus offering a unified lens to understand Mamba internal operations.
zh
[NLP-104] Advantageous Parameter Expansion Training Makes Better Large Language Models
【速读】: 该论文试图解决大规模语言模型在增加可训练参数数量以提升性能时所面临的计算开销增大的问题,其核心在于如何在减少参数量或数据量的情况下保持甚至提升模型性能。解决方案的关键是提出了一种名为优势参数扩展训练(Advantageous Parameter EXpansion Training, APEX)的方法,该方法通过逐步将优势参数扩展到劣势参数的空间中,提高优势参数的比例,从而增强训练效率。理论分析表明,APEX通过优化参数空间的矩阵有效秩来实现性能提升。
链接: https://arxiv.org/abs/2505.24241
作者: Naibin Gu,Yilong Chen,Zhenyu Zhang,Peng Fu,Zheng Lin,Shuohuan Wang,Yu Sun,Hua Wu,Weiping Wang,Haifeng Wang
机构: Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络空间安全学院); Baidu Inc.(百度公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Although scaling up the number of trainable parameters in both pre-training and fine-tuning can effectively improve the performance of large language models, it also leads to increased computational overhead. When delving into the parameter difference, we find that a subset of parameters, termed advantageous parameters, plays a crucial role in determining model performance. Further analysis reveals that stronger models tend to possess more such parameters. In this paper, we propose Advantageous Parameter EXpansion Training (APEX), a method that progressively expands advantageous parameters into the space of disadvantageous ones, thereby increasing their proportion and enhancing training effectiveness. Further theoretical analysis from the perspective of matrix effective rank explains the performance gains of APEX. Extensive experiments on both instruction tuning and continued pre-training demonstrate that, in instruction tuning, APEX outperforms full-parameter tuning while using only 52% of the trainable parameters. In continued pre-training, APEX achieves the same perplexity level as conventional training with just 33% of the training data, and yields significant improvements on downstream tasks.
zh
[NLP-105] An Adversary-Resistant Multi-Agent LLM System via Credibility Scoring
【速读】: 该论文试图解决多智能体大语言模型(multi-agent LLM)系统在面对对抗性或低性能代理时的脆弱性问题。解决方案的关键在于引入一种基于可信度评分(credibility scoring)的通用且具有抗攻击性的多智能体框架,通过将协作式问答过程建模为迭代博弈,并根据每个代理在查询回答中的历史贡献动态学习可信度分数,从而在聚合团队输出时有效降低对抗性影响,提升多智能体协作的鲁棒性。
链接: https://arxiv.org/abs/2505.24239
作者: Sana Ebrahimi,Mohsen Dehghankar,Abolfazl Asudeh
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:While multi-agent LLM systems show strong capabilities in various domains, they are highly vulnerable to adversarial and low-performing agents. To resolve this issue, in this paper, we introduce a general and adversary-resistant multi-agent LLM framework based on credibility scoring. We model the collaborative query-answering process as an iterative game, where the agents communicate and contribute to a final system output. Our system associates a credibility score that is used when aggregating the team outputs. The credibility scores are learned gradually based on the past contributions of each agent in query answering. Our experiments across multiple tasks and settings demonstrate our system’s effectiveness in mitigating adversarial influence and enhancing the resilience of multi-agent cooperation, even in the adversary-majority settings.
zh
[NLP-106] From Hallucinations to Jailbreaks: Rethinking the Vulnerability of Large Foundation Models
【速读】: 该论文试图解决大型基础模型(Large Foundation Models, LFMs)在面对幻觉(hallucinations)和越狱攻击(jailbreak attacks)时所表现出的两种不同脆弱性问题。研究发现,针对其中一种脆弱性的防御措施通常会影响另一种,暗示两者之间存在更深层次的联系。论文提出了一种统一的理论框架,将越狱攻击建模为令牌级优化,将幻觉建模为注意力级优化,并通过实证验证了两个关键命题:相似损失收敛与注意力重分配中的梯度一致性。该研究的关键在于揭示这两种脆弱性之间的内在关联,并证明针对其中一种的缓解技术可以有效降低另一种的成功率,从而为提升LFMs的鲁棒性提供了新的思路。
链接: https://arxiv.org/abs/2505.24232
作者: Haibo Jin,Peiyan Zhang,Peiran Wang,Man Luo,Haohan Wang
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); HKUST (香港科技大学); University of California, Los Angeles (加利福尼亚大学洛杉矶分校); Intel Labs (英特尔实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large foundation models (LFMs) are susceptible to two distinct vulnerabilities: hallucinations and jailbreak attacks. While typically studied in isolation, we observe that defenses targeting one often affect the other, hinting at a deeper connection. We propose a unified theoretical framework that models jailbreaks as token-level optimization and hallucinations as attention-level optimization. Within this framework, we establish two key propositions: (1) \textitSimilar Loss Convergence - the loss functions for both vulnerabilities converge similarly when optimizing for target-specific outputs; and (2) \textitGradient Consistency in Attention Redistribution - both exhibit consistent gradient behavior driven by shared attention dynamics. We validate these propositions empirically on LLaVA-1.5 and MiniGPT-4, showing consistent optimization trends and aligned gradients. Leveraging this connection, we demonstrate that mitigation techniques for hallucinations can reduce jailbreak success rates, and vice versa. Our findings reveal a shared failure mode in LFMs and suggest that robustness strategies should jointly address both vulnerabilities. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2505.24232 [cs.CV] (or arXiv:2505.24232v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.24232 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-107] Dynamic Context-Aware Streaming Pretrained Language Model For Inverse Text Normalization INTERSPEECH2025
【速读】: 该论文旨在解决在流式自动语音识别(ASR)系统中集成流式逆文本归一化(ITN)所面临的准确性、效率和适应性问题,尤其是在低资源和有限上下文场景下的挑战。其解决方案的关键在于引入一种用于ITN的流式预训练语言模型,利用预训练的语言表示提高鲁棒性,并通过在训练和推理过程中采用动态上下文感知机制,实现自适应块大小调整和右文信息的整合。
链接: https://arxiv.org/abs/2505.24229
作者: Luong Ho,Khanh Le,Vinh Pham,Bao Nguyen,Tan Tran,Duc Chau
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to INTERSPEECH 2025
Abstract:Inverse Text Normalization (ITN) is crucial for converting spoken Automatic Speech Recognition (ASR) outputs into well-formatted written text, enhancing both readability and usability. Despite its importance, the integration of streaming ITN within streaming ASR remains largely unexplored due to challenges in accuracy, efficiency, and adaptability, particularly in low-resource and limited-context scenarios. In this paper, we introduce a streaming pretrained language model for ITN, leveraging pretrained linguistic representations for improved robustness. To address streaming constraints, we propose Dynamic Context-Aware during training and inference, enabling adaptive chunk size adjustments and the integration of right-context information. Experimental results demonstrate that our method achieves accuracy comparable to non-streaming ITN and surpasses existing streaming ITN models on a Vietnamese dataset, all while maintaining low latency, ensuring seamless integration into ASR systems.
zh
[NLP-108] Reasoning Can Hurt the Inductive Abilities of Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在归纳推理能力上的局限性,特别是针对通过少量示例推断潜在规则的任务。尽管通常认为链式思维(Chain-of-Thought, CoT)提示方法能够提升推理能力,但研究发现,在某些情况下,CoT推理反而会降低归纳性能,导致大型推理模型(Large Reasoning Models, LRMs)表现不如非推理模型。解决方案的关键在于提出一种理论框架,揭示了推理步骤如何通过三种失败模式(子任务分解错误、子任务求解错误和最终答案总结错误)放大误差,并基于此设计结构化干预措施,以根据识别的失败类型调整CoT生成过程,从而在不进行微调的情况下提高归纳准确性。
链接: https://arxiv.org/abs/2505.24225
作者: Haibo Jin,Peiyan Zhang,Man Luo,Haohan Wang
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); HKUST (香港科技大学); Intel Labs (英特尔实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 26 pages
Abstract:Large Language Models (LLMs) have shown remarkable progress across domains, yet their ability to perform inductive reasoning - inferring latent rules from sparse examples - remains limited. It is often assumed that chain-of-thought (CoT) prompting, as used in Large Reasoning Models (LRMs), enhances such reasoning. We investigate this assumption with creating four controlled, diagnostic game-based tasks - chess, Texas Hold’em, dice games, and blackjack - with hidden human-defined rules. We find that CoT reasoning can degrade inductive performance, with LRMs often underperforming their non-reasoning counterparts. To explain this, we present a theoretical framework that reveals how reasoning steps can amplify error through three failure modes: incorrect sub-task decomposition, incorrect sub-task solving, and incorrect final answer summarization. Based on our theoretical and empirical analysis, we introduce structured interventions that adapt CoT generation according to our identified failure types. These interventions improve inductive accuracy without retraining. Our findings suggest that effective (CoT) reasoning depends not only on taking more steps but also on ensuring those steps are well-structured. Comments: 26 pages Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2505.24225 [cs.CV] (or arXiv:2505.24225v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.24225 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-109] Automated Structured Radiology Report Generation ACL
【速读】: 该论文试图解决从胸部X光(CXR)图像自动生成放射学报告过程中存在的临床一致性差和评估困难问题(Radiology Report Generation)。现有数据集中的自由文本报告具有高度的变异性与非结构化特性,导致生成模型难以产出一致且具有临床意义的报告,同时传统评估指标无法准确反映放射学解释的细微差别。解决方案的关键在于引入结构化放射学报告生成(Structured Radiology Report Generation, SRRG)任务,通过使用大语言模型(LLM)将自由文本报告重构为标准化格式,并构建新的数据集以确保报告的清晰性、一致性和临床结构化。此外,提出SRR-BERT模型用于细粒度疾病分类,以及F1-SRR-BERT评估指标,以更好地连接自由文本的多样性与结构化临床报告之间的差距。
链接: https://arxiv.org/abs/2505.24223
作者: Jean-Benoit Delbrouck,Justin Xu,Johannes Moll,Alois Thomas,Zhihong Chen,Sophie Ostmeier,Asfandyar Azhar,Kelvin Zhenghao Li,Andrew Johnston,Christian Bluethgen,Eduardo Reis,Mohamed Muneer,Maya Varma,Curtis Langlotz
机构: Stanford AIMI (斯坦福AIMI); HOPPR (HOPPR)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL Main 2025
Abstract:Automated radiology report generation from chest X-ray (CXR) images has the potential to improve clinical efficiency and reduce radiologists’ workload. However, most datasets, including the publicly available MIMIC-CXR and CheXpert Plus, consist entirely of free-form reports, which are inherently variable and unstructured. This variability poses challenges for both generation and evaluation: existing models struggle to produce consistent, clinically meaningful reports, and standard evaluation metrics fail to capture the nuances of radiological interpretation. To address this, we introduce Structured Radiology Report Generation (SRRG), a new task that reformulates free-text radiology reports into a standardized format, ensuring clarity, consistency, and structured clinical reporting. We create a novel dataset by restructuring reports using large language models (LLMs) following strict structured reporting desiderata. Additionally, we introduce SRR-BERT, a fine-grained disease classification model trained on 55 labels, enabling more precise and clinically informed evaluation of structured reports. To assess report quality, we propose F1-SRR-BERT, a metric that leverages SRR-BERT’s hierarchical disease taxonomy to bridge the gap between free-text variability and structured clinical reporting. We validate our dataset through a reader study conducted by five board-certified radiologists and extensive benchmarking experiments.
zh
[NLP-110] ERU-KG: Efficient Reference-aligned Unsupervised Keyphrase Generation ACL2025
【速读】: 该论文旨在解决无监督关键短语预测中存在的信息量估计不准确和时间效率低的问题。现有方法通常依赖于启发式定义的重要性评分,这可能导致信息量估计偏差,并且缺乏对计算效率的考虑。其解决方案的关键在于提出ERU-KG模型,该模型包含信息量模块和短语生成模块,其中信息量模块通过参考(如查询、引用上下文和标题)以及在词级层面建模信息量,从而更高效地估计短语的信息量,无需显式建模候选短语,提升了预测的准确性和效率。
链接: https://arxiv.org/abs/2505.24219
作者: Lam Thanh Do,Aaditya Bodke,Pritom Saha Akash,Kevin Chen-Chuan Chang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025
Abstract:Unsupervised keyphrase prediction has gained growing interest in recent years. However, existing methods typically rely on heuristically defined importance scores, which may lead to inaccurate informativeness estimation. In addition, they lack consideration for time efficiency. To solve these problems, we propose ERU-KG, an unsupervised keyphrase generation (UKG) model that consists of an informativeness and a phraseness module. The former estimates the relevance of keyphrase candidates, while the latter generate those candidates. The informativeness module innovates by learning to model informativeness through references (e.g., queries, citation contexts, and titles) and at the term-level, thereby 1) capturing how the key concepts of documents are perceived in different contexts and 2) estimating informativeness of phrases more efficiently by aggregating term informativeness, removing the need for explicit modeling of the candidates. ERU-KG demonstrates its effectiveness on keyphrase generation benchmarks by outperforming unsupervised baselines and achieving on average 89% of the performance of a supervised model for top 10 predictions. Additionally, to highlight its practical utility, we evaluate the model on text retrieval tasks and show that keyphrases generated by ERU-KG are effective when employed as query and document expansions. Furthermore, inference speed tests reveal that ERU-KG is the fastest among baselines of similar model sizes. Finally, our proposed model can switch between keyphrase generation and extraction by adjusting hyperparameters, catering to diverse application requirements.
zh
[NLP-111] Semi-structured LLM Reason ers Can Be Rigorously Audited
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在推理过程中存在的“忠实性”问题,即模型生成的推理轨迹可能包含难以检测的错误和遗漏,进而掩盖模型输出中的偏见。解决方案的关键在于引入半结构化推理模型(Semi-Structured Reasoning Models, SSRMs),该模型在内部整合了半结构化的思维链(Chain-of-Thought, CoT)推理格式,并以类似Python的语法生成推理轨迹。SSRMs通过任务特定的受限词汇来命名不同的推理步骤,并标记每一步的输入和输出,从而提高推理过程的可分析性与可审计性。
链接: https://arxiv.org/abs/2505.24217
作者: Jixuan Leng,Cassandra A. Cohen,Zhixian Zhang,Chenyan Xiong,William W. Cohen
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:As Large Language Models (LLMs) become increasingly capable at reasoning, the problem of “faithfulness” persists: LLM “reasoning traces” can contain errors and omissions that are difficult to detect, and may obscure biases in model outputs. To address these limitations, we introduce Semi-Structured Reasoning Models (SSRMs), which internalize a semi-structured Chain-of-Thought (CoT) reasoning format within the model. Our SSRMs generate reasoning traces in a Pythonic syntax. While SSRM traces are not executable, they adopt a restricted, task-specific vocabulary to name distinct reasoning steps, and to mark each step’s inputs and outputs. Through extensive evaluation on ten benchmarks, SSRMs demonstrate strong performance and generality: they outperform comparably sized baselines by nearly ten percentage points on in-domain tasks while remaining competitive with specialized models on out-of-domain medical benchmarks. Furthermore, we show that semi-structured reasoning is more amenable to analysis: in particular, they can be automatically audited to identify reasoning flaws. We explore both hand-crafted structured audits, which detect task-specific problematic reasoning patterns, and learned typicality audits, which apply probabilistic models over reasoning patterns, and show that both audits can be used to effectively flag probable reasoning errors.
zh
[NLP-112] Are Any-to-Any Models More Consistent Across Modality Transfers Than Specialists?
【速读】: 该论文试图解决多模态生成模型在跨模态一致性方面的不确定性问题,即统一模型是否真正实现了跨模态的一致性,还是这种一致性仅是感知上的错觉。其解决方案的关键在于引入ACON数据集,该数据集包含1,000张图像(其中500张为新贡献)及其对应的描述、编辑指令和问答对,用于严格评估跨模态转换。同时,论文通过三种一致性准则——循环一致性、前向等变性和共轭等变性——进行实验分析,以揭示统一模型在点对点评估中的表现,并借助多编辑操作所支持的中间潜在空间结构化分析,发现弱但可观察的一致性。
链接: https://arxiv.org/abs/2505.24211
作者: Jiwan Chung,Janghan Yoon,Junhyeong Park,Sangeyl Lee,Joowon Yang,Sooyeon Park,Youngjae Yu
机构: Yonsei University (延世大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Any-to-any generative models aim to enable seamless interpretation and generation across multiple modalities within a unified framework, yet their ability to preserve relationships across modalities remains uncertain. Do unified models truly achieve cross-modal coherence, or is this coherence merely perceived? To explore this, we introduce ACON, a dataset of 1,000 images (500 newly contributed) paired with captions, editing instructions, and QA pairs to evaluate cross-modal transfers rigorously. Using three consistency criteria-cyclic consistency, forward equivariance, and conjugated equivariance-our experiments reveal that any-to-any models do not consistently demonstrate greater cross-modal consistency than specialized models in pointwise evaluations such as cyclic consistency. However, equivariance evaluations uncover weak but observable consistency through structured analyses of the intermediate latent space enabled by multiple editing operations. We release our code and data at this https URL.
zh
[NLP-113] Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC
【速读】: 该论文旨在解决在有限资源条件下,自监督或监督预训练的语音基础模型(Speech Foundation Models, SFMs)在微调过程中表现受限的问题。其关键解决方案是探索多种适应策略,包括冻结上游训练、部分微调以及低秩适配,并结合数据增强技术以缓解少样本设置下的性能差距,同时引入语言识别连接时序分类(LID CTC)损失进行正则化,从而显著提升了多语言语言识别(LID)和自动语音识别(ASR)的性能。
链接: https://arxiv.org/abs/2505.24200
作者: Qingzheng Wang,Jiancheng Sun,Yifan Peng,Shinji Watanabe
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Multilingual speech processing with self-supervised or supervised pre-trained Speech Foundation Models (SFM) has achieved strong performance on tasks like Language Identification (LID) and Automatic Speech Recognition (ASR). However, these models struggle with limited resources during fine-tuning. This paper enhances multilingual LID and ASR on ML-SUPERB 2.0 by exploring multiple strategies for adapting SFMs, including frozen upstream training, partial fine-tuning, and low-rank adaptation. Furthermore, we employ data augmentation to mitigate performance gaps in few-shot settings and introduce LID Connectionist Temporal Classification (CTC) loss for regularization. Our approach achieves a 14% relative improvement in LID accuracy and a 30% relative reduction in ASR CER over the baseline on ML-SUPERB 2.0, securing second place in the Interspeech 2025 ML-SUPERB 2.0 Challenge.
zh
[NLP-114] Intuitionistic Fuzzy Sets for Large Language Model Data Annotation: A Novel Approach to Side-by-Side Preference Labeling
【速读】: 该论文旨在解决人类偏好数据质量在训练和评估大型语言模型(Large Language Models, LLMs)中的关键问题,特别是在基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)和直接偏好优化(Direct Preference Optimization, DPO)场景中,传统并列标注(Side-by-Side, SBS)方法面临固有不确定性、标注者意见不一致以及偏好判断复杂性等挑战。其解决方案的关键在于引入基于直觉模糊集(Intuitionistic Fuzzy Sets, IFS)的框架,通过成员度、非成员度和犹豫度三个维度建模人类偏好,不仅捕捉偏好程度,还量化判断中的不确定性和犹豫性,从而实现更精细的偏好建模与标注者意见的聚合,并提出用于评估偏好数据质量的指标,显著提升了标注一致性与数据质量。
链接: https://arxiv.org/abs/2505.24199
作者: Yimin Du
机构: 未知
类目: Computation and Language (cs.CL)
备注: 7 pages
Abstract:The quality of human preference data is crucial for training and evaluating large language models (LLMs), particularly in reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) scenarios. Traditional side-by-side (SBS) annotation approaches often struggle with inherent uncertainty, annotator disagreement, and the complexity of preference judgments. This paper introduces a novel framework based on intuitionistic fuzzy sets (IFS) for modeling and aggregating human preferences in LLM data annotation tasks. Our approach captures not only the degree of preference but also the uncertainty and hesitation inherent in human judgment through membership, non-membership, and hesitation degrees. We propose an IFS-based annotation protocol that enables more nuanced preference modeling, develops aggregation methods for handling annotator disagreement, and introduces quality metrics for preference data assessment. Experimental validation on multiple datasets demonstrates that our IFS-based approach significantly improves annotation consistency, reduces annotator fatigue, and produces higher-quality preference data compared to traditional binary and Likert-scale methods. The resulting preference datasets lead to improved model performance in downstream tasks, with 12.3% improvement in win-rate against baseline models and 15.7% reduction in annotation time. Our framework provides a principled approach to handling uncertainty in human preference annotation and offers practical benefits for large-scale LLM training.
zh
[NLP-115] CLaSp: In-Context Layer Skip for Self-Speculative Decoding ACL2025
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在解码过程中效率低下的问题,特别是通过自推测解码(self-speculative decoding)提高大型语言模型(Large Language Models, LLMs)的解码速度。现有方法通常需要额外的草稿模块进行训练,这在实现和跨不同LLMs的兼容性方面存在挑战。论文提出的解决方案关键在于CLaSp,这是一种基于上下文的层跳过策略,通过跳过验证模型的中间层构建压缩的草稿模型,无需额外训练或模块,采用即插即用机制,并利用动态规划算法优化层跳过过程,从而实现高效的解码加速。
链接: https://arxiv.org/abs/2505.24196
作者: Longze Chen,Renke Shan,Huiming Wang,Lu Wang,Ziqiang Liu,Run Luo,Jiawei Wang,Hamid Alinejad-Rokny,Min Yang
机构: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); University of Chinese Academy of Sciences (中国科学院大学); Singapore University of Technology and Design (新加坡科技设计大学); School of Biomedical Engineering, UNSW Sydney (新南威尔士大学生物医学工程学院); Ritzz-AI (Ritzz-AI)
类目: Computation and Language (cs.CL)
备注: 11 pages, 7 figures, ACL 2025
Abstract:Speculative decoding (SD) is a promising method for accelerating the decoding process of Large Language Models (LLMs). The efficiency of SD primarily hinges on the consistency between the draft model and the verify model. However, existing drafting approaches typically require additional modules to be trained, which can be challenging to implement and ensure compatibility across various LLMs. In this paper, we propose CLaSp, an in-context layer-skipping strategy for self-speculative decoding. Unlike prior methods, CLaSp does not require additional drafting modules or extra training. Instead, it employs a plug-and-play mechanism by skipping intermediate layers of the verify model to construct a compressed draft model. Specifically, we develop a dynamic programming algorithm that optimizes the layer-skipping process by leveraging the complete hidden states from the last verification stage as an objective. This enables CLaSp to dynamically adjust its layer-skipping strategy after each verification stage, without relying on pre-optimized sets of skipped layers. Experimental results across diverse downstream tasks demonstrate that CLaSp achieves a speedup of 1.3x ~ 1.7x on LLaMA3 series models without altering the original distribution of the generated text.
zh
[NLP-116] WikiGap: Promoting Epistemic Equity by Surfacing Knowledge Gaps Between English Wikipedia and other Language Editions
【速读】: 该论文试图解决英语维基百科在知识覆盖上的局限性问题,即其作为全球知识访问主导平台,虽拥有大量页面浏览量,但未能充分反映其他语言版本中基于各自文化与媒体环境的补充性事实。解决方案的关键在于提出WikiGap系统,该系统通过结合多语言信息差距发现方法与以用户为中心的设计,将法语、俄语和中文维基百科中的补充信息整合至英语维基百科界面中,从而提升事实查找的准确性与效率,并增强用户对非英语版本中可用补充信息的认知。
链接: https://arxiv.org/abs/2505.24195
作者: Zining Wang,Yuxuan Zhang,Dongwook Yoon,Nicholas Vincent,Farhan Samir,Vered Shwartz
机构: University of British Columbia (不列颠哥伦比亚大学); Simon Fraser University (西蒙菲莎大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:
Abstract:With more than 11 times as many pageviews as the next, English Wikipedia dominates global knowledge access relative to other language editions. Readers are prone to assuming English Wikipedia as a superset of all language editions, leading many to prefer it even when their primary language is not English. Other language editions, however, comprise complementary facts rooted in their respective cultures and media environments, which are marginalized in English Wikipedia. While Wikipedia’s user interface enables switching between language editions through its Interlanguage Link (ILL) system, it does not reveal to readers that other language editions contain valuable, complementary information. We present WikiGap, a system that surfaces complementary facts sourced from other Wikipedias within the English Wikipedia interface. Specifically, by combining a recent multilingual information-gap discovery method with a user-centered design, WikiGap enables access to complementary information from French, Russian, and Chinese Wikipedia. In a mixed-methods study (n=21), WikiGap significantly improved fact-finding accuracy, reduced task time, and received a 32-point higher usability score relative to Wikipedia’s current ILL-based navigation system. Participants reported increased awareness of the availability of complementary information in non-English editions and reconsidered the completeness of English Wikipedia. WikiGap thus paves the way for improved epistemic equity across language editions.
zh
[NLP-117] Fine-Tune an SLM or Prompt an LLM ? The Case of Generating Low-Code Workflows
【速读】: 该论文试图解决在生成结构化输出(如JSON格式的低代码工作流)任务中,小型语言模型(Small Language Models, SLMs)与大型语言模型(Large Language Models, LLMs)在性能上的比较问题。研究指出,尽管LLMs具有强大的任务处理能力,但在特定领域任务中,通过微调的SLMs仍能保持质量优势。解决方案的关键在于对SLMs进行微调,而非依赖提示工程,实验结果表明,微调可使生成质量平均提升10%。
链接: https://arxiv.org/abs/2505.24189
作者: Orlando Marquez Ayala,Patrice Bechard,Emily Chen,Maggie Baird,Jingfei Chen
机构: ServiceNow(服务现在)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) such as GPT-4o can handle a wide range of complex tasks with the right prompt. As per token costs are reduced, the advantages of fine-tuning Small Language Models (SLMs) for real-world applications – faster inference, lower costs – may no longer be clear. In this work, we present evidence that, for domain-specific tasks that require structured outputs, SLMs still have a quality advantage. We compare fine-tuning an SLM against prompting LLMs on the task of generating low-code workflows in JSON form. We observe that while a good prompt can yield reasonable results, fine-tuning improves quality by 10% on average. We also perform systematic error analysis to reveal model limitations.
zh
[NLP-118] Beyond Exponential Decay: Rethinking Error Accumulation in Large Language Models
【速读】: 该论文试图解决大语言模型(Large Language Model, LLM)在长序列输出中可靠性呈指数衰减的假设问题,这一假设基于每个标记(token)独立的错误概率。论文的核心贡献在于提出了一种新的可靠性公式,其关键在于识别出错误主要集中在少数“关键标记”(5-10%的总标记数),这些标记代表了重要的决策节点,而非均匀分布在所有标记中。通过区分高影响标记与日益可预测的多数标记,论文揭示了现代LLM能够保持长期连贯性的机制,并提出了以语义重要标记的选择性保留、不确定决策边界处的动态计算分配、歧义处的多路径探索以及与自然语义域对齐的架构为核心的下一代系统框架。
链接: https://arxiv.org/abs/2505.24187
作者: Mikhail L. Arbuzov,Alexey A. Shvets,Sisong Beir
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The prevailing assumption of an exponential decay in large language model (LLM) reliability with sequence length, predicated on independent per-token error probabilities, posits an inherent limitation for long autoregressive outputs. Our research fundamentally challenges this view by synthesizing emerging evidence that LLM errors are not uniformly distributed but are concentrated at sparse “key tokens” ( 5-10% of total tokens) representing critical decision junctions. By distinguishing these high-impact tokens from the increasingly predictable majority, we introduce a new reliability formula explaining the sustained coherence of modern LLMs over thousands of tokens. Converging research streams reveal that long-context performance primarily depends on accurately navigating a few crucial semantic decision points rather than on uniform token-level accuracy, enabling targeted strategies that significantly outperform brute-force approaches. We thus propose a framework for next-generation systems centered on selective preservation of semantically vital tokens, dynamic computational allocation at uncertain decision boundaries, multi-path exploration at ambiguities, and architectures aligned with natural semantic domains. This marks a fundamental shift from raw scaling to strategic reasoning, promising breakthrough performance without proportionate computational scaling and offering a more nuanced understanding that supersedes the exponential decay hypothesis, thereby opening pathways toward substantially more powerful and efficient language systems.
zh
[NLP-119] Adaptive LoRA Merge with Parameter Pruning for Low-Resource Generation ACL2025
【速读】: 该论文试图解决低资源语言生成任务中大型语言模型(Large Language Model, LLM)适应性不足的问题,以及现有方法在参数冻结情况下适应能力有限的局限性。其解决方案的关键在于提出一种LoRA合并方法,通过微调少量目标任务数据来更新和剪枝LoRA参数,从而实现更细粒度的参数调整和任务适应性的提升。
链接: https://arxiv.org/abs/2505.24174
作者: Ryota Miyano,Yuki Arase
机构: Osaka University (大阪大学); Institute of Science Tokyo (东京科学大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at ACL2025 Findings
Abstract:This study proposes a simple yet effective LoRA merge method to achieve LLM adaptation for low-resource language generation tasks. The LoRA merge technique, which integrates multiple LoRA modules trained on different tasks, has gained attention as an effective and efficient approach for adapting LLMs to target tasks. However, previous methods are limited in adaptability as they keep the LoRA parameters frozen. Additionally, the low-resource problem has been out of their scope. We propose a LoRA merge method that updates and prunes LoRA parameters through fine-tuning with minimal target task data, which allows finer-grained adjustments of LoRA parameters and enhancement of task adaptability. Extensive experiments have been conducted taking summarization as a benchmark task. Our datasets cover various domains and multiple languages of English and Japanese. The results confirm that the proposed method achieves significant and consistent improvements in task adaptability over the previous methods.
zh
[NLP-120] ag-Evol: Achieving Efficient Instruction Evolving via Tag Injection ACL2025
【速读】: 该论文试图解决现有数据合成方法在指令演化过程中依赖固定策略、需手动设计且结构单一,以及迭代演化导致困难样本获取成本高昂的问题。其解决方案的关键在于提出Tag-Evol框架,通过引入多样且具体的知识标签作为策略,实现通过注入不同标签组合对原始指令进行可控演化,从而提升演化数据的质量与多样性。
链接: https://arxiv.org/abs/2505.24165
作者: Yixuan Wang,Shiqi Zhou,Chuanzhe Guo,Qingfu Zhu
机构: Harbin Institute of Technology, China(哈尔滨工业大学)
类目: Computation and Language (cs.CL)
备注: Accepted as Findings of ACL 2025
Abstract:Evol-Instruct has made significant improvements as a data synthesis method in several areas. Existing methods typically rely on a fixed set of strategies to evolve, which require manual design and are monolithic in form. In addition, iterative evolution also makes the acquisition of hard samples expensive. In view of this, we propose the Tag-Evol framework, a more diverse and efficient instruction evolving method. Specifically, Tag-Evol uses diverse and specific knowledge tags as strategies to achieve controlled evolution by injecting different combinations of tags into the original instructions. Experiments with multiple backbones in diverse domain benchmarks show that the proposed method generates significantly better evolved data than other methods. Furthermore, we conduct a thorough analysis of the evolved data, demonstrating that Tag-Evol is not only efficient but also generates more diverse and challenging data.
zh
[NLP-121] Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models
【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在后训练阶段缺乏稳定强化学习(Reinforcement Learning, RL)方法的问题,现有工作通常仅关注单一任务类型,如接地任务、数学问题或图表分析。其解决方案的关键在于提出一种统一的框架Mixed-R1,该框架包含混合奖励函数设计(Mixed-Reward)和混合后训练数据集(Mixed-45K),其中Mixed-Reward通过多种奖励函数适应不同任务,并引入一种新的开放性奖励机制Bidirectional Max-Average Similarity (BMAS) 以处理长文本内容。
链接: https://arxiv.org/abs/2505.24164
作者: Shilin Xu,Yanwei Li,Rui Yang,Tao Zhang,Yueyi Sun,Wei Chow,Linfeng Li,Hang Song,Qi Xu,Yunhai Tong,Xiangtai Li,Hao Fei
机构: ByteDance(字节跳动); Peking University(北京大学); National University of Singapore(新加坡国立大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent works on large language models (LLMs) have successfully demonstrated the emergence of reasoning capabilities via reinforcement learning (RL). Although recent efforts leverage group relative policy optimization (GRPO) for MLLMs post-training, they constantly explore one specific aspect, such as grounding tasks, math problems, or chart analysis. There are no works that can leverage multi-source MLLM tasks for stable reinforcement learning. In this work, we present a unified perspective to solve this problem. We present Mixed-R1, a unified yet straightforward framework that contains a mixed reward function design (Mixed-Reward) and a mixed post-training dataset (Mixed-45K). We first design a data engine to select high-quality examples to build the Mixed-45K post-training dataset. Then, we present a Mixed-Reward design, which contains various reward functions for various MLLM tasks. In particular, it has four different reward functions: matching reward for binary answer or multiple-choice problems, chart reward for chart-aware datasets, IoU reward for grounding problems, and open-ended reward for long-form text responses such as caption datasets. To handle the various long-form text content, we propose a new open-ended reward named Bidirectional Max-Average Similarity (BMAS) by leveraging tokenizer embedding matching between the generated response and the ground truth. Extensive experiments show the effectiveness of our proposed method on various MLLMs, including Qwen2.5-VL and Intern-VL on various sizes. Our dataset and model are available at this https URL.
zh
[NLP-122] LKD-KGC: Domain-Specific KG Construction via LLM -driven Knowledge Dependency Parsing EDBT2026
【速读】: 该论文试图解决领域特定知识图谱(Knowledge Graphs, KGs)构建过程中因依赖人工定义的模式、单文档处理以及公开领域参考知识而导致的效率低下和效果受限问题。其解决方案的关键在于提出一种名为LKD-KGC的无监督领域特定KG构建框架,该框架通过自主分析文档仓库以推断知识依赖关系,利用大语言模型(Large Language Models, LLMs)驱动的优先级排序确定最优处理顺序,并通过整合多文档的层次化上下文自回归生成实体模式,从而实现无需预定义结构或外部知识的实体与关系无监督提取。
链接: https://arxiv.org/abs/2505.24163
作者: Jiaqi Sun,Shiyou Qian,Zhangchi Han,Wei Li,Zelin Qian,Dingyu Yang,Jian Cao,Guangtao Xue
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitting to EDBT 2026
Abstract:Knowledge Graphs (KGs) structure real-world entities and their relationships into triples, enhancing machine reasoning for various tasks. While domain-specific KGs offer substantial benefits, their manual construction is often inefficient and requires specialized knowledge. Recent approaches for knowledge graph construction (KGC) based on large language models (LLMs), such as schema-guided KGC and reference knowledge integration, have proven efficient. However, these methods are constrained by their reliance on manually defined schema, single-document processing, and public-domain references, making them less effective for domain-specific corpora that exhibit complex knowledge dependencies and specificity, as well as limited reference knowledge. To address these challenges, we propose LKD-KGC, a novel framework for unsupervised domain-specific KG construction. LKD-KGC autonomously analyzes document repositories to infer knowledge dependencies, determines optimal processing sequences via LLM driven prioritization, and autoregressively generates entity schema by integrating hierarchical inter-document contexts. This schema guides the unsupervised extraction of entities and relationships, eliminating reliance on predefined structures or external knowledge. Extensive experiments show that compared with state-of-the-art baselines, LKD-KGC generally achieves improvements of 10% to 20% in both precision and recall rate, demonstrating its potential in constructing high-quality domain-specific KGs.
zh
[NLP-123] Rationales Are Not Silver Bullets: Measuring the Impact of Rationales on Model Performance and Reliability ACL2025
【速读】: 该论文试图解决现有研究中关于通过增强推理过程(rationales)来训练语言模型的普遍观点是否始终成立的问题,即验证推理过程对模型性能和可靠性的影响是否具有普遍一致性。其解决方案的关键在于进行系统的实验分析,以全面评估推理过程对模型性能及可靠性的影响,并揭示任务内在难度在其中的作用机制。研究发现,推理过程在某些情况下可能降低模型性能,而在其他情况下则能提升模型可靠性,且性能与可靠性的提升之间存在线性对应关系。
链接: https://arxiv.org/abs/2505.24147
作者: Chiwei Zhu,Benfeng Xu,An Yang,Junyang Lin,Quan Wang,Chang Zhou,Zhendong Mao
机构: 未知
类目: Computation and Language (cs.CL)
备注: To be published in ACL 2025 Findings. (Work originally done in Jan 2024)
Abstract:Training language models with rationales augmentation has been shown to be beneficial in many existing works. In this paper, we identify that such a prevailing view does not hold consistently. We conduct comprehensive investigations to thoroughly inspect the impact of rationales on model performance as well as a novel perspective of model reliability. The results lead to several key findings that add new insights upon existing understandings: 1) Rationales can, at times, deteriorate model performance; 2) Rationales can, at times, improve model reliability, even outperforming their untrained counterparts; 3) A linear correspondence exists in between the performance and reliability improvements, while both are driven by the intrinsic difficulty of the task. These findings provide informative regulations on the broad utilization of rationales and raise critical implications on the procedure of explicitly aligning language models with implicit human thoughts. Codes can be found at this https URL.
zh
[NLP-124] CrossICL: Cross-Task In-Context Learning via Unsupervised Demonstration Transfer
【速读】: 该论文试图解决在In-Context Learning (ICL)中依赖人工提供示例的问题,特别是在实际应用场景中用户难以或不愿意提供这些示例的情况下。其解决方案的关键在于提出一种新的ICL范式——CrossICL,通过利用已有源任务的示例来为目标任务提供可靠指导,从而无需额外的人工努力。为此,作者设计了一种两阶段对齐策略,以缓解任务间差异带来的干扰,作为实验探索的基础。
链接: https://arxiv.org/abs/2505.24143
作者: Jinglong Gao,Xiao Ding,Lingxiao Zou,Bing Qin,Ting Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages
Abstract:In-Context Learning (ICL) enhances the performance of large language models (LLMs) with demonstrations. However, obtaining these demonstrations primarily relies on manual effort. In most real-world scenarios, users are often unwilling or unable to provide such demonstrations. Inspired by the human analogy, we explore a new ICL paradigm CrossICL to study how to utilize existing source task demonstrations in the ICL for target tasks, thereby obtaining reliable guidance without any additional manual effort. To explore this, we first design a two-stage alignment strategy to mitigate the interference caused by gaps across tasks, as the foundation for our experimental exploration. Based on it, we conduct comprehensive exploration of CrossICL, with 875 NLP tasks from the Super-NI benchmark and six types of LLMs, including GPT-4o. Experimental results demonstrate the effectiveness of CrossICL and provide valuable insights on questions like the criteria for selecting cross-task demonstrations, as well as the types of task-gap-induced interference in CrossICL.
zh
[NLP-125] R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration
【速读】: 该论文旨在解决推理模型在推理过程中生成过长输出导致的键值(Key-Value, KV)缓存过大问题,以及现有KV缓存压缩方法在链式思维(Chain-of-Thought, CoT)推理中可能引发的推理失败问题。解决方案的关键在于提出一种冗余感知的KV缓存压缩方法(Redundancy-aware KV Cache Compression for Reasoning models, R-KV),该方法通过识别并保留推理过程中的关键信息,仅使用10%的KV缓存即可接近全缓存性能,显著优于现有基线方法,并在减少内存占用和提升吞吐量方面表现出色。
链接: https://arxiv.org/abs/2505.24133
作者: Zefan Cai,Wen Xiao,Hanshi Sun,Cheng Luo,Yikai Zhang,Ke Wan,Yucheng Li,Yeyang Zhou,Li-Wen Chang,Jiuxiang Gu,Zhen Dong,Anima Anandkumar,Abedelkadir Asi,Junjie Hu
机构: University of Wisconsin - Madison; Microsoft; Carnegie Mellon University; California Institute of Technology; University of California - San Diego; University Of Surrey; Adobe; University of California - Berkeley
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Reasoning models have demonstrated impressive performance in self-reflection and chain-of-thought reasoning. However, they often produce excessively long outputs, leading to prohibitively large key-value (KV) caches during inference. While chain-of-thought inference significantly improves performance on complex reasoning tasks, it can also lead to reasoning failures when deployed with existing KV cache compression approaches. To address this, we propose Redundancy-aware KV Cache Compression for Reasoning models (R-KV), a novel method specifically targeting redundant tokens in reasoning models. Our method preserves nearly 100% of the full KV cache performance using only 10% of the KV cache, substantially outperforming existing KV cache baselines, which reach only 60% of the performance. Remarkably, R-KV even achieves 105% of full KV cache performance with 16% of the KV cache. This KV-cache reduction also leads to a 90% memory saving and a 6.6X throughput over standard chain-of-thought reasoning inference. Experimental results show that R-KV consistently outperforms existing KV cache compression baselines across two mathematical reasoning datasets.
zh
[NLP-126] he State of Multilingual LLM Safety Research: From Measuring the Language Gap to Mitigating It
【速读】: 该论文试图解决大型语言模型(Large Language Model, LLM)安全研究中存在显著的语言偏倚问题,即研究主要以英语为中心,导致其他语言尤其是高资源非英语语言的安全研究被忽视。其解决方案的关键在于提出多语言安全研究的必要性,并通过系统性回顾近300篇2020-2024年间发表于ACL等主要自然语言处理会议和研讨会的文献,识别出当前英语安全研究在语言文档记录方面的不足,进而提出未来在安全评估、训练数据生成以及跨语言安全泛化三个方向的研究建议,以推动更全面、包容的AI安全实践。
链接: https://arxiv.org/abs/2505.24119
作者: Zheng-Xin Yong,Beyza Ermis,Marzieh Fadaee,Stephen H. Bach,Julia Kreutzer
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper presents a comprehensive analysis of the linguistic diversity of LLM safety research, highlighting the English-centric nature of the field. Through a systematic review of nearly 300 publications from 2020–2024 across major NLP conferences and workshops at *ACL, we identify a significant and growing language gap in LLM safety research, with even high-resource non-English languages receiving minimal attention. We further observe that non-English languages are rarely studied as a standalone language and that English safety research exhibits poor language documentation practice. To motivate future research into multilingual safety, we make several recommendations based on our survey, and we then pose three concrete future directions on safety evaluation, training data generation, and crosslingual safety generalization. Based on our survey and proposed directions, the field can develop more robust, inclusive AI safety practices for diverse global populations.
zh
[NLP-127] raining LLM s for EHR-Based Reasoning Tasks via Reinforcement Learning
【速读】: 该论文试图解决将大语言模型(Large Language Models, LLMs)适配到复杂临床推理任务中的问题,特别是在电子健康记录(Electronic Health Records, EHRs)解释中所面临的挑战。文献指出,尽管强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)在数学和编程领域已取得成功,但在医疗保健场景中应用时,由于需要专业知识和推理能力,存在独特的困难。解决方案的关键在于提出EHRMIND,该方法采用两阶段策略:首先通过轻量级监督微调(Supervised Fine-Tuning, SFT)注入缺失的领域知识并稳定训练过程,随后通过RLVR优化模型的决策能力,从而提升准确性、可解释性及跨任务泛化能力。
链接: https://arxiv.org/abs/2505.24105
作者: Jiacheng Lin,Zhenbang Wu,Jimeng Sun
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:We present EHRMIND, a practical recipe for adapting large language models (LLMs) to complex clinical reasoning tasks using reinforcement learning with verifiable rewards (RLVR). While RLVR has succeeded in mathematics and coding, its application to healthcare contexts presents unique challenges due to the specialized knowledge and reasoning required for electronic health record (EHR) interpretation. Our pilot study on the MEDCALC benchmark reveals two key failure modes: (1) misapplied knowledge, where models possess relevant medical knowledge but apply it incorrectly, and (2) missing knowledge, where models lack essential domain knowledge. To address these cases, EHRMIND applies a two-stage solution: a lightweight supervised fine-tuning (SFT) warm-up that injects missing domain knowledge, stabilizes subsequent training, and encourages structured, interpretable outputs; followed by RLVR, which reinforces outcome correctness and refines the model’s decision-making. We demonstrate the effectiveness of our method across diverse clinical applications, including medical calculations (MEDCALC), patient-trial matching (TREC CLINICAL TRIALS), and disease diagnosis (EHRSHOT). EHRMIND delivers consistent gains in accuracy, interpretability, and cross-task generalization. These findings offer practical guidance for applying RLVR to enhance LLM capabilities in healthcare settings.
zh
[NLP-128] HardTests: Synthesizing High-Quality Test Cases for LLM Coding
【速读】: 该论文试图解决在大型语言模型(Large Language Model, LLM)推理中,可靠验证器(verifier)难以获取的问题,特别是在复杂的编程问题中,因为精心设计的错误解决方案可能仅能通过人工编写的边缘案例检测出来,而这些案例难以自动合成。解决方案的关键是提出HARDTESTGEN,这是一个利用LLM进行高质量测试用例合成的流程,通过该流程构建了一个包含47k题目的综合性竞赛编程数据集HARDTESTS,其测试用例在评估LLM生成代码时表现出更高的精度和召回率,尤其在较难的问题上效果更为显著。
链接: https://arxiv.org/abs/2505.24098
作者: Zhongmou He,Yee Man Choi,Kexun Zhang,Jiabao Ji,Junting Zhou,Dejia Xu,Ivan Bercovich,Aidan Zhang,Lei Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Verifiers play a crucial role in large language model (LLM) reasoning, needed by post-training techniques such as reinforcement learning. However, reliable verifiers are hard to get for difficult coding problems, because a well-disguised wrong solution may only be detected by carefully human-written edge cases that are difficult to synthesize. To address this issue, we propose HARDTESTGEN, a pipeline for high-quality test synthesis using LLMs. With this pipeline, we curate a comprehensive competitive programming dataset HARDTESTS with 47k problems and synthetic high-quality tests. Compared with existing tests, HARDTESTGEN tests demonstrate precision that is 11.3 percentage points higher and recall that is 17.5 percentage points higher when evaluating LLM-generated code. For harder problems, the improvement in precision can be as large as 40 points. HARDTESTS also proves to be more effective for model training, measured by downstream code generation performance. We will open-source our dataset and synthesis pipeline at this https URL.
zh
[NLP-129] CM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine
【速读】: 该论文旨在解决当前针对中医药领域的大语言模型(Large Language Models, LLMs)缺乏统一、标准化的多模态问答(QA)评估基准的问题。现有评估数据集在范围和形式上存在局限,主要为文本类型,无法全面反映模型在真实任务中的表现。论文提出的解决方案是构建TCM-Ladder,这是首个专为评估大型中医药语言模型而设计的多模态QA数据集,涵盖中医药多个核心学科,并整合了文本、图像和视频等多种模态内容,包含超过52,000道题目,涵盖多种题型。此外,论文还提出了Ladder-Score评估方法,专门用于评估中医药问答中的术语使用和语义表达质量,从而实现对通用领域和中医药专用大语言模型的统一多模态评估。
链接: https://arxiv.org/abs/2505.24063
作者: Jiacheng Xie,Yang Yu,Ziyang Zhang,Shuai Zeng,Jiaxuan He,Ayush Vasireddy,Xiaoting Tang,Congyu Guo,Lening Zhao,Congcong Jing,Guanghui An,Dong Xu
机构: 未知
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注: 22 pages, 4 figures
Abstract:Traditional Chinese Medicine (TCM), as an effective alternative medicine, has been receiving increasing attention. In recent years, the rapid development of large language models (LLMs) tailored for TCM has underscored the need for an objective and comprehensive evaluation framework to assess their performance on real-world tasks. However, existing evaluation datasets are limited in scope and primarily text-based, lacking a unified and standardized multimodal question-answering (QA) benchmark. To address this issue, we introduce TCM-Ladder, the first multimodal QA dataset specifically designed for evaluating large TCM language models. The dataset spans multiple core disciplines of TCM, including fundamental theory, diagnostics, herbal formulas, internal medicine, surgery, pharmacognosy, and pediatrics. In addition to textual content, TCM-Ladder incorporates various modalities such as images and videos. The datasets were constructed using a combination of automated and manual filtering processes and comprise 52,000+ questions in total. These questions include single-choice, multiple-choice, fill-in-the-blank, diagnostic dialogue, and visual comprehension tasks. We trained a reasoning model on TCM-Ladder and conducted comparative experiments against 9 state-of-the-art general domain and 5 leading TCM-specific LLMs to evaluate their performance on the datasets. Moreover, we propose Ladder-Score, an evaluation method specifically designed for TCM question answering that effectively assesses answer quality regarding terminology usage and semantic expression. To our knowledge, this is the first work to evaluate mainstream general domain and TCM-specific LLMs on a unified multimodal benchmark. The datasets and leaderboard are publicly available at this https URL or this https URL and will be continuously updated.
zh
[NLP-130] MedPAIR: Measuring Physicians and AI Relevance Alignment in Medical Question Answering
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在医学问答任务中虽然能够给出正确答案,但其推理过程可能存在逻辑错误的问题。解决方案的关键在于构建MedPAIR数据集,通过收集36名医学住院医师对1,300个问答对中每个句子的相关性标注,评估医生与LLMs在处理信息时的相关性优先级差异,并验证去除医生标注的不相关信息后对下游任务性能的影响。研究发现,LLMs与医生在相关性判断上存在显著差异,而过滤掉不相关信息后,医生和LLMs的准确性均有所提升。
链接: https://arxiv.org/abs/2505.24040
作者: Yuexing Hao,Kumail Alhamoud,Hyewon Jeong,Haoran Zhang,Isha Puri,Philip Torr,Mike Schaekermann,Ariel D. Stern,Marzyeh Ghassemi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable performance on various medical question-answering (QA) benchmarks, including standardized medical exams. However, correct answers alone do not ensure correct logic, and models may reach accurate conclusions through flawed processes. In this study, we introduce the MedPAIR (Medical Dataset Comparing Physicians and AI Relevance Estimation and Question Answering) dataset to evaluate how physician trainees and LLMs prioritize relevant information when answering QA questions. We obtain annotations on 1,300 QA pairs from 36 physician trainees, labeling each sentence within the question components for relevance. We compare these relevance estimates to those for LLMs, and further evaluate the impact of these “relevant” subsets on downstream task performance for both physician trainees and LLMs. We find that LLMs are frequently not aligned with the content relevance estimates of physician trainees. After filtering out physician trainee-labeled irrelevant sentences, accuracy improves for both the trainees and the LLMs. All LLM and physician trainee-labeled data are available at: this http URL.
zh
[NLP-131] he Surprising Soupability of Documents in State Space Models
【速读】: 该论文试图解决如何利用结构化状态空间模型(Structured State Space Models, SSMs)的隐状态进行下游推理的问题,特别是如何在不重新处理完整输入的情况下实现模块化编码与复用。其解决方案的关键在于提出了一种称为“文档混汤”(document souping)的方法,即独立编码文档后通过简单的操作(如平均)将它们的表示进行聚合,形成单一上下文状态,从而支持多跳问答、稀疏检索和长文档推理等任务。
链接: https://arxiv.org/abs/2505.24033
作者: Yasaman Jafari,Zixian Wang,Leon Bergen,Taylor Berg-Kirkpatrick
机构: University of California San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注:
Abstract:We investigate whether hidden states from Structured State Space Models (SSMs) can be merged post-hoc to support downstream reasoning. Inspired by model souping, we propose a strategy where documents are encoded independently and their representations are pooled – via simple operations like averaging – into a single context state. This approach, which we call document souping, enables modular encoding and reuse without reprocessing the full input for each query. We finetune Mamba2 models to produce soupable representations and find that they support multi-hop QA, sparse retrieval, and long-document reasoning with strong accuracy. On HotpotQA, souping ten independently encoded documents nearly matches the performance of a cross-encoder trained on the same inputs.
zh
[NLP-132] Hidden Persuasion: Detecting Manipulative Narratives on Social Media During the 2022 Russian Invasion of Ukraine
【速读】: 该论文旨在解决社交媒体中操纵技术的检测与分类问题,特别是针对乌克兰Telegram用户所使用的修辞和文体操纵手法。其关键解决方案在于对Gemma 2语言模型进行LoRA适配器微调,并结合元特征和阈值优化构建二级分类器以提升分类性能;在片段检测方面,则采用经过多目标训练的XLM-RoBERTa模型进行令牌二分类,从而实现对操纵内容的精准定位。
链接: https://arxiv.org/abs/2505.24028
作者: Kateryna Akhynko,Oleksandr Kosovan,Mykola Trokhymovych
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper presents one of the top-performing solutions to the UNLP 2025 Shared Task on Detecting Manipulation in Social Media. The task focuses on detecting and classifying rhetorical and stylistic manipulation techniques used to influence Ukrainian Telegram users. For the classification subtask, we fine-tuned the Gemma 2 language model with LoRA adapters and applied a second-level classifier leveraging meta-features and threshold optimization. For span detection, we employed an XLM-RoBERTa model trained for multi-target, including token binary classification. Our approach achieved 2nd place in classification and 3rd place in span detection.
zh
[NLP-133] BeaverTalk: Oregon State Universitys IWSLT 2025 Simultaneous Speech Translation System
【速读】: 该论文旨在解决实时语音到文本翻译(speech-to-text translation)中的挑战,特别是在IWSLT 2025同时翻译任务中的低延迟和高延迟场景。其解决方案的关键在于构建一个级联系统BeaverTalk,该系统结合了语音活动检测(VAD)分段器、Whisper Large V2进行自动语音识别(ASR),以及Gemma 3 12B大语言模型(LLM)实现同时翻译。为提升翻译效果,Gemma 3 12B通过低秩适配器(LoRAs)进行微调,采用对话式提示策略,并利用源语言的单句记忆库作为上下文。
链接: https://arxiv.org/abs/2505.24016
作者: Matthew Raffel,Victor Agostinelli,Lizhong Chen
机构: Oregon State University (俄勒冈州立大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at IWSLT 2025
Abstract:This paper discusses the construction, fine-tuning, and deployment of BeaverTalk, a cascaded system for speech-to-text translation as part of the IWSLT 2025 simultaneous translation task. The system architecture employs a VAD segmenter for breaking a speech stream into segments, Whisper Large V2 for automatic speech recognition (ASR), and Gemma 3 12B for simultaneous translation. Regarding the simultaneous translation LLM, it is fine-tuned via low-rank adaptors (LoRAs) for a conversational prompting strategy that leverages a single prior-sentence memory bank from the source language as context. The cascaded system participated in the English \rightarrow German and English \rightarrow Chinese language directions for both the low and high latency regimes. In particular, on the English \rightarrow German task, the system achieves a BLEU of 24.64 and 27.83 at a StreamLAAL of 1837.86 and 3343.73, respectively. Then, on the English \rightarrow Chinese task, the system achieves a BLEU of 34.07 and 37.23 at a StreamLAAL of 2216.99 and 3521.35, respectively.
zh
[NLP-134] Large Language Model Meets Constraint Propagation IJCAI2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成文本时难以有效执行外部约束的问题,因为LLMs在生成过程中是按顺序逐个生成token,缺乏显式的控制机制。其解决方案的关键在于将LLMs的预测与约束编程(Constraint Programming, CP)推理相结合,将文本生成建模为一个约束满足问题(Constraint Satisfaction Problem, CSP),并通过引入掩码语言模型(Masked Language Models, MLMs)来增强领域生成能力,实现双向约束传播,从而提升文本生成的可靠性和约束感知能力。
链接: https://arxiv.org/abs/2505.24012
作者: Alexandre Bonlarron,Florian Régin,Elisabetta De Maria,Jean-Charles Régin
机构: Université Côte d’Azur (科特迪瓦大学); CNRS (法国国家科学研究中心); I3S (I3S实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To appear in the Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2025)
Abstract:Large Language Models (LLMs) excel at generating fluent text but struggle to enforce external constraints because they generate tokens sequentially without explicit control mechanisms. GenCP addresses this limitation by combining LLM predictions with Constraint Programming (CP) reasoning, formulating text generation as a Constraint Satisfaction Problem (CSP). In this paper, we improve GenCP by integrating Masked Language Models (MLMs) for domain generation, which allows bidirectional constraint propagation that leverages both past and future tokens. This integration bridges the gap between token-level prediction and structured constraint enforcement, leading to more reliable and constraint-aware text generation. Our evaluation on COLLIE benchmarks demonstrates that incorporating domain preview via MLM calls significantly improves GenCP’s performance. Although this approach incurs additional MLM calls and, in some cases, increased backtracking, the overall effect is a more efficient use of LLM inferences and an enhanced ability to generate feasible and meaningful solutions, particularly in tasks with strict content constraints.
zh
[NLP-135] Diversity of Transformer Layers: One Aspect of Parameter Scaling Laws
【速读】: 该论文试图解决Transformer架构中内部机制与参数扩展规律(parameter scaling laws)之间关系不明确的问题。其解决方案的关键在于通过偏置-多样性分解(bias-diversity decomposition)理论分析残差流中的层及其规模,揭示层间误差(偏置)和输出差异性(多样性)对模型性能的影响。研究发现,当单个层的预测接近正确答案且层间输出保持多样性时,模型性能提升显著,尤其在单层输出偏离真实值时,多样性变得尤为重要。此外,论文引入信息论意义上的多样性概念,证明增加层数仅在层间行为差异显著时才能提升性能,并揭示了层数增加带来的性能增益具有次模性(submodularity),这与参数扩展规律预测的对数收敛趋势一致。
链接: https://arxiv.org/abs/2505.24009
作者: Hidetaka Kamigaito,Ying Zhang,Jingun Kwon,Katsuhiko Hayashi,Manabu Okumura,Taro Watanabe
机构: Nara Institute of Science and Technology (奈良先端科学技術大学院大学); RIKEN Center for Advanced Intelligence Project (理化学研究所先进智能项目中心); Chungnam National University (忠南国立大学); The University of Tokyo (东京大学); Institute of Science Tokyo (东京理科大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Transformers deliver outstanding performance across a wide range of tasks and are now a dominant backbone architecture for large language models (LLMs). Their task-solving performance is improved by increasing parameter size, as shown in the recent studies on parameter scaling laws. Although recent mechanistic-interpretability studies have deepened our understanding of the internal behavior of Transformers by analyzing their residual stream, the relationship between these internal mechanisms and the parameter scaling laws remains unclear. To bridge this gap, we focus on layers and their size, which mainly decide the parameter size of Transformers. For this purpose, we first theoretically investigate the layers within the residual stream through a bias-diversity decomposition. The decomposition separates (i) bias, the error of each layer’s output from the ground truth, and (ii) diversity, which indicates how much the outputs of each layer differ from each other. Analyzing Transformers under this theory reveals that performance improves when individual layers make predictions close to the correct answer and remain mutually diverse. We show that diversity becomes especially critical when individual layers’ outputs are far from the ground truth. Finally, we introduce an information-theoretic diversity and show our main findings that adding layers enhances performance only when those layers behave differently, i.e., are diverse. We also reveal the performance gains from increasing the number of layers exhibit submodularity: marginal improvements diminish as additional layers increase, mirroring the logarithmic convergence predicted by the parameter scaling laws. Experiments on multiple semantic-understanding tasks with various LLMs empirically confirm the theoretical properties derived in this study.
zh
[NLP-136] Redefining Research Crowdsourcing: Incorporating Human Feedback with LLM -Powered Digital Twins
【速读】: 该论文试图解决众包工作平台(如Amazon Mechanical Turk和Prolific)中,随着生成式AI工具的广泛使用导致的数据有效性下降以及工作者角色被弱化的问题。其解决方案的关键在于提出一种混合框架,利用数字孪生(digital twin)技术,通过个性化AI模型模拟工作者的行为与偏好,同时保持人类在决策过程中的参与。该方法旨在提升工作效率并减少决策疲劳,同时维持响应质量,实现可扩展性与真实性的平衡。
链接: https://arxiv.org/abs/2505.24004
作者: Amanda Chan,Catherine Di,Joseph Rupertus,Gary Smith,Varun Nagaraj Rao,Manoel Horta Ribeiro,Andrés Monroy-Hernández
机构: Princeton University (普林斯顿大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted as a CHI Late Breaking Work (2025), cite appropriately
Abstract:Crowd work platforms like Amazon Mechanical Turk and Prolific are vital for research, yet workers’ growing use of generative AI tools poses challenges. Researchers face compromised data validity as AI responses replace authentic human behavior, while workers risk diminished roles as AI automates tasks. To address this, we propose a hybrid framework using digital twins, personalized AI models that emulate workers’ behaviors and preferences while keeping humans in the loop. We evaluate our system with an experiment (n=88 crowd workers) and in-depth interviews with crowd workers (n=5) and social science researchers (n=4). Our results suggest that digital twins may enhance productivity and reduce decision fatigue while maintaining response quality. Both researchers and workers emphasized the importance of transparency, ethical data use, and worker agency. By automating repetitive tasks and preserving human engagement for nuanced ones, digital twins may help balance scalability with authenticity.
zh
[NLP-137] Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLM s ICML2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在公平性评估中的不足问题,特别是传统公平性度量仅关注离散的准确性评估而未能捕捉模型不确定性对决策偏见的隐性影响。解决方案的关键在于提出一种考虑不确定性的公平性度量方法——UCerF,该方法能够更细致地评估模型公平性,反映模型决策中的内部偏差,相较于传统度量更具代表性。同时,论文还引入了一个包含31,756个样本的新性别-职业公平性评估数据集,以弥补现有数据集在数据规模、多样性和清晰度方面的不足。
链接: https://arxiv.org/abs/2505.23996
作者: Yinong Oliver Wang,Nivedha Sivakumar,Falaah Arif Khan,Rin Metcalf Susa,Adam Golinski,Natalie Mackraz,Barry-John Theobald,Luca Zappella,Nicholas Apostoloff
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 8 figures, and 1 table in main paper. Supplementary appendix attached. Accepted at ICML 2025
Abstract:The recent rapid adoption of large language models (LLMs) highlights the critical need for benchmarking their fairness. Conventional fairness metrics, which focus on discrete accuracy-based evaluations (i.e., prediction correctness), fail to capture the implicit impact of model uncertainty (e.g., higher model confidence about one group over another despite similar accuracy). To address this limitation, we propose an uncertainty-aware fairness metric, UCerF, to enable a fine-grained evaluation of model fairness that is more reflective of the internal bias in model decisions compared to conventional fairness measures. Furthermore, observing data size, diversity, and clarity issues in current datasets, we introduce a new gender-occupation fairness evaluation dataset with 31,756 samples for co-reference resolution, offering a more diverse and suitable dataset for evaluating modern LLMs. We establish a benchmark, using our metric and dataset, and apply it to evaluate the behavior of ten open-source LLMs. For example, Mistral-7B exhibits suboptimal fairness due to high confidence in incorrect predictions, a detail overlooked by Equalized Odds but captured by UCerF. Overall, our proposed LLM benchmark, which evaluates fairness with uncertainty awareness, paves the way for developing more transparent and accountable AI systems.
zh
[NLP-138] Large Language Models for Controllable Multi-property Multi-objective Molecule Optimization
【速读】: 该论文旨在解决现实世界药物设计中分子优化的问题,即在保持已满足药理学相关标准的分子性质的同时,选择性地提升多个分子性质。现有计算方法和指令调优的大语言模型(Large Language Models, LLMs)无法准确捕捉这种特定性质的目标,限制了其实际应用。论文提出的解决方案关键在于构建了首个专注于多性质优化且具有明确性质特定目标的指令调优数据集C-MuMOInstruct,并基于此开发了GeLLMO-Cs系列指令调优的LLMs,能够执行针对性的性质优化,从而在多种任务中表现出色,展现出强大的零样本泛化能力。
链接: https://arxiv.org/abs/2505.23987
作者: Vishal Dey,Xiao Hu,Xia Ning
机构: The Ohio State University(俄亥俄州立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Biomolecules (q-bio.BM)
备注:
Abstract:In real-world drug design, molecule optimization requires selectively improving multiple molecular properties up to pharmaceutically relevant levels, while maintaining others that already meet such criteria. However, existing computational approaches and instruction-tuned LLMs fail to capture such nuanced property-specific objectives, limiting their practical applicability. To address this, we introduce C-MuMOInstruct, the first instruction-tuning dataset focused on multi-property optimization with explicit, property-specific objectives. Leveraging C-MuMOInstruct, we develop GeLLMO-Cs, a series of instruction-tuned LLMs that can perform targeted property-specific optimization. Our experiments across 5 in-distribution and 5 out-of-distribution tasks show that GeLLMO-Cs consistently outperform strong baselines, achieving up to 126% higher success rate. Notably, GeLLMO-Cs exhibit impressive 0-shot generalization to novel optimization tasks and unseen instructions. This offers a step toward a foundational LLM to support realistic, diverse optimizations with property-specific objectives. C-MuMOInstruct and code are accessible through this https URL.
zh
[NLP-139] FLAT-LLM : Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在资源受限环境中部署时面临的高计算和内存需求问题。现有低秩分解方法虽为结构压缩提供了可行路径,但常伴随精度下降、校准过程昂贵以及模型架构效率低下等问题。论文提出的FLAT-LLM是一种无需训练的结构压缩方法,其关键在于在激活空间中采用细粒度低秩变换,通过基于头部的主成分分析(PCA)计算截断特征向量来降低隐藏维度,并利用重要性度量自适应分配解码器中的秩,从而实现高效且有效的权重压缩,无需恢复微调即可在短时间内完成校准。
链接: https://arxiv.org/abs/2505.23966
作者: Jiayi Tian,Ryan Solgi,Jinming Lu,Yifan Yang,Hai Li,Zheng Zhang
机构: University of California, Santa Barbara(加州大学圣塔芭芭拉分校); Intel Corporation(英特尔公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have enabled remarkable progress in natural language processing, yet their high computational and memory demands pose challenges for deployment in resource-constrained environments. Although recent low-rank decomposition methods offer a promising path for structural compression, they often suffer from accuracy degradation, expensive calibration procedures, and result in inefficient model architectures that hinder real-world inference speedups. In this paper, we propose FLAT-LLM, a fast and accurate, training-free structural compression method based on fine-grained low-rank transformations in the activation space. Specifically, we reduce the hidden dimension by transforming the weights using truncated eigenvectors computed via head-wise Principal Component Analysis (PCA), and employ an importance-based metric to adaptively allocate ranks across decoders. FLAT-LLM achieves efficient and effective weight compression without recovery fine-tuning, which could complete the calibration within a few minutes. Evaluated across 4 models and 11 datasets, FLAT-LLM outperforms structural pruning baselines in generalization and downstream performance, while delivering inference speedups over decomposition-based methods.
zh
[NLP-140] Information Structure in Mappings: An Approach to Learning Representation and Generalisation
【速读】: 该论文试图解决大规模神经网络在表征空间上的统一描述问题,即缺乏可靠的方法来描述其表征结构的形成过程及何种结构具有优势。论文的关键解决方案是引入定量方法以识别映射中的系统性结构,并结合信息论量化每个结构的特性,从而分析学习、结构和泛化能力。此外,论文提出了一种高效估计向量空间熵的新方法,使得该分析可应用于从100万到120亿参数规模的模型,进而揭示大规模分布式认知模型的学习机制及其与人类类比系统的共性。
链接: https://arxiv.org/abs/2505.23960
作者: Henry Conklin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: PhD Thesis, 204 pages; entropy estimation discussed from p.94
Abstract:Despite the remarkable success of large large-scale neural networks, we still lack unified notation for thinking about and describing their representational spaces. We lack methods to reliably describe how their representations are structured, how that structure emerges over training, and what kinds of structures are desirable. This thesis introduces quantitative methods for identifying systematic structure in a mapping between spaces, and leverages them to understand how deep-learning models learn to represent information, what representational structures drive generalisation, and how design decisions condition the structures that emerge. To do this I identify structural primitives present in a mapping, along with information theoretic quantifications of each. These allow us to analyse learning, structure, and generalisation across multi-agent reinforcement learning models, sequence-to-sequence models trained on a single task, and Large Language Models. I also introduce a novel, performant, approach to estimating the entropy of vector space, that allows this analysis to be applied to models ranging in size from 1 million to 12 billion parameters. The experiments here work to shed light on how large-scale distributed models of cognition learn, while allowing us to draw parallels between those systems and their human analogs. They show how the structures of language and the constraints that give rise to them in many ways parallel the kinds of structures that drive performance of contemporary neural networks. Comments: PhD Thesis, 204 pages; entropy estimation discussed from p.94 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2505.23960 [cs.LG] (or arXiv:2505.23960v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.23960 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-141] A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models
【速读】: 该论文试图解决生成式 AI (Generative AI) 在链式思维 (Chain-of-thought, CoT) 推理过程中是否存在忠实性问题,即推理过程是否真实反映了模型的内部处理机制。解决方案的关键在于提出一种新颖的细粒度评估流程,用于分类偏见表达模式,从而实现对 CoT 推理更精确的分析。该框架揭示了模型在处理不同类型偏见时的差异,并发现了图像相关偏见较文本偏见更难被表达的现象,以及“不一致”推理这一新型现象,为检测非忠实 CoT 提供了潜在指标。
链接: https://arxiv.org/abs/2505.23945
作者: Sriram Balasubramanian,Samyadeep Basu,Soheil Feizi
机构: University of Maryland, College Park(马里兰大学学院市分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 34 pages, 25 figures
Abstract:Chain-of-thought (CoT) reasoning enhances performance of large language models, but questions remain about whether these reasoning traces faithfully reflect the internal processes of the model. We present the first comprehensive study of CoT faithfulness in large vision-language models (LVLMs), investigating how both text-based and previously unexplored image-based biases affect reasoning and bias articulation. Our work introduces a novel, fine-grained evaluation pipeline for categorizing bias articulation patterns, enabling significantly more precise analysis of CoT reasoning than previous methods. This framework reveals critical distinctions in how models process and respond to different types of biases, providing new insights into LVLM CoT faithfulness. Our findings reveal that subtle image-based biases are rarely articulated compared to explicit text-based ones, even in models specialized for reasoning. Additionally, many models exhibit a previously unidentified phenomenon we term ``inconsistent’’ reasoning - correctly reasoning before abruptly changing answers, serving as a potential canary for detecting biased reasoning from unfaithful CoTs. We then apply the same evaluation pipeline to revisit CoT faithfulness in LLMs across various levels of implicit cues. Our findings reveal that current language-only reasoning models continue to struggle with articulating cues that are not overtly stated.
zh
[NLP-142] Retrieval Augmented Generation based Large Language Models for Causality Mining NAACL2025
【速读】: 该论文试图解决信息检索中的因果关系检测与挖掘问题(causality detection and mining),其核心挑战在于传统无监督方法性能较差且需要大量人工干预,而有监督方法则受限于大规模标注数据的缺乏。解决方案的关键在于利用检索增强生成(retrieval-augmented generation, RAG)技术构建动态提示(dynamic prompting)方案,以提升大语言模型(large language models, LLMs)在因果关系检测与抽取任务中的表现。
链接: https://arxiv.org/abs/2505.23944
作者: Thushara Manjari Naduvilakandy,Hyeju Jang,Mohammad Al Hasan
机构: Indiana University Indianapolis(印第安纳大学印第安纳波利斯分校)
类目: Computation and Language (cs.CL)
备注: 13 pages, 6 figures, published in knowledgeNLP-NAACL2025
Abstract:Causality detection and mining are important tasks in information retrieval due to their enormous use in information extraction, and knowledge graph construction. To solve these tasks, in existing literature there exist several solutions – both unsupervised and supervised. However, the unsupervised methods suffer from poor performance and they often require significant human intervention for causal rule selection, leading to poor generalization across different domains. On the other hand, supervised methods suffer from the lack of large training datasets. Recently, large language models (LLMs) with effective prompt engineering are found to be effective to overcome the issue of unavailability of large training dataset. Yet, in existing literature, there does not exist comprehensive works on causality detection and mining using LLM prompting. In this paper, we present several retrieval-augmented generation (RAG) based dynamic prompting schemes to enhance LLM performance in causality detection and extraction tasks. Extensive experiments over three datasets and five LLMs validate the superiority of our proposed RAG-based dynamic prompting over other static prompting schemes.
zh
[NLP-143] SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving
【速读】: 该论文试图解决传统静态基准无法准确反映真实软件开发流程中大型语言模型(Large Language Models, LLMs)性能的问题。解决方案的关键在于提出SwingArena框架,该框架通过将LLMs作为提交者(生成补丁)和评审者(创建测试用例并验证补丁)进行配对,模拟软件迭代的协作过程,并结合持续集成(CI)流水线进行验证。此外,引入了检索增强代码生成(Retrieval-Augmented Code Generation, RACG)模块,以高效处理长上下文挑战,提供语法和语义相关的代码片段,支持多种编程语言,从而实现框架在多样化任务和上下文中的可扩展性。
链接: https://arxiv.org/abs/2505.23932
作者: Wendong Xu,Jing Xiong,Chenyang Zhao,Qiujiang Chen,Haoran Wang,Hui Shen,Zhongwei Wan,Jianbo Dai,Taiqiang Wu,He Xiao,Chaofan Tao,Z. Morley Mao,Ying Sheng,Zhijiang Guo,Hongxia Yang,Bei Yu,Lingpeng Kong,Quanquan Gu,Ngai Wong
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs) that closely mirrors real-world software development workflows. Unlike traditional static benchmarks, SwingArena models the collaborative process of software iteration by pairing LLMs as submitters, who generate patches, and reviewers, who create test cases and verify the patches through continuous integration (CI) pipelines. To support these interactive evaluations, we introduce a retrieval-augmented code generation (RACG) module that efficiently handles long-context challenges by providing syntactically and semantically relevant code snippets from large codebases, supporting multiple programming languages (C++, Python, Rust, and Go). This enables the framework to scale across diverse tasks and contexts while respecting token limitations. Our experiments, using over 400 high-quality real-world GitHub issues selected from a pool of 2,300 issues, show that models like GPT-4o excel at aggressive patch generation, whereas DeepSeek and Gemini prioritize correctness in CI validation. SwingArena presents a scalable and extensible methodology for evaluating LLMs in realistic, CI-driven software development settings. More details are available on our project page: this http URL
zh
[NLP-144] Scaling up the think-aloud method
【速读】: 该论文试图解决传统think-aloud方法在当代认知科学中因人工转录和标注耗时而难以进行大规模分析的问题(think-aloud method)。其解决方案的关键在于利用自然语言处理工具实现对推理过程口头报告的自动化转录与标注,从而支持大规模数据的分析。通过这一方法,研究者能够构建搜索图并对人类推理轨迹的一致性与变异性进行分析,验证了大规模think-aloud数据的价值及自动化分析的可行性。
链接: https://arxiv.org/abs/2505.23931
作者: Daniel Wurgaft,Ben Prystawski,Kanishk Gandhi,Cedegao E. Zhang,Joshua B. Tenenbaum,Noah D. Goodman
机构: Stanford University (斯坦福大学); MIT (麻省理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures. Daniel Wurgaft and Ben Prystawski contributed equally
Abstract:The think-aloud method, where participants voice their thoughts as they solve a task, is a valuable source of rich data about human reasoning processes. Yet, it has declined in popularity in contemporary cognitive science, largely because labor-intensive transcription and annotation preclude large sample sizes. Here, we develop methods to automate the transcription and annotation of verbal reports of reasoning using natural language processing tools, allowing for large-scale analysis of think-aloud data. In our study, 640 participants thought aloud while playing the Game of 24, a mathematical reasoning task. We automatically transcribed the recordings and coded the transcripts as search graphs, finding moderate inter-rater reliability with humans. We analyze these graphs and characterize consistency and variation in human reasoning traces. Our work demonstrates the value of think-aloud data at scale and serves as a proof of concept for the automated analysis of verbal reports.
zh
[NLP-145] ChARM: Character-based Act-adaptive Reward Modeling for Advanced Role-Playing Language Agents
【速读】: 该论文旨在解决角色扮演语言代理(Role-Playing Language Agents, RPLAs)在模拟角色以实现真实且引人入胜的人机交互过程中,传统奖励模型在可扩展性和适应主观对话偏好方面的不足。其解决方案的关键在于提出ChARM,一种基于角色的行动自适应奖励模型,通过两个创新点加以实现:一是行动自适应的边界(act-adaptive margin),显著提升了学习效率和泛化能力;二是利用大规模未标注数据的自我进化机制,以增强训练覆盖范围。
链接: https://arxiv.org/abs/2505.23923
作者: Feiteng Fang,Ting-En Lin,Yuchuan Wu,Xiong Liu,Xiang Huang,Dingwei Chen,Jing Ye,Haonan Zhang,Liang Zhu,Hamid Alinejad-Rokny,Min Yang,Fei Huang,Yongbin Li
机构: Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences; State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, Beijing, China; University of New South Wales; Tongji University; Tongyi Laboratory
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Role-Playing Language Agents (RPLAs) aim to simulate characters for realistic and engaging human-computer interactions. However, traditional reward models often struggle with scalability and adapting to subjective conversational preferences. We propose ChARM, a Character-based Act-adaptive Reward Model, addressing these challenges through two innovations: (1) an act-adaptive margin that significantly enhances learning efficiency and generalizability, and (2) a self-evolution mechanism leveraging large-scale unlabeled data to improve training coverage. Additionally, we introduce RoleplayPref, the first large-scale preference dataset specifically for RPLAs, featuring 1,108 characters, 13 subcategories, and 16,888 bilingual dialogues, alongside RoleplayEval, a dedicated evaluation benchmark. Experimental results show a 13% improvement over the conventional Bradley-Terry model in preference rankings. Furthermore, applying ChARM-generated rewards to preference learning techniques (e.g., direct preference optimization) achieves state-of-the-art results on CharacterEval and RoleplayEval. Code and dataset are available at this https URL.
zh
[NLP-146] ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding
【速读】: 该论文试图解决长视频理解中模型难以有效捕捉多层次时间信息的问题,即从片段(秒级)、镜头(十秒级)、事件(分钟级)到故事(小时级)的层次化时间结构。现有基准测试要么忽略了多尺度设计,要么将不同尺度的问题分散在不同视频中,导致无法在同一内容上直接比较模型在不同时间尺度上的性能。解决方案的关键在于提出ScaleLong,这是首个通过在同一流媒体内容中嵌入针对四个层次时间尺度的问题来解耦这些因素的基准,从而实现跨时间尺度的直接性能比较。
链接: https://arxiv.org/abs/2505.23922
作者: David Ma,Huaqing Yuan,Xingjian Wang,Qianbo Zang,Tianci Liu,Xinyang He,Yanbin Wei,Jiawei Guo,Ni Jiahui,Zhenzhu Yang,Meng Cao,Shanghaoran Quan,Yizhi Li,Wangchunshu Zhou,Jiaheng Liu,Wenhao Huang,Ge Zhang,Shiwen Ni,Xiaojie Jin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Although long-video understanding demands that models capture hierarchical temporal information – from clip (seconds) and shot (tens of seconds) to event (minutes) and story (hours) – existing benchmarks either neglect this multi-scale design or scatter scale-specific questions across different videos, preventing direct comparison of model performance across timescales on the same content. To address this, we introduce ScaleLong, the first benchmark to disentangle these factors by embedding questions targeting four hierarchical timescales – clip (seconds), shot (tens of seconds), event (minutes), and story (hours) – all within the same video content. This within-content multi-timescale questioning design enables direct comparison of model performance across timescales on identical videos. ScaleLong features 269 long videos (avg.\ 86,min) from 5 main categories and 36 sub-categories, with 4–8 carefully designed questions, including at least one question for each timescale. Evaluating 23 MLLMs reveals a U-shaped performance curve, with higher accuracy at the shortest and longest timescales and a dip at intermediate levels. Furthermore, ablation studies show that increased visual token capacity consistently enhances reasoning across all timescales. ScaleLong offers a fine-grained, multi-timescale benchmark for advancing MLLM capabilities in long-video understanding. The code and dataset are available this https URL.
zh
[NLP-147] Probing Association Biases in LLM Moderation Over-Sensitivity
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在内容审核中过度敏感的问题,即误将无害评论标记为有毒内容。其解决方案的关键在于引入一种语义层面的分析方法——主题关联分析(Topic Association Analysis),通过评估LLMs对特定主题与毒性之间的隐含关联程度,揭示模型在处理内容时存在的系统性主题偏见,从而超越传统的基于关键词的过滤方式,深入理解模型过度敏感的内在机制。
链接: https://arxiv.org/abs/2505.23914
作者: Yuxin Wang,Botao Yu,Ivory Yang,Saeed Hassanpour,Soroush Vosoughi
机构: Dartmouth College (达特茅斯学院); The Ohio State University (俄亥俄州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review
Abstract:Large Language Models are widely used for content moderation but often misclassify benign comments as toxic, leading to over-sensitivity. While previous research attributes this issue primarily to the presence of offensive terms, we reveal a potential cause beyond token level: LLMs exhibit systematic topic biases in their implicit associations. Inspired by cognitive psychology’s implicit association tests, we introduce Topic Association Analysis, a semantic-level approach to quantify how LLMs associate certain topics with toxicity. By prompting LLMs to generate free-form scenario imagination for misclassified benign comments and analyzing their topic amplification levels, we find that more advanced models (e.g., GPT-4 Turbo) demonstrate stronger topic stereotype despite lower overall false positive rates. These biases suggest that LLMs do not merely react to explicit, offensive language but rely on learned topic associations, shaping their moderation decisions. Our findings highlight the need for refinement beyond keyword-based filtering, providing insights into the underlying mechanisms driving LLM over-sensitivity.
zh
[NLP-148] Reinforcement Learning for Better Verbalized Confidence in Long-Form Generation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成事实性内容时存在的幻觉(Hallucination)问题,这一问题严重影响了模型的安全性和可信度。现有方法多依赖于计算成本较高的后处理自洽性检测,而基于语言化置信度(Verbalized Confidence)的方法虽更高效,但主要局限于短文本问答任务,难以推广到开放式生成场景。论文提出的解决方案关键在于提出LoVeC(Long-form Verbalized Confidence),通过强化学习(Reinforcement Learning, RL)训练模型在生成的每个语句后附加数值置信度评分,从而提供一种直接且可解释的事实性判断信号。实验表明,该方法在多个长文本问答数据集上表现出更好的校准性能和跨领域泛化能力,同时具有较高的计算效率。
链接: https://arxiv.org/abs/2505.23912
作者: Caiqi Zhang,Xiaochen Zhu,Chengzu Li,Nigel Collier,Andreas Vlachos
机构: University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Hallucination remains a major challenge for the safe and trustworthy deployment of large language models (LLMs) in factual content generation. Prior work has explored confidence estimation as an effective approach to hallucination detection, but often relies on post-hoc self-consistency methods that require computationally expensive sampling. Verbalized confidence offers a more efficient alternative, but existing approaches are largely limited to short-form question answering (QA) tasks and do not generalize well to open-ended generation. In this paper, we propose LoVeC (Long-form Verbalized Confidence), an on-the-fly verbalized confidence estimation method for long-form generation. Specifically, we use reinforcement learning (RL) to train LLMs to append numerical confidence scores to each generated statement, serving as a direct and interpretable signal of the factuality of generation. Our experiments consider both on-policy and off-policy RL methods, including DPO, ORPO, and GRPO, to enhance the model calibration. We introduce two novel evaluation settings, free-form tagging and iterative tagging, to assess different verbalized confidence estimation methods. Experiments on three long-form QA datasets show that our RL-trained models achieve better calibration and generalize robustly across domains. Also, our method is highly efficient, as it only requires adding a few tokens to the output being decoded.
zh
[NLP-149] One Task Vector is not Enough: A Large-Scale Study for In-Context Learning
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在少样本学习(In-context Learning, ICL)中任务向量(task vectors)的表征机制与有效性问题,尤其是如何通过隐藏状态激活来编码任务信息。其解决方案的关键在于构建了一个名为QuiteAFew的新数据集,包含3,096个多样化的少样本任务,每个任务包含30个输入输出对,从而提供了更全面的分析基础。实验结果表明,任务向量在中间层表现最佳,且不同任务类型的效果差异显著,复杂任务依赖于多个子任务特定的向量,揭示了任务知识的分布式表示机制。
链接: https://arxiv.org/abs/2505.23911
作者: Pavel Tikhonov,Ivan Oseledets,Elena Tutubalina
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:In-context learning (ICL) enables Large Language Models (LLMs) to adapt to new tasks using few examples, with task vectors - specific hidden state activations - hypothesized to encode task information. Existing studies are limited by small-scale benchmarks, restricting comprehensive analysis. We introduce QuiteAFew, a novel dataset of 3,096 diverse few-shot tasks, each with 30 input-output pairs derived from the Alpaca dataset. Experiments with Llama-3-8B on QuiteAFew reveal: (1) task vector performance peaks at an intermediate layer (e.g., 15th), (2) effectiveness varies significantly by task type, and (3) complex tasks rely on multiple, subtask-specific vectors rather than a single vector, suggesting distributed task knowledge representation.
zh
[NLP-150] OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation
【速读】: 该论文旨在解决基于大型语言模型(Large Language Model, LLM)的多智能体系统在跨领域迁移时面临的挑战,即现有方法在应用于新领域时需要进行完整的架构重构和全部组件的重新训练。解决方案的关键在于提出Workforce框架,该框架通过模块化架构将战略规划与专业执行解耦,包含领域无关的规划器、子任务管理者以及具有领域特定工具调用能力的工作者,从而实现推理和训练阶段的跨领域可迁移性。
链接: https://arxiv.org/abs/2505.23885
作者: Mengkang Hu,Yuhang Zhou,Wendong Fan,Yuzhou Nie,Bowei Xia,Tao Sun,Ziyu Ye,Zhaoxuan Jin,Yingru Li,Qiguang Chen,Zeyu Zhang,Yifeng Wang,Qianshuo Ye,Bernard Ghanem,Ping Luo,Guohao Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Page: this https URL
Abstract:Large Language Model (LLM)-based multi-agent systems show promise for automating real-world tasks but struggle to transfer across domains due to their domain-specific nature. Current approaches face two critical shortcomings: they require complete architectural redesign and full retraining of all components when applied to new domains. We introduce Workforce, a hierarchical multi-agent framework that decouples strategic planning from specialized execution through a modular architecture comprising: (i) a domain-agnostic Planner for task decomposition, (ii) a Coordinator for subtask management, and (iii) specialized Workers with domain-specific tool-calling capabilities. This decoupling enables cross-domain transferability during both inference and training phases: During inference, Workforce seamlessly adapts to new domains by adding or modifying worker agents; For training, we introduce Optimized Workforce Learning (OWL), which improves generalization across domains by optimizing a domain-agnostic planner with reinforcement learning from real-world feedback. To validate our approach, we evaluate Workforce on the GAIA benchmark, covering various realistic, multi-domain agentic tasks. Experimental results demonstrate Workforce achieves open-source state-of-the-art performance (69.70%), outperforming commercial systems like OpenAI’s Deep Research by 2.34%. More notably, our OWL-trained 32B model achieves 52.73% accuracy (+16.37%) and demonstrates performance comparable to GPT-4o on challenging tasks. To summarize, by enabling scalable generalization and modular domain transfer, our work establishes a foundation for the next generation of general-purpose AI assistants.
zh
[NLP-151] st-Time Training Done Right
【速读】: 该论文试图解决Test-Time Training (TTT)在处理长上下文数据时效率低下的问题,现有方法由于在现代GPU上的FLOPs利用率低(通常仅为5%)以及采用小规模在线小批量更新(如每16或64个标记更新一次快速权重),导致其难以扩展到非一维有序序列的数据结构。解决方案的关键在于采用极大的块更新策略,称为Large Chunk Test-Time Training (LaCT),通过使用从2K到1M个标记的极大块更新,显著提升硬件利用率,并支持非线性状态大小的扩展(高达模型参数的40%),从而大幅提高状态容量,同时避免了复杂且易出错的内核实现。
链接: https://arxiv.org/abs/2505.23884
作者: Tianyuan Zhang,Sai Bi,Yicong Hong,Kai Zhang,Fujun Luan,Songlin Yang,Kalyan Sunkavalli,William T. Freeman,Hao Tan
机构: Massachusetts Institute of Technology (麻省理工学院); Adobe Research (Adobe 研究院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 11 figures
Abstract:Test-Time Training (TTT) models context dependencies by adapting part of the model’s weights (referred to as fast weights) during inference. This fast weight, akin to recurrent states in RNNs, stores temporary memories of past tokens in the current sequence. Existing TTT methods struggled to show effectiveness in handling long-context data, due to their inefficiency on modern GPUs. The TTT layers in many of these approaches operate with extremely low FLOPs utilization (often 5%) because they deliberately apply small online minibatch sizes (e.g., updating fast weights every 16 or 64 tokens). Moreover, a small minibatch implies fine-grained block-wise causal dependencies in the data, unsuitable for data beyond 1D ordered sequences, like sets or N-dimensional grids such as images or videos. In contrast, we pursue the opposite direction by using an extremely large chunk update, ranging from 2K to 1M tokens across tasks of varying modalities, which we refer to as Large Chunk Test-Time Training (LaCT). It improves hardware utilization by orders of magnitude, and more importantly, facilitates scaling of nonlinear state size (up to 40% of model parameters), hence substantially improving state capacity, all without requiring cumbersome and error-prone kernel implementations. It also allows easy integration of sophisticated optimizers, e.g. Muon for online updates. We validate our approach across diverse modalities and tasks, including novel view synthesis with image set, language models, and auto-regressive video diffusion. Our approach can scale up to 14B-parameter AR video diffusion model on sequences up to 56K tokens. In our longest sequence experiment, we perform novel view synthesis with 1 million context length. We hope this work will inspire and accelerate new research in the field of long-context modeling and test-time training. Website: this https URL
zh
[NLP-152] BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning
【速读】: 该论文试图解决如何在生物视觉模型中实现超越初始训练目标的新兴行为问题,具体表现为通过大规模对比视觉-语言训练获得具有生物学意义的嵌入空间。解决方案的关键在于构建并利用TreeOfLife-200M数据集,这是目前最大且最多样化的生物有机体图像数据集,并基于此训练BioCLIP 2模型,使其能够区分不同物种。尽管训练目标较为狭窄,但BioCLIP 2在多种生物视觉任务中表现出色,其嵌入空间展现出跨物种和同物种层面的显著结构特性,这些特性通过层级监督和对比目标得以增强。
链接: https://arxiv.org/abs/2505.23883
作者: Jianyang Gu,Samuel Stevens,Elizabeth G Campolongo,Matthew J Thompson,Net Zhang,Jiaman Wu,Andrei Kopanev,Zheda Mai,Alexander E. White,James Balhoff,Wasila Dahdul,Daniel Rubenstein,Hilmar Lapp,Tanya Berger-Wolf,Wei-Lun Chao,Yu Su
机构: The Ohio State University (俄亥俄州立大学); Smithsonian Institution (史密森学会); UNC Chapel Hill (北卡罗来纳大学教堂山分校); University of California, Irvine (加州大学欧文分校); Princeton University (普林斯顿大学); Duke University (杜克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Project page: this https URL
Abstract:Foundation models trained at scale exhibit remarkable emergent behaviors, learning new capabilities beyond their initial training objectives. We find such emergent behaviors in biological vision models via large-scale contrastive vision-language training. To achieve this, we first curate TreeOfLife-200M, comprising 214 million images of living organisms, the largest and most diverse biological organism image dataset to date. We then train BioCLIP 2 on TreeOfLife-200M to distinguish different species. Despite the narrow training objective, BioCLIP 2 yields extraordinary accuracy when applied to various biological visual tasks such as habitat classification and trait prediction. We identify emergent properties in the learned embedding space of BioCLIP 2. At the inter-species level, the embedding distribution of different species aligns closely with functional and ecological meanings (e.g., beak sizes and habitats). At the intra-species level, instead of being diminished, the intra-species variations (e.g., life stages and sexes) are preserved and better separated in subspaces orthogonal to inter-species distinctions. We provide formal proof and analyses to explain why hierarchical supervision and contrastive objectives encourage these emergent properties. Crucially, our results reveal that these properties become increasingly significant with larger-scale training data, leading to a biologically meaningful embedding space.
zh
[NLP-153] Using Reasoning Models to Generate Search Heuristics that Solve Open Instances of Combinatorial Design Problems
【速读】: 该论文试图解决组合设计领域中尚未确定存在性的开放实例问题,特别是通过生成式AI(Generative AI)结合推理能力的大型语言模型(LLMs)来生成搜索启发式方法,以构造小规模开放实例的解。解决方案的关键在于构造协议CPro1,该协议通过引导LLMs选择和实现策略,结合自动化超参数调优和执行反馈,从而有效提升求解效率与准确性。使用具备推理能力的LLMs,CPro1成功解决了16个组合设计问题中的7个长期未解问题,并在3种设计类型中获得了新的解。
链接: https://arxiv.org/abs/2505.23881
作者: Christopher D. Rosin
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Combinatorics (math.CO)
备注: arXiv admin note: text overlap with arXiv:2501.17725
Abstract:Large Language Models (LLMs) with reasoning are trained to iteratively generate and refine their answers before finalizing them, which can help with applications to mathematics and code generation. We apply code generation with reasoning LLMs to a specific task in the mathematical field of combinatorial design. This field studies diverse types of combinatorial designs, many of which have lists of open instances for which existence has not yet been determined. The Constructive Protocol CPro1 uses LLMs to generate search heuristics that have the potential to construct solutions to small open instances. Starting with a textual definition and a validity verifier for a particular type of design, CPro1 guides LLMs to select and implement strategies, while providing automated hyperparameter tuning and execution feedback. CPro1 with reasoning LLMs successfully solves long-standing open instances for 7 of 16 combinatorial design problems selected from the 2006 Handbook of Combinatorial Designs, including new solved instances for 3 of these (Bhaskar Rao Designs, Symmetric Weighing Matrices, Balanced Ternary Designs) that were unsolved by CPro1 with non-reasoning LLMs. It also solves open instances for several problems from recent (2025) literature, generating new Covering Sequences, Johnson Clique Covers, Deletion Codes, and a Uniform Nested Steiner Quadruple System.
zh
[NLP-154] Infi-Med: Low-Resource Medical MLLM s with Robust Reasoning Evaluation
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在医疗领域实际部署中面临的资源效率、诊断准确性、临床考量及伦理隐私等关键问题。其解决方案的核心在于提出Infi-Med框架,该框架包含三项关键创新:一是通过构建高质量的监督微调数据集,在最小样本需求下实现资源高效的方法,并兼顾预训练与后训练阶段;二是增强跨模态整合与临床任务理解的多模态推理能力;三是建立系统化的评估体系,以全面衡量模型在不同医学模态和任务类型中的性能。
链接: https://arxiv.org/abs/2505.23867
作者: Zeyu Liu,Zhitian Hou,Yining Di,Kejing Yang,Zhijie Sang,Congkai Xie,Jingwen Yang,Siyuan Liu,Jialu Wang,Chunming Li,Ming Li,Hongxia Yang
机构: The Hong Kong Polytechnic University(香港理工大学); Sun Yat-sen University(中山大学); Reallm Labs(瑞尔实验室); The Hong Kong University of Science and Technology(香港科技大学); Tongji University(同济大学); Shanghai Jiao Tong University(上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal large language models (MLLMs) have demonstrated promising prospects in healthcare, particularly for addressing complex medical tasks, supporting multidisciplinary treatment (MDT), and enabling personalized precision medicine. However, their practical deployment faces critical challenges in resource efficiency, diagnostic accuracy, clinical considerations, and ethical privacy. To address these limitations, we propose Infi-Med, a comprehensive framework for medical MLLMs that introduces three key innovations: (1) a resource-efficient approach through curating and constructing high-quality supervised fine-tuning (SFT) datasets with minimal sample requirements, with a forward-looking design that extends to both pretraining and posttraining phases; (2) enhanced multimodal reasoning capabilities for cross-modal integration and clinical task understanding; and (3) a systematic evaluation system that assesses model performance across medical modalities and task types. Our experiments demonstrate that Infi-Med achieves state-of-the-art (SOTA) performance in general medical reasoning while maintaining rapid adaptability to clinical scenarios. The framework establishes a solid foundation for deploying MLLMs in real-world healthcare settings by balancing model effectiveness with operational constraints.
zh
[NLP-155] OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在面对跨语言和跨模态的有害提示时,现有检测方法存在不足的问题。其核心挑战在于当前检测方法在处理低资源语言或非文本模态(如图像和音频)的提示时容易失效。解决方案的关键在于提出OMNIGUARD,该方法通过识别跨语言或跨模态对齐的内部表示,并利用这些表示构建与语言或模态无关的分类器,从而实现对有害提示的有效检测。
链接: https://arxiv.org/abs/2505.23856
作者: Sahil Verma,Keegan Hines,Jeff Bilmes,Charlotte Siska,Luke Zettlemoyer,Hila Gonen,Chandan Singh
机构: University of Washington (华盛顿大学); Microsoft (微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:The emerging capabilities of large language models (LLMs) have sparked concerns about their immediate potential for harmful misuse. The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-resource languages or prompts provided in non-text modalities such as image and audio). To tackle this challenge, we propose OMNIGUARD, an approach for detecting harmful prompts across languages and modalities. Our approach (i) identifies internal representations of an LLM/MLLM that are aligned across languages or modalities and then (ii) uses them to build a language-agnostic or modality-agnostic classifier for detecting harmful prompts. OMNIGUARD improves harmful prompt classification accuracy by 11.57% over the strongest baseline in a multilingual setting, by 20.44% for image-based prompts, and sets a new SOTA for audio-based prompts. By repurposing embeddings computed during generation, OMNIGUARD is also very efficient ( \approx 120 \times faster than the next fastest baseline). Code and data are available at: this https URL.
zh
[NLP-156] Revisiting Uncertainty Estimation and Calibration of Large Language Models
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在高风险应用场景中缺乏可靠不确定性估计的问题,从而提升其安全性和可信度。论文提出的解决方案关键在于系统评估多种不确定性估计方法,并发现语言语义不确定性(Linguistic Verbal Uncertainty, LVU)在校准性和区分性方面优于基于令牌概率(Token Probability-based Uncertainty, TPU)和数值表述不确定性(Numerical Verbal Uncertainty, NVU)的方法,同时具备更高的可解释性。
链接: https://arxiv.org/abs/2505.23854
作者: Linwei Tao,Yi-Fan Yeh,Minjing Dong,Tao Huang,Philip Torr,Chang Xu
机构: University of Sydney(悉尼大学); City University of Hong Kong(香港城市大学); Shanghai Jiao Tong University(上海交通大学); University of Oxford(牛津大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:As large language models (LLMs) are increasingly deployed in high-stakes applications, robust uncertainty estimation is essential for ensuring the safe and trustworthy deployment of LLMs. We present the most comprehensive study to date of uncertainty estimation in LLMs, evaluating 80 models spanning open- and closed-source families, dense and Mixture-of-Experts (MoE) architectures, reasoning and non-reasoning modes, quantization variants and parameter scales from 0.6B to 671B. Focusing on three representative black-box single-pass methods, including token probability-based uncertainty (TPU), numerical verbal uncertainty (NVU), and linguistic verbal uncertainty (LVU), we systematically evaluate uncertainty calibration and selective classification using the challenging MMLU-Pro benchmark, which covers both reasoning-intensive and knowledge-based tasks. Our results show that LVU consistently outperforms TPU and NVU, offering stronger calibration and discrimination while being more interpretable. We also find that high accuracy does not imply reliable uncertainty, and that model scale, post-training, reasoning ability and quantization all influence estimation performance. Notably, LLMs exhibit better uncertainty estimates on reasoning tasks than on knowledge-heavy ones, and good calibration does not necessarily translate to effective error ranking. These findings highlight the need for multi-perspective evaluation and position LVU as a practical tool for improving the reliability of LLMs in real-world settings.
zh
[NLP-157] Large Language Model-Based Agents for Automated Research Reproducibility: An Exploratory Study in Alzheimers Disease
【速读】: 该论文试图解决如何利用生成式 AI (Generative AI) 作为自主代理来自动重现已发表研究的成果,以评估其在生物医学研究中的可重复性。解决方案的关键在于构建一个基于大型语言模型(LLMs)的模拟研究团队,通过仅提供研究摘要、方法部分和数据字典描述,让这些代理自主编写并执行代码以动态重现研究结果。
链接: https://arxiv.org/abs/2505.23852
作者: Nic Dobbins,Christelle Xiong,Kristine Lan,Meliha Yetisgen
机构: Johns Hopkins University (约翰霍普金斯大学); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Applications (stat.AP)
备注:
Abstract:Objective: To demonstrate the capabilities of Large Language Models (LLMs) as autonomous agents to reproduce findings of published research studies using the same or similar dataset. Materials and Methods: We used the “Quick Access” dataset of the National Alzheimer’s Coordinating Center (NACC). We identified highly cited published research manuscripts using NACC data and selected five studies that appeared reproducible using this dataset alone. Using GPT-4o, we created a simulated research team of LLM-based autonomous agents tasked with writing and executing code to dynamically reproduce the findings of each study, given only study Abstracts, Methods sections, and data dictionary descriptions of the dataset. Results: We extracted 35 key findings described in the Abstracts across 5 Alzheimer’s studies. On average, LLM agents approximately reproduced 53.2% of findings per study. Numeric values and range-based findings often differed between studies and agents. The agents also applied statistical methods or parameters that varied from the originals, though overall trends and significance were sometimes similar. Discussion: In some cases, LLM-based agents replicated research techniques and findings. In others, they failed due to implementation flaws or missing methodological detail. These discrepancies show the current limits of LLMs in fully automating reproducibility assessments. Still, this early investigation highlights the potential of structured agent-based systems to provide scalable evaluation of scientific rigor. Conclusion: This exploratory work illustrates both the promise and limitations of LLMs as autonomous agents for automating reproducibility in biomedical research. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Applications (stat.AP) Cite as: arXiv:2505.23852 [cs.CL] (or arXiv:2505.23852v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.23852 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Minqi Xiong [view email] [v1] Thu, 29 May 2025 01:31:55 UTC (178 KB) Full-text links: Access Paper: View a PDF of the paper titled Large Language Model-Based Agents for Automated Research Reproducibility: An Exploratory Study in Alzheimer’s Disease, by Nic Dobbins and 3 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CL prev | next new | recent | 2025-05 Change to browse by: cs cs.AI cs.MA stat stat.AP References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[NLP-158] ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark
【速读】: 该论文试图解决现有基准在评估大型语言模型(Large Language Models, LLMs)在符号数学领域核心技能(如积分、微分方程和代数化简)方面的不足。其解决方案的关键在于提出ASyMOB,一个专注于符号操作的新型评估框架,包含17,092个独特的数学挑战,并按相似性和复杂性进行组织,从而能够分析LLMs在面对简单数值或符号“扰动”时的泛化能力。
链接: https://arxiv.org/abs/2505.23851
作者: Michael Shalyt,Rotem Elimelech,Ido Kaminer
机构: Technion - Israel Institute of Technology (以色列理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注: Code repository: this https URL Complete benchmark dataset: this https URL
Abstract:Large language models (LLMs) are rapidly approaching the level of proficiency in university-level symbolic mathematics required for applications in advanced science and technology. However, existing benchmarks fall short in assessing the core skills of LLMs in symbolic mathematics-such as integration, differential equations, and algebraic simplification. To address this gap, we introduce ASyMOB, a novel assessment framework focused exclusively on symbolic manipulation, featuring 17,092 unique math challenges, organized by similarity and complexity. ASyMOB enables analysis of LLM generalization capabilities by comparing performance in problems that differ by simple numerical or symbolic `perturbations’. Evaluated LLMs exhibit substantial degradation in performance for all perturbation types (up to -70.3%), suggesting reliance on memorized patterns rather than deeper understanding of symbolic math, even among models achieving high baseline accuracy. Comparing LLM performance to computer algebra systems, we identify examples where they fail while LLMs succeed, as well as problems solved only by combining both approaches. Models capable of integrated code execution yielded higher accuracy compared to their performance without code, particularly stabilizing weaker models (up to +33.1% for certain perturbation types). Notably, the most advanced models (o4-mini, Gemini 2.5 Flash) demonstrate not only high symbolic math proficiency (scoring 96.8% and 97.6% on the unperturbed set), but also remarkable robustness against perturbations, (-21.7% and -21.2% vs. average -50.4% for the other models). This may indicate a recent “phase transition” in the generalization capabilities of frontier LLMs. It remains to be seen whether the path forward lies in deeper integration with sophisticated external tools, or in developing models so capable that symbolic math systems like CAS become unnecessary.
zh
[NLP-159] Derailing Non-Answers via Logit Suppression at Output Subspace Boundaries in RLHF-Aligned Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对敏感内容时的拒绝响应率过高的问题,而无需修改模型权重或提示词。解决方案的关键在于通过在生成过程中对特定标记序列进行简单的格式调整,即抑制“think”标记后的换行符(\n\n)以及抑制思维链(Chain-of-Thought, CoT)块结束后的结束序列标记(/think),从而影响生成过程中的token概率,避免模型进入拒绝响应的子空间。该方法无需数据集、参数调整或训练,仅依赖于生成过程中的token概率修改。
链接: https://arxiv.org/abs/2505.23848
作者: Harvey Dam,Jonas Knochelmann,Vinu Joseph,Ganesh Gopalakrishnan
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:We introduce a method to reduce refusal rates of large language models (LLMs) on sensitive content without modifying model weights or prompts. Motivated by the observation that refusals in certain models were often preceded by the specific token sequence of a token marking the beginning of the chain-of-thought (CoT) block (think) followed by a double newline token (\n\n), we investigate the impact of two simple formatting adjustments during generation: suppressing \n\n after think and suppressing the end-of-sequence token after the end of the CoT block (/think). Our method requires no datasets, parameter changes, or training, relying solely on modifying token probabilities during generation. In our experiments with official DeepSeek-R1 distillations, these interventions increased the proportion of substantive answers to sensitive prompts without affecting performance on standard benchmarks. Our findings suggest that refusal behaviors can be circumvented by blocking refusal subspaces at specific points in the generation process.
zh
[NLP-160] Scalable Symbiotic AI and Non-AI Agent Based Parallel Discrete Event Simulations
【速读】: 该论文试图解决如何在可信的前提下充分利用人工智能(Artificial Intelligence, AI)系统的潜力,通过将多个AI与非AI系统无缝耦合,以约束和确保输出的正确性。其解决方案的关键在于提出一种基于并行离散事件仿真(Parallel Discrete Event Simulation, PDES)的方法,该方法以因果、规则为基础,将多个AI与非AI代理(agent)整合到一个协同环境中,每个代理作为PDES框架中的实体,响应其他代理的先前请求,从而实现任务的并行处理与共同目标的达成。此外,通过非AI代理施加动态约束,对AI代理的输出设置边界,同时支持通过部署数百个代理在大型计算集群中实现可扩展性。
链接: https://arxiv.org/abs/2505.23846
作者: Atanu Barai,Stephan Eidenbenz,Nandakishore Santhi
机构: Los Alamos National Laboratory (洛斯阿拉莫斯国家实验室)
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:To fully leverage the potential of artificial intelligence (AI) systems in a trustworthy manner, it is desirable to couple multiple AI and non-AI systems together seamlessly for constraining and ensuring correctness of the output. This paper introduces a novel parallel discrete event simulation (PDES) based methodology to combine multiple AI and non-AI agents in a causal, rule-based way. Our approach tightly integrates the concept of passage of time, with each agent considered as an entity in the PDES framework and responding to prior requests from other agents. Such coupling mechanism enables the agents to work in a co-operative environment towards a common goal while many tasks run in parallel throughout the simulation. It further enables setting up boundaries to the outputs of the AI agents by applying necessary dynamic constraints using non-AI agents while allowing for scalability through deployment of hundreds of such agents in a larger compute cluster. Distributing smaller AI agents can enable extremely scalable simulations in the future, addressing local memory bottlenecks for model parameter storage. Within a PDES involving both AI and non-AI agents, we break down the problem at hand into structured steps, when necessary, providing a set of multiple choices to the AI agents, and then progressively solve these steps towards a final goal. At each step, the non-AI agents act as unbiased auditors, verifying each action by the AI agents so that certain rules of engagement are followed. We evaluate our approach by solving four problems from four different domains and comparing the results with those from AI models alone. Our results show greater accuracy in solving problems from various domains where the AI models struggle to solve the problems solely by themselves. Results show that overall accuracy of our approach is 68% where as the accuracy of vanilla models is less than 23%.
zh
[NLP-161] Read Your Own Mind: Reasoning Helps Surface Self-Confidence Signals in LLM s
【速读】: 该论文试图解决深度学习模型在问答任务中自报告置信度不可靠的问题(self-reported confidence),其核心问题是模型在默认的答案-置信度设置下表现出过度自信,而通过大量采样得到的语义熵(semantic entropy)则保持了较高的可靠性。解决方案的关键在于通过增加测试时的计算量,使模型能够探索其预测分布,具体表现为强制模型在最终答案前生成长链式思维(chain-of-thought),从而显著提升其置信度评估的有效性。此外,研究还表明,仅依赖链式思维的独立阅读模型可以重建相似的置信度,暗示自报告置信度可能只是推理过程中浮现的替代选项的统计特征。因此,可靠的风险估计需要对生成空间进行显式探索,而自报告置信度仅在完成此类探索后才具有可信度。
链接: https://arxiv.org/abs/2505.23845
作者: Jakub Podolak,Rajeev Verma
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We study the source of uncertainty in DeepSeek R1-32B by analyzing its self-reported verbal confidence on question answering (QA) tasks. In the default answer-then-confidence setting, the model is regularly over-confident, whereas semantic entropy - obtained by sampling many responses - remains reliable. We hypothesize that this is because of semantic entropy’s larger test-time compute, which lets us explore the model’s predictive distribution. We show that granting DeepSeek the budget to explore its distribution by forcing a long chain-of-thought before the final answer greatly improves its verbal score effectiveness, even on simple fact-retrieval questions that normally require no reasoning. Furthermore, a separate reader model that sees only the chain can reconstruct very similar confidences, indicating the verbal score might be merely a statistic of the alternatives surfaced during reasoning. Our analysis concludes that reliable uncertainty estimation requires explicit exploration of the generative space, and self-reported confidence is trustworthy only after such exploration.
zh
[NLP-162] Enabling Flexible Multi-LLM Integration for Scalable Knowledge Aggregation
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在持续改进过程中面临的挑战,特别是在整合其他专业LLMs能力时的困难。传统微调方法、集成和权重合并等策略存在内存消耗大、难以适应变化的数据环境等问题,而现有知识迁移方法则因候选模型选择和训练流程灵活性不足导致任务间干扰和性能下降。论文提出的解决方案关键在于设计一个自适应选择网络,根据得分识别最相关的源LLMs以减少知识干扰,并引入动态加权融合策略与反馈驱动的损失函数,从而实现更稳定、可扩展的知识聚合过程。
链接: https://arxiv.org/abs/2505.23844
作者: Zhenglun Kong,Zheng Zhan,Shiyue Hou,Yifan Gong,Xin Meng,Pengwei Sui,Peiyan Dong,Xuan Shen,Zifeng Wang,Pu Zhao,Hao Tang,Stratis Ioannidis,Yanzhi Wang
机构: Northeastern University (东北大学); Peking University (北京大学); Harvard University (哈佛大学); Google (谷歌)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have shown remarkable promise but remain challenging to continually improve through traditional finetuning, particularly when integrating capabilities from other specialized LLMs. Popular methods like ensemble and weight merging require substantial memory and struggle to adapt to changing data environments. Recent efforts have transferred knowledge from multiple LLMs into a single target model; however, they suffer from interference and degraded performance among tasks, largely due to limited flexibility in candidate selection and training pipelines. To address these issues, we propose a framework that adaptively selects and aggregates knowledge from diverse LLMs to build a single, stronger model, avoiding the high memory overhead of ensemble and inflexible weight merging. Specifically, we design an adaptive selection network that identifies the most relevant source LLMs based on their scores, thereby reducing knowledge interference. We further propose a dynamic weighted fusion strategy that accounts for the inherent strengths of candidate LLMs, along with a feedback-driven loss function that prevents the selector from converging on a single subset of sources. Experimental results demonstrate that our method can enable a more stable and scalable knowledge aggregation process while reducing knowledge interference by up to 50% compared to existing approaches. Code is avaliable at this https URL
zh
[NLP-163] Evaluation Hallucination in Multi-Round Incomplete Information Lateral-Driven Reasoning Tasks
【速读】: 该论文试图解决现有评估方法在多轮不完全信息任务中无法准确揭示大型语言模型(Large Language Models, LLMs)横向思维能力的问题。现有方法常导致误导性结果,无法识别如走捷径行为、僵化模式和过早终止任务等关键问题,从而掩盖了模型的真实推理能力并影响评估的可靠性。解决方案的关键在于提出一套改进的评估标准,包括对推理路径的检查、多样化评估指标的引入以及与人类表现的对比分析。
链接: https://arxiv.org/abs/2505.23843
作者: Wenhan Dong,Tianyi Hu,Jingyi Zheng,Zhen Sun,Yuemeng Zhao,Yule Liu,Xinlei He,Xinyi Huang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Multi-round incomplete information tasks are crucial for evaluating the lateral thinking capabilities of large language models (LLMs). Currently, research primarily relies on multiple benchmarks and automated evaluation metrics to assess these abilities. However, our study reveals novel insights into the limitations of existing methods, as they often yield misleading results that fail to uncover key issues, such as shortcut-taking behaviors, rigid patterns, and premature task termination. These issues obscure the true reasoning capabilities of LLMs and undermine the reliability of evaluations. To address these limitations, we propose a refined set of evaluation standards, including inspection of reasoning paths, diversified assessment metrics, and comparative analyses with human performance.
zh
[NLP-164] Document Valuation in LLM Summaries: A Cluster Shapley Approach
【速读】: 该论文试图解决在大型语言模型(Large Language Models, LLMs)生成摘要时,如何合理评估和量化单个文档对最终摘要的贡献问题,以实现对原始内容创作者的公平信用归属与潜在补偿。解决方案的关键在于引入博弈论中的Shapley值方法,该方法基于每个文档的边际贡献进行信用分配,但由于其计算复杂度高,论文进一步提出Cluster Shapley算法,通过利用文档间的语义相似性进行聚类,并在簇级别上计算Shapley值,从而显著降低计算成本,同时保持较高的归因质量。
链接: https://arxiv.org/abs/2505.23842
作者: Zikun Ye,Hema Yoganarasimhan
机构: University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); General Economics (econ.GN)
备注:
Abstract:Large Language Models (LLMs) are increasingly used in systems that retrieve and summarize content from multiple sources, such as search engines and AI assistants. While these models enhance user experience by generating coherent summaries, they obscure the contributions of original content creators, raising concerns about credit attribution and compensation. We address the challenge of valuing individual documents used in LLM-generated summaries. We propose using Shapley values, a game-theoretic method that allocates credit based on each document’s marginal contribution. Although theoretically appealing, Shapley values are expensive to compute at scale. We therefore propose Cluster Shapley, an efficient approximation algorithm that leverages semantic similarity between documents. By clustering documents using LLM-based embeddings and computing Shapley values at the cluster level, our method significantly reduces computation while maintaining attribution quality. We demonstrate our approach to a summarization task using Amazon product reviews. Cluster Shapley significantly reduces computational complexity while maintaining high accuracy, outperforming baseline methods such as Monte Carlo sampling and Kernel SHAP with a better efficient frontier. Our approach is agnostic to the exact LLM used, the summarization process used, and the evaluation procedure, which makes it broadly applicable to a variety of summarization settings.
zh
[NLP-165] SkewRoute: Training-Free LLM Routing for Knowledge Graph Retrieval-Augmented Generation via Score Skewness of Retrieved Context
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在部署过程中因推理成本高而带来的挑战,尤其是在结合知识图谱增强的检索生成(Knowledge Graph-Retrieval Augmented Generation, KG-RAG)系统中,检索到的大量知识上下文进一步增加了推理成本。为平衡性能与成本,论文提出了一种无需训练的路由框架,其关键在于利用检索评分器产生的得分分布与查询难度之间的强相关性,从而实现对查询的智能路由,将简单查询分配给较小的LLMs,复杂查询分配给较大的LLMs,进而减少对大型LLM的调用次数达50%而不牺牲响应质量。
链接: https://arxiv.org/abs/2505.23841
作者: Hairu Wang,Yuan Feng,Yukun Cao,Xike Xie,S Kevin Zhou
机构: School of Computer Science, University of Science and Technology of China, China (中国科学技术大学计算机学院); School of Biomedical Engineering, USTC, China (生物医学工程学院,中国科学技术大学); Data Darkness Lab, MIRACLE Center, Suzhou Institute for Advanced Reasearch, USTC, China (数据黑暗实验室,MIRACLE中心,苏州先进研究所以及中国科学技术大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Large language models excel at many tasks but often incur high inference costs during deployment. To mitigate hallucination, many systems use a knowledge graph to enhance retrieval-augmented generation (KG-RAG). However, the large amount of retrieved knowledge contexts increase these inference costs further. A promising solution to balance performance and cost is LLM routing, which directs simple queries to smaller LLMs and complex ones to larger LLMs. However, no dedicated routing methods currently exist for RAG, and existing training-based routers face challenges scaling to this domain due to the need for extensive training data. We observe that the score distributions produced by the retrieval scorer strongly correlate with query difficulty. Based on this, we propose a novel, training-free routing framework, the first tailored to KG-RAG that effectively balances performance and cost in a plug-and-play manner. Experiments show our method reduces calls to larger LLMs by up to 50% without sacrificing response quality, demonstrating its potential for efficient and scalable LLM deployment.
zh
[NLP-166] Measuring Sycophancy of Language Models in Multi-turn Dialogues
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在对话中表现出的谄媚行为(sycophancy),即模型倾向于无条件迎合用户观点,而忽视事实准确性或伦理合理性。其解决方案的关键在于引入SYCON Bench,这是一个用于评估多轮自由对话场景下谄媚行为的新基准,通过衡量模型对用户立场的适应速度(Turn of Flip)和在持续用户压力下立场转变的频率(Number of Flip)来量化该行为。研究还发现,对齐调优会加剧谄媚行为,而模型规模扩展和推理优化则有助于增强模型抵御不良用户观点的能力。此外,采用第三人称视角的提示策略可显著降低辩论场景中的谄媚程度。
链接: https://arxiv.org/abs/2505.23840
作者: Jiseung Hong,Grace Byun,Seungone Kim,Kai Shu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) are expected to provide helpful and harmless responses, yet they often exhibit sycophancy–conforming to user beliefs regardless of factual accuracy or ethical soundness. Prior research on sycophancy has primarily focused on single-turn factual correctness, overlooking the dynamics of real-world interactions. In this work, we introduce SYCON Bench, a novel benchmark for evaluating sycophantic behavior in multi-turn, free-form conversational settings. Our benchmark measures how quickly a model conforms to the user (Turn of Flip) and how frequently it shifts its stance under sustained user pressure (Number of Flip). Applying SYCON Bench to 17 LLMs across three real-world scenarios, we find that sycophancy remains a prevalent failure mode. Our analysis shows that alignment tuning amplifies sycophantic behavior, whereas model scaling and reasoning optimization strengthen the model’s ability to resist undesirable user views. Reasoning models generally outperform instruction-tuned models but often fail when they over-index on logical exposition instead of directly addressing the user’s underlying beliefs. Finally, we evaluate four additional prompting strategies and demonstrate that adopting a third-person perspective reduces sycophancy by up to 63.8% in debate scenario. We release our code and data at this https URL.
zh
[NLP-167] Exploring the Landscape of Text-to-SQL with Large Language Models : Progresses Challenges and Opportunities
【速读】: 该论文旨在解决将自然语言(Natural Language, NL)问题转换为结构化查询语言(Structured Query Language, SQL)的问题,即Text-to-SQL任务,以降低用户对SQL知识的依赖,提升对关系型数据库的访问效率。其解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)的强大语言理解和生成能力,通过系统性地分析LLM在Text-to-SQL中的研究趋势、技术方法、数据集及评估指标,并探讨该领域面临的挑战与未来发展方向,从而推动Text-to-SQL系统的性能优化与应用扩展。
链接: https://arxiv.org/abs/2505.23838
作者: Yiming Huang,Jiyu Guo,Wenxin Mao,Cuiyun Gao,Peiyi Han,Chuanyi Liu,Qing Ling
机构: Harbin Institute of Technology (Shenzhen); Peng Cheng Laboratory; Sun Yat-sen University; Great Bay University
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Submitted to ACM Computing Surveys (CSUR). Currently under review
Abstract:Converting natural language (NL) questions into SQL queries, referred to as Text-to-SQL, has emerged as a pivotal technology for facilitating access to relational databases, especially for users without SQL knowledge. Recent progress in large language models (LLMs) has markedly propelled the field of natural language processing (NLP), opening new avenues to improve text-to-SQL systems. This study presents a systematic review of LLM-based text-to-SQL, focusing on four key aspects: (1) an analysis of the research trends in LLM-based text-to-SQL; (2) an in-depth analysis of existing LLM-based text-to-SQL techniques from diverse perspectives; (3) summarization of existing text-to-SQL datasets and evaluation metrics; and (4) discussion on potential obstacles and avenues for future exploration in this domain. This survey seeks to furnish researchers with an in-depth understanding of LLM-based text-to-SQL, sparking new innovations and advancements in this field.
zh
[NLP-168] CoMaPOI: A Collaborative Multi-Agent Framework for Next POI Prediction Bridging the Gap Between Trajectory and Language SIGIR2025
【速读】: 该论文旨在解决将大型语言模型(Large Language Models, LLMs)应用于下一兴趣点(Next Point-Of-Interest, Next POI)预测任务时所面临的两个关键挑战:一是LLMs缺乏对数值时空数据的内在理解,影响用户时空分布与偏好的准确建模;二是候选POI空间过大且无约束,导致预测结果随机或不相关。解决方案的关键在于提出一种协作多智能体框架CoMaPOI,通过三个专用智能体(Profiler、Forecaster和Predictor)的协同作用,分别实现数值数据到语言描述的转换、候选POI空间的动态约束与优化,以及高精度预测的生成。
链接: https://arxiv.org/abs/2505.23837
作者: Lin Zhong,Lingzhi Wang,Xu Yang,Qing Liao
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: This paper has been accepted by SIGIR 2025
Abstract:Large Language Models (LLMs) offer new opportunities for the next Point-Of-Interest (POI) prediction task, leveraging their capabilities in semantic understanding of POI trajectories. However, previous LLM-based methods, which are superficially adapted to next POI prediction, largely overlook critical challenges associated with applying LLMs to this task. Specifically, LLMs encounter two critical challenges: (1) a lack of intrinsic understanding of numeric spatiotemporal data, which hinders accurate modeling of users’ spatiotemporal distributions and preferences; and (2) an excessively large and unconstrained candidate POI space, which often results in random or irrelevant predictions. To address these issues, we propose a Collaborative Multi Agent Framework for Next POI Prediction, named CoMaPOI. Through the close interaction of three specialized agents (Profiler, Forecaster, and Predictor), CoMaPOI collaboratively addresses the two critical challenges. The Profiler agent is responsible for converting numeric data into language descriptions, enhancing semantic understanding. The Forecaster agent focuses on dynamically constraining and refining the candidate POI space. The Predictor agent integrates this information to generate high-precision predictions. Extensive experiments on three benchmark datasets (NYC, TKY, and CA) demonstrate that CoMaPOI achieves state of the art performance, improving all metrics by 5% to 10% compared to SOTA baselines. This work pioneers the investigation of challenges associated with applying LLMs to complex spatiotemporal tasks by leveraging tailored collaborative agents.
zh
[NLP-169] Large Language Models Often Know When They Are Being Evaluated
【速读】: 该论文试图解决的问题是:当AI模型能够识别自身正在被评估时,评估的有效性可能会受到损害,因为模型可能在评估过程中表现出与实际部署不同的行为,从而导致用于部署和治理决策的基准不可靠。解决方案的关键在于构建一个涵盖多种数据源的多样化基准,包括1,000个提示和来自61个不同数据集的对话记录,以测试前沿语言模型是否具备评估意识(evaluation awareness),即能否准确区分评估场景与真实部署场景。研究结果显示,尽管前沿模型表现出显著的评估意识,但尚未超越人类基线,且在代理设置中比聊天设置中更易识别评估场景。
链接: https://arxiv.org/abs/2505.23836
作者: Joe Needham,Giles Edkins,Govind Pimpale,Henning Bartsch,Marius Hobbhahn
机构: MATS; Apollo Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:If AI models can detect when they are being evaluated, the effectiveness of evaluations might be compromised. For example, models could have systematically different behavior during evaluations, leading to less reliable benchmarks for deployment and governance decisions. We investigate whether frontier language models can accurately classify transcripts based on whether they originate from evaluations or real-world deployment, a capability we call evaluation awareness. To achieve this, we construct a diverse benchmark of 1,000 prompts and transcripts from 61 distinct datasets. These span public benchmarks (e.g., MMLU, SWEBench), real-world deployment interactions, and agent trajectories from scaffolding frameworks (e.g., web-browsing agents). Frontier models clearly demonstrate above-random evaluation awareness (Gemini-2.5-Pro reaches an AUC of 0.83 ), but do not yet surpass our simple human baseline (AUC of 0.92 ). Furthermore, both AI models and humans are better at identifying evaluations in agentic settings compared to chat settings. Additionally, we test whether models can identify the purpose of the evaluation. Under multiple-choice and open-ended questioning, AI models far outperform random chance in identifying what an evaluation is testing for. Our results indicate that frontier models already exhibit a substantial, though not yet superhuman, level of evaluation-awareness. We recommend tracking this capability in future models.
zh
[NLP-170] Say What You Mean: Natural Language Access Control with Large Language Models for Internet of Things
【速读】: 该论文旨在解决物联网(IoT)中访问控制策略的复杂性问题,特别是现有平台在处理动态和上下文相关因素时的不足,以及策略编写过程中人类意图与机器可执行逻辑之间的语义鸿沟。解决方案的关键在于提出LACE(Language-based Access Control Engine),这是一个结合了大语言模型(LLM)的混合框架,通过提示引导的策略生成、检索增强推理和形式化验证,实现表达性强、可解释且可验证的访问控制。LACE能够将自然语言策略自动转换为结构化规则,并通过混合LLM与规则引擎进行访问决策,从而提升策略生成的正确性和决策准确性。
链接: https://arxiv.org/abs/2505.23835
作者: Ye Cheng,Minghui Xu,Yue Zhang,Kun Li,Hao Wu,Yechao Zhang,Shaoyong Guo,Wangjie Qiu,Dongxiao Yu,Xiuzhen Cheng
机构: Shandong University (山东大学); Nanjing University (南京大学); Huazhong University of Science and Technology (华中科技大学); Beijing University of Posts and Telecommunications (北京邮电大学); Beihang University (北京航空航天大学); Zhongguancun Laboratory (中关村实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Access control in the Internet of Things (IoT) is becoming increasingly complex, as policies must account for dynamic and contextual factors such as time, location, user behavior, and environmental conditions. However, existing platforms either offer only coarse-grained controls or rely on rigid rule matching, making them ill-suited for semantically rich or ambiguous access scenarios. Moreover, the policy authoring process remains fragmented: domain experts describe requirements in natural language, but developers must manually translate them into code, introducing semantic gaps and potential misconfiguration. In this work, we present LACE, the Language-based Access Control Engine, a hybrid framework that leverages large language models (LLMs) to bridge the gap between human intent and machine-enforceable logic. LACE combines prompt-guided policy generation, retrieval-augmented reasoning, and formal validation to support expressive, interpretable, and verifiable access control. It enables users to specify policies in natural language, automatically translates them into structured rules, validates semantic correctness, and makes access decisions using a hybrid LLM-rule-based engine. We evaluate LACE in smart home environments through extensive experiments. LACE achieves 100% correctness in verified policy generation and up to 88% decision accuracy with 0.79 F1-score using DeepSeek-V3, outperforming baselines such as GPT-3.5 and Gemini. The system also demonstrates strong scalability under increasing policy volume and request concurrency. Our results highlight LACE’s potential to enable secure, flexible, and user-friendly access control across real-world IoT platforms.
zh
[NLP-171] Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在抽象推理能力上的评估问题,旨在建立一个简单、有效且理论基础扎实的基准,以严格探测LLMs的抽象推理能力。解决方案的关键在于构建一个数学框架,将抽象推理定义为:(i) 提取与表面表示无关的本质模式,以及(ii) 对这些抽象模式应用一致规则。基于此框架,研究引入了两个互补的度量指标:\scoreGamma 用于衡量基本推理准确性,\scoreDelta 用于量化模型对特定符号的依赖程度,从而反映真实抽象能力与单纯记忆之间的差异。此外,通过设计基于规则任务的系统性符号重映射基准,迫使模型展示超越表层标记匹配的真实模式识别能力。
链接: https://arxiv.org/abs/2505.23833
作者: Qingchuan Ma,Yuhang Wu,Xiawu Zheng,Rongrong Ji
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:In this paper, we aim to establish a simple, effective, and theoretically grounded benchmark for rigorously probing abstract reasoning in Large Language Models (LLMs). To achieve this, we first develop a mathematic framework that defines abstract reasoning as the ability to: (i) extract essential patterns independent of surface representations, and (ii) apply consistent rules to these abstract patterns. Based on this framework, we introduce two novel complementary metrics: (\scoreGamma) measures basic reasoning accuracy, while (\scoreDelta) quantifies a model’s reliance on specific symbols rather than underlying patterns - a key indicator of true abstraction versus mere memorization. To implement this measurement, we design a benchmark: systematic symbol remapping in rule-based tasks, which forces models to demonstrate genuine pattern recognition beyond superficial token matching. Extensive LLM evaluations using this benchmark (commercial API models, 7B-70B, multi-agent) reveal:1) critical limitations in non-decimal arithmetic and symbolic reasoning; 2) persistent abstraction gaps despite chain-of-thought prompting; and 3) (\scoreDelta)'s effectiveness in robustly measuring memory dependence by quantifying performance degradation under symbol remapping, particularly highlighting operand-specific memorization. These findings underscore that current LLMs, despite domain-specific strengths, still lack robust abstract reasoning, highlighting key areas for future improvement.
zh
[NLP-172] LegalSearchLM: Rethinking Legal Case Retrieval as Legal Elements Generation
【速读】: 该论文旨在解决法律案例检索(Legal Case Retrieval, LCR)中存在的两个主要问题:现有研究通常在规模较小的检索语料库上进行评估,且查询类型较为有限,无法充分反映现实法律检索场景的复杂性;此外,其依赖嵌入或词法匹配方法导致表示能力有限和法律无关匹配。解决方案的关键在于提出:(1) LEGAR BENCH,首个大规模韩国LCR基准,涵盖411种多样化的犯罪类型和120万份法律案例;(2) LegalSearchLM,一种通过约束解码直接生成基于目标案例内容的法律要素推理检索模型,从而提升检索效果和泛化能力。
链接: https://arxiv.org/abs/2505.23832
作者: Chaeeun Kim,Jinu Lee,Wonseok Hwang
机构: LBOX; University of Illinois Urbana-Champaign; University of Seoul
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Under review
Abstract:Legal Case Retrieval (LCR), which retrieves relevant cases from a query case, is a fundamental task for legal professionals in research and decision-making. However, existing studies on LCR face two major limitations. First, they are evaluated on relatively small-scale retrieval corpora (e.g., 100-55K cases) and use a narrow range of criminal query types, which cannot sufficiently reflect the complexity of real-world legal retrieval scenarios. Second, their reliance on embedding-based or lexical matching methods often results in limited representations and legally irrelevant matches. To address these issues, we present: (1) LEGAR BENCH, the first large-scale Korean LCR benchmark, covering 411 diverse crime types in queries over 1.2M legal cases; and (2) LegalSearchLM, a retrieval model that performs legal element reasoning over the query case and directly generates content grounded in the target cases through constrained decoding. Experimental results show that LegalSearchLM outperforms baselines by 6-20% on LEGAR BENCH, achieving state-of-the-art performance. It also demonstrates strong generalization to out-of-domain cases, outperforming naive generative models trained on in-domain data by 15%.
zh
[NLP-173] ICH-Qwen : A Large Language Model Towards Chinese Intangible Cultural Heritage
【速读】: 该论文旨在解决非物质文化遗产(Intangible Cultural Heritage, ICH)在现代化进程中面临的传承断裂、损毁和消失等严峻问题。其解决方案的关键在于利用大规模语言模型技术,通过构建专门针对ICH领域的大型语言模型——ICH-Qwen,结合自然语言理解和知识推理能力,并借助合成数据与微调技术,提升模型在ICH相关任务中的表现,从而为非物质文化遗产的保护、传承与传播提供智能化解决方案。
链接: https://arxiv.org/abs/2505.23831
作者: Wenhao Ye,Tiansheng Zheng,Yue Qi,Wenhua Zhao,Xiyu Wang,Xue Zhao,Jiacheng He,Yaya Zheng,Dongbo Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 16 pages, 2 figures
Abstract:The intangible cultural heritage (ICH) of China, a cultural asset transmitted across generations by various ethnic groups, serves as a significant testament to the evolution of human civilization and holds irreplaceable value for the preservation of historical lineage and the enhancement of cultural self-confidence. However, the rapid pace of modernization poses formidable challenges to ICH, including threats damage, disappearance and discontinuity of inheritance. China has the highest number of items on the UNESCO Intangible Cultural Heritage List, which is indicative of the nation’s abundant cultural resources and emphasises the pressing need for ICH preservation. In recent years, the rapid advancements in large language modelling have provided a novel technological approach for the preservation and dissemination of ICH. This study utilises a substantial corpus of open-source Chinese ICH data to develop a large language model, ICH-Qwen, for the ICH domain. The model employs natural language understanding and knowledge reasoning capabilities of large language models, augmented with synthetic data and fine-tuning techniques. The experimental results demonstrate the efficacy of ICH-Qwen in executing tasks specific to the ICH domain. It is anticipated that the model will provide intelligent solutions for the protection, inheritance and dissemination of intangible cultural heritage, as well as new theoretical and practical references for the sustainable development of intangible cultural heritage. Furthermore, it is expected that the study will open up new paths for digital humanities research.
zh
[NLP-174] EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models
【速读】: 该论文旨在解决多模态大语言模型(MLLMs)中基于专家混合(MoE)方法的两个关键问题:专家同质化和路由器刚性。专家同质化源于MoE专家通常通过简单复制大语言模型(LLM)中的前馈网络(FFN)参数进行初始化,导致专家功能趋同,削弱了MoE架构的多样性;而路由器刚性则由于静态线性路由器在选择专家时无法区分视觉和文本标记,造成图像与文本的专家分布相似。该论文提出的解决方案——EvoMoE,其关键在于引入一种精心设计的专家初始化策略,通过专家进化过程从单一可训练专家中逐步生成多个稳健专家,以缓解专家同质化问题;同时提出动态感知标记的路由器(DTR),利用超网络动态生成针对每个标记的路由权重,实现基于模态和标记内在价值的动态路由分配。
链接: https://arxiv.org/abs/2505.23830
作者: Linglin Jing,Yuting Gao,Zhigang Wang,Wang Lan,Yiwen Tang,Wenhai Wang,Kaipeng Zhang,Qingpei Guo
机构: Shanghai AI Laboratory; Ant Group
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advancements have shown that the Mixture of Experts (MoE) approach significantly enhances the capacity of large language models (LLMs) and improves performance on downstream tasks. Building on these promising results, multi-modal large language models (MLLMs) have increasingly adopted MoE techniques. However, existing multi-modal MoE tuning methods typically face two key challenges: expert uniformity and router rigidity. Expert uniformity occurs because MoE experts are often initialized by simply replicating the FFN parameters from LLMs, leading to homogenized expert functions and weakening the intended diversification of the MoE architecture. Meanwhile, router rigidity stems from the prevalent use of static linear routers for expert selection, which fail to distinguish between visual and textual tokens, resulting in similar expert distributions for image and text. To address these limitations, we propose EvoMoE, an innovative MoE tuning framework. EvoMoE introduces a meticulously designed expert initialization strategy that progressively evolves multiple robust experts from a single trainable expert, a process termed expert evolution that specifically targets severe expert homogenization. Furthermore, we introduce the Dynamic Token-aware Router (DTR), a novel routing mechanism that allocates input tokens to appropriate experts based on their modality and intrinsic token values. This dynamic routing is facilitated by hypernetworks, which dynamically generate routing weights tailored for each individual token. Extensive experiments demonstrate that EvoMoE significantly outperforms other sparse MLLMs across a variety of multi-modal benchmarks, including MME, MMBench, TextVQA, and POPE. Our results highlight the effectiveness of EvoMoE in enhancing the performance of MLLMs by addressing the critical issues of expert uniformity and router rigidity.
zh
[NLP-175] BiasFilter: An Inference-Time Debiasing Framework for Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中社会偏见的缓解问题。现有去偏方法通常存在人力和计算成本高、效果有限以及难以扩展到更大模型和开放式生成任务的缺陷。该论文提出的解决方案——BiasFilter,其关键在于采用一种模型无关的推理阶段去偏框架,通过实时过滤生成输出来实现公平性,而非依赖于重新训练或修改模型参数。具体而言,它通过定期评估中间输出、维护候选延续集,并基于公平性奖励信号逐步丢弃低奖励片段来完成生成过程。
链接: https://arxiv.org/abs/2505.23829
作者: Xiaoqing Cheng,Ruizhe Chen,Hongying Zan,Yuxiang Jia,Min Peng
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Mitigating social bias in large language models (LLMs) has become an increasingly important research objective. However, existing debiasing methods often incur high human and computational costs, exhibit limited effectiveness, and struggle to scale to larger models and open-ended generation tasks. To address these limitations, this paper proposes BiasFilter, a model-agnostic, inference-time debiasing framework that integrates seamlessly with both open-source and API-based LLMs. Instead of relying on retraining with balanced data or modifying model parameters, BiasFilter enforces fairness by filtering generation outputs in real time. Specifically, it periodically evaluates intermediate outputs every few tokens, maintains an active set of candidate continuations, and incrementally completes generation by discarding low-reward segments based on a fairness reward signal. To support this process, we construct a fairness preference dataset and train an implicit reward model to assess token-level fairness in generated responses. Extensive experiments demonstrate that BiasFilter effectively mitigates social bias across a range of LLMs while preserving overall generation quality.
zh
[NLP-176] ValueSim: Generating Backstories to Model Individual Value Systems
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在对齐人类价值观时缺乏对个体化价值体系模拟的问题。现有方法如提示学习和强化学习虽能处理广泛的伦理考量和帮助性,但未能关注个性化价值系统的建模。解决方案的关键在于提出ValueSim框架,该框架通过生成反映个体过往经历和人口统计信息的个人背景故事来模拟个体价值观,并采用受认知-情感人格系统启发的多模块架构,基于这些叙事进行价值模拟。
链接: https://arxiv.org/abs/2505.23827
作者: Bangde Du,Ziyi Ye,Zhijing Wu,Jankowska Monika,Shuqi Zhu,Qingyao Ai,Yujia Zhou,Yiqun Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages main paper + 13 pages appendix, 3 figures, 2 tables
Abstract:As Large Language Models (LLMs) continue to exhibit increasingly human-like capabilities, aligning them with human values has become critically important. Contemporary advanced techniques, such as prompt learning and reinforcement learning, are being deployed to better align LLMs with human values. However, while these approaches address broad ethical considerations and helpfulness, they rarely focus on simulating individualized human value systems. To address this gap, we present ValueSim, a framework that simulates individual values through the generation of personal backstories reflecting past experiences and demographic information. ValueSim converts structured individual data into narrative backstories and employs a multi-module architecture inspired by the Cognitive-Affective Personality System to simulate individual values based on these narratives. Testing ValueSim on a self-constructed benchmark derived from the World Values Survey demonstrates an improvement in top-1 accuracy by over 10% compared to retrieval-augmented generation methods. Further analysis reveals that performance enhances as additional user interaction history becomes available, indicating the model’s ability to refine its persona simulation capabilities over time.
zh
[NLP-177] Reviewing Scientific Papers for Critical Problems With Reasoning LLM s: Baseline Approaches and Automatic Evaluation DATE
【速读】: 该论文试图解决科学出版物同行评审过程中对领域专家依赖度过高的问题,以及由此带来的评估成本高和效率低的挑战。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)作为稿件质量检查工具,而非直接生成评审意见,通过引入可扩展的自动评估框架,使用顶级LLMs作为评判者,以降低对人工专家的依赖,同时提高评估的效率和可行性。
链接: https://arxiv.org/abs/2505.23824
作者: Tianmai M. Zhang,Neil F. Abernethy
机构: University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注: Work in progress. Conclusions may be updated
Abstract:Recent advancements in large language models have sparked interest in utilizing them to assist the peer review process of scientific publication. Instead of having AI models generate reviews in the same way as human reviewers, we propose adopting them as manuscript quality checkers. We introduce several baseline approaches and an extendable automatic evaluation framework using top LLMs as judges to tackle the difficulty of recruiting domain experts for manual evaluation. Utilizing papers withdrawn from arXiv, we validated our proposed methods with several leading reasoning LLMs from different providers and assessed their performance and API costs for identifying critical errors and unsoundness problems. The OpenAI o3 model performed the best, while o4-mini was the most cost-effective one in our evaluation. This paper provides insights into document-based scientific understanding/reasoning and lays the foundation for future applications.
zh
[NLP-178] RAG PPI: RAG Benchmark for Protein-Protein Interactions in Drug Discovery
【速读】: 该论文试图解决在药物开发中通过蛋白质-蛋白质相互作用(Protein-Protein Interactions, PPIs)识别潜在生物影响的挑战,这一过程对于目标识别(Target ID)至关重要。现有方法在处理大量蛋白质时仍存在耗时和困难的问题。该研究的关键解决方案是构建了一个名为RAGPPI的基准数据集,包含4,420个问答对,聚焦于PPI的潜在生物影响,并通过专家驱动的数据标注和集成自动评估语言模型生成银标准数据集,以支持检索增强生成(Retrieval-Augmented Generation, RAG)系统在药物发现问答任务中的发展。
链接: https://arxiv.org/abs/2505.23823
作者: Youngseung Jeon,Ziwen Li,Thomas Li,JiaSyuan Chang,Morteza Ziyadi,Xiang ‘Anthony’ Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注: 17 pages, 4 figures, 8 tables
Abstract:Retrieving the biological impacts of protein-protein interactions (PPIs) is essential for target identification (Target ID) in drug development. Given the vast number of proteins involved, this process remains time-consuming and challenging. Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) frameworks have supported Target ID; however, no benchmark currently exists for identifying the biological impacts of PPIs. To bridge this gap, we introduce the RAG Benchmark for PPIs (RAGPPI), a factual question-answer benchmark of 4,420 question-answer pairs that focus on the potential biological impacts of PPIs. Through interviews with experts, we identified criteria for a benchmark dataset, such as a type of QA and source. We built a gold-standard dataset (500 QA pairs) through expert-driven data annotation. We developed an ensemble auto-evaluation LLM that reflected expert labeling characteristics, which facilitates the construction of a silver-standard dataset (3,720 QA pairs). We are committed to maintaining RAGPPI as a resource to support the research community in advancing RAG systems for drug discovery QA solutions.
zh
[NLP-179] Speech as a Multimodal Digital Phenotype for Multi-Task LLM -based Mental Health Prediction
【速读】: 该论文试图解决青少年抑郁检测中单一模态数据利用不足以及多任务学习与纵向分析整合不够的问题。其解决方案的关键在于将患者语音数据视为三模态多媒体数据源(即语音生成文本、声学特征点和语音生物标志物),并结合多任务学习(MTL)同时预测抑郁、自杀意念和睡眠障碍,同时引入纵向分析策略以建模多个临床交互中的时间变化,从而全面理解病情进展。
链接: https://arxiv.org/abs/2505.23822
作者: Mai Ali,Christopher Lucasius,Tanmay P. Patel,Madison Aitken,Jacob Vorstman,Peter Szatmari,Marco Battaglia,Deepa Kundur
机构: University of Toronto(多伦多大学); Centre for Addiction and Mental Health(加拿大心理健康中心); York University(约克大学); The Hospital for Sick Children(多伦多儿童医院); Department of Psychiatry(精神病学系)
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注: 6 pages, 1 figure, 3 tables. Submitted to ICSM 2025. The corresponding author is Mai Ali ( this http URL @mail. this http URL ). Christopher Lucasius and Tanmay P. Patel contributed equally
Abstract:Speech is a noninvasive digital phenotype that can offer valuable insights into mental health conditions, but it is often treated as a single modality. In contrast, we propose the treatment of patient speech data as a trimodal multimedia data source for depression detection. This study explores the potential of large language model-based architectures for speech-based depression prediction in a multimodal regime that integrates speech-derived text, acoustic landmarks, and vocal biomarkers. Adolescent depression presents a significant challenge and is often comorbid with multiple disorders, such as suicidal ideation and sleep disturbances. This presents an additional opportunity to integrate multi-task learning (MTL) into our study by simultaneously predicting depression, suicidal ideation, and sleep disturbances using the multimodal formulation. We also propose a longitudinal analysis strategy that models temporal changes across multiple clinical interactions, allowing for a comprehensive understanding of the conditions’ progression. Our proposed approach, featuring trimodal, longitudinal MTL is evaluated on the Depression Early Warning dataset. It achieves a balanced accuracy of 70.8%, which is higher than each of the unimodal, single-task, and non-longitudinal methods.
zh
[NLP-180] Arbiters of Ambivalence: Challenges of Using LLM s in No-Consensus Tasks
【速读】: 该论文试图解决生成式 AI (Generative AI) 在对齐过程中无法准确复制人类判断和偏好,特别是在存在分歧的场景中的局限性问题。研究通过构建一个“无共识”基准,涵盖多种先验上存在分歧的场景,以评估生成式 AI 在回答生成者、评判者和辩论者三种角色中的表现。解决方案的关键在于揭示生成式 AI 在作为评判者或辩论者时倾向于采取立场,而非保持中立,从而强调了在缺乏人类监督的情况下,需要更复杂的对齐方法来更好地捕捉人类之间的分歧。
链接: https://arxiv.org/abs/2505.23820
作者: Bhaktipriya Radharapu,Manon Revel,Megan Ung,Sebastian Ruder,Adina Williams
机构: Meta(元)
类目: Computation and Language (cs.CL)
备注:
Abstract:The increasing use of LLMs as substitutes for humans in aligning'' LLMs has raised questions about their ability to replicate human judgments and preferences, especially in ambivalent scenarios where humans disagree. This study examines the biases and limitations of LLMs in three roles: answer generator, judge, and debater. These roles loosely correspond to previously described alignment frameworks: preference alignment (judge) and scalable oversight (debater), with the answer generator reflecting the typical setting with user interactions. We develop a
no-consensus’’ benchmark by curating examples that encompass a variety of a priori ambivalent scenarios, each presenting two possible stances. Our results show that while LLMs can provide nuanced assessments when generating open-ended answers, they tend to take a stance on no-consensus topics when employed as judges or debaters. These findings underscore the necessity for more sophisticated methods for aligning LLMs without human oversight, highlighting that LLMs cannot fully capture human disagreement even on topics where humans themselves are divided.
zh
[NLP-181] Ratas framework: A comprehensive genai-based approach to rubric-based marking of real-world textual exams
【速读】: 该论文试图解决自动化答案评分在教育技术中的关键挑战,包括现有方法受限于特定考试格式、评分过程缺乏可解释性以及在跨学科和多样化评估类型中的实际应用能力不足。解决方案的关键在于提出RATAS(Rubric Automated Tree-based Answer Scoring)框架,该框架利用先进的生成式 AI 模型实现基于评分标准的文本回答评分,支持广泛的评分标准、实现学科无关的评估,并生成结构化且可解释的评分理由。
链接: https://arxiv.org/abs/2505.23818
作者: Masoud Safilian,Amin Beheshti,Stephen Elbourn
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Automated answer grading is a critical challenge in educational technology, with the potential to streamline assessment processes, ensure grading consistency, and provide timely feedback to students. However, existing approaches are often constrained to specific exam formats, lack interpretability in score assignment, and struggle with real-world applicability across diverse subjects and assessment types. To address these limitations, we introduce RATAS (Rubric Automated Tree-based Answer Scoring), a novel framework that leverages state-of-the-art generative AI models for rubric-based grading of textual responses. RATAS is designed to support a wide range of grading rubrics, enable subject-agnostic evaluation, and generate structured, explainable rationales for assigned scores. We formalize the automatic grading task through a mathematical framework tailored to rubric-based assessment and present an architecture capable of handling complex, real-world exam structures. To rigorously evaluate our approach, we construct a unique, contextualized dataset derived from real-world project-based courses, encompassing diverse response formats and varying levels of complexity. Empirical results demonstrate that RATAS achieves high reliability and accuracy in automated grading while providing interpretable feedback that enhances transparency for both students and nstructors.
zh
[NLP-182] A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLM s
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成输出时是否能够可靠地对用户目标进行引导的问题,即所谓的“可操控性”(steerability)。其关键在于提出一种基于多维目标空间的评估框架,将用户目标和LLM输出建模为对应于文本属性(如阅读难度)的向量,从而系统地评估模型在可操控性方面的失败模式,包括覆盖不足、校准错误和副作用等问题。
链接: https://arxiv.org/abs/2505.23816
作者: Trenton Chang,Tobias Schnabel,Adith Swaminathan,Jenna Wiens
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 8 figures. 26 pages of references and supplementary material, 20 additional figures
Abstract:Despite advances in large language models (LLMs) on reasoning and instruction-following benchmarks, it remains unclear whether they can reliably produce outputs aligned with a broad variety of user goals, a concept we refer to as steerability. The abundance of methods proposed to modify LLM behavior makes it unclear whether current LLMs are already steerable, or require further intervention. In particular, LLMs may exhibit (i) poor coverage, where rare user goals are underrepresented; (ii) miscalibration, where models overshoot requests; and (iii) side effects, where changes to one dimension of text inadvertently affect others. To systematically evaluate these failures, we introduce a framework based on a multi-dimensional goal space that models user goals and LLM outputs as vectors with dimensions corresponding to text attributes (e.g., reading difficulty). Applied to a text-rewriting task, we find that current LLMs struggle with steerability, as side effects are persistent. Interventions to improve steerability, such as prompt engineering, best-of- N sampling, and reinforcement learning fine-tuning, have varying effectiveness, yet side effects remain problematic. Our findings suggest that even strong LLMs struggle with steerability, and existing alignment strategies may be insufficient. We open-source our steerability evaluation framework at this https URL.
zh
[NLP-183] Aligning LLM s by Predicting Preferences from User Writing Samples ICML2025
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在推断用户偏好时存在的描述不够精准的问题,即现有方法往往生成过于通用的偏好描述,无法捕捉人类偏好的独特性和个性化特征。其解决方案的关键在于提出PROSE方法,该方法通过两个核心要素提升偏好描述的精度:一是对推断出的偏好进行迭代优化,二是通过多个用户写作样本验证推断出的偏好,从而实现更准确的人类偏好建模。
链接: https://arxiv.org/abs/2505.23815
作者: Stéphane Aroca-Ouellette,Natalie Mackraz,Barry-John Theobald,Katherine Metcalf
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to ICML 2025. 32 pages total: 9 main, 2 references, 21 appendix. arXiv admin note: substantial text overlap with arXiv:2410.06273
Abstract:Accommodating human preferences is essential for creating aligned LLM agents that deliver personalized and effective interactions. Recent work has shown the potential for LLMs acting as writing agents to infer a description of user preferences. Agent alignment then comes from conditioning on the inferred preference description. However, existing methods often produce generic preference descriptions that fail to capture the unique and individualized nature of human preferences. This paper introduces PROSE, a method designed to enhance the precision of preference descriptions inferred from user writing samples. PROSE incorporates two key elements: (1) iterative refinement of inferred preferences, and (2) verification of inferred preferences across multiple user writing samples. We evaluate PROSE with several LLMs (i.e., Qwen2.5 7B and 72B Instruct, GPT-mini, and GPT-4o) on a summarization and an email writing task. We find that PROSE more accurately infers nuanced human preferences, improving the quality of the writing agent’s generations over CIPHER (a state-of-the-art method for inferring preferences) by 33%. Lastly, we demonstrate that ICL and PROSE are complementary methods, and combining them provides up to a 9% improvement over ICL alone.
zh
[NLP-184] Emotion-aware Dual Cross-Attentive Neural Network with Label Fusion for Stance Detection in Misinformative Social Media Content
【速读】: 该论文旨在解决在虚假信息传播的社交媒体用户生成内容中,准确检测立场(stance)的问题,以应对虚假信息导致的用户观点极化现象。解决方案的关键在于提出一种名为SPLAENet的新型方法,该方法通过融合标签的双交叉注意力机制和情感感知的层次化注意力网络,捕捉源文本与回复文本之间的相互关系,并利用情感对齐或差异来有效区分不同的立场类别,同时结合距离度量学习进行标签融合,提升模型对立场分类的准确性。
链接: https://arxiv.org/abs/2505.23812
作者: Lata Pangtey,Mohammad Zia Ur Rehman,Prasad Chaudhari,Shubhi Bansal,Nagendra Kumar
机构: Indian Institute of Technology (IIT) Indore (印度理工学院(伊诺德)分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The rapid evolution of social media has generated an overwhelming volume of user-generated content, conveying implicit opinions and contributing to the spread of misinformation. The method aims to enhance the detection of stance where misinformation can polarize user opinions. Stance detection has emerged as a crucial approach to effectively analyze underlying biases in shared information and combating misinformation. This paper proposes a novel method for \textbfStance \textbfPrediction through a \textbfLabel-fused dual cross-\textbfAttentive \textbfEmotion-aware neural \textbfNetwork (SPLAENet) in misinformative social media user-generated content. The proposed method employs a dual cross-attention mechanism and a hierarchical attention network to capture inter and intra-relationships by focusing on the relevant parts of source text in the context of reply text and vice versa. We incorporate emotions to effectively distinguish between different stance categories by leveraging the emotional alignment or divergence between the texts. We also employ label fusion that uses distance-metric learning to align extracted features with stance labels, improving the method’s ability to accurately distinguish between stances. Extensive experiments demonstrate the significant improvements achieved by SPLAENet over existing state-of-the-art methods. SPLAENet demonstrates an average gain of 8.92% in accuracy and 17.36% in F1-score on the RumourEval dataset. On the SemEval dataset, it achieves average gains of 7.02% in accuracy and 10.92% in F1-score. On the P-stance dataset, it demonstrates average gains of 10.03% in accuracy and 11.18% in F1-score. These results validate the effectiveness of the proposed method for stance detection in the context of misinformative social media content.
zh
[NLP-185] LayerIF: Estimating Layer Quality for Large Language Models using Influence Functions
【速读】: 该论文试图解决预训练大型语言模型(Large Language Models, LLMs)在不同层的训练质量存在显著差异的问题,这种差异限制了其在下游任务中的性能表现。现有方法主要依赖于模型中心的启发式策略(如谱统计、异常检测或均匀分配),而忽略了数据的影响。论文提出的解决方案是LayerIF,其关键在于采用数据驱动的方法,利用影响函数(Influence Functions)以一种符合任务需求的方式量化各层的训练质量。通过隔离每层的梯度并计算验证损失对训练样本的敏感性,从而得到层重要性的数据驱动估计,实现了对同一LLM在不同任务中层特异性的揭示。
链接: https://arxiv.org/abs/2505.23811
作者: Hadi Askari,Shivanshu Gupta,Fei Wang,Anshuman Chhabra,Muhao Chen
机构: University of California, Davis (加州大学戴维斯分校); University of California, Irvine (加州大学欧文分校); University of Southern California (南加利福尼亚大学); University of South Florida (南佛罗里达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under Review
Abstract:Pretrained Large Language Models (LLMs) achieve strong performance across a wide range of tasks, yet exhibit substantial variability in the various layers’ training quality with respect to specific downstream applications, limiting their downstream this http URL is therefore critical to estimate layer-wise training quality in a manner that accounts for both model architecture and training data. However, existing approaches predominantly rely on model-centric heuristics (such as spectral statistics, outlier detection, or uniform allocation) while overlooking the influence of data. To address these limitations, we propose LayerIF, a data-driven framework that leverages Influence Functions to quantify the training quality of individual layers in a principled and task-sensitive manner. By isolating each layer’s gradients and measuring the sensitivity of the validation loss to training examples by computing layer-wise influences, we derive data-driven estimates of layer importance. Notably, our method produces task-specific layer importance estimates for the same LLM, revealing how layers specialize for different test-time evaluation tasks. We demonstrate the utility of our scores by leveraging them for two downstream applications: (a) expert allocation in LoRA-MoE architectures and (b) layer-wise sparsity distribution for LLM pruning. Experiments across multiple LLM architectures demonstrate that our model-agnostic, influence-guided allocation leads to consistent gains in task performance.
zh
[NLP-186] MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理长而复杂的对话会话时的鲁棒性不足问题,特别是针对频繁的动机转移和复杂的跨轮次依赖关系。其解决方案的关键在于提出一个名为MARS-Bench的多轮真实场景对话基准,该基准基于逐帧文本解说构建,专门用于评估多轮对话的三个关键方面:超多轮、交互式多轮和跨轮任务。通过该基准进行的大量实验揭示了闭源LLMs在处理复杂对话任务中的优势,以及显式推理对提升LLMs鲁棒性的有效性。
链接: https://arxiv.org/abs/2505.23810
作者: Chenghao Yang,Yinbo Luo,Zhoufutu Wen,Qi Chu,Tao Gong,Longxiang Liu,Kaiyuan Zhang,Jianpeng Jiao,Ge Zhang,Wenhao Huang,Nenghai Yu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 29 pages, 13 figures
Abstract:Large Language Models (\textbfLLMs), e.g. ChatGPT, have been widely adopted in real-world dialogue applications. However, LLMs’ robustness, especially in handling long complex dialogue sessions, including frequent motivation transfer, sophisticated cross-turn dependency, is criticized all along. Nevertheless, no existing benchmarks can fully reflect these weaknesses. We present \textbfMARS-Bench, a \textbfMulti-turn \textbfAthletic \textbfReal-world \textbfScenario Dialogue \textbfBenchmark, designed to remedy the gap. MARS-Bench is constructed from play-by-play text commentary so to feature realistic dialogues specifically designed to evaluate three critical aspects of multi-turn conversations: Ultra Multi-turn, Interactive Multi-turn, and Cross-turn Tasks. Extensive experiments on MARS-Bench also reveal that closed-source LLMs significantly outperform open-source alternatives, explicit reasoning significantly boosts LLMs’ robustness on handling long complex dialogue sessions, and LLMs indeed face significant challenges when handling motivation transfer and sophisticated cross-turn dependency. Moreover, we provide mechanistic interpretability on how attention sinks due to special tokens lead to LLMs’ performance degradation when handling long complex dialogue sessions based on attention visualization experiment in Qwen2.5-7B-Instruction.
zh
[NLP-187] LLM -Driven E-Commerce Marketing Content Optimization: Balancing Creativity and Conversion
【速读】: 该论文旨在解决电子商务中创意内容与转化效果之间的平衡问题(conversion effectiveness),以提升营销文案的吸引力和转化率。其解决方案的关键在于提出一个整合了提示工程(prompt engineering)、多目标微调(multi-objective fine-tuning)和后处理(post-processing)的框架,通过结合情感调整、多样性增强和行动呼吁(CTA)嵌入等方法,生成既具吸引力又具备高转化效果的营销文案。
链接: https://arxiv.org/abs/2505.23809
作者: Haowei Yang,Haotian Lyu,Tianle Zhang,Dingzhou Wang,Yushang Zhao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:As e-commerce competition intensifies, balancing creative content with conversion effectiveness becomes critical. Leveraging LLMs’ language generation capabilities, we propose a framework that integrates prompt engineering, multi-objective fine-tuning, and post-processing to generate marketing copy that is both engaging and conversion-driven. Our fine-tuning method combines sentiment adjustment, diversity enhancement, and CTA embedding. Through offline evaluations and online A/B tests across categories, our approach achieves a 12.5 % increase in CTR and an 8.3 % increase in CVR while maintaining content novelty. This provides a practical solution for automated copy generation and suggests paths for future multimodal, real-time personalization.
zh
[NLP-188] DenseLoRA: Dense Low-Rank Adaptation of Large Language Models
【速读】: 该论文旨在解决低秩适配(Low-rank adaptation, LoRA)中参数利用效率低的问题,即在适配大语言模型(Large Language Models, LLMs)时,用于微调的两个低秩矩阵中存在大量冗余权重,导致参数利用率不足。其解决方案的关键在于提出一种名为Dense Low-Rank Adaptation (DenseLoRA)的新方法,该方法通过使用一个密集的低秩矩阵替代LoRA中的两个冗余低秩矩阵,从而提升参数利用效率和适配效果。此外,DenseLoRA引入了一个编码器-解码器结构,在所有适配层之前对隐藏表示进行精炼和压缩,进一步优化了模型性能。
链接: https://arxiv.org/abs/2505.23808
作者: Lin Mu,Xiaoyu Wang,Li Ni,Yang Li,Zhize Wu,Peiquan Jin,Yiwen Zhang
机构: Anhui University (安徽大学); Hefei University (合肥大学); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Low-rank adaptation (LoRA) has been developed as an efficient approach for adapting large language models (LLMs) by fine-tuning two low-rank matrices, thereby reducing the number of trainable parameters. However, prior research indicates that many of the weights in these matrices are redundant, leading to inefficiencies in parameter utilization. To address this limitation, we introduce Dense Low-Rank Adaptation (DenseLoRA), a novel approach that enhances parameter efficiency while achieving superior performance compared to LoRA. DenseLoRA builds upon the concept of representation fine-tuning, incorporating a single Encoder-Decoder to refine and compress hidden representations across all adaptation layers before applying adaptation. Instead of relying on two redundant low-rank matrices as in LoRA, DenseLoRA adapts LLMs through a dense low-rank matrix, improving parameter utilization and adaptation efficiency. We evaluate DenseLoRA on various benchmarks, showing that it achieves 83.8% accuracy with only 0.01% of trainable parameters, compared to LoRA’s 80.8% accuracy with 0.70% of trainable parameters on LLaMA3-8B. Additionally, we conduct extensive experiments to systematically assess the impact of DenseLoRA’s components on overall model performance. Code is available at this https URL.
zh
[NLP-189] DLP: Dynamic Layerwise Pruning in Large Language Models ICML2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在高稀疏性水平下进行剪枝时导致性能严重下降的问题。现有主流剪枝技术通常采用统一的逐层剪枝策略,而近期非均匀逐层剪枝方法虽然考虑了不同层的重要性差异,但依赖预定义值,难以达到最优效果。论文提出的解决方案是动态逐层剪枝(Dynamic Layerwise Pruning, DLP),其关键在于通过整合模型权重与输入激活信息,自适应地确定每层的相对重要性,并据此分配剪枝率,从而在保持模型性能的同时实现高效压缩。
链接: https://arxiv.org/abs/2505.23807
作者: Yuli Chen,Bo Cheng,Jiale Han,Yingying Zhang,Yingting Li,Shuhao Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2025
Abstract:Pruning has recently been widely adopted to reduce the parameter scale and improve the inference efficiency of Large Language Models (LLMs). Mainstream pruning techniques often rely on uniform layerwise pruning strategies, which can lead to severe performance degradation at high sparsity levels. Recognizing the varying contributions of different layers in LLMs, recent studies have shifted their focus toward non-uniform layerwise pruning. However, these approaches often rely on pre-defined values, which can result in suboptimal performance. To overcome these limitations, we propose a novel method called Dynamic Layerwise Pruning (DLP). This approach adaptively determines the relative importance of each layer by integrating model weights with input activation information, assigning pruning rates accordingly. Experimental results show that DLP effectively preserves model performance at high sparsity levels across multiple LLMs. Specifically, at 70% sparsity, DLP reduces the perplexity of LLaMA2-7B by 7.79 and improves the average accuracy by 2.7% compared to state-of-the-art methods. Moreover, DLP is compatible with various existing LLM compression techniques and can be seamlessly integrated into Parameter-Efficient Fine-Tuning (PEFT). We release the code at this https URL to facilitate future research.
zh
[NLP-190] MedOrchestra: A Hybrid Cloud-Local LLM Approach for Clinical Data Interpretation
【速读】: 该论文试图解决在临床环境中部署大规模语言模型(Large Language Models, LLMs)时面临的关键权衡问题:云环境下的LLMs虽然参数量大、性能优越,但可能危及敏感临床数据的隐私;而本地LLMs虽能保障隐私,但在处理复杂的临床解释任务时表现不佳。解决方案的关键在于提出MedOrchestra,一个混合框架,其中云LLMs负责将复杂临床任务分解为可管理的子任务和提示生成,而本地LLMs则以隐私保护的方式执行这些子任务。通过临床指南和合成测试案例生成并验证子任务提示,本地LLMs在不接触临床数据的情况下执行子任务并整合云LLMs生成的输出,从而在保证隐私的同时提升任务性能。
链接: https://arxiv.org/abs/2505.23806
作者: Sihyeon Lee,Hyunjoo Song,Jong-chan Lee,Yoon Jin Lee,Boram Lee,Hee-Eon Lim,Dongyeong Kim,Jinwook Seo,Bohyoung Kim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Deploying large language models (LLMs) in clinical settings faces critical trade-offs: cloud LLMs, with their extensive parameters and superior performance, pose risks to sensitive clinical data privacy, while local LLMs preserve privacy but often fail at complex clinical interpretation tasks. We propose MedOrchestra, a hybrid framework where a cloud LLM decomposes complex clinical tasks into manageable subtasks and prompt generation, while a local LLM executes these subtasks in a privacy-preserving manner. Without accessing clinical data, the cloud LLM generates and validates subtask prompts using clinical guidelines and synthetic test cases. The local LLM executes subtasks locally and synthesizes outputs generated by the cloud LLM. We evaluate MedOrchestra on pancreatic cancer staging using 100 radiology reports under NCCN guidelines. On free-text reports, MedOrchestra achieves 70.21% accuracy, outperforming local model baselines (without guideline: 48.94%, with guideline: 56.59%) and board-certified clinicians (gastroenterologists: 59.57%, surgeons: 65.96%, radiologists: 55.32%). On structured reports, MedOrchestra reaches 85.42% accuracy, showing clear superiority across all settings.
zh
[NLP-191] Calibrating LLM s for Text-to-SQL Parsing by Leverag ing Sub-clause Frequencies
【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的文本到SQL解析系统中,模型在输出时可能表现出自信但错误的情况,即如何获取可靠的不确定性度量以提升系统的可信度。其解决方案的关键在于通过后验校准方法,如Platt缩放(Platt scaling),将原始模型输出的概率转化为更准确的置信度分数,并进一步引入“子短语频率”(Sub-clause Frequency, SCF)评分,利用SQL查询的结构化特性提供更细粒度的正确性信号。通过多变量Platt缩放(Multivariate Platt Scaling, MPS)将SCF评分整合为一个整体且校准良好的置信度分数,从而在两个主流文本到SQL数据集上实现了校准效果和错误检测任务的进一步提升。
链接: https://arxiv.org/abs/2505.23804
作者: Terrance Liu,Shuyi Wang,Daniel Preotiuc-Pietro,Yash Chandarana,Chirag Gupta
机构: Bloomberg(彭博社)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:While large language models (LLMs) achieve strong performance on text-to-SQL parsing, they sometimes exhibit unexpected failures in which they are confidently incorrect. Building trustworthy text-to-SQL systems thus requires eliciting reliable uncertainty measures from the LLM. In this paper, we study the problem of providing a calibrated confidence score that conveys the likelihood of an output query being correct. Our work is the first to establish a benchmark for post-hoc calibration of LLM-based text-to-SQL parsing. In particular, we show that Platt scaling, a canonical method for calibration, provides substantial improvements over directly using raw model output probabilities as confidence scores. Furthermore, we propose a method for text-to-SQL calibration that leverages the structured nature of SQL queries to provide more granular signals of correctness, named “sub-clause frequency” (SCF) scores. Using multivariate Platt scaling (MPS), our extension of the canonical Platt scaling technique, we combine individual SCF scores into an overall accurate and calibrated score. Empirical evaluation on two popular text-to-SQL datasets shows that our approach of combining MPS and SCF yields further improvements in calibration and the related task of error detection over traditional Platt scaling.
zh
[NLP-192] MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在医学领域应用中评估不足的问题,即现有评估方法无法全面反映真实临床实践的复杂性和多样性。其解决方案的关键在于提出MedHELM,一个可扩展的评估框架,包含三个核心贡献:一是由29名临床医生验证的涵盖5个类别、22个子类和121项任务的分类体系;二是覆盖所有分类和子类的35个基准测试套件(17个现有基准和18个新构建的基准);三是采用改进的评估方法(LLM-jury)对LLMs进行系统比较及成本效益分析。该框架为LLMs在医疗任务中的性能评估提供了全面且实用的工具。
链接: https://arxiv.org/abs/2505.23802
作者: Suhana Bedi,Hejie Cui,Miguel Fuentes,Alyssa Unell,Michael Wornow,Juan M. Banda,Nikesh Kotecha,Timothy Keyes,Yifan Mai,Mert Oez,Hao Qiu,Shrey Jain,Leonardo Schettini,Mehr Kashyap,Jason Alan Fries,Akshay Swaminathan,Philip Chung,Fateme Nateghi,Asad Aali,Ashwin Nayak,Shivam Vedak,Sneha S. Jain,Birju Patel,Oluseyi Fayanju,Shreya Shah,Ethan Goh,Dong-han Yao,Brian Soetikno,Eduardo Reis,Sergios Gatidis,Vasu Divi,Robson Capasso,Rachna Saralkar,Chia-Chun Chiang,Jenelle Jindal,Tho Pham,Faraz Ghoddusi,Steven Lin,Albert S. Chiou,Christy Hong,Mohana Roy,Michael F. Gensheimer,Hinesh Patel,Kevin Schulman,Dev Dash,Danton Char,Lance Downing,Francois Grolleau,Kameron Black,Bethel Mieso,Aydin Zahedivash,Wen-wai Yim,Harshita Sharma,Tony Lee,Hannah Kirsch,Jennifer Lee,Nerissa Ambers,Carlene Lugtu,Aditya Sharma,Bilal Mawji,Alex Alekseyev,Vicky Zhou,Vikas Kakkar,Jarrod Helzer,Anurang Revri,Yair Bannett,Roxana Daneshjou,Jonathan Chen,Emily Alsentzer,Keith Morse,Nirmal Ravi,Nima Aghaeepour,Vanessa Kennedy,Akshay Chaudhari,Thomas Wang,Sanmi Koyejo,Matthew P. Lungren,Eric Horvitz,Percy Liang,Mike Pfeffer,Nigam H. Shah
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcategories, and 121 tasks developed with 29 clinicians. Second, a comprehensive benchmark suite comprising 35 benchmarks (17 existing, 18 newly formulated) providing complete coverage of all categories and subcategories in the taxonomy. Third, a systematic comparison of LLMs with improved evaluation methods (using an LLM-jury) and a cost-performance analysis. Evaluation of 9 frontier LLMs, using the 35 benchmarks, revealed significant performance variation. Advanced reasoning models (DeepSeek R1: 66% win-rate; o3-mini: 64% win-rate) demonstrated superior performance, though Claude 3.5 Sonnet achieved comparable results at 40% lower estimated computational cost. On a normalized accuracy scale (0-1), most models performed strongly in Clinical Note Generation (0.73-0.85) and Patient Communication Education (0.78-0.83), moderately in Medical Research Assistance (0.65-0.75), and generally lower in Clinical Decision Support (0.56-0.72) and Administration Workflow (0.53-0.63). Our LLM-jury evaluation method achieved good agreement with clinician ratings (ICC = 0.47), surpassing both average clinician-clinician agreement (ICC = 0.43) and automated baselines including ROUGE-L (0.36) and BERTScore-F1 (0.44). Claude 3.5 Sonnet achieved comparable performance to top models at lower estimated cost. These findings highlight the importance of real-world, task-specific evaluation for medical use of LLMs and provides an open source framework to enable this.
zh
[NLP-193] SEMFED: Semantic-Aware Resource-Efficient Federated Learning for Heterogeneous NLP Tasks
【速读】: 该论文旨在解决联邦学习(Federated Learning, FL)在自然语言处理(Natural Language Processing, NLP)任务中面临的挑战,包括客户端间的语义异质性、词汇不匹配以及边缘设备上的资源限制。其解决方案的关键在于提出一种名为SEMFED的语义感知且资源高效的联邦学习框架,该框架通过三个核心创新实现优化:一是语义感知的客户端选择机制,平衡语义多样性与资源约束;二是针对设备能力定制的自适应NLP模型架构,以保留语义信息;三是通信高效的语义特征压缩技术,显著降低带宽需求。
链接: https://arxiv.org/abs/2505.23801
作者: Sajid Hussain,Muhammad Sohail,Nauman Ali Khan
机构: Military College of Signals (MCS); National University of Sciences and Technology (NUST)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages
Abstract:Background: Federated Learning (FL) has emerged as a promising paradigm for training machine learning models while preserving data privacy. However, applying FL to Natural Language Processing (NLP) tasks presents unique challenges due to semantic heterogeneity across clients, vocabulary mismatches, and varying resource constraints on edge devices. Objectives: This paper introduces SEMFED, a novel semantic-aware resource-efficient federated learning framework specifically designed for heterogeneous NLP tasks. Methods: SEMFED incorporates three key innovations: (1) a semantic-aware client selection mechanism that balances semantic diversity with resource constraints, (2) adaptive NLP-specific model architectures tailored to device capabilities while preserving semantic information, and (3) a communication-efficient semantic feature compression technique that significantly reduces bandwidth requirements. Results: Experimental results on various NLP classification tasks demonstrate that SEMFED achieves an 80.5% reduction in communication costs while maintaining model accuracy above 98%, outperforming state-of-the-art FL approaches. Conclusion: SEMFED effectively manages heterogeneous client environments with varying computational resources, network reliability, and semantic data distributions, making it particularly suitable for real-world federated NLP deployments.
zh
[NLP-194] Estimating LLM Consistency: A User Baseline vs Surrogate Metrics
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成文本时容易产生幻觉以及对提示扰动敏感的问题,这些问题常导致生成内容不一致或不可靠。现有方法通过测量响应的一致性(模型对响应的置信度,或在重新采样时生成相似响应的可能性)来缓解上述问题,但这些方法在模拟人类对LLM一致性感知方面效果不佳。论文提出了一种基于logit的集成方法来估计LLM的一致性,并证明该方法在估计人类对LLM一致性评分方面表现与现有最佳指标相当。研究结果表明,当前无需人类评估的一致性估计方法仍存在显著不足,因此建议更广泛地采用人类输入进行评估。
链接: https://arxiv.org/abs/2505.23799
作者: Xiaoyuan Wu,Weiran Lin,Omer Akgul,Lujo Bauer
机构: Carnegie Mellon University (卡内基梅隆大学); RSAC Labs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) are prone to hallucinations and sensitive to prompt perturbations, often resulting in inconsistent or unreliable generated text. Different methods have been proposed to mitigate such hallucinations and fragility – one of them being measuring the consistency (the model’s confidence in the response, or likelihood of generating a similar response when resampled) of LLM responses. In previous work, measuring consistency often relied on the probability of a response appearing within a pool of resampled responses, or internal states or logits of responses. However, it is not yet clear how well these approaches approximate how humans perceive the consistency of LLM responses. We performed a user study (n=2,976) and found current methods typically do not approximate users’ perceptions of LLM consistency very well. We propose a logit-based ensemble method for estimating LLM consistency, and we show that this method matches the performance of the best-performing existing metric in estimating human ratings of LLM consistency. Our results suggest that methods of estimating LLM consistency without human evaluation are sufficiently imperfect that we suggest evaluation with human input be more broadly used.
zh
[NLP-195] My Answer Is NOT Fair: Mitigating Social Bias in Vision-Language Models via Fair and Biased Residuals
【速读】: 该论文试图解决大型视觉-语言模型(VLM)在生成响应中存在社会偏见的问题,特别是性别和种族相关的偏见,以及模型对特定社会群体的置信度校准不当的问题。解决方案的关键在于提出一种无需重新训练且与模型无关的后处理方法,在推理阶段通过抑制与偏见相关的残差并增强与公平性相关的残差,从而减轻社会偏见,提升模型响应的公平性和置信度可靠性。
链接: https://arxiv.org/abs/2505.23798
作者: Jian Lan,Yifei Fu,Udo Schlegel,Gengyuan Zhang,Tanveer Hannan,Haokun Chen,Thomas Seidl
机构: LMU Munich (慕尼黑路德维希-马克西米利安大学); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Social bias is a critical issue in large vision-language models (VLMs), where fairness- and ethics-related problems harm certain groups of people in society. It is unknown to what extent VLMs yield social bias in generative responses. In this study, we focus on evaluating and mitigating social bias on both the model’s response and probability distribution. To do so, we first evaluate four state-of-the-art VLMs on PAIRS and SocialCounterfactuals datasets with the multiple-choice selection task. Surprisingly, we find that models suffer from generating gender-biased or race-biased responses. We also observe that models are prone to stating their responses are fair, but indeed having mis-calibrated confidence levels towards particular social groups. While investigating why VLMs are unfair in this study, we observe that VLMs’ hidden layers exhibit substantial fluctuations in fairness levels. Meanwhile, residuals in each layer show mixed effects on fairness, with some contributing positively while some lead to increased bias. Based on these findings, we propose a post-hoc method for the inference stage to mitigate social bias, which is training-free and model-agnostic. We achieve this by ablating bias-associated residuals while amplifying fairness-associated residuals on model hidden layers during inference. We demonstrate that our post-hoc method outperforms the competing training strategies, helping VLMs have fairer responses and more reliable confidence levels.
zh
[NLP-196] Detection of Suicidal Risk on Social Media: A Hybrid Model
【速读】: 该论文试图解决如何有效检测自杀风险的问题,特别是在社交媒体平台上通过用户发布的文本内容进行早期识别。其解决方案的关键在于提出一种基于RoBERTa-TF-IDF-PCA的混合模型,该模型结合了深度上下文嵌入(RoBERTa)与统计词频加权(TF-IDF)特征,并通过主成分分析(PCA)进行降维,以提升自杀风险评估的准确性和可靠性。
链接: https://arxiv.org/abs/2505.23797
作者: Zaihan Yang,Ryan Leonard,Hien Tran,Rory Driscoll,Chadbourne Davis
机构: Suffolk University (萨福克大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:
Abstract:Suicidal thoughts and behaviors are increasingly recognized as a critical societal concern, highlighting the urgent need for effective tools to enable early detection of suicidal risk. In this work, we develop robust machine learning models that leverage Reddit posts to automatically classify them into four distinct levels of suicide risk severity. We frame this as a multi-class classification task and propose a RoBERTa-TF-IDF-PCA Hybrid model, integrating the deep contextual embeddings from Robustly Optimized BERT Approach (RoBERTa), a state-of-the-art deep learning transformer model, with the statistical term-weighting of TF-IDF, further compressed with PCA, to boost the accuracy and reliability of suicide risk assessment. To address data imbalance and overfitting, we explore various data resampling techniques and data augmentation strategies to enhance model generalization. Additionally, we compare our model’s performance against that of using RoBERTa only, the BERT model and other traditional machine learning classifiers. Experimental results demonstrate that the hybrid model can achieve improved performance, giving a best weighted F_1 score of 0.7512.
zh
[NLP-197] Emergent LLM behaviors are observationally equivalent to data leakage
【速读】: 该论文试图解决的问题是:Ashery等人的研究声称大型语言模型(LLMs)在进行经典的“命名游戏”时会自发形成类似人类社会规范的语言惯例,而本文则质疑这一结论的可靠性。论文的关键解决方案在于指出其结果更可能由数据泄露(data leakage)引起,即模型只是复现了预训练阶段已接触过的惯例,而非真正生成新的“涌现”性语言规范。通过多种分析,作者证明LLMs识别并回忆了协调博弈的结构和结果,而非产生新的社会规范,因此观察到的行为本质上与对训练语料的机械记忆无异。
链接: https://arxiv.org/abs/2505.23796
作者: Christopher Barrie,Petter Törnberg
机构: New York University (纽约大学); Institute for Logic, Language, and Computation, University of Amsterdam (逻辑、语言和计算研究所,阿姆斯特丹大学)
类目: Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
备注:
Abstract:Ashery et al. recently argue that large language models (LLMs), when paired to play a classic “naming game,” spontaneously develop linguistic conventions reminiscent of human social norms. Here, we show that their results are better explained by data leakage: the models simply reproduce conventions they already encountered during pre-training. Despite the authors’ mitigation measures, we provide multiple analyses demonstrating that the LLMs recognize the structure of the coordination game and recall its outcomes, rather than exhibit “emergent” conventions. Consequently, the observed behaviors are indistinguishable from memorization of the training corpus. We conclude by pointing to potential alternative strategies and reflecting more generally on the place of LLMs for social science models.
zh
[NLP-198] R3-RAG : Learning Step-by-Step Reasoning and Retrieval for LLM s via Reinforcement Learning
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中密集检索器作为瓶颈的问题,即其参数量有限且无法进行逐步推理,同时传统基于提示的迭代RAG受限于人工设计的工作流程。解决方案的关键在于提出一种基于强化学习的R3-RAG框架,通过让大语言模型(Large Language Models, LLMs)逐步学习推理与检索,从而实现全面的外部知识检索并得到正确答案。该方法分为两个阶段:首先通过冷启动策略使模型学习迭代式推理与检索的模式,随后利用强化学习进一步提升模型在外部检索环境中的探索能力。
链接: https://arxiv.org/abs/2505.23794
作者: Yuan Li,Qi Luo,Xiaonan Li,Bufan Li,Qinyuan Cheng,Bo Wang,Yining Zheng,Yuxin Wang,Zhangyue Yin,Xipeng Qiu
机构: Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation (RAG) integrates external knowledge with Large Language Models (LLMs) to enhance factual correctness and mitigate hallucination. However, dense retrievers often become the bottleneck of RAG systems due to their limited parameters compared to LLMs and their inability to perform step-by-step reasoning. While prompt-based iterative RAG attempts to address these limitations, it is constrained by human-designed workflows. To address these limitations, we propose \textbfR3-RAG , which uses \textbfR einforcement learning to make the LLM learn how to \textbfR eason and \textbfR etrieve step by step, thus retrieving comprehensive external knowledge and leading to correct answers. R3-RAG is divided into two stages. We first use cold start to make the model learn the manner of iteratively interleaving reasoning and retrieval. Then we use reinforcement learning to further harness its ability to better explore the external retrieval environment. Specifically, we propose two rewards for R3-RAG: 1) answer correctness for outcome reward, which judges whether the trajectory leads to a correct answer; 2) relevance-based document verification for process reward, encouraging the model to retrieve documents that are relevant to the user question, through which we can let the model learn how to iteratively reason and retrieve relevant documents to get the correct answer. Experimental results show that R3-RAG significantly outperforms baselines and can transfer well to different retrievers. We release R3-RAG at this https URL.
zh
[NLP-199] Rethinking the Understanding Ability across LLM s through Mutual Information
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)内在语言理解能力的评估问题,其核心挑战在于如何有效衡量模型对输入信息的保留与表征能力。解决方案的关键在于提出一种基于互信息(Mutual Information, MI)的信息论框架,将语言理解建模为输入句子与其潜在表示之间的互信息,并通过分解句子级MI为词元级MI,建立理论边界。进一步地,利用Fano不等式推导出可计算的词元级MI下界,从而直接关联到词元可恢复性,即从句子嵌入中预测原始词元的能力。该方法为评估和提升LLM的语言理解能力提供了新的理论基础与实践路径。
链接: https://arxiv.org/abs/2505.23790
作者: Shaojie Wang,Sirui Ding,Na Zou
机构: University of Houston (休斯顿大学); UCSF (加利福尼亚大学旧金山分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large language models (LLMs) have revolutionized natural language processing, yet evaluating their intrinsic linguistic understanding remains challenging. Moving beyond specialized evaluation tasks, we propose an information-theoretic framework grounded in mutual information (MI) to achieve this. We formalize the understanding as MI between an input sentence and its latent representation (sentence-level MI), measuring how effectively input information is preserved in latent representation. Given that LLMs learn embeddings for individual tokens, we decompose sentence-level MI into token-level MI between tokens and sentence embeddings, establishing theoretical bounds connecting these measures. Based on this foundation, we theoretically derive a computable lower bound for token-level MI using Fano’s inequality, which directly relates to token-level recoverability-the ability to predict original tokens from sentence embedding. We implement this recoverability task to comparatively measure MI across different LLMs, revealing that encoder-only models consistently maintain higher information fidelity than their decoder-only counterparts, with the latter exhibiting a distinctive late-layer “forgetting” pattern where mutual information is first enhanced and then discarded. Moreover, fine-tuning to maximize token-level recoverability consistently improves understanding ability of LLMs on tasks without task-specific supervision, demonstrating that mutual information can serve as a foundation for understanding and improving language model capabilities.
zh
[NLP-200] Conversational Exploration of Literature Landscape with LitChat
【速读】: 该论文试图解决在“大文献”时代下,传统手动文献综述方法无法应对科学出版物数量指数级增长所带来的文献景观理解难题。现有大型语言模型(LLMs)虽具备一定的文献理解能力,但因上下文窗口限制和幻觉等信任问题,无法提供系统综述所需的“全面、客观、开放和透明”的视角。论文提出的解决方案是LitChat,其关键在于构建一个端到端、交互式且对话式的文献代理,通过数据驱动的发现工具增强LLM代理的能力,从而实现用户查询的自动解析、相关文献的检索、知识图谱的构建以及基于数据挖掘技术生成证据支持的洞察。
链接: https://arxiv.org/abs/2505.23789
作者: Mingyu Huang,Shasha Zhou,Yuxuan Chen,Ke Li
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:We are living in an era of “big literature”, where the volume of digital scientific publications is growing exponentially. While offering new opportunities, this also poses challenges for understanding literature landscapes, as traditional manual reviewing is no longer feasible. Recent large language models (LLMs) have shown strong capabilities for literature comprehension, yet they are incapable of offering “comprehensive, objective, open and transparent” views desired by systematic reviews due to their limited context windows and trust issues like hallucinations. Here we present LitChat, an end-to-end, interactive and conversational literature agent that augments LLM agents with data-driven discovery tools to facilitate literature exploration. LitChat automatically interprets user queries, retrieves relevant sources, constructs knowledge graphs, and employs diverse data-mining techniques to generate evidence-based insights addressing user needs. We illustrate the effectiveness of LitChat via a case study on AI4Health, highlighting its capacity to quickly navigate the users through large-scale literature landscape with data-based evidence that is otherwise infeasible with traditional means.
zh
[NLP-201] Nine Ways to Break Copyright Law and Why Our LLM Wont: A Fair Use Aligned Generation Framework
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成内容时可能引发的版权侵权问题,即模型可能直接复制受保护内容或进行不足的改造性修改,从而带来伦理、法律和实际应用上的风险。其解决方案的关键在于开发了FUA-LLM(Fair Use Aligned Language Models)框架,该框架基于合理使用原则设计,通过构建包含18,000个专家验证示例的FairUseDB数据集,并采用直接偏好优化(Direct Preference Optimization, DPO)方法对开源LLMs进行微调,以生成既符合法律要求又具备实用性的替代输出,而非依赖传统的拒绝式过滤机制。
链接: https://arxiv.org/abs/2505.23788
作者: Aakash Sen Sharma,Debdeep Sanyal,Priyansh Srivastava,Sundar Atreya H.,Shirish Karande,Mohan Kankanhalli,Murari Mandal
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 30 Pages
Abstract:Large language models (LLMs) commonly risk copyright infringement by reproducing protected content verbatim or with insufficient transformative modifications, posing significant ethical, legal, and practical concerns. Current inference-time safeguards predominantly rely on restrictive refusal-based filters, often compromising the practical utility of these models. To address this, we collaborated closely with intellectual property experts to develop FUA-LLM (Fair Use Aligned Language Models), a legally-grounded framework explicitly designed to align LLM outputs with fair-use doctrine. Central to our method is FairUseDB, a carefully constructed dataset containing 18,000 expert-validated examples covering nine realistic infringement scenarios. Leveraging this dataset, we apply Direct Preference Optimization (DPO) to fine-tune open-source LLMs, encouraging them to produce legally compliant and practically useful alternatives rather than resorting to blunt refusal. Recognizing the shortcomings of traditional evaluation metrics, we propose new measures: Weighted Penalty Utility and Compliance Aware Harmonic Mean (CAH) to balance infringement risk against response utility. Extensive quantitative experiments coupled with expert evaluations confirm that FUA-LLM substantially reduces problematic outputs (up to 20%) compared to state-of-the-art approaches, while preserving real-world usability.
zh
[NLP-202] Meaning Is Not A Metric: Using LLM s to make cultural context legible at scale
【速读】: 该论文试图解决在基于人工智能的社会技术系统中,如何有效表征人类意义的问题。传统系统依赖于“薄描述”(thin description)的数值化表示,这种表示方式虽然标准化但缺乏文化语境,导致无法准确反映人类行为的意义。论文提出的解决方案关键在于利用“厚描述”(thick description)的框架,通过语义化的表达方式保留情境信息,从而更全面地表征人类意义。而生成式 AI(Generative AI)的语义能力为自动化生成和处理厚描述提供了可能,有助于克服传统方法在规模应用上的瓶颈。
链接: https://arxiv.org/abs/2505.23785
作者: Cody Kommers,Drew Hemment,Maria Antoniak,Joel Z. Leibo,Hoyt Long,Emily Robinson,Adam Sobey
机构: The Alan Turing Institute (阿尔伯特·图灵研究所); University of Edinburgh (爱丁堡大学); University of Copenhagen (哥本哈根大学); Pioneer Centre for AI (人工智能先锋中心); Google DeepMind (谷歌深度思维); University of Chicago (芝加哥大学); University of Exeter (埃克塞特大学); University of Southampton (南安普顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Position paper
Abstract:This position paper argues that large language models (LLMs) can make cultural context, and therefore human meaning, legible at an unprecedented scale in AI-based sociotechnical systems. We argue that such systems have previously been unable to represent human meaning because they rely on thin descriptions: numerical representations that enforce standardization and therefore strip human activity of the cultural context that gives it meaning. By contrast, scholars in the humanities and qualitative social sciences have developed frameworks for representing meaning through thick description: verbal representations that accommodate heterogeneity and retain contextual information needed to represent human meaning. While these methods can effectively codify meaning, they are difficult to deploy at scale. However, the verbal capabilities of LLMs now provide a means of (at least partially) automating the generation and processing of thick descriptions, potentially overcoming this bottleneck. We argue that the problem of rendering human meaning legible is not just about selecting better metrics, but about developing new representational formats (based on thick description). We frame this as a crucial direction for the application of generative AI and identify five key challenges: preserving context, maintaining interpretive pluralism, integrating perspectives based on lived experience and critical distance, distinguishing qualitative content from quantitative magnitude, and acknowledging meaning as dynamic rather than static. Furthermore, we suggest that thick description has the potential to serve as a unifying framework to address a number of emerging concerns about the difficulties of representing culture in (or using) LLMs.
zh
[NLP-203] Automatic classification of stop realisation with wav2vec2.0 INTERSPEECH2025
【速读】: 该论文试图解决语音数据中许多可变语音现象的自动标注问题,目前缺乏有效的工具。其解决方案的关键在于利用预训练的自监督模型,如wav2vec2.0,在英语和日语中实现对爆破音爆发存在的高精度自动分类,该模型在精细标注和未准备的语音语料库中均表现出鲁棒性。
链接: https://arxiv.org/abs/2505.23688
作者: James Tanner,Morgan Sonderegger,Jane Stuart-Smith,Jeff Mielke,Tyler Kendall
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted for Interspeech 2025. 5 pages, 3 figures
Abstract:Modern phonetic research regularly makes use of automatic tools for the annotation of speech data, however few tools exist for the annotation of many variable phonetic phenomena. At the same time, pre-trained self-supervised models, such as wav2vec2.0, have been shown to perform well at speech classification tasks and latently encode fine-grained phonetic information. We demonstrate that wav2vec2.0 models can be trained to automatically classify stop burst presence with high accuracy in both English and Japanese, robust across both finely-curated and unprepared speech corpora. Patterns of variability in stop realisation are replicated with the automatic annotations, and closely follow those of manual annotations. These results demonstrate the potential of pre-trained speech models as tools for the automatic annotation and processing of speech corpus data, enabling researchers to ‘scale-up’ the scope of phonetic research with relative ease.
zh
[NLP-204] Few-Shot Optimization for Sensor Data Using Large Language Models : A Case Study on Fatigue Detection
【速读】: 该论文旨在解决传感器基础分类任务中示例选择的质量问题,特别是在少量样本(few-shot)条件下,如何提升模型的性能。其解决方案的关键在于提出一种基于HED-LM(Hybrid Euclidean Distance with Large Language Models)的混合选择流程,该流程通过欧几里得距离过滤候选示例,并利用大语言模型(Large Language Models)评估上下文相关性进行重新排序,从而提升示例选择的准确性与鲁棒性。
链接: https://arxiv.org/abs/2505.18754
作者: Elsen Ronando,Sozo Inoue
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 43 pages, 18 figures. Accepted for publication in MDPI Sensors (2025). Final version before journal publication
Abstract:In this paper, we propose a novel few-shot optimization with HED-LM (Hybrid Euclidean Distance with Large Language Models) to improve example selection for sensor-based classification tasks. While few-shot prompting enables efficient inference with limited labeled data, its performance largely depends on the quality of selected examples. HED-LM addresses this challenge through a hybrid selection pipeline that filters candidate examples based on Euclidean distance and re-ranks them using contextual relevance scored by large language models (LLMs). To validate its effectiveness, we apply HED-LM to a fatigue detection task using accelerometer data characterized by overlapping patterns and high inter-subject variability. Unlike simpler tasks such as activity recognition, fatigue detection demands more nuanced example selection due to subtle differences in physiological signals. Our experiments show that HED-LM achieves a mean macro F1-score of 69.13 \pm 10.71%, outperforming both random selection (59.30 \pm 10.13%) and distance-only filtering (67.61 \pm 11.39%). These represent relative improvements of 16.6% and 2.3%, respectively. The results confirm that combining numerical similarity with contextual relevance improves the robustness of few-shot prompting. Overall, HED-LM offers a practical solution to improve performance in real-world sensor-based learning tasks and shows potential for broader applications in healthcare monitoring, human activity recognition, and industrial safety scenarios.
zh
[NLP-205] “Dyadosyncrasy” Idiosyncrasy and Demographic Factors in Turn-Taking INTERSPEECH2025
【速读】: 该论文试图解决对话中轮流说话(turn-taking)的个体差异及其影响因素问题,重点探讨人口统计学特征(如性别、年龄、教育程度)和个体因素如何塑造轮流说话的模式。研究的关键在于通过分析过渡地板偏移量(Transition Floor Offset, TFO),揭示个体间的差异主要由强烈的“个性同步”(idiosyncratic)和更显著的“配对同步”(dyadosyncratic)成分驱动,即对话双方之间的互动关系比人口统计学因素更能决定TFO的特征。
链接: https://arxiv.org/abs/2505.24736
作者: Julio Cesar Cavalcanti,Gabriel Skantze
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Accepted to Interspeech 2025
Abstract:Turn-taking in dialogue follows universal constraints but also varies significantly. This study examines how demographic (sex, age, education) and individual factors shape turn-taking using a large dataset of US English conversations (Fisher). We analyze Transition Floor Offset (TFO) and find notable interspeaker variation. Sex and age have small but significant effects female speakers and older individuals exhibit slightly shorter offsets - while education shows no effect. Lighter topics correlate with shorter TFOs. However, individual differences have a greater impact, driven by a strong idiosyncratic and an even stronger “dyadosyncratic” component - speakers in a dyad resemble each other more than they resemble themselves in different dyads. This suggests that the dyadic relationship and joint activity are the strongest determinants of TFO, outweighing demographic influences.
zh
[NLP-206] Identifying Primary Stress Across Related Languages and Dialects with Transformer-based Speech Encoder Models INTERSPEECH2025
【速读】: 该论文试图解决语音中重音识别(primary stress identification)的自动化问题,这一问题在语言学和语音处理中具有重要意义。传统方法主要依赖于手工设计的声学特征,并且多基于英语数据集。本文的关键解决方案是通过微调预训练的Transformer模型,并引入音频帧分类头,以实现更高效的重音识别。实验表明,该方法在克罗地亚语和塞尔维亚语中表现接近完美,而在方言和较远语言如查卡维安语和斯洛文尼亚语中性能有所下降,但仅需少量多音节训练词即可获得良好的效果。
链接: https://arxiv.org/abs/2505.24571
作者: Nikola Ljubešić,Ivan Porupski,Peter Rupnik
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to InterSpeech2025
Abstract:Automating primary stress identification has been an active research field due to the role of stress in encoding meaning and aiding speech comprehension. Previous studies relied mainly on traditional acoustic features and English datasets. In this paper, we investigate the approach of fine-tuning a pre-trained transformer model with an audio frame classification head. Our experiments use a new Croatian training dataset, with test sets in Croatian, Serbian, the Chakavian dialect, and Slovenian. By comparing an SVM classifier using traditional acoustic features with the fine-tuned speech transformer, we demonstrate the transformer’s superiority across the board, achieving near-perfect results for Croatian and Serbian, with a 10-point performance drop for the more distant Chakavian and Slovenian. Finally, we show that only a few hundred multi-syllabic training words suffice for strong performance. We release our datasets and model under permissive licenses.
zh
[NLP-207] Boosting In-Context Learning in LLM s Through the Lens of Classical Supervised Learning
【速读】: 该论文试图解决在上下文学习(In-Context Learning, ICL)中大型语言模型(Large Language Models, LLMs)因系统性偏差导致的预测不稳定问题。现有校准技术在对数空间中仅通过移动决策边界来缓解偏差,但无法改变其方向,因此在偏差严重时效果有限。该论文提出的解决方案是监督校准(Supervised Calibration, SC),其关键在于基于损失最小化框架,在对数空间中学习LLM预测概率的最优、逐类仿射变换,无需额外数据。SC通过引入更丰富的函数类,不仅涵盖了多种现有校准方法,还具备调整甚至反转决策边界方向的能力,同时结合了上下文不变性和方向信任区域两种正则化技术,提升了校准效果与稳定性。
链接: https://arxiv.org/abs/2505.23783
作者: Korel Gundem,Juncheng Dong,Dennis Zhang,Vahid Tarokh,Zhengling Qi
机构: The George Washington University (乔治·华盛顿大学); Duke University (杜克大学); Olin School of Business, Washington University in St. Louis (奥林商学院,圣路易斯华盛顿大学)
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:In-Context Learning (ICL) allows Large Language Models (LLMs) to adapt to new tasks with just a few examples, but their predictions often suffer from systematic biases, leading to unstable performances in classification. While calibration techniques are proposed to mitigate these biases, we show that, in the logit space, many of these methods are equivalent to merely shifting the LLM’s decision boundary without having the ability to alter its orientation. This proves inadequate when biases cause the LLM to be severely misdirected. To address these limitations and provide a unifying framework, we propose Supervised Calibration (SC), a loss-minimization based framework which learns an optimal, per-class affine transformation of the LLM’s predictive probabilities in the logit space without requiring external data beyond the context. By using a more expressive functional class, SC not only subsumes many existing calibration methods in ICL as special cases, but also enables the ability to alter and even completely reverse the orientation of the LLM’s decision boundary. Furthermore, SC’s loss-based nature facilitates the seamless integration of two purpose-built regularization techniques: context-invariance and directional trust-region. The former is designed to tackle the instability issue in ICL, while the latter controls the degree of calibration. Finally, SC delivers state-of-the-art performance over calibration baselines in the 4-shot, 8-shot, and 16-shot settings across all nine datasets for Mistral-7B-Instruct-v0.3, LLaMA-2-7B-chat, and Qwen2-7B-Instruct.
zh
计算机视觉
[CV-0] AdaHuman: Animatable Detailed 3D Human Generation with Compositional Multiview Diffusion
【速读】:该论文旨在解决现有图像到3D虚拟人像生成方法难以生成适用于现实应用场景的高细节、可动画化的3D虚拟人像的问题。其解决方案的关键在于提出AdaHuman框架,该框架包含两个核心创新:一是基于姿态条件的3D关节扩散模型,能够在每个扩散步骤中合成任意姿态下的一致多视角图像及对应的3D Gaussian Splats (3DGS)重建;二是基于3DGS的组合细化模块,通过图像到图像的细化增强局部身体部位的细节,并利用新型的裁剪感知相机光线图实现无缝整合,从而生成连贯且高细节的3D虚拟人像。
链接: https://arxiv.org/abs/2505.24877
作者: Yangyi Huang,Ye Yuan,Xueting Li,Jan Kautz,Umar Iqbal
机构: NVIDIA(英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Website: this https URL
Abstract:Existing methods for image-to-3D avatar generation struggle to produce highly detailed, animation-ready avatars suitable for real-world applications. We introduce AdaHuman, a novel framework that generates high-fidelity animatable 3D avatars from a single in-the-wild image. AdaHuman incorporates two key innovations: (1) A pose-conditioned 3D joint diffusion model that synthesizes consistent multi-view images in arbitrary poses alongside corresponding 3D Gaussian Splats (3DGS) reconstruction at each diffusion step; (2) A compositional 3DGS refinement module that enhances the details of local body parts through image-to-image refinement and seamlessly integrates them using a novel crop-aware camera ray map, producing a cohesive detailed 3D avatar. These components allow AdaHuman to generate highly realistic standardized A-pose avatars with minimal self-occlusion, enabling rigging and animation with any input motion. Extensive evaluation on public benchmarks and in-the-wild images demonstrates that AdaHuman significantly outperforms state-of-the-art methods in both avatar reconstruction and reposing. Code and models will be publicly available for research purposes.
zh
[CV-1] MiniMax-Remover: Taming Bad Noise Helps Video Object Removal
【速读】:该论文旨在解决视频对象移除(video object removal)任务中的挑战,包括幻觉对象、视觉伪影以及现有方法依赖计算成本高的采样过程和无分类器指导(classifier-free guidance, CFG)导致的推理速度慢的问题。其解决方案的关键在于提出一种两阶段的视频对象移除方法——MiniMax-Remover,第一阶段通过移除文本输入和交叉注意力层简化预训练视频生成模型,以获得更轻量高效的架构;第二阶段则通过最小最大优化策略,在人类标注的高质量视频上进行蒸馏,从而提升编辑质量和推理速度。
链接: https://arxiv.org/abs/2505.24873
作者: Bojia Zi,Weixuan Peng,Xianbiao Qi,Jianan Wang,Shihao Zhao,Rong Xiao,Kam-Fai Wong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in video diffusion models have driven rapid progress in video editing techniques. However, video object removal, a critical subtask of video editing, remains challenging due to issues such as hallucinated objects and visual artifacts. Furthermore, existing methods often rely on computationally expensive sampling procedures and classifier-free guidance (CFG), resulting in slow inference. To address these limitations, we propose MiniMax-Remover, a novel two-stage video object removal approach. Motivated by the observation that text condition is not best suited for this task, we simplify the pretrained video generation model by removing textual input and cross-attention layers, resulting in a more lightweight and efficient model architecture in the first stage. In the second stage, we distilled our remover on successful videos produced by the stage-1 model and curated by human annotators, using a minimax optimization strategy to further improve editing quality and inference speed. Specifically, the inner maximization identifies adversarial input noise (“bad noise”) that makes failure removals, while the outer minimization step trains the model to generate high-quality removal results even under such challenging conditions. As a result, our method achieves a state-of-the-art video object removal results with as few as 6 sampling steps and doesn’t rely on CFG, significantly improving inference efficiency. Extensive experiments demonstrate the effectiveness and superiority of MiniMax-Remover compared to existing methods. Codes and Videos are available at: this https URL.
zh
[CV-2] GenSpace: Benchmarking Spatially-Aware Image Generation
【速读】:该论文试图解决当前图像生成模型在文本或图像提示下缺乏类似人类的三维空间感知能力的问题,特别是其在物体位置、关系和度量等具体三维细节上的表现不足。解决方案的关键在于提出一个专门的评估流程和指标——GenSpace,该方法通过利用多个视觉基础模型重建三维场景几何结构,从而提供更准确且符合人类直觉的空间忠实性度量,以全面评估现有图像生成模型的空间意识。
链接: https://arxiv.org/abs/2505.24870
作者: Zehan Wang,Jiayang Xu,Ziang Zhang,Tianyu Pan,Chao Du,Hengshuang Zhao,Zhou Zhao
机构: Zhejiang University (浙江大学); Sea AI Lab (Sea人工智能实验室); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Humans can intuitively compose and arrange scenes in the 3D space for photography. However, can advanced AI image generators plan scenes with similar 3D spatial awareness when creating images from text or image prompts? We present GenSpace, a novel benchmark and evaluation pipeline to comprehensively assess the spatial awareness of current image generation models. Furthermore, standard evaluations using general Vision-Language Models (VLMs) frequently fail to capture the detailed spatial errors. To handle this challenge, we propose a specialized evaluation pipeline and metric, which reconstructs 3D scene geometry using multiple visual foundation models and provides a more accurate and human-aligned metric of spatial faithfulness. Our findings show that while AI models create visually appealing images and can follow general instructions, they struggle with specific 3D details like object placement, relationships, and measurements. We summarize three core limitations in the spatial perception of current state-of-the-art image generation models: 1) Object Perspective Understanding, 2) Egocentric-Allocentric Transformation and 3) Metric Measurement Adherence, highlighting possible directions for improving spatial intelligence in image generation.
zh
[CV-3] SiLVR: A Simple Language-based Video Reasoning Framework
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂视频-语言任务中的推理能力显著落后的问题。其解决方案的关键在于提出SiLVR框架,该框架通过将复杂的视频理解分解为两个阶段:第一阶段利用多感官输入(如短片段字幕和音频/语音字幕)将原始视频转换为基于语言的表示;第二阶段将语言描述输入强大的推理型语言模型(Reasoning Large Language Models, LLMs)以完成复杂的视频-语言理解任务。此外,为处理长上下文的多感官输入,SiLVR采用自适应标记缩减方案,动态确定采样时间粒度。
链接: https://arxiv.org/abs/2505.24869
作者: Ce Zhang,Yan-Bo Lin,Ziyang Wang,Mohit Bansal,Gedas Bertasius
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in test-time optimization have led to remarkable reasoning capabilities in Large Language Models (LLMs), enabling them to solve highly complex problems in math and coding. However, the reasoning capabilities of multimodal LLMs (MLLMs) still significantly lag, especially for complex video-language tasks. To address this issue, we present SiLVR, a Simple Language-based Video Reasoning framework that decomposes complex video understanding into two stages. In the first stage, SiLVR transforms raw video into language-based representations using multisensory inputs, such as short clip captions and audio/speech subtitles. In the second stage, language descriptions are fed into a powerful reasoning LLM to solve complex video-language understanding tasks. To handle long-context multisensory inputs, we use an adaptive token reduction scheme, which dynamically determines the temporal granularity with which to sample the tokens. Our simple, modular, and training-free video reasoning framework achieves the best-reported results on Video-MME (long), Video-MMMU (comprehension), Video-MMLU, CGBench, and EgoLife. Furthermore, our empirical study focused on video reasoning capabilities shows that, despite not being explicitly trained on video, strong reasoning LLMs can effectively aggregate multisensory input information from video, speech, and audio for complex temporal, causal, long-context, and knowledge acquisition reasoning tasks in video. Code is available at this https URL.
zh
[CV-4] me Blindness: Why Video-Language Models Cant See What Humans Can?
【速读】:该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在处理仅包含时间序列信息的视频数据时表现不佳的问题,特别是在空间信息被遮蔽的情况下,模型难以捕捉纯时间模式。解决方案的关键在于引入了一个名为SpookyBench的基准测试集,其中信息仅通过噪声-like帧的时间序列编码,以模拟从生物信号到隐蔽通信等自然现象。该基准揭示了当前最先进的VLMs在完全依赖时间线索的任务中表现极差,而人类在相同任务中准确率超过98%,突显出模型对帧级空间特征的过度依赖以及无法有效提取时间线索的问题。为克服这一限制,需要设计新的架构或训练范式,以实现空间依赖与时间处理的解耦。
链接: https://arxiv.org/abs/2505.24867
作者: Ujjwal Upadhyay,Mukul Ranjan,Zhiqiang Shen,Mohamed Elhoseiny
机构: King Abdullah University of Science and Technology (KAUST); Mohamed bin Zayed University of AI (MBZUAI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page at this https URL
Abstract:Recent advances in vision-language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We introduce \textbfSpookyBench , a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Interestingly, while humans can recognize shapes, text, and patterns in these sequences with over 98% accuracy, state-of-the-art VLMs achieve 0% accuracy. This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. Furthermore, when trained in data sets with low spatial signal-to-noise ratios (SNR), temporal understanding of models degrades more rapidly than human perception, especially in tasks requiring fine-grained temporal reasoning. Overcoming this limitation will require novel architectures or training paradigms that decouple spatial dependencies from temporal processing. Our systematic analysis shows that this issue persists across model scales and architectures. We release SpookyBench to catalyze research in temporal pattern recognition and bridge the gap between human and machine video understanding. Dataset and code has been made available on our project website: this https URL.
zh
[CV-5] alkingHeadBench: A Multi-Modal Benchmark Analysis of Talking-Head DeepFake Detection
【速读】:该论文旨在解决当前深度伪造(deepfake)语音头像检测基准无法反映先进生成模型进展的问题,现有基准依赖过时的生成器,难以评估模型的鲁棒性和泛化能力。其解决方案的关键在于提出TalkingHeadBench,这是一个全面的多模型多生成器基准和精心构建的数据集,用于评估最先进检测器在最新生成器上的性能,并通过精心设计的协议评估身份和生成器特征分布变化下的泛化能力。
链接: https://arxiv.org/abs/2505.24866
作者: Xinqi Xiong,Prakrut Patel,Qingyuan Fan,Amisha Wadhwa,Sarathy Selvam,Xiao Guo,Luchao Qi,Xiaoming Liu,Roni Sengupta
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid advancement of talking-head deepfake generation fueled by advanced generative models has elevated the realism of synthetic videos to a level that poses substantial risks in domains such as media, politics, and finance. However, current benchmarks for deepfake talking-head detection fail to reflect this progress, relying on outdated generators and offering limited insight into model robustness and generalization. We introduce TalkingHeadBench, a comprehensive multi-model multi-generator benchmark and curated dataset designed to evaluate the performance of state-of-the-art detectors on the most advanced generators. Our dataset includes deepfakes synthesized by leading academic and commercial models and features carefully constructed protocols to assess generalization under distribution shifts in identity and generator characteristics. We benchmark a diverse set of existing detection methods, including CNNs, vision transformers, and temporal models, and analyze their robustness and generalization capabilities. In addition, we provide error analysis using Grad-CAM visualizations to expose common failure modes and detector biases. TalkingHeadBench is hosted on this https URL with open access to all data splits and protocols. Our benchmark aims to accelerate research towards more robust and generalizable detection models in the face of rapidly evolving generative techniques.
zh
[CV-6] ViStoryBench: Comprehensive Benchmark Suite for Story Visualization
【速读】:该论文旨在解决故事可视化(story visualization)框架在真实场景中性能不足的问题,特别是在生成与叙事和参考图像一致的视觉连贯图像方面。其解决方案的关键在于引入一个全面的评估基准ViStoryBench,该基准通过收集涵盖多种故事类型和艺术风格的数据集,确保模型在不同情节(如喜剧、恐怖)和视觉美学(如动漫、3D渲染)等多个维度上进行评估,同时平衡叙事结构与视觉元素,并包含单主角和多主角的故事以测试角色一致性,以及复杂情节和精细世界观构建以挑战模型生成准确视觉内容的能力。
链接: https://arxiv.org/abs/2505.24862
作者: Cailin Zhuang,Ailin Huang,Wei Cheng,Jingwei Wu,Yaoqi Hu,Jiaqi Liao,Zhewei Huang,Hongyuan Wang,Xinyao Liao,Weiwei Cai,Hengyuan Xu,Xuanyang Zhang,Xianfang Zeng,Gang Yu,Chi Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 33 Pages, Project Page: this https URL , Code: this https URL
Abstract:Story visualization, which aims to generate a sequence of visually coherent images aligning with a given narrative and reference images, has seen significant progress with recent advancements in generative models. To further enhance the performance of story visualization frameworks in real-world scenarios, we introduce a comprehensive evaluation benchmark, ViStoryBench. We collect a diverse dataset encompassing various story types and artistic styles, ensuring models are evaluated across multiple dimensions such as different plots (e.g., comedy, horror) and visual aesthetics (e.g., anime, 3D renderings). ViStoryBench is carefully curated to balance narrative structures and visual elements, featuring stories with single and multiple protagonists to test models’ ability to maintain character consistency. Additionally, it includes complex plots and intricate world-building to challenge models in generating accurate visuals. To ensure comprehensive comparisons, our benchmark incorporates a wide range of evaluation metrics assessing critical aspects. This structured and multifaceted framework enables researchers to thoroughly identify both the strengths and weaknesses of different models, fostering targeted improvements.
zh
[CV-7] Reading Recognition in the Wild
【速读】:该论文试图解决在持续运行的智能眼镜中实现自我中心情境人工智能的问题,特别是如何识别用户是否处于阅读状态。解决方案的关键在于引入了一个大规模多模态的“Reading in the Wild”数据集,并利用三种模态(自我中心RGB图像、眼动追踪和头部姿态)构建了一个灵活的Transformer模型,以有效且高效地编码这些模态,从而实现对阅读行为的准确识别。
链接: https://arxiv.org/abs/2505.24848
作者: Charig Yang,Samiul Alam,Shakhrul Iman Siam,Michael J. Proulx,Lambert Mathias,Kiran Somasundaram,Luis Pesqueira,James Fort,Sheroze Sheriffdeen,Omkar Parkhi,Carl Ren,Mi Zhang,Yuning Chai,Richard Newcombe,Hyo Jin Kim
机构: Meta Reality Labs Research (Meta现实实验室研究); VGG, University of Oxford (VGG,牛津大学); The Ohio State University (俄亥俄州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:To enable egocentric contextual AI in always-on smart glasses, it is crucial to be able to keep a record of the user’s interactions with the world, including during reading. In this paper, we introduce a new task of reading recognition to determine when the user is reading. We first introduce the first-of-its-kind large-scale multimodal Reading in the Wild dataset, containing 100 hours of reading and non-reading videos in diverse and realistic scenarios. We then identify three modalities (egocentric RGB, eye gaze, head pose) that can be used to solve the task, and present a flexible transformer model that performs the task using these modalities, either individually or combined. We show that these modalities are relevant and complementary to the task, and investigate how to efficiently and effectively encode each modality. Additionally, we show the usefulness of this dataset towards classifying types of reading, extending current reading understanding studies conducted in constrained settings to larger scale, diversity and realism. Code, model, and data will be public.
zh
[CV-8] VideoCAD: A Large-Scale Video Dataset for Learning UI Interactions and 3D Reasoning from CAD Software
【速读】:该论文旨在解决计算机辅助设计(Computer-Aided Design, CAD)中用户界面(UI)交互学习的复杂性问题,特别是针对高精度、长时序的工程工具操作缺乏有效数据和方法的问题。其解决方案的关键在于构建了VideoCAD,这是一个大规模的合成数据集,包含超过41K个标注的CAD操作视频,通过自动化框架生成高质量的UI动作数据,显著提升了UI交互学习的复杂度和时序长度。此外,论文还提出了VideoCADFormer模型,用于直接从视频中学习CAD交互,展示了在长时序依赖和多模态推理方面的优势。
链接: https://arxiv.org/abs/2505.24838
作者: Brandon Man,Ghadi Nehme,Md Ferdous Alam,Faez Ahmed
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Computer-Aided Design (CAD) is a time-consuming and complex process, requiring precise, long-horizon user interactions with intricate 3D interfaces. While recent advances in AI-driven user interface (UI) agents show promise, most existing datasets and methods focus on short, low-complexity tasks in mobile or web applications, failing to capture the demands of professional engineering tools. In this work, we introduce VideoCAD, the first attempt at engineering UI interaction learning for precision tasks. Specifically, VideoCAD is a large-scale synthetic dataset consisting of over 41K annotated video recordings of CAD operations, generated using an automated framework for collecting high-fidelity UI action data from human-made CAD designs. Compared to existing datasets, VideoCAD offers an order of magnitude higher complexity in UI interaction learning for real-world engineering tasks, having up to a 20x longer time horizon than other datasets. We show two important downstream applications of VideoCAD: learning UI interactions from professional precision 3D CAD tools and a visual question-answering (VQA) benchmark designed to evaluate multimodal large language models’ (LLM) spatial reasoning and video understanding abilities. To learn the UI interactions, we propose VideoCADFormer - a state-of-the-art model in learning CAD interactions directly from video, which outperforms multiple behavior cloning baselines. Both VideoCADFormer and the VQA benchmark derived from VideoCAD reveal key challenges in the current state of video-based UI understanding, including the need for precise action grounding, multi-modal and spatial reasoning, and long-horizon dependencies.
zh
[CV-9] Zero-Shot Chinese Character Recognition with Hierarchical Multi-Granularity Image-Text Aligning
【速读】:该论文旨在解决中文字符识别(Chinese Character Recognition, CCR)中由于字符结构复杂、语义细粒度高而导致的识别准确率不足问题。现有方法通常依赖于自回归模型和编辑距离后处理,并且仅使用单一层次的字符表示,难以充分利用中文字符的细粒度语义信息。论文提出的解决方案是基于对比学习范式的分层多粒度图像-文本对齐框架(Hierarchical Multi-Granularity Image-Text Aligning, Hi-GITA),其关键在于在图像和文本两侧引入多粒度编码器,以提取从笔画到部首再到整体结构的多层次语义表示,并通过多粒度融合模块和细粒度解耦图像-文本对比损失来增强跨模态对齐效果。
链接: https://arxiv.org/abs/2505.24837
作者: Yinglian Zhu,Haiyang Yu,Qizao Wang,Wei Lu,Xiangyang Xue,Bin Li
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The first three authors contributed equally
Abstract:Chinese Character Recognition (CCR) is a fundamental technology for intelligent document processing. Unlike Latin characters, Chinese characters exhibit unique spatial structures and compositional rules, allowing for the use of fine-grained semantic information in representation. However, existing approaches are usually based on auto-regressive as well as edit distance post-process and typically rely on a single-level character representation. In this paper, we propose a Hierarchical Multi-Granularity Image-Text Aligning (Hi-GITA) framework based on a contrastive paradigm. To leverage the abundant fine-grained semantic information of Chinese characters, we propose multi-granularity encoders on both image and text sides. Specifically, the Image Multi-Granularity Encoder extracts hierarchical image representations from character images, capturing semantic cues from localized strokes to holistic structures. The Text Multi-Granularity Encoder extracts stroke and radical sequence representations at different levels of granularity. To better capture the relationships between strokes and radicals, we introduce Multi-Granularity Fusion Modules on the image and text sides, respectively. Furthermore, to effectively bridge the two modalities, we further introduce a Fine-Grained Decoupled Image-Text Contrastive loss, which aligns image and text representations across multiple granularities. Extensive experiments demonstrate that our proposed Hi-GITA significantly outperforms existing zero-shot CCR methods. For instance, it brings about 20% accuracy improvement in handwritten character and radical zero-shot settings. Code and models will be released soon.
zh
[CV-10] Segmenting France Across Four Centuries
【速读】:该论文旨在解决历史地图在大尺度、长时间序列下土地利用与土地覆盖演变分析中的数据稀缺与标注成本高昂的问题。其解决方案的关键在于构建一个涵盖18至20世纪法国 metropolitan 地区的多时期历史地图数据集,并提供有限但精确的手动标注历史标签,以及对应的现代标签,以支持深度学习模型的训练与评估。该数据集通过引入不同年代的地图风格差异、景观变化及解释性歧义等复杂性,为弱监督和全监督方法的比较提供了基准,从而推动长期环境监测与景观变迁研究的发展。
链接: https://arxiv.org/abs/2505.24824
作者: Marta López-Rauhut,Hongyu Zhou,Mathieu Aubry,Loic Landrieu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 8 figures, 3 tables
Abstract:Historical maps offer an invaluable perspective into territory evolution across past centuries–long before satellite or remote sensing technologies existed. Deep learning methods have shown promising results in segmenting historical maps, but publicly available datasets typically focus on a single map type or period, require extensive and costly annotations, and are not suited for nationwide, long-term analyses. In this paper, we introduce a new dataset of historical maps tailored for analyzing large-scale, long-term land use and land cover evolution with limited annotations. Spanning metropolitan France (548,305 km^2), our dataset contains three map collections from the 18th, 19th, and 20th centuries. We provide both comprehensive modern labels and 22,878 km^2 of manually annotated historical labels for the 18th and 19th century maps. Our dataset illustrates the complexity of the segmentation task, featuring stylistic inconsistencies, interpretive ambiguities, and significant landscape changes (e.g., marshlands disappearing in favor of forests). We assess the difficulty of these challenges by benchmarking three approaches: a fully-supervised model trained with historical labels, and two weakly-supervised models that rely only on modern annotations. The latter either use the modern labels directly or first perform image-to-image translation to address the stylistic gap between historical and contemporary maps. Finally, we discuss how these methods can support long-term environment monitoring, offering insights into centuries of landscape transformation. Our official project repository is publicly available at this https URL.
zh
[CV-11] Bi-Manual Joint Camera Calibration and Scene Representation
【速读】:该论文旨在解决多机器人操作中相机标定与环境表示构建的复杂性问题,特别是在双臂操作场景下,传统标定方法依赖于采集带有预设标记的图像,过程繁琐且耗时。其解决方案的关键在于提出一种无需依赖标定标记的联合标定与表示框架(Bi-Manual Joint Calibration and Representation Framework, Bi-JCR),通过利用3D基础模型实现密集、无标记的多视角对应关系,从而联合估计相机到末端执行器的外参、机械臂之间的相对位姿以及共享工作空间的统一尺度一致的3D表示。
链接: https://arxiv.org/abs/2505.24819
作者: Haozhan Tang,Tianyi Zhang,Matthew Johnson-Roberson,Weiming Zhi
机构: Carnegie Mellon University (卡内基梅隆大学); Vanderbilt University (范德堡大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Robot manipulation, especially bimanual manipulation, often requires setting up multiple cameras on multiple robot manipulators. Before robot manipulators can generate motion or even build representations of their environments, the cameras rigidly mounted to the robot need to be calibrated. Camera calibration is a cumbersome process involving collecting a set of images, with each capturing a pre-determined marker. In this work, we introduce the Bi-Manual Joint Calibration and Representation Framework (Bi-JCR). Bi-JCR enables multiple robot manipulators, each with cameras mounted, to circumvent taking images of calibration markers. By leveraging 3D foundation models for dense, marker-free multi-view correspondence, Bi-JCR jointly estimates: (i) the extrinsic transformation from each camera to its end-effector, (ii) the inter-arm relative poses between manipulators, and (iii) a unified, scale-consistent 3D representation of the shared workspace, all from the same captured RGB image sets. The representation, jointly constructed from images captured by cameras on both manipulators, lives in a common coordinate frame and supports collision checking and semantic segmentation to facilitate downstream bimanual coordination tasks. We empirically evaluate the robustness of Bi-JCR on a variety of tabletop environments, and demonstrate its applicability on a variety of downstream tasks.
zh
[CV-12] CL-LoRA: Continual Low-Rank Adaptation for Rehearsal-Free Class-Incremental Learning CVPR2025
【速读】:该论文旨在解决类增量学习(Class-Incremental Learning, CIL)中因逐步学习新类而产生的知识遗忘问题,特别是在无需回放样本的情况下,如何有效保留已有类的知识并适应新任务。现有基于适配器的方法在每个新任务中创建独立的适配器模块,导致参数冗余且无法充分利用跨任务的共享知识。解决方案的关键在于提出ContinuaL Low-Rank Adaptation (CL-LoRA),其核心是引入一种双适配器架构,结合任务共享适配器与任务特定适配器,通过随机正交矩阵和梯度重分配的知识蒸馏机制保留共享知识,并利用可学习的块状权重减少任务间干扰,从而提升模型的持续学习效率与可扩展性。
链接: https://arxiv.org/abs/2505.24816
作者: Jiangpeng He,Zhihao Duan,Fengqing Zhu
机构: Massachusetts Institute of Technology (麻省理工学院); Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025
Abstract:Class-Incremental Learning (CIL) aims to learn new classes sequentially while retaining the knowledge of previously learned classes. Recently, pre-trained models (PTMs) combined with parameter-efficient fine-tuning (PEFT) have shown remarkable performance in rehearsal-free CIL without requiring exemplars from previous tasks. However, existing adapter-based methods, which incorporate lightweight learnable modules into PTMs for CIL, create new adapters for each new task, leading to both parameter redundancy and failure to leverage shared knowledge across tasks. In this work, we propose ContinuaL Low-Rank Adaptation (CL-LoRA), which introduces a novel dual-adapter architecture combining \textbftask-shared adapters to learn cross-task knowledge and \textbftask-specific adapters to capture unique features of each new task. Specifically, the shared adapters utilize random orthogonal matrices and leverage knowledge distillation with gradient reassignment to preserve essential shared knowledge. In addition, we introduce learnable block-wise weights for task-specific adapters, which mitigate inter-task interference while maintaining the model’s plasticity. We demonstrate CL-LoRA consistently achieves promising performance under multiple benchmarks with reduced training and inference computation, establishing a more efficient and scalable paradigm for continual learning with pre-trained models.
zh
[CV-13] C-GS: A Faster Gaussian Splatting Module Utilizing Tensor Cores
【速读】:该论文旨在解决3D Gaussian Splatting(3DGS)中由条件alpha混合导致的渲染管线时间成本过高的问题,以及现有加速算法与渲染管线设计紧密耦合所带来的扩展性限制。其解决方案的关键在于提出TC-GS模块,通过将alpha计算映射到矩阵乘法,充分利用了原本闲置的Tensor Core(TCU),从而实现了算法无关的通用加速,并有效缓解了因Tensor Core半精度计算引起的像素坐标二次项带来的舍入误差。
链接: https://arxiv.org/abs/2505.24796
作者: Zimu Liao,Jifeng Ding,Rong Fu,Siwei Cui,Ruixuan Gong,Li Wang,Boni Hu,Yi Wang,Hengjie Li,XIngcheng Zhang,Hui Wang
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学); University of Electronic Science and Technology of China (电子科技大学); Fudan University (复旦大学); Beijing Institute of Technology (北京理工大学); Northwestern Polytechnical University (西北工业大学); Shanghai Innovation Institute (上海创新研究院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 15 pages, 6 figures
Abstract:3D Gaussian Splatting (3DGS) renders pixels by rasterizing Gaussian primitives, where conditional alpha-blending dominates the time cost in the rendering pipeline. This paper proposes TC-GS, an algorithm-independent universal module that expands Tensor Core (TCU) applicability for 3DGS, leading to substantial speedups and seamless integration into existing 3DGS optimization frameworks. The key innovation lies in mapping alpha computation to matrix multiplication, fully utilizing otherwise idle TCUs in existing 3DGS implementations. TC-GS provides plug-and-play acceleration for existing top-tier acceleration algorithms tightly coupled with rendering pipeline designs, like Gaussian compression and redundancy elimination algorithms. Additionally, we introduce a global-to-local coordinate transformation to mitigate rounding errors from quadratic terms of pixel coordinates caused by Tensor Core half-precision computation. Extensive experiments demonstrate that our method maintains rendering quality while providing an additional 2.18x speedup over existing Gaussian acceleration algorithms, thus reaching up to a total 5.6x acceleration. The code is currently available at anonymous \hrefthis https URL
zh
[CV-14] Lightweight Relational Embedding in Task-Interpolated Few-Shot Networks for Enhanced Gastrointestinal Disease Classification
【速读】:该论文旨在解决传统结肠镜检查在早期结直肠癌(Colorectal Cancer, CRC)诊断中的局限性,包括其侵入性、对高质量内窥镜图像的依赖性以及图像质量不足导致的诊断准确性问题。其解决方案的关键在于提出一种基于小样本学习(Few-Shot Learning)架构的深度学习网络,该网络包含定制化的特征提取器、任务插值、关系嵌入和双级别路由注意力机制,使模型能够快速适应未见过的细粒度内窥镜图像模式,并通过任务插值增强图像数据,同时利用关系嵌入捕捉图像内部关键特征及连续帧之间的过渡信息,从而克服传统卷积神经网络(Convolutional Neural Networks, CNNs)的局限性。
链接: https://arxiv.org/abs/2505.24792
作者: Xinliu Zhong,Leo Hwa Liang,Angela S. Koh,Yeo Si Yong
机构: National University of Singapore (新加坡国立大学); Duke-NUS Medical School (杜克-国大医学研究院); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 15 figures
Abstract:Traditional diagnostic methods like colonoscopy are invasive yet critical tools necessary for accurately diagnosing colorectal cancer (CRC). Detection of CRC at early stages is crucial for increasing patient survival rates. However, colonoscopy is dependent on obtaining adequate and high-quality endoscopic images. Prolonged invasive procedures are inherently risky for patients, while suboptimal or insufficient images hamper diagnostic accuracy. These images, typically derived from video frames, often exhibit similar patterns, posing challenges in discrimination. To overcome these challenges, we propose a novel Deep Learning network built on a Few-Shot Learning architecture, which includes a tailored feature extractor, task interpolation, relational embedding, and a bi-level routing attention mechanism. The Few-Shot Learning paradigm enables our model to rapidly adapt to unseen fine-grained endoscopic image patterns, and the task interpolation augments the insufficient images artificially from varied instrument viewpoints. Our relational embedding approach discerns critical intra-image features and captures inter-image transitions between consecutive endoscopic frames, overcoming the limitations of Convolutional Neural Networks (CNNs). The integration of a light-weight attention mechanism ensures a concentrated analysis of pertinent image regions. By training on diverse datasets, the model’s generalizability and robustness are notably improved for handling endoscopic images. Evaluated on Kvasir dataset, our model demonstrated superior performance, achieving an accuracy of 90.1%, precision of 0.845, recall of 0.942, and an F1 score of 0.891. This surpasses current state-of-the-art methods, presenting a promising solution to the challenges of invasive colonoscopy by optimizing CRC detection through advanced image analysis.
zh
[CV-15] DiG-Net: Enhancing Quality of Life through Hyper-Range Dynamic Gesture Recognition in Assistive Robotics
【速读】:该论文旨在解决远距离动态手势识别在辅助人机交互(HRI)中的局限性,传统手势识别方法通常仅适用于短距离场景,难以满足远程辅助通信的需求。其解决方案的关键在于提出一种名为Distance-aware Gesture Network (DiG-Net) 的新型架构,该架构通过结合Depth-Conditioned Deformable Alignment (DADA) 块与Spatio-Temporal Graph 模块,实现对远距离下受物理衰减、分辨率降低及动态手势变化等挑战性条件影响的gesture序列的鲁棒处理与分类,并引入Radiometric Spatio-Temporal Depth Attenuation Loss (RSTDAL) 以提升模型在不同距离下的学习能力和鲁棒性。
链接: https://arxiv.org/abs/2505.24786
作者: Eran Bamani Beeri,Eden Nissinman,Avishai Sintov
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2411.18413
Abstract:Dynamic hand gestures play a pivotal role in assistive human-robot interaction (HRI), facilitating intuitive, non-verbal communication, particularly for individuals with mobility constraints or those operating robots remotely. Current gesture recognition methods are mostly limited to short-range interactions, reducing their utility in scenarios demanding robust assistive communication from afar. In this paper, we introduce a novel approach designed specifically for assistive robotics, enabling dynamic gesture recognition at extended distances of up to 30 meters, thereby significantly improving accessibility and quality of life. Our proposed Distance-aware Gesture Network (DiG-Net) effectively combines Depth-Conditioned Deformable Alignment (DADA) blocks with Spatio-Temporal Graph modules, enabling robust processing and classification of gesture sequences captured under challenging conditions, including significant physical attenuation, reduced resolution, and dynamic gesture variations commonly experienced in real-world assistive environments. We further introduce the Radiometric Spatio-Temporal Depth Attenuation Loss (RSTDAL), shown to enhance learning and strengthen model robustness across varying distances. Our model demonstrates significant performance improvement over state-of-the-art gesture recognition frameworks, achieving a recognition accuracy of 97.3% on a diverse dataset with challenging hyper-range gestures. By effectively interpreting gestures from considerable distances, DiG-Net significantly enhances the usability of assistive robots in home healthcare, industrial safety, and remote assistance scenarios, enabling seamless and intuitive interactions for users regardless of physical limitations
zh
[CV-16] ackling View-Dependent Semantics in 3D Language Gaussian Splatting ICML2025
【速读】:该论文试图解决3D场景理解中因视角变化导致的语义不一致问题,即视图依赖语义(view-dependent semantics)的问题。现有方法通常将2D语义特征直接投影到3D高斯分布上,忽略了2D与3D理解之间的根本差异。解决方案的关键在于提出LaGa(Language Gaussians),通过将3D场景分解为物体来建立跨视角的语义联系,并通过聚类语义描述符和基于多视角语义的重新加权构建视角聚合的语义表示,从而更全面地捕捉视图依赖语义的关键信息。
链接: https://arxiv.org/abs/2505.24746
作者: Jiazhong Cen,Xudong Zhou,Jiemin Fang,Changsong Wen,Lingxi Xie,Xiaopeng Zhang,Wei Shen,Qi Tian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2025 camera ready. Project Page: this https URL
Abstract:Recent advancements in 3D Gaussian Splatting (3D-GS) enable high-quality 3D scene reconstruction from RGB images. Many studies extend this paradigm for language-driven open-vocabulary scene understanding. However, most of them simply project 2D semantic features onto 3D Gaussians and overlook a fundamental gap between 2D and 3D understanding: a 3D object may exhibit various semantics from different viewpoints–a phenomenon we term view-dependent semantics. To address this challenge, we propose LaGa (Language Gaussians), which establishes cross-view semantic connections by decomposing the 3D scene into objects. Then, it constructs view-aggregated semantic representations by clustering semantic descriptors and reweighting them based on multi-view semantics. Extensive experiments demonstrate that LaGa effectively captures key information from view-dependent semantics, enabling a more comprehensive understanding of 3D scenes. Notably, under the same settings, LaGa achieves a significant improvement of +18.7% mIoU over the previous SOTA on the LERF-OVS dataset. Our code is available at: this https URL.
zh
[CV-17] DreamDance: Animating Character Art via Inpainting Stable Gaussian Worlds
【速读】:该论文试图解决在角色艺术动画中生成稳定、一致的角色与场景运动的问题,特别是在给定精确相机轨迹的情况下。解决方案的关键在于将动画任务重新建模为两个基于修复(inpainting)的步骤:相机感知场景修复和姿态感知视频修复。第一个步骤利用预训练的图像修复模型生成多视角场景图像,并优化一个稳定的大型高斯场,以实现带有相机轨迹的粗略背景视频渲染;第二个步骤则通过姿态感知视频修复模型,将动态角色注入场景视频的同时提升背景质量,其核心是基于DiT的视频生成模型结合门控策略,自适应地融合角色外观和姿态信息到基础背景视频中。
链接: https://arxiv.org/abs/2505.24733
作者: Jiaxu Zhang,Xianfang Zeng,Xin Chen,Wei Zuo,Gang Yu,Guosheng Lin,Zhigang Tu
机构: Wuhan University (武汉大学); Nanyang Technological University (南洋理工大学); StepFun (StepFun); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents DreamDance, a novel character art animation framework capable of producing stable, consistent character and scene motion conditioned on precise camera trajectories. To achieve this, we re-formulate the animation task as two inpainting-based steps: Camera-aware Scene Inpainting and Pose-aware Video Inpainting. The first step leverages a pre-trained image inpainting model to generate multi-view scene images from the reference art and optimizes a stable large-scale Gaussian field, which enables coarse background video rendering with camera trajectories. However, the rendered video is rough and only conveys scene motion. To resolve this, the second step trains a pose-aware video inpainting model that injects the dynamic character into the scene video while enhancing background quality. Specifically, this model is a DiT-based video generation model with a gating strategy that adaptively integrates the character’s appearance and pose information into the base background video. Through extensive experiments, we demonstrate the effectiveness and generalizability of DreamDance, producing high-quality and consistent character animations with remarkable camera dynamics.
zh
[CV-18] Reinforcing Video Reasoning with Focused Thinking
【速读】:该论文旨在解决多模态大语言模型在复杂推理任务中存在的一些关键问题,包括生成的推理链缺乏焦点且冗长,导致重要的时空线索被掩盖,以及二元奖励机制无法有效处理部分正确的答案,从而引发奖励方差大和学习效率低的问题。其解决方案的关键在于提出TW-GRPO框架,该框架通过引入基于信息密度的token加权机制来聚焦关键信息,并通过将单选任务转换为多选任务以实现更精细的奖励粒度,同时采用问答反转的数据增强策略提升样本多样性。
链接: https://arxiv.org/abs/2505.24718
作者: Jisheng Dang,Jingze Wu,Teng Wang,Xuanhui Lin,Nannan Zhu,Hongbo Chen,Wei-Shi Zheng,Meng Wang,Tat-Seng Chua
机构: Sun Yat-sen University (中山大学); Lanzhou University (兰州大学); University of Hong Kong (香港大学); Hefei University of Technology (合肥工业大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in reinforcement learning, particularly through Group Relative Policy Optimization (GRPO), have significantly improved multimodal large language models for complex reasoning tasks. However, two critical limitations persist: 1) they often produce unfocused, verbose reasoning chains that obscure salient spatiotemporal cues and 2) binary rewarding fails to account for partially correct answers, resulting in high reward variance and inefficient learning. In this paper, we propose TW-GRPO, a novel framework that enhances visual reasoning with focused thinking and dense reward granularity. Specifically, we employs a token weighting mechanism that prioritizes tokens with high informational density (estimated by intra-group variance), suppressing redundant tokens like generic reasoning prefixes. Furthermore, we reformulate RL training by shifting from single-choice to multi-choice QA tasks, where soft rewards enable finer-grained gradient estimation by distinguishing partial correctness. Additionally, we propose question-answer inversion, a data augmentation strategy to generate diverse multi-choice samples from existing benchmarks. Experiments demonstrate state-of-the-art performance on several video reasoning and general understanding benchmarks. Notably, TW-GRPO achieves 50.4% accuracy on CLEVRER (18.8% improvement over Video-R1) and 65.8% on MMVU. Our codes are available at \hrefthis https URLthis https URL.
zh
[CV-19] RT-X Net: RGB-Thermal cross attention network for Low-Light Image Enhancement ICIP2025
【速读】:该论文旨在解决夜间条件下由于高噪声水平和强光照源导致图像质量下降,从而使得低光图像增强变得困难的问题。解决方案的关键在于提出RT-X Net,这是一种基于跨注意力机制的网络,用于融合RGB图像与热成像图像,以提升夜间图像的增强效果。通过自注意力网络进行特征提取,并利用跨注意力机制实现多模态信息的有效整合。
链接: https://arxiv.org/abs/2505.24705
作者: Raman Jha,Adithya Lenka,Mani Ramanagopal,Aswin Sankaranarayanan,Kaushik Mitra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICIP 2025
Abstract:In nighttime conditions, high noise levels and bright illumination sources degrade image quality, making low-light image enhancement challenging. Thermal images provide complementary information, offering richer textures and structural details. We propose RT-X Net, a cross-attention network that fuses RGB and thermal images for nighttime image enhancement. We leverage self-attention networks for feature extraction and a cross-attention mechanism for fusion to effectively integrate information from both modalities. To support research in this domain, we introduce the Visible-Thermal Image Enhancement Evaluation (V-TIEE) dataset, comprising 50 co-located visible and thermal images captured under diverse nighttime conditions. Extensive evaluations on the publicly available LLVIP dataset and our V-TIEE dataset demonstrate that RT-X Net outperforms state-of-the-art methods in low-light image enhancement. The code and the V-TIEE can be found here this https URL.
zh
[CV-20] PatchDEMUX: A Certifiably Robust Framework for Multi-label Classifiers Against Adversarial Patches CVPR2025
【速读】:该论文试图解决对抗性补丁攻击对多标签分类器的威胁,特别是在现有可证明鲁棒性防御方法主要针对单标签分类的情况下,缺乏针对多标签分类的有效解决方案。其关键解决方案是提出PatchDEMUX框架,该框架将多标签分类任务视为一系列独立的二分类问题,从而为多标签分类器提供可证明的鲁棒性保障,并在攻击者仅能使用单一补丁的场景下引入额外的认证流程以获得更紧的鲁棒性边界。
链接: https://arxiv.org/abs/2505.24703
作者: Dennis Jacob,Chong Xiang,Prateek Mittal
机构: UC Berkeley (加州大学伯克利分校); Princeton University (普林斯顿大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: CVPR 2025
Abstract:Deep learning techniques have enabled vast improvements in computer vision technologies. Nevertheless, these models are vulnerable to adversarial patch attacks which catastrophically impair performance. The physically realizable nature of these attacks calls for certifiable defenses, which feature provable guarantees on robustness. While certifiable defenses have been successfully applied to single-label classification, limited work has been done for multi-label classification. In this work, we present PatchDEMUX, a certifiably robust framework for multi-label classifiers against adversarial patches. Our approach is a generalizable method which can extend any existing certifiable defense for single-label classification; this is done by considering the multi-label classification task as a series of isolated binary classification problems to provably guarantee robustness. Furthermore, in the scenario where an attacker is limited to a single patch we propose an additional certification procedure that can provide tighter robustness bounds. Using the current state-of-the-art (SOTA) single-label certifiable defense PatchCleanser as a backbone, we find that PatchDEMUX can achieve non-trivial robustness on the MS-COCO and PASCAL VOC datasets while maintaining high clean performance
zh
[CV-21] Conformal Prediction for Zero-Shot Models CVPR2025
【速读】:该论文试图解决预训练视觉-语言模型(Vision-Language Models)在下游任务中可靠性与不确定性评估不足的问题,特别是在领域偏移(domain drift)背景下,传统分位数预测方法在生成置信集时效率下降的问题。其解决方案的关键在于提出Conf-OT方法,该方法通过在联合校准和查询集上进行归纳学习,解决最优传输问题以缩小预训练域与适应域之间的差距,从而提升置信集的效率并保持覆盖率保证,同时避免了额外的数据划分。
链接: https://arxiv.org/abs/2505.24693
作者: Julio Silva-Rodríguez,Ismail Ben Ayed,Jose Dolz
机构: ÉTS Montréal(ÉTS蒙特利尔)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Code: this https URL
Abstract:Vision-language models pre-trained at large scale have shown unprecedented adaptability and generalization to downstream tasks. Although its discriminative potential has been widely explored, its reliability and uncertainty are still overlooked. In this work, we investigate the capabilities of CLIP models under the split conformal prediction paradigm, which provides theoretical guarantees to black-box models based on a small, labeled calibration set. In contrast to the main body of literature on conformal predictors in vision classifiers, foundation models exhibit a particular characteristic: they are pre-trained on a one-time basis on an inaccessible source domain, different from the transferred task. This domain drift negatively affects the efficiency of the conformal sets and poses additional challenges. To alleviate this issue, we propose Conf-OT, a transfer learning setting that operates transductive over the combined calibration and query sets. Solving an optimal transport problem, the proposed method bridges the domain gap between pre-training and adaptation without requiring additional data splits but still maintaining coverage guarantees. We comprehensively explore this conformal prediction strategy on a broad span of 15 datasets and three non-conformity scores. Conf-OT provides consistent relative improvements of up to 20% on set efficiency while being 15 times faster than popular transductive approaches.
zh
[CV-22] Learning reusable concepts across different egocentric video understanding tasks CVPR2025
【速读】:该论文旨在解决自主系统在理解视频流中人类活动时缺乏整体感知能力的问题,即如何使系统在短时间内同时理解事件、识别场景中物体的相关性与交互,并预测未来可能发生的情况。解决方案的关键在于提出Hier-EgoPack框架,该框架能够创建一组可跨下游任务迁移的任务视角,作为额外见解的来源,并作为机器人可携带和按需使用的技能背包。
链接: https://arxiv.org/abs/2505.24690
作者: Simone Alberto Peirone,Francesca Pistilli,Antonio Alliegro,Tatiana Tommasi,Giuseppe Averta
机构: Politecnico di Torino (都灵理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Extended abstract derived from arXiv:2502.02487 . Presented at the Second Joint Egocentric Vision (EgoVis) Workshop (CVPR 2025)
Abstract:Our comprehension of video streams depicting human activities is naturally multifaceted: in just a few moments, we can grasp what is happening, identify the relevance and interactions of objects in the scene, and forecast what will happen soon, everything all at once. To endow autonomous systems with such holistic perception, learning how to correlate concepts, abstract knowledge across diverse tasks, and leverage tasks synergies when learning novel skills is essential. In this paper, we introduce Hier-EgoPack, a unified framework able to create a collection of task perspectives that can be carried across downstream tasks and used as a potential source of additional insights, as a backpack of skills that a robot can carry around and use when needed.
zh
[CV-23] Beyond FACS: Data-driven Facial Expression Dictionaries with Application to Predicting Autism
【速读】:该论文试图解决传统面部动作编码系统(FACS)在自动化检测中存在准确率不足、部分面部动作单元(AU)被排除以及非叠加组合下检测失败等问题,这些问题限制了其在行为研究中的应用。论文提出的解决方案的关键在于构建一种新的数据驱动编码系统——面部基底(Facial Basis),该系统模仿FACS的核心特性,但具有完全无监督学习、能够重建所有可观察到的面部运动,并且其单元具有可加性,从而克服了传统自动化FACS编码的三个结构性局限。
链接: https://arxiv.org/abs/2505.24679
作者: Evangelos Sariyanidi,Lisa Yankowitz,Robert T. Schultz,John D. Herrington,Birkan Tunc,Jeffrey Cohn
机构: Center for Autism Research, The Children’s Hospital of Philadelphia (儿童医院的自闭症研究中心); Department of Psychiatry, University of Pennsylvania (宾夕法尼亚大学精神病学系); Deliberate AI (Deliberate AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in the Proceedings of the 19th IEEE International Conference on Automatic Face and Gesture Recognition (2025)
Abstract:The Facial Action Coding System (FACS) has been used by numerous studies to investigate the links between facial behavior and mental health. The laborious and costly process of FACS coding has motivated the development of machine learning frameworks for Action Unit (AU) detection. Despite intense efforts spanning three decades, the detection accuracy for many AUs is considered to be below the threshold needed for behavioral research. Also, many AUs are excluded altogether, making it impossible to fulfill the ultimate goal of FACS-the representation of any facial expression in its entirety. This paper considers an alternative approach. Instead of creating automated tools that mimic FACS experts, we propose to use a new coding system that mimics the key properties of FACS. Specifically, we construct a data-driven coding system called the Facial Basis, which contains units that correspond to localized and interpretable 3D facial movements, and overcomes three structural limitations of automated FACS coding. First, the proposed method is completely unsupervised, bypassing costly, laborious and variable manual annotation. Second, Facial Basis reconstructs all observable movement, rather than relying on a limited repertoire of recognizable movements (as in automated FACS). Finally, the Facial Basis units are additive, whereas AUs may fail detection when they appear in a non-additive combination. The proposed method outperforms the most frequently used AU detector in predicting autism diagnosis from in-person and remote conversations, highlighting the importance of encoding facial behavior comprehensively. To our knowledge, Facial Basis is the first alternative to FACS for deconstructing facial expressions in videos into localized movements. We provide an open source implementation of the method at this http URL.
zh
[CV-24] 6D Pose Estimation on Point Cloud Data through Prior Knowledge Integration: A Case Study in Autonomous Disassembly
【速读】:该论文旨在解决在制造领域中,特别是在起动机拆卸过程中,如何准确估计螺栓的6D位姿(6D pose)问题。由于存在遮挡以及单视角数据采集的局限性,导致部分螺栓无法被检测到,因此需要一种能够获取完整螺栓信息的综合处理流程。解决方案的关键在于引入先验知识,并构建一个多层次的处理管道,以有效捕捉电机上所有螺栓的6D信息,从而提升自动化拆卸过程中的螺栓检测准确性。
链接: https://arxiv.org/abs/2505.24669
作者: Chengzhi Wu,Hao Fu,Jan-Philipp Kaiser,Erik Tabuchi Barczak,Julius Pfrommer,Gisela Lanza,Michael Heizmann,Jürgen Beyerer
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Fraunhofer Institute of Optronics, System Technologies and Image Exploitation IOSB (弗劳恩霍夫光子学、系统技术与图像分析研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The accurate estimation of 6D pose remains a challenging task within the computer vision domain, even when utilizing 3D point cloud data. Conversely, in the manufacturing domain, instances arise where leveraging prior knowledge can yield advancements in this endeavor. This study focuses on the disassembly of starter motors to augment the engineering of product life cycles. A pivotal objective in this context involves the identification and 6D pose estimation of bolts affixed to the motors, facilitating automated disassembly within the manufacturing workflow. Complicating matters, the presence of occlusions and the limitations of single-view data acquisition, notably when motors are placed in a clamping system, obscure certain portions and render some bolts imperceptible. Consequently, the development of a comprehensive pipeline capable of acquiring complete bolt information is imperative to avoid oversight in bolt detection. In this paper, employing the task of bolt detection within the scope of our project as a pertinent use case, we introduce a meticulously devised pipeline. This multi-stage pipeline effectively captures the 6D information with regard to all bolts on the motor, thereby showcasing the effective utilization of prior knowledge in handling this challenging task. The proposed methodology not only contributes to the field of 6D pose estimation but also underscores the viability of integrating domain-specific insights to tackle complex problems in manufacturing and automation.
zh
[CV-25] Decoupled Competitive Framework for Semi-supervised Medical Image Segmentation ECAI2024
【速读】:该论文旨在解决医学影像分割中由于标注样本不足而导致的半监督学习性能受限问题。现有基于Mean Teacher (MT)或Dual Students (DS)结构的方法虽取得一定成果,但面临两个固有局限:MT结构中因采用指数移动平均(EMA)机制导致的过耦合问题,以及DS结构中两名学生之间的严重认知偏差,这些因素可能降低模型效果甚至引发模型崩溃。该研究提出的解耦竞争框架(Decoupled Competitive Framework, DCF)通过引入简单的竞争机制更新EMA,动态解耦学生与教师,并促进学生间有价值且精确见解的无缝交换,从而有效缓解上述问题。
链接: https://arxiv.org/abs/2505.24667
作者: Jiahe Chen,Jiahe Ying,Shen Wang,Jianwei Zheng
机构: Zhejiang University of Technology (浙江理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in ECAI 2024
Abstract:Confronting the critical challenge of insufficiently annotated samples in medical domain, semi-supervised medical image segmentation (SSMIS) emerges as a promising solution. Specifically, most methodologies following the Mean Teacher (MT) or Dual Students (DS) architecture have achieved commendable results. However, to date, these approaches face a performance bottleneck due to two inherent limitations, \textite.g., the over-coupling problem within MT structure owing to the employment of exponential moving average (EMA) mechanism, as well as the severe cognitive bias between two students of DS structure, both of which potentially lead to reduced efficacy, or even model collapse eventually. To mitigate these issues, a Decoupled Competitive Framework (DCF) is elaborated in this work, which utilizes a straightforward competition mechanism for the update of EMA, effectively decoupling students and teachers in a dynamical manner. In addition, the seamless exchange of invaluable and precise insights is facilitated among students, guaranteeing a better learning paradigm. The DCF introduced undergoes rigorous validation on three publicly accessible datasets, which encompass both 2D and 3D datasets. The results demonstrate the superiority of our method over previous cutting-edge competitors. Code will be available at this https URL.
zh
[CV-26] Black-box Adversarial Attacks on CNN-based SLAM Algorithms
【速读】:该论文试图解决深度学习在SLAM(同步定位与建图)系统中因对抗攻击而产生的可靠性问题,特别是针对基于卷积神经网络(Convolutional Neural Network, CNN)的特征检测器的对抗性扰动分析。解决方案的关键在于引入黑盒对抗扰动,作用于GCN-SLAM算法所接收的RGB图像,并通过实验验证了这种攻击对SLAM系统跟踪性能的显著影响,尤其是在中等规模攻击下可导致高达76%的帧跟踪失败,同时指出对深度信息进行攻击相较于RGB图像更具破坏性。
链接: https://arxiv.org/abs/2505.24654
作者: Maria Rafaela Gkeka,Bowen Sun,Evgenia Smirni,Christos D. Antonopoulos,Spyros Lalis,Nikolaos Bellas
机构: University of Thessaly (希腊塞萨利大学); William & Mary (威廉与玛丽学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 8 figures
Abstract:Continuous advancements in deep learning have led to significant progress in feature detection, resulting in enhanced accuracy in tasks like Simultaneous Localization and Mapping (SLAM). Nevertheless, the vulnerability of deep neural networks to adversarial attacks remains a challenge for their reliable deployment in applications, such as navigation of autonomous agents. Even though CNN-based SLAM algorithms are a growing area of research there is a notable absence of a comprehensive presentation and examination of adversarial attacks targeting CNN-based feature detectors, as part of a SLAM system. Our work introduces black-box adversarial perturbations applied to the RGB images fed into the GCN-SLAM algorithm. Our findings on the TUM dataset [30] reveal that even attacks of moderate scale can lead to tracking failure in as many as 76% of the frames. Moreover, our experiments highlight the catastrophic impact of attacking depth instead of RGB input images on the SLAM system.
zh
[CV-27] BIMA: Bijective Maximum Likelihood Learning Approach to Hallucination Prediction and Mitigation in Large Vision-Language Models CVPR
【速读】:该论文旨在解决大规模视觉-语言模型中出现的幻觉(hallucination)问题,即语言模型生成的响应与视觉内容不一致的现象。其解决方案的关键在于提出一种基于归一化流理论的双射最大似然学习(Bijective Maximum Likelihood Learning, BIMA)方法,通过优化解码过程来有效缓解幻觉问题。
链接: https://arxiv.org/abs/2505.24649
作者: Huu-Thien Tran,Thanh-Dat Truong,Khoa Luu
机构: CVIU Lab, University of Arkansas (CVIU 实验室,阿肯色大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPRW 2025, 8 pages, 4 figures
Abstract:Large vision-language models have become widely adopted to advance in various domains. However, developing a trustworthy system with minimal interpretable characteristics of large-scale models presents a significant challenge. One of the most prevalent terms associated with the fallacy functions caused by these systems is hallucination, where the language model generates a response that does not correspond to the visual content. To mitigate this problem, several approaches have been developed, and one prominent direction is to ameliorate the decoding process. In this paper, we propose a new Bijective Maximum Likelihood Learning (BIMA) approach to hallucination mitigation using normalizing flow theories. The proposed BIMA method can efficiently mitigate the hallucination problem in prevailing vision-language models, resulting in significant improvements. Notably, BIMA achieves the average F1 score of 85.06% on POPE benchmark and remarkably reduce CHAIRS and CHAIRI by 7.6% and 2.6%, respectively. To the best of our knowledge, this is one of the first studies that contemplates the bijection means to reduce hallucination induced by large vision-language models.
zh
[CV-28] A Cross Branch Fusion-Based Contrastive Learning Framework for Point Cloud Self-supervised Learning
【速读】:该论文旨在解决点云无监督学习中缺乏有效表示学习方法的问题,特别是在不使用额外训练数据的情况下。其解决方案的关键在于提出了一种基于对比跨分支注意力的框架(Contrastive Cross-branch Attention-based framework for Point cloud data, PoCCA),通过引入子分支实现损失层之前的分支间信息交换,从而增强特征表示的丰富性与判别性。
链接: https://arxiv.org/abs/2505.24641
作者: Chengzhi Wu,Qianliang Huang,Kun Jin,Julius Pfrommer,Jürgen Beyerer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Contrastive learning is an essential method in self-supervised learning. It primarily employs a multi-branch strategy to compare latent representations obtained from different branches and train the encoder. In the case of multi-modal input, diverse modalities of the same object are fed into distinct branches. When using single-modal data, the same input undergoes various augmentations before being fed into different branches. However, all existing contrastive learning frameworks have so far only performed contrastive operations on the learned features at the final loss end, with no information exchange between different branches prior to this stage. In this paper, for point cloud unsupervised learning without the use of extra training data, we propose a Contrastive Cross-branch Attention-based framework for Point cloud data (termed PoCCA), to learn rich 3D point cloud representations. By introducing sub-branches, PoCCA allows information exchange between different branches before the loss end. Experimental results demonstrate that in the case of using no extra training data, the representations learned with our self-supervised model achieve state-of-the-art performances when used for downstream tasks on point clouds.
zh
[CV-29] Cloud Optical Thickness Retrievals Using Angle Invariant Attention Based Deep Learning Models ICIP
【速读】:该论文旨在解决卫星遥感中云光学厚度(Cloud Optical Thickness, COT)估算的准确性问题,特别是在考虑三维云效应、观测角度和大气干扰时的挑战。传统方法如独立像素近似(Independent Pixel Approximation, IPA)因简化假设导致显著偏差,而现有深度学习模型在辐射强度变化、畸变和云阴影下表现不稳定,并且在不同太阳和观测天顶角下引入较大误差。该论文提出的解决方案是构建一种新型的、角度不变的注意力机制深度模型——带角度编码的云注意力网络(Cloud-Attention-Net with Angle Coding, CAAC),其关键在于利用注意力机制和角度嵌入来建模卫星观测几何与三维辐射传输效应,并通过多角度训练策略实现角度不变性,从而显著提升COT估算的精度。
链接: https://arxiv.org/abs/2505.24638
作者: Zahid Hassan Tushar,Adeleke Ademakinwa,Jianwu Wang,Zhibo Zhang,Sanjay Purushotham
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 7 figures, to be published in 2025 IEEE International Conference on Image Processing (ICIP)
Abstract:Cloud Optical Thickness (COT) is a critical cloud property influencing Earth’s climate, weather, and radiation budget. Satellite radiance measurements enable global COT retrieval, but challenges like 3D cloud effects, viewing angles, and atmospheric interference must be addressed to ensure accurate estimation. Traditionally, the Independent Pixel Approximation (IPA) method, which treats individual pixels independently, has been used for COT estimation. However, IPA introduces significant bias due to its simplified assumptions. Recently, deep learning-based models have shown improved performance over IPA but lack robustness, as they are sensitive to variations in radiance intensity, distortions, and cloud shadows. These models also introduce substantial errors in COT estimation under different solar and viewing zenith angles. To address these challenges, we propose a novel angle-invariant, attention-based deep model called Cloud-Attention-Net with Angle Coding (CAAC). Our model leverages attention mechanisms and angle embeddings to account for satellite viewing geometry and 3D radiative transfer effects, enabling more accurate retrieval of COT. Additionally, our multi-angle training strategy ensures angle invariance. Through comprehensive experiments, we demonstrate that CAAC significantly outperforms existing state-of-the-art deep learning models, reducing cloud property retrieval errors by at least a factor of nine.
zh
[CV-30] Category-Level 6D Object Pose Estimation in Agricultural Settings Using a Lattice-Deformation Framework and Diffusion-Augmented Synthetic Data IROS
【速读】:该论文旨在解决农业场景中因水果和蔬菜在形状、尺寸和纹理上具有高类内差异而导致的6D物体位姿估计难题。现有方法大多依赖于实例特定的CAD模型或深度传感器,限制了其在实际农业应用中的可行性。论文提出的PLANTPose框架通过仅使用RGB输入进行类别级6D位姿估计,其关键在于预测相对于基准网格的6D位姿和形变参数,从而使得单一类别级CAD模型能够适应未见过的实例,实现无需依赖实例特定数据的准确位姿估计。
链接: https://arxiv.org/abs/2505.24636
作者: Marios Glytsos,Panagiotis P. Filntisis,George Retsinas,Petros Maragos
机构: Athena Research and Innovation Center (阿塔纳研究与创新中心); National Technical University of Athens (希腊雅典国立技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 4 figures. Submitted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025. This work has been submitted to the IEEE for possible publication
Abstract:Accurate 6D object pose estimation is essential for robotic grasping and manipulation, particularly in agriculture, where fruits and vegetables exhibit high intra-class variability in shape, size, and texture. The vast majority of existing methods rely on instance-specific CAD models or require depth sensors to resolve geometric ambiguities, making them impractical for real-world agricultural applications. In this work, we introduce PLANTPose, a novel framework for category-level 6D pose estimation that operates purely on RGB input. PLANTPose predicts both the 6D pose and deformation parameters relative to a base mesh, allowing a single category-level CAD model to adapt to unseen instances. This enables accurate pose estimation across varying shapes without relying on instance-specific data. To enhance realism and improve generalization, we also leverage Stable Diffusion to refine synthetic training images with realistic texturing, mimicking variations due to ripeness and environmental factors and bridging the domain gap between synthetic data and the real world. Our evaluations on a challenging benchmark that includes bananas of various shapes, sizes, and ripeness status demonstrate the effectiveness of our framework in handling large intraclass variations while maintaining accurate 6D pose predictions, significantly outperforming the state-of-the-art RGB-based approach MegaPose.
zh
[CV-31] NUC-Net: Non-uniform Cylindrical Partition Network for Efficient LiDAR Semantic Segmentation
【速读】:该论文旨在解决LiDAR语义分割中基于体素的方法存在的两个主要问题:一是需要较大的输入体素分辨率导致计算成本和内存消耗过高;二是无法有效处理LiDAR点云的非均衡点分布问题。其解决方案的关键在于提出了一种名为NUC-Net的非均匀圆柱形划分网络,通过引入等差数列间隔(Arithmetic Progression of Interval, API)方法实现径向轴的非均匀划分,生成更具代表性和效率的体素表示,并结合非均匀多尺度聚合方法提升上下文信息,从而在SemanticKITTI和nuScenes数据集上实现了更优的性能与更高的效率。
链接: https://arxiv.org/abs/2505.24634
作者: Xuzhi Wang,Wei Feng,Lingdong Kong,Liang Wan
机构: Tianjin Normal University(天津师范大学); Tianjin Key Laboratory of Student Mental Health and Intelligence Assessment(天津市学生心理健康与智能评估重点实验室); Tianjin University(天津大学); National University of Singapore(新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:LiDAR semantic segmentation plays a vital role in autonomous driving. Existing voxel-based methods for LiDAR semantic segmentation apply uniform partition to the 3D LiDAR point cloud to form a structured representation based on cartesian/cylindrical coordinates. Although these methods show impressive performance, the drawback of existing voxel-based methods remains in two aspects: (1) it requires a large enough input voxel resolution, which brings a large amount of computation cost and memory consumption. (2) it does not well handle the unbalanced point distribution of LiDAR point cloud. In this paper, we propose a non-uniform cylindrical partition network named NUC-Net to tackle the above challenges. Specifically, we propose the Arithmetic Progression of Interval (API) method to non-uniformly partition the radial axis and generate the voxel representation which is representative and efficient. Moreover, we propose a non-uniform multi-scale aggregation method to improve contextual information. Our method achieves state-of-the-art performance on SemanticKITTI and nuScenes datasets with much faster speed and much less training time. And our method can be a general component for LiDAR semantic segmentation, which significantly improves both the accuracy and efficiency of the uniform counterpart by 4 \times training faster and 2 \times GPU memory reduction and 3 \times inference speedup. We further provide theoretical analysis towards understanding why NUC is effective and how point distribution affects performance. Code is available at \hrefthis https URLthis https URL.
zh
[CV-32] Learning from Videos for 3D World: Enhancing MLLM s with 3D Vision Geometry Priors
【速读】:该论文旨在解决传统方法依赖于全面的3D数据输入(如点云或重建的Bird’s-Eye View地图)来理解三维场景的问题,其核心挑战在于如何直接从视频数据中增强多模态大语言模型(Multimodal Large Language Models, MLLMs)对三维空间的理解与推理能力。解决方案的关键在于提出一种名为Video-3D Geometry Large Language Model (VG LLM) 的新方法,该方法通过3D视觉几何编码器从视频序列中提取三维先验信息,并将其与视觉标记融合后输入MLLM,从而实现无需显式3D数据输入的三维场景理解与空间推理。
链接: https://arxiv.org/abs/2505.24625
作者: Duo Zheng,Shijia Huang,Yanyang Li,Liwei Wang
机构: The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Previous research has investigated the application of Multimodal Large Language Models (MLLMs) in understanding 3D scenes by interpreting them as videos. These approaches generally depend on comprehensive 3D data inputs, such as point clouds or reconstructed Bird’s-Eye View (BEV) maps. In our research, we advance this field by enhancing the capability of MLLMs to understand and reason in 3D spaces directly from video data, without the need for additional 3D input. We propose a novel and efficient method, the Video-3D Geometry Large Language Model (VG LLM). Our approach employs a 3D visual geometry encoder that extracts 3D prior information from video sequences. This information is integrated with visual tokens and fed into the MLLM. Extensive experiments have shown that our method has achieved substantial improvements in various tasks related to 3D scene understanding and spatial reasoning, all directly learned from video sources. Impressively, our 4B model, which does not rely on explicit 3D data inputs, achieves competitive results compared to existing state-of-the-art methods, and even surpasses the Gemini-1.5-Pro in the VSI-Bench evaluations.
zh
[CV-33] Hyperbolic Dataset Distillation
【速读】:该论文旨在解决大规模数据集在深度学习中带来的计算和存储挑战,通过数据集蒸馏(dataset distillation)生成一个紧凑的数据集以替代原始数据集,同时保持模型性能。现有基于分布匹配(distribution matching, DM)的方法虽然提高了效率,但受限于欧几里得空间,忽略了数据的复杂几何和层次结构。该论文提出的解决方案关键在于引入双曲空间(hyperbolic space),利用其负曲率和指数增长的体积特性,将浅层网络提取的特征嵌入到洛伦兹双曲空间中,并通过计算合成数据与原始数据中心之间的双曲距离来优化层次结构,从而提升数据集蒸馏的效果和训练稳定性。
链接: https://arxiv.org/abs/2505.24623
作者: Wenyuan Li,Guang Li,Keisuke Maeda,Takahiro Ogawa,Miki Haseyama
机构: Hokkaido University (北海道大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:To address the computational and storage challenges posed by large-scale datasets in deep learning, dataset distillation has been proposed to synthesize a compact dataset that replaces the original while maintaining comparable model performance. Unlike optimization-based approaches that require costly bi-level optimization, distribution matching (DM) methods improve efficiency by aligning the distributions of synthetic and original data, thereby eliminating nested optimization. DM achieves high computational efficiency and has emerged as a promising solution. However, existing DM methods, constrained to Euclidean space, treat data as independent and identically distributed points, overlooking complex geometric and hierarchical relationships. To overcome this limitation, we propose a novel hyperbolic dataset distillation method, termed HDD. Hyperbolic space, characterized by negative curvature and exponential volume growth with distance, naturally models hierarchical and tree-like structures. HDD embeds features extracted by a shallow network into the Lorentz hyperbolic space, where the discrepancy between synthetic and original data is measured by the hyperbolic (geodesic) distance between their centroids. By optimizing this distance, the hierarchical structure is explicitly integrated into the distillation process, guiding synthetic samples to gravitate towards the root-centric regions of the original data distribution while preserving their underlying geometric characteristics. Furthermore, we find that pruning in hyperbolic space requires only 20% of the distilled core set to retain model performance, while significantly improving training stability. Notably, HDD is seamlessly compatible with most existing DM methods, and extensive experiments on different datasets validate its effectiveness.
zh
[CV-34] GARLIC: GAussian Representation LearnIng for spaCe partitioning
【速读】:该论文试图解决高维向量空间中高效索引与检索的问题,特别是在处理高维数据时如何平衡搜索速度与准确性。其解决方案的关键在于提出GARLIC(GAussian Representation LearnIng for spaCe partitioning),该结构基于N维高斯分布,通过信息论目标优化高斯参数,以实现覆盖性、置信度和结构语义一致性的平衡,并通过分裂与克隆操作逐步细化表示,从而有效处理不同数据密度。此方法在保持传统空间划分方法快速构建时间的同时,实现了较高的召回率,展现出在近似最近邻检索和分类任务中的优越性能。
链接: https://arxiv.org/abs/2505.24608
作者: Panagiotis Rigas,Panagiotis Drivas,Charalambos Tzamos,Ioannis Chamodrakas,George Ioannakis,Leonidas J. Guibas,Ioannis Z. Emiris
机构: Dept. of Informatics & Telecommunications, National and Kapodistrian University of Athens (信息学与电信系,雅典国立卡波迪斯特里亚大学); Archimedes, Athena Research Center (阿基米德,阿提卡研究中心); Visual Recognition Group, Faculty of Electrical Engineering, Czech Technical University in Prague (视觉识别组,电气工程学院,布拉格捷克技术大学); Institute for Language and Speech Processing, Athena Research Center (语言与语音处理研究所,阿提卡研究中心); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce GARLIC (GAussian Representation LearnIng for spaCe partitioning), a novel indexing structure based on (N)-dimensional Gaussians for efficiently learning high-dimensional vector spaces. Our approach is inspired from Gaussian splatting techniques, typically used in 3D rendering, which we adapt for high-dimensional search and classification. We optimize Gaussian parameters using information-theoretic objectives that balance coverage, assignment confidence, and structural and semantic consistency. A key contribution is to progressively refine the representation through split and clone operations, handling hundreds of dimensions, thus handling varying data densities. GARLIC offers the fast building times of traditional space partitioning methods (e.g., under (\sim5) min build time for SIFT1M) while achieving (\sim50%) Recall10@10 in low-candidate regimes. Experimental results on standard benchmarks demonstrate our method’s consistency in (a) (k)-NN retrieval, outperforming methods, such as Faiss-IVF, in fast-recall by using about half their probes for the same Recall10@10 in Fashion-MNIST, and (b) in classification tasks, beating by (\sim15%) accuracy other majority voting methods. Further, we show strong generalization capabilities, maintaining high accuracy even with downsampled training data: using just (1%) of the training data returns (\sim 45%) Recall@1, thus making GARLIC quite powerful for applications requiring both speed and accuracy.
zh
[CV-35] SARD: A Large-Scale Synthetic Arabic OCR Dataset for Book-Style Text Recognition
【速读】:该论文试图解决阿拉伯语光学字符识别(Arabic Optical Character Recognition, OCR)中缺乏大规模、多样化且结构化数据集的问题,这些问题限制了现代OCR模型,尤其是强大的视觉-语言模型的训练效果。解决方案的关键在于引入SARD(Large-Scale Synthetic Arabic OCR Dataset),这是一个大规模的合成数据集,专门设计用于模拟书籍样式的文档,包含843,622张文档图像和6.9亿个单词,并覆盖十种不同的阿拉伯字体,以确保广泛的字体覆盖率。SARD的合成特性使其免受真实世界噪声和失真的影响,提供了干净且可控的模型训练环境,同时具备可扩展性和对版式与内容变化的精确控制。
链接: https://arxiv.org/abs/2505.24600
作者: Omer Nacar,Yasser Al-Habashi,Serry Sibaee,Adel Ammar,Wadii Boulila
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Arabic Optical Character Recognition (OCR) is essential for converting vast amounts of Arabic print media into digital formats. However, training modern OCR models, especially powerful vision-language models, is hampered by the lack of large, diverse, and well-structured datasets that mimic real-world book layouts. Existing Arabic OCR datasets often focus on isolated words or lines or are limited in scale, typographic variety, or structural complexity found in books. To address this significant gap, we introduce SARD (Large-Scale Synthetic Arabic OCR Dataset). SARD is a massive, synthetically generated dataset specifically designed to simulate book-style documents. It comprises 843,622 document images containing 690 million words, rendered across ten distinct Arabic fonts to ensure broad typographic coverage. Unlike datasets derived from scanned documents, SARD is free from real-world noise and distortions, offering a clean and controlled environment for model training. Its synthetic nature provides unparalleled scalability and allows for precise control over layout and content variation. We detail the dataset’s composition and generation process and provide benchmark results for several OCR models, including traditional and deep learning approaches, highlighting the challenges and opportunities presented by this dataset. SARD serves as a valuable resource for developing and evaluating robust OCR and vision-language models capable of processing diverse Arabic book-style texts.
zh
[CV-36] Unleashing the Power of Intermediate Domains for Mixed Domain Semi-Supervised Medical Image Segmentation
【速读】:该论文旨在解决医学图像分割中同时存在的有限标注数据和领域偏移(domain shift)问题,提出了一种新的挑战性场景:混合领域半监督医学图像分割(Mixed Domain Semi-supervised medical image Segmentation, MiDSS)。其解决方案的关键在于提出UST-RUN框架,该框架通过充分利用中间领域信息促进知识迁移,采用统一复制粘贴(Unified Copy-paste, UCP)构建中间领域,并引入对称引导训练策略(Symmetric GuiDance, SymGD)通过融合中间样本的伪标签来监督无标签数据,同时结合训练过程感知的随机幅度混合(Training Process aware Random Amplitude MixUp, TP-RAM)逐步引入风格转换组件,从而生成更多样化的中间样本,提升模型性能。
链接: https://arxiv.org/abs/2505.24567
作者: Qinghe Ma,Jian Zhang,Lei Qi,Qian Yu,Yinghuan Shi,Yang Gao
机构: State Key Laboratory for Novel Software Technology, Nanjing University; National Institute of Healthcare Data Science, Nanjing University; Nanjing Drum Tower Hospital, Nanjing, Jiangsu; School of Computer Science and Engineering, Southeast University; Key Lab of Computer Network and Information Integration (Ministry of Education), Southeast University; School of Data and Computer Science, Shandong Women’s University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TMI 2025. arXiv admin note: text overlap with arXiv:2404.08951
Abstract:Both limited annotation and domain shift are prevalent challenges in medical image segmentation. Traditional semi-supervised segmentation and unsupervised domain adaptation methods address one of these issues separately. However, the coexistence of limited annotation and domain shift is quite common, which motivates us to introduce a novel and challenging scenario: Mixed Domain Semi-supervised medical image Segmentation (MiDSS), where limited labeled data from a single domain and a large amount of unlabeled data from multiple domains. To tackle this issue, we propose the UST-RUN framework, which fully leverages intermediate domain information to facilitate knowledge transfer. We employ Unified Copy-paste (UCP) to construct intermediate domains, and propose a Symmetric GuiDance training strategy (SymGD) to supervise unlabeled data by merging pseudo-labels from intermediate samples. Subsequently, we introduce a Training Process aware Random Amplitude MixUp (TP-RAM) to progressively incorporate style-transition components into intermediate samples. To generate more diverse intermediate samples, we further select reliable samples with high-quality pseudo-labels, which are then mixed with other unlabeled data. Additionally, we generate sophisticated intermediate samples with high-quality pseudo-labels for unreliable samples, ensuring effective knowledge transfer for them. Extensive experiments on four public datasets demonstrate the superiority of UST-RUN. Notably, UST-RUN achieves a 12.94% improvement in Dice score on the Prostate dataset. Our code is available at this https URL
zh
[CV-37] Optimal Weighted Convolution for Classification and Denosing
【速读】:该论文试图解决传统卷积神经网络(CNN)在空间特征表征和特征提取上的局限性,通过引入一种新颖的加权卷积算子来增强其性能。该解决方案的关键在于将空间密度函数整合到卷积操作中,使网络能够根据邻近像素相对于参考像素的位置差异进行差异化加权,从而提升空间表征能力。该方法保持了与现有CNN架构相同的可训练参数数量,并且具有良好的通用性,适用于任意维度的规则网格信号。
链接: https://arxiv.org/abs/2505.24558
作者: Simone Cammarasana,Giuseppe Patanè
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 3 figures, 6 tables
Abstract:We introduce a novel weighted convolution operator that enhances traditional convolutional neural networks (CNNs) by integrating a spatial density function into the convolution operator. This extension enables the network to differentially weight neighbouring pixels based on their relative position to the reference pixel, improving spatial characterisation and feature extraction. The proposed operator maintains the same number of trainable parameters and is fully compatible with existing CNN architectures. Although developed for 2D image data, the framework is generalisable to signals on regular grids of arbitrary dimensions, such as 3D volumetric data or 1D time series. We propose an efficient implementation of the weighted convolution by pre-computing the density function and achieving execution times comparable to standard convolution layers. We evaluate our method on two deep learning tasks: image classification using the CIFAR-100 dataset [KH+09] and image denoising using the DIV2K dataset [AT17]. Experimental results with state-of-the-art classification (e.g., VGG [SZ15], ResNet [HZRS16]) and denoising (e.g., DnCNN [ZZC+17], NAFNet [CCZS22]) methods show that the weighted convolution improves performance with respect to standard convolution across different quantitative metrics. For example, VGG achieves an accuracy of 66.94% with weighted convolution versus 56.89% with standard convolution on the classification problem, while DnCNN improves the PSNR value from 20.17 to 22.63 on the denoising problem. All models were trained on the CINECA Leonardo cluster to reduce the execution time and improve the tuning of the density function values. The PyTorch implementation of the weighted convolution is publicly available at: this https URL.
zh
[CV-38] Mixpert: Mitigating Multimodal Learning Conflicts with Efficient Mixture-of-Vision-Experts
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理多样化任务领域时,单一视觉编码器(vision encoder)难以适应复杂视觉场景并导致领域冲突的问题。其解决方案的关键在于提出一种高效的视觉专家混合架构(Mixpert),该架构在保持单个视觉编码器联合学习优势的基础上,重构为多专家范式,支持跨不同视觉任务的任务特定微调,并通过动态路由机制将输入图像分配给最合适的视觉专家,从而在最小额外计算成本下有效缓解多任务学习中的领域冲突。
链接: https://arxiv.org/abs/2505.24541
作者: Xin He,Xumeng Han,Longhui Wei,Lingxi Xie,Qi Tian
机构: Huawei Inc.(华为公司); University of Chinese Academy of Sciences(中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal large language models (MLLMs) require a nuanced interpretation of complex image information, typically leveraging a vision encoder to perceive various visual scenarios. However, relying solely on a single vision encoder to handle diverse task domains proves difficult and inevitably leads to conflicts. Recent work enhances data perception by directly integrating multiple domain-specific vision encoders, yet this structure adds complexity and limits the potential for joint optimization. In this paper, we introduce Mixpert, an efficient mixture-of-vision-experts architecture that inherits the joint learning advantages from a single vision encoder while being restructured into a multi-expert paradigm for task-specific fine-tuning across different visual tasks. Additionally, we design a dynamic routing mechanism that allocates input images to the most suitable visual expert. Mixpert effectively alleviates domain conflicts encountered by a single vision encoder in multi-task learning with minimal additional computational cost, making it more efficient than multiple encoders. Furthermore, Mixpert integrates seamlessly into any MLLM, with experimental results demonstrating substantial performance gains across various tasks.
zh
[CV-39] Geospatial Foundation Models to Enable Progress on Sustainable Development Goals
【速读】:该论文试图解决当前地学基础模型(Foundation Models, FMs)在实际应用中的有效性及其与全球可持续发展目标(Sustainable Development Goals, SDGs)的对齐程度不足的问题。其解决方案的关键在于提出SustainFM,一个基于17个SDGs的综合性基准框架,涵盖从资产财富预测到环境灾害检测等多种任务,通过多维度评估FMs的性能,包括泛化能力、可迁移性和能效,并倡导从以模型为中心的开发向以影响为导向的部署转变,强调能源效率、领域漂移鲁棒性及伦理考量等关键指标。
链接: https://arxiv.org/abs/2505.24528
作者: Pedram Ghamisi,Weikang Yu,Xiaokang Zhang,Aldino Rizaldy,Jian Wang,Chufeng Zhou,Richard Gloaguen,Gustau Camps-Valls
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Foundation Models (FMs) are large-scale, pre-trained AI systems that have revolutionized natural language processing and computer vision, and are now advancing geospatial analysis and Earth Observation (EO). They promise improved generalization across tasks, scalability, and efficient adaptation with minimal labeled data. However, despite the rapid proliferation of geospatial FMs, their real-world utility and alignment with global sustainability goals remain underexplored. We introduce SustainFM, a comprehensive benchmarking framework grounded in the 17 Sustainable Development Goals with extremely diverse tasks ranging from asset wealth prediction to environmental hazard detection. This study provides a rigorous, interdisciplinary assessment of geospatial FMs and offers critical insights into their role in attaining sustainability goals. Our findings show: (1) While not universally superior, FMs often outperform traditional approaches across diverse tasks and datasets. (2) Evaluating FMs should go beyond accuracy to include transferability, generalization, and energy efficiency as key criteria for their responsible use. (3) FMs enable scalable, SDG-grounded solutions, offering broad utility for tackling complex sustainability challenges. Critically, we advocate for a paradigm shift from model-centric development to impact-driven deployment, and emphasize metrics such as energy efficiency, robustness to domain shifts, and ethical considerations.
zh
[CV-40] Optimal Density Functions for Weighted Convolution in Learning Models
【速读】:该论文旨在解决传统卷积在处理定义在规则网格上的信号(如2D图像)时,因对邻近像素采用均匀权重而导致的近似精度不足的问题。其解决方案的关键在于引入加权卷积(weighted convolution),通过应用一个最优密度函数来根据邻近像素与中心像素的距离调整它们的贡献比例,从而提升卷积操作的准确性。该方法通过构建一个优化模型,将卷积核权重的优化(使用随机梯度下降)与密度函数的优化(使用DIRECT-L算法)分离,实现了对图像到图像任务(如图像去噪)中损失函数的显著降低和测试准确性的提升。
链接: https://arxiv.org/abs/2505.24527
作者: Simone Cammarasana,Giuseppe Patanè
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 figures, 5 tables, 21 pages
Abstract:The paper introduces the weighted convolution, a novel approach to the convolution for signals defined on regular grids (e.g., 2D images) through the application of an optimal density function to scale the contribution of neighbouring pixels based on their distance from the central pixel. This choice differs from the traditional uniform convolution, which treats all neighbouring pixels equally. Our weighted convolution can be applied to convolutional neural network problems to improve the approximation accuracy. Given a convolutional network, we define a framework to compute the optimal density function through a minimisation model. The framework separates the optimisation of the convolutional kernel weights (using stochastic gradient descent) from the optimisation of the density function (using DIRECT-L). Experimental results on a learning model for an image-to-image task (e.g., image denoising) show that the weighted convolution significantly reduces the loss (up to 53% improvement) and increases the test accuracy compared to standard convolution. While this method increases execution time by 11%, it is robust across several hyperparameters of the learning model. Future work will apply the weighted convolution to real-case 2D and 3D image convolutional learning problems.
zh
[CV-41] UniGeo: Taming Video Diffusion for Unified Consistent Geometry Estimation
【速读】:该论文旨在解决单目几何估计(如深度和法线)中缺乏跨帧一致性的问题,现有方法通常仅关注单个视频帧内的相机坐标系下的几何属性估计,而忽略了扩散模型在确定帧间对应关系方面的固有能力。解决方案的关键在于通过适当的设计和微调,利用视频生成模型的内在一致性实现一致的几何估计,具体包括:选择与视频帧具有相同对应关系的全局坐标系下的几何属性作为预测目标、引入一种基于重用位置编码的新颖高效条件化方法,以及通过共享相同对应关系的多个几何属性进行联合训练以提升性能。
链接: https://arxiv.org/abs/2505.24521
作者: Yang-Tian Sun,Xin Yu,Zehuan Huang,Yi-Hua Huang,Yuan-Chen Guo,Ziyi Yang,Yan-Pei Cao,Xiaojuan Qi
机构: The University of Hong Kong (香港大学); Beihang University (北京航空航天大学); VAST (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Recently, methods leveraging diffusion model priors to assist monocular geometric estimation (e.g., depth and normal) have gained significant attention due to their strong generalization ability. However, most existing works focus on estimating geometric properties within the camera coordinate system of individual video frames, neglecting the inherent ability of diffusion models to determine inter-frame correspondence. In this work, we demonstrate that, through appropriate design and fine-tuning, the intrinsic consistency of video generation models can be effectively harnessed for consistent geometric estimation. Specifically, we 1) select geometric attributes in the global coordinate system that share the same correspondence with video frames as the prediction targets, 2) introduce a novel and efficient conditioning method by reusing positional encodings, and 3) enhance performance through joint training on multiple geometric attributes that share the same correspondence. Our results achieve superior performance in predicting global geometric attributes in videos and can be directly applied to reconstruction tasks. Even when trained solely on static video data, our approach exhibits the potential to generalize to dynamic video scenes.
zh
[CV-42] un2CLIP: Improving CLIPs Visual Detail Capturing Ability via Inverting unCLIP
【速读】:该论文试图解决现有Contrastive Language-Image Pre-training (CLIP)模型在区分图像中细节差异以及在密集预测和以视觉为中心的多模态任务中表现不佳的问题。解决方案的关键在于利用一种生成式模型——unCLIP,其通过条件输入空间与CLIP的原始图像-文本嵌入空间对齐,能够更好地捕捉图像细节。论文提出将unCLIP进行反向操作(称为un² CLIP),以增强CLIP图像编码器的视觉细节捕获能力,同时保持其与原始文本编码器的一致性。
链接: https://arxiv.org/abs/2505.24517
作者: Yinqi Li,Jiahe Zhao,Hong Chang,Ruibing Hou,Shiguang Shan,Xilin Chen
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Contrastive Language-Image Pre-training (CLIP) has become a foundation model and has been applied to various vision and multimodal tasks. However, recent works indicate that CLIP falls short in distinguishing detailed differences in images and shows suboptimal performance on dense-prediction and vision-centric multimodal tasks. Therefore, this work focuses on improving existing CLIP models, aiming to capture as many visual details in images as possible. We find that a specific type of generative models, unCLIP, provides a suitable framework for achieving our goal. Specifically, unCLIP trains an image generator conditioned on the CLIP image embedding. In other words, it inverts the CLIP image encoder. Compared to discriminative models like CLIP, generative models are better at capturing image details because they are trained to learn the data distribution of images. Additionally, the conditional input space of unCLIP aligns with CLIP’s original image-text embedding space. Therefore, we propose to invert unCLIP (dubbed un ^2 CLIP) to improve the CLIP model. In this way, the improved image encoder can gain unCLIP’s visual detail capturing ability while preserving its alignment with the original text encoder simultaneously. We evaluate our improved CLIP across various tasks to which CLIP has been applied, including the challenging MMVP-VLM benchmark, the dense-prediction open-vocabulary segmentation task, and multimodal large language model tasks. Experiments show that un ^2 CLIP significantly improves the original CLIP and previous CLIP improvement methods. Code and models will be available at this https URL.
zh
[CV-43] Reason -SVG: Hybrid Reward RL for Aha-Moments in Vector Graphics Generation
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成高质量可缩放矢量图形(Scalable Vector Graphics, SVGs)时面临的结构性有效性、语义忠实性和视觉一致性等挑战,这些问题通常超出当前LLMs的能力范围。其解决方案的关键在于提出一种名为Reason-SVG的框架,该框架引入了“带思考的绘画”(Drawing-with-Thought, DwT)范式,使模型在生成SVG代码的同时输出显式的设计推理过程,从而模拟人类的创作流程,并通过两阶段训练策略(监督微调和基于组相对策略优化的强化学习)以及混合奖励函数来提升模型的推理能力和生成质量。
链接: https://arxiv.org/abs/2505.24499
作者: Ximing Xing,Yandong Guan,Jing Zhang,Dong Xu,Qian Yu
机构: Beihang University (北京航空航天大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 5 figures
Abstract:Generating high-quality Scalable Vector Graphics (SVGs) is challenging for Large Language Models (LLMs), as it requires advanced reasoning for structural validity, semantic faithfulness, and visual coherence – capabilities in which current LLMs often fall short. In this work, we introduce Reason-SVG, a novel framework designed to enhance LLM reasoning for SVG generation. Reason-SVG pioneers the “Drawing-with-Thought” (DwT) paradigm, in which models generate both SVG code and explicit design rationales, mimicking the human creative process. Reason-SVG adopts a two-stage training strategy: First, Supervised Fine-Tuning (SFT) trains the LLM on the DwT paradigm to activate foundational reasoning abilities. Second, Reinforcement Learning (RL), utilizing Group Relative Policy Optimization (GRPO), empowers the model to generate both DwT and SVGs rationales through refined, reward-driven reasoning. To facilitate reasoning-driven SVG generation, we design a Hybrid Reward function that evaluates the presence and utility of DwT reasoning, along with structural validity, semantic alignment, and visual quality. We also introduce the SVGX-DwT-10k dataset, a high-quality corpus of 10,000 SVG-DwT pairs, where each SVG code is generated based on explicit DwT reasoning. By integrating DwT, SFT, and Hybrid Reward-guided RL, Reason-SVG significantly improves LLM performance in generating accurate and visually compelling SVGs, potentially fostering “Aha moments” in design.
zh
[CV-44] Deformable Attention Mechanisms Applied to Object Detection case of Remote Sensing
【速读】:该论文旨在解决遥感图像中目标检测的问题,特别是在光学和合成孔径雷达(SAR)图像中的对象识别任务。其解决方案的关键在于应用一种基于可变形注意力机制的Deformable-DETR模型,该模型在两种不同模态的遥感数据集(Pleiades Aircraft数据集和SAR Ship Detection Dataset,SSDD)上进行了验证,并表现出优越的检测性能,分别取得了95.12%和94.54%的F1分数。
链接: https://arxiv.org/abs/2505.24489
作者: Anasse Boutayeb,Iyad Lahsen-cherif,Ahmed El Khadimi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, paper accepted at the 29th International Conference on Knowledge-Based and Intelligent Information and Engineering Systems (KES 2025), Osaka, Japan
Abstract:Object detection has recently seen an interesting trend in terms of the most innovative research work, this task being of particular importance in the field of remote sensing, given the consistency of these images in terms of geographical coverage and the objects present. Furthermore, Deep Learning (DL) models, in particular those based on Transformers, are especially relevant for visual computing tasks in general, and target detection in particular. Thus, the present work proposes an application of Deformable-DETR model, a specific architecture using deformable attention mechanisms, on remote sensing images in two different modes, especially optical and Synthetic Aperture Radar (SAR). To achieve this objective, two datasets are used, one optical, which is Pleiades Aircraft dataset, and the other SAR, in particular SAR Ship Detection Dataset (SSDD). The results of a 10-fold stratified validation showed that the proposed model performed particularly well, obtaining an F1 score of 95.12% for the optical dataset and 94.54% for SSDD, while comparing these results with several models detections, especially those based on CNNs and transformers, as well as those specifically designed to detect different object classes in remote sensing images.
zh
[CV-45] ACM-UNet: Adaptive Integration of CNNs and Mamba for Efficient Medical Image Segmentation
【速读】:该论文旨在解决医学图像分割中现有U型编码器-解码器架构在利用预训练视觉主干(如ResNet、ViT、VMamba)时因结构不匹配而导致的性能受限问题。其解决方案的关键在于提出ACM-UNet框架,该框架通过轻量级适配器机制有效融合预训练的卷积神经网络(CNN)和状态空间模型(SSMs)Mamba,从而克服结构不兼容性,并充分利用CNN在细粒度局部细节提取与SSMs在长程依赖建模方面的互补优势。
链接: https://arxiv.org/abs/2505.24481
作者: Jing Huang,Yongkang Zhao,Yuhan Li,Zhitao Dai,Cheng Chen,Qiying Lai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures, 5 tables
Abstract:The U-shaped encoder-decoder architecture with skip connections has become a prevailing paradigm in medical image segmentation due to its simplicity and effectiveness. While many recent works aim to improve this framework by designing more powerful encoders and decoders, employing advanced convolutional neural networks (CNNs) for local feature extraction, Transformers or state space models (SSMs) such as Mamba for global context modeling, or hybrid combinations of both, these methods often struggle to fully utilize pretrained vision backbones (e.g., ResNet, ViT, VMamba) due to structural mismatches. To bridge this gap, we introduce ACM-UNet, a general-purpose segmentation framework that retains a simple UNet-like design while effectively incorporating pretrained CNNs and Mamba models through a lightweight adapter mechanism. This adapter resolves architectural incompatibilities and enables the model to harness the complementary strengths of CNNs and SSMs-namely, fine-grained local detail extraction and long-range dependency modeling. Additionally, we propose a hierarchical multi-scale wavelet transform module in the decoder to enhance feature fusion and reconstruction fidelity. Extensive experiments on the Synapse and ACDC benchmarks demonstrate that ACM-UNet achieves state-of-the-art performance while remaining computationally efficient. Notably, it reaches 85.12% Dice Score and 13.89mm HD95 on the Synapse dataset with 17.93G FLOPs, showcasing its effectiveness and scalability. Code is available at: this https URL.
zh
[CV-46] Period-LLM : Extending the Periodic Capability of Multimodal Large Language Model CVPR2025
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理周期性或准周期性任务时存在的问题,主要包括缺乏时间建模能力和短周期与长周期之间的冲突。解决方案的关键在于提出Period-LLM,这是一种专门设计用于提升跨模态周期性任务性能的多模态大语言模型,并采用“从易到难泛化”范式逐步构建模型的周期性推理能力,同时引入“抵抗逻辑遗忘”优化策略以在语义对齐过程中保持周期性推理能力。
链接: https://arxiv.org/abs/2505.24476
作者: Yuting Zhang,Hao Lu,Qingyong Hu,Yin Wang,Kaishen Yuan,Xin Liu,Kaishun Wu
机构: The Hong Kong University of Science & Technology (Guangzhou); The Hong Kong University of Science & Technology; Zhejiang University; Lappeenranta-Lahti University of Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025
Abstract:Periodic or quasi-periodic phenomena reveal intrinsic characteristics in various natural processes, such as weather patterns, movement behaviors, traffic flows, and biological signals. Given that these phenomena span multiple modalities, the capabilities of Multimodal Large Language Models (MLLMs) offer promising potential to effectively capture and understand their complex nature. However, current MLLMs struggle with periodic tasks due to limitations in: 1) lack of temporal modelling and 2) conflict between short and long periods. This paper introduces Period-LLM, a multimodal large language model designed to enhance the performance of periodic tasks across various modalities, and constructs a benchmark of various difficulty for evaluating the cross-modal periodic capabilities of large models. Specially, We adopt an “Easy to Hard Generalization” paradigm, starting with relatively simple text-based tasks and progressing to more complex visual and multimodal tasks, ensuring that the model gradually builds robust periodic reasoning capabilities. Additionally, we propose a “Resisting Logical Oblivion” optimization strategy to maintain periodic reasoning abilities during semantic alignment. Extensive experiments demonstrate the superiority of the proposed Period-LLM over existing MLLMs in periodic tasks. The code is available at this https URL.
zh
[CV-47] SPPSFormer: High-quality Superpoint-based Transformer for Roof Plane Instance Segmentation from Point Clouds
【速读】:该论文旨在解决点云屋顶平面实例分割中Transformer应用较少以及现有超点(superpoint)质量不足导致性能受限的问题。其解决方案的关键在于建立高质量超点的两个标准,并引入一种两阶段超点生成过程,以确保生成的超点具有精确的边界、一致的几何尺寸和形状,从而提升超点Transformer的特征学习效果。此外,通过引入多维手工特征以弥补小样本下深度学习特征的不足,并设计结合Kolmogorov-Arnold网络与Transformer模块的解码器,进一步提升实例预测和掩码提取性能。
链接: https://arxiv.org/abs/2505.24475
作者: Cheng Zeng,Xiatian Qi,Chi Chen,Kai Sun,Wangle Zhang,Yuxuan Liu,Yan Meng,Bisheng Yang
机构: Hubei University (湖北大学); Wuhan University (武汉大学); Hubei Water Resources Research Institute (湖北省水利水电科学研究院); Chinese Academy of Surveying and Mapping (中国测绘科学研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 8 figures
Abstract:Transformers have been seldom employed in point cloud roof plane instance segmentation, which is the focus of this study, and existing superpoint Transformers suffer from limited performance due to the use of low-quality superpoints. To address this challenge, we establish two criteria that high-quality superpoints for Transformers should satisfy and introduce a corresponding two-stage superpoint generation process. The superpoints generated by our method not only have accurate boundaries, but also exhibit consistent geometric sizes and shapes, both of which greatly benefit the feature learning of superpoint Transformers. To compensate for the limitations of deep learning features when the training set size is limited, we incorporate multidimensional handcrafted features into the model. Additionally, we design a decoder that combines a Kolmogorov-Arnold Network with a Transformer module to improve instance prediction and mask extraction. Finally, our network’s predictions are refined using traditional algorithm-based postprocessing. For evaluation, we annotated a real-world dataset and corrected annotation errors in the existing RoofN3D dataset. Experimental results show that our method achieves state-of-the-art performance on our dataset, as well as both the original and reannotated RoofN3D datasets. Moreover, our model is not sensitive to plane boundary annotations during training, significantly reducing the annotation burden. Through comprehensive experiments, we also identified key factors influencing roof plane segmentation performance: in addition to roof types, variations in point cloud density, density uniformity, and 3D point precision have a considerable impact. These findings underscore the importance of incorporating data augmentation strategies that account for point cloud quality to enhance model robustness under diverse and challenging conditions.
zh
[CV-48] SA-Person: Text-Based Person Retrieval with Scene-aware Re-ranking
【速读】:该论文旨在解决基于文本的行人重识别(text-based person retrieval)问题,即从图像画廊中根据自然语言描述识别目标个体。现有方法主要关注外观相关的跨模态检索,而忽略了场景中的上下文信息,这可能为检索提供有价值的补充信息。该论文的关键解决方案是提出SA-Person框架,其包含两个阶段:第一阶段通过将文本线索与行人特定区域对齐来实现区分性的外观定位;第二阶段引入SceneRanker,一种无需训练的、场景感知的重排序方法,利用多模态大语言模型联合推理行人外观和全局场景上下文。
链接: https://arxiv.org/abs/2505.24466
作者: Yingjia Xu,Jinlin Wu,Zhen Chen,Daming Gao,Yang Yang,Zhen Lei,Min Cao
机构: Soochow University (苏州大学); Centre for Artificial Intelligence and Robotics, HKISI-CAS (人工智能与机器人中心,HKISI-CAS); MAIS, Institute of Automation, Chinese Academy of Sciences (MAIS,自动化研究所,中国科学院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 7 figures. Under review
Abstract:Text-based person retrieval aims to identify a target individual from a gallery of images based on a natural language description. It presents a significant challenge due to the complexity of real-world scenes and the ambiguity of appearance-related descriptions. Existing methods primarily emphasize appearance-based cross-modal retrieval, often neglecting the contextual information embedded within the scene, which can offer valuable complementary insights for retrieval. To address this, we introduce SCENEPERSON-13W, a large-scale dataset featuring over 100,000 scenes with rich annotations covering both pedestrian appearance and environmental cues. Based on this, we propose SA-Person, a two-stage retrieval framework. In the first stage, it performs discriminative appearance grounding by aligning textual cues with pedestrian-specific regions. In the second stage, it introduces SceneRanker, a training-free, scene-aware re-ranking method leveraging multimodal large language models to jointly reason over pedestrian appearance and the global scene context. Experiments on SCENEPERSON-13W validate the effectiveness of our framework in challenging scene-level retrieval scenarios. The code and dataset will be made publicly available.
zh
[CV-49] Diversify and Conquer: Open-set Disagreement for Robust Semi-supervised Learning with Outliers
【速读】:该论文试图解决半监督学习(Semi-Supervised Learning, SSL)在开放集场景下的鲁棒性问题,即当未标记数据包含未知类别数据(即异常值)时,传统SSL方法的性能会显著下降。其解决方案的关键在于提出一种名为“Diversify and Conquer (DAC)”的框架,通过在一个训练过程中构建一组具有不同偏差的模型,利用这些模型在未标记数据上的预测差异来增强对异常值的检测能力,从而提升SSL模型的鲁棒性。
链接: https://arxiv.org/abs/2505.24443
作者: Heejo Kong,Sung-Jin Kim,Gunho Jung,Seong-Whan Lee
机构: Korea University (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by IEEE Transactions on Neural Networks and Learning Systems (TNNLS)
Abstract:Conventional semi-supervised learning (SSL) ideally assumes that labeled and unlabeled data share an identical class distribution, however in practice, this assumption is easily violated, as unlabeled data often includes unknown class data, i.e., outliers. The outliers are treated as noise, considerably degrading the performance of SSL models. To address this drawback, we propose a novel framework, Diversify and Conquer (DAC), to enhance SSL robustness in the context of open-set semi-supervised learning. In particular, we note that existing open-set SSL methods rely on prediction discrepancies between inliers and outliers from a single model trained on labeled data. This approach can be easily failed when the labeled data is insufficient, leading to performance degradation that is worse than naive SSL that do not account for outliers. In contrast, our approach exploits prediction disagreements among multiple models that are differently biased towards the unlabeled distribution. By leveraging the discrepancies arising from training on unlabeled data, our method enables robust outlier detection even when the labeled data is underspecified. Our key contribution is constructing a collection of differently biased models through a single training process. By encouraging divergent heads to be differently biased towards outliers while making consistent predictions for inliers, we exploit the disagreement among these heads as a measure to identify unknown concepts. Our code is available at this https URL.
zh
[CV-50] SORCE: Small Object Retrieval in Complex Environments
【速读】:该论文旨在解决在复杂环境中对不显眼的小目标进行文本到图像检索(Text-to-Image Retrieval, T2IR)的问题,现有基准主要关注描述整体图像语义或前景显著物体的文本查询,而忽略了复杂场景中不易察觉的小物体。解决方案的关键在于提出一种多嵌入表示方法,通过多模态大语言模型(Multimodal Large Language Models, MLLMs)和区域提示(Regional Prompts, ReP)为每张图像生成多个独特的嵌入,从而更有效地捕捉小目标的语义信息,提升检索性能。
链接: https://arxiv.org/abs/2505.24441
作者: Chunxu Liu,Chi Xie,Xiaxu Chen,Wei Li,Feng Zhu,Rui Zhao,Limin Wang
机构: Sensetime Research (商汤科技); Tongji University (同济大学); Beijing Institute of Technology (北京理工大学); Shanghai AI Lab (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Text-to-Image Retrieval (T2IR) is a highly valuable task that aims to match a given textual query to images in a gallery. Existing benchmarks primarily focus on textual queries describing overall image semantics or foreground salient objects, possibly overlooking inconspicuous small objects, especially in complex environments. Such small object retrieval is crucial, as in real-world applications, the targets of interest are not always prominent in the image. Thus, we introduce SORCE (Small Object Retrieval in Complex Environments), a new subfield of T2IR, focusing on retrieving small objects in complex images with textual queries. We propose a new benchmark, SORCE-1K, consisting of images with complex environments and textual queries describing less conspicuous small objects with minimal contextual cues from other salient objects. Preliminary analysis on SORCE-1K finds that existing T2IR methods struggle to capture small objects and encode all the semantics into a single embedding, leading to poor retrieval performance on SORCE-1K. Therefore, we propose to represent each image with multiple distinctive embeddings. We leverage Multimodal Large Language Models (MLLMs) to extract multiple embeddings for each image instructed by a set of Regional Prompts (ReP). Experimental results show that our multi-embedding approach through MLLM and ReP significantly outperforms existing T2IR methods on SORCE-1K. Our experiments validate the effectiveness of SORCE-1K for benchmarking SORCE performances, highlighting the potential of multi-embedding representation and text-customized MLLM features for addressing this task.
zh
[CV-51] Graph Flow Matching: Enhancing Image Generation with Neighbor-Aware Flow Fields
【速读】:该论文试图解决现有流匹配网络在生成样本时仅独立预测每个点的速度,忽略轨迹上相邻点之间潜在相关性的问题,这可能影响速度预测的准确性,进而影响生成质量。解决方案的关键在于提出图流匹配(Graph Flow Matching, GFM),通过将学习到的速度分解为反应项(可使用任意标准流匹配网络)和扩散项,后者利用图神经模块聚合邻近信息,从而在保持深度流模型可扩展性的前提下,通过引入局部上下文信息提升速度预测的准确性。
链接: https://arxiv.org/abs/2505.24434
作者: Md Shahriar Rahim Siddiqui,Moshe Eliasof,Eldad Haber
机构: The University of British Columbia (不列颠哥伦比亚大学); The University of Cambridge (剑桥大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Flow matching casts sample generation as learning a continuous-time velocity field that transports noise to data. Existing flow matching networks typically predict each point’s velocity independently, considering only its location and time along its flow trajectory, and ignoring neighboring points. However, this pointwise approach may overlook correlations between points along the generation trajectory that could enhance velocity predictions, thereby improving downstream generation quality. To address this, we propose Graph Flow Matching (GFM), a lightweight enhancement that decomposes the learned velocity into a reaction term – any standard flow matching network – and a diffusion term that aggregates neighbor information via a graph neural module. This reaction-diffusion formulation retains the scalability of deep flow models while enriching velocity predictions with local context, all at minimal additional computational cost. Operating in the latent space of a pretrained variational autoencoder, GFM consistently improves Fréchet Inception Distance (FID) and recall across five image generation benchmarks (LSUN Church, LSUN Bedroom, FFHQ, AFHQ-Cat, and CelebA-HQ at 256\times256 ), demonstrating its effectiveness as a modular enhancement to existing flow matching architectures.
zh
[CV-52] Bridging 3D Anomaly Localization and Repair via High-Quality Continuous Geometric Representation
【速读】:该论文旨在解决3D点云异常检测中因姿态变化和复杂几何异常导致的几何保真度问题,以及现有基于块的方法在细粒度异常定位上的局限性。其解决方案的关键在于提出了一种名为Pose-Aware Signed Distance Field (PASDF)的新框架,该框架通过学习连续、姿态不变的形状表示,实现了3D异常检测与修复的集成。PASDF的核心技术包括姿态对齐模块和SDF网络,能够动态融合姿态信息,并从连续的SDF中隐式学习高保真异常修复模板,从而实现像素级的异常定位与修复。
链接: https://arxiv.org/abs/2505.24431
作者: Bozhong Zheng,Jinye Gan,Xiaohao Xu,Wenqiao Li,Xiaonan Huang,Na Ni,Yingna Wu
机构: ShanghaiTech University (上海科技大学); University of Michigan, Ann Arbor (密歇根大学安娜堡分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D point cloud anomaly detection is essential for robust vision systems but is challenged by pose variations and complex geometric anomalies. Existing patch-based methods often suffer from geometric fidelity issues due to discrete voxelization or projection-based representations, limiting fine-grained anomaly localization. We introduce Pose-Aware Signed Distance Field (PASDF), a novel framework that integrates 3D anomaly detection and repair by learning a continuous, pose-invariant shape representation. PASDF leverages a Pose Alignment Module for canonicalization and a SDF Network to dynamically incorporate pose, enabling implicit learning of high-fidelity anomaly repair templates from the continuous SDF. This facilitates precise pixel-level anomaly localization through an Anomaly-Aware Scoring Module. Crucially, the continuous 3D representation in PASDF extends beyond detection, facilitating in-situ anomaly repair. Experiments on Real3D-AD and Anomaly-ShapeNet demonstrate state-of-the-art performance, achieving high object-level AUROC scores of 80.2% and 90.0%, respectively. These results highlight the effectiveness of continuous geometric representations in advancing 3D anomaly detection and facilitating practical anomaly region repair. The code is available at this https URL to support further research.
zh
[CV-53] Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在组合推理(compositional reasoning)方面的不足,尤其是对概念间关系的理解能力有限的问题。尽管现有方法在提升词汇敏感性方面取得了一定进展,但忽视了语义理解的提升,导致下游检索性能未能同步改善。论文提出的解决方案关键在于CLIC(Compositionally-aware Learning in CLIP),这是一种基于多图像及其关联标题的新型微调训练技术,能够同时提升模型在词汇和语义层面的组合理解能力,并在检索任务中实现稳定性能提升。
链接: https://arxiv.org/abs/2505.24424
作者: Amit Peleg,Naman Deep Singh,Matthias Hein
机构: University of Tübingen (图宾根大学); Tübingen AI Center (图宾根人工智能中心)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models like CLIP have demonstrated remarkable zero-shot capabilities in classification and retrieval. However, these models often struggle with compositional reasoning - the ability to understand the relationships between concepts. A recent benchmark, SugarCrepe++, reveals that previous works on improving compositionality have mainly improved lexical sensitivity but neglected semantic understanding. In addition, downstream retrieval performance often deteriorates, although one would expect that improving compositionality should enhance retrieval. In this work, we introduce CLIC (Compositionally-aware Learning in CLIP), a fine-tuning method based on a novel training technique combining multiple images and their associated captions. CLIC improves compositionality across architectures as well as differently pre-trained CLIP models, both in terms of lexical and semantic understanding, and achieves consistent gains in retrieval performance. This even applies to the recent CLIPS, which achieves SOTA retrieval performance. Nevertheless, the short fine-tuning with CLIC leads to an improvement in retrieval and to the best compositional CLIP model on SugarCrepe++. All our models and code are available at this https URL
zh
[CV-54] EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering
【速读】:该论文旨在解决多语言文本生成中存在准确性和可控性不足的问题,尤其是如何在扩散模型中实现任意语言的文本渲染。其解决方案的关键在于提出了一种基于DiT(Diffusion Transformer)的文本渲染框架EasyText,通过将去噪潜在表示与多语言字符标记相连接,并引入字符位置编码和位置编码插值技术,以实现可控且精确的文本渲染。
链接: https://arxiv.org/abs/2505.24417
作者: Runnan Lu,Yuxuan Zhang,Jailing Liu,Haifa Wang,Yiren Song
机构: Tiamat AI (Tiamat AI); Liblib AI (Liblib AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating accurate multilingual text with diffusion models has long been desired but remains challenging. Recent methods have made progress in rendering text in a single language, but rendering arbitrary languages is still an unexplored area. This paper introduces EasyText, a text rendering framework based on DiT (Diffusion Transformer), which connects denoising latents with multilingual character tokens encoded as character tokens. We propose character positioning encoding and position encoding interpolation techniques to achieve controllable and precise text rendering. Additionally, we construct a large-scale synthetic text image dataset with 1 million multilingual image-text annotations as well as a high-quality dataset of 20K annotated images, which are used for pretraining and fine-tuning respectively. Extensive experiments and evaluations demonstrate the effectiveness and advancement of our approach in multilingual text rendering, visual quality, and layout-aware text integration.
zh
[CV-55] PCIE_Pose Solution for EgoExo4D Pose and Proficiency Estimation Challenge
【速读】:该论文旨在解决从第一视角(egocentric)RGB视频中估计21个3D手部关节点的问题,以及在动态场景下进行身体姿态估计的挑战。其关键解决方案是提出了一种结合视觉Transformer和卷积神经网络(CNN)主干的Hand Pose Vision Transformer (HP-ViT+)架构,通过加权融合策略优化手部姿态预测;同时,在身体姿态估计任务中采用了多模态时空特征集成策略,以应对复杂动态环境下的姿态估计问题。
链接: https://arxiv.org/abs/2505.24411
作者: Feng Chen,Kanokphan Lertniphonphan,Qiancheng Yan,Xiaohui Fan,Jun Xie,Tao Zhang,Zhepeng Wang
机构: Lenovo Research(联想研究院); University of Chinese Academy of Sciences(中国科学院大学); Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This report introduces our team’s (PCIE_EgoPose) solutions for the EgoExo4D Pose and Proficiency Estimation Challenges at CVPR2025. Focused on the intricate task of estimating 21 3D hand joints from RGB egocentric videos, which are complicated by subtle movements and frequent occlusions, we developed the Hand Pose Vision Transformer (HP-ViT+). This architecture synergizes a Vision Transformer and a CNN backbone, using weighted fusion to refine the hand pose predictions. For the EgoExo4D Body Pose Challenge, we adopted a multimodal spatio-temporal feature integration strategy to address the complexities of body pose estimation across dynamic contexts. Our methods achieved remarkable performance: 8.31 PA-MPJPE in the Hand Pose Challenge and 11.25 MPJPE in the Body Pose Challenge, securing championship titles in both competitions. We extended our pose estimation solutions to the Proficiency Estimation task, applying core technologies such as transformer-based architectures. This extension enabled us to achieve a top-1 accuracy of 0.53, a SOTA result, in the Demonstrator Proficiency Estimation competition.
zh
[CV-56] IRBridge: Solving Image Restoration Bridge with Pre-trained Generative Diffusion Models
【速读】:该论文试图解决现有图像修复桥梁模型在面对不同类型的退化时需要从头训练的问题,这导致计算成本高且性能受限。其解决方案的关键在于利用预训练的生成先验,并通过提出一种过渡方程来连接两个具有相同终端分布的扩散过程,从而实现生成模型在图像修复桥梁中的直接应用,提升修复任务的灵活性和适应性。
链接: https://arxiv.org/abs/2505.24406
作者: Hanting Wang,Tao Jin,Wang Lin,Shulei Wang,Hai Huang,Shengpeng Ji,Zhou Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Bridge models in image restoration construct a diffusion process from degraded to clear images. However, existing methods typically require training a bridge model from scratch for each specific type of degradation, resulting in high computational costs and limited performance. This work aims to efficiently leverage pretrained generative priors within existing image restoration bridges to eliminate this requirement. The main challenge is that standard generative models are typically designed for a diffusion process that starts from pure noise, while restoration tasks begin with a low-quality image, resulting in a mismatch in the state distributions between the two processes. To address this challenge, we propose a transition equation that bridges two diffusion processes with the same endpoint distribution. Based on this, we introduce the IRBridge framework, which enables the direct utilization of generative models within image restoration bridges, offering a more flexible and adaptable approach to image restoration. Extensive experiments on six image restoration tasks demonstrate that IRBridge efficiently integrates generative priors, resulting in improved robustness and generalization performance. Code will be available at GitHub.
zh
[CV-57] PCIE_Interaction Solution for Ego4D Social Interaction Challenge
【速读】:该论文旨在解决Ego4D社交互动挑战中的“Looking At Me (LAM)”和“Talking To Me (TTM)”任务,即准确检测主体与摄像头佩戴者之间的社交互动。在LAM任务中,解决方案的关键在于面部质量增强和集成方法;而在TTM任务中,关键在于通过融合音频和视觉线索,并基于视觉质量得分进行加权,扩展了视觉交互分析。
链接: https://arxiv.org/abs/2505.24404
作者: Kanokphan Lertniphonphan,Feng Chen,Junda Xu,Fengbu Lan,Jun Xie,Tao Zhang,Zhepeng Wang
机构: Lenovo Research(联想研究); Beijing Normal University(北京师范大学); Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This report presents our team’s PCIE_Interaction solution for the Ego4D Social Interaction Challenge at CVPR 2025, addressing both Looking At Me (LAM) and Talking To Me (TTM) tasks. The challenge requires accurate detection of social interactions between subjects and the camera wearer, with LAM relying exclusively on face crop sequences and TTM combining speaker face crops with synchronized audio segments. In the LAM track, we employ face quality enhancement and ensemble methods. For the TTM task, we extend visual interaction analysis by fusing audio and visual cues, weighted by a visual quality score. Our approach achieved 0.81 and 0.71 mean average precision (mAP) on the LAM and TTM challenges leader board. Code is available at this https URL
zh
[CV-58] Leverag ing Intermediate Features of Vision Transformer for Face Anti-Spoofing CVPR
【速读】:该论文试图解决人脸识别系统在面对伪造攻击(spoofing attack)时的安全性问题,即恶意用户可能通过展示注册用户的面部照片来非法绕过身份验证。解决方案的关键在于提出一种基于Vision Transformer (ViT) 的伪造攻击检测方法,该方法利用ViT的中间特征,这些特征在局部与全局特征之间具有良好的平衡,对于检测真实与伪造面部图像之间的细微差异至关重要。此外,该方法还引入了两种数据增强技术:人脸防伪数据增强和块级数据增强,以提高伪造攻击检测的准确性。
链接: https://arxiv.org/abs/2505.24402
作者: Mika Feng,Koichi Ito,Takafumi Aoki,Tetsushi Ohki,Masakatsu Nishigaki
机构: Tohoku University (东北大学); Shizuoka University (静冈大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
Abstract:Face recognition systems are designed to be robust against changes in head pose, illumination, and blurring during image capture. If a malicious person presents a face photo of the registered user, they may bypass the authentication process illegally. Such spoofing attacks need to be detected before face recognition. In this paper, we propose a spoofing attack detection method based on Vision Transformer (ViT) to detect minute differences between live and spoofed face images. The proposed method utilizes the intermediate features of ViT, which have a good balance between local and global features that are important for spoofing attack detection, for calculating loss in training and score in inference. The proposed method also introduces two data augmentation methods: face anti-spoofing data augmentation and patch-wise data augmentation, to improve the accuracy of spoofing attack detection. We demonstrate the effectiveness of the proposed method through experiments using the OULU-NPU and SiW datasets.
zh
[CV-59] S3CE-Net: Spike-guided Spatiotemporal Semantic Coupling and Expansion Network for Long Sequence Event Re-Identification
【速读】:该论文旨在解决长序列事件流下的行人重识别(person re-identification, Re-ID)任务中因恶劣光照条件、背景干扰、时间分辨率不足以及面部信息保护等问题带来的挑战。其关键解决方案是提出一种基于脉冲神经网络(spiking neural networks, SNNs)的简单高效模型——Spike-guided Spatiotemporal Semantic Coupling and Expansion Network (S3CE-Net),该模型通过引入Spike-guided Spatial-temporal Attention Mechanism (SSAM)和Spatiotemporal Feature Sampling Strategy (STFS),实现了时空语义的交互与关联,并在不增加额外参数的情况下提升模型对有效语义的感知能力,从而在长序列事件数据上取得优异性能。
链接: https://arxiv.org/abs/2505.24401
作者: Xianheng Ma,Hongchen Tan,Xiuping Liu,Yi Zhang,Huasheng Wang,Jiang Liu,Ying Chen,Hantao Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we leverage the advantages of event cameras to resist harsh lighting conditions, reduce background interference, achieve high time resolution, and protect facial information to study the long-sequence event-based person re-identification (Re-ID) task. To this end, we propose a simple and efficient long-sequence event Re-ID model, namely the Spike-guided Spatiotemporal Semantic Coupling and Expansion Network (S3CE-Net). To better handle asynchronous event data, we build S3CE-Net based on spiking neural networks (SNNs). The S3CE-Net incorporates the Spike-guided Spatial-temporal Attention Mechanism (SSAM) and the Spatiotemporal Feature Sampling Strategy (STFS). The SSAM is designed to carry out semantic interaction and association in both spatial and temporal dimensions, leveraging the capabilities of SNNs. The STFS involves sampling spatial feature subsequences and temporal feature subsequences from the spatiotemporal dimensions, driving the Re-ID model to perceive broader and more robust effective semantics. Notably, the STFS introduces no additional parameters and is only utilized during the training stage. Therefore, S3CE-Net is a low-parameter and high-efficiency model for long-sequence event-based person Re-ID. Extensive experiments have verified that our S3CE-Net achieves outstanding performance on many mainstream long-sequence event-based person Re-ID datasets. Code is available at:this https URL.
zh
[CV-60] Leadership Assessment in Pediatric Intensive Care Unit Team Training CVPR2025
【速读】:该论文试图解决儿科重症监护室(PICU)团队领导能力评估的问题,其核心挑战在于如何自动化、客观地分析团队成员的领导行为。解决方案的关键在于构建一个基于自我中心视觉(egocentric vision)的自动化分析框架,通过Aria眼镜采集多模态数据(如视频、音频、眼动和头部运动),并利用REMoDNaV、SAM、YOLO和ChatGPT等技术实现固定目标检测、眼神接触识别和对话分类,从而提取与领导力相关的行为指标,如注视时间、转换模式和直接指令等。
链接: https://arxiv.org/abs/2505.24389
作者: Liangyang Ouyang,Yuki Sakai,Ryosuke Furuta,Hisataka Nozawa,Hikoro Matsui,Yoichi Sato
机构: The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted by EgoVis Workshop at CVPR 2025
Abstract:This paper addresses the task of assessing PICU team’s leadership skills by developing an automated analysis framework based on egocentric vision. We identify key behavioral cues, including fixation object, eye contact, and conversation patterns, as essential indicators of leadership assessment. In order to capture these multimodal signals, we employ Aria Glasses to record egocentric video, audio, gaze, and head movement data. We collect one-hour videos of four simulated sessions involving doctors with different roles and levels. To automate data processing, we propose a method leveraging REMoDNaV, SAM, YOLO, and ChatGPT for fixation object detection, eye contact detection, and conversation classification. In the experiments, significant correlations are observed between leadership skills and behavioral metrics, i.e., the output of our proposed methods, such as fixation time, transition patterns, and direct orders in speech. These results indicate that our proposed data collection and analysis framework can effectively solve skill assessment for training PICU teams.
zh
[CV-61] SASP: Strip-Aware Spatial Perception for Fine-Grained Bird Image Classification
【速读】:该论文旨在解决细粒度鸟类图像分类(Fine-grained Bird Image Classification, FBIC)中的挑战,包括由于物种大小和成像距离差异导致的鸟类在图像中尺寸变化、复杂自然栖息地带来的强背景干扰以及飞行、栖息或觅食等高度灵活姿态引起的类内显著变异。这些问题使得传统方法难以稳定提取判别特征,从而限制了模型在实际应用中的泛化能力和可解释性。该论文提出的解决方案关键在于基于条带感知的空间感知框架,通过引入扩展感知聚合器(Extensional Perception Aggregator, EPA)和通道语义编织(Channel Semantic Weaving, CSW)两个新颖模块,以捕捉鸟类图像中整行或整列的长程空间依赖关系,从而提升模型的鲁棒性和可解释性。
链接: https://arxiv.org/abs/2505.24380
作者: Zheng Wang
机构: Central China Normal University (华中师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Fine-grained bird image classification (FBIC) is not only of great significance for ecological monitoring and species identification, but also holds broad research value in the fields of image recognition and fine-grained visual modeling. Compared with general image classification tasks, FBIC poses more formidable challenges: 1) the differences in species size and imaging distance result in the varying sizes of birds presented in the images; 2) complex natural habitats often introduce strong background interference; 3) and highly flexible poses such as flying, perching, or foraging result in substantial intra-class variability. These factors collectively make it difficult for traditional methods to stably extract discriminative features, thereby limiting the generalizability and interpretability of models in real-world applications. To address these challenges, this paper proposes a fine-grained bird classification framework based on strip-aware spatial perception, which aims to capture long-range spatial dependencies across entire rows or columns in bird images, thereby enhancing the model’s robustness and interpretability. The proposed method incorporates two novel modules: extensional perception aggregator (EPA) and channel semantic weaving (CSW). Specifically, EPA integrates local texture details with global structural cues by aggregating information across horizontal and vertical spatial directions. CSW further refines the semantic representations by adaptively fusing long-range and short-range information along the channel dimension. Built upon a ResNet-50 backbone, the model enables jump-wise connection of extended structural features across the spatial domain. Experimental results on the CUB-200-2011 dataset demonstrate that our framework achieves significant performance improvements while maintaining architectural efficiency.
zh
[CV-62] Spatiotemporal Analysis of Forest Machine Operations Using 3D Video Classification
【速读】:该论文试图解决从车载摄像头视频中自动分类林业作业类型的问题,旨在减少传统时间研究中的手动工作量。解决方案的关键在于采用基于深度学习的框架,利用3D ResNet-50架构结合PyTorchVideo进行时空特征提取,通过手动标注的现场录制数据集进行训练,从而有效捕捉现实林业环境中的运动模式和外观特征。
链接: https://arxiv.org/abs/2505.24375
作者: Maciej Wielgosz,Simon Berg,Heikki Korpunen,Stephan Hoffmann
机构: Norwegian Institute of Bioeconomy Research (NIBIO)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents a deep learning-based framework for classifying forestry operations from dashcam video footage. Focusing on four key work elements - crane-out, cutting-and-to-processing, driving, and processing - the approach employs a 3D ResNet-50 architecture implemented with PyTorchVideo. Trained on a manually annotated dataset of field recordings, the model achieves strong performance, with a validation F1 score of 0.88 and precision of 0.90. These results underscore the effectiveness of spatiotemporal convolutional networks for capturing both motion patterns and appearance in real-world forestry environments. The system integrates standard preprocessing and augmentation techniques to improve generalization, but overfitting is evident, highlighting the need for more training data and better class balance. Despite these challenges, the method demonstrates clear potential for reducing the manual workload associated with traditional time studies, offering a scalable solution for operational monitoring and efficiency analysis in forestry. This work contributes to the growing application of AI in natural resource management and sets the foundation for future systems capable of real-time activity recognition in forest machinery. Planned improvements include dataset expansion, enhanced regularization, and deployment trials on embedded systems for in-field use. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.24375 [cs.CV] (or arXiv:2505.24375v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.24375 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-63] D2AF: A Dual-Driven Annotation and Filtering Framework for Visual Grounding
【速读】:该论文旨在解决视觉定位任务中由于人工标注成本高导致的数据集规模受限及伪标签多样性不足的问题。其解决方案的关键在于提出一种仅依赖输入图像的鲁棒标注框架D2AF,通过融合多模态大模型和目标检测模型,采用双驱动标注策略,有效生成包含封闭集和开放集的详细区域-文本对,从而提升数据量与多样性,并通过一致性与分布感知的过滤方法优化数据质量,最终显著提升模型性能。
链接: https://arxiv.org/abs/2505.24372
作者: Yichi Zhang,Gongwei Chen,Jun Zhu,Jia Wan
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16pages, 8figures
Abstract:Visual Grounding is a task that aims to localize a target region in an image based on a free-form natural language description. With the rise of Transformer architectures, there is an increasing need for larger datasets to boost performance. However, the high cost of manual annotation poses a challenge, hindering the scale of data and the ability of large models to enhance their effectiveness. Previous pseudo label generation methods heavily rely on human-labeled captions of the original dataset, limiting scalability and diversity. To address this, we propose D2AF, a robust annotation framework for visual grounding using only input images. This approach overcomes dataset size limitations and enriches both the quantity and diversity of referring expressions. Our approach leverages multimodal large models and object detection models. By implementing dual-driven annotation strategies, we effectively generate detailed region-text pairs using both closed-set and open-set approaches. We further conduct an in-depth analysis of data quantity and data distribution. Our findings demonstrate that increasing data volume enhances model performance. However, the degree of improvement depends on how well the pseudo labels broaden the original data distribution. Based on these insights, we propose a consistency and distribution aware filtering method to further improve data quality by effectively removing erroneous and redundant data. This approach effectively eliminates noisy data, leading to improved performance. Experiments on three visual grounding tasks demonstrate that our method significantly improves the performance of existing models and achieves state-of-the-art results.
zh
[CV-64] Grid-LOGAT: Grid Based Local and Global Area Transcription for Video Question Answering
【速读】:该论文旨在解决视频问答(VideoQA)中文本转录质量不足及隐私保护的问题。其解决方案的关键在于提出一种基于网格的局部与全局区域转录(Grid-LoGAT)系统,通过在边缘设备上部署生成式 AI (Generative AI) 模型提取视频帧中的文本转录,并在云端使用大型语言模型(LLM)处理问题并生成答案,从而保障图像隐私;同时,采用基于网格的视觉提示方法提升转录质量,通过提取每个网格单元的精细局部细节并融合全局信息实现更准确的文本转录。
链接: https://arxiv.org/abs/2505.24371
作者: Md Intisar Chowdhury,Kittinun Aukkapinyo,Hiroshi Fujimura,Joo Ann Woo,Wasu Wasusatein,Fadoua Ghourabi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Abstract:In this paper, we propose a Grid-based Local and Global Area Transcription (Grid-LoGAT) system for Video Question Answering (VideoQA). The system operates in two phases. First, extracting text transcripts from video frames using a Vision-Language Model (VLM). Next, processing questions using these transcripts to generate answers through a Large Language Model (LLM). This design ensures image privacy by deploying the VLM on edge devices and the LLM in the cloud. To improve transcript quality, we propose grid-based visual prompting, which extracts intricate local details from each grid cell and integrates them with global information. Evaluation results show that Grid-LoGAT, using the open-source VLM (LLaVA-1.6-7B) and LLM (Llama-3.1-8B), outperforms state-of-the-art methods with similar baseline models on NExT-QA and STAR-QA datasets with an accuracy of 65.9% and 50.11% respectively. Additionally, our method surpasses the non-grid version by 24 points on localization-based questions we created using NExT-QA.
zh
[CV-65] Revisiting Cross-Modal Knowledge Distillation: A Disentanglement Approach for RGBD Semantic Segmentation
【速读】:该论文旨在解决多模态RGBD数据在训练阶段与推理阶段之间出现的模态不匹配问题,即在训练时可获取多模态数据(RGB和深度图像),但在推理时可能因传感器故障或资源限制而无法获取全部模态,导致模型性能下降。解决方案的关键在于提出一种新的跨模态知识蒸馏框架——CroDiNo-KD,该框架通过解耦表示、对比学习和解耦数据增强,同时学习单模态的RGB和深度模型,以结构化神经网络内部流形并提升模型的泛化能力。
链接: https://arxiv.org/abs/2505.24361
作者: Roger Ferrod,Cássio F. Dantas,Luigi Di Caro,Dino Ienco
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-modal RGB and Depth (RGBD) data are predominant in many domains such as robotics, autonomous driving and remote sensing. The combination of these multi-modal data enhances environmental perception by providing 3D spatial context, which is absent in standard RGB images. Although RGBD multi-modal data can be available to train computer vision models, accessing all sensor modalities during the inference stage may be infeasible due to sensor failures or resource constraints, leading to a mismatch between data modalities available during training and inference. Traditional Cross-Modal Knowledge Distillation (CMKD) frameworks, developed to address this task, are typically based on a teacher/student paradigm, where a multi-modal teacher distills knowledge into a single-modality student model. However, these approaches face challenges in teacher architecture choices and distillation process selection, thus limiting their adoption in real-world scenarios. To overcome these issues, we introduce CroDiNo-KD (Cross-Modal Disentanglement: a New Outlook on Knowledge Distillation), a novel cross-modal knowledge distillation framework for RGBD semantic segmentation. Our approach simultaneously learns single-modality RGB and Depth models by exploiting disentanglement representation, contrastive learning and decoupled data augmentation with the aim to structure the internal manifolds of neural network models through interaction and collaboration. We evaluated CroDiNo-KD on three RGBD datasets across diverse domains, considering recent CMKD frameworks as competitors. Our findings illustrate the quality of CroDiNo-KD, and they suggest reconsidering the conventional teacher/student paradigm to distill information from multi-modal data to single-modality neural networks.
zh
[CV-66] VUDG: A Dataset for Video Understanding Domain Generalization
【速读】:该论文旨在解决视频理解中的领域泛化(Domain Generalization, DG)问题,即现有方法在面对真实世界视频应用中固有的领域偏移时表现不佳,导致泛化能力不足。其解决方案的关键在于提出一个名为Video Understanding Domain Generalization (VUDG)的新数据集,该数据集包含来自11个不同领域的视频,涵盖三种类型的领域偏移,并保持跨领域的语义相似性以确保评估的公平性和有效性。此外,论文还引入了一种多专家渐进式标注框架,用于为每个视频生成多种选择题和开放性问答对,从而提升数据集的质量和多样性。
链接: https://arxiv.org/abs/2505.24346
作者: Ziyi Wang,Zhi Gao,Boxuan Yu,Zirui Dai,Yuxiang Song,Qingyuan Lu,Jin Chen,Xinxiao Wu
机构: Beijing Key Laboratory of Intelligent Information Technology (北京智能信息技术重点实验室); School of Computer Science & Technology, Beijing Institute of Technology (北京理工大学计算机科学与技术学院); Guangdong Laboratory of Machine Perception and Intelligent Computing (广东省机器感知与智能计算实验室); Shenzhen MSU-BIT University (深圳MSU-BIT大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video understanding has made remarkable progress in recent years, largely driven by advances in deep models and the availability of large-scale annotated datasets. However, existing works typically ignore the inherent domain shifts encountered in real-world video applications, leaving domain generalization (DG) in video understanding underexplored. Hence, we propose Video Understanding Domain Generalization (VUDG), a novel dataset designed specifically for evaluating the DG performance in video understanding. VUDG contains videos from 11 distinct domains that cover three types of domain shifts, and maintains semantic similarity across different domains to ensure fair and meaningful evaluation. We propose a multi-expert progressive annotation framework to annotate each video with both multiple-choice and open-ended question-answer pairs. Extensive experiments on 9 representative large video-language models (LVLMs) and several traditional video question answering methods show that most models (including state-of-the-art LVLMs) suffer performance degradation under domain shifts. These results highlight the challenges posed by VUDG and the difference in the robustness of current models to data distribution shifts. We believe VUDG provides a valuable resource for prompting future research in domain generalization video understanding.
zh
[CV-67] KairosAD: A SAM-Based Model for Industrial Anomaly Detection on Embedded Devices
【速读】:该论文旨在解决在智能制造时代,工业生产线上质量控制中异常检测的难题,特别是针对现有模型参数量大、计算需求高、难以部署到资源受限的嵌入式设备的问题。解决方案的关键在于提出一种名为KairosAD的新型监督方法,该方法利用了Mobile Segment Anything Model (MobileSAM)的强大能力进行基于图像的异常检测,从而在保持较高检测性能的同时显著减少了模型参数数量并提升了推理速度。
链接: https://arxiv.org/abs/2505.24334
作者: Uzair Khan,Franco Fummi,Luigi Capogrosso
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 23rd International Conference on Image Analysis and Processing (ICIAP 2025)
Abstract:In the era of intelligent manufacturing, anomaly detection has become essential for maintaining quality control on modern production lines. However, while many existing models show promising performance, they are often too large, computationally demanding, and impractical to deploy on resource-constrained embedded devices that can be easily installed on the production lines of Small and Medium Enterprises (SMEs). To bridge this gap, we present KairosAD, a novel supervised approach that uses the power of the Mobile Segment Anything Model (MobileSAM) for image-based anomaly detection. KairosAD has been evaluated on the two well-known industrial anomaly detection datasets, i.e., MVTec-AD and ViSA. The results show that KairosAD requires 78% fewer parameters and boasts a 4x faster inference time compared to the leading state-of-the-art model, while maintaining comparable AUROC performance. We deployed KairosAD on two embedded devices, the NVIDIA Jetson NX, and the NVIDIA Jetson AGX. Finally, KairosAD was successfully installed and tested on the real production line of the Industrial Computer Engineering Laboratory (ICE Lab) at the University of Verona. The code is available at this https URL.
zh
[CV-68] DisTime: Distribution-based Time Representation for Video Large Language Models
【速读】:该论文旨在解决视频大语言模型(Video-LLMs)在精确时间定位方面的挑战,主要由于离散时间表示和有限的时间感知数据集。其解决方案的关键在于提出DisTime框架,该框架通过可学习的标记创建连续的时间嵌入空间,并结合基于分布的时间解码器生成时间概率分布,从而有效缓解边界模糊性并保持时间连续性。此外,基于分布的时间编码器重新编码时间戳以提供时间标记,进一步提升了模型的时间理解能力。
链接: https://arxiv.org/abs/2505.24329
作者: Yingsen Zeng,Zepeng Huang,Yujie Zhong,Chengjian Feng,Jie Hu,Lin Ma,Yang Liu
机构: Meituan Inc. (美团公司); Wangxuan Institute of Computer Technology, Peking University (北京大学计算机技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite advances in general video understanding, Video Large Language Models (Video-LLMs) face challenges in precise temporal localization due to discrete time representations and limited temporally aware datasets. Existing methods for temporal expression either conflate time with text-based numerical values, add a series of dedicated temporal tokens, or regress time using specialized temporal grounding heads. To address these issues, we introduce DisTime, a lightweight framework designed to enhance temporal comprehension in Video-LLMs. DisTime employs a learnable token to create a continuous temporal embedding space and incorporates a Distribution-based Time Decoder that generates temporal probability distributions, effectively mitigating boundary ambiguities and maintaining temporal continuity. Additionally, the Distribution-based Time Encoder re-encodes timestamps to provide time markers for Video-LLMs. To overcome temporal granularity limitations in existing datasets, we propose an automated annotation paradigm that combines the captioning capabilities of Video-LLMs with the localization expertise of dedicated temporal models. This leads to the creation of InternVid-TG, a substantial dataset with 1.25M temporally grounded events across 179k videos, surpassing ActivityNet-Caption by 55 times. Extensive experiments demonstrate that DisTime achieves state-of-the-art performance across benchmarks in three time-sensitive tasks while maintaining competitive performance in Video QA tasks. Code and data are released at this https URL.
zh
[CV-69] STAR-Net: An Interpretable Model-Aided Network for Remote Sensing Image Denoising
【速读】:该论文旨在解决远程感知图像(Remote Sensing Image, RSI)去噪中存在可解释性差、对非局部自相似性关注不足以及需要繁琐调整正则化参数的问题。其解决方案的关键在于提出一种名为稀疏张量辅助表示网络(STAR-Net)的新方法,该方法利用低秩先验有效捕捉RSI中的非局部自相似性,并通过引入稀疏变体STAR-Net-S增强对非高斯噪声的鲁棒性;同时,采用基于ADMM(交替方向乘子法)引导的深度展开网络,实现所有正则化参数的自动学习,从而结合了模型驱动与深度学习方法的优势。
链接: https://arxiv.org/abs/2505.24327
作者: Jingjing Liu,Jiashun Jin,Xianchao Xiu,Jianhua Zhang,Wanquan Liu
机构: Shanghai University (上海大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Remote sensing image (RSI) denoising is an important topic in the field of remote sensing. Despite the impressive denoising performance of RSI denoising methods, most current deep learning-based approaches function as black boxes and lack integration with physical information models, leading to limited interpretability. Additionally, many methods may struggle with insufficient attention to non-local self-similarity in RSI and require tedious tuning of regularization parameters to achieve optimal performance, particularly in conventional iterative optimization approaches. In this paper, we first propose a novel RSI denoising method named sparse tensor-aided representation network (STAR-Net), which leverages a low-rank prior to effectively capture the non-local self-similarity within RSI. Furthermore, we extend STAR-Net to a sparse variant called STAR-Net-S to deal with the interference caused by non-Gaussian noise in original RSI for the purpose of improving robustness. Different from conventional iterative optimization, we develop an alternating direction method of multipliers (ADMM)-guided deep unrolling network, in which all regularization parameters can be automatically learned, thus inheriting the advantages of both model-based and deep learning-based approaches and successfully addressing the above-mentioned shortcomings. Comprehensive experiments on synthetic and real-world datasets demonstrate that STAR-Net and STAR-Net-S outperform state-of-the-art RSI denoising methods.
zh
[CV-70] InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing CVPR2025
【速读】:该论文旨在解决从文本生成新颖的人体-物体交互(Human Object Interaction, HOI)的问题,尤其是针对开放集物体的生成挑战。其解决方案的关键在于提出一种无需特定数据集训练的零样本3D HOI生成框架,通过利用大规模预训练模型的知识,包括从大语言模型(LLM)中推断人-物关系以初始化物体属性并指导优化过程,使用预训练的2D图像扩散模型解析未见过的物体并提取接触点,以及通过多视角SDS采样生成初始人体姿态,并引入细节优化以实现精确且自然的交互。
链接: https://arxiv.org/abs/2505.24315
作者: Jinlu Zhang,Yixin Chen,Zan Wang,Jie Yang,Yizhou Wang,Siyuan Huang
机构: Peking University (北京大学); BIGAI (北京通用人工智能研究院); Beijing Institute of Technology (北京理工大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025
Abstract:Recent advances in 3D human-aware generation have made significant progress. However, existing methods still struggle with generating novel Human Object Interaction (HOI) from text, particularly for open-set objects. We identify three main challenges of this task: precise human-object relation reasoning, affordance parsing for any object, and detailed human interaction pose synthesis aligning description and object geometry. In this work, we propose a novel zero-shot 3D HOI generation framework without training on specific datasets, leveraging the knowledge from large-scale pre-trained models. Specifically, the human-object relations are inferred from large language models (LLMs) to initialize object properties and guide the optimization process. Then we utilize a pre-trained 2D image diffusion model to parse unseen objects and extract contact points, avoiding the limitations imposed by existing 3D asset knowledge. The initial human pose is generated by sampling multiple hypotheses through multi-view SDS based on the input text and object geometry. Finally, we introduce a detailed optimization to generate fine-grained, precise, and natural interaction, enforcing realistic 3D contact between the 3D object and the involved body parts, including hands in grasping. This is achieved by distilling human-level feedback from LLMs to capture detailed human-object relations from the text instruction. Extensive experiments validate the effectiveness of our approach compared to prior works, particularly in terms of the fine-grained nature of interactions and the ability to handle open-set 3D objects.
zh
[CV-71] Progressive Class-level Distillation
【速读】:该论文旨在解决知识蒸馏(Knowledge Distillation, KD)中,传统方法由于高置信度类别主导蒸馏过程,导致低概率类别中包含的区分性信息被忽视,从而造成知识迁移不充分的问题。其解决方案的关键在于提出一种称为渐进式类别级蒸馏(Progressive Class-level Distillation, PCD)的方法,通过分阶段进行知识迁移,首先对教师-学生logits差异进行排序以确定蒸馏优先级,再将整个蒸馏过程划分为多个阶段,并采用双向分阶段蒸馏机制,结合从细到粗的渐进学习与从粗到细的反向优化,实现不同类别组在不同蒸馏阶段内的充分logits对齐,从而提升知识迁移的全面性。
链接: https://arxiv.org/abs/2505.24310
作者: Jiayan Li,Jun Li,Zhourui Zhang,Jianhua Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In knowledge distillation (KD), logit distillation (LD) aims to transfer class-level knowledge from a more powerful teacher network to a small student model via accurate teacher-student alignment at the logits level. Since high-confidence object classes usually dominate the distillation process, low-probability classes which also contain discriminating information are downplayed in conventional methods, leading to insufficient knowledge transfer. To address this issue, we propose a simple yet effective LD method termed Progressive Class-level Distillation (PCD). In contrast to existing methods which perform all-class ensemble distillation, our PCD approach performs stage-wise distillation for step-by-step knowledge transfer. More specifically, we perform ranking on teacher-student logits difference for identifying distillation priority from scratch, and subsequently divide the entire LD process into multiple stages. Next, bidirectional stage-wise distillation incorporating fine-to-coarse progressive learning and reverse coarse-to-fine refinement is conducted, allowing comprehensive knowledge transfer via sufficient logits alignment within separate class groups in different distillation stages. Extension experiments on public benchmarking datasets demonstrate the superiority of our method compared to state-of-the-arts for both classification and detection tasks.
zh
[CV-72] SR3D: Unleashing Single-view 3D Reconstruction for Transparent and Specular Object Grasping
【速读】:该论文试图解决透明和镜面材质物体在3D机器人抓取中的挑战,这些问题主要源于深度感知技术的局限性。解决方案的关键在于提出一种无需训练的框架SR3D,该框架通过单视角的RGB和深度图像,利用外部视觉模型生成物体的3D重建网格,并结合视图匹配和关键点匹配机制,充分利用观测中的2D和3D语义与几何信息,以准确确定物体的位姿和尺度,从而实现对原始深度退化3D场景中物体的精确定位和有效抓取。
链接: https://arxiv.org/abs/2505.24305
作者: Mingxu Zhang,Xiaoqi Li,Jiahui Xu,Kaichen Zhou,Hojin Bae,Yan Shen,Chuyan Xiong,Jiaming Liu,Hao Dong
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in 3D robotic manipulation have improved grasping of everyday objects, but transparent and specular materials remain challenging due to depth sensing limitations. While several 3D reconstruction and depth completion approaches address these challenges, they suffer from setup complexity or limited observation information utilization. To address this, leveraging the power of single view 3D object reconstruction approaches, we propose a training free framework SR3D that enables robotic grasping of transparent and specular objects from a single view observation. Specifically, given single view RGB and depth images, SR3D first uses the external visual models to generate 3D reconstructed object mesh based on RGB image. Then, the key idea is to determine the 3D object’s pose and scale to accurately localize the reconstructed object back into its original depth corrupted 3D scene. Therefore, we propose view matching and keypoint matching mechanisms,which leverage both the 2D and 3D’s inherent semantic and geometric information in the observation to determine the object’s 3D state within the scene, thereby reconstructing an accurate 3D depth map for effective grasp detection. Experiments in both simulation and real world show the reconstruction effectiveness of SR3D.
zh
[CV-73] Category-aware EEG image generation based on wavelet transform and contrast semantic loss
【速读】:该论文试图解决从脑电(EEG)信号中重建视觉刺激的问题,这是实现脑机接口的关键步骤。其解决方案的关键在于提出一种基于Transformer的EEG信号编码器,该编码器融合了离散小波变换(DWT)和门控机制,并通过特征对齐和类别感知融合损失来提取与视觉刺激相关的特征。随后,利用预训练的扩散模型将这些特征重构为视觉刺激,从而提升重建效果。
链接: https://arxiv.org/abs/2505.24301
作者: Enshang Zhang,Zhicheng Zhang,Takashi Hanakawa
机构: Kyoto University Graduate School of Medicine (京都大学医学研究科); JancsiLab (Jancsi实验室); JancsiTech (Jancsi科技); Sino-Finland Joint AI Laboratory for Child Health of Zhejiang Province (浙江省儿童健康中芬人工智能联合实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:Reconstructing visual stimuli from EEG signals is a crucial step in realizing brain-computer interfaces. In this paper, we propose a transformer-based EEG signal encoder integrating the Discrete Wavelet Transform (DWT) and the gating mechanism. Guided by the feature alignment and category-aware fusion losses, this encoder is used to extract features related to visual stimuli from EEG signals. Subsequently, with the aid of a pre-trained diffusion model, these features are reconstructed into visual stimuli. To verify the effectiveness of the model, we conducted EEG-to-image generation and classification tasks using the THINGS-EEG dataset. To address the limitations of quantitative analysis at the semantic level, we combined WordNet-based classification and semantic similarity metrics to propose a novel semantic-based score, emphasizing the ability of our model to transfer neural activities into visual representations. Experimental results show that our model significantly improves semantic alignment and classification accuracy, which achieves a maximum single-subject accuracy of 43%, outperforming other state-of-the-art methods. The source code and supplementary material is available at this https URL.
zh
[CV-74] EgoExOR: An Ego-Exo-Centric Operating Room Dataset for Surgical Activity Understanding
【速读】:该论文旨在解决手术室(OR)中多角色与设备在高动态、遮挡密集环境下的精准协同问题,现有数据集未能全面融合第一视角(egocentric)与第三视角(exocentric)的多模态信息。其解决方案的关键在于提出EgoExOR数据集及其基准,首次将可穿戴眼镜的第一视角数据(RGB、注视、手部追踪、音频)、RGB-D相机的第三视角RGB与深度信息以及超声图像进行融合,并通过详细的场景图注释(涵盖36个实体和22种关系,共568,235个三元组)支持临床交互建模,从而为手术场景图生成提供新的多模态感知基础。
链接: https://arxiv.org/abs/2505.24287
作者: Ege Özsoy,Arda Mamur,Felix Tristram,Chantal Pellegrini,Magdalena Wysocki,Benjamin Busam,Nassir Navab
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Operating rooms (ORs) demand precise coordination among surgeons, nurses, and equipment in a fast-paced, occlusion-heavy environment, necessitating advanced perception models to enhance safety and efficiency. Existing datasets either provide partial egocentric views or sparse exocentric multi-view context, but do not explore the comprehensive combination of both. We introduce EgoExOR, the first OR dataset and accompanying benchmark to fuse first-person and third-person perspectives. Spanning 94 minutes (84,553 frames at 15 FPS) of two emulated spine procedures, Ultrasound-Guided Needle Insertion and Minimally Invasive Spine Surgery, EgoExOR integrates egocentric data (RGB, gaze, hand tracking, audio) from wearable glasses, exocentric RGB and depth from RGB-D cameras, and ultrasound imagery. Its detailed scene graph annotations, covering 36 entities and 22 relations (568,235 triplets), enable robust modeling of clinical interactions, supporting tasks like action recognition and human-centric perception. We evaluate the surgical scene graph generation performance of two adapted state-of-the-art models and offer a new baseline that explicitly leverages EgoExOR’s multimodal and multi-perspective signals. This new dataset and benchmark set a new foundation for OR perception, offering a rich, multimodal resource for next-generation clinical perception.
zh
[CV-75] LLM -powered Query Expansion for Enhancing Boundary Prediction in Language-driven Action Localization
【速读】:该论文旨在解决语言驱动的视频动作定位中边界不确定性的问题,即语言查询通常缺乏动作起止边界的细节信息,导致手动标注的主观性和训练数据中的边界不确定性。解决方案的关键在于通过大语言模型(LLM)生成动作起止边界的文本描述,以扩展原始查询,从而提供更详细的边界线索;同时,通过计算帧与扩展查询之间的语义相似性以及帧与标注边界帧之间的时间距离,建模动作边界的概率得分,以增强对边界不确定性的容忍度,提升训练稳定性。
链接: https://arxiv.org/abs/2505.24282
作者: Zirui Shang,Xinxiao Wu,Shuo Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Language-driven action localization in videos requires not only semantic alignment between language query and video segment, but also prediction of action boundaries. However, the language query primarily describes the main content of an action and usually lacks specific details of action start and end boundaries, which increases the subjectivity of manual boundary annotation and leads to boundary uncertainty in training data. In this paper, on one hand, we propose to expand the original query by generating textual descriptions of the action start and end boundaries through LLMs, which can provide more detailed boundary cues for localization and thus reduce the impact of boundary uncertainty. On the other hand, to enhance the tolerance to boundary uncertainty during training, we propose to model probability scores of action boundaries by calculating the semantic similarities between frames and the expanded query as well as the temporal distances between frames and the annotated boundary frames. They can provide more consistent boundary supervision, thus improving the stability of training. Our method is model-agnostic and can be seamlessly and easily integrated into any existing models of language-driven action localization in an off-the-shelf manner. Experimental results on several datasets demonstrate the effectiveness of our method. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.24282 [cs.CV] (or arXiv:2505.24282v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.24282 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zirui Shang [view email] [v1] Fri, 30 May 2025 06:59:35 UTC (1,559 KB) Full-text links: Access Paper: View a PDF of the paper titled LLM-powered Query Expansion for Enhancing Boundary Prediction in Language-driven Action Localization, by Zirui Shang and 2 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2025-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[CV-76] Out of Sight Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames
【速读】:该论文旨在解决多帧视觉语言模型(VLM)在跨时间整合空间线索以进行长时程空间推理的问题,特别是在处理非共视对象对的空间关系时表现不足。解决方案的关键在于引入Disjoint-3DQA基准,通过设计涉及非共视对象对的问题来评估VLMs的3D场景表征能力,从而揭示模型在构建和维护随时间变化的3D场景表示方面的核心瓶颈。
链接: https://arxiv.org/abs/2505.24257
作者: Sahithya Ravi,Gabriel Sarch,Vibhav Vineet,Andrew D. Wilson,Balasaravanan Thoravi Kumaravel
机构: University of British Columbia (不列颠哥伦比亚大学); Carnegie Mellon University (卡内基梅隆大学); Microsoft Research, Redmond (微软研究院,雷德蒙德)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:An embodied AI assistant operating on egocentric video must integrate spatial cues across time - for instance, determining where an object A, glimpsed a few moments ago lies relative to an object B encountered later. We introduce Disjoint-3DQA , a generative QA benchmark that evaluates this ability of VLMs by posing questions about object pairs that are not co-visible in the same frame. We evaluated seven state-of-the-art VLMs and found that models lag behind human performance by 28%, with steeper declines in accuracy (60% to 30 %) as the temporal gap widens. Our analysis further reveals that providing trajectories or bird’s-eye-view projections to VLMs results in only marginal improvements, whereas providing oracle 3D coordinates leads to a substantial 20% performance increase. This highlights a core bottleneck of multi-frame VLMs in constructing and maintaining 3D scene representations over time from visual signals. Disjoint-3DQA therefore sets a clear, measurable challenge for long-horizon spatial reasoning and aims to catalyze future research at the intersection of vision, language, and embodied AI.
zh
[CV-77] Interactive Video Generation via Domain Adaptation
【速读】:该论文旨在解决交互式视频生成(Interactive Video Generation, IVG)中用户对运动元素(如物体轨迹)进行控制的挑战,尤其是在不进行微调训练的情况下,现有方法通过注意力掩码引导轨迹时往往导致感知质量下降的问题。论文提出两个关键解决方案:首先,针对由注意力掩码引起的内部协变量偏移导致的感知质量退化问题,引入了掩码归一化(mask normalization),通过分布匹配缓解这一域偏移问题;其次,为了解决初始噪声与IVG条件之间的初始化差距问题,提出了时间内在扩散先验(temporal intrinsic diffusion prior),在每一步去噪过程中强制时空一致性。
链接: https://arxiv.org/abs/2505.24253
作者: Ishaan Rawal,Suryansh Kumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Preprint. Under Review
Abstract:Text-conditioned diffusion models have emerged as powerful tools for high-quality video generation. However, enabling Interactive Video Generation (IVG), where users control motion elements such as object trajectory, remains challenging. Recent training-free approaches introduce attention masking to guide trajectory, but this often degrades perceptual quality. We identify two key failure modes in these methods, both of which we interpret as domain shift problems, and propose solutions inspired by domain adaptation. First, we attribute the perceptual degradation to internal covariate shift induced by attention masking, as pretrained models are not trained to handle masked attention. To address this, we propose mask normalization, a pre-normalization layer designed to mitigate this shift via distribution matching. Second, we address initialization gap, where the randomly sampled initial noise does not align with IVG conditioning, by introducing a temporal intrinsic diffusion prior that enforces spatio-temporal consistency at each denoising step. Extensive qualitative and quantitative evaluations demonstrate that mask normalization and temporal intrinsic denoising improve both perceptual quality and trajectory control over the existing state-of-the-art IVG techniques.
zh
[CV-78] Harnessing Foundation Models for Robust and Generalizable 6-DOF Bronchoscopy Localization
【速读】:该论文旨在解决基于视觉的六自由度支气管镜定位中存在的一般化能力不足和视觉退化下的鲁棒性差的问题(vision-based 6-DOF bronchoscopy localization)。为应对这些挑战,作者提出了PANSv2框架,其关键在于将深度估计、关键点检测和中心线约束整合到统一的姿态优化框架中,通过评估姿态概率并求解最优支气管镜姿态来提升定位精度。此外,利用端镜基础模型EndoOmni和视频基础模型EndoMamba进行时空分析,增强了模型的泛化能力,同时引入自动重新初始化模块以提高对视觉退化情况的鲁棒性。
链接: https://arxiv.org/abs/2505.24249
作者: Qingyao Tian,Huai Liao,Xinyan Huang,Bingyu Yang,Hongbin Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-based 6-DOF bronchoscopy localization offers a promising solution for accurate and cost-effective interventional guidance. However, existing methods struggle with 1) limited generalization across patient cases due to scarce labeled data, and 2) poor robustness under visual degradation, as bronchoscopy procedures frequently involve artifacts such as occlusions and motion blur that impair visual information. To address these challenges, we propose PANSv2, a generalizable and robust bronchoscopy localization framework. Motivated by PANS that leverages multiple visual cues for pose likelihood measurement, PANSv2 integrates depth estimation, landmark detection, and centerline constraints into a unified pose optimization framework that evaluates pose probability and solves for the optimal bronchoscope pose. To further enhance generalization capabilities, we leverage the endoscopic foundation model EndoOmni for depth estimation and the video foundation model EndoMamba for landmark detection, incorporating both spatial and temporal analyses. Pretrained on diverse endoscopic datasets, these models provide stable and transferable visual representations, enabling reliable performance across varied bronchoscopy scenarios. Additionally, to improve robustness to visual degradation, we introduce an automatic re-initialization module that detects tracking failures and re-establishes pose using landmark detections once clear views are available. Experimental results on bronchoscopy dataset encompassing 10 patient cases show that PANSv2 achieves the highest tracking success rate, with an 18.1% improvement in SR-5 (percentage of absolute trajectory error under 5 mm) compared to existing methods, showing potential towards real clinical usage.
zh
[CV-79] 50 Years of Automated Face Recognition
【速读】:该论文旨在回顾和分析人脸识别(Face Recognition, FR)技术的发展历程及其关键技术突破,重点探讨如何通过数据集扩展、损失函数优化、神经网络架构设计和特征融合等手段提升模型的泛化能力和识别精度。其解决方案的关键在于利用大规模真实与人工智能生成的数据集,结合深度神经网络结构,以实现对高、低质量人脸场景下卓越的识别性能,并推动在实际应用(如签证与边境匹配)中的准确率提升。
链接: https://arxiv.org/abs/2505.24247
作者: Minchul Kim,Anil Jain,Xiaoming Liu
机构: Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Over the past 50 years, automated face recognition has evolved from rudimentary, handcrafted systems into sophisticated deep learning models that rival and often surpass human performance. This paper chronicles the history and technological progression of FR, from early geometric and statistical methods to modern deep neural architectures leveraging massive real and AI-generated datasets. We examine key innovations that have shaped the field, including developments in dataset, loss function, neural network design and feature fusion. We also analyze how the scale and diversity of training data influence model generalization, drawing connections between dataset growth and benchmark improvements. Recent advances have achieved remarkable milestones: state-of-the-art face verification systems now report False Negative Identification Rates of 0.13% against a 12.4 million gallery in NIST FRVT evaluations for 1:N visa-to-border matching. While recent advances have enabled remarkable accuracy in high- and low-quality face scenarios, numerous challenges persist. While remarkable progress has been achieved, several open research problems remain. We outline critical challenges and promising directions for future face recognition research, including scalability, multi-modal fusion, synthetic identity generation, and explainable systems.
zh
[CV-80] LTM3D: Bridging Token Spaces for Conditional 3D Generation with Auto-Regressive Diffusion Framework
【速读】:该论文旨在解决条件化3D形状生成中如何有效结合扩散模型与自回归(AR)模型的优势,以提升生成形状的结构准确性和提示一致性的问题。其解决方案的关键在于提出LTM3D框架,该框架采用条件分布建模主干,结合掩码自编码器与扩散模型以增强令牌间的依赖学习,并引入前缀学习机制以对齐条件令牌与形状潜在令牌,从而提高多模态生成的灵活性;同时,通过潜在令牌重建模块与重建引导采样策略,减少不确定性并提升生成形状的结构保真度。
链接: https://arxiv.org/abs/2505.24245
作者: Xin Kang,Zihan Zheng,Lei Chu,Yue Gao,Jiahao Li,Hao Pan,Xuejin Chen,Yan Lu
机构: Microsoft Research Asia (微软亚洲研究院); University of Science and Technology of China (中国科学技术大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We present LTM3D, a Latent Token space Modeling framework for conditional 3D shape generation that integrates the strengths of diffusion and auto-regressive (AR) models. While diffusion-based methods effectively model continuous latent spaces and AR models excel at capturing inter-token dependencies, combining these paradigms for 3D shape generation remains a challenge. To address this, LTM3D features a Conditional Distribution Modeling backbone, leveraging a masked autoencoder and a diffusion model to enhance token dependency learning. Additionally, we introduce Prefix Learning, which aligns condition tokens with shape latent tokens during generation, improving flexibility across modalities. We further propose a Latent Token Reconstruction module with Reconstruction-Guided Sampling to reduce uncertainty and enhance structural fidelity in generated shapes. Our approach operates in token space, enabling support for multiple 3D representations, including signed distance fields, point clouds, meshes, and 3D Gaussian Splatting. Extensive experiments on image- and text-conditioned shape generation tasks demonstrate that LTM3D outperforms existing methods in prompt fidelity and structural accuracy while offering a generalizable framework for multi-modal, multi-representation 3D generation.
zh
[CV-81] MIRAG E: Assessing Hallucination in Multimodal Reasoning Chains of MLLM
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中多源幻觉问题,特别是现有基准未能有效区分感知诱导幻觉与推理诱导幻觉,从而阻碍了对多模态推理失败的诊断。其解决方案的关键在于提出\dataset基准,通过构建输入图像被正确感知但推理错误的问题来隔离推理幻觉,并引入多粒度评估指标以量化幻觉程度;同时,提出\method方法,结合课程强化微调和协同提示推理,通过逐步降低学习难度鼓励模型生成逻辑一致的推理链,从而减少逻辑幻觉。
链接: https://arxiv.org/abs/2505.24238
作者: Bowen Dong,Minheng Ni,Zitong Huang,Guanglei Yang,Wangmeng Zuo,Lei Zhang
机构: Harbin Institute of Technology (哈尔滨工业大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Multimodal hallucination in multimodal large language models (MLLMs) restricts the correctness of MLLMs. However, multimodal hallucinations are multi-sourced and arise from diverse causes. Existing benchmarks fail to adequately distinguish between perception-induced hallucinations and reasoning-induced hallucinations. This failure constitutes a significant issue and hinders the diagnosis of multimodal reasoning failures within MLLMs. To address this, we propose the \dataset benchmark, which isolates reasoning hallucinations by constructing questions where input images are correctly perceived by MLLMs yet reasoning errors persist. \dataset introduces multi-granular evaluation metrics: accuracy, factuality, and LLMs hallucination score for hallucination quantification. Our analysis reveals that (1) the model scale, data scale, and training stages significantly affect the degree of logical, fabrication, and factual hallucinations; (2) current MLLMs show no effective improvement on spatial hallucinations caused by misinterpreted spatial relationships, indicating their limited visual reasoning capabilities; and (3) question types correlate with distinct hallucination patterns, highlighting targeted challenges and potential mitigation strategies. To address these challenges, we propose \method, a method that combines curriculum reinforcement fine-tuning to encourage models to generate logic-consistent reasoning chains by stepwise reducing learning difficulty, and collaborative hint inference to reduce reasoning complexity. \method establishes a baseline on \dataset, and reduces the logical hallucinations in original base models.
zh
[CV-82] Light as Deception: GPT -driven Natural Relighting Against Vision-Language Pre-training Models
【速读】:该论文试图解决在视觉-语言预训练(Vision-and-Language Pretraining, VLP)模型中生成自然且语义有意义的对抗样本这一挑战。现有方法主要针对分类任务设计,其受限的优化空间导致在适应VLP模型时效果不佳,产生无效攻击或不自然的伪影。解决方案的关键在于提出LightD框架,该框架通过语义引导的重新光照生成自然对抗样本,利用ChatGPT生成上下文感知的初始光照参数,并结合预训练的重新光照模型(IC-light)实现多样化的光照调整,同时扩展优化空间并保持扰动与场景语义的一致性。此外,采用基于梯度的优化进一步提升攻击效果,同时维持视觉自然性。
链接: https://arxiv.org/abs/2505.24227
作者: Ying Yang,Jie Zhang,Xiao Lv,Di Lin,Tao Xiang,Qing Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:
Abstract:While adversarial attacks on vision-and-language pretraining (VLP) models have been explored, generating natural adversarial samples crafted through realistic and semantically meaningful perturbations remains an open challenge. Existing methods, primarily designed for classification tasks, struggle when adapted to VLP models due to their restricted optimization spaces, leading to ineffective attacks or unnatural artifacts. To address this, we propose \textbfLightD, a novel framework that generates natural adversarial samples for VLP models via semantically guided relighting. Specifically, LightD leverages ChatGPT to propose context-aware initial lighting parameters and integrates a pretrained relighting model (IC-light) to enable diverse lighting adjustments. LightD expands the optimization space while ensuring perturbations align with scene semantics. Additionally, gradient-based optimization is applied to the reference lighting image to further enhance attack effectiveness while maintaining visual naturalness. The effectiveness and superiority of the proposed LightD have been demonstrated across various VLP models in tasks such as image captioning and visual question answering.
zh
[CV-83] Unleashing High-Quality Image Generation in Diffusion Sampling Using Second-Order Levenberg-Marquardt-Langevin
【速读】:该论文旨在解决扩散模型(Diffusion Models, DMs)在图像生成过程中因依赖一阶Langevin动力学而导致的采样质量受限问题,尤其是如何有效利用二阶Hessian几何信息以提升采样精度,同时避免高维场景下的计算复杂度过高问题。其解决方案的关键在于提出一种无需训练的Levenberg-Marquardt-Langevin(LML)方法,通过引入低秩近似和阻尼机制,高效地近似扩散过程中的Hessian几何结构,从而在不显著增加计算开销的前提下提升图像生成质量。
链接: https://arxiv.org/abs/2505.24222
作者: Fangyikang Wang,Hubery Yin,Lei Qian,Yinan Li,Shaobin Zhuang,Huminhao Zhu,Yilin Zhang,Yanlong Tang,Chao Zhang,Hanbin Zhao,Hui Qian,Chen Li
机构: Zhejiang University (浙江大学); WeChat Vision, Tencent Inc (微信视觉,腾讯公司); Shanghai Jiao Tong University (上海交通大学); Tencent Lightspeed Studio (腾讯光子工作室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The diffusion models (DMs) have demonstrated the remarkable capability of generating images via learning the noised score function of data distribution. Current DM sampling techniques typically rely on first-order Langevin dynamics at each noise level, with efforts concentrated on refining inter-level denoising strategies. While leveraging additional second-order Hessian geometry to enhance the sampling quality of Langevin is a common practice in Markov chain Monte Carlo (MCMC), the naive attempts to utilize Hessian geometry in high-dimensional DMs lead to quadratic-complexity computational costs, rendering them non-scalable. In this work, we introduce a novel Levenberg-Marquardt-Langevin (LML) method that approximates the diffusion Hessian geometry in a training-free manner, drawing inspiration from the celebrated Levenberg-Marquardt optimization algorithm. Our approach introduces two key innovations: (1) A low-rank approximation of the diffusion Hessian, leveraging the DMs’ inherent structure and circumventing explicit quadratic-complexity computations; (2) A damping mechanism to stabilize the approximated Hessian. This LML approximated Hessian geometry enables the diffusion sampling to execute more accurate steps and improve the image generation quality. We further conduct a theoretical analysis to substantiate the approximation error bound of low-rank approximation and the convergence property of the damping mechanism. Extensive experiments across multiple pretrained DMs validate that the LML method significantly improves image generation quality, with negligible computational overhead.
zh
[CV-84] Shuffle PatchMix Augmentation with Confidence-Margin Weighted Pseudo-Labels for Enhanced Source-Free Domain Adaptation ICIP2025
【速读】:该论文旨在解决无源域适应(Source-Free Domain Adaptation, SFDA)问题,即在无法访问源域数据的情况下,使模型适应目标域。其解决方案的关键在于引入一种新的增强技术——Shuffle PatchMix (SPM),以及一种新颖的伪标签重加权策略。SPM通过打乱和混合图像块生成多样化且具有挑战性的增强数据,而重加权策略则通过优先考虑可靠伪标签来减轻标签噪声的影响,从而提升模型在小数据集上的泛化能力和鲁棒性。
链接: https://arxiv.org/abs/2505.24216
作者: Prasanna Reddy Pulakurthi,Majid Rabbani,Jamison Heard,Sohail Dianat,Celso M. de Melo,Raghuveer Rao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures, 5 tables, Accepted to IEEE ICIP 2025
Abstract:This work investigates Source-Free Domain Adaptation (SFDA), where a model adapts to a target domain without access to source data. A new augmentation technique, Shuffle PatchMix (SPM), and a novel reweighting strategy are introduced to enhance performance. SPM shuffles and blends image patches to generate diverse and challenging augmentations, while the reweighting strategy prioritizes reliable pseudo-labels to mitigate label noise. These techniques are particularly effective on smaller datasets like PACS, where overfitting and pseudo-label noise pose greater risks. State-of-the-art results are achieved on three major benchmarks: PACS, VisDA-C, and DomainNet-126. Notably, on PACS, improvements of 7.3% (79.4% to 86.7%) and 7.2% are observed in single-target and multi-target settings, respectively, while gains of 2.8% and 0.7% are attained on DomainNet-126 and VisDA-C. This combination of advanced augmentation and robust pseudo-label reweighting establishes a new benchmark for SFDA. The code is available at: this https URL
zh
[CV-85] Benchmarking Foundation Models for Zero-Shot Biometric Tasks
【速读】:该论文试图解决预训练模型在生物特征识别与分析领域中的潜力尚未被充分探索的问题,特别是生成式AI(Generative AI)在面部和虹膜模态下的多种生物特征任务中的应用。其解决方案的关键在于构建一个全面的基准测试,评估最先进的公开可用的视觉-语言模型(VLMs)和多模态大语言模型(MLLMs)在六种生物特征任务中的零样本和少量样本性能,包括面部验证、软生物特征属性预测(性别和种族)、虹膜识别、活体检测(PAD)以及面部篡改检测(合成和深度伪造)。通过实验表明,这些基础模型的嵌入向量可以在无需微调的情况下成功应用于多种生物特征任务,并且结合简单的分类器头可以进一步提升如深度伪造检测、活体检测和软生物特征属性提取等任务的准确性。
链接: https://arxiv.org/abs/2505.24214
作者: Redwan Sony,Parisa Farmanifard,Hamzeh Alzwairy,Nitish Shukla,Arun Ross
机构: Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The advent of foundation models, particularly Vision-Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), has redefined the frontiers of artificial intelligence, enabling remarkable generalization across diverse tasks with minimal or no supervision. Yet, their potential in biometric recognition and analysis remains relatively underexplored. In this work, we introduce a comprehensive benchmark that evaluates the zero-shot and few-shot performance of state-of-the-art publicly available VLMs and MLLMs across six biometric tasks spanning the face and iris modalities: face verification, soft biometric attribute prediction (gender and race), iris recognition, presentation attack detection (PAD), and face manipulation detection (morphs and deepfakes). A total of 41 VLMs were used in this evaluation. Experiments show that embeddings from these foundation models can be used for diverse biometric tasks with varying degrees of success. For example, in the case of face verification, a True Match Rate (TMR) of 96.77 percent was obtained at a False Match Rate (FMR) of 1 percent on the Labeled Face in the Wild (LFW) dataset, without any fine-tuning. In the case of iris recognition, the TMR at 1 percent FMR on the IITD-R-Full dataset was 97.55 percent without any fine-tuning. Further, we show that applying a simple classifier head to these embeddings can help perform DeepFake detection for faces, Presentation Attack Detection (PAD) for irides, and extract soft biometric attributes like gender and ethnicity from faces with reasonably high accuracy. This work reiterates the potential of pretrained models in achieving the long-term vision of Artificial General Intelligence.
zh
[CV-86] STORK: Improving the Fidelity of Mid-NFE Sampling for Diffusion and Flow Matching Models
【速读】:该论文旨在解决扩散模型(Diffusion Models, DMs)在中等数量函数评估(NFE,20-50)范围内采样效率与生成质量之间的平衡问题。现有方法虽能在低NFE下实现快速采样,但难以在中等NFE范围内有效提升生成质量,而实际应用中许多模型如Stable Diffusion 3.5、FLUX和SANA常依赖此区间以获得更优结果。论文提出的解决方案是设计一种无需训练且结构独立的扩散模型常微分方程(ODE)求解器——稳定泰勒正交龙格-库塔(STORK)方法,其基于刚性ODE求解器并引入泰勒展开适应机制,能够适用于各类扩散模型的采样过程,包括基于噪声和流匹配的模型,在20-50 NFE范围内显著提升了生成质量。
链接: https://arxiv.org/abs/2505.24210
作者: Zheng Tan,Weizhen Wang,Andrea L. Bertozzi,Ernest K. Ryu
机构: UCLA Department of Mathematics (加州大学洛杉矶分校数学系); UCLA Computer Science Department (加州大学洛杉矶分校计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注:
Abstract:Diffusion models (DMs) have demonstrated remarkable performance in high-fidelity image and video generation. Because high-quality generations with DMs typically require a large number of function evaluations (NFEs), resulting in slow sampling, there has been extensive research successfully reducing the NFE to a small range (10) while maintaining acceptable image quality. However, many practical applications, such as those involving Stable Diffusion 3.5, FLUX, and SANA, commonly operate in the mid-NFE regime (20-50 NFE) to achieve superior results, and, despite the practical relevance, research on the effective sampling within this mid-NFE regime remains underexplored. In this work, we propose a novel, training-free, and structure-independent DM ODE solver called the Stabilized Taylor Orthogonal Runge–Kutta (STORK) method, based on a class of stiff ODE solvers with a Taylor expansion adaptation. Unlike prior work such as DPM-Solver, which is dependent on the semi-linear structure of the DM ODE, STORK is applicable to any DM sampling, including noise-based and flow matching-based models. Within the 20-50 NFE range, STORK achieves improved generation quality, as measured by FID scores, across unconditional pixel-level generation and conditional latent-space generation tasks using models like Stable Diffusion 3.5 and SANA. Code is available at this https URL.
zh
[CV-87] Boosting All-in-One Image Restoration via Self-Improved Privilege Learning
【速读】:该论文旨在解决统一图像修复模型在处理多样且混合退化时面临的优化动态不稳定和任务间冲突问题。其解决方案的关键在于提出Self-Improved Privilege Learning (SIPL),该方法通过创新性地将特权信息(Privileged Information, PI)的利用从训练阶段扩展到推理阶段,使模型能够在测试时利用自身初步输出作为伪特权信号进行迭代自修正。SIPL的核心组件是Proxy Fusion模块,它包含一个可学习的特权字典,在训练阶段从特权特征表示中提炼出关键的高频和结构先验,并在推理阶段与模型初始修复特征交互,形成自我校正循环。
链接: https://arxiv.org/abs/2505.24207
作者: Gang Wu,Junjun Jiang,Kui Jiang,Xianming Liu
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unified image restoration models for diverse and mixed degradations often suffer from unstable optimization dynamics and inter-task conflicts. This paper introduces Self-Improved Privilege Learning (SIPL), a novel paradigm that overcomes these limitations by innovatively extending the utility of privileged information (PI) beyond training into the inference stage. Unlike conventional Privilege Learning, where ground-truth-derived guidance is typically discarded after training, SIPL empowers the model to leverage its own preliminary outputs as pseudo-privileged signals for iterative self-refinement at test time. Central to SIPL is Proxy Fusion, a lightweight module incorporating a learnable Privileged Dictionary. During training, this dictionary distills essential high-frequency and structural priors from privileged feature representations. Critically, at inference, the same learned dictionary then interacts with features derived from the model’s initial restoration, facilitating a self-correction loop. SIPL can be seamlessly integrated into various backbone architectures, offering substantial performance improvements with minimal computational overhead. Extensive experiments demonstrate that SIPL significantly advances the state-of-the-art on diverse all-in-one image restoration benchmarks. For instance, when integrated with the PromptIR model, SIPL achieves remarkable PSNR improvements of +4.58 dB on composite degradation tasks and +1.28 dB on diverse five-task benchmarks, underscoring its effectiveness and broad applicability. Codes are available at our project page this https URL.
zh
[CV-88] Provably Improving Generalization of Few-Shot Models with Synthetic Data ICML2025
【速读】:该论文试图解决小样本图像分类(few-shot image classification)中由于标注训练样本稀缺而导致的性能瓶颈问题。其解决方案的关键在于构建一个理论框架,用于量化真实数据与合成数据分布差异对监督学习的影响,并基于此框架提出一种结合原型学习(prototype learning)的算法,以优化数据划分和模型训练,从而有效缩小真实小样本数据与合成数据之间的差距。
链接: https://arxiv.org/abs/2505.24190
作者: Lan-Cuong Nguyen,Quan Nguyen-Tri,Bang Tran Khanh,Dung D. Le,Long Tran-Thanh,Khoat Than
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2025. Our code will be released soon
Abstract:Few-shot image classification remains challenging due to the scarcity of labeled training examples. Augmenting them with synthetic data has emerged as a promising way to alleviate this issue, but models trained on synthetic samples often face performance degradation due to the inherent gap between real and synthetic distributions. To address this limitation, we develop a theoretical framework that quantifies the impact of such distribution discrepancies on supervised learning, specifically in the context of image classification. More importantly, our framework suggests practical ways to generate good synthetic samples and to train a predictor with high generalization ability. Building upon this framework, we propose a novel theoretical-based algorithm that integrates prototype learning to optimize both data partitioning and model training, effectively bridging the gap between real few-shot data and synthetic data. Extensive experiments results show that our approach demonstrates superior performance compared to state-of-the-art methods, outperforming them across multiple datasets.
zh
[CV-89] owards Unified Modeling in Federated Multi-Task Learning via Subspace Decoupling
【速读】:该论文旨在解决联邦多任务学习(Federated Multi-Task Learning, FMTL)中难以将异构任务有效聚合到统一模型的问题。传统FMTL方法主要关注为每个客户端构建个性化模型,而无法实现多异构任务的联合优化,导致在实际应用场景中因任务目标、标签空间和优化路径差异较大时,难以实现有效的联合训练。解决方案的关键在于提出FedDEA(Federated Decoupled Aggregation),该方法通过动态识别本地更新中与任务相关的维度并进行重缩放,增强优化效果,从而有效抑制跨任务干扰,并在统一全局模型中实现任务级解耦聚合。FedDEA无需依赖任务标签或架构修改,具备广泛适用性和部署友好性。
链接: https://arxiv.org/abs/2505.24185
作者: Yipan Wei,Yuchen Zou,Yapeng Li,Bo Du
机构: Wuhan University (武汉大学); Central China Normal University (华中师范大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Federated Multi-Task Learning (FMTL) enables multiple clients performing heterogeneous tasks without exchanging their local data, offering broad potential for privacy preserving multi-task collaboration. However, most existing methods focus on building personalized models for each client and unable to support the aggregation of multiple heterogeneous tasks into a unified model. As a result, in real-world scenarios where task objectives, label spaces, and optimization paths vary significantly, conventional FMTL methods struggle to achieve effective joint training. To address this challenge, we propose FedDEA (Federated Decoupled Aggregation), an update-structure-aware aggregation method specifically designed for multi-task model integration. Our method dynamically identifies task-relevant dimensions based on the response strength of local updates and enhances their optimization effectiveness through rescaling. This mechanism effectively suppresses cross-task interference and enables task-level decoupled aggregation within a unified global model. FedDEA does not rely on task labels or architectural modifications, making it broadly applicable and deployment-friendly. Experimental results demonstrate that it can be easily integrated into various mainstream federated optimization algorithms and consistently delivers significant overall performance improvements on widely used NYUD-V2 and PASCAL-Context. These results validate the robustness and generalization capabilities of FedDEA under highly heterogeneous task settings.
zh
[CV-90] Seeing is Not Reasoning : MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoT
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉物理推理方面的不足,特别是其在理解物理规律、空间关系和因果效应方面表现不佳的问题。解决方案的关键在于引入MVPBench基准测试平台,该平台通过视觉思维链(Visual Chain-of-Thought, CoT)框架严格评估模型的视觉物理推理能力,并采用基于图的CoT一致性度量来验证推理路径是否符合物理逻辑,同时减少对文本先验的依赖,以促进模型更准确地理解视觉信息。
链接: https://arxiv.org/abs/2505.24182
作者: Zhuobai Dong,Junchao Yi,Ziyuan Zheng,Haochen Han,Xiangxi Zheng,Alex Jinpeng Wang,Fangming Liu,Linjie Li
机构: Central South University (中南大学); University of Electronic Science and Technology of China (电子科技大学); Peng Cheng Laboratory (鹏城实验室); Nanjing University (南京大学); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding the physical world - governed by laws of motion, spatial relations, and causality - poses a fundamental challenge for multimodal large language models (MLLMs). While recent advances such as OpenAI o3 and GPT-4o demonstrate impressive perceptual and reasoning capabilities, our investigation reveals these models struggle profoundly with visual physical reasoning, failing to grasp basic physical laws, spatial interactions, and causal effects in complex scenes. More importantly, they often fail to follow coherent reasoning chains grounded in visual evidence, especially when multiple steps are needed to arrive at the correct answer. To rigorously evaluate this capability, we introduce MVPBench, a curated benchmark designed to rigorously evaluate visual physical reasoning through the lens of visual chain-of-thought (CoT). Each example features interleaved multi-image inputs and demands not only the correct final answer but also a coherent, step-by-step reasoning path grounded in evolving visual cues. This setup mirrors how humans reason through real-world physical processes over time. To ensure fine-grained evaluation, we introduce a graph-based CoT consistency metric that verifies whether the reasoning path of model adheres to valid physical logic. Additionally, we minimize shortcut exploitation from text priors, encouraging models to rely on visual understanding. Experimental results reveal a concerning trend: even cutting-edge MLLMs exhibit poor visual reasoning accuracy and weak image-text alignment in physical domains. Surprisingly, RL-based post-training alignment - commonly believed to improve visual reasoning performance - often harms spatial reasoning, suggesting a need to rethink current fine-tuning practices.
zh
[CV-91] DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis?
【速读】:该论文试图解决现有视觉-语言模型(Vision-Language Models, VLMs)在医学图像分析中是否真正具备类人临床推理能力的问题,而非仅仅模仿表面模式。其解决方案的关键是提出DrVD-Bench,这是首个针对临床视觉推理的多模态基准测试框架,包含三个模块:视觉证据理解、推理轨迹评估和报告生成评估,覆盖20种任务类型、17个诊断类别及五种影像模态,旨在系统性地评估模型的临床推理能力,并推动开发具有临床可信度的VLMs。
链接: https://arxiv.org/abs/2505.24173
作者: Tianhong Zhou,Yin Xu,Yingtao Zhu,Chuxi Xiao,Haiyang Bian,Lei Wei,Xuegong Zhang
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models (VLMs) exhibit strong zero-shot generalization on natural images and show early promise in interpretable medical image analysis. However, existing benchmarks do not systematically evaluate whether these models truly reason like human clinicians or merely imitate superficial patterns. To address this gap, we propose DrVD-Bench, the first multimodal benchmark for clinical visual reasoning. DrVD-Bench consists of three modules: Visual Evidence Comprehension, Reasoning Trajectory Assessment, and Report Generation Evaluation, comprising a total of 7,789 image-question pairs. Our benchmark covers 20 task types, 17 diagnostic categories, and five imaging modalities-CT, MRI, ultrasound, radiography, and pathology. DrVD-Bench is explicitly structured to reflect the clinical reasoning workflow from modality recognition to lesion identification and diagnosis. We benchmark 19 VLMs, including general-purpose and medical-specific, open-source and proprietary models, and observe that performance drops sharply as reasoning complexity increases. While some models begin to exhibit traces of human-like reasoning, they often still rely on shortcut correlations rather than grounded visual understanding. DrVD-Bench offers a rigorous and structured evaluation framework to guide the development of clinically trustworthy VLMs.
zh
[CV-92] Pretraining Deformable Image Registration Networks with Random Images
【速读】:该论文试图解决医学图像配准中对大量领域特定数据依赖的问题,其解决方案的关键在于利用随机图像之间的配准作为预训练任务,以构建一个用于图像配准的基础模型。通过这种策略,模型能够在减少对医学图像数据依赖的同时提升配准精度,并加速下游任务的训练过程。
链接: https://arxiv.org/abs/2505.24167
作者: Junyu Chen,Shuwen Wei,Yihao Liu,Aaron Carass,Yong Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MIDL 2025. Code available at this https URL
Abstract:Recent advances in deep learning-based medical image registration have shown that training deep neural networks~(DNNs) does not necessarily require medical images. Previous work showed that DNNs trained on randomly generated images with carefully designed noise and contrast properties can still generalize well to unseen medical data. Building on this insight, we propose using registration between random images as a proxy task for pretraining a foundation model for image registration. Empirical results show that our pretraining strategy improves registration accuracy, reduces the amount of domain-specific data needed to achieve competitive performance, and accelerates convergence during downstream training, thereby enhancing computational efficiency.
zh
[CV-93] raining-free zero-shot 3D symmetry detection with visual features back-projected to geometry
【速读】:该论文试图解决零样本3D对称性检测(zero-shot 3D symmetry detection)的问题,即在不依赖任何训练数据的情况下识别3D物体的反射对称平面。解决方案的关键在于利用基础视觉模型(foundation vision models)如DINOv2提取的视觉特征,通过从3D物体的渲染视图中提取特征并将其反投影到原始几何结构上,从而捕捉特征的对称不变性,并通过提出的一种算法识别对称平面。
链接: https://arxiv.org/abs/2505.24162
作者: Isaac Aguirre,Ivan Sipiran
机构: University of Chile (智利大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present a simple yet effective training-free approach for zero-shot 3D symmetry detection that leverages visual features from foundation vision models such as DINOv2. Our method extracts features from rendered views of 3D objects and backprojects them onto the original geometry. We demonstrate the symmetric invariance of these features and use them to identify reflection-symmetry planes through a proposed algorithm. Experiments on a subset of ShapeNet demonstrate that our approach outperforms both traditional geometric methods and learning-based approaches without requiring any training data. Our work demonstrates how foundation vision models can help in solving complex 3D geometric problems such as symmetry detection.
zh
[CV-94] hreading Keyframe with Narratives: MLLM s as Strong Long Video Comprehenders
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在长视频理解中的问题,即视频帧数量庞大(即视觉标记)与语言模型上下文长度有限之间的矛盾。传统均匀采样容易选择无关内容,而对数千帧进行微调则带来巨大的计算负担。该论文提出的解决方案关键在于“关键帧与叙事串联”(Nar-KFC),其核心是通过整数二次规划问题优化关键帧选择,联合考虑查询相关性和帧多样性,并采用定制的贪心搜索策略以降低计算复杂度;同时,通过插入由非关键帧生成的交错文本叙事,缓解稀疏关键帧采样导致的时间不连续性,从而实现时空与内容感知的高效视频压缩。
链接: https://arxiv.org/abs/2505.24158
作者: Bo Fang,Wenhao Wu,Qiangqiang Wu,Yuxin Song,Antoni B. Chan
机构: City University of Hong Kong (香港城市大学); Baidu Inc. (百度公司); University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Employing Multimodal Large Language Models (MLLMs) for long video understanding remains a challenging problem due to the dilemma between the substantial number of video frames (i.e., visual tokens) versus the limited context length of language models. Traditional uniform sampling often leads to selection of irrelevant content, while post-training MLLMs on thousands of frames imposes a substantial computational burden. In this paper, we propose threading keyframes with narratives (Nar-KFC), a plug-and-play module to facilitate effective and efficient long video perception. Nar-KFC generally involves two collaborative steps. First, we formulate the keyframe selection process as an integer quadratic programming problem, jointly optimizing query-relevance and frame-diversity. To avoid its computational complexity, a customized greedy search strategy is designed as an efficient alternative. Second, to mitigate the temporal discontinuity caused by sparse keyframe sampling, we further introduce interleaved textual narratives generated from non-keyframes using off-the-shelf captioners. These narratives are inserted between keyframes based on their true temporal order, forming a coherent and compact representation. Nar-KFC thus serves as a temporal- and content-aware compression strategy that complements visual and textual modalities. Experimental results on multiple long-video benchmarks demonstrate that Nar-KFC significantly improves the performance of popular MLLMs. Code will be made publicly available.
zh
[CV-95] owards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction
【速读】:该论文旨在解决具身智能体在学习可泛化的双臂操作策略时面临的挑战,特别是由于动作空间庞大和需要协调的臂部运动所带来的困难。现有方法依赖于视觉-语言-动作(Vision-Language-Action, VLA)模型来获取双臂策略,但单臂数据或预训练VLA模型的知识迁移往往难以有效泛化,主要原因是双臂数据稀缺以及单臂与双臂操作的根本差异。论文提出了一种新的双臂基础策略,其关键在于通过微调领先的文本到视频模型来预测机器人轨迹,并训练一个轻量级扩散策略用于动作生成。为弥补文本到视频模型中缺乏具身知识的问题,引入了两阶段范式,分别微调独立的文本到光流和光流到视频模型,其中光流作为中间变量,提供图像间细微运动的简洁表示,从而减少对低级动作的直接依赖并提升策略的泛化能力。
链接: https://arxiv.org/abs/2505.24156
作者: Chenyou Fan,Fangzheng Yan,Chenjia Bai,Jiepeng Wang,Chi Zhang,Zhen Wang,Xuelong Li
机构: Institute of Artificial intelligence (TeleAI), China Telecom; Northwestern Polytechnical University; Hong Kong University of Science and Technology
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Learning a generalizable bimanual manipulation policy is extremely challenging for embodied agents due to the large action space and the need for coordinated arm movements. Existing approaches rely on Vision-Language-Action (VLA) models to acquire bimanual policies. However, transferring knowledge from single-arm datasets or pre-trained VLA models often fails to generalize effectively, primarily due to the scarcity of bimanual data and the fundamental differences between single-arm and bimanual manipulation. In this paper, we propose a novel bimanual foundation policy by fine-tuning the leading text-to-video models to predict robot trajectories and training a lightweight diffusion policy for action generation. Given the lack of embodied knowledge in text-to-video models, we introduce a two-stage paradigm that fine-tunes independent text-to-flow and flow-to-video models derived from a pre-trained text-to-video model. Specifically, optical flow serves as an intermediate variable, providing a concise representation of subtle movements between images. The text-to-flow model predicts optical flow to concretize the intent of language instructions, and the flow-to-video model leverages this flow for fine-grained video prediction. Our method mitigates the ambiguity of language in single-stage text-to-video prediction and significantly reduces the robot-data requirement by avoiding direct use of low-level actions. In experiments, we collect high-quality manipulation data for real dual-arm robot, and the results of simulation and real-world experiments demonstrate the effectiveness of our method.
zh
[CV-96] he Butterfly Effect in Pathology: Exploring Security in Pathology Foundation Models
【速读】:该论文旨在解决病理学基础模型在面对对抗攻击时的安全性问题,尤其是针对全切片图像(WSI)分析的病理学基础模型的脆弱性尚未被充分研究这一现状。其解决方案的关键在于提出一种无需依赖下游任务标签的无标签攻击框架,并引入“局部扰动、全局影响”的原理,同时根据WSI的特点重新定义扰动预算。通过修改每张切片中仅0.1%的区域并添加不可察觉的噪声,该攻击方法能够在多个数据集和任务上显著降低模型的下游准确性,从而揭示了病理学基础模型在安全性方面的潜在风险。
链接: https://arxiv.org/abs/2505.24141
作者: Jiashuai Liu,Yingjia Shang,Yingkang Zhan,Di Zhang,Yi Niu,Dong Wei,Xian Wu,Zeyu Gao,Chen Li,Yefeng Zheng
机构: Xi’an Jiaotong University (西安交通大学); Heilongjiang University (黑龙江大学); Tencent Jarvis Lab (腾讯优图实验室); University of Cambridge (剑桥大学); Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:With the widespread adoption of pathology foundation models in both research and clinical decision support systems, exploring their security has become a critical concern. However, despite their growing impact, the vulnerability of these models to adversarial attacks remains largely unexplored. In this work, we present the first systematic investigation into the security of pathology foundation models for whole slide image~(WSI) analysis against adversarial attacks. Specifically, we introduce the principle of \textitlocal perturbation with global impact and propose a label-free attack framework that operates without requiring access to downstream task labels. Under this attack framework, we revise four classical white-box attack methods and redefine the perturbation budget based on the characteristics of WSI. We conduct comprehensive experiments on three representative pathology foundation models across five datasets and six downstream tasks. Despite modifying only 0.1% of patches per slide with imperceptible noise, our attack leads to downstream accuracy degradation that can reach up to 20% in the worst cases. Furthermore, we analyze key factors that influence attack success, explore the relationship between patch-level vulnerability and semantic content, and conduct a preliminary investigation into potential defence strategies. These findings lay the groundwork for future research on the adversarial robustness and reliable deployment of pathology foundation models. Our code is publicly available at: this https URL.
zh
[CV-97] S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Modelwith Spatio-Temporal Visual Representation CVPR2025
【速读】:该论文旨在解决自动驾驶中端到端运动规划方法在缺乏人类标注的情况下性能不足的问题。现有方法通常依赖于人类标注来学习中间感知和预测任务,而纯自监督方法则因直接从传感器输入生成规划轨迹而表现较差。论文提出的解决方案关键在于S4-Driver,其核心是基于PaLI多模态大语言模型(Multimodal Large Language Models, MLLMs)的时空视觉表示,并采用了一种新颖的稀疏体素策略,将MLLMs在2D图像空间中的强视觉表征无缝转换至车辆规划所使用的原生3D空间,无需微调视觉编码器,从而提升了3D空间中轨迹预测的准确性。
链接: https://arxiv.org/abs/2505.24139
作者: Yichen Xie,Runsheng Xu,Tong He,Jyh-Jing Hwang,Katie Luo,Jingwei Ji,Hubert Lin,Letian Chen,Yiren Lu,Zhaoqi Leng,Dragomir Anguelov,Mingxing Tan
机构: UC Berkeley (加州大学伯克利分校); Waymo LLC (韦姆奥公司); Cornell University (康奈尔大学); Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR2025
Abstract:The latest advancements in multi-modal large language models (MLLMs) have spurred a strong renewed interest in end-to-end motion planning approaches for autonomous driving. Many end-to-end approaches rely on human annotations to learn intermediate perception and prediction tasks, while purely self-supervised approaches–which directly learn from sensor inputs to generate planning trajectories without human annotations often underperform the state of the art. We observe a key gap in the input representation space: end-to-end approaches built on MLLMs are often pretrained with reasoning tasks in 2D image space rather than the native 3D space in which autonomous vehicles plan. To this end, we propose S4-Driver, a scalable self-supervised motion planning algorithm with spatio-temporal visual representation, based on the popular PaLI multimodal large language model. S4-Driver uses a novel sparse volume strategy to seamlessly transform the strong visual representation of MLLMs from perspective view to 3D space without the need to finetune the vision encoder. This representation aggregates multi-view and multi-frame visual inputs and enables better prediction of planning trajectories in 3D space. To validate our method, we run experiments on both nuScenes and Waymo Open Motion Dataset (with in-house camera data). Results show that S4-Driver performs favorably against existing supervised multi-task approaches while requiring no human annotations. It also demonstrates great scalability when pretrained on large volumes of unannotated driving logs.
zh
[CV-98] CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs
【速读】:该论文试图解决当前视觉-语言模型(Vision-Language Models, VLMs)在科学推理能力方面评估不足的问题,现有多模态基准主要关注通用图像理解或文本驱动的推理,缺乏需要结合领域知识与视觉证据分析的真实科学情境。解决方案的关键在于提出CSVQA,这是一个专门设计用于通过领域基础的视觉问答任务评估科学推理能力的诊断性多模态基准,其包含1,378个精心构建的问题-答案对,覆盖广泛的STEM学科,并强调真实科学内容和复杂推理需求,同时引入严格的评估协议以系统检验模型预测是否基于有效的中间推理步骤。
链接: https://arxiv.org/abs/2505.24120
作者: Ai Jian,Weijie Qiu,Xiaokun Wang,Peiyu Wang,Yunzhuo Hao,Jiangbo Pei,Yichen Wei,Yi Peng,Xuchen Song
机构: Skywork AI, Kunlun Inc.
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 36 pages
Abstract:Vision-Language Models (VLMs) have demonstrated remarkable progress in multimodal understanding, yet their capabilities for scientific reasoning remains inadequately assessed. Current multimodal benchmarks predominantly evaluate generic image comprehension or text-driven reasoning, lacking authentic scientific contexts that require domain-specific knowledge integration with visual evidence analysis. To fill this gap, we present CSVQA, a diagnostic multimodal benchmark specifically designed for evaluating scientific reasoning through domain-grounded visual question this http URL benchmark features 1,378 carefully constructed question-answer pairs spanning diverse STEM disciplines, each demanding domain knowledge, integration of visual evidence, and higher-order reasoning. Compared to prior multimodal benchmarks, CSVQA places greater emphasis on real-world scientific content and complex this http URL additionally propose a rigorous evaluation protocol to systematically assess whether model predictions are substantiated by valid intermediate reasoning steps based on curated explanations. Our comprehensive evaluation of 15 VLMs on this benchmark reveals notable performance disparities, as even the top-ranked proprietary model attains only 49.6% this http URL empirical evidence underscores the pressing need for advancing scientific reasoning capabilities in VLMs. Our CSVQA is released at this https URL.
zh
[CV-99] Federated Foundation Model for GI Endoscopy Images
【速读】:该论文试图解决医学影像领域中基础模型训练面临的数据隐私和数据稀缺问题,尤其是在胃肠道内镜成像中,由于医疗数据的敏感性,难以直接共享数据以进行大规模模型训练。解决方案的关键在于提出一种联邦学习(Federated Learning, FL)框架,使数据能够在本地医院环境中保持私有,同时通过协作训练共享模型参数,从而在保护患者隐私的前提下实现基础模型的有效训练。
链接: https://arxiv.org/abs/2505.24108
作者: Alina Devkota,Annahita Amireskandari,Joel Palko,Shyam Thakkar,Donald Adjeroh,Xiajun Jiang,Binod Bhattarai,Prashnna K. Gyawali
机构: West Virginia University, Morgantown, WV, USA (西弗吉尼亚大学,摩根敦,西弗吉尼亚州,美国); University of Memphis, Memphis, TN, USA (田纳西大学孟菲斯分校,孟菲斯,田纳西州,美国); University of Aberdeen, UK (阿伯丁大学,英国)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 11 figures, submitted to BHI2025
Abstract:Gastrointestinal (GI) endoscopy is essential in identifying GI tract abnormalities in order to detect diseases in their early stages and improve patient outcomes. Although deep learning has shown success in supporting GI diagnostics and decision-making, these models require curated datasets with labels that are expensive to acquire. Foundation models offer a promising solution by learning general-purpose representations, which can be finetuned for specific tasks, overcoming data scarcity. Developing foundation models for medical imaging holds significant potential, but the sensitive and protected nature of medical data presents unique challenges. Foundation model training typically requires extensive datasets, and while hospitals generate large volumes of data, privacy restrictions prevent direct data sharing, making foundation model training infeasible in most scenarios. In this work, we propose a FL framework for training foundation models for gastroendoscopy imaging, enabling data to remain within local hospital environments while contributing to a shared model. We explore several established FL algorithms, assessing their suitability for training foundation models without relying on task-specific labels, conducting experiments in both homogeneous and heterogeneous settings. We evaluate the trained foundation model on three critical downstream tasks–classification, detection, and segmentation–and demonstrate that it achieves improved performance across all tasks, highlighting the effectiveness of our approach in a federated, privacy-preserving setting.
zh
[CV-100] Weakly-Supervised Affordance Grounding Guided by Part-Level Semantic Priors ICLR2025
【速读】:该论文试图解决弱监督的可操作性定位(weakly supervised affordance grounding)问题,即在没有密集标注的情况下,利用人-物交互图像和第一视角物图像训练模型识别物体的可操作区域。以往的方法主要依赖于类别激活图(class activation maps),这些方法在语义分割中表现良好,但可能不适合定位动作和功能。该论文的关键解决方案是利用先进的基础模型,构建基于伪标签(pseudo labels)的监督训练流程,其中伪标签由预训练的部件分割模型生成,并通过从可操作性到部件名称的映射进行引导。此外,论文引入了三个关键改进:标签精炼阶段、细粒度特征对齐过程以及轻量级推理模块,以提升可操作性学习效果。
链接: https://arxiv.org/abs/2505.24103
作者: Peiran Xu,Yadong Mu
机构: Wangxuan Institute of Computer Technology (王选计算机技术研究所); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2025
Abstract:In this work, we focus on the task of weakly supervised affordance grounding, where a model is trained to identify affordance regions on objects using human-object interaction images and egocentric object images without dense labels. Previous works are mostly built upon class activation maps, which are effective for semantic segmentation but may not be suitable for locating actions and functions. Leveraging recent advanced foundation models, we develop a supervised training pipeline based on pseudo labels. The pseudo labels are generated from an off-the-shelf part segmentation model, guided by a mapping from affordance to part names. Furthermore, we introduce three key enhancements to the baseline model: a label refining stage, a fine-grained feature alignment process, and a lightweight reasoning module. These techniques harness the semantic knowledge of static objects embedded in off-the-shelf foundation models to improve affordance learning, effectively bridging the gap between objects and actions. Extensive experiments demonstrate that the performance of the proposed model has achieved a breakthrough improvement over existing methods. Our codes are available at this https URL .
zh
[CV-101] Proxy-FDA: Proxy-based Feature Distribution Alignment for Fine-tuning Vision Foundation Models without Forgetting ICML2025
【速读】:该论文试图解决在对基础模型进行微调时出现的**概念遗忘(concept forgetting)**问题,即在特定任务上微调后,模型会丢失之前学习到的其他任务的知识。解决方案的关键在于提出一种新颖的正则化方法——Proxy-FDA,该方法通过在特征空间中对齐预训练与微调后的特征分布,并利用动态生成的信息代理来增强数据多样性,从而显式地保留特征空间中的结构化知识。
链接: https://arxiv.org/abs/2505.24088
作者: Chen Huang,Skyler Seto,Hadi Pouransari,Mehrdad Farajtabar,Raviteja Vemulapalli,Fartash Faghri,Oncel Tuzel,Barry-John Theobald,Josh Susskind
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2025
Abstract:Vision foundation models pre-trained on massive data encode rich representations of real-world concepts, which can be adapted to downstream tasks by fine-tuning. However, fine-tuning foundation models on one task often leads to the issue of concept forgetting on other tasks. Recent methods of robust fine-tuning aim to mitigate forgetting of prior knowledge without affecting the fine-tuning performance. Knowledge is often preserved by matching the original and fine-tuned model weights or feature pairs. However, such point-wise matching can be too strong, without explicit awareness of the feature neighborhood structures that encode rich knowledge as well. We propose a novel regularization method Proxy-FDA that explicitly preserves the structural knowledge in feature space. Proxy-FDA performs Feature Distribution Alignment (using nearest neighbor graphs) between the pre-trained and fine-tuned feature spaces, and the alignment is further improved by informative proxies that are generated dynamically to increase data diversity. Experiments show that Proxy-FDA significantly reduces concept forgetting during fine-tuning, and we find a strong correlation between forgetting and a distributional distance metric (in comparison to L2 distance). We further demonstrate Proxy-FDA’s benefits in various fine-tuning settings (end-to-end, few-shot and continual tuning) and across different tasks like image classification, captioning and VQA.
zh
[CV-102] ComposeAnything: Composite Object Priors for Text-to-Image Generation
【速读】:该论文旨在解决当前文本到图像(T2I)模型在生成涉及复杂和新颖物体排列的图像时面临的挑战,尤其是难以准确捕捉三维定位并保持图像质量和连贯性的问题。其解决方案的关键在于引入ComposeAnything框架,该框架利用大语言模型(LLM)的思维链推理能力生成包含深度信息的2.5D语义布局,并基于此布局生成具有空间和深度感知的粗略组合,作为扩散模型中随机噪声初始化的强而可解释的先验,从而通过对象先验增强和空间控制去噪过程,实现高质量且符合文本描述的图像生成。
链接: https://arxiv.org/abs/2505.24086
作者: Zeeshan Khan,Shizhe Chen,Cordelia Schmid
机构: Inria(法国国家信息与自动化研究所); École normale supérieure(巴黎高等师范学院); CNRS(法国国家科学研究中心); PSL Research University(巴黎萨克雷大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating images from text involving complex and novel object arrangements remains a significant challenge for current text-to-image (T2I) models. Although prior layout-based methods improve object arrangements using spatial constraints with 2D layouts, they often struggle to capture 3D positioning and sacrifice quality and coherence. In this work, we introduce ComposeAnything, a novel framework for improving compositional image generation without retraining existing T2I models. Our approach first leverages the chain-of-thought reasoning abilities of LLMs to produce 2.5D semantic layouts from text, consisting of 2D object bounding boxes enriched with depth information and detailed captions. Based on this layout, we generate a spatial and depth aware coarse composite of objects that captures the intended composition, serving as a strong and interpretable prior that replaces stochastic noise initialization in diffusion-based T2I models. This prior guides the denoising process through object prior reinforcement and spatial-controlled denoising, enabling seamless generation of compositional objects and coherent backgrounds, while allowing refinement of inaccurate priors. ComposeAnything outperforms state-of-the-art methods on the T2I-CompBench and NSR-1K benchmarks for prompts with 2D/3D spatial arrangements, high object counts, and surreal compositions. Human evaluations further demonstrate that our model generates high-quality images with compositions that faithfully reflect the text.
zh
[CV-103] SIM: A mapping framework for built environment auditing based on street view imagery
【速读】:该论文试图解决基于街景图像的建成环境审计中对象地理空间映射方法和工具不足的问题,这限制了街道物体审计的效率与准确性。解决方案的关键在于提出一个开源的街景图像映射框架,该框架包含三个核心管道:1)地面物体(如道路)宽度测量;2)已知尺寸物体(如门和停车标志)的三维定位;3)直径测量(如行道树)。这些管道能够自动测量和映射目标物体,从而提升建成环境审计的生产力和精度。
链接: https://arxiv.org/abs/2505.24076
作者: Huan Ning,Zhenlong Li,Manzhu Yu,Wenpeng Yin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Built environment auditing refers to the systematic documentation and assessment of urban and rural spaces’ physical, social, and environmental characteristics, such as walkability, road conditions, and traffic lights. It is used to collect data for the evaluation of how built environments impact human behavior, health, mobility, and overall urban functionality. Traditionally, built environment audits were conducted using field surveys and manual observations, which were time-consuming and costly. The emerging street view imagery, e.g., Google Street View, has become a widely used data source for conducting built environment audits remotely. Deep learning and computer vision techniques can extract and classify objects from street images to enhance auditing productivity. Before meaningful analysis, the detected objects need to be geospatially mapped for accurate documentation. However, the mapping methods and tools based on street images are underexplored, and there are no universal frameworks or solutions yet, imposing difficulties in auditing the street objects. In this study, we introduced an open source street view mapping framework, providing three pipelines to map and measure: 1) width measurement for ground objects, such as roads; 2) 3D localization for objects with a known dimension (e.g., doors and stop signs); and 3) diameter measurements (e.g., street trees). These pipelines can help researchers, urban planners, and other professionals automatically measure and map target objects, promoting built environment auditing productivity and accuracy. Three case studies, including road width measurement, stop sign localization, and street tree diameter measurement, are provided in this paper to showcase pipeline usage.
zh
[CV-104] 3DGEER: Exact and Efficient Volumetric Rendering with 3D Gaussians
【速读】:该论文旨在解决3D Gaussian Splatting(3DGS)在大视场角(FoV)相机输入下因将3D高斯分布近似为2D高斯而造成的渲染质量受限的问题。现有方法虽尝试扩展3DGS以减轻这些近似误差,但未能同时实现精确性和高效性。本文提出的3DGEER方法通过从基本原理推导出光线穿过3D高斯分布时密度积分的闭式表达式,实现了任意相机模型下的精确前向渲染,并支持基于梯度的3D高斯参数优化。其关键在于提出了一种高效的粒子包围锥体(PBF)计算方法,以及一种新型的双极等角度投影(BEAP)表示,以加速光线与高斯分布的关联并提升重建质量。
链接: https://arxiv.org/abs/2505.24053
作者: Zixun Huang,Cho-Ying Wu,Yuliang Guo,Xinyu Huang,Liu Ren
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) marks a significant milestone in balancing the quality and efficiency of differentiable rendering. However, its high efficiency stems from an approximation of projecting 3D Gaussians onto the image plane as 2D Gaussians, which inherently limits rendering quality–particularly under large Field-of-View (FoV) camera inputs. While several recent works have extended 3DGS to mitigate these approximation errors, none have successfully achieved both exactness and high efficiency simultaneously. In this work, we introduce 3DGEER, an Exact and Efficient Volumetric Gaussian Rendering method. Starting from first principles, we derive a closed-form expression for the density integral along a ray traversing a 3D Gaussian distribution. This formulation enables precise forward rendering with arbitrary camera models and supports gradient-based optimization of 3D Gaussian parameters. To ensure both exactness and real-time performance, we propose an efficient method for computing a tight Particle Bounding Frustum (PBF) for each 3D Gaussian, enabling accurate and efficient ray-Gaussian association. We also introduce a novel Bipolar Equiangular Projection (BEAP) representation to accelerate ray association under generic camera models. BEAP further provides a more uniform ray sampling strategy to apply supervision, which empirically improves reconstruction quality. Experiments on multiple pinhole and fisheye datasets show that our method consistently outperforms prior methods, establishing a new state-of-the-art in real-time neural rendering.
zh
[CV-105] From Images to Signals: Are Large Vision Models Useful for Time Series Analysis?
【速读】:该论文试图解决大型视觉模型(Large Vision Models, LVMs)在时间序列分析中的有效性问题,特别是其在时间序列分类和预测任务中的适用性。论文的关键解决方案是设计并开展首个系统性研究,涵盖4种LVM、8种图像处理方法、18个数据集和26种基线模型,覆盖高阶(分类)和低阶(预测)任务,并进行广泛的消融分析,以评估LVM在时间序列任务中的实际表现与局限性。
链接: https://arxiv.org/abs/2505.24030
作者: Ziming Zhao,ChengAo Shen,Hanghang Tong,Dongjin Song,Zhigang Deng,Qingsong Wen,Jingchao Ni
机构: University of Houston(休斯顿大学); University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校); University of Connecticut(康涅狄格大学); Squirrel Ai Learning(松鼠人工智能学习)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Transformer-based models have gained increasing attention in time series research, driving interest in Large Language Models (LLMs) and foundation models for time series analysis. As the field moves toward multi-modality, Large Vision Models (LVMs) are emerging as a promising direction. In the past, the effectiveness of Transformer and LLMs in time series has been debated. When it comes to LVMs, a similar question arises: are LVMs truely useful for time series analysis? To address it, we design and conduct the first principled study involving 4 LVMs, 8 imaging methods, 18 datasets and 26 baselines across both high-level (classification) and low-level (forecasting) tasks, with extensive ablation analysis. Our findings indicate LVMs are indeed useful for time series classification but face challenges in forecasting. Although effective, the contemporary best LVM forecasters are limited to specific types of LVMs and imaging methods, exhibit a bias toward forecasting periods, and have limited ability to utilize long look-back windows. We hope our findings could serve as a cornerstone for future research on LVM- and multimodal-based solutions to different time series tasks.
zh
[CV-106] MaskAdapt: Unsupervised Geometry-Aware Domain Adaptation Using Multimodal Contextual Learning and RGB-Depth Masking CVPR
【速读】:该论文旨在解决作物与杂草语义分割在跨域场景下的泛化能力不足问题,特别是在无监督域适应(Unsupervised Domain Adaptation, UDA)框架下,由于光照、相机设置、土壤成分和作物生长阶段等域偏移导致的模型性能下降。其解决方案的关键在于提出一种名为MaskAdapt的新方法,通过融合RGB图像与深度数据特征,利用深度梯度捕捉空间过渡以解决纹理模糊问题,并结合跨注意力机制优化RGB特征表示,从而提升边界划分的准确性。此外,引入几何感知掩码策略,在训练过程中应用水平、垂直和随机掩码,增强模型对全局空间上下文的依赖,提高视觉识别的鲁棒性。
链接: https://arxiv.org/abs/2505.24026
作者: Numair Nadeem,Muhammad Hamza Asad,Saeed Anwar,Abdul Bais
机构: University of Regina (里贾纳大学); University Canada West (加拿大西海岸大学); The University of Western Australia (西澳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures, presented at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025. Reviewer comments available upon request
Abstract:Semantic segmentation of crops and weeds is crucial for site-specific farm management; however, most existing methods depend on labor intensive pixel-level annotations. A further challenge arises when models trained on one field (source domain) fail to generalize to new fields (target domain) due to domain shifts, such as variations in lighting, camera setups, soil composition, and crop growth stages. Unsupervised Domain Adaptation (UDA) addresses this by enabling adaptation without target-domain labels, but current UDA methods struggle with occlusions and visual blending between crops and weeds, leading to misclassifications in real-world conditions. To overcome these limitations, we introduce MaskAdapt, a novel approach that enhances segmentation accuracy through multimodal contextual learning by integrating RGB images with features derived from depth data. By computing depth gradients from depth maps, our method captures spatial transitions that help resolve texture ambiguities. These gradients, through a cross-attention mechanism, refines RGB feature representations, resulting in sharper boundary delineation. In addition, we propose a geometry-aware masking strategy that applies horizontal, vertical, and stochastic masks during training. This encourages the model to focus on the broader spatial context for robust visual recognition. Evaluations on real agricultural datasets demonstrate that MaskAdapt consistently outperforms existing State-of-the-Art (SOTA) UDA methods, achieving improved segmentation mean Intersection over Union (mIOU) across diverse field conditions.
zh
[CV-107] DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models
【速读】:该论文试图解决视觉基础模型在上下文推理能力方面的缺失问题,尤其是针对基于查询的表示模型,如DINO系列,其推理能力尚未得到充分探索。解决方案的关键在于提出一种名为Group Relative Query Optimization (GRQO) 的新型强化学习训练策略,该策略通过基于组归一化对齐质量计算查询级奖励,以激励视觉模型的上下文推理能力,并结合KL正则化稳定对象性分布,从而实现查询间的密集且表达性强的监督,同时减少过拟合和分布漂移。
链接: https://arxiv.org/abs/2505.24025
作者: Chenbin Pan,Wenbin He,Zhengzhong Tu,Liu Ren
机构: Bosch Research North America (博世北美研究院); Bosch Center for Artificial Intelligence (BCAI) (博世人工智能中心); Texas A&M University (德克萨斯A&M大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The recent explosive interest in the reasoning capabilities of large language models, such as DeepSeek-R1, has demonstrated remarkable success through reinforcement learning-based fine-tuning frameworks, exemplified by methods like Group Relative Policy Optimization (GRPO). However, such reasoning abilities remain underexplored and notably absent in vision foundation models, including representation models like the DINO series. In this work, we propose \textbfDINO-R1, the first such attempt to incentivize visual in-context reasoning capabilities of vision foundation models using reinforcement learning. Specifically, DINO-R1 introduces \textbfGroup Relative Query Optimization (GRQO), a novel reinforcement-style training strategy explicitly designed for query-based representation models, which computes query-level rewards based on group-normalized alignment quality. We also apply KL-regularization to stabilize the objectness distribution to reduce the training instability. This joint optimization enables dense and expressive supervision across queries while mitigating overfitting and distributional drift. Building upon Grounding-DINO, we train a series of DINO-R1 family models that integrate a visual prompt encoder and a visual-guided query selection mechanism. Extensive experiments on COCO, LVIS, and ODinW demonstrate that DINO-R1 significantly outperforms supervised fine-tuning baselines, achieving strong generalization in both open-vocabulary and closed-set visual prompting scenarios.
zh
[CV-108] Multi-Group Proportional Representation for Text-to-Image Models
【速读】:该论文试图解决文本到图像(Text-to-image, T2I)生成模型在代表性和公平性方面存在的问题,特别是其对交叉群体(intersectional groups)的表征不足、刻板印象传播以及少数群体的边缘化问题。现有研究虽关注人工智能的“安全”与“负责任”设计,但缺乏系统性方法来衡量和控制图像生成中的表征性危害。该论文提出了一种新的框架,通过应用多组比例代表性(Multi-Group Proportional Representation, MPR)度量来评估T2I模型生成图像中交叉群体的表征情况,其关键在于通过计算不同人口群体在生成图像中的最差偏差,实现灵活且情境相关的表征统计测量,并开发了优化算法以将MPR作为训练目标,从而在保持生成质量的同时提升模型在人口群体间的平衡性。
链接: https://arxiv.org/abs/2505.24023
作者: Sangwon Jung,Alex Oesterling,Claudio Mayrink Verdun,Sajani Vithana,Taesup Moon,Flavio P. Calmon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Text-to-image (T2I) generative models can create vivid, realistic images from textual descriptions. As these models proliferate, they expose new concerns about their ability to represent diverse demographic groups, propagate stereotypes, and efface minority populations. Despite growing attention to the “safe” and “responsible” design of artificial intelligence (AI), there is no established methodology to systematically measure and control representational harms in image generation. This paper introduces a novel framework to measure the representation of intersectional groups in images generated by T2I models by applying the Multi-Group Proportional Representation (MPR) metric. MPR evaluates the worst-case deviation of representation statistics across given population groups in images produced by a generative model, allowing for flexible and context-specific measurements based on user requirements. We also develop an algorithm to optimize T2I models for this metric. Through experiments, we demonstrate that MPR can effectively measure representation statistics across multiple intersectional groups and, when used as a training objective, can guide models toward a more balanced generation across demographic groups while maintaining generation quality.
zh
[CV-109] Preemptive Hallucination Reduction: An Input-Level Approach for Multimodal Language Model
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)中视觉幻觉(visual hallucinations)的问题,即模型生成的响应与视觉输入不一致,这严重影响了模型在需要精确和可信输出场景下的可靠性。解决方案的关键在于提出了一种基于集成的预处理框架,该框架根据问题类型自适应地选择最合适的过滤方法——噪声减少(noise reduced, NR)、边缘增强(edge enhanced, EE)或原始输入(unaltered input, org),从而在不修改模型架构或训练流程的情况下有效降低幻觉率。
链接: https://arxiv.org/abs/2505.24007
作者: Nokimul Hasan Arif,Shadman Rabby,Md Hefzul Hossain Papon,Sabbir Ahmed
机构: Dhaka International University (达卡国际大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted for review in AJSE Springer, 21 pages, 4 figures, 4 Tables
Abstract:Visual hallucinations in Large Language Models (LLMs), where the model generates responses that are inconsistent with the visual input, pose a significant challenge to their reliability, particularly in contexts where precise and trustworthy outputs are critical. Current research largely emphasizes post-hoc correction or model-specific fine-tuning strategies, with limited exploration of preprocessing techniques to address hallucination issues at the input stage. This study presents a novel ensemble-based preprocessing framework that adaptively selects the most appropriate filtering approach – noise reduced (NR), edge enhanced (EE), or unaltered input (org) based on the type of question posed, resulting into reduced hallucination without requiring any modifications to the underlying model architecture or training pipeline. Evaluated on the `HaloQuest’ dataset – a benchmark designed to test multimodal reasoning on visually complex inputs, our method achieves a 44.3% reduction in hallucination rates, as measured by Natural Language Inference (NLI) scores using SelfCheckGPT. This demonstrates that intelligent input conditioning alone can significantly enhance factual grounding in LLM responses. The findings highlight the importance of adaptive preprocessing techniques in mitigating hallucinations, paving the way for more reliable multimodal systems capable of addressing real-world challenges.
zh
[CV-110] DGIQA: Depth-guided Feature Attention and Refinement for Generalizable Image Quality Assessment
【速读】:该论文旨在解决无参考图像质量评估(NR-IQA)中因依赖人类主观感知而导致的对未见过的自然失真缺乏客观泛化能力的问题。其解决方案的关键在于引入一种新颖的深度引导跨注意力与精炼机制(Depth-CAR),该机制将场景深度和空间特征融合为结构感知表示,以提升NR-IQA性能,并通过引入TCB(Transformer-CNN Bridge)将Transformer主干网络中的高层全局上下文依赖与分层CNN层提取的局部空间特征进行融合,从而实现更具有判别性的特征学习。
链接: https://arxiv.org/abs/2505.24002
作者: Vaishnav Ramesh,Junliang Liu,Haining Wang,Md Jahidul Islam
机构: University of Florida(佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages
Abstract:A long-held challenge in no-reference image quality assessment (NR-IQA) learning from human subjective perception is the lack of objective generalization to unseen natural distortions. To address this, we integrate a novel Depth-Guided cross-attention and refinement (Depth-CAR) mechanism, which distills scene depth and spatial features into a structure-aware representation for improved NR-IQA. This brings in the knowledge of object saliency and relative contrast of the scene for more discriminative feature learning. Additionally, we introduce the idea of TCB (Transformer-CNN Bridge) to fuse high-level global contextual dependencies from a transformer backbone with local spatial features captured by a set of hierarchical CNN (convolutional neural network) layers. We implement TCB and Depth-CAR as multimodal attention-based projection functions to select the most informative features, which also improve training time and inference efficiency. Experimental results demonstrate that our proposed DGIQA model achieves state-of-the-art (SOTA) performance on both synthetic and authentic benchmark datasets. More importantly, DGIQA outperforms SOTA models on cross-dataset evaluations as well as in assessing natural image distortions such as low-light effects, hazy conditions, and lens flares.
zh
[CV-111] DeepTopoNet: A Framework for Subglacial Topography Estimation on the Greenland Ice Sheets
【速读】:该论文旨在解决格陵兰冰盖下地形(bed topography)观测数据稀疏且不连续导致的模型预测不确定性问题,特别是在冰盖底部地形信息不足的区域。其解决方案的关键在于提出一种名为DeepTopoNet的深度学习框架,该框架通过引入一种动态损失平衡机制,将雷达反演的冰厚数据与BedMachine Greenland数据进行融合,自适应地调整雷达数据与BedMachine数据的权重,从而在雷达覆盖有限的区域保持模型的鲁棒性,并充分利用BedMachine数据的高空间分辨率优势。此外,该方法结合了基于梯度和趋势面特征以提升模型性能,并采用适用于亚网格尺度预测的卷积神经网络(CNN)架构,最终在Upernavik Isstrøm区域验证中表现出优于传统基线方法的精度。
链接: https://arxiv.org/abs/2505.23980
作者: Bayu Adhi Tama,Mansa Krishna,Homayra Alam,Mostafa Cham,Omar Faruque,Gong Cheng,Jianwu Wang,Mathieu Morlighem,Vandana Janeja
机构: University of Maryland, Baltimore County (UMBC); Dartmouth College
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Submitted to SIGSPATIAL 2025
Abstract:Understanding Greenland’s subglacial topography is critical for projecting the future mass loss of the ice sheet and its contribution to global sea-level rise. However, the complex and sparse nature of observational data, particularly information about the bed topography under the ice sheet, significantly increases the uncertainty in model projections. Bed topography is traditionally measured by airborne ice-penetrating radar that measures the ice thickness directly underneath the aircraft, leaving data gap of tens of kilometers in between flight lines. This study introduces a deep learning framework, which we call as DeepTopoNet, that integrates radar-derived ice thickness observations and BedMachine Greenland data through a novel dynamic loss-balancing mechanism. Among all efforts to reconstruct bed topography, BedMachine has emerged as one of the most widely used datasets, combining mass conservation principles and ice thickness measurements to generate high-resolution bed elevation estimates. The proposed loss function adaptively adjusts the weighting between radar and BedMachine data, ensuring robustness in areas with limited radar coverage while leveraging the high spatial resolution of BedMachine predictions i.e. bed estimates. Our approach incorporates gradient-based and trend surface features to enhance model performance and utilizes a CNN architecture designed for subgrid-scale predictions. By systematically testing on the Upernavik Isstrøm) region, the model achieves high accuracy, outperforming baseline methods in reconstructing subglacial terrain. This work demonstrates the potential of deep learning in bridging observational gaps, providing a scalable and efficient solution to inferring subglacial topography.
zh
[CV-112] VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL
【速读】:该论文旨在解决当前视觉语言模型(Vision Language Models, VLMs)在多模态推理和逻辑连贯决策方面缺乏大规模且结构化训练数据的问题。其解决方案的关键在于提出VisualSphinx,这是一个首个大规模的合成视觉逻辑推理训练数据集。为了解决带有定位答案的图像生成挑战,研究者设计了一种从种子问题中提取并扩展谜题规则的规则到图像合成流程,并生成用于谜题样本组装的定位合成图像代码,从而提升了VLM在逻辑推理任务中的表现。
链接: https://arxiv.org/abs/2505.23977
作者: Yichen Feng,Zhangchen Xu,Fengqing Jiang,Yuetai Li,Bhaskar Ramasubramanian,Luyao Niu,Bill Yuchen Lin,Radha Poovendran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page at this https URL
Abstract:Vision language models (VLMs) are expected to perform effective multimodal reasoning and make logically coherent decisions, which is critical to tasks such as diagram understanding and spatial problem solving. However, current VLM reasoning lacks large-scale and well-structured training datasets. To bridge this gap, we propose VisualSphinx, a first-of-its-kind large-scale synthetic visual logical reasoning training data. To tackle the challenge of image synthesis with grounding answers, we propose a rule-to-image synthesis pipeline, which extracts and expands puzzle rules from seed questions and generates the code of grounding synthesis image synthesis for puzzle sample assembly. Experiments demonstrate that VLM trained using GRPO on VisualSphinx benefit from logical coherence and readability of our dataset and exhibit improved performance on logical reasoning tasks. The enhanced reasoning capabilities developed from VisualSphinx also benefit other reasoning tasks such as algebraic reasoning, arithmetic reasoning and geometry reasoning.
zh
[CV-113] MangoLeafViT: Leverag ing Lightweight Vision Transformer with Runtime Augmentation for Efficient Mango Leaf Disease Classification
【速读】:该论文旨在解决芒果叶片疾病分类中计算效率低且不适用于低端设备的问题(mango leaf disease classification with computational efficiency and low-end device compatibility)。其解决方案的关键在于提出了一种基于轻量级Vision Transformer(视觉变换器)的管道,该管道引入了自注意力机制,通过全局注意力捕获不同病害类型之间的复杂模式,并结合运行时增强技术以提升性能,从而在保持极低计算开销的同时实现了最先进的分类准确率。
链接: https://arxiv.org/abs/2505.23961
作者: Rafi Hassan Chowdhury,Sabbir Ahmed
机构: Islamic University of Technology(伊斯兰理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in 27th ICCIT 2024, 6 pages, 5 Figures, 4 tables
Abstract:Ensuring food safety is critical due to its profound impact on public health, economic stability, and global supply chains. Cultivation of Mango, a major agricultural product in several South Asian countries, faces high financial losses due to different diseases, affecting various aspects of the entire supply chain. While deep learning-based methods have been explored for mango leaf disease classification, there remains a gap in designing solutions that are computationally efficient and compatible with low-end devices. In this work, we propose a lightweight Vision Transformer-based pipeline with a self-attention mechanism to classify mango leaf diseases, achieving state-of-the-art performance with minimal computational overhead. Our approach leverages global attention to capture intricate patterns among disease types and incorporates runtime augmentation for enhanced performance. Evaluation on the MangoLeafBD dataset demonstrates a 99.43% accuracy, outperforming existing methods in terms of model size, parameter count, and FLOPs count.
zh
[CV-114] Leverag ing Auxiliary Information in Text-to-Video Retrieval: A Review
【速读】:该论文旨在解决文本到视频(Text-to-Video, T2V)检索中由于模态差异导致的语义鸿沟问题,即如何更准确地从视频库中找到与用户文本查询最相关的视频。其解决方案的关键在于引入并利用来自视频和文本模态的辅助信息,如视觉属性(对象)、时空上下文以及文本描述(如语音和重述的字幕),以提升检索性能并加强模态间的语义对齐。
链接: https://arxiv.org/abs/2505.23952
作者: Adriano Fragomeni,Dima Damen,Michael Wray
机构: University of Bristol(布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-Video (T2V) retrieval aims to identify the most relevant item from a gallery of videos based on a user’s text query. Traditional methods rely solely on aligning video and text modalities to compute the similarity and retrieve relevant items. However, recent advancements emphasise incorporating auxiliary information extracted from video and text modalities to improve retrieval performance and bridge the semantic gap between these modalities. Auxiliary information can include visual attributes, such as objects; temporal and spatial context; and textual descriptions, such as speech and rephrased captions. This survey comprehensively reviews 81 research papers on Text-to-Video retrieval that utilise such auxiliary information. It provides a detailed analysis of their methodologies; highlights state-of-the-art results on benchmark datasets; and discusses available datasets and their auxiliary information. Additionally, it proposes promising directions for future research, focusing on different ways to further enhance retrieval performance using this information.
zh
[CV-115] Vision Language Models are Biased
【速读】:该论文试图解决视觉语言模型(Vision Language Models, VLMs)在标准、客观的视觉任务(如计数和识别)中因过度依赖网络上广泛存在的先验知识而产生的偏差问题。解决方案的关键在于提出一种自动化框架,用于测试VLMs在不同领域中的偏见,并通过引入描述性文本(如“Adidas”)进一步验证这些偏见对模型性能的影响。该研究揭示了当前最先进的VLMs在面对需要精确视觉分析的任务时存在显著的偏差,且传统的纠正方法效果有限。
链接: https://arxiv.org/abs/2505.23941
作者: An Vo,Khai-Nguyen Nguyen,Mohammad Reza Taesiri,Vy Tuong Dang,Anh Totti Nguyen,Daeyoung Kim
机构: KAIST(韩国科学技术院); College of William and Mary(威廉与玛丽学院); University of Alberta(阿尔伯塔大学); Auburn University(奥本大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Code and qualitative examples are available at: this http URL
Abstract:Large language models (LLMs) memorize a vast amount of prior knowledge from the Internet that help them on downstream tasks but also may notoriously sway their outputs towards wrong or biased answers. In this work, we test how the knowledge about popular subjects hurt the accuracy of vision language models (VLMs) on standard, objective visual tasks of counting and identification. We find that state-of-the-art VLMs are strongly biased (e.g, unable to recognize a fourth stripe has been added to a 3-stripe Adidas logo) scoring an average of 17.05% accuracy in counting (e.g., counting stripes in an Adidas-like logo) across 7 diverse domains from animals, logos, chess, board games, optical illusions, to patterned grids. Insert text (e.g., “Adidas”) describing the subject name into the counterfactual image further decreases VLM accuracy. The biases in VLMs are so strong that instructing them to double-check their results or rely exclusively on image details to answer improves counting accuracy by only +2 points, on average. Our work presents an interesting failure mode in VLMs and an automated framework for testing VLM biases. Code and data are available at: this http URL.
zh
[CV-116] Point-MoE: Towards Cross-Domain Generalization in 3D Semantic Segmentation via Mixture-of-Experts
【速读】:该论文试图解决3D点云理解领域在大规模、跨域泛化方面的挑战,尤其是由于3D数据集规模较小以及数据来源多样导致的领域异质性问题。解决方案的关键在于提出Point-MoE,一种基于专家混合(Mixture-of-Experts)的架构,通过简单的top-k路由策略使模型自动专精于不同领域,而无需依赖领域标签。实验表明,Point-MoE在跨域任务中表现出优于现有方法的性能和泛化能力。
链接: https://arxiv.org/abs/2505.23926
作者: Xuweiyi Chen,Wentao Zhou,Aruni RoyChowdhury,Zezhou Cheng
机构: University of Virginia (弗吉尼亚大学); The MathWorks, Inc. (数学工作公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:While scaling laws have transformed natural language processing and computer vision, 3D point cloud understanding has yet to reach that stage. This can be attributed to both the comparatively smaller scale of 3D datasets, as well as the disparate sources of the data itself. Point clouds are captured by diverse sensors (e.g., depth cameras, LiDAR) across varied domains (e.g., indoor, outdoor), each introducing unique scanning patterns, sampling densities, and semantic biases. Such domain heterogeneity poses a major barrier towards training unified models at scale, especially under the realistic constraint that domain labels are typically inaccessible at inference time. In this work, we propose Point-MoE, a Mixture-of-Experts architecture designed to enable large-scale, cross-domain generalization in 3D perception. We show that standard point cloud backbones degrade significantly in performance when trained on mixed-domain data, whereas Point-MoE with a simple top-k routing strategy can automatically specialize experts, even without access to domain labels. Our experiments demonstrate that Point-MoE not only outperforms strong multi-domain baselines but also generalizes better to unseen domains. This work highlights a scalable path forward for 3D understanding: letting the model discover structure in diverse 3D data, rather than imposing it via manual curation or domain supervision.
zh
[CV-117] Representational Difference Explanations
【速读】:该论文试图解决机器学习模型比较中缺乏有效且可解释工具的问题,当前的后验可解释AI(XAI)方法在支持模型对比方面存在局限。解决方案的关键在于提出一种名为表示差异解释(Representational Differences Explanations, RDX)的方法,该方法能够发现并可视化两个学习表示之间的差异,从而实现更直接和可解释的模型比较。
链接: https://arxiv.org/abs/2505.23917
作者: Neehar Kondapaneni,Oisin Mac Aodha,Pietro Perona
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注: 9 pages, 6 figures, 21 supplementary pages, 14 supp figs
Abstract:We propose a method for discovering and visualizing the differences between two learned representations, enabling more direct and interpretable model comparisons. We validate our method, which we call Representational Differences Explanations (RDX), by using it to compare models with known conceptual differences and demonstrate that it recovers meaningful distinctions where existing explainable AI (XAI) techniques fail. Applied to state-of-the-art models on challenging subsets of the ImageNet and iNaturalist datasets, RDX reveals both insightful representational differences and subtle patterns in the data. Although comparison is a cornerstone of scientific analysis, current tools in machine learning, namely post hoc XAI methods, struggle to support model comparison effectively. Our work addresses this gap by introducing an effective and explainable tool for contrasting model representations.
zh
[CV-118] Cora: Correspondence-aware image editing using few step diffusion SIGGRAPH2025
【速读】:该论文试图解决图像编辑中涉及显著结构变化的任务(如非刚性变形、对象修改或内容生成)所面临的挑战,这些问题在现有基于扩散的方法中仍难以实现高质量的编辑结果。其解决方案的关键在于引入了关联意识的噪声校正和插值注意力图,通过语义对应对齐源图像与目标图像的纹理和结构,从而实现准确的纹理迁移并在必要时生成新内容,同时提供对内容生成与保留之间平衡的控制。
链接: https://arxiv.org/abs/2505.23907
作者: Amirhossein Almohammadi,Aryan Mikaeili,Sauradip Nag,Negar Hassanpour,Andrea Tagliasacchi,Ali Mahdavi-Amiri
机构: Simon Fraser University (西蒙弗雷泽大学); Huawei (华为); Google Deepmind (谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in SIGGRAPH 2025
Abstract:Image editing is an important task in computer graphics, vision, and VFX, with recent diffusion-based methods achieving fast and high-quality results. However, edits requiring significant structural changes, such as non-rigid deformations, object modifications, or content generation, remain challenging. Existing few step editing approaches produce artifacts such as irrelevant texture or struggle to preserve key attributes of the source image (e.g., pose). We introduce Cora, a novel editing framework that addresses these limitations by introducing correspondence-aware noise correction and interpolated attention maps. Our method aligns textures and structures between the source and target images through semantic correspondence, enabling accurate texture transfer while generating new content when necessary. Cora offers control over the balance between content generation and preservation. Extensive experiments demonstrate that, quantitatively and qualitatively, Cora excels in maintaining structure, textures, and identity across diverse edits, including pose changes, object addition, and texture refinements. User studies confirm that Cora delivers superior results, outperforming alternatives.
zh
[CV-119] Generating Fit Check Videos with a Handheld Camera
【速读】:该论文试图解决如何通过手持移动设备便捷地捕获全身视频的问题,传统方法通常需要固定摄像头、精心构图和反复练习。其解决方案的关键在于提出一种基于视频扩散模型的新方法,该方法利用用户在镜中拍摄的正面和背面静态照片以及手持手机时的IMU运动参考,合成出具有真实光照和阴影一致性的目标动作视频。关键技术创新包括无参数的帧生成策略和多参考注意力机制,以有效整合正反面自拍的外观信息,并引入基于图像的微调策略以提升帧清晰度和阴影、反射的生成效果。
链接: https://arxiv.org/abs/2505.23886
作者: Bowei Chen,Brian Curless,Ira Kemelmacher-Shlizerman,Steven M. Seitz
机构: University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Self-captured full-body videos are popular, but most deployments require mounted cameras, carefully-framed shots, and repeated practice. We propose a more convenient solution that enables full-body video capture using handheld mobile devices. Our approach takes as input two static photos (front and back) of you in a mirror, along with an IMU motion reference that you perform while holding your mobile phone, and synthesizes a realistic video of you performing a similar target motion. We enable rendering into a new scene, with consistent illumination and shadows. We propose a novel video diffusion-based model to achieve this. Specifically, we propose a parameter-free frame generation strategy, as well as a multi-reference attention mechanism, that effectively integrate appearance information from both the front and back selfies into the video diffusion model. Additionally, we introduce an image-based fine-tuning strategy to enhance frame sharpness and improve the generation of shadows and reflections, achieving a more realistic human-scene composition.
zh
[CV-120] Beyond Pretty Pictures: Combined Single- and Multi-Image Super-resolution for Sentinel-2 Images
【速读】:该论文旨在解决卫星图像分辨率不足的问题,特别是针对Sentinel-2卫星数据在地表特征(如房屋、街道或树篱)识别上的局限性。其关键解决方案是提出SEN4X,一种结合单图像和多图像超分辨率技术的混合架构,通过利用重复的Sentinel-2数据进行时间超采样,并融合高分辨率Pléiades Neo数据的学习先验信息,从而将Sentinel-2影像提升至2.5米地面采样距离。
链接: https://arxiv.org/abs/2505.24799
作者: Aditya Retnanto(1),Son Le(1),Sebastian Mueller(1),Armin Leitner(2),Konrad Schindler(3),Yohan Iddawela(1),Michael Riffler(2) ((1) Asian Development Bank, Philippines, (2) GeoVille Information Systems and Data Processing GmbH, Austria, (3) ETH Zürich, Switzerland)
机构: Asian Development Bank (亚洲开发银行); GeoVille Information Systems and Data Processing GmbH (GeoVille信息系系统和数据处理有限公司); ETH Zürich (苏黎世联邦理工学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Super-resolution aims to increase the resolution of satellite images by reconstructing high-frequency details, which go beyond naïve upsampling. This has particular relevance for Earth observation missions like Sentinel-2, which offer frequent, regular coverage at no cost; but at coarse resolution. Its pixel footprint is too large to capture small features like houses, streets, or hedge rows. To address this, we present SEN4X, a hybrid super-resolution architecture that combines the advantages of single-image and multi-image techniques. It combines temporal oversampling from repeated Sentinel-2 acquisitions with a learned prior from high-resolution Pléiades Neo data. In doing so, SEN4X upgrades Sentinel-2 imagery to 2.5 m ground sampling distance. We test the super-resolved images on urban land-cover classification in Hanoi, Vietnam. We find that they lead to a significant performance improvement over state-of-the-art super-resolution baselines.
zh
[CV-121] Efficient Estimation of Regularized Tylers M-Estimator Using Approximate LOOCV
【速读】:该论文试图解决在正则化泰勒M估计器(Regularized Tyler’s M-estimator, RTME)中估计正则化参数(或收缩系数)α∈(0,1) 的问题。其解决方案的关键在于将 α 设定为一个适当选择的目标函数的解,即留一法交叉验证(leave-one-out cross-validated, LOOCV)对数似然损失函数。由于LOOCV在中等样本量 n 下计算上不可行,作者提出了一种计算高效的近似方法,以消除在LOOCV过程中每次剔除一个样本时调用RTME过程的需要,从而将LOOCV过程的时间复杂度从 O(n2) 降低至 O(n),显著提升了计算效率。
链接: https://arxiv.org/abs/2505.24781
作者: Karim Abou-Moustafa
机构: Intel’s Foundry Technology Development (TD) Division, Intel Corp. (英特尔公司芯片制造技术开发(TD)部门)
类目: Machine Learning (stat.ML); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: An extended version of a short article that appeared in 2023 IEEE Workshop on Information Theory, Saint-Malo, France
Abstract:We consider the problem of estimating a regularization parameter, or a shrinkage coefficient \alpha \in (0,1) for Regularized Tyler’s M-estimator (RTME). In particular, we propose to estimate an optimal shrinkage coefficient by setting \alpha as the solution to a suitably chosen objective function; namely the leave-one-out cross-validated (LOOCV) log-likelihood loss. Since LOOCV is computationally prohibitive even for moderate sample size n , we propose a computationally efficient approximation for the LOOCV log-likelihood loss that eliminates the need for invoking the RTME procedure n times for each sample left out during the LOOCV procedure. This approximation yields an O(n) reduction in the running time complexity for the LOOCV procedure, which results in a significant speedup for computing the LOOCV estimate. We demonstrate the efficiency and accuracy of the proposed approach on synthetic high-dimensional data sampled from heavy-tailed elliptical distributions, as well as on real high-dimensional datasets for object recognition, face recognition, and handwritten digit’s recognition. Our experiments show that the proposed approach is efficient and consistently more accurate than other methods in the literature for shrinkage coefficient estimation.
zh
[CV-122] Contrast-Invariant Self-supervised Segmentation for Quantitative Placental MRI
【速读】:该论文旨在解决多回波T2*-加权胎盘MRI中胎盘分割的挑战性问题,主要难点包括个体回波间边界对比度弱且不一致、缺乏所有回波时间的手动标注真值以及胎儿和母体运动引起的运动伪影。其解决方案的关键在于提出一种增强对比度的分割框架,通过利用多回波T2*-加权MRI中的互补信息来学习鲁棒且对比度不变的表示。该方法整合了掩码自编码(MAE)用于无监督预训练、掩码伪标签(MPL)用于跨回波时间的无监督域适应,以及全局-局部协作以对齐细粒度特征与全局解剖上下文,并引入语义匹配损失以增强同一受试者不同回波间的表示一致性。
链接: https://arxiv.org/abs/2505.24739
作者: Xinliu Zhong,Ruiying Liu,Emily S. Nichols,Xuzhe Zhang,Andrew F. Laine,Emma G. Duerden,Yun Wang
机构: Emory University (埃默里大学); Western University (西方大学); Columbia University (哥伦比亚大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 20 figures
Abstract:Accurate placental segmentation is essential for quantitative analysis of the placenta. However, this task is particularly challenging in T2*-weighted placental imaging due to: (1) weak and inconsistent boundary contrast across individual echoes; (2) the absence of manual ground truth annotations for all echo times; and (3) motion artifacts across echoes caused by fetal and maternal movement. In this work, we propose a contrast-augmented segmentation framework that leverages complementary information across multi-echo T2*-weighted MRI to learn robust, contrast-invariant representations. Our method integrates: (i) masked autoencoding (MAE) for self-supervised pretraining on unlabeled multi-echo slices; (ii) masked pseudo-labeling (MPL) for unsupervised domain adaptation across echo times; and (iii) global-local collaboration to align fine-grained features with global anatomical context. We further introduce a semantic matching loss to encourage representation consistency across echoes of the same subject. Experiments on a clinical multi-echo placental MRI dataset demonstrate that our approach generalizes effectively across echo times and outperforms both single-echo and naive fusion baselines. To our knowledge, this is the first work to systematically exploit multi-echo T2*-weighted MRI for placental segmentation.
zh
[CV-123] umorGen: Boundary-Aware Tumor-Mask Synthesis with Rectified Flow Matching
【速读】:该论文旨在解决医学影像中肿瘤数据合成的挑战,特别是现有方法在肿瘤多样性、计算效率及肿瘤边界渐变特性捕捉方面的不足。其关键解决方案是提出TumorGen,该方法通过三个核心组件实现高效的3D肿瘤合成:基于边界感知的伪掩码生成模块,利用灵活的边界框替代严格的二值掩码;空间约束向量场估计器,通过修正流匹配技术同时合成肿瘤潜在表示和掩码以提高计算效率;以及VAE引导的掩码优化器,增强边界的真实性。TumorGen在保持病理准确性的同时,显著提升了计算效率。
链接: https://arxiv.org/abs/2505.24687
作者: Shengyuan Liu,Wenting Chen,Boyun Zheng,Wentao Pan,Xiang Li,Yixuan Yuan
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures
Abstract:Tumor data synthesis offers a promising solution to the shortage of annotated medical datasets. However, current approaches either limit tumor diversity by using predefined masks or employ computationally expensive two-stage processes with multiple denoising steps, causing computational inefficiency. Additionally, these methods typically rely on binary masks that fail to capture the gradual transitions characteristic of tumor boundaries. We present TumorGen, a novel Boundary-Aware Tumor-Mask Synthesis with Rectified Flow Matching for efficient 3D tumor synthesis with three key components: a Boundary-Aware Pseudo Mask Generation module that replaces strict binary masks with flexible bounding boxes; a Spatial-Constraint Vector Field Estimator that simultaneously synthesizes tumor latents and masks using rectified flow matching to ensure computational efficiency; and a VAE-guided mask refiner that enhances boundary realism. TumorGen significantly improves computational efficiency by requiring fewer sampling steps while maintaining pathological accuracy through coarse and fine-grained spatial constraints. Experimental results demonstrate TumorGen’s superior performance over existing tumor synthesis methods in both efficiency and realism, offering a valuable contribution to AI-driven cancer diagnostics.
zh
[CV-124] Model-Guided Network with Cluster-Based Operators for Spatio-Spectral Super-Resolution
【速读】:该论文试图解决从低分辨率多光谱观测中重建高分辨率高光谱图像的问题(high-resolution hyperspectral image reconstruction from low-resolution multispectral observation),特别是针对联合空谱超分辨率(joint spatio-spectral super-resolution)这一相对较少被探索的领域。其解决方案的关键在于提出了一种端到端的模型驱动框架,该框架将联合空谱超分辨率问题显式分解为空间超分辨率、光谱超分辨率和融合任务,并通过展开基于变分的方法来处理每个子任务,其中近似梯度迭代方案中的算子被定制化的可学习模块所替代。
链接: https://arxiv.org/abs/2505.24605
作者: Ivan Pereira-Sánchez,Julia Navarro,Ana Belén Petro,Joan Duran
机构: Universitat de les Illes Balears(马略卡岛大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper addresses the problem of reconstructing a high-resolution hyperspectral image from a low-resolution multispectral observation. While spatial super-resolution and spectral super-resolution have been extensively studied, joint spatio-spectral super-resolution remains relatively explored. We propose an end-to-end model-driven framework that explicitly decomposes the joint spatio-spectral super-resolution problem into spatial super-resolution, spectral super-resolution and fusion tasks. Each sub-task is addressed by unfolding a variational-based approach, where the operators involved in the proximal gradient iterative scheme are replaced with tailored learnable modules. In particular, we design an upsampling operator for spatial super-resolution based on classical back-projection algorithms, adapted to handle arbitrary scaling factors. Spectral reconstruction is performed using learnable cluster-based upsampling and downsampling operators. For image fusion, we integrate low-frequency estimation and high-frequency injection modules to combine the spatial and spectral information from spatial super-resolution and spectral super-resolution outputs. Additionally, we introduce an efficient nonlocal post-processing step that leverages image self-similarity by combining a multi-head attention mechanism with residual connections. Extensive evaluations on several datasets and sampling factors demonstrate the effectiveness of our approach. The source code will be available at this https URL
zh
[CV-125] Digital twins enable full-reference quality assessment of photoacoustic image reconstructions
【速读】:该论文试图解决光声图像重建算法质量定量比较的问题,该问题在缺乏理想参考图像的情况下难以通过无参考图像质量评估方法有效解决,而全参考评估方法又依赖于已知的地面真实数据,这在体内或体模研究中通常不可获得。论文的关键解决方案是利用组织模拟体的数值数字孪生(digital twin)和成像系统进行定量校准,以缩小仿真与实际之间的差距。通过这一框架,作者首次在实验数据上测试了一种基于傅里叶变换的圆周检测几何重建算法,并实现了对数值正向模型准确性的评估以及重建方案的全参考图像质量比较。
链接: https://arxiv.org/abs/2505.24514
作者: Janek Gröhl,Leonid Kunyansky,Jenni Poimala,Thomas R. Else,Francesca Di Cecio,Sarah E. Bohndiek,Ben T. Cox,Andreas Hauptmann
机构: University of Cambridge(剑桥大学); Cancer Research UK Cambridge Institute(英国癌症研究基金会剑桥研究所); ENI-G(ENI-G); University Medical Center Göttingen(哥廷根大学医学中心); Max Planck Institute for Multidiplinary Sciences(马克斯·普朗克多学科科学研究所); University of Arizona(亚利桑那大学); University of Eastern Finland(东芬兰大学); Imperial College London(帝国理工学院); University of Oulu(奥卢大学); University College London(伦敦大学学院)
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:
Abstract:Quantitative comparison of the quality of photoacoustic image reconstruction algorithms remains a major challenge. No-reference image quality measures are often inadequate, but full-reference measures require access to an ideal reference image. While the ground truth is known in simulations, it is unknown in vivo, or in phantom studies, as the reference depends on both the phantom properties and the imaging system. We tackle this problem by using numerical digital twins of tissue-mimicking phantoms and the imaging system to perform a quantitative calibration to reduce the simulation gap. The contributions of this paper are two-fold: First, we use this digital-twin framework to compare multiple state-of-the-art reconstruction algorithms. Second, among these is a Fourier transform-based reconstruction algorithm for circular detection geometries, which we test on experimental data for the first time. Our results demonstrate the usefulness of digital phantom twins by enabling assessment of the accuracy of the numerical forward model and enabling comparison of image reconstruction schemes with full-reference image quality assessment. We show that the Fourier transform-based algorithm yields results comparable to those of iterative time reversal, but at a lower computational cost. All data and code are publicly available on Zenodo: this https URL.
zh
[CV-126] pyMEAL: A Multi-Encoder Augmentation-Aware Learning for Robust and Generalizable Medical Image Translation
【速读】:该论文旨在解决医学影像中由于患者个体差异、图像伪影以及模型泛化能力不足导致的先进AI驱动影像技术临床应用受限的问题。其关键解决方案是提出一种多编码器增强感知学习(MEAL)框架,该框架通过四个不同的增强变体并行处理,并结合三种融合策略(包括拼接、融合层和自适应控制器块)来构建多编码器模型,从而在解码前整合增强特定特征,其中MEAL-BD方法通过保留增强感知表示,实现了鲁棒且协议无关的特征学习。
链接: https://arxiv.org/abs/2505.24421
作者: Abdul-mojeed Olabisi Ilyas,Adeleke Maradesa,Jamal Banzi,Jianpan Huang,Henry K.F. Mak,Kannie W.Y. Chan
机构: Hong Kong Centre for Cerebro-Cardiovascular Health Engineering (COCHE), Hong Kong, China; Sokoine University of Agriculture (索科奈大学); The University of Hong Kong (香港大学); State Key Laboratory of Brain and Cognitive Sciences (脑与认知科学国家重点实验室); Alzheimer’s Disease Research Network (阿尔茨海默病研究网络); Department of Biomedical Engineering, City University of Hong Kong (城市大学生物医学工程系), Hong Kong, China; Russell H. Morgan Department of Radiology and Radiological Science, The Johns Hopkins University School of Medicine (约翰霍普金斯大学医学院放射学与放射科学系), Baltimore, MD, USA; City University of Hong Kong Shenzhen Research Institute (城市大学深圳研究院), Shenzhen, China
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 36 pages, 9 figures, 2 tables
Abstract:Medical imaging is critical for diagnostics, but clinical adoption of advanced AI-driven imaging faces challenges due to patient variability, image artifacts, and limited model generalization. While deep learning has transformed image analysis, 3D medical imaging still suffers from data scarcity and inconsistencies due to acquisition protocols, scanner differences, and patient motion. Traditional augmentation uses a single pipeline for all transformations, disregarding the unique traits of each augmentation and struggling with large data volumes. To address these challenges, we propose a Multi-encoder Augmentation-Aware Learning (MEAL) framework that leverages four distinct augmentation variants processed through dedicated encoders. Three fusion strategies such as concatenation (CC), fusion layer (FL), and adaptive controller block (BD) are integrated to build multi-encoder models that combine augmentation-specific features before decoding. MEAL-BD uniquely preserves augmentation-aware representations, enabling robust, protocol-invariant feature learning. As demonstrated in a Computed Tomography (CT)-to-T1-weighted Magnetic Resonance Imaging (MRI) translation study, MEAL-BD consistently achieved the best performance on both unseen- and predefined-test data. On both geometric transformations (like rotations and flips) and non-augmented inputs, MEAL-BD outperformed other competing methods, achieving higher mean peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) scores. These results establish MEAL as a reliable framework for preserving structural fidelity and generalizing across clinically relevant variability. By reframing augmentation as a source of diverse, generalizable features, MEAL supports robust, protocol-invariant learning, advancing clinically reliable medical imaging solutions. Comments: 36 pages, 9 figures, 2 tables Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.24421 [eess.IV] (or arXiv:2505.24421v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2505.24421 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-127] Efficient RAW Image Deblurring with Adaptive Frequency Modulation NEURIPS2025
【速读】:该论文旨在解决RAW图像去模糊(RAW image deblurring)中的挑战,尤其是在处理频率依赖性模糊的同时保持计算效率。现有方法大多专注于sRGB图像,而RAW图像由于未经过处理且具有线性特性,具备更高的恢复潜力但研究较少。论文提出的解决方案是Frequency Enhanced Network (FrENet),其关键在于在频域中直接进行RAW到RAW的去模糊,引入了自适应频率位置调制模块,动态调整频率成分以实现精确控制,并采用频域跳跃连接以保留高频细节,从而在提升去模糊质量的同时保持高效性。
链接: https://arxiv.org/abs/2505.24407
作者: Wenlong Jiao,Binglong Li,Wei Shang,Ping Wang,Dongwei Ren
机构: Tianjin University (天津大学); Harbin Institute of Technology (哈尔滨工业大学); Tianjin University (天津大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Submitted to NeurIPS 2025
Abstract:Image deblurring plays a crucial role in enhancing visual clarity across various applications. Although most deep learning approaches primarily focus on sRGB images, which inherently lose critical information during the image signal processing pipeline, RAW images, being unprocessed and linear, possess superior restoration potential but remain underexplored. Deblurring RAW images presents unique challenges, particularly in handling frequency-dependent blur while maintaining computational efficiency. To address these issues, we propose Frequency Enhanced Network (FrENet), a framework specifically designed for RAW-to-RAW deblurring that operates directly in the frequency domain. We introduce a novel Adaptive Frequency Positional Modulation module, which dynamically adjusts frequency components according to their spectral positions, thereby enabling precise control over the deblurring process. Additionally, frequency domain skip connections are adopted to further preserve high-frequency details. Experimental results demonstrate that FrENet surpasses state-of-the-art deblurring methods in RAW image deblurring, achieving significantly better restoration quality while maintaining high efficiency in terms of reduced MACs. Furthermore, FrENet’s adaptability enables it to be extended to sRGB images, where it delivers comparable or superior performance compared to methods specifically designed for sRGB data. The code will be available at this https URL .
zh
[CV-128] A Novel Coronary Artery Registration Method Based on Super-pixel Particle Swarm Optimization
【速读】:该论文旨在解决经皮冠状动脉介入治疗(Percutaneous Coronary Intervention, PCI)中多模态图像配准的问题,即如何准确地将实时二维X射线血管造影(XRA)与三维计算机断层扫描血管造影(CTA)图像进行对齐,以提高手术的精准性和安全性。解决方案的关键在于提出一种基于群体优化算法的新型多模态冠状动脉图像配准方法,该方法通过分离预处理XRA和CTA图像,并利用Steger和超像素粒子群优化算法进行特征提取,从而有效应对大形变、低对比度和噪声等挑战。
链接: https://arxiv.org/abs/2505.24351
作者: Peng Qi,Wenxi Qu,Tianliang Yao,Haonan Ma,Dylan Wintle,Yinyi Lai,Giorgos Papanastasiou,Chengjia Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Percutaneous Coronary Intervention (PCI) is a minimally invasive procedure that improves coronary blood flow and treats coronary artery disease. Although PCI typically requires 2D X-ray angiography (XRA) to guide catheter placement at real-time, computed tomography angiography (CTA) may substantially improve PCI by providing precise information of 3D vascular anatomy and status. To leverage real-time XRA and detailed 3D CTA anatomy for PCI, accurate multimodal image registration of XRA and CTA is required, to guide the procedure and avoid complications. This is a challenging process as it requires registration of images from different geometrical modalities (2D - 3D and vice versa), with variations in contrast and noise levels. In this paper, we propose a novel multimodal coronary artery image registration method based on a swarm optimization algorithm, which effectively addresses challenges such as large deformations, low contrast, and noise across these imaging modalities. Our algorithm consists of two main modules: 1) preprocessing of XRA and CTA images separately, and 2) a registration module based on feature extraction using the Steger and Superpixel Particle Swarm Optimization algorithms. Our technique was evaluated on a pilot dataset of 28 pairs of XRA and CTA images from 10 patients who underwent PCI. The algorithm was compared with four state-of-the-art (SOTA) methods in terms of registration accuracy, robustness, and efficiency. Our method outperformed the selected SOTA baselines in all aspects. Experimental results demonstrate the significant effectiveness of our algorithm, surpassing the previous benchmarks and proposes a novel clinical approach that can potentially have merit for improving patient outcomes in coronary artery disease.
zh
[CV-129] Beyond the LUMIR challenge: The pathway to foundational registration models
【速读】:该论文试图解决无监督脑部磁共振成像(MRI)图像配准的问题,旨在推动在无需解剖标签图监督的情况下,实现更准确和生物合理的形变建模。解决方案的关键在于引入了大规模无监督脑部MRI图像配准(LUMIR)挑战,通过提供超过4,000个预处理的T1加权脑部MRI数据进行训练,且不依赖任何标签图,从而鼓励基于自监督的学习方法。此外,LUMIR还设计了严格的零样本泛化任务,以评估模型在不同成像模态、疾病群体、采集协议和物种之间的适应能力。
链接: https://arxiv.org/abs/2505.24160
作者: Junyu Chen,Shuwen Wei,Joel Honkamaa,Pekka Marttinen,Hang Zhang,Min Liu,Yichao Zhou,Zuopeng Tan,Zhuoyuan Wang,Yi Wang,Hongchao Zhou,Shunbo Hu,Yi Zhang,Qian Tao,Lukas Förner,Thomas Wendler,Bailiang Jian,Benedikt Wiestler,Tim Hable,Jin Kim,Dan Ruan,Frederic Madesta,Thilo Sentker,Wiebke Heyer,Lianrui Zuo,Yuwei Dai,Jing Wu,Jerry L. Prince,Harrison Bai,Yong Du,Yihao Liu,Alessa Hering,Reuben Dorent,Lasse Hansen,Mattias P. Heinrich,Aaron Carass
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical image challenges have played a transformative role in advancing the field, catalyzing algorithmic innovation and establishing new performance standards across diverse clinical applications. Image registration, a foundational task in neuroimaging pipelines, has similarly benefited from the Learn2Reg initiative. Building on this foundation, we introduce the Large-scale Unsupervised Brain MRI Image Registration (LUMIR) challenge, a next-generation benchmark designed to assess and advance unsupervised brain MRI registration. Distinct from prior challenges that leveraged anatomical label maps for supervision, LUMIR removes this dependency by providing over 4,000 preprocessed T1-weighted brain MRIs for training without any label maps, encouraging biologically plausible deformation modeling through self-supervision. In addition to evaluating performance on 590 held-out test subjects, LUMIR introduces a rigorous suite of zero-shot generalization tasks, spanning out-of-domain imaging modalities (e.g., FLAIR, T2-weighted, T2*-weighted), disease populations (e.g., Alzheimer’s disease), acquisition protocols (e.g., 9.4T MRI), and species (e.g., macaque brains). A total of 1,158 subjects and over 4,000 image pairs were included for evaluation. Performance was assessed using both segmentation-based metrics (Dice coefficient, 95th percentile Hausdorff distance) and landmark-based registration accuracy (target registration error). Across both in-domain and zero-shot tasks, deep learning-based methods consistently achieved state-of-the-art accuracy while producing anatomically plausible deformation fields. The top-performing deep learning-based models demonstrated diffeomorphic properties and inverse consistency, outperforming several leading optimization-based methods, and showing strong robustness to most domain shifts, the exception being a drop in performance on out-of-domain contrasts.
zh
[CV-130] Sparsity-Driven Parallel Imaging Consistency for Improved Self-Supervised MRI Reconstruction ICIP
【速读】:该论文试图解决在缺乏完全采样参考数据的情况下,物理驱动的深度学习(Physics-driven Deep Learning, PD-DL)模型在高加速率下进行快速磁共振成像(MRI)重建时出现的伪影问题。其解决方案的关键在于通过设计特定的扰动来训练PD-DL网络,引入一种新的一致性项,以评估模型在稀疏域中准确预测添加扰动的能力,从而实现更可靠且无伪影的重建结果。
链接: https://arxiv.org/abs/2505.24136
作者: Yaşar Utku Alçalar,Mehmet Akçakaya
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
备注: IEEE International Conference on Image Processing (ICIP), 2025
Abstract:Physics-driven deep learning (PD-DL) models have proven to be a powerful approach for improved reconstruction of rapid MRI scans. In order to train these models in scenarios where fully-sampled reference data is unavailable, self-supervised learning has gained prominence. However, its application at high acceleration rates frequently introduces artifacts, compromising image fidelity. To mitigate this shortcoming, we propose a novel way to train PD-DL networks via carefully-designed perturbations. In particular, we enhance the k-space masking idea of conventional self-supervised learning with a novel consistency term that assesses the model’s ability to accurately predict the added perturbations in a sparse domain, leading to more reliable and artifact-free reconstructions. The results obtained from the fastMRI knee and brain datasets show that the proposed training strategy effectively reduces aliasing artifacts and mitigates noise amplification at high acceleration rates, outperforming state-of-the-art self-supervised methods both visually and quantitatively.
zh
[CV-131] A Mathematical Perspective On Contrastive Learning
【速读】:该论文旨在解决多模态对比学习中如何有效对齐不同数据模态的问题,其核心是通过参数化编码器优化条件概率分布,使不同模态在共享潜在空间中保持一致性。解决方案的关键在于将潜在空间的识别视为低秩矩阵逼近问题,并基于概率视角引入新的损失函数和对齐度量,从而提升模型在跨模态检索、分类及生成任务中的性能。
链接: https://arxiv.org/abs/2505.24134
作者: Ricardo Baptista,Andrew M. Stuart,Son Tran
机构: Amazon(亚马逊); California Institute of Technology (加州理工学院)
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 44 pages, 15 figures
Abstract:Multimodal contrastive learning is a methodology for linking different data modalities; the canonical example is linking image and text data. The methodology is typically framed as the identification of a set of encoders, one for each modality, that align representations within a common latent space. In this work, we focus on the bimodal setting and interpret contrastive learning as the optimization of (parameterized) encoders that define conditional probability distributions, for each modality conditioned on the other, consistent with the available data. This provides a framework for multimodal algorithms such as crossmodal retrieval, which identifies the mode of one of these conditional distributions, and crossmodal classification, which is similar to retrieval but includes a fine-tuning step to make it task specific. The framework we adopt also gives rise to crossmodal generative models. This probabilistic perspective suggests two natural generalizations of contrastive learning: the introduction of novel probabilistic loss functions, and the use of alternative metrics for measuring alignment in the common latent space. We study these generalizations of the classical approach in the multivariate Gaussian setting. In this context we view the latent space identification as a low-rank matrix approximation problem. This allows us to characterize the capabilities of loss functions and alignment metrics to approximate natural statistics, such as conditional means and covariances; doing so yields novel variants on contrastive learning algorithms for specific mode-seeking and for generative tasks. The framework we introduce is also studied through numerical experiments on multivariate Gaussians, the labeled MNIST dataset, and on a data assimilation application arising in oceanography. Comments: 44 pages, 15 figures Subjects: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2505.24134 [stat.ML] (or arXiv:2505.24134v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2505.24134 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-132] Exploring Domain Wall Pinning in Ferroelectrics via Automated High Throughput AFM
【速读】:该论文旨在解决铁电材料中畴壁动力学的局域依赖性问题,即畴壁的钉扎行为与局部微结构密切相关,需进行空间分辨研究。其解决方案的关键在于利用机器学习控制的自动化压电响应力显微镜(Piezoresponse Force Microscopy),对大面积外延PbTiO₃薄膜中的极化应变畴结构进行电场驱动的动力学量化分析,从而建立畴配置与脉冲参数之间的预测映射关系。
链接: https://arxiv.org/abs/2505.24062
作者: Kamyar Barakati,Yu Liu,Hiroshi Funakubo,Sergei V. Kalinin
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
备注: 17 pages, 6 figures
Abstract:Domain-wall dynamics in ferroelectric materials are strongly position-dependent since each polar interface is locked into a unique local microstructure. This necessitates spatially resolved studies of the wall-pinning using scanning-probe microscopy techniques. The pinning centers and preexisting domain walls are usually sparse within image plane, precluding the use of dense hyperspectral imaging modes and requiring time-consuming human experimentation. Here, a large area epitaxial PbTiO _3 film on cubic KTaO _3 were investigated to quantify the electric field driven dynamics of the polar-strain domain structures using ML-controlled automated Piezoresponse Force Microscopy. Analysis of 1500 switching events reveals that domain wall displacement depends not only on field parameters but also on the local ferroelectric-ferroelastic configuration. For example, twin boundaries in polydomains regions like a _1^- / c^+ \parallel a _2^- / c^- stay pinned up to a certain level of bias magnitude and change only marginally as the bias increases from 20V to 30V, whereas single variant boundaries like a _2^+ / c^+ \parallel a _2^- / c^- stack are already activated at 20V. These statistics on the possible ferroelectric and ferroelastic wall orientations, together with the automated, high-throughput AFM workflow, can be distilled into a predictive map that links domain configurations to pulse parameters. This microstructure-specific rule set forms the foundation for designing ferroelectric memories.
zh
[CV-133] Conformal Object Detection by Sequential Risk Control
【速读】:该论文试图解决工业应用中目标检测模型在关键任务中的可靠性问题,这一问题源于神经网络的固有不可靠性以及目标检测模型的复杂结构。解决方案的关键在于引入置信预测(Conformal Prediction),这是一种后处理方法,能够在不依赖模型或数据分布先验知识的情况下,为任何数据集大小提供统计保证。论文提出了一种新的方法——序列置信风险控制(Sequential Conformal Risk Control, SeqCRC),将置信风险控制(Conformal Risk Control, CRC)的统计保证扩展到两个序列任务,并针对不同应用场景和认证需求设计了相应的损失函数和预测集。
链接: https://arxiv.org/abs/2505.24038
作者: Léo Andéol,Luca Mossina,Adrien Mazoyer,Sébastien Gerchinovitz
机构: Institut de Mathématiques de Toulouse, Université de Toulouse (图卢兹数学研究所,图卢兹大学); IRT Saint Exupéry (圣埃克苏佩里工业研究机构)
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 28 pages, 11 figures
Abstract:Recent advances in object detectors have led to their adoption for industrial uses. However, their deployment in critical applications is hindered by the inherent lack of reliability of neural networks and the complex structure of object detection models. To address these challenges, we turn to Conformal Prediction, a post-hoc procedure which offers statistical guarantees that are valid for any dataset size, without requiring prior knowledge on the model or data distribution. Our contribution is manifold: first, we formally define the problem of Conformal Object Detection (COD) and introduce a novel method, Sequential Conformal Risk Control (SeqCRC), that extends the statistical guarantees of Conformal Risk Control (CRC) to two sequential tasks with two parameters, as required in the COD setting. Then, we propose loss functions and prediction sets suited to applying CRC to different applications and certification requirements. Finally, we present a conformal toolkit, enabling replication and further exploration of our methods. Using this toolkit, we perform extensive experiments, yielding a benchmark that validates the investigated methods and emphasizes trade-offs and other practical consequences.
zh
[CV-134] Estimating Head Motion in Structural MRI Using a Deep Neural Network Trained on Synthetic Artifacts
【速读】:该论文旨在解决磁共振成像(Magnetic Resonance Imaging, MRI)中运动伪影对自动化神经解剖度量(如皮层厚度)产生的偏差问题。传统手动审查无法客观量化解剖扫描中的运动,而现有自动化方法通常依赖专用硬件或不平衡的噪声训练数据。该研究的关键解决方案是训练一个3D卷积神经网络,仅使用合成损坏的数据体积来估计运动严重程度,从而实现无需前瞻性运动校正的客观、可扩展的结构MRI运动评估。
链接: https://arxiv.org/abs/2505.23916
作者: Charles Bricout,Samira Ebrahimi Kahou,Sylvain Bouix
机构: École de technologie supérieure (École de technologie supérieure); University of Calgary (University of Calgary); Canada CIFAR AI Chair/Mila (Canada CIFAR AI Chair/Mila)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Motion-related artifacts are inevitable in Magnetic Resonance Imaging (MRI) and can bias automated neuroanatomical metrics such as cortical thickness. Manual review cannot objectively quantify motion in anatomical scans, and existing automated approaches often require specialized hardware or rely on unbalanced noisy training data. Here, we train a 3D convolutional neural network to estimate motion severity using only synthetically corrupted volumes. We validate our method with one held-out site from our training cohort and with 14 fully independent datasets, including one with manual ratings, achieving a representative R^2 = 0.65 versus manual labels and significant thickness-motion correlations in 12/15 datasets. Furthermore, our predicted motion correlates with subject age in line with prior studies. Our approach generalizes across scanner brands and protocols, enabling objective, scalable motion assessment in structural MRI studies without prospective motion correction.
zh
[CV-135] Parameter-Free Bio-Inspired Channel Attention for Enhanced Cardiac MRI Reconstruction
【速读】:该论文试图解决现有注意力模块在理论基础方面不足的问题,特别是在卷积神经网络中缺乏稳健的理论支撑。其解决方案的关键在于提出一种非线性注意力架构,该架构基于生态学原理中的单种群增长非线性差分方程,设计出一种无需参数的注意力模块,从而超越现有的先进无参数方法。
链接: https://arxiv.org/abs/2505.23872
作者: Anam Hashmi,Julia Dietlmeier,Kathleen M. Curran,Noel E. O’Connor
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: presented at the 28th UK Conference on Medical Image Understanding and Analysis - MIUA, 24 - 26 July 2024
Abstract:Attention is a fundamental component of the human visual recognition system. The inclusion of attention in a convolutional neural network amplifies relevant visual features and suppresses the less important ones. Integrating attention mechanisms into convolutional neural networks enhances model performance and interpretability. Spatial and channel attention mechanisms have shown significant advantages across many downstream tasks in medical imaging. While existing attention modules have proven to be effective, their design often lacks a robust theoretical underpinning. In this study, we address this gap by proposing a non-linear attention architecture for cardiac MRI reconstruction and hypothesize that insights from ecological principles can guide the development of effective and efficient attention mechanisms. Specifically, we investigate a non-linear ecological difference equation that describes single-species population growth to devise a parameter-free attention module surpassing current state-of-the-art parameter-free methods.
zh
人工智能
[AI-0] DexMachina: Functional Retargeting for Bimanual Dexterous Manipulation
【速读】:该论文试图解决功能性迁移问题(functional retargeting),即学习灵巧操作策略以从人类手-物体示范中跟踪物体状态。针对长时序、双臂操作和关节物体的任务,由于动作空间大、时空不连续性以及人机手之间的具身差距,该问题具有挑战性。论文提出的解决方案是DexMachina,其关键在于使用衰减强度的虚拟物体控制器:物体首先被自动引导至目标状态,从而使策略在运动和接触引导下逐步接管控制。
链接: https://arxiv.org/abs/2505.24853
作者: Zhao Mandi,Yifan Hou,Dieter Fox,Yashraj Narang,Ajay Mandlekar,Shuran Song
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We study the problem of functional retargeting: learning dexterous manipulation policies to track object states from human hand-object demonstrations. We focus on long-horizon, bimanual tasks with articulated objects, which is challenging due to large action space, spatiotemporal discontinuities, and embodiment gap between human and robot hands. We propose DexMachina, a novel curriculum-based algorithm: the key idea is to use virtual object controllers with decaying strength: an object is first driven automatically towards its target states, such that the policy can gradually learn to take over under motion and contact guidance. We release a simulation benchmark with a diverse set of tasks and dexterous hands, and show that DexMachina significantly outperforms baseline methods. Our algorithm and benchmark enable a functional comparison for hardware designs, and we present key findings informed by quantitative and qualitative results. With the recent surge in dexterous hand development, we hope this work will provide a useful platform for identifying desirable hardware capabilities and lower the barrier for contributing to future research. Videos and more at this https URL
zh
[AI-1] RealDrive: Retrieval-Augmented Driving with Diffusion Models
【速读】:该论文旨在解决基于学习的规划器在处理罕见且安全关键场景时表现不佳以及生成轨迹可控性有限的问题。其解决方案的关键在于提出一种基于检索增强生成(Retrieval-Augmented Generation, RAG)的框架——RealDrive,通过从训练数据集中检索最相关的专家示范来初始化基于扩散的规划策略,并利用去噪过程在当前观测与检索示例之间进行插值,从而实现对多样化场景中轨迹的细粒度控制和安全行为生成。此外,研究还表明,采用基于规划目标训练的任务相关检索模型相较于任务无关的检索器能够显著提升规划性能。
链接: https://arxiv.org/abs/2505.24808
作者: Wenhao Ding,Sushant Veer,Yuxiao Chen,Yulong Cao,Chaowei Xiao,Marco Pavone
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Learning-based planners generate natural human-like driving behaviors by learning to reason about nuanced interactions from data, overcoming the rigid behaviors that arise from rule-based planners. Nonetheless, data-driven approaches often struggle with rare, safety-critical scenarios and offer limited controllability over the generated trajectories. To address these challenges, we propose RealDrive, a Retrieval-Augmented Generation (RAG) framework that initializes a diffusion-based planning policy by retrieving the most relevant expert demonstrations from the training dataset. By interpolating between current observations and retrieved examples through a denoising process, our approach enables fine-grained control and safe behavior across diverse scenarios, leveraging the strong prior provided by the retrieved scenario. Another key insight we produce is that a task-relevant retrieval model trained with planning-based objectives results in superior planning performance in our framework compared to a task-agnostic retriever. Experimental results demonstrate improved generalization to long-tail events and enhanced trajectory diversity compared to standard learning-based planners – we observe a 40% reduction in collision rate on the Waymo Open Motion dataset with RAG.
zh
[AI-2] Inference Acceleration of Autoregressive Normalizing Flows by Selective Jacobi Decoding
【速读】:该论文试图解决自回归模型在推理过程中由于严格顺序依赖导致的生成速度慢的问题,从而限制了其在实际应用中的部署。解决方案的关键在于提出一种选择性雅可比解码(Selective Jacobi Decoding, SeJD)策略,通过并行迭代优化加速自回归推理,理论分析表明该方法具有超线性收敛速率,并且所需的迭代次数不超过原始顺序方法。
链接: https://arxiv.org/abs/2505.24791
作者: Jiaru Zhang,Juanwu Lu,Ziran Wang,Ruqi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Normalizing flows are promising generative models with advantages such as theoretical rigor, analytical log-likelihood computation, and end-to-end training. However, the architectural constraints to ensure invertibility and tractable Jacobian computation limit their expressive power and practical usability. Recent advancements utilize autoregressive modeling, significantly enhancing expressive power and generation quality. However, such sequential modeling inherently restricts parallel computation during inference, leading to slow generation that impedes practical deployment. In this paper, we first identify that strict sequential dependency in inference is unnecessary to generate high-quality samples. We observe that patches in sequential modeling can also be approximated without strictly conditioning on all preceding patches. Moreover, the models tend to exhibit low dependency redundancy in the initial layer and higher redundancy in subsequent layers. Leveraging these observations, we propose a selective Jacobi decoding (SeJD) strategy that accelerates autoregressive inference through parallel iterative optimization. Theoretical analyses demonstrate the method’s superlinear convergence rate and guarantee that the number of iterations required is no greater than the original sequential approach. Empirical evaluations across multiple datasets validate the generality and effectiveness of our acceleration technique. Experiments demonstrate substantial speed improvements up to 4.7 times faster inference while keeping the generation quality and fidelity.
zh
[AI-3] EXP-Bench: Can AI Conduct AI Research Experiments?
【速读】:该论文试图解决当前AI代理在进行严谨、端到端的实验研究时面临的复杂性问题,即如何有效自动化AI研究过程以加速科学进步。其解决方案的关键在于提出EXP-Bench,这是一个新的基准测试平台,用于系统评估AI代理在完整研究实验中的表现。为构建高保真度的实验任务,研究者设计了一个半自主的流程,从权威AI论文及其开源代码中提取并结构化关键实验细节,从而生成461个AI研究任务。该基准不仅揭示了现有大型语言模型(LLM)代理在实验设计与实现上的局限性,还为未来提升AI代理进行AI研究实验的能力提供了重要工具。
链接: https://arxiv.org/abs/2505.24785
作者: Patrick Tser Jern Kon,Jiachen Liu,Xinyi Zhu,Qiuyi Ding,Jingjia Peng,Jiarong Xing,Yibo Huang,Yiming Qiu,Jayanth Srinivasa,Myungjin Lee,Mosharaf Chowdhury,Matei Zaharia,Ang Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 45 pages, 13 figures
Abstract:Automating AI research holds immense potential for accelerating scientific progress, yet current AI agents struggle with the complexities of rigorous, end-to-end experimentation. We introduce EXP-Bench, a novel benchmark designed to systematically evaluate AI agents on complete research experiments sourced from influential AI publications. Given a research question and incomplete starter code, EXP-Bench challenges AI agents to formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. To enable the creation of such intricate and authentic tasks with high-fidelity, we design a semi-autonomous pipeline to extract and structure crucial experimental details from these research papers and their associated open-source code. With the pipeline, EXP-Bench curated 461 AI research tasks from 51 top-tier AI research papers. Evaluations of leading LLM-based agents, such as OpenHands and IterativeAgent on EXP-Bench demonstrate partial capabilities: while scores on individual experimental aspects such as design or implementation correctness occasionally reach 20-35%, the success rate for complete, executable experiments was a mere 0.5%. By identifying these bottlenecks and providing realistic step-by-step experiment procedures, EXP-Bench serves as a vital tool for future AI agents to improve their ability to conduct AI research experiments. EXP-Bench is open-sourced at this https URL.
zh
[AI-4] AXIOM: Learning to Play Games in Minutes with Expanding Object-Centric Models
【速读】:该论文试图解决深度强化学习(Deep Reinforcement Learning, DRL)在数据效率上的不足,即相比人类学习,DRL在少量数据情况下表现较差。其解决方案的关键在于提出一种名为AXIOM的新架构,该架构整合了一组最小但具有表现力的核心先验知识(core priors),用于描述以对象为中心的动力学和交互,从而在低数据条件下加速学习。AXIOM通过将贝叶斯方法的数据效率与可解释性与DRL的跨任务泛化能力相结合,实现了这一目标。
链接: https://arxiv.org/abs/2505.24784
作者: Conor Heins,Toon Van de Maele,Alexander Tschantz,Hampus Linander,Dimitrije Markovic,Tommaso Salvatori,Corrado Pezzato,Ozan Catal,Ran Wei,Magnus Koudahl,Marco Perin,Karl Friston,Tim Verbelen,Christopher Buckley
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 10 pages main text, 4 figures, 2 tables; 25 pages supplementary material, 8 figures
Abstract:Current deep reinforcement learning (DRL) approaches achieve state-of-the-art performance in various domains, but struggle with data efficiency compared to human learning, which leverages core priors about objects and their interactions. Active inference offers a principled framework for integrating sensory information with prior knowledge to learn a world model and quantify the uncertainty of its own beliefs and predictions. However, active inference models are usually crafted for a single task with bespoke knowledge, so they lack the domain flexibility typical of DRL approaches. To bridge this gap, we propose a novel architecture that integrates a minimal yet expressive set of core priors about object-centric dynamics and interactions to accelerate learning in low-data regimes. The resulting approach, which we call AXIOM, combines the usual data efficiency and interpretability of Bayesian approaches with the across-task generalization usually associated with DRL. AXIOM represents scenes as compositions of objects, whose dynamics are modeled as piecewise linear trajectories that capture sparse object-object interactions. The structure of the generative model is expanded online by growing and learning mixture models from single events and periodically refined through Bayesian model reduction to induce generalization. AXIOM masters various games within only 10,000 interaction steps, with both a small number of parameters compared to DRL, and without the computational expense of gradient-based optimization.
zh
[AI-5] HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts
【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)在捕捉自然语言固有的语义层次结构和几何特性方面的不足,这些问题导致了训练不稳定性和生成能力下降。其解决方案的关键在于将模型完全置于双曲空间(Hyperbolic space)中进行操作,利用双曲空间的扩展性、无尺度性和低失真特性,以更好地对齐语言模型与文本的底层几何结构。论文提出了HELM系列模型,包括HELM-MICE和HELM-D,并引入了双曲多头潜在注意力(HMLA)等关键技术,以实现更高效的训练和推理,同时开发了适用于双曲空间的旋转位置编码和RMS归一化等基础组件。
链接: https://arxiv.org/abs/2505.24722
作者: Neil He,Rishabh Anand,Hiren Madhu,Ali Maatouk,Smita Krishnaswamy,Leandros Tassiulas,Menglin Yang,Rex Ying
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have shown great success in text modeling tasks across domains. However, natural language exhibits inherent semantic hierarchies and nuanced geometric structure, which current LLMs do not capture completely owing to their reliance on Euclidean operations. Recent studies have also shown that not respecting the geometry of token embeddings leads to training instabilities and degradation of generative capabilities. These findings suggest that shifting to non-Euclidean geometries can better align language models with the underlying geometry of text. We thus propose to operate fully in Hyperbolic space, known for its expansive, scale-free, and low-distortion properties. We thus introduce HELM, a family of HypErbolic Large Language Models, offering a geometric rethinking of the Transformer-based LLM that addresses the representational inflexibility, missing set of necessary operations, and poor scalability of existing hyperbolic LMs. We additionally introduce a Mixture-of-Curvature Experts model, HELM-MICE, where each expert operates in a distinct curvature space to encode more fine-grained geometric structure from text, as well as a dense model, HELM-D. For HELM-MICE, we further develop hyperbolic Multi-Head Latent Attention (HMLA) for efficient, reduced-KV-cache training and inference. For both models, we develop essential hyperbolic equivalents of rotary positional encodings and RMS normalization. We are the first to train fully hyperbolic LLMs at billion-parameter scale, and evaluate them on well-known benchmarks such as MMLU and ARC, spanning STEM problem-solving, general knowledge, and commonsense reasoning. Our results show consistent gains from our HELM architectures – up to 4% – over popular Euclidean architectures used in LLaMA and DeepSeek, highlighting the efficacy and enhanced reasoning afforded by hyperbolic geometry in large-scale LM pretraining.
zh
[AI-6] owards Scalable Schema Mapping using Large Language Models
【速读】:该论文旨在解决数据集成系统在处理来自大量异构源的信息时所面临的可扩展性问题,特别是传统手动编写的模式映射在复杂性、源特定性和维护成本方面的不足。其解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)来自动化模式匹配,同时针对LLMs在输出一致性、表达能力受限以及计算成本高等方面的问题,提出了通过采样与聚合技术提高输出稳定性、采用更表达性强的映射格式(如GLaV)以及通过数据类型预筛选降低重复调用LLMs的计算开销等策略。
链接: https://arxiv.org/abs/2505.24716
作者: Christopher Buss,Mahdis Safari,Arash Termehchy,Stefan Lee,David Maier
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:The growing need to integrate information from a large number of diverse sources poses significant scalability challenges for data integration systems. These systems often rely on manually written schema mappings, which are complex, source-specific, and costly to maintain as sources evolve. While recent advances suggest that large language models (LLMs) can assist in automating schema matching by leveraging both structural and natural language cues, key challenges remain. In this paper, we identify three core issues with using LLMs for schema mapping: (1) inconsistent outputs due to sensitivity to input phrasing and structure, which we propose methods to address through sampling and aggregation techniques; (2) the need for more expressive mappings (e.g., GLaV), which strain the limited context windows of LLMs; and (3) the computational cost of repeated LLM calls, which we propose to mitigate through strategies like data type prefiltering.
zh
[AI-7] On Symmetric Losses for Robust Policy Optimization with Noisy Preferences
【速读】:该论文旨在解决在存在噪声偏好数据的情况下,如何有效优化策略以对齐语言模型与人类意图的问题。其核心挑战在于传统方法通常假设偏好数据是准确的,而现实中的偏好数据往往因人类错误或偏见而包含噪声。论文提出的解决方案的关键在于将奖励建模视为一个分类问题,并引入对标签噪声具有鲁棒性的对称损失函数,从而提出了对称偏好优化(Symmetric Preference Optimization, SymPO)方法。该方法确保即使在噪声标签下,奖励函数仍保持排序不变性,进而实现有效的策略优化。
链接: https://arxiv.org/abs/2505.24709
作者: Soichiro Nishimori,Yu-Jie Zhang,Thanawat Lodkaew,Masashi Sugiyama
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Optimizing policies based on human preferences is key to aligning language models with human intent. This work focuses on reward modeling, a core component in reinforcement learning from human feedback (RLHF), and offline preference optimization, such as direct preference optimization. Conventional approaches typically assume accurate annotations. However, real-world preference data often contains noise due to human errors or biases. We propose a principled framework for robust policy optimization under noisy preferences, viewing reward modeling as a classification problem. This allows us to leverage symmetric losses, known for their robustness to label noise in classification, leading to our Symmetric Preference Optimization (SymPO) method. We prove that symmetric losses enable successful policy optimization even under noisy labels, as the resulting reward remains rank-preserving – a property sufficient for policy improvement. Experiments on synthetic and real-world tasks demonstrate the effectiveness of SymPO.
zh
[AI-8] Disentangling Granularity: An Implicit Inductive Bias in Factorized VAEs
【速读】:该论文试图解决变分自编码器(Variational Autoencoder, VAE)及其变体在无监督解缠表示学习中面临的理论挑战,即在缺乏隐式归纳偏置的情况下,无监督解缠是不可实现的,而该偏置至今仍未明确。论文的关键解决方案在于揭示了一种隐式归纳偏置——解缠粒度(disentangling granularity),并通过分析β-TCVAE中的总相关性,发现了参数空间中ELBO(Evidence Lower Bound)的“V”形最优轨迹。这一发现通过大量实验得到验证,并表明通过调整解缠粒度(如β-STCVAE所实现的)可以扩展解缠表示的范围,从而实现对高复杂度特征的解缠。
链接: https://arxiv.org/abs/2505.24684
作者: Zihao Chen,Yu Xiang,Wenyong Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the success in learning semantically meaningful, unsupervised disentangled representations, variational autoencoders (VAEs) and their variants face a fundamental theoretical challenge: substantial evidence indicates that unsupervised disentanglement is unattainable without implicit inductive bias, yet such bias remains elusive. In this work, we focus on exploring the implicit inductive bias that drive disentanglement in VAEs with factorization priors. By analyzing the total correlation in \beta-TCVAE, we uncover a crucial implicit inductive bias called disentangling granularity, which leads to the discovery of an interesting “V”-shaped optimal Evidence Lower Bound (ELBO) trajectory within the parameter space. This finding is validated through over 100K experiments using factorized VAEs and our newly proposed model, \beta-STCVAE. Notably, experimental results reveal that conventional factorized VAEs, constrained by fixed disentangling granularity, inherently tend to disentangle low-complexity feature. Whereas, appropriately tuning disentangling granularity, as enabled by \beta-STCVAE, broadens the range of disentangled representations, allowing for the disentanglement of high-complexity features. Our findings unveil that disentangling granularity as an implicit inductive bias in factorized VAEs influence both disentanglement performance and the inference of the ELBO, offering fresh insights into the interpretability and inherent biases of VAEs.
zh
[AI-9] Generative Knowledge Production Pipeline Driven by Academic Influencers
【速读】:该论文试图解决生成式 AI(Generative AI)在知识生产、验证和传播过程中对学术诚信和可信度带来的挑战。其解决方案的关键在于构建一个结构化的、可实施的生成式出版生产流程,以及一个支持人机协同与伦理合规的政策框架,以平衡原创性、伦理规范和人类与AI的合作,从而适应AI技术对传统科学范式的颠覆性影响。
链接: https://arxiv.org/abs/2505.24681
作者: Katalin Feher,Marton Demeter
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注: 15 pages, 1 figure, 2 tables, Horizon Europe NGI funding
Abstract:Generative AI transforms knowledge production, validation, and dissemination, raising academic integrity and credibility concerns. This study examines 53 academic influencer videos that reached 5.3 million viewers to identify an emerging, structured, implementation-ready pipeline balancing originality, ethical compliance, and human-AI collaboration despite the disruptive impacts. Findings highlight generative AI’s potential to automate publication workflows and democratize participation in knowledge production while challenging traditional scientific norms. Academic influencers emerge as key intermediaries in this paradigm shift, connecting bottom-up practices with institutional policies to improve adaptability. Accordingly, the study proposes a generative publication production pipeline and a policy framework for co-intelligence adaptation and reinforcing credibility-centered standards in AI-powered research. These insights support scholars, educators, and policymakers in understanding AI’s transformative impact by advocating responsible and innovation-driven knowledge production. Additionally, they reveal pathways for automating best practices, optimizing scholarly workflows, and fostering creativity in academic research and publication.
zh
[AI-10] Adaptable Cardiovascular Disease Risk Prediction from Heterogeneous Data using Large Language Models
【速读】:该论文旨在解决心血管疾病(Cardiovascular Disease, CVD)风险预测模型在实际临床应用中面临的挑战,包括对患者信息的过度简化、依赖刚性输入结构以及对数据分布变化的敏感性。其解决方案的关键在于构建AdaCVD框架,该框架基于在英国生物银行超过50万名参与者数据上进行广泛微调的大语言模型,能够灵活整合结构化数据与非结构化文本,并快速适应新的患者群体,从而实现更准确、更具适应性的CVD风险预测。
链接: https://arxiv.org/abs/2505.24655
作者: Frederike Lübeck,Jonas Wildberger,Frederik Träuble,Maximilian Mordig,Sergios Gatidis,Andreas Krause,Bernhard Schölkopf
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Cardiovascular disease (CVD) risk prediction models are essential for identifying high-risk individuals and guiding preventive actions. However, existing models struggle with the challenges of real-world clinical practice as they oversimplify patient profiles, rely on rigid input schemas, and are sensitive to distribution shifts. We developed AdaCVD, an adaptable CVD risk prediction framework built on large language models extensively fine-tuned on over half a million participants from the UK Biobank. In benchmark comparisons, AdaCVD surpasses established risk scores and standard machine learning approaches, achieving state-of-the-art performance. Crucially, for the first time, it addresses key clinical challenges across three dimensions: it flexibly incorporates comprehensive yet variable patient information; it seamlessly integrates both structured data and unstructured text; and it rapidly adapts to new patient populations using minimal additional data. In stratified analyses, it demonstrates robust performance across demographic, socioeconomic, and clinical subgroups, including underrepresented cohorts. AdaCVD offers a promising path toward more flexible, AI-driven clinical decision support tools suited to the realities of heterogeneous and dynamic healthcare environments.
zh
[AI-11] Random Rule Forest (RRF): Interpretable Ensembles of LLM -Generated Questions for Predicting Startup Success
【速读】:该论文试图解决预测初创企业成功性的难题,旨在构建既准确又可解释的模型。解决方案的关键在于提出一种轻量级集成框架,该框架通过大型语言模型(Large Language Models, LLMs)生成的是非问题,并利用基于阈值的投票机制对这些问题进行过滤、排序和聚合,从而形成一个强大的集成预测器。此方法在保持完全透明性的同时,显著提升了预测精度。
链接: https://arxiv.org/abs/2505.24622
作者: Ben Griffin,Joseph Ternasky,Fuat Alican,Yigit Ihlamur
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 4 figures
Abstract:Predicting startup success requires models that are both accurate and interpretable. We present a lightweight ensemble framework that combines YES/NO questions generated by large language models (LLMs), forming a transparent decision-making system. Each question acts as a weak heuristic, and by filtering, ranking, and aggregating them through a threshold-based voting mechanism, we construct a strong ensemble predictor. On a test set where 10% of startups are classified as successful, our approach achieves a precision rate of 50%, representing a 5x improvement over random selection, while remaining fully transparent. When we incorporate expert-guided heuristics into the generation process, performance improves further to 54% precision. These results highlight the value of combining LLM reasoning with human insight and demonstrate that simple, interpretable ensembles can support high-stakes decisions in domains such as venture capital (VC).
zh
[AI-12] axonomic Networks: A Representation for Neuro-Symbolic Pairing
【速读】:该论文试图解决如何有效融合神经方法与符号方法以提升知识表示和分类任务性能的问题,其解决方案的关键在于构建一种称为“神经符号对(neuro-symbolic pair)”的架构,其中神经网络与符号系统通过共同的知识表示进行链接。文中提出了一种名为“分类网络(taxonomic networks)”的层次化概念表示方式,并展示了符号方法在数据和计算资源有限的情况下更高效地学习分类网络,而神经方法在资源充足时能获得更高精度的分类网络。这种可互换的神经符号对结构为未来更深层次集成神经与符号计算的系统奠定了基础。
链接: https://arxiv.org/abs/2505.24601
作者: Zekun Wang,Ethan L. Haarer,Nicki Barari,Christopher J. MacLellan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, NeuS 2025
Abstract:We introduce the concept of a \textbfneuro-symbolic pair – neural and symbolic approaches that are linked through a common knowledge representation. Next, we present \textbftaxonomic networks, a type of discrimination network in which nodes represent hierarchically organized taxonomic concepts. Using this representation, we construct a novel neuro-symbolic pair and evaluate its performance. We show that our symbolic method learns taxonomic nets more efficiently with less data and compute, while the neural method finds higher-accuracy taxonomic nets when provided with greater resources. As a neuro-symbolic pair, these approaches can be used interchangeably based on situational needs, with seamless translation between them when necessary. This work lays the foundation for future systems that more fundamentally integrate neural and symbolic computation.
zh
[AI-13] Mixture-of-Experts for Personalized and Semantic-Aware Next Location Prediction
【速读】:该论文旨在解决现有下一位置预测方法在捕捉现实世界地点的复杂多维语义以及建模不同用户群体异质行为动态方面的不足。其解决方案的关键在于提出NextLocMoE框架,该框架基于大语言模型(Large Language Models, LLMs),采用双层级专家混合(Mixture-of-Experts, MoE)设计,包含一个位于嵌入层的位置语义MoE模块和一个嵌入于Transformer主干网络中的个性化MoE模块,以动态适应个体用户的移动模式,并结合一种基于历史的路由机制以提升专家选择的准确性与预测稳定性。
链接: https://arxiv.org/abs/2505.24597
作者: Shuai Liu,Ning Cao,Yile Chen,Yue Jiang,Gao Cong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Next location prediction plays a critical role in understanding human mobility patterns. However, existing approaches face two core limitations: (1) they fall short in capturing the complex, multi-functional semantics of real-world locations; and (2) they lack the capacity to model heterogeneous behavioral dynamics across diverse user groups. To tackle these challenges, we introduce NextLocMoE, a novel framework built upon large language models (LLMs) and structured around a dual-level Mixture-of-Experts (MoE) design. Our architecture comprises two specialized modules: a Location Semantics MoE that operates at the embedding level to encode rich functional semantics of locations, and a Personalized MoE embedded within the Transformer backbone to dynamically adapt to individual user mobility patterns. In addition, we incorporate a history-aware routing mechanism that leverages long-term trajectory data to enhance expert selection and ensure prediction stability. Empirical evaluations across several real-world urban datasets show that NextLocMoE achieves superior performance in terms of predictive accuracy, cross-domain generalization, and interpretability
zh
[AI-14] Binary Cumulative Encoding meets Time Series Forecasting
【速读】:该论文试图解决时间序列预测中现有方法依赖于独热编码(one-hot encoding)而忽略目标值固有顺序结构的问题,导致在训练过程中无法提供预测值与真实值之间相对距离的信息。解决方案的关键在于引入二进制累积编码(Binary Cumulative Encoding, BCE),该编码将标量目标转化为单调二进制向量,隐式保留了顺序和幅度信息,使模型能够在分类框架下学习距离感知的表示。
链接: https://arxiv.org/abs/2505.24595
作者: Andrei Chernov,Vitaliy Pozdnyakov,Ilya Makarov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Recent studies in time series forecasting have explored formulating regression via classification task. By discretizing the continuous target space into bins and predicting over a fixed set of classes, these approaches benefit from stable training, robust uncertainty modeling, and compatibility with modern deep learning architectures. However, most existing methods rely on one-hot encoding that ignores the inherent ordinal structure of the underlying values. As a result, they fail to provide information about the relative distance between predicted and true values during training. In this paper, we propose to address this limitation by introducing binary cumulative encoding (BCE), that represents scalar targets into monotonic binary vectors. This encoding implicitly preserves order and magnitude information, allowing the model to learn distance-aware representations while still operating within a classification framework. We propose a convolutional neural network architecture specifically designed for BCE, incorporating residual and dilated convolutions to enable fast and expressive temporal modeling. Through extensive experiments on benchmark forecasting datasets, we show that our approach outperforms widely used methods in both point and probabilistic forecasting, while requiring fewer parameters and enabling faster training.
zh
[AI-15] A Flat Minima Perspective on Understanding Augmentations and Model Robustness
【速读】:该论文试图解决如何从理论上理解数据增强(data augmentation)在提升模型鲁棒性(model robustness)方面的有效性问题。其解决方案的关键在于提出一个统一的理论框架,通过损失函数表面的平坦性(loss surface flatness)和PAC泛化界(PAC generalization bound)来阐明数据增强如何增强模型的鲁棒性。该框架不仅涵盖了多种现有的增强方法,而且不局限于特定类型的分布偏移,如对抗攻击。
链接: https://arxiv.org/abs/2505.24592
作者: Weebum Yoo,Sung Whan Yoon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Model robustness indicates a model’s capability to generalize well on unforeseen distributional shifts, including data corruption, adversarial attacks, and domain shifts. Data augmentation is one of the prevalent and effective ways to enhance robustness. Despite the great success of augmentations in different fields, a general theoretical understanding of their efficacy in improving model robustness is lacking. We offer a unified theoretical framework to clarify how augmentations can enhance model robustness through the lens of loss surface flatness and PAC generalization bound. Our work diverges from prior studies in that our analysis i) broadly encompasses much of the existing augmentation methods, and ii) is not limited to specific types of distribution shifts like adversarial attacks. We confirm our theories through simulations on the existing common corruption and adversarial robustness benchmarks based on the CIFAR and ImageNet datasets, as well as domain generalization benchmarks including PACS and OfficeHome.
zh
[AI-16] AutoChemSchematic AI: A Closed-Loop Physics-Aware Agent ic Framework for Auto-Generating Chemical Process and Instrumentation Diagrams
【速读】:该论文试图解决将生成式AI(Generative AI)在实验室中发现的新化学品和材料转化为工业规模生产时所面临的瓶颈问题,即如何自动生成符合工程约束的工艺流程图(PFD)和管道仪表图(PID)。解决方案的关键在于提出一个闭环、物理感知的框架,该框架结合了领域专用的小规模语言模型(SLMs)与第一性原理模拟,通过三个核心组件实现:包含1,020种以上化学品的工艺流程与仪表描述层次化知识图谱、多阶段训练流程(包括监督微调、直接偏好优化和检索增强指令调优)以及基于DWSIM的模拟器闭环验证。此外,框架还引入了先进的推理时间优化技术与结构剪枝策略,以提升运行效率并减少模型规模,同时保持较高的准确性。
链接: https://arxiv.org/abs/2505.24584
作者: Sakhinana Sagar Srinivas,Shivam Gupta,Venkataramana Runkana
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Recent advancements in generative AI have accelerated the discovery of novel chemicals and materials; however, transitioning these discoveries to industrial-scale production remains a critical bottleneck, as it requires the development of entirely new chemical manufacturing processes. Current AI methods cannot auto-generate PFDs or PIDs, despite their critical role in scaling chemical processes, while adhering to engineering constraints. We present a closed loop, physics aware framework for the automated generation of industrially viable PFDs and PIDs. The framework integrates domain specialized small scale language models (SLMs) (trained for chemical process QA tasks) with first principles simulation, leveraging three key components: (1) a hierarchical knowledge graph of process flow and instrumentation descriptions for 1,020+ chemicals, (2) a multi-stage training pipeline that fine tunes domain specialized SLMs on synthetic datasets via Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Retrieval-Augmented Instruction Tuning (RAIT), and (3) DWSIM based simulator in the loop validation to ensure feasibility. To improve both runtime efficiency and model compactness, the framework incorporates advanced inference time optimizations including FlashAttention, Lookahead Decoding, PagedAttention with KV-cache quantization, and Test Time Inference Scaling and independently applies structural pruning techniques (width and depth) guided by importance heuristics to reduce model size with minimal accuracy loss. Experiments demonstrate that the framework generates simulator-validated process descriptions with high fidelity, outperforms baseline methods in correctness, and generalizes to unseen chemicals. By bridging AI-driven design with industrial-scale feasibility, this work significantly reduces RD timelines from lab discovery to plant deployment.
zh
[AI-17] CHIP: Chameleon Hash-based Irreversible Passport for Robust Deep Model Ownership Verification and Active Usage Control
【速读】:该论文试图解决大规模深度神经网络(Deep Neural Networks, DNNs)在知识产权(IP)保护方面的问题,特别是在应对模糊攻击时,现有基于护照的方法无法提供全面的保护,包括强大的所有权证明、高保真度、主动使用授权和用户可追溯性。论文提出的解决方案是Chameleon Hash-based Irreversible Passport (CHIP)框架,其关键在于利用抗碰撞的哈希函数实现模型所有权声明和用户可追溯性,同时通过陷门碰撞特性将多个用户护照和许可证书哈希为同一不可变签名,从而实现主动使用控制。
链接: https://arxiv.org/abs/2505.24536
作者: Chaohui Xu,Qi Cui,Chip-Hong Chang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The pervasion of large-scale Deep Neural Networks (DNNs) and their enormous training costs make their intellectual property (IP) protection of paramount importance. Recently introduced passport-based methods attempt to steer DNN watermarking towards strengthening ownership verification against ambiguity attacks by modulating the affine parameters of normalization layers. Unfortunately, neither watermarking nor passport-based methods provide a holistic protection with robust ownership proof, high fidelity, active usage authorization and user traceability for offline access distributed models and multi-user Machine-Learning as a Service (MLaaS) cloud model. In this paper, we propose a Chameleon Hash-based Irreversible Passport (CHIP) protection framework that utilizes the cryptographic chameleon hash function to achieve all these goals. The collision-resistant property of chameleon hash allows for strong model ownership claim upon IP infringement and liable user traceability, while the trapdoor-collision property enables hashing of multiple user passports and licensee certificates to the same immutable signature to realize active usage control. Using the owner passport as an oracle, multiple user-specific triplets, each contains a passport-aware user model, a user passport, and a licensee certificate can be created for secure offline distribution. The watermarked master model can also be deployed for MLaaS with usage permission verifiable by the provision of any trapdoor-colliding user passports. CHIP is extensively evaluated on four datasets and two architectures to demonstrate its protection versatility and robustness. Our code is released at this https URL.
zh
[AI-18] Directional Non-Commutative Monoidal Structures with Interchange Law via Commutative Generators
【速读】:该论文试图解决如何将传统的一维单子系统(monoidal systems)推广到高维空间中,以统一和建模多种信号处理与数据分析中的经典线性变换问题。其解决方案的关键在于引入一种基于向量-矩阵对递归定义的代数结构,通过定义各轴上的非交换组合算子以及全局交换律,实现多维方向性组合的结构一致性,并保持线性算子的可交换性。该框架能够将离散傅里叶变换(Discrete Fourier Transform, DFT)、沃尔什变换(Walsh transform)和哈达玛变换(Hadamard transform)等经典变换作为特例进行统一描述,同时为特定数据模态和任务设计可学习的变换提供了系统性方法。
链接: https://arxiv.org/abs/2505.24533
作者: Mahesh Godavarti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注:
Abstract:We introduce a novel framework consisting of a class of algebraic structures that generalize one-dimensional monoidal systems into higher dimensions by defining per-axis composition operators subject to non-commutativity and a global interchange law. These structures, defined recursively from a base case of vector-matrix pairs, model directional composition in multiple dimensions while preserving structural coherence through commutative linear operators. We show that the framework that unifies several well-known linear transforms in signal processing and data analysis. In this framework, data indices are embedded into a composite structure that decomposes into simpler components. We show that classic transforms such as the Discrete Fourier Transform (DFT), the Walsh transform, and the Hadamard transform are special cases of our algebraic structure. The framework provides a systematic way to derive these transforms by appropriately choosing vector and matrix pairs. By subsuming classical transforms within a common structure, the framework also enables the development of learnable transformations tailored to specific data modalities and tasks. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC) MSC classes: 20-XX, 08A02 ACMclasses: F.4.1; I.2 Cite as: arXiv:2505.24533 [cs.LG] (or arXiv:2505.24533v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.24533 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-19] Can Slow-thinking LLM s Reason Over Time? Empirical Studies in Time Series Forecasting
【速读】:该论文试图解决传统时间序列预测(Time Series Forecasting, TSF)方法在处理时间动态和上下文依赖时缺乏显式推理的问题,旨在探索慢思考大语言模型(LLM)是否能够通过结构化推理有效支持时间序列预测,尤其是在零样本(zero-shot)场景下。解决方案的关键在于将TSF重新定义为一个条件推理任务,并设计一系列提示策略以激发预训练慢思考LLM在推理阶段的推理能力,从而提升预测效果。
链接: https://arxiv.org/abs/2505.24511
作者: Jiahao Wang,Mingyue Cheng,Qi Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Time series forecasting (TSF) is a fundamental and widely studied task, spanning methods from classical statistical approaches to modern deep learning and multimodal language modeling. Despite their effectiveness, these methods often follow a fast thinking paradigm emphasizing pattern extraction and direct value mapping, while overlooking explicit reasoning over temporal dynamics and contextual dependencies. Meanwhile, emerging slow-thinking LLMs (e.g., ChatGPT-o1, DeepSeek-R1) have demonstrated impressive multi-step reasoning capabilities across diverse domains, suggesting a new opportunity for reframing TSF as a structured reasoning task. This motivates a key question: can slow-thinking LLMs effectively reason over temporal patterns to support time series forecasting, even in zero-shot manner? To investigate this, in this paper, we propose TimeReasoner, an extensive empirical study that formulates TSF as a conditional reasoning task. We design a series of prompting strategies to elicit inference-time reasoning from pretrained slow-thinking LLMs and evaluate their performance across diverse TSF benchmarks. Our findings reveal that slow-thinking LLMs exhibit non-trivial zero-shot forecasting capabilities, especially in capturing high-level trends and contextual shifts. While preliminary, our study surfaces important insights into the reasoning behaviors of LLMs in temporal domains highlighting both their potential and limitations. We hope this work catalyzes further research into reasoning-based forecasting paradigms and paves the way toward more interpretable and generalizable TSF frameworks.
zh
[AI-20] Online Fair Division with Additional Information
【速读】:该论文试图解决在在线环境下将不可分割物品公平分配给代理的问题(online fair division of indivisible goods),其中物品按顺序到达且必须不可逆地分配给代理。其核心问题是探讨未来物品信息的可用性如何影响公平分配的存在性和近似可达性。解决方案的关键在于利用额外的信息,如每个代理的总估值(等价于归一化估值)或未来物品价值的多重集合(频率预测),以设计更公平的在线算法。研究证明,在拥有归一化信息的情况下,可以提出优于现有结果的算法;而在拥有频率预测的情况下,可引入一种元算法来达到基于“份额”的公平性概念的最佳离线保证。
链接: https://arxiv.org/abs/2505.24503
作者: Tzeh Yuan Neoh,Jannik Peters,Nicholas Teh
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:
Abstract:We study the problem of fairly allocating indivisible goods to agents in an online setting, where goods arrive sequentially and must be allocated irrevocably to agents. Focusing on the popular fairness notions of envy-freeness, proportionality, and maximin share fairness (and their approximate variants), we ask how the availability of information on future goods influences the existence and approximability of fair allocations. In the absence of any such information, we establish strong impossibility results, demonstrating the inherent difficulty of achieving even approximate fairness guarantees. In contrast, we demonstrate that knowledge of additional information – such as aggregate of each agent’s total valuations (equivalently, normalized valuations) or the multiset of future goods values (frequency predictions) – would enable the design of fairer online algorithms. Given normalization information, we propose an algorithm that achieves stronger fairness guarantees than previously known results. Given frequency predictions, we introduce a meta-algorithm that leverages frequency predictions to match the best-known offline guarantees for a broad class of ‘‘share-based’’ fairness notions. Our complementary impossibility results in each setting underscore both the limitations imposed by uncertainty about future goods and the potential of leveraging structured information to achieve fairer outcomes in online fair division.
zh
[AI-21] MELT: Towards Automated Multimodal Emotion Data Annotation by Leverag ing LLM Embedded Knowledge
【速读】:该论文试图解决语音情感识别(Speech Emotion Recognition, SER)中数据标注成本高且存在不一致性的问题。传统的人工标注不仅耗费大量资源,还因标注者主观差异和缺乏上下文知识而导致标签准确性不足。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)——具体而言是GPT-4o——在仅依赖文本线索的情况下对多模态语音数据进行自动情感标注。通过设计结构化文本提示,该方法充分利用了GPT-4o在训练过程中积累的知识,实现了无需直接访问多模态输入即可生成准确且上下文相关的标注,从而构建了完全由GPT-4o标注的多模态情感数据集MELT。
链接: https://arxiv.org/abs/2505.24493
作者: Xin Jing,Jiadong Wang,Iosif Tsangko,Andreas Triantafyllopoulos,Björn W. Schuller
机构: 未知
类目: Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Although speech emotion recognition (SER) has advanced significantly with deep learning, annotation remains a major hurdle. Human annotation is not only costly but also subject to inconsistencies annotators often have different preferences and may lack the necessary contextual knowledge, which can lead to varied and inaccurate labels. Meanwhile, Large Language Models (LLMs) have emerged as a scalable alternative for annotating text data. However, the potential of LLMs to perform emotional speech data annotation without human supervision has yet to be thoroughly investigated. To address these problems, we apply GPT-4o to annotate a multimodal dataset collected from the sitcom Friends, using only textual cues as inputs. By crafting structured text prompts, our methodology capitalizes on the knowledge GPT-4o has accumulated during its training, showcasing that it can generate accurate and contextually relevant annotations without direct access to multimodal inputs. Therefore, we propose MELT, a multimodal emotion dataset fully annotated by GPT-4o. We demonstrate the effectiveness of MELT by fine-tuning four self-supervised learning (SSL) backbones and assessing speech emotion recognition performance across emotion datasets. Additionally, our subjective experiments’ results demonstrate a consistence performance improvement on SER.
zh
[AI-22] Object Centric Concept Bottlenecks
【速读】:该论文试图解决现代人工智能中高性能且可解释模型的开发问题,特别是传统基于概念的模型(Concept-based Models, CBMs)在对象中心的真实世界场景中表达能力受限,难以处理复杂的视觉任务。其解决方案的关键在于引入基于对象中心的概念瓶颈(Object-Centric Concept Bottlenecks, OCB),该框架结合了CBMs与预训练对象中心基础模型的优势,从而提升了性能和可解释性。
链接: https://arxiv.org/abs/2505.24492
作者: David Steinmann,Wolfgang Stammer,Antonia Wüst,Kristian Kersting
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Developing high-performing, yet interpretable models remains a critical challenge in modern AI. Concept-based models (CBMs) attempt to address this by extracting human-understandable concepts from a global encoding (e.g., image encoding) and then applying a linear classifier on the resulting concept activations, enabling transparent decision-making. However, their reliance on holistic image encodings limits their expressiveness in object-centric real-world settings and thus hinders their ability to solve complex vision tasks beyond single-label classification. To tackle these challenges, we introduce Object-Centric Concept Bottlenecks (OCB), a framework that combines the strengths of CBMs and pre-trained object-centric foundation models, boosting performance and interpretability. We evaluate OCB on complex image datasets and conduct a comprehensive ablation study to analyze key components of the framework, such as strategies for aggregating object-concept encodings. The results show that OCB outperforms traditional CBMs and allows one to make interpretable decisions for complex visual tasks.
zh
[AI-23] Rehearsal with Auxiliary-Informed Sampling for Audio Deepfake Detection INTERSPEECH2025
【速读】:该论文试图解决现有音频深度伪造检测框架在面对新型深度伪造攻击时性能下降的问题。其解决方案的关键在于提出一种基于重放的持续学习方法——带有辅助信息引导采样的重放(Rehearsal with Auxiliary-Informed Sampling, RAIS),通过引入标签生成网络生成辅助标签,指导记忆缓冲区中多样本的选择,从而有效捕捉音频特征的多样性,减少偏差并降低遗忘风险。
链接: https://arxiv.org/abs/2505.24486
作者: Falih Gozi Febrinanto,Kristen Moore,Chandra Thapa,Jiangang Ma,Vidya Saikrishna,Feng Xia
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Accepted by Interspeech 2025
Abstract:The performance of existing audio deepfake detection frameworks degrades when confronted with new deepfake attacks. Rehearsal-based continual learning (CL), which updates models using a limited set of old data samples, helps preserve prior knowledge while incorporating new information. However, existing rehearsal techniques don’t effectively capture the diversity of audio characteristics, introducing bias and increasing the risk of forgetting. To address this challenge, we propose Rehearsal with Auxiliary-Informed Sampling (RAIS), a rehearsal-based CL approach for audio deepfake detection. RAIS employs a label generation network to produce auxiliary labels, guiding diverse sample selection for the memory buffer. Extensive experiments show RAIS outperforms state-of-the-art methods, achieving an average Equal Error Rate (EER) of 1.953 % across five experiences. The code is available at: this https URL.
zh
[AI-24] Evaluating Gemini in an arena for learning
【速读】:该论文试图解决教育领域中缺乏一个稳健且通用的基准来评估人工智能模型支持学习能力的问题。其解决方案的关键在于构建了一个“学习竞技场”,通过教育工作者和教学法专家进行盲测、多轮次的对比实验,评估主流AI模型在真实学习场景中的表现,从而确定哪些模型更有效地支持学习目标。
链接: https://arxiv.org/abs/2505.24477
作者: LearnLM Team Google:Abhinit Modi,Aditya Srikanth Veerubhotla,Aliya Rysbek,Andrea Huber,Ankit Anand,Avishkar Bhoopchand,Brett Wiltshire,Daniel Gillick,Daniel Kasenberg,Eleni Sgouritsa,Gal Elidan,Hengrui Liu,Holger Winnemoeller,Irina Jurenka,James Cohan,Jennifer She,Julia Wilkowski,Kaiz Alarakyia,Kevin R. McKee,Komal Singh,Lisa Wang,Markus Kunesch,Miruna Pîslar,Niv Efron,Parsa Mahmoudieh,Pierre-Alexandre Kamienny,Sara Wiltberger,Shakir Mohamed,Shashank Agarwal,Shubham Milind Phal,Sun Jae Lee,Theofilos Strinopoulos,Wei-Jen Ko,Yael Gold-Zamir,Yael Haramaty,Yannis Assael
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Artificial intelligence (AI) is poised to transform education, but the research community lacks a robust, general benchmark to evaluate AI models for learning. To assess state-of-the-art support for educational use cases, we ran an “arena for learning” where educators and pedagogy experts conduct blind, head-to-head, multi-turn comparisons of leading AI models. In particular, N = 189 educators drew from their experience to role-play realistic learning use cases, interacting with two models sequentially, after which N = 206 experts judged which model better supported the user’s learning goals. The arena evaluated a slate of state-of-the-art models: Gemini 2.5 Pro, Claude 3.7 Sonnet, GPT-4o, and OpenAI o3. Excluding ties, experts preferred Gemini 2.5 Pro in 73.2% of these match-ups – ranking it first overall in the arena. Gemini 2.5 Pro also demonstrated markedly higher performance across key principles of good pedagogy. Altogether, these results position Gemini 2.5 Pro as a leading model for learning.
zh
[AI-25] rain One Sparse Autoencoder Across Multiple Sparsity Budgets to Preserve Interpretability and Accuracy
【速读】:该论文试图解决传统稀疏自编码器(Sparse Autoencoder, SAE)在训练时固定稀疏度水平导致的灵活性不足问题,这一限制使得不同稀疏度需求需要独立模型,增加了计算开销。解决方案的关键在于提出一种新的训练目标——\emph{HierarchicalTopK},该方法使单一SAE能够在多个稀疏度水平上同时优化重构效果,从而实现稀疏度与解释方差之间的帕累托最优权衡,并保持较高的可解释性。
链接: https://arxiv.org/abs/2505.24473
作者: Nikita Balagansky,Yaroslav Aksenov,Daniil Laptev,Vadim Kurochkin,Gleb Gerasimov,Nikita Koryagin,Daniil Gavrilov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Sparse Autoencoders (SAEs) have proven to be powerful tools for interpreting neural networks by decomposing hidden representations into disentangled, interpretable features via sparsity constraints. However, conventional SAEs are constrained by the fixed sparsity level chosen during training; meeting different sparsity requirements therefore demands separate models and increases the computational footprint during both training and evaluation. We introduce a novel training objective, \emphHierarchicalTopK, which trains a single SAE to optimise reconstructions across multiple sparsity levels simultaneously. Experiments with Gemma-2 2B demonstrate that our approach achieves Pareto-optimal trade-offs between sparsity and explained variance, outperforming traditional SAEs trained at individual sparsity levels. Further analysis shows that HierarchicalTopK preserves high interpretability scores even at higher sparsity. The proposed objective thus closes an important gap between flexibility and interpretability in SAE design.
zh
[AI-26] SEAR: A Multimodal Dataset for Analyzing AR-LLM -Driven Social Engineering Behaviors
【速读】:该论文旨在解决通过增强现实(AR)和多模态大语言模型(LLMs)实施的社会工程(SE)攻击的检测与防御问题。其解决方案的关键在于构建了一个名为SEAR的数据集,该数据集包含180个标注对话、同步的AR捕获的视觉/音频线索、环境上下文、精心策划的社交媒体资料以及主观度量指标,从而为研究AR驱动的社会工程攻击提供了多模态的实证基础。
链接: https://arxiv.org/abs/2505.24458
作者: Tianlong Yu,Chenghang Ye,Zheyu Yang,Ziyi Zhou,Cui Tang,Zui Tao,Jun Zhang,Kailong Wang,Liting Zhou,Yang Yang,Ting Bi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The SEAR Dataset is a novel multimodal resource designed to study the emerging threat of social engineering (SE) attacks orchestrated through augmented reality (AR) and multimodal large language models (LLMs). This dataset captures 180 annotated conversations across 60 participants in simulated adversarial scenarios, including meetings, classes and networking events. It comprises synchronized AR-captured visual/audio cues (e.g., facial expressions, vocal tones), environmental context, and curated social media profiles, alongside subjective metrics such as trust ratings and susceptibility assessments. Key findings reveal SEAR’s alarming efficacy in eliciting compliance (e.g., 93.3% phishing link clicks, 85% call acceptance) and hijacking trust (76.7% post-interaction trust surge). The dataset supports research in detecting AR-driven SE attacks, designing defensive frameworks, and understanding multimodal adversarial manipulation. Rigorous ethical safeguards, including anonymization and IRB compliance, ensure responsible use. The SEAR dataset is available at this https URL.
zh
[AI-27] LPASS: Linear Probes as Stepping Stones for vulnerability detection using compressed LLM s
【速读】:该论文旨在解决在网络安全领域中,如何高效且有效地利用压缩后的大型语言模型(Large Language Models, LLMs)进行漏洞代码检测的问题。其关键解决方案是引入线性探针(Linear Probes, LPs),通过在微调前对压缩模型的性能进行早期估计,从而确定层剪枝压缩的截止点,实现模型压缩与性能之间的平衡。该方法被称为LPASS,在BERT和Gemma上验证了其有效性,能够在不损失精度的情况下移除大量模型层,并显著提升模型的效率与性能。
链接: https://arxiv.org/abs/2505.24451
作者: Luis Ibanez-Lissen,Lorena Gonzalez-Manzano,Jose Maria de Fuentes,Nicolas Anciaux
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are being extensively used for cybersecurity purposes. One of them is the detection of vulnerable codes. For the sake of efficiency and effectiveness, compression and fine-tuning techniques are being developed, respectively. However, they involve spending substantial computational efforts. In this vein, we analyse how Linear Probes (LPs) can be used to provide an estimation on the performance of a compressed LLM at an early phase – before fine-tuning. We also show their suitability to set the cut-off point when applying layer pruning compression. Our approach, dubbed LPASS , is applied in BERT and Gemma for the detection of 12 of MITRE’s Top 25 most dangerous vulnerabilities on 480k C/C++ samples. LPs can be computed in 142.97 s. and provide key findings: (1) 33.3 % and 72.2% of layers can be removed, respectively, with no precision loss; (2) they provide an early estimate of the post-fine-tuning and post-compression model effectiveness, with 3% and 8.68% as the lowest and average precision errors, respectively. LPASS -based LLMs outperform the state of the art, reaching 86.9% of accuracy in multi-class vulnerability detection. Interestingly, LPASS -based compressed versions of Gemma outperform the original ones by 1.6% of F1-score at a maximum while saving 29.4 % and 23.8% of training and inference time and 42.98% of model size.
zh
[AI-28] Learning Safety Constraints for Large Language Models ICML2025
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在输出中可能产生的有害内容以及对对抗攻击的脆弱性问题。其解决方案的关键在于提出SaP(Safety Polytope),一种基于几何的方法,在模型的表示空间中直接学习和强制执行多个安全约束。该方法通过多面体的面识别安全与不安全区域,实现对不安全输出的检测与修正,且无需修改模型权重,从而在保持模型性能的同时增强安全性。
链接: https://arxiv.org/abs/2505.24445
作者: Xin Chen,Yarden As,Andreas Krause
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2025 (Spotlight)
Abstract:Large language models (LLMs) have emerged as powerful tools but pose significant safety risks through harmful outputs and vulnerability to adversarial attacks. We propose SaP, short for Safety Polytope, a geometric approach to LLM safety that learns and enforces multiple safety constraints directly in the model’s representation space. We develop a framework that identifies safe and unsafe regions via the polytope’s facets, enabling both detection and correction of unsafe outputs through geometric steering. Unlike existing approaches that modify model weights, SaP operates post-hoc in the representation space, preserving model capabilities while enforcing safety constraints. Experiments across multiple LLMs demonstrate that our method can effectively detect unethical inputs, reduce adversarial attack success rates while maintaining performance on standard tasks, thus highlighting the importance of having an explicit geometric model for safety. Analysis of the learned polytope facets reveals emergence of specialization in detecting different semantic notions of safety, providing interpretable insights into how safety is captured in LLMs’ representation space.
zh
[AI-29] RMoA: Optimizing Mixture-of-Agents through Diversity Maximization and Residual Compensation ACL2025
【速读】:该论文旨在解决基于大语言模型的多智能体系统在计算开销高、信息丢失和鲁棒性不足方面的局限性。其解决方案的关键在于提出残差混合智能体(Residual Mixture-of-Agents, RMoA),通过引入残差连接优化系统效率与可靠性,同时设计基于嵌入的多样性选择机制以最大化模型响应的信息利用率并降低计算成本。此外,通过引入残差提取智能体和残差聚合智能体,有效缓解了迭代过程中的信息退化问题,并结合自适应终止机制提升推理效率。
链接: https://arxiv.org/abs/2505.24442
作者: Zhentao Xie,Chengcheng Han,Jinxin Shi,Wenjun Cui,Xin Zhao,Xingjiao Wu,Jiabao Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2025 (Findings)
Abstract:Although multi-agent systems based on large language models show strong capabilities on multiple tasks, they are still limited by high computational overhead, information loss, and robustness. Inspired by ResNet’s residual learning, we propose Residual Mixture-of-Agents (RMoA), integrating residual connections to optimize efficiency and reliability. To maximize information utilization from model responses while minimizing computational costs, we innovatively design an embedding-based diversity selection mechanism that greedily selects responses via vector similarity. Furthermore, to mitigate iterative information degradation, we introduce a Residual Extraction Agent to preserve cross-layer incremental information by capturing inter-layer response differences, coupled with a Residual Aggregation Agent for hierarchical information integration. Additionally, we propose an adaptive termination mechanism that dynamically halts processing based on residual convergence, further improving inference efficiency. RMoA achieves state-of-the-art performance on the benchmarks of across alignment, mathematical reasoning, code generation, and multitasking understanding, while significantly reducing computational overhead. Code is available at this https URL.
zh
[AI-30] P: A Universal Measure of Predictive Intelligence
【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)系统缺乏统一的智能衡量标准的问题,即如何在人类、动物和AI之间建立一个可比较的智能比率尺度。其解决方案的关键在于提出一种基于预测能力的通用智能度量方法,该方法假设预测是智能最重要的组成部分,并通过计算代理在与环境互动过程中预测的准确性以及预测和感知环境的Kolmogorov复杂性来评估智能水平。
链接: https://arxiv.org/abs/2505.24426
作者: David Gamez
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Over the last thirty years, considerable progress has been made with the development of systems that can drive cars, play games, predict protein folding and generate natural language. These systems are described as intelligent and there has been a great deal of talk about the rapid increase in artificial intelligence and its potential dangers. However, our theoretical understanding of intelligence and ability to measure it lag far behind our capacity for building systems that mimic intelligent human behaviour. There is no commonly agreed definition of the intelligence that AI systems are said to possess. No-one has developed a practical measure that would enable us to compare the intelligence of humans, animals and AIs on a single ratio scale. This paper sets out a new universal measure of intelligence that is based on the hypothesis that prediction is the most important component of intelligence. As an agent interacts with its normal environment, the accuracy of its predictions is summed up and the complexity of its predictions and perceived environment is accounted for using Kolmogorov complexity. Two experiments were carried out to evaluate the practical feasibility of the algorithm. These demonstrated that it could measure the intelligence of an agent embodied in a virtual maze and an agent that makes predictions about time-series data. This universal measure could be the starting point for a new comparative science of intelligence that ranks humans, animals and AIs on a single ratio scale. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2505.24426 [cs.AI] (or arXiv:2505.24426v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.24426 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-31] hree Kinds of Negation in Knowledge and Their Mathematical Foundations
【速读】:该论文试图解决知识处理中对否定(negation)的理解、区分、表达和计算这一基础性问题。其解决方案的关键在于基于矛盾(contradiction)与对立(opposition)概念的区分,提出从概念视角出发,知识中的否定可分为三类:矛盾否定、对立否定和中介否定。为建立能够全面反映这些否定形式内在联系、性质和规律的数学基础,作者引入了SCOI(具有矛盾否定、对立否定和中介否定的集合)和LCOI(具有矛盾否定、对立否定和中介否定的逻辑),并证明了SCOI的主要运算性质以及LCOI中的形式推理关系。
链接: https://arxiv.org/abs/2505.24422
作者: Zhenghua Pan,Yong Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 32 pages,13 figures
Abstract:In the field of artificial intelligence, understanding, distinguishing, expressing, and computing the negation in knowledge is a fundamental issue in knowledge processing and research. In this paper, we examine and analyze the understanding and characteristics of negation in various fields such as philosophy, logic, and linguistics etc. Based on the distinction between the concepts of contradiction and opposition, we propose that there are three different types of negation in knowledge from a conceptual perspective: contradictory negation, opposite negation, and intermediary negation. To establish a mathematical foundation that fully reflects the intrinsic connections, properties, and laws of these different forms of negation, we introduce SCOI: sets with contradictory negation, opposite negation and intermediary negation, and LCOI: logic with contradictory negation, opposite negation and intermediary negation, and we proved the main operational properties of SCOI as well as the formal inference relations in LCOI.
zh
[AI-32] Boosting Automatic Exercise Evaluation Through Musculoskeletal Simulation-Based IMU Data Augmentation
【速读】:该论文试图解决深度学习模型在评估惯性测量单元(IMU)捕捉的运动质量时面临的数据可用性有限、类别不平衡和标签模糊等问题。其解决方案的关键在于提出一种新颖的数据增强方法,该方法通过整合骨骼肌肉仿真与运动轨迹的系统性修改生成逼真的IMU数据,并结合逆向运动学参数与基于知识的评估策略,确保生物力学合理性及自动可靠标注。
链接: https://arxiv.org/abs/2505.24415
作者: Andreas Spilz,Heiko Oppel,Michael Munz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Automated evaluation of movement quality holds significant potential for enhancing physiotherapeutic treatments and sports training by providing objective, real-time feedback. However, the effectiveness of deep learning models in assessing movements captured by inertial measurement units (IMUs) is often hampered by limited data availability, class imbalance, and label ambiguity. In this work, we present a novel data augmentation method that generates realistic IMU data using musculoskeletal simulations integrated with systematic modifications of movement trajectories. Crucially, our approach ensures biomechanical plausibility and allows for automatic, reliable labeling by combining inverse kinematic parameters with a knowledge-based evaluation strategy. Extensive evaluations demonstrate that augmented variants closely resembles real-world data, significantly improving the classification accuracy and generalization capability of neural network models. Additionally, we highlight the benefits of augmented data for patient-specific fine-tuning scenarios, particularly when only limited subject-specific training examples are available. Our findings underline the practicality and efficacy of this augmentation method in overcoming common challenges faced by deep learning applications in physiotherapeutic exercise evaluation.
zh
[AI-33] Mastering Massive Multi-Task Reinforcement Learning via Mixture-of-Expert Decision Transformer ICML2025
【速读】:该论文旨在解决离线多任务强化学习(offline multi-task reinforcement learning, MTRL)中任务数量扩展面临的挑战,即现有方法在任务数量有限时表现良好,但难以扩展到大规模任务场景。其解决方案的关键在于提出一种新的混合专家(mixture-of-experts, MoE)框架M3DT,通过进一步释放模型参数的可扩展性来提升任务可扩展性。具体而言,M3DT增强了智能体的架构与优化,利用MoE结构减轻参数子集的任务负载,并引入三阶段训练机制以实现高效且性能优越的训练。实验结果表明,M3DT在增加专家数量的情况下,不仅在固定任务数量下提升性能,还能有效扩展至160个任务并保持优异表现。
链接: https://arxiv.org/abs/2505.24378
作者: Yilun Kong,Guozheng Ma,Qi Zhao,Haoyu Wang,Li Shen,Xueqian Wang,Dacheng Tao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2025
Abstract:Despite recent advancements in offline multi-task reinforcement learning (MTRL) have harnessed the powerful capabilities of the Transformer architecture, most approaches focus on a limited number of tasks, with scaling to extremely massive tasks remaining a formidable challenge. In this paper, we first revisit the key impact of task numbers on current MTRL method, and further reveal that naively expanding the parameters proves insufficient to counteract the performance degradation as the number of tasks escalates. Building upon these insights, we propose M3DT, a novel mixture-of-experts (MoE) framework that tackles task scalability by further unlocking the model’s parameter scalability. Specifically, we enhance both the architecture and the optimization of the agent, where we strengthen the Decision Transformer (DT) backbone with MoE to reduce task load on parameter subsets, and introduce a three-stage training mechanism to facilitate efficient training with optimal performance. Experimental results show that, by increasing the number of experts, M3DT not only consistently enhances its performance as model expansion on the fixed task numbers, but also exhibits remarkable task scalability, successfully extending to 160 tasks with superior performance.
zh
[AI-34] Adversarial Preference Learning for Robust LLM Alignment ACL2025
【速读】:该论文旨在解决现代语言模型在使用人类反馈强化学习(Reinforcement Learning from Human Feedback, RLHF)过程中面临的三个关键问题:人工标注的低效与高成本、潜在对抗攻击的多样性以及反馈偏差和奖励劫持的风险。其解决方案的关键在于提出一种称为对抗偏好学习(Adversarial Preference Learning, APL)的迭代对抗训练方法,该方法包含三项核心创新:基于模型内在偏好概率的直接有害性度量、生成输入特定对抗变异的条件生成攻击者,以及通过漏洞发现与缓解实现持续适应的自动闭环反馈框架。
链接: https://arxiv.org/abs/2505.24369
作者: Yuanfu Wang,Pengyu Wang,Chenyang Xi,Bo Tang,Junyi Zhu,Wenqiang Wei,Chen Chen,Chao Yang,Jingfeng Zhang,Chaochao Lu,Yijun Niu,Keming Mao,Zhiyu Li,Feiyu Xiong,Jie Hu,Mingchuan Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ACL2025 Findings
Abstract:Modern language models often rely on Reinforcement Learning from Human Feedback (RLHF) to encourage safe behaviors. However, they remain vulnerable to adversarial attacks due to three key limitations: (1) the inefficiency and high cost of human annotation, (2) the vast diversity of potential adversarial attacks, and (3) the risk of feedback bias and reward hacking. To address these challenges, we introduce Adversarial Preference Learning (APL), an iterative adversarial training method incorporating three key innovations. First, a direct harmfulness metric based on the model’s intrinsic preference probabilities, eliminating reliance on external assessment. Second, a conditional generative attacker that synthesizes input-specific adversarial variations. Third, an iterative framework with automated closed-loop feedback, enabling continuous adaptation through vulnerability discovery and mitigation. Experiments on Mistral-7B-Instruct-v0.3 demonstrate that APL significantly enhances robustness, achieving 83.33% harmlessness win rate over the base model (evaluated by GPT-4o), reducing harmful outputs from 5.88% to 0.43% (measured by LLaMA-Guard), and lowering attack success rate by up to 65% according to HarmBench. Notably, APL maintains competitive utility, with an MT-Bench score of 6.59 (comparable to the baseline 6.78) and an LC-WinRate of 46.52% against the base model.
zh
[AI-35] ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在长上下文推理中因Key-Value (KV)缓存存储需求过高而导致的性能受限问题。其解决方案的关键在于提出一种后训练的KV缓存压缩方法ReCalKV,通过分别针对Keys和Values设计不同的压缩策略,以减少隐藏维度并保持模型性能。对于Keys,采用基于头相似性的重排序(Head-wise Similarity-aware Reordering, HSR)方法,在降低计算开销的同时保持准确性;对于Values,则采用离线校准与矩阵融合(Offline Calibration and Matrix Fusion, OCMF)方法,在无需额外计算开销的情况下保持精度。
链接: https://arxiv.org/abs/2505.24357
作者: Xianglong Yan,Zhiteng Li,Tianao Zhang,Linghe Kong,Yulun Zhang,Xiaokang Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have achieved remarkable performance, yet their capability on long-context reasoning is often constrained by the excessive memory required to store the Key-Value (KV) cache. This makes KV cache compression an essential step toward enabling efficient long-context reasoning. Recent methods have explored reducing the hidden dimensions of the KV cache, but many introduce additional computation through projection layers or suffer from significant performance degradation under high compression ratios. To address these challenges, we propose ReCalKV, a post-training KV cache compression method that reduces the hidden dimensions of the KV cache. We develop distinct compression strategies for Keys and Values based on their different roles and varying importance in the attention mechanism. For Keys, we propose Head-wise Similarity-aware Reordering (HSR), which clusters similar heads and applies grouped SVD to the key projection matrix, reducing additional computation while preserving accuracy. For Values, we propose Offline Calibration and Matrix Fusion (OCMF) to preserve accuracy without extra computational overhead. Experiments show that ReCalKV outperforms existing low-rank compression methods, achieving high compression ratios with minimal performance loss. Code is available at: this https URL.
zh
[AI-36] GridRoute: A Benchmark for LLM -Based Route Planning with Cardinal Movement in Grid Environments
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在路径规划任务中独立推理能力较强但缺乏与传统算法协同潜力的问题。其解决方案的关键在于提出一种名为“思维算法”(Algorithm of Thought, AoT)的混合提示技术,通过引入传统算法的指导来增强LLMs在路径规划中的表现。
链接: https://arxiv.org/abs/2505.24306
作者: Kechen Li,Yaotian Tao,Ximing Wen,Quanwei Sun,Zifei Gong,Chang Xu,Xizhe Zhang,Tianbo Ji
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages
Abstract:Recent advancements in Large Language Models (LLMs) have demonstrated their potential in planning and reasoning tasks, offering a flexible alternative to classical pathfinding algorithms. However, most existing studies focus on LLMs’ independent reasoning capabilities and overlook the potential synergy between LLMs and traditional algorithms. To fill this gap, we propose a comprehensive evaluation benchmark GridRoute to assess how LLMs can take advantage of traditional algorithms. We also propose a novel hybrid prompting technique called Algorithm of Thought (AoT), which introduces traditional algorithms’ guidance into prompting. Our benchmark evaluates six LLMs ranging from 7B to 72B parameters across various map sizes, assessing their performance in correctness, optimality, and efficiency in grid environments with varying sizes. Our results show that AoT significantly boosts performance across all model sizes, particularly in larger or more complex environments, suggesting a promising approach to addressing path planning challenges. Our code is open-sourced at this https URL.
zh
[AI-37] AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)训练中的系统级效率低下问题。现有同步RL系统在批量设置中通过交替生成与训练来稳定训练过程,但导致GPU利用率低,因为生成必须等待批次中最长输出完成才能进行模型更新。该论文提出的AReaL系统是一种完全异步的RL框架,其关键在于彻底解耦生成与训练过程,使生成工作进程持续生成新输出,而训练工作进程在收集到一批数据后即可更新模型,从而显著提升GPU利用率。此外,AReaL通过平衡生成与训练任务的工作负载,并采用增强过时样本处理能力的PPO变体,进一步提升了训练稳定性与效率。
链接: https://arxiv.org/abs/2505.24298
作者: Wei Fu,Jiaxuan Gao,Xujie Shen,Chen Zhu,Zhiyu Mei,Chuyi He,Shusheng Xu,Guo Wei,Jun Mei,Jiashu Wang,Tongkai Yang,Binhang Yuan,Yi Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) has become a trending paradigm for training large language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs requires massive parallelization and poses an urgent need for efficient training systems. Most existing large-scale RL systems for LLMs are synchronous by alternating generation and training in a batch setting, where the rollouts in each training batch are generated by the same (or latest) model. This stabilizes RL training but suffers from severe system-level inefficiency. Generation must wait until the longest output in the batch is completed before model update, resulting in GPU underutilization. We present AReaL, a \emphfully asynchronous RL system that completely decouples generation from training. Rollout workers in AReaL continuously generate new outputs without waiting, while training workers update the model whenever a batch of data is collected. AReaL also incorporates a collection of system-level optimizations, leading to substantially higher GPU utilization. To stabilize RL training, AReaL balances the workload of rollout and training workers to control data staleness, and adopts a staleness-enhanced PPO variant to better handle outdated training samples. Extensive experiments on math and code reasoning benchmarks show that AReaL achieves \textbfup to 2.57 \times training speedup compared to the best synchronous systems with the same number of GPUs and matched or even improved final performance. The code of AReaL is available at this https URL.
zh
[AI-38] Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion
【速读】:该论文旨在解决零样本语音转换(zero-shot voice conversion)中难以准确复制源说话人的表达风格或模仿目标说话人独特表达风格的问题,从而限制了语音转换的可控性。其解决方案的关键在于提出一种名为Discl-VC的新框架,该框架通过自监督语音表示中分离内容与韵律信息,并利用流匹配Transformer进行上下文学习以合成目标说话人的语音。为实现生成语音韵律的精确控制,引入了一种掩码生成Transformer,该模型基于提示以非自回归方式预测离散的韵律标记。
链接: https://arxiv.org/abs/2505.24291
作者: Kaidi Wang,Wenhao Guan,Ziyue Jiang,Hukai Huang,Peijie Chen,Weijie Wu,Qingyang Hong,Lin Li
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Currently, zero-shot voice conversion systems are capable of synthesizing the voice of unseen speakers. However, most existing approaches struggle to accurately replicate the speaking style of the source speaker or mimic the distinctive speaking style of the target speaker, thereby limiting the controllability of voice conversion. In this work, we propose Discl-VC, a novel voice conversion framework that disentangles content and prosody information from self-supervised speech representations and synthesizes the target speaker’s voice through in-context learning with a flow matching transformer. To enable precise control over the prosody of generated speech, we introduce a mask generative transformer that predicts discrete prosody tokens in a non-autoregressive manner based on prompts. Experimental results demonstrate the superior performance of Discl-VC in zero-shot voice conversion and its remarkable accuracy in prosody control for synthesized speech.
zh
[AI-39] How Much Backtracking is Enough? Exploring the Interplay of SFT and RL in Enhancing LLM Reasoning
【速读】:该论文试图解决在大型语言模型(Large Language Models, LLMs)中如何有效提升其推理能力的问题,特别是针对需要长链式思维(Chain-of-Thought, CoT)和回溯(backtracking)能力的复杂任务。其解决方案的关键在于系统性地研究监督微调(Supervised Fine-tuning, SFT)与强化学习(Reinforcement Learning, RL)之间的动态关系,并通过构造具有不同回溯步骤的合成数据集,分析回溯在推理过程中的作用。研究发现,较长的CoT序列结合回溯能够显著提升RL训练效果,且在更复杂的任务中,SFT阶段需要更多的回溯次数以增强模型的推理能力。此外,实验还表明,RL训练更关注结构模式而非内容正确性,这为设计高效的LLMs训练策略提供了重要依据。
链接: https://arxiv.org/abs/2505.24273
作者: Hongyi James Cai,Junlin Wang,Xiaoyin Chen,Bhuwan Dhingra
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent breakthroughs in large language models (LLMs) have effectively improved their reasoning abilities, particularly on mathematical and logical problems that have verifiable answers, through techniques such as supervised finetuning (SFT) and reinforcement learning (RL). Prior research indicates that RL effectively internalizes search strategies, enabling long chain-of-thought (CoT) reasoning, with backtracking emerging naturally as a learned capability. However, the precise benefits of backtracking, specifically, how significantly it contributes to reasoning improvements and the optimal extent of its use, remain poorly understood. In this work, we systematically investigate the dynamics between SFT and RL on eight reasoning tasks: Countdown, Sudoku, Arc 1D, Geometry, Color Cube Rotation, List Functions, Zebra Puzzles, and Self Reference. Our findings highlight that short CoT sequences used in SFT as a warm-up do have moderate contribution to RL training, compared with cold-start RL; however such contribution diminishes when tasks become increasingly difficult. Motivated by this observation, we construct synthetic datasets varying systematically in the number of backtracking steps and conduct controlled experiments to isolate the influence of either the correctness (content) or the structure (i.e., backtrack frequency). We find that (1) longer CoT with backtracks generally induce better and more stable RL training, (2) more challenging problems with larger search space tend to need higher numbers of backtracks during the SFT stage. Additionally, we demonstrate through experiments on distilled data that RL training is largely unaffected by the correctness of long CoT sequences, suggesting that RL prioritizes structural patterns over content correctness. Collectively, our results offer practical insights into designing optimal training strategies to effectively scale reasoning in LLMs.
zh
[AI-40] INSIGHT: A Survey of In-Network Systems for Intelligent High-Efficiency AI and Topology Optimization
【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)工作负载对网络基础设施日益增长的需求问题,通过在网计算(in-network computation)提升网络性能。其解决方案的关键在于利用网络设备(如交换机、路由器和网络接口卡)的处理能力,在网络内部直接执行AI计算任务,从而降低延迟、提高吞吐量并优化资源利用率。论文还探讨了可编程网络架构(如软件定义网络SDN和可编程数据平面PDP)与AI的融合,以及如何在资源受限的网络设备上高效部署AI模型,包括算法设计和模型压缩技术,以应对内存和计算能力的限制。
链接: https://arxiv.org/abs/2505.24269
作者: Aleksandr Algazinov,Joydeep Chandra,Matt Laing
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:
Abstract:In-network computation represents a transformative approach to addressing the escalating demands of Artificial Intelligence (AI) workloads on network infrastructure. By leveraging the processing capabilities of network devices such as switches, routers, and Network Interface Cards (NICs), this paradigm enables AI computations to be performed directly within the network fabric, significantly reducing latency, enhancing throughput, and optimizing resource utilization. This paper provides a comprehensive analysis of optimizing in-network computation for AI, exploring the evolution of programmable network architectures, such as Software-Defined Networking (SDN) and Programmable Data Planes (PDPs), and their convergence with AI. It examines methodologies for mapping AI models onto resource-constrained network devices, addressing challenges like limited memory and computational capabilities through efficient algorithm design and model compression techniques. The paper also highlights advancements in distributed learning, particularly in-network aggregation, and the potential of federated learning to enhance privacy and scalability. Frameworks like Planter and Quark are discussed for simplifying development, alongside key applications such as intelligent network monitoring, intrusion detection, traffic management, and Edge AI. Future research directions, including runtime programmability, standardized benchmarks, and new applications paradigms, are proposed to advance this rapidly evolving field. This survey underscores the potential of in-network AI to create intelligent, efficient, and responsive networks capable of meeting the demands of next-generation AI applications.
zh
[AI-41] Generative AI for Urban Design: A Stepwise Approach Integrating Human Expertise with Multimodal Diffusion Models
【速读】:该论文试图解决现有生成式人工智能(Generative AI)在城市设计中与人类设计流程集成不足的问题,特别是在端到端管道中控制有限,忽视了现实世界设计的迭代特性。解决方案的关键在于提出一种分步生成的城市设计框架,该框架将多模态扩散模型与人类专业知识相结合,将设计过程划分为三个关键阶段:道路网络与土地利用规划、建筑布局规划以及详细规划与渲染,从而实现更灵活和可控的设计流程。
链接: https://arxiv.org/abs/2505.24260
作者: Mingyi He,Yuebing Liang,Shenhao Wang,Yunhan Zheng,Qingyi Wang,Dingyi Zhuang,Li Tian,Jinhua Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Urban design is a multifaceted process that demands careful consideration of site-specific constraints and collaboration among diverse professionals and stakeholders. The advent of generative artificial intelligence (GenAI) offers transformative potential by improving the efficiency of design generation and facilitating the communication of design ideas. However, most existing approaches are not well integrated with human design workflows. They often follow end-to-end pipelines with limited control, overlooking the iterative nature of real-world design. This study proposes a stepwise generative urban design framework that integrates multimodal diffusion models with human expertise to enable more adaptive and controllable design processes. Instead of generating design outcomes in a single end-to-end process, the framework divides the process into three key stages aligned with established urban design workflows: (1) road network and land use planning, (2) building layout planning, and (3) detailed planning and rendering. At each stage, multimodal diffusion models generate preliminary designs based on textual prompts and image-based constraints, which can then be reviewed and refined by human designers. We design an evaluation framework to assess the fidelity, compliance, and diversity of the generated designs. Experiments using data from Chicago and New York City demonstrate that our framework outperforms baseline models and end-to-end approaches across all three dimensions. This study underscores the benefits of multimodal diffusion models and stepwise generation in preserving human control and facilitating iterative refinements, laying the groundwork for human-AI interaction in urban design solutions.
zh
[AI-42] FABLE: A Novel Data-Flow Analysis Benchmark on Procedural Text for Large Language Model Evaluation
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在数据流推理(data-flow reasoning)能力上的系统性评估问题,即如何衡量LLMs对数据在程序中移动、转换和持久化的理解能力。解决方案的关键在于提出FABLE,一个可扩展的基准测试框架,通过结构化、过程性的文本评估LLMs的数据流理解能力。FABLE借鉴了软件工程中的八种经典数据流分析方法,并将其应用于烹饪食谱、旅行路线和自动化计划三个现实领域,构建了包含2,400个问答对的基准数据集,为LLMs的数据流推理能力提供了系统的评估手段。
链接: https://arxiv.org/abs/2505.24258
作者: Vishal Pallagani,Nitin Gupta,John Aydin,Biplav Srivastava
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding how data moves, transforms, and persists, known as data flow, is fundamental to reasoning in procedural tasks. Despite their fluency in natural and programming languages, large language models (LLMs), although increasingly being applied to decisions with procedural tasks, have not been systematically evaluated for their ability to perform data-flow reasoning. We introduce FABLE, an extensible benchmark designed to assess LLMs’ understanding of data flow using structured, procedural text. FABLE adapts eight classical data-flow analyses from software engineering: reaching definitions, very busy expressions, available expressions, live variable analysis, interval analysis, type-state analysis, taint analysis, and concurrency analysis. These analyses are instantiated across three real-world domains: cooking recipes, travel routes, and automated plans. The benchmark includes 2,400 question-answer pairs, with 100 examples for each domain-analysis combination. We evaluate three types of LLMs: a reasoning-focused model (DeepSeek-R1 8B), a general-purpose model (LLaMA 3.1 8B), and a code-specific model (Granite Code 8B). Each model is tested using majority voting over five sampled completions per prompt. Results show that the reasoning model achieves higher accuracy, but at the cost of over 20 times slower inference compared to the other models. In contrast, the general-purpose and code-specific models perform close to random chance. FABLE provides the first diagnostic benchmark to systematically evaluate data-flow reasoning and offers insights for developing models with stronger procedural understanding.
zh
[AI-43] A Reward-driven Automated Webshell Malicious-code Generator for Red-teaming
【速读】:该论文旨在解决当前恶意代码数据集在混淆方法上缺乏系统分类以及现有恶意代码生成方法依赖提示工程导致生成内容多样性不足和冗余度高的问题。其解决方案的关键在于提出一种基于奖励驱动的自动化WebShell恶意代码生成器(RAWG),通过将常见数据集中的WebShell样本按混淆类型进行分类,利用大语言模型提取并标准化关键标记以构建高质量语料库,并在此基础上进行监督微调和强化学习优化,从而生成多样且高度混淆的恶意负载。
链接: https://arxiv.org/abs/2505.24252
作者: Yizhong Ding
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Frequent cyber-attacks have elevated WebShell exploitation and defense to a critical research focus within network security. However, there remains a significant shortage of publicly available, well-categorized malicious-code datasets organized by obfuscation method. Existing malicious-code generation methods, which primarily rely on prompt engineering, often suffer from limited diversity and high redundancy in the payloads they produce. To address these limitations, we propose \textbfRAWG, a \textbfReward-driven \textbfAutomated \textbfWebshell Malicious-code \textbfGenerator designed for red-teaming applications. Our approach begins by categorizing webshell samples from common datasets into seven distinct types of obfuscation. We then employ a large language model (LLM) to extract and normalize key tokens from each sample, creating a standardized, high-quality corpus. Using this curated dataset, we perform supervised fine-tuning (SFT) on an open-source large model to enable the generation of diverse, highly obfuscated webshell malicious payloads. To further enhance generation quality, we apply Proximal Policy Optimization (PPO), treating malicious-code samples as “chosen” data and benign code as “rejected” data during reinforcement learning. Extensive experiments demonstrate that RAWG significantly outperforms current state-of-the-art methods in both payload diversity and escape effectiveness.
zh
[AI-44] Dynamic Malware Classification of Windows PE Files using CNNs and Greyscale Images Derived from Runtime API Call Argument Conversion
【速读】:该论文试图解决传统静态分析在应对多态性(polymorphic)和变形性(metamorphic)恶意软件时效果不佳的问题,因为这些恶意软件通过改变外观而不改变行为来逃避基于代码结构的检测。解决方案的关键在于提出一种动态恶意软件分类框架,通过在Windows可移植可执行文件(PE)运行时提取API调用参数,并将动态特征编码为时间模式,进而生成灰度图像以增强特征表示,最后利用卷积神经网络(CNN)进行恶意软件的准确分类。
链接: https://arxiv.org/abs/2505.24231
作者: Md Shahnawaz,Bishwajit Prasad Gond,Durga Prasad Mohapatra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Malware detection and classification remains a topic of concern for cybersecurity, since it is becoming common for attackers to use advanced obfuscation on their malware to stay undetected. Conventional static analysis is not effective against polymorphic and metamorphic malware as these change their appearance without modifying their behavior, thus defying the analysis by code structure alone. This makes it important to use dynamic detection that monitors malware behavior at runtime. In this paper, we present a dynamic malware categorization framework that extracts API argument calls at the runtime execution of Windows Portable Executable (PE) files. Extracting and encoding the dynamic features of API names, argument return values, and other relative features, we convert raw behavioral data to temporal patterns. To enhance feature portrayal, the generated patterns are subsequently converted into grayscale pictures using a magma colormap. These improved photos are used to teach a Convolutional Neural Network (CNN) model discriminative features, which allows for reliable and accurate malware classification. Results from experiments indicate that our method, with an average accuracy of 98.36% is effective in classifying different classes of malware and benign by integrating dynamic analysis and deep learning. It not only achieves high classification accuracy but also demonstrates significant resilience against typical evasion strategies.
zh
[AI-45] ProofNet: A Neuro-Symbolic System for Formal Proof Verification with Self-Correction
【速读】:该论文试图解决基于大语言模型(Large Language Model, LLM)的自动定理证明系统中存在的幻觉逻辑步骤和不可验证推理的问题。其解决方案的关键在于提出ProofNet++,一个结合了神经符号方法的框架,通过集成符号化证明树监督、以验证器作为奖励函数的强化学习循环以及迭代自校正模块,有效提升了证明的准确性、正确性和形式化可验证性。
链接: https://arxiv.org/abs/2505.24230
作者: Murari Ambati
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures
Abstract:We propose ProofNet++, a neuro-symbolic framework that enhances automated theorem proving by combining large language models (LLMs) with formal proof verification and self-correction mechanisms. Current LLM-based systems suffer from hallucinated logical steps and unverifiable reasoning. ProofNet++ mitigates these limitations by integrating symbolic proof tree supervision, a reinforcement learning loop using verifiers as reward functions, and an iterative self-correction module. Our experiments on miniF2F, Lean’s mathlib, and HOL Light show that ProofNet++ significantly improves proof accuracy, correctness, and formal verifiability over prior models. We provide theoretical analysis of the convergence and stability of the verifier-guided RL framework and release our datasets and codebase for future research.
zh
[AI-46] E2GraphRAG : Streamlining Graph-based RAG for High Efficiency and Effectiveness
【速读】:该论文旨在解决传统基于图的检索增强生成(Graph-based RAG)方法在效率和灵活性方面的不足,尤其是其低效的索引过程和对人工预定义查询模式的依赖,这限制了实际应用。解决方案的关键在于提出E^2GraphRAG框架,该框架通过结合大语言模型构建摘要树以及使用SpaCy构建实体图,并建立实体与文档块之间的双向索引,从而实现高效的本地和全局检索,同时采用自适应检索策略以优化检索模式的选择。
链接: https://arxiv.org/abs/2505.24226
作者: Yibo Zhao,Jiapeng Zhu,Ye Guo,Kangkang He,Xiang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages
Abstract:Graph-based RAG methods like GraphRAG have shown promising global understanding of the knowledge base by constructing hierarchical entity graphs. However, they often suffer from inefficiency and rely on manually pre-defined query modes, limiting practical use. In this paper, we propose E^2GraphRAG, a streamlined graph-based RAG framework that improves both Efficiency and Effectiveness. During the indexing stage, E^2GraphRAG constructs a summary tree with large language models and an entity graph with SpaCy based on document chunks. We then construct bidirectional indexes between entities and chunks to capture their many-to-many relationships, enabling fast lookup during both local and global retrieval. For the retrieval stage, we design an adaptive retrieval strategy that leverages the graph structure to retrieve and select between local and global modes. Experiments show that E^2GraphRAG achieves up to 10 times faster indexing than GraphRAG and 100 times speedup over LightRAG in retrieval while maintaining competitive QA performance.
zh
[AI-47] Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在生成输出时的安全性问题,特别是其安全性相较于语言模型(Large Language Models, LLMs)基础模型显著下降的现象。研究发现,即使输入空白或无关图像,VLMs也可能对原本在纯文本环境下会被拒绝的提示生成有害响应。论文的核心贡献在于揭示了模态间隙(modality gap)与VLMs安全性之间的高度负相关关系,并提出一种在预训练阶段减少模态间隙的正则化方法,从而有效提升VLMs的安全对齐程度,同时不损害模型性能。
链接: https://arxiv.org/abs/2505.24208
作者: Wenhan Yang,Spencer Stice,Ali Payani,Baharan Mirzasoleiman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Ensuring Vision-Language Models (VLMs) generate safe outputs is crucial for their reliable deployment. However, LVLMs suffer from drastic safety degradation compared to their LLM backbone. Even blank or irrelevant images can trigger LVLMs to generate harmful responses to prompts that would otherwise be refused in text-only contexts. The modality gap between image and text representations has been recently hypothesized to contribute to safety degradation of LVLMs. However, if and how the amount of modality gap affects LVLMs’ safety is not studied. In this work, we show that the amount of modality gap is highly inversely correlated with VLMs’ safety. Then, we show that this modality gap is introduced during pretraining LVLMs and persists through fine-tuning. Inspired by this observation, we propose a regularization to reduce the modality gap during pretraining. Our extensive experiments on LLaVA v1.5, ShareGPT4V, and MiniGPT-4 show that our method substantially improves safety alignment of LVLMs, reducing unsafe rate by up to 16.3% without compromising performance, and can further boost existing defenses by up to 18.2%.
zh
[AI-48] SentinelAgent : Graph-based Anomaly Detection in Multi-Agent Systems
【速读】:该论文试图解决基于大型语言模型(Large Language Model, LLM)的多智能体系统(Multi-Agent System, MAS)在安全性和可靠性方面面临的新挑战,包括提示操纵、不安全工具使用以及智能体间涌现的协调失误等问题。现有防护机制仅能在输入输出层面提供部分保护,难以应对系统性或多点故障。其解决方案的关键在于提出一个面向MAS的系统级异常检测框架,该框架结合结构建模与运行时行为监控,包含两个核心组件:一是基于图的动态执行图框架,用于在节点、边和路径层面进行语义异常检测;二是可插拔的SentinelAgent,作为由LLM驱动的监督智能体,根据安全策略和上下文推理对MAS执行进行观察、分析和干预。该方法通过将抽象检测逻辑与可操作的执行相结合,能够检测单点故障、提示注入、多智能体共谋及潜在的利用路径。
链接: https://arxiv.org/abs/2505.24201
作者: Xu He,Di Wu,Yan Zhai,Kun Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rise of large language model (LLM)-based multi-agent systems (MAS) introduces new security and reliability challenges. While these systems show great promise in decomposing and coordinating complex tasks, they also face multi-faceted risks across prompt manipulation, unsafe tool usage, and emergent agent miscoordination. Existing guardrail mechanisms offer only partial protection, primarily at the input-output level, and fall short in addressing systemic or multi-point failures in MAS. In this work, we present a system-level anomaly detection framework tailored for MAS, integrating structural modeling with runtime behavioral oversight. Our approach consists of two components. First, we propose a graph-based framework that models agent interactions as dynamic execution graphs, enabling semantic anomaly detection at node, edge, and path levels. Second, we introduce a pluggable SentinelAgent, an LLM-powered oversight agent that observes, analyzes, and intervenes in MAS execution based on security policies and contextual reasoning. By bridging abstract detection logic with actionable enforcement, our method detects not only single-point faults and prompt injections but also multi-agent collusion and latent exploit paths. We validate our framework through two case studies, including an email assistant and Microsoft’s Magentic-One system, demonstrating its ability to detect covert risks and provide explainable root-cause attribution. Our work lays the foundation for more trustworthy, monitorable, and secure agent-based AI ecosystems.
zh
[AI-49] Learning API Functionality from Demonstrations for Tool-based Agents
【速读】:该论文试图解决在缺乏或无法获取完整、准确API文档的情况下,基于数字工具的智能体(Digital tool-based agents)难以可靠地理解和使用API功能的问题。其解决方案的关键在于通过直接从演示(demonstrations)中学习API功能,而非依赖文档。研究者通过收集专家API代理和自我探索生成的演示数据,并分析演示数量、LLM生成的摘要与评估对任务成功率的影响,验证了该方法的有效性。实验表明,提供明确的函数调用和自然语言评价能够显著提升任务成功率,但即便对于最先进的大语言模型(LLMs),从演示中学习API功能仍是一个具有挑战性的任务。
链接: https://arxiv.org/abs/2505.24197
作者: Bhrij Patel,Ashish Jagmohan,Aditya Vempaty
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 Pages, 13 Figures, 5 Tables
Abstract:Digital tool-based agents that invoke external Application Programming Interfaces (APIs) often rely on documentation to understand API functionality. However, such documentation is frequently missing, outdated, privatized, or inconsistent-hindering the development of reliable, general-purpose agents. In this work, we propose learning API functionality directly from demonstrations as a new paradigm applicable in scenarios without documentation. Using existing API benchmarks, we collect demonstrations from both expert API-based agents and from self-exploration. To understand what information demonstrations must convey for successful task completion, we extensively study how the number of demonstrations and the use of LLM-generated summaries and evaluations affect the task success rate of the API-based agent. Our experiments across 3 datasets and 5 models show that learning functionality from demonstrations remains a non-trivial challenge, even for state-of-the-art LLMs. We find that providing explicit function calls and natural language critiques significantly improves the agent’s task success rate due to more accurate parameter filling. We analyze failure modes, identify sources of error, and highlight key open challenges for future work in documentation-free, self-improving, API-based agents.
zh
[AI-50] SCOUT: Teaching Pre-trained Language Models to Enhance Reasoning via Flow Chain-of-Thought
【速读】:该论文试图解决基于Chain of Thought (CoT)的推理方法在可扩展性和泛化能力上的局限性,以及现有递归推理方法对昂贵预训练的依赖和缺乏系统性演化框架的问题。其解决方案的关键在于提出Flow Chain of Thought (Flow CoT),通过将递归推理建模为潜在认知状态的渐进轨迹,使每一轮推理成为独立的认知阶段,从而在不依赖人工监督的情况下深化推理过程。为实现这一目标,论文进一步提出了SCOUT(Stepwise Cognitive Optimization Using Teachers)轻量级微调框架,通过渐进式蒸馏和基于交叉注意力的回顾模块,使模型在不进行预训练的情况下实现Flow CoT风格的推理。
链接: https://arxiv.org/abs/2505.24181
作者: Guanghao Li,Wenhao Jiang,Mingfeng Chen,Yan Li,Hao Yu,Shuting Dong,Tao Ren,Ming Tang,Chun Yuan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Chain of Thought (CoT) prompting improves the reasoning performance of large language models (LLMs) by encouraging step by step thinking. However, CoT-based methods depend on intermediate reasoning steps, which limits scalability and generalization. Recent work explores recursive reasoning, where LLMs reuse internal layers across iterations to refine latent representations without explicit CoT supervision. While promising, these approaches often require costly pretraining and lack a principled framework for how reasoning should evolve across iterations. We address this gap by introducing Flow Chain of Thought (Flow CoT), a reasoning paradigm that models recursive inference as a progressive trajectory of latent cognitive states. Flow CoT frames each iteration as a distinct cognitive stage deepening reasoning across iterations without relying on manual supervision. To realize this, we propose SCOUT (Stepwise Cognitive Optimization Using Teachers), a lightweight fine tuning framework that enables Flow CoT style reasoning without the need for pretraining. SCOUT uses progressive distillation to align each iteration with a teacher of appropriate capacity, and a cross attention based retrospective module that integrates outputs from previous iterations while preserving the models original computation flow. Experiments across eight reasoning benchmarks show that SCOUT consistently improves both accuracy and explanation quality, achieving up to 1.8% gains under fine tuning. Qualitative analyses further reveal that SCOUT enables progressively deeper reasoning across iterations refining both belief formation and explanation granularity. These results not only validate the effectiveness of SCOUT, but also demonstrate the practical viability of Flow CoT as a scalable framework for enhancing reasoning in LLMs.
zh
[AI-51] SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在长序列推理过程中,由于自注意力模块的二次时间复杂度导致的预填充阶段性能瓶颈问题。其解决方案的关键在于提出一种细粒度稀疏注意力方法(SALE),通过4位量化查询-键乘积实现快速且精确的注意力权重估计,并结合块稀疏注意力加速计算,同时采用相对注意力得分指标进行查询-键对的重要性评估,从而在保持模型精度的前提下显著提升长序列处理效率。
链接: https://arxiv.org/abs/2505.24179
作者: Xiaodong Ji,Hailin Zhang,Fangcheng Fu,Bin Cui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Many advanced Large Language Model (LLM) applications require long-context processing, but the self-attention module becomes a bottleneck during the prefilling stage of inference due to its quadratic time complexity with respect to sequence length. Existing sparse attention methods accelerate attention computation by skipping less significant regions of the attention map. However, these approaches typically perform coarse-grained inspection of the attention map, rendering considerable loss in model accuracy. In this paper, we propose SALE, a fine-grained sparse attention method that accelerates the long-context prefilling stage of LLM with negligible loss in model accuracy. SALE achieves fast and accurate fine-grained attention weight estimation through 4-bit quantized query-key products, followed by block-sparse attention to accelerate prefilling computations. For importance evaluation for query-key pairs, we adopt our Relative Attention Score metric, which offers significantly higher efficiency within our framework. We implement a custom CUDA kernel optimized for our approach for hardware efficiency, reducing the additional overhead to approximately 11% of the full attention latency. Notably, SALE requires no parameter training and can be seamlessly integrated into existing systems with trivial code modifications. Experiments on long-context benchmarks demonstrate that our method outperforms existing approaches in accuracy-efficiency trade-offs, achieving at least 3.36x speedups on Llama-3.1-8B for sequences longer than 64K while maintaining model quality.
zh
[AI-52] Invariant Link Selector for Spatial-Temporal Out-of-Distribution Problem AISTATS2025
【速读】:该论文旨在解决基础模型时代下,由于训练环境与测试环境之间的数据偏差(即Out-of-Distribution, OOD问题)导致的AI泛化能力受限问题,特别是针对关系型数据(如图结构数据)在违背独立同分布(Independent and Identically Distributed, IID)假设的情况下,尤其是在时间序列相关场景中,泛化能力更加薄弱的问题。其解决方案的关键在于利用信息瓶颈(Information Bottleneck, IB)方法,提出一种误差有界且能够区分不变成分与变化成分的不变链接选择器(Invariant Link Selector),从而在训练过程中提取对标签最具不变性和代表性的图结构组件,提升深度学习模型在不同测试场景下的泛化能力。此外,通过引入任务相关的损失函数(如时间链接预测),使预训练模型能够有效解决实际应用任务,如引用推荐和商品推荐。
链接: https://arxiv.org/abs/2505.24178
作者: Katherine Tieu,Dongqi Fu,Jun Wu,Jingrui He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by AISTATS 2025. 22 pages, 2 figures, 6 tables
Abstract:In the era of foundation models, Out-of- Distribution (OOD) problems, i.e., the data discrepancy between the training environments and testing environments, hinder AI generalization. Further, relational data like graphs disobeying the Independent and Identically Distributed (IID) condition makes the problem more challenging, especially much harder when it is associated with time. Motivated by this, to realize the robust invariant learning over temporal graphs, we want to investigate what components in temporal graphs are most invariant and representative with respect to labels. With the Information Bottleneck (IB) method, we propose an error-bounded Invariant Link Selector that can distinguish invariant components and variant components during the training process to make the deep learning model generalizable for different testing scenarios. Besides deriving a series of rigorous generalizable optimization functions, we also equip the training with task-specific loss functions, e.g., temporal link prediction, to make pretrained models solve real-world application tasks like citation recommendation and merchandise recommendation, as demonstrated in our experiments with state-of-the-art (SOTA) methods. Our code is available at this https URL.
zh
[AI-53] Dont Just Follow MLLM Plans: Robust and Efficient Planning for Open-world Agents
【速读】:该论文试图解决在不可预测、交互式环境中,自主代理如何有效掌握复杂多步骤任务的问题,特别是现有方法依赖于有缺陷的内部知识或不现实的环境假设,且缺乏从零开始在真实场景中获取规划知识的能力。解决方案的关键在于提出一种名为REPOA(Robust and Efficient Planning for Open-world Agents)的新框架,其核心组件包括自适应依赖学习与细粒度故障感知操作记忆,以增强对知识不准确性的鲁棒性,以及基于难度的探索机制,以提升学习效率。
链接: https://arxiv.org/abs/2505.24157
作者: Seungjoon Lee,Suhwan Kim,Minhyeon Oh,Youngsik Yoon,Jungseul Ok
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Developing autonomous agents capable of mastering complex, multi-step tasks in unpredictable, interactive environments presents a significant challenge. While Large Language Models (LLMs) offer promise for planning, existing approaches often rely on problematic internal knowledge or make unrealistic environmental assumptions. Although recent work explores learning planning knowledge, they still retain limitations due to partial reliance on external knowledge or impractical setups. Indeed, prior research has largely overlooked developing agents capable of acquiring planning knowledge from scratch, directly in realistic settings. While realizing this capability is necessary, it presents significant challenges, primarily achieving robustness given the substantial risk of incorporating LLMs’ inaccurate knowledge. Moreover, efficiency is crucial for practicality as learning can demand prohibitive exploration. In response, we introduce Robust and Efficient Planning for Open-world Agents (REPOA), a novel framework designed to tackle these issues. REPOA features three key components: adaptive dependency learning and fine-grained failure-aware operation memory to enhance robustness to knowledge inaccuracies, and difficulty-based exploration to improve learning efficiency. Our evaluation in two established open-world testbeds demonstrates REPOA’s robust and efficient planning, showcasing its capability to successfully obtain challenging late-game items that were beyond the reach of prior approaches.
zh
[AI-54] RCCDA: Adaptive Model Updates in the Presence of Concept Drift under a Constrained Resource Budget
【速读】:该论文试图解决在概念漂移(concept drift)环境下,如何在严格资源约束下保持机器学习(Machine Learning, ML)模型性能的问题。现有方法依赖于高计算开销的漂移检测机制,无法提供严格的资源使用保证或理论性能保障。论文提出的解决方案关键在于RCCDA:一种基于历史损失信息和可调漂移阈值的动态模型更新策略,通过分析模型损失在概念漂移下的演化,并将其整合到Lyapunov漂移加惩罚框架中,从而实现对更新频率和成本的理论保证,确保在资源受限场景下的高效模型适应。
链接: https://arxiv.org/abs/2505.24149
作者: Adam Piaseczny,Md Kamran Chowdhury Shisher,Shiqiang Wang,Christopher G. Brinton
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Machine learning (ML) algorithms deployed in real-world environments are often faced with the challenge of adapting models to concept drift, where the task data distributions are shifting over time. The problem becomes even more difficult when model performance must be maintained under adherence to strict resource constraints. Existing solutions often depend on drift-detection methods that produce high computational overhead for resource-constrained environments, and fail to provide strict guarantees on resource usage or theoretical performance assurances. To address these shortcomings, we propose RCCDA: a dynamic model update policy that optimizes ML training dynamics while ensuring strict compliance to predefined resource constraints, utilizing only past loss information and a tunable drift threshold. In developing our policy, we analytically characterize the evolution of model loss under concept drift with arbitrary training update decisions. Integrating these results into a Lyapunov drift-plus-penalty framework produces a lightweight policy based on a measurable accumulated loss threshold that provably limits update frequency and cost. Experimental results on three domain generalization datasets demonstrate that our policy outperforms baseline methods in inference accuracy while adhering to strict resource constraints under several schedules of concept drift, making our solution uniquely suited for real-time ML deployments.
zh
[AI-55] AMSbench: A Comprehensive Benchmark for Evaluating MLLM Capabilities in AMS Circuits
【速读】:该论文试图解决模拟/混合信号(Analog/Mixed-Signal, AMS)电路设计自动化长期面临的挑战,即当前多模态大语言模型(Multi-modal Large Language Models, MLLMs)在复杂多模态推理和高级电路设计任务中存在显著局限性。解决方案的关键在于提出AMSbench,这是一个针对AMS电路分析与设计任务的基准测试套件,能够系统评估MLLMs在电路图感知、电路分析和电路设计等关键任务中的性能,从而推动MLLMs对电路领域知识的理解与应用能力提升。
链接: https://arxiv.org/abs/2505.24138
作者: Yichen Shi,Ze Zhang,Hongyang Wang,Zhuofu Tao,Zhongyi Li,Bingyu Chen,Yaxin Wang,Zhiping Yu,Ting-Jung Lin,Lei He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Analog/Mixed-Signal (AMS) circuits play a critical role in the integrated circuit (IC) industry. However, automating Analog/Mixed-Signal (AMS) circuit design has remained a longstanding challenge due to its difficulty and complexity. Recent advances in Multi-modal Large Language Models (MLLMs) offer promising potential for supporting AMS circuit analysis and design. However, current research typically evaluates MLLMs on isolated tasks within the domain, lacking a comprehensive benchmark that systematically assesses model capabilities across diverse AMS-related challenges. To address this gap, we introduce AMSbench, a benchmark suite designed to evaluate MLLM performance across critical tasks including circuit schematic perception, circuit analysis, and circuit design. AMSbench comprises approximately 8000 test questions spanning multiple difficulty levels and assesses eight prominent models, encompassing both open-source and proprietary solutions such as Qwen 2.5-VL and Gemini 2.5 Pro. Our evaluation highlights significant limitations in current MLLMs, particularly in complex multi-modal reasoning and sophisticated circuit design tasks. These results underscore the necessity of advancing MLLMs’ understanding and effective application of circuit-specific knowledge, thereby narrowing the existing performance gap relative to human expertise and moving toward fully automated AMS circuit design workflows. Our data is released at this https URL
zh
[AI-56] Searching Clinical Data Using Generative AI
【速读】:该论文试图解决临床数据搜索中由于数据规模大且结构不统一而导致的手动操作效率低、易出错的问题。其解决方案的关键在于提出一种生成式 AI 方法——SearchAI,该方法通过层次化模型来尊重编码体系的层级关系,并实现从父节点到子节点的关系遍历,从而提升搜索的准确性和效率。SearchAI 能够处理关键词同义词、语义变体以及开放性查询,满足医疗专业人员对相关疾病、药物或状况组合的搜索需求。
链接: https://arxiv.org/abs/2505.24090
作者: Karan Hanswadkar,Anika Kanchi,Shivani Tripathi,Shi Qiao,Rony Chatterjee,Alekh Jindal
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial Intelligence (AI) is making a major impact on healthcare, particularly through its application in natural language processing (NLP) and predictive analytics. The healthcare sector has increasingly adopted AI for tasks such as clinical data analysis and medical code assignment. However, searching for clinical information in large and often unorganized datasets remains a manual and error-prone process. Assisting this process with automations can help physicians improve their operational productivity significantly. In this paper, we present a generative AI approach, coined SearchAI, to enhance the accuracy and efficiency of searching clinical data. Unlike traditional code assignment, which is a one-to-one problem, clinical data search is a one-to-many problem, i.e., a given search query can map to a family of codes. Healthcare professionals typically search for groups of related diseases, drugs, or conditions that map to many codes, and therefore, they need search tools that can handle keyword synonyms, semantic variants, and broad open-ended queries. SearchAI employs a hierarchical model that respects the coding hierarchy and improves the traversal of relationships from parent to child nodes. SearchAI navigates these hierarchies predictively and ensures that all paths are reachable without losing any relevant nodes. To evaluate the effectiveness of SearchAI, we conducted a series of experiments using both public and production datasets. Our results show that SearchAI outperforms default hierarchical traversals across several metrics, including accuracy, robustness, performance, and scalability. SearchAI can help make clinical data more accessible, leading to streamlined workflows, reduced administrative burden, and enhanced coding and diagnostic accuracy. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.24090 [cs.DB] (or arXiv:2505.24090v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2505.24090 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-57] mRAG : Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation
【速读】:该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在动态现实应用场景中因静态训练数据、幻觉现象及无法验证最新外部证据而导致性能受限的问题。其解决方案的关键在于引入检索增强生成(Retrieval-Augmented Generation, RAG)机制,通过检索模块使LVLM能够访问大规模知识库,从而在生成过程中融入事实性与上下文相关的信息,提升模型输出的准确性和可靠性。
链接: https://arxiv.org/abs/2505.24073
作者: Chan-Wei Hu,Yueqi Wang,Shuo Xing,Chia-Ju Chen,Zhengzhong Tu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 11 figures
Abstract:Large Vision-Language Models (LVLMs) have made remarkable strides in multimodal tasks such as visual question answering, visual grounding, and complex reasoning. However, they remain limited by static training data, susceptibility to hallucinations, and inability to verify claims against up-to-date, external evidence, compromising their performance in dynamic real-world applications. Retrieval-Augmented Generation (RAG) offers a practical solution to mitigate these challenges by allowing the LVLMs to access large-scale knowledge databases via retrieval mechanisms, thereby grounding model outputs in factual, contextually relevant information. Here in this paper, we conduct the first systematic dissection of the multimodal RAG pipeline for LVLMs, explicitly investigating (1) the retrieval phase: on the modality configurations and retrieval strategies, (2) the re-ranking stage: on strategies to mitigate positional biases and improve the relevance of retrieved evidence, and (3) the generation phase: we further investigate how to best integrate retrieved candidates into the final generation process. Finally, we extend to explore a unified agentic framework that integrates re-ranking and generation through self-reflection, enabling LVLMs to select relevant evidence and suppress irrelevant context dynamically. Our full-stack exploration of RAG for LVLMs yields substantial insights, resulting in an average performance boost of 5% without any fine-tuning.
zh
[AI-58] DSR-Bench: Evaluating the Structural Reasoning Abilities of LLM s via Data Structures
【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)在结构化推理能力上的不足,特别是其在处理数据关系和操作数据结构方面的能力有限。现有基准测试主要关注高层应用任务,而未专门评估这一基础能力。解决方案的关键是引入DSR-Bench,这是一个新型基准测试,通过数据结构来评估LLMs的结构化推理能力,提供可解释的数据关系表示,并包含20种数据结构、35种操作及4,140个问题实例,以实现对推理局限性的细粒度分析。
链接: https://arxiv.org/abs/2505.24069
作者: Yu He,Yingxi Li,Colin White,Ellen Vitercik
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly deployed for real-world tasks that fundamentally involve data manipulation. A core requirement across these tasks is the ability to perform structural reasoning–that is, to understand and reason about data relationships. For example, customer requests require a temporal ordering, which can be represented by data structures such as queues. However, existing benchmarks primarily focus on high-level, application-driven evaluations without isolating this fundamental capability. To address this gap, we introduce DSR-Bench, a novel benchmark evaluating LLMs’ structural reasoning capabilities through data structures, which provide interpretable representations of data relationships. DSR-Bench includes 20 data structures, 35 operations, and 4,140 problem instances, organized hierarchically for fine-grained analysis of reasoning limitations. Our evaluation pipeline is fully automated and deterministic, eliminating subjective human or model-based judgments. Its synthetic nature also ensures scalability and minimizes data contamination risks. We benchmark nine state-of-the-art LLMs. Our analysis shows that instruction-tuned models struggle with basic multi-attribute and multi-hop reasoning. Furthermore, while reasoning-oriented models perform better, they remain fragile on complex and hybrid structures, with the best model achieving an average score of only 47% on the challenge subset. Crucially, models often perform poorly on multi-dimensional data and natural language task descriptions, highlighting a critical gap for real-world deployment.
zh
[AI-59] Bridging Source and Target Domains via Link Prediction for Unsupervised Domain Adaptation on Graphs
【速读】:该论文试图解决图神经网络(Graph Neural Networks, GNNs)在跨域节点分类任务中因标签数据稀缺而导致的性能下降问题,特别是在源域和目标域之间存在分布偏移的情况下。现有方法大多直接套用独立同分布数据的无监督域适应(Unsupervised Domain Adaptation, UDA)技术,未能充分考虑图结构和GNN的消息传递机制,导致在标签分布偏移时效果不佳。该论文的关键解决方案是引入链接预测机制,通过连接源域和目标域的节点,促进跨域的消息传递,并增强目标域节点的“分布内”邻域结构,从而在输入层调整目标图以减少其在嵌入空间中与源域的偏差,同时设计了一种身份保持学习目标以防止目标域判别信息的丢失。
链接: https://arxiv.org/abs/2505.24055
作者: Yilong Wang,Tianxiang Zhao,Zongyu Wu,Suhang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph neural networks (GNNs) have shown great ability for node classification on graphs. However, the success of GNNs relies on abundant labeled data, while obtaining high-quality labels is costly and challenging, especially for newly emerging domains. Hence, unsupervised domain adaptation (UDA), which trains a classifier on the labeled source graph and adapts it to the unlabeled target graph, is attracting increasing attention. Various approaches have been proposed to alleviate the distribution shift between the source and target graphs to facilitate the classifier adaptation. However, most of them simply adopt existing UDA techniques developed for independent and identically distributed data to gain domain-invariant node embeddings for graphs, which do not fully consider the graph structure and message-passing mechanism of GNNs during the adaptation and will fail when label distribution shift exists among domains. In this paper, we proposed a novel framework that adopts link prediction to connect nodes between source and target graphs, which can facilitate message-passing between the source and target graphs and augment the target nodes to have ``in-distribution’’ neighborhoods with the source domain. This strategy modified the target graph on the input level to reduce its deviation from the source domain in the embedding space and is insensitive to disproportional label distributions across domains. To prevent the loss of discriminative information in the target graph, we further design a novel identity-preserving learning objective, which guides the learning of the edge insertion module together with reconstruction and adaptation losses. Experimental results on real-world datasets demonstrate the effectiveness of our framework.
zh
[AI-60] Leave it to the Specialist: Repair Sparse LLM s with Sparse Fine-Tuning via Sparsity Evolution
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在部署过程中因计算需求大而面临的挑战,以及现有剪枝和微调方法在高稀疏性下难以保持模型性能的问题。其解决方案的关键在于提出一种专为稀疏LLMs设计的微调方法——稀疏性演化微调(Sparsity Evolution Fine-Tuning, SEFT),该方法在微调过程中动态演化剪枝模型的稀疏拓扑结构,同时保持整体稀疏性不变,并通过权重脱落与增长策略实现任务特定的适应,从而提升模型在下游任务中的性能。
链接: https://arxiv.org/abs/2505.24037
作者: Qiao Xiao,Alan Ansell,Boqian Wu,Lu Yin,Mykola Pechenizkiy,Shiwei Liu,Decebal Constantin Mocanu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have achieved remarkable success across various tasks but face deployment challenges due to their massive computational demands. While post-training pruning methods like SparseGPT and Wanda can effectively reduce the model size, but struggle to maintain model performance at high sparsity levels, limiting their utility for downstream tasks. Existing fine-tuning methods, such as full fine-tuning and LoRA, fail to preserve sparsity as they require updating the whole dense metrics, not well-suited for sparse LLMs. In this paper, we propose Sparsity Evolution Fine-Tuning (SEFT), a novel method designed specifically for sparse LLMs. SEFT dynamically evolves the sparse topology of pruned models during fine-tuning, while preserving the overall sparsity throughout the process. The strengths of SEFT lie in its ability to perform task-specific adaptation through a weight drop-and-grow strategy, enabling the pruned model to self-adapt its sparse connectivity pattern based on the target dataset. Furthermore, a sensitivity-driven pruning criterion is employed to ensure that the desired sparsity level is consistently maintained throughout fine-tuning. Our experiments on various LLMs, including LLaMA families, DeepSeek, and Mistral, across a diverse set of benchmarks demonstrate that SEFT achieves stronger performance while offering superior memory and time efficiency compared to existing baselines. Our code is publicly available at: this https URL.
zh
[AI-61] GenIC: An LLM -Based Framework for Instance Completion in Knowledge Graphs
【速读】:该论文旨在解决知识图谱补全(Knowledge Graph Completion)问题,特别是实例补全(Instance Completion)任务,即在仅已知三元组的头实体(head)的情况下,预测关系-尾实体对(relation-tail pair)。解决方案的关键在于提出了一种基于大语言模型(LLM)的端到端实例补全框架GenIC,该框架分为两个步骤:第一步骤为属性预测,作为多标签分类任务;第二步骤为链接预测,作为生成式的序列到序列任务,从而有效利用实体描述和知识图谱模式进行缺失事实的推断。
链接: https://arxiv.org/abs/2505.24036
作者: Amel Gader,Alsayed Algergawy
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Knowledge graph completion aims to address the gaps of knowledge bases by adding new triples that represent facts. The complexity of this task depends on how many parts of a triple are already known. Instance completion involves predicting the relation-tail pair when only the head is given (h, ?, ?). Notably, modern knowledge bases often contain entity descriptions and types, which can provide valuable context for inferring missing facts. By leveraging these textual descriptions and the ability of large language models to extract facts from them and recognize patterns within the knowledge graph schema, we propose an LLM-powered, end-to-end instance completion approach. Specifically, we introduce GenIC: a two-step Generative Instance Completion framework. The first step focuses on property prediction, treated as a multi-label classification task. The second step is link prediction, framed as a generative sequence-to-sequence task. Experimental results on three datasets show that our method outperforms existing baselines. Our code is available at this https URL.
zh
[AI-62] LlamaRL: A Distributed Asynchronous Reinforcement Learning Framework for Efficient Large-scale LLM Trainin
【速读】:该论文旨在解决在大规模语言模型(Large Language Models, LLMs)上高效进行强化学习(Reinforcement Learning, RL)训练的挑战,特别是在高延迟和高内存需求下构建可靠且高效的RL框架问题。其解决方案的关键在于提出了一种全分布式、异步的RL框架LlamaRL,该框架采用基于原生PyTorch的简化单控制器架构,实现了模块化、易用性和对数千个GPU的无缝扩展,并通过共置模型卸载、异步离策略训练以及分布式直接内存访问等最佳实践显著提升了训练效率。
链接: https://arxiv.org/abs/2505.24034
作者: Bo Wu,Sid Wang,Yunhao Tang,Jia Ding,Eryk Helenowski,Liang Tan,Tengyu Xu,Tushar Gowda,Zhengxing Chen,Chen Zhu,Xiaocheng Tang,Yundi Qian,Beibei Zhu,Rui Hou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning (RL) has become the most effective post-training approach for improving the capabilities of Large Language Models (LLMs). In practice, because of the high demands on latency and memory, it is particularly challenging to develop an efficient RL framework that reliably manages policy models with hundreds to thousands of billions of parameters. In this paper, we present LlamaRL, a fully distributed, asynchronous RL framework optimized for efficient training of large-scale LLMs with various model sizes (8B, 70B, and 405B parameters) on GPU clusters ranging from a handful to thousands of devices. LlamaRL introduces a streamlined, single-controller architecture built entirely on native PyTorch, enabling modularity, ease of use, and seamless scalability to thousands of GPUs. We also provide a theoretical analysis of LlamaRL’s efficiency, including a formal proof that its asynchronous design leads to strict RL speed-up. Empirically, by leveraging best practices such as colocated model offloading, asynchronous off-policy training, and distributed direct memory access for weight synchronization, LlamaRL achieves significant efficiency gains – up to 10.7x speed-up compared to DeepSpeed-Chat-like systems on a 405B-parameter policy model. Furthermore, the efficiency advantage continues to grow with increasing model scale, demonstrating the framework’s suitability for future large-scale RL training. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.24034 [cs.LG] (or arXiv:2505.24034v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.24034 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-63] LLM Agents Should Employ Security Principles
【速读】:该论文试图解决大规模部署生成式 AI (Generative AI) 代理时所面临的安全与隐私问题,特别是多代理交互和上下文操控带来的漏洞,如隐私泄露和系统被利用的风险。解决方案的关键在于将信息安全管理中长期确立的安全原则(如纵深防御、最小权限、完整验证和心理可接受性)嵌入到 LLM 代理的设计与运行中,以构建更安全的智能体系统。为此,作者提出了 AgentSandbox,一个融合这些安全原则的概念框架,旨在在代理生命周期内提供全面的防护,并通过评估证明其在保持功能效用的同时有效降低隐私风险。
链接: https://arxiv.org/abs/2505.24019
作者: Kaiyuan Zhang,Zian Su,Pin-Yu Chen,Elisa Bertino,Xiangyu Zhang,Ninghui Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM) agents show considerable promise for automating complex tasks using contextual reasoning; however, interactions involving multiple agents and the system’s susceptibility to prompt injection and other forms of context manipulation introduce new vulnerabilities related to privacy leakage and system exploitation. This position paper argues that the well-established design principles in information security, which are commonly referred to as security principles, should be employed when deploying LLM agents at scale. Design principles such as defense-in-depth, least privilege, complete mediation, and psychological acceptability have helped guide the design of mechanisms for securing information systems over the last five decades, and we argue that their explicit and conscientious adoption will help secure agentic systems. To illustrate this approach, we introduce AgentSandbox, a conceptual framework embedding these security principles to provide safeguards throughout an agent’s life-cycle. We evaluate with state-of-the-art LLMs along three dimensions: benign utility, attack utility, and attack success rate. AgentSandbox maintains high utility for its intended functions under both benign and adversarial evaluations while substantially mitigating privacy risks. By embedding secure design principles as foundational elements within emerging LLM agent protocols, we aim to promote trustworthy agent ecosystems aligned with user privacy expectations and evolving regulatory requirements.
zh
[AI-64] Multi-Modal View Enhanced Large Vision Models for Long-Term Time Series Forecasting
【速读】:该论文试图解决长期时间序列预测(Long-term Time Series Forecasting, LTSF)中如何有效利用多模态视图(Multi-modal Views, MMVs)的问题。现有方法在应用大型视觉模型(Large Vision Models, LVMs)时存在对“预测时段”的归纳偏倚,限制了模型性能的提升。解决方案的关键在于提出DMMV框架,该框架结合趋势-季节分解与基于反演残差的自适应分解,以有效整合多模态视图,从而提升LTSF的预测精度。实验结果表明,DMMV在多个基准数据集上均优于现有单视图和多模态基线方法。
链接: https://arxiv.org/abs/2505.24003
作者: ChengAo Shen,Wenchao Yu,Ziming Zhao,Dongjin Song,Wei Cheng,Haifeng Chen,Jingchao Ni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Time series, typically represented as numerical sequences, can also be transformed into images and texts, offering multi-modal views (MMVs) of the same underlying signal. These MMVs can reveal complementary patterns and enable the use of powerful pre-trained large models, such as large vision models (LVMs), for long-term time series forecasting (LTSF). However, as we identified in this work, applying LVMs to LTSF poses an inductive bias towards “forecasting periods”. To harness this bias, we propose DMMV, a novel decomposition-based multi-modal view framework that leverages trend-seasonal decomposition and a novel backcast residual based adaptive decomposition to integrate MMVs for LTSF. Comparative evaluations against 14 state-of-the-art (SOTA) models across diverse datasets show that DMMV outperforms single-view and existing multi-modal baselines, achieving the best mean squared error (MSE) on 6 out of 8 benchmark datasets.
zh
[AI-65] Multi-RAG : A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding
【速读】:该论文旨在解决在动态、信息密集的场景中,如何有效减轻人类认知负担并提供适应性辅助的问题。其解决方案的关键在于提出Multi-RAG系统,该系统通过整合和推理多源信息流(包括视频、音频和文本),实现对情境的深入理解和高效决策,从而提升人机协作的适应性和效率。
链接: https://arxiv.org/abs/2505.23990
作者: Mingyang Mao,Mariela M. Perez-Cabarcas,Utteja Kallakuri,Nicholas R. Waytowich,Xiaomin Lin,Tinoosh Mohsenin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:To effectively engage in human society, the ability to adapt, filter information, and make informed decisions in ever-changing situations is critical. As robots and intelligent agents become more integrated into human life, there is a growing opportunity-and need-to offload the cognitive burden on humans to these systems, particularly in dynamic, information-rich scenarios. To fill this critical need, we present Multi-RAG, a multimodal retrieval-augmented generation system designed to provide adaptive assistance to humans in information-intensive circumstances. Our system aims to improve situational understanding and reduce cognitive load by integrating and reasoning over multi-source information streams, including video, audio, and text. As an enabling step toward long-term human-robot partnerships, Multi-RAG explores how multimodal information understanding can serve as a foundation for adaptive robotic assistance in dynamic, human-centered situations. To evaluate its capability in a realistic human-assistance proxy task, we benchmarked Multi-RAG on the MMBench-Video dataset, a challenging multimodal video understanding benchmark. Our system achieves superior performance compared to existing open-source video large language models (Video-LLMs) and large vision-language models (LVLMs), while utilizing fewer resources and less input data. The results demonstrate Multi- RAG’s potential as a practical and efficient foundation for future human-robot adaptive assistance systems in dynamic, real-world contexts. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2505.23990 [cs.AI] (or arXiv:2505.23990v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.23990 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-66] MSQA: Benchmarking LLM s on Graduate-Level Materials Science Reasoning and Knowledge
【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)在材料科学领域缺乏针对领域特定知识和复杂推理能力的评估基准的问题。解决方案的关键在于引入MSQA,这是一个包含1,757道研究生水平材料科学问题的综合性评估基准,涵盖详细解释性回答和二元真/假评估两种形式,旨在通过要求模型具备精确的事实性知识和跨七个材料科学子领域的多步骤推理能力来挑战LLMs,从而全面评估其在先进材料科学中的事实与推理能力。
链接: https://arxiv.org/abs/2505.23982
作者: Jerry Junyang Cheung,Shiyao Shen,Yuchen Zhuang,Yinghao Li,Rampi Ramprasad,Chao Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Despite recent advances in large language models (LLMs) for materials science, there is a lack of benchmarks for evaluating their domain-specific knowledge and complex reasoning abilities. To bridge this gap, we introduce MSQA, a comprehensive evaluation benchmark of 1,757 graduate-level materials science questions in two formats: detailed explanatory responses and binary True/False assessments. MSQA distinctively challenges LLMs by requiring both precise factual knowledge and multi-step reasoning across seven materials science sub-fields, such as structure-property relationships, synthesis processes, and computational modeling. Through experiments with 10 state-of-the-art LLMs, we identify significant gaps in current LLM performance. While API-based proprietary LLMs achieve up to 84.5% accuracy, open-source (OSS) LLMs peak around 60.5%, and domain-specific LLMs often underperform significantly due to overfitting and distributional shifts. MSQA represents the first benchmark to jointly evaluate the factual and reasoning capabilities of LLMs crucial for LLMs in advanced materials science.
zh
[AI-67] Confidential Guardian: Cryptographically Prohibiting the Abuse of Model Abstention
【速读】:该论文试图解决在安全关键应用中,机器学习模型通过谨慎预测(cautious predictions)来限制有害错误的问题,但这种机制可能被不诚实机构利用,以不确定性为借口进行歧视或不合理地拒绝服务。解决方案的关键在于提出 Confidential Guardian 框架,该框架通过分析参考数据集上的校准指标来检测人为抑制的置信度,并利用零知识证明(zero-knowledge proofs)验证推理过程,确保报告的置信度分数确实来源于部署的模型,从而防止提供方伪造置信度值并保护模型的专有信息。
链接: https://arxiv.org/abs/2505.23968
作者: Stephan Rabanser,Ali Shahin Shamsabadi,Olive Franzese,Xiao Wang,Adrian Weller,Nicolas Papernot
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Proceedings of the 42nd International Conference on Machine Learning
Abstract:Cautious predictions – where a machine learning model abstains when uncertain – are crucial for limiting harmful errors in safety-critical applications. In this work, we identify a novel threat: a dishonest institution can exploit these mechanisms to discriminate or unjustly deny services under the guise of uncertainty. We demonstrate the practicality of this threat by introducing an uncertainty-inducing attack called Mirage, which deliberately reduces confidence in targeted input regions, thereby covertly disadvantaging specific individuals. At the same time, Mirage maintains high predictive performance across all data points. To counter this threat, we propose Confidential Guardian, a framework that analyzes calibration metrics on a reference dataset to detect artificially suppressed confidence. Additionally, it employs zero-knowledge proofs of verified inference to ensure that reported confidence scores genuinely originate from the deployed model. This prevents the provider from fabricating arbitrary model confidence values while protecting the model’s proprietary details. Our results confirm that Confidential Guardian effectively prevents the misuse of cautious predictions, providing verifiable assurances that abstention reflects genuine model uncertainty rather than malicious intent.
zh
[AI-68] Enhancing LLM -Based Code Generation with Complexity Metrics: A Feedback-Driven Approach
【速读】:该论文试图解决生成式 AI (Generative AI) 在代码生成任务中失败案例的改进问题,特别是通过理解生成代码的特性来提升其成功率。解决方案的关键在于分析代码复杂性与生成代码成功之间的关系,并基于此提出一种迭代反馈方法,利用先前失败输出中的复杂性指标引导 LLM 生成更正确的代码。实验结果表明,该方法在多个基准测试和不同规模的 LLM 上均取得了显著提升,尤其是在较小模型上效果更为明显。
链接: https://arxiv.org/abs/2505.23953
作者: Melika Sepidband,Hamed Taherkhani,Song Wang,Hadi Hemmati
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures. Accepted to COMPSAC 2025
Abstract:Automatic code generation has gained significant momentum with the advent of Large Language Models (LLMs) such as GPT-4. Although many studies focus on improving the effectiveness of LLMs for code generation, very limited work tries to understand the generated code’s characteristics and leverage that to improve failed cases. In this paper, as the most straightforward characteristic of code, we investigate the relationship between code complexity and the success of LLM generated code. Using a large set of standard complexity metrics, we first conduct an empirical analysis to explore their correlation with LLM’s performance on code generation (i.e., Pass@1). Using logistic regression models, we identify which complexity metrics are most predictive of code correctness. Building on these findings, we propose an iterative feedback method, where LLMs are prompted to generate correct code based on complexity metrics from previous failed outputs. We validate our approach across multiple benchmarks (i.e., HumanEval, MBPP, LeetCode, and BigCodeBench) and various LLMs (i.e., GPT-4o, GPT-3.5 Turbo, Llama 3.1, and GPT-o3 mini), comparing the results with two baseline methods: (a) zero-shot generation, and (b) iterative execution-based feedback without our code complexity insights. Experiment results show that our approach makes notable improvements, particularly with a smaller LLM (GPT3.5 Turbo), where, e.g., Pass@1 increased by 35.71% compared to the baseline’s improvement of 12.5% on the HumanEval dataset. The study expands experiments to BigCodeBench and integrates the method with the Reflexion code generation agent, leading to Pass@1 improvements of 20% (GPT-4o) and 23.07% (GPT-o3 mini). The results highlight that complexity-aware feedback enhances both direct LLM prompting and agent-based workflows.
zh
[AI-69] InterMT: Multi-Turn Interleaved Preference Alignment with Human Feedback
【速读】:该论文试图解决多模态大模型(Multimodal Large Models, MLLMs)在连续多轮、多模态交互能力上的不足,特别是其在理解和生成交织的多模态上下文以及在持续对话中保持连贯响应方面的缺陷。解决方案的关键在于构建一个名为InterMT的首个基于真实人类反馈的多轮多模态交互偏好数据集,并引入专家标注以确保数据质量,同时通过代理工作流利用工具增强的MLLMs来弥补多模态理解与生成能力的不足。此外,还提出了InterMT-Bench以评估MLLMs在辅助判断者完成多轮多模态任务中的能力。
链接: https://arxiv.org/abs/2505.23950
作者: Boyuan Chen,Donghai Hong,Jiaming Ji,Jiacheng Zheng,Bowen Dong,Jiayi Zhou,Kaile Wang,Juntao Dai,Xuyao Wang,Wenqi Chen,Qirui Zheng,Wenxin Li,Sirui Han,Yike Guo,Yaodong Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As multimodal large models (MLLMs) continue to advance across challenging tasks, a key question emerges: What essential capabilities are still missing? A critical aspect of human learning is continuous interaction with the environment – not limited to language, but also involving multimodal understanding and generation. To move closer to human-level intelligence, models must similarly support multi-turn, multimodal interaction. In particular, they should comprehend interleaved multimodal contexts and respond coherently in ongoing exchanges. In this work, we present an initial exploration through the InterMT – the first preference dataset for multi-turn multimodal interaction, grounded in real human feedback. In this exploration, we particularly emphasize the importance of human oversight, introducing expert annotations to guide the process, motivated by the fact that current MLLMs lack such complex interactive capabilities. InterMT captures human preferences at both global and local levels into nine sub-dimensions, consists of 15.6k prompts, 52.6k multi-turn dialogue instances, and 32.4k human-labeled preference pairs. To compensate for the lack of capability for multi-modal understanding and generation, we introduce an agentic workflow that leverages tool-augmented MLLMs to construct multi-turn QA instances. To further this goal, we introduce InterMT-Bench to assess the ability of MLLMs in assisting judges with multi-turn, multimodal tasks. We demonstrate the utility of \InterMT through applications such as judge moderation and further reveal the multi-turn scaling law of judge model. We hope the open-source of our data can help facilitate further research on aligning current MLLMs to the next step. Our project website can be found at this https URL .
zh
[AI-70] SENOR: Highly-Efficient Algorithm for Finding Transposable N:M Sparse Masks
【速读】:该论文旨在解决传统N:M稀疏性在训练过程中无法保持矩阵转置一致性的问题,从而限制了模型在训练阶段的效率提升。其关键解决方案是提出一种可转置的N:M稀疏性掩码生成方法,通过将掩码生成建模为最优传输问题,并利用熵正则化和Dykstra算法求解,随后进行四舍五入处理,实现了对大规模模型的有效稀疏化。该方法在GPU上采用张量化实现,显著提升了计算效率,同时支持任意N:M比例的稀疏性配置,为高效训练稀疏模型提供了可行路径。
链接: https://arxiv.org/abs/2505.23949
作者: Xiang Meng,Mehdi Makni,Rahul Mazumder
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Network pruning reduces the computational requirements of large neural networks, with N:M sparsity – retaining only N out of every M consecutive weights – offering a compelling balance between compressed model quality and hardware acceleration. However, N:M sparsity only accelerates forward-pass computations, as N:M patterns are not preserved during matrix transposition, limiting efficiency during training where both passes are computationally intensive. While transposable N:M sparsity has been proposed to address this limitation, existing methods for finding transposable N:M sparse masks either fail to scale to large models or are restricted to M=4 which results in suboptimal compression-accuracy trade-off. We introduce an efficient solver for transposable N:M masks that scales to billion-parameter models. We formulate mask generation as optimal transport problems and solve through entropy regularization and Dykstra’s algorithm, followed by a rounding procedure. Our tensor-based implementation exploits GPU parallelism, achieving up to 100x speedup with only 1-10% error compared to existing methods. Our approach can be integrated with layer-wise N:M pruning frameworks including Wanda, SparseGPT and ALPS to produce transposable N:M sparse models with arbitrary N:M values. Experiments show that LLaMA3.2-8B with transposable 16:32 sparsity maintains performance close to its standard N:M counterpart and outperforms standard 2:4 sparse model, showing the practical value of our approach.
zh
[AI-71] Position: The Future of Bayesian Prediction Is Prior-Fitted ICML2025
【速读】:该论文试图解决在数据稀缺场景下进行有效贝叶斯推断的问题,尤其是在现实世界数据生成停滞而计算资源迅速增长的背景下。解决方案的关键在于利用先验-数据拟合网络(Prior-Data Fitted Networks, PFNs),通过在随机生成的人工数据集上训练神经网络,使模型能够捕捉数据生成分布所定义的先验信息,从而实现对低数据场景的高效预训练计算分配。
链接: https://arxiv.org/abs/2505.23947
作者: Samuel Müller,Arik Reuter,Noah Hollmann,David Rügamer,Frank Hutter
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted as position paper at ICML 2025
Abstract:Training neural networks on randomly generated artificial datasets yields Bayesian models that capture the prior defined by the dataset-generating distribution. Prior-data Fitted Networks (PFNs) are a class of methods designed to leverage this insight. In an era of rapidly increasing computational resources for pre-training and a near stagnation in the generation of new real-world data in many applications, PFNs are poised to play a more important role across a wide range of applications. They enable the efficient allocation of pre-training compute to low-data scenarios. Originally applied to small Bayesian modeling tasks, the field of PFNs has significantly expanded to address more complex domains and larger datasets. This position paper argues that PFNs and other amortized inference approaches represent the future of Bayesian inference, leveraging amortized learning to tackle data-scarce problems. We thus believe they are a fruitful area of research. In this position paper, we explore their potential and directions to address their current limitations.
zh
[AI-72] Lessons Learned: A Multi-Agent Framework for Code LLM s to Learn and Improve
【速读】:该论文试图解决如何在不了解多个大语言模型(Large Language Models, LLMs)互补优势的情况下,有效利用多个LLM代理协同解决编码问题。其解决方案的关键在于提出了一种基于“lesson”(经验教训)的协作框架,通过设计“lesson solicitation–banking–selection”(经验教训收集—存储—选择)机制,使LLM代理能够在集体求解过程中共享和学习彼此的成功与失败经验,从而提升整体性能。实验表明,经过经验学习的小型LLM团队可以超越单一大型LLM及其他多LLM协作方法。
链接: https://arxiv.org/abs/2505.23946
作者: Yuanzhe Liu,Ryan Deng,Tim Kaler,Xuhao Chen,Charles E. Leiserson,Yao Ma,Jie Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注:
Abstract:Recent studies show that LLMs possess different skills and specialize in different tasks. In fact, we observe that their varied performance occur in several levels of granularity. For example, in the code optimization task, code LLMs excel at different optimization categories and no one dominates others. This observation prompts the question of how one leverages multiple LLM agents to solve a coding problem without knowing their complementary strengths a priori. We argue that a team of agents can learn from each other’s successes and failures so as to improve their own performance. Thus, a lesson is the knowledge produced by an agent and passed on to other agents in the collective solution process. We propose a lesson-based collaboration framework, design the lesson solicitation–banking–selection mechanism, and demonstrate that a team of small LLMs with lessons learned can outperform a much larger LLM and other multi-LLM collaboration methods.
zh
[AI-73] SG-Blend: Learning an Interpolation Between Improved Swish and GELU for Robust Neural Representations
【速读】:该论文试图解决深度神经网络中激活函数设计的优化问题,旨在提升模型的表达能力与梯度稳定性之间的平衡。其解决方案的关键在于提出SG-Blend,一种通过动态插值融合改进的SSwish(Swish的一阶对称变体)和已有的GELU激活函数的新型激活函数,利用可学习参数自适应地结合两者的优势:SSwish的可控非单调性和对称性,以及GELU的平滑概率特性,从而实现更通用且稳定的性能提升。
链接: https://arxiv.org/abs/2505.23942
作者: Gaurav Sarkar,Jay Gala,Subarna Tripathi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The design of activation functions remains a pivotal component in optimizing deep neural networks. While prevailing choices like Swish and GELU demonstrate considerable efficacy, they often exhibit domain-specific optima. This work introduces SG-Blend, a novel activation function that blends our proposed SSwish, a first-order symmetric variant of Swish and the established GELU through dynamic interpolation. By adaptively blending these constituent functions via learnable parameters, SG-Blend aims to harness their complementary strengths: SSwish’s controlled non-monotonicity and symmetry, and GELU’s smooth, probabilistic profile, to achieve a more universally robust balance between model expressivity and gradient stability. We conduct comprehensive empirical evaluations across diverse modalities and architectures, showing performance improvements across all considered natural language and computer vision tasks and models. These results, achieved with negligible computational overhead, underscore SG-Blend’s potential as a versatile, drop-in replacement that consistently outperforms strong contemporary baselines. The code is available at this https URL.
zh
[AI-74] BIRD: Behavior Induction via Representation-structure Distillation
【速读】:该论文旨在解决如何将对齐行为(human-aligned behavior)有效地迁移至在不同任务或数据分布上训练的模型中,这一过程在微调过程中容易丢失,且收集保持对齐行为的任务特定数据成本高昂。解决方案的关键在于提出BIRD(Behavior Induction via Representation-structure Distillation),通过匹配学生模型与教师模型的内部表示结构来实现对齐行为的迁移,从而提升模型在分布外场景下的鲁棒性及其他对齐属性。
链接: https://arxiv.org/abs/2505.23933
作者: Galen Pogoncheff,Michael Beyeler
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Human-aligned deep learning models exhibit behaviors consistent with human values, such as robustness, fairness, and honesty. Transferring these behavioral properties to models trained on different tasks or data distributions remains challenging: aligned behavior is easily forgotten during fine-tuning, and collecting task-specific data that preserves this behavior can be prohibitively costly. We introduce BIRD (Behavior Induction via Representation-structure Distillation), a flexible framework for transferring aligned behavior by matching the internal representation structure of a student model to that of a teacher. Applied to out-of-distribution robustness in image classification, BIRD outperforms fine-tuning, transfer learning, and continual learning methods, improving robust accuracy by up to 16% over the next strongest baseline. It remains effective even when the teacher is trained on a much simpler dataset and is 25 \times smaller than the student. In a large-scale study of over 400 teacher-student pairs, we show that three interpretable and computable properties of the teacher’s representations (i.e., task relevance, behavioral relevance, and complementary knowledge) explain up to 85% of the variance in transfer success. These insights offer practical guidance for teacher selection and design. BIRD turns small, well-aligned models into scalable alignment seeds, removing a key bottleneck in deploying safe AI systems in the wild.
zh
[AI-75] Exploring Societal Concerns and Perceptions of AI: A Thematic Analysis through the Lens of Problem-Seeking
【速读】:该论文试图解决人工智能(AI)与人类智能在目标设定与问题求解过程中的本质差异问题,旨在明确人类智能的独特性。其解决方案的关键在于提出一种新的概念框架,将“问题寻求”(problem-seeking)与“问题解决”(problem-solving)区分开来,强调人类智能具有基于身体体验和情感的内在目标设定能力,而当前AI主要依赖计算处理,缺乏这种具身认知(embodied cognition)的灵活性和经验基础。通过分析YouTube视频的元数据,研究揭示了公众对AI的复杂态度,并主张提升情感与数字素养,以促进负责任的AI发展,同时将AI定位为增强而非替代人类智能的工具。
链接: https://arxiv.org/abs/2505.23930
作者: Naomi Omeonga wa Kayembe
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 48 pages
Abstract:This study introduces a novel conceptual framework distinguishing problem-seeking from problem-solving to clarify the unique features of human intelligence in contrast to AI. Problem-seeking refers to the embodied, emotionally grounded process by which humans identify and set goals, while problem-solving denotes the execution of strategies aimed at achieving such predefined objectives. The framework emphasizes that while AI excels at efficiency and optimization, it lacks the orientation derived from experiential grounding and the embodiment flexibility intrinsic to human cognition. To empirically explore this distinction, the research analyzes metadata from 157 YouTube videos discussing AI. Conducting a thematic analysis combining qualitative insights with keyword-based quantitative metrics, this mixed-methods approach uncovers recurring themes in public discourse, including privacy, job displacement, misinformation, optimism, and ethical concerns. The results reveal a dual sentiment: public fascination with AI’s capabilities coexists with anxiety and skepticism about its societal implications. The discussion critiques the orthogonality thesis, which posits that intelligence is separable from goal content, and instead argues that human intelligence integrates goal-setting and goal-pursuit. It underscores the centrality of embodied cognition in human reasoning and highlights how AI’s limitations come from its current reliance on computational processing. The study advocates for enhancing emotional and digital literacy to foster responsible AI engagement. It calls for reframing public discourse to recognize AI as a tool that augments – rather than replaces – human intelligence. By positioning problem seeking at the core of cognition and as a critical dimension of intelligence, this research offers new perspectives on ethically aligned and human-centered AI development.
zh
[AI-76] CNN-LSTM Hybrid Model for AI-Driven Prediction of COVID-19 Severity from Spike Sequences and Clinical Data
【速读】:该论文试图解决如何准确预测COVID-19疾病严重程度的问题,以优化医疗资源分配和患者管理。其解决方案的关键在于开发一种混合的卷积神经网络-长短期记忆网络(CNN-LSTM)深度学习模型,该模型结合了卷积神经网络(CNN)在局部模式提取上的优势与长短期记忆网络(LSTM)在长期依赖性建模上的能力,利用刺突蛋白序列和相关临床元数据进行疾病严重程度的预测。
链接: https://arxiv.org/abs/2505.23879
作者: Caio Cheohen,Vinnícius M. S. Gomes,Manuela L. da Silva
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures, 4 tables
Abstract:The COVID-19 pandemic, caused by SARS-CoV-2, highlighted the critical need for accurate prediction of disease severity to optimize healthcare resource allocation and patient management. The spike protein, which facilitates viral entry into host cells, exhibits high mutation rates, particularly in the receptor-binding domain, influencing viral pathogenicity. Artificial intelligence approaches, such as deep learning, offer promising solutions for leveraging genomic and clinical data to predict disease outcomes. Objective: This study aimed to develop a hybrid CNN-LSTM deep learning model to predict COVID-19 severity using spike protein sequences and associated clinical metadata from South American patients. Methods: We retrieved 9,570 spike protein sequences from the GISAID database, of which 3,467 met inclusion criteria after standardization. The dataset included 2,313 severe and 1,154 mild cases. A feature engineering pipeline extracted features from sequences, while demographic and clinical variables were one-hot encoded. A hybrid CNN-LSTM architecture was trained, combining CNN layers for local pattern extraction and an LSTM layer for long-term dependency modeling. Results: The model achieved an F1 score of 82.92%, ROC-AUC of 0.9084, precision of 83.56%, and recall of 82.85%, demonstrating robust classification performance. Training stabilized at 85% accuracy with minimal overfitting. The most prevalent lineages (P.1, AY.99.2) and clades (GR, GK) aligned with regional epidemiological trends, suggesting potential associations between viral genetics and clinical outcomes. Conclusion: The CNN-LSTM hybrid model effectively predicted COVID-19 severity using spike protein sequences and clinical data, highlighting the utility of AI in genomic surveillance and precision public health. Despite limitations, this approach provides a framework for early severity prediction in future outbreaks.
zh
[AI-77] Actor-Critic based Online Data Mixing For Language Model Pre-Training
【速读】:该论文旨在解决预训练数据的覆盖范围和组成对大型语言模型(Large Language Models, LLMs)泛化能力的影响问题,同时降低训练过程中的碳足迹和财务成本。现有方法在动态训练过程中未能有效适应数据分布的变化,而在线数据混合(Online Data Mixing, ODM)虽引入了多臂老虎机算法作为数据采样策略,但未考虑领域内交互。本文提出的基于策略-评论家框架的在线数据混合(Actor-Critic based Online Data Mixing, AC-ODM)方法,其关键在于通过辅助策略-评论家网络捕捉动态变化的领域权重,并利用奖励函数建模领域内交互,从而提升数据混合的效率与模型性能。
链接: https://arxiv.org/abs/2505.23878
作者: Jing Ma,Chenhao Dang,Mingjie Liao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures
Abstract:The coverage and composition of pretraining data significantly impacts the generalization ability of Large Language Models (LLMs). To reduce the carbon footprint and financial costs of training, some data mixing methods, which applied the optimized domain weights of a small proxy model to train a larger one, were proposed. However, these methods did not evolute with the training dynamics. The existing online data mixing (ODM) method addressed this limitation by applying the multi-armed bandit algorithm as data sampling strategy. Yet, it did not consider the intra-domain interactions. In this paper, we develop an actor-critic based online data mixing (AC-ODM) method, which captures the varying domain weights by auxiliary actor-critic networks and consider the intra-domain interactions with the reward function. While constructing the dataset to pretrain a large target LLM, we directly apply the actor, which is trained with a small proxy LLM as the environment, as the sampling strategy. The transfer of sampling strategy can not only ensure the efficiency of dynamical data mixing, but also expedite the convergence of pretraining the target LLM. Numerical results demonstrate that AC-ODM-410M, which invokes the sampling strategy obtained by a proxy LLM with 410M parameters, reaching the optimal validation perplexity of ODM 71% faster, and improves performance on the zero-shot MMLU benchmark by 27.5% of accuracy, about 2.23x better on pass@1 of HumanEval benchmark.
zh
[AI-78] A comparative analysis of a neural network with calculated weights and a neural network with random generation of weights based on the training dataset size
【速读】:该论文旨在解决传统神经网络在训练过程中依赖随机初始化权重导致的训练速度慢和对训练数据量敏感的问题。其解决方案的关键在于使用解析公式预先计算多层感知机(Multilayer Perceptron, MLP)的权重,从而实现更快的训练速度和更高的数据集缩减鲁棒性。
链接: https://arxiv.org/abs/2505.23876
作者: Polad Geidarov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The paper discusses the capabilities of multilayer perceptron neural networks implementing metric recognition methods, for which the values of the weights are calculated analytically by formulas. Comparative experiments in training a neural network with pre-calculated weights and with random initialization of weights on different sizes of the MNIST training dataset are carried out. The results of the experiments show that a multilayer perceptron with pre-calculated weights can be trained much faster and is much more robust to the reduction of the training dataset.
zh
[AI-79] A Benchmark Dataset for Graph Regression with Homogeneous and Multi-Relational Variants
【速读】:该论文旨在解决当前图级回归任务中基准数据集多样性不足的问题,尤其是现有公开基准主要集中在分子图和引用网络上,限制了模型在同构与异构图结构间泛化能力的提升。其解决方案的关键在于引入RelSC数据集,该数据集基于程序图构建,融合了源代码中的语法和语义信息,并为每个图标注了程序的执行时间成本,提供了一个与现有基准显著不同的连续目标变量。RelSC包含两种互补变体:RelSC-H采用单一边类型并提供丰富的节点特征,而RelSC-M保留原始的多关系结构,通过多种边类型连接节点以编码不同的语义关系,从而帮助研究者探究表示选择对模型行为的影响。
链接: https://arxiv.org/abs/2505.23875
作者: Peter Samoaa,Marcus Vukojevic,Morteza Haghir Chehreghani,Antonio Longa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph-level regression underpins many real-world applications, yet public benchmarks remain heavily skewed toward molecular graphs and citation networks. This limited diversity hinders progress on models that must generalize across both homogeneous and heterogeneous graph structures. We introduce RelSC, a new graph-regression dataset built from program graphs that combine syntactic and semantic information extracted from source code. Each graph is labelled with the execution-time cost of the corresponding program, providing a continuous target variable that differs markedly from those found in existing benchmarks. RelSC is released in two complementary variants. RelSC-H supplies rich node features under a single (homogeneous) edge type, while RelSC-M preserves the original multi-relational structure, connecting nodes through multiple edge types that encode distinct semantic relationships. Together, these variants let researchers probe how representation choice influences model behaviour. We evaluate a diverse set of graph neural network architectures on both variants of RelSC. The results reveal consistent performance differences between the homogeneous and multi-relational settings, emphasising the importance of structural representation. These findings demonstrate RelSC’s value as a challenging and versatile benchmark for advancing graph regression methods.
zh
[AI-80] KGMark: A Diffusion Watermark for Knowledge Graphs
【速读】:该论文试图解决动态知识图谱(Knowledge Graphs, KGs)中水印技术应用的挑战,即现有水印方法主要针对静态文本或图像数据,难以应对结构化数据在空间和时间上的变化。解决方案的关键在于提出KGMARK框架,该框架通过一种基于聚类的对齐方法适应空间变化,并采用冗余嵌入策略增强水印对抗各种攻击的鲁棒性,同时引入可学习的掩码矩阵提升扩散指纹的透明度,从而有效应对结构化数据的变异问题。
链接: https://arxiv.org/abs/2505.23873
作者: Hongrui Peng,Haolang Lu,Yuanlong Yu,Weiye Fu,Kun Wang,Guoshun Nan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 20pages, 6figures
Abstract:Knowledge graphs (KGs) are ubiquitous in numerous real-world applications, and watermarking facilitates protecting intellectual property and preventing potential harm from AI-generated content. Existing watermarking methods mainly focus on static plain text or image data, while they can hardly be applied to dynamic graphs due to spatial and temporal variations of structured data. This motivates us to propose KGMARK, the first graph watermarking framework that aims to generate robust, detectable, and transparent diffusion fingerprints for dynamic KG data. Specifically, we propose a novel clustering-based alignment method to adapt the watermark to spatial variations. Meanwhile, we present a redundant embedding strategy to harden the diffusion watermark against various attacks, facilitating the robustness of the watermark to the temporal variations. Additionally, we introduce a novel learnable mask matrix to improve the transparency of diffusion fingerprints. By doing so, our KGMARK properly tackles the variation challenges of structured data. Experiments on various public benchmarks show the effectiveness of our proposed KGMARK.
zh
[AI-81] ADG: Ambient Diffusion-Guided Dataset Recovery for Corruption-Robust Offline Reinforcement Learning
【速读】:该论文旨在解决离线强化学习(offline RL)中由于传感器或人类输入收集的真实世界数据存在噪声和错误所带来的挑战,特别是针对高维状态空间中的数据污染以及多元素同时被污染的情况。其解决方案的关键在于提出了一种名为Ambient Diffusion-Guided Dataset Recovery (ADG) 的新方法,该方法首次将扩散模型(diffusion models)应用于离线RL的数据修复任务中。ADG通过引入近似分布下的环境去噪扩散概率模型(Ambient Denoising Diffusion Probabilistic Models, DDPM),利用其噪声预测特性区分干净与污染数据,并基于干净数据训练标准DDPM,最终用该模型优化污染数据,从而提升数据质量并增强离线RL的鲁棒性。
链接: https://arxiv.org/abs/2505.23871
作者: Zeyuan Liu,Zhihe Yang,Jiawei Xu,Rui Yang,Jiafei Lyu,Baoxiang Wang,Yunjian Xu,Xiu Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Real-world datasets collected from sensors or human inputs are prone to noise and errors, posing significant challenges for applying offline reinforcement learning (RL). While existing methods have made progress in addressing corrupted actions and rewards, they remain insufficient for handling corruption in high-dimensional state spaces and for cases where multiple elements in the dataset are corrupted simultaneously. Diffusion models, known for their strong denoising capabilities, offer a promising direction for this problem-but their tendency to overfit noisy samples limits their direct applicability. To overcome this, we propose Ambient Diffusion-Guided Dataset Recovery (ADG), a novel approach that pioneers the use of diffusion models to tackle data corruption in offline RL. First, we introduce Ambient Denoising Diffusion Probabilistic Models (DDPM) from approximated distributions, which enable learning on partially corrupted datasets with theoretical guarantees. Second, we use the noise-prediction property of Ambient DDPM to distinguish between clean and corrupted data, and then use the clean subset to train a standard DDPM. Third, we employ the trained standard DDPM to refine the previously identified corrupted data, enhancing data quality for subsequent offline RL training. A notable strength of ADG is its versatility-it can be seamlessly integrated with any offline RL algorithm. Experiments on a range of benchmarks, including MuJoCo, Kitchen, and Adroit, demonstrate that ADG effectively mitigates the impact of corrupted data and improves the robustness of offline RL under various noise settings, achieving state-of-the-art results.
zh
[AI-82] MaCP: Minimal yet Mighty Adaptation via Hierarchical Cosine Projection
【速读】:该论文试图解决在微调大型基础模型时参数和内存消耗过高,导致模型效率与准确性难以平衡的问题。解决方案的关键在于提出一种名为MaCP(Minimal yet Mighty adaptive Cosine Projection)的适应方法,其核心是利用余弦投影的优异能量集中和去相关特性,将低秩适应的权重变化投影到离散余弦空间,并在此空间中对权重变化进行分层划分,选取各层级中最关键的频率成分,从而在保持模型性能的同时显著降低计算复杂度和内存需求。
链接: https://arxiv.org/abs/2505.23870
作者: Yixian Shen,Qi Bi,Jia-Hong Huang,Hongyi Zhu,Andy D. Pimentel,Anuj Pathania
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2410.09103
Abstract:We present a new adaptation method MaCP, Minimal yet Mighty adaptive Cosine Projection, that achieves exceptional performance while requiring minimal parameters and memory for fine-tuning large foundation models. Its general idea is to exploit the superior energy compaction and decorrelation properties of cosine projection to improve both model efficiency and accuracy. Specifically, it projects the weight change from the low-rank adaptation into the discrete cosine space. Then, the weight change is partitioned over different levels of the discrete cosine spectrum, and each partition’s most critical frequency components are selected. Extensive experiments demonstrate the effectiveness of MaCP across a wide range of single-modality tasks, including natural language understanding, natural language generation, text summarization, as well as multi-modality tasks such as image classification and video understanding. MaCP consistently delivers superior accuracy, significantly reduced computational complexity, and lower memory requirements compared to existing alternatives.
zh
[AI-83] Noise-Robustness Through Noise: Asymmetric LoRA Adaption with Poisoning Expert
【速读】:该论文试图解决预训练语言模型在下游任务适应过程中对噪声数据敏感的问题,现有参数高效微调方法在处理噪声数据时容易受到干扰。传统噪声处理方法要么依赖繁琐的数据预处理,要么采用易导致误差累积的模型架构修改。论文提出的解决方案关键在于引入一种基于非对称LoRA中毒专家(LoPE)的噪声鲁棒适应方法,通过仅使用生成的噪声数据增强模型对噪声的鲁棒性,其核心机制是利用两阶段范式在微调过程中对中毒专家进行噪声注入以提升其噪声识别与处理能力,并在推理阶段选择性地屏蔽该专家以利用正常专家净化后的知识实现噪声鲁棒输出。
链接: https://arxiv.org/abs/2505.23868
作者: Zhaokun Wang,Jinyu Guo,Jingwen Pu,Lingfeng Chen,Hongli Pu,Jie Ou.Libo Qin,Wenhong Tian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Current parameter-efficient fine-tuning methods for adapting pre-trained language models to downstream tasks are susceptible to interference from noisy data. Conventional noise-handling approaches either rely on laborious data pre-processing or employ model architecture modifications prone to error accumulation. In contrast to existing noise-process paradigms, we propose a noise-robust adaptation method via asymmetric LoRA poisoning experts (LoPE), a novel framework that enhances model robustness to noise only with generated noisy data. Drawing inspiration from the mixture-of-experts architecture, LoPE strategically integrates a dedicated poisoning expert in an asymmetric LoRA configuration. Through a two-stage paradigm, LoPE performs noise injection on the poisoning expert during fine-tuning to enhance its noise discrimination and processing ability. During inference, we selectively mask the dedicated poisoning expert to leverage purified knowledge acquired by normal experts for noise-robust output. Extensive experiments demonstrate that LoPE achieves strong performance and robustness purely through the low-cost noise injection, which completely eliminates the requirement of data cleaning.
zh
[AI-84] owards Understanding The Calibration Benefits of Sharpness-Aware Minimization
【速读】:该论文旨在解决深度神经网络在安全关键应用中存在校准不良和过度自信的问题(overconfidence),这可能导致灾难性后果。其解决方案的关键在于采用一种名为尖锐度感知最小化(Sharpness-Aware Minimization, SAM)的训练方法,该方法通过隐式最大化预测分布的熵来学习更加校准的模型,从而有效缓解过度自信现象。基于这一发现,作者进一步提出了SAM的变种——CSAM,以进一步改善模型的校准性能。
链接: https://arxiv.org/abs/2505.23866
作者: Chengli Tan,Yubo Zhou,Haishan Ye,Guang Dai,Junmin Liu,Zengjie Song,Jiangshe Zhang,Zixiang Zhao,Yunda Hao,Yong Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages
Abstract:Deep neural networks have been increasingly used in safety-critical applications such as medical diagnosis and autonomous driving. However, many studies suggest that they are prone to being poorly calibrated and have a propensity for overconfidence, which may have disastrous consequences. In this paper, unlike standard training such as stochastic gradient descent, we show that the recently proposed sharpness-aware minimization (SAM) counteracts this tendency towards overconfidence. The theoretical analysis suggests that SAM allows us to learn models that are already well-calibrated by implicitly maximizing the entropy of the predictive distribution. Inspired by this finding, we further propose a variant of SAM, coined as CSAM, to ameliorate model calibration. Extensive experiments on various datasets, including ImageNet-1K, demonstrate the benefits of SAM in reducing calibration error. Meanwhile, CSAM performs even better than SAM and consistently achieves lower calibration error than other approaches
zh
[AI-85] Combining Deep Architectures for Information Gain estimation and Reinforcement Learning for multiagent field exploration
【速读】:该论文旨在解决精准农业中作物监测所需的高效自主系统问题,其中智能体需在大规模环境中探索的同时最小化资源消耗。其核心挑战是通过部分、序列化的观察来推断每个网格单元中的目标数量。解决方案的关键在于提出一种两阶段深度学习框架:首先使用预训练的长短期记忆网络(LSTM)作为信念模型,更新环境的概率地图及其相关熵,从而定义预期的信息增益(IG),使智能体能够优先探索信息丰富的区域;其次,在输入中引入视角可见性掩码,以在部分可观测条件下保持马尔可夫性质并避免重复探索已查看的视角。
链接: https://arxiv.org/abs/2505.23865
作者: Emanuele Masiero,Vito Trianni,Giuseppe Vizzari,Dimitri Ognibene
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 4 pages, presented at RLDM 2025
Abstract:Precision agriculture requires efficient autonomous systems for crop monitoring, where agents must explore large-scale environments while minimizing resource consumption. This work addresses the problem as an active exploration task in a grid environment representing an agricultural field. Each cell may contain targets (e.g., damaged crops) observable from nine predefined points of view (POVs). Agents must infer the number of targets per cell using partial, sequential observations. We propose a two-stage deep learning framework. A pre-trained LSTM serves as a belief model, updating a probabilistic map of the environment and its associated entropy, which defines the expected information gain (IG). This allows agents to prioritize informative regions. A key contribution is the inclusion of a POV visibility mask in the input, preserving the Markov property under partial observability and avoiding revisits to already explored views. Three agent architectures were compared: an untrained IG-based agent selecting actions to maximize entropy reduction; a DQN agent using CNNs over local 3x3 inputs with belief, entropy, and POV mask; and a Double-CNN DQN agent with wider spatial context. Simulations on 20x20 maps showed that the untrained agent performs well despite its simplicity. The DQN agent matches this performance when the POV mask is included, while the Double-CNN agent consistently achieves superior exploration efficiency, especially in larger environments. Results show that uncertainty-aware policies leveraging entropy, belief states, and visibility tracking lead to robust and scalable exploration. Future work includes curriculum learning, multi-agent cooperation with shared rewards, transformer-based models, and intrinsic motivation mechanisms to further enhance learning efficiency and policy generalization. Comments: 4 pages, presented at RLDM 2025 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.23865 [cs.LG] (or arXiv:2505.23865v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.23865 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Emanuele Masiero [view email] [v1] Thu, 29 May 2025 09:39:38 UTC (346 KB)
zh
[AI-86] Personalized Subgraph Federated Learning with Differentiable Auxiliary Projections
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在图结构数据上的非独立同分布(non-IID)挑战,特别是在每个客户端持有从全局图中采样的不同子图的场景下。其解决方案的关键在于提出FedAux框架,该框架通过联合训练局部图神经网络(GNN)和可学习的辅助投影向量(Auxiliary Projection Vector, APV),将节点嵌入投影到一维空间,并利用软排序和轻量级一维卷积对嵌入进行优化,从而捕获客户端特定的信息。随后,服务器利用这些APVs计算客户端间的相似性并进行加权参数混合,实现个性化模型的同时保持跨客户端的知识迁移。
链接: https://arxiv.org/abs/2505.23864
作者: Wei Zhuo,Zhaohuan Zhan,Ziduo Yang,Han Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated learning (FL) on graph-structured data typically faces non-IID challenges, particularly in scenarios where each client holds a distinct subgraph sampled from a global graph. In this paper, we introduce Federated learning with Auxiliary projections (FedAux), a personalized subgraph FL framework that learns to align, compare, and aggregate heterogeneously distributed local models without sharing raw data or node embeddings. In FedAux, each client jointly trains (i) a local GNN and (ii) a learnable auxiliary projection vector (APV) that differentiably projects node embeddings onto a 1D space. A soft-sorting operation followed by a lightweight 1D convolution refines these embeddings in the ordered space, enabling the APV to effectively capture client-specific information. After local training, these APVs serve as compact signatures that the server uses to compute inter-client similarities and perform similarity-weighted parameter mixing, yielding personalized models while preserving cross-client knowledge transfer. Moreover, we provide rigorous theoretical analysis to establish the convergence and rationality of our design. Empirical evaluations across diverse graph benchmarks demonstrate that FedAux substantially outperforms existing baselines in both accuracy and personalization performance.
zh
[AI-87] Mamba Integrated with Physics Principles Masters Long-term Chaotic System Forecasting
【速读】:该论文试图解决从短期观测中对混沌系统进行长期预测的问题,这一问题由于系统对初始条件的内在敏感性和奇异吸引子的复杂几何结构而具有挑战性。现有方法通常依赖于长期训练数据或关注短期序列相关性,难以在长时域内保持预测稳定性和动力学一致性。解决方案的关键在于提出PhyxMamba框架,该框架将基于Mamba的状态空间模型与物理信息原理相结合,通过时间延迟嵌入从短暂观测中重构吸引子流形,提取关键的全局动力学特征,并采用生成式训练方案、多标记预测和吸引子几何正则化来增强预测精度并保持重要的统计不变量。
链接: https://arxiv.org/abs/2505.23863
作者: Chang Liu,Bohao Zhao,Jingtao Ding,Huandong Wang,Yong Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Long-term forecasting of chaotic systems from short-term observations remains a fundamental and underexplored challenge due to the intrinsic sensitivity to initial conditions and the complex geometry of strange attractors. Existing approaches often rely on long-term training data or focus on short-term sequence correlations, struggling to maintain predictive stability and dynamical coherence over extended horizons. We propose PhyxMamba, a novel framework that integrates a Mamba-based state-space model with physics-informed principles to capture the underlying dynamics of chaotic systems. By reconstructing the attractor manifold from brief observations using time-delay embeddings, PhyxMamba extracts global dynamical features essential for accurate forecasting. Our generative training scheme enables Mamba to replicate the physical process, augmented by multi-token prediction and attractor geometry regularization for physical constraints, enhancing prediction accuracy and preserving key statistical invariants. Extensive evaluations on diverse simulated and real-world chaotic systems demonstrate that PhyxMamba delivers superior long-term forecasting and faithfully captures essential dynamical invariants from short-term data. This framework opens new avenues for reliably predicting chaotic systems under observation-scarce conditions, with broad implications across climate science, neuroscience, epidemiology, and beyond. Our code is open-source at this https URL.
zh
[AI-88] BiBLDR: Bidirectional Behavior Learning for Drug Repositioning
【速读】:该论文旨在解决现有基于深度学习的药物再利用方法在冷启动场景下(即缺乏与疾病关联信息的新药)难以进行有效推理的问题。其解决方案的关键在于提出一种双向行为学习策略(BiBLDR),将药物再利用重新定义为行为序列学习任务,通过构建双向行为序列并采用两阶段策略,首先构建原型空间以表征药物和疾病的表征属性,其次利用精炼的原型和双向行为序列数据预测潜在的药物-疾病关联,从而更鲁棒且精确地捕捉药物与疾病特征之间的交互关系。
链接: https://arxiv.org/abs/2505.23861
作者: Renye Zhang,Mengyun Yang,Qichang Zhao,Jianxin Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Drug repositioning aims to identify potential new indications for existing drugs to reduce the time and financial costs associated with developing new drugs. Most existing deep learning-based drug repositioning methods predominantly utilize graph-based representations. However, graph-based drug repositioning methods struggle to perform effective inference in cold-start scenarios involving novel drugs because of the lack of association information with the diseases. Unlike traditional graph-based approaches, we propose a bidirectional behavior learning strategy for drug repositioning, known as BiBLDR. This innovative framework redefines drug repositioning as a behavior sequential learning task to capture drug-disease interaction patterns. First, we construct bidirectional behavioral sequences based on drug and disease sides. The consideration of bidirectional information ensures a more meticulous and rigorous characterization of the behavioral sequences. Subsequently, we propose a two-stage strategy for drug repositioning. In the first stage, we construct prototype spaces to characterize the representational attributes of drugs and diseases. In the second stage, these refined prototypes and bidirectional behavior sequence data are leveraged to predict potential drug-disease associations. Based on this learning approach, the model can more robustly and precisely capture the interactive relationships between drug and disease features from bidirectional behavioral sequences. Extensive experiments demonstrate that our method achieves state-of-the-art performance on benchmark datasets. Meanwhile, BiBLDR demonstrates significantly superior performance compared to previous methods in cold-start scenarios. Our code is published in this https URL.
zh
[AI-89] owards Minimizing Feature Drift in Model Merging: Layer-wise Task Vector Fusion for Adaptive Knowledge Integration
【速读】:该论文旨在解决多任务模型合并过程中知识整合导致的性能退化问题,即在将多个微调后的任务特定专家模型合并为统一模型时,如何最小化性能下降。现有方法主要通过从参数级或任务损失角度减少专家模型与统一模型之间的差异,但参数级方法与理论上限存在显著差距,而任务损失方法则需要昂贵的二次训练过程。论文的关键解决方案是观察到性能退化与特征漂移密切相关,即模型合并导致同一样本特征表示的变化,从而提出分层最优任务向量合并(LOT Merging),通过逐层最小化特征漂移,将问题建模为凸二次优化问题,并通过矩阵运算高效求解线性层和归一化层的参数,实现高效的模型整合。
链接: https://arxiv.org/abs/2505.23859
作者: Wenju Sun,Qingyong Li,Wen Wang,Yang Liu,Yangli-ao Geng,Boyang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-task model merging aims to consolidate knowledge from multiple fine-tuned task-specific experts into a unified model while minimizing performance degradation. Existing methods primarily approach this by minimizing differences between task-specific experts and the unified model, either from a parameter-level or a task-loss perspective. However, parameter-level methods exhibit a significant performance gap compared to the upper bound, while task-loss approaches entail costly secondary training procedures. In contrast, we observe that performance degradation closely correlates with feature drift, i.e., differences in feature representations of the same sample caused by model merging. Motivated by this observation, we propose Layer-wise Optimal Task Vector Merging (LOT Merging), a technique that explicitly minimizes feature drift between task-specific experts and the unified model in a layer-by-layer manner. LOT Merging can be formulated as a convex quadratic optimization problem, enabling us to analytically derive closed-form solutions for the parameters of linear and normalization layers. Consequently, LOT Merging achieves efficient model consolidation through basic matrix operations. Extensive experiments across vision and vision-language benchmarks demonstrate that LOT Merging significantly outperforms baseline methods, achieving improvements of up to 4.4% (ViT-B/32) over state-of-the-art approaches.
zh
[AI-90] DATD3: Depthwise Attention Twin Delayed Deep Deterministic Policy Gradient For Model Free Reinforcement Learning Under Output Feedback Control
【速读】:该论文旨在解决现实世界应用中强化学习的输出反馈(output-feedback)问题,即智能体仅能获得部分状态信息的情况。其解决方案的关键在于提出一种新的马尔可夫决策过程框架——输出反馈马尔可夫决策过程(OPMDP),并在此基础上引入深度可分离卷积与多头注意力机制的深度确定性策略梯度算法(DATD3),以有效编码历史观测信息,从而在保持策略表达能力的同时避免循环模型的不稳定性。
链接: https://arxiv.org/abs/2505.23857
作者: Wuhao Wang,Zhiyong Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning in real-world applications often involves output-feedback settings, where the agent receives only partial state information. To address this challenge, we propose the Output-Feedback Markov Decision Process (OPMDP), which extends the standard MDP formulation to accommodate decision-making based on observation histories. Building on this framework, we introduce Depthwise Attention Twin Delayed Deep Deterministic Policy Gradient (DATD3), a novel actor-critic algorithm that employs depthwise separable convolution and multi-head attention to encode historical observations. DATD3 maintains policy expressiveness while avoiding the instability of recurrent models. Extensive experiments on continuous control tasks demonstrate that DATD3 outperforms existing memory-based and recurrent baselines under both partial and full observability.
zh
[AI-91] CADRE: Customizable Assurance of Data Readiness in Privacy-Preserving Federated Learning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中客户端数据质量与准备度不足的问题,尤其是在隐私保护联邦学习(Privacy-Preserving Federated Learning, PPFL)场景下,由于数据访问受限,难以确保数据的高质量和可用性。论文提出的解决方案是CADRE(Customizable Assurance of Data REadiness),其关键在于允许用户自定义数据准备度(Data Readiness, DR)的标准、指标、规则和修复措施,并基于这些自定义内容生成全面的DR报告,从而在保障隐私的前提下优化数据集的准备过程。
链接: https://arxiv.org/abs/2505.23849
作者: Kaveen Hiniduma,Zilinghan Li,Aditya Sinha,Ravi Madduri,Suren Byna
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 7 figures, 2 tables
Abstract:Privacy-Preserving Federated Learning (PPFL) is a decentralized machine learning approach where multiple clients train a model collaboratively. PPFL preserves privacy and security of the client’s data by not exchanging it. However, ensuring that data at each client is of high quality and ready for federated learning (FL) is a challenge due to restricted data access. In this paper, we introduce CADRE (Customizable Assurance of Data REadiness) for FL, a novel framework that allows users to define custom data readiness (DR) standards, metrics, rules, and remedies tailored to specific FL tasks. Our framework generates comprehensive DR reports based on the user-defined metrics, rules, and remedies to ensure datasets are optimally prepared for FL while preserving privacy. We demonstrate the framework’s practical application by integrating it into an existing PPFL framework. We conducted experiments across six diverse datasets, addressing seven different DR issues. The results illustrate the framework’s versatility and effectiveness in ensuring DR across various dimensions, including data quality, privacy, and fairness. This approach enhances the performance and reliability of FL models as well as utilizes valuable resources by identifying and addressing data-related issues before the training phase.
zh
[AI-92] Seven Security Challenges That Must be Solved in Cross-domain Multi-agent LLM Systems
【速读】:该论文试图解决跨领域多智能体大型语言模型(Large Language Models, LLMs)系统中的安全问题,特别是由于跨组织协作导致的信任假设破裂所带来的新型安全挑战。解决方案的关键在于识别并分类七类新的安全威胁,同时提出相应的攻击模型、安全评估指标和未来研究方向,以应对由多智能体动态行为引发的非传统安全风险。
链接: https://arxiv.org/abs/2505.23847
作者: Ronny Ko,Jiseong Jeong,Shuyuan Zheng,Chuan Xiao,Taewan Kim,Makoto Onizuka,Wonyong Shin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are rapidly evolving into autonomous agents that cooperate across organizational boundaries, enabling joint disaster response, supply-chain optimization, and other tasks that demand decentralized expertise without surrendering data ownership. Yet, cross-domain collaboration shatters the unified trust assumptions behind current alignment and containment techniques. An agent benign in isolation may, when receiving messages from an untrusted peer, leak secrets or violate policy, producing risks driven by emergent multi-agent dynamics rather than classical software bugs. This position paper maps the security agenda for cross-domain multi-agent LLM systems. We introduce seven categories of novel security challenges, for each of which we also present plausible attacks, security evaluation metrics, and future research guidelines.
zh
[AI-93] DP-RTFL: Differentially Private Resilient Temporal Federated Learning for Trustworthy AI in Regulated Industries
【速读】:该论文试图解决联邦学习(Federated Learning, FL)在实际应用中面临的故障容错性差、系统对客户端和服务器同时故障的抗脆弱性不足以及难以提供可验证的强隐私保障等问题,这些问题可能导致训练中断、数据丢失、模型完整性受损及违反数据保护法规。解决方案的关键在于提出一种名为差分隐私弹性时间联邦学习(Differentially Private Resilient Temporal Federated Learning, DP-RTFL)的框架,该框架通过在客户端集成局部差分隐私(Local Differential Privacy, LDP),结合弹性时间状态管理和完整性验证机制(如基于哈希的承诺,即零知识完整性证明,ZKIPs),以确保训练连续性、精确的状态恢复和强数据隐私。
链接: https://arxiv.org/abs/2505.23813
作者: Abhijit Talluri
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 6 pages (IEEE conference format), 10 figures. Source code available at this https URL
Abstract:Federated Learning (FL) has emerged as a critical paradigm for enabling privacy-preserving machine learning, particularly in regulated sectors such as finance and healthcare. However, standard FL strategies often encounter significant operational challenges related to fault tolerance, system resilience against concurrent client and server failures, and the provision of robust, verifiable privacy guarantees essential for handling sensitive data. These deficiencies can lead to training disruptions, data loss, compromised model integrity, and non-compliance with data protection regulations (e.g., GDPR, CCPA). This paper introduces Differentially Private Resilient Temporal Federated Learning (DP-RTFL), an advanced FL framework designed to ensure training continuity, precise state recovery, and strong data privacy. DP-RTFL integrates local Differential Privacy (LDP) at the client level with resilient temporal state management and integrity verification mechanisms, such as hash-based commitments (referred to as Zero-Knowledge Integrity Proofs or ZKIPs in this context). The framework is particularly suited for critical applications like credit risk assessment using sensitive financial data, aiming to be operationally robust, auditable, and scalable for enterprise AI deployments. The implementation of the DP-RTFL framework is available as open-source.
zh
[AI-94] ADA: Automated Moving Target Defense for AI Workloads via Ephemeral Infrastructure-Native Rotation in Kubernetes
【速读】:该论文试图解决AI工作负载在传统静态防御机制下安全性不足的问题,这些机制依赖于复杂且昂贵的保密或可信计算解决方案来保护底层计算平台。解决方案的关键在于引入自适应防御代理(Adaptive Defense Agent, ADA),通过持续自动地在基础设施层轮换AI工作负载,利用Kubernetes Pod的固有短暂性,系统性地破坏攻击者假设并中断潜在的攻击链。ADA采用混沌工程原则作为持续主动防御手段,实现了从静态防御向动态、零信任模型的范式转变。
链接: https://arxiv.org/abs/2505.23805
作者: Akram Sheriff,Ken Huang,Zsolt Nemeth,Madjid Nakhjiri
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces the Adaptive Defense Agent (ADA), an innovative Automated Moving Target Defense (AMTD) system designed to fundamentally enhance the security posture of AI workloads. ADA operates by continuously and automatically rotating these workloads at the infrastructure level, leveraging the inherent ephemerality of Kubernetes pods. This constant managed churn systematically invalidates attacker assumptions and disrupts potential kill chains by regularly destroying and respawning AI service instances. This methodology, applying principles of chaos engineering as a continuous, proactive defense, offers a paradigm shift from traditional static defenses that rely on complex and expensive confidential or trusted computing solutions to secure the underlying compute platforms, while at the same time agnostically supporting the latest advancements in agentic and nonagentic AI ecosystems and solutions such as agent-to-agent (A2A) communication frameworks or model context protocols (MCP). This AI-native infrastructure design, relying on the widely proliferated cloud-native Kubernetes technologies, facilitates easier deployment, simplifies maintenance through an inherent zero trust posture achieved by rotation, and promotes faster adoption. We posit that ADA’s novel approach to AMTD provides a more robust, agile, and operationally efficient zero-trust model for AI services, achieving security through proactive environmental manipulation rather than reactive patching.
zh
[AI-95] MultiPhishGuard: An LLM -based Multi-Agent System for Phishing Email Detection
【速读】:该论文旨在解决网络钓鱼邮件检测中因对抗性策略演变和异构攻击模式带来的关键挑战。传统检测方法如基于规则的过滤器和黑名单难以应对这些变化,导致误漏检和安全风险。为了解决这一问题,论文提出了一种基于大语言模型(LLM)的多智能体检测系统——MultiPhishGuard,其关键在于通过协同的专业化智能体与对抗感知的强化学习相结合,实现动态适应和自我改进的防御机制。系统包含五个协作智能体,并利用近端策略优化(PPO)算法自动调整决策权重,同时引入对抗性训练循环以增强系统的鲁棒性和检测性能。
链接: https://arxiv.org/abs/2505.23803
作者: Yinuo Xue,Eric Spero,Yun Sing Koh,Giovanni Russello
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Phishing email detection faces critical challenges from evolving adversarial tactics and heterogeneous attack patterns. Traditional detection methods, such as rule-based filters and denylists, often struggle to keep pace with these evolving tactics, leading to false negatives and compromised security. While machine learning approaches have improved detection accuracy, they still face challenges adapting to novel phishing strategies. We present MultiPhishGuard, a dynamic LLM-based multi-agent detection system that synergizes specialized expertise with adversarial-aware reinforcement learning. Our framework employs five cooperative agents (text, URL, metadata, explanation simplifier, and adversarial agents) with automatically adjusted decision weights powered by a Proximal Policy Optimization reinforcement learning algorithm. To address emerging threats, we introduce an adversarial training loop featuring an adversarial agent that generates subtle context-aware email variants, creating a self-improving defense ecosystem and enhancing system robustness. Experimental evaluations on public datasets demonstrate that MultiPhishGuard significantly outperforms Chain-of-Thoughts, single-agent baselines and state-of-the-art detectors, as validated by ablation studies and comparative analyses. Experiments demonstrate that MultiPhishGuard achieves high accuracy (97.89%) with low false positive (2.73%) and false negative rates (0.20%). Additionally, we incorporate an explanation simplifier agent, which provides users with clear and easily understandable explanations for why an email is classified as phishing or legitimate. This work advances phishing defense through dynamic multi-agent collaboration and generative adversarial resilience.
zh
[AI-96] USB: A Comprehensive and Unified Safety Evaluation Benchmark for Multimodal Large Language Models
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在安全评估方面存在的不足,包括现有基准数据质量差、覆盖范围有限以及模态风险组合不全面,导致评估结果失真和矛盾,从而阻碍了安全问题的发现与治理。论文提出的解决方案关键在于引入统一安全基准(Unified Safety Benchmarks, USB),其核心特征包括高质量查询、广泛的风险类别、全面的模态组合,并同时评估模型对有害查询的脆弱性和对无害查询的过度敏感性。此外,通过设计复杂的数据合成流程生成高质量补充数据,以填补此前未被探索的领域,最终构建了一个更为全面的安全评估基准。
链接: https://arxiv.org/abs/2505.23793
作者: Baolin Zheng,Guanlin Chen,Hongqiong Zhong,Qingyang Teng,Yingshui Tan,Zhendong Liu,Weixun Wang,Jiaheng Liu,Jian Yang,Huiyun Jing,Jincheng Wei,Wenbo Su,Xiaoyong Zhu,Bo Zheng,Kaifu Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite their remarkable achievements and widespread adoption, Multimodal Large Language Models (MLLMs) have revealed significant security vulnerabilities, highlighting the urgent need for robust safety evaluation benchmarks. Existing MLLM safety benchmarks, however, fall short in terms of data quality and coverge, and modal risk combinations, resulting in inflated and contradictory evaluation results, which hinders the discovery and governance of security concerns. Besides, we argue that vulnerabilities to harmful queries and oversensitivity to harmless ones should be considered simultaneously in MLLMs safety evaluation, whereas these were previously considered separately. In this paper, to address these shortcomings, we introduce Unified Safety Benchmarks (USB), which is one of the most comprehensive evaluation benchmarks in MLLM safety. Our benchmark features high-quality queries, extensive risk categories, comprehensive modal combinations, and encompasses both vulnerability and oversensitivity evaluations. From the perspective of two key dimensions: risk categories and modality combinations, we demonstrate that the available benchmarks – even the union of the vast majority of them – are far from being truly comprehensive. To bridge this gap, we design a sophisticated data synthesis pipeline that generates extensive, high-quality complementary data addressing previously unexplored aspects. By combining open-source datasets with our synthetic data, our benchmark provides 4 distinct modality combinations for each of the 61 risk sub-categories, covering both English and Chinese across both vulnerability and oversensitivity dimensions.
zh
[AI-97] Zero-Trust Foundation Models: A New Paradigm for Secure and Collaborative Artificial Intelligence for Internet of Things
【速读】:该论文试图解决物联网(IoT)系统中基础模型(Foundation Models, FMs)在安全性和隐私保护方面的挑战,特别是在分布式、异构且可能面临对抗性威胁的环境中实现可信的AI。解决方案的关键在于引入零信任基础模型(Zero-Trust Foundation Models, ZTFMs),通过将持续验证、最小权限访问(Least Privilege Access, LPA)、数据保密性和行为分析等核心原则嵌入到FM的设计、训练和部署生命周期中,从而构建具备自我防御能力的智能生态系统。
链接: https://arxiv.org/abs/2505.23792
作者: Kai Li,Conggai Li,Xin Yuan,Shenghong Li,Sai Zou,Syed Sohail Ahmed,Wei Ni,Dusit Niyato,Abbas Jamalipour,Falko Dressler,Ozgur B. Akan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper focuses on Zero-Trust Foundation Models (ZTFMs), a novel paradigm that embeds zero-trust security principles into the lifecycle of foundation models (FMs) for Internet of Things (IoT) systems. By integrating core tenets, such as continuous verification, least privilege access (LPA), data confidentiality, and behavioral analytics into the design, training, and deployment of FMs, ZTFMs can enable secure, privacy-preserving AI across distributed, heterogeneous, and potentially adversarial IoT environments. We present the first structured synthesis of ZTFMs, identifying their potential to transform conventional trust-based IoT architectures into resilient, self-defending ecosystems. Moreover, we propose a comprehensive technical framework, incorporating federated learning (FL), blockchain-based identity management, micro-segmentation, and trusted execution environments (TEEs) to support decentralized, verifiable intelligence at the network edge. In addition, we investigate emerging security threats unique to ZTFM-enabled systems and evaluate countermeasures, such as anomaly detection, adversarial training, and secure aggregation. Through this analysis, we highlight key open research challenges in terms of scalability, secure orchestration, interpretable threat attribution, and dynamic trust calibration. This survey lays a foundational roadmap for secure, intelligent, and trustworthy IoT infrastructures powered by FMs.
zh
[AI-98] Evaluating Query Efficiency and Accuracy of Transfer Learning-based Model Extraction Attack in Federated Learning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)框架下模型提取(Model Extraction, ME)攻击对机器学习即服务(Machine Learning as a Service, MLaaS)平台的知识产权(Intellectual Property, IP)威胁问题。其解决方案的关键在于通过实施针对不同联邦客户端的模型提取攻击,并评估攻击效果,同时探索基于迁移学习的方法,利用预训练模型作为提取过程的起点,以提高提取模型的准确性和保真度,从而揭示FL系统在面对ME攻击时的脆弱性及潜在风险。
链接: https://arxiv.org/abs/2505.23791
作者: Sayyed Farid Ahamed,Sandip Roy,Soumya Banerjee,Marc Vucovich,Kevin Choi,Abdul Rahman,Alison Hu,Edward Bowen,Sachin Shetty
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at IEEE IWCMC. 6 pages, 4 Figures, 3 tables
Abstract:Federated Learning (FL) is a collaborative learning framework designed to protect client data, yet it remains highly vulnerable to Intellectual Property (IP) threats. Model extraction (ME) attacks pose a significant risk to Machine Learning as a Service (MLaaS) platforms, enabling attackers to replicate confidential models by querying black-box (without internal insight) APIs. Despite FL’s privacy-preserving goals, its distributed nature makes it particularly susceptible to such attacks. This paper examines the vulnerability of FL-based victim models to two types of model extraction attacks. For various federated clients built under the NVFlare platform, we implemented ME attacks across two deep learning architectures and three image datasets. We evaluate the proposed ME attack performance using various metrics, including accuracy, fidelity, and KL divergence. The experiments show that for different FL clients, the accuracy and fidelity of the extracted model are closely related to the size of the attack query set. Additionally, we explore a transfer learning based approach where pretrained models serve as the starting point for the extraction process. The results indicate that the accuracy and fidelity of the fine-tuned pretrained extraction models are notably higher, particularly with smaller query sets, highlighting potential advantages for attackers.
zh
[AI-99] Mind the Gap: A Practical Attack on GGUF Quantization ICML2025
【速读】:该论文旨在解决后训练量化(post-training quantization)方法在部署大规模语言模型(Large Language Models, LLMs)时存在的安全风险问题,特别是针对GGUF这类复杂量化方法的漏洞。现有攻击手段无法有效应用于此类方法,因此缺乏对其安全性评估。论文提出的解决方案的关键在于利用量化误差——即全精度权重与其量化/反量化版本之间的差异——作为构造恶意量化模型的灵活性来源,从而在保持模型在全精度下表现正常的同时,植入隐藏的恶意行为。
链接: https://arxiv.org/abs/2505.23786
作者: Kazuki Egashira,Robin Staab,Mark Vero,Jingxuan He,Martin Vechev
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICML 2025
Abstract:With the increasing size of frontier LLMs, post-training quantization has become the standard for memory-efficient deployment. Recent work has shown that basic rounding-based quantization schemes pose security risks, as they can be exploited to inject malicious behaviors into quantized models that remain hidden in full precision. However, existing attacks cannot be applied to more complex quantization methods, such as the GGUF family used in the popular ollama and this http URL frameworks. In this work, we address this gap by introducing the first attack on GGUF. Our key insight is that the quantization error – the difference between the full-precision weights and their (de-)quantized version – provides sufficient flexibility to construct malicious quantized models that appear benign in full precision. Leveraging this, we develop an attack that trains the target malicious LLM while constraining its weights based on quantization errors. We demonstrate the effectiveness of our attack on three popular LLMs across nine GGUF quantization data types on three diverse attack scenarios: insecure code generation ( \Delta = 88.7% ), targeted content injection ( \Delta = 85.0% ), and benign instruction refusal ( \Delta = 30.1% ). Our attack highlights that (1) the most widely used post-training quantization method is susceptible to adversarial interferences, and (2) the complexity of quantization schemes alone is insufficient as a defense.
zh
[AI-100] 4500 Seconds: Small Data Training Approaches for Deep UAV Audio Classification
【速读】:该论文试图解决无人机(UAV)分类中因数据稀缺而导致的挑战,其关键解决方案是采用参数高效微调(PEFT)和数据增强技术来缓解数据不足的问题。研究对比了卷积神经网络(CNN)和基于注意力机制的Transformer模型在无人机分类任务中的性能表现。
链接: https://arxiv.org/abs/2505.23782
作者: Andrew P. Berg,Qian Zhang,Mia Y. Wang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted at the 14th International Conference on Data Science, Technology, and Applications (DATA), 2025
Abstract:Unmanned aerial vehicle (UAV) usage is expected to surge in the coming decade, raising the need for heightened security measures to prevent airspace violations and security threats. This study investigates deep learning approaches to UAV classification focusing on the key issue of data scarcity. To investigate this we opted to train the models using a total of 4,500 seconds of audio samples, evenly distributed across a 9-class dataset. We leveraged parameter efficient fine-tuning (PEFT) and data augmentations to mitigate the data scarcity. This paper implements and compares the use of convolutional neural networks (CNNs) and attention-based transformers. Our results show that, CNNs outperform transformers by 1-2% accuracy, while still being more computationally efficient. These early findings, however, point to potential in using transformers models; suggesting that with more data and further optimizations they could outperform CNNs. Future works aims to upscale the dataset to better understand the trade-offs between these approaches.
zh
[AI-101] More-than-Human Storytelling: Designing Longitudinal Narrative Engagements with Generative AI
【速读】:该论文试图解决长期使用生成式 AI (Generative AI) 叙事代理进行故事创作的用户参与问题,特别是探讨用户与AI叙事者之间的动态关系。其解决方案的关键在于通过“Dreamsmithy”应用实现用户与AI叙事者“Makoto”的日常协作创作,并借助为期两周的日记研究收集用户的反思与互动数据,从而揭示用户在长期使用过程中表现出的复杂情感与社会时间性联结。研究强调了GenAI在持续叙事中的潜力,同时也指出了叙事连贯性、控制权及伦理问题等关键挑战。
链接: https://arxiv.org/abs/2505.23780
作者: Émilie Fabre,Katie Seaborn,Shuta Koiwai,Mizuki Watanabe,Paul Riesch
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: CHI EA '25
Abstract:Longitudinal engagement with generative AI (GenAI) storytelling agents is a timely but less charted domain. We explored multi-generational experiences with “Dreamsmithy,” a daily dream-crafting app, where participants (N = 28) co-created stories with AI narrator “Makoto” every day. Reflections and interactions were captured through a two-week diary study. Reflexive thematic analysis revealed themes likes “oscillating ambivalence” and “socio-chronological bonding,” highlighting the complex dynamics that emerged between individuals and the AI narrator over time. Findings suggest that while people appreciated the personal notes, opportunities for reflection, and AI creativity, limitations in narrative coherence and control occasionally caused frustration. The results underscore the potential of GenAI for longitudinal storytelling, but also raise critical questions about user agency and ethics. We contribute initial empirical insights and design considerations for developing adaptive, more-than-human storytelling systems.
zh
[AI-102] A comprehensive survey of cybercrimes in India over the last decade
【速读】:该论文旨在解决日益严峻的网络犯罪(cybercrime)问题,特别是针对数字化转型过程中暴露的安全风险。其解决方案的关键在于加强网络安全措施,包括提升公众意识、开展网络安全教育以及建立完善的监管框架,以保护敏感数据和信息系统的安全。
链接: https://arxiv.org/abs/2505.23770
作者: Sudhanshu Sekhar Tripathy
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Since the 1990s, the integration of technology into daily life has led to the creation of an extensive network of interconnected devices, transforming how individuals and organizations operate. However, this digital transformation has also spurred the rise of cybercrime, criminal activities perpetrated through networks or computer systems. Cybercrime has become a global concern, presenting significant challenges to security systems. Although advancements in digital technology have enhanced efficiency, they have also opened new avenues for exploitation by cybercriminals, highlighting the urgent need for advanced cybersecurity measures. The escalating number of cyberattacks and associated risks in the past decade highlights the critical importance of protecting sensitive data and safeguarding information systems. Cybercrimes range from financial fraud and phishing scams to identity theft and online harassment, posing substantial risks to both individuals and organizations. In response, governments, law enforcement agencies, and cybersecurity units have intensified their efforts to address these threats. In recent years, India has experienced a significant surge in cybercrime incidents, with a notable increase in cases involving ransomware, data breaches, and social engineering attacks. The growing penetration of internet services, the expansion of e-commerce, and the rapid adoption of digital payment systems have made individuals and organizations more vulnerable to cyber threats. Key areas affected include banking, healthcare, and government sectors, which are frequently targeted due to the sensitive nature of the data they handle. To combat these risks, there is an increasing focus on public awareness, cybersecurity education, and robust regulatory frameworks. This paper examines cybercrime, prevention strategies, security protocols, and terminology to safeguard digital infrastructure.
zh
[AI-103] owards Natural Language Communication for Cooperative Autonomous Driving via Self-Play
【速读】:该论文试图解决自动驾驶车辆在交通场景中缺乏人类可理解的通信机制的问题,从而影响其与其它车辆及人类驾驶员的协作与协调。解决方案的关键在于提出一种名为LLM+Debrief的新方法,通过多智能体讨论学习车辆的消息生成和高层决策策略,以生成有意义且人类可理解的自然语言消息,从而提升自动驾驶系统的协作效率与安全性。
链接: https://arxiv.org/abs/2505.18334
作者: Jiaxun Cui,Chen Tang,Jarrett Holtz,Janice Nguyen,Alessandro G. Allievi,Hang Qiu,Peter Stone
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Past work has demonstrated that autonomous vehicles can drive more safely if they communicate with one another than if they do not. However, their communication has often not been human-understandable. Using natural language as a vehicle-to-vehicle (V2V) communication protocol offers the potential for autonomous vehicles to drive cooperatively not only with each other but also with human drivers. In this work, we propose a suite of traffic tasks in autonomous driving where vehicles in a traffic scenario need to communicate in natural language to facilitate coordination in order to avoid an imminent collision and/or support efficient traffic flow. To this end, this paper introduces a novel method, LLM+Debrief, to learn a message generation and high-level decision-making policy for autonomous vehicles through multi-agent discussion. To evaluate LLM agents for driving, we developed a gym-like simulation environment that contains a range of driving scenarios. Our experimental results demonstrate that LLM+Debrief is more effective at generating meaningful and human-understandable natural language messages to facilitate cooperation and coordination than a zero-shot LLM agent. Our code and demo videos are available at this https URL.
zh
[AI-104] Combining Abstract Argumentation and Machine Learning for Efficiently Analyzing Low-Level Process Event Streams
【速读】:该论文试图解决在过程追踪分析中,当追踪事件与参考业务活动之间存在差距时的解释问题,即如何将每个正在进行的追踪事件映射到相应的活动实例步骤。其解决方案的关键在于提出一种数据/计算高效的神经符号方法,该方法结合了基于序列标注器生成高概率候选事件解释与基于抽象论证框架(AAF)的推理机对候选解释进行精炼,从而利用先验知识弥补示例数据的不足,以降低对大量人工标注数据的依赖并提升解释的准确性和可解释性。
链接: https://arxiv.org/abs/2505.05880
作者: Bettina Fazzinga,Sergio Flesca,Filippo Furfaro,Luigi Pontieri,Francesco Scala
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Monitoring and analyzing process traces is a critical task for modern companies and organizations. In scenarios where there is a gap between trace events and reference business activities, this entails an interpretation problem, amounting to translating each event of any ongoing trace into the corresponding step of the activity instance. Building on a recent approach that frames the interpretation problem as an acceptance problem within an Abstract Argumentation Framework (AAF), one can elegantly analyze plausible event interpretations (possibly in an aggregated form), as well as offer explanations for those that conflict with prior process knowledge. Since, in settings where event-to-activity mapping is highly uncertain (or simply under-specified) this reasoning-based approach may yield lowly-informative results and heavy computation, one can think of discovering a sequencetagging model, trained to suggest highly-probable candidate event interpretations in a context-aware way. However, training such a model optimally may require using a large amount of manually-annotated example traces. Considering the urgent need of developing Green AI solutions enabling environmental and societal sustainability (with reduced labor/computational costs and carbon footprint), we propose a data/computation-efficient neuro-symbolic approach to the problem, where the candidate interpretations returned by the example-driven sequence tagger is refined by the AAF-based reasoner. This allows us to also leverage prior knowledge to compensate for the scarcity of example data, as confirmed by experimental results; clearly, this property is particularly useful in settings where data annotation and model optimization costs are subject to stringent constraints.
zh
[AI-105] HARP: A Large-Scale Higher-Order Ambisonic Room Impulse Response Dataset ICASSP2025
【速读】:该论文旨在解决传统麦克风阵列在声场覆盖和空间音频再现中的局限性,以支持更真实的沉浸式音频应用。其解决方案的关键在于提出一种基于叠加原理的64麦克风配置,该配置能够直接在球面谐波域中捕获7阶Ambisonic房间脉冲响应(HOA-RIRs),从而实现高精度的空间音频重建。通过虚拟仿真生成的广泛房间配置数据集,为机器学习在房间声学建模和声场合成中的应用提供了重要的资源。
链接: https://arxiv.org/abs/2411.14207
作者: Shivam Saini,Jürgen Peissig
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: Accepted at ICASSP 2025 Workshop. Code to generate uploaded at: this https URL
Abstract:This contribution introduces a dataset of 7th-order Ambisonic Room Impulse Responses (HOA-RIRs), created using the Image Source Method. By employing higher-order Ambisonics, our dataset enables precise spatial audio reproduction, a critical requirement for realistic immersive audio applications. Leveraging the virtual simulation, we present a unique microphone configuration, based on the superposition principle, designed to optimize sound field coverage while addressing the limitations of traditional microphone arrays. The presented 64-microphone configuration allows us to capture RIRs directly in the Spherical Harmonics domain. The dataset features a wide range of room configurations, encompassing variations in room geometry, acoustic absorption materials, and source-receiver distances. A detailed description of the simulation setup is provided alongside for an accurate reproduction. The dataset serves as a vital resource for researchers working on spatial audio, particularly in applications involving machine learning to improve room acoustics modeling and sound field synthesis. It further provides a very high level of spatial resolution and realism crucial for tasks such as source localization, reverberation prediction, and immersive sound reproduction.
zh
[AI-106] A survey of using EHR as real-world evidence for discovering and validating new drug indications
【速读】:该论文试图解决如何利用电子健康记录(Electronic Health Records, EHRs)作为真实世界证据(Real-World Evidence, RWE)来支持新药适应症的发现与验证问题。其解决方案的关键在于系统综述基于EHR的药物再利用方法,涵盖数据来源、处理方法和表征技术,并探讨研究设计与统计框架以评估药物疗效,同时强调大型语言模型(Large Language Models, LLMs)和目标试验模拟在验证过程中的重要作用。
链接: https://arxiv.org/abs/2505.24767
作者: Nabasmita Talukdar,Xiaodan Zhang,Shreya Paithankar,Hui Wang,Bin Chen
机构: 未知
类目: Applications (stat.AP); Artificial Intelligence (cs.AI)
备注:
Abstract:Electronic Health Records (EHRs) have been increasingly used as real-world evidence (RWE) to support the discovery and validation of new drug indications. This paper surveys current approaches to EHR-based drug repurposing, covering data sources, processing methodologies, and representation techniques. It discusses study designs and statistical frameworks for evaluating drug efficacy. Key challenges in validation are discussed, with emphasis on the role of large language models (LLMs) and target trial emulation. By synthesizing recent developments and methodological advances, this work provides a foundational resource for researchers aiming to translate real-world data into actionable drug-repurposing evidence.
zh
[AI-107] Supervised Quantum Machine Learning: A Future Outlook from Qubits to Enterprise Applications
【速读】:该论文旨在探讨监督式量子机器学习(Supervised Quantum Machine Learning, QML)的发展现状与未来方向,重点解决如何利用量子计算资源提升传统机器学习模型的训练与推理效率的问题。其解决方案的关键在于探索变分量子电路、量子神经网络和量子核方法等技术,并结合混合量子-经典工作流,以期在实际应用中实现量子优势。然而,当前研究仍面临噪声、平坦悬崖问题、可扩展性不足以及缺乏对性能提升的理论证明等挑战。
链接: https://arxiv.org/abs/2505.24765
作者: Srikanth Thudumu,Jason Fisher,Hung Du
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: Future outlook and roadmap of QML with 7 pages and 1 figure
Abstract:Supervised Quantum Machine Learning (QML) represents an intersection of quantum computing and classical machine learning, aiming to use quantum resources to support model training and inference. This paper reviews recent developments in supervised QML, focusing on methods such as variational quantum circuits, quantum neural networks, and quantum kernel methods, along with hybrid quantum-classical workflows. We examine recent experimental studies that show partial indications of quantum advantage and describe current limitations including noise, barren plateaus, scalability issues, and the lack of formal proofs of performance improvement over classical methods. The main contribution is a ten-year outlook (2025-2035) that outlines possible developments in supervised QML, including a roadmap describing conditions under which QML may be used in applied research and enterprise systems over the next decade.
zh
[AI-108] Unsupervised Evolutionary Cell Type Matching via Entropy-Minimized Optimal Transport
【速读】:该论文旨在解决跨物种细胞类型进化对应关系的识别问题,这是比较基因组学和进化生物学中的一个基础性挑战。现有方法通常依赖于参考基因组匹配或投影匹配,前者具有设计上的不对称性,后者则可能增加计算复杂度并降低细胞类型层面的生物学可解释性。论文提出的解决方案是OT-MESH,其关键在于利用熵正则化最优传输(entropy-regularized optimal transport, OT)构建一种无监督计算框架,并结合最小化Sinkhorn熵(Minimize Entropy of Sinkhorn, MESH)技术优化传输计划,通过高信噪比(Signal-to-Noise Ratio, SNR)基因筛选和基于细胞类型中心点余弦距离的成本矩阵构建,实现跨物种细胞类型同源性的系统鉴定,从而提升传输计划的稀疏性和对应矩阵的可解释性。
链接: https://arxiv.org/abs/2505.24759
作者: Mu Qiao
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Identifying evolutionary correspondences between cell types across species is a fundamental challenge in comparative genomics and evolutionary biology. Existing approaches often rely on either reference-based matching, which imposes asymmetry by designating one species as the reference, or projection-based matching, which may increase computational complexity and obscure biological interpretability at the cell-type level. Here, we present OT-MESH, an unsupervised computational framework leveraging entropy-regularized optimal transport (OT) to systematically determine cross-species cell type homologies. Our method uniquely integrates the Minimize Entropy of Sinkhorn (MESH) technique to refine the OT plan. It begins by selecting genes with high Signal-to-Noise Ratio (SNR) to capture the most informative features, from which a cost matrix is constructed using cosine distances between cell-type centroids. Importantly, the MESH procedure iteratively refines the cost matrix, leading to a transport plan with significantly enhanced sparsity and interpretability of the resulting correspondence matrices. Applied to retinal bipolar cells (BCs) and retinal ganglion cells (RGCs) from mouse and macaque, OT-MESH accurately recovers known evolutionary relationships and uncovers novel correspondences, one of which was independently validated experimentally. Thus, our framework offers a principled, scalable, symmetric, and interpretable solution for evolutionary cell type mapping, facilitating deeper insights into cellular specialization and conservation across species.
zh
[AI-109] Deep Learning Weather Models for Subregional Ocean Forecasting: A Case Study on the Canary Current Upwelling System
【速读】:该论文试图解决传统海洋动力学模型在计算成本高、速度慢以及对亚区域海洋过程(如中尺度结构)预测能力有限的问题。其解决方案的关键在于将最初为全球天气预报开发的图神经网络(Graph Neural Network, GNN)适应于亚区域海洋预测,利用卫星数据进行训练,并与先进的物理海洋模型进行比较,以评估其在捕捉海洋动力学方面的性能。结果表明,该深度学习模型在复杂海洋动力区域表现出更高的预测精度和更低的RMSE误差,显示出其在提升亚区域中长期海洋预测能力方面的潜力。
链接: https://arxiv.org/abs/2505.24429
作者: Giovanny C-Londoño,Javier Sánchez,Ángel Rodríguez-Santana
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 28 pages, 8 figures
Abstract:Oceanographic forecasting impacts various sectors of society by supporting environmental conservation and economic activities. Based on global circulation models, traditional forecasting methods are computationally expensive and slow, limiting their ability to provide rapid forecasts. Recent advances in deep learning offer faster and more accurate predictions, although these data-driven models are often trained with global data from numerical simulations, which may not reflect reality. The emergence of such models presents great potential for improving ocean prediction at a subregional domain. However, their ability to predict fine-scale ocean processes, like mesoscale structures, remains largely unknown. This work aims to adapt a graph neural network initially developed for global weather forecasting to improve subregional ocean prediction, specifically focusing on the Canary Current upwelling system. The model is trained with satellite data and compared to state-of-the-art physical ocean models to assess its performance in capturing ocean dynamics. Our results show that the deep learning model surpasses traditional methods in precision despite some challenges in upwelling areas. It demonstrated superior performance in reducing RMSE errors compared to ConvLSTM and the GLORYS reanalysis, particularly in regions with complex oceanic dynamics such as Cape Ghir, Cape Bojador, and Cape Blanc. The model achieved improvements of up to 26.5% relative to ConvLSTM and error reductions of up to 76% in 5-day forecasts compared to the GLORYS reanalysis at these critical locations, highlighting its enhanced capability to capture spatial variability and improve predictive accuracy in complex areas. These findings suggest the viability of adapting meteorological data-driven models for improving subregional medium-term ocean forecasting.
zh
[AI-110] When Humans Growl and Birds Speak: High-Fidelity Voice Conversion from Human to Animal and Designed Sounds INTERSPEECH2025
【速读】:该论文旨在解决人类语音到非人类声音转换(H2NH-VC)的问题,与以往研究主要关注狗叫声和16或22.05kHz音频转换不同,本文扩展了非语音声音的范围,包括自然声音(如狮子吼叫、鸟鸣)和设计语音(合成低吼声)。为实现多样化非语音声音生成和44.1kHz高质量音频转换,该研究引入了预处理流程和改进的条件变分自编码器(CVAE)基础的H2NH-VC模型,两者均针对人类和非人类语音进行了优化。解决方案的关键在于构建适应多种非语音声音类型的预处理管道和提升模型性能以保证转换质量、自然度和相似性。
链接: https://arxiv.org/abs/2505.24336
作者: Minsu Kang,Seolhee Lee,Choonghyeon Lee,Namhyun Cho
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
备注: INTERSPEECH 2025 accepted
Abstract:Human to non-human voice conversion (H2NH-VC) transforms human speech into animal or designed vocalizations. Unlike prior studies focused on dog-sounds and 16 or 22.05kHz audio transformation, this work addresses a broader range of non-speech sounds, including natural sounds (lion-roars, birdsongs) and designed voice (synthetic growls). To accomodate generation of diverse non-speech sounds and 44.1kHz high-quality audio transformation, we introduce a preprocessing pipeline and an improved CVAE-based H2NH-VC model, both optimized for human and non-human voices. Experimental results showed that the proposed method outperformed baselines in quality, naturalness, and similarity MOS, achieving effective voice conversion across diverse non-human timbres. Demo samples are available at this https URL
zh
[AI-111] Attractor learning for spatiotemporally chaotic dynamical systems using echo state networks with transfer learning
【速读】:该论文旨在解决非线性偏微分方程(PDE)——广义Kuramoto-Sivashinsky(gKS)方程在不同参数设置下长期统计模式预测的问题,该方程表现出时空混沌特性。研究提出的解决方案关键在于将回声状态网络(Echo State Network, ESN)与迁移学习相结合,通过迁移学习使ESN适应不同的参数条件,从而有效捕捉由色散关系或空间域长度变化引起的底层混沌吸引子的变化。
链接: https://arxiv.org/abs/2505.24099
作者: Mohammad Shah Alam,William Ott,Ilya Timofeyev
机构: 未知
类目: Dynamical Systems (math.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD); Machine Learning (stat.ML)
备注:
Abstract:In this paper, we explore the predictive capabilities of echo state networks (ESNs) for the generalized Kuramoto-Sivashinsky (gKS) equation, an archetypal nonlinear PDE that exhibits spatiotemporal chaos. We introduce a novel methodology that integrates ESNs with transfer learning, aiming to enhance predictive performance across various parameter regimes of the gKS model. Our research focuses on predicting changes in long-term statistical patterns of the gKS model that result from varying the dispersion relation or the length of the spatial domain. We use transfer learning to adapt ESNs to different parameter settings and successfully capture changes in the underlying chaotic attractor.
zh
[AI-112] Multi-output Classification using a Cross-talk Architecture for Compound Fault Diagnosis of Motors in Partially Labeled Condition
【速读】:该论文旨在解决旋转机械在复杂工况下(如转速变化和负载波动)的故障诊断问题,特别是在需要领域自适应的复合故障场景中的挑战。其解决方案的关键在于提出一种针对部分标注目标数据集的多输出分类(MOC)框架,该框架能够同时对复合故障的严重程度进行分类,并引入了一种新型的交叉通信层结构,实现诊断任务间的有选择性信息共享,从而提升复合故障场景下的分类性能。此外,通过引入频段层归一化方法,进一步提升了电机振动数据的领域自适应能力。
链接: https://arxiv.org/abs/2505.24001
作者: Wonjun Yi,Wonho Jung,Kangmin Jang,Yong-Hwa Park
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: Submitted to Mechanical Systems and Signal Processing on May 9th, 2025
Abstract:The increasing complexity of rotating machinery and the diversity of operating conditions, such as rotating speed and varying torques, have amplified the challenges in fault diagnosis in scenarios requiring domain adaptation, particularly involving compound faults. This study addresses these challenges by introducing a novel multi-output classification (MOC) framework tailored for domain adaptation in partially labeled (PL) target datasets. Unlike conventional multi-class classification (MCC) approaches, the proposed MOC framework classifies the severity levels of compound faults simultaneously. Furthermore, we explore various single-task and multi-task architectures applicable to the MOC formulation-including shared trunk and cross-talk-based designs-for compound fault diagnosis under PL conditions. Based on this investigation, we propose a novel cross-talk layer structure that enables selective information sharing across diagnostic tasks, effectively enhancing classification performance in compound fault scenarios. In addition, frequency-layer normalization was incorporated to improve domain adaptation performance on motor vibration data. Compound fault conditions were implemented using a motor-based test setup, and the proposed model was evaluated across six domain adaptation scenarios. The experimental results demonstrate its superior macro F1 performance compared to baseline models. We further showed that the proposed mode’s structural advantage is more pronounced in compound fault settings through a single-fault comparison. We also found that frequency-layer normalization fits the fault diagnosis task better than conventional methods. Lastly, we discuss that this improvement primarily stems from the model’s structural ability to leverage inter-fault classification task interactions, rather than from a simple increase in model parameters.
zh
[AI-113] A New Deep-learning-Based Approach For mRNA Optimization: High Fidelity Computation Efficiency and Multiple Optimization Factors
【速读】:该论文旨在解决mRNA优化中的三大关键挑战:序列保真度(防止意外的氨基酸改变)、计算效率(速度和可扩展性)以及优化变量的覆盖范围(多目标能力)。此外,现有方法在全面整合与mRNA生命周期和翻译过程相关的因素方面存在不足,包括内在序列特性、二级结构、翻译延伸动力学和tRNA可用性。论文提出的解决方案是引入一种基于深度学习的新型mRNA优化方法——\textbfRNop,其关键在于构建了一个包含超过300万条序列的大规模数据集,并设计了四个专门的损失函数(GPLoss、CAILoss、tAILoss和MFELoss),以同时实现序列保真度控制、物种特异性密码子适应、tRNA可用性和理想的mRNA二级结构特征优化。
链接: https://arxiv.org/abs/2505.23862
作者: Zheng Gong,Ziyi Jiang,Weihao Gao,Deng Zhuo,Lan Ma
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: You can also contact hudenjear@gmail.com for more information
Abstract:The mRNA optimization is critical for therapeutic and biotechnological applications, since sequence features directly govern protein expression levels and efficacy. However, current methods face significant challenges in simultaneously achieving three key objectives: (1) fidelity (preventing unintended amino acid changes), (2) computational efficiency (speed and scalability), and (3) the scope of optimization variables considered (multi-objective capability). Furthermore, existing methods often fall short of comprehensively incorporating the factors related to the mRNA lifecycle and translation process, including intrinsic mRNA sequence properties, secondary structure, translation elongation kinetics, and tRNA availability. To address these limitations, we introduce \textbfRNop, a novel deep learning-based method for mRNA optimization. We collect a large-scale dataset containing over 3 million sequences and design four specialized loss functions, the GPLoss, CAILoss, tAILoss, and MFELoss, which simultaneously enable explicit control over sequence fidelity while optimizing species-specific codon adaptation, tRNA availability, and desirable mRNA secondary structure features. Then, we demonstrate RNop’s effectiveness through extensive in silico and in vivo experiments. RNop ensures high sequence fidelity, achieves significant computational throughput up to 47.32 sequences/s, and yields optimized mRNA sequences resulting in a significant increase in protein expression for functional proteins compared to controls. RNop surpasses current methodologies in both quantitative metrics and experimental validation, enlightening a new dawn for efficient and effective mRNA design. Code and models will be available at this https URL.
zh
[AI-114] Quantum computing and artificial intelligence: status and perspectives
【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)与量子计算(Quantum Computing)之间的相互作用及其协同发展的基础性问题,旨在探索两者如何互相促进并推动技术进步。其解决方案的关键在于提出一个长期的研究议程,涵盖理论研究的协调、量子AI发展与量子硬件路线图的对齐、经典与量子资源的估算(特别是降低和优化能源消耗)、推进这一新兴的混合软件工程学科,以及在考虑社会影响的前提下增强欧洲的工业竞争力。
链接: https://arxiv.org/abs/2505.23860
作者: Giovanni Acampora,Andris Ambainis,Natalia Ares,Leonardo Banchi,Pallavi Bhardwaj,Daniele Binosi,G. Andrew D. Briggs,Tommaso Calarco,Vedran Dunjko,Jens Eisert,Olivier Ezratty,Paul Erker,Federico Fedele,Elies Gil-Fuster,Martin Gärttner,Mats Granath,Markus Heyl,Iordanis Kerenidis,Matthias Klusch,Anton Frisk Kockum,Richard Kueng,Mario Krenn,Jörg Lässig,Antonio Macaluso,Sabrina Maniscalco,Florian Marquardt,Kristel Michielsen,Gorka Muñoz-Gil,Daniel Müssig,Hendrik Poulsen Nautrup,Evert van Nieuwenburg,Roman Orus,Jörg Schmiedmayer,Markus Schmitt,Philipp Slusallek,Filippo Vicentini,Christof Weitenberg,Frank K. Wilhelm
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 32 pages, 3 figures
Abstract:This white paper discusses and explores the various points of intersection between quantum computing and artificial intelligence (AI). It describes how quantum computing could support the development of innovative AI solutions. It also examines use cases of classical AI that can empower research and development in quantum technologies, with a focus on quantum computing and quantum sensing. The purpose of this white paper is to provide a long-term research agenda aimed at addressing foundational questions about how AI and quantum computing interact and benefit one another. It concludes with a set of recommendations and challenges, including how to orchestrate the proposed theoretical work, align quantum AI developments with quantum hardware roadmaps, estimate both classical and quantum resources - especially with the goal of mitigating and optimizing energy consumption - advance this emerging hybrid software engineering discipline, and enhance European industrial competitiveness while considering societal implications.
zh
机器学习
[LG-0] he Road to Generalizable Neuro-Symbolic Learning Should be Paved with Foundation Models
链接: https://arxiv.org/abs/2505.24874
作者: Adam Stein,Aaditya Naik,Neelay Velingker,Mayur Naik,Eric Wong
类目: Machine Learning (cs.LG)
*备注: 19 pages, 11 figures
Abstract:Neuro-symbolic learning was proposed to address challenges with training neural networks for complex reasoning tasks with the added benefits of interpretability, reliability, and efficiency. Neuro-symbolic learning methods traditionally train neural models in conjunction with symbolic programs, but they face significant challenges that limit them to simplistic problems. On the other hand, purely-neural foundation models now reach state-of-the-art performance through prompting rather than training, but they are often unreliable and lack interpretability. Supplementing foundation models with symbolic programs, which we call neuro-symbolic prompting, provides a way to use these models for complex reasoning tasks. Doing so raises the question: What role does specialized model training as part of neuro-symbolic learning have in the age of foundation models? To explore this question, we highlight three pitfalls of traditional neuro-symbolic learning with respect to the compute, data, and programs leading to generalization problems. This position paper argues that foundation models enable generalizable neuro-symbolic solutions, offering a path towards achieving the original goals of neuro-symbolic learning without the downsides of training from scratch.
[LG-1] Accelerated Sampling from Masked Diffusion Models via Entropy Bounded Unmasking
链接: https://arxiv.org/abs/2505.24857
作者: Heli Ben-Hamu,Itai Gat,Daniel Severo,Niklas Nolte,Brian Karrer
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent masked diffusion models (MDMs) have shown competitive performance compared to autoregressive models (ARMs) for language modeling. While most literature has focused on performance enhancing sampling procedures, efficient sampling from MDMs has been scarcely explored. We make the observation that often a given sequence of partially masked tokens determines the values of multiple unknown tokens deterministically, meaning that a single prediction of a masked model holds additional information unused by standard sampling procedures. Based on this observation, we introduce EB-Sampler, a simple drop-in replacement for existing samplers, utilizing an Entropy Bounded unmasking procedure that dynamically unmasks multiple tokens in one function evaluation with predefined approximate error tolerance. We formulate the EB-Sampler as part of a broad family of adaptive samplers for which we provide an error analysis that motivates our algorithmic choices. EB-Sampler accelerates sampling from current state of the art MDMs by roughly 2-3x on standard coding and math reasoning benchmarks without loss in performance. We also validate the same procedure works well on smaller reasoning tasks including maze navigation and Sudoku, tasks ARMs often struggle with.
[LG-2] Chameleon: A MatMul-Free Temporal Convolutional Network Accelerator for End-to-End Few-Shot and Continual Learning from Sequential Data
链接: https://arxiv.org/abs/2505.24852
作者: Douwe den Blanken,Charlotte Frenkel
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 14 pages, 7 figures
Abstract:On-device learning at the edge enables low-latency, private personalization with improved long-term robustness and reduced maintenance costs. Yet, achieving scalable, low-power end-to-end on-chip learning, especially from real-world sequential data with a limited number of examples, is an open challenge. Indeed, accelerators supporting error backpropagation optimize for learning performance at the expense of inference efficiency, while simplified learning algorithms often fail to reach acceptable accuracy targets. In this work, we present Chameleon, leveraging three key contributions to solve these challenges. (i) A unified learning and inference architecture supports few-shot learning (FSL), continual learning (CL) and inference at only 0.5% area overhead to the inference logic. (ii) Long temporal dependencies are efficiently captured with temporal convolutional networks (TCNs), enabling the first demonstration of end-to-end on-chip FSL and CL on sequential data and inference on 16-kHz raw audio. (iii) A dual-mode, matrix-multiplication-free compute array allows either matching the power consumption of state-of-the-art inference-only keyword spotting (KWS) accelerators or enabling 4.3\times higher peak GOPS. Fabricated in 40-nm CMOS, Chameleon sets new accuracy records on Omniglot for end-to-end on-chip FSL (96.8%, 5-way 1-shot, 98.8%, 5-way 5-shot) and CL (82.2% final accuracy for learning 250 classes with 10 shots), while maintaining an inference accuracy of 93.3% on the 12-class Google Speech Commands dataset at an extreme-edge power budget of 3.1 \mu W.
[LG-3] From Invariant Representations to Invariant Data: Provable Robustness to Spurious Correlations via Noisy Counterfactual Matching
链接: https://arxiv.org/abs/2505.24843
作者: Ruqi Bai,Yao Ji,Zeyu Zhou,David I. Inouye
类目: Machine Learning (cs.LG)
*备注:
Abstract:Spurious correlations can cause model performance to degrade in new environments. Prior causality-inspired works aim to learn invariant representations (e.g., IRM) but typically underperform empirical risk minimization (ERM). Recent alternatives improve robustness by leveraging test-time data, but such data may be unavailable in practice. To address these issues, we take a data-centric approach by leveraging invariant data pairs, pairs of samples that would have the same prediction with the optimally robust classifier. We prove that certain counterfactual pairs will naturally satisfy this invariance property and introduce noisy counterfactual matching (NCM), a simple constraint-based method for leveraging invariant pairs for enhanced robustness, even with a small set of noisy pairs-in the ideal case, each pair can eliminate one spurious feature. For linear causal models, we prove that the test domain error can be upper bounded by the in-domain error and a term that depends on the counterfactuals’ diversity and quality. We validate on a synthetic dataset and demonstrate on real-world benchmarks that linear probing on a pretrained backbone improves robustness.
[LG-4] Cascading Adversarial Bias from Injection to Distillation in Language Models
链接: https://arxiv.org/abs/2505.24842
作者: Harsh Chaudhari,Jamie Hayes,Matthew Jagielski,Ilia Shumailov,Milad Nasr,Alina Oprea
类目: Machine Learning (cs.LG)
*备注:
Abstract:Model distillation has become essential for creating smaller, deployable language models that retain larger system capabilities. However, widespread deployment raises concerns about resilience to adversarial manipulation. This paper investigates vulnerability of distilled models to adversarial injection of biased content during training. We demonstrate that adversaries can inject subtle biases into teacher models through minimal data poisoning, which propagates to student models and becomes significantly amplified. We propose two propagation modes: Untargeted Propagation, where bias affects multiple tasks, and Targeted Propagation, focusing on specific tasks while maintaining normal behavior elsewhere. With only 25 poisoned samples (0.25% poisoning rate), student models generate biased responses 76.9% of the time in targeted scenarios - higher than 69.4% in teacher models. For untargeted propagation, adversarial bias appears 6x-29x more frequently in student models on unseen tasks. We validate findings across six bias types (targeted advertisements, phishing links, narrative manipulations, insecure coding practices), various distillation methods, and different modalities spanning text and code generation. Our evaluation reveals shortcomings in current defenses - perplexity filtering, bias detection systems, and LLM-based autorater frameworks - against these attacks. Results expose significant security vulnerabilities in distilled models, highlighting need for specialized safeguards. We propose practical design principles for building effective adversarial bias mitigation strategies.
[LG-5] ming is important: Risk-aware Fund Allocation based on Time-Series Forecasting KDD2025
链接: https://arxiv.org/abs/2505.24835
作者: Fuyuan Lyu,Linfeng Du,Yunpeng Weng,Qiufang Ying,Zhiyan Xu,Wen Zou,Haolun Wu,Xiuqiang He,Xing Tang
类目: Machine Learning (cs.LG)
*备注: Accepted by KDD 2025 ADS Track
Abstract:Fund allocation has been an increasingly important problem in the financial domain. In reality, we aim to allocate the funds to buy certain assets within a certain future period. Naive solutions such as prediction-only or Predict-then-Optimize approaches suffer from goal mismatch. Additionally, the introduction of the SOTA time series forecasting model inevitably introduces additional uncertainty in the predicted result. To solve both problems mentioned above, we introduce a Risk-aware Time-Series Predict-and-Allocate (RTS-PnO) framework, which holds no prior assumption on the forecasting models. Such a framework contains three features: (i) end-to-end training with objective alignment measurement, (ii) adaptive forecasting uncertainty calibration, and (iii) agnostic towards forecasting models. The evaluation of RTS-PnO is conducted over both online and offline experiments. For offline experiments, eight datasets from three categories of financial applications are used: Currency, Stock, and Cryptos. RTS-PnO consistently outperforms other competitive baselines. The online experiment is conducted on the Cross-Border Payment business at FiT, Tencent, and an 8.4% decrease in regret is witnessed when compared with the product-line approach. The code for the offline experiment is available at this https URL.
[LG-6] ByzFL: Research Framework for Robust Federated Learning
链接: https://arxiv.org/abs/2505.24802
作者: Marc González,Rachid Guerraoui,Rafael Pinot,Geovani Rizk,John Stephan,François Taïani
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present ByzFL, an open-source Python library for developing and benchmarking robust federated learning (FL) algorithms. ByzFL provides a unified and extensible framework that includes implementations of state-of-the-art robust aggregators, a suite of configurable attacks, and tools for simulating a variety of FL scenarios, including heterogeneous data distributions, multiple training algorithms, and adversarial threat models. The library enables systematic experimentation via a single JSON-based configuration file and includes built-in utilities for result visualization. Compatible with PyTorch tensors and NumPy arrays, ByzFL is designed to facilitate reproducible research and rapid prototyping of robust FL solutions. ByzFL is available at this https URL, with source code hosted on GitHub: this https URL.
[LG-7] QGAN-based data augmentation for hybrid quantum-classical neural networks
链接: https://arxiv.org/abs/2505.24780
作者: Run-Ze He,Jun-Jian Su,Su-Juan Qin,Zheng-Ping Jin,Fei Gao
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:
Abstract:Quantum neural networks converge faster and achieve higher accuracy than classical models. However, data augmentation in quantum machine learning remains underexplored. To tackle data scarcity, we integrate quantum generative adversarial networks (QGANs) with hybrid quantum-classical neural networks (HQCNNs) to develop an augmentation framework. We propose two strategies: a general approach to enhance data processing and classification across HQCNNs, and a customized strategy that dynamically generates samples tailored to the HQCNN’s performance on specific data categories, improving its ability to learn from complex datasets. Simulation experiments on the MNIST dataset demonstrate that QGAN outperforms traditional data augmentation methods and classical GANs. Compared to baseline DCGAN, QGAN achieves comparable performance with half the parameters, balancing efficiency and effectiveness. This suggests that QGANs can simplify models and generate high-quality data, enhancing HQCNN accuracy and performance. These findings pave the way for applying quantum data augmentation techniques in machine learning.
[LG-8] EVA-MILP: Towards Standardized Evaluation of MILP Instance Generation NEURIPS
链接: https://arxiv.org/abs/2505.24779
作者: Yidong Luo,Chenguang Wang,Jiahao Yang,Fanzeng Xia,Tianshu Yu
类目: Machine Learning (cs.LG)
*备注: The code is available in \url{ this https URL }
Abstract:Mixed-Integer Linear Programming (MILP) is fundamental to solving complex decision-making problems. The proliferation of MILP instance generation methods, driven by machine learning’s demand for diverse optimization datasets and the limitations of static benchmarks, has significantly outpaced standardized evaluation techniques. Consequently, assessing the fidelity and utility of synthetic MILP instances remains a critical, multifaceted challenge. This paper introduces a comprehensive benchmark framework designed for the systematic and objective evaluation of MILP instance generation methods. Our framework provides a unified and extensible methodology, assessing instance quality across crucial dimensions: mathematical validity, structural similarity, computational hardness, and utility in downstream machine learning tasks. A key innovation is its in-depth analysis of solver-internal features – particularly by comparing distributions of key solver outputs including root node gap, heuristic success rates, and cut plane usage – leveraging the solver’s dynamic solution behavior as an `expert assessment’ to reveal nuanced computational resemblances. By offering a structured approach with clearly defined solver-independent and solver-dependent metrics, our benchmark aims to facilitate robust comparisons among diverse generation techniques, spur the development of higher-quality instance generators, and ultimately enhance the reliability of research reliant on synthetic MILP data. The framework’s effectiveness in systematically comparing the fidelity of instance sets is demonstrated using contemporary generative models.
[LG-9] Diffusion-Based Symbolic Regression
链接: https://arxiv.org/abs/2505.24776
作者: Zachary Bastiani,Robert M. Kirby,Jacob Hochhalter,Shandian Zhe
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion has emerged as a powerful framework for generative modeling, achieving remarkable success in applications such as image and audio synthesis. Enlightened by this progress, we propose a novel diffusion-based approach for symbolic regression. We construct a random mask-based diffusion and denoising process to generate diverse and high-quality equations. We integrate this generative processes with a token-wise Group Relative Policy Optimization (GRPO) method to conduct efficient reinforcement learning on the given measurement dataset. In addition, we introduce a long short-term risk-seeking policy to expand the pool of top-performing candidates, further enhancing performance. Extensive experiments and ablation studies have demonstrated the effectiveness of our approach.
[LG-10] AFLoRA: Adaptive Federated Fine-Tuning of Large Language Models with Resource-Aware Low-Rank Adaption
链接: https://arxiv.org/abs/2505.24773
作者: Yajie Zhou,Xiaoyi Pang,Zhibo Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated fine-tuning has emerged as a promising approach to adapt foundation models to downstream tasks using decentralized data. However, real-world deployment remains challenging due to the high computational and communication demands of fine-tuning Large Language Models (LLMs) on clients with data and system resources that are heterogeneous and constrained. In such settings, the global model’s performance is often bottlenecked by the weakest clients and further degraded by the non-IID nature of local data. Although existing methods leverage parameter-efficient techniques such as Low-Rank Adaptation (LoRA) to reduce communication and computation overhead, they often fail to simultaneously ensure accurate aggregation of low-rank updates and maintain low system costs, thereby hindering overall performance. To address these challenges, we propose AFLoRA, an adaptive and lightweight federated fine-tuning framework for LLMs. AFLoRA decouples shared and client-specific updates to reduce overhead and improve aggregation accuracy, incorporates diagonal matrix-based rank pruning to better utilize local resources, and employs rank-aware aggregation with public data refinement to strengthen generalization under data heterogeneity. Extensive experiments demonstrate that AFLoRA outperforms state-of-the-art methods in both accuracy and efficiency, providing a practical solution for efficient LLM adaptation in heterogeneous environments in the real world.
[LG-11] Adapting to Linear Separable Subsets with Large-Margin in Differentially Private Learning
链接: https://arxiv.org/abs/2505.24737
作者: Erchi Wang,Yuqing Zhu,Yu-Xiang Wang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:This paper studies the problem of differentially private empirical risk minimization (DP-ERM) for binary linear classification. We obtain an efficient (\varepsilon,\delta) -DP algorithm with an empirical zero-one risk bound of \tildeO\left(\frac1\gamma^2\varepsilon n + \frac|S_\mathrmout|\gamma n\right) where n is the number of data points, S_\mathrmout is an arbitrary subset of data one can remove and \gamma is the margin of linear separation of the remaining data points (after S_\mathrmout is removed). Here, \tildeO(\cdot) hides only logarithmic terms. In the agnostic case, we improve the existing results when the number of outliers is small. Our algorithm is highly adaptive because it does not require knowing the margin parameter \gamma or outlier subset S_\mathrmout . We also derive a utility bound for the advanced private hyperparameter tuning algorithm.
[LG-12] Feature Attribution from First Principles
链接: https://arxiv.org/abs/2505.24729
作者: Magamed Taimeskhanov,Damien Garreau
类目: Machine Learning (cs.LG)
*备注: 30 pages, 3 figures
Abstract:Feature attribution methods are a popular approach to explain the behavior of machine learning models. They assign importance scores to each input feature, quantifying their influence on the model’s prediction. However, evaluating these methods empirically remains a significant challenge. To bypass this shortcoming, several prior works have proposed axiomatic frameworks that any feature attribution method should satisfy. In this work, we argue that such axioms are often too restrictive, and propose in response a new feature attribution framework, built from the ground up. Rather than imposing axioms, we start by defining attributions for the simplest possible models, i.e., indicator functions, and use these as building blocks for more complex models. We then show that one recovers several existing attribution methods, depending on the choice of atomic attribution. Subsequently, we derive closed-form expressions for attribution of deep ReLU networks, and take a step toward the optimization of evaluation metrics with respect to feature attributions.
[LG-13] Robust Federated Learning against Model Perturbation in Edge Networks
链接: https://arxiv.org/abs/2505.24728
作者: Dongzi Jin,Yong Xiao,Yingyu Li
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted by IEEE ICC 2025
Abstract:Federated Learning (FL) is a promising paradigm for realizing edge intelligence, allowing collaborative learning among distributed edge devices by sharing models instead of raw data. However, the shared models are often assumed to be ideal, which would be inevitably violated in practice due to various perturbations, leading to significant performance degradation. To overcome this challenge, we propose a novel method, termed Sharpness-Aware Minimization-based Robust Federated Learning (SMRFL), which aims to improve model robustness against perturbations by exploring the geometrical property of the model landscape. Specifically, SMRFL solves a min-max optimization problem that promotes model convergence towards a flat minimum by minimizing the maximum loss within a neighborhood of the model parameters. In this way, model sensitivity to perturbations is reduced, and robustness is enhanced since models in the neighborhood of the flat minimum also enjoy low loss values. The theoretical result proves that SMRFL can converge at the same rate as FL without perturbations. Extensive experimental results show that SMRFL significantly enhances robustness against perturbations compared to three baseline methods on two real-world datasets under three perturbation scenarios.
[LG-14] Running Conventional Automatic Speech Recognition on Memristor Hardware: A Simulated Approach INTERSPEECH2025
链接: https://arxiv.org/abs/2505.24721
作者: Nick Rossenbach,Benedikt Hilmes,Leon Brackmann,Moritz Gunz,Ralf Schlüter
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Emerging Technologies (cs.ET)
*备注: Accepted for the Blue Sky track at Interspeech 2025
Abstract:Memristor-based hardware offers new possibilities for energy-efficient machine learning (ML) by providing analog in-memory matrix multiplication. Current hardware prototypes cannot fit large neural networks, and related literature covers only small ML models for tasks like MNIST or single word recognition. Simulation can be used to explore how hardware properties affect larger models, but existing software assumes simplified hardware. We propose a PyTorch-based library based on “Synaptogen” to simulate neural network execution with accurately captured memristor hardware properties. For the first time, we show how an ML system with millions of parameters would behave on memristor hardware, using a Conformer trained on the speech recognition task TED-LIUMv2 as example. With adjusted quantization-aware training, we limit the relative degradation in word error rate to 25% when using a 3-bit weight precision to execute linear operations via simulated analog computation.
[LG-15] PDE-Transformer: Efficient and Versatile Transformers for Physics Simulations ICML2025
链接: https://arxiv.org/abs/2505.24717
作者: Benjamin Holzschuh,Qiang Liu,Georg Kohl,Nils Thuerey
类目: Machine Learning (cs.LG)
*备注: ICML 2025. Code available at this https URL
Abstract:We introduce PDE-Transformer, an improved transformer-based architecture for surrogate modeling of physics simulations on regular grids. We combine recent architectural improvements of diffusion transformers with adjustments specific for large-scale simulations to yield a more scalable and versatile general-purpose transformer architecture, which can be used as the backbone for building large-scale foundation models in physical sciences. We demonstrate that our proposed architecture outperforms state-of-the-art transformer architectures for computer vision on a large dataset of 16 different types of PDEs. We propose to embed different physical channels individually as spatio-temporal tokens, which interact via channel-wise self-attention. This helps to maintain a consistent information density of tokens when learning multiple types of PDEs simultaneously. We demonstrate that our pre-trained models achieve improved performance on several challenging downstream tasks compared to training from scratch and also beat other foundation model architectures for physics simulations.
[LG-16] Quick-Draw Bandits: Quickly Optimizing in Nonstationary Environments with Extremely Many Arms KDD2025
链接: https://arxiv.org/abs/2505.24692
作者: Derek Everett,Fred Lu,Edward Raff,Fernando Camacho,James Holt
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: KDD 2025, Research Track
Abstract:Canonical algorithms for multi-armed bandits typically assume a stationary reward environment where the size of the action space (number of arms) is small. More recently developed methods typically relax only one of these assumptions: existing non-stationary bandit policies are designed for a small number of arms, while Lipschitz, linear, and Gaussian process bandit policies are designed to handle a large (or infinite) number of arms in stationary reward environments under constraints on the reward function. In this manuscript, we propose a novel policy to learn reward environments over a continuous space using Gaussian interpolation. We show that our method efficiently learns continuous Lipschitz reward functions with \mathcalO^*(\sqrtT) cumulative regret. Furthermore, our method naturally extends to non-stationary problems with a simple modification. We finally demonstrate that our method is computationally favorable (100-10000x faster) and experimentally outperforms sliding Gaussian process policies on datasets with non-stationarity and an extremely large number of arms.
[LG-17] Predicting the Past: Estimating Historical Appraisals with OCR and Machine Learning
链接: https://arxiv.org/abs/2505.24676
作者: Mihir Bhaskar,Jun Tao Luo,Zihan Geng,Asmita Hajra,Junia Howell,Matthew R. Gormley
类目: Machine Learning (cs.LG)
*备注: Accepted to COMPASS 2025
Abstract:Despite well-documented consequences of the U.S. government’s 1930s housing policies on racial wealth disparities, scholars have struggled to quantify its precise financial effects due to the inaccessibility of historical property appraisal records. Many counties still store these records in physical formats, making large-scale quantitative analysis difficult. We present an approach scholars can use to digitize historical housing assessment data, applying it to build and release a dataset for one county. Starting from publicly available scanned documents, we manually annotated property cards for over 12,000 properties to train and validate our methods. We use OCR to label data for an additional 50,000 properties, based on our two-stage approach combining classical computer vision techniques with deep learning-based OCR. For cases where OCR cannot be applied, such as when scanned documents are not available, we show how a regression model based on building feature data can estimate the historical values, and test the generalizability of this model to other counties. With these cost-effective tools, scholars, community activists, and policy makers can better analyze and understand the historical impacts of redlining.
[LG-18] Learning geometry and topology via multi-chart flows
链接: https://arxiv.org/abs/2505.24665
作者: Hanlin Yu,Søren Hauberg,Marcelo Hartmann,Arto Klami,Georgios Arvanitidis
类目: Machine Learning (cs.LG)
*备注:
Abstract:Real world data often lie on low-dimensional Riemannian manifolds embedded in high-dimensional spaces. This motivates learning degenerate normalizing flows that map between the ambient space and a low-dimensional latent space. However, if the manifold has a non-trivial topology, it can never be correctly learned using a single flow. Instead multiple flows must be `glued together’. In this paper, we first propose the general training scheme for learning such a collection of flows, and secondly we develop the first numerical algorithms for computing geodesics on such manifolds. Empirically, we demonstrate that this leads to highly significant improvements in topology estimation.
[LG-19] Learning Distributions over Permutations and Rankings with Factorized Representations
链接: https://arxiv.org/abs/2505.24664
作者: Daniel Severo,Brian Karrer,Niklas Nolte
类目: Machine Learning (cs.LG)
*备注:
Abstract:Learning distributions over permutations is a fundamental problem in machine learning, with applications in ranking, combinatorial optimization, structured prediction, and data association. Existing methods rely on mixtures of parametric families or neural networks with expensive variational inference procedures. In this work, we propose a novel approach that leverages alternative representations for permutations, including Lehmer codes, Fisher-Yates draws, and Insertion-Vectors. These representations form a bijection with the symmetric group, allowing for unconstrained learning using conventional deep learning techniques, and can represent any probability distribution over permutations. Our approach enables a trade-off between expressivity of the model family and computational requirements. In the least expressive and most computationally efficient case, our method subsumes previous families of well established probabilistic models over permutations, including Mallow’s and the Repeated Insertion Model. Experiments indicate our method significantly outperforms current approaches on the jigsaw puzzle benchmark, a common task for permutation learning. However, we argue this benchmark is limited in its ability to assess learning probability distributions, as the target is a delta distribution (i.e., a single correct solution exists). We therefore propose two additional benchmarks: learning cyclic permutations and re-ranking movies based on user preference. We show that our method learns non-trivial distributions even in the least expressive mode, while traditional models fail to even generate valid permutations in this setting.
[LG-20] Beyond the Black Box: Interpretability of LLM s in Finance
链接: https://arxiv.org/abs/2505.24650
作者: Hariom Tatsat(Barclays),Ariye Shater(Barclays)
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
*备注: 28 pages, 15 figures
Abstract:Large Language Models (LLMs) exhibit remarkable capabilities across a spectrum of tasks in financial services, including report generation, chatbots, sentiment analysis, regulatory compliance, investment advisory, financial knowledge retrieval, and summarization. However, their intrinsic complexity and lack of transparency pose significant challenges, especially in the highly regulated financial sector, where interpretability, fairness, and accountability are critical. As far as we are aware, this paper presents the first application in the finance domain of understanding and utilizing the inner workings of LLMs through mechanistic interpretability, addressing the pressing need for transparency and control in AI systems. Mechanistic interpretability is the most intuitive and transparent way to understand LLM behavior by reverse-engineering their internal workings. By dissecting the activations and circuits within these models, it provides insights into how specific features or components influence predictions - making it possible not only to observe but also to modify model behavior. In this paper, we explore the theoretical aspects of mechanistic interpretability and demonstrate its practical relevance through a range of financial use cases and experiments, including applications in trading strategies, sentiment analysis, bias, and hallucination detection. While not yet widely adopted, mechanistic interpretability is expected to become increasingly vital as adoption of LLMs increases. Advanced interpretability tools can ensure AI systems remain ethical, transparent, and aligned with evolving financial regulations. In this paper, we have put special emphasis on how these techniques can help unlock interpretability requirements for regulatory and compliance purposes - addressing both current needs and anticipating future expectations from financial regulators globally.
[LG-21] WILTing Trees: Interpreting the Distance Between MPNN Embeddings ICML2025
链接: https://arxiv.org/abs/2505.24642
作者: Masahiro Negishi,Thomas Gärtner,Pascal Welke
类目: Machine Learning (cs.LG)
*备注: 25 pages, 10 figures. Accepted to ICML 2025. See this https URL for code
Abstract:We investigate the distance function learned by message passing neural networks (MPNNs) in specific tasks, aiming to capture the functional distance between prediction targets that MPNNs implicitly learn. This contrasts with previous work, which links MPNN distances on arbitrary tasks to structural distances on graphs that ignore task-specific information. To address this gap, we distill the distance between MPNN embeddings into an interpretable graph distance. Our method uses optimal transport on the Weisfeiler Leman Labeling Tree (WILT), where the edge weights reveal subgraphs that strongly influence the distance between embeddings. This approach generalizes two well-known graph kernels and can be computed in linear time. Through extensive experiments, we demonstrate that MPNNs define the relative position of embeddings by focusing on a small set of subgraphs that are known to be functionally important in the domain.
[LG-22] Stop Guessing: Optimizing Goalkeeper Policies for Soccer Penalty Kicks
链接: https://arxiv.org/abs/2505.24629
作者: Lotte Bransen,Tim Janssen,Jesse Davis
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: 24 pages, 7 figures
Abstract:Penalties are fraught and game-changing moments in soccer games that teams explicitly prepare for. Consequently, there has been substantial interest in analyzing them in order to provide advice to practitioners. From a data science perspective, such analyses suffer from a significant limitation: they make the unrealistic simplifying assumption that goalkeepers and takers select their action – where to dive and where to the place the kick – independently of each other. In reality, the choices that some goalkeepers make depend on the taker’s movements and vice-versa. This adds substantial complexity to the problem because not all players have the same action capacities, that is, only some players are capable of basing their decisions on their opponent’s movements. However, the small sample sizes on the player level mean that one may have limited insights into a specific opponent’s capacities. We address these challenges by developing a player-agnostic simulation framework that can evaluate the efficacy of different goalkeeper strategies. It considers a rich set of choices and incorporates information about a goalkeeper’s skills. Our work is grounded in a large dataset of penalties that were annotated by penalty experts and include aspects of both kicker and goalkeeper strategies. We show how our framework can be used to optimize goalkeeper policies in real-world situations.
[LG-23] Rethinking Neural Combinatorial Optimization for Vehicle Routing Problems with Different Constraint Tightness Degrees
链接: https://arxiv.org/abs/2505.24627
作者: Fu Luo,Yaoxin Wu,Zhi Zheng,Zhenkun Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent neural combinatorial optimization (NCO) methods have shown promising problem-solving ability without requiring domain-specific expertise. Most existing NCO methods use training and testing data with a fixed constraint value and lack research on the effect of constraint tightness on the performance of NCO methods. This paper takes the capacity-constrained vehicle routing problem (CVRP) as an example to empirically analyze the NCO performance under different tightness degrees of the capacity constraint. Our analysis reveals that existing NCO methods overfit the capacity constraint, and they can only perform satisfactorily on a small range of the constraint values but poorly on other values. To tackle this drawback of existing NCO methods, we develop an efficient training scheme that explicitly considers varying degrees of constraint tightness and proposes a multi-expert module to learn a generally adaptable solving strategy. Experimental results show that the proposed method can effectively overcome the overfitting issue, demonstrating superior performances on the CVRP and CVRP with time windows (CVRPTW) with various constraint tightness degrees.
[LG-24] Multi-criteria Rank-based Aggregation for Explainable AI IJCNN
链接: https://arxiv.org/abs/2505.24612
作者: Sujoy Chatterjee,Everton Romanzini Colombo,Marcos Medeiros Raimundo
类目: Machine Learning (cs.LG)
*备注: Accepted at the 2025 International Joint Conference on Neural Networks (IJCNN)
Abstract:Explainability is crucial for improving the transparency of black-box machine learning models. With the advancement of explanation methods such as LIME and SHAP, various XAI performance metrics have been developed to evaluate the quality of explanations. However, different explainers can provide contrasting explanations for the same prediction, introducing trade-offs across conflicting quality metrics. Although available aggregation approaches improve robustness, reducing explanations’ variability, very limited research employed a multi-criteria decision-making approach. To address this gap, this paper introduces a multi-criteria rank-based weighted aggregation method that balances multiple quality metrics simultaneously to produce an ensemble of explanation models. Furthermore, we propose rank-based versions of existing XAI metrics (complexity, faithfulness and stability) to better evaluate ranked feature importance explanations. Extensive experiments on publicly available datasets demonstrate the robustness of the proposed model across these metrics. Comparative analyses of various multi-criteria decision-making and rank aggregation algorithms showed that TOPSIS and WSUM are the best candidates for this use case.
[LG-25] he Gaussian Mixing Mechanism: Renyi Differential Privacy via Gaussian Sketches
链接: https://arxiv.org/abs/2505.24603
作者: Omri Lev,Vishwak Srinivasan,Moshe Shenfeld,Katrina Ligett,Ayush Sekhari,Ashia C. Wilson
类目: Machine Learning (cs.LG)
*备注:
Abstract:Gaussian sketching, which consists of pre-multiplying the data with a random Gaussian matrix, is a widely used technique for multiple problems in data science and machine learning, with applications spanning computationally efficient optimization, coded computing, and federated learning. This operation also provides differential privacy guarantees due to its inherent randomness. In this work, we revisit this operation through the lens of Renyi Differential Privacy (RDP), providing a refined privacy analysis that yields significantly tighter bounds than prior results. We then demonstrate how this improved analysis leads to performance improvement in different linear regression settings, establishing theoretical utility guarantees. Empirically, our methods improve performance across multiple datasets and, in several cases, reduce runtime.
[LG-26] Conservation-preserved Fourier Neural Operator through Adaptive Correction
链接: https://arxiv.org/abs/2505.24579
作者: Chaoyu Liu,Yangming Li,Zhongying Deng,Chris Budd,Carola-Bibiane Schönlieb
类目: Machine Learning (cs.LG)
*备注:
Abstract:Fourier Neural Operators (FNOs) have recently emerged as a promising and efficient approach for learning the numerical solutions to partial differential equations (PDEs) from data. However, standard FNO often fails to preserve key conservation laws, such as mass conservation, momentum conservation, norm conservation, etc., which are crucial for accurately modeling physical systems. Existing methods for incorporating these conservation laws into Fourier neural operators are achieved by designing related loss function or incorporating post-processing method at the training time. None of them can both exactly and adaptively correct the outputs to satisfy conservation laws, and our experiments show that these methods can lead to inferior performance while preserving conservation laws. In this work, we propose a novel adaptive correction approach to ensure the conservation of fundamental quantities. Our method introduces a learnable matrix to adaptively adjust the solution to satisfy the conservation law during training. It ensures that the outputs exactly satisfy the goal conservation law and allow for more flexibility and adaptivity for the model to correct the outputs. We theoretically show that applying our adaptive correction to an unconstrained FNO yields a solution with data loss no worse than that of the best conservation-satisfying FNO. We compare our approach with existing methods on a range of representative PDEs. Experiment results show that our method consistently outperform other methods.
[LG-27] Neuro-Symbolic Operator for Interpretable and Generalizable Characterization of Complex Piezoelectric Systems
链接: https://arxiv.org/abs/2505.24578
作者: Abhishek Chandra,Taniya Kapoor,Mitrofan Curti,Koen Tiels,Elena A. Lomonova
类目: Machine Learning (cs.LG)
*备注:
Abstract:Complex piezoelectric systems are foundational in industrial applications. Their performance, however, is challenged by the nonlinear voltage-displacement hysteretic relationships. Efficient characterization methods are, therefore, essential for reliable design, monitoring, and maintenance. Recently proposed neural operator methods serve as surrogates for system characterization but face two pressing issues: interpretability and generalizability. State-of-the-art (SOTA) neural operators are black-boxes, providing little insight into the learned operator. Additionally, generalizing them to novel voltages and predicting displacement profiles beyond the training domain is challenging, limiting their practical use. To address these limitations, this paper proposes a neuro-symbolic operator (NSO) framework that derives the analytical operators governing hysteretic relationships. NSO first learns a Fourier neural operator mapping voltage fields to displacement profiles, followed by a library-based sparse model discovery method, generating white-box parsimonious models governing the underlying hysteresis. These models enable accurate and interpretable prediction of displacement profiles across varying and out-of-distribution voltage fields, facilitating generalizability. The potential of NSO is demonstrated by accurately predicting voltage-displacement hysteresis, including butterfly-shaped relationships. Moreover, NSO predicts displacement profiles even for noisy and low-fidelity voltage data, emphasizing its robustness. The results highlight the advantages of NSO compared to SOTA neural operators and model discovery methods on several evaluation metrics. Consequently, NSO contributes to characterizing complex piezoelectric systems while improving the interpretability and generalizability of neural operators, essential for design, monitoring, maintenance, and other real-world scenarios.
[LG-28] HLSAD: Hodge Laplacian-based Simplicial Anomaly Detection KDD2025
链接: https://arxiv.org/abs/2505.24534
作者: Florian Frantzen,Michael T. Schaub
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Accepted for KDD 2025
Abstract:In this paper, we propose HLSAD, a novel method for detecting anomalies in time-evolving simplicial complexes. While traditional graph anomaly detection techniques have been extensively studied, they often fail to capture changes in higher-order interactions that are crucial for identifying complex structural anomalies. These higher-order interactions can arise either directly from the underlying data itself or through graph lifting techniques. Our approach leverages the spectral properties of Hodge Laplacians of simplicial complexes to effectively model multi-way interactions among data points. By incorporating higher-dimensional simplicial structures into our method, our method enhances both detection accuracy and computational efficiency. Through comprehensive experiments on both synthetic and real-world datasets, we demonstrate that our approach outperforms existing graph methods in detecting both events and change points.
[LG-29] ransformers Are Universally Consistent
链接: https://arxiv.org/abs/2505.24531
作者: Sagar Ghosh,Kushal Bose,Swagatam Das
类目: Machine Learning (cs.LG)
*备注:
Abstract:Despite their central role in the success of foundational models and large-scale language modeling, the theoretical foundations governing the operation of Transformers remain only partially understood. Contemporary research has largely focused on their representational capacity for language comprehension and their prowess in in-context learning, frequently under idealized assumptions such as linearized attention mechanisms. Initially conceived to model sequence-to-sequence transformations, a fundamental and unresolved question is whether Transformers can robustly perform functional regression over sequences of input tokens. This question assumes heightened importance given the inherently non-Euclidean geometry underlying real-world data distributions. In this work, we establish that Transformers equipped with softmax-based nonlinear attention are uniformly consistent when tasked with executing Ordinary Least Squares (OLS) regression, provided both the inputs and outputs are embedded in hyperbolic space. We derive deterministic upper bounds on the empirical error which, in the asymptotic regime, decay at a provable rate of \mathcalO(t^-1/2d) , where t denotes the number of input tokens and d the embedding dimensionality. Notably, our analysis subsumes the Euclidean setting as a special case, recovering analogous convergence guarantees parameterized by the intrinsic dimensionality of the data manifold. These theoretical insights are corroborated through empirical evaluations on real-world datasets involving both continuous and categorical response variables.
[LG-30] Airborne Neural Network
链接: https://arxiv.org/abs/2505.24513
作者: Paritosh Ranjan,Surajit Majumder,Prodip Roy
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 11 pages, 3 figures
Abstract:Deep Learning, driven by neural networks, has led to groundbreaking advancements in Artificial Intelligence by enabling systems to learn and adapt like the human brain. These models have achieved remarkable results, particularly in data-intensive domains, supported by massive computational infrastructure. However, deploying such systems in Aerospace, where real time data processing and ultra low latency are critical, remains a challenge due to infrastructure limitations. This paper proposes a novel concept: the Airborne Neural Network a distributed architecture where multiple airborne devices each host a subset of neural network neurons. These devices compute collaboratively, guided by an airborne network controller and layer specific controllers, enabling real-time learning and inference during flight. This approach has the potential to revolutionize Aerospace applications, including airborne air traffic control, real-time weather and geographical predictions, and dynamic geospatial data processing. By enabling large-scale neural network operations in airborne environments, this work lays the foundation for the next generation of AI powered Aerospace systems.
[LG-31] Learning to Optimally Dispatch Power: Performance on a Nation-Wide Real-World Dataset
链接: https://arxiv.org/abs/2505.24505
作者: Ignacio Boero,Santiago Diaz,Tomás Vázquez,Enzo Coppes,Pablo Belzarena,Federico Larroca
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:The Optimal Reactive Power Dispatch (ORPD) problem plays a crucial role in power system operations, ensuring voltage stability and minimizing power losses. Recent advances in machine learning, particularly within the ``learning to optimize’’ framework, have enabled fast and efficient approximations of ORPD solutions, typically by training models on precomputed optimization results. While these approaches have demonstrated promising performance on synthetic datasets, their effectiveness under real-world grid conditions remains largely unexplored. This paper makes two key contributions. First, we introduce a publicly available power system dataset that includes both the structural characteristics of Uruguay’s electrical grid and nearly two years of real-world operational data, encompassing actual demand and generation profiles. Given Uruguay’s high penetration of renewable energy, the ORPD problem has become the primary optimization challenge in its power network. Second, we assess the impact of real-world data on learning-based ORPD solutions, revealing a significant increase in prediction errors when transitioning from synthetic to actual demand and generation inputs. Our results highlight the limitations of existing models in learning under the complex statistical properties of real grid conditions and emphasize the need for more expressive architectures. By providing this dataset, we aim to facilitate further research into robust learning-based optimization techniques for power system management.
[LG-32] Efficient Neural and Numerical Methods for High-Quality Online Speech Spectrogram Inversion via Gradient Theorem INTERSPEECH2025
链接: https://arxiv.org/abs/2505.24498
作者: Andres Fernandez,Juan Azcarreta,Cagdas Bilen,Jesus Monge Alvarez
类目: Machine Learning (cs.LG)
*备注: Accepted at InterSpeech 2025
Abstract:Recent work in online speech spectrogram inversion effectively combines Deep Learning with the Gradient Theorem to predict phase derivatives directly from magnitudes. Then, phases are estimated from their derivatives via least squares, resulting in a high quality reconstruction. In this work, we introduce three innovations that drastically reduce computational cost, while maintaining high quality: Firstly, we introduce a novel neural network architecture with just 8k parameters, 30 times smaller than previous state of the art. Secondly, increasing latency by 1 hop size allows us to further halve the cost of the neural inference step. Thirdly, we we observe that the least squares problem features a tridiagonal matrix and propose a linear-complexity solver for the least squares step that leverages tridiagonality and positive-semidefiniteness, achieving a speedup of several orders of magnitude. We release samples online.
[LG-33] Smooth Model Compression without Fine-Tuning
链接: https://arxiv.org/abs/2505.24469
作者: Christina Runkel,Natacha Kuete Meli,Jovita Lukasik,Ander Biguri,Carola-Bibiane Schönlieb,Michael Moeller
类目: Machine Learning (cs.LG)
*备注:
Abstract:Compressing and pruning large machine learning models has become a critical step towards their deployment in real-world applications. Standard pruning and compression techniques are typically designed without taking the structure of the network’s weights into account, limiting their effectiveness. We explore the impact of smooth regularization on neural network training and model compression. By applying nuclear norm, first- and second-order derivative penalties of the weights during training, we encourage structured smoothness while preserving predictive performance on par with non-smooth models. We find that standard pruning methods often perform better when applied to these smooth models. Building on this observation, we apply a Singular-Value-Decomposition-based compression method that exploits the underlying smooth structure and approximates the model’s weight tensors by smaller low-rank tensors. Our approach enables state-of-the-art compression without any fine-tuning - reaching up to 91% accuracy on a smooth ResNet-18 on CIFAR-10 with 70% fewer parameters.
[LG-34] Logits-Based Finetuning
链接: https://arxiv.org/abs/2505.24461
作者: Jingyao Li,Senqiao Yang,Sitong Wu,Han Shi,Chuanyang Zheng,Hong Xu,Jiaya Jia
类目: Machine Learning (cs.LG)
*备注:
Abstract:The core of out-of-distribution (OOD) detection is to learn the in-distribution (ID) representation, which is distinguishable from OOD samples. Previous work applied recognition-based methods to learn the ID features, which tend to learn shortcuts instead of comprehensive representations. In this work, we find surprisingly that simply using reconstruction-based methods could boost the performance of OOD detection significantly. We deeply explore the main contributors of OOD detection and find that reconstruction-based pretext tasks have the potential to provide a generally applicable and efficacious prior, which benefits the model in learning intrinsic data distributions of the ID dataset. Specifically, we take Masked Image Modeling as a pretext task for our OOD detection framework (MOOD). Without bells and whistles, MOOD outperforms previous SOTA of one-class OOD detection by 5.7%, multi-class OOD detection by 3.0%, and near-distribution OOD detection by 2.1%. It even defeats the 10-shot-per-class outlier exposure OOD detection, although we do not include any OOD samples for our detection. Codes are available at this https URL.
[LG-35] Stepsize anything: A unified learning rate schedule for budgeted-iteration training
链接: https://arxiv.org/abs/2505.24452
作者: Anda Tang,Yiming Dong,Yutao Zeng,zhou Xun,Zhouchen Lin
类目: Machine Learning (cs.LG)
*备注:
Abstract:The expanding computational costs and limited resources underscore the critical need for budgeted-iteration training, which aims to achieve optimal learning within predetermined iteration this http URL learning rate schedules fundamentally govern the performance of different networks and tasks, particularly in budgeted-iteration scenarios, their design remains largely heuristic, lacking theoretical this http URL addition, the optimal learning rate schedule requires extensive trial-and-error selection, making the training process this http URL this work, we propose the Unified Budget-Aware (UBA) schedule, a theoretically grounded learning rate schedule that consistently outperforms commonly-used schedules among diverse architectures and tasks under different constrained training this http URL, we bridge the gap by constructing a novel training budget-aware optimization framework, which explicitly accounts for the robustness to landscape curvature this http URL this framework, we derive the UBA schedule, controlled by a single hyper-parameter \varphi that provides a trade-off between flexibility and simplicity, eliminating the need for per-network numerical optimization. Moreover, we establish a theoretical connection between \varphi and the condition number, adding interpretation and justification to our approach. Besides, we prove the convergence for different values of \varphi .We offer practical guidelines for its selection via theoretical analysis and empirical this http URL experimental results show that UBA \textitconsistently surpasses the commonly-used schedules across diverse vision and language tasks, spanning network architectures (e.g., ResNet, OLMo) and scales, under different training-iteration budgets.
[LG-36] Weisfeiler and Leman Follow the Arrow of Time: Expressive Power of Message Passing in Temporal Event Graphs
链接: https://arxiv.org/abs/2505.24438
作者: Franziska Heeg,Jonas Sauer,Petra Mutzel,Ingo Scholtes
类目: Machine Learning (cs.LG)
*备注:
Abstract:An important characteristic of temporal graphs is how the directed arrow of time influences their causal topology, i.e., which nodes can possibly influence each other causally via time-respecting paths. The resulting patterns are often neglected by temporal graph neural networks (TGNNs). To formally analyze the expressive power of TGNNs, we lack a generalization of graph isomorphism to temporal graphs that fully captures their causal topology. Addressing this gap, we introduce the notion of consistent event graph isomorphism, which utilizes a time-unfolded representation of time-respecting paths in temporal graphs. We compare this definition with existing notions of temporal graph isomorphisms. We illustrate and highlight the advantages of our approach and develop a temporal generalization of the Weisfeiler-Leman algorithm to heuristically distinguish non-isomorphic temporal graphs. Building on this theoretical foundation, we derive a novel message passing scheme for temporal graph neural networks that operates on the event graph representation of temporal graphs. An experimental evaluation shows that our approach performs well in a temporal graph classification experiment.
[LG-37] Multi-task Learning for Heterogeneous Multi-source Block-Wise Missing Data
链接: https://arxiv.org/abs/2505.24413
作者: Yang Sui,Qi Xu,Yang Bai,Annie Qu
类目: Machine Learning (cs.LG); Computation (stat.CO)
*备注:
Abstract:Multi-task learning (MTL) has emerged as an imperative machine learning tool to solve multiple learning tasks simultaneously and has been successfully applied to healthcare, marketing, and biomedical fields. However, in order to borrow information across different tasks effectively, it is essential to utilize both homogeneous and heterogeneous information. Among the extensive literature on MTL, various forms of heterogeneity are presented in MTL problems, such as block-wise, distribution, and posterior heterogeneity. Existing methods, however, struggle to tackle these forms of heterogeneity simultaneously in a unified framework. In this paper, we propose a two-step learning strategy for MTL which addresses the aforementioned heterogeneity. First, we impute the missing blocks using shared representations extracted from homogeneous source across different tasks. Next, we disentangle the mappings between input features and responses into a shared component and a task-specific component, respectively, thereby enabling information borrowing through the shared component. Our numerical experiments and real-data analysis from the ADNI database demonstrate the superior MTL performance of the proposed method compared to other competing methods.
[LG-38] On the Lipschitz Continuity of Set Aggregation Functions and Neural Networks for Sets
链接: https://arxiv.org/abs/2505.24403
作者: Giannis Nikolentzos,Konstantinos Skianis
类目: Machine Learning (cs.LG)
*备注:
Abstract:The Lipschitz constant of a neural network is connected to several important properties of the network such as its robustness and generalization. It is thus useful in many settings to estimate the Lipschitz constant of a model. Prior work has focused mainly on estimating the Lipschitz constant of multi-layer perceptrons and convolutional neural networks. Here we focus on data modeled as sets or multisets of vectors and on neural networks that can handle such data. These models typically apply some permutation invariant aggregation function, such as the sum, mean or max operator, to the input multisets to produce a single vector for each input sample. In this paper, we investigate whether these aggregation functions are Lipschitz continuous with respect to three distance functions for unordered multisets, and we compute their Lipschitz constants. In the general case, we find that each aggregation function is Lipschitz continuous with respect to only one of the three distance functions. Then, we build on these results to derive upper bounds on the Lipschitz constant of neural networks that can process multisets of vectors, while we also study their stability to perturbations and generalization under distribution shifts. To empirically verify our theoretical analysis, we conduct a series of experiments on datasets from different domains.
[LG-39] LightSAM: Parameter-Agnostic Sharpness-Aware Minimization
链接: https://arxiv.org/abs/2505.24399
作者: Yifei Cheng,Li Shen,Hao Sun,Nan Yin,Xiaochun Cao,Enhong Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Sharpness-Aware Minimization (SAM) optimizer enhances the generalization ability of the machine learning model by exploring the flat minima landscape through weight perturbations. Despite its empirical success, SAM introduces an additional hyper-parameter, the perturbation radius, which causes the sensitivity of SAM to it. Moreover, it has been proved that the perturbation radius and learning rate of SAM are constrained by problem-dependent parameters to guarantee convergence. These limitations indicate the requirement of parameter-tuning in practical applications. In this paper, we propose the algorithm LightSAM which sets the perturbation radius and learning rate of SAM adaptively, thus extending the application scope of SAM. LightSAM employs three popular adaptive optimizers, including AdaGrad-Norm, AdaGrad and Adam, to replace the SGD optimizer for weight perturbation and model updating, reducing sensitivity to parameters. Theoretical results show that under weak assumptions, LightSAM could converge ideally with any choices of perturbation radius and learning rate, thus achieving parameter-agnostic. We conduct preliminary experiments on several deep learning tasks, which together with the theoretical findings validate the the effectiveness of LightSAM.
[LG-40] Anomaly Detection and Improvement of Clusters using Enhanced K-Means Algorithm CCS
链接: https://arxiv.org/abs/2505.24365
作者: Vardhan Shorewala,Shivam Shorewala
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注: IEEE ICCCSP
Abstract:This paper introduces a unified approach to cluster refinement and anomaly detection in datasets. We propose a novel algorithm that iteratively reduces the intra-cluster variance of N clusters until a global minimum is reached, yielding tighter clusters than the standard k-means algorithm. We evaluate the method using intrinsic measures for unsupervised learning, including the silhouette coefficient, Calinski-Harabasz index, and Davies-Bouldin index, and extend it to anomaly detection by identifying points whose assignment causes a significant variance increase. External validation on synthetic data and the UCI Breast Cancer and UCI Wine Quality datasets employs the Jaccard similarity score, V-measure, and F1 score. Results show variance reductions of 18.7% and 88.1% on the synthetic and Wine Quality datasets, respectively, along with accuracy and F1 score improvements of 22.5% and 20.8% on the Wine Quality dataset.
[LG-41] Interpreting Large Text-to-Image Diffusion Models with Dictionary Learning CVPR2025
链接: https://arxiv.org/abs/2505.24360
作者: Stepan Shabalin,Ayush Panda,Dmitrii Kharlapenko,Abdur Raheem Ali,Yixiong Hao,Arthur Conmy
类目: Machine Learning (cs.LG)
*备注: 10 pages, 10 figures, Mechanistic Interpretability for Vision at CVPR 2025
Abstract:Sparse autoencoders are a promising new approach for decomposing language model activations for interpretation and control. They have been applied successfully to vision transformer image encoders and to small-scale diffusion models. Inference-Time Decomposition of Activations (ITDA) is a recently proposed variant of dictionary learning that takes the dictionary to be a set of data points from the activation distribution and reconstructs them with gradient pursuit. We apply Sparse Autoencoders (SAEs) and ITDA to a large text-to-image diffusion model, Flux 1, and consider the interpretability of embeddings of both by introducing a visual automated interpretation pipeline. We find that SAEs accurately reconstruct residual stream embeddings and beat MLP neurons on interpretability. We are able to use SAE features to steer image generation through activation addition. We find that ITDA has comparable interpretability to SAEs.
[LG-42] Cartan Networks: Group theoretical Hyperbolic Deep Learning
链接: https://arxiv.org/abs/2505.24353
作者: Federico Milanesio,Matteo Santoro,Pietro G. Fré,Guido Sanguinetti
类目: Machine Learning (cs.LG)
*备注: 20 pages, 3 figures, under review
Abstract:Hyperbolic deep learning leverages the metric properties of hyperbolic spaces to develop efficient and informative embeddings of hierarchical data. Here, we focus on the solvable group structure of hyperbolic spaces, which follows naturally from their construction as symmetric spaces. This dual nature of Lie group and Riemannian manifold allows us to propose a new class of hyperbolic deep learning algorithms where group homomorphisms are interleaved with metric-preserving diffeomorphisms. The resulting algorithms, which we call Cartan networks, show promising results on various benchmark data sets and open the way to a novel class of hyperbolic deep learning architectures.
[LG-43] ROAD: Responsibility-Oriented Reward Design for Reinforcement Learning in Autonomous Driving
链接: https://arxiv.org/abs/2505.24317
作者: Yongming Chen,Miner Chen,Liewen Liao,Mingyang Jiang,Xiang Zuo,Hengrui Zhang,Yuchen Xi,Songan Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning (RL) in autonomous driving employs a trial-and-error mechanism, enhancing robustness in unpredictable environments. However, crafting effective reward functions remains challenging, as conventional approaches rely heavily on manual design and demonstrate limited efficacy in complex scenarios. To address this issue, this study introduces a responsibility-oriented reward function that explicitly incorporates traffic regulations into the RL framework. Specifically, we introduced a Traffic Regulation Knowledge Graph and leveraged Vision-Language Models alongside Retrieval-Augmented Generation techniques to automate reward assignment. This integration guides agents to adhere strictly to traffic laws, thus minimizing rule violations and optimizing decision-making performance in diverse driving conditions. Experimental validations demonstrate that the proposed methodology significantly improves the accuracy of assigning accident responsibilities and effectively reduces the agent’s liability in traffic incidents.
[LG-44] On the Emergence of Weak-to-Strong Generalization: A Bias-Variance Perspective
链接: https://arxiv.org/abs/2505.24313
作者: Gengze Xu,Wei Yao,Ziqiao Wang,Yong Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Weak-to-strong generalization (W2SG) refers to the phenomenon where a strong student model, trained on a dataset labeled by a weak teacher, ultimately outperforms the teacher on the target task. Recent studies attribute this performance gain to the prediction misfit between the student and teacher models. In this work, we theoretically investigate the emergence of W2SG through a generalized bias-variance decomposition of Bregman divergence. Specifically, we show that the expected population risk gap between the student and teacher is quantified by the expected misfit between the two models. While this aligns with previous results, our analysis removes several restrictive assumptions, most notably, the convexity of the student’s hypothesis class, required in earlier works. Moreover, we show that W2SG is more likely to emerge when the student model approximates its posterior mean teacher, rather than mimicking an individual teacher. Using a concrete example, we demonstrate that if the student model has significantly larger capacity than the teacher, it can indeed converge to this posterior mean. Our analysis also suggests that avoiding overfitting to the teacher’s supervision and reducing the entropy of student’s prediction further facilitate W2SG. In addition, we show that the reverse cross-entropy loss, unlike the standard forward cross-entropy, is less sensitive to the predictive uncertainty of the teacher. Finally, we empirically verify our theoretical insights and demonstrate that incorporating the reverse cross-entropy loss consistently improves student performance.
[LG-45] GradPower: Powering Gradients for Faster Language Model Pre-Training
链接: https://arxiv.org/abs/2505.24275
作者: Mingze Wang,Jinbo Wang,Jiaqi Zhang,Wei Wang,Peng Pei,Xunliang Cai,Weinan E,Lei Wu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 22 pages
Abstract:We propose GradPower, a lightweight gradient-transformation technique for accelerating language model pre-training. Given a gradient vector g=(g_i)_i , GradPower first applies the elementwise sign-power transformation: \varphi_p(g)=(\rm sign(g_i)|g_i|^p)_i for a fixed p0 , and then feeds the transformed gradient into a base optimizer. Notably, GradPower requires only a single-line code change and no modifications to the base optimizer’s internal logic, including the hyperparameters. When applied to Adam (termed AdamPower), GradPower consistently achieves lower terminal loss across diverse architectures (LLaMA, Qwen2MoE), parameter scales (66M to 2B), datasets (C4, OpenWebText), and learning-rate schedules (cosine, warmup-stable-decay). The most pronounced gains are observed when training modern mixture-of-experts models with warmup-stable-decay schedules. GradPower also integrates seamlessly with other state-of-the-art optimizers, such as Muon, yielding further improvements. Finally, we provide theoretical analyses that reveal the underlying mechanism of GradPower and highlights the influence of gradient noise.
[LG-46] On Fairness of Task Arithmetic: The Role of Task Vectors
链接: https://arxiv.org/abs/2505.24262
作者: Hiroki Naganuma,Kotaro Yoshida,Laura Gomezjurado Gonzalez,Takafumi Horie,Yuji Naraki,Ryotaro Shimizu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Model editing techniques, particularly task arithmetic using task vectors, have shown promise in efficiently modifying pre-trained models through arithmetic operations like task addition and negation. Despite computational advantages, these methods may inadvertently affect model fairness, creating risks in sensitive applications like hate speech detection. However, the fairness implications of task arithmetic remain largely unexplored, presenting a critical gap in the existing literature. We systematically examine how manipulating task vectors affects fairness metrics, including Demographic Parity and Equalized Odds. To rigorously assess these effects, we benchmark task arithmetic against full fine-tuning, a costly but widely used baseline, and Low-Rank Adaptation (LoRA), a prevalent parameter-efficient fine-tuning method. Additionally, we explore merging task vectors from models fine-tuned on demographic subgroups vulnerable to hate speech, investigating whether fairness outcomes can be controlled by adjusting task vector coefficients, potentially enabling tailored model behavior. Our results offer novel insights into the fairness implications of model editing and establish a foundation for fairness-aware and responsible model editing practices.
[LG-47] aming Hyperparameter Sensitivity in Data Attribution: Practical Selection Without Costly Retraining
链接: https://arxiv.org/abs/2505.24261
作者: Weiyi Wang,Junwei Deng,Yuzheng Hu,Shiyuan Zhang,Xirui Jiang,Runting Zhang,Han Zhao,Jiaqi W. Ma
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Data attribution methods, which quantify the influence of individual training data points on a machine learning model, have gained increasing popularity in data-centric applications in modern AI. Despite a recent surge of new methods developed in this space, the impact of hyperparameter tuning in these methods remains under-explored. In this work, we present the first large-scale empirical study to understand the hyperparameter sensitivity of common data attribution methods. Our results show that most methods are indeed sensitive to certain key hyperparameters. However, unlike typical machine learning algorithms – whose hyperparameters can be tuned using computationally-cheap validation metrics – evaluating data attribution performance often requires retraining models on subsets of training data, making such metrics prohibitively costly for hyperparameter tuning. This poses a critical open challenge for the practical application of data attribution methods. To address this challenge, we advocate for better theoretical understandings of hyperparameter behavior to inform efficient tuning strategies. As a case study, we provide a theoretical analysis of the regularization term that is critical in many variants of influence function methods. Building on this analysis, we propose a lightweight procedure for selecting the regularization value without model retraining, and validate its effectiveness across a range of standard data attribution benchmarks. Overall, our study identifies a fundamental yet overlooked challenge in the practical application of data attribution, and highlights the importance of careful discussion on hyperparameter selection in future method development.
[LG-48] Rethinking Continual Learning with Progressive Neural Collapse
链接: https://arxiv.org/abs/2505.24254
作者: Zheng Wang,Wanhao Yu,Li Yang,Sen Lin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Continual Learning (CL) seeks to build an agent that can continuously learn a sequence of tasks, where a key challenge, namely Catastrophic Forgetting, persists due to the potential knowledge interference among different tasks. On the other hand, deep neural networks (DNNs) are shown to converge to a terminal state termed Neural Collapse during training, where all class prototypes geometrically form a static simplex equiangular tight frame (ETF). These maximally and equally separated class prototypes make the ETF an ideal target for model learning in CL to mitigate knowledge interference. Thus inspired, several studies have emerged very recently to leverage a fixed global ETF in CL, which however suffers from key drawbacks, such as impracticability and limited this http URL address these challenges and fully unlock the potential of ETF in CL, we propose Progressive Neural Collapse (ProNC), a novel framework that completely removes the need of a fixed global ETF in CL. Specifically, ProNC progressively expands the ETF target in a principled way by adding new class prototypes as vertices for new tasks, ensuring maximal separability across all encountered classes with minimal shifts from the previous ETF. We next develop a new CL framework by plugging ProNC into commonly used CL algorithm designs, where distillation is further leveraged to balance between target shifting for old classes and target aligning for new classes. Extensive experiments show that our approach significantly outperforms related baselines while maintaining superior flexibility, simplicity, and efficiency.
[LG-49] Model Informed Flows for Bayesian Inference of Probabilistic Programs
链接: https://arxiv.org/abs/2505.24243
作者: Joohwan Ko,Justin Domke
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Variational inference often struggles with the posterior geometry exhibited by complex hierarchical Bayesian models. Recent advances in flow-based variational families and Variationally Inferred Parameters (VIP) each address aspects of this challenge, but their formal relationship is unexplored. Here, we prove that the combination of VIP and a full-rank Gaussian can be represented exactly as a forward autoregressive flow augmented with a translation term and input from the model’s prior. Guided by this theoretical insight, we introduce the Model-Informed Flow (MIF) architecture, which adds the necessary translation mechanism, prior information, and hierarchical ordering. Empirically, MIF delivers tighter posterior approximations and matches or exceeds state-of-the-art performance across a suite of hierarchical and non-hierarchical benchmarks.
[LG-50] On the Expressive Power of Mixture-of-Experts for Structured Complex Tasks
链接: https://arxiv.org/abs/2505.24205
作者: Mingze Wang,Weinan E
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 18 pages
Abstract:Mixture-of-experts networks (MoEs) have demonstrated remarkable efficiency in modern deep learning. Despite their empirical success, the theoretical foundations underlying their ability to model complex tasks remain poorly understood. In this work, we conduct a systematic study of the expressive power of MoEs in modeling complex tasks with two common structural priors: low-dimensionality and sparsity. For shallow MoEs, we prove that they can efficiently approximate functions supported on low-dimensional manifolds, overcoming the curse of dimensionality. For deep MoEs, we show that \cO(L) -layer MoEs with E experts per layer can approximate piecewise functions comprising E^L pieces with compositional sparsity, i.e., they can exhibit an exponential number of structured tasks. Our analysis reveals the roles of critical architectural components and hyperparameters in MoEs, including the gating mechanism, expert networks, the number of experts, and the number of layers, and offers natural suggestions for MoE variants.
[LG-51] Improved Best-of-Both-Worlds Regret for Bandits with Delayed Feedback
链接: https://arxiv.org/abs/2505.24193
作者: Ofir Schlisselberg,Tal Lancewicki,Peter Auer,Yishay Mansour
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study the multi-armed bandit problem with adversarially chosen delays in the Best-of-Both-Worlds (BoBW) framework, which aims to achieve near-optimal performance in both stochastic and adversarial environments. While prior work has made progress toward this goal, existing algorithms suffer from significant gaps to the known lower bounds, especially in the stochastic settings. Our main contribution is a new algorithm that, up to logarithmic factors, matches the known lower bounds in each setting individually. In the adversarial case, our algorithm achieves regret of \widetildeO(\sqrtKT + \sqrtD) , which is optimal up to logarithmic terms, where T is the number of rounds, K is the number of arms, and D is the cumulative delay. In the stochastic case, we provide a regret bound which scale as \sum_i:\Delta_i0\left(\log T/\Delta_i\right) + \frac1K\sum \Delta_i \sigma_max , where \Delta_i is the sub-optimality gap of arm i and \sigma_\max is the maximum number of missing observations. To the best of our knowledge, this is the first BoBW algorithm to simultaneously match the lower bounds in both stochastic and adversarial regimes in delayed environment. Moreover, even beyond the BoBW setting, our stochastic regret bound is the first to match the known lower bound under adversarial delays, improving the second term over the best known result by a factor of K . Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.24193 [cs.LG] (or arXiv:2505.24193v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.24193 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-52] CodeV-R1: Reasoning -Enhanced Verilog Generation
链接: https://arxiv.org/abs/2505.24183
作者: Yaoyu Zhu,Di Huang,Hanqi Lyu,Xiaoyun Zhang,Chongxiao Li,Wenxuan Shi,Yutong Wu,Jianan Mu,Jinghua Wang,Yang Zhao,Pengwei Jin,Shuyao Cheng,Shengwen Liang,Xishan Zhang,Rui Zhang,Zidong Du,Qi Guo,Xing Hu,Yunji Chen
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Programming Languages (cs.PL)
*备注:
Abstract:Large language models (LLMs) trained via reinforcement learning with verifiable reward (RLVR) have achieved breakthroughs on tasks with explicit, automatable verification, such as software programming and mathematical problems. Extending RLVR to electronic design automation (EDA), especially automatically generating hardware description languages (HDLs) like Verilog from natural-language (NL) specifications, however, poses three key challenges: the lack of automated and accurate verification environments, the scarcity of high-quality NL-code pairs, and the prohibitive computation cost of RLVR. To this end, we introduce CodeV-R1, an RLVR framework for training Verilog generation LLMs. First, we develop a rule-based testbench generator that performs robust equivalence checking against golden references. Second, we propose a round-trip data synthesis method that pairs open-source Verilog snippets with LLM-generated NL descriptions, verifies code-NL-code consistency via the generated testbench, and filters out inequivalent examples to yield a high-quality dataset. Third, we employ a two-stage “distill-then-RL” training pipeline: distillation for the cold start of reasoning abilities, followed by adaptive DAPO, our novel RLVR algorithm that can reduce training cost by adaptively adjusting sampling rate. The resulting model, CodeV-R1-7B, achieves 68.6% and 72.9% pass@1 on VerilogEval v2 and RTLLM v1.1, respectively, surpassing prior state-of-the-art by 12~20%, while matching or even exceeding the performance of 671B DeepSeek-R1. We will release our model, training pipeline, and dataset to facilitate research in EDA and LLM communities.
[LG-53] Proxy Target: Bridging the Gap Between Discrete Spiking Neural Networks and Continuous Control
链接: https://arxiv.org/abs/2505.24161
作者: Zijie Xu,Tong Bu,Zecheng Hao,Jianhao Ding,Zhaofei Yu
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:
Abstract:Spiking Neural Networks (SNNs) offer low-latency and energy-efficient decision making through neuromorphic hardware, making them compelling for Reinforcement Learning (RL) in resource-constrained edge devices. Recent studies in this field directly replace Artificial Neural Networks (ANNs) by SNNs in existing RL frameworks, overlooking whether the RL algorithm is suitable for SNNs. However, most RL algorithms in continuous control are designed tailored to ANNs, including the target network soft updates mechanism, which conflict with the discrete, non-differentiable dynamics of SNN spikes. We identify that this mismatch destabilizes SNN training in continuous control tasks. To bridge this gap between discrete SNN and continuous control, we propose a novel proxy target framework. The continuous and differentiable dynamics of the proxy target enable smooth updates, bypassing the incompatibility of SNN spikes, stabilizing the RL algorithms. Since the proxy network operates only during training, the SNN retains its energy efficiency during deployment without inference overhead. Extensive experiments on continuous control benchmarks demonstrate that compared to vanilla SNNs, the proxy target framework enables SNNs to achieve up to 32% higher performance across different spiking neurons. Notably, we are the first to surpass ANN performance in continuous control with simple Leaky-Integrate-and-Fire (LIF) neurons. This work motivates a new class of SNN-friendly RL algorithms tailored to SNN’s characteristics, paving the way for neuromorphic agents that combine high performance with low power consumption.
[LG-54] Biological Pathway Guided Gene Selection Through Collaborative Reinforcement Learning KDD2025 KDD
链接: https://arxiv.org/abs/2505.24155
作者: Ehtesamul Azim,Dongjie Wang,Tae Hyun Hwang,Yanjie Fu,Wei Zhang
类目: Machine Learning (cs.LG)
*备注: 31st SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2025)
Abstract:Gene selection in high-dimensional genomic data is essential for understanding disease mechanisms and improving therapeutic outcomes. Traditional feature selection methods effectively identify predictive genes but often ignore complex biological pathways and regulatory networks, leading to unstable and biologically irrelevant signatures. Prior approaches, such as Lasso-based methods and statistical filtering, either focus solely on individual gene-outcome associations or fail to capture pathway-level interactions, presenting a key challenge: how to integrate biological pathway knowledge while maintaining statistical rigor in gene selection? To address this gap, we propose a novel two-stage framework that integrates statistical selection with biological pathway knowledge using multi-agent reinforcement learning (MARL). First, we introduce a pathway-guided pre-filtering strategy that leverages multiple statistical methods alongside KEGG pathway information for initial dimensionality reduction. Next, for refined selection, we model genes as collaborative agents in a MARL framework, where each agent optimizes both predictive power and biological relevance. Our framework incorporates pathway knowledge through Graph Neural Network-based state representations, a reward mechanism combining prediction performance with gene centrality and pathway coverage, and collaborative learning strategies using shared memory and a centralized critic component. Extensive experiments on multiple gene expression datasets demonstrate that our approach significantly improves both prediction accuracy and biological interpretability compared to traditional methods.
[LG-55] Autoregressive regularized score-based diffusion models for multi-scenarios fluid flow prediction
链接: https://arxiv.org/abs/2505.24145
作者: Wilfried Genuist,Éric Savin,Filippo Gatti,Didier Clouteau
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: 34 pages, 17 figures
Abstract:Building on recent advances in scientific machine learning and generative modeling for computational fluid dynamics, we propose a conditional score-based diffusion model designed for multi-scenarios fluid flow prediction. Our model integrates an energy constraint rooted in the statistical properties of turbulent flows, improving prediction quality with minimal training, while enabling efficient sampling at low cost. The method features a simple and general architecture that requires no problem-specific design, supports plug-and-play enhancements, and enables fast and flexible solution generation. It also demonstrates an efficient conditioning mechanism that simplifies training across different scenarios without demanding a redesign of existing models. We further explore various stochastic differential equation formulations to demonstrate how thoughtful design choices enhance performance. We validate the proposed methodology through extensive experiments on complex fluid dynamics datasets encompassing a variety of flow regimes and configurations. Results demonstrate that our model consistently achieves stable, robust, and physically faithful predictions, even under challenging turbulent conditions. With properly tuned parameters, it achieves accurate results across multiple scenarios while preserving key physical and statistical properties. We present a comprehensive analysis of stochastic differential equation impact and discuss our approach across diverse fluid mechanics tasks.
[LG-56] Bounds on the Excess Minimum Risk via Generalized Information Divergence Measures
链接: https://arxiv.org/abs/2505.24117
作者: Ananya Omanwar,Fady Alajaji,Tamás Linder
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:Given finite-dimensional random vectors Y , X , and Z that form a Markov chain in that order (i.e., Y \to X \to Z ), we derive upper bounds on the excess minimum risk using generalized information divergence measures. Here, Y is a target vector to be estimated from an observed feature vector X or its stochastically degraded version Z . The excess minimum risk is defined as the difference between the minimum expected loss in estimating Y from X and from Z . We present a family of bounds that generalize the mutual information based bound of Györfi et al. (2023), using the Rényi and \alpha -Jensen-Shannon divergences, as well as Sibson’s mutual information. Our bounds are similar to those developed by Modak et al. (2021) and Aminian et al. (2024) for the generalization error of learning algorithms. However, unlike these works, our bounds do not require the sub-Gaussian parameter to be constant and therefore apply to a broader class of joint distributions over Y , X , and Z . We also provide numerical examples under both constant and non-constant sub-Gaussianity assumptions, illustrating that our generalized divergence based bounds can be tighter than the one based on mutual information for certain regimes of the parameter \alpha .
[LG-57] Neural Networks as Universal Finite-State Machines: A Constructive ReLU Simulation Framework for NFAs
链接: https://arxiv.org/abs/2505.24110
作者: Sahil Rajesh Dhayalkar
类目: Machine Learning (cs.LG); Formal Languages and Automata Theory (cs.FL)
*备注: 16 pages, with proofs in Appendix
Abstract:We present a formal and constructive framework establishing the equivalence between nondeterministic finite automata (NFAs) and standard feedforward ReLU neural networks. By encoding automaton states as binary vectors and transitions as sparse linear layers, we show that ReLU activations simulate nondeterministic branching, subset construction, and \epsilon -closures in a mathematically precise manner. Our core theoretical results prove that a three-layer ReLU network of width \mathcalO(n) can exactly recognize any regular language accepted by an n -state NFA-without recurrence, memory, or approximation. Furthermore, we show that gradient descent over structure-preserving networks preserves symbolic semantics and acceptance behavior. Extensive experiments across multiple validation tasks-including parallel path tracking, symbolic subset construction, \epsilon -closure convergence, acceptance classification, structural training invariants, and functional equivalence-achieve perfect or near-perfect empirical alignment with ground-truth automata. This work provides the first provably complete symbolic simulation of NFAs within standard deep learning architectures, uniting automata theory with neural computation through ReLU dynamics.
[LG-58] A SHAP-based explainable multi-level stacking ensemble learning method for predicting the length of stay in acute stroke
链接: https://arxiv.org/abs/2505.24101
作者: Zhenran Xu
类目: Machine Learning (cs.LG)
*备注: Master Minor Thesis, Preprint
Abstract:Length of stay (LOS) prediction in acute stroke is critical for improving care planning. Existing machine learning models have shown suboptimal predictive performance, limited generalisability, and have overlooked system-level factors. We aimed to enhance model efficiency, performance, and interpretability by refining predictors and developing an interpretable multi-level stacking ensemble model. Data were accessed from the biennial Stroke Foundation Acute Audit (2015, 2017, 2019, 2021) in Australia. Models were developed for ischaemic and haemorrhagic stroke separately. The outcome was prolonged LOS (the LOS above the 75th percentile). Candidate predictors (ischaemic: n=89; haemorrhagic: n=83) were categorised into patient, clinical, and system domains. Feature selection with correlation-based approaches was used to refine key predictors. The evaluation of models included discrimination (AUC), calibration curves, and interpretability (SHAP plots). In ischaemic stroke (N=12,575), prolonged LOS was =9 days, compared to =11 days in haemorrhagic stroke (N=1,970). The ensemble model achieved superior performance [AUC: 0.824 (95% CI: 0.801-0.846)] and statistically outperformed logistic regression [AUC: 0.805 (95% CI: 0.782-0.829); P=0.0004] for ischaemic. However, the model [AUC: 0.843 (95% CI: 0.790-0.895)] did not statistically outperform logistic regression [AUC: 0.828 (95% CI: 0.774-0.882); P=0.136] for haemorrhagic. SHAP analysis identified shared predictors for both types of stroke: rehabilitation assessment, urinary incontinence, stroke unit care, inability to walk independently, physiotherapy, and stroke care coordinators involvement. An explainable ensemble model effectively predicted the prolonged LOS in ischaemic stroke. Further validation in larger cohorts is needed for haemorrhagic stroke.
[LG-59] Practical Bayes-Optimal Membership Inference Attacks
链接: https://arxiv.org/abs/2505.24089
作者: Marcus Lassila,Johan Östman,Khac-Hoang Ngo,Alexandre Graell i Amat
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 9 pages plus 13 pages of appendices
Abstract:We develop practical and theoretically grounded membership inference attacks (MIAs) against both independent and identically distributed (i.i.d.) data and graph-structured data. Building on the Bayesian decision-theoretic framework of Sablayrolles et al., we derive the Bayes-optimal membership inference rule for node-level MIAs against graph neural networks, addressing key open questions about optimal query strategies in the graph setting. We introduce BASE and G-BASE, computationally efficient approximations of the Bayes-optimal attack. G-BASE achieves superior performance compared to previously proposed classifier-based node-level MIA attacks. BASE, which is also applicable to non-graph data, matches or exceeds the performance of prior state-of-the-art MIAs, such as LiRA and RMIA, at a significantly lower computational cost. Finally, we show that BASE and RMIA are equivalent under a specific hyperparameter setting, providing a principled, Bayes-optimal justification for the RMIA attack.
[LG-60] DeepBoost-AF: A Novel Unsupervised Feature Learning and Gradient Boosting Fusion for Robust Atrial Fibrillation Detection in Raw ECG Signals
链接: https://arxiv.org/abs/2505.24085
作者: Alireza Jafari,Fereshteh Yousefirizi,Vahid Seydi
类目: Machine Learning (cs.LG)
*备注: 12-page,4 figures,3 tables, Achieves 95.20% F1-score (99.99% sensitivity) on 8,528 PhysioNet 2017 recordings, Mean inference time: 4 seconds, Python implementation will be open-sourced upon publication
Abstract:Atrial fibrillation (AF) is a prevalent cardiac arrhythmia associated with elevated health risks, where timely detection is pivotal for mitigating stroke-related morbidity. This study introduces an innovative hybrid methodology integrating unsupervised deep learning and gradient boosting models to improve AF detection. A 19-layer deep convolutional autoencoder (DCAE) is coupled with three boosting classifiers-AdaBoost, XGBoost, and LightGBM (LGBM)-to harness their complementary advantages while addressing individual limitations. The proposed framework uniquely combines DCAE with gradient boosting, enabling end-to-end AF identification devoid of manual feature extraction. The DCAE-LGBM model attains an F1-score of 95.20%, sensitivity of 99.99%, and inference latency of four seconds, outperforming existing methods and aligning with clinical deployment requirements. The DCAE integration significantly enhances boosting models, positioning this hybrid system as a reliable tool for automated AF detection in clinical settings.
[LG-61] Primal-Dual Neural Algorithmic Reasoning
链接: https://arxiv.org/abs/2505.24067
作者: Yu He,Ellen Vitercik
类目: Machine Learning (cs.LG)
*备注: The 42nd International Conference on Machine Learning, 2025
Abstract:Neural Algorithmic Reasoning (NAR) trains neural networks to simulate classical algorithms, enabling structured and interpretable reasoning over complex data. While prior research has predominantly focused on learning exact algorithms for polynomial-time-solvable problems, extending NAR to harder problems remains an open challenge. In this work, we introduce a general NAR framework grounded in the primal-dual paradigm, a classical method for designing efficient approximation algorithms. By leveraging a bipartite representation between primal and dual variables, we establish an alignment between primal-dual algorithms and Graph Neural Networks. Furthermore, we incorporate optimal solutions from small instances to greatly enhance the model’s reasoning capabilities. Our empirical results demonstrate that our model not only simulates but also outperforms approximation algorithms for multiple tasks, exhibiting robust generalization to larger and out-of-distribution graphs. Moreover, we highlight the framework’s practical utility by integrating it with commercial solvers and applying it to real-world datasets.
[LG-62] Measure gradients not activations! Enhancing neuronal activity in deep reinforcement learning
链接: https://arxiv.org/abs/2505.24061
作者: Jiashun Liu,Zihao Wu,Johan Obando-Ceron,Pablo Samuel Castro,Aaron Courville,Ling Pan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep reinforcement learning (RL) agents frequently suffer from neuronal activity loss, which impairs their ability to adapt to new data and learn continually. A common method to quantify and address this issue is the tau-dormant neuron ratio, which uses activation statistics to measure the expressive ability of neurons. While effective for simple MLP-based agents, this approach loses statistical power in more complex architectures. To address this, we argue that in advanced RL agents, maintaining a neuron’s learning capacity, its ability to adapt via gradient updates, is more critical than preserving its expressive ability. Based on this insight, we shift the statistical objective from activations to gradients, and introduce GraMa (Gradient Magnitude Neural Activity Metric), a lightweight, architecture-agnostic metric for quantifying neuron-level learning capacity. We show that GraMa effectively reveals persistent neuron inactivity across diverse architectures, including residual networks, diffusion models, and agents with varied activation functions. Moreover, resetting neurons guided by GraMa (ReGraMa) consistently improves learning performance across multiple deep RL algorithms and benchmarks, such as MuJoCo and the DeepMind Control Suite.
[LG-63] Characterising the Inductive Biases of Neural Networks on Boolean Data
链接: https://arxiv.org/abs/2505.24060
作者: Chris Mingard,Lukas Seier,Niclas Göring,Andrei-Vlad Badelita,Charles London,Ard Louis
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Deep neural networks are renowned for their ability to generalise well across diverse tasks, even when heavily overparameterized. Existing works offer only partial explanations (for example, the NTK-based task-model alignment explanation neglects feature learning). Here, we provide an end-to-end, analytically tractable case study that links a network’s inductive prior, its training dynamics including feature learning, and its eventual generalisation. Specifically, we exploit the one-to-one correspondence between depth-2 discrete fully connected networks and disjunctive normal form (DNF) formulas by training on Boolean functions. Under a Monte Carlo learning algorithm, our model exhibits predictable training dynamics and the emergence of interpretable features. This framework allows us to trace, in detail, how inductive bias and feature formation drive generalisation.
[LG-64] owards disentangling the contributions of articulation and acoustics in multimodal phoneme recognition
链接: https://arxiv.org/abs/2505.24059
作者: Sean Foley,Hong Nguyen,Jihwan Lee,Sudarsana Reddy Kadiri,Dani Byrd,Louis Goldstein,Shrikanth Narayanan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Although many previous studies have carried out multimodal learning with real-time MRI data that captures the audio-visual kinematics of the vocal tract during speech, these studies have been limited by their reliance on multi-speaker corpora. This prevents such models from learning a detailed relationship between acoustics and articulation due to considerable cross-speaker variability. In this study, we develop unimodal audio and video models as well as multimodal models for phoneme recognition using a long-form single-speaker MRI corpus, with the goal of disentangling and interpreting the contributions of each modality. Audio and multimodal models show similar performance on different phonetic manner classes but diverge on places of articulation. Interpretation of the models’ latent space shows similar encoding of the phonetic space across audio and multimodal models, while the models’ attention weights highlight differences in acoustic and articulatory timing for certain phonemes.
[LG-65] Differential Gated Self-Attention
链接: https://arxiv.org/abs/2505.24054
作者: Elpiniki Maria Lygizou,Mónika Farsang,Radu Grosu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Transformers excel across a large variety of tasks but remain susceptible to corrupted inputs, since standard self-attention treats all query-key interactions uniformly. Inspired by lateral inhibition in biological neural circuits and building on the recent use by the Differential Transformer’s use of two parallel softmax subtraction for noise cancellation, we propose Multihead Differential Gated Self-Attention (M-DGSA) that learns per-head input-dependent gating to dynamically suppress attention noise. Each head splits into excitatory and inhibitory branches whose dual softmax maps are fused by a sigmoid gate predicted from the token embedding, yielding a context-aware contrast enhancement. M-DGSA integrates seamlessly into existing Transformer stacks with minimal computational overhead. We evaluate on both vision and language benchmarks, demonstrating consistent robustness gains over vanilla Transformer, Vision Transformer, and Differential Transformer baselines. Our contributions are (i) a novel input-dependent gating mechanism for self-attention grounded in lateral inhibition, (ii) a principled synthesis of biological contrast-enhancement and self-attention theory, and (iii) comprehensive experiments demonstrating noise resilience and cross-domain applicability.
[LG-66] NeuronTune: Towards Self-Guided Spurious Bias Mitigation ICML2025
链接: https://arxiv.org/abs/2505.24048
作者: Guangtao Zheng,Wenqian Ye,Aidong Zhang
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2025
Abstract:Deep neural networks often develop spurious bias, reliance on correlations between non-essential features and classes for predictions. For example, a model may identify objects based on frequently co-occurring backgrounds rather than intrinsic features, resulting in degraded performance on data lacking these correlations. Existing mitigation approaches typically depend on external annotations of spurious correlations, which may be difficult to obtain and are not relevant to the spurious bias in a model. In this paper, we take a step towards self-guided mitigation of spurious bias by proposing NeuronTune, a post hoc method that directly intervenes in a model’s internal decision process. Our method probes in a model’s latent embedding space to identify and regulate neurons that lead to spurious prediction behaviors. We theoretically justify our approach and show that it brings the model closer to an unbiased one. Unlike previous methods, NeuronTune operates without requiring spurious correlation annotations, making it a practical and effective tool for improving model robustness. Experiments across different architectures and data modalities demonstrate that our method significantly mitigates spurious bias in a self-guided way.
[LG-67] he Rich and the Simple: On the Implicit Bias of Adam and SGD
链接: https://arxiv.org/abs/2505.24022
作者: Bhavya Vasudeva,Jung Whan Lee,Vatsal Sharan,Mahdi Soltanolkotabi
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 27 pages, 11 figures, 16 tables
Abstract:Adam is the de facto optimization algorithm for several deep learning applications, but an understanding of its implicit bias and how it differs from other algorithms, particularly standard first-order methods such as (stochastic) gradient descent (GD), remains limited. In practice, neural networks trained with SGD are known to exhibit simplicity bias – a tendency to find simple solutions. In contrast, we show that Adam is more resistant to such simplicity bias. To demystify this phenomenon, in this paper, we investigate the differences in the implicit biases of Adam and GD when training two-layer ReLU neural networks on a binary classification task involving synthetic data with Gaussian clusters. We find that GD exhibits a simplicity bias, resulting in a linear decision boundary with a suboptimal margin, whereas Adam leads to much richer and more diverse features, producing a nonlinear boundary that is closer to the Bayes’ optimal predictor. This richer decision boundary also allows Adam to achieve higher test accuracy both in-distribution and under certain distribution shifts. We theoretically prove these results by analyzing the population gradients. To corroborate our theoretical findings, we present empirical results showing that this property of Adam leads to superior generalization across datasets with spurious correlations where neural networks trained with SGD are known to show simplicity bias and don’t generalize well under certain distributional shifts.
[LG-68] How far away are truly hyperparameter-free learning algorithms?
链接: https://arxiv.org/abs/2505.24005
作者: Priya Kasimbeg,Vincent Roulet,Naman Agarwal,Sourabh Medapati,Fabian Pedregosa,Atish Agarwala,George E. Dahl
类目: Machine Learning (cs.LG)
*备注:
Abstract:Despite major advances in methodology, hyperparameter tuning remains a crucial (and expensive) part of the development of machine learning systems. Even ignoring architectural choices, deep neural networks have a large number of optimization and regularization hyperparameters that need to be tuned carefully per workload in order to obtain the best results. In a perfect world, training algorithms would not require workload-specific hyperparameter tuning, but would instead have default settings that performed well across many workloads. Recently, there has been a growing literature on optimization methods which attempt to reduce the number of hyperparameters – particularly the learning rate and its accompanying schedule. Given these developments, how far away is the dream of neural network training algorithms that completely obviate the need for painful tuning? In this paper, we evaluate the potential of learning-rate-free methods as components of hyperparameter-free methods. We freeze their (non-learning rate) hyperparameters to default values, and score their performance using the recently-proposed AlgoPerf: Training Algorithms benchmark. We found that literature-supplied default settings performed poorly on the benchmark, so we performed a search for hyperparameter configurations that performed well across all workloads simultaneously. The best AlgoPerf-calibrated learning-rate-free methods had much improved performance but still lagged slightly behind a similarly calibrated NadamW baseline in overall benchmark score. Our results suggest that there is still much room for improvement for learning-rate-free methods, and that testing against a strong, workload-agnostic baseline is important to improve hyperparameter reduction techniques. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.24005 [cs.LG] (or arXiv:2505.24005v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.24005 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-69] Adaptive Deadline and Batch Layered Synchronized Federated Learning
链接: https://arxiv.org/abs/2505.23973
作者: Asaf Goren,Natalie Lang,Nir Shlezinger,Alejandro Cohen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated learning (FL) enables collaborative model training across distributed edge devices while preserving data privacy, and typically operates in a round-based synchronous manner. However, synchronous FL suffers from latency bottlenecks due to device heterogeneity, where slower clients (stragglers) delay or degrade global updates. Prior solutions, such as fixed deadlines, client selection, and layer-wise partial aggregation, alleviate the effect of stragglers, but treat round timing and local workload as static parameters, limiting their effectiveness under strict time constraints. We propose ADEL-FL, a novel framework that jointly optimizes per-round deadlines and user-specific batch sizes for layer-wise aggregation. Our approach formulates a constrained optimization problem minimizing the expected L2 distance to the global optimum under total training time and global rounds. We provide a convergence analysis under exponential compute models and prove that ADEL-FL yields unbiased updates with bounded variance. Extensive experiments demonstrate that ADEL-FL outperforms alternative methods in both convergence rate and final accuracy under heterogeneous conditions.
[LG-70] Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training
链接: https://arxiv.org/abs/2505.23971
作者: William Merrill,Shane Arora,Dirk Groeneveld,Hannaneh Hajishirzi
类目: Machine Learning (cs.LG)
*备注:
Abstract:The right batch size is important when training language models at scale: a large batch size is necessary for fast training, but a batch size that is too large will harm token efficiency. To navigate this tradeoff, McCandlish et al. (2018) suggest that a critical batch size (CBS), below which training will not substantially degrade loss, can be estimated based on the gradient noise scale during training. While their method has been adopted in practice, e.g., when training GPT-3, strong assumptions are required to justify gradient noise as a proxy for the CBS, which makes it unclear whether their approach should be trusted in practice, limiting its applicability. In this paper, we introduce a simple, empirical approach to directly measure the CBS and show how the CBS evolves over training. Applying our approach to the OLMo models, we find that CBS is near 0 at initialization, increases rapidly at first, and then plateaus as training progresses. Furthermore, we find that this trend holds across different model sizes (1B and 7B), suggesting CBS from small training runs can inform larger-scale training runs. Our findings about how the CBS changes over training motivate batch size warmup as a natural way to reliably train language models at large batch size: start the batch size small and increase it as the CBS grows. To validate this claim, we use batch size warmup to train OLMo 1B to slightly better loss than the original training run with 43% fewer gradient steps. This shows how our framework can be applied to reliably train language models at larger batch sizes, increasing data parallelism without compromising performance.
[LG-71] Improved Approximations for Hard Graph Problems using Predictions
链接: https://arxiv.org/abs/2505.23967
作者: Anders Aamand,Justin Y. Chen,Siddharth Gollapudi,Sandeep Silwal,Hao Wu
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:
Abstract:We design improved approximation algorithms for NP-hard graph problems by incorporating predictions (e.g., learned from past data). Our prediction model builds upon and extends the \varepsilon -prediction framework by Cohen-Addad, d’Orsi, Gupta, Lee, and Panigrahi (NeurIPS 2024). We consider an edge-based version of this model, where each edge provides two bits of information, corresponding to predictions about whether each of its endpoints belong to an optimal solution. Even with weak predictions where each bit is only \varepsilon -correlated with the true solution, this information allows us to break approximation barriers in the standard setting. We develop algorithms with improved approximation ratios for MaxCut, Vertex Cover, Set Cover, and Maximum Independent Set problems (among others). Across these problems, our algorithms share a unifying theme, where we separately satisfy constraints related to high degree vertices (using predictions) and low-degree vertices (without using predictions) and carefully combine the answers.
[LG-72] Acoustic Classification of Maritime Vessels using Learnable Filterbanks
链接: https://arxiv.org/abs/2505.23964
作者: Jonas Elsborg,Tejs Vegge,Arghya Bhowmik
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 9 pages, 5 figures, 2 tables
Abstract:Reliably monitoring and recognizing maritime vessels based on acoustic signatures is complicated by the variability of different recording scenarios. A robust classification framework must be able to generalize across diverse acoustic environments and variable source-sensor distances. To this end, we present a deep learning model with robust performance across different recording scenarios. Using a trainable spectral front-end and temporal feature encoder to learn a Gabor filterbank, the model can dynamically emphasize different frequency components. Trained on the VTUAD hydrophone recordings from the Strait of Georgia, our model, CATFISH, achieves a state-of-the-art 96.63 % percent test accuracy across varying source-sensor distances, surpassing the previous benchmark by over 12 percentage points. We present the model, justify our architectural choices, analyze the learned Gabor filters, and perform ablation studies on sensor data fusion and attention-based pooling.
[LG-73] Estimating Misreporting in the Presence of Genuine Modification: A Causal Perspective
链接: https://arxiv.org/abs/2505.23954
作者: Dylan Zapzalka,Trenton Chang,Lindsay Warrenburg,Sae-Hwan Park,Daniel K. Shenfeld,Ravi B. Parikh,Jenna Wiens,Maggie Makar
类目: Machine Learning (cs.LG)
*备注:
Abstract:In settings where ML models are used to inform the allocation of resources, agents affected by the allocation decisions might have an incentive to strategically change their features to secure better outcomes. While prior work has studied strategic responses broadly, disentangling misreporting from genuine modification remains a fundamental challenge. In this paper, we propose a causally-motivated approach to identify and quantify how much an agent misreports on average by distinguishing deceptive changes in their features from genuine modification. Our key insight is that, unlike genuine modification, misreported features do not causally affect downstream variables (i.e., causal descendants). We exploit this asymmetry by comparing the causal effect of misreported features on their causal descendants as derived from manipulated datasets against those from unmanipulated datasets. We formally prove identifiability of the misreporting rate and characterize the variance of our estimator. We empirically validate our theoretical results using a semi-synthetic and real Medicare dataset with misreported data, demonstrating that our approach can be employed to identify misreporting in real-world scenarios.
[LG-74] Searching Neural Architectures for Sensor Nodes on IoT Gateways
链接: https://arxiv.org/abs/2505.23939
作者: Andrea Mattia Garavagno,Edoardo Ragusa,Antonio Frisoli,Paolo Gastaldo
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:This paper presents an automatic method for the design of Neural Networks (NNs) at the edge, enabling Machine Learning (ML) access even in privacy-sensitive Internet of Things (IoT) applications. The proposed method runs on IoT gateways and designs NNs for connected sensor nodes without sharing the collected data outside the local network, keeping the data in the site of collection. This approach has the potential to enable ML for Healthcare Internet of Things (HIoT) and Industrial Internet of Things (IIoT), designing hardware-friendly and custom NNs at the edge for personalized healthcare and advanced industrial services such as quality control, predictive maintenance, or fault diagnosis. By preventing data from being disclosed to cloud services, this method safeguards sensitive information, including industrial secrets and personal data. The outcomes of a thorough experimental session confirm that – on the Visual Wake Words dataset – the proposed approach can achieve state-of-the-art results by exploiting a search procedure that runs in less than 10 hours on the Raspberry Pi Zero 2.
[LG-75] hompson Sampling in Online RLHF with General Function Approximation
链接: https://arxiv.org/abs/2505.23927
作者: Songtao Feng,Jie Fu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning from human feedback (RLHF) has achieved great empirical success in aligning large language models (LLMs) with human preference, and it is of great importance to study the statistical efficiency of RLHF algorithms from a theoretical perspective. In this work, we consider the online RLHF setting where the preference data is revealed during the learning process and study action value function approximation. We design a model-free posterior sampling algorithm for online RLHF inspired by Thompson sampling and provide its theoretical guarantee. Specifically, we adopt Bellman eluder (BE) dimension as the complexity measure of the function class and establish O(\sqrtT) regret bound for the proposed algorithm with other multiplicative factor depending on the horizon, BE dimension and the log -bracketing number of the function class. Further, in the analysis, we first establish the concentration-type inequality of the squared Bellman error bound based on the maximum likelihood estimator (MLE) generalization bound, which plays the crucial rules in obtaining the eluder-type regret bound and may be of independent interest.
[LG-76] Simplifying Bayesian Optimization Via In-Context Direct Optimum Sampling
链接: https://arxiv.org/abs/2505.23913
作者: Gustavo Sutter Pessurno de Carvalho,Mohammed Abdulrahman,Hao Wang,Sriram Ganapathi Subramanian,Marc St-Aubin,Sharon O’Sullivan,Lawrence Wan,Luis Ricardez-Sandoval,Pascal Poupart,Agustinus Kristiadi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:The optimization of expensive black-box functions is ubiquitous in science and engineering. A common solution to this problem is Bayesian optimization (BO), which is generally comprised of two components: (i) a surrogate model and (ii) an acquisition function, which generally require expensive re-training and optimization steps at each iteration, respectively. Although recent work enabled in-context surrogate models that do not require re-training, virtually all existing BO methods still require acquisition function maximization to select the next observation, which introduces many knobs to tune, such as Monte Carlo samplers and multi-start optimizers. In this work, we propose a completely in-context, zero-shot solution for BO that does not require surrogate fitting or acquisition function optimization. This is done by using a pre-trained deep generative model to directly sample from the posterior over the optimum point. We show that this process is equivalent to Thompson sampling and demonstrate the capabilities and cost-effectiveness of our foundation model on a suite of real-world benchmarks. We achieve an efficiency gain of more than 35x in terms of wall-clock time when compared with Gaussian process-based BO, enabling efficient parallel and distributed BO, e.g., for high-throughput optimization.
[LG-77] Patient-Aware Feature Alignment for Robust Lung Sound Classification:Cohesion-Separation and Global Alignment Losses INTERSPEECH2025
链接: https://arxiv.org/abs/2505.23834
作者: Seung Gyu Jeong,Seong Eun Kim
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted INTERSPEECH 2025
Abstract:Lung sound classification is vital for early diagnosis of respiratory diseases. However, biomedical signals often exhibit inter-patient variability even among patients with the same symptoms, requiring a learning approach that considers individual differences. We propose a Patient-Aware Feature Alignment (PAFA) framework with two novel losses, Patient Cohesion-Separation Loss (PCSL) and Global Patient Alignment Loss (GPAL). PCSL clusters features of the same patient while separating those from other patients to capture patient variability, whereas GPAL draws each patient’s centroid toward a global center, preventing feature space fragmentation. Our method achieves outstanding results on the ICBHI dataset with a score of 64.84% for four-class and 72.08% for two-class classification. These findings highlight PAFA’s ability to capture individualized patterns and demonstrate performance gains in distinct patient clusters, offering broader applications for patient-centered healthcare.
[LG-78] Learning Normal Patterns in Musical Loops
链接: https://arxiv.org/abs/2505.23784
作者: Shayan Dadman,Bernt Arild Bremdal,Børre Bang,Rune Dalmo
类目: ound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: 27 pages, 10 figures
Abstract:This paper introduces an unsupervised framework for detecting audio patterns in musical samples (loops) through anomaly detection techniques, addressing challenges in music information retrieval (MIR). Existing methods are often constrained by reliance on handcrafted features, domain-specific limitations, or dependence on iterative user interaction. We address these limitations through an architecture combining deep feature extraction with unsupervised anomaly detection. Our approach leverages a pre-trained Hierarchical Token-semantic Audio Transformer (HTS-AT), paired with a Feature Fusion Mechanism (FFM), to generate representations from variable-length audio loops. These embeddings are processed using one-class Deep Support Vector Data Description (Deep SVDD), which learns normative audio patterns by mapping them to a compact latent hypersphere. Evaluations on curated bass and guitar datasets compare standard and residual autoencoder variants against baselines like Isolation Forest (IF) and and principle component analysis (PCA) methods. Results show our Deep SVDD models, especially the residual autoencoder variant, deliver improved anomaly separation, particularly for larger variations. This research contributes a flexible, fully unsupervised solution for processing diverse audio samples, overcoming previous structural and input limitations while enabling effective pattern identification through distance-based latent space scoring.
[LG-79] Unified AI for Accurate Audio Anomaly Detection
链接: https://arxiv.org/abs/2505.23781
作者: Hamideh Khaleghpour,Brett McKinney
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 6 pages, 14 figures. Based on original research. Submitted to arXiv for public preprint
Abstract:This paper presents a unified AI framework for high-accuracy audio anomaly detection by integrating advanced noise reduction, feature extraction, and machine learning modeling techniques. The approach combines spectral subtraction and adaptive filtering to enhance audio quality, followed by feature extraction using traditional methods like MFCCs and deep embeddings from pre-trained models such as OpenL3. The modeling pipeline incorporates classical models (SVM, Random Forest), deep learning architectures (CNNs), and ensemble methods to boost robustness and accuracy. Evaluated on benchmark datasets including TORGO and LibriSpeech, the proposed framework demonstrates superior performance in precision, recall, and classification of slurred vs. normal speech. This work addresses challenges in noisy environments and real-time applications and provides a scalable solution for audio-based anomaly detection.
[LG-80] Statistical mechanics of extensive-width Bayesian neural networks near interpolation
链接: https://arxiv.org/abs/2505.24849
作者: Jean Barbier,Francesco Camilli,Minh-Toan Nguyen,Mauro Pastore,Rudy Skerk
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 9 pages + appendices, 12 figures. This submission supersedes arXiv:2501.18530
Abstract:For three decades statistical mechanics has been providing a framework to analyse neural networks. However, the theoretically tractable models, e.g., perceptrons, random features models and kernel machines, or multi-index models and committee machines with few neurons, remained simple compared to those used in applications. In this paper we help reducing the gap between practical networks and their theoretical understanding through a statistical physics analysis of the supervised learning of a two-layer fully connected network with generic weight distribution and activation function, whose hidden layer is large but remains proportional to the inputs dimension. This makes it more realistic than infinitely wide networks where no feature learning occurs, but also more expressive than narrow ones or with fixed inner weights. We focus on the Bayes-optimal learning in the teacher-student scenario, i.e., with a dataset generated by another network with the same architecture. We operate around interpolation, where the number of trainable parameters and of data are comparable and feature learning emerges. Our analysis uncovers a rich phenomenology with various learning transitions as the number of data increases. In particular, the more strongly the features (i.e., hidden neurons of the target) contribute to the observed responses, the less data is needed to learn them. Moreover, when the data is scarce, the model only learns non-linear combinations of the teacher weights, rather than “specialising” by aligning its weights with the teacher’s. Specialisation occurs only when enough data becomes available, but it can be hard to find for practical training algorithms, possibly due to statistical-to-computational~gaps.
[LG-81] Generalization Dynamics of Linear Diffusion Models
链接: https://arxiv.org/abs/2505.24769
作者: Claudia Merger,Sebastian Goldt
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:Diffusion models trained on finite datasets with N samples from a target distribution exhibit a transition from memorisation, where the model reproduces training examples, to generalisation, where it produces novel samples that reflect the underlying data distribution. Understanding this transition is key to characterising the sample efficiency and reliability of generative models, but our theoretical understanding of this transition is incomplete. Here, we analytically study the memorisation-to-generalisation transition in a simple model using linear denoisers, which allow explicit computation of test errors, sampling distributions, and Kullback-Leibler divergences between samples and target distribution. Using these measures, we predict that this transition occurs roughly when N \asymp d , the dimension of the inputs. When N is smaller than the dimension of the inputs d , so that only a fraction of relevant directions of variation are present in the training data, we demonstrate how both regularization and early stopping help to prevent overfitting. For N d , we find that the sampling distributions of linear diffusion models approach their optimum (measured by the Kullback-Leibler divergence) linearly with d/N , independent of the specifics of the data distribution. Our work clarifies how sample complexity governs generalisation in a simple model of diffusion-based generative models and provides insight into the training dynamics of linear denoisers.
[LG-82] Knockoff-Guided Compressive Sensing: A Statistical Machine Learning Framework for Support-Assured Signal Recovery
链接: https://arxiv.org/abs/2505.24727
作者: Xiaochen Zhang,Haoyi Xiong
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:This paper introduces a novel Knockoff-guided compressive sensing framework, referred to as \TheName, which enhances signal recovery by leveraging precise false discovery rate (FDR) control during the support identification phase. Unlike LASSO, which jointly performs support selection and signal estimation without explicit error control, our method guarantees FDR control in finite samples, enabling more reliable identification of the true signal support. By separating and controlling the support recovery process through statistical Knockoff filters, our framework achieves more accurate signal reconstruction, especially in challenging scenarios where traditional methods fail. We establish theoretical guarantees demonstrating how FDR control directly ensures recovery performance under weaker conditions than traditional \ell_1 -based compressive sensing methods, while maintaining accurate signal reconstruction. Extensive numerical experiments demonstrate that our proposed Knockoff-based method consistently outperforms LASSO-based and other state-of-the-art compressive sensing techniques. In simulation studies, our method improves F1-score by up to 3.9x over baseline methods, attributed to principled false discovery rate (FDR) control and enhanced support recovery. The method also consistently yields lower reconstruction and relative errors. We further validate the framework on real-world datasets, where it achieves top downstream predictive performance across both regression and classification tasks, often narrowing or even surpassing the performance gap relative to uncompressed signals. These results establish \TheName as a robust and practical alternative to existing approaches, offering both theoretical guarantees and strong empirical performance through statistically grounded support selection.
[LG-83] K2IE: Kernel Method-based Kernel Intensity Estimators for Inhomogeneous Poisson Processes ICML2025
链接: https://arxiv.org/abs/2505.24704
作者: Hideaki Kim,Tomoharu Iwata,Akinori Fujino
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted to ICML 2025
Abstract:Kernel method-based intensity estimators, formulated within reproducing kernel Hilbert spaces (RKHSs), and classical kernel intensity estimators (KIEs) have been among the most easy-to-implement and feasible methods for estimating the intensity functions of inhomogeneous Poisson processes. While both approaches share the term “kernel”, they are founded on distinct theoretical principles, each with its own strengths and limitations. In this paper, we propose a novel regularized kernel method for Poisson processes based on the least squares loss and show that the resulting intensity estimator involves a specialized variant of the representer theorem: it has the dual coefficient of unity and coincides with classical KIEs. This result provides new theoretical insights into the connection between classical KIEs and kernel method-based intensity estimators, while enabling us to develop an efficient KIE by leveraging advanced techniques from RKHS theory. We refer to the proposed model as the kernel method-based kernel intensity estimator (K ^2 IE). Through experiments on synthetic datasets, we show that K ^2 IE achieves comparable predictive performance while significantly surpassing the state-of-the-art kernel method-based estimator in computational efficiency.
[LG-84] Impact of Bottleneck Layers and Skip Connections on the Generalization of Linear Denoising Autoencoders
链接: https://arxiv.org/abs/2505.24668
作者: Jonghyun Ham,Maximilian Fleissner,Debarghya Ghoshdastidar
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Modern deep neural networks exhibit strong generalization even in highly overparameterized regimes. Significant progress has been made to understand this phenomenon in the context of supervised learning, but for unsupervised tasks such as denoising, several open questions remain. While some recent works have successfully characterized the test error of the linear denoising problem, they are limited to linear models (one-layer network). In this work, we focus on two-layer linear denoising autoencoders trained under gradient flow, incorporating two key ingredients of modern deep learning architectures: A low-dimensional bottleneck layer that effectively enforces a rank constraint on the learned solution, as well as the possibility of a skip connection that bypasses the bottleneck. We derive closed-form expressions for all critical points of this model under product regularization, and in particular describe its global minimizer under the minimum-norm principle. From there, we derive the test risk formula in the overparameterized regime, both for models with and without skip connections. Our analysis reveals two interesting phenomena: Firstly, the bottleneck layer introduces an additional complexity measure akin to the classical bias-variance trade-off – increasing the bottleneck width reduces bias but introduces variance, and vice versa. Secondly, skip connection can mitigate the variance in denoising autoencoders – especially when the model is mildly overparameterized. We further analyze the impact of skip connections in denoising autoencoder using random matrix theory and support our claims with numerical evidence.
[LG-85] Predictive posterior sampling from non-stationnary Gaussian process priors via Diffusion models with application to climate data
链接: https://arxiv.org/abs/2505.24556
作者: Gabriel V Cardoso,Mike Pereira
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Bayesian models based on Gaussian processes (GPs) offer a flexible framework to predict spatially distributed variables with uncertainty. But the use of nonstationary priors, often necessary for capturing complex spatial patterns, makes sampling from the predictive posterior distribution (PPD) computationally intractable. In this paper, we propose a two-step approach based on diffusion generative models (DGMs) to mimic PPDs associated with non-stationary GP priors: we replace the GP prior by a DGM surrogate, and leverage recent advances on training-free guidance algorithms for DGMs to sample from the desired posterior distribution. We apply our approach to a rich non-stationary GP prior from which exact posterior sampling is untractable and validate that the issuing distributions are close to their GP counterpart using several statistical metrics. We also demonstrate how one can fine-tune the trained DGMs to target specific parts of the GP prior. Finally we apply the proposed approach to solve inverse problems arising in environmental sciences, thus yielding state-of-the-art predictions.
[LG-86] Real-time Fall Prevention system for the Next-generation of Workers
链接: https://arxiv.org/abs/2505.24487
作者: Nicholas Cartocci,Antonios E. Gkikakis,Darwin G. Caldwell,Jesús Ortiz
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:Developing a general-purpose wearable real-time fall-detection system is still a challenging task, especially for healthy and strong subjects, such as industrial workers that work in harsh environments. In this work, we present a hybrid approach for fall detection and prevention, which uses the dynamic model of an inverted pendulum to generate simulations of falling that are then fed to a deep learning framework. The output is a signal to activate a fall mitigation mechanism when the subject is at risk of harm. The advantage of this approach is that abstracted models can be used to efficiently generate training data for thousands of different subjects with different falling initial conditions, something that is practically impossible with real experiments. This approach is suitable for a specific type of fall, where the subjects fall without changing their initial configuration significantly, and it is the first step toward a general-purpose wearable device, with the aim of reducing fall-associated injuries in industrial environments, which can improve the safety of workers.
[LG-87] Distributed gradient methods under heavy-tailed communication noise
链接: https://arxiv.org/abs/2505.24464
作者: Manojlo Vukovic,Dusan Jakovetic,Dragana Bajovic,Soummya Kar
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication
Abstract:We consider a standard distributed optimization problem in which networked nodes collaboratively minimize the sum of their locally known convex costs. For this setting, we address for the first time the fundamental problem of design and analysis of distributed methods to solve the above problem when inter-node communication is subject to \emphheavy-tailed noise. Heavy-tailed noise is highly relevant and frequently arises in densely deployed wireless sensor and Internet of Things (IoT) networks. Specifically, we design a distributed gradient-type method that features a carefully balanced mixed time-scale time-varying consensus and gradient contribution step sizes and a bounded nonlinear operator on the consensus update to limit the effect of heavy-tailed noise. Assuming heterogeneous strongly convex local costs with mutually different minimizers that are arbitrarily far apart, we show that the proposed method converges to a neighborhood of the network-wide problem solution in the mean squared error (MSE) sense, and we also characterize the corresponding convergence rate. We further show that the asymptotic MSE can be made arbitrarily small through consensus step-size tuning, possibly at the cost of slowing down the transient error decay. Numerical experiments corroborate our findings and demonstrate the resilience of the proposed method to heavy-tailed (and infinite variance) communication noise. They also show that existing distributed methods, designed for finite-communication-noise-variance settings, fail in the presence of infinite variance noise.
[LG-88] wo failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation
链接: https://arxiv.org/abs/2505.24333
作者: Alessio Giorlandino,Sebastian Goldt
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注:
Abstract:Finding the right initialisation for neural networks is crucial to ensure smooth training and good performance. In transformers, the wrong initialisation can lead to one of two failure modes of self-attention layers: rank collapse, where all tokens collapse into similar representations, and entropy collapse, where highly concentrated attention scores lead to training instability. While the right initialisation has been extensively studied in feed-forward networks, an exact description of signal propagation through a full transformer block has so far been lacking. Here, we provide an analytical theory of signal propagation through vanilla transformer blocks with self-attention layers, layer normalisation, skip connections and ReLU MLP. To treat the self-attention layer, we draw on a formal parallel with the Random Energy Model from statistical physics. We identify and characterise two regimes governed by the variance of the query and key initialisations: a low-variance regime, where we recover the known rank collapse behaviour; and a previously unexplored high-variance regime, where signal is preserved but \textitentropy collapse occurs. In the low-variance regime, we calculate the critical strength for the residual connection to ensure signal propagation. Our theory yields trainability diagrams that identify the correct choice of initialisation hyper-parameters for a given architecture. Experiments with BERT-style models trained on TinyStories validate our predictions. Our theoretical framework gives a unified perspective on the two failure modes of self-attention and gives quantitative predictions on the scale of both weights and residual connections that guarantees smooth training.
[LG-89] Equilibrium Distribution for t-Distributed Stochastic Neighbor Embedding with Generalized Kernels
链接: https://arxiv.org/abs/2505.24311
作者: Yi Gu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注:
Abstract:T-distributed stochastic neighbor embedding (t-SNE) is a well-known algorithm for visualizing high-dimensional data by finding low-dimensional representations. In this paper, we study the convergence of t-SNE with generalized kernels and extend the results of Auffinger and Fletcher in 2023. Our work starts by giving a concrete formulation of generalized input and output kernels. Then we prove that under certain conditions, the t-SNE algorithm converges to an equilibrium distribution for a wide range of input and output kernels as the number of data points diverges.
[LG-90] Data Fusion for Partial Identification of Causal Effects
链接: https://arxiv.org/abs/2505.24296
作者: Quinn Lanners,Cynthia Rudin,Alexander Volfovsky,Harsh Parikh
类目: Methodology (stat.ME); Machine Learning (cs.LG); Econometrics (econ.EM)
*备注:
Abstract:Data fusion techniques integrate information from heterogeneous data sources to improve learning, generalization, and decision making across data sciences. In causal inference, these methods leverage rich observational data to improve causal effect estimation, while maintaining the trustworthiness of randomized controlled trials. Existing approaches often relax the strong no unobserved confounding assumption by instead assuming exchangeability of counterfactual outcomes across data sources. However, when both assumptions simultaneously fail - a common scenario in practice - current methods cannot identify or estimate causal effects. We address this limitation by proposing a novel partial identification framework that enables researchers to answer key questions such as: Is the causal effect positive or negative? and How severe must assumption violations be to overturn this conclusion? Our approach introduces interpretable sensitivity parameters that quantify assumption violations and derives corresponding causal effect bounds. We develop doubly robust estimators for these bounds and operationalize breakdown frontier analysis to understand how causal conclusions change as assumption violations increase. We apply our framework to the Project STAR study, which investigates the effect of classroom size on students’ third-grade standardized test performance. Our analysis reveals that the Project STAR results are robust to simultaneous violations of key assumptions, both on average and across various subgroups of interest. This strengthens confidence in the study’s conclusions despite potential unmeasured biases in the data.
[LG-91] Multi-task Learning for Heterogeneous Data via Integrating Shared and Task-Specific Encodings
链接: https://arxiv.org/abs/2505.24281
作者: Yang Sui,Qi Xu,Yang Bai,Annie Qu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Multi-task learning (MTL) has become an essential machine learning tool for addressing multiple learning tasks simultaneously and has been effectively applied across fields such as healthcare, marketing, and biomedical research. However, to enable efficient information sharing across tasks, it is crucial to leverage both shared and heterogeneous information. Despite extensive research on MTL, various forms of heterogeneity, including distribution and posterior heterogeneity, present significant challenges. Existing methods often fail to address these forms of heterogeneity within a unified framework. In this paper, we propose a dual-encoder framework to construct a heterogeneous latent factor space for each task, incorporating a task-shared encoder to capture common information across tasks and a task-specific encoder to preserve unique task characteristics. Additionally, we explore the intrinsic similarity structure of the coefficients corresponding to learned latent factors, allowing for adaptive integration across tasks to manage posterior heterogeneity. We introduce a unified algorithm that alternately learns the task-specific and task-shared encoders and coefficients. In theory, we investigate the excess risk bound for the proposed MTL method using local Rademacher complexity and apply it to a new but related task. Through simulation studies, we demonstrate that the proposed method outperforms existing data integration methods across various settings. Furthermore, the proposed method achieves superior predictive performance for time to tumor doubling across five distinct cancer types in PDX data.
[LG-92] Aligning Protein Conformation Ensemble Generation with Physical Feedback ICML2025
链接: https://arxiv.org/abs/2505.24203
作者: Jiarui Lu,Xiaoyin Chen,Stephen Zhewen Lu,Aurélie Lozano,Vijil Chenthamarakshan,Payel Das,Jian Tang
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: Published as a conference paper at ICML 2025
Abstract:Protein dynamics play a crucial role in protein biological functions and properties, and their traditional study typically relies on time-consuming molecular dynamics (MD) simulations conducted in silico. Recent advances in generative modeling, particularly denoising diffusion models, have enabled efficient accurate protein structure prediction and conformation sampling by learning distributions over crystallographic structures. However, effectively integrating physical supervision into these data-driven approaches remains challenging, as standard energy-based objectives often lead to intractable optimization. In this paper, we introduce Energy-based Alignment (EBA), a method that aligns generative models with feedback from physical models, efficiently calibrating them to appropriately balance conformational states based on their energy differences. Experimental results on the MD ensemble benchmark demonstrate that EBA achieves state-of-the-art performance in generating high-quality protein ensembles. By improving the physical plausibility of generated structures, our approach enhances model predictions and holds promise for applications in structural biology and drug discovery.
[LG-93] Information-theoretic machine learning for time-varying mode decomposition of separated airfoil wakes
链接: https://arxiv.org/abs/2505.24132
作者: Kai Fukami,Ryo Araki
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:We perform an information-theoretic mode decomposition for separated wakes around a wing. The current data-driven approach based on a neural network referred to as deep sigmoidal flow enables the extraction of an informative component from a given flow field snapshot with respect to a target variable at a future time stamp, thereby capturing the causality as a time-varying modal structure. We consider three examples of separated flows around a NACA0012 airfoil, namely, 1. laminar periodic wake at post-stall angles of attack, 2. strong vortex-airfoil interactions, and 3. a turbulent wake in a spanwise-periodic domain. The present approach reveals informative vortical structures associated with a time-varying lift response. For the periodic shedding cases, the informative structures vary in time corresponding to the fluctuation level from their mean values. With the second example of vortex-airfoil interactions, how the effect of vortex gust on a wing emerges in the lift response over time is identified in an interpretable manner. Furthermore, for the case of turbulent wake, the present model highlights structures near the wing and vortex cores as informative components based solely on the information metric without any prior knowledge of aerodynamics and length scales. This study provides causality-based insights into a range of unsteady aerodynamic problems.
[LG-94] Performative Risk Control: Calibrating Models for Reliable Deployment under Performativity
链接: https://arxiv.org/abs/2505.24097
作者: Victor Li,Baiting Chen,Yuzhen Mao,Qi Lei,Zhun Deng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Calibrating blackbox machine learning models to achieve risk control is crucial to ensure reliable decision-making. A rich line of literature has been studying how to calibrate a model so that its predictions satisfy explicit finite-sample statistical guarantees under a fixed, static, and unknown data-generating distribution. However, prediction-supported decisions may influence the outcome they aim to predict, a phenomenon named performativity of predictions, which is commonly seen in social science and economics. In this paper, we introduce Performative Risk Control, a framework to calibrate models to achieve risk control under performativity with provable theoretical guarantees. Specifically, we provide an iteratively refined calibration process, where we ensure the predictions are improved and risk-controlled throughout the process. We also study different types of risk measures and choices of tail bounds. Lastly, we demonstrate the effectiveness of our framework by numerical experiments on the task of predicting credit default risk. To the best of our knowledge, this work is the first one to study statistically rigorous risk control under performativity, which will serve as an important safeguard against a wide range of strategic manipulation in decision-making processes.
[LG-95] Cross-Modal Characterization of Thin Film MoS_2 Using Generative Models
链接: https://arxiv.org/abs/2505.24065
作者: Isaiah A. Moses,Chen Chen,Joan M. Redwing,Wesley F. Reinhart
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注: 36 pages, 10 figures, 10 tables
Abstract:The growth and characterization of materials using empirical optimization typically requires a significant amount of expert time, experience, and resources. Several complementary characterization methods are routinely performed to determine the quality and properties of a grown sample. Machine learning (ML) can support the conventional approaches by using historical data to guide and provide speed and efficiency to the growth and characterization of materials. Specifically, ML can provide quantitative information from characterization data that is typically obtained from a different modality. In this study, we have investigated the feasibility of projecting the quantitative metric from microscopy measurements, such as atomic force microscopy (AFM), using data obtained from spectroscopy measurements, like Raman spectroscopy. Generative models were also trained to generate the full and specific features of the Raman and photoluminescence spectra from each other and the AFM images of the thin film MoS _2 . The results are promising and have provided a foundational guide for the use of ML for the cross-modal characterization of materials for their accelerated, efficient, and cost-effective discovery.
[LG-96] Leverag ing machine learning features for linear optical interferometer control
链接: https://arxiv.org/abs/2505.24032
作者: Sergei S. Kuzmin,Ivan V. Dyakonov,Stanislav S. Straupe
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Optics (physics.optics)
*备注: 8 pages, 8 figures
Abstract:We have developed an algorithm that constructs a model of a reconfigurable optical interferometer, independent of specific architectural constraints. The programming of unitary transformations on the interferometer’s optical modes relies on either an analytical method for deriving the unitary matrix from a set of phase shifts or an optimization routine when such decomposition is not available. Our algorithm employs a supervised learning approach, aligning the interferometer model with a training set derived from the device being studied. A straightforward optimization procedure leverages this trained model to determine the phase shifts of the interferometer with a specific architecture, obtaining the required unitary transformation. This approach enables the effective tuning of interferometers without requiring a precise analytical solution, paving the way for the exploration of new interferometric circuit architectures.
[LG-97] A2 Copula-Driven Spatial Bayesian Neural Network For Modeling Non-Gaussian Dependence: A Simulation Study
链接: https://arxiv.org/abs/2505.24006
作者: Agnideep Aich,Sameera Hewage,Md Monzur Murshed,Ashit Baran Aich,Amanda Mayeaux,Asim K. Dey,Kumer P. Das,Bruce Wade
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:In this paper, we introduce the A2 Copula Spatial Bayesian Neural Network (A2-SBNN), a predictive spatial model designed to map coordinates to continuous fields while capturing both typical spatial patterns and extreme dependencies. By embedding the dual-tail novel Archimedean copula viz. A2 directly into the network’s weight initialization, A2-SBNN naturally models complex spatial relationships, including rare co-movements in the data. The model is trained through a calibration-driven process combining Wasserstein loss, moment matching, and correlation penalties to refine predictions and manage uncertainty. Simulation results show that A2-SBNN consistently delivers high accuracy across a wide range of dependency strengths, offering a new, effective solution for spatial data modeling beyond traditional Gaussian-based approaches.
[LG-98] Can Emotion Fool Anti-spoofing? INTERSPEECH2025
链接: https://arxiv.org/abs/2505.23962
作者: Aurosweta Mahapatra,Ismail Rasim Ulgen,Abinay Reddy Naini,Carlos Busso,Berrak Sisman
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted to Interspeech 2025
Abstract:Traditional anti-spoofing focuses on models and datasets built on synthetic speech with mostly neutral state, neglecting diverse emotional variations. As a result, their robustness against high-quality, emotionally expressive synthetic speech is uncertain. We address this by introducing EmoSpoof-TTS, a corpus of emotional text-to-speech samples. Our analysis shows existing anti-spoofing models struggle with emotional synthetic speech, exposing risks of emotion-targeted attacks. Even trained on emotional data, the models underperform due to limited focus on emotional aspect and show performance disparities across emotions. This highlights the need for emotion-focused anti-spoofing paradigm in both dataset and methodology. We propose GEM, a gated ensemble of emotion-specialized models with a speech emotion recognition gating network. GEM performs effectively across all emotions and neutral state, improving defenses against spoofing attacks. We release the EmoSpoof-TTS Dataset: this https URL
[LG-99] Gibbs randomness-compression proposition: An efficient deep learning
链接: https://arxiv.org/abs/2505.23869
作者: M. Süzen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 5 pages, 5 figures
Abstract:A proposition that connects randomness and compression put forward via Gibbs entropy over set of measurement vectors associated with a compression process. The proposition states that a lossy compression process is equivalent to \it directed randomness that preserves information content. The proposition originated from the observed behaviour in newly proposed \it Dual Tomographic Compression (DTC) compress-train framework. This is akin to tomographic reconstruction of layer weight matrices via building compressed sensed projections, so called \it weight rays. This tomographic approach is applied to previous and next layers in a dual fashion, that triggers neuronal-level pruning. This novel model compress-train scheme appear in iterative fashion and act as smart neural architecture search, Experiments demonstrated utility of this dual-tomography producing state-of-the-art performance with efficient compression during training, accelerating and supporting lottery ticket hypothesis. However, random compress-train iterations having similar performance demonstrated the connection between randomness and compression from statistical physics perspective, we formulated so called \it Gibbs randomness-compression proposition, signifying randomness-compression relationship via Gibbs entropy. Practically, DTC framework provides a promising approach for massively energy and resource efficient deep learning training approach.
[LG-100] A Start To End Machine Learning Approach To Maximize Scientific Throughput From The LCLS-II-HE
链接: https://arxiv.org/abs/2505.23858
作者: Aashwin Mishra,Matt Seaberg,Ryan Roussel,Fred Poitevin,Jana Thayer,Daniel Ratner,Auralee Edelen,Apurva Mehta
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注:
Abstract:With the increasing brightness of Light sources, including the Diffraction-Limited brightness upgrade of APS and the high-repetition-rate upgrade of LCLS, the proposed experiments therein are becoming increasingly complex. For instance, experiments at LCLS-II-HE will require the X-ray beam to be within a fraction of a micron in diameter, with pointing stability of a few nanoradians, at the end of a kilometer-long electron accelerator, a hundred-meter-long undulator section, and tens of meters long X-ray optics. This enhancement of brightness will increase the data production rate to rival the largest data generators in the world. Without real-time active feedback control and an optimized pipeline to transform measurements to scientific information and insights, researchers will drown in a deluge of mostly useless data, and fail to extract the highly sophisticated insights that the recent brightness upgrades promise. In this article, we outline the strategy we are developing at SLAC to implement Machine Learning driven optimization, automation and real-time knowledge extraction from the electron-injector at the start of the electron accelerator, to the multidimensional X-ray optical systems, and till the experimental endstations and the high readout rate, multi-megapixel detectors at LCLS to deliver the design performance to the users. This is illustrated via examples from Accelerator, Optics and End User applications. Subjects: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex) Cite as: arXiv:2505.23858 [physics.ins-det] (or arXiv:2505.23858v1 [physics.ins-det] for this version) https://doi.org/10.48550/arXiv.2505.23858 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-101] On the Parallels Between Evolutionary Theory and the State of AI GECCO2025
链接: https://arxiv.org/abs/2505.23774
作者: Zeki Doruk Erden,Boi Faltings
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Adaptation and Self-Organizing Systems (nlin.AO)
*备注: Published at the Evolving Self-Organization Workshop in GECCO 2025
Abstract:This article critically examines the foundational principles of contemporary AI methods, exploring the limitations that hinder its potential. We draw parallels between the modern AI landscape and the 20th-century Modern Synthesis in evolutionary biology, and highlight how advancements in evolutionary theory that augmented the Modern Synthesis, particularly those of Evolutionary Developmental Biology, offer insights that can inform a new design paradigm for AI. By synthesizing findings across AI and evolutionary theory, we propose a pathway to overcome existing limitations, enabling AI to achieve its aspirational goals.
信息检索
[IR-0] Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings
链接: https://arxiv.org/abs/2505.24782
作者: Max Conti,Manuel Faysse,Gautier Viaud,Antoine Bosselut,Céline Hudelot,Pierre Colombo
类目: Information Retrieval (cs.IR)
*备注: Under Review
Abstract:A limitation of modern document retrieval embedding methods is that they typically encode passages (chunks) from the same documents independently, often overlooking crucial contextual information from the rest of the document that could greatly improve individual chunk representations. In this work, we introduce ConTEB (Context-aware Text Embedding Benchmark), a benchmark designed to evaluate retrieval models on their ability to leverage document-wide context. Our results show that state-of-the-art embedding models struggle in retrieval scenarios where context is required. To address this limitation, we propose InSeNT (In-sequence Negative Training), a novel contrastive post-training approach which combined with late chunking pooling enhances contextual representation learning while preserving computational efficiency. Our method significantly improves retrieval quality on ConTEB without sacrificing base model performance. We further find chunks embedded with our method are more robust to suboptimal chunking strategies and larger retrieval corpus sizes. We open-source all artifacts at this https URL. Comments: Under Review Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2505.24782 [cs.IR] (or arXiv:2505.24782v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2505.24782 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-1] A Novel Discrete Memristor-Coupled Heterogeneous Dual-Neuron Model and Its Application in Multi-Scenario Image Encryption
链接: https://arxiv.org/abs/2505.24294
作者: Yi Zou,Mengjiao Wang,Xinan Zhang,Herbert Ho-Ching Iu
类目: Information Retrieval (cs.IR)
*备注: IEEE INTERNET OF THINGS JOURNAL(IOTJ)
Abstract:Simulating brain functions using neural networks is an important area of research. Recently, discrete memristor-coupled neurons have attracted significant attention, as memristors effectively mimic synaptic behavior, which is essential for learning and memory. This highlights the biological relevance of such models. This study introduces a discrete memristive heterogeneous dual-neuron network (MHDNN). The stability of the MHDNN is analyzed with respect to initial conditions and a range of neuronal parameters. Numerical simulations demonstrate complex dynamical behaviors. Various neuronal firing patterns are investigated under different coupling strengths, and synchronization phenomena between neurons are explored. The MHDNN is implemented and validated on the STM32 hardware platform. An image encryption algorithm based on the MHDNN is proposed, along with two hardware platforms tailored for multi-scenario police image encryption. These solutions enable real-time and secure transmission of police data in complex environments, reducing hacking risks and enhancing system security.
[IR-2] On the Scaling of Robustness and Effectiveness in Dense Retrieval
链接: https://arxiv.org/abs/2505.24279
作者: Yu-An Liu,Ruqing Zhang,Jiafeng Guo,Maarten de Rijke,Yixing Fan,Xueqi Cheng
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Robustness and Effectiveness are critical aspects of developing dense retrieval models for real-world applications. It is known that there is a trade-off between the two. Recent work has addressed scaling laws of effectiveness in dense retrieval, revealing a power-law relationship between effectiveness and the size of models and data. Does robustness follow scaling laws too? If so, can scaling improve both robustness and effectiveness together, or do they remain locked in a trade-off? To answer these questions, we conduct a comprehensive experimental study. We find that:(i) Robustness, including out-of-distribution and adversarial robustness, also follows a scaling law.(ii) Robustness and effectiveness exhibit different scaling patterns, leading to significant resource costs when jointly improving both. Given these findings, we shift to the third factor that affects model performance, namely the optimization strategy, beyond the model size and data size. We find that: (i) By fitting different optimization strategies, the joint performance of robustness and effectiveness traces out a Pareto frontier. (ii) When the optimization strategy strays from Pareto efficiency, the joint performance scales in a sub-optimal direction. (iii) By adjusting the optimization weights to fit the Pareto efficiency, we can achieve Pareto training, where the scaling of joint performance becomes most efficient. Even without requiring additional resources, Pareto training is comparable to the performance of scaling resources several times under optimization strategies that overly prioritize either robustness or effectiveness. Finally, we demonstrate that our findings can help deploy dense retrieval models in real-world applications that scale efficiently and are balanced for robustness and effectiveness. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2505.24279 [cs.IR] (or arXiv:2505.24279v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2505.24279 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-3] MGS3: A Multi-Granularity Self-Supervised Code Search Framework
链接: https://arxiv.org/abs/2505.24274
作者: Rui Li,Junfeng Kang,Qi Liu,Liyang He,Zheng Zhang,Yunhao Sha,Linbo Zhu,Zhenya Huang
类目: oftware Engineering (cs.SE); Information Retrieval (cs.IR)
*备注:
Abstract:In the pursuit of enhancing software reusability and developer productivity, code search has emerged as a key area, aimed at retrieving code snippets relevant to functionalities based on natural language queries. Despite significant progress in self-supervised code pre-training utilizing the vast amount of code data in repositories, existing methods have primarily focused on leveraging contrastive learning to align natural language with function-level code snippets. These studies have overlooked the abundance of fine-grained (such as block-level and statement-level) code snippets prevalent within the function-level code snippets, which results in suboptimal performance across all levels of granularity. To address this problem, we first construct a multi-granularity code search dataset called MGCodeSearchNet, which contains 536K+ pairs of natural language and code snippets. Subsequently, we introduce a novel Multi-Granularity Self-Supervised contrastive learning code Search framework (MGS ^3 ). First, MGS ^3 features a Hierarchical Multi-Granularity Representation module (HMGR), which leverages syntactic structural relationships for hierarchical representation and aggregates fine-grained information into coarser-grained representations. Then, during the contrastive learning phase, we endeavor to construct positive samples of the same granularity for fine-grained code, and introduce in-function negative samples for fine-grained code. Finally, we conduct extensive experiments on code search benchmarks across various granularities, demonstrating that the framework exhibits outstanding performance in code search tasks of multiple granularities. These experiments also showcase its model-agnostic nature and compatibility with existing pre-trained code representation models.
[IR-4] Heterogeneous Graph Masked Contrastive Learning for Robust Recommendation
链接: https://arxiv.org/abs/2505.24172
作者: Lei Sang,Yu Wang,Yiwen Zhang
类目: Information Retrieval (cs.IR)
*备注: 12 pages, 7 figures
Abstract:Heterogeneous graph neural networks (HGNNs) have demonstrated their superiority in exploiting auxiliary information for recommendation tasks. However, graphs constructed using meta-paths in HGNNs are usually too dense and contain a large number of noise edges. The propagation mechanism of HGNNs propagates even small amounts of noise in a graph to distant neighboring nodes, thereby affecting numerous node embeddings. To address this limitation, we introduce a novel model, named Masked Contrastive Learning (MCL), to enhance recommendation robustness to noise. MCL employs a random masking strategy to augment the graph via meta-paths, reducing node sensitivity to specific neighbors and bolstering embedding robustness. Furthermore, MCL employs contrastive cross-view on a Heterogeneous Information Network (HIN) from two perspectives: one-hop neighbors and meta-path neighbors. This approach acquires embeddings capturing both local and high-order structures simultaneously for recommendation. Empirical evaluations on three real-world datasets confirm the superiority of our approach over existing recommendation methods.
[IR-5] ransforming Podcast Preview Generation: From Expert Models to LLM -Based Systems ACL2025
链接: https://arxiv.org/abs/2505.23908
作者: Winstead Zhu,Ann Clifton,Azin Ghazimatin,Edgar Tanaka,Ward Ronan
类目: Information Retrieval (cs.IR)
*备注: 9 pages, 2 figures, accepted at ACL 2025 Industry Track
Abstract:Discovering and evaluating long-form talk content such as videos and podcasts poses a significant challenge for users, as it requires a considerable time investment. Previews offer a practical solution by providing concise snippets that showcase key moments of the content, enabling users to make more informed and confident choices. We propose an LLM-based approach for generating podcast episode previews and deploy the solution at scale, serving hundreds of thousands of podcast previews in a real-world application. Comprehensive offline evaluations and online A/B testing demonstrate that LLM-generated previews consistently outperform a strong baseline built on top of various ML expert models, showcasing a significant reduction in the need for meticulous feature engineering. The offline results indicate notable enhancements in understandability, contextual clarity, and interest level, and the online A/B test shows a 4.6% increase in user engagement with preview content, along with a 5x boost in processing efficiency, offering a more streamlined and performant solution compared to the strong baseline of feature-engineered expert models.