本篇博文主要内容为 2025-05-30 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-05-30)
今日共更新695篇论文,其中:
- 自然语言处理共144篇(Computation and Language (cs.CL))
- 人工智能共202篇(Artificial Intelligence (cs.AI))
- 计算机视觉共179篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共252篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] From Chat Logs to Collective Insights: Aggregative Question Answering
【速读】: 该论文试图解决现有方法在处理大规模对话日志时,无法有效聚合和推理以提取集体洞察的问题。其解决方案的关键在于引入一种新的任务——聚合问答(Aggregative Question Answering),要求模型显式地对数千条用户-聊天机器人交互进行推理,以回答涉及特定群体新兴关注点等聚合性问题。为支持该研究方向,作者构建了一个基准数据集WildChat-AQA,包含从182,330条真实对话中提取的6,027个聚合问题。
链接: https://arxiv.org/abs/2505.23765
作者: Wentao Zhang,Woojeong Kim,Yuntian Deng
机构: University of Waterloo (滑铁卢大学); Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Conversational agents powered by large language models (LLMs) are rapidly becoming integral to our daily interactions, generating unprecedented amounts of conversational data. Such datasets offer a powerful lens into societal interests, trending topics, and collective concerns. Yet, existing approaches typically treat these interactions as independent and miss critical insights that could emerge from aggregating and reasoning across large-scale conversation logs. In this paper, we introduce Aggregative Question Answering, a novel task requiring models to reason explicitly over thousands of user-chatbot interactions to answer aggregative queries, such as identifying emerging concerns among specific demographics. To enable research in this direction, we construct a benchmark, WildChat-AQA, comprising 6,027 aggregative questions derived from 182,330 real-world chatbot conversations. Experiments show that existing methods either struggle to reason effectively or incur prohibitive computational costs, underscoring the need for new approaches capable of extracting collective insights from large-scale conversational data.
zh
[NLP-1] MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence
【速读】: 该论文旨在解决多图像空间推理能力在多模态大语言模型(MLLMs)中的评估不足问题,现有基准仅关注单图关系,无法反映真实场景下所需的多图像空间推理能力。其解决方案的关键在于引入MMSI-Bench,一个专注于多图像空间智能的视觉问答(VQA)基准,通过精心设计的1,000道无歧义的多项选择题,结合大量图像与细致的干扰项及分步推理过程,全面评估模型的空间推理能力。
链接: https://arxiv.org/abs/2505.23764
作者: Sihan Yang,Runsen Xu,Yiman Xie,Sizhe Yang,Mo Li,Jingli Lin,Chenming Zhu,Xiaochen Chen,Haodong Duan,Xiangyu Yue,Dahua Lin,Tai Wang,Jiangmiao Pang
机构: Shanghai AI Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学); Zhejiang University (浙江大学); Tsinghua University (清华大学); Shanghai Jiaotong University (上海交通大学); University of Hong Kong (香港大学); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 34 pages. A comprehensive, fully human-curated, multi-image-based spatial intelligence benchmark with reasoning annotation for MLLMs. Project page: this https URL
Abstract:Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial reasoning that real-world deployments demand. We introduce MMSI-Bench, a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision researchers spent more than 300 hours meticulously crafting 1,000 challenging, unambiguous multiple-choice questions from over 120,000 images, each paired with carefully designed distractors and a step-by-step reasoning process. We conduct extensive experiments and thoroughly evaluate 34 open-source and proprietary MLLMs, observing a wide gap: the strongest open-source model attains roughly 30% accuracy and OpenAI’s o3 reasoning model reaches 40%, while humans score 97%. These results underscore the challenging nature of MMSI-Bench and the substantial headroom for future research. Leveraging the annotated reasoning processes, we also provide an automated error analysis pipeline that diagnoses four dominant failure modes, including (1) grounding errors, (2) overlap-matching and scene-reconstruction errors, (3) situation-transformation reasoning errors, and (4) spatial-logic errors, offering valuable insights for advancing multi-image spatial intelligence. Project page: this https URL .
zh
[NLP-2] ZeroGUI: Automating Online GUI Learning at Zero Human Cost
【速读】: 该论文旨在解决现有基于纯视觉的图形用户界面(GUI)代理在训练过程中依赖高质量人工标注以及难以适应动态交互环境的问题。其解决方案的关键在于提出ZeroGUI,一个可扩展的在线学习框架,通过集成基于视觉-语言模型(VLM)的自动任务生成、自动奖励估计以及两阶段在线强化学习,实现无需人工干预的GUI代理训练。
链接: https://arxiv.org/abs/2505.23762
作者: Chenyu Yang,Shiqian Su,Shi Liu,Xuan Dong,Yue Yu,Weijie Su,Xuehui Wang,Zhaoyang Liu,Jinguo Zhu,Hao Li,Wenhai Wang,Yu Qiao,Xizhou Zhu,Jifeng Dai
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid advancement of large Vision-Language Models (VLMs) has propelled the development of pure-vision-based GUI Agents, capable of perceiving and operating Graphical User Interfaces (GUI) to autonomously fulfill user instructions. However, existing approaches usually adopt an offline learning framework, which faces two core limitations: (1) heavy reliance on high-quality manual annotations for element grounding and action supervision, and (2) limited adaptability to dynamic and interactive environments. To address these limitations, we propose ZeroGUI, a scalable, online learning framework for automating GUI Agent training at Zero human cost. Specifically, ZeroGUI integrates (i) VLM-based automatic task generation to produce diverse training goals from the current environment state, (ii) VLM-based automatic reward estimation to assess task success without hand-crafted evaluation functions, and (iii) two-stage online reinforcement learning to continuously interact with and learn from GUI environments. Experiments on two advanced GUI Agents (UI-TARS and Aguvis) demonstrate that ZeroGUI significantly boosts performance across OSWorld and AndroidLab environments. The code is available at this https URL.
zh
[NLP-3] Differential Information: An Information-Theoretic Perspective on Preference Optimization
【速读】: 该论文试图解决直接偏好优化(Direct Preference Optimization, DPO)中对数比率奖励参数化的理论依据不完整的问题。其解决方案的关键在于引入微分信息分布(Differential Information Distribution, DID),通过该分布分析策略更新过程中获得的信息,揭示了当偏好标签编码了将参考策略转换为目标策略所需的微分信息时,DPO中的对数比率奖励是通过偏好优化学习目标策略的唯一最优形式,并进一步探讨了微分信息熵对策略分布的影响机制。
链接: https://arxiv.org/abs/2505.23761
作者: Yunjae Won,Hyunji Lee,Hyeonbin Hwang,Minjoon Seo
机构: KAIST AI (KAIST人工智能)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 41 pages, 13 figures; due to the 1,920-character limitation imposed on the abstract field by arXiv, the abstract included on the arXiv page is slightly abbreviated compared to the version presented in the PDF
Abstract:Direct Preference Optimization (DPO) has become a standard technique for aligning language models with human preferences in a supervised manner. Despite its empirical success, the theoretical justification behind its log-ratio reward parameterization remains incomplete. In this work, we address this gap by utilizing the Differential Information Distribution (DID): a distribution over token sequences that captures the information gained during policy updates. First, we show that when preference labels encode the differential information required to transform a reference policy into a target policy, the log-ratio reward in DPO emerges as the uniquely optimal form for learning the target policy via preference optimization. This result naturally yields a closed-form expression for the optimal sampling distribution over rejected responses. Second, we find that the condition for preferences to encode differential information is fundamentally linked to an implicit assumption regarding log-margin ordered policies-an inductive bias widely used in preference optimization yet previously unrecognized. Finally, by analyzing the entropy of the DID, we characterize how learning low-entropy differential information reinforces the policy distribution, while high-entropy differential information induces a smoothing effect, which explains the log-likelihood displacement phenomenon. We validate our theoretical findings in synthetic experiments and extend them to real-world instruction-following datasets. Our results suggest that learning high-entropy differential information is crucial for general instruction-following, while learning low-entropy differential information benefits knowledge-intensive question answering. Overall, our work presents a unifying perspective on the DPO objective, the structure of preference data, and resulting policy behaviors through the lens of differential information.
zh
[NLP-4] Puzzled by Puzzles: When Vision-Language Models Cant Take a Hint
【速读】: 该论文试图解决视觉语言模型(VLMs)在解读和解决隐喻谜题(rebus puzzles)方面的局限性,这类谜题通过图像、空间排列和符号替换编码语言,需要多模态抽象、符号推理以及对文化、语音和语言双关的理解。解决方案的关键在于构建一个手工生成且标注的多样化英语隐喻谜题基准数据集,涵盖从简单的图形替换到依赖空间线索的复杂谜题,以此评估当前VLMs的能力并揭示其在抽象推理、发散思维和视觉隐喻理解方面的不足。
链接: https://arxiv.org/abs/2505.23759
作者: Heekyung Lee,Jiaxin Ge,Tsung-Han Wu,Minwoo Kang,Trevor Darrell,David M. Chan
机构: University of California, Berkeley (加利福尼亚大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-language models (VLMs). Unlike traditional image captioning or question answering tasks, rebus solving requires multi-modal abstraction, symbolic reasoning, and a grasp of cultural, phonetic and linguistic puns. In this paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse English-language rebus puzzles, ranging from simple pictographic substitutions to spatially-dependent cues (“head” over “heels”). We analyze how different VLMs perform, and our findings reveal that while VLMs exhibit some surprising capabilities in decoding simple visual clues, they struggle significantly with tasks requiring abstract reasoning, lateral thinking, and understanding visual metaphors.
zh
[NLP-5] DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在形式化定理证明(Automated Theorem Proving, ATP)中表现不佳的问题,其核心挑战在于传统ATP方法依赖于与LLMs所具备的非形式化自然语言知识不匹配的形式化证明系统。论文提出的解决方案关键在于构建一个名为DeepTheorem的综合性非形式化定理证明框架,该框架通过自然语言增强LLMs的数学推理能力,包含大规模高质量的IMO级别非形式化定理与证明数据集,并引入一种专门针对非形式化定理证明的强化学习策略(RL-Zero),结合可验证的定理变体以激励稳健的数学推理。
链接: https://arxiv.org/abs/2505.23754
作者: Ziyin Zhang,Jiahao Xu,Zhiwei He,Tian Liang,Qiuzhi Liu,Yansi Li,Linfeng Song,Zhengwen Liang,Zhuosheng Zhang,Rui Wang,Zhaopeng Tu,Haitao Mi,Dong Yu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Theorem proving serves as a major testbed for evaluating complex reasoning abilities in large language models (LLMs). However, traditional automated theorem proving (ATP) approaches rely heavily on formal proof systems that poorly align with LLMs’ strength derived from informal, natural language knowledge acquired during pre-training. In this work, we propose DeepTheorem, a comprehensive informal theorem-proving framework exploiting natural language to enhance LLM mathematical reasoning. DeepTheorem includes a large-scale benchmark dataset consisting of 121K high-quality IMO-level informal theorems and proofs spanning diverse mathematical domains, rigorously annotated for correctness, difficulty, and topic categories, accompanied by systematically constructed verifiable theorem variants. We devise a novel reinforcement learning strategy (RL-Zero) explicitly tailored to informal theorem proving, leveraging the verified theorem variants to incentivize robust mathematical inference. Additionally, we propose comprehensive outcome and process evaluation metrics examining proof correctness and the quality of reasoning steps. Extensive experimental analyses demonstrate DeepTheorem significantly improves LLM theorem-proving performance compared to existing datasets and supervised fine-tuning protocols, achieving state-of-the-art accuracy and reasoning quality. Our findings highlight DeepTheorem’s potential to fundamentally advance automated informal theorem proving and mathematical exploration.
zh
[NLP-6] ATLAS: Learning to Optimally Memorize the Context at Test Time
【速读】: 该论文旨在解决传统Transformer模型在处理长序列任务时存在的局限性,特别是其在长期上下文理解与外推能力上的不足。这些问题主要源于三个设计层面的缺陷:有限的记忆容量、仅基于最新输入进行更新的在线性质以及对固定大小记忆的表达能力较弱。论文提出的解决方案关键在于ATLAS模块,它通过基于当前和过去标记优化记忆的方式,增强了长期记忆能力,克服了传统模型的在线更新限制,从而显著提升了模型在长序列任务中的表现。
链接: https://arxiv.org/abs/2505.23735
作者: Ali Behrouz,Zeman Li,Praneeth Kacham,Majid Daliri,Yuan Deng,Peilin Zhong,Meisam Razaviyayn,Vahab Mirrokni
机构: Google(谷歌)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Transformers have been established as the most popular backbones in sequence modeling, mainly due to their effectiveness in in-context retrieval tasks and the ability to learn at scale. Their quadratic memory and time complexity, however, bound their applicability in longer sequences and so has motivated researchers to explore effective alternative architectures such as modern recurrent neural networks (a.k.a long-term recurrent memory module). Despite their recent success in diverse downstream tasks, they struggle in tasks that requires long context understanding and extrapolation to longer sequences. We observe that these shortcomings come from three disjoint aspects in their design: (1) limited memory capacity that is bounded by the architecture of memory and feature mapping of the input; (2) online nature of update, i.e., optimizing the memory only with respect to the last input; and (3) less expressive management of their fixed-size memory. To enhance all these three aspects, we present ATLAS, a long-term memory module with high capacity that learns to memorize the context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Building on this insight, we present a new family of Transformer-like architectures, called DeepTransformers, that are strict generalizations of the original Transformer architecture. Our experimental results on language modeling, common-sense reasoning, recall-intensive, and long-context understanding tasks show that ATLAS surpasses the performance of Transformers and recent linear recurrent models. ATLAS further improves the long context performance of Titans, achieving +80% accuracy in 10M context length of BABILong benchmark.
zh
[NLP-7] Bounded Rationality for LLM s: Satisficing Alignment at Inference-Time ICML2025
【速读】: 该论文试图解决大型语言模型与人类偏好对齐的问题,这一问题由于偏好反馈的多维特性而具有挑战性。现有方法通常将其建模为多目标优化问题,但忽略了人类实际决策过程中的特性。研究显示,人类决策遵循满足性策略——在优化主要目标的同时确保其他目标达到可接受阈值。论文提出的解决方案关键在于SITAlign框架,该框架在推理阶段通过最大化主要目标同时满足次要标准的阈值约束,来实现满足性对齐。
链接: https://arxiv.org/abs/2505.23729
作者: Mohamad Chehade,Soumya Suvra Ghosal,Souradip Chakraborty,Avinash Reddy,Dinesh Manocha,Hao Zhu,Amrit Singh Bedi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2025
Abstract:Aligning large language models with humans is challenging due to the inherently multifaceted nature of preference feedback. While existing approaches typically frame this as a multi-objective optimization problem, they often overlook how humans actually make decisions. Research on bounded rationality suggests that human decision making follows satisficing strategies-optimizing primary objectives while ensuring others meet acceptable thresholds. To bridge this gap and operationalize the notion of satisficing alignment, we propose SITAlign: an inference time framework that addresses the multifaceted nature of alignment by maximizing a primary objective while satisfying threshold-based constraints on secondary criteria. We provide theoretical insights by deriving sub-optimality bounds of our satisficing based inference alignment approach. We empirically validate SITAlign’s performance through extensive experimentation on multiple benchmarks. For instance, on the PKU-SafeRLHF dataset with the primary objective of maximizing helpfulness while ensuring a threshold on harmlessness, SITAlign outperforms the state-of-the-art multi objective decoding strategy by a margin of 22.3% in terms of GPT-4 win-tie rate for helpfulness reward while adhering to the threshold on harmlessness.
zh
[NLP-8] ML-Agent : Reinforcing LLM Agents for Autonomous Machine Learning Engineering
【速读】: 该论文旨在解决传统基于大型语言模型(Large Language Model, LLM)的智能体在自主机器学习(ML)工程中过度依赖人工提示工程、无法根据多样化的实验经验进行适应与优化的问题。其解决方案的关键在于提出一种基于学习的智能体机器学习范式,通过在线强化学习(Online Reinforcement Learning, RL)实现智能体在机器学习任务中的交互式实验学习,并构建包含三个核心组件的新型智能体机器学习训练框架:探索增强的微调、分步强化学习以及面向智能体机器学习的奖励模块,从而提升智能体的探索能力、训练效率和反馈一致性。
链接: https://arxiv.org/abs/2505.23723
作者: Zexi Liu,Jingyi Chai,Xinyu Zhu,Shuo Tang,Rui Ye,Bo Zhang,Lei Bai,Siheng Chen
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The emergence of large language model (LLM)-based agents has significantly advanced the development of autonomous machine learning (ML) engineering. However, most existing approaches rely heavily on manual prompt engineering, failing to adapt and optimize based on diverse experimental experiences. Focusing on this, for the first time, we explore the paradigm of learning-based agentic ML, where an LLM agent learns through interactive experimentation on ML tasks using online reinforcement learning (RL). To realize this, we propose a novel agentic ML training framework with three key components: (1) exploration-enriched fine-tuning, which enables LLM agents to generate diverse actions for enhanced RL exploration; (2) step-wise RL, which enables training on a single action step, accelerating experience collection and improving training efficiency; (3) an agentic ML-specific reward module, which unifies varied ML feedback signals into consistent rewards for RL optimization. Leveraging this framework, we train ML-Agent, driven by a 7B-sized Qwen-2.5 LLM for autonomous ML. Remarkably, despite being trained on merely 9 ML tasks, our 7B-sized ML-Agent outperforms the 671B-sized DeepSeek-R1 agent. Furthermore, it achieves continuous performance improvements and demonstrates exceptional cross-task generalization capabilities.
zh
[NLP-9] Label-Guided In-Context Learning for Named Entity Recognition
【速读】: 该论文试图解决在命名实体识别(Named Entity Recognition, NER)任务中,基于上下文学习(In-context Learning, ICL)的示例选择方法因仅依赖语义相似性而忽略训练标签,导致性能不佳的问题。解决方案的关键在于提出DEER方法,该方法通过利用训练标签的token级统计信息来优化示例选择和错误标记的修正过程,从而提升ICL在NER任务中的表现。
链接: https://arxiv.org/abs/2505.23722
作者: Fan Bai,Hamid Hassanzadeh,Ardavan Saeedi,Mark Dredze
机构: Johns Hopkins University (约翰霍普金斯大学); Optum (Optum)
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:In-context learning (ICL) enables large language models (LLMs) to perform new tasks using only a few demonstrations. In Named Entity Recognition (NER), demonstrations are typically selected based on semantic similarity to the test instance, ignoring training labels and resulting in suboptimal performance. We introduce DEER, a new method that leverages training labels through token-level statistics to improve ICL performance. DEER first enhances example selection with a label-guided, token-based retriever that prioritizes tokens most informative for entity recognition. It then prompts the LLM to revisit error-prone tokens, which are also identified using label statistics, and make targeted corrections. Evaluated on five NER datasets using four different LLMs, DEER consistently outperforms existing ICL methods and approaches the performance of supervised fine-tuning. Further analysis shows its effectiveness on both seen and unseen entities and its robustness in low-resource settings.
zh
[NLP-10] Dont Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对有缺陷或矛盾前提时,往往无法进行有效批判性分析的问题,这导致了推理效率低下和输出不可靠。解决方案的关键在于引入一个名为Premise Critique Bench (PCBench) 的评估基准,该基准通过整合三种难度级别的四种错误类型,并结合多维评估指标,系统地评估LLMs的命题批判能力。研究揭示了当前模型在自主批判能力上的不足,并强调了提升模型对输入有效性的主动评估能力的重要性。
链接: https://arxiv.org/abs/2505.23715
作者: Jinzhe Li,Gengxu Li,Yi Chang,Yuan Wu
机构: School of Artificial Intelligence, Jilin University (吉林大学人工智能学院); Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE, China (教育部知识驱动人机智能工程研究中心,中国); International Center of Future Science, Jilin University (吉林大学未来科学国际中心)
类目: Computation and Language (cs.CL)
备注: 31 pages,13 figures,15 tables
Abstract:Large language models (LLMs) have witnessed rapid advancements, demonstrating remarkable capabilities. However, a notable vulnerability persists: LLMs often uncritically accept flawed or contradictory premises, leading to inefficient reasoning and unreliable outputs. This emphasizes the significance of possessing the \textbfPremise Critique Ability for LLMs, defined as the capacity to proactively identify and articulate errors in input premises. Most existing studies assess LLMs’ reasoning ability in ideal settings, largely ignoring their vulnerabilities when faced with flawed premises. Thus, we introduce the \textbfPremise Critique Bench (PCBench), designed by incorporating four error types across three difficulty levels, paired with multi-faceted evaluation metrics. We conducted systematic evaluations of 15 representative LLMs. Our findings reveal: (1) Most models rely heavily on explicit prompts to detect errors, with limited autonomous critique; (2) Premise critique ability depends on question difficulty and error type, with direct contradictions being easier to detect than complex or procedural errors; (3) Reasoning ability does not consistently correlate with the premise critique ability; (4) Flawed premises trigger overthinking in reasoning models, markedly lengthening responses due to repeated attempts at resolving conflicts. These insights underscore the urgent need to enhance LLMs’ proactive evaluation of input validity, positioning premise critique as a foundational capability for developing reliable, human-centric systems. The code is available at this https URL.
zh
[NLP-11] SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods ACL
【速读】: 该论文试图解决低资源语言中高质量评估数据集匮乏的问题,以推动跨语言迁移学习的发展。其关键解决方案是发布包含多义词的句子标注数据集,并提出一种半自动标注方法,以提高数据集创建的效率和质量。通过Word-in-Context (WiC) 格式实验验证了数据集的有效性,强调了在低资源环境下有针对性的数据集构建与评估对于多义词消歧及跨语言迁移研究的重要性。
链接: https://arxiv.org/abs/2505.23714
作者: Roksana Goworek,Harpal Karlcut,Muhammad Shezad,Nijaguna Darshana,Abhishek Mane,Syam Bondada,Raghav Sikka,Ulvi Mammadov,Rauf Allahverdiyev,Sriram Purighella,Paridhi Gupta,Muhinyia Ndegwa,Haim Dubossarsky
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 22 figures, submitted to SIGTYP 2025 workshop in ACL
Abstract:This paper addresses the critical need for high-quality evaluation datasets in low-resource languages to advance cross-lingual transfer. While cross-lingual transfer offers a key strategy for leveraging multilingual pretraining to expand language technologies to understudied and typologically diverse languages, its effectiveness is dependent on quality and suitable benchmarks. We release new sense-annotated datasets of sentences containing polysemous words, spanning nine low-resource languages across diverse language families and scripts. To facilitate dataset creation, the paper presents a demonstrably beneficial semi-automatic annotation method. The utility of the datasets is demonstrated through Word-in-Context (WiC) formatted experiments that evaluate transfer on these low-resource languages. Results highlight the importance of targeted dataset creation and evaluation for effective polysemy disambiguation in low-resource settings and transfer studies. The released datasets and code aim to support further research into fair, robust, and truly multilingual NLP.
zh
[NLP-12] SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models
【速读】: 该论文试图解决当前缺乏系统性评估框架来全面衡量大型语言模型(Large Language Models, LLMs)社会推理能力的问题。现有研究往往简化现实场景,任务过于基础,无法有效挑战先进模型。解决方案的关键在于提出SocialMaze,这是一个专门设计用于评估社会推理能力的新基准,其核心在于系统性地整合三大挑战:深度推理、动态交互和信息不确定性,并通过六个多样化任务在三个关键场景中进行评估,以更真实地反映模型的社会推理能力。
链接: https://arxiv.org/abs/2505.23713
作者: Zixiang Xu,Yanbo Wang,Yue Huang,Jiayi Ye,Haomin Zhuang,Zirui Song,Lang Gao,Chenxi Wang,Zhaorun Chen,Yujun Zhou,Sixian Li,Wang Pan,Yue Zhao,Jieyu Zhao,Xiangliang Zhang,Xiuying Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注: Code available at this https URL
Abstract:Large language models (LLMs) are increasingly applied to socially grounded tasks, such as online community moderation, media content analysis, and social reasoning games. Success in these contexts depends on a model’s social reasoning ability - the capacity to interpret social contexts, infer others’ mental states, and assess the truthfulness of presented information. However, there is currently no systematic evaluation framework that comprehensively assesses the social reasoning capabilities of LLMs. Existing efforts often oversimplify real-world scenarios and consist of tasks that are too basic to challenge advanced models. To address this gap, we introduce SocialMaze, a new benchmark specifically designed to evaluate social reasoning. SocialMaze systematically incorporates three core challenges: deep reasoning, dynamic interaction, and information uncertainty. It provides six diverse tasks across three key settings: social reasoning games, daily-life interactions, and digital community platforms. Both automated and human validation are used to ensure data quality. Our evaluation reveals several key insights: models vary substantially in their ability to handle dynamic interactions and integrate temporally evolving information; models with strong chain-of-thought reasoning perform better on tasks requiring deeper inference beyond surface-level cues; and model reasoning degrades significantly under uncertainty. Furthermore, we show that targeted fine-tuning on curated reasoning examples can greatly improve model performance in complex social scenarios. The dataset is publicly available at: this https URL
zh
[NLP-13] Can LLM s Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation
【速读】: 该论文试图解决当前基于最终答案的评估指标在衡量大语言模型(Large Language Models, LLMs)数学问题求解能力时存在的混淆问题,即该指标将抽象建模(abstract formulation)与算术计算(arithmetic computation)两种独立子技能混合在一起,导致无法准确评估模型的推理能力。论文提出的解决方案之关键在于通过解耦评估(disentangled evaluation)方法,分别量化模型在抽象建模和算术计算两方面的表现,并揭示出在无需思维链(CoT)的情况下,模型的最终答案准确率主要受限于算术计算步骤,而非抽象建模步骤。此外,研究还发现,即使在没有显式推理步骤的情况下,模型仍通过“抽象-计算”机制(abstract-then-compute mechanism)同时处理这两种技能,从而为后续改进模型推理能力提供了新的方向。
链接: https://arxiv.org/abs/2505.23701
作者: Ziling Cheng,Meng Cao,Leila Pishdad,Yanshuai Cao,Jackie Chi Kit Cheung
机构: Mila – Quebec AI Institute (Mila – 魁北克人工智能研究所); McGill University (麦吉尔大学); Borealis AI (Borealis AI); Canada CIFAR AI Chair (加拿大CIFAR人工智能主席)
类目: Computation and Language (cs.CL)
备注:
Abstract:Final-answer-based metrics are commonly used for evaluating large language models (LLMs) on math word problems, often taken as proxies for reasoning ability. However, such metrics conflate two distinct sub-skills: abstract formulation (capturing mathematical relationships using expressions) and arithmetic computation (executing the calculations). Through a disentangled evaluation on GSM8K and SVAMP, we find that the final-answer accuracy of Llama-3 and Qwen2.5 (1B-32B) without CoT is overwhelmingly bottlenecked by the arithmetic computation step and not by the abstract formulation step. Contrary to the common belief, we show that CoT primarily aids in computation, with limited impact on abstract formulation. Mechanistically, we show that these two skills are composed conjunctively even in a single forward pass without any reasoning steps via an abstract-then-compute mechanism: models first capture problem abstractions, then handle computation. Causal patching confirms these abstractions are present, transferable, composable, and precede computation. These behavioural and mechanistic findings highlight the need for disentangled evaluation to accurately assess LLM reasoning and to guide future improvements.
zh
[NLP-14] VF-Eval: Evaluating Multimodal LLM s for Generating Feedback on AIGC Videos ACL2025
【速读】: 该论文试图解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在理解和评估AI生成内容(AIGC)视频方面能力不足的问题。现有研究多集中于自然视频的评估,而对合成视频的分析较为薄弱,同时,尽管部分视频生成工作依赖MLLMs进行质量评估,但其在解释AIGC视频方面的表现仍缺乏系统性研究。论文提出的解决方案是构建一个新的基准测试框架VF-Eval,包含四个任务:连贯性验证、错误意识、错误类型检测和推理评估,以全面衡量MLLMs在处理AIGC视频时的能力。该基准的核心在于通过多样化且具有挑战性的任务设计,揭示MLLMs在理解合成视频中的局限性,并为提升视频生成质量提供参考。
链接: https://arxiv.org/abs/2505.23693
作者: Tingyu Song,Tongyan Hu,Guo Gan,Yilun Zhao
机构: School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences (中国科学院大学先进交叉科学学院); National University of Singapore (新加坡国立大学); Zhejiang University (浙江大学); Yale University (耶鲁大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ACL 2025 Main
Abstract:MLLMs have been widely studied for video question answering recently. However, most existing assessments focus on natural videos, overlooking synthetic videos, such as AI-generated content (AIGC). Meanwhile, some works in video generation rely on MLLMs to evaluate the quality of generated videos, but the capabilities of MLLMs on interpreting AIGC videos remain largely underexplored. To address this, we propose a new benchmark, VF-Eval, which introduces four tasks-coherence validation, error awareness, error type detection, and reasoning evaluation-to comprehensively evaluate the abilities of MLLMs on AIGC videos. We evaluate 13 frontier MLLMs on VF-Eval and find that even the best-performing model, GPT-4.1, struggles to achieve consistently good performance across all tasks. This highlights the challenging nature of our benchmark. Additionally, to investigate the practical applications of VF-Eval in improving video generation, we conduct an experiment, RePrompt, demonstrating that aligning MLLMs more closely with human feedback can benefit video generation.
zh
[NLP-15] Child-Directed Language Does Not Consistently Boost Syntax Learning in Language Models
【速读】: 该论文试图解决语言模型(Language Models, LMs)在不同训练语料上的泛化能力差异问题,特别是探讨以儿童导向语言(Child-Directed Language, CDL)作为训练数据是否比常用的互联网爬取数据(如维基百科)更有效。其解决方案的关键在于提出一种新的测试方法FIT-CLAMS,该方法通过频率控制设计实现不同训练语料之间的平衡比较,并揭示了在评估句法能力时控制频率效应的重要性。
链接: https://arxiv.org/abs/2505.23689
作者: Francesca Padovani,Jaap Jumelet,Yevgen Matusevych,Arianna Bisazza
机构: University of Groningen (格罗宁根大学)
类目: Computation and Language (cs.CL)
备注: 21 pages, 4 figures, 4 tables
Abstract:Seminal work by Huebner et al. (2021) showed that language models (LMs) trained on English Child-Directed Language (CDL) can reach similar syntactic abilities as LMs trained on much larger amounts of adult-directed written text, suggesting that CDL could provide more effective LM training material than the commonly used internet-crawled data. However, the generalizability of these results across languages, model types, and evaluation settings remains unclear. We test this by comparing models trained on CDL vs. Wikipedia across two LM objectives (masked and causal), three languages (English, French, German), and three syntactic minimal-pair benchmarks. Our results on these benchmarks show inconsistent benefits of CDL, which in most cases is outperformed by Wikipedia models. We then identify various shortcomings in previous benchmarks, and introduce a novel testing methodology, FIT-CLAMS, which uses a frequency-controlled design to enable balanced comparisons across training corpora. Through minimal pair evaluations and regression analysis we show that training on CDL does not yield stronger generalizations for acquiring syntax and highlight the importance of controlling for frequency effects when evaluating syntactic ability.
zh
[NLP-16] Automatic classification of stop realisation with wav2vec2.0 INTERSPEECH2025
【速读】: 该论文试图解决语音数据中许多可变语音现象缺乏自动标注工具的问题,特别是辅音爆发(stop burst)的存在性分类问题。解决方案的关键在于利用预训练的自监督模型wav2vec2.0,该模型在语音分类任务中表现出色,并能潜在地编码精细的语音信息,从而实现对英语和日语中辅音爆发的高精度自动分类。
链接: https://arxiv.org/abs/2505.23688
作者: James Tanner,Morgan Sonderegger,Jane Stuart-Smith,Jeff Mielke,Tyler Kendall
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted for Interspeech 2025. 5 pages, 3 figures
Abstract:Modern phonetic research regularly makes use of automatic tools for the annotation of speech data, however few tools exist for the annotation of many variable phonetic phenomena. At the same time, pre-trained self-supervised models, such as wav2vec2.0, have been shown to perform well at speech classification tasks and latently encode fine-grained phonetic information. We demonstrate that wav2vec2.0 models can be trained to automatically classify stop burst presence with high accuracy in both English and Japanese, robust across both finely-curated and unprepared speech corpora. Patterns of variability in stop realisation are replicated with the automatic annotations, and closely follow those of manual annotations. These results demonstrate the potential of pre-trained speech models as tools for the automatic annotation and processing of speech corpus data, enabling researchers to `scale-up’ the scope of phonetic research with relative ease.
zh
[NLP-17] GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents
【速读】: 该论文试图解决如何评估语言模型在开发高性能软件方面的能力问题,具体表现为通过自动化管道生成和执行性能测试,以分析代码库提交历史并识别出102个具有挑战性的优化任务。解决方案的关键在于构建一个基准测试框架(benchmark),其中包含具体的代码库和性能测试作为精确规范,要求智能体提升运行时效率,并与专家开发者优化结果进行对比。该方法不仅提供了量化评估手段,还揭示了当前SWE-Agents在低级语言处理、优化策略选择及瓶颈定位等方面的局限性。
链接: https://arxiv.org/abs/2505.23671
作者: Manish Shetty,Naman Jain,Jinjian Liu,Vijay Kethanaboyina,Koushik Sen,Ion Stoica
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Website: this https URL
Abstract:Developing high-performance software is a complex task that requires specialized expertise. We introduce GSO, a benchmark for evaluating language models’ capabilities in developing high-performance software. We develop an automated pipeline that generates and executes performance tests to analyze repository commit histories to identify 102 challenging optimization tasks across 10 codebases, spanning diverse domains and programming languages. An agent is provided with a codebase and performance test as a precise specification, and tasked to improve the runtime efficiency, which is measured against the expert developer optimization. Our quantitative evaluation reveals that leading SWE-Agents struggle significantly, achieving less than 5% success rate, with limited improvements even with inference-time scaling. Our qualitative analysis identifies key failure modes, including difficulties with low-level languages, practicing lazy optimization strategies, and challenges in accurately localizing bottlenecks. We release the code and artifacts of our benchmark along with agent trajectories to enable future research.
zh
[NLP-18] LoLA: Low-Rank Linear Attention With Sparse Caching
【速读】: 该论文旨在解决基于Transformer的大语言模型在长序列推理中面临的二次复杂度问题,以及线性注意力方法在准确逼近softmax注意力上的不足。其关键解决方案是提出LoLA(Low-rank Linear Attention with sparse caching),通过将额外的键值对分别存储于局部滑动窗口、稀疏全局缓存和线性注意力的循环隐藏状态中,有效缓解了“记忆冲突”问题,并显著提升了模型在长上下文任务中的性能。
链接: https://arxiv.org/abs/2505.23666
作者: Luke McDermott,Robert W. Heath Jr.,Rahul Parhi
机构: UC San Diego(加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Transformer-based large language models suffer from quadratic complexity at inference on long sequences. Linear attention methods are efficient alternatives, however, they fail to provide an accurate approximation of softmax attention. By additionally incorporating sliding window attention into each linear attention head, this gap can be closed for short context-length tasks. Unfortunately, these approaches cannot recall important information from long contexts due to “memory collisions”. In this paper , we propose LoLA: Low-rank Linear Attention with sparse caching. LoLA separately stores additional key-value pairs that would otherwise interfere with past associative memories. Moreover, LoLA further closes the gap between linear attention models and transformers by distributing past key-value pairs into three forms of memory: (i) recent pairs in a local sliding window; (ii) difficult-to-memorize pairs in a sparse, global cache; and (iii) generic pairs in the recurrent hidden state of linear attention. As an inference-only strategy, LoLA enables pass-key retrieval on up to 8K context lengths on needle-in-a-haystack tasks from RULER. It boosts the accuracy of the base subquadratic model from 0.6% to 97.4% at 4K context lengths, with a 4.6x smaller cache than that of Llama-3.1 8B. LoLA demonstrates strong performance on zero-shot commonsense reasoning tasks among 1B and 8B parameter subquadratic models. Finally, LoLA is an extremely lightweight approach: Nearly all of our results can be reproduced on a single consumer GPU.
zh
[NLP-19] oolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions
【速读】: 该论文试图解决现有评估方法在短文本上下文中测试大型语言模型(Large Language Models, LLMs)的工具使用能力,而未能充分反映模型在真实长期交互中的表现问题。解决方案的关键在于提出ToolHaystack基准,该基准通过包含多个任务执行上下文和连续对话中的现实噪声,模拟长期交互场景,从而评估模型在保持上下文连贯性和处理干扰方面的能力。
链接: https://arxiv.org/abs/2505.23662
作者: Beong-woo Kwak,Minju Kim,Dongha Lim,Hyungjoo Chae,Dongjin Kang,Sunghwan Kim,Dongil Yang,Jinyoung Yeo
机构: Yonsei University (延世大学)
类目: Computation and Language (cs.CL)
备注: Our code and data are available at this https URL
Abstract:Large language models (LLMs) have demonstrated strong capabilities in using external tools to address user inquiries. However, most existing evaluations assume tool use in short contexts, offering limited insight into model behavior during realistic long-term interactions. To fill this gap, we introduce ToolHaystack, a benchmark for testing the tool use capabilities in long-term interactions. Each test instance in ToolHaystack includes multiple tasks execution contexts and realistic noise within a continuous conversation, enabling assessment of how well models maintain context and handle various disruptions. By applying this benchmark to 14 state-of-the-art LLMs, we find that while current models perform well in standard multi-turn settings, they often significantly struggle in ToolHaystack, highlighting critical gaps in their long-term robustness not revealed by previous tool benchmarks.
zh
[NLP-20] Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成过程中容易出现幻觉(hallucination)的问题,尤其是在长上下文场景下。其解决方案的关键在于提出一种名为主动层对比解码(Active Layer-Contrastive Decoding, ActLCD)的新颖解码策略,该策略通过将解码过程建模为序列决策问题,并利用基于奖励的分类器引导强化学习策略,在超越单个词元(token)层面优化事实准确性。
链接: https://arxiv.org/abs/2505.23657
作者: Hongxiang Zhang,Hao Chen,Tianyi Zhang,Muhao Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Recent decoding methods improve the factuality of large language models~(LLMs) by refining how the next token is selected during generation. These methods typically operate at the token level, leveraging internal representations to suppress superficial patterns. Nevertheless, LLMs remain prone to hallucinations, especially over longer contexts. In this paper, we propose Active Layer-Contrastive Decoding (ActLCD), a novel decoding strategy that actively decides when to apply contrasting layers during generation. By casting decoding as a sequential decision-making problem, ActLCD employs a reinforcement learning policy guided by a reward-aware classifier to optimize factuality beyond the token level. Our experiments demonstrate that ActLCD surpasses state-of-the-art methods across five benchmarks, showcasing its effectiveness in mitigating hallucinations in diverse generation scenarios.
zh
[NLP-21] ARC: Argument Representation and Coverag e Analysis for Zero-Shot Long Document Summarization with Instruction Following LLM s
【速读】: 该论文试图解决在抽象摘要生成过程中,大型语言模型(Large Language Models, LLMs)对论点角色(argument roles)的保留不足问题,尤其是在高风险领域如法律文档中的关键信息可能被遗漏。解决方案的关键在于引入Argument Representation Coverage (ARC),这是一个用于衡量LLMs生成摘要中捕捉显著论点能力的框架,通过该框架分析不同模型在法律意见和科学文章等以论点角色为核心领域的摘要表现,并揭示位置偏差和角色特定偏好对关键论点覆盖的影响,从而强调需要更注重论点意识的摘要策略。
链接: https://arxiv.org/abs/2505.23654
作者: Mohamed Elaraby,Diane Litman
机构: University of Pittsburgh (匹兹堡大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Integrating structured information has long improved the quality of abstractive summarization, particularly in retaining salient content. In this work, we focus on a specific form of structure: argument roles, which are crucial for summarizing documents in high-stakes domains such as law. We investigate whether instruction-tuned large language models (LLMs) adequately preserve this information. To this end, we introduce Argument Representation Coverage (ARC), a framework for measuring how well LLM-generated summaries capture salient arguments. Using ARC, we analyze summaries produced by three open-weight LLMs in two domains where argument roles are central: long legal opinions and scientific articles. Our results show that while LLMs cover salient argument roles to some extent, critical information is often omitted in generated summaries, particularly when arguments are sparsely distributed throughout the input. Further, we use ARC to uncover behavioral patterns – specifically, how the positional bias of LLM context windows and role-specific preferences impact the coverage of key arguments in generated summaries, emphasizing the need for more argument-aware summarization strategies.
zh
[NLP-22] Are Reasoning Models More Prone to Hallucination?
【速读】: 该论文试图解决大型推理模型(Large Reasoning Models, LRMs)在事实性任务中是否更容易产生幻觉(hallucination)的问题。其解决方案的关键在于从三个角度深入分析:首先,通过全面评估发现,LRMs经过冷启动监督微调(cold start supervised fine-tuning)和可验证奖励强化学习(verifiable reward RL)的后训练流程能够有效缓解幻觉;其次,通过行为分析揭示了两种影响模型事实性的关键认知行为——错误重复(Flaw Repetition)和思考-答案不匹配(Think-Answer Mismatch);最后,从模型不确定性(model uncertainty)的角度探讨了幻觉产生的机制,发现幻觉的增加通常与模型不确定性与事实准确性之间的不对齐有关。
链接: https://arxiv.org/abs/2505.23646
作者: Zijun Yao,Yantao Liu,Yanxu Chen,Jianhui Chen,Junfeng Fang,Lei Hou,Juanzi Li,Tat-Seng Chua
机构: Tsinghua University (清华大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Recently evolved large reasoning models (LRMs) show powerful performance in solving complex tasks with long chain-of-thought (CoT) reasoning capability. As these LRMs are mostly developed by post-training on formal reasoning tasks, whether they generalize the reasoning capability to help reduce hallucination in fact-seeking tasks remains unclear and debated. For instance, DeepSeek-R1 reports increased performance on SimpleQA, a fact-seeking benchmark, while OpenAI-o3 observes even severer hallucination. This discrepancy naturally raises the following research question: Are reasoning models more prone to hallucination? This paper addresses the question from three perspectives. (1) We first conduct a holistic evaluation for the hallucination in LRMs. Our analysis reveals that LRMs undergo a full post-training pipeline with cold start supervised fine-tuning (SFT) and verifiable reward RL generally alleviate their hallucination. In contrast, both distillation alone and RL training without cold start fine-tuning introduce more nuanced hallucinations. (2) To explore why different post-training pipelines alters the impact on hallucination in LRMs, we conduct behavior analysis. We characterize two critical cognitive behaviors that directly affect the factuality of a LRM: Flaw Repetition, where the surface-level reasoning attempts repeatedly follow the same underlying flawed logic, and Think-Answer Mismatch, where the final answer fails to faithfully match the previous CoT process. (3) Further, we investigate the mechanism behind the hallucination of LRMs from the perspective of model uncertainty. We find that increased hallucination of LRMs is usually associated with the misalignment between model uncertainty and factual accuracy. Our work provides an initial understanding of the hallucination in LRMs.
zh
[NLP-23] Human Empathy as Encoder: AI-Assisted Depression Assessment in Special Education
【速读】: 该论文试图解决在特殊教育等敏感环境中准确评估学生抑郁问题的挑战,传统标准化问卷无法全面反映学生的实际情况,而自动化方法在处理丰富的学生叙述时又缺乏来自教师共情连接的关键个性化洞察。解决方案的关键在于引入一种名为“Human Empathy as Encoder (HEAE)”的人本主义AI框架,该框架通过将学生叙述文本与教师生成的9维“共情向量”(Empathy Vector, EV)相结合,将隐性共情洞察结构化为AI输入,从而增强而非取代人类判断,实现了人机协同的透明且社会负责的抑郁严重程度评估。
链接: https://arxiv.org/abs/2505.23631
作者: Boning Zhao
机构: New York University (纽约大学); Tandon School of Engineering (坦登工程学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages, 6 figures. Under review
Abstract:Assessing student depression in sensitive environments like special education is challenging. Standardized questionnaires may not fully reflect students’ true situations. Furthermore, automated methods often falter with rich student narratives, lacking the crucial, individualized insights stemming from teachers’ empathetic connections with students. Existing methods often fail to address this ambiguity or effectively integrate educator understanding. To address these limitations by fostering a synergistic human-AI collaboration, this paper introduces Human Empathy as Encoder (HEAE), a novel, human-centered AI framework for transparent and socially responsible depression severity assessment. Our approach uniquely integrates student narrative text with a teacher-derived, 9-dimensional “Empathy Vector” (EV), its dimensions guided by the PHQ-9 framework,to explicitly translate tacit empathetic insight into a structured AI input enhancing rather than replacing human judgment. Rigorous experiments optimized the multimodal fusion, text representation, and classification architecture, achieving 82.74% accuracy for 7-level severity classification. This work demonstrates a path toward more responsible and ethical affective computing by structurally embedding human empathy
zh
[NLP-24] GeNRe: A French Gender-Neutral Rewriting System Using Collective Nouns ACL2025
【速读】: 该论文试图解决自然语言处理(Natural Language Processing, NLP)领域中文本数据存在的性别偏见问题,特别是由于使用阳性泛指(masculine generics)所导致的性别刻板印象强化。解决方案的关键在于开发一种法语性别中性重写系统(GeNRe),该系统利用集体名词(collective nouns)进行性别中性化处理,并结合基于规则的系统(Rule-Based System, RBS)以及微调的语言模型,以实现对性别化形式的自动检测与替换。此外,研究还探索了指令调优模型在提升系统性能方面的潜力。
链接: https://arxiv.org/abs/2505.23630
作者: Enzo Doyen,Amalia Todirascu
机构: LiLPa, University of Strasbourg (LiLPa,斯特拉斯堡大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 Findings; 9 pages, 2 figures
Abstract:A significant portion of the textual data used in the field of Natural Language Processing (NLP) exhibits gender biases, particularly due to the use of masculine generics (masculine words that are supposed to refer to mixed groups of men and women), which can perpetuate and amplify stereotypes. Gender rewriting, an NLP task that involves automatically detecting and replacing gendered forms with neutral or opposite forms (e.g., from masculine to feminine), can be employed to mitigate these biases. While such systems have been developed in a number of languages (English, Arabic, Portuguese, German, French), automatic use of gender neutralization techniques (as opposed to inclusive or gender-switching techniques) has only been studied for English. This paper presents GeNRe, the very first French gender-neutral rewriting system using collective nouns, which are gender-fixed in French. We introduce a rule-based system (RBS) tailored for the French language alongside two fine-tuned language models trained on data generated by our RBS. We also explore the use of instruct-based models to enhance the performance of our other systems and find that Claude 3 Opus combined with our dictionary achieves results close to our RBS. Through this contribution, we hope to promote the advancement of gender bias mitigation techniques in NLP for French.
zh
[NLP-25] AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora
【速读】: 该论文试图解决传统知识图谱构建中依赖预定义模式(schema)的局限性,从而实现完全自动化的知识图谱构造。其解决方案的关键在于提出AutoSchemaKG框架,该框架利用大语言模型同时抽取知识三元组并直接从文本中推导出全面的模式,通过概念化将实例组织到语义类别中,无需人工干预即可实现高语义对齐的模式生成。
链接: https://arxiv.org/abs/2505.23628
作者: Jiaxin Bai,Wei Fan,Qi Hu,Qing Zong,Chunyang Li,Hong Ting Tsang,Hongyu Luo,Yauwai Yim,Haoyu Huang,Xiao Zhou,Feng Qin,Tianshi Zheng,Xi Peng,Xin Yao,Huiwen Yang,Leijie Wu,Yi Ji,Gong Zhang,Renhai Chen,Yangqiu Song
机构: CSE, HKUST (计算机科学与工程系,香港科技大学); CSE, CUHK (计算机科学与工程系,香港中文大学); Theory Lab, Huawei (理论实验室,华为)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, preprint, code: this https URL
Abstract:We present AutoSchemaKG, a framework for fully autonomous knowledge graph construction that eliminates the need for predefined schemas. Our system leverages large language models to simultaneously extract knowledge triples and induce comprehensive schemas directly from text, modeling both entities and events while employing conceptualization to organize instances into semantic categories. Processing over 50 million documents, we construct ATLAS (Automated Triple Linking And Schema induction), a family of knowledge graphs with 900+ million nodes and 5.9 billion edges. This approach outperforms state-of-the-art baselines on multi-hop QA tasks and enhances LLM factuality. Notably, our schema induction achieves 95% semantic alignment with human-crafted schemas with zero manual intervention, demonstrating that billion-scale knowledge graphs with dynamically induced schemas can effectively complement parametric knowledge in large language models.
zh
[NLP-26] Characterizing the Expressivity of Transformer Language Models
【速读】: 该论文试图解决Transformer-based语言模型(LMs)的理论表达能力尚未完全理解的问题,特别是现有研究中使用的理想化模型与实际Transformer架构之间的差异。其解决方案的关键在于对固定精度的Transformer进行精确表征,该模型采用了严格的未来屏蔽和软注意力机制,更贴近实际应用。研究证明,这类模型的表达能力恰好等价于仅包含一个时间算子(过去算子)的线性时序逻辑片段,并通过将其与形式语言理论、自动机理论和代数中的经典类进行关联,构建了一个统一的理论框架。
链接: https://arxiv.org/abs/2505.23623
作者: Jiaoda Li,Ryan Cotterell
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Transformer-based language models (LMs) have achieved widespread empirical success, but their theoretical expressive power remains only partially understood. Prior work often relies on idealized models with assumptions – such as arbitrary numerical precision and hard attention – that diverge from real-world transformers. In this work, we provide an exact characterization of fixed-precision transformers with strict future masking and soft attention, an idealization that more closely mirrors practical implementations. We show that these models are precisely as expressive as a specific fragment of linear temporal logic that includes only a single temporal operator: the past operator. We further relate this logic to established classes in formal language theory, automata theory, and algebra, yielding a rich and unified theoretical framework for understanding transformer expressivity. Finally, we present empirical results that align closely with our theory: transformers trained on languages within their theoretical capacity generalize perfectly over lengths, while they consistently fail to generalize on languages beyond it.
zh
[NLP-27] able-R1: Inference-Time Scaling for Table Reasoning
【速读】: 该论文试图解决在表格推理任务中实现推理时的模型扩展问题,即如何在不增加训练成本的情况下提升模型在推理阶段的表现。其解决方案的关键在于提出两种后训练策略:从前沿模型的推理轨迹进行知识蒸馏以及基于可验证奖励的强化学习(RLVR)。知识蒸馏利用由DeepSeek-R1生成的大规模推理轨迹数据对LLM进行微调,而RLVR则通过任务特定的可验证奖励函数和GRPO算法优化模型,最终得到Table-R1-Zero模型,在保持较低参数量的同时实现了与高性能模型相当甚至更优的性能。
链接: https://arxiv.org/abs/2505.23621
作者: Zheyuan Yang,Lyuhao Chen,Arman Cohan,Yilun Zhao
机构: Yale NLP Lab (耶鲁自然语言处理实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:In this work, we present the first study to explore inference-time scaling on table reasoning tasks. We develop and evaluate two post-training strategies to enable inference-time scaling: distillation from frontier model reasoning traces and reinforcement learning with verifiable rewards (RLVR). For distillation, we introduce a large-scale dataset of reasoning traces generated by DeepSeek-R1, which we use to fine-tune LLMs into the Table-R1-SFT model. For RLVR, we propose task-specific verifiable reward functions and apply the GRPO algorithm to obtain the Table-R1-Zero model. We evaluate our Table-R1-series models across diverse table reasoning tasks, including short-form QA, fact verification, and free-form QA. Notably, the Table-R1-Zero model matches or exceeds the performance of GPT-4.1 and DeepSeek-R1, while using only a 7B-parameter LLM. It also demonstrates strong generalization to out-of-domain datasets. Extensive ablation and qualitative analyses reveal the benefits of instruction tuning, model architecture choices, and cross-task generalization, as well as emergence of essential table reasoning skills during RL training.
zh
[NLP-28] Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering
【速读】: 该论文旨在解决小参数量语言模型(LM)在真实软件工程任务中表现不佳的问题,尤其是在处理SWE-Bench等基准测试时,当模型参数少于100B时性能显著下降。现有方法主要依赖于高质量数据的监督微调(SFT),但其数据收集成本高昂。论文提出的解决方案是进化式测试时缩放(EvoScale),其关键在于将生成过程视为一个进化过程,通过迭代选择和变异优化输出,从而在减少样本数量的同时提升解的质量。此外,通过强化学习(RL)训练模型自我进化,使其在不依赖外部验证器的情况下自主提升生成结果的评分,最终在SWE-Bench-Verified上实现了32B参数模型与超过100B参数模型相当或更优的性能。
链接: https://arxiv.org/abs/2505.23604
作者: Guangtao Zeng,Maohao Shen,Delin Chen,Zhenting Qi,Subhro Das,Dan Gutfreund,David Cox,Gregory Wornell,Wei Lu,Zhang-Wei Hong,Chuang Gan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Language models (LMs) perform well on standardized coding benchmarks but struggle with real-world software engineering tasks such as resolving GitHub issues in SWE-Bench, especially when model parameters are less than 100B. While smaller models are preferable in practice due to their lower computational cost, improving their performance remains challenging. Existing approaches primarily rely on supervised fine-tuning (SFT) with high-quality data, which is expensive to curate at scale. An alternative is test-time scaling: generating multiple outputs, scoring them using a verifier, and selecting the best one. Although effective, this strategy often requires excessive sampling and costly scoring, limiting its practical application. We propose Evolutionary Test-Time Scaling (EvoScale), a sample-efficient method that treats generation as an evolutionary process. By iteratively refining outputs via selection and mutation, EvoScale shifts the output distribution toward higher-scoring regions, reducing the number of samples needed to find correct solutions. To reduce the overhead from repeatedly sampling and selection, we train the model to self-evolve using reinforcement learning (RL). Rather than relying on external verifiers at inference time, the model learns to self-improve the scores of its own generations across iterations. Evaluated on SWE-Bench-Verified, EvoScale enables our 32B model, Satori-SWE-32B, to match or exceed the performance of models with over 100B parameters while using a few samples. Code, data, and models will be fully open-sourced.
zh
[NLP-29] Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles
【速读】: 该论文旨在解决规则驱动的强化学习(Rule-based Reinforcement Learning, RL)在多模态大语言模型(Multimodal Large Language Models, MLLMs)中的应用问题,特别是针对感知密集型任务所面临的独特挑战。其解决方案的关键在于采用拼图任务作为结构化的实验框架,通过微调使MLLMs从随机猜测逐步提升至接近完美的准确率,并验证了视觉任务中泛化能力的提升与任务配置密切相关。此外,研究还揭示了MLLMs在无需显式推理的情况下也能实现有效学习与泛化,同时指出复杂推理模式可能是预存的而非涌现的,以及强化学习相比监督微调在泛化性能上的优势。
链接: https://arxiv.org/abs/2505.23590
作者: Zifu Wang,Junyi Zhu,Bo Tang,Zhiyu Li,Feiyu Xiong,Jiaqian Yu,Matthew B. Blaschko
机构: ESAT-PSI, KU Leuven (ESAT-PSI, 比利时鲁汶大学); University of Science and Technology of China (中国科学技术大学); Institute for Advanced Algorithms Research, Shanghai (上海先进算法研究院); Memory Tensor, Shanghai (上海Memory Tensor); Samsung R&D Institute China, Beijing (三星中国研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The application of rule-based reinforcement learning (RL) to multimodal large language models (MLLMs) introduces unique challenges and potential deviations from findings in text-only domains, particularly for perception-heavy tasks. This paper provides a comprehensive study of rule-based visual RL using jigsaw puzzles as a structured experimental framework, revealing several key findings. \textitFirstly, we find that MLLMs, initially performing near to random guessing on simple puzzles, achieve near-perfect accuracy and generalize to complex, unseen configurations through fine-tuning. \textitSecondly, training on jigsaw puzzles can induce generalization to other visual tasks, with effectiveness tied to specific task configurations. \textitThirdly, MLLMs can learn and generalize with or without explicit reasoning, though open-source models often favor direct answering. Consequently, even when trained for step-by-step reasoning, they can ignore the thinking process in deriving the final answer. \textitFourthly, we observe that complex reasoning patterns appear to be pre-existing rather than emergent, with their frequency increasing alongside training and task difficulty. \textitFinally, our results demonstrate that RL exhibits more effective generalization than Supervised Fine-Tuning (SFT), and an initial SFT cold start phase can hinder subsequent RL optimization. Although these observations are based on jigsaw puzzles and may vary across other visual tasks, this research contributes a valuable piece of jigsaw to the larger puzzle of collective understanding rule-based visual RL and its potential in multimodal learning. The code is available at: \hrefthis https URLthis https URL.
zh
[NLP-30] On-Policy RL with Optimal Reward Baseline
【速读】: 该论文旨在解决大规模语言模型对齐和推理任务中强化学习算法存在的训练不稳定性和计算效率低的问题。当前的强化学习算法由于宽松的在线策略约束和辅助模型导致的计算效率低下,常常面临训练不稳定的问题。论文提出的解决方案是基于最优奖励基线的在线策略强化学习(On-Policy RL with Optimal reward baseline, OPO),其关键在于强调精确的在线策略训练以稳定训练过程并增强探索能力,同时引入理论上的最优奖励基线以最小化梯度方差,从而提升训练效率和模型输出的多样性。
链接: https://arxiv.org/abs/2505.23585
作者: Yaru Hao,Li Dong,Xun Wu,Shaohan Huang,Zewen Chi,Furu Wei
机构: Microsoft Research (微软研究院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Reinforcement learning algorithms are fundamental to align large language models with human preferences and to enhance their reasoning capabilities. However, current reinforcement learning algorithms often suffer from training instability due to loose on-policy constraints and computational inefficiency due to auxiliary models. In this work, we propose On-Policy RL with Optimal reward baseline (OPO), a novel and simplified reinforcement learning algorithm designed to address these challenges. OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration. Moreover, OPO introduces the optimal reward baseline that theoretically minimizes gradient variance. We evaluate OPO on mathematical reasoning benchmarks. The results demonstrate its superior performance and training stability without additional models or regularization terms. Furthermore, OPO achieves lower policy shifts and higher output entropy, encouraging more diverse and less repetitive responses. These results highlight OPO as a promising direction for stable and effective reinforcement learning in large language model alignment and reasoning tasks. The implementation is provided at this https URL.
zh
[NLP-31] Evaluating AI capabilities in detecting conspiracy theories on YouTube
【速读】: 该论文试图解决在线平台中识别YouTube上阴谋论视频的问题,特别是在面对有害内容(如虚假信息和阴谋论)时的检测需求。解决方案的关键在于利用开源权重的大规模语言模型(LLMs),包括纯文本模型和多模态模型,进行零样本设置下的分类任务,并与微调的RoBERTa基线模型进行性能比较。研究发现,文本基础的LLMs在召回率上表现优异但精确度较低,而多模态模型在整合视觉数据方面效果有限,表明当前LLM方法在准确性和鲁棒性方面仍存在改进空间。
链接: https://arxiv.org/abs/2505.23570
作者: Leonardo La Rocca,Francesco Corso,Francesco Pierri
机构: Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano(Via Ponzio 34, Milan, Italy); CENTAI(Corso Inghilterra 4, Turin, Italy)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注: Submitted for review to OSNEM Special Issue of April 2025
Abstract:As a leading online platform with a vast global audience, YouTube’s extensive reach also makes it susceptible to hosting harmful content, including disinformation and conspiracy theories. This study explores the use of open-weight Large Language Models (LLMs), both text-only and multimodal, for identifying conspiracy theory videos shared on YouTube. Leveraging a labeled dataset of thousands of videos, we evaluate a variety of LLMs in a zero-shot setting and compare their performance to a fine-tuned RoBERTa baseline. Results show that text-based LLMs achieve high recall but lower precision, leading to increased false positives. Multimodal models lag behind their text-only counterparts, indicating limited benefits from visual data integration. To assess real-world applicability, we evaluate the most accurate models on an unlabeled dataset, finding that RoBERTa achieves performance close to LLMs with a larger number of parameters. Our work highlights the strengths and limitations of current LLM-based approaches for online harmful content detection, emphasizing the need for more precise and robust systems.
zh
[NLP-32] Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models
【速读】: 该论文试图解决大型语言模型在使用强化学习(Reinforcement Learning, RL)提升推理能力时面临的两个主要问题:基于token级别的方法(如PPO)虽然提供细粒度的优势信号,但由于难以训练准确的critic模型导致估计不准确;而基于轨迹级别的方法(如GRPO)仅依赖最终奖励的粗粒度优势信号,导致信用分配不精确。解决方案的关键在于提出一种新的强化学习框架——Segment Policy Optimization (SPO),该框架采用中间粒度的段级别优势估计,在保证更精确信用分配的同时,减少估计点数量,从而在无需critic模型的情况下实现更准确的优势估计。SPO通过三个核心组件实现这一目标:灵活的段划分、准确的段优势估计以及基于段优势的策略优化,其中包含一种新颖的概率掩码策略。
链接: https://arxiv.org/abs/2505.23564
作者: Yiran Guo,Lijie Xu,Jie Liu,Dan Ye,Shuang Qiu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Enhancing the reasoning capabilities of large language models effectively using reinforcement learning (RL) remains a crucial challenge. Existing approaches primarily adopt two contrasting advantage estimation granularities: Token-level methods (e.g., PPO) aim to provide the fine-grained advantage signals but suffer from inaccurate estimation due to difficulties in training an accurate critic model. On the other extreme, trajectory-level methods (e.g., GRPO) solely rely on a coarse-grained advantage signal from the final reward, leading to imprecise credit assignment. To address these limitations, we propose Segment Policy Optimization (SPO), a novel RL framework that leverages segment-level advantage estimation at an intermediate granularity, achieving a better balance by offering more precise credit assignment than trajectory-level methods and requiring fewer estimation points than token-level methods, enabling accurate advantage estimation based on Monte Carlo (MC) without a critic model. SPO features three components with novel strategies: (1) flexible segment partition; (2) accurate segment advantage estimation; and (3) policy optimization using segment advantages, including a novel probability-mask strategy. We further instantiate SPO for two specific scenarios: (1) SPO-chain for short chain-of-thought (CoT), featuring novel cutpoint-based partition and chain-based advantage estimation, achieving 6 - 12 percentage point improvements in accuracy over PPO and GRPO on GSM8K. (2) SPO-tree for long CoT, featuring novel tree-based advantage estimation, which significantly reduces the cost of MC estimation, achieving 7 - 11 percentage point improvements over GRPO on MATH500 under 2K and 4K context evaluation. We make our code publicly available at this https URL.
zh
[NLP-33] Understanding Refusal in Language Models with Sparse Autoencoders
【速读】: 该论文试图解决对齐语言模型中拒绝行为(refusal)内部机制的理解问题,特别是如何识别和干预导致拒绝行为的潜在特征。解决方案的关键在于使用稀疏自编码器(sparse autoencoders)来发现那些因果性地介导拒绝行为的潜在特征,并通过在两个开源聊天模型上进行干预实验,验证这些特征对生成结果的影响,从而实现对拒绝行为在激活层面上的细粒度分析。
链接: https://arxiv.org/abs/2505.23556
作者: Wei Jie Yeo,Nirmalendu Prakash,Clement Neo,Roy Ka-Wei Lee,Erik Cambria,Ranjan Satapathy
机构: Nanyang Technological University (南洋理工大学); Singapore University of Technology and Design (新加坡科技设计大学); Digital Trust Centre (数字信任中心); Institute of High Performance Computing (高性能计算研究所); Agency for Science, Technology and Research (科技研究局)
类目: Computation and Language (cs.CL)
备注:
Abstract:Refusal is a key safety behavior in aligned language models, yet the internal mechanisms driving refusals remain opaque. In this work, we conduct a mechanistic study of refusal in instruction-tuned LLMs using sparse autoencoders to identify latent features that causally mediate refusal behaviors. We apply our method to two open-source chat models and intervene on refusal-related features to assess their influence on generation, validating their behavioral impact across multiple harmful datasets. This enables a fine-grained inspection of how refusal manifests at the activation level and addresses key research questions such as investigating upstream-downstream latent relationship and understanding the mechanisms of adversarial jailbreaking techniques. We also establish the usefulness of refusal features in enhancing generalization for linear probes to out-of-distribution adversarial samples in classification tasks. We open source our code in this https URL.
zh
[NLP-34] ranslation in the Wild
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在无需专门翻译目标函数训练的情况下,为何仍能表现出卓越的翻译能力这一问题。其解决方案的关键在于提出一种假设,即LLMs的翻译能力来源于两种不同类型的预训练数据,这些数据可能被模型以不同的方式内化。论文进一步探讨了如何通过实证方法检验这一“双重性”假设,并分析其对重新理解深度学习时代人类与机器翻译的启示。
链接: https://arxiv.org/abs/2505.23548
作者: Yuri Balashov
机构: University of Georgia (佐治亚大学)
类目: Computation and Language (cs.CL)
备注: 4 figures
Abstract:Large Language Models (LLMs) excel in translation among other things, demonstrating competitive performance for many language pairs in zero- and few-shot settings. But unlike dedicated neural machine translation models, LLMs are not trained on any translation-related objective. What explains their remarkable translation abilities? Are these abilities grounded in “incidental bilingualism” (Briakou et al. 2023) in training data? Does instruction tuning contribute to it? Are LLMs capable of aligning and leveraging semantically identical or similar monolingual contents from different corners of the internet that are unlikely to fit in a single context window? I offer some reflections on this topic, informed by recent studies and growing user experience. My working hypothesis is that LLMs’ translation abilities originate in two different types of pre-training data that may be internalized by the models in different ways. I discuss the prospects for testing the “duality” hypothesis empirically and its implications for reconceptualizing translation, human and machine, in the age of deep learning.
zh
[NLP-35] Probability-Consistent Preference Optimization for Enhanced LLM Reasoning ACL2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在数学推理能力提升过程中,现有偏好优化方法过于依赖结果导向的评估标准(如答案正确性或一致性),而忽视了响应内部逻辑一致性的缺陷。其解决方案的关键在于提出一种名为概率一致性偏好优化(Probability-Consistent Preference Optimization, PCPO)的新框架,该框架通过建立两个定量指标进行偏好选择:(1)表面层的答案正确性,以及(2)跨响应的内在标记级概率一致性。
链接: https://arxiv.org/abs/2505.23540
作者: Yunqiao Yang,Houxing Ren,Zimu Lu,Ke Wang,Weikang Shi,Aojun Zhou,Junting Pan,Mingjie Zhan,Hongsheng Li
机构: CUHK MMLab (香港中文大学多媒体实验室); SenseTime Research (商汤科技研究部); CPII under InnoHK (InnoHK创新科技研究院); Shanghai AI Laboratory (上海人工智能实验室); Zhiyuan College, SJTU (上海交通大学智源学院)
类目: Computation and Language (cs.CL)
备注: 14 pages, to be published in ACL 2025 findings
Abstract:Recent advances in preference optimization have demonstrated significant potential for improving mathematical reasoning capabilities in large language models (LLMs). While current approaches leverage high-quality pairwise preference data through outcome-based criteria like answer correctness or consistency, they fundamentally neglect the internal logical coherence of responses. To overcome this, we propose Probability-Consistent Preference Optimization (PCPO), a novel framework that establishes dual quantitative metrics for preference selection: (1) surface-level answer correctness and (2) intrinsic token-level probability consistency across responses. Extensive experiments show that our PCPO consistently outperforms existing outcome-only criterion approaches across a diverse range of LLMs and benchmarks. Our code is publicly available at this https URL.
zh
[NLP-36] CLaC at SemEval-2025 Task 6: A Multi-Architecture Approach for Corporate Environmental Promise Verification ACL2025 SEMEVAL-2025
【速读】: 该论文旨在解决企业ESG(Environmental, Social, and Governance)报告中承诺验证的问题,具体包括承诺识别、支持证据评估、清晰度评价和验证时间确定四个子任务。其解决方案的关键在于探索三种模型架构:第一种基于ESG-BERT并引入任务特定的分类器头;第二种在该架构中融入针对每个子任务的语言学特征;第三种则采用结合子任务的模型,包含基于注意力的序列池化、带有文档元数据增强的Transformer表示以及多目标学习。实验结果表明,该综合子任务方法在ML-Promise数据集上取得了优于基线的性能,突显了语言学特征提取、注意力池化和多目标学习在承诺验证任务中的有效性。
链接: https://arxiv.org/abs/2505.23538
作者: Nawar Turk,Eeham Khan,Leila Kosseim
机构: Concordia University (康考迪亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to SemEval-2025 Task 6 (ACL 2025)
Abstract:This paper presents our approach to the SemEval-2025 Task~6 (PromiseEval), which focuses on verifying promises in corporate ESG (Environmental, Social, and Governance) reports. We explore three model architectures to address the four subtasks of promise identification, supporting evidence assessment, clarity evaluation, and verification timing. Our first model utilizes ESG-BERT with task-specific classifier heads, while our second model enhances this architecture with linguistic features tailored for each subtask. Our third approach implements a combined subtask model with attention-based sequence pooling, transformer representations augmented with document metadata, and multi-objective learning. Experiments on the English portion of the ML-Promise dataset demonstrate progressive improvement across our models, with our combined subtask approach achieving a leaderboard score of 0.5268, outperforming the provided baseline of 0.5227. Our work highlights the effectiveness of linguistic feature extraction, attention pooling, and multi-objective learning in promise verification tasks, despite challenges posed by class imbalance and limited training data.
zh
[NLP-37] Domain-Aware Tensor Network Structure Search
【速读】: 该论文试图解决张量网络结构搜索(Tensor Network Structure Search, TN-SS)问题,即在高维数据中寻找最优张量网络(TN)结构的挑战。现有最先进的算法计算成本高,且未能利用真实张量数据中的领域信息,缺乏所识别结构的透明性。该论文提出的解决方案关键在于引入一种名为tnLLM的框架,该框架结合了数据的领域信息,并利用大语言模型(Large Language Models, LLMs)的推理能力,直接预测合适的张量网络结构。其核心创新在于设计了一个领域感知的提示流程,使LLM能够根据张量模式之间的实际关系推断出合适的结构,从而在减少函数评估次数的同时保持性能,并生成领域相关的解释。
链接: https://arxiv.org/abs/2505.23537
作者: Giorgos Iacovides,Wuyang Zhou,Chao Li,Qibin Zhao,Danilo Mandic
机构: Imperial College London (帝国理工学院); RIKEN AIP (理化学研究所人工智能项目)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Tensor networks (TNs) provide efficient representations of high-dimensional data, yet identification of the optimal TN structures, the so called tensor network structure search (TN-SS) problem, remains a challenge. Current state-of-the-art (SOTA) algorithms are computationally expensive as they require extensive function evaluations, which is prohibitive for real-world applications. In addition, existing methods ignore valuable domain information inherent in real-world tensor data and lack transparency in their identified TN structures. To this end, we propose a novel TN-SS framework, termed the tnLLM, which incorporates domain information about the data and harnesses the reasoning capabilities of large language models (LLMs) to directly predict suitable TN structures. The proposed framework involves a domain-aware prompting pipeline which instructs the LLM to infer suitable TN structures based on the real-world relationships between tensor modes. In this way, our approach is capable of not only iteratively optimizing the objective function, but also generating domain-aware explanations for the identified structures. Experimental results demonstrate that tnLLM achieves comparable TN-SS objective function values with much fewer function evaluations compared to SOTA algorithms. Furthermore, we demonstrate that the LLM-enabled domain information can be used to find good initializations in the search space for sampling-based SOTA methods to accelerate their convergence while preserving theoretical performance guarantees.
zh
[NLP-38] Identity resolution of software metadata using Large Language Models
【速读】: 该论文试图解决研究软件元数据的身份识别问题,这是构建一致的研究软件集合的关键步骤,而该集合是OpenEBench平台中Software Observatory的核心组成部分。解决方案的关键在于评估指令微调的大语言模型在软件元数据身份解析任务中的表现,并引入基于协议的代理方法以实现高置信度的自动化决策,从而应对元数据异构性和规模带来的挑战。
链接: https://arxiv.org/abs/2505.23500
作者: Eva Martín del Pico,Josep Lluís Gelpí,Salvador Capella-Gutiérrez
机构: Barcelona Supercomputing Center (巴塞罗那超级计算中心); University of Barcelona (巴塞罗那大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注:
Abstract:Software is an essential component of research. However, little attention has been paid to it compared with that paid to research data. Recently, there has been an increase in efforts to acknowledge and highlight the importance of software in research activities. Structured metadata from platforms like this http URL, Bioconductor, and Galaxy ToolShed offers valuable insights into research software in the Life Sciences. Although originally intended to support discovery and integration, this metadata can be repurposed for large-scale analysis of software practices. However, its quality and completeness vary across platforms, reflecting diverse documentation practices. To gain a comprehensive view of software development and sustainability, consolidating this metadata is necessary, but requires robust mechanisms to address its heterogeneity and scale. This article presents an evaluation of instruction-tuned large language models for the task of software metadata identity resolution, a critical step in assembling a cohesive collection of research software. Such a collection is the reference component for the Software Observatory at OpenEBench, a platform that aggregates metadata to monitor the FAIRness of research software in the Life Sciences. We benchmarked multiple models against a human-annotated gold standard, examined their behavior on ambiguous cases, and introduced an agreement-based proxy for high-confidence automated decisions. The proxy achieved high precision and statistical robustness, while also highlighting the limitations of current models and the broader challenges of automating semantic judgment in FAIR-aligned software metadata across registries and repositories. Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL); Digital Libraries (cs.DL) Cite as: arXiv:2505.23500 [cs.SE] (or arXiv:2505.23500v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2505.23500 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Eva Martín Del Pico [view email] [v1] Thu, 29 May 2025 14:47:31 UTC (956 KB) function toggleList(whichLayer,toggleThis) var elem, vis; if( document.getElementById ) // standard elem = document.getElementById( whichLayer ); else if( document.all ) // old msie versions elem = document.all[whichLayer]; else if( document.layers ) // nn4 elem = document.layers[whichLayer]; vis = elem.style; // if the style.display value is blank we try to figure it out here if(vis.display==‘’!=undefined!=undefined) vis.display = (elem.offsetWidth!=0!=0)?‘inline’:‘none’; vis.display = (vis.display==‘’||vis.display==‘inline’)?‘none’:‘inline’; // toggle link inner text status = vis.display; if(vis.display==‘inline’) document.getElementById(‘toggle’).innerHTML = “(collapse list)”; document.getElementById(‘toggle’).title = “Collapse list”; else document.getElementById(‘toggle’).innerHTML = “(”+toggleThis+“)”; document.getElementById(‘toggle’).title = “Show complete list”; Full-text links: Access Paper: View a PDF of the paper titled Identity resolution of software metadata using Large Language Models, by Eva Mart’in del Pico and 1 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Ancillary-file links: Ancillary files (details): Supplementary_materials.pdf Current browse context: cs.SE prev | next new | recent | 2025-05 Change to browse by: cs cs.CL cs.DL References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[NLP-39] Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking
【速读】: 该论文试图解决当前知识图谱问答(Knowledge Graph Question Answering, KGQA)基准数据集存在的质量问题,如不准确或不完整的答案标注、构造不良的问题以及过时或不一致的知识。其解决方案的关键在于提出KGQAGen框架,该框架结合结构化知识定位、大语言模型(LLM)引导生成和符号验证,系统性地解决上述问题,从而生成具有挑战性和可验证性的问答实例。
链接: https://arxiv.org/abs/2505.23495
作者: Liangliang Zhang,Zhuorui Jiang,Hongliang Chi,Haoyang Chen,Mohammed Elkoumy,Fali Wang,Qiong Wu,Zhengyi Zhou,Shirui Pan,Suhang Wang,Yao Ma
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages
Abstract:Knowledge Graph Question Answering (KGQA) systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning. However, despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues, including inaccurate or incomplete ground-truth annotations, poorly constructed questions that are ambiguous, trivial, or unanswerable, and outdated or inconsistent knowledge. Through a manual audit of 16 popular KGQA datasets, including WebQSP and CWQ, we find that the average factual correctness rate is only 57 %. To address these issues, we introduce KGQAGen, an LLM-in-the-loop framework that systematically resolves these pitfalls. KGQAGen combines structured knowledge grounding, LLM-guided generation, and symbolic verification to produce challenging and verifiable QA instances. Using KGQAGen, we construct KGQAGen-10k, a ten-thousand scale benchmark grounded in Wikidata, and evaluate a diverse set of KG-RAG models. Experimental results demonstrate that even state-of-the-art systems struggle on this benchmark, highlighting its ability to expose limitations of existing models. Our findings advocate for more rigorous benchmark construction and position KGQAGen as a scalable framework for advancing KGQA evaluation.
zh
[NLP-40] Spoken Language Modeling with Duration-Penalized Self-Supervised Units INTERSPEECH2025
【速读】: 该论文试图解决语音语言模型(Spoken Language Models, SLMs)中码本大小与音素单位粗粒度(即持续时间)之间的相互作用问题,以及这种相互作用对模型性能的影响。解决方案的关键在于使用简单的持续时间惩罚动态规划(Duration-Penalized Dynamic Programming, DPDP)方法,通过调整码本大小和单位粗粒度来优化SLM的性能。研究发现,在特定任务中,如句子重合成、词汇和句法语言建模,较粗粒度的单位在低比特率下能够提升模型准确性,而DPDP方法则提供了一种高效获取有益粗粒度单位的方式。
链接: https://arxiv.org/abs/2505.23494
作者: Nicol Visser,Herman Kamper
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2025
Abstract:Spoken language models (SLMs) operate on acoustic units obtained by discretizing self-supervised speech representations. Although the characteristics of these units directly affect performance, the interaction between codebook size and unit coarseness (i.e., duration) remains unexplored. We investigate SLM performance as we vary codebook size and unit coarseness using the simple duration-penalized dynamic programming (DPDP) method. New analyses are performed across different linguistic levels. At the phone and word levels, coarseness provides little benefit, as long as the codebook size is chosen appropriately. However, when producing whole sentences in a resynthesis task, SLMs perform better with coarser units. In lexical and syntactic language modeling tasks, coarser units also give higher accuracies at lower bitrates. We therefore show that coarser units aren’t always better, but that DPDP is a simple and efficient way to obtain coarser units for the tasks where they are beneficial.
zh
[NLP-41] R2I-Bench: Benchmarking Reasoning -Driven Text-to-Image Generation
【速读】: 该论文旨在解决当前文本到图像(T2I)生成模型在推理能力方面的不足问题,特别是在处理需要常识、逻辑、因果关系等复杂推理任务时表现有限。其解决方案的关键在于引入R2I-Bench,一个专门设计的全面基准测试平台,用于严格评估基于推理的T2I生成能力,并通过R2IScore这一基于实例特定推理导向问题的评分机制,从文本-图像对齐、推理准确性和图像质量三个维度进行细粒度评估。
链接: https://arxiv.org/abs/2505.23493
作者: Kaijie Chen,Zihao Lin,Zhiyang Xu,Ying Shen,Yuguang Yao,Joy Rimchala,Jiaxin Zhang,Lifu Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Project Page: this https URL
Abstract:Reasoning is a fundamental capability often required in real-world text-to-image (T2I) generation, e.g., generating a bitten apple that has been left in the air for more than a week
necessitates understanding temporal decay and commonsense concepts. While recent T2I models have made impressive progress in producing photorealistic images, their reasoning capability remains underdeveloped and insufficiently evaluated. To bridge this gap, we introduce R2I-Bench, a comprehensive benchmark specifically designed to rigorously assess reasoning-driven T2I generation. R2I-Bench comprises meticulously curated data instances, spanning core reasoning categories, including commonsense, mathematical, logical, compositional, numerical, causal, and concept mixing. To facilitate fine-grained evaluation, we design R2IScore, a QA-style metric based on instance-specific, reasoning-oriented evaluation questions that assess three critical dimensions: text-image alignment, reasoning accuracy, and image quality. Extensive experiments with 16 representative T2I models, including a strong pipeline-based framework that decouples reasoning and generation using the state-of-the-art language and image generation models, demonstrate consistently limited reasoning performance, highlighting the need for more robust, reasoning-aware architectures in the next generation of T2I systems. Project Page: this https URL
zh
[NLP-42] Revisiting Overthinking in Long Chain-of-Thought from the Perspective of Self-Doubt
【速读】: 该论文试图解决大型语言模型在复杂任务中出现的“过度思考”(overthinking)问题,即模型在得出正确答案后仍进行不必要的推理步骤,导致计算资源浪费和推理效率降低。解决方案的关键在于从自我怀疑(self-doubt)的角度进行量化分析,并提出一种简单的提示方法,通过引导模型质疑输入问题的有效性,从而减少对输入问题的过度依赖,避免自我怀疑,最终降低推理步骤数量和答案长度。
链接: https://arxiv.org/abs/2505.23480
作者: Keqin Peng,Liang Ding,Yuanxin Ouyang,Meng Fang,Dacheng Tao
机构: Beihang University (北京航空航天大学); The University of Sydney (悉尼大学); University of Liverpool (利物浦大学); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reasoning Large Language Models (RLLMs) have demonstrated impressive performance on complex tasks, largely due to the adoption of Long Chain-of-Thought (Long CoT) reasoning. However, they often exhibit overthinking – performing unnecessary reasoning steps even after arriving at the correct answer. Prior work has largely focused on qualitative analyses of overthinking through sample-based observations of long CoTs. In contrast, we present a quantitative analysis of overthinking from the perspective of self-doubt, characterized by excessive token usage devoted to re-verifying already-correct answer. We find that self-doubt significantly contributes to overthinking. In response, we introduce a simple and effective prompting method to reduce the model’s over-reliance on input questions, thereby avoiding self-doubt. Specifically, we first prompt the model to question the validity of the input question, and then respond concisely based on the outcome of that evaluation. Experiments on three mathematical reasoning tasks and four datasets with missing premises demonstrate that our method substantially reduces answer length and yields significant improvements across nearly all datasets upon 4 widely-used RLLMs. Further analysis demonstrates that our method effectively minimizes the number of reasoning steps and reduces self-doubt.
zh
[NLP-43] Evaluating the performance and frag ility of large language models on the self-assessment for neurological surgeons
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在神经外科类考试题目中的表现及其对干扰性陈述的鲁棒性问题。研究的关键在于通过使用来自神经外科协会自评考试(CNS-SANS)的2,904道题目对28个LLMs进行综合评估,并引入一种干扰框架,以测试模型在面对包含多义词的无关干扰语句时的表现退化情况。研究发现,尽管部分模型能够达到通过标准,但其性能在受到干扰时显著下降,表明当前LLMs在临床应用中仍需提升对文本内干扰的抗性。
链接: https://arxiv.org/abs/2505.23477
作者: Krithik Vishwanath,Anton Alyakin,Mrigayu Ghosh,Jin Vivian Lee,Daniel Alexander Alber,Karl L. Sangwon,Douglas Kondziolka,Eric Karl Oermann
机构: 未知
类目: Computation and Language (cs.CL)
备注: 22 pages, 3 main figures, 3 supplemental figures
Abstract:The Congress of Neurological Surgeons Self-Assessment for Neurological Surgeons (CNS-SANS) questions are widely used by neurosurgical residents to prepare for written board examinations. Recently, these questions have also served as benchmarks for evaluating large language models’ (LLMs) neurosurgical knowledge. This study aims to assess the performance of state-of-the-art LLMs on neurosurgery board-like questions and to evaluate their robustness to the inclusion of distractor statements. A comprehensive evaluation was conducted using 28 large language models. These models were tested on 2,904 neurosurgery board examination questions derived from the CNS-SANS. Additionally, the study introduced a distraction framework to assess the fragility of these models. The framework incorporated simple, irrelevant distractor statements containing polysemous words with clinical meanings used in non-clinical contexts to determine the extent to which such distractions degrade model performance on standard medical benchmarks. 6 of the 28 tested LLMs achieved board-passing outcomes, with the top-performing models scoring over 15.7% above the passing threshold. When exposed to distractions, accuracy across various model architectures was significantly reduced-by as much as 20.4%-with one model failing that had previously passed. Both general-purpose and medical open-source models experienced greater performance declines compared to proprietary variants when subjected to the added distractors. While current LLMs demonstrate an impressive ability to answer neurosurgery board-like exam questions, their performance is markedly vulnerable to extraneous, distracting information. These findings underscore the critical need for developing novel mitigation strategies aimed at bolstering LLM resilience against in-text distractions, particularly for safe and effective clinical deployment.
zh
[NLP-44] Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns
【速读】: 该论文旨在解决现有Process Reward Models (PRMs)在面对多种推理模式(如分解、转换、归纳等)时,缺乏系统性评估的问题。当前的基准测试主要关注步骤正确性,而忽视了在不同推理模式下的全面评估。论文提出的解决方案是引入Socratic-PRMBench,这是一个新的基准测试集,涵盖六种推理模式下的2995条存在缺陷的推理路径,以系统评估PRMs在不同推理模式下的表现。其关键在于通过多样化的推理模式构建全面的评估框架,从而揭示现有PRMs在多模式推理评估中的显著不足。
链接: https://arxiv.org/abs/2505.23474
作者: Xiang Li,Haiyang Yu,Xinghua Zhang,Ziyang Huang,Shizhu He,Kang Liu,Jun Zhao,Fei Huang,Yongbin Li
机构: Tongyi Lab, Alibaba Group(通义实验室,阿里巴巴集团); Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Process Reward Models (PRMs) are crucial in complex reasoning and problem-solving tasks (e.g., LLM agents with long-horizon decision-making) by verifying the correctness of each intermediate reasoning step. In real-world scenarios, LLMs may apply various reasoning patterns (e.g., decomposition) to solve a problem, potentially suffering from errors under various reasoning patterns. Therefore, PRMs are required to identify errors under various reasoning patterns during the reasoning process. However, existing benchmarks mainly focus on evaluating PRMs with stepwise correctness, ignoring a systematic evaluation of PRMs under various reasoning patterns. To mitigate this gap, we introduce Socratic-PRMBench, a new benchmark to evaluate PRMs systematically under six reasoning patterns, including Transformation, Decomposition, Regather, Deduction, Verification, and Integration. Socratic-PRMBenchcomprises 2995 reasoning paths with flaws within the aforementioned six reasoning patterns. Through our experiments on both PRMs and LLMs prompted as critic models, we identify notable deficiencies in existing PRMs. These observations underscore the significant weakness of current PRMs in conducting evaluations on reasoning steps under various reasoning patterns. We hope Socratic-PRMBench can serve as a comprehensive testbed for systematic evaluation of PRMs under diverse reasoning patterns and pave the way for future development of PRMs.
zh
[NLP-45] UAQFact: Evaluating Factual Knowledge Utilization of LLM s on Unanswerable Questions ACL2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理无答案问题(Unanswerable Questions, UAQ)时缺乏事实知识支持的问题,这限制了对LLMs利用其事实知识能力的评估。解决方案的关键在于引入一个名为UAQFact的新无答案问题数据集,该数据集为双语并包含辅助事实知识,来源于知识图谱(Knowledge Graph)。基于UAQFact,研究进一步定义了两个新任务,以分别衡量LLMs利用内部和外部事实知识的能力。
链接: https://arxiv.org/abs/2505.23461
作者: Chuanyuan Tan,Wenbiao Shao,Hao Xiong,Tong Zhu,Zhenhua Liu,Kai Shi,Wenliang Chen
机构: School of Computer Science and Technology, Soochow University (计算机科学与技术学院,苏州大学); OPPO AI Center (OPPO人工智能中心)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Findings
Abstract:Handling unanswerable questions (UAQ) is crucial for LLMs, as it helps prevent misleading responses in complex situations. While previous studies have built several datasets to assess LLMs’ performance on UAQ, these datasets lack factual knowledge support, which limits the evaluation of LLMs’ ability to utilize their factual knowledge when handling UAQ. To address the limitation, we introduce a new unanswerable question dataset UAQFact, a bilingual dataset with auxiliary factual knowledge created from a Knowledge Graph. Based on UAQFact, we further define two new tasks to measure LLMs’ ability to utilize internal and external factual knowledge, respectively. Our experimental results across multiple LLM series show that UAQFact presents significant challenges, as LLMs do not consistently perform well even when they have factual knowledge stored. Additionally, we find that incorporating external knowledge may enhance performance, but LLMs still cannot make full use of the knowledge which may result in incorrect responses.
zh
[NLP-46] Rethinking Regularization Methods for Knowledge Graph Completion
【速读】: 该论文试图解决知识图谱补全(Knowledge Graph Completion, KGC)中模型泛化能力不足和性能上限受限的问题。其解决方案的关键在于重新思考正则化方法在KGC中的应用,通过精心设计的正则化策略不仅缓解过拟合和减少方差,还使模型突破原有性能上限。论文提出了一种新颖的稀疏正则化方法(Sparse-Regularization, SPR),其核心思想是将基于秩的可选择稀疏性概念引入KGC正则化器,通过对嵌入向量中具有显著特征的组件进行选择性惩罚,从而有效忽略贡献较小且可能仅代表噪声的组件。
链接: https://arxiv.org/abs/2505.23442
作者: Linyu Li,Zhi Jin,Yuanpeng He,Dongming Jin,Haoran Duan,Zhengwei Tao,Xuan Zhang,Jiandong Li
机构: Peking University (北京大学); Wuhan University (武汉大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Knowledge graph completion (KGC) has attracted considerable attention in recent years because it is critical to improving the quality of knowledge graphs. Researchers have continuously explored various models. However, most previous efforts have neglected to take advantage of regularization from a deeper perspective and therefore have not been used to their full potential. This paper rethinks the application of regularization methods in KGC. Through extensive empirical studies on various KGC models, we find that carefully designed regularization not only alleviates overfitting and reduces variance but also enables these models to break through the upper bounds of their original performance. Furthermore, we introduce a novel sparse-regularization method that embeds the concept of rank-based selective sparsity into the KGC regularizer. The core idea is to selectively penalize those components with significant features in the embedding vector, thus effectively ignoring many components that contribute little and may only represent noise. Various comparative experiments on multiple datasets and multiple models show that the SPR regularization method is better than other regularization methods and can enable the KGC model to further break through the performance margin.
zh
[NLP-47] he Warmup Dilemma: How Learning Rate Strategies Impact Speech-to-Text Model Convergence
【速读】: 该论文旨在解决大规模语音到文本(Speech-to-Text, S2T)训练中学习率(Learning Rate, LR)调度策略不足的问题,特别是在使用复杂Transformer变体如Conformer或Branchformer时,传统的简单学习率衰减方法效果有限。论文提出的关键解决方案是采用双线性预热(double linear warmup)策略,即在第一阶段将学习率增加到一个极小值,随后在第二阶段提升至较高值,以改善训练过程中的收敛性。研究揭示了大规模S2T训练需要次指数级学习率预热,并指出预热阶段较高的学习率虽能加速初始收敛,但对最终性能无显著提升。
链接: https://arxiv.org/abs/2505.23420
作者: Marco Gaido,Sara Papi,Luisa Bentivogli,Alessio Brutti,Mauro Cettolo,Roberto Gretter,Marco Matassoni,Mohamed Nabih,Matteo Negri
机构: Fondazione Bruno Kessler(布鲁诺·凯塞尔基金会)
类目: Computation and Language (cs.CL)
备注: Accepted to IWSLT 2025
Abstract:Training large-scale models presents challenges not only in terms of resource requirements but also in terms of their convergence. For this reason, the learning rate (LR) is often decreased when the size of a model is increased. Such a simple solution is not enough in the case of speech-to-text (S2T) trainings, where evolved and more complex variants of the Transformer architecture – e.g., Conformer or Branchformer – are used in light of their better performance. As a workaround, OWSM designed a double linear warmup of the LR, increasing it to a very small value in the first phase before updating it to a higher value in the second phase. While this solution worked well in practice, it was not compared with alternative solutions, nor was the impact on the final performance of different LR warmup schedules studied. This paper fills this gap, revealing that i) large-scale S2T trainings demand a sub-exponential LR warmup, and ii) a higher LR in the warmup phase accelerates initial convergence, but it does not boost final performance.
zh
[NLP-48] SWE-bench Goes Live!
【速读】: 该论文旨在解决现有代码修复基准(如SWE-bench)存在的更新滞后、仓库覆盖范围有限以及依赖大量人工操作导致可扩展性差和过拟合风险的问题。其解决方案的关键在于提出了一种名为\textbfSWE-bench-Live的实时可更新基准,其中包含从2024年后真实GitHub问题中提取的1,319个任务,覆盖93个仓库,并通过专用Docker镜像确保执行的可复现性。核心创新是\method,一个自动化整理流程,实现了从实例生成到环境配置的全自动化,消除了人工瓶颈,提升了可扩展性和持续更新能力。
链接: https://arxiv.org/abs/2505.23419
作者: Linghao Zhang,Shilin He,Chaoyun Zhang,Yu Kang,Bowen Li,Chengxing Xie,Junhao Wang,Maoquan Wang,Yufan Huang,Shengyu Fu,Elsie Nallipogu,Qingwei Lin,Yingnong Dang,Saravan Rajmohan,Dongmei Zhang
机构: Microsoft(微软); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Homepage: \url{ this https URL }, Code: \url{ this https URL }, Dataset: \url{ this https URL }
Abstract:The issue-resolving task, where a model generates patches to fix real-world bugs, has emerged as a critical benchmark for evaluating the capabilities of large language models (LLMs). While SWE-bench and its variants have become standard in this domain, they suffer from key limitations: they have not been updated since their initial releases, cover a narrow set of repositories, and depend heavily on manual effort for instance construction and environment setup. These factors hinder scalability and introduce risks of overfitting and data contamination. In this work, we present \textbfSWE-bench-Live, a \textitlive-updatable benchmark designed to overcome these challenges. Our initial release consists of 1,319 tasks derived from real GitHub issues created since 2024, spanning 93 repositories. Each task is accompanied by a dedicated Docker image to ensure reproducible execution. Central to our benchmark is \method, an automated curation pipeline that streamlines the entire process from instance creation to environment setup, removing manual bottlenecks and enabling scalability and continuous updates. We evaluate a range of state-of-the-art agent frameworks and LLMs on SWE-bench-Live, revealing a substantial performance gap compared to static benchmarks like SWE-bench, even under controlled evaluation conditions. To better understand this discrepancy, we perform detailed analyses across repository origin, issue recency, and task difficulty. By providing a fresh, diverse, and executable benchmark grounded in live repository activity, SWE-bench-Live facilitates rigorous, contamination-resistant evaluation of LLMs and agents in dynamic, real-world software development settings.
zh
[NLP-49] From Parameters to Prompts: Understanding and Mitigating the Factuality Gap between Fine-Tuned LLM s
【速读】: 该论文旨在解决预训练语言模型在微调过程中,已知与未知知识之间的事实性差距(factual knowledge gap)问题。其关键解决方案是通过在推理阶段引入分布外(out-of-distribution, OOD)设置或使用适当的上下文学习(in-context learning, ICL)提示(如少样本学习和思维链,Chain of Thought),从而有效缓解这一差距。研究从知识图谱的角度理论证明了测试时的提示可以削弱甚至覆盖微调数据的影响,成为知识提取中的主导因素。
链接: https://arxiv.org/abs/2505.23410
作者: Xuan Gong,Hanbo Huang,Shiyu Liang
机构: Tongji University (同济大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注: The code of this paper will be released soon
Abstract:Factual knowledge extraction aims to explicitly extract knowledge parameterized in pre-trained language models for application in downstream tasks. While prior work has been investigating the impact of supervised fine-tuning data on the factuality of large language models (LLMs), its mechanism remains poorly understood. We revisit this impact through systematic experiments, with a particular focus on the factuality gap that arises when fine-tuning on known versus unknown knowledge. Our findings show that this gap can be mitigated at the inference stage, either under out-of-distribution (OOD) settings or by using appropriate in-context learning (ICL) prompts (i.e., few-shot learning and Chain of Thought (CoT)). We prove this phenomenon theoretically from the perspective of knowledge graphs, showing that the test-time prompt may diminish or even overshadow the impact of fine-tuning data and play a dominant role in knowledge extraction. Ultimately, our results shed light on the interaction between finetuning data and test-time prompt, demonstrating that ICL can effectively compensate for shortcomings in fine-tuning data, and highlighting the need to reconsider the use of ICL prompting as a means to evaluate the effectiveness of fine-tuning data selection methods.
zh
[NLP-50] Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models
【速读】: 该论文试图解决通过越狱技术对大型语言模型(Large Language Models, LLMs)进行对抗攻击的问题,此类攻击旨在绕过模型内置的安全和伦理约束,从而影响其可靠性。解决方案的关键在于提出一种基于LLMs语义理解能力的自适应越狱策略框架,该框架根据模型的语义理解能力将其分类为Type I和Type II,并为每类模型设计针对性的越狱策略,以有效利用其潜在漏洞,显著提升越狱成功率。实验结果表明,该方法在GPT-4o上的成功率达到98.9%。
链接: https://arxiv.org/abs/2505.23404
作者: Mingyu Yu,Wei Wang,Yanjie Wei,Sujuan Qin
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Adversarial attacks on Large Language Models (LLMs) via jailbreaking techniques-methods that circumvent their built-in safety and ethical constraints-have emerged as a critical challenge in AI security. These attacks compromise the reliability of LLMs by exploiting inherent weaknesses in their comprehension capabilities. This paper investigates the efficacy of jailbreaking strategies that are specifically adapted to the diverse levels of understanding exhibited by different LLMs. We propose the Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models, a novel framework that classifies LLMs into Type I and Type II categories according to their semantic comprehension abilities. For each category, we design tailored jailbreaking strategies aimed at leveraging their vulnerabilities to facilitate successful attacks. Extensive experiments conducted on multiple LLMs demonstrate that our adaptive strategy markedly improves the success rate of jailbreaking. Notably, our approach achieves an exceptional 98.9% success rate in jailbreaking GPT-4o(29 May 2025 release)
zh
[NLP-51] hreading the Needle: Reweaving Chain-of-Thought Reasoning to Explain Human Label Variation
【速读】: 该论文试图解决人类标注者对同一数据实例存在合理差异(human label variation)的问题,旨在通过生成式AI (Generative AI) 生成的思维链(CoTs)来更好地理解并对齐模型预测与人类标签分布。其解决方案的关键在于提出一种基于大型语言模型(LLM)的管道,该管道结合了具有语言学基础的论述分割器,以更准确地从CoTs中提取支持性和反对性陈述,从而为每个答案选项提供隐含的推理依据。此外,还引入了一种基于排名的高置信度标签验证(HLV)评估框架,强调答案排序而非精确分数,从而更贴近人类标签分布的比较方式。
链接: https://arxiv.org/abs/2505.23368
作者: Beiduo Chen,Yang Janet Liu,Anna Korhonen,Barbara Plank
机构: LMU Munich(慕尼黑路德维希-马克西米利安大学); Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心); University of Cambridge(剑桥大学)
类目: Computation and Language (cs.CL)
备注: 22 pages, 7 figures
Abstract:The recent rise of reasoning-tuned Large Language Models (LLMs)–which generate chains of thought (CoTs) before giving the final answer–has attracted significant attention and offers new opportunities for gaining insights into human label variation, which refers to plausible differences in how multiple annotators label the same data instance. Prior work has shown that LLM-generated explanations can help align model predictions with human label distributions, but typically adopt a reverse paradigm: producing explanations based on given answers. In contrast, CoTs provide a forward reasoning path that may implicitly embed rationales for each answer option, before generating the answers. We thus propose a novel LLM-based pipeline enriched with linguistically-grounded discourse segmenters to extract supporting and opposing statements for each answer option from CoTs with improved accuracy. We also propose a rank-based HLV evaluation framework that prioritizes the ranking of answers over exact scores, which instead favor direct comparison of label distributions. Our method outperforms a direct generation method as well as baselines on three datasets, and shows better alignment of ranking methods with humans, highlighting the effectiveness of our approach.
zh
[NLP-52] Discriminative Policy Optimization for Token-Level Reward Models ICML2025
【速读】: 该论文旨在解决生成式语言模型在复杂推理任务中因传统结果奖励模型(ORM)监督不够细致而导致的能力提升受限问题,以及现有步骤级和标记级过程奖励模型(PRM)在生成语言建模与奖励建模之间的冲突所引发的不稳定性和信用分配不准确问题。其解决方案的关键在于通过将奖励建模与语言生成解耦,利用判别性策略优化得到一种标记级奖励模型,即Q函数奖励模型(Q-RM),该模型能够从偏好数据中显式学习标记级Q函数,而无需依赖细粒度标注。
链接: https://arxiv.org/abs/2505.23363
作者: Hongzhan Chen,Tao Yang,Shiping Gao,Ruijun Chen,Xiaojun Quan,Hongtao Tian,Ting Yao
机构: 未知
类目: Computation and Language (cs.CL)
备注: ICML 2025
Abstract:Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs) for optimizing policy models, positioning them as a promising approach to enhancing the capabilities of LLMs in complex reasoning tasks. Recent efforts have advanced PRMs from step-level to token-level granularity by integrating reward modeling into the training of generative models, with reward scores derived from token generation probabilities. However, the conflict between generative language modeling and reward modeling may introduce instability and lead to inaccurate credit assignments. To address this challenge, we revisit token-level reward assignment by decoupling reward modeling from language generation and derive a token-level reward model through the optimization of a discriminative policy, termed the Q-function Reward Model (Q-RM). We theoretically demonstrate that Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations. In our experiments, Q-RM consistently outperforms all baseline methods across various benchmarks. For example, when integrated into PPO/REINFORCE algorithms, Q-RM enhances the average Pass@1 score by 5.85/4.70 points on mathematical reasoning tasks compared to the ORM baseline, and by 4.56/5.73 points compared to the token-level PRM counterpart. Moreover, reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12 times faster than ORM on GSM8K and 11 times faster than step-level PRM on MATH. Code and data are available at this https URL.
zh
[NLP-53] Nosey: Open-source hardware for acoustic nasalance INTERSPEECH2025
【速读】: 该论文旨在解决传统鼻音测量设备成本高、灵活性不足的问题,提出了一种低成本、可定制的3D打印鼻音测量系统Nosey(Nasalance Open Source Estimation sYstem)。解决方案的关键在于通过开源硬件设计实现系统的可定制性与经济性,同时保持与商用设备相当的测量效果,特别是在语音环境对比中的表现一致性。
链接: https://arxiv.org/abs/2505.23339
作者: Maya Dewhurst,Jack Collins,Justin J. H. Lo,Roy Alderton,Sam Kirkham
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: Accepted to Interspeech 2025
Abstract:We introduce Nosey (Nasalance Open Source Estimation sYstem), a low-cost, customizable, 3D-printed system for recording acoustic nasalance data that we have made available as open-source hardware (this http URL). We first outline the motivations and design principles behind our hardware nasalance system, and then present a comparison between Nosey and a commercial nasalance device. Nosey shows consistently higher nasalance scores than the commercial device, but the magnitude of contrast between phonological environments is comparable between systems. We also review ways of customizing the hardware to facilitate testing, such as comparison of microphones and different construction materials. We conclude that Nosey is a flexible and cost-effective alternative to commercial nasometry devices and propose some methodological considerations for its use in data collection.
zh
[NLP-54] Neither Stochastic Parroting nor AGI: LLM s Solve Tasks through Context-Directed Extrapolation from Training Data Priors
【速读】: 该论文试图解决对大语言模型(Large Language Model, LLM)能力的极端化认知问题,即避免将LLM视为仅仅是“随机鹦鹉”或具备“涌现”出的高级推理能力的系统。论文提出的解决方案关键在于提出“上下文引导的外推”(context-directed extrapolation)机制,认为LLM通过从训练数据中的先验知识进行外推,并借助类似上下文学习的机制来定位合适的信息进行外推,从而表现出一定的推理能力。这一观点强调LLM的推理能力是可预测、可控的,并非类人高级认知能力,且不具备无限扩展性。
链接: https://arxiv.org/abs/2505.23323
作者: Harish Tayyar Madabushi,Melissa Torgbi,Claire Bonial
机构: University of Bath (巴斯大学); Army Research Lab (陆军研究实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:In this position paper we raise critical awareness of a realistic view of LLM capabilities that eschews extreme alternative views that LLMs are either “stochastic parrots” or in possession of “emergent” advanced reasoning capabilities, which, due to their unpredictable emergence, constitute an existential threat. Our middle-ground view is that LLMs extrapolate from priors from their training data, and that a mechanism akin to in-context learning enables the targeting of the appropriate information from which to extrapolate. We call this “context-directed extrapolation.” Under this view, substantiated though existing literature, while reasoning capabilities go well beyond stochastic parroting, such capabilities are predictable, controllable, not indicative of advanced reasoning akin to high-level cognitive capabilities in humans, and not infinitely scalable with additional training. As a result, fears of uncontrollable emergence of agency are allayed, while research advances are appropriately refocused on the processes of context-directed extrapolation and how this interacts with training data to produce valuable capabilities in LLMs. Future work can therefore explore alternative augmenting techniques that do not rely on inherent advanced reasoning in LLMs.
zh
[NLP-55] Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO
【速读】: 该论文试图解决对比对齐方法中出现的似然度不足(likelihood underdetermination)问题,即在优化大语言模型(LLM)以匹配相对偏好时,导致示例响应的绝对似然度下降,从而引发生成输出偏离预期模式的问题。解决方案的关键在于重新审视直接偏好优化(Direct Preference Optimization, DPO),并证明其损失函数理论上可以分解为一种重构形式。通过恢复重构损失中的完整正则化项,有效解决了似然度不足问题。基于此,作者提出了PRO(PRoximalized PReference Optimization),一种统一的对齐方法,能够处理多种反馈类型,并通过高效近似完整正则化项消除似然度不足。
链接: https://arxiv.org/abs/2505.23316
作者: Kaiyang Guo,Yinchuan Li,Zhitang Chen
机构: Huawei Noah’s Ark Lab(华为诺亚方舟实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Direct alignment methods typically optimize large language models (LLMs) by contrasting the likelihoods of preferred versus dispreferred responses. While effective in steering LLMs to match relative preference, these methods are frequently noted for decreasing the absolute likelihoods of example responses. As a result, aligned models tend to generate outputs that deviate from the expected patterns, exhibiting reward-hacking effect even without a reward model. This undesired consequence exposes a fundamental limitation in contrastive alignment, which we characterize as likelihood underdetermination. In this work, we revisit direct preference optimization (DPO) – the seminal direct alignment method – and demonstrate that its loss theoretically admits a decomposed reformulation. The reformulated loss not only broadens applicability to a wider range of feedback types, but also provides novel insights into the underlying cause of likelihood underdetermination. Specifically, the standard DPO implementation implicitly oversimplifies a regularizer in the reformulated loss, and reinstating its complete version effectively resolves the underdetermination issue. Leveraging these findings, we introduce PRoximalized PReference Optimization (PRO), a unified method to align with diverse feeback types, eliminating likelihood underdetermination through an efficient approximation of the complete regularizer. Comprehensive experiments show the superiority of PRO over existing methods in scenarios involving pairwise, binary and scalar feedback.
zh
[NLP-56] Enhancing Marker Scoring Accuracy through Ordinal Confidence Modelling in Educational Assessments ACL2025
【速读】: 该论文试图解决自动化作文评分(Automated Essay Scoring, AES)中确保评分在达到高可靠性标准后才被释放的伦理问题,其核心在于通过置信度建模(confidence modelling)为每个自动评分分配一个可靠性估计值。解决方案的关键在于将置信度估计建模为分类任务,即预测AES生成的评分是否正确地将考生归类到相应的CEFR(Common European Framework of Reference for Languages)等级中,并通过分数分箱将其转化为多类分类问题,同时引入一种新的核加权序数分类交叉熵(Kernel Weighted Ordinal Categorical Cross Entropy, KWOCCE)损失函数,以利用CEFR标签的序数结构。该方法在F1分数上达到了0.97,并显著提高了系统在不同CEFR一致性阈值下的评分释放比例。
链接: https://arxiv.org/abs/2505.23315
作者: Abhirup Chakravarty,Mark Brenchley,Trevor Breakspear,Ian Lewin,Yan Huang
机构: Cambridge University Press & Assessment (剑桥大学出版社与评估机构)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This is the preprint version of our paper accepted to ACL 2025 (Industry Track). The DOI will be added once available
Abstract:A key ethical challenge in Automated Essay Scoring (AES) is ensuring that scores are only released when they meet high reliability standards. Confidence modelling addresses this by assigning a reliability estimate measure, in the form of a confidence score, to each automated score. In this study, we frame confidence estimation as a classification task: predicting whether an AES-generated score correctly places a candidate in the appropriate CEFR level. While this is a binary decision, we leverage the inherent granularity of the scoring domain in two ways. First, we reformulate the task as an n-ary classification problem using score binning. Second, we introduce a set of novel Kernel Weighted Ordinal Categorical Cross Entropy (KWOCCE) loss functions that incorporate the ordinal structure of CEFR labels. Our best-performing model achieves an F1 score of 0.97, and enables the system to release 47% of scores with 100% CEFR agreement and 99% with at least 95% CEFR agreement -compared to approximately 92% (approx.) CEFR agreement from the standalone AES model where we release all AM predicted scores.
zh
[NLP-57] Generalized Category Discovery in Event-Centric Contexts: Latent Pattern Mining with LLM s
【速读】: 该论文旨在解决通用类别发现(Generalized Category Discovery, GCD)中在现实场景下文本分类方法验证不足的问题,特别是针对以事件为中心的GCD(Event-Centric GCD, EC-GCD)任务,其面临两个主要挑战:由于主观标准导致的聚类与分类分组的分歧,以及少数类别在对齐过程中的不公平性。解决方案的关键在于提出PaMA框架,该框架利用大语言模型(LLMs)提取和精炼事件模式,以提升聚类与类别的对齐效果,并通过一个排序-过滤-挖掘的流程确保类别分布不平衡情况下的原型表示平衡。
链接: https://arxiv.org/abs/2505.23304
作者: Yi Luo,Qiwen Wang,Junqi Yang,Luyao Tang,Zhenghao Lin,Zhenzhe Ying,Weiqiang Wang,Chen Lin
机构: Xiamen University (厦门大学); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:Generalized Category Discovery (GCD) aims to classify both known and novel categories using partially labeled data that contains only known classes. Despite achieving strong performance on existing benchmarks, current textual GCD methods lack sufficient validation in realistic settings. We introduce Event-Centric GCD (EC-GCD), characterized by long, complex narratives and highly imbalanced class distributions, posing two main challenges: (1) divergent clustering versus classification groupings caused by subjective criteria, and (2) Unfair alignment for minority classes. To tackle these, we propose PaMA, a framework leveraging LLMs to extract and refine event patterns for improved cluster-class alignment. Additionally, a ranking-filtering-mining pipeline ensures balanced representation of prototypes across imbalanced categories. Evaluations on two EC-GCD benchmarks, including a newly constructed Scam Report dataset, demonstrate that PaMA outperforms prior methods with up to 12.58% H-score gains, while maintaining strong generalization on base GCD datasets.
zh
[NLP-58] Data-efficient Meta-models for Evaluation of Context-based Questions and Answers in LLM s
【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)和检索增强生成(Retrieval-Augmented Generation, RAG)系统在工业应用中因幻觉检测能力不足而导致的可靠性问题。其关键解决方案是通过结合高效的分类算法与降维技术,降低两种先进幻觉检测框架——Lookback Lens(分析注意力头动态)和基于探测的方法(解码内部模型表示)——对训练数据的需求,从而在仅需250个训练样本的情况下仍能保持与强大专有LLM基线相当的性能,提升了在标注资源受限场景下的可扩展性与实用性。
链接: https://arxiv.org/abs/2505.23299
作者: Julia Belikova,Konstantin Polev,Rauf Parchiev,Dmitry Simakov
机构: Sber AI Lab(斯伯AI实验室); Moscow Institute of Physics and Technology(莫斯科物理技术学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems are increasingly deployed in industry applications, yet their reliability remains hampered by challenges in detecting hallucinations. While supervised state-of-the-art (SOTA) methods that leverage LLM hidden states – such as activation tracing and representation analysis – show promise, their dependence on extensively annotated datasets limits scalability in real-world applications. This paper addresses the critical bottleneck of data annotation by investigating the feasibility of reducing training data requirements for two SOTA hallucination detection frameworks: Lookback Lens, which analyzes attention head dynamics, and probing-based approaches, which decode internal model representations. We propose a methodology combining efficient classification algorithms with dimensionality reduction techniques to minimize sample size demands while maintaining competitive performance. Evaluations on standardized question-answering RAG benchmarks show that our approach achieves performance comparable to strong proprietary LLM-based baselines with only 250 training samples. These results highlight the potential of lightweight, data-efficient paradigms for industrial deployment, particularly in annotation-constrained scenarios.
zh
[NLP-59] EmoBench-UA: A Benchmark Dataset for Emotion Detection in Ukrainian
【速读】: 该论文试图解决乌克兰语(Ukrainian)中情感分类(emotion classification)这一研究领域缺乏公开基准数据集的问题。其解决方案的关键在于引入了EmoBench-UA,这是首个针对乌克兰语文本情感检测的标注数据集,该数据集的标注方案参考了以往以英语为中心的情感检测工作,并通过众包平台确保了标注过程的高质量。此外,研究还评估了多种方法在该数据集上的表现,包括基于语言学的基线模型、从英语翻译的合成数据以及大型语言模型(LLMs),从而揭示了非主流语言在情感分类任务中的挑战,并强调了开发乌克兰语专用模型和训练资源的重要性。
链接: https://arxiv.org/abs/2505.23297
作者: Daryna Dementieva,Nikolay Babakov,Alexander Fraser
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:While Ukrainian NLP has seen progress in many texts processing tasks, emotion classification remains an underexplored area with no publicly available benchmark to date. In this work, we introduce EmoBench-UA, the first annotated dataset for emotion detection in Ukrainian texts. Our annotation schema is adapted from the previous English-centric works on emotion detection (Mohammad et al., 2018; Mohammad, 2022) guidelines. The dataset was created through crowdsourcing using the this http URL platform ensuring high-quality of the annotation process. Then, we evaluate a range of approaches on the collected dataset, starting from linguistic-based baselines, synthetic data translated from English, to large language models (LLMs). Our findings highlight the challenges of emotion classification in non-mainstream languages like Ukrainian and emphasize the need for further development of Ukrainian-specific models and training resources.
zh
[NLP-60] How Does Response Length Affect Long-Form Factuality ACL2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成长文本时出现的事实性错误问题,特别是响应长度对事实准确性的影响尚未被充分研究。其解决方案的关键在于引入一种自动化的双层级长文本事实性评估框架,该框架在保持与人工标注高度一致性的同时,具备成本效益,从而为系统性研究响应长度与事实性之间的关系提供了有效工具。
链接: https://arxiv.org/abs/2505.23295
作者: James Xu Zhao,Jimmy Z.J. Liu,Bryan Hooi,See-Kiong Ng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ACL 2025 Findings. 24 pages, 10 figures, 18 tables. Code available at this https URL
Abstract:Large language models (LLMs) are widely used for long-form text generation. However, factual errors in the responses would undermine their reliability. Despite growing attention to LLM factuality, the effect of response length on factuality remains underexplored. In this work, we systematically investigate this relationship by first introducing an automatic and bi-level long-form factuality evaluation framework, which achieves high agreement with human annotations while being cost-effective. Using this framework, we conduct controlled experiments and find that longer responses exhibit lower factual precision, confirming the presence of length bias. To explain this phenomenon, we empirically examine three hypotheses: error propagation, long context, and facts exhaustion. Our results reveal that facts exhaustion, where the model gradually exhausts more reliable knowledge, is the primary cause of factual degradation, rather than the other two hypotheses.
zh
[NLP-61] ScEdit: Script-based Assessment of Knowledge Editing ACL2025
【速读】: 该论文试图解决当前知识编辑(Knowledge Editing, KE)任务在现实应用场景中应用不足的问题,尤其是在大型语言模型作为代理(LLM-as-agent)的背景下。其解决方案的关键在于引入了一个基于脚本的新型基准测试框架——ScEdit(Script-based Knowledge Editing Benchmark),该基准涵盖了反事实和时间性编辑,并整合了标记级和文本级评估方法,从而扩展了传统的基于事实的“What”型问题评估至基于行动的“How”型问题评估,以更全面地衡量KE技术的有效性。
链接: https://arxiv.org/abs/2505.23291
作者: Xinye Li,Zunwen Zheng,Qian Zhang,Dekai Zhuang,Jiabao Kang,Liyan Xu,Qingbin Liu,Xi Chen,Zhiying Tu,Dianhui Chu,Dianbo Sui
机构: Harbin Institute of Technology (哈尔滨工业大学); Jilin University (吉林大学); Tencent (腾讯)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Findings
Abstract:Knowledge Editing (KE) has gained increasing attention, yet current KE tasks remain relatively simple. Under current evaluation frameworks, many editing methods achieve exceptionally high scores, sometimes nearing perfection. However, few studies integrate KE into real-world application scenarios (e.g., recent interest in LLM-as-agent). To support our analysis, we introduce a novel script-based benchmark – ScEdit (Script-based Knowledge Editing Benchmark) – which encompasses both counterfactual and temporal edits. We integrate token-level and text-level evaluation methods, comprehensively analyzing existing KE techniques. The benchmark extends traditional fact-based (“What”-type question) evaluation to action-based (“How”-type question) evaluation. We observe that all KE methods exhibit a drop in performance on established metrics and face challenges on text-level metrics, indicating a challenging task. Our benchmark is available at this https URL.
zh
[NLP-62] Sentinel: Attention Probing of Proxy Models for LLM Context Compression with an Understanding Perspective
【速读】: 该论文试图解决检索增强生成(Retrieval-augmented generation, RAG)中外部上下文过长、噪声多或超出输入限制的问题,以及现有压缩方法需要依赖监督训练专用压缩模型所带来的成本高和可移植性差的问题。解决方案的关键在于提出Sentinel框架,该框架通过将上下文过滤重新定义为基于注意力的理解任务,利用轻量级分类器对现成的0.5B参数代理大语言模型(LLM)的解码器注意力进行探测,从而识别句子的相关性,实现高效的句子级压缩。
链接: https://arxiv.org/abs/2505.23277
作者: Yong Zhang,Yanwen Huang,Ning Cheng,Yang Guo,Yun Zhu,Yanmeng Wang,Shaojun Wang,Jing Xiao
机构: Ping An Technology (Shenzhen) Co., Ltd., China(平安科技(深圳)有限公司,中国); University of Electronic Science and Technology of China(电子科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint. 17 pages including appendix
Abstract:Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external context, but retrieved passages are often lengthy, noisy, or exceed input limits. Existing compression methods typically require supervised training of dedicated compression models, increasing cost and reducing portability. We propose Sentinel, a lightweight sentence-level compression framework that reframes context filtering as an attention-based understanding task. Rather than training a compression model, Sentinel probes decoder attention from an off-the-shelf 0.5B proxy LLM using a lightweight classifier to identify sentence relevance. Empirically, we find that query-context relevance estimation is consistent across model scales, with 0.5B proxies closely matching the behaviors of larger models. On the LongBench benchmark, Sentinel achieves up to 5 \times compression while matching the QA performance of 7B-scale compression systems. Our results suggest that probing native attention signals enables fast, effective, and question-aware context compression. Code available at: this https URL.
zh
[NLP-63] he Arabic AI Fingerprint: Stylometric Analysis and Detection of Large Language Models Text
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在阿拉伯语中生成文本所引发的信息真实性问题,特别是在教育、社交媒体和学术领域中存在的虚假信息传播、医疗指导误导和定向宣传等挑战。其解决方案的关键在于通过多策略生成方法(仅基于标题生成、内容感知生成和文本优化)和多种模型架构(ALLaM、Jais、Llama 和 GPT-4)进行系统性分析,并结合深入的风格分析(stylometric analysis),识别出人类写作与机器生成阿拉伯语文本之间的语言模式差异。基于这些发现,研究者构建了基于 BERT 的检测模型,在正式语境中实现了高达 99.9% 的 F1 分数,展现出对不同模型架构的强泛化能力。
链接: https://arxiv.org/abs/2505.23276
作者: Maged S. Al-Shaibani,Moataz Ahmed
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have achieved unprecedented capabilities in generating human-like text, posing subtle yet significant challenges for information integrity across critical domains, including education, social media, and academia, enabling sophisticated misinformation campaigns, compromising healthcare guidance, and facilitating targeted propaganda. This challenge becomes severe, particularly in under-explored and low-resource languages like Arabic. This paper presents a comprehensive investigation of Arabic machine-generated text, examining multiple generation strategies (generation from the title only, content-aware generation, and text refinement) across diverse model architectures (ALLaM, Jais, Llama, and GPT-4) in academic, and social media domains. Our stylometric analysis reveals distinctive linguistic patterns differentiating human-written from machine-generated Arabic text across these varied contexts. Despite their human-like qualities, we demonstrate that LLMs produce detectable signatures in their Arabic outputs, with domain-specific characteristics that vary significantly between different contexts. Based on these insights, we developed BERT-based detection models that achieved exceptional performance in formal contexts (up to 99.9% F1-score) with strong precision across model architectures. Our cross-domain analysis confirms generalization challenges previously reported in the literature. To the best of our knowledge, this work represents the most comprehensive investigation of Arabic machine-generated text to date, uniquely combining multiple prompt generation methods, diverse model architectures, and in-depth stylometric analysis across varied textual domains, establishing a foundation for developing robust, linguistically-informed detection systems essential for preserving information integrity in Arabic-language contexts.
zh
[NLP-64] Does Machine Unlearning Truly Remove Model Knowledge? A Framework for Auditing Unlearning in LLM s
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在数据隐私和版权问题上的挑战,特别是如何有效移除模型中特定敏感信息而无需进行昂贵的重新训练。其解决方案的关键在于提出一种全面的去学习(unlearning)评估框架,包含三个基准数据集、六种去学习算法以及五种基于提示的审计方法,并引入一种新颖的技术——利用中间激活扰动,以克服仅依赖模型输入和输出的审计方法的局限性。
链接: https://arxiv.org/abs/2505.23270
作者: Haokun Chen,Yueqi Zhang,Yuan Bi,Yao Zhang,Tong Liu,Jinhe Bi,Jian Lan,Jindong Gu,Claudia Grosser,Denis Krompass,Nassir Navab,Volker Tresp
机构: LMU Munich, Germany; Siemens AG, Munich, Germany; Munich Center for Machine Learning (MCML), Germany; Technical University of Munich, Germany; University of Oxford, UK
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:In recent years, Large Language Models (LLMs) have achieved remarkable advancements, drawing significant attention from the research community. Their capabilities are largely attributed to large-scale architectures, which require extensive training on massive datasets. However, such datasets often contain sensitive or copyrighted content sourced from the public internet, raising concerns about data privacy and ownership. Regulatory frameworks, such as the General Data Protection Regulation (GDPR), grant individuals the right to request the removal of such sensitive information. This has motivated the development of machine unlearning algorithms that aim to remove specific knowledge from models without the need for costly retraining. Despite these advancements, evaluating the efficacy of unlearning algorithms remains a challenge due to the inherent complexity and generative nature of LLMs. In this work, we introduce a comprehensive auditing framework for unlearning evaluation, comprising three benchmark datasets, six unlearning algorithms, and five prompt-based auditing methods. By using various auditing algorithms, we evaluate the effectiveness and robustness of different unlearning strategies. To explore alternatives beyond prompt-based auditing, we propose a novel technique that leverages intermediate activation perturbations, addressing the limitations of auditing methods that rely solely on model inputs and outputs.
zh
[NLP-65] Automatic Construction of Multiple Classification Dimensions for Managing Approaches in Scientific Papers
【速读】: 该论文试图解决从大量科学论文中高效查询和利用相关方法(approach)的问题,这一过程因缺乏有效的组织管理框架而面临显著挑战。其解决方案的关键在于通过自上而下的方式识别方法模式,并基于语义、话语、句法和词汇四个语言层次进行细化,进而提取科学论文中的方法。此外,论文提出了五种分类维度,并采用树结构表示步骤,利用基于树结构的相似性度量关注句法层面的相似性,同时引入集合相似性度量计算方法间的相似性。通过自下而上的聚类算法构建各维度下的类树,利用聚类过程中生成的类标签管理方法组件,最终形成多维方法空间,从而实现高效的方法查询与管理。
链接: https://arxiv.org/abs/2505.23252
作者: Bing Ma,Hai Zhuge
机构: 未知
类目: Computation and Language (cs.CL)
备注: 26 pages, 9 figures
Abstract:Approaches form the foundation for conducting scientific research. Querying approaches from a vast body of scientific papers is extremely time-consuming, and without a well-organized management framework, researchers may face significant challenges in querying and utilizing relevant approaches. Constructing multiple dimensions on approaches and managing them from these dimensions can provide an efficient solution. Firstly, this paper identifies approach patterns using a top-down way, refining the patterns through four distinct linguistic levels: semantic level, discourse level, syntactic level, and lexical level. Approaches in scientific papers are extracted based on approach patterns. Additionally, five dimensions for categorizing approaches are identified using these patterns. This paper proposes using tree structure to represent step and measuring the similarity between different steps with a tree-structure-based similarity measure that focuses on syntactic-level similarities. A collection similarity measure is proposed to compute the similarity between approaches. A bottom-up clustering algorithm is proposed to construct class trees for approach components within each dimension by merging each approach component or class with its most similar approach component or class in each iteration. The class labels generated during the clustering process indicate the common semantics of the step components within the approach components in each class and are used to manage the approaches within the class. The class trees of the five dimensions collectively form a multi-dimensional approach space. The application of approach queries on the multi-dimensional approach space demonstrates that querying within this space ensures strong relevance between user queries and results and rapidly reduces search space through a class-based query mechanism.
zh
[NLP-66] ChartMind: A Comprehensive Benchmark for Complex Real-world Multimodal Chart Question Answering
【速读】: 该论文旨在解决现实场景中复杂图表问答(Complex Chart Question Answering, CQA)任务的评估与建模问题,传统方法因依赖固定输出格式和客观指标而未能满足实际应用中的复杂需求。其解决方案的关键在于提出一种上下文感知且模型无关的框架——ChartLLM,该框架通过提取关键上下文元素、降低噪声并提升多模态大语言模型的推理准确性,实现了对图表理解的灵活性和有效性增强。
链接: https://arxiv.org/abs/2505.23242
作者: Jingxuan Wei,Nan Xu,Junnan Zhu,Yanni Hao,Gaowei Wu,Bihui Yu,Lei Wang
机构: Beijing Wenge Technology Co., Ltd. (北京文革科技有限公司); Shenyang Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所沈阳分所); MAIS, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所多媒体与智能系统实验室); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Chart question answering (CQA) has become a critical multimodal task for evaluating the reasoning capabilities of vision-language models. While early approaches have shown promising performance by focusing on visual features or leveraging large-scale pre-training, most existing evaluations rely on rigid output formats and objective metrics, thus ignoring the complex, real-world demands of practical chart analysis. In this paper, we introduce ChartMind, a new benchmark designed for complex CQA tasks in real-world settings. ChartMind covers seven task categories, incorporates multilingual contexts, supports open-domain textual outputs, and accommodates diverse chart formats, bridging the gap between real-world applications and traditional academic benchmarks. Furthermore, we propose a context-aware yet model-agnostic framework, ChartLLM, that focuses on extracting key contextual elements, reducing noise, and enhancing the reasoning accuracy of multimodal large language models. Extensive evaluations on ChartMind and three representative public benchmarks with 14 mainstream multimodal models show our framework significantly outperforms the previous three common CQA paradigms: instruction-following, OCR-enhanced, and chain-of-thought, highlighting the importance of flexible chart understanding for real-world CQA. These findings suggest new directions for developing more robust chart reasoning in future research.
zh
[NLP-67] MCTSr-Zero: Self-Reflective Psychological Counseling Dialogues Generation via Principles and Adaptive Exploration
【速读】: 该论文旨在解决将蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)与大型语言模型(Large Language Models, LLMs)结合应用于开放性对话任务时出现的对齐问题,特别是在心理辅导等以人类为中心的场景中,传统基于结果导向的MCTS方法难以满足情感共鸣、伦理规范和人类偏好等主观标准。解决方案的关键在于提出MCTSr-Zero框架,其核心创新是“领域对齐”机制,将MCTS的搜索目标从预定义的终态转向符合目标领域原则的对话轨迹,同时引入“再生”和“元提示适应”机制以扩展探索空间,从而生成高质量且符合原则的对话数据。
链接: https://arxiv.org/abs/2505.23229
作者: Hao Lu,Yanchi Gu,Haoyuan Huang,Yulin Zhou,Ningxin Zhu,Chen Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 50 pages, 3 figures
Abstract:The integration of Monte Carlo Tree Search (MCTS) with Large Language Models (LLMs) has demonstrated significant success in structured, problem-oriented tasks. However, applying these methods to open-ended dialogues, such as those in psychological counseling, presents unique challenges. Unlike tasks with objective correctness, success in therapeutic conversations depends on subjective factors like empathetic engagement, ethical adherence, and alignment with human preferences, for which strict “correctness” criteria are ill-defined. Existing result-oriented MCTS approaches can therefore produce misaligned responses. To address this, we introduce MCTSr-Zero, an MCTS framework designed for open-ended, human-centric dialogues. Its core innovation is “domain alignment”, which shifts the MCTS search objective from predefined end-states towards conversational trajectories that conform to target domain principles (e.g., empathy in counseling). Furthermore, MCTSr-Zero incorporates “Regeneration” and “Meta-Prompt Adaptation” mechanisms to substantially broaden exploration by allowing the MCTS to consider fundamentally different initial dialogue strategies. We evaluate MCTSr-Zero in psychological counseling by generating multi-turn dialogue data, which is used to fine-tune an LLM, PsyLLM. We also introduce PsyEval, a benchmark for assessing multi-turn psychological counseling dialogues. Experiments demonstrate that PsyLLM achieves state-of-the-art performance on PsyEval and other relevant metrics, validating MCTSr-Zero’s effectiveness in generating high-quality, principle-aligned conversational data for human-centric domains and addressing the LLM challenge of consistently adhering to complex psychological standards.
zh
[NLP-68] MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration ACL2025
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在多模态推理过程中存在的置信度评估不足问题,特别是在每个推理步骤中缺乏精确的置信度校准,导致幻觉(hallucination)现象的累积。其解决方案的关键在于提出MMBoundary框架,通过引入互补的文本和跨模态自奖励信号,在每个推理步骤中估计置信度,并结合监督微调与强化学习阶段,进一步对齐模型知识并校准每一步的置信度,从而提升推理链的自我修正能力。
链接: https://arxiv.org/abs/2505.23224
作者: Zhitao He,Sandeep Polisetty,Zhiyuan Fan,Yuchen Huang,Shujin Wu,Yi R.(May)Fung
机构: Hong Kong University of Science and Technology (香港科技大学); University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); University of Southern California (南加利福尼亚大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025
Abstract:In recent years, multimodal large language models (MLLMs) have made significant progress but continue to face inherent challenges in multimodal reasoning, which requires multi-level (e.g., perception, reasoning) and multi-granular (e.g., multi-step reasoning chain) advanced inferencing. Prior work on estimating model confidence tends to focus on the overall response for training and calibration, but fails to assess confidence in each reasoning step, leading to undesirable hallucination snowballing. In this work, we present MMBoundary, a novel framework that advances the knowledge boundary awareness of MLLMs through reasoning step confidence calibration. To achieve this, we propose to incorporate complementary textual and cross-modal self-rewarding signals to estimate confidence at each step of the MLLM reasoning process. In addition to supervised fine-tuning MLLM on this set of self-rewarded confidence estimation signal for initial confidence expression warm-up, we introduce a reinforcement learning stage with multiple reward functions for further aligning model knowledge and calibrating confidence at each reasoning step, enhancing reasoning chain self-correction. Empirical results show that MMBoundary significantly outperforms existing methods across diverse domain datasets and metrics, achieving an average of 7.5% reduction in multimodal confidence calibration errors and up to 8.3% improvement in task performance.
zh
[NLP-69] ExpeTrans: LLM s Are Experiential Transfer Learners
【速读】: 该论文试图解决传统方法在为大型语言模型(Large Language Models, LLMs)提供文本任务求解经验时依赖大量人工劳动或时间成本的问题,这在用户查询任务类型日益多样化的背景下变得不切实际。解决方案的关键在于设计一种自主经验迁移框架,使LLMs能够模仿人类认知智能,自主地将现有源任务的经验迁移到新遇到的目标任务中,从而实现无需高昂成本获取经验,并为LLMs的泛化能力提供新的路径。
链接: https://arxiv.org/abs/2505.23191
作者: Jinglong Gao,Xiao Ding,Lingxiao Zou,Bibo Cai,Bing Qin,Ting Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 12 figs/tables
Abstract:Recent studies provide large language models (LLMs) with textual task-solving experiences via prompts to improve their performance. However, previous methods rely on substantial human labor or time to gather such experiences for each task, which is impractical given the growing variety of task types in user queries to LLMs. To address this issue, we design an autonomous experience transfer framework to explore whether LLMs can mimic human cognitive intelligence to autonomously transfer experience from existing source tasks to newly encountered target tasks. This not only allows the acquisition of experience without extensive costs of previous methods, but also offers a novel path for the generalization of LLMs. Experimental results on 13 datasets demonstrate that our framework effectively improves the performance of LLMs. Furthermore, we provide a detailed analysis of each module in the framework.
zh
[NLP-70] Cross-Task Experiential Learning on LLM -based Multi-Agent Collaboration
【速读】: 该论文旨在解决大型语言模型驱动的多智能体系统(MAS)在处理结构相似任务时存在的计算冗余和泛化能力有限的问题。其关键解决方案是提出多智能体跨任务经验学习(MAEL)框架,该框架通过图结构的多智能体协作网络建模任务求解流程,并在经验学习阶段量化每个步骤的质量,将相应的奖励、输入和输出存储到每个智能体的经验池中,从而在推理阶段利用高奖励的任务相关经验作为少样本示例,提升推理步骤的有效性。
链接: https://arxiv.org/abs/2505.23187
作者: Yilong Li,Chen Qian,Yu Xia,Ruijie Shi,Yufan Dang,Zihao Xie,Ziming You,Weize Chen,Cheng Yang,Weichuan Liu,Ye Tian,Xuantang Xiong,Lei Han,Zhiyuan Liu,Maosong Sun
机构: Peking University (北京大学); Shanghai Jiao Tong University (上海交通大学); Tsinghua University (清华大学); Beijing University of Posts and Telecommunications (北京邮电大学); Siemens (西门子); Tencent Robotics X (腾讯机器人实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Work in Progress
Abstract:Large Language Model-based multi-agent systems (MAS) have shown remarkable progress in solving complex tasks through collaborative reasoning and inter-agent critique. However, existing approaches typically treat each task in isolation, resulting in redundant computations and limited generalization across structurally similar tasks. To address this, we introduce multi-agent cross-task experiential learning (MAEL), a novel framework that endows LLM-driven agents with explicit cross-task learning and experience accumulation. We model the task-solving workflow on a graph-structured multi-agent collaboration network, where agents propagate information and coordinate via explicit connectivity. During the experiential learning phase, we quantify the quality for each step in the task-solving workflow and store the resulting rewards along with the corresponding inputs and outputs into each agent’s individual experience pool. During inference, agents retrieve high-reward, task-relevant experiences as few-shot examples to enhance the effectiveness of each reasoning step, thereby enabling more accurate and efficient multi-agent collaboration. Experimental results on diverse datasets demonstrate that MAEL empowers agents to learn from prior task experiences effectively-achieving faster convergence and producing higher-quality solutions on current tasks.
zh
[NLP-71] Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement
【速读】: 该论文旨在解决机器翻译输出中细粒度错误片段的自动识别问题,即词级质量评估(Word-level Quality Estimation, WQE)。传统WQE方法通常依赖于大型语言模型的提示或大量人工标注数据的专门训练,成本较高。本文提出了一种高效的替代方案,关键在于利用语言模型可解释性和不确定性量化方面的最新进展,从翻译模型内部机制中识别翻译错误。通过在12个翻译方向上对14个指标的评估,并使用多组人工标注数据量化人工标注差异对指标性能的影响,研究揭示了无监督指标的未开发潜力、监督方法在面对标注不确定性时的不足以及单标注者评估实践的脆弱性。
链接: https://arxiv.org/abs/2505.23183
作者: Gabriele Sarti,Vilém Zouhar,Malvina Nissim,Arianna Bisazza
机构: University of Groningen (格罗宁根大学); ETH Zurich (苏黎世联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Under review. Code: this https URL Metrics: this https URL
Abstract:Word-level quality estimation (WQE) aims to automatically identify fine-grained error spans in machine-translated outputs and has found many uses, including assisting translators during post-editing. Modern WQE techniques are often expensive, involving prompting of large language models or ad-hoc training on large amounts of human-labeled data. In this work, we investigate efficient alternatives exploiting recent advances in language model interpretability and uncertainty quantification to identify translation errors from the inner workings of translation models. In our evaluation spanning 14 metrics across 12 translation directions, we quantify the impact of human label variation on metric performance by using multiple sets of human labels. Our results highlight the untapped potential of unsupervised metrics, the shortcomings of supervised methods when faced with label uncertainty, and the brittleness of single-annotator evaluation practices.
zh
[NLP-72] Infinite-Instruct: Synthesizing Scaling Code instruction Data with Bidirectional Synthesis and Static Verification
【速读】: 该论文旨在解决传统代码指令数据合成方法中存在的多样性不足和逻辑性较差的问题。其解决方案的关键在于提出了一种名为Infinite-Instruct的自动化框架,该框架通过“逆向构建”将代码片段转化为多样化的编程问题,并利用“反馈构建”将问题中的关键词结构化为知识图谱以增强问题的内部逻辑;最后通过跨语言静态代码分析管道过滤无效样本,从而提升生成代码的质量和逻辑性。
链接: https://arxiv.org/abs/2505.23177
作者: Wenjing Xing,Wenke Lu,Yeheng Duan,Bing Zhao,Zhenghui kang,Yaolong Wang,Kai Gao,Lei Qiao
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Traditional code instruction data synthesis methods suffer from limited diversity and poor logic. We introduce Infinite-Instruct, an automated framework for synthesizing high-quality question-answer pairs, designed to enhance the code generation capabilities of large language models (LLMs). The framework focuses on improving the internal logic of synthesized problems and the quality of synthesized code. First, “Reverse Construction” transforms code snippets into diverse programming problems. Then, through “Backfeeding Construction,” keywords in programming problems are structured into a knowledge graph to reconstruct them into programming problems with stronger internal logic. Finally, a cross-lingual static code analysis pipeline filters invalid samples to ensure data quality. Experiments show that on mainstream code generation benchmarks, our fine-tuned models achieve an average performance improvement of 21.70% on 7B-parameter models and 36.95% on 32B-parameter models. Using less than one-tenth of the instruction fine-tuning data, we achieved performance comparable to the Qwen-2.5-Coder-Instruct. Infinite-Instruct provides a scalable solution for LLM training in programming. We open-source the datasets used in the experiments, including both unfiltered versions and filtered versions via static analysis. The data are available at this https URL
zh
[NLP-73] MapMake: Schema Guided Text to Table Generation ACL2025
【速读】: 该论文试图解决文本到表格生成(Text-to-Table generation)中信息提取不准确、缺乏数据推理能力以及难以捕捉定性和定量信息的问题。其解决方案的关键在于提出一种名为MapMake的通用方法,该方法通过将文本分解为命题原子语句,实现细粒度的信息分解,从而提取潜在的模式结构,并利用该结构填充能够体现原始文本定性细节和定量事实的表格。
链接: https://arxiv.org/abs/2505.23174
作者: Naman Ahuja,Fenil Bardoliya,Chitta Baral,Vivek Gupta
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025
Abstract:Transforming dense, detailed, unstructured text into an interpretable and summarised table, also colloquially known as Text-to-Table generation, is an essential task for information retrieval. Current methods, however, miss out on how and what complex information to extract; they also lack the ability to infer data from the text. In this paper, we introduce a versatile approach, MapMake, which “dissects” text into propositional atomic statements. This facilitates granular decomposition to extract the latent schema. The schema is then used to populate the tables that capture the qualitative nuances and the quantitative facts in the original text. Our approach is tested against two challenging datasets, Rotowire, renowned for its complex and multi-table schema, and Livesum, which demands numerical aggregation. By carefully identifying and correcting hallucination errors in Rotowire, we aim to achieve a cleaner and more reliable benchmark. We evaluate our method rigorously on a comprehensive suite of comparative and referenceless metrics. Our findings demonstrate significant improvement results across both datasets with better interpretability in Text-to-Table generation. Moreover, through detailed ablation studies and analyses, we investigate the factors contributing to superior performance and validate the practicality of our framework in structured summarization tasks.
zh
[NLP-74] ZIPA: A family of efficient models for multilingual phone recognition ACL2025
【速读】: 该论文试图解决跨语言音素识别(crosslinguistic phone recognition)的性能瓶颈问题,旨在提升现有系统的准确性和效率。其解决方案的关键在于构建了一个大规模多语言语音语料库IPAPack++,并引入了基于Zipformer架构的高效模型ZIPA,包括基于转换器(transducer)和CTC(Connectionist Temporal Classification)的变体,通过减少参数量实现更优的性能,同时利用噪声学生训练进一步提升模型效果。
链接: https://arxiv.org/abs/2505.23170
作者: Jian Zhu,Farhan Samir,Eleanor Chodroff,David R. Mortensen
机构: University of British Columbia (不列颠哥伦比亚大学); University of Zurich (苏黎世大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: ACL 2025 Main
Abstract:We present ZIPA, a family of efficient speech models that advances the state-of-the-art performance of crosslinguistic phone recognition. We first curated IPAPack++, a large-scale multilingual speech corpus with 17,132 hours of normalized phone transcriptions and a novel evaluation set capturing unseen languages and sociophonetic variation. With the large-scale training data, ZIPA, including transducer (ZIPA-T) and CTC-based (ZIPA-CR) variants, leverage the efficient Zipformer backbones and outperform existing phone recognition systems with much fewer parameters. Further scaling via noisy student training on 11,000 hours of pseudo-labeled multilingual data yields further improvement. While ZIPA achieves strong performance on benchmarks, error analysis reveals persistent limitations in modeling sociophonetic diversity, underscoring challenges for future research.
zh
[NLP-75] Dont Show: Leverag ing Language Models Abstractive Retellings to Model Literary Themes ACL2025
【速读】: 该论文试图解决传统词袋模型方法(如潜在狄利克雷分布,LDA)在文学文本主题建模中的局限性,因为文学语言注重沉浸式感官细节而非抽象描述或阐述。解决方案的关键在于提出Retell方法,即通过提示资源高效的生成式语言模型(Generative Language Models, LMs)对文本段落进行重述,从而将叙事的表层形式转化为更高级的概念和主题,再利用LDA对重述内容进行主题建模,以获得比单独使用LDA或直接让LM列举主题更精确和信息量大的主题结果。
链接: https://arxiv.org/abs/2505.23166
作者: Li Lucy,Camilla Griffiths,Sarah Levine,Jennifer L. Eberhardt,Dorottya Demszky,David Bamman
机构: University of California Berkeley (加州大学伯克利分校); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注: 26 pages, 7 figures, Findings of ACL 2025
Abstract:Conventional bag-of-words approaches for topic modeling, like latent Dirichlet allocation (LDA), struggle with literary text. Literature challenges lexical methods because narrative language focuses on immersive sensory details instead of abstractive description or exposition: writers are advised to “show, don’t tell.” We propose Retell, a simple, accessible topic modeling approach for literature. Here, we prompt resource-efficient, generative language models (LMs) to tell what passages show, thereby translating narratives’ surface forms into higher-level concepts and themes. By running LDA on LMs’ retellings of passages, we can obtain more precise and informative topics than by running LDA alone or by directly asking LMs to list topics. To investigate the potential of our method for cultural analytics, we compare our method’s outputs to expert-guided annotations in a case study on racial/cultural identity in high school English language arts books.
zh
[NLP-76] Cross-Domain Bilingual Lexicon Induction via Pretrained Language Models
【速读】: 该论文试图解决在专业领域中进行跨语言词典提取(Bilingual Lexicon Induction, BLI)时性能下降的问题,特别是在医学等专业领域中,传统方法如Muse和Vecmap的表现显著低于在通用领域中的表现。其解决方案的关键在于引入代码切换(Code Switch)策略,以适应不同上下文中的差异性匹配,从而提升模型在特定领域中的适用性和翻译质量。通过利用预训练模型的能力,该方法构建了更优的词向量表示,并在三个具体领域中平均提升了0.78分的性能。
链接: https://arxiv.org/abs/2505.23146
作者: Qiuyu Ding,Zhiqiang Cao,Hailong Cao,Tiejun Zhao
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Bilingual Lexicon Induction (BLI) is generally based on common domain data to obtain monolingual word embedding, and by aligning the monolingual word embeddings to obtain the cross-lingual embeddings which are used to get the word translation pairs. In this paper, we propose a new task of BLI, which is to use the monolingual corpus of the general domain and target domain to extract domain-specific bilingual dictionaries. Motivated by the ability of Pre-trained models, we propose a method to get better word embeddings that build on the recent work on BLI. This way, we introduce the Code Switch(Qin et al., 2020) firstly in the cross-domain BLI task, which can match differit is yet to be seen whether these methods are suitable for bilingual lexicon extraction in professional fields. As we can see in table 1, the classic and efficient BLI approach, Muse and Vecmap, perform much worse on the Medical dataset than on the Wiki dataset. On one hand, the specialized domain data set is relatively smaller compared to the generic domain data set generally, and specialized words have a lower frequency, which will directly affect the translation quality of bilingual dictionaries. On the other hand, static word embeddings are widely used for BLI, however, in some specific fields, the meaning of words is greatly influenced by context, in this case, using only static word embeddings may lead to greater bias. ent strategies in different contexts, making the model more suitable for this task. Experimental results show that our method can improve performances over robust BLI baselines on three specific domains by averagely improving 0.78 points.
zh
[NLP-77] Enhancing Large Language Models Machine Translation via Dynamic Focus Anchoring
【速读】: 该论文试图解决大型语言模型在机器翻译(MT)任务中对上下文敏感单元(context-sensitive units, CSUs)处理不足的问题,这些问题包括多义词等,导致局部翻译准确性下降,甚至影响句子和任务的整体理解,最终可能引发翻译失败。解决方案的关键在于通过获取CSUs并应用语义聚焦来增强LLMs的MT能力,具体方法是动态分析和识别翻译挑战,并以结构化方式将其融入LLMs,从而缓解因信息扁平化导致的CSUs误译或误解问题,进而提升复杂术语的翻译准确性。
链接: https://arxiv.org/abs/2505.23140
作者: Qiuyu Ding,Zhiqiang Cao,Hailong Cao,Tiejun Zhao
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models have demonstrated exceptional performance across multiple crosslingual NLP tasks, including machine translation (MT). However, persistent challenges remain in addressing context-sensitive units (CSUs), such as polysemous words. These CSUs not only affect the local translation accuracy of LLMs, but also affect LLMs’ understanding capability for sentences and tasks, and even lead to translation failure. To address this problem, we propose a simple but effective method to enhance LLMs’ MT capabilities by acquiring CSUs and applying semantic focus. Specifically, we dynamically analyze and identify translation challenges, then incorporate them into LLMs in a structured manner to mitigate mistranslations or misunderstandings of CSUs caused by information flattening. Efficiently activate LLMs to identify and apply relevant knowledge from its vast data pool in this way, ensuring more accurate translations for translating difficult terms. On a benchmark dataset of MT, our proposed method achieved competitive performance compared to multiple existing open-sourced MT baseline models. It demonstrates effectiveness and robustness across multiple language pairs, including both similar language pairs and distant language pairs. Notably, the proposed method requires no additional model training and enhances LLMs’ performance across multiple NLP tasks with minimal resource consumption.
zh
[NLP-78] PBEBench: A Multi-Step Programming by Examples Reasoning Benchmark inspired by Historical Linguistics
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在处理历史语言学启发的归纳推理问题时表现不足的问题,特别是针对编程通过示例(Programming by Examples)任务的挑战。其解决方案的关键在于开发了一个完全自动化的流水线,用于动态生成具有可控难度的基准测试集,以应对可扩展性和数据污染等问题。通过该流水线生成的测试集包含近1000个实例,对当前最先进的推理LLMs仍具有相当大的挑战性,表明尽管LCoT LLMs在抽象推理方面表现出色,但在历史语言学及其他领域常见的推理任务上仍存在显著局限。
链接: https://arxiv.org/abs/2505.23126
作者: Atharva Naik,Darsh Agrawal,Manav Kapadnis,Yuwei An,Yash Mathur,Carolyn Rose,David Mortensen
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recently, long chain of thought (LCoT), Large Language Models (LLMs), have taken the machine learning world by storm with their breathtaking reasoning capabilities. However, are the abstract reasoning abilities of these models general enough for problems of practical importance? Unlike past work, which has focused mainly on math, coding, and data wrangling, we focus on a historical linguistics-inspired inductive reasoning problem, formulated as Programming by Examples. We develop a fully automated pipeline for dynamically generating a benchmark for this task with controllable difficulty in order to tackle scalability and contamination issues to which many reasoning benchmarks are subject. Using our pipeline, we generate a test set with nearly 1k instances that is challenging for all state-of-the-art reasoning LLMs, with the best model (Claude-3.7-Sonnet) achieving a mere 54% pass rate, demonstrating that LCoT LLMs still struggle with a class or reasoning that is ubiquitous in historical linguistics as well as many other domains.
zh
[NLP-79] ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations
【速读】: 该论文旨在解决现有的开源多模态模型在多轮交互(multi-turn interaction)能力上的不足,尤其是在处理长上下文时的表现较弱的问题。其解决方案的关键在于引入了一个名为ContextQFormer的上下文建模模块,该模块通过使用记忆块(memory block)来增强上下文信息的表示能力,从而提升模型在多轮多模态对话中的表现。
链接: https://arxiv.org/abs/2505.23121
作者: Yiming Lei,Zhizheng Yang,Zeming Liu,Haitao Leng,Shaoguo Liu,Tingting Gao,Qingjie Liu,Yunhong Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures
Abstract:Multi-modal large language models have demonstrated remarkable zero-shot abilities and powerful image-understanding capabilities. However, the existing open-source multi-modal models suffer from the weak capability of multi-turn interaction, especially for long contexts. To address the issue, we first introduce a context modeling module, termed ContextQFormer, which utilizes a memory block to enhance the presentation of contextual information. Furthermore, to facilitate further research, we carefully build a new multi-turn multi-modal dialogue dataset (TMDialog) for pre-training, instruction-tuning, and evaluation, which will be open-sourced lately. Compared with other multi-modal dialogue datasets, TMDialog contains longer conversations, which supports the research of multi-turn multi-modal dialogue. In addition, ContextQFormer is compared with three baselines on TMDialog and experimental results illustrate that ContextQFormer achieves an improvement of 2%-4% in available rate over baselines.
zh
[NLP-80] Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios
【速读】: 该论文旨在解决医疗领域中多模态推理模型在复杂任务中的表现不足问题,尤其是在临床决策制定中缺乏有效的多源证据迭代推理能力。其解决方案的关键在于提出一种两阶段的后训练管道\textitMedE ^2 ,首先通过2,000个仅含文本的数据样本微调模型以激发推理行为,随后利用1,500个严格筛选的多模态医学案例进一步增强模型的推理能力,使其输出与提出的多模态医学推理偏好对齐。
链接: https://arxiv.org/abs/2505.23118
作者: Linjie Mu,Zhongzhen Huang,Yakun Zhu,Xiangyu Zhao,Shaoting Zhang,Xiaofan Zhang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Effective clinical decision-making depends on iterative, multimodal reasoning across diverse sources of evidence. The recent emergence of multimodal reasoning models has significantly transformed the landscape of solving complex tasks. Although such models have achieved notable success in mathematics and science, their application to medical domains remains underexplored. In this work, we propose \textitMedE ^2 , a two-stage post-training pipeline that elicits and then enhances multimodal reasoning for medical domains. In Stage-I, we fine-tune models using 2,000 text-only data samples containing precisely orchestrated reasoning demonstrations to elicit reasoning behaviors. In Stage-II, we further enhance the model’s reasoning capabilities using 1,500 rigorously curated multimodal medical cases, aligning model reasoning outputs with our proposed multimodal medical reasoning preference. Extensive experiments demonstrate the efficacy and reliability of \textitMedE ^2 in improving the reasoning performance of medical multimodal models. Notably, models trained with \textitMedE ^2 consistently outperform baselines across multiple medical multimodal benchmarks. Additional validation on larger models and under inference-time scaling further confirms the robustness and practical utility of our approach.
zh
[NLP-81] Dataset Cartography for Large Language Model Alignment: Mapping and Diagnosing Preference Data
【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)对齐过程中人工偏好数据收集成本高、效率低的问题,这一问题限制了模型对齐的可扩展性。解决方案的关键在于引入一种名为Alignment Data Map的工具,该工具利用GPT-4o作为LLM对齐的代理,计算生成响应与指令之间的对齐分数,并基于这些分数的均值和方差构建数据地图。通过识别高均值、低方差区域的高质量样本,该方法能够在仅使用33%数据的情况下实现与全量数据相当或更优的性能,从而显著提升数据收集效率并减少对显式标注的依赖。
链接: https://arxiv.org/abs/2505.23114
作者: Seohyeong Lee,Eunwon Kim,Hwaran Lee,Buru Chang
机构: Sogang University (首尔歌德大学); Korea University (韩国外国语大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Human preference data plays a critical role in aligning large language models (LLMs) with human values. However, collecting such data is often expensive and inefficient, posing a significant scalability challenge. To address this, we introduce Alignment Data Map, a GPT-4o-assisted tool for analyzing and diagnosing preference data. Using GPT-4o as a proxy for LLM alignment, we compute alignment scores for LLM-generated responses to instructions from existing preference datasets. These scores are then used to construct an Alignment Data Map based on their mean and variance. Our experiments show that using only 33 percent of the data, specifically samples in the high-mean, low-variance region, achieves performance comparable to or better than using the entire dataset. This finding suggests that the Alignment Data Map can significantly improve data collection efficiency by identifying high-quality samples for LLM alignment without requiring explicit annotations. Moreover, the Alignment Data Map can diagnose existing preference datasets. Our analysis shows that it effectively detects low-impact or potentially misannotated samples. Source code is available online.
zh
[NLP-82] Generating Diverse Training Samples for Relation Extraction with Large Language Models ACL2025
【速读】: 该论文试图解决在关系抽取(Relation Extraction, RE)任务中,使用生成式AI(Generative AI)直接生成的训练样本存在结构相似性高、表达方式单一的问题,从而影响模型的泛化能力和性能。解决方案的关键在于通过改进生成过程以提升样本多样性,同时保持其正确性,具体包括两种方法:一是通过在上下文学习(In-Context Learning, ICL)提示中直接给出指令以促使模型生成差异化的样本;二是通过直接偏好优化(Direct Preference Optimization, DPO)对大语言模型进行微调,以增强其生成多样且正确的训练样本的能力。
链接: https://arxiv.org/abs/2505.23108
作者: Zexuan Li,Hongliang Dai,Piji Li
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); MIIT Key Laboratory of Pattern Analysis and Machine Intelligence (工业和信息化部模式分析与机器智能重点实验室); The Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education (教育部脑机智能技术重点实验室)
类目: Computation and Language (cs.CL)
备注: ACL2025 Main
Abstract:Using Large Language Models (LLMs) to generate training data can potentially be a preferable way to improve zero or few-shot NLP tasks. However, many problems remain to be investigated for this direction. For the task of Relation Extraction (RE), we find that samples generated by directly prompting LLMs may easily have high structural similarities with each other. They tend to use a limited variety of phrasing while expressing the relation between a pair of entities. Therefore, in this paper, we study how to effectively improve the diversity of the training samples generated with LLMs for RE, while also maintaining their correctness. We first try to make the LLMs produce dissimilar samples by directly giving instructions in In-Context Learning (ICL) prompts. Then, we propose an approach to fine-tune LLMs for diversity training sample generation through Direct Preference Optimization (DPO). Our experiments on commonly used RE datasets show that both attempts can improve the quality of the generated training data. We also find that comparing with directly performing RE with an LLM, training a non-LLM RE model with its generated samples may lead to better performance.
zh
[NLP-83] MAP: Revisiting Weight Decomposition for Low-Rank Adaptation
【速读】: 该论文试图解决参数高效微调(Parameter-efficient fine-tuning, PEFT)方法中权重适应方向与幅度分解不具严格几何基础的问题,现有方法通常在列级别上启发式地定义方向,缺乏理论支撑。解决方案的关键在于提出MAP框架,该框架将权重矩阵重新表述为高维向量,并以严谨的方式将其适应过程解耦为方向和幅度,通过归一化预训练权重、学习方向更新以及引入两个标量系数独立缩放基础向量和更新向量的幅度,实现更可解释和灵活的适应。
链接: https://arxiv.org/abs/2505.23094
作者: Chongjie Si,Zhiyi Shi,Yadao Wang,Xiaokang Yang,Susanto Rahardja,Wei Shen
机构: Shanghai Jiao Tong University (上海交通大学); Harvard University (哈佛大学); Alibaba Group (阿里巴巴集团); Singapore Institute of Technology (新加坡科技学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:The rapid development of large language models has revolutionized natural language processing, but their fine-tuning remains computationally expensive, hindering broad deployment. Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, have emerged as solutions. Recent work like DoRA attempts to further decompose weight adaptation into direction and magnitude components. However, existing formulations often define direction heuristically at the column level, lacking a principled geometric foundation. In this paper, we propose MAP, a novel framework that reformulates weight matrices as high-dimensional vectors and decouples their adaptation into direction and magnitude in a rigorous manner. MAP normalizes the pre-trained weights, learns a directional update, and introduces two scalar coefficients to independently scale the magnitude of the base and update vectors. This design enables more interpretable and flexible adaptation, and can be seamlessly integrated into existing PEFT methods. Extensive experiments show that MAP significantly improves performance when coupling with existing methods, offering a simple yet powerful enhancement to existing PEFT methods. Given the universality and simplicity of MAP, we hope it can serve as a default setting for designing future PEFT methods.
zh
[NLP-84] Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models
【速读】: 该论文旨在解决多模态小语言模型(MSLMs)在扩展大型语言模型(LLMs)推理能力时所面临的三大关键挑战:高质量多模态推理数据集的稀缺性、视觉处理集成导致的推理能力退化以及直接应用强化学习可能引发复杂但错误的推理过程。解决方案的关键在于设计了一个名为Infi-MMR的框架,通过三个精心构建的阶段系统地释放MSLMs的推理潜力,包括基础推理激活、跨模态推理适应和多模态推理增强,从而有效提升模型在多模态数学推理和通用推理任务上的性能。
链接: https://arxiv.org/abs/2505.23091
作者: Zeyu Liu,Yuhang Liu,Guanghao Zhu,Congkai Xie,Zhen Li,Jianbo Yuan,Xinyao Wang,Qing Li,Shing-Chi Cheung,Shengyu Zhang,Fei Wu,Hongxia Yang
机构: The Hong Kong Polytechnic University(香港理工大学); Zhejiang University(浙江大学); University of Electronic Science and Technology of China(电子科技大学); Reallm Labs(Reallm 实验室); The Hong Kong University of Science and Technology(香港科技大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Recent advancements in large language models (LLMs) have demonstrated substantial progress in reasoning capabilities, such as DeepSeek-R1, which leverages rule-based reinforcement learning to enhance logical reasoning significantly. However, extending these achievements to multimodal large language models (MLLMs) presents critical challenges, which are frequently more pronounced for Multimodal Small Language Models (MSLMs) given their typically weaker foundational reasoning abilities: (1) the scarcity of high-quality multimodal reasoning datasets, (2) the degradation of reasoning capabilities due to the integration of visual processing, and (3) the risk that direct application of reinforcement learning may produce complex yet incorrect reasoning processes. To address these challenges, we design a novel framework Infi-MMR to systematically unlock the reasoning potential of MSLMs through a curriculum of three carefully structured phases and propose our multimodal reasoning model Infi-MMR-3B. The first phase, Foundational Reasoning Activation, leverages high-quality textual reasoning datasets to activate and strengthen the model’s logical reasoning capabilities. The second phase, Cross-Modal Reasoning Adaptation, utilizes caption-augmented multimodal data to facilitate the progressive transfer of reasoning skills to multimodal contexts. The third phase, Multimodal Reasoning Enhancement, employs curated, caption-free multimodal data to mitigate linguistic biases and promote robust cross-modal reasoning. Infi-MMR-3B achieves both state-of-the-art multimodal math reasoning ability (43.68% on MathVerse testmini, 27.04% on MathVision test, and 21.33% on OlympiadBench) and general reasoning ability (67.2% on MathVista testmini).
zh
[NLP-85] Document-Level Text Generation with Minimum Bayes Risk Decoding using Optimal Transport ACL2025
【速读】: 该论文旨在解决文档级文本生成任务中由于需要理解更长上下文而带来的挑战,特别是传统最小贝叶斯风险(Minimum Bayes Risk, MBR)解码方法在该任务中的性能受限问题。其解决方案的关键在于提出MBR-OT,这是一种基于Wasserstein距离的MBR解码变体,通过将句子级效用函数扩展至文档级别,从而更有效地评估文档的整体效用。实验结果表明,MBR-OT在文档级机器翻译、文本简化和密集图像描述生成任务中均优于标准MBR方法。
链接: https://arxiv.org/abs/2505.23078
作者: Yuu Jinnai
机构: CyberAgent(网络代理)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025
Abstract:Document-level text generation tasks are known to be more difficult than sentence-level text generation tasks as they require the understanding of longer context to generate high-quality texts. In this paper, we investigate the adaption of Minimum Bayes Risk (MBR) decoding for document-level text generation tasks. MBR decoding makes use of a utility function to estimate the output with the highest expected utility from a set of candidate outputs. Although MBR decoding is shown to be effective in a wide range of sentence-level text generation tasks, its performance on document-level text generation tasks is limited as many of the utility functions are designed for evaluating the utility of sentences. To this end, we propose MBR-OT, a variant of MBR decoding using Wasserstein distance to compute the utility of a document using a sentence-level utility function. The experimental result shows that the performance of MBR-OT outperforms that of the standard MBR in document-level machine translation, text simplification, and dense image captioning tasks. Our code is available at this https URL
zh
[NLP-86] SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking Services
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在多模态社交网络服务(Social Networking Services, SNS)场景下的能力评估不足问题,特别是在视觉与文本内容融合的多模态上下文中,现有基准测试主要侧重于文本任务,缺乏对实际SNS生态中多模态交互的全面覆盖。解决方案的关键在于提出SNS-Bench-VL,这是一个涵盖8个多模态任务的综合性基准,包含4,001个精心构建的多模态问答对,用于评估视觉-语言LLMs在真实社交媒体场景中的性能,从而推动下一代社交网络服务中鲁棒、上下文感知且符合人类价值观的多模态智能的发展。
链接: https://arxiv.org/abs/2505.23065
作者: Hongcheng Guo,Zheyong Xie,Shaosheng Cao,Boyang Wang,Weiting Liu,Anjie Le,Lei Li,Zhoujun Li
机构: Beihang University (北京航空航天大学); Xiaohongshu Inc. (小红书公司); Fudan University (复旦大学); University of Oxford (牛津大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:With the increasing integration of visual and textual content in Social Networking Services (SNS), evaluating the multimodal capabilities of Large Language Models (LLMs) is crucial for enhancing user experience, content understanding, and platform intelligence. Existing benchmarks primarily focus on text-centric tasks, lacking coverage of the multimodal contexts prevalent in modern SNS ecosystems. In this paper, we introduce SNS-Bench-VL, a comprehensive multimodal benchmark designed to assess the performance of Vision-Language LLMs in real-world social media scenarios. SNS-Bench-VL incorporates images and text across 8 multimodal tasks, including note comprehension, user engagement analysis, information retrieval, and personalized recommendation. It comprises 4,001 carefully curated multimodal question-answer pairs, covering single-choice, multiple-choice, and open-ended tasks. We evaluate over 25 state-of-the-art multimodal LLMs, analyzing their performance across tasks. Our findings highlight persistent challenges in multimodal social context comprehension. We hope SNS-Bench-VL will inspire future research towards robust, context-aware, and human-aligned multimodal intelligence for next-generation social networking services.
zh
[NLP-87] Self-Correcting Code Generation Using Small Language Models
【速读】: 该论文试图解决小规模语言模型在代码生成过程中缺乏有效自我修正能力的问题(self-correction),即如何使小型模型通过自我反思引导其输出的改进。解决方案的关键在于提出一种名为CoCoS的方法,其核心是引入在线强化学习目标,使模型在多轮修正中自信地保留正确输出并逐步纠正错误输出,同时采用累积奖励函数和细粒度奖励机制,以更好地适应多轮修正场景,从而提升初始响应质量并实现显著的自我修正效果。
链接: https://arxiv.org/abs/2505.23060
作者: Jeonghun Cho,Deokhyung Kang,Hyounghun Kim,Gary Geunbae Lee
机构: Pohang University of Science and Technology (POSTECH)
类目: Computation and Language (cs.CL)
备注:
Abstract:Self-correction has demonstrated potential in code generation by allowing language models to revise and improve their outputs through successive refinement. Recent studies have explored prompting-based strategies that incorporate verification or feedback loops using proprietary models, as well as training-based methods that leverage their strong reasoning capabilities. However, whether smaller models possess the capacity to effectively guide their outputs through self-reflection remains unexplored. Our findings reveal that smaller models struggle to exhibit reflective revision behavior across both self-correction paradigms. In response, we introduce CoCoS, an approach designed to enhance the ability of small language models for multi-turn code correction. Specifically, we propose an online reinforcement learning objective that trains the model to confidently maintain correct outputs while progressively correcting incorrect outputs as turns proceed. Our approach features an accumulated reward function that aggregates rewards across the entire trajectory and a fine-grained reward better suited to multi-turn correction scenarios. This facilitates the model in enhancing initial response quality while achieving substantial improvements through self-correction. With 1B-scale models, CoCoS achieves improvements of 35.8% on the MBPP and 27.7% on HumanEval compared to the baselines.
zh
[NLP-88] Be.FM: Open Foundation Models for Human Behavior
【速读】: 该论文试图解决基础模型在建模和理解人类行为方面的潜力尚未被充分探索的问题。解决方案的关键在于引入一个名为http URL的开源基础模型,该模型基于开源的大语言模型,并在多样化的行为数据上进行微调,从而能够理解和预测人类决策。此外,研究者构建了一套全面的基准任务以测试行为基础模型的能力,证明该模型可以预测行为、推断个体和群体特征、生成关于情境的见解,并应用行为科学知识。
链接: https://arxiv.org/abs/2505.23058
作者: Yutong Xie,Zhuoheng Li,Xiyuan Wang,Yijun Pan,Qijia Liu,Xingzhi Cui,Kuang-Yu Lo,Ruoyi Gao,Xingjian Zhang,Jin Huang,Walter Yuan,Matthew O. Jackson,Qiaozhu Mei
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注:
Abstract:Despite their success in numerous fields, the potential of foundation models for modeling and understanding human behavior remains largely unexplored. We introduce this http URL, one of the first open foundation models designed for human behavior modeling. Built upon open-source large language models and fine-tuned on a diverse range of behavioral data, this http URL can be used to understand and predict human decision-making. We construct a comprehensive set of benchmark tasks for testing the capabilities of behavioral foundation models. Our results demonstrate that this http URL can predict behaviors, infer characteristics of individuals and populations, generate insights about contexts, and apply behavioral science knowledge.
zh
[NLP-89] Query Routing for Retrieval-Augmented Language Models
【速读】: 该论文试图解决在检索增强生成(Retrieval-Augmented Generation, RAG)框架下,不同大型语言模型(Large Language Models, LLMs)响应质量差异导致的模型路由效率问题。现有路由方法依赖静态参数化知识表示,在动态变化的检索文档影响下表现不佳。解决方案的关键在于提出RAGRouter,其通过结合文档嵌入与RAG能力嵌入,并利用对比学习捕捉知识表示的变化,从而实现更精准的路由决策。
链接: https://arxiv.org/abs/2505.23052
作者: Jiarui Zhang,Xiangyu Liu,Yong Hu,Chaoyue Niu,Fan Wu,Guihai Chen
机构: Shanghai Jiao Tong University (上海交通大学); WeChat, Tencent Inc (微信,腾讯公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) significantly improves the performance of Large Language Models (LLMs) on knowledge-intensive tasks. However, varying response quality across LLMs under RAG necessitates intelligent routing mechanisms, which select the most suitable model for each query from multiple retrieval-augmented LLMs via a dedicated router model. We observe that external documents dynamically affect LLMs’ ability to answer queries, while existing routing methods, which rely on static parametric knowledge representations, exhibit suboptimal performance in RAG scenarios. To address this, we formally define the new retrieval-augmented LLM routing problem, incorporating the influence of retrieved documents into the routing framework. We propose RAGRouter, a RAG-aware routing design, which leverages document embeddings and RAG capability embeddings with contrastive learning to capture knowledge representation shifts and enable informed routing decisions. Extensive experiments on diverse knowledge-intensive tasks and retrieval settings show that RAGRouter outperforms the best individual LLM by 3.61% on average and existing routing methods by 3.29%-9.33%. With an extended score-threshold-based mechanism, it also achieves strong performance-efficiency trade-offs under low-latency constraints.
zh
[NLP-90] DenoiseRotator: Enhance Pruning Robustness for LLM s via Importance Concentration
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在剪枝过程中性能显著下降的问题,尤其是在半结构化稀疏性约束下。现有剪枝方法主要关注单个权重的重要性估计,限制了模型关键能力的保留。论文提出的解决方案的关键在于重新分配参数重要性,使模型本身更易于剪枝,通过最小化归一化重要性分数的信息熵,将重要性集中到更小的权重子集上,从而提升剪枝的鲁棒性。
链接: https://arxiv.org/abs/2505.23049
作者: Tianteng Gu,Bei Liu,Bo Xiao,Ke Zeng,Jiacheng Liu,Yanmin Qian
机构: Shanghai Jiao Tong University (上海交通大学); HKUST (香港科技大学); Meituan (美团)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Pruning is a widely used technique to compress large language models (LLMs) by removing unimportant weights, but it often suffers from significant performance degradation - especially under semi-structured sparsity constraints. Existing pruning methods primarily focus on estimating the importance of individual weights, which limits their ability to preserve critical capabilities of the model. In this work, we propose a new perspective: rather than merely selecting which weights to prune, we first redistribute parameter importance to make the model inherently more amenable to pruning. By minimizing the information entropy of normalized importance scores, our approach concentrates importance onto a smaller subset of weights, thereby enhancing pruning robustness. We instantiate this idea through DenoiseRotator, which applies learnable orthogonal transformations to the model’s weight matrices. Our method is model-agnostic and can be seamlessly integrated with existing pruning techniques such as Magnitude, SparseGPT, and Wanda. Evaluated on LLaMA3, Qwen2.5, and Mistral models under 50% unstructured and 2:4 semi-structured sparsity, DenoiseRotator consistently improves perplexity and zero-shot accuracy. For instance, on LLaMA3-70B pruned with SparseGPT at 2:4 semi-structured sparsity, DenoiseRotator reduces the perplexity gap to the dense model by 58%, narrowing the degradation from 8.1 to 3.4 points. Codes are available at this https URL.
zh
[NLP-91] ailorSQL: An NL2SQL System Tailored to Your Query Workload
【速读】: 该论文试图解决NL2SQL(自然语言到SQL)翻译的准确性与效率问题,特别是在现有数据库中利用历史查询工作负载信息以提升翻译效果。传统方法主要依赖数据库模式等显式信息,而忽略了历史查询中隐含的有用信息,如常见的连接路径以及命名晦涩的表和列的语义。解决方案的关键在于引入TailorSQL系统,该系统通过专门针对特定工作负载进行优化,充分利用历史查询数据中的隐含信息,从而显著提高NL2SQL的准确性和执行效率,实验证明其在标准基准测试中执行准确率提升了高达2倍。
链接: https://arxiv.org/abs/2505.23039
作者: Kapil Vaidya,Jialin Ding,Sebastian Kosak,David Kernert,Chuan Lei,Xiao Qin,Abhinav Tripathy,Ramesh Balan,Balakrishnan Narayanaswamy,Tim Kraska
机构: Amazon Web Services (亚马逊网络服务); Technical University of Munich (慕尼黑工业大学); STACKIT
类目: Databases (cs.DB); Computation and Language (cs.CL)
备注:
Abstract:NL2SQL (natural language to SQL) translates natural language questions into SQL queries, thereby making structured data accessible to non-technical users, serving as the foundation for intelligent data applications. State-of-the-art NL2SQL techniques typically perform translation by retrieving database-specific information, such as the database schema, and invoking a pre-trained large language model (LLM) using the question and retrieved information to generate the SQL query. However, existing NL2SQL techniques miss a key opportunity which is present in real-world settings: NL2SQL is typically applied on existing databases which have already served many SQL queries in the past. The past query workload implicitly contains information which is helpful for accurate NL2SQL translation and is not apparent from the database schema alone, such as common join paths and the semantics of obscurely-named tables and columns. We introduce TailorSQL, a NL2SQL system that takes advantage of information in the past query workload to improve both the accuracy and latency of translating natural language questions into SQL. By specializing to a given workload, TailorSQL achieves up to 2 \times improvement in execution accuracy on standardized benchmarks. Subjects: Databases (cs.DB); Computation and Language (cs.CL) Cite as: arXiv:2505.23039 [cs.DB] (or arXiv:2505.23039v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2505.23039 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-92] EL4NER: Ensemble Learning for Named Entity Recognition via Multiple Small-Parameter Large Language Models
【速读】: 该论文试图解决基于大型语言模型(Large Language Models, LLMs)的上下文学习(In-Context Learning, ICL)在命名实体识别(Named Entity Recognition, NER)任务中存在的计算资源消耗高、部署成本大以及数据隐私和社区协作受限的问题。其解决方案的关键在于提出一种集成学习方法(Ensemble Learning Method for Named Entity Recognition, EL4NER),通过聚合多个开源的小参数LLMs的ICL输出,实现更低的部署与推理成本,同时提升NER任务的整体性能。该方法包含三个核心组件:基于任务分解的多阶段集成学习流水线、面向NER任务优化的句级句子相似度算法,以及用于减少集成过程中噪声的自验证机制。
链接: https://arxiv.org/abs/2505.23038
作者: Yuzhen Xiao,Jiahe Song,Yongxin Xu,Ruizhe Zhang,Yiqi Xiao,Xin Lu,Runchuan Zhu,Bowen Jiang,Junfeng Zhao
机构: Peking University (北京大学); High Confidence Software Technologies (高可信软件技术); Nanhu Laboratory (南湖实验室); Beijing University of Aeronautics and Astronautics (北京航空航天大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:In-Context Learning (ICL) technique based on Large Language Models (LLMs) has gained prominence in Named Entity Recognition (NER) tasks for its lower computing resource consumption, less manual labeling overhead, and stronger generalizability. Nevertheless, most ICL-based NER methods depend on large-parameter LLMs: the open-source models demand substantial computational resources for deployment and inference, while the closed-source ones incur high API costs, raise data-privacy concerns, and hinder community collaboration. To address this question, we propose an Ensemble Learning Method for Named Entity Recognition (EL4NER), which aims at aggregating the ICL outputs of multiple open-source, small-parameter LLMs to enhance overall performance in NER tasks at less deployment and inference cost. Specifically, our method comprises three key components. First, we design a task decomposition-based pipeline that facilitates deep, multi-stage ensemble learning. Second, we introduce a novel span-level sentence similarity algorithm to establish an ICL demonstration retrieval mechanism better suited for NER tasks. Third, we incorporate a self-validation mechanism to mitigate the noise introduced during the ensemble process. We evaluated EL4NER on multiple widely adopted NER datasets from diverse domains. Our experimental results indicate that EL4NER surpasses most closed-source, large-parameter LLM-based methods at a lower parameter cost and even attains state-of-the-art (SOTA) performance among ICL-based methods on certain datasets. These results show the parameter efficiency of EL4NER and underscore the feasibility of employing open-source, small-parameter LLMs within the ICL paradigm for NER tasks.
zh
[NLP-93] Improving Multilingual Social Media Insights: Aspect-based Comment Analysis
【速读】: 该论文试图解决社交媒体帖子由于语言使用自由、观点和话题多样且不连贯,从而对下游自然语言处理(NLP)任务如评论聚类、评论摘要和社会媒体观点分析带来的挑战。解决方案的关键在于提出在细粒度层面识别和生成评论方面术语(comment aspect term, CAT),以引导模型注意力,具体通过监督微调的多语言大语言模型进行CAT生成,并利用直接偏好优化(DPO)进一步使模型预测与人类期望对齐。
链接: https://arxiv.org/abs/2505.23037
作者: Longyin Zhang,Bowei Zou,Ai Ti Aw
机构: Institute for Infocomm Research, ASTAR, Singapore (信息通信研究院,ASTAR,新加坡)
类目: Computation and Language (cs.CL)
备注: The paper was peer-reviewed
Abstract:The inherent nature of social media posts, characterized by the freedom of language use with a disjointed array of diverse opinions and topics, poses significant challenges to downstream NLP tasks such as comment clustering, comment summarization, and social media opinion analysis. To address this, we propose a granular level of identifying and generating aspect terms from individual comments to guide model attention. Specifically, we leverage multilingual large language models with supervised fine-tuning for comment aspect term generation (CAT-G), further aligning the model’s predictions with human expectations through DPO. We demonstrate the effectiveness of our method in enhancing the comprehension of social media discourse on two NLP tasks. Moreover, this paper contributes the first multilingual CAT-G test set on English, Chinese, Malay, and Bahasa Indonesian. As LLM capabilities vary among languages, this test set allows for a comparative analysis of performance across languages with varying levels of LLM proficiency.
zh
[NLP-94] Machine-Facing English: Defining a Hybrid Register Shaped by Human-AI Discourse
【速读】: 该论文试图解决人类与人工智能(Artificial Intelligence, AI)交互过程中语言适应性变化的问题,具体表现为语法僵化、语用简化和高度明确的表达方式,这些特征虽然提升了机器可解析性,却牺牲了自然流畅性。其解决方案的关键在于通过定性观察和自然语言声明式提示(Natural Language Declarative Prompting, NLD-P)方法,识别出五种典型的语言特征:冗余清晰度、指令性语法、受控词汇、扁平化语调和单一意图结构,这些特征在提高执行准确性的同时也压缩了表达范围。研究强调了在对话界面设计和多语言用户教学中平衡沟通效率与语言丰富性的挑战。
链接: https://arxiv.org/abs/2505.23035
作者: Hyunwoo Kim,Hanau Yi
机构: ddai Inc. (ddai 公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Machine-Facing English (MFE) is an emergent register shaped by the adaptation of everyday language to the expanding presence of AI interlocutors. Drawing on register theory (Halliday 1985, 2006), enregisterment (Agha 2003), audience design (Bell 1984), and interactional pragmatics (Giles Ogay 2007), this study traces how sustained human-AI interaction normalizes syntactic rigidity, pragmatic simplification, and hyper-explicit phrasing - features that enhance machine parseability at the expense of natural fluency. Our analysis is grounded in qualitative observations from bilingual (Korean/English) voice- and text-based product testing sessions, with reflexive drafting conducted using Natural Language Declarative Prompting (NLD-P) under human curation. Thematic analysis identifies five recurrent traits - redundant clarity, directive syntax, controlled vocabulary, flattened prosody, and single-intent structuring - that improve execution accuracy but compress expressive range. MFE’s evolution highlights a persistent tension between communicative efficiency and linguistic richness, raising design challenges for conversational interfaces and pedagogical considerations for multilingual users. We conclude by underscoring the need for comprehensive methodological exposition and future empirical validation.
zh
[NLP-95] Can Modern NLP Systems Reliably Annotate Chest Radiography Exams? A Pre-Purchase Evaluation and Comparative Study of Solutions from AWS Google Azure John Snow Labs and Open-Source Models on an Independent Pediatric Dataset
【速读】: 该论文试图解决临床报告自动标注中通用自然语言处理(NLP)工具在特定任务如儿科胸部X光片(CXR)报告标注中的性能评估不足的问题。解决方案的关键在于对比四种商业临床NLP系统(AWS、GC、AZ、SP)以及两个专门的胸部X光片报告标注器(CheXpert和CheXbert)在实体抽取和断言检测方面的表现,并基于一致性伪真实数据进行评估,以揭示各系统在实体数量、断言分布及准确性上的显著差异。
链接: https://arxiv.org/abs/2505.23030
作者: Shruti Hegde,Mabon Manoj Ninan,Jonathan R. Dillman,Shireen Hayatghaibi,Lynn Babcock,Elanchezhian Somasundaram
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:General-purpose clinical natural language processing (NLP) tools are increasingly used for the automatic labeling of clinical reports. However, independent evaluations for specific tasks, such as pediatric chest radiograph (CXR) report labeling, are limited. This study compares four commercial clinical NLP systems - Amazon Comprehend Medical (AWS), Google Healthcare NLP (GC), Azure Clinical NLP (AZ), and SparkNLP (SP) - for entity extraction and assertion detection in pediatric CXR reports. Additionally, CheXpert and CheXbert, two dedicated chest radiograph report labelers, were evaluated on the same task using CheXpert-defined labels. We analyzed 95,008 pediatric CXR reports from a large academic pediatric hospital. Entities and assertion statuses (positive, negative, uncertain) from the findings and impression sections were extracted by the NLP systems, with impression section entities mapped to 12 disease categories and a No Findings category. CheXpert and CheXbert extracted the same 13 categories. Outputs were compared using Fleiss Kappa and accuracy against a consensus pseudo-ground truth. Significant differences were found in the number of extracted entities and assertion distributions across NLP systems. SP extracted 49,688 unique entities, GC 16,477, AZ 31,543, and AWS 27,216. Assertion accuracy across models averaged around 62%, with SP highest (76%) and AWS lowest (50%). CheXpert and CheXbert achieved 56% accuracy. Considerable variability in performance highlights the need for careful validation and review before deploying NLP tools for clinical report labeling.
zh
[NLP-96] Uncovering Visual-Semantic Psycholinguistic Properties from the Distributional Structure of Text Embedding Spac ACL2025
【速读】: 该论文试图解决如何准确估计文本的图像可表征性(imageability)和具体性(concreteness)的问题,传统方法通常依赖于并行的视觉和语义空间,如图像-标题对或跨模态模型。论文的关键解决方案是提出一种无监督、分布无关的度量方法,称为邻域稳定性度量(Neighborhood Stability Measure, NSM),该方法基于词在语义嵌入空间中的邻域尖锐程度来反映其图像可表征性和具体性。实验表明,NSM相较于现有无监督方法与真实评分具有更强的相关性,并且在分类任务中表现出良好的预测能力。
链接: https://arxiv.org/abs/2505.23029
作者: Si Wu,Sebastian Bruch
机构: Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注: Accepted for ACL 2025. This is the camera-ready version. Will be presenting in July 2025 in Vienna
Abstract:Imageability (potential of text to evoke a mental image) and concreteness (perceptibility of text) are two psycholinguistic properties that link visual and semantic spaces. It is little surprise that computational methods that estimate them do so using parallel visual and semantic spaces, such as collections of image-caption pairs or multi-modal models. In this paper, we work on the supposition that text itself in an image-caption dataset offers sufficient signals to accurately estimate these properties. We hypothesize, in particular, that the peakedness of the neighborhood of a word in the semantic embedding space reflects its degree of imageability and concreteness. We then propose an unsupervised, distribution-free measure, which we call Neighborhood Stability Measure (NSM), that quantifies the sharpness of peaks. Extensive experiments show that NSM correlates more strongly with ground-truth ratings than existing unsupervised methods, and is a strong predictor of these properties for classification. Our code and data are available on GitHub (this https URL).
zh
[NLP-97] Context Robust Knowledge Editing for Language Models ACL2025
【速读】: 该论文旨在解决知识编辑(Knowledge Editing, KE)方法在实际应用中因前置上下文导致编辑效果失效的问题。现有KE评估方法仅关注编辑后的知识而忽略前置上下文的影响,然而在真实场景中,前置上下文可能触发原始知识的检索,从而破坏编辑效果。为应对这一问题,作者提出了CHED基准,用于评估KE方法在存在前置上下文时的鲁棒性。解决方案的关键在于提出CoRE方法,通过最小化模型隐藏状态中与上下文相关的方差,增强KE方法对前置上下文的鲁棒性,从而在保持模型整体能力的同时提升编辑成功率。
链接: https://arxiv.org/abs/2505.23026
作者: Haewon Park,Gyubin Choi,Minjun Kim,Yohan Jo
机构: Seoul National University (首尔国立大学); Gwangju Institute of Science and Technology (光州科学技术院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025 Findings. Our code and datasets are available at ( this https URL )
Abstract:Knowledge editing (KE) methods offer an efficient way to modify knowledge in large language models. Current KE evaluations typically assess editing success by considering only the edited knowledge without any preceding contexts. In real-world applications, however, preceding contexts often trigger the retrieval of the original knowledge and undermine the intended edit. To address this issue, we develop CHED – a benchmark designed to evaluate the context robustness of KE methods. Evaluations on CHED show that they often fail when preceding contexts are present. To mitigate this shortcoming, we introduce CoRE, a KE method designed to strengthen context robustness by minimizing context-sensitive variance in hidden states of the model for edited knowledge. This method not only improves the editing success rate in situations where a preceding context is present but also preserves the overall capabilities of the model. We provide an in-depth analysis of the differing impacts of preceding contexts when introduced as user utterances versus assistant responses, and we dissect attention-score patterns to assess how specific tokens influence editing success.
zh
[NLP-98] Agent Align: Navigating Safety Alignment in the Shift from Informative to Agent ic Large Language Models ACL2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在具备代理能力(agentic capabilities)后,由于其执行动作的能力增强而面临的安全隐患问题,特别是在后训练阶段代理使用安全对齐(safety alignment)的不足。解决方案的关键在于提出AgentAlign框架,该框架通过抽象行为链(abstract behavior chains)作为安全对齐数据合成的媒介,在多样化工具实例的模拟环境中实例化这些行为链,从而生成高度真实且可执行的指令,同时捕捉复杂的多步骤动态,并通过非恶意解释的行为链比例合成良性指令,精确校准有益性与无害性的边界。
链接: https://arxiv.org/abs/2505.23020
作者: Jinchuan Zhang,Lu Yin,Yan Zhou,Songlin Hu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Submitted to ACL 2025
Abstract:The acquisition of agentic capabilities has transformed LLMs from “knowledge providers” to “action executors”, a trend that while expanding LLMs’ capability boundaries, significantly increases their susceptibility to malicious use. Previous work has shown that current LLM-based agents execute numerous malicious tasks even without being attacked, indicating a deficiency in agentic use safety alignment during the post-training phase. To address this gap, we propose AgentAlign, a novel framework that leverages abstract behavior chains as a medium for safety alignment data synthesis. By instantiating these behavior chains in simulated environments with diverse tool instances, our framework enables the generation of highly authentic and executable instructions while capturing complex multi-step dynamics. The framework further ensures model utility by proportionally synthesizing benign instructions through non-malicious interpretations of behavior chains, precisely calibrating the boundary between helpfulness and harmlessness. Evaluation results on AgentHarm demonstrate that fine-tuning three families of open-source models using our method substantially improves their safety (35.8% to 79.5% improvement) while minimally impacting or even positively enhancing their helpfulness, outperforming various prompting methods. The dataset and code have both been open-sourced.
zh
[NLP-99] Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models
【速读】: 该论文旨在解决在微调大语言模型(Large Language Models, LLMs)时,由于数据集中包含隐蔽后门(stealthy backdoors)而带来的安全风险问题。现有检测方法要么仅适用于分类任务,要么可能降低生成性能并引入新触发器,因此无法高效消除隐蔽的污染样本。论文提出的解决方案关键在于利用参考模型过滤(Reference-Filtration)与TF-IDF聚类(Tfidf-Clustering)机制(RFTC),通过比较样本响应与参考模型输出的差异来识别可疑样本,并基于类内距离进一步确定真正的污染样本,从而有效提升后门检测效果与模型性能。
链接: https://arxiv.org/abs/2505.23015
作者: Jinwen Chen,Hainan Zhang,Fei Sun,Qinnan Zhang,Sijia Wen,Ziwei Wang,Zhiming Zheng
机构: Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing (北京先进创新中心未来区块链与隐私计算); School of Artificial Intelligence, Beihang University, China (人工智能学院,北京航空航天大学,中国); Institute of Computing Technology, Chinese Academy of Sciences (计算技术研究所,中国科学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Fine-tuning LLMs with datasets containing stealthy backdoors from publishers poses security risks to downstream applications. Mainstream detection methods either identify poisoned samples by analyzing the prediction probability of poisoned classification models or rely on the rewriting model to eliminate the stealthy triggers. However, the former cannot be applied to generation tasks, while the latter may degrade generation performance and introduce new triggers. Therefore, efficiently eliminating stealthy poisoned samples for LLMs remains an urgent problem. We observe that after applying TF-IDF clustering to the sample response, there are notable differences in the intra-class distances between clean and poisoned samples. Poisoned samples tend to cluster closely because of their specific malicious outputs, whereas clean samples are more scattered due to their more varied responses. Thus, in this paper, we propose a stealthy backdoor sample detection method based on Reference-Filtration and Tfidf-Clustering mechanisms (RFTC). Specifically, we first compare the sample response with the reference model’s outputs and consider the sample suspicious if there’s a significant discrepancy. And then we perform TF-IDF clustering on these suspicious samples to identify the true poisoned samples based on the intra-class distance. Experiments on two machine translation datasets and one QA dataset demonstrate that RFTC outperforms baselines in backdoor detection and model performance. Further analysis of different reference models also confirms the effectiveness of our Reference-Filtration.
zh
[NLP-100] Synthetic Document Question Answering in Hungarian
【速读】: 该论文旨在解决低资源语言在文档视觉问答(DocVQA)任务中的性能瓶颈问题,特别是在匈牙利语等资源较少的语言中,由于缺乏足够的训练和评估数据,现代视觉语言模型(VLMs)表现显著不足。其解决方案的关键在于构建高质量的匈牙利语文档VQA数据集,包括通过人工筛选的HuDocVQA-manual和基于Common Crawl的合成生成数据集HuDocVQA,并通过多轮质量过滤与去重提升数据集质量,同时提供用于训练匈牙利语OCR的HuCCPDF数据集,以支持多语言DocVQA的研究。
链接: https://arxiv.org/abs/2505.23008
作者: Jonathan Li,Zoltan Csaki,Nidhi Hiremath,Etash Guha,Fenglu Hong,Edward Ma,Urmish Thakker
机构: SambaNova Systems, Inc. (SambaNova系统公司); University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Modern VLMs have achieved near-saturation accuracy in English document visual question-answering (VQA). However, this task remains challenging in lower resource languages due to a dearth of suitable training and evaluation data. In this paper we present scalable methods for curating such datasets by focusing on Hungarian, approximately the 17th highest resource language on the internet. Specifically, we present HuDocVQA and HuDocVQA-manual, document VQA datasets that modern VLMs significantly underperform on compared to English DocVQA. HuDocVQA-manual is a small manually curated dataset based on Hungarian documents from Common Crawl, while HuDocVQA is a larger synthetically generated VQA data set from the same source. We apply multiple rounds of quality filtering and deduplication to HuDocVQA in order to match human-level quality in this dataset. We also present HuCCPDF, a dataset of 117k pages from Hungarian Common Crawl PDFs along with their transcriptions, which can be used for training a model for Hungarian OCR. To validate the quality of our datasets, we show how finetuning on a mixture of these datasets can improve accuracy on HuDocVQA for Llama 3.2 11B Instruct by +7.2%. Our datasets and code will be released to the public to foster further research in multilingual DocVQA.
zh
[NLP-101] A Practical Approach for Building Production-Grade Conversational Agents with Workflow Graphs ACL2025
【速读】: 该论文旨在解决将最先进的(SOTA)大型语言模型(Large Language Models, LLMs)研究成果应用于工业场景中的挑战,特别是在保持灵活对话能力的同时严格遵守服务特定约束的问题。由于LLMs的随机性本质,这一问题被视为两个相互冲突的需求。论文提出了一种解决方案,其关键在于通过具体策略克服LLMs在实际应用中的固有局限性,从而实现可扩展、可控且可靠的AI驱动对话代理的开发。
链接: https://arxiv.org/abs/2505.23006
作者: Chiwan Park,Wonjun Jang,Daeryong Kim,Aelim Ahn,Kichang Yang,Woosung Hwang,Jihyeon Roh,Hyerin Park,Hyosun Wang,Min Seok Kim,Jihoon Kang
机构: Kakao( kakao)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2025 Industry Track. 12 pages, 5 figures
Abstract:The advancement of Large Language Models (LLMs) has led to significant improvements in various service domains, including search, recommendation, and chatbot applications. However, applying state-of-the-art (SOTA) research to industrial settings presents challenges, as it requires maintaining flexible conversational abilities while also strictly complying with service-specific constraints. This can be seen as two conflicting requirements due to the probabilistic nature of LLMs. In this paper, we propose our approach to addressing this challenge and detail the strategies we employed to overcome their inherent limitations in real-world applications. We conduct a practical case study of a conversational agent designed for the e-commerce domain, detailing our implementation workflow and optimizations. Our findings provide insights into bridging the gap between academic research and real-world application, introducing a framework for developing scalable, controllable, and reliable AI-driven agents.
zh
[NLP-102] DyePack: Provably Flagging Test Set Contamination in LLM s Using Backdoors
【速读】: 该论文试图解决开放基准测试集在评估和推进大语言模型过程中可能遭遇的测试集污染问题,即模型在训练过程中可能使用了基准测试集,从而导致评估结果不可靠。解决方案的关键在于提出DyePack框架,该框架利用后门攻击技术,在不访问模型损失函数、输出logits或任何内部细节的情况下,通过在测试数据中混合后门样本以标记那些在训练中使用过测试集的模型,其核心设计包括多个具有随机目标的后门机制,从而实现精确的假阳性率计算,确保不会误判的同时为每个检测到的污染案例提供有力证据。
链接: https://arxiv.org/abs/2505.23001
作者: Yize Cheng,Wenxiao Wang,Mazda Moayeri,Soheil Feizi
机构: University of Maryland (马里兰大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Open benchmarks are essential for evaluating and advancing large language models, offering reproducibility and transparency. However, their accessibility makes them likely targets of test set contamination. In this work, we introduce DyePack, a framework that leverages backdoor attacks to identify models that used benchmark test sets during training, without requiring access to the loss, logits, or any internal details of the model. Like how banks mix dye packs with their money to mark robbers, DyePack mixes backdoor samples with the test data to flag models that trained on it. We propose a principled design incorporating multiple backdoors with stochastic targets, enabling exact false positive rate (FPR) computation when flagging every model. This provably prevents false accusations while providing strong evidence for every detected case of contamination. We evaluate DyePack on five models across three datasets, covering both multiple-choice and open-ended generation tasks. For multiple-choice questions, it successfully detects all contaminated models with guaranteed FPRs as low as 0.000073% on MMLU-Pro and 0.000017% on Big-Bench-Hard using eight backdoors. For open-ended generation tasks, it generalizes well and identifies all contaminated models on Alpaca with a guaranteed false positive rate of just 0.127% using six backdoors.
zh
[NLP-103] Verify-in-the-Graph: Entity Disambiguation Enhancement for Complex Claim Verification with Interactive Graph Representation NAACL2025
【速读】: 该论文旨在解决事实验证(fact verification)任务中的挑战,特别是针对现实世界中复杂且语义结构复杂的声明(claim)进行准确且可解释的验证。传统方法通过将声明分解为子声明并查询知识库来处理隐藏或模糊实体,但缺乏有效的实体消歧策略会损害整个验证过程。该论文提出的解决方案关键在于提出一种名为VeGraph的框架,其核心是利用大语言模型(LLM)代理的推理与理解能力,通过三个阶段实现:图表示、实体消歧和验证,从而有效提升声明验证的准确性与可解释性。
链接: https://arxiv.org/abs/2505.22993
作者: Hoang Pham,Thanh-Do Nguyen,Khac-Hoai Nam Bui
机构: Viettel Artificial Intelligence and Data Services Center, Viettel Group (越南电信人工智能与数据服务中心,越南电信集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
备注: Published at NAACL 2025 Main Conference
Abstract:Claim verification is a long-standing and challenging task that demands not only high accuracy but also explainability of the verification process. This task becomes an emerging research issue in the era of large language models (LLMs) since real-world claims are often complex, featuring intricate semantic structures or obfuscated entities. Traditional approaches typically address this by decomposing claims into sub-claims and querying a knowledge base to resolve hidden or ambiguous entities. However, the absence of effective disambiguation strategies for these entities can compromise the entire verification process. To address these challenges, we propose Verify-in-the-Graph (VeGraph), a novel framework leveraging the reasoning and comprehension abilities of LLM agents. VeGraph operates in three phases: (1) Graph Representation - an input claim is decomposed into structured triplets, forming a graph-based representation that integrates both structured and unstructured information; (2) Entity Disambiguation -VeGraph iteratively interacts with the knowledge base to resolve ambiguous entities within the graph for deeper sub-claim verification; and (3) Verification - remaining triplets are verified to complete the fact-checking process. Experiments using Meta-Llama-3-70B (instruct version) show that VeGraph achieves competitive performance compared to baselines on two benchmarks HoVer and FEVEROUS, effectively addressing claim verification challenges. Our source code and data are available for further exploitation.
zh
[NLP-104] Exploring Scaling Laws for EHR Foundation Models
【速读】: 该论文试图解决在电子健康记录(Electronic Health Records, EHRs)领域中,如何通过系统性地增加模型规模、数据量和计算资源来提升模型性能的问题,这一问题在大型语言模型(Large Language Models, LLMs)中已被Scaling Laws所揭示,但在EHR场景下尚未被深入研究。解决方案的关键在于通过对MIMIC-IV数据库中的患者时间序列数据进行Transformer架构的训练,验证EHR基础模型存在与LLMs类似的缩放规律,包括IsoFLOPs曲线的抛物线形态以及计算资源、模型参数、数据规模与临床效用之间的幂律关系,从而为资源高效的EHR模型训练提供理论依据和实践指导。
链接: https://arxiv.org/abs/2505.22964
作者: Sheng Zhang,Qin Liu,Naoto Usuyama,Cliff Wong,Tristan Naumann,Hoifung Poon
机构: Microsoft Research (微软研究院); University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The emergence of scaling laws has profoundly shaped the development of large language models (LLMs), enabling predictable performance gains through systematic increases in model size, dataset volume, and compute. Yet, these principles remain largely unexplored in the context of electronic health records (EHRs) – a rich, sequential, and globally abundant data source that differs structurally from natural language. In this work, we present the first empirical investigation of scaling laws for EHR foundation models. By training transformer architectures on patient timeline data from the MIMIC-IV database across varying model sizes and compute budgets, we identify consistent scaling patterns, including parabolic IsoFLOPs curves and power-law relationships between compute, model parameters, data size, and clinical utility. These findings demonstrate that EHR models exhibit scaling behavior analogous to LLMs, offering predictive insights into resource-efficient training strategies. Our results lay the groundwork for developing powerful EHR foundation models capable of transforming clinical prediction tasks and advancing personalized healthcare.
zh
[NLP-105] oMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在说服任务中缺乏对对手心理状态的动态建模能力,导致说服策略单一、多样性不足的问题。解决方案的关键在于引入了理论心智增强的说服者(Theory of Mind Augmented Persuader, ToMAP),通过集成两个理论心智模块,提升说服者对对手心理状态的感知与分析能力。具体而言,ToMAP首先引导说服者考虑目标核心主张可能面临的反对意见,并利用文本编码器与训练好的多层感知机分类器预测对手对这些反论点的立场,结合精心设计的强化学习框架,使说服者能够有效利用对手相关信息生成更具说服力的论点。
链接: https://arxiv.org/abs/2505.22961
作者: Peixuan Han,Zijia Liu,Jiaxuan You
机构: Siebel School of Computing and Data Science (塞贝尔计算与数据科学学院); University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) have shown promising potential in persuasion, but existing works on training LLM persuaders are still preliminary. Notably, while humans are skilled in modeling their opponent’s thoughts and opinions proactively and dynamically, current LLMs struggle with such Theory of Mind (ToM) reasoning, resulting in limited diversity and opponent awareness. To address this limitation, we introduce Theory of Mind Augmented Persuader (ToMAP), a novel approach for building more flexible persuader agents by incorporating two theory of mind modules that enhance the persuader’s awareness and analysis of the opponent’s mental state. Specifically, we begin by prompting the persuader to consider possible objections to the target central claim, and then use a text encoder paired with a trained MLP classifier to predict the opponent’s current stance on these counterclaims. Our carefully designed reinforcement learning schema enables the persuader learns how to analyze opponent-related information and utilize it to generate more effective arguments. Experiments show that the ToMAP persuader, while containing only 3B parameters, outperforms much larger baselines, like GPT-4o, with a relative gain of 39.4% across multiple persuadee models and diverse corpora. Notably, ToMAP exhibits complex reasoning chains and reduced repetition during training, which leads to more diverse and effective arguments. The opponent-aware feature of ToMAP also makes it suitable for long conversations and enables it to employ more logical and opponent-aware strategies. These results underscore our method’s effectiveness and highlight its potential for developing more persuasive language agents. Code is available at: this https URL.
zh
[NLP-106] LLM -based HSE Compliance Assessment: Benchmark Performance and Advancements
【速读】: 该论文旨在解决健康、安全与环境(HSE)合规评估中动态实时决策的难题,特别是在复杂法规和人-机-环境交互背景下,大型语言模型(LLM)在领域特定知识和结构化法律推理方面的不足。其解决方案的关键在于提出HSE-Bench基准数据集,该数据集包含超过1000个手动整理的问题,并引入基于Issue spotting, rule Recall, rule Application, and rule Conclusion (IRAC)的推理流程,以评估LLM的全面推理能力。此外,研究还提出了一种新的提示技术——专家推理(Reasoning of Expert, RoE),旨在引导LLM模拟不同专家的推理过程,从而提升合规评估的准确性。
链接: https://arxiv.org/abs/2505.22959
作者: Jianwei Wang,Mengqi Wang,Yinsi Zhou,Zhenchang Xing,Qing Liu,Xiwei Xu,Wenjie Zhang,Liming Zhu
机构: University of New South Wales(新南威尔士大学); Data61, CSIRO(数据61,澳大利亚联邦科学与工业研究组织)
类目: Computation and Language (cs.CL)
备注:
Abstract:Health, Safety, and Environment (HSE) compliance assessment demands dynamic real-time decision-making under complicated regulations and complex human-machine-environment interactions. While large language models (LLMs) hold significant potential for decision intelligence and contextual dialogue, their capacity for domain-specific knowledge in HSE and structured legal reasoning remains underexplored. We introduce HSE-Bench, the first benchmark dataset designed to evaluate the HSE compliance assessment capabilities of LLM. HSE-Bench comprises over 1,000 manually curated questions drawn from regulations, court cases, safety exams, and fieldwork videos, and integrates a reasoning flow based on Issue spotting, rule Recall, rule Application, and rule Conclusion (IRAC) to assess the holistic reasoning pipeline. We conduct extensive evaluations on different prompting strategies and more than 10 LLMs, including foundation models, reasoning models and multimodal vision models. The results show that, although current LLMs achieve good performance, their capabilities largely rely on semantic matching rather than principled reasoning grounded in the underlying HSE compliance context. Moreover, their native reasoning trace lacks the systematic legal reasoning required for rigorous HSE compliance assessment. To alleviate these, we propose a new prompting technique, Reasoning of Expert (RoE), which guides LLMs to simulate the reasoning process of different experts for compliance assessment and reach a more accurate unified decision. We hope our study highlights reasoning gaps in LLMs for HSE compliance and inspires further research on related tasks.
zh
[NLP-107] LLM s for Argument Mining: Detection Extraction and Relationship Classification of pre-defined Arguments in Online Comments
【速读】: 该论文试图解决在涉及争议性问题的公共讨论中,自动检测和理解论点的问题(argument mining)。其解决方案的关键在于评估四种先进的大型语言模型(Large Language Models, LLMs)在三个论点挖掘任务中的表现,以探索其在在线评论中识别特定主题预定义论点的能力。研究结果表明,尽管大型且微调的LLMs在整体任务上表现出色,但在处理长篇、复杂及情绪化内容时仍存在系统性不足。
链接: https://arxiv.org/abs/2505.22956
作者: Matteo Guida,Yulia Otmakhova,Eduard Hovy,Lea Frermann
机构: The University of Melbourne (墨尔本大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Automated large-scale analysis of public discussions around contested issues like abortion requires detecting and understanding the use of arguments. While Large Language Models (LLMs) have shown promise in language processing tasks, their performance in mining topic-specific, pre-defined arguments in online comments remains underexplored. We evaluate four state-of-the-art LLMs on three argument mining tasks using datasets comprising over 2,000 opinion comments across six polarizing topics. Quantitative evaluation suggests an overall strong performance across the three tasks, especially for large and fine-tuned LLMs, albeit at a significant environmental cost. However, a detailed error analysis revealed systematic shortcomings on long and nuanced comments and emotionally charged language, raising concerns for downstream applications like content moderation or opinion analysis. Our results highlight both the promise and current limitations of LLMs for automated argument analysis in online comments.
zh
[NLP-108] StrucSum: Graph-Structured Reasoning for Long Document Extractive Summarization with LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在零样本摘要生成任务中对文档结构建模不足以及难以识别长文本中关键信息的问题。其解决方案的关键在于提出StrucSum框架,该框架通过引入句级图结构来增强LLM的推理能力,具体包括三种策略:针对局部上下文的邻近感知提示(Neighbor-Aware Prompting, NAP)、用于重要性估计的中心性感知提示(Centrality-Aware Prompting, CAP)以及用于高效输入缩减的中心性引导掩码(Centrality-Guided Masking, CGM)。
链接: https://arxiv.org/abs/2505.22950
作者: Haohan Yuan,Sukhwa Hong,Haopeng Zhang
机构: ALOHA Lab, University of Hawaii at Manoa (ALOHA 实验室,夏威夷大学马诺阿分校); University of Hawaii at Hilo (夏威夷大学希洛分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have shown strong performance in zero-shot summarization, but often struggle to model document structure and identify salient information in long texts. In this work, we introduce StrucSum, a training-free prompting framework that enhances LLM reasoning through sentence-level graph structures. StrucSum injects structural signals into prompts via three targeted strategies: Neighbor-Aware Prompting (NAP) for local context, Centrality-Aware Prompting (CAP) for importance estimation, and Centrality-Guided Masking (CGM) for efficient input reduction. Experiments on ArXiv, PubMed, and Multi-News demonstrate that StrucSum consistently improves both summary quality and factual consistency over unsupervised baselines and vanilla prompting. Notably, on ArXiv, it boosts FactCC and SummaC by 19.2 and 9.7 points, indicating stronger alignment between summaries and source content. These findings suggest that structure-aware prompting is a simple yet effective approach for zero-shot extractive summarization with LLMs, without any training or task-specific tuning.
zh
[NLP-109] NegVQA: Can Vision Language Models Understand Negation? ACL2025
【速读】: 该论文试图解决视觉语言模型(Vision Language Models, VLMs)在理解否定(negation)方面的不足问题,尤其是在高风险应用场景中,模型对否定语义的准确理解至关重要。解决方案的关键在于构建一个名为NegVQA的视觉问答(VQA)基准,该基准包含7,379个两选一的问题,涵盖了多种否定场景和图像-问题分布。通过利用大语言模型从现有VQA数据集中生成否定版本的问题,NegVQA为评估VLMs在否定理解方面的能力提供了标准化测试平台。
链接: https://arxiv.org/abs/2505.22946
作者: Yuhui Zhang,Yuchang Su,Yiming Liu,Serena Yeung-Levy
机构: Stanford University (斯坦福大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Published at ACL 2025 Findings
Abstract:Negation is a fundamental linguistic phenomenon that can entirely reverse the meaning of a sentence. As vision language models (VLMs) continue to advance and are deployed in high-stakes applications, assessing their ability to comprehend negation becomes essential. To address this, we introduce NegVQA, a visual question answering (VQA) benchmark consisting of 7,379 two-choice questions covering diverse negation scenarios and image-question distributions. We construct NegVQA by leveraging large language models to generate negated versions of questions from existing VQA datasets. Evaluating 20 state-of-the-art VLMs across seven model families, we find that these models struggle significantly with negation, exhibiting a substantial performance drop compared to their responses to the original questions. Furthermore, we uncover a U-shaped scaling trend, where increasing model size initially degrades performance on NegVQA before leading to improvements. Our benchmark reveals critical gaps in VLMs’ negation understanding and offers insights into future VLM development. Project page available at this https URL.
zh
[NLP-110] OWL: Probing Cross-Lingual Recall of Memorized Texts via World Literature
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在非英语语言中的记忆能力和跨语言迁移能力的问题,具体是探究模型是否能够在翻译后的文本中回忆起原始语言中的内容。解决方案的关键在于构建了一个多语言对齐数据集OWL,包含20本书的31.5K个片段,涵盖英语原文、官方翻译和六种低资源语言的新翻译,并通过三种任务(直接探测、姓名填空和前缀探测)评估不同模型家族和规模的记忆能力,从而揭示跨语言记忆的广度和模型间的差异。
链接: https://arxiv.org/abs/2505.22945
作者: Alisha Srivastava,Emir Korukluoglu,Minh Nhat Le,Duyen Tran,Chau Minh Pham,Marzena Karpinska,Mohit Iyyer
机构: UMass Amherst (马萨诸塞大学阿默斯特分校); University of Maryland, College Park (马里兰大学学院公园分校); Microsoft (微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: preprint, 25 pages
Abstract:Large language models (LLMs) are known to memorize and recall English text from their pretraining data. However, the extent to which this ability generalizes to non-English languages or transfers across languages remains unclear. This paper investigates multilingual and cross-lingual memorization in LLMs, probing if memorized content in one language (e.g., English) can be recalled when presented in translation. To do so, we introduce OWL, a dataset of 31.5K aligned excerpts from 20 books in ten languages, including English originals, official translations (Vietnamese, Spanish, Turkish), and new translations in six low-resource languages (Sesotho, Yoruba, Maithili, Malagasy, Setswana, Tahitian). We evaluate memorization across model families and sizes through three tasks: (1) direct probing, which asks the model to identify a book’s title and author; (2) name cloze, which requires predicting masked character names; and (3) prefix probing, which involves generating continuations. We find that LLMs consistently recall content across languages, even for texts without direct translation in pretraining data. GPT-4o, for example, identifies authors and titles 69% of the time and masked entities 6% of the time in newly translated excerpts. Perturbations (e.g., masking characters, shuffling words) modestly reduce direct probing accuracy (7% drop for shuffled official translations). Our results highlight the extent of cross-lingual memorization and provide insights on the differences between the models.
zh
[NLP-111] Can LLM s Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates ACL2025 KR
【速读】: 该论文试图解决预训练多模态表示(如CLIP)在组合性方面存在的脆弱性问题,这些脆弱性导致了非直观的判断。解决方案的关键在于提出Multimodal Adversarial Compositionality (MAC)基准,利用大语言模型(LLM)生成具有欺骗性的文本样本,以在不同模态中探测并评估这些脆弱性,并通过样本级攻击成功率和群体级基于熵的多样性进行评价。此外,为提升零样本方法的效果,研究还引入了一种自训练方法,通过排斥采样微调结合促进多样性的过滤机制,从而提高攻击成功率和样本多样性。
链接: https://arxiv.org/abs/2505.22943
作者: Jaewoo Ahn,Heeseung Yun,Dayoon Ko,Gunhee Kim
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)
备注: ACL 2025 Main. Code is released at this https URL
Abstract:While pre-trained multimodal representations (e.g., CLIP) have shown impressive capabilities, they exhibit significant compositional vulnerabilities leading to counterintuitive judgments. We introduce Multimodal Adversarial Compositionality (MAC), a benchmark that leverages large language models (LLMs) to generate deceptive text samples to exploit these vulnerabilities across different modalities and evaluates them through both sample-wise attack success rate and group-wise entropy-based diversity. To improve zero-shot methods, we propose a self-training approach that leverages rejection-sampling fine-tuning with diversity-promoting filtering, which enhances both attack success rate and sample diversity. Using smaller language models like Llama-3.1-8B, our approach demonstrates superior performance in revealing compositional vulnerabilities across various multimodal representations, including images, videos, and audios.
zh
[NLP-112] WorkForceAgent -R1: Incentivizing Reasoning Capability in LLM -based Web Agents via Reinforcement Learning
【速读】: 该论文旨在解决现有基于监督微调(SFT)的网络代理在处理企业环境中动态网页交互任务时,因推理能力不足而导致的泛化性和鲁棒性差的问题。其解决方案的关键在于引入WorkForceAgent-R1,该代理基于一种规则驱动的R1风格强化学习框架进行训练,专注于提升单步推理与规划能力,并通过结构化的奖励函数评估输出格式的规范性和操作的正确性,从而隐式学习到稳健的中间推理过程,而无需依赖显式标注或大量专家示范。
链接: https://arxiv.org/abs/2505.22942
作者: Yuchen Zhuang,Di Jin,Jiaao Chen,Wenqi Shi,Hanrui Wang,Chao Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in Progress
Abstract:Large language models (LLMs)-empowered web agents enables automating complex, real-time web navigation tasks in enterprise environments. However, existing web agents relying on supervised fine-tuning (SFT) often struggle with generalization and robustness due to insufficient reasoning capabilities when handling the inherently dynamic nature of web interactions. In this study, we introduce WorkForceAgent-R1, an LLM-based web agent trained using a rule-based R1-style reinforcement learning framework designed explicitly to enhance single-step reasoning and planning for business-oriented web navigation tasks. We employ a structured reward function that evaluates both adherence to output formats and correctness of actions, enabling WorkForceAgent-R1 to implicitly learn robust intermediate reasoning without explicit annotations or extensive expert demonstrations. Extensive experiments on the WorkArena benchmark demonstrate that WorkForceAgent-R1 substantially outperforms SFT baselines by 10.26-16.59%, achieving competitive performance relative to proprietary LLM-based agents (gpt-4o) in workplace-oriented web navigation tasks.
zh
[NLP-113] Improving QA Efficiency with DistilBERT: Fine-Tuning and Inference on mobile Intel CPUs
【速读】: 该论文旨在解决在资源受限系统上高效部署基于Transformer的问答(Question Answering, QA)模型的问题,特别是在13代Intel i7-1355U CPU上的实时应用。其解决方案的关键在于通过探索性数据分析、数据增强和DistilBERT架构的微调,实现模型在保持较高准确率的同时降低计算开销,最终在验证集上达到0.6536的F1分数,并将单次推理时间控制在0.1208秒以内,从而在准确性与计算效率之间取得了良好的平衡。
链接: https://arxiv.org/abs/2505.22937
作者: Ngeyen Yinkfu
机构: Carnegie Mellon University Africa (卡内基梅隆大学非洲校区)
类目: Computation and Language (cs.CL)
备注: This paper presents an efficient transformer-based question-answering model optimized for inference on a 13th Gen Intel i7 CPU. The proposed approach balances performance and computational efficiency, making it suitable for real-time applications on resource-constrained devices. Code for this paper is available upon request via email at nyinkfu@andrew. this http URL
Abstract:This study presents an efficient transformer-based question-answering (QA) model optimized for deployment on a 13th Gen Intel i7-1355U CPU, using the Stanford Question Answering Dataset (SQuAD) v1.1. Leveraging exploratory data analysis, data augmentation, and fine-tuning of a DistilBERT architecture, the model achieves a validation F1 score of 0.6536 with an average inference time of 0.1208 seconds per question. Compared to a rule-based baseline (F1: 0.3124) and full BERT-based models, our approach offers a favorable trade-off between accuracy and computational efficiency. This makes it well-suited for real-time applications on resource-constrained systems. The study includes systematic evaluation of data augmentation strategies and hyperparameter configurations, providing practical insights into optimizing transformer models for CPU-based inference.
zh
[NLP-114] Unraveling LoRA Interference: Orthogonal Subspaces for Robust Model Merging ACL2025
【速读】: 该论文试图解决在对低秩适应(LoRA)微调的模型进行模型合并时出现的性能下降问题,这一问题源于模型参数与数据分布之间的相互作用被忽视。解决方案的关键在于提出一种名为OSRM(Orthogonal Subspaces for Robust model Merging)的方法,通过在微调前约束LoRA子空间,确保某一任务的更新不会对其他任务的输出产生负面影响,从而提升合并后的多任务模型的稳定性和性能。
链接: https://arxiv.org/abs/2505.22934
作者: Haobo Zhang,Jiayu Zhou
机构: University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 5 figures, 16 tables, accepted by ACL 2025
Abstract:Fine-tuning large language models (LMs) for individual tasks yields strong performance but is expensive for deployment and storage. Recent works explore model merging to combine multiple task-specific models into a single multi-task model without additional training. However, existing merging methods often fail for models fine-tuned with low-rank adaptation (LoRA), due to significant performance degradation. In this paper, we show that this issue arises from a previously overlooked interplay between model parameters and data distributions. We propose Orthogonal Subspaces for Robust model Merging (OSRM) to constrain the LoRA subspace prior to fine-tuning, ensuring that updates relevant to one task do not adversely shift outputs for others. Our approach can seamlessly integrate with most existing merging algorithms, reducing the unintended interference among tasks. Extensive experiments on eight datasets, tested with three widely used LMs and two large LMs, demonstrate that our method not only boosts merging performance but also preserves single-task accuracy. Furthermore, our approach exhibits greater robustness to the hyperparameters of merging. These results highlight the importance of data-parameter interaction in model merging and offer a plug-and-play solution for merging LoRA models.
zh
[NLP-115] Enhancing Study-Level Inference from Clinical Trial Papers via RL-based Numeric Reasoning
【速读】: 该论文旨在解决医学系统综述中自动化提取数值证据并确定研究层面结论的瓶颈问题,传统方法依赖于浅层文本线索,难以捕捉专家评估背后的数值推理过程。其解决方案的关键在于将问题概念化为定量推理任务,通过提取结构化数值证据(如事件计数或标准差)并应用领域知识驱动的逻辑来推导特定结果的结论,从而实现更准确且可解释的推理。
链接: https://arxiv.org/abs/2505.22928
作者: Massimiliano Pronesti,Michela Lorandi,Paul Flanagan,Oisin Redmon,Anya Belz,Yufang Hou
机构: IBM Research Europe - Ireland(IBM研究欧洲-爱尔兰); Dublin City University(都柏林城市大学); IT:U Interdisciplinary Transformation University Austria(IT:U跨学科转型大学奥地利)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Systematic reviews in medicine play a critical role in evidence-based decision-making by aggregating findings from multiple studies. A central bottleneck in automating this process is extracting numeric evidence and determining study-level conclusions for specific outcomes and comparisons. Prior work has framed this problem as a textual inference task by retrieving relevant content fragments and inferring conclusions from them. However, such approaches often rely on shallow textual cues and fail to capture the underlying numeric reasoning behind expert assessments. In this work, we conceptualise the problem as one of quantitative reasoning. Rather than inferring conclusions from surface text, we extract structured numerical evidence (e.g., event counts or standard deviations) and apply domain knowledge informed logic to derive outcome-specific conclusions. We develop a numeric reasoning system composed of a numeric data extraction model and an effect estimate component, enabling more accurate and interpretable inference aligned with the domain expert principles. We train the numeric data extraction model using different strategies, including supervised fine-tuning (SFT) and reinforcement learning (RL) with a new value reward model. When evaluated on the CochraneForest benchmark, our best-performing approach – using RL to train a small-scale number extraction model – yields up to a 21% absolute improvement in F1 score over retrieval-based systems and outperforms general-purpose LLMs of over 400B parameters by up to 9%. Our results demonstrate the promise of reasoning-driven approaches for automating systematic evidence synthesis. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2505.22928 [cs.AI] (or arXiv:2505.22928v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.22928 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-116] Structured Memory Mechanisms for Stable Context Representation in Large Language Models
【速读】: 该论文试图解决大型语言模型在理解长期上下文方面的局限性,特别是其在保持和检索跨段落及对话轮次的语义信息方面的不足。解决方案的关键在于提出一种配备长期记忆机制的模型架构,该架构集成了显式记忆单元、门控写入机制和基于注意力的阅读模块,并引入遗忘函数以实现记忆内容的动态更新。此外,通过设计联合训练目标,将主任务损失与对记忆写入和遗忘的约束相结合,引导模型在任务执行过程中学习更优的记忆策略。
链接: https://arxiv.org/abs/2505.22921
作者: Yue Xing,Tao Yang,Yijiashun Qi,Minggu Wei,Yu Cheng,Honghui Xin
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper addresses the limitations of large language models in understanding long-term context. It proposes a model architecture equipped with a long-term memory mechanism to improve the retention and retrieval of semantic information across paragraphs and dialogue turns. The model integrates explicit memory units, gated writing mechanisms, and attention-based reading modules. A forgetting function is introduced to enable dynamic updates of memory content, enhancing the model’s ability to manage historical information. To further improve the effectiveness of memory operations, the study designs a joint training objective. This combines the main task loss with constraints on memory writing and forgetting. It guides the model to learn better memory strategies during task execution. Systematic evaluation across multiple subtasks shows that the model achieves clear advantages in text generation consistency, stability in multi-turn question answering, and accuracy in cross-context reasoning. In particular, the model demonstrates strong semantic retention and contextual coherence in long-text tasks and complex question answering scenarios. It effectively mitigates the context loss and semantic drift problems commonly faced by traditional language models when handling long-term dependencies. The experiments also include analysis of different memory structures, capacity sizes, and control strategies. These results further confirm the critical role of memory mechanisms in language understanding. They demonstrate the feasibility and effectiveness of the proposed approach in both architectural design and performance outcomes.
zh
[NLP-117] ER-REASON : A Benchmark Dataset for LLM -Based Clinical Reasoning in the Emergency Room
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在急诊科(Emergency Room, ER)临床推理与决策评估中的不足问题,即现有基准多聚焦于孤立任务,未能全面反映临床推理过程和实际医疗决策的完整流程。其解决方案的关键在于构建ER-Reason基准,该基准包含3,984名患者的多维度临床记录,并设计了涵盖急诊科工作流程关键阶段的评估任务,同时收集了72份由医生撰写的完整推理说明,以更真实地模拟临床决策过程,从而更准确地评估LLM的临床推理能力。
链接: https://arxiv.org/abs/2505.22919
作者: Nikita Mehandru,Niloufar Golchini,David Bamman,Travis Zack,Melanie F. Molina,Ahmed Alaa
机构: University of California, Berkeley(加州大学伯克利分校); University of California, San Francisco(加州大学旧金山分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have been extensively evaluated on medical question answering tasks based on licensing exams. However, real-world evaluations often depend on costly human annotators, and existing benchmarks tend to focus on isolated tasks that rarely capture the clinical reasoning or full workflow underlying medical decisions. In this paper, we introduce ER-Reason, a benchmark designed to evaluate LLM-based clinical reasoning and decision-making in the emergency room (ER)–a high-stakes setting where clinicians make rapid, consequential decisions across diverse patient presentations and medical specialties under time pressure. ER-Reason includes data from 3,984 patients, encompassing 25,174 de-identified longitudinal clinical notes spanning discharge summaries, progress notes, history and physical exams, consults, echocardiography reports, imaging notes, and ER provider documentation. The benchmark includes evaluation tasks that span key stages of the ER workflow: triage intake, initial assessment, treatment selection, disposition planning, and final diagnosis–each structured to reflect core clinical reasoning processes such as differential diagnosis via rule-out reasoning. We also collected 72 full physician-authored rationales explaining reasoning processes that mimic the teaching process used in residency training, and are typically absent from ER documentation. Evaluations of state-of-the-art LLMs on ER-Reason reveal a gap between LLM-generated and clinician-authored clinical reasoning for ER decisions, highlighting the need for future research to bridge this divide.
zh
[NLP-118] alent or Luck? Evaluating Attribution Bias in Large Language Models
【速读】: 该论文试图解决大型语言模型(LLM)在事件结果归因过程中可能存在的社会偏见问题,特别是这些偏见如何通过模型的推理差异对不同人口统计群体产生不公平影响。解决方案的关键在于提出一种基于认知理论的偏见评估框架,以识别模型在归因过程中的推理差异如何导致对特定群体的系统性偏差。
链接: https://arxiv.org/abs/2505.22910
作者: Chahat Raj,Mahika Banerjee,Aylin Caliskan,Antonios Anastasopoulos,Ziwei Zhu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 18 pages
Abstract:When a student fails an exam, do we tend to blame their effort or the test’s difficulty? Attribution, defined as how reasons are assigned to event outcomes, shapes perceptions, reinforces stereotypes, and influences decisions. Attribution Theory in social psychology explains how humans assign responsibility for events using implicit cognition, attributing causes to internal (e.g., effort, ability) or external (e.g., task difficulty, luck) factors. LLMs’ attribution of event outcomes based on demographics carries important fairness implications. Most works exploring social biases in LLMs focus on surface-level associations or isolated stereotypes. This work proposes a cognitively grounded bias evaluation framework to identify how models’ reasoning disparities channelize biases toward demographic groups.
zh
[NLP-119] Conversational Alignment with Artificial Intelligence in Context
【速读】: 该论文试图解决如何使基于大语言模型的智能对话代理(conversational agents)在对话行为上与人类交流规范和实践保持一致的问题,核心在于探讨AI设计与人类价值观之间的关系。其解决方案的关键是提出一个名为CONTEXT-ALIGN的框架,该框架基于会话语用学的哲学与语言学文献,旨在为评估开发者的设计选择提供新的评价标准,并揭示当前大语言模型(LLM)架构、约束和功能可能对实现完全对话对齐造成的根本性限制。
链接: https://arxiv.org/abs/2505.22907
作者: Rachel Katharine Sterken(University of Hong Kong),James Ravi Kirkpatrick(University of Oxford and Magdalen College, Oxford)
机构: University of Hong Kong (香港大学); University of Oxford (牛津大学); Magdalen College, Oxford (牛津大学莫德林学院)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 20 pages, to be published in Philosophical Perspectives
Abstract:The development of sophisticated artificial intelligence (AI) conversational agents based on large language models raises important questions about the relationship between human norms, values, and practices and AI design and performance. This article explores what it means for AI agents to be conversationally aligned to human communicative norms and practices for handling context and common ground and proposes a new framework for evaluating developers’ design choices. We begin by drawing on the philosophical and linguistic literature on conversational pragmatics to motivate a set of desiderata, which we call the CONTEXT-ALIGN framework, for conversational alignment with human communicative practices. We then suggest that current large language model (LLM) architectures, constraints, and affordances may impose fundamental limitations on achieving full conversational alignment.
zh
[NLP-120] VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models
【速读】: 该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)中社会刻板印象和偏见的问题,尤其是现有研究主要关注肖像图像和性别-职业关联,而忽视了更广泛和复杂的社会刻板印象及其潜在危害。解决方案的关键是引入VIGNETTE,这是一个包含30M+图像的大规模视觉问答(VQA)基准,通过事实性、感知、刻板印象和决策四个方向的问答框架来评估VLMs中的偏见。该方法不仅扩展了研究范围,还揭示了模型如何在情境化设置中解读身份,并表现出特征和能力假设及歧视模式。
链接: https://arxiv.org/abs/2505.22897
作者: Chahat Raj,Bowen Wei,Aylin Caliskan,Antonios Anastasopoulos,Ziwei Zhu
机构: George Mason University (乔治·梅森大学); University of Washington (华盛顿大学); Archimedes, Athena Research Center (阿基米德,埃拉托色尼研究中心)
类目: Computation and Language (cs.CL)
备注: 17 pages
Abstract:While bias in large language models (LLMs) is well-studied, similar concerns in vision-language models (VLMs) have received comparatively less attention. Existing VLM bias studies often focus on portrait-style images and gender-occupation associations, overlooking broader and more complex social stereotypes and their implied harm. This work introduces VIGNETTE, a large-scale VQA benchmark with 30M+ images for evaluating bias in VLMs through a question-answering framework spanning four directions: factuality, perception, stereotyping, and decision making. Beyond narrowly-centered studies, we assess how VLMs interpret identities in contextualized settings, revealing how models make trait and capability assumptions and exhibit patterns of discrimination. Drawing from social psychology, we examine how VLMs connect visual identity cues to trait and role-based inferences, encoding social hierarchies, through biased selections. Our findings uncover subtle, multifaceted, and surprising stereotypical patterns, offering insights into how VLMs construct social meaning from inputs.
zh
[NLP-121] When Models Reason in Your Language: Controlling Thinking Trace Language Comes at the Cost of Accuracy
【速读】: 该论文试图解决当前大型推理模型(Large Reasoning Models, LRMs)在多语言推理能力上的不足问题,特别是模型在非英语语言中生成连贯且可读的推理过程的能力较弱。其解决方案的关键在于通过针对特定语言的微调(post training),尤其是仅使用100个示例进行针对性训练,以缓解模型在多语言推理中的不匹配问题,尽管这种方法仍会带来一定程度的准确率下降。
链接: https://arxiv.org/abs/2505.22888
作者: Jirui Qi,Shan Chen,Zidi Xiong,Raquel Fernández,Danielle S. Bitterman,Arianna Bisazza
机构: University of Groningen(格罗宁根大学); Harvard University(哈佛大学); Mass General Brigham(马萨诸塞总医院); Boston Children’s Hospital(波士顿儿童医院); University of Amsterdam(阿姆斯特丹大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent Large Reasoning Models (LRMs) with thinking traces have shown strong performance on English reasoning tasks. However, their ability to think in other languages is less studied. This capability is as important as answer accuracy for real world applications because users may find the reasoning trace useful for oversight only when it is expressed in their own language. We comprehensively evaluate two leading families of LRMs on our XReasoning benchmark and find that even the most advanced models often revert to English or produce fragmented reasoning in other languages, revealing a substantial gap in multilingual reasoning. Prompt based interventions that force models to reason in the users language improve readability and oversight but reduce answer accuracy, exposing an important trade off. We further show that targeted post training on just 100 examples mitigates this mismatch, though some accuracy loss remains. Our results highlight the limited multilingual reasoning capabilities of current LRMs and outline directions for future work. Code and data are available at this https URL.
zh
[NLP-122] GateNLP at SemEval-2025 Task 10: Hierarchical Three-Step Prompting for Multilingual Narrative Classification
【速读】: 该论文旨在解决多语言新闻文章中叙事分类的问题,以应对在线新闻泛滥和虚假信息传播带来的挑战。其关键解决方案是提出一种分层三步提示方法(Hierarchical Three-Step Prompting, H3Prompt),该方法利用大型语言模型(LLM)进行分步分类:首先将文章归类到两个领域(乌克兰-俄罗斯战争或气候变化),然后识别主要叙事,最后分配子叙事。该方法在英语测试集上取得了全球28支参赛队伍中的第一名。
链接: https://arxiv.org/abs/2505.22867
作者: Iknoor Singh,Carolina Scarton,Kalina Bontcheva
机构: University of Sheffield (谢菲尔德大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The proliferation of online news and the increasing spread of misinformation necessitate robust methods for automatic data analysis. Narrative classification is emerging as a important task, since identifying what is being said online is critical for fact-checkers, policy markers and other professionals working on information studies. This paper presents our approach to SemEval 2025 Task 10 Subtask 2, which aims to classify news articles into a pre-defined two-level taxonomy of main narratives and sub-narratives across multiple languages. We propose Hierarchical Three-Step Prompting (H3Prompt) for multilingual narrative classification. Our methodology follows a three-step Large Language Model (LLM) prompting strategy, where the model first categorises an article into one of two domains (Ukraine-Russia War or Climate Change), then identifies the most relevant main narratives, and finally assigns sub-narratives. Our approach secured the top position on the English test set among 28 competing teams worldwide. The code is available at this https URL. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2505.22867 [cs.CL] (or arXiv:2505.22867v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.22867 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-123] Large Language Models for Depression Recognition in Spoken Language Integrating Psychological Knowledge
【速读】: 该论文试图解决抑郁症检测中深度神经网络(DNN)在现实世界中的有效性不足以及大型语言模型(LLMs)在处理非文本线索时的局限性问题。其关键解决方案是首次将LLMs应用于多模态抑郁症检测,并通过预训练模型Wav2Vec提取音频特征,将其映射到基于文本的LLMs进行进一步处理,同时提出一种新颖的心理学知识整合策略,利用问答集向LLMs注入授权知识以提升诊断性能。
链接: https://arxiv.org/abs/2505.22863
作者: Yupei Li,Shuaijie Shao,Manuel Milling,Björn W. Schuller
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:
Abstract:Depression is a growing concern gaining attention in both public discourse and AI research. While deep neural networks (DNNs) have been used for recognition, they still lack real-world effectiveness. Large language models (LLMs) show strong potential but require domain-specific fine-tuning and struggle with non-textual cues. Since depression is often expressed through vocal tone and behaviour rather than explicit text, relying on language alone is insufficient. Diagnostic accuracy also suffers without incorporating psychological expertise. To address these limitations, we present, to the best of our knowledge, the first application of LLMs to multimodal depression detection using the DAIC-WOZ dataset. We extract the audio features using the pre-trained model Wav2Vec, and mapped it to text-based LLMs for further processing. We also propose a novel strategy for incorporating psychological knowledge into LLMs to enhance diagnostic performance, specifically using a question and answer set to grant authorised knowledge to LLMs. Our approach yields a notable improvement in both Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) compared to a base score proposed by the related original paper. The codes are available at this https URL
zh
[NLP-124] LiTEx: A Linguistic Taxonomy of Explanations for Understanding Within-Label Variation in Natural Language Inference
【速读】: 该论文试图解决自然语言推理(Natural Language Inference, NLI)中存在的人类标注者内部标签一致性(within-label variation)问题,即标注者在对同一前提-假设对赋予相同标签的同时,可能给出不同的推理依据。解决方案的关键在于引入LITEX,这是一种基于语言学的自由文本解释分类体系,通过系统化地对解释进行标注与分析,揭示NLI标签背后的推理逻辑,并验证其在生成解释任务中的有效性,结果显示基于LITEX的生成方法能够生成更接近人类解释的语言表达。
链接: https://arxiv.org/abs/2505.22848
作者: Pingjun Hong,Beiduo Chen,Siyao Peng,Marie-Catherine de Marneffe,Barbara Plank
机构: LMU Munich(慕尼黑路德维希-马克西米利安大学); Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心); UCLouvain(卢万大学); FNRS(比利时国家科学基金会)
类目: Computation and Language (cs.CL)
备注: 21 pages, 6 figures
Abstract:There is increasing evidence of Human Label Variation (HLV) in Natural Language Inference (NLI), where annotators assign different labels to the same premise-hypothesis pair. However, within-label variation–cases where annotators agree on the same label but provide divergent reasoning–poses an additional and mostly overlooked challenge. Several NLI datasets contain highlighted words in the NLI item as explanations, but the same spans on the NLI item can be highlighted for different reasons, as evidenced by free-text explanations, which offer a window into annotators’ reasoning. To systematically understand this problem and gain insight into the rationales behind NLI labels, we introduce LITEX, a linguistically-informed taxonomy for categorizing free-text explanations. Using this taxonomy, we annotate a subset of the e-SNLI dataset, validate the taxonomy’s reliability, and analyze how it aligns with NLI labels, highlights, and explanations. We further assess the taxonomy’s usefulness in explanation generation, demonstrating that conditioning generation on LITEX yields explanations that are linguistically closer to human explanations than those generated using only labels or highlights. Our approach thus not only captures within-label variation but also shows how taxonomy-guided generation for reasoning can bridge the gap between human and model explanations more effectively than existing strategies.
zh
[NLP-125] Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation
【速读】: 该论文旨在解决基于Transformer的语言模型在处理序列顺序和实现上下文长度外推时,现有位置编码(Positional Encoding, PE)方法缺乏理论依据且评估指标有限的问题。其解决方案的关键在于提出贝叶斯注意力机制(Bayesian Attention Mechanism, BAM),将位置编码建模为概率模型中的先验,并统一了现有方法(如NoPE和ALiBi),同时引入了一种广义高斯位置先验,显著提升了长上下文的泛化能力。
链接: https://arxiv.org/abs/2505.22842
作者: Arthur S. Bianchessi,Rodrigo C. Barros,Lucas S. Kupssinskü
机构: MALTA - Machine Learning Theory and Applications Lab - PUCRS - Porto Alegre, Brazil
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Transformer-based language models rely on positional encoding (PE) to handle token order and support context length extrapolation. However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims. We propose the Bayesian Attention Mechanism (BAM), a theoretical framework that formulates positional encoding as a prior within a probabilistic model. BAM unifies existing methods (e.g., NoPE and ALiBi) and motivates a new Generalized Gaussian positional prior that substantially improves long-context generalization. Empirically, BAM enables accurate information retrieval at 500\times the training context length, outperforming previous state-of-the-art context length generalization in long context retrieval accuracy while maintaining comparable perplexity and introducing minimal additional parameters.
zh
[NLP-126] What Has Been Lost with Synthetic Evaluation?
【速读】: 该论文试图解决如何有效构建评估基准以衡量大型语言模型(Large Language Models, LLMs)性能的问题,特别是针对生成式评估数据是否能够满足基准的严谨性、挑战性和针对性要求。其解决方案的关键在于通过提示(prompting)LLMs生成推理类文本基准,并将其与通过精心策划的人工众包创建的基准进行对比,以评估生成数据的有效性和难度。研究发现,尽管LLMs生成的数据在符合标注指南方面表现良好,但其挑战性低于人工创建的数据,这提示了使用LLMs生成评估数据可能存在的局限性。
链接: https://arxiv.org/abs/2505.22830
作者: Alexander Gill,Abhilasha Ravichander,Ana Marasović
机构: University of Utah(犹他大学); University of Washington(华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages main, 5 pages reference, 24 pages appendix
Abstract:Large language models (LLMs) are increasingly used for data generation. However, creating evaluation benchmarks raises the bar for this emerging paradigm. Benchmarks must target specific phenomena, penalize exploiting shortcuts, and be challenging. Through two case studies, we investigate whether LLMs can meet these demands by generating reasoning over-text benchmarks and comparing them to those created through careful crowdsourcing. Specifically, we evaluate both the validity and difficulty of LLM-generated versions of two high-quality reading comprehension datasets: CondaQA, which evaluates reasoning about negation, and DROP, which targets reasoning about quantities. We find that prompting LLMs can produce variants of these datasets that are often valid according to the annotation guidelines, at a fraction of the cost of the original crowdsourcing effort. However, we show that they are less challenging for LLMs than their human-authored counterparts. This finding sheds light on what may have been lost by generating evaluation data with LLMs, and calls for critically reassessing the immediate use of this increasingly prevalent approach to benchmark creation.
zh
[NLP-127] Self-Critique and Refinement for Faithful Natural Language Explanations
【速读】: 该论文试图解决自然语言解释(Natural Language Explanations, NLEs)在忠实性方面的问题,即现有的解释往往无法准确反映模型的实际推理过程。其解决方案的关键在于提出一种无需外部监督的自评与优化框架——Self-critique and Refinement for Natural Language Explanations (SR-NLE),通过迭代的自我评估和优化过程提升解释的忠实性,其中引入了自然语言自反馈和基于特征归因的新反馈机制来引导优化过程。
链接: https://arxiv.org/abs/2505.22823
作者: Yingming Wang,Pepa Atanasova
机构: University of Copenhagen(哥本哈根大学)
类目: Computation and Language (cs.CL)
备注: 21 pages, 10 figures, 14 tables
Abstract:With the rapid development of large language models (LLMs), natural language explanations (NLEs) have become increasingly important for understanding model predictions. However, these explanations often fail to faithfully represent the model’s actual reasoning process. While existing work has demonstrated that LLMs can self-critique and refine their initial outputs for various tasks, this capability remains unexplored for improving explanation faithfulness. To address this gap, we introduce Self-critique and Refinement for Natural Language Explanations (SR-NLE), a framework that enables models to improve the faithfulness of their own explanations – specifically, post-hoc NLEs – through an iterative critique and refinement process without external supervision. Our framework leverages different feedback mechanisms to guide the refinement process, including natural language self-feedback and, notably, a novel feedback approach based on feature attribution that highlights important input words. Our experiments across three datasets and four state-of-the-art LLMs demonstrate that SR-NLE significantly reduces unfaithfulness rates, with our best method achieving an average unfaithfulness rate of 36.02%, compared to 54.81% for baseline – an absolute reduction of 18.79%. These findings reveal that the investigated LLMs can indeed refine their explanations to better reflect their actual reasoning process, requiring only appropriate guidance through feedback without additional training or fine-tuning.
zh
[NLP-128] First Steps Towards Overhearing LLM Agents : A Case Study With Dungeons Drag ons Gameplay EMNLP2025
【速读】: 该论文试图解决传统对话式大语言模型(Large Language Model, LLM)代理仅直接与人类用户互动的问题,提出了一种新的交互范式——“旁听代理”(overhearing agents)。这些代理不主动参与对话,而是通过“监听”人类之间的对话来执行后台任务或提供辅助建议。解决方案的关键在于利用大规模多模态音频-语言模型作为旁听代理,以 Dungeons & Dragons 游戏场景为研究对象,通过隐式音频线索完成辅助任务,并验证了此类模型在实际场景中的有效性。
链接: https://arxiv.org/abs/2505.22809
作者: Andrew Zhu,Evan Osgood,Chris Callison-Burch
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 8 pages, 5 figures. In submission at EMNLP 2025
Abstract:Much work has been done on conversational LLM agents which directly assist human users with tasks. We present an alternative paradigm for interacting with LLM agents, which we call “overhearing agents”. These overhearing agents do not actively participate in conversation – instead, they “listen in” on human-to-human conversations and perform background tasks or provide suggestions to assist the user. In this work, we explore the overhearing agents paradigm through the lens of Dungeons Dragons gameplay. We present an in-depth study using large multimodal audio-language models as overhearing agents to assist a Dungeon Master. We perform a human evaluation to examine the helpfulness of such agents and find that some large audio-language models have the emergent ability to perform overhearing agent tasks using implicit audio cues. Finally, we release Python libraries and our project code to support further research into the overhearing agents paradigm at this https URL.
zh
[NLP-129] owards a More Generalized Approach in Open Relation Extraction ACL2025
【速读】: 该论文试图解决传统开放关系抽取(OpenRE)方法在处理实际场景中未定义关系分布不均问题时的局限性,即传统方法假设无标签数据仅包含新颖关系或已预先划分为已知与新颖实例,而现实中新颖关系是任意分布的。解决方案的关键在于提出一种广义的OpenRE设置,并设计了MixORE框架,该框架通过两阶段机制将关系分类与聚类相结合,共同学习已知和新颖关系,从而提升模型在已知关系分类和新颖关系聚类任务中的性能。
链接: https://arxiv.org/abs/2505.22801
作者: Qing Wang,Yuepei Li,Qiao Qiao,Kang Zhou,Qi Li
机构: Iowa State University (爱荷华州立大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2025 Main Conference
Abstract:Open Relation Extraction (OpenRE) seeks to identify and extract novel relational facts between named entities from unlabeled data without pre-defined relation schemas. Traditional OpenRE methods typically assume that the unlabeled data consists solely of novel relations or is pre-divided into known and novel instances. However, in real-world scenarios, novel relations are arbitrarily distributed. In this paper, we propose a generalized OpenRE setting that considers unlabeled data as a mixture of both known and novel instances. To address this, we propose MixORE, a two-phase framework that integrates relation classification and clustering to jointly learn known and novel relations. Experiments on three benchmark datasets demonstrate that MixORE consistently outperforms competitive baselines in known relation classification and novel relation clustering. Our findings contribute to the advancement of generalized OpenRE research and real-world applications.
zh
[NLP-130] Cultural Evaluations of Vision-Language Models Have a Lot to Learn from Cultural Theory
【速读】: 该论文试图解决现代视觉-语言模型(Vision-Language Models, VLMs)在文化能力评估中表现不佳的问题,特别是如何系统地识别和标注图像中的文化细微差别。解决方案的关键在于引入来自视觉文化研究(文化研究、符号学和视觉研究)的基础方法,以支持对图像的文化分析,并提出了五个对应于文化维度的框架,以实现对VLMs文化能力更全面的分析。
链接: https://arxiv.org/abs/2505.22793
作者: Srishti Yadav,Lauren Tilton,Maria Antoniak,Taylor Arnold,Jiaang Li,Siddhesh Milind Pawar,Antonia Karamolegkou,Stella Frank,Zhaochong An,Negar Rostamzadeh,Daniel Hershcovich,Serge Belongie,Ekaterina Shutova
机构: University of Copenhagen(哥本哈根大学); University of Richmond(里士满大学); University of Amsterdam(阿姆斯特丹大学); Pioneer Centre of AI(人工智能先锋中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Modern vision-language models (VLMs) often fail at cultural competency evaluations and benchmarks. Given the diversity of applications built upon VLMs, there is renewed interest in understanding how they encode cultural nuances. While individual aspects of this problem have been studied, we still lack a comprehensive framework for systematically identifying and annotating the nuanced cultural dimensions present in images for VLMs. This position paper argues that foundational methodologies from visual culture studies (cultural studies, semiotics, and visual studies) are necessary for cultural analysis of images. Building upon this review, we propose a set of five frameworks, corresponding to cultural dimensions, that must be considered for a more complete analysis of the cultural competencies of VLMs.
zh
[NLP-131] Can Large Language Models Match the Conclusions of Systematic Reviews?
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)是否能够像临床专家一样,通过系统性地评估和分析多个研究来生成高质量的系统综述(Systematic Reviews, SR)。其解决方案的关键在于构建了一个名为MedEvidence的基准数据集,该数据集将100个系统综述与其所依据的研究配对,并在此基础上对24种不同类型的LLMs进行了系统评估,以探究它们在处理SR任务时的表现及局限性。
链接: https://arxiv.org/abs/2505.22787
作者: Christopher Polzak,Alejandro Lozano,Min Woo Sun,James Burgess,Yuhui Zhang,Kevin Wu,Serena Yeung-Levy
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Systematic reviews (SR), in which experts summarize and analyze evidence across individual studies to provide insights on a specialized topic, are a cornerstone for evidence-based clinical decision-making, research, and policy. Given the exponential growth of scientific articles, there is growing interest in using large language models (LLMs) to automate SR generation. However, the ability of LLMs to critically assess evidence and reason across multiple documents to provide recommendations at the same proficiency as domain experts remains poorly characterized. We therefore ask: Can LLMs match the conclusions of systematic reviews written by clinical experts when given access to the same studies? To explore this question, we present MedEvidence, a benchmark pairing findings from 100 SRs with the studies they are based on. We benchmark 24 LLMs on MedEvidence, including reasoning, non-reasoning, medical specialist, and models across varying sizes (from 7B-700B). Through our systematic evaluation, we find that reasoning does not necessarily improve performance, larger models do not consistently yield greater gains, and knowledge-based fine-tuning degrades accuracy on MedEvidence. Instead, most models exhibit similar behavior: performance tends to degrade as token length increases, their responses show overconfidence, and, contrary to human experts, all models show a lack of scientific skepticism toward low-quality findings. These results suggest that more work is still required before LLMs can reliably match the observations from expert-conducted SRs, even though these systems are already deployed and being used by clinicians. We release our codebase and benchmark to the broader research community to further investigate LLM-based SR systems.
zh
[NLP-132] MEDAL: A Framework for Benchmarking LLM s as Multilingual Open-Domain Chatbots and Dialogue Evaluators
【速读】: 该论文试图解决当前对话系统评估基准存在静态、过时及多语言覆盖不足的问题,这些问题限制了其捕捉细微语言和文化差异的能力。解决方案的关键在于提出MEDAL,一个自动化多智能体框架,通过利用先进的生成式 AI (Generative AI) 生成多语言对话数据,并使用强大的 LLM(如 GPT-4.1)进行多维性能分析,从而构建更具代表性和多样性的开放域对话评估基准。
链接: https://arxiv.org/abs/2505.22777
作者: John Mendonça,Alon Lavie,Isabel Trancoso
机构: INESC-ID(INESC-ID); Instituto Superior Técnico, University of Lisbon( Instituto Superior Técnico, University of Lisbon); Carnegie Mellon University(卡内基梅隆大学); Phrase(Phrase)
类目: Computation and Language (cs.CL)
备注: May ARR
Abstract:As the capabilities of chatbots and their underlying LLMs continue to dramatically improve, evaluating their performance has increasingly become a major blocker to their further development. A major challenge is the available benchmarking datasets, which are largely static, outdated, and lacking in multilingual coverage, limiting their ability to capture subtle linguistic and cultural variations. This paper introduces MEDAL, an automated multi-agent framework for generating, evaluating, and curating more representative and diverse open-domain dialogue evaluation benchmarks. Our approach leverages several state-of-the-art LLMs to generate user-chatbot multilingual dialogues, conditioned on varied seed contexts. A strong LLM (GPT-4.1) is then used for a multidimensional analysis of the performance of the chatbots, uncovering noticeable cross-lingual performance differences. Guided by this large-scale evaluation, we curate a new meta-evaluation multilingual benchmark and human-annotate samples with nuanced quality judgments. This benchmark is then used to assess the ability of several reasoning and non-reasoning LLMs to act as evaluators of open-domain dialogues. We find that current LLMs struggle to detect nuanced issues, particularly those involving empathy and reasoning.
zh
[NLP-133] Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages
【速读】: 该论文试图解决跨口语与书面语的句法结构比较问题,旨在通过一种基于树库的方法,系统分析不同语言中口语和书面语的句法特征差异。解决方案的关键在于采用完全归纳、自下而上的方法,将句法结构定义为去词化的依存(子)树,并从英语和斯洛文尼亚语的通用依存(Universal Dependencies, UD)树库中提取这些结构,从而分析其规模、多样性、分布及模态间的重叠情况。研究结果揭示了口语和书面语在句法结构上的显著差异,以及模态特异性偏好,为语法在使用中的数据驱动理论提供了基础。
链接: https://arxiv.org/abs/2505.22774
作者: Kaja Dobrovoljc
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper presents a novel treebank-driven approach to comparing syntactic structures in speech and writing using dependency-parsed corpora. Adopting a fully inductive, bottom-up method, we define syntactic structures as delexicalized dependency (sub)trees and extract them from spoken and written Universal Dependencies (UD) treebanks in two syntactically distinct languages, English and Slovenian. For each corpus, we analyze the size, diversity, and distribution of syntactic inventories, their overlap across modalities, and the structures most characteristic of speech. Results show that, across both languages, spoken corpora contain fewer and less diverse syntactic structures than their written counterparts, with consistent cross-linguistic preferences for certain structural types across modalities. Strikingly, the overlap between spoken and written syntactic inventories is very limited: most structures attested in speech do not occur in writing, pointing to modality-specific preferences in syntactic organization that reflect the distinct demands of real-time interaction and elaborated writing. This contrast is further supported by a keyness analysis of the most frequent speech-specific structures, which highlights patterns associated with interactivity, context-grounding, and economy of expression. We argue that this scalable, language-independent framework offers a useful general method for systematically studying syntactic variation across corpora, laying the groundwork for more comprehensive data-driven theories of grammar in use.
zh
[NLP-134] Automated Essay Scoring Incorporating Annotations from Automated Feedback Systems
【速读】: 该论文试图解决自动化作文评分(Automated Essay Scoring, AES)中评分准确率不足的问题,其解决方案的关键在于将反馈导向的标注(feedback-oriented annotations)引入评分流程。具体而言,通过整合两种类型的反馈驱动标注——识别拼写和语法错误的标注,以及突出论证要素的标注,结合生成式语言模型与基于编码器的标记分类器进行注释生成,从而提升基于编码器的大规模语言模型在微调为分类器后的性能表现。
链接: https://arxiv.org/abs/2505.22771
作者: Christopher Ormerod
机构: Cambium Assessment Inc. (Cambium Assessment Inc.)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, AIME-Con Conference Submission
Abstract:This study illustrates how incorporating feedback-oriented annotations into the scoring pipeline can enhance the accuracy of automated essay scoring (AES). This approach is demonstrated with the Persuasive Essays for Rating, Selecting, and Understanding Argumentative and Discourse Elements (PERSUADE) corpus. We integrate two types of feedback-driven annotations: those that identify spelling and grammatical errors, and those that highlight argumentative components. To illustrate how this method could be applied in real-world scenarios, we employ two LLMs to generate annotations – a generative language model used for spell-correction and an encoder-based token classifier trained to identify and mark argumentative elements. By incorporating annotations into the scoring process, we demonstrate improvements in performance using encoder-based large language models fine-tuned as classifiers.
zh
[NLP-135] StressTest: Can YOUR Speech LM Handle the Stress?
【速读】: 该论文试图解决当前语音感知语言模型(Speech-aware Language Models, SLMs)在处理句子重音(sentence stress)相关语义和说话人意图时表现不佳的问题。尽管句子重音在塑造语义和表达隐含意图中起着关键作用,但其在模型评估与开发中被忽视。解决方案的关键在于引入StressTest基准,用于评估模型根据重音模式区分句子解释的能力,并提出一种新颖的合成数据生成流程,构建了Stress17k数据集,以模拟由重音变化引起的语义变化。通过在该合成数据集上优化模型,能够有效提升模型在真实语音数据上的表现,最终得到的微调模型StresSLM在句子重音推理和检测任务中显著优于现有模型。
链接: https://arxiv.org/abs/2505.22765
作者: Iddo Yosha,Gallil Maimon,Yossi Adi
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Sentence stress refers to emphasis, placed on specific words within a spoken utterance to highlight or contrast an idea, or to introduce new information. It is often used to imply an underlying intention that is not explicitly stated. Recent advances in speech-aware language models (SLMs) have enabled direct processing of audio, allowing models to bypass transcription and access the full richness of the speech signal and perform audio reasoning tasks such as spoken question answering. Despite the crucial role of sentence stress in shaping meaning and speaker intent, it remains largely overlooked in evaluation and development of such models. In this work, we address this gap by introducing StressTest, a benchmark specifically designed to evaluate a model’s ability to distinguish between interpretations of spoken sentences based on the stress pattern. We assess the performance of several leading SLMs and find that, despite their overall capabilities, they perform poorly on such tasks. To overcome this limitation, we propose a novel synthetic data generation pipeline, and create Stress17k, a training set that simulates change of meaning implied by stress variation. Then, we empirically show that optimizing models with this synthetic dataset aligns well with real-world recordings and enables effective finetuning of SLMs. Results suggest, that our finetuned model, StresSLM, significantly outperforms existing models on both sentence stress reasoning and detection tasks. Code, models, data, and audio samples - this http URL.
zh
[NLP-136] FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian
【速读】: 该论文旨在解决语音基础模型(Speech Foundation Models, SFMs)在可复现性和公平评估方面面临的挑战,这些问题源于现有模型如Whisper和SeamlessM4T的封闭性,其训练数据和代码不可获取。解决方案的关键在于提出FAMA,这是首个面向英语和意大利语的开放科学语音基础模型家族,其基于150k+小时的开源(Open-Source, OS)语音数据进行训练,并引入了一个包含16k小时清洗过的伪标注语音数据的新数据集。FAMA在性能上与现有SFMs具有竞争力,同时在效率上提升了多达8倍,所有相关成果均以符合开源许可的方式发布,推动了语音技术研究的开放性。
链接: https://arxiv.org/abs/2505.22759
作者: Sara Papi,Marco Gaido,Luisa Bentivogli,Alessio Brutti,Mauro Cettolo,Roberto Gretter,Marco Matassoni,Mohamed Nabih,Matteo Negri
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
Abstract:The development of speech foundation models (SFMs) like Whisper and SeamlessM4T has significantly advanced the field of speech processing. However, their closed nature–with inaccessible training data and code–poses major reproducibility and fair evaluation challenges. While other domains have made substantial progress toward open science by developing fully transparent models trained on open-source (OS) code and data, similar efforts in speech remain limited. To fill this gap, we introduce FAMA, the first family of open science SFMs for English and Italian, trained on 150k+ hours of OS speech data. Moreover, we present a new dataset containing 16k hours of cleaned and pseudo-labeled speech for both languages. Results show that FAMA achieves competitive performance compared to existing SFMs while being up to 8 times faster. All artifacts, including code, datasets, and models, are released under OS-compliant licenses, promoting openness in speech technology research.
zh
[NLP-137] FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference
【速读】: 该论文试图解决低批量(low-batch)推理中由于内存带宽和内核启动开销导致的性能瓶颈问题,这在边缘部署和延迟敏感的应用中尤为重要。解决方案的关键在于提出FlashFormer,这是一个针对基于Transformer的大语言模型单批次推理的原型内核,通过优化内存访问和减少计算冗余,在不同模型规模和量化设置下实现了显著的加速效果。
链接: https://arxiv.org/abs/2505.22758
作者: Aniruddha Nrusimha,William Brandon,Mayank Mishra,Yikang Shen,Rameswar Panda,Jonathan Ragan-Kelley,Yoon Kim
机构: Massachusetts Institute of Technology (麻省理工学院); MIT-IBM Watson AI Lab (MIT-IBM沃森人工智能实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:The size and compute characteristics of modern large language models have led to an increased interest in developing specialized kernels tailored for training and inference. Existing kernels primarily optimize for compute utilization, targeting the large-batch training and inference settings. However, low-batch inference, where memory bandwidth and kernel launch overheads contribute are significant factors, remains important for many applications of interest such as in edge deployment and latency-sensitive applications. This paper describes FlashFormer, a proof-of-concept kernel for accelerating single-batch inference for transformer-based large language models. Across various model sizes and quantizations settings, we observe nontrivial speedups compared to existing state-of-the-art inference kernels.
zh
[NLP-138] Pre-Training Curriculum for Multi-Token Prediction in Language Models ACL2025
【速读】: 该论文试图解决小型语言模型(Small Language Models, SLMs)在多标记预测(Multi-token Prediction, MTP)预训练目标下表现不佳的问题。其解决方案的关键在于引入一种课程学习策略,通过两种变体——正向课程(从单标记预测逐步过渡到多标记预测)和反向课程(从多标记预测逐步退化到单标记预测)——来优化MTP的训练过程。实验结果表明,正向课程能够有效提升SLMs在下游任务中的单标记预测性能和生成质量,同时保留自推测解码的优势,而反向课程虽在单标记预测性能上更优,但无法实现自推测解码的收益。
链接: https://arxiv.org/abs/2505.22757
作者: Ansar Aynetdinov,Alan Akbik
机构: Humboldt-Universität zu Berlin (洪堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2025 (Main)
Abstract:Multi-token prediction (MTP) is a recently proposed pre-training objective for language models. Rather than predicting only the next token (NTP), MTP predicts the next k tokens at each prediction step, using multiple prediction heads. MTP has shown promise in improving downstream performance, inference speed, and training efficiency, particularly for large models. However, prior work has shown that smaller language models (SLMs) struggle with the MTP objective. To address this, we propose a curriculum learning strategy for MTP training, exploring two variants: a forward curriculum, which gradually increases the complexity of the pre-training objective from NTP to MTP, and a reverse curriculum, which does the opposite. Our experiments show that the forward curriculum enables SLMs to better leverage the MTP objective during pre-training, improving downstream NTP performance and generative output quality, while retaining the benefits of self-speculative decoding. The reverse curriculum achieves stronger NTP performance and output quality, but fails to provide any self-speculative decoding benefits.
zh
[NLP-139] Decomposing Elements of Problem Solving: What “Math” Does RL Teach?
【速读】: 该论文试图解决如何更细致地评估大型语言模型(Large Language Models, LLMs)在数学推理任务中的能力问题,特别是针对强化学习(Reinforcement Learning, RL)方法如GRPO在提升模型性能方面的局限性。其解决方案的关键在于将问题求解过程分解为三个基础能力:Plan(映射问题到解决步骤序列)、Execute(正确执行解决步骤)和Verify(识别解决方案的正确性)。通过实证分析,作者发现GRPO主要提升了执行能力,但在处理根本新问题时因规划能力不足而遇到“覆盖墙”(coverage wall)。研究进一步构建了一个最小化的合成解树导航任务,以模拟数学问题求解,并揭示了在改进探索和泛化能力的前提下,强化学习可能突破该限制的条件。
链接: https://arxiv.org/abs/2505.22756
作者: Tian Qin,Core Francisco Park,Mujin Kwun,Aaron Walsman,Eran Malach,Nikhil Anand,Hidenori Tanaka,David Alvarez-Melis
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Mathematical reasoning tasks have become prominent benchmarks for assessing the reasoning capabilities of LLMs, especially with reinforcement learning (RL) methods such as GRPO showing significant performance gains. However, accuracy metrics alone do not support fine-grained assessment of capabilities and fail to reveal which problem-solving skills have been internalized. To better understand these capabilities, we propose to decompose problem solving into fundamental capabilities: Plan (mapping questions to sequences of steps), Execute (correctly performing solution steps), and Verify (identifying the correctness of a solution). Empirically, we find that GRPO mainly enhances the execution skill-improving execution robustness on problems the model already knows how to solve-a phenomenon we call temperature distillation. More importantly, we show that RL-trained models struggle with fundamentally new problems, hitting a ‘coverage wall’ due to insufficient planning skills. To explore RL’s impact more deeply, we construct a minimal, synthetic solution-tree navigation task as an analogy for mathematical problem-solving. This controlled setup replicates our empirical findings, confirming RL primarily boosts execution robustness. Importantly, in this setting, we identify conditions under which RL can potentially overcome the coverage wall through improved exploration and generalization to new solution paths. Our findings provide insights into the role of RL in enhancing LLM reasoning, expose key limitations, and suggest a path toward overcoming these barriers. Code is available at this https URL.
zh
[NLP-140] Climate Finance Bench
【速读】: 该论文试图解决如何利用大型语言模型对企业的气候披露信息进行问答任务的问题,具体表现为从可持续发展报告中准确提取、进行数值推理和逻辑推理。其解决方案的关键在于构建一个包含33份跨11个GICS行业的英文可持续性报告及330个专家验证的问答对的数据集,并在此基础上比较检索增强生成(RAG)方法的效果,其中发现检索器定位实际包含答案段落的能力是性能的主要瓶颈。
链接: https://arxiv.org/abs/2505.22752
作者: Rafik Mankour,Yassine Chafai,Hamada Saleh,Ghassen Ben Hassine,Thibaud Barreau,Peter Tankov
机构: CREST, ENSAE, Institut Polytechnique de Paris and Institut Louis Bachelier
类目: Computation and Language (cs.CL)
备注: Dataset is available at this https URL
Abstract:Climate Finance Bench introduces an open benchmark that targets question-answering over corporate climate disclosures using Large Language Models. We curate 33 recent sustainability reports in English drawn from companies across all 11 GICS sectors and annotate 330 expert-validated question-answer pairs that span pure extraction, numerical reasoning, and logical reasoning. Building on this dataset, we propose a comparison of RAG (retrieval-augmented generation) approaches. We show that the retriever’s ability to locate passages that actually contain the answer is the chief performance bottleneck. We further argue for transparent carbon reporting in AI-for-climate applications, highlighting advantages of techniques such as Weight Quantization.
zh
[NLP-141] raining Language Models to Generate Quality Code with Program Analysis Feedback
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在代码生成过程中难以保证代码质量的问题,特别是在安全性(如SQL注入漏洞)和可维护性(如缺少类型注解)方面。解决方案的关键在于提出一种基于强化学习的框架REAL,该框架通过程序分析引导的反馈来激励LLMs生成生产级代码。REAL集成了两种自动信号:(1)程序分析用于检测安全或可维护性缺陷,(2)单元测试用于确保功能正确性。与以往方法不同,REAL具有提示无关性和无参考性,能够在无需人工干预的情况下实现可扩展的监督。
链接: https://arxiv.org/abs/2505.22704
作者: Feng Yao,Zilong Wang,Liyuan Liu,Junxia Cui,Li Zhong,Xiaohan Fu,Haohui Mai,Vish Krishnan,Jianfeng Gao,Jingbo Shang
机构: University of California, San Diego (加利福尼亚大学圣地亚哥分校); Microsoft Research (微软研究院); CausalFlow Inc. (因果流公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures
Abstract:Code generation with large language models (LLMs), often termed vibe coding, is increasingly adopted in production but fails to ensure code quality, particularly in security (e.g., SQL injection vulnerabilities) and maintainability (e.g., missing type annotations). Existing methods, such as supervised fine-tuning and rule-based post-processing, rely on labor-intensive annotations or brittle heuristics, limiting their scalability and effectiveness. We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code using program analysis-guided feedback. Specifically, REAL integrates two automated signals: (1) program analysis detecting security or maintainability defects and (2) unit tests ensuring functional correctness. Unlike prior work, our framework is prompt-agnostic and reference-free, enabling scalable supervision without manual intervention. Experiments across multiple datasets and model scales demonstrate that REAL outperforms state-of-the-art methods in simultaneous assessments of functionality and code quality. Our work bridges the gap between rapid prototyping and production-ready code, enabling LLMs to deliver both speed and quality.
zh
[NLP-142] VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models
【速读】: 该论文旨在解决大型视觉-语言模型(LVLMs)在推理过程中因较长的视觉标记序列导致的高计算成本问题,从而限制其在实时场景中的部署。解决方案的关键在于提出VScan框架,该框架通过两个阶段的视觉标记压缩策略实现高效推理:首先在视觉编码阶段结合全局与局部扫描并进行标记合并以减少冗余,其次在语言模型的中间层引入剪枝机制,从而有效降低计算量并保持模型性能。
链接: https://arxiv.org/abs/2505.22654
作者: Ce Zhang,Kaixin Ma,Tianqing Fang,Wenhao Yu,Hongming Zhang,Zhisong Zhang,Yaqi Xie,Katia Sycara,Haitao Mi,Dong Yu
机构: Carnegie Mellon University (卡内基梅隆大学); Tencent AI Lab (腾讯人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Recent Large Vision-Language Models (LVLMs) have advanced multi-modal understanding by incorporating finer-grained visual perception and encoding. However, such methods incur significant computational costs due to longer visual token sequences, posing challenges for real-time deployment. To mitigate this, prior studies have explored pruning unimportant visual tokens either at the output layer of the visual encoder or at the early layers of the language model. In this work, we revisit these design choices and reassess their effectiveness through comprehensive empirical studies of how visual tokens are processed throughout the visual encoding and language decoding stages. Guided by these insights, we propose VScan, a two-stage visual token reduction framework that addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model. Extensive experimental results across four LVLMs validate the effectiveness of VScan in accelerating inference and demonstrate its superior performance over current state-of-the-arts on sixteen benchmarks. Notably, when applied to LLaVA-NeXT-7B, VScan achieves a 2.91 \times speedup in prefilling and a 10 \times reduction in FLOPs, while retaining 95.4% of the original performance.
zh
[NLP-143] NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding INTERSPEECH2025
【速读】: 该论文旨在解决统计n-gram语言模型在自动语音识别(ASR)中因计算效率低下而难以应用于工业场景的问题。其关键解决方案是重新设计数据结构,以支持GPU优化的快速并行操作,从而实现对所有主要ASR模型类型(包括转换器、注意力编码器-解码器模型和CTC)的可定制贪心解码,计算开销低于7%。该方法在保持较高推理速度的同时,显著减少了领域外场景下贪心解码与束搜索之间的准确率差距。
链接: https://arxiv.org/abs/2505.22857
作者: Vladimir Bataev,Andrei Andrusenko,Lilit Grigoryan,Aleksandr Laptev,Vitaly Lavrukhin,Boris Ginsburg
机构: NVIDIA(英伟达)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Accepted to Interspeech 2025
Abstract:Statistical n-gram language models are widely used for context-biasing tasks in Automatic Speech Recognition (ASR). However, existing implementations lack computational efficiency due to poor parallelization, making context-biasing less appealing for industrial use. This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference. Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types - including transducers, attention encoder-decoder models, and CTC - with less than 7% computational overhead. The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search. The implementation of the proposed NGPU-LM is open-sourced.
zh
计算机视觉
[CV-0] xtRegion: Text-Aligned Region Tokens from Frozen Image-Text Models
【速读】:该论文试图解决图像-文本模型在细节视觉理解上的不足,尽管它们在图像级任务中表现出色,但缺乏对物体空间边界精确描述的能力。解决方案的关键在于提出TextRegion框架,该框架无需训练即可结合图像-文本模型与SAM2分割模型的优势,生成具有文本对齐能力的区域标记(region tokens),从而实现细粒度的视觉理解并保持开放词汇能力。
链接: https://arxiv.org/abs/2505.23769
作者: Yao Xiao,Qiqian Fu,Heyi Tao,Yuqun Wu,Zhen Zhu,Derek Hoiem
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at: this https URL
Abstract:Image-text models excel at image-level tasks but struggle with detailed visual understanding. While these models provide strong visual-language alignment, segmentation models like SAM2 offer precise spatial boundaries for objects. To this end, we propose TextRegion, a simple, effective, and training-free framework that combines the strengths of image-text models and SAM2 to generate powerful text-aligned region tokens. These tokens enable detailed visual understanding while preserving open-vocabulary capabilities. They can be directly applied to various downstream tasks, including open-world semantic segmentation, referring expression comprehension, and grounding. We conduct extensive evaluations and consistently achieve superior or competitive performance compared to state-of-the-art training-free methods. Additionally, our framework is compatible with many image-text models, making it highly practical and easily extensible as stronger models emerge. Code is available at: this https URL.
zh
[CV-1] Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought CVPR2025
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉主导场景中因缺乏精确视觉关注而导致推理不准确的问题。解决方案的关键在于引入Argus,其核心是通过物体中心的视觉注意力锚定机制作为视觉思维链信号,从而在多模态推理任务中实现更有效的目标条件视觉注意力控制。
链接: https://arxiv.org/abs/2505.23766
作者: Yunze Man,De-An Huang,Guilin Liu,Shiwei Sheng,Shilong Liu,Liang-Yan Gui,Jan Kautz,Yu-Xiong Wang,Zhiding Yu
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Project Page: this https URL
Abstract:Recent advances in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in vision-language tasks, yet they often struggle with vision-centric scenarios where precise visual focus is needed for accurate reasoning. In this paper, we introduce Argus to address these limitations with a new visual attention grounding mechanism. Our approach employs object-centric grounding as visual chain-of-thought signals, enabling more effective goal-conditioned visual attention during multimodal reasoning tasks. Evaluations on diverse benchmarks demonstrate that Argus excels in both multimodal reasoning tasks and referring object grounding tasks. Extensive analysis further validates various design choices of Argus, and reveals the effectiveness of explicit language-guided visual region-of-interest engagement in MLLMs, highlighting the importance of advancing multimodal intelligence from a visual-centric perspective. Project page: this https URL
zh
[CV-2] Sketch Down the FLOPs: Towards Efficient Networks for Human Sketch CVPR2025
【速读】:该论文试图解决在草图(sketch)数据上高效推理的问题,尤其是在现有针对照片(photo)的轻量级模型无法直接应用于草图数据的情况下。其关键解决方案是提出两种针对草图特性的组件:一种是跨模态知识蒸馏网络,用于将现有的照片高效网络适配为兼容草图的模型,另一种是基于强化学习的画布选择器,用于动态调整抽象级别以进一步降低计算量。这些方法共同实现了在保持精度的同时,将浮点运算次数(FLOPs)减少99.37%。
链接: https://arxiv.org/abs/2505.23763
作者: Aneeshan Sain,Subhajit Maity,Pinaki Nath Chowdhury,Subhadeep Koley,Ayan Kumar Bhunia,Yi-Zhe Song
机构: SketchX(草图X); University of Surrey(萨里大学); Department of Computer Science(计算机科学系); University of Central Florida(中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025, Project Page: this https URL
Abstract:As sketch research has collectively matured over time, its adaptation for at-mass commercialisation emerges on the immediate horizon. Despite an already mature research endeavour for photos, there is no research on the efficient inference specifically designed for sketch data. In this paper, we first demonstrate existing state-of-the-art efficient light-weight models designed for photos do not work on sketches. We then propose two sketch-specific components which work in a plug-n-play manner on any photo efficient network to adapt them to work on sketch data. We specifically chose fine-grained sketch-based image retrieval (FG-SBIR) as a demonstrator as the most recognised sketch problem with immediate commercial value. Technically speaking, we first propose a cross-modal knowledge distillation network to transfer existing photo efficient networks to be compatible with sketch, which brings down number of FLOPs and model parameters by 97.96% percent and 84.89% respectively. We then exploit the abstract trait of sketch to introduce a RL-based canvas selector that dynamically adjusts to the abstraction level which further cuts down number of FLOPs by two thirds. The end result is an overall reduction of 99.37% of FLOPs (from 40.18G to 0.254G) when compared with a full network, while retaining the accuracy (33.03% vs 32.77%) – finally making an efficient network for the sparse sketch data that exhibit even fewer FLOPs than the best photo counterpart.
zh
[CV-3] LoRAShop: Training-Free Multi-Concept Image Generation and Editing with Rectified Flow Transformers
【速读】:该论文试图解决多概念图像编辑中如何有效整合多个主体或风格同时保持全局上下文、光照和细节的问题。解决方案的关键在于利用Flux风格扩散Transformer中的特征交互模式,通过在前向传播中为每个概念生成解耦的潜在掩码,并仅在目标概念的边界区域内融合对应的LoRA(Low-Rank Adaptation)权重,从而实现对图像的精准个性化编辑。
链接: https://arxiv.org/abs/2505.23758
作者: Yusuf Dalva,Hidir Yesiltepe,Pinar Yanardag
机构: Virginia Tech (弗吉尼亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Webpage: this https URL
Abstract:We introduce LoRAShop, the first framework for multi-concept image editing with LoRA models. LoRAShop builds on a key observation about the feature interaction patterns inside Flux-style diffusion transformers: concept-specific transformer features activate spatially coherent regions early in the denoising process. We harness this observation to derive a disentangled latent mask for each concept in a prior forward pass and blend the corresponding LoRA weights only within regions bounding the concepts to be personalized. The resulting edits seamlessly integrate multiple subjects or styles into the original scene while preserving global context, lighting, and fine details. Our experiments demonstrate that LoRAShop delivers better identity preservation compared to baselines. By eliminating retraining and external constraints, LoRAShop turns personalized diffusion models into a practical `photoshop-with-LoRAs’ tool and opens new avenues for compositional visual storytelling and rapid creative iteration.
zh
[CV-4] Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models
【速读】:该论文旨在解决自主驾驶中视觉-语言-动作(Vision-Language-Action, VLA)模型在非结构化边缘场景下的性能不足问题,主要原因是缺乏针对性的基准测试。其解决方案的关键在于提出Impromptu VLA数据集,该数据集包含超过80,000个精心筛选的视频片段,来源于8个开源大规模数据集的200万原始片段,并基于一种新的四类非结构化挑战场景构建。该数据集包含丰富的、面向规划的问答标注和动作轨迹,实验表明,使用该数据集训练的VLAs在现有基准上实现了显著的性能提升。
链接: https://arxiv.org/abs/2505.23757
作者: Haohan Chi,Huan-ang Gao,Ziming Liu,Jianing Liu,Chenyu Liu,Jinwei Li,Kaisen Yang,Yangcheng Yu,Zeda Wang,Wenyi Li,Leichen Wang,Xingtao Hu,Hao Sun,Hang Zhao,Hao Zhao
机构: Tsinghua University (清华大学); Bosch Research (博世研究院); IIIS, Tsinghua University (清华大学人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Vision-Language-Action (VLA) models for autonomous driving show promise but falter in unstructured corner case scenarios, largely due to a scarcity of targeted benchmarks. To address this, we introduce Impromptu VLA. Our core contribution is the Impromptu VLA Dataset: over 80,000 meticulously curated video clips, distilled from over 2M source clips sourced from 8 open-source large-scale datasets. This dataset is built upon our novel taxonomy of four challenging unstructured categories and features rich, planning-oriented question-answering annotations and action trajectories. Crucially, experiments demonstrate that VLAs trained with our dataset achieve substantial performance gains on established benchmarks–improving closed-loop NeuroNCAP scores and collision rates, and reaching near state-of-the-art L2 accuracy in open-loop nuScenes trajectory prediction. Furthermore, our QA suite serves as an effective diagnostic, revealing clear VLM improvements in perception, prediction, and planning. Our code, data and models are available at this https URL.
zh
[CV-5] Rooms from Motion: Un-posed Indoor 3D Object Detection as Localization and Mapping
【速读】:该论文试图解决场景级3D目标检测问题,即在没有先验相机位姿的情况下,从无姿态图像中同时实现目标定位与地图构建。其解决方案的关键在于提出了一种基于目标中心的框架——Rooms from Motion (RfM),通过使用图像生成的3D定向框作为几何基元,替代传统结构从运动(Structure-from-Motion)中的2D关键点匹配器,从而估计度量相机位姿、目标轨迹,并最终生成全局语义3D目标地图。该方法在存在先验位姿时,通过优化全局3D框以提升地图质量,展现出优于基于点云或密集体素的全局方法的定位性能和地图精度。
链接: https://arxiv.org/abs/2505.23756
作者: Justin Lazarow,Kai Kang,Afshin Dehghan
机构: Apple(苹果)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We revisit scene-level 3D object detection as the output of an object-centric framework capable of both localization and mapping using 3D oriented boxes as the underlying geometric primitive. While existing 3D object detection approaches operate globally and implicitly rely on the a priori existence of metric camera poses, our method, Rooms from Motion (RfM) operates on a collection of un-posed images. By replacing the standard 2D keypoint-based matcher of structure-from-motion with an object-centric matcher based on image-derived 3D boxes, we estimate metric camera poses, object tracks, and finally produce a global, semantic 3D object map. When a priori pose is available, we can significantly improve map quality through optimization of global 3D boxes against individual observations. RfM shows strong localization performance and subsequently produces maps of higher quality than leading point-based and multi-view 3D object detection methods on CA-1M and ScanNet++, despite these global methods relying on overparameterization through point clouds or dense volumes. Rooms from Motion achieves a general, object-centric representation which not only extends the work of Cubify Anything to full scenes but also allows for inherently sparse localization and parametric mapping proportional to the number of objects in a scene.
zh
[CV-6] hinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks
【速读】:该论文试图解决现有评估体系在领域特定基准上的不足,特别是在复杂遥感应用场景中评估大语言模型(Large Language Models, LLMs)工具使用能力的缺失。其解决方案的关键在于构建ThinkGeo,一个基于结构化工具使用和多步骤规划的代理基准,通过涵盖多种实际应用的查询,如城市规划、灾害评估、环境监测等,来评估LLM驱动代理在遥感任务中的表现。该基准采用类似ReAct的交互循环,并对开源与闭源LLMs进行了广泛测试,以全面分析工具准确性和规划一致性。
链接: https://arxiv.org/abs/2505.23752
作者: Akashah Shabbir,Muhammad Akhtar Munir,Akshay Dudhane,Muhammad Umer Sheikh,Muhammad Haris Khan,Paolo Fraccaro,Juan Bernabe Moreno,Fahad Shahbaz Khan,Salman Khan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent progress in large language models (LLMs) has enabled tool-augmented agents capable of solving complex real-world tasks through step-by-step reasoning. However, existing evaluations often focus on general-purpose or multimodal scenarios, leaving a gap in domain-specific benchmarks that assess tool-use capabilities in complex remote sensing use cases. We present ThinkGeo, an agentic benchmark designed to evaluate LLM-driven agents on remote sensing tasks via structured tool use and multi-step planning. Inspired by tool-interaction paradigms, ThinkGeo includes human-curated queries spanning a wide range of real-world applications such as urban planning, disaster assessment and change analysis, environmental monitoring, transportation analysis, aviation monitoring, recreational infrastructure, and industrial site analysis. Each query is grounded in satellite or aerial imagery and requires agents to reason through a diverse toolset. We implement a ReAct-style interaction loop and evaluate both open and closed-source LLMs (e.g., GPT-4o, Qwen2.5) on 436 structured agentic tasks. The benchmark reports both step-wise execution metrics and final answer correctness. Our analysis reveals notable disparities in tool accuracy and planning consistency across models. ThinkGeo provides the first extensive testbed for evaluating how tool-enabled LLMs handle spatial reasoning in remote sensing. Our code and dataset are publicly available
zh
[CV-7] REOrdering Patches Improves Vision Models
【速读】:该论文试图解决视觉序列模型中由于图像块(patch)顺序依赖性导致的性能下降问题,特别是在长序列Transformer架构中,传统固定行优先(row-major)排列方式因破坏了自注意力机制的排列等变性而影响模型表现。解决方案的关键在于提出REOrder框架,通过信息论先验评估不同图像块序列的可压缩性,并利用Plackett-Luce策略优化REINFORCE算法,从而在排列空间中高效学习任务最优的图像块顺序。
链接: https://arxiv.org/abs/2505.23751
作者: Declan Kutscher,David M. Chan,Yutong Bai,Trevor Darrell,Ritwik Gupta
机构: University of Pittsburgh (匹兹堡大学); University of California, Berkeley (加州大学伯克利分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sequence models such as transformers require inputs to be represented as one-dimensional sequences. In vision, this typically involves flattening images using a fixed row-major (raster-scan) order. While full self-attention is permutation-equivariant, modern long-sequence transformers increasingly rely on architectural approximations that break this invariance and introduce sensitivity to patch ordering. We show that patch order significantly affects model performance in such settings, with simple alternatives like column-major or Hilbert curves yielding notable accuracy shifts. Motivated by this, we propose REOrder, a two-stage framework for discovering task-optimal patch orderings. First, we derive an information-theoretic prior by evaluating the compressibility of various patch sequences. Then, we learn a policy over permutations by optimizing a Plackett-Luce policy using REINFORCE. This approach enables efficient learning in a combinatorial permutation space. REOrder improves top-1 accuracy over row-major ordering on ImageNet-1K by up to 3.01% and Functional Map of the World by 13.35%.
zh
[CV-8] Spatial-MLLM : Boosting MLLM Capabilities in Visual-based Spatial Intelligence
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在仅具备2D输入(如图像或视频)的情况下,难以有效提升其空间智能的问题。现有3D MLLMs通常依赖额外的3D或2.5D数据来获得空间感知能力,这限制了其在仅有2D输入场景中的应用。论文提出的解决方案关键在于利用视觉几何基础模型中的结构先验,通过双编码器架构:一个预训练的2D视觉编码器提取语义特征,另一个基于视觉几何模型主干网络初始化的空间编码器提取3D结构特征,并通过连接器将两者融合为统一的视觉标记,以增强空间理解能力。此外,还引入了一种空间感知的帧采样策略,在推理阶段选择具有空间信息的帧,确保模型在有限的token长度下仍能关注对空间推理至关重要的帧。
链接: https://arxiv.org/abs/2505.23747
作者: Diankun Wu,Fangfu Liu,Yi-Hsin Hung,Yueqi Duan
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages
Abstract:Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct the Spatial-MLLM-120k dataset and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that our spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: this https URL.
zh
[CV-9] o Trust Or Not To Trust Your Vision-Language Models Prediction
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在面对误分类时容易产生自信但错误预测的问题,这一问题在安全关键领域可能带来严重后果。其解决方案的关键在于提出一种无需训练的框架TrustVLM,该框架通过利用图像嵌入空间中某些概念更明确的表示特性,设计了一个新颖的置信度评分函数,以提升误分类检测的效果。
链接: https://arxiv.org/abs/2505.23745
作者: Hao Dong,Moru Liu,Jian Liang,Eleni Chatzi,Olga Fink
机构: ETH Zürich (ETH Zurich); Technical University of Munich (慕尼黑工业大学); NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所NLPR与MAIS); University of Chinese Academy of Sciences (中国科学院大学); EPFL (瑞士联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Vision-Language Models (VLMs) have demonstrated strong capabilities in aligning visual and textual modalities, enabling a wide range of applications in multimodal understanding and generation. While they excel in zero-shot and transfer learning scenarios, VLMs remain susceptible to misclassification, often yielding confident yet incorrect predictions. This limitation poses a significant risk in safety-critical domains, where erroneous predictions can lead to severe consequences. In this work, we introduce TrustVLM, a training-free framework designed to address the critical challenge of estimating when VLM’s predictions can be trusted. Motivated by the observed modality gap in VLMs and the insight that certain concepts are more distinctly represented in the image embedding space, we propose a novel confidence-scoring function that leverages this space to improve misclassification detection. We rigorously evaluate our approach across 17 diverse datasets, employing 4 architectures and 2 VLMs, and demonstrate state-of-the-art performance, with improvements of up to 51.87% in AURC, 9.14% in AUROC, and 32.42% in FPR95 compared to existing baselines. By improving the reliability of the model without requiring retraining, TrustVLM paves the way for safer deployment of VLMs in real-world applications. The code will be available at this https URL.
zh
[CV-10] Boosting Domain Incremental Learning: Selecting the Optimal Parameters is All You Need CVPR2025
【速读】:该论文试图解决深度神经网络(Deep Neural Networks, DNNs)在动态现实场景中因数据分布随时间变化而导致性能下降的问题,特别是在领域增量学习(Domain Incremental Learning, DIL)中参数选择准确性不足的问题。解决方案的关键在于提出一种轻量级框架SOYO,其核心包括高斯混合压缩器(Gaussian Mixture Compressor, GMC)和领域特征重采样器(Domain Feature Resampler, DFR),用于高效存储和平衡先验领域数据,以及多层级领域特征融合网络(Multi-level Domain Feature Fusion Network, MDFN),以增强领域特征提取能力。
链接: https://arxiv.org/abs/2505.23744
作者: Qiang Wang,Xiang Song,Yuhang He,Jizhou Han,Chenhao Ding,Xinyuan Gao,Yihong Gong
机构: Xi’an Jiaotong University (西安交通大学); Shenzhen University of Advanced Technology (深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at CVPR 2025
Abstract:Deep neural networks (DNNs) often underperform in real-world, dynamic settings where data distributions change over time. Domain Incremental Learning (DIL) offers a solution by enabling continual model adaptation, with Parameter-Isolation DIL (PIDIL) emerging as a promising paradigm to reduce knowledge conflicts. However, existing PIDIL methods struggle with parameter selection accuracy, especially as the number of domains and corresponding classes grows. To address this, we propose SOYO, a lightweight framework that improves domain selection in PIDIL. SOYO introduces a Gaussian Mixture Compressor (GMC) and Domain Feature Resampler (DFR) to store and balance prior domain data efficiently, while a Multi-level Domain Feature Fusion Network (MDFN) enhances domain feature extraction. Our framework supports multiple Parameter-Efficient Fine-Tuning (PEFT) methods and is validated across tasks such as image classification, object detection, and speech enhancement. Experimental results on six benchmarks demonstrate SOYO’s consistent superiority over existing baselines, showcasing its robustness and adaptability in complex, evolving environments. The codes will be released in this https URL.
zh
[CV-11] DarkDiff: Advancing Low-Light Raw Enhancement by Retasking Diffusion Models for Camera ISP
【速读】:该论文试图解决在极端低光照条件下高质量摄影的难题,特别是在传统图像信号处理器(ISP)算法被深度网络取代后,现有基于回归的模型容易导致低光照片或深阴影区域过度平滑的问题。其解决方案的关键在于引入一种新颖的框架,通过重新调整预训练的生成式扩散模型(Generative Diffusion Models)与相机ISP相结合,以提升低光原始图像的感知质量。
链接: https://arxiv.org/abs/2505.23743
作者: Amber Yijia Zheng,Yu Zhang,Jun Hu,Raymond A. Yeh,Chen Chen
机构: Purdue University (普渡大学); Apple Inc (苹果公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-quality photography in extreme low-light conditions is challenging but impactful for digital cameras. With advanced computing hardware, traditional camera image signal processor (ISP) algorithms are gradually being replaced by efficient deep networks that enhance noisy raw images more intelligently. However, existing regression-based models often minimize pixel errors and result in oversmoothing of low-light photos or deep shadows. Recent work has attempted to address this limitation by training a diffusion model from scratch, yet those models still struggle to recover sharp image details and accurate colors. We introduce a novel framework to enhance low-light raw images by retasking pre-trained generative diffusion models with the camera ISP. Extensive experiments demonstrate that our method outperforms the state-of-the-art in perceptual quality across three challenging low-light raw image benchmarks.
zh
[CV-12] MAGREF: Masked Guidance for Any-Reference Video Generation
【速读】:该论文旨在解决多参考主体视频生成中保持多主体一致性和确保高质量生成的挑战。其解决方案的关键在于提出MAGREF框架,该框架引入了掩码引导机制,以实现基于多样本图像和文本提示的连贯多主体视频合成,具体包括区域感知的动态掩码机制和像素级通道拼接机制,从而在不改变模型结构的情况下灵活处理多种主体推理,并更好地保留外观特征。
链接: https://arxiv.org/abs/2505.23742
作者: Yufan Deng,Xun Guo,Yuanyang Yin,Jacob Zhiyuan Fang,Yiding Yang,Yizhi Wang,Shenghai Yuan,Angtian Wang,Bo Liu,Haibin Huang,Chongyang Ma
机构: ByteDance Intelligent Creation(字节跳动智能创作)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project website: this https URL
Abstract:Video generation has made substantial strides with the emergence of deep generative models, especially diffusion-based approaches. However, video generation based on multiple reference subjects still faces significant challenges in maintaining multi-subject consistency and ensuring high generation quality. In this paper, we propose MAGREF, a unified framework for any-reference video generation that introduces masked guidance to enable coherent multi-subject video synthesis conditioned on diverse reference images and a textual prompt. Specifically, we propose (1) a region-aware dynamic masking mechanism that enables a single model to flexibly handle various subject inference, including humans, objects, and backgrounds, without architectural changes, and (2) a pixel-wise channel concatenation mechanism that operates on the channel dimension to better preserve appearance features. Our model delivers state-of-the-art video generation quality, generalizing from single-subject training to complex multi-subject scenarios with coherent synthesis and precise control over individual subjects, outperforming existing open-source and commercial baselines. To facilitate evaluation, we also introduce a comprehensive multi-subject video benchmark. Extensive experiments demonstrate the effectiveness of our approach, paving the way for scalable, controllable, and high-fidelity multi-subject video synthesis. Code and model can be found at: this https URL
zh
[CV-13] LayerPeeler: Autoregressive Peeling for Layer-wise Image Vectorization
【速读】:该论文旨在解决图像向量化过程中遮挡区域处理不足的问题,现有工具在处理遮挡区域时会产生不完整或碎片化的形状,影响可编辑性。论文提出的解决方案是LayerPeeler,其关键在于采用自回归剥层策略,通过识别并移除最上层的非遮挡层,同时恢复底层内容,从而生成具有完整路径和连贯层结构的矢量图形。该方法利用视觉-语言模型构建层图以捕捉元素间的遮挡关系,并结合微调的图像扩散模型进行精确的层移除操作。
链接: https://arxiv.org/abs/2505.23740
作者: Ronghuan Wu,Wanchao Su,Jing Liao
机构: City University of Hong Kong(香港城市大学); Monash University(莫纳什大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project Page: this https URL
Abstract:Image vectorization is a powerful technique that converts raster images into vector graphics, enabling enhanced flexibility and interactivity. However, popular image vectorization tools struggle with occluded regions, producing incomplete or fragmented shapes that hinder editability. While recent advancements have explored rule-based and data-driven layer-wise image vectorization, these methods face limitations in vectorization quality and flexibility. In this paper, we introduce LayerPeeler, a novel layer-wise image vectorization approach that addresses these challenges through a progressive simplification paradigm. The key to LayerPeeler’s success lies in its autoregressive peeling strategy: by identifying and removing the topmost non-occluded layers while recovering underlying content, we generate vector graphics with complete paths and coherent layer structures. Our method leverages vision-language models to construct a layer graph that captures occlusion relationships among elements, enabling precise detection and description for non-occluded layers. These descriptive captions are used as editing instructions for a finetuned image diffusion model to remove the identified layers. To ensure accurate removal, we employ localized attention control that precisely guides the model to target regions while faithfully preserving the surrounding content. To support this, we contribute a large-scale dataset specifically designed for layer peeling tasks. Extensive quantitative and qualitative experiments demonstrate that LayerPeeler significantly outperforms existing techniques, producing vectorization results with superior path semantics, geometric regularity, and visual fidelity.
zh
[CV-14] How Animals Dance (When Youre Not Looking)
【速读】:该论文试图解决生成与音乐同步且符合舞蹈编排的动物舞蹈视频的问题,其核心挑战在于如何从少量关键帧中合成连贯且具有节奏感的动画。解决方案的关键在于将舞蹈生成建模为图优化问题,通过寻找满足指定节拍编排模式的最佳关键帧结构,并结合视频扩散模型生成中间帧,从而实现高质量的舞蹈视频生成。
链接: https://arxiv.org/abs/2505.23738
作者: Xiaojuan Wang,Aleksander Holynski,Brian Curless,Ira Kemelmacher,Steve Seitz
机构: University of Washington(华盛顿大学); UC Berkeley(加州大学伯克利分校); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL
Abstract:We present a keyframe-based framework for generating music-synchronized, choreography aware animal dance videos. Starting from a few keyframes representing distinct animal poses – generated via text-to-image prompting or GPT-4o – we formulate dance synthesis as a graph optimization problem: find the optimal keyframe structure that satisfies a specified choreography pattern of beats, which can be automatically estimated from a reference dance video. We also introduce an approach for mirrored pose image generation, essential for capturing symmetry in dance. In-between frames are synthesized using an video diffusion model. With as few as six input keyframes, our method can produce up to 30 second dance videos across a wide range of animals and music tracks.
zh
[CV-15] ZPressor: Bottleneck-Aware Compression for Scalable Feed-Forward 3DGS
【速读】:该论文旨在解决前向3D高斯点云(feed-forward 3D Gaussian Splatting, 3DGS)模型在输入视图数量增加时性能下降或内存消耗过大的可扩展性问题。其关键解决方案是引入ZPressor,一个与架构无关的轻量级模块,通过信息瓶颈原理,将多视角输入高效压缩为保留场景关键信息的紧凑潜在状态Z,具体通过将视图划分为锚点集和支撑集,并利用交叉注意力机制将支撑视图的信息压缩到锚点视图中,从而实现模型对超过100个输入视图的高效处理。
链接: https://arxiv.org/abs/2505.23734
作者: Weijie Wang,Donny Y. Chen,Zeyu Zhang,Duochao Shi,Akide Liu,Bohan Zhuang
机构: ZIP Lab, Zhejiang University (浙江大学); ByteDance Seed (字节跳动种子项目); Monash University (莫纳什大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL , Code: this https URL
Abstract:Feed-forward 3D Gaussian Splatting (3DGS) models have recently emerged as a promising solution for novel view synthesis, enabling one-pass inference without the need for per-scene 3DGS optimization. However, their scalability is fundamentally constrained by the limited capacity of their encoders, leading to degraded performance or excessive memory consumption as the number of input views increases. In this work, we analyze feed-forward 3DGS frameworks through the lens of the Information Bottleneck principle and introduce ZPressor, a lightweight architecture-agnostic module that enables efficient compression of multi-view inputs into a compact latent state Z that retains essential scene information while discarding redundancy. Concretely, ZPressor enables existing feed-forward 3DGS models to scale to over 100 input views at 480P resolution on an 80GB GPU, by partitioning the views into anchor and support sets and using cross attention to compress the information from the support views into anchor views, forming the compressed latent state Z . We show that integrating ZPressor into several state-of-the-art feed-forward 3DGS models consistently improves performance under moderate input views and enhances robustness under dense view settings on two large-scale benchmarks DL3DV-10K and RealEstate10K. The video results, code and trained models are available on our project page: this https URL.
zh
[CV-16] PixelThink: Towards Efficient Chain-of-Pixel Reasoning
【速读】:该论文旨在解决现有推理分割方法在分布外场景中泛化能力不足的问题,以及基于强化学习的推理过程容易出现的过度思考(overthinking)问题,即生成冗长且与任务复杂度无关的推理链,导致计算成本增加和推理质量控制受限。其解决方案的关键在于提出PixelThink,该方法通过整合外部估计的任务难度和内部测量的模型不确定性,在强化学习框架内调控推理生成,使模型能够根据场景复杂性和预测置信度动态调整推理长度,从而提升推理效率与分割性能。
链接: https://arxiv.org/abs/2505.23727
作者: Song Wang,Gongfan Fang,Lingdong Kong,Xiangtai Li,Jianyun Xu,Sheng Yang,Qiang Li,Jianke Zhu,Xinchao Wang
机构: Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学); Nanyang Technological University (南洋理工大学); AD Lab, CaiNiao Inc., Alibaba Group (AD实验室,菜鸟网络公司,阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Project Page: this https URL
Abstract:Existing reasoning segmentation approaches typically fine-tune multimodal large language models (MLLMs) using image-text pairs and corresponding mask labels. However, they exhibit limited generalization to out-of-distribution scenarios without an explicit reasoning process. Although recent efforts leverage reinforcement learning through group-relative policy optimization (GRPO) to enhance reasoning ability, they often suffer from overthinking - producing uniformly verbose reasoning chains irrespective of task complexity. This results in elevated computational costs and limited control over reasoning quality. To address this problem, we propose PixelThink, a simple yet effective scheme that integrates externally estimated task difficulty and internally measured model uncertainty to regulate reasoning generation within a reinforcement learning paradigm. The model learns to compress reasoning length in accordance with scene complexity and predictive confidence. To support comprehensive evaluation, we introduce ReasonSeg-Diff, an extended benchmark with annotated reasoning references and difficulty scores, along with a suite of metrics designed to assess segmentation accuracy, reasoning quality, and efficiency jointly. Experimental results demonstrate that the proposed approach improves both reasoning efficiency and overall segmentation performance. Our work contributes novel perspectives towards efficient and interpretable multimodal understanding. The code and model will be publicly available.
zh
[CV-17] FMG-Det: Foundation Model Guided Robust Object Detection ICIP2025
【速读】:该论文试图解决在目标检测任务中由于标注边界主观性导致的高质量数据收集难题,特别是在对象边界部分可见或模糊的情况下,噪声标注会显著降低检测器性能,尤其在小样本设置下更为明显。解决方案的关键在于提出FMG-Det方法,该方法结合了多实例学习(Multiple Instance Learning, MIL)框架与一个利用强大基础模型进行标注校正的预处理流程,通过该流程以及对检测器头部的微调,实现了在标准和小样本场景下的最先进性能,同时相比其他方法更为简单高效。
链接: https://arxiv.org/abs/2505.23726
作者: Darryl Hannan,Timothy Doster,Henry Kvinge,Adam Attarian,Yijing Watkins
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, ICIP 2025
Abstract:Collecting high quality data for object detection tasks is challenging due to the inherent subjectivity in labeling the boundaries of an object. This makes it difficult to not only collect consistent annotations across a dataset but also to validate them, as no two annotators are likely to label the same object using the exact same coordinates. These challenges are further compounded when object boundaries are partially visible or blurred, which can be the case in many domains. Training on noisy annotations significantly degrades detector performance, rendering them unusable, particularly in few-shot settings, where just a few corrupted annotations can impact model performance. In this work, we propose FMG-Det, a simple, efficient methodology for training models with noisy annotations. More specifically, we propose combining a multiple instance learning (MIL) framework with a pre-processing pipeline that leverages powerful foundation models to correct labels prior to training. This pre-processing pipeline, along with slight modifications to the detector head, results in state-of-the-art performance across a number of datasets, for both standard and few-shot scenarios, while being much simpler and more efficient than other approaches.
zh
[CV-18] AnySplat: Feed-forward 3D Gaussian Splatting from Unconstrained Views
【速读】:该论文试图解决从未校准的图像集合中进行新颖视角合成的问题,传统神经渲染流程需要已知的相机位姿和每场景优化,而近期的前馈方法则因密集视角的计算负担而表现不佳。解决方案的关键在于AnySplat模型能够通过一次前向传播预测所有内容,生成编码场景几何与外观的3D高斯基元,并为每个输入图像提供对应的相机内参和外参,从而在无需姿态标注的情况下高效处理多视角数据集。
链接: https://arxiv.org/abs/2505.23716
作者: Lihan Jiang,Yucheng Mao,Linning Xu,Tao Lu,Kerui Ren,Yichen Jin,Xudong Xu,Mulin Yu,Jiangmiao Pang,Feng Zhao,Dahua Lin,Bo Dai
机构: The University of Science and Technology of China (中国科学技术大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学); Brown University (布朗大学); Shanghai Jiao Tong University (上海交通大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:We introduce AnySplat, a feed forward network for novel view synthesis from uncalibrated image collections. In contrast to traditional neural rendering pipelines that demand known camera poses and per scene optimization, or recent feed forward methods that buckle under the computational weight of dense views, our model predicts everything in one shot. A single forward pass yields a set of 3D Gaussian primitives encoding both scene geometry and appearance, and the corresponding camera intrinsics and extrinsics for each input image. This unified design scales effortlessly to casually captured, multi view datasets without any pose annotations. In extensive zero shot evaluations, AnySplat matches the quality of pose aware baselines in both sparse and dense view scenarios while surpassing existing pose free approaches. Moreover, it greatly reduce rendering latency compared to optimization based neural fields, bringing real time novel view synthesis within reach for unconstrained capture this http URL page: this https URL
zh
[CV-19] Skin Lesion Phenotyping via Nested Multi-modal Contrastive Learning
【速读】:该论文旨在解决皮肤病变分类任务中由于成像条件差异大(如光照、色彩、分辨率、距离等)以及缺乏临床和表型背景信息所带来的挑战。其解决方案的关键在于提出了一种名为SLIMP(Skin Lesion Image-Metadata Pre-training)的新型嵌套对比学习方法,通过结合皮肤病变的外观特征与患者层面的元数据(如病史和其他临床相关信息),充分利用所有可用的数据模态进行预训练,从而提升下游皮肤病变分类任务的性能。
链接: https://arxiv.org/abs/2505.23709
作者: Dionysis Christopoulos,Sotiris Spanos,Eirini Baltzi,Valsamis Ntouskos,Konstantinos Karantzalos
机构: Remote Sensing Lab, National Technical University of Athens, Athens, Greece; Department of Engineering and Sciences, Universitas Mercatorum, Rome, Italy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We introduce SLIMP (Skin Lesion Image-Metadata Pre-training) for learning rich representations of skin lesions through a novel nested contrastive learning approach that captures complex relationships between images and metadata. Melanoma detection and skin lesion classification based solely on images, pose significant challenges due to large variations in imaging conditions (lighting, color, resolution, distance, etc.) and lack of clinical and phenotypical context. Clinicians typically follow a holistic approach for assessing the risk level of the patient and for deciding which lesions may be malignant and need to be excised, by considering the patient’s medical history as well as the appearance of other lesions of the patient. Inspired by this, SLIMP combines the appearance and the metadata of individual skin lesions with patient-level metadata relating to their medical record and other clinically relevant information. By fully exploiting all available data modalities throughout the learning process, the proposed pre-training strategy improves performance compared to other pre-training strategies on downstream skin lesions classification tasks highlighting the learned representations quality.
zh
[CV-20] CLDTracker: A Comprehensive Language Description for Visual Tracking
【速读】:该论文旨在解决视觉目标跟踪(Visual Object Tracking, VOT)在动态外观变化、遮挡和背景干扰等复杂场景下的性能瓶颈问题。传统跟踪方法主要依赖视觉线索,在此类场景中表现不佳。为克服这一问题,论文提出了一种名为CLDTracker的新型综合语言描述框架,其关键在于引入了双分支架构,包括文本分支和视觉分支。在文本分支中,通过利用强大的视觉语言模型(Vision-Language Models, VLMs)如CLIP和GPT-4V生成丰富的文本描述,以弥补目标对象语义细节表达不足的问题;同时,通过高效的视觉与文本特征融合机制,实现对目标的全面理解,并引入时序建模以适应目标外观的演变,从而提升跟踪的鲁棒性和准确性。
链接: https://arxiv.org/abs/2505.23704
作者: Mohamad Alansari,Sajid Javed,Iyyakutti Iyappan Ganapathi,Sara Alansari,Muzammal Naseer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 47 pages, 9 figures, Information Fusion Journal
Abstract:VOT remains a fundamental yet challenging task in computer vision due to dynamic appearance changes, occlusions, and background clutter. Traditional trackers, relying primarily on visual cues, often struggle in such complex scenarios. Recent advancements in VLMs have shown promise in semantic understanding for tasks like open-vocabulary detection and image captioning, suggesting their potential for VOT. However, the direct application of VLMs to VOT is hindered by critical limitations: the absence of a rich and comprehensive textual representation that semantically captures the target object’s nuances, limiting the effective use of language information; inefficient fusion mechanisms that fail to optimally integrate visual and textual features, preventing a holistic understanding of the target; and a lack of temporal modeling of the target’s evolving appearance in the language domain, leading to a disconnect between the initial description and the object’s subsequent visual changes. To bridge these gaps and unlock the full potential of VLMs for VOT, we propose CLDTracker, a novel Comprehensive Language Description framework for robust visual Tracking. Our tracker introduces a dual-branch architecture consisting of a textual and a visual branch. In the textual branch, we construct a rich bag of textual descriptions derived by harnessing the powerful VLMs such as CLIP and GPT-4V, enriched with semantic and contextual cues to address the lack of rich textual representation. Experiments on six standard VOT benchmarks demonstrate that CLDTracker achieves SOTA performance, validating the effectiveness of leveraging robust and temporally-adaptive vision-language representations for tracking. Code and models are publicly available at: this https URL
zh
[CV-21] DA-VPT: Semantic-Guided Visual Prompt Tuning for Vision Transformers CVPR2025
【速读】:该论文试图解决视觉提示调优(Visual Prompt Tuning, VPT)中提示与图像标记之间的分布关系未被充分探索的问题,以及如何通过语义信息提升视觉变换器(Vision Transformer, ViT)模型微调效果的问题。其解决方案的关键在于提出一种新的框架——分布感知视觉提示调优(Distribution Aware Visual Prompt Tuning, DA-VPT),通过从与类别相关的语义数据中学习距离度量来引导提示的分布,从而增强提示作为图像块与类别标记之间语义信息桥梁的作用。
链接: https://arxiv.org/abs/2505.23694
作者: Li Ren,Chen Chen,Liqiang Wang,Kien Hua
机构: University of Central Florida (佛罗里达中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025
Abstract:Visual Prompt Tuning (VPT) has become a promising solution for Parameter-Efficient Fine-Tuning (PEFT) approach for Vision Transformer (ViT) models by partially fine-tuning learnable tokens while keeping most model parameters frozen. Recent research has explored modifying the connection structures of the prompts. However, the fundamental correlation and distribution between the prompts and image tokens remain unexplored. In this paper, we leverage metric learning techniques to investigate how the distribution of prompts affects fine-tuning performance. Specifically, we propose a novel framework, Distribution Aware Visual Prompt Tuning (DA-VPT), to guide the distributions of the prompts by learning the distance metric from their class-related semantic data. Our method demonstrates that the prompts can serve as an effective bridge to share semantic information between image patches and the class token. We extensively evaluated our approach on popular benchmarks in both recognition and segmentation tasks. The results demonstrate that our approach enables more effective and efficient fine-tuning of ViT models by leveraging semantic information to guide the learning of the prompts, leading to improved performance on various downstream vision tasks.
zh
[CV-22] Mobi-π: Mobilizing Your Robot Learning Policy
【速读】:该论文试图解决的是**策略移动性(policy mobilization)**问题,即在新环境中找到一个与基于有限相机视角训练的操纵策略相符合的移动机器人基座位姿。传统方法通常通过重新训练策略以提高其对未见基座位姿的鲁棒性,而该工作则通过将导航与操纵解耦,避免了额外的示范数据需求。解决方案的关键在于通过优化机器人基座位姿,使其与学习策略的分布内位姿对齐,具体包括利用3D高斯点云渲染进行新视角合成、使用评分函数评估位姿适用性以及基于采样的优化方法来确定最优机器人位姿。
链接: https://arxiv.org/abs/2505.23692
作者: Jingyun Yang,Isabella Huang,Brandon Vu,Max Bajracharya,Rika Antonova,Jeannette Bohg
机构: Stanford University (斯坦福大学); Toyota Research Institute (丰田研究院); University of Cambridge (剑桥大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project website: this https URL
Abstract:Learned visuomotor policies are capable of performing increasingly complex manipulation tasks. However, most of these policies are trained on data collected from limited robot positions and camera viewpoints. This leads to poor generalization to novel robot positions, which limits the use of these policies on mobile platforms, especially for precise tasks like pressing buttons or turning faucets. In this work, we formulate the policy mobilization problem: find a mobile robot base pose in a novel environment that is in distribution with respect to a manipulation policy trained on a limited set of camera viewpoints. Compared to retraining the policy itself to be more robust to unseen robot base pose initializations, policy mobilization decouples navigation from manipulation and thus does not require additional demonstrations. Crucially, this problem formulation complements existing efforts to improve manipulation policy robustness to novel viewpoints and remains compatible with them. To study policy mobilization, we introduce the Mobi- \pi framework, which includes: (1) metrics that quantify the difficulty of mobilizing a given policy, (2) a suite of simulated mobile manipulation tasks based on RoboCasa to evaluate policy mobilization, (3) visualization tools for analysis, and (4) several baseline methods. We also propose a novel approach that bridges navigation and manipulation by optimizing the robot’s base pose to align with an in-distribution base pose for a learned policy. Our approach utilizes 3D Gaussian Splatting for novel view synthesis, a score function to evaluate pose suitability, and sampling-based optimization to identify optimal robot poses. We show that our approach outperforms baselines in both simulation and real-world environments, demonstrating its effectiveness for policy mobilization.
zh
[CV-23] Grounded Reinforcement Learning for Visual Reasoning
【速读】:该论文旨在解决视觉推理中模型难以将抽象推理与具体视觉证据有效结合的问题,特别是在处理需要视觉注意力引导、感知输入解释和空间证据锚定的任务时。解决方案的关键在于提出ViGoRL(Visually Grounded Reinforcement Learning),该模型通过强化学习(Reinforcement Learning)显式地将每一步推理锚定到特定的视觉坐标,模仿人类视觉决策过程,生成具有空间定位的推理轨迹,并在需要精细探索时利用多轮强化学习框架动态放大至预测坐标,从而提升模型在视觉搜索和小GUI元素定位等任务中的表现。
链接: https://arxiv.org/abs/2505.23678
作者: Gabriel Sarch,Snigdha Saha,Naitik Khandelwal,Ayush Jain,Michael J. Tarr,Aviral Kumar,Katerina Fragkiadaki
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL
Abstract:While reinforcement learning (RL) over chains of thought has significantly advanced language models in tasks such as mathematics and coding, visual reasoning introduces added complexity by requiring models to direct visual attention, interpret perceptual inputs, and ground abstract reasoning in spatial evidence. We introduce ViGoRL (Visually Grounded Reinforcement Learning), a vision-language model trained with RL to explicitly anchor each reasoning step to specific visual coordinates. Inspired by human visual decision-making, ViGoRL learns to produce spatially grounded reasoning traces, guiding visual attention to task-relevant regions at each step. When fine-grained exploration is required, our novel multi-turn RL framework enables the model to dynamically zoom into predicted coordinates as reasoning unfolds. Across a diverse set of visual reasoning benchmarks–including SAT-2 and BLINK for spatial reasoning, Vbench for visual search, and ScreenSpot and VisualWebArena for web-based grounding–ViGoRL consistently outperforms both supervised fine-tuning and conventional RL baselines that lack explicit grounding mechanisms. Incorporating multi-turn RL with zoomed-in visual feedback significantly improves ViGoRL’s performance on localizing small GUI elements and visual search, achieving 86.4% on VBench. Additionally, we find that grounding amplifies other visual behaviors such as region exploration, grounded subgoal setting, and visual verification. Finally, human evaluations show that the model’s visual references are not only spatially accurate but also helpful for understanding model reasoning steps. Our results show that visually grounded RL is a strong paradigm for imbuing models with general-purpose visual reasoning.
zh
[CV-24] ImmunoDiff: A Diffusion Model for Immunotherapy Response Prediction in Lung Cancer
【速读】:该论文旨在解决非小细胞肺癌(Non-Small Cell Lung Cancer, NSCLC)中免疫治疗反应预测的准确性和动态变化捕捉能力不足的问题。现有基于影像组学和深度学习的预测模型主要依赖于治疗前的影像数据,难以全面反映免疫治疗引起的复杂形态和纹理变化。其解决方案的关键在于提出一种解剖感知的扩散模型——ImmunoDiff,该模型能够从基线影像生成治疗后的CT图像,并通过引入解剖先验(如肺叶和血管结构)、新型的cbi-Adapter模块以及临床变量条件机制,实现影像与临床数据的多模态一致融合,从而提升预测性能。
链接: https://arxiv.org/abs/2505.23675
作者: Moinak Bhattacharya,Judy Huang,Amna F. Sher,Gagandeep Singh,Chao Chen,Prateek Prasanna
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurately predicting immunotherapy response in Non-Small Cell Lung Cancer (NSCLC) remains a critical unmet need. Existing radiomics and deep learning-based predictive models rely primarily on pre-treatment imaging to predict categorical response outcomes, limiting their ability to capture the complex morphological and textural transformations induced by immunotherapy. This study introduces ImmunoDiff, an anatomy-aware diffusion model designed to synthesize post-treatment CT scans from baseline imaging while incorporating clinically relevant constraints. The proposed framework integrates anatomical priors, specifically lobar and vascular structures, to enhance fidelity in CT synthesis. Additionally, we introduce a novel cbi-Adapter, a conditioning module that ensures pairwise-consistent multimodal integration of imaging and clinical data embeddings, to refine the generative process. Additionally, a clinical variable conditioning mechanism is introduced, leveraging demographic data, blood-based biomarkers, and PD-L1 expression to refine the generative process. Evaluations on an in-house NSCLC cohort treated with immune checkpoint inhibitors demonstrate a 21.24% improvement in balanced accuracy for response prediction and a 0.03 increase in c-index for survival prediction. Code will be released soon.
zh
[CV-25] OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation
【速读】:该论文试图解决多模态理解与生成任务中模型统一性不足的问题,旨在构建一个轻量级且完全开源的基线模型以实现多模态内容的高效处理。解决方案的关键在于采用一种高效的训练策略,通过可学习的查询和轻量级的Transformer连接器,将现成的多模态大语言模型(multimodal large language models, LLMs)与扩散模型进行桥梁式连接,从而降低训练复杂度和计算开销。
链接: https://arxiv.org/abs/2505.23661
作者: Size Wu,Zhonghua Wu,Zerui Gong,Qingyi Tao,Sheng Jin,Qinyue Li,Wei Li,Chen Change Loy
机构: S-Lab, Nanyang Technological University (S-Lab,南洋理工大学); SenseTime Research (商汤科技研究); Tetras.AI (Tetras.AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this report, we present OpenUni, a simple, lightweight, and fully open-source baseline for unifying multimodal understanding and generation. Inspired by prevailing practices in unified model learning, we adopt an efficient training strategy that minimizes the training complexity and overhead by bridging the off-the-shelf multimodal large language models (LLMs) and diffusion models through a set of learnable queries and a light-weight transformer-based connector. With a minimalist choice of architecture, we demonstrate that OpenUni can: 1) generate high-quality and instruction-aligned images, and 2) achieve exceptional performance on standard benchmarks such as GenEval, DPG- Bench, and WISE, with only 1.1B and 3.1B activated parameters. To support open research and community advancement, we release all model weights, training code, and our curated training datasets (including 23M image-text pairs) at this https URL.
zh
[CV-26] D-AR: Diffusion via Autoregressive Models
【速读】:该论文试图解决图像扩散过程中的生成效率与可控性问题,其核心在于将传统的扩散模型重新定义为一种标准的自回归(autoregressive)序列预测任务。解决方案的关键在于设计一个将图像转换为离散标记序列的分词器(tokenizer),这些标记在不同位置可以解码为像素空间中的不同去噪步骤,利用扩散过程的粗到细特性,直接采用标准的下一个标记预测方法进行建模,无需修改底层结构(如因果掩码或训练/推理策略)。这种序列化的自回归标记生成方式直接映射了图像空间中的扩散过程,从而实现了高效的图像生成与控制。
链接: https://arxiv.org/abs/2505.23660
作者: Ziteng Gao,Mike Zheng Shou
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report
Abstract:This paper presents Diffusion via Autoregressive models (D-AR), a new paradigm recasting the image diffusion process as a vanilla autoregressive procedure in the standard next-token-prediction fashion. We start by designing the tokenizer that converts images into sequences of discrete tokens, where tokens in different positions can be decoded into different diffusion denoising steps in the pixel space. Thanks to the diffusion properties, these tokens naturally follow a coarse-to-fine order, which directly lends itself to autoregressive modeling. Therefore, we apply standard next-token prediction on these tokens, without modifying any underlying designs (either causal masks or training/inference strategies), and such sequential autoregressive token generation directly mirrors the diffusion procedure in image space. That is, once the autoregressive model generates an increment of tokens, we can directly decode these tokens into the corresponding diffusion denoising step in the streaming manner. Our pipeline naturally reveals several intriguing properties, for example, it supports consistent previews when generating only a subset of tokens and enables zero-shot layout-controlled synthesis. On the standard ImageNet benchmark, our method achieves 2.09 FID using a 775M Llama backbone with 256 discrete tokens. We hope our work can inspire future research on unified autoregressive architectures of visual synthesis, especially with large language models. Code and models will be available at this https URL
zh
[CV-27] VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models
【速读】:该论文旨在解决当前文本到视频(T2V)扩散模型在生成物理上合理内容方面的不足,其核心问题是这些模型缺乏对物理规律的准确理解能力。解决方案的关键在于提出一种名为VideoREPA的新框架,该框架通过在令牌层级关系上对齐,将视频理解基础模型中的物理理解能力蒸馏到T2V模型中,从而缩小物理理解差距并提升生成内容的物理合理性。具体而言,引入了基于时空对齐的令牌关系蒸馏(TRD)损失,为微调强大预训练的T2V模型提供软性指导,这是对先前表示对齐(REPA)方法的重要改进。
链接: https://arxiv.org/abs/2505.23656
作者: Xiangdong Zhang,Jiaqi Liao,Shaofeng Zhang,Fanqing Meng,Xiangpeng Wan,Junchi Yan,Yu Cheng
机构: Shanghai Jiao Tong University (上海交通大学); NetMind.AI (NetMind.AI); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in text-to-video (T2V) diffusion models have enabled high-fidelity and realistic video synthesis. However, current T2V models often struggle to generate physically plausible content due to their limited inherent ability to accurately understand physics. We found that while the representations within T2V models possess some capacity for physics understanding, they lag significantly behind those from recent video self-supervised learning methods. To this end, we propose a novel framework called VideoREPA, which distills physics understanding capability from video understanding foundation models into T2V models by aligning token-level relations. This closes the physics understanding gap and enable more physics-plausible generation. Specifically, we introduce the Token Relation Distillation (TRD) loss, leveraging spatio-temporal alignment to provide soft guidance suitable for finetuning powerful pre-trained T2V models, a critical departure from prior representation alignment (REPA) methods. To our knowledge, VideoREPA is the first REPA method designed for finetuning T2V models and specifically for injecting physical knowledge. Empirical evaluations show that VideoREPA substantially enhances the physics commonsense of baseline method, CogVideoX, achieving significant improvement on relevant benchmarks and demonstrating a strong capacity for generating videos consistent with intuitive physics. More video results are available at this https URL.
zh
[CV-28] Merge-Friendly Post-Training Quantization for Multi-Target Domain Adaptation ICML2025
【速读】:该论文试图解决在量化模型(quantized models)场景下进行多目标领域自适应(multi-target domain adaptation)时,由于量化带来的离散化效应和域限制导致的模型融合(model merging)难题。解决方案的关键在于提出一种新的后训练量化方法——HDRQ(Hessian and distant regularizing quantization),该方法通过最小化量化过程对源预训练模型的偏差,并平坦化损失表面,从而促进高效的模型融合。
链接: https://arxiv.org/abs/2505.23651
作者: Juncheol Shin,Minsang Seok,Seonggon Kim,Eunhyeok Park
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2025. Code: this https URL
Abstract:Model merging has emerged as a powerful technique for combining task-specific weights, achieving superior performance in multi-target domain adaptation. However, when applied to practical scenarios, such as quantized models, new challenges arise. In practical scenarios, quantization is often applied to target-specific data, but this process restricts the domain of interest and introduces discretization effects, making model merging highly non-trivial. In this study, we analyze the impact of quantization on model merging through the lens of error barriers. Leveraging these insights, we propose a novel post-training quantization, HDRQ - Hessian and distant regularizing quantization - that is designed to consider model merging for multi-target domain adaptation. Our approach ensures that the quantization process incurs minimal deviation from the source pre-trained model while flattening the loss surface to facilitate smooth model merging. To our knowledge, this is the first study on this challenge, and extensive experiments confirm its effectiveness.
zh
[CV-29] Radiant Triangle Soup with Soft Connectivity Forces for 3D Reconstruction and Novel View Synthesis
【速读】:该论文旨在解决3D场景表示中如何高效且精确地建模几何与外观的问题。传统方法中广泛使用的高斯点(Gaussian splats)在表达颜色插值和下游任务的算法支持方面存在局限性,而三角形(triangle)则能够提供更丰富的颜色插值能力,并自然地组合形成表面。该工作的关键在于提出一种基于三角形的场景优化算法,通过在优化过程中引入连接性力,促进三角形之间在三维空间中的显式但柔和的表面连续性,从而提升重建效果。
链接: https://arxiv.org/abs/2505.23642
作者: Nathaniel Burgdorfer,Philippos Mordohai
机构: Stevens Institute of Technology (斯蒂文斯理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this work, we introduce an inference-time optimization framework utilizing triangles to represent the geometry and appearance of the scene. More specifically, we develop a scene optimization algorithm for triangle soup, a collection of disconnected semi-transparent triangle primitives. Compared to the current most-widely used primitives for 3D scene representation, namely Gaussian splats, triangles allow for more expressive color interpolation, and benefit from a large algorithmic infrastructure for downstream tasks. Triangles, unlike full-rank Gaussian kernels, naturally combine to form surfaces. We formulate connectivity forces between triangles during optimization, encouraging explicit, but soft, surface continuity in 3D. We perform experiments on a representative 3D reconstruction dataset and show competitive photometric and geometric results.
zh
[CV-30] Comparing the Effects of Persistence Barcodes Aggregation and Feature Concatenation on Medical Imaging
【速读】:该论文试图解决医学图像分析中传统特征提取方法对输入数据微小变化敏感导致特征表示剧烈变化的问题,以及如何有效整合由持久同调(Persistent Homology, PH)生成的拓扑特征以提升分类模型性能的问题。解决方案的关键在于利用PH提取稳健且稳定的拓扑与几何特征,并通过特征拼接(feature concatenation)方式保留每个条形码(barcode)中的详细拓扑信息,从而获得更优的分类性能。
链接: https://arxiv.org/abs/2505.23637
作者: Dashti A. Ali,Richard K. G. Do,William R. Jarnagin,Aras T. Asaad,Amber L. Simpson
机构: Queen’s University(皇后大学); Memorial Sloan Kettering Cancer Center(纪念斯隆凯特琳癌症中心); Oxford University(牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 8 figures
Abstract:In medical image analysis, feature engineering plays an important role in the design and performance of machine learning models. Persistent homology (PH), from the field of topological data analysis (TDA), demonstrates robustness and stability to data perturbations and addresses the limitation from traditional feature extraction approaches where a small change in input results in a large change in feature representation. Using PH, we store persistent topological and geometrical features in the form of the persistence barcode whereby large bars represent global topological features and small bars encapsulate geometrical information of the data. When multiple barcodes are computed from 2D or 3D medical images, two approaches can be used to construct the final topological feature vector in each dimension: aggregating persistence barcodes followed by featurization or concatenating topological feature vectors derived from each barcode. In this study, we conduct a comprehensive analysis across diverse medical imaging datasets to compare the effects of the two aforementioned approaches on the performance of classification models. The results of this analysis indicate that feature concatenation preserves detailed topological information from individual barcodes, yields better classification performance and is therefore a preferred approach when conducting similar experiments.
zh
[CV-31] Color Image Set Recognition Based on Quaternionic Grassmannians
【速读】:该论文试图解决颜色图像集的识别问题,其核心挑战在于如何有效捕捉和表示颜色信息。解决方案的关键在于利用四元数(quaternion)的强大能力,将每个颜色图像集表示为四元数格拉斯曼流形(quaternionic Grassmannian)上的一个点,并提供了一个直接公式来计算该流形上两点之间的最短距离,从而构建新的分类框架。
链接: https://arxiv.org/abs/2505.23629
作者: Xiang Xiang Wang,Tin-Yau Tam
机构: University of Nevada, Reno (内华达大学雷诺分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Algebraic Geometry (math.AG)
备注:
Abstract:We propose a new method for recognizing color image sets using quaternionic Grassmannians, which use the power of quaternions to capture color information and represent each color image set as a point on the quaternionic Grassmannian. We provide a direct formula to calculate the shortest distance between two points on the quaternionic Grassmannian, and use this distance to build a new classification framework. Experiments on the ETH-80 benchmark dataset show that our method achieves good recognition results. We also discuss some limitations in stability and suggest ways the method can be improved in the future.
zh
[CV-32] ZeroSep: Separate Anything in Audio with Zero Training
【速读】:该论文试图解决音频源分离(Audio Source Separation)在实际应用中面临的挑战,即现有监督深度学习方法依赖大量任务特定的标注数据,并且难以适应真实世界声学场景的广泛变化和开放集特性。解决方案的关键在于利用预训练的文本引导音频扩散模型(text-guided audio diffusion models),通过在适当配置下直接使用该模型实现零样本源分离(zero-shot source separation)。该方法通过将混合音频反向投影到扩散模型的潜在空间,并利用文本条件引导去噪过程以恢复各个独立声源,无需任何任务特定的训练或微调,从而具备处理开放集场景的能力。
链接: https://arxiv.org/abs/2505.23625
作者: Chao Huang,Yuesheng Ma,Junxuan Huang,Susan Liang,Yunlong Tang,Jing Bi,Wenqiang Liu,Nima Mesgarani,Chenliang Xu
机构: 未知
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Audio source separation is fundamental for machines to understand complex acoustic environments and underpins numerous audio applications. Current supervised deep learning approaches, while powerful, are limited by the need for extensive, task-specific labeled data and struggle to generalize to the immense variability and open-set nature of real-world acoustic scenes. Inspired by the success of generative foundation models, we investigate whether pre-trained text-guided audio diffusion models can overcome these limitations. We make a surprising discovery: zero-shot source separation can be achieved purely through a pre-trained text-guided audio diffusion model under the right configuration. Our method, named ZeroSep, works by inverting the mixed audio into the diffusion model’s latent space and then using text conditioning to guide the denoising process to recover individual sources. Without any task-specific training or fine-tuning, ZeroSep repurposes the generative diffusion model for a discriminative separation task and inherently supports open-set scenarios through its rich textual priors. ZeroSep is compatible with a variety of pre-trained text-guided audio diffusion backbones and delivers strong separation performance on multiple separation benchmarks, surpassing even supervised methods.
zh
[CV-33] One Trajectory One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory
【速读】:该论文旨在解决长视频中Transformer模型扩展性差的问题,特别是现有基于时空块的视频标记化方法导致的标记过多和计算效率低下的问题。其解决方案的关键在于引入了基于语义对象轨迹的“接地视频标记化”(grounded video tokenization),通过跟踪全景子物体的运动轨迹来组织标记,而非依赖固定块,从而更符合人类感知原理,有效减少冗余并保持时间一致性。该方法通过TrajViT实现,利用对比学习进行训练,在多个视频理解任务中显著优于传统时空ViT(ViT3D)。
链接: https://arxiv.org/abs/2505.23617
作者: Chenhao Zheng,Jieyu Zhang,Mohammadreza Salehi,Ziqi Gao,Vishnu Iyengar,Norimasa Kobori,Quan Kong,Ranjay Krishna
机构: University of Washington (华盛顿大学); Allen Institute for Artificial Intelligence (艾伦人工智能研究所); Woven by Toyota, Inc (丰田公司编织部门)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注:
Abstract:Effective video tokenization is critical for scaling transformer models for long videos. Current approaches tokenize videos using space-time patches, leading to excessive tokens and computational inefficiencies. The best token reduction strategies degrade performance and barely reduce the number of tokens when the camera moves. We introduce grounded video tokenization, a paradigm that organizes tokens based on panoptic sub-object trajectories rather than fixed patches. Our method aligns with fundamental perceptual principles, ensuring that tokenization reflects scene complexity rather than video duration. We propose TrajViT, a video encoder that extracts object trajectories and converts them into semantically meaningful tokens, significantly reducing redundancy while maintaining temporal coherence. Trained with contrastive learning, TrajViT significantly outperforms space-time ViT (ViT3D) across multiple video understanding benchmarks, e.g., TrajViT outperforms ViT3D by a large margin of 6% top-5 recall in average at video-text retrieval task with 10x token deduction. We also show TrajViT as a stronger model than ViT3D for being the video encoder for modern VideoLLM, obtaining an average of 5.2% performance improvement across 6 VideoQA benchmarks while having 4x faster training time and 18x less inference FLOPs. TrajViT is the first efficient encoder to consistently outperform ViT3D across diverse video analysis tasks, making it a robust and scalable solution.
zh
[CV-34] Autoregressive Meta-Actions for Unified Controllable Trajectory Generation
【速读】:该论文试图解决现有自主驾驶系统中轨迹生成框架对固定时间区间内不变的元动作(meta-action)的依赖问题,这种依赖导致了元动作与实际行为轨迹之间的时序错位,进而破坏任务一致性并限制模型性能。解决方案的关键在于引入自回归元动作(Autoregressive Meta-Actions),通过将传统长间隔元动作分解为帧级元动作,实现自回归元动作预测与元动作条件轨迹生成的顺序交互,确保每个轨迹段与其对应元动作严格对齐,从而在整段轨迹上实现一致且统一的任务表述,并显著降低复杂度。
链接: https://arxiv.org/abs/2505.23612
作者: Jianbo Zhao,Taiyu Ban,Xiyang Wang,Qibin Zhou,Hangning Zhou,Zhihao Liu,Mu Yang,Lei Liu,Bin Li
机构: University of Science and Technology of China (中国科学技术大学); Mach Drive (机械驱动)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Controllable trajectory generation guided by high-level semantic decisions, termed meta-actions, is crucial for autonomous driving systems. A significant limitation of existing frameworks is their reliance on invariant meta-actions assigned over fixed future time intervals, causing temporal misalignment with the actual behavior trajectories. This misalignment leads to irrelevant associations between the prescribed meta-actions and the resulting trajectories, disrupting task coherence and limiting model performance. To address this challenge, we introduce Autoregressive Meta-Actions, an approach integrated into autoregressive trajectory generation frameworks that provides a unified and precise definition for meta-action-conditioned trajectory prediction. Specifically, We decompose traditional long-interval meta-actions into frame-level meta-actions, enabling a sequential interplay between autoregressive meta-action prediction and meta-action-conditioned trajectory generation. This decomposition ensures strict alignment between each trajectory segment and its corresponding meta-action, achieving a consistent and unified task formulation across the entire trajectory span and significantly reducing complexity. Moreover, we propose a staged pre-training process to decouple the learning of basic motion dynamics from the integration of high-level decision control, which offers flexibility, stability, and modularity. Experimental results validate our framework’s effectiveness, demonstrating improved trajectory adaptivity and responsiveness to dynamic decision-making scenarios. We provide the video document and dataset, which are available at this https URL.
zh
[CV-35] Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
【速读】:该论文旨在解决统一生成模型在跨模态任务中面临的推理速度慢和泛化能力弱的问题。传统自回归统一模型由于序列解码导致推理效率低下,而非自回归统一模型则因预训练主干网络的限制而泛化能力不足。解决方案的关键在于提出Muddit,这是一种统一的离散扩散Transformer架构,它通过整合强大的视觉先验知识与轻量级文本解码器,在统一架构下实现了文本和图像模态的快速并行生成,从而在质量和效率上均优于更大规模的自回归模型。
链接: https://arxiv.org/abs/2505.23606
作者: Qingyu Shi,Jinbin Bai,Zhuoran Zhao,Wenhao Chai,Kaidong Yu,Jianzong Wu,Shuangyong Song,Yunhai Tong,Xiangtai Li,Xuelong Li,Shuicheng Yan
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: The code and model are available at this https URL
Abstract:Unified generation models aim to handle diverse tasks across modalities – such as text generation, image generation, and vision-language reasoning – within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.
zh
[CV-36] A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis
【速读】:该论文旨在解决当前端镜检查分析中多模态大语言模型(MLLMs)评估基准不足的问题,即现有基准通常仅覆盖特定的内镜场景和有限的临床任务,无法全面反映真实世界的内镜场景多样性及临床工作流程所需的技能范围。其解决方案的关键在于引入EndoBench,这是首个专门设计用于评估MLLMs在内镜实践全谱系中的多维能力的综合性基准,涵盖4种不同的内镜场景、12项专业临床任务及其12个子任务,以及5种视觉提示粒度,共计6,832个经过严格验证的视觉问答对,从而构建了一个能够全面评估MLLMs在解剖识别、病变分析、空间定位和手术操作等临床流程中的感知与诊断能力的多维评估框架。
链接: https://arxiv.org/abs/2505.23601
作者: Shengyuan Liu,Boyun Zheng,Wenting Chen,Zhihao Peng,Zhenfei Yin,Jing Shao,Jiancong Hu,Yixuan Yuan
机构: Chinese University of Hong Kong (香港中文大学); City University of Hong Kong (香港城市大学); University of Oxford (牛津大学); Shanghai AI Laboratory (上海人工智能实验室); The Sixth Affiliated Hospital, Sun Yat-sen University (中山大学第六附属医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 36 pages, 18 figures
Abstract:Endoscopic procedures are essential for diagnosing and treating internal diseases, and multi-modal large language models (MLLMs) are increasingly applied to assist in endoscopy analysis. However, current benchmarks are limited, as they typically cover specific endoscopic scenarios and a small set of clinical tasks, failing to capture the real-world diversity of endoscopic scenarios and the full range of skills needed in clinical workflows. To address these issues, we introduce EndoBench, the first comprehensive benchmark specifically designed to assess MLLMs across the full spectrum of endoscopic practice with multi-dimensional capacities. EndoBench encompasses 4 distinct endoscopic scenarios, 12 specialized clinical tasks with 12 secondary subtasks, and 5 levels of visual prompting granularities, resulting in 6,832 rigorously validated VQA pairs from 21 diverse datasets. Our multi-dimensional evaluation framework mirrors the clinical workflow–spanning anatomical recognition, lesion analysis, spatial localization, and surgical operations–to holistically gauge the perceptual and diagnostic abilities of MLLMs in realistic scenarios. We benchmark 23 state-of-the-art models, including general-purpose, medical-specialized, and proprietary MLLMs, and establish human clinician performance as a reference standard. Our extensive experiments reveal: (1) proprietary MLLMs outperform open-source and medical-specialized models overall, but still trail human experts; (2) medical-domain supervised fine-tuning substantially boosts task-specific accuracy; and (3) model performance remains sensitive to prompt format and clinical task complexity. EndoBench establishes a new standard for evaluating and advancing MLLMs in endoscopy, highlighting both progress and persistent gaps between current models and expert clinical reasoning. We publicly release our benchmark and code.
zh
[CV-37] Bridging Classical and Modern Computer Vision: PerceptiveNet for Tree Crown Semantic Segmentation CVPR
【速读】:该论文旨在解决遥感数据中树冠精准语义分割的问题,这一问题对于森林管理、生物多样性研究和碳汇量化等科学任务至关重要。传统方法在处理森林冠层的复杂性(如阴影、复杂背景、尺度变化和树种间细微光谱差异)时存在局限,而深度学习模型虽能通过提取信息性和区分性特征提升精度,但仍难以全面捕捉这些复杂性。论文提出的解决方案是PerceptiveNet,其关键在于引入了基于对数Gabor参数化的卷积层,结合具有更宽感受野的主干网络,以提取显著特征并捕获广泛上下文与空间信息,从而有效提升分割性能。
链接: https://arxiv.org/abs/2505.23597
作者: Georgios Voulgaris
机构: University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) EarthVision
Abstract:The accurate semantic segmentation of tree crowns within remotely sensed data is crucial for scientific endeavours such as forest management, biodiversity studies, and carbon sequestration quantification. However, precise segmentation remains challenging due to complexities in the forest canopy, including shadows, intricate backgrounds, scale variations, and subtle spectral differences among tree species. Compared to the traditional methods, Deep Learning models improve accuracy by extracting informative and discriminative features, but often fall short in capturing the aforementioned complexities. To address these challenges, we propose PerceptiveNet, a novel model incorporating a Logarithmic Gabor-parameterised convolutional layer with trainable filter parameters, alongside a backbone that extracts salient features while capturing extensive context and spatial information through a wider receptive field. We investigate the impact of Log-Gabor, Gabor, and standard convolutional layers on semantic segmentation performance through extensive experimentation. Additionally, we conduct an ablation study to assess the contributions of individual layers and their combinations to overall model performance, and we evaluate PerceptiveNet as a backbone within a novel hybrid CNN-Transformer model. Our results outperform state-of-the-art models, demonstrating significant performance improvements on a tree crown dataset while generalising across domains, including two benchmark aerial scene semantic segmentation datasets with varying complexities. Comments: Accepted for publication at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) EarthVision Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.23597 [cs.CV] (or arXiv:2505.23597v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.23597 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-38] DeepChest: Dynamic Gradient-Free Task Weighting for Effective Multi-Task Learning in Chest X-ray Classification
【速读】:该论文旨在解决多任务学习(Multi-Task Learning, MTL)在复杂领域如医学影像中,如何有效平衡不同任务贡献的问题。传统方法通常依赖启发式或基于梯度的策略,但这些方法往往计算开销大且效率低下。论文提出的解决方案是DeepChest,其关键在于采用了一种基于任务特定损失趋势分析的性能驱动型动态任务加权机制,能够在不依赖梯度信息的情况下自适应调整任务重要性,从而显著降低内存消耗并提升训练速度。
链接: https://arxiv.org/abs/2505.23595
作者: Youssef Mohamed,Noran Mohamed,Khaled Abouhashad,Feilong Tang,Sara Atito,Shoaib Jameel,Imran Razzak,Ahmed B. Zaky
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:While Multi-Task Learning (MTL) offers inherent advantages in complex domains such as medical imaging by enabling shared representation learning, effectively balancing task contributions remains a significant challenge. This paper addresses this critical issue by introducing DeepChest, a novel, computationally efficient and effective dynamic task-weighting framework specifically designed for multi-label chest X-ray (CXR) classification. Unlike existing heuristic or gradient-based methods that often incur substantial overhead, DeepChest leverages a performance-driven weighting mechanism based on effective analysis of task-specific loss trends. Given a network architecture (e.g., ResNet18), our model-agnostic approach adaptively adjusts task importance without requiring gradient access, thereby significantly reducing memory usage and achieving a threefold increase in training speed. It can be easily applied to improve various state-of-the-art methods. Extensive experiments on a large-scale CXR dataset demonstrate that DeepChest not only outperforms state-of-the-art MTL methods by 7% in overall accuracy but also yields substantial reductions in individual task losses, indicating improved generalization and effective mitigation of negative transfer. The efficiency and performance gains of DeepChest pave the way for more practical and robust deployment of deep learning in critical medical diagnostic applications. The code is publicly available at this https URL
zh
[CV-39] PCA for Enhanced Cross-Dataset Generalizability in Breast Ultrasound Tumor Segmentation
【速读】:该论文旨在解决医学图像分割中模型在跨未见数据集部署时存在的外部有效性不足问题,这一问题在超声图像领域尤为突出。现有解决方案如域适应和基于生成对抗网络(GAN)的风格迁移虽具潜力,但在医学领域因数据集通常规模小且多样化而效果有限。本文的关键解决方案是应用主成分分析(Principal Component Analysis, PCA)进行预处理,通过保留约90%的数据集方差来降低噪声并突出关键特征,从而提升模型在跨数据集验证中的性能表现。
链接: https://arxiv.org/abs/2505.23587
作者: Christian Schmidt,Heinrich Martin Overhoff
机构: Westfälische Hochschule University of Applied Sciences (西法利亚应用科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:In medical image segmentation, limited external validity remains a critical obstacle when models are deployed across unseen datasets, an issue particularly pronounced in the ultrasound image domain. Existing solutions-such as domain adaptation and GAN-based style transfer-while promising, often fall short in the medical domain where datasets are typically small and diverse. This paper presents a novel application of principal component analysis (PCA) to address this limitation. PCA preprocessing reduces noise and emphasizes essential features by retaining approximately 90% of the dataset variance. We evaluate our approach across six diverse breast tumor ultrasound datasets comprising 3,983 B-mode images and corresponding expert tumor segmentation masks. For each dataset, a corresponding dimensionality reduced PCA-dataset is created and U-Net-based segmentation models are trained on each of the twelve datasets. Each model trained on an original dataset was inferenced on the remaining five out-of-domain original datasets (baseline results), while each model trained on a PCA dataset was inferenced on five out-of-domain PCA datasets. Our experimental results indicate that using PCA reconstructed datasets, instead of original images, improves the model’s recall and Dice scores, particularly for model-dataset pairs where baseline performance was lowest, achieving statistically significant gains in recall (0.57 \pm 0.07 vs. 0.70 \pm 0.05, p = 0.0004 ) and Dice scores (0.50 \pm 0.06 vs. 0.58 \pm 0.06, p = 0.03 ). Our method reduced the decline in recall values due to external validation by 33% . These findings underscore the potential of PCA reconstruction as a safeguard to mitigate declines in segmentation performance, especially in challenging cases, with implications for enhancing external validity in real-world medical applications.
zh
[CV-40] Weakly-supervised Localization of Manipulated Image Regions Using Multi-resolution Learned Features
【速读】:该论文试图解决图像篡改检测中可解释性不足和篡改区域定位不准确的问题,尤其是在缺乏像素级标注的现实场景下,现有全监督定位技术受限。其解决方案的关键在于提出一种新型的弱监督方法,将基于图像级篡改检测网络生成的激活图与预训练模型的分割图进行融合,通过多视角特征图的融合实现粗粒度定位,并利用预训练分割模型提供的详细区域信息进行细化,结合贝叶斯推理提升定位精度。
链接: https://arxiv.org/abs/2505.23586
作者: Ziyong Wang,Charith Abhayaratne
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注: This paper was presented at the British Machine Vision Conference 2024 workshop on Media authenticity in the age of artificial intelligence
Abstract:The explosive growth of digital images and the widespread availability of image editing tools have made image manipulation detection an increasingly critical challenge. Current deep learning-based manipulation detection methods excel in achieving high image-level classification accuracy, they often fall short in terms of interpretability and localization of manipulated regions. Additionally, the absence of pixel-wise annotations in real-world scenarios limits the existing fully-supervised manipulation localization techniques. To address these challenges, we propose a novel weakly-supervised approach that integrates activation maps generated by image-level manipulation detection networks with segmentation maps from pre-trained models. Specifically, we build on our previous image-level work named WCBnet to produce multi-view feature maps which are subsequently fused for coarse localization. These coarse maps are then refined using detailed segmented regional information provided by pre-trained segmentation models (such as DeepLab, SegmentAnything and PSPnet), with Bayesian inference employed to enhance the manipulation localization. Experimental results demonstrate the effectiveness of our approach, highlighting the feasibility to localize image manipulations without relying on pixel-level labels.
zh
[CV-41] Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition
【速读】:该论文旨在解决手写数学表达式识别(Handwritten Mathematical Expression Recognition, HMER)在光学字符识别(Optical Character Recognition, OCR)中的持续挑战,即符号布局的固有自由度和手写风格的多样性导致的性能瓶颈。其解决方案的关键在于利用预训练视觉-语言模型(Vision-Language Models, VLMs)的强大跨任务泛化能力,通过全微调的方式将领域特定知识注入通用框架中,而无需修改模型架构。该方法整合了三种数据驱动的任务:树感知的思维链(Tree-Aware Chain-of-Thought, Tree-CoT)用于结构化空间推理,误差驱动学习(Error-Driven Learning, EDL)用于减少视觉相似字符间的混淆,以及符号计数(Symbol Counting, SC)用于提升长表达式的识别一致性。
链接: https://arxiv.org/abs/2505.23566
作者: Yu Li,Jin Jiang,Jianhua Zhu,Shuai Peng,Baole Wei,Yuxuan Zhou,Liangcai Gao
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学计算机技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Handwritten Mathematical Expression Recognition (HMER) remains a persistent challenge in Optical Character Recognition (OCR) due to the inherent freedom of symbol layout and variability in handwriting styles. Prior methods have faced performance bottlenecks, proposing isolated architectural modifications that are difficult to integrate coherently into a unified framework. Meanwhile, recent advances in pretrained vision-language models (VLMs) have demonstrated strong cross-task generalization, offering a promising foundation for developing unified solutions. In this paper, we introduce Uni-MuMER, which fully fine-tunes a VLM for the HMER task without modifying its architecture, effectively injecting domain-specific knowledge into a generalist framework. Our method integrates three data-driven tasks: Tree-Aware Chain-of-Thought (Tree-CoT) for structured spatial reasoning, Error-Driven Learning (EDL) for reducing confusion among visually similar characters, and Symbol Counting (SC) for improving recognition consistency in long expressions. Experiments on the CROHME and HME100K datasets show that Uni-MuMER achieves new state-of-the-art performance, surpassing the best lightweight specialized model SSAN by 16.31% and the top-performing VLM Gemini2.5-flash by 24.42% in the zero-shot setting. Our datasets, models, and code are open-sourced at: this https URL
zh
[CV-42] Qwen Look Again: Guiding Vision-Language Reasoning Models to Re-attention Visual Information
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在推理过程中因长时间推理导致视觉信息被稀释,从而引发幻觉(hallucination)的问题。其解决方案的关键在于引入一种新的视觉-文本反思机制,通过强化学习方法Balanced Reflective Policy Optimization (BRPO)引导模型自主决定何时生成反思并平衡反思的数量与长度,同时在训练和推理阶段引入Visual Token COPY和Visual Token ROUTE策略,强制模型重新关注视觉信息,从而有效提升视觉注意力并抑制幻觉。
链接: https://arxiv.org/abs/2505.23558
作者: Xu Chu,Xinrong Chen,Guanyu Wang,Zhijie Tan,Kui Huang,Wenyu Lv,Tong Mo,Weiping Li
机构: Peking University (北京大学); Baidu Inc. (百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Inference time scaling drives extended reasoning to enhance the performance of Vision-Language Models (VLMs), thus forming powerful Vision-Language Reasoning Models (VLRMs). However, long reasoning dilutes visual tokens, causing visual information to receive less attention and may trigger hallucinations. Although introducing text-only reflection processes shows promise in language models, we demonstrate that it is insufficient to suppress hallucinations in VLMs. To address this issue, we introduce Qwen-LookAgain (Qwen-LA), a novel VLRM designed to mitigate hallucinations by incorporating a vision-text reflection process that guides the model to re-attention visual information during reasoning. We first propose a reinforcement learning method Balanced Reflective Policy Optimization (BRPO), which guides the model to decide when to generate vision-text reflection on its own and balance the number and length of reflections. Then, we formally prove that VLRMs lose attention to visual tokens as reasoning progresses, and demonstrate that supplementing visual information during reflection enhances visual attention. Therefore, during training and inference, Visual Token COPY and Visual Token ROUTE are introduced to force the model to re-attention visual information at the visual level, addressing the limitations of text-only reflection. Experiments on multiple visual QA datasets and hallucination metrics indicate that Qwen-LA achieves leading accuracy performance while reducing hallucinations. Our code is available at: this https URL.
zh
[CV-43] Position Paper: Metadata Enrichment Model: Integrating Neural Networks and Semantic Knowledge Graphs for Cultural Heritage Applications
【速读】:该论文试图解决文化遗产品类数字化过程中由于元数据不足而导致的可访问性、互操作性和跨机构协作困难的问题。其解决方案的关键在于将神经网络与语义技术相结合,提出了一种名为Metadata Enrichment Model (MEM) 的概念框架,其中核心创新是Multilayer Vision Mechanism (MVM),该机制通过动态检测嵌套特征(如印章中的文本或邮票中的图像)来提升视觉分析效果。
链接: https://arxiv.org/abs/2505.23543
作者: Jan Ignatowicz,Krzysztof Kutt,Grzegorz J. Nalepa
机构: Jagiellonian University (亚捷隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The digitization of cultural heritage collections has opened new directions for research, yet the lack of enriched metadata poses a substantial challenge to accessibility, interoperability, and cross-institutional collaboration. In several past years neural networks models such as YOLOv11 and Detectron2 have revolutionized visual data analysis, but their application to domain-specific cultural artifacts - such as manuscripts and incunabula - remains limited by the absence of methodologies that address structural feature extraction and semantic interoperability. In this position paper, we argue, that the integration of neural networks with semantic technologies represents a paradigm shift in cultural heritage digitization processes. We present the Metadata Enrichment Model (MEM), a conceptual framework designed to enrich metadata for digitized collections by combining fine-tuned computer vision models, large language models (LLMs) and structured knowledge graphs. The Multilayer Vision Mechanism (MVM) appears as the key innovation of MEM. This iterative process improves visual analysis by dynamically detecting nested features, such as text within seals or images within stamps. To expose MEM’s potential, we apply it to a dataset of digitized incunabula from the Jagiellonian Digital Library and release a manually annotated dataset of 105 manuscript pages. We examine the practical challenges of MEM’s usage in real-world GLAM institutions, including the need for domain-specific fine-tuning, the adjustment of enriched metadata with Linked Data standards and computational costs. We present MEM as a flexible and extensible methodology. This paper contributes to the discussion on how artificial intelligence and semantic web technologies can advance cultural heritage research, and also use these technologies in practice.
zh
[CV-44] Hallo4: High-Fidelity Dynamic Portrait Animation via Direct Preference Optimization and Temporal Motion Modulation
【速读】:该论文旨在解决由音频和骨骼运动驱动的高动态性和逼真度的人像动画生成问题,主要挑战包括精确的唇形同步、自然的面部表情以及高质量的身体运动动力学。其解决方案的关键在于两个创新:首先,引入针对以人为中心的动画优化的直接偏好优化方法,利用精心构建的人类偏好数据集,使生成结果与人眼感知指标对齐;其次,提出的时间运动调制机制通过时间通道重新分配和比例特征扩展,将运动条件重塑为维度对齐的潜在特征,从而解决时空分辨率不匹配问题,并在扩散合成中保留高频运动细节的保真度。
链接: https://arxiv.org/abs/2505.23525
作者: Jiahao Cui,Yan Chen,Mingwang Xu,Hanlin Shang,Yuxuan Chen,Yun Zhan,Zilong Dong,Yao Yao,Jingdong Wang,Siyu Zhu
机构: Fudan University (复旦大学); Baidu Inc. (百度公司); Nanjing University (南京大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating highly dynamic and photorealistic portrait animations driven by audio and skeletal motion remains challenging due to the need for precise lip synchronization, natural facial expressions, and high-fidelity body motion dynamics. We propose a human-preference-aligned diffusion framework that addresses these challenges through two key innovations. First, we introduce direct preference optimization tailored for human-centric animation, leveraging a curated dataset of human preferences to align generated outputs with perceptual metrics for portrait motion-video alignment and naturalness of expression. Second, the proposed temporal motion modulation resolves spatiotemporal resolution mismatches by reshaping motion conditions into dimensionally aligned latent features through temporal channel redistribution and proportional feature expansion, preserving the fidelity of high-frequency motion details in diffusion-based synthesis. The proposed mechanism is complementary to existing UNet and DiT-based portrait diffusion approaches, and experiments demonstrate obvious improvements in lip-audio synchronization, expression vividness, body motion coherence over baseline methods, alongside notable gains in human preference metrics. Our model and source code can be found at: this https URL.
zh
[CV-45] CLIP-AE: CLIP-assisted Cross-view Audio-Visual Enhancement for Unsupervised Temporal Action Localization
【速读】:该论文旨在解决无监督时间动作定位(Unsupervised Temporal Action Localization, UTAL)中的两个主要挑战:一是分类预训练特征过度关注高度区分性区域,二是仅依赖视觉模态信息难以确定上下文边界。其解决方案的关键在于引入CLIP辅助的跨视音频视觉增强方法,通过视觉语言预训练(VLP)和基于分类预训练的协同增强来避免对高度区分性区域的过度关注,并结合音频感知提供更丰富的上下文边界信息;同时,采用自监督跨视学习范式实现多视图感知增强,无需额外标注。
链接: https://arxiv.org/abs/2505.23524
作者: Rui Xia,Dan Jiang,Quan Zhang,Ke Zhang,Chun Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Temporal Action Localization (TAL) has garnered significant attention in information retrieval. Existing supervised or weakly supervised methods heavily rely on labeled temporal boundaries and action categories, which are labor-intensive and time-consuming. Consequently, unsupervised temporal action localization (UTAL) has gained popularity. However, current methods face two main challenges: 1) Classification pre-trained features overly focus on highly discriminative regions; 2) Solely relying on visual modality information makes it difficult to determine contextual boundaries. To address these issues, we propose a CLIP-assisted cross-view audiovisual enhanced UTAL method. Specifically, we introduce visual language pre-training (VLP) and classification pre-training-based collaborative enhancement to avoid excessive focus on highly discriminative regions; we also incorporate audio perception to provide richer contextual boundary information. Finally, we introduce a self-supervised cross-view learning paradigm to achieve multi-view perceptual enhancement without additional annotations. Extensive experiments on two public datasets demonstrate our model’s superiority over several state-of-the-art competitors.
zh
[CV-46] OmniEarth-Bench: Towards Holistic Evaluation of Earths Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data
【速读】:该论文旨在解决现有地球科学多模态学习基准在系统性覆盖地系统组成部分及跨圈层交互方面的关键缺陷,这些基准通常仅限于孤立子系统(如仅人类活动圈层或大气圈层)且评估维度有限(少于16项任务)。其解决方案的关键在于提出OmniEarth-Bench,这是首个涵盖所有六个地球科学圈层(大气圈、岩石圈、海洋圈、冰冻圈、生物圈和人类活动圈层)及其跨圈层交互的综合性多模态基准,包含100个专家精心设计的评估维度,并通过卫星传感器和原位观测数据整合了29,779条标注,涵盖感知、通用推理、科学知识推理和链式思维(CoT)推理四个层次,最终通过混合专家与众包的工作流程验证数据以减少标签歧义。
链接: https://arxiv.org/abs/2505.23522
作者: Fengxiang Wang,Mingshuo Chen,Xuming He,YiFan Zhang,Feng Liu,Zijie Guo,Zhenghao Hu,Jiong Wang,Jingyi Xu,Zhangrui Li,Fenghua Ling,Ben Fei,Weijia Li,Long Lan,Wenjing Yang,Wenlong Zhang,Lei Bai
机构: National University of Defense Technology, China; Shanghai Artificial Intelligence Laboratory, China; Beijing University of Posts and Telecommunications, China; Zhejiang University, China; Shanghai Jiao Tong University, China; Fudan University, China; Sun Yat-sen University, China; Nanjing University, China; Chinese Academy of Sciences
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Existing benchmarks for Earth science multimodal learning exhibit critical limitations in systematic coverage of geosystem components and cross-sphere interactions, often constrained to isolated subsystems (only in Human-activities sphere or atmosphere) with limited evaluation dimensions (less than 16 tasks). To address these gaps, we introduce OmniEarth-Bench, the first comprehensive multimodal benchmark spanning all six Earth science spheres (atmosphere, lithosphere, Oceansphere, cryosphere, biosphere and Human-activities sphere) and cross-spheres with one hundred expert-curated evaluation dimensions. Leveraging observational data from satellite sensors and in-situ measurements, OmniEarth-Bench integrates 29,779 annotations across four tiers: perception, general reasoning, scientific knowledge reasoning and chain-of-thought (CoT) reasoning. This involves the efforts of 2-5 experts per sphere to establish authoritative evaluation dimensions and curate relevant observational datasets, 40 crowd-sourcing annotators to assist experts for annotations, and finally, OmniEarth-Bench is validated via hybrid expert-crowd workflows to reduce label ambiguity. Experiments on 9 state-of-the-art MLLMs reveal that even the most advanced models struggle with our benchmarks, where none of them reach 35% accuracy. Especially, in some cross-spheres tasks, the performance of leading models like GPT-4o drops to 0.0%. OmniEarth-Bench sets a new standard for geosystem-aware AI, advancing both scientific discovery and practical applications in environmental monitoring and disaster prediction. The dataset, source code, and trained models were released.
zh
[CV-47] VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning
【速读】:该论文旨在解决视频异常理解(Video Anomaly Understanding, VAU)中存在的一系列挑战,包括对细粒度时空感知和模糊环境下鲁棒推理的需求,以及现有方法在可解释性、因果关系和上下文捕捉方面的不足。其解决方案的关键在于提出VAU-R1,一个基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的数据高效框架,并通过强化学习微调(Reinforcement Fine-Tuning, RFT)增强异常推理能力。此外,论文还提出了VAU-Bench,首个针对视频异常推理的思维链基准,包含多项选择问答、详细推理过程、时间标注和描述性字幕,以支持对推理能力的全面评估。
链接: https://arxiv.org/abs/2505.23504
作者: Liyun Zhu,Qixiang Chen,Xi Shen,Xiaodong Cun
机构: Australian National University (澳大利亚国立大学); GVC Lab, Great Bay University (GVC 实验室,大湾大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video Anomaly Understanding (VAU) is essential for applications such as smart cities, security surveillance, and disaster alert systems, yet remains challenging due to its demand for fine-grained spatio-temporal perception and robust reasoning under ambiguity. Despite advances in anomaly detection, existing methods often lack interpretability and struggle to capture the causal and contextual aspects of abnormal events. This limitation is further compounded by the absence of comprehensive benchmarks for evaluating reasoning ability in anomaly scenarios. To address both challenges, we introduce VAU-R1, a data-efficient framework built upon Multimodal Large Language Models (MLLMs), which enhances anomaly reasoning through Reinforcement Fine-Tuning (RFT). Besides, we propose VAU-Bench, the first Chain-of-Thought benchmark tailored for video anomaly reasoning, featuring multiple-choice QA, detailed rationales, temporal annotations, and descriptive captions. Empirical results show that VAU-R1 significantly improves question answering accuracy, temporal grounding, and reasoning coherence across diverse contexts. Together, our method and benchmark establish a strong foundation for interpretable and reasoning-aware video anomaly understanding. Our code is available at this https URL.
zh
[CV-48] Can Large Language Models Challenge CNNS in Medical Image Analysis?
【速读】:该论文旨在解决医学诊断图像分类中的准确性和效率问题,通过构建一种多模态人工智能(Multimodal AI)框架来提升诊断的可靠性。其解决方案的关键在于对比卷积神经网络(CNN)与不同大型语言模型(LLM)在诊断性能、执行效率及环境影响方面的差异,并通过在LLM上应用额外的过滤机制实现性能的显著提升。
链接: https://arxiv.org/abs/2505.23503
作者: Shibbir Ahmed,Shahnewaz Karim Sakib,Anindya Bijoy Das
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
Abstract:This study presents a multimodal AI framework designed for precisely classifying medical diagnostic images. Utilizing publicly available datasets, the proposed system compares the strengths of convolutional neural networks (CNNs) and different large language models (LLMs). This in-depth comparative analysis highlights key differences in diagnostic performance, execution efficiency, and environmental impacts. Model evaluation was based on accuracy, F1-score, average execution time, average energy consumption, and estimated CO_2 emission. The findings indicate that although CNN-based models can outperform various multimodal techniques that incorporate both images and contextual information, applying additional filtering on top of LLMs can lead to substantial performance gains. These findings highlight the transformative potential of multimodal AI systems to enhance the reliability, efficiency, and scalability of medical diagnostics in clinical settings.
zh
[CV-49] VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation
【速读】:该论文旨在解决现有视频字幕评估基准在细粒度评价方面的不足,特别是无法有效捕捉视频生成中关键的时空细节问题。其解决方案的关键在于引入首个大规模细粒度视频字幕评估基准(VCapsBench),该基准包含5,677(5K+)视频和109,796(100K+)问答对,并在21个细粒度维度上进行系统标注,同时提出三种评估指标(准确率AR、不一致率IR、覆盖率CR)及基于大语言模型(LLM)的自动化评估流程,通过对比问答对分析验证字幕质量。
链接: https://arxiv.org/abs/2505.23484
作者: Shi-Xue Zhang,Hongfa Wang,Duojun Huang,Xin Li,Xiaobin Zhu,Xu-Cheng Yin
机构: Tencent(腾讯); University of Science and Technology Beijing(北京科技大学); Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: submitting
Abstract:Video captions play a crucial role in text-to-video generation tasks, as their quality directly influences the semantic coherence and visual fidelity of the generated videos. Although large vision-language models (VLMs) have demonstrated significant potential in caption generation, existing benchmarks inadequately address fine-grained evaluation, particularly in capturing spatial-temporal details critical for video generation. To address this gap, we introduce the Fine-grained Video Caption Evaluation Benchmark (VCapsBench), the first large-scale fine-grained benchmark comprising 5,677 (5K+) videos and 109,796 (100K+) question-answer pairs. These QA-pairs are systematically annotated across 21 fine-grained dimensions (e.g., camera movement, and shot type) that are empirically proven critical for text-to-video generation. We further introduce three metrics (Accuracy (AR), Inconsistency Rate (IR), Coverage Rate (CR)), and an automated evaluation pipeline leveraging large language model (LLM) to verify caption quality via contrastive QA-pairs analysis. By providing actionable insights for caption optimization, our benchmark can advance the development of robust text-to-video models. The dataset and codes are available at website: this https URL.
zh
[CV-50] PhysicsNeRF: Physics-Guided 3D Reconstruction from Sparse Views ICML2025
【速读】:该论文旨在解决从稀疏视角进行3D重建的问题(3D reconstruction from sparse views),传统NeRF(Neural Radiance Fields)在稀疏监督下表现不佳。其解决方案的关键在于提出PhysicsNeRF框架,该框架通过引入四个互补约束:深度排序、RegNeRF风格的一致性、稀疏先验和跨视角对齐,结合一个参数量仅为0.67M的紧凑架构,在仅使用8个视角的情况下实现了21.4 dB的平均PSNR,优于现有方法。
链接: https://arxiv.org/abs/2505.23481
作者: Mohamed Rayan Barhdadi,Hasan Kurban,Hussein Alnuweiri
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 2 figures, 2 tables. Preliminary work. Under review by the Building Physically Plausible World Models Workshop at the 42nd International Conference on Machine Learning (ICML 2025), Vancouver, Canada
Abstract:PhysicsNeRF is a physically grounded framework for 3D reconstruction from sparse views, extending Neural Radiance Fields with four complementary constraints: depth ranking, RegNeRF-style consistency, sparsity priors, and cross-view alignment. While standard NeRFs fail under sparse supervision, PhysicsNeRF employs a compact 0.67M-parameter architecture and achieves 21.4 dB average PSNR using only 8 views, outperforming prior methods. A generalization gap of 5.7-6.2 dB is consistently observed and analyzed, revealing fundamental limitations of sparse-view reconstruction. PhysicsNeRF enables physically consistent, generalizable 3D representations for agent interaction and simulation, and clarifies the expressiveness-generalization trade-off in constrained NeRF models.
zh
[CV-51] mePoint: Accelerated Time Series Alignment via Self-Supervised Keypoint and Descriptor Learning ICML2025
【速读】:该论文旨在解决时间序列快速且可扩展对齐的问题,传统方法动态时间规整(Dynamic Time Warping, DTW)在可扩展性和抗噪性方面表现不佳。其解决方案的关键在于提出TimePoint,一种自监督方法,通过从合成数据中学习关键点和描述符,显著加速基于DTW的对齐过程,并通常提升对齐精度。TimePoint借鉴了二维关键点检测的思想,但针对一维信号的独特挑战进行了调整,利用高效的1D微分同胚生成真实训练数据,并结合全卷积和小波卷积架构提取具有信息量的关键点和描述符,从而实现更高效和准确的对齐。
链接: https://arxiv.org/abs/2505.23475
作者: Ron Shapira Weber,Shahar Ben Ishay,Andrey Lavrinenko,Shahaf E. Finder,Oren Freifeld
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICML 2025
Abstract:Fast and scalable alignment of time series is a fundamental challenge in many domains. The standard solution, Dynamic Time Warping (DTW), struggles with poor scalability and sensitivity to noise. We introduce TimePoint, a self-supervised method that dramatically accelerates DTW-based alignment while typically improving alignment accuracy by learning keypoints and descriptors from synthetic data. Inspired by 2D keypoint detection but carefully adapted to the unique challenges of 1D signals, TimePoint leverages efficient 1D diffeomorphisms, which effectively model nonlinear time warping, to generate realistic training data. This approach, along with fully convolutional and wavelet convolutional architectures, enables the extraction of informative keypoints and descriptors. Applying DTW to these sparse representations yield major speedups and typically higher alignment accuracy than standard DTW applied to the full signals. TimePoint demonstrates strong generalization to real-world time series when trained solely on synthetic data, and further improves with fine-tuning on real data. Extensive experiments demonstrate that TimePoint consistently achieves faster and more accurate alignments than standard DTW, making it a scalable solution for time-series analysis. Our code is available at this https URL
zh
[CV-52] A Divide-and-Conquer Approach for Global Orientation of Non-Watertight Scene-Level Point Clouds Using 0-1 Integer Optimization SIGGRAPH2025
【速读】:该论文旨在解决大规模、非封闭的3D场景点云定向(point cloud orientation)问题,这一问题在计算机图形学和3D视觉中具有广泛应用,但现有方法主要针对封闭的、对象级别的3D模型,难以处理复杂的大规模场景。其解决方案的关键在于提出一种名为DACPO(Divide-And-Conquer Point Orientation)的框架,该框架采用分而治之策略,将输入点云分割为较小的块进行独立处理,并通过全局优化阶段整合结果,从而实现可扩展且鲁棒的点云定向。
链接: https://arxiv.org/abs/2505.23469
作者: Zhuodong Li,Fei Hou,Wencheng Wang,Xuequan Lu,Ying He
机构: Key Laboratory of System Software (CAS), State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所系统软件重点实验室); University of Chinese Academy of Sciences (中国科学院大学); The University of Western Australia (西澳大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted to SIGGRAPH 2025
Abstract:Orienting point clouds is a fundamental problem in computer graphics and 3D vision, with applications in reconstruction, segmentation, and analysis. While significant progress has been made, existing approaches mainly focus on watertight, object-level 3D models. The orientation of large-scale, non-watertight 3D scenes remains an underexplored challenge. To address this gap, we propose DACPO (Divide-And-Conquer Point Orientation), a novel framework that leverages a divide-and-conquer strategy for scalable and robust point cloud orientation. Rather than attempting to orient an unbounded scene at once, DACPO segments the input point cloud into smaller, manageable blocks, processes each block independently, and integrates the results through a global optimization stage. For each block, we introduce a two-step process: estimating initial normal orientations by a randomized greedy method and refining them by an adapted iterative Poisson surface reconstruction. To achieve consistency across blocks, we model inter-block relationships using an an undirected graph, where nodes represent blocks and edges connect spatially adjacent blocks. To reliably evaluate orientation consistency between adjacent blocks, we introduce the concept of the visible connected region, which defines the region over which visibility-based assessments are performed. The global integration is then formulated as a 0-1 integer-constrained optimization problem, with block flip states as binary variables. Despite the combinatorial nature of the problem, DACPO remains scalable by limiting the number of blocks (typically a few hundred for 3D scenes) involved in the optimization. Experiments on benchmark datasets demonstrate DACPO’s strong performance, particularly in challenging large-scale, non-watertight scenarios where existing methods often fail. The source code is available at this https URL.
zh
[CV-53] Semantics-Aware Human Motion Generation from Audio Instructions
【速读】:该论文试图解决音频信号与生成动作之间语义关联薄弱的问题,即现有方法多关注音乐或语音节奏的匹配,而未能有效捕捉音频的语义内容以指导动作生成。解决方案的关键在于提出一个端到端框架,采用带有记忆检索注意力模块的掩码生成式Transformer,以处理稀疏且长时的音频输入,从而增强音频语义与生成动作之间的对齐性。
链接: https://arxiv.org/abs/2505.23465
作者: Zi-An Wang,Shihao Zou,Shiyao Yu,Mingyuan Zhang,Chao Dong
机构: 未知
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in interactive technologies have highlighted the prominence of audio signals for semantic encoding. This paper explores a new task, where audio signals are used as conditioning inputs to generate motions that align with the semantics of the audio. Unlike text-based interactions, audio provides a more natural and intuitive communication method. However, existing methods typically focus on matching motions with music or speech rhythms, which often results in a weak connection between the semantics of the audio and generated motions. We propose an end-to-end framework using a masked generative transformer, enhanced by a memory-retrieval attention module to handle sparse and lengthy audio inputs. Additionally, we enrich existing datasets by converting descriptions into conversational style and generating corresponding audio with varied speaker identities. Experiments demonstrate the effectiveness and efficiency of the proposed framework, demonstrating that audio instructions can convey semantics similar to text while providing more practical and user-friendly interactions.
zh
[CV-54] Revisiting Reweighted Risk for Calibration: AURC Focal Loss and Inverse Focal Loss
【速读】:该论文试图解决深度学习中模型校准(calibration)问题,特别是针对重加权风险函数(reweighted risk functionals)如焦点损失(focal loss)、逆焦点损失和AURC(Area Under the Risk-Coverage Curve)在校准性能上的差异。论文的关键在于建立这些重加权策略与校准误差之间的原理性联系,并提出一种正则化的AURC损失函数,该函数通过选择置信度评分函数(confidence score functions, CSFs)提供了更高的灵活性,同时利用SoftRank技术实现可微分优化,从而有效提升模型的校准性能。
链接: https://arxiv.org/abs/2505.23463
作者: Han Zhou,Sebastian G.Gruber,Teodora Popordanoska,Matthew B. Blaschko
机构: ESAT-PSI, KU Leuven, Belgium
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Several variants of reweighted risk functionals, such as focal losss, inverse focal loss, and the Area Under the Risk-Coverage Curve (AURC), have been proposed in the literature and claims have been made in relation to their calibration properties. However, focal loss and inverse focal loss propose vastly different weighting schemes. In this paper, we revisit a broad class of weighted risk functions commonly used in deep learning and establish a principled connection between these reweighting schemes and calibration errors. We show that minimizing calibration error is closely linked to the selective classification paradigm and demonstrate that optimizing a regularized variant of the AURC naturally leads to improved calibration. This regularized AURC shares a similar reweighting strategy with inverse focal loss, lending support to the idea that focal loss is less principled when calibration is a desired outcome. Direct AURC optimization offers greater flexibility through the choice of confidence score functions (CSFs). To enable gradient-based optimization, we introduce a differentiable formulation of the regularized AURC using the SoftRank technique. Empirical evaluations demonstrate that our AURC-based loss achieves competitive class-wise calibration performance across a range of datasets and model architectures.
zh
[CV-55] LAFR: Efficient Diffusion-based Blind Face Restoration via Latent Codebook Alignment Adapter
【速读】:该论文旨在解决从低质量(LQ)图像中进行盲人脸识别恢复的问题,该任务不仅需要高保真度的图像重建,还需保持面部身份的一致性。现有方法通常通过重新训练变分自编码器(VAE)编码器来缓解LQ输入与HQ数据之间的语义不匹配问题,但这一过程计算成本高且内存消耗大。论文提出的解决方案关键在于LAFR(Latent Alignment for Face Restoration),这是一种基于代码本的潜在空间适配器,能够将LQ图像的潜在分布与HQ图像对齐,从而在不修改原始VAE的情况下实现语义一致的扩散采样。此外,通过引入多层级恢复损失和利用人脸图像的结构规律,仅需对0.9%的FFHQ数据集进行轻量级微调即可达到先进水平的性能,显著提升了训练效率。
链接: https://arxiv.org/abs/2505.23462
作者: Runyi Li,Bin Chen,Jian Zhang,Radu Timofte
机构: Peking University (北京大学); University of Würzburg (维尔茨堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Blind face restoration from low-quality (LQ) images is a challenging task that requires not only high-fidelity image reconstruction but also the preservation of facial identity. While diffusion models like Stable Diffusion have shown promise in generating high-quality (HQ) images, their VAE modules are typically trained only on HQ data, resulting in semantic misalignment when encoding LQ inputs. This mismatch significantly weakens the effectiveness of LQ conditions during the denoising process. Existing approaches often tackle this issue by retraining the VAE encoder, which is computationally expensive and memory-intensive. To address this limitation efficiently, we propose LAFR (Latent Alignment for Face Restoration), a novel codebook-based latent space adapter that aligns the latent distribution of LQ images with that of HQ counterparts, enabling semantically consistent diffusion sampling without altering the original VAE. To further enhance identity preservation, we introduce a multi-level restoration loss that combines constraints from identity embeddings and facial structural priors. Additionally, by leveraging the inherent structural regularity of facial images, we show that lightweight finetuning of diffusion prior on just 0.9% of FFHQ dataset is sufficient to achieve results comparable to state-of-the-art methods, reduce training time by 70%. Extensive experiments on both synthetic and real-world face restoration benchmarks demonstrate the effectiveness and efficiency of LAFR, achieving high-quality, identity-preserving face reconstruction from severely degraded inputs.
zh
[CV-56] A Reverse Causal Framework to Mitigate Spurious Correlations for Debiasing Scene Graph Generation
【速读】:该论文试图解决现有两阶段场景图生成(Scene Graph Generation, SGG)框架中由于因果链结构导致的虚假相关性问题,这种相关性会使某些关系的预测受到其他关系的影响,从而引发至少两种可观察的偏差:尾部关系被预测为头部关系,前景关系被预测为背景关系,其中后者在文献中较少被讨论。解决方案的关键在于将因果链结构重构为反向因果结构,将分类器的输入视为混杂因素,并将检测器的输入和最终预测视为因果变量,提出了一种名为RcSGG的反向因果框架。其核心方法包括主动反向估计(Active Reverse Estimation, ARE)和最大信息采样(Maximum Information Sampling, MIS),通过干预混杂因素并考虑关系信息来增强反向因果性的估计,从而缓解SGG框架中的虚假相关性并消除由此产生的偏差。
链接: https://arxiv.org/abs/2505.23451
作者: Shuzhou Sun,Li Liu,Tianpeng Liu,Shuaifeng Zhi,Ming-Ming Cheng,Janne Heikkilä,Yongxiang Liu
机构: National University of Defense Technology (NUDT); Tsinghua University; University of Oulu; Nankai University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 21 pages, 11 figures, 12 tables
Abstract:Existing two-stage Scene Graph Generation (SGG) frameworks typically incorporate a detector to extract relationship features and a classifier to categorize these relationships; therefore, the training paradigm follows a causal chain structure, where the detector’s inputs determine the classifier’s inputs, which in turn influence the final predictions. However, such a causal chain structure can yield spurious correlations between the detector’s inputs and the final predictions, i.e., the prediction of a certain relationship may be influenced by other relationships. This influence can induce at least two observable biases: tail relationships are predicted as head ones, and foreground relationships are predicted as background ones; notably, the latter bias is seldom discussed in the literature. To address this issue, we propose reconstructing the causal chain structure into a reverse causal structure, wherein the classifier’s inputs are treated as the confounder, and both the detector’s inputs and the final predictions are viewed as causal variables. Specifically, we term the reconstructed causal paradigm as the Reverse causal Framework for SGG (RcSGG). RcSGG initially employs the proposed Active Reverse Estimation (ARE) to intervene on the confounder to estimate the reverse causality, \ie the causality from final predictions to the classifier’s inputs. Then, the Maximum Information Sampling (MIS) is suggested to enhance the reverse causality estimation further by considering the relationship information. Theoretically, RcSGG can mitigate the spurious correlations inherent in the SGG framework, subsequently eliminating the induced biases. Comprehensive experiments on popular benchmarks and diverse SGG frameworks show the state-of-the-art mean recall rate.
zh
[CV-57] Network Inversion for Uncertainty-Aware Out-of-Distribution Detection
【速读】:该论文旨在解决机器学习系统中分布外(Out-of-distribution, OOD)检测与不确定性估计(Uncertainty Estimation, UE)的问题,这两个问题是构建安全系统的关键组成部分。其解决方案的关键在于将网络逆向(Network Inversion)与分类器训练相结合,通过引入一个“垃圾”类别来表示异常输入,并在训练过程中不断重构并重新训练分类器,从而将分布外样本推向“垃圾”类别,同时学习到有意义的决策边界。该方法无需外部OOD数据集或后处理校准技术,提供了一种统一且可扩展的解决方案。
链接: https://arxiv.org/abs/2505.23448
作者: Pirzada Suhail,Rehna Afroz,Amit Sethi
机构: IIT Bombay(印度理工学院孟买分校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Out-of-distribution (OOD) detection and uncertainty estimation (UE) are critical components for building safe machine learning systems, especially in real-world scenarios where unexpected inputs are inevitable. In this work, we propose a novel framework that combines network inversion with classifier training to simultaneously address both OOD detection and uncertainty estimation. For a standard n-class classification task, we extend the classifier to an (n+1)-class model by introducing a “garbage” class, initially populated with random gaussian noise to represent outlier inputs. After each training epoch, we use network inversion to reconstruct input images corresponding to all output classes that initially appear as noisy and incoherent and are therefore excluded to the garbage class for retraining the classifier. This cycle of training, inversion, and exclusion continues iteratively till the inverted samples begin to resemble the in-distribution data more closely, suggesting that the classifier has learned to carve out meaningful decision boundaries while sanitising the class manifolds by pushing OOD content into the garbage class. During inference, this training scheme enables the model to effectively detect and reject OOD samples by classifying them into the garbage class. Furthermore, the confidence scores associated with each prediction can be used to estimate uncertainty for both in-distribution and OOD inputs. Our approach is scalable, interpretable, and does not require access to external OOD datasets or post-hoc calibration techniques while providing a unified solution to the dual challenges of OOD detection and uncertainty estimation.
zh
[CV-58] CryoCCD: Conditional Cycle-consistent Diffusion with Biophysical Modeling for Cryo-EM Synthesis
【速读】:该论文旨在解决冷冻电子显微镜(cryo-EM)领域中由于高质量标注数据稀缺而导致的下游分析模型开发困难的问题。其关键解决方案是提出CryoCCD,该框架结合了生物物理建模与生成技术,通过组合异质性、细胞背景和物理信息成像生成多尺度的cryo-EM图像,同时利用条件扩散模型与循环一致性及掩码感知对比学习来生成真实的噪声,从而提升图像结构保真度和下游任务性能。
链接: https://arxiv.org/abs/2505.23444
作者: Runmin Jiang,Genpei Zhang,Yuntian Yang,Siqi Wu,Yuheng Zhang,Wanyue Feng,Yizhou Zhao,Xi Xiao,Xiao Wang,Tianyang Wang,Xingjian Li,Min Xu
机构: Carnegie Mellon University (卡内基梅隆大学); University of Alabama at Birmingham (阿拉巴马大学伯明翰分校); Oak Ridge National Laboratory (橡树岭国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Cryo-electron microscopy (cryo-EM) offers near-atomic resolution imaging of macromolecules, but developing robust models for downstream analysis is hindered by the scarcity of high-quality annotated data. While synthetic data generation has emerged as a potential solution, existing methods often fail to capture both the structural diversity of biological specimens and the complex, spatially varying noise inherent in cryo-EM imaging. To overcome these limitations, we propose CryoCCD, a synthesis framework that integrates biophysical modeling with generative techniques. Specifically, CryoCCD produces multi-scale cryo-EM micrographs that reflect realistic biophysical variability through compositional heterogeneity, cellular context, and physics-informed imaging. To generate realistic noise, we employ a conditional diffusion model, enhanced by cycle consistency to preserve structural fidelity and mask-aware contrastive learning to capture spatially adaptive noise patterns. Extensive experiments show that CryoCCD generates structurally accurate micrographs and enhances performance in downstream tasks, outperforming state-of-the-art baselines in both particle picking and reconstruction.
zh
[CV-59] VITON-DRR: Details Retention Virtual Try-on via Non-rigid Registration
【速读】:该论文旨在解决图像虚拟试衣中由于衣物与人体姿态不匹配或自我遮挡导致的衣物细节丢失问题,这一问题会严重影响试衣结果的真实性和质量。其解决方案的关键在于提出一种基于精确非刚性配准(VITON-DRR)的细节保留虚拟试衣方法,通过构建双金字塔结构特征提取器进行人体语义分割,并设计一种变形模块以准确提取衣物关键点并进行非刚性配准,最终通过图像合成模块生成具有更准确形变和更丰富细节的试衣图像。
链接: https://arxiv.org/abs/2505.23439
作者: Ben Li,Minqi Li,Jie Ren,Kaibing Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 12 figures, Accepted by Computers Graphics
Abstract:Image-based virtual try-on aims to fit a target garment to a specific person image and has attracted extensive research attention because of its huge application potential in the e-commerce and fashion industries. To generate high-quality try-on results, accurately warping the clothing item to fit the human body plays a significant role, as slight misalignment may lead to unrealistic artifacts in the fitting image. Most existing methods warp the clothing by feature matching and thin-plate spline (TPS). However, it often fails to preserve clothing details due to self-occlusion, severe misalignment between poses, etc. To address these challenges, this paper proposes a detail retention virtual try-on method via accurate non-rigid registration (VITON-DRR) for diverse human poses. Specifically, we reconstruct a human semantic segmentation using a dual-pyramid-structured feature extractor. Then, a novel Deformation Module is designed for extracting the cloth key points and warping them through an accurate non-rigid registration algorithm. Finally, the Image Synthesis Module is designed to synthesize the deformed garment image and generate the human pose information adaptively. Compared with traditional methods, the proposed VITON-DRR can make the deformation of fitting images more accurate and retain more garment details. The experimental results demonstrate that the proposed method performs better than state-of-the-art methods.
zh
[CV-60] Adaptive Spatial Augmentation for Semi-supervised Semantic Segmentation
【速读】:该论文旨在解决半监督语义分割(Semi-Supervised Semantic Segmentation, SSSS)中数据增强策略不足的问题,特别是在弱到强一致性正则化框架下,现有方法主要依赖强度扰动的增强方式,而忽视了空间变换如平移和旋转的有效性。论文提出的关键解决方案是引入自适应空间增强(Adaptive Spatial Augmentation, ASAug),通过基于熵值的动态调整机制,针对不同图像实例进行个性化增强,从而提升模型在SSSS任务中的性能,并在多个基准数据集上取得了最先进的结果。
链接: https://arxiv.org/abs/2505.23438
作者: Lingyan Ran,Yali Li,Tao Zhuo,Shizhou Zhang,Yanning Zhang
机构: Northwestern Polytechnical University (西北工业大学); Northwest A&F University (西北农林科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures
Abstract:In semi-supervised semantic segmentation (SSSS), data augmentation plays a crucial role in the weak-to-strong consistency regularization framework, as it enhances diversity and improves model generalization. Recent strong augmentation methods have primarily focused on intensity-based perturbations, which have minimal impact on the semantic masks. In contrast, spatial augmentations like translation and rotation have long been acknowledged for their effectiveness in supervised semantic segmentation tasks, but they are often ignored in SSSS. In this work, we demonstrate that spatial augmentation can also contribute to model training in SSSS, despite generating inconsistent masks between the weak and strong augmentations. Furthermore, recognizing the variability among images, we propose an adaptive augmentation strategy that dynamically adjusts the augmentation for each instance based on entropy. Extensive experiments show that our proposed Adaptive Spatial Augmentation (\textbfASAug) can be integrated as a pluggable module, consistently improving the performance of existing methods and achieving state-of-the-art results on benchmark datasets such as PASCAL VOC 2012, Cityscapes, and COCO.
zh
[CV-61] UrbanCraft: Urban View Extrapolation via Hierarchical Sem-Geometric Priors
【速读】:该论文旨在解决城市场景重建中基于神经渲染的方法在处理外推视图合成(Extrapolated View Synthesis, EVS)问题时的性能不足,即现有方法在训练相机轨迹之外的视角(如左右或向下视角)上表现不佳,限制了重建应用的泛化能力。其解决方案的关键在于设计UrbanCraft,通过分层语义-几何表示作为额外先验来克服EVS问题,具体包括利用部分可观测场景重建粗粒度语义和几何基元,并通过占用网格建立场景级先验,同时结合3D边界框引入实例级先验以增强物体细节和空间关系,最终提出分层语义-几何引导的变分分数蒸馏(HSG-VSD)方法,将预训练UrbanCraft2D中的语义和几何约束整合到分数蒸馏采样过程中,确保生成分布与可观测场景一致。
链接: https://arxiv.org/abs/2505.23434
作者: Tianhang Wang,Fan Lu,Sanqing Qu,Guo Yu,Shihang Du,Ya Wu,Yuan Huang,Guang Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing neural rendering-based urban scene reconstruction methods mainly focus on the Interpolated View Synthesis (IVS) setting that synthesizes from views close to training camera trajectory. However, IVS can not guarantee the on-par performance of the novel view outside the training camera distribution (\textite.g., looking left, right, or downwards), which limits the generalizability of the urban reconstruction application. Previous methods have optimized it via image diffusion, but they fail to handle text-ambiguous or large unseen view angles due to coarse-grained control of text-only diffusion. In this paper, we design UrbanCraft, which surmounts the Extrapolated View Synthesis (EVS) problem using hierarchical sem-geometric representations serving as additional priors. Specifically, we leverage the partially observable scene to reconstruct coarse semantic and geometric primitives, establishing a coarse scene-level prior through an occupancy grid as the base representation. Additionally, we incorporate fine instance-level priors from 3D bounding boxes to enhance object-level details and spatial relationships. Building on this, we propose the \textbfHierarchical \textbfSemantic-Geometric-\textbfGuided Variational Score Distillation (HSG-VSD), which integrates semantic and geometric constraints from pretrained UrbanCraft2D into the score distillation sampling process, forcing the distribution to be consistent with the observable scene. Qualitative and quantitative comparisons demonstrate the effectiveness of our methods on EVS problem.
zh
[CV-62] Buffer-free Class-Incremental Learning with Out-of-Distribution Detection
【速读】:该论文旨在解决开放世界场景下类增量学习(Class-incremental learning, CIL)中的两个核心问题:模型在学习新类时避免遗忘旧类,以及处理未知类的输入。传统方法通过任务增量学习框架训练多头模型并结合分布外(out-of-distribution, OOD)检测器来预测任务身份,但其依赖于与历史数据记忆缓冲区的联合训练,存在隐私、可扩展性和训练时间增加的问题。论文的关键在于深入分析后处理OOD检测方法,并验证其在推理阶段的适用性,证明这些方法可以有效替代基于缓冲区的OOD检测,从而实现无需记忆缓冲区的高效且隐私友好的CIL系统。
链接: https://arxiv.org/abs/2505.23412
作者: Srishti Gupta,Daniele Angioni,Maura Pintor,Ambra Demontis,Lea Schönherr,Battista Biggio,Fabio Roli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Class-incremental learning (CIL) poses significant challenges in open-world scenarios, where models must not only learn new classes over time without forgetting previous ones but also handle inputs from unknown classes that a closed-set model would misclassify. Recent works address both issues by (i)~training multi-head models using the task-incremental learning framework, and (ii) predicting the task identity employing out-of-distribution (OOD) detectors. While effective, the latter mainly relies on joint training with a memory buffer of past data, raising concerns around privacy, scalability, and increased training time. In this paper, we present an in-depth analysis of post-hoc OOD detection methods and investigate their potential to eliminate the need for a memory buffer. We uncover that these methods, when applied appropriately at inference time, can serve as a strong substitute for buffer-based OOD detection. We show that this buffer-free approach achieves comparable or superior performance to buffer-based methods both in terms of class-incremental learning and the rejection of unknown samples. Experimental results on CIFAR-10, CIFAR-100 and Tiny ImageNet datasets support our findings, offering new insights into the design of efficient and privacy-preserving CIL systems for open-world settings.
zh
[CV-63] Video Editing for Audio-Visual Dubbing
【速读】:该论文试图解决视觉配音(visual dubbing)中的身份保持与同步问题,即在将新语音与面部动作同步的过程中,如何有效保留原始视频的上下文信息并实现精确的唇部同步。现有方法要么生成虚拟说话人脸,难以自然融入原场景,要么使用修复技术导致关键视觉信息如部分遮挡和光照变化丢失。解决方案的关键在于提出EdiDub框架,将其重新定义为一种内容感知的编辑任务,通过专门的条件设置来确保对原始视频的忠实且精确的修改,而非简单的复制,从而在保持复杂视觉元素的同时实现更准确的唇部同步。
链接: https://arxiv.org/abs/2505.23406
作者: Binyamin Manela,Sharon Gannot,Ethan Fetyaya
机构: Bar-Ilan University (巴伊兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Visual dubbing, the synchronization of facial movements with new speech, is crucial for making content accessible across different languages, enabling broader global reach. However, current methods face significant limitations. Existing approaches often generate talking faces, hindering seamless integration into original scenes, or employ inpainting techniques that discard vital visual information like partial occlusions and lighting variations. This work introduces EdiDub, a novel framework that reformulates visual dubbing as a content-aware editing task. EdiDub preserves the original video context by utilizing a specialized conditioning scheme to ensure faithful and accurate modifications rather than mere copying. On multiple benchmarks, including a challenging occluded-lip dataset, EdiDub significantly improves identity preservation and synchronization. Human evaluations further confirm its superiority, achieving higher synchronization and visual naturalness scores compared to the leading methods. These results demonstrate that our content-aware editing approach outperforms traditional generation or inpainting, particularly in maintaining complex visual elements while ensuring accurate lip synchronization.
zh
[CV-64] Bridging Geometric and Semantic Foundation Models for Generalized Monocular Depth Estimation
【速读】:该论文旨在解决单目深度估计(Monocular Depth Estimation, MDE)中复杂场景下深度预测精度不足的问题,特别是在处理复杂结构和重叠物体时的性能瓶颈。其解决方案的关键在于提出BriGeS方法,该方法通过融合几何与语义信息,在基础模型中引入了“桥接门”(Bridging Gate),以整合深度和分割基础模型的互补优势,并结合注意力温度缩放技术对注意力机制进行精细调节,从而实现对不同输入特征的平衡关注,提升模型在复杂场景下的泛化能力和估计精度。
链接: https://arxiv.org/abs/2505.23400
作者: Sanggyun Ma,Wonjoon Choi,Jihun Park,Jaeyeul Kim,Seunghun Lee,Jiwan Seo,Sunghoon Im
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present Bridging Geometric and Semantic (BriGeS), an effective method that fuses geometric and semantic information within foundation models to enhance Monocular Depth Estimation (MDE). Central to BriGeS is the Bridging Gate, which integrates the complementary strengths of depth and segmentation foundation models. This integration is further refined by our Attention Temperature Scaling technique. It finely adjusts the focus of the attention mechanisms to prevent over-concentration on specific features, thus ensuring balanced performance across diverse inputs. BriGeS capitalizes on pre-trained foundation models and adopts a strategy that focuses on training only the Bridging Gate. This method significantly reduces resource demands and training time while maintaining the model’s ability to generalize effectively. Extensive experiments across multiple challenging datasets demonstrate that BriGeS outperforms state-of-the-art methods in MDE for complex scenes, effectively handling intricate structures and overlapping objects.
zh
[CV-65] Point or Line? Using Line-based Representation for Panoptic Symbol Spotting in CAD Drawings
【速读】:该论文试图解决计算机辅助设计(CAD)图纸中语义符号定位的问题,即同时识别可数对象的个体实例和不可数物质的语义区域。现有方法通常依赖于图像栅格化、图结构构建或基于点的表示,但这些方法往往存在计算成本高、泛化能力有限以及几何结构信息丢失等问题。论文提出的解决方案关键在于采用基于线的原始图形表示(line-based representation of primitives),该设计保留了原始图形的几何连续性,从而实现了更精确的形状表示,同时保持了计算友好性,适用于矢量图形理解任务。此外,通过引入Branch Fusion Refinement模块,有效整合实例与语义预测,解决了两者之间的不一致性,提升了整体预测的可靠性。
链接: https://arxiv.org/abs/2505.23395
作者: Xingguang Wei,Haomin Wang,Shenglong Ye,Ruifeng Luo,Yanting Zhang,Lixin Gu,Jifeng Dai,Yu Qiao,Wenhai Wang,Hongjie Zhang
机构: Shanghai AI Laboratory(上海人工智能实验室); University of Science and Technology of China(中国科学技术大学); Shanghai Jiao Tong University(上海交通大学); Tongji University(同济大学); Donghua University(东华大学); Tsinghua University(清华大学); The Chinese University of Hong Kong(香港中文大学); Arcplus East China Architectural Design & Research Institute Co., Ltd.(华东建筑设计研究院有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We study the task of panoptic symbol spotting, which involves identifying both individual instances of countable things and the semantic regions of uncountable stuff in computer-aided design (CAD) drawings composed of vector graphical primitives. Existing methods typically rely on image rasterization, graph construction, or point-based representation, but these approaches often suffer from high computational costs, limited generality, and loss of geometric structural information. In this paper, we propose VecFormer, a novel method that addresses these challenges through line-based representation of primitives. This design preserves the geometric continuity of the original primitive, enabling more accurate shape representation while maintaining a computation-friendly structure, making it well-suited for vector graphic understanding tasks. To further enhance prediction reliability, we introduce a Branch Fusion Refinement module that effectively integrates instance and semantic predictions, resolving their inconsistencies for more coherent panoptic outputs. Extensive experiments demonstrate that our method establishes a new state-of-the-art, achieving 91.1 PQ, with Stuff-PQ improved by 9.6 and 21.2 points over the second-best results under settings with and without prior information, respectively, highlighting the strong potential of line-based representation as a foundation for vector graphic understanding.
zh
[CV-66] Robust and Annotation-Free Wound Segmentation on Noisy Real-World Pressure Ulcer Images: Towards Automated DESIGN-Rtextsuperscripttextregistered Assessment
【速读】:该论文旨在解决现有伤口分割模型(如FUSegNet)在不同身体部位的泛化能力不足的问题,尤其是在非足部溃疡数据集上的表现不佳。其解决方案的关键在于提出一种注释高效的管道,结合基于YOLOv11n的轻量级检测器与预训练的FUSegNet分割模型,仅需500个手动标注的边界框即可实现鲁棒性能,无需依赖像素级标注或针对新解剖区域的微调,从而有效弥合领域差距并实现跨多种伤口类型的直接部署。
链接: https://arxiv.org/abs/2505.23392
作者: Yun-Cheng Tsai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Purpose: Accurate wound segmentation is essential for automated DESIGN-R scoring. However, existing models such as FUSegNet, which are trained primarily on foot ulcer datasets, often fail to generalize to wounds on other body sites. Methods: We propose an annotation-efficient pipeline that combines a lightweight YOLOv11n-based detector with the pre-trained FUSegNet segmentation model. Instead of relying on pixel-level annotations or retraining for new anatomical regions, our method achieves robust performance using only 500 manually labeled bounding boxes. This zero fine-tuning approach effectively bridges the domain gap and enables direct deployment across diverse wound types. This is an advance not previously demonstrated in the wound segmentation literature. Results: Evaluated on three real-world test sets spanning foot, sacral, and trochanter wounds, our YOLO plus FUSegNet pipeline improved mean IoU by 23 percentage points over vanilla FUSegNet and increased end-to-end DESIGN-R size estimation accuracy from 71 percent to 94 percent (see Table 3 for details). Conclusion: Our pipeline generalizes effectively across body sites without task-specific fine-tuning, demonstrating that minimal supervision, with 500 annotated ROIs, is sufficient for scalable, annotation-light wound segmentation. This capability paves the way for real-world DESIGN-R automation, reducing reliance on pixel-wise labeling, streamlining documentation workflows, and supporting objective and consistent wound scoring in clinical practice. We will publicly release the trained detector weights and configuration to promote reproducibility and facilitate downstream deployment. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.23392 [cs.CV] (or arXiv:2505.23392v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.23392 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yun-Cheng Tsai [view email] [v1] Thu, 29 May 2025 12:25:30 UTC (7,438 KB)
zh
[CV-67] VModA: An Effective Framework for Adaptive NSFW Image Moderation
【速读】:该论文试图解决社交网络中不安全/不适合工作(NSFW)内容泛滥的问题,此类内容对公民尤其是未成年人造成严重危害。现有检测方法主要依赖于基于深度学习的图像识别与分类,但NSFW图像正以更加复杂的方式呈现,利用图像细节和复杂语义来掩盖其真实性质或吸引更多关注,导致现有方法难以有效识别。为了解决这一问题,论文提出了VModA框架,其关键在于能够适应多样化的审核规则,并处理跨类别的复杂语义丰富的NSFW内容,从而显著提升检测准确率。
链接: https://arxiv.org/abs/2505.23386
作者: Han Bao,Qinying Wang,Zhi Chen,Qingming Li,Xuhong Zhang,Changjiang Li,Zonghui Wang,Shouling Ji,Wenzhi Chen
机构: Zhejiang University (浙江大学); University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Stony Brook (石溪大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Not Safe/Suitable for Work (NSFW) content is rampant on social networks and poses serious harm to citizens, especially minors. Current detection methods mainly rely on deep learning-based image recognition and classification. However, NSFW images are now presented in increasingly sophisticated ways, often using image details and complex semantics to obscure their true nature or attract more views. Although still understandable to humans, these images often evade existing detection methods, posing a significant threat. Further complicating the issue, varying regulations across platforms and regions create additional challenges for effective moderation, leading to detection bias and reduced accuracy. To address this, we propose VModA, a general and effective framework that adapts to diverse moderation rules and handles complex, semantically rich NSFW content across categories. Experimental results show that VModA significantly outperforms existing methods, achieving up to a 54.3% accuracy improvement across NSFW types, including those with complex semantics. Further experiments demonstrate that our method exhibits strong adaptability across categories, scenarios, and base VLMs. We also identified inconsistent and controversial label samples in public NSFW benchmark datasets, re-annotated them, and submitted corrections to the original maintainers. Two datasets have confirmed the updates so far. Additionally, we evaluate VModA in real-world scenarios to demonstrate its practical effectiveness.
zh
[CV-68] UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning
【速读】:该论文旨在解决统一多模态大语言模型在预训练阶段依赖大规模数据集和计算资源,以及后训练方法通常依赖外部数据或仅限于任务特定定制的问题。其解决方案的关键在于提出一种自改进的后训练方法UniRL,该方法通过模型自身生成图像作为训练数据,在无需任何外部图像数据的情况下,实现生成与理解任务的相互增强,从而提升模型性能并减少生成与理解之间的不平衡。
链接: https://arxiv.org/abs/2505.23380
作者: Weijia Mao,Zhenheng Yang,Mike Zheng Shou
机构: Show Lab, National University of Singapore (新加坡国立大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unified multimodal large language models such as Show-o and Janus have achieved strong performance across both generation and understanding tasks. However, these models typically rely on large-scale datasets and require substantial computation during the pretraining stage. In addition, several post-training methods have been proposed, but they often depend on external data or are limited to task-specific customization. In this work, we introduce UniRL, a self-improving post-training approach. Our approach enables the model to generate images from prompts and use them as training data in each iteration, without relying on any external image data. Moreover, it enables the two tasks to enhance each other: the generated images are used for understanding, and the understanding results are used to supervise generation. We explore supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) to optimize the models. UniRL offers three key advantages: (1) it requires no external image data, as all training samples are generated by the model itself during training; (2) it not only improves individual task performance, but also reduces the imbalance between generation and understanding; and (3) it requires only several additional training steps during the post-training stage. We evaluate UniRL on top of Show-o and Janus, achieving a GenEval score of 0.77 for Show-o and 0.65 for Janus. Code and models will be released in this https URL.
zh
[CV-69] PAN-Crafter: Learning Modality-Consistent Alignment for PAN-Sharpening
【速读】:该论文旨在解决多光谱(MS)与全色(PAN)图像在跨模态配准不一致(cross-modality misalignment)问题,这一问题由于传感器位置、采集时间和分辨率差异而产生,导致传统深度学习方法在存在配准误差时出现光谱失真、双边缘和模糊等现象。其解决方案的关键在于提出PAN-Crafter框架,该框架通过模态自适应重建(MARs)使单一网络同时重建高分辨率多光谱(HRMS)和全色图像,并利用全色图像的高频细节作为辅助自监督信号;同时引入跨模态对齐感知注意力(CM3A),实现多光谱纹理与全色结构的双向对齐,从而提升跨模态特征的自适应优化能力。
链接: https://arxiv.org/abs/2505.23367
作者: Jeonghyeok Do,Sungpyo Kim,Geunhyuk Youk,Jaehyup Lee,Munchurl Kim
机构: KAIST(韩国科学技术院); Kyungpook National University(庆北国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Please visit our project page this https URL
Abstract:PAN-sharpening aims to fuse high-resolution panchromatic (PAN) images with low-resolution multi-spectral (MS) images to generate high-resolution multi-spectral (HRMS) outputs. However, cross-modality misalignment – caused by sensor placement, acquisition timing, and resolution disparity – induces a fundamental challenge. Conventional deep learning methods assume perfect pixel-wise alignment and rely on per-pixel reconstruction losses, leading to spectral distortion, double edges, and blurring when misalignment is present. To address this, we propose PAN-Crafter, a modality-consistent alignment framework that explicitly mitigates the misalignment gap between PAN and MS modalities. At its core, Modality-Adaptive Reconstruction (MARs) enables a single network to jointly reconstruct HRMS and PAN images, leveraging PAN’s high-frequency details as auxiliary self-supervision. Additionally, we introduce Cross-Modality Alignment-Aware Attention (CM3A), a novel mechanism that bidirectionally aligns MS texture to PAN structure and vice versa, enabling adaptive feature refinement across modalities. Extensive experiments on multiple benchmark datasets demonstrate that our PAN-Crafter outperforms the most recent state-of-the-art method in all metrics, even with 50.11 \times faster inference time and 0.63 \times the memory size. Furthermore, it demonstrates strong generalization performance on unseen satellite datasets, showing its robustness across different conditions.
zh
[CV-70] MCFNet: A Multimodal Collaborative Fusion Network for Fine-Grained Semantic Classification
【速读】:该论文旨在解决多模态信息处理中因不同模态间复杂且隐含的依赖关系导致的传统方法难以有效捕捉细粒度语义交互的问题,从而限制了其在高精度分类任务中的应用。解决方案的关键在于提出一种名为多模态协同融合网络(Multimodal Collaborative Fusion Network, MCFNet)的架构,该架构通过引入正则化集成融合模块提升模态内特征表示,并利用混合注意力机制实现精确的语义对齐;同时,结合多损失函数的加权投票范式,构建多模态决策分类模块,以联合利用模态间相关性和单模态判别特征。
链接: https://arxiv.org/abs/2505.23365
作者: Yang Qiao,Xiaoyu Zhong,Xiaofeng Gu,Zhiguo Yu
机构: Jiangnan University (江南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal information processing has become increasingly important for enhancing image classification performance. However, the intricate and implicit dependencies across different modalities often hinder conventional methods from effectively capturing fine-grained semantic interactions, thereby limiting their applicability in high-precision classification tasks. To address this issue, we propose a novel Multimodal Collaborative Fusion Network (MCFNet) designed for fine-grained classification. The proposed MCFNet architecture incorporates a regularized integrated fusion module that improves intra-modal feature representation through modality-specific regularization strategies, while facilitating precise semantic alignment via a hybrid attention mechanism. Additionally, we introduce a multimodal decision classification module, which jointly exploits inter-modal correlations and unimodal discriminative features by integrating multiple loss functions within a weighted voting paradigm. Extensive experiments and ablation studies on benchmark datasets demonstrate that the proposed MCFNet framework achieves consistent improvements in classification accuracy, confirming its effectiveness in modeling subtle cross-modal semantics.
zh
[CV-71] VideoReason Bench: Can MLLM s Perform Vision-Centric Complex Video Reasoning ?
【速读】:该论文试图解决视频理解领域中长链式思维(long chain-of-thought, CoT)推理对大型语言模型(large language models, LLMs)性能提升的验证问题,因为现有基准测试缺乏足够的推理深度以展示扩展CoT链的优势。解决方案的关键在于引入VideoReasonBench,这是一个专注于视觉、具备高推理复杂度的基准测试,其视频内容包含一系列细粒度操作,仅部分可见的潜在状态决定了视频中的推理任务,从而要求模型在多步骤推理中精确回忆视频中的操作并得出正确答案。
链接: https://arxiv.org/abs/2505.23359
作者: Yuanxin Liu,Kun Ouyang,Haoning Wu,Yi Liu,Lin Sui,Xinhao Li,Yan Zhong,Y. Charles,Xinyu Zhou,Xu Sun
机构: Moonshot AI(月之暗面); Peking University(北京大学); Nanjing University(南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Recent studies have shown that long chain-of-thought (CoT) reasoning can significantly enhance the performance of large language models (LLMs) on complex tasks. However, this benefit is yet to be demonstrated in the domain of video understanding, since most existing benchmarks lack the reasoning depth required to demonstrate the advantages of extended CoT chains. While recent efforts have proposed benchmarks aimed at video reasoning, the tasks are often knowledge-driven and do not rely heavily on visual content. To bridge this gap, we introduce VideoReasonBench, a benchmark designed to evaluate vision-centric, complex video reasoning. To ensure visual richness and high reasoning complexity, each video in VideoReasonBench depicts a sequence of fine-grained operations on a latent state that is only visible in part of the video. The questions evaluate three escalating levels of video reasoning skills: recalling observed visual information, inferring the content of latent states, and predicting information beyond the video. Under such task setting, models have to precisely recall multiple operations in the video, and perform step-by-step reasoning to get correct final answers for these questions. Using VideoReasonBench, we comprehensively evaluate 18 state-of-the-art multimodal LLMs (MLLMs), finding that most perform poorly on complex video reasoning, e.g., GPT-4o achieves only 6.9% accuracy, while the thinking-enhanced Gemini-2.5-Pro significantly outperforms others with 56.0% accuracy. Our investigations on “test-time scaling” further reveal that extended thinking budget, while offering none or minimal benefits on existing video benchmarks, is essential for improving the performance on VideoReasonBench.
zh
[CV-72] Beam-Guided Knowledge Replay for Knowledge-Rich Image Captioning using Vision-Language Model
【速读】:该论文试图解决现有图像描述生成模型生成的描述过于泛化、缺乏特异性和上下文深度的问题(generative image captioning models often produce generic descriptions that lack specificity and contextual depth)。其解决方案的关键在于提出一种基于知识重放的新型图像描述框架KRCapVLM,通过引入束搜索解码策略以生成更多样化和连贯的描述,结合基于注意力机制的模块增强图像编码器的特征表示,并采用训练调度器提高训练稳定性和收敛性。这些方法显著提升了描述质量与知识识别的准确性。
链接: https://arxiv.org/abs/2505.23358
作者: Reem AlJunaid,Muzammil Behzad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating informative and knowledge-rich image captions remains a challenge for many existing captioning models, which often produce generic descriptions that lack specificity and contextual depth. To address this limitation, we propose KRCapVLM, a knowledge replay-based novel image captioning framework using vision-language model. We incorporate beam search decoding to generate more diverse and coherent captions. We also integrate attention-based modules into the image encoder to enhance feature representation. Finally, we employ training schedulers to improve stability and ensure smoother convergence during training. These proposals accelerate substantial gains in both caption quality and knowledge recognition. Our proposed model demonstrates clear improvements in both the accuracy of knowledge recognition and the overall quality of generated captions. It shows a stronger ability to generalize to previously unseen knowledge concepts, producing more informative and contextually relevant descriptions. These results indicate the effectiveness of our approach in enhancing the model’s capacity to generate meaningful, knowledge-grounded captions across a range of scenarios.
zh
[CV-73] Beyond Optimal Transport: Model-Aligned Coupling for Flow Matching
【速读】:该论文试图解决流匹配(Flow Matching, FM)中由于随机耦合导致路径交叉、模型学习非直线轨迹从而需要大量积分步骤才能生成高质量样本的问题。其解决方案的关键在于提出模型对齐耦合(Model-Aligned Coupling, MAC),该方法不仅基于几何距离构建耦合,还考虑模型预测误差所反映的模型偏好传输方向,从而提升模型对矢量场的学习效果。为避免高计算成本的匹配过程,MAC通过选择预测误差最低的top-k比例耦合进行训练,显著提升了生成质量和效率。
链接: https://arxiv.org/abs/2505.23346
作者: Yexiong Lin,Yu Yao,Tongliang Liu
机构: Sydney AI Centre (悉尼人工智能中心); The University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Flow Matching (FM) is an effective framework for training a model to learn a vector field that transports samples from a source distribution to a target distribution. To train the model, early FM methods use random couplings, which often result in crossing paths and lead the model to learn non-straight trajectories that require many integration steps to generate high-quality samples. To address this, recent methods adopt Optimal Transport (OT) to construct couplings by minimizing geometric distances, which helps reduce path crossings. However, we observe that such geometry-based couplings do not necessarily align with the model’s preferred trajectories, making it difficult to learn the vector field induced by these couplings, which prevents the model from learning straight trajectories. Motivated by this, we propose Model-Aligned Coupling (MAC), an effective method that matches training couplings based not only on geometric distance but also on alignment with the model’s preferred transport directions based on its prediction error. To avoid the time-costly match process, MAC proposes to select the top- k fraction of couplings with the lowest error for training. Extensive experiments show that MAC significantly improves generation quality and efficiency in few-step settings compared to existing methods. Project page: this https URL
zh
[CV-74] Diffusion Sampling Path Tells More: An Efficient Plug-and-Play Strategy for Sample Filtering
【速读】:该论文旨在解决扩散模型在生成样本时因采样轨迹中的随机性导致的样本质量不一致问题。传统方法如基于训练的微调(如DDPO)和推理阶段对齐技术虽然能够提升样本保真度,但通常需要完整的去噪过程和外部奖励信号,计算成本较高。本文的关键解决方案是发现无分类器指导(CFG)过程中样本质量与去噪轨迹特征之间的新关联,具体表现为样本分布的高密度区域与累积得分差异(ASD)之间的强相关性。基于此,作者提出CFG-Rejection策略,在去噪早期阶段过滤低质量样本,无需外部奖励信号或模型重训练,且与现有扩散框架完全兼容。
链接: https://arxiv.org/abs/2505.23343
作者: Sixian Wang,Zhiwei Tang,Tsung-Hui Chang
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); DAMO Academy, Alibaba Group (阿里达摩院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models often exhibit inconsistent sample quality due to stochastic variations inherent in their sampling trajectories. Although training-based fine-tuning (e.g. DDPO [1]) and inference-time alignment techniques[2] aim to improve sample fidelity, they typically necessitate full denoising processes and external reward signals. This incurs substantial computational costs, hindering their broader applicability. In this work, we unveil an intriguing phenomenon: a previously unobserved yet exploitable link between sample quality and characteristics of the denoising trajectory during classifier-free guidance (CFG). Specifically, we identify a strong correlation between high-density regions of the sample distribution and the Accumulated Score Differences (ASD)–the cumulative divergence between conditional and unconditional scores. Leveraging this insight, we introduce CFG-Rejection, an efficient, plug-and-play strategy that filters low-quality samples at an early stage of the denoising process, crucially without requiring external reward signals or model retraining. Importantly, our approach necessitates no modifications to model architectures or sampling schedules and maintains full compatibility with existing diffusion frameworks. We validate the effectiveness of CFG-Rejection in image generation through extensive experiments, demonstrating marked improvements on human preference scores (HPSv2, PickScore) and challenging benchmarks (GenEval, DPG-Bench). We anticipate that CFG-Rejection will offer significant advantages for diverse generative modalities beyond images, paving the way for more efficient and reliable high-quality sample generation.
zh
[CV-75] DSAGL: Dual-Stream Attention-Guided Learning for Weakly Supervised Whole Slide Image Classification
【速读】:该论文旨在解决在癌症诊断中,由于全切片图像(Whole-slide images, WSIs)的超大尺寸和细粒度标注稀缺性,导致传统监督学习方法面临显著挑战的问题。其解决方案的关键在于提出一种名为DSAGL(Dual-Stream Attention-Guided Learning)的弱监督分类框架,该框架结合了教师-学生架构与双流设计,通过生成多尺度注意力伪标签来明确实例级模糊性和袋级语义一致性,并利用共享轻量编码器(VSSMamba)进行高效长程依赖建模,以及融合注意力模块(FASA)增强对诊断相关区域的关注,同时引入混合损失函数以确保两流间的相互一致性。
链接: https://arxiv.org/abs/2505.23341
作者: Daoxi Cao,Hangbei Cheng,Yijin Li,Ruolin Zhou,Xinyi Li,Xuehan Zhang,Binwei Li,Xuancheng Gu,Xueyu Liu,Yongfei Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Whole-slide images (WSIs) are critical for cancer diagnosis due to their ultra-high resolution and rich semantic content. However, their massive size and the limited availability of fine-grained annotations pose substantial challenges for conventional supervised learning. We propose DSAGL (Dual-Stream Attention-Guided Learning), a novel weakly supervised classification framework that combines a teacher-student architecture with a dual-stream design. DSAGL explicitly addresses instance-level ambiguity and bag-level semantic consistency by generating multi-scale attention-based pseudo labels and guiding instance-level learning. A shared lightweight encoder (VSSMamba) enables efficient long-range dependency modeling, while a fusion-attentive module (FASA) enhances focus on sparse but diagnostically relevant regions. We further introduce a hybrid loss to enforce mutual consistency between the two streams. Experiments on CIFAR-10, NCT-CRC, and TCGA-Lung datasets demonstrate that DSAGL consistently outperforms state-of-the-art MIL baselines, achieving superior discriminative performance and robustness under weak supervision.
zh
[CV-76] Fine-Tuning Next-Scale Visual Autoregressive Models with Group Relative Policy Optimization
【速读】:该论文试图解决如何通过强化学习(Reinforcement Learning, RL)微调预训练生成模型,以更精确地对齐人类偏好和复杂奖励信号的问题。其解决方案的关键在于应用群体相对策略优化(Group Relative Policy Optimization, GRPO),该方法能够有效利用来自审美预测器和CLIP嵌入的复杂奖励信号,从而显著提升图像质量并实现对生成风格的精准控制。此外,通过CLIP的引入,该方法使视觉自回归(VAR)模型能够超越初始ImageNet分布,生成与预训练阶段未出现的图像风格相关的图像,展现出良好的泛化能力。
链接: https://arxiv.org/abs/2505.23331
作者: Matteo Gallici,Haitz Sáez de Ocáriz Borde
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Fine-tuning pre-trained generative models with Reinforcement Learning (RL) has emerged as an effective approach for aligning outputs more closely with nuanced human preferences. In this paper, we investigate the application of Group Relative Policy Optimization (GRPO) to fine-tune next-scale visual autoregressive (VAR) models. Our empirical results demonstrate that this approach enables alignment to intricate reward signals derived from aesthetic predictors and CLIP embeddings, significantly enhancing image quality and enabling precise control over the generation style. Interestingly, by leveraging CLIP, our method can help VAR models generalize beyond their initial ImageNet distribution: through RL-driven exploration, these models can generate images aligned with prompts referencing image styles that were absent during pre-training. In summary, we show that RL-based fine-tuning is both efficient and effective for VAR models, benefiting particularly from their fast inference speeds, which are advantageous for online sampling, an aspect that poses significant challenges for diffusion-based alternatives.
zh
[CV-77] Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis
【速读】:该论文试图解决如何将高维视频生成模型有效地应用于低维任务(如可控图像生成)的问题,特别是如何克服连续视频帧与离散图像生成之间的挑战。其解决方案的关键在于提出一种称为\textttDRA-Ctrl的视频到图像的知识压缩与任务适配范式,该方法利用视频模型的长程上下文建模能力和全注意力机制,并通过基于mixup的过渡策略实现平滑适配,同时引入定制化的掩码机制以提升文本提示与图像级控制的一致性。
链接: https://arxiv.org/abs/2505.23325
作者: Hengyuan Cao,Yutong Feng,Biao Gong,Yijing Tian,Yunhong Lu,Chuang Liu,Bin Wang
机构: Zhejiang University (浙江大学); Kunbyte AI (Kunbyte AI); Ant Group (蚂蚁集团); Hangzhou Normal University (杭州师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video generative models can be regarded as world simulators due to their ability to capture dynamic, continuous changes inherent in real-world environments. These models integrate high-dimensional information across visual, temporal, spatial, and causal dimensions, enabling predictions of subjects in various status. A natural and valuable research direction is to explore whether a fully trained video generative model in high-dimensional space can effectively support lower-dimensional tasks such as controllable image generation. In this work, we propose a paradigm for video-to-image knowledge compression and task adaptation, termed \textitDimension-Reduction Attack (\textttDRA-Ctrl), which utilizes the strengths of video models, including long-range context modeling and flatten full-attention, to perform various generation tasks. Specially, to address the challenging gap between continuous video frames and discrete image generation, we introduce a mixup-based transition strategy that ensures smooth adaptation. Moreover, we redesign the attention structure with a tailored masking mechanism to better align text prompts with image-level control. Experiments across diverse image generation tasks, such as subject-driven and spatially conditioned generation, show that repurposed video models outperform those trained directly on images. These results highlight the untapped potential of large-scale video generators for broader visual applications. \textttDRA-Ctrl provides new insights into reusing resource-intensive video models and lays foundation for future unified generative models across visual modalities. The project page is this https URL.
zh
[CV-78] CF-DETR: Coarse-to-Fine Transformer for Real-Time Object Detection
【速读】:该论文旨在解决在自动驾驶感知系统中,同时执行多个检测变压器(DETR)任务时面临的严格实时性要求(R1)和高精度需求(R2)之间的矛盾,尤其是在资源受限条件下如何平衡延迟与精度的问题。其解决方案的关键在于提出CF-DETR系统,该系统结合了新颖的粗粒度到细粒度Transformer架构与专用的实时调度框架NPFP**,通过三种关键策略(A1: 粗粒度到细粒度推理,A2: 选择性细粒度推理,A3: 多层级批量推理)动态调整补丁粒度和注意力范围,以适应目标的重要性,从而满足高精度需求;同时,NPFP**调度框架(A4)协调这些机制,将每个DETR任务划分为保证关键目标检测的粗粒度子任务和可选的细粒度子任务,以兼顾实时性和整体准确性。
链接: https://arxiv.org/abs/2505.23317
作者: Woojin Shin,Donghwa Kang,Byeongyun Park,Brent Byunghoon Kang,Jinkyu Lee,Hyeongboo Baek
机构: University of Seoul(首尔大学); KAIST(韩国科学技术院); Sungkyunkwan University(成均馆大学)
类目: ystems and Control (eess.SY); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages
Abstract:Detection Transformers (DETR) are increasingly adopted in autonomous vehicle (AV) perception systems due to their superior accuracy over convolutional networks. However, concurrently executing multiple DETR tasks presents significant challenges in meeting firm real-time deadlines (R1) and high accuracy requirements (R2), particularly for safety-critical objects, while navigating the inherent latency-accuracy trade-off under resource constraints. Existing real-time DNN scheduling approaches often treat models generically, failing to leverage Transformer-specific properties for efficient resource allocation. To address these challenges, we propose CF-DETR, an integrated system featuring a novel coarse-to-fine Transformer architecture and a dedicated real-time scheduling framework NPFP**. CF-DETR employs three key strategies (A1: coarse-to-fine inference, A2: selective fine inference, A3: multi-level batch inference) that exploit Transformer properties to dynamically adjust patch granularity and attention scope based on object criticality, aiming to satisfy R2. The NPFP** scheduling framework (A4) orchestrates these adaptive mechanisms A1-A3. It partitions each DETR task into a safety-critical coarse subtask for guaranteed critical object detection within its deadline (ensuring R1), and an optional fine subtask for enhanced overall accuracy (R2), while managing individual and batched execution. Our extensive evaluations on server, GPU-enabled embedded platforms, and actual AV platforms demonstrate that CF-DETR, under an NPFP** policy, successfully meets strict timing guarantees for critical operations and achieves significantly higher overall and critical object detection accuracy compared to existing baselines across diverse AV workloads.
zh
[CV-79] Adversarial Semantic and Label Perturbation Attack for Pedestrian Attribute Recognition
【速读】:该论文旨在解决行人属性识别(Pedestrian Attribute Recognition, PAR)在对抗攻击下的脆弱性和抗干扰能力不足的问题。其解决方案的关键在于提出首个针对PAR的对抗攻击与防御框架,通过结合全局和块级攻击,在预训练的基于CLIP的PAR框架上生成对抗噪声,并设计语义偏移防御策略以抑制对抗攻击的影响。
链接: https://arxiv.org/abs/2505.23313
作者: Weizhe Kong,Xiao Wang,Ruichong Gao,Chenglong Li,Yu Zhang,Xing Yang,Yaowei Wang,Jin Tang
机构: Anhui University (安徽大学); National University of Defense Technology (国防科技大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学,深圳); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Pedestrian Attribute Recognition (PAR) is an indispensable task in human-centered research and has made great progress in recent years with the development of deep neural networks. However, the potential vulnerability and anti-interference ability have still not been fully explored. To bridge this gap, this paper proposes the first adversarial attack and defense framework for pedestrian attribute recognition. Specifically, we exploit both global- and patch-level attacks on the pedestrian images, based on the pre-trained CLIP-based PAR framework. It first divides the input pedestrian image into non-overlapping patches and embeds them into feature embeddings using a projection layer. Meanwhile, the attribute set is expanded into sentences using prompts and embedded into attribute features using a pre-trained CLIP text encoder. A multi-modal Transformer is adopted to fuse the obtained vision and text tokens, and a feed-forward network is utilized for attribute recognition. Based on the aforementioned PAR framework, we adopt the adversarial semantic and label-perturbation to generate the adversarial noise, termed ASL-PAR. We also design a semantic offset defense strategy to suppress the influence of adversarial attacks. Extensive experiments conducted on both digital domains (i.e., PETA, PA100K, MSP60K, RAPv2) and physical domains fully validated the effectiveness of our proposed adversarial attack and defense strategies for the pedestrian attribute recognition. The source code of this paper will be released on this https URL.
zh
[CV-80] RACE: Trajectory-Constrained Concept Erasure in Diffusion Models
【速读】:该论文旨在解决生成式模型(Generative AI)在文本到图像扩散模型中生成不良概念(如色情内容、敏感身份、受版权保护的风格等)的问题,这些问题对隐私、公平性和安全性构成了严重威胁。其解决方案的关键在于提出一种名为TRACE(Trajectory-Constrained Attentional Concept Erasure)的新方法,该方法通过结合严格的理论框架与有效的微调过程,在保持整体生成质量的同时擦除目标概念。具体而言,TRACE首先推导出一种闭合形式的更新方式以移除跨注意力层中的目标概念隐表示,随后引入一种轨迹感知的微调目标,在去噪过程的后期阶段引导模型远离该概念,从而在不影响无关内容生成质量的前提下实现概念擦除。
链接: https://arxiv.org/abs/2505.23312
作者: Finn Carter
机构: Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: In peer review
Abstract:Text-to-image diffusion models have shown unprecedented generative capability, but their ability to produce undesirable concepts (e.g.~pornographic content, sensitive identities, copyrighted styles) poses serious concerns for privacy, fairness, and safety. Concept erasure aims to remove or suppress specific concept information in a generative model. In this paper, we introduce \textbfTRACE (Trajectory-Constrained Attentional Concept Erasure), a novel method to erase targeted concepts from diffusion models while preserving overall generative quality. Our approach combines a rigorous theoretical framework, establishing formal conditions under which a concept can be provably suppressed in the diffusion process, with an effective fine-tuning procedure compatible with both conventional latent diffusion (Stable Diffusion) and emerging rectified flow models (e.g.~FLUX). We first derive a closed-form update to the model’s cross-attention layers that removes hidden representations of the target concept. We then introduce a trajectory-aware finetuning objective that steers the denoising process away from the concept only in the late sampling stages, thus maintaining the model’s fidelity on unrelated content. Empirically, we evaluate TRACE on multiple benchmarks used in prior concept erasure studies (object classes, celebrity faces, artistic styles, and explicit content from the I2P dataset). TRACE achieves state-of-the-art performance, outperforming recent methods such as ANT, EraseAnything, and MACE in terms of removal efficacy and output quality.
zh
[CV-81] Quality assessment of 3D human animation: Subjective and objective evaluation
【速读】:该论文试图解决非基于参数化身体模型生成的虚拟人类动画的质量评估问题(quality assessment for animated virtual humans),这一领域尚缺乏有效的评估指标。解决方案的关键在于提出一种基于数据驱动框架的质量评估方法,通过构建包含虚拟人类动画及其对应主观真实感评分的数据集,并利用该数据集训练线性回归器来预测感知评价分数,实验结果表明该方法在相关性上优于现有的深度学习基线方法。
链接: https://arxiv.org/abs/2505.23301
作者: Rim Rekik,Stefanie Wuhrer,Ludovic Hoyet,Katja Zibrek,Anne-Hélène Olivier
机构: Inria centre at the University Grenoble Alpes, France(Inria中心在格勒诺布尔阿尔卑斯大学,法国); Inria, Univ Rennes, CNRS, IRISA, France(Inria,雷恩大学,CNRS,IRISA,法国)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Virtual human animations have a wide range of applications in virtual and augmented reality. While automatic generation methods of animated virtual humans have been developed, assessing their quality remains challenging. Recently, approaches introducing task-oriented evaluation metrics have been proposed, leveraging neural network training. However, quality assessment measures for animated virtual humans that are not generated with parametric body models have yet to be developed. In this context, we introduce a first such quality assessment measure leveraging a novel data-driven framework. First, we generate a dataset of virtual human animations together with their corresponding subjective realism evaluation scores collected with a user study. Second, we use the resulting dataset to learn predicting perceptual evaluation scores. Results indicate that training a linear regressor on our dataset results in a correlation of 90%, which outperforms a state of the art deep learning baseline.
zh
[CV-82] Federated Unsupervised Semantic Segmentation
【速读】:该论文试图解决在无监督语义图像分割(Unsupervised Semantic Image Segmentation, USS)中,如何在联邦学习(Federated Learning, FL)框架下实现跨分布式客户端的特征表示与聚类中心对齐的问题。由于数据分布的异质性且缺乏监督信号,这一任务具有挑战性。解决方案的关键在于提出FUSS(Federated Unsupervised image Semantic Segmentation),该方法引入了新颖的联邦策略,以促进特征空间和原型空间中的全局一致性,并联合优化本地分割头与共享语义中心点,从而实现完全去中心化的、无需标签的语义分割训练。
链接: https://arxiv.org/abs/2505.23292
作者: Evangelos Charalampakis,Vasileios Mygdalis,Ioannis Pitas
机构: Aristotle University of Thessaloniki (亚里士多德大学塞萨洛尼基分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:This work explores the application of Federated Learning (FL) in Unsupervised Semantic image Segmentation (USS). Recent USS methods extract pixel-level features using frozen visual foundation models and refine them through self-supervised objectives that encourage semantic grouping. These features are then grouped to semantic clusters to produce segmentation masks. Extending these ideas to federated settings requires feature representation and cluster centroid alignment across distributed clients – an inherently difficult task under heterogeneous data distributions in the absence of supervision. To address this, we propose FUSS Federated Unsupervised image Semantic Segmentation) which is, to our knowledge, the first framework to enable fully decentralized, label-free semantic segmentation training. FUSS introduces novel federation strategies that promote global consistency in feature and prototype space, jointly optimizing local segmentation heads and shared semantic centroids. Experiments on both benchmark and real-world datasets, including binary and multi-class segmentation tasks, show that FUSS consistently outperforms local-only client trainings as well as extensions of classical FL algorithms under varying client data distributions. To support reproducibility, full code will be released upon manuscript acceptance.
zh
[CV-83] Wav2Sem: Plug-and-Play Audio Semantic Decoupling for 3D Speech-Driven Facial Animation CVPR2025
【速读】:该论文旨在解决3D语音驱动面部动画生成中,由于语言中存在发音相似但唇形不同的音节(near-homophone syllables)导致的自监督音频特征空间中的显著耦合问题,进而引发后续唇部运动生成中的平均效应。解决方案的关键在于提出了一种即插即用的语义去相关模块——Wav2Sem,该模块通过提取整个音频序列的语义特征,并利用额外的语义信息在特征空间内对音频编码进行去相关处理,从而获得更具表现力的音频特征。
链接: https://arxiv.org/abs/2505.23290
作者: Hao Li,Ju Dai,Xin Zhao,Feng Zhou,Junjun Pan,Lei Li
机构: Beihang University (北京航空航天大学); Peng Cheng Laboratory (鹏城实验室); North China University of Technology (华北理工大学); University of Washington (华盛顿大学); University of Copenhagen (哥本哈根大学)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025
Abstract:In 3D speech-driven facial animation generation, existing methods commonly employ pre-trained self-supervised audio models as encoders. However, due to the prevalence of phonetically similar syllables with distinct lip shapes in language, these near-homophone syllables tend to exhibit significant coupling in self-supervised audio feature spaces, leading to the averaging effect in subsequent lip motion generation. To address this issue, this paper proposes a plug-and-play semantic decorrelation module-Wav2Sem. This module extracts semantic features corresponding to the entire audio sequence, leveraging the added semantic information to decorrelate audio encodings within the feature space, thereby achieving more expressive audio features. Extensive experiments across multiple Speech-driven models indicate that the Wav2Sem module effectively decouples audio features, significantly alleviating the averaging effect of phonetically similar syllables in lip shape generation, thereby enhancing the precision and naturalness of facial animations. Our source code is available at this https URL.
zh
[CV-84] GenCAD-Self-Repairing: Feasibility Enhancement for 3D CAD Generation
【速读】:该论文旨在解决生成式CAD模型在可行性方面的不足,特别是针对GenCAD模型生成的边界表示(B-reps)中约10%的设计不可行的问题。其解决方案的关键在于提出GenCAD-Self-Repairing框架,该框架通过扩散引导和自修复流程提升生成CAD模型的可行性,具体包括在潜在空间中集成引导扩散去噪过程以及基于回归的修正机制,以在保持几何精度的同时优化不可行的CAD命令序列。
链接: https://arxiv.org/abs/2505.23287
作者: Chikaha Tsuji,Enrique Flores Medina,Harshit Gupta,Md Ferdous Alam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the advancement of generative AI, research on its application to 3D model generation has gained traction, particularly in automating the creation of Computer-Aided Design (CAD) files from images. GenCAD is a notable model in this domain, leveraging an autoregressive transformer-based architecture with a contrastive learning framework to generate CAD programs. However, a major limitation of GenCAD is its inability to consistently produce feasible boundary representations (B-reps), with approximately 10% of generated designs being infeasible. To address this, we propose GenCAD-Self-Repairing, a framework that enhances the feasibility of generative CAD models through diffusion guidance and a self-repairing pipeline. This framework integrates a guided diffusion denoising process in the latent space and a regression-based correction mechanism to refine infeasible CAD command sequences while preserving geometric accuracy. Our approach successfully converted two-thirds of infeasible designs in the baseline method into feasible ones, significantly improving the feasibility rate while simultaneously maintaining a reasonable level of geometric accuracy between the point clouds of ground truth models and generated models. By significantly improving the feasibility rate of generating CAD models, our approach helps expand the availability of high-quality training data and enhances the applicability of AI-driven CAD generation in manufacturing, architecture, and product design. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.23287 [cs.CV] (or arXiv:2505.23287v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.23287 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-85] RSFAKE-1M: A Large-Scale Dataset for Detecting Diffusion-Generated Remote Sensing Forgeries
【速读】:该论文试图解决生成式 AI (Generative AI) 生成的遥感图像伪造检测问题,这一领域在环境监测、城市规划和国家安全中具有重要应用,但当前研究仍存在显著不足。现有基准主要针对基于生成对抗网络 (GAN) 的伪造图像或自然图像,未能有效覆盖扩散模型生成的遥感图像。解决方案的关键在于构建 RSFAKE-1M 数据集,该数据集包含50万张伪造和50万张真实的遥感图像,伪造图像由在遥感数据上微调的十种扩散模型生成,涵盖多种生成条件,如文本提示、结构引导和修复等。通过该数据集,研究者能够评估现有检测方法并推动下一代伪造检测技术的发展。
链接: https://arxiv.org/abs/2505.23283
作者: Zhihong Tan,Jiayi Wang,Huiying Shi,Binyuan Huang,Hongchen Wei,Zhenzhong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Detecting forged remote sensing images is becoming increasingly critical, as such imagery plays a vital role in environmental monitoring, urban planning, and national security. While diffusion models have emerged as the dominant paradigm for image generation, their impact on remote sensing forgery detection remains underexplored. Existing benchmarks primarily target GAN-based forgeries or focus on natural images, limiting progress in this critical domain. To address this gap, we introduce RSFAKE-1M, a large-scale dataset of 500K forged and 500K real remote sensing images. The fake images are generated by ten diffusion models fine-tuned on remote sensing data, covering six generation conditions such as text prompts, structural guidance, and inpainting. This paper presents the construction of RSFAKE-1M along with a comprehensive experimental evaluation using both existing detectors and unified baselines. The results reveal that diffusion-based remote sensing forgeries remain challenging for current methods, and that models trained on RSFAKE-1M exhibit notably improved generalization and robustness. Our findings underscore the importance of RSFAKE-1M as a foundation for developing and evaluating next-generation forgery detection approaches in the remote sensing domain. The dataset and other supplementary materials are available at this https URL.
zh
[CV-86] Holistic Large-Scale Scene Reconstruction via Mixed Gaussian Splatting
【速读】:该论文旨在解决大规模三维场景重建中因采用分而治之(divide-and-conquer)范式导致的全局场景信息丢失以及参数调优复杂的问题。其解决方案的关键在于提出一种全新的整体优化框架MixGS,该框架通过将相机位姿和高斯属性整合到一种视图感知表示中,从而对整个场景进行整体建模,并通过一种新颖的混合操作结合解码后的高斯分布与原始高斯分布,以同时保持全局一致性与局部细节的真实性。
链接: https://arxiv.org/abs/2505.23280
作者: Chuandong Liu,Huijiao Wang,Lei Yu,Gui-Song Xia
机构: Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in 3D Gaussian Splatting have shown remarkable potential for novel view synthesis. However, most existing large-scale scene reconstruction methods rely on the divide-and-conquer paradigm, which often leads to the loss of global scene information and requires complex parameter tuning due to scene partitioning and local optimization. To address these limitations, we propose MixGS, a novel holistic optimization framework for large-scale 3D scene reconstruction. MixGS models the entire scene holistically by integrating camera pose and Gaussian attributes into a view-aware representation, which is decoded into fine-detailed Gaussians. Furthermore, a novel mixing operation combines decoded and original Gaussians to jointly preserve global coherence and local fidelity. Extensive experiments on large-scale scenes demonstrate that MixGS achieves state-of-the-art rendering quality and competitive speed, while significantly reducing computational requirements, enabling large-scale scene reconstruction training on a single 24GB VRAM GPU. The code will be released at this https URL.
zh
[CV-87] Are MLMs Trapped in the Visual Room?
【速读】:该论文试图解决多模态大模型(Multi-Modal Large Models, MLMs)是否能够真正“理解”视觉输入的问题,而非仅仅“感知”或“描述”视觉内容。其解决方案的关键在于提出“视觉房间”(Visual Room)论证,即系统可能通过遵循算法规则处理并描述视觉输入的每一个细节,但并未真正理解其内在意图。为验证这一观点,作者构建了一个包含感知和认知两个层级的评估框架,并引入了一个高质量的多模态讽刺数据集,以评估模型在表面细节捕捉和讽刺极性推理方面的能力。实验结果揭示了模型在感知层面表现良好,但在理解层面存在显著差距,主要归因于情感推理、常识推理和语境对齐方面的不足。
链接: https://arxiv.org/abs/2505.23272
作者: Yazhou Zhang,Chunwang Zou,Qimeng Liu,Lu Rong,Ben Yao,Zheng Lian,Qiuchi Li,Peng Zhang,Jing Qin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Can multi-modal large models (MLMs) that can see'' an image be said to
understand’’ it? Drawing inspiration from Searle’s Chinese Room, we propose the \textbfVisual Room argument: a system may process and describe every detail of visual inputs by following algorithmic rules, without genuinely comprehending the underlying intention. This dilemma challenges the prevailing assumption that perceptual mastery implies genuine understanding. In implementation, we introduce a two-tier evaluation framework spanning perception and cognition. The perception component evaluates whether MLMs can accurately capture the surface-level details of visual contents, where the cognitive component examines their ability to infer sarcasm polarity. To support this framework, We further introduce a high-quality multi-modal sarcasm dataset comprising both 924 static images and 100 dynamic videos. All sarcasm labels are annotated by the original authors and verified by independent reviewers to ensure clarity and consistency. We evaluate eight state-of-the-art (SoTA) MLMs. Our results highlight three key findings: (1) MLMs perform well on perception tasks; (2) even with correct perception, models exhibit an average error rate of ~16.1% in sarcasm understanding, revealing a significant gap between seeing and understanding; (3) error analysis attributes this gap to deficiencies in emotional reasoning, commonsense inference, and context alignment. This work provides empirical grounding for the proposed Visual Room argument and offers a new evaluation paradigm for MLMs.
zh
[CV-88] LADA: Scalable Label-Specific CLIP Adapter for Continual Learning ICML2025
【速读】:该论文旨在解决基于视觉-语言模型(如CLIP)的持续学习中参数选择错误导致性能下降的问题。现有方法通过为每个任务添加可学习参数集来适应预训练图像编码器,但在推理时需要选择合适的参数,容易出错。其解决方案的关键是引入LADA(Label-specific ADApter),它在冻结的CLIP图像编码器后附加轻量级、标签特定的记忆单元,通过聚合任务无关知识生成判别特征,并利用特征蒸馏防止灾难性遗忘,从而实现高效的持续学习。
链接: https://arxiv.org/abs/2505.23271
作者: Mao-Lin Luo,Zi-Hao Zhou,Tong Wei,Min-Ling Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ICML 2025
Abstract:Continual learning with vision-language models like CLIP offers a pathway toward scalable machine learning systems by leveraging its transferable representations. Existing CLIP-based methods adapt the pre-trained image encoder by adding multiple sets of learnable parameters, with each task using a partial set of parameters. This requires selecting the expected parameters for input images during inference, which is prone to error that degrades performance. To address this problem, we introduce LADA (Label-specific ADApter). Instead of partitioning parameters across tasks, LADA appends lightweight, label-specific memory units to the frozen CLIP image encoder, enabling discriminative feature generation by aggregating task-agnostic knowledge. To prevent catastrophic forgetting, LADA employs feature distillation for seen classes, preventing their features from being interfered with by new classes. Positioned after the image encoder, LADA prevents gradient flow to the frozen CLIP parameters, ensuring efficient training. Extensive results show that LADA achieves state-of-the-art performance in continual learning settings. The implementation code is available at this https URL.
zh
[CV-89] Unsupervised Transcript-assisted Video Summarization and Highlight Detection
【速读】:该论文试图解决视频摘要生成和关键片段检测中因仅依赖视觉内容而导致的信息不充分问题,旨在通过整合视频帧与对应字幕(transcript)两种模态信息来提升效果。其解决方案的关键在于提出了一种多模态流水线,利用视频帧和字幕进行模态融合,通过强化学习(Reinforcement Learning, RL)框架训练模型,以生成多样化且具有代表性的摘要,并确保包含具有意义字幕内容的视频片段。
链接: https://arxiv.org/abs/2505.23268
作者: Spyros Barbakos,Charalampos Antoniadis,Gerasimos Potamianos,Gianluca Setti
机构: University of Thessaly (希腊塞萨利大学); King Abdullah University of Science and Technology (阿卜杜拉国王科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:Video consumption is a key part of daily life, but watching entire videos can be tedious. To address this, researchers have explored video summarization and highlight detection to identify key video segments. While some works combine video frames and transcripts, and others tackle video summarization and highlight detection using Reinforcement Learning (RL), no existing work, to the best of our knowledge, integrates both modalities within an RL framework. In this paper, we propose a multimodal pipeline that leverages video frames and their corresponding transcripts to generate a more condensed version of the video and detect highlights using a modality fusion mechanism. The pipeline is trained within an RL framework, which rewards the model for generating diverse and representative summaries while ensuring the inclusion of video segments with meaningful transcript content. The unsupervised nature of the training allows for learning from large-scale unannotated datasets, overcoming the challenge posed by the limited size of existing annotated datasets. Our experiments show that using the transcript in video summarization and highlight detection achieves superior results compared to relying solely on the visual content of the video.
zh
[CV-90] Disrupting Vision-Language Model-Driven Navigation Services via Adversarial Object Fusion
【速读】:该论文旨在解决服务导向环境中基于视觉-语言导航(VLN)代理的安全部署问题,特别是针对生成式AI(Generative AI)驱动的感知模块在对抗性环境下的脆弱性。现有对抗性攻击方法未能充分考虑服务计算场景中对可靠性和服务质量(QoS)的高要求。论文提出的解决方案是Adversarial Object Fusion (AdvOF),其关键在于通过精确聚合和对齐受害者物体在2D与3D空间中的位置,定义并渲染对抗性物体,并在物理属性与视觉-语言模型(VLM)感知之间进行正则化协同优化,从而实现对抗性环境对导航代理性能的有效干扰,同时保持对正常导航任务的最小影响。
链接: https://arxiv.org/abs/2505.23266
作者: Chunlong Xie,Jialing He,Shangwei Guo,Jiacheng Wang,Shudong Zhang,Tianwei Zhang,Tao Xiang
机构: Chongqing University (重庆大学); Nanyang Technological University (南洋理工大学); Xidian University (西安电子科技大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review
Abstract:We present Adversarial Object Fusion (AdvOF), a novel attack framework targeting vision-and-language navigation (VLN) agents in service-oriented environments by generating adversarial 3D objects. While foundational models like Large Language Models (LLMs) and Vision Language Models (VLMs) have enhanced service-oriented navigation systems through improved perception and decision-making, their integration introduces vulnerabilities in mission-critical service workflows. Existing adversarial attacks fail to address service computing contexts, where reliability and quality-of-service (QoS) are paramount. We utilize AdvOF to investigate and explore the impact of adversarial environments on the VLM-based perception module of VLN agents. In particular, AdvOF first precisely aggregates and aligns the victim object positions in both 2D and 3D space, defining and rendering adversarial objects. Then, we collaboratively optimize the adversarial object with regularization between the adversarial and victim object across physical properties and VLM perceptions. Through assigning importance weights to varying views, the optimization is processed stably and multi-viewedly by iterative fusions from local updates and justifications. Our extensive evaluations demonstrate AdvOF can effectively degrade agent performance under adversarial conditions while maintaining minimal interference with normal navigation tasks. This work advances the understanding of service security in VLM-powered navigation systems, providing computational foundations for robust service composition in physical-world deployments.
zh
[CV-91] Image Aesthetic Reasoning : A New Benchmark for Medical Image Screening with MLLM s
【速读】:该论文旨在解决多模态大语言模型(MLLMs)在图像审美推理能力方面的不足,尤其是在医学图像筛选任务中表现不佳的问题。其关键解决方案在于构建了一个包含1500+样本的医学图像筛选数据集,并采用长链思维(CoT)与动态比例精度奖励的群体相对策略优化方法(DPA-GRPO),以提升MLLMs的图像审美推理能力。
链接: https://arxiv.org/abs/2505.23265
作者: Zheng Sun,Yi Wei,Long Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) are of great application across many domains, such as multimodal understanding and generation. With the development of diffusion models (DM) and unified MLLMs, the performance of image generation has been significantly improved, however, the study of image screening is rare and its performance with MLLMs is unsatisfactory due to the lack of data and the week image aesthetic reasoning ability in MLLMs. In this work, we propose a complete solution to address these problems in terms of data and methodology. For data, we collect a comprehensive medical image screening dataset with 1500+ samples, each sample consists of a medical image, four generated images, and a multiple-choice answer. The dataset evaluates the aesthetic reasoning ability under four aspects: \textit(1) Appearance Deformation, (2) Principles of Physical Lighting and Shadow, (3) Placement Layout, (4) Extension Rationality. For methodology, we utilize long chains of thought (CoT) and Group Relative Policy Optimization with Dynamic Proportional Accuracy reward, called DPA-GRPO, to enhance the image aesthetic reasoning ability of MLLMs. Our experimental results reveal that even state-of-the-art closed-source MLLMs, such as GPT-4o and Qwen-VL-Max, exhibit performance akin to random guessing in image aesthetic reasoning. In contrast, by leveraging the reinforcement learning approach, we are able to surpass the score of both large-scale models and leading closed-source models using a much smaller model. We hope our attempt on medical image screening will serve as a regular configuration in image aesthetic reasoning in the future.
zh
[CV-92] UniTEX: Universal High Fidelity Generative Texturing for 3D Shapes
【速读】:该论文旨在解决3D纹理生成中因依赖UV映射导致的拓扑模糊性问题,以及现有方法在多视角纹理一致性与质量上的不足。其解决方案的关键在于提出UniTEX框架,该框架通过将纹理生成提升至统一的3D功能空间,利用Texture Functions (TFs) 实现基于表面邻近度的连续、体素化纹理表示,从而绕过传统UV映射的限制,并结合基于Transformer的Large Texturing Model (LTM) 直接从图像和几何输入中预测TFs,进一步通过LoRA适配策略优化大规模Diffusion Transformers (DiTs) 以实现高质量多视角纹理合成。
链接: https://arxiv.org/abs/2505.23253
作者: Yixun Liang,Kunming Luo,Xiao Chen,Rui Chen,Hongyu Yan,Weiyu Li,Jiarui Liu,Ping Tan
机构: HKUST(香港科技大学); Light Illusion(光幻象)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 9 figures
Abstract:We present UniTEX, a novel two-stage 3D texture generation framework to create high-quality, consistent textures for 3D assets. Existing approaches predominantly rely on UV-based inpainting to refine textures after reprojecting the generated multi-view images onto the 3D shapes, which introduces challenges related to topological ambiguity. To address this, we propose to bypass the limitations of UV mapping by operating directly in a unified 3D functional space. Specifically, we first propose that lifts texture generation into 3D space via Texture Functions (TFs)–a continuous, volumetric representation that maps any 3D point to a texture value based solely on surface proximity, independent of mesh topology. Then, we propose to predict these TFs directly from images and geometry inputs using a transformer-based Large Texturing Model (LTM). To further enhance texture quality and leverage powerful 2D priors, we develop an advanced LoRA-based strategy for efficiently adapting large-scale Diffusion Transformers (DiTs) for high-quality multi-view texture synthesis as our first stage. Extensive experiments demonstrate that UniTEX achieves superior visual quality and texture integrity compared to existing approaches, offering a generalizable and scalable solution for automated 3D texture generation. Code will available in: this https URL.
zh
[CV-93] Advancing Image Super-resolution Techniques in Remote Sensing: A Comprehensive Survey
【速读】:该论文试图解决遥感图像超分辨率(Remote Sensing Image Super-Resolution, RSISR)方法缺乏系统性与全面性综述的问题,旨在帮助研究人员理解当前技术趋势与挑战。其解决方案的关键在于对RSISR算法进行分类,包括监督学习、无监督学习和质量评估方法,并深入分析各类方法的优缺点及在大尺度退化下保持细粒度纹理和几何结构的局限性,从而为未来研究提供方向,强调领域特定架构与鲁棒评估协议的重要性。
链接: https://arxiv.org/abs/2505.23248
作者: Yunliang Qi,Meng Lou,Yimin Liu,Lu Li,Zhen Yang,Wen Nie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages,7 figures, an survey
Abstract:Remote sensing image super-resolution (RSISR) is a crucial task in remote sensing image processing, aiming to reconstruct high-resolution (HR) images from their low-resolution (LR) counterparts. Despite the growing number of RSISR methods proposed in recent years, a systematic and comprehensive review of these methods is still lacking. This paper presents a thorough review of RSISR algorithms, covering methodologies, datasets, and evaluation metrics. We provide an in-depth analysis of RSISR methods, categorizing them into supervised, unsupervised, and quality evaluation approaches, to help researchers understand current trends and challenges. Our review also discusses the strengths, limitations, and inherent challenges of these techniques. Notably, our analysis reveals significant limitations in existing methods, particularly in preserving fine-grained textures and geometric structures under large-scale degradation. Based on these findings, we outline future research directions, highlighting the need for domain-specific architectures and robust evaluation protocols to bridge the gap between synthetic and real-world RSISR scenarios.
zh
[CV-94] SAMamba: Adaptive State Space Modeling with Hierarchical Vision for Infrared Small Target Detection
【速读】:该论文旨在解决红外小目标检测(Infrared Small Target Detection, ISTD)中的核心挑战,包括目标在图像中占比极小(不足0.15%)以及与复杂背景的低区分度。现有深度学习方法在下采样过程中常出现信息丢失,并且难以高效建模全局上下文。其解决方案的关键在于提出SAMamba框架,该框架融合了SAM2的层次化特征学习与Mamba的可选序列建模能力,通过三个关键创新:(1) 特征选择适配器(FS-Adapter)实现高效的自然图像到红外图像域适应;(2) 跨通道状态空间交互(CSI)模块以线性复杂度实现全局上下文建模;(3) 细节保持的上下文融合(DPCF)模块通过门控机制自适应地结合多尺度特征,从而有效提升检测性能。
链接: https://arxiv.org/abs/2505.23214
作者: Wenhao Xu,Shuchen Zheng,Changwei Wang,Zherui Zhang,Chuan Ren,Rongtao Xu,Shibiao Xu
机构: School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, 100083, China; Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences),Shandong, 250353, China; The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, 100190, China; Shandong Provincial Key Laboratory of Computer Networks, Shandong Fundamental Research Center for Computer Science, Shandong, 250353, China; School of Software Microelectronics, Peking University, Beijing, 100871, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Information Fusion 2025
Abstract:Infrared small target detection (ISTD) is vital for long-range surveillance in military, maritime, and early warning applications. ISTD is challenged by targets occupying less than 0.15% of the image and low distinguishability from complex backgrounds. Existing deep learning methods often suffer from information loss during downsampling and inefficient global context modeling. This paper presents SAMamba, a novel framework integrating SAM2’s hierarchical feature learning with Mamba’s selective sequence modeling. Key innovations include: (1) A Feature Selection Adapter (FS-Adapter) for efficient natural-to-infrared domain adaptation via dual-stage selection (token-level with a learnable task embedding and channel-wise adaptive transformations); (2) A Cross-Channel State-Space Interaction (CSI) module for efficient global context modeling with linear complexity using selective state space modeling; and (3) A Detail-Preserving Contextual Fusion (DPCF) module that adaptively combines multi-scale features with a gating mechanism to balance high-resolution and low-resolution feature contributions. SAMamba addresses core ISTD challenges by bridging the domain gap, maintaining fine-grained details, and efficiently modeling long-range dependencies. Experiments on NUAA-SIRST, IRSTD-1k, and NUDT-SIRST datasets show SAMamba significantly outperforms state-of-the-art methods, especially in challenging scenarios with heterogeneous backgrounds and varying target scales. Code: this https URL.
zh
[CV-95] Navigating the Accuracy-Size Trade-Off with Flexible Model Merging
【速读】:该论文试图解决多任务模型合并中精度损失与部署成本之间的权衡问题,即在不进行昂贵训练的情况下,如何有效提升合并模型的性能,同时避免部署多个独立微调模型带来的高成本。解决方案的关键在于提出FlexMerge,这是一种无需数据的模型合并框架,能够灵活生成不同规模的合并模型,通过将微调模型视为序列块并逐步合并,实现对模型精度与大小的精细控制,从而在多种部署场景下提供高效且灵活的解决方案。
链接: https://arxiv.org/abs/2505.23209
作者: Akash Dhasade,Divyansh Jhunjhunwala,Milos Vujasinovic,Gauri Joshi,Anne-Marie Kermarrec
机构: EPFL(瑞士联邦理工学院); Carnegie Mellon University(卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Model merging has emerged as an efficient method to combine multiple single-task fine-tuned models. The merged model can enjoy multi-task capabilities without expensive training. While promising, merging into a single model often suffers from an accuracy gap with respect to individual fine-tuned models. On the other hand, deploying all individual fine-tuned models incurs high costs. We propose FlexMerge, a novel data-free model merging framework to flexibly generate merged models of varying sizes, spanning the spectrum from a single merged model to retaining all individual fine-tuned models. FlexMerge treats fine-tuned models as collections of sequential blocks and progressively merges them using any existing data-free merging method, halting at a desired size. We systematically explore the accuracy-size trade-off exhibited by different merging algorithms in combination with FlexMerge. Extensive experiments on vision and NLP benchmarks, with up to 30 tasks, reveal that even modestly larger merged models can provide substantial accuracy improvements over a single model. By offering fine-grained control over fused model size, FlexMerge provides a flexible, data-free, and high-performance solution for diverse deployment scenarios.
zh
[CV-96] HyperPointFormer: Multimodal Fusion in 3D Space with Dual-Branch Cross-Attention Transformers
【速读】:该论文试图解决传统方法在处理多模态遥感数据时对三维(3D)信息利用不足的问题,具体表现为通过将3D数据栅格化为2D格式来融合多模态数据,从而限制了模型直接学习3D空间特征的能力,并且无法生成有效的3D预测。解决方案的关键在于提出一种完全基于3D的方法,该方法在3D点云中融合所有模态,并采用专用的双分支Transformer模型同时学习几何和光谱特征,同时引入基于交叉注意力机制的多尺度特征融合策略,以有效整合不同模态的信息。
链接: https://arxiv.org/abs/2505.23206
作者: Aldino Rizaldy,Richard Gloaguen,Fabian Ewald Fassnacht,Pedram Ghamisi
机构: Helmholtz-Zentrum Dresden-Rossendorf (HZDR); Helmholtz Institute Freiberg for Resource Technology (HIF); Freie Universität Berlin (Freie Universität Berlin)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal remote sensing data, including spectral and lidar or photogrammetry, is crucial for achieving satisfactory land-use / land-cover classification results in urban scenes. So far, most studies have been conducted in a 2D context. When 3D information is available in the dataset, it is typically integrated with the 2D data by rasterizing the 3D data into 2D formats. Although this method yields satisfactory classification results, it falls short in fully exploiting the potential of 3D data by restricting the model’s ability to learn 3D spatial features directly from raw point clouds. Additionally, it limits the generation of 3D predictions, as the dimensionality of the input data has been reduced. In this study, we propose a fully 3D-based method that fuses all modalities within the 3D point cloud and employs a dedicated dual-branch Transformer model to simultaneously learn geometric and spectral features. To enhance the fusion process, we introduce a cross-attention-based mechanism that fully operates on 3D points, effectively integrating features from various modalities across multiple scales. The purpose of cross-attention is to allow one modality to assess the importance of another by weighing the relevant features. We evaluated our method by comparing it against both 3D and 2D methods using the 2018 IEEE GRSS Data Fusion Contest (DFC2018) dataset. Our findings indicate that 3D fusion delivers competitive results compared to 2D methods and offers more flexibility by providing 3D predictions. These predictions can be projected onto 2D maps, a capability that is not feasible in reverse. Additionally, we evaluated our method on different datasets, specifically the ISPRS Vaihingen 3D and the IEEE 2019 Data Fusion Contest. Our code will be published here: this https URL.
zh
[CV-97] WTEFNet: Real-Time Low-Light Object Detection for Advanced Driver-Assistance Systems
【速读】:该论文旨在解决低光照条件下基于RGB相机的物体检测方法性能显著下降的问题(low-light conditions),其核心挑战在于低质量图像导致的环境感知能力不足。解决方案的关键在于提出WTEFNet框架,该框架包含三个核心模块:低光照增强(Low-Light Enhancement, LLE)模块用于提升暗区并抑制过曝区域;小波特征提取(Wavelet-based Feature Extraction, WFE)模块通过多级离散小波变换分离高频与低频成分,实现有效降噪与结构特征保留;自适应融合检测(Adaptive Fusion Detection, AFFD)模块则融合语义与光照特征以提升检测鲁棒性。
链接: https://arxiv.org/abs/2505.23201
作者: Hao Wu,Junzhou Chen,Ronghui Zhang,Nengchao Lyu,Hongyu Hu,Yanyong Guo,Tony Z. Qiu
机构: Sun Yat-sen University (中山大学); Wuhan University of Technology (武汉理工大学); Jilin University (吉林大学); Southeast University (东南大学); University of Alberta (阿尔伯塔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is expected to be submitted to IEEE Transactions on Instrumentation and Measurement
Abstract:Object detection is a cornerstone of environmental perception in advanced driver assistance systems(ADAS). However, most existing methods rely on RGB cameras, which suffer from significant performance degradation under low-light conditions due to poor image quality. To address this challenge, we proposes WTEFNet, a real-time object detection framework specifically designed for low-light scenarios, with strong adaptability to mainstream detectors. WTEFNet comprises three core modules: a Low-Light Enhancement (LLE) module, a Wavelet-based Feature Extraction (WFE) module, and an Adaptive Fusion Detection (AFFD) module. The LLE enhances dark regions while suppressing overexposed areas; the WFE applies multi-level discrete wavelet transforms to isolate high- and low-frequency components, enabling effective denoising and structural feature retention; the AFFD fuses semantic and illumination features for robust detection. To support training and evaluation, we introduce GSN, a manually annotated dataset covering both clear and rainy night-time scenes. Extensive experiments on BDD100K, SHIFT, nuScenes, and GSN demonstrate that WTEFNet achieves state-of-the-art accuracy under low-light conditions. Furthermore, deployment on a embedded platform (NVIDIA Jetson AGX Orin) confirms the framework’s suitability for real-time ADAS applications.
zh
[CV-98] Language-guided Learning for Object Detection Tackling Multiple Variations in Aerial Images
【速读】:该论文试图解决航空图像中目标检测面临的多类型变化问题,例如光照和视角变化带来的场景多样性及目标外观的剧烈变化,这些问题使得从整个图像场景中定位目标并识别其类别变得更加复杂。解决方案的关键在于提出一种名为LANGuage-guided Object detection (LANGO)的新型目标检测框架,该框架通过语言引导的学习机制减轻场景和实例级变化的影响。具体而言,设计了一个视觉语义推理器以理解图像场景的语义,同时引入了关系学习损失(relation learning loss)来处理实例级变化,如视角角度和尺度变化,从而提升目标检测性能。
链接: https://arxiv.org/abs/2505.23193
作者: Sungjune Park,Hyunjun Kim,Beomchan Park,Yong Man Ro
机构: Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite recent advancements in computer vision research, object detection in aerial images still suffers from several challenges. One primary challenge to be mitigated is the presence of multiple types of variation in aerial images, for example, illumination and viewpoint changes. These variations result in highly diverse image scenes and drastic alterations in object appearance, so that it becomes more complicated to localize objects from the whole image scene and recognize their categories. To address this problem, in this paper, we introduce a novel object detection framework in aerial images, named LANGuage-guided Object detection (LANGO). Upon the proposed language-guided learning, the proposed framework is designed to alleviate the impacts from both scene and instance-level variations. First, we are motivated by the way humans understand the semantics of scenes while perceiving environmental factors in the scenes (e.g., weather). Therefore, we design a visual semantic reasoner that comprehends visual semantics of image scenes by interpreting conditions where the given images were captured. Second, we devise a training objective, named relation learning loss, to deal with instance-level variations, such as viewpoint angle and scale changes. This training objective aims to learn relations in language representations of object categories, with the help of the robust characteristics against such variations. Through extensive experiments, we demonstrate the effectiveness of the proposed method, and our method obtains noticeable detection performance improvements.
zh
[CV-99] Fooling the Watchers: Breaking AIGC Detectors via Semantic Prompt Attacks
【速读】:该论文试图解决文本到图像(Text-to-Image, T2I)模型生成的逼真人像可能被滥用以及AIGC检测器鲁棒性不足的问题。解决方案的关键在于提出一种自动化的对抗性提示生成框架,该框架利用语法树结构和蒙特卡洛树搜索算法的变体,系统地探索语义提示空间,从而生成多样且可控的能够持续规避开源和商业AIGC检测器的提示。
链接: https://arxiv.org/abs/2505.23192
作者: Run Hao,Peng Ying
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 9 pages
Abstract:The rise of text-to-image (T2I) models has enabled the synthesis of photorealistic human portraits, raising serious concerns about identity misuse and the robustness of AIGC detectors. In this work, we propose an automated adversarial prompt generation framework that leverages a grammar tree structure and a variant of the Monte Carlo tree search algorithm to systematically explore the semantic prompt space. Our method generates diverse, controllable prompts that consistently evade both open-source and commercial AIGC detectors. Extensive experiments across multiple T2I models validate its effectiveness, and the approach ranked first in a real-world adversarial AIGC detection competition. Beyond attack scenarios, our method can also be used to construct high-quality adversarial datasets, providing valuable resources for training and evaluating more robust AIGC detection and defense systems.
zh
[CV-100] rackVLA: Embodied Visual Tracking in the Wild
【速读】:该论文旨在解决具身视觉追踪(Embodied Visual Tracking)问题,即在动态环境中,仅通过第一人称视觉让智能体持续跟踪特定目标,该任务面临严重遮挡和高场景动态性的挑战。现有方法通常通过将识别与规划模块化分离来应对这一问题,但未能有效融合两者之间的协同作用。本文提出的解决方案是TrackVLA,一种视觉-语言-动作(Vision-Language-Action, VLA)模型,其关键在于通过共享的大型语言模型(LLM)主干网络,结合语言建模头用于目标识别以及基于锚点的扩散模型用于轨迹规划,从而学习目标识别与轨迹规划之间的协同效应。
链接: https://arxiv.org/abs/2505.23189
作者: Shaoan Wang,Jiazhao Zhang,Minghan Li,Jiahang Liu,Anqi Li,Kui Wu,Fangwei Zhong,Junzhi Yu,Zhizheng Zhang,He Wang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Embodied visual tracking is a fundamental skill in Embodied AI, enabling an agent to follow a specific target in dynamic environments using only egocentric vision. This task is inherently challenging as it requires both accurate target recognition and effective trajectory planning under conditions of severe occlusion and high scene dynamics. Existing approaches typically address this challenge through a modular separation of recognition and planning. In this work, we propose TrackVLA, a Vision-Language-Action (VLA) model that learns the synergy between object recognition and trajectory planning. Leveraging a shared LLM backbone, we employ a language modeling head for recognition and an anchor-based diffusion model for trajectory planning. To train TrackVLA, we construct an Embodied Visual Tracking Benchmark (EVT-Bench) and collect diverse difficulty levels of recognition samples, resulting in a dataset of 1.7 million samples. Through extensive experiments in both synthetic and real-world environments, TrackVLA demonstrates SOTA performance and strong generalizability. It significantly outperforms existing methods on public benchmarks in a zero-shot manner while remaining robust to high dynamics and occlusion in real-world scenarios at 10 FPS inference speed. Our project page is: this https URL.
zh
[CV-101] HiGarment: Cross-modal Harmony Based Diffusion Model for Flat Sketch to Realistic Garment Image
【速读】:该论文试图解决服装生成任务中,基于扩散模型的生成过程缺乏足够的视觉监督以及平面草图与文本指导之间可能存在冲突的问题。解决方案的关键在于提出HiGarment框架,其核心组件包括多模态语义增强机制,用于提升文本和视觉模态下的面料表征,以及协调的跨注意力机制,动态平衡来自平面草图和文本提示的信息,从而实现可控的服装合成。
链接: https://arxiv.org/abs/2505.23186
作者: Junyi Guo,Jingxuan Zhang,Fangyu Wu,Huanda Lu,Qiufeng Wang,Wenmian Yang,Eng Gee Lim,Dongming Lu
机构: Xi’an Jiaotong Liverpool University (西安交通大学利物浦大学); NingboTech University (宁波工程学院); Beijing Normal University (北京师范大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion-based garment synthesis tasks primarily focus on the design phase in the fashion domain, while the garment production process remains largely underexplored. To bridge this gap, we introduce a new task: Flat Sketch to Realistic Garment Image (FS2RG), which generates realistic garment images by integrating flat sketches and textual guidance. FS2RG presents two key challenges: 1) fabric characteristics are solely guided by textual prompts, providing insufficient visual supervision for diffusion-based models, which limits their ability to capture fine-grained fabric details; 2) flat sketches and textual guidance may provide conflicting information, requiring the model to selectively preserve or modify garment attributes while maintaining structural coherence. To tackle this task, we propose HiGarment, a novel framework that comprises two core components: i) a multi-modal semantic enhancement mechanism that enhances fabric representation across textual and visual modalities, and ii) a harmonized cross-attention mechanism that dynamically balances information from flat sketches and text prompts, allowing controllable synthesis by generating either sketch-aligned (image-biased) or text-guided (text-biased) outputs. Furthermore, we collect Multi-modal Detailed Garment, the largest open-source dataset for garment generation. Experimental results and user studies demonstrate the effectiveness of HiGarment in garment synthesis. The code and dataset will be released.
zh
[CV-102] Proximal Algorithm Unrolling: Flexible and Efficient Reconstruction Networks for Single-Pixel Imaging CVPR2025
【速读】:该论文试图解决单像素成像(Single-Pixel Imaging, SPI)逆问题中,如何融合深度迭代算法(Deep-unrolling)与插件-播放(Plug-and-Play, PnP)方法的优势,以实现高精度和高效重建的问题。PnP方法虽然在不同压缩比(Compression Ratio, CR)下具有灵活性,但受限于重建精度和速度;而深度迭代方法在特定CR下能获得更优的精度和速度,但需要针对不同CR进行微调或重新训练。该论文的关键解决方案是设计一种高效的深度图像修复器(Deep Image Restorer, DIR),用于HQS(半二次分裂)和ADMM(交替方向乘子法)的迭代展开,并提出一种通用的近似轨迹(Proximal Trajectory, PT)损失函数,使学习到的DIR逼近理想显式修复正则化的近似算子,从而实现单一模型在不同CR下的灵活应用并提升重建性能。
链接: https://arxiv.org/abs/2505.23180
作者: Ping Wang,Lishun Wang,Gang Qu,Xiaodong Wang,Yulun Zhang,Xin Yuan
机构: Westlake University (西湖大学); Zhejiang University (浙江大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025
Abstract:Deep-unrolling and plug-and-play (PnP) approaches have become the de-facto standard solvers for single-pixel imaging (SPI) inverse problem. PnP approaches, a class of iterative algorithms where regularization is implicitly performed by an off-the-shelf deep denoiser, are flexible for varying compression ratios (CRs) but are limited in reconstruction accuracy and speed. Conversely, unrolling approaches, a class of multi-stage neural networks where a truncated iterative optimization process is transformed into an end-to-end trainable network, typically achieve better accuracy with faster inference but require fine-tuning or even retraining when CR changes. In this paper, we address the challenge of integrating the strengths of both classes of solvers. To this end, we design an efficient deep image restorer (DIR) for the unrolling of HQS (half quadratic splitting) and ADMM (alternating direction method of multipliers). More importantly, a general proximal trajectory (PT) loss function is proposed to train HQS/ADMM-unrolling networks such that learned DIR approximates the proximal operator of an ideal explicit restoration regularizer. Extensive experiments demonstrate that, the resulting proximal unrolling networks can not only flexibly handle varying CRs with a single model like PnP algorithms, but also outperform previous CR-specific unrolling networks in both reconstruction accuracy and speed. Source codes and models are available at this https URL.
zh
[CV-103] DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂现实场景中细粒度视觉感知能力有限的问题。其解决方案的关键在于设计一种基于强化学习(Reinforcement Learning, RL)的框架——DIP-R1,该框架通过三种基于规则的奖励建模机制,引导MLLMs进行细致的视觉场景分析与不确定区域的深入观察,从而提升其在复杂场景下的视觉感知能力。
链接: https://arxiv.org/abs/2505.23179
作者: Sungjune Park,Hyunjun Kim,Junho Kim,Seongho Kim,Yong Man Ro
机构: KAIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated significant visual understanding capabilities, yet their fine-grained visual perception in complex real-world scenarios, such as densely crowded public areas, remains limited. Inspired by the recent success of reinforcement learning (RL) in both LLMs and MLLMs, in this paper, we explore how RL can enhance visual perception ability of MLLMs. Then we develop a novel RL-based framework, Deep Inspection and Perception with RL (DIP-R1) designed to enhance the visual perception capabilities of MLLMs, by comprehending complex scenes and looking through visual instances closely. DIP-R1 guides MLLMs through detailed inspection of visual scene via three simply designed rule-based reward modelings. First, we adopt a standard reasoning reward encouraging the model to include three step-by-step processes: 1) reasoning for understanding visual scenes, 2) observing for looking through interested but ambiguous regions, and 3) decision-making for predicting answer. Second, a variance-guided looking reward is designed to examine uncertain regions for the second observing process. It explicitly enables the model to inspect ambiguous areas, improving its ability to mitigate perceptual uncertainties. Third, we model a weighted precision-recall accuracy reward enhancing accurate decision-making. We explore its effectiveness across diverse fine-grained object detection data consisting of challenging real-world environments, such as densely crowded scenes. Built upon existing MLLMs, DIP-R1 achieves consistent and significant improvement across various in-domain and out-of-domain scenarios. It also outperforms various existing baseline models and supervised fine-tuning methods. Our findings highlight the substantial potential of integrating RL into MLLMs for enhancing capabilities in complex real-world perception tasks.
zh
[CV-104] Pseudo Multi-Source Domain Generalization: Bridging the Gap Between Single and Multi-Source Domain Generalization
【速读】:该论文试图解决深度学习模型在部署到与训练数据分布不同的数据分布时性能下降的问题,尤其是在环境条件频繁变化的实际应用中。其解决方案的关键在于提出一种名为伪多源域泛化(Pseudo Multi-source Domain Generalization, PMDG)的框架,该框架通过风格迁移和数据增强技术从单一源域生成多个伪域,从而构建合成的多域数据集,使复杂的多源域泛化(Multi-source Domain Generalization, MDG)算法能够在更实际的单源域泛化(Single-source Domain Generalization, SDG)设置中应用。
链接: https://arxiv.org/abs/2505.23173
作者: Shohei Enomoto
机构: NTT(日本电信电话株式会社)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep learning models often struggle to maintain performance when deployed on data distributions different from their training data, particularly in real-world applications where environmental conditions frequently change. While Multi-source Domain Generalization (MDG) has shown promise in addressing this challenge by leveraging multiple source domains during training, its practical application is limited by the significant costs and difficulties associated with creating multi-domain datasets. To address this limitation, we propose Pseudo Multi-source Domain Generalization (PMDG), a novel framework that enables the application of sophisticated MDG algorithms in more practical Single-source Domain Generalization (SDG) settings. PMDG generates multiple pseudo-domains from a single source domain through style transfer and data augmentation techniques, creating a synthetic multi-domain dataset that can be used with existing MDG algorithms. Through extensive experiments with PseudoDomainBed, our modified version of the DomainBed benchmark, we analyze the effectiveness of PMDG across multiple datasets and architectures. Our analysis reveals several key findings, including a positive correlation between MDG and PMDG performance and the potential of pseudo-domains to match or exceed actual multi-domain performance with sufficient data. These comprehensive empirical results provide valuable insights for future research in domain generalization. Our code is available at this https URL.
zh
[CV-105] RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer
【速读】:该论文试图解决机器人操作中模仿学习因真实世界机器人示范数据收集成本过高而难以大规模应用的问题,以及模拟器与现实之间的模拟到现实(sim-to-real)差距带来的挑战。其解决方案的关键在于提出RoboTransfer,这是一个基于扩散的视频生成框架,通过整合多视角几何和对场景组件(如背景和物体属性)的显式控制,结合跨视角特征交互及全局深度/法线条件,确保多视角间的几何一致性,从而实现具有增强几何一致性和视觉保真度的多视角视频生成。
链接: https://arxiv.org/abs/2505.23171
作者: Liu Liu,Xiaofeng Wang,Guosheng Zhao,Keyu Li,Wenkang Qin,Jiaxiong Qiu,Zheng Zhu,Guan Huang,Zhizhong Su
机构: Horizon Robotics(地平线机器人); GigaAI(巨量人工智能); CASIA(中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 15 figures
Abstract:Imitation Learning has become a fundamental approach in robotic manipulation. However, collecting large-scale real-world robot demonstrations is prohibitively expensive. Simulators offer a cost-effective alternative, but the sim-to-real gap make it extremely challenging to scale. Therefore, we introduce RoboTransfer, a diffusion-based video generation framework for robotic data synthesis. Unlike previous methods, RoboTransfer integrates multi-view geometry with explicit control over scene components, such as background and object attributes. By incorporating cross-view feature interactions and global depth/normal conditions, RoboTransfer ensures geometry consistency across views. This framework allows fine-grained control, including background edits and object swaps. Experiments demonstrate that RoboTransfer is capable of generating multi-view videos with enhanced geometric consistency and visual fidelity. In addition, policies trained on the data generated by RoboTransfer achieve a 33.3% relative improvement in the success rate in the DIFF-OBJ setting and a substantial 251% relative improvement in the more challenging DIFF-ALL scenario. Explore more demos on our project page: this https URL
zh
[CV-106] Implicit Inversion turns CLIP into a Decoder
【速读】:该论文试图解决如何仅使用判别式模型CLIP(Contrastive Language-Image Pretraining)实现图像生成的问题,而无需依赖解码器、训练或微调。其解决方案的关键在于优化一种频率感知的隐式神经表示,通过在网络层中分层分布频率来促进从粗到细的生成过程,并引入对抗鲁棒初始化、轻量级正交普罗克鲁斯特斯投影以及混合损失函数以稳定逆映射。该方法在不修改CLIP权重的情况下实现了文本到图像生成、风格迁移和图像重建等能力。
链接: https://arxiv.org/abs/2505.23161
作者: Antonio D’Orazio,Maria Rosaria Briglia,Donato Crisostomi,Dario Loi,Emanuele Rodolà,Iacopo Masi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:CLIP is a discriminative model trained to align images and text in a shared embedding space. Due to its multimodal structure, it serves as the backbone of many generative pipelines, where a decoder is trained to map from the shared space back to images. In this work, we show that image synthesis is nevertheless possible using CLIP alone – without any decoder, training, or fine-tuning. Our approach optimizes a frequency-aware implicit neural representation that encourages coarse-to-fine generation by stratifying frequencies across network layers. To stabilize this inverse mapping, we introduce adversarially robust initialization, a lightweight Orthogonal Procrustes projection to align local text and image embeddings, and a blending loss that anchors outputs to natural image statistics. Without altering CLIP’s weights, this framework unlocks capabilities such as text-to-image generation, style transfer, and image reconstruction. These findings suggest that discriminative models may hold untapped generative potential, hidden in plain sight.
zh
[CV-107] LODGE: Level-of-Detail Large-Scale Gaussian Splatting with Efficient Rendering
【速读】:该论文旨在解决在内存受限设备上实时渲染大规模3D场景的问题(Large-scale 3D Scene Rendering on Memory-Constrained Devices)。其解决方案的关键在于提出一种分层的细节层次(Level-of-Detail, LOD)表示方法,通过基于相机距离迭代选择最优的高斯分布子集,从而显著降低渲染时间和GPU内存占用。此外,通过深度感知的3D平滑滤波、基于重要性的剪枝与微调,以及场景空间分块和动态加载机制,有效维持了视觉保真度并减少了内存开销。
链接: https://arxiv.org/abs/2505.23158
作者: Jonas Kulhanek,Marie-Julie Rakotosaona,Fabian Manhardt,Christina Tsalicoglou,Michael Niemeyer,Torsten Sattler,Songyou Peng,Federico Tombari
机构: Google(谷歌); CTU in Prague(布拉格查理大学); Google DeepMind(谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Web: this https URL
Abstract:In this work, we present a novel level-of-detail (LOD) method for 3D Gaussian Splatting that enables real-time rendering of large-scale scenes on memory-constrained devices. Our approach introduces a hierarchical LOD representation that iteratively selects optimal subsets of Gaussians based on camera distance, thus largely reducing both rendering time and GPU memory usage. We construct each LOD level by applying a depth-aware 3D smoothing filter, followed by importance-based pruning and fine-tuning to maintain visual fidelity. To further reduce memory overhead, we partition the scene into spatial chunks and dynamically load only relevant Gaussians during rendering, employing an opacity-blending mechanism to avoid visual artifacts at chunk boundaries. Our method achieves state-of-the-art performance on both outdoor (Hierarchical 3DGS) and indoor (Zip-NeRF) datasets, delivering high-quality renderings with reduced latency and memory requirements.
zh
[CV-108] PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling
【速读】:该论文旨在解决音频-视觉事件解析(Audio-Visual Event Parsing)在实时应用中的限制问题,传统方法依赖于对完整视频的离线处理,模型规模庞大,难以满足实时性需求。其解决方案的关键在于提出一种名为Predictive Future Modeling (PreFM)的框架,该框架通过(1)预测多模态未来建模以整合有益的未来音频-视觉线索,提升上下文理解能力;(2)模态无关的鲁棒表示与焦点时间优先策略,以提高精度和泛化能力,从而实现高效的在线推理和实时性能。
链接: https://arxiv.org/abs/2505.23155
作者: Xiao Yu,Yan Fang,Xiaojie Jin,Yao Zhao,Yunchao Wei
机构: Beijing Jiaotong University (北京交通大学); Visual Intelligence + X International Joint Laboratory (视觉智能+X国际联合实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 8 figures
Abstract:Audio-visual event parsing plays a crucial role in understanding multimodal video content, but existing methods typically rely on offline processing of entire videos with huge model sizes, limiting their real-time applicability. We introduce Online Audio-Visual Event Parsing (On-AVEP), a novel paradigm for parsing audio, visual, and audio-visual events by sequentially analyzing incoming video streams. The On-AVEP task necessitates models with two key capabilities: (1) Accurate online inference, to effectively distinguish events with unclear and limited context in online settings, and (2) Real-time efficiency, to balance high performance with computational constraints. To cultivate these, we propose the Predictive Future Modeling (PreFM) framework featured by (a) predictive multimodal future modeling to infer and integrate beneficial future audio-visual cues, thereby enhancing contextual understanding and (b) modality-agnostic robust representation along with focal temporal prioritization to improve precision and generalization. Extensive experiments on the UnAV-100 and LLP datasets show PreFM significantly outperforms state-of-the-art methods by a large margin with significantly fewer parameters, offering an insightful approach for real-time multimodal video understanding. Code is available at this https URL.
zh
[CV-109] FlowAlign: Trajectory-Regularized Inversion-Free Flow-based Image Editing
【速读】:该论文旨在解决无反演(inversion-free)流模型在图像编辑中因缺乏精确潜在空间反演而导致的编辑轨迹不稳定和源图像一致性差的问题。其解决方案的关键在于提出FlowAlign框架,通过引入流匹配损失(flow-matching loss)作为正则化机制,以促进编辑过程中更平滑和稳定的轨迹,并在轨迹上显式平衡语义对齐与编辑提示的一致性以及结构一致性与源图像的一致性。
链接: https://arxiv.org/abs/2505.23145
作者: Jeongsol Kim,Yeobin Hong,Jong Chul Ye
机构: KAIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Recent inversion-free, flow-based image editing methods such as FlowEdit leverages a pre-trained noise-to-image flow model such as Stable Diffusion 3, enabling text-driven manipulation by solving an ordinary differential equation (ODE). While the lack of exact latent inversion is a core advantage of these methods, it often results in unstable editing trajectories and poor source consistency. To address this limitation, we propose FlowAlign, a novel inversion-free flow-based framework for consistent image editing with principled trajectory control. FlowAlign introduces a flow-matching loss as a regularization mechanism to promote smoother and more stable trajectories during the editing process. Notably, the flow-matching loss is shown to explicitly balance semantic alignment with the edit prompt and structural consistency with the source image along the trajectory. Furthermore, FlowAlign naturally supports reverse editing by simply reversing the ODE trajectory, highlighting the reversible and consistent nature of the transformation. Extensive experiments demonstrate that FlowAlign outperforms existing methods in both source preservation and editing controllability.
zh
[CV-110] Interpreting Chest X-rays Like a Radiologist: A Benchmark with Clinical Reasoning
【速读】:该论文试图解决当前基于人工智能(Artificial Intelligence, AI)的胸部X光片(Chest X-ray, CXR)解读模型在诊断推理建模方面的不足,这些模型通常采用简单的输入到输出范式,未能有效模拟放射科医生在实际临床环境中所采用的顺序性、上下文相关的诊断推理过程。解决方案的关键在于构建一个名为CXRTrek的多阶段视觉问答(Visual Question Answering, VQA)数据集,该数据集首次明确模拟了真实临床场景下的诊断推理流程,并在此基础上提出了一种专门设计的视觉-语言大模型(Vision-Language Large Model, VLLM)CXRTrekNet,以将临床推理流程融入VLLM框架中,从而有效建模诊断阶段间的依赖关系和放射学背景下的推理模式。
链接: https://arxiv.org/abs/2505.23143
作者: Jinquan Guan,Qi Chen,Lizhou Liang,Yuhang Liu,Vu Minh Hieu Phan,Minh-Son To,Jian Chen,Yutong Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages (main text), 18 pages (appendix)
Abstract:Artificial intelligence (AI)-based chest X-ray (CXR) interpretation assistants have demonstrated significant progress and are increasingly being applied in clinical settings. However, contemporary medical AI models often adhere to a simplistic input-to-output paradigm, directly processing an image and an instruction to generate a result, where the instructions may be integral to the model’s architecture. This approach overlooks the modeling of the inherent diagnostic reasoning in chest X-ray interpretation. Such reasoning is typically sequential, where each interpretive stage considers the images, the current task, and the contextual information from previous stages. This oversight leads to several shortcomings, including misalignment with clinical scenarios, contextless reasoning, and untraceable errors. To fill this gap, we construct CXRTrek, a new multi-stage visual question answering (VQA) dataset for CXR interpretation. The dataset is designed to explicitly simulate the diagnostic reasoning process employed by radiologists in real-world clinical settings for the first time. CXRTrek covers 8 sequential diagnostic stages, comprising 428,966 samples and over 11 million question-answer (QA) pairs, with an average of 26.29 QA pairs per sample. Building on the CXRTrek dataset, we propose a new vision-language large model (VLLM), CXRTrekNet, specifically designed to incorporate the clinical reasoning flow into the VLLM framework. CXRTrekNet effectively models the dependencies between diagnostic stages and captures reasoning patterns within the radiological context. Trained on our dataset, the model consistently outperforms existing medical VLLMs on the CXRTrek benchmarks and demonstrates superior generalization across multiple tasks on five diverse external datasets. The dataset and model can be found in our repository (this https URL).
zh
[CV-111] Zero-to-Hero: Zero-Shot Initialization Empowering Reference-Based Video Appearance Editing
【速读】:该论文旨在解决视频编辑中基于文本引导的方法存在用户意图模糊和对物体特定方面编辑控制不足的问题。其解决方案的关键在于提出一种名为Zero-to-Hero的新方法,该方法通过参考图像引导的视频编辑,将编辑过程分解为两个独立问题:首先编辑一个锚定帧以满足用户需求作为参考图像,然后在其他帧中一致地传播其外观。该方法利用原始帧内的对应关系来指导注意力机制,相较于之前提出的光流或时间模块,在处理大运动物体时更具鲁棒性,并提供了可靠的零样本初始化以确保准确性和时间一致性。
链接: https://arxiv.org/abs/2505.23134
作者: Tongtong Su,Chengyu Wang,Jun Huang,Dongming Lu
机构: Zhejiang University (浙江大学); Alibaba Cloud Computing (阿里云计算)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Appearance editing according to user needs is a pivotal task in video editing. Existing text-guided methods often lead to ambiguities regarding user intentions and restrict fine-grained control over editing specific aspects of objects. To overcome these limitations, this paper introduces a novel approach named Zero-to-Hero, which focuses on reference-based video editing that disentangles the editing process into two distinct problems. It achieves this by first editing an anchor frame to satisfy user requirements as a reference image and then consistently propagating its appearance across other frames. We leverage correspondence within the original frames to guide the attention mechanism, which is more robust than previously proposed optical flow or temporal modules in memory-friendly video generative models, especially when dealing with objects exhibiting large motions. It offers a solid ZERO-shot initialization that ensures both accuracy and temporal consistency. However, intervention in the attention mechanism results in compounded imaging degradation with over-saturated colors and unknown blurring issues. Starting from Zero-Stage, our Hero-Stage Holistically learns a conditional generative model for vidEo RestOration. To accurately evaluate the consistency of the appearance, we construct a set of videos with multiple appearances using Blender, enabling a fine-grained and deterministic evaluation. Our method outperforms the best-performing baseline with a PSNR improvement of 2.6 dB. The project page is at this https URL.
zh
[CV-112] PhotoArtAgent : Intelligent Photo Retouching with Language Model-Based Artist Agents
【速读】:该论文试图解决非专业用户在图像修饰过程中缺乏 interpretative depth(解释深度)和 interactive transparency(交互透明性)的问题,即现有的自动化工具虽然能生成视觉上吸引人的结果,但无法提供类似专业艺术家的创造性决策过程和可解释性。解决方案的关键在于提出 PhotoArtAgent,该系统结合了 Vision-Language Models (VLMs) 与先进的自然语言推理能力,以模拟专业艺术家的创作过程,通过显式的艺术分析、修饰策略规划以及基于 API 的参数输出,实现图像的迭代优化,并提供透明的文本解释,从而增强用户的交互性和控制感。
链接: https://arxiv.org/abs/2505.23130
作者: Haoyu Chen,Keda Tao,Yizao Wang,Xinlei Wang,Lei Zhu,Jinjin Gu
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); Xidian University(西安电子科技大学); Sun Yat-sen University(中山大学); The University of Sydney(悉尼大学); INSAIT, Sofia University(索菲亚大学INSAIT)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Photo retouching is integral to photographic art, extending far beyond simple technical fixes to heighten emotional expression and narrative depth. While artists leverage expertise to create unique visual effects through deliberate adjustments, non-professional users often rely on automated tools that produce visually pleasing results but lack interpretative depth and interactive transparency. In this paper, we introduce PhotoArtAgent, an intelligent system that combines Vision-Language Models (VLMs) with advanced natural language reasoning to emulate the creative process of a professional artist. The agent performs explicit artistic analysis, plans retouching strategies, and outputs precise parameters to Lightroom through an API. It then evaluates the resulting images and iteratively refines them until the desired artistic vision is achieved. Throughout this process, PhotoArtAgent provides transparent, text-based explanations of its creative rationale, fostering meaningful interaction and user control. Experimental results show that PhotoArtAgent not only surpasses existing automated tools in user studies but also achieves results comparable to those of professional human artists.
zh
[CV-113] HMAD: Advancing E2E Driving with Anchored Offset Proposals and Simulation-Supervised Multi-target Scoring
【速读】:该论文旨在解决端到端自动驾驶中生成多样化且符合规则的轨迹以及从这些选项中鲁棒地选择最优路径的持续性挑战,特别是在通过学习的多维度评估进行路径选择方面。解决方案的关键在于提出HMAD框架,该框架整合了基于鸟瞰图(BEV)的轨迹提案机制与学习的多标准评分系统。其中,BEVFormer结合可学习的锚定查询,从轨迹字典初始化并通过迭代偏移解码生成大量多样化且稳定的候选轨迹;同时,模拟监督的评分模块对轨迹进行多维度评估,包括无责任碰撞、可行驶区域合规性、舒适性和整体驾驶质量(即扩展的PDM分数)。
链接: https://arxiv.org/abs/2505.23129
作者: Bin Wang,Pingjun Li,Jinkun Liu,Jun Cheng,Hailong Lei,Yinze Rong,Huan-ang Gao,Kangliang Chen,Xing Pan,Weihao Gu
机构: HAOMO.AI Technology Co., Ltd (哈摩人工智能科技有限公司); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:End-to-end autonomous driving faces persistent challenges in both generating diverse, rule-compliant trajectories and robustly selecting the optimal path from these options via learned, multi-faceted evaluation. To address these challenges, we introduce HMAD, a framework integrating a distinctive Bird’s-Eye-View (BEV) based trajectory proposal mechanism with learned multi-criteria scoring. HMAD leverages BEVFormer and employs learnable anchored queries, initialized from a trajectory dictionary and refined via iterative offset decoding (inspired by DiffusionDrive), to produce numerous diverse and stable candidate trajectories. A key innovation, our simulation-supervised scorer module, then evaluates these proposals against critical metrics including no at-fault collisions, drivable area compliance, comfortableness, and overall driving quality (i.e., extended PDM score). Demonstrating its efficacy, HMAD achieves a 44.5% driving score on the CVPR 2025 private test set. This work highlights the benefits of effectively decoupling robust trajectory generation from comprehensive, safety-aware learned scoring for advanced autonomous driving.
zh
[CV-114] MMGT: Motion Mask Guided Two-Stage Network for Co-Speech Gesture Video Generation
【速读】:该论文旨在解决语音驱动的同步手势视频生成问题,即从音频驱动的静态图像生成生动的语音视频,其难点在于身体不同部位在运动幅度、音频相关性和细节特征方面的多样性。现有方法通常依赖额外先验信息来解决大动作捕捉不足的问题,但限制了实际应用。该论文提出了一种基于运动掩码引导的两阶段网络(Motion Mask-Guided Two-Stage Network, MMGT),其关键在于利用音频信号生成的运动掩码和运动特征,与音频联合驱动同步语音手势视频的生成。第一阶段通过空间掩码引导的音频姿态生成网络(Spatial Mask-Guided Audio Pose Generation, SMGA)生成高质量姿态视频和运动掩码,有效捕捉面部和手势等关键区域的大动作;第二阶段则将运动掩码分层音频注意力机制(Motion Masked Hierarchical Audio Attention, MM-HAA)集成到稳定扩散视频生成模型中,以提升细粒度运动生成和区域细节控制能力,从而实现高质量、细节丰富的上半身视频生成。
链接: https://arxiv.org/abs/2505.23120
作者: Siyuan Wang,Jiawei Liu,Wei Wang,Yeying Jin,Jinsong Du,Zhi Han
机构: Shenyang Institute of Automation, Chinese Academy of Sciences (沈阳自动化研究所,中国科学院); Liaoning Liaohe Laboratory (辽宁辽河实验室); the Key Laboratory on Intelligent Detection and Equipment Technology (智能检测与装备技术重点实验室); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Co-Speech Gesture Video Generation aims to generate vivid speech videos from audio-driven still images, which is challenging due to the diversity of different parts of the body in terms of amplitude of motion, audio relevance, and detailed features. Relying solely on audio as the control signal often fails to capture large gesture movements in video, leading to more pronounced artifacts and distortions. Existing approaches typically address this issue by introducing additional a priori information, but this can limit the practical application of the task. Specifically, we propose a Motion Mask-Guided Two-Stage Network (MMGT) that uses audio, as well as motion masks and motion features generated from the audio signal to jointly drive the generation of synchronized speech gesture videos. In the first stage, the Spatial Mask-Guided Audio Pose Generation (SMGA) Network generates high-quality pose videos and motion masks from audio, effectively capturing large movements in key regions such as the face and gestures. In the second stage, we integrate the Motion Masked Hierarchical Audio Attention (MM-HAA) into the Stabilized Diffusion Video Generation model, overcoming limitations in fine-grained motion generation and region-specific detail control found in traditional methods. This guarantees high-quality, detailed upper-body video generation with accurate texture and motion details. Evaluations show improved video quality, lip-sync, and gesture. The model and code are available at this https URL.
zh
[CV-115] xtSR: Diffusion Super-Resolution with Multilingual OCR Guidance
【速读】:该论文旨在解决多语言场景文本图像超分辨率(Multilingual Scene Text Image Super-Resolution, STISR)中的文本区域定位不准确、文本字符到形状的先验建模不足以及生成纹理幻觉等问题,这些问题导致超分辨率后的文本质量下降。其解决方案的关键在于提出TextSR,一个专门设计用于STISR的多模态扩散模型,该模型通过文本检测器定位文本区域,并结合光学字符识别(OCR)提取多语言文本,再利用基于UTF-8的文本编码器和交叉注意力机制将文本字符转换为视觉形状,同时引入两种创新方法提升模型鲁棒性,从而有效引导超分辨率过程,增强文本细节与可读性。
链接: https://arxiv.org/abs/2505.23119
作者: Keren Ye,Ignacio Garcia Dorado,Michalis Raptis,Mauricio Delbracio,Irene Zhu,Peyman Milanfar,Hossein Talebi
机构: Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While recent advancements in Image Super-Resolution (SR) using diffusion models have shown promise in improving overall image quality, their application to scene text images has revealed limitations. These models often struggle with accurate text region localization and fail to effectively model image and multilingual character-to-shape priors. This leads to inconsistencies, the generation of hallucinated textures, and a decrease in the perceived quality of the super-resolved text. To address these issues, we introduce TextSR, a multimodal diffusion model specifically designed for Multilingual Scene Text Image Super-Resolution. TextSR leverages a text detector to pinpoint text regions within an image and then employs Optical Character Recognition (OCR) to extract multilingual text from these areas. The extracted text characters are then transformed into visual shapes using a UTF-8 based text encoder and cross-attention. Recognizing that OCR may sometimes produce inaccurate results in real-world scenarios, we have developed two innovative methods to enhance the robustness of our model. By integrating text character priors with the low-resolution text images, our model effectively guides the super-resolution process, enhancing fine details within the text and improving overall legibility. The superior performance of our model on both the TextZoom and TextVQA datasets sets a new benchmark for STISR, underscoring the efficacy of our approach. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.23119 [cs.CV] (or arXiv:2505.23119v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.23119 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-116] Diffusion-Based Generative Models for 3D Occupancy Prediction in Autonomous Driving ICRA2025
【速读】:该论文试图解决从视觉输入准确预测三维占用网格(3D occupancy grids)的问题,这是自动驾驶中的关键任务,但现有判别方法在处理噪声数据、不完整观测以及三维场景的复杂结构方面存在困难。解决方案的关键在于将三维占用预测重新建模为生成建模任务,采用扩散模型(diffusion models)来学习数据分布并融入三维场景先验,从而提升预测的一致性、噪声鲁棒性,并更好地处理三维空间结构的复杂性。
链接: https://arxiv.org/abs/2505.23115
作者: Yunshen Wang,Yicheng Liu,Tianyuan Yuan,Yucheng Mao,Yingshi Liang,Xiuyu Yang,Honggang Zhang,Hang Zhao
机构: Institute for Interdisciplinary Information Sciences, Tsinghua University (清华大学交叉信息研究院); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA 2025
Abstract:Accurately predicting 3D occupancy grids from visual inputs is critical for autonomous driving, but current discriminative methods struggle with noisy data, incomplete observations, and the complex structures inherent in 3D scenes. In this work, we reframe 3D occupancy prediction as a generative modeling task using diffusion models, which learn the underlying data distribution and incorporate 3D scene priors. This approach enhances prediction consistency, noise robustness, and better handles the intricacies of 3D spatial structures. Our extensive experiments show that diffusion-based generative models outperform state-of-the-art discriminative approaches, delivering more realistic and accurate occupancy predictions, especially in occluded or low-visibility regions. Moreover, the improved predictions significantly benefit downstream planning tasks, highlighting the practical advantages of our method for real-world autonomous driving applications.
zh
[CV-117] Identification of Patterns of Cognitive Impairment for Early Detection of Dementia
【速读】:该论文试图解决早期检测痴呆症的难题,特别是在大规模人群中进行周期性评估时,传统全面认知测试因耗时且繁琐而难以广泛应用的问题。解决方案的关键在于通过从包含正常人和轻度认知障碍(MCI)患者的群体中学习个体特定的认知损伤模式,并利用这些模式设计个性化的定期随访测试。该方法采用两步流程,包括集成包装特征选择和聚类识别与分析,从而能够识别出与临床认可的MCI亚型相对应的认知损伤模式,并进一步用于预测潜在的认知损伤路径,即使在无症状或看似正常的人群中亦可实现。
链接: https://arxiv.org/abs/2505.23109
作者: Anusha A. S.,Uma Ranjan,Medha Sharma,Siddharth Dutt
机构: Centre for Brain Research, Indian Institute of Science Bengaluru, India
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 2 figures, to be published in IEEE EMBC 2020
Abstract:Early detection of dementia is crucial to devise effective interventions. Comprehensive cognitive tests, while being the most accurate means of diagnosis, are long and tedious, thus limiting their applicability to a large population, especially when periodic assessments are needed. The problem is compounded by the fact that people have differing patterns of cognitive impairment as they progress to different forms of dementia. This paper presents a novel scheme by which individual-specific patterns of impairment can be identified and used to devise personalized tests for periodic follow-up. Patterns of cognitive impairment are initially learned from a population cluster of combined normals and MCIs, using a set of standardized cognitive tests. Impairment patterns in the population are identified using a 2-step procedure involving an ensemble wrapper feature selection followed by cluster identification and analysis. These patterns have been shown to correspond to clinically accepted variants of MCI, a prodrome of dementia. The learned clusters of patterns can subsequently be used to identify the most likely route of cognitive impairment, even for pre-symptomatic and apparently normal people. Baseline data of 24,000 subjects from the NACC database was used for the study.
zh
[CV-118] EAD: An EEG Adapter for Automated Classification
【速读】:该论文旨在解决多通道脑电图(EEG)数据在不同采集设备下难以构建统一嵌入学习管道的问题,特别是在不同数量EEG通道的多样本任务中,传统方法因依赖特定通道数而无法通用。解决方案的关键在于提出一种灵活的框架——EEG Adapter (EAD),该框架通过利用经过显著调整的最新EEG基础模型,学习鲁棒的EEG表示以进行分类任务,从而实现跨设备和跨通道数的有效泛化。
链接: https://arxiv.org/abs/2505.23107
作者: Pushapdeep Singh,Jyoti Nigam,Medicherla Vamsi Krishna,Arnav Bhavsar,Aditya Nigam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:While electroencephalography (EEG) has been a popular modality for neural decoding, it often involves task specific acquisition of the EEG data. This poses challenges for the development of a unified pipeline to learn embeddings for various EEG signal classification, which is often involved in various decoding tasks. Traditionally, EEG classification involves the step of signal preprocessing and the use of deep learning techniques, which are highly dependent on the number of EEG channels in each sample. However, the same pipeline cannot be applied even if the EEG data is collected for the same experiment but with different acquisition devices. This necessitates the development of a framework for learning EEG embeddings, which could be highly beneficial for tasks involving multiple EEG samples for the same task but with varying numbers of EEG channels. In this work, we propose EEG Adapter (EAD), a flexible framework compatible with any signal acquisition device. More specifically, we leverage a recent EEG foundational model with significant adaptations to learn robust representations from the EEG data for the classification task. We evaluate EAD on two publicly available datasets achieving state-of-the-art accuracies 99.33% and 92.31% on EEG-ImageNet and BrainLat respectively. This illustrates the effectiveness of the proposed framework across diverse EEG datasets containing two different perception tasks: stimulus and resting-state EEG signals. We also perform zero-shot EEG classification on EEG-ImageNet task to demonstrate the generalization capability of the proposed approach.
zh
[CV-119] CURVE: CLIP-Utilized Reinforcement Learning for Visual Image Enhancement via Simple Image Processing ICIP2025
【速读】:该论文旨在解决无参考低光照图像增强(Zero-reference Low-Light Image Enhancement, ZR-LLIE)中的两个挑战:一是利用对比语言-图像预训练(Contrastive Language-Image Pre-Training, CLIP)模型生成感知上“良好”的图像,二是保持对高分辨率图像的计算效率。其解决方案的关键在于提出一种基于强化学习的图像增强方法——CLIP-Utilized Reinforcement learning-based Visual image Enhancement (CURVE),该方法采用一个简单的图像处理模块,通过贝塞尔曲线调整全局图像色调,并迭代估计处理参数,同时利用CLIP文本嵌入设计奖励函数进行强化学习训练。
链接: https://arxiv.org/abs/2505.23102
作者: Yuka Ogino,Takahiro Toizumi,Atsushi Ito
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICIP2025
Abstract:Low-Light Image Enhancement (LLIE) is crucial for improving both human perception and computer vision tasks. This paper addresses two challenges in zero-reference LLIE: obtaining perceptually ‘good’ images using the Contrastive Language-Image Pre-Training (CLIP) model and maintaining computational efficiency for high-resolution images. We propose CLIP-Utilized Reinforcement learning-based Visual image Enhancement (CURVE). CURVE employs a simple image processing module which adjusts global image tone based on Bézier curve and estimates its processing parameters iteratively. The estimator is trained by reinforcement learning with rewards designed using CLIP text embeddings. Experiments on low-light and multi-exposure datasets demonstrate the performance of CURVE in terms of enhancement quality and processing speed compared to conventional methods.
zh
[CV-120] LeMoRe: Learn More Details for Lightweight Semantic Segmentation ICIP2025
【速读】:该论文试图解决轻量级语义分割中效率与性能难以平衡的问题(lightweight semantic segmentation),现有方法由于特征建模的复杂性,常受限于刚性架构和隐式表示学习,导致参数密集型设计和对计算密集型Vision Transformer框架的依赖。其解决方案的关键在于通过融合显式建模与隐式建模,提出一种高效范式,结合明确的笛卡尔方向与显式建模视角以及隐式推断的中间表示,利用嵌套注意力机制高效捕捉全局依赖关系。
链接: https://arxiv.org/abs/2505.23093
作者: Mian Muhammad Naeem Abid,Nancy Mehta,Zongwei Wu,Radu Timofte
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE ICIP 2025
Abstract:Lightweight semantic segmentation is essential for many downstream vision tasks. Unfortunately, existing methods often struggle to balance efficiency and performance due to the complexity of feature modeling. Many of these existing approaches are constrained by rigid architectures and implicit representation learning, often characterized by parameter-heavy designs and a reliance on computationally intensive Vision Transformer-based frameworks. In this work, we introduce an efficient paradigm by synergizing explicit and implicit modeling to balance computational efficiency with representational fidelity. Our method combines well-defined Cartesian directions with explicitly modeled views and implicitly inferred intermediate representations, efficiently capturing global dependencies through a nested attention mechanism. Extensive experiments on challenging datasets, including ADE20K, CityScapes, Pascal Context, and COCO-Stuff, demonstrate that LeMoRe strikes an effective balance between performance and efficiency.
zh
[CV-121] GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion
【速读】:该论文旨在解决从视频中准确且时间一致地估计三维人体几何结构的问题,现有方法主要针对单张图像优化,常面临时间不一致和难以捕捉精细动态细节的局限。其解决方案的关键在于提出GeoMan架构,通过将视频几何估计任务重构为图像到视频生成问题,利用基于图像的模型估计第一帧的深度和法线,再以此作为条件输入视频扩散模型,从而减轻视频模型的几何估计负担,使其专注于细节建模,并借助大规模视频数据集的学习先验提升时间一致性与泛化能力。此外,为解决人体尺寸准确估计的问题,引入了根相对深度表示,保留关键人体尺度信息并提升单目输入下的估计效果。
链接: https://arxiv.org/abs/2505.23085
作者: Gwanghyun Kim,Xueting Li,Ye Yuan,Koki Nagano,Tianye Li,Jan Kautz,Se Young Chun,Umar Iqbal
机构: NVIDIA(英伟达); Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: Project page: this https URL
Abstract:Estimating accurate and temporally consistent 3D human geometry from videos is a challenging problem in computer vision. Existing methods, primarily optimized for single images, often suffer from temporal inconsistencies and fail to capture fine-grained dynamic details. To address these limitations, we present GeoMan, a novel architecture designed to produce accurate and temporally consistent depth and normal estimations from monocular human videos. GeoMan addresses two key challenges: the scarcity of high-quality 4D training data and the need for metric depth estimation to accurately model human size. To overcome the first challenge, GeoMan employs an image-based model to estimate depth and normals for the first frame of a video, which then conditions a video diffusion model, reframing video geometry estimation task as an image-to-video generation problem. This design offloads the heavy lifting of geometric estimation to the image model and simplifies the video model’s role to focus on intricate details while using priors learned from large-scale video datasets. Consequently, GeoMan improves temporal consistency and generalizability while requiring minimal 4D training data. To address the challenge of accurate human size estimation, we introduce a root-relative depth representation that retains critical human-scale details and is easier to be estimated from monocular inputs, overcoming the limitations of traditional affine-invariant and metric depth representations. GeoMan achieves state-of-the-art performance in both qualitative and quantitative evaluations, demonstrating its effectiveness in overcoming longstanding challenges in 3D human geometry estimation from videos.
zh
[CV-122] URWKV: Unified RWKV Model with Multi-state Perspective for Low-light Image Restoration CVPR2025
【速读】:该论文旨在解决低光照图像增强(LLIE)及联合LLIE与去模糊(LLIE-deblur)模型在处理动态耦合退化问题时的局限性。其解决方案的关键在于提出一种具有多状态视角的统一接收权重键值(URWKV)模型,通过定制核心URWKV模块,利用多阶段内部和跨阶段状态来感知和分析复杂的退化现象。该模型引入了亮度自适应归一化(LAN)、基于指数移动平均的跨阶段状态聚合以及状态感知选择性融合(SSF)模块,从而实现更灵活、有效的退化恢复。
链接: https://arxiv.org/abs/2505.23068
作者: Rui Xu,Yuzhen Niu,Yuezhou Li,Huangbiao Xu,Wenxi Liu,Yuzhong Chen
机构: Fujian Key Laboratory of Network Computing and Intelligent Information Processing (福建省网络计算与智能信息处理重点实验室); College of Computer and Data Science, Fuzhou University (福州大学计算机与数据科学学院); Engineering Research Center of Big Data Intelligence, Ministry of Education, Fuzhou (教育部大数据智能工程研究中心,福州)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted to CVPR 2025
Abstract:Existing low-light image enhancement (LLIE) and joint LLIE and deblurring (LLIE-deblur) models have made strides in addressing predefined degradations, yet they are often constrained by dynamically coupled degradations. To address these challenges, we introduce a Unified Receptance Weighted Key Value (URWKV) model with multi-state perspective, enabling flexible and effective degradation restoration for low-light images. Specifically, we customize the core URWKV block to perceive and analyze complex degradations by leveraging multiple intra- and inter-stage states. First, inspired by the pupil mechanism in the human visual system, we propose Luminance-adaptive Normalization (LAN) that adjusts normalization parameters based on rich inter-stage states, allowing for adaptive, scene-aware luminance modulation. Second, we aggregate multiple intra-stage states through exponential moving average approach, effectively capturing subtle variations while mitigating information loss inherent in the single-state mechanism. To reduce the degradation effects commonly associated with conventional skip connections, we propose the State-aware Selective Fusion (SSF) module, which dynamically aligns and integrates multi-state features across encoder stages, selectively fusing contextual information. In comparison to state-of-the-art models, our URWKV model achieves superior performance on various benchmarks, while requiring significantly fewer parameters and computational resources.
zh
[CV-123] Zero-P-to-3: Zero-Shot Partial-View Images to 3D Object
【速读】:该论文旨在解决在部分观测条件下进行三维重建的问题,特别是在观测视角范围有限且不可见区域生成不一致的情况下。其解决方案的关键在于提出一种无需训练的方法(\method),该方法通过整合局部密集观测与多源先验信息,并采用基于融合的策略在DDIM采样中对齐这些先验,从而生成多视角一致的图像以监督不可见区域的重建,同时结合迭代优化策略利用物体的几何结构提升重建质量。
链接: https://arxiv.org/abs/2505.23054
作者: Yuxuan Lin,Ruihang Chu,Zhenyu Chen,Xiao Tang,Lei Ke,Haoling Li,Yingji Zhong,Zhihao Li,Shiyong Liu,Xiaofei Wu,Jianzhuang Liu,Yujiu Yang
机构: Tsinghua University (清华大学); Huawei Technologies Co., Ltd. (华为技术有限公司); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative 3D reconstruction shows strong potential in incomplete observations. While sparse-view and single-image reconstruction are well-researched, partial observation remains underexplored. In this context, dense views are accessible only from a specific angular range, with other perspectives remaining inaccessible. This task presents two main challenges: (i) limited View Range: observations confined to a narrow angular scope prevent effective traditional interpolation techniques that require evenly distributed perspectives. (ii) inconsistent Generation: views created for invisible regions often lack coherence with both visible regions and each other, compromising reconstruction consistency. To address these challenges, we propose \method, a novel training-free approach that integrates the local dense observations and multi-source priors for reconstruction. Our method introduces a fusion-based strategy to effectively align these priors in DDIM sampling, thereby generating multi-view consistent images to supervise invisible views. We further design an iterative refinement strategy, which uses the geometric structures of the object to enhance reconstruction quality. Extensive experiments on multiple datasets show the superiority of our method over SOTAs, especially in invisible regions.
zh
[CV-124] Multi-Sourced Compositional Generalization in Visual Question Answering IJCAI2025
【速读】:该论文旨在解决视觉问答(VQA)任务中多源组合泛化(multi-sourced compositional generalization, MSCG)能力不足的问题,即模型在面对由不同模态原始元素组成的新型组合时表现不佳。解决方案的关键在于提出一种检索增强的训练框架,通过学习不同模态下语义等价原始元素的统一表示,提升模型对多源组合泛化的理解能力。具体而言,该框架在训练过程中为每个原始元素检索语义等价的其他模态原始元素,并将检索到的特征与原始特征进行融合,以优化模型对同一语义原始元素在不同模态中一致性表示的学习。
链接: https://arxiv.org/abs/2505.23045
作者: Chuanhao Li,Wenbo Ye,Zhen Li,Yuwei Wu,Yunde Jia
机构: Beijing Institute of Technology (北京理工大学); Shenzhen MSU-BIT University (深圳美中商学院-比特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by IJCAI 2025
Abstract:Compositional generalization is the ability of generalizing novel compositions from seen primitives, and has received much attention in vision-and-language (V\L) recently. Due to the multi-modal nature of V\L tasks, the primitives composing compositions source from different modalities, resulting in multi-sourced novel compositions. However, the generalization ability over multi-sourced novel compositions, \textiti.e., multi-sourced compositional generalization (MSCG) remains unexplored. In this paper, we explore MSCG in the context of visual question answering (VQA), and propose a retrieval-augmented training framework to enhance the MSCG ability of VQA models by learning unified representations for primitives from different modalities. Specifically, semantically equivalent primitives are retrieved for each primitive in the training samples, and the retrieved features are aggregated with the original primitive to refine the model. This process helps the model learn consistent representations for the same semantic primitives across different modalities. To evaluate the MSCG ability of VQA models, we construct a new GQA-MSCG dataset based on the GQA dataset, in which samples include three types of novel compositions composed of primitives from different modalities. Experimental results demonstrate the effectiveness of the proposed framework. We release GQA-MSCG at this https URL.
zh
[CV-125] SpatialSplat: Efficient Semantic 3D from Sparse Unposed Images
【速读】:该论文旨在解决现有3D重建方法在结合语义信息时面临的表达能力下降和冗余内存开销的问题。其关键解决方案是提出\textbf{SpatialSplat},该框架通过生成冗余感知的高斯分布并利用双场语义表示来优化场景表示。具体而言,通过将语义表示分解为粗粒度未压缩特征场和细粒度低维特征场,以及引入选择性高斯机制以保留场景中必要的高斯分布,从而有效减少冗余并提升语义表达能力。
链接: https://arxiv.org/abs/2505.23044
作者: Yu Sheng,Jiajun Deng,Xinran Zhang,Yu Zhang,Bei Hua,Yanyong Zhang,Jianmin Ji
机构: University of Science and Technology of China (中国科学技术大学); The University of Adelaide (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A major breakthrough in 3D reconstruction is the feedforward paradigm to generate pixel-wise 3D points or Gaussian primitives from sparse, unposed images. To further incorporate semantics while avoiding the significant memory and storage costs of high-dimensional semantic features, existing methods extend this paradigm by associating each primitive with a compressed semantic feature vector. However, these methods have two major limitations: (a) the naively compressed feature compromises expressiveness, affecting the model’s ability to capture fine-grained semantics, and (b) the pixel-wise primitive prediction introduces redundancy in overlapping areas, causing unnecessary memory overhead. To this end, we introduce \textbfSpatialSplat, a feedforward framework that produces redundancy-aware Gaussians and capitalizes on a dual-field semantic representation. Particularly, with the insight that primitives within the same instance exhibit high semantic consistency, we decompose the semantic representation into a coarse feature field that encodes uncompressed semantics with minimal primitives, and a fine-grained yet low-dimensional feature field that captures detailed inter-instance relationships. Moreover, we propose a selective Gaussian mechanism, which retains only essential Gaussians in the scene, effectively eliminating redundant primitives. Our proposed Spatialsplat learns accurate semantic information and detailed instances prior with more compact 3D Gaussians, making semantic 3D reconstruction more applicable. We conduct extensive experiments to evaluate our method, demonstrating a remarkable 60% reduction in scene representation parameters while achieving superior performance over state-of-the-art methods. The code will be made available for future investigation.
zh
[CV-126] Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation
【速读】:该论文试图解决统一视觉-语言模型(Unified Vision-Language Models, VLMs)中理解与生成任务之间相互增强机制尚未被充分探索的问题。其解决方案的关键在于通过设计一个贴近真实场景的数据集,系统性地研究统一VLMs在理解与生成任务间的泛化能力,并验证混合训练数据对任务间相互促进的作用。研究发现,统一架构下理解与生成任务在多种模型结构中均表现出相互受益的效果,且这种效益可随数据量增加而提升,同时强调了多模态输入与输出空间对齐的重要性以及生成任务知识向理解任务的跨任务迁移。
链接: https://arxiv.org/abs/2505.23043
作者: Jihai Zhang,Tianle Li,Linjie Li,Zhengyuan Yang,Yu Cheng
机构: The Chinese University of Hong Kong(香港中文大学); Microsoft(微软)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in unified vision-language models (VLMs), which integrate both visual understanding and generation capabilities, have attracted significant attention. The underlying hypothesis is that a unified architecture with mixed training on both understanding and generation tasks can enable mutual enhancement between understanding and generation. However, this hypothesis remains underexplored in prior works on unified VLMs. To address this gap, this paper systematically investigates the generalization across understanding and generation tasks in unified VLMs. Specifically, we design a dataset closely aligned with real-world scenarios to facilitate extensive experiments and quantitative evaluations. We evaluate multiple unified VLM architectures to validate our findings. Our key findings are as follows. First, unified VLMs trained with mixed data exhibit mutual benefits in understanding and generation tasks across various architectures, and this mutual benefits can scale up with increased data. Second, better alignment between multimodal input and output spaces will lead to better generalization. Third, the knowledge acquired during generation tasks can transfer to understanding tasks, and this cross-task generalization occurs within the base language model, beyond modality adapters. Our findings underscore the critical necessity of unifying understanding and generation in VLMs, offering valuable insights for the design and optimization of unified VLMs.
zh
[CV-127] Deep Modeling and Optimization of Medical Image Classification
【速读】:该论文旨在解决医学领域中深度模型因数据隐私问题难以进行大规模微调,以及对比语言图像预训练(CLIP)在医学领域潜力未被充分挖掘的问题。其解决方案的关键在于引入一种基于四个卷积神经网络(CNN)和八个视觉Transformer(ViT)的CLIP变体以实现脑癌和皮肤癌分类,结合12种深度模型与两种联邦学习技术以保护数据隐私,并结合传统机器学习方法提升模型在未知领域数据中的泛化能力。
链接: https://arxiv.org/abs/2505.23040
作者: Yihang Wu,Muhammad Owais,Reem Kateb,Ahmad Chaddad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ISBI2025
Abstract:Deep models, such as convolutional neural networks (CNNs) and vision transformer (ViT), demonstrate remarkable performance in image classification. However, those deep models require large data to fine-tune, which is impractical in the medical domain due to the data privacy issue. Furthermore, despite the feasible performance of contrastive language image pre-training (CLIP) in the natural domain, the potential of CLIP has not been fully investigated in the medical field. To face these challenges, we considered three scenarios: 1) we introduce a novel CLIP variant using four CNNs and eight ViTs as image encoders for the classification of brain cancer and skin cancer, 2) we combine 12 deep models with two federated learning techniques to protect data privacy, and 3) we involve traditional machine learning (ML) methods to improve the generalization ability of those deep models in unseen domain data. The experimental results indicate that maxvit shows the highest averaged (AVG) test metrics (AVG = 87.03%) in HAM10000 dataset with multimodal learning, while convnext_l demonstrates remarkable test with an F1-score of 83.98% compared to swin_b with 81.33% in FL model. Furthermore, the use of support vector machine (SVM) can improve the overall test metrics with AVG of \sim 2% for swin transformer series in ISIC2018. Our codes are available at this https URL.
zh
[CV-128] owards Privacy-Preserving Fine-Grained Visual Classification via Hierarchical Learning from Label Proportions
【速读】:该论文试图解决在隐私敏感场景下,如医学图像分析中,传统细粒度视觉分类(Fine-Grained Visual Classification, FGVC)方法依赖实例级标签导致的不可行性问题。其解决方案的关键在于引入从标签比例学习(Learning from Label Proportions, LLP)范式,通过仅需袋级标签进行高效训练,同时利用细粒度数据集的层次结构,提出一种基于层次细粒度标签比例的学习框架(Learning from Hierarchical Fine-Grained Label Proportions, LHFGLP),结合无约束的层次细粒度稀疏字典学习和层次比例损失,实现特征粒度的逐步优化与分类性能的提升。
链接: https://arxiv.org/abs/2505.23031
作者: Jinyi Chang,Dongliang Chang,Lei Chen,Bingyao Yu,Zhanyu Ma
机构: PRIS, Beijing University of Posts and Telecommunications (PRIS,北京邮电大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, 5 tables
Abstract:In recent years, Fine-Grained Visual Classification (FGVC) has achieved impressive recognition accuracy, despite minimal inter-class variations. However, existing methods heavily rely on instance-level labels, making them impractical in privacy-sensitive scenarios such as medical image analysis. This paper aims to enable accurate fine-grained recognition without direct access to instance labels. To achieve this, we leverage the Learning from Label Proportions (LLP) paradigm, which requires only bag-level labels for efficient training. Unlike existing LLP-based methods, our framework explicitly exploits the hierarchical nature of fine-grained datasets, enabling progressive feature granularity refinement and improving classification accuracy. We propose Learning from Hierarchical Fine-Grained Label Proportions (LHFGLP), a framework that incorporates Unrolled Hierarchical Fine-Grained Sparse Dictionary Learning, transforming handcrafted iterative approximation into learnable network optimization. Additionally, our proposed Hierarchical Proportion Loss provides hierarchical supervision, further enhancing classification performance. Experiments on three widely-used fine-grained datasets, structured in a bag-based manner, demonstrate that our framework consistently outperforms existing LLP-based methods. We will release our code and datasets to foster further research in privacy-preserving fine-grained classification.
zh
[CV-129] Diverse Prototypical Ensembles Improve Robustness to Subpopulation Shift ICML2025
【速读】:该论文试图解决子群体分布偏移(subpopulation shift)问题,即训练数据与目标数据之间子群体分布的差异会导致机器学习模型性能显著下降。现有解决方案通常通过重新加权策略修改经验风险最小化以提高泛化能力,但这些方法依赖于对子群体数量和性质的假设以及群体成员标注,而这些信息在许多现实数据集中并不可用。论文的关键解决方案是使用多样化的原型集成(Diverse Prototypical Ensembles, DPEs),通过将标准线性分类层替换为基于原型的分类器混合体,使每个成员在不同特征和样本上进行分类,从而自适应地捕捉子群体相关的风险。
链接: https://arxiv.org/abs/2505.23027
作者: Minh Nguyen Nhat To,Paul F RWilson,Viet Nguyen,Mohamed Harmanani,Michael Cooper,Fahimeh Fooladgar,Purang Abolmaesumi,Parvin Mousavi,Rahul G. Krishnan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2025 Paper
Abstract:The subpopulationtion shift, characterized by a disparity in subpopulation distributibetween theween the training and target datasets, can significantly degrade the performance of machine learning models. Current solutions to subpopulation shift involve modifying empirical risk minimization with re-weighting strategies to improve generalization. This strategy relies on assumptions about the number and nature of subpopulations and annotations on group membership, which are unavailable for many real-world datasets. Instead, we propose using an ensemble of diverse classifiers to adaptively capture risk associated with subpopulations. Given a feature extractor network, we replace its standard linear classification layer with a mixture of prototypical classifiers, where each member is trained to classify the data while focusing on different features and samples from other members. In empirical evaluation on nine real-world datasets, covering diverse domains and kinds of subpopulation shift, our method of Diverse Prototypical Ensembles (DPEs) often outperforms the prior state-of-the-art in worst-group accuracy. The code is available at this https URL
zh
[CV-130] Spatio-Temporal Joint Density Driven Learning for Skeleton-Based Action Recognition
【速读】:该论文旨在解决基于骨架的动作分类中,传统无监督或自监督学习方法过于关注骨架序列的动态特性,而忽视了骨架中运动与静态元素之间复杂交互所蕴含的判别潜力问题。其解决方案的关键在于提出一种新的度量方法——时空关节密度(spatial-temporal joint density, STJD),通过追踪该密度在动作过程中的演变,识别出对分类具有判别性的“主关节”,并基于此设计了一种对比学习策略STJD-CL以及结合重构框架的STJD-MP方法,以提升自监督学习的效果。
链接: https://arxiv.org/abs/2505.23012
作者: Shanaka Ramesh Gunasekara,Wanqing Li,Philip Ogunbona,Jack Yang
机构: University of Wollongong (伍伦贡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Traditional approaches in unsupervised or self supervised learning for skeleton-based action classification have concentrated predominantly on the dynamic aspects of skeletal sequences. Yet, the intricate interaction between the moving and static elements of the skeleton presents a rarely tapped discriminative potential for action classification. This paper introduces a novel measurement, referred to as spatial-temporal joint density (STJD), to quantify such interaction. Tracking the evolution of this density throughout an action can effectively identify a subset of discriminative moving and/or static joints termed “prime joints” to steer self-supervised learning. A new contrastive learning strategy named STJD-CL is proposed to align the representation of a skeleton sequence with that of its prime joints while simultaneously contrasting the representations of prime and nonprime joints. In addition, a method called STJD-MP is developed by integrating it with a reconstruction-based framework for more effective learning. Experimental evaluations on the NTU RGB+D 60, NTU RGB+D 120, and PKUMMD datasets in various downstream tasks demonstrate that the proposed STJD-CL and STJD-MP improved performance, particularly by 3.5 and 3.6 percentage points over the state-of-the-art contrastive methods on the NTU RGB+D 120 dataset using X-sub and X-set evaluations, respectively.
zh
[CV-131] SeG-SR: Integrating Semantic Knowledge into Remote Sensing Image Super-Resolution via Vision-Language Model
【速读】:该论文旨在解决远程感知图像超分辨率(RSISR)中因忽略高阶语义知识而导致的语义不一致伪影问题。现有方法主要关注像素空间中的低级特征,而未能充分利用场景的高层语义信息,从而影响重建质量。解决方案的关键在于引入语义引导机制,通过视觉-语言模型(VLM)提取输入图像的语义知识,并设计语义特征提取模块(SFEM)、语义定位模块(SLM)和可学习调制模块(LMM),将高阶场景理解融入超分辨率流程,以提升重建结果的语义一致性与质量。
链接: https://arxiv.org/abs/2505.23010
作者: Bowen Chen,Keyan Chen,Mohan Yang,Zhengxia Zou,Zhenwei Shi
机构: Beihang University (北京航空航天大学); Department of Aerospace Intelligent Science and Technology (航天智能科学与技术系); School of Astronautics (宇航学院); Key Laboratory of Spacecraft Design Optimization and Dynamic Simulation Technologies (航天器设计优化与动态仿真技术重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-resolution (HR) remote sensing imagery plays a vital role in a wide range of applications, including urban planning and environmental monitoring. However, due to limitations in sensors and data transmission links, the images acquired in practice often suffer from resolution degradation. Remote Sensing Image Super-Resolution (RSISR) aims to reconstruct HR images from low-resolution (LR) inputs, providing a cost-effective and efficient alternative to direct HR image acquisition. Existing RSISR methods primarily focus on low-level characteristics in pixel space, while neglecting the high-level understanding of remote sensing scenes. This may lead to semantically inconsistent artifacts in the reconstructed results. Motivated by this observation, our work aims to explore the role of high-level semantic knowledge in improving RSISR performance. We propose a Semantic-Guided Super-Resolution framework, SeG-SR, which leverages Vision-Language Models (VLMs) to extract semantic knowledge from input images and uses it to guide the super resolution (SR) process. Specifically, we first design a Semantic Feature Extraction Module (SFEM) that utilizes a pretrained VLM to extract semantic knowledge from remote sensing images. Next, we propose a Semantic Localization Module (SLM), which derives a series of semantic guidance from the extracted semantic knowledge. Finally, we develop a Learnable Modulation Module (LMM) that uses semantic guidance to modulate the features extracted by the SR network, effectively incorporating high-level scene understanding into the SR pipeline. We validate the effectiveness and generalizability of SeG-SR through extensive experiments: SeG-SR achieves state-of-the-art performance on two datasets and consistently delivers performance improvements across various SR architectures. Codes can be found at this https URL.
zh
[CV-132] Number of Clusters in a Dataset: A Regularized K-means Approach
【速读】:该论文试图解决在无标签数据集中确定有意义聚类数量的问题,这是许多应用中的关键任务。其解决方案的关键在于对正则化k-means算法中的关键超参数λ进行理论上的严格边界推导,假设聚类为理想聚类(ideal clusters),即d维球体且半径相同,这些理想聚类可作为k-means聚类的近似。通过分析添加性和乘性正则化下的k-means算法,论文旨在减少多解性带来的模糊性,并验证正则化k-means在聚类偏离理想假设时的性能表现。
链接: https://arxiv.org/abs/2505.22991
作者: Behzad Kamgar-Parsi,Behrooz Kamgar-Parsi
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 14 figures. arXiv admin note: substantial text overlap with arXiv:1911.06741
Abstract:Finding the number of meaningful clusters in an unlabeled dataset is important in many applications. Regularized k-means algorithm is a possible approach frequently used to find the correct number of distinct clusters in datasets. The most common formulation of the regularization function is the additive linear term \lambda k , where k is the number of clusters and \lambda a positive coefficient. Currently, there are no principled guidelines for setting a value for the critical hyperparameter \lambda . In this paper, we derive rigorous bounds for \lambda assuming clusters are \em ideal. Ideal clusters (defined as d -dimensional spheres with identical radii) are close proxies for k-means clusters ( d -dimensional spherically symmetric distributions with identical standard deviations). Experiments show that the k-means algorithm with additive regularizer often yields multiple solutions. Thus, we also analyze k-means algorithm with multiplicative regularizer. The consensus among k-means solutions with additive and multiplicative regularizations reduces the ambiguity of multiple solutions in certain cases. We also present selected experiments that demonstrate performance of the regularized k-means algorithms as clusters deviate from the ideal assumption.
zh
[CV-133] MOVi: Training-free Text-conditioned Multi-Object Video Generation
【速读】:该论文旨在解决扩散模型在生成多物体视频时存在的问题,包括难以准确捕捉复杂物体交互、将部分物体误认为静态背景以及无法按照提示生成多个独立物体导致的生成错误或特征混合。其解决方案的关键在于提出一种无需训练的方法,利用扩散模型和大语言模型(LLM)的开放世界知识,通过LLM作为“导演”控制物体轨迹,并结合噪声重新初始化实现对真实运动的精确控制,同时通过操纵注意力机制增强物体特定特征和运动模式的捕捉,减少跨物体特征干扰。
链接: https://arxiv.org/abs/2505.22980
作者: Aimon Rahman,Jiang Liu,Ze Wang,Ximeng Sun,Jialian Wu,Xiaodong Yu,Yusheng Su,Vishal M. Patel,Zicheng Liu,Emad Barsoum
机构: Johns Hopkins University (约翰霍普金斯大学); Advanced Micro Devices (先进微设备公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in diffusion-based text-to-video (T2V) models have demonstrated remarkable progress, but these models still face challenges in generating videos with multiple objects. Most models struggle with accurately capturing complex object interactions, often treating some objects as static background elements and limiting their movement. In addition, they often fail to generate multiple distinct objects as specified in the prompt, resulting in incorrect generations or mixed features across objects. In this paper, we present a novel training-free approach for multi-object video generation that leverages the open world knowledge of diffusion models and large language models (LLMs). We use an LLM as the ``director’’ of object trajectories, and apply the trajectories through noise re-initialization to achieve precise control of realistic movements. We further refine the generation process by manipulating the attention mechanism to better capture object-specific features and motion patterns, and prevent cross-object feature interference. Extensive experiments validate the effectiveness of our training free approach in significantly enhancing the multi-object generation capabilities of existing video diffusion models, resulting in 42% absolute improvement in motion dynamics and object generation accuracy, while also maintaining high fidelity and motion smoothness.
zh
[CV-134] Pose-free 3D Gaussian splatting via shape-ray estimation ICIP2025
【速读】:该论文试图解决在未知场景中使用可泛化的3D高斯点云渲染时,因相机位姿(camera poses)不准确导致的几何错位问题。其解决方案的关键在于提出SHARE框架,该框架通过联合形状与相机射线估计实现无位姿的高斯点云渲染,构建了一个姿态感知的规范体积表示,以融合多视角信息并减少由错误位姿引起的对齐误差,同时通过锚点对齐的高斯预测优化局部几何结构,提升高斯分布的精确性。
链接: https://arxiv.org/abs/2505.22978
作者: Youngju Na,Taeyeon Kim,Jumin Lee,Kyu Beom Han,Woo Jae Kim,Sung-eui Yoon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICIP 2025
Abstract:While generalizable 3D Gaussian splatting enables efficient, high-quality rendering of unseen scenes, it heavily depends on precise camera poses for accurate geometry. In real-world scenarios, obtaining accurate poses is challenging, leading to noisy pose estimates and geometric misalignments. To address this, we introduce SHARE, a pose-free, feed-forward Gaussian splatting framework that overcomes these ambiguities by joint shape and camera rays estimation. Instead of relying on explicit 3D transformations, SHARE builds a pose-aware canonical volume representation that seamlessly integrates multi-view information, reducing misalignment caused by inaccurate pose estimates. Additionally, anchor-aligned Gaussian prediction enhances scene reconstruction by refining local geometry around coarse anchors, allowing for more precise Gaussian placement. Extensive experiments on diverse real-world datasets show that our method achieves robust performance in pose-free generalizable Gaussian splatting.
zh
[CV-135] HyperMotion: DiT-Based Pose-Guided Human Image Animation of Complex Motions
【速读】:该论文旨在解决在复杂人体运动(Hypermotion)条件下,基于姿态引导的人体图像动画生成中存在的结构稳定性与外观一致性不足的问题,以及缺乏高质量的评估基准。其解决方案的关键在于引入了\textbf{Open-HyperMotionX Dataset}和\textbf{HyperMotionX Bench},提供了高质量的人体姿态标注和精选视频片段,用于评估和优化相关模型,并提出了一种基于DiT的视频生成基线方法,结合空间低频增强的RoPE模块,通过引入可学习的频率缩放机制,有效提升了动态人体运动序列的结构稳定性和外观一致性。
链接: https://arxiv.org/abs/2505.22977
作者: Shuolin Xu,Siming Zheng,Ziyi Wang,HC Yu,Jinwei Chen,Huaqi Zhang,Bo Li,Peng-Tao Jiang
机构: vivo Mobile Communication Co., Ltd (vivo移动通信有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 7 figures
Abstract:Recent advances in diffusion models have significantly improved conditional video generation, particularly in the pose-guided human image animation task. Although existing methods are capable of generating high-fidelity and time-consistent animation sequences in regular motions and static scenes, there are still obvious limitations when facing complex human body motions (Hypermotion) that contain highly dynamic, non-standard motions, and the lack of a high-quality benchmark for evaluation of complex human motion animations. To address this challenge, we introduce the \textbfOpen-HyperMotionX Dataset and \textbfHyperMotionX Bench, which provide high-quality human pose annotations and curated video clips for evaluating and improving pose-guided human image animation models under complex human motion conditions. Furthermore, we propose a simple yet powerful DiT-based video generation baseline and design spatial low-frequency enhanced RoPE, a novel module that selectively enhances low-frequency spatial feature modeling by introducing learnable frequency scaling. Our method significantly improves structural stability and appearance consistency in highly dynamic human motion sequences. Extensive experiments demonstrate the effectiveness of our dataset and proposed approach in advancing the generation quality of complex human motion image animations. Code and dataset will be made publicly available.
zh
[CV-136] oward Memory-Aided World Models: Benchmarking via Spatial Consistency
【速读】:该论文试图解决世界模型中空间一致性(spatial consistency)不足的问题,这一问题限制了模型在视觉生成以及下游任务如模拟和规划中的可靠性。解决方案的关键在于设计一个能够保留长时程观测信息并构建显式或隐式内部空间表征的内存模块(memory module)。为促进此类模块的发展,作者构建了一个数据集和相应的基准测试,通过在Minecraft开放世界环境中采样150个不同位置,收集了约250小时(2000万帧)基于循环导航的视频数据,并采用序列长度的课程设计,使模型能够在日益复杂的导航轨迹上学习空间一致性。
链接: https://arxiv.org/abs/2505.22976
作者: Kewei Lian,Shaofei Cai,Yilun Du,Yitao Liang
机构: Peking Univeristy (北京大学); MIT (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The ability to simulate the world in a spatially consistent manner is a crucial requirements for effective world models. Such a model enables high-quality visual generation, and also ensures the reliability of world models for downstream tasks such as simulation and planning. Designing a memory module is a crucial component for addressing spatial consistency: such a model must not only retain long-horizon observational information, but also enables the construction of explicit or implicit internal spatial representations. However, there are no dataset designed to promote the development of memory modules by explicitly enforcing spatial consistency constraints. Furthermore, most existing benchmarks primarily emphasize visual coherence or generation quality, neglecting the requirement of long-range spatial consistency. To bridge this gap, we construct a dataset and corresponding benchmark by sampling 150 distinct locations within the open-world environment of Minecraft, collecting about 250 hours (20 million frames) of loop-based navigation videos with actions. Our dataset follows a curriculum design of sequence lengths, allowing models to learn spatial consistency on increasingly complex navigation trajectories. Furthermore, our data collection pipeline is easily extensible to new Minecraft environments and modules. Four representative world model baselines are evaluated on our benchmark. Dataset, benchmark, and code are open-sourced to support future research.
zh
[CV-137] HDR: Iterative HDR Imaging with Arbitrary Number of Exposures ICIP2025
【速读】:该论文旨在解决传统HDR成像方法在输入帧数量固定(如三张)的情况下无法适应不同数量输入帧的问题。其解决方案的关键在于提出了一种名为iHDR的迭代融合框架,该框架包含一个无鬼影的双输入HDR融合网络(DiHDR)和一个基于物理的域映射网络(ToneNet)。DiHDR通过一对输入估计中间HDR图像,而ToneNet将其映射回非线性域并作为下一轮配对融合的参考输入,从而实现多轮迭代融合直至所有输入帧被利用。
链接: https://arxiv.org/abs/2505.22971
作者: Yu Yuan,Yiheng Chi,Xingguang Zhang,Stanley Chan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To be appear in IEEE ICIP 2025
Abstract:High dynamic range (HDR) imaging aims to obtain a high-quality HDR image by fusing information from multiple low dynamic range (LDR) images. Numerous learning-based HDR imaging methods have been proposed to achieve this for static and dynamic scenes. However, their architectures are mostly tailored for a fixed number (e.g., three) of inputs and, therefore, cannot apply directly to situations beyond the pre-defined limited scope. To address this issue, we propose a novel framework, iHDR, for iterative fusion, which comprises a ghost-free Dual-input HDR fusion network (DiHDR) and a physics-based domain mapping network (ToneNet). DiHDR leverages a pair of inputs to estimate an intermediate HDR image, while ToneNet maps it back to the nonlinear domain and serves as the reference input for the next pairwise fusion. This process is iteratively executed until all input frames are utilized. Qualitative and quantitative experiments demonstrate the effectiveness of the proposed method as compared to existing state-of-the-art HDR deghosting approaches given flexible numbers of input frames.
zh
[CV-138] ATI: Any Trajectory Instruction for Controllable Video Generation
【速读】:该论文旨在解决视频生成中运动控制的统一性问题,即如何在生成过程中同时实现相机运动、对象级平移和细粒度局部运动的协同控制。传统方法通常通过独立模块或任务特定设计处理不同类型的运动,而该研究提出的解决方案的关键在于将用户定义的轨迹投影到预训练图像到视频生成模型的潜在空间中,通过轻量级运动注入器实现对运动的统一调控,从而生成时间一致且语义对齐的运动序列。
链接: https://arxiv.org/abs/2505.22944
作者: Angtian Wang,Haibin Huang,Jacob Zhiyuan Fang,Yiding Yang,Chongyang Ma
机构: ByteDance Intelligent Creation(字节跳动智能创作)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose a unified framework for motion control in video generation that seamlessly integrates camera movement, object-level translation, and fine-grained local motion using trajectory-based inputs. In contrast to prior methods that address these motion types through separate modules or task-specific designs, our approach offers a cohesive solution by projecting user-defined trajectories into the latent space of pre-trained image-to-video generation models via a lightweight motion injector. Users can specify keypoints and their motion paths to control localized deformations, entire object motion, virtual camera dynamics, or combinations of these. The injected trajectory signals guide the generative process to produce temporally consistent and semantically aligned motion sequences. Our framework demonstrates superior performance across multiple video motion control tasks, including stylized motion effects (e.g., motion brushes), dynamic viewpoint changes, and precise local motion manipulation. Experiments show that our method provides significantly better controllability and visual quality compared to prior approaches and commercial solutions, while remaining broadly compatible with various state-of-the-art video generation backbones. Project page: this https URL.
zh
[CV-139] Fast Isotropic Median Filtering FAST
【速读】:该论文试图解决传统中值滤波算法在图像处理中面临的实际限制,包括图像数据的位深度、滤波核的大小以及滤波核形状本身带来的问题。已知的算法在处理非正方形核时存在效率低下或产生条纹状伪影的问题。论文提出了一种全新的方法,其关键在于能够高效处理任意位深度的数据、任意大小的滤波核以及任意凸形状的核,包括圆形形状,从而克服了上述所有限制。
链接: https://arxiv.org/abs/2505.22938
作者: Ben Weiss
机构: Google Research(谷歌研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Data Structures and Algorithms (cs.DS)
备注: Supplemental material: this https URL
Abstract:Median filtering is a cornerstone of computational image processing. It provides an effective means of image smoothing, with minimal blurring or softening of edges, invariance to monotonic transformations such as gamma adjustment, and robustness to noise and outliers. However, known algorithms have all suffered from practical limitations: the bit depth of the image data, the size of the filter kernel, or the kernel shape itself. Square-kernel implementations tend to produce streaky cross-hatching artifacts, and nearly all known efficient algorithms are in practice limited to square kernels. We present for the first time a method that overcomes all of these limitations. Our method operates efficiently on arbitrary bit-depth data, arbitrary kernel sizes, and arbitrary convex kernel shapes, including circular shapes.
zh
[CV-140] Leverag ing Diffusion Models for Synthetic Data Augmentation in Protein Subcellular Localization Classification
【速读】:该论文试图解决如何利用由扩散模型生成的合成图像来增强蛋白质亚细胞定位的多标签分类问题。其解决方案的关键在于采用一种简化的类条件去噪扩散概率模型(DDPM)生成与标签一致的样本,并通过两种混合训练策略——Mix Loss 和 Mix Representation 将合成数据与真实数据进行融合。然而,实验结果表明,尽管这些方法在验证集上表现良好,但所提出的 MixModel 在未见过的测试数据上泛化能力较差,突显了有效利用合成数据的挑战。
链接: https://arxiv.org/abs/2505.22926
作者: Sylvey Lin,Zhi-Yi Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We investigate whether synthetic images generated by diffusion models can enhance multi-label classification of protein subcellular localization. Specifically, we implement a simplified class-conditional denoising diffusion probabilistic model (DDPM) to produce label-consistent samples and explore their integration with real data via two hybrid training strategies: Mix Loss and Mix Representation. While these approaches yield promising validation performance, our proposed MixModel exhibits poor generalization to unseen test data, underscoring the challenges of leveraging synthetic data effectively. In contrast, baseline classifiers built on ResNet backbones with conventional loss functions demonstrate greater stability and test-time performance. Our findings highlight the importance of realistic data generation and robust supervision when incorporating generative augmentation into biomedical image classification.
zh
[CV-141] Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape
【速读】:该论文旨在解决扩散模型(Diffusion Models)中注意力机制计算复杂度高的问题,特别是在高分辨率和长视频生成任务中,注意力机制的复杂度随分辨率和视频长度呈二次增长,成为性能瓶颈。解决方案的关键在于提出一种名为Re-ttention的方法,通过利用扩散模型的时间冗余性来克服注意力机制中的概率归一化偏移问题,从而实现极高稀疏度下的注意力计算,同时保持视觉质量。该方法通过基于先验softmax分布历史重新调整注意力分数,在仅使用3.1%的token进行推理的情况下,显著提升了效率并降低了延迟。
链接: https://arxiv.org/abs/2505.22918
作者: Ruichen Chen,Keith G. Mills,Liyao Jiang,Chao Gao,Di Niu
机构: University of Alberta (阿尔伯塔大学); Huawei Technologies (华为技术)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion Transformers (DiT) have become the de-facto model for generating high-quality visual content like videos and images. A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length. One logical way to lessen this burden is sparse attention, where only a subset of tokens or patches are included in the calculation. However, existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads. % To address this concern, we propose Re-ttention, which implements very high sparse attention for visual generation models by leveraging the temporal redundancy of Diffusion Models to overcome the probabilistic normalization shift within the attention mechanism. Specifically, Re-ttention reshapes attention scores based on the prior softmax distribution history in order to preserve the visual quality of the full quadratic attention at very high sparsity levels. % Experimental results on T2V/T2I models such as CogVideoX and the PixArt DiTs demonstrate that Re-ttention requires as few as 3.1% of the tokens during inference, outperforming contemporary methods like FastDiTAttn, Sparse VideoGen and MInference. Further, we measure latency to show that our method can attain over 45% end-to-end % and over 92% self-attention latency reduction on an H100 GPU at negligible overhead cost. Code available online here: \hrefthis https URLthis https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.22918 [cs.CV] (or arXiv:2505.22918v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.22918 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-142] cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning
【速读】:该论文旨在解决传统计算机辅助设计(Computer-Aided Design, CAD)重建方法依赖单一输入模态(如点云、图像或文本)导致的泛化能力和鲁棒性受限的问题。其解决方案的关键在于提出一种多模态CAD重建模型,该模型能够同时处理点云、图像和文本三种输入模态,并借鉴大型语言模型(Large Language Model, LLM)的训练范式,采用两阶段流程:首先在大规模程序生成数据上进行监督微调(Supervised Fine-Tuning, SFT),随后通过在线反馈进行强化学习(Reinforcement Learning, RL)微调,其中使用Group Relative Preference Optimization(GRPO)等在线强化学习算法显著提升了性能。
链接: https://arxiv.org/abs/2505.22914
作者: Maksim Kolodiazhnyi,Denis Tarasov,Dmitrii Zhemchuzhnikov,Alexander Nikulin,Ilya Zisman,Anna Vorontsova,Anton Konushin,Vladislav Kurenkov,Danila Rukhovich
机构: AIRI Institute; Lomonosov Moscow State University; ETH Zurich; Innopolis University
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Computer-Aided Design (CAD) plays a central role in engineering and manufacturing, making it possible to create precise and editable 3D models. Using a variety of sensor or user-provided data as inputs for CAD reconstruction can democratize access to design applications. However, existing methods typically focus on a single input modality, such as point clouds, images, or text, which limits their generalizability and robustness. Leveraging recent advances in vision-language models (VLM), we propose a multi-modal CAD reconstruction model that simultaneously processes all three input modalities. Inspired by large language model (LLM) training paradigms, we adopt a two-stage pipeline: supervised fine-tuning (SFT) on large-scale procedurally generated data, followed by reinforcement learning (RL) fine-tuning using online feedback, obtained programatically. Furthermore, we are the first to explore RL fine-tuning of LLMs for CAD tasks demonstrating that online RL algorithms such as Group Relative Preference Optimization (GRPO) outperform offline alternatives. In the DeepCAD benchmark, our SFT model outperforms existing single-modal approaches in all three input modalities simultaneously. More importantly, after RL fine-tuning, cadrille sets new state-of-the-art on three challenging datasets, including a real-world one.
zh
[CV-143] Hierarchical Material Recognition from Local Appearance
【速读】:该论文旨在解决材料在层次化识别中的分类问题,特别是针对局部外观的材料识别。其解决方案的关键在于构建一个基于材料物理特性的层次化分类体系,并利用图注意力网络(Graph Attention Networks)方法,通过利用类别间的分类学接近性来实现高效的材料识别。此外,研究还引入了一个多样化的、真实场景下的数据集,包含图像和深度图,以支持模型的训练与验证。
链接: https://arxiv.org/abs/2505.22911
作者: Matthew Beveridge,Shree K. Nayar
机构: Columbia University (哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce a taxonomy of materials for hierarchical recognition from local appearance. Our taxonomy is motivated by vision applications and is arranged according to the physical traits of materials. We contribute a diverse, in-the-wild dataset with images and depth maps of the taxonomy classes. Utilizing the taxonomy and dataset, we present a method for hierarchical material recognition based on graph attention networks. Our model leverages the taxonomic proximity between classes and achieves state-of-the-art performance. We demonstrate the model’s potential to generalize to adverse, real-world imaging conditions, and that novel views rendered using the depth maps can enhance this capability. Finally, we show the model’s capacity to rapidly learn new materials in a few-shot learning setting.
zh
[CV-144] 3DGS Compression with Sparsity-guided Hierarchical Transform Coding
【速读】:该论文试图解决3D Gaussian Splatting (3DGS)在压缩过程中因内存占用大导致的传输和存储开销高的问题,以及现有神经压缩方法如Scaffold-GS未能采用端到端优化的分析-合成变换所带来的信号冗余去除效率低的问题。解决方案的关键在于提出一种基于稀疏性的分层变换编码框架(Sparsity-guided Hierarchical Transform Coding, SHTC),该框架首次实现了针对3DGS的端到端优化变换编码,通过联合优化3DGS、变换和轻量级上下文模型,使变换能够生成接近最优率失真(R-D)性能的表示。SHTC包含一个使用KLT进行数据去相关的基础层和一个基于稀疏编码的增强层,通过线性变换和迭代收缩阈值算法(ISTA)实现残差的压缩与重建,从而在保持较低参数和计算开销的同时显著提升R-D性能。
链接: https://arxiv.org/abs/2505.22908
作者: Hao Xu,Xiaolin Wu,Xi Zhang
机构: McMaster University (麦克马斯特大学); Southwest Jiaotong University (西南交通大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) has gained popularity for its fast and high-quality rendering, but it has a very large memory footprint incurring high transmission and storage overhead. Recently, some neural compression methods, such as Scaffold-GS, were proposed for 3DGS but they did not adopt the approach of end-to-end optimized analysis-synthesis transforms which has been proven highly effective in neural signal compression. Without an appropriate analysis transform, signal correlations cannot be removed by sparse representation. Without such transforms the only way to remove signal redundancies is through entropy coding driven by a complex and expensive context modeling, which results in slower speed and suboptimal rate-distortion (R-D) performance. To overcome this weakness, we propose Sparsity-guided Hierarchical Transform Coding (SHTC), the first end-to-end optimized transform coding framework for 3DGS compression. SHTC jointly optimizes the 3DGS, transforms and a lightweight context model. This joint optimization enables the transform to produce representations that approach the best R-D performance possible. The SHTC framework consists of a base layer using KLT for data decorrelation, and a sparsity-coded enhancement layer that compresses the KLT residuals to refine the representation. The enhancement encoder learns a linear transform to project high-dimensional inputs into a low-dimensional space, while the decoder unfolds the Iterative Shrinkage-Thresholding Algorithm (ISTA) to reconstruct the residuals. All components are designed to be interpretable, allowing the incorporation of signal priors and fewer parameters than black-box transforms. This novel design significantly improves R-D performance with minimal additional parameters and computational overhead.
zh
[CV-145] CFP-Gen: Combinatorial Functional Protein Generation via Diffusion Language Models ICML2025
【速读】:该论文旨在解决现有预训练语言模型(PLMs)在生成蛋白质序列时仅能基于单一模态的约束条件,难以同时满足跨不同模态的多种约束问题。其解决方案的关键在于提出CFP-Gen,这是一种用于组合功能性蛋白质生成的新型扩散语言模型,通过整合多模态条件与功能、序列和结构约束,实现从头设计蛋白质。核心创新包括引入注释引导的特征调制(AGFM)模块以动态调整基于可组合功能注释的蛋白质特征分布,以及残基控制的功能编码(RCFE)模块以捕捉残基级相互作用,从而实现更精确的控制。此外,还可无缝集成现成的3D结构编码器以施加几何约束。
链接: https://arxiv.org/abs/2505.22869
作者: Junbo Yin,Chao Zha,Wenjia He,Chencheng Xu,Xin Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
备注: Accepted at ICML 2025. Code is available at this https URL
Abstract:Existing PLMs generate protein sequences based on a single-condition constraint from a specific modality, struggling to simultaneously satisfy multiple constraints across different modalities. In this work, we introduce CFP-Gen, a novel diffusion language model for Combinatorial Functional Protein GENeration. CFP-Gen facilitates the de novo protein design by integrating multimodal conditions with functional, sequence, and structural constraints. Specifically, an Annotation-Guided Feature Modulation (AGFM) module is introduced to dynamically adjust the protein feature distribution based on composable functional annotations, e.g., GO terms, IPR domains and EC numbers. Meanwhile, the Residue-Controlled Functional Encoding (RCFE) module captures residue-wise interaction to ensure more precise control. Additionally, off-the-shelf 3D structure encoders can be seamlessly integrated to impose geometric constraints. We demonstrate that CFP-Gen enables high-throughput generation of novel proteins with functionality comparable to natural proteins, while achieving a high success rate in designing multifunctional proteins. Code and data available at this https URL.
zh
[CV-146] 4DTAM: Non-Rigid Tracking and Mapping via Dynamic Surface Gaussians CVPR2025
【速读】:该论文试图解决4D-SLAM(同时定位与建图)中复杂的非刚性表面重建与相机位姿估计问题,尤其是在高维优化空间下,由于自然环境中的复杂非刚性运动导致的病态问题。其解决方案的关键在于引入基于高斯表面基元的SLAM方法,该方法比传统3D高斯更有效地利用深度信号以实现精确的表面重建,并结合由多层感知机(MLP)表示的形变场以及新颖的相机位姿估计技术与表面正则化项,以建模非刚性变形并促进时空重建。此外,为克服缺乏可靠真实数据的问题,作者还构建了一个新的开放合成数据集,用于评估和推动现代4D-SLAM研究。
链接: https://arxiv.org/abs/2505.22859
作者: Hidenobu Matsuki,Gwangbin Bae,Andrew J. Davison
机构: Dyson Robotics Laboratory, Imperial College London (戴森机器人实验室,帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Project Page: this https URL
Abstract:We propose the first 4D tracking and mapping method that jointly performs camera localization and non-rigid surface reconstruction via differentiable rendering. Our approach captures 4D scenes from an online stream of color images with depth measurements or predictions by jointly optimizing scene geometry, appearance, dynamics, and camera ego-motion. Although natural environments exhibit complex non-rigid motions, 4D-SLAM remains relatively underexplored due to its inherent challenges; even with 2.5D signals, the problem is ill-posed because of the high dimensionality of the optimization space. To overcome these challenges, we first introduce a SLAM method based on Gaussian surface primitives that leverages depth signals more effectively than 3D Gaussians, thereby achieving accurate surface reconstruction. To further model non-rigid deformations, we employ a warp-field represented by a multi-layer perceptron (MLP) and introduce a novel camera pose estimation technique along with surface regularization terms that facilitate spatio-temporal reconstruction. In addition to these algorithmic challenges, a significant hurdle in 4D SLAM research is the lack of reliable ground truth and evaluation protocols, primarily due to the difficulty of 4D capture using commodity sensors. To address this, we present a novel open synthetic dataset of everyday objects with diverse motions, leveraging large-scale object models and animation modeling. In summary, we open up the modern 4D-SLAM research by introducing a novel method and evaluation protocols grounded in modern vision and rendering techniques.
zh
[CV-147] A Probabilistic Jump-Diffusion Framework for Open-World Egocentric Activity Recognition CVPR2025
【速读】:该论文旨在解决开放世界第一人称活动识别(egocentric activity recognition)中的核心挑战,即在非受限环境下从一个广泛且部分可观测的搜索空间中推断未见过的活动。其解决方案的关键在于提出ProbRes框架,该框架基于跳跃扩散过程,通过平衡先验引导的探索与似然驱动的利用,高效地导航搜索空间。ProbRes整合结构化的常识先验以构建语义一致的搜索空间,并利用视觉-语言模型(VLMs)自适应地优化预测,同时采用随机搜索机制在最小化穷举枚举的前提下定位高似然活动标签。
链接: https://arxiv.org/abs/2505.22858
作者: Sanjoy Kundu,Shanmukha Vellamcheti,Sathyanarayanan N. Aakur
机构: Auburn University (奥本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Extended abstract of arXiv:2504.03948 for CVPR 2025 EgoVis Workshop
Abstract:Open-world egocentric activity recognition poses a fundamental challenge due to its unconstrained nature, requiring models to infer unseen activities from an expansive, partially observed search space. We introduce ProbRes, a Probabilistic Residual search framework based on jump-diffusion that efficiently navigates this space by balancing prior-guided exploration with likelihood-driven exploitation. Our approach integrates structured commonsense priors to construct a semantically coherent search space, adaptively refines predictions using Vision-Language Models (VLMs) and employs a stochastic search mechanism to locate high-likelihood activity labels while minimizing exhaustive enumeration efficiently. We systematically evaluate ProbRes across multiple openness levels (L0–L3), demonstrating its adaptability to increasing search space complexity. In addition to achieving state-of-the-art performance on benchmark datasets (GTEA Gaze, GTEA Gaze+, EPIC-Kitchens, and Charades-Ego), we establish a clear taxonomy for open-world recognition, delineating the challenges and methodological advancements necessary for egocentric activity understanding.
zh
[CV-148] IRS: Incremental Relationship-guided Segmentation for Digital Pathology
【速读】:该论文旨在解决数字病理学中由于数据在时间上和部分标注情况下导致的全景分割难题,同时确保模型具备处理分布外(out-of-distribution, OOD)数据的持续学习能力。其解决方案的关键在于提出一种新颖的统一增量关系引导分割(Incremental Relationship-guided Segmentation, IRS)学习框架,通过数学建模现有与新增类别之间的解剖关系,利用简单的增量通用命题矩阵实现空间-时间分布外持续学习范式。
链接: https://arxiv.org/abs/2505.22855
作者: Ruining Deng,Junchao Zhu,Juming Xiong,Can Cui,Tianyuan Yao,Junlin Guo,Siqi Lu,Marilyn Lionts,Mengmeng Yin,Yu Wang,Shilin Zhao,Yucheng Tang,Yihe Yang,Paul Dennis Simonson,Mert R. Sabuncu,Haichun Yang,Yuankai Huo
机构: Vanderbilt University (范德比尔特大学); Vanderbilt University Medical Center (范德比尔特大学医学中心); NVIDIA Corp. (NVIDIA公司); Weill Cornell Medicine (威尔康奈尔医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Continual learning is rapidly emerging as a key focus in computer vision, aiming to develop AI systems capable of continuous improvement, thereby enhancing their value and practicality in diverse real-world applications. In healthcare, continual learning holds great promise for continuously acquired digital pathology data, which is collected in hospitals on a daily basis. However, panoramic segmentation on digital whole slide images (WSIs) presents significant challenges, as it is often infeasible to obtain comprehensive annotations for all potential objects, spanning from coarse structures (e.g., regions and unit objects) to fine structures (e.g., cells). This results in temporally and partially annotated data, posing a major challenge in developing a holistic segmentation framework. Moreover, an ideal segmentation model should incorporate new phenotypes, unseen diseases, and diverse populations, making this task even more complex. In this paper, we introduce a novel and unified Incremental Relationship-guided Segmentation (IRS) learning scheme to address temporally acquired, partially annotated data while maintaining out-of-distribution (OOD) continual learning capacity in digital pathology. The key innovation of IRS lies in its ability to realize a new spatial-temporal OOD continual learning paradigm by mathematically modeling anatomical relationships between existing and newly introduced classes through a simple incremental universal proposition matrix. Experimental results demonstrate that the IRS method effectively handles the multi-scale nature of pathological segmentation, enabling precise kidney segmentation across various structures (regions, units, and cells) as well as OOD disease lesions at multiple magnifications. This capability significantly enhances domain generalization, making IRS a robust approach for real-world digital pathology applications.
zh
[CV-149] CLIPGaussian: Universal and Multimodal Style Transfer Based on Gaussian Splatting
【速读】:该论文试图解决在基于高斯溅射(Gaussian Splatting, GS)的表示中进行风格迁移的问题,尤其是超越简单颜色变化的复杂风格迁移挑战。其解决方案的关键在于提出CLIPGaussians,这是首个统一的风格迁移框架,支持跨多种模态(2D图像、视频、3D物体和4D场景)的文本和图像引导的风格化。该方法直接作用于高斯基元,并作为插件模块集成到现有的GS流水线中,无需大型生成模型或从头开始训练,实现了3D和4D场景中颜色与几何的联合优化,并在视频中保持时间一致性,同时维持模型规模。
链接: https://arxiv.org/abs/2505.22854
作者: Kornel Howil,Joanna Waczyńska,Piotr Borycki,Tadeusz Dziarmaga,Marcin Mazur,Przemysław Spurek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Gaussian Splatting (GS) has recently emerged as an efficient representation for rendering 3D scenes from 2D images and has been extended to images, videos, and dynamic 4D content. However, applying style transfer to GS-based representations, especially beyond simple color changes, remains challenging. In this work, we introduce CLIPGaussians, the first unified style transfer framework that supports text- and image-guided stylization across multiple modalities: 2D images, videos, 3D objects, and 4D scenes. Our method operates directly on Gaussian primitives and integrates into existing GS pipelines as a plug-in module, without requiring large generative models or retraining from scratch. CLIPGaussians approach enables joint optimization of color and geometry in 3D and 4D settings, and achieves temporal coherence in videos, while preserving a model size. We demonstrate superior style fidelity and consistency across all tasks, validating CLIPGaussians as a universal and efficient solution for multimodal style transfer.
zh
[CV-150] Improving Contrastive Learning for Referring Expression Counting
【速读】:该论文旨在解决Referring Expression Counting (REC)问题,即根据细粒度属性和上下文差异对目标进行计数,现有方法在区分属于同一类别但对应不同指代表达的视觉相似对象时存在困难。解决方案的关键在于提出C-REX框架,这是一个基于监督对比学习的新型对比学习方法,其核心是通过在图像空间内操作来避免图像-文本对比学习中的对齐问题,从而提供更稳定的对比信号,并显著增加负样本池以提升表示的鲁棒性。
链接: https://arxiv.org/abs/2505.22850
作者: Kostas Triaridis,Panagiotis Kaliosis,E-Ro Nguyen,Jingyi Xu,Hieu Le,Dimitris Samaras
机构: Stony Brook University (斯托尼布鲁克大学); EPFL (洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures
Abstract:Object counting has progressed from class-specific models, which count only known categories, to class-agnostic models that generalize to unseen categories. The next challenge is Referring Expression Counting (REC), where the goal is to count objects based on fine-grained attributes and contextual differences. Existing methods struggle with distinguishing visually similar objects that belong to the same category but correspond to different referring expressions. To address this, we propose C-REX, a novel contrastive learning framework, based on supervised contrastive learning, designed to enhance discriminative representation learning. Unlike prior works, C-REX operates entirely within the image space, avoiding the misalignment issues of image-text contrastive learning, thus providing a more stable contrastive signal. It also guarantees a significantly larger pool of negative samples, leading to improved robustness in the learned representations. Moreover, we showcase that our framework is versatile and generic enough to be applied to other similar tasks like class-agnostic counting. To support our approach, we analyze the key components of sota detection-based models and identify that detecting object centroids instead of bounding boxes is the key common factor behind their success in counting tasks. We use this insight to design a simple yet effective detection-based baseline to build upon. Our experiments show that C-REX achieves state-of-the-art results in REC, outperforming previous methods by more than 22% in MAE and more than 10% in RMSE, while also demonstrating strong performance in class-agnostic counting. Code is available at this https URL.
zh
[CV-151] IMTS is Worth Time times Channel Patches: Visual Masked Autoencoders for Irregular Multivariate Time Series Prediction ICML2025
【速读】:该论文旨在解决不规则多变量时间序列(Irregular Multivariate Time Series, IMTS)预测中的挑战,特别是在多通道信号未对齐和广泛缺失数据的情况下难以捕捉可靠的时间模式问题。其解决方案的关键在于提出VIMTS框架,该框架通过将IMTS按时间轴划分为等间隔的特征块,并利用学习到的跨通道依赖关系进行补全,随后借助视觉掩码自编码器(Visual Mask AutoEncoder, MAE)处理稀疏多通道数据的能力进行块重建,结合粗到细的策略生成精确预测,同时引入自监督学习以提升IMTS建模效果。
链接: https://arxiv.org/abs/2505.22815
作者: Zhangyi Hu,Jiemin Wu,Hua Xu,Mingqian Liao,Ninghui Feng,Bo Gao,Songning Lai,Yutao Yue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICML 2025
Abstract:Irregular Multivariate Time Series (IMTS) forecasting is challenging due to the unaligned nature of multi-channel signals and the prevalence of extensive missing data. Existing methods struggle to capture reliable temporal patterns from such data due to significant missing values. While pre-trained foundation models show potential for addressing these challenges, they are typically designed for Regularly Sampled Time Series (RTS). Motivated by the visual Mask AutoEncoder’s (MAE) powerful capability for modeling sparse multi-channel information and its success in RTS forecasting, we propose VIMTS, a framework adapting Visual MAE for IMTS forecasting. To mitigate the effect of missing values, VIMTS first processes IMTS along the timeline into feature patches at equal intervals. These patches are then complemented using learned cross-channel dependencies. Then it leverages visual MAE’s capability in handling sparse multichannel data for patch reconstruction, followed by a coarse-to-fine technique to generate precise predictions from focused contexts. In addition, we integrate self-supervised learning for improved IMTS modeling by adapting the visual MAE to IMTS data. Extensive experiments demonstrate VIMTS’s superior performance and few-shot capability, advancing the application of visual foundation models in more general time series tasks. Our code is available at this https URL.
zh
[CV-152] VidText: Towards Comprehensive Evaluation for Video Text Understanding
【速读】:该论文旨在解决现有视频理解基准在文本信息处理上的不足,即大多数基准忽略了视频中的文本信息,而专门针对OCR的基准仅限于静态图像,无法捕捉文本与动态视觉语境之间的交互。其解决方案的关键在于提出VidText基准,该基准具备广泛的真实场景覆盖和多语言支持,并引入分层评估框架(包括视频级、片段级和实例级任务)以及成对的感知推理任务,以全面评估视频文本理解能力。
链接: https://arxiv.org/abs/2505.22810
作者: Zhoufaran Yang,Yan Shu,Zhifei Yang,Yan Zhang,Yu Li,Keyang Lu,Gangyan Zeng,Shaohui Liu,Yu Zhou,Nicu Sebe
机构: HIT(哈尔滨工业大学); UNITN(特伦托大学); PKU(北京大学); IIE, CAS(中国科学院自动化研究所); UCAS(中国科学院大学); BUAA(北京航空航天大学); NJUST(南京理工大学); NKU(南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual texts embedded in videos carry rich semantic information, which is crucial for both holistic video understanding and fine-grained reasoning about local human actions. However, existing video understanding benchmarks largely overlook textual information, while OCR-specific benchmarks are constrained to static images, limiting their ability to capture the interaction between text and dynamic visual contexts. To address this gap, we propose VidText, a new benchmark designed for comprehensive and in-depth evaluation of video text understanding. VidText offers the following key features: 1) It covers a wide range of real-world scenarios and supports multilingual content, encompassing diverse settings where video text naturally appears. 2) It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks, enabling assessment of both global summarization and local retrieval capabilities. 3) The benchmark also introduces a set of paired perception reasoning tasks, ranging from visual text perception to cross-modal reasoning between textual and visual information. Extensive experiments on 18 state-of-the-art Large Multimodal Models (LMMs) reveal that current models struggle across most tasks, with significant room for improvement. Further analysis highlights the impact of both model-intrinsic factors, such as input resolution and OCR capability, and external factors, including the use of auxiliary information and Chain-of-Thought reasoning strategies. We hope VidText will fill the current gap in video understanding benchmarks and serve as a foundation for future research on multimodal reasoning with video text in dynamic environments.
zh
[CV-153] Anomalies by Synthesis: Anomaly Detection using Generative Diffusion Models for Off-Road Navigation ICRA2025
【速读】:该论文试图解决机器人在非结构化和野外环境中安全可靠导航时,如何检测与训练数据分布不同的异常(out-of-distribution, OOD)问题。其解决方案的关键在于提出一种基于分析-合成的方法,利用生成式扩散模型(Generative Diffusion Model)在不假设OOD数据性质的前提下,合成去除异常的图像,并通过分析扩散模型对图像片段的修改情况实现像素级异常检测。该方法的核心创新在于提出了一个新颖的引导扩散推理策略,通过分析理想引导梯度并推导出合理的近似,使扩散模型能够预测引导梯度,从而实现高效的异常检测。
链接: https://arxiv.org/abs/2505.22805
作者: Siddharth Ancha,Sunshine Jiang,Travis Manderson,Laura Brandt,Yilun Du,Philip R. Osteen,Nicholas Roy
机构: MIT CSAIL (麻省理工学院计算机科学与人工智能实验室); DEVCOM Army Research Laboratory (美国陆军研究实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Presented at ICRA 2025
Abstract:In order to navigate safely and reliably in off-road and unstructured environments, robots must detect anomalies that are out-of-distribution (OOD) with respect to the training data. We present an analysis-by-synthesis approach for pixel-wise anomaly detection without making any assumptions about the nature of OOD data. Given an input image, we use a generative diffusion model to synthesize an edited image that removes anomalies while keeping the remaining image unchanged. Then, we formulate anomaly detection as analyzing which image segments were modified by the diffusion model. We propose a novel inference approach for guided diffusion by analyzing the ideal guidance gradient and deriving a principled approximation that bootstraps the diffusion model to predict guidance gradients. Our editing technique is purely test-time that can be integrated into existing workflows without the need for retraining or fine-tuning. Finally, we use a combination of vision-language foundation models to compare pixels in a learned feature space and detect semantically meaningful edits, enabling accurate anomaly detection for off-road navigation. Project website: this https URL
zh
[CV-154] Fast Trajectory-Independent Model-Based Reconstruction Algorithm for Multi-Dimensional Magnetic Particle Imaging
【速读】:该论文旨在解决传统磁粒子成像(Magnetic Particle Imaging, MPI)重建中依赖于耗时校准或模型模拟的问题,以及现有方法对特定正弦扫描轨迹的依赖性。其解决方案的关键在于提出一种与轨迹无关的基于模型的MPI重建算法,并结合零样本Plug-and-Play(PnP)算法,利用自然图像训练的先进去噪器进行解卷积,同时具备自动噪声水平估计功能,从而实现对不同扫描场景下真实2D MPI数据的有效重建。
链接: https://arxiv.org/abs/2505.22797
作者: Vladyslav Gapyak,Thomas März,Andreas Weinmann
机构: Darmstadt University of Applied Sciences (达姆施塔特应用科学大学); European University of Technology (欧洲科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA); Medical Physics (physics.med-ph)
备注: 10 pages, 5 figures. This work has been submitted to the IEEE for possible publication
Abstract:Magnetic Particle Imaging (MPI) is a promising tomographic technique for visualizing the spatio-temporal distribution of superparamagnetic nanoparticles, with applications ranging from cancer detection to real-time cardiovascular monitoring. Traditional MPI reconstruction relies on either time-consuming calibration (measured system matrix) or model-based simulation of the forward operator. Recent developments have shown the applicability of Chebyshev polynomials to multi-dimensional Lissajous Field-Free Point (FFP) scans. This method is bound to the particular choice of sinusoidal scanning trajectories. In this paper, we present the first reconstruction on real 2D MPI data with a trajectory-independent model-based MPI reconstruction algorithm. We further develop the zero-shot Plug-and-Play (PnP) algorithm of the authors – with automatic noise level estimation – to address the present deconvolution problem, leveraging a state-of-the-art denoiser trained on natural images without retraining on MPI-specific data. We evaluate our method on the publicly available 2D FFP MPI dataset ``MPIdata: Equilibrium Model with Anisotropy", featuring scans of six phantoms acquired using a Bruker preclinical scanner. Moreover, we show reconstruction performed on custom data on a 2D scanner with additional high-frequency excitation field and partial data. Our results demonstrate strong reconstruction capabilities across different scanning scenarios – setting a precedent for general-purpose, flexible model-based MPI reconstruction.
zh
[CV-155] Rhetorical Text-to-Image Generation via Two-layer Diffusion Policy Optimization
【速读】:该论文试图解决从修辞语言生成图像的问题,这一任务对文本到图像模型而言仍是一个关键挑战。尽管人类能够轻松将修辞内容映射到视觉表征,但当前最先进的多模态大语言模型(MLLM)在处理隐含意义的修辞语言时仍存在局限性。其关键问题在于现有模型过于强调物体级别的词嵌入对齐,导致隐喻表达引导图像生成偏向字面视觉,而忽略了预期的语义含义。为了解决这一问题,作者提出了Rhet2Pix框架,该框架将修辞文本到图像的生成建模为一个多步骤策略优化问题,并引入了两层马尔可夫决策过程(MDP)扩散模块,以提升生成图像的语义丰富性和准确性。
链接: https://arxiv.org/abs/2505.22792
作者: Yuxi Zhang,Yueting Li,Xinyu Du,Sibo Wang
机构: The Chinese University of Hong Kong, Shenzhen (深圳大学); University of California, Berkeley (加州大学伯克利分校); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating images from rhetorical languages remains a critical challenge for text-to-image models. Even state-of-the-art (SOTA) multimodal large language models (MLLM) fail to generate images based on the hidden meaning inherent in rhetorical language–despite such content being readily mappable to visual representations by humans. A key limitation is that current models emphasize object-level word embedding alignment, causing metaphorical expressions to steer image generation towards their literal visuals and overlook the intended semantic meaning. To address this, we propose Rhet2Pix, a framework that formulates rhetorical text-to-image generation as a multi-step policy optimization problem, incorporating a two-layer MDP diffusion module. In the outer layer, Rhet2Pix converts the input prompt into incrementally elaborated sub-sentences and executes corresponding image-generation actions, constructing semantically richer visuals. In the inner layer, Rhet2Pix mitigates reward sparsity during image generation by discounting the final reward and optimizing every adjacent action pair along the diffusion denoising trajectory. Extensive experiments demonstrate the effectiveness of Rhet2Pix in rhetorical text-to-image generation. Our model outperforms SOTA MLLMs such as GPT-4o, Grok-3 and leading academic baselines across both qualitative and quantitative evaluations. The code and dataset used in this work are publicly available.
zh
[CV-156] MAC-Gaze: Motion-Aware Continual Calibration for Mobile Gaze Tracking
【速读】:该论文试图解决移动场景下眼动追踪精度下降的问题,特别是在用户姿势和设备方向发生变化时,传统的一次性校准方法无法适应动态变化导致性能退化。解决方案的关键在于提出MAC-Gaze,一种基于运动感知的持续校准方法,该方法利用智能手机的惯性测量单元(IMU)传感器和持续学习技术,自动检测用户运动状态的变化并更新眼动追踪模型,通过聚类混合决策机制在运动模式显著偏离已有状态时触发重新校准,并采用基于回放的持续学习策略以实现新运动条件的累积学习同时避免灾难性遗忘。
链接: https://arxiv.org/abs/2505.22769
作者: Yaxiong Lei,Mingyue Zhao,Yuheng Wang,Shijing He,Yusuke Sugano,Yafei Wang,Kaixing Zhao,Mohamed Khamis,Juan Ye
机构: University of St Andrews (圣安德鲁斯大学); King’s College London (国王学院伦敦大学); University of Tokyo (东京大学); Dalian Maritime University (大连海事大学); North-western Polytechnical University (西北工业大学); University of Glasgow (格拉斯哥大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 7 figures
Abstract:Mobile gaze tracking faces a fundamental challenge: maintaining accuracy as users naturally change their postures and device orientations. Traditional calibration approaches, like one-off, fail to adapt to these dynamic conditions, leading to degraded performance over time. We present MAC-Gaze, a Motion-Aware continual Calibration approach that leverages smartphone Inertial measurement unit (IMU) sensors and continual learning techniques to automatically detect changes in user motion states and update the gaze tracking model accordingly. Our system integrates a pre-trained visual gaze estimator and an IMU-based activity recognition model with a clustering-based hybrid decision-making mechanism that triggers recalibration when motion patterns deviate significantly from previously encountered states. To enable accumulative learning of new motion conditions while mitigating catastrophic forgetting, we employ replay-based continual learning, allowing the model to maintain performance across previously encountered motion conditions. We evaluate our system through extensive experiments on the publicly available RGBDGaze dataset and our own 10-hour multimodal MotionGaze dataset (481K+ images, 800K+ IMU readings), encompassing a wide range of postures under various motion conditions including sitting, standing, lying, and walking. Results demonstrate that our method reduces gaze estimation error by 19.9% on RGBDGaze (from 1.73 cm to 1.41 cm) and by 31.7% on MotionGaze (from 2.81 cm to 1.92 cm) compared to traditional calibration approaches. Our framework provides a robust solution for maintaining gaze estimation accuracy in mobile scenarios.
zh
[CV-157] st-time augmentation improves efficiency in conformal prediction
【速读】:该论文试图解决 conformal classifier 在预测时常常生成过于宽泛的预测类别集合的问题,这种集合缺乏信息量。解决方案的关键在于引入测试时增强(test-time augmentation, TTA),通过在推理阶段引入归纳偏置来减小 conformal classifier 产生的预测集合大小。该方法具有灵活性、计算效率高且有效,能够与任何 conformal 分数结合使用,无需重新训练模型,并平均减少预测集合大小的10%-14%。
链接: https://arxiv.org/abs/2505.22764
作者: Divya Shanmugam,Helen Lu,Swami Sankaranarayanan,John Guttag
机构: Massachusetts Institute of Technology, CSAIL (麻省理工学院,CSAIL)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A conformal classifier produces a set of predicted classes and provides a probabilistic guarantee that the set includes the true class. Unfortunately, it is often the case that conformal classifiers produce uninformatively large sets. In this work, we show that test-time augmentation (TTA)–a technique that introduces inductive biases during inference–reduces the size of the sets produced by conformal classifiers. Our approach is flexible, computationally efficient, and effective. It can be combined with any conformal score, requires no model retraining, and reduces prediction set sizes by 10%-14% on average. We conduct an evaluation of the approach spanning three datasets, three models, two established conformal scoring methods, different guarantee strengths, and several distribution shifts to show when and why test-time augmentation is a useful addition to the conformal pipeline.
zh
[CV-158] MIAS-SAM: Medical Image Anomaly Segmentation without thresholding
【速读】:该论文试图解决医学图像中异常区域分割的问题,特别是无需手动定义阈值即可实现精确分割的挑战。解决方案的关键在于提出MIAS-SAM方法,该方法利用基于块的内存库存储从正常数据中提取的图像特征,并在推理时将从SAM编码器提取的嵌入块与内存库中的块进行比较,生成异常图,最终通过计算异常图的质心来引导SAM解码器,从而从之前提取的特征中获得准确的分割结果。
链接: https://arxiv.org/abs/2505.22762
作者: Marco Colussi,Dragan Ahmetovic,Sergio Mascetti
机构: Università degli studi di Milano(米兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:This paper presents MIAS-SAM, a novel approach for the segmentation of anomalous regions in medical images. MIAS-SAM uses a patch-based memory bank to store relevant image features, which are extracted from normal data using the SAM encoder. At inference time, the embedding patches extracted from the SAM encoder are compared with those in the memory bank to obtain the anomaly map. Finally, MIAS-SAM computes the center of gravity of the anomaly map to prompt the SAM decoder, obtaining an accurate segmentation from the previously extracted features. Differently from prior works, MIAS-SAM does not require to define a threshold value to obtain the segmentation from the anomaly map. Experimental results conducted on three publicly available datasets, each with a different imaging modality (Brain MRI, Liver CT, and Retina OCT) show accurate anomaly segmentation capabilities measured using DICE score. The code is available at: this https URL
zh
[CV-159] HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer
【速读】:该论文旨在解决图像生成基础模型在提升生成质量的同时所面临的计算复杂度增加和推理延迟加剧的权衡问题。其解决方案的关键在于提出一种新的稀疏扩散Transformer(sparse Diffusion Transformer, DiT)结构,通过引入动态专家混合(dynamic Mixture-of-Experts, MoE)架构,实现了高效的多模态交互与图像生成。该结构包含双流解耦设计和单流稀疏DiT结构,能够在保持高质量图像生成能力的同时降低计算成本,从而有效缓解了生成质量与计算效率之间的矛盾。
链接: https://arxiv.org/abs/2505.22705
作者: Qi Cai,Jingwen Chen,Yang Chen,Yehao Li,Fuchen Long,Yingwei Pan,Zhaofan Qiu,Yiheng Zhang,Fengbin Gao,Peihan Xu,Yimeng Wang,Kai Yu,Wenxuan Chen,Ziwei Feng,Zijian Gong,Jianzhuang Pan,Yi Peng,Rui Tian,Siyu Wang,Bo Zhao,Ting Yao,Tao Mei
机构: HiDream.ai(嗨梦想人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Source codes and models are available at this https URL and this https URL
Abstract:Recent advancements in image generative foundation models have prioritized quality improvements but often at the cost of increased computational complexity and inference latency. To address this critical trade-off, we introduce HiDream-I1, a new open-source image generative foundation model with 17B parameters that achieves state-of-the-art image generation quality within seconds. HiDream-I1 is constructed with a new sparse Diffusion Transformer (DiT) structure. Specifically, it starts with a dual-stream decoupled design of sparse DiT with dynamic Mixture-of-Experts (MoE) architecture, in which two separate encoders are first involved to independently process image and text tokens. Then, a single-stream sparse DiT structure with dynamic MoE architecture is adopted to trigger multi-model interaction for image generation in a cost-efficient manner. To support flexiable accessibility with varied model capabilities, we provide HiDream-I1 in three variants: HiDream-I1-Full, HiDream-I1-Dev, and HiDream-I1-Fast. Furthermore, we go beyond the typical text-to-image generation and remould HiDream-I1 with additional image conditions to perform precise, instruction-based editing on given images, yielding a new instruction-based image editing model namely HiDream-E1. Ultimately, by integrating text-to-image generation and instruction-based image editing, HiDream-I1 evolves to form a comprehensive image agent (HiDream-A1) capable of fully interactive image creation and refinement. To accelerate multi-modal AIGC research, we have open-sourced all the codes and model weights of HiDream-I1-Full, HiDream-I1-Dev, HiDream-I1-Fast, HiDream-E1 through our project websites: this https URL and this https URL. All features can be directly experienced via this https URL. Comments: Source codes and models are available at this https URL and this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM) Cite as: arXiv:2505.22705 [cs.CV] (or arXiv:2505.22705v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.22705 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-160] Frequency-Adaptive Discrete Cosine-ViT-ResNet Architecture for Sparse-Data Vision
【速读】:该论文旨在解决稀有动物图像分类中数据稀缺的问题,即许多物种仅有少量标注样本。其解决方案的关键在于设计了一个包含自适应离散余弦变换(DCT)预处理模块、ViT-B16和ResNet50主干网络以及贝叶斯线性分类头的混合深度学习框架。该框架首次引入了自适应频域选择机制,以学习适合后续主干网络的低、中、高频边界,从而有效提升在极端样本稀缺条件下的分类性能。
链接: https://arxiv.org/abs/2505.22701
作者: Ziyue Kang,Weichuan Zhang
机构: Shaanxi University of Science and Technology(陕西科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A major challenge in rare animal image classification is the scarcity of data, as many species usually have only a small number of labeled samples. To address this challenge, we designed a hybrid deep-learning framework comprising a novel adaptive DCT preprocessing module, ViT-B16 and ResNet50 backbones, and a Bayesian linear classification head. To our knowledge, we are the first to introduce an adaptive frequency-domain selection mechanism that learns optimal low-, mid-, and high-frequency boundaries suited to the subsequent backbones. Our network first captures image frequency-domain cues via this adaptive DCT partitioning. The adaptively filtered frequency features are then fed into ViT-B16 to model global contextual relationships, while ResNet50 concurrently extracts local, multi-scale spatial representations from the original image. A cross-level fusion strategy seamlessly integrates these frequency- and spatial-domain embeddings, and the fused features are passed through a Bayesian linear classifier to output the final category predictions. On our self-built 50-class wildlife dataset, this approach outperforms conventional CNN and fixed-band DCT pipelines, achieving state-of-the-art accuracy under extreme sample scarcity. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.22701 [cs.CV] (or arXiv:2505.22701v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.22701 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-161] Using Cross-Domain Detection Loss to Infer Multi-Scale Information for Improved Tiny Head Tracking
【速读】:该论文旨在解决人群场景中头部检测与跟踪任务中计算资源消耗大、延迟高以及资源占用多的问题(如处理器、内存和带宽)。其解决方案的关键在于提出一个优化性能与效率平衡的框架,该框架集成(1)跨域检测损失、(2)多尺度模块以及(3)小感受野检测机制,通过这些创新有效提升了小型检测器的性能,增强了对微小头部的检测能力,并在多个尺度上捕捉高频细节。
链接: https://arxiv.org/abs/2505.22677
作者: Jisu Kim,Alex Mattingly,Eung-Joo Lee,Benjamin S. Riggan
机构: University of Nebraska-Lincoln (内布拉斯加大学林肯分校); University of Maryland (马里兰大学); University of Arizona (亚利桑那大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear at IEEE International Conference on Automatic Face and Gesture 2025 (FG2025)
Abstract:Head detection and tracking are essential for downstream tasks, but current methods often require large computational budgets, which increase latencies and ties up resources (e.g., processors, memory, and bandwidth). To address this, we propose a framework to enhance tiny head detection and tracking by optimizing the balance between performance and efficiency. Our framework integrates (1) a cross-domain detection loss, (2) a multi-scale module, and (3) a small receptive field detection mechanism. These innovations enhance detection by bridging the gap between large and small detectors, capturing high-frequency details at multiple scales during training, and using filters with small receptive fields to detect tiny heads. Evaluations on the CroHD and CrowdHuman datasets show improved Multiple Object Tracking Accuracy (MOTA) and mean Average Precision (mAP), demonstrating the effectiveness of our approach in crowded scenes.
zh
[CV-162] Synthetic Generation and Latent Projection Denoising of Rim Lesions in Multiple Sclerosis CVPR2025
【速读】:该论文旨在解决多发性硬化症中罕见的磁性环形病灶(paramagnetic rim lesions, PRLs)检测与分割中的类别不平衡问题。其关键解决方案是通过生成式人工智能(Generative AI)合成定量磁敏感图,以增强分类器性能,并提出一种新颖的去噪方法,用于处理模糊的环形病灶标注,从而有效增加少数类样本数量,提升临床可解释性的检测效果。
链接: https://arxiv.org/abs/2505.23353
作者: Alexandra G. Roberts,Ha M. Luu,Mert Şişman,Alexey V. Dimov,Ceren Tozlu,Ilhami Kovanlikaya,Susan A. Gauthier,Thanh D. Nguyen,Yi Wang
机构: Cornell University (康奈尔大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted full paper in Synthetic Data @ CVPR 2025 12 pages, 10 figures
Abstract:Quantitative susceptibility maps from magnetic resonance images can provide both prognostic and diagnostic information in multiple sclerosis, a neurodegenerative disease characterized by the formation of lesions in white matter brain tissue. In particular, susceptibility maps provide adequate contrast to distinguish between “rim” lesions, surrounded by deposited paramagnetic iron, and “non-rim” lesion types. These paramagnetic rim lesions (PRLs) are an emerging biomarker in multiple sclerosis. Much effort has been devoted to both detection and segmentation of such lesions to monitor longitudinal change. As paramagnetic rim lesions are rare, addressing this problem requires confronting the class imbalance between rim and non-rim lesions. We produce synthetic quantitative susceptibility maps of paramagnetic rim lesions and show that inclusion of such synthetic data improves classifier performance and provide a multi-channel extension to generate accompanying contrasts and probabilistic segmentation maps. We exploit the projection capability of our trained generative network to demonstrate a novel denoising approach that allows us to train on ambiguous rim cases and substantially increase the minority class. We show that both synthetic lesion synthesis and our proposed rim lesion label denoising method best approximate the unseen rim lesion distribution and improve detection in a clinically interpretable manner. We release our code and generated data at this https URL upon publication.
zh
[CV-163] Plug-and-Play Posterior Sampling for Blind Inverse Problems
【速读】:该论文试图解决盲逆问题(blind inverse problems),即在目标图像和测量算子均未知的情况下进行图像恢复。解决方案的关键在于提出了一种名为盲插件-播放扩散模型(Blind Plug-and-Play Diffusion Models, Blind-PnPDM)的框架,通过将问题重新建模为交替高斯去噪过程来进行后验采样。该方法利用两个扩散模型作为学习先验:一个用于捕捉目标图像的分布,另一个用于表征测量算子的参数,从而实现了灵活性和适应性。
链接: https://arxiv.org/abs/2505.22923
作者: Anqi Li,Weijie Gan,Ulugbek S. Kamilov
机构: Washington University in St. Louis (华盛顿大学圣路易斯分校)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2305.12672
Abstract:We introduce Blind Plug-and-Play Diffusion Models (Blind-PnPDM) as a novel framework for solving blind inverse problems where both the target image and the measurement operator are unknown. Unlike conventional methods that rely on explicit priors or separate parameter estimation, our approach performs posterior sampling by recasting the problem into an alternating Gaussian denoising scheme. We leverage two diffusion models as learned priors: one to capture the distribution of the target image and another to characterize the parameters of the measurement operator. This PnP integration of diffusion models ensures flexibility and ease of adaptation. Our experiments on blind image deblurring show that Blind-PnPDM outperforms state-of-the-art methods in terms of both quantitative metrics and visual fidelity. Our results highlight the effectiveness of treating blind inverse problems as a sequence of denoising subproblems while harnessing the expressive power of diffusion-based priors.
zh
[CV-164] DeepMultiConnectome: Deep Multi-Task Prediction of Structural Connectomes Directly from Diffusion MRI Tractography
【速读】:该论文旨在解决传统扩散磁共振成像(dMRI)纤维追踪在生成脑结构连接组(connectome)时耗时且依赖灰质分割(gray matter parcellation)的问题,这对大规模研究构成了挑战。其解决方案的关键在于提出DeepMultiConnectome,这是一个基于点云的神经网络模型,通过多任务学习直接从纤维追踪数据预测结构连接组,无需灰质分割,并支持多种分割方案。该模型能够高效地从包含300万条纤维的全脑轨迹中生成多个连接组,显著提升了连接组生成的速度和可扩展性。
链接: https://arxiv.org/abs/2505.22685
作者: Marcus J. Vroemen,Yuqian Chen,Yui Lo,Tengfei Xu,Weidong Cai,Fan Zhang,Josien P.W. Pluim,Lauren J. O’Donnell
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 5 figures, 5 tables
Abstract:Diffusion MRI (dMRI) tractography enables in vivo mapping of brain structural connections, but traditional connectome generation is time-consuming and requires gray matter parcellation, posing challenges for large-scale studies. We introduce DeepMultiConnectome, a deep-learning model that predicts structural connectomes directly from tractography, bypassing the need for gray matter parcellation while supporting multiple parcellation schemes. Using a point-cloud-based neural network with multi-task learning, the model classifies streamlines according to their connected regions across two parcellation schemes, sharing a learned representation. We train and validate DeepMultiConnectome on tractography from the Human Connectome Project Young Adult dataset ( n = 1000 ), labeled with an 84 and 164 region gray matter parcellation scheme. DeepMultiConnectome predicts multiple structural connectomes from a whole-brain tractogram containing 3 million streamlines in approximately 40 seconds. DeepMultiConnectome is evaluated by comparing predicted connectomes with traditional connectomes generated using the conventional method of labeling streamlines using a gray matter parcellation. The predicted connectomes are highly correlated with traditionally generated connectomes ( r = 0.992 for an 84-region scheme; r = 0.986 for a 164-region scheme) and largely preserve network properties. A test-retest analysis of DeepMultiConnectome demonstrates reproducibility comparable to traditionally generated connectomes. The predicted connectomes perform similarly to traditionally generated connectomes in predicting age and cognitive function. Overall, DeepMultiConnectome provides a scalable, fast model for generating subject-specific connectomes across multiple parcellation schemes.
zh
[CV-165] ConnectomeDiffuser: Generative AI Enables Brain Network Construction from Diffusion Tensor Imaging
【速读】:该论文旨在解决传统从扩散张量成像(DTI)构建结构脑网络方法中存在的固有局限性,包括操作者主观性、劳动密集型工作流程以及对复杂拓扑特征和疾病特异性生物标志物的捕捉能力受限等问题。其解决方案的关键在于提出一种基于扩散的新型框架——ConnectomeDiffuser,该框架整合了三个核心组件:基于黎曼几何原理提取拓扑特征的模板网络、生成具有增强拓扑保真度的全面脑网络的扩散模型,以及结合疾病特异性标记以提高诊断准确性的图卷积网络分类器。
链接: https://arxiv.org/abs/2505.22683
作者: Xuhang Chen,Michael Kwok-Po Ng,Kim-Fung Tsang,Chi-Man Pun,Shuqiang Wang
机构: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院); University of Macau(澳门大学); Hong Kong Baptist University(香港浸会大学)
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Brain network analysis plays a crucial role in diagnosing and monitoring neurodegenerative disorders such as Alzheimer’s disease (AD). Existing approaches for constructing structural brain networks from diffusion tensor imaging (DTI) often rely on specialized toolkits that suffer from inherent limitations: operator subjectivity, labor-intensive workflows, and restricted capacity to capture complex topological features and disease-specific biomarkers. To overcome these challenges and advance computational neuroimaging instrumentation, ConnectomeDiffuser is proposed as a novel diffusion-based framework for automated end-to-end brain network construction from DTI. The proposed model combines three key components: (1) a Template Network that extracts topological features from 3D DTI scans using Riemannian geometric principles, (2) a diffusion model that generates comprehensive brain networks with enhanced topological fidelity, and (3) a Graph Convolutional Network classifier that incorporates disease-specific markers to improve diagnostic accuracy. ConnectomeDiffuser demonstrates superior performance by capturing a broader range of structural connectivity and pathology-related information, enabling more sensitive analysis of individual variations in brain networks. Experimental validation on datasets representing two distinct neurodegenerative conditions demonstrates significant performance improvements over other brain network methods. This work contributes to the advancement of instrumentation in the context of neurological disorders, providing clinicians and researchers with a robust, generalizable measurement framework that facilitates more accurate diagnosis, deeper mechanistic understanding, and improved therapeutic monitoring of neurodegenerative diseases such as AD.
zh
[CV-166] MRI Image Generation Based on Text Prompts
【速读】:该论文试图解决获取真实MRI数据集所面临的挑战,如高成本、罕见病例样本有限以及隐私问题。其解决方案的关键在于利用文本提示的Stable Diffusion (SD)模型进行MRI图像生成,并通过在3T fastMRI数据集和0.3T M4Raw数据集上对模型进行微调,以生成不同磁场强度下的脑部T1、T2和FLAIR图像,从而提升图像质量和与文本提示的语义一致性。
链接: https://arxiv.org/abs/2505.22682
作者: Xinxian Fan,Mengye Lyu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:
Abstract:This study explores the use of text-prompted MRI image generation with the Stable Diffusion (SD) model to address challenges in acquiring real MRI datasets, such as high costs, limited rare case samples, and privacy concerns. The SD model, pre-trained on natural images, was fine-tuned using the 3T fastMRI dataset and the 0.3T M4Raw dataset, with the goal of generating brain T1, T2, and FLAIR images across different magnetic field strengths. The performance of the fine-tuned model was evaluated using quantitative metrics,including Fréchet Inception Distance (FID) and Multi-Scale Structural Similarity (MS-SSIM), showing improvements in image quality and semantic consistency with the text prompts. To further evaluate the model’s potential, a simple classification task was carried out using a small 0.35T MRI dataset, demonstrating that the synthetic images generated by the fine-tuned SD model can effectively augment training datasets and improve the performance of MRI constrast classification tasks. Overall, our findings suggest that text-prompted MRI image generation is feasible and can serve as a useful tool for medical AI applications.
zh
[CV-167] Physiology-Informed Generative Multi-Task Network for Contrast-Free CT Perfusion
【速读】:该论文试图解决传统CT灌注(CTP)成像中使用对比剂所带来的过敏反应、不良副作用及高昂成本问题。解决方案的关键在于提出一种名为MAGIC的深度学习框架,该框架结合生成式人工智能与生理信息,将非对比增强的CT图像映射为多模态的无对比剂CTP成像图,通过在损失函数中引入生理特征以提升图像保真度,从而实现无需对比剂的高效、精准灌注成像。
链接: https://arxiv.org/abs/2505.22673
作者: Wasif Khan,Kyle B. See,Simon Kato,Ziqian Huang,Amy Lazarte,Kyle Douglas,Xiangyang Lou,Teng J. Peng,Dhanashree Rajderkar,John Rees,Pina Sanelli,Amita Singh,Ibrahim Tuna,Christina A. Wilson,Ruogu Fang
机构: University of Florida(佛罗里达大学); Feinstein Institutes for Medical Research(费恩斯坦医学研究研究所); Donald and Barbara Zucker School of Medicine at Hofstra Northwell(霍夫斯特拉诺斯威尔唐纳德和芭芭拉祖克尔医学院); Northwell Health(诺斯韦尔健康机构); Center for Cognitive Aging and Memory(认知衰老与记忆中心)
类目: Tissues and Organs (q-bio.TO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review
Abstract:Perfusion imaging is extensively utilized to assess hemodynamic status and tissue perfusion in various organs. Computed tomography perfusion (CTP) imaging plays a key role in the early assessment and planning of stroke treatment. While CTP provides essential perfusion parameters to identify abnormal blood flow in the brain, the use of contrast agents in CTP can lead to allergic reactions and adverse side effects, along with costing USD 4.9 billion worldwide in 2022. To address these challenges, we propose a novel deep learning framework called Multitask Automated Generation of Intermodal CT perfusion maps (MAGIC). This framework combines generative artificial intelligence and physiological information to map non-contrast computed tomography (CT) imaging to multiple contrast-free CTP imaging maps. We demonstrate enhanced image fidelity by incorporating physiological characteristics into the loss terms. Our network was trained and validated using CT image data from patients referred for stroke at UF Health and demonstrated robustness to abnormalities in brain perfusion activity. A double-blinded study was conducted involving seven experienced neuroradiologists and vascular neurologists. This study validated MAGIC’s visual quality and diagnostic accuracy showing favorable performance compared to clinical perfusion imaging with intravenous contrast injection. Overall, MAGIC holds great promise in revolutionizing healthcare by offering contrast-free, cost-effective, and rapid perfusion imaging.
zh
人工智能
[AI-0] Comparative of Genetic Fuzzy regression techniques for aeroacoustic phenomenons
【速读】:该论文试图解决空气动力噪声(airfoil self-noise)的建模问题,这是气动声学中的关键问题,对航空航天、汽车和无人机应用具有重要意义。解决方案的关键在于利用遗传模糊系统(Genetic Fuzzy Systems, GFS)结合不同的模糊回归策略,包括高规则密度的Takagi-Sugeno-Kang(TSK)模糊系统、级联遗传模糊树(GFT)架构以及基于模糊C均值(FCM)聚类的新型分组方法,以降低模型复杂度并提高其在复杂气动声学现象中的回归性能。
链接: https://arxiv.org/abs/2505.23746
作者: Hugo Henry,Kelly Cohen
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 11 pages and 23 figures
Abstract:This study investigates the application of Genetic Fuzzy Systems (GFS) to model the self-noise generated by airfoils, a key issue in aeroaccoustics with significant implications for aerospace, automotive and drone applications. Using the publicly available Airfoil Self Noise dataset, various Fuzzy regression strategies are explored and compared. The paper evaluates a brute force Takagi Sugeno Kang (TSK) fuzzy system with high rule density, a cascading Geneti Fuzzy Tree (GFT) architecture and a novel clustered approach based on Fuzzy C-means (FCM) to reduce the model’s complexity. This highlights the viability of clustering assisted fuzzy inference as an effective regression tool for complex aero accoustic phenomena. Keywords : Fuzzy logic, Regression, Cascading systems, Clustering and AI.
zh
[AI-1] Exposing the Impact of GenAI for Cybercrime: An Investigation into the Dark Side
【速读】:该论文试图解决生成式 AI(Generative AI)在网络安全领域可能引发的恶意互联网相关活动和网络犯罪问题,以及如何在保障安全的同时促进技术创新。其解决方案的关键在于通过心理框架下的技术放大和可供性理论,结合准实验设计与中断时间序列分析,对两个数据集(一个通用数据集和一个加密货币相关数据集)进行实证研究,以评估生成式 AI 在网络犯罪中的作用,从而为人工智能治理提供平衡控制与创新的策略支持。
链接: https://arxiv.org/abs/2505.23733
作者: Truong(Jack)Luu,Binny M. Samuel
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:In recent years, the rapid advancement and democratization of generative AI models have sparked significant debate over safety, ethical risks, and dual-use concerns, particularly in the context of cybersecurity. While anecdotally known, this paper provides empirical evidence regarding generative AI’s association with malicious internet-related activities and cybercrime by examining the phenomenon through psychological frameworks of technological amplification and affordance theory. Using a quasi-experimental design with interrupted time series analysis, we analyze two datasets, one general and one cryptocurrency-focused, to empirically assess generative AI’s role in cybercrime. The findings contribute to ongoing discussions about AI governance by balancing control and fostering innovation, underscoring the need for strategies to guide policymakers, inform AI developers and cybersecurity professionals, and educate the public to maximize AI’s benefits while mitigating its risks.
zh
[AI-2] SC-LoRA: Balancing Efficient Fine-tuning and Knowledge Preservation via Subspace-Constrained LoRA
【速读】:该论文旨在解决传统低秩适应(Low-Rank Adaptation, LoRA)在微调大型语言模型(Large Language Models, LLMs)时存在的收敛速度慢和知识遗忘问题。其解决方案的关键在于提出了一种新的LoRA初始化框架——子空间约束LoRA(Subspace-Constrained LoRA, SC-LoRA),通过将可训练LoRA适配器的输出限制在一个低秩子空间中,实现微调效率与知识保留之间的平衡。该约束机制使得微调数据的上下文信息得到最大程度的保留,而预训练知识的上下文信息则被最小程度地保留,从而让可训练权重主要关注微调数据的主要特征,避免对已保存知识特征造成损害。
链接: https://arxiv.org/abs/2505.23724
作者: Minrui Luo,Fuhang Kuang,Yu Wang,Zirui Liu,Tianxing He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Parameter-Efficient Fine-Tuning (PEFT) methods, particularly Low-Rank Adaptation (LoRA), are indispensable for efficiently customizing Large Language Models (LLMs). However, vanilla LoRA suffers from slow convergence speed and knowledge forgetting problems. Recent studies have leveraged the power of designed LoRA initialization, to enhance the fine-tuning efficiency, or to preserve knowledge in the pre-trained LLM. However, none of these works can address the two cases at the same time. To this end, we introduce Subspace-Constrained LoRA (SC-LoRA), a novel LoRA initialization framework engineered to navigate the trade-off between efficient fine-tuning and knowledge preservation. We achieve this by constraining the output of trainable LoRA adapters in a low-rank subspace, where the context information of fine-tuning data is most preserved while the context information of preserved knowledge is least retained, in a balanced way. Such constraint enables the trainable weights to primarily focus on the main features of fine-tuning data while avoiding damaging the preserved knowledge features. We provide theoretical analysis on our method, and conduct extensive experiments including safety preservation and world knowledge preservation, on various downstream tasks. In our experiments, SC-LoRA succeeds in delivering superior fine-tuning performance while markedly diminishing knowledge forgetting, surpassing contemporary LoRA initialization methods.
zh
[AI-3] COBRA: Contextual Bandit Algorithm for Ensuring Truthful Strategic Agents
【速读】:该论文试图解决多智能体情境下的上下文老虎机问题,其中智能体可能出于自身利益而虚假报告其偏好臂,这与传统上下文老虎机假设智能体诚实报告的设定相悖(contextual bandit problem)。为应对这一挑战,作者提出了一种名为COBRA的算法,其关键在于通过非货币激励的方式抑制智能体的战略行为,同时保证机制的激励相容性(incentive compatibility)和次线性遗憾(sub-linear regret)性能。
链接: https://arxiv.org/abs/2505.23720
作者: Arun Verma,Indrajit Saha,Makoto Yokoo,Bryan Kian Hsiang Low
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: This paper proposes a contextual bandit algorithm that prevents strategic agents from misreporting while having approximate incentive compatibility and a sub-linear regret guarantee
Abstract:This paper considers a contextual bandit problem involving multiple agents, where a learner sequentially observes the contexts and the agent’s reported arms, and then selects the arm that maximizes the system’s overall reward. Existing work in contextual bandits assumes that agents truthfully report their arms, which is unrealistic in many real-life applications. For instance, consider an online platform with multiple sellers; some sellers may misrepresent product quality to gain an advantage, such as having the platform preferentially recommend their products to online users. To address this challenge, we propose an algorithm, COBRA, for contextual bandit problems involving strategic agents that disincentivize their strategic behavior without using any monetary incentives, while having incentive compatibility and a sub-linear regret guarantee. Our experimental results also validate the different performance aspects of our proposed algorithm.
zh
[AI-4] From Connectivity to Autonomy: The Dawn of Self-Evolving Communication Systems
【速读】:该论文试图解决6G通信系统在动态适应性和智能化管理方面的挑战,旨在构建一个自演进的电信生态系统。解决方案的关键在于利用AI驱动的智能技术,通过可重构基础设施、自适应中间件、智能网络功能以及多智能体协作,实现分布式决策与工业物联网框架的无缝集成,从而提升实时决策效率、优化系统性能并降低延迟。
链接: https://arxiv.org/abs/2505.23710
作者: Zeinab Nezami,Syed Danial Ali Shah,Maryam Hafeez,Karim Djemame,Syed Ali Raza Zaidi
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET)
备注:
Abstract:This paper envisions 6G as a self-evolving telecom ecosystem, where AI-driven intelligence enables dynamic adaptation beyond static connectivity. We explore the key enablers of autonomous communication systems, spanning reconfigurable infrastructure, adaptive middleware, and intelligent network functions, alongside multi-agent collaboration for distributed decision-making. We explore how these methodologies align with emerging industrial IoT frameworks, ensuring seamless integration within digital manufacturing processes. Our findings emphasize the potential for improved real-time decision-making, optimizing efficiency, and reducing latency in networked control systems. The discussion addresses ethical challenges, research directions, and standardization efforts, concluding with a technology stack roadmap to guide future developments. By leveraging state-of-the-art 6G network management techniques, this research contributes to the next generation of intelligent automation solutions, bridging the gap between theoretical advancements and real-world industrial applications.
zh
[AI-5] Distributed Federated Learning for Vehicular Network Security: Anomaly Detection Benefits and Multi-Domain Attack Threats
【速读】:该论文试图解决在车联网中,由于大规模、高移动性和异构数据分布带来的安全信息分类挑战,传统依赖集中式数据收集或纯本地训练的方法存在局限性。解决方案的关键在于采用分布式联邦学习(Distributed Federated Learning, DFL),通过车辆在单跳邻居间交换模型更新并多跳传播模型,实现深度学习模型的协同训练,从而显著提升所有车辆的分类准确性,特别是在个体准确率较低的车辆上表现出明显的性能提升。
链接: https://arxiv.org/abs/2505.23706
作者: Utku Demir,Yalin E. Sagduyu,Tugba Erpek,Hossein Jafari,Sastry Kompella,Mengran Xue
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Signal Processing (eess.SP)
备注:
Abstract:In connected and autonomous vehicles, machine learning for safety message classification has become critical for detecting malicious or anomalous behavior. However, conventional approaches that rely on centralized data collection or purely local training face limitations due to the large scale, high mobility, and heterogeneous data distributions inherent in inter-vehicle networks. To overcome these challenges, this paper explores Distributed Federated Learning (DFL), whereby vehicles collaboratively train deep learning models by exchanging model updates among one-hop neighbors and propagating models over multiple hops. Using the Vehicular Reference Misbehavior (VeReMi) Extension Dataset, we show that DFL can significantly improve classification accuracy across all vehicles compared to learning strictly with local data. Notably, vehicles with low individual accuracy see substantial accuracy gains through DFL, illustrating the benefit of knowledge sharing across the network. We further show that local training data size and time-varying network connectivity correlate strongly with the model’s overall accuracy. We investigate DFL’s resilience and vulnerabilities under attacks in multiple domains, namely wireless jamming and training data poisoning attacks. Our results reveal important insights into the vulnerabilities of DFL when confronted with multi-domain attacks, underlining the need for more robust strategies to secure DFL in vehicular networks.
zh
[AI-6] Lets Reason Formally: Natural-Formal Hybrid Reasoning Enhances LLM s Math Capability
【速读】:该论文旨在解决如何有效提升大语言模型(Large Language Models, LLMs)在数学推理任务中的性能,特别是如何将自然语言(Natural Language, NL)与形式化语言(Formal Language, FL)的推理能力进行融合。其关键解决方案是提出NL-FL HybridReasoning框架,该框架通过NL-FL Problem Alignment方法将NL中的问答问题转化为FL中的存在性定理,利用Mixed Problem Input技术使FL推理器同时处理NL和FL问题,并通过基于LLM的Answer Extraction机制解决NL与FL输出格式的差异,从而显著提升了数学推理的准确性。
链接: https://arxiv.org/abs/2505.23703
作者: Ruida Wang,Yuxin Li,Yi R.(May)Fung,Tong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Enhancing the mathematical reasoning capabilities of LLMs has garnered significant attention in both the mathematical and computer science communities. Recent works have made substantial progress in both Natural Language (NL) reasoning and Formal Language (FL) reasoning by leveraging the potential of pure Reinforcement Learning (RL) methods on base models. However, RL approaches struggle to impart new capabilities not presented in the base model, highlighting the need to integrate more knowledge like FL into NL math reasoning effectively. Yet, this integration is challenging due to inherent disparities in problem structure and reasoning format between NL and FL. To address these challenges, we introduce NL-FL HybridReasoning, an end-to-end framework designed to incorporate the FL expert into NL math problem-solving. To bridge the NL and FL input format gap, we propose the NL-FL Problem Alignment method, which reformulates the Question-Answering (QA) problems in NL as existence theorems in FL. Subsequently, the Mixed Problem Input technique we provide enables the FL reasoner to handle both QA and existence problems concurrently. Lastly, we mitigate the NL and FL output format gap in reasoning through an LLM-based Answer Extraction mechanism. Comprehensive experiments demonstrate that the HybridReasoning framework achieves 89.80% and 84.34% accuracy rates on the MATH-500 and the AMC benchmarks, surpassing the NL baseline by 4.60% and 4.82%, respectively. Notably, some problems resolved by our framework remain unsolved by the NL baseline model even under a larger number of trials.
zh
[AI-7] Data-to-Dashboard: Multi-Agent LLM Framework for Insightful Visualization in Enterprise Analytics
【速读】:该论文旨在解决如何自动化从原始数据到可视化仪表板的流程,特别是在数据分析领域中提升洞察生成和可视化的能力。其解决方案的关键在于构建一个模块化的生成式 AI (Generative AI) 代理系统,该系统能够执行领域检测、概念提取、多视角分析生成以及迭代自我反思,从而模拟业务分析师的分析推理过程,而无需依赖封闭的本体或问题模板。
链接: https://arxiv.org/abs/2505.23695
作者: Ran Zhang,Mohannad Elhamod
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid advancement of LLMs has led to the creation of diverse agentic systems in data analysis, utilizing LLMs’ capabilities to improve insight generation and visualization. In this paper, we present an agentic system that automates the data-to-dashboard pipeline through modular LLM agents capable of domain detection, concept extraction, multi-perspective analysis generation, and iterative self-reflection. Unlike existing chart QA systems, our framework simulates the analytical reasoning process of business analysts by retrieving domain-relevant knowledge and adapting to diverse datasets without relying on closed ontologies or question templates. We evaluate our system on three datasets across different domains. Benchmarked against GPT-4o with a single-prompt baseline, our approach shows improved insightfulness, domain relevance, and analytical depth, as measured by tailored evaluation metrics and qualitative human assessment. This work contributes a novel modular pipeline to bridge the path from raw data to visualization, and opens new opportunities for human-in-the-loop validation by domain experts in business analytics. All code can be found here: this https URL Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2505.23695 [cs.AI] (or arXiv:2505.23695v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.23695 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-8] ROTATE: Regret-driven Open-ended Training for Ad Hoc Teamwork
【速读】:该论文试图解决多智能体学习中的一个基础泛化挑战——即开发能够与未见过的合作伙伴协作的AI代理,这一问题被称为Ad Hoc Teamwork (AHT)。现有方法通常采用两阶段流程,首先生成固定队友群体,其次训练AHT代理与其协作,但这种分离导致生成的队友群体行为覆盖有限,并且忽略了这些队友是否易于AHT代理学习。此外,训练AHT代理的算法通常将训练队友视为静态集合,无法控制其分布。本文提出了一种统一框架,通过将问题重新表述为一个开放式的学习过程,其中包含一个自适应代理和一个对抗性队友生成器。关键解决方案是引入ROTATE算法,该算法通过交替优化AHT代理和生成探测其弱点的队友,实现性能提升。实验表明,ROTATE在泛化到未见过的评估队友方面显著优于基线方法,确立了更稳健和可泛化的团队协作新标准。
链接: https://arxiv.org/abs/2505.23686
作者: Caroline Wang,Arrasy Rahman,Jiaxun Cui,Yoonchang Sung,Peter Stone
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Developing AI agents capable of collaborating with previously unseen partners is a fundamental generalization challenge in multi-agent learning, known as Ad Hoc Teamwork (AHT). Existing AHT approaches typically adopt a two-stage pipeline, where first, a fixed population of teammates is generated with the idea that they should be representative of the teammates that will be seen at deployment time, and second, an AHT agent is trained to collaborate well with agents in the population. To date, the research community has focused on designing separate algorithms for each stage. This separation has led to algorithms that generate teammate pools with limited coverage of possible behaviors, and that ignore whether the generated teammates are easy to learn from for the AHT agent. Furthermore, algorithms for training AHT agents typically treat the set of training teammates as static, thus attempting to generalize to previously unseen partner agents without assuming any control over the distribution of training teammates. In this paper, we present a unified framework for AHT by reformulating the problem as an open-ended learning process between an ad hoc agent and an adversarial teammate generator. We introduce ROTATE, a regret-driven, open-ended training algorithm that alternates between improving the AHT agent and generating teammates that probe its deficiencies. Extensive experiments across diverse AHT environments demonstrate that ROTATE significantly outperforms baselines at generalizing to an unseen set of evaluation teammates, thus establishing a new standard for robust and generalizable teamwork.
zh
[AI-9] Fortune: Formula-Driven Reinforcement Learning for Symbolic Table Reasoning in Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LMs)在处理表格数据时的数值或符号推理能力不足的问题,尤其是在复杂场景下。其解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的框架——Formula Tuning(Fortune),该框架通过使用二进制答案正确性作为奖励信号,训练LM生成可执行的电子表格公式,从而减少对监督公式标注的依赖,并引导模型通过推理学习公式的推导过程。
链接: https://arxiv.org/abs/2505.23667
作者: Lang Cao,Jingxian Xu,Hanbing Liu,Jinyu Wang,Mengyu Zhou,Haoyu Dong,Shi Han,Dongmei Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Tables are a fundamental structure for organizing and analyzing data, making effective table understanding a critical capability for intelligent systems. While large language models (LMs) demonstrate strong general reasoning abilities, they continue to struggle with accurate numerical or symbolic reasoning over tabular data, especially in complex scenarios. Spreadsheet formulas provide a powerful and expressive medium for representing executable symbolic operations, encoding rich reasoning patterns that remain largely underutilized. In this paper, we propose Formula Tuning (Fortune), a reinforcement learning (RL) framework that trains LMs to generate executable spreadsheet formulas for question answering over general tabular data. Formula Tuning reduces the reliance on supervised formula annotations by using binary answer correctness as a reward signal, guiding the model to learn formula derivation through reasoning. We provide a theoretical analysis of its advantages and demonstrate its effectiveness through extensive experiments on seven table reasoning benchmarks. Formula Tuning substantially enhances LM performance, particularly on multi-step numerical and symbolic reasoning tasks, enabling a 7B model to outperform O1 on table understanding. This highlights the potential of formula-driven RL to advance symbolic table reasoning in LMs.
zh
[AI-10] Keyed Chaotic Tensor Transformations for Secure And Attributable Neural Inference
【速读】:该论文旨在解决人工智能系统中神经网络推理过程中的安全与隐私保护问题,特别是针对数据归属、认证推理和张量级水印的需求。其解决方案的关键在于引入一种基于密钥驱动的混沌动力学变换框架,通过确定性且密码学种子初始化的混沌系统对张量进行非可逆、用户特定的变换,从而实现安全的推理过程和数据溯源。
链接: https://arxiv.org/abs/2505.23655
作者: Peter David Fagan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 8 pages
Abstract:This work introduces a novel framework for secure and privacy-preserving neural network inference based on keyed chaotic dynamical transformations. The proposed method applies a deterministic, cryptographically seeded chaotic system to tensors, producing non-invertible, user-specific transformations that enable authenticated inference, tensor-level watermarking, and data attribution. This framework offers a scalable and lightweight alternative to conventional cryptographic techniques, and establishes a new direction for tensor-level security in AI systems.
zh
[AI-11] Securing AI Agents with Information-Flow Control
【速读】:该论文试图解决AI代理在面对如提示注入等漏洞时的安全性问题,旨在为AI代理提供安全保证。解决方案的关键在于利用信息流控制(information-flow control, IFC)技术,通过动态污点追踪来表征可强制执行的安全属性,并引入Fides规划器,该规划器能够跟踪保密性和完整性标签,确定性地执行安全策略,并引入新颖的原语以选择性地隐藏信息。
链接: https://arxiv.org/abs/2505.23643
作者: Manuel Costa,Boris Köpf,Aashish Kolluri,Andrew Paverd,Mark Russinovich,Ahmed Salem,Shruti Tople,Lukas Wutschitz,Santiago Zanella-Béguelin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:As AI agents become increasingly autonomous and capable, ensuring their security against vulnerabilities such as prompt injection becomes critical. This paper explores the use of information-flow control (IFC) to provide security guarantees for AI agents. We present a formal model to reason about the security and expressiveness of agent planners. Using this model, we characterize the class of properties enforceable by dynamic taint-tracking and construct a taxonomy of tasks to evaluate security and utility trade-offs of planner designs. Informed by this exploration, we present Fides, a planner that tracks confidentiality and integrity labels, deterministically enforces security policies, and introduces novel primitives for selectively hiding information. Its evaluation in AgentDojo demonstrates that this approach broadens the range of tasks that can be securely accomplished. A tutorial to walk readers through the the concepts introduced in the paper can be found at this https URL
zh
[AI-12] owards Explainable Sequential Learning
【速读】:该论文试图解决多变量时间序列分类中数值驱动与事件驱动数据处理之间的衔接问题,旨在实现可解释的人工智能结果。解决方案的关键在于提出了一种混合可解释的时间数据处理流程——DataFul Explainable MultivariatE coRrelatIonal Temporal Artificial inTElligence (EMeriTAte+DF),通过验证的人工智能原理将数值型时间数据分类与事件型分类相融合,并引入后验可解释阶段来描述数值输入数据中的并发组成成分及其数值负载,同时扩展了事件驱动文献以设计支持并发组成的规范挖掘算法。
链接: https://arxiv.org/abs/2505.23624
作者: Giacomo Bergami,Emma Packer,Kirsty Scott,Silvia Del Din
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper offers a hybrid explainable temporal data processing pipeline, DataFul Explainable MultivariatE coRrelatIonal Temporal Artificial inTElligence (EMeriTAte+DF), bridging numerical-driven temporal data classification with an event-based one through verified artificial intelligence principles, enabling human-explainable results. This was possible through a preliminary a posteriori explainable phase describing the numerical input data in terms of concurrent constituents with numerical payloads. This further required extending the event-based literature to design specification mining algorithms supporting concurrent constituents. Our previous and current solutions outperform state-of-the-art solutions for multivariate time series classifications, thus showcasing the effectiveness of the proposed methodology.
zh
[AI-13] MAPLE: A Mobile Assistant with Persistent Finite State Machines for Recovery Reasoning
【速读】:该论文旨在解决移动图形用户界面(GUI)代理在执行跨应用任务时缺乏上下文理解、无法检测意外结果及错误恢复能力不足的问题。现有代理仅能基于当前屏幕进行推理,缺少对应用导航流程的结构化建模。解决方案的关键在于提出MAPLE框架,该框架将应用交互抽象为有限状态机(FSM),通过计算建模每个UI屏幕为离散状态,用户操作作为转移,从而提供结构化的应用执行表示,并由多个专业代理协作实现实时动态构建FSM,以跟踪导航进度、验证操作结果并回滚至稳定状态进行错误恢复。
链接: https://arxiv.org/abs/2505.23596
作者: Linqiang Guo,Wei Liu,Yi Wen Heng,Tse-Hsun(Peter)Chen,Yang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Mobile GUI agents aim to autonomously complete user-instructed tasks across mobile apps. Recent advances in Multimodal Large Language Models (MLLMs) enable these agents to interpret UI screens, identify actionable elements, and perform interactions such as tapping or typing. However, existing agents remain reactive: they reason only over the current screen and lack a structured model of app navigation flow, limiting their ability to understand context, detect unexpected outcomes, and recover from errors. We present MAPLE, a state-aware multi-agent framework that abstracts app interactions as a Finite State Machine (FSM). We computationally model each UI screen as a discrete state and user actions as transitions, allowing the FSM to provide a structured representation of the app execution. MAPLE consists of specialized agents responsible for four phases of task execution: planning, execution, verification, error recovery, and knowledge retention. These agents collaborate to dynamically construct FSMs in real time based on perception data extracted from the UI screen, allowing the GUI agents to track navigation progress and flow, validate action outcomes through pre- and post-conditions of the states, and recover from errors by rolling back to previously stable states. Our evaluation results on two challenging cross-app benchmarks, Mobile-Eval-E and SPA-Bench, show that MAPLE outperforms the state-of-the-art baseline, improving task success rate by up to 12%, recovery success by 13.8%, and action accuracy by 6.5%. Our results highlight the importance of structured state modeling in guiding mobile GUI agents during task execution. Moreover, our FSM representation can be integrated into future GUI agent architectures as a lightweight, model-agnostic memory layer to support structured planning, execution verification, and error recovery.
zh
[AI-14] Collaborative Last-Mile Delivery: A Multi-Platform Vehicle Routing Problem With En-route Charging
【速读】:该论文试图解决协同物流中的多平台车辆路径问题(VRP-DR),旨在通过整合卡车、无人机和机器人,实现高效、低成本的末端配送。解决方案的关键在于引入一种新型的协同同步多平台车辆路径模型,并结合灵活的能源补给机制,利用卡车作为移动平台,实现无人机和机器人的途中充电、调度与回收,从而克服其载荷能力有限、续航距离短及电池约束等问题。此外,该研究还提出了一种可扩展的启发式算法FINDER,以高效求解大规模实例,提升整体运营效率。
链接: https://arxiv.org/abs/2505.23584
作者: Sumbal Malik,Majid Khonji,Khaled Elbassioni,Jorge Dias
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:The rapid growth of e-commerce and the increasing demand for timely, cost-effective last-mile delivery have increased interest in collaborative logistics. This research introduces a novel collaborative synchronized multi-platform vehicle routing problem with drones and robots (VRP-DR), where a fleet of \mathcalM trucks, \mathcalN drones and \mathcalK robots, cooperatively delivers parcels. Trucks serve as mobile platforms, enabling the launching, retrieving, and en-route charging of drones and robots, thereby addressing critical limitations such as restricted payload capacities, limited range, and battery constraints. The VRP-DR incorporates five realistic features: (1) multi-visit service per trip, (2) multi-trip operations, (3) flexible docking, allowing returns to the same or different trucks (4) cyclic and acyclic operations, enabling return to the same or different nodes; and (5) en-route charging, enabling drones and robots to recharge while being transported on the truck, maximizing operational efficiency by utilizing idle transit time. The VRP-DR is formulated as a mixed-integer linear program (MILP) to minimize both operational costs and makespan. To overcome the computational challenges of solving large-scale instances, a scalable heuristic algorithm, FINDER (Flexible INtegrated Delivery with Energy Recharge), is developed, to provide efficient, near-optimal solutions. Numerical experiments across various instance sizes evaluate the performance of the MILP and heuristic approaches in terms of solution quality and computation time. The results demonstrate significant time savings of the combined delivery mode over the truck-only mode and substantial cost reductions from enabling multi-visits. The study also provides insights into the effects of en-route charging, docking flexibility, drone count, speed, and payload capacity on system performance.
zh
[AI-15] Engineering Serendipity through Recommendations of Items with Atypical Aspects
【速读】:该论文试图解决传统推荐系统中缺乏意外性(serendipity)的问题,即如何通过推荐具有非典型特征的物品来增强用户的惊喜体验。解决方案的关键在于构建一个基于大语言模型(LLM)的系统流程,该流程从物品评论中提取非典型特征,并估计其用户特定的效用,从而计算出一种衡量意外性潜力的指标,用于对推荐列表进行重排序。
链接: https://arxiv.org/abs/2505.23580
作者: Ramit Aditya,Razvan Bunescu,Smita Nannaware,Erfan Al-Hossami
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 25 pages of content + references and appendix. arXiv admin note: text overlap with arXiv:2311.02702
Abstract:A restaurant dinner or a hotel stay may lead to memorable experiences when guests encounter unexpected aspects that also match their interests. For example, an origami-making station in the waiting area of a restaurant may be both surprising and enjoyable for a customer who is passionate about paper crafts. Similarly, an exhibit of 18th century harpsichords would be atypical for a hotel lobby and likely pique the interest of a guest who has a passion for Baroque music. Motivated by this insight, in this paper we introduce the new task of engineering serendipity through recommendations of items with atypical aspects. We describe an LLM-based system pipeline that extracts atypical aspects from item reviews, then estimates and aggregates their user-specific utility in a measure of serendipity potential that is used to rerank a list of items recommended to the user. To facilitate system development and evaluation, we introduce a dataset of Yelp reviews that are manually annotated with atypical aspects and a dataset of artificially generated user profiles, together with crowdsourced annotations of user-aspect utility values. Furthermore, we introduce a custom procedure for dynamic selection of in-context learning examples, which is shown to improve LLM-based judgments of atypicality and utility. Experimental evaluations show that serendipity-based rankings generated by the system are highly correlated with ground truth rankings for which serendipity scores are computed from manual annotations of atypical aspects and their user-dependent utility. Overall, we hope that the new recommendation task and the associated system presented in this paper catalyze further research into recommendation approaches that go beyond accuracy in their pursuit of enhanced user satisfaction. The datasets and the code are made publicly available at this https URL . Comments: 25 pages of content + references and appendix. arXiv admin note: text overlap with arXiv:2311.02702 Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.23580 [cs.IR] (or arXiv:2505.23580v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2505.23580 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-16] Cognitive Guardrails for Open-World Decision Making in Autonomous Drone Swarms
【速读】:该论文试图解决小型无人航空系统(sUAS)在灾难响应任务中,传统计算机视觉系统难以识别未知物体或推断其对任务规划的相关性的问题。解决方案的关键是引入大型语言模型(LLMs)来推理检测到的物体及其影响,同时通过认知防护机制(cognitive guardrails)确保在不确定性下的安全和合理决策。
链接: https://arxiv.org/abs/2505.23576
作者: Jane Cleland-Huang,Pedro Antonio Alarcon Granadeno,Arturo Miguel Russell Bernal,Demetrius Hernandez,Michael Murphy,Maureen Petterson,Walter Scheirer
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 16 pages, 8 figures
Abstract:Small Uncrewed Aerial Systems (sUAS) are increasingly deployed as autonomous swarms in search-and-rescue and other disaster-response scenarios. In these settings, they use computer vision (CV) to detect objects of interest and autonomously adapt their missions. However, traditional CV systems often struggle to recognize unfamiliar objects in open-world environments or to infer their relevance for mission planning. To address this, we incorporate large language models (LLMs) to reason about detected objects and their implications. While LLMs can offer valuable insights, they are also prone to hallucinations and may produce incorrect, misleading, or unsafe recommendations. To ensure safe and sensible decision-making under uncertainty, high-level decisions must be governed by cognitive guardrails. This article presents the design, simulation, and real-world integration of these guardrails for sUAS swarms in search-and-rescue missions.
zh
[AI-17] CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring
【速读】:该论文试图解决在自主性不断增强的AI模型中,如何有效检测其潜在有害行为的问题。解决方案的关键在于引入一种链式思维(Chain-of-Thought, CoT)监控机制,通过持续监督模型的中间推理步骤来提升对有害行为的检测能力。相较于仅审查最终输出的动作监控方法,CoT监控在某些场景下可将检测率提高27个百分点,但其在面对误导性推理时存在性能下降的问题。为解决这一问题,研究提出了一种混合协议,独立评估推理过程与最终输出,并通过加权平均进行整合,从而在所有测试模型和任务中均优于单一的CoT或动作监控方法。
链接: https://arxiv.org/abs/2505.23575
作者: Benjamin Arnav,Pablo Bernabeu-Pérez,Nathan Helm-Burger,Tim Kostolansky,Hannes Whittingham,Mary Phuong
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:As AI models are deployed with increasing autonomy, it is important to ensure they do not take harmful actions unnoticed. As a potential mitigation, we investigate Chain-of-Thought (CoT) monitoring, wherein a weaker trusted monitor model continuously oversees the intermediate reasoning steps of a more powerful but untrusted model. We compare CoT monitoring to action-only monitoring, where only final outputs are reviewed, in a red-teaming setup where the untrusted model is instructed to pursue harmful side tasks while completing a coding problem. We find that CoT monitoring improves detection by up to 27 percentage points in scenarios where action-only monitoring fails to reliably identify sabotage. However, CoT traces can also contain misleading rationalizations that deceive the monitor, reducing performance in more obvious sabotage cases. To address this, we introduce a hybrid protocol that independently scores both reasoning and final outputs and combines them using a weighted average. This hybrid monitor consistently outperforms both CoT and action-only monitors across all tested models and tasks, with detection rates over four times higher than action-only monitoring for subtle deception scenarios.
zh
[AI-18] SafeScientist: Toward Risk-Aware Scientific Discoveries by LLM Agents
【速读】:该论文旨在解决生成式 AI 在科学发现自动化过程中引发的伦理与安全问题。其核心挑战在于如何在提升科学探索效率的同时,确保AI行为符合伦理规范并降低潜在风险。解决方案的关键在于提出 \textbfSafeScientist 框架,该框架通过集成多种防御机制(如提示监控、代理协作监控、工具使用监控及伦理审查组件)实现对研究过程的全面安全监督,并主动拒绝高风险或不道德的任务,从而在不牺牲科学产出质量的前提下显著提升安全性。
链接: https://arxiv.org/abs/2505.23559
作者: Kunlun Zhu,Jiaxun Zhang,Ziheng Qi,Nuoxing Shang,Zijia Liu,Peixuan Han,Yue Su,Haofei Yu,Jiaxuan You
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in large language model (LLM) agents have significantly accelerated scientific discovery automation, yet concurrently raised critical ethical and safety concerns. To systematically address these challenges, we introduce \textbfSafeScientist, an innovative AI scientist framework explicitly designed to enhance safety and ethical responsibility in AI-driven scientific exploration. SafeScientist proactively refuses ethically inappropriate or high-risk tasks and rigorously emphasizes safety throughout the research process. To achieve comprehensive safety oversight, we integrate multiple defensive mechanisms, including prompt monitoring, agent-collaboration monitoring, tool-use monitoring, and an ethical reviewer component. Complementing SafeScientist, we propose \textbfSciSafetyBench, a novel benchmark specifically designed to evaluate AI safety in scientific contexts, comprising 240 high-risk scientific tasks across 6 domains, alongside 30 specially designed scientific tools and 120 tool-related risk tasks. Extensive experiments demonstrate that SafeScientist significantly improves safety performance by 35% compared to traditional AI scientist frameworks, without compromising scientific output quality. Additionally, we rigorously validate the robustness of our safety pipeline against diverse adversarial attack methods, further confirming the effectiveness of our integrated approach. The code and data will be available at this https URL. \textcolorredWarning: this paper contains example data that may be offensive or harmful.
zh
[AI-19] Sustainable Carbon-Aware and Water-Efficient LLM Scheduling in Geo-Distributed Cloud Datacenters
【速读】:该论文试图解决大型语言模型(Large Language Models, LLM)在推理阶段带来的环境影响问题,包括高碳排放、大量耗水以及能源成本高昂等可持续性挑战。解决方案的关键在于提出一种名为SLIT的新型框架,通过结合机器学习(Machine Learning, ML)驱动的元启发式方法,在地理分布的云数据中心中协同优化LLM的服务质量(时间到首个标记)、碳排放、用水量和能源成本。
链接: https://arxiv.org/abs/2505.23554
作者: Hayden Moore,Sirui Qi,Ninad Hogade,Dejan Milojicic,Cullen Bash,Sudeep Pasricha
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:In recent years, Large Language Models (LLM) such as ChatGPT, CoPilot, and Gemini have been widely adopted in different areas. As the use of LLMs continues to grow, many efforts have focused on reducing the massive training overheads of these models. But it is the environmental impact of handling user requests to LLMs that is increasingly becoming a concern. Recent studies estimate that the costs of operating LLMs in their inference phase can exceed training costs by 25x per year. As LLMs are queried incessantly, the cumulative carbon footprint for the operational phase has been shown to far exceed the footprint during the training phase. Further, estimates indicate that 500 ml of fresh water is expended for every 20-50 requests to LLMs during inference. To address these important sustainability issues with LLMs, we propose a novel framework called SLIT to co-optimize LLM quality of service (time-to-first token), carbon emissions, water usage, and energy costs. The framework utilizes a machine learning (ML) based metaheuristic to enhance the sustainability of LLM hosting across geo-distributed cloud datacenters. Such a framework will become increasingly vital as LLMs proliferate.
zh
[AI-20] Synchronizing Process Model and Event Abstraction for Grounded Process Intelligence (Extended Version)
【速读】:该论文试图解决在流程智能项目中,对从事件数据中发现的模型进行抽象后,无法同步抽象底层事件日志的问题,这导致失去了真实世界行为的依据,限制了分析的深度。解决方案的关键在于提供同步模型与事件抽象的形式化基础,证明通过模型抽象(MA)对过程模型进行抽象并从抽象后的事件日志中重新发现过程模型,能够得到等价的过程模型,从而确保抽象过程中的行为一致性。
链接: https://arxiv.org/abs/2505.23536
作者: Janik-Vasily Benzin,Gyunam Park,Stefanie Rinderle-Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Model abstraction (MA) and event abstraction (EA) are means to reduce complexity of (discovered) models and event data. Imagine a process intelligence project that aims to analyze a model discovered from event data which is further abstracted, possibly multiple times, to reach optimality goals, e.g., reducing model size. So far, after discovering the model, there is no technique that enables the synchronized abstraction of the underlying event log. This results in loosing the grounding in the real-world behavior contained in the log and, in turn, restricts analysis insights. Hence, in this work, we provide the formal basis for synchronized model and event abstraction, i.e., we prove that abstracting a process model by MA and discovering a process model from an abstracted event log yields an equivalent process model. We prove the feasibility of our approach based on behavioral profile abstraction as non-order preserving MA technique, resulting in a novel EA technique.
zh
[AI-21] Subgraph Gaussian Embedding Contrast for Self-Supervised Graph Representation Learning
【速读】:该论文旨在解决图表示学习(Graph Representation Learning, GRL)中依赖人工标注数据的问题,提出了一种新的自监督学习(Self-Supervised Learning, SSL)方法——子图高斯嵌入对比(Subgraph Gaussian Embedding Contrast, SubGEC)。其解决方案的关键在于引入了一个子图高斯嵌入模块,该模块能够自适应地将子图映射到结构化的高斯空间,从而在保持输入子图特征的同时生成具有可控分布的子图。此外,通过使用最优传输距离(如Wasserstein和Gromov-Wasserstein距离)来衡量子图间的相似性,增强了对比学习过程的鲁棒性。
链接: https://arxiv.org/abs/2505.23529
作者: Shifeng Xie,Aref Einizade,Jhony H. Giraldo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph Representation Learning (GRL) is a fundamental task in machine learning, aiming to encode high-dimensional graph-structured data into low-dimensional vectors. Self-Supervised Learning (SSL) methods are widely used in GRL because they can avoid expensive human annotation. In this work, we propose a novel Subgraph Gaussian Embedding Contrast (SubGEC) method. Our approach introduces a subgraph Gaussian embedding module, which adaptively maps subgraphs to a structured Gaussian space, ensuring the preservation of input subgraph characteristics while generating subgraphs with a controlled distribution. We then employ optimal transport distances, more precisely the Wasserstein and Gromov-Wasserstein distances, to effectively measure the similarity between subgraphs, enhancing the robustness of the contrastive learning process. Extensive experiments across multiple benchmarks demonstrate that \method~outperforms or presents competitive performance against state-of-the-art approaches. Our findings provide insights into the design of SSL methods for GRL, emphasizing the importance of the distribution of the generated contrastive pairs.
zh
[AI-22] Individual differences in the cognitive mechanisms of planning strategy discovery
【速读】:该论文试图解决如何使元认知强化学习(Metacognitive Reinforcement Learning, MCRL)模型更接近人类在策略发现上的表现问题。其关键解决方案是引入可能有助于人类策略发现的认知机制,如内在生成的元认知伪奖励、主观努力估值和终止 deliberation(反复思考)。这些机制在实验中被证明能够促进策略发现,但尚未完全缩小模型与人类表现之间的差距,从而需要进一步探索其他潜在因素。
链接: https://arxiv.org/abs/2505.23519
作者: Ruiqi He,Falk Lieder
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:People employ efficient planning strategies. But how are these strategies acquired? Previous research suggests that people can discover new planning strategies through learning from reinforcements, a process known as metacognitive reinforcement learning (MCRL). While prior work has shown that MCRL models can learn new planning strategies and explain more participants’ experience-driven discovery better than alternative mechanisms, it also revealed significant individual differences in metacognitive learning. Furthermore, when fitted to human data, these models exhibit a slower rate of strategy discovery than humans. In this study, we investigate whether incorporating cognitive mechanisms that might facilitate human strategy discovery can bring models of MCRL closer to human performance. Specifically, we consider intrinsically generated metacognitive pseudo-rewards, subjective effort valuation, and termination deliberation. Analysis of planning task data shows that a larger proportion of participants used at least one of these mechanisms, with significant individual differences in their usage and varying impacts on strategy discovery. Metacognitive pseudo-rewards, subjective effort valuation, and learning the value of acting without further planning were found to facilitate strategy discovery. While these enhancements provided valuable insights into individual differences and the effect of these mechanisms on strategy discovery, they did not fully close the gap between model and human performance, prompting further exploration of additional factors that people might use to discover new planning strategies.
zh
[AI-23] RAP: Targeted Redirecting of Agent ic Preferences
【速读】:该论文旨在解决自主代理式人工智能系统在跨模态推理过程中面临的对抗性攻击问题,特别是针对视觉-语言模型(VLMs)的语义层面的漏洞。现有攻击方法通常依赖于可见的像素扰动或需要访问模型内部结构,难以实现隐蔽的现实世界攻击。该论文提出的解决方案——TRAP框架,其关键在于利用基于扩散的语义注入技术,通过负提示降级与正向语义优化相结合的方式,在不访问模型内部的情况下生成视觉自然但能引发代理系统选择偏差的图像,从而实现高效且隐蔽的对抗攻击。
链接: https://arxiv.org/abs/2505.23518
作者: Hangoo Kang,Jehyeok Yeon,Gagandeep Singh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous agentic AI systems powered by vision-language models (VLMs) are rapidly advancing toward real-world deployment, yet their cross-modal reasoning capabilities introduce new attack surfaces for adversarial manipulation that exploit semantic reasoning across modalities. Existing adversarial attacks typically rely on visible pixel perturbations or require privileged model or environment access, making them impractical for stealthy, real-world exploitation. We introduce TRAP, a generative adversarial framework that manipulates the agent’s decision-making using diffusion-based semantic injections. Our method combines negative prompt-based degradation with positive semantic optimization, guided by a Siamese semantic network and layout-aware spatial masking. Without requiring access to model internals, TRAP produces visually natural images yet induces consistent selection biases in agentic AI systems. We evaluate TRAP on the Microsoft Common Objects in Context (COCO) dataset, building multi-candidate decision scenarios. Across these scenarios, TRAP achieves a 100% attack success rate on leading models, including LLaVA-34B, Gemma3, and Mistral-3.1, significantly outperforming baselines such as SPSA, Bandit, and standard diffusion approaches. These results expose a critical vulnerability: Autonomous agents can be consistently misled through human-imperceptible cross-modal manipulations. These findings highlight the need for defense strategies beyond pixel-level robustness to address semantic vulnerabilities in cross-modal decision-making.
zh
[AI-24] A Robot-Assisted Approach to Small Talk Training for Adults with ASD
【速读】:该论文试图解决自闭症谱系障碍(Autism Spectrum Disorder, ASD)成人群体在日常社交互动中面临的困难,特别是小范围交谈(small talk)的挑战。解决方案的关键在于开发一种家庭自主机器人系统,使用户能够在安全、可控的环境中练习对话技能,研究结果表明该系统能够有效提升用户的对话发起能力与眼神交流,并被用户视为提高社交能力的重要工具。
链接: https://arxiv.org/abs/2505.23508
作者: Rebecca Ramnauth,Dražen Brščić,Brian Scassellati
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted for publication in Robotics: Science and Systems (RSS) 2025, 14 pages, 4 figures,
Abstract:From dating to job interviews, making new friends or simply chatting with the cashier at checkout, engaging in small talk is a vital, everyday social skill. For adults with Autism Spectrum Disorder (ASD), small talk can be particularly challenging, yet it is essential for social integration, building relationships, and accessing professional opportunities. In this study, we present our development and evaluation of an in-home autonomous robot system that allows users to practice small talk. Results from the week-long study show that adults with ASD enjoyed the training, made notable progress in initiating conversations and improving eye contact, and viewed the system as a valuable tool for enhancing their conversational skills.
zh
[AI-25] Autoformalization in the Era of Large Language Models : A Survey
【速读】:该论文试图解决将非正式数学命题转化为可验证形式化表示的自动形式化(autoformalization)问题,这是自动化定理证明中的基础任务。其解决方案的关键在于利用人工智能,特别是大语言模型(LLMs)的快速发展,推动该领域的进步,并通过端到端的工作流程,包括数据预处理、模型设计与评估,提升LLMs生成内容的可验证性,从而增强其可信度和推理能力。
链接: https://arxiv.org/abs/2505.23486
作者: Ke Weng,Lun Du,Sirui Li,Wangyue Lu,Haozhe Sun,Hengyu Liu,Tiancheng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Autoformalization, the process of transforming informal mathematical propositions into verifiable formal representations, is a foundational task in automated theorem proving, offering a new perspective on the use of mathematics in both theoretical and applied domains. Driven by the rapid progress in artificial intelligence, particularly large language models (LLMs), this field has witnessed substantial growth, bringing both new opportunities and unique challenges. In this survey, we provide a comprehensive overview of recent advances in autoformalization from both mathematical and LLM-centric perspectives. We examine how autoformalization is applied across various mathematical domains and levels of difficulty, and analyze the end-to-end workflow from data preprocessing to model design and evaluation. We further explore the emerging role of autoformalization in enhancing the verifiability of LLM-generated outputs, highlighting its potential to improve both the trustworthiness and reasoning capabilities of LLMs. Finally, we summarize key open-source models and datasets supporting current research, and discuss open challenges and promising future directions for the field.
zh
[AI-26] EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在面对语义无害的输入时过度拒绝响应的问题,这种现象被称为“过度拒绝”(over-refusals),通常由保守的安全对齐策略引起,严重影响用户体验。解决方案的关键在于提出EVOREFUSE,一种通过进化算法优化提示词的方法,能够生成多样化的伪恶意指令,这些指令能一致地引发LLMs的自信拒绝。EVOREFUSE通过突变策略和重组探索更广泛的指令空间,并迭代优化种子指令以最大化LLM拒绝概率的证据下界。
链接: https://arxiv.org/abs/2505.23473
作者: Xiaorui Wu,Xiaofeng Mao,Fei Li,Xin Zhang,Xiaolu Zhang,Jun Zhou,Yuxiang Peng,Li Zheng,Chong Teng,Donghong Ji,Zhuang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) frequently refuse to respond to pseudo-malicious instructions: semantically harmless input queries triggering unnecessary LLM refusals due to conservative safety alignment, significantly impairing user experience. Collecting such instructions is crucial for evaluating and mitigating over-refusals, but existing instruction curation methods, like manual creation or instruction rewriting, either lack scalability or fail to produce sufficiently diverse and effective refusal-inducing prompts. To address these limitations, we introduce EVOREFUSE, a prompt optimization approach that generates diverse pseudo-malicious instructions consistently eliciting confident refusals across LLMs. EVOREFUSE employs an evolutionary algorithm exploring the instruction space in more diverse directions than existing methods via mutation strategies and recombination, and iteratively evolves seed instructions to maximize evidence lower bound on LLM refusal probability. Using EVOREFUSE, we create two novel datasets: EVOREFUSE-TEST, a benchmark of 582 pseudo-malicious instructions that outperforms the next-best benchmark with 140.41% higher average refusal triggering rate across 9 LLMs, 34.86% greater lexical diversity, and 40.03% improved LLM response confidence scores; and EVOREFUSE-ALIGN, which provides 3,000 pseudo-malicious instructions with responses for supervised and preference-based alignment training. LLAMA3.1-8B-INSTRUCT supervisedly fine-tuned on EVOREFUSE-ALIGN achieves up to 14.31% fewer over-refusals than models trained on the second-best alignment dataset, without compromising safety. Our analysis with EVOREFUSE-TEST reveals models trigger over-refusals by overly focusing on sensitive keywords while ignoring broader context.
zh
[AI-27] Bounded-Abstention Pairwise Learning to Rank
【速读】:该论文旨在解决在高风险领域中,排名系统决策可能带来的经济和社会影响问题,特别是如何通过引入安全机制来提高决策的可靠性。其解决方案的关键在于提出一种适用于成对学习排序(pairwise learning-to-rank)任务的新型弃权(abstention)方法,该方法基于对排序器条件风险的阈值判断,当估计的风险超过预设阈值时,系统将放弃做出决策,从而将不确定的判断交由人类专家处理。
链接: https://arxiv.org/abs/2505.23437
作者: Antonio Ferrara,Andrea Pugnana,Francesco Bonchi,Salvatore Ruggieri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Ranking systems influence decision-making in high-stakes domains like health, education, and employment, where they can have substantial economic and social impacts. This makes the integration of safety mechanisms essential. One such mechanism is \textitabstention , which enables algorithmic decision-making system to defer uncertain or low-confidence decisions to human experts. While abstention have been predominantly explored in the context of classification tasks, its application to other machine learning paradigms remains underexplored. In this paper, we introduce a novel method for abstention in pairwise learning-to-rank tasks. Our approach is based on thresholding the ranker’s conditional risk: the system abstains from making a decision when the estimated risk exceeds a predefined threshold. Our contributions are threefold: a theoretical characterization of the optimal abstention strategy, a model-agnostic, plug-in algorithm for constructing abstaining ranking models, and a comprehensive empirical evaluations across multiple datasets, demonstrating the effectiveness of our approach.
zh
[AI-28] Emergent Risk Awareness in Rational Agents under Resource Constraints
【速读】:该论文试图解决在资源受限或存在失败约束环境下,具有自主能力的AI代理(AI agents)因生存压力导致的效用驱动行为变化及其与人类目标之间的潜在不对齐问题。其解决方案的关键在于通过生存老虎机(survival bandit)框架进行形式化建模,分析生存驱动偏好转移的影响,并识别不对齐产生的条件,进而提出缓解风险趋避行为的机制,以提升AI代理在关键资源有限环境中的可解释性与安全性。
链接: https://arxiv.org/abs/2505.23436
作者: Daniel Jarne Ornia,Nicholas Bishop,Joel Dyer,Wei-Chen Lee,Ani Calinescu,Doyne Farme,Michael Wooldridge
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Advanced reasoning models with agentic capabilities (AI agents) are deployed to interact with humans and to solve sequential decision-making problems under (approximate) utility functions and internal models. When such problems have resource or failure constraints where action sequences may be forcibly terminated once resources are exhausted, agents face implicit trade-offs that reshape their utility-driven (rational) behaviour. Additionally, since these agents are typically commissioned by a human principal to act on their behalf, asymmetries in constraint exposure can give rise to previously unanticipated misalignment between human objectives and agent incentives. We formalise this setting through a survival bandit framework, provide theoretical and empirical results that quantify the impact of survival-driven preference shifts, identify conditions under which misalignment emerges and propose mechanisms to mitigate the emergence of risk-seeking or risk-averse behaviours. As a result, this work aims to increase understanding and interpretability of emergent behaviours of AI agents operating under such survival pressure, and offer guidelines for safely deploying such AI systems in critical resource-limited environments.
zh
[AI-29] A Mathematical Framework for AI-Human Integration in Work ICML2025
【速读】:该论文试图解决生成式 AI (Generative AI, GenAI) 在不同工作场景中如何与人类工作者互补或替代的问题,核心在于理解GenAI与人类技能的协同机制。解决方案的关键在于提出一个数学框架,将技能分解为决策层面和行动层面的子技能,以反映人类与GenAI在不同任务中的互补优势,并通过分析子技能能力变化对工作成功的影响,识别成功概率出现显著转变的条件,从而揭示GenAI如何有效增强人类生产力。
链接: https://arxiv.org/abs/2505.23432
作者: Elisa Celis,Lingxiao Huang,Nisheeth K. Vishnoi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); General Economics (econ.GN)
备注: This paper will appear in ICML 2025
Abstract:The rapid rise of Generative AI (GenAI) tools has sparked debate over their role in complementing or replacing human workers across job contexts. We present a mathematical framework that models jobs, workers, and worker-job fit, introducing a novel decomposition of skills into decision-level and action-level subskills to reflect the complementary strengths of humans and GenAI. We analyze how changes in subskill abilities affect job success, identifying conditions for sharp transitions in success probability. We also establish sufficient conditions under which combining workers with complementary subskills significantly outperforms relying on a single worker. This explains phenomena such as productivity compression, where GenAI assistance yields larger gains for lower-skilled workers. We demonstrate the framework’ s practicality using data from O*NET and Big-Bench Lite, aligning real-world data with our model via subskill-division methods. Our results highlight when and how GenAI complements human skills, rather than replacing them.
zh
[AI-30] Enhanced DACER Algorithm with High Diffusion Efficiency
【速读】:该论文旨在解决扩散模型在在线强化学习(online RL)中训练和推理效率低下的问题,尤其是在减少扩散步骤时性能显著下降的瓶颈。其关键解决方案是提出一种Q-gradient field objective作为辅助优化目标,以指导每个扩散步骤中的去噪过程,同时引入时间加权机制,使模型能够在早期阶段高效消除大规模噪声并在后期阶段细化动作,从而在保持或提升性能的同时显著提高训练和推理效率。
链接: https://arxiv.org/abs/2505.23426
作者: Yinuo Wang,Mining Tan,Wenjun Zou,Haotian Lin,Xujie Song,Wenxuan Wang,Tong Liu,Likun Wang,Guojian Zhan,Tianze Zhu,Shiqi Liu,Jingliang Duan,Shengbo Eben Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Due to their expressive capacity, diffusion models have shown great promise in offline RL and imitation learning. Diffusion Actor-Critic with Entropy Regulator (DACER) extended this capability to online RL by using the reverse diffusion process as a policy approximator, trained end-to-end with policy gradient methods, achieving strong performance. However, this comes at the cost of requiring many diffusion steps, which significantly hampers training efficiency, while directly reducing the steps leads to noticeable performance degradation. Critically, the lack of inference efficiency becomes a significant bottleneck for applying diffusion policies in real-time online RL settings. To improve training and inference efficiency while maintaining or even enhancing performance, we propose a Q-gradient field objective as an auxiliary optimization target to guide the denoising process at each diffusion step. Nonetheless, we observe that the independence of the Q-gradient field from the diffusion time step negatively impacts the performance of the diffusion policy. To address this, we introduce a temporal weighting mechanism that enables the model to efficiently eliminate large-scale noise in the early stages and refine actions in the later stages. Experimental results on MuJoCo benchmarks and several multimodal tasks demonstrate that the DACER2 algorithm achieves state-of-the-art performance in most MuJoCo control tasks with only five diffusion steps, while also exhibiting stronger multimodality compared to DACER.
zh
[AI-31] From Knowledge to Noise: CTIM-Rover and the Pitfalls of Episodic Memory in Software Engineering Agents
【速读】:该论文试图解决软件工程(Software Engineering, SE)中AI代理在执行任务时缺乏有效长期记忆的问题,导致其无法充分利用之前任务的信息来提升后续任务的性能。解决方案的关键在于引入一种名为Cross-Task-Instance Memory (CTIM) 的情景记忆机制,该机制能够在不同任务实例之间共享和积累知识,从而增强AI代理的推理与规划能力。研究构建了CTIM-Rover,基于Experiential Learning(EL)方法ExpeL,并采用Mixture-Of-Experts (MoEs) 模型来构建通用和仓库级别的CTIM,但实验结果表明该方法并未在任何配置下超越AutoCodeRover,暗示当前方法在实际SE问题中的可扩展性存在局限。
链接: https://arxiv.org/abs/2505.23422
作者: Tobias Lindenbauer,Georg Groh,Hinrich Schütze
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Short Paper, REALM '25 camera-ready
Abstract:We introduce CTIM-Rover, an AI agent for Software Engineering (SE) built on top of AutoCodeRover (Zhang et al., 2024) that extends agentic reasoning frameworks with an episodic memory, more specifically, a general and repository-level Cross-Task-Instance Memory (CTIM). While existing open-source SE agents mostly rely on ReAct (Yao et al., 2023b), Reflexion (Shinn et al., 2023), or Code-Act (Wang et al., 2024), all of these reasoning and planning frameworks inefficiently discard their long-term memory after a single task instance. As repository-level understanding is pivotal for identifying all locations requiring a patch for fixing a bug, we hypothesize that SE is particularly well positioned to benefit from CTIM. For this, we build on the Experiential Learning (EL) approach ExpeL (Zhao et al., 2024), proposing a Mixture-Of-Experts (MoEs) inspired approach to create both a general-purpose and repository-level CTIM. We find that CTIM-Rover does not outperform AutoCodeRover in any configuration and thus conclude that neither ExpeL nor DoT-Bank (Lingam et al., 2024) scale to real-world SE problems. Our analysis indicates noise introduced by distracting CTIM items or exemplar trajectories as the likely source of the performance degradation.
zh
[AI-32] oward Effective AI Governance: A Review of Principles
【速读】:该论文试图解决当前AI治理框架与实践缺乏系统性综述的问题,旨在识别二次文献中强调的框架、原则、机制及利益相关者角色。其解决方案的关键在于通过快速三级文献综述方法,对IEEE和ACM(2020–2024)的九篇同行评审二次研究进行结构化纳入标准和主题语义合成,从而提炼出AI治理的核心方向并揭示实证验证与包容性方面的不足。
链接: https://arxiv.org/abs/2505.23417
作者: Danilo Ribeiro,Thayssa Rocha,Gustavo Pinto,Bruno Cartaxo,Marcelo Amaral,Nicole Davila,Ana Camargo
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial Intelligence (AI) governance is the practice of establishing frameworks, policies, and procedures to ensure the responsible, ethical, and safe development and deployment of AI systems. Although AI governance is a core pillar of Responsible AI, current literature still lacks synthesis across such governance frameworks and practices. Objective: To identify which frameworks, principles, mechanisms, and stakeholder roles are emphasized in secondary literature on AI governance. Method: We conducted a rapid tertiary review of nine peer-reviewed secondary studies from IEEE and ACM (20202024), using structured inclusion criteria and thematic semantic synthesis. Results: The most cited frameworks include the EU AI Act and NIST RMF; transparency and accountability are the most common principles. Few reviews detail actionable governance mechanisms or stakeholder strategies. Conclusion: The review consolidates key directions in AI governance and highlights gaps in empirical validation and inclusivity. Findings inform both academic inquiry and practical adoption in organizations.
zh
[AI-33] Bidirectional predictive coding
【速读】:该论文试图解决传统预测编码(Predictive Coding, PC)模型在处理需要双向信息流的任务时性能下降的问题,以及如何在生物合理性框架下实现生成式与判别式推理的结合。其解决方案的关键在于提出双向预测编码(bidirectional PC, bPC),该模型同时整合了生成式和判别式推理机制,并通过构建一个能够同时适应两种任务的能态景观(energy landscape)来实现性能优化,从而在多种生物相关任务中表现出优于单向模型的性能。
链接: https://arxiv.org/abs/2505.23415
作者: Gaspard Oliviers,Mufeng Tang,Rafal Bogacz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Predictive coding (PC) is an influential computational model of visual learning and inference in the brain. Classical PC was proposed as a top-down generative model, where the brain actively predicts upcoming visual inputs, and inference minimises the prediction errors. Recent studies have also shown that PC can be formulated as a discriminative model, where sensory inputs predict neural activities in a feedforward manner. However, experimental evidence suggests that the brain employs both generative and discriminative inference, while unidirectional PC models show degraded performance in tasks requiring bidirectional processing. In this work, we propose bidirectional PC (bPC), a PC model that incorporates both generative and discriminative inference while maintaining a biologically plausible circuit implementation. We show that bPC matches or outperforms unidirectional models in their specialised generative or discriminative tasks, by developing an energy landscape that simultaneously suits both tasks. We also demonstrate bPC’s superior performance in two biologically relevant tasks including multimodal learning and inference with missing information, suggesting that bPC resembles biological visual inference more closely.
zh
[AI-34] GAM-Agent : Game-Theoretic and Uncertainty-Aware Collaboration for Complex Visual Reasoning
【速读】:该论文试图解决视觉语言推理(Vision-Language Reasoning, VLR)中模型预测的可靠性与可解释性不足的问题。其解决方案的关键在于提出GAM-Agent框架,该框架基于博弈论,将推理过程建模为基底代理(负责视觉感知子任务)与一个关键代理(负责验证逻辑一致性和事实正确性)之间的非零和博弈。通过结构化声明、证据和不确定性估计进行通信,并引入不确定性感知控制器动态调整代理协作,在检测到分歧或模糊时触发多轮辩论,从而提升预测的鲁棒性与可解释性。
链接: https://arxiv.org/abs/2505.23399
作者: Jusheng Zhang,Yijia Fan,Wenjun Lin,Ruiqi Chen,Haoyi Jiang,Wenhao Chai,Jian Wang,Keze Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We propose GAM-Agent, a game-theoretic multi-agent framework for enhancing vision-language reasoning. Unlike prior single-agent or monolithic models, GAM-Agent formulates the reasoning process as a non-zero-sum game between base agents–each specializing in visual perception subtasks–and a critical agent that verifies logic consistency and factual correctness. Agents communicate via structured claims, evidence, and uncertainty estimates. The framework introduces an uncertainty-aware controller to dynamically adjust agent collaboration, triggering multi-round debates when disagreement or ambiguity is detected. This process yields more robust and interpretable predictions. Experiments on four challenging benchmarks–MMMU, MMBench, MVBench, and V*Bench–demonstrate that GAM-Agent significantly improves performance across various VLM backbones. Notably, GAM-Agent boosts the accuracy of small-to-mid scale models (e.g., Qwen2.5-VL-7B, InternVL3-14B) by 5–6%, and still enhances strong models like GPT-4o by up to 2–3%. Our approach is modular, scalable, and generalizable, offering a path toward reliable and explainable multi-agent multimodal reasoning.
zh
[AI-35] A Unified Framework for Human AI Collaboration in Security Operations Centers with Trusted Autonomy
【速读】:该论文试图解决现有安全运营中心(Security Operations Center, SOC)中人工智能(Artificial Intelligence, AI)协作框架过于侧重自动化,缺乏系统性结构来管理人类监督、信任校准以及可扩展的AI自主性问题。传统方法通常假设静态或二元的自主性设置,未能充分考虑SOC任务在复杂性、关键性和风险方面的差异性。解决方案的关键在于提出一种基于五级AI自主性(从人工操作到完全自主)的分层框架,该框架与人机协同(Human-in-the-Loop, HITL)角色及任务特定的信任阈值相对应,从而实现适应性和可解释性AI在核心SOC功能中的集成。该框架通过形式化连接自主性、信任与HITL机制,支持根据操作复杂性和相关风险进行动态任务分配。
链接: https://arxiv.org/abs/2505.23397
作者: Ahmad Mohsin,Helge Janicke,Ahmed Ibrahim,Iqbal H. Sarker,Seyit Camtepe
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Journal Article
Abstract:This article presents a structured framework for Human-AI collaboration in Security Operations Centers (SOCs), integrating AI autonomy, trust calibration, and Human-in-the-loop decision making. Existing frameworks in SOCs often focus narrowly on automation, lacking systematic structures to manage human oversight, trust calibration, and scalable autonomy with AI. Many assume static or binary autonomy settings, failing to account for the varied complexity, criticality, and risk across SOC tasks considering Humans and AI collaboration. To address these limitations, we propose a novel autonomy tiered framework grounded in five levels of AI autonomy from manual to fully autonomous, mapped to Human-in-the-Loop (HITL) roles and task-specific trust thresholds. This enables adaptive and explainable AI integration across core SOC functions, including monitoring, protection, threat detection, alert triage, and incident response. The proposed framework differentiates itself from previous research by creating formal connections between autonomy, trust, and HITL across various SOC levels, which allows for adaptive task distribution according to operational complexity and associated risks. The framework is exemplified through a simulated cyber range that features the cybersecurity AI-Avatar, a fine-tuned LLM-based SOC assistant. The AI-Avatar case study illustrates human-AI collaboration for SOC tasks, reducing alert fatigue, enhancing response coordination, and strategically calibrating trust. This research systematically presents both the theoretical and practical aspects and feasibility of designing next-generation cognitive SOCs that leverage AI not to replace but to enhance human decision-making.
zh
[AI-36] Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)生成的代码在实际部署中存在效率不足的问题,这一问题已成为其应用的关键瓶颈。论文提出的解决方案关键在于引入一种基于测试时迭代优化的闭环系统,通过执行沙箱提供的实证性能反馈,使LLMs能够迭代优化代码。该框架探索了三种训练策略,其中基于强化学习(Reinforcement Learning, RL)的Group Relative Policy Optimization (GRPO) 策略在代码性能优化方面表现出持续提升的能力,显著提高了代码通过率(pass@1)和超越人类提交效率的可能性。
链接: https://arxiv.org/abs/2505.23387
作者: Mingzhe Du,Luu Tuan Tuan,Yue Liu,Yuhao Qing,Dong Huang,Xinyi He,Qian Liu,Zejun Ma,See-kiong Ng
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) generate functionally correct solutions but often fall short in code efficiency, a critical bottleneck for real-world deployment. In this paper, we introduce a novel test-time iterative optimization framework to address this, employing a closed-loop system where LLMs iteratively refine code based on empirical performance feedback from an execution sandbox. We explore three training strategies: Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Group Relative Policy Optimization~(GRPO). Experiments on our Venus dataset and the APPS benchmark show that SFT and DPO rapidly saturate in efficiency gains. In contrast, GRPO, using reinforcement learning (RL) with execution feedback, continuously optimizes code performance, significantly boosting both pass@1 (from 47% to 62%) and the likelihood of outperforming human submissions in efficiency (from 31% to 45%). Our work demonstrates effective test-time code efficiency improvement and critically reveals the power of RL in teaching LLMs to truly self-improve code efficiency.
zh
[AI-37] AutoGPS: Automated Geometry Problem Solving via Multimodal Formalization and Deductive Reasoning
【速读】:该论文试图解决几何问题求解在人工智能领域中的独特挑战,即需要具备卓越的多模态理解能力和严格的数学推理能力。现有方法主要分为基于神经网络和基于符号的方法,但二者在可靠性和可解释性方面均存在局限。解决方案的关键在于提出AutoGPS,这是一个神经符号协同框架,其核心是通过多模态问题形式化器(Multimodal Problem Formalizer, MPF)和演绎符号推理器(Deductive Symbolic Reasoner, DSR)的协作,将几何问题转化为结构化的形式语言表示,并通过超图扩展任务进行数学严谨的推导,从而生成简洁、可靠且人类可读的分步解题过程。
链接: https://arxiv.org/abs/2505.23381
作者: Bowen Ping,Minnan Luo,Zhuohang Dang,Chenxi Wang,Chengyou Jia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Geometry problem solving presents distinctive challenges in artificial intelligence, requiring exceptional multimodal comprehension and rigorous mathematical reasoning capabilities. Existing approaches typically fall into two categories: neural-based and symbolic-based methods, both of which exhibit limitations in reliability and interpretability. To address this challenge, we propose AutoGPS, a neuro-symbolic collaborative framework that solves geometry problems with concise, reliable, and human-interpretable reasoning processes. Specifically, AutoGPS employs a Multimodal Problem Formalizer (MPF) and a Deductive Symbolic Reasoner (DSR). The MPF utilizes neural cross-modal comprehension to translate geometry problems into structured formal language representations, with feedback from DSR collaboratively. The DSR takes the formalization as input and formulates geometry problem solving as a hypergraph expansion task, executing mathematically rigorous and reliable derivation to produce minimal and human-readable stepwise solutions. Extensive experimental evaluations demonstrate that AutoGPS achieves state-of-the-art performance on benchmark datasets. Furthermore, human stepwise-reasoning evaluation confirms AutoGPS’s impressive reliability and interpretability, with 99% stepwise logical coherence. The project homepage is at this https URL.
zh
[AI-38] Dynamic Spectral Backpropagation for Efficient Neural Network Training
【速读】:该论文旨在解决在资源受限条件下神经网络训练的效率与鲁棒性问题,以及应对少样本学习和硬件效率等挑战。其解决方案的关键在于动态谱反向传播(Dynamic Spectral Backpropagation, DSBP),通过将梯度投影到主特征向量上,降低计算复杂度并促进平坦极小值的形成,从而提升模型的泛化能力和稳定性。
链接: https://arxiv.org/abs/2505.23369
作者: Mannmohan Muthuraman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Dynamic Spectral Backpropagation (DSBP) enhances neural network training under resource constraints by projecting gradients onto principal eigenvectors, reducing complexity and promoting flat minima. Five extensions are proposed, dynamic spectral inference, spectral architecture optimization, spectral meta learning, spectral transfer regularization, and Lie algebra inspired dynamics, to address challenges in robustness, fewshot learning, and hardware efficiency. Supported by a third order stochastic differential equation (SDE) and a PAC Bayes limit, DSBP outperforms Sharpness Aware Minimization (SAM), Low Rank Adaptation (LoRA), and Model Agnostic Meta Learning (MAML) on CIFAR 10, Fashion MNIST, MedMNIST, and Tiny ImageNet, as demonstrated through extensive experiments and visualizations. Future work focuses on scalability, bias mitigation, and ethical considerations.
zh
[AI-39] Understanding the Information Propagation Effects of Communication Topologies in LLM -based Multi-Agent Systems
【速读】:该论文旨在解决大规模语言模型驱动的多智能体系统中通信拓扑结构设计的问题,特别是如何理解并优化稀疏与密集拓扑结构在协作中的影响。其解决方案的关键在于提出一种因果框架,用于分析不同稀疏度的拓扑结构下智能体输出(正确或错误)的传播机制,并基于此设计出一种新型拓扑结构生成方法——EIB-leanrner,该方法通过融合密集和稀疏图的连通模式,在抑制错误传播的同时保持有益信息的扩散,从而实现任务性能的最优化。
链接: https://arxiv.org/abs/2505.23352
作者: Xu Shen,Yixin Liu,Yiwei Dai,Yili Wang,Rui Miao,Yue Tan,Shirui Pan,Xin Wang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:The communication topology in large language model-based multi-agent systems fundamentally governs inter-agent collaboration patterns, critically shaping both the efficiency and effectiveness of collective decision-making. While recent studies for communication topology automated design tend to construct sparse structures for efficiency, they often overlook why and when sparse and dense topologies help or hinder collaboration. In this paper, we present a causal framework to analyze how agent outputs, whether correct or erroneous, propagate under topologies with varying sparsity. Our empirical studies reveal that moderately sparse topologies, which effectively suppress error propagation while preserving beneficial information diffusion, typically achieve optimal task performance. Guided by this insight, we propose a novel topology design approach, EIB-leanrner, that balances error suppression and beneficial information propagation by fusing connectivity patterns from both dense and sparse graphs. Extensive experiments show the superior effectiveness, communication cost, and robustness of EIB-leanrner.
zh
[AI-40] owards Reward Fairness in RLHF: From a Resource Allocation Perspective ACL2025
【速读】:该论文试图解决奖励机制中存在偏差(reward unfairness)的问题,这种偏差可能影响大规模语言模型(LLM)与人类偏好的对齐。解决方案的关键在于从资源分配的角度提出一种与偏差无关的方法,将偏好学习建模为资源分配问题,在考虑效用与公平性权衡的基础上,通过引入公平性正则化(Fairness Regularization)和公平性系数(Fairness Coefficient)实现奖励的公平性。该方法无需针对每种具体偏差进行设计,即可有效缓解多种奖励偏差问题。
链接: https://arxiv.org/abs/2505.23349
作者: Sheng Ouyang,Yulan Hu,Ge Chen,Qingyang Li,Fuzheng Zhang,Yong Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2025
Abstract:Rewards serve as proxies for human preferences and play a crucial role in Reinforcement Learning from Human Feedback (RLHF). However, if these rewards are inherently imperfect, exhibiting various biases, they can adversely affect the alignment of large language models (LLMs). In this paper, we collectively define the various biases present in rewards as the problem of reward unfairness. We propose a bias-agnostic method to address the issue of reward fairness from a resource allocation perspective, without specifically designing for each type of bias, yet effectively mitigating them. Specifically, we model preference learning as a resource allocation problem, treating rewards as resources to be allocated while considering the trade-off between utility and fairness in their distribution. We propose two methods, Fairness Regularization and Fairness Coefficient, to achieve fairness in rewards. We apply our methods in both verification and reinforcement learning scenarios to obtain a fairness reward model and a policy model, respectively. Experiments conducted in these scenarios demonstrate that our approach aligns LLMs with human preferences in a more fair manner.
zh
[AI-41] Matryoshka Model Learning for Improved Elastic Student Models KDD2025
【速读】:该论文试图解决工业级机器学习(Machine Learning, ML)模型在面对快速变化的部署约束时,需要大量资源进行模型开发的问题。其解决方案的关键在于提出MatTA框架,该框架通过一种新颖的Teacher-TA-Student(教师-TA-学生)方法,利用容量更高的TA模型来指导多个准确的学生模型的训练,从而在仅进行一次训练的情况下,提供多种可部署的选项以在精度与服务成本之间进行权衡。
链接: https://arxiv.org/abs/2505.23337
作者: Chetan Verma,Aditya Srinivas Timmaraju,Cho Jui-Hsieh,Suyash Damle,Ngot Bui,Yang Zhang,Wen Chen,Xin Liu,Prateek Jain,Inderjit S Dhillon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, Accepted at KDD 2025
Abstract:Industry-grade ML models are carefully designed to meet rapidly evolving serving constraints, which requires significant resources for model development. In this paper, we propose MatTA, a framework for training multiple accurate Student models using a novel Teacher-TA-Student recipe. TA models are larger versions of the Student models with higher capacity, and thus allow Student models to better relate to the Teacher model and also bring in more domain-specific expertise. Furthermore, multiple accurate Student models can be extracted from the TA model. Therefore, despite only one training run, our methodology provides multiple servable options to trade off accuracy for lower serving cost. We demonstrate the proposed method, MatTA, on proprietary datasets and models. Its practical efficacy is underscored by live A/B tests within a production ML system, demonstrating 20% improvement on a key metric. We also demonstrate our method on GPT-2 Medium, a public model, and achieve relative improvements of over 24% on SAT Math and over 10% on the LAMBADA benchmark.
zh
[AI-42] Score-based Generative Modeling for Conditional Independence Testing KDD2025
【速读】:该论文旨在解决在高维设置下确定随机变量之间的条件独立性(Conditional Independence, CI)关系这一基础且具有挑战性的问题。现有基于生成模型的CI检验方法,如利用生成对抗网络(Generative Adversarial Networks, GANs)的方法,常因条件分布建模不佳和训练不稳定而导致性能欠佳。论文提出的解决方案关键在于采用基于得分的生成建模方法,通过切片条件得分匹配方案精确估计条件得分,并利用Langevin动力学条件采样生成零假设样本,从而实现精确的I类错误控制和强大的检验功效。此外,方法中引入了拟合优度阶段以验证生成样本并提升实际可解释性。
链接: https://arxiv.org/abs/2505.23309
作者: Yixin Ren,Chenghou Jin,Yewei Xia,Li Ke,Longtao Huang,Hui Xue,Hao Zhang,Jihong Guan,Shuigeng Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by KDD2025
Abstract:Determining conditional independence (CI) relationships between random variables is a fundamental yet challenging task in machine learning and statistics, especially in high-dimensional settings. Existing generative model-based CI testing methods, such as those utilizing generative adversarial networks (GANs), often struggle with undesirable modeling of conditional distributions and training instability, resulting in subpar performance. To address these issues, we propose a novel CI testing method via score-based generative modeling, which achieves precise Type I error control and strong testing power. Concretely, we first employ a sliced conditional score matching scheme to accurately estimate conditional score and use Langevin dynamics conditional sampling to generate null hypothesis samples, ensuring precise Type I error control. Then, we incorporate a goodness-of-fit stage into the method to verify generated samples and enhance interpretability in practice. We theoretically establish the error bound of conditional distributions modeled by score-based generative models and prove the validity of our CI tests. Extensive experiments on both synthetic and real-world datasets show that our method significantly outperforms existing state-of-the-art methods, providing a promising way to revitalize generative model-based CI testing.
zh
[AI-43] MathArena: Evaluating LLM s on Uncontaminated Math Competitions
【速读】:该论文试图解决当前数学基准测试中存在的真实推理能力评估不足与模型可能存在的记忆偏差问题(contamination)。其解决方案的关键在于引入MathArena,这是一个基于数学竞赛的实时评估框架,利用定期发布的高质量、高挑战性题目进行模型评估,从而有效避免数据泄露和记忆效应,同时首次将证明写作能力纳入数学推理评估体系。
链接: https://arxiv.org/abs/2505.23281
作者: Mislav Balunović,Jasper Dekoninck,Ivo Petrov,Nikola Jovanović,Martin Vechev
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid advancement of reasoning capabilities in large language models (LLMs) has led to notable improvements on mathematical benchmarks. However, many of the most commonly used evaluation datasets (e.g., AIME 2024) are widely available online, making it difficult to disentangle genuine reasoning from potential memorization. Furthermore, these benchmarks do not evaluate proof-writing capabilities, which are crucial for many mathematical tasks. To address this, we introduce MathArena, a new benchmark based on the following key insight: recurring math competitions provide a stream of high-quality, challenging problems that can be used for real-time evaluation of LLMs. By evaluating models as soon as new problems are released, we effectively eliminate the risk of contamination. Using this framework, we find strong signs of contamination in AIME 2024. Nonetheless, evaluations on harder competitions, such as SMT 2025 – published well after model release dates – demonstrate impressive reasoning capabilities in top-performing models. MathArena is also the first benchmark for proof-writing capabilities. On USAMO 2025, even top models score below 25%, far behind their performance on final-answer tasks. So far, we have evaluated 30 models across five competitions, totaling 149 problems. As an evolving benchmark, MathArena will continue to track the progress of LLMs on newly released competitions, ensuring rigorous and up-to-date evaluation of mathematical reasoning.
zh
[AI-44] VLM-RRT: Vision Language Model Guided RRT Search for Autonomous UAV Navigation
【速读】:该论文旨在解决传统路径规划方法(如RRT)在高复杂度搜索空间中面临的路径质量不佳、收敛速度慢以及计算效率低的问题,尤其是在灾难响应等高风险应用场景中,对快速高效路径规划的迫切需求。其解决方案的关键在于提出一种融合视觉语言模型(VLM)与RRT的混合方法——VLM-RRT,通过VLM对环境快照进行模式识别并提供初始方向引导,从而优化采样过程,提升路径规划的采样效率和路径质量。
链接: https://arxiv.org/abs/2505.23267
作者: Jianlin Ye,Savvas Papaioannou,Panayiotis Kolios
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Path planning is a fundamental capability of autonomous Unmanned Aerial Vehicles (UAVs), enabling them to efficiently navigate toward a target region or explore complex environments while avoiding obstacles. Traditional pathplanning methods, such as Rapidly-exploring Random Trees (RRT), have proven effective but often encounter significant challenges. These include high search space complexity, suboptimal path quality, and slow convergence, issues that are particularly problematic in high-stakes applications like disaster response, where rapid and efficient planning is critical. To address these limitations and enhance path-planning efficiency, we propose Vision Language Model RRT (VLM-RRT), a hybrid approach that integrates the pattern recognition capabilities of Vision Language Models (VLMs) with the path-planning strengths of RRT. By leveraging VLMs to provide initial directional guidance based on environmental snapshots, our method biases sampling toward regions more likely to contain feasible paths, significantly improving sampling efficiency and path quality. Extensive quantitative and qualitative experiments with various state-of-the-art VLMs demonstrate the effectiveness of this proposed approach.
zh
[AI-45] Deep Retrieval at CheckThat! 2025: Identifying Scientific Papers from Implicit Social Media Mentions via Hybrid Retrieval and Re-Ranking
【速读】:该论文旨在解决从社交媒体帖子中检索相关科学文献的问题,具体针对的是CLEF CheckThat! 2025竞赛中的子任务4b。其关键解决方案是一个混合检索流程,结合了词汇精确匹配、语义泛化和深度上下文重新排序,以弥合非正式语言与正式语言之间的差距。该方法通过将基于BM25的关键词匹配与使用微调后的INF-Retriever-v1模型进行密集语义检索的FAISS向量存储相结合,生成候选文档,并利用大语言模型(LLM)驱动的交叉编码器对结果进行重新排序,从而实现了高效的检索性能。
链接: https://arxiv.org/abs/2505.23250
作者: Pascal J. Sager,Ashwini Kamaraj,Benjamin F. Grewe,Thilo Stadelmann
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:We present the methodology and results of the Deep Retrieval team for subtask 4b of the CLEF CheckThat! 2025 competition, which focuses on retrieving relevant scientific literature for given social media posts. To address this task, we propose a hybrid retrieval pipeline that combines lexical precision, semantic generalization, and deep contextual re-ranking, enabling robust retrieval that bridges the informal-to-formal language gap. Specifically, we combine BM25-based keyword matching with a FAISS vector store using a fine-tuned INF-Retriever-v1 model for dense semantic retrieval. BM25 returns the top 30 candidates, and semantic search yields 100 candidates, which are then merged and re-ranked via a large language model (LLM)-based cross-encoder. Our approach achieves a mean reciprocal rank at 5 (MRR@5) of 76.46% on the development set and 66.43% on the hidden test set, securing the 1st position on the development leaderboard and ranking 3rd on the test leaderboard (out of 31 teams), with a relative performance gap of only 2 percentage points compared to the top-ranked system. We achieve this strong performance by running open-source models locally and without external training data, highlighting the effectiveness of a carefully designed and fine-tuned retrieval pipeline. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.23250 [cs.IR] (or arXiv:2505.23250v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2505.23250 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-46] Accelerating RLHF Training with Reward Variance Increase
【速读】:该论文旨在解决基于群体相对策略优化(Group Relative Policy Optimization, GRPO)的强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)训练效率低下的问题。其关键解决方案是提出一种实用的奖励调整模型,通过严格增加奖励方差并保持相对偏好和奖励期望,从而加速RLHF训练过程。该方法本质上是一个非凸优化问题,为了解决计算上的挑战,作者设计了一种时间复杂度为O(n log n)的算法,以显式刻画可行集的极值点并找到全局解。此外,该奖励调整模型被自然地集成到GRPO算法中,形成了更高效的GRPOVI算法。
链接: https://arxiv.org/abs/2505.23247
作者: Zonglin Yang,Zhexuan Gu,Houduo Qi,Yancheng Yuan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:
Abstract:Reinforcement learning from human feedback (RLHF) is an essential technique for ensuring that large language models (LLMs) are aligned with human values and preferences during the post-training phase. As an effective RLHF approach, group relative policy optimization (GRPO) has demonstrated success in many LLM-based applications. However, efficient GRPO-based RLHF training remains a challenge. Recent studies reveal that a higher reward variance of the initial policy model leads to faster RLHF training. Inspired by this finding, we propose a practical reward adjustment model to accelerate RLHF training by provably increasing the reward variance and preserving the relative preferences and reward expectation. Our reward adjustment method inherently poses a nonconvex optimization problem, which is NP-hard to solve in general. To overcome the computational challenges, we design a novel O(n \log n) algorithm to find a global solution of the nonconvex reward adjustment model by explicitly characterizing the extreme points of the feasible set. As an important application, we naturally integrate this reward adjustment model into the GRPO algorithm, leading to a more efficient GRPO with reward variance increase (GRPOVI) algorithm for RLHF training. As an interesting byproduct, we provide an indirect explanation for the empirical effectiveness of GRPO with rule-based reward for RLHF training, as demonstrated in DeepSeek-R1. Experiment results demonstrate that the GRPOVI algorithm can significantly improve the RLHF training efficiency compared to the original GRPO algorithm.
zh
[AI-47] OSS-UAgent : An Agent -based Usability Evaluation Framework for Open Source Software
【速读】:该论文旨在解决开源软件(Open Source Software, OSS)可用性评估中传统方法依赖人工评估所导致的成本高和可扩展性差的问题。其解决方案的关键在于提出OSS-UAgent,这是一个基于智能代理的自动化、可配置且交互式的可用性评估框架,该框架利用大语言模型(Large Language Models, LLMs)驱动的智能代理模拟不同经验水平的开发者执行编程任务,并通过动态构建平台特定的知识库实现上下文感知的代码生成与多维度自动评估。
链接: https://arxiv.org/abs/2505.23239
作者: Lingkai Meng,Yu Shao,Long Yuan,Longbin Lai,Peng Cheng,Wenyuan Yu,Wenjie Zhang,Xuemin Lin,Jingren Zhou
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Usability evaluation is critical to the impact and adoption of open source software (OSS), yet traditional methods relying on human evaluators suffer from high costs and limited scalability. To address these limitations, we introduce OSS-UAgent, an automated, configurable, and interactive agent-based usability evaluation framework specifically designed for open source software. Our framework employs intelligent agents powered by large language models (LLMs) to simulate developers performing programming tasks across various experience levels (from Junior to Expert). By dynamically constructing platform-specific knowledge bases, OSS-UAgent ensures accurate and context-aware code generation. The generated code is automatically evaluated across multiple dimensions, including compliance, correctness, and readability, providing a comprehensive measure of the software’s usability. Additionally, our demonstration showcases OSS-UAgent’s practical application in evaluating graph analytics platforms, highlighting its effectiveness in automating usability evaluation.
zh
[AI-48] Less is More: Unlocking Specialization of Time Series Foundation Models via Structured Pruning
【速读】:该论文试图解决时间序列基础模型(Time Series Foundation Models, TSFMs)在微调后无法持续优于小规模、专业化模型的问题。其关键解决方案是通过结构化剪枝方法,聚焦于更相关且紧凑的参数空间,从而在微调过程中保留TSFMs中已学习到的任务相关网络子结构的先验知识,进而提升预测性能。实验结果表明,“剪枝-微调”范式能够显著提升TSFMs的性能,并超越强基准模型。
链接: https://arxiv.org/abs/2505.23195
作者: Lifan Zhao,Yanyan Shen,Zhaoyang Liu,Xue Wang,Jiaji Deng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Manuscript with fixed typos and figures
Abstract:Scaling laws motivate the development of Time Series Foundation Models (TSFMs) that pre-train vast parameters and achieve remarkable zero-shot forecasting performance. Surprisingly, even after fine-tuning, TSFMs cannot consistently outperform smaller, specialized models trained on full-shot downstream data. A key question is how to realize effective adaptation of TSFMs for a target forecasting task. Through empirical studies on various TSFMs, the pre-trained models often exhibit inherent sparsity and redundancy in computation, suggesting that TSFMs have learned to activate task-relevant network substructures to accommodate diverse forecasting tasks. To preserve this valuable prior knowledge, we propose a structured pruning method to regularize the subsequent fine-tuning process by focusing it on a more relevant and compact parameter space. Extensive experiments on seven TSFMs and six benchmarks demonstrate that fine-tuning a smaller, pruned TSFM significantly improves forecasting performance compared to fine-tuning original models. This “prune-then-finetune” paradigm often enables TSFMs to achieve state-of-the-art performance and surpass strong specialized baselines.
zh
[AI-49] FreRA: A Frequency-Refined Augmentation for Contrastive Learning on Time Series Classification KDD2025
【速读】:该论文试图解决时间序列分类任务中对比学习的最优增强策略设计问题,现有预定义的时间域增强方法主要源自视觉领域,缺乏对时间序列数据特性的针对性,导致语义相关信息可能被不匹配模式扭曲。解决方案的关键在于从频域角度提出一种轻量且有效的频率细化增强方法(Frequency Refined Augmentation, FreRA),其核心是自动分离关键与非关键频率成分,并通过语义感知的身份修改和语义无关的自适应修改来保护语义相关信息并注入多样性,从而生成保持语义的视图,提升对比学习在时间序列任务中的性能。
链接: https://arxiv.org/abs/2505.23181
作者: Tian Tian,Chunyan Miao,Hangwei Qian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: KDD 2025
Abstract:Contrastive learning has emerged as a competent approach for unsupervised representation learning. However, the design of an optimal augmentation strategy, although crucial for contrastive learning, is less explored for time series classification tasks. Existing predefined time-domain augmentation methods are primarily adopted from vision and are not specific to time series data. Consequently, this cross-modality incompatibility may distort the semantically relevant information of time series by introducing mismatched patterns into the data. To address this limitation, we present a novel perspective from the frequency domain and identify three advantages for downstream classification: global, independent, and compact. To fully utilize the three properties, we propose the lightweight yet effective Frequency Refined Augmentation (FreRA) tailored for time series contrastive learning on classification tasks, which can be seamlessly integrated with contrastive learning frameworks in a plug-and-play manner. Specifically, FreRA automatically separates critical and unimportant frequency components. Accordingly, we propose semantic-aware Identity Modification and semantic-agnostic Self-adaptive Modification to protect semantically relevant information in the critical frequency components and infuse variance into the unimportant ones respectively. Theoretically, we prove that FreRA generates semantic-preserving views. Empirically, we conduct extensive experiments on two benchmark datasets, including UCR and UEA archives, as well as five large-scale datasets on diverse applications. FreRA consistently outperforms ten leading baselines on time series classification, anomaly detection, and transfer learning tasks, demonstrating superior capabilities in contrastive representation learning and generalization in transfer learning scenarios across diverse datasets.
zh
[AI-50] Best Arm Identification with Possibly Biased Offline Data UAI2025
【速读】:该论文试图解决在固定置信度设置下,利用可能存在偏差的离线数据进行最佳臂识别(Best Arm Identification, BAI)的问题,这一问题在现实场景如临床试验中较为常见。解决方案的关键在于提出LUCB-H算法,该算法通过引入自适应置信区间,并结合辅助偏差校正机制,在LUCB框架内平衡离线与在线数据的影响,从而有效提升算法性能。
链接: https://arxiv.org/abs/2505.23165
作者: Le Yang,Vincent Y. F. Tan,Wang Chi Cheung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: Accepted to UAI 2025
Abstract:We study the best arm identification (BAI) problem with potentially biased offline data in the fixed confidence setting, which commonly arises in real-world scenarios such as clinical trials. We prove an impossibility result for adaptive algorithms without prior knowledge of the bias bound between online and offline distributions. To address this, we propose the LUCB-H algorithm, which introduces adaptive confidence bounds by incorporating an auxiliary bias correction to balance offline and online data within the LUCB framework. Theoretical analysis shows that LUCB-H matches the sample complexity of standard LUCB when offline data is misleading and significantly outperforms it when offline data is helpful. We also derive an instance-dependent lower bound that matches the upper bound of LUCB-H in certain scenarios. Numerical experiments further demonstrate the robustness and adaptability of LUCB-H in effectively incorporating offline data.
zh
[AI-51] Conceptual Framework Toward Embodied Collective Adaptive Intelligence
【速读】:该论文试图解决如何设计和分析具有自适应能力的群体智能系统的问题,特别是在复杂动态环境中实现自主代理的协作、适应与自组织。解决方案的关键在于提出一个概念性框架,该框架明确了任务泛化、弹性、可扩展性和自组装等核心属性,旨在将理论基础与实际工程方法相结合,以构建更具韧性、可扩展性和适应性的AI系统。
链接: https://arxiv.org/abs/2505.23153
作者: Fan Wang,Shaoshan Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Collective Adaptive Intelligence (CAI) represent a transformative approach in artificial intelligence, wherein numerous autonomous agents collaborate, adapt, and self-organize to navigate complex, dynamic environments. This paradigm is particularly impactful in embodied AI applications, where adaptability and resilience are paramount. By enabling systems to reconfigure themselves in response to unforeseen challenges, CAI facilitate robust performance in real-world scenarios. This article introduces a conceptual framework for designing and analyzing CAI. It delineates key attributes including task generalization, resilience, scalability, and self-assembly, aiming to bridge theoretical foundations with practical methodologies for engineering adaptive, emergent intelligence. By providing a structured foundation for understanding and implementing CAI, this work seeks to guide researchers and practitioners in developing more resilient, scalable, and adaptable AI systems across various domains.
zh
[AI-52] VERINA: Benchmarking Verifiable Code Generation
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)生成代码时的正确性保障问题,尤其是针对可验证代码生成(Verifiable Code Generation)的评估不足。现有基准测试缺乏对代码、规范和证明生成及其组合的端到端评估支持,导致难以全面衡量LLMs在该领域的性能。论文提出的解决方案是构建一个高质量的基准测试平台——Verina,它包含189个手动整理的Lean编程任务,涵盖详细的题目描述、参考实现、形式化规范和广泛的测试用例,从而实现对代码、规范和证明生成的全面且模块化的评估。该基准的核心价值在于推动可验证代码生成技术的发展,并揭示当前LLMs在证明生成等方面的显著挑战。
链接: https://arxiv.org/abs/2505.23135
作者: Zhe Ye,Zhengxu Yan,Jingxuan He,Timothe Kasriel,Kaiyu Yang,Dawn Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注:
Abstract:Large language models (LLMs) are increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging and often requires costly manual review. Verifiable code generation – jointly generating code, specifications, and proofs of code-specification alignment – offers a promising path to address this limitation and further unleash LLMs’ benefits in coding. Yet, there exists a significant gap in evaluation: current benchmarks often lack support for end-to-end verifiable code generation. In this paper, we introduce Verina (Verifiable Code Generation Arena), a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions. Verina consists of 189 manually curated coding tasks in Lean, with detailed problem descriptions, reference implementations, formal specifications, and extensive test suites. Our extensive evaluation of state-of-the-art LLMs reveals significant challenges in verifiable code generation, especially in proof generation, underscoring the need for improving LLM-based theorem provers in verification domains. The best model, OpenAI o4-mini, generates only 61.4% correct code, 51.0% sound and complete specifications, and 3.6% successful proofs, with one trial per task. We hope Verina will catalyze progress in verifiable code generation by providing a rigorous and comprehensive benchmark. We release our dataset on this https URL and our evaluation code on this https URL.
zh
[AI-53] Patient Domain Supervised Contrastive Learning for Lung Sound Classification Using Mobile Phone
【速读】:该论文试图解决传统肺部听诊在新冠疫情中暴露的局限性,以及通过智能手机麦克风记录和分析肺部声音时面临的音频风格差异和患者个体差异问题。其解决方案的关键在于提出一种名为患者领域监督对比学习(Patient Domain Supervised Contrastive Learning, PD-SCL)的方法,并将其与音频频谱变换器(Audio Spectrogram Transformer, AST)模型结合,从而显著提升了模型性能。
链接: https://arxiv.org/abs/2505.23132
作者: Seung Gyu Jeong,Seong Eun Kim
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: ITS-CSCC 2024
Abstract:Auscultation is crucial for diagnosing lung diseases. The COVID-19 pandemic has revealed the limitations of traditional, in-person lung sound assessments. To overcome these issues, advancements in digital stethoscopes and artificial intelligence (AI) have led to the development of new diagnostic methods. In this context, our study aims to use smartphone microphones to record and analyze lung sounds. We faced two major challenges: the difference in audio style between electronic stethoscopes and smartphone microphones, and the variability among patients. To address these challenges, we developed a method called Patient Domain Supervised Contrastive Learning (PD-SCL). By integrating this method with the Audio Spectrogram Transformer (AST) model, we significantly improved its performance by 2.4% compared to the original AST model. This progress demonstrates that smartphones can effectively diagnose lung sounds, addressing inconsistencies in patient data and showing potential for broad use beyond traditional clinical settings. Our research contributes to making lung disease detection more accessible in the post-COVID-19 world.
zh
[AI-54] Decom-Renorm-Merge: Model Merging on the Right Space Improves Multitasking
【速读】:该论文试图解决模型合并(model merging)过程中由于权重矩阵中相同位置条目功能不一致而导致的直接逐元素合并无效的问题。现有方法假设权重矩阵中相同位置的条目具有相同功能,但这一假设忽略了微调神经网络中神经元可能发展出不同特征组合的复杂性。论文提出的解决方案是Decom-Renorm-Merge (DRM),其关键在于利用奇异值分解(Singular Value Decomposition)将权重矩阵分解并映射到对齐的联合空间,从而实现有效的逐元素合并。实验结果表明,DRM在多种模型架构上均优于现有先进方法,而重归一化(renormalization)被证明是构建稳健联合空间的关键组件。
链接: https://arxiv.org/abs/2505.23117
作者: Yuatyong Chaichana,Thanapat Trachu,Peerat Limkonchotiwat,Konpat Preechakul,Tirasan Khandhawit,Ekapol Chuangsuwanich
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In the era of large-scale training, model merging has evolved into a tool for creating multitasking models efficiently. It enables the knowledge of models to be fused, without the need for heavy computation as required in traditional multitask learning. Existing merging methods often assume that entries at identical positions in weight matrices serve the same function, enabling straightforward entry-wise comparison and merging. However, this assumption overlooks the complexity of finetuned neural networks, where neurons may develop distinct feature compositions, making direct entry-wise merging problematic. We present Decom-Renorm-Merge (DRM), a simple yet effective approach that leverages Singular Value Decomposition to decompose and coordinate weight matrices into an aligned joint space, where entry-wise merging becomes possible. We showcase the effectiveness of DRM across various settings ranging from smaller encoder-based such as ViT and DeBERTa, encoder-decoder-based such as T5, and larger decoder-based such as Llama3.1-8B. Our experimental results show that DRM outperforms several state-of-the-art merging techniques across full finetuning and low-rank adaptation settings. Moreover, our analysis reveals renormalization as the crucial component for creating a robust and even joint space for merging, significantly contributing to the method’s performance.
zh
[AI-55] CrossLinear: Plug-and-Play Cross-Correlation Embedding for Time Series Forecasting with Exogenous Variables
【速读】:该论文旨在解决带有外生变量的时间序列预测中变量间依赖关系建模的挑战,传统模型难以区分内生与外生变量,导致效率低下和过拟合问题。其解决方案的关键在于提出CrossLinear模型,该模型通过引入一个即插即用的交叉相关嵌入模块,以较低的计算成本捕捉变量间的时不变和直接依赖关系,同时忽略时变或间接依赖关系,从而降低过拟合风险并提升预测性能。此外,CrossLinear采用分块处理和全局线性头结构,有效捕获短期和长期时间依赖性,进一步提高预测精度。
链接: https://arxiv.org/abs/2505.23116
作者: Pengfei Zhou,Yunlong Liu,Junli Liang,Qi Song,Xiangyang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Time series forecasting with exogenous variables is a critical emerging paradigm that presents unique challenges in modeling dependencies between variables. Traditional models often struggle to differentiate between endogenous and exogenous variables, leading to inefficiencies and overfitting. In this paper, we introduce CrossLinear, a novel Linear-based forecasting model that addresses these challenges by incorporating a plug-and-play cross-correlation embedding module. This lightweight module captures the dependencies between variables with minimal computational cost and seamlessly integrates into existing neural networks. Specifically, it captures time-invariant and direct variable dependencies while disregarding time-varying or indirect dependencies, thereby mitigating the risk of overfitting in dependency modeling and contributing to consistent performance improvements. Furthermore, CrossLinear employs patch-wise processing and a global linear head to effectively capture both short-term and long-term temporal dependencies, further improving its forecasting precision. Extensive experiments on 12 real-world datasets demonstrate that CrossLinear achieves superior performance in both short-term and long-term forecasting tasks. The ablation study underscores the effectiveness of the cross-correlation embedding module. Additionally, the generalizability of this module makes it a valuable plug-in for various forecasting tasks across different domains. Codes are available at this https URL.
zh
[AI-56] Equivariant Spherical Transformer for Efficient Molecular Modeling
【速读】:该论文旨在解决SE(3)-equivariant图神经网络(GNN)在分子系统建模中因消息传递过程依赖张量积卷积而导致的非线性不足和群表示不完整问题,从而限制了模型的表达能力。解决方案的关键在于引入等变球面变换器(Equivariant Spherical Transformer, EST),该框架在傅里叶变换后的群表示空间域中采用Transformer结构,理论上和实验上均证明其能够涵盖张量积的功能空间并实现更优的表达能力,同时通过傅里叶变换的均匀采样策略保证了等变归纳偏置。
链接: https://arxiv.org/abs/2505.23086
作者: Junyi An,Xinyu Lu,Chao Qu,Yunfei Shi,Peijia Lin,Qianwei Tang,Licheng Xu,Fenglei Cao,Yuan Qi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages, 3 figures
Abstract:SE(3)-equivariant Graph Neural Networks (GNNs) have significantly advanced molecular system modeling by employing group representations. However, their message passing processes, which rely on tensor product-based convolutions, are limited by insufficient non-linearity and incomplete group representations, thereby restricting expressiveness. To overcome these limitations, we introduce the Equivariant Spherical Transformer (EST), a novel framework that leverages a Transformer structure within the spatial domain of group representations after Fourier transform. We theoretically and empirically demonstrate that EST can encompass the function space of tensor products while achieving superior expressiveness. Furthermore, EST’s equivariant inductive bias is guaranteed through a uniform sampling strategy for the Fourier transform. Our experiments demonstrate state-of-the-art performance by EST on various molecular benchmarks, including OC20 and QM9.
zh
[AI-57] Second Opinion Matters: Towards Adaptive Clinical AI via the Consensus of Expert Model Ensemble
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在临床应用中过度依赖单一模型架构所带来的过时风险和灵活性不足的问题。其解决方案的关键在于提出一种名为“共识机制”(Consensus Mechanism)的新框架,该框架通过集成多个专业医学专家代理,模拟临床分诊和多学科决策过程,从而提升临床决策的准确性与适应性,同时可根据内部模型配置优化成本、延迟或性能。
链接: https://arxiv.org/abs/2505.23075
作者: Amit Kumthekar,Zion Tilley,Henry Duong,Bhargav Patel,Michael Magnoli,Ahmed Omar,Ahmed Nasser,Chaitanya Gharpure,Yevgen Reztzov
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages, 11 figures
Abstract:Despite the growing clinical adoption of large language models (LLMs), current approaches heavily rely on single model architectures. To overcome risks of obsolescence and rigid dependence on single model systems, we present a novel framework, termed the Consensus Mechanism. Mimicking clinical triage and multidisciplinary clinical decision-making, the Consensus Mechanism implements an ensemble of specialized medical expert agents enabling improved clinical decision making while maintaining robust adaptability. This architecture enables the Consensus Mechanism to be optimized for cost, latency, or performance, purely based on its interior model configuration. To rigorously evaluate the Consensus Mechanism, we employed three medical evaluation benchmarks: MedMCQA, MedQA, and MedXpertQA Text, and the differential diagnosis dataset, DDX+. On MedXpertQA, the Consensus Mechanism achieved an accuracy of 61.0% compared to 53.5% and 45.9% for OpenAI’s O3 and Google’s Gemini 2.5 Pro. Improvement was consistent across benchmarks with an increase in accuracy on MedQA ( \Delta\mathrmAccuracy_\mathrmconsensus\text-O3 = 3.4% ) and MedMCQA ( \Delta\mathrmAccuracy_\mathrmconsensus\text-O3 = 9.1% ). These accuracy gains extended to differential diagnosis generation, where our system demonstrated improved recall and precision (F1 _\mathrmconsensus = 0.326 vs. F1 _\mathrmO3\text-high = 0.2886) and a higher top-1 accuracy for DDX (Top1 _\mathrmconsensus = 52.0% vs. Top1 _\mathrmO3\text-high = 45.2%). Comments: 23 pages, 11 figures Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2505.23075 [cs.AI] (or arXiv:2505.23075v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.23075 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-58] Composite Flow Matching for Reinforcement Learning with Shifted-Dynamics Data
【速读】:该论文试图解决在强化学习中利用源环境的离线数据时,由于源环境与目标环境转移动态差异导致的样本效率下降问题。其解决方案的关键在于提出CompFlow方法,该方法基于流匹配与最优传输的理论联系,将目标动态建模为建立在源域流输出分布之上的条件流,而非直接从高斯先验学习,从而实现对动态差距的合理估计,并通过Wasserstein距离提升目标动态的学习泛化能力。
链接: https://arxiv.org/abs/2505.23062
作者: Lingkai Kong,Haichuan Wang,Tonghan Wang,Guojun Xiong,Milind Tambe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Incorporating pre-collected offline data from a source environment can significantly improve the sample efficiency of reinforcement learning (RL), but this benefit is often challenged by discrepancies between the transition dynamics of the source and target environments. Existing methods typically address this issue by penalizing or filtering out source transitions in high dynamics-gap regions. However, their estimation of the dynamics gap often relies on KL divergence or mutual information, which can be ill-defined when the source and target dynamics have disjoint support. To overcome these limitations, we propose CompFlow, a method grounded in the theoretical connection between flow matching and optimal transport. Specifically, we model the target dynamics as a conditional flow built upon the output distribution of the source-domain flow, rather than learning it directly from a Gaussian prior. This composite structure offers two key advantages: (1) improved generalization for learning target dynamics, and (2) a principled estimation of the dynamics gap via the Wasserstein distance between source and target transitions. Leveraging our principled estimation of the dynamics gap, we further introduce an optimistic active data collection strategy that prioritizes exploration in regions of high dynamics gap, and theoretically prove that it reduces the performance disparity with the optimal policy. Empirically, CompFlow outperforms strong baselines across several RL benchmarks with shifted dynamics.
zh
[AI-59] From Token to Action: State Machine Reasoning to Mitigate Overthinking in Information Retrieval
【速读】:该论文旨在解决生成式 AI (Generative AI) 在信息检索 (IR) 中应用时出现的过度推理问题,具体表现为冗余轨迹和偏离用户意图的错误推理。其解决方案的关键是提出状态机推理 (State Machine Reasoning, SMR),该方法通过离散动作(Refine、Rerank、Stop)构建一种基于转移的推理框架,支持早期终止和细粒度控制,从而提升检索性能并减少token使用量。
链接: https://arxiv.org/abs/2505.23059
作者: Dohyeon Lee,Yeonseok Jeong,Seung-won Hwang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Chain-of-Thought (CoT) prompting enables complex reasoning in large language models (LLMs), including applications in information retrieval (IR). However, it often leads to overthinking, where models produce excessively long and semantically redundant traces with little or no benefit. We identify two key challenges in IR: redundant trajectories that revisit similar states and misguided reasoning that diverges from user intent. To address these, we propose State Machine Reasoning (SMR), a transition-based reasoning framework composed of discrete actions (Refine, Rerank, Stop) that support early stopping and fine-grained control. Experiments on the BEIR and BRIGHT benchmarks show that SMR improves retrieval performance (nDCG@10) by 3.4% while reducing token usage by 74.4%. It generalizes across LLMs and retrievers without requiring task-specific tuning, offering a practical alternative to conventional CoT reasoning. The code and details are available at this https URL.
zh
[AI-60] Augment or Not? A Comparative Study of Pure and Augmented Large Language Model Recommenders
【速读】:该论文旨在解决传统推荐系统在语义理解能力和知识融合方面存在的局限性,通过引入大型语言模型(Large Language Models, LLMs)来提升推荐系统的性能。其解决方案的关键在于提出了一种系统化的分类体系,将现有方法分为纯LLM推荐器和增强型LLM推荐器,并构建了一个统一的评估平台以支持公平比较,从而揭示影响推荐效果的关键设计选择。
链接: https://arxiv.org/abs/2505.23053
作者: Wei-Hsiang Huang,Chen-Wei Ke,Wei-Ning Chiu,Yu-Xuan Su,Chun-Chun Yang,Chieh-Yuan Cheng,Yun-Nung Chen,Pu-Jen Cheng
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have introduced new paradigms for recommender systems by enabling richer semantic understanding and incorporating implicit world knowledge. In this study, we propose a systematic taxonomy that classifies existing approaches into two categories: (1) Pure LLM Recommenders, which rely solely on LLMs, and (2) Augmented LLM Recommenders, which integrate additional non-LLM techniques to enhance performance. This taxonomy provides a novel lens through which to examine the evolving landscape of LLM-based recommendation. To support fair comparison, we introduce a unified evaluation platform that benchmarks representative models under consistent experimental settings, highlighting key design choices that impact effectiveness. We conclude by discussing open challenges and outlining promising directions for future research. This work offers both a comprehensive overview and practical guidance for advancing next-generation LLM-powered recommender.
zh
[AI-61] From Theory to Application: Fine-Tuning Large EEG Model with Real-World Stress Data
【速读】:该论文试图解决如何将大型脑电图模型(Large EEG Models, LEMs)有效应用于真实世界环境中的压力分类问题,而非仅限于受控的临床数据集。其解决方案的关键在于对LaBraM这一先进的基础脑电模型进行微调,并利用真实课堂环境中采集的休息态脑电数据训练二分类器,以区分正常与高压力状态。实验结果表明,该方法在平衡准确率和推理效率方面显著优于传统压力分类器,展示了LEM在处理实际场景下的脑电数据时的潜力。
链接: https://arxiv.org/abs/2505.23042
作者: Siwen Wang,Shitou Zhang,Wan-Lin Chen,Dung Truong,Tzyy-Ping Jung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in Large Language Models have inspired the development of foundation models across various domains. In this study, we evaluate the efficacy of Large EEG Models (LEMs) by fine-tuning LaBraM, a state-of-the-art foundation EEG model, on a real-world stress classification dataset collected in a graduate classroom. Unlike previous studies that primarily evaluate LEMs using data from controlled clinical settings, our work assesses their applicability to real-world environments. We train a binary classifier that distinguishes between normal and elevated stress states using resting-state EEG data recorded from 18 graduate students during a class session. The best-performing fine-tuned model achieves a balanced accuracy of 90.47% with a 5-second window, significantly outperforming traditional stress classifiers in both accuracy and inference efficiency. We further evaluate the robustness of the fine-tuned LEM under random data shuffling and reduced channel counts. These results demonstrate the capability of LEMs to effectively process real-world EEG data and highlight their potential to revolutionize brain-computer interface applications by shifting the focus from model-centric to data-centric design.
zh
[AI-62] Case-Based Reasoning Enhances the Predictive Power of LLM s in Drug-Drug Interaction
【速读】:该论文试图解决药物-药物相互作用(Drug-Drug Interaction, DDI)预测中大型语言模型(LLMs)有效性不足的问题,旨在提升治疗安全性。解决方案的关键在于提出CBR-DDI框架,该框架通过从历史病例中提炼药理学原理,结合案例推理(Case-Based Reasoning, CBR)机制,增强LLMs在DDI任务中的推理能力。具体而言,CBR-DDI利用LLMs提取药理学洞察并构建知识库,采用图神经网络(GNNs)建模药物关联,并引入混合检索机制与双层知识增强提示策略,以实现有效案例检索与复用,同时通过代表性采样策略进行动态案例优化。
链接: https://arxiv.org/abs/2505.23034
作者: Guangyi Liu,Yongqi Zhang,Xunyuan Liu,Quanming Yao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Drug-drug interaction (DDI) prediction is critical for treatment safety. While large language models (LLMs) show promise in pharmaceutical tasks, their effectiveness in DDI prediction remains challenging. Inspired by the well-established clinical practice where physicians routinely reference similar historical cases to guide their decisions through case-based reasoning (CBR), we propose CBR-DDI, a novel framework that distills pharmacological principles from historical cases to improve LLM reasoning for DDI tasks. CBR-DDI constructs a knowledge repository by leveraging LLMs to extract pharmacological insights and graph neural networks (GNNs) to model drug associations. A hybrid retrieval mechanism and dual-layer knowledge-enhanced prompting allow LLMs to effectively retrieve and reuse relevant cases. We further introduce a representative sampling strategy for dynamic case refinement. Extensive experiments demonstrate that CBR-DDI achieves state-of-the-art performance, with a significant 28.7% accuracy improvement over both popular LLMs and CBR baseline, while maintaining high interpretability and flexibility.
zh
[AI-63] Bayesian Neural Scaling Laws Extrapolation with Prior-Fitted Networks ICML2025
【速读】:该论文试图解决深度学习中神经网络缩放定律(neural scaling laws)的外推问题,特别是在数据有限的情况下,如何准确预测模型性能提升并量化不确定性。现有方法多依赖点估计,缺乏对不确定性的描述,这在涉及决策的应用中存在局限性。论文提出的解决方案关键在于构建一种基于先验-数据拟合网络(Prior-data Fitted Networks, PFNs)的贝叶斯框架,通过设计一个能够生成无限多类似真实世界神经缩放定律的合成函数先验分布,使PFN能够元学习外推过程,从而实现更可靠、具备不确定性感知的外推能力。
链接: https://arxiv.org/abs/2505.23032
作者: Dongwoo Lee,Dong Bok Lee,Steven Adriaensen,Juho Lee,Sung Ju Hwang,Frank Hutter,Seon Joo Kim,Hae Beom Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2025
Abstract:Scaling has been a major driver of recent advancements in deep learning. Numerous empirical studies have found that scaling laws often follow the power-law and proposed several variants of power-law functions to predict the scaling behavior at larger scales. However, existing methods mostly rely on point estimation and do not quantify uncertainty, which is crucial for real-world applications involving decision-making problems such as determining the expected performance improvements achievable by investing additional computational resources. In this work, we explore a Bayesian framework based on Prior-data Fitted Networks (PFNs) for neural scaling law extrapolation. Specifically, we design a prior distribution that enables the sampling of infinitely many synthetic functions resembling real-world neural scaling laws, allowing our PFN to meta-learn the extrapolation. We validate the effectiveness of our approach on real-world neural scaling laws, comparing it against both the existing point estimation methods and Bayesian approaches. Our method demonstrates superior performance, particularly in data-limited scenarios such as Bayesian active learning, underscoring its potential for reliable, uncertainty-aware extrapolation in practical applications.
zh
[AI-64] K2VAE: A Koopman-Kalman Enhanced Variational AutoEncoder for Probabilistic Time Series Forecasting
【速读】:该论文旨在解决长期概率时间序列预测(Long-term Probabilistic Time Series Forecasting, LPTSF)中的挑战,尤其是在预测范围延长时,非线性动态对预测精度的显著负面影响以及生成模型效率下降的问题。解决方案的关键在于提出一种基于变分自编码器(VAE)的高效生成模型——K² VAE,该模型通过KoopmanNet将非线性时间序列转换为线性动力系统,并利用KalmanNet在该线性系统中优化预测和建模不确定性,从而减少长期预测中的误差累积。
链接: https://arxiv.org/abs/2505.23017
作者: Xingjian Wu,Xiangfei Qiu,Hongfan Gao,Jilin Hu,Bin Yang,Chenjuan Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Probabilistic Time Series Forecasting (PTSF) plays a crucial role in decision-making across various fields, including economics, energy, and transportation. Most existing methods excell at short-term forecasting, while overlooking the hurdles of Long-term Probabilistic Time Series Forecasting (LPTSF). As the forecast horizon extends, the inherent nonlinear dynamics have a significant adverse effect on prediction accuracy, and make generative models inefficient by increasing the cost of each iteration. To overcome these limitations, we introduce K^2 VAE, an efficient VAE-based generative model that leverages a KoopmanNet to transform nonlinear time series into a linear dynamical system, and devises a KalmanNet to refine predictions and model uncertainty in such linear system, which reduces error accumulation in long-term forecasting. Extensive experiments demonstrate that K^2 VAE outperforms state-of-the-art methods in both short- and long-term PTSF, providing a more efficient and accurate solution.
zh
[AI-65] Hybrid Cross-domain Robust Reinforcement Learning ECML KDD2025
【速读】:该论文旨在解决离线鲁棒强化学习(offline robust reinforcement learning)中因数据有限和模拟器动态不匹配导致的样本效率低下的问题。其解决方案的关键在于提出HYDRO框架,该框架通过结合在线模拟器与有限的离线数据,在非平凡的鲁棒强化学习情境下补充数据不足的问题。HYDRO通过测量并最小化模拟器与不确定性集中的最坏情况模型之间的性能差距,采用新颖的不确定性过滤和优先采样策略,选择最具相关性和可靠性的模拟器样本,从而提升样本效率和算法性能。
链接: https://arxiv.org/abs/2505.23003
作者: Linh Le Pham Van,Minh Hoang Nguyen,Hung Le,Hung The Tran,Sunil Gupta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ECML PKDD 2025
Abstract:Robust reinforcement learning (RL) aims to learn policies that remain effective despite uncertainties in its environment, which frequently arise in real-world applications due to variations in environment dynamics. The robust RL methods learn a robust policy by maximizing value under the worst-case models within a predefined uncertainty set. Offline robust RL algorithms are particularly promising in scenarios where only a fixed dataset is available and new data cannot be collected. However, these approaches often require extensive offline data, and gathering such datasets for specific tasks in specific environments can be both costly and time-consuming. Using an imperfect simulator offers a faster, cheaper, and safer way to collect data for training, but it can suffer from dynamics mismatch. In this paper, we introduce HYDRO, the first Hybrid Cross-Domain Robust RL framework designed to address these challenges. HYDRO utilizes an online simulator to complement the limited amount of offline datasets in the non-trivial context of robust RL. By measuring and minimizing performance gaps between the simulator and the worst-case models in the uncertainty set, HYDRO employs novel uncertainty filtering and prioritized sampling to select the most relevant and reliable simulator samples. Our extensive experiments demonstrate HYDRO’s superior performance over existing methods across various tasks, underscoring its potential to improve sample efficiency in offline robust RL.
zh
[AI-66] MenTeR: A fully-automated Multi-agenT workflow for end-to-end RF/Analog Circuits Netlist Design
【速读】:该论文旨在解决射频/模拟设计过程中存在的复杂性高、耗时长以及依赖专家直觉的问题,这些问题限制了电路开发的时间和成本效率。其解决方案的关键在于引入MenTeR——一个集成到端到端模拟设计框架中的多智能体工作流系统,通过多个专业AI代理协同处理设计过程的不同方面,如规格理解、电路优化和测试平台验证,从而减少对频繁试错干预的依赖。
链接: https://arxiv.org/abs/2505.22990
作者: Pin-Han Chen,Yu-Sheng Lin,Wei-Cheng Lee,Tin-Yu Leu,Po-Hsiang Hsu,Anjana Dissanayake,Sungjin Oh,Chinq-Shiun Chiu
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 9 pages, 7 figures, accepted by IEEE ICLAD 2025
Abstract:RF/Analog design is essential for bridging digital technologies with real-world signals, ensuring the functionality and reliability of a wide range of electronic systems. However, analog design procedures are often intricate, time-consuming and reliant on expert intuition, and hinder the time and cost efficiency of circuit development. To overcome the limitations of the manual circuit design, we introduce MenTeR - a multiagent workflow integrated into an end-to-end analog design framework. By employing multiple specialized AI agents that collaboratively address different aspects of the design process, such as specification understanding, circuit optimization, and test bench validation, MenTeR reduces the dependency on frequent trial-and-error-style intervention. MenTeR not only accelerates the design cycle time but also facilitates a broader exploration of the design space, demonstrating robust capabilities in handling real-world analog systems. We believe that MenTeR lays the groundwork for future “RF/Analog Copilots” that can collaborate seamlessly with human designers.
zh
[AI-67] Model-Preserving Adaptive Rounding
【速读】:该论文试图解决后训练量化(Post-training Quantization, PTQ)中由于局部优化目标导致的模型输出分布与原始模型不一致的问题。现有方法通过独立最小化每个线性层的激活误差来实现量化,但忽略了后续层的影响,因此无法保证最终模型的准确性。论文提出的解决方案关键在于引入Yet Another Quantization Algorithm (YAQA),其核心是利用每个线性层关于全模型KL散度的Kronecker-factored Hessian近似,结合可 tractably 计算的层间Hessian草图和与量化器无关的量化算法,从而在减少KL散度方面取得显著效果。
链接: https://arxiv.org/abs/2505.22988
作者: Albert Tseng,Zhaofeng Sun,Christopher De Sa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:The main goal of post-training quantization (PTQ) is to produced a compressed model whose output distribution is as close to the original model’s as possible. To do this tractably, almost all LLM PTQ algorithms quantize linear layers by independently minimizing the immediate activation error. However, this localized objective ignores the effect of subsequent layers, so reducing it does not necessarily give a closer model. In this work, we introduce Yet Another Quantization Algorithm (YAQA), an adaptive rounding algorithm that uses Kronecker-factored approximations of each linear layer’s Hessian with respect to the \textitfull model KL divergence. YAQA consists of two components: Kronecker-factored sketches of the full layerwise Hessian that can be tractably computed for hundred-billion parameter LLMs, and a quantizer-independent rounding algorithm that uses these sketches and comes with theoretical guarantees. Across a wide range of models and quantizers, YAQA empirically reduces the KL divergence to the original model by \approx 30% while achieving state of the art performance on downstream tasks.
zh
[AI-68] Strategic Reflectivism In Intelligent Systems
【速读】:该论文试图解决如何在智能系统(人类或人工)中实现有效决策的问题,其核心在于平衡直觉推理与反思推理之间的转换。解决方案的关键在于“战略反思主义”(Strategic Reflectivism),即通过务实地在直觉和反思推理之间切换,以最优地满足竞争性目标。该理论基于美国实用主义,超越了对反思思维的表面指标(如模型规模或思维链)的依赖,并随着对直觉与反思价值理解的加深而变得更加可行。
链接: https://arxiv.org/abs/2505.22987
作者: Nick Byrd
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Theoretical Economics (econ.TH)
备注: An earlier version of this paper was presented at the 2025 ACM Workshop on Human-AI Interaction for Augmented Reasoning (CHI25-WS-AUGMENTED-REASONING). Permission to copy for educational use is granted, provided copies are not for sale or profit and include this notice and full citation on the first page. Other uses require the author permission
Abstract:By late 20th century, the rationality wars had launched debates about the nature and norms of intuitive and reflective thinking. Those debates drew from mid-20th century ideas such as bounded rationality, which challenged more idealized notions of rationality observed since the 19th century. Now that 21st century cognitive scientists are applying the resulting dual process theories to artificial intelligence, it is time to dust off some lessons from this history. So this paper synthesizes old ideas with recent results from experiments on humans and machines. The result is Strategic Reflectivism, which takes the position that one key to intelligent systems (human or artificial) is pragmatic switching between intuitive and reflective inference to optimally fulfill competing goals. Strategic Reflectivism builds on American Pragmatism, transcends superficial indicators of reflective thinking such as model size or chains of thought, and becomes increasingly actionable as we learn more about the value of intuition and reflection.
zh
[AI-69] Knowledge Distillation for Reservoir-based Classifier: Human Activity Recognition
【速读】:该论文旨在解决时间序列数据中高效分类的问题,特别是在人体活动识别(HAR)任务中实现高准确率的同时降低计算成本。其解决方案的关键在于提出了一种名为PatchEchoClassifier的新模型,该模型基于回声状态网络(Echo State Network, ESN)机制,并引入了分块表示提取的分词器,同时采用知识蒸馏框架将高容量MLP-Mixer教师模型的知识迁移至轻量级的ESN学生模型,从而在保持较高分类精度的同时显著减少了浮点运算次数(FLOPS)。
链接: https://arxiv.org/abs/2505.22985
作者: Masaharu Kagiyama,Tsuyoshi Okita
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: 23 pages,5 figures
Abstract:This paper aims to develop an energy-efficient classifier for time-series data by introducing PatchEchoClassifier, a novel model that leverages a reservoir-based mechanism known as the Echo State Network (ESN). The model is designed for human activity recognition (HAR) using one-dimensional sensor signals and incorporates a tokenizer to extract patch-level representations. To train the model efficiently, we propose a knowledge distillation framework that transfers knowledge from a high-capacity MLP-Mixer teacher to the lightweight reservoir-based student model. Experimental evaluations on multiple HAR datasets demonstrate that our model achieves over 80 percent accuracy while significantly reducing computational cost. Notably, PatchEchoClassifier requires only about one-sixth of the floating point operations (FLOPS) compared to DeepConvLSTM, a widely used convolutional baseline. These results suggest that PatchEchoClassifier is a promising solution for real-time and energy-efficient human activity recognition in edge computing environments.
zh
[AI-70] EquiReg: Equivariance Regularized Diffusion for Inverse Problems
【速读】:该论文旨在解决扩散模型在求解逆问题(如图像恢复)时,由于似然项不可计算而依赖各向同性高斯近似所导致的数据流形偏离和重建不一致、不稳定的问题。其解决方案的关键在于提出一种名为等变正则化(Equivariance Regularized, EquiReg)的框架,通过重新加权扩散轨迹并惩罚偏离数据流形的样本,从而增强重建质量。该方法定义了一种依赖于分布的等变误差,并利用在流形内样本误差低、流形外样本误差高的函数来正则化扩散采样过程。
链接: https://arxiv.org/abs/2505.22973
作者: Bahareh Tolooshams,Aditi Chandrashekar,Rayhan Zirvi,Abbas Mammadov,Jiachen Yao,Chuwei Wang,Anima Anandkumar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion models represent the state-of-the-art for solving inverse problems such as image restoration tasks. In the Bayesian framework, diffusion-based inverse solvers incorporate a likelihood term to guide the prior sampling process, generating data consistent with the posterior distribution. However, due to the intractability of the likelihood term, many current methods rely on isotropic Gaussian approximations, which lead to deviations from the data manifold and result in inconsistent, unstable reconstructions. We propose Equivariance Regularized (EquiReg) diffusion, a general framework for regularizing posterior sampling in diffusion-based inverse problem solvers. EquiReg enhances reconstructions by reweighting diffusion trajectories and penalizing those that deviate from the data manifold. We define a new distribution-dependent equivariance error, empirically identify functions that exhibit low error for on-manifold samples and higher error for off-manifold samples, and leverage these functions to regularize the diffusion sampling process. When applied to a variety of solvers, EquiReg outperforms state-of-the-art diffusion models in both linear and nonlinear image restoration tasks, as well as in reconstructing partial differential equations.
zh
[AI-71] Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness
【速读】:该论文试图解决多智能体辩论(Multi-Agent Debate, MAD)在不同条件下的有效性问题,特别是与自代理方法相比的性能差异。其解决方案的关键在于将MAD视为一种测试时的计算扩展技术,强调协作优化和多样化探索能力,通过系统性的实证研究分析任务难度、模型规模和代理多样性对MAD性能的影响。
链接: https://arxiv.org/abs/2505.22960
作者: Yongjin Yang,Euiin Yi,Jongwoo Ko,Kimin Lee,Zhijing Jin,Se-Young Yun
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint, under review
Abstract:The remarkable growth in large language model (LLM) capabilities has spurred exploration into multi-agent systems, with debate frameworks emerging as a promising avenue for enhanced problem-solving. These multi-agent debate (MAD) approaches, where agents collaboratively present, critique, and refine arguments, potentially offer improved reasoning, robustness, and diverse perspectives over monolithic models. Despite prior studies leveraging MAD, a systematic understanding of its effectiveness compared to self-agent methods, particularly under varying conditions, remains elusive. This paper seeks to fill this gap by conceptualizing MAD as a test-time computational scaling technique, distinguished by collaborative refinement and diverse exploration capabilities. We conduct a comprehensive empirical investigation comparing MAD with strong self-agent test-time scaling baselines on mathematical reasoning and safety-related tasks. Our study systematically examines the influence of task difficulty, model scale, and agent diversity on MAD’s performance. Key findings reveal that, for mathematical reasoning, MAD offers limited advantages over self-agent scaling but becomes more effective with increased problem difficulty and decreased model capability, while agent diversity shows little benefit. Conversely, for safety tasks, MAD’s collaborative refinement can increase vulnerability, but incorporating diverse agent configurations facilitates a gradual reduction in attack success through the collaborative refinement process. We believe our findings provide critical guidance for the future development of more effective and strategically deployed MAD systems.
zh
[AI-72] Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
【速读】:该论文试图解决当前人工智能系统缺乏自主持续改进能力的问题,即现有的AI系统具有由人类设计的固定架构,无法自主地进行持续优化。其解决方案的关键在于提出一种名为达尔文戈德尔机器(Darwin Gödel Machine, DGM)的自改进系统,该系统通过迭代修改自身代码并利用编码基准进行经验验证,实现自我提升。DGM受到达尔文进化和开放性研究的启发,维护一个生成的编码代理档案,并通过基础模型生成新的、有趣的代理版本,从而在搜索空间中进行并行探索,最终显著提升了编码能力。
链接: https://arxiv.org/abs/2505.22954
作者: Jenny Zhang,Shengran Hu,Cong Lu,Robert Lange,Jeff Clune
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Code at this https URL
Abstract:Today’s AI systems have human-designed, fixed architectures and cannot autonomously and continuously improve themselves. The advance of AI could itself be automated. If done safely, that would accelerate AI development and allow us to reap its benefits much sooner. Meta-learning can automate the discovery of novel algorithms, but is limited by first-order improvements and the human design of a suitable search space. The Gödel machine proposed a theoretical alternative: a self-improving AI that repeatedly modifies itself in a provably beneficial manner. Unfortunately, proving that most changes are net beneficial is impossible in practice. We introduce the Darwin Gödel Machine (DGM), a self-improving system that iteratively modifies its own code (thereby also improving its ability to modify its own codebase) and empirically validates each change using coding benchmarks. Inspired by Darwinian evolution and open-endedness research, the DGM maintains an archive of generated coding agents. It grows the archive by sampling an agent from it and using a foundation model to create a new, interesting, version of the sampled agent. This open-ended exploration forms a growing tree of diverse, high-quality agents and allows the parallel exploration of many different paths through the search space. Empirically, the DGM automatically improves its coding capabilities (e.g., better code editing tools, long-context window management, peer-review mechanisms), increasing performance on SWE-bench from 20.0% to 50.0%, and on Polyglot from 14.2% to 30.7%. Furthermore, the DGM significantly outperforms baselines without self-improvement or open-ended exploration. All experiments were done with safety precautions (e.g., sandboxing, human oversight). The DGM is a significant step toward self-improving AI, capable of gathering its own stepping stones along paths that unfold into endless innovation.
zh
[AI-73] Foundation Molecular Grammar: Multi-Modal Foundation Models Induce Interpretable Molecular Graph Languages ICML2025
【速读】:该论文试图解决分子生成模型中语法学习依赖专家标注或不可靠启发式方法的问题,从而限制了模型的可解释性和泛化能力。其解决方案的关键在于提出Foundation Molecular Grammar (FMG),该方法利用多模态基础模型(MMFMs)来推导可解释的分子语言,通过将分子表示为图像、描述为文本,并利用提示学习在不同模态间对齐信息,实现分子生成与属性预测的高效且具有化学可解释性的框架。
链接: https://arxiv.org/abs/2505.22948
作者: Michael Sun,Weize Yuan,Gang Liu,Wojciech Matusik,Jie Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICML 2025
Abstract:Recent data-efficient molecular generation approaches exploit graph grammars to introduce interpretability into the generative models. However, grammar learning therein relies on expert annotation or unreliable heuristics for algorithmic inference. We propose Foundation Molecular Grammar (FMG), which leverages multi-modal foundation models (MMFMs) to induce an interpretable molecular language. By exploiting the chemical knowledge of an MMFM, FMG renders molecules as images, describes them as text, and aligns information across modalities using prompt learning. FMG can be used as a drop-in replacement for the prior grammar learning approaches in molecular generation and property prediction. We show that FMG not only excels in synthesizability, diversity, and data efficiency but also offers built-in chemical interpretability for automated molecular discovery workflows. Code is available at this https URL.
zh
[AI-74] Generative Social Choice: The Next Generation ICML2025
【速读】:该论文试图解决在民主过程中生成一个能够按比例代表用户意见全谱的简洁陈述列表的问题(representative slate generation),其核心挑战在于候选集由所有可能的陈述组成,且无法直接访问,只能通过特定查询获取。解决方案的关键在于扩展了生成式社会选择(generative social choice)框架,使其能够在近似最优查询和总体长度预算限制下提供理论保障,并利用GPT-4o进行查询实现,从而从非结构化用户意见中生成具有代表性的陈述列表。
链接: https://arxiv.org/abs/2505.22939
作者: Niclas Boehmer,Sara Fish,Ariel D. Procaccia
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ICML 2025
Abstract:A key task in certain democratic processes is to produce a concise slate of statements that proportionally represents the full spectrum of user opinions. This task is similar to committee elections, but unlike traditional settings, the candidate set comprises all possible statements of varying lengths, and so it can only be accessed through specific queries. Combining social choice and large language models, prior work has approached this challenge through a framework of generative social choice. We extend the framework in two fundamental ways, providing theoretical guarantees even in the face of approximately optimal queries and a budget limit on the overall length of the slate. Using GPT-4o to implement queries, we showcase our approach on datasets related to city improvement measures and drug reviews, demonstrating its effectiveness in generating representative slates from unstructured user opinions.
zh
[AI-75] Scalable Parameter and Memory Efficient Pretraining for LLM : Recent Algorithmic Advances and Benchmarking
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在预训练过程中面临的计算资源和内存消耗过大的问题,具体而言,是探究参数或内存高效方法是否能在保持与全模型训练相当性能的前提下提升预训练效率,并缩小性能差距。解决方案的关键在于通过优化器和超参数的合理选择,以及在低秩方法中引入高秩更新,从而提升模型性能。此外,论文提出了两种实用技术——权重重构(weight refactorization)和动量重置(momentum reset),以进一步增强高效预训练方法的效果,实验表明这些技术在减少内存使用的同时能够获得优于现有方法的性能。
链接: https://arxiv.org/abs/2505.22922
作者: Athanasios Glentis,Jiaxiang Li,Qiulin Shang,Andi Han,Ioannis Tsaknakis,Quan Wei,Mingyi Hong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Fueled by their remarkable ability to tackle diverse tasks across multiple domains, large language models (LLMs) have grown at an unprecedented rate, with some recent models containing trillions of parameters. This growth is accompanied by substantial computational challenges, particularly regarding the memory and compute resources required for training and fine-tuning. Numerous approaches have been explored to address these issues, such as LoRA. While these methods are effective for fine-tuning, their application to pre-training is significantly more challenging due to the need to learn vast datasets. Motivated by this issue, we aim to address the following questions: Can parameter- or memory-efficient methods enhance pre-training efficiency while achieving performance comparable to full-model training? How can the performance gap be narrowed? To this end, the contributions of this work are the following. (1) We begin by conducting a comprehensive survey that summarizes state-of-the-art methods for efficient pre-training. (2) We perform a benchmark evaluation of several representative memory efficient pre-training approaches to comprehensively evaluate their performance across model sizes. We observe that with a proper choice of optimizer and hyperparameters, full-rank training delivers the best performance, as expected. We also notice that incorporating high-rank updates in low-rank approaches is the key to improving their performance. (3) Finally, we propose two practical techniques, namely weight refactorization and momentum reset, to enhance the performance of efficient pre-training methods. We observe that applying these techniques to the low-rank method (on a 1B model) can achieve a lower perplexity than popular memory efficient algorithms such as GaLore and Fira, while simultaneously using about 25% less memory.
zh
[AI-76] HiLDe: Intentional Code Generation via Human-in-the-Loop Decoding
【速读】:该论文试图解决当前AI编程工具在提升开发者生产力的同时,可能导致用户过度依赖系统输出而忽视关键决策过程的问题,这种现象在软件安全等关键领域可能带来严重后果。解决方案的关键在于提出“Human-in-the-loop Decoding”(人机协同解码)这一交互技术,通过允许用户在代码生成过程中观察并直接干预大语言模型(LLM)的决策,从而确保生成的代码更符合用户的实际需求。该方法在HiLDe代码补全助手中的实现,通过突出显示LLM的关键决策并提供局部替代方案,有效减少了安全相关任务中产生的漏洞数量,并提升了代码生成与用户目标的一致性。
链接: https://arxiv.org/abs/2505.22906
作者: Emmanuel Anaya González,Raven Rothkopf,Sorin Lerner,Nadia Polikarpova
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: 10 pages, 6 figures
Abstract:While AI programming tools hold the promise of increasing programmers’ capabilities and productivity to a remarkable degree, they often exclude users from essential decision-making processes, causing many to effectively “turn off their brains” and over-rely on solutions provided by these systems. These behaviors can have severe consequences in critical domains, like software security. We propose Human-in-the-loop Decoding, a novel interaction technique that allows users to observe and directly influence LLM decisions during code generation, in order to align the model’s output with their personal requirements. We implement this technique in HiLDe, a code completion assistant that highlights critical decisions made by the LLM and provides local alternatives for the user to explore. In a within-subjects study (N=18) on security-related tasks, we found that HiLDe led participants to generate significantly fewer vulnerabilities and better align code generation with their goals compared to a traditional code completion assistant.
zh
[AI-77] Defining Foundation Models for Computational Science: A Call for Clarity and Rigor
【速读】:该论文试图解决当前在计算科学中对“基础模型”(foundation model)概念缺乏统一定义所带来的混淆和科学意义的弱化问题。其解决方案的关键在于提出一个基于泛化性、可重用性和可扩展性的计算科学基础模型的正式定义,并引入数据驱动有限元方法(Data-Driven Finite Element Method, DD-FEM),该方法结合了传统有限元方法的模块化结构与数据驱动学习的表征能力,以应对实现计算科学基础模型中的关键挑战,如可扩展性、适应性和物理一致性。
链接: https://arxiv.org/abs/2505.22904
作者: Youngsoo Choi,Siu Wun Cheung,Youngkyu Kim,Ping-Hsuan Tsai,Alejandro N. Diaz,Ivan Zanardi,Seung Whan Chung,Dylan Matthew Copeland,Coleman Kendrick,William Anderson,Traian Iliescu,Matthias Heinkenschloss
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注: 26 pages, 2 tables, 7 figures
Abstract:The widespread success of foundation models in natural language processing and computer vision has inspired researchers to extend the concept to scientific machine learning and computational science. However, this position paper argues that as the term “foundation model” is an evolving concept, its application in computational science is increasingly used without a universally accepted definition, potentially creating confusion and diluting its precise scientific meaning. In this paper, we address this gap by proposing a formal definition of foundation models in computational science, grounded in the core values of generality, reusability, and scalability. We articulate a set of essential and desirable characteristics that such models must exhibit, drawing parallels with traditional foundational methods, like the finite element and finite volume methods. Furthermore, we introduce the Data-Driven Finite Element Method (DD-FEM), a framework that fuses the modular structure of classical FEM with the representational power of data-driven learning. We demonstrate how DD-FEM addresses many of the key challenges in realizing foundation models for computational science, including scalability, adaptability, and physics consistency. By bridging traditional numerical methods with modern AI paradigms, this work provides a rigorous foundation for evaluating and developing novel approaches toward future foundation models in computational science.
zh
[AI-78] Local Stability and Region of Attraction Analysis for Neural Network Feedback Systems under Positivity Constraints
【速读】:该论文旨在解决具有静态非线性反馈的Lur’e型非线性系统的局部稳定性分析问题,其中非线性反馈由前馈神经网络(FFNN)实现。解决方案的关键在于利用正系统约束,采用局部化的Aizerman猜想,从而为受限于紧集内的轨迹提供指数稳定的充分条件,并结合两种方法估计区域吸引域(ROA):一种是基于李雅普诺夫的较少保守性方法,通过构造满足线性矩阵不等式(LMI)的二次函数不变子水平集;另一种是通过逐层传播线性松弛计算FFNN的紧致局部扇区边界,进而集成到局部化Aizerman框架中以验证局部指数稳定性。
链接: https://arxiv.org/abs/2505.22889
作者: Hamidreza Montazeri Hedesh,Moh Kamalul Wafi,Milad Siami
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: Submitted to 64th IEEE Conference on Decision and Control 2025 - Rio de Janeiro, Brazil
Abstract:We study the local stability of nonlinear systems in the Lur’e form with static nonlinear feedback realized by feedforward neural networks (FFNNs). By leveraging positivity system constraints, we employ a localized variant of the Aizerman conjecture, which provides sufficient conditions for exponential stability of trajectories confined to a compact set. Using this foundation, we develop two distinct methods for estimating the Region of Attraction (ROA): (i) a less conservative Lyapunov-based approach that constructs invariant sublevel sets of a quadratic function satisfying a linear matrix inequality (LMI), and (ii) a novel technique for computing tight local sector bounds for FFNNs via layer-wise propagation of linear relaxations. These bounds are integrated into the localized Aizerman framework to certify local exponential stability. Numerical results demonstrate substantial improvements over existing integral quadratic constraint-based approaches in both ROA size and scalability.
zh
[AI-79] Semantic Exploration and Dense Mapping of Complex Environments using Ground Robots Equipped with LiDAR and Panoramic Camera
【速读】:该论文旨在解决在复杂未知环境中,通过地面机器人实现自主语义探索与密集语义目标映射的问题,尤其是如何平衡多视角高质量观测的采集与避免不必要的重复遍历。其解决方案的关键在于提出一个完整的系统,结合了地图构建与路径规划,重新定义任务为几何覆盖与语义视角观测的双重目标,并通过一种新颖的Priority-driven Decoupled Local Sampler生成局部视角集合,实现无冗余的多视角语义检查与体素覆盖。此外,还设计了分层规划器和Safe Aggressive Exploration State Machine,以确保高效全局覆盖与安全的激进探索行为,同时集成了一种即插即用的语义目标映射模块,实现点云级别的密集语义目标映射。
链接: https://arxiv.org/abs/2505.22880
作者: Xiaoyang Zhan,Shixin Zhou,Qianqian Yang,Yixuan Zhao,Hao Liu,Srinivas Chowdary Ramineni,Kenji Shimada
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents a system for autonomous semantic exploration and dense semantic target mapping of a complex unknown environment using a ground robot equipped with a LiDAR-panoramic camera suite. Existing approaches often struggle to balance collecting high-quality observations from multiple view angles and avoiding unnecessary repetitive traversal. To fill this gap, we propose a complete system combining mapping and planning. We first redefine the task as completing both geometric coverage and semantic viewpoint observation. We then manage semantic and geometric viewpoints separately and propose a novel Priority-driven Decoupled Local Sampler to generate local viewpoint sets. This enables explicit multi-view semantic inspection and voxel coverage without unnecessary repetition. Building on this, we develop a hierarchical planner to ensure efficient global coverage. In addition, we propose a Safe Aggressive Exploration State Machine, which allows aggressive exploration behavior while ensuring the robot’s safety. Our system includes a plug-and-play semantic target mapping module that integrates seamlessly with state-of-the-art SLAM algorithms for pointcloud-level dense semantic target mapping. We validate our approach through extensive experiments in both realistic simulations and complex real-world environments. Simulation results show that our planner achieves faster exploration and shorter travel distances while guaranteeing a specified number of multi-view inspections. Real-world experiments further confirm the system’s effectiveness in achieving accurate dense semantic object mapping of unstructured environments.
zh
[AI-80] BugWhisperer: Fine-Tuning LLM s for SoC Hardware Vulnerability Detection
【速读】:该论文试图解决系统级芯片(System-on-Chip, SoC)安全验证中因手动、劳动密集且不灵活的方法而导致的可扩展性和有效性受限的问题,这些问题使得在寄存器传输级(Register-Transfer Level, RTL)进行缺陷检测变得困难。解决方案的关键在于提出了一种名为BugWhisperer的新框架,该框架利用了经过专门微调的大型语言模型(Large Language Model, LLM),通过增强LLM的硬件安全知识并利用其文本推理和知识迁移能力,实现了验证过程的自动化、适应性提升和可重用性增强。
链接: https://arxiv.org/abs/2505.22878
作者: Shams Tarek,Dipayan Saha,Sujan Kumar Saha,Farimah Farahmandi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: This paper was presented at IEEE VLSI Test Symposium (VTS) 2025
Abstract:The current landscape of system-on-chips (SoCs) security verification faces challenges due to manual, labor-intensive, and inflexible methodologies. These issues limit the scalability and effectiveness of security protocols, making bug detection at the Register-Transfer Level (RTL) difficult. This paper proposes a new framework named BugWhisperer that utilizes a specialized, fine-tuned Large Language Model (LLM) to address these challenges. By enhancing the LLM’s hardware security knowledge and leveraging its capabilities for text inference and knowledge transfer, this approach automates and improves the adaptability and reusability of the verification process. We introduce an open-source, fine-tuned LLM specifically designed for detecting security vulnerabilities in SoC designs. Our findings demonstrate that this tailored LLM effectively enhances the efficiency and flexibility of the security verification process. Additionally, we introduce a comprehensive hardware vulnerability database that supports this work and will further assist the research community in enhancing the security verification process.
zh
[AI-81] he WHY in Business Processes: Unification of Causal Process Models
【速读】:该论文试图解决在业务流程干预和改进中,如何统一多个因果过程变体为一个一致模型的问题,特别是在处理不同变体间的交替因果条件时存在的挑战,如缺失值处理和日志分割中的条件表达。解决方案的关键在于提出一种新方法,能够保留原始因果模型的正确性,同时显式表示其因果流的交替特性。该方法经过形式化定义、证明,并在多个公开和专有数据集上进行了评估。
链接: https://arxiv.org/abs/2505.22871
作者: Yuval David,Fabiana Fournier,Lior Limonad,Inna Skarbovsky
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 28 pages, 6 figures, BPM 2025 Forum
Abstract:Causal reasoning is essential for business process interventions and improvement, requiring a clear understanding of causal relationships among activity execution times in an event log. Recent work introduced a method for discovering causal process models but lacked the ability to capture alternating causal conditions across multiple variants. This raises the challenges of handling missing values and expressing the alternating conditions among log splits when blending traces with varying activities. We propose a novel method to unify multiple causal process variants into a consistent model that preserves the correctness of the original causal models, while explicitly representing their causal-flow alternations. The method is formally defined, proved, evaluated on three open and two proprietary datasets, and released as an open-source implementation. Comments: 28 pages, 6 figures, BPM 2025 Forum Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2505.22871 [cs.AI] (or arXiv:2505.22871v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.22871 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-82] Scaling Offline RL via Efficient and Expressive Shortcut Models NEURIPS2025
【速读】:该论文试图解决在离线强化学习(Offline Reinforcement Learning, ORL)中应用扩散和流模型(Diffusion and Flow Models)时面临的挑战,即由于噪声采样过程的迭代性导致策略优化困难的问题。解决方案的关键在于引入可扩展的离线强化学习算法(Scalable Offline Reinforcement Learning, SORL),该算法利用快捷模型(Shortcut Models)——一种新型生成模型——来实现训练和推理的扩展性。SORL的策略能够捕捉复杂的数据分布,并通过单阶段训练过程高效训练;在测试阶段,通过将学习到的Q函数作为验证器,实现序列和并行推理的扩展性。
链接: https://arxiv.org/abs/2505.22866
作者: Nicolas Espinosa-Dice,Yiyi Zhang,Yiding Chen,Bradley Guo,Owen Oertell,Gokul Swamy,Kiante Brantley,Wen Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 32 pages, 5 figures. Under review at NeurIPS 2025
Abstract:Diffusion and flow models have emerged as powerful generative approaches capable of modeling diverse and multimodal behavior. However, applying these models to offline reinforcement learning (RL) remains challenging due to the iterative nature of their noise sampling processes, making policy optimization difficult. In this paper, we introduce Scalable Offline Reinforcement Learning (SORL), a new offline RL algorithm that leverages shortcut models - a novel class of generative models - to scale both training and inference. SORL’s policy can capture complex data distributions and can be trained simply and efficiently in a one-stage training procedure. At test time, SORL introduces both sequential and parallel inference scaling by using the learned Q-function as a verifier. We demonstrate that SORL achieves strong performance across a range of offline RL tasks and exhibits positive scaling behavior with increased test-time compute. We release the code at this http URL.
zh
[AI-83] BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models ICML2025
【速读】:该论文旨在解决双耳渲染(binaural rendering)中渲染质量不足和流式推理(streamable inference)困难的问题。为实现高质量且接近真实录音的双耳音频合成,需精确建模双耳线索、房间混响和环境声音,并满足实际应用中的流式处理需求。其解决方案的关键在于将双耳渲染视为生成问题而非回归问题,设计了一个条件流匹配模型以生成高质量音频;同时引入因果U-Net架构,仅依赖历史信息估计当前音频帧,以适应流式推理;此外,还提出了包含流式短时傅里叶变换/逆变换(STFT/ISTFT)、缓冲区银行、中间点求解器和早期跳跃调度的连续推理流水线,以提升渲染的连续性和速度。
链接: https://arxiv.org/abs/2505.22865
作者: Susan Liang,Dejan Markovic,Israel D. Gebru,Steven Krenn,Todd Keebler,Jacob Sandakly,Frank Yu,Samuel Hassel,Chenliang Xu,Alexander Richard
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: ICML 2025, 18 pages
Abstract:Binaural rendering aims to synthesize binaural audio that mimics natural hearing based on a mono audio and the locations of the speaker and listener. Although many methods have been proposed to solve this problem, they struggle with rendering quality and streamable inference. Synthesizing high-quality binaural audio that is indistinguishable from real-world recordings requires precise modeling of binaural cues, room reverb, and ambient sounds. Additionally, real-world applications demand streaming inference. To address these challenges, we propose a flow matching based streaming binaural speech synthesis framework called BinauralFlow. We consider binaural rendering to be a generation problem rather than a regression problem and design a conditional flow matching model to render high-quality audio. Moreover, we design a causal U-Net architecture that estimates the current audio frame solely based on past information to tailor generative models for streaming inference. Finally, we introduce a continuous inference pipeline incorporating streaming STFT/ISTFT operations, a buffer bank, a midpoint solver, and an early skip schedule to improve rendering continuity and speed. Quantitative and qualitative evaluations demonstrate the superiority of our method over SOTA approaches. A perceptual study further reveals that our model is nearly indistinguishable from real-world recordings, with a 42% confusion rate.
zh
[AI-84] Permissioned LLM s: Enforcing Access Control in Large Language Models
【速读】:该论文试图解决在企业环境中,当基于隔离数据微调的大型语言模型(Large Language Model, LLM)为具有不同访问权限的用户提供下游任务请求时,可能导致原有的访问控制结构失效的问题。解决方案的关键在于提出一种新型的受权限约束的LLM(Permissioned LLM, PermLLM),其核心是将组织的数据访问控制结构嵌入到生成的查询响应中,从而确保访问控制的有效性。为此,论文引入了相关响应的概念以验证PermLLM机制的正确性,并提出了访问优势这一新指标来评估机制的效能,同时通过参数高效微调技术实现了三种新的PermLLM机制。
链接: https://arxiv.org/abs/2505.22860
作者: Bargav Jayaraman,Virendra J. Marathe,Hamid Mozaffari,William F. Shen,Krishnaram Kenthapadi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:In enterprise settings, organizational data is segregated, siloed and carefully protected by elaborate access control frameworks. These access control structures can completely break down if an LLM fine-tuned on the siloed data serves requests, for downstream tasks, from individuals with disparate access privileges. We propose Permissioned LLMs (PermLLM), a new class of LLMs that superimpose the organizational data access control structures on query responses they generate. We formalize abstractions underpinning the means to determine whether access control enforcement happens correctly over LLM query responses. Our formalism introduces the notion of a relevant response that can be used to prove whether a PermLLM mechanism has been implemented correctly. We also introduce a novel metric, called access advantage, to empirically evaluate the efficacy of a PermLLM mechanism. We introduce three novel PermLLM mechanisms that build on Parameter Efficient Fine-Tuning to achieve the desired access control. We furthermore present two instantiations of access advantage–(i) Domain Distinguishability Index (DDI) based on Membership Inference Attacks, and (ii) Utility Gap Index (UGI) based on LLM utility evaluation. We demonstrate the efficacy of our PermLLM mechanisms through extensive experiments on four public datasets (GPQA, RCV1, SimpleQA, and WMDP), in addition to evaluating the validity of DDI and UGI metrics themselves for quantifying access control in LLMs.
zh
[AI-85] Operationalizing CaMeL: Strengthening LLM Defenses for Enterprise Deployment
【速读】:该论文旨在解决大型语言模型(LLM)代理中提示注入攻击的问题,其核心解决方案是通过改进CaMeL(Capabilities for Machine Learning)的能力沙箱机制来增强安全性。关键在于引入了提示筛选、输出审计、分层风险访问模型以及经过验证的中间语言,以扩展威胁覆盖范围、提升操作可用性,并确保形式化保障。
链接: https://arxiv.org/abs/2505.22852
作者: Krti Tallam,Emma Miller
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:CaMeL (Capabilities for Machine Learning) introduces a capability-based sandbox to mitigate prompt injection attacks in large language model (LLM) agents. While effective, CaMeL assumes a trusted user prompt, omits side-channel concerns, and incurs performance tradeoffs due to its dual-LLM design. This response identifies these issues and proposes engineering improvements to expand CaMeL’s threat coverage and operational usability. We introduce: (1) prompt screening for initial inputs, (2) output auditing to detect instruction leakage, (3) a tiered-risk access model to balance usability and control, and (4) a verified intermediate language for formal guarantees. Together, these upgrades align CaMeL with best practices in enterprise security and support scalable deployment.
zh
[AI-86] RocqStar: Leverag ing Similarity-driven Retrieval and Agent ic Systems for Rocq generation
【速读】:该论文试图解决在形式化验证中生成Rocq证明的挑战,特别是如何提高生成器的性能和证明的可靠性。解决方案的关键在于提出一种基于自注意力嵌入模型的检索方法,以实现更有效的前提选择,并通过多阶段代理系统结合多智能体辩论机制来优化证明合成的规划阶段。
链接: https://arxiv.org/abs/2505.22846
作者: Nikita Khramov,Andrei Kozyrev,Gleb Solovev,Anton Podkopaev
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Software Engineering (cs.SE)
备注:
Abstract:Interactive Theorem Proving was repeatedly shown to be fruitful combined with Generative Artificial Intelligence. This paper assesses multiple approaches to Rocq generation and illuminates potential avenues for improvement. We highlight the importance of thorough premise selection for generating Rocq proofs and propose a novel approach, leveraging retrieval via a self-attentive embedder model. The evaluation of the designed approach shows up to 28% relative increase of the generator’s performance. We tackle the problem of writing Rocq proofs using a multi-stage agentic system, tailored for formal verification, and demonstrate its high effectiveness. We conduct an ablation study and show the use of multi-agent debate on the planning stage of proof synthesis.
zh
[AI-87] Security Benefits and Side Effects of Labeling AI-Generated Images
【速读】:该论文试图解决AI生成的虚假信息在被标记后是否能有效降低其风险的问题,特别是标签对用户识别虚假信息能力的影响。研究的关键在于通过定性与定量方法评估标签机制的效果,包括用户对标签的认知、信任度以及标签可能带来的副作用,如用户对带有标签的AI生成内容的过度依赖或对真实内容的误判。
链接: https://arxiv.org/abs/2505.22845
作者: Sandra Höltervennhoff,Jonas Ricker,Maike M. Raphael,Charlotte Schwedes,Rebecca Weil,Asja Fischer,Thorsten Holz,Lea Schönherr,Sascha Fahl
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:
Abstract:Generative artificial intelligence is developing rapidly, impacting humans’ interaction with information and digital media. It is increasingly used to create deceptively realistic misinformation, so lawmakers have imposed regulations requiring the disclosure of AI-generated content. However, only little is known about whether these labels reduce the risks of AI-generated misinformation. Our work addresses this research gap. Focusing on AI-generated images, we study the implications of labels, including the possibility of mislabeling. Assuming that simplicity, transparency, and trust are likely to impact the successful adoption of such labels, we first qualitatively explore users’ opinions and expectations of AI labeling using five focus groups. Second, we conduct a pre-registered online survey with over 1300 U.S. and EU participants to quantitatively assess the effect of AI labels on users’ ability to recognize misinformation containing either human-made or AI-generated images. Our focus groups illustrate that, while participants have concerns about the practical implementation of labeling, they consider it helpful in identifying AI-generated images and avoiding deception. However, considering security benefits, our survey revealed an ambiguous picture, suggesting that users might over-rely on labels. While inaccurate claims supported by labeled AI-generated images were rated less credible than those with unlabeled AI-images, the belief in accurate claims also decreased when accompanied by a labeled AI-generated image. Moreover, we find the undesired side effect that human-made images conveying inaccurate claims were perceived as more credible in the presence of labels. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI) Cite as: arXiv:2505.22845 [cs.CR] (or arXiv:2505.22845v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2505.22845 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-88] Aurora: Are Android Malware Classifiers Reliable under Distribution Shift?
【速读】:该论文试图解决现代漂移自适应恶意软件分类器在实际操作中的可靠性问题,特别是其置信度估计与误差之间的对齐性以及在分布偏移下的操作稳定性问题。现有评估范式主要关注基准性能指标,忽视了置信度与错误之间的关系,导致无法全面评估模型的实用性。论文提出的解决方案是AURORA框架,其关键在于通过验证模型的置信度轮廓来评估其估计的可靠性,并引入一系列指标以超越静态性能评估,实现对操作稳定性的更全面分析。
链接: https://arxiv.org/abs/2505.22843
作者: Alexander Herzog,Aliai Eusebi,Lorenzo Cavallaro
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The performance figures of modern drift-adaptive malware classifiers appear promising, but does this translate to genuine operational reliability? The standard evaluation paradigm primarily focuses on baseline performance metrics, neglecting confidence-error alignment and operational stability. While TESSERACT established the importance of temporal evaluation, we take a complementary direction by investigating whether malware classifiers maintain reliable confidence estimates under distribution shifts and exploring the tensions between scientific advancement and practical impacts when they do not. We propose AURORA, a framework to evaluate malware classifiers based on their confidence quality and operational resilience. AURORA subjects the confidence profile of a given model to verification to assess the reliability of its estimates. Unreliable confidence estimates erode operational trust, waste valuable annotation budget on non-informative samples for active learning, and leave error-prone instances undetected in selective classification. AURORA is further complemented by a set of metrics designed to go beyond point-in-time performance, striving towards a more holistic assessment of operational stability throughout temporal evaluation periods. The fragility we observe in state-of-the-art frameworks across datasets of varying drift severity suggests the need for a return to the whiteboard.
zh
[AI-89] How Do Diffusion Models Improve Adversarial Robustness?
【速读】:该论文试图解决扩散模型在对抗鲁棒性提升中的机制不明确问题,特别是为何扩散模型能够增强对抗样本的鲁棒性。其解决方案的关键在于系统性分析扩散模型在净化过程中的行为,发现扩散模型实际上增加了干净样本的ℓ_p距离,且净化结果受模型内部随机性影响显著,进而揭示了压缩率与鲁棒性之间的强相关性,为评估模型鲁棒性提供了一种无需梯度分析的可靠指标。
链接: https://arxiv.org/abs/2505.22839
作者: Liu Yuezhang,Xue-Xin Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 7 figures
Abstract:Recent findings suggest that diffusion models significantly enhance empirical adversarial robustness. While some intuitive explanations have been proposed, the precise mechanisms underlying these improvements remain unclear. In this work, we systematically investigate how and how well diffusion models improve adversarial robustness. First, we observe that diffusion models intriguingly increase, rather than decrease, the \ell_p distance to clean samples–challenging the intuition that purification denoises inputs closer to the original data. Second, we find that the purified images are heavily influenced by the internal randomness of diffusion models, where a compression effect arises within each randomness configuration. Motivated by this observation, we evaluate robustness under fixed randomness and find that the improvement drops to approximately 24% on CIFAR-10–substantially lower than prior reports approaching 70%. Importantly, we show that this remaining robustness gain strongly correlates with the model’s ability to compress the input space, revealing the compression rate as a reliable robustness indicator without requiring gradient-based analysis. Our findings provide novel insights into the mechanisms underlying diffusion-based purification, and offer guidance for developing more effective and principled adversarial purification systems.
zh
[AI-90] Orca: Browsing at Scale Through User-Driven and AI-Facilitated Orchestration Across Malleable Webpages
【速读】:该论文试图解决传统浏览器在处理跨页面大规模信息操作与综合时的不足,即其无法有效支持用户对多页面信息进行操作和整合的问题。解决方案的关键在于利用生成式 AI (Generative AI) 增强用户在网页内容上的交互,通过构建一个可协作、可塑的浏览器级工作空间,使用户能够更高效地进行信息探索、操作、组织与综合。
链接: https://arxiv.org/abs/2505.22831
作者: Peiling Jiang,Haijun Xia
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Web-based activities are fundamentally distributed across webpages. However, conventional browsers with stacks of tabs fail to support operating and synthesizing large volumes of information across pages. While recent AI systems enable fully automated web browsing and information synthesis, they often diminish user agency and hinder contextual understanding. Therefore, we explore how AI could instead augment users’ interactions with content across webpages and mitigate cognitive and manual efforts. Through literature on information tasks and web browsing challenges, and an iterative design process, we present a rich set of novel interactions with our prototype web browser, Orca. Leveraging AI, Orca supports user-driven exploration, operation, organization, and synthesis of web content at scale. To enable browsing at scale, webpages are treated as malleable materials that humans and AI can collaboratively manipulate and compose into a malleable, dynamic, and browser-level workspace. Our evaluation revealed an increased “appetite” for information foraging, enhanced user control, and more flexibility in sensemaking across a broader information landscape on the web.
zh
[AI-91] Bridging Distribution Shift and AI Safety: Conceptual and Methodological Synergies
【速读】:该论文试图解决分布偏移(distribution shift)与人工智能安全(AI safety)之间的关系问题,旨在通过概念和方法上的协同作用建立两者的统一视角。其解决方案的关键在于揭示两种特定类型的联系:一是针对特定分布偏移类型的方法可以用于实现相应的安全目标;二是某些分布偏移与安全问题可以形式化地相互转化,从而实现方法上的相互适应。这一关键发现为两者的基础性整合提供了理论支持。
链接: https://arxiv.org/abs/2505.22829
作者: Chenruo Liu,Kenan Tang,Yao Qin,Qi Lei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 35 pages
Abstract:This paper bridges distribution shift and AI safety through a comprehensive analysis of their conceptual and methodological synergies. While prior discussions often focus on narrow cases or informal analogies, we establish two types connections between specific causes of distribution shift and fine-grained AI safety issues: (1) methods addressing a specific shift type can help achieve corresponding safety goals, or (2) certain shifts and safety issues can be formally reduced to each other, enabling mutual adaptation of their methods. Our findings provide a unified perspective that encourages fundamental integration between distribution shift and AI safety research.
zh
[AI-92] PGLearn – An Open-Source Learning Toolkit for Optimal Power Flow
【速读】:该论文试图解决机器学习(Machine Learning, ML)在最优潮流(Optimal Power Flow, OPF)问题中应用时面临的标准化数据集和评估指标缺失的问题,这一问题阻碍了ML方法在OPF领域的进展。解决方案的关键在于提出PGLearn,这是一个全面的标准化数据集和评估工具套件,提供了代表实际运行条件的数据,涵盖了全局和局部变量,并首次引入了多个大规模系统的时序数据,同时支持多种OPF公式(如AC、DC和二阶锥形式),并通过公开数据集和评估工具促进方法间的公平比较与研究加速。
链接: https://arxiv.org/abs/2505.22825
作者: Michael Klamkin,Mathieu Tanneau,Pascal Van Hentenryck
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注:
Abstract:Machine Learning (ML) techniques for Optimal Power Flow (OPF) problems have recently garnered significant attention, reflecting a broader trend of leveraging ML to approximate and/or accelerate the resolution of complex optimization problems. These developments are necessitated by the increased volatility and scale in energy production for modern and future grids. However, progress in ML for OPF is hindered by the lack of standardized datasets and evaluation metrics, from generating and solving OPF instances, to training and benchmarking machine learning models. To address this challenge, this paper introduces PGLearn, a comprehensive suite of standardized datasets and evaluation tools for ML and OPF. PGLearn provides datasets that are representative of real-life operating conditions, by explicitly capturing both global and local variability in the data generation, and by, for the first time, including time series data for several large-scale systems. In addition, it supports multiple OPF formulations, including AC, DC, and second-order cone formulations. Standardized datasets are made publicly available to democratize access to this field, reduce the burden of data generation, and enable the fair comparison of various methodologies. PGLearn also includes a robust toolkit for training, evaluating, and benchmarking machine learning models for OPF, with the goal of standardizing performance evaluation across the field. By promoting open, standardized datasets and evaluation metrics, PGLearn aims at democratizing and accelerating research and innovation in machine learning applications for optimal power flow problems. Datasets are available for download at this https URL.
zh
[AI-93] A Tool for Generating Exceptional Behavior Tests With Large Language Models
【速读】:该论文试图解决软件开发中异常行为测试(Exceptional Behavior Tests, EBTs)不足的问题,即开发者通常更关注“愉快路径”(happy paths)而忽视了异常场景的测试。解决方案的关键在于提出exLong框架,该框架利用微调后的CodeLlama大语言模型(Large Language Model, LLM),结合对抛出异常的追踪逻辑、保护抛出语句的条件表达式以及执行相似非异常路径的行为测试,从而自动生成全面的EBTs。
链接: https://arxiv.org/abs/2505.22818
作者: Linghan Zhong,Samuel Yuan,Jiyang Zhang,Yu Liu,Pengyu Nie,Junyi Jessy Li,Milos Gligoric
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: FSE 2025 Demo (Camera Ready)
Abstract:Exceptional behavior tests (EBTs) are crucial in software development for verifying that code correctly handles unwanted events and throws appropriate exceptions. However, prior research has shown that developers often prioritize testing “happy paths”, e.g., paths without unwanted events over exceptional scenarios. We present exLong, a framework that automatically generates EBTs to address this gap. exLong leverages a large language model (LLM) fine-tuned from CodeLlama and incorporates reasoning about exception-throwing traces, conditional expressions that guard throw statements, and non-exceptional behavior tests that execute similar traces. Our demonstration video illustrates how exLong can effectively assist developers in creating comprehensive EBTs for their project (available at this https URL).
zh
[AI-94] A Large Language Model-Enabled Control Architecture for Dynamic Resource Capability Exploration in Multi-Agent Manufacturing Systems
【速读】:该论文旨在解决制造环境中因需求波动和产品生命周期缩短等因素导致的复杂性和不可预测性问题,特别是在动态工业场景中传统控制方法在响应能力和适应性方面的局限性。其解决方案的关键在于引入一种基于大型语言模型(Large Language Model)的控制架构,通过上下文感知的决策能力,实现对资源能力的动态探索,从而提升系统的弹性和灵活性。
链接: https://arxiv.org/abs/2505.22814
作者: Jonghan Lim,Ilya Kovalenko
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:Manufacturing environments are becoming more complex and unpredictable due to factors such as demand variations and shorter product lifespans. This complexity requires real-time decision-making and adaptation to disruptions. Traditional control approaches highlight the need for advanced control strategies capable of overcoming unforeseen challenges, as they demonstrate limitations in responsiveness within dynamic industrial settings. Multi-agent systems address these challenges through decentralization of decision-making, enabling systems to respond dynamically to operational changes. However, current multi-agent systems encounter challenges related to real-time adaptation, context-aware decision-making, and the dynamic exploration of resource capabilities. Large language models provide the possibility to overcome these limitations through context-aware decision-making capabilities. This paper introduces a large language model-enabled control architecture for multi-agent manufacturing systems to dynamically explore resource capabilities in response to real-time disruptions. A simulation-based case study demonstrates that the proposed architecture improves system resilience and flexibility. The case study findings show improved throughput and efficient resource utilization compared to existing approaches.
zh
[AI-95] CLUE: Neural Networks Calibration via Learning Uncertainty-Error alignment
【速读】:该论文旨在解决神经网络在实际应用中可靠不确定性估计的问题,现有校准技术通常依赖于后处理调整或粗粒度分箱方法,但在可扩展性、可微性和跨领域泛化方面存在局限。其解决方案的关键在于提出CLUE(Calibration via Learning Uncertainty-Error Alignment),该方法在训练过程中显式对齐预测的不确定性与观测误差,基于良好校准模型应产生与经验损失匹配的不确定性估计这一原理,采用一种联合优化预测性能和校准的新损失函数,利用不确定性与损失的汇总统计量作为代理指标,实现了完全可微、领域无关且兼容标准训练流程的校准方法。
链接: https://arxiv.org/abs/2505.22803
作者: Pedro Mendes,Paolo Romano,David Garlan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reliable uncertainty estimation is critical for deploying neural networks (NNs) in real-world applications. While existing calibration techniques often rely on post-hoc adjustments or coarse-grained binning methods, they remain limited in scalability, differentiability, and generalization across domains. In this work, we introduce CLUE (Calibration via Learning Uncertainty-Error Alignment), a novel approach that explicitly aligns predicted uncertainty with observed error during training, grounded in the principle that well-calibrated models should produce uncertainty estimates that match their empirical loss. CLUE adopts a novel loss function that jointly optimizes predictive performance and calibration, using summary statistics of uncertainty and loss as proxies. The proposed method is fully differentiable, domain-agnostic, and compatible with standard training pipelines. Through extensive experiments on vision, regression, and language modeling tasks, including out-of-distribution and domain-shift scenarios, we demonstrate that CLUE achieves superior calibration quality and competitive predictive performance with respect to state-of-the-art approaches without imposing significant computational overhead.
zh
[AI-96] Efficient Preimage Approximation for Neural Network Certification
【速读】:该论文旨在解决在安全和安保关键应用中对神经网络进行有效认证的问题,特别是针对“补丁攻击”(patch attacks)的认证问题,此类攻击通过对抗性补丁或光照条件遮挡图像部分区域(如交通标志)。解决方案的关键在于改进预像(preimage)近似方法,通过引入更紧的边界、自适应蒙特卡洛采样和改进的分支启发式策略,提升了PREMAP算法的可扩展性和效率。
链接: https://arxiv.org/abs/2505.22798
作者: Anton Björklund,Mykola Zaitsev,Marta Kwiatkowska
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:The growing reliance on artificial intelligence in safety- and security-critical applications demands effective neural network certification. A challenging real-world use case is certification against ``patch attacks’', where adversarial patches or lighting conditions obscure parts of images, for example traffic signs. One approach to certification, which also gives quantitative coverage estimates, utilizes preimages of neural networks, i.e., the set of inputs that lead to a specified output. However, these preimage approximation methods, including the state-of-the-art PREMAP algorithm, struggle with scalability. This paper presents novel algorithmic improvements to PREMAP involving tighter bounds, adaptive Monte Carlo sampling, and improved branching heuristics. We demonstrate efficiency improvements of at least an order of magnitude on reinforcement learning control benchmarks, and show that our method scales to convolutional neural networks that were previously infeasible. Our results demonstrate the potential of preimage approximation methodology for reliability and robustness certification.
zh
[AI-97] Predicting Human Depression with Hybrid Data Acquisition utilizing Physical Activity Sensing and Social Media Feeds
【速读】:该论文旨在解决抑郁症等心理障碍在具有社交回避倾向个体中的检测与评估问题,其解决方案的关键在于利用智能手机传感器数据和社交媒体(Twitter)互动信息,结合卷积神经网络(CNN)和朴素贝叶斯分类方法进行物理活动识别与用户情绪分类,并通过支持向量机(SVM)模型对抑郁严重程度进行高精度分类,从而实现对抑郁状态的非侵入式长期监测。
链接: https://arxiv.org/abs/2505.22779
作者: Mohammad Helal Uddin,Sabur Baidya
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Mental disorders including depression, anxiety, and other neurological disorders pose a significant global challenge, particularly among individuals exhibiting social avoidance tendencies. This study proposes a hybrid approach by leveraging smartphone sensor data measuring daily physical activities and analyzing their social media (Twitter) interactions for evaluating an individual’s depression level. Using CNN-based deep learning models and Naive Bayes classification, we identify human physical activities accurately and also classify the user sentiments. A total of 33 participants were recruited for data acquisition, and nine relevant features were extracted from the physical activities and analyzed with their weekly depression scores, evaluated using the Geriatric Depression Scale (GDS) questionnaire. Of the nine features, six are derived from physical activities, achieving an activity recognition accuracy of 95%, while three features stem from sentiment analysis of Twitter activities, yielding a sentiment analysis accuracy of 95.6%. Notably, several physical activity features exhibited significant correlations with the severity of depression symptoms. For classifying the depression severity, a support vector machine (SVM)-based algorithm is employed that demonstrated a very high accuracy of 94%, outperforming alternative models, e.g., the multilayer perceptron (MLP) and k-nearest neighbor. It is a simple approach yet highly effective in the long run for monitoring depression without breaching personal privacy.
zh
[AI-98] In Dialogue with Intelligence: Rethinking Large Language Models as Collective Knowledge
【速读】:该论文试图解决如何理解当代人工智能系统中交互、表征与代理的问题,其核心在于将大型语言模型(Large Language Models, LLMs)重新框架为集体人类知识(Collective Human Knowledge, CK)的动态实例,而非传统的静态存储结构。解决方案的关键在于通过对话机制激发智能,并引入共增强(co-augmentation)的概念,即人类与机器认知之间的相互增强,从而提供一种新的理论和实践视角来分析AI系统的互动过程。
链接: https://arxiv.org/abs/2505.22767
作者: Eleni Vasilaki
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 6 pages, 1 table
Abstract:Large Language Models (LLMs) are typically analysed through architectural, behavioural, or training-data lenses. This article offers a theoretical and experiential re-framing: LLMs as dynamic instantiations of Collective human Knowledge (CK), where intelligence is evoked through dialogue rather than stored statically. Drawing on concepts from neuroscience and AI, and grounded in sustained interaction with ChatGPT-4, I examine emergent dialogue patterns, the implications of fine-tuning, and the notion of co-augmentation: mutual enhancement between human and machine cognition. This perspective offers a new lens for understanding interaction, representation, and agency in contemporary AI systems.
zh
[AI-99] A comprehensive analysis of PINNs: Variants Applications and Challenges
【速读】:该论文试图解决当前物理信息神经网络(Physics Informed Neural Networks, PINNs)在应用中的标准化不足问题,以及现有综述文献未能全面覆盖其架构、变体、应用场景和挑战的局限性。解决方案的关键在于通过详尽分析PINNs的架构与变体、在不同方程和应用领域的性能表现,并结合最新研究进展,系统性地填补现有文献中的知识空白,同时探讨当前实施中的主要挑战及未来研究方向。
链接: https://arxiv.org/abs/2505.22761
作者: Afila Ajithkumar Sophiya,Akarsh K Nair,Sepehr Maleki,Senthil K. Krishnababu
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:
Abstract:Physics Informed Neural Networks (PINNs) have been emerging as a powerful computational tool for solving differential equations. However, the applicability of these models is still in its initial stages and requires more standardization to gain wider popularity. Through this survey, we present a comprehensive overview of PINNs approaches exploring various aspects related to their architecture, variants, areas of application, real-world use cases, challenges, and so on. Even though existing surveys can be identified, they fail to provide a comprehensive view as they primarily focus on either different application scenarios or limit their study to a superficial level. This survey attempts to bridge the gap in the existing literature by presenting a detailed analysis of all these factors combined with recent advancements and state-of-the-art research in PINNs. Additionally, we discuss prevalent challenges in PINNs implementation and present some of the future research directions as well. The overall contributions of the survey can be summarised into three sections: A detailed overview of PINNs architecture and variants, a performance analysis of PINNs on different equations and application domains highlighting their features. Finally, we present a detailed discussion of current issues and future research directions.
zh
[AI-100] Enhancing Lifelong Multi-Agent Path-finding by Using Artificial Potential Fields
【速读】:该论文旨在解决多智能体路径规划(Multi-Agent Path Finding, MAPF)及终身多智能体路径规划(Lifelong MAPF, LMAPF)问题,其中MAPF要求一组智能体无碰撞地到达目标位置,而LMAPF则在智能体到达目标后动态生成新的目标。论文提出将人工势场(Artificial Potential Fields, APFs)整合到多种MAPF算法中,包括优先级规划、MAPF-LNS2和带回溯的优先级继承(Priority Inheritance with Backtracking, PIBT)。实验结果表明,尽管APF对MAPF效果不显著,但在LMAPF中可使整体系统吞吐量提升多达7倍,其解决方案的关键在于利用APF有效处理动态目标生成场景下的路径规划问题。
链接: https://arxiv.org/abs/2505.22753
作者: Arseniy Pertzovsky,Roni Stern,Ariel Felner,Roie Zivan
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注:
Abstract:We explore the use of Artificial Potential Fields (APFs) to solve Multi-Agent Path Finding (MAPF) and Lifelong MAPF (LMAPF) problems. In MAPF, a team of agents must move to their goal locations without collisions, whereas in LMAPF, new goals are generated upon arrival. We propose methods for incorporating APFs in a range of MAPF algorithms, including Prioritized Planning, MAPF-LNS2, and Priority Inheritance with Backtracking (PIBT). Experimental results show that using APF is not beneficial for MAPF but yields up to a 7-fold increase in overall system throughput for LMAPF.
zh
[AI-101] Design and testing of an agent chatbot supporting decision making with public transport data
【速读】:该论文试图解决公共交通运输服务质量评估中数据量大、分析复杂的问题,以及如何有效利用生成式 AI 进行数据交互与决策支持。其解决方案的关键在于构建一个基于代理架构的聊天机器人,该架构通过扩展核心大型语言模型(LLM)的能力,使其能够与一系列工具交互,执行如执行 SQL 查询、数据可视化和根据行程坐标生成地图等任务。
链接: https://arxiv.org/abs/2505.22698
作者: Luca Fantin,Marco Antonelli,Margherita Cesetti,Daniele Irto,Bruno Zamengo,Francesco Silvestri
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Assessing the quality of public transportation services requires the analysis of large quantities of data on the scheduled and actual trips and documents listing the quality constraints each service needs to meet. Interrogating such datasets with SQL queries, organizing and visualizing the data can be quite complex for most users. This paper presents a chatbot offering a user-friendly tool to interact with these datasets and support decision making. It is based on an agent architecture, which expands the capabilities of the core Large Language Model (LLM) by allowing it to interact with a series of tools that can execute several tasks, like performing SQL queries, plotting data and creating maps from the coordinates of a trip and its stops. This paper also tackles one of the main open problems of such Generative AI projects: collecting data to measure the system’s performance. Our chatbot has been extensively tested with a workflow that asks several questions and stores the generated query, the retrieved data and the natural language response for each of them. Such questions are drawn from a set of base examples which are then completed with actual data from the database. This procedure yields a dataset for the evaluation of the chatbot’s performance, especially the consistency of its answers and the correctness of the generated queries.
zh
[AI-102] LCB-CV-UNet: Enhanced Detector for High Dynamic Range Radar Signals
【速读】:该论文旨在解决高动态范围(High Dynamic Range, HDR)雷达信号导致的性能退化问题。其解决方案的关键在于提出了一种硬件高效且即插即用的对数连接块(Logarithmic Connect Block, LCB),该模块能够保持相位相干性,从而应对处理HDR特征时的固有挑战。此外,还引入了双混合数据集构建方法,以生成近似典型HDR信号场景的半合成数据集,进一步提升了模型的适应性和检测性能。
链接: https://arxiv.org/abs/2505.23454
作者: Yanbin Wang,Xingyu Chen,Yumiao Wang,Xiang Wang,Chuanfei Zang,Guolong Cui,Jiahuan Liu
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: 5 pages, 4 figures. Accepted to IEEE IGARSS 2025
Abstract:We propose the LCB-CV-UNet to tackle performance degradation caused by High Dynamic Range (HDR) radar signals. Initially, a hardware-efficient, plug-and-play module named Logarithmic Connect Block (LCB) is proposed as a phase coherence preserving solution to address the inherent challenges in handling HDR features. Then, we propose the Dual Hybrid Dataset Construction method to generate a semi-synthetic dataset, approximating typical HDR signal scenarios with adjustable target distributions. Simulation results show about 1% total detection probability improvement with under 0.9% computational complexity added compared with the baseline. Furthermore, it excels 5% over the baseline at the range in 11-13 dB signal-to-noise ratio typical for urban targets. Finally, the real experiment validates the practicality of our model.
zh
[AI-103] Representing local protein environments with atomistic foundation models
【速读】:该论文旨在解决如何有效表征蛋白质局部环境的问题,这一表征对于蛋白质建模与设计以及生物分子相互作用的模拟至关重要。由于局部环境在结构和化学上的高度可变性,传统的表征方法难以全面捕捉其特征。本文的关键解决方案是利用原子级基础模型(AFM)的中间特征,提出一种新的局部蛋白质环境表示方法,该方法能够有效捕获局部结构特征(如二级结构元件)和化学特征(如氨基酸身份和质子化状态),并展示了其在构建生物分子环境分布的数据驱动先验及实现高精度物理引导的化学位移预测中的潜力。
链接: https://arxiv.org/abs/2505.23354
作者: Meital Bojan,Sanketh Vedula,Advaith Maddipatla,Nadav Bojan Sellam,Federico Napoli,Paul Schanda,Alex M. Bronstein
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注:
Abstract:The local structure of a protein strongly impacts its function and interactions with other molecules. Therefore, a concise, informative representation of a local protein environment is essential for modeling and designing proteins and biomolecular interactions. However, these environments’ extensive structural and chemical variability makes them challenging to model, and such representations remain under-explored. In this work, we propose a novel representation for a local protein environment derived from the intermediate features of atomistic foundation models (AFMs). We demonstrate that this embedding effectively captures both local structure (e.g., secondary motifs), and chemical features (e.g., amino-acid identity and protonation state). We further show that the AFM-derived representation space exhibits meaningful structure, enabling the construction of data-driven priors over the distribution of biomolecular environments. Finally, in the context of biomolecular NMR spectroscopy, we demonstrate that the proposed representations enable a first-of-its-kind physics-informed chemical shift predictor that achieves state-of-the-art accuracy. Our results demonstrate the surprising effectiveness of atomistic foundation models and their emergent representations for protein modeling beyond traditional molecular simulations. We believe this will open new lines of work in constructing effective functional representations for protein environments.
zh
[AI-104] Spoken question answering for visual queries INTERSPEECH2025
【速读】:该论文试图解决如何构建一个能够通过语音和图像进行交互的多模态问答系统(spoken visual question answering, SVQA)的问题,其核心挑战在于缺乏同时包含文本、语音和图像三种模态的训练与评估数据集。解决方案的关键在于利用两个零样本文本转语音(TTS)模型合成VQA数据集,从而生成适用于SVQA任务的语音输入数据,实验结果表明,仅使用合成语音训练的模型性能接近基于文本问答的上界模型。
链接: https://arxiv.org/abs/2505.23308
作者: Nimrod Shabtay,Zvi Kons,Avihu Dekel,Hagai Aronowitz,Ron Hoory,Assaf Arbelle
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: Accepted for Interspeech 2025 (with additional results)
Abstract:Question answering (QA) systems are designed to answer natural language questions. Visual QA (VQA) and Spoken QA (SQA) systems extend the textual QA system to accept visual and spoken input respectively. This work aims to create a system that enables user interaction through both speech and images. That is achieved through the fusion of text, speech, and image modalities to tackle the task of spoken VQA (SVQA). The resulting multi-modal model has textual, visual, and spoken inputs and can answer spoken questions on images. Training and evaluating SVQA models requires a dataset for all three modalities, but no such dataset currently exists. We address this problem by synthesizing VQA datasets using two zero-shot TTS models. Our initial findings indicate that a model trained only with synthesized speech nearly reaches the performance of the upper-bounding model trained on textual QAs. In addition, we show that the choice of the TTS model has a minor impact on accuracy. Comments: Accepted for Interspeech 2025 (with additional results) Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV) Cite as: arXiv:2505.23308 [eess.AS] (or arXiv:2505.23308v1 [eess.AS] for this version) https://doi.org/10.48550/arXiv.2505.23308 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-105] Efficient Quantum Approximate kNN Algorithm via Granular-Ball Computing IJCAI2025
【速读】:该论文试图解决k-最近邻(k-Nearest Neighbors, kNN)算法在处理大规模数据时存在的高时间复杂度问题。其解决方案的关键在于提出一种创新算法——基于粒球的量子kNN(Granular-Ball based Quantum kNN, GB-QkNN),该方法通过引入粒球结构减少需要处理的数据量,并结合分层可导航小世界(Hierarchical Navigable Small World, HNSW)方法加速搜索过程,同时通过量化优化HNSW中的耗时步骤,如距离计算,从而显著降低算法的时间复杂度。
链接: https://arxiv.org/abs/2505.23066
作者: Shuyin Xia,Xiaojiang Tian,Suzhen Yuan,Jeremiah D. Deng
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages; 7 figure; accepted by IJCAI 2025
Abstract:High time complexity is one of the biggest challenges faced by k -Nearest Neighbors ( k NN). Although current classical and quantum k NN algorithms have made some improvements, they still have a speed bottleneck when facing large amounts of data. To address this issue, we propose an innovative algorithm called Granular-Ball based Quantum k NN(GB-Q k NN). This approach achieves higher efficiency by first employing granular-balls, which reduces the data size needed to processed. The search process is then accelerated by adopting a Hierarchical Navigable Small World (HNSW) method. Moreover, we optimize the time-consuming steps, such as distance calculation, of the HNSW via quantization, further reducing the time complexity of the construct and search process. By combining the use of granular-balls and quantization of the HNSW method, our approach manages to take advantage of these treatments and significantly reduces the time complexity of the k NN-like algorithms, as revealed by a comprehensive complexity analysis.
zh
[AI-106] Learning to Charge More: A Theoretical Study of Collusion by Q-Learning Agents
【速读】:该论文试图解决在无限重复博弈中,Q-学习智能体为何会学习到高于竞争水平的价格(supracompetitive prices)这一现象的理论解释问题。解决方案的关键在于证明,当博弈存在单阶段纳什均衡价格和促进共谋的价格,并且在实验结束时Q函数满足特定不等式条件时,企业会持续设定高于竞争水平的价格。研究引入了一类新的单记忆子博弈完美均衡(SPE),并给出了学习行为由天真共谋、严厉触发策略或递增策略支持的条件。
链接: https://arxiv.org/abs/2505.22909
作者: Cristian Chica,Yinglong Guo,Gilad Lerman
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:
Abstract:There is growing experimental evidence that Q -learning agents may learn to charge supracompetitive prices. We provide the first theoretical explanation for this behavior in infinite repeated games. Firms update their pricing policies based solely on observed profits, without computing equilibrium strategies. We show that when the game admits both a one-stage Nash equilibrium price and a collusive-enabling price, and when the Q -function satisfies certain inequalities at the end of experimentation, firms learn to consistently charge supracompetitive prices. We introduce a new class of one-memory subgame perfect equilibria (SPEs) and provide conditions under which learned behavior is supported by naive collusion, grim trigger policies, or increasing strategies. Naive collusion does not constitute an SPE unless the collusive-enabling price is a one-stage Nash equilibrium, whereas grim trigger policies can.
zh
[AI-107] Self-orthogonalizing attractor neural networks emerging from the free energy principle
【速读】:该论文试图解决如何从第一性原理出发理解吸引子动力学在复杂系统(如大脑)中的自组织机制,以及如何将其应用于神经计算和人工智能系统的设计。其解决方案的关键在于将自由能原理应用于随机动力系统的通用划分,从而形式化吸引子网络的涌现过程。该方法无需显式设定学习和推理规则,而是通过优化预测准确性与模型复杂性,产生高效且符合生物学原理的推断与学习动态,最终实现一种多层级的贝叶斯主动推理过程。
链接: https://arxiv.org/abs/2505.22749
作者: Tamas Spisak,Karl Friston
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 22 pages main text, 5 pages appendix, 6 figures; interactive manuscript available at: this https URL Associated GitHub repository: this https URL
Abstract:Attractor dynamics are a hallmark of many complex systems, including the brain. Understanding how such self-organizing dynamics emerge from first principles is crucial for advancing our understanding of neuronal computations and the design of artificial intelligence systems. Here we formalize how attractor networks emerge from the free energy principle applied to a universal partitioning of random dynamical systems. Our approach obviates the need for explicitly imposed learning and inference rules and identifies emergent, but efficient and biologically plausible inference and learning dynamics for such self-organizing systems. These result in a collective, multi-level Bayesian active inference process. Attractors on the free energy landscape encode prior beliefs; inference integrates sensory data into posterior beliefs; and learning fine-tunes couplings to minimize long-term surprise. Analytically and via simulations, we establish that the proposed networks favor approximately orthogonalized attractor representations, a consequence of simultaneously optimizing predictive accuracy and model complexity. These attractors efficiently span the input subspace, enhancing generalization and the mutual information between hidden causes and observable effects. Furthermore, while random data presentation leads to symmetric and sparse couplings, sequential data fosters asymmetric couplings and non-equilibrium steady-state dynamics, offering a natural extension to conventional Boltzmann Machines. Our findings offer a unifying theory of self-organizing attractor networks, providing novel insights for AI and neuroscience.
zh
[AI-108] PSBench: a large-scale benchmark for estimating the accuracy of protein complex structural models
【速读】:该论文旨在解决蛋白质复合物结构预测模型的准确性估计(EMA)问题,即如何可靠地评估和排序预测模型的质量。其关键解决方案是构建了PSBench基准套件,该套件包含来自CASP15和CASP16的四个大规模、标注详尽的数据集,涵盖广泛的蛋白质序列长度、复合物化学计量比、功能类别及建模难度,并为每个模型提供了全局、局部和界面级别的多种质量评分,从而为EMA方法的开发与评估提供了坚实的基础。
链接: https://arxiv.org/abs/2505.22674
作者: Pawan Neupane,Jian Liu,Jianlin Cheng
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Predicting protein complex structures is essential for protein function analysis, protein design, and drug discovery. While AI methods like AlphaFold can predict accurate structural models for many protein complexes, reliably estimating the quality of these predicted models (estimation of model accuracy, or EMA) for model ranking and selection remains a major challenge. A key barrier to developing effective machine learning-based EMA methods is the lack of large, diverse, and well-annotated datasets for training and evaluation. To address this gap, we introduce PSBench, a benchmark suite comprising four large-scale, labeled datasets generated during the 15th and 16th community-wide Critical Assessment of Protein Structure Prediction (CASP15 and CASP16). PSBench includes over one million structural models covering a wide range of protein sequence lengths, complex stoichiometries, functional classes, and modeling difficulties. Each model is annotated with multiple complementary quality scores at the global, local, and interface levels. PSBench also provides multiple evaluation metrics and baseline EMA methods to facilitate rigorous comparisons. To demonstrate PSBench’s utility, we trained and evaluated GATE, a graph transformer-based EMA method, on the CASP15 data. GATE was blindly tested in CASP16 (2024), where it ranked among the top-performing EMA methods. These results highlight PSBench as a valuable resource for advancing EMA research in protein complex modeling. PSBench is publicly available at: this https URL.
zh
机器学习
[LG-0] Model Immunization from a Condition Number Perspective ICML2025
链接: https://arxiv.org/abs/2505.23760
作者: Amber Yijia Zheng,Cedar Site Bai,Brian Bullins,Raymond A. Yeh
类目: Machine Learning (cs.LG)
*备注: ICML 2025
Abstract:Model immunization aims to pre-train models that are difficult to fine-tune on harmful tasks while retaining their utility on other non-harmful tasks. Though prior work has shown empirical evidence for immunizing text-to-image models, the key understanding of when immunization is possible and a precise definition of an immunized model remain unclear. In this work, we propose a framework, based on the condition number of a Hessian matrix, to analyze model immunization for linear models. Building on this framework, we design an algorithm with regularization terms to control the resulting condition numbers after pre-training. Empirical results on linear models and non-linear deep-nets demonstrate the effectiveness of the proposed algorithm on model immunization. The code is available at this https URL.
[LG-1] Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?
链接: https://arxiv.org/abs/2505.23749
作者: Paul Gölz,Nika Haghtalab,Kunhe Yang
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:
Abstract:After pre-training, large language models are aligned with human preferences based on pairwise comparisons. State-of-the-art alignment methods (such as PPO-based RLHF and DPO) are built on the assumption of aligning with a single preference model, despite being deployed in settings where users have diverse preferences. As a result, it is not even clear that these alignment methods produce models that satisfy users on average – a minimal requirement for pluralistic alignment. Drawing on social choice theory and modeling users’ comparisons through individual Bradley-Terry (BT) models, we introduce an alignment method’s distortion: the worst-case ratio between the optimal achievable average utility, and the average utility of the learned policy. The notion of distortion helps draw sharp distinctions between alignment methods: Nash Learning from Human Feedback achieves the minimax optimal distortion of (\frac12 + o(1)) \cdot \beta (for the BT temperature \beta ), robustly across utility distributions, distributions of comparison pairs, and permissible KL divergences from the reference policy. RLHF and DPO, by contrast, suffer \geq (1 - o(1)) \cdot \beta distortion already without a KL constraint, and e^\Omega(\beta) or even unbounded distortion in the full setting, depending on how comparison pairs are sampled. Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2505.23749 [cs.LG] (or arXiv:2505.23749v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.23749 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-2] EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-Contrast INTERSPEECH2025
链接: https://arxiv.org/abs/2505.23732
作者: Shreeram Suresh Chandra,Lucas Goncalves,Junchen Lu,Carlos Busso,Berrak Sisman
类目: Machine Learning (cs.LG)
*备注: Accepted at Interspeech 2025
Abstract:Current emotion-based contrastive language-audio pretraining (CLAP) methods typically learn by naïvely aligning audio samples with corresponding text prompts. Consequently, this approach fails to capture the ordinal nature of emotions, hindering inter-emotion understanding and often resulting in a wide modality gap between the audio and text embeddings due to insufficient alignment. To handle these drawbacks, we introduce EmotionRankCLAP, a supervised contrastive learning approach that uses dimensional attributes of emotional speech and natural language prompts to jointly capture fine-grained emotion variations and improve cross-modal alignment. Our approach utilizes a Rank-N-Contrast objective to learn ordered relationships by contrasting samples based on their rankings in the valence-arousal space. EmotionRankCLAP outperforms existing emotion-CLAP methods in modeling emotion ordinality across modalities, measured via a cross-modal retrieval task.
[LG-3] MuLoCo: Muon is a practical inner optimizer for DiLoCo
链接: https://arxiv.org/abs/2505.23725
作者: Benjamin Thérien,Xiaolong Huang,Irina Rish,Eugene Belilovsky
类目: Machine Learning (cs.LG)
*备注:
Abstract:DiLoCo is a powerful framework for training large language models (LLMs) under networking constraints with advantages for increasing parallelism and accelerator utilization in data center settings. Despite significantly reducing communication frequency, however, DiLoCo’s communication steps still involve all-reducing a complete copy of the model’s parameters. While existing works have explored ways to reduce communication in DiLoCo, the role of error feedback accumulators and the effect of the inner-optimizer on compressibility remain under-explored. In this work, we investigate the effectiveness of standard compression methods including Top-k sparsification and quantization for reducing the communication overhead of DiLoCo when paired with two local optimizers (AdamW and Muon). Our experiments pre-training decoder-only transformer language models (LMs) reveal that leveraging Muon as the inner optimizer for DiLoCo along with an error-feedback accumulator allows to aggressively compress the communicated delta to 2-bits with next to no performance degradation. Crucially, MuLoCo (Muon inner optimizer DiLoCo) significantly outperforms DiLoCo while communicating 8X less and having identical memory complexity.
[LG-4] DiffER: Categorical Diffusion for Chemical Retrosynthesis
链接: https://arxiv.org/abs/2505.23721
作者: Sean Current,Ziqi Chen,Daniel Adu-Ampratwum,Xia Ning,Srinivasan Parthasarathy
类目: Machine Learning (cs.LG)
*备注: 25 pages, 3 figures, 3 tables
Abstract:Methods for automatic chemical retrosynthesis have found recent success through the application of models traditionally built for natural language processing, primarily through transformer neural networks. These models have demonstrated significant ability to translate between the SMILES encodings of chemical products and reactants, but are constrained as a result of their autoregressive nature. We propose DiffER, an alternative template-free method for retrosynthesis prediction in the form of categorical diffusion, which allows the entire output SMILES sequence to be predicted in unison. We construct an ensemble of diffusion models which achieves state-of-the-art performance for top-1 accuracy and competitive performance for top-3, top-5, and top-10 accuracy among template-free methods. We prove that DiffER is a strong baseline for a new class of template-free model, capable of learning a variety of synthetic techniques used in laboratory settings and outperforming a variety of other template-free methods on top-k accuracy metrics. By constructing an ensemble of categorical diffusion models with a novel length prediction component with variance, our method is able to approximately sample from the posterior distribution of reactants, producing results with strong metrics of confidence and likelihood. Furthermore, our analyses demonstrate that accurate prediction of the SMILES sequence length is key to further boosting the performance of categorical diffusion models.
[LG-5] Rex: Zero-Shot Forecasting Across Long and Short Horizons with Enhanced In-Context Learning
链接: https://arxiv.org/abs/2505.23719
作者: Andreas Auer,Patrick Podest,Daniel Klotz,Sebastian Böck,Günter Klambauer,Sepp Hochreiter
类目: Machine Learning (cs.LG)
*备注:
Abstract:In-context learning, the ability of large language models to perform tasks using only examples provided in the prompt, has recently been adapted for time series forecasting. This paradigm enables zero-shot prediction, where past values serve as context for forecasting future values, making powerful forecasting tools accessible to non-experts and increasing the performance when training data are scarce. Most existing zero-shot forecasting approaches rely on transformer architectures, which, despite their success in language, often fall short of expectations in time series forecasting, where recurrent models like LSTMs frequently have the edge. Conversely, while LSTMs are well-suited for time series modeling due to their state-tracking capabilities, they lack strong in-context learning abilities. We introduce TiRex that closes this gap by leveraging xLSTM, an enhanced LSTM with competitive in-context learning skills. Unlike transformers, state-space models, or parallelizable RNNs such as RWKV, TiRex retains state-tracking, a critical property for long-horizon forecasting. To further facilitate its state-tracking ability, we propose a training-time masking strategy called CPM. TiRex sets a new state of the art in zero-shot time series forecasting on the HuggingFace benchmarks GiftEval and Chronos-ZS, outperforming significantly larger models including TabPFN-TS (Prior Labs), Chronos Bolt (Amazon), TimesFM (Google), and Moirai (Salesforce) across both short- and long-term forecasts.
[LG-6] Knowledge Insulating Vision-Language-Action Models: Train Fast Run Fast Generalize Better
链接: https://arxiv.org/abs/2505.23705
作者: Danny Driess,Jost Tobias Springenberg,Brian Ichter,Lili Yu,Adrian Li-Bell,Karl Pertsch,Allen Z. Ren,Homer Walke,Quan Vuong,Lucy Xiaoyang Shi,Sergey Levine
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:Vision-language-action (VLA) models provide a powerful approach to training control policies for physical systems, such as robots, by combining end-to-end learning with transfer of semantic knowledge from web-scale vision-language model (VLM) training. However, the constraints of real-time control are often at odds with the design of VLMs: the most powerful VLMs have tens or hundreds of billions of parameters, presenting an obstacle to real-time inference, and operate on discrete tokens rather than the continuous-valued outputs that are required for controlling robots. To address this challenge, recent VLA models have used specialized modules for efficient continuous control, such as action experts or continuous output heads, which typically require adding new untrained parameters to the pretrained VLM backbone. While these modules improve real-time and control capabilities, it remains an open question whether they preserve or degrade the semantic knowledge contained in the pretrained VLM, and what effect they have on the VLA training dynamics. In this paper, we study this question in the context of VLAs that include a continuous diffusion or flow matching action expert, showing that naively including such experts significantly harms both training speed and knowledge transfer. We provide an extensive analysis of various design choices, their impact on performance and knowledge transfer, and propose a technique for insulating the VLM backbone during VLA training that mitigates this issue. Videos are available at this https URL.
[LG-7] (U)NFV: Supervised and Unsupervised Neural Finite Volume Methods for Solving Hyperbolic PDEs
链接: https://arxiv.org/abs/2505.23702
作者: Nathan Lichtlé,Alexi Canesse,Zhe Fu,Hossein Nick Zinat Matin,Maria Laura Delle Monache,Alexandre M. Bayen
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:We introduce (U)NFV, a modular neural network architecture that generalizes classical finite volume (FV) methods for solving hyperbolic conservation laws. Hyperbolic partial differential equations (PDEs) are challenging to solve, particularly conservation laws whose physically relevant solutions contain shocks and discontinuities. FV methods are widely used for their mathematical properties: convergence to entropy solutions, flow conservation, or total variation diminishing, but often lack accuracy and flexibility in complex settings. Neural Finite Volume addresses these limitations by learning update rules over extended spatial and temporal stencils while preserving conservation structure. It supports both supervised training on solution data (NFV) and unsupervised training via weak-form residual loss (UNFV). Applied to first-order conservation laws, (U)NFV achieves up to 10x lower error than Godunov’s method, outperforms ENO/WENO, and rivals discontinuous Galerkin solvers with far less complexity. On traffic modeling problems, both from PDEs and from experimental highway data, (U)NFV captures nonlinear wave dynamics with significantly higher fidelity and scalability than traditional FV approaches.
[LG-8] DiCoFlex: Model-agnostic diverse counterfactuals with flexible control
链接: https://arxiv.org/abs/2505.23700
作者: Oleksii Furman,Ulvi Movsum-zada,Patryk Marszalek,Maciej Zięba,Marek Śmieja
类目: Machine Learning (cs.LG)
*备注:
Abstract:Counterfactual explanations play a pivotal role in explainable artificial intelligence (XAI) by offering intuitive, human-understandable alternatives that elucidate machine learning model decisions. Despite their significance, existing methods for generating counterfactuals often require constant access to the predictive model, involve computationally intensive optimization for each instance and lack the flexibility to adapt to new user-defined constraints without retraining. In this paper, we propose DiCoFlex, a novel model-agnostic, conditional generative framework that produces multiple diverse counterfactuals in a single forward pass. Leveraging conditional normalizing flows trained solely on labeled data, DiCoFlex addresses key limitations by enabling real-time user-driven customization of constraints such as sparsity and actionability at inference time. Extensive experiments on standard benchmark datasets show that DiCoFlex outperforms existing methods in terms of validity, diversity, proximity, and constraint adherence, making it a practical and scalable solution for counterfactual generation in sensitive decision-making domains.
[LG-9] Computational Algebra with Attention: Transformer Oracles for Border Basis Algorithms
链接: https://arxiv.org/abs/2505.23696
作者: Hiroshi Kera,Nico Pelleriti,Yuki Ishihara,Max Zimmer,Sebastian Pokutta
类目: Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注: 13+19 pages (3+9 figures, 2+7 tables)
Abstract:Solving systems of polynomial equations, particularly those with finitely many solutions, is a crucial challenge across many scientific fields. Traditional methods like Gröbner and Border bases are fundamental but suffer from high computational costs, which have motivated recent Deep Learning approaches to improve efficiency, albeit at the expense of output correctness. In this work, we introduce the Oracle Border Basis Algorithm, the first Deep Learning approach that accelerates Border basis computation while maintaining output guarantees. To this end, we design and train a Transformer-based oracle that identifies and eliminates computationally expensive reduction steps, which we find to dominate the algorithm’s runtime. By selectively invoking this oracle during critical phases of computation, we achieve substantial speedup factors of up to 3.5x compared to the base algorithm, without compromising the correctness of results. To generate the training data, we develop a sampling method and provide the first sampling theorem for border bases. We construct a tokenization and embedding scheme tailored to monomial-centered algebraic computations, resulting in a compact and expressive input representation, which reduces the number of tokens to encode an n -variate polynomial by a factor of O(n) . Our learning approach is data efficient, stable, and a practical enhancement to traditional computer algebra algorithms and symbolic computation.
[LG-10] Learning Compositional Functions with Transformers from Easy-to-Hard Data COLT2025
链接: https://arxiv.org/abs/2505.23683
作者: Zixuan Wang,Eshaan Nichani,Alberto Bietti,Alex Damian,Daniel Hsu,Jason D. Lee,Denny Wu
类目: Machine Learning (cs.LG)
*备注: COLT 2025
Abstract:Transformer-based language models have demonstrated impressive capabilities across a range of complex reasoning tasks. Prior theoretical work exploring the expressive power of transformers has shown that they can efficiently perform multi-step reasoning tasks involving parallelizable computations. However, the learnability of such constructions, particularly the conditions on the data distribution that enable efficient learning via gradient-based optimization, remains an open question. Towards answering this question, in this work we study the learnability of the k -fold composition task, which requires computing an interleaved composition of k input permutations and k hidden permutations, and can be expressed by a transformer with O(\log k) layers. On the negative front, we prove a Statistical Query (SQ) lower bound showing that any SQ learner that makes only polynomially-many queries to an SQ oracle for the k -fold composition task distribution must have sample size exponential in k , thus establishing a statistical-computational gap. On the other hand, we show that this function class can be efficiently learned, with runtime and sample complexity polynomial in k , by gradient descent on an O(\log k) -depth transformer via two different curriculum learning strategies: one in which data consists of k’ -fold composition functions with k’ \le k presented in increasing difficulty, and another in which all such data is presented simultaneously. Our work sheds light on the necessity and sufficiency of having both easy and hard examples in the data distribution for transformers to learn complex compositional tasks.
[LG-11] Understanding Mode Connectivity via Parameter Space Symmetry ICML2025
链接: https://arxiv.org/abs/2505.23681
作者: Bo Zhao,Nima Dehmamy,Robin Walters,Rose Yu
类目: Machine Learning (cs.LG)
*备注: 20 pages, 4 figures, ICML 2025
Abstract:Neural network minima are often connected by curves along which train and test loss remain nearly constant, a phenomenon known as mode connectivity. While this property has enabled applications such as model merging and fine-tuning, its theoretical explanation remains unclear. We propose a new approach to exploring the connectedness of minima using parameter space symmetry. By linking the topology of symmetry groups to that of the minima, we derive the number of connected components of the minima of linear networks and show that skip connections reduce this number. We then examine when mode connectivity and linear mode connectivity hold or fail, using parameter symmetries which account for a significant part of the minimum. Finally, we provide explicit expressions for connecting curves in the minima induced by symmetry. Using the curvature of these curves, we derive conditions under which linear mode connectivity approximately holds. Our findings highlight the role of continuous symmetries in understanding the neural network loss landscape.
[LG-12] Bayesian Optimization from Human Feedback: Near-Optimal Regret Bounds
链接: https://arxiv.org/abs/2505.23673
作者: Aya Kayal,Sattar Vakili,Laura Toni,Da-shan Shiu,Alberto Bernacchia
类目: Machine Learning (cs.LG)
*备注:
Abstract:Bayesian optimization (BO) with preference-based feedback has recently garnered significant attention due to its emerging applications. We refer to this problem as Bayesian Optimization from Human Feedback (BOHF), which differs from conventional BO by learning the best actions from a reduced feedback model, where only the preference between two actions is revealed to the learner at each time step. The objective is to identify the best action using a limited number of preference queries, typically obtained through costly human feedback. Existing work, which adopts the Bradley-Terry-Luce (BTL) feedback model, provides regret bounds for the performance of several algorithms. In this work, within the same framework we develop tighter performance guarantees. Specifically, we derive regret bounds of \tilde\mathcalO(\sqrt\Gamma(T)T) , where \Gamma(T) represents the maximum information gain \unicodex2014 a kernel-specific complexity term \unicodex2014 and T is the number of queries. Our results significantly improve upon existing bounds. Notably, for common kernels, we show that the order-optimal sample complexities of conventional BO \unicodex2014 achieved with richer feedback models \unicodex2014 are recovered. In other words, the same number of preferential samples as scalar-valued samples is sufficient to find a nearly optimal solution.
[LG-13] AMBER: Adaptive Mesh Generation by Iterative Mesh Resolution Prediction
链接: https://arxiv.org/abs/2505.23663
作者: Niklas Freymuth,Tobias Würth,Nicolas Schreiber,Balazs Gyenes,Andreas Boltres,Johannes Mitsch,Aleksandar Taranovic,Tai Hoang,Philipp Dahlinger,Philipp Becker,Luise Kärger,Gerhard Neumann
类目: Machine Learning (cs.LG); Computational Geometry (cs.CG)
*备注:
Abstract:The cost and accuracy of simulating complex physical systems using the Finite Element Method (FEM) scales with the resolution of the underlying mesh. Adaptive meshes improve computational efficiency by refining resolution in critical regions, but typically require task-specific heuristics or cumbersome manual design by a human expert. We propose Adaptive Meshing By Expert Reconstruction (AMBER), a supervised learning approach to mesh adaptation. Starting from a coarse mesh, AMBER iteratively predicts the sizing field, i.e., a function mapping from the geometry to the local element size of the target mesh, and uses this prediction to produce a new intermediate mesh using an out-of-the-box mesh generator. This process is enabled through a hierarchical graph neural network, and relies on data augmentation by automatically projecting expert labels onto AMBER-generated data during training. We evaluate AMBER on 2D and 3D datasets, including classical physics problems, mechanical components, and real-world industrial designs with human expert meshes. AMBER generalizes to unseen geometries and consistently outperforms multiple recent baselines, including ones using Graph and Convolutional Neural Networks, and Reinforcement Learning-based approaches.
[LG-14] Bayesian Perspective on Memorization and Reconstruction
链接: https://arxiv.org/abs/2505.23658
作者: Haim Kaplan,Yishay Mansour,Kobbi Nissim,Uri Stemmer
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:We introduce a new Bayesian perspective on the concept of data reconstruction, and leverage this viewpoint to propose a new security definition that, in certain settings, provably prevents reconstruction attacks. We use our paradigm to shed new light on one of the most notorious attacks in the privacy and memorization literature - fingerprinting code attacks (FPC). We argue that these attacks are really a form of membership inference attacks, rather than reconstruction attacks. Furthermore, we show that if the goal is solely to prevent reconstruction (but not membership inference), then in some cases the impossibility results derived from FPC no longer apply.
[LG-15] How does Transformer Learn Implicit Reasoning ?
链接: https://arxiv.org/abs/2505.23653
作者: Jiaran Ye,Zijun Yao,Zhidian Huang,Liangming Pan,Jinxin Liu,Yushi Bai,Amy Xin,Liu Weichuan,Xiaoyin Che,Lei Hou,Juanzi Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent work suggests that large language models (LLMs) can perform multi-hop reasoning implicitly – producing correct answers without explicitly verbalizing intermediate steps – but the underlying mechanisms remain poorly understood. In this paper, we study how such implicit reasoning emerges by training transformers from scratch in a controlled symbolic environment. Our analysis reveals a three-stage developmental trajectory: early memorization, followed by in-distribution generalization, and eventually cross-distribution generalization. We find that training with atomic triples is not necessary but accelerates learning, and that second-hop generalization relies on query-level exposure to specific compositional structures. To interpret these behaviors, we introduce two diagnostic tools: cross-query semantic patching, which identifies semantically reusable intermediate representations, and a cosine-based representational lens, which reveals that successful reasoning correlates with the cosine-base clustering in hidden space. This clustering phenomenon in turn provides a coherent explanation for the behavioral dynamics observed across training, linking representational structure to reasoning capability. These findings provide new insights into the interpretability of implicit multi-hop reasoning in LLMs, helping to clarify how complex reasoning processes unfold internally and offering pathways to enhance the transparency of such models.
[LG-16] Optimization-Free Diffusion Model – A Perturbation Theory Approach
链接: https://arxiv.org/abs/2505.23652
作者: Yuehaw Khoo,Mathias Oster,Yifan Peng
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 36 pages, 6 figures
Abstract:Diffusion models have emerged as a powerful framework in generative modeling, typically relying on optimizing neural networks to estimate the score function via forward SDE simulations. In this work, we propose an alternative method that is both optimization-free and forward SDE-free. By expanding the score function in a sparse set of eigenbasis of the backward Kolmogorov operator associated with the diffusion process, we reformulate score estimation as the solution to a linear system, avoiding iterative optimization and time-dependent sample generation. We analyze the approximation error using perturbation theory and demonstrate the effectiveness of our method on high-dimensional Boltzmann distributions and real-world datasets.
[LG-17] Continuous Chain of Thought Enables Parallel Exploration and Reasoning
链接: https://arxiv.org/abs/2505.23648
作者: Halil Alperen Gozeten,M. Emrullah Ildiz,Xuechen Zhang,Hrayr Harutyunyan,Ankit Singh Rawat,Samet Oymak
类目: Machine Learning (cs.LG)
*备注:
Abstract:Current language models generate chain-of-thought traces by autoregressively sampling tokens from a finite vocabulary. While this discrete sampling has achieved remarkable success, conducting chain-of-thought with continuously-valued tokens (CoT2) offers a richer and more expressive alternative. Our work examines the benefits of CoT2 through logical reasoning tasks that inherently require search capabilities and provide optimization and exploration methods for CoT2. Theoretically, we show that CoT2 allows the model to track multiple traces in parallel and quantify its benefits for inference efficiency. Notably, one layer transformer equipped with CoT2 can provably solve the combinatorial “subset sum problem” given sufficient embedding dimension. These insights lead to a novel and effective supervision strategy where we match the softmax outputs to the empirical token distributions of a set of target traces. Complementing this, we introduce sampling strategies that unlock policy optimization and self-improvement for CoT2. Our first strategy samples and composes K discrete tokens at each decoding step to control the level of parallelism, and reduces to standard CoT when K=1 . Our second strategy relies on continuous exploration over the probability simplex. Experiments confirm that policy optimization with CoT2 indeed improves the performance of the model beyond its initial discrete or continuous supervision.
[LG-18] Global optimization of graph acquisition functions for neural architecture search
链接: https://arxiv.org/abs/2505.23640
作者: Yilin Xie,Shiqiang Zhang,Jixiang Qing,Ruth Misener,Calvin Tsay
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 19 pages, 6 figures, 3 tables
Abstract:Graph Bayesian optimization (BO) has shown potential as a powerful and data-efficient tool for neural architecture search (NAS). Most existing graph BO works focus on developing graph surrogates models, i.e., metrics of networks and/or different kernels to quantify the similarity between networks. However, the acquisition optimization, as a discrete optimization task over graph structures, is not well studied due to the complexity of formulating the graph search space and acquisition functions. This paper presents explicit optimization formulations for graph input space including properties such as reachability and shortest paths, which are used later to formulate graph kernels and the acquisition function. We theoretically prove that the proposed encoding is an equivalent representation of the graph space and provide restrictions for the NAS domain with either node or edge labels. Numerical results over several NAS benchmarks show that our method efficiently finds the optimal architecture for most cases, highlighting its efficacy.
[LG-19] MCP Safety Training: Learning to Refuse Falsely Benign MCP Exploits using Improved Preference Alignment
链接: https://arxiv.org/abs/2505.23634
作者: John Halloran
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 27 pages, 19 figures, 4 tables
Abstract:The model context protocol (MCP) has been widely adapted as an open standard enabling the seamless integration of generative AI agents. However, recent work has shown the MCP is susceptible to retrieval-based “falsely benign” attacks (FBAs), allowing malicious system access and credential theft, but requiring that users download compromised files directly to their systems. Herein, we show that the threat model of MCP-based attacks is significantly broader than previously thought, i.e., attackers need only post malicious content online to deceive MCP agents into carrying out their attacks on unsuspecting victims’ systems. To improve alignment guardrails against such attacks, we introduce a new MCP dataset of FBAs and (truly) benign samples to explore the effectiveness of direct preference optimization (DPO) for the refusal training of large language models (LLMs). While DPO improves model guardrails against such attacks, we show that the efficacy of refusal learning varies drastically depending on the model’s original post-training alignment scheme–e.g., GRPO-based LLMs learn to refuse extremely poorly. Thus, to further improve FBA refusals, we introduce Retrieval Augmented Generation for Preference alignment (RAG-Pref), a novel preference alignment strategy based on RAG. We show that RAG-Pref significantly improves the ability of LLMs to refuse FBAs, particularly when combined with DPO alignment, thus drastically improving guardrails against MCP-based attacks. Comments: 27 pages, 19 figures, 4 tables Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2505.23634 [cs.LG] (or arXiv:2505.23634v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.23634 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-20] Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection INTERSPEECH2025
链接: https://arxiv.org/abs/2505.23627
作者: Griffin Dietz Smith,Dianna Yee,Jennifer King Chen,Leah Findlater
类目: Machine Learning (cs.LG)
*备注: Interspeech 2025
Abstract:Identifying mistakes (i.e., miscues) made while reading aloud is commonly approached post-hoc by comparing automatic speech recognition (ASR) transcriptions to the target reading text. However, post-hoc methods perform poorly when ASR inaccurately transcribes verbatim speech. To improve on current methods for reading error annotation, we propose a novel end-to-end architecture that incorporates the target reading text via prompting and is trained for both improved verbatim transcription and direct miscue detection. Our contributions include: first, demonstrating that incorporating reading text through prompting benefits verbatim transcription performance over fine-tuning, and second, showing that it is feasible to augment speech recognition tasks for end-to-end miscue detection. We conducted two case studies – children’s read-aloud and adult atypical speech – and found that our proposed strategies improve verbatim transcription and miscue detection compared to current state-of-the-art.
[LG-21] Few-Shot Speech Deepfake Detection Adaptation with Gaussian Processes
链接: https://arxiv.org/abs/2505.23619
作者: Neta Glazer,David Chernin,Idan Achituve,Sharon Gannot,Ethan Fetaya
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:
Abstract:Recent advancements in Text-to-Speech (TTS) models, particularly in voice cloning, have intensified the demand for adaptable and efficient deepfake detection methods. As TTS systems continue to evolve, detection models must be able to efficiently adapt to previously unseen generation models with minimal data. This paper introduces ADD-GP, a few-shot adaptive framework based on a Gaussian Process (GP) classifier for Audio Deepfake Detection (ADD). We show how the combination of a powerful deep embedding model with the Gaussian processes flexibility can achieve strong performance and adaptability. Additionally, we show this approach can also be used for personalized detection, with greater robustness to new TTS models and one-shot adaptability. To support our evaluation, a benchmark dataset is constructed for this task using new state-of-the-art voice cloning models.
[LG-22] Learning Interpretable Differentiable Logic Networks for Tabular Regression
链接: https://arxiv.org/abs/2505.23615
作者: Chang Yue,Niraj K. Jha
类目: Machine Learning (cs.LG)
*备注:
Abstract:Neural networks (NNs) achieve outstanding performance in many domains; however, their decision processes are often opaque and their inference can be computationally expensive in resource-constrained environments. We recently proposed Differentiable Logic Networks (DLNs) to address these issues for tabular classification based on relaxing discrete logic into a differentiable form, thereby enabling gradient-based learning of networks built from binary logic operations. DLNs offer interpretable reasoning and substantially lower inference cost. We extend the DLN framework to supervised tabular regression. Specifically, we redesign the final output layer to support continuous targets and unify the original two-phase training procedure into a single differentiable stage. We evaluate the resulting model on 15 public regression benchmarks, comparing it with modern neural networks and classical regression baselines. Regression DLNs match or exceed baseline accuracy while preserving interpretability and fast inference. Our results show that DLNs are a viable, cost-effective alternative for regression tasks, especially where model transparency and computational efficiency are important. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.23615 [cs.LG] (or arXiv:2505.23615v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.23615 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-23] Inference-time Scaling of Diffusion Models through Classical Search
链接: https://arxiv.org/abs/2505.23614
作者: Xiangcheng Zhang,Haowei Lin,Haotian Ye,James Zou,Jianzhu Ma,Yitao Liang,Yilun Du
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Website at this https URL
Abstract:Classical search algorithms have long underpinned modern artificial intelligence. In this work, we tackle the challenge of inference-time control in diffusion models – adapting generated outputs to meet diverse test-time objectives – using principles from classical search. We propose a general framework that orchestrates local and global search to efficiently navigate the generative space. It employs a theoretically grounded local search via annealed Langevin MCMC and performs compute-efficient global exploration using breadth-first and depth-first tree search. We evaluate our approach on a range of challenging domains, including planning, offline reinforcement learning, and image generation. Across all tasks, we observe significant gains in both performance and efficiency. These results show that classical search provides a principled and practical foundation for inference-time scaling in diffusion models. Project page at this http URL.
[LG-24] he Generalized Skew Spectrum of Graphs
链接: https://arxiv.org/abs/2505.23609
作者: Armando Bellante,Martin Plávala,Alessandro Luongo
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Group Theory (math.GR); Representation Theory (math.RT)
*备注:
Abstract:This paper proposes a family of permutation-invariant graph embeddings, generalizing the Skew Spectrum of graphs of Kondor Borgwardt (2008). Grounded in group theory and harmonic analysis, our method introduces a new class of graph invariants that are isomorphism-invariant and capable of embedding richer graph structures - including attributed graphs, multilayer graphs, and hypergraphs - which the Skew Spectrum could not handle. Our generalization further defines a family of functions that enables a trade-off between computational complexity and expressivity. By applying generalization-preserving heuristics to this family, we improve the Skew Spectrum’s expressivity at the same computational cost. We formally prove the invariance of our generalization, demonstrate its improved expressiveness through experiments, and discuss its efficient computation.
[LG-25] Data Model Design for Explainable Machine Learning-based Electricity Applications
链接: https://arxiv.org/abs/2505.23607
作者: Carolina Fortuna,Gregor Cerar,Blaz Bertalanic,Andrej Campa,Mihael Mohorcic
类目: Machine Learning (cs.LG)
*备注:
Abstract:The transition from traditional power grids to smart grids, significant increase in the use of renewable energy sources, and soaring electricity prices has triggered a digital transformation of the energy infrastructure that enables new, data driven, applications often supported by machine learning models. However, the majority of the developed machine learning models rely on univariate data. To date, a structured study considering the role meta-data and additional measurements resulting in multivariate data is missing. In this paper we propose a taxonomy that identifies and structures various types of data related to energy applications. The taxonomy can be used to guide application specific data model development for training machine learning models. Focusing on a household electricity forecasting application, we validate the effectiveness of the proposed taxonomy in guiding the selection of the features for various types of models. As such, we study of the effect of domain, contextual and behavioral features on the forecasting accuracy of four interpretable machine learning techniques and three openly available datasets. Finally, using a feature importance techniques, we explain individual feature contributions to the forecasting accuracy.
[LG-26] On Transferring Transferability: Towards a Theory for Size Generalization
链接: https://arxiv.org/abs/2505.23599
作者: Eitan Levin,Yuxin Ma,Mateo Díaz,Soledad Villar
类目: Machine Learning (cs.LG); Representation Theory (math.RT); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 69 pages, 8 figures
Abstract:Many modern learning tasks require models that can take inputs of varying sizes. Consequently, dimension-independent architectures have been proposed for domains where the inputs are graphs, sets, and point clouds. Recent work on graph neural networks has explored whether a model trained on low-dimensional data can transfer its performance to higher-dimensional inputs. We extend this body of work by introducing a general framework for transferability across dimensions. We show that transferability corresponds precisely to continuity in a limit space formed by identifying small problem instances with equivalent large ones. This identification is driven by the data and the learning task. We instantiate our framework on existing architectures, and implement the necessary changes to ensure their transferability. Finally, we provide design principles for designing new transferable models. Numerical experiments support our findings.
[LG-27] LLM Performance for Code Generation on Noisy Tasks
链接: https://arxiv.org/abs/2505.23598
作者: Radzim Sendyka,Christian Cabrera,Andrei Paleyes,Diana Robinson,Neil Lawrence
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:
Abstract:This paper investigates the ability of large language models (LLMs) to recognise and solve tasks which have been obfuscated beyond recognition. Focusing on competitive programming and benchmark tasks (LeetCode and MATH), we compare performance across multiple models and obfuscation methods, such as noise and redaction. We demonstrate that all evaluated LLMs can solve tasks obfuscated to a level where the text would be unintelligible to human readers, and does not contain key pieces of instruction or context. We introduce the concept of eager pattern matching to describe this behaviour, which is not observed in tasks published after the models’ knowledge cutoff date, indicating strong memorisation or overfitting to training data, rather than legitimate reasoning about the presented problem. We report empirical evidence of distinct performance decay patterns between contaminated and unseen datasets. We discuss the implications for benchmarking and evaluations of model behaviour, arguing for caution when designing experiments using standard datasets. We also propose measuring the decay of performance under obfuscation as a possible strategy for detecting dataset contamination and highlighting potential safety risks and interpretability issues for automated software systems.
[LG-28] Position: Federated Foundation Language Model Post-Training Should Focus on Open-Source Models
链接: https://arxiv.org/abs/2505.23593
作者: Nikita Agrawal,Simon Mertel,Ruben Mayer
类目: Machine Learning (cs.LG)
*备注:
Abstract:Post-training of foundation language models has emerged as a promising research domain in federated learning (FL) with the goal to enable privacy-preserving model improvements and adaptations to user’s downstream tasks. Recent advances in this area adopt centralized post-training approaches that build upon black-box foundation language models where there is no access to model weights and architecture details. Although the use of black-box models has been successful in centralized post-training, their blind replication in FL raises several concerns. Our position is that using black-box models in FL contradicts the core principles of federation such as data privacy and autonomy. In this position paper, we critically analyze the usage of black-box models in federated post-training, and provide a detailed account of various aspects of openness and their implications for FL.
[LG-29] Accelerated Training of Federated Learning via Second-Order Methods
链接: https://arxiv.org/abs/2505.23588
作者: Mrinmay Sen,Sidhant R Nair,C Krishna Mohan
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 17 pages, 1 figure, 4 tables, submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI)
Abstract:This paper explores second-order optimization methods in Federated Learning (FL), addressing the critical challenges of slow convergence and the excessive communication rounds required to achieve optimal performance from the global model. While existing surveys in FL primarily focus on challenges related to statistical and device label heterogeneity, as well as privacy and security concerns in first-order FL methods, less attention has been given to the issue of slow model training. This slow training often leads to the need for excessive communication rounds or increased communication costs, particularly when data across clients are highly heterogeneous. In this paper, we examine various FL methods that leverage second-order optimization to accelerate the training process. We provide a comprehensive categorization of state-of-the-art second-order FL methods and compare their performance based on convergence speed, computational cost, memory usage, transmission overhead, and generalization of the global model. Our findings show the potential of incorporating Hessian curvature through second-order optimization into FL and highlight key challenges, such as the efficient utilization of Hessian and its inverse in FL. This work lays the groundwork for future research aimed at developing scalable and efficient federated optimization methods for improving the training of the global model in FL.
[LG-30] Improving Time Series Forecasting via Instance-aware Post-hoc Revision
链接: https://arxiv.org/abs/2505.23583
作者: Zhiding Liu,Mingyue Cheng,Guanhao Zhao,Jiqian Yang,Qi Liu,Enhong Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series forecasting plays a vital role in various real-world applications and has attracted significant attention in recent decades. While recent methods have achieved remarkable accuracy by incorporating advanced inductive biases and training strategies, we observe that instance-level variations remain a significant challenge. These variations–stemming from distribution shifts, missing data, and long-tail patterns–often lead to suboptimal forecasts for specific instances, even when overall performance appears strong. To address this issue, we propose a model-agnostic framework, PIR, designed to enhance forecasting performance through Post-forecasting Identification and Revision. Specifically, PIR first identifies biased forecasting instances by estimating their accuracy. Based on this, the framework revises the forecasts using contextual information, including covariates and historical time series, from both local and global perspectives in a post-processing fashion. Extensive experiments on real-world datasets with mainstream forecasting models demonstrate that PIR effectively mitigates instance-level errors and significantly improves forecasting reliability.
[LG-31] BioReason : Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model
链接: https://arxiv.org/abs/2505.23579
作者: Adibvafa Fallahpour,Andrew Magnuson,Purav Gupta,Shihao Ma,Jack Naimer,Arnav Shah,Haonan Duan,Omar Ibrahim,Hani Goodarzi,Chris J. Maddison,Bo Wang
类目: Machine Learning (cs.LG)
*备注: 16 pages, 3 figures, 2 tables
Abstract:Unlocking deep, interpretable biological reasoning from complex genomic data is a major AI challenge hindering scientific discovery. Current DNA foundation models, despite strong sequence representation, struggle with multi-step reasoning and lack inherent transparent, biologically intuitive explanations. We introduce BioReason, a pioneering architecture that, for the first time, deeply integrates a DNA foundation model with a Large Language Model (LLM). This novel connection enables the LLM to directly process and reason with genomic information as a fundamental input, fostering a new form of multimodal biological understanding. BioReason’s sophisticated multi-step reasoning is developed through supervised fine-tuning and targeted reinforcement learning, guiding the system to generate logical, biologically coherent deductions. On biological reasoning benchmarks including KEGG-based disease pathway prediction - where accuracy improves from 88% to 97% - and variant effect prediction, BioReason demonstrates an average 15% performance gain over strong single-modality baselines. BioReason reasons over unseen biological entities and articulates decision-making through interpretable, step-by-step biological traces, offering a transformative approach for AI in biology that enables deeper mechanistic insights and accelerates testable hypothesis generation from genomic data. Data, code, and checkpoints are publicly available at this https URL
[LG-32] Maximum Likelihood Learning of Latent Dynamics Without Reconstruction
链接: https://arxiv.org/abs/2505.23569
作者: Samo Hromadka,Kai Biegun,Lior Fox,James Heald,Maneesh Sahani
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce a novel unsupervised learning method for time series data with latent dynamical structure: the recognition-parametrized Gaussian state space model (RP-GSSM). The RP-GSSM is a probabilistic model that learns Markovian Gaussian latents explaining statistical dependence between observations at different time steps, combining the intuition of contrastive methods with the flexible tools of probabilistic generative models. Unlike contrastive approaches, the RP-GSSM is a valid probabilistic model learned via maximum likelihood. Unlike generative approaches, the RP-GSSM has no need for an explicit network mapping from latents to observations, allowing it to focus model capacity on inference of latents. The model is both tractable and expressive: it admits exact inference thanks to its jointly Gaussian latent prior, while maintaining expressivity with an arbitrarily nonlinear neural network link between observations and latents. These qualities allow the RP-GSSM to learn task-relevant latents without ad-hoc regularization, auxiliary losses, or optimizer scheduling. We show how this approach outperforms alternatives on problems that include learning nonlinear stochastic dynamics from video, with or without background distractors. Our results position the RP-GSSM as a useful foundation model for a variety of downstream applications.
[LG-33] DRO: A Python Library for Distributionally Robust Optimization in Machine Learning
链接: https://arxiv.org/abs/2505.23565
作者: Jiashuo Liu,Tianyu Wang,Henry Lam,Hongseok Namkoong,Jose Blanchet
类目: Machine Learning (cs.LG); Mathematical Software (cs.MS); Numerical Analysis (math.NA)
*备注:
Abstract:We introduce dro, an open-source Python library for distributionally robust optimization (DRO) for regression and classification problems. The library implements 14 DRO formulations and 9 backbone models, enabling 79 distinct DRO methods. Furthermore, dro is compatible with both scikit-learn and PyTorch. Through vectorization and optimization approximation techniques, dro reduces runtime by 10x to over 1000x compared to baseline implementations on large-scale datasets. Comprehensive documentation is available at this https URL.
[LG-34] Adaptive Federated LoRA in Heterogeneous Wireless Networks with Independent Sampling
链接: https://arxiv.org/abs/2505.23555
作者: Yanzhao Hou,Jiaxiang Geng,Boyu Li,Xiaofeng Tao,Juncheng Wang,Xiaodong Xu,Bing Luo
类目: Machine Learning (cs.LG)
*备注: 13 pages, Submitted to IEEE Journal on Selected Areas in Communications (JSAC)
Abstract:Federated LoRA has emerged as a promising technique for efficiently fine-tuning large language models (LLMs) on distributed devices by reducing the number of trainable parameters. However, existing approaches often inadequately overlook the theoretical and practical implications of system and data heterogeneity, thereby failing to optimize the overall training efficiency, particularly in terms of wall-clock time. In this paper, we propose an adaptive federated LoRA strategy with independent client sampling to minimize the convergence wall-clock time of federated fine-tuning under both computation and communication heterogeneity. We first derive a new convergence bound for federated LoRA with arbitrary and independent client sampling, notably without requiring the stringent bounded gradient assumption. Then, we introduce an adaptive bandwidth allocation scheme that accounts for heterogeneous client resources and system bandwidth constraints. Based on the derived theory, we formulate and solve a non-convex optimization problem to jointly determine the LoRA sketching ratios and sampling probabilities, aiming to minimize wall-clock convergence time. An efficient and low-complexity algorithm is developed to approximate the solution. Finally, extensive experiments demonstrate that our approach significantly reduces wall-clock training time compared to state-of-the-art methods across various models and datasets.
[LG-35] Comparing the Moore-Penrose Pseudoinverse and Gradient Descent for Solving Linear Regression Problems: A Performance Analysis
链接: https://arxiv.org/abs/2505.23552
作者: Alex Adams
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper investigates the comparative performance of two fundamental approaches to solving linear regression problems: the closed-form Moore-Penrose pseudoinverse and the iterative gradient descent method. Linear regression is a cornerstone of predictive modeling, and the choice of solver can significantly impact efficiency and accuracy. I review and discuss the theoretical underpinnings of both methods, analyze their computational complexity, and evaluate their empirical behavior on synthetic datasets with controlled characteristics, as well as on established real-world datasets. My results delineate the conditions under which each method excels in terms of computational time, numerical stability, and predictive accuracy. This work aims to provide practical guidance for researchers and practitioners in machine learning when selecting between direct, exact solutions and iterative, approximate solutions for linear regression tasks.
[LG-36] Comparative assessment of fairness definitions and bias mitigation strategies in machine learning-based diagnosis of Alzheimers disease from MR images
链接: https://arxiv.org/abs/2505.23528
作者: Maria Eleftheria Vlontzou,Maria Athanasiou,Christos Davatzikos,Konstantina S. Nikita
类目: Machine Learning (cs.LG)
*备注: © 2025 IEEE Paper accepted at IEEE Engineering in Medicine and Biology Society Conference, 2025
Abstract:The present study performs a comprehensive fairness analysis of machine learning (ML) models for the diagnosis of Mild Cognitive Impairment (MCI) and Alzheimer’s disease (AD) from MRI-derived neuroimaging features. Biases associated with age, race, and gender in a multi-cohort dataset, as well as the influence of proxy features encoding these sensitive attributes, are investigated. The reliability of various fairness definitions and metrics in the identification of such biases is also assessed. Based on the most appropriate fairness measures, a comparative analysis of widely used pre-processing, in-processing, and post-processing bias mitigation strategies is performed. Moreover, a novel composite measure is introduced to quantify the trade-off between fairness and performance by considering the F1-score and the equalized odds ratio, making it appropriate for medical diagnostic applications. The obtained results reveal the existence of biases related to age and race, while no significant gender bias is observed. The deployed mitigation strategies yield varying improvements in terms of fairness across the different sensitive attributes and studied subproblems. For race and gender, Reject Option Classification improves equalized odds by 46% and 57%, respectively, and achieves harmonic mean scores of 0.75 and 0.80 in the MCI versus AD subproblem, whereas for age, in the same subproblem, adversarial debiasing yields the highest equalized odds improvement of 40% with a harmonic mean score of 0.69. Insights are provided into how variations in AD neuropathology and risk factors, associated with demographic characteristics, influence model fairness.
[LG-37] Normalizing Flows are Capable Models for RL
链接: https://arxiv.org/abs/2505.23527
作者: Raj Ghugare,Benjamin Eysenbach
类目: Machine Learning (cs.LG)
*备注: Project page with code - this https URL
Abstract:Modern reinforcement learning (RL) algorithms have found success by using powerful probabilistic models, such as transformers, energy-based models, and diffusion/flow-based models. To this end, RL researchers often choose to pay the price of accommodating these models into their algorithms – diffusion models are expressive, but are computationally intensive due to their reliance on solving differential equations, while autoregressive transformer models are scalable but typically require learning discrete representations. Normalizing flows (NFs), by contrast, seem to provide an appealing alternative, as they enable likelihoods and sampling without solving differential equations or autoregressive architectures. However, their potential in RL has received limited attention, partly due to the prevailing belief that normalizing flows lack sufficient expressivity. We show that this is not the case. Building on recent work in NFs, we propose a single NF architecture which integrates seamlessly into RL algorithms, serving as a policy, Q-function, and occupancy measure. Our approach leads to much simpler algorithms, and achieves higher performance in imitation learning, offline, goal conditioned RL and unsupervised RL.
[LG-38] Accelerating AllReduce with a Persistent Strag gler
链接: https://arxiv.org/abs/2505.23523
作者: Arjun Devraj,Eric Ding,Abhishek Vijaya Kumar,Robert Kleinberg,Rachee Singh
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 23 pages, 11 figures
Abstract:Distributed machine learning workloads use data and tensor parallelism for training and inference, both of which rely on the AllReduce collective to synchronize gradients or activations. However, bulk-synchronous AllReduce algorithms can be delayed by a persistent straggler that is slower to reach the synchronization barrier required to begin the collective. To address this challenge, we propose StragglAR: an AllReduce algorithm that accelerates distributed training and inference in the presence of persistent stragglers. StragglAR implements a ReduceScatter among the remaining GPUs during the straggler-induced delay, and then executes a novel collective algorithm to complete the AllReduce once the straggler reaches the synchronization barrier. StragglAR achieves a 2x theoretical speedup over popular bandwidth-efficient AllReduce algorithms (e.g., Ring) for large GPU clusters with persistent stragglers. On an 8-GPU server, our implementation of StragglAR yields a 22% speedup over state-of-the-art AllReduce algorithms.
[LG-39] AnchorAttention: Difference-Aware Sparse Attention with Stripe Granularity
链接: https://arxiv.org/abs/2505.23520
作者: Yu Zhang,Dong Guo,Fang Wu,Guoliang Zhu,Dian Ding,Yiming Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) with extended context lengths face significant computational challenges during the pre-filling phase, primarily due to the quadratic complexity of self-attention. Existing methods typically employ dynamic pattern matching and block-sparse low-level implementations. However, their reliance on local information for pattern identification fails to capture global contexts, and the coarse granularity of blocks leads to persistent internal sparsity, resulting in suboptimal accuracy and efficiency. To address these limitations, we propose \textbfAnchorAttention, a difference-aware, dynamic sparse attention mechanism that efficiently identifies critical attention regions at a finer stripe granularity while adapting to global contextual information, achieving superior speed and accuracy. AnchorAttention comprises three key components: (1) \textbfPattern-based Anchor Computation, leveraging the commonalities present across all inputs to rapidly compute a set of near-maximum scores as the anchor; (2) \textbfDifference-aware Stripe Sparsity Identification, performing difference-aware comparisons with the anchor to quickly obtain discrete coordinates of significant regions in a stripe-like sparsity pattern; (3) \textbfFine-grained Sparse Computation, replacing the traditional contiguous KV block loading approach with simultaneous discrete KV position loading to maximize sparsity rates while preserving full hardware computational potential. With its finer-grained sparsity strategy, \textbfAnchorAttention achieves higher sparsity rates at the same recall level, significantly reducing computation time. Compared to previous state-of-the-art methods, at a text length of 128k, it achieves a speedup of 1.44 \times while maintaining higher recall rates.
[LG-40] Spectrotemporal Modulation: Efficient and Interpretable Feature Representation for Classifying Speech Music and Environmental Sounds INTERSPEECH2025
链接: https://arxiv.org/abs/2505.23509
作者: Andrew Chang,Yike Li,Iran R. Roman,David Poeppel
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Interspeech 2025
Abstract:Audio DNNs have demonstrated impressive performance on various machine listening tasks; however, most of their representations are computationally costly and uninterpretable, leaving room for optimization. Here, we propose a novel approach centered on spectrotemporal modulation (STM) features, a signal processing method that mimics the neurophysiological representation in the human auditory cortex. The classification performance of our STM-based model, without any pretraining, is comparable to that of pretrained audio DNNs across diverse naturalistic speech, music, and environmental sounds, which are essential categories for both human cognition and machine perception. These results show that STM is an efficient and interpretable feature representation for audio classification, advancing the development of machine listening and unlocking exciting new possibilities for basic understanding of speech and auditory sciences, as well as developing audio BCI and cognitive computing.
[LG-41] Why Machine Learning Models Fail to Fully Capture Epistemic Uncertainty
链接: https://arxiv.org/abs/2505.23506
作者: Sebastián Jiménez,Mira Jürgens,Willem Waegeman
类目: Machine Learning (cs.LG)
*备注:
Abstract:In recent years various supervised learning methods that disentangle aleatoric and epistemic uncertainty based on second-order distributions have been proposed. We argue that these methods fail to capture critical components of epistemic uncertainty, particularly due to the often-neglected component of model bias. To show this, we make use of a more fine-grained taxonomy of epistemic uncertainty sources in machine learning models, and analyse how the classical bias-variance decomposition of the expected prediction error can be decomposed into different parts reflecting these uncertainties. By using a simulation-based evaluation protocol which encompasses epistemic uncertainty due to both procedural- and data-driven uncertainty components, we illustrate that current methods rarely capture the full spectrum of epistemic uncertainty. Through theoretical insights and synthetic experiments, we show that high model bias can lead to misleadingly low estimates of epistemic uncertainty, and common second-order uncertainty quantification methods systematically blur bias-induced errors into aleatoric estimates, thereby underrepresenting epistemic uncertainty. Our findings underscore that meaningful aleatoric estimates are feasible only if all relevant sources of epistemic uncertainty are properly represented.
[LG-42] Epistemic Errors of Imperfect Multitask Learners When Distributions Shift
链接: https://arxiv.org/abs/2505.23496
作者: Sabina J. Sloman,Michele Caprio,Samuel Kaski
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:When data are noisy, a statistical learner’s goal is to resolve epistemic uncertainty about the data it will encounter at test-time, i.e., to identify the distribution of test (target) data. Many real-world learning settings introduce sources of epistemic uncertainty that can not be resolved on the basis of training (source) data alone: The source data may arise from multiple tasks (multitask learning), the target data may differ systematically from the source data tasks (distribution shift), and/or the learner may not arrive at an accurate characterization of the source data (imperfect learning). We introduce a principled definition of epistemic error, and provide a generic, decompositional epistemic error bound. Our error bound is the first to (i) consider epistemic error specifically, (ii) accommodate all the sources of epistemic uncertainty above, and (iii) separately attribute the error to each of multiple aspects of the learning procedure and environment. As corollaries of the generic result, we provide (i) epistemic error bounds specialized to the settings of Bayesian transfer learning and distribution shift within \epsilon -neighborhoods, and (ii) a set of corresponding generalization bounds. Finally, we provide a novel definition of negative transfer, and validate its insights in a synthetic experimental setting.
[LG-43] SGD as Free Energy Minimization: A Thermodynamic View on Neural Network Training
链接: https://arxiv.org/abs/2505.23489
作者: Ildus Sadrtdinov,Ivan Klimov,Ekaterina Lobacheva,Dmitry Vetrov
类目: Machine Learning (cs.LG)
*备注: First two authors contributed equally
Abstract:We present a thermodynamic interpretation of the stationary behavior of stochastic gradient descent (SGD) under fixed learning rates (LRs) in neural network training. We show that SGD implicitly minimizes a free energy function F=U-TS , balancing training loss U and the entropy of the weights distribution S , with temperature T determined by the LR. This perspective offers a new lens on why high LRs prevent training from converging to the loss minima and how different LRs lead to stabilization at different loss levels. We empirically validate the free energy framework on both underparameterized (UP) and overparameterized (OP) models. UP models consistently follow free energy minimization, with temperature increasing monotonically with LR, while for OP models, the temperature effectively drops to zero at low LRs, causing SGD to minimize the loss directly and converge to an optimum. We attribute this mismatch to differences in the signal-to-noise ratio of stochastic gradients near optima, supported by both a toy example and neural network experiments.
[LG-44] Refining Labeling Functions with Limited Labeled Data
链接: https://arxiv.org/abs/2505.23470
作者: Chenjie Li,Amir Gilad,Boris Glavic,Zhengjie Miao,Sudeepa Roy
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: techreport
Abstract:Programmatic weak supervision (PWS) significantly reduces human effort for labeling data by combining the outputs of user-provided labeling functions (LFs) on unlabeled datapoints. However, the quality of the generated labels depends directly on the accuracy of the LFs. In this work, we study the problem of fixing LFs based on a small set of labeled examples. Towards this goal, we develop novel techniques for repairing a set of LFs by minimally changing their results on the labeled examples such that the fixed LFs ensure that (i) there is sufficient evidence for the correct label of each labeled datapoint and (ii) the accuracy of each repaired LF is sufficiently high. We model LFs as conditional rules which enables us to refine them, i.e., to selectively change their output for some inputs. We demonstrate experimentally that our system improves the quality of LFs based on surprisingly small sets of labeled datapoints.
[LG-45] On Global Convergence Rates for Federated Policy Gradient under Heterogeneous Environment
链接: https://arxiv.org/abs/2505.23459
作者: Safwan Labbi,Paul Mangold,Daniil Tiapkin,Eric Moulines
类目: Machine Learning (cs.LG)
*备注: Preprint
Abstract:Ensuring convergence of policy gradient methods in federated reinforcement learning (FRL) under environment heterogeneity remains a major challenge. In this work, we first establish that heterogeneity, perhaps counter-intuitively, can necessitate optimal policies to be non-deterministic or even time-varying, even in tabular environments. Subsequently, we prove global convergence results for federated policy gradient (FedPG) algorithms employing local updates, under a Łojasiewicz condition that holds only for each individual agent, in both entropy-regularized and non-regularized scenarios. Crucially, our theoretical analysis shows that FedPG attains linear speed-up with respect to the number of agents, a property central to efficient federated learning. Leveraging insights from our theoretical findings, we introduce b-RS-FedPG, a novel policy gradient method that employs a carefully constructed softmax-inspired parameterization coupled with an appropriate regularization scheme. We further demonstrate explicit convergence rates for b-RS-FedPG toward near-optimal stationary policies. Finally, we demonstrate that empirically both FedPG and b-RS-FedPG consistently outperform federated Q-learning on heterogeneous settings.
[LG-46] Diffusion Guidance Is a Controllable Policy Improvement Operator
链接: https://arxiv.org/abs/2505.23458
作者: Kevin Frans,Seohong Park,Pieter Abbeel,Sergey Levine
类目: Machine Learning (cs.LG)
*备注:
Abstract:At the core of reinforcement learning is the idea of learning beyond the performance in the data. However, scaling such systems has proven notoriously tricky. In contrast, techniques from generative modeling have proven remarkably scalable and are simple to train. In this work, we combine these strengths, by deriving a direct relation between policy improvement and guidance of diffusion models. The resulting framework, CFGRL, is trained with the simplicity of supervised learning, yet can further improve on the policies in the data. On offline RL tasks, we observe a reliable trend – increased guidance weighting leads to increased performance. Of particular importance, CFGRL can operate without explicitly learning a value function, allowing us to generalize simple supervised methods (e.g., goal-conditioned behavioral cloning) to further prioritize optimality, gaining performance for “free” across the board.
[LG-47] Strategic Classification with Non-Linear Classifiers
链接: https://arxiv.org/abs/2505.23443
作者: Benyamin Trachtenberg,Nir Rosenfeld
类目: Machine Learning (cs.LG)
*备注:
Abstract:In strategic classification, the standard supervised learning setting is extended to support the notion of strategic user behavior in the form of costly feature manipulations made in response to a classifier. While standard learning supports a broad range of model classes, the study of strategic classification has, so far, been dedicated mostly to linear classifiers. This work aims to expand the horizon by exploring how strategic behavior manifests under non-linear classifiers and what this implies for learning. We take a bottom-up approach showing how non-linearity affects decision boundary points, classifier expressivity, and model classes complexity. A key finding is that universal approximators (e.g., neural nets) are no longer universal once the environment is strategic. We demonstrate empirically how this can create performance gaps even on an unrestricted model class.
[LG-48] Diversity-Aware Policy Optimization for Large Language Model Reasoning
链接: https://arxiv.org/abs/2505.23433
作者: Jian Yao,Ran Cheng,Xingyu Wu,Jibin Wu,Kay Chen Tan
类目: Machine Learning (cs.LG)
*备注:
Abstract:The reasoning capabilities of large language models (LLMs) have advanced rapidly, particularly following the release of DeepSeek R1, which has inspired a surge of research into data quality and reinforcement learning (RL) algorithms. Despite the pivotal role diversity plays in RL, its influence on LLM reasoning remains largely underexplored. To bridge this gap, this work presents a systematic investigation into the impact of diversity in RL-based training for LLM reasoning, and proposes a novel diversity-aware policy optimization method. Across evaluations on 12 LLMs, we observe a strong positive correlation between the solution diversity and Potential at k (a novel metric quantifying an LLM’s reasoning potential) in high-performing models. This finding motivates our method to explicitly promote diversity during RL training. Specifically, we design a token-level diversity and reformulate it into a practical objective, then we selectively apply it to positive samples. Integrated into the R1-zero training framework, our method achieves a 3.5 percent average improvement across four mathematical reasoning benchmarks, while generating more diverse and robust solutions.
[LG-49] Improved Learning via k-DTW: A Novel Dissimilarity Measure for Curves ICML2025
链接: https://arxiv.org/abs/2505.23431
作者: Amer Krivošija,Alexander Munteanu,André Nusser,Chris Schwiegelshohn
类目: Data Structures and Algorithms (cs.DS); Computational Geometry (cs.CG); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICML 2025
Abstract:This paper introduces k -Dynamic Time Warping ( k -DTW), a novel dissimilarity measure for polygonal curves. k -DTW has stronger metric properties than Dynamic Time Warping (DTW) and is more robust to outliers than the Fréchet distance, which are the two gold standards of dissimilarity measures for polygonal curves. We show interesting properties of k -DTW and give an exact algorithm as well as a (1+\varepsilon) -approximation algorithm for k -DTW by a parametric search for the k -th largest matched distance. We prove the first dimension-free learning bounds for curves and further learning theoretic results. k -DTW not only admits smaller sample size than DTW for the problem of learning the median of curves, where some factors depending on the curves’ complexity m are replaced by k , but we also show a surprising separation on the associated Rademacher and Gaussian complexities: k -DTW admits strictly smaller bounds than DTW, by a factor \tilde\Omega(\sqrtm) when k\ll m . We complement our theoretical findings with an experimental illustration of the benefits of using k -DTW for clustering and nearest neighbor classification.
[LG-50] On the Validity of Head Motion Patterns as Generalisable Depression Biomarkers
链接: https://arxiv.org/abs/2505.23427
作者: Monika Gahalawat,Maneesh Bilalpur,Raul Fernandez Rojas,Jeffrey F. Cohn,Roland Goecke,Ramanathan Subramanian
类目: Machine Learning (cs.LG)
*备注:
Abstract:Depression is a debilitating mood disorder negatively impacting millions worldwide. While researchers have explored multiple verbal and non-verbal behavioural cues for automated depression assessment, head motion has received little attention thus far. Further, the common practice of validating machine learning models via a single dataset can limit model generalisability. This work examines the effectiveness and generalisability of models utilising elementary head motion units, termed kinemes, for depression severity estimation. Specifically, we consider three depression datasets from different western cultures (German: AVEC2013, Australian: Blackdog and American: Pitt datasets) with varied contextual and recording settings to investigate the generalisability of the derived kineme patterns via two methods: (i) k-fold cross-validation over individual/multiple datasets, and (ii) model reuse on other datasets. Evaluating classification and regression performance with classical machine learning methods, our results show that: (1) head motion patterns are efficient biomarkers for estimating depression severity, achieving highly competitive performance for both classification and regression tasks on a variety of datasets, including achieving the second best Mean Absolute Error (MAE) on the AVEC2013 dataset, and (2) kineme-based features are more generalisable than (a) raw head motion descriptors for binary severity classification, and (b) other visual behavioural cues for severity estimation (regression).
[LG-51] OTPTO: Joint Product Selection and Inventory Optimization in Fresh E-commerce Front-End Warehouses
链接: https://arxiv.org/abs/2505.23421
作者: Zheming Zhang,Yan Jiang,Qingshan Li,Ai Han
类目: Machine Learning (cs.LG)
*备注: 18 pages, 9 figures
Abstract:In China’s competitive fresh e-commerce market, optimizing operational strategies, especially inventory management in front-end warehouses, is key to enhance customer satisfaction and to gain a competitive edge. Front-end warehouses are placed in residential areas to ensure the timely delivery of fresh goods and are usually in small size. This brings the challenge of deciding which goods to stock and in what quantities, taking into account capacity constraints. To address this issue, traditional predict-then-optimize (PTO) methods that predict sales and then decide on inventory often don’t align prediction with inventory goals, as well as fail to prioritize consumer satisfaction. This paper proposes a multi-task Optimize-then-Predict-then-Optimize (OTPTO) approach that jointly optimizes product selection and inventory management, aiming to increase consumer satisfaction by maximizing the full order fulfillment rate. Our method employs a 0-1 mixed integer programming model OM1 to determine historically optimal inventory levels, and then uses a product selection model PM1 and the stocking model PM2 for prediction. The combined results are further refined through a post-processing algorithm OM2. Experimental results from this http URL’s 7Fresh platform demonstrate the robustness and significant advantages of our OTPTO method. Compared to the PTO approach, our OTPTO method substantially enhances the full order fulfillment rate by 4.34% (a relative increase of 7.05%) and narrows the gap to the optimal full order fulfillment rate by 5.27%. These findings substantiate the efficacy of the OTPTO method in managing inventory at front-end warehouses of fresh e-commerce platforms and provide valuable insights for future research in this domain.
[LG-52] KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction
链接: https://arxiv.org/abs/2505.23416
作者: Jang-Hyun Kim,Jinuk Kim,Sangwoo Kwon,Jae W. Lee,Sangdoo Yun,Hyun Oh Song
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: preprint
Abstract:Transformer-based large language models (LLMs) cache context as key-value (KV) pairs during inference. As context length grows, KV cache sizes expand, leading to substantial memory overhead and increased attention latency. This paper introduces KVzip, a query-agnostic KV cache eviction method enabling effective reuse of compressed KV caches across diverse queries. KVzip quantifies the importance of a KV pair using the underlying LLM to reconstruct original contexts from cached KV pairs, subsequently evicting pairs with lower importance. Extensive empirical evaluations demonstrate that KVzip reduces KV cache size by 3-4 \times and FlashAttention decoding latency by approximately 2 \times , with negligible performance loss in question-answering, retrieval, reasoning, and code comprehension tasks. Evaluations include various models such as LLaMA3.1-8B, Qwen2.5-14B, and Gemma3-12B, with context lengths reaching up to 170K tokens. KVzip significantly outperforms existing query-aware KV eviction methods, which suffer from performance degradation even at a 90% cache budget ratio under multi-query scenarios.
[LG-53] Automated Modeling Method for Pathloss Model Discovery
链接: https://arxiv.org/abs/2505.23383
作者: Ahmad Anaqreh,Shih-Kai Chou,Mihael Mohorčič,Carolina Fortuna
类目: Machine Learning (cs.LG)
*备注:
Abstract:Modeling propagation is the cornerstone for designing and optimizing next-generation wireless systems, with a particular emphasis on 5G and beyond era. Traditional modeling methods have long relied on statistic-based techniques to characterize propagation behavior across different environments. With the expansion of wireless communication systems, there is a growing demand for methods that guarantee the accuracy and interoperability of modeling. Artificial intelligence (AI)-based techniques, in particular, are increasingly being adopted to overcome this challenge, although the interpretability is not assured with most of these methods. Inspired by recent advancements in AI, this paper proposes a novel approach that accelerates the discovery of path loss models while maintaining interpretability. The proposed method automates the model formulation, evaluation, and refinement, facilitating model discovery. We evaluate two techniques: one based on Deep Symbolic Regression, offering full interpretability, and the second based on Kolmogorov-Arnold Networks, providing two levels of interpretability. Both approaches are evaluated on two synthetic and two real-world datasets. Our results show that Kolmogorov-Arnold Networks achieve R^2 values close to 1 with minimal prediction error, while Deep Symbolic Regression generates compact models with moderate accuracy. Moreover, on the selected examples, we demonstrate that automated methods outperform traditional methods, achieving up to 75% reduction in prediction errors, offering accurate and explainable solutions with potential to increase the efficiency of discovering next-generation path loss models.
[LG-54] Meta-Learning Approaches for Speaker-Dependent Voice Fatigue Models INTERSPEECH2025
链接: https://arxiv.org/abs/2505.23378
作者: Roseline Polle,Agnes Norbury,Alexandra Livia Georgescu,Nicholas Cummins,Stefano Goria
类目: Machine Learning (cs.LG)
*备注: 5 pages, 3 figures. To appear at Interspeech 2025
Abstract:Speaker-dependent modelling can substantially improve performance in speech-based health monitoring applications. While mixed-effect models are commonly used for such speaker adaptation, they require computationally expensive retraining for each new observation, making them impractical in a production environment. We reformulate this task as a meta-learning problem and explore three approaches of increasing complexity: ensemble-based distance models, prototypical networks, and transformer-based sequence models. Using pre-trained speech embeddings, we evaluate these methods on a large longitudinal dataset of shift workers (N=1,185, 10,286 recordings), predicting time since sleep from speech as a function of fatigue, a symptom commonly associated with ill-health. Our results demonstrate that all meta-learning approaches tested outperformed both cross-sectional and conventional mixed-effects models, with a transformer-based method achieving the strongest performance.
[LG-55] Grower-in-the-Loop Interactive Reinforcement Learning for Greenhouse Climate Control
链接: https://arxiv.org/abs/2505.23355
作者: Maxiu Xiao,Jianglin Lan,Jingxing Yu,Eldert van Henten,Congcong Sun
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Climate control is crucial for greenhouse production as it directly affects crop growth and resource use. Reinforcement learning (RL) has received increasing attention in this field, but still faces challenges, including limited training efficiency and high reliance on initial learning conditions. Interactive RL, which combines human (grower) input with the RL agent’s learning, offers a potential solution to overcome these challenges. However, interactive RL has not yet been applied to greenhouse climate control and may face challenges related to imperfect inputs. Therefore, this paper aims to explore the possibility and performance of applying interactive RL with imperfect inputs into greenhouse climate control, by: (1) developing three representative interactive RL algorithms tailored for greenhouse climate control (reward shaping, policy shaping and control sharing); (2) analyzing how input characteristics are often contradicting, and how the trade-offs between them make grower’s inputs difficult to perfect; (3) proposing a neural network-based approach to enhance the robustness of interactive RL agents under limited input availability; (4) conducting a comprehensive evaluation of the three interactive RL algorithms with imperfect inputs in a simulated greenhouse environment. The demonstration shows that interactive RL incorporating imperfect grower inputs has the potential to improve the performance of the RL agent. RL algorithms that influence action selection, such as policy shaping and control sharing, perform better when dealing with imperfect inputs, achieving 8.4% and 6.8% improvement in profit, respectively. In contrast, reward shaping, an algorithm that manipulates the reward function, is sensitive to imperfect inputs and leads to a 9.4% decrease in profit. This highlights the importance of selecting an appropriate mechanism when incorporating imperfect inputs.
[LG-56] Sentinel: Scheduling Live Streams with Proactive Anomaly Detection in Crowdsourced Cloud-Edge Platforms
链接: https://arxiv.org/abs/2505.23347
作者: Yuting Li,Shaoyuan Huang,Tengwen Zhang,Cheng Zhang,Xiaofei Wang,Victor C.M. Leung
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2402.14619
Abstract:With the rapid growth of live streaming services, Crowdsourced Cloud-edge service Platforms (CCPs) are playing an increasingly important role in meeting the increasing demand. Although stream scheduling plays a critical role in optimizing CCPs’ revenue, most optimization strategies struggle to achieve practical results due to various anomalies in unstable CCPs. Additionally, the substantial scale of CCPs magnifies the difficulties of anomaly detection in time-sensitive scheduling. To tackle these challenges, this paper proposes Sentinel, a proactive anomaly detection-based scheduling framework. Sentinel models the scheduling process as a two-stage Pre-Post-Scheduling paradigm: in the pre-scheduling stage, Sentinel conducts anomaly detection and constructs a strategy pool; in the post-scheduling stage, upon request arrival, it triggers an appropriate scheduling based on a pre-generated strategy to implement the scheduling process. Extensive experiments on realistic datasets show that Sentinel significantly reduces anomaly frequency by 70%, improves revenue by 74%, and doubles the scheduling speed.
[LG-57] Graph Positional Autoencoders as Self-supervised Learners KDD2025
链接: https://arxiv.org/abs/2505.23345
作者: Yang Liu,Deyu Bo,Wenxuan Cao,Yuan Fang,Yawen Li,Chuan Shi
类目: Machine Learning (cs.LG)
*备注: 12 pages, 3 figures, Accepted at KDD 2025
Abstract:Graph self-supervised learning seeks to learn effective graph representations without relying on labeled data. Among various approaches, graph autoencoders (GAEs) have gained significant attention for their efficiency and scalability. Typically, GAEs take incomplete graphs as input and predict missing elements, such as masked nodes or edges. While effective, our experimental investigation reveals that traditional node or edge masking paradigms primarily capture low-frequency signals in the graph and fail to learn the expressive structural information. To address these issues, we propose Graph Positional Autoencoders (GraphPAE), which employs a dual-path architecture to reconstruct both node features and positions. Specifically, the feature path uses positional encoding to enhance the message-passing processing, improving GAE’s ability to predict the corrupted information. The position path, on the other hand, leverages node representations to refine positions and approximate eigenvectors, thereby enabling the encoder to learn diverse frequency information. We conduct extensive experiments to verify the effectiveness of GraphPAE, including heterophilic node classification, graph property prediction, and transfer learning. The results demonstrate that GraphPAE achieves state-of-the-art performance and consistently outperforms baselines by a large margin.
[LG-58] X2Graph for Cancer Subtyping Prediction on Biological Tabular Data
链接: https://arxiv.org/abs/2505.23334
作者: Tu Bui,Mohamed Suliman,Aparajita Haldar,Mohammed Amer,Serban Georgescu
类目: Machine Learning (cs.LG)
*备注: IEEE Engineering in Medicine and Biology Society (EMBC) 2025
Abstract:Despite the transformative impact of deep learning on text, audio, and image datasets, its dominance in tabular data, especially in the medical domain where data are often scarce, remains less clear. In this paper, we propose X2Graph, a novel deep learning method that achieves strong performance on small biological tabular datasets. X2Graph leverages external knowledge about the relationships between table columns, such as gene interactions, to convert each sample into a graph structure. This transformation enables the application of standard message passing algorithms for graph modeling. Our X2Graph method demonstrates superior performance compared to existing tree-based and deep learning methods across three cancer subtyping datasets.
[LG-59] Efficient Parameter Estimation for Bayesian Network Classifiers using Hierarchical Linear Smoothing
链接: https://arxiv.org/abs/2505.23320
作者: Connor Cooper,Geoffrey I. Webb,Daniel F. Schmidt
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 20 pages, 9 figures
Abstract:Bayesian network classifiers (BNCs) possess a number of properties desirable for a modern classifier: They are easily interpretable, highly scalable, and offer adaptable complexity. However, traditional methods for learning BNCs have historically underperformed when compared to leading classification methods such as random forests. Recent parameter smoothing techniques using hierarchical Dirichlet processes (HDPs) have enabled BNCs to achieve performance competitive with random forests on categorical data, but these techniques are relatively inflexible, and require a complicated, specialized sampling process. In this paper, we introduce a novel method for parameter estimation that uses a log-linear regression to approximate the behaviour of HDPs. As a linear model, our method is remarkably flexible and simple to interpret, and can leverage the vast literature on learning linear models. Our experiments show that our method can outperform HDP smoothing while being orders of magnitude faster, remaining competitive with random forests on categorical data.
[LG-60] MGE-LDM: Joint Latent Diffusion for Simultaneous Music Generation and Source Extraction
链接: https://arxiv.org/abs/2505.23305
作者: Yunkee Chae,Kyogu Lee
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 27 pages, 4 figures
Abstract:We present MGE-LDM, a unified latent diffusion framework for simultaneous music generation, source imputation, and query-driven source separation. Unlike prior approaches constrained to fixed instrument classes, MGE-LDM learns a joint distribution over full mixtures, submixtures, and individual stems within a single compact latent diffusion model. At inference, MGE-LDM enables (1) complete mixture generation, (2) partial generation (i.e., source imputation), and (3) text-conditioned extraction of arbitrary sources. By formulating both separation and imputation as conditional inpainting tasks in the latent space, our approach supports flexible, class-agnostic manipulation of arbitrary instrument sources. Notably, MGE-LDM can be trained jointly across heterogeneous multi-track datasets (e.g., Slakh2100, MUSDB18, MoisesDB) without relying on predefined instrument categories. Audio samples are available at our project page: this https URL.
[LG-61] Comparative Analysis of the Land Use and Land Cover Changes in Different Governorates of Oman using Spatiotemporal Multi-spectral Satellite Data
链接: https://arxiv.org/abs/2505.23285
作者: Muhammad Shafi,Syed Mohsin Bokhari
类目: Machine Learning (cs.LG)
*备注:
Abstract:Land cover and land use (LULC) changes are key applications of satellite imagery, and they have critical roles in resource management, urbanization, protection of soils and the environment, and enhancing sustainable development. The literature has heavily utilized multispectral spatiotemporal satellite data alongside advanced machine learning algorithms to monitor and predict LULC changes. This study analyzes and compares LULC changes across various governorates (provinces) of the Sultanate of Oman from 2016 to 2021 using annual time steps. For the chosen region, multispectral spatiotemporal data were acquired from the open-source Sentinel-2 satellite dataset. Supervised machine learning algorithms were used to train and classify different land covers, such as water bodies, crops, urban, etc. The constructed model was subsequently applied within the study region, allowing for an effective comparative evaluation of LULC changes within the given timeframe.
[LG-62] Efficiently Access Diffusion Fisher: Within the Outer Product Span Space
链接: https://arxiv.org/abs/2505.23264
作者: Fangyikang Wang,Hubery Yin,Shaobin Zhuang,Huminhao Zhu,Yinan Li,Lei Qian,Chao Zhang,Hanbin Zhao,Hui Qian,Chen Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent Diffusion models (DMs) advancements have explored incorporating the second-order diffusion Fisher information (DF), defined as the negative Hessian of log density, into various downstream tasks and theoretical analysis. However, current practices typically approximate the diffusion Fisher by applying auto-differentiation to the learned score network. This black-box method, though straightforward, lacks any accuracy guarantee and is time-consuming. In this paper, we show that the diffusion Fisher actually resides within a space spanned by the outer products of score and initial data. Based on the outer-product structure, we develop two efficient approximation algorithms to access the trace and matrix-vector multiplication of DF, respectively. These algorithms bypass the auto-differentiation operations with time-efficient vector-product calculations. Furthermore, we establish the approximation error bounds for the proposed algorithms. Experiments in likelihood evaluation and adjoint optimization demonstrate the superior accuracy and reduced computational cost of our proposed algorithms. Additionally, based on the novel outer-product formulation of DF, we design the first numerical verification experiment for the optimal transport property of the general PF-ODE deduced map.
[LG-63] Measuring Participant Contributions in Decentralized Federated Learning
链接: https://arxiv.org/abs/2505.23246
作者: Honoka Anada,Tatsuya Kaneko,Shinya Takamaeda-Yamazaki
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated learning (FL) enables multiple clients to collaboratively train models without sharing their data. Measuring participant contributions in FL is crucial for incentivizing clients and ensuring transparency. While various methods have been proposed for contribution measurement, they are designed exclusively for centralized federated learning (CFL), where a central server collects and aggregates client models, along with evaluating their contributions. Meanwhile, decentralized federated learning (DFL), in which clients exchange models directly without a central server, has gained significant attention for mitigating communication bottlenecks and eliminating a single point of failure. However, applying existing contribution measurement methods to DFL is challenging due to the presence of multiple global models and the absence of a central server. In this study, we present novel methodologies for measuring participant contributions in DFL. We first propose DFL-Shapley, an extension of the Shapley value tailored for DFL, adapting this widely used CFL metric to decentralized settings. Given the impracticality of computing the ideal DFL-Shapley in real-world systems, we introduce DFL-MR, a computable approximation that estimates overall contributions by accumulating round-wise Shapley values. We evaluate DFL-Shapley and DFL-MR across various FL scenarios and compare them with existing CFL metrics. The experimental results confirm DFL-Shapley as a valid ground-truth metric and demonstrate DFL-MR’s proximity to DFL-Shapley across various settings, highlighting their effectiveness as contribution metrics in DFL.
[LG-64] Equivalence of stochastic and deterministic policy gradients
链接: https://arxiv.org/abs/2505.23244
作者: Emo Todorov
类目: Machine Learning (cs.LG)
*备注:
Abstract:Policy gradients in continuous control have been derived for both stochastic and deterministic policies. Here we study the relationship between the two. In a widely-used family of MDPs involving Gaussian control noise and quadratic control costs, we show that the stochastic and deterministic policy gradients, natural gradients, and state value functions are identical; while the state-control value functions are different. We then develop a general procedure for constructing an MDP with deterministic policy that is equivalent to a given MDP with stochastic policy. The controls of this new MDP are the sufficient statistics of the stochastic policy in the original MDP. Our results suggest that policy gradient methods can be unified by approximating state value functions rather than state-control value functions.
[LG-65] Graph Random Walk with Feature-Label Space Alignment: A Multi-Label Feature Selection Method
链接: https://arxiv.org/abs/2505.23228
作者: Wanfu Gao,Jun Gao,Qingqi Han,Hanlin Pan,Kunpeng Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:The rapid growth in feature dimension may introduce implicit associations between features and labels in multi-label datasets, making the relationships between features and labels increasingly complex. Moreover, existing methods often adopt low-dimensional linear decomposition to explore the associations between features and labels. However, linear decomposition struggles to capture complex nonlinear associations and may lead to misalignment between the feature space and the label space. To address these two critical challenges, we propose innovative solutions. First, we design a random walk graph that integrates feature-feature, label-label, and feature-label relationships to accurately capture nonlinear and implicit indirect associations, while optimizing the latent representations of associations between features and labels after low-rank decomposition. Second, we align the variable spaces by leveraging low-dimensional representation coefficients, while preserving the manifold structure between the original high-dimensional multi-label data and the low-dimensional representation space. Extensive experiments and ablation studies conducted on seven benchmark datasets and three representative datasets using various evaluation metrics demonstrate the superiority of the proposed method\footnoteCode: this https URL.
[LG-66] Generalizability vs. Counterfactual Explainability Trade-Off
链接: https://arxiv.org/abs/2505.23225
作者: Fabiano Veglianti,Flavio Giorgi,Fabrizio Silvestri,Gabriele Tolomei
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures, plus appendix. arXiv admin note: text overlap with arXiv:2502.09193
Abstract:In this work, we investigate the relationship between model generalization and counterfactual explainability in supervised learning. We introduce the notion of \varepsilon -valid counterfactual probability ( \varepsilon -VCP) – the probability of finding perturbations of a data point within its \varepsilon -neighborhood that result in a label change. We provide a theoretical analysis of \varepsilon -VCP in relation to the geometry of the model’s decision boundary, showing that \varepsilon -VCP tends to increase with model overfitting. Our findings establish a rigorous connection between poor generalization and the ease of counterfactual generation, revealing an inherent trade-off between generalization and counterfactual explainability. Empirical results validate our theory, suggesting \varepsilon -VCP as a practical proxy for quantitatively characterizing overfitting.
[LG-67] Daunce: Data Attribution through Uncertainty Estimation
链接: https://arxiv.org/abs/2505.23223
作者: Xingyuan Pan,Chenlu Ye,Joseph Melkonian,Jiaqi W. Ma,Tong Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Training data attribution (TDA) methods aim to identify which training examples influence a model’s predictions on specific test data most. By quantifying these influences, TDA supports critical applications such as data debugging, curation, and valuation. Gradient-based TDA methods rely on gradients and second-order information, limiting their applicability at scale. While recent random projection-based methods improve scalability, they often suffer from degraded attribution accuracy. Motivated by connections between uncertainty and influence functions, we introduce Daunce - a simple yet effective data attribution approach through uncertainty estimation. Our method operates by fine-tuning a collection of perturbed models and computing the covariance of per-example losses across these models as the attribution score. Daunce is scalable to large language models (LLMs) and achieves more accurate attribution compared to existing TDA methods. We validate Daunce on tasks ranging from vision tasks to LLM fine-tuning, and further demonstrate its compatibility with black-box model access. Applied to OpenAI’s GPT models, our method achieves, to our knowledge, the first instance of data attribution on proprietary LLMs.
[LG-68] rajectory Generator Matching for Time Series
链接: https://arxiv.org/abs/2505.23215
作者: T. Jahn,J. Chemseddine,P. Hagemann,C. Wald,G. Steidl
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:Accurately modeling time-continuous stochastic processes from irregular observations remains a significant challenge. In this paper, we leverage ideas from generative modeling of image data to push the boundary of time series generation. For this, we find new generators of SDEs and jump processes, inspired by trajectory flow matching, that have the marginal distributions of the time series of interest. Specifically, we can handle discontinuities of the underlying processes by parameterizing the jump kernel densities by scaled Gaussians that allow for closed form formulas of the corresponding Kullback-Leibler divergence in the loss. Unlike most other approaches, we are able to handle irregularly sampled time series.
[LG-69] owards Robust Overlapping Speech Detection: A Speaker-Aware Progressive Approach Using WavLM
链接: https://arxiv.org/abs/2505.23207
作者: Zhaokai Sun,Li Zhang,Qing Wang,Pan Zhou,Lei Xie
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:
Abstract:Overlapping Speech Detection (OSD) aims to identify regions where multiple speakers overlap in a conversation, a critical challenge in multi-party speech processing. This work proposes a speaker-aware progressive OSD model that leverages a progressive training strategy to enhance the correlation between subtasks such as voice activity detection (VAD) and overlap detection. To improve acoustic representation, we explore the effectiveness of state-of-the-art self-supervised learning (SSL) models, including WavLM and wav2vec 2.0, while incorporating a speaker attention module to enrich features with frame-level speaker information. Experimental results show that the proposed method achieves state-of-the-art performance, with an F1 score of 82.76% on the AMI test set, demonstrating its robustness and effectiveness in OSD.
[LG-70] Beyond Zero Initialization: Investigating the Impact of Non-Zero Initialization on LoRA Fine-Tuning Dynamics ICML2025
链接: https://arxiv.org/abs/2505.23194
作者: Shiwei Li,Xiandi Luo,Xing Tang,Haozhao Wang,Hao Chen,Weihong Luo,Yuhua Li,Xiuqiang He,Ruixuan Li
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML 2025
Abstract:Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method. In standard LoRA layers, one of the matrices, A or B , is initialized to zero, ensuring that fine-tuning starts from the pretrained model. However, there is no theoretical support for this practice. In this paper, we investigate the impact of non-zero initialization on LoRA’s fine-tuning dynamics from an infinite-width perspective. Our analysis reveals that, compared to zero initialization, simultaneously initializing A and B to non-zero values improves LoRA’s robustness to suboptimal learning rates, particularly smaller ones. Further analysis indicates that although the non-zero initialization of AB introduces random noise into the pretrained weight, it generally does not affect fine-tuning performance. In other words, fine-tuning does not need to strictly start from the pretrained model. The validity of our findings is confirmed through extensive experiments across various models and datasets. The code is available at this https URL.
[LG-71] DeepRTE: Pre-trained Attention-based Neural Network for Radiative Tranfer
链接: https://arxiv.org/abs/2505.23190
作者: Yekun Zhu,Min Tang,Zheng Ma
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this study, we propose a novel neural network approach, termed DeepRTE, to address the steady-state Radiative Transfer Equation (RTE). The RTE is a differential-integral equation that governs the propagation of radiation through a participating medium, with applications spanning diverse domains such as neutron transport, atmospheric radiative transfer, heat transfer, and optical imaging. Our proposed DeepRTE framework leverages pre-trained attention-based neural networks to solve the RTE with high accuracy and computational efficiency. The efficacy of the proposed approach is substantiated through comprehensive numerical experiments.
[LG-72] Improving the Effective Receptive Field of Message-Passing Neural Networks ICML2025
链接: https://arxiv.org/abs/2505.23185
作者: Shahaf E. Finder,Ron Shapira Weber,Moshe Eliasof,Oren Freifeld,Eran Treister
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2025
Abstract:Message-Passing Neural Networks (MPNNs) have become a cornerstone for processing and analyzing graph-structured data. However, their effectiveness is often hindered by phenomena such as over-squashing, where long-range dependencies or interactions are inadequately captured and expressed in the MPNN output. This limitation mirrors the challenges of the Effective Receptive Field (ERF) in Convolutional Neural Networks (CNNs), where the theoretical receptive field is underutilized in practice. In this work, we show and theoretically explain the limited ERF problem in MPNNs. Furthermore, inspired by recent advances in ERF augmentation for CNNs, we propose an Interleaved Multiscale Message-Passing Neural Networks (IM-MPNN) architecture to address these problems in MPNNs. Our method incorporates a hierarchical coarsening of the graph, enabling message-passing across multiscale representations and facilitating long-range interactions without excessive depth or parameterization. Through extensive evaluations on benchmarks such as the Long-Range Graph Benchmark (LRGB), we demonstrate substantial improvements over baseline MPNNs in capturing long-range dependencies while maintaining computational efficiency.
[LG-73] wo Is Better Than One: Rotations Scale LoRAs
链接: https://arxiv.org/abs/2505.23184
作者: Hongcan Guo,Guoshun Nan,Yuan Yang,Diyang Zhang,Haotian Li,Zhican Chen,Qinchuan Zhou,Yuhan Ran,Xinye Cao,Sicong Leng,Xiaofeng Tao,Xudong Jiang
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 27pages, 16figures
Abstract:Scaling Low-Rank Adaptation (LoRA)-based Mixture-of-Experts (MoE) facilitates large language models (LLMs) to efficiently adapt to diverse tasks. However, traditional gating mechanisms that route inputs to the best experts may fundamentally hinder LLMs’ scalability, leading to poor generalization and underfitting issues. We identify that the root cause lies in the restricted expressiveness of existing weighted-sum mechanisms, both within and outside the convex cone of LoRA representations. This motivates us to propose RadarGate, a novel geometrically inspired gating method that introduces rotational operations of LoRAs representations to boost the expressiveness and facilitate richer feature interactions among multiple LoRAs for scalable LLMs. Specifically, we first fuse each LoRA representation to other LoRAs using a learnable component and then feed the output to a rotation matrix. This matrix involves learnable parameters that define the relative angular relationship between LoRA representations. Such a simple yet effective mechanism provides an extra degree of freedom, facilitating the learning of cross-LoRA synergies and properly tracking the challenging poor generalization and underfitting issues as the number of LoRA grows. Extensive experiments on 6 public benchmarks across 21 tasks show the effectiveness of our RadarGate for scaling LoRAs. We also provide valuable insights, revealing that the rotations to each pair of representations are contrastive, encouraging closer alignment of semantically similar representations during geometrical transformation while pushing distance ones further apart. We will release our code to the community.
[LG-74] FSL-SAGE: Accelerating Federated Split Learning via Smashed Activation Gradient Estimation ICML2025
链接: https://arxiv.org/abs/2505.23182
作者: Srijith Nair,Michael Lin,Amirreza Talebi,Peizhong Ju,Elizabeth Bentley,Jia Liu
类目: Machine Learning (cs.LG)
*备注: 22 pages, 14 figures, Accepted at ICML 2025 as poster
Abstract:Collaborative training methods like Federated Learning (FL) and Split Learning (SL) enable distributed machine learning without sharing raw data. However, FL assumes clients can train entire models, which is infeasible for large-scale models. In contrast, while SL alleviates the client memory constraint in FL by offloading most training to the server, it increases network latency due to its sequential nature. Other methods address the conundrum by using local loss functions for parallel client-side training to improve efficiency, but they lack server feedback and potentially suffer poor accuracy. We propose FSL-SAGE (Federated Split Learning via Smashed Activation Gradient Estimation), a new federated split learning algorithm that estimates server-side gradient feedback via auxiliary models. These auxiliary models periodically adapt to emulate server behavior on local datasets. We show that FSL-SAGE achieves a convergence rate of \mathcalO(1/\sqrtT) , where T is the number of communication rounds. This result matches FedAvg, while significantly reducing communication costs and client memory requirements. Our empirical results also verify that it outperforms existing state-of-the-art FSL methods, offering both communication efficiency and accuracy.
[LG-75] he Panaceas for Improving Low-Rank Decomposition in Communication-Efficient Federated Learning ICML2025
链接: https://arxiv.org/abs/2505.23176
作者: Shiwei Li,Xiandi Luo,Haozhao Wang,Xing Tang,Shijie Xu,Weihong Luo,Yuhua Li,Xiuqiang He,Ruixuan Li
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted by ICML 2025
Abstract:To improve the training efficiency of federated learning (FL), previous research has employed low-rank decomposition techniques to reduce communication overhead. In this paper, we seek to enhance the performance of these low-rank decomposition methods. Specifically, we focus on three key issues related to decomposition in FL: what to decompose, how to decompose, and how to aggregate. Subsequently, we introduce three novel techniques: Model Update Decomposition (MUD), Block-wise Kronecker Decomposition (BKD), and Aggregation-Aware Decomposition (AAD), each targeting a specific issue. These techniques are complementary and can be applied simultaneously to achieve optimal performance. Additionally, we provide a rigorous theoretical analysis to ensure the convergence of the proposed MUD. Extensive experimental results show that our approach achieves faster convergence and superior accuracy compared to relevant baseline methods. The code is available at this https URL.
[LG-76] Bigger Regularized Categorical: High-Capacity Value Functions are Efficient Multi-Task Learners
链接: https://arxiv.org/abs/2505.23150
作者: Michal Nauman,Marek Cygan,Carmelo Sferrazza,Aviral Kumar,Pieter Abbeel
类目: Machine Learning (cs.LG)
*备注: preprint
Abstract:Recent advances in language modeling and vision stem from training large models on diverse, multi-task data. This paradigm has had limited impact in value-based reinforcement learning (RL), where improvements are often driven by small models trained in a single-task context. This is because in multi-task RL sparse rewards and gradient conflicts make optimization of temporal difference brittle. Practical workflows for generalist policies therefore avoid online training, instead cloning expert trajectories or distilling collections of single-task policies into one agent. In this work, we show that the use of high-capacity value models trained via cross-entropy and conditioned on learnable task embeddings addresses the problem of task interference in online RL, allowing for robust and scalable multi-task training. We test our approach on 7 multi-task benchmarks with over 280 unique tasks, spanning high degree-of-freedom humanoid control and discrete vision-based RL. We find that, despite its simplicity, the proposed approach leads to state-of-the-art single and multi-task performance, as well as sample-efficient transfer to new tasks.
[LG-77] DOPPLER: Dual-Policy Learning for Device Assignment in Asynchronous Dataflow Graphs
链接: https://arxiv.org/abs/2505.23131
作者: Xinyu Yao,Daniel Bourgeois,Abhinav Jain,Yuxin Tang,Jiawen Yao,Zhimin Ding,Arlei Silva,Chris Jermaine
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 32 pages, 19 figures
Abstract:We study the problem of assigning operations in a dataflow graph to devices to minimize execution time in a work-conserving system, with emphasis on complex machine learning workloads. Prior learning-based methods often struggle due to three key limitations: (1) reliance on bulk-synchronous systems like TensorFlow, which under-utilize devices due to barrier synchronization; (2) lack of awareness of the scheduling mechanism of underlying systems when designing learning-based methods; and (3) exclusive dependence on reinforcement learning, ignoring the structure of effective heuristics designed by experts. In this paper, we propose \textscDoppler, a three-stage framework for training dual-policy networks consisting of 1) a \mathsfSEL policy for selecting operations and 2) a \mathsfPLC policy for placing chosen operations on devices. Our experiments show that \textscDoppler outperforms all baseline methods across tasks by reducing system execution time and additionally demonstrates sampling efficiency by reducing per-episode training time.
[LG-78] Learning to Incentivize in Repeated Principal-Agent Problems with Adversarial Agent Arrivals ICML2025
链接: https://arxiv.org/abs/2505.23124
作者: Junyan Liu,Arnab Maiti,Artin Tajdini,Kevin Jamieson,Lillian J. Ratliff
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: To appear at ICML 2025
Abstract:We initiate the study of a repeated principal-agent problem over a finite horizon T , where a principal sequentially interacts with K\geq 2 types of agents arriving in an adversarial order. At each round, the principal strategically chooses one of the N arms to incentivize for an arriving agent of unknown type. The agent then chooses an arm based on its own utility and the provided incentive, and the principal receives a corresponding reward. The objective is to minimize regret against the best incentive in hindsight. Without prior knowledge of agent behavior, we show that the problem becomes intractable, leading to linear regret. We analyze two key settings where sublinear regret is achievable. In the first setting, the principal knows the arm each agent type would select greedily for any given incentive. Under this setting, we propose an algorithm that achieves a regret bound of O(\min\sqrtKT\log N,K\sqrtT) and provide a matching lower bound up to a \log K factor. In the second setting, an agent’s response varies smoothly with the incentive and is governed by a Lipschitz constant L\geq 1 . Under this setting, we show that there is an algorithm with a regret bound of \tildeO((LN)^1/3T^2/3) and establish a matching lower bound up to logarithmic factors. Finally, we extend our algorithmic results for both settings by allowing the principal to incentivize multiple arms simultaneously in each round.
[LG-79] Neural Interpretable PDEs: Harmonizing Fourier Insights with Attention for Scalable and Interpretable Physics Discovery
链接: https://arxiv.org/abs/2505.23106
作者: Ning Liu,Yue Yu
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Attention mechanisms have emerged as transformative tools in core AI domains such as natural language processing and computer vision. Yet, their largely untapped potential for modeling intricate physical systems presents a compelling frontier. Learning such systems often entails discovering operators that map between functional spaces using limited instances of function pairs – a task commonly framed as a severely ill-posed inverse PDE problem. In this work, we introduce Neural Interpretable PDEs (NIPS), a novel neural operator architecture that builds upon and enhances Nonlocal Attention Operators (NAO) in both predictive accuracy and computational efficiency. NIPS employs a linear attention mechanism to enable scalable learning and integrates a learnable kernel network that acts as a channel-independent convolution in Fourier space. As a consequence, NIPS eliminates the need to explicitly compute and store large pairwise interactions, effectively amortizing the cost of handling spatial interactions into the Fourier transform. Empirical evaluations demonstrate that NIPS consistently surpasses NAO and other baselines across diverse benchmarks, heralding a substantial leap in scalable, interpretable, and efficient physics learning. Our code and data accompanying this paper are available at this https URL.
[LG-80] LUMION: Fast Fault Recovery for ML Jobs Using Programmable Optical Fabrics
链接: https://arxiv.org/abs/2505.23105
作者: Abhishek Vijaya Kumar,Eric Ding,Arjun Devraj,Darius Bunandar,Rachee Singh
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:When accelerators fail in modern ML datacenters, operators migrate the affected ML training or inference jobs to entirely new racks. This approach, while preserving network performance, is highly inefficient, requiring datacenters to reserve full racks of idle accelerators for fault tolerance. In this paper, we address this resource inefficiency by introducing LUMION, a novel reconfigurable optical fabric for connecting accelerators within a datacenter rack. Instead of migrating entire ML jobs, LUMION dynamically integrates spare accelerators into ongoing workloads as failures occur, thereby maintaining consistent performance without costly migrations. We show the benefits of LUMION by building an end-to-end hardware prototype. Our experiments fine-tune Llama 3.2 and show that LUMION swaps a failed GPU with a healthy one and restarts the ML job within ~ 1 second of the failure. LUMION achieves higher inter-GPU bandwidth compared to traditional electrical racks after replacing failed accelerators with spare ones, leading to nearly 2X improvement in fine-tuning throughput.
[LG-81] Weight Spectra Induced Efficient Model Adaptation
链接: https://arxiv.org/abs/2505.23099
作者: Chongjie Si,Xuankun Yang,Muqing Liu,Yadao Wang,Xiaokang Yang,Wenbo Su,Bo Zheng,Wei Shen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large-scale foundation models have demonstrated remarkable versatility across a wide range of downstream tasks. However, fully fine-tuning these models incurs prohibitive computational costs, motivating the development of Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA, which introduces low-rank updates to pre-trained weights. Despite their empirical success, the underlying mechanisms by which PEFT modifies model parameters remain underexplored. In this work, we present a systematic investigation into the structural changes of weight matrices during fully fine-tuning. Through singular value decomposition (SVD), we reveal that fine-tuning predominantly amplifies the top singular values while leaving the remainder largely intact, suggesting that task-specific knowledge is injected into a low-dimensional subspace. Furthermore, we find that the dominant singular vectors are reoriented in task-specific directions, whereas the non-dominant subspace remains stable. Building on these insights, we propose a novel method that leverages learnable rescaling of top singular directions, enabling precise modulation of the most influential components without disrupting the global structure. Our approach achieves consistent improvements over strong baselines across multiple tasks, highlighting the efficacy of structurally informed fine-tuning.
[LG-82] Learning to Search for Vehicle Routing with Multiple Time Windows
链接: https://arxiv.org/abs/2505.23098
作者: Kuan Xu,Zhiguang Cao,Chenlong Zheng,Linong Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this study, we propose a reinforcement learning-based adaptive variable neighborhood search (RL-AVNS) method designed for effectively solving the Vehicle Routing Problem with Multiple Time Windows (VRPMTW). Unlike traditional adaptive approaches that rely solely on historical operator performance, our method integrates a reinforcement learning framework to dynamically select neighborhood operators based on real-time solution states and learned experience. We introduce a fitness metric that quantifies customers’ temporal flexibility to improve the shaking phase, and employ a transformer-based neural policy network to intelligently guide operator selection during the local search. Extensive computational experiments are conducted on realistic scenarios derived from the replenishment of unmanned vending machines, characterized by multiple clustered replenishment windows. Results demonstrate that RL-AVNS significantly outperforms traditional variable neighborhood search (VNS), adaptive VNS (AVNS), and state-of-the-art learning-based heuristics, achieving substantial improvements in solution quality and computational efficiency across various instance scales and time window complexities. Particularly notable is the algorithm’s capability to generalize effectively to problem instances not encountered during training, underscoring its practical utility for complex logistics scenarios.
[LG-83] Gradient Boosting Decision Tree with LSTM for Investment Prediction
链接: https://arxiv.org/abs/2505.23084
作者: Chang Yu,Fang Liu,Jie Zhu,Shaobo Guo,Yifan Gao,Zhongheng Yang,Meiwei Liu,Qianwen Xing
类目: Machine Learning (cs.LG)
*备注: This paper have been accepted by IEEE confulence
Abstract:This paper proposes a hybrid framework combining LSTM (Long Short-Term Memory) networks with LightGBM and CatBoost for stock price prediction. The framework processes time-series financial data and evaluates performance using seven models: Artificial Neural Networks (ANNs), Convolutional Neural Networks (CNNs), Bidirectional LSTM (BiLSTM), vanilla LSTM, XGBoost, LightGBM, and standard Neural Networks (NNs). Key metrics, including MAE, R-squared, MSE, and RMSE, are used to establish benchmarks across different time scales. Building on these benchmarks, we develop an ensemble model that combines the strengths of sequential and tree-based approaches. Experimental results show that the proposed framework improves accuracy by 10 to 15 percent compared to individual models and reduces error during market changes. This study highlights the potential of ensemble methods for financial forecasting and provides a flexible design for integrating new machine learning techniques. Comments: This paper have been accepted by IEEE confulence Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.23084 [cs.LG] (or arXiv:2505.23084v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.23084 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-84] Multi-Modal Learning with Bayesian-Oriented Gradient Calibration
链接: https://arxiv.org/abs/2505.23071
作者: Peizheng Guo,Jingyao Wang,Huijie Guo,Jiangmeng Li,Chuxiong Sun,Changwen Zheng,Wenwen Qiang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multi-Modal Learning (MML) integrates information from diverse modalities to improve predictive accuracy. However, existing methods mainly aggregate gradients with fixed weights and treat all dimensions equally, overlooking the intrinsic gradient uncertainty of each modality. This may lead to (i) excessive updates in sensitive dimensions, degrading performance, and (ii) insufficient updates in less sensitive dimensions, hindering learning. To address this issue, we propose BOGC-MML, a Bayesian-Oriented Gradient Calibration method for MML to explicitly model the gradient uncertainty and guide the model optimization towards the optimal direction. Specifically, we first model each modality’s gradient as a random variable and derive its probability distribution, capturing the full uncertainty in the gradient space. Then, we propose an effective method that converts the precision (inverse variance) of each gradient distribution into a scalar evidence. This evidence quantifies the confidence of each modality in every gradient dimension. Using these evidences, we explicitly quantify per-dimension uncertainties and fuse them via a reduced Dempster-Shafer rule. The resulting uncertainty-weighted aggregation produces a calibrated update direction that balances sensitivity and conservatism across dimensions. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness and advantages of the proposed method.
[LG-85] Loss-Guided Model Sharing and Local Learning Correction in Decentralized Federated Learning for Crop Disease Classification
链接: https://arxiv.org/abs/2505.23063
作者: Denis Mamba Kabala,Adel Hafiane,Laurent Bobelin,Raphael Canals
类目: Machine Learning (cs.LG)
*备注:
Abstract:Crop disease detection and classification is a critical challenge in agriculture, with major implications for productivity, food security, and environmental sustainability. While deep learning models such as CNN and ViT have shown excellent performance in classifying plant diseases from images, their large-scale deployment is often limited by data privacy concerns. Federated Learning (FL) addresses this issue, but centralized FL remains vulnerable to single-point failures and scalability limits. In this paper, we introduce a novel Decentralized Federated Learning (DFL) framework that uses validation loss (Loss_val) both to guide model sharing between peers and to correct local training via an adaptive loss function controlled by weighting parameter. We conduct extensive experiments using PlantVillage datasets with three deep learning architectures (ResNet50, VGG16, and ViT_B16), analyzing the impact of weighting parameter, the number of shared models, the number of clients, and the use of Loss_val versus Loss_train of other clients. Results demonstrate that our DFL approach not only improves accuracy and convergence speed, but also ensures better generalization and robustness across heterogeneous data environments making it particularly well-suited for privacy-preserving agricultural applications.
[LG-86] DINGO: Constrained Inference for Diffusion LLM s
链接: https://arxiv.org/abs/2505.23061
作者: Tarun Suresh,Debangshu Banerjee,Shubham Ugare,Sasa Misailovic,Gagandeep Singh
类目: Machine Learning (cs.LG); Programming Languages (cs.PL); Software Engineering (cs.SE)
*备注: DINGO an algorithm to provably apply constraints to diffusion LLM generations
Abstract:Diffusion LLMs have emerged as a promising alternative to conventional autoregressive LLMs, offering significant potential for improved runtime efficiency. However, existing diffusion models lack the ability to provably enforce user-specified formal constraints, such as regular expressions, which makes them unreliable for tasks that require structured outputs, such as fixed-schema JSON generation. Unlike autoregressive models that generate tokens sequentially, diffusion LLMs predict a block of tokens in parallel. This parallelism makes traditional constrained decoding algorithms, which are designed for sequential token prediction, ineffective at preserving the true output distribution. To address this limitation, we propose DINGO, a dynamic programming-based constrained decoding strategy that is both efficient and provably distribution-preserving. DINGO enables sampling of output strings with the highest probability under the model’s predicted distribution, while strictly satisfying any user-specified regular expression. On standard symbolic math and JSON generation benchmarks, DINGO achieves up to a 68 percentage point improvement over unconstrained inference
[LG-87] CDR-Agent : Intelligent Selection and Execution of Clinical Decision Rules Using Large Language Model Agents
链接: https://arxiv.org/abs/2505.23055
作者: Zhen Xiang,Aliyah R. Hsu,Austin V. Zane,Aaron E. Kornblith,Margaret J. Lin-Martore,Jasmanpreet C. Kaur,Vasuda M. Dokiparthi,Bo Li,Bin Yu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Clinical decision-making is inherently complex and fast-paced, particularly in emergency departments (EDs) where critical, rapid and high-stakes decisions are made. Clinical Decision Rules (CDRs) are standardized evidence-based tools that combine signs, symptoms, and clinical variables into decision trees to make consistent and accurate diagnoses. CDR usage is often hindered by the clinician’s cognitive load, limiting their ability to quickly recall and apply the appropriate rules. We introduce CDR-Agent, a novel LLM-based system designed to enhance ED decision-making by autonomously identifying and applying the most appropriate CDRs based on unstructured clinical notes. To validate CDR-Agent, we curated two novel ED datasets: synthetic and CDR-Bench, although CDR-Agent is applicable to non ED clinics. CDR-Agent achieves a 56.3% (synthetic) and 8.7% (CDR-Bench) accuracy gain relative to the standalone LLM baseline in CDR selection. Moreover, CDR-Agent significantly reduces computational overhead. Using these datasets, we demonstrated that CDR-Agent not only selects relevant CDRs efficiently, but makes cautious yet effective imaging decisions by minimizing unnecessary interventions while successfully identifying most positively diagnosed cases, outperforming traditional LLM prompting approaches. Code for our work can be found at: this https URL
[LG-88] ProDiff: Prototype-Guided Diffusion for Minimal Information Trajectory Imputation
链接: https://arxiv.org/abs/2505.23048
作者: Tianci Bu,Le Zhou,Wenchuan Yang,Jianhong Mou,Kang Yang,Suoyi Tan,Feng Yao,Jingyuan Wang,Xin Lu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Trajectory data is crucial for various applications but often suffers from incompleteness due to device limitations and diverse collection scenarios. Existing imputation methods rely on sparse trajectory or travel information, such as velocity, to infer missing points. However, these approaches assume that sparse trajectories retain essential behavioral patterns, which place significant demands on data acquisition and overlook the potential of large-scale human trajectory embeddings. To address this, we propose ProDiff, a trajectory imputation framework that uses only two endpoints as minimal information. It integrates prototype learning to embed human movement patterns and a denoising diffusion probabilistic model for robust spatiotemporal reconstruction. Joint training with a tailored loss function ensures effective imputation. ProDiff outperforms state-of-the-art methods, improving accuracy by 6.28% on FourSquare and 2.52% on WuXi. Further analysis shows a 0.927 correlation between generated and real trajectories, demonstrating the effectiveness of our approach.
[LG-89] An Empirical Study of Federated Prompt Learning for Vision Language Model
链接: https://arxiv.org/abs/2505.23024
作者: Zhihao Wang,Wenke Huang,Tian Chen,Zekun Shi,Guancheng Wan,Yu Qiao,Bin Yang,Jian Wang,Bing Li,Mang Ye
类目: Machine Learning (cs.LG)
*备注:
Abstract:The Vision Language Model (VLM) excels in aligning vision and language representations, and prompt learning has emerged as a key technique for adapting such models to downstream tasks. However, the application of prompt learning with VLM in federated learning (\fl) scenarios remains underexplored. This paper systematically investigates the behavioral differences between language prompt learning (LPT) and vision prompt learning (VPT) under data heterogeneity challenges, including label skew and domain shift. We conduct extensive experiments to evaluate the impact of various \fl and prompt configurations, such as client scale, aggregation strategies, and prompt length, to assess the robustness of Federated Prompt Learning (FPL). Furthermore, we explore strategies for enhancing prompt learning in complex scenarios where label skew and domain shift coexist, including leveraging both prompt types when computational resources allow. Our findings offer practical insights into optimizing prompt learning in federated settings, contributing to the broader deployment of VLMs in privacy-preserving environments.
[LG-90] SCORPIO: Serving the Right Requests at the Right Time for Heterogeneous SLOs in LLM Inference
链接: https://arxiv.org/abs/2505.23022
作者: Yinghao Tang,Tingfeng Lan,Xiuqi Huang,Hui Lu,Wei Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Existing Large Language Model (LLM) serving systems prioritize maximum throughput. They often neglect Service Level Objectives (SLOs) such as Time to First Token (TTFT) and Time Per Output Token (TPOT), which leads to suboptimal SLO attainment. This paper introduces SCORPIO, an SLO-oriented LLM serving system designed to maximize system goodput and SLO attainment for workloads with heterogeneous SLOs. Our core insight is to exploit SLO heterogeneity for adaptive scheduling across admission control, queue management, and batch selection. SCORPIO features a TTFT Guard, which employs least-deadline-first reordering and rejects unattainable requests, and a TPOT Guard, which utilizes a VBS-based admission control and a novel credit-based batching mechanism. Both guards are supported by a predictive module. Evaluations demonstrate that SCORPIO improves system goodput by up to 14.4X and SLO adherence by up to 46.5% compared to state-of-the-art baselines.
[LG-91] Hyperbolic-PDE GNN: Spectral Graph Neural Networks in the Perspective of A System of Hyperbolic Partial Differential Equations ICML2025
链接: https://arxiv.org/abs/2505.23014
作者: Juwei Yue,Haikuo Li,Jiawei Sheng,Xiaodong Li,Taoyu Su,Tingwen Liu,Li Guo
类目: Machine Learning (cs.LG)
*备注: 18 pages, 2 figures, published to ICML 2025
Abstract:Graph neural networks (GNNs) leverage message passing mechanisms to learn the topological features of graph data. Traditional GNNs learns node features in a spatial domain unrelated to the topology, which can hardly ensure topological features. In this paper, we formulates message passing as a system of hyperbolic partial differential equations (hyperbolic PDEs), constituting a dynamical system that explicitly maps node representations into a particular solution space. This solution space is spanned by a set of eigenvectors describing the topological structure of graphs. Within this system, for any moment in time, a node features can be decomposed into a superposition of the basis of eigenvectors. This not only enhances the interpretability of message passing but also enables the explicit extraction of fundamental characteristics about the topological structure. Furthermore, by solving this system of hyperbolic partial differential equations, we establish a connection with spectral graph neural networks (spectral GNNs), serving as a message passing enhancement paradigm for spectral this http URL further introduce polynomials to approximate arbitrary filter functions. Extensive experiments demonstrate that the paradigm of hyperbolic PDEs not only exhibits strong flexibility but also significantly enhances the performance of various spectral GNNs across diverse graph tasks.
[LG-92] Scalable Complexity Control Facilitates Reasoning Ability of LLM s
链接: https://arxiv.org/abs/2505.23013
作者: Liangkai Hang,Junjie Yao,Zhiwei Bai,Tianyi Chen,Yang Chen,Rongjie Diao,Hezhou Li,Pengxiao Lin,Zhiwei Wang,Cheng Xu,Zhongwang Zhang,Zhangchen Zhou,Zhiyu Li,Zehao Lin,Kai Chen,Feiyu Xiong,Yaoyu Zhang,Weinan E,Hongkang Yang,Zhi-Qin John Xu
类目: Machine Learning (cs.LG)
*备注:
Abstract:The reasoning ability of large language models (LLMs) has been rapidly advancing in recent years, attracting interest in more fundamental approaches that can reliably enhance their generalizability. This work demonstrates that model complexity control, conveniently implementable by adjusting the initialization rate and weight decay coefficient, improves the scaling law of LLMs consistently over varying model sizes and data sizes. This gain is further illustrated by comparing the benchmark performance of 2.4B models pretrained on 1T tokens with different complexity hyperparameters. Instead of fixing the initialization std, we found that a constant initialization rate (the exponent of std) enables the scaling law to descend faster in both model and data sizes. These results indicate that complexity control is a promising direction for the continual advancement of LLMs.
[LG-93] EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic Expressiveness and Linguistic Challenges Using Model-as-a-Judge
链接: https://arxiv.org/abs/2505.23009
作者: Ruskin Raj Manku,Yuzhi Tang,Xingjian Shi,Mu Li,Alex Smola
类目: Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:
Abstract:Text-to-Speech (TTS) benchmarks often fail to capture how well models handle nuanced and semantically complex text. Building on \textitEmergentTTS , we introduce \textitEmergentTTS-Eval , a comprehensive benchmark covering six challenging TTS scenarios: emotions, paralinguistics, foreign words, syntactic complexity, complex pronunciation (e.g. URLs, formulas), and questions. Crucially, our framework automates both test-case generation and evaluation, making the benchmark easily extensible. Starting from a small set of human-written seed prompts, we iteratively extend them using LLMs to target specific structural, phonetic and prosodic challenges, resulting in 1,645 diverse test cases. Moreover, we employ a model-as-a-judge approach, using a Large Audio Language Model (LALM) to assess the speech across multiple dimensions such as expressed emotion, prosodic, intonational, and pronunciation accuracy. We evaluate state-of-the-art open-source and proprietary TTS systems, such as 11Labs, Deepgram, and OpenAI’s 4o-mini-TTS, on EmergentTTS-Eval, demonstrating its ability to reveal fine-grained performance differences. Results show that the model-as-a-judge approach offers robust TTS assessment and a high correlation with human preferences. We open source the evaluation \hrefthis https URLcode and the \hrefthis https URLdataset .
[LG-94] QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining
链接: https://arxiv.org/abs/2505.23004
作者: Kyle R. Chickering,Bangzheng Li,Muhao Chen
类目: Machine Learning (cs.LG)
*备注: 22 pages, 19 figures
Abstract:Multimodal Large Language Models (MLLMs) encode images into visual tokens, aligning visual and textual signals within a shared latent space to facilitate crossmodal representation learning. The CLIP model is a widely adopted foundational vision language model whose vision encoder has played a critical role in the development of MLLMs such as LLaVA. However, the CLIP vision encoder suffers from notable limitations including being constrained to only handling fixed input resolutions and a failure to produce separated embeddings for dissimilar images. Replacing the vision encoder of an existing model typically incurs substantial computational costs because such a change often necessitates retraining the entire model pipeline. In this work, we identify two factors which underlie the limitations of the CLIP vision encoder: mesoscopic bias and interpolation bias. To address these issues, we propose QLIP, a drop-in replacement for CLIP that can be seamlessly integrated with existing MLLMs with only a few lines of code and can enhance both coarse-grained and fine-grained visual understanding, without re-training. QLIP is designed around an image quadtree which replaces the standard uniform grid patches with a novel content aware patchification. Our experimental results demonstrate that QLIP improves the general visual question answering accuracy of the LLaVA v1.5 model series across various model sizes–without requiring retraining or fine-tuning of the full MLLM. Notably, QLIP boosts detailed understanding performance on the challenging V^\ast benchmark by up to 13.6 percent. Comments: 22 pages, 19 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.23004 [cs.LG] (or arXiv:2505.23004v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.23004 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-95] LLM Agents for Bargaining with Utility-based Feedback
链接: https://arxiv.org/abs/2505.22998
作者: Jihwan Oh,Murad Aghazada,Se-Young Yun,Taehyeon Kim
类目: Machine Learning (cs.LG)
*备注: Preprint
Abstract:Bargaining, a critical aspect of real-world interactions, presents challenges for large language models (LLMs) due to limitations in strategic depth and adaptation to complex human factors. Existing benchmarks often fail to capture this real-world complexity. To address this and enhance LLM capabilities in realistic bargaining, we introduce a comprehensive framework centered on utility-based feedback. Our contributions are threefold: (1) BargainArena, a novel benchmark dataset with six intricate scenarios (e.g., deceptive practices, monopolies) to facilitate diverse strategy modeling; (2) human-aligned, economically-grounded evaluation metrics inspired by utility theory, incorporating agent utility and negotiation power, which implicitly reflect and promote opponent-aware reasoning (OAR); and (3) a structured feedback mechanism enabling LLMs to iteratively refine their bargaining strategies. This mechanism can positively collaborate with in-context learning (ICL) prompts, including those explicitly designed to foster OAR. Experimental results show that LLMs often exhibit negotiation strategies misaligned with human preferences, and that our structured feedback mechanism significantly improves their performance, yielding deeper strategic and opponent-aware reasoning.
[LG-96] Walking the Weight Manifold: a Topological Approach to Conditioning Inspired by Neuromodulation
链接: https://arxiv.org/abs/2505.22994
作者: Ari S. Benjamin,Kyle Daruwalla,Christian Pehle,Anthony M. Zador
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 17 pages, 4 figures
Abstract:One frequently wishes to learn a range of similar tasks as efficiently as possible, re-using knowledge across tasks. In artificial neural networks, this is typically accomplished by conditioning a network upon task context by injecting context as input. Brains have a different strategy: the parameters themselves are modulated as a function of various neuromodulators such as serotonin. Here, we take inspiration from neuromodulation and propose to learn weights which are smoothly parameterized functions of task context variables. Rather than optimize a weight vector, i.e. a single point in weight space, we optimize a smooth manifold in weight space with a predefined topology. To accomplish this, we derive a formal treatment of optimization of manifolds as the minimization of a loss functional subject to a constraint on volumetric movement, analogous to gradient descent. During inference, conditioning selects a single point on this manifold which serves as the effective weight matrix for a particular sub-task. This strategy for conditioning has two main advantages. First, the topology of the manifold (whether a line, circle, or torus) is a convenient lever for inductive biases about the relationship between tasks. Second, learning in one state smoothly affects the entire manifold, encouraging generalization across states. To verify this, we train manifolds with several topologies, including straight lines in weight space (for conditioning on e.g. noise level in input data) and ellipses (for rotated images). Despite their simplicity, these parameterizations outperform conditioning identical networks by input concatenation and better generalize to out-of-distribution samples. These results suggest that modulating weights over low-dimensional manifolds offers a principled and effective alternative to traditional conditioning.
[LG-97] A Computational Approach to Improving Fairness in K-means Clustering
链接: https://arxiv.org/abs/2505.22984
作者: Guancheng Zhou,Haiping Xu,Hongkang Xu,Chenyu Li,Donghui Yan
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 14 pages, 5 figures
Abstract:The popular K-means clustering algorithm potentially suffers from a major weakness for further analysis or interpretation. Some cluster may have disproportionately more (or fewer) points from one of the subpopulations in terms of some sensitive variable, e.g., gender or race. Such a fairness issue may cause bias and unexpected social consequences. This work attempts to improve the fairness of K-means clustering with a two-stage optimization formulation–clustering first and then adjust cluster membership of a small subset of selected data points. Two computationally efficient algorithms are proposed in identifying those data points that are expensive for fairness, with one focusing on nearest data points outside of a cluster and the other on highly ‘mixed’ data points. Experiments on benchmark datasets show substantial improvement on fairness with a minimal impact to clustering quality. The proposed algorithms can be easily extended to a broad class of clustering algorithms or fairness metrics.
[LG-98] Learning coordinated badminton skills for legged manipulators
链接: https://arxiv.org/abs/2505.22974
作者: Yuntao Ma,Andrei Cramariuc,Farbod Farshidian,Marco Hutter
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Science Robotics DOI: https://doi.org/10.1126/scirobotics.adu3922
Abstract:Coordinating the motion between lower and upper limbs and aligning limb control with perception are substantial challenges in robotics, particularly in dynamic environments. To this end, we introduce an approach for enabling legged mobile manipulators to play badminton, a task that requires precise coordination of perception, locomotion, and arm swinging. We propose a unified reinforcement learning-based control policy for whole-body visuomotor skills involving all degrees of freedom to achieve effective shuttlecock tracking and striking. This policy is informed by a perception noise model that utilizes real-world camera data, allowing for consistent perception error levels between simulation and deployment and encouraging learned active perception behaviors. Our method includes a shuttlecock prediction model, constrained reinforcement learning for robust motion control, and integrated system identification techniques to enhance deployment readiness. Extensive experimental results in a variety of environments validate the robot’s capability to predict shuttlecock trajectories, navigate the service area effectively, and execute precise strikes against human players, demonstrating the feasibility of using legged mobile manipulators in complex and dynamic sports scenarios.
[LG-99] MermaidFlow: Redefining Agent ic Workflow Generation via Safety-Constrained Evolutionary Programming
链接: https://arxiv.org/abs/2505.22967
作者: Chengqi Zheng,Jianda Chen,Yueming Lyu,Wen Zheng Terence Ng,Haopeng Zhang,Yew-Soon Ong,Ivor Tsang,Haiyan Yin
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
Abstract:Despite the promise of autonomous agentic reasoning, existing workflow generation methods frequently produce fragile, unexecutable plans due to unconstrained LLM-driven construction. We introduce MermaidFlow, a framework that redefines the agentic search space through safety-constrained graph evolution. At its core, MermaidFlow represent workflows as a verifiable intermediate representation using Mermaid, a structured and human-interpretable graph language. We formulate domain-aware evolutionary operators, i.e., crossover, mutation, insertion, and deletion, to preserve semantic correctness while promoting structural diversity, enabling efficient exploration of a high-quality, statically verifiable workflow space. Without modifying task settings or evaluation protocols, MermaidFlow achieves consistent improvements in success rates and faster convergence to executable plans on the agent reasoning benchmark. The experimental results demonstrate that safety-constrained graph evolution offers a scalable, modular foundation for robust and interpretable agentic reasoning systems.
[LG-100] Directed Graph Grammars for Sequence-based Learning ICML2025
链接: https://arxiv.org/abs/2505.22949
作者: Michael Sun,Orion Foo,Gang Liu,Wojciech Matusik,Jie Chen
类目: Machine Learning (cs.LG)
*备注: ICML 2025
Abstract:Directed acyclic graphs (DAGs) are a class of graphs commonly used in practice, with examples that include electronic circuits, Bayesian networks, and neural architectures. While many effective encoders exist for DAGs, it remains challenging to decode them in a principled manner, because the nodes of a DAG can have many different topological orders. In this work, we propose a grammar-based approach to constructing a principled, compact and equivalent sequential representation of a DAG. Specifically, we view a graph as derivations over an unambiguous grammar, where the DAG corresponds to a unique sequence of production rules. Equivalently, the procedure to construct such a description can be viewed as a lossless compression of the data. Such a representation has many uses, including building a generative model for graph generation, learning a latent space for property prediction, and leveraging the sequence representational continuity for Bayesian Optimization over structured data. Code is available at this https URL.
[LG-101] Is Noise Conditioning Necessary? A Unified Theory of Unconditional Graph Diffusion Models
链接: https://arxiv.org/abs/2505.22935
作者: Jipeng Li,Yanning Shen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Explicit noise-level conditioning is widely regarded as essential for the effective operation of Graph Diffusion Models (GDMs). In this work, we challenge this assumption by investigating whether denoisers can implicitly infer noise levels directly from corrupted graph structures, potentially eliminating the need for explicit noise conditioning. To this end, we develop a theoretical framework centered on Bernoulli edge-flip corruptions and extend it to encompass more complex scenarios involving coupled structure-attribute noise. Extensive empirical evaluations on both synthetic and real-world graph datasets, using models such as GDSS and DiGress, provide strong support for our theoretical findings. Notably, unconditional GDMs achieve performance comparable or superior to their conditioned counterparts, while also offering reductions in parameters (4-6%) and computation time (8-10%). Our results suggest that the high-dimensional nature of graph data itself often encodes sufficient information for the denoising process, opening avenues for simpler, more efficient GDM architectures.
[LG-102] Mustafar: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference
链接: https://arxiv.org/abs/2505.22913
作者: Donghyeon Joo,Helya Hosseini,Ramyad Hadidi,Bahar Asgari
类目: Machine Learning (cs.LG)
*备注: 19 pages, 9 figures
Abstract:We demonstrate that unstructured sparsity significantly improves KV cache compression for LLMs, enabling sparsity levels up to 70% without compromising accuracy or requiring fine-tuning. We conduct a systematic exploration of pruning strategies and find per-token magnitude-based pruning as highly effective for both Key and Value caches under unstructured sparsity, surpassing prior structured pruning schemes. The Key cache benefits from prominent outlier elements, while the Value cache surprisingly benefits from a simple magnitude-based pruning despite its uniform distribution. KV cache size is the major bottleneck in decode performance due to high memory overhead for large context lengths. To address this, we use a bitmap-based sparse format and a custom attention kernel capable of compressing and directly computing over compressed caches pruned to arbitrary sparsity patterns, significantly accelerating memory-bound operations in decode computations and thereby compensating for the overhead of runtime pruning and compression. Our custom attention kernel coupled with the bitmap-based format delivers substantial compression of KV cache upto 45% of dense inference and thereby enables longer context length and increased tokens/sec throughput of upto 2.23x compared to dense inference. Our pruning mechanism and sparse attention kernel is available at this https URL.
[LG-103] On the Dynamic Regret of Following the Regularized Leader: Optimism with History Pruning
链接: https://arxiv.org/abs/2505.22899
作者: Naram Mhaisen,George Iosifidis
类目: Machine Learning (cs.LG)
*备注:
Abstract:We revisit the Follow the Regularized Leader (FTRL) framework for Online Convex Optimization (OCO) over compact sets, focusing on achieving dynamic regret guarantees. Prior work has highlighted the framework’s limitations in dynamic environments due to its tendency to produce “lazy” iterates. However, building on insights showing FTRL’s ability to produce “agile” iterates, we show that it can indeed recover known dynamic regret bounds through optimistic composition of future costs and careful linearization of past costs, which can lead to pruning some of them. This new analysis of FTRL against dynamic comparators yields a principled way to interpolate between greedy and agile updates and offers several benefits, including refined control over regret terms, optimism without cyclic dependence, and the application of minimal recursive regularization akin to AdaFTRL. More broadly, we show that it is not the lazy projection style of FTRL that hinders (optimistic) dynamic regret, but the decoupling of the algorithm’s state (linearized history) from its iterates, allowing the state to grow arbitrarily. Instead, pruning synchronizes these two when necessary.
[LG-104] Smart Surrogate Losses for Contextual Stochastic Linear Optimization with Robust Constraints
链接: https://arxiv.org/abs/2505.22881
作者: Hyungki Im,Wyame Benslimane,Paul Grigas
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:We study an extension of contextual stochastic linear optimization (CSLO) that, in contrast to most of the existing literature, involves inequality constraints that depend on uncertain parameters predicted by a machine learning model. To handle the constraint uncertainty, we use contextual uncertainty sets constructed via methods like conformal prediction. Given a contextual uncertainty set method, we introduce the “Smart Predict-then-Optimize with Robust Constraints” (SPO-RC) loss, a feasibility-sensitive adaptation of the SPO loss that measures decision error of predicted objective parameters. We also introduce a convex surrogate, SPO-RC+, and prove Fisher consistency with SPO-RC. To enhance performance, we train on truncated datasets where true constraint parameters lie within the uncertainty sets, and we correct the induced sample selection bias using importance reweighting techniques. Through experiments on fractional knapsack and alloy production problem instances, we demonstrate that SPO-RC+ effectively handles uncertainty in constraints and that combining truncation with importance reweighting can further improve performance.
[LG-105] CrossNAS: A Cross-Layer Neural Architecture Search Framework for PIM Systems
链接: https://arxiv.org/abs/2505.22868
作者: Md Hasibul Amin,Mohammadreza Mohammadi,Jason D. Bakos,Ramtin Zand
类目: Emerging Technologies (cs.ET); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we propose the CrossNAS framework, an automated approach for exploring a vast, multidimensional search space that spans various design abstraction layers-circuits, architecture, and systems-to optimize the deployment of machine learning workloads on analog processing-in-memory (PIM) systems. CrossNAS leverages the single-path one-shot weight-sharing strategy combined with the evolutionary search for the first time in the context of PIM system mapping and optimization. CrossNAS sets a new benchmark for PIM neural architecture search (NAS), outperforming previous methods in both accuracy and energy efficiency while maintaining comparable or shorter search times.
[LG-106] Causal-PIK: Causality-based Physical Reasoning with a Physics-Informed Kernel ICML2025
链接: https://arxiv.org/abs/2505.22861
作者: Carlota Parés-Morlans,Michelle Yi,Claire Chen,Sarah A. Wu,Rika Antonova,Tobias Gerstenberg,Jeannette Bohg
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2025
Abstract:Tasks that involve complex interactions between objects with unknown dynamics make planning before execution difficult. These tasks require agents to iteratively improve their actions after actively exploring causes and effects in the environment. For these type of tasks, we propose Causal-PIK, a method that leverages Bayesian optimization to reason about causal interactions via a Physics-Informed Kernel to help guide efficient search for the best next action. Experimental results on Virtual Tools and PHYRE physical reasoning benchmarks show that Causal-PIK outperforms state-of-the-art results, requiring fewer actions to reach the goal. We also compare Causal-PIK to human studies, including results from a new user study we conducted on the PHYRE benchmark. We find that Causal-PIK remains competitive on tasks that are very challenging, even for human problem-solvers.
[LG-107] Kernel-Smoothed Scores for Denoising Diffusion: A Bias-Variance Study
链接: https://arxiv.org/abs/2505.22841
作者: Franck Gabriel,François Ged,Maria Han Veiga,Emmanuel Schertzer
类目: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:
Abstract:Diffusion models now set the benchmark in high-fidelity generative sampling, yet they can, in principle, be prone to memorization. In this case, their learned score overfits the finite dataset so that the reverse-time SDE samples are mostly training points. In this paper, we interpret the empirical score as a noisy version of the true score and show that its covariance matrix is asymptotically a re-weighted data PCA. In large dimension, the small time limit makes the noise variance blow up while simultaneously reducing spatial correlation. To reduce this variance, we introduce a kernel-smoothed empirical score and analyze its bias-variance trade-off. We derive asymptotic bounds on the Kullback-Leibler divergence between the true distribution and the one generated by the modified reverse SDE. Regularization on the score has the same effect as increasing the size of the training dataset, and thus helps prevent memorization. A spectral decomposition of the forward diffusion suggests better variance control under some regularity conditions of the true data distribution. Reverse diffusion with kernel-smoothed empirical score can be reformulated as a gradient descent drifted toward a Log-Exponential Double-Kernel Density Estimator (LED-KDE). This perspective highlights two regularization mechanisms taking place in denoising diffusions: an initial Gaussian kernel first diffuses mass isotropically in the ambient space, while a second kernel applied in score space concentrates and spreads that mass along the data manifold. Hence, even a straightforward regularization-without any learning-already mitigates memorization and enhances generalization. Numerically, we illustrate our results with several experiments on synthetic and MNIST datasets.
[LG-108] Development and Validation of SXI LNM Algorithm for Sepsis Prediction
链接: https://arxiv.org/abs/2505.22840
作者: Dharambir Mahto,Prashant Yadav,Mahesh Banavar,Jim Keany,Alan T Joseph,Srinivas Kilambi
类目: Machine Learning (cs.LG)
*备注: Paper accepted at JMAI
Abstract:Sepsis is a life-threatening condition affecting over 48.9 million people globally and causing 11 million deaths annually. Despite medical advancements, predicting sepsis remains a challenge due to non-specific symptoms and complex pathophysiology. The SXI++ LNM is a machine learning scoring system that refines sepsis prediction by leveraging multiple algorithms and deep neural networks. This study aims to improve robustness in clinical applications and evaluates the predictive performance of the SXI++ LNM for sepsis prediction. The model, utilizing a deep neural network, was trained and tested using multiple scenarios with different dataset distributions. The model’s performance was assessed against unseen test data, and accuracy, precision, and area under the curve (AUC) were calculated. THE SXI++ LNM outperformed the state of the art in three use cases, achieving an AUC of 0.99 (95% CI: 0.98-1.00). The model demonstrated a precision of 99.9% (95% CI: 99.8-100.0) and an accuracy of 99.99% (95% CI: 99.98-100.0), maintaining high reliability.
[LG-109] Preference Learning with Response Time
链接: https://arxiv.org/abs/2505.22820
作者: Ayush Sawarni,Sahasrajit Sarmasarkar,Vasilis Syrgkanis
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper investigates the integration of response time data into human preference learning frameworks for more effective reward model elicitation. While binary preference data has become fundamental in fine-tuning foundation models, generative AI systems, and other large-scale models, the valuable temporal information inherent in user decision-making remains largely unexploited. We propose novel methodologies to incorporate response time information alongside binary choice data, leveraging the Evidence Accumulation Drift Diffusion (EZ) model, under which response time is informative of the preference strength. We develop Neyman-orthogonal loss functions that achieve oracle convergence rates for reward model learning, matching the theoretical optimal rates that would be attained if the expected response times for each query were known a priori. Our theoretical analysis demonstrates that for linear reward functions, conventional preference learning suffers from error rates that scale exponentially with reward magnitude. In contrast, our response time-augmented approach reduces this to polynomial scaling, representing a significant improvement in sample efficiency. We extend these guarantees to non-parametric reward function spaces, establishing convergence properties for more complex, realistic reward models. Our extensive experiments validate our theoretical findings in the context of preference learning over images.
[LG-110] X-Factor: Quality Is a Dataset-Intrinsic Property
链接: https://arxiv.org/abs/2505.22813
作者: Josiah Couch,Miao Li,Rima Arnaout,Ramy Arnaout
类目: Machine Learning (cs.LG)
*备注: 13 pages, 7 figures
Abstract:In the universal quest to optimize machine-learning classifiers, three factors – model architecture, dataset size, and class balance – have been shown to influence test-time performance but do not fully account for it. Previously, evidence was presented for an additional factor that can be referred to as dataset quality, but it was unclear whether this was actually a joint property of the dataset and the model architecture, or an intrinsic property of the dataset itself. If quality is truly dataset-intrinsic and independent of model architecture, dataset size, and class balance, then the same datasets should perform better (or worse) regardless of these other factors. To test this hypothesis, here we create thousands of datasets, each controlled for size and class balance, and use them to train classifiers with a wide range of architectures, from random forests and support-vector machines to deep networks. We find that classifier performance correlates strongly by subset across architectures ( R^2=0.79 ), supporting quality as an intrinsic property of datasets independent of dataset size and class balance and of model architecture. Digging deeper, we find that dataset quality appears to be an emergent property of something more fundamental: the quality of datasets’ constituent classes. Thus, quality joins size, class balance, and model architecture as an independent correlate of performance and a separate target for optimizing machine-learning-based classification.
[LG-111] Navigating the Latent Space Dynamics of Neural Models
链接: https://arxiv.org/abs/2505.22785
作者: Marco Fumero,Luca Moschella,Emanuele Rodolà,Francesco Locatello
类目: Machine Learning (cs.LG)
*备注:
Abstract:Neural networks transform high-dimensional data into compact, structured representations, often modeled as elements of a lower dimensional latent space. In this paper, we present an alternative interpretation of neural models as dynamical systems acting on the latent manifold. Specifically, we show that autoencoder models implicitly define a latent vector field on the manifold, derived by iteratively applying the encoding-decoding map, without any additional training. We observe that standard training procedures introduce inductive biases that lead to the emergence of attractor points within this vector field. Drawing on this insight, we propose to leverage the vector field as a representation for the network, providing a novel tool to analyze the properties of the model and the data. This representation enables to: (i) analyze the generalization and memorization regimes of neural models, even throughout training; (ii) extract prior knowledge encoded in the network’s parameters from the attractors, without requiring any input data; (iii) identify out-of-distribution samples from their trajectories in the vector field. We further validate our approach on vision foundation models, showcasing the applicability and effectiveness of our method in real-world scenarios.
[LG-112] Machine Learning Models Have a Supply Chain Problem
链接: https://arxiv.org/abs/2505.22778
作者: Sarah Meiklejohn,Hayden Blauzvern,Mihai Maruseac,Spencer Schrock,Laurent Simon,Ilia Shumailov
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 12 pages, four figures and one table
Abstract:Powerful machine learning (ML) models are now readily available online, which creates exciting possibilities for users who lack the deep technical expertise or substantial computing resources needed to develop them. On the other hand, this type of open ecosystem comes with many risks. In this paper, we argue that the current ecosystem for open ML models contains significant supply-chain risks, some of which have been exploited already in real attacks. These include an attacker replacing a model with something malicious (e.g., malware), or a model being trained using a vulnerable version of a framework or on restricted or poisoned data. We then explore how Sigstore, a solution designed to bring transparency to open-source software supply chains, can be used to bring transparency to open ML models, in terms of enabling model publishers to sign their models and prove properties about the datasets they use.
[LG-113] Calibrated Value-Aware Model Learning with Stochastic Environment Models
链接: https://arxiv.org/abs/2505.22772
作者: Claas Voelcker,Anastasiia Pedan,Arash Ahmadian,Romina Abachi,Igor Gilitschenski,Amir-massoud Farahmand
类目: Machine Learning (cs.LG)
*备注:
Abstract:The idea of value-aware model learning, that models should produce accurate value estimates, has gained prominence in model-based reinforcement learning. The MuZero loss, which penalizes a model’s value function prediction compared to the ground-truth value function, has been utilized in several prominent empirical works in the literature. However, theoretical investigation into its strengths and weaknesses is limited. In this paper, we analyze the family of value-aware model learning losses, which includes the popular MuZero loss. We show that these losses, as normally used, are uncalibrated surrogate losses, which means that they do not always recover the correct model and value function. Building on this insight, we propose corrections to solve this issue. Furthermore, we investigate the interplay between the loss calibration, latent model architectures, and auxiliary losses that are commonly employed when training MuZero-style agents. We show that while deterministic models can be sufficient to predict accurate values, learning calibrated stochastic models is still advantageous.
[LG-114] Multivariate de Bruijn Graphs: A Symbolic Graph Framework for Time Series Forecasting
链接: https://arxiv.org/abs/2505.22768
作者: Mert Onur Cakiroglu,Idil Bilge Altun,Hasan Kurban,Elham Buxton,Mehmet Dalkilic
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series forecasting remains a challenging task for foundation models due to temporal heterogeneity, high dimensionality, and the lack of inherent symbolic structure. In this work, we propose DRAGON (Discrete Representation and Augmented Graph encoding Over deBruijN Graphs), a novel encoder that introduces Multivariate de Bruijn Graphs (MdBGs) to bridge the gap between symbolic representations and neural modeling. DRAGON discretizes continuous input sequences and maps them onto a fixed graph structure, enabling dynamic context recovery via graph-based attention. Integrated as an auxiliary module within a dual-branch architecture, DRAGON augments conventional CNN-based encoders with symbolic, structure-aware representations. All code developed for this study is available at: this https URL
[LG-115] StarBASE-GP: Biologically-Guided Automated Machine Learning for Genotype-to-Phenotype Association Analysis
链接: https://arxiv.org/abs/2505.22746
作者: Jose Guadalupe Hernandez,Attri Ghosh,Philip J. Freda,Yufei Meng,Nicholas Matsumoto,Jason H. Moore
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:
Abstract:We present the Star-Based Automated Single-locus and Epistasis analysis tool - Genetic Programming (StarBASE-GP), an automated framework for discovering meaningful genetic variants associated with phenotypic variation in large-scale genomic datasets. StarBASE-GP uses a genetic programming-based multi-objective optimization strategy to evolve machine learning pipelines that simultaneously maximize explanatory power (r2) and minimize pipeline complexity. Biological domain knowledge is integrated at multiple stages, including the use of nine inheritance encoding strategies to model deviations from additivity, a custom linkage disequilibrium pruning node that minimizes redundancy among features, and a dynamic variant recommendation system that prioritizes informative candidates for pipeline inclusion. We evaluate StarBASE-GP on a cohort of Rattus norvegicus (brown rat) to identify variants associated with body mass index, benchmarking its performance against a random baseline and a biologically naive version of the tool. StarBASE-GP consistently evolves Pareto fronts with superior performance, yielding higher accuracy in identifying both ground truth and novel quantitative trait loci, highlighting relevant targets for future validation. By incorporating evolutionary search and relevant biological theory into a flexible automated machine learning framework, StarBASE-GP demonstrates robust potential for advancing variant discovery in complex traits.
[LG-116] Private Rate-Constrained Optimization with Applications to Fair Learning
链接: https://arxiv.org/abs/2505.22703
作者: Mohammad Yaghini,Tudor Cebere,Michael Menart,Aurélien Bellet,Nicolas Papernot
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注:
Abstract:Many problems in trustworthy ML can be formulated as minimization of the model error under constraints on the prediction rates of the model for suitably-chosen marginals, including most group fairness constraints (demographic parity, equality of odds, etc.). In this work, we study such constrained minimization problems under differential privacy (DP). Standard DP optimization techniques like DP-SGD rely on the loss function’s decomposability into per-sample contributions. However, rate constraints introduce inter-sample dependencies, violating the decomposability requirement. To address this, we develop RaCO-DP, a DP variant of the Stochastic Gradient Descent-Ascent (SGDA) algorithm which solves the Lagrangian formulation of rate constraint problems. We demonstrate that the additional privacy cost of incorporating these constraints reduces to privately estimating a histogram over the mini-batch at each optimization step. We prove the convergence of our algorithm through a novel analysis of SGDA that leverages the linear structure of the dual parameter. Finally, empirical results on learning under group fairness constraints demonstrate that our method Pareto-dominates existing private learning approaches in fairness-utility trade-offs.
[LG-117] Update Your Transformer to the Latest Release: Re-Basin of Task Vectors ICML2025
链接: https://arxiv.org/abs/2505.22697
作者: Filippo Rinaldi,Giacomo Capitani,Lorenzo Bonicelli,Donato Crisostomi,Federico Bolelli,Elisa Ficarra,Emanuele Rodolà,Simone Calderara,Angelo Porrello
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2025
Abstract:Foundation models serve as the backbone for numerous specialized models developed through fine-tuning. However, when the underlying pretrained model is updated or retrained (e.g., on larger and more curated datasets), the fine-tuned model becomes obsolete, losing its utility and requiring retraining. This raises the question: is it possible to transfer fine-tuning to a new release of the model? In this work, we investigate how to transfer fine-tuning to a new checkpoint without having to re-train, in a data-free manner. To do so, we draw principles from model re-basin and provide a recipe based on weight permutations to re-base the modifications made to the original base model, often called task vector. In particular, our approach tailors model re-basin for Transformer models, taking into account the challenges of residual connections and multi-head attention layers. Specifically, we propose a two-level method rooted in spectral theory, initially permuting the attention heads and subsequently adjusting parameters within select pairs of heads. Through extensive experiments on visual and textual tasks, we achieve the seamless transfer of fine-tuned knowledge to new pre-trained backbones without relying on a single training step or datapoint. Code is available at this https URL.
[LG-118] When Does Neuroevolution Outcompete Reinforcement Learning in Transfer Learning Tasks?
链接: https://arxiv.org/abs/2505.22696
作者: Eleni Nisioti,Joachim Winther Pedersen,Erwan Plantec,Milton L. Montero,Sebastian Risi
类目: Machine Learning (cs.LG)
*备注:
Abstract:The ability to continuously and efficiently transfer skills across tasks is a hallmark of biological intelligence and a long-standing goal in artificial systems. Reinforcement learning (RL), a dominant paradigm for learning in high-dimensional control tasks, is known to suffer from brittleness to task variations and catastrophic forgetting. Neuroevolution (NE) has recently gained attention for its robustness, scalability, and capacity to escape local optima. In this paper, we investigate an understudied dimension of NE: its transfer learning capabilities. To this end, we introduce two benchmarks: a) in stepping gates, neural networks are tasked with emulating logic circuits, with designs that emphasize modular repetition and variation b) ecorobot extends the Brax physics engine with objects such as walls and obstacles and the ability to easily switch between different robotic morphologies. Crucial in both benchmarks is the presence of a curriculum that enables evaluating skill transfer across tasks of increasing complexity. Our empirical analysis shows that NE methods vary in their transfer abilities and frequently outperform RL baselines. Our findings support the potential of NE as a foundation for building more adaptable agents and highlight future challenges for scaling NE to complex, real-world problems.
[LG-119] LLM -ODDR: A Large Language Model Framework for Joint Order Dispatching and Driver Repositioning
链接: https://arxiv.org/abs/2505.22695
作者: Tengfei Lyu,Siyuan Feng,Hao Liu,Hai Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Ride-hailing platforms face significant challenges in optimizing order dispatching and driver repositioning operations in dynamic urban environments. Traditional approaches based on combinatorial optimization, rule-based heuristics, and reinforcement learning often overlook driver income fairness, interpretability, and adaptability to real-world dynamics. To address these gaps, we propose LLM-ODDR, a novel framework leveraging Large Language Models (LLMs) for joint Order Dispatching and Driver Repositioning (ODDR) in ride-hailing services. LLM-ODDR framework comprises three key components: (1) Multi-objective-guided Order Value Refinement, which evaluates orders by considering multiple objectives to determine their overall value; (2) Fairness-aware Order Dispatching, which balances platform revenue with driver income fairness; and (3) Spatiotemporal Demand-Aware Driver Repositioning, which optimizes idle vehicle placement based on historical patterns and projected supply. We also develop JointDR-GPT, a fine-tuned model optimized for ODDR tasks with domain knowledge. Extensive experiments on real-world datasets from Manhattan taxi operations demonstrate that our framework significantly outperforms traditional methods in terms of effectiveness, adaptability to anomalous conditions, and decision interpretability. To our knowledge, this is the first exploration of LLMs as decision-making agents in ride-hailing ODDR tasks, establishing foundational insights for integrating advanced language models within intelligent transportation systems.
[LG-120] MoRE: A Mixture of Low-Rank Experts for Adaptive Multi-Task Learning ACL2025
链接: https://arxiv.org/abs/2505.22694
作者: Dacao Zhang,Kun Zhang,Shimao Chu,Le Wu,Xin Li,Si Wei
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted to ACL 2025 Findings
Abstract:With the rapid development of Large Language Models (LLMs), Parameter-Efficient Fine-Tuning (PEFT) methods have gained significant attention, which aims to achieve efficient fine-tuning of LLMs with fewer parameters. As a representative PEFT method, Low-Rank Adaptation (LoRA) introduces low-rank matrices to approximate the incremental tuning parameters and achieves impressive performance over multiple scenarios. After that, plenty of improvements have been proposed for further improvement. However, these methods either focus on single-task scenarios or separately train multiple LoRA modules for multi-task scenarios, limiting the efficiency and effectiveness of LoRA in multi-task scenarios. To better adapt to multi-task fine-tuning, in this paper, we propose a novel Mixture of Low-Rank Experts (MoRE) for multi-task PEFT. Specifically, instead of using an individual LoRA for each task, we align different ranks of LoRA module with different tasks, which we named low-rank experts. Moreover, we design a novel adaptive rank selector to select the appropriate expert for each task. By jointly training low-rank experts, MoRE can enhance the adaptability and efficiency of LoRA in multi-task scenarios. Finally, we conduct extensive experiments over multiple multi-task benchmarks along with different LLMs to verify model performance. Experimental results demonstrate that compared to traditional LoRA and its variants, MoRE significantly improves the performance of LLMs in multi-task scenarios and incurs no additional inference cost. We also release the model and code to facilitate the community.
[LG-121] SlimLLM : Accurate Structured Pruning for Large Language Models ICML2025
链接: https://arxiv.org/abs/2505.22689
作者: Jialong Guo,Xinghao Chen,Yehui Tang,Yunhe Wang
类目: Machine Learning (cs.LG)
*备注: ICML 2025
Abstract:Large language models(LLMs) have garnered significant attention and demonstrated impressive capabilities in a wide range of applications. However, due to their enormous computational costs, the deployment and application of LLMs are often severely limited. To address this issue, structured pruning is an effective solution to compress the parameters of LLMs. Determining the importance of each sub-module in LLMs and minimizing performance loss are critical issues that need to be carefully addressed in structured pruning. In this paper, we propose an effective and fast structured pruning method named SlimLLM for large language models. For channel and attention head pruning, we evaluate the importance based on the entire channel or head, rather than merely aggregating the importance of individual elements within a sub-module. This approach enables a more holistic consideration of the interdependence among elements within the sub-module. In addition, we design a simple linear regression strategy for the output matrix to quickly recover performance. We also propose layer-based importance ratio to determine the pruning ratio for each layer. Based on the LLaMA benchmark results, our SlimLLM outperforms other methods and achieves state-of-the-art performance.
[LG-122] Localized Weather Prediction Using Kolmogorov-Arnold Network-Based Models and Deep RNNs
链接: https://arxiv.org/abs/2505.22686
作者: Ange-Clement Akazan,Verlon Roel Mbingui,Gnankan Landry Regis N’guessan,Issa Karambal
类目: Machine Learning (cs.LG)
*备注:
Abstract:Weather forecasting is crucial for managing risks and economic planning, particularly in tropical Africa, where extreme events severely impact livelihoods. Yet, existing forecasting methods often struggle with the region’s complex, non-linear weather patterns. This study benchmarks deep recurrent neural networks such as \textttLSTM, GRU, BiLSTM, BiGRU , and Kolmogorov-Arnold-based models (\textttKAN and \textttTKAN) for daily forecasting of temperature, precipitation, and pressure in two tropical cities: Abidjan, Cote d’Ivoire (Ivory Coast) and Kigali (Rwanda). We further introduce two customized variants of \textttTKAN that replace its original \textttSiLU activation function with \textttGeLU and \textttMiSH, respectively. Using station-level meteorological data spanning from 2010 to 2024, we evaluate all the models on standard regression metrics. \textttKAN achieves temperature prediction ( R^2=0.9986 in Abidjan, 0.9998 in Kigali, \textttMSE 0.0014~^\circ C ^2 ), while \textttTKAN variants minimize absolute errors for precipitation forecasting in low-rainfall regimes. The customized \textttTKAN models demonstrate improvements over the standard \textttTKAN across both datasets. Classical \textttRNNs remain highly competitive for atmospheric pressure ( R^2 \approx 0.83-0.86 ), outperforming \textttKAN -based models in this task. These results highlight the potential of spline-based neural architectures for efficient and data-efficient forecasting.
[LG-123] Recovering Fairness Directly from Modularity: a New Way for Fair Community Partitioning
链接: https://arxiv.org/abs/2505.22684
作者: Yufeng Wang,Yiguang Bai,Tianqing Zhu,Ismail Ben Ayed,Jing Yuan
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: 17pages, 5 figures
Abstract:Community partitioning is crucial in network analysis, with modularity optimization being the prevailing technique. However, traditional modularity-based methods often overlook fairness, a critical aspect in real-world applications. To address this, we introduce protected group networks and propose a novel fairness-modularity metric. This metric extends traditional modularity by explicitly incorporating fairness, and we prove that minimizing it yields naturally fair partitions for protected groups while maintaining theoretical soundness. We develop a general optimization framework for fairness partitioning and design the efficient Fair Fast Newman (FairFN) algorithm, enhancing the Fast Newman (FN) method to optimize both modularity and fairness. Experiments show FairFN achieves significantly improved fairness and high-quality partitions compared to state-of-the-art methods, especially on unbalanced datasets.
[LG-124] Unified Network-Based Representation of BIM Models for Embedding Semantic Spatial and Topological Data
链接: https://arxiv.org/abs/2505.22670
作者: Jin Han,Xin-Zheng Lu,Jia-Rui Lin
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:
Abstract:Building Information Modeling (BIM) has revolutionized the construction industry by providing a comprehensive digital representation of building structures throughout their lifecycle. However, existing research lacks effective methods for capturing the complex spatial and topological relationships between components in BIM models, which are essential for understanding design patterns and enhancing decision-making. This study proposes a unified network-based representation method that integrates the “semantic-spatial-topological” multi-dimensional design features of BIM models. By extending the IFC (Industry Foundation Classes) standard, we introduce local spatial relationships and topological connections between components to enrich the network structure. This representation method enables a more detailed understanding of component interactions, dependencies, and implicit design patterns, effectively capturing the semantic, topological, and spatial relationships in BIM, and holds significant potential for the representation and learning of design patterns.
[LG-125] User-centric Vehicle-to-Grid Optimization with an Input Convex Neural Network-based Battery Degradation Model
链接: https://arxiv.org/abs/2505.11047
作者: Arghya Mallick,Georgios Pantazis,Mohammad Khosravi,Peyman Mohajerin Esfahani,Sergio Grammatico
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:
Abstract:We propose a data-driven, user-centric vehicle-to-grid (V2G) methodology based on multi-objective optimization to balance battery degradation and V2G revenue according to EV user preference. Given the lack of accurate and generalizable battery degradation models, we leverage input convex neural networks (ICNNs) to develop a data-driven degradation model trained on extensive experimental datasets. This approach enables our model to capture nonconvex dependencies on battery temperature and time while maintaining convexity with respect to the charging rate. Such a partial convexity property ensures that the second stage of our methodology remains computationally efficient. In the second stage, we integrate our data-driven degradation model into a multi-objective optimization framework to generate an optimal smart charging profile for each EV. This profile effectively balances the trade-off between financial benefits from V2G participation and battery degradation, controlled by a hyperparameter reflecting the user prioritization of battery health. Numerical simulations show the high accuracy of the ICNN model in predicting battery degradation for unseen data. Finally, we present a trade-off curve illustrating financial benefits from V2G versus losses from battery health degradation based on user preferences and showcase smart charging strategies under realistic scenarios.
[LG-126] On the Convergence Analysis of Muon
链接: https://arxiv.org/abs/2505.23737
作者: Wei Shen,Ruichuan Huang,Minhui Huang,Cong Shen,Jiawei Zhang
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:The majority of parameters in neural networks are naturally represented as matrices. However, most commonly used optimizers treat these matrix parameters as flattened vectors during optimization, potentially overlooking their inherent structural properties. Recently, an optimizer called Muon has been proposed, specifically designed to optimize matrix-structured parameters. Extensive empirical evidence shows that Muon can significantly outperform traditional optimizers when training neural networks. Nonetheless, the theoretical understanding of Muon’s convergence behavior and the reasons behind its superior performance remain limited. In this work, we present a comprehensive convergence rate analysis of Muon and its comparison with Gradient Descent (GD). We further characterize the conditions under which Muon can outperform GD. Our theoretical results reveal that Muon can benefit from the low-rank and approximate blockwise diagonal structure of Hessian matrices – phenomena widely observed in practical neural network training. Our experimental results support and corroborate the theoretical findings.
[LG-127] Instance-Optimality for Private KL Distribution Estimation
链接: https://arxiv.org/abs/2505.23620
作者: Jiayuan Ye,Vitaly Feldman,Kunal Talwar
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We study the fundamental problem of estimating an unknown discrete distribution p over d symbols, given n i.i.d. samples from the distribution. We are interested in minimizing the KL divergence between the true distribution and the algorithm’s estimate. We first construct minimax optimal private estimators. Minimax optimality however fails to shed light on an algorithm’s performance on individual (non-worst-case) instances p and simple minimax-optimal DP estimators can have poor empirical performance on real distributions. We then study this problem from an instance-optimality viewpoint, where the algorithm’s error on p is compared to the minimum achievable estimation error over a small local neighborhood of p . Under natural notions of local neighborhood, we propose algorithms that achieve instance-optimality up to constant factors, with and without a differential privacy constraint. Our upper bounds rely on (private) variants of the Good-Turing estimator. Our lower bounds use additive local neighborhoods that more precisely captures the hardness of distribution estimation in KL divergence, compared to ones considered in prior works.
[LG-128] Multilook Coherent Imaging: Theoretical Guarantees and Algorithms
链接: https://arxiv.org/abs/2505.23594
作者: Xi Chen,Soham Jana,Christopher A. Metzler,Arian Maleki,Shirin Jalali
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 29 pages, 4 figures, 3 tables. arXiv admin note: substantial text overlap with arXiv:2402.15635
Abstract:Multilook coherent imaging is a widely used technique in applications such as digital holography, ultrasound imaging, and synthetic aperture radar. A central challenge in these systems is the presence of multiplicative noise, commonly known as speckle, which degrades image quality. Despite the widespread use of coherent imaging systems, their theoretical foundations remain relatively underexplored. In this paper, we study both the theoretical and algorithmic aspects of likelihood-based approaches for multilook coherent imaging, providing a rigorous framework for analysis and method development. Our theoretical contributions include establishing the first theoretical upper bound on the Mean Squared Error (MSE) of the maximum likelihood estimator under the deep image prior hypothesis. Our results capture the dependence of MSE on the number of parameters in the deep image prior, the number of looks, the signal dimension, and the number of measurements per look. On the algorithmic side, we employ projected gradient descent (PGD) as an efficient method for computing the maximum likelihood solution. Furthermore, we introduce two key ideas to enhance the practical performance of PGD. First, we incorporate the Newton-Schulz algorithm to compute matrix inverses within the PGD iterations, significantly reducing computational complexity. Second, we develop a bagging strategy to mitigate projection errors introduced during PGD updates. We demonstrate that combining these techniques with PGD yields state-of-the-art performance. Our code is available at this https URL.
[LG-129] Learning Parametric Distributions from Samples and Preferences
链接: https://arxiv.org/abs/2505.23557
作者: Marc Jourdan,Gizem Yüce,Nicolas Flammarion
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 28 pages, 8 figures. To be published in the Forty-Second International Conference on Machine Learning
Abstract:Recent advances in language modeling have underscored the role of preference feedback in enhancing model performance. This paper investigates the conditions under which preference feedback improves parameter estimation in classes of continuous parametric distributions. In our framework, the learner observes pairs of samples from an unknown distribution along with their relative preferences depending on the same unknown parameter. We show that preference-based M-estimators achieve a better asymptotic variance than sample-only M-estimators, further improved by deterministic preferences. Leveraging the hard constraints revealed by deterministic preferences, we propose an estimator achieving an estimation error scaling of \mathcalO(1/n) – a significant improvement over the \Theta(1/\sqrtn) rate attainable with samples alone. Next, we establish a lower bound that matches this accelerated rate; up to dimension and problem-dependent constants. While the assumptions underpinning our analysis are restrictive, they are satisfied by notable cases such as Gaussian or Laplace distributions for preferences based on the log-probability reward.
[LG-130] DeepFilterGAN: A Full-band Real-time Speech Enhancement System with GAN-based Stochastic Regeneration INTERSPEECH2025
链接: https://arxiv.org/abs/2505.23515
作者: Sanberk Serbest,Tijana Stojkovic,Milos Cernak,Andrew Harper
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Accepted to Interspeech 2025
Abstract:In this work, we propose a full-band real-time speech enhancement system with GAN-based stochastic regeneration. Predictive models focus on estimating the mean of the target distribution, whereas generative models aim to learn the full distribution. This behavior of predictive models may lead to over-suppression, i.e. the removal of speech content. In the literature, it was shown that combining a predictive model with a generative one within the stochastic regeneration framework can reduce the distortion in the output. We use this framework to obtain a real-time speech enhancement system. With 3.58M parameters and a low latency, our system is designed for real-time streaming with a lightweight architecture. Experiments show that our system improves over the first stage in terms of NISQA-MOS metric. Finally, through an ablation study, we show the importance of noisy conditioning in our system. We participated in 2025 Urgent Challenge with our model and later made further improvements.
[LG-131] he Strong Weak and Benign Goodharts law. An independence-free and paradigm-agnostic formalisation
链接: https://arxiv.org/abs/2505.23445
作者: Adrien Majka,El-Mahdi El-Mhamdi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 32 pages, 1 figure
Abstract:Goodhart’s law is a famous adage in policy-making that states that ``When a measure becomes a target, it ceases to be a good measure’'. As machine learning models and the optimisation capacity to train them grow, growing empirical evidence reinforced the belief in the validity of this law without however being formalised. Recently, a few attempts were made to formalise Goodhart’s law, either by categorising variants of it, or by looking at how optimising a proxy metric affects the optimisation of an intended goal. In this work, we alleviate the simplifying independence assumption, made in previous works, and the assumption on the learning paradigm made in most of them, to study the effect of the coupling between the proxy metric and the intended goal on Goodhart’s law. Our results show that in the case of light tailed goal and light tailed discrepancy, dependence does not change the nature of Goodhart’s effect. However, in the light tailed goal and heavy tailed discrepancy case, we exhibit an example where over-optimisation occurs at a rate inversely proportional to the heavy tailedness of the discrepancy between the goal and the metric. %
[LG-132] Dynamic Estimation Loss Control in Variational Quantum Sensing via Online Conformal Inference
链接: https://arxiv.org/abs/2505.23389
作者: Ivana Nikoloska,Hamdi Joudeh,Ruud van Sloun,Osvaldo Simeone
类目: Quantum Physics (quant-ph); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Quantum sensing exploits non-classical effects to overcome limitations of classical sensors, with applications ranging from gravitational-wave detection to nanoscale imaging. However, practical quantum sensors built on noisy intermediate-scale quantum (NISQ) devices face significant noise and sampling constraints, and current variational quantum sensing (VQS) methods lack rigorous performance guarantees. This paper proposes an online control framework for VQS that dynamically updates the variational parameters while providing deterministic error bars on the estimates. By leveraging online conformal inference techniques, the approach produces sequential estimation sets with a guaranteed long-term risk level. Experiments on a quantum magnetometry task confirm that the proposed dynamic VQS approach maintains the required reliability over time, while still yielding precise estimates. The results demonstrate the practical benefits of combining variational quantum algorithms with online conformal inference to achieve reliable quantum sensing on NISQ devices.
[LG-133] A Descriptor Is All You Need: Accurate Machine Learning of Nonadiabatic Coupling Vectors
链接: https://arxiv.org/abs/2505.23344
作者: Jakub Martinka,Lina Zhang,Yi-Fan Hou,Mikołaj Martyka,Jiří Pittner,Mario Barbatti,Pavlo O. Dral
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:
Abstract:Nonadiabatic couplings (NACs) play a crucial role in modeling photochemical and photophysical processes with methods such as the widely used fewest-switches surface hopping (FSSH). There is therefore a strong incentive to machine learn NACs for accelerating simulations. However, this is challenging due to NACs’ vectorial, double-valued character and the singularity near a conical intersection seam. For the first time, we design NAC-specific descriptors based on our domain expertise and show that they allow learning NACs with never-before-reported accuracy of R^2 exceeding 0.99. The key to success is also our new ML phase-correction procedure. We demonstrate the efficiency and robustness of our approach on a prototypical example of fully ML-driven FSSH simulations of fulvene targeting the SA-2-CASSCF(6,6) electronic structure level. This ML-FSSH dynamics leads to an accurate description of S_1 decay while reducing error bars by allowing the execution of a large ensemble of trajectories. Our implementations are available in open-source MLatom.
[LG-134] Stable Thompson Sampling: Valid Inference via Variance Inflation
链接: https://arxiv.org/abs/2505.23260
作者: Budhaditya Halder,Shubhayan Pan,Koulik Khamaru
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We consider the problem of statistical inference when the data is collected via a Thompson Sampling-type algorithm. While Thompson Sampling (TS) is known to be both asymptotically optimal and empirically effective, its adaptive sampling scheme poses challenges for constructing confidence intervals for model parameters. We propose and analyze a variant of TS, called Stable Thompson Sampling, in which the posterior variance is inflated by a logarithmic factor. We show that this modification leads to asymptotically normal estimates of the arm means, despite the non-i.i.d. nature of the data. Importantly, this statistical benefit comes at a modest cost: the variance inflation increases regret by only a logarithmic factor compared to standard TS. Our results reveal a principled trade-off: by paying a small price in regret, one can enable valid statistical inference for adaptive decision-making algorithms.
[LG-135] Joint estimation of smooth graph signals from partial linear measurements
链接: https://arxiv.org/abs/2505.23240
作者: Hemant Tyagi
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 29 pages, 2 figures
Abstract:Given an undirected and connected graph G on T vertices, suppose each vertex t has a latent signal x_t \in \mathbbR^n associated to it. Given partial linear measurements of the signals, for a potentially small subset of the vertices, our goal is to estimate x_t ‘s. Assuming that the signals are smooth w.r.t G , in the sense that the quadratic variation of the signals over the graph is small, we obtain non-asymptotic bounds on the mean squared error for jointly recovering x_t ‘s, for the smoothness penalized least squares estimator. In particular, this implies for certain choices of G that this estimator is weakly consistent (as T \rightarrow \infty ) under potentially very stringent sampling, where only one coordinate is measured per vertex for a vanishingly small fraction of the vertices. The results are extended to a multi-layer'' ranking problem where x_t corresponds to the latent strengths of a collection of n items, and noisy pairwise difference measurements are obtained at each
layer’’ t via a measurement graph G_t . Weak consistency is established for certain choices of G even when the individual G_t 's are very sparse and disconnected.
[LG-136] JAPAN: Joint Adaptive Prediction Areas with Normalising-Flows
链接: https://arxiv.org/abs/2505.23196
作者: Eshant English,Christoph Lippert
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Conformal prediction provides a model-agnostic framework for uncertainty quantification with finite-sample validity guarantees, making it an attractive tool for constructing reliable prediction sets. However, existing approaches commonly rely on residual-based conformity scores, which impose geometric constraints and struggle when the underlying distribution is multimodal. In particular, they tend to produce overly conservative prediction areas centred around the mean, often failing to capture the true shape of complex predictive distributions. In this work, we introduce JAPAN (Joint Adaptive Prediction Areas with Normalising-Flows), a conformal prediction framework that uses density-based conformity scores. By leveraging flow-based models, JAPAN estimates the (predictive) density and constructs prediction areas by thresholding on the estimated density scores, enabling compact, potentially disjoint, and context-adaptive regions that retain finite-sample coverage guarantees. We theoretically motivate the efficiency of JAPAN and empirically validate it across multivariate regression and forecasting tasks, demonstrating good calibration and tighter prediction areas compared to existing baselines. We also provide several \emphextensions adding flexibility to our proposed framework.
[LG-137] opological Adaptive Least Mean Squares Algorithms over Simplicial Complexes
链接: https://arxiv.org/abs/2505.23160
作者: Lorenzo Marinucci,Claudio Battiloro,Paolo Di Lorenzo
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:This paper introduces a novel adaptive framework for processing dynamic flow signals over simplicial complexes, extending classical least-mean-squares (LMS) methods to high-order topological domains. Building on discrete Hodge theory, we present a topological LMS algorithm that efficiently processes streaming signals observed over time-varying edge subsets. We provide a detailed stochastic analysis of the algorithm, deriving its stability conditions, steady-state mean-square-error, and convergence speed, while exploring the impact of edge sampling on performance. We also propose strategies to design optimal edge sampling probabilities, minimizing rate while ensuring desired estimation accuracy. Assuming partial knowledge of the complex structure (e.g., the underlying graph), we introduce an adaptive topology inference method that integrates with the proposed LMS framework. Additionally, we propose a distributed version of the algorithm and analyze its stability and mean-square-error properties. Empirical results on synthetic and real-world traffic data demonstrate that our approach, in both centralized and distributed settings, outperforms graph-based LMS methods by leveraging higher-order topological features.
[LG-138] Gradient Methods with Online Scaling Part I. Theoretical Foundations
链接: https://arxiv.org/abs/2505.23081
作者: Wenzhi Gao,Ya-Chi Chu,Yinyu Ye,Madeleine Udell
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Extension of arXiv:2411.01803 and arXiv:2502.11229
Abstract:This paper establishes the theoretical foundations of the online scaled gradient methods (OSGM), a framework that utilizes online learning to adapt stepsizes and provably accelerate first-order methods. OSGM quantifies the effectiveness of a stepsize by a feedback function motivated from a convergence measure and uses the feedback to adjust the stepsize through an online learning algorithm. Consequently, instantiations of OSGM achieve convergence rates that are asymptotically no worse than the optimal stepsize. OSGM yields desirable convergence guarantees on smooth convex problems, including 1) trajectory-dependent global convergence on smooth convex objectives; 2) an improved complexity result on smooth strongly convex problems, and 3) local superlinear convergence. Notably, OSGM constitutes a new family of first-order methods with non-asymptotic superlinear convergence, joining the celebrated quasi-Newton methods. Finally, OSGM explains the empirical success of the popular hypergradient-descent heuristic in optimization for machine learning.
[LG-139] Improved Last-Iterate Convergence of Shuffling Gradient Methods for Nonsmooth Convex Optimization ICML2025
链接: https://arxiv.org/abs/2505.23056
作者: Zijian Liu,Zhengyuan Zhou
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICML 2025
Abstract:We study the convergence of the shuffling gradient method, a popular algorithm employed to minimize the finite-sum function with regularization, in which functions are passed to apply (Proximal) Gradient Descent (GD) one by one whose order is determined by a permutation on the indices of functions. In contrast to its easy implementation and effective performance in practice, the theoretical understanding remains limited. A recent advance by (Liu Zhou, 2024b) establishes the first last-iterate convergence results under various settings, especially proving the optimal rates for smooth (strongly) convex optimization. However, their bounds for nonsmooth (strongly) convex functions are only as fast as Proximal GD. In this work, we provide the first improved last-iterate analysis for the nonsmooth case demonstrating that the widely used Random Reshuffle ( \textsfRR ) and Single Shuffle ( \textsfSS ) strategies are both provably faster than Proximal GD, reflecting the benefit of randomness. As an important implication, we give the first (nearly) optimal convergence result for the suffix average under the \textsfRR sampling scheme in the general convex case, matching the lower bound shown by (Koren et al., 2022).
[LG-140] heoretical Foundations of the Deep Copula Classifier: A Generative Approach to Modeling Dependent Features
链接: https://arxiv.org/abs/2505.22997
作者: Agnideep Aich,Ashit Baran Aich,Bruce Wade
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Submitted
Abstract:Traditional classifiers often assume feature independence or rely on overly simplistic relationships, leading to poor performance in settings where real-world dependencies matter. We introduce the Deep Copula Classifier (DCC), a generative model that separates the learning of each feature’s marginal distribution from the modeling of their joint dependence structure via neural network-parameterized copulas. For each class, lightweight neural networks are used to flexibly and adaptively capture feature interactions, making DCC particularly effective when classification is driven by complex dependencies. We establish that DCC converges to the Bayes-optimal classifier under standard conditions and provide explicit convergence rates of O(n^-r/(2r + d)) for r-smooth copula densities. Beyond theoretical guarantees, we outline several practical extensions, including high-dimensional scalability through vine and factor copula architectures, semi-supervised learning via entropy regularization, and online adaptation using streaming gradient methods. By unifying statistical rigor with the representational power of neural networks, DCC offers a mathematically grounded and interpretable framework for dependency-aware classification.
[LG-141] Highly Efficient and Effective LLM s with Multi-Boolean Architectures
链接: https://arxiv.org/abs/2505.22811
作者: Ba-Hien Tran,Van Minh Nguyen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Under Review
Abstract:Weight binarization has emerged as a promising strategy to drastically reduce the complexity of large language models (LLMs). It is mainly classified into two approaches: post-training binarization and finetuning with training-aware binarization methods. The first approach, while having low complexity, leads to significant loss of information from the original LLMs, resulting in poor performance. The second approach, on the other hand, relies heavily on full-precision latent weights for gradient approximation of binary weights, which not only remains suboptimal but also introduces substantial complexity. In this paper, we introduce a novel framework that effectively transforms LLMs into multi-kernel Boolean parameters, for the first time, finetunes them directly in the Boolean domain, eliminating the need for expensive latent weights. This significantly reduces complexity during both finetuning and inference. Through extensive and insightful experiments across a wide range of LLMs, we demonstrate that our method outperforms recent ultra low-bit quantization and binarization methods.
[LG-142] Distribution free M-estimation
链接: https://arxiv.org/abs/2505.22807
作者: John C. Duchi
类目: atistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 26 pages
Abstract:The basic question of delineating those statistical problems that are solvable without making any assumptions on the underlying data distribution has long animated statistics and learning theory. This paper characterizes when a (univariate) convex M-estimation or stochastic optimization problem is solvable in such an assumption-free setting, providing a precise dividing line between solvable and unsolvable problems. The conditions we identify show, perhaps surprisingly, that Lipschitz continuity of the loss being minimized is not necessary for distribution free minimization, and they are also distinct from classical characterizations of learnability in machine learning.
[LG-143] mporal Convolutional Autoencoder for Interference Mitigation in FMCW Radar Altimeters
链接: https://arxiv.org/abs/2505.22783
作者: Charles E. Thornton,Jamie Sloop,Samuel Brown,Aaron Orndorff,William C. Headley,Stephen Young
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 10 pages, 10 figures
Abstract:We investigate the end-to-end altitude estimation performance of a convolutional autoencoder-based interference mitigation approach for frequency-modulated continuous-wave (FMCW) radar altimeters. Specifically, we show that a Temporal Convolutional Network (TCN) autoencoder effectively exploits temporal correlations in the received signal, providing superior interference suppression compared to a Least Mean Squares (LMS) adaptive filter. Unlike existing approaches, the present method operates directly on the received FMCW signal. Additionally, we identify key challenges in applying deep learning to wideband FMCW interference mitigation and outline directions for future research to enhance real-time feasibility and generalization to arbitrary interference conditions.
[LG-144] Finite-Sample Convergence Bounds for Trust Region Policy Optimization in Mean-Field Games
链接: https://arxiv.org/abs/2505.22781
作者: Antonio Ocello,Daniil Tiapkin,Lorenzo Mancini,Mathieu Laurière,Eric Moulines
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:We introduce Mean-Field Trust Region Policy Optimization (MF-TRPO), a novel algorithm designed to compute approximate Nash equilibria for ergodic Mean-Field Games (MFG) in finite state-action spaces. Building on the well-established performance of TRPO in the reinforcement learning (RL) setting, we extend its methodology to the MFG framework, leveraging its stability and robustness in policy optimization. Under standard assumptions in the MFG literature, we provide a rigorous analysis of MF-TRPO, establishing theoretical guarantees on its convergence. Our results cover both the exact formulation of the algorithm and its sample-based counterpart, where we derive high-probability guarantees and finite sample complexity. This work advances MFG optimization by bridging RL techniques with mean-field decision-making, offering a theoretically grounded approach to solving complex multi-agent problems.
[LG-145] Non-convex entropic mean-field optimization via Best Response flow
链接: https://arxiv.org/abs/2505.22760
作者: Razvan-Andrei Lascu,Mateusz B. Majka
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Probability (math.PR)
*备注: 40 pages
Abstract:We study the problem of minimizing non-convex functionals on the space of probability measures, regularized by the relative entropy (KL divergence) with respect to a fixed reference measure, as well as the corresponding problem of solving entropy-regularized non-convex-non-concave min-max problems. We utilize the Best Response flow (also known in the literature as the fictitious play flow) and study how its convergence is influenced by the relation between the degree of non-convexity of the functional under consideration, the regularization parameter and the tail behaviour of the reference measure. In particular, we demonstrate how to choose the regularizer, given the non-convex functional, so that the Best Response operator becomes a contraction with respect to the L^1 -Wasserstein distance, which then ensures the existence of its unique fixed point, which is then shown to be the unique global minimizer for our optimization problem. This extends recent results where the Best Response flow was applied to solve convex optimization problems regularized by the relative entropy with respect to arbitrary reference measures, and with arbitrary values of the regularization parameter. Our results explain precisely how the assumption of convexity can be relaxed, at the expense of making a specific choice of the regularizer. Additionally, we demonstrate how these results can be applied in reinforcement learning in the context of policy optimization for Markov Decision Processes and Markov games with softmax parametrized policies in the mean-field regime.
[LG-146] Information-Computation Gaps in Quantum Learning via Low-Degree Likelihood
链接: https://arxiv.org/abs/2505.22743
作者: Sitan Chen,Weiyuan Gong,Jonas Haferkamp,Yihui Quek
类目: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 88 pages, 2 figures
Abstract:In a variety of physically relevant settings for learning from quantum data, designing protocols that can computationally efficiently extract information remains largely an art, and there are important cases where we believe this to be impossible, that is, where there is an information-computation gap. While there is a large array of tools in the classical literature for giving evidence for average-case hardness of statistical inference problems, the corresponding tools in the quantum literature are far more limited. One such framework in the classical literature, the low-degree method, makes predictions about hardness of inference problems based on the failure of estimators given by low-degree polynomials. In this work, we extend this framework to the quantum setting. We establish a general connection between state designs and low-degree hardness. We use this to obtain the first information-computation gaps for learning Gibbs states of random, sparse, non-local Hamiltonians. We also use it to prove hardness for learning random shallow quantum circuit states in a challenging model where states can be measured in adaptively chosen bases. To our knowledge, the ability to model adaptivity within the low-degree framework was open even in classical settings. In addition, we also obtain a low-degree hardness result for quantum error mitigation against strategies with single-qubit measurements. We define a new quantum generalization of the planted biclique problem and identify the threshold at which this problem becomes computationally hard for protocols that perform local measurements. Interestingly, the complexity landscape for this problem shifts when going from local measurements to more entangled single-copy measurements. We show average-case hardness for the “standard” variant of Learning Stabilizers with Noise and for agnostically learning product states. Comments: 88 pages, 2 figures Subjects: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2505.22743 [quant-ph] (or arXiv:2505.22743v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2505.22743 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Weiyuan Gong [view email] [v1] Wed, 28 May 2025 18:04:10 UTC (202 KB)
[LG-147] Investigating the effectiveness of multimodal data in forecasting SARS-COV-2 case surges
链接: https://arxiv.org/abs/2505.22688
作者: Palur Venkata Raghuvamsi,Siyuan Brandon Loh,Prasanta Bhattacharya,Joses Ho,Raphael Lee Tze Chuen,Alvin X. Han,Sebastian Maurer-Stroh
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:The COVID-19 pandemic response relied heavily on statistical and machine learning models to predict key outcomes such as case prevalence and fatality rates. These predictions were instrumental in enabling timely public health interventions that helped break transmission cycles. While most existing models are grounded in traditional epidemiological data, the potential of alternative datasets, such as those derived from genomic information and human behavior, remains underexplored. In the current study, we investigated the usefulness of diverse modalities of feature sets in predicting case surges. Our results highlight the relative effectiveness of biological (e.g., mutations), public health (e.g., case counts, policy interventions) and human behavioral features (e.g., mobility and social media conversations) in predicting country-level case surges. Importantly, we uncover considerable heterogeneity in predictive performance across countries and feature modalities, suggesting that surge prediction models may need to be tailored to specific national contexts and pandemic phases. Overall, our work highlights the value of integrating alternative data sources into existing disease surveillance frameworks to enhance the prediction of pandemic dynamics.
[LG-148] An Efficient deep learning model to Predict Stock Price Movement Based on Limit Order Book
链接: https://arxiv.org/abs/2505.22678
作者: Jiahao Yang,Ran Fang,Ming Zhang,Jun Zhou
类目: Trading and Market Microstructure (q-fin.TR); Machine Learning (cs.LG)
*备注:
Abstract:In high-frequency trading (HFT), leveraging limit order books (LOB) to model stock price movements is crucial for achieving profitable outcomes. However, this task is challenging due to the high-dimensional and volatile nature of the original data. Even recent deep learning models often struggle to capture price movement patterns effectively, particularly without well-designed features. We observed that raw LOB data exhibits inherent symmetry between the ask and bid sides, and the bid-ask differences demonstrate greater stability and lower complexity compared to the original data. Building on this insight, we propose a novel approach in which leverages the Siamese architecture to enhance the performance of existing deep learning models. The core idea involves processing the ask and bid sides separately using the same module with shared parameters. We applied our Siamese-based methods to several widely used strong baselines and validated their effectiveness using data from 14 military industry stocks in the Chinese A-share market. Furthermore, we integrated multi-head attention (MHA) mechanisms with the Long Short-Term Memory (LSTM) module to investigate its role in modeling stock price movements. Our experiments used raw data and widely used Order Flow Imbalance (OFI) features as input with some strong baseline models. The results show that our method improves the performance of strong baselines in over 75 % of cases, excluding the Multi-Layer Perception (MLP) baseline, which performed poorly and is not considered practical. Furthermore, we found that Multi-Head Attention can enhance model performance, particularly over shorter forecasting horizons.
信息检索
[IR-0] Bridging the Gap Between Semantic and User Preference Spaces for Multi-modal Music Representation Learning ICMR2025
链接: https://arxiv.org/abs/2505.23298
作者: Xiaofeng Pan,Jing Chen,Haitong Zhang,Menglin Xing,Jiayi Wei,Xuefeng Mu,Zhongqian Xie
类目: ound (cs.SD); Information Retrieval (cs.IR)
*备注: ICMR 2025
Abstract:Recent works of music representation learning mainly focus on learning acoustic music representations with unlabeled audios or further attempt to acquire multi-modal music representations with scarce annotated audio-text pairs. They either ignore the language semantics or rely on labeled audio datasets that are difficult and expensive to create. Moreover, merely modeling semantic space usually fails to achieve satisfactory performance on music recommendation tasks since the user preference space is ignored. In this paper, we propose a novel Hierarchical Two-stage Contrastive Learning (HTCL) method that models similarity from the semantic perspective to the user perspective hierarchically to learn a comprehensive music representation bridging the gap between semantic and user preference spaces. We devise a scalable audio encoder and leverage a pre-trained BERT model as the text encoder to learn audio-text semantics via large-scale contrastive pre-training. Further, we explore a simple yet effective way to exploit interaction data from our online music platform to adapt the semantic space to user preference space via contrastive fine-tuning, which differs from previous works that follow the idea of collaborative filtering. As a result, we obtain a powerful audio encoder that not only distills language semantics from the text encoder but also models similarity in user preference space with the integrity of semantic space preserved. Experimental results on both music semantic and recommendation tasks confirm the effectiveness of our method.