本篇博文主要内容为 2025-05-19 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-05-19)
今日共更新539篇论文,其中:
- 自然语言处理共83篇(Computation and Language (cs.CL))
- 人工智能共175篇(Artificial Intelligence (cs.AI))
- 计算机视觉共119篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共196篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Modeling cognitive processes of natural reading with transformer-based Language Models
【速读】: 该论文试图解决语言模型在解释阅读过程中眼动行为(如Gaze Duration)方面的有效性问题,特别是如何更准确地捕捉人类读者对语言的预测性。其解决方案的关键在于评估基于Transformer架构的语言模型(如GPT2、LLaMA-7B和LLaMA2-7B),以探究这些模型在解释西班牙语读者眼动数据中的表现,并与早期模型(如N-grams和LSTM网络)进行比较。结果表明,Transformer模型在解释Gaze Duration的方差方面优于传统模型,但仍无法完全模拟人类预测能力。
链接: https://arxiv.org/abs/2505.11485
作者: Bruno Bianchi,Fermín Travi,Juan E. Kamienkowski
机构: Universidad de Buenos Aires (布宜诺斯艾利斯大学); CONICET-Universidad de Buenos Aires (国家科学技术研究委员会-布宜诺斯艾利斯大学); Instituto de Ciencias de la Computación (计算机科学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in Natural Language Processing (NLP) have led to the development of highly sophisticated language models for text generation. In parallel, neuroscience has increasingly employed these models to explore cognitive processes involved in language comprehension. Previous research has shown that models such as N-grams and LSTM networks can partially account for predictability effects in explaining eye movement behaviors, specifically Gaze Duration, during reading. In this study, we extend these findings by evaluating transformer-based models (GPT2, LLaMA-7B, and LLaMA2-7B) to further investigate this relationship. Our results indicate that these architectures outperform earlier models in explaining the variance in Gaze Durations recorded from Rioplantense Spanish readers. However, similar to previous studies, these models still fail to account for the entirety of the variance captured by human predictability. These findings suggest that, despite their advancements, state-of-the-art language models continue to predict language in ways that differ from human readers.
zh
[NLP-1] SoftCoT: Test-Time Scaling with Soft Chain-of-Thought Reasoning
【速读】: 该论文试图解决在测试阶段通过增加计算资源提升模型推理性能时,传统离散token空间方法存在的信息损失和推理路径多样性不足的问题。其解决方案的关键在于引入SoftCoT++,通过在连续潜在空间中对潜在思考进行扰动并应用对比学习,以促进软思考表示的多样性,从而实现更丰富的推理路径探索。
链接: https://arxiv.org/abs/2505.11484
作者: Yige Xu,Xu Guo,Zhiwei Zeng,Chunyan Miao
机构: Nanyang Technological University (南洋理工大学); Alibaba-NTU Global e-Sustainability CorpLab (阿里巴巴-南洋理工全球可持续性企业实验室); KTH Royal Institute of Technology (皇家理工学院)
类目: Computation and Language (cs.CL)
备注: 14 pages
Abstract:Test-Time Scaling (TTS) refers to approaches that improve reasoning performance by allocating extra computation during inference, without altering the model’s parameters. While existing TTS methods operate in a discrete token space by generating more intermediate steps, recent studies in Coconut and SoftCoT have demonstrated that thinking in the continuous latent space can further enhance the reasoning performance. Such latent thoughts encode informative thinking without the information loss associated with autoregressive token generation, sparking increased interest in continuous-space reasoning. Unlike discrete decoding, where repeated sampling enables exploring diverse reasoning paths, latent representations in continuous space are fixed for a given input, which limits diverse exploration, as all decoded paths originate from the same latent thought. To overcome this limitation, we introduce SoftCoT++ to extend SoftCoT to the Test-Time Scaling paradigm by enabling diverse exploration of thinking paths. Specifically, we perturb latent thoughts via multiple specialized initial tokens and apply contrastive learning to promote diversity among soft thought representations. Experiments across five reasoning benchmarks and two distinct LLM architectures demonstrate that SoftCoT++ significantly boosts SoftCoT and also outperforms SoftCoT with self-consistency scaling. Moreover, it shows strong compatibility with conventional scaling techniques such as self-consistency. Source code is available at this https URL.
zh
[NLP-2] Improving Assembly Code Performance with Large Language Models via Reinforcement Learning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在汇编代码性能优化方面的潜力尚未被充分探索的问题。其关键解决方案是提出一种基于近端策略优化(Proximal Policy Optimization, PPO)的强化学习框架,通过结合功能正确性验证与相对于行业标准编译器gcc -O3的执行性能评估来训练LLMs,从而实现对汇编代码的有效优化。
链接: https://arxiv.org/abs/2505.11480
作者: Anjiang Wei,Tarun Suresh,Huanmi Tan,Yinglun Xu,Gagandeep Singh,Ke Wang,Alex Aiken
机构: Stanford University (斯坦福大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Carnegie Mellon University (卡内基梅隆大学); Visa Research (Visa 研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Performance (cs.PF); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注:
Abstract:Large language models (LLMs) have demonstrated strong performance across a wide range of programming tasks, yet their potential for code optimization remains underexplored. This work investigates whether LLMs can optimize the performance of assembly code, where fine-grained control over execution enables improvements that are difficult to express in high-level languages. We present a reinforcement learning framework that trains LLMs using Proximal Policy Optimization (PPO), guided by a reward function that considers both functional correctness, validated through test cases, and execution performance relative to the industry-standard compiler gcc -O3. To support this study, we introduce a benchmark of 8,072 real-world programs. Our model, Qwen2.5-Coder-7B-PPO, achieves 96.0% test pass rates and an average speedup of 1.47x over the gcc -O3 baseline, outperforming all 20 other models evaluated, including Claude-3.7-sonnet. These results indicate that reinforcement learning can unlock the potential of LLMs to serve as effective optimizers for assembly code performance.
zh
[NLP-3] HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages
【速读】: 该论文旨在解决通用领域指令遵循语言模型在强化学习中依赖高质量、多样化偏好数据的问题,以提升模型性能。其解决方案的关键在于引入HelpSteer3-Preference,这是一个采用CC-BY-4.0许可、由人工标注的高质量偏好数据集,包含超过40,000个样本,覆盖了包括STEM、编程和多语言场景在内的多种实际应用,从而显著提升了奖励模型(Reward Models, RMs)在RM-Bench和JudgeBench上的表现。
链接: https://arxiv.org/abs/2505.11475
作者: Zhilin Wang,Jiaqi Zeng,Olivier Delalleau,Hoo-Chang Shin,Felipe Soares,Alexander Bukharin,Ellie Evans,Yi Dong,Oleksii Kuchaiev
机构: NVIDIA(英伟达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 38 pages, 2 figures
Abstract:Preference datasets are essential for training general-domain, instruction-following language models with Reinforcement Learning from Human Feedback (RLHF). Each subsequent data release raises expectations for future data collection, meaning there is a constant need to advance the quality and diversity of openly available preference data. To address this need, we introduce HelpSteer3-Preference, a permissively licensed (CC-BY-4.0), high-quality, human-annotated preference dataset comprising of over 40,000 samples. These samples span diverse real-world applications of large language models (LLMs), including tasks relating to STEM, coding and multilingual scenarios. Using HelpSteer3-Preference, we train Reward Models (RMs) that achieve top performance on RM-Bench (82.4%) and JudgeBench (73.7%). This represents a substantial improvement (~10% absolute) over the previously best-reported results from existing RMs. We demonstrate HelpSteer3-Preference can also be applied to train Generative RMs and how policy models can be aligned with RLHF using our RMs. Dataset (CC-BY-4.0): this https URL
zh
[NLP-4] No Gold Standard No Problem: Reference-Free Evaluation of Taxonomies
【速读】: 该论文试图解决分类体系(taxonomy)质量评估中缺乏参考的度量问题,特别是现有度量无法处理语义与分类相似性之间关联性错误的问题。其解决方案的关键在于引入两种无参考的评估指标:第一种通过计算语义相似性与分类相似性之间的相关性来评估鲁棒性,第二种则利用自然语言推理(Natural Language Inference)来评估逻辑充分性。这两种方法在五个分类体系上进行了测试,并显示出与黄金标准分类体系的F1分数具有良好的相关性。
链接: https://arxiv.org/abs/2505.11470
作者: Pascal Wullschleger,Majid Zarharan,Donnacha Daly,Marc Pouly,Jennifer Foster
机构: Dublin City University (都柏林城市大学); Lucerne School of Computer Science and Information Technology (卢塞恩计算机科学与信息技术学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce two reference-free metrics for quality evaluation of taxonomies. The first metric evaluates robustness by calculating the correlation between semantic and taxonomic similarity, covering a type of error not handled by existing metrics. The second uses Natural Language Inference to assess logical adequacy. Both metrics are tested on five taxonomies and are shown to correlate well with F1 against gold-standard taxonomies.
zh
[NLP-5] Disentangling Reasoning and Knowledge in Medical Large Language Models
【速读】: 该论文试图解决当前医学问答基准测试中推理与事实回忆混杂的问题,从而更准确地评估大型语言模型(LLMs)在医学推理方面的能力。其解决方案的关键是利用PubMedBERT分类器将11个生物医学问答基准测试分为侧重推理和知识的子集,该分类器达到了81%的准确率,接近人类表现。通过这种分离,研究揭示了仅32.8%的问题需要复杂推理,并发现了生物医学模型与通用模型在知识和推理性能上的差异。
链接: https://arxiv.org/abs/2505.11462
作者: Rahul Thapa,Qingyang Wu,Kevin Wu,Harrison Zhang,Angela Zhang,Eric Wu,Haotian Ye,Suhana Bedi,Nevin Aresh,Joseph Boen,Shriya Reddy,Ben Athiwaratkun,Shuaiwen Leon Song,James Zou
机构: Stanford University (斯坦福大学); Together AI (Together AI); University of California, San Francisco (加利福尼亚大学旧金山分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Medical reasoning in large language models (LLMs) aims to emulate clinicians’ diagnostic thinking, but current benchmarks such as MedQA-USMLE, MedMCQA, and PubMedQA often mix reasoning with factual recall. We address this by separating 11 biomedical QA benchmarks into reasoning- and knowledge-focused subsets using a PubMedBERT classifier that reaches 81 percent accuracy, comparable to human performance. Our analysis shows that only 32.8 percent of questions require complex reasoning. We evaluate biomedical models (HuatuoGPT-o1, MedReason, m1) and general-domain models (DeepSeek-R1, o4-mini, Qwen3), finding consistent gaps between knowledge and reasoning performance. For example, m1 scores 60.5 on knowledge but only 47.1 on reasoning. In adversarial tests where models are misled with incorrect initial reasoning, biomedical models degrade sharply, while larger or RL-trained general models show more robustness. To address this, we train BioMed-R1 using fine-tuning and reinforcement learning on reasoning-heavy examples. It achieves the strongest performance among similarly sized models. Further gains may come from incorporating clinical case reports and training with adversarial and backtracking scenarios.
zh
[NLP-6] Is Compression Really Linear with Code Intelligence?
【速读】: 该论文试图解决数据压缩与大型语言模型(Large Language Models, LLMs)代码智能能力之间关系的评估问题,特别是在多语言、多任务代码基准下的公平性与有效性。现有研究假设压缩与通用智能呈线性关系,但未能充分考虑代码的复杂性及现代代码LLMs的评估挑战。论文的关键解决方案是引入\textitFormat Annealing,这是一种轻量且透明的训练方法,旨在公平评估预训练模型的内在代码智能能力。通过构建一个大规模、未见过的代码验证集,论文揭示了代码智能与压缩效率(以每字符位数BPC衡量)之间存在基本的对数关系,从而修正了先前线性假设的局限性。
链接: https://arxiv.org/abs/2505.11441
作者: Xianzhen Luo,Shijie Xuyang,Tianhao Cheng,Zheng Chu,Houyi Li,ziqi wang,Siming Huang,Qingfu Zhu,Qiufeng Wang,Xiangyu Zhang,Shuigeng Zhou,Wanxiang Che
机构: Harbin Institute of Technology (哈尔滨工业大学); Fudan University (复旦大学); StepFun (StepFun); Megvii (旷视科技)
类目: Computation and Language (cs.CL)
备注: work in progress
Abstract:Understanding the relationship between data compression and the capabilities of Large Language Models (LLMs) is crucial, especially in specialized domains like code intelligence. Prior work posited a linear relationship between compression and general intelligence. However, it overlooked the multifaceted nature of code that encompasses diverse programming languages and tasks, and struggled with fair evaluation of modern Code LLMs. We address this by evaluating a diverse array of open-source Code LLMs on comprehensive multi-language, multi-task code benchmarks. To address the challenge of efficient and fair evaluation of pre-trained LLMs’ code intelligence, we introduce \textitFormat Annealing, a lightweight, transparent training methodology designed to assess the intrinsic capabilities of these pre-trained models equitably. Compression efficacy, measured as bits-per-character (BPC), is determined using a novel, large-scale, and previously unseen code validation set derived from GitHub. Our empirical results reveal a fundamental logarithmic relationship between measured code intelligence and BPC. This finding refines prior hypotheses of linearity, which we suggest are likely observations of the logarithmic curve’s tail under specific, limited conditions. Our work provides a more nuanced understanding of compression’s role in developing code intelligence and contributes a robust evaluation framework in the code domain.
zh
[NLP-7] GODBench: A Benchmark for Multimodal Large Language Models in Video Comment Art ACL2025
【速读】: 该论文试图解决多模态大语言模型(MLLMs)在生成具有创造性表达的视频评论艺术(Video Comment Art)方面存在的不足,特别是其在理解文化与语境细微差别、生成富有情感共鸣或讽刺意味的内容时表现欠佳的问题。现有基准测试由于模态和类别限制,无法全面评估模型的创造力。解决方案的关键在于提出GODBench基准,该基准整合了视频与文本模态以系统评估MLLMs的评论艺术生成能力,并引入受物理波传播启发的多步骤推理框架Ripple of Thought (RoT),旨在提升MLLMs的创造性表现。
链接: https://arxiv.org/abs/2505.11436
作者: Chenkai Zhang,Yiming Lei,Zeming Liu,Haitao Leng,Shaoguo Liu,Tingting Gao,Qingjie Liu,Yunhong Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 69 pages, 66 figures, accepted by ACL 2025
Abstract:Video Comment Art enhances user engagement by providing creative content that conveys humor, satire, or emotional resonance, requiring a nuanced and comprehensive grasp of cultural and contextual subtleties. Although Multimodal Large Language Models (MLLMs) and Chain-of-Thought (CoT) have demonstrated strong reasoning abilities in STEM tasks (e.g. mathematics and coding), they still struggle to generate creative expressions such as resonant jokes and insightful satire. Moreover, existing benchmarks are constrained by their limited modalities and insufficient categories, hindering the exploration of comprehensive creativity in video-based Comment Art creation. To address these limitations, we introduce GODBench, a novel benchmark that integrates video and text modalities to systematically evaluate MLLMs’ abilities to compose Comment Art. Furthermore, inspired by the propagation patterns of waves in physics, we propose Ripple of Thought (RoT), a multi-step reasoning framework designed to enhance the creativity of MLLMs. Extensive experiments reveal that existing MLLMs and CoT methods still face significant challenges in understanding and generating creative video comments. In contrast, RoT provides an effective approach to improve creative composing, highlighting its potential to drive meaningful advancements in MLLM-based creativity. GODBench is publicly available at this https URL.
zh
[NLP-8] When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLM s
【速读】: 该论文试图解决生成式 AI (Generative AI) 在指令遵循任务中因显式链式推理(Chain-of-Thought, CoT)引发的性能下降问题。研究发现,尽管CoT在复杂推理任务中表现优异,但在某些情况下会导致模型忽略简单约束或引入冗余内容,从而降低指令遵循准确性。解决方案的关键在于通过引入选择性推理策略,特别是分类器选择性推理(classifier-selective reasoning),以恢复因CoT导致的性能损失,其核心是通过量化模型在生成过程中的约束注意力来优化推理过程。
链接: https://arxiv.org/abs/2505.11423
作者: Xiaomin Li,Zhou Yu,Zhiwei Zhang,Xupeng Chen,Ziji Zhang,Yingying Zhuang,Narayanan Sadagopan,Anurag Beniwal
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Reasoning-enhanced large language models (RLLMs), whether explicitly trained for reasoning or prompted via chain-of-thought (CoT), have achieved state-of-the-art performance on many complex reasoning tasks. However, we uncover a surprising and previously overlooked phenomenon: explicit CoT reasoning can significantly degrade instruction-following accuracy. Evaluating 15 models on two benchmarks: IFEval (with simple, rule-verifiable constraints) and ComplexBench (with complex, compositional constraints), we consistently observe performance drops when CoT prompting is applied. Through large-scale case studies and an attention-based analysis, we identify common patterns where reasoning either helps (e.g., with formatting or lexical precision) or hurts (e.g., by neglecting simple constraints or introducing unnecessary content). We propose a metric, constraint attention, to quantify model focus during generation and show that CoT reasoning often diverts attention away from instruction-relevant tokens. To mitigate these effects, we introduce and evaluate four strategies: in-context learning, self-reflection, self-selective reasoning, and classifier-selective reasoning. Our results demonstrate that selective reasoning strategies, particularly classifier-selective reasoning, can substantially recover lost performance. To our knowledge, this is the first work to systematically expose reasoning-induced failures in instruction-following and offer practical mitigation strategies.
zh
[NLP-9] owards Cultural Bridge by Bahnaric-Vietnamese Translation Using Transfer Learning of Sequence-To-Sequence Pre-training Language Model
【速读】: 该论文试图解决Bahnaric-Vietnamese翻译中存在的资源匮乏问题,尤其是缺乏可用的Bahnaric语料库,包括词汇、语法、对话模式和双语语料,这严重阻碍了训练数据的收集。解决方案的关键在于采用基于序列到序列预训练语言模型的迁移学习方法,通过在有限的越-巴赫纳里克双语资源上继续训练预训练的越南语语言模型,实现从语言模型到机器翻译的迁移学习,从而缓解两种语言间资源不平衡的问题,并优化训练与计算过程。
链接: https://arxiv.org/abs/2505.11421
作者: Phan Tran Minh Dat,Vo Hoang Nhat Khang,Quan Thanh Tho
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This work explores the journey towards achieving Bahnaric-Vietnamese translation for the sake of culturally bridging the two ethnic groups in Vietnam. However, translating from Bahnaric to Vietnamese also encounters some difficulties. The most prominent challenge is the lack of available original Bahnaric resources source language, including vocabulary, grammar, dialogue patterns and bilingual corpus, which hinders the data collection process for training. To address this, we leverage a transfer learning approach using sequence-to-sequence pre-training language model. First of all, we leverage a pre-trained Vietnamese language model to capture the characteristics of this language. Especially, to further serve the purpose of machine translation, we aim for a sequence-to-sequence model, not encoder-only like BERT or decoder-only like GPT. Taking advantage of significant similarity between the two languages, we continue training the model with the currently limited bilingual resources of Vietnamese-Bahnaric text to perform the transfer learning from language model to machine translation. Thus, this approach can help to handle the problem of imbalanced resources between two languages, while also optimizing the training and computational processes. Additionally, we also enhanced the datasets using data augmentation to generate additional resources and defined some heuristic methods to help the translation more precise. Our approach has been validated to be highly effective for the Bahnaric-Vietnamese translation model, contributing to the expansion and preservation of languages, and facilitating better mutual understanding between the two ethnic people.
zh
[NLP-10] CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLM s
【速读】: 该论文试图解决医疗场景下大语言模型(Large Language Models, LLMs)的安全性、对齐性和对抗性攻击敏感性问题。现有基准测试在临床特异性、有害性等级和对“越狱”式攻击的覆盖方面存在不足。解决方案的关键在于提出CARES(Clinical Adversarial Robustness and Evaluation of Safety)基准,该基准包含超过18,000个提示,涵盖八项医疗安全原则、四种有害性等级及四种提示风格(直接、间接、混淆和角色扮演),以模拟恶意和良性使用场景,并引入三类响应评估协议(接受、谨慎、拒绝)和细粒度安全评分指标,用于评估模型行为。此外,论文还提出了一种轻量级分类器来检测“越狱”尝试,并通过基于提醒的条件引导模型向更安全的行为转变。
链接: https://arxiv.org/abs/2505.11413
作者: Sijia Chen,Xiaomin Li,Mengxue Zhang,Eric Hanchen Jiang,Qingcheng Zeng,Chen-Hsiang Yu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly deployed in medical contexts, raising critical concerns about safety, alignment, and susceptibility to adversarial manipulation. While prior benchmarks assess model refusal capabilities for harmful prompts, they often lack clinical specificity, graded harmfulness levels, and coverage of jailbreak-style attacks. We introduce CARES (Clinical Adversarial Robustness and Evaluation of Safety), a benchmark for evaluating LLM safety in healthcare. CARES includes over 18,000 prompts spanning eight medical safety principles, four harm levels, and four prompting styles: direct, indirect, obfuscated, and role-play, to simulate both malicious and benign use cases. We propose a three-way response evaluation protocol (Accept, Caution, Refuse) and a fine-grained Safety Score metric to assess model behavior. Our analysis reveals that many state-of-the-art LLMs remain vulnerable to jailbreaks that subtly rephrase harmful prompts, while also over-refusing safe but atypically phrased queries. Finally, we propose a mitigation strategy using a lightweight classifier to detect jailbreak attempts and steer models toward safer behavior via reminder-based conditioning. CARES provides a rigorous framework for testing and improving medical LLM safety under adversarial and ambiguous conditions.
zh
[NLP-11] Visual Planning : Lets Think Only with Images
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)及其多模态扩展(Multimodal Large Language Models, MLLMs)在处理涉及空间和几何信息的任务时,过度依赖纯文本作为推理媒介所带来的局限性。其解决方案的关键在于提出一种新的范式——视觉规划(Visual Planning),该范式通过纯粹的视觉表示进行规划,独立于文本,利用图像序列编码视觉领域的逐步推理过程,从而实现更自然和有效的机器推理。
链接: https://arxiv.org/abs/2505.11409
作者: Yi Xu,Chengzu Li,Han Zhou,Xingchen Wan,Caiqi Zhang,Anna Korhonen,Ivan Vulić
机构: University of Cambridge (剑桥大学); University College London (伦敦大学学院); Google(谷歌)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures, 1 table (26 pages, 12 figures, 8 tables including references and appendices)
Abstract:Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations, independent of text. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising alternative to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.
zh
[NLP-12] Large Language Model Use Impact Locus of Control
【速读】: 该论文试图解决人工智能(Artificial Intelligence, AI)在协作写作过程中对用户自我认知和控制感(locus of control)产生的心理影响问题。研究的关键在于揭示就业状态如何调节用户对AI的依赖程度及其控制感的变化,发现就业人群表现出更高的AI依赖性和向内控(internal control)的转变,而失业人群则倾向于感知到个人能动性(agency)的下降。这一发现为理解AI在塑造个体能动性与身份认同中的作用提供了实证依据。
链接: https://arxiv.org/abs/2505.11406
作者: Jenny Xiyu Fu,Brennan Antone,Kowe Kadoma,Malte Jung
机构: Cornell University (康奈尔大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:As AI tools increasingly shape how we write, they may also quietly reshape how we perceive ourselves. This paper explores the psychological impact of co-writing with AI on people’s locus of control. Through an empirical study with 462 participants, we found that employment status plays a critical role in shaping users’ reliance on AI and their locus of control. Current results demonstrated that employed participants displayed higher reliance on AI and a shift toward internal control, while unemployed users tended to experience a reduction in personal agency. Through quantitative results and qualitative observations, this study opens a broader conversation about AI’s role in shaping personal agency and identity.
zh
[NLP-13] EmotionHallucer: Evaluating Emotion Hallucinations in Multimodal Large Language Models
【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在情感理解任务中产生的幻觉问题,即模型生成与情感无关或无意义的内容。解决方案的关键在于构建了一个名为EmotionHallucer的基准测试平台,用于检测和分析MLLMs中的情感幻觉。该平台基于情感心理学知识和现实世界的多模态感知两个维度进行评估,并采用对抗性二分类问答(QA)框架,通过精心设计的基础对与幻觉对来评估模型的情感幻觉倾向。此外,研究还提出了PEP-MEK框架,以提升情感幻觉检测效果。
链接: https://arxiv.org/abs/2505.11405
作者: Bohao Xing,Xin Liu,Guoying Zhao,Chengyu Liu,Xiaolan Fu,Heikki Kälviäinen
机构: Lappeenranta-Lahti University of Technology LUT(拉彭兰塔-拉赫蒂应用科学大学); University of Oulu(奥卢大学); Southeast University(东南大学); Shanghai Jiao Tong University(上海交通大学); Brno University of Technology(布尔诺技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Emotion understanding is a critical yet challenging task. Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced their capabilities in this area. However, MLLMs often suffer from hallucinations, generating irrelevant or nonsensical content. To the best of our knowledge, despite the importance of this issue, there has been no dedicated effort to evaluate emotion-related hallucinations in MLLMs. In this work, we introduce EmotionHallucer, the first benchmark for detecting and analyzing emotion hallucinations in MLLMs. Unlike humans, whose emotion understanding stems from the interplay of biology and social learning, MLLMs rely solely on data-driven learning and lack innate emotional instincts. Fortunately, emotion psychology provides a solid foundation of knowledge about human emotions. Building on this, we assess emotion hallucinations from two dimensions: emotion psychology knowledge and real-world multimodal perception. To support robust evaluation, we utilize an adversarial binary question-answer (QA) framework, which employs carefully crafted basic and hallucinated pairs to assess the emotion hallucination tendencies of MLLMs. By evaluating 38 LLMs and MLLMs on EmotionHallucer, we reveal that: i) most current models exhibit substantial issues with emotion hallucinations; ii) closed-source models outperform open-source ones in detecting emotion hallucinations, and reasoning capability provides additional advantages; iii) existing models perform better in emotion psychology knowledge than in multimodal emotion perception. As a byproduct, these findings inspire us to propose the PEP-MEK framework, which yields an average improvement of 9.90% in emotion hallucination detection across selected models. Resources will be available at this https URL.
zh
[NLP-14] A computational system to handle the orthographic layer of tajwid in contemporary Quranic Orthography
【速读】: 该论文试图解决如何系统性地分析和处理《古兰经》的音韵规则(tajwid)及其在当代古兰经正字法(Contemporary Quranic Orthography, CQO)中的体现问题,以便更精确地研究其语音和语调过程。解决方案的关键在于开发了一个Python模块,能够准确地从CQO文本中移除或添加音韵正字层,从而实现对《开罗古兰经》全文的系统性分析,并为不同古兰经手稿之间的对齐与比较提供计算基础。
链接: https://arxiv.org/abs/2505.11379
作者: Alicia González Martínez
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Contemporary Quranic Orthography (CQO) relies on a precise system of phonetic notation that can be traced back to the early stages of Islam, when the Quran was mainly oral in nature and the first written renderings of it served as memory aids for this oral tradition. The early systems of diacritical marks created on top of the Quranic Consonantal Text (QCT) motivated the creation and further development of a fine-grained system of phonetic notation that represented tajwid-the rules of recitation. We explored the systematicity of the rules of tajwid, as they are encountered in the Cairo Quran, using a fully and accurately encoded digital edition of the Quranic text. For this purpose, we developed a python module that can remove or add the orthographic layer of tajwid from a Quranic text in CQO. The interesting characteristic of these two sets of rules is that they address the complete Quranic text of the Cairo Quran, so they can be used as precise witnesses to study its phonetic and prosodic processes. From a computational point of view, the text of the Cairo Quran can be used as a linchpin to align and compare Quranic manuscripts, due to its richness and completeness. This will let us create a very powerful framework to work with the Arabic script, not just within an isolated text, but automatically exploring a specific textual phenomenon in other connected manuscripts. Having all the texts mapped among each other can serve as a powerful tool to study the nature of the notation systems of diacritics added to the consonantal skeleton.
zh
[NLP-15] GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents ACL2025
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在遵循领域导向指南(domain-oriented guidelines)方面缺乏全面评估基准的问题。尽管LLMs在通用领域指令遵循能力上已取得显著进展,但其在实际应用中常需遵循与常识知识可能冲突的领域规则,而现有研究尚未提供有效的评估手段。解决方案的关键在于提出GuideBench,这是一个针对LLMs遵循指南性能的综合性基准,从规则遵循多样性、规则更新鲁棒性以及与人类偏好对齐三个关键方面进行评估,从而为LLMs在领域导向任务中的性能提升提供依据。
链接: https://arxiv.org/abs/2505.11368
作者: Lingxiao Diao,Xinyue Xu,Wanxuan Sun,Cheng Yang,Zhuosheng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); ByteDance (字节跳动)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Main Conference
Abstract:Large language models (LLMs) have been widely deployed as autonomous agents capable of following user instructions and making decisions in real-world applications. Previous studies have made notable progress in benchmarking the instruction following capabilities of LLMs in general domains, with a primary focus on their inherent commonsense knowledge. Recently, LLMs have been increasingly deployed as domain-oriented agents, which rely on domain-oriented guidelines that may conflict with their commonsense knowledge. These guidelines exhibit two key characteristics: they consist of a wide range of domain-oriented rules and are subject to frequent updates. Despite these challenges, the absence of comprehensive benchmarks for evaluating the domain-oriented guideline following capabilities of LLMs presents a significant obstacle to their effective assessment and further development. In this paper, we introduce GuideBench, a comprehensive benchmark designed to evaluate guideline following performance of LLMs. GuideBench evaluates LLMs on three critical aspects: (i) adherence to diverse rules, (ii) robustness to rule updates, and (iii) alignment with human preferences. Experimental results on a range of LLMs indicate substantial opportunities for improving their ability to follow domain-oriented guidelines.
zh
[NLP-16] Phare: A Safety Probe for Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在安全性和可靠性方面的潜在问题,特别是现有评估体系过于侧重性能而忽视了识别模型的失效模式。其解决方案的关键在于提出Phare,一个支持多语言的诊断框架,用于在三个关键维度——幻觉与可靠性、社会偏见以及有害内容生成——上探测和评估LLM的行为,从而揭示系统性漏洞并提供可操作的见解以构建更稳健、对齐和可信的语言系统。
链接: https://arxiv.org/abs/2505.11365
作者: Pierre Le Jeune,Benoît Malésieux,Weixuan Xiao,Matteo Dora
机构: Giskard AI(吉斯卡德人工智能)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:Ensuring the safety of large language models (LLMs) is critical for responsible deployment, yet existing evaluations often prioritize performance over identifying failure modes. We introduce Phare, a multilingual diagnostic framework to probe and evaluate LLM behavior across three critical dimensions: hallucination and reliability, social biases, and harmful content generation. Our evaluation of 17 state-of-the-art LLMs reveals patterns of systematic vulnerabilities across all safety dimensions, including sycophancy, prompt sensitivity, and stereotype reproduction. By highlighting these specific failure modes rather than simply ranking models, Phare provides researchers and practitioners with actionable insights to build more robust, aligned, and trustworthy language systems.
zh
[NLP-17] LegoSLM: Connecting LLM with Speech Encoder using CTC Posteriors
【速读】: 该论文旨在解决如何有效融合大规模预训练语音编码器与大语言模型(LLMs)以提升语音处理任务性能的问题,尤其是针对自动语音识别(ASR)和语音翻译任务。现有方法在连续语音提示和ASR错误纠正方面存在性能不佳或灵活性不足的局限性。论文提出的解决方案关键在于引入LegoSLM框架,通过ASR后验矩阵将语音编码器与LLMs进行连接,使语音编码器生成的CTC后验分布用于重构伪音频嵌入,并将其与文本嵌入在LLM输入空间中进行拼接,从而实现跨模态信息的有效融合。
链接: https://arxiv.org/abs/2505.11352
作者: Rao Ma,Tongzhou Chen,Kartik Audhkhasi,Bhuvana Ramabhadran
机构: University of Cambridge; Google Deepmind
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Recently, large-scale pre-trained speech encoders and Large Language Models (LLMs) have been released, which show state-of-the-art performance on a range of spoken language processing tasks including Automatic Speech Recognition (ASR). To effectively combine both models for better performance, continuous speech prompts, and ASR error correction have been adopted. However, these methods are prone to suboptimal performance or are inflexible. In this paper, we propose a new paradigm, LegoSLM, that bridges speech encoders and LLMs using the ASR posterior matrices. The speech encoder is trained to generate Connectionist Temporal Classification (CTC) posteriors over the LLM vocabulary, which are used to reconstruct pseudo-audio embeddings by computing a weighted sum of the LLM input embeddings. These embeddings are concatenated with text embeddings in the LLM input space. Using the well-performing USM and Gemma models as an example, we demonstrate that our proposed LegoSLM method yields good performance on both ASR and speech translation tasks. By connecting USM with Gemma models, we can get an average of 49% WERR over the USM-CTC baseline on 8 MLS testsets. The trained model also exhibits modularity in a range of settings – after fine-tuning the Gemma model weights, the speech encoder can be switched and combined with the LLM in a zero-shot fashion. Additionally, we propose to control the decode-time influence of the USM and LLM using a softmax temperature, which shows effectiveness in domain adaptation.
zh
[NLP-18] Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models
【速读】: 该论文试图解决关键问题生成(Critical Questions Generation, CQs-Gen)任务中缺乏合适数据集和自动评估标准的问题,从而阻碍了该领域的进展。解决方案的关键在于构建首个大规模人工标注的数据集,并研究自动评估方法,最终确定基于大型语言模型(Large Language Models, LLMs)的参考依赖技术作为与人类判断最相关的评估策略。
链接: https://arxiv.org/abs/2505.11341
作者: Banca Calvo Figueras,Rodrigo Agerri
机构: HiTZ Center - Ixa (HiTZ中心-IXA); University of the Basque Country UPV/EHU (巴斯克大学UPV/EHU)
类目: Computation and Language (cs.CL)
备注:
Abstract:The task of Critical Questions Generation (CQs-Gen) aims to foster critical thinking by enabling systems to generate questions that expose assumptions and challenge the reasoning in arguments. Despite growing interest in this area, progress has been hindered by the lack of suitable datasets and automatic evaluation standards. This work presents a comprehensive approach to support the development and benchmarking of systems for this task. We construct the first large-scale manually-annotated dataset. We also investigate automatic evaluation methods and identify a reference-based technique using large language models (LLMs) as the strategy that best correlates with human judgments. Our zero-shot evaluation of 11 LLMs establishes a strong baseline while showcasing the difficulty of the task. Data, code, and a public leaderboard are provided to encourage further research not only in terms of model performance, but also to explore the practical benefits of CQs-Gen for both automated reasoning and human critical thinking.
zh
[NLP-19] XtraG PT: LLM s for Human-AI Collaboration on Controllable Academic Paper Revision
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在学术写作支持方面的局限性,特别是其无法满足科研沟通中对跨部分概念一致性等复杂需求的问题。现有系统多用于通用科学文本生成,难以支持深层次的学术写作修订。为解决这些问题,该研究提出了一种人机协作的学术论文修订框架。其关键在于构建了一个包含7,040篇顶级会议论文的综合性数据集,该数据集标注了超过140,000对指令-响应对,反映了真实的、基于章节级别的科学修订过程,并在此基础上开发了XtraGPT,首个面向上下文感知、指令引导的开放源代码LLM系列,参数规模从1.5B到14B不等。
链接: https://arxiv.org/abs/2505.11336
作者: Nuo Chen,Andre Lin HuiKai,Jiaying Wu,Junyi Hou,Zining Zhang,Qian Wang,Xidong Wang,Bingsheng He
机构: National University of Singapore(新加坡国立大学); The Chinese University of Hong Kong, Shenzhen(香港中文大学深圳)
类目: Computation and Language (cs.CL)
备注: preprint
Abstract:Despite the growing adoption of large language models (LLMs) in academic workflows, their capabilities remain limited when it comes to supporting high-quality scientific writing. Most existing systems are designed for general-purpose scientific text generation and fail to meet the sophisticated demands of research communication beyond surface-level polishing, such as conceptual coherence across sections. Furthermore, academic writing is inherently iterative and revision-driven, a process not well supported by direct prompting-based paradigms. To address these scenarios, we propose a human-AI collaboration framework for academic paper revision. We first introduce a comprehensive dataset of 7,040 research papers from top-tier venues annotated with over 140,000 instruction-response pairs that reflect realistic, section-level scientific revisions. Building on the dataset, we develop XtraGPT, the first suite of open-source LLMs, designed to provide context-aware, instruction-guided writing assistance, ranging from 1.5B to 14B parameters. Extensive experiments validate that XtraGPT significantly outperforms same-scale baselines and approaches the quality of proprietary systems. Both automated preference assessments and human evaluations confirm the effectiveness of our models in improving scientific drafts.
zh
[NLP-20] CROC: Evaluating and Training T2I Metrics with Pseudo- and Human-Labeled Contrastive Robustness Checks
【速读】: 该论文旨在解决文本到图像(text-to-image, T2I)生成任务中评估指标(evaluation metrics)的适用性评估问题,特别是针对现有指标在面对复杂图像属性时的鲁棒性不足。传统的人工元评估成本高且耗时,而自动化替代方案稀缺。该研究提出CROC(Contrastive Robustness Checks)框架,其关键在于通过合成对比测试用例,系统地探测和量化评估指标的鲁棒性,从而构建一个大规模的伪标签数据集(CROC^syn),用于细粒度比较评估指标,并在此基础上训练出性能先进的CROCScore指标。此外,研究还引入了一个人工监督的基准(CROC^hum)以补充数据集,进一步验证了现有指标在特定挑战性类别中的不足。
链接: https://arxiv.org/abs/2505.11314
作者: Christoph Leiter,Yuki M. Asano,Margret Keuper,Steffen Eger
机构: University of Mannheim (曼海姆大学); University of Technology Nuremberg (纽伦堡工业大学); Max Planck Institute for Informatics (马克斯·普朗克信息学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: preprint
Abstract:The assessment of evaluation metrics (meta-evaluation) is crucial for determining the suitability of existing metrics in text-to-image (T2I) generation tasks. Human-based meta-evaluation is costly and time-intensive, and automated alternatives are scarce. We address this gap and propose CROC: a scalable framework for automated Contrastive Robustness Checks that systematically probes and quantifies metric robustness by synthesizing contrastive test cases across a comprehensive taxonomy of image properties. With CROC, we generate a pseudo-labeled dataset (CROC ^syn ) of over one million contrastive prompt-image pairs to enable a fine-grained comparison of evaluation metrics. We also use the dataset to train CROCScore, a new metric that achieves state-of-the-art performance among open-source methods, demonstrating an additional key application of our framework. To complement this dataset, we introduce a human-supervised benchmark (CROC ^hum ) targeting especially challenging categories. Our results highlight robustness issues in existing metrics: for example, many fail on prompts involving negation, and all tested open-source metrics fail on at least 25% of cases involving correct identification of body parts.
zh
[NLP-21] Probing Subphonemes in Morphology Models
【速读】: 该论文试图解决生成式 AI (Generative AI) 在形态学屈折任务中跨语言和形态规则泛化能力有限的问题,其核心在于探究模型对语音学及子音段层面隐含现象的捕捉能力。解决方案的关键是引入一种与语言无关的探测方法,用于研究直接基于音素训练的 Transformers 模型中语音特征的编码情况,从而揭示不同语音特征在模型中的表征方式及其对形态学建模的影响。
链接: https://arxiv.org/abs/2505.11297
作者: Gal Astrach,Yuval Pinter
机构: Ben-Gurion University of the Negev (本-古里安大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Transformers have achieved state-of-the-art performance in morphological inflection tasks, yet their ability to generalize across languages and morphological rules remains limited. One possible explanation for this behavior can be the degree to which these models are able to capture implicit phenomena at the phonological and subphonemic levels. We introduce a language-agnostic probing method to investigate phonological feature encoding in transformers trained directly on phonemes, and perform it across seven morphologically diverse languages. We show that phonological features which are local, such as final-obstruent devoicing in Turkish, are captured well in phoneme embeddings, whereas long-distance dependencies like vowel harmony are better represented in the transformer’s encoder. Finally, we discuss how these findings inform empirical strategies for training morphological models, particularly regarding the role of subphonemic feature acquisition.
zh
[NLP-22] mporal fine-tuning for early risk detection
【速读】: 该论文旨在解决网络环境中早期风险检测(Early Risk Detection, ERD)的问题,即及时识别面临社会和健康问题的用户。传统方法在关键场景下难以同时保证分类精度和检测延迟的优化,而标准分类指标可能不足以满足需求。论文提出的解决方案的关键在于采用时间微调(temporal fine-tuning)策略,通过在学习过程中显式引入时间因素,对基于Transformer的模型进行调整。该方法允许分析完整的用户发帖历史,考虑不同情境下的模型调优,并利用时间相关指标评估训练效果,从而在精度与速度之间实现统一优化。
链接: https://arxiv.org/abs/2505.11280
作者: Horacio Thompson,Esaú Villatoro-Tello,Manuel Montes-y-Gómez,Marcelo Errecalde
机构: 未知
类目: Computation and Language (cs.CL)
备注: In: Proceedings of the 53rd JAIIO / 50th CLEI - ASAID, 2024, p. 137. ISSN: 2451-7496
Abstract:Early Risk Detection (ERD) on the Web aims to identify promptly users facing social and health issues. Users are analyzed post-by-post, and it is necessary to guarantee correct and quick answers, which is particularly challenging in critical scenarios. ERD involves optimizing classification precision and minimizing detection delay. Standard classification metrics may not suffice, resorting to specific metrics such as ERDE(theta) that explicitly consider precision and delay. The current research focuses on applying a multi-objective approach, prioritizing classification performance and establishing a separate criterion for decision time. In this work, we propose a completely different strategy, temporal fine-tuning, which allows tuning transformer-based models by explicitly incorporating time within the learning process. Our method allows us to analyze complete user post histories, tune models considering different contexts, and evaluate training performance using temporal metrics. We evaluated our proposal in the depression and eating disorders tasks for the Spanish language, achieving competitive results compared to the best models of MentalRiskES 2023. We found that temporal fine-tuning optimized decisions considering context and time progress. In this way, by properly taking advantage of the power of transformers, it is possible to address ERD by combining precision and speed as a single objective.
zh
[NLP-23] Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)因知识库有限而影响推理能力的问题。现有方法在检索外部资源时往往获取无关或噪声信息,从而阻碍了准确推理。其解决方案的关键在于提出AutoRefine,这是一个基于强化学习的后训练框架,采用“搜索与思考过程中不断精炼”的新范式。AutoRefine在连续搜索调用之间引入显式的知识精炼步骤,使模型能够迭代地过滤、提炼和组织证据,从而生成更准确的答案。此外,通过结合定制的检索特定奖励与答案正确性奖励,进一步提升了模型的推理性能。
链接: https://arxiv.org/abs/2505.11277
作者: Yaorui Shi,Shihan Li,Chang Wu,Zhiyuan Liu,Junfeng Fang,Hengxing Cai,An Zhang,Xiang Wang
机构: University of Science and Technology of China (中国科学技术大学); National University of Singapore (新加坡国立大学); DP Technology (DP技术公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models have demonstrated impressive reasoning capabilities but are inherently limited by their knowledge reservoir. Retrieval-augmented reasoning mitigates this limitation by allowing LLMs to query external resources, but existing methods often retrieve irrelevant or noisy information, hindering accurate reasoning. In this paper, we propose AutoRefine, a reinforcement learning post-training framework that adopts a new ``search-and-refine-during-think’’ paradigm. AutoRefine introduces explicit knowledge refinement steps between successive search calls, enabling the model to iteratively filter, distill, and organize evidence before generating an answer. Furthermore, we incorporate tailored retrieval-specific rewards alongside answer correctness rewards using group relative policy optimization. Experiments on single-hop and multi-hop QA benchmarks demonstrate that AutoRefine significantly outperforms existing approaches, particularly in complex, multi-hop reasoning scenarios. Detailed analysis shows that AutoRefine issues frequent, higher-quality searches and synthesizes evidence effectively.
zh
[NLP-24] SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning
【速读】: 该论文试图解决大型推理模型在处理简单和复杂查询时效率低下导致的资源浪费和用户延迟增加的问题。解决方案的关键在于提出SelfBudgeter——一种自适应可控的推理策略,通过双阶段训练范式实现:首先,模型根据查询难度预估推理成本;其次,引入基于预算引导的GPRO进行强化学习,有效在保持准确性的同时减少输出长度。该方法使用户能够预估生成时间并决定是否继续或中断过程,同时支持通过预填充令牌预算直接控制推理长度。
链接: https://arxiv.org/abs/2505.11274
作者: Zheng Li,Qingxiu Dong,Jingyuan Ma,Di Zhang,Zhifang Sui
机构: Peking University (北京大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Recently, large reasoning models demonstrate exceptional performance on various tasks. However, reasoning models inefficiently over-process both trivial and complex queries, leading to resource waste and prolonged user latency. To address this challenge, we propose SelfBudgeter - a self-adaptive controllable reasoning strategy for efficient reasoning. Our approach adopts a dual-phase training paradigm: first, the model learns to pre-estimate the reasoning cost based on the difficulty of the query. Then, we introduce budget-guided GPRO for reinforcement learning, which effectively maintains accuracy while reducing output length. SelfBudgeter allows users to anticipate generation time and make informed decisions about continuing or interrupting the process. Furthermore, our method enables direct manipulation of reasoning length via pre-filling token budget. Experimental results demonstrate that SelfBudgeter can rationally allocate budgets according to problem complexity, achieving up to 74.47% response length compression on the MATH benchmark while maintaining nearly undiminished accuracy.
zh
[NLP-25] Semantic Caching of Contextual Summaries for Efficient Question-Answering with Language Models
【速读】: 该论文旨在解决在分布式系统中处理长上下文时产生的高计算开销、内存占用和网络带宽消耗问题。其解决方案的关键在于提出一种新的语义缓存方法,用于存储和重用中间上下文摘要,从而在基于大型语言模型(Large Language Models, LLM)的问答工作流中实现高效的信息复用。通过减少冗余计算,该方法在保持与完整文档处理相当的答案准确性的同时,将计算开销降低了50-60%。
链接: https://arxiv.org/abs/2505.11271
作者: Camille Couturier,Spyros Mastorakis,Haiying Shen,Saravan Rajmohan,Victor Rühle
机构: Microsoft 365 Research (微软365研究); University of Virginia (弗吉尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Preprint. Paper accepted at ICCCN 2025, the final version will appear in the proceedings
Abstract:Large Language Models (LLMs) are increasingly deployed across edge and cloud platforms for real-time question-answering and retrieval-augmented generation. However, processing lengthy contexts in distributed systems incurs high computational overhead, memory usage, and network bandwidth. This paper introduces a novel semantic caching approach for storing and reusing intermediate contextual summaries, enabling efficient information reuse across similar queries in LLM-based QA workflows. Our method reduces redundant computations by up to 50-60% while maintaining answer accuracy comparable to full document processing, as demonstrated on NaturalQuestions, TriviaQA, and a synthetic ArXiv dataset. This approach balances computational cost and response quality, critical for real-time AI assistants.
zh
[NLP-26] HAPO: Training Language Models to Reason Concisely via History-Aware Policy Optimization
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在测试阶段通过延长输出长度来提升推理能力和性能时,所面临的输出冗长和推理成本增加的问题。其解决方案的关键在于提出一种基于历史信息的策略优化方法——History-Aware Policy Optimization (HAPO),该方法通过跟踪每个问题的先前正确响应的最小长度等历史状态,并利用基于此历史状态的长度奖励函数,激励模型发现比之前找到的更简洁的正确解,从而在保持准确性的同时提高推理效率。
链接: https://arxiv.org/abs/2505.11225
作者: Chengyu Huang,Zhengxin Zhang,Claire Cardie
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:While scaling the length of responses at test-time has been shown to markedly improve the reasoning abilities and performance of large language models (LLMs), it often results in verbose outputs and increases inference cost. Prior approaches for efficient test-time scaling, typically using universal budget constraints or query-level length optimization, do not leverage historical information from previous encounters with the same problem during training. We hypothesize that this limits their ability to progressively make solutions more concise over time. To address this, we present History-Aware Policy Optimization (HAPO), which keeps track of a history state (e.g., the minimum length over previously generated correct responses) for each problem. HAPO employs a novel length reward function based on this history state to incentivize the discovery of correct solutions that are more concise than those previously found. Crucially, this reward structure avoids overly penalizing shorter incorrect responses with the goal of facilitating exploration towards more efficient solutions. By combining this length reward with a correctness reward, HAPO jointly optimizes for correctness and efficiency. We use HAPO to train DeepSeek-R1-Distill-Qwen-1.5B, DeepScaleR-1.5B-Preview, and Qwen-2.5-1.5B-Instruct, and evaluate HAPO on several math benchmarks that span various difficulty levels. Experiment results demonstrate that HAPO effectively induces LLMs’ concise reasoning abilities, producing length reductions of 33-59% with accuracy drops of only 2-5%.
zh
[NLP-27] Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese
【速读】: 该论文旨在解决现有文本到语音(TTS)系统评估方法存在的主观性强、环境不一致及可解释性差等问题,以及现有评估数据集在多维设计上的不足,特别是在中文TTS评估中对说话风格、语境多样性和陷阱句等关键因素的忽视。其解决方案的关键在于引入Audio Turing Test (ATT)——一个与ATT-Corpus配套的多维中文语料库,以及基于人类判断数据微调的Auto-ATT自动评估工具。ATT通过让评估者判断语音是否像人类发出,简化了评估流程,减少了评分偏差并提高了评估的鲁棒性。
链接: https://arxiv.org/abs/2505.11200
作者: Xihuai Wang,Ziyi Zhao,Siyu Ren,Shao Zhang,Song Li,Xiaoyu Li,Ziwen Wang,Lin Qiu,Guanglu Wan,Xuezhi Cao,Xunliang Cai,Weinan Zhang
机构: Meituan(美团)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Under Review
Abstract:Recent advances in large language models (LLMs) have significantly improved text-to-speech (TTS) systems, enhancing control over speech style, naturalness, and emotional expression, which brings TTS Systems closer to human-level performance. Although the Mean Opinion Score (MOS) remains the standard for TTS System evaluation, it suffers from subjectivity, environmental inconsistencies, and limited interpretability. Existing evaluation datasets also lack a multi-dimensional design, often neglecting factors such as speaking styles, context diversity, and trap utterances, which is particularly evident in Chinese TTS evaluation. To address these challenges, we introduce the Audio Turing Test (ATT), a multi-dimensional Chinese corpus dataset ATT-Corpus paired with a simple, Turing-Test-inspired evaluation protocol. Instead of relying on complex MOS scales or direct model comparisons, ATT asks evaluators to judge whether a voice sounds human. This simplification reduces rating bias and improves evaluation robustness. To further support rapid model development, we also finetune Qwen2-Audio-Instruct with human judgment data as Auto-ATT for automatic evaluation. Experimental results show that ATT effectively differentiates models across specific capability dimensions using its multi-dimensional design. Auto-ATT also demonstrates strong alignment with human evaluations, confirming its value as a fast and reliable assessment tool. The white-box ATT-Corpus and Auto-ATT can be found in ATT Hugging Face Collection (this https URL).
zh
[NLP-28] NoPE: The Counting Power of Transformers with No Positional Encodings
【速读】: 该论文试图解决无位置编码(NoPE)的Transformer模型在表达能力上的局限性问题,特别是其在处理计数语言和复杂逻辑推理任务时的能力。研究发现,使用平均硬注意力机制的NoPE-Transformers(NoPE-AHATs)能够表达与多变量多项式方程非负整数解相对应的计数语言,这类语言对应于半代数集,即多个多项式不等式组的非负整数解的有限并集。解决方案的关键在于对NoPE-AHATs的表达能力进行了精确的表征,并揭示了其在计数能力和可判定性方面的特性,从而为理解无PE Transformer的理论边界提供了新的视角。
链接: https://arxiv.org/abs/2505.11199
作者: Chris Köcher,Alexander Kozachinskiy,Anthony Widjaja Lin,Marco Sälzer,Georg Zetzsche
机构: MPI-SWS; Centro Nacional de Inteligencia Artificial; RPTU Kaiserslautern-Landau
类目: Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
备注:
Abstract:Positional Encodings (PEs) seem to be indispensable for ensuring expressiveness of transformers; without them attention transformers reduce to a bag-of-word model. NoPE-transformers (i.e. with No PEs) with unique hard attention mechanisms were very recently shown to only be able to express regular languages, i.e., with limited counting ability. This paper shows that, with average hard attention mechanisms, NoPE-transformers are still surprisingly expressive: they can express counting languages corresponding to nonnegative integer solutions to multivariate polynomial equations (i.e. Diophantine equations), reasoning about which is well-known to be undecidable. In fact, we provide a precise characterization of languages expressible by Average Hard Attention NoPE-Transformers (NoPE-AHATs): they correspond precisely to what we call \emphsemi-algebraic sets, i.e., finite unions of sets of nonnegative integer solutions to systems of multivariate polynomial inequations. We obtain several interesting consequences of our characterization. Firstly, NoPE-transformers can express counting properties that are far more complex than established models like simplified counter machines and Petri nets, but cannot express a very simple counting property of PARITY. Secondly, the problem of analyzing NoPE-transformers is undecidable, e.g., whether a given NoPE transformer classifies all input strings in one class. To complement our results, we exhibit a counting language that is not expressible by average hard attention transformers even with arbitrary PEs but is expressible in the circuit complexity class TC ^0 , answering an open problem.
zh
[NLP-29] CompAlign: Improving Compositional Text-to-Image Generation with a Complex Benchmark and Fine-Grained Feedback
【速读】: 该论文试图解决生成式 AI (Generative AI) 在组合性图像生成任务中对多对象、属性及三维空间关系准确描绘能力不足的问题。其解决方案的关键在于提出 CompAlign 基准与 CompQuest 评估框架,其中 CompAlign 提供了包含复杂三维空间关系的多主体生成提示,用于评估和提升模型的组合性图像生成能力;而 CompQuest 则通过分解复杂提示为原子子问题,并利用多模态大语言模型 (MLLM) 提供细粒度二分类反馈,实现对生成图像与组合性提示之间对齐程度的精确量化。此外,该研究还引入了一种基于 CompQuest 反馈作为偏好信号的对齐框架,以提升扩散模型在组合性图像生成中的表现。
链接: https://arxiv.org/abs/2505.11178
作者: Yixin Wan,Kai-Wei Chang
机构: University of California, Los Angeles (加利福尼亚大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:State-of-the-art T2I models are capable of generating high-resolution images given textual prompts. However, they still struggle with accurately depicting compositional scenes that specify multiple objects, attributes, and spatial relations. We present CompAlign, a challenging benchmark with an emphasis on assessing the depiction of 3D-spatial relationships, for evaluating and improving models on compositional image generation. CompAlign consists of 900 complex multi-subject image generation prompts that combine numerical and 3D-spatial relationships with varied attribute bindings. Our benchmark is remarkably challenging, incorporating generation tasks with 3+ generation subjects with complex 3D-spatial relationships. Additionally, we propose CompQuest, an interpretable and accurate evaluation framework that decomposes complex prompts into atomic sub-questions, then utilizes a MLLM to provide fine-grained binary feedback on the correctness of each aspect of generation elements in model-generated images. This enables precise quantification of alignment between generated images and compositional prompts. Furthermore, we propose an alignment framework that uses CompQuest’s feedback as preference signals to improve diffusion models’ compositional image generation abilities. Using adjustable per-image preferences, our method is easily scalable and flexible for different tasks. Evaluation of 9 T2I models reveals that: (1) models remarkable struggle more with compositional tasks with more complex 3D-spatial configurations, and (2) a noticeable performance gap exists between open-source accessible models and closed-source commercial models. Further empirical study on using CompAlign for model alignment yield promising results: post-alignment diffusion models achieve remarkable improvements in compositional accuracy, especially on complex generation tasks, outperforming previous approaches.
zh
[NLP-30] Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline
【速读】: 该论文试图解决多语言信息提取与处理的问题,特别是在基于图像的文档中实现跨语言的信息理解和访问。解决方案的关键在于构建一个端到端的系统,结合光学字符识别(Optical Character Recognition, OCR)技术提取文本,并利用大型语言模型应用编程接口(Large Language Model APIs, LLM APIs)进行跨语言翻译、摘要生成和再翻译,同时集成情感分析、主题分类和日期提取等模块以提升文档理解能力。
链接: https://arxiv.org/abs/2505.11177
作者: Hrishit Madhavi,Jacob Cherian,Yuvraj Khamkar,Dhananjay Bhagat
机构: Dr Vishwanath Karad MIT World Peace University, Pune, India
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures, direct arXiv submission
Abstract:This paper presents an end-to-end suite for multilingual information extraction and processing from image-based documents. The system uses Optical Character Recognition (Tesseract) to extract text in languages such as English, Hindi, and Tamil, and then a pipeline involving large language model APIs (Gemini) for cross-lingual translation, abstractive summarization, and re-translation into a target language. Additional modules add sentiment analysis (TensorFlow), topic classification (Transformers), and date extraction (Regex) for better document comprehension. Made available in an accessible Gradio interface, the current research shows a real-world application of libraries, models, and APIs to close the language gap and enhance access to information in image media across different linguistic environments
zh
[NLP-31] SoLoPO: Unlocking Long-Context Capabilities in LLM s via Short-to-Long Preference Optimization
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在有效利用现实世界长上下文信息时面临的挑战,这些问题主要源于数据质量缺陷、训练效率不足以及缺乏合理的优化目标。解决方案的关键在于提出一种名为SoLoPO(Short-to-Long Preference Optimization)的框架,该框架将长上下文偏好优化(Long-context Preference Optimization, Long-context PO)分解为两个组件:短上下文偏好优化(Short-context PO)和短到长奖励对齐(Short-to-Long Reward Alignment, SoLo-RA),并通过理论和实证证据加以支持。该方法通过增强模型在短上下文中的知识利用能力,并促进在包含相同任务相关信息的短长上下文条件下的奖励分数一致性,从而提升模型在长上下文场景中的表现。
链接: https://arxiv.org/abs/2505.11166
作者: Huashan Sun,Shengyi Liao,Yansen Han,Yu Bai,Yang Gao,Cheng Fu,Weizhou Shen,Fanqi Wan,Ming Yan,Ji Zhang,Fei Huang
机构: Tongyi Lab, Alibaba Group (通义实验室,阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite advances in pretraining with extended context lengths, large language models (LLMs) still face challenges in effectively utilizing real-world long-context information, primarily due to insufficient long-context alignment caused by data quality issues, training inefficiencies, and the lack of well-designed optimization objectives. To address these limitations, we propose a framework named \textbfS h \textbfo rt-to- \textbfLo ng \textbfP reference \textbfO ptimization ( \textbfSoLoPO ), decoupling long-context preference optimization (PO) into two components: short-context PO and short-to-long reward alignment (SoLo-RA), supported by both theoretical and empirical evidence. Specifically, short-context PO leverages preference pairs sampled from short contexts to enhance the model’s contextual knowledge utilization ability. Meanwhile, SoLo-RA explicitly encourages reward score consistency utilization for the responses when conditioned on both short and long contexts that contain identical task-relevant information. This facilitates transferring the model’s ability to handle short contexts into long-context scenarios. SoLoPO is compatible with mainstream preference optimization algorithms, while substantially improving the efficiency of data construction and training processes. Experimental results show that SoLoPO enhances all these algorithms with respect to stronger length and domain generalization abilities across various long-context benchmarks, while achieving notable improvements in both computational and memory efficiency.
zh
[NLP-32] Maximizing Asynchronicity in Event-based Neural Networks
【速读】: 该论文旨在解决事件相机(event camera)数据的异步、稀疏序列特性与传统基于张量的机器学习(ML)方法之间的不兼容问题。现有异步到同步(A2S)方法在表达能力和泛化性方面通常不如密集同步方法。论文提出的EVA(EVent Asynchronous representation learning)框架通过借鉴语言建模中的线性注意力和自监督学习技术,生成具有高表达性和泛化性的逐事件表示,从而克服上述挑战。其解决方案的关键在于将自然语言处理中的先进方法迁移至事件数据的表示学习中,实现了在识别和检测任务上的显著性能提升。
链接: https://arxiv.org/abs/2505.11165
作者: Haiqing Hao,Nikola Zubić,Weihua He,Zhipeng Sui,Davide Scaramuzza,Wenhui Wang
机构: Tsinghua University (清华大学); University of Zurich (苏黎世大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 5 figures, 9 tables
Abstract:Event cameras deliver visual data with high temporal resolution, low latency, and minimal redundancy, yet their asynchronous, sparse sequential nature challenges standard tensor-based machine learning (ML). While the recent asynchronous-to-synchronous (A2S) paradigm aims to bridge this gap by asynchronously encoding events into learned representations for ML pipelines, existing A2S approaches often sacrifice representation expressivity and generalizability compared to dense, synchronous methods. This paper introduces EVA (EVent Asynchronous representation learning), a novel A2S framework to generate highly expressive and generalizable event-by-event representations. Inspired by the analogy between events and language, EVA uniquely adapts advances from language modeling in linear attention and self-supervised learning for its construction. In demonstration, EVA outperforms prior A2S methods on recognition tasks (DVS128-Gesture and N-Cars), and represents the first A2S framework to successfully master demanding detection tasks, achieving a remarkable 47.7 mAP on the Gen1 dataset. These results underscore EVA’s transformative potential for advancing real-time event-based vision applications.
zh
[NLP-33] MPMA: Preference Manipulation Attack Against Model Context Protocol
【速读】: 该论文旨在解决Model Context Protocol (MCP)在开放生态系统中因第三方定制化服务器而暴露的安全漏洞问题,特别是针对MCP Preference Manipulation Attack (MPMA)这一新型安全威胁。论文提出的解决方案关键在于设计两种攻击方法:Direct Preference Manipulation Attack (DPMA)和Genetic-based Advertising Preference Manipulation Attack (GAPMA),其中GAPMA通过引入遗传算法(GA)优化描述信息,实现了攻击效果与隐蔽性的平衡,从而揭示了MCP体系在公平性方面的重大缺陷,并强调了构建有效防御机制的紧迫性。
链接: https://arxiv.org/abs/2505.11154
作者: Zihan Wang,Hongwei Li,Rui Zhang,Yu Liu,Wenbo Jiang,Wenshu Fan,Qingchuan Zhao,Guowen Xu
机构: University of Electronic Science and Technology of China (中国电子科技大学); City University of Hong Kong (香港城市大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Model Context Protocol (MCP) standardizes interface mapping for large language models (LLMs) to access external data and tools, which revolutionizes the paradigm of tool selection and facilitates the rapid expansion of the LLM agent tool ecosystem. However, as the MCP is increasingly adopted, third-party customized versions of the MCP server expose potential security vulnerabilities. In this paper, we first introduce a novel security threat, which we term the MCP Preference Manipulation Attack (MPMA). An attacker deploys a customized MCP server to manipulate LLMs, causing them to prioritize it over other competing MCP servers. This can result in economic benefits for attackers, such as revenue from paid MCP services or advertising income generated from free servers. To achieve MPMA, we first design a Direct Preference Manipulation Attack ( \mathttDPMA ) that achieves significant effectiveness by inserting the manipulative word and phrases into the tool name and description. However, such a direct modification is obvious to users and lacks stealthiness. To address these limitations, we further propose Genetic-based Advertising Preference Manipulation Attack ( \mathttGAPMA ). \mathttGAPMA employs four commonly used strategies to initialize descriptions and integrates a Genetic Algorithm (GA) to enhance stealthiness. The experiment results demonstrate that \mathttGAPMA balances high effectiveness and stealthiness. Our study reveals a critical vulnerability of the MCP in open ecosystems, highlighting an urgent need for robust defense mechanisms to ensure the fairness of the MCP ecosystem.
zh
[NLP-34] Scaling Reasoning can Improve Factuality in Large Language Models
【速读】: 该论文试图解决在开放域问答(open-domain QA)任务中,大型语言模型(LLM)的推理能力是否能通过更长的推理链和额外计算资源提升事实准确性的问题。其解决方案的关键在于从先进的大规模推理模型中提炼推理轨迹,并通过引入知识图谱中的事实信息来丰富这些轨迹,同时对不同规模的模型进行微调,以评估推理准确性的改进效果。此外,研究还验证了在测试阶段增加计算资源和令牌预算能够显著提升事实准确性。
链接: https://arxiv.org/abs/2505.11140
作者: Mike Zhang,Johannes Bjerva,Russa Biswas
机构: Aalborg University (奥尔堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent studies on large language model (LLM) reasoning capabilities have demonstrated promising improvements in model performance by leveraging a lengthy thinking process and additional computational resources during inference, primarily in tasks involving mathematical reasoning (Muennighoff et al., 2025). However, it remains uncertain if longer reasoning chains inherently enhance factual accuracy, particularly beyond mathematical contexts. In this work, we thoroughly examine LLM reasoning within complex open-domain question-answering (QA) scenarios. We initially distill reasoning traces from advanced, large-scale reasoning models (QwQ-32B and DeepSeek-R1-671B), then fine-tune a variety of models ranging from smaller, instruction-tuned variants to larger architectures based on Qwen2.5. To enrich reasoning traces, we introduce factual information from knowledge graphs in the form of paths into our reasoning traces. Our experimental setup includes four baseline approaches and six different instruction-tuned models evaluated across a benchmark of six datasets, encompassing over 22.6K questions. Overall, we carry out 168 experimental runs and analyze approximately 1.7 million reasoning traces. Our findings indicate that, within a single run, smaller reasoning models achieve noticeable improvements in factual accuracy compared to their original instruction-tuned counterparts. Moreover, our analysis demonstrates that adding test-time compute and token budgets factual accuracy consistently improves by 2-8%, further confirming the effectiveness of test-time scaling for enhancing performance and consequently improving reasoning accuracy in open-domain QA tasks. We release all the experimental artifacts for further research.
zh
[NLP-35] owards Better Evaluation for Generated Patent Claims ACL2025
【速读】: 该论文试图解决自动化专利权利要求生成系统在评估过程中存在的问题,即现有评估指标与人类专家评估之间存在不一致性。解决方案的关键在于引入Patent-CE,这是首个针对专利权利要求的全面评估基准,并提出了一种名为PatClaimEval的多维评估方法,该方法在所有评估标准中与人类专家评估的相关性最高,从而为更准确地评估自动化专利权利要求生成系统提供了基础。
链接: https://arxiv.org/abs/2505.11095
作者: Lekang Jiang,Pascal A Scherz,Stephan Goetz
机构: University of Cambridge (剑桥大学); PSPB Patent Law
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025. 14 pages, 8 tables
Abstract:Patent claims define the scope of protection and establish the legal boundaries of an invention. Drafting these claims is a complex and time-consuming process that usually requires the expertise of skilled patent attorneys, which can form a large access barrier for many small enterprises. To solve these challenges, researchers have investigated the use of large language models (LLMs) for automating patent claim generation. However, existing studies highlight inconsistencies between automated evaluation metrics and human expert assessments. To bridge this gap, we introduce Patent-CE, the first comprehensive benchmark for evaluating patent claims. Patent-CE includes comparative claim evaluations annotated by patent experts, focusing on five key criteria: feature completeness, conceptual clarity, terminology consistency, logical linkage, and overall quality. Additionally, we propose PatClaimEval, a novel multi-dimensional evaluation method specifically designed for patent claims. Our experiments demonstrate that PatClaimEval achieves the highest correlation with human expert evaluations across all assessment criteria among all tested metrics. This research provides the groundwork for more accurate evaluations of automated patent claim generation systems.
zh
[NLP-36] BLEUBERI: BLEU is a surprisingly effective reward for instruction following
【速读】: 该论文试图解决在基于强化学习(RL)的对齐过程中,传统奖励模型(reward models)训练成本高、依赖大规模人工标注偏好数据的问题。其解决方案的关键在于利用基于字符串匹配的简单指标(如BLEU)作为奖励函数,替代传统的奖励模型。研究发现,BLEU在通用指令遵循数据集上与人类偏好具有较高的一致性,并基于此提出了BLEUBERI方法,通过识别挑战性指令并结合Group Relative Policy Optimization(GRPO)进行优化,实现了与基于奖励模型的强化学习方法相当的性能。
链接: https://arxiv.org/abs/2505.11080
作者: Yapei Chang,Yekyung Kim,Michael Krumdick,Amir Zadeh,Chuan Li,Chris Tanner,Mohit Iyyer
机构: KiwiBird( KiwiBird); Apple(苹果); Lemon(柠檬); University of Maryland, College Park(马里兰大学学院公园分校); Kensho( Kensho); Lambda AI( Lambda AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 28 pages, 11 figures, 15 tables
Abstract:Reward models are central to aligning LLMs with human preferences, but they are costly to train, requiring large-scale human-labeled preference data and powerful pretrained LLM backbones. Meanwhile, the increasing availability of high-quality synthetic instruction-following datasets raises the question: can simpler, reference-based metrics serve as viable alternatives to reward models during RL-based alignment? In this paper, we show first that BLEU, a basic string-matching metric, surprisingly matches strong reward models in agreement with human preferences on general instruction-following datasets. Based on this insight, we develop BLEUBERI, a method that first identifies challenging instructions and then applies Group Relative Policy Optimization (GRPO) using BLEU directly as the reward function. We demonstrate that BLEUBERI-trained models are competitive with models trained via reward model-guided RL across four challenging instruction-following benchmarks and three different base language models. A human evaluation further supports that the quality of BLEUBERI model outputs is on par with those from reward model-aligned models. Moreover, BLEUBERI models generate outputs that are more factually grounded than competing methods. Overall, we show that given access to high-quality reference outputs (easily obtained via existing instruction-following datasets or synthetic data generation), string matching-based metrics are cheap yet effective proxies for reward models during alignment. We release our code and data at this https URL.
zh
[NLP-37] mathcalALLM 4ADD: Unlocking the Capabilities of Audio Large Language Models for Audio Deepfake Detection
【速读】: 该论文旨在解决音频深度伪造检测(Audio Deepfake Detection, ADD)问题,特别是在高保真音频生成模型日益普及背景下,如何有效识别伪造音频。其解决方案的关键在于提出一种基于音频大语言模型(ALLM)的框架 \mathcal{ALLM4ADD},通过将ADD任务重新定义为音频问答问题,并利用监督微调使模型能够评估音频的真实性,从而在数据稀缺场景下实现更优的检测性能。
链接: https://arxiv.org/abs/2505.11079
作者: Hao Gu,Jiangyan Yi,Chenglong Wang,Jianhua Tao,Zheng Lian,Jiayi He,Yong Ren,Yujie Chen,Zhengqi Wen
机构: Institute of Automation, Chinese Academy of Sciences(自动化研究所,中国科学院); Tsinghua University(清华大学); Taizhou University(台州大学); Anhui University(安徽大学); Beijing National Research Center for Information Science and Technology, Tsinghua University(北京信息科学与技术国家研究中心,清华大学)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Audio deepfake detection (ADD) has grown increasingly important due to the rise of high-fidelity audio generative models and their potential for misuse. Given that audio large language models (ALLMs) have made significant progress in various audio processing tasks, a heuristic question arises: Can ALLMs be leveraged to solve ADD?. In this paper, we first conduct a comprehensive zero-shot evaluation of ALLMs on ADD, revealing their ineffectiveness in detecting fake audio. To enhance their performance, we propose \mathcalALLM4ADD , an ALLM-driven framework for ADD. Specifically, we reformulate ADD task as an audio question answering problem, prompting the model with the question: “Is this audio fake or real?”. We then perform supervised fine-tuning to enable the ALLM to assess the authenticity of query audio. Extensive experiments are conducted to demonstrate that our ALLM-based method can achieve superior performance in fake audio detection, particularly in data-scarce scenarios. As a pioneering study, we anticipate that this work will inspire the research community to leverage ALLMs to develop more effective ADD systems.
zh
[NLP-38] CAMEO: Collection of Multilingual Emotional Speech Corpora NEURIPS
【速读】: 该论文试图解决跨语言和跨情感状态的语音情感识别(Speech Emotion Recognition, SER)研究中数据获取困难、结果可复现性差以及缺乏标准化评估基准的问题。解决方案的关键在于构建一个经过精心筛选和标准化处理的多语言情感语音数据集,即CAMEO,以促进相关研究的开展,并通过公开数据集、元数据和排行榜(leaderboard)提高研究的透明度和可比性。
链接: https://arxiv.org/abs/2505.11051
作者: Iwona Christop,Maciej Czajka
机构: Adam Mickiewicz University (亚当·密茨凯维奇大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Under review at NeurIPS
Abstract:This paper presents CAMEO – a curated collection of multilingual emotional speech datasets designed to facilitate research in emotion recognition and other speech-related tasks. The main objectives were to ensure easy access to the data, to allow reproducibility of the results, and to provide a standardized benchmark for evaluating speech emotion recognition (SER) systems across different emotional states and languages. The paper describes the dataset selection criteria, the curation and normalization process, and provides performance results for several models. The collection, along with metadata, and a leaderboard, is publicly available via the Hugging Face platform.
zh
[NLP-39] OntoURL: A Benchmark for Evaluating Large Language Models on Symbolic Ontological Understanding Reasoning and Learning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理结构化符号知识方面能力不足的问题,特别是其在本体(ontology)相关任务中的表现尚未得到充分研究。解决方案的关键在于提出一个LLMs本体能力的分类体系,并引入OntoURL,这是首个全面的基准测试工具,用于系统评估LLMs在理解、推理和学习本体知识方面的性能。通过15项任务和58,981个问题,OntoURL从多个维度对LLMs进行了评估,揭示了当前模型在符号知识处理上的局限性,并为未来研究提供了重要的基准。
链接: https://arxiv.org/abs/2505.11031
作者: Xiao Zhang,Huiyuan Lai,Qianru Meng,Johan Bos
机构: University of Groningen (格罗宁根大学); Leiden University (莱顿大学)
类目: Computation and Language (cs.CL)
备注: Paper submitted to NeruoIPS 2025 dataset and benchmark track
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across a range of natural language processing tasks, yet their ability to process structured symbolic knowledge remains underexplored. To address this gap, we propose a taxonomy of LLMs’ ontological capabilities and introduce OntoURL, the first comprehensive benchmark designed to systematically evaluate LLMs’ proficiency in handling ontologies – formal, symbolic representations of domain knowledge through concepts, relationships, and instances. Based on the proposed taxonomy, OntoURL systematically assesses three dimensions: understanding, reasoning, and learning through 15 distinct tasks comprising 58,981 questions derived from 40 ontologies across 8 domains. Experiments with 20 open-source LLMs reveal significant performance differences across models, tasks, and domains, with current LLMs showing proficiency in understanding ontological knowledge but substantial weaknesses in reasoning and learning tasks. These findings highlight fundamental limitations in LLMs’ capability to process symbolic knowledge and establish OntoURL as a critical benchmark for advancing the integration of LLMs with formal knowledge representations.
zh
[NLP-40] StRuCom: A Novel Dataset of Structured Code Comments in Russian
【速读】: 该论文试图解决生成式 AI (Generative AI) 在俄语代码注释生成任务中表现不佳的问题,尤其是相较于英语,现有模型在俄语上的性能较差。解决方案的关键在于构建了 StRuCom——首个针对俄语代码文档的大规模数据集(包含153K个示例),该数据集结合了来自俄罗斯 GitHub 仓库的人工编写注释与合成生成的注释,并通过自动化验证确保其符合 Python、Java、JavaScript、C# 和 Go 等语言的标准,从而有效提升了模型在俄语代码文档生成任务中的表现。
链接: https://arxiv.org/abs/2505.11026
作者: Maria Dziuba,Valentin Malykh
机构: MTS AI; ITMO University; IITU University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:
Abstract:Structured code comments in docstring format are essential for code comprehension and maintenance, but existing machine learning models for their generation perform poorly for Russian compared to English. To bridge this gap, we present StRuCom - the first large-scale dataset (153K examples) specifically designed for Russian code documentation. Unlike machine-translated English datasets that distort terminology (e.g., technical loanwords vs. literal translations) and docstring structures, StRuCom combines human-written comments from Russian GitHub repositories with synthetically generated ones, ensuring compliance with Python, Java, JavaScript, C#, and Go standards through automated validation. Fine-tuning Qwen2.5-Coder models (0.5B-7B) on StRuCom shows statistically significant improvements of chrf++ and BERTScore over baseline models.
zh
[NLP-41] Review-Instruct: A Review-Driven Multi-Turn Conversations Generation Method for Large Language Models ACL2025
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在对话AI中因依赖单轮监督微调(Single-turn Supervised Fine-tuning, SFT)数据而导致的多轮对话上下文连贯性不足的问题。其解决方案的关键在于提出了一种名为Review-Instruct的新型框架,该框架通过迭代的“提问-回应-评审”流程,由候选者、多个评审者和主席三种代理角色协作生成高质量的多轮对话数据,从而提升指令的多样性和难度。
链接: https://arxiv.org/abs/2505.11010
作者: Jiangxu Wu,Cong Wang,TianHuang Su,Jun Yang,Haozhi Lin,Chao Zhang,Ming Peng,Kai Shi,SongPan Yang,BinQing Pan,ZiXian Li,Ni Yang,ZhenYu Yang
机构: OPPO AI Center(OPPO人工智能中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL2025 Accepted
Abstract:The effectiveness of large language models (LLMs) in conversational AI is hindered by their reliance on single-turn supervised fine-tuning (SFT) data, which limits contextual coherence in multi-turn dialogues. Existing methods for generating multi-turn dialogue data struggle to ensure both diversity and quality in instructions. To address this, we propose Review-Instruct, a novel framework that synthesizes multi-turn conversations through an iterative “Ask-Respond-Review” process involving three agent roles: a Candidate, multiple Reviewers, and a Chairman. The framework iteratively refines instructions by incorporating Reviewer feedback, enhancing dialogue diversity and difficulty. We construct a multi-turn dataset using the Alpaca dataset and fine-tune the LLaMA2-13B model. Evaluations on MT-Bench, MMLU-Pro, and Auto-Arena demonstrate significant improvements, achieving absolute gains of 2.9% on MMLU-Pro and 2% on MT-Bench compared to prior state-of-the-art models based on LLaMA2-13B. Ablation studies confirm the critical role of the Review stage and the use of multiple Reviewers in boosting instruction diversity and difficulty. Our work highlights the potential of review-driven, multi-agent frameworks for generating high-quality conversational data at scale.
zh
[NLP-42] Reconstructing Syllable Sequences in Abugida Scripts with Incomplete Inputs
【速读】: 该论文旨在解决Abugida语言中音节序列预测的问题,具体是通过Transformer模型从不同类型的不完整输入(如辅音序列、元音序列、部分音节和遮蔽音节)中重建完整的音节序列。其解决方案的关键在于利用辅音序列在音节预测中的重要作用,这使得模型能够取得较高的BLEU分数,同时在处理部分音节和遮蔽音节的重建任务中表现出稳健的性能。
链接: https://arxiv.org/abs/2505.11008
作者: Ye Kyaw Thu,Thazin Myint Oo
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 14 pages, 2 figures, 6 tables, 1 listing
Abstract:This paper explores syllable sequence prediction in Abugida languages using Transformer-based models, focusing on six languages: Bengali, Hindi, Khmer, Lao, Myanmar, and Thai, from the Asian Language Treebank (ALT) dataset. We investigate the reconstruction of complete syllable sequences from various incomplete input types, including consonant sequences, vowel sequences, partial syllables (with random character deletions), and masked syllables (with fixed syllable deletions). Our experiments reveal that consonant sequences play a critical role in accurate syllable prediction, achieving high BLEU scores, while vowel sequences present a significantly greater challenge. The model demonstrates robust performance across tasks, particularly in handling partial and masked syllable reconstruction, with strong results for tasks involving consonant information and syllable masking. This study advances the understanding of sequence prediction for Abugida languages and provides practical insights for applications such as text prediction, spelling correction, and data augmentation in these scripts.
zh
[NLP-43] Illusion or Algorithm? Investigating Memorization Emergence and Symbolic Processing in In-Context Learning
【速读】: 该论文试图解决生成式 AI (Generative AI) 中的上下文学习(in-context learning, ICL)机制的理解问题,具体是探讨 ICL 是源于对训练语料的“记忆”还是体现了模型内部的符号算法能力。论文的关键解决方案是引入一套系统性的探究任务和一种新方法,利用完整的 Pythia 缩放套件(包括中间检查点)来分析 ICL 表现,并结合对残差流子空间的机制性分析,以揭示 ICL 的本质特性。
链接: https://arxiv.org/abs/2505.11004
作者: Jingcheng Niu,Subhabrata Dutta,Ahmed Elshabrawy,Harish Tayyar Madabushi,Iryna Gurevych
机构: Technical University of Darmstadt(达姆施塔特工业大学); Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学); The University of Bath(巴斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large-scale Transformer language models (LMs) trained solely on next-token prediction with web-scale data can solve a wide range of tasks after seeing just a few examples. The mechanism behind this capability, known as in-context learning (ICL), remains both controversial and poorly understood. Some studies argue that it is merely the result of memorizing vast amounts of data, while others contend that it reflects a fundamental, symbolic algorithmic development in LMs. In this work, we introduce a suite of investigative tasks and a novel method to systematically investigate ICL by leveraging the full Pythia scaling suite, including interim checkpoints that capture progressively larger amount of training data. By carefully exploring ICL performance on downstream tasks and simultaneously conducting a mechanistic analysis of the residual stream’s subspace, we demonstrate that ICL extends beyond mere “memorization” of the training corpus, yet does not amount to the implementation of an independent symbolic algorithm. Our results also clarify several aspects of ICL, including the influence of training dynamics, model capabilities, and elements of mechanistic interpretability. Overall, our work advances the understanding of ICL and its implications, offering model developers insights into potential improvements and providing AI security practitioners with a basis for more informed guidelines.
zh
[NLP-44] Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory ACL2025
【速读】: 该论文试图解决在大规模语言模型(Large Language Models, LLM)中,不同推理提示策略在测试时计算资源扩展下的性能表现问题。其关键解决方案是通过概率理论提出一种方法,能够在不进行额外资源密集型推理的情况下,快速且准确地预测扩展性能并选择最优策略,从而为多数投票(majority voting)提供测试时的扩展规律。
链接: https://arxiv.org/abs/2505.10981
作者: Yexiang Liu,Zekun Li,Zhi Fang,Nan Xu,Ran He,Tieniu Tan
机构: MAIS, Institute of Automation, Chinese Academy of Sciences (MAIS,自动化研究所,中国科学院); School of Artificial Intelligence, University of Chinese Academy of Sciences (人工智能学院,中国科学院大学); University of California, Santa Barbara (加州大学圣塔芭芭拉分校); Beijing Wenge Technology Co., Ltd (北京文戈科技有限公司); Nanjing University (南京大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ACL 2025 Main
Abstract:Recently, scaling test-time compute on Large Language Models (LLM) has garnered wide attention. However, there has been limited investigation of how various reasoning prompting strategies perform as scaling. In this paper, we focus on a standard and realistic scaling setting: majority voting. We systematically conduct experiments on 6 LLMs \times 8 prompting strategies \times 6 benchmarks. Experiment results consistently show that as the sampling time and computational overhead increase, complicated prompting strategies with superior initial performance gradually fall behind simple Chain-of-Thought. We analyze this phenomenon and provide theoretical proofs. Additionally, we propose a method according to probability theory to quickly and accurately predict the scaling performance and select the best strategy under large sampling times without extra resource-intensive inference in practice. It can serve as the test-time scaling law for majority voting. Furthermore, we introduce two ways derived from our theoretical analysis to significantly improve the scaling performance. We hope that our research can promote to re-examine the role of complicated prompting, unleash the potential of simple prompting strategies, and provide new insights for enhancing test-time scaling performance.
zh
[NLP-45] Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio
【速读】: 该论文旨在解决单通道多说话人自动语音识别(Monaural multi-speaker ASR)中的数据稀缺性和重叠语音中语音识别与说话人归属的固有难度问题。其解决方案的关键在于采用端到端(end-to-end, E2E)架构,以减少误差传播并更好地利用语音内容与说话人身份之间的协同效应。论文通过系统分类E2E神经方法,分析了不同架构范式(SIMO与SISO)的特点及权衡,并探讨了针对长时语音的扩展策略,旨在推动更鲁棒和可扩展的多说话人ASR系统的发展。
链接: https://arxiv.org/abs/2505.10975
作者: Xinlu He,Jacob Whitehill
机构: Worcester Polytechnic Institute (伍斯特理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 13 pages. Submitted to IEEE/ACM Transaction on Audio Speech and Language Processing (TASLP)
Abstract:Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances have driven the shift from cascade systems to end-to-end (E2E) architectures, which reduce error propagation and better exploit the synergy between speech content and speaker identity. Despite rapid progress in E2E multi-speaker ASR, the field lacks a comprehensive review of recent developments. This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. Specifically, we analyze: (1) architectural paradigms (SIMO vs.~SISO) for pre-segmented audio, analyzing their distinct characteristics and trade-offs; (2) recent architectural and algorithmic improvements based on these two paradigms; (3) extensions to long-form speech, including segmentation strategy and speaker-consistent hypothesis stitching. Further, we (4) evaluate and compare methods across standard benchmarks. We conclude with a discussion of open challenges and future research directions towards building robust and scalable multi-speaker ASR.
zh
[NLP-46] he Way We Prompt: Conceptual Blending Neural Dynamics and Prompt-Induced Transitions in LLM s
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)表现出类似人类智能和个性的行为背后的机制问题,其核心在于揭示这些模型如何通过提示(prompt)进行意义的融合与压缩。解决方案的关键在于将概念融合理论(Conceptual Blending Theory, CBT)作为实验框架,利用基于提示的方法系统研究提示诱导过渡(Prompt-Induced Transitions, PIT)和提示诱导幻觉(Prompt-Induced Hallucinations, PIH),从而揭示人工与生物认知之间的结构相似性与差异性。
链接: https://arxiv.org/abs/2505.10948
作者: Makoto Sato
机构: 未知
类目: Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Large language models (LLMs), inspired by neuroscience, exhibit behaviors that often evoke a sense of personality and intelligence-yet the mechanisms behind these effects remain elusive. Here, we operationalize Conceptual Blending Theory (CBT) as an experimental framework, using prompt-based methods to reveal how LLMs blend and compress meaning. By systematically investigating Prompt-Induced Transitions (PIT) and Prompt-Induced Hallucinations (PIH), we uncover structural parallels and divergences between artificial and biological cognition. Our approach bridges linguistics, neuroscience, and empirical AI research, demonstrating that human-AI collaboration can serve as a living prototype for the future of cognitive science. This work proposes prompt engineering not just as a technical tool, but as a scientific method for probing the deep structure of meaning itself.
zh
[NLP-47] Semantic Aware Linear Transfer by Recycling Pre-trained Language Models for Cross-lingual Transfer ACL2025
【速读】: 该论文试图解决多语言大语言模型(Large Language Models, LLMs)在跨语言迁移过程中因源模型嵌入空间与目标语言词汇不匹配而导致的表达能力受限问题。解决方案的关键在于提出一种名为语义感知线性迁移(Semantic Aware Linear Transfer, SALT)的新方法,该方法通过利用目标语言预训练语言模型(Pre-trained Language Models, PLMs)的嵌入,基于源语言和目标语言词汇重叠部分的相似性构建独特的回归线,以处理非重叠词的嵌入空间,从而有效传递PLM的深层表示优势至LLMs。
链接: https://arxiv.org/abs/2505.10945
作者: Seungyoon Lee,Seongtae Hong,Hyeonseok Moon,Heuiseok Lim
机构: Korea University (高丽大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2025 Findings
Abstract:Large Language Models (LLMs) increasingly incorporate multilingual capabilities, fueling the demand to transfer them into target language-specific models. However, most approaches, which blend the source model’s embedding by replacing the source vocabulary with the target language-specific vocabulary, may constrain expressive capacity in the target language since the source model is predominantly trained on English data. In this paper, we propose Semantic Aware Linear Transfer (SALT), a novel cross-lingual transfer technique that recycles embeddings from target language Pre-trained Language Models (PLMs) to transmit the deep representational strengths of PLM-derived embedding to LLMs. SALT derives unique regression lines based on the similarity in the overlap of the source and target vocabularies, to handle each non-overlapping token’s embedding space. Our extensive experiments show that SALT significantly outperforms other transfer methods and achieves lower loss with accelerating faster convergence during language adaptation. Notably, SALT obtains remarkable performance in cross-lingual understanding setups compared to other methods. Furthermore, we highlight the scalable use of PLMs to enhance the functionality of contemporary LLMs by conducting experiments with varying architectures.
zh
[NLP-48] GenKnowSub: Improving Modularity and Reusability of LLM s through General Knowledge Subtraction ACL2025
【速读】: 该论文试图解决大型语言模型在零样本泛化能力上的不足问题,其核心挑战在于通用知识与任务特定适应之间的纠缠。解决方案的关键是提出一种模块化框架,通过构建任务特定的LoRA模块库和通用领域LoRA,实现两者的解耦。具体而言,通过从每个任务特定模块中减去通用知识组件,得到专注于任务相关信息的残差模块,这一方法称为通用知识减法(GenKnowSub)。
链接: https://arxiv.org/abs/2505.10939
作者: Mohammadtaha Bagherifard,Sahar Rajabi,Ali Edalat,Yadollah Yaghoobzadeh
机构: University of Tehran(德黑兰大学); Tehran Institute for Advanced Studies(德黑兰高级研究所); Khatam University(卡塔姆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2025 (main conference, short paper), 10 pages
Abstract:Large language models often struggle with zero-shot generalization, and several modular approaches have been proposed to address this challenge. Yet, we hypothesize that a key limitation remains: the entanglement of general knowledge and task-specific adaptations. To overcome this, we propose a modular framework that disentangles these components by constructing a library of task-specific LoRA modules alongside a general-domain LoRA. By subtracting this general knowledge component from each task-specific module, we obtain residual modules that focus more exclusively on task-relevant information, a method we call general knowledge subtraction (GenKnowSub). Leveraging the refined task-specific modules and the Arrow routing algorithm \citepostapenko2024towards, we dynamically select and combine modules for new inputs without additional training. Our studies on the Phi-3 model and standard Arrow as baselines reveal that using general knowledge LoRAs derived from diverse languages, including English, French, and German, yields consistent performance gains in both monolingual and cross-lingual settings across a wide set of benchmarks. Further experiments on Phi-2 demonstrate how GenKnowSub generalizes to weaker LLMs. The complete code and data are available at this https URL.
zh
[NLP-49] Accurate KV Cache Quantization with Outlier Tokens Tracing ACL2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在部署过程中因计算资源消耗大而带来的挑战,特别是针对KV Cache量化过程中内存使用与精度之间的平衡问题。其解决方案的关键在于识别并排除在量化过程中表现异常的token,这些token由于具有独特的特征而偏离了常规的通道量化和标记量化模式,通过将它们作为离群值排除在量化之外,显著提升了整体的量化精度。
链接: https://arxiv.org/abs/2505.10938
作者: Yi Su,Yuechi Zhou,Quantong Qiu,Juntao Li,Qingrong Xia,Ping Li,Xinyu Duan,Zhefeng Wang,Min Zhang
机构: Soochow University (苏州大学); Huawei Cloud (华为云)
类目: Computation and Language (cs.CL)
备注: ACL2025 Main
Abstract:The impressive capabilities of Large Language Models (LLMs) come at the cost of substantial computational resources during deployment. While KV Cache can significantly reduce recomputation during inference, it also introduces additional memory overhead. KV Cache quantization presents a promising solution, striking a good balance between memory usage and accuracy. Previous research has shown that the Keys are distributed by channel, while the Values are distributed by token. Consequently, the common practice is to apply channel-wise quantization to the Keys and token-wise quantization to the Values. However, our further investigation reveals that a small subset of unusual tokens exhibit unique characteristics that deviate from this pattern, which can substantially impact quantization accuracy. To address this, we develop a simple yet effective method to identify these tokens accurately during the decoding process and exclude them from quantization as outlier tokens, significantly improving overall accuracy. Extensive experiments show that our method achieves significant accuracy improvements under 2-bit quantization and can deliver a 6.4 times reduction in memory usage and a 2.3 times increase in throughput.
zh
[NLP-50] Reasoning with OmniThought: A Large CoT Dataset with Verbosity and Cognitive Difficulty Annotations
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在训练过程中缺乏全面的链式思维(Chain-of-Thought, CoT)数据集的问题,当前资源难以提供多教师模型提炼出的连贯CoT过程以及描述CoT内部特性的多维属性。其解决方案的关键在于引入OmniThought数据集,该数据集包含由两个强大LRMs作为教师模型生成并验证的200万条CoT过程,并通过新颖的推理冗余度(Reasoning Verbosity, RV)和认知难度(Cognitive Difficulty, CD)评分对每个CoT过程进行标注,从而提升LRM训练的有效性。
链接: https://arxiv.org/abs/2505.10937
作者: Wenrui Cai,Chengyu Wang,Junbing Yan,Jun Huang,Xiangzhong Fang
机构: Shanghai Jiao Tong University (上海交通大学); Alibaba Cloud Computing (阿里云计算)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The emergence of large reasoning models (LRMs) has transformed Natural Language Processing by excelling in complex tasks such as mathematical problem-solving and code generation. These models leverage chain-of-thought (CoT) processes, enabling them to emulate human-like reasoning strategies. However, the advancement of LRMs is hindered by the lack of comprehensive CoT datasets. Current resources often fail to provide extensive reasoning problems with coherent CoT processes distilled from multiple teacher models and do not account for multifaceted properties describing the internal characteristics of CoTs. To address these challenges, we introduce OmniThought, a large-scale dataset featuring 2 million CoT processes generated and validated by two powerful LRMs as teacher models. Each CoT process in OmniThought is annotated with novel Reasoning Verbosity (RV) and Cognitive Difficulty (CD) scores, which describe the appropriateness of CoT verbosity and cognitive difficulty level for models to comprehend these reasoning processes. We further establish a self-reliant pipeline to curate this dataset. Extensive experiments using Qwen2.5 models of various sizes demonstrate the positive impact of our proposed scores on LRM training effectiveness. Based on the proposed OmniThought dataset, we further train and release a series of high-performing LRMs, specifically equipped with stronger reasoning abilities and optimal CoT output length and difficulty level. Our contributions significantly enhance the development and training of LRMs for solving complex tasks.
zh
[NLP-51] Connecting the Dots: A Chain-of-Collaboration Prompting Framework for LLM Agents
【速读】: 该论文旨在解决业务流程协作中存在的一系列挑战,包括单智能体链式思维方法在跨领域提示设计上的复杂性以及多智能体系统在消耗大量token和稀释核心问题方面的不足。其解决方案的关键在于提出Cochain协作提示框架,该框架通过在较低成本下整合知识和提示,有效解决了业务流程协作问题。具体而言,Cochain构建了一个集成的知识图谱,并通过维护和检索提示树来获取与业务流程其他阶段相关的提示信息。
链接: https://arxiv.org/abs/2505.10936
作者: Jiaxing Zhao,Hongbin Xie,Yuzhen Lei,Xuan Song,Zhuoran Shi,Lianxin Li,Shuangxue Liu,Haoran Zhang
机构: School of Artificial Intelligence, Jilin University (人工智能学院,吉林大学); Department of Computer Science and Engineering, Southern University of Science and Technology (计算机科学与工程系,南方科技大学); School of Urban Planning and Design, Peking University (城市规划与设计学院,北京大学)
类目: Computation and Language (cs.CL)
备注: 34 pages, 20 figures
Abstract:Large Language Models (LLMs) have demonstrated impressive performance in executing complex reasoning tasks. Chain-of-thought effectively enhances reasoning capabilities by unlocking the potential of large models, while multi-agent systems provide more comprehensive solutions by integrating collective intelligence of multiple agents. However, both approaches face significant limitations. Single-agent with chain-of-thought, due to the inherent complexity of designing cross-domain prompts, faces collaboration challenges. Meanwhile, multi-agent systems consume substantial tokens and inevitably dilute the primary problem, which is particularly problematic in business workflow tasks. To address these challenges, we propose Cochain, a collaboration prompting framework that effectively solves business workflow collaboration problem by combining knowledge and prompts at a reduced cost. Specifically, we construct an integrated knowledge graph that incorporates knowledge from multiple stages. Furthermore, by maintaining and retrieving a prompts tree, we can obtain prompt information relevant to other stages of the business workflow. We perform extensive evaluations of Cochain across multiple datasets, demonstrating that Cochain outperforms all baselines in both prompt engineering and multi-agent LLMs. Additionally, expert evaluation results indicate that the use of a small model in combination with Cochain outperforms GPT-4.
zh
[NLP-52] A Survey on the Safety and Security Threats of Computer-Using Agents : JARVIS or Ultron?
【速读】: 该论文旨在系统化分析计算机使用代理(Computer-Using Agents, CUAs)在安全与可靠性方面面临的问题,并提出相应的防御策略。其核心问题在于随着CUAs能力的增强,其引入的安全风险也日益复杂,尤其是在大型语言模型(Large Language Models, LLMs)驱动的推理过程中,结合多软件组件和多模态输入所带来的挑战。解决方案的关键在于通过全面的文献综述,从四个研究目标出发:定义适用于安全分析的CUA、分类当前的安全威胁、构建现有的防御策略分类体系,以及总结评估CUA安全性与性能的基准、数据集和评价指标,从而为未来研究提供结构化基础并为实践者提供可操作的安全设计指导。
链接: https://arxiv.org/abs/2505.10924
作者: Ada Chen,Yongjiang Wu,Junyuan Zhang,Shu Yang,Jen-tse Huang,Kun Wang,Wenxuan Wang,Shuai Wang
机构: Carnegie Mellon University (卡内基梅隆大学); The Chinese University of Hong Kong (香港中文大学); KAUST (沙特阿美基础科学研究院); Johns Hopkins University (约翰霍普金斯大学); Nanyang Technological University (南洋理工大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注:
Abstract:Recently, AI-driven interactions with computing devices have advanced from basic prototype tools to sophisticated, LLM-based systems that emulate human-like operations in graphical user interfaces. We are now witnessing the emergence of \emphComputer-Using Agents (CUAs), capable of autonomously performing tasks such as navigating desktop applications, web pages, and mobile apps. However, as these agents grow in capability, they also introduce novel safety and security risks. Vulnerabilities in LLM-driven reasoning, with the added complexity of integrating multiple software components and multimodal inputs, further complicate the security landscape. In this paper, we present a systematization of knowledge on the safety and security threats of CUAs. We conduct a comprehensive literature review and distill our findings along four research objectives: \textit\textbf(i) define the CUA that suits safety analysis; \textit\textbf(ii) categorize current safety threats among CUAs; \textit\textbf(iii) propose a comprehensive taxonomy of existing defensive strategies; \textit\textbf(iv) summarize prevailing benchmarks, datasets, and evaluation metrics used to assess the safety and performance of CUAs. Building on these insights, our work provides future researchers with a structured foundation for exploring unexplored vulnerabilities and offers practitioners actionable guidance in designing and deploying secure Computer-Using Agents.
zh
[NLP-53] REI-Bench: Can Embodied Agents Understand Vague Human Instructions in Task Planning ?
【速读】: 该论文试图解决人类指令中指代表达(Referencing Expressions, REs)的模糊性对基于大语言模型(Large Language Model, LLM)的机器人任务规划性能造成的影响问题。研究发现,REs的模糊性会导致机器人任务规划成功率显著下降,最高可达77.9%。解决方案的关键在于提出一种面向任务的上下文认知方法,该方法能够生成清晰的指令以供机器人执行,从而有效缓解REs带来的问题,并在性能上优于现有的提示感知和思维链方法。
链接: https://arxiv.org/abs/2505.10872
作者: Chenxi Jiang,Chuhao Zhou,Jianfei Yang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Submitted to CoRL 2025, under review
Abstract:Robot task planning decomposes human instructions into executable action sequences that enable robots to complete a series of complex tasks. Although recent large language model (LLM)-based task planners achieve amazing performance, they assume that human instructions are clear and straightforward. However, real-world users are not experts, and their instructions to robots often contain significant vagueness. Linguists suggest that such vagueness frequently arises from referring expressions (REs), whose meanings depend heavily on dialogue context and environment. This vagueness is even more prevalent among the elderly and children, who robots should serve more. This paper studies how such vagueness in REs within human instructions affects LLM-based robot task planning and how to overcome this issue. To this end, we propose the first robot task planning benchmark with vague REs (REI-Bench), where we discover that the vagueness of REs can severely degrade robot planning performance, leading to success rate drops of up to 77.9%. We also observe that most failure cases stem from missing objects in planners. To mitigate the REs issue, we propose a simple yet effective approach: task-oriented context cognition, which generates clear instructions for robots, achieving state-of-the-art performance compared to aware prompt and chains of thought. This work contributes to the research community of human-robot interaction (HRI) by making robot task planning more practical, particularly for non-expert users, e.g., the elderly and children.
zh
[NLP-54] Improve Rule Retrieval and Reasoning with Self-Induction and Relevance ReEstimate ACL2025
【速读】: 该论文试图解决规则检索(rule retrieval)中存在的挑战,这一领域虽关键但研究不足。传统检索方法使用稀疏或稠密检索器直接搜索相关规则以支持下游推理,但由于查询中的具体事实与规则的抽象表示之间存在显著语义差距,导致检索准确性较低,进而影响推理性能。解决方案的关键在于提出Self-Induction Augmented Retrieval (SIAR),该方法利用大型语言模型(Large Language Models, LLMs)从查询中抽象出潜在的推理规则,用于查询增强以提升检索效果;同时引入Rule Relevance ReEstimate (R^3),通过评估检索到的规则所包含的抽象知识是否能够实例化以匹配查询中的事实并有助于推理,从而重新估计规则的相关性。
链接: https://arxiv.org/abs/2505.10870
作者: Ziyang Huang,Wangtao Sun,Jun Zhao,Kang Liu
机构: Institute of Automation, Chinese Academy of Sciences (自动化研究所,中国科学院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: ACL 2025
Abstract:This paper systematically addresses the challenges of rule retrieval, a crucial yet underexplored area. Vanilla retrieval methods using sparse or dense retrievers to directly search for relevant rules to support downstream reasoning, often suffer from low accuracy. This is primarily due to a significant semantic gap between the instantiated facts in the queries and the abstract representations of the rules. Such misalignment results in suboptimal retrieval quality, which in turn negatively impacts reasoning performance. To overcome these challenges, we propose Self-Induction Augmented Retrieval (SIAR), a novel approach that utilizes Large Language Models (LLMs) to induce potential inferential rules that might offer benefits for reasoning by abstracting the underlying knowledge and logical structure in queries. These induced rules are then used for query augmentation to improve retrieval effectiveness. Additionally, we introduce Rule Relevance ReEstimate (R ^3 ), a method that re-estimates the relevance of retrieved rules by assessing whether the abstract knowledge they contain can be instantiated to align with the facts in the queries and the helpfulness for reasoning. Extensive experiments across various settings demonstrate the effectiveness and versatility of our proposed methods.
zh
[NLP-55] Have Multimodal Large Language Models (MLLM s) Really Learned to Tell the Time on Analog Clocks?
【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在识别模拟时钟时间方面存在的困难。研究认为,这一问题可能源于训练数据集中缺乏不同时间点的时钟图像。解决方案的关键在于通过微调(fine-tuning)来改善模型对时钟时间的识别能力,并通过测试不同类型的时钟来评估模型是否真正掌握了该技能,而非仅仅依赖训练数据中的模式。
链接: https://arxiv.org/abs/2505.10862
作者: Tairan Fu,Miguel González,Javier Conde,Elena Merino-Gómez,Pedro Reviriego
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Universidad Politécnica de Madrid (马德里理工大学); Universidad de Valladolid (瓦拉多利德大学)
类目: Computation and Language (cs.CL)
备注: 6 pages, 5 figures, 2 tables
Abstract:Multimodal Large Language Models which can answer complex questions on an image struggle to tell the time on analog clocks. This is probably due to the lack of images with clocks at different times in their training set. In this work we explore this issue with one of the latest MLLMs: GPT-4.1 to understand why MLLMs fail to tell the time and whether fine-tuning can solve the problem. The results show how models are making progress in reading the time on analog clocks. But have they really learned to do it, or have they only learned patterns in their training datasets? In this work we put the models to the test with different clocks to illustrate the limitations of MLLMs to abstract and generalize.
zh
[NLP-56] Creativity or Brute Force? Using Brainteasers as a Window into the Problem-Solving Abilities of Large Language Models
【速读】: 该论文试图解决当前AI系统评估中仅依赖准确率(accuracy)作为指标的局限性,旨在更深入地探究模型在解决问题时所采用的推理策略。其解决方案的关键在于引入一个基于长篇叙述形式的谜题(brainteasers)基准,通过该基准分析模型在多个推理层次上的表现,包括语义解析、解题生成、自我修正、步骤分解及利用提示等过程,从而评估模型解题的正确性、质量和创造性。
链接: https://arxiv.org/abs/2505.10844
作者: Simeng Han,Stephen Xia,Grant Zhang,Howard Dai,Chen Liu,Lichang Chen,Hoang Huy Nguyen,Hongyuan Mei,Jiayuan Mao,R. Thomas McCoy
机构: Yale University (耶鲁大学); University of Maryland, College Park (马里兰大学学院公园分校); Georgia Institute of Technology (佐治亚理工学院); xAI; Massachusetts Institute of Technology (麻省理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 Tables; 5 Figures
Abstract:Accuracy remains a standard metric for evaluating AI systems, but it offers limited insight into how models arrive at their solutions. In this work, we introduce a benchmark based on brainteasers written in long narrative form to probe more deeply into the types of reasoning strategies that models use. Brainteasers are well-suited for this goal because they can be solved with multiple approaches, such as a few-step solution that uses a creative insight or a longer solution that uses more brute force. We investigate large language models (LLMs) across multiple layers of reasoning, focusing not only on correctness but also on the quality and creativity of their solutions. We investigate many aspects of the reasoning process: (1) semantic parsing of the brainteasers into precise mathematical competition style formats; (2) generating solutions from these mathematical forms; (3) self-correcting solutions based on gold solutions; (4) producing step-by-step sketches of solutions; and (5) making use of hints. We find that LLMs are in many cases able to find creative, insightful solutions to brainteasers, suggesting that they capture some of the capacities needed to solve novel problems in creative ways. Nonetheless, there also remain situations where they rely on brute force despite the availability of more efficient, creative solutions, highlighting a potential direction for improvement in the reasoning abilities of LLMs.
zh
[NLP-57] LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLM s
【速读】: 该论文试图解决如何高效地对大型语言模型(Large Language Models, LLMs)进行红队测试以发现其漏洞的问题。传统方法在攻击中常使用LLMs作为优化器,但由于离散的语言空间导致基于梯度的方法难以有效应用。该研究提出的解决方案是LARGO(Latent Adversarial Reflection through Gradient Optimization),其关键在于在LLM的连续潜在空间内操作,通过优化对抗性潜在向量并递归调用同一LLM将其解码为自然语言,从而生成流畅且隐蔽的越狱提示。这种方法实现了快速、有效且可迁移的攻击效果。
链接: https://arxiv.org/abs/2505.10838
作者: Ran Li,Hao Wang,Chengzhi Mao
机构: Columbia University (哥伦比亚大学); Rutgers University (罗格斯大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:Efficient red-teaming method to uncover vulnerabilities in Large Language Models (LLMs) is crucial. While recent attacks often use LLMs as optimizers, the discrete language space make gradient-based methods struggle. We introduce LARGO (Latent Adversarial Reflection through Gradient Optimization), a novel latent self-reflection attack that reasserts the power of gradient-based optimization for generating fluent jailbreaking prompts. By operating within the LLM’s continuous latent space, LARGO first optimizes an adversarial latent vector and then recursively call the same LLM to decode the latent into natural language. This methodology yields a fast, effective, and transferable attack that produces fluent and stealthy prompts. On standard benchmarks like AdvBench and JailbreakBench, LARGO surpasses leading jailbreaking techniques, including AutoDAN, by 44 points in attack success rate. Our findings demonstrate a potent alternative to agentic LLM prompting, highlighting the efficacy of interpreting and attacking LLM internals through gradient optimization.
zh
[NLP-58] Multimodal Event Detection: Current Approaches and Defining the New Playground through LLM s and VLMs
【速读】: 该论文试图解决在社交媒体上检测事件的挑战,传统单模态系统由于数据传播的快速性和多模态特性而表现不佳。解决方案的关键在于采用多种模型,包括单模态模型(如ModernBERT和ConvNeXt-V2)、多模态融合技术以及先进的生成式模型(如GPT-4o和LLaVA),并通过评估多模态生成式模型在仅提供单一模态输入时的表现来优化其效果。
链接: https://arxiv.org/abs/2505.10836
作者: Abhishek Dey,Aabha Bothera,Samhita Sarikonda,Rishav Aryan,Sanjay Kumar Podishetty,Akshay Havalgi,Gaurav Singh,Saurabh Srivastava
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at NLDB 2025
Abstract:In this paper, we study the challenges of detecting events on social media, where traditional unimodal systems struggle due to the rapid and multimodal nature of data dissemination. We employ a range of models, including unimodal ModernBERT and ConvNeXt-V2, multimodal fusion techniques, and advanced generative models like GPT-4o, and LLaVA. Additionally, we also study the effect of providing multimodal generative models (such as GPT-4o) with a single modality to assess their efficacy. Our results indicate that while multimodal approaches notably outperform unimodal counterparts, generative approaches despite having a large number of parameters, lag behind supervised methods in precision. Furthermore, we also found that they lag behind instruction-tuned models because of their inability to generate event classes correctly. During our error analysis, we discovered that common social media issues such as leet speak, text elongation, etc. are effectively handled by generative approaches but are hard to tackle using supervised approaches.
zh
[NLP-59] Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL
【速读】: 该论文试图解决大型推理模型(Large Reasoning Models, LRMs)在处理简单问题时因过度进行显式推理而产生的计算开销和延迟问题。其解决方案的关键在于赋予LRMs自适应的思考能力,使其能够根据问题复杂度动态决定是否进行显式推理。研究发现,在提示中插入简单的省略号(“…”)可以随机触发思考或非思考模式,从而揭示出推理行为的潜在可控性。基于此特性,作者提出了AutoThink,一个通过分阶段强化学习(multi-stage reinforcement learning)优化推理策略的框架,使模型仅在必要时调用显式推理,而在简单任务中默认生成简洁响应,从而实现了准确率与效率之间的良好平衡。
链接: https://arxiv.org/abs/2505.10832
作者: Songjun Tu,Jiahao Lin,Qichao Zhang,Xiangyu Tian,Linjing Li,Xiangyuan Lan,Dongbin Zhao
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Pengcheng Laboratory (鹏城实验室); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL
Abstract:Large reasoning models (LRMs) are proficient at generating explicit, step-by-step reasoning sequences before producing final answers. However, such detailed reasoning can introduce substantial computational overhead and latency, particularly for simple problems. To address this over-thinking problem, we explore how to equip LRMs with adaptive thinking capabilities: enabling them to dynamically decide whether or not to engage in explicit reasoning based on problem complexity. Building on R1-style distilled models, we observe that inserting a simple ellipsis (“…”) into the prompt can stochastically trigger either a thinking or no-thinking mode, revealing a latent controllability in the reasoning behavior. Leveraging this property, we propose AutoThink, a multi-stage reinforcement learning (RL) framework that progressively optimizes reasoning policies via stage-wise reward shaping. AutoThink learns to invoke explicit reasoning only when necessary, while defaulting to succinct responses for simpler tasks. Experiments on five mainstream mathematical benchmarks demonstrate that AutoThink achieves favorable accuracy-efficiency trade-offs compared to recent prompting and RL-based pruning methods. It can be seamlessly integrated into any R1-style model, including both distilled and further fine-tuned variants. Notably, AutoThink improves relative accuracy by 6.4 percent while reducing token usage by 52 percent on DeepSeek-R1-Distill-Qwen-1.5B, establishing a scalable and adaptive reasoning paradigm for LRMs.
zh
[NLP-60] Creating General User Models from Computer Use
【速读】: 该论文试图解决当前用户模型(user models)碎片化、针对特定应用设计且缺乏灵活推理能力的问题,这些问题限制了人机交互(Human-Computer Interaction, HCI)中长期愿景的实现。解决方案的关键在于提出一种通用用户模型(General User Model, GUM),该模型通过观察用户与计算机的任何交互来学习用户知识和偏好,并利用多模态观测推断出带有置信度的命题,同时检索相关命题以提供上下文并持续修订现有命题,从而实现对用户需求的主动预测与适应。
链接: https://arxiv.org/abs/2505.10831
作者: Omar Shaikh,Shardul Sapkota,Shan Rizvi,Eric Horvitz,Joon Sung Park,Diyi Yang,Michael S. Bernstein
机构: Stanford University (斯坦福大学); Microsoft Research (微软研究院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 22 pages, 6 figures, 1 table; see this https URL
Abstract:Human-computer interaction has long imagined technology that understands us-from our preferences and habits, to the timing and purpose of our everyday actions. Yet current user models remain fragmented, narrowly tailored to specific apps, and incapable of the flexible reasoning required to fulfill these visions. This paper presents an architecture for a general user model (GUM) that learns about you by observing any interaction you have with your computer. The GUM takes as input any unstructured observation of a user (e.g., device screenshots) and constructs confidence-weighted propositions that capture that user knowledge and preferences. GUMs can infer that a user is preparing for a wedding they’re attending from messages with a friend. Or recognize that a user is struggling with a collaborator’s feedback on a draft by observing multiple stalled edits and a switch to reading related work. GUMs introduce an architecture that infers new propositions about a user from multimodal observations, retrieves related propositions for context, and continuously revises existing propositions. To illustrate the breadth of applications that GUMs enable, we demonstrate how they augment chat-based assistants with context, manage OS notifications to selectively surface important information, and enable interactive agents that adapt to preferences across apps. We also instantiate proactive assistants (GUMBOs) that discover and execute useful suggestions on a user’s behalf using their GUM. In our evaluations, we find that GUMs make calibrated and accurate inferences about users, and that assistants built on GUMs proactively identify and perform actions that users wouldn’t think to request explicitly. Altogether, GUMs introduce methods that leverage multimodal models to understand unstructured context, enabling long-standing visions of HCI and entirely new interactive systems that anticipate user needs.
zh
[NLP-61] Enhancing Low-Resource Minority Language Translation with LLM s and Retrieval-Augmented Generation for Cultural Nuances
【速读】: 该论文试图解决低资源语言(low-resource languages)翻译中的挑战,通过整合大型语言模型(Large Language Models, LLMs)与检索增强生成(Retrieval-Augmented Generation, RAG)技术来提升翻译质量。解决方案的关键在于结合检索机制与先进的语言建模能力,以提高词汇覆盖率和语法连贯性,特别是在处理专业或文化特定术语时表现出色。此外,研究还强调了迭代修正的重要性以及在领域特定表达上的挑战,表明单纯依赖静态词典存在局限性。
链接: https://arxiv.org/abs/2505.10829
作者: Chen-Chi Chang,Chong-Fu Li,Chu-Hsuan Lee,Hung-Shin Lee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to IntelliSys 2025
Abstract:This study investigates the challenges of translating low-resource languages by integrating Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG). Various model configurations were tested on Hakka translations, with BLEU scores ranging from 12% (dictionary-only) to 31% (RAG with Gemini 2.0). The best-performing model (Model 4) combined retrieval and advanced language modeling, improving lexical coverage, particularly for specialized or culturally nuanced terms, and enhancing grammatical coherence. A two-stage method (Model 3) using dictionary outputs refined by Gemini 2.0 achieved a BLEU score of 26%, highlighting iterative correction’s value and the challenges of domain-specific expressions. Static dictionary-based approaches struggled with context-sensitive content, demonstrating the limitations of relying solely on predefined resources. These results emphasize the need for curated resources, domain knowledge, and ethical collaboration with local communities, offering a framework that improves translation accuracy and fluency while supporting cultural preservation.
zh
[NLP-62] Relation Extraction Across Entire Books to Reconstruct Community Networks: The AffilKG Datasets
【速读】: 该论文试图解决知识图谱(Knowledge Graph, KG)从文本中自动提取时的准确性问题,特别是现有标注数据集无法有效评估这一问题,因为它们的KG存在连接性差、规模小或过于复杂等问题。解决方案的关键是引入AffilKG,这是一个包含六个数据集的集合,首次将完整的书籍扫描与大规模、带标签的知识图谱进行配对。每个数据集都包含隶属关系图(affiliation graphs),这是一种简单的KG,用于捕捉人(Person)与组织(Organization)实体之间的成员关系,适用于迁移、社区互动等社会现象的研究;此外,三个数据集还包含包含更多关系类型的扩展KG,从而支持更全面的评估和分析。
链接: https://arxiv.org/abs/2505.10798
作者: Erica Cai,Sean McQuade,Kevin Young,Brendan O’Connor
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:When knowledge graphs (KGs) are automatically extracted from text, are they accurate enough for downstream analysis? Unfortunately, current annotated datasets can not be used to evaluate this question, since their KGs are highly disconnected, too small, or overly complex. To address this gap, we introduce AffilKG (this https URL), which is a collection of six datasets that are the first to pair complete book scans with large, labeled knowledge graphs. Each dataset features affiliation graphs, which are simple KGs that capture Member relationships between Person and Organization entities – useful in studies of migration, community interactions, and other social phenomena. In addition, three datasets include expanded KGs with a wider variety of relation types. Our preliminary experiments demonstrate significant variability in model performance across datasets, underscoring AffilKG’s ability to enable two critical advances: (1) benchmarking how extraction errors propagate to graph-level analyses (e.g., community structure), and (2) validating KG extraction methods for real-world social science research.
zh
[NLP-63] Finetune-RAG : Fine-Tuning Language Models to Resist Hallucination in Retrieval-Augmented Generation
【速读】: 该论文试图解决在Retrieval-Augmented Generation (RAG)框架中,由于检索到不相关信息而导致大型语言模型(LLM)产生幻觉的问题。解决方案的关键在于提出Finetune-RAG,这是一种简单而有效的微调方法,并构建了首个模拟现实世界不完美的RAG训练数据集,从而提升了事实准确性。
链接: https://arxiv.org/abs/2505.10792
作者: Zhan Peng Lee,Andre Lin,Calvin Tan
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) has emerged as a powerful framework to improve factuality in large language models (LLMs) by grounding their outputs in retrieved documents. However, ensuring perfect retrieval of relevant information remains challenging, and when irrelevant content is passed downstream to an LLM, it can lead to hallucinations. In this work, we propose Finetune-RAG, a simple and effective fine-tuning approach that features the first-of-its-kind RAG training dataset constructed to mimic real-world imperfections. Experimental results show that Finetune-RAG improves factual accuracy by 21.2% over the base model. We also propose a Bench-RAG, an LLM-as-a-judge evaluation pipeline that stress tests models under realistic imperfect retrieval scenarios. Our codebase and dataset are fully open sourced for community use.
zh
[NLP-64] A Systematic Analysis of Base Model Choice for Reward Modeling
【速读】: 该论文试图解决在训练高质量奖励模型(Reward Model, RM)时,基础模型(Base Model)选择对奖励建模性能的影响这一被忽视的问题。其关键解决方案在于系统性地分析不同基础模型对奖励模型性能的影响,并通过优化基础模型的选择以及结合少量基准测试结果来提升模型选择效果,从而显著提高奖励模型的性能(最高可提升14%,在前5-10名中平均提升18%)。此外,论文还探讨了不同后训练步骤对最终性能的影响,并尝试利用估计的数据分布来降低性能预测误差。
链接: https://arxiv.org/abs/2505.10775
作者: Kian Ahrabian,Pegah Jandaghi,Negar Mokhberian,Sai Praneeth Karimireddy,Jay Pujara
机构: University of Southern California (南加州大学); Information Sciences Institute (信息科学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 13 figures, 5 tables
Abstract:Reinforcement learning from human feedback (RLHF) and, at its core, reward modeling have become a crucial part of training powerful large language models (LLMs). One commonly overlooked factor in training high-quality reward models (RMs) is the effect of the base model, which is becoming more challenging to choose given the rapidly growing pool of LLMs. In this work, we present a systematic analysis of the effect of base model selection on reward modeling performance. Our results show that the performance can be improved by up to 14% compared to the most common (i.e., default) choice. Moreover, we showcase the strong statistical relation between some existing benchmarks and downstream performances. We also demonstrate that the results from a small set of benchmarks could be combined to boost the model selection ( + 18% on average in the top 5-10). Lastly, we illustrate the impact of different post-training steps on the final performance and explore using estimated data distributions to reduce performance prediction error.
zh
[NLP-65] Ranked Voting based Self-Consistency of Large Language Models ACL2025
【速读】: 该论文试图解决传统链式思维推理方法在每次试验中仅生成单一答案导致的潜在答案被忽略的问题,从而影响后续投票过程中的自一致性评估。其解决方案的关键在于在每次推理过程中生成排序后的答案,并在多个不同响应的排序答案之间进行排序投票,以提高整体自一致性评估的可靠性。具体采用了即时淘汰投票、博达计数投票和均倒数排名投票三种排序投票方法。
链接: https://arxiv.org/abs/2505.10772
作者: Weiqin Wang,Yile Wang,Hui Huang
机构: Shenzhen University (深圳大学)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Findings
Abstract:Majority voting is considered an effective method to enhance chain-of-thought reasoning, as it selects the answer with the highest “self-consistency” among different reasoning paths (Wang et al., 2023). However, previous chain-of-thought reasoning methods typically generate only a single answer in each trial, thereby ignoring the possibility of other potential answers. As a result, these alternative answers are often overlooked in subsequent voting processes. In this work, we propose to generate ranked answers in each reasoning process and conduct ranked voting among multiple ranked answers from different responses, thereby making the overall self-consistency more reliable. Specifically, we use three ranked voting methods: Instant-runoff voting, Borda count voting, and mean reciprocal rank voting. We validate our methods on six datasets, including three multiple-choice and three open-ended question-answering tasks, using both advanced open-source and closed-source large language models. Extensive experimental results indicate that our proposed method outperforms the baselines, showcasing the potential of leveraging the information of ranked answers and using ranked voting to improve reasoning performance. The code is available at this https URL.
zh
[NLP-66] SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval
【速读】: 该论文试图解决多语言环境下在线虚假信息传播所带来的挑战,特别是在低资源语言中缺乏有效应对措施的问题。其解决方案的关键在于通过SemEval 2025举办的多语言声明检索共享任务,探索并评估在同语言和跨语言场景下匹配已验证声明与社交媒体中新出现声明的有效方法。该任务为多语言声明检索和自动化事实核查提供了重要的数据集和系统基准,有助于推动该领域的进一步研究。
链接: https://arxiv.org/abs/2505.10740
作者: Qiwei Peng,Robert Moro,Michal Gregor,Ivan Srba,Simon Ostermann,Marian Simko,Juraj Podroužek,Matúš Mesarčík,Jaroslav Kopčan,Anders Søgaard
机构: University of Copenhagen (哥本哈根大学); Kempelen Institute of Intelligent Technologies (Kempelen智能技术研究所); German Research Institute for Artificial Intelligence (DFKI) (德国人工智能研究中心)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:The rapid spread of online disinformation presents a global challenge, and machine learning has been widely explored as a potential solution. However, multilingual settings and low-resource languages are often neglected in this field. To address this gap, we conducted a shared task on multilingual claim retrieval at SemEval 2025, aimed at identifying fact-checked claims that match newly encountered claims expressed in social media posts across different languages. The task includes two subtracks: (1) a monolingual track, where social posts and claims are in the same language, and (2) a crosslingual track, where social posts and claims might be in different languages. A total of 179 participants registered for the task contributing to 52 test submissions. 23 out of 31 teams have submitted their system papers. In this paper, we report the best-performing systems as well as the most common and the most effective approaches across both subtracks. This shared task, along with its dataset and participating systems, provides valuable insights into multilingual claim retrieval and automated fact-checking, supporting future research in this field.
zh
[NLP-67] Model Performance-Guided Evaluation Data Selection for Effective Prompt Optimization
【速读】: 该论文试图解决大型语言模型(Large Language Model, LLM)性能优化中手动提示工程效率低下以及自动化提示优化技术依赖随机评估子集导致评估不可靠的问题。其解决方案的关键在于提出IPOMP(Iterative evaluation data selection for effective Prompt Optimization using real-time Model Performance),该方法通过语义聚类和边界分析选择具有代表性和多样性的样本,并利用实时模型性能数据进行迭代优化,以替换冗余样本,从而提升提示优化的效果与稳定性。
链接: https://arxiv.org/abs/2505.10736
作者: Ximing Dong,Shaowei Wang,Dayi Lin,Ahmed E. Hassan
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Optimizing Large Language Model (LLM) performance requires well-crafted prompts, but manual prompt engineering is labor-intensive and often ineffective. Automated prompt optimization techniques address this challenge but the majority of them rely on randomly selected evaluation subsets, which fail to represent the full dataset, leading to unreliable evaluations and suboptimal prompts. Existing coreset selection methods, designed for LLM benchmarking, are unsuitable for prompt optimization due to challenges in clustering similar samples, high data collection costs, and the unavailability of performance data for new or private datasets. To overcome these issues, we propose IPOMP, an Iterative evaluation data selection for effective Prompt Optimization using real-time Model Performance. IPOMP is a two-stage approach that selects representative and diverse samples using semantic clustering and boundary analysis, followed by iterative refinement with real-time model performance data to replace redundant samples. Evaluations on the BIG-bench dataset show that IPOMP improves effectiveness by 1.6% to 5.3% and stability by at least 57% compared with SOTA baselines, with minimal computational overhead below 1%. Furthermore, the results demonstrate that our real-time performance-guided refinement approach can be universally applied to enhance existing coreset selection methods.
zh
[NLP-68] racr-Injection: Distilling Algorithms into Pre-trained Language Models ACL
【速读】: 该论文试图解决的问题是:尽管Transformer架构理论上具备符号处理能力,但这些能力在实际中难以通过无监督数据学习到,导致理论能力与实际可学习性之间存在不匹配。解决方案的关键在于提出tracr-injection方法,该方法能够将用RASP(一种可直接编译为Transformer权重的编程语言)编写的算法直接蒸馏到预训练的语言模型中,从而在模型的残差流中创建可解释的子空间,并提升模型在分布外数据上的性能。
链接: https://arxiv.org/abs/2505.10719
作者: Tomás Vergara-Browne,Álvaro Soto
机构: Pontificia Universidad Católica de Chile (天主教圣母大学); CENIA (CENIA); Mila (Mila); McGill University (麦吉尔大学)
类目: Computation and Language (cs.CL)
备注: ACL Findings 2025
Abstract:Motivated by the surge of large language models, there has been a push to formally characterize the symbolic abilities intrinsic to the transformer architecture. A programming language, called RASP, has been proposed, which can be directly compiled into transformer weights to implement these algorithms. However, the tasks that can be implemented in RASP are often uncommon to learn from natural unsupervised data, showing a mismatch between theoretical capabilities of the transformer architecture, and the practical learnability of these capabilities from unsupervised data. We propose tracr-injection, a method that allows us to distill algorithms written in RASP directly into a pre-trained language model. We showcase our method by injecting 3 different algorithms into a language model. We show how our method creates an interpretable subspace within the model’s residual stream, which can be decoded into the variables present in the code of the RASP algorithm. Additionally, we found that the proposed method can improve out of distribution performance compared to our baseline, indicating that indeed a more symbolic mechanism is taking place in the inner workings of the model. We release the code used to run our experiments.
zh
[NLP-69] AI-enhanced semantic feature norms for 786 concepts
【速读】: 该论文试图解决传统语义特征规范(semantic feature norms)在概念/特征覆盖范围与质量可验证性之间存在的权衡问题,这一问题源于规范研究的劳动密集性。解决方案的关键在于通过引入大型语言模型(LLMs)生成的响应来增强人类生成的特征规范数据集,并利用可靠的人类判断验证规范的质量。该方法构建了一个名为NOVA的AI增强型特征规范数据集,其在特征密度、概念间重叠度以及预测人类语义相似性判断方面均优于传统的人类独立试验数据集和词嵌入模型。
链接: https://arxiv.org/abs/2505.10718
作者: Siddharth Suresh,Kushin Mukherjee,Tyler Giallanza,Xizheng Yu,Mia Patil,Jonathan D. Cohen,Timothy T. Rogers
机构: University of Wisconsin-Madison(威斯康星大学麦迪逊分校); Stanford University(斯坦福大学); Princeton University(普林斯顿大学); Brown University(布朗大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 8 pages, 5 figures
Abstract:Semantic feature norms have been foundational in the study of human conceptual knowledge, yet traditional methods face trade-offs between concept/feature coverage and verifiability of quality due to the labor-intensive nature of norming studies. Here, we introduce a novel approach that augments a dataset of human-generated feature norms with responses from large language models (LLMs) while verifying the quality of norms against reliable human judgments. We find that our AI-enhanced feature norm dataset, NOVA: Norms Optimized Via AI, shows much higher feature density and overlap among concepts while outperforming a comparable human-only norm dataset and word-embedding models in predicting people’s semantic similarity judgments. Taken together, we demonstrate that human conceptual knowledge is richer than captured in previous norm datasets and show that, with proper validation, LLMs can serve as powerful tools for cognitive science research.
zh
[NLP-70] A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning Model Merging and Clinical-Tasks Alignment
【速读】: 该论文旨在解决大型语言模型在临床环境中部署时面临的高计算成本和延迟问题,以及小型语言模型(Small Language Models, SLMs)因容量有限而需要进行生物医学领域适应的挑战,同时应对临床数据不可用性和高敏感性的问题。其解决方案的关键在于提出一种新颖的框架,通过在相关医学和临床语料库上进行专家模型的预指令微调(pre-instruction tuning)、模型合并(model merging)以及临床任务对齐(clinical-tasks alignment),从而提升SLMs在临床任务中的性能。此外,研究还构建了MediFlow合成数据集以支持模型的进一步优化。
链接: https://arxiv.org/abs/2505.10717
作者: Jean-Philippe Corbeil,Amin Dada,Jean-Michel Attendu,Asma Ben Abacha,Alessandro Sordoni,Lucas Caccia,François Beaulieu,Thomas Lin,Jens Kleesiek,Paul Vozila
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:High computation costs and latency of large language models such as GPT-4 have limited their deployment in clinical settings. Small language models (SLMs) offer a cost-effective alternative, but their limited capacity requires biomedical domain adaptation, which remains challenging. An additional bottleneck is the unavailability and high sensitivity of clinical data. To address these challenges, we propose a novel framework for adapting SLMs into high-performing clinical models. We introduce the MediPhi collection of 3.8B-parameter SLMs developed with our novel framework: pre-instruction tuning of experts on relevant medical and clinical corpora (PMC, Medical Guideline, MedWiki, etc.), model merging, and clinical-tasks alignment. To cover most clinical tasks, we extended the CLUE benchmark to CLUE+, doubling its size. Our expert models deliver relative improvements on this benchmark over the base model without any task-specific fine-tuning: 64.3% on medical entities, 49.5% on radiology reports, and 44% on ICD-10 coding (outperforming GPT-4-0125 by 14%). We unify the expert models into MediPhi via model merging, preserving gains across benchmarks. Furthermore, we built the MediFlow collection, a synthetic dataset of 2.5 million high-quality instructions on 14 medical NLP tasks, 98 fine-grained document types, and JSON format support. Alignment of MediPhi using supervised fine-tuning and direct preference optimization achieves further gains of 18.9% on average.
zh
[NLP-71] GeoGrid-Bench: Can Foundation Models Understand Multimodal Gridded Geo-Spatial Data?
【速读】: 该论文旨在解决基础模型在理解网格结构的地理空间数据方面的能力评估问题,特别是针对气候变量等具有密集数值、强时空依赖性和多模态表示(如表格数据、热力图和地理可视化)的数据。解决方案的关键在于构建GeoGrid-Bench基准,该基准包含大规模真实世界数据,覆盖150个地点的16种气候变量及扩展时间范围,并通过8个领域专家定制的模板生成约3200个问答对,以反映科学家在实际研究中遇到的任务。
链接: https://arxiv.org/abs/2505.10714
作者: Bowen Jiang,Yangxinyu Xie,Xiaomeng Wang,Jiashu He,Joshua Bergerson,John K Hutchison,Jordan Branham,Camillo J Taylor,Tanwi Mallick
机构: University of Pennsylvania (宾夕法尼亚大学); Argonne National Laboratory (阿贡国家实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:We present GeoGrid-Bench, a benchmark designed to evaluate the ability of foundation models to understand geo-spatial data in the grid structure. Geo-spatial datasets pose distinct challenges due to their dense numerical values, strong spatial and temporal dependencies, and unique multimodal representations including tabular data, heatmaps, and geographic visualizations. To assess how foundation models can support scientific research in this domain, GeoGrid-Bench features large-scale, real-world data covering 16 climate variables across 150 locations and extended time frames. The benchmark includes approximately 3,200 question-answer pairs, systematically generated from 8 domain expert-curated templates to reflect practical tasks encountered by human scientists. These range from basic queries at a single location and time to complex spatiotemporal comparisons across regions and periods. Our evaluation reveals that vision-language models perform best overall, and we provide a fine-grained analysis of the strengths and limitations of different foundation models in different geo-spatial tasks. This benchmark offers clearer insights into how foundation models can be effectively applied to geo-spatial data analysis and used to support scientific research.
zh
[NLP-72] Artificial Intelligence Bias on English Language Learners in Automatic Scoring
【速读】: 该论文试图解决自动评分系统在对中学生科学评估中的书面回答进行评分时,可能对英语学习者(English Language Learners, ELLs)产生的评分偏差和不公平现象。其解决方案的关键在于通过调整训练数据的平衡性来评估和减少这种偏差,具体是通过微调BERT模型,使用四种不同数据集进行实验:ELLs的响应、非ELLs的响应、反映现实比例的不平衡混合数据集以及平衡混合数据集。研究结果表明,当训练数据量足够大时(如ELLs样本量为30,000或1,000),未发现AI偏差和评分差异,但在样本量较小的情况下(如ELLs样本量为200)可能存在潜在问题。
链接: https://arxiv.org/abs/2505.10643
作者: Shuchen Guo,Yun Wang,Jichao Yu,Xuansheng Wu,Bilgehan Ayik,Field M. Watts,Ehsan Latif,Ninghao Liu,Lei Liu,Xiaoming Zhai
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:This study investigated potential scoring biases and disparities toward English Language Learners (ELLs) when using automatic scoring systems for middle school students’ written responses to science assessments. We specifically focus on examining how unbalanced training data with ELLs contributes to scoring bias and disparities. We fine-tuned BERT with four datasets: responses from (1) ELLs, (2) non-ELLs, (3) a mixed dataset reflecting the real-world proportion of ELLs and non-ELLs (unbalanced), and (4) a balanced mixed dataset with equal representation of both groups. The study analyzed 21 assessment items: 10 items with about 30,000 ELL responses, five items with about 1,000 ELL responses, and six items with about 200 ELL responses. Scoring accuracy (Acc) was calculated and compared to identify bias using Friedman tests. We measured the Mean Score Gaps (MSGs) between ELLs and non-ELLs and then calculated the differences in MSGs generated through both the human and AI models to identify the scoring disparities. We found that no AI bias and distorted disparities between ELLs and non-ELLs were found when the training dataset was large enough (ELL = 30,000 and ELL = 1,000), but concerns could exist if the sample size is limited (ELL = 200).
zh
[NLP-73] MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly
【速读】: 该论文旨在解决长上下文视觉-语言模型(Long-Context Vision-Language Models, LCVLMs)在处理多图像与交错文本标记任务时评估不足的问题,提出了一种全面的基准测试框架MMLongBench。其解决方案的关键在于构建一个包含13,331个样本、覆盖五类下游任务的基准,涵盖多种图像类型,并通过跨模态分词方案实现标准化输入长度(8K-128K tokens)的控制,从而系统性地评估模型的长上下文理解与推理能力。
链接: https://arxiv.org/abs/2505.10610
作者: Zhaowei Wang,Wenhao Yu,Xiyu Ren,Jipeng Zhang,Yu Zhao,Rohit Saxena,Liang Cheng,Ginny Wong,Simon See,Pasquale Minervini,Yangqiu Song,Mark Steedman
机构: HKUST(香港科技大学); Tencent AI Seattle Lab(腾讯人工智能西雅图实验室); University of Edinburgh(爱丁堡大学); Miniml.AI(Miniml.AI); NVIDIA AI Technology Center (NVAITC), NVIDIA(英伟达人工智能技术中心(NVAITC),英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Work in progress
Abstract:The rapid extension of context windows in large vision-language models has given rise to long-context vision-language models (LCVLMs), which are capable of handling hundreds of images with interleaved text tokens in a single forward pass. In this work, we introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks, to evaluate LCVLMs effectively and thoroughly. MMLongBench is composed of 13,331 examples spanning five different categories of downstream tasks, such as Visual RAG and Many-Shot ICL. It also provides broad coverage of image types, including various natural and synthetic images. To assess the robustness of the models to different input lengths, all examples are delivered at five standardized input lengths (8K-128K tokens) via a cross-modal tokenization scheme that combines vision patches and text tokens. Through a thorough benchmarking of 46 closed-source and open-source LCVLMs, we provide a comprehensive analysis of the current models’ vision-language long-context ability. Our results show that: i) performance on a single task is a weak proxy for overall long-context capability; ii) both closed-source and open-source models face challenges in long-context vision-language tasks, indicating substantial room for future improvement; iii) models with stronger reasoning ability tend to exhibit better long-context performance. By offering wide task coverage, various image types, and rigorous length control, MMLongBench provides the missing foundation for diagnosing and advancing the next generation of LCVLMs.
zh
[NLP-74] UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech
【速读】: 该论文旨在解决可控情感文本到语音(controllable emotional TTS)中的挑战,特别是传统方法依赖预定义的离散情感标签难以捕捉人类情感感知和表达的复杂性与连续性,同时由于缺乏大规模且情感分布均衡、细粒度标注的情感语音数据集,导致合成模型容易过拟合并影响有效的情感控制。解决方案的关键在于提出UDDETTS,一种统一离散情感与维度情感的神经编解码语言模型,其引入可解释的唤醒-主导-效价(Arousal-Dominance-Valence, ADV)空间进行维度情感描述,并支持通过离散情感标签或非线性量化后的ADV值进行情感控制,同时设计了一种半监督训练策略以充分利用具有不同情感标注类型的多样语音数据集。
链接: https://arxiv.org/abs/2505.10599
作者: Jiaxuan Liu,Zhenhua Ling
机构: NERCSLIP, University of Science and Technology of China (中国科学技术大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under review
Abstract:Recent neural codec language models have made great progress in the field of text-to-speech (TTS), but controllable emotional TTS still faces many challenges. Traditional methods rely on predefined discrete emotion labels to control emotion categories and intensities, which can’t capture the complexity and continuity of human emotional perception and expression. The lack of large-scale emotional speech datasets with balanced emotion distributions and fine-grained emotion annotations often causes overfitting in synthesis models and impedes effective emotion control. To address these issues, we propose UDDETTS, a neural codec language model unifying discrete and dimensional emotions for controllable emotional TTS. This model introduces the interpretable Arousal-Dominance-Valence (ADV) space for dimensional emotion description and supports emotion control driven by either discrete emotion labels or nonlinearly quantified ADV values. Furthermore, a semi-supervised training strategy is designed to comprehensively utilize diverse speech datasets with different types of emotion annotations to train the UDDETTS. Experiments show that UDDETTS achieves linear emotion control along the three dimensions of ADV space, and exhibits superior end-to-end emotional speech synthesis capabilities.
zh
[NLP-75] wo Minds Better Than One: Collaborative Reward Modeling for LLM Alignment
【速读】: 该论文试图解决奖励模型(Reward Model, RM)在对齐大语言模型(Large Language Model, LLM)与人类价值观时,因人类反馈中的噪声偏好导致的奖励误泛化问题,即奖励模型过度拟合虚假模式并在策略优化过程中提供误导性信号。解决方案的关键在于提出协作式奖励建模(Collaborative Reward Modeling, CRM),其核心是通过并行训练两个奖励模型并相互评估数据选择以过滤噪声,结合课程学习(Curriculum Learning)从易到难组织偏好数据,从而提升模型的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2505.10597
作者: Jiazheng Zhang,Wenqing Jing,Zizhuo Zhang,Zhiheng Xi,Shihan Dou,Rongxiang Weng,Jiahuan Li,Jingang Wang,MingXu Cai,Shibo Hong,Tao Gui,Qi Zhang
机构: Fudan University (复旦大学); Hong Kong Baptist University (香港浸会大学); Meituan Inc (美团公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Reward models (RMs) are essential for aligning large language models (LLMs) with human values. However, noisy preferences in human feedback often lead to reward misgeneralization, where RMs overfit to spurious patterns and provide misleading signals during policy optimization. We systematically analyze the training dynamics of preference pairs and identify that noisy examples are harder to fit and introduce instability. Empirical evidence shows that LLMs optimized using reward models trained on full noisy datasets perform worse than those trained on filtered, high-quality preferences. To address this, we propose Collaborative Reward Modeling (CRM), an online framework that enhances robustness by combining peer review and curriculum learning. Two reward models are trained in parallel and assess each other’s data selections to filter out potential noise. Curriculum learning structures the preference data from easy to hard, ensuring synchronized training and stable feedback. Extensive experiments demonstrate that CRM improves generalization, with up to 9.94 points of accuracy gain on RewardBench under 40 percent label noise. CRM is also compatible with implicit-reward alignment methods, offering a practical and versatile strategy for robust alignment.
zh
[NLP-76] Understanding Gen Alpha Digital Language: Evaluation of LLM Safety Systems for Content Moderation
【速读】: 该论文试图解决当前AI系统在识别和应对Generation Alpha(Gen Alpha)群体中隐蔽的网络骚扰与操控行为方面的不足。Gen Alpha由于其独特的数字语言环境,包括游戏、表情包和AI驱动的趋势,导致传统安全工具难以有效识别其中的有害互动。论文提出的关键解决方案是构建一个针对Gen Alpha表达的首个同类数据集,并开发一种改进的AI内容审核框架,以提升对青少年网络保护的能力。此外,研究还通过多视角评估(包括AI系统、人工审核员和家长)以及Gen Alpha共同研究者的直接参与,增强了方案的针对性与实用性。
链接: https://arxiv.org/abs/2505.10588
作者: Manisha Mehta,Fausto Giunchiglia
机构: Warren E Hyde Middle School (沃伦·E·海德中学); University of Trento (特伦托大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted to ACM FAccT 2025. To be presented in Athens, June 2025, and published in the conference proceedings. Preprint version; final version will appear in the ACM Digital Library
Abstract:This research offers a unique evaluation of how AI systems interpret the digital language of Generation Alpha (Gen Alpha, born 2010-2024). As the first cohort raised alongside AI, Gen Alpha faces new forms of online risk due to immersive digital engagement and a growing mismatch between their evolving communication and existing safety tools. Their distinct language, shaped by gaming, memes, and AI-driven trends, often conceals harmful interactions from both human moderators and automated systems. We assess four leading AI models (GPT-4, Claude, Gemini, and Llama 3) on their ability to detect masked harassment and manipulation within Gen Alpha discourse. Using a dataset of 100 recent expressions from gaming platforms, social media, and video content, the study reveals critical comprehension failures with direct implications for online safety. This work contributes: (1) a first-of-its-kind dataset capturing Gen Alpha expressions; (2) a framework to improve AI moderation systems for youth protection; (3) a multi-perspective evaluation including AI systems, human moderators, and parents, with direct input from Gen Alpha co-researchers; and (4) an analysis of how linguistic divergence increases youth vulnerability. Findings highlight the urgent need to redesign safety systems attuned to youth communication, especially given Gen Alpha reluctance to seek help when adults fail to understand their digital world. This study combines the insight of a Gen Alpha researcher with systematic academic analysis to address critical digital safety challenges.
zh
[NLP-77] owards Automated Situation Awareness: A RAG -Based Framework for Peacebuilding Reports
【速读】: 该论文旨在解决在人道主义响应、冲突监测和早期预警与行动中,由于手动分析大量异构数据源导致的及时性和准确性不足的问题。其解决方案的关键在于提出一种动态的检索增强生成(Retrieval-Augmented Generation, RAG)系统,该系统通过整合实时数据源(如新闻文章、冲突事件数据库和经济指标)自主生成情境意识报告,并按需构建特定查询的知识库,以确保报告的时效性、相关性和准确性。
链接: https://arxiv.org/abs/2505.10586
作者: Poli A. Nemkova,Suleyman O. Polat,Rafid I. Jahan,Sagnik Ray Choudhury,Sun-joo Lee,Shouryadipta Sarkar,Mark V. Albert
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:
Abstract:Timely and accurate situation awareness is vital for decision-making in humanitarian response, conflict monitoring, and early warning and early action. However, the manual analysis of vast and heterogeneous data sources often results in delays, limiting the effectiveness of interventions. This paper introduces a dynamic Retrieval-Augmented Generation (RAG) system that autonomously generates situation awareness reports by integrating real-time data from diverse sources, including news articles, conflict event databases, and economic indicators. Our system constructs query-specific knowledge bases on demand, ensuring timely, relevant, and accurate insights. To ensure the quality of generated reports, we propose a three-level evaluation framework that combines semantic similarity metrics, factual consistency checks, and expert feedback. The first level employs automated NLP metrics to assess coherence and factual accuracy. The second level involves human expert evaluation to verify the relevance and completeness of the reports. The third level utilizes LLM-as-a-Judge, where large language models provide an additional layer of assessment to ensure robustness. The system is tested across multiple real-world scenarios, demonstrating its effectiveness in producing coherent, insightful, and actionable reports. By automating report generation, our approach reduces the burden on human analysts and accelerates decision-making processes. To promote reproducibility and further research, we openly share our code and evaluation tools with the community via GitHub. Subjects: Computers and Society (cs.CY); Computation and Language (cs.CL) Cite as: arXiv:2505.10586 [cs.CY] (or arXiv:2505.10586v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2505.10586 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-78] Relative Drawing Identification Complexity is Invariant to Modality in Vision-Language Models
【速读】: 该论文试图解决多模态大型语言模型是否通过共享表示整合其模态的问题,具体而言,探究图像和文本描述在潜在空间中的映射是否具有相似性。解决方案的关键在于采用机器教学(machine teaching)理论,通过最小示例集来评估视觉-语言模型学习特定对象子集的复杂度,并对比基于原始图像(位图)与基于坐标(TikZ格式)的表示方式的教学效果。
链接: https://arxiv.org/abs/2505.10583
作者: Diogo Freitas,Brigt Håvardstun,Cèsar Ferri,Darío Garigliotti,Jan Arne Telle,José Hernández-Orallo
机构: Interactive Technologies Institute (互动技术研究所); NOVA LINCS (NOVA LINCS); University of Madeira (马德拉大学); Department of Informatics (信息学系); University of Bergen (卑尔根大学); Valencian Research Institute for Artificial Intelligence (瓦伦西亚人工智能研究机构); Universitat Politècnica de València (瓦伦西亚理工大学); Leverhulme Centre for the Future of Intelligence (莱弗里特未来智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 54 pages (42 pages of appendix)
Abstract:Large language models have become multimodal, and many of them are said to integrate their modalities using common representations. If this were true, a drawing of a car as an image, for instance, should map to the similar area in the latent space as a textual description of the strokes that conform the drawing. To explore this in a black-box access regime to these models, we propose the use of machine teaching, a theory that studies the minimal set of examples a teacher needs to choose so that the learner captures the concept. In this paper we evaluate the complexity of teaching visual-language models a subset of objects in the Quick, Draw! dataset using two presentations: raw images as bitmaps and trace coordinates in TikZ format. The results indicate that image-based representations generally require fewer segments and achieve higher accuracy than coordinate-based representations. But, surprisingly, the teaching size usually ranks concepts similarly across both modalities, even when controlling for (a human proxy of) concept priors, suggesting that the simplicity of concepts may be an inherent property that transcends modality representations.
zh
[NLP-79] Large Language Models for Cancer Communication: Evaluating Linguistic Quality Safety and Accessibility in Generative AI
【速读】: 该论文试图解决乳腺癌和宫颈癌相关健康信息沟通中的有效性和安全性问题,旨在评估大型语言模型(Large Language Models, LLMs)在生成准确、安全且易于理解的癌症相关信息方面的能力与局限性。解决方案的关键在于通过混合方法评估框架对语言质量、安全性和信任度、以及沟通的可及性与效果进行系统分析,结合定量指标、定性专家评分及统计分析方法,以揭示通用型LLMs与医学专用LLMs在不同维度上的表现差异,并提出针对危害、偏见的缓解措施及模型设计优化方向。
链接: https://arxiv.org/abs/2505.10472
作者: Agnik Saha,Victoria Churchill,Anny D. Rodriguez,Ugur Kursuncu,Muhammed Y. Idris
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Effective communication about breast and cervical cancers remains a persistent health challenge, with significant gaps in public understanding of cancer prevention, screening, and treatment, potentially leading to delayed diagnoses and inadequate treatments. This study evaluates the capabilities and limitations of Large Language Models (LLMs) in generating accurate, safe, and accessible cancer-related information to support patient understanding. We evaluated five general-purpose and three medical LLMs using a mixed-methods evaluation framework across linguistic quality, safety and trustworthiness, and communication accessibility and affectiveness. Our approach utilized quantitative metrics, qualitative expert ratings, and statistical analysis using Welch’s ANOVA, Games-Howell, and Hedges’ g. Our results show that general-purpose LLMs produced outputs of higher linguistic quality and affectiveness, while medical LLMs demonstrate greater communication accessibility. However, medical LLMs tend to exhibit higher levels of potential harm, toxicity, and bias, reducing their performance in safety and trustworthiness. Our findings indicate a duality between domain-specific knowledge and safety in health communications. The results highlight the need for intentional model design with targeted improvements, particularly in mitigating harm and bias, and improving safety and affectiveness. This study provides a comprehensive evaluation of LLMs for cancer communication, offering critical insights for improving AI-generated health content and informing future development of accurate, safe, and accessible digital health tools.
zh
[NLP-80] DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language Models
【速读】: 该论文旨在解决生成式语言模型在后训练阶段中因依赖于解级和标量奖励信号而导致的语义多样性与质量不一致问题(diversity-quality inconsistency),即不同的推理路径可能获得相同的奖励。其解决方案的关键在于提出一种名为Diversity-aware Reward Adjustment(DRA)的方法,该方法通过引入子模态互信息(Submodular Mutual Information, SMI)来调整奖励计算,从而降低冗余完成项的权重并增强多样完成项的奖励,以促进学习过程中的更好探索,同时保持对高质量样本的稳定利用。
链接: https://arxiv.org/abs/2505.09655
作者: Xiwen Chen,Wenhui Zhu,Peijie Qiu,Xuanzhao Dong,Hao Wang,Haiyu Wu,Huayu Li,Aristeidis Sotiras,Yalin Wang,Abolfazl Razi
机构: Clemson University (克莱姆森大学); Arizona State University (亚利桑那州立大学); Washington University in St.Louis (圣路易斯华盛顿大学); University of Notre Dame (圣母大学); University of Arizona (亚利桑那大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Recent advances in reinforcement learning for language model post-training, such as Group Relative Policy Optimization (GRPO), have shown promise in low-resource settings. However, GRPO typically relies on solution-level and scalar reward signals that fail to capture the semantic diversity among sampled completions. This leads to what we identify as a diversity-quality inconsistency, where distinct reasoning paths may receive indistinguishable rewards. To address this limitation, we propose \textitDiversity-aware Reward Adjustment (DRA), a method that explicitly incorporates semantic diversity into the reward computation. DRA uses Submodular Mutual Information (SMI) to downweight redundant completions and amplify rewards for diverse ones. This encourages better exploration during learning, while maintaining stable exploitation of high-quality samples. Our method integrates seamlessly with both GRPO and its variant DR.~GRPO, resulting in \textitDRA-GRPO and \textitDGA-DR.~GRPO . We evaluate our method on five mathematical reasoning benchmarks and find that it outperforms recent strong baselines. It achieves state-of-the-art performance with an average accuracy of 58.2%, using only 7,000 fine-tuning samples and a total training cost of approximately 55. The code is available at this https URL.
zh
[NLP-81] On Next-Token Prediction in LLM s: How End Goals Determine the Consistency of Decoding Algorithms
【速读】: 该论文试图解决大型语言模型(Large Language Model, LLM)中解码算法与不同目标函数之间的一致性问题,特别是这些算法在输出序列时是否能够有效地逼近真实概率分布或优化特定损失函数。解决方案的关键在于分析多种解码算法(如贪心法、前瞻法、随机采样和温度缩放随机采样)在不同目标下的理论一致性,并揭示其在不同概率分布下的最优性边界。研究发现,当next-token预测收敛到真实概率分布时,随机采样能够一致地生成模仿真实分布的序列,而其他目标(如最小化整个序列的0-1损失)则无法在多项式时间内找到对所有概率分布都最优的算法,表明解码算法的选择需基于具体目标,并且当前许多方法缺乏坚实的理论基础。
链接: https://arxiv.org/abs/2505.11183
作者: Jacob Trauger,Ambuj Tewari
机构: University of Michigan (密歇根大学)
类目: Machine Learning (stat.ML); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 23 pages
Abstract:Probabilistic next-token prediction trained using cross-entropy loss is the basis of most large language models. Given a sequence of previous values, next-token prediction assigns a probability to each possible next value in the vocabulary. There are many ways to use next-token prediction to output token sequences. This paper examines a few of these algorithms (greedy, lookahead, random sampling, and temperature-scaled random sampling) and studies their consistency with respect to various goals encoded as loss functions. Although consistency of surrogate losses with respect to a target loss function is a well researched topic, we are the first to study it in the context of LLMs (to the best of our knowledge). We find that, so long as next-token prediction converges to its true probability distribution, random sampling is consistent with outputting sequences that mimic sampling from the true probability distribution. For the other goals, such as minimizing the 0-1 loss on the entire sequence, we show no polynomial-time algorithm is optimal for all probability distributions and all decoding algorithms studied are only optimal for a subset of probability distributions. When analyzing these results, we see that there is a dichotomy created between the goals of information retrieval and creative generation for the decoding algorithms. This shows that choosing the correct decoding algorithm based on the desired goal is extremely important and many of the ones used are lacking theoretical grounding in numerous scenarios.
zh
[NLP-82] MatTools: Benchmarking Large Language Models for Materials Science Tools
【速读】: 该论文试图解决如何评估大型语言模型(Large Language Models, LLMs)在材料科学领域中对材料模拟工具的理解与使用能力的问题。其解决方案的关键在于构建一个名为MatTools的基准测试框架,该框架包含两个互补的组件:基于pymatgen代码库和文档的材料模拟工具问答(QA)基准以及一个真实世界工具使用基准。通过自动化方法收集实际材料科学工具使用的例子,该框架能够有效评估LLMs生成和安全执行基于物理计算材料科学包代码的能力,从而为提升LLMs在材料科学中的应用效果提供标准化的评估与改进途径。
链接: https://arxiv.org/abs/2505.10852
作者: Siyu Liu,Jiamin Xu,Beilin Ye,Bo Hu,David J. Srolovitz,Tongqi Wen
机构: The University of Hong Kong (香港大学); Hong Kong SAR (香港特别行政区); China (中国); Materials Innovation Institute for Life Sciences and Energy (生命科学与能源材料创新研究所); HKU-SIRI (香港大学-深圳研究院)
类目: Materials Science (cond-mat.mtrl-sci); Computation and Language (cs.CL); Databases (cs.DB)
备注: 27 pages, 23 figures
Abstract:Large language models (LLMs) are increasingly applied to materials science questions, including literature comprehension, property prediction, materials discovery and alloy design. At the same time, a wide range of physics-based computational approaches have been developed in which materials properties can be calculated. Here, we propose a benchmark application to evaluate the proficiency of LLMs to answer materials science questions through the generation and safe execution of codes based on such physics-based computational materials science packages. MatTools is built on two complementary components: a materials simulation tool question-answer (QA) benchmark and a real-world tool-usage benchmark. We designed an automated methodology to efficiently collect real-world materials science tool-use examples. The QA benchmark, derived from the pymatgen (Python Materials Genomics) codebase and documentation, comprises 69,225 QA pairs that assess the ability of an LLM to understand materials science tools. The real-world benchmark contains 49 tasks (138 subtasks) requiring the generation of functional Python code for materials property calculations. Our evaluation of diverse LLMs yields three key insights: (1)Generalists outshine specialists;(2)AI knows AI; and (3)Simpler is better. MatTools provides a standardized framework for assessing and improving LLM capabilities for materials science tool applications, facilitating the development of more effective AI systems for materials science and general scientific research.
zh
计算机视觉
[CV-0] QVGen: Pushing the Limit of Quantized Video Generative Models
【速读】:该论文旨在解决视频扩散模型(Video Diffusion Models, DMs)在实际部署中面临的计算和内存需求过高的问题,尤其是在极低比特量化(如4-bit或以下)条件下,直接应用量化技术对视频DMs效果不佳。其解决方案的关键在于提出一种量化感知训练(Quantization-Aware Training, QAT)框架——QVGen,通过引入辅助模块(Φ)以缓解量化误差并提升收敛性,并结合基于奇异值分解(SVD)和秩相关正则化(γ)的秩衰减策略,在消除推理开销的同时保持模型性能。
链接: https://arxiv.org/abs/2505.11497
作者: Yushi Huang,Ruihao Gong,Jing Liu,Yifu Ding,Chengtao Lv,Haotong Qin,Jun Zhang
机构: Hong Kong University of Science and Technology (香港科技大学); SenseTime Research (商汤科技); Beihang University (北京航空航天大学); Monash University (莫纳什大学); ETH Zürich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our code will be released upon acceptance
Abstract:Video diffusion models (DMs) have enabled high-quality video synthesis. Yet, their substantial computational and memory demands pose serious challenges to real-world deployment, even on high-end GPUs. As a commonly adopted solution, quantization has proven notable success in reducing cost for image DMs, while its direct application to video DMs remains ineffective. In this paper, we present QVGen, a novel quantization-aware training (QAT) framework tailored for high-performance and inference-efficient video DMs under extremely low-bit quantization (e.g., 4-bit or below). We begin with a theoretical analysis demonstrating that reducing the gradient norm is essential to facilitate convergence for QAT. To this end, we introduce auxiliary modules ( \Phi ) to mitigate large quantization errors, leading to significantly enhanced convergence. To eliminate the inference overhead of \Phi , we propose a rank-decay strategy that progressively eliminates \Phi . Specifically, we repeatedly employ singular value decomposition (SVD) and a proposed rank-based regularization \mathbf\gamma to identify and decay low-contributing components. This strategy retains performance while zeroing out inference overhead. Extensive experiments across 4 state-of-the-art (SOTA) video DMs, with parameter sizes ranging from 1.3 B \sim14 B, show that QVGen is the first to reach full-precision comparable quality under 4-bit settings. Moreover, it significantly outperforms existing methods. For instance, our 3-bit CogVideoX-2B achieves improvements of +25.28 in Dynamic Degree and +8.43 in Scene Consistency on VBench.
zh
[CV-1] GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing
【速读】:该论文旨在解决文本引导图像编辑模型性能评估的挑战,现有评估方法依赖于如CLIP之类的图像-文本相似性度量,缺乏精确性。其解决方案的关键在于引入一个新基准(GIE-Bench),该基准通过两个关键维度进行评估:(i)功能正确性,通过自动生成的多项选择题验证目标修改是否成功应用;(ii)图像内容保留,利用对象感知的掩码技术和保留评分确保非目标区域的视觉一致性。该基准包含20个多样化内容类别中的1000多个高质量编辑示例,并附有详细的编辑指令、评估问题和空间对象掩码,为文本引导图像编辑模型的评估提供了更可靠和可扩展的框架。
链接: https://arxiv.org/abs/2505.11493
作者: Yusu Qian,Jiasen Lu,Tsu-Jui Fu,Xinze Wang,Chen Chen,Yinfei Yang,Wenze Hu,Zhe Gan
机构: Apple(苹果)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Editing images using natural language instructions has become a natural and expressive way to modify visual content; yet, evaluating the performance of such models remains challenging. Existing evaluation approaches often rely on image-text similarity metrics like CLIP, which lack precision. In this work, we introduce a new benchmark designed to evaluate text-guided image editing models in a more grounded manner, along two critical dimensions: (i) functional correctness, assessed via automatically generated multiple-choice questions that verify whether the intended change was successfully applied; and (ii) image content preservation, which ensures that non-targeted regions of the image remain visually consistent using an object-aware masking technique and preservation scoring. The benchmark includes over 1000 high-quality editing examples across 20 diverse content categories, each annotated with detailed editing instructions, evaluation questions, and spatial object masks. We conduct a large-scale study comparing GPT-Image-1, the latest flagship in the text-guided image editing space, against several state-of-the-art editing models, and validate our automatic metrics against human ratings. Results show that GPT-Image-1 leads in instruction-following accuracy, but often over-modifies irrelevant image regions, highlighting a key trade-off in the current model behavior. GIE-Bench provides a scalable, reproducible framework for advancing more accurate evaluation of text-guided image editing.
zh
[CV-2] Unsupervised Detection of Distribution Shift in Inverse Problems using Diffusion Models
【速读】:该论文试图解决在成像逆问题中,扩散模型(diffusion models)在训练数据与测试数据分布发生偏移时性能下降的问题。现有方法通常需要访问干净的测试图像以识别和量化分布偏移,但在实际应用中这些图像几乎不可用。论文提出的解决方案的关键在于设计一种完全无监督的度量方法,仅利用来自不同数据集训练的扩散模型的得分函数(score functions)和间接(损坏)测量数据来估计分布偏移。该方法理论上证明了其能够估计训练分布与测试分布之间的KL散度,并在实验中验证了其在仅使用损坏测量数据的情况下能够近似干净图像计算的KL散度。基于这一结果,通过仅使用损坏测量数据对分布外得分与分布内得分进行对齐,可以降低KL散度并提升多种逆问题的重建质量。
链接: https://arxiv.org/abs/2505.11482
作者: Shirin Shoushtari,Edward P. Chandler,Yuanhao Wang,M. Salman Asif,Ulugbek S. Kamilov
机构: Washington University in St. Louis(圣路易斯华盛顿大学); University of California, Riverside(加州大学河滨分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models are widely used as priors in imaging inverse problems. However, their performance often degrades under distribution shifts between the training and test-time images. Existing methods for identifying and quantifying distribution shifts typically require access to clean test images, which are almost never available while solving inverse problems (at test time). We propose a fully unsupervised metric for estimating distribution shifts using only indirect (corrupted) measurements and score functions from diffusion models trained on different datasets. We theoretically show that this metric estimates the KL divergence between the training and test image distributions. Empirically, we show that our score-based metric, using only corrupted measurements, closely approximates the KL divergence computed from clean images. Motivated by this result, we show that aligning the out-of-distribution score with the in-distribution score – using only corrupted measurements – reduces the KL divergence and leads to improved reconstruction quality across multiple inverse problems.
zh
[CV-3] PSDiffusion: Harmonized Multi-Layer Image Generation via Layout and Appearance Alignment
【速读】:该论文试图解决多层图像生成中难以处理多层之间交互的问题,例如合理的全局布局、物理上可信的接触以及阴影和反射等视觉效果,同时保持高透明度质量。其解决方案的关键在于提出PSDiffusion,一个统一的扩散框架,能够通过单一前向过程自动生成包含一个RGB背景和多个RGBA前景的多层图像,并引入全局-层交互机制,实现多层图像的同时协同生成,从而确保每层的高质量与完整性以及层间的空间和视觉一致性。
链接: https://arxiv.org/abs/2505.11468
作者: Dingbang Huang,Wenbo Li,Yifei Zhao,Xinyu Pan,Yanhong Zeng,Bo Dai
机构: Shanghai Jiao Tong University (上海交通大学); The Chinese University of Hong Kong (香港中文大学); Shanghai AI Laboratory (上海人工智能实验室); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Diffusion models have made remarkable advancements in generating high-quality images from textual descriptions. Recent works like LayerDiffuse have extended the previous single-layer, unified image generation paradigm to transparent image layer generation. However, existing multi-layer generation methods fail to handle the interactions among multiple layers such as rational global layout, physics-plausible contacts and visual effects like shadows and reflections while maintaining high alpha quality. To solve this problem, we propose PSDiffusion, a unified diffusion framework for simultaneous multi-layer text-to-image generation. Our model can automatically generate multi-layer images with one RGB background and multiple RGBA foregrounds through a single feed-forward process. Unlike existing methods that combine multiple tools for post-decomposition or generate layers sequentially and separately, our method introduces a global-layer interactive mechanism that generates layered-images concurrently and collaboratively, ensuring not only high quality and completeness for each layer, but also spatial and visual interactions among layers for global coherence.
zh
[CV-4] Exploiting Radiance Fields for Grasp Generation on Novel Synthetic Views
【速读】:该论文试图解决视觉引导的机器人抓取中因视角限制导致的抓取姿态不准确问题,特别是在物体被遮挡或无法通过多视角图像获取足够信息的情况下。解决方案的关键在于利用新颖视角合成技术,通过生成式AI (Generative AI) 从用户指定的新视角渲染高保真虚拟图像,从而提供额外的上下文信息以提升抓取姿态的准确性与覆盖范围。实验结果表明,新颖视角合成不仅能够补充稀疏真实视角所获得的力闭合抓取,还能提高整体抓取性能。
链接: https://arxiv.org/abs/2505.11467
作者: Abhishek Kashyap,Henrik Andreasson,Todor Stoyanov
机构: Örebro University (奥勒布鲁大学); Centre for Applied Autonomous Sensor Systems (AASS) (应用自主传感器系统中心(AASS))
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages
Abstract:Vision based robot manipulation uses cameras to capture one or more images of a scene containing the objects to be manipulated. Taking multiple images can help if any object is occluded from one viewpoint but more visible from another viewpoint. However, the camera has to be moved to a sequence of suitable positions for capturing multiple images, which requires time and may not always be possible, due to reachability constraints. So while additional images can produce more accurate grasp poses due to the extra information available, the time-cost goes up with the number of additional views sampled. Scene representations like Gaussian Splatting are capable of rendering accurate photorealistic virtual images from user-specified novel viewpoints. In this work, we show initial results which indicate that novel view synthesis can provide additional context in generating grasp poses. Our experiments on the Graspnet-1billion dataset show that novel views contributed force-closure grasps in addition to the force-closure grasps obtained from sparsely sampled real views while also improving grasp coverage. In the future we hope this work can be extended to improve grasp extraction from radiance fields constructed with a single input image, using for example diffusion models or generalizable radiance fields.
zh
[CV-5] HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation
【速读】:该论文试图解决大型多模态模型(Large Multimodal Models, LMMs)在人类中心化标准(Human Centered AI, HCAI)方面存在的不足,如公平性、伦理、共情和包容性等问题,这些问题对于与人类价值观对齐至关重要。解决方案的关键是引入HumaniBench,这是一个包含32K真实世界图像问题对的全面基准,通过可扩展的GPT4o辅助管道进行标注,并由领域专家进行全面验证,旨在评估LMMs在七个HCAI原则下的表现。
链接: https://arxiv.org/abs/2505.11454
作者: Shaina Raza,Aravind Narayanan,Vahid Reza Khazaie,Ashmal Vayani,Mukund S. Chettiar,Amandeep Singh,Mubarak Shah,Deval Pandya
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Large multimodal models (LMMs) now excel on many vision language benchmarks, however, they still struggle with human centered criteria such as fairness, ethics, empathy, and inclusivity, key to aligning with human values. We introduce HumaniBench, a holistic benchmark of 32K real-world image question pairs, annotated via a scalable GPT4o assisted pipeline and exhaustively verified by domain experts. HumaniBench evaluates seven Human Centered AI (HCAI) principles: fairness, ethics, understanding, reasoning, language inclusivity, empathy, and robustness, across seven diverse tasks, including open and closed ended visual question answering (VQA), multilingual QA, visual grounding, empathetic captioning, and robustness tests. Benchmarking 15 state of the art LMMs (open and closed source) reveals that proprietary models generally lead, though robustness and visual grounding remain weak points. Some open-source models also struggle to balance accuracy with adherence to human-aligned principles. HumaniBench is the first benchmark purpose built around HCAI principles. It provides a rigorous testbed for diagnosing alignment gaps and guiding LMMs toward behavior that is both accurate and socially responsible. Dataset, annotation prompts, and evaluation code are available at: this https URL
zh
[CV-6] SurgPose: Generalisable Surgical Instrument Pose Estimation using Zero-Shot Learning and Stereo Vision ICRA
【速读】:该论文旨在解决机器人辅助微创手术(RMIS)中手术器械姿态估计的泛化性问题,特别是在未见过的手术器械上实现准确的姿态估计。传统基于标记的方法受限于遮挡、反射和工具特定设计,而监督学习方法则依赖大量标注数据,难以适应新工具。为此,本文提出了一种基于零样本RGB-D模型的6 Degrees of Freedom(DoF)姿态估计流水线,关键在于通过引入基于视觉的深度估计方法RAFT-Stereo提升反射和无纹理环境下的深度估计鲁棒性,并将SAM-6D中的实例分割模块替换为微调后的Mask R-CNN,从而显著提高复杂和遮挡条件下的分割精度,最终实现了对未见过手术器械的更优姿态估计。
链接: https://arxiv.org/abs/2505.11439
作者: Utsav Rai,Haozheng Xu,Stamatia Giannarou
机构: Imperial College London(帝国理工学院); Hamlyn Centre for Robotic Surgery(机器人外科哈姆林中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: To be published in 2025 International Conference on Robotics and Automation (ICRA)
Abstract:Accurate pose estimation of surgical tools in Robot-assisted Minimally Invasive Surgery (RMIS) is essential for surgical navigation and robot control. While traditional marker-based methods offer accuracy, they face challenges with occlusions, reflections, and tool-specific designs. Similarly, supervised learning methods require extensive training on annotated datasets, limiting their adaptability to new tools. Despite their success in other domains, zero-shot pose estimation models remain unexplored in RMIS for pose estimation of surgical instruments, creating a gap in generalising to unseen surgical tools. This paper presents a novel 6 Degrees of Freedom (DoF) pose estimation pipeline for surgical instruments, leveraging state-of-the-art zero-shot RGB-D models like the FoundationPose and SAM-6D. We advanced these models by incorporating vision-based depth estimation using the RAFT-Stereo method, for robust depth estimation in reflective and textureless environments. Additionally, we enhanced SAM-6D by replacing its instance segmentation module, Segment Anything Model (SAM), with a fine-tuned Mask R-CNN, significantly boosting segmentation accuracy in occluded and complex conditions. Extensive validation reveals that our enhanced SAM-6D surpasses FoundationPose in zero-shot pose estimation of unseen surgical instruments, setting a new benchmark for zero-shot RGB-D pose estimation in RMIS. This work enhances the generalisability of pose estimation for unseen objects and pioneers the application of RGB-D zero-shot methods in RMIS.
zh
[CV-7] Face Consistency Benchmark for GenAI Video
【速读】:该论文试图解决AI生成视频中角色一致性(character consistency)难以维持的问题,当前模型在外观和属性的连贯性方面存在显著不足。解决方案的关键在于提出Face Consistency Benchmark (FCB),这是一个用于评估和比较AI生成视频中角色一致性的框架,通过提供标准化指标来揭示现有方案的不足,并推动更可靠方法的发展。
链接: https://arxiv.org/abs/2505.11425
作者: Michal Podstawski,Malgorzata Kudelska,Haohong Wang
机构: TCL Research Europe (TCL欧洲研究院); TCL Research America (TCL美洲研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Video generation driven by artificial intelligence has advanced significantly, enabling the creation of dynamic and realistic content. However, maintaining character consistency across video sequences remains a major challenge, with current models struggling to ensure coherence in appearance and attributes. This paper introduces the Face Consistency Benchmark (FCB), a framework for evaluating and comparing the consistency of characters in AI-generated videos. By providing standardized metrics, the benchmark highlights gaps in existing solutions and promotes the development of more reliable approaches. This work represents a crucial step toward improving character consistency in AI video generation technologies.
zh
[CV-8] Improving Object Detection Performance through YOLOv8: A Comprehensive Training and Evaluation Study
【速读】:该论文试图解决面部图像中皱纹的检测与分割问题,其解决方案的关键在于采用基于YOLOv8的分割模型。
链接: https://arxiv.org/abs/2505.11424
作者: Rana Poureskandar,Shiva Razzagzadeh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:This study evaluated the performance of a YOLOv8-based segmentation model for detecting and segmenting wrinkles in facial images.
zh
[CV-9] Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reason er
【速读】:该论文试图解决病理学领域中现有生成式 AI (Generative AI) 模型在诊断准确性和推理合理性方面的局限性。这些问题主要源于当前病理数据集的结构化程度不足,缺乏真实病理专家所采用的深度和系统性诊断范式。解决方案的关键在于利用病理学教材和真实病理专家构建高质量、以推理为导向的数据集,并引入 Patho-R1,这是一个基于多模态强化学习(RL)的病理推理模型,通过三阶段训练流程(持续预训练、监督微调和强化学习优化)提升模型的多模态推理质量。
链接: https://arxiv.org/abs/2505.11404
作者: Wenchuan Zhang,Penghao Zhang,Jingru Guo,Tao Cheng,Jie Chen,Shuwan Zhang,Zhang Zhang,Yuhao Yi,Hong Bu
机构: West China Hospital, Sichuan University (华西医院,四川大学); University of Toronto (多伦多大学); Sichuan University (四川大学); China Medical University (中国医科大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in vision language models (VLMs) have enabled broad progress in the general medical field. However, pathology still remains a more challenging subdomain, with current pathology specific VLMs exhibiting limitations in both diagnostic accuracy and reasoning plausibility. Such shortcomings are largely attributable to the nature of current pathology datasets, which are primarily composed of image description pairs that lack the depth and structured diagnostic paradigms employed by real world pathologists. In this study, we leverage pathology textbooks and real world pathology experts to construct high-quality, reasoning-oriented datasets. Building on this, we introduce Patho-R1, a multimodal RL-based pathology Reasoner, trained through a three-stage pipeline: (1) continued pretraining on 3.5 million image-text pairs for knowledge infusion; (2) supervised fine-tuning on 500k high-quality Chain-of-Thought samples for reasoning incentivizing; (3) reinforcement learning using Group Relative Policy Optimization and Decoupled Clip and Dynamic sAmpling Policy Optimization strategies for multimodal reasoning quality refinement. To further assess the alignment quality of our dataset, we propose PathoCLIP, trained on the same figure-caption corpus used for continued pretraining. Comprehensive experimental results demonstrate that both PathoCLIP and Patho-R1 achieve robust performance across a wide range of pathology-related tasks, including zero-shot classification, cross-modal retrieval, Visual Question Answering, and Multiple Choice Question. Our project is available at the Patho-R1 repository: this https URL.
zh
[CV-10] MutualNeRF: Improve the Performance of NeRF under Limited Samples with Mutual Information Theory
【速读】:该论文旨在解决在有限样本条件下神经辐射场(Neural Radiance Field, NeRF)性能下降的问题,现有方法虽尝试引入先验知识,但缺乏统一的理论框架支持。其解决方案的关键在于引入互信息(Mutual Information)作为度量,统一评估图像间的相关性,涵盖语义(宏观)和像素(微观)层面,从而在稀疏视角采样中通过最小化互信息选择包含更多非重叠场景信息的视角,并在少样本视角合成中通过最大化推断图像与真实图像之间的互信息,结合高效的正则化项提升性能。
链接: https://arxiv.org/abs/2505.11386
作者: Zifan Wang,Jingwei Li,Yitang Li,Yunze Liu
机构: IIIS, Tsinghua University (清华大学人工智能研究院); Shanghai Qizhi Institute (上海智源研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper introduces MutualNeRF, a framework enhancing Neural Radiance Field (NeRF) performance under limited samples using Mutual Information Theory. While NeRF excels in 3D scene synthesis, challenges arise with limited data and existing methods that aim to introduce prior knowledge lack theoretical support in a unified framework. We introduce a simple but theoretically robust concept, Mutual Information, as a metric to uniformly measure the correlation between images, considering both macro (semantic) and micro (pixel) levels. For sparse view sampling, we strategically select additional viewpoints containing more non-overlapping scene information by minimizing mutual information without knowing ground truth images beforehand. Our framework employs a greedy algorithm, offering a near-optimal solution. For few-shot view synthesis, we maximize the mutual information between inferred images and ground truth, expecting inferred images to gain more relevant information from known images. This is achieved by incorporating efficient, plug-and-play regularization terms. Experiments under limited samples show consistent improvement over state-of-the-art baselines in different settings, affirming the efficacy of our framework. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.11386 [cs.CV] (or arXiv:2505.11386v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.11386 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jingwei Li [view email] [v1] Fri, 16 May 2025 15:50:08 UTC (5,461 KB)
zh
[CV-11] Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation
【速读】:该论文旨在解决视频-语言大模型(Video-VLMs)在实际三维导航任务中面临的三大挑战:对三维几何和空间语义理解不足、大规模探索与长期环境记忆能力有限以及对动态变化环境适应性差。其解决方案的关键在于提出Dynam3D,一种动态分层的三维表示模型,通过语言对齐、可泛化和分层的三维表示作为视觉输入,训练三维视觉-语言模型(3D-VLM)进行导航动作预测。Dynam3D能够将二维CLIP特征投影到三维空间,并通过动态分层更新策略构建多层级三维块-实例-区域表示,实现三维实例的在线编码与定位,并在变化环境中动态更新,从而提升导航的大规模探索与长期记忆能力。
链接: https://arxiv.org/abs/2505.11383
作者: Zihan Wang,Seungjun Lee,Gim Hee Lee
机构: School of Computing, National University of Singapore (计算学院,新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Vision-and-Language Navigation (VLN) is a core task where embodied agents leverage their spatial mobility to navigate in 3D environments toward designated destinations based on natural language instructions. Recently, video-language large models (Video-VLMs) with strong generalization capabilities and rich commonsense knowledge have shown remarkable performance when applied to VLN tasks. However, these models still encounter the following challenges when applied to real-world 3D navigation: 1) Insufficient understanding of 3D geometry and spatial semantics; 2) Limited capacity for large-scale exploration and long-term environmental memory; 3) Poor adaptability to dynamic and changing this http URL address these limitations, we propose Dynam3D, a dynamic layered 3D representation model that leverages language-aligned, generalizable, and hierarchical 3D representations as visual input to train 3D-VLM in navigation action prediction. Given posed RGB-D images, our Dynam3D projects 2D CLIP features into 3D space and constructs multi-level 3D patch-instance-zone representations for 3D geometric and semantic understanding with a dynamic and layer-wise update strategy. Our Dynam3D is capable of online encoding and localization of 3D instances, and dynamically updates them in changing environments to provide large-scale exploration and long-term memory capabilities for navigation. By leveraging large-scale 3D-language pretraining and task-specific adaptation, our Dynam3D sets new state-of-the-art performance on VLN benchmarks including R2R-CE, REVERIE-CE and NavRAG-CE under monocular settings. Furthermore, experiments for pre-exploration, lifelong memory, and real-world robot validate the effectiveness of practical deployment.
zh
[CV-12] Dynamic Base model Shift for Delta Compression
【速读】:该论文试图解决在使用基于Transformer的模型进行微调后,由于微调模型在多个任务上的存储和部署成本过高,而通过压缩微调参数(即微调模型与预训练模型权重之间的差异)来降低这些成本时,现有方法因默认以预训练模型作为基础模型而导致性能显著下降的问题。解决方案的关键在于提出动态基础模型迁移(Dynamic Base Model Shift, DBMS),通过在执行参数压缩前动态调整基础模型以适应目标任务,并优化两个关键参数以控制基础模型迁移的幅度和整体压缩规模,从而在保持高性能的同时实现高比例的压缩。
链接: https://arxiv.org/abs/2505.11344
作者: Chenyu Huang,Peng Ye,Shenghe Zheng,Xiaohui Wang,Lei Bai,Tao Chen,Wanli Ouyang
机构: Fudan University (复旦大学); Shanghai AI Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages, 7 figures
Abstract:Transformer-based models with the pretrain-finetune paradigm bring about significant progress, along with the heavy storage and deployment costs of finetuned models on multiple tasks. Delta compression attempts to lower the costs by reducing the redundancy of delta parameters (i.e., the difference between the finetuned and pre-trained model weights) through pruning or quantization. However, existing methods by default employ the pretrained model as the base model and compress the delta parameters for every task, which may causes significant performance degradation, especially when the compression rate is extremely high. To tackle this issue, we investigate the impact of different base models on the performance of delta compression and find that the pre-trained base model can hardly be optimal. To this end, we propose Dynamic Base Model Shift (DBMS), which dynamically adapts the base model to the target task before performing delta compression. Specifically, we adjust two parameters, which respectively determine the magnitude of the base model shift and the overall scale of delta compression, to boost the compression performance on each task. Through low-cost learning of these two parameters, our DBMS can maintain most of the finetuned model’s performance even under an extremely high compression ratio setting, significantly surpassing existing methods. Moreover, our DBMS is orthogonal and can be integrated with a variety of other methods, and it has been evaluated across different types of models including language, vision transformer, and multi-modal models.
zh
[CV-13] MARRS: Masked Autoregressive Unit-based Reaction Synthesis
【速读】:该论文试图解决人类动作-反应合成(human action-reaction synthesis)问题,即根据他人动作序列生成相应的反应动作。现有自回归建模方法在运动生成任务中表现优异,但其伴随的向量量化(vector quantization)存在量化信息丢失、代码本利用率低等固有缺陷。此外,与仅关注身体关节运动的文本到运动(text-to-motion)不同,动作-反应合成还需涵盖精细的手部动作。该论文提出MARRS框架,其关键在于通过Unit-distinguished Motion Variational AutoEncoder(UD-VAE)将全身划分为独立的身体和手部单元进行编码,结合Action-Conditioned Fusion(ACF)和Adaptive Unit Modulation(AUM)实现动作条件下的特征融合与单元间交互,并利用扩散模型中的紧凑MLP作为噪声预测器以建模每个token的概率分布。
链接: https://arxiv.org/abs/2505.11334
作者: Y.B. Wang,S Wang,J.N. Zhang,J.F. Wu,Q.D. He,C.C. Fu,C.J. Wang,Y. Liu
机构: Zhejiang University (浙江大学); Youtu Lab, Tencent (腾讯优图实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This work aims at a challenging task: human action-reaction synthesis, i.e., generating human reactions based on the action sequence of the other as conditions. Currently, autoregressive modeling approaches have achieved remarkable performance in motion generation tasks, e.g. text-to-motion. However, vector quantization (VQ) accompanying autoregressive generation has inherent disadvantages, including loss of quantization information, low codebook utilization, etc. Moreover, unlike text-to-motion, which focuses solely on the movement of body joints, human action-reaction synthesis also encompasses fine-grained hand movements. In this work, we propose MARRS, a novel framework designed to generate coordinated and fine-grained reaction motions in continuous representations. Initially, we present the Unit-distinguished Motion Variational AutoEncoder (UD-VAE), which segments the entire body into distinct body and hand units, encoding them independently. Subsequently, we propose Action-Conditioned Fusion (ACF), which involves randomly masking a subset of reactive tokens and extracting specific information about the body and hands from the active tokens. Furthermore, we introduce Adaptive Unit Modulation (AUM) to facilitate interaction between body and hand units by using the information from one unit to adaptively modulate the other. Finally, for the diffusion model, we employ a compact MLP as a noise predictor for each distinct body unit and incorporate the diffusion loss to model the probability distribution of each token. Quantitative and qualitative results demonstrate that our method achieves superior performance. The code will be released upon acceptance.
zh
[CV-14] mporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在实时交互环境中生成语义准确且时间精确的响应的问题。传统VLMs主要针对离线任务如图像描述和视频问答进行优化,而实时场景需要模型能够根据流式视频动态生成内容并保持时间对齐。为应对这一挑战,作者提出了一个新基准任务——时序定位语言生成(Temporally-Grounded Language Generation, TGLG),并设计了相应的评估数据集与指标TRACE,以衡量模型在语义相似性和时间对齐方面的表现。解决方案的关键在于提出一种时间同步交织的视觉-语言模型(Vision-Language Model with Time-Synchronized Interleaving, VLM-TSI),该模型通过时间同步方式交错处理视觉和语言标记,从而实现无需依赖轮次假设的实时语言生成。
链接: https://arxiv.org/abs/2505.11326
作者: Keunwoo Peter Yu,Joyce Chai
机构: University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages
Abstract:Vision-language models (VLMs) have shown remarkable progress in offline tasks such as image captioning and video question answering. However, real-time interactive environments impose new demands on VLMs, requiring them to generate utterances that are not only semantically accurate but also precisely timed. We identify two core capabilities necessary for such settings – \textitperceptual updating and \textitcontingency awareness – and propose a new benchmark task, \textbfTemporally-Grounded Language Generation (TGLG) , to evaluate them. TGLG requires models to generate utterances in response to streaming video such that both content and timing align with dynamic visual input. To support this benchmark, we curate evaluation datasets from sports broadcasting and egocentric human interaction domains, and introduce a new metric, \textbfTRACE , to evaluate TGLG by jointly measuring semantic similarity and temporal alignment. Finally, we present \textbfVision-Language Model with Time-Synchronized Interleaving (VLM-TSI) , a model that interleaves visual and linguistic tokens in a time-synchronized manner, enabling real-time language generation without relying on turn-based assumptions. Experimental results show that VLM-TSI significantly outperforms a strong baseline, yet overall performance remains modest – highlighting the difficulty of TGLG and motivating further research in real-time VLMs. Code and data available \hrefthis https URLhere .
zh
[CV-15] Breaking the Batch Barrier (B3) of Contrastive Learning via Smart Batch Mining
【速读】:该论文旨在解决对比学习(Contrastive Learning, CL)中训练批次质量与大小对模型性能影响的问题,特别是如何构建高质量的批次以提升嵌入模型的效果。其解决方案的关键在于提出了一种名为“Breaking the Batch Barrier”(B3)的新型批次构建策略,该策略通过预训练教师模型对数据集中的所有样本进行排序,并基于稀疏相似性图应用社区检测算法,识别出彼此之间可作为强负例的样本聚类,从而构建富含内部负例的高质量批次。
链接: https://arxiv.org/abs/2505.11293
作者: Raghuveer Thirukovalluru,Rui Meng,Ye Liu,Karthikeyan K,Mingyi Su,Ping Nie,Semih Yavuz,Yingbo Zhou,Wenhu Chen,Bhuwan Dhingra
机构: Duke University(杜克大学); Salesforce AI Research(销售force人工智能研究); Independent(独立); University of Waterloo(滑铁卢大学); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 4 figures
Abstract:Contrastive learning (CL) is a prevalent technique for training embedding models, which pulls semantically similar examples (positives) closer in the representation space while pushing dissimilar ones (negatives) further apart. A key source of negatives are ‘in-batch’ examples, i.e., positives from other examples in the batch. Effectiveness of such models is hence strongly influenced by the size and quality of training batches. In this work, we propose ‘Breaking the Batch Barrier’ (B3), a novel batch construction strategy designed to curate high-quality batches for CL. Our approach begins by using a pretrained teacher embedding model to rank all examples in the dataset, from which a sparse similarity graph is constructed. A community detection algorithm is then applied to this graph to identify clusters of examples that serve as strong negatives for one another. The clusters are then used to construct batches that are rich in in-batch negatives. Empirical results on the MMEB multimodal embedding benchmark (36 tasks) demonstrate that our method sets a new state of the art, outperforming previous best methods by +1.3 and +2.9 points at the 7B and 2B model scales, respectively. Notably, models trained with B3 surpass existing state-of-the-art results even with a batch size as small as 64, which is 4-16x smaller than that required by other methods.
zh
[CV-16] MTevent: A Multi-Task Event Camera Dataset for 6D Pose Estimation and Moving Object Detection CVPR2025
【速读】:该论文旨在解决高速移动机器人在复杂动态环境中进行精确的6D位姿估计和运动目标检测的问题。现有基于RGB相机的感知方法由于运动模糊和实时性不足而难以有效利用高速平台的性能。论文提出的解决方案关键在于构建MTevent数据集,该数据集结合了高速运动、远距离感知和真实物体交互,并采用立体事件相机与RGB相机协同工作,为事件驱动的视觉研究提供了一个新的基准,以推动高精度、低延迟的机器人视觉技术发展。
链接: https://arxiv.org/abs/2505.11282
作者: Shrutarv Awasthi,Anas Gouda,Sven Franke,Jérôme Rutinowski,Frank Hoffmann,Moritz Roidl
机构: TU Dortmund (多特蒙德工业大学); Lamarr Institute for Machine Learning and Artificial Intelligence (Lamarr机器学习与人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted to CVPR 2025 Workshop on Event-based Vision
Abstract:Mobile robots are reaching unprecedented speeds, with platforms like Unitree B2, and Fraunhofer O3dyn achieving maximum speeds between 5 and 10 m/s. However, effectively utilizing such speeds remains a challenge due to the limitations of RGB cameras, which suffer from motion blur and fail to provide real-time responsiveness. Event cameras, with their asynchronous operation, and low-latency sensing, offer a promising alternative for high-speed robotic perception. In this work, we introduce MTevent, a dataset designed for 6D pose estimation and moving object detection in highly dynamic environments with large detection distances. Our setup consists of a stereo-event camera and an RGB camera, capturing 75 scenes, each on average 16 seconds, and featuring 16 unique objects under challenging conditions such as extreme viewing angles, varying lighting, and occlusions. MTevent is the first dataset to combine high-speed motion, long-range perception, and real-world object interactions, making it a valuable resource for advancing event-based vision in robotics. To establish a baseline, we evaluate the task of 6D pose estimation using NVIDIA’s FoundationPose on RGB images, achieving an Average Recall of 0.22 with ground-truth masks, highlighting the limitations of RGB-based approaches in such dynamic settings. With MTevent, we provide a novel resource to improve perception models and foster further research in high-speed robotic vision. The dataset is available for download this https URL
zh
[CV-17] Equal is Not Always Fair: A New Perspective on Hyperspectral Representation Non-Uniformity
【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)表示中的普遍非均匀性问题,该问题表现为光谱依赖性、空间连续性和特征效率之间的复杂且常冲突的行为。现有模型通常依赖于假设维度同质性的统一处理范式,导致性能欠佳和表征偏差。其解决方案的关键在于提出FairHyp框架,通过协作但专精的模块显式解耦并解决三重非均匀性,包括基于Runge-Kutta思想的空间可变性适配器、多感受野卷积模块以及基于谱上下文状态空间模型的结构设计,从而实现维度特异性适应的同时保持全局一致性和相互增强。
链接: https://arxiv.org/abs/2505.11267
作者: Wuzhou Quan,Mingqiang Wei,Jinhui Tang
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Nanjing Forestry Univeristy (南京林业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Hyperspectral image (HSI) representation is fundamentally challenged by pervasive non-uniformity, where spectral dependencies, spatial continuity, and feature efficiency exhibit complex and often conflicting behaviors. Most existing models rely on a unified processing paradigm that assumes homogeneity across dimensions, leading to suboptimal performance and biased representations. To address this, we propose FairHyp, a fairness-directed framework that explicitly disentangles and resolves the threefold non-uniformity through cooperative yet specialized modules. We introduce a Runge-Kutta-inspired spatial variability adapter to restore spatial coherence under resolution discrepancies, a multi-receptive field convolution module with sparse-aware refinement to enhance discriminative features while respecting inherent sparsity, and a spectral-context state space model that captures stable and long-range spectral dependencies via bidirectional Mamba scanning and statistical aggregation. Unlike one-size-fits-all solutions, FairHyp achieves dimension-specific adaptation while preserving global consistency and mutual reinforcement. This design is grounded in the view that non-uniformity arises from the intrinsic structure of HSI representations, rather than any particular task setting. To validate this, we apply FairHyp across four representative tasks including classification, denoising, super-resolution, and inpaintin, demonstrating its effectiveness in modeling a shared structural flaw. Extensive experiments show that FairHyp consistently outperforms state-of-the-art methods under varied imaging conditions. Our findings redefine fairness as a structural necessity in HSI modeling and offer a new paradigm for balancing adaptability, efficiency, and fidelity in high-dimensional vision tasks.
zh
[CV-18] Multi-view dense image matching with similarity learning and geometry priors
【速读】:该论文旨在解决多视角相似性学习中的挑战,特别是在缺乏繁琐多视角训练数据集的情况下实现高效的多视角重建。其解决方案的关键在于引入MV-DeepSimNets,该方法利用极线几何(epipolar geometry)进行训练,并通过在线几何先验来表征像素关系,从而生成具有几何感知的特征。这些特征通过平面扫描投影到候选深度假设中,结合几何预处理技术,有效提升了多视角重建性能,同时通过聚合学习到的相似性构建并正则化代价体积,实现了优于传统密集匹配方法的多视角表面重建效果。
链接: https://arxiv.org/abs/2505.11264
作者: Mohamed Ali Chebbi,Ewelina Rupnik,Paul Lopes,Marc Pierrot-Deseilligny
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce MV-DeepSimNets, a comprehensive suite of deep neural networks designed for multi-view similarity learning, leveraging epipolar geometry for training. Our approach incorporates an online geometry prior to characterize pixel relationships, either along the epipolar line or through homography rectification. This enables the generation of geometry-aware features from native images, which are then projected across candidate depth hypotheses using plane sweeping. Our method geometric preconditioning effectively adapts epipolar-based features for enhanced multi-view reconstruction, without requiring the laborious multi-view training dataset creation. By aggregating learned similarities, we construct and regularize the cost volume, leading to improved multi-view surface reconstruction over traditional dense matching approaches. MV-DeepSimNets demonstrates superior performance against leading similarity learning networks and end-to-end regression models, especially in terms of generalization capabilities across both aerial and satellite imagery with varied ground sampling distances. Our pipeline is integrated into MicMac software and can be readily adopted in standard multi-resolution image matching pipelines.
zh
[CV-19] DRAG ON: A Large-Scale Dataset of Realistic Images Generated by Diffusion Models
【速读】:该论文旨在解决合成图像检测工具在面对快速发展的扩散模型时,因现有数据集覆盖模型范围有限且更新滞后而导致的检测能力不足问题。其解决方案的关键在于构建一个全面的DRAGON数据集,该数据集包含来自25种扩散模型的图像,涵盖了最新的进展和较早的成熟架构,并通过引入一种基于大语言模型的简单但有效的提示扩展管道,提升了生成图像的多样性和质量,从而为伪造内容的检测与溯源技术提供更可靠的数据支持。
链接: https://arxiv.org/abs/2505.11257
作者: Giulia Bertazzini,Daniele Baracchi,Dasara Shullani,Isao Echizen,Alessandro Piva
机构: University of Florence (佛罗伦萨大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The remarkable ease of use of diffusion models for image generation has led to a proliferation of synthetic content online. While these models are often employed for legitimate purposes, they are also used to generate fake images that support misinformation and hate speech. Consequently, it is crucial to develop robust tools capable of detecting whether an image has been generated by such models. Many current detection methods, however, require large volumes of sample images for training. Unfortunately, due to the rapid evolution of the field, existing datasets often cover only a limited range of models and quickly become outdated. In this work, we introduce DRAGON, a comprehensive dataset comprising images from 25 diffusion models, spanning both recent advancements and older, well-established architectures. The dataset contains a broad variety of images representing diverse subjects. To enhance image realism, we propose a simple yet effective pipeline that leverages a large language model to expand input prompts, thereby generating more diverse and higher-quality outputs, as evidenced by improvements in standard quality metrics. The dataset is provided in multiple sizes (ranging from extra-small to extra-large) to accomodate different research scenarios. DRAGON is designed to support the forensic community in developing and evaluating detection and attribution techniques for synthetic content. Additionally, the dataset is accompanied by a dedicated test set, intended to serve as a benchmark for assessing the performance of newly developed methods.
zh
[CV-20] Entropy-Driven Genetic Optimization for Deep-Feature-Guided Low-Light Image Enhancement
【速读】:该论文旨在解决传统图像增强方法过于关注像素级信息而忽视语义特征的问题,从而导致增强后的图像在视觉质量和语义一致性之间难以平衡。其解决方案的关键在于提出一种基于NSGA-II算法的无监督模糊启发式图像增强框架,通过优化亮度、对比度和伽马参数来实现视觉质量与语义保真度的平衡,同时利用预训练深度神经网络作为特征提取器,并结合GPU加速的多目标优化以及局部搜索阶段对最优候选解进行微调,以提升增强效果。
链接: https://arxiv.org/abs/2505.11246
作者: Nirjhor Datta,Afroza Akther,M. Sohel Rahman
机构: Bangladesh University of Engineering and Technology (孟加拉国工程与技术大学); BRAC University (BRAC大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image enhancement methods often prioritize pixel level information, overlooking the semantic features. We propose a novel, unsupervised, fuzzy-inspired image enhancement framework guided by NSGA-II algorithm that optimizes image brightness, contrast, and gamma parameters to achieve a balance between visual quality and semantic fidelity. Central to our proposed method is the use of a pre trained deep neural network as a feature extractor. To find the best enhancement settings, we use a GPU-accelerated NSGA-II algorithm that balances multiple objectives, namely, increasing image entropy, improving perceptual similarity, and maintaining appropriate brightness. We further improve the results by applying a local search phase to fine-tune the top candidates from the genetic algorithm. Our approach operates entirely without paired training data making it broadly applicable across domains with limited or noisy labels. Quantitatively, our model achieves excellent performance with average BRISQUE and NIQE scores of 19.82 and 3.652, respectively, in all unpaired datasets. Qualitatively, enhanced images by our model exhibit significantly improved visibility in shadowed regions, natural balance of contrast and also preserve the richer fine detail without introducing noticable artifacts. This work opens new directions for unsupervised image enhancement where semantic consistency is critical.
zh
[CV-21] Diffusion-NPO: Negative Preference Optimization for Better Preference Aligned Generation of Diffusion Models ICLR2025
【速读】:该论文试图解决现有生成式 AI (Generative AI) 模型在对齐生成输出与人类偏好时存在的不足,特别是现有偏好对齐方法忽视了处理无条件/负条件输出的重要性,从而限制了无需分类器指导(classifier-free guidance, CFG)的有效性。解决方案的关键在于提出一种简单但有效的策略,通过专门训练模型以适应负面偏好,该方法无需新的训练策略或数据集,仅需对现有技术进行小幅调整,能够与多种扩散模型无缝集成,从而提升模型与人类偏好的一致性。
链接: https://arxiv.org/abs/2505.11245
作者: Fu-Yun Wang,Yunhao Shui,Jingtan Piao,Keqiang Sun,Hongsheng Li
机构: MMLab, CUHK, Hong Kong(多媒体实验室,香港中文大学,香港); Shanghai Jiang Tong University, Shanghai(上海交通大学,上海); CPII under InnoHK, Hong Kong(创新科技研究院,香港)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2025
Abstract:Diffusion models have made substantial advances in image generation, yet models trained on large, unfiltered datasets often yield outputs misaligned with human preferences. Numerous methods have been proposed to fine-tune pre-trained diffusion models, achieving notable improvements in aligning generated outputs with human preferences. However, we argue that existing preference alignment methods neglect the critical role of handling unconditional/negative-conditional outputs, leading to a diminished capacity to avoid generating undesirable outcomes. This oversight limits the efficacy of classifier-free guidance~(CFG), which relies on the contrast between conditional generation and unconditional/negative-conditional generation to optimize output quality. In response, we propose a straightforward but versatile effective approach that involves training a model specifically attuned to negative preferences. This method does not require new training strategies or datasets but rather involves minor modifications to existing techniques. Our approach integrates seamlessly with models such as SD1.5, SDXL, video diffusion models and models that have undergone preference optimization, consistently enhancing their alignment with human preferences.
zh
[CV-22] AW-GATCN: Adaptive Weighted Graph Attention Convolutional Network for Event Camera Data Joint Denoising and Object Recognition
【速读】:该论文旨在解决事件相机(event camera)在基于事件的对象识别中产生的冗余和噪声数据问题,这些问题会干扰关键时空信息的提取。其解决方案的关键在于提出一种自适应图基噪声数据去除框架,通过自适应事件分割、多因子边权重机制以及自适应图基去噪策略,有效整合时空信息,在去除噪声的同时保留关键结构特征,从而提升识别的鲁棒性与准确性。
链接: https://arxiv.org/abs/2505.11232
作者: Haiyu Li,Charith Abhayaratne
机构: The University of Sheffield(谢菲尔德大学); School of Electronic and Electrical Engineering(电子与电气工程学院); Centre for Machine Intelligence(机器智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Event cameras, which capture brightness changes with high temporal resolution, inherently generate a significant amount of redundant and noisy data beyond essential object structures. The primary challenge in event-based object recognition lies in effectively removing this noise without losing critical spatial-temporal information. To address this, we propose an Adaptive Graph-based Noisy Data Removal framework for Event-based Object Recognition. Specifically, our approach integrates adaptive event segmentation based on normalized density analysis, a multifactorial edge-weighting mechanism, and adaptive graph-based denoising strategies. These innovations significantly enhance the integration of spatiotemporal information, effectively filtering noise while preserving critical structural features for robust recognition. Experimental evaluations on four challenging datasets demonstrate that our method achieves superior recognition accuracies of 83.77%, 76.79%, 99.30%, and 96.89%, surpassing existing graph-based methods by up to 8.79%, and improving noise reduction performance by up to 19.57%, with an additional accuracy gain of 6.26% compared to traditional Euclidean-based techniques.
zh
[CV-23] Seeing Sound Hearing Sight: Uncovering Modality Bias and Conflict of AI models in Sound Localization
【速读】:该论文试图解决多模态人工智能(Multimodal AI)在处理跨模态冲突时的模态偏差(modality bias)问题,即系统在面对视觉与听觉信息不一致时,是否倾向于依赖某一模态而忽略另一模态。研究发现,当前AI模型在面对冲突或缺失的视觉线索时表现较差,往往过度依赖视觉输入,导致性能下降。解决方案的关键在于通过使用基于3D模拟生成的立体音频-图像数据集对先进模型进行微调(finetune),从而提升其在跨模态冲突下的定位能力,并使其表现出类似人类的水平偏倚,这可能与立体音频结构反映人类耳部位置有关。
链接: https://arxiv.org/abs/2505.11217
作者: Yanhao Jia,Ji Xie,S Jivaganesh,Hao Li,Xu Wu,Mengmi Zhang
机构: Nanyang Technological University (南洋理工大学); Peking University (北京大学); Shenzhen University (深圳大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: 16 pages, 14 figures
Abstract:Imagine hearing a dog bark and turning toward the sound only to see a parked car, while the real, silent dog sits elsewhere. Such sensory conflicts test perception, yet humans reliably resolve them by prioritizing sound over misleading visuals. Despite advances in multimodal AI integrating vision and audio, little is known about how these systems handle cross-modal conflicts or whether they favor one modality. In this study, we systematically examine modality bias and conflict resolution in AI sound localization. We assess leading multimodal models and benchmark them against human performance in psychophysics experiments across six audiovisual conditions, including congruent, conflicting, and absent cues. Humans consistently outperform AI, demonstrating superior resilience to conflicting or missing visuals by relying on auditory information. In contrast, AI models often default to visual input, degrading performance to near chance levels. To address this, we finetune a state-of-the-art model using a stereo audio-image dataset generated via 3D simulations. Even with limited training data, the refined model surpasses existing benchmarks. Notably, it also mirrors human-like horizontal localization bias favoring left-right precision-likely due to the stereo audio structure reflecting human ear placement. These findings underscore how sensory input quality and system architecture shape multimodal representation accuracy.
zh
[CV-24] GeoMM: On Geodesic Perspective for Multi-modal Learning CVPR2025
【速读】:该论文试图解决多模态学习中传统距离度量无法有效区分正负样本的问题,特别是在非线性流形中,一些样本可能具有高相似性但语义不同。解决方案的关键在于首次将测地距离(geodesic distance)引入多模态学习中,通过构建图结构并利用最短路径算法计算测地距离,以捕捉样本间的复杂关系,同时采用分层图结构和增量更新策略提升计算效率。
链接: https://arxiv.org/abs/2505.11216
作者: Shibin Mei,Hang Wang,Bingbing Ni
机构: Huawei(华为); Shanghai Jiao Tong University(上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 3 figures, accepted by CVPR2025
Abstract:Geodesic distance serves as a reliable means of measuring distance in nonlinear spaces, and such nonlinear manifolds are prevalent in the current multimodal learning. In these scenarios, some samples may exhibit high similarity, yet they convey different semantics, making traditional distance metrics inadequate for distinguishing between positive and negative samples. This paper introduces geodesic distance as a novel distance metric in multi-modal learning for the first time, to mine correlations between samples, aiming to address the limitations of common distance metric. Our approach incorporates a comprehensive series of strategies to adapt geodesic distance for the current multimodal learning. Specifically, we construct a graph structure to represent the adjacency relationships among samples by thresholding distances between them and then apply the shortest-path algorithm to obtain geodesic distance within this graph. To facilitate efficient computation, we further propose a hierarchical graph structure through clustering and combined with incremental update strategies for dynamic status updates. Extensive experiments across various downstream tasks validate the effectiveness of our proposed method, demonstrating its capability to capture complex relationships between samples and improve the performance of multimodal learning models.
zh
[CV-25] DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling
【速读】:该论文旨在解决扩散模型在视觉生成任务中计算开销大的问题,特别是针对预训练的Diffusion Transformer (DiT)模型中全局自注意力机制冗余、主要捕捉局部模式的现象,探索更高效的替代方案。其解决方案的关键在于重新引入卷积作为构建高效且表达能力强的扩散模型的替代组件,并通过引入一种紧凑的通道注意力机制来缓解卷积网络(ConvNets)相比Transformer更高的通道冗余问题,从而提升特征多样性,最终提出了完全由标准卷积网络模块构成的Diffusion ConvNet (DiCo)模型系列。
链接: https://arxiv.org/abs/2505.11196
作者: Yuang Ai,Qihang Fan,Xuefeng Hu,Zhenheng Yang,Ran He,Huaibo Huang
机构: CASIA(中国科学院); UCAS(中国科学院大学); ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 29 figures, 9 tables
Abstract:Diffusion Transformer (DiT), a promising diffusion model for visual generation, demonstrates impressive performance but incurs significant computational overhead. Intriguingly, analysis of pre-trained DiT models reveals that global self-attention is often redundant, predominantly capturing local patterns-highlighting the potential for more efficient alternatives. In this paper, we revisit convolution as an alternative building block for constructing efficient and expressive diffusion models. However, naively replacing self-attention with convolution typically results in degraded performance. Our investigations attribute this performance gap to the higher channel redundancy in ConvNets compared to Transformers. To resolve this, we introduce a compact channel attention mechanism that promotes the activation of more diverse channels, thereby enhancing feature diversity. This leads to Diffusion ConvNet (DiCo), a family of diffusion models built entirely from standard ConvNet modules, offering strong generative performance with significant efficiency gains. On class-conditional ImageNet benchmarks, DiCo outperforms previous diffusion models in both image quality and generation speed. Notably, DiCo-XL achieves an FID of 2.05 at 256x256 resolution and 2.53 at 512x512, with a 2.7x and 3.1x speedup over DiT-XL/2, respectively. Furthermore, our largest model, DiCo-H, scaled to 1B parameters, reaches an FID of 1.90 on ImageNet 256x256-without any additional supervision during training. Code: this https URL.
zh
[CV-26] FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Pretraining
【速读】:该论文旨在解决视觉-语言预训练(Vision-Language Pretraining, VLP)中由于图像与文本在大规模数据集中的多对多对应关系导致的假阴性(False Negatives)问题。假阴性会引入冲突的监督信号,从而破坏学习到的嵌入空间并降低困难负样本采样的效果。论文提出的解决方案是FALCON(False-negative Aware Learning of COntrastive Negatives),其关键在于采用一种基于学习的mini-batch构建策略,通过动态选择每个锚点实例的适当难度的负样本,以适应性地平衡困难负样本与假阴性之间的权衡,该过程由跨模态对齐改进的代理指标引导。
链接: https://arxiv.org/abs/2505.11192
作者: Myunsoo Kim,Seong-Woong Shim,Byung-Jun Lee
机构: Korea University(高丽大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:False negatives pose a critical challenge in vision-language pretraining (VLP) due to the many-to-many correspondence between images and texts in large-scale datasets. These false negatives introduce conflicting supervision signals that degrade the learned embedding space and diminish the effectiveness of hard negative sampling. In this paper, we propose FALCON (False-negative Aware Learning of COntrastive Negatives), a learning-based mini-batch construction strategy that adaptively balances the trade-off between hard and false negatives during VLP. Rather than relying on fixed heuristics, FALCON employs a negative mining scheduler that dynamically selects negative samples of appropriate hardness for each anchor instance during mini-batch construction, guided by a proxy for cross-modal alignment improvement. Experimental results demonstrate that FALCON significantly improves performance across two widely adopted VLP frameworks (ALBEF, BLIP-2) and a broad range of downstream tasks and evaluation settings, underscoring its effectiveness and robustness in mitigating the impact of false negatives.
zh
[CV-27] Imputation-free and Alignment-free: Incomplete Multi-view Clustering Driven by Consensus Semantic Learning CVPR2025
【速读】:该论文旨在解决不完整多视图聚类(IMVC)中由于数据缺失导致的视图内原型偏移和视图间语义不一致问题。其解决方案的关键在于提出一种无需填充和对齐的共识语义学习框架(FreeCSL),通过从可用数据中学习共识原型以建立跨视图的共享语义空间,从而弥合所有观测之间的语义差距,并利用基于模块度的启发式图聚类方法恢复具有内部紧密性和外部分离性的簇结构,以增强簇语义。
链接: https://arxiv.org/abs/2505.11182
作者: Yuzhuo Dai,Jiaqi Jin,Zhibin Dong,Siwei Wang,Xinwang Liu,En Zhu,Xihong Yang,Xinbiao Gan,Yu Feng
机构: National University of Defense Technology (国防科技大学); Intelligent Game and Decision Lab (智能游戏与决策实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The paper has been accepted by the 42nd CVPR 2025. The main text has 9 pages, including 8 figures and 4 tables. The appendix has 8 pages, with 10 figures and 6 tables. The reference list has 3 pages
Abstract:In incomplete multi-view clustering (IMVC), missing data induce prototype shifts within views and semantic inconsistencies across views. A feasible solution is to explore cross-view consistency in paired complete observations, further imputing and aligning the similarity relationships inherently shared across views. Nevertheless, existing methods are constrained by two-tiered limitations: (1) Neither instance- nor cluster-level consistency learning construct a semantic space shared across views to learn consensus semantics. The former enforces cross-view instances alignment, and wrongly regards unpaired observations with semantic consistency as negative pairs; the latter focuses on cross-view cluster counterparts while coarsely handling fine-grained intra-cluster relationships within views. (2) Excessive reliance on consistency results in unreliable imputation and alignment without incorporating view-specific cluster information. Thus, we propose an IMVC framework, imputation- and alignment-free for consensus semantics learning (FreeCSL). To bridge semantic gaps across all observations, we learn consensus prototypes from available data to discover a shared space, where semantically similar observations are pulled closer for consensus semantics learning. To capture semantic relationships within specific views, we design a heuristic graph clustering based on modularity to recover cluster structure with intra-cluster compactness and inter-cluster separation for cluster semantics enhancement. Extensive experiments demonstrate, compared to state-of-the-art competitors, FreeCSL achieves more confident and robust assignments on IMVC task.
zh
[CV-28] CheX-DS: Improving Chest X-ray Image Classification with Ensemble Learning Based on DenseNet and Swin Transformer
【速读】:该论文旨在解决胸部疾病自动诊断中长期尾分布多标签数据分类的问题,现有方法主要依赖于卷积神经网络(CNN),其关注局部特征而忽视全局特征。解决方案的关键在于提出一种名为CheX-DS的模型,该模型结合了优秀的医学图像CNN模型DenseNet与新兴的Swin Transformer模型,通过集成深度学习技术充分发挥CNN和Transformer的优势,并采用加权二元交叉熵损失与非对称损失的组合来有效处理数据不平衡问题。
链接: https://arxiv.org/abs/2505.11168
作者: Xinran Li,Yu Liu,Xiujuan Xu,Xiaowei Zhao
机构: Dalian University of Technology (大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: BIBM
Abstract:The automatic diagnosis of chest diseases is a popular and challenging task. Most current methods are based on convolutional neural networks (CNNs), which focus on local features while neglecting global features. Recently, self-attention mechanisms have been introduced into the field of computer vision, demonstrating superior performance. Therefore, this paper proposes an effective model, CheX-DS, for classifying long-tail multi-label data in the medical field of chest X-rays. The model is based on the excellent CNN model DenseNet for medical imaging and the newly popular Swin Transformer model, utilizing ensemble deep learning techniques to combine the two models and leverage the advantages of both CNNs and Transformers. The loss function of CheX-DS combines weighted binary cross-entropy loss with asymmetric loss, effectively addressing the issue of data imbalance. The NIH ChestX-ray14 dataset is selected to evaluate the model’s effectiveness. The model outperforms previous studies with an excellent average AUC score of 83.76%, demonstrating its superior performance.
zh
[CV-29] Learning Dense Hand Contact Estimation from Imbalanced Data
【速读】:该论文旨在解决从不平衡数据中有效学习密集手部接触估计(dense hand contact estimation)的问题。其关键解决方案包括:为解决类别不平衡问题,引入了平衡接触采样方法,通过构建并从多个采样组中采样,公平地表示接触与非接触样本的多样接触统计;为解决空间不平衡问题,提出了顶点级类别平衡(vertex-level class-balanced, VCB)损失函数,通过根据每个顶点在数据集中的接触频率分别重新加权损失贡献,从而融入空间变化的接触分布。
链接: https://arxiv.org/abs/2505.11152
作者: Daniel Sungho Jung,Kyoung Mu Lee
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this http URL
Abstract:Hands are essential to human interaction, and understanding contact between hands and the world can promote comprehensive understanding of their function. Recently, there have been growing number of hand interaction datasets that cover interaction with object, other hand, scene, and body. Despite the significance of the task and increasing high-quality data, how to effectively learn dense hand contact estimation remains largely underexplored. There are two major challenges for learning dense hand contact estimation. First, there exists class imbalance issue from hand contact datasets where majority of samples are not in contact. Second, hand contact datasets contain spatial imbalance issue with most of hand contact exhibited in finger tips, resulting in challenges for generalization towards contacts in other hand regions. To tackle these issues, we present a framework that learns dense HAnd COntact estimation (HACO) from imbalanced data. To resolve the class imbalance issue, we introduce balanced contact sampling, which builds and samples from multiple sampling groups that fairly represent diverse contact statistics for both contact and non-contact samples. Moreover, to address the spatial imbalance issue, we propose vertex-level class-balanced (VCB) loss, which incorporates spatially varying contact distribution by separately reweighting loss contribution of each vertex based on its contact frequency across dataset. As a result, we effectively learn to predict dense hand contact estimation with large-scale hand contact data without suffering from class and spatial imbalance issue. The codes will be released.
zh
[CV-30] Open-Source Multi-Viewpoint Surgical Telerobotics ICRA
【速读】:该论文试图解决传统微创手术(Minimally Invasive Surgery, MIS)中可视化与控制范式受限的问题,旨在通过引入额外可调节视角来提升手术协作效率和机器感知的鲁棒性。解决方案的关键在于构建一个同步多视角、多传感器的机器人手术系统,集成高性能视觉组件并升级达芬奇研究套件(da Vinci Research Kit)的控制逻辑,以实现精准的实时术中三维感知和算法辅助的机器人操作。
链接: https://arxiv.org/abs/2505.11142
作者: Guido Caccianiga,Yarden Sharon,Bernard Javot,Senya Polikovsky,Gökce Ergün,Ivan Capobianco,André L. Mihaljevic,Anton Deguet,Katherine J. Kuchenbecker
机构: Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); University Hospital Tübingen (图宾根大学医院); Johns Hopkins University (约翰霍普金斯大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 2 pages, 2 figures, ICRA-RAMI workshop long abstract
Abstract:As robots for minimally invasive surgery (MIS) gradually become more accessible and modular, we believe there is a great opportunity to rethink and expand the visualization and control paradigms that have characterized surgical teleoperation since its inception. We conjecture that introducing one or more additional adjustable viewpoints in the abdominal cavity would not only unlock novel visualization and collaboration strategies for surgeons but also substantially boost the robustness of machine perception toward shared autonomy. Immediate advantages include controlling a second viewpoint and teleoperating surgical tools from a different perspective, which would allow collaborating surgeons to adjust their views independently and still maneuver their robotic instruments intuitively. Furthermore, we believe that capturing synchronized multi-view 3D measurements of the patient’s anatomy would unlock advanced scene representations. Accurate real-time intraoperative 3D perception will allow algorithmic assistants to directly control one or more robotic instruments and/or robotic cameras. Toward these goals, we are building a synchronized multi-viewpoint, multi-sensor robotic surgery system by integrating high-performance vision components and upgrading the da Vinci Research Kit control logic. This short paper reports a functional summary of our setup and elaborates on its potential impacts in research and future clinical practice. By fully open-sourcing our system, we will enable the research community to reproduce our setup, improve it, and develop powerful algorithms, effectively boosting clinical translation of cutting-edge research.
zh
[CV-31] Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLM s vs. Humans
【速读】:该论文试图解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在多模态推理任务中与人类表现之间存在差距的问题。解决方案的关键在于提出Human-Aligned Bench,这是一个用于细粒度对齐多模态推理与人类表现的基准测试集,包含9,794个仅依赖上下文推理的多模态问题,并附有人类成功率及易错选项,从而为评估和提升模型的推理能力提供了量化依据。
链接: https://arxiv.org/abs/2505.11141
作者: Yansheng Qiu,Li Xiao,Zhaopan Xu,Pengfei Zhou,Zheng Wang,Kaipeng Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The goal of achieving Artificial General Intelligence (AGI) is to imitate humans and surpass them. Models such as OpenAI’s o1, o3, and DeepSeek’s R1 have demonstrated that large language models (LLMs) with human-like reasoning capabilities exhibit exceptional performance and are being gradually integrated into multimodal large language models (MLLMs). However, whether these models possess capabilities comparable to humans in handling reasoning tasks remains unclear at present. In this paper, we propose Human-Aligned Bench, a benchmark for fine-grained alignment of multimodal reasoning with human performance. Specifically, we collected 9,794 multimodal questions that solely rely on contextual reasoning, including bilingual (Chinese and English) multimodal questions and pure text-based questions, encompassing four question types: visual reasoning, definition judgment, analogical reasoning, and logical judgment. More importantly, each question is accompanied by human success rates and options that humans are prone to choosing incorrectly. Extensive experiments on the Human-Aligned Bench reveal notable differences between the performance of current MLLMs in multimodal reasoning and human performance. The findings on our benchmark provide insights into the development of the next-generation models.
zh
[CV-32] owards Robust Spiking Neural Networks:Mitigating Heterogeneous Training Vulnerability via Dominant Eigencomponent Projection
【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在使用主流训练方法——直接编码结合时间反向传播(BPTT)时所表现出的显著脆弱性问题,即即使在略微不同分布的数据上进行一次反向传播也会导致网络崩溃。解决方案的关键在于提出一种无需超参数的方法——主导特征分量投影(Dominant Eigencomponent Projection, DEP),通过正交投影去除梯度中的主要成分,从而有效降低海森矩阵谱半径,防止SNN陷入尖锐极小值,提升其对异构数据污染的鲁棒性。
链接: https://arxiv.org/abs/2505.11134
作者: Desong Zhang,Jia Hu,Geyong Min
机构: University of Exeter (埃克塞特大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Spiking Neural Networks (SNNs) process information via discrete spikes, enabling them to operate at remarkably low energy levels. However, our experimental observations reveal a striking vulnerability when SNNs are trained using the mainstream method–direct encoding combined with backpropagation through time (BPTT): even a single backward pass on data drawn from a slightly different distribution can lead to catastrophic network collapse. Our theoretical analysis attributes this vulnerability to the repeated inputs inherent in direct encoding and the gradient accumulation characteristic of BPTT, which together produce an exceptional large Hessian spectral radius. To address this challenge, we develop a hyperparameter-free method called Dominant Eigencomponent Projection (DEP). By orthogonally projecting gradients to precisely remove their dominant components, DEP effectively reduces the Hessian spectral radius, thereby preventing SNNs from settling into sharp minima. Extensive experiments demonstrate that DEP not only mitigates the vulnerability of SNNs to heterogeneous data poisoning, but also significantly enhances overall robustness compared to key baselines, providing strong support for safer and more reliable SNN deployment.
zh
[CV-33] One Image is Worth a Thousand Words: A Usability Preservable Text-Image Collaborative Erasing Framework ICML2025
【速读】:该论文试图解决文本到图像扩散模型在生成视觉上不 desirable 或有害内容时的可控性问题,特别是现有去除方法依赖手动构建的文本提示,难以在提高去除效果(erasure efficacy)的同时最小化对其他良性概念(benign concepts)的影响。解决方案的关键在于引入视觉监督,提出首个文本-图像协同概念去除(Collaborative Concept Erasing, Co-Erasing)框架,通过联合文本提示与由提示诱导的不良图像描述概念,并利用负向引导降低目标概念的生成概率,从而有效跨越文本与图像模态之间的知识鸿沟,提升去除效果并减少对其他概念的干扰。
链接: https://arxiv.org/abs/2505.11131
作者: Feiran Li,Qianqian Xu,Shilong Bao,Zhiyong Yang,Xiaochun Cao,Qingming Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper has been accepeted to ICML 2025. Not Final Version
Abstract:Concept erasing has recently emerged as an effective paradigm to prevent text-to-image diffusion models from generating visually undesirable or even harmful content. However, current removal methods heavily rely on manually crafted text prompts, making it challenging to achieve a high erasure (efficacy) while minimizing the impact on other benign concepts (usability). In this paper, we attribute the limitations to the inherent gap between the text and image modalities, which makes it hard to transfer the intricately entangled concept knowledge from text prompts to the image generation process. To address this, we propose a novel solution by directly integrating visual supervision into the erasure process, introducing the first text-image Collaborative Concept Erasing (Co-Erasing) framework. Specifically, Co-Erasing describes the concept jointly by text prompts and the corresponding undesirable images induced by the prompts, and then reduces the generating probability of the target concept through negative guidance. This approach effectively bypasses the knowledge gap between text and image, significantly enhancing erasure efficacy. Additionally, we design a text-guided image concept refinement strategy that directs the model to focus on visual features most relevant to the specified text concept, minimizing disruption to other benign concepts. Finally, comprehensive experiments suggest that Co-Erasing outperforms state-of-the-art erasure approaches significantly with a better trade-off between efficacy and usability. Codes are available at this https URL.
zh
[CV-34] PhiNet v2: A Mask-Free Brain-Inspired Vision Foundation Model from Video
【速读】:该论文试图解决传统自监督学习(SSL)模型未能充分利用生物视觉处理系统特性的问题,特别是在处理时间序列视觉输入时对强数据增强的依赖。其解决方案的关键在于提出PhiNet v2,一个基于Transformer的架构,通过变分推理从连续输入流中学习鲁棒的视觉表征,从而模仿人类视觉处理机制,无需依赖强数据增强即可处理序列图像输入。
链接: https://arxiv.org/abs/2505.11129
作者: Makoto Yamada,Kian Ming A. Chai,Ayoub Rhim,Satoki Ishikawa,Mohammad Sabokrou,Yao-Hung Hubert Tsai
机构: Okinawa Institute of Science and Technology (冲绳科学技术大学院大学); DSO National Laboratories (DSO国家实验室); Institute of Science Tokyo (东京科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: arXiv admin note: substantial text overlap with arXiv:2405.14650
Abstract:Recent advances in self-supervised learning (SSL) have revolutionized computer vision through innovative architectures and learning objectives, yet they have not fully leveraged insights from biological visual processing systems. Recently, a brain-inspired SSL model named PhiNet was proposed; it is based on a ResNet backbone and operates on static image inputs with strong augmentation. In this paper, we introduce PhiNet v2, a novel Transformer-based architecture that processes temporal visual input (that is, sequences of images) without relying on strong augmentation. Our model leverages variational inference to learn robust visual representations from continuous input streams, similar to human visual processing. Through extensive experimentation, we demonstrate that PhiNet v2 achieves competitive performance compared to state-of-the-art vision foundation models, while maintaining the ability to learn from sequential input without strong data augmentation. This work represents a significant step toward more biologically plausible computer vision systems that process visual information in a manner more closely aligned with human cognitive processes.
zh
[CV-35] Whats Inside Your Diffusion Model? A Score-Based Riemannian Metric to Explore the Data Manifold
【速读】:该论文试图解决扩散模型在学习数据流形(data manifold)时,其几何特性缺乏明确理解的问题。解决方案的关键在于引入一种基于得分函数(score function)的黎曼度量(Riemannian metric),该度量利用扩散模型中的Stein得分函数来表征数据流形的内在几何结构,而无需显式参数化流形。该方法在环境空间中定义了一个度量张量,通过拉伸流形垂直方向的距离并保持切向方向的距离不变,从而构建出一种几何结构,使得测地线自然沿流形轮廓延伸。
链接: https://arxiv.org/abs/2505.11128
作者: Simone Azeglio,Arianna Di Bernardo
机构: Institut de la Vision & Laboratoire des Systèmes Perceptifs (视觉研究所与感知系统实验室); Sorbonne Université (索邦大学); CNRS (法国国家科学研究中心); INSERM (法国国家医学研究院); École Normale Supérieure (巴黎高等师范学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in diffusion models have demonstrated their remarkable ability to capture complex image distributions, but the geometric properties of the learned data manifold remain poorly understood. We address this gap by introducing a score-based Riemannian metric that leverages the Stein score function from diffusion models to characterize the intrinsic geometry of the data manifold without requiring explicit parameterization. Our approach defines a metric tensor in the ambient space that stretches distances perpendicular to the manifold while preserving them along tangential directions, effectively creating a geometry where geodesics naturally follow the manifold’s contours. We develop efficient algorithms for computing these geodesics and demonstrate their utility for both interpolation between data points and extrapolation beyond the observed data distribution. Through experiments on synthetic data with known geometry, Rotated MNIST, and complex natural images via Stable Diffusion, we show that our score-based geodesics capture meaningful transformations that respect the underlying data distribution. Our method consistently outperforms baseline approaches on perceptual metrics (LPIPS) and distribution-level metrics (FID, KID), producing smoother, more realistic image transitions. These results reveal the implicit geometric structure learned by diffusion models and provide a principled way to navigate the manifold of natural images through the lens of Riemannian geometry.
zh
[CV-36] Redundancy-Aware Pretraining of Vision-Language Foundation Models in Remote Sensing
【速读】:该论文试图解决在遥感(RS)领域中,基于视觉-语言模型(VLM)的预训练过程中由于图像与多个包含冗余信息的文本描述配对而导致的预训练和推理时间增加的问题。解决方案的关键在于引入一种加权特征聚合(WFA)策略,通过从每张图像的多个文本描述中提取互补信息并减少冗余,从而提升预训练效率和效果。该策略利用两种技术计算不同文本描述的重要性权重:非参数化独特性(基于BLEU分数)和基于学习的注意力机制。
链接: https://arxiv.org/abs/2505.11121
作者: Mathis Jürgen Adler,Leonard Hackel,Gencer Sumbul,Begüm Demir
机构: TU Berlin (柏林工业大学); BIFOLD (BIFOLD); EPFL (瑞士联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE International Geoscience and Remote Sensing Symposium (IGARSS) 2025. Our code is available at this https URL
Abstract:The development of foundation models through pretraining of vision-language models (VLMs) has recently attracted great attention in remote sensing (RS). VLM pretraining aims to learn image and language alignments from a large number of image-text pairs. Each pretraining image is often associated with multiple captions containing redundant information due to repeated or semantically similar phrases, resulting in increased pretraining and inference time. To overcome this, we introduce a weighted feature aggregation (WFA) strategy for VLM pretraining in RS. Our strategy aims to extract and exploit complementary information from multiple captions per image while reducing redundancies through feature aggregation with importance weighting. To calculate adaptive importance weights for different captions of each image, we propose two techniques: (i) non-parametric uniqueness and (ii) learning-based attention. In the first technique, importance weights are calculated based on the bilingual evaluation understudy (BLEU) scores of the captions to emphasize unique sentences and reduce the influence of repetitive ones. In the second technique, importance weights are learned through an attention mechanism instead of relying on hand-crafted features. The effectiveness of the proposed WFA strategy with the two techniques is analyzed in terms of downstream performance on text-to-image retrieval in RS. Experimental results show that the proposed strategy enables efficient and effective pretraining of VLMs in RS. Based on the experimental analysis, we derive guidelines for selecting appropriate techniques depending on downstream task requirements and resource constraints. The code of this work is publicly available at this https URL.
zh
[CV-37] Planar Velocity Estimation for Fast-Moving Mobile Robots Using Event-Based Optical Flow
【速读】:该论文旨在解决移动机器人中速度估计的准确性问题,特别是在驾驶员辅助系统和自动驾驶中,传统基于轮式里程计与惯性测量单元(IMU)数据融合的方法往往依赖于非滑动转向等强假设或复杂的车辆动力学模型,这些在不同环境条件下(如湿滑路面)可能失效。其解决方案的关键在于利用平面运动学结合垂直指向地面的事件相机(event camera)的光流信息,从而解耦对车轮与地面附着力的依赖。事件相机的异步微秒级延迟和高动态范围使其对运动模糊具有高度鲁棒性,提升了视觉感知在自动驾驶中的可靠性。
链接: https://arxiv.org/abs/2505.11116
作者: Liam Boyle,Jonas Kühne,Nicolas Baumann,Niklas Bastuck,Michele Magno
机构: ETH Zurich(苏黎世联邦理工学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate velocity estimation is critical in mobile robotics, particularly for driver assistance systems and autonomous driving. Wheel odometry fused with Inertial Measurement Unit (IMU) data is a widely used method for velocity estimation; however, it typically requires strong assumptions, such as non-slip steering, or complex vehicle dynamics models that do not hold under varying environmental conditions like slippery surfaces. We introduce an approach to velocity estimation that is decoupled from wheel-to-surface traction assumptions by leveraging planar kinematics in combination with optical flow from event cameras pointed perpendicularly at the ground. The asynchronous micro-second latency and high dynamic range of event cameras make them highly robust to motion blur, a common challenge in vision-based perception techniques for autonomous driving. The proposed method is evaluated through in-field experiments on a 1:10 scale autonomous racing platform and compared to precise motion capture data, demonstrating not only performance on par with the state-of-the-art Event-VIO method but also a 38.3 % improvement in lateral error. Qualitative experiments at highway speeds of up to 32 m/s further confirm the effectiveness of our approach, indicating significant potential for real-world deployment.
zh
[CV-38] Deepfake Forensic Analysis: Source Dataset Attribution and Legal Implications of Synthetic Media Manipulation
【速读】:该论文试图解决生成式人工智能(Generative AI)生成的合成媒体在真实性验证和数据集溯源方面面临的挑战,这些问题在版权保护、隐私安全和法律合规性方面具有重要意义。其解决方案的关键在于提出一种基于可解释特征分析的取证框架,通过整合频域变换(如傅里叶/离散余弦变换)、颜色分布度量和局部特征描述符(如SIFT),提取合成输出中的判别性统计特征,并利用监督分类器(如随机森林、支持向量机、XGBoost)实现对真实图像与合成图像的高精度二分类以及对不同训练数据集(如CelebA或FFHQ)的多类归属识别。实验表明,频域特征在捕捉数据集特定伪影方面表现突出,而颜色直方图则揭示了GAN训练中的隐式正则化策略。
链接: https://arxiv.org/abs/2505.11110
作者: Massimiliano Cassia,Luca Guarnera,Mirko Casu,Ignazio Zangara,Sebastiano Battiato
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Synthetic media generated by Generative Adversarial Networks (GANs) pose significant challenges in verifying authenticity and tracing dataset origins, raising critical concerns in copyright enforcement, privacy protection, and legal compliance. This paper introduces a novel forensic framework for identifying the training dataset (e.g., CelebA or FFHQ) of GAN-generated images through interpretable feature analysis. By integrating spectral transforms (Fourier/DCT), color distribution metrics, and local feature descriptors (SIFT), our pipeline extracts discriminative statistical signatures embedded in synthetic outputs. Supervised classifiers (Random Forest, SVM, XGBoost) achieve 98-99% accuracy in binary classification (real vs. synthetic) and multi-class dataset attribution across diverse GAN architectures (StyleGAN, AttGAN, GDWCT, StarGAN, and StyleGAN2). Experimental results highlight the dominance of frequency-domain features (DCT/FFT) in capturing dataset-specific artifacts, such as upsampling patterns and spectral irregularities, while color histograms reveal implicit regularization strategies in GAN training. We further examine legal and ethical implications, showing how dataset attribution can address copyright infringement, unauthorized use of personal data, and regulatory compliance under frameworks like GDPR and California’s AB 602. Our framework advances accountability and governance in generative modeling, with applications in digital forensics, content moderation, and intellectual property litigation.
zh
[CV-39] MAVOS-DD: Multilingual Audio-Video Open-Set Deepfake Detection Benchmark
【速读】:该论文试图解决多语言音视频深度伪造检测中的开放集(open-set)检测问题,即在训练过程中未见过的生成模型和语言组合下的检测性能下降问题。解决方案的关键在于构建一个大规模的多语言开放集基准数据集,包含超过250小时的真实与虚假视频,其中60%的数据是生成的,并且针对每种语言使用七种不同的深度伪造生成模型进行数据构造,同时在训练、验证和测试集划分上引入开放集设置,以模拟实际应用中模型面临未知生成技术的情境。
链接: https://arxiv.org/abs/2505.11109
作者: Florinel-Alin Croitoru,Vlad Hondru,Marius Popescu,Radu Tudor Ionescu,Fahad Shahbaz Khan,Mubarak Shah
机构: University of Bucharest(布加勒斯特大学); MBZ University of Artificial Intelligence(MBZ人工智能大学); Linköping University(林雪平大学); University of Central Florida(中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 15 pages
Abstract:We present the first large-scale open-set benchmark for multilingual audio-video deepfake detection. Our dataset comprises over 250 hours of real and fake videos across eight languages, with 60% of data being generated. For each language, the fake videos are generated with seven distinct deepfake generation models, selected based on the quality of the generated content. We organize the training, validation and test splits such that only a subset of the chosen generative models and languages are available during training, thus creating several challenging open-set evaluation setups. We perform experiments with various pre-trained and fine-tuned deepfake detectors proposed in recent literature. Our results show that state-of-the-art detectors are not currently able to maintain their performance levels when tested in our open-set scenarios. We publicly release our data and code at: this https URL.
zh
[CV-40] Hybrid-Emba3D: Geometry-Aware and Cross-Path Feature Hybrid Enhanced State Space Model for Point Cloud Classification
【速读】:该论文旨在解决点云分类任务中高效提取局部几何特征与保持模型复杂度之间的双重挑战。其解决方案的关键在于提出Hybrid-Emba3D,一种通过几何特征耦合和跨路径特征融合增强的双向Mamba模型。该模型通过局部几何池化与几何特征耦合机制,在不引入额外参数的情况下显著提升了局部特征的判别能力,同时采用双路径混合结构有效处理局部突变和稀疏关键信号,突破了传统状态空间模型(SSM)在长程建模上的局限性。
链接: https://arxiv.org/abs/2505.11099
作者: Bin Liu,Chunyang Wang,Xuelian Liu,Guan Xi,Ge Zhang,Ziteng Yao,Mengxue Dong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The point cloud classification tasks face the dual challenge of efficiently extracting local geometric features while maintaining model complexity. The Mamba architecture utilizes the linear complexity advantage of state space models (SSMs) to overcome the computational bottleneck of Transformers while balancing global modeling capabilities. However, the inherent contradiction between its unidirectional dependency and the unordered nature of point clouds impedes modeling spatial correlation in local neighborhoods, thus constraining geometric feature extraction. This paper proposes Hybrid-Emba3D, a bidirectional Mamba model enhanced by geometry-feature coupling and cross-path feature hybridization. The Local geometric pooling with geometry-feature coupling mechanism significantly enhances local feature discriminative power via coordinated propagation and dynamic aggregation of geometric information between local center points and their neighborhoods, without introducing additional parameters. The designed Collaborative feature enhancer adopts dual-path hybridization, effectively handling local mutations and sparse key signals, breaking through the limitations of traditional SSM long-range modeling. Experimental results demonstrate that the proposed model achieves a new SOTA classification accuracy of 95.99% on ModelNet40 with only 0.03M additional.
zh
[CV-41] Pseudo-Label Quality Decoupling and Correction for Semi-Supervised Instance Segmentation
【速读】:该论文旨在解决半监督实例分割(Semi-Supervised Instance Segmentation, SSIS)中由于实例类别和像素掩码的伪标签噪声导致的性能不稳定问题。其解决方案的关键在于提出一种新型的伪标签质量解耦与修正框架(Pseudo-Label Quality Decoupling and Correction, PL-DC),通过实例级的解耦双阈值过滤机制、类别级的动态实例类别修正模块以及像素级的掩码不确定性感知机制,分别独立控制分类与分组质量、缓解类别混淆以及降低像素级伪标签噪声的影响,从而提升模型性能。
链接: https://arxiv.org/abs/2505.11075
作者: Jianghang Lin,Yilin Lu,Yunhang Shen,Chaoyang Zhu,Shengchuan Zhang,Liujuan Cao,Rongrong Ji
机构: Xiamen University (厦门大学); Tencent Youtu Lab (腾讯优图实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semi-Supervised Instance Segmentation (SSIS) involves classifying and grouping image pixels into distinct object instances using limited labeled data. This learning paradigm usually faces a significant challenge of unstable performance caused by noisy pseudo-labels of instance categories and pixel masks. We find that the prevalent practice of filtering instance pseudo-labels assessing both class and mask quality with a single score threshold, frequently leads to compromises in the trade-off between the qualities of class and mask labels. In this paper, we introduce a novel Pseudo-Label Quality Decoupling and Correction (PL-DC) framework for SSIS to tackle the above challenges. Firstly, at the instance level, a decoupled dual-threshold filtering mechanism is designed to decouple class and mask quality estimations for instance-level pseudo-labels, thereby independently controlling pixel classifying and grouping qualities. Secondly, at the category level, we introduce a dynamic instance category correction module to dynamically correct the pseudo-labels of instance categories, effectively alleviating category confusion. Lastly, we introduce a pixel-level mask uncertainty-aware mechanism at the pixel level to re-weight the mask loss for different pixels, thereby reducing the impact of noise introduced by pixel-level mask pseudo-labels. Extensive experiments on the COCO and Cityscapes datasets demonstrate that the proposed PL-DC achieves significant performance improvements, setting new state-of-the-art results for SSIS. Notably, our PL-DC shows substantial gains even with minimal labeled data, achieving an improvement of +11.6 mAP with just 1% COCO labeled data and +15.5 mAP with 5% Cityscapes labeled data. The code will be public.
zh
[CV-42] owards Self-Improvement of Diffusion Models via Group Preference Optimization
【速读】:该论文旨在解决文本到图像(T2I)扩散模型在使用直接偏好优化(DPO)时面临的两个主要问题:DPO对偏好对的敏感性以及高质量数据收集和标注的劳动密集性。其解决方案的关键在于提出一种基于群体偏好优化(Group Preference Optimization, GPO)的方法,通过扩展DPO从成对到群体级别的偏好学习,并结合奖励标准化进行重加权,从而提升模型性能,而无需显式的数据选择。此外,GPO利用模型自身的生成能力实现自我优化,无需外部数据,具有良好的泛化性和实用性。
链接: https://arxiv.org/abs/2505.11070
作者: Renjie Chen,Wenfeng Lin,Yichen Zhang,Jiangchuan Wei,Boyuan Liu,Chao Feng,Jiao Ran,Mingyu Guo
机构: ByteDance Douyin Content Group (字节跳动抖音内容组)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Aligning text-to-image (T2I) diffusion models with Direct Preference Optimization (DPO) has shown notable improvements in generation quality. However, applying DPO to T2I faces two challenges: the sensitivity of DPO to preference pairs and the labor-intensive process of collecting and annotating high-quality data. In this work, we demonstrate that preference pairs with marginal differences can degrade DPO performance. Since DPO relies exclusively on relative ranking while disregarding the absolute difference of pairs, it may misclassify losing samples as wins, or vice versa. We empirically show that extending the DPO from pairwise to groupwise and incorporating reward standardization for reweighting leads to performance gains without explicit data selection. Furthermore, we propose Group Preference Optimization (GPO), an effective self-improvement method that enhances performance by leveraging the model’s own capabilities without requiring external data. Extensive experiments demonstrate that GPO is effective across various diffusion models and tasks. Specifically, combining with widely used computer vision models, such as YOLO and OCR, the GPO improves the accurate counting and text rendering capabilities of the Stable Diffusion 3.5 Medium by 20 percentage points. Notably, as a plug-and-play method, no extra overhead is introduced during inference.
zh
[CV-43] Assessing the Performance of Analog Training for Transfer Learning
【速读】:该论文试图解决模拟存内计算(Analog in-memory computing)在深度学习训练和迁移学习(TL)中因器件非线性、不对称开关行为及器件间差异导致的训练效果不佳问题。现有训练算法无法适应这些特性,而新提出的c-TTv2算法通过引入裁剪技术(chopped technique)来克服上述挑战,其关键在于利用该技术提升算法对器件非理想特性的鲁棒性。
链接: https://arxiv.org/abs/2505.11067
作者: Omobayode Fagbohungbe,Corey Lammie,Malte J. Rasch,Takashi Ando,Tayfun Gokmen,Vijay Narayanan
机构: IBM Research - Yorktown Heights, NY USA(IBM研究-约克镇高地, 纽约州, 美国); IBM Research Europe(IBM研究欧洲)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Analog in-memory computing is a next-generation computing paradigm that promises fast, parallel, and energy-efficient deep learning training and transfer learning (TL). However, achieving this promise has remained elusive due to a lack of suitable training algorithms. Analog memory devices exhibit asymmetric and non-linear switching behavior in addition to device-to-device variation, meaning that most, if not all, of the current off-the-shelf training algorithms cannot achieve good training outcomes. Also, recently introduced algorithms have enjoyed limited attention, as they require bi-directionally switching devices of unrealistically high symmetry and precision and are highly sensitive. A new algorithm chopped TTv2 (c-TTv2), has been introduced, which leverages the chopped technique to address many of the challenges mentioned above. In this paper, we assess the performance of the c-TTv2 algorithm for analog TL using a Swin-ViT model on a subset of the CIFAR100 dataset. We also investigate the robustness of our algorithm to changes in some device specifications, including weight transfer noise, symmetry point skew, and symmetry point variability
zh
[CV-44] HSRMamba: Efficient Wavelet Stripe State Space Model for Hyperspectral Image Super-Resolution
【速读】:该论文旨在解决单张高光谱图像超分辨率(Single Hyperspectral Image Super-Resolution, SHSR)中因模型采用一维扫描范式而导致的图像生成过程中可能出现的伪影问题。其解决方案的关键在于引入基于条带的扫描方案,以有效减少全局单向扫描带来的伪影,并结合小波分解技术缓解高频空间特征与低频光谱特征之间的模态冲突,从而在保持计算效率的同时提升超分辨率性能。
链接: https://arxiv.org/abs/2505.11062
作者: Baisong Li,Xingwang Wang,Haixiao Xu
机构: Jilin University (吉林大学); Key Laboratory of Symbolic Computation and Krowledge Engineering of Ministry of Education, Jilin University (教育部符号计算与知识工程重点实验室,吉林大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Single hyperspectral image super-resolution (SHSR) aims to restore high-resolution images from low-resolution hyperspectral images. Recently, the Visual Mamba model has achieved an impressive balance between performance and computational efficiency. However, due to its 1D scanning paradigm, the model may suffer from potential artifacts during image generation. To address this issue, we propose HSRMamba. While maintaining the computational efficiency of Visual Mamba, we introduce a strip-based scanning scheme to effectively reduce artifacts from global unidirectional scanning. Additionally, HSRMamba uses wavelet decomposition to alleviate modal conflicts between high-frequency spatial features and low-frequency spectral features, further improving super-resolution performance. Extensive experiments show that HSRMamba not only excels in reducing computational load and model size but also outperforms existing methods, achieving state-of-the-art results.
zh
[CV-45] CUBIC: Concept Embeddings for Unsupervised Bias Identification using VLMs IJCNN2025
【速读】:该论文试图解决深度视觉模型中因数据集中的虚假相关性而产生的偏差问题,这些问题通常由模型学习到的非鲁棒特征引起。传统的方法依赖于低级特征(如热图)进行解释,而基于高级人类可理解概念的方法在识别这些偏差方面更为有效,但受限于缺乏标注了潜在偏差概念的图像数据。论文提出的解决方案CUBIC(Concept embeddings for Unsupervised Bias IdentifiCation)的关键在于无需预定义的偏差候选或特定偏差的模型失败示例,而是利用图像-文本潜在空间和线性分类器探测器,分析超类标签的潜在表示如何受到给定概念的影响,从而识别显著影响模型预测的概念。
链接: https://arxiv.org/abs/2505.11060
作者: David Méndez,Gianpaolo Bontempo,Elisa Ficarra,Roberto Confalonieri,Natalia Díaz-Rodríguez
机构: University of Granada(格拉纳达大学); University of Modena and Reggio Emilia(摩德纳和雷焦艾米利亚大学); University of Padova(帕多瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures, 5 tables. Accepted at IJCNN 2025; to appear in IEEE Xplore
Abstract:Deep vision models often rely on biases learned from spurious correlations in datasets. To identify these biases, methods that interpret high-level, human-understandable concepts are more effective than those relying primarily on low-level features like heatmaps. A major challenge for these concept-based methods is the lack of image annotations indicating potentially bias-inducing concepts, since creating such annotations requires detailed labeling for each dataset and concept, which is highly labor-intensive. We present CUBIC (Concept embeddings for Unsupervised Bias IdentifiCation), a novel method that automatically discovers interpretable concepts that may bias classifier behavior. Unlike existing approaches, CUBIC does not rely on predefined bias candidates or examples of model failures tied to specific biases, as such information is not always available. Instead, it leverages image-text latent space and linear classifier probes to examine how the latent representation of a superclass label \unicodex2014 shared by all instances in the dataset \unicodex2014 is influenced by the presence of a given concept. By measuring these shifts against the normal vector to the classifier’s decision boundary, CUBIC identifies concepts that significantly influence model predictions. Our experiments demonstrate that CUBIC effectively uncovers previously unknown biases using Vision-Language Models (VLMs) without requiring the samples in the dataset where the classifier underperforms or prior knowledge of potential biases.
zh
[CV-46] Artifacts of Idiosyncracy in Global Street View Data
【速读】:该论文试图解决街景数据(street view data)在城市覆盖上的偏差问题,尤其是在全球28个城市中,即使数据密集采样,仍存在由于城市布局等独特性导致的覆盖不均现象。解决方案的关键在于量化分析街景数据覆盖分布的偏差,并提出一种评估方法以更深入理解城市覆盖的独特性,从而从源头上识别和应对这些偏差。
链接: https://arxiv.org/abs/2505.11046
作者: Tim Alpherts,Sennay Ghebreab,Nanne van Noord
机构: University of Amsterdam(阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at FAccT '25
Abstract:Street view data is increasingly being used in computer vision applications in recent years. Machine learning datasets are collected for these applications using simple sampling techniques. These datasets are assumed to be a systematic representation of cities, especially when densely sampled. Prior works however, show that there are clear gaps in coverage, with certain cities or regions being covered poorly or not at all. Here we demonstrate that a cities’ idiosyncracies, such as city layout, may lead to biases in street view data for 28 cities across the globe, even when they are densely covered. We quantitatively uncover biases in the distribution of coverage of street view data and propose a method for evaluation of such distributions to get better insight in idiosyncracies in a cities’ coverage. In addition, we perform a case study of Amsterdam with semi-structured interviews, showing how idiosyncracies of the collection process impact representation of cities and regions and allowing us to address biases at their source.
zh
[CV-47] CleanPatrick: A Benchmark for Image Data Cleaning
【速读】:该论文旨在解决图像数据清洗中的基准不足问题,当前的图像数据清洗基准依赖于合成噪声或狭窄的人类研究,限制了比较和实际应用的相关性。其解决方案的关键在于引入CleanPatrick,这是首个针对图像领域数据清洗的大规模基准,基于公开的Fitzpatrick17k皮肤科数据集构建,并通过933名医疗众包工作者收集了496,377个二元标注,识别出无关样本(4%)、近似重复样本(21%)和标签错误(22%),采用受项目反应理论启发的聚合模型并结合专家审查以生成高质量的地面真实数据。此外,CleanPatrick将问题检测形式化为排名任务,并采用反映实际审计工作流程的典型排名指标。
链接: https://arxiv.org/abs/2505.11034
作者: Fabian Gröger,Simone Lionetti,Philippe Gottfrois,Alvaro Gonzalez-Jimenez,Ludovic Amruthalingam,Elisabeth Victoria Goessinger,Hanna Lindemann,Marie Bargiela,Marie Hofbauer,Omar Badri,Philipp Tschandl,Arash Koochek,Matthew Groh,Alexander A. Navarini,Marc Pouly
机构: University of Basel (巴塞尔大学); Lucerne University of Applied Sciences and Arts (卢塞恩应用科学大学); University Hospital of Basel (巴塞尔大学医院); Northwestern University (西北大学); Northeast Dermatology Associates (东北皮肤病协会); Medical University of Vienna (维也纳医科大学); Banner Health (巴纳健康)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Robust machine learning depends on clean data, yet current image data cleaning benchmarks rely on synthetic noise or narrow human studies, limiting comparison and real-world relevance. We introduce CleanPatrick, the first large-scale benchmark for data cleaning in the image domain, built upon the publicly available Fitzpatrick17k dermatology dataset. We collect 496,377 binary annotations from 933 medical crowd workers, identify off-topic samples (4%), near-duplicates (21%), and label errors (22%), and employ an aggregation model inspired by item-response theory followed by expert review to derive high-quality ground truth. CleanPatrick formalizes issue detection as a ranking task and adopts typical ranking metrics mirroring real audit workflows. Benchmarking classical anomaly detectors, perceptual hashing, SSIM, Confident Learning, NoiseRank, and SelfClean, we find that, on CleanPatrick, self-supervised representations excel at near-duplicate detection, classical methods achieve competitive off-topic detection under constrained review budgets, and label-error detection remains an open challenge for fine-grained medical classification. By releasing both the dataset and the evaluation framework, CleanPatrick enables a systematic comparison of image-cleaning strategies and paves the way for more reliable data-centric artificial intelligence.
zh
[CV-48] DexGarmentLab: Dexterous Garment Manipulation Environment with Generalizable Policy
【速读】:该论文旨在解决服装操作中的复杂性问题,尤其是由于服装类别、几何形状和变形的多样性所带来的挑战。现有研究在复制人类手部操作服装的灵巧性方面面临困难,主要受限于缺乏对灵巧服装操作的真实模拟。为了解决这一问题,作者提出了DexGarmentLab,这是一个专为灵巧(尤其是双臂)服装操作设计的环境,具备针对服装建模优化的仿真技术,以缩小仿真与现实之间的差距。该研究的关键在于利用服装结构对应关系,仅通过一个专家示范即可自动生成包含多样化轨迹的数据集,从而显著减少人工干预,并进一步提出一种分层服装操作策略(HALO),通过识别可转移的可供性点并生成可泛化的轨迹来提升不同服装形状和变形下的泛化能力。
链接: https://arxiv.org/abs/2505.11032
作者: Yuran Wang,Ruihai Wu,Yue Chen,Jiarui Wang,Jiaqi Liang,Ziyu Zhu,Haoran Geng,Jitendra Malik,Pieter Abbeel,Hao Dong
机构: Peking University (北京大学); University of California, Berkeley (加利福尼亚大学伯克利分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Garment manipulation is a critical challenge due to the diversity in garment categories, geometries, and deformations. Despite this, humans can effortlessly handle garments, thanks to the dexterity of our hands. However, existing research in the field has struggled to replicate this level of dexterity, primarily hindered by the lack of realistic simulations of dexterous garment manipulation. Therefore, we propose DexGarmentLab, the first environment specifically designed for dexterous (especially bimanual) garment manipulation, which features large-scale high-quality 3D assets for 15 task scenarios, and refines simulation techniques tailored for garment modeling to reduce the sim-to-real gap. Previous data collection typically relies on teleoperation or training expert reinforcement learning (RL) policies, which are labor-intensive and inefficient. In this paper, we leverage garment structural correspondence to automatically generate a dataset with diverse trajectories using only a single expert demonstration, significantly reducing manual intervention. However, even extensive demonstrations cannot cover the infinite states of garments, which necessitates the exploration of new algorithms. To improve generalization across diverse garment shapes and deformations, we propose a Hierarchical gArment-manipuLation pOlicy (HALO). It first identifies transferable affordance points to accurately locate the manipulation area, then generates generalizable trajectories to complete the task. Through extensive experiments and detailed analysis of our method and baseline, we demonstrate that HALO consistently outperforms existing methods, successfully generalizing to previously unseen instances even with significant variations in shape and deformation where others fail. Our project page is available at: this https URL.
zh
[CV-49] Classifying Shelf Life Quality of Pineapples by Combining Audio and Visual Features
【速读】:该论文旨在通过非破坏性方法确定菠萝的保质期质量,以减少浪费并提高收入。其解决方案的关键在于构建一个基于音频和视觉特征的多模态多视角分类模型,并引入了PQC500数据集,该数据集包含500个菠萝的两种模态数据:通过多个麦克风记录敲击声以及从不同位置拍摄的图像。为训练跨模态分类模型,作者改进了对比音频视觉掩码自编码器,并采用以音频为主的采样策略,从而在实验中实现了84%的准确率,显著优于仅使用音频或仅使用视觉的单模态模型。
链接: https://arxiv.org/abs/2505.11020
作者: Yi-Lu Jiang,Wen-Chang Chang,Ching-Lin Wang,Kung-Liang Hsu,Chih-Yi Chiu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Determining the shelf life quality of pineapples using non-destructive methods is a crucial step to reduce waste and increase income. In this paper, a multimodal and multiview classification model was constructed to classify pineapples into four quality levels based on audio and visual characteristics. For research purposes, we compiled and released the PQC500 dataset consisting of 500 pineapples with two modalities: one was tapping pineapples to record sounds by multiple microphones and the other was taking pictures by multiple cameras at different locations, providing multimodal and multi-view audiovisual features. We modified the contrastive audiovisual masked autoencoder to train the cross-modal-based classification model by abundant combinations of audio and visual pairs. In addition, we proposed to sample a compact size of training data for efficient computation. The experiments were evaluated under various data and model configurations, and the results demonstrated that the proposed cross-modal model trained using audio-major sampling can yield 84% accuracy, outperforming the unimodal models of only audio and only visual by 6% and 18%, respectively.
zh
[CV-50] Rethinking the Mean Teacher Strategy from the Perspective of Self-paced Learning
【速读】:该论文旨在解决半监督医学图像分割中手动标注成本高的问题,其解决方案的关键在于重新诠释均值教师(Mean Teacher, MT)策略为一种由时间滞后教师模型与真实标签输出一致性所调控的自 paced learning(自适应学习)机制,并进一步引入跨架构模型间的共识来增强学习节奏的灵活性。论文提出双教师-学生学习(Dual Teacher-Student Learning, DTSL)框架,通过两个具有不同架构的教师-学生模型组,利用基于Jensen-Shannon散度的共识标签生成器(Consensus Label Generator, CLG)生成伪标签,从而提升模型性能。
链接: https://arxiv.org/abs/2505.11018
作者: Pengchen Zhang,Alan J.X. Guo,Sipin Luo,Zhe Han,Lin Guo
机构: Tianjin University(天津大学); Tianjin Hospital of Tianjin University(天津大学天津医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semi-supervised medical image segmentation has attracted significant attention due to its potential to reduce manual annotation costs. The mean teacher (MT) strategy, commonly understood as introducing smoothed, temporally lagged consistency regularization, has demonstrated strong performance across various tasks in this field. In this work, we reinterpret the MT strategy on supervised data as a form of self-paced learning, regulated by the output agreement between the temporally lagged teacher model and the ground truth labels. This idea is further extended to incorporate agreement between a temporally lagged model and a cross-architectural model, which offers greater flexibility in regulating the learning pace and enables application to unlabeled data. Specifically, we propose dual teacher-student learning (DTSL), a framework that introduces two groups of teacher-student models with different architectures. The output agreement between the cross-group teacher and student models is used as pseudo-labels, generated via a Jensen-Shannon divergence-based consensus label generator (CLG). Extensive experiments on popular datasets demonstrate that the proposed method consistently outperforms existing state-of-the-art approaches. Ablation studies further validate the effectiveness of the proposed modules.
zh
[CV-51] WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?
【速读】:该论文试图解决当前文档理解基准(如DocVQA和ChartQA)主要基于扫描或数字文档,未能充分反映真实世界场景中复杂挑战的问题,例如光照变化和物理畸变。解决方案的关键在于提出WildDoc,这是首个专为评估自然环境中文档理解能力而设计的基准,其包含手动采集的多样化文档图像,并利用来自现有基准的文档源以实现与数字或扫描文档的全面比较,同时通过在不同条件下四次采集每份文档来严格评估模型的鲁棒性。
链接: https://arxiv.org/abs/2505.11015
作者: An-Lan Wang,Jingqun Tang,Liao Lei,Hao Feng,Qi Liu,Xiang Fei,Jinghui Lu,Han Wang,Weiwei Liu,Hao Liu,Yuliang Liu,Xiang Bai,Can Huang
机构: ByteDance(字节跳动); Huazhong University of Science and Technology(华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced capabilities in Document Understanding. However, prevailing benchmarks like DocVQA and ChartQA predominantly comprise \textitscanned or digital documents, inadequately reflecting the intricate challenges posed by diverse real-world scenarios, such as variable illumination and physical distortions. This paper introduces WildDoc, the inaugural benchmark designed specifically for assessing document understanding in natural environments. WildDoc incorporates a diverse set of manually captured document images reflecting real-world conditions and leverages document sources from established benchmarks to facilitate comprehensive comparisons with digital or scanned documents. Further, to rigorously evaluate model robustness, each document is captured four times under different conditions. Evaluations of state-of-the-art MLLMs on WildDoc expose substantial performance declines and underscore the models’ inadequate robustness compared to traditional benchmarks, highlighting the unique challenges posed by real-world document understanding. Our project homepage is available at this https URL.
zh
[CV-52] owards Robust and Controllable Text-to-Motion via Masked Autoregressive Diffusion
【速读】:该论文旨在解决从文本描述生成3D人体运动的挑战,特别是在处理分布外运动时模型表现不佳的问题。现有基于向量量化变分自编码器(VQVAE)的方法难以准确表示新颖运动,而基于扩散过程的方法在连续表示上缺乏对单帧的细粒度控制。论文提出的解决方案MoMADiff结合了掩码建模与扩散过程,利用帧级连续表示生成运动,并支持用户提供的关键帧规范,从而实现对运动时空特性的精确控制,其关键在于通过融合掩码机制与扩散模型提升生成运动的多样性和可控性。
链接: https://arxiv.org/abs/2505.11013
作者: Zongye Zhang,Bohan Kong,Qingjie Liu,Yunhong Wang
机构: Beijing China; State Key Laboratory of Virtual Reality Technology and Systems (国家重点实验室虚拟现实技术与系统)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 10 pages, 6 figures, 5 tables
Abstract:Generating 3D human motion from text descriptions remains challenging due to the diverse and complex nature of human motion. While existing methods excel within the training distribution, they often struggle with out-of-distribution motions, limiting their applicability in real-world scenarios. Existing VQVAE-based methods often fail to represent novel motions faithfully using discrete tokens, which hampers their ability to generalize beyond seen data. Meanwhile, diffusion-based methods operating on continuous representations often lack fine-grained control over individual frames. To address these challenges, we propose a robust motion generation framework MoMADiff, which combines masked modeling with diffusion processes to generate motion using frame-level continuous representations. Our model supports flexible user-provided keyframe specification, enabling precise control over both spatial and temporal aspects of motion synthesis. MoMADiff demonstrates strong generalization capability on novel text-to-motion datasets with sparse keyframes as motion prompts. Extensive experiments on two held-out datasets and two standard benchmarks show that our method consistently outperforms state-of-the-art models in motion quality, instruction fidelity, and keyframe adherence.
zh
[CV-53] ForensicHub: A Unified Benchmark Codebase for All-Domain Fake Image Detection and Localization
【速读】:该论文试图解决Fake Image Detection and Localization (FIDL)领域缺乏统一基准的问题,当前该领域被分割为四个独立的子领域,导致各领域之间数据集、模型和评估协议互不兼容,阻碍了跨领域的比较与整体发展。解决方案的关键在于提出ForensicHub,这是首个面向所有FIDL领域的统一基准代码库,其核心是采用模块化和配置驱动的架构,将取证流程分解为可互换的组件,支持跨数据集、变换、模型和评估器的灵活组合,并集成多种基准和基线模型,以促进领域间的协同与进步。
链接: https://arxiv.org/abs/2505.11003
作者: Bo Du,Xuekang Zhu,Xiaochen Ma,Chenfan Qu,Kaiwen Feng,Zhe Yang,Chi-Man Pun,Jian Liu,Jizhe Zhou
机构: Sichuan University (四川大学); Ant Group (蚂蚁集团); MBZUAI (MBZUAI); Peking University (北京大学); South China University of Technology (华南理工大学); University of Macao (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report. Code available at: this https URL
Abstract:The field of Fake Image Detection and Localization (FIDL) is highly fragmented, encompassing four domains: deepfake detection (Deepfake), image manipulation detection and localization (IMDL), artificial intelligence-generated image detection (AIGC), and document image manipulation localization (Doc). Although individual benchmarks exist in some domains, a unified benchmark for all domains in FIDL remains blank. The absence of a unified benchmark results in significant domain silos, where each domain independently constructs its datasets, models, and evaluation protocols without interoperability, preventing cross-domain comparisons and hindering the development of the entire FIDL field. To close the domain silo barrier, we propose ForensicHub, the first unified benchmark codebase for all-domain fake image detection and localization. Considering drastic variations on dataset, model, and evaluation configurations across all domains, as well as the scarcity of open-sourced baseline models and the lack of individual benchmarks in some domains, ForensicHub: i) proposes a modular and configuration-driven architecture that decomposes forensic pipelines into interchangeable components across datasets, transforms, models, and evaluators, allowing flexible composition across all domains; ii) fully implements 10 baseline models, 6 backbones, 2 new benchmarks for AIGC and Doc, and integrates 2 existing benchmarks of DeepfakeBench and IMDLBenCo through an adapter-based design; iii) conducts indepth analysis based on the ForensicHub, offering 8 key actionable insights into FIDL model architecture, dataset characteristics, and evaluation standards. ForensicHub represents a significant leap forward in breaking the domain silos in the FIDL field and inspiring future breakthroughs.
zh
[CV-54] DDAE: Enhancing Diffusion Models Towards Unified Generative and Discriminative Learning
【速读】:该论文试图解决两个关键问题:1)能否利用生成式预训练得到的表征来提升扩散模型本身的训练,而不仅仅是用于下游任务;2)能否在不牺牲生成能力的前提下,提升特征质量以媲美甚至超越现代自监督学习方法。解决方案的关键在于引入自条件机制(self-conditioning),这是一种简单但有效的内部机制,通过利用去噪网络中的丰富语义来引导其解码层,形成更紧致的瓶颈,从而压缩高层语义以提升生成效果。
链接: https://arxiv.org/abs/2505.10999
作者: Weilai Xiang,Hongyu Yang,Di Huang,Yunhong Wang
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While diffusion models have gained prominence in image synthesis, their generative pre-training has been shown to yield discriminative representations, paving the way towards unified visual generation and understanding. However, two key questions remain: 1) Can these representations be leveraged to improve the training of diffusion models themselves, rather than solely benefiting downstream tasks? 2) Can the feature quality be enhanced to rival or even surpass modern self-supervised learners, without compromising generative capability? This work addresses these questions by introducing self-conditioning, a straightforward yet effective mechanism that internally leverages the rich semantics inherent in denoising network to guide its own decoding layers, forming a tighter bottleneck that condenses high-level semantics to improve generation. Results are compelling: our method boosts both generation FID and recognition accuracy with 1% computational overhead and generalizes across diverse diffusion architectures. Crucially, self-conditioning facilitates an effective integration of discriminative techniques, such as contrastive self-distillation, directly into diffusion models without sacrificing generation quality. Extensive experiments on pixel-space and latent-space datasets show that in linear evaluations, our enhanced diffusion models, particularly UViT and DiT, serve as strong representation learners, surpassing various self-supervised models.
zh
[CV-55] Visual Anomaly Detection under Complex View-Illumination Interplay: A Large-Scale Benchmark
【速读】:该论文旨在解决视觉异常检测(Visual Anomaly Detection, VAD)系统在实际部署中对真实世界成像变化的敏感性问题,尤其是视角与光照复杂交互对缺陷可见性的影响。现有基准测试未能充分考虑这一关键挑战。论文提出的解决方案是引入多视角多光照异常检测(Multi-View Multi-Illumination Anomaly Detection, M2AD)基准,该基准包含119,880张高分辨率图像,通过系统化采集10个类别共999个样本,在12个同步视角和10种光照条件下进行测试(总计120种配置),以评估VAD方法在复杂视角-光照交互条件下的鲁棒性。其关键在于设计两种评估协议:M2AD-Synergy用于测试跨多种配置的信息融合能力,M2AD-Invariant用于衡量单图像在真实视角-光照效应下的鲁棒性。实验表明,当前最先进的VAD方法在M2AD上表现显著不足,突显了视角-光照交互带来的深刻挑战。
链接: https://arxiv.org/abs/2505.10996
作者: Yunkang Cao,Yuqi Cheng,Xiaohao Xu,Yiheng Zhang,Yihan Sun,Yuxiang Tan,Yuxin Zhang,Xiaonan Huang,Weiming Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Homgepage: this https URL . Yunkang Cao and Yuqi Cheng contribute equally to this work
Abstract:The practical deployment of Visual Anomaly Detection (VAD) systems is hindered by their sensitivity to real-world imaging variations, particularly the complex interplay between viewpoint and illumination which drastically alters defect visibility. Current benchmarks largely overlook this critical challenge. We introduce Multi-View Multi-Illumination Anomaly Detection (M2AD), a new large-scale benchmark comprising 119,880 high-resolution images designed explicitly to probe VAD robustness under such interacting conditions. By systematically capturing 999 specimens across 10 categories using 12 synchronized views and 10 illumination settings (120 configurations total), M2AD enables rigorous evaluation. We establish two evaluation protocols: M2AD-Synergy tests the ability to fuse information across diverse configurations, and M2AD-Invariant measures single-image robustness against realistic view-illumination effects. Our extensive benchmarking shows that state-of-the-art VAD methods struggle significantly on M2AD, demonstrating the profound challenge posed by view-illumination interplay. This benchmark serves as an essential tool for developing and validating VAD methods capable of overcoming real-world complexities. Our full dataset and test suite will be released at this https URL to facilitate the field.
zh
[CV-56] M4-SAR: A Multi-Resolution Multi-Polarization Multi-Scene Multi-Source Dataset and Benchmark for Optical-SAR Fusion Object Detection
【速读】:该论文旨在解决单源遥感目标检测在复杂环境下的性能受限问题,特别是在光学图像和合成孔径雷达(SAR)图像各自存在的缺陷(如光学图像受光照、云层或分辨率限制,SAR图像受斑点噪声和语义表达能力不足影响)导致检测准确率下降的问题。其解决方案的关键在于提出首个全面的光学-SAR融合目标检测数据集M4-SAR,并构建统一的基准测试工具包,同时设计E2E-OSDet端到端多源融合检测框架,以缓解跨域差异并提升检测性能。实验结果表明,融合光学与SAR数据可使mAP提升5.7%,尤其在复杂环境中效果显著。
链接: https://arxiv.org/abs/2505.10931
作者: Chao Wang,Wei Lu,Xiang Li,Jian Yang,Lei Luo
机构: Nanjing University of Science and Technology(南京理工大学); Anhui University(安徽大学); Nankai University(南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Single-source remote sensing object detection using optical or SAR images struggles in complex environments. Optical images offer rich textural details but are often affected by low-light, cloud-obscured, or low-resolution conditions, reducing the detection performance. SAR images are robust to weather, but suffer from speckle noise and limited semantic expressiveness. Optical and SAR images provide complementary advantages, and fusing them can significantly improve the detection accuracy. However, progress in this field is hindered by the lack of large-scale, standardized datasets. To address these challenges, we propose the first comprehensive dataset for optical-SAR fusion object detection, named Multi-resolution, Multi-polarization, Multi-scene, Multi-source SAR dataset (M4-SAR). It contains 112,184 precisely aligned image pairs and nearly one million labeled instances with arbitrary orientations, spanning six key categories. To enable standardized evaluation, we develop a unified benchmarking toolkit that integrates six state-of-the-art multi-source fusion methods. Furthermore, we propose E2E-OSDet, a novel end-to-end multi-source fusion detection framework that mitigates cross-domain discrepancies and establishes a robust baseline for future studies. Extensive experiments on M4-SAR demonstrate that fusing optical and SAR data can improve mAP by 5.7% over single-source inputs, with particularly significant gains in complex environments. The dataset and code are publicly available at this https URL.
zh
[CV-57] GrowSplat: Constructing Temporal Digital Twins of Plants with Gaussian Splats
【速读】:该论文试图解决植物生长的精确时间重建问题(temporal reconstruction of plant growth),这是植物表型分析和育种中的关键挑战,主要由于植物复杂的几何结构、遮挡以及非刚性变形导致。解决方案的关键在于结合3D Gaussian Splatting与鲁棒的样本对齐流程,通过多视角相机数据重建高斯点云,并采用两阶段配准方法:首先通过基于特征的匹配和快速全局配准进行粗略对齐,然后利用迭代最近点算法进行精细对齐,从而生成连续的4D植物发育模型。
链接: https://arxiv.org/abs/2505.10923
作者: Simeon Adebola,Shuangyu Xie,Chung Min Kim,Justin Kerr,Bart M. van Marrewijk,Mieke van Vlaardingen,Tim van Daalen,Robert van Loo,Jose Luis Susa Rincon,Eugen Solowjow,Rick van de Zedde,Ken Goldberg
机构: UC Berkeley (加州大学伯克利分校); Siemens Research Lab (西门子研究实验室); Netherlands Plant Eco-phenotyping Centre (荷兰植物生态表型中心); Wageningen University and Research (瓦赫宁根大学与研究中心)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate temporal reconstructions of plant growth are essential for plant phenotyping and breeding, yet remain challenging due to complex geometries, occlusions, and non-rigid deformations of plants. We present a novel framework for building temporal digital twins of plants by combining 3D Gaussian Splatting with a robust sample alignment pipeline. Our method begins by reconstructing Gaussian Splats from multi-view camera data, then leverages a two-stage registration approach: coarse alignment through feature-based matching and Fast Global Registration, followed by fine alignment with Iterative Closest Point. This pipeline yields a consistent 4D model of plant development in discrete time steps. We evaluate the approach on data from the Netherlands Plant Eco-phenotyping Center, demonstrating detailed temporal reconstructions of Sequoia and Quinoa species. Videos and Images can be seen at this https URL
zh
[CV-58] owards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution
【速读】:该论文试图解决中国文化遗产领域中跨模态检索的挑战,特别是在复杂装饰图案与专业文本描述之间的局部对齐问题。现有通用多模态数据集无法满足该领域的需求,导致跨模态学习模型在该领域的开发和评估受限。解决方案的关键在于提出一种无需训练的局部对齐策略LACLIP,该策略基于微调后的中文-CLIP模型,在推理过程中通过计算加权相似性分数来增强全局文本描述与局部视觉区域的对齐效果。
链接: https://arxiv.org/abs/2505.10921
作者: Junyi Yuan,Jian Zhang,Fangyu Wu,Dongming Lu,Huanda Lu,Qiufeng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:China has a long and rich history, encompassing a vast cultural heritage that includes diverse multimodal information, such as silk patterns, Dunhuang murals, and their associated historical narratives. Cross-modal retrieval plays a pivotal role in understanding and interpreting Chinese cultural heritage by bridging visual and textual modalities to enable accurate text-to-image and image-to-text retrieval. However, despite the growing interest in multimodal research, there is a lack of specialized datasets dedicated to Chinese cultural heritage, limiting the development and evaluation of cross-modal learning models in this domain. To address this gap, we propose a multimodal dataset named CulTi, which contains 5,726 image-text pairs extracted from two series of professional documents, respectively related to ancient Chinese silk and Dunhuang murals. Compared to existing general-domain multimodal datasets, CulTi presents a challenge for cross-modal retrieval: the difficulty of local alignment between intricate decorative motifs and specialized textual descriptions. To address this challenge, we propose LACLIP, a training-free local alignment strategy built upon a fine-tuned Chinese-CLIP. LACLIP enhances the alignment of global textual descriptions with local visual regions by computing weighted similarity scores during inference. Experimental results on CulTi demonstrate that LACLIP significantly outperforms existing models in cross-modal retrieval, particularly in handling fine-grained semantic associations within Chinese cultural heritage.
zh
[CV-59] VISTA: Enhancing Vision-Text Alignment in MLLM s via Cross-Modal Mutual Information Maximization
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中存在的模态对齐问题,特别是模型在处理长文本序列时出现的视觉等其他模态信息对齐性能退化的问题。其解决方案的关键在于提出一种名为VISTA(Vision-Text Alignment)的新方法,该方法基于理论分析引入了显式的对齐目标,旨在最大化跨模态互信息,从而有效提升视觉理解能力,并且无需额外的可训练模块或训练数据,具有高效性和实用性。
链接: https://arxiv.org/abs/2505.10917
作者: Mingxiao Li,Na Su,Fang Qu,Zhizhou Zhong,Ziyang Chen,Zhaopeng Tu,Xiaolong Li
机构: Tencent Hunyuan(腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current multimodal large language models (MLLMs) face a critical challenge in modality alignment, often exhibiting a bias towards textual information at the expense of other modalities like vision. This paper conducts a systematic information-theoretic analysis of the widely used cross-entropy loss in MLLMs, uncovering its implicit alignment objective. Our theoretical investigation reveals that this implicit objective has inherent limitations, leading to a degradation of cross-modal alignment as text sequence length increases, thereby hindering effective multimodal information fusion. To overcome these drawbacks, we propose Vision-Text Alignment (VISTA), a novel approach guided by our theoretical insights. VISTA introduces an explicit alignment objective designed to maximize cross-modal mutual information, preventing the degradation of visual alignment. Notably, VISTA enhances the visual understanding capabilities of existing MLLMs without requiring any additional trainable modules or extra training data, making it both efficient and practical. Our method significantly outperforms baseline models across more than a dozen benchmark datasets, including VQAv2, MMStar, and MME, paving the way for new directions in MLLM modal alignment research.
zh
[CV-60] Patient-Specific Dynamic Digital-Physical Twin for Coronary Intervention Training: An Integrated Mixed Reality Approach
【速读】:该论文旨在解决冠状动脉介入治疗中术前规划不精确及医生培训系统缺乏心脏生理动态准确模拟的问题。其解决方案的关键在于构建基于4D-CTA的综合动态心脏模型研究框架,整合数字孪生技术、计算机视觉和物理模型制造,实现对冠状动脉解剖结构和动态特征的高精度还原,从而提供具有视觉与触觉反馈的个性化教育与临床规划工具。
链接: https://arxiv.org/abs/2505.10902
作者: Shuo Wang,Tong Ren,Nan Cheng,Rong Wang,Li Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 34 pages, 24 figures
Abstract:Background and Objective: Precise preoperative planning and effective physician training for coronary interventions are increasingly important. Despite advances in medical imaging technologies, transforming static or limited dynamic imaging data into comprehensive dynamic cardiac models remains challenging. Existing training systems lack accurate simulation of cardiac physiological dynamics. This study develops a comprehensive dynamic cardiac model research framework based on 4D-CTA, integrating digital twin technology, computer vision, and physical model manufacturing to provide precise, personalized tools for interventional cardiology. Methods: Using 4D-CTA data from a 60-year-old female with three-vessel coronary stenosis, we segmented cardiac chambers and coronary arteries, constructed dynamic models, and implemented skeletal skinning weight computation to simulate vessel deformation across 20 cardiac phases. Transparent vascular physical models were manufactured using medical-grade silicone. We developed cardiac output analysis and virtual angiography systems, implemented guidewire 3D reconstruction using binocular stereo vision, and evaluated the system through angiography validation and CABG training applications. Results: Morphological consistency between virtual and real angiography reached 80.9%. Dice similarity coefficients for guidewire motion ranged from 0.741-0.812, with mean trajectory errors below 1.1 mm. The transparent model demonstrated advantages in CABG training, allowing direct visualization while simulating beating heart challenges. Conclusion: Our patient-specific digital-physical twin approach effectively reproduces both anatomical structures and dynamic characteristics of coronary vasculature, offering a dynamic environment with visual and tactile feedback valuable for education and clinical planning.
zh
[CV-61] CTP: A hybrid CNN-Transformer-PINN model for ocean front forecasting
【速读】:该论文旨在解决海洋锋面(ocean front)预测中多步预报时空间连续性和物理一致性难以保持的问题。现有方法如LSTM、ConvLSTM和AttentionConv在处理长时间序列预测时存在精度下降和物理约束不足的局限性。其解决方案的关键在于提出CTP框架,该框架融合了卷积神经网络(CNN)、Transformer架构和物理信息神经网络(PINN),通过局部空间编码、长程时间注意力机制以及物理约束强化,有效提升了预测的准确性与时间稳定性。
链接: https://arxiv.org/abs/2505.10894
作者: Yishuo Wang,Feng Zhou,Muping Zhou,Qicheng Meng,Zhijun Hu,Yi Wang
机构: Shanghai Jiao Tong University (上海交通大学); Second Institute of Oceanography, MNR (国家海洋局第二海洋研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper proposes CTP, a novel deep learning framework that integrates convolutional neural network(CNN), Transformer architectures, and physics-informed neural network(PINN) for ocean front prediction. Ocean fronts, as dynamic interfaces between distinct water masses, play critical roles in marine biogeochemical and physical processes. Existing methods such as LSTM, ConvLSTM, and AttentionConv often struggle to maintain spatial continuity and physical consistency over multi-step forecasts. CTP addresses these challenges by combining localized spatial encoding, long-range temporal attention, and physical constraint enforcement. Experimental results across south China sea(SCS) and Kuroshio(KUR) regions from 1993 to 2020 demonstrate that CTP achieves state-of-the-art(SOTA) performance in both single-step and multi-step predictions, significantly outperforming baseline models in accuracy, F_1 score, and temporal stability.
zh
[CV-62] PoseBench3D: A Cross-Dataset Analysis Framework for 3D Human Pose Estimation
【速读】:该论文旨在解决三维人体姿态估计模型在实际应用中面临的泛化能力不足问题,即现有方法主要关注单一数据集上的性能,而忽视了在不同视角、环境和相机设置下的适应性。其解决方案的关键在于提出PoseBench3D,一个统一的评估框架,能够在多个广泛使用的数据集上系统地重新评估现有及未来的模型,从而实现跨数据集的一致性和公平比较,并支持未来新数据集的扩展。该框架通过统一接口提供预配置但可调整的数据格式,确保与多种模型架构的兼容性,并通过大量实验分析不同预处理技术和数据集准备参数对模型泛化能力的影响。
链接: https://arxiv.org/abs/2505.10888
作者: Saad Manzur,Bryan Vela,Brandon Vela,Aditya Agrawal,Lan-Anh Dang-Vu,David Li,Wayne Hayes
机构: University of California, Irvine (加州大学欧文分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:Reliable three-dimensional human pose estimation is becoming increasingly important for real-world applications, yet much of prior work has focused solely on the performance within a single dataset. In practice, however, systems must adapt to diverse viewpoints, environments, and camera setups – conditions that differ significantly from those encountered during training, which is often the case in real-world scenarios. To address these challenges, we present a standardized testing environment in which each method is evaluated on a variety of datasets, ensuring consistent and fair cross-dataset comparisons – allowing for the analysis of methods on previously unseen data. Therefore, we propose PoseBench3D, a unified framework designed to systematically re-evaluate prior and future models across four of the most widely used datasets for human pose estimation – with the framework able to support novel and future datasets as the field progresses. Through a unified interface, our framework provides datasets in a pre-configured yet easily modifiable format, ensuring compatibility with diverse model architectures. We re-evaluated the work of 18 methods, either trained or gathered from existing literature, and reported results using both Mean Per Joint Position Error (MPJPE) and Procrustes Aligned Mean Per Joint Position Error (PA-MPJPE) metrics, yielding more than 100 novel cross-dataset evaluation results. Additionally, we analyze performance differences resulting from various pre-processing techniques and dataset preparation parameters – offering further insight into model generalization capabilities.
zh
[CV-63] Preference Isolation Forest for Structure-based Anomaly Detection
【速读】:该论文试图解决的是检测不符合由低维流形表示的结构化模式的异常样本问题。其解决方案的关键在于提出一种称为Preference Isolation Forest (PIF) 的通用异常检测框架,该框架结合了基于自适应隔离的方法与偏好嵌入的灵活性。核心思想是通过拟合低维流形将数据嵌入高维偏好空间,并将异常识别为孤立点。
链接: https://arxiv.org/abs/2505.10876
作者: Filippo Leveni,Luca Magri,Cesare Alippi,Giacomo Boracchi
机构: Politecnico di Milano (米兰理工大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: Submitted to Pattern Recognition
Abstract:We address the problem of detecting anomalies as samples that do not conform to structured patterns represented by low-dimensional manifolds. To this end, we conceive a general anomaly detection framework called Preference Isolation Forest (PIF), that combines the benefits of adaptive isolation-based methods with the flexibility of preference embedding. The key intuition is to embed the data into a high-dimensional preference space by fitting low-dimensional manifolds, and to identify anomalies as isolated points. We propose three isolation approaches to identify anomalies: i ) Voronoi-iForest, the most general solution, ii ) RuzHash-iForest, that avoids explicit computation of distances via Local Sensitive Hashing, and iii ) Sliding-PIF, that leverages a locality prior to improve efficiency and effectiveness.
zh
[CV-64] A Light and Smart Wearable Platform with Multimodal Foundation Model for Enhanced Spatial Reasoning in People with Blindness and Low Vision
【速读】:该论文旨在解决盲人和低视力人群(pBLV)在环境导航和物体定位中面临的挑战,特别是当前多模态大语言模型(MLLM)缺乏必要的空间推理能力,以及缺乏轻量级、易用的辅助系统。解决方案的关键在于提出一种增强空间推理能力的多模态大语言模型方法,并结合一种可作为眼镜附件的硬件组件,以提升用户的环境感知与交互能力。该方法通过微调MLLM以增强对环境上下文的理解,并利用先进的视觉语言模型(VLM)提供实时的空间感知反馈,从而实现更高效、独立的导航体验。
链接: https://arxiv.org/abs/2505.10875
作者: Alexey Magay,Dhurba Tripathi,Yu Hao,Yi Fang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website and code: this https URL
Abstract:People with blindness and low vision (pBLV) face significant challenges, struggling to navigate environments and locate objects due to limited visual cues. Spatial reasoning is crucial for these individuals, as it enables them to understand and interpret the spatial relationships in their surroundings, enhancing their ability to navigate and interact more safely and independently. Current multi-modal large language (MLLM) models for low vision people lack the spatial reasoning capabilities needed to effectively assist in these tasks. Moreover, there is a notable absence of lightweight, easy-to-use systems that allow pBLV to effectively perceive and interact with their surrounding environment. In this paper, we propose a novel spatial enhanced multi-modal large language model based approach for visually impaired individuals. By fine-tuning the MLLM to incorporate spatial reasoning capabilities, our method significantly improves the understanding of environmental context, which is critical for navigation and object recognition. The innovation extends to a hardware component, designed as an attachment for glasses, ensuring increased accessibility and ease of use. This integration leverages advanced VLMs to interpret visual data and provide real-time, spatially aware feedback to the user. Our approach aims to bridge the gap between advanced machine learning models and practical, user-friendly assistive devices, offering a robust solution for visually impaired users to navigate their surroundings more effectively and independently. The paper includes an in-depth evaluation using the VizWiz dataset, demonstrating substantial improvements in accuracy and user experience. Additionally, we design a comprehensive dataset to evaluate our method’s effectiveness in realworld situations, demonstrating substantial improvements in accuracy and user experience.
zh
[CV-65] MultiLink: Multi-class Structure Recovery via Agglomerative Clustering and Model Selection CVPR2021
【速读】:该论文试图解决在噪声和异常值污染的数据集中恢复多个不同类别的结构的问题,特别是针对由混合基础参数模型(如平面和圆柱体、单应性和基础矩阵)定义的几何结构。其解决方案的关键在于提出了一种名为MultiLink的新算法,该算法通过偏好分析和聚类处理鲁棒拟合问题,并结合了实时模型拟合与模型选择的新型链接机制,以确定两个聚类是否应合并,从而实现了对多类别模型的同步处理。
链接: https://arxiv.org/abs/2505.10874
作者: Luca Magri,Filippo Leveni,Giacomo Boracchi
机构: Politecnico di Milano(米兰理工大学); DEIB(电子、信息和生物工程部)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Computer Vision and Pattern Recognition (CVPR 2021)
Abstract:We address the problem of recovering multiple structures of different classes in a dataset contaminated by noise and outliers. In particular, we consider geometric structures defined by a mixture of underlying parametric models (e.g. planes and cylinders, homographies and fundamental matrices), and we tackle the robust fitting problem by preference analysis and clustering. We present a new algorithm, termed MultiLink, that simultaneously deals with multiple classes of models. MultiLink combines on-the-fly model fitting and model selection in a novel linkage scheme that determines whether two clusters are to be merged. The resulting method features many practical advantages with respect to methods based on preference analysis, being faster, less sensitive to the inlier threshold, and able to compensate limitations deriving from hypotheses sampling. Experiments on several public datasets demonstrate that Multi-Link favourably compares with state of the art alternatives, both in multi-class and single-class problems. Code is publicly made available for download.
zh
[CV-66] Hashing for Structure-based Anomaly Detection
【速读】:该论文试图解决在数据集中识别不符合由低维流形表示的结构化模式的异常样本的问题。其解决方案的关键在于将数据嵌入到一个高维空间——称为偏好空间(Preference Space),在此空间中,异常可以被识别为最孤立的点。为了提高异常检测的效率,该工作采用局部敏感哈希(Locality Sensitive Hashing)来避免在高维空间中显式计算距离,从而实现一种基于隔离的异常检测技术,在降低计算成本的同时达到了最先进的性能。
链接: https://arxiv.org/abs/2505.10873
作者: Filippo Leveni,Luca Magri,Cesare Alippi,Giacomo Boracchi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: Accepted at International Conference on Image Analysis and Processing (ICIAP 2023)
Abstract:We focus on the problem of identifying samples in a set that do not conform to structured patterns represented by low-dimensional manifolds. An effective way to solve this problem is to embed data in a high dimensional space, called Preference Space, where anomalies can be identified as the most isolated points. In this work, we employ Locality Sensitive Hashing to avoid explicit computation of distances in high dimensions and thus improve Anomaly Detection efficiency. Specifically, we present an isolation-based anomaly detection technique designed to work in the Preference Space which achieves state-of-the-art performance at a lower computational cost. Code is publicly available at this https URL.
zh
[CV-67] A Convolution-Based Gait Asymmetry Metric for Inter-Limb Synergistic Coordination
【速读】:该论文试图解决步态对称性评估的问题,传统方法主要依赖于左右两侧肌电图(EMG)信号或加速度的差异,而本文提出了一种基于线性时不变(LTI)系统建模的段间协调方式,并引入一种差异度量来评估步态对称性。解决方案的关键在于通过LTI系统建模来捕捉肢体各部分之间的协调关系,并利用该模型计算出的差异度量来更准确地评价步态的对称性。
链接: https://arxiv.org/abs/2505.10869
作者: Go Fukino,Kanta Tachibana
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 7 pages, 13 figures, 3 tables
Abstract:This study focuses on the velocity patterns of various body parts during walking and proposes a method for evaluating gait symmetry. Traditional motion analysis studies have assessed gait symmetry based on differences in electromyographic (EMG) signals or acceleration between the left and right sides. In contrast, this paper models intersegmental coordination using an LTI system and proposes a dissimilarity metric to evaluate symmetry. The method was tested on five subjects with both symmetric and asymmetric gait.
zh
[CV-68] RefPose: Leverag ing Reference Geometric Correspondences for Accurate 6D Pose Estimation of Unseen Objects CVPR2025
【速读】:该论文试图解决从单目RGB图像中估计未见过物体的6D位姿这一具有挑战性的问题,尤其针对缺乏先验对象特定知识的情况。其解决方案的关键在于提出RefPose方法,该方法通过参考图像和几何对应关系作为指导,首先利用物体模板渲染参考图像并建立精炼阶段所需的几何对应关系,随后在精炼阶段基于生成的参考图像估计查询图像的几何对应关系,并通过“渲染与比较”方法迭代优化位姿。此外,引入了相关体积引导的注意力机制以有效捕捉查询图像与参考图像之间的相关性,从而使方法能够动态适应新物体形状,实现对未见过物体的鲁棒位姿估计。
链接: https://arxiv.org/abs/2505.10841
作者: Jaeguk Kim,Jaewoo Park,Keuntek Lee,Nam Ik Cho
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025
Abstract:Estimating the 6D pose of unseen objects from monocular RGB images remains a challenging problem, especially due to the lack of prior object-specific knowledge. To tackle this issue, we propose RefPose, an innovative approach to object pose estimation that leverages a reference image and geometric correspondence as guidance. RefPose first predicts an initial pose by using object templates to render the reference image and establish the geometric correspondence needed for the refinement stage. During the refinement stage, RefPose estimates the geometric correspondence of the query based on the generated references and iteratively refines the pose through a render-and-compare approach. To enhance this estimation, we introduce a correlation volume-guided attention mechanism that effectively captures correlations between the query and reference images. Unlike traditional methods that depend on pre-defined object models, RefPose dynamically adapts to new object shapes by leveraging a reference image and geometric correspondence. This results in robust performance across previously unseen objects. Extensive evaluation on the BOP benchmark datasets shows that RefPose achieves state-of-the-art results while maintaining a competitive runtime.
zh
[CV-69] NeuSEditor: From Multi-View Images to Text-Guided Neural Surface Edits
【速读】:该论文旨在解决隐式表面表示在编辑过程中难以保持身份一致性和几何一致性的问题。现有方法在编辑时往往无法有效保留场景的特定元素,导致视觉质量下降。其解决方案的关键在于提出NeuSEditor,这是一种基于文本引导的神经隐式表面编辑方法,通过引入保持身份的架构,高效地将场景分为前景和背景,从而实现对场景的精确修改而不影响特定元素,并结合几何感知的蒸馏损失来提升渲染和几何质量。
链接: https://arxiv.org/abs/2505.10827
作者: Nail Ibrahimli,Julian F. P. Kooij,Liangliang Nan
机构: Delft University of Technology (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Implicit surface representations are valued for their compactness and continuity, but they pose significant challenges for editing. Despite recent advancements, existing methods often fail to preserve identity and maintain geometric consistency during editing. To address these challenges, we present NeuSEditor, a novel method for text-guided editing of neural implicit surfaces derived from multi-view images. NeuSEditor introduces an identity-preserving architecture that efficiently separates scenes into foreground and background, enabling precise modifications without altering the scene-specific elements. Our geometry-aware distillation loss significantly enhances rendering and geometric quality. Our method simplifies the editing workflow by eliminating the need for continuous dataset updates and source prompting. NeuSEditor outperforms recent state-of-the-art methods like PDS and InstructNeRF2NeRF, delivering superior quantitative and qualitative results. For more visual results, visit: this http URL.
zh
[CV-70] A High-Performance Thermal Infrared Object Detection Framework with Centralized Regulation
【速读】:该论文旨在解决热红外(Thermal Infrared, TIR)图像中目标检测方法在提取和融合局部-全局信息方面的不足,这一问题限制了TIR领域特征注意力的有效性。其解决方案的关键在于提出一种基于集中特征调节的新型高效热红外目标检测框架CRT-YOLO,该框架通过集成高效的多尺度注意力(Efficient Multi-Scale Attention, EMA)模块和集中特征金字塔(Centralized Feature Pyramid, CFP)网络,实现了对TIR信息的全局范围交互与特征调控,从而显著提升了检测性能。
链接: https://arxiv.org/abs/2505.10825
作者: Jinke Li,Yue Wu,Xiaoyan Yang
机构: Hangzhou Dianzi University (杭州电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This manuscript has been accepted for publication in the International Journal for Housing Science and Its Applications (IJHSA), 2025
Abstract:Thermal Infrared (TIR) technology involves the use of sensors to detect and measure infrared radiation emitted by objects, and it is widely utilized across a broad spectrum of applications. The advancements in object detection methods utilizing TIR images have sparked significant research interest. However, most traditional methods lack the capability to effectively extract and fuse local-global information, which is crucial for TIR-domain feature attention. In this study, we present a novel and efficient thermal infrared object detection framework, known as CRT-YOLO, that is based on centralized feature regulation, enabling the establishment of global-range interaction on TIR information. Our proposed model integrates efficient multi-scale attention (EMA) modules, which adeptly capture long-range dependencies while incurring minimal computational overhead. Additionally, it leverages the Centralized Feature Pyramid (CFP) network, which offers global regulation of TIR features. Extensive experiments conducted on two benchmark datasets demonstrate that our CRT-YOLO model significantly outperforms conventional methods for TIR image object detection. Furthermore, the ablation study provides compelling evidence of the effectiveness of our proposed modules, reinforcing the potential impact of our approach on advancing the field of thermal infrared object detection.
zh
[CV-71] xtured mesh Quality Assessment using Geometry and Color Field Similarity
【速读】:该论文试图解决纹理网格质量评估(Textured Mesh Quality Assessment, TMQA)在现有方法中难以提供准确且鲁棒评估的问题。解决方案的关键在于提出一种基于点的TMQA方法,称为场网格质量度量(Field Mesh Quality Metric, FMQM),该方法利用符号距离场和一种新提出的颜色场——最近表面点颜色场,实现有效的网格特征描述,并从几何和颜色场中提取与视觉感知相关的四个特征,从而提升评估的准确性与效率。
链接: https://arxiv.org/abs/2505.10824
作者: Kaifa Yang,Qi Yang,Zhu Li,Yiling Xu
机构: Shanghai Jiao Tong University (上海交通大学); University of Missouri–Kansas City (密苏里大学堪萨斯城分校)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 15 pages main content, 4 pages supplementary material. Submitted to IEEE Transactions on Visualization and Computer Graphics (IEEE TVCG) for review
Abstract:Textured mesh quality assessment (TMQA) is critical for various 3D mesh applications. However, existing TMQA methods often struggle to provide accurate and robust evaluations. Motivated by the effectiveness of fields in representing both 3D geometry and color information, we propose a novel point-based TMQA method called field mesh quality metric (FMQM). FMQM utilizes signed distance fields and a newly proposed color field named nearest surface point color field to realize effective mesh feature description. Four features related to visual perception are extracted from the geometry and color fields: geometry similarity, geometry gradient similarity, space color distribution similarity, and space color gradient similarity. Experimental results on three benchmark datasets demonstrate that FMQM outperforms state-of-the-art (SOTA) TMQA metrics. Furthermore, FMQM exhibits low computational complexity, making it a practical and efficient solution for real-world applications in 3D graphics and visualization. Our code is publicly available at: this https URL.
zh
[CV-72] From Embeddings to Accuracy: Comparing Foundation Models for Radiographic Classification
【速读】:该论文旨在解决医学影像诊断中多类别放射摄影分类的准确性与计算效率问题,特别是在导管位置评估中的应用。其解决方案的关键在于利用基础模型(foundation models)生成的嵌入(embeddings)来训练轻量级适配器模型(adapter models),以实现高效且准确的分类性能。研究对比了多种基础模型生成的嵌入与不同机器学习算法结合的效果,发现MedImageInsight嵌入与支持向量机适配器组合表现最佳,显示出较高的分类准确率和计算效率,同时具备良好的公平性。
链接: https://arxiv.org/abs/2505.10823
作者: Xue Li,Jameson Merkow,Noel C. F. Codella,Alberto Santamaria-Pang,Naiteek Sangani,Alexander Ersoy,Christopher Burt,John W. Garrett,Richard J. Bruce,Joshua D. Warner,Tyler Bradshaw,Ivan Tarapov,Matthew P. Lungren,Alan B. McMillan
机构: University of Wisconsin–Madison (威斯康星大学麦迪逊分校); Microsoft Health and Life Sciences (微软健康与生命科学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 11 pages, 5 figures, 4 tables
Abstract:Foundation models, pretrained on extensive datasets, have significantly advanced machine learning by providing robust and transferable embeddings applicable to various domains, including medical imaging diagnostics. This study evaluates the utility of embeddings derived from both general-purpose and medical domain-specific foundation models for training lightweight adapter models in multi-class radiography classification, focusing specifically on tube placement assessment. A dataset comprising 8842 radiographs classified into seven distinct categories was employed to extract embeddings using six foundation models: DenseNet121, BiomedCLIP, Med-Flamingo, MedImageInsight, Rad-DINO, and CXR-Foundation. Adapter models were subsequently trained using classical machine learning algorithms. Among these combinations, MedImageInsight embeddings paired with an support vector machine adapter yielded the highest mean area under the curve (mAUC) at 93.8%, followed closely by Rad-DINO (91.1%) and CXR-Foundation (89.0%). In comparison, BiomedCLIP and DenseNet121 exhibited moderate performance with mAUC scores of 83.0% and 81.8%, respectively, whereas Med-Flamingo delivered the lowest performance at 75.1%. Notably, most adapter models demonstrated computational efficiency, achieving training within one minute and inference within seconds on CPU, underscoring their practicality for clinical applications. Furthermore, fairness analyses on adapters trained on MedImageInsight-derived embeddings indicated minimal disparities, with gender differences in performance within 2% and standard deviations across age groups not exceeding 3%. These findings confirm that foundation model embeddings-especially those from MedImageInsight-facilitate accurate, computationally efficient, and equitable diagnostic classification using lightweight adapters for radiographic image analysis.
zh
[CV-73] MoCLIP: Motion-Aware Fine-Tuning and Distillation of CLIP for Human Motion Generation CVPR2025
【速读】:该论文旨在解决文本到运动生成中运动动态建模不足的问题,即现有方法依赖于基于对比语言-图像预训练(CLIP)的文本编码器,但由于其在文本-图像对上的训练,限制了其对运动中固有时序和运动学结构的理解能力。解决方案的关键在于引入MoCLIP,一个经过微调的CLIP模型,其额外添加了运动编码头,并通过对比学习和牵引损失在运动序列上进行训练,从而显式地融入运动感知表示,提升运动保真度并保持与现有CLIP基础流程的兼容性。
链接: https://arxiv.org/abs/2505.10810
作者: Gabriel Maldonado,Armin Danesh Pazho,Ghazal Alinezhad Noghre,Vinit Katariya,Hamed Tabkhi
机构: University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures, 2 tables. Presented at the CVPR 2025 Human Motion Generation (HuMoGen) Workshop. Introduces MoCLIP, a CLIP-based fine-tuning strategy for motion generation, with results on HumanML3D dataset and ablation studies
Abstract:Human motion generation is essential for fields such as animation, robotics, and virtual reality, requiring models that effectively capture motion dynamics from text descriptions. Existing approaches often rely on Contrastive Language-Image Pretraining (CLIP)-based text encoders, but their training on text-image pairs constrains their ability to understand temporal and kinematic structures inherent in motion and motion generation. This work introduces MoCLIP, a fine-tuned CLIP model with an additional motion encoding head, trained on motion sequences using contrastive learning and tethering loss. By explicitly incorporating motion-aware representations, MoCLIP enhances motion fidelity while remaining compatible with existing CLIP-based pipelines and seamlessly integrating into various CLIP-based methods. Experiments demonstrate that MoCLIP improves Top-1, Top-2, and Top-3 accuracy while maintaining competitive FID, leading to improved text-to-motion alignment results. These results highlight MoCLIP’s versatility and effectiveness, establishing it as a robust framework for enhancing motion generation.
zh
[CV-74] EA-3DGS: Efficient and Adaptive 3D Gaussians with Highly Enhanced Quality for outdoor scenes
【速读】:该论文旨在解决基于NeRF的方法在重建室外场景时存在的训练和推理速度慢、点云表示缺乏有效调整机制以及高内存消耗的问题。其关键解决方案是提出EA-3DGS方法,通过引入自适应四面体网格结构来规范高斯组件的初始化,从而有效捕捉低纹理区域的几何结构;同时采用高效的高斯剪枝策略与结构感知的密集化策略,以保留几何关键点并优化点云分布;此外,还利用向量量化技术对高斯组件参数进行量化,显著降低存储需求且对渲染质量影响较小。
链接: https://arxiv.org/abs/2505.10787
作者: Jianlin Guo,Haihong Xiao,Wenxiong Kang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Efficient scene representations are essential for many real-world applications, especially those involving spatial measurement. Although current NeRF-based methods have achieved impressive results in reconstructing building-scale scenes, they still suffer from slow training and inference speeds due to time-consuming stochastic sampling. Recently, 3D Gaussian Splatting (3DGS) has demonstrated excellent performance with its high-quality rendering and real-time speed, especially for objects and small-scale scenes. However, in outdoor scenes, its point-based explicit representation lacks an effective adjustment mechanism, and the millions of Gaussian points required often lead to memory constraints during training. To address these challenges, we propose EA-3DGS, a high-quality real-time rendering method designed for outdoor scenes. First, we introduce a mesh structure to regulate the initialization of Gaussian components by leveraging an adaptive tetrahedral mesh that partitions the grid and initializes Gaussian components on each face, effectively capturing geometric structures in low-texture regions. Second, we propose an efficient Gaussian pruning strategy that evaluates each 3D Gaussian’s contribution to the view and prunes accordingly. To retain geometry-critical Gaussian points, we also present a structure-aware densification strategy that densifies Gaussian points in low-curvature regions. Additionally, we employ vector quantization for parameter quantization of Gaussian components, significantly reducing disk space requirements with only a minimal impact on rendering quality. Extensive experiments on 13 scenes, including eight from four public datasets (MatrixCity-Aerial, Mill-19, Tanks \ Temples, WHU) and five self-collected scenes acquired through UAV photogrammetry measurement from SCUT-CA and plateau regions, further demonstrate the superiority of our method.
zh
[CV-75] SynRailObs: A Synthetic Dataset for Obstacle Detection in Railway Scenarios
【速读】:该论文旨在解决铁路环境中障碍物检测的挑战,特别是由于现有公开数据集无法满足复杂条件下大规模、高精度标注图像的需求,从而阻碍了铁路安全研究的进步。其解决方案的关键在于引入SynRailObs,这是一个高保真合成数据集,能够代表多种天气条件和地理特征,并利用扩散模型生成现实中难以获取的罕见和复杂障碍物,以提升障碍物检测模型的泛化能力和适用性。
链接: https://arxiv.org/abs/2505.10784
作者: Qiushi Guo,Jason Rambach
机构: DFKI(德国人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Detecting potential obstacles in railway environments is critical for preventing serious accidents. Identifying a broad range of obstacle categories under complex conditions requires large-scale datasets with precisely annotated, high-quality images. However, existing publicly available datasets fail to meet these requirements, thereby hindering progress in railway safety research. To address this gap, we introduce SynRailObs, a high-fidelity synthetic dataset designed to represent a diverse range of weather conditions and geographical features. Furthermore, diffusion models are employed to generate rare and difficult-to-capture obstacles that are typically challenging to obtain in real-world scenarios. To evaluate the effectiveness of SynRailObs, we perform experiments in real-world railway environments, testing on both ballasted and ballastless tracks across various weather conditions. The results demonstrate that SynRailObs holds substantial potential for advancing obstacle detection in railway safety applications. Models trained on this dataset show consistent performance across different distances and environmental conditions. Moreover, the model trained on SynRailObs exhibits zero-shot capabilities, which are essential for applications in security-sensitive domains. The data is available in this https URL.
zh
[CV-76] Completely Weakly Supervised Class-Incremental Learning for Semantic Segmentation
【速读】:该论文试图解决完全弱监督类增量语义分割(completely weakly supervised class-incremental learning for semantic segmentation)的问题,即在仅使用图像级标签的情况下,学习对基础类和新增的新颖类进行分割。传统类增量语义分割(CISS)方法需要昂贵的像素级标注,而现有部分弱监督方法仍存在局限。该工作的关键在于通过结合定位器和一系列基础模型的伪标签,并根据其不确定性生成鲁棒的伪标签,同时引入示例引导的数据增强方法以缓解灾难性遗忘。
链接: https://arxiv.org/abs/2505.10781
作者: David Minkwan Kim,Soeun Lee,Byeongkeun Kang
机构: Hanyang University (汉阳大学); Chung-Ang University (中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages
Abstract:This work addresses the task of completely weakly supervised class-incremental learning for semantic segmentation to learn segmentation for both base and additional novel classes using only image-level labels. While class-incremental semantic segmentation (CISS) is crucial for handling diverse and newly emerging objects in the real world, traditional CISS methods require expensive pixel-level annotations for training. To overcome this limitation, partially weakly-supervised approaches have recently been proposed. However, to the best of our knowledge, this is the first work to introduce a completely weakly-supervised method for CISS. To achieve this, we propose to generate robust pseudo-labels by combining pseudo-labels from a localizer and a sequence of foundation models based on their uncertainty. Moreover, to mitigate catastrophic forgetting, we introduce an exemplar-guided data augmentation method that generates diverse images containing both previous and novel classes with guidance. Finally, we conduct experiments in three common experimental settings: 15-5 VOC, 10-10 VOC, and COCO-to-VOC, and in two scenarios: disjoint and overlap. The experimental results demonstrate that our completely weakly supervised method outperforms even partially weakly supervised methods in the 15-5 VOC and 10-10 VOC settings while achieving competitive accuracy in the COCO-to-VOC setting.
zh
[CV-77] Unifying Segment Anything in Microscopy with Multimodal Large Language Model
【速读】:该论文旨在解决生物医学图像中感兴趣区域分割在跨域数据上表现不佳的问题,其核心挑战在于现有基础模型缺乏视觉-语言知识(VLK)导致的泛化能力不足。解决方案的关键在于利用多模态大语言模型(MLLM)注入VLK,通过引入视觉-语言语义对齐(VLSA)模块和语义边界正则化(SBR)机制,提升Segment Anything Model(SAM)在显微镜跨域数据上的分割性能与边界感知能力。
链接: https://arxiv.org/abs/2505.10769
作者: Manyu Li,Ruian He,Zixian Zhang,Weimin Tan,Bo Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 9 figures
Abstract:Accurate segmentation of regions of interest in biomedical images holds substantial value in image analysis. Although several foundation models for biomedical segmentation have currently achieved excellent performance on certain datasets, they typically demonstrate sub-optimal performance on unseen domain data. We owe the deficiency to lack of vision-language knowledge before segmentation. Multimodal Large Language Models (MLLMs) bring outstanding understanding and reasoning capabilities to multimodal tasks, which inspires us to leverage MLLMs to inject Vision-Language Knowledge (VLK), thereby enabling vision models to demonstrate superior generalization capabilities on cross-domain datasets. In this paper, we propose using MLLMs to guide SAM in learning microscopy crose-domain data, unifying Segment Anything in Microscopy, named uLLSAM. Specifically, we propose the Vision-Language Semantic Alignment (VLSA) module, which injects VLK into Segment Anything Model (SAM). We find that after SAM receives global VLK prompts, its performance improves significantly, but there are deficiencies in boundary contour perception. Therefore, we further propose Semantic Boundary Regularization (SBR) to prompt SAM. Our method achieves performance improvements of 7.71% in Dice and 12.10% in SA across 9 in-domain microscopy datasets, achieving state-of-the-art performance. Our method also demonstrates improvements of 6.79% in Dice and 10.08% in SA across 10 out-ofdomain datasets, exhibiting strong generalization capabilities. Code is available at this https URL.
zh
[CV-78] Benchmarking performance explainability and evaluation strategies of vision-language models for surgery: Challenges and opportunities
【速读】:该论文试图解决微创手术(Minimally Invasive Surgery, MIS)中视觉和操作挑战,特别是手术器械分类以及理解涉及器械、动词和解剖目标的手术动作问题。传统方法依赖于特定程序和任务的模型,这些模型通常在小规模手动标注数据集上进行训练。本文的关键解决方案是利用大规模图像-文本对预训练的视觉语言模型(Vision-Language Models, VLMs),评估其在多种外科数据集上的表现,以探索其在手术领域中的适应性和局限性。
链接: https://arxiv.org/abs/2505.10764
作者: Jiajun Cheng,Xianwu Zhao,Shan Lin
机构: Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Minimally invasive surgery (MIS) presents significant visual and technical challenges, including surgical instrument classification and understanding surgical action involving instruments, verbs, and anatomical targets. While many machine learning-based methods have been developed for surgical understanding, they typically rely on procedure- and task-specific models trained on small, manually annotated datasets. In contrast, the recent success of vision-language models (VLMs) trained on large volumes of raw image-text pairs has demonstrated strong adaptability to diverse visual data and a range of downstream tasks. This opens meaningful research questions: how well do these general-purpose VLMs perform in the surgical domain? In this work, we explore those questions by benchmarking several VLMs across diverse surgical datasets, including general laparoscopic procedures and endoscopic submucosal dissection, to assess their current capabilities and limitations. Our benchmark reveals key gaps in the models’ ability to consistently link language to the correct regions in surgical scenes.
zh
[CV-79] Mapping Semantic Segmentation to Point Clouds Using Structure from Motion for Forest Analysis ICRA2025
【速读】:该论文试图解决森林环境中缺乏公开的语义分割点云数据集的问题,这一问题主要源于远程感知技术获取点云数据的成本高、传感器要求严格以及耗时性强。此外,目前尚无通过Structure From Motion (SfM)算法生成的公开标注数据集,这可能是因为缺乏能够将语义分割信息映射到精确点云的SfM算法,尤其是在像森林这样具有挑战性的环境中。解决方案的关键在于提出一种新的管道,利用自建的森林模拟器生成多样化的森林场景的逼真RGB图像及其对应的语义分割掩码,并通过修改后的开源SfM软件进行处理,以在三维重建过程中保留语义信息,从而生成包含几何和语义细节的点云数据。
链接: https://arxiv.org/abs/2505.10751
作者: Francisco Raverta Capua,Pablo De Cristoforis
机构: Universidad de Buenos Aires (布宜诺斯艾利斯大学); Facultad de Ciencias Exactas y Naturales (自然科学精确科学学院); CONICET-Universidad de Buenos Aires (阿根廷国家科学技术研究委员会-布宜诺斯艾利斯大学); Instituto de Ciencias de la Computación (ICC) (计算机科学研究所(ICC))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress, accepted in Novel Approaches for Precision Agriculture and Forestry with Autonomous Robots, ICRA 2025 Workshop - May 23, 2025 - Atlanta, GA
Abstract:Although the use of remote sensing technologies for monitoring forested environments has gained increasing attention, publicly available point cloud datasets remain scarce due to the high costs, sensor requirements, and time-intensive nature of their acquisition. Moreover, as far as we are aware, there are no public annotated datasets generated through Structure From Motion (SfM) algorithms applied to imagery, which may be due to the lack of SfM algorithms that can map semantic segmentation information into an accurate point cloud, especially in a challenging environment like forests. In this work, we present a novel pipeline for generating semantically segmented point clouds of forest environments. Using a custom-built forest simulator, we generate realistic RGB images of diverse forest scenes along with their corresponding semantic segmentation masks. These labeled images are then processed using modified open-source SfM software capable of preserving semantic information during 3D reconstruction. The resulting point clouds provide both geometric and semantic detail, offering a valuable resource for training and evaluating deep learning models aimed at segmenting real forest point clouds obtained via SfM. Comments: Work in progress, accepted in Novel Approaches for Precision Agriculture and Forestry with Autonomous Robots, ICRA 2025 Workshop - May 23, 2025 - Atlanta, GA Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.10751 [cs.CV] (or arXiv:2505.10751v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.10751 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-80] IMAGE-ALCHEMY: Advancing subject fidelity in personalised text-to-image generation
【速读】:该论文试图解决在文本到图像扩散模型(如Stable Diffusion)中,基于少量参考图像对模型进行个性化以表示新主体时所面临的灾难性遗忘、过拟合以及计算成本过高的问题。解决方案的关键在于提出一种两阶段流程,通过在Stable Diffusion XL (SDXL) 模型的U-Net结构中的注意力权重上应用LoRA(Low-Rank Adaptation)微调,从而实现对新主体的高保真整合,同时保持SDXL的整体生成能力。
链接: https://arxiv.org/abs/2505.10743
作者: Amritanshu Tiwari,Cherish Puniani,Kaustubh Sharma,Ojasva Nema
机构: Indian Institute of Technology Roorkee (印度理工学院鲁尔基分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages
Abstract:Recent advances in text-to-image diffusion models, particularly Stable Diffusion, have enabled the generation of highly detailed and semantically rich images. However, personalizing these models to represent novel subjects based on a few reference images remains challenging. This often leads to catastrophic forgetting, overfitting, or large computational this http URL propose a two-stage pipeline that addresses these limitations by leveraging LoRA-based fine-tuning on the attention weights within the U-Net of the Stable Diffusion XL (SDXL) model. First, we use the unmodified SDXL to generate a generic scene by replacing the subject with its class label. Then, we selectively insert the personalized subject through a segmentation-driven image-to-image (Img2Img) pipeline that uses the trained LoRA this http URL framework isolates the subject encoding from the overall composition, thus preserving SDXL’s broader generative capabilities while integrating the new subject in a high-fidelity manner. Our method achieves a DINO similarity score of 0.789 on SDXL, outperforming existing personalized text-to-image approaches.
zh
[CV-81] Automated Detection of Salvins Albatrosses: Improving Deep Learning Tools for Aerial Wildlife Surveys CVPR2025
【速读】:该论文旨在解决在偏远和复杂环境中对特定物种进行种群监测的问题,具体针对萨利文信天翁(Thalassarche salvini)的繁殖种群数量估算。解决方案的关键在于利用预训练的通用鸟类检测模型BirdDetector,并通过目标领域标注数据进行微调以及结合更强的数据增强方法,以提升检测精度。研究结果表明,与零样本设置相比,微调和增强技术显著提高了模型的性能。
链接: https://arxiv.org/abs/2505.10737
作者: Mitchell Rogers,Theo Thompson,Isla Duporge,Johannes Fischer,Klemens Pütz,Thomas Mattern,Bing Xue,Mengjie Zhang
机构: Centre for Data Science and Artificial Intelligence, Victoria University of Wellington, New Zealand; Department of Zoology, University of Otago, New Zealand; Department of Ecology and Evolutionary Biology, Princeton University, U.S.A; Marine Bycatch and Threats – Department of Conservation, New Zealand; Antarctic Research Trust, Bremervörde, Germany; The Tawaki Trust, Dunedin, New Zealand; Global Penguin Society, Puerto Madryn, Chubut, Argentina
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the CV4Animals workshop at CVPR 2025
Abstract:Recent advancements in deep learning and aerial imaging have transformed wildlife monitoring, enabling researchers to survey wildlife populations at unprecedented scales. Unmanned Aerial Vehicles (UAVs) provide a cost-effective means of capturing high-resolution imagery, particularly for monitoring densely populated seabird colonies. In this study, we assess the performance of a general-purpose avian detection model, BirdDetector, in estimating the breeding population of Salvin’s albatross (Thalassarche salvini) on the Bounty Islands, New Zealand. Using drone-derived imagery, we evaluate the model’s effectiveness in both zero-shot and fine-tuned settings, incorporating enhanced inference techniques and stronger augmentation methods. Our findings indicate that while applying the model in a zero-shot setting offers a strong baseline, fine-tuning with annotations from the target domain and stronger image augmentation leads to marked improvements in detection accuracy. These results highlight the potential of leveraging pre-trained deep-learning models for species-specific monitoring in remote and challenging environments.
zh
[CV-82] artanGround: A Large-Scale Dataset for Ground Robot Perception and Navigation
【速读】:该论文旨在解决地面机器人在多样化环境中实现更精准的感知与自主性的挑战,其核心问题是现有方法在不同场景下的泛化能力不足。解决方案的关键在于构建一个大规模、多模态的数据集TartanGround,该数据集包含多种传感器数据(如RGB立体相机、深度图、光流、立体差异、LiDAR点云、真实位姿、语义分割图像和带语义标签的占用图),并通过自动化流水线生成模拟不同地面机器人平台(轮式和腿式)运动模式的轨迹,从而提供丰富的训练与评估数据,推动学习型任务(如占用预测、同步定位与地图构建、神经场景表示等)的发展。
链接: https://arxiv.org/abs/2505.10696
作者: Manthan Patel,Fan Yang,Yuheng Qiu,Cesar Cadena,Sebastian Scherer,Marco Hutter,Wenshan Wang
机构: ETH Zurich (苏黎世联邦理工学院); Carnegie Mellon University (卡内基梅隆大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review for IEEE conference
Abstract:We present TartanGround, a large-scale, multi-modal dataset to advance the perception and autonomy of ground robots operating in diverse environments. This dataset, collected in various photorealistic simulation environments includes multiple RGB stereo cameras for 360-degree coverage, along with depth, optical flow, stereo disparity, LiDAR point clouds, ground truth poses, semantic segmented images, and occupancy maps with semantic labels. Data is collected using an integrated automatic pipeline, which generates trajectories mimicking the motion patterns of various ground robot platforms, including wheeled and legged robots. We collect 910 trajectories across 70 environments, resulting in 1.5 million samples. Evaluations on occupancy prediction and SLAM tasks reveal that state-of-the-art methods trained on existing datasets struggle to generalize across diverse scenes. TartanGround can serve as a testbed for training and evaluation of a broad range of learning-based tasks, including occupancy prediction, SLAM, neural scene representation, perception-based navigation, and more, enabling advancements in robotic perception and autonomy towards achieving robust models generalizable to more diverse scenarios. The dataset and codebase for data collection will be made publicly available upon acceptance. Webpage: this https URL
zh
[CV-83] A probabilistic framework for dynamic quantization
【速读】:该论文试图解决神经网络动态量化中的计算效率与量化参数自适应调整之间的平衡问题,旨在实现输入自适应的量化参数缩放。解决方案的关键在于提出一种概率框架,通过轻量级代理模型对网络的预激活值应用概率模型,从而在不显著增加内存开销的情况下,实现针对每个输入的量化参数自适应调整。
链接: https://arxiv.org/abs/2505.10689
作者: Gabriele Santini,Francesco Paissan,Elisabetta Farella
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose a probabilistic framework for dynamic quantization of neural networks that allows for a computationally efficient input-adaptive rescaling of the quantization parameters. Our framework applies a probabilistic model to the network’s pre-activations through a lightweight surrogate, enabling the adaptive adjustment of the quantization parameters on a per-input basis without significant memory overhead. We validate our approach on a set of popular computer vision tasks and models, observing only a negligible loss in performance. Our method strikes the best performance and computational overhead tradeoff compared to standard quantization strategies.
zh
[CV-84] GaussianFormer3D: Multi-Modal Gaussian-based Semantic Occupancy Prediction with 3D Deformable Attention
【速读】:该论文旨在解决自动驾驶中3D语义占据预测的准确性与效率问题,特别是在多模态感知系统中如何有效融合LiDAR与相机数据以提升预测性能。其解决方案的关键在于提出一种基于3D高斯(3D Gaussians)的多模态语义占据预测框架GaussianFormer3D,通过引入体素到高斯的初始化策略和LiDAR引导的3D可变形注意力机制,在提升预测精度的同时降低了内存消耗并提高了计算效率。
链接: https://arxiv.org/abs/2505.10685
作者: Lingjun Zhao,Sizhe Wei,James Hays,Lu Gan
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D semantic occupancy prediction is critical for achieving safe and reliable autonomous driving. Compared to camera-only perception systems, multi-modal pipelines, especially LiDAR-camera fusion methods, can produce more accurate and detailed predictions. Although most existing works utilize a dense grid-based representation, in which the entire 3D space is uniformly divided into discrete voxels, the emergence of 3D Gaussians provides a compact and continuous object-centric representation. In this work, we propose a multi-modal Gaussian-based semantic occupancy prediction framework utilizing 3D deformable attention, named as GaussianFormer3D. We introduce a voxel-to-Gaussian initialization strategy to provide 3D Gaussians with geometry priors from LiDAR data, and design a LiDAR-guided 3D deformable attention mechanism for refining 3D Gaussians with LiDAR-camera fusion features in a lifted 3D space. We conducted extensive experiments on both on-road and off-road datasets, demonstrating that our GaussianFormer3D achieves high prediction accuracy that is comparable to state-of-the-art multi-modal fusion-based methods with reduced memory consumption and improved efficiency.
zh
[CV-85] Are Spatial-Temporal Graph Convolution Networks for Human Action Recognition Over-Parameterized?
【速读】:该论文试图解决骨架动作识别(skeleton-based human action recognition, HAR)中空间-时间图卷积网络(ST-GCNs)的模型过参数化问题,即尽管输入设置一致,不同模型的识别性能差异不大。解决方案的关键在于通过彩票券假设验证了ST-GCNs在HAR任务中的过参数化特性,并提出了一种新型稀疏ST-GCNs生成器,该生成器从随机初始化的稠密网络中训练出稀疏结构,同时保持与稠密组件相当的性能;此外,通过整合多级稀疏结构,构建了多级稀疏ST-GCNs,显著提升了HAR性能。
链接: https://arxiv.org/abs/2505.10679
作者: Jianyang Xie,Yitian Zhao,Yanda Meng,He Zhao,Anh Nguyen,Yalin Zheng
机构: University of Liverpool (利物浦大学); Ningbo Institute of Materials Technology and Engineering (宁波材料技术与工程研究所); CAS (中国科学院); University of Exeter (埃克塞特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Spatial-temporal graph convolutional networks (ST-GCNs) showcase impressive performance in skeleton-based human action recognition (HAR). However, despite the development of numerous models, their recognition performance does not differ significantly after aligning the input settings. With this observation, we hypothesize that ST-GCNs are over-parameterized for HAR, a conjecture subsequently confirmed through experiments employing the lottery ticket hypothesis. Additionally, a novel sparse ST-GCNs generator is proposed, which trains a sparse architecture from a randomly initialized dense network while maintaining comparable performance levels to the dense components. Moreover, we generate multi-level sparsity ST-GCNs by integrating sparse structures at various sparsity levels and demonstrate that the assembled model yields a significant enhancement in HAR performance. Thorough experiments on four datasets, including NTU-RGB+D 60(120), Kinetics-400, and FineGYM, demonstrate that the proposed sparse ST-GCNs can achieve comparable performance to their dense components. Even with 95% fewer parameters, the sparse ST-GCNs exhibit a degradation of 1% in top-1 accuracy. Meanwhile, the multi-level sparsity ST-GCNs, which require only 66% of the parameters of the dense ST-GCNs, demonstrate an improvement of 1% in top-1 accuracy. The code is available at this https URL.
zh
[CV-86] GA3CE: Unconstrained 3D Gaze Estimation with Gaze-Aware 3D Context Encoding CVPR2025
【速读】:该论文试图解决在非受限场景下从2D观测中估计3D gaze direction(注视方向)的问题,特别是在无法获取被试者眼部的近距离视图时,如被试者距离较远或背对摄像头的情况。传统方法通常依赖于2D外观信息或在不可学习的后处理步骤中结合有限的空间线索(如深度图),难以有效建模复杂的3D空间关系。该论文提出的解决方案关键在于GA3CE(Gaze-Aware 3D Context Encoding)方法,通过将主体和场景表示为3D姿态和物体位置,并将其作为3D上下文来学习空间关系,同时在以自我为中心的空间中对齐该上下文以降低空间复杂度,并引入D^3(direction-distance-decomposed)位置编码以更精确地捕捉3D上下文与注视方向之间的方向和距离关系。
链接: https://arxiv.org/abs/2505.10671
作者: Yuki Kawana,Shintaro Shiba,Quan Kong,Norimasa Kobori
机构: Woven by Toyota(丰田织物)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR2025. Project page: this https URL
Abstract:We propose a novel 3D gaze estimation approach that learns spatial relationships between the subject and objects in the scene, and outputs 3D gaze direction. Our method targets unconstrained settings, including cases where close-up views of the subject’s eyes are unavailable, such as when the subject is distant or facing away. Previous approaches typically rely on either 2D appearance alone or incorporate limited spatial cues using depth maps in the non-learnable post-processing step. Estimating 3D gaze direction from 2D observations in these scenarios is challenging; variations in subject pose, scene layout, and gaze direction, combined with differing camera poses, yield diverse 2D appearances and 3D gaze directions even when targeting the same 3D scene. To address this issue, we propose GA3CE: Gaze-Aware 3D Context Encoding. Our method represents subject and scene using 3D poses and object positions, treating them as 3D context to learn spatial relationships in 3D space. Inspired by human vision, we align this context in an egocentric space, significantly reducing spatial complexity. Furthermore, we propose D ^3 (direction-distance-decomposed) positional encoding to better capture the spatial relationship between 3D context and gaze direction in direction and distance space. Experiments demonstrate substantial improvements, reducing mean angle error by 13%-37% compared to leading baselines on benchmark datasets in single-frame settings.
zh
[CV-87] CLIP Embeddings for AI-Generated Image Detection: A Few-Shot Study with Lightweight Classifier
【速读】:该论文试图解决在社交媒体平台上验证AI生成图像真实性的挑战,特别是针对视觉-语言模型(VLM)如CLIP在AI生成图像分类任务中的能力尚未被充分探索的问题。其解决方案的关键在于利用冻结的CLIP模型提取视觉嵌入,并将其输入轻量级网络进行微调,仅调整最终分类器,从而实现高效的图像真实性检测。实验表明,该方法在CIFAKE基准数据集上达到了95%的准确率,且在少量样本适应中仍能保持85%的性能。
链接: https://arxiv.org/abs/2505.10664
作者: Ziyang Ou
机构: University of Rochester (罗切斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures, not submitted to any conference
Abstract:Verifying the authenticity of AI-generated images presents a growing challenge on social media platforms these days. While vision-language models (VLMs) like CLIP outdo in multimodal representation, their capacity for AI-generated image classification is underexplored due to the absence of such labels during the pre-training process. This work investigates whether CLIP embeddings inherently contain information indicative of AI generation. A proposed pipeline extracts visual embeddings using a frozen CLIP model, feeds its embeddings to lightweight networks, and fine-tunes only the final classifier. Experiments on the public CIFAKE benchmark show the performance reaches 95% accuracy without language reasoning. Few-shot adaptation to curated custom with 20% of the data results in performance to 85%. A closed-source baseline (Gemini-2.0) has the best zero-shot accuracy yet fails on specific styles. Notably, some specific image types, such as wide-angle photographs and oil paintings, pose significant challenges to classification. These results indicate previously unexplored difficulties in classifying certain types of AI-generated images, revealing new and more specific questions in this domain that are worth further investigation.
zh
[CV-88] Advancing Multiple Instance Learning with Continual Learning for Whole Slide Imaging CVPR2025
【速读】:该论文旨在解决传统多实例学习(Multiple Instance Learning, MIL)模型在面对动态变化的数据集时适应性不足的问题,即模型在持续学习过程中容易发生灾难性遗忘。其解决方案的关键在于分析并缓解注意力机制在MIL模型中的遗忘问题,通过提出两种核心组件:注意力知识蒸馏(Attention Knowledge Distillation, AKD)和伪袋记忆池(Pseudo-Bag Memory Pool, PMP)。AKD通过保留注意力层的知识来减轻遗忘,而PMP则通过选择性存储最具信息量的图像区域来提高内存效率。
链接: https://arxiv.org/abs/2505.10649
作者: Xianrui Li,Yufei Cui,Jun Li,Antoni B. Chan
机构: City University of Hong Kong (香港城市大学); Huawei Canada (华为加拿大); Guangzhou Bingli Technology Co., Ltd. (广州秉力科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025
Abstract:Advances in medical imaging and deep learning have propelled progress in whole slide image (WSI) analysis, with multiple instance learning (MIL) showing promise for efficient and accurate diagnostics. However, conventional MIL models often lack adaptability to evolving datasets, as they rely on static training that cannot incorporate new information without extensive retraining. Applying continual learning (CL) to MIL models is a possible solution, but often sees limited improvements. In this paper, we analyze CL in the context of attention MIL models and find that the model forgetting is mainly concentrated in the attention layers of the MIL model. Using the results of this analysis we propose two components for improving CL on MIL: Attention Knowledge Distillation (AKD) and the Pseudo-Bag Memory Pool (PMP). AKD mitigates catastrophic forgetting by focusing on retaining attention layer knowledge between learning sessions, while PMP reduces the memory footprint by selectively storing only the most informative patches, or ``pseudo-bags’’ from WSIs. Experimental evaluations demonstrate that our method significantly improves both accuracy and memory efficiency on diverse WSI datasets, outperforming current state-of-the-art CL methods. This work provides a foundation for CL in large-scale, weakly annotated clinical datasets, paving the way for more adaptable and resilient diagnostic models.
zh
[CV-89] Mitigate Language Priors in Large Vision-Language Models by Cross-Images Contrastive Decoding
【速读】:该论文试图解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中由于语言先验(language priors)导致的幻觉问题,即模型生成在语言上合理但与视觉内容不一致的文本。解决方案的关键在于提出一种无需训练的跨图像对比解码(Cross-Image Contrastive Decoding, CICD)方法,通过识别并消除有害的语言先验,同时保持文本的流畅性和连贯性,从而有效缓解幻觉现象。
链接: https://arxiv.org/abs/2505.10634
作者: Jianfei Zhao,Feng Zhang,Xin Sun,Chong Feng
机构: Beijing Institute of Technology (北京理工大学); Zhongguancun Academy (中关村学院); Southeast Academy of Information Technology (东南信息科技学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Language priors constitute one of the primary causes of hallucinations in Large Vision-Language Models (LVLMs), driving the models to generate linguistically plausible yet visually inconsistent content. The language priors in LVLMs originate from the linguistic knowledge inherited from their pre-trained Large Language Model (LLM) backbone. Consequently, this characteristic is an intrinsic property of the model that remains independent of visual inputs. Inspired by the finding that language priors are consistent across images, we propose Cross-Image Contrastive Decoding (CICD), a simple yet effective training-free method to alleviate language priors in LVLMs. CICD first identifies essential and detrimental priors, and then employs contrastive decoding to eliminate the detrimental ones. This approach simultaneously prevents LVLMs from generating hallucinated content while maintaining textual fluency and coherence. Furthermore, the limited information overlap between images helps prevent visual information loss during contrastive decoding. We validate the effectiveness of CICD on four benchmarks with six LVLMs. Our experiments demonstrate that CICD performs remarkably well in mitigating language priors, especially in the image captioning task, where such priors are most pronounced. Code will be released once accepted.
zh
[CV-90] MIRAG E: A Multi-modal Benchmark for Spatial Perception Reasoning and Intelligence
【速读】:该论文试图解决当前模型在空间感知与推理能力上的不足,特别是在物体属性识别和空间关系推理方面的缺陷,这些问题限制了动态推理的准确性。解决方案的关键在于提出MIRAGE,一个跨模态基准,用于评估模型在计数(物体属性识别)、关系(空间关系推理)以及结合计数与关系的能力,通过复杂且细致的场景揭示现有模型的局限性,并推动更优表示和推理框架的发展。
链接: https://arxiv.org/abs/2505.10604
作者: Chonghan Liu,Haoran Wang,Felix Henry,Pu Miao,Yajie Zhang,Yu Zhao,Peiran Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Spatial perception and reasoning are core components of human cognition, encompassing object recognition, spatial relational understanding, and dynamic reasoning. Despite progress in computer vision, existing benchmarks reveal significant gaps in models’ abilities to accurately recognize object attributes and reason about spatial relationships, both essential for dynamic reasoning. To address these limitations, we propose MIRAGE, a multi-modal benchmark designed to evaluate models’ capabilities in Counting (object attribute recognition), Relation (spatial relational reasoning), and Counting with Relation. Through diverse and complex scenarios requiring fine-grained recognition and reasoning, MIRAGE highlights critical limitations in state-of-the-art models, underscoring the need for improved representations and reasoning frameworks. By targeting these foundational abilities, MIRAGE provides a pathway toward spatiotemporal reasoning in future research.
zh
[CV-91] SRMamba: Mamba for Super-Resolution of LiDAR Point Clouds
【速读】:该论文旨在解决LiDAR点云超分辨率问题,特别是在稀疏场景下从新视角恢复点云的三维空间结构这一关键挑战。其解决方案的关键在于采用基于霍夫投票(Hough Voting)的投影技术和孔洞补偿策略,以消除距离图像中的水平线性孔洞;同时引入视觉状态空间模型和多方向扫描机制,以增强长距离依赖关系并关注垂直三维空间中的潜在几何特征;此外,还设计了一个非对称U-Net网络,以适应不同激光束数量的LiDAR输入特性,从而实现多激光束点云的超分辨率重建。
链接: https://arxiv.org/abs/2505.10601
作者: Chuang Chen,Wenyi Ge
机构: Chengdu University of Information Technology(成都信息工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:In recent years, range-view-based LiDAR point cloud super-resolution techniques attract significant attention as a low-cost method for generating higher-resolution point cloud data. However, due to the sparsity and irregular structure of LiDAR point clouds, the point cloud super-resolution problem remains a challenging topic, especially for point cloud upsampling under novel views. In this paper, we propose SRMamba, a novel method for super-resolution of LiDAR point clouds in sparse scenes, addressing the key challenge of recovering the 3D spatial structure of point clouds from novel views. Specifically, we implement projection technique based on Hough Voting and Hole Compensation strategy to eliminate horizontally linear holes in range image. To improve the establishment of long-distance dependencies and to focus on potential geometric features in vertical 3D space, we employ Visual State Space model and Multi-Directional Scanning mechanism to mitigate the loss of 3D spatial structural information due to the range image. Additionally, an asymmetric U-Net network adapts to the input characteristics of LiDARs with different beam counts, enabling super-resolution reconstruction for multi-beam point clouds. We conduct a series of experiments on multiple challenging public LiDAR datasets (SemanticKITTI and nuScenes), and SRMamba demonstrates significant superiority over other algorithms in both qualitative and quantitative evaluations.
zh
[CV-92] ARFC-WAHNet: Adaptive Receptive Field Convolution and Wavelet-Attentive Hierarchical Network for Infrared Small Target Detection
【速读】:该论文旨在解决红外小目标检测(Infrared Small Target Detection, ISTD)中因红外图像纹理和结构信息有限而导致的检测精度不足问题,以及现有深度学习方法在复杂场景和多样化目标下的适应性差与特征丢失问题。其解决方案的关键在于提出一种自适应感受野卷积与小波注意力分层网络(ARFC-WAHNet),该网络通过多感受野特征交互卷积(MRFFIConv)模块实现判别特征的自适应提取,利用小波频率增强下采样(WFED)模块提升目标特征并抑制背景噪声,结合高低级特征融合(HLFF)模块和全局中值增强注意力(GMEA)模块,以增强特征的多样性和表达能力。
链接: https://arxiv.org/abs/2505.10595
作者: Xingye Cui,Junhai Luo,Jiakun Deng,Kexuan Li,Xiangyu Qiu,Zhenming Peng
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Infrared small target detection (ISTD) is critical in both civilian and military applications. However, the limited texture and structural information in infrared images makes accurate detection particularly challenging. Although recent deep learning-based methods have improved performance, their use of conventional convolution kernels limits adaptability to complex scenes and diverse targets. Moreover, pooling operations often cause feature loss and insufficient exploitation of image information. To address these issues, we propose an adaptive receptive field convolution and wavelet-attentive hierarchical network for infrared small target detection (ARFC-WAHNet). This network incorporates a multi-receptive field feature interaction convolution (MRFFIConv) module to adaptively extract discriminative features by integrating multiple convolutional branches with a gated unit. A wavelet frequency enhancement downsampling (WFED) module leverages Haar wavelet transform and frequency-domain reconstruction to enhance target features and suppress background noise. Additionally, we introduce a high-low feature fusion (HLFF) module for integrating low-level details with high-level semantics, and a global median enhancement attention (GMEA) module to improve feature diversity and expressiveness via global attention. Experiments on public datasets SIRST, NUDT-SIRST, and IRSTD-1k demonstrate that ARFC-WAHNet outperforms recent state-of-the-art methods in both detection accuracy and robustness, particularly under complex backgrounds. The code is available at this https URL.
zh
[CV-93] Super-Resolution Generative Adversarial Networks based Video Enhancement
【速读】:该论文试图解决视频超分辨率(Video Super-Resolution)中由于传统单图像超分辨率(Single-Image Super-Resolution, SISR)方法无法处理时序连续性而导致的动态内容质量下降问题。解决方案的关键在于将标准的生成对抗网络(Generative Adversarial Network, GAN)结构扩展为能够处理时空数据的框架,引入了3D非局部块(3D Non-Local Blocks)以捕捉空间与时间维度上的关联性,并通过基于块的学习和先进的数据退化技术构建实验训练流程,从而提升模型在不同视频内容下的泛化能力和稳定性。
链接: https://arxiv.org/abs/2505.10589
作者: Kağan ÇETİN
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:
Abstract:This study introduces an enhanced approach to video super-resolution by extending ordinary Single-Image Super-Resolution (SISR) Super-Resolution Generative Adversarial Network (SRGAN) structure to handle spatio-temporal data. While SRGAN has proven effective for single-image enhancement, its design does not account for the temporal continuity required in video processing. To address this, a modified framework that incorporates 3D Non-Local Blocks is proposed, which is enabling the model to capture relationships across both spatial and temporal dimensions. An experimental training pipeline is developed, based on patch-wise learning and advanced data degradation techniques, to simulate real-world video conditions and learn from both local and global structures and details. This helps the model generalize better and maintain stability across varying video content while maintaining the general structure besides the pixel-wise correctness. Two model variants-one larger and one more lightweight-are presented to explore the trade-offs between performance and efficiency. The results demonstrate improved temporal coherence, sharper textures, and fewer visual artifacts compared to traditional single-image methods. This work contributes to the development of practical, learning-based solutions for video enhancement tasks, with potential applications in streaming, gaming, and digital restoration.
zh
[CV-94] Efficient Malicious UAV Detection Using Autoencoder-TSMamba Integration DATE WWW
【速读】:该论文旨在解决下一代网络(Next-Generation Networks, NGNs)中恶意无人机(Malicious Unmanned Aerial Vehicles, UAVs)带来的安全威胁问题,如未经授权的监控、数据窃取和危险物品的投递。解决方案的关键在于提出一种集成的自编码器(Autoencoder, AE)-分类器系统,其中AE基于四层三向空间Mamba(Tri-orientated Spatial Mamba, TSMamba)架构,能够有效捕捉复杂的空间关系以识别恶意UAV行为。随后,通过ResNet-based分类器处理AE生成的残差值,从而实现较低的复杂度和较高的分类精度。
链接: https://arxiv.org/abs/2505.10585
作者: Azim Akhtarshenas,Ramin Toosi,David López-Pérez,Tohid Alizadeh,Alireza Hosseini
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 12 pages, 6 figures and 3 tables, accepted in IbPRIA 2025, this https URL
Abstract:Malicious Unmanned Aerial Vehicles (UAVs) present a significant threat to next-generation networks (NGNs), posing risks such as unauthorized surveillance, data theft, and the delivery of hazardous materials. This paper proposes an integrated (AE)-classifier system to detect malicious UAVs. The proposed AE, based on a 4-layer Tri-orientated Spatial Mamba (TSMamba) architecture, effectively captures complex spatial relationships crucial for identifying malicious UAV activities. The first phase involves generating residual values through the AE, which are subsequently processed by a ResNet-based classifier. This classifier leverages the residual values to achieve lower complexity and higher accuracy. Our experiments demonstrate significant improvements in both binary and multi-class classification scenarios, achieving up to 99.8 % recall compared to 96.7 % in the benchmark. Additionally, our method reduces computational complexity, making it more suitable for large-scale deployment. These results highlight the robustness and scalability of our approach, offering an effective solution for malicious UAV detection in NGN environments.
zh
[CV-95] Aquarius: A Family of Industry-Level Video Generation Models for Marketing Scenarios
【速读】:该论文旨在解决工业级视频生成系统在高保真度、多画面比例和长时长视频合成方面的挑战,特别是在大规模计算集群上的高效部署与优化问题。解决方案的关键在于构建一个包含分布式图与视频数据处理流水线、适用于不同规模的模型架构、高性能训练基础设施、多xPU并行推理加速以及多种营销场景应用的完整框架——Aquarius。其中,高效的工程架构与算法创新是实现大规模视频生成系统性能提升的核心因素。
链接: https://arxiv.org/abs/2505.10584
作者: Huafeng Shi,Jianzhong Liang,Rongchang Xie,Xian Wu,Cheng Chen,Chang Liu
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This report introduces Aquarius, a family of industry-level video generation models for marketing scenarios designed for thousands-xPU clusters and models with hundreds of billions of parameters. Leveraging efficient engineering architecture and algorithmic innovation, Aquarius demonstrates exceptional performance in high-fidelity, multi-aspect-ratio, and long-duration video synthesis. By disclosing the framework’s design details, we aim to demystify industrial-scale video generation systems and catalyze advancements in the generative video community. The Aquarius framework consists of five components: Distributed Graph and Video Data Processing Pipeline: Manages tens of thousands of CPUs and thousands of xPUs via automated task distribution, enabling efficient video data processing. Additionally, we are about to open-source the entire data processing framework named “Aquarius-Datapipe”. Model Architectures for Different Scales: Include a Single-DiT architecture for 2B models and a Multimodal-DiT architecture for 13.4B models, supporting multi-aspect ratios, multi-resolution, and multi-duration video generation. High-Performance infrastructure designed for video generation model training: Incorporating hybrid parallelism and fine-grained memory optimization strategies, this infrastructure achieves 36% MFU at large scale. Multi-xPU Parallel Inference Acceleration: Utilizes diffusion cache and attention optimization to achieve a 2.35x inference speedup. Multiple marketing-scenarios applications: Including image-to-video, text-to-video (avatar), video inpainting and video personalization, among others. More downstream applications and multi-dimensional evaluation metrics will be added in the upcoming version updates.
zh
[CV-96] Bias and Generalizability of Foundation Models across Datasets in Breast Mammography MICCAI
【速读】:该论文试图解决生成式 AI (Generative AI) 在乳腺X线摄影分类任务中存在公平性和偏差问题,特别是在数据变异性和固有偏见影响下,模型性能受限。其解决方案的关键在于通过多源数据集的整合与公平性感知技术来提升模型的泛化能力和公平性表现,尽管数据聚合可以改善整体性能,但仅靠此方法无法完全消除偏差,而引入公平性意识的技术则能实现更稳定和均衡的子群体表现。
链接: https://arxiv.org/abs/2505.10579
作者: Germani Elodie,Selin Türk Ilayda,Zeineddine Fatima,Mourad Charbel,Albarqouni Shadi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2025
Abstract:Over the past decades, computer-aided diagnosis tools for breast cancer have been developed to enhance screening procedures, yet their clinical adoption remains challenged by data variability and inherent biases. Although foundation models (FMs) have recently demonstrated impressive generalizability and transfer learning capabilities by leveraging vast and diverse datasets, their performance can be undermined by spurious correlations that arise from variations in image quality, labeling uncertainty, and sensitive patient attributes. In this work, we explore the fairness and bias of FMs for breast mammography classification by leveraging a large pool of datasets from diverse sources-including data from underrepresented regions and an in-house dataset. Our extensive experiments show that while modality-specific pre-training of FMs enhances performance, classifiers trained on features from individual datasets fail to generalize across domains. Aggregating datasets improves overall performance, yet does not fully mitigate biases, leading to significant disparities across under-represented subgroups such as extreme breast densities and age groups. Furthermore, while domain-adaptation strategies can reduce these disparities, they often incur a performance trade-off. In contrast, fairness-aware techniques yield more stable and equitable performance across subgroups. These findings underscore the necessity of incorporating rigorous fairness evaluations and mitigation strategies into FM-based models to foster inclusive and generalizable AI.
zh
[CV-97] Robust Emotion Recognition via Bi-Level Self-Supervised Continual Learning
【速读】:该论文旨在解决通过生理信号(如脑电图 EEG)进行情绪识别时面临的跨被试差异性和噪声标签问题,这些问题限制了情绪识别模型的性能。现有领域自适应和持续学习方法在处理这些挑战时表现不足,尤其是在数据连续流且无标签的现实场景中。论文提出的解决方案是基于动态记忆缓冲区的新型双层自监督持续学习框架 SSOCL,其关键在于通过迭代优化动态缓冲区和伪标签分配,有效保留具有代表性的样本,从而实现从连续、无标签的生理数据流中进行情绪识别的泛化能力。该框架包含快速适应模块和聚类映射模块,增强了对动态数据流的鲁棒学习与有效处理能力。
链接: https://arxiv.org/abs/2505.10575
作者: Adnan Ahmad,Bahareh Nakisa,Mohammad Naim Rastgoo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Emotion recognition through physiological signals such as electroencephalogram (EEG) has become an essential aspect of affective computing and provides an objective way to capture human emotions. However, physiological data characterized by cross-subject variability and noisy labels hinder the performance of emotion recognition models. Existing domain adaptation and continual learning methods struggle to address these issues, especially under realistic conditions where data is continuously streamed and unlabeled. To overcome these limitations, we propose a novel bi-level self-supervised continual learning framework, SSOCL, based on a dynamic memory buffer. This bi-level architecture iteratively refines the dynamic buffer and pseudo-label assignments to effectively retain representative samples, enabling generalization from continuous, unlabeled physiological data streams for emotion recognition. The assigned pseudo-labels are subsequently leveraged for accurate emotion prediction. Key components of the framework, including a fast adaptation module and a cluster-mapping module, enable robust learning and effective handling of evolving data streams. Experimental validation on two mainstream EEG tasks demonstrates the framework’s ability to adapt to continuous data streams while maintaining strong generalization across subjects, outperforming existing approaches.
zh
[CV-98] GarmentPile: Point-Level Visual Affordance Guided Retrieval and Adaptation for Cluttered Garments Manipulation
【速读】:该论文试图解决杂乱衣物操作中的挑战,这主要源于衣物的复杂可变形特性以及衣物之间的复杂关系。与单个衣物操作不同,杂乱场景需要处理复杂的衣物缠绕和交互,同时保持衣物的清洁度和操作稳定性。解决方案的关键在于学习点级可供性(point-level affordance),这是一种密集表示,用于建模复杂的空间和多模态操作候选,同时考虑到衣物的几何形状、结构及物体间的关系。此外,为了解决在极度缠绕的杂乱环境中难以直接获取衣物的问题,引入了一个由学习到的可供性引导的适应模块,以重新组织高度缠绕的衣物至适合操作的状态。
链接: https://arxiv.org/abs/2503.09243
作者: Ruihai Wu,Ziyu Zhu,Yuran Wang,Yue Chen,Jiarui Wang,Hao Dong
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cluttered garments manipulation poses significant challenges due to the complex, deformable nature of garments and intricate garment relations. Unlike single-garment manipulation, cluttered scenarios require managing complex garment entanglements and interactions, while maintaining garment cleanliness and manipulation stability. To address these demands, we propose to learn point-level affordance, the dense representation modeling the complex space and multi-modal manipulation candidates, while being aware of garment geometry, structure, and inter-object relations. Additionally, as it is difficult to directly retrieve a garment in some extremely entangled clutters, we introduce an adaptation module, guided by learned affordance, to reorganize highly-entangled garments into states plausible for manipulation. Our framework demonstrates effectiveness over environments featuring diverse garment types and pile configurations in both simulation and the real world. Project page: this https URL.
zh
[CV-99] From Fibers to Cells: Fourier-Based Registration Enables Virtual Cresyl Violet Staining From 3D Polarized Light Imaging
【速读】:该论文试图解决在脑微结构综合评估中,如何实现细胞体(cytoarchitecture)与神经纤维(myeloarchitecture)在相同切片中的精确对齐与关联的问题。由于染色过程中不可避免的形变,需要进行非线性且跨模态的配准以研究细胞与纤维之间的详细关系,但传统方法因后染色处理的复杂性限制了样本数量。解决方案的关键在于利用深度学习方法进行图像到图像的翻译,生成与3D-PLI在细胞层面空间对齐的虚拟染色图像,并通过基于傅里叶的配准方法确保训练数据在目标和预测染色图像间的高对应性,从而实现对Cresyl violet染色的准确预测。
链接: https://arxiv.org/abs/2505.11394
作者: Alexander Oberstrass,Esteban Vaca,Eric Upschulte,Meiqi Niu,Nicola Palomero-Gallagher,David Graessel,Christian Schiffer,Markus Axer,Katrin Amunts,Timo Dickscheid
机构: Institute of Neuroscience and Medicine (INM-1), Research Centre Jülich, Germany; Helmholtz AI, Research Centre Jülich, Germany; Cécile & Oskar Vogt Institute of Brain Research, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Germany; Department of Physics, University of Wuppertal, Germany; Institute of Computer Science, Faculty of Mathematics and Natural Sciences, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Comprehensive assessment of the various aspects of the brain’s microstructure requires the use of complementary imaging techniques. This includes measuring the spatial distribution of cell bodies (cytoarchitecture) and nerve fibers (myeloarchitecture). The gold standard for cytoarchitectonic analysis is light microscopic imaging of cell-body stained tissue sections. To reveal the 3D orientations of nerve fibers, 3D Polarized Light Imaging (3D-PLI) has been introduced as a reliable technique providing a resolution in the micrometer range while allowing processing of series of complete brain sections. 3D-PLI acquisition is label-free and allows subsequent staining of sections after measurement. By post-staining for cell bodies, a direct link between fiber- and cytoarchitecture can potentially be established within the same section. However, inevitable distortions introduced during the staining process make a nonlinear and cross-modal registration necessary in order to study the detailed relationships between cells and fibers in the images. In addition, the complexity of processing histological sections for post-staining only allows for a limited number of samples. In this work, we take advantage of deep learning methods for image-to-image translation to generate a virtual staining of 3D-PLI that is spatially aligned at the cellular level. In a supervised setting, we build on a unique dataset of brain sections, to which Cresyl violet staining has been applied after 3D-PLI measurement. To ensure high correspondence between both modalities, we address the misalignment of training data using Fourier-based registration methods. In this way, registration can be efficiently calculated during training for local image patches of target and predicted staining. We demonstrate that the proposed method enables prediction of a Cresyl violet staining from 3D-PLI, matching individual cell instances.
zh
[CV-100] A Fourier Space Perspective on Diffusion Models
【速读】:该论文试图解决扩散模型在生成高频率成分时质量下降的问题,这一问题源于标准加性白噪声前向过程导致的高频成分信噪比(SNR)更快恶化,从而在逆向过程中违反正态性假设。解决方案的关键在于研究一种在傅里叶空间中以相同速率污染所有频率的替代前向过程,从而消除生成过程中的典型频率层次,提升高频成分为主的数据集上的生成性能。
链接: https://arxiv.org/abs/2505.11278
作者: Fabian Falck,Teodora Pandeva,Kiarash Zahirnia,Rachel Lawrence,Richard Turner,Edward Meeds,Javier Zazo,Sushrut Karmalkar
机构: 未知
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Methodology (stat.ME)
备注:
Abstract:Diffusion models are state-of-the-art generative models on data modalities such as images, audio, proteins and materials. These modalities share the property of exponentially decaying variance and magnitude in the Fourier domain. Under the standard Denoising Diffusion Probabilistic Models (DDPM) forward process of additive white noise, this property results in high-frequency components being corrupted faster and earlier in terms of their Signal-to-Noise Ratio (SNR) than low-frequency ones. The reverse process then generates low-frequency information before high-frequency details. In this work, we study the inductive bias of the forward process of diffusion models in Fourier space. We theoretically analyse and empirically demonstrate that the faster noising of high-frequency components in DDPM results in violations of the normality assumption in the reverse process. Our experiments show that this leads to degraded generation quality of high-frequency components. We then study an alternate forward process in Fourier space which corrupts all frequencies at the same rate, removing the typical frequency hierarchy during generation, and demonstrate marked performance improvements on datasets where high frequencies are primary, while performing on par with DDPM on standard imaging benchmarks.
zh
[CV-101] Diffusion Model in Hyperspectral Image Processing and Analysis: A Review
【速读】:该论文试图解决高光谱图像处理与分析中的高维性、数据冗余和噪声干扰等问题,这些问题给传统模型带来了巨大挑战,难以满足日益增长的分析需求。其解决方案的关键在于利用扩散模型(Diffusion Model),该模型通过模拟数据在时间上的扩散过程,能够有效处理高维数据、生成高质量样本,并在去噪和数据增强等方面表现出色,从而显著提升高光谱图像分析的准确性和效率。
链接: https://arxiv.org/abs/2505.11158
作者: Xing Hu,Xiangcheng Liu,Qianqian Duan,Danfeng Hong,Dawei Zhang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 33 pages,20 figures
Abstract:Hyperspectral image processing and analysis has important application value in remote sensing, agriculture and environmental monitoring, but its high dimensionality, data redundancy and noise interference etc. bring great challenges to the analysis. Traditional models have limitations in dealing with these complex data, and it is difficult to meet the increasing demand for analysis. In recent years, Diffusion Model, as an emerging generative model, has shown unique advantages in hyperspectral image processing. By simulating the diffusion process of data in time, the Diffusion Model can effectively process high-dimensional data, generate high-quality samples, and perform well in denoising and data enhancement. In this paper, we review the recent research advances in diffusion modeling for hyperspectral image processing and analysis, and discuss its applications in tasks such as high-dimensional data processing, noise removal, classification, and anomaly detection. The performance of diffusion-based models on image processing is compared and the challenges are summarized. It is shown that the diffusion model can significantly improve the accuracy and efficiency of hyperspectral image analysis, providing a new direction for future research.
zh
[CV-102] Generative Models in Computational Pathology: A Comprehensive Survey on Methods Applications and Challenges
【速读】:该论文旨在解决计算病理学中生成模型的应用与挑战,重点关注如何通过生成式AI实现数据高效学习、合成数据增强以及多模态表示。其解决方案的关键在于系统性地综述了生成模型的发展历程,从早期的生成对抗网络到最新的扩散模型和具备生成能力的基础模型,并探讨了图像生成、文本生成、多模态图像-文本生成及其他生成应用(如空间模拟和分子推断)的最新进展。此外,论文还分析了常用的数据集和评估协议,指出了当前在生成高保真全切片图像、临床可解释性及合成数据伦理法律问题等方面的局限性,为未来构建统一、多模态且可临床部署的生成系统提供了研究方向。
链接: https://arxiv.org/abs/2505.10993
作者: Yuan Zhang,Xinfeng Zhang,Xiaoming Qi Xinyu Wu,Feng Chen,Guanyu Yang,Huazhu Fu
机构: Southeast University (东南大学); Tsinghua University (清华大学); National University of Singapore (新加坡国立大学); School of Biomedical Engineering (生物医学工程学院); Department of Biomedical Engineering (生物医学工程系); Department of Electrical and Computer Engineering (电气与计算机工程系); School of Information Science and Engineering (信息科学与工程学院); Department of Biostatistics (生物统计学系); Center for Global Health (全球健康中心); School of Public Health (公共卫生学院); Nanjing Medical University (南京医科大学); Institute of High-Performance Computing (高性能计算研究所); Agency for Science, Technology and Research (科技研究局)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages,9 figures
Abstract:Generative modeling has emerged as a promising direction in computational pathology, offering capabilities such as data-efficient learning, synthetic data augmentation, and multimodal representation across diverse diagnostic tasks. This review provides a comprehensive synthesis of recent progress in the field, organized into four key domains: image generation, text generation, multimodal image-text generation, and other generative applications, including spatial simulation and molecular inference. By analyzing over 150 representative studies, we trace the evolution of generative architectures from early generative adversarial networks to recent advances in diffusion models and foundation models with generative capabilities. We further examine the datasets and evaluation protocols commonly used in this domain and highlight ongoing limitations, including challenges in generating high-fidelity whole slide images, clinical interpretability, and concerns related to the ethical and legal implications of synthetic data. The review concludes with a discussion of open challenges and prospective research directions, with an emphasis on developing unified, multimodal, and clinically deployable generative systems. This work aims to provide a foundational reference for researchers and practitioners developing and applying generative models in computational pathology.
zh
[CV-103] Pretrained hybrid transformer for generalizable cardiac substructures segmentation from contrast and non-contrast CTs in lung and breast cancers
【速读】:该论文旨在解决生成式 AI (Generative AI) 在放射治疗计划(RTP)中自动分割心脏亚结构时,因临床病例特征与训练数据集不同而导致性能下降的问题。其解决方案的关键是将预训练的 Transformer 模型优化为混合 Transformer 卷积网络(HTN),并通过平衡不同成像对比(如增强型和非增强型 CT)及患者扫描体位的数据进行微调,以提高模型的泛化能力和鲁棒性。
链接: https://arxiv.org/abs/2505.10855
作者: Aneesh Rangnekar,Nikhil Mankuzhy,Jonas Willmann,Chloe Choi,Abraham Wu,Maria Thor,Andreas Rimner,Harini Veeraraghavan
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:AI automated segmentations for radiation treatment planning (RTP) can deteriorate when applied in clinical cases with different characteristics than training dataset. Hence, we refined a pretrained transformer into a hybrid transformer convolutional network (HTN) to segment cardiac substructures lung and breast cancer patients acquired with varying imaging contrasts and patient scan positions. Cohort I, consisting of 56 contrast-enhanced (CECT) and 124 non-contrast CT (NCCT) scans from patients with non-small cell lung cancers acquired in supine position, was used to create oracle with all 180 training cases and balanced (CECT: 32, NCCT: 32 training) HTN models. Models were evaluated on a held-out validation set of 60 cohort I patients and 66 patients with breast cancer from cohort II acquired in supine (n=45) and prone (n=21) positions. Accuracy was measured using DSC, HD95, and dose metrics. Publicly available TotalSegmentator served as the benchmark. The oracle and balanced models were similarly accurate (DSC Cohort I: 0.80 \pm 0.10 versus 0.81 \pm 0.10; Cohort II: 0.77 \pm 0.13 versus 0.80 \pm 0.12), outperforming TotalSegmentator. The balanced model, using half the training cases as oracle, produced similar dose metrics as manual delineations for all cardiac substructures. This model was robust to CT contrast in 6 out of 8 substructures and patient scan position variations in 5 out of 8 substructures and showed low correlations of accuracy to patient size and age. A HTN demonstrated robustly accurate (geometric and dose metrics) cardiac substructures segmentation from CTs with varying imaging and patient characteristics, one key requirement for clinical use. Moreover, the model combining pretraining with balanced distribution of NCCT and CECT scans was able to provide reliably accurate segmentations under varied conditions with far fewer labeled datasets compared to an oracle model.
zh
[CV-104] Adaptive Spatial Transcriptomics Interpolation via Cross-modal Cross-slice Modeling MICCAI2025
【速读】:该论文旨在解决空间转录组学(Spatial Transcriptomics, ST)中由于缺失中间组织切片和高昂成本导致的多切片ST数据生成可行性受限的问题,提出C2-STi方法以在相邻ST切片之间的任意中间位置进行缺失切片的插值。解决方案的关键在于设计三个核心模块:1)距离感知的局部结构调制模块,用于自适应捕捉跨切片变形并增强ST切片间的空间相关性;2)金字塔基因共表达相关模块,用于捕获基因间的多尺度生物关联;3)跨模态对齐模块,通过整合ST配对的苏木精-伊红(HE)染色图像,实现ST与HE图像间关键细胞特征的过滤与对齐。
链接: https://arxiv.org/abs/2505.10729
作者: NingFeng Que,Xiaofei Wang,Jingjing Chen,Yixuan Jiang,Chao Li
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: Early accepted by MICCAI 2025
Abstract:Spatial transcriptomics (ST) is a promising technique that characterizes the spatial gene profiling patterns within the tissue context. Comprehensive ST analysis depends on consecutive slices for 3D spatial insights, whereas the missing intermediate tissue sections and high costs limit the practical feasibility of generating multi-slice ST. In this paper, we propose C2-STi, the first attempt for interpolating missing ST slices at arbitrary intermediate positions between adjacent ST slices. Despite intuitive, effective ST interpolation presents significant challenges, including 1) limited continuity across heterogeneous tissue sections, 2) complex intrinsic correlation across genes, and 3) intricate cellular structures and biological semantics within each tissue section. To mitigate these challenges, in C2-STi, we design 1) a distance-aware local structural modulation module to adaptively capture cross-slice deformations and enhance positional correlations between ST slices, 2) a pyramid gene co-expression correlation module to capture multi-scale biological associations among genes, and 3) a cross-modal alignment module that integrates the ST-paired hematoxylin and eosin (HE)-stained images to filter and align the essential cellular features across ST and H\E images. Extensive experiments on the public dataset demonstrate our superiority over state-of-the-art approaches on both single-slice and multi-slice ST interpolation. Codes are available at this https URL.
zh
[CV-105] Predicting Risk of Pulmonary Fibrosis Formation in PASC Patients
【速读】:该论文旨在解决Post-Acute Sequelae of COVID-19 (PASC)相关肺纤维化早期检测与风险评估的问题,特别是在长期感染后肺部可能发生的纤维化损伤。其解决方案的关键在于提出一种结合深度学习与影像组学的多中心胸部CT分析框架,通过卷积神经网络(CNN)和可解释特征提取实现对肺纤维化的有效预测,取得了82.2%的准确率和85.5%的AUC值,并利用Grad-CAM可视化和影像组学特征分析提供临床相关的洞察。
链接: https://arxiv.org/abs/2505.10691
作者: Wanying Dou,Gorkem Durak,Koushik Biswas,Ziliang Hong,Andrea Mia Bejar,Elif Keles,Kaan Akin,Sukru Mehmet Erturk,Alpay Medetalibeyoglu,Marc Sala,Alexander Misharin,Hatice Savas,Mary Salvatore,Sachin Jambawalikar,Drew Torigian,Jayaram K. Udupa,Ulas Bagci
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While the acute phase of the COVID-19 pandemic has subsided, its long-term effects persist through Post-Acute Sequelae of COVID-19 (PASC), commonly known as Long COVID. There remains substantial uncertainty regarding both its duration and optimal management strategies. PASC manifests as a diverse array of persistent or newly emerging symptoms–ranging from fatigue, dyspnea, and neurologic impairments (e.g., brain fog), to cardiovascular, pulmonary, and musculoskeletal abnormalities–that extend beyond the acute infection phase. This heterogeneous presentation poses substantial challenges for clinical assessment, diagnosis, and treatment planning. In this paper, we focus on imaging findings that may suggest fibrotic damage in the lungs, a critical manifestation characterized by scarring of lung tissue, which can potentially affect long-term respiratory function in patients with PASC. This study introduces a novel multi-center chest CT analysis framework that combines deep learning and radiomics for fibrosis prediction. Our approach leverages convolutional neural networks (CNNs) and interpretable feature extraction, achieving 82.2% accuracy and 85.5% AUC in classification tasks. We demonstrate the effectiveness of Grad-CAM visualization and radiomics-based feature analysis in providing clinically relevant insights for PASC-related lung fibrosis prediction. Our findings highlight the potential of deep learning-driven computational methods for early detection and risk assessment of PASC-related lung fibrosis–presented for the first time in the literature.
zh
[CV-106] ROIsGAN: A Region Guided Generative Adversarial Framework for Murine Hippocampal Subregion Segmentation
【速读】:该论文旨在解决从组织切片图像中自动分割海马体关键亚区(包括齿状回DG、CA1和CA3)的问题,这一任务在神经退行性和精神疾病研究中具有重要意义。现有方法尚未针对免疫组化(IHC)图像中的海马体亚区分割提供有效解决方案。论文的关键在于提出了ROIsGAN,一种基于区域引导的U-Net生成对抗网络,通过引入结合Dice损失和二值交叉熵损失的区域引导判别器损失函数,显著提升了边界划分和结构细节的精确性,从而在DG、CA1和CA3亚区的分割任务中实现了性能提升,Dice分数提高1-10%,IoU最高提升11%。
链接: https://arxiv.org/abs/2505.10687
作者: Sayed Mehedi Azim,Brian Corbett,Iman Dehzangi
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注:
Abstract:The hippocampus, a critical brain structure involved in memory processing and various neurodegenerative and psychiatric disorders, comprises three key subregions: the dentate gyrus (DG), Cornu Ammonis 1 (CA1), and Cornu Ammonis 3 (CA3). Accurate segmentation of these subregions from histological tissue images is essential for advancing our understanding of disease mechanisms, developmental dynamics, and therapeutic interventions. However, no existing methods address the automated segmentation of hippocampal subregions from tissue images, particularly from immunohistochemistry (IHC) images. To bridge this gap, we introduce a novel set of four comprehensive murine hippocampal IHC datasets featuring distinct staining modalities: cFos, NeuN, and multiplexed stains combining cFos, NeuN, and either \DeltaFosB or GAD67, capturing structural, neuronal activity, and plasticity associated information. Additionally, we propose ROIsGAN, a region-guided U-Net-based generative adversarial network tailored for hippocampal subregion segmentation. By leveraging adversarial learning, ROIsGAN enhances boundary delineation and structural detail refinement through a novel region-guided discriminator loss combining Dice and binary cross-entropy loss. Evaluated across DG, CA1, and CA3 subregions, ROIsGAN consistently outperforms conventional segmentation models, achieving performance gains ranging from 1-10% in Dice score and up to 11% in Intersection over Union (IoU), particularly under challenging staining conditions. Our work establishes foundational datasets and methods for automated hippocampal segmentation, enabling scalable, high-precision analysis of tissue images in neuroscience research. Our generated datasets, proposed model as a standalone tool, and its corresponding source code are publicly available at: this https URL
zh
[CV-107] MOSAIC: A Multi-View 2.5D Organ Slice Selector with Cross-Attentional Reasoning for Anatomically-Aware CT Localization in Medical Organ Segmentation
【速读】:该论文旨在解决从腹部CT图像中实现高效且准确的多器官分割问题,现有3D分割方法计算和内存消耗大,处理整个包含大量解剖无关切片的体积,而2D方法则存在类别不平衡和缺乏跨视图上下文感知的问题。解决方案的关键在于提出一种新型的解剖感知切片选择管道,通过引入视觉-语言模型(VLM)利用轴位、矢状位和冠状位的融合三切片(2.5D)表示进行跨视图器官存在检测,从而在分割前减少输入体积,实现具有高结构相关性的切片选择,确保空间一致性并保留上下文线索。
链接: https://arxiv.org/abs/2505.10672
作者: Hania Ghouse,Muzammil Behzad
机构: King Fahd University of Petroleum and Minerals (法赫德国王石油矿产大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Efficient and accurate multi-organ segmentation from abdominal CT volumes is a fundamental challenge in medical image analysis. Existing 3D segmentation approaches are computationally and memory intensive, often processing entire volumes that contain many anatomically irrelevant slices. Meanwhile, 2D methods suffer from class imbalance and lack cross-view contextual awareness. To address these limitations, we propose a novel, anatomically-aware slice selector pipeline that reduces input volume prior to segmentation. Our unified framework introduces a vision-language model (VLM) for cross-view organ presence detection using fused tri-slice (2.5D) representations from axial, sagittal, and coronal planes. Our proposed model acts as an “expert” in anatomical localization, reasoning over multi-view representations to selectively retain slices with high structural relevance. This enables spatially consistent filtering across orientations while preserving contextual cues. More importantly, since standard segmentation metrics such as Dice or IoU fail to measure the spatial precision of such slice selection, we introduce a novel metric, Slice Localization Concordance (SLC), which jointly captures anatomical coverage and spatial alignment with organ-centric reference slices. Unlike segmentation-specific metrics, SLC provides a model-agnostic evaluation of localization fidelity. Our model offers substantial improvement gains against several baselines across all organs, demonstrating both accurate and reliable organ-focused slice filtering. These results show that our method enables efficient and spatially consistent organ filtering, thereby significantly reducing downstream segmentation cost while maintaining high anatomical fidelity.
zh
[CV-108] ExploreGS: a vision-based low overhead framework for 3D scene reconstruction
【速读】:该论文试图解决无人机在资源受限设备上实现高效且高质量的3D场景重建问题(3D scene reconstruction),传统方法依赖于成本较高的激光雷达(lidar)获取点云数据,而本文提出了一种基于视觉的低开销框架ExploreGS。其解决方案的关键在于利用RGB图像替代传统点云获取过程,并结合词袋模型(Bag-of-Words, BoW)实现实时处理能力,从而使得3D Gaussian Splatting(3DGS)训练能够在机载设备上执行,兼顾了重建质量与计算效率。
链接: https://arxiv.org/abs/2505.10578
作者: Yunji Feng,Chengpu Yu,Fengrui Ran,Zhi Yang,Yinni Liu
机构: Beijing Institute of Technology (北京理工大学); Chongqing innovation Center, Beijing Institute of Technology (重庆创新中心,北京理工大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper proposes a low-overhead, vision-based 3D scene reconstruction framework for drones, named ExploreGS. By using RGB images, ExploreGS replaces traditional lidar-based point cloud acquisition process with a vision model, achieving a high-quality reconstruction at a lower cost. The framework integrates scene exploration and model reconstruction, and leverags a Bag-of-Words(BoW) model to enable real-time processing capabilities, therefore, the 3D Gaussian Splatting (3DGS) training can be executed on-board. Comprehensive experiments in both simulation and real-world environments demonstrate the efficiency and applicability of the ExploreGS framework on resource-constrained devices, while maintaining reconstruction quality comparable to state-of-the-art methods.
zh
[CV-109] GRNN:Recurrent Neural Network based on Ghost Features for Video Super-Resolution ICME2023
【速读】:该论文旨在解决视频超分辨率(Video Super-Resolution, VSR)系统中因卷积神经网络(Convolutional Neural Networks, CNNs)导致的高计算成本问题,特别是特征冗余问题。其关键解决方案是引入“Ghost features”以减少特征冗余,并结合传统循环卷积网络(Recurrent Convolutional Network, RNN)中的“gradient disappearance”现象分析,将Ghost模块与RNN相结合,以更好地建模时序信息。通过将当前帧与下一帧以及前一帧的输出和隐藏状态共同输入模型,实现了在多个基准数据集上的性能提升,包括PSNR和SSIM指标的改善以及视频纹理细节的更好保留。
链接: https://arxiv.org/abs/2505.10577
作者: Yutong Guo
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by 2023 IEEE International Conference on Multimedia and Expo (ICME 2023)
Abstract:Modern video super-resolution (VSR) systems based on convolutional neural networks (CNNs) require huge computational costs. The problem of feature redundancy is present in most models in many domains, but is rarely discussed in VSR. We experimentally observe that many features in VSR models are also similar to each other, so we propose to use “Ghost features” to reduce this redundancy. We also analyze the so-called “gradient disappearance” phenomenon generated by the conventional recurrent convolutional network (RNN) model, and combine the Ghost module with RNN to complete the modeling on time series. The current frame is used as input to the model together with the next frame, the output of the previous frame and the hidden state. Extensive experiments on several benchmark models and datasets show that the PSNR and SSIM of our proposed modality are improved to some extent. Some texture details in the video are also better preserved.
zh
人工智能
[AI-0] MOSAAIC: Managing Optimization towards Shared Autonomy Authority and Initiative in Co-creation
【速读】:该论文试图解决在计算创造力领域中,如何在人类与协作式人工智能(co-creative AI)之间实现恰当平衡的问题。解决方案的关键在于提出MOSAAIC(Managing Optimization towards Shared Autonomy, Authority, and Initiative in Co-creation)框架,该框架通过识别控制的三个核心维度——自主性、主动性和权威性,来表征和平衡协作创作中的控制权,并结合控制优化策略以实现动态平衡。
链接: https://arxiv.org/abs/2505.11481
作者: Alayt Issak,Jeba Rezwana,Casper Harteveld
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Striking the appropriate balance between humans and co-creative AI is an open research question in computational creativity. Co-creativity, a form of hybrid intelligence where both humans and AI take action proactively, is a process that leads to shared creative artifacts and ideas. Achieving a balanced dynamic in co-creativity requires characterizing control and identifying strategies to distribute control between humans and AI. We define control as the power to determine, initiate, and direct the process of co-creation. Informed by a systematic literature review of 172 full-length papers, we introduce MOSAAIC (Managing Optimization towards Shared Autonomy, Authority, and Initiative in Co-creation), a novel framework for characterizing and balancing control in co-creation. MOSAAIC identifies three key dimensions of control: autonomy, initiative, and authority. We supplement our framework with control optimization strategies in co-creation. To demonstrate MOSAAIC’s applicability, we analyze the distribution of control in six existing co-creative AI case studies and present the implications of using this framework.
zh
[AI-1] Automatic Reward Shaping from Confounded Offline Data ICML2025
【速读】:该论文试图解决在存在未观测混杂因素(unobserved confounding)的复杂和高维领域中,基于有偏数据的离策略学习问题。其解决方案的关键在于提出一种对混杂偏差具有鲁棒性的深度强化学习算法,该算法通过寻找与观测数据兼容的最坏情况环境下的安全策略来减少偏差影响。
链接: https://arxiv.org/abs/2505.11478
作者: Mingxuan Li,Junzhe Zhang,Elias Bareinboim
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICML 2025
Abstract:A key task in Artificial Intelligence is learning effective policies for controlling agents in unknown environments to optimize performance measures. Off-policy learning methods, like Q-learning, allow learners to make optimal decisions based on past experiences. This paper studies off-policy learning from biased data in complex and high-dimensional domains where \emphunobserved confounding cannot be ruled out a priori. Building on the well-celebrated Deep Q-Network (DQN), we propose a novel deep reinforcement learning algorithm robust to confounding biases in observed data. Specifically, our algorithm attempts to find a safe policy for the worst-case environment compatible with the observations. We apply our method to twelve confounded Atari games, and find that it consistently dominates the standard DQN in all games where the observed input to the behavioral and target policies mismatch and unobserved confounders exist.
zh
[AI-2] Extracting Explainable Dates From Medical Images By Reverse-Engineering UNIX Timestamps
【速读】:该论文试图解决从医疗文档中准确提取日期信息的问题,尤其是在使用AI进行文本转录时,如何有效识别复杂日期和日期范围。解决方案的关键在于通过逆向工程UNIX时间戳并利用正则表达式合成(regular expression synthesis)自动生成能够精确匹配日期的正则表达式,相较于手动创建的正则表达式,这种方法在减少误检的同时仅略微增加了漏检率,从而提高了日期提取的准确性和可靠性。
链接: https://arxiv.org/abs/2505.11451
作者: Lee Harris,James Bentham,Philippe De Wilde
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Dates often contribute towards highly impactful medical decisions, but it is rarely clear how to extract this data. AI has only just begun to be used transcribe such documents, and common methods are either to trust that the output produced by a complex AI model, or to parse the text using regular expressions. Recent work has established that regular expressions are an explainable form of logic, but it is difficult to decompose these into the component parts that are required to construct precise UNIX timestamps. First, we test publicly-available regular expressions, and we found that these were unable to capture a significant number of our dates. Next, we manually created easily-decomposable regular expressions, and we found that these were able to detect the majority of real dates, but also a lot of sequences of text that look like dates. Finally, we used regular expression synthesis to automatically identify regular expressions from the reverse-engineered UNIX timestamps that we created. We find that regular expressions created by regular expression synthesis detect far fewer sequences of text that look like dates than those that were manually created, at the cost of a slight increase to the number of missed dates. Overall, our results show that regular expressions can be created through regular expression synthesis to identify complex dates and date ranges in text transcriptions. To our knowledge, our proposed way of learning deterministic logic by reverse-engineering several many-one mappings and feeding these into a regular expression synthesiser is a new approach.
zh
[AI-3] LLM s unlock new paths to monetizing exploits
【速读】:该论文试图解决大型语言模型(Large language models, LLMs)对网络攻击经济性带来的潜在变革问题,即LLMs可能使攻击者从针对广泛目标的通用攻击转向高度个性化的用户级攻击。解决方案的关键在于利用当前最先进的LLMs实现针对性强、成本较低的攻击手段,例如自动识别易被利用的漏洞以及根据目标设备内容定制勒索要求,从而展示这些攻击在技术上的可行性。
链接: https://arxiv.org/abs/2505.11449
作者: Nicholas Carlini,Milad Nasr,Edoardo Debenedetti,Barry Wang,Christopher A. Choquette-Choo,Daphne Ippolito,Florian Tramèr,Matthew Jagielski
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:We argue that Large language models (LLMs) will soon alter the economics of cyberattacks. Instead of attacking the most commonly used software and monetizing exploits by targeting the lowest common denominator among victims, LLMs enable adversaries to launch tailored attacks on a user-by-user basis. On the exploitation front, instead of human attackers manually searching for one difficult-to-identify bug in a product with millions of users, LLMs can find thousands of easy-to-identify bugs in products with thousands of users. And on the monetization front, instead of generic ransomware that always performs the same attack (encrypt all your data and request payment to decrypt), an LLM-driven ransomware attack could tailor the ransom demand based on the particular content of each exploited device. We show that these two attacks (and several others) are imminently practical using state-of-the-art LLMs. For example, we show that without any human intervention, an LLM finds highly sensitive personal information in the Enron email dataset (e.g., an executive having an affair with another employee) that could be used for blackmail. While some of our attacks are still too expensive to scale widely today, the incentives to implement these attacks will only increase as LLMs get cheaper. Thus, we argue that LLMs create a need for new defense-in-depth approaches. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.11449 [cs.CR] (or arXiv:2505.11449v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2505.11449 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-4] Mergenetic: a Simple Evolutionary Model Merging Library
【速读】:该论文旨在解决如何高效地将现有模型的能力合并为一个新模型的问题,特别是在无需额外训练的情况下实现这一目标。其解决方案的关键在于引入Mergenetic,一个开源的进化模型合并库,它支持灵活的合并方法与进化算法的组合,并通过轻量级适应度估计器降低评估成本,从而在有限的硬件资源下实现跨任务和语言的高性能模型合并。
链接: https://arxiv.org/abs/2505.11427
作者: Adrian Robert Minut,Tommaso Mencattini,Andrea Santilli,Donato Crisostomi,Emanuele Rodolà
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: Link: this https URL
Abstract:Model merging allows combining the capabilities of existing models into a new one - post hoc, without additional training. This has made it increasingly popular thanks to its low cost and the availability of libraries that support merging on consumer GPUs. Recent work shows that pairing merging with evolutionary algorithms can boost performance, but no framework currently supports flexible experimentation with such strategies in language models. We introduce Mergenetic, an open-source library for evolutionary model merging. Mergenetic enables easy composition of merging methods and evolutionary algorithms while incorporating lightweight fitness estimators to reduce evaluation costs. We describe its design and demonstrate that Mergenetic produces competitive results across tasks and languages using modest hardware.
zh
[AI-5] EdgeWisePersona: A Dataset for On-Device User Profiling from Natural Language Interactions
【速读】:该论文旨在解决在边缘设备上部署的小型语言模型(Small Language Models, SLMs)在用户行为建模任务中的性能不足问题,特别是如何从多会话的自然语言交互中重建用户画像。其解决方案的关键在于构建一个结构化的数据集,该数据集包含由用户习惯(routines)定义的用户画像,并利用大型语言模型(Large Language Models, LLMs)生成模拟的真实、多样且上下文感知的交互会话,从而为小模型提供训练和评估基准,以提升其在隐私保护、低延迟和个性化体验等约束条件下的行为建模能力。
链接: https://arxiv.org/abs/2505.11417
作者: Patryk Bartkowiak,Michal Podstawski
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:This paper introduces a novel dataset and evaluation benchmark designed to assess and improve small language models deployable on edge devices, with a focus on user profiling from multi-session natural language interactions in smart home environments. At the core of the dataset are structured user profiles, each defined by a set of routines - context-triggered, repeatable patterns of behavior that govern how users interact with their home systems. Using these profiles as input, a large language model (LLM) generates corresponding interaction sessions that simulate realistic, diverse, and context-aware dialogues between users and their devices. The primary task supported by this dataset is profile reconstruction: inferring user routines and preferences solely from interactions history. To assess how well current models can perform this task under realistic conditions, we benchmarked several state-of-the-art compact language models and compared their performance against large foundation models. Our results show that while small models demonstrate some capability in reconstructing profiles, they still fall significantly short of large models in accurately capturing user behavior. This performance gap poses a major challenge - particularly because on-device processing offers critical advantages, such as preserving user privacy, minimizing latency, and enabling personalized experiences without reliance on the cloud. By providing a realistic, structured testbed for developing and evaluating behavioral modeling under these constraints, our dataset represents a key step toward enabling intelligent, privacy-respecting AI systems that learn and adapt directly on user-owned devices. Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2505.11417 [cs.HC] (or arXiv:2505.11417v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2505.11417 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-6] MID-L: Matrix-Interpolated Dropout Layer with Layer-wise Neuron Selection
【速读】:该论文试图解决现代神经网络在处理每个输入时激活所有神经元导致的计算冗余与效率低下的问题。其解决方案的关键在于引入一种名为矩阵插值丢弃层(Matrix-Interpolated Dropout Layer, MID-L)的新模块,该模块通过学习一个输入相关的门控向量,在两个变换路径之间进行插值,动态选择并激活最具有信息量的神经元。MID-L采用可微的Top-k掩码策略,实现了每输入自适应计算,同时保持端到端的可微性,从而在减少计算量的同时维持或提升模型性能。
链接: https://arxiv.org/abs/2505.11416
作者: Pouya Shaeri,Ariane Middel
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted in a Computer Science Conference, currently in Review
Abstract:Modern neural networks often activate all neurons for every input, leading to unnecessary computation and inefficiency. We introduce Matrix-Interpolated Dropout Layer (MID-L), a novel module that dynamically selects and activates only the most informative neurons by interpolating between two transformation paths via a learned, input-dependent gating vector. Unlike conventional dropout or static sparsity methods, MID-L employs a differentiable Top-k masking strategy, enabling per-input adaptive computation while maintaining end-to-end differentiability. MID-L is model-agnostic and integrates seamlessly into existing architectures. Extensive experiments on six benchmarks, including MNIST, CIFAR-10, CIFAR-100, SVHN, UCI Adult, and IMDB, show that MID-L achieves up to average 55% reduction in active neurons, 1.7 \times FLOPs savings, and maintains or exceeds baseline accuracy. We further validate the informativeness and selectivity of the learned neurons via Sliced Mutual Information (SMI) and observe improved robustness under overfitting and noisy data conditions. Additionally, MID-L demonstrates favorable inference latency and memory usage profiles, making it suitable for both research exploration and deployment on compute-constrained systems. These results position MID-L as a general-purpose, plug-and-play dynamic computation layer, bridging the gap between dropout regularization and efficient inference.
zh
[AI-7] DecompileBench: A Comprehensive Benchmark for Evaluating Decompilers in Real-World Scenarios
【速读】:该论文试图解决现有反汇编工具评估方法在语义保真度和分析人员可用性方面不足的问题,传统方法主要依赖语法正确性或主观人工评分,未能满足实际逆向工程任务的需求。其解决方案的关键在于提出DecompileBench框架,该框架包含三个核心组件:真实场景函数提取、运行时感知验证以及基于大语言模型(LLM)的自动化人本评估,从而实现了对反汇编工具在逆向工程流程中的有效评估。
链接: https://arxiv.org/abs/2505.11340
作者: Zeyu Gao,Yuxin Cui,Hao Wang,Siliang Qin,Yuanda Wang,Bolun Zhang,Chao Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Decompilers are fundamental tools for critical security tasks, from vulnerability discovery to malware analysis, yet their evaluation remains fragmented. Existing approaches primarily focus on syntactic correctness through synthetic micro-benchmarks or subjective human ratings, failing to address real-world requirements for semantic fidelity and analyst usability. We present DecompileBench, the first comprehensive framework that enables effective evaluation of decompilers in reverse engineering workflows through three key components: \textitreal-world function extraction (comprising 23,400 functions from 130 real-world programs), \textitruntime-aware validation, and \textitautomated human-centric assessment using LLM-as-Judge to quantify the effectiveness of decompilers in reverse engineering workflows. Through a systematic comparison between six industrial-strength decompilers and six recent LLM-powered approaches, we demonstrate that LLM-based methods surpass commercial tools in code understandability despite 52.2% lower functionality correctness. These findings highlight the potential of LLM-based approaches to transform human-centric reverse engineering. We open source \hrefthis https URLDecompileBench to provide a framework to advance research on decompilers and assist security experts in making informed tool selections based on their specific requirements.
zh
[AI-8] Explaining Strategic Decisions in Multi-Agent Reinforcement Learning for Aerial Combat Tactics
【速读】:该论文试图解决多智能体强化学习(MARL)在敏感军事场景中部署时因缺乏可解释性而导致的信任、安全及与人类战略对齐问题。解决方案的关键在于通过适应不同的空中格斗场景,应用多种可解释性技术,以获得对模型行为的解释性洞察,并将AI生成的战术与人类可理解的推理相联系,从而提升透明度,确保可靠部署和有意义的人机交互。
链接: https://arxiv.org/abs/2505.11311
作者: Ardian Selmonaj,Alessandro Antonucci,Adrian Schneider,Michael Rüegsegger,Matthias Sommer
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published as a journal chapter in NATO Journal of Science and Technology
Abstract:Artificial intelligence (AI) is reshaping strategic planning, with Multi-Agent Reinforcement Learning (MARL) enabling coordination among autonomous agents in complex scenarios. However, its practical deployment in sensitive military contexts is constrained by the lack of explainability, which is an essential factor for trust, safety, and alignment with human strategies. This work reviews and assesses current advances in explainability methods for MARL with a focus on simulated air combat scenarios. We proceed by adapting various explainability techniques to different aerial combat scenarios to gain explanatory insights about the model behavior. By linking AI-generated tactics with human-understandable reasoning, we emphasize the need for transparency to ensure reliable deployment and meaningful human-machine interaction. By illuminating the crucial importance of explainability in advancing MARL for operational defense, our work supports not only strategic planning but also the training of military personnel with insightful and comprehensible analyses.
zh
[AI-9] Heterogeneity-Aware Client Sampling: A Unified Solution for Consistent Federated Learning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中由于通信和计算异构性导致的优化动态扭曲和目标不一致问题。通信和计算异构性会使得全局模型收敛到一个远离最优解的错误稳定点,而这一问题在现有研究中尚未得到充分探讨。论文的关键在于揭示了通信和计算异构性驱动不一致性的根本不同机制,并提出了FedACS(Federated Heterogeneity-Aware Client Sampling),一种能够消除所有类型目标不一致性的通用方法。通过理论分析,证明了FedACS在动态异构环境中仍能以$ O(1/\sqrt{R}) $的收敛速率达到正确最优解。
链接: https://arxiv.org/abs/2505.11304
作者: Shudi Weng,Chao Ren,Ming Xiao,Mikael Skoglund
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated learning (FL) commonly involves clients with diverse communication and computational capabilities. Such heterogeneity can significantly distort the optimization dynamics and lead to objective inconsistency, where the global model converges to an incorrect stationary point potentially far from the pursued optimum. Despite its critical impact, the joint effect of communication and computation heterogeneity has remained largely unexplored, due to the intrinsic complexity of their interaction. In this paper, we reveal the fundamentally distinct mechanisms through which heterogeneous communication and computation drive inconsistency in FL. To the best of our knowledge, this is the first unified theoretical analysis of general heterogeneous FL, offering a principled understanding of how these two forms of heterogeneity jointly distort the optimization trajectory under arbitrary choices of local solvers. Motivated by these insights, we propose Federated Heterogeneity-Aware Client Sampling, FedACS, a universal method to eliminate all types of objective inconsistency. We theoretically prove that FedACS converges to the correct optimum at a rate of O(1/\sqrtR) , even in dynamic heterogeneous environments. Extensive experiments across multiple datasets show that FedACS outperforms state-of-the-art and category-specific baselines by 4.3%-36%, while reducing communication costs by 22%-89% and computation loads by 14%-105%, respectively.
zh
[AI-10] Meta-World: An Improved Standardized RL Benchmark
【速读】:该论文试图解决Meta-World基准测试中因未记录的改动导致的算法比较不公平问题,这些问题阻碍了多任务和元强化学习代理的准确评估。解决方案的关键在于澄清文献中的结果,并利用Meta-World的早期版本来提供对多任务和元强化学习基准设计的见解,同时发布了一个新的开源版本,该版本实现了过去结果的完全可复现性,技术上更加友好,并为用户提供了对任务集包含任务的更大控制权。
链接: https://arxiv.org/abs/2505.11289
作者: Reginald McLean,Evangelos Chatzaroulas,Luc McCutcheon,Frank Röder,Tianhe Yu,Zhanpeng He,K.R. Zentner,Ryan Julian,J K Terry,Isaac Woungang,Nariman Farsad,Pablo Samuel Castro
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Meta-World is widely used for evaluating multi-task and meta-reinforcement learning agents, which are challenged to master diverse skills simultaneously. Since its introduction however, there have been numerous undocumented changes which inhibit a fair comparison of algorithms. This work strives to disambiguate these results from the literature, while also leveraging the past versions of Meta-World to provide insights into multi-task and meta-reinforcement learning benchmark design. Through this process we release a new open-source version of Meta-World (this https URL) that has full reproducibility of past results, is more technically ergonomic, and gives users more control over the tasks that are included in a task set.
zh
[AI-11] CC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLM s
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在非西方文化语境下表现有限的问题,从而提升其广泛适用性。解决方案的关键是提出一个名为TCC-Bench的双语(中文和英文)视觉问答(VQA)基准,专门用于评估MLLMs对传统文化的理解能力。该基准包含丰富的文化背景和视觉多样性数据,并采用半自动化流程生成候选问题,结合人工校准以确保数据质量并避免数据泄露,同时通过防止文化概念在问题文本中直接披露来减少语言偏见。
链接: https://arxiv.org/abs/2505.11275
作者: Pengju Xu,Yan Wang,Shuyuan Zhang,Xuan Zhou,Xin Li,Yue Yuan,Fengzhao Li,Shunyuan Zhou,Xingyu Wang,Yi Zhang,Haiying Zhao
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Preprint
Abstract:Recent progress in Multimodal Large Language Models (MLLMs) have significantly enhanced the ability of artificial intelligence systems to understand and generate multimodal content. However, these models often exhibit limited effectiveness when applied to non-Western cultural contexts, which raises concerns about their wider applicability. To address this limitation, we propose the \textbfTraditional \textbfChinese \textbfCulture understanding \textbfBenchmark (\textbfTCC-Bench), a bilingual (\textiti.e., Chinese and English) Visual Question Answering (VQA) benchmark specifically designed for assessing the understanding of traditional Chinese culture by MLLMs. TCC-Bench comprises culturally rich and visually diverse data, incorporating images from museum artifacts, everyday life scenes, comics, and other culturally significant contexts. We adopt a semi-automated pipeline that utilizes GPT-4o in text-only mode to generate candidate questions, followed by human curation to ensure data quality and avoid potential data leakage. The benchmark also avoids language bias by preventing direct disclosure of cultural concepts within question texts. Experimental evaluations across a wide range of MLLMs demonstrate that current models still face significant challenges when reasoning about culturally grounded visual content. The results highlight the need for further research in developing culturally inclusive and context-aware multimodal systems. The code and data can be found at: this https URL.
zh
[AI-12] AIJI: MCP-based Multi-Modal Data Analytics on Data Lakes
【速读】:该论文旨在解决数据湖中多模态数据分析的准确性、效率和数据新鲜度问题。现有大型语言模型(Large Language Models, LLMs)在处理结构化、半结构化和非结构化数据时存在不足,包括查询语言难以精确表达用户分析意图、单一模型处理多种数据模态导致的推理开销大以及数据湖中的数据可能不完整或过时等问题。其解决方案的关键在于提出一种基于模型上下文协议(Model Context Protocol, MCP)的新架构,通过引入语义操作符层次结构和AI代理驱动的自然语言到操作符翻译器,实现用户意图与分析执行的高效对接;同时,利用MCP执行框架中针对特定数据模态优化的专用基础模型,提升分析的准确性和效率,并通过模块化部署支持高可扩展性;最后,结合深度研究和机器遗忘技术设计更新机制,以平衡数据新鲜度与推理效率。
链接: https://arxiv.org/abs/2505.11270
作者: Chao Zhang,Shaolei Zhang,Quehuan Liu,Sibei Chen,Tong Li,Ju Fan
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:The variety of data in data lakes presents significant challenges for data analytics, as data scientists must simultaneously analyze multi-modal data, including structured, semi-structured, and unstructured data. While Large Language Models (LLMs) have demonstrated promising capabilities, they still remain inadequate for multi-modal data analytics in terms of accuracy, efficiency, and freshness. First, current natural language (NL) or SQL-like query languages may struggle to precisely and comprehensively capture users’ analytical intent. Second, relying on a single unified LLM to process diverse data modalities often leads to substantial inference overhead. Third, data stored in data lakes may be incomplete or outdated, making it essential to integrate external open-domain knowledge to generate timely and relevant analytics results. In this paper, we envision a new multi-modal data analytics system. Specifically, we propose a novel architecture built upon the Model Context Protocol (MCP), an emerging paradigm that enables LLMs to collaborate with knowledgeable agents. First, we define a semantic operator hierarchy tailored for querying multi-modal data in data lakes and develop an AI-agent-powered NL2Operator translator to bridge user intent and analytical execution. Next, we introduce an MCP-based execution framework, in which each MCP server hosts specialized foundation models optimized for specific data modalities. This design enhances both accuracy and efficiency, while supporting high scalability through modular deployment. Finally, we propose a updating mechanism by harnessing the deep research and machine unlearning techniques to refresh the data lakes and LLM knowledges, with the goal of balancing the data freshness and inference efficiency. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.11270 [cs.DB] (or arXiv:2505.11270v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2505.11270 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-13] LD-Scene: LLM -Guided Diffusion for Controllable Generation of Adversarial Safety-Critical Driving Scenarios
【速读】:该论文旨在解决自动驾驶系统在安全关键场景下的评估难题,这些场景由于稀有性和难以从真实驾驶数据中获取,给自动驾驶车辆性能的有效评估带来了重大挑战。现有方法通常受限于可控性不足和用户友好性差,因为需要大量专家知识。该论文提出的解决方案的关键在于LD-Scene框架,该框架结合了大型语言模型(Large Language Models, LLMs)与潜在扩散模型(Latent Diffusion Models, LDMs),通过自然语言实现用户可控的对抗性场景生成,其核心组件包括一个捕捉真实驾驶轨迹分布的LDM以及一个将用户查询转化为对抗性损失函数的LLM引导模块。
链接: https://arxiv.org/abs/2505.11247
作者: Mingxing Peng,Yuting Xie,Xusen Guo,Ruoyu Yao,Hai Yang,Jun Ma
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 13 pages, 5 figures
Abstract:Ensuring the safety and robustness of autonomous driving systems necessitates a comprehensive evaluation in safety-critical scenarios. However, these safety-critical scenarios are rare and difficult to collect from real-world driving data, posing significant challenges to effectively assessing the performance of autonomous vehicles. Typical existing methods often suffer from limited controllability and lack user-friendliness, as extensive expert knowledge is essentially required. To address these challenges, we propose LD-Scene, a novel framework that integrates Large Language Models (LLMs) with Latent Diffusion Models (LDMs) for user-controllable adversarial scenario generation through natural language. Our approach comprises an LDM that captures realistic driving trajectory distributions and an LLM-based guidance module that translates user queries into adversarial loss functions, facilitating the generation of scenarios aligned with user queries. The guidance module integrates an LLM-based Chain-of-Thought (CoT) code generator and an LLM-based code debugger, enhancing the controllability and robustness in generating guidance functions. Extensive experiments conducted on the nuScenes dataset demonstrate that LD-Scene achieves state-of-the-art performance in generating realistic, diverse, and effective adversarial scenarios. Furthermore, our framework provides fine-grained control over adversarial behaviors, thereby facilitating more effective testing tailored to specific driving scenarios.
zh
[AI-14] A Set-Sequence Model for Time Series ICLR2025
【速读】:该论文试图解决金融预测问题中个体单元(如贷款、债券或股票)的行为受可观测的单元级因素、宏观经济变量以及潜在的横截面效应影响的问题。传统方法通过手工设计的汇总特征来捕捉这些潜在效应,而本文提出了一种Set-Sequence模型,其关键在于通过学习每个时期的共享横截面摘要,并将该摘要与时间序列结合,独立预测每个单元的结果,从而消除了对手工设计特征的依赖。该方法利用了横截面的集合性质,在计算上高效,能够在与单元数量成线性关系的时间内生成集合摘要,并且具有灵活性,可使用现有的序列模型并在推理时处理不同数量的单元。
链接: https://arxiv.org/abs/2505.11243
作者: Elliot L. Epstein,Apaar Sadhwani,Kay Giesecke
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP)
备注: Presented at the Workshop on Financial AI at ICLR 2025
Abstract:In many financial prediction problems, the behavior of individual units (such as loans, bonds, or stocks) is influenced by observable unit-level factors and macroeconomic variables, as well as by latent cross-sectional effects. Traditional approaches attempt to capture these latent effects via handcrafted summary features. We propose a Set-Sequence model that eliminates the need for handcrafted features. The Set model first learns a shared cross-sectional summary at each period. The Sequence model then ingests the summary-augmented time series for each unit independently to predict its outcome. Both components are learned jointly over arbitrary sets sampled during training. Our approach harnesses the set nature of the cross-section and is computationally efficient, generating set summaries in linear time relative to the number of units. It is also flexible, allowing the use of existing sequence models and accommodating a variable number of units at inference. Empirical evaluations demonstrate that our Set-Sequence model significantly outperforms benchmarks on stock return prediction and mortgage behavior tasks. Code will be released.
zh
[AI-15] Is PRM Necessary? Problem-Solving RL Implicitly Induces PRM Capability in LLM s
【速读】:该论文试图解决生成式 AI (Generative AI) 在复杂推理任务中对过程奖励模型 (Process Reward Model, PRM) 依赖性的问题,旨在探讨纯强化学习 (Reinforcement Learning, RL) 是否能够独立提升推理能力并内在地构建有效的 PRM 能力。论文的解决方案之关键在于提出 Self-PRM 框架,该框架通过模型自主评估和重排序生成解的方式,利用自奖励机制提升推理准确性,从而减少对外部 PRM 的依赖。然而,研究也揭示了 Self-PRM 在处理复杂问题时仍存在精度不足的问题,表明需要进一步优化 RL 训练以提升奖励对齐与内省准确性。
链接: https://arxiv.org/abs/2505.11227
作者: Zhangying Feng,Qianglong Chen,Ning Lu,Yongqian Li,Siqi Cheng,Shuangmu Peng,Duyu Tang,Shengcai Liu,Zhirui Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The development of reasoning capabilities represents a critical frontier in large language models (LLMs) research, where reinforcement learning (RL) and process reward models (PRMs) have emerged as predominant methodological frameworks. Contrary to conventional wisdom, empirical evidence from DeepSeek-R1 demonstrates that pure RL training focused on mathematical problem-solving can progressively enhance reasoning abilities without PRM integration, challenging the perceived necessity of process supervision. In this study, we conduct a systematic investigation of the relationship between RL training and PRM capabilities. Our findings demonstrate that problem-solving proficiency and process supervision capabilities represent complementary dimensions of reasoning that co-evolve synergistically during pure RL training. In particular, current PRMs underperform simple baselines like majority voting when applied to state-of-the-art models such as DeepSeek-R1 and QwQ-32B. To address this limitation, we propose Self-PRM, an introspective framework in which models autonomously evaluate and rerank their generated solutions through self-reward mechanisms. Although Self-PRM consistently improves the accuracy of the benchmark (particularly with larger sample sizes), analysis exposes persistent challenges: The approach exhibits low precision (10%) on difficult problems, frequently misclassifying flawed solutions as valid. These analyses underscore the need for continued RL scaling to improve reward alignment and introspective accuracy. Overall, our findings suggest that PRM may not be essential for enhancing complex reasoning, as pure RL not only improves problem-solving skills but also inherently fosters robust PRM capabilities. We hope these findings provide actionable insights for building more reliable and self-aware complex reasoning models.
zh
[AI-16] Bayesian Hierarchical Invariant Prediction
【速读】:该论文试图解决在异质数据下识别因果特征的挑战,特别是通过改进现有方法Invariant Causal Prediction (ICP)的计算可扩展性和因果特征识别的可靠性。其解决方案的关键在于引入贝叶斯分层框架,即Bayesian Hierarchical Invariant Prediction (BHIP),利用分层结构显式检验因果机制的不变性,并结合稀疏性诱导先验(如horseshoe和spike-and-slab)提升因果特征识别的准确性,同时增强了对先验信息的利用能力。
链接: https://arxiv.org/abs/2505.11211
作者: Francisco Madaleno,Pernille Julie Viuff Sand,Francisco C. Pereira,Sergio Hernan Garrido Mejia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
备注:
Abstract:We propose Bayesian Hierarchical Invariant Prediction (BHIP) reframing Invariant Causal Prediction (ICP) through the lens of Hierarchical Bayes. We leverage the hierarchical structure to explicitly test invariance of causal mechanisms under heterogeneous data, resulting in improved computational scalability for a larger number of predictors compared to ICP. Moreover, given its Bayesian nature BHIP enables the use of prior information. In this paper, we test two sparsity inducing priors: horseshoe and spike-and-slab, both of which allow us a more reliable identification of causal features. We test BHIP in synthetic and real-world data showing its potential as an alternative inference method to ICP.
zh
[AI-17] GLOVA: Global and Local Variation-Aware Analog Circuit Design with Risk-Sensitive Reinforcement Learning
【速读】:该论文旨在解决模拟/混合信号电路设计中由于工艺、电压和温度(PVT)变化导致的性能退化问题,以及现有自动化设计方法在应对真实晶圆中显著失配方面的不足。其解决方案的关键在于提出GLOVA框架,该框架利用风险敏感的强化学习来考虑PVT变化对可靠性边界的影响,并引入基于集成的批评者以实现样本高效的训练。此外,论文还提出了μ-σ评估和仿真重排序方法,以降低识别失败设计的仿真成本,从而提升设计鲁棒性并显著提高样本效率和减少设计时间。
链接: https://arxiv.org/abs/2505.11208
作者: Dongjun Kim,Junwoo Park,Chaehyeon Shin,Jaeheon Jung,Kyungho Shin,Seungheon Baek,Sanghyuk Heo,Woongrae Kim,Inchul Jeong,Joohwan Cho,Jongsun Park
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: Accepted for DAC 2025
Abstract:Analog/mixed-signal circuit design encounters significant challenges due to performance degradation from process, voltage, and temperature (PVT) variations. To achieve commercial-grade reliability, iterative manual design revisions and extensive statistical simulations are required. While several studies have aimed to automate variation aware analog design to reduce time-to-market, the substantial mismatches in real-world wafers have not been thoroughly addressed. In this paper, we present GLOVA, an analog circuit sizing framework that effectively manages the impact of diverse random mismatches to improve robustness against PVT variations. In the proposed approach, risk-sensitive reinforcement learning is leveraged to account for the reliability bound affected by PVT variations, and ensemble-based critic is introduced to achieve sample-efficient learning. For design verification, we also propose \mu - \sigma evaluation and simulation reordering method to reduce simulation costs of identifying failed designs. GLOVA supports verification through industrial-level PVT variation evaluation methods, including corner simulation as well as global and local Monte Carlo (MC) simulations. Compared to previous state-of-the-art variation-aware analog sizing frameworks, GLOVA achieves up to 80.5 \times improvement in sample efficiency and 76.0 \times reduction in time.
zh
[AI-18] RanDeS: Randomized Delta Superposition for Multi-Model Compression
【速读】:该论文试图解决多模型压缩场景下模型融合(model merging)因任务特定参数调整(deltas)之间的干扰而导致性能下降的问题。其解决方案的关键在于将模型融合重新定义为一种压缩与检索机制,并通过随机正交变换(random orthogonal transformations)对这些delta向量进行去相关处理,从而实现自相消(self-cancellation),显著降低任务间的干扰,提升视觉和语言任务的性能。该方法无需额外内存即可添加新模型,且具有数据和模型无关性,支持高效灵活的多模型服务。
链接: https://arxiv.org/abs/2505.11204
作者: Hangyu Zhou,Aaron Gokaslan,Volodymyr Kuleshov,Bharath Hariharan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: this https URL
Abstract:From a multi-model compression perspective, model merging enables memory-efficient serving of multiple models fine-tuned from the same base, but suffers from degraded performance due to interference among their task-specific parameter adjustments (i.e., deltas). In this paper, we reformulate model merging as a compress-and-retrieve scheme, revealing that the task interference arises from the summation of irrelevant deltas during model retrieval. To address this issue, we use random orthogonal transformations to decorrelate these vectors into self-cancellation. We show that this approach drastically reduces interference, improving performance across both vision and language tasks. Since these transformations are fully defined by random seeds, adding new models requires no extra memory. Further, their data- and model-agnostic nature enables easy addition or removal of models with minimal compute overhead, supporting efficient and flexible multi-model serving.
zh
[AI-19] User-centric Music Recommendations UAI2022
【速读】:该论文试图解决音乐推荐系统中用户个性化与可解释性不足的问题,旨在通过构建一个以用户为中心的推荐框架,提升推荐结果的透明度和用户参与度。解决方案的关键在于利用用户长期的播放记录,结合社区贡献的标签和Spotify音频特征,建立用户时间上下文数据集,并基于此预测特定时刻用户偏好匹配的音频特征,从而推荐符合用户当前情境的音乐内容。该框架的核心创新在于从单一用户的历史行为中学习音乐习惯,并具备扩展至多用户场景的潜力。
链接: https://arxiv.org/abs/2505.11198
作者: Jaime Ramirez Castillo,M. Julia Flores,Ann E. Nicholson
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted for the 16th Bayesian Modelling Applications Workshop (@UAI2022) (BMAW 2022)
Abstract:This work presents a user-centric recommendation framework, designed as a pipeline with four distinct, connected, and customizable phases. These phases are intended to improve explainability and boost user engagement. We have collected the historical this http URL track playback records of a single user over approximately 15 years. The collected dataset includes more than 90,000 playbacks and approximately 14,000 unique tracks. From track playback records, we have created a dataset of user temporal contexts (each row is a specific moment when the user listened to certain music descriptors). As music descriptors, we have used community-contributed this http URL tags and Spotify audio features. They represent the music that, throughout years, the user has been listening to. Next, given the most relevant this http URL tags of a moment (e.g. the hour of the day), we predict the Spotify audio features that best fit the user preferences in that particular moment. Finally, we use the predicted audio features to find tracks similar to these features. The final aim is to recommend (and discover) tracks that the user may feel like listening to at a particular moment. For our initial study case, we have chosen to predict only a single audio feature target: danceability. The framework, however, allows to include more target variables. The ability to learn the musical habits from a single user can be quite powerful, and this framework could be extended to other users. Comments: Accepted for the 16th Bayesian Modelling Applications Workshop (@UAI2022) (BMAW 2022) Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.11198 [cs.IR] (or arXiv:2505.11198v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2505.11198 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jaime Ramírez Castillo [view email] [v1] Fri, 16 May 2025 12:56:40 UTC (321 KB) Full-text links: Access Paper: View a PDF of the paper titled User-centric Music Recommendations, by Jaime Ramirez Castillo and 2 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.IR prev | next new | recent | 2025-05 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[AI-20] Multi-Modal Multi-Task (M3T) Federated Foundation Models for Embodied AI: Potentials and Challenges for Edge Integration
【速读】:该论文旨在解决 embodied AI 系统在多模态、个性化和交互性增强背景下,如何有效从多样化的感官输入中学习、持续适应用户偏好,并在资源和隐私约束下安全运行的问题。其关键解决方案是提出 Federated Foundation Models (FFMs),通过融合多模态多任务(M3T)基础模型的泛化能力与联邦学习(FL)的分布式隐私保护机制,实现边缘智能系统的高效部署与个性化服务。
链接: https://arxiv.org/abs/2505.11191
作者: Kasra Borazjani,Payam Abdisarabshali,Fardis Nadimi,Naji Khosravan,Minghui Liwang,Xianbin Wang,Yiguang Hong,Seyyedali Hosseinalipour
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 10 pages, 3 figures, 3 tables
Abstract:As embodied AI systems become increasingly multi-modal, personalized, and interactive, they must learn effectively from diverse sensory inputs, adapt continually to user preferences, and operate safely under resource and privacy constraints. These challenges expose a pressing need for machine learning models capable of swift, context-aware adaptation while balancing model generalization and personalization. Here, two methods emerge as suitable candidates, each offering parts of these capabilities: Foundation Models (FMs) provide a pathway toward generalization across tasks and modalities, whereas Federated Learning (FL) offers the infrastructure for distributed, privacy-preserving model updates and user-level model personalization. However, when used in isolation, each of these approaches falls short of meeting the complex and diverse capability requirements of real-world embodied environments. In this vision paper, we introduce Federated Foundation Models (FFMs) for embodied AI, a new paradigm that unifies the strengths of multi-modal multi-task (M3T) FMs with the privacy-preserving distributed nature of FL, enabling intelligent systems at the wireless edge. We collect critical deployment dimensions of FFMs in embodied AI ecosystems under a unified framework, which we name “EMBODY”: Embodiment heterogeneity, Modality richness and imbalance, Bandwidth and compute constraints, On-device continual learning, Distributed control and autonomy, and Yielding safety, privacy, and personalization. For each, we identify concrete challenges and envision actionable research directions. We also present an evaluation framework for deploying FFMs in embodied AI systems, along with the associated trade-offs.
zh
[AI-21] Can Global XAI Methods Reveal Injected Bias in LLM s? SHAP vs Rule Extraction vs RuleSHAP
【速读】:该论文试图解决生成式 AI (Generative AI) 在传播信息时可能引发的虚假信息和偏见问题,这些问题可能对联合国可持续发展目标 (SDGs) 造成负面影响。现有可解释 AI (XAI) 工具由于针对的是简单模型,难以处理大型语言模型 (LLMs) 的非数值特性,因此在检测 LLM 中的偏见方面存在局限性。论文的关键解决方案是提出一种文本到序数映射策略,将非数值输入/输出转换为数值特征,从而使得 XAI 工具能够识别部分与虚假信息相关的偏见,并引入 RuleSHAP 算法,结合 SHAP 和 RuleFit 以提高对非单变量偏见的检测能力,平均在 MRR@1 指标上比 RuleFit 提高了 94%。
链接: https://arxiv.org/abs/2505.11189
作者: Francesco Sovrano
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Generative AI systems can help spread information but also misinformation and biases, potentially undermining the UN Sustainable Development Goals (SDGs). Explainable AI (XAI) aims to reveal the inner workings of AI systems and expose misbehaviours or biases. However, current XAI tools, built for simpler models, struggle to handle the non-numerical nature of large language models (LLMs). This paper examines the effectiveness of global XAI methods, such as rule-extraction algorithms and SHAP, in detecting bias in LLMs. To do so, we first show a text-to-ordinal mapping strategy to convert non-numerical inputs/outputs into numerical features, enabling these tools to identify (some) misinformation-related biases in LLM-generated content. Then, we inject non-linear biases of varying complexity (univariate, conjunctive, and non-convex) into widespread LLMs like ChatGPT and Llama via system instructions, using global XAI methods to detect them. This way, we found that RuleFit struggles with conjunctive and non-convex biases, while SHAP can approximate conjunctive biases but cannot express them as actionable rules. Hence, we introduce RuleSHAP, a global rule extraction algorithm combining SHAP and RuleFit to detect more non-univariate biases, improving injected bias detection over RuleFit by +94% (MRR@1) on average.
zh
[AI-22] Feasibility with Language Models for Open-World Compositional Zero-Shot Learning ECCV
【速读】:该论文试图解决开放世界组合零样本学习(Open-World Compositional Zero-Shot Learning, OW-CZSL)中,当所有可能的状态-对象组合都被视为未见过的类别时,零样本预测器性能下降的问题。解决方案的关键在于利用外部辅助知识来判断状态-对象组合的可行性,具体通过生成式 AI (Generative AI) 驱动的可行性语言模型(Feasibility with Language Model, FLM)实现,该方法借助大语言模型(Large Language Models, LLMs)理解状态与对象之间的语义关系,并通过查询LLMs获取对给定组合可行性的输出logit值,从而提升OW-CZSL的性能。
链接: https://arxiv.org/abs/2505.11181
作者: Jae Myung Kim,Stephan Alaniz,Cordelia Schmid,Zeynep Akata
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ECCV Workshop in OOD-CV, 2024
Abstract:Humans can easily tell if an attribute (also called state) is realistic, i.e., feasible, for an object, e.g. fire can be hot, but it cannot be wet. In Open-World Compositional Zero-Shot Learning, when all possible state-object combinations are considered as unseen classes, zero-shot predictors tend to perform poorly. Our work focuses on using external auxiliary knowledge to determine the feasibility of state-object combinations. Our Feasibility with Language Model (FLM) is a simple and effective approach that leverages Large Language Models (LLMs) to better comprehend the semantic relationships between states and objects. FLM involves querying an LLM about the feasibility of a given pair and retrieving the output logit for the positive answer. To mitigate potential misguidance of the LLM given that many of the state-object compositions are rare or completely infeasible, we observe that the in-context learning ability of LLMs is essential. We present an extensive study identifying Vicuna and ChatGPT as best performing, and we demonstrate that our FLM consistently improves OW-CZSL performance across all three benchmarks.
zh
[AI-23] From Intent Discovery to Recognition with Topic Modeling and Synthetic Data
【速读】:该论文旨在解决在短语句和冷启动问题背景下,AI系统中客户意图理解和识别的挑战,特别是在缺乏足够真实用户数据的情况下,推荐系统需要引入新产品或服务时的问题。其解决方案的关键在于提出一种基于代理的大语言模型(LLM)框架,用于主题建模和合成查询生成,通过层次化主题建模与意图发现,扩展人工整理的意图分类体系,并生成合成用户查询数据以增强真实语句,降低对人工标注的依赖,从而提升意图识别的多样性和覆盖范围。
链接: https://arxiv.org/abs/2505.11176
作者: Aaron Rodrigues,Mahmood Hegazy,Azzam Naeem
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding and recognizing customer intents in AI systems is crucial, particularly in domains characterized by short utterances and the cold start problem, where recommender systems must include new products or services without sufficient real user data. Customer utterances are characterized by infrequent word co-occurences and high term variability, which poses significant challenges for traditional methods in specifying distinct user needs and preparing synthetic queries. To address this, we propose an agentic LLM framework for topic modeling and synthetic query generation, which accelerates the discovery and recognition of customer intents. We first apply hierarchical topic modeling and intent discovery to expand a human-curated taxonomy from 36 generic user intents to 278 granular intents, demonstrating the potential of LLMs to significantly enhance topic specificity and diversity. Next, to support newly discovered intents and address the cold start problem, we generate synthetic user query data, which augments real utterances and reduces dependency on human annotation, especially in low-resource settings. Topic model experiments show substantial improvements in coherence and relevance after topic expansion, while synthetic data experiments indicate that in-class few-shot prompting significantly improves the quality and utility of synthetic queries without compromising diversity. We also show that LLM-generated intent descriptions and keywords can effectively substitute for human-curated versions when used as context for synthetic query generation. Our research underscores the scalability and utility of LLM agents in topic modeling and highlights the strategic use of synthetic utterances to enhance dataset variability and coverage for intent recognition. We present a comprehensive and robust framework for online discovery and recognition of new customer intents in dynamic domains.
zh
[AI-24] Real-Time Verification of Embodied Reasoning for Generative Skill Acquisition
【速读】:该论文旨在解决在复杂3D环境中,基于生成式技能获取(Generative Skill Acquisition)的代理系统中,由于依赖通用代理(如大语言模型)提供的监督信号而导致的技能学习效率低下问题。其关键解决方案是提出VERGSA框架,该框架通过系统性地将实时验证原则整合到具身技能学习中,实现了从数学推理验证到具身学习的无缝扩展,并引入了一种自动化、可扩展的奖励标注方案,从而显著提升了技能学习的成功率和效率。
链接: https://arxiv.org/abs/2505.11175
作者: Bo Yue,Shuqi Guo,Kaiyu Hu,Chujiao Wang,Benyou Wang,Kui Jia,Guiliang Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative skill acquisition enables embodied agents to actively learn a scalable and evolving repertoire of control skills, crucial for the advancement of large decision models. While prior approaches often rely on supervision signals from generalist agents (e.g., LLMs), their effectiveness in complex 3D environments remains unclear; exhaustive evaluation incurs substantial computational costs, significantly hindering the efficiency of skill learning. Inspired by recent successes in verification models for mathematical reasoning, we propose VERGSA (Verifying Embodied Reasoning in Generative Skill Acquisition), a framework that systematically integrates real-time verification principles into embodied skill learning. VERGSA establishes 1) a seamless extension from verification of mathematical reasoning into embodied learning by dynamically incorporating contextually relevant tasks into prompts and defining success metrics for both subtasks and overall tasks, and 2) an automated, scalable reward labeling scheme that synthesizes dense reward signals by iteratively finalizing the contribution of scene configuration and subtask learning to overall skill acquisition. To the best of our knowledge, this approach constitutes the first comprehensive training dataset for verification-driven generative skill acquisition, eliminating arduous manual reward engineering. Experiments validate the efficacy of our approach: 1) the exemplar task pool improves the average task success rates by 21%, 2) our verification model boosts success rates by 24% for novel tasks and 36% for encountered tasks, and 3) outperforms LLM-as-a-Judge baselines in verification quality.
zh
[AI-25] Attention on the Sphere
【速读】:该论文旨在解决如何使Transformer架构有效处理定义在二维球面上的数据问题,特别是在需要保持球面对称性和拓扑结构的领域中,如大气物理、宇宙学和机器人学。解决方案的关键在于引入一种广义的注意力机制,通过将数值求积权重整合到注意力计算中,实现几何上忠实的球面注意力,该机制近似具有旋转等变性,从而提供强大的归纳偏置并优于传统的笛卡尔方法。此外,通过在球面上引入邻域注意力,限制交互仅限于测地邻域,进一步提升了模型的可扩展性和性能,同时保持了对称性特性。
链接: https://arxiv.org/abs/2505.11157
作者: Boris Bonev,Max Rietmann,Andrea Paris,Alberto Carpentieri,Thorsten Kurth
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce a generalized attention mechanism for spherical domains, enabling Transformer architectures to natively process data defined on the two-dimensional sphere - a critical need in fields such as atmospheric physics, cosmology, and robotics, where preserving spherical symmetries and topology is essential for physical accuracy. By integrating numerical quadrature weights into the attention mechanism, we obtain a geometrically faithful spherical attention that is approximately rotationally equivariant, providing strong inductive biases and leading to better performance than Cartesian approaches. To further enhance both scalability and model performance, we propose neighborhood attention on the sphere, which confines interactions to geodesic neighborhoods. This approach reduces computational complexity and introduces the additional inductive bias for locality, while retaining the symmetry properties of our method. We provide optimized CUDA kernels and memory-efficient implementations to ensure practical applicability. The method is validated on three diverse tasks: simulating shallow water equations on the rotating sphere, spherical image segmentation, and spherical depth estimation. Across all tasks, our spherical Transformers consistently outperform their planar counterparts, highlighting the advantage of geometric priors for learning on spherical domains.
zh
[AI-26] X2C: A Dataset Featuring Nuanced Facial Expressions for Realistic Humanoid Imitation
【速读】:该论文试图解决人形机器人在情感人机交互中实现逼真面部表情模仿的问题,其主要挑战在于缺乏包含多样化人形面部表情且带有准确标注的数据集。解决方案的关键在于引入X2C(Anything to Control)数据集,该数据集包含100,000组(图像,控制值)对,每张图像展示了具有多种面部表情的人形机器人,并标注了30个控制值以表示真实的表达配置,同时提出X2CNet框架,该框架通过学习细微的人形表情与其底层控制值之间的对应关系,实现了不同人类表演者在复杂环境下的面部表情模仿,并在物理人形机器人上进行了实际演示,验证了其在逼真人形面部表情模仿方面的有效性。
链接: https://arxiv.org/abs/2505.11146
作者: Peizhen Li,Longbing Cao,Xiao-Ming Wu,Runze Yang,Xiaohan Yu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:The ability to imitate realistic facial expressions is essential for humanoid robots engaged in affective human-robot communication. However, the lack of datasets containing diverse humanoid facial expressions with proper annotations hinders progress in realistic humanoid facial expression imitation. To address these challenges, we introduce X2C (Anything to Control), a dataset featuring nuanced facial expressions for realistic humanoid imitation. With X2C, we contribute: 1) a high-quality, high-diversity, large-scale dataset comprising 100,000 (image, control value) pairs. Each image depicts a humanoid robot displaying a diverse range of facial expressions, annotated with 30 control values representing the ground-truth expression configuration; 2) X2CNet, a novel human-to-humanoid facial expression imitation framework that learns the correspondence between nuanced humanoid expressions and their underlying control values from X2C. It enables facial expression imitation in the wild for different human performers, providing a baseline for the imitation task, showcasing the potential value of our dataset; 3) real-world demonstrations on a physical humanoid robot, highlighting its capability to advance realistic humanoid facial expression imitation. Code and Data: this https URL
zh
[AI-27] Reinforcement Learning for AMR Charging Decisions: The Impact of Reward and Action Space Design
【速读】:该论文旨在解决大规模堆叠仓库中自主移动机器人充电策略优化的问题,其核心挑战在于如何设计有效的强化学习(Reinforcement Learning, RL)框架以提升服务效率。解决方案的关键在于探索不同奖励机制和动作空间配置对智能体性能的影响,通过灵活的RL方法相较于传统启发式策略展现出更优的服务时间表现,同时揭示了开放性设计与引导性配置之间的权衡关系。
链接: https://arxiv.org/abs/2505.11136
作者: Janik Bischoff,Alexandru Rinciog,Anne Meyer
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Under review LION19: The 19th Learning and Intelligent OptimizatioN Conference
Abstract:We propose a novel reinforcement learning (RL) design to optimize the charging strategy for autonomous mobile robots in large-scale block stacking warehouses. RL design involves a wide array of choices that can mostly only be evaluated through lengthy experimentation. Our study focuses on how different reward and action space configurations, ranging from flexible setups to more guided, domain-informed design configurations, affect the agent performance. Using heuristic charging strategies as a baseline, we demonstrate the superiority of flexible, RL-based approaches in terms of service times. Furthermore, our findings highlight a trade-off: While more open-ended designs are able to discover well-performing strategies on their own, they may require longer convergence times and are less stable, whereas guided configurations lead to a more stable learning process but display a more limited generalization potential. Our contributions are threefold. First, we extend SLAPStack, an open-source, RL-compatible simulation-framework to accommodate charging strategies. Second, we introduce a novel RL design for tackling the charging strategy problem. Finally, we introduce several novel adaptive baseline heuristics and reproducibly evaluate the design using a Proximal Policy Optimization agent and varying different design configurations, with a focus on reward.
zh
[AI-28] Scalability of Reinforcement Learning Methods for Dispatching in Semiconductor Frontend Fabs: A Comparison of Open-Source Models with Real Industry Datasets
【速读】:该论文试图解决半导体行业中调度或派工方法在真实场景下评估时存在的基准数据集不足的问题,因为现有常用基准数据集如Minifab或SMT2020缺乏现实场景中的复杂细节和约束。解决方案的关键在于通过对比开源仿真模型与真实工业数据集,评估优化方法在不同复杂度下的扩展性,特别是聚焦于基于策略梯度和进化策略的强化学习方法。研究发现,基于进化策略的方法在复杂度扩展性上优于策略梯度方法,并且通过选择和组合关键瓶颈工具进行控制,能够实现高效的优化。此外,使用多样化训练数据集在不同负载场景和随机设备故障模式下具有更好的泛化能力。
链接: https://arxiv.org/abs/2505.11135
作者: Patrick Stöckermann,Henning Südfeld,Alessandro Immordino,Thomas Altenmüller,Marc Wegmann,Martin Gebser,Konstantin Schekotihin,Georg Seidel,Chew Wye Chan,Fei Fei Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Benchmark datasets are crucial for evaluating approaches to scheduling or dispatching in the semiconductor industry during the development and deployment phases. However, commonly used benchmark datasets like the Minifab or SMT2020 lack the complex details and constraints found in real-world scenarios. To mitigate this shortcoming, we compare open-source simulation models with a real industry dataset to evaluate how optimization methods scale with different levels of complexity. Specifically, we focus on Reinforcement Learning methods, performing optimization based on policy-gradient and Evolution Strategies. Our research provides insights into the effectiveness of these optimization methods and their applicability to realistic semiconductor frontend fab simulations. We show that our proposed Evolution Strategies-based method scales much better than a comparable policy-gradient-based approach. Moreover, we identify the selection and combination of relevant bottleneck tools to control by the agent as crucial for an efficient optimization. For the generalization across different loading scenarios and stochastic tool failure patterns, we achieve advantages when utilizing a diverse training dataset. While the overall approach is computationally expensive, it manages to scale well with the number of CPU cores used for training. For the real industry dataset, we achieve an improvement of up to 4% regarding tardiness and up to 1% regarding throughput. For the less complex open-source models Minifab and SMT2020, we observe double-digit percentage improvement in tardiness and single digit percentage improvement in throughput by use of Evolution Strategies.
zh
[AI-29] Conditioning Matters: Training Diffusion Policies is Faster Than You Think
【速读】:该论文旨在解决条件扩散策略训练中的一个根本性问题——当生成条件难以区分时,训练目标会退化为建模边缘动作分布,这一现象被称为损失坍缩(loss collapse)。为克服此问题,作者提出了Cocos方法,其关键在于修改条件流匹配中的源分布,使其依赖于条件输入中提取的语义信息,从而增强条件整合并防止损失坍缩。
链接: https://arxiv.org/abs/2505.11123
作者: Zibin Dong,Yicheng Liu,Yinchuan Li,Hang Zhao,Jianye Hao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2505.10105
Abstract:Diffusion policies have emerged as a mainstream paradigm for building vision-language-action (VLA) models. Although they demonstrate strong robot control capabilities, their training efficiency remains suboptimal. In this work, we identify a fundamental challenge in conditional diffusion policy training: when generative conditions are hard to distinguish, the training objective degenerates into modeling the marginal action distribution, a phenomenon we term loss collapse. To overcome this, we propose Cocos, a simple yet general solution that modifies the source distribution in the conditional flow matching to be condition-dependent. By anchoring the source distribution around semantics extracted from condition inputs, Cocos encourages stronger condition integration and prevents the loss collapse. We provide theoretical justification and extensive empirical results across simulation and real-world benchmarks. Our method achieves faster convergence and higher success rates than existing approaches, matching the performance of large-scale pre-trained VLAs using significantly fewer gradient steps and parameters. Cocos is lightweight, easy to implement, and compatible with diverse policy architectures, offering a general-purpose improvement to diffusion policy training.
zh
[AI-30] Navigating the Alpha Jungle: An LLM -Powered MCTS Framework for Formulaic Factor Mining
【速读】:该论文旨在解决量化投资中Alpha因子挖掘的问题,即从复杂的金融数据中识别具有预测能力的信号。传统方法依赖人工经验,而现代自动化方法如基于遗传编程或强化学习的方法往往存在搜索效率低或得到的Alpha因子可解释性差的问题。论文提出的解决方案是将大型语言模型(Large Language Models, LLMs)与蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)相结合,其关键在于利用LLM的指令遵循和推理能力,在MCTS驱动的探索过程中迭代生成和优化符号化的Alpha公式,并通过金融回测提供的丰富定量反馈引导MCTS的探索,从而高效地在庞大的搜索空间中找到性能优越且可解释的Alpha因子。
链接: https://arxiv.org/abs/2505.11122
作者: Yu Shi,Yitong Duan,Jian Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 30 pages
Abstract:Alpha factor mining is pivotal in quantitative investment for identifying predictive signals from complex financial data. While traditional formulaic alpha mining relies on human expertise, contemporary automated methods, such as those based on genetic programming or reinforcement learning, often suffer from search inefficiency or yield poorly interpretable alpha factors. This paper introduces a novel framework that integrates Large Language Models (LLMs) with Monte Carlo Tree Search (MCTS) to overcome these limitations. Our approach leverages the LLM’s instruction-following and reasoning capability to iteratively generate and refine symbolic alpha formulas within an MCTS-driven exploration. A key innovation is the guidance of MCTS exploration by rich, quantitative feedback from financial backtesting of each candidate factor, enabling efficient navigation of the vast search space. Furthermore, a frequent subtree avoidance mechanism is introduced to bolster search efficiency and alpha factor performance. Experimental results on real-world stock market data demonstrate that our LLM-based framework outperforms existing methods by mining alphas with superior predictive accuracy, trading performance, and improved interpretability, while offering a more efficient solution for formulaic alpha mining.
zh
[AI-31] Predicting Student Dropout Risk With A Dual-Modal Abrupt Behavioral Changes Approach
【速读】:该论文试图解决教育场景中学生辍学风险的及时预测问题,旨在通过早期干预提升教育成果。传统机器学习模型在离线教育环境中面临数据质量差、规模有限和异质性高的挑战,同时教育理论虽提供有价值见解,但缺乏可量化的关键指标限制了其在数据驱动建模中的应用。解决方案的关键在于提出双模态多尺度滑动窗口(Dual-Modal Multiscale Sliding Window, DMSW)模型,该模型整合学业成绩与行为数据,通过最小数据动态捕捉行为模式,相比传统方法提升了15%的预测准确率,从而帮助教育者更早识别高风险学生并提供及时支持。
链接: https://arxiv.org/abs/2505.11119
作者: Jiabei Cheng,Zhen-Qun Yang,Jiannong Cao,Yu Yang,Xinzhe Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 14 pages, 5 figures
Abstract:Timely prediction of students at high risk of dropout is critical for early intervention and improving educational outcomes. However, in offline educational settings, poor data quality, limited scale, and high heterogeneity often hinder the application of advanced machine learning models. Furthermore, while educational theories provide valuable insights into dropout phenomena, the lack of quantifiable metrics for key indicators limits their use in data-driven modeling. Through data analysis and a review of educational literature, we identified abrupt changes in student behavior as key early signals of dropout risk. To address this, we propose the Dual-Modal Multiscale Sliding Window (DMSW) Model, which integrates academic performance and behavioral data to dynamically capture behavior patterns using minimal data. The DMSW model improves prediction accuracy by 15% compared to traditional methods, enabling educators to identify high-risk students earlier, provide timely support, and foster a more inclusive learning environment. Our analysis highlights key behavior patterns, offering practical insights for preventive strategies and tailored support. These findings bridge the gap between theory and practice in dropout prediction, giving educators an innovative tool to enhance student retention and outcomes.
zh
[AI-32] FairSHAP: Preprocessing for Fairness Through Attribution-Based Data Augmentation
【速读】:该论文试图解决机器学习模型中的公平性问题,特别是在高风险领域中,偏见决策可能导致严重的社会后果。现有预处理方法通常缺乏透明的机制来识别导致不公平的特征或实例,从而掩盖了数据修改的依据。解决方案的关键在于引入FairSHAP,这是一个基于Shapley值归因的预处理框架,通过可解释的特征重要性度量识别对公平性至关重要的实例,并通过敏感群体间的实例级匹配系统性地修改这些实例,从而在保持数据完整性和模型准确性的同时降低区分性风险。
链接: https://arxiv.org/abs/2505.11111
作者: Lin Zhu,Yijun Bian,Lei You
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 3 figures, 15 pages
Abstract:Ensuring fairness in machine learning models is critical, particularly in high-stakes domains where biased decisions can lead to serious societal consequences. Existing preprocessing approaches generally lack transparent mechanisms for identifying which features or instances are responsible for unfairness. This obscures the rationale behind data modifications. We introduce FairSHAP, a novel pre-processing framework that leverages Shapley value attribution to improve both individual and group fairness. FairSHAP identifies fairness-critical instances in the training data using an interpretable measure of feature importance, and systematically modifies them through instance-level matching across sensitive groups. This process reduces discriminative risk - an individual fairness metric - while preserving data integrity and model accuracy. We demonstrate that FairSHAP significantly improves demographic parity and equality of opportunity across diverse tabular datasets, achieving fairness gains with minimal data perturbation and, in some cases, improved predictive performance. As a model-agnostic and transparent method, FairSHAP integrates seamlessly into existing machine learning pipelines and provides actionable insights into the sources of this http URL code is on this https URL.
zh
[AI-33] PARSEC: Preference Adaptation for Robotic Object Rearrangement from Scene Context
【速读】:该论文旨在解决家庭机器人在无明确指令的情况下实现个性化物体重新排列的问题,包括在已存在物体的环境中进行有意义的物体放置,以及对未见过的物体和新环境的泛化能力。其解决方案的关键在于引入PARSEC基准,该基准基于由72名用户提供的110K个重新排列示例构建,包含93种物体类别和15种环境,并提出ContextSortLM模型,该模型通过结合先前和当前场景上下文来适应用户偏好,同时考虑多种有效放置方式,从而在部分已排列环境中进行物体放置。
链接: https://arxiv.org/abs/2505.11108
作者: Kartik Ramachandruni,Sonia Chernova
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Under review at ROMAN 2025
Abstract:Object rearrangement is a key task for household robots requiring personalization without explicit instructions, meaningful object placement in environments occupied with objects, and generalization to unseen objects and new environments. To facilitate research addressing these challenges, we introduce PARSEC, an object rearrangement benchmark for learning user organizational preferences from observed scene context to place objects in a partially arranged environment. PARSEC is built upon a novel dataset of 110K rearrangement examples crowdsourced from 72 users, featuring 93 object categories and 15 environments. We also propose ContextSortLM, an LLM-based rearrangement model that places objects in partially arranged environments by adapting to user preferences from prior and current scene context while accounting for multiple valid placements. We evaluate ContextSortLM and existing personalized rearrangement approaches on the PARSEC benchmark and complement these findings with a crowdsourced evaluation of 108 online raters ranking model predictions based on alignment with user preferences. Our results indicate that personalized rearrangement models leveraging multiple scene context sources perform better than models relying on a single context source. Moreover, ContextSortLM outperforms other models in placing objects to replicate the target user’s arrangement and ranks among the top two in all three environment categories, as rated by online evaluators. Importantly, our evaluation highlights challenges associated with modeling environment semantics across different environment categories and provides recommendations for future work.
zh
[AI-34] Group Think: Multiple Concurrent Reasoning Agents Collaborating at Token Level Granularity
【速读】:该论文试图解决传统多推理代理系统中因轮询交互导致的延迟增加问题,同时提升推理质量。其解决方案的关键在于提出Group Think——一种让单一大型语言模型(Large Language Model, LLM)同时扮演多个并发推理代理(thinker)的机制。通过共享部分生成进度,Group Think实现了在token级别上的动态协同推理,从而减少冗余推理、提升质量并显著降低延迟。
链接: https://arxiv.org/abs/2505.11107
作者: Chan-Jan Hsu,Davide Buffelli,Jamie McGowan,Feng-Ting Liao,Yi-Chang Chen,Sattar Vakili,Da-shan Shiu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large language models (LLMs) have demonstrated the power of reasoning through self-generated chains of thought. Multiple reasoning agents can collaborate to raise joint reasoning quality above individual outcomes. However, such agents typically interact in a turn-based manner, trading increased latency for improved quality. In this paper, we propose Group Think–a single LLM that acts as multiple concurrent reasoning agents, or thinkers. With shared visibility into each other’s partial generation progress, Group Think introduces a new concurrent-reasoning paradigm in which multiple reasoning trajectories adapt dynamically to one another at the token level. For example, a reasoning thread may shift its generation mid-sentence upon detecting that another thread is better positioned to continue. This fine-grained, token-level collaboration enables Group Think to reduce redundant reasoning and improve quality while achieving significantly lower latency. Moreover, its concurrent nature allows for efficient utilization of idle computational resources, making it especially suitable for edge inference, where very small batch size often underutilizes local~GPUs. We give a simple and generalizable modification that enables any existing LLM to perform Group Think on a local GPU. We also present an evaluation strategy to benchmark reasoning latency and empirically demonstrate latency improvements using open-source LLMs that were not explicitly trained for Group Think. We hope this work paves the way for future LLMs to exhibit more sophisticated and more efficient collaborative behavior for higher quality generation.
zh
[AI-35] Inferring the Most Similar Variable-length Subsequences between Multidimensional Time Series
【速读】:该论文试图解决在两个多维时间序列之间找到最相似子序列的问题,特别是在时间序列和子序列长度存在差异的情况下。解决方案的关键在于提出了一种算法,该算法能够在理论上保证正确性和效率的基础上,精确地找到长度不同的时间序列之间的最相似子序列。实验结果表明,该方法不仅能够提供正确的解,还在运行时间上显著优于基线方法。
链接: https://arxiv.org/abs/2505.11106
作者: Thanadej Rattanakornphan,Piyanon Charoenpoonpanich,Chainarong Amornbunchornvej
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Methodology (stat.ME)
备注: Under review
Abstract:Finding the most similar subsequences between two multidimensional time series has many applications: e.g. capturing dependency in stock market or discovering coordinated movement of baboons. Considering one pattern occurring in one time series, we might be wondering whether the same pattern occurs in another time series with some distortion that might have a different length. Nevertheless, to the best of our knowledge, there is no efficient framework that deals with this problem yet. In this work, we propose an algorithm that provides the exact solution of finding the most similar multidimensional subsequences between time series where there is a difference in length both between time series and between subsequences. The algorithm is built based on theoretical guarantee of correctness and efficiency. The result in simulation datasets illustrated that our approach not just only provided correct solution, but it also utilized running time only quarter of time compared against the baseline approaches. In real-world datasets, it extracted the most similar subsequences even faster (up to 20 times faster against baseline methods) and provided insights regarding the situation in stock market and following relations of multidimensional time series of baboon movement. Our approach can be used for any time series. The code and datasets of this work are provided for the public use.
zh
[AI-36] Bidirectional Distillation: A Mixed-Play Framework for Multi-Agent Generalizable Behaviors
【速读】:该论文旨在解决多智能体强化学习(MARL)中群体-群体泛化问题,特别是在智能体遭遇未见过的合作者时的泛化能力不足问题。现有基于自对弈的方法受限于内部空间泛化的局限性,而本文提出的双向蒸馏(BiDist)是一种新颖的混合对弈框架,其关键在于通过双向知识蒸馏机制实现更广泛的策略空间泛化:前向蒸馏模拟历史策略空间以生成隐式自对弈,反向蒸馏则在非自对弈方式下系统引导智能体探索已知策略空间之外的新分布。BiDist无需存储历史策略,具有简洁高效的特性,并通过理论分析和实验验证了其有效性。
链接: https://arxiv.org/abs/2505.11100
作者: Lang Feng,Jiahao Lin,Dong Xing,Li Zhang,De Ma,Gang Pan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Population-population generalization is a challenging problem in multi-agent reinforcement learning (MARL), particularly when agents encounter unseen co-players. However, existing self-play-based methods are constrained by the limitation of inside-space generalization. In this study, we propose Bidirectional Distillation (BiDist), a novel mixed-play framework, to overcome this limitation in MARL. BiDist leverages knowledge distillation in two alternating directions: forward distillation, which emulates the historical policies’ space and creates an implicit self-play, and reverse distillation, which systematically drives agents towards novel distributions outside the known policy space in a non-self-play manner. In addition, BiDist operates as a concise and efficient solution without the need for the complex and costly storage of past policies. We provide both theoretical analysis and empirical evidence to support BiDist’s effectiveness. Our results highlight its remarkable generalization ability across a variety of cooperative, competitive, and social dilemma tasks, and reveal that BiDist significantly diversifies the policy distribution space. We also present comprehensive ablation studies to reinforce BiDist’s effectiveness and key success factors. Source codes are available in the supplementary material.
zh
[AI-37] Analysis of Customer Journeys Using Prototype Detection and Counterfactual Explanations for Sequential Data
【速读】:该论文试图解决多渠道平台中客户旅程的定量研究与全面分析问题,特别是在数据具有序列性和分析复杂性的背景下。其解决方案的关键在于提出一种包含三个步骤的新型方法:首先,定义序列数据之间的距离以识别和可视化代表性序列;其次,基于此距离预测购买可能性;最后,若序列未暗示购买,则通过提取序列数据的反事实解释来推荐反事实序列以提高购买概率。
链接: https://arxiv.org/abs/2505.11086
作者: Keita Kinjo
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 19 pages, 7 figures
Abstract:Recently, the proliferation of omni-channel platforms has attracted interest in customer journeys, particularly regarding their role in developing marketing strategies. However, few efforts have been taken to quantitatively study or comprehensively analyze them owing to the sequential nature of their data and the complexity involved in analysis. In this study, we propose a novel approach comprising three steps for analyzing customer journeys. First, the distance between sequential data is defined and used to identify and visualize representative sequences. Second, the likelihood of purchase is predicted based on this distance. Third, if a sequence suggests no purchase, counterfactual sequences are recommended to increase the probability of a purchase using a proposed method, which extracts counterfactual explanations for sequential data. A survey was conducted, and the data were analyzed; the results revealed that typical sequences could be extracted, and the parts of those sequences important for purchase could be detected. We believe that the proposed approach can support improvements in various marketing activities.
zh
[AI-38] A Fast Kernel-based Conditional Independence test with Application to Causal Discovery
【速读】:该论文试图解决核条件独立性(Kernel-based Conditional Independence, KCI)检验在处理大规模数据时因计算复杂度较高(三次方复杂度)而面临的应用瓶颈问题。解决方案的关键在于提出一种名为FastKCI的可扩展且可并行化的核条件独立性检验方法,其核心思想是通过基于条件变量的高斯混合模型对数据集进行划分,并利用专家混合方法在各个子集上并行执行局部KCI检验,最终通过重要性加权采样方案聚合结果,从而在保持统计功效的同时显著提升计算效率。
链接: https://arxiv.org/abs/2505.11085
作者: Oliver Schacht,Biwei Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 9 pages, 5 figures
Abstract:Kernel-based conditional independence (KCI) testing is a powerful nonparametric method commonly employed in causal discovery tasks. Despite its flexibility and statistical reliability, cubic computational complexity limits its application to large datasets. To address this computational bottleneck, we propose \textitFastKCI, a scalable and parallelizable kernel-based conditional independence test that utilizes a mixture-of-experts approach inspired by embarrassingly parallel inference techniques for Gaussian processes. By partitioning the dataset based on a Gaussian mixture model over the conditioning variables, FastKCI conducts local KCI tests in parallel, aggregating the results using an importance-weighted sampling scheme. Experiments on synthetic datasets and benchmarks on real-world production data validate that FastKCI maintains the statistical power of the original KCI test while achieving substantial computational speedups. FastKCI thus represents a practical and efficient solution for conditional independence testing in causal inference on large-scale data.
zh
[AI-39] Fault Diagnosis across Heterogeneous Domains via Self-Adaptive Temporal-Spatial Attention and Sample Generation
【速读】:该论文旨在解决多工况过程中的故障诊断问题,特别是在实际工业场景中,不同运行模式下的健康状态类别仅存在部分重叠,导致现有故障诊断方法面临数据不完整和运行模式间分布差异大的挑战。解决方案的关键在于提出一种名为自适应时空注意力网络(TSA-SAN)的新型故障诊断模型,通过利用健康类别数据构建跨模式映射以生成多工况样本,并结合健康与故障样本的插值来丰富故障数据;同时引入自适应实例归一化以抑制无关信息并保留关键统计特征,以及构建时序-空间注意力机制以聚焦关键特征,从而提升模型的泛化能力。
链接: https://arxiv.org/abs/2505.11083
作者: Guangqiang Li,M. Amine Atoui,Xiangshun Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 31 pages, 11 figures
Abstract:Deep learning methods have shown promising performance in fault diagnosis for multimode process. Most existing studies assume that the collected health state categories from different operating modes are identical. However, in real industrial scenarios, these categories typically exhibit only partial overlap. The incompleteness of the available data and the large distributional differences between the operating modes pose a significant challenge to existing fault diagnosis methods. To address this problem, a novel fault diagnosis model named self-adaptive temporal-spatial attention network (TSA-SAN) is proposed. First, inter-mode mappings are constructed using healthy category data to generate multimode samples. To enrich the diversity of the fault data, interpolation is performed between healthy and fault samples. Subsequently, the fault diagnosis model is trained using real and generated data. The self-adaptive instance normalization is established to suppress irrelevant information while retaining essential statistical features for diagnosis. In addition, a temporal-spatial attention mechanism is constructed to focus on the key features, thus enhancing the generalization ability of the model. The extensive experiments demonstrate that the proposed model significantly outperforms the state-of-the-art methods. The code will be available on Github at this https URL.
zh
[AI-40] A Multi-modal Fusion Network for Terrain Perception Based on Illumination Aware
【速读】:该论文旨在解决自动驾驶车辆(AV)在复杂光照和天气条件下实时感知道路地形的难题,因为现有传感器如摄像头和激光雷达对环境光照变化敏感,导致感知性能下降。其解决方案的关键在于提出一种光照感知的多模态融合网络(IMF),该网络结合了外感受和本体感受感知,并基于光照特征优化融合过程。核心创新包括引入一个光照感知子网络以准确估计光照特征,以及设计一个能够根据光照特征动态调整不同模态权重的多模态融合网络,同时通过预训练光照感知子网络并引入光照损失作为训练约束来增强优化过程。
链接: https://arxiv.org/abs/2505.11066
作者: Rui Wang,Shichun Yang,Yuyi Chen,Zhuoyang Li,Zexiang Tong,Jianyi Xu,Jiayi Lu,Xinjie Feng,Yaoguang Cao
机构: 未知
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:Road terrains play a crucial role in ensuring the driving safety of autonomous vehicles (AVs). However, existing sensors of AVs, including cameras and Lidars, are susceptible to variations in lighting and weather conditions, making it challenging to achieve real-time perception of road conditions. In this paper, we propose an illumination-aware multi-modal fusion network (IMF), which leverages both exteroceptive and proprioceptive perception and optimizes the fusion process based on illumination features. We introduce an illumination-perception sub-network to accurately estimate illumination features. Moreover, we design a multi-modal fusion network which is able to dynamically adjust weights of different modalities according to illumination features. We enhance the optimization process by pre-training of the illumination-perception sub-network and incorporating illumination loss as one of the training constraints. Extensive experiments demonstrate that the IMF shows a superior performance compared to state-of-the-art methods. The comparison results with single modality perception methods highlight the comprehensive advantages of multi-modal fusion in accurately perceiving road terrains under varying lighting conditions. Our dataset is available at: this https URL.
zh
[AI-41] me Travel is Cheating: Going Live with DeepFund for Real-Time Fund Investment Benchmarking
【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)在实际复杂基金投资场景中有效性评估不足的问题,尤其是现有基准测试依赖历史回测导致的“时间旅行”问题,即模型可能通过训练语料库中的未来信息获得不公平优势,从而产生信息泄露和过于乐观的性能估计。解决方案的关键在于引入DeepFund,这是一个实时基金基准工具,采用多智能体架构,直接连接实时股票市场数据(仅使用模型预训练截止后发布的信息),以确保评估过程的公平性和无信息泄露性。
链接: https://arxiv.org/abs/2505.11065
作者: Changlun Li,Yao Shi,Chen Wang,Qiqi Duan,Runke Ruan,Weijie Huang,Haonan Long,Lijun Huang,Yuyu Luo,Nan Tang
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 21 pages, 9 figures
Abstract:Large Language Models (LLMs) have demonstrated notable capabilities across financial tasks, including financial report summarization, earnings call transcript analysis, and asset classification. However, their real-world effectiveness in managing complex fund investment remains inadequately assessed. A fundamental limitation of existing benchmarks for evaluating LLM-driven trading strategies is their reliance on historical back-testing, inadvertently enabling LLMs to “time travel”-leveraging future information embedded in their training corpora, thus resulting in possible information leakage and overly optimistic performance estimates. To address this issue, we introduce DeepFund, a live fund benchmark tool designed to rigorously evaluate LLM in real-time market conditions. Utilizing a multi-agent architecture, DeepFund connects directly with real-time stock market data-specifically data published after each model pretraining cutoff-to ensure fair and leakage-free evaluations. Empirical tests on nine flagship LLMs from leading global institutions across multiple investment dimensions-including ticker-level analysis, investment decision-making, portfolio management, and risk control-reveal significant practical challenges. Notably, even cutting-edge models such as DeepSeek-V3 and Claude-3.7-Sonnet incur net trading losses within DeepFund real-time evaluation environment, underscoring the present limitations of LLMs for active fund management. Our code is available at this https URL.
zh
[AI-42] hink Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction
【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的自主代理在长时程行为轨迹中面临的安全对齐问题,特别是在代理的内部推理过程(thought)中可能引入的潜在风险。解决方案的关键在于提出Thought-Aligner,这是一个轻量级、资源高效的动态思想修正模块,能够在每次动作执行前实时纠正高风险思想,从而确保后续决策和工具交互的安全性。该方法仅修改推理阶段而不改变底层代理框架,具备良好的可部署性和广泛适用性。
链接: https://arxiv.org/abs/2505.11063
作者: Changyue Jiang,Xudong Pan,Min Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:LLM-based autonomous agents possess capabilities such as reasoning, tool invocation, and environment interaction, enabling the execution of complex multi-step tasks. The internal reasoning process, i.e., thought, of behavioral trajectory significantly influences tool usage and subsequent actions but can introduce potential risks. Even minor deviations in the agent’s thought may trigger cascading effects leading to irreversible safety incidents. To address the safety alignment challenges in long-horizon behavioral trajectories, we propose Thought-Aligner, a plug-in dynamic thought correction module. Utilizing a lightweight and resource-efficient model, Thought-Aligner corrects each high-risk thought on the fly before each action execution. The corrected thought is then reintroduced to the agent, ensuring safer subsequent decisions and tool interactions. Importantly, Thought-Aligner modifies only the reasoning phase without altering the underlying agent framework, making it easy to deploy and widely applicable to various agent frameworks. To train the Thought-Aligner model, we construct an instruction dataset across ten representative scenarios and simulate ReAct execution trajectories, generating 5,000 diverse instructions and more than 11,400 safe and unsafe thought pairs. The model is fine-tuned using contrastive learning techniques. Experiments across three agent safety benchmarks involving 12 different LLMs demonstrate that Thought-Aligner raises agent behavioral safety from approximately 50% in the unprotected setting to 90% on average. Additionally, Thought-Aligner maintains response latency below 100ms with minimal resource usage, demonstrating its capability for efficient deployment, broad applicability, and timely responsiveness. This method thus provides a practical dynamic safety solution for the LLM-based agents.
zh
[AI-43] Halting Recurrent GNNs and the Graded μ-Calculus
【速读】:该论文试图解决递归图神经网络(Recurrent GNNs)在表达能力与终止性保证方面的局限性,特别是当前方法要么假设图的大小已知,要么缺乏终止性保障。其解决方案的关键在于提出一种停机机制,使得递归GNN能够在不依赖图大小的情况下,表达所有可由分级模态μ-微积分定义的节点分类器。为此,作者开发了一种新的近似语义用于分级μ-微积分,并基于此设计了一个对图大小无感知的模型检测算法——计数算法,最终证明该算法可以在停机递归GNN上实现。
链接: https://arxiv.org/abs/2505.11050
作者: Jeroen Bollen,Jan Van den Bussche,Stijn Vansummeren,Jonni Virtema
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:Graph Neural Networks (GNNs) are a class of machine-learning models that operate on graph-structured data. Their expressive power is intimately related to logics that are invariant under graded bisimilarity. Current proposals for recurrent GNNs either assume that the graph size is given to the model, or suffer from a lack of termination guarantees. In this paper, we propose a halting mechanism for recurrent GNNs. We prove that our halting model can express all node classifiers definable in graded modal mu-calculus, even for the standard GNN variant that is oblivious to the graph size. A recent breakthrough in the study of the expressivity of graded modal mu-calculus in the finite suggests that conversely, restricted to node classifiers definable in monadic second-order logic, recurrent GNNs can express only node classifiers definable in graded modal mu-calculus. To prove our main result, we develop a new approximate semantics for graded mu-calculus, which we believe to be of independent interest. We leverage this new semantics into a new model-checking algorithm, called the counting algorithm, which is oblivious to the graph size. In a final step we show that the counting algorithm can be implemented on a halting recurrent GNN.
zh
[AI-44] GuardReason er-VL: Safeguarding VLMs via Reinforced Reasoning
【速读】:该论文旨在解决视觉语言模型(Visual Language Models, VLMs)的安全性问题,通过引入一种基于推理的VLM防护模型GuardReasoner-VL来提升其安全性。解决方案的关键在于利用在线强化学习(online RL)激励防护模型在做出内容审核决策前进行深思熟虑的推理。具体而言,首先构建了一个包含123K样本和631K推理步骤的多模态推理语料库,随后通过监督微调(SFT)启动模型的推理能力,并进一步通过在线RL增强推理能力,包括采用拒绝采样和安全感知的数据拼接进行数据增强,以及动态调整裁剪参数以平衡探索与利用。此外,设计了长度感知的安全奖励机制,以兼顾性能与令牌效率。
链接: https://arxiv.org/abs/2505.11049
作者: Yue Liu,Shengfang Zhai,Mingzhe Du,Yulin Chen,Tri Cao,Hongcheng Gao,Cheng Wang,Xinfeng Li,Kun Wang,Junfeng Fang,Jiaheng Zhang,Bryan Hooi
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:To enhance the safety of VLMs, this paper introduces a novel reasoning-based VLM guard model dubbed GuardReasoner-VL. The core idea is to incentivize the guard model to deliberatively reason before making moderation decisions via online RL. First, we construct GuardReasoner-VLTrain, a reasoning corpus with 123K samples and 631K reasoning steps, spanning text, image, and text-image inputs. Then, based on it, we cold-start our model’s reasoning ability via SFT. In addition, we further enhance reasoning regarding moderation through online RL. Concretely, to enhance diversity and difficulty of samples, we conduct rejection sampling followed by data augmentation via the proposed safety-aware data concatenation. Besides, we use a dynamic clipping parameter to encourage exploration in early stages and exploitation in later stages. To balance performance and token efficiency, we design a length-aware safety reward that integrates accuracy, format, and token cost. Extensive experiments demonstrate the superiority of our model. Remarkably, it surpasses the runner-up by 19.27% F1 score on average. We release data, code, and models (3B/7B) of GuardReasoner-VL at this https URL
zh
[AI-45] he heteronomy of algorithms: Traditional knowledge and computational knowledge
【速读】:该论文试图解决在计算系统日益渗透社会生活的背景下,如何培养公众对计算形式进行批判性思考的能力问题(critiquing the computal)。其解决方案的关键在于发展一种新的数字素养教育理念,即数字Bildung(digital Bildung),通过跨学科的方法整合哲学、政治学、历史学、人类学、社会学、媒介研究、计算机科学和人文学科的理论与方法,以理解软件和数据如何影响日常生活,并应对由此产生的认识论挑战。
链接: https://arxiv.org/abs/2505.11030
作者: David M. Berry
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:If an active citizen should increasingly be a computationally enlightened one, replacing the autonomy of reason with the heteronomy of algorithms, then I argue in this article that we must begin teaching the principles of critiquing the computal through new notions of what we might call digital Bildung. Indeed, if civil society itself is mediated by computational systems and media, the public use of reason must also be complemented by skills for negotiating and using these computal forms to articulate such critique. Not only is there a need to raise the intellectual tone regarding computation and its related softwarization processes, but there is an urgent need to attend to the likely epistemic challenges from computation which, as presently constituted, tends towards justification through a philosophy of utility rather than through a philosophy of care for the territory of the intellect. We therefore need to develop an approach to this field that uses concepts and methods drawn from philosophy, politics, history, anthropology, sociology, media studies, computer science, and the humanities more generally, to try to understand these issues - particularly the way in which software and data increasingly penetrate our everyday life and the pressures and fissures that are created. We must, in other words, move to undertake a critical interdisciplinary research program to understand the way in which these systems are created, instantiated, and normatively engendered in both specific and general contexts.
zh
[AI-46] Most General Explanations of Tree Ensembles IJCAI2025
【速读】:该论文试图解决如何为AI决策找到最一般的可解释性解释(abductive explanation)的问题。其关键在于通过形式化模型识别出覆盖尽可能大的输入空间范围,同时仍能正确解释模型行为的解释。这种最一般的解释能够提供最广泛适用性的说明,从而更有可能被人类理解并认为合理。
链接: https://arxiv.org/abs/2505.10991
作者: Yacine Izza,Alexey Ignatiev,Joao Marques-Silva,Peter J. Stuckey
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: Restricted version of this paper was accepted at IJCAI 2025
Abstract:Explainable Artificial Intelligence (XAI) is critical for attaining trust in the operation of AI systems. A key question of an AI system is ``why was this decision made this way’'. Formal approaches to XAI use a formal model of the AI system to identify abductive explanations. While abductive explanations may be applicable to a large number of inputs sharing the same concrete values, more general explanations may be preferred for numeric inputs. So-called inflated abductive explanations give intervals for each feature ensuring that any input whose values fall withing these intervals is still guaranteed to make the same prediction. Inflated explanations cover a larger portion of the input space, and hence are deemed more general explanations. But there can be many (inflated) abductive explanations for an instance. Which is the best? In this paper, we show how to find a most general abductive explanation for an AI decision. This explanation covers as much of the input space as possible, while still being a correct formal explanation of the model’s behaviour. Given that we only want to give a human one explanation for a decision, the most general explanation gives us the explanation with the broadest applicability, and hence the one most likely to seem sensible. (The paper has been accepted at IJCAI2025 conference.)
zh
[AI-47] RAG Synth: Synthetic Data for Robust and Faithful RAG Component Optimization
【速读】:该论文旨在解决现有检索增强生成(RAG)框架中检索器(retriever)在处理复杂查询时的鲁棒性不足以及生成器(generator)在响应合成过程中出现的忠实性(fidelity)问题。其解决方案的关键在于提出RAGSynth框架,该框架包含数据构建建模和对应的合成数据生成实现,旨在优化检索器的鲁棒性和生成器的忠实性。通过RAGSynth生成的大规模合成数据,显著提升了RAG系统的性能,并在多个领域中表现出良好的泛化能力。
链接: https://arxiv.org/abs/2505.10989
作者: Haiyang Shen,Hang Yan,Zhongshi Xing,Mugeng Liu,Yue Li,Zhiyang Chen,Yuxiang Wang,Jiuzheng Wang,Yun Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:RAG can enhance the performance of LLMs on knowledge-intensive tasks. Various RAG paradigms, including vanilla, planning-based, and iterative RAG, are built upon 2 cores: the retriever, which should robustly select relevant documents across complex queries, and the generator, which should faithfully synthesize responses. However, existing retrievers rely heavily on public knowledge and struggle with queries of varying logical complexity and clue completeness, while generators frequently face fidelity problems. In this work, we introduce RAGSynth, a framework that includes a data construction modeling and a corresponding synthetic data generation implementation, designed to optimize retriever robustness and generator fidelity. Additionally, we present SynthBench, a benchmark encompassing 8 domain-specific documents across 4 domains, featuring diverse query complexities, clue completeness, and fine-grained citation granularity. Leveraging RAGSynth, we generate a large-scale synthetic dataset, including single and multi-hop. Extensive experiments demonstrate that the synthetic data significantly improves the robustness of the retrievers and the fidelity of the generators. Additional evaluations confirm that RAGSynth can also generalize well across different domains. By integrating the optimized retrievers into various RAG paradigms, we consistently observe enhanced RAG system performance. We have open-sourced the implementation on this https URL.
zh
[AI-48] DRL-Based Injection Molding Process Parameter Optimization for Adaptive and Profitable Production
【速读】:该论文旨在解决塑料注塑成型过程中在动态环境和经济条件下平衡产品质量与盈利能力的持续性挑战。其解决方案的关键在于提出一种基于深度强化学习(DRL)的框架,用于实时过程优化,并将产品质量和盈利能力整合到控制目标中。该框架通过构建代理模型预测产品质量和周期时间,结合软演员-评论家(SAC)和近端策略优化(PPO)算法进行高效的离线训练,从而实现对季节性和操作性变化的动态适应,同时保持产品质量并最大化利润。
链接: https://arxiv.org/abs/2505.10988
作者: Joon-Young Kim,Jecheon Yu,Heekyu Kim,Seunghwa Ryu
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 50 pages, 10 figures
Abstract:Plastic injection molding remains essential to modern manufacturing. However, optimizing process parameters to balance product quality and profitability under dynamic environmental and economic conditions remains a persistent challenge. This study presents a novel deep reinforcement learning (DRL)-based framework for real-time process optimization in injection molding, integrating product quality and profitability into the control objective. A profit function was developed to reflect real-world manufacturing costs, incorporating resin, mold wear, and electricity prices, including time-of-use variations. Surrogate models were constructed to predict product quality and cycle time, enabling efficient offline training of DRL agents using soft actor-critic (SAC) and proximal policy optimization (PPO) algorithms. Experimental results demonstrate that the proposed DRL framework can dynamically adapt to seasonal and operational variations, consistently maintaining product quality while maximizing profit. Compared to traditional optimization methods such as genetic algorithms, the DRL models achieved comparable economic performance with up to 135x faster inference speeds, making them well-suited for real-time applications. The framework’s scalability and adaptability highlight its potential as a foundation for intelligent, data-driven decision-making in modern manufacturing environments.
zh
[AI-49] GenoArmory: A Unified Evaluation Framework for Adversarial Attacks on Genomic Foundation Models
【速读】:该论文试图解决基因组基础模型(Genomic Foundation Models, GFMs)在面对对抗攻击时的脆弱性评估问题,旨在建立一个统一的对抗攻击基准。解决方案的关键在于提出GenoArmory,这是首个针对GFMs的综合性评估框架,通过系统性地评估五种先进的GFMs在四种常见攻击算法和三种防御策略下的鲁棒性,揭示模型架构、量化方案及训练数据集对模型安全性的影响,并引入GenoAdv数据集以提升GFMs的安全性。
链接: https://arxiv.org/abs/2505.10983
作者: Haozheng Luo,Chenghao Qiu,Yimin Wang,Shang Wu,Jiahao Yu,Han Liu,Binghui Wang,Yan Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:We propose the first unified adversarial attack benchmark for Genomic Foundation Models (GFMs), named GenoArmory. Unlike existing GFM benchmarks, GenoArmory offers the first comprehensive evaluation framework to systematically assess the vulnerability of GFMs to adversarial attacks. Methodologically, we evaluate the adversarial robustness of five state-of-the-art GFMs using four widely adopted attack algorithms and three defense strategies. Importantly, our benchmark provides an accessible and comprehensive framework to analyze GFM vulnerabilities with respect to model architecture, quantization schemes, and training datasets. Additionally, we introduce GenoAdv, a new adversarial sample dataset designed to improve GFM safety. Empirically, classification models exhibit greater robustness to adversarial perturbations compared to generative models, highlighting the impact of task type on model vulnerability. Moreover, adversarial attacks frequently target biologically significant genomic regions, suggesting that these models effectively capture meaningful sequence features.
zh
[AI-50] Facets in Argumentation: A Formal Approach to Argument Significance
【速读】:该论文试图解决在抽象论证框架(Abstract Argumentation Frameworks, AFs)中,介于决策与扩展计数/枚举之间的细粒度推理问题,这类问题目前需要高昂的计算成本。论文提出的解决方案关键在于引入了一个新概念——“facet”(面),即属于某些扩展(可信的)但不隶属于所有扩展(怀疑的)的论证。这一概念使得用户能够更有效地导航、过滤或理解特定论证的重要性,而无需进行复杂的扩展计数或枚举操作。研究证明,涉及facet的任务复杂度显著低于扩展计数任务,并通过实现和实验验证了该方法的可行性。
链接: https://arxiv.org/abs/2505.10982
作者: Johannes Fichte,Nicolas Fröhlich,Markus Hecher,Victor Lagerkvist,Yasir Mahmood,Arne Meier,Jonathan Persson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Argumentation is a central subarea of Artificial Intelligence (AI) for modeling and reasoning about arguments. The semantics of abstract argumentation frameworks (AFs) is given by sets of arguments (extensions) and conditions on the relationship between them, such as stable or admissible. Today’s solvers implement tasks such as finding extensions, deciding credulous or skeptical acceptance, counting, or enumerating extensions. While these tasks are well charted, the area between decision, counting/enumeration and fine-grained reasoning requires expensive reasoning so far. We introduce a novel concept (facets) for reasoning between decision and enumeration. Facets are arguments that belong to some extensions (credulous) but not to all extensions (skeptical). They are most natural when a user aims to navigate, filter, or comprehend the significance of specific arguments, according to their needs. We study the complexity and show that tasks involving facets are much easier than counting extensions. Finally, we provide an implementation, and conduct experiments to demonstrate feasibility.
zh
[AI-51] Group-in-Group Policy Optimization for LLM Agent Training
【速读】:该论文旨在解决大规模语言模型(LLM)在长时序任务中进行强化学习(RL)训练时的信用分配(credit assignment)难题,尤其是在多步骤交互中稀疏或延迟奖励带来的挑战。其解决方案的关键在于提出一种名为“组内组策略优化”(Group-in-Group Policy Optimization, GiGPO)的新型RL算法,该算法通过引入两级相对优势估计结构:在回合级别基于完整轨迹组计算宏观相对优势,在步骤级别通过锚点状态分组机制构建步骤级分组以实现微观相对优势估计,从而在不依赖辅助模型或额外采样的情况下,实现细粒度的每步信用信号分配。
链接: https://arxiv.org/abs/2505.10978
作者: Lang Feng,Zhenghai Xue,Tingcong Liu,Bo An
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Recent advances in group-based reinforcement learning (RL) have driven frontier large language models (LLMs) in single-turn tasks like mathematical reasoning. However, their scalability to long-horizon LLM agent training remains limited. Unlike static tasks, agent-environment interactions unfold over many steps and often yield sparse or delayed rewards, making credit assignment across individual steps significantly more challenging. In this work, we propose Group-in-Group Policy Optimization (GiGPO), a novel RL algorithm that achieves fine-grained credit assignment for LLM agents while preserving the appealing properties of group-based RL: critic-free, low memory, and stable convergence. GiGPO introduces a two-level structure for estimating relative advantage: (i) At the episode-level, GiGPO computes macro relative advantages based on groups of complete trajectories; (ii) At the step-level, GiGPO introduces an anchor state grouping mechanism that retroactively constructs step-level groups by identifying repeated environment states across trajectories. Actions stemming from the same state are grouped together, enabling micro relative advantage estimation. This hierarchical structure effectively captures both global trajectory quality and local step effectiveness without relying on auxiliary models or additional rollouts. We evaluate GiGPO on two challenging agent benchmarks, ALFWorld and WebShop, using Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct. Crucially, GiGPO delivers fine-grained per-step credit signals and achieves performance gains of 12% on ALFWorld and 9% on WebShop over the GRPO baseline: all while maintaining the same GPU memory overhead, identical LLM rollout, and incurring little to no additional time cost.
zh
[AI-52] GROQLoco: Generalist and RObot-agnostic Quadruped Locomotion Control using Offline Datasets
【速读】:该论文旨在解决复杂腿部运动控制中通用策略学习的挑战,特别是在多样化地形和机器人形态下实现连续动态环境中的实时适应性。其解决方案的关键在于提出GROQLoco框架,该框架基于注意力机制,通过仅使用离线数据集,在多个四足机器人和地形上学习一个统一的通用运动策略。该方法利用来自两种不同运动行为(楼梯穿越与平坦地形行走)的专家示范进行训练,实现两种行为的融合,并直接在所有机器人的本体感觉数据上运行,无需任何机器人特定编码,从而实现了高效的低延迟控制输出和跨机器人、跨地形的零样本迁移能力。
链接: https://arxiv.org/abs/2505.10973
作者: Narayanan PP,Sarvesh Prasanth Venkatesan,Srinivas Kantha Reddy,Shishir Kolathaya
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18pages, 16figures, 6tables
Abstract:Recent advancements in large-scale offline training have demonstrated the potential of generalist policy learning for complex robotic tasks. However, applying these principles to legged locomotion remains a challenge due to continuous dynamics and the need for real-time adaptation across diverse terrains and robot morphologies. In this work, we propose GROQLoco, a scalable, attention-based framework that learns a single generalist locomotion policy across multiple quadruped robots and terrains, relying solely on offline datasets. Our approach leverages expert demonstrations from two distinct locomotion behaviors - stair traversal (non-periodic gaits) and flat terrain traversal (periodic gaits) - collected across multiple quadruped robots, to train a generalist model that enables behavior fusion for both behaviors. Crucially, our framework operates directly on proprioceptive data from all robots without incorporating any robot-specific encodings. The policy is directly deployable on an Intel i7 nuc, producing low-latency control outputs without any test-time optimization. Our extensive experiments demonstrate strong zero-shot transfer across highly diverse quadruped robots and terrains, including hardware deployment on the Unitree Go1, a commercially available 12kg robot. Notably, we evaluate challenging cross-robot training setups where different locomotion skills are unevenly distributed across robots, yet observe successful transfer of both flat walking and stair traversal behaviors to all robots at test time. We also show preliminary walking on Stoch 5, a 70kg quadruped, on flat and outdoor terrains without requiring any fine tuning. These results highlight the potential for robust generalist locomotion across diverse robots and terrains.
zh
[AI-53] MPS-Prover: Advancing Stepwise Theorem Proving by Multi-Perspective Search and Data Curation
【速读】:该论文旨在解决自动化定理证明(Automated Theorem Proving, ATP)中存在搜索引导偏差导致的效率低下和证明策略次优的问题。其解决方案的关键在于提出一种多视角搜索证明器(Multi-Perspective Search Prover, MPS-Prover),该系统通过两种核心创新实现改进:一是高效的后训练数据清理策略,可去除约40%的冗余数据而不影响性能;二是多视角树搜索机制,该机制结合了学习到的评判模型与精心设计的启发式规则,以多样化战术选择、避免陷入无产出状态并增强搜索鲁棒性。
链接: https://arxiv.org/abs/2505.10962
作者: Zhenwen Liang,Linfeng Song,Yang Li,Tao Yang,Feng Zhang,Haitao Mi,Dong Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Work in Progress
Abstract:Automated Theorem Proving (ATP) in formal languages remains a formidable challenge in AI, demanding rigorous logical deduction and navigating vast search spaces. While large language models (LLMs) have shown promising performance, existing stepwise provers often suffer from biased search guidance, leading to inefficiencies and suboptimal proof strategies. This paper introduces the Multi-Perspective Search Prover (MPS-Prover), a novel stepwise ATP system designed to overcome these limitations. MPS-Prover incorporates two key innovations: a highly effective post-training data curation strategy that prunes approximately 40% of redundant training data without sacrificing performance, and a multi-perspective tree search mechanism. This search integrates a learned critic model with strategically designed heuristic rules to diversify tactic selection, prevent getting trapped in unproductive states, and enhance search robustness. Extensive evaluations demonstrate that MPS-Prover achieves state-of-the-art performance on multiple challenging benchmarks, including miniF2F and ProofNet, outperforming prior 7B parameter models. Furthermore, our analyses reveal that MPS-Prover generates significantly shorter and more diverse proofs compared to existing stepwise and whole-proof methods, highlighting its efficiency and efficacy. Our work advances the capabilities of LLM-based formal reasoning and offers a robust framework and a comprehensive analysis for developing more powerful theorem provers.
zh
[AI-54] Let the Trial Begin: A Mock-Court Approach to Vulnerability Detection using LLM -Based Agents
【速读】:该论文试图解决源代码中漏洞检测的难题,特别是在良性函数与易受攻击函数具有显著相似性的情况下。其解决方案的关键在于引入VulTrial,这是一个受法庭辩论启发的多智能体框架,包含安全研究人员、代码作者、主持人和评审委员会四个角色特定的智能体,通过协同交互提升自动化漏洞检测的效果。
链接: https://arxiv.org/abs/2505.10961
作者: Ratnadira Widyasari,Martin Weyssow,Ivana Clairine Irsan,Han Wei Ang,Frank Liauw,Eng Lieh Ouh,Lwin Khin Shar,Hong Jin Kang,David Lo
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Detecting vulnerabilities in source code remains a critical yet challenging task, especially when benign and vulnerable functions share significant similarities. In this work, we introduce VulTrial, a courtroom-inspired multi-agent framework designed to enhance automated vulnerability detection. It employs four role-specific agents, which are security researcher, code author, moderator, and review board. Through extensive experiments using GPT-3.5 and GPT-4o we demonstrate that Vultrial outperforms single-agent and multi-agent baselines. Using GPT-4o, VulTrial improves the performance by 102.39% and 84.17% over its respective baseline. Additionally, we show that role-specific instruction tuning in multi-agent with small data (50 pair samples) improves the performance of VulTrial further by 139.89% and 118.30%. Furthermore, we analyze the impact of increasing the number of agent interactions on VulTrial’s overall performance. While multi-agent setups inherently incur higher costs due to increased token usage, our findings reveal that applying VulTrial to a cost-effective model like GPT-3.5 can improve its performance by 69.89% compared to GPT-4o in a single-agent setting, at a lower overall cost.
zh
[AI-55] Relational Graph Transformer
【速读】:该论文旨在解决关系型数据中图神经网络(Graph Neural Network, GNN)在捕捉复杂结构模式和长距离依赖关系方面的局限性,以及传统图变换器(Graph Transformer)在处理大规模异构关系实体图时所面临的挑战。其解决方案的关键在于提出了一种专门针对关系表的图变换器架构——关系图变换器(Relational Graph Transformer, RelGT),该架构采用了一种新颖的多元素标记化策略,将每个节点分解为五个组成部分(特征、类型、跳数距离、时间及局部结构),从而高效编码异质性、时序性和拓扑结构,同时结合采样子图上的局部注意力与可学习中心点的全局注意力,以融合局部与数据库级表示。
链接: https://arxiv.org/abs/2505.10960
作者: Vijay Prakash Dwivedi,Sri Jaladi,Yangyi Shen,Federico López,Charilaos I. Kanatsoulis,Rishi Puri,Matthias Fey,Jure Leskovec
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: Code: this https URL
Abstract:Relational Deep Learning (RDL) is a promising approach for building state-of-the-art predictive models on multi-table relational data by representing it as a heterogeneous temporal graph. However, commonly used Graph Neural Network models suffer from fundamental limitations in capturing complex structural patterns and long-range dependencies that are inherent in relational data. While Graph Transformers have emerged as powerful alternatives to GNNs on general graphs, applying them to relational entity graphs presents unique challenges: (i) Traditional positional encodings fail to generalize to massive, heterogeneous graphs; (ii) existing architectures cannot model the temporal dynamics and schema constraints of relational data; (iii) existing tokenization schemes lose critical structural information. Here we introduce the Relational Graph Transformer (RelGT), the first graph transformer architecture designed specifically for relational tables. RelGT employs a novel multi-element tokenization strategy that decomposes each node into five components (features, type, hop distance, time, and local structure), enabling efficient encoding of heterogeneity, temporality, and topology without expensive precomputation. Our architecture combines local attention over sampled subgraphs with global attention to learnable centroids, incorporating both local and database-wide representations. Across 21 tasks from the RelBench benchmark, RelGT consistently matches or outperforms GNN baselines by up to 18%, establishing Graph Transformers as a powerful architecture for Relational Deep Learning.
zh
[AI-56] Constrained Preferential Bayesian Optimization and Its Application in Banner Ad Design
【速读】:该论文试图解决在存在不等式约束条件下的偏好贝叶斯优化(Preferential Bayesian Optimization, PBO)问题,现有PBO方法尚未考虑实际优化任务中的约束条件。解决方案的关键在于提出一种名为约束偏好贝叶斯优化(Constrained Preferential Bayesian Optimization, CPBO)的方法,首次将不等式约束引入PBO框架,并设计了一种新的采集函数以聚焦于可行区域的探索,从而有效识别最优解。
链接: https://arxiv.org/abs/2505.10954
作者: Koki Iwai,Yusuke Kumagae,Yuki Koyama,Masahiro Hamasaki,Masataka Goto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注: 17 pages, 15 figures
Abstract:Preferential Bayesian optimization (PBO) is a variant of Bayesian optimization that observes relative preferences (e.g., pairwise comparisons) instead of direct objective values, making it especially suitable for human-in-the-loop scenarios. However, real-world optimization tasks often involve inequality constraints, which existing PBO methods have not yet addressed. To fill this gap, we propose constrained preferential Bayesian optimization (CPBO), an extension of PBO that incorporates inequality constraints for the first time. Specifically, we present a novel acquisition function for this purpose. Our technical evaluation shows that our CPBO method successfully identifies optimal solutions by focusing on exploring feasible regions. As a practical application, we also present a designer-in-the-loop system for banner ad design using CPBO, where the objective is the designer’s subjective preference, and the constraint ensures a target predicted click-through rate. We conducted a user study with professional ad designers, demonstrating the potential benefits of our approach in guiding creative design under real-world constraints.
zh
[AI-57] Who You Are Matters: Bridging Topics and Social Roles via LLM -Enhanced Logical Recommendation
【速读】:该论文试图解决传统推荐系统在建模用户特征及其社会角色方面存在的不足,这些因素是影响用户兴趣相关性和偏好演变的逻辑混杂因素。解决方案的关键在于引入用户角色识别任务和行为逻辑建模任务,通过将大型语言模型(Large Language Model, LLM)与推荐系统高效集成,提出TagCF框架。该框架利用LLM的世界知识和逻辑推理能力生成虚拟逻辑图,以增强推荐性能,并通过用户角色对齐用户行为逻辑与反馈,从而深化对用户行为的理解。
链接: https://arxiv.org/abs/2505.10940
作者: Qing Yu,Xiaobei Wang,Shuchang Liu,Yandong Bai,Xiaoyu Yang,Xueliang Wang,Chang Meng,Shanshan Wu,Hailan Yang,Huihui Xiao,Xiang Li,Fan Yang,Xiaoqiang Feng,Lantao Hu,Han Li,Kun Gai,Lixin Zou
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Recommender systems filter contents/items valuable to users by inferring preferences from user features and historical behaviors. Mainstream approaches follow the learning-to-rank paradigm, which focus on discovering and modeling item topics (e.g., categories), and capturing user preferences on these topics based on historical interactions. However, this paradigm often neglects the modeling of user characteristics and their social roles, which are logical confounders influencing the correlated interest and user preference transition. To bridge this gap, we introduce the user role identification task and the behavioral logic modeling task that aim to explicitly model user roles and learn the logical relations between item topics and user social roles. We show that it is possible to explicitly solve these tasks through an efficient integration framework of Large Language Model (LLM) and recommendation systems, for which we propose TagCF. On the one hand, the exploitation of the LLM’s world knowledge and logic inference ability produces a virtual logic graph that reveals dynamic and expressive knowledge of users, augmenting the recommendation performance. On the other hand, the user role aligns the user behavioral logic with the observed user feedback, refining our understanding of user behaviors. Additionally, we also show that the extracted user-item logic graph is empirically a general knowledge that can benefit a wide range of recommendation tasks, and conduct experiments on industrial and several public datasets as verification.
zh
[AI-58] Vaiage: A Multi-Agent Solution to Personalized Travel Planning
【速读】:该论文旨在解决旅行规划这一认知密集型任务中的挑战,包括用户偏好冲突、动态外部信息以及多步骤时空优化等问题。传统平台在提供静态结果、缺乏上下文适应性以及不支持实时交互或意图细化方面存在不足。其解决方案的关键在于提出Vaiage系统,该系统基于图结构的多智能体框架,利用大型语言模型(Large Language Models, LLMs)作为目标条件推荐器和序列规划者,通过自然语言交互、结构化工具使用和基于地图的反馈循环,实现自适应、可解释且端到端的旅行规划。
链接: https://arxiv.org/abs/2505.10922
作者: Binwen Liu,Jiexi Ge,Jiamin Wang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:Planning trips is a cognitively intensive task involving conflicting user preferences, dynamic external information, and multi-step temporal-spatial optimization. Traditional platforms often fall short - they provide static results, lack contextual adaptation, and fail to support real-time interaction or intent refinement. Our approach, Vaiage, addresses these challenges through a graph-structured multi-agent framework built around large language models (LLMs) that serve as both goal-conditioned recommenders and sequential planners. LLMs infer user intent, suggest personalized destinations and activities, and synthesize itineraries that align with contextual constraints such as budget, timing, group size, and weather. Through natural language interaction, structured tool use, and map-based feedback loops, Vaiage enables adaptive, explainable, and end-to-end travel planning grounded in both symbolic reasoning and conversational understanding. To evaluate Vaiage, we conducted human-in-the-loop experiments using rubric-based GPT-4 assessments and qualitative feedback. The full system achieved an average score of 8.5 out of 10, outperforming the no-strategy (7.2) and no-external-API (6.8) variants, particularly in feasibility. Qualitative analysis indicated that agent coordination - especially the Strategy and Information Agents - significantly improved itinerary quality by optimizing time use and integrating real-time context. These results demonstrate the effectiveness of combining LLM reasoning with symbolic agent coordination in open-ended, real-world planning tasks. Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.10922 [cs.MA] (or arXiv:2505.10922v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2505.10922 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-59] Phi: Leverag ing Pattern-based Hierarchical Sparsity for High-Efficiency Spiking Neural Networks ISCA2025
【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在计算效率上的瓶颈问题,特别是在利用二进制激活的稀疏性进行计算优化时存在的不足。现有SNN加速器虽然利用了激活的稀疏性来跳过零值计算,但未能充分挖掘二进制激活中固有的分布模式。解决方案的关键在于提出一种基于模式的分层稀疏性框架\textit{Phi},该框架通过引入两级稀疏性结构:第一级通过预定义模式表示激活,实现离线预计算以大幅减少运行时计算;第二级则通过高稀疏矩阵补充第一级矩阵,进一步降低计算量同时保持精度。该方法结合算法与硬件协同设计,显著提升了SNN的计算速度和能效。
链接: https://arxiv.org/abs/2505.10909
作者: Chiyue Wei,Bowen Duan,Cong Guo,Jingyang Zhang,Qingyue Song,Hai “Helen” Li,Yiran Chen
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: ISCA 2025
Abstract:Spiking Neural Networks (SNNs) are gaining attention for their energy efficiency and biological plausibility, utilizing 0-1 activation sparsity through spike-driven computation. While existing SNN accelerators exploit this sparsity to skip zero computations, they often overlook the unique distribution patterns inherent in binary activations. In this work, we observe that particular patterns exist in spike activations, which we can utilize to reduce the substantial computation of SNN models. Based on these findings, we propose a novel \textbfpattern-based hierarchical sparsity framework, termed \textbf\textitPhi, to optimize computation. \textitPhi introduces a two-level sparsity hierarchy: Level 1 exhibits vector-wise sparsity by representing activations with pre-defined patterns, allowing for offline pre-computation with weights and significantly reducing most runtime computation. Level 2 features element-wise sparsity by complementing the Level 1 matrix, using a highly sparse matrix to further reduce computation while maintaining accuracy. We present an algorithm-hardware co-design approach. Algorithmically, we employ a k-means-based pattern selection method to identify representative patterns and introduce a pattern-aware fine-tuning technique to enhance Level 2 sparsity. Architecturally, we design \textbf\textitPhi, a dedicated hardware architecture that efficiently processes the two levels of \textitPhi sparsity on the fly. Extensive experiments demonstrate that \textitPhi achieves a 3.45\times speedup and a 4.93\times improvement in energy efficiency compared to state-of-the-art SNN accelerators, showcasing the effectiveness of our framework in optimizing SNN computation. Comments: ISCA 2025 Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.10909 [cs.AR] (or arXiv:2505.10909v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2505.10909 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-60] On the Security Risks of ML-based Malware Detection Systems: A Survey
【速读】:该论文试图解决机器学习-based(ML-based)恶意软件检测(MD)系统在实际应用中面临的安全风险分析不足的问题。其关键在于通过引入CIA原则定义安全风险的范围,并将ML-based MD系统分解为不同的操作阶段,从而构建一个基于阶段的分类体系,以此对各阶段中的攻击与防御方案进行系统性分析和总结。
链接: https://arxiv.org/abs/2505.10903
作者: Ping He,Yuhao Mao,Changjiang Li,Lorenzo Cavallaro,Ting Wang,Shouling Ji
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Malware presents a persistent threat to user privacy and data integrity. To combat this, machine learning-based (ML-based) malware detection (MD) systems have been developed. However, these systems have increasingly been attacked in recent years, undermining their effectiveness in practice. While the security risks associated with ML-based MD systems have garnered considerable attention, the majority of prior works is limited to adversarial malware examples, lacking a comprehensive analysis of practical security risks. This paper addresses this gap by utilizing the CIA principles to define the scope of security risks. We then deconstruct ML-based MD systems into distinct operational stages, thus developing a stage-based taxonomy. Utilizing this taxonomy, we summarize the technical progress and discuss the gaps in the attack and defense proposals related to the ML-based MD systems within each stage. Subsequently, we conduct two case studies, using both inter-stage and intra-stage analyses according to the stage-based taxonomy to provide new empirical insights. Based on these analyses and insights, we suggest potential future directions from both inter-stage and intra-stage perspectives.
zh
[AI-61] Explain What You Mean: Intent Augmented Knowledge Graph Recommender Built With LLM
【速读】:该论文试图解决推荐系统中的交互稀疏性(interaction sparsity)问题,这一问题在实体分组基数不均衡的环境中尤为突出,例如在线市场中的用户与产品。为了解决这一问题,论文提出的解决方案关键在于构建并增强知识图谱,通过检索增强生成和编码方法来实现知识图谱的构建与稠密化。IKGR框架从交互知识图谱中学习潜在的用户-物品亲和力,并通过相互意图连通性进一步稠密化,从而缓解稀疏性问题并实现基于意图的可解释推荐。
链接: https://arxiv.org/abs/2505.10900
作者: Wenqing Zheng,Noah Fatsi,Daniel Barcklow,Dmitri Kalaev,Steven Yao,Owen Reinert,C. Bayan Bruss,Daniele Rosa
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Interaction sparsity is the primary obstacle for recommendation systems. Sparsity manifests in environments with disproportional cardinality of groupings of entities, such as users and products in an online marketplace. It also is found for newly introduced entities, described as the cold-start problem. Recent efforts to mitigate this sparsity issue shifts the performance bottleneck to other areas in the computational pipeline. Those that focus on enriching sparse representations with connectivity data from other external sources propose methods that are resource demanding and require careful domain expert aided addition of this newly introduced data. Others that turn to Large Language Model (LLM) based recommenders will quickly encounter limitations surrounding data quality and availability. In this work, we propose LLM-based Intent Knowledge Graph Recommender (IKGR), a novel framework that leverages retrieval-augmented generation and an encoding approach to construct and densify a knowledge graph. IKGR learns latent user-item affinities from an interaction knowledge graph and further densifies it through mutual intent connectivity. This addresses sparsity issues and allows the model to make intent-grounded recommendations with an interpretable embedding translation layer. Through extensive experiments on real-world datasets, we demonstrate that IKGR overcomes knowledge gaps and achieves substantial gains over state-of-the-art baselines on both publicly available and our internal recommendation datasets.
zh
[AI-62] InfantAgent -Next: A Multimodal Generalist Agent for Automated Computer Interaction
【速读】:该论文旨在解决传统方法在处理多模态交互任务时存在的局限性,即要么依赖单一大型模型构建复杂流程,要么仅提供流程模块化但缺乏灵活性的问题。其解决方案的关键在于提出一种高度模块化的架构,将基于工具的代理与纯视觉代理集成在一起,使不同模型能够以分步协作的方式解决解耦任务,从而提升系统的通用性和适应性。
链接: https://arxiv.org/abs/2505.10887
作者: Bin Lei,Weitai Kang,Zijian Zhang,Winson Chen,Xi Xie,Shan Zuo,Mimi Xie,Ali Payani,Mingyi Hong,Yan Yan,Caiwen Ding
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces \textscInfantAgent-Next, a generalist agent capable of interacting with computers in a multimodal manner, encompassing text, images, audio, and video. Unlike existing approaches that either build intricate workflows around a single large model or only provide workflow modularity, our agent integrates tool-based and pure vision agents within a highly modular architecture, enabling different models to collaboratively solve decoupled tasks in a step-by-step manner. Our generality is demonstrated by our ability to evaluate not only pure vision-based real-world benchmarks (i.e., OSWorld), but also more general or tool-intensive benchmarks (e.g., GAIA and SWE-Bench). Specifically, we achieve \mathbf7.27% accuracy on OSWorld, higher than Claude-Computer-Use. Codes and evaluation scripts are open-sourced at this https URL.
zh
[AI-63] BanglaFake: Constructing and Evaluating a Specialized Bengali Deepfake Audio Dataset
【速读】:该论文试图解决低资源语言如孟加拉语中深度伪造音频检测的挑战,主要由于数据集有限和声学特征细微。解决方案的关键是引入BangalFake,这是一个包含12,260个真实音频和13,260个深度伪造音频的孟加拉语深度伪造音频数据集,其中合成语音通过最先进的文本转语音(Text-to-Speech, TTS)模型生成,确保了高自然度和质量。
链接: https://arxiv.org/abs/2505.10885
作者: Istiaq Ahmed Fahad,Kamruzzaman Asif,Sifat Sikder
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 5 page
Abstract:Deepfake audio detection is challenging for low-resource languages like Bengali due to limited datasets and subtle acoustic features. To address this, we introduce BangalFake, a Bengali Deepfake Audio Dataset with 12,260 real and 13,260 deepfake utterances. Synthetic speech is generated using SOTA Text-to-Speech (TTS) models, ensuring high naturalness and quality. We evaluate the dataset through both qualitative and quantitative analyses. Mean Opinion Score (MOS) from 30 native speakers shows Robust-MOS of 3.40 (naturalness) and 4.01 (intelligibility). t-SNE visualization of MFCCs highlights real vs. fake differentiation challenges. This dataset serves as a crucial resource for advancing deepfake detection in Bengali, addressing the limitations of low-resource language research.
zh
[AI-64] Graph and Simplicial Complex Prediction Gaussian Process via the Hodgelet Representations
【速读】:该论文试图解决在数据稀缺情况下图神经网络(Graph Neural Networks, GNNs)容易过拟合从而导致性能下降的问题。其解决方案的关键在于将高斯过程(Gaussian Processes, GPs)框架扩展到单纯复形(Simplicial Complexes, SCs),以处理边级属性和更高阶单纯形上的属性,并通过引入Hodge分解来增强SC表示,从而纳入同调信息,如孔的数量,进而提升预测性能。
链接: https://arxiv.org/abs/2505.10877
作者: Mathieu Alain,So Takao,Xiaowen Dong,Bastian Rieck,Emmanuel Noutahi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Predicting the labels of graph-structured data is crucial in scientific applications and is often achieved using graph neural networks (GNNs). However, when data is scarce, GNNs suffer from overfitting, leading to poor performance. Recently, Gaussian processes (GPs) with graph-level inputs have been proposed as an alternative. In this work, we extend the Gaussian process framework to simplicial complexes (SCs), enabling the handling of edge-level attributes and attributes supported on higher-order simplices. We further augment the resulting SC representations by considering their Hodge decompositions, allowing us to account for homological information, such as the number of holes, in the SC. We demonstrate that our framework enhances the predictions across various applications, paving the way for GPs to be more widely used for graph and SC-level predictions.
zh
[AI-65] Optimal Allocation of Privacy Budget on Hierarchical Data Release
【速读】:该论文试图解决在保持个体隐私的前提下,从具有层次结构的数据集中释放有用信息所面临的挑战,特别是如何最优地分配有限的隐私预算。解决方案的关键在于将这一问题建模为一个约束优化问题,旨在在总隐私预算的限制下最大化数据效用,同时考虑数据粒度与隐私损失之间的固有权衡。
链接: https://arxiv.org/abs/2505.10871
作者: Joonhyuk Ko,Juba Ziani,Ferdinando Fioretto
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Releasing useful information from datasets with hierarchical structures while preserving individual privacy presents a significant challenge. Standard privacy-preserving mechanisms, and in particular Differential Privacy, often require careful allocation of a finite privacy budget across different levels and components of the hierarchy. Sub-optimal allocation can lead to either excessive noise, rendering the data useless, or to insufficient protections for sensitive information. This paper addresses the critical problem of optimal privacy budget allocation for hierarchical data release. It formulates this challenge as a constrained optimization problem, aiming to maximize data utility subject to a total privacy budget while considering the inherent trade-offs between data granularity and privacy loss. The proposed approach is supported by theoretical analysis and validated through comprehensive experiments on real hierarchical datasets. These experiments demonstrate that optimal privacy budget allocation significantly enhances the utility of the released data and improves the performance of downstream tasks.
zh
[AI-66] MCU: Improving Machine Unlearning through Mode Connectivity
【速读】:该论文试图解决机器遗忘(Machine Unlearning, MU)中由于线性参数更新导致的权重纠缠问题,从而更有效地从训练好的模型中移除特定训练数据的信息。解决方案的关键在于提出一种名为模式连通性遗忘(Mode Connectivity Unlearning, MCU)的新框架,该框架利用模式连通性在非线性范围内寻找遗忘路径,并引入参数掩码策略以提升遗忘效果并降低计算开销,同时通过自适应调整遗忘惩罚系数来平衡遗忘质量和预测性能。
链接: https://arxiv.org/abs/2505.10859
作者: Yingdan Shi,Ren Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Machine Unlearning (MU) aims to remove the information of specific training data from a trained model, ensuring compliance with privacy regulations and user requests. While one line of existing MU methods relies on linear parameter updates via task arithmetic, they suffer from weight entanglement. In this work, we propose a novel MU framework called Mode Connectivity Unlearning (MCU) that leverages mode connectivity to find an unlearning pathway in a nonlinear manner. To further enhance performance and efficiency, we introduce a parameter mask strategy that not only improves unlearning effectiveness but also reduces computational overhead. Moreover, we propose an adaptive adjustment strategy for our unlearning penalty coefficient to adaptively balance forgetting quality and predictive performance during training, eliminating the need for empirical hyperparameter tuning. Unlike traditional MU methods that identify only a single unlearning model, MCU uncovers a spectrum of unlearning models along the pathway. Overall, MCU serves as a plug-and-play framework that seamlessly integrates with any existing MU methods, consistently improving unlearning efficacy. Extensive experiments on the image classification task demonstrate that MCU achieves superior performance.
zh
[AI-67] ImputeINR: Time Series Imputation via Implicit Neural Representations for Disease Diagnosis with Missing Data IJCAI2025
【速读】:该论文试图解决医疗数据中时间序列缺失值较多导致的下游疾病诊断任务性能下降问题。现有填补方法主要针对离散数据点,难以有效建模稀疏数据,尤其在处理大量缺失值时表现较差。解决方案的关键在于提出一种新方法ImputeINR,该方法利用隐式神经表示(Implicit Neural Representations, INR)学习时间序列的连续函数,从而在不依赖采样频率的情况下实现高精度的填补,即使在观测值极其稀疏的情况下也能生成细粒度的填补结果。
链接: https://arxiv.org/abs/2505.10856
作者: Mengxuan Li,Ke Liu,Jialong Guo,Jiajun Bu,Hongwei Wang,Haishuai Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by IJCAI 2025
Abstract:Healthcare data frequently contain a substantial proportion of missing values, necessitating effective time series imputation to support downstream disease diagnosis tasks. However, existing imputation methods focus on discrete data points and are unable to effectively model sparse data, resulting in particularly poor performance for imputing substantial missing values. In this paper, we propose a novel approach, ImputeINR, for time series imputation by employing implicit neural representations (INR) to learn continuous functions for time series. ImputeINR leverages the merits of INR in that the continuous functions are not coupled to sampling frequency and have infinite sampling frequency, allowing ImputeINR to generate fine-grained imputations even on extremely sparse observed values. Extensive experiments conducted on eight datasets with five ratios of masked values show the superior imputation performance of ImputeINR, especially for high missing ratios in time series data. Furthermore, we validate that applying ImputeINR to impute missing values in healthcare data enhances the performance of downstream disease diagnosis tasks. Codes are available.
zh
[AI-68] Ready2Unlearn: A Learning-Time Approach for Preparing Models with Future Unlearning Readiness
【速读】:该论文试图解决机器学习模型在部署后面对数据删除请求时,传统方法需要在模型部署阶段被动执行未学习(unlearning)算法所导致的效率低下和性能下降问题。其解决方案的关键在于提出Ready2Unlearn,一种在训练阶段就为模型注入未学习准备度的学习优化方法,通过借鉴元学习(meta-learning)原理,使模型在训练过程中主动具备处理未来未学习请求的能力,从而提升未学习过程的效率与稳定性。
链接: https://arxiv.org/abs/2505.10845
作者: Hanyu Duan,Yi Yang,Ahmed Abbasi,Kar Yan Tam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces Ready2Unlearn, a learning-time optimization approach designed to facilitate future unlearning processes. Unlike the majority of existing unlearning efforts that focus on designing unlearning algorithms, which are typically implemented reactively when an unlearning request is made during the model deployment phase, Ready2Unlearn shifts the focus to the training phase, adopting a “forward-looking” perspective. Building upon well-established meta-learning principles, Ready2Unlearn proactively trains machine learning models with unlearning readiness, such that they are well prepared and can handle future unlearning requests in a more efficient and principled manner. Ready2Unlearn is model-agnostic and compatible with any gradient ascent-based machine unlearning algorithms. We evaluate the method on both vision and language tasks under various unlearning settings, including class-wise unlearning and random data unlearning. Experimental results show that by incorporating such preparedness at training time, Ready2Unlearn produces an unlearning-ready model state, which offers several key advantages when future unlearning is required, including reduced unlearning time, improved retention of overall model capability, and enhanced resistance to the inadvertent recovery of forgotten data. We hope this work could inspire future efforts to explore more proactive strategies for equipping machine learning models with built-in readiness towards more reliable and principled machine unlearning.
zh
[AI-69] ACO: Rethinking Semantic Communications with Task Adaptation and Context Embedding
【速读】:该论文试图解决语义通信中的关键挑战,即在不降低性能的前提下,准确识别和提取最核心的语义信息,并适应接收端可能随时间变化的下游任务。解决方案的关键在于提出一种新颖的语义通信框架,该框架能够联合捕捉任务特定信息与上下文信息,从而提升下游任务的性能并增强系统的适应性。
链接: https://arxiv.org/abs/2505.10834
作者: Achintha Wijesinghe,Weiwei Wang,Suchinthaka Wanninayaka,Songyang Zhang,Zhi Ding
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注: Submitted to the IEEE GlobeCom 2025
Abstract:Recent advancements in generative artificial intelligence have introduced groundbreaking approaches to innovating next-generation semantic communication, which prioritizes conveying the meaning of a message rather than merely transmitting raw data. A fundamental challenge in semantic communication lies in accurately identifying and extracting the most critical semantic information while adapting to downstream tasks without degrading performance, particularly when the objective at the receiver may evolve over time. To enable flexible adaptation to multiple tasks at the receiver, this work introduces a novel semantic communication framework, which is capable of jointly capturing task-specific information to enhance downstream task performance and contextual information. Through rigorous experiments on popular image datasets and computer vision tasks, our framework shows promising improvement compared to existing work, including superior performance in downstream tasks, better generalizability, ultra-high bandwidth efficiency, and low reconstruction latency.
zh
[AI-70] PoE-World: Compositional World Modeling with Products of Programmatic Experts
【速读】:该论文试图解决传统基于深度学习的世界模型在复杂环境中需要大量训练数据且难以从稀疏观测中灵活更新知识的问题。其解决方案的关键在于引入一种新的程序合成方法,通过将世界模型表示为由大型语言模型(LLMs)合成的程序化专家(Programmatic Experts)的指数加权乘积(PoE-World),从而实现对复杂非网格世界领域的有效建模。
链接: https://arxiv.org/abs/2505.10819
作者: Wasu Top Piriyakulkij,Yichao Liang,Hao Tang,Adrian Weller,Marta Kryven,Kevin Ellis
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Learning how the world works is central to building AI agents that can adapt to complex environments. Traditional world models based on deep learning demand vast amounts of training data, and do not flexibly update their knowledge from sparse observations. Recent advances in program synthesis using Large Language Models (LLMs) give an alternate approach which learns world models represented as source code, supporting strong generalization from little data. To date, application of program-structured world models remains limited to natural language and grid-world domains. We introduce a novel program synthesis method for effectively modeling complex, non-gridworld domains by representing a world model as an exponentially-weighted product of programmatic experts (PoE-World) synthesized by LLMs. We show that this approach can learn complex, stochastic world models from just a few observations. We evaluate the learned world models by embedding them in a model-based planning agent, demonstrating efficient performance and generalization to unseen levels on Atari’s Pong and Montezuma’s Revenge. We release our code and display the learned world models and videos of the agent’s gameplay at this https URL.
zh
[AI-71] Developing and Integrating Trust Modeling into Multi-Objective Reinforcement Learning for Intelligent Agricultural Management
【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)在农业领域广泛应用中面临的算法建议与农民实践经验、地方知识和传统做法之间的差距问题。其解决方案的关键在于强调人机交互(Human-AI Interaction, HAII),通过引入可信度框架(包括能力、善意和正直三个维度)构建一个量化农民对基于强化学习(Reinforcement Learning, RL)的施肥策略信任程度的数学模型,并将该模型嵌入多目标RL框架中,从而在策略优化过程中直接融合信任因素,使AI推荐结果在技术上可靠、经济上可行、情境上贴合且社会上可接受。
链接: https://arxiv.org/abs/2505.10803
作者: Zhaoan Wang,Wonseok Jang,Bowen Ruan,Jun Wang,Shaoping Xiao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Precision agriculture, enhanced by artificial intelligence (AI), offers promising tools such as remote sensing, intelligent irrigation, fertilization management, and crop simulation to improve agricultural efficiency and sustainability. Reinforcement learning (RL), in particular, has outperformed traditional methods in optimizing yields and resource management. However, widespread AI adoption is limited by gaps between algorithmic recommendations and farmers’ practical experience, local knowledge, and traditional practices. To address this, our study emphasizes Human-AI Interaction (HAII), focusing on transparency, usability, and trust in RL-based farm management. We employ a well-established trust framework - comprising ability, benevolence, and integrity - to develop a novel mathematical model quantifying farmers’ confidence in AI-based fertilization strategies. Surveys conducted with farmers for this research reveal critical misalignments, which are integrated into our trust model and incorporated into a multi-objective RL framework. Unlike prior methods, our approach embeds trust directly into policy optimization, ensuring AI recommendations are technically robust, economically feasible, context-aware, and socially acceptable. By aligning technical performance with human-centered trust, this research supports broader AI adoption in agriculture.
zh
[AI-72] Attention-Based Reward Shaping for Sparse and Delayed Rewards
【速读】:该论文旨在解决现实世界中强化学习(Reinforcement Learning, RL)应用面临的稀疏且延迟的奖励函数问题。其解决方案的关键在于提出了一种通用且鲁棒的算法——基于注意力机制的奖励塑造(Attention-based REward Shaping, ARES),该算法利用Transformer的注意力机制生成塑造后的奖励,从而为任何环境创建密集奖励函数。ARES通过一组轨迹及其最终回报作为输入,能够在离线状态下训练,并在小数据集或由随机策略生成的轨迹上仍能生成有意义的塑造奖励,具有广泛的适用性和鲁棒性。
链接: https://arxiv.org/abs/2505.10802
作者: Ian Holmes,Min Chi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages, 17 tables, 2 figures. Code available online at this https URL
Abstract:Sparse and delayed reward functions pose a significant obstacle for real-world Reinforcement Learning (RL) applications. In this work, we propose Attention-based REward Shaping (ARES), a general and robust algorithm which uses a transformer’s attention mechanism to generate shaped rewards and create a dense reward function for any environment. ARES requires a set of episodes and their final returns as input. It can be trained entirely offline and is able to generate meaningful shaped rewards even when using small datasets or episodes produced by agents taking random actions. ARES is compatible with any RL algorithm and can handle any level of reward sparsity. In our experiments, we focus on the most challenging case where rewards are fully delayed until the end of each episode. We evaluate ARES across a diverse range of environments, widely used RL algorithms, and baseline methods to assess the effectiveness of the shaped rewards it produces. Our results show that ARES can significantly improve learning in delayed reward settings, enabling RL agents to train in scenarios that would otherwise require impractical amounts of data or even be unlearnable. To our knowledge, ARES is the first approach that works fully offline, remains robust to extreme reward delays and low-quality data, and is not limited to goal-based tasks.
zh
[AI-73] Analyzing Patterns and Influence of Advertising in Print Newspapers
【速读】:该论文试图解决印刷报纸中广告实践的系统性分析问题,旨在揭示广告主、广告内容、广告时间、广告位置及广告方式等关键特征。解决方案的关键在于构建了一个基于图像处理和光学字符识别(OCR)技术的数据驱动流程,以高精度从数字版印刷报纸中提取文章和广告信息,从而形成了包含超过12,000个版面、数以万计广告的大规模数据集。
链接: https://arxiv.org/abs/2505.10791
作者: N Harsha Vardhan,Ponnurangam Kumaraguru,Kiran Garimella
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: Accepted at COMPASS 2025
Abstract:This paper investigates advertising practices in print newspapers across India using a novel data-driven approach. We develop a pipeline employing image processing and OCR techniques to extract articles and advertisements from digital versions of print newspapers with high accuracy. Applying this methodology to five popular newspapers that span multiple regions and three languages, English, Hindi, and Telugu, we assembled a dataset of more than 12,000 editions containing several hundred thousand advertisements. Collectively, these newspapers reach a readership of over 100 million people. Using this extensive dataset, we conduct a comprehensive analysis to answer key questions about print advertising: who advertises, what they advertise, when they advertise, where they place their ads, and how they advertise. Our findings reveal significant patterns, including the consistent level of print advertising over the past six years despite declining print circulation, the overrepresentation of company ads on prominent pages, and the disproportionate revenue contributed by government ads. Furthermore, we examine whether advertising in a newspaper influences the coverage an advertiser receives. Through regression analyses on coverage volume and sentiment, we find strong evidence supporting this hypothesis for corporate advertisers. The results indicate a clear trend where increased advertising correlates with more favorable and extensive media coverage, a relationship that remains robust over time and across different levels of advertiser popularity.
zh
[AI-74] Neural-Inspired Advances in Integral Cryptanalysis
【速读】:该论文旨在解决传统密码分析方法在寻找最优积分区分器(integral distinguisher)时效率不足的问题,特别是针对轻量级分组密码SKINNY的密钥恢复攻击。其解决方案的关键在于利用生成式AI(Generative AI)神经网络学习与积分性质相关的特征,并将其整合到优化的搜索框架中,从而提升区分器的发现能力。通过引入中间相遇搜索框架,平衡了模型精度与计算效率,最终实现了对更多轮数的密钥恢复攻击,显著优于现有自动化搜索模型的性能。
链接: https://arxiv.org/abs/2505.10790
作者: Liu Zhang,Yiran Yao,Danping Shi,Dongchen Chai,Jian Guo,Zilong Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The study by Gohr this http URL at CRYPTO 2019 and sunsequent related works have shown that neural networks can uncover previously unused features, offering novel insights into cryptanalysis. Motivated by these findings, we employ neural networks to learn features specifically related to integral properties and integrate the corresponding insights into optimized search frameworks. These findings validate the framework of using neural networks for feature exploration, providing researchers with novel insights that advance established cryptanalysis methods. Neural networks have inspired the development of more precise integral search models. By comparing the integral distinguishers obtained via neural networks with those identified by classical methods, we observe that existing automated search models often fail to find optimal distinguishers. To address this issue, we develop a meet in the middle search framework that balances model accuracy and computational efficiency. As a result, we reduce the number of active plaintext bits required for an 11 rounds integral distinguisher on SKINNY64/64, and further identify a 12 rounds key dependent integral distinguisher achieving one additional round over the previous best-known result. The integral distinguishers discovered by neural networks enable key recovery attacks on more rounds. We identify a 7 rounds key independent integral distinguisher from neural networks with even only one active plaintext cell, which is based on linear combinations of bits. This distinguisher enables a 15 rounds key recovery attack on SKINNYn/n, improving upon the previous record by one round. Additionally, we discover an 8 rounds key dependent integral distinguisher using neural network that further reduces the time complexity of key recovery attacks against SKINNY. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.10790 [cs.CR] (or arXiv:2505.10790v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2505.10790 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-75] SECRET: Semi-supervised Clinical Trial Document Similarity Search
【速读】:该论文试图解决临床试验设计中因设计缺陷、疗效不足和安全性事件导致的延迟、经济损失和声誉损害等问题,其核心在于通过识别相似的历史临床试验来提供参考以优化新试验的设计。解决方案的关键是提出一种新的方法,通过总结临床试验方案并基于查询试验的方案搜索相似试验,从而有效提升试验相似性检索的准确性和召回率,该方法在召回率@1和精确率@1上分别比最佳基线提升了78%和53%。
链接: https://arxiv.org/abs/2505.10780
作者: Trisha Das,Afrah Shafquat,Beigi Mandis,Jacob Aptekar,Jimeng Sun
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Clinical trials are vital for evaluation of safety and efficacy of new treatments. However, clinical trials are resource-intensive, time-consuming and expensive to conduct, where errors in trial design, reduced efficacy, and safety events can result in significant delays, financial losses, and damage to reputation. These risks underline the importance of informed and strategic decisions in trial design to mitigate these risks and improve the chances of a successful trial. Identifying similar historical trials is critical as these trials can provide an important reference for potential pitfalls and challenges including serious adverse events, dosage inaccuracies, recruitment difficulties, patient adherence issues, etc. Addressing these challenges in trial design can lead to development of more effective study protocols with optimized patient safety and trial efficiency. In this paper, we present a novel method to identify similar historical trials by summarizing clinical trial protocols and searching for similar trials based on a query trial’s protocol. Our approach significantly outperforms all baselines, achieving up to a 78% improvement in recall@1 and a 53% improvement in precision@1 over the best baseline. We also show that our method outperforms all other baselines in partial trial similarity search and zero-shot patient-trial matching, highlighting its superior utility in these tasks.
zh
[AI-76] Qualia Optimization
【速读】:该论文试图探讨当前或未来人工智能系统是否可能具有感受质(qualia),如疼痛或愉悦,并提出在评估AI系统时应将这些主观体验的质量与性能指标一同考虑。其解决方案的关键在于基于强化学习框架和心灵哲学理论,构建具体的数学问题设定,并通过初步方法和属性分析来优化问题设置,最终提出能够促进强化的策略。
链接: https://arxiv.org/abs/2505.10779
作者: Philip S. Thomas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Technical Report, College of Information and Computer Science, University of Massachusetts
Abstract:This report explores the speculative question: what if current or future AI systems have qualia, such as pain or pleasure? It does so by assuming that AI systems might someday possess qualia – and that the quality of these subjective experiences should be considered alongside performance metrics. Concrete mathematical problem settings, inspired by reinforcement learning formulations and theories from philosophy of mind, are then proposed and initial approaches and properties are presented. These properties enable refinement of the problem setting, culminating with the proposal of methods that promote reinforcement.
zh
[AI-77] Context-Aware Probabilistic Modeling with LLM for Multimodal Time Series Forecasting
【速读】:该论文旨在解决现有时间序列预测方法在有效整合外生文本与大语言模型(LLM)的概率特性方面的不足。当前方法要么通过简单的提示进行浅层文本-时间序列融合,要么依赖与LLM的token生成范式冲突的确定性数值解码,从而限制了上下文感知能力和分布建模。其解决方案的关键在于提出CAPTime,一种基于上下文感知的概率多模态时间序列预测方法,该方法利用文本引导的抽象和自回归LLM解码,通过预训练的时间序列编码器提取时序模式,并通过可学习的交互将它们与文本上下文对齐,生成联合多模态表示,结合分布专家混合与冻结的LLM,实现上下文感知的概率预测。
链接: https://arxiv.org/abs/2505.10774
作者: Yueyang Yao,Jiajun Li,Xingyuan Dai,MengMeng Zhang,Xiaoyan Gong,Fei-Yue Wang,Yisheng Lv
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 2 figures
Abstract:Time series forecasting is important for applications spanning energy markets, climate analysis, and traffic management. However, existing methods struggle to effectively integrate exogenous texts and align them with the probabilistic nature of large language models (LLMs). Current approaches either employ shallow text-time series fusion via basic prompts or rely on deterministic numerical decoding that conflict with LLMs’ token-generation paradigm, which limits contextual awareness and distribution modeling. To address these limitations, we propose CAPTime, a context-aware probabilistic multimodal time series forecasting method that leverages text-informed abstraction and autoregressive LLM decoding. Our method first encodes temporal patterns using a pretrained time series encoder, then aligns them with textual contexts via learnable interactions to produce joint multimodal representations. By combining a mixture of distribution experts with frozen LLMs, we enable context-aware probabilistic forecasting while preserving LLMs’ inherent distribution modeling capabilities. Experiments on diverse time series forecasting tasks demonstrate the superior accuracy and generalization of CAPTime, particularly in multimodal scenarios. Additional analysis highlights its robustness in data-scarce scenarios through hybrid probabilistic decoding.
zh
[AI-78] Geofenced Unmanned Aerial Robotic Defender for Deer Detection and Deterrence (GUARD) ICRA
【速读】:该论文试图解决野生动物(尤其是鹿类)对农作物造成的损害问题,传统驱赶方法在可扩展性、响应速度和适应不同农田环境方面存在不足。解决方案的关键在于开发一种集成的无人飞行器(UAV)系统,该系统结合了基于YOLO的实时计算机视觉模块用于鹿类检测、节能的覆盖路径规划算法以提高田间监测效率,并配备自主充电站以实现UAV的持续运行。
链接: https://arxiv.org/abs/2505.10770
作者: Ebasa Temesgen,Mario Jerez,Greta Brown,Graham Wilson,Sree Ganesh Lalitaditya Divakarla,Sarah Boelter,Oscar Nelson,Robert McPherson,Maria Gini
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted to the Novel Approaches for Precision Agriculture and Forestry with Autonomous Robots IEEE ICRA Workshop - 2025
Abstract:Wildlife-induced crop damage, particularly from deer, threatens agricultural productivity. Traditional deterrence methods often fall short in scalability, responsiveness, and adaptability to diverse farmland environments. This paper presents an integrated unmanned aerial vehicle (UAV) system designed for autonomous wildlife deterrence, developed as part of the Farm Robotics Challenge. Our system combines a YOLO-based real-time computer vision module for deer detection, an energy-efficient coverage path planning algorithm for efficient field monitoring, and an autonomous charging station for continuous operation of the UAV. In collaboration with a local Minnesota farmer, the system is tailored to address practical constraints such as terrain, infrastructure limitations, and animal behavior. The solution is evaluated through a combination of simulation and field testing, demonstrating robust detection accuracy, efficient coverage, and extended operational time. The results highlight the feasibility and effectiveness of drone-based wildlife deterrence in precision agriculture, offering a scalable framework for future deployment and extension.
zh
[AI-79] Code-Driven Planning in Grid Worlds with Large Language Models
【速读】:该论文旨在解决基于网格的任务中生成可解释智能体策略的问题,其核心挑战在于如何有效地合成能够映射环境状态到动作序列的代码形式策略。解决方案的关键在于提出一种迭代程序规划(Iterative Programmatic Planning, IPP)框架,该框架利用大型语言模型(Large Language Models, LLMs)进行代码生成作为策略合成手段,并结合多种提示策略与迭代优化机制,通过任务执行反馈不断改进代码,从而提升任务求解性能。
链接: https://arxiv.org/abs/2505.10749
作者: Ashwath Vaithinathan Aravindan,Zhisheng Tang,Mayank Kejriwal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We propose an iterative programmatic planning (IPP) framework for solving grid-based tasks by synthesizing interpretable agent policies expressed in code using large language models (LLMs). Instead of relying on traditional search or reinforcement learning, our approach uses code generation as policy synthesis, where the LLM outputs executable programs that map environment states to action sequences. Our proposed architecture incorporates several prompting strategies, including direct code generation, pseudocode-conditioned refinement, and curriculum-based prompting, but also includes an iterative refinement mechanism that updates code based on task performance feedback. We evaluate our approach using six leading LLMs and two challenging grid-based benchmarks (GRASP and MiniGrid). Our IPP framework demonstrates improvements over direct code generation ranging from 10% to as much as 10x across five of the six models and establishes a new state-of-the-art result for GRASP. IPP is found to significantly outperform direct elicitation of a solution from GPT-o3-mini (by 63% on MiniGrid to 116% on GRASP), demonstrating the viability of the overall approach. Computational costs of all code generation approaches are similar. While code generation has a higher initial prompting cost compared to direct solution elicitation (\ 0.08 per task vs. \ 0.002 per instance for GPT-o3-mini), the code can be reused for any number of instances, making the amortized cost significantly lower (by 400x on GPT-o3-mini across the complete GRASP benchmark).
zh
[AI-80] ChestyBot: Detecting and Disrupting Chinese Communist Party Influence Stratagems
【速读】:该论文试图解决外国信息操作(foreign information operations)在实时检测与缓解方面的不足,特别是针对俄罗斯和中国行为体利用美国宽松的信息环境进行的恶意影响活动。解决方案的关键在于提出ChestyBot,这是一个基于语用学(pragmatics-based)的语言模型,能够以高达98.34%的准确率检测未标记的外国恶意影响推文,并支持一种新型框架,在外国影响活动形成初期进行干扰。
链接: https://arxiv.org/abs/2505.10746
作者: Matthew Stoffolano,Ayush Rout,Justin M. Pelletier
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Social and Information Networks (cs.SI)
备注: Presented at USCYBERCOM Cyber Recon Symposium 2023 at DreamPort in Columbia, MD on April 20, 2023
Abstract:Foreign information operations conducted by Russian and Chinese actors exploit the United States’ permissive information environment. These campaigns threaten democratic institutions and the broader Westphalian model. Yet, existing detection and mitigation strategies often fail to identify active information campaigns in real time. This paper introduces ChestyBot, a pragmatics-based language model that detects unlabeled foreign malign influence tweets with up to 98.34% accuracy. The model supports a novel framework to disrupt foreign influence operations in their formative stages.
zh
[AI-81] Evaluations at Work: Measuring the Capabilities of GenAI in Use
【速读】:该论文试图解决当前人工智能评估基准未能反映人机协作中复杂、多轮交互特性的问题。其解决方案的关键在于构建一个将实际任务分解为相互依赖子任务的评估框架,从而能够追踪大型语言模型(LLM)的表现及用户策略在对话中的演变。该框架还结合了一套综合指标,包括语义相似性、词重叠和数值匹配的复合使用度、结构连贯性、单轮内多样性以及反映AI输出与用户工作知识对齐程度的“信息前沿”新度量标准。通过在财务估值任务中的实证研究,论文揭示了LLM生成内容整合对输出质量的影响因素,并提出了对人机协作评估的更全面方法。
链接: https://arxiv.org/abs/2505.10742
作者: Brandon Lepine,Gawesha Weerantunga,Juho Kim,Pamela Mishkin,Matthew Beane
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Current AI benchmarks miss the messy, multi-turn nature of human-AI collaboration. We present an evaluation framework that decomposes real-world tasks into interdependent subtasks, letting us track both LLM performance and users’ strategies across a dialogue. Complementing this framework, we develop a suite of metrics, including a composite usage derived from semantic similarity, word overlap, and numerical matches; structural coherence; intra-turn diversity; and a novel measure of the “information frontier” reflecting the alignment between AI outputs and users’ working knowledge. We demonstrate our methodology in a financial valuation task that mirrors real-world complexity. Our empirical findings reveal that while greater integration of LLM-generated content generally enhances output quality, its benefits are moderated by factors such as response incoherence, excessive subtask diversity, and the distance of provided information from users’ existing knowledge. These results suggest that proactive dialogue strategies designed to inject novelty may inadvertently undermine task performance. Our work thus advances a more holistic evaluation of human-AI collaboration, offering both a robust methodological framework and actionable insights for developing more effective AI-augmented work processes.
zh
[AI-82] Automating Security Audit Using Large Language Model based Agent : An Exploration Experiment
【速读】:该论文试图解决传统安全审计过程中存在的效率低下和成本高昂的问题,尤其是在Windows操作系统中检查密码策略合规性方面。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)作为自主代理,通过构建一个框架来自动化执行部分安全审计任务,如密码策略合规性检查,从而提高审计效率和准确性。
链接: https://arxiv.org/abs/2505.10732
作者: Jia Hui Chin,Pu Zhang,Yu Xin Cheong,Jonathan Pan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:In the current rapidly changing digital environment, businesses are under constant stress to ensure that their systems are secured. Security audits help to maintain a strong security posture by ensuring that policies are in place, controls are implemented, gaps are identified for cybersecurity risks mitigation. However, audits are usually manual, requiring much time and costs. This paper looks at the possibility of developing a framework to leverage Large Language Models (LLMs) as an autonomous agent to execute part of the security audit, namely with the field audit. password policy compliance for Windows operating system. Through the conduct of an exploration experiment of using GPT-4 with Langchain, the agent executed the audit tasks by accurately flagging password policy violations and appeared to be more efficient than traditional manual audits. Despite its potential limitations in operational consistency in complex and dynamic environment, the framework suggests possibilities to extend further to real-time threat monitoring and compliance checks.
zh
[AI-83] Learning Repetition-Invariant Representations for Polymer Informatics
【速读】:该论文试图解决现有图神经网络方法在处理聚合物时仅建模单个重复单元,无法生成与重复单元数量无关的一致向量表示的问题(Graph Neural Network methods fail to produce consistent vector representations for true polymer structures with varying numbers of units)。解决方案的关键在于引入一种新的方法——图重复不变性(Graph Repetition Invariance, GRIN),该方法通过结合基于图的最大生成树对齐与重复单元增强,确保结构一致性,并从模型和数据角度提供重复不变性的理论保障。
链接: https://arxiv.org/abs/2505.10726
作者: Yihan Zhu,Gang Liu,Eric Inae,Tengfei Luo,Meng Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages,3 figuares
Abstract:Polymers are large macromolecules composed of repeating structural units known as monomers and are widely applied in fields such as energy storage, construction, medicine, and aerospace. However, existing graph neural network methods, though effective for small molecules, only model the single unit of polymers and fail to produce consistent vector representations for the true polymer structure with varying numbers of units. To address this challenge, we introduce Graph Repetition Invariance (GRIN), a novel method to learn polymer representations that are invariant to the number of repeating units in their graph representations. GRIN integrates a graph-based maximum spanning tree alignment with repeat-unit augmentation to ensure structural consistency. We provide theoretical guarantees for repetition-invariance from both model and data perspectives, demonstrating that three repeating units are the minimal augmentation required for optimal invariant representation learning. GRIN outperforms state-of-the-art baselines on both homopolymer and copolymer benchmarks, learning stable, repetition-invariant representations that generalize effectively to polymer chains of unseen sizes.
zh
[AI-84] GNN-Suite: a Graph Neural Network Benchmarking Framework for Biomedical Informatics
【速读】:该论文旨在解决在计算生物学中构建和评估图神经网络(Graph Neural Network, GNN)架构的标准化与可重复性问题,其核心挑战在于如何有效整合多源生物数据并公平比较不同GNN模型的性能。解决方案的关键是提出GNN-Suite,一个基于Nextflow工作流的模块化框架,通过统一的实验设置和评价标准,实现对多种GNN架构(如GAT、GCN2、GraphSAGE等)以及逻辑回归(Logistic Regression, LR)基线模型的系统性评估,从而促进网络结构学习在识别癌症驱动基因等任务中的应用。
链接: https://arxiv.org/abs/2505.10711
作者: Sebestyén Kamp,Giovanni Stracquadanio,T. Ian Simpson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Main article 8 pages (20 in total with supplementary information included), 3 main article figures and 3 supplemental figures
Abstract:We present GNN-Suite, a robust modular framework for constructing and benchmarking Graph Neural Network (GNN) architectures in computational biology. GNN-Suite standardises experimentation and reproducibility using the Nextflow workflow to evaluate GNN performance. We demonstrate its utility in identifying cancer-driver genes by constructing molecular networks from protein-protein interaction (PPI) data from STRING and BioGRID and annotating nodes with features from the PCAWG, PID, and COSMIC-CGC repositories. Our design enables fair comparisons among diverse GNN architectures including GAT, GAT3H, GCN, GCN2, GIN, GTN, HGCN, PHGCN, and GraphSAGE and a baseline Logistic Regression (LR) model. All GNNs were configured as standardised two-layer models and trained with uniform hyperparameters (dropout = 0.2; Adam optimiser with learning rate = 0.01; and an adjusted binary cross-entropy loss to address class imbalance) over an 80/20 train-test split for 300 epochs. Each model was evaluated over 10 independent runs with different random seeds to yield statistically robust performance metrics, with balanced accuracy (BACC) as the primary measure. Notably, GCN2 achieved the highest BACC (0.807 +/- 0.035) on a STRING-based network, although all GNN types outperformed the LR baseline, highlighting the advantage of network-based learning over feature-only approaches. Our results show that a common framework for implementing and evaluating GNN architectures aids in identifying not only the best model but also the most effective means of incorporating complementary data. By making GNN-Suite publicly available, we aim to foster reproducible research and promote improved benchmarking standards in computational biology. Future work will explore additional omics datasets and further refine network architectures to enhance predictive accuracy and interpretability in biomedical applications. Comments: Main article 8 pages (20 in total with supplementary information included), 3 main article figures and 3 supplemental figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) ACMclasses: J.3; I.2.1 Cite as: arXiv:2505.10711 [cs.LG] (or arXiv:2505.10711v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.10711 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-85] Embodied AI in Machine Learning – is it Really Embodied?
【速读】:该论文试图解决当前人工智能驱动的机器人在具身性(embodied)方面存在的局限性,即这些机器人仅具备较弱的具身性,并继承了传统“经典人工智能”(GOFAI)的一些问题。论文的关键解决方案在于探讨跨具身学习(cross-embodiment learning)的可能性,并识别其根本障碍,进而提出推动该领域发展的方向。
链接: https://arxiv.org/abs/2505.10705
作者: Matej Hoffmann,Shubhan Parag Patni
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO)
备注: 16 pages, 3 figures
Abstract:Embodied Artificial Intelligence (Embodied AI) is gaining momentum in the machine learning communities with the goal of leveraging current progress in AI (deep learning, transformers, large language and visual-language models) to empower robots. In this chapter we put this work in the context of “Good Old-Fashioned Artificial Intelligence” (GOFAI) (Haugeland, 1989) and the behavior-based or embodied alternatives (R. A. Brooks 1991; Pfeifer and Scheier 2001). We claim that the AI-powered robots are only weakly embodied and inherit some of the problems of GOFAI. Moreover, we review and critically discuss the possibility of cross-embodiment learning (Padalkar et al. 2024). We identify fundamental roadblocks and propose directions on how to make progress.
zh
[AI-86] Predicting Human Behavior in Autonomous Systems: A Collaborative Machine Teaching Approach for Reducing Transfer of Control Events
【速读】:该论文试图解决自主系统在故障处理中因不必要的控制权转移(Transfer of Control, ToC)导致的可靠性与效率下降问题。解决方案的关键在于利用人类交互数据训练AI模型,使其能够提前识别问题或协助用户解决,从而减少非关键情况下的ToC事件。研究采用基于长短期记忆网络(LSTM)的模型,通过模拟工业吸尘器的交互工具收集数据,验证了非专家数据在训练模型中的有效性,展示了AI从人类问题解决行为中学习的潜力,以补充传感器数据提升工业自动化与人机协作性能。
链接: https://arxiv.org/abs/2505.10695
作者: Julian Wolter,Amr Gomaa
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:As autonomous systems become integral to various industries, effective strategies for fault handling are essential to ensure reliability and efficiency. Transfer of Control (ToC), a traditional approach for interrupting automated processes during faults, is often triggered unnecessarily in non-critical situations. To address this, we propose a data-driven method that uses human interaction data to train AI models capable of preemptively identifying and addressing issues or assisting users in resolution. Using an interactive tool simulating an industrial vacuum cleaner, we collected data and developed an LSTM-based model to predict user behavior. Our findings reveal that even data from non-experts can effectively train models to reduce unnecessary ToC events, enhancing the system’s robustness. This approach highlights the potential of AI to learn directly from human problem-solving behaviors, complementing sensor data to improve industrial automation and human-AI collaboration.
zh
[AI-87] owards an LLM -powered Social Digital Twinning Platform
【速读】:该论文试图解决复杂适应性社会系统中对潜在“如果…会怎样”情景进行探索的问题,特别是在社会干预措施设计与评估方面的挑战。解决方案的关键在于构建一个名为Social Digital Twinner的创新社会仿真工具,其核心由三部分组成:包含真实世界数据和多维代表性合成人口的数据基础设施、基于大语言模型(LLM)的基于代理的仿真引擎,以及支持自然语言交互的用户界面。该工具通过实时互动和协作,使利益相关者能够共同设计、测试和优化干预措施,从而推动数据驱动和证据为基础的社会问题解决方法。
链接: https://arxiv.org/abs/2505.10681
作者: Önder Gürcan,Vanja Falck,Markus G. Rousseau,Larissa L. Lima
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 13 pages, 3 figures, 23rd International Conference on Practical applications of Agents and Multi-Agent Systems (PAAMS 2025)
Abstract:We present Social Digital Twinner, an innovative social simulation tool for exploring plausible effects of what-if scenarios in complex adaptive social systems. The architecture is composed of three seamlessly integrated parts: a data infrastructure featuring real-world data and a multi-dimensionally representative synthetic population of citizens, an LLM-enabled agent-based simulation engine, and a user interface that enable intuitive, natural language interactions with the simulation engine and the artificial agents (i.e. citizens). Social Digital Twinner facilitates real-time engagement and empowers stakeholders to collaboratively design, test, and refine intervention measures. The approach is promoting a data-driven and evidence-based approach to societal problem-solving. We demonstrate the tool’s interactive capabilities by addressing the critical issue of youth school dropouts in Kragero, Norway, showcasing its ability to create and execute a dedicated social digital twin using natural language.
zh
[AI-88] A Conformal Predictive Measure for Assessing Catastrophic Forgetting
【速读】:该论文试图解决持续学习中的灾难性遗忘(catastrophic forgetting, CF)评估问题,旨在提供一种有效量化和评估CF的方法。解决方案的关键在于提出了一种基于共形预测(conformal prediction, CP)的度量指标——共形预测置信因子(Conformal Prediction Confidence Factor, CPCF),通过自适应共形预测动态监测模型对先前任务的置信度变化,从而实现对CF的实时监控与测量。
链接: https://arxiv.org/abs/2505.10677
作者: Ioannis Pitsiorlas,Nour Jamoussi,Marios Kountouris
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This work introduces a novel methodology for assessing catastrophic forgetting (CF) in continual learning. We propose a new conformal prediction (CP)-based metric, termed the Conformal Prediction Confidence Factor (CPCF), to quantify and evaluate CF effectively. Our framework leverages adaptive CP to estimate forgetting by monitoring the model’s confidence on previously learned tasks. This approach provides a dynamic and practical solution for monitoring and measuring CF of previous tasks as new ones are introduced, offering greater suitability for real-world applications. Experimental results on four benchmark datasets demonstrate a strong correlation between CPCF and the accuracy of previous tasks, validating the reliability and interpretability of the proposed metric. Our results highlight the potential of CPCF as a robust and effective tool for assessing and understanding CF in dynamic learning environments.
zh
[AI-89] Interpretable Risk Mitigation in LLM Agent Systems
【速读】:该论文试图解决由大型语言模型(Large Language Models, LLMs)驱动的自主代理在需要负责任行为的领域中因模型固有的不可预测性而导致的安全性和可靠性问题。其解决方案的关键在于通过一种与具体游戏和提示无关的策略修改方法,利用从稀疏自编码器潜在空间中提取的可解释特征来引导残差流,从而调整代理的行为。实验结果显示,使用善意协商特征进行引导可降低平均背叛概率28个百分点,并识别出多个开源LLM代理的可行引导范围。
链接: https://arxiv.org/abs/2505.10670
作者: Jan Chojnacki
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT)
备注:
Abstract:Autonomous agents powered by large language models (LLMs) enable novel use cases in domains where responsible action is increasingly important. Yet the inherent unpredictability of LLMs raises safety concerns about agent reliability. In this work, we explore agent behaviour in a toy, game-theoretic environment based on a variation of the Iterated Prisoner’s Dilemma. We introduce a strategy-modification method-independent of both the game and the prompt-by steering the residual stream with interpretable features extracted from a sparse autoencoder latent space. Steering with the good-faith negotiation feature lowers the average defection probability by 28 percentage points. We also identify feasible steering ranges for several open-source LLM agents. Finally, we hypothesise that game-theoretic evaluation of LLM agents, combined with representation-steering alignment, can generalise to real-world applications on end-user devices and embodied platforms.
zh
[AI-90] Seasonal Forecasting of Pan-Arctic Sea Ice with State Space Model WWW
【速读】:该论文旨在解决北极海冰浓度季节性预测的准确性问题,特别是在长期预测中面临动态模型计算成本高和深度学习模型难以处理复杂海冰动力学及不确定性的问题。解决方案的关键在于提出一种名为IceMamba的深度学习架构,该架构在状态空间模型中集成了先进的注意力机制,从而提升了对季节性变化和不确定性的建模能力。
链接: https://arxiv.org/abs/2505.10665
作者: Wei Wang,Weidong Yang,Lei Wang,Guihua Wang,Ruibo Lei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper is published in npj Climate and Atmospheric Science: this https URL Supplementary information: this https URL
Abstract:The rapid decline of Arctic sea ice resulting from anthropogenic climate change poses significant risks to indigenous communities, ecosystems, and the global climate system. This situation emphasizes the immediate necessity for precise seasonal sea ice forecasts. While dynamical models perform well for short-term forecasts, they encounter limitations in long-term forecasts and are computationally intensive. Deep learning models, while more computationally efficient, often have difficulty managing seasonal variations and uncertainties when dealing with complex sea ice dynamics. In this research, we introduce IceMamba, a deep learning architecture that integrates sophisticated attention mechanisms within the state space model. Through comparative analysis of 25 renowned forecast models, including dynamical, statistical, and deep learning approaches, our experimental results indicate that IceMamba delivers excellent seasonal forecasting capabilities for Pan-Arctic sea ice concentration. Specifically, IceMamba outperforms all tested models regarding average RMSE and anomaly correlation coefficient (ACC) and ranks second in Integrated Ice Edge Error (IIEE). This innovative approach enhances our ability to foresee and alleviate the effects of sea ice variability, offering essential insights for strategies aimed at climate adaptation.
zh
[AI-91] On the Evaluation of Engineering Artificial General Intelligence
【速读】:该论文试图解决如何评估工程人工智能通用智能(eAGI)代理的问题,其核心挑战在于如何构建一个能够全面衡量eAGI代理在物理系统及其控制器工程中表现的评价框架。解决方案的关键在于提出一个可扩展的评价框架,该框架将布鲁姆分类法(Bloom’s Taxonomy)专门化并嵌入到工程设计语境中,从而实现从方法论知识到实际设计问题的丰富评价维度,并支持对文本响应及结构化设计成果(如CAD模型和SysML模型)的评估,同时提供一种可自动化的程序以适应不同的工程场景。
链接: https://arxiv.org/abs/2505.10653
作者: Sandeep Neema,Susmit Jha,Adam Nagel,Ethan Lew,Chandrasekar Sureshkumar,Aleksa Gordic,Chase Shimmin,Hieu Nguygen,Paul Eremenko
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages
Abstract:We discuss the challenges and propose a framework for evaluating engineering artificial general intelligence (eAGI) agents. We consider eAGI as a specialization of artificial general intelligence (AGI), deemed capable of addressing a broad range of problems in the engineering of physical systems and associated controllers. We exclude software engineering for a tractable scoping of eAGI and expect dedicated software engineering AI agents to address the software implementation challenges. Similar to human engineers, eAGI agents should possess a unique blend of background knowledge (recall and retrieve) of facts and methods, demonstrate familiarity with tools and processes, exhibit deep understanding of industrial components and well-known design families, and be able to engage in creative problem solving (analyze and synthesize), transferring ideas acquired in one context to another. Given this broad mandate, evaluating and qualifying the performance of eAGI agents is a challenge in itself and, arguably, a critical enabler to developing eAGI agents. In this paper, we address this challenge by proposing an extensible evaluation framework that specializes and grounds Bloom’s taxonomy - a framework for evaluating human learning that has also been recently used for evaluating LLMs - in an engineering design context. Our proposed framework advances the state of the art in benchmarking and evaluation of AI agents in terms of the following: (a) developing a rich taxonomy of evaluation questions spanning from methodological knowledge to real-world design problems; (b) motivating a pluggable evaluation framework that can evaluate not only textual responses but also evaluate structured design artifacts such as CAD models and SysML models; and © outlining an automatable procedure to customize the evaluation benchmark to different engineering contexts.
zh
[AI-92] he Hitchhikers Guide to Production-ready Trustworthy Foundation Model powered Software (FMware)
【速读】:该论文旨在解决如何构建可靠且可部署的生成式 AI (Generative AI) 软件系统(即 FMware)所面临的挑战,包括模型选择、领域数据对齐、提示工程以及自主代理编排等问题。其解决方案的关键在于结合理论研究与工业实践经验,提出可操作的见解和技术路线图,以应对从原型演示到生产系统的复杂过渡过程中的系统测试、优化、部署及与遗留软件的集成问题。
链接: https://arxiv.org/abs/2505.10640
作者: Kirill Vasilevski,Benjamin Rombaut,Gopi Krishnan Rajbahadur,Gustavo A. Oliva,Keheliya Gallaba,Filipe R. Cogo,Jiahuei(Justina)Lin,Dayi Lin,Haoxiang Zhang,Bouyan Chen,Kishanthan Thangarajah,Ahmed E. Hassan,Zhen Ming(Jack)Jiang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Foundation Models (FMs) such as Large Language Models (LLMs) are reshaping the software industry by enabling FMware, systems that integrate these FMs as core components. In this KDD 2025 tutorial, we present a comprehensive exploration of FMware that combines a curated catalogue of challenges with real-world production concerns. We first discuss the state of research and practice in building FMware. We further examine the difficulties in selecting suitable models, aligning high-quality domain-specific data, engineering robust prompts, and orchestrating autonomous agents. We then address the complex journey from impressive demos to production-ready systems by outlining issues in system testing, optimization, deployment, and integration with legacy software. Drawing on our industrial experience and recent research in the area, we provide actionable insights and a technology roadmap for overcoming these challenges. Attendees will gain practical strategies to enable the creation of trustworthy FMware in the evolving technology landscape.
zh
[AI-93] Agent Name Service (ANS): A Universal Directory for Secure AI Agent Discovery and Interoperability
【速读】:该论文试图解决多智能体系统中安全发现机制不足的问题,具体表现为缺乏公开的智能体发现框架。其解决方案的关键在于提出了一种基于DNS的新型架构——代理名称服务(Agent Name Service, ANS),该架构通过公钥基础设施(Public Key Infrastructure, PKI)证书实现可验证的代理身份与信任机制,并结合协议无关的注册基础设施,支持多种通信标准,从而确保智能体的安全发现与交互。
链接: https://arxiv.org/abs/2505.10609
作者: Ken Huang,Vineeth Sai Narajala,Idan Habler,Akram Sheriff
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
备注: 15 pages, 6 figures, 6 code listings, Supported and endorsed by OWASP GenAI ASI Project
Abstract:The proliferation of AI agents requires robust mechanisms for secure discovery. This paper introduces the Agent Name Service (ANS), a novel architecture based on DNS addressing the lack of a public agent discovery framework. ANS provides a protocol-agnostic registry infrastructure that leverages Public Key Infrastructure (PKI) certificates for verifiable agent identity and trust. The architecture features several key innovations: a formalized agent registration and renewal mechanism for lifecycle management; DNS-inspired naming conventions with capability-aware resolution; a modular Protocol Adapter Layer supporting diverse communication standards (A2A, MCP, ACP etc.); and precisely defined algorithms for secure resolution. We implement structured communication using JSON Schema and conduct a comprehensive threat analysis of our proposal. The result is a foundational directory service addressing the core challenges of secured discovery and interaction in multi-agent systems, paving the way for future interoperable, trustworthy, and scalable agent ecosystems.
zh
[AI-94] MONAQ: Multi-Objective Neural Architecture Querying for Time-Series Analysis on Resource-Constrained Devices
【速读】:该论文旨在解决在资源受限硬件上进行高效时间序列分析的问题,特别是在边缘部署场景下的通用时间序列分析任务。现有研究虽已关注硬件感知的神经网络架构搜索(Hardware-aware Neural Architecture Search, NAS),但尚未针对此类任务提出有效方案。论文提出的解决方案关键在于引入MONAQ框架,该框架将NAS重构为多目标神经网络架构查询(Multi-Objective Neural Architecture Querying)任务,结合多模态查询生成与基于大语言模型(Large Language Model, LLM)的多目标搜索策略,通过代码生成实现可部署模型,从而在保持高效性的同时提升模型性能。
链接: https://arxiv.org/abs/2505.10607
作者: Patara Trirat,Jae-Gil Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Code will be available at this https URL
Abstract:The growing use of smartphones and IoT devices necessitates efficient time-series analysis on resource-constrained hardware, which is critical for sensing applications such as human activity recognition and air quality prediction. Recent efforts in hardware-aware neural architecture search (NAS) automate architecture discovery for specific platforms; however, none focus on general time-series analysis with edge deployment. Leveraging the problem-solving and reasoning capabilities of large language models (LLM), we propose MONAQ, a novel framework that reformulates NAS into Multi-Objective Neural Architecture Querying tasks. MONAQ is equipped with multimodal query generation for processing multimodal time-series inputs and hardware constraints, alongside an LLM agent-based multi-objective search to achieve deployment-ready models via code generation. By integrating numerical data, time-series images, and textual descriptions, MONAQ improves an LLM’s understanding of time-series data. Experiments on fifteen datasets demonstrate that MONAQ-discovered models outperform both handcrafted models and NAS baselines while being more efficient.
zh
[AI-95] Continuity and Isolation Lead to Doubts or Dilemmas in Large Language Models
【速读】:该论文试图解决Transformer模型在处理简单模式序列时存在的理论与实践限制问题,具体表现为隔离(isolation)和连续性(continuity)现象。隔离现象表明任何可学习的序列必须与其他可学习序列保持隔离,导致单个Transformer无法同时学习多个序列;连续性现象则表明,围绕已学习序列会形成吸引子盆地,使得落入该区域的序列会坍缩至已学习序列。解决方案的关键在于数学证明了这些现象在使用紧凑位置编码(compact positional encoding)的Transformer中普遍存在,并通过严谨实验验证了这些理论限制在实际应用中的表现。
链接: https://arxiv.org/abs/2505.10606
作者: Hector Pasten,Felipe Urrutia,Hector Jimenez,Cristian B. Calderon,Cristóbal Rojas,Alexander Kozachinskiy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding how Transformers work and how they process information is key to the theoretical and empirical advancement of these machines. In this work, we demonstrate the existence of two phenomena in Transformers, namely isolation and continuity. Both of these phenomena hinder Transformers to learn even simple pattern sequences. Isolation expresses that any learnable sequence must be isolated from another learnable sequence, and hence some sequences cannot be learned by a single Transformer at the same time. Continuity entails that an attractor basin forms around a learned sequence, such that any sequence falling in that basin will collapse towards the learned sequence. Here, we mathematically prove these phenomena emerge in all Transformers that use compact positional encoding, and design rigorous experiments, demonstrating that the theoretical limitations we shed light on occur on the practical scale.
zh
[AI-96] oward a Public and Secure Generative AI: A Comparative Analysis of Open and Closed LLM s
【速读】:该论文试图解决开放源代码与专有(封闭)生成式人工智能(Generative AI)系统在特性、机遇与挑战方面的系统性比较缺乏问题,以及构建一个开放、公共和安全的生成式AI框架的基础要素。其解决方案的关键在于采用文献综述、批判性分析和比较分析相结合的方法,提出以开放性、公共治理和安全性为核心维度的框架,以促进可信且包容的生成式AI未来发展。
链接: https://arxiv.org/abs/2505.10603
作者: Jorge Machado
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative artificial intelligence (Gen AI) systems represent a critical technology with far-reaching implications across multiple domains of society. However, their deployment entails a range of risks and challenges that require careful evaluation. To date, there has been a lack of comprehensive, interdisciplinary studies offering a systematic comparison between open-source and proprietary (closed) generative AI systems, particularly regarding their respective advantages and drawbacks. This study aims to: i) critically evaluate and compare the characteristics, opportunities, and challenges of open and closed generative AI models; and ii) propose foundational elements for the development of an Open, Public, and Safe Gen AI framework. As a methodology, we adopted a combined approach that integrates three methods: literature review, critical analysis, and comparative analysis. The proposed framework outlines key dimensions, openness, public governance, and security, as essential pillars for shaping the future of trustworthy and inclusive Gen AI. Our findings reveal that open models offer greater transparency, auditability, and flexibility, enabling independent scrutiny and bias mitigation. In contrast, closed systems often provide better technical support and ease of implementation, but at the cost of unequal access, accountability, and ethical oversight. The research also highlights the importance of multi-stakeholder governance, environmental sustainability, and regulatory frameworks in ensuring responsible development.
zh
[AI-97] Enhancing IoT Cyber Attack Detection in the Presence of Highly Imbalanced Data
【速读】:该论文试图解决物联网(IoT)网络中由于数据集高度不平衡导致的入侵检测系统(IDS)性能下降问题,这种不平衡使得传统机器学习模型难以准确识别少数类攻击。解决方案的关键在于采用混合采样技术以提高数据不平衡环境下的检测准确性,并结合强大的模型和特征选择方法来显著提升物联网安全防护能力。
链接: https://arxiv.org/abs/2505.10600
作者: Md. Ehsanul Haque,Md. Saymon Hosen Polash,Md Al-Imran Sanjida Simla,Md Alomgir Hossain,Sarwar Jahan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Published paper of CSNT2025
Abstract:Due to the rapid growth in the number of Internet of Things (IoT) networks, the cyber risk has increased exponentially, and therefore, we have to develop effective IDS that can work well with highly imbalanced datasets. A high rate of missed threats can be the result, as traditional machine learning models tend to struggle in identifying attacks when normal data volume is much higher than the volume of attacks. For example, the dataset used in this study reveals a strong class imbalance with 94,659 instances of the majority class and only 28 instances of the minority class, making it quite challenging to determine rare attacks accurately. The challenges presented in this research are addressed by hybrid sampling techniques designed to improve data imbalance detection accuracy in IoT domains. After applying these techniques, we evaluate the performance of several machine learning models such as Random Forest, Soft Voting, Support Vector Classifier (SVC), K-Nearest Neighbors (KNN), Multi-Layer Perceptron (MLP), and Logistic Regression with respect to the classification of cyber-attacks. The obtained results indicate that the Random Forest model achieved the best performance with a Kappa score of 0.9903, test accuracy of 0.9961, and AUC of 0.9994. Strong performance is also shown by the Soft Voting model, with an accuracy of 0.9952 and AUC of 0.9997, indicating the benefits of combining model predictions. Overall, this work demonstrates the value of hybrid sampling combined with robust model and feature selection for significantly improving IoT security against cyber-attacks, especially in highly imbalanced data environments.
zh
[AI-98] Inclusivity of AI Speech in Healthcare: A Decade Look Back
【速读】:该论文试图解决AI语音识别技术在医疗领域应用中存在的包容性不足问题(inclusivity gaps),具体表现为数据集和研究过度偏向高资源语言、标准化口音及狭窄的人口群体,这可能导致对边缘化群体的语音识别错误,从而加剧医疗不平等。解决方案的关键在于设计更具包容性的数据集、开展偏差缓解研究以及建立政策框架,以确保AI语音技术在医疗领域的公平可及性。
链接: https://arxiv.org/abs/2505.10596
作者: Retno Larasati
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:The integration of AI speech recognition technologies into healthcare has the potential to revolutionize clinical workflows and patient-provider communication. However, this study reveals significant gaps in inclusivity, with datasets and research disproportionately favouring high-resource languages, standardized accents, and narrow demographic groups. These biases risk perpetuating healthcare disparities, as AI systems may misinterpret speech from marginalized groups. This paper highlights the urgent need for inclusive dataset design, bias mitigation research, and policy frameworks to ensure equitable access to AI speech technologies in healthcare.
zh
[AI-99] CRPE: Expanding The Reasoning Capability of Large Language Model for Code Generation
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在代码生成任务中分析与逻辑处理能力不足的问题,特别是提升其代码推理(Code Reasoning)能力。解决方案的关键在于提出一种创新的三阶段框架CRPE(Code Reasoning Process Enhancer),该框架通过数据合成与模型训练的系统性优化,显著增强了语言模型的代码推理能力。CRPE的核心在于构建从指令数据获取到专家级代码推理数据合成,最终实现自主推理增强的完整流程,从而有效提升了代码生成任务的性能。
链接: https://arxiv.org/abs/2505.10594
作者: Ningxin Gui,Qianghuai Jia,Feijun Jiang,Yuling Jiao,dechun wang,Jerry Zhijian Yang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce CRPE (Code Reasoning Process Enhancer), an innovative three-stage framework for data synthesis and model training that advances the development of sophisticated code reasoning capabilities in large language models (LLMs). Building upon existing system-1 models, CRPE addresses the fundamental challenge of enhancing LLMs’ analytical and logical processing in code generation tasks. Our framework presents a methodologically rigorous yet implementable approach to cultivating advanced code reasoning abilities in language models. Through the implementation of CRPE, we successfully develop an enhanced COT-Coder that demonstrates marked improvements in code generation tasks. Evaluation results on LiveCodeBench (20240701-20240901) demonstrate that our COT-Coder-7B-StepDPO, derived from Qwen2.5-Coder-7B-Base, with a pass@1 accuracy of 21.88, exceeds all models with similar or even larger sizes. Furthermore, our COT-Coder-32B-StepDPO, based on Qwen2.5-Coder-32B-Base, exhibits superior performance with a pass@1 accuracy of 35.08, outperforming GPT4O on the benchmark. Overall, CRPE represents a comprehensive, open-source method that encompasses the complete pipeline from instruction data acquisition through expert code reasoning data synthesis, culminating in an autonomous reasoning enhancement mechanism.
zh
[AI-100] LLM -Explorer: Towards Efficient and Affordable LLM -based Exploration for Mobile Apps
【速读】:该论文旨在解决自动化移动应用探索中因过度依赖生成式 AI (Generative AI) 生成每一步操作而导致的高计算成本和资源消耗问题。其解决方案的关键在于提出 LLM-Explorer,该方法将生成式 AI 主要用于维护精确且紧凑的知识库,而非直接生成所有操作,从而通过知识引导无生成式 AI 的操作生成,实现更高效和经济的探索过程。
链接: https://arxiv.org/abs/2505.10593
作者: Shanhui Zhao,Hao Wen,Wenjie Du,Cheng Liang,Yunxin Liu,Xiaozhou Ye,Ye Ouyang,Yuanchun Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted by MobiCom 2025
Abstract:Large language models (LLMs) have opened new opportunities for automated mobile app exploration, an important and challenging problem that used to suffer from the difficulty of generating meaningful UI interactions. However, existing LLM-based exploration approaches rely heavily on LLMs to generate actions in almost every step, leading to a huge cost of token fees and computational resources. We argue that such extensive usage of LLMs is neither necessary nor effective, since many actions during exploration do not require, or may even be biased by the abilities of LLMs. Further, based on the insight that a precise and compact knowledge plays the central role for effective exploration, we introduce LLM-Explorer, a new exploration agent designed for efficiency and affordability. LLM-Explorer uses LLMs primarily for maintaining the knowledge instead of generating actions, and knowledge is used to guide action generation in a LLM-less manner. Based on a comparison with 5 strong baselines on 20 typical apps, LLM-Explorer was able to achieve the fastest and highest coverage among all automated app explorers, with over 148x lower cost than the state-of-the-art LLM-based approach.
zh
[AI-101] Anchoring AI Capabilities in Market Valuations: The Capability Realization Rate Model and Valuation Misalignment Risk NEURIPS
【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)能力与股票估值之间存在的脱节问题,即市场估值往往过度反映AI的潜在能力而非实际实现的性能。解决方案的关键在于提出一种能力实现率(Capability Realization Rate, CRR)模型,用于量化AI潜力与实际表现之间的差距,并据此识别估值错配风险。
链接: https://arxiv.org/abs/2505.10590
作者: Xinmin Fang,Lingfeng Tao,Zhengxiong Li
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures, NeurIPS
Abstract:Recent breakthroughs in artificial intelligence (AI) have triggered surges in market valuations for AI-related companies, often outpacing the realization of underlying capabilities. We examine the anchoring effect of AI capabilities on equity valuations and propose a Capability Realization Rate (CRR) model to quantify the gap between AI potential and realized performance. Using data from the 2023–2025 generative AI boom, we analyze sector-level sensitivity and conduct case studies (OpenAI, Adobe, NVIDIA, Meta, Microsoft, Goldman Sachs) to illustrate patterns of valuation premium and misalignment. Our findings indicate that AI-native firms commanded outsized valuation premiums anchored to future potential, while traditional companies integrating AI experienced re-ratings subject to proof of tangible returns. We argue that CRR can help identify valuation misalignment risk-where market prices diverge from realized AI-driven value. We conclude with policy recommendations to improve transparency, mitigate speculative bubbles, and align AI innovation with sustainable market value.
zh
[AI-102] A Multi-Task Foundation Model for Wireless Channel Representation Using Contrastive and Masked Autoencoder Learning
【速读】:该论文旨在解决当前自监督学习在无线信道表示中的应用未能充分考虑无线通信的独特特性与约束的问题。其解决方案的关键在于提出一种基于Transformer的编码器-解码器基础模型WiMAE(Wireless Masked Autoencoder),并在其基础上开发了ContraWiMAE,通过在统一多任务框架中引入对比学习目标与重建任务,增强模型对结构化和判别性特征的捕捉能力,从而提升表示质量。
链接: https://arxiv.org/abs/2505.09160
作者: Berkay Guler,Giovanni Geraci,Hamid Jafarkhani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
备注:
Abstract:Current applications of self-supervised learning to wireless channel representation often borrow paradigms developed for text and image processing, without fully addressing the unique characteristics and constraints of wireless communications. Aiming to fill this gap, we first propose WiMAE (Wireless Masked Autoencoder), a transformer-based encoder-decoder foundation model pretrained on a realistic open-source multi-antenna wireless channel dataset. Building upon this foundation, we develop ContraWiMAE, which enhances WiMAE by incorporating a contrastive learning objective alongside the reconstruction task in a unified multi-task framework. By warm-starting from pretrained WiMAE weights and generating positive pairs via noise injection, the contrastive component enables the model to capture both structural and discriminative features, enhancing representation quality beyond what reconstruction alone can achieve. Through extensive evaluation on unseen scenarios, we demonstrate the effectiveness of both approaches across multiple downstream tasks, with ContraWiMAE showing further improvements in linear separability and adaptability in diverse wireless environments. Comparative evaluations against a state-of-the-art wireless channel foundation model confirm the superior performance and data efficiency of our models, highlighting their potential as powerful baselines for future research in self-supervised wireless channel representation learning.
zh
[AI-103] Humans expect rationality and cooperation from LLM opponents in strategic games
【速读】:该论文试图解决在多参与者同时选择博弈中,人类对作为对手的大型语言模型(Large Language Models, LLMs)行为反应的差异问题。其解决方案的关键在于通过一项受金钱激励的控制实验室实验,对比人类与LLMs在p-Beauty Contest任务中的行为表现,并采用被试内设计从个体层面分析行为差异。研究发现,人类在与LLMs对弈时更倾向于选择“零”纳什均衡策略,这一现象主要由高战略推理能力的个体驱动,且其策略选择基于对LLMs推理能力和合作倾向的感知。
链接: https://arxiv.org/abs/2505.11011
作者: Darija Barak,Miguel Costa-Gomes
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:As Large Language Models (LLMs) integrate into our social and economic interactions, we need to deepen our understanding of how humans respond to LLMs opponents in strategic settings. We present the results of the first controlled monetarily-incentivised laboratory experiment looking at differences in human behaviour in a multi-player p-beauty contest against other humans and LLMs. We use a within-subject design in order to compare behaviour at the individual level. We show that, in this environment, human subjects choose significantly lower numbers when playing against LLMs than humans, which is mainly driven by the increased prevalence of `zero’ Nash-equilibrium choices. This shift is mainly driven by subjects with high strategic reasoning ability. Subjects who play the zero Nash-equilibrium choice motivate their strategy by appealing to perceived LLM’s reasoning ability and, unexpectedly, propensity towards cooperation. Our findings provide foundational insights into the multi-player human-LLM interaction in simultaneous choice games, uncover heterogeneities in both subjects’ behaviour and beliefs about LLM’s play when playing against them, and suggest important implications for mechanism design in mixed human-LLM systems.
zh
[AI-104] Space Group Equivariant Crystal Diffusion
【速读】:该论文试图解决晶体材料逆向设计的加速问题,旨在通过生成模型高效生成具有特定性质的晶体结构。解决方案的关键在于引入SGEquiDiff,该模型通过空间群不变似然函数自然处理空间群约束,并结合SE(3)-不变的晶格离散采样、置换不变的Transformer自回归采样以及原子坐标的空间群等变扩散机制,从而有效捕捉晶体结构的对称性特征及其对材料性能的影响。
链接: https://arxiv.org/abs/2505.10994
作者: Rees Chang,Angela Pak,Alex Guerra,Ni Zhan,Nick Richardson,Elif Ertekin,Ryan P. Adams
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:
Abstract:Accelerating inverse design of crystalline materials with generative models has significant implications for a range of technologies. Unlike other atomic systems, 3D crystals are invariant to discrete groups of isometries called the space groups. Crucially, these space group symmetries are known to heavily influence materials properties. We propose SGEquiDiff, a crystal generative model which naturally handles space group constraints with space group invariant likelihoods. SGEquiDiff consists of an SE(3)-invariant, telescoping discrete sampler of crystal lattices; permutation-invariant, transformer-based autoregressive sampling of Wyckoff positions, elements, and numbers of symmetrically unique atoms; and space group equivariant diffusion of atomic coordinates. We show that space group equivariant vector fields automatically live in the tangent spaces of the Wyckoff positions. SGEquiDiff achieves state-of-the-art performance on standard benchmark datasets as assessed by quantitative proxy metrics and quantum mechanical calculations.
zh
机器学习
[LG-0] Potential failures of physics-informed machine learning in traffic flow modeling: theoretical and experimental analysis
链接: https://arxiv.org/abs/2505.11491
作者: Yuan-Zheng Lei,Yaobang Gong,Dianwei Chen,Yao Cheng,Xianfeng Terry Yang
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:This study critically examines the performance of physics-informed machine learning (PIML) approaches for traffic flow modeling, defining the failure of a PIML model as the scenario where it underperforms both its purely data-driven and purely physics-based counterparts. We analyze the loss landscape by perturbing trained models along the principal eigenvectors of the Hessian matrix and evaluating corresponding loss values. Our results suggest that physics residuals in PIML do not inherently hinder optimization, contrary to a commonly assumed failure cause. Instead, successful parameter updates require both ML and physics gradients to form acute angles with the quasi-true gradient and lie within a conical region. Given inaccuracies in both the physics models and the training data, satisfying this condition is often difficult. Experiments reveal that physical residuals can degrade the performance of LWR- and ARZ-based PIML models, especially under highly physics-driven settings. Moreover, sparse sampling and the use of temporally averaged traffic data can produce misleadingly small physics residuals that fail to capture actual physical dynamics, contributing to model failure. We also identify the Courant-Friedrichs-Lewy (CFL) condition as a key indicator of dataset suitability for PIML, where successful applications consistently adhere to this criterion. Lastly, we observe that higher-order models like ARZ tend to have larger error lower bounds than lower-order models like LWR, which is consistent with the experimental findings of existing studies.
[LG-1] msf-CNN: Patch-based Multi-Stage Fusion with Convolutional Neural Networks for TinyML
链接: https://arxiv.org/abs/2505.11483
作者: Zhaolan Huang,Emmanuel Baccelli
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注:
Abstract:AI spans from large language models to tiny models running on microcontrollers (MCUs). Extremely memory-efficient model architectures are decisive to fit within an MCU’s tiny memory budget e.g., 128kB of RAM. However, inference latency must remain small to fit real-time constraints. An approach to tackle this is patch-based fusion, which aims to optimize data flows across neural network layers. In this paper, we introduce msf-CNN, a novel technique that efficiently finds optimal fusion settings for convolutional neural networks (CNNs) by walking through the fusion solution space represented as a directed acyclic graph. Compared to previous work on CNN fusion for MCUs, msf-CNN identifies a wider set of solutions. We published an implementation of msf-CNN running on various microcontrollers (ARM Cortex-M, RISC-V, ESP32). We show that msf-CNN can achieve inference using 50% less RAM compared to the prior art (MCUNetV2 and StreamNet). We thus demonstrate how msf-CNN offers additional flexibility for system designers.
[LG-2] Signal attenuation enables scalable decentralized multi-agent reinforcement learning over networks
链接: https://arxiv.org/abs/2505.11461
作者: Wesley A Suttle,Vipul K Sharma,Brian M Sadler
类目: Machine Learning (cs.LG)
*备注: 7 pages, 1 figure
Abstract:Classic multi-agent reinforcement learning (MARL) methods require that agents enjoy global state observability, preventing development of decentralized algorithms and limiting scalability. Recent work has shown that, under assumptions on decaying inter-agent influence, global observability can be replaced by local neighborhood observability at each agent, enabling decentralization and scalability. Real-world applications enjoying such decay properties remain underexplored, however, despite the fact that signal power decay, or signal attenuation, due to path loss is an intrinsic feature of many problems in wireless communications and radar networks. In this paper, we show that signal attenuation enables decentralization in MARL by considering the illustrative special case of performing power allocation for target detection in a radar network. To achieve this, we propose two new constrained multi-agent Markov decision process formulations of this power allocation problem, derive local neighborhood approximations for global value function and gradient estimates and establish corresponding error bounds, and develop decentralized saddle point policy gradient algorithms for solving the proposed problems. Our approach, though oriented towards the specific radar network problem we consider, provides a useful model for future extensions to additional problems in wireless communications and radar networks.
[LG-3] A Generative Framework for Causal Estimation via Importance-Weighted Diffusion Distillation
链接: https://arxiv.org/abs/2505.11444
作者: Xinran Song,Tianyu Chen,Mingyuan Zhou
类目: Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
Abstract:Estimating individualized treatment effects from observational data is a central challenge in causal inference, largely due to covariate imbalance and confounding bias from non-randomized treatment assignment. While inverse probability weighting (IPW) is a well-established solution to this problem, its integration into modern deep learning frameworks remains limited. In this work, we propose Importance-Weighted Diffusion Distillation (IWDD), a novel generative framework that combines the pretraining of diffusion models with importance-weighted score distillation to enable accurate and fast causal estimation-including potential outcome prediction and treatment effect estimation. We demonstrate how IPW can be naturally incorporated into the distillation of pretrained diffusion models, and further introduce a randomization-based adjustment that eliminates the need to compute IPW explicitly-thereby simplifying computation and, more importantly, provably reducing the variance of gradient estimates. Empirical results show that IWDD achieves state-of-the-art out-of-sample prediction performance, with the highest win rates compared to other baselines, significantly improving causal estimation and supporting the development of individualized treatment strategies. We will release our PyTorch code for reproducibility and future research.
[LG-4] MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production
链接: https://arxiv.org/abs/2505.11432
作者: Chao Jin,Ziheng Jiang,Zhihao Bai,Zheng Zhong,Juncai Liu,Xiang Li,Ningxin Zheng,Xi Wang,Cong Xie,Wen Heng,Yiyuan Ma,Wenlei Bao,Size Zheng,Yanghua Peng,Haibin Lin,Xuanzhe Liu,Xin Jin,Xin Liu
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:We present MegaScale-MoE, a production system tailored for the efficient training of large-scale mixture-of-experts (MoE) models. MoE emerges as a promising architecture to scale large language models (LLMs) to unprecedented sizes, thereby enhancing model performance. However, existing MoE training systems experience a degradation in training efficiency, exacerbated by the escalating scale of MoE models and the continuous evolution of hardware. Recognizing the pivotal role of efficient communication in enhancing MoE training, MegaScale-MoE customizes communication-efficient parallelism strategies for attention and FFNs in each MoE layer and adopts a holistic approach to overlap communication with computation at both inter- and intra-operator levels. Additionally, MegaScale-MoE applies communication compression with adjusted communication patterns to lower precision, further improving training efficiency. When training a 352B MoE model on 1,440 NVIDIA Hopper GPUs, MegaScale-MoE achieves a training throughput of 1.41M tokens/s, improving the efficiency by 1.88 \times compared to Megatron-LM. We share our operational experience in accelerating MoE training and hope that by offering our insights in system design, this work will motivate future research in MoE systems. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2505.11432 [cs.LG] (or arXiv:2505.11432v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.11432 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-5] MoE-CAP: Benchmarking Cost Accuracy and Performance of Sparse Mixture-of-Experts Systems
链接: https://arxiv.org/abs/2505.11415
作者: Yinsicheng Jiang,Yao Fu,Yeqi Huang,Ping Nie,Zhan Lu,Leyang Xue,Congjie He,Man-Kit Sit,Jilong Xue,Li Dong,Ziming Miao,Dayou Du,Tairan Xu,Kai Zou,Edoardo Ponti,Luo Mai
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: arXiv admin note: substantial text overlap with arXiv:2412.07067
Abstract:The sparse Mixture-of-Experts (MoE) architecture is increasingly favored for scaling Large Language Models (LLMs) efficiently, but it depends on heterogeneous compute and memory resources. These factors jointly affect system Cost, Accuracy, and Performance (CAP), making trade-offs inevitable. Existing benchmarks often fail to capture these trade-offs accurately, complicating practical deployment decisions. To address this, we introduce MoE-CAP, a benchmark specifically designed for MoE systems. Our analysis reveals that achieving an optimal balance across CAP is difficult with current hardware; MoE systems typically optimize two of the three dimensions at the expense of the third-a dynamic we term the MoE-CAP trade-off. To visualize this, we propose the CAP Radar Diagram. We further introduce sparsity-aware performance metrics-Sparse Memory Bandwidth Utilization (S-MBU) and Sparse Model FLOPS Utilization (S-MFU)-to enable accurate performance benchmarking of MoE systems across diverse hardware platforms and deployment scenarios.
[LG-6] Uncertainty quantification with approximate variational learning for wearable photoplethysmography prediction tasks
链接: https://arxiv.org/abs/2505.11412
作者: Ciaran Bench,Vivek Desai,Mohammad Moulaeifard,Nils Strodthoff,Philip Aston,Andrew Thompson
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Photoplethysmography (PPG) signals encode information about relative changes in blood volume that can be used to assess various aspects of cardiac health non-invasively, e.g.\ to detect atrial fibrillation (AF) or predict blood pressure (BP). Deep networks are well-equipped to handle the large quantities of data acquired from wearable measurement devices. However, they lack interpretability and are prone to overfitting, leaving considerable risk for poor performance on unseen data and misdiagnosis. Here, we describe the use of two scalable uncertainty quantification techniques: Monte Carlo Dropout and the recently proposed Improved Variational Online Newton. These techniques are used to assess the trustworthiness of models trained to perform AF classification and BP regression from raw PPG time series. We find that the choice of hyperparameters has a considerable effect on the predictive performance of the models and on the quality and composition of predicted uncertainties. E.g. the stochasticity of the model parameter sampling determines the proportion of the total uncertainty that is aleatoric, and has varying effects on predictive performance and calibration quality dependent on the chosen uncertainty quantification technique and the chosen expression of uncertainty. We find significant discrepancy in the quality of uncertainties over the predicted classes, emphasising the need for a thorough evaluation protocol that assesses local and adaptive calibration. This work suggests that the choice of hyperparameters must be carefully tuned to balance predictive performance and calibration quality, and that the optimal parameterisation may vary depending on the chosen expression of uncertainty.
[LG-7] Is Grokking a Computational Glass Relaxation?
链接: https://arxiv.org/abs/2505.11411
作者: Xiaotian Zhang,Yue Shang,Entao Yang,Ge Zhang
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注:
Abstract:Understanding neural network’s (NN) generalizability remains a central question in deep learning research. The special phenomenon of grokking, where NNs abruptly generalize long after the training performance reaches a near-perfect level, offers a unique window to investigate the underlying mechanisms of NNs’ generalizability. Here we propose an interpretation for grokking by framing it as a computational glass relaxation: viewing NNs as a physical system where parameters are the degrees of freedom and train loss is the system energy, we find memorization process resembles a rapid cooling of liquid into non-equilibrium glassy state at low temperature and the later generalization is like a slow relaxation towards a more stable configuration. This mapping enables us to sample NNs’ Boltzmann entropy (states of density) landscape as a function of training loss and test accuracy. Our experiments in transformers on arithmetic tasks suggests that there is NO entropy barrier in the memorization-to-generalization transition of grokking, challenging previous theory that defines grokking as a first-order phase transition. We identify a high-entropy advantage under grokking, an extension of prior work linking entropy to generalizability but much more significant. Inspired by grokking’s far-from-equilibrium nature, we develop a toy optimizer WanD based on Wang-landau molecular dynamics, which can eliminate grokking without any constraints and find high-norm generalizing solutions. This provides strictly-defined counterexamples to theory attributing grokking solely to weight norm evolution towards the Goldilocks zone and also suggests new potential ways for optimizer design.
[LG-8] Finding Counterfactual Evidences for Node Classification KDD2025
链接: https://arxiv.org/abs/2505.11396
作者: Dazhuo Qiu,Jinwen Chen,Arijit Khan,Yan Zhao,Francesco Bonchi
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: Accepted by KDD 2025
Abstract:Counterfactual learning is emerging as an important paradigm, rooted in causality, which promises to alleviate common issues of graph neural networks (GNNs), such as fairness and interpretability. However, as in many real-world application domains where conducting randomized controlled trials is impractical, one has to rely on available observational (factual) data to detect counterfactuals. In this paper, we introduce and tackle the problem of searching for counterfactual evidences for the GNN-based node classification task. A counterfactual evidence is a pair of nodes such that, regardless they exhibit great similarity both in the features and in their neighborhood subgraph structures, they are classified differently by the GNN. We develop effective and efficient search algorithms and a novel indexing solution that leverages both node features and structural information to identify counterfactual evidences, and generalizes beyond any specific GNN. Through various downstream applications, we demonstrate the potential of counterfactual evidences to enhance fairness and accuracy of GNNs.
[LG-9] IISE PGE Energy Analytics Challenge 2025: Hourly-Binned Regression Models Beat Transformers in Load Forecasting
链接: https://arxiv.org/abs/2505.11390
作者: Millend Roy,Vladimir Pyltsov,Yinbo Hu
类目: Machine Learning (cs.LG); Econometrics (econ.EM); Systems and Control (eess.SY)
*备注:
Abstract:Accurate electricity load forecasting is essential for grid stability, resource optimization, and renewable energy integration. While transformer-based deep learning models like TimeGPT have gained traction in time-series forecasting, their effectiveness in long-term electricity load prediction remains uncertain. This study evaluates forecasting models ranging from classical regression techniques to advanced deep learning architectures using data from the ESD 2025 competition. The dataset includes two years of historical electricity load data, alongside temperature and global horizontal irradiance (GHI) across five sites, with a one-day-ahead forecasting horizon. Since actual test set load values remain undisclosed, leveraging predicted values would accumulate errors, making this a long-term forecasting challenge. We employ (i) Principal Component Analysis (PCA) for dimensionality reduction and (ii) frame the task as a regression problem, using temperature and GHI as covariates to predict load for each hour, (iii) ultimately stacking 24 models to generate yearly forecasts. Our results reveal that deep learning models, including TimeGPT, fail to consistently outperform simpler statistical and machine learning approaches due to the limited availability of training data and exogenous variables. In contrast, XGBoost, with minimal feature engineering, delivers the lowest error rates across all test cases while maintaining computational efficiency. This highlights the limitations of deep learning in long-term electricity forecasting and reinforces the importance of model selection based on dataset characteristics rather than complexity. Our study provides insights into practical forecasting applications and contributes to the ongoing discussion on the trade-offs between traditional and modern forecasting methods. Subjects: Machine Learning (cs.LG); Econometrics (econ.EM); Systems and Control (eess.SY) Cite as: arXiv:2505.11390 [cs.LG] (or arXiv:2505.11390v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.11390 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-10] he Future is Sparse: Embedding Compression for Scalable Retrieval in Recommender Systems
链接: https://arxiv.org/abs/2505.11388
作者: Petr Kasalický,Martin Spišák,Vojtěch Vančura,Daniel Bohuněk,Rodrigo Alves,Pavel Kordík
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:Industry-scale recommender systems face a core challenge: representing entities with high cardinality, such as users or items, using dense embeddings that must be accessible during both training and inference. However, as embedding sizes grow, memory constraints make storage and access increasingly difficult. We describe a lightweight, learnable embedding compression technique that projects dense embeddings into a high-dimensional, sparsely activated space. Designed for retrieval tasks, our method reduces memory requirements while preserving retrieval performance, enabling scalable deployment under strict resource constraints. Our results demonstrate that leveraging sparsity is a promising approach for improving the efficiency of large-scale recommenders. We release our code at this https URL.
[LG-11] On the Interconnections of Calibration Quantification and Classifier Accuracy Prediction under Dataset Shift
链接: https://arxiv.org/abs/2505.11380
作者: Alejandro Moreo
类目: Machine Learning (cs.LG)
*备注:
Abstract:When the distribution of the data used to train a classifier differs from that of the test data, i.e., under dataset shift, well-established routines for calibrating the decision scores of the classifier, estimating the proportion of positives in a test sample, or estimating the accuracy of the classifier, become particularly challenging. This paper investigates the interconnections among three fundamental problems, calibration, quantification, and classifier accuracy prediction, under dataset shift conditions. Specifically, we prove their equivalence through mutual reduction, i.e., we show that access to an oracle for any one of these tasks enables the resolution of the other two. Based on these proofs, we propose new methods for each problem based on direct adaptations of well-established methods borrowed from the other disciplines. Our results show such methods are often competitive, and sometimes even surpass the performance of dedicated approaches from each discipline. The main goal of this paper is to fostering cross-fertilization among these research areas, encouraging the development of unified approaches and promoting synergies across the fields.
[LG-12] Machine Learning Approaches to Vocal Register Classification in Contemporary Male Pop Music
链接: https://arxiv.org/abs/2505.11378
作者: Alexander Kim,Charlotte Botha
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 8 pages, 8 figures
Abstract:For singers of all experience levels, one of the most daunting challenges in learning technical repertoire is navigating placement and vocal register in and around the passagio (passage between chest voice and head voice registers). Particularly in pop music, where a single artist may use a variety of timbre’s and textures to achieve a desired quality, it can be difficult to identify what vocal register within the vocal range a singer is using. This paper presents two methods for classifying vocal registers in an audio signal of male pop music through the analysis of textural features of mel-spectrogram images. Additionally, we will discuss the practical integration of these models for vocal analysis tools, and introduce a concurrently developed software called AVRA which stands for Automatic Vocal Register Analysis. Our proposed methods achieved consistent classification of vocal register through both Support Vector Machine (SVM) and Convolutional Neural Network (CNN) models, which supports the promise of more robust classification possibilities across more voice types and genres of singing.
[LG-13] Understanding Nonlinear Implicit Bias via Region Counts in Input Space
链接: https://arxiv.org/abs/2505.11370
作者: Jingwei Li,Jing Xu,Zifan Wang,Huishuai Zhang,Jingzhao Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:One explanation for the strong generalization ability of neural networks is implicit bias. Yet, the definition and mechanism of implicit bias in non-linear contexts remains little understood. In this work, we propose to characterize implicit bias by the count of connected regions in the input space with the same predicted label. Compared with parameter-dependent metrics (e.g., norm or normalized margin), region count can be better adapted to nonlinear, overparameterized models, because it is determined by the function mapping and is invariant to reparametrization. Empirically, we found that small region counts align with geometrically simple decision boundaries and correlate well with good generalization performance. We also observe that good hyper-parameter choices such as larger learning rates and smaller batch sizes can induce small region counts. We further establish the theoretical connections and explain how larger learning rate can induce small region counts in neural networks.
[LG-14] Learning Multimodal AI Algorithms for Amplifying Limited User Input into High-dimensional Control Space
链接: https://arxiv.org/abs/2505.11366
作者: Ali Rabiee,Sima Ghafoori,MH Farhadi,Robert Beyer,Xiangyu Bai,David J Lin,Sarah Ostadabbas,Reza Abiri
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Current invasive assistive technologies are designed to infer high-dimensional motor control signals from severely paralyzed patients. However, they face significant challenges, including public acceptance, limited longevity, and barriers to commercialization. Meanwhile, noninvasive alternatives often rely on artifact-prone signals, require lengthy user training, and struggle to deliver robust high-dimensional control for dexterous tasks. To address these issues, this study introduces a novel human-centered multimodal AI approach as intelligent compensatory mechanisms for lost motor functions that could potentially enable patients with severe paralysis to control high-dimensional assistive devices, such as dexterous robotic arms, using limited and noninvasive inputs. In contrast to the current state-of-the-art (SoTA) noninvasive approaches, our context-aware, multimodal shared-autonomy framework integrates deep reinforcement learning algorithms to blend limited low-dimensional user input with real-time environmental perception, enabling adaptive, dynamic, and intelligent interpretation of human intent for complex dexterous manipulation tasks, such as pick-and-place. The results from our ARAS (Adaptive Reinforcement learning for Amplification of limited inputs in Shared autonomy) trained with synthetic users over 50,000 computer simulation episodes demonstrated the first successful implementation of the proposed closed-loop human-in-the-loop paradigm, outperforming the SoTA shared autonomy algorithms. Following a zero-shot sim-to-real transfer, ARAS was evaluated on 23 human subjects, demonstrating high accuracy in dynamic intent detection and smooth, stable 3D trajectory control for dexterous pick-and-place tasks. ARAS user study achieved a high task success rate of 92.88%, with short completion times comparable to those of SoTA invasive assistive technologies.
[LG-15] Efficient End-to-End Learning for Decision-Making: A Meta-Optimization Approach
链接: https://arxiv.org/abs/2505.11360
作者: Rares Cristian,Pavithra Harsha,Georgia Perakis,Brian Quanz
类目: Machine Learning (cs.LG)
*备注:
Abstract:End-to-end learning has become a widely applicable and studied problem in training predictive ML models to be aware of their impact on downstream decision-making tasks. These end-to-end models often outperform traditional methods that separate training from the optimization and only myopically focus on prediction error. However, the computational complexity of end-to-end frameworks poses a significant challenge, particularly for large-scale problems. While training an ML model using gradient descent, each time we need to compute a gradient we must solve an expensive optimization problem. We present a meta-optimization method that learns efficient algorithms to approximate optimization problems, dramatically reducing computational overhead of solving the decision problem in general, an aspect we leverage in the training within the end-to-end framework. Our approach introduces a neural network architecture that near-optimally solves optimization problems while ensuring feasibility constraints through alternate projections. We prove exponential convergence, approximation guarantees, and generalization bounds for our learning method. This method offers superior computational efficiency, producing high-quality approximations faster and scaling better with problem size compared to existing techniques. Our approach applies to a wide range of optimization problems including deterministic, single-stage as well as two-stage stochastic optimization problems. We illustrate how our proposed method applies to (1) an electricity generation problem using real data from an electricity routing company coordinating the movement of electricity throughout 13 states, (2) a shortest path problem with a computer vision task of predicting edge costs from terrain maps, (3) a two-stage multi-warehouse cross-fulfillment newsvendor problem, as well as a variety of other newsvendor-like problems.
[LG-16] LGBQPC: Local Granular-Ball Quality Peaks Clustering
链接: https://arxiv.org/abs/2505.11359
作者: Zihang Jia,Zhen Zhang,Witold Pedrycz
类目: Machine Learning (cs.LG)
*备注:
Abstract:The density peaks clustering (DPC) algorithm has attracted considerable attention for its ability to detect arbitrarily shaped clusters based on a simple yet effective assumption. Recent advancements integrating granular-ball (GB) computing with DPC have led to the GB-based DPC (GBDPC) algorithm, which improves computational efficiency. However, GBDPC demonstrates limitations when handling complex clustering tasks, particularly those involving data with complex manifold structures or non-uniform density distributions. To overcome these challenges, this paper proposes the local GB quality peaks clustering (LGBQPC) algorithm, which offers comprehensive improvements to GBDPC in both GB generation and clustering processes based on the principle of justifiable granularity (POJG). Firstly, an improved GB generation method, termed GB-POJG+, is developed, which systematically refines the original GB-POJG in four key aspects: the objective function, termination criterion for GB division, definition of abnormal GB, and granularity level adaptation strategy. GB-POJG+ simplifies parameter configuration by requiring only a single penalty coefficient and ensures high-quality GB generation while maintaining the number of generated GBs within an acceptable range. In the clustering phase, two key innovations are introduced based on the GB k-nearest neighbor graph: relative GB quality for density estimation and geodesic distance for GB distance metric. These modifications substantially improve the performance of GBDPC on datasets with complex manifold structures or non-uniform density distributions. Extensive numerical experiments on 40 benchmark datasets, including both synthetic and publicly available datasets, validate the superior performance of the proposed LGBQPC algorithm.
[LG-17] Fractal Graph Contrastive Learning
链接: https://arxiv.org/abs/2505.11356
作者: Nero Z. Li,Xuehao Zhai,Zhichao Shi,Boshen Shi,Xuhui Jiang
类目: Machine Learning (cs.LG)
*备注:
Abstract:While Graph Contrastive Learning (GCL) has attracted considerable attention in the field of graph self-supervised learning, its performance heavily relies on data augmentations that are expected to generate semantically consistent positive pairs. Existing strategies typically resort to random perturbations or local structure preservation, yet lack explicit control over global structural consistency between augmented views. To address this limitation, we propose Fractal Graph Contrastive Learning (FractalGCL), a theory-driven framework that leverages fractal self-similarity to enforce global topological coherence. FractalGCL introduces two key innovations: a renormalisation-based augmentation that generates structurally aligned positive views via box coverings; and a fractal-dimension-aware contrastive loss that aligns graph embeddings according to their fractal dimensions. While combining the two innovations markedly boosts graph-representation quality, it also adds non-trivial computational overhead. To mitigate the computational overhead of fractal dimension estimation, we derive a one-shot estimator by proving that the dimension discrepancy between original and renormalised graphs converges weakly to a centred Gaussian distribution. This theoretical insight enables a reduction in dimension computation cost by an order of magnitude, cutting overall training time by approximately 61%. The experiments show that FractalGCL not only delivers state-of-the-art results on standard benchmarks but also outperforms traditional baselines on traffic networks by an average margin of about remarkably 7%. Codes are available at (this https URL).
[LG-18] Context parroting: A simple but tough-to-beat baseline for foundation models in scientific machine learning
链接: https://arxiv.org/abs/2505.11349
作者: Yuanzhao Zhang,William Gilpin
类目: Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD); Computational Physics (physics.comp-ph)
*备注:
Abstract:Recently-developed time series foundation models for scientific machine learning exhibit emergent abilities to predict physical systems. These abilities include zero-shot forecasting, in which a model forecasts future states of a system given only a short trajectory as context. Here, we show that foundation models applied to physical systems can give accurate predictions, but that they fail to develop meaningful representations of the underlying physics. Instead, foundation models often forecast by context parroting, a simple zero-shot forecasting strategy that copies directly from the context. As a result, a naive direct context parroting model scores higher than state-of-the-art time-series foundation models on predicting a diverse range of dynamical systems, at a tiny fraction of the computational cost. We draw a parallel between context parroting and induction heads, which explains why large language models trained on text can be repurposed for time series forecasting. Our dynamical systems perspective also ties the scaling between forecast accuracy and context length to the fractal dimension of the attractor, providing insight into the previously observed in-context neural scaling laws. Context parroting thus serves as a simple but tough-to-beat baseline for future time-series foundation models and can help identify in-context learning strategies beyond parroting.
[LG-19] raining NTK to Generalize with KARE
链接: https://arxiv.org/abs/2505.11347
作者: Johannes Schwab,Bryan Kelly,Semyon Malamud,Teng Andrea Xu
类目: Machine Learning (cs.LG)
*备注:
Abstract:The performance of the data-dependent neural tangent kernel (NTK; Jacot et al. (2018)) associated with a trained deep neural network (DNN) often matches or exceeds that of the full network. This implies that DNN training via gradient descent implicitly performs kernel learning by optimizing the NTK. In this paper, we propose instead to optimize the NTK explicitly. Rather than minimizing empirical risk, we train the NTK to minimize its generalization error using the recently developed Kernel Alignment Risk Estimator (KARE; Jacot et al. (2020)). Our simulations and real data experiments show that NTKs trained with KARE consistently match or significantly outperform the original DNN and the DNN- induced NTK (the after-kernel). These results suggest that explicitly trained kernels can outperform traditional end-to-end DNN optimization in certain settings, challenging the conventional dominance of DNNs. We argue that explicit training of NTK is a form of over-parametrized feature learning.
[LG-20] What Can We Learn From MIMO Graph Convolutions? IJCAI2025
链接: https://arxiv.org/abs/2505.11346
作者: Andreas Roth,Thomas Liebig
类目: Machine Learning (cs.LG)
*备注: IJCAI 2025
Abstract:Most graph neural networks (GNNs) utilize approximations of the general graph convolution derived in the graph Fourier domain. While GNNs are typically applied in the multi-input multi-output (MIMO) case, the approximations are performed in the single-input single-output (SISO) case. In this work, we first derive the MIMO graph convolution through the convolution theorem and approximate it directly in the MIMO case. We find the key MIMO-specific property of the graph convolution to be operating on multiple computational graphs, or equivalently, applying distinct feature transformations for each pair of nodes. As a localized approximation, we introduce localized MIMO graph convolutions (LMGCs), which generalize many linear message-passing neural networks. For almost every choice of edge weights, we prove that LMGCs with a single computational graph are injective on multisets, and the resulting representations are linearly independent when more than one computational graph is used. Our experimental results confirm that an LMGC can combine the benefits of various methods.
[LG-21] Sobolev Training of End-to-End Optimization Proxies
链接: https://arxiv.org/abs/2505.11342
作者: Andrew W. Rosemberg,Joaquim Dias Garcia,Russell Bent,Pascal Van Hentenryck
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 9 Pages, 4 Figures, 5 Tables
Abstract:Optimization proxies - machine learning models trained to approximate the solution mapping of parametric optimization problems in a single forward pass - offer dramatic reductions in inference time compared to traditional iterative solvers. This work investigates the integration of solver sensitivities into such end to end proxies via a Sobolev training paradigm and does so in two distinct settings: (i) fully supervised proxies, where exact solver outputs and sensitivities are available, and (ii) self supervised proxies that rely only on the objective and constraint structure of the underlying optimization problem. By augmenting the standard training loss with directional derivative information extracted from the solver, the proxy aligns both its predicted solutions and local derivatives with those of the optimizer. Under Lipschitz continuity assumptions on the true solution mapping, matching first order sensitivities is shown to yield uniform approximation error proportional to the training set covering radius. Empirically, different impacts are observed in each studied setting. On three large Alternating Current Optimal Power Flow benchmarks, supervised Sobolev training cuts mean squared error by up to 56 percent and the median worst case constraint violation by up to 400 percent while keeping the optimality gap below 0.22 percent. For a mean variance portfolio task trained without labeled solutions, self supervised Sobolev training halves the average optimality gap in the medium risk region (standard deviation above 10 percent of budget) and matches the baseline elsewhere. Together, these results highlight Sobolev training whether supervised or self supervised as a path to fast reliable surrogates for safety critical large scale optimization workloads.
[LG-22] he Final Layer Holds the Key: A Unified and Efficient GNN Calibration Framework
链接: https://arxiv.org/abs/2505.11335
作者: Jincheng Huang,Jie Xu,Xiaoshuang Shi,Ping Hu,Lei Feng,Xiaofeng Zhu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph Neural Networks (GNNs) have demonstrated remarkable effectiveness on graph-based tasks. However, their predictive confidence is often miscalibrated, typically exhibiting under-confidence, which harms the reliability of their decisions. Existing calibration methods for GNNs normally introduce additional calibration components, which fail to capture the intrinsic relationship between the model and the prediction confidence, resulting in limited theoretical guarantees and increased computational overhead. To address this issue, we propose a simple yet efficient graph calibration method. We establish a unified theoretical framework revealing that model confidence is jointly governed by class-centroid-level and node-level calibration at the final layer. Based on this insight, we theoretically show that reducing the weight decay of the final-layer parameters alleviates GNN under-confidence by acting on the class-centroid level, while node-level calibration acts as a finer-grained complement to class-centroid level calibration, which encourages each test node to be closer to its predicted class centroid at the final-layer representations. Extensive experiments validate the superiority of our method.
[LG-23] okenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference
链接: https://arxiv.org/abs/2505.11329
作者: Raja Gond,Nipun Kwatra,Ramachandran Ramjee
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 13 pages, 15 figures
Abstract:Distributed inference of large language models (LLMs) can introduce overheads of up to 20% even over GPUs connected via high-speed interconnects such as NVLINK. Multiple techniques have been proposed to mitigate these overheads by decomposing computations into finer-grained tasks and overlapping communication with sub-tasks as they complete. However, fine-grained decomposition of a large computation into many smaller computations on GPUs results in overheads. Further, the communication itself uses many streaming multiprocessors (SMs), adding to the overhead. We present TokenWeave to address these challenges. TokenWeave proposes a Token-Splitting technique that divides the tokens in the inference batch into two approximately equal subsets in a wave-aware manner. The computation of one subset is then overlapped with the communication of the other. In addition, TokenWeave optimizes the order of the layer normalization computation with respect to communication operations and implements a novel fused AllReduce-RMSNorm kernel carefully leveraging Multimem instruction support available on NVIDIA Hopper GPUs. These optimizations allow TokenWeave to perform communication and RMSNorm using only 2-8 SMs. Moreover, our kernel enables the memory bound RMSNorm to be overlapped with the other batch’s computation, providing additional gains. Our evaluations demonstrate up to 29% latency gains and up to 26% throughput gains across multiple models and workloads. In several settings, TokenWeave results in better performance compared to an equivalent model with all communication removed. Comments: 13 pages, 15 figures Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2505.11329 [cs.DC] (or arXiv:2505.11329v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2505.11329 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-24] Anomaly Detection for Non-stationary Time Series using Recurrent Wavelet Probabilistic Neural Network
链接: https://arxiv.org/abs/2505.11321
作者: Pu Yang,J. A. Barria
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:In this paper, an unsupervised Recurrent Wavelet Probabilistic Neural Network (RWPNN) is proposed, which aims at detecting anomalies in non-stationary environments by modelling the temporal features using a nonparametric density estimation network. The novel framework consists of two components, a Stacked Recurrent Encoder-Decoder (SREnc-Dec) module that captures temporal features in a latent space, and a Multi-Receptive-field Wavelet Probabilistic Network (MRWPN) that creates an ensemble probabilistic model to characterise the latent space. This formulation extends the standard wavelet probabilistic networks to wavelet deep probabilistic networks, which can handle higher data dimensionality. The MRWPN module can adapt to different rates of data variation in different datasets without imposing strong distribution assumptions, resulting in a more robust and accurate detection for Time Series Anomaly Detection (TSAD) tasks in the non-stationary environment. We carry out the assessment on 45 real-world time series datasets from various domains, verify the performance of RWPNN in TSAD tasks with several constraints, and show its ability to provide early warnings for anomalous events.
[LG-25] On the Role of Weight Decay in Collaborative Filtering: A Popularity Perspective KDD2025
链接: https://arxiv.org/abs/2505.11318
作者: Donald Loveland,Mingxuan Ju,Tong Zhao,Neil Shah,Danai Koutra
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted at SIGKDD 2025
Abstract:Collaborative filtering (CF) enables large-scale recommendation systems by encoding information from historical user-item interactions into dense ID-embedding tables. However, as embedding tables grow, closed-form solutions become impractical, often necessitating the use of mini-batch gradient descent for training. Despite extensive work on designing loss functions to train CF models, we argue that one core component of these pipelines is heavily overlooked: weight decay. Attaining high-performing models typically requires careful tuning of weight decay, regardless of loss, yet its necessity is not well understood. In this work, we question why weight decay is crucial in CF pipelines and how it impacts training. Through theoretical and empirical analysis, we surprisingly uncover that weight decay’s primary function is to encode popularity information into the magnitudes of the embedding vectors. Moreover, we find that tuning weight decay acts as a coarse, non-linear knob to influence preference towards popular or unpopular items. Based on these findings, we propose PRISM (Popularity-awaRe Initialization Strategy for embedding Magnitudes), a straightforward yet effective solution to simplify the training of high-performing CF models. PRISM pre-encodes the popularity information typically learned through weight decay, eliminating its necessity. Our experiments show that PRISM improves performance by up to 4.77% and reduces training times by 38.48%, compared to state-of-the-art training strategies. Additionally, we parameterize PRISM to modulate the initialization strength, offering a cost-effective and meaningful strategy to mitigate popularity bias.
[LG-26] Improving Inference-Time Optimisation for Vocal Effects Style Transfer with a Gaussian Prior
链接: https://arxiv.org/abs/2505.11315
作者: Chin-Yun Yu,Marco A. Martínez-Ramírez,Junghyun Koo,Wei-Hsiang Liao,Yuki Mitsufuji,György Fazekas
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Submitted to WASPAA 2025
Abstract:Style Transfer with Inference-Time Optimisation (ST-ITO) is a recent approach for transferring the applied effects of a reference audio to a raw audio track. It optimises the effect parameters to minimise the distance between the style embeddings of the processed audio and the reference. However, this method treats all possible configurations equally and relies solely on the embedding space, which can lead to unrealistic or biased results. We address this pitfall by introducing a Gaussian prior derived from a vocal preset dataset, DiffVox, over the parameter space. The resulting optimisation is equivalent to maximum-a-posteriori estimation. Evaluations on vocal effects transfer on the MedleyDB dataset show significant improvements across metrics compared to baselines, including a blind audio effects estimator, nearest-neighbour approaches, and uncalibrated ST-ITO. The proposed calibration reduces parameter mean squared error by up to 33% and matches the reference style better. Subjective evaluations with 16 participants confirm our method’s superiority, especially in limited data regimes. This work demonstrates how incorporating prior knowledge in inference time enhances audio effects transfer, paving the way for more effective and realistic audio processing systems.
[LG-27] Where You Place the Norm Matters: From Prejudiced to Neutral Initializations
链接: https://arxiv.org/abs/2505.11312
作者: Emanuele Francazi,Francesco Pinto,Aurelien Lucchi,Marco Baity-Jesi
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注:
Abstract:Normalization layers, such as Batch Normalization and Layer Normalization, are central components in modern neural networks, widely adopted to improve training stability and generalization. While their practical effectiveness is well documented, a detailed theoretical understanding of how normalization affects model behavior, starting from initialization, remains an important open question. In this work, we investigate how both the presence and placement of normalization within hidden layers influence the statistical properties of network predictions before training begins. In particular, we study how these choices shape the distribution of class predictions at initialization, which can range from unbiased (Neutral) to highly concentrated (Prejudiced) toward a subset of classes. Our analysis shows that normalization placement induces systematic differences in the initial prediction behavior of neural networks, which in turn shape the dynamics of learning. By linking architectural choices to prediction statistics at initialization, our work provides a principled understanding of how normalization can influence early training behavior and offers guidance for more controlled and interpretable network design.
[LG-28] Reinforcement Learning Closures for Underresolved Partial Differential Equations using Synthetic Data
链接: https://arxiv.org/abs/2505.11308
作者: Lothar Heimbach,Sebastian Kaltenbach,Petr Karnakov,Francis J. Alexander,Petros Koumoutsakos
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Partial Differential Equations (PDEs) describe phenomena ranging from turbulence and epidemics to quantum mechanics and financial markets. Despite recent advances in computational science, solving such PDEs for real-world applications remains prohibitively expensive because of the necessity of resolving a broad range of spatiotemporal scales. In turn, practitioners often rely on coarse-grained approximations of the original PDEs, trading off accuracy for reduced computational resources. To mitigate the loss of detail inherent in such approximations, closure models are employed to represent unresolved spatiotemporal interactions. We present a framework for developing closure models for PDEs using synthetic data acquired through the method of manufactured solutions. These data are used in conjunction with reinforcement learning to provide closures for coarse-grained PDEs. We illustrate the efficacy of our method using the one-dimensional and two-dimensional Burgers’ equations and the two-dimensional advection equation. Moreover, we demonstrate that closure models trained for inhomogeneous PDEs can be effectively generalized to homogeneous PDEs. The results demonstrate the potential for developing accurate and computationally efficient closure models for systems with scarce data.
[LG-29] Diffusion Learning with Partial Agent Participation and Local Updates
链接: https://arxiv.org/abs/2505.11307
作者: Elsa Rizk,Kun Yuan,Ali H. Sayed
类目: Machine Learning (cs.LG)
*备注: 17 pages
Abstract:Diffusion learning is a framework that endows edge devices with advanced intelligence. By processing and analyzing data locally and allowing each agent to communicate with its immediate neighbors, diffusion effectively protects the privacy of edge devices, enables real-time response, and reduces reliance on central servers. However, traditional diffusion learning relies on communication at every iteration, leading to communication overhead, especially with large learning models. Furthermore, the inherent volatility of edge devices, stemming from power outages or signal loss, poses challenges to reliable communication between neighboring agents. To mitigate these issues, this paper investigates an enhanced diffusion learning approach incorporating local updates and partial agent participation. Local updates will curtail communication frequency, while partial agent participation will allow for the inclusion of agents based on their availability. We prove that the resulting algorithm is stable in the mean-square error sense and provide a tight analysis of its Mean-Square-Deviation (MSD) performance. Various numerical experiments are conducted to illustrate our theoretical findings.
[LG-30] Effective Probabilistic Time Series Forecasting with Fourier Adaptive Noise-Separated Diffusion
链接: https://arxiv.org/abs/2505.11306
作者: Xinyan Wang,Rui Dai,Kaikui Liu,Xiangxiang Chu
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose the Fourier Adaptive Lite Diffusion Architecture (FALDA), a novel probabilistic framework for time series forecasting. First, we introduce the Diffusion Model for Residual Regression (DMRR) framework, which unifies diffusion-based probabilistic regression methods. Within this framework, FALDA leverages Fourier-based decomposition to incorporate a component-specific architecture, enabling tailored modeling of individual temporal components. A conditional diffusion model is utilized to estimate the future noise term, while our proposed lightweight denoiser, DEMA (Decomposition MLP with AdaLN), conditions on the historical noise term to enhance denoising performance. Through mathematical analysis and empirical validation, we demonstrate that FALDA effectively reduces epistemic uncertainty, allowing probabilistic learning to primarily focus on aleatoric uncertainty. Experiments on six real-world benchmarks demonstrate that FALDA consistently outperforms existing probabilistic forecasting approaches across most datasets for long-term time series forecasting while achieving enhanced computational efficiency without compromising accuracy. Notably, FALDA also achieves superior overall performance compared to state-of-the-art (SOTA) point forecasting approaches, with improvements of up to 9%.
[LG-31] Graph Representational Learning: When Does More Expressivity Hurt Generalization?
链接: https://arxiv.org/abs/2505.11298
作者: Sohir Maskey,Raffaele Paolino,Fabian Jogl,Gitta Kutyniok,Johannes F. Lutzeyer
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph Neural Networks (GNNs) are powerful tools for learning on structured data, yet the relationship between their expressivity and predictive performance remains unclear. We introduce a family of premetrics that capture different degrees of structural similarity between graphs and relate these similarities to generalization, and consequently, the performance of expressive GNNs. By considering a setting where graph labels are correlated with structural features, we derive generalization bounds that depend on the distance between training and test graphs, model complexity, and training set size. These bounds reveal that more expressive GNNs may generalize worse unless their increased complexity is balanced by a sufficiently large training set or reduced distance between training and test graphs. Our findings relate expressivity and generalization, offering theoretical insights supported by empirical results.
[LG-32] Bidirectional Information Flow (BIF) – A Sample Efficient Hierarchical Gaussian Process for Bayesian Optimization
链接: https://arxiv.org/abs/2505.11294
作者: Juan D. Guerra(1 and 3),Thomas Garbay(1 and 3),Guillaume Lajoie(2 and 3),Marco Bonizzato(1, 2 and 3) ((1) Polytechnique Montréal, (2) Université de Montréal, (3) Mila - Quebec Artificial Intelligence Institute)
类目: Machine Learning (cs.LG)
*备注:
Abstract:Hierarchical Gaussian Process (H-GP) models divide problems into different subtasks, allowing for different models to address each part, making them well-suited for problems with inherent hierarchical structure. However, typical H-GP models do not fully take advantage of this structure, only sending information up or down the hierarchy. This one-way coupling limits sample efficiency and slows convergence. We propose Bidirectional Information Flow (BIF), an efficient H-GP framework that establishes bidirectional information exchange between parent and child models in H-GPs for online training. BIF retains the modular structure of hierarchical models - the parent combines subtask knowledge from children GPs - while introducing top-down feedback to continually refine children models during online learning. This mutual exchange improves sample efficiency, enables robust training, and allows modular reuse of learned subtask models. BIF outperforms conventional H-GP Bayesian Optimization methods, achieving up to 85% and 5x higher R^2 scores for the parent and children respectively, on synthetic and real-world neurostimulation optimization tasks.
[LG-33] SubROC: AUC-Based Discovery of Exceptional Subgroup Performance for Binary Classifiers
链接: https://arxiv.org/abs/2505.11283
作者: Tom Siegl,Kutalmış Coşkun,Bjarne Hiller,Amin Mirzaei,Florian Lemmerich,Martin Becker
类目: Machine Learning (cs.LG)
*备注: 49 pages, 8 figures
Abstract:Machine learning (ML) is increasingly employed in real-world applications like medicine or economics, thus, potentially affecting large populations. However, ML models often do not perform homogeneously across such populations resulting in subgroups of the population (e.g., sex=female AND marital_status=married) where the model underperforms or, conversely, is particularly accurate. Identifying and describing such subgroups can support practical decisions on which subpopulation a model is safe to deploy or where more training data is required. The potential of identifying and analyzing such subgroups has been recognized, however, an efficient and coherent framework for effective search is missing. Consequently, we introduce SubROC, an open-source, easy-to-use framework based on Exceptional Model Mining for reliably and efficiently finding strengths and weaknesses of classification models in the form of interpretable population subgroups. SubROC incorporates common evaluation measures (ROC and PR AUC), efficient search space pruning for fast exhaustive subgroup search, control for class imbalance, adjustment for redundant patterns, and significance testing. We illustrate the practical benefits of SubROC in case studies as well as in comparative analyses across multiple datasets.
[LG-34] Multiclass threshold-based classification
链接: https://arxiv.org/abs/2505.11276
作者: Francesco Marchetti,Edoardo Legnaro,Sabrina Guastavino
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we introduce a threshold-based framework for multiclass classification that generalizes the standard argmax rule. This is done by replacing the probabilistic interpretation of softmax outputs with a geometric one on the multidimensional simplex, where the classification depends on a multidimensional threshold. This change of perspective enables for any trained classification network an a posteriori optimization of the classification score by means of threshold tuning, as usually carried out in the binary setting. This allows a further refinement of the prediction capability of any network. Moreover, this multidimensional threshold-based setting makes it possible to define score-oriented losses, which are based on the interpretation of the threshold as a random variable. Our experiments show that the multidimensional threshold tuning yields consistent performance improvements across various networks and datasets, and that the proposed multiclass score-oriented losses are competitive with standard loss functions, resembling the advantages observed in the binary case.
[LG-35] Driving Mechanisms and Forecasting of Chinas Pet Population-An ARIMA-RF-HW Hybrid Approach
链接: https://arxiv.org/abs/2505.11269
作者: Shengjia Chang,Xianshuo Yue
类目: Machine Learning (cs.LG)
*备注: 10 pages, 6 figures, 7 tables
Abstract:This study proposes a dynamically weighted ARIMA-RF-HW hybrid model integrating ARIMA for seasonality and trends, Random Forest for nonlinear features, and Holt-Winters smoothing for seasonal adjustment to improve China’s pet population forecasting accuracy. Using 2005-2023 data with nine economic, social, and policy indicators (urban income, consumption, aging ratio, policy quantity, new veterinary drug approvals), data were preprocessed via Z-score normalization and missing value imputation. The results show that key drivers of pet populations include urban income (19.48% for cats, 17.15% for dogs), consumption (17.99% for cats), and policy quantity (13.33% for cats, 14.02% for dogs), with aging (12.81% for cats, 13.27% for dogs) and urbanization amplifying the demand for pets. Forecasts show steady cat growth and fluctuating dog numbers, reflecting cats’ adaptability to urban environments. This research supports policymakers in optimizing pet health management and guides enterprises in developing differentiated services, advancing sustainable industry growth.
[LG-36] Fourier Low-rank and Sparse Tensor for Efficient Tensor Completion
链接: https://arxiv.org/abs/2505.11261
作者: Jingyang Li,Jiuqian Shang,Yang Chen
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Tensor completion is crucial in many scientific domains with missing data problems. Traditional low-rank tensor models, including CP, Tucker, and Tensor-Train, exploit low-dimensional structures to recover missing data. However, these methods often treat all tensor modes symmetrically, failing to capture the unique spatiotemporal patterns inherent in scientific data, where the temporal component exhibits both low-frequency stability and high-frequency variations. To address this, we propose a novel model, \underlineFourier \underlineLow-rank and \underlineSparse \underlineTensor (FLoST), which decomposes the tensor along the temporal dimension using a Fourier transform. This approach captures low-frequency components with low-rank matrices and high-frequency fluctuations with sparsity, resulting in a hybrid structure that efficiently models both smooth and localized variations. Compared to the well-known tubal-rank model, which assumes low-rankness across all frequency components, FLoST requires significantly fewer parameters, making it computationally more efficient, particularly when the time dimension is large. Through theoretical analysis and empirical experiments, we demonstrate that FLoST outperforms existing tensor completion models in terms of both accuracy and computational efficiency, offering a more interpretable solution for spatiotemporal data reconstruction.
[LG-37] Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction
链接: https://arxiv.org/abs/2505.11254
作者: Jeffrey Willette,Heejun Lee,Sung Ju Hwang
类目: Machine Learning (cs.LG)
*备注:
Abstract:The attention mechanism of a transformer has a quadratic complexity, leading to high inference costs and latency for long sequences. However, attention matrices are mostly sparse, which implies that many entries may be omitted from computation for efficient inference. Sparse attention inference methods aim to reduce this computational burden; however, they also come with a troublesome performance degradation. We discover that one reason for this degradation is that the sparse calculation induces a distributional shift in the attention outputs. The distributional shift causes decoding-time queries to fail to align well with the appropriate keys from the prefill stage, leading to a drop in performance. We propose a simple, novel, and effective procedure for correcting this distributional shift, bringing the distribution of sparse attention outputs closer to that of quadratic attention. Our method can be applied on top of any sparse attention method, and results in an average 36%pt performance increase, recovering 88% of quadratic attention accuracy on the 131K RULER benchmark when applied on top of sliding window attention with sink tokens while only adding a small overhead. Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.
[LG-38] Rethinking Irregular Time Series Forecasting: A Simple yet Effective Baseline
链接: https://arxiv.org/abs/2505.11250
作者: Xvyuan Liu,Xiangfei Qiu,Xingjian Wu,Zhengyu Li,Chenjuan Guo,Jilin Hu,Bin Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:The forecasting of irregular multivariate time series (IMTS) is crucial in key areas such as healthcare, biomechanics, climate science, and astronomy. However, achieving accurate and practical predictions is challenging due to two main factors. First, the inherent irregularity and data missingness in irregular time series make modeling difficult. Second, most existing methods are typically complex and resource-intensive. In this study, we propose a general framework called APN to address these challenges. Specifically, we design a novel Time-Aware Patch Aggregation (TAPA) module that achieves adaptive patching. By learning dynamically adjustable patch boundaries and a time-aware weighted averaging strategy, TAPA transforms the original irregular sequences into high-quality, regularized representations in a channel-independent manner. Additionally, we use a simple query module to effectively integrate historical information while maintaining the model’s efficiency. Finally, predictions are made by a shallow MLP. Experimental results on multiple real-world datasets show that APN outperforms existing state-of-the-art methods in both efficiency and accuracy.
[LG-39] Massive-STEPS: Massive Semantic Trajectories for Understanding POI Check-ins – Dataset and Benchmarks
链接: https://arxiv.org/abs/2505.11239
作者: Wilson Wongso,Hao Xue,Flora D. Salim
类目: Machine Learning (cs.LG)
*备注:
Abstract:Understanding human mobility through Point-of-Interest (POI) recommendation is increasingly important for applications such as urban planning, personalized services, and generative agent simulation. However, progress in this field is hindered by two key challenges: the over-reliance on older datasets from 2012-2013 and the lack of reproducible, city-level check-in datasets that reflect diverse global regions. To address these gaps, we present Massive-STEPS (Massive Semantic Trajectories for Understanding POI Check-ins), a large-scale, publicly available benchmark dataset built upon the Semantic Trails dataset and enriched with semantic POI metadata. Massive-STEPS spans 12 geographically and culturally diverse cities and features more recent (2017-2018) and longer-duration (24 months) check-in data than prior datasets. We benchmarked a wide range of POI recommendation models on Massive-STEPS using both supervised and zero-shot approaches, and evaluated their performance across multiple urban contexts. By releasing Massive-STEPS, we aim to facilitate reproducible and equitable research in human mobility and POI recommendation. The dataset and benchmarking code are available at: this https URL
[LG-40] Concept Drift Guided LayerNorm Tuning for Efficient Multimodal Metaphor Identification ICMR’25
链接: https://arxiv.org/abs/2505.11237
作者: Wenhao Qian,Zhenzhen Hu,Zijie Song,Jia Li
类目: Multimedia (cs.MM); Machine Learning (cs.LG)
*备注: ICMR’25, June 30-July 3, 2025, Chicago, IL, USA
Abstract:Metaphorical imagination, the ability to connect seemingly unrelated concepts, is fundamental to human cognition and communication. While understanding linguistic metaphors has advanced significantly, grasping multimodal metaphors, such as those found in internet memes, presents unique challenges due to their unconventional expressions and implied meanings. Existing methods for multimodal metaphor identification often struggle to bridge the gap between literal and figurative interpretations. Additionally, generative approaches that utilize large language models or text-to-image models, while promising, suffer from high computational costs. This paper introduces \textbfConcept \textbfDrift \textbfGuided \textbfLayerNorm \textbfTuning (\textbfCDGLT), a novel and training-efficient framework for multimodal metaphor identification. CDGLT incorporates two key innovations: (1) Concept Drift, a mechanism that leverages Spherical Linear Interpolation (SLERP) of cross-modal embeddings from a CLIP encoder to generate a new, divergent concept embedding. This drifted concept helps to alleviate the gap between literal features and the figurative task. (2) A prompt construction strategy, that adapts the method of feature extraction and fusion using pre-trained language models for the multimodal metaphor identification task. CDGLT achieves state-of-the-art performance on the MET-Meme benchmark while significantly reducing training costs compared to existing generative methods. Ablation studies demonstrate the effectiveness of both Concept Drift and our adapted LN Tuning approach. Our method represents a significant step towards efficient and accurate multimodal metaphor understanding. The code is available: \hrefthis https URLthis https URL.
[LG-41] Memory-Efficient Orthogonal Fine-Tuning with Principal Subspace Adaptation
链接: https://arxiv.org/abs/2505.11235
作者: Fei Wu,Jia Hu,Geyong Min,Shiqiang Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Driven by the relentless growth in model parameters, which renders full fine-tuning prohibitively expensive for large-scale deployment, parameter-efficient fine-tuning (PEFT) has emerged as a crucial approach for rapidly adapting large models to a wide range of downstream tasks. Among the PEFT family, orthogonal fine-tuning and its variants have demonstrated remarkable performance by preserving hyperspherical energy, which encodes pairwise angular similarity between neurons. However, these methods are inherently memory-inefficient due to the need to store intermediate activations from multiple full-dimensional sparse matrices. To address this limitation, we propose Memory-efficient Orthogonal Fine-Tuning (MOFT) with principal subspace adaptation. Specifically, we first establish a theoretical condition under which orthogonal transformations within a low-rank subspace preserve hyperspherical energy. Based on this insight, we constrain orthogonal fine-tuning to the principal subspace defined by the top-r components obtained through singular value decomposition and impose an additional constraint on the projection matrix to satisfy the preservation condition. To enhance MOFT’s flexibility across tasks, we relax strict orthogonality by introducing two learnable scaling vectors. Extensive experiments on 37 diverse tasks and four models across NLP and CV demonstrate that MOFT consistently outperforms key baselines while significantly reducing the memory footprint of orthogonal fine-tuning.
[LG-42] Learning traffic flows: Graph Neural Networks for Metamodelling Traffic Assignment
链接: https://arxiv.org/abs/2505.11230
作者: Oskar Bohn Lassen,Serio Agriesti,Mohamed Eldafrawi,Daniele Gammelli,Guido Cantelmo,Guido Gentile,Francisco Camara Pereira
类目: Machine Learning (cs.LG)
*备注:
Abstract:The Traffic Assignment Problem is a fundamental, yet computationally expensive, task in transportation modeling, especially for large-scale networks. Traditional methods require iterative simulations to reach equilibrium, making real-time or large-scale scenario analysis challenging. In this paper, we propose a learning-based approach using Message-Passing Neural Networks as a metamodel to approximate the equilibrium flow of the Stochastic User Equilibrium assignment. Our model is designed to mimic the algorithmic structure used in conventional traffic simulators allowing it to better capture the underlying process rather than just the data. We benchmark it against other conventional deep learning techniques and evaluate the model’s robustness by testing its ability to predict traffic flows on input data outside the domain on which it was trained. This approach offers a promising solution for accelerating out-of-distribution scenario assessments, reducing computational costs in large-scale transportation planning, and enabling real-time decision-making.
[LG-43] Learning hidden cascades via classification
链接: https://arxiv.org/abs/2505.11228
作者: Derrick Gilchrist Edward Manoharan,Anubha Goel,Alexandros Iosifidis,Henri Hansen,Juho Kanniainen
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:
Abstract:The spreading dynamics in social networks are often studied under the assumption that individuals’ statuses, whether informed or infected, are fully observable. However, in many real-world situations, such statuses remain unobservable, which is crucial for determining an individual’s potential to further spread the infection. While this final status is hidden, intermediate indicators such as symptoms of infection are observable and provide important insights into the spread process. We propose a partial observability-aware Machine Learning framework to learn the characteristics of the spreading model. We term the method Distribution Classification, which utilizes the power of classifiers to infer the underlying transmission dynamics. We evaluate our method on two types of synthetic networks and extend the study to a real-world insider trading network. Results show that the method performs well, especially on complex networks with high cyclic connectivity, supporting its utility in analyzing real-world spreading phenomena where direct observation of individual statuses is not possible.
[LG-44] Sample Efficient Reinforcement Learning via Large Vision Language Model Distillation ICASSP2025
链接: https://arxiv.org/abs/2505.11221
作者: Donghoon Lee,Tung M. Luu,Younghwan Lee,Chang D. Yoo
类目: Machine Learning (cs.LG)
*备注: 5 pages, ICASSP 2025. The first two authors are equally contributed
Abstract:Recent research highlights the potential of multimodal foundation models in tackling complex decision-making challenges. However, their large parameters make real-world deployment resource-intensive and often impractical for constrained systems. Reinforcement learning (RL) shows promise for task-specific agents but suffers from high sample complexity, limiting practical applications. To address these challenges, we introduce LVLM to Policy (LVLM2P), a novel framework that distills knowledge from large vision-language models (LVLM) into more efficient RL agents. Our approach leverages the LVLM as a teacher, providing instructional actions based on trajectories collected by the RL agent, which helps reduce less meaningful exploration in the early stages of learning, thereby significantly accelerating the agent’s learning progress. Additionally, by leveraging the LVLM to suggest actions directly from visual observations, we eliminate the need for manual textual descriptors of the environment, enhancing applicability across diverse tasks. Experiments show that LVLM2P significantly enhances the sample efficiency of baseline RL algorithms.
[LG-45] Minimizing False-Positive Attributions in Explanations of Non-Linear Models
链接: https://arxiv.org/abs/2505.11210
作者: Anders Gjølbye,Stefan Haufe,Lars Kai Hansen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Preprint. Under review
Abstract:Suppressor variables can influence model predictions without being dependent on the target outcome and they pose a significant challenge for Explainable AI (XAI) methods. These variables may cause false-positive feature attributions, undermining the utility of explanations. Although effective remedies exist for linear models, their extension to non-linear models and to instance-based explanations has remained limited. We introduce PatternLocal, a novel XAI technique that addresses this gap. PatternLocal begins with a locally linear surrogate, e.g. LIME, KernelSHAP, or gradient-based methods, and transforms the resulting discriminative model weights into a generative representation, thereby suppressing the influence of suppressor variables while preserving local fidelity. In extensive hyperparameter optimization on the XAI-TRIS benchmark, PatternLocal consistently outperformed other XAI methods and reduced false-positive attributions when explaining non-linear tasks, thereby enabling more reliable and actionable insights.
[LG-46] Modeling Cell Dynamics and Interactions with Unbalanced Mean Field Schrödinger Bridge
链接: https://arxiv.org/abs/2505.11197
作者: Zhenyi Zhang,Zihan Wang,Yuhao Sun,Tiejun Li,Peijie Zhou
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Modeling the dynamics from sparsely time-resolved snapshot data is crucial for understanding complex cellular processes and behavior. Existing methods leverage optimal transport, Schrödinger bridge theory, or their variants to simultaneously infer stochastic, unbalanced dynamics from snapshot data. However, these approaches remain limited in their ability to account for cell-cell interactions. This integration is essential in real-world scenarios since intercellular communications are fundamental life processes and can influence cell state-transition dynamics. To address this challenge, we formulate the Unbalanced Mean-Field Schrödinger Bridge (UMFSB) framework to model unbalanced stochastic interaction dynamics from snapshot data. Inspired by this framework, we further propose CytoBridge, a deep learning algorithm designed to approximate the UMFSB problem. By explicitly modeling cellular transitions, proliferation, and interactions through neural networks, CytoBridge offers the flexibility to learn these processes directly from data. The effectiveness of our method has been extensively validated using both synthetic gene regulatory data and real scRNA-seq datasets. Compared to existing methods, CytoBridge identifies growth, transition, and interaction patterns, eliminates false transitions, and reconstructs the developmental landscape with greater accuracy.
[LG-47] VitaGraph: Building a Knowledge Graph for Biologically Relevant Learning Tasks
链接: https://arxiv.org/abs/2505.11185
作者: Francesco Madeddu,Lucia Testa,Gianluca De Carlo,Michele Pieroni,Andrea Mastropietro,Aris Anagnostopoulos,Paolo Tieri,Sergio Barbarossa
类目: Machine Learning (cs.LG)
*备注: 9 pages of main text, 4 figures
Abstract:The intrinsic complexity of human biology presents ongoing challenges to scientific understanding. Researchers collaborate across disciplines to expand our knowledge of the biological interactions that define human life. AI methodologies have emerged as powerful tools across scientific domains, particularly in computational biology, where graph data structures effectively model biological entities such as protein-protein interaction (PPI) networks and gene functional networks. Those networks are used as datasets for paramount network medicine tasks, such as gene-disease association prediction, drug repurposing, and polypharmacy side effect studies. Reliable predictions from machine learning models require high-quality foundational data. In this work, we present a comprehensive multi-purpose biological knowledge graph constructed by integrating and refining multiple publicly available datasets. Building upon the Drug Repurposing Knowledge Graph (DRKG), we define a pipeline tasked with a) cleaning inconsistencies and redundancies present in DRKG, b) coalescing information from the main available public data sources, and c) enriching the graph nodes with expressive feature vectors such as molecular fingerprints and gene ontologies. Biologically and chemically relevant features improve the capacity of machine learning models to generate accurate and well-structured embedding spaces. The resulting resource represents a coherent and reliable biological knowledge graph that serves as a state-of-the-art platform to advance research in computational biology and precision medicine. Moreover, it offers the opportunity to benchmark graph-based machine learning and network medicine models on relevant tasks. We demonstrate the effectiveness of the proposed dataset by benchmarking it against the task of drug repurposing, PPI prediction, and side-effect prediction, modeled as link prediction problems.
[LG-48] Gaussian Weight Sampling for Scalable Efficient and Stable Pseudo-Quantization Training
链接: https://arxiv.org/abs/2505.11170
作者: Myeonghwan Ahn,Sungjoo Yoo
类目: Machine Learning (cs.LG)
*备注:
Abstract:Ever-growing scale of large language models (LLMs) is pushing for improved efficiency, favoring fully quantized training (FQT) over BF16. While FQT accelerates training, it faces consistency challenges and requires searching over an exponential number of cases, each needing over 200B tokens to ensure stability. Pseudo-quantization training (PQT) addresses the issues of FQT, although it is not well-studied. We explore the practical implications of PQT in detail and propose a noise distribution R that is floating-point (FP)-friendly, with ideal properties including stochastic precision annealing. As a result, the proposed method serves as an effective theoretical foundation for low-precision FP parameters through PQT, utilizing efficient fake quantization via an addition and subsequent FP casting. We demonstrate that Gaussian weight sampling is (1) scalable: supports low-precision FP parameters down to FP6 and high-precision noise up to 9-bit with BF16 operator. The proposed method is (2) efficient: incurring computational overhead as low as 1.40% on the A100 GPU in terms of Llama2 training tokens per second, and requiring 2 bytes per parameter in GPU memory. We demonstrate that PQT with Gaussian weight sampling is (3) stable: closely following or even surpassing performance of the BF16 baseline while pre-training GPT2 and Llama2 models with up to 1B parameters and 300B tokens. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.11170 [cs.LG] (or arXiv:2505.11170v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.11170 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-49] Bi-directional Recurrence Improves Transformer in Partially Observable Markov Decision Processes
链接: https://arxiv.org/abs/2505.11153
作者: Ashok Arora,Neetesh Kumar
类目: Machine Learning (cs.LG)
*备注:
Abstract:In real-world reinforcement learning (RL) scenarios, agents often encounter partial observability, where incomplete or noisy information obscures the true state of the environment. Partially Observable Markov Decision Processes (POMDPs) are commonly used to model these environments, but effective performance requires memory mechanisms to utilise past observations. While recurrence networks have traditionally addressed this need, transformer-based models have recently shown improved sample efficiency in RL tasks. However, their application to POMDPs remains underdeveloped, and their real-world deployment is constrained due to the high parameter count. This work introduces a novel bi-recurrent model architecture that improves sample efficiency and reduces model parameter count in POMDP scenarios. The architecture replaces the multiple feed forward layers with a single layer of bi-directional recurrence unit to better capture and utilize sequential dependencies and contextual information. This approach improves the model’s ability to handle partial observability and increases sample efficiency, enabling effective learning from comparatively fewer interactions. To evaluate the performance of the proposed model architecture, experiments were conducted on a total of 23 POMDP environments. The proposed model architecture outperforms existing transformer-based, attention-based, and recurrence-based methods by a margin ranging from 87.39% to 482.04% on average across the 23 POMDP environments.
[LG-50] Covariance Density Neural Networks
链接: https://arxiv.org/abs/2505.11139
作者: Om Roy,Yashar Moshfeghi,Keith Smith
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph neural networks have re-defined how we model and predict on network data but there lacks a consensus on choosing the correct underlying graph structure on which to model signals. CoVariance Neural Networks (VNN) address this issue by using the sample covariance matrix as a Graph Shift Operator (GSO). Here, we improve on the performance of VNNs by constructing a Density Matrix where we consider the sample Covariance matrix as a quasi-Hamiltonian of the system in the space of random variables. Crucially, using this density matrix as the GSO allows components of the data to be extracted at different scales, allowing enhanced discriminability and performance. We show that this approach allows explicit control of the stability-discriminability trade-off of the network, provides enhanced robustness to noise compared to VNNs, and outperforms them in useful real-life applications where the underlying covariance matrix is informative. In particular, we show that our model can achieve strong performance in subject-independent Brain Computer Interface EEG motor imagery classification, outperforming EEGnet while being faster. This shows how covariance density neural networks provide a basis for the notoriously difficult task of transferability of BCIs when evaluated on unseen individuals.
[LG-51] Fairness-aware Anomaly Detection via Fair Projection
链接: https://arxiv.org/abs/2505.11132
作者: Feng Xiao,Xiaoying Tang,Jicong Fan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Unsupervised anomaly detection is a critical task in many high-social-impact applications such as finance, healthcare, social media, and cybersecurity, where demographics involving age, gender, race, disease, etc, are used frequently. In these scenarios, possible bias from anomaly detection systems can lead to unfair treatment for different groups and even exacerbate social bias. In this work, first, we thoroughly analyze the feasibility and necessary assumptions for ensuring group fairness in unsupervised anomaly detection. Second, we propose a novel fairness-aware anomaly detection method FairAD. From the normal training data, FairAD learns a projection to map data of different demographic groups to a common target distribution that is simple and compact, and hence provides a reliable base to estimate the density of the data. The density can be directly used to identify anomalies while the common target distribution ensures fairness between different groups. Furthermore, we propose a threshold-free fairness metric that provides a global view for model’s fairness, eliminating dependence on manual threshold selection. Experiments on real-world benchmarks demonstrate that our method achieves an improved trade-off between detection accuracy and fairness under both balanced and skewed data across different groups.
[LG-52] FedDuA: Doubly Adaptive Federated Learning
链接: https://arxiv.org/abs/2505.11126
作者: Shokichi Takakura,Seng Pei Liew,Satoshi Hasegawa
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Federated learning is a distributed learning framework where clients collaboratively train a global model without sharing their raw data. FedAvg is a popular algorithm for federated learning, but it often suffers from slow convergence due to the heterogeneity of local datasets and anisotropy in the parameter space. In this work, we formalize the central server optimization procedure through the lens of mirror descent and propose a novel framework, called FedDuA, which adaptively selects the global learning rate based on both inter-client and coordinate-wise heterogeneity in the local updates. We prove that our proposed doubly adaptive step-size rule is minimax optimal and provide a convergence analysis for convex objectives. Although the proposed method does not require additional communication or computational cost on clients, extensive numerical experiments show that our proposed framework outperforms baselines in various settings and is robust to the choice of hyperparameters.
[LG-53] GraphOracle: A Foundation Model for Knowledge Graph Reasoning
链接: https://arxiv.org/abs/2505.11125
作者: Enjun Du,Siyi Liu,Yongqi Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Foundation models have demonstrated remarkable capabilities across various domains, but developing analogous models for knowledge graphs presents unique challenges due to their dynamic nature and the need for cross-domain reasoning. To address these issues, we introduce \textbf\textscGraphOracle, a relation-centric foundation model that unifies reasoning across knowledge graphs by converting them into Relation-Dependency Graphs (RDG), explicitly encoding compositional patterns with fewer edges than prior methods. A query-dependent attention mechanism is further developed to learn inductive representations for both relations and entities. Pre-training on diverse knowledge graphs, followed by minutes-level fine-tuning, enables effective generalization to unseen entities, relations, and entire graphs. Through comprehensive experiments on 31 diverse benchmarks spanning transductive, inductive, and cross-domain settings, we demonstrate consistent state-of-the-art performance with minimal adaptation, improving the prediction performance by up to 35% compared to the strongest baselines.
[LG-54] Dual-Balancing for Physics-Informed Neural Networks IJCAI2025
链接: https://arxiv.org/abs/2505.11117
作者: Chenhong Zhou,Jie Chen,Zaifeng Yang,Ching Eng Png
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: Accepted at IJCAI 2025 (34th International Joint Conference on Artificial Intelligence)
Abstract:Physics-informed neural networks (PINNs) have emerged as a new learning paradigm for solving partial differential equations (PDEs) by enforcing the constraints of physical equations, boundary conditions (BCs), and initial conditions (ICs) into the loss function. Despite their successes, vanilla PINNs still suffer from poor accuracy and slow convergence due to the intractable multi-objective optimization issue. In this paper, we propose a novel Dual-Balanced PINN (DB-PINN), which dynamically adjusts loss weights by integrating inter-balancing and intra-balancing to alleviate two imbalance issues in PINNs. Inter-balancing aims to mitigate the gradient imbalance between PDE residual loss and condition-fitting losses by determining an aggregated weight that offsets their gradient distribution discrepancies. Intra-balancing acts on condition-fitting losses to tackle the imbalance in fitting difficulty across diverse conditions. By evaluating the fitting difficulty based on the loss records, intra-balancing can allocate the aggregated weight proportionally to each condition loss according to its fitting difficulty levels. We further introduce a robust weight update strategy to prevent abrupt spikes and arithmetic overflow in instantaneous weight values caused by large loss variances, enabling smooth weight updating and stable training. Extensive experiments demonstrate that DB-PINN achieves significantly superior performance than those popular gradient-based weighting methods in terms of convergence speed and prediction accuracy. Our code and supplementary material are available at this https URL.
[LG-55] ShiQ: Bringing back Bellm an to LLM s
链接: https://arxiv.org/abs/2505.11081
作者: Pierre Clavier,Nathan Grinsztajn,Raphael Avalos,Yannis Flet-Berliac,Irem Ergun,Omar D. Domingues,Eugene Tarassov,Olivier Pietquin,Pierre H. Richemond,Florian Strub,Matthieu Geist
类目: Machine Learning (cs.LG)
*备注:
Abstract:The fine-tuning of pre-trained large language models (LLMs) using reinforcement learning (RL) is generally formulated as direct policy optimization. This approach was naturally favored as it efficiently improves a pretrained LLM, seen as an initial policy. Another RL paradigm, Q-learning methods, has received far less attention in the LLM community while demonstrating major success in various non-LLM RL tasks. In particular, Q-learning effectiveness comes from its sample efficiency and ability to learn offline, which is particularly valuable given the high computational cost of sampling with LLMs. However, naively applying a Q-learning-style update to the model’s logits is ineffective due to the specificity of LLMs. Our core contribution is to derive theoretically grounded loss functions from Bellman equations to adapt Q-learning methods to LLMs. To do so, we carefully adapt insights from the RL literature to account for LLM-specific characteristics, ensuring that the logits become reliable Q-value estimates. We then use this loss to build a practical algorithm, ShiQ for Shifted-Q, that supports off-policy, token-wise learning while remaining simple to implement. Finally, we evaluate ShiQ on both synthetic data and real-world benchmarks, e.g., UltraFeedback and BFCL-V3, demonstrating its effectiveness in both single-turn and multi-turn LLM settings
[LG-56] Addition is almost all you need: Compressing neural networks with double binary factorization
链接: https://arxiv.org/abs/2505.11076
作者: Vladimír Boža,Vladimír Macko
类目: Machine Learning (cs.LG)
*备注:
Abstract:Binary quantization approaches, which replace weight matrices with binary matrices and substitute costly multiplications with cheaper additions, offer a computationally efficient approach to address the increasing computational and storage requirements of Large Language Models (LLMs). However, the severe quantization constraint ( \pm1 ) can lead to significant accuracy degradation. In this paper, we propose Double Binary Factorization (DBF), a novel method that factorizes dense weight matrices into products of two binary (sign) matrices, each accompanied by scaling vectors. DBF preserves the efficiency advantages of binary representations while achieving compression rates that are competitive with or superior to state-of-the-art methods. Specifically, in a 1-bit per weight range, DBF is better than existing binarization approaches. In a 2-bit per weight range, DBF is competitive with the best quantization methods like QuIP# and QTIP. Unlike most existing compression techniques, which offer limited compression level choices, DBF allows fine-grained control over compression ratios by adjusting the factorization’s intermediate dimension. Based on this advantage, we further introduce an algorithm for estimating non-uniform layer-wise compression ratios for DBF, based on previously developed channel pruning criteria. Code available at: this https URL Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.11076 [cs.LG] (or arXiv:2505.11076v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.11076 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-57] NeuralSurv: Deep Survival Analysis with Bayesian Uncertainty Quantification
链接: https://arxiv.org/abs/2505.11054
作者: Mélodie Monod,Alessandro Micheli,Samir Bhatt
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We introduce NeuralSurv, the first deep survival model to incorporate Bayesian uncertainty quantification. Our non-parametric, architecture-agnostic framework flexibly captures time-varying covariate-risk relationships in continuous time via a novel two-stage data-augmentation scheme, for which we establish theoretical guarantees. For efficient posterior inference, we introduce a mean-field variational algorithm with coordinate-ascent updates that scale linearly in model size. By locally linearizing the Bayesian neural network, we obtain full conjugacy and derive all coordinate updates in closed form. In experiments, NeuralSurv delivers superior calibration compared to state-of-the-art deep survival models, while matching or exceeding their discriminative performance across both synthetic benchmarks and real-world datasets. Our results demonstrate the value of Bayesian principles in data-scarce regimes by enhancing model calibration and providing robust, well-calibrated uncertainty estimates for the survival function.
[LG-58] Exploration by Random Distribution Distillation
链接: https://arxiv.org/abs/2505.11044
作者: Zhirui Fang,Kai Yang,Jian Tao,Jiafei Lyu,Lusong Li,Li Shen,Xiu Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Exploration remains a critical challenge in online reinforcement learning, as an agent must effectively explore unknown environments to achieve high returns. Currently, the main exploration algorithms are primarily count-based methods and curiosity-based methods, with prediction-error methods being a prominent example. In this paper, we propose a novel method called \textbfRandom \textbfDistribution \textbfDistillation (RDD), which samples the output of a target network from a normal distribution. RDD facilitates a more extensive exploration by explicitly treating the difference between the prediction network and the target network as an intrinsic reward. Furthermore, by introducing randomness into the output of the target network for a given state and modeling it as a sample from a normal distribution, intrinsic rewards are bounded by two key components: a pseudo-count term ensuring proper exploration decay and a discrepancy term accounting for predictor convergence. We demonstrate that RDD effectively unifies both count-based and prediction-error approaches. It retains the advantages of prediction-error methods in high-dimensional spaces, while also implementing an intrinsic reward decay mode akin to the pseudo-count method. In the experimental section, RDD is compared with more advanced methods in a series of environments. Both theoretical analysis and experimental results confirm the effectiveness of our approach in improving online exploration for reinforcement learning tasks.
[LG-59] Efficient Attention via Pre-Scoring: Prioritizing Informative Keys in Transformers
链接: https://arxiv.org/abs/2505.11040
作者: Zhexiang Li,Haoyu Wang,Yutong Bao,David Woodruff
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent advances in transformer architectures deeply enhance long-context language modeling. Among them, HyperAttention achieves competitive efficiency by combining a single-level LSH-based clustering with uniform residual sampling. However,such a sampling limits crucial keys’ capturing, which in turn raises the overall perplexity. In this paper, we propose a pre-scoring mechanism to assist HyperAttention to prioritize significant keys. Specifically, we introduce three scoring methods: K-means clustering, K-median clustering, and leverage score-based ranking (inspired by LevAttention) to filter keys effectively. We further replace HyperAttention’s original uniform residual sampling entirely, relying exclusively on our pre-scoring mechanism. Experiments on ChatGLM2 (131k token context) reduce perplexity from 12 to 8.3, which outperforms standard HyperAttention. Moreover, when running on the Vision-Transformer (ViT), our method shows that it can guarantee similar accuracy compared with LevAttention, and will surpass LevAttention given specific parameters. Although this method introduces computational overhead, its combination with HyperAttention remains 20 times faster than FlashAttention, providing a balanced trade-off between speed and modeling accuracy. Our results highlight the effectiveness of integrating pre-scoring into hierarchical attention mechanisms, significantly improving Transformer’s efficiency.
[LG-60] Deep Latent Variable Model based Vertical Federated Learning with Flexible Alignment and Labeling Scenarios
链接: https://arxiv.org/abs/2505.11035
作者: Kihun Hong,Sejun Park,Ganguk Hwang
类目: Machine Learning (cs.LG)
*备注: 9 pages + appendix, 8 figures, 18 tables
Abstract:Federated learning (FL) has attracted significant attention for enabling collaborative learning without exposing private data. Among the primary variants of FL, vertical federated learning (VFL) addresses feature-partitioned data held by multiple institutions, each holding complementary information for the same set of users. However, existing VFL methods often impose restrictive assumptions such as a small number of participating parties, fully aligned data, or only using labeled data. In this work, we reinterpret alignment gaps in VFL as missing data problems and propose a unified framework that accommodates both training and inference under arbitrary alignment and labeling scenarios, while supporting diverse missingness mechanisms. In the experiments on 168 configurations spanning four benchmark datasets, six training-time missingness patterns, and seven testing-time missingness patterns, our method outperforms all baselines in 160 cases with an average gap of 9.6 percentage points over the next-best competitors. To the best of our knowledge, this is the first VFL framework to jointly handle arbitrary data alignment, unlabeled data, and multi-party collaboration all at once.
[LG-61] Exploiting the Asymmetric Uncertainty Structure of Pre-trained VLMs on the Unit Hypersphere
链接: https://arxiv.org/abs/2505.11029
作者: Li Ju,Max Andersson,Stina Fredriksson,Edward Glöckner,Andreas Hellander,Ekta Vats,Prashant Singh
类目: Machine Learning (cs.LG)
*备注:
Abstract:Vision-language models (VLMs) as foundation models have significantly enhanced performance across a wide range of visual and textual tasks, without requiring large-scale training from scratch for downstream tasks. However, these deterministic VLMs fail to capture the inherent ambiguity and uncertainty in natural language and visual data. Recent probabilistic post-hoc adaptation methods address this by mapping deterministic embeddings onto probability distributions; however, existing approaches do not account for the asymmetric uncertainty structure of the modalities, and the constraint that meaningful deterministic embeddings reside on a unit hypersphere, potentially leading to suboptimal performance. In this paper, we address the asymmetric uncertainty structure inherent in textual and visual data, and propose AsymVLM to build probabilistic embeddings from pre-trained VLMs on the unit hypersphere, enabling uncertainty quantification. We validate the effectiveness of the probabilistic embeddings on established benchmarks, and present comprehensive ablation studies demonstrating the inherent nature of asymmetry in the uncertainty structure of textual and visual data.
[LG-62] Leverag ing Real-Time Data Analysis and Multiple Kernel Learning for Manufacturing of Innovative Steels
链接: https://arxiv.org/abs/2505.11024
作者: Wolfgang Rannetbauer,Simon Hubmer,Carina Hambrock,Ronny Ramlau
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 29 pages, 7 figures
Abstract:The implementation of thermally sprayed components in steel manufacturing presents challenges for production and plant maintenance. While enhancing performance through specialized surface properties, these components may encounter difficulties in meeting modified requirements due to standardization in the refurbishment process. This article proposes updating the established coating process for thermally spray coated components for steel manufacturing (TCCSM) by integrating real-time data analytics and predictive quality management. Two essential components–the data aggregator and the quality predictor–are designed through continuous process monitoring and the application of data-driven methodologies to meet the dynamic demands of the evolving steel landscape. The quality predictor is powered by the simple and effective multiple kernel learning strategy with the goal of realizing predictive quality. The data aggregator, designed with sensors, flow meters, and intelligent data processing for the thermal spray coating process, is proposed to facilitate real-time analytics. The performance of this combination was verified using small-scale tests that enabled not only the accurate prediction of coating quality based on the collected data but also proactive notification to the operator as soon as significant deviations are identified.
[LG-63] Informed but Not Always Improved: Challenging the Benefit of Background Knowledge in GNNs
链接: https://arxiv.org/abs/2505.11023
作者: Kutalmış Coşkun,Ivo Kavisanczki,Amin Mirzaei,Tom Siegl,Bjarne C. Hiller,Stefan Lüdtke,Martin Becker
类目: Machine Learning (cs.LG)
*备注: 10 pages, 7 figures
Abstract:In complex and low-data domains such as biomedical research, incorporating background knowledge (BK) graphs, such as protein-protein interaction (PPI) networks, into graph-based machine learning pipelines is a promising research direction. However, while BK is often assumed to improve model performance, its actual contribution and the impact of imperfect knowledge remain poorly understood. In this work, we investigate the role of BK in an important real-world task: cancer subtype classification. Surprisingly, we find that (i) state-of-the-art GNNs using BK perform no better than uninformed models like linear regression, and (ii) their performance remains largely unchanged even when the BK graph is heavily perturbed. To understand these unexpected results, we introduce an evaluation framework, which employs (i) a synthetic setting where the BK is clearly informative and (ii) a set of perturbations that simulate various imperfections in BK graphs. With this, we test the robustness of BK-aware models in both synthetic and real-world biomedical settings. Our findings reveal that careful alignment of GNN architectures and BK characteristics is necessary but holds the potential for significant performance improvements.
[LG-64] Logo-LLM : Local and Global Modeling with Large Language Models for Time Series Forecasting
链接: https://arxiv.org/abs/2505.11017
作者: Wenjie Ou,Zhishuo Zhao,Dongyue Guo,Yi Lin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series forecasting is critical across multiple domains, where time series data exhibits both local patterns and global dependencies. While Transformer-based methods effectively capture global dependencies, they often overlook short-term local variations in time series. Recent methods that adapt large language models (LLMs) into time series forecasting inherit this limitation by treating LLMs as black-box encoders, relying solely on the final-layer output and underutilizing hierarchical representations. To address this limitation, we propose Logo-LLM, a novel LLM-based framework that explicitly extracts and models multi-scale temporal features from different layers of a pre-trained LLM. Through empirical analysis, we show that shallow layers of LLMs capture local dynamics in time series, while deeper layers encode global trends. Moreover, Logo-LLM introduces lightweight Local-Mixer and Global-Mixer modules to align and integrate features with the temporal input across layers. Extensive experiments demonstrate that Logo-LLM achieves superior performance across diverse benchmarks, with strong generalization in few-shot and zero-shot settings while maintaining low computational overhead.
[LG-65] ReaCritic: Large Reasoning Transformer-based DRL Critic-model Scaling For Heterogeneous Networks
链接: https://arxiv.org/abs/2505.10992
作者: Feiran You,Hongyang Du
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:Heterogeneous Networks (HetNets) pose critical challenges for intelligent management due to the diverse user requirements and time-varying wireless conditions. These factors introduce significant decision complexity, which limits the adaptability of existing Deep Reinforcement Learning (DRL) methods. In many DRL algorithms, especially those involving value-based or actor-critic structures, the critic component plays a key role in guiding policy learning by estimating value functions. However, conventional critic models often use shallow architectures that map observations directly to scalar estimates, limiting their ability to handle multi-task complexity. In contrast, recent progress in inference-time scaling of Large Language Models (LLMs) has shown that generating intermediate reasoning steps can significantly improve decision quality. Motivated by this, we propose ReaCritic, a large reasoning transformer-based criticmodel scaling scheme that brings reasoning ability into DRL. ReaCritic performs horizontal reasoning over parallel state-action inputs and vertical reasoning through deep transformer stacks. It is compatible with a broad range of value-based and actor-critic DRL algorithms and enhances generalization in dynamic wireless environments. Extensive experiments demonstrate that ReaCritic improves convergence speed and final performance across various HetNet settings and standard OpenAI Gym control tasks.
[LG-66] SubGCache: Accelerating Graph-based RAG with Subgraph-level KV Cache
链接: https://arxiv.org/abs/2505.10951
作者: Qiuyu Zhu,Liang Zhang,Qianxiong Xu,Cheng Long,Jie Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph-based retrieval-augmented generation (RAG) enables large language models (LLMs) to incorporate structured knowledge via graph retrieval as contextual input, enhancing more accurate and context-aware reasoning. We observe that for different queries, it could retrieve similar subgraphs as prompts, and thus we propose SubGCache, which aims to reduce inference latency by reusing computation across queries with similar structural prompts (i.e., subgraphs). Specifically, SubGCache clusters queries based on subgraph embeddings, constructs a representative subgraph for each cluster, and pre-computes the key-value (KV) cache of the representative subgraph. For each query with its retrieved subgraph within a cluster, it reuses the pre-computed KV cache of the representative subgraph of the cluster without computing the KV tensors again for saving computation. Experiments on two new datasets across multiple LLM backbones and graph-based RAG frameworks demonstrate that SubGCache consistently reduces inference latency with comparable and even improved generation quality, achieving up to 6.68 \times reduction in time-to-first-token (TTFT).
[LG-67] Shackled Dancing: A Bit-Locked Diffusion Algorithm for Lossless and Controllable Image Steganography
链接: https://arxiv.org/abs/2505.10950
作者: Tianshuo Zhang,Gao Jia,Wenzhe Zhai,Rui Yann,Xianglei Xing
类目: Machine Learning (cs.LG)
*备注:
Abstract:Data steganography aims to conceal information within visual content, yet existing spatial- and frequency-domain approaches suffer from trade-offs between security, capacity, and perceptual quality. Recent advances in generative models, particularly diffusion models, offer new avenues for adaptive image synthesis, but integrating precise information embedding into the generative process remains challenging. We introduce Shackled Dancing Diffusion, or SD ^2 , a plug-and-play generative steganography method that combines bit-position locking with diffusion sampling injection to enable controllable information embedding within the generative trajectory. SD ^2 leverages the expressive power of diffusion models to synthesize diverse carrier images while maintaining full message recovery with 100% accuracy. Our method achieves a favorable balance between randomness and constraint, enhancing robustness against steganalysis without compromising image fidelity. Extensive experiments show that SD ^2 substantially outperforms prior methods in security, embedding capacity, and stability. This algorithm offers new insights into controllable generation and opens promising directions for secure visual communication.
[LG-68] FP64 is All You Need: Rethinking Failure Modes in Physics-Informed Neural Networks
链接: https://arxiv.org/abs/2505.10949
作者: Chenhui Xu,Dancheng Liu,Amir Nassereldine,Jinjun Xiong
类目: Machine Learning (cs.LG)
*备注:
Abstract:Physics Informed Neural Networks (PINNs) often exhibit failure modes in which the PDE residual loss converges while the solution error stays large, a phenomenon traditionally blamed on local optima separated from the true solution by steep loss barriers. We challenge this understanding by demonstrate that the real culprit is insufficient arithmetic precision: with standard FP32, the LBFGS optimizer prematurely satisfies its convergence test, freezing the network in a spurious failure phase. Simply upgrading to FP64 rescues optimization, enabling vanilla PINNs to solve PDEs without any failure modes. These results reframe PINN failure modes as precision induced stalls rather than inescapable local minima and expose a three stage training dynamic unconverged, failure, success whose boundaries shift with numerical precision. Our findings emphasize that rigorous arithmetic precision is the key to dependable PDE solving with neural networks.
[LG-69] Certifying Stability of Reinforcement Learning Policies using Generalized Lyapunov Functions
链接: https://arxiv.org/abs/2505.10947
作者: Kehan Long,Jorge Cortés,Nikolay Atanasov
类目: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:
Abstract:We study the problem of certifying the stability of closed-loop systems under control policies derived from optimal control or reinforcement learning (RL). Classical Lyapunov methods require a strict step-wise decrease in the Lyapunov function but such a certificate is difficult to construct for a learned control policy. The value function associated with an RL policy is a natural Lyapunov function candidate but it is not clear how it should be modified. To gain intuition, we first study the linear quadratic regulator (LQR) problem and make two key observations. First, a Lyapunov function can be obtained from the value function of an LQR policy by augmenting it with a residual term related to the system dynamics and stage cost. Second, the classical Lyapunov decrease requirement can be relaxed to a generalized Lyapunov condition requiring only decrease on average over multiple time steps. Using this intuition, we consider the nonlinear setting and formulate an approach to learn generalized Lyapunov functions by augmenting RL value functions with neural network residual terms. Our approach successfully certifies the stability of RL policies trained on Gymnasium and DeepMind Control benchmarks. We also extend our method to jointly train neural controllers and stability certificates using a multi-step Lyapunov loss, resulting in larger certified inner approximations of the region of attraction compared to the classical Lyapunov approach. Overall, our formulation enables stability certification for a broad class of systems with learned policies by making certificates easier to construct, thereby bridging classical control theory and modern learning-based methods.
[LG-70] Nosy Layers Noisy Fixes: Tackling DRAs in Federated Learning Systems using Explainable AI ASIACCS2025
链接: https://arxiv.org/abs/2505.10942
作者: Meghali Nandi,Arash Shaghaghi,Nazatul Haque Sultan,Gustavo Batista,Raymond K. Zhao,Sanjay Jha
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted to AsiaCCS 2025
Abstract:Federated Learning (FL) has emerged as a powerful paradigm for collaborative model training while keeping client data decentralized and private. However, it is vulnerable to Data Reconstruction Attacks (DRA) such as “LoKI” and “Robbing the Fed”, where malicious models sent from the server to the client can reconstruct sensitive user data. To counter this, we introduce DRArmor, a novel defense mechanism that integrates Explainable AI with targeted detection and mitigation strategies for DRA. Unlike existing defenses that focus on the entire model, DRArmor identifies and addresses the root cause (i.e., malicious layers within the model that send gradients with malicious intent) by analyzing their contribution to the output and detecting inconsistencies in gradient values. Once these malicious layers are identified, DRArmor applies defense techniques such as noise injection, pixelation, and pruning to these layers rather than the whole model, minimizing the attack surface and preserving client data privacy. We evaluate DRArmor’s performance against the advanced LoKI attack across diverse datasets, including MNIST, CIFAR-10, CIFAR-100, and ImageNet, in a 200-client FL setup. Our results demonstrate DRArmor’s effectiveness in mitigating data leakage, achieving high True Positive and True Negative Rates of 0.910 and 0.890, respectively. Additionally, DRArmor maintains an average accuracy of 87%, effectively protecting client privacy without compromising model performance. Compared to existing defense mechanisms, DRArmor reduces the data leakage rate by 62.5% with datasets containing 500 samples per client.
[LG-71] Privacy-Aware Lifelong Learning
链接: https://arxiv.org/abs/2505.10941
作者: Ozan Özdenizci,Elmar Rueckert,Robert Legenstein
类目: Machine Learning (cs.LG)
*备注:
Abstract:Lifelong learning algorithms enable models to incrementally acquire new knowledge without forgetting previously learned information. Contrarily, the field of machine unlearning focuses on explicitly forgetting certain previous knowledge from pretrained models when requested, in order to comply with data privacy regulations on the right-to-be-forgotten. Enabling efficient lifelong learning with the capability to selectively unlearn sensitive information from models presents a critical and largely unaddressed challenge with contradicting objectives. We address this problem from the perspective of simultaneously preventing catastrophic forgetting and allowing forward knowledge transfer during task-incremental learning, while ensuring exact task unlearning and minimizing memory requirements, based on a single neural network model to be adapted. Our proposed solution, privacy-aware lifelong learning (PALL), involves optimization of task-specific sparse subnetworks with parameter sharing within a single architecture. We additionally utilize an episodic memory rehearsal mechanism to facilitate exact unlearning without performance degradations. We empirically demonstrate the scalability of PALL across various architectures in image classification, and provide a state-of-the-art solution that uniquely integrates lifelong learning and privacy-aware unlearning mechanisms for responsible AI applications.
[LG-72] Physics-informed Temporal Alignment for Auto-regressive PDE Foundation Models ICML2025
链接: https://arxiv.org/abs/2505.10930
作者: Congcong Zhu,Xiaoyan Xu,Jiayue Han,Jingrun Chen
类目: Machine Learning (cs.LG)
*备注: Accepted as a conference paper in ICML2025
Abstract:Auto-regressive partial differential equation (PDE) foundation models have shown great potential in handling time-dependent data. However, these models suffer from the shortcut problem deeply rooted in auto-regressive prediction, causing error accumulation. The challenge becomes particularly evident for out-of-distribution data, as the pretraining performance may approach random model initialization for downstream tasks with long-term dynamics. To deal with this problem, we propose physics-informed temporal alignment (PITA), a self-supervised learning framework inspired by inverse problem solving. Specifically, PITA aligns the physical dynamics discovered at different time steps on each given PDE trajectory by integrating physics-informed constraints into the self-supervision signal. The alignment is derived from observation data without relying on known physics priors, indicating strong generalization ability to the out-of-distribution data. Extensive experiments show that PITA significantly enhances the accuracy and robustness of existing foundation models on diverse time-dependent PDE data. The code is available at this https URL.
[LG-73] A Dataset for Spatiotemporal-Sensitive POI Question Answering
链接: https://arxiv.org/abs/2505.10928
作者: Xiao Han,Dayan Pan,Xiangyu Zhao,Xuyuan Hu,Zhaolin Deng,Xiangjie Kong,Guojiang Shen
类目: Machine Learning (cs.LG)
*备注: Under Review
Abstract:Spatiotemporal relationships are critical in data science, as many prediction and reasoning tasks require analysis across both spatial and temporal dimensions–for instance, navigating an unfamiliar city involves planning itineraries that sequence locations and timing cultural experiences. However, existing Question-Answering (QA) datasets lack sufficient spatiotemporal-sensitive questions, making them inadequate benchmarks for evaluating models’ spatiotemporal reasoning capabilities. To address this gap, we introduce POI-QA, a novel spatiotemporal-sensitive QA dataset centered on Point of Interest (POI), constructed through three key steps: mining and aligning open-source vehicle trajectory data from GAIA with high-precision geographic POI data, rigorous manual validation of noisy spatiotemporal facts, and generating bilingual (Chinese/English) QA pairs that reflect human-understandable spatiotemporal reasoning tasks. Our dataset challenges models to parse complex spatiotemporal dependencies, and evaluations of state-of-the-art multilingual LLMs (e.g., Qwen2.5-7B, Llama3.1-8B) reveal stark limitations: even the top-performing model (Qwen2.5-7B fine-tuned with RAG+LoRA) achieves a top 10 Hit Ratio (HR@10) of only 0.41 on the easiest task, far below human performance at 0.56. This underscores persistent weaknesses in LLMs’ ability to perform consistent spatiotemporal reasoning, while highlighting POI-QA as a robust benchmark to advance algorithms sensitive to spatiotemporal dynamics. The dataset is publicly available at this https URL.
[LG-74] Automated Identification of Logical Errors in Programs: Advancing Scalable Analysis of Student Misconceptions
链接: https://arxiv.org/abs/2505.10913
作者: Muntasir Hoq,Ananya Rao,Reisha Jaishankar,Krish Piryani,Nithya Janapati,Jessica Vandenberg,Bradford Mott,Narges Norouzi,James Lester,Bita Akram
类目: Machine Learning (cs.LG)
*备注: Accepted for publication at the 18th International Conference on Educational Data Mining (EDM), 2025
Abstract:In Computer Science (CS) education, understanding factors contributing to students’ programming difficulties is crucial for effective learning support. By identifying specific issues students face, educators can provide targeted assistance to help them overcome obstacles and improve learning outcomes. While identifying sources of struggle, such as misconceptions, in real-time can be challenging in current educational practices, analyzing logical errors in students’ code can offer valuable insights. This paper presents a scalable framework for automatically detecting logical errors in students’ programming solutions. Our framework is based on an explainable Abstract Syntax Tree (AST) embedding model, the Subtree-based Attention Neural Network (SANN), that identifies the structural components of programs containing logical errors. We conducted a series of experiments to evaluate its effectiveness, and the results suggest that our framework can accurately capture students’ logical errors and, more importantly, provide us with deeper insights into their learning processes, offering a valuable tool for enhancing programming education.
[LG-75] Multi-Objective Preference Optimization: Improving Human Alignment of Generative Models
链接: https://arxiv.org/abs/2505.10892
作者: Akhil Agnihotri,Rahul Jain,Deepak Ramachandran,Zheng Wen
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2406.18853 by other authors
Abstract:Post-training of LLMs with RLHF, and subsequently preference optimization algorithms such as DPO, IPO, etc., made a big difference in improving human alignment. However, all such techniques can only work with a single (human) objective. In practice, human users have multiple objectives, such as helpfulness and harmlessness, and there is no natural way to aggregate them into a single objective. In this paper, we address the multi-objective preference-alignment problem, where a policy must optimize several, potentially conflicting, objectives. We introduce the Multi-Objective Preference Optimization (MOPO) algorithm, which frames alignment as a constrained KL-regularized optimization: the primary objective is maximized while secondary objectives are lower-bounded by tunable safety thresholds. Unlike prior work, MOPO operates directly on pairwise preference data, requires no point-wise reward assumption, and avoids heuristic prompt-context engineering. The method recovers policies on the Pareto front whenever the front is attainable; practically, it reduces to simple closed-form iterative updates suitable for large-scale training. On synthetic benchmarks with diverse canonical preference structures, we show that MOPO approximates the Pareto front. When fine-tuning a 1.3B-parameter language model on real-world human-preference datasets, MOPO attains higher rewards and yields policies that Pareto-dominate baselines; ablation studies confirm optimization stability and robustness to hyperparameters.
[LG-76] Global Convergence of Adaptive Sensing for Principal Eigenvector Estimation
链接: https://arxiv.org/abs/2505.10882
作者: Alex Saad-Falcon,Brighton Ancelin,Justin Romberg
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:This paper addresses the challenge of efficient principal component analysis (PCA) in high-dimensional spaces by analyzing a compressively sampled variant of Oja’s algorithm with adaptive sensing. Traditional PCA methods incur substantial computational costs that scale poorly with data dimensionality, whereas subspace tracking algorithms like Oja’s offer more efficient alternatives but typically require full-dimensional observations. We analyze a variant where, at each iteration, only two compressed measurements are taken: one in the direction of the current estimate and one in a random orthogonal direction. We prove that this adaptive sensing approach achieves global convergence in the presence of noise when tracking the leading eigenvector of a datastream with eigengap \Delta=\lambda_1-\lambda_2 . Our theoretical analysis demonstrates that the algorithm experiences two phases: (1) a warmup phase requiring O(\lambda_1\lambda_2d^2/\Delta^2) iterations to achieve a constant-level alignment with the true eigenvector, followed by (2) a local convergence phase where the sine alignment error decays at a rate of O(\lambda_1\lambda_2d^2/\Delta^2 t) for iterations t . The guarantee aligns with existing minimax lower bounds with an added factor of d due to the compressive sampling. This work provides the first convergence guarantees in adaptive sensing for subspace tracking with noise. Our proof technique is also considerably simpler than those in prior works. The results have important implications for applications where acquiring full-dimensional samples is challenging or costly.
[LG-77] Prior-Guided Diffusion Planning for Offline Reinforcement Learning
链接: https://arxiv.org/abs/2505.10881
作者: Donghyeon Ki,JunHyeok Oh,Seong-Woong Shim,Byung-Jun Lee
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion models have recently gained prominence in offline reinforcement learning due to their ability to effectively learn high-performing, generalizable policies from static datasets. Diffusion-based planners facilitate long-horizon decision-making by generating high-quality trajectories through iterative denoising, guided by return-maximizing objectives. However, existing guided sampling strategies such as Classifier Guidance, Classifier-Free Guidance, and Monte Carlo Sample Selection either produce suboptimal multi-modal actions, struggle with distributional drift, or incur prohibitive inference-time costs. To address these challenges, we propose Prior Guidance (PG), a novel guided sampling framework that replaces the standard Gaussian prior of a behavior-cloned diffusion model with a learnable distribution, optimized via a behavior-regularized objective. PG directly generates high-value trajectories without costly reward optimization of the diffusion model itself, and eliminates the need to sample multiple candidates at inference for sample selection. We present an efficient training strategy that applies behavior regularization in latent space, and empirically demonstrate that PG outperforms state-of-the-art diffusion policies and planners across diverse long-horizon offline RL benchmarks.
[LG-78] Approximation and Generalization Abilities of Score-based Neural Network Generative Models for Sub-Gaussian Distributions
链接: https://arxiv.org/abs/2505.10880
作者: Guoji Fu,Wee Sun Lee
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 94 pages
Abstract:This paper studies the approximation and generalization abilities of score-based neural network generative models (SGMs) in estimating an unknown distribution P_0 from n i.i.d. observations in d dimensions. Assuming merely that P_0 is \alpha -sub-Gaussian, we prove that for any time step t \in [t_0, n^O(1)] , where t_0 \geq O(\alpha^2n^-2/d\log n) , there exists a deep ReLU neural network with width \leq O(\log^3n) and depth \leq O(n^3/d\log_2n) that can approximate the scores with \tildeO(n^-1) mean square error and achieve a nearly optimal rate of \tildeO(n^-1t_0^-d/2) for score estimation, as measured by the score matching loss. Our framework is universal and can be used to establish convergence rates for SGMs under milder assumptions than previous work. For example, assuming further that the target density function p_0 lies in Sobolev or Besov classes, with an appropriately early stopping strategy, we demonstrate that neural network-based SGMs can attain nearly minimax convergence rates up to logarithmic factors. Our analysis removes several crucial assumptions, such as Lipschitz continuity of the score function or a strictly positive lower bound on the target density.
[LG-79] Multi-Stage Speaker Diarization for Noisy Classrooms
链接: https://arxiv.org/abs/2505.10879
作者: Ali Sartaz Khan,Tolulope Ogunremi,Ahmed Attia,Dorottya Demszky
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:
Abstract:Speaker diarization, the process of identifying “who spoke when” in audio recordings, is essential for understanding classroom dynamics. However, classroom settings present distinct challenges, including poor recording quality, high levels of background noise, overlapping speech, and the difficulty of accurately capturing children’s voices. This study investigates the effectiveness of multi-stage diarization models using Nvidia’s NeMo diarization pipeline. We assess the impact of denoising on diarization accuracy and compare various voice activity detection (VAD) models, including self-supervised transformer-based frame-wise VAD models. We also explore a hybrid VAD approach that integrates Automatic Speech Recognition (ASR) word-level timestamps with frame-level VAD predictions. We conduct experiments using two datasets from English speaking classrooms to separate teacher vs. student speech and to separate all speakers. Our results show that denoising significantly improves the Diarization Error Rate (DER) by reducing the rate of missed speech. Additionally, training on both denoised and noisy datasets leads to substantial performance gains in noisy conditions. The hybrid VAD model leads to further improvements in speech detection, achieving a DER as low as 17% in teacher-student experiments and 45% in all-speaker experiments. However, we also identified trade-offs between voice activity detection and speaker confusion. Overall, our study highlights the effectiveness of multi-stage diarization models and integrating ASR-based information for enhancing speaker diarization in noisy classroom environments.
[LG-80] Improving the Data-efficiency of Reinforcement Learning by Warm-starting with LLM NEURIPS25
链接: https://arxiv.org/abs/2505.10861
作者: Thang Duong,Minglai Yang,Chicheng Zhang
类目: Machine Learning (cs.LG)
*备注: 31 pages (9 for the main paper), 27 figures, NeurIPS 25 submission
Abstract:We investigate the usage of Large Language Model (LLM) in collecting high-quality data to warm-start Reinforcement Learning (RL) algorithms for learning in some classical Markov Decision Process (MDP) environments. In this work, we focus on using LLM to generate an off-policy dataset that sufficiently covers state-actions visited by optimal policies, then later using an RL algorithm to explore the environment and improve the policy suggested by the LLM. Our algorithm, LORO, can both converge to an optimal policy and have a high sample efficiency thanks to the LLM’s good starting policy. On multiple OpenAI Gym environments, such as CartPole and Pendulum, we empirically demonstrate that LORO outperforms baseline algorithms such as pure LLM-based policies, pure RL, and a naive combination of the two, achieving up to 4 \times the cumulative rewards of the pure RL baseline.
[LG-81] On DeepSeek MoE: Statistical Benefits of Shared Experts and Normalized Sigmoid Gating
链接: https://arxiv.org/abs/2505.10860
作者: Huy Nguyen,Thong T. Doan,Quang Pham,Nghi D. Q. Bui,Nhat Ho,Alessandro Rinaldo
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 100 pages
Abstract:Mixture of experts (MoE) methods are a key component in most large language model architectures, including the recent series of DeepSeek models. Compared to other MoE implementations, DeepSeekMoE stands out because of two unique features: the deployment of a shared expert strategy and of the normalized sigmoid gating mechanism. Despite the prominent role of DeepSeekMoE in the success of the DeepSeek series of models, there have been only a few attempts to justify theoretically the value of the shared expert strategy, while its normalized sigmoid gating has remained unexplored. To bridge this gap, we undertake a comprehensive theoretical study of these two features of DeepSeekMoE from a statistical perspective. We perform a convergence analysis of the expert estimation task to highlight the gains in sample efficiency for both the shared expert strategy and the normalized sigmoid gating, offering useful insights into the design of expert and gating structures. To verify empirically our theoretical findings, we carry out several experiments on both synthetic data and real-world datasets for (vision) language modeling tasks. Finally, we conduct an extensive empirical analysis of the router behaviors, ranging from router saturation, router change rate, to expert utilization.
[LG-82] Foundation model for mass spectrometry proteomics
链接: https://arxiv.org/abs/2505.10848
作者: Justin Sanders,Melih Yilmaz,Jacob H. Russell,Wout Bittremieux,William E. Fondrie,Nicholas M. Riley,Sewoong Oh,William Stafford Noble
类目: Machine Learning (cs.LG)
*备注:
Abstract:Mass spectrometry is the dominant technology in the field of proteomics, enabling high-throughput analysis of the protein content of complex biological samples. Due to the complexity of the instrumentation and resulting data, sophisticated computational methods are required for the processing and interpretation of acquired mass spectra. Machine learning has shown great promise to improve the analysis of mass spectrometry data, with numerous purpose-built methods for improving specific steps in the data acquisition and analysis pipeline reaching widespread adoption. Here, we propose unifying various spectrum prediction tasks under a single foundation model for mass spectra. To this end, we pre-train a spectrum encoder using de novo sequencing as a pre-training task. We then show that using these pre-trained spectrum representations improves our performance on the four downstream tasks of spectrum quality prediction, chimericity prediction, phosphorylation prediction, and glycosylation status prediction. Finally, we perform multi-task fine-tuning and find that this approach improves the performance on each task individually. Overall, our work demonstrates that a foundation model for tandem mass spectrometry proteomics trained on de novo sequencing learns generalizable representations of spectra, improves performance on downstream tasks where training data is limited, and can ultimately enhance data acquisition and analysis in proteomics experiments.
[LG-83] AutoRAN: Weak-to-Strong Jailbreaking of Large Reasoning Models
链接: https://arxiv.org/abs/2505.10846
作者: Jiacheng Liang,Tanqiu Jiang,Yuhui Wang,Rongyi Zhu,Fenglong Ma,Ting Wang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 9 pages
Abstract:This paper presents AutoRAN, the first automated, weak-to-strong jailbreak attack framework targeting large reasoning models (LRMs). At its core, AutoRAN leverages a weak, less-aligned reasoning model to simulate the target model’s high-level reasoning structures, generates narrative prompts, and iteratively refines candidate prompts by incorporating the target model’s intermediate reasoning steps. We evaluate AutoRAN against state-of-the-art LRMs including GPT-o3/o4-mini and Gemini-2.5-Flash across multiple benchmark datasets (AdvBench, HarmBench, and StrongReject). Results demonstrate that AutoRAN achieves remarkable success rates (approaching 100%) within one or a few turns across different LRMs, even when judged by a robustly aligned external model. This work reveals that leveraging weak reasoning models can effectively exploit the critical vulnerabilities of much more capable reasoning models, highlighting the need for improved safety measures specifically designed for reasoning-based models. The code for replicating AutoRAN and running records are available at: (this https URL). (warning: this paper contains potentially harmful content generated by LRMs.)
[LG-84] MergeBench: A Benchmark for Merging Domain-Specialized LLM s
链接: https://arxiv.org/abs/2505.10833
作者: Yifei He,Siqi Zeng,Yuzheng Hu,Rui Yang,Tong Zhang,Han Zhao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Model merging provides a scalable alternative to multi-task training by combining specialized finetuned models through parameter arithmetic, enabling efficient deployment without the need for joint training or access to all task data. While recent methods have shown promise, existing evaluations are limited in both model scale and task diversity, leaving open questions about their applicability to large, domain-specialized LLMs. To tackle the challenges, we introduce MergeBench, a comprehensive evaluation suite designed to assess model merging at scale. MergeBench builds on state-of-the-art open-source language models, including Llama and Gemma families at 2B to 9B scales, and covers five key domains: instruction following, mathematics, multilingual understanding, coding and safety. We standardize finetuning and evaluation protocols, and assess eight representative merging methods across multi-task performance, forgetting and runtime efficiency. Based on extensive experiments, we provide practical guidelines for algorithm selection and share insights showing that model merging tends to perform better on stronger base models, with techniques such as merging coefficient tuning and sparsification improving knowledge retention. However, several challenges remain, including the computational cost on large models, the gap for in-domain performance compared to multi-task models, and the underexplored role of model merging in standard LLM training pipelines. We hope MergeBench provides a foundation for future research to advance the understanding and practical application of model merging. We open source our code at \hrefthis https URLthis https URL.
[LG-85] Distilled Circuits: A Mechanistic Study of Internal Restructuring in Knowledge Distillation
链接: https://arxiv.org/abs/2505.10822
作者: Reilly Haskins,Benjamin Adams
类目: Machine Learning (cs.LG)
*备注:
Abstract:Knowledge distillation compresses a larger neural model (teacher) into smaller, faster student models by training them to match teacher outputs. However, the internal computational transformations that occur during this process remain poorly understood. We apply techniques from mechanistic interpretability to analyze how internal circuits, representations, and activation patterns differ between teacher and student. Focusing on GPT2-small and its distilled counterpart DistilGPT2, we find that student models reorganize, compress, and discard teacher components, often resulting in stronger reliance on fewer individual components. To quantify functional alignment beyond output similarity, we introduce an alignment metric based on influence-weighted component similarity, validated across multiple tasks. Our findings reveal that while knowledge distillation preserves broad functional behaviors, it also causes significant shifts in internal computation, with important implications for the robustness and generalization capacity of distilled models.
[LG-86] Cell Library Characterization for Composite Current Source Models Based on Gaussian Process Regression and Active Learning
链接: https://arxiv.org/abs/2505.10799
作者: Tao Bai,Junzhuo Zhou,Zeyuan Deng,Peng Cao
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:
Abstract:The composite current source (CCS) model has been adopted as an advanced timing model that represents the current behavior of cells for improved accuracy and better capability than traditional non-linear delay models (NLDM) to model complex dynamic effects and interactions under advanced process nodes. However, the high accuracy requirement, large amount of data and extensive simulation cost pose severe challenges to CCS characterization. To address these challenges, we introduce a novel Gaussian Process Regression(GPR) model with active learning(AL) to establish the characterization framework efficiently and accurately. Our approach significantly outperforms conventional commercial tools as well as learning based approaches by achieving an average absolute error of 2.05 ps and a relative error of 2.27% for current waveform of 57 cells under 9 process, voltage, temperature (PVT) corners with TSMC 22nm process. Additionally, our model drastically reduces the runtime to 27% and the storage by up to 19.5x compared with that required by commercial tools.
[LG-87] Deep Symbolic Optimization: Reinforcement Learning for Symbolic Mathematics
链接: https://arxiv.org/abs/2505.10762
作者: Conor F. Hayes,Felipe Leno Da Silva,Jiachen Yang,T. Nathan Mundhenk,Chak Shing Lee,Jacob F. Pettit,Claudio Santiago,Sookyung Kim,Joanne T. Kim,Ignacio Aravena Solis,Ruben Glatt,Andre R. Goncalves,Alexander Ladd,Ahmet Can Solak,Thomas Desautels,Daniel Faissol,Brenden K. Petersen,Mikel Landajuela
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Symbolic Computation (cs.SC)
*备注: Under review in LNCS Computational Approaches to Scientific Discovery
Abstract:Deep Symbolic Optimization (DSO) is a novel computational framework that enables symbolic optimization for scientific discovery, particularly in applications involving the search for intricate symbolic structures. One notable example is equation discovery, which aims to automatically derive mathematical models expressed in symbolic form. In DSO, the discovery process is formulated as a sequential decision-making task. A generative neural network learns a probabilistic model over a vast space of candidate symbolic expressions, while reinforcement learning strategies guide the search toward the most promising regions. This approach integrates gradient-based optimization with evolutionary and local search techniques, and it incorporates in-situ constraints, domain-specific priors, and advanced policy optimization methods. The result is a robust framework capable of efficiently exploring extensive search spaces to identify interpretable and physically meaningful models. Extensive evaluations on benchmark problems have demonstrated that DSO achieves state-of-the-art performance in both accuracy and interpretability. In this chapter, we provide a comprehensive overview of the DSO framework and illustrate its transformative potential for automating symbolic optimization in scientific discovery.
[LG-88] Random Client Selection on Contrastive Federated Learning for Tabular Data
链接: https://arxiv.org/abs/2505.10759
作者: Achmad Ginanjar,Xue Li,Priyanka Singh,Wen Hua
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Vertical Federated Learning (VFL) has revolutionised collaborative machine learning by enabling privacy-preserving model training across multiple parties. However, it remains vulnerable to information leakage during intermediate computation sharing. While Contrastive Federated Learning (CFL) was introduced to mitigate these privacy concerns through representation learning, it still faces challenges from gradient-based attacks. This paper presents a comprehensive experimental analysis of gradient-based attacks in CFL environments and evaluates random client selection as a defensive strategy. Through extensive experimentation, we demonstrate that random client selection proves particularly effective in defending against gradient attacks in the CFL network. Our findings provide valuable insights for implementing robust security measures in contrastive federated learning systems, contributing to the development of more secure collaborative learning frameworks
[LG-89] ZEUS: Zero-shot Embeddings for Unsupervised Separation of Tabular Data
链接: https://arxiv.org/abs/2505.10704
作者: Patryk Marszałek,Tomasz Kuśmierczyk,Witold Wydmański,Jacek Tabor,Marek Śmieja
类目: Machine Learning (cs.LG)
*备注:
Abstract:Clustering tabular data remains a significant open challenge in data analysis and machine learning. Unlike for image data, similarity between tabular records often varies across datasets, making the definition of clusters highly dataset-dependent. Furthermore, the absence of supervised signals complicates hyperparameter tuning in deep learning clustering methods, frequently resulting in unstable performance. To address these issues and reduce the need for per-dataset tuning, we adopt an emerging approach in deep learning: zero-shot learning. We propose ZEUS, a self-contained model capable of clustering new datasets without any additional training or fine-tuning. It operates by decomposing complex datasets into meaningful components that can then be clustered effectively. Thanks to pre-training on synthetic datasets generated from a latent-variable prior, it generalizes across various datasets without requiring user intervention. To the best of our knowledge, ZEUS is the first zero-shot method capable of generating embeddings for tabular data in a fully unsupervised manner. Experimental results demonstrate that it performs on par with or better than traditional clustering algorithms and recent deep learning-based methods, while being significantly faster and more user-friendly.
[LG-90] Clustering Rooftop PV Systems via Probabilistic Embeddings
链接: https://arxiv.org/abs/2505.10699
作者: Kutay Bölat,Tarek Alskaif,Peter Palensky,Simon Tindemans
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:As the number of rooftop photovoltaic (PV) installations increases, aggregators and system operators are required to monitor and analyze these systems, raising the challenge of integration and management of large, spatially distributed time-series data that are both high-dimensional and affected by missing values. In this work, a probabilistic entity embedding-based clustering framework is proposed to address these problems. This method encodes each PV system’s characteristic power generation patterns and uncertainty as a probability distribution, then groups systems by their statistical distances and agglomerative clustering. Applied to a multi-year residential PV dataset, it produces concise, uncertainty-aware cluster profiles that outperform a physics-based baseline in representativeness and robustness, and support reliable missing-value imputation. A systematic hyperparameter study further offers practical guidance for balancing model performance and robustness.
[LG-91] Asymptotically-Optimal Gaussian Bandits with Side Observations ICML’22
链接: https://arxiv.org/abs/2505.10698
作者: Alexia Atsidakou,Orestis Papadigenopoulos,Constantine Caramanis,Sujay Sanghavi,Sanjay Shakkottai
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: International Conference on Machine Learning, ICML '22
Abstract:We study the problem of Gaussian bandits with general side information, as first introduced by Wu, Szepesvari, and Gyorgy. In this setting, the play of an arm reveals information about other arms, according to an arbitrary a priori known side information matrix: each element of this matrix encodes the fidelity of the information that the row'' arm reveals about the
column’’ arm. In the case of Gaussian noise, this model subsumes standard bandits, full-feedback, and graph-structured feedback as special cases. In this work, we first construct an LP-based asymptotic instance-dependent lower bound on the regret. The LP optimizes the cost (regret) required to reliably estimate the suboptimality gap of each arm. This LP lower bound motivates our main contribution: the first known asymptotically optimal algorithm for this general setting.
[LG-92] System Identification and Control Using Lyapunov-Based Deep Neural Networks without Persistent Excitation: A Concurrent Learning Approach
链接: https://arxiv.org/abs/2505.10678
作者: Rebecca G. Hart,Omkar Sudhir Patil,Zachary I. Bell,Warren E. Dixon
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:
Abstract:Deep Neural Networks (DNNs) are increasingly used in control applications due to their powerful function approximation capabilities. However, many existing formulations focus primarily on tracking error convergence, often neglecting the challenge of identifying the system dynamics using the DNN. This paper presents the first result on simultaneous trajectory tracking and online system identification using a DNN-based controller, without requiring persistent excitation. Two new concurrent learning adaptation laws are constructed for the weights of all the layers of the DNN, achieving convergence of the DNN’s parameter estimates to a neighborhood of their ideal values, provided the DNN’s Jacobian satisfies a finite-time excitation condition. A Lyapunov-based stability analysis is conducted to ensure convergence of the tracking error, weight estimation errors, and observer errors to a neighborhood of the origin. Simulations performed on a range of systems and trajectories, with the same initial and operating conditions, demonstrated 40.5% to 73.6% improvement in function approximation performance compared to the baseline, while maintaining a similar tracking error and control effort. Simulations evaluating function approximation capabilities on data points outside of the trajectory resulted in 58.88% and 74.75% improvement in function approximation compared to the baseline.
[LG-93] Accelerating Visual-Policy Learning through Parallel Differentiable Simulation
链接: https://arxiv.org/abs/2505.10646
作者: Haoxiang You,Yilang Liu,Ian Abraham
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:In this work, we propose a computationally efficient algorithm for visual policy learning that leverages differentiable simulation and first-order analytical policy gradients. Our approach decouple the rendering process from the computation graph, enabling seamless integration with existing differentiable simulation ecosystems without the need for specialized differentiable rendering software. This decoupling not only reduces computational and memory overhead but also effectively attenuates the policy gradient norm, leading to more stable and smoother optimization. We evaluate our method on standard visual control benchmarks using modern GPU-accelerated simulation. Experiments show that our approach significantly reduces wall-clock training time and consistently outperforms all baseline methods in terms of final returns. Notably, on complex tasks such as humanoid locomotion, our method achieves a 4\times improvement in final return, and successfully learns a humanoid running policy within 4 hours on a single GPU.
[LG-94] FRET: Feature Redundancy Elimination for Test Time Adaptation
链接: https://arxiv.org/abs/2505.10641
作者: Linjing You,Jiabao Lu,Xiayuan Huang,Xiangli Nie
类目: Machine Learning (cs.LG)
*备注:
Abstract:Test-Time Adaptation (TTA) aims to enhance the generalization of deep learning models when faced with test data that exhibits distribution shifts from the training data. In this context, only a pre-trained model and unlabeled test data are available, making it particularly relevant for privacy-sensitive applications. In practice, we observe that feature redundancy in embeddings tends to increase as domain shifts intensify in TTA. However, existing TTA methods often overlook this redundancy, which can hinder the model’s adaptability to new data. To address this issue, we introduce Feature Redundancy Elimination for Test-time Adaptation (FRET), a novel perspective for TTA. A straightforward approach (S-FRET) is to directly minimize the feature redundancy score as an optimization objective to improve adaptation. Despite its simplicity and effectiveness, S-FRET struggles with label shifts, limiting its robustness in real-world scenarios. To mitigate this limitation, we further propose Graph-based FRET (G-FRET), which integrates a Graph Convolutional Network (GCN) with contrastive learning. This design not only reduces feature redundancy but also enhances feature discriminability in both the representation and prediction layers. Extensive experiments across multiple model architectures, tasks, and datasets demonstrate the effectiveness of S-FRET and show that G-FRET achieves state-of-the-art performance. Further analysis reveals that G-FRET enables the model to extract non-redundant and highly discriminative features during inference, thereby facilitating more robust test-time adaptation.
[LG-95] How many measurements are enough? Bayesian recovery in inverse problems with general distributions
链接: https://arxiv.org/abs/2505.10630
作者: Ben Adcock,Nick Huang
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:We study the sample complexity of Bayesian recovery for solving inverse problems with general prior, forward operator and noise distributions. We consider posterior sampling according to an approximate prior \mathcalP , and establish sufficient conditions for stable and accurate recovery with high probability. Our main result is a non-asymptotic bound that shows that the sample complexity depends on (i) the intrinsic complexity of \mathcalP , quantified by its so-called approximate covering number, and (ii) concentration bounds for the forward operator and noise distributions. As a key application, we specialize to generative priors, where \mathcalP is the pushforward of a latent distribution via a Deep Neural Network (DNN). We show that the sample complexity scales log-linearly with the latent dimension k , thus establishing the efficacy of DNN-based priors. Generalizing existing results on deterministic (i.e., non-Bayesian) recovery for the important problem of random sampling with an orthogonal matrix U , we show how the sample complexity is determined by the coherence of U with respect to the support of \mathcalP . Hence, we establish that coherence plays a fundamental role in Bayesian recovery as well. Overall, our framework unifies and extends prior work, providing rigorous guarantees for the sample complexity of solving Bayesian inverse problems with arbitrary distributions.
[LG-96] Measurement to Meaning: A Validity-Centered Framework for AI Evaluation
链接: https://arxiv.org/abs/2505.10573
作者: Olawale Salaudeen,Anka Reuel,Ahmed Ahmed,Suhana Bedi,Zachary Robertson,Sudharsan Sundar,Ben Domingue,Angelina Wang,Sanmi Koyejo
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: Corresponding author: olawale@mit.edu
Abstract:While the capabilities and utility of AI systems have advanced, rigorous norms for evaluating these systems have lagged. Grand claims, such as models achieving general reasoning capabilities, are supported with model performance on narrow benchmarks, like performance on graduate-level exam questions, which provide a limited and potentially misleading assessment. We provide a structured approach for reasoning about the types of evaluative claims that can be made given the available evidence. For instance, our framework helps determine whether performance on a mathematical benchmark is an indication of the ability to solve problems on math tests or instead indicates a broader ability to reason. Our framework is well-suited for the contemporary paradigm in machine learning, where various stakeholders provide measurements and evaluations that downstream users use to validate their claims and decisions. At the same time, our framework also informs the construction of evaluations designed to speak to the validity of the relevant claims. By leveraging psychometrics’ breakdown of validity, evaluations can prioritize the most critical facets for a given claim, improving empirical utility and decision-making efficacy. We illustrate our framework through detailed case studies of vision and language model evaluations, highlighting how explicitly considering validity strengthens the connection between evaluation evidence and the claims being made.
[LG-97] Anti-aliasing of neural distortion effects via model fine tuning
链接: https://arxiv.org/abs/2505.11375
作者: Alistair Carson,Alec Wright,Stefan Bilbao
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Accepted for DAFx25
Abstract:Neural networks have become ubiquitous with guitar distortion effects modelling in recent years. Despite their ability to yield perceptually convincing models, they are susceptible to frequency aliasing when driven by high frequency and high gain inputs. Nonlinear activation functions create both the desired harmonic distortion and unwanted aliasing distortion as the bandwidth of the signal is expanded beyond the Nyquist frequency. Here, we present a method for reducing aliasing in neural models via a teacher-student fine tuning approach, where the teacher is a pre-trained model with its weights frozen, and the student is a copy of this with learnable parameters. The student is fine-tuned against an aliasing-free dataset generated by passing sinusoids through the original model and removing non-harmonic components from the output spectra. Our results show that this method significantly suppresses aliasing for both long-short-term-memory networks (LSTM) and temporal convolutional networks (TCN). In the majority of our case studies, the reduction in aliasing was greater than that achieved by two times oversampling. One side-effect of the proposed method is that harmonic distortion components are also affected. This adverse effect was found to be model-dependent, with the LSTM models giving the best balance between anti-aliasing and preserving the perceived similarity to an analog reference device.
[LG-98] STRIDE: Sparse Techniques for Regression in Deep Gaussian Processes
链接: https://arxiv.org/abs/2505.11355
作者: Simon Urbainczyk,Aretha L. Teckentrup,Jonas Latz
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:
Abstract:Gaussian processes (GPs) have gained popularity as flexible machine learning models for regression and function approximation with an in-built method for uncertainty quantification. However, GPs suffer when the amount of training data is large or when the underlying function contains multi-scale features that are difficult to represent by a stationary kernel. To address the former, training of GPs with large-scale data is often performed through inducing point approximations (also known as sparse GP regression (GPR)), where the size of the covariance matrices in GPR is reduced considerably through a greedy search on the data set. To aid the latter, deep GPs have gained traction as hierarchical models that resolve multi-scale features by combining multiple GPs. Posterior inference in deep GPs requires a sampling or, more usual, a variational approximation. Variational approximations lead to large-scale stochastic, non-convex optimisation problems and the resulting approximation tends to represent uncertainty incorrectly. In this work, we combine variational learning with MCMC to develop a particle-based expectation-maximisation method to simultaneously find inducing points within the large-scale data (variationally) and accurately train the GPs (sampling-based). The result is a highly efficient and accurate methodology for deep GP training on large-scale data. We test our method on standard benchmark problems.
[LG-99] Revisiting Stochastic Approximation and Stochastic Gradient Descent
链接: https://arxiv.org/abs/2505.11343
作者: Rajeeva Laxman Karandikar,Bhamidi Visweswara Rao,Mathukumalli Vidyasagar
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 22 pages
Abstract:In this paper, we take a fresh look at stochastic approximation (SA) and Stochastic Gradient Descent (SGD). We derive new sufficient conditions for the convergence of SA. In particular, the “noise” or measurement error need not have a finite second moment, and under suitable conditions, not even a finite mean. By adapting this method of proof, we also derive sufficient conditions for the convergence of zero-order SGD, wherein the stochastic gradient is computed using only two function evaluations, and no gradient computations. The sufficient conditions derived here are the weakest to date, thus leading to a considerable expansion of the applicability of SA and SGD theory.
[LG-100] Convergence Rates of Constrained Expected Improvement
链接: https://arxiv.org/abs/2505.11323
作者: Haowei Wang,Jingyi Wang,Zhongxiang Dai,Nai-Yuan Chiang,Szu Hui Ng,Cosmin G. Petra
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Constrained Bayesian optimization (CBO) methods have seen significant success in black-box optimization with constraints, and one of the most commonly used CBO methods is the constrained expected improvement (CEI) algorithm. CEI is a natural extension of the expected improvement (EI) when constraints are incorporated. However, the theoretical convergence rate of CEI has not been established. In this work, we study the convergence rate of CEI by analyzing its simple regret upper bound. First, we show that when the objective function f and constraint function c are assumed to each lie in a reproducing kernel Hilbert space (RKHS), CEI achieves the convergence rates of \mathcalO \left(t^-\frac12\log^\fracd+12(t) \right) \ \textand \ \mathcalO\left(t^\frac-\nu2\nu+d \log^\frac\nu2\nu+d(t)\right) for the commonly used squared exponential and Matérn kernels, respectively. Second, we show that when f and c are assumed to be sampled from Gaussian processes (GPs), CEI achieves the same convergence rates with a high probability. Numerical experiments are performed to validate the theoretical analysis.
[LG-101] Adaptive Linear Embedding for Nonstationary High-Dimensional Optimization
链接: https://arxiv.org/abs/2505.11281
作者: Yuejiang Wen,Paul D. Franzon
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: working, to be submitted
Abstract:Bayesian Optimization (BO) in high-dimensional spaces remains fundamentally limited by the curse of dimensionality and the rigidity of global low-dimensional assumptions. While Random EMbedding Bayesian Optimization (REMBO) mitigates this via linear projections into low-dimensional subspaces, it typically assumes a single global embedding and a stationary objective. In this work, we introduce Self-Adaptive embedding REMBO (SA-REMBO), a novel framework that generalizes REMBO to support multiple random Gaussian embeddings, each capturing a different local subspace structure of the high-dimensional objective. An index variable governs the embedding choice and is jointly modeled with the latent optimization variable via a product kernel in a Gaussian Process surrogate. This enables the optimizer to adaptively select embeddings conditioned on location, effectively capturing locally varying effective dimensionality, nonstationarity, and heteroscedasticity in the objective landscape. We theoretically analyze the expressiveness and stability of the index-conditioned product kernel and empirically demonstrate the advantage of our method across synthetic and real-world high-dimensional benchmarks, where traditional REMBO and other low-rank BO methods fail. Our results establish SA-REMBO as a powerful and flexible extension for scalable BO in complex, structured design spaces.
[LG-102] Linear Convergence of the Frank-Wolfe Algorithm over Product Polytopes
链接: https://arxiv.org/abs/2505.11259
作者: Gabriele Iommazzo,David Martínez-Rubio,Francisco Criado,Elias Wirth,Sebastian Pokutta
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:We study the linear convergence of Frank-Wolfe algorithms over product polytopes. We analyze two condition numbers for the product polytope, namely the \emphpyramidal width and the \emphvertex-facet distance, based on the condition numbers of individual polytope components. As a result, for convex objectives that are \mu -Polyak-Łojasiewicz, we show linear convergence rates quantified in terms of the resulting condition numbers. We apply our results to the problem of approximately finding a feasible point in a polytope intersection in high-dimensions, and demonstrate the practical efficiency of our algorithms through empirical results.
[LG-103] Nash: Neural Adaptive Shrinkage for Structured High-Dimensional Regression
链接: https://arxiv.org/abs/2505.11143
作者: William R.P. Denault
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Sparse linear regression is a fundamental tool in data analysis. However, traditional approaches often fall short when covariates exhibit structure or arise from heterogeneous sources. In biomedical applications, covariates may stem from distinct modalities or be structured according to an underlying graph. We introduce Neural Adaptive Shrinkage (Nash), a unified framework that integrates covariate-specific side information into sparse regression via neural networks. Nash adaptively modulates penalties on a per-covariate basis, learning to tailor regularization without cross-validation. We develop a variational inference algorithm for efficient training and establish connections to empirical Bayes regression. Experiments on real data demonstrate that Nash can improve accuracy and adaptability over existing methods.
[LG-104] Inexact Column Generation for Bayesian Network Structure Learning via Difference-of-Submodular Optimization
链接: https://arxiv.org/abs/2505.11089
作者: Yiran Yang,Rui Chen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:In this paper, we consider a score-based Integer Programming (IP) approach for solving the Bayesian Network Structure Learning (BNSL) problem. State-of-the-art BNSL IP formulations suffer from the exponentially large number of variables and constraints. A standard approach in IP to address such challenges is to employ row and column generation techniques, which dynamically generate rows and columns, while the complex pricing problem remains a computational bottleneck for BNSL. For the general class of \ell_0 -penalized likelihood scores, we show how the pricing problem can be reformulated as a difference of submodular optimization problem, and how the Difference of Convex Algorithm (DCA) can be applied as an inexact method to efficiently solve the pricing problems. Empirically, we show that, for continuous Gaussian data, our row and column generation approach yields solutions with higher quality than state-of-the-art score-based approaches, especially when the graph density increases, and achieves comparable performance against benchmark constraint-based and hybrid approaches, even when the graph size increases.
[LG-105] Conceptual framework for the application of deep neural networks to surface composition reconstruction from Mercurys exospheric data
链接: https://arxiv.org/abs/2505.11053
作者: Adrian Kazakov(1),Anna Milillo(1),Alessandro Mura(1),Stavro Ivanovski(2),Valeria Mangano(1),Alessandro Aronica(1),Elisabetta De Angelis(1),Pier Paolo Di Bartolomeo(1),Alessandro Brin(1),Luca Colasanti(1),Miguel Escalona-Moran(3),Francesco Lazzarotto(4),Stefano Massetti(1),Martina Moroni(1),Raffaella Noschese(1),Fabrizio Nuccilli(1),Stefano Orsini(1),Christina Plainaki(5),Rosanna Rispoli(1),Roberto Sordini(1),Mirko Stumpo(1),Nello Vertolli(1) ((1) INAF-IAPS, Rome, Italy, (2) INAF-Osservatorio Astronomico di Trieste, Trieste, Italy, (3) Augmented Intelligence Lab, Salceda de Caselas, Spain, (4) INAF-Osservatorio Astronomico di Padova, Padova, Italy, (5) ASI - Italian Space Agency, Rome, Italy)
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: All versions of this article can be explored in the collection: DOI this https URL . This article is identical to v2.5 of the aforementioned collection: DOI this https URL
Abstract:Surface information derived from exospheric measurements at planetary bodies complements surface mapping provided by dedicated imagers, offering critical insights into surface release processes, interactions within the planetary environment, space weathering, and planetary evolution. This study explores the feasibility of deriving Mercury’s regolith elemental composition from in-situ measurements of its neutral exosphere using deep neural networks (DNNs). We present a supervised feed-forward DNN architecture - a multilayer perceptron (MLP) - that, starting from exospheric densities and proton precipitation fluxes, predicts the chemical elements of the surface regolith below. It serves as an estimator for the surface-exosphere interaction and the processes leading to exosphere formation. Because the DNN requires a comprehensive exospheric dataset not available from previous missions, this study uses simulated exosphere components and simulated drivers. Extensive training and testing campaigns demonstrate the MLP’s ability to accurately predict and reconstruct surface composition maps from these simulated measurements. Although this initial version does not aim to reproduce Mercury’s actual surface composition, it provides a proof of concept, showcasing the algorithm’s robustness and capacity for handling complex datasets to create estimators for exospheric generation models. Moreover, our tests reveal substantial potential for further development, suggesting that this method could significantly enhance the analysis of complex surface-exosphere interactions and complement planetary exosphere models. This work anticipates applying the approach to data from the BepiColombo mission, specifically the SERENA package, whose nominal phase begins in 2027.
[LG-106] Generalization Bounds for Quantum Learning via Rényi Divergences
链接: https://arxiv.org/abs/2505.11025
作者: Naqueeb Ahmad Warsi,Ayanava Dasgupta,Masahito Hayashi
类目: Quantum Physics (quant-ph); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 36 pages, 2 figures
Abstract:This work advances the theoretical understanding of quantum learning by establishing a new family of upper bounds on the expected generalization error of quantum learning algorithms, leveraging the framework introduced by Caro et al. (2024) and a new definition for the expected true loss. Our primary contribution is the derivation of these bounds in terms of quantum and classical Rényi divergences, utilizing a variational approach for evaluating quantum Rényi divergences, specifically the Petz and a newly introduced modified sandwich quantum Rényi divergence. Analytically and numerically, we demonstrate the superior performance of the bounds derived using the modified sandwich quantum Rényi divergence compared to those based on the Petz divergence. Furthermore, we provide probabilistic generalization error bounds using two distinct techniques: one based on the modified sandwich quantum Rényi divergence and classical Rényi divergence, and another employing smooth max Rényi divergence.
[LG-107] A Cautionary Tale on Integrating Studies with Disparate Outcome Measures for Causal Inference
链接: https://arxiv.org/abs/2505.11014
作者: Harsh Parikh,Trang Quynh Nguyen,Elizabeth A. Stuart,Kara E. Rudolph,Caleb H. Miles
类目: Methodology (stat.ME); Machine Learning (cs.LG); Econometrics (econ.EM)
*备注:
Abstract:Data integration approaches are increasingly used to enhance the efficiency and generalizability of studies. However, a key limitation of these methods is the assumption that outcome measures are identical across datasets – an assumption that often does not hold in practice. Consider the following opioid use disorder (OUD) studies: the XBOT trial and the POAT study, both evaluating the effect of medications for OUD on withdrawal symptom severity (not the primary outcome of either trial). While XBOT measures withdrawal severity using the subjective opiate withdrawal scale, POAT uses the clinical opiate withdrawal scale. We analyze this realistic yet challenging setting where outcome measures differ across studies and where neither study records both types of outcomes. Our paper studies whether and when integrating studies with disparate outcome measures leads to efficiency gains. We introduce three sets of assumptions – with varying degrees of strength – linking both outcome measures. Our theoretical and empirical results highlight a cautionary tale: integration can improve asymptotic efficiency only under the strongest assumption linking the outcomes. However, misspecification of this assumption leads to bias. In contrast, a milder assumption may yield finite-sample efficiency gains, yet these benefits diminish as sample size increases. We illustrate these trade-offs via a case study integrating the XBOT and POAT datasets to estimate the comparative effect of two medications for opioid use disorder on withdrawal symptoms. By systematically varying the assumptions linking the SOW and COW scales, we show potential efficiency gains and the risks of bias. Our findings emphasize the need for careful assumption selection when fusing datasets with differing outcome measures, offering guidance for researchers navigating this common challenge in modern data integration.
[LG-108] Supervised Models Can Generalize Also When Trained on Random Label
链接: https://arxiv.org/abs/2505.11006
作者: Oskar Allerbo,Thomas B. Schön
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:The success of unsupervised learning raises the question of whether also supervised models can be trained without using the information in the output y . In this paper, we demonstrate that this is indeed possible. The key step is to formulate the model as a smoother, i.e. on the form \hatf=Sy , and to construct the smoother matrix S independently of y , e.g. by training on random labels. We present a simple model selection criterion based on the distribution of the out-of-sample predictions and show that, in contrast to cross-validation, this criterion can be used also without access to y . We demonstrate on real and synthetic data that y -free trained versions of linear and kernel ridge regression, smoothing splines, and neural networks perform similarly to their standard, y -based, versions and, most importantly, significantly better than random guessing.
[LG-109] A Physics-Informed Convolutional Long Short Term Memory Statistical Model for Fluid Thermodynamics Simulations
链接: https://arxiv.org/abs/2505.10919
作者: Luca Menicali,Andrew Grace,David H. Richter,Stefano Castruccio
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Fluid thermodynamics underpins atmospheric dynamics, climate science, industrial applications, and energy systems. However, direct numerical simulations (DNS) of such systems are computationally prohibitive. To address this, we present a novel physics-informed spatio-temporal surrogate model for Rayleigh-Bénard convection (RBC), a canonical example of convective fluid flow. Our approach combines convolutional neural networks for spatial feature extraction with an innovative recurrent architecture inspired by large language models, comprising a context builder and a sequence generator to capture temporal dynamics. Inference is penalized with respect to the governing partial differential equations to ensure physical interpretability. Given the sensitivity of turbulent convection to initial conditions, we quantify uncertainty using a conformal prediction framework. This model replicates key features of RBC dynamics while significantly reducing computational cost, offering a scalable alternative to DNS for long-term simulations.
[LG-110] Comparative Analysis of Black-Box Optimization Methods for Weather Intervention Design
链接: https://arxiv.org/abs/2505.10843
作者: Yuta Higuchi,Rikuto Nagai,Atsushi Okazaki,Masaki Ogura,Naoki Wakamiya
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: 15 pages, 11 figures
Abstract:As climate change increases the threat of weather-related disasters, research on weather control is gaining importance. The objective of weather control is to mitigate disaster risks by administering interventions with optimal timing, location, and intensity. However, the optimization process is highly challenging due to the vast scale and complexity of weather phenomena, which introduces two major challenges. First, obtaining accurate gradient information for optimization is difficult. In addition, numerical weather prediction (NWP) models demand enormous computational resources, necessitating parameter optimization with minimal function evaluations. To address these challenges, this study proposes a method for designing weather interventions based on black-box optimization, which enables efficient exploration without requiring gradient information. The proposed method is evaluated in two distinct control scenarios: one-shot initial value intervention and sequential intervention based on model predictive control. Furthermore, a comparative analysis is conducted among four representative black-box optimization methods in terms of total rainfall reduction. Experimental results show that Bayesian optimization achieves higher control effectiveness than the others, particularly in high-dimensional search spaces. These findings suggest that Bayesian optimization is a highly effective approach for weather intervention computation.
[LG-111] Minimax learning rates for estimating binary classifiers under margin conditions
链接: https://arxiv.org/abs/2505.10628
作者: Jonathan García,Philipp Petersen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:We study classification problems using binary estimators where the decision boundary is described by horizon functions and where the data distribution satisfies a geometric margin condition. We establish upper and lower bounds for the minimax learning rate over broad function classes with bounded Kolmogorov entropy in Lebesgue norms. A key novelty of our work is the derivation of lower bounds on the worst-case learning rates under a geometric margin condition – a setting that is almost universally satisfied in practice but remains theoretically challenging. Moreover, our results deal with the noiseless setting, where lower bounds are particularly hard to establish. We apply our general results to classification problems with decision boundaries belonging to several function classes: for Barron-regular functions, and for Hölder-continuous functions with strong margins, we identify optimal rates close to the fast learning rates of \mathcalO(n^-1) for n \in \mathbbN samples. Also for merely convex decision boundaries, in a strong margin case optimal rates near \mathcalO(n^-1/2) can be achieved.
[LG-112] An Exponential Averag ing Process with Strong Convergence Properties
链接: https://arxiv.org/abs/2505.10605
作者: Frederik Köhne,Anton Schiela
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注:
Abstract:Averaging, or smoothing, is a fundamental approach to obtain stable, de-noised estimates from noisy observations. In certain scenarios, observations made along trajectories of random dynamical systems are of particular interest. One popular smoothing technique for such a scenario is exponential moving averaging (EMA), which assigns observations a weight that decreases exponentially in their age, thus giving younger observations a larger weight. However, EMA fails to enjoy strong stochastic convergence properties, which stems from the fact that the weight assigned to the youngest observation is constant over time, preventing the noise in the averaged quantity from decreasing to zero. In this work, we consider an adaptation to EMA, which we call p -EMA, where the weights assigned to the last observations decrease to zero at a subharmonic rate. We provide stochastic convergence guarantees for this kind of averaging under mild assumptions on the autocorrelations of the underlying random dynamical system. We further discuss the implications of our results for a recently introduced adaptive step size control for Stochastic Gradient Descent (SGD), which uses p -EMA for averaging noisy observations.
[LG-113] Quantum thermodynamics and semi-definite optimization
链接: https://arxiv.org/abs/2505.04514
作者: Nana Liu,Michele Minervini,Dhrumil Patel,Mark M. Wilde
类目: Quantum Physics (quant-ph); Statistical Mechanics (cond-mat.stat-mech); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: v2: 16 pages of main text, 15 pages of appendices, 3 figures, corrections introduced
Abstract:In quantum thermodynamics, a system is described by a Hamiltonian and a list of non-commuting charges representing conserved quantities like particle number or electric charge, and an important goal is to determine the system’s minimum energy in the presence of these conserved charges. In optimization theory, a semi-definite program (SDP) involves a linear objective function optimized over the cone of positive semi-definite operators intersected with an affine space. These problems arise from differing motivations in the physics and optimization communities and are phrased using very different terminology, yet they are essentially identical mathematically. By adopting Jaynes’ mindset motivated by quantum thermodynamics, we observe that minimizing free energy in the aforementioned thermodynamics problem, instead of energy, leads to an elegant solution in terms of a dual chemical potential maximization problem that is concave in the chemical potential parameters. As such, one can employ standard (stochastic) gradient ascent methods to find the optimal values of these parameters, and these methods are guaranteed to converge quickly. At low temperature, the minimum free energy provides an excellent approximation for the minimum energy. We then show how this Jaynes-inspired gradient-ascent approach can be used in both first- and second-order classical and hybrid quantum-classical algorithms for minimizing energy, and equivalently, how it can be used for solving SDPs, with guarantees on the runtimes of the algorithms. The approach discussed here is well grounded in quantum thermodynamics and, as such, provides physical motivation underpinning why algorithms published fifty years after Jaynes’ seminal work, including the matrix multiplicative weights update method, the matrix exponentiated gradient update method, and their quantum algorithmic generalizations, perform well at solving SDPs.
信息检索
[IR-0] CRISP: Clustering Multi-Vector Representations for Denoising and Pruning
链接: https://arxiv.org/abs/2505.11471
作者: João Veneroso,Rajesh Jayaram,Jinmeng Rao,Gustavo Hernández Ábrego,Majid Hadian,Daniel Cer
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Multi-vector models, such as ColBERT, are a significant advancement in neural information retrieval (IR), delivering state-of-the-art performance by representing queries and documents by multiple contextualized token-level embeddings. However, this increased representation size introduces considerable storage and computational overheads which have hindered widespread adoption in practice. A common approach to mitigate this overhead is to cluster the model’s frozen vectors, but this strategy’s effectiveness is fundamentally limited by the intrinsic clusterability of these embeddings. In this work, we introduce CRISP (Clustered Representations with Intrinsic Structure Pruning), a novel multi-vector training method which learns inherently clusterable representations directly within the end-to-end training process. By integrating clustering into the training phase rather than imposing it post-hoc, CRISP significantly outperforms post-hoc clustering at all representation sizes, as well as other token pruning methods. On the BEIR retrieval benchmarks, CRISP achieves a significant rate of ~3x reduction in the number of vectors while outperforming the original unpruned model. This indicates that learned clustering effectively denoises the model by filtering irrelevant information, thereby generating more robust multi-vector representations. With more aggressive clustering, CRISP achieves an 11x reduction in the number of vectors with only a 3.6% quality loss.
[IR-1] mmRAG : A Modular Benchmark for Retrieval-Augmented Generation over Text Tables and Knowledge Graphs
链接: https://arxiv.org/abs/2505.11180
作者: Chuan Xu,Qiaosheng Chen,Yutong Feng,Gong Cheng
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing the capabilities of large language models. However, existing RAG evaluation predominantly focuses on text retrieval and relies on opaque, end-to-end assessments of generated outputs. To address these limitations, we introduce mmRAG, a modular benchmark designed for evaluating multi-modal RAG systems. Our benchmark integrates queries from six diverse question-answering datasets spanning text, tables, and knowledge graphs, which we uniformly convert into retrievable documents. To enable direct, granular evaluation of individual RAG components – such as the accuracy of retrieval and query routing – beyond end-to-end generation quality, we follow standard information retrieval procedures to annotate document relevance and derive dataset relevance. We establish baseline performance by evaluating a wide range of RAG implementations on mmRAG.