本篇博文主要内容为 2025-05-16 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-05-16)

今日共更新454篇论文,其中:

  • 自然语言处理61篇(Computation and Language (cs.CL))
  • 人工智能128篇(Artificial Intelligence (cs.AI))
  • 计算机视觉75篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习148篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning ACL2025

【速读】: 该论文旨在解决当前大型多模态模型(Large Multimodal Models, LMMs)在多模态数学推理任务中的局限性,尤其是现有自然语言图像描述数据集对数学图表细节关注不足的问题。其解决方案的关键在于利用代码作为跨模态对齐的监督信号,因为代码能够精确编码生成对应图形所需的所有信息,从而建立图像与代码之间的精准关联。通过这种策略,研究者提出了图像到代码模型FigCodifier和大规模图像-代码数据集ImgCode-8.6M,并进一步构建了用于多模态数学指令微调的高质量数据集MM-MathInstruct-3M,最终训练出MathCoder-VL模型,在多项指标上达到了新的开源SOTA。

链接: https://arxiv.org/abs/2505.10557
作者: Ke Wang,Junting Pan,Linda Wei,Aojun Zhou,Weikang Shi,Zimu Lu,Han Xiao,Yunqiao Yang,Houxing Ren,Mingjie Zhan,Hongsheng Li
机构: The Chinese University of Hong Kong (香港中文大学); Multimedia Laboratory (MMLab) (多媒体实验室); CPII under InnoHK (InnoHK下的CPII)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to ACL 2025 Findings

点击查看摘要

Abstract:Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning. To this end, we propose leveraging code as supervision for cross-modal alignment, since code inherently encodes all information needed to generate corresponding figures, establishing a precise connection between the two modalities. Specifically, we co-develop our image-to-code model and dataset with model-in-the-loop approach, resulting in an image-to-code model, FigCodifier and ImgCode-8.6M dataset, the largest image-code dataset to date. Furthermore, we utilize FigCodifier to synthesize novel mathematical figures and then construct MM-MathInstruct-3M, a high-quality multimodal math instruction fine-tuning dataset. Finally, we present MathCoder-VL, trained with ImgCode-8.6M for cross-modal alignment and subsequently fine-tuned on MM-MathInstruct-3M for multimodal math problem solving. Our model achieves a new open-source SOTA across all six metrics. Notably, it surpasses GPT-4o and Claude 3.5 Sonnet in the geometry problem-solving subset of MathVista, achieving improvements of 8.9% and 9.2%. The dataset and models will be released at this https URL.
zh

[NLP-1] Beyond Aha!: Toward Systematic Meta-Abilities Alignment in Large Reasoning Models

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在长链式推理中出现的“顿悟”行为(如自我修正、回溯和验证)的时间和一致性不可预测、难以控制的问题,从而限制了LRMs推理能力的可扩展性和可靠性。其解决方案的关键在于通过自动生成的自验证任务,显式对齐模型的三种元能力:演绎、归纳和类比推理,并采用三阶段流水线——个体对齐、参数空间合并以及领域特定强化学习(Reinforcement Learning, RL),以提升模型性能。

链接: https://arxiv.org/abs/2505.10554
作者: Zhiyuan Hu,Yibo Wang,Hanze Dong,Yuhui Xu,Amrita Saha,Caiming Xiong,Bryan Hooi,Junnan Li
机构: National University of Singapore (新加坡国立大学); Tsinghua University (清华大学); Salesforce AI Research (Salesforce人工智能研究)
类目: Computation and Language (cs.CL)
备注: In Progress

点击查看摘要

Abstract:Large reasoning models (LRMs) already possess a latent capacity for long chain-of-thought reasoning. Prior work has shown that outcome-based reinforcement learning (RL) can incidentally elicit advanced reasoning behaviors such as self-correction, backtracking, and verification phenomena often referred to as the model’s “aha moment”. However, the timing and consistency of these emergent behaviors remain unpredictable and uncontrollable, limiting the scalability and reliability of LRMs’ reasoning capabilities. To address these limitations, we move beyond reliance on prompts and coincidental “aha moments”. Instead, we explicitly align models with three meta-abilities: deduction, induction, and abduction, using automatically generated, self-verifiable tasks. Our three stage-pipeline individual alignment, parameter-space merging, and domain-specific reinforcement learning, boosting performance by over 10% relative to instruction-tuned baselines. Furthermore, domain-specific RL from the aligned checkpoint yields an additional 2% average gain in the performance ceiling across math, coding, and science benchmarks, demonstrating that explicit meta-ability alignment offers a scalable and dependable foundation for reasoning. Code is available at: this https URL
zh

[NLP-2] owards a Deeper Understanding of Reasoning Capabilities in Large Language Models

【速读】: 该论文试图解决大型语言模型在动态环境中作为自我学习和推理代理的适应能力问题,特别是评估自省、启发式变异和规划等提示技术的有效性。研究的关键在于通过动态环境中的实验,分析不同规模模型的表现差异,并探索战略提示如何缩小大模型与小模型之间的性能差距,同时揭示先进推理方法在提升复杂任务表现中的潜力与局限性。

链接: https://arxiv.org/abs/2505.10543
作者: Annie Wong,Thomas Bäck,Aske Plaat,Niki van Stein,Anna V. Kononova
机构: Leiden Institute of Advanced Computer Science (莱顿先进计算机科学研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While large language models demonstrate impressive performance on static benchmarks, the true potential of large language models as self-learning and reasoning agents in dynamic environments remains unclear. This study systematically evaluates the efficacy of self-reflection, heuristic mutation, and planning as prompting techniques to test the adaptive capabilities of agents. We conduct experiments with various open-source language models in dynamic environments and find that larger models generally outperform smaller ones, but that strategic prompting can close this performance gap. Second, a too-long prompt can negatively impact smaller models on basic reactive tasks, while larger models show more robust behaviour. Third, advanced prompting techniques primarily benefit smaller models on complex games, but offer less improvement for already high-performing large language models. Yet, we find that advanced reasoning methods yield highly variable outcomes: while capable of significantly improving performance when reasoning and decision-making align, they also introduce instability and can lead to big performance drops. Compared to human performance, our findings reveal little evidence of true emergent reasoning. Instead, large language model performance exhibits persistent limitations in crucial areas such as planning, reasoning, and spatial coordination, suggesting that current-generation large language models still suffer fundamental shortcomings that may not be fully overcome through self-reflective prompting alone. Reasoning is a multi-faceted task, and while reasoning methods like Chain of thought improves multi-step reasoning on math word problems, our findings using dynamic benchmarks highlight important shortcomings in general reasoning capabilities, indicating a need to move beyond static benchmarks to capture the complexity of reasoning.
zh

[NLP-3] WorldPM: Scaling Human Preference Modeling

【速读】: 该论文旨在解决如何有效建模和扩展人类偏好以提升语言模型对人类偏好的适应能力的问题。其解决方案的关键在于提出World Preference Modeling (WorldPM),通过构建一个统一的人类偏好表征,利用大规模数据和模型进行训练,从而实现偏好建模的可扩展性。研究发现,WorldPM在对抗性指标和客观指标上表现出良好的规模效应,尤其在较大参数模型中展现出显著的性能提升,验证了其作为偏好微调基础的有效性。

链接: https://arxiv.org/abs/2505.10527
作者: Binghai Wang,Runji Lin,Keming Lu,Le Yu,Zhenru Zhang,Fei Huang,Chujie Zheng,Kai Dang,Yang Fan,Xingzhang Ren,An Yang,Binyuan Hui,Dayiheng Liu,Tao Gui,Qi Zhang,Xuanjing Huang,Yu-Gang Jiang,Bowen Yu,Jingren Zhou,Junyang Lin
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Motivated by scaling laws in language modeling that demonstrate how test loss scales as a power law with model and dataset sizes, we find that similar laws exist in preference modeling. We propose World Preference Modeling (WorldPM) to emphasize this scaling potential, where World Preference embodies a unified representation of human preferences. In this paper, we collect preference data from public forums covering diverse user communities, and conduct extensive training using 15M-scale data across models ranging from 1.5B to 72B parameters. We observe distinct patterns across different evaluation metrics: (1) Adversarial metrics (ability to identify deceptive features) consistently scale up with increased training data and base model size; (2) Objective metrics (objective knowledge with well-defined answers) show emergent behavior in larger language models, highlighting WorldPM’s scalability potential; (3) Subjective metrics (subjective preferences from a limited number of humans or AI) do not demonstrate scaling trends. Further experiments validate the effectiveness of WorldPM as a foundation for preference fine-tuning. Through evaluations on 7 benchmarks with 20 subtasks, we find that WorldPM broadly improves the generalization performance across human preference datasets of varying sizes (7K, 100K and 800K samples), with performance gains exceeding 5% on many key subtasks. Integrating WorldPM into our internal RLHF pipeline, we observe significant improvements on both in-house and public evaluation sets, with notable gains of 4% to 8% in our in-house evaluations.
zh

[NLP-4] MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models

【速读】: 该论文旨在解决将推测解码(speculative decoding)技术应用于视觉语言模型(Vision-Language Models, VLMs)时所面临的两个核心问题:一是小型语言模型缺乏处理视觉输入的架构组件,二是其生成的token预测无法与考虑视觉上下文的VLM目标模型匹配。解决方案的关键在于提出MASSV方法,该方法通过两阶段策略将现有小型语言模型转化为有效的多模态推测模型,首先通过轻量级可训练投影器连接目标VLM的视觉编码器,随后利用目标VLM生成的响应进行自蒸馏视觉指令微调,以对齐token预测。

链接: https://arxiv.org/abs/2505.10526
作者: Mugilan Ganesan,Shane Segal,Ankur Aggarwal,Nish Sinnadurai,Sean Lie,Vithursan Thangarasa
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Main paper: 11 pp., 4 figs., 3 tabs.; Supplementary: 2 pp

点击查看摘要

Abstract:Speculative decoding significantly accelerates language model inference by enabling a lightweight draft model to propose multiple tokens that a larger target model verifies simultaneously. However, applying this technique to vision-language models (VLMs) presents two fundamental challenges: small language models that could serve as efficient drafters lack the architectural components to process visual inputs, and their token predictions fail to match those of VLM target models that consider visual context. We introduce Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models (MASSV), which transforms existing small language models into effective multimodal drafters through a two-phase approach. MASSV first connects the target VLM’s vision encoder to the draft model via a lightweight trainable projector, then applies self-distilled visual instruction tuning using responses generated by the target VLM to align token predictions. Comprehensive experiments across the Qwen2.5-VL and Gemma3 model families demonstrate that MASSV increases accepted length by up to 30% and delivers end-to-end inference speedups of up to 1.46x on visually-grounded tasks. MASSV provides a scalable, architecture-compatible method for accelerating both current and future VLMs.
zh

[NLP-5] Multi-Token Prediction Needs Registers

【速读】: 该论文旨在解决多标记预测(multi-token prediction)在语言模型预训练中的效果未能一致泛化到其他设置(如微调)的问题。其解决方案的关键在于提出MuToR,一种将可学习的寄存器标记(register tokens)交错插入输入序列的方法,每个标记负责预测未来的目标标记。与现有方法相比,MuToR具有显著优势:仅引入极少量额外参数、无需架构修改以保证与现成预训练语言模型的兼容性,并保持与下一个标记预训练目标的一致性,使其特别适用于监督微调。此外,MuToR自然支持可扩展的预测范围。

链接: https://arxiv.org/abs/2505.10518
作者: Anastasios Gerontopoulos,Spyros Gidaris,Nikos Komodakis
机构: Archimedes, Athena Research Center; University of Crete; valeo.ai; IACM-Forth
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-token prediction has emerged as a promising objective for improving language model pretraining, but its benefits have not consistently generalized to other settings such as fine-tuning. In this paper, we propose MuToR, a simple and effective approach to multi-token prediction that interleaves learnable register tokens into the input sequence, each tasked with predicting future targets. Compared to existing methods, MuToR offers several key advantages: it introduces only a negligible number of additional parameters, requires no architectural changes–ensuring compatibility with off-the-shelf pretrained language models–and remains aligned with the next-token pretraining objective, making it especially well-suited for supervised fine-tuning. Moreover, it naturally supports scalable prediction horizons. We demonstrate the effectiveness and versatility of MuToR across a range of use cases, including supervised fine-tuning, parameter-efficient fine-tuning (PEFT), and pretraining, on challenging generative tasks in both language and vision domains. Our code will be available at: this https URL.
zh

[NLP-6] he Devil Is in the Word Alignment Details: On Translation-Based Cross-Lingual Transfer for Token Classification Tasks

【速读】: 该论文旨在解决跨语言迁移(XLT)中标签投影(label projection)的挑战性问题,特别是在基于翻译的XLT策略中,如何准确地将源语言句子中的标签映射到目标语言翻译后的对应词素或词组上。其解决方案的关键在于系统性地研究低层次设计决策对标签投影的影响,包括标签在多词跨度间的投影算法、减少噪声标签映射的过滤策略以及翻译后句子的预分词处理。通过优化这些设计选择,论文证明基于词对齐(WA)的标签投影方法可以达到与基于标记的方法相当的性能,并进一步提出一种集成翻译训练和翻译测试预测的新型投影策略,显著优于现有方法,同时降低了对WA低层次设计选择的敏感性,提升了XLT在token分类任务中的鲁棒性。

链接: https://arxiv.org/abs/2505.10507
作者: Benedikt Ebing,Goran Glavaš
机构: University of Würzburg (维尔茨堡大学); Center for Artificial Intelligence and Data Science (人工智能与数据科学中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Translation-based strategies for cross-lingual transfer XLT such as translate-train – training on noisy target language data translated from the source language – and translate-test – evaluating on noisy source language data translated from the target language – are competitive XLT baselines. In XLT for token classification tasks, however, these strategies include label projection, the challenging step of mapping the labels from each token in the original sentence to its counterpart(s) in the translation. Although word aligners (WAs) are commonly used for label projection, the low-level design decisions for applying them to translation-based XLT have not been systematically investigated. Moreover, recent marker-based methods, which project labeled spans by inserting tags around them before (or after) translation, claim to outperform WAs in label projection for XLT. In this work, we revisit WAs for label projection, systematically investigating the effects of low-level design decisions on token-level XLT: (i) the algorithm for projecting labels between (multi-)token spans, (ii) filtering strategies to reduce the number of noisily mapped labels, and (iii) the pre-tokenization of the translated sentences. We find that all of these substantially impact translation-based XLT performance and show that, with optimized choices, XLT with WA offers performance at least comparable to that of marker-based methods. We then introduce a new projection strategy that ensembles translate-train and translate-test predictions and demonstrate that it substantially outperforms the marker-based projection. Crucially, we show that our proposed ensembling also reduces sensitivity to low-level WA design choices, resulting in more robust XLT for token classification tasks.
zh

[NLP-7] RouteNator: A Router-Based Multi-Modal Architecture for Generating Synthetic Training Data for Function Calling LLM s

【速读】: 该论文试图解决在缺乏真实用户交互数据的情况下,对大型语言模型(Large Language Models, LLMs)进行微调以执行函数调用任务的问题。由于数字内容创作工具中用户需求通常通过自然语言查询表达,并需映射到API调用,而真实任务特定数据的缺失和隐私限制使得传统方法难以有效训练。论文提出的解决方案关键在于一种基于路由器的架构,该架构利用领域资源如内容元数据和结构化知识图谱,结合文本到文本和视觉到文本的语言模型,生成高质量的合成训练数据,其灵活的路由机制能够生成与实际数据分布相匹配的合成数据,从而显著提升函数分类准确性和API参数选择性能。

链接: https://arxiv.org/abs/2505.10495
作者: Vibha Belavadi,Tushar Vatsa,Dewang Sultania,Suhas Suresha,Ishita Verma,Cheng Chen,Tracy Holloway King,Michael Friedrich
机构: Adobe Inc. (Adobe公司)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing

点击查看摘要

Abstract:This paper addresses fine-tuning Large Language Models (LLMs) for function calling tasks when real user interaction data is unavailable. In digital content creation tools, where users express their needs through natural language queries that must be mapped to API calls, the lack of real-world task-specific data and privacy constraints for training on it necessitate synthetic data generation. Existing approaches to synthetic data generation fall short in diversity and complexity, failing to replicate real-world data distributions and leading to suboptimal performance after LLM fine-tuning. We present a novel router-based architecture that leverages domain resources like content metadata and structured knowledge graphs, along with text-to-text and vision-to-text language models to generate high-quality synthetic training data. Our architecture’s flexible routing mechanism enables synthetic data generation that matches observed real-world distributions, addressing a fundamental limitation of traditional approaches. Evaluation on a comprehensive set of real user queries demonstrates significant improvements in both function classification accuracy and API parameter selection. Models fine-tuned with our synthetic data consistently outperform traditional approaches, establishing new benchmarks for function calling tasks.
zh

[NLP-8] Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective ACL2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在代码安全性和可用性方面的评估不足问题,现有代码安全基准测试仅关注单一任务和范式,如代码补全和生成,缺乏对安全代码生成、漏洞修复和识别等多维度的全面评估。论文提出CoV-Eval,一个涵盖代码补全、漏洞修复、漏洞检测与分类等多种任务的多任务基准,以实现对LLM代码安全性的全面评估。解决方案的关键在于开发VC-Judge,一种与人类专家高度对齐的改进型判断模型,能够更高效可靠地审查LLM生成的程序中的漏洞。

链接: https://arxiv.org/abs/2505.10494
作者: Yutao Mou,Xiao Deng,Yuxiao Luo,Shikun Zhang,Wei Ye
机构: National Engineering Research Center for Software Engineering, Peking University (国家工程研究中心,北京大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL2025 Main Conference

点击查看摘要

Abstract:Code security and usability are both essential for various coding assistant applications driven by large language models (LLMs). Current code security benchmarks focus solely on single evaluation task and paradigm, such as code completion and generation, lacking comprehensive assessment across dimensions like secure code generation, vulnerability repair and discrimination. In this paper, we first propose CoV-Eval, a multi-task benchmark covering various tasks such as code completion, vulnerability repair, vulnerability detection and classification, for comprehensive evaluation of LLM code security. Besides, we developed VC-Judge, an improved judgment model that aligns closely with human experts and can review LLM-generated programs for vulnerabilities in a more efficient and reliable way. We conduct a comprehensive evaluation of 20 proprietary and open-source LLMs. Overall, while most LLMs identify vulnerable codes well, they still tend to generate insecure codes and struggle with recognizing specific vulnerability types and performing repairs. Extensive experiments and qualitative analyses reveal key challenges and optimization directions, offering insights for future research in LLM code security.
zh

[NLP-9] CL-RAG : Bridging the Gap in Retrieval-Augmented Generation with Curriculum Learning

【速读】: 该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)系统在训练过程中因检索文档有效性差异较大而导致的模型适应性不足问题。其解决方案的关键在于引入多阶段课程学习(Curriculum Learning, CL)框架,通过构建具有不同难度级别的训练数据,并按照由易到难的顺序逐步训练模型,从而提升RAG系统的整体性能与泛化能力。

链接: https://arxiv.org/abs/2505.10493
作者: Shaohan Wang,Licheng Zhang,Zheren Fu,Zhendong Mao
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) is an effective method to enhance the capabilities of large language models (LLMs). Existing methods focus on optimizing the retriever or generator in the RAG system by directly utilizing the top-k retrieved documents. However, the documents effectiveness are various significantly across user queries, i.e. some documents provide valuable knowledge while others totally lack critical information. It hinders the retriever and generator’s adaptation during training. Inspired by human cognitive learning, curriculum learning trains models using samples progressing from easy to difficult, thus enhancing their generalization ability, and we integrate this effective paradigm to the training of the RAG system. In this paper, we propose a multi-stage Curriculum Learning based RAG system training framework, named CL-RAG. We first construct training data with multiple difficulty levels for the retriever and generator separately through sample evolution. Then, we train the model in stages based on the curriculum learning approach, thereby optimizing the overall performance and generalization of the RAG system more effectively. Our CL-RAG framework demonstrates consistent effectiveness across four open-domain QA datasets, achieving performance gains of 2% to 4% over multiple advanced methods.
zh

[NLP-10] Parallel Scaling Law for Language Models

【速读】: 该论文试图解决传统语言模型扩展方法(如参数扩展或推理时扩展)所带来的显著空间或时间成本问题。其解决方案的关键在于提出一种新的扩展范式——并行扩展(ParScale),通过在训练和推理过程中增加模型的并行计算能力,而非单纯增加参数量或输出标记数。该方法通过对输入应用多种可学习的变换,同时执行模型的前向传播,并动态聚合多个输出,从而实现高效的并行计算扩展。

链接: https://arxiv.org/abs/2505.10475
作者: Mouxiang Chen,Binyuan Hui,Zeyu Cui,Jiaxi Yang,Dayiheng Liu,Jianling Sun,Junyang Lin,Zhongxin Liu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:It is commonly believed that scaling language models should commit a significant space or time cost, by increasing the parameters (parameter scaling) or output tokens (inference-time scaling). We introduce the third and more inference-efficient scaling paradigm: increasing the model’s parallel computation during both training and inference time. We apply P diverse and learnable transformations to the input, execute forward passes of the model in parallel, and dynamically aggregate the P outputs. This method, namely parallel scaling (ParScale), scales parallel computation by reusing existing parameters and can be applied to any model structure, optimization procedure, data, or task. We theoretically propose a new scaling law and validate it through large-scale pre-training, which shows that a model with P parallel streams is similar to scaling the parameters by O(\log P) while showing superior inference efficiency. For example, ParScale can use up to 22 \times less memory increase and 6 \times less latency increase compared to parameter scaling that achieves the same performance improvement. It can also recycle an off-the-shelf pre-trained model into a parallelly scaled one by post-training on a small amount of tokens, further reducing the training budget. The new scaling law we discovered potentially facilitates the deployment of more powerful models in low-resource scenarios, and provides an alternative perspective for the role of computation in machine learning.
zh

[NLP-11] Superposition Yields Robust Neural Scaling

【速读】: 该论文试图解决神经网络模型规模与性能之间的关系问题,具体是解释大语言模型(Large Language Models, LLMs)中出现的神经缩放定律(neural scaling law)的起源,即模型损失(loss)随着模型规模呈幂律下降的现象。其解决方案的关键在于提出了一种基于表征超叠加(representation superposition)机制的简化模型,通过分析特征频率分布与模型规模之间的关系,揭示了在强超叠加条件下,损失与模型维度呈反比关系,并通过实证分析验证了该模型对多个开源LLMs的适用性。

链接: https://arxiv.org/abs/2505.10465
作者: Yizhou liu,Ziming Liu,Jeff Gore
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 30 pages, 23 figures

点击查看摘要

Abstract:The success of today’s large language models (LLMs) depends on the observation that larger models perform better. However, the origin of this neural scaling law – the finding that loss decreases as a power law with model size – remains unclear. Starting from two empirical principles – that LLMs represent more things than the model dimensions (widths) they have (i.e., representations are superposed), and that words or concepts in language occur with varying frequencies – we constructed a toy model to study the loss scaling with model size. We found that when superposition is weak, meaning only the most frequent features are represented without interference, the scaling of loss with model size depends on the underlying feature frequency; if feature frequencies follow a power law, so does the loss. In contrast, under strong superposition, where all features are represented but overlap with each other, the loss becomes inversely proportional to the model dimension across a wide range of feature frequency distributions. This robust scaling behavior is explained geometrically: when many more vectors are packed into a lower dimensional space, the interference (squared overlaps) between vectors scales inversely with that dimension. We then analyzed four families of open-sourced LLMs and found that they exhibit strong superposition and quantitatively match the predictions of our toy model. The Chinchilla scaling law turned out to also agree with our results. We conclude that representation superposition is an important mechanism underlying the observed neural scaling laws. We anticipate that these insights will inspire new training strategies and model architectures to achieve better performance with less computation and fewer parameters.
zh

[NLP-12] Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models

【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在推理任务中准确性和效率不足的问题,特别是如何通过优化推理轨迹来提升最终答案的正确性。其解决方案的关键在于提出一种名为Diffusion Chain of Lateral Thought (DCoLT)的推理框架,该框架将反向扩散过程中的每个中间步骤视为潜在的“思考”动作,并通过基于结果的强化学习(Reinforcement Learning, RL)优化整个推理轨迹,以最大化最终答案的正确性奖励。与传统Chain-of-Thought (CoT)方法不同,DCoLT允许双向、非线性的推理过程,并且在中间步骤中对语法正确性没有严格限制。

链接: https://arxiv.org/abs/2505.10446
作者: Zemin Huang,Zhiyang Chen,Zijun Wang,Tiancheng Li,Guo-Jun Qi
机构: Westlake University (西湖大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce the \emphDiffusion Chain of Lateral Thought (DCoLT), a reasoning framework for diffusion language models. DCoLT treats each intermediate step in the reverse diffusion process as a latent “thinking” action and optimizes the entire reasoning trajectory to maximize the reward on the correctness of the final answer with outcome-based Reinforcement Learning (RL). Unlike traditional Chain-of-Thought (CoT) methods that follow a causal, linear thinking process, DCoLT allows bidirectional, non-linear reasoning with no strict rule on grammatical correctness amid its intermediate steps of thought. We implement DCoLT on two representative Diffusion Language Models (DLMs). First, we choose SEDD as a representative continuous-time discrete diffusion model, where its concrete score derives a probabilistic policy to maximize the RL reward over the entire sequence of intermediate diffusion steps. We further consider the discrete-time masked diffusion language model – LLaDA, and find that the order to predict and unmask tokens plays an essential role to optimize its RL action resulting from the ranking-based Unmasking Policy Module (UPM) defined by the Plackett-Luce model. Experiments on both math and code generation tasks show that using only public data and 16 H800 GPUs, DCoLT-reinforced DLMs outperform other DLMs trained by SFT or RL or even both. Notably, DCoLT-reinforced LLaDA boosts its reasoning accuracy by +9.8%, +5.7%, +11.4%, +19.5% on GSM8K, MATH, MBPP, and HumanEval.
zh

[NLP-13] Hierarchical Document Refinement for Long-context Retrieval-augmented Generation

【速读】: 该论文旨在解决现实世界中检索增强生成(Retrieval-Augmented Generation, RAG)应用在处理长文本输入时遇到的冗余信息和噪声问题,这些问题导致推理成本增加和性能下降。其解决方案的关键在于提出LongRefiner,一个高效且可插拔的文本精炼模块,它利用长文档的固有结构特性,通过双层级查询分析、分层文档构建以及基于单一基础模型的多任务学习实现自适应精炼,从而在保持高性能的同时显著降低计算成本和延迟。

链接: https://arxiv.org/abs/2505.10413
作者: Jiajie Jin,Xiaoxi Li,Guanting Dong,Yuyao Zhang,Yutao Zhu,Yongkang Wu,Zhonghua Li,Qi Ye,Zhicheng Dou
机构: Gaoling School of Artificial Intelligence, Renmin University of China (高瓴人工智能学院,中国人民大学); Huawei Poisson Lab (华为泊松实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Real-world RAG applications often encounter long-context input scenarios, where redundant information and noise results in higher inference costs and reduced performance. To address these challenges, we propose LongRefiner, an efficient plug-and-play refiner that leverages the inherent structural characteristics of long documents. LongRefiner employs dual-level query analysis, hierarchical document structuring, and adaptive refinement through multi-task learning on a single foundation model. Experiments on seven QA datasets demonstrate that LongRefiner achieves competitive performance in various scenarios while using 10x fewer computational costs and latency compared to the best baseline. Further analysis validates that LongRefiner is scalable, efficient, and effective, providing practical insights for real-world long-text RAG applications. Our code is available at this https URL.
zh

[NLP-14] Are LLM -generated plain language summaries truly understandable? A large-scale crowdsourced evaluation

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成患者可理解的医学信息摘要(Plain Language Summaries, PLSs)中的有效性问题,特别是其在提升健康信息理解方面的实际效果尚不明确。解决方案的关键在于通过大规模众包评估(使用Amazon Mechanical Turk,共150名参与者)对LLM生成的PLS进行系统性评价,结合主观Likert量表评分和客观的多项选择理解和回忆测试,以更准确地衡量PLS的质量与读者理解效果,同时揭示自动化评估指标与人类判断之间的不一致性。

链接: https://arxiv.org/abs/2505.10409
作者: Yue Guo,Jae Ho Sohn,Gondy Leroy,Trevor Cohen
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of California, San Francisco (加利福尼亚大学旧金山分校); University of Arizona (亚利桑那大学); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Plain language summaries (PLSs) are essential for facilitating effective communication between clinicians and patients by making complex medical information easier for laypeople to understand and act upon. Large language models (LLMs) have recently shown promise in automating PLS generation, but their effectiveness in supporting health information comprehension remains unclear. Prior evaluations have generally relied on automated scores that do not measure understandability directly, or subjective Likert-scale ratings from convenience samples with limited generalizability. To address these gaps, we conducted a large-scale crowdsourced evaluation of LLM-generated PLSs using Amazon Mechanical Turk with 150 participants. We assessed PLS quality through subjective Likert-scale ratings focusing on simplicity, informativeness, coherence, and faithfulness; and objective multiple-choice comprehension and recall measures of reader understanding. Additionally, we examined the alignment between 10 automated evaluation metrics and human judgments. Our findings indicate that while LLMs can generate PLSs that appear indistinguishable from human-written ones in subjective evaluations, human-written PLSs lead to significantly better comprehension. Furthermore, automated evaluation metrics fail to reflect human judgment, calling into question their suitability for evaluating PLSs. This is the first study to systematically evaluate LLM-generated PLSs based on both reader preferences and comprehension outcomes. Our findings highlight the need for evaluation frameworks that move beyond surface-level quality and for generation methods that explicitly optimize for layperson comprehension.
zh

[NLP-15] Rethinking Repetition Problems of LLM s in Code Generation ACL2025

【速读】: 该论文旨在解决代码生成过程中存在的结构重复问题,这一问题相较于内容重复更为普遍且具有挑战性。结构重复指的是重复代码在不同模式中出现但具有固定结构,这种结构特性可被语法所体现。论文提出了一种高效的解码方法——基于语法的重复惩罚(RPG),其关键在于利用语法规则识别生成过程中的重复问题,并通过策略性地降低导致重复的关键标记的概率,从而缓解重复现象,提升生成代码的质量。

链接: https://arxiv.org/abs/2505.10402
作者: Yihong Dong,Yuchen Liu,Xue Jiang,Zhi Jin,Ge Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: Accepted to ACL 2025 (main)

点击查看摘要

Abstract:With the advent of neural language models, the performance of code generation has been significantly boosted. However, the problem of repetitions during the generation process continues to linger. Previous work has primarily focused on content repetition, which is merely a fraction of the broader repetition problem in code generation. A more prevalent and challenging problem is structural repetition. In structural repetition, the repeated code appears in various patterns but possesses a fixed structure, which can be inherently reflected in grammar. In this paper, we formally define structural repetition and propose an efficient decoding approach called RPG, which stands for Repetition Penalization based on Grammar, to alleviate the repetition problems in code generation for LLMs. Specifically, RPG first leverages grammar rules to identify repetition problems during code generation, and then strategically decays the likelihood of critical tokens that contribute to repetitions, thereby mitigating them in code generation. To facilitate this study, we construct a new dataset CodeRepetEval to comprehensively evaluate approaches for mitigating the repetition problems in code generation. Extensive experimental results demonstrate that RPG substantially outperforms the best-performing baselines on CodeRepetEval dataset as well as HumanEval and MBPP benchmarks, effectively reducing repetitions and enhancing the quality of generated code.
zh

[NLP-16] Multi-domain Multilingual Sentiment Analysis in Industry: Predicting Aspect-based Opinion Quadruples

【速读】: 该论文试图解决跨领域和跨语言的基于方面的情感分析问题,具体是通过生成式AI(Generative AI)实现四元组情感抽取,即从文本中识别出方面类别、情感极性、目标对象以及情感表达。解决方案的关键在于使用一个经过微调的单一模型来同时处理多个领域特定的分类体系,从而在保持性能与专用单领域模型相当的同时,降低运营复杂性。

链接: https://arxiv.org/abs/2505.10389
作者: Benjamin White,Anastasia Shimorina
机构: Orange Innovation(橙色创新)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper explores the design of an aspect-based sentiment analysis system using large language models (LLMs) for real-world use. We focus on quadruple opinion extraction – identifying aspect categories, sentiment polarity, targets, and opinion expressions from text data across different domains and languages. Using internal datasets, we investigate whether a single fine-tuned model can effectively handle multiple domain-specific taxonomies simultaneously. We demonstrate that a combined multi-domain model achieves performance comparable to specialized single-domain models while reducing operational complexity. We also share lessons learned for handling non-extractive predictions and evaluating various failure modes when developing LLM-based systems for structured prediction tasks.
zh

[NLP-17] Coherent Language Reconstruction from Brain Recordings with Flexible Multi-Modal Input Stimuli

【速读】: 该论文试图解决从脑活动信号中解码人类思维的问题,特别是针对传统方法在处理单模态输入(如图像或音频)时的局限性,而人类思维本质上是多模态的。解决方案的关键在于提出一个统一且灵活的框架,利用视觉-语言模型(Visual-Language Models, VLMs)和模态特异性专家,联合解析不同模态(视觉、听觉和文本)下的脑记录数据,从而实现连贯语言的重建。

链接: https://arxiv.org/abs/2505.10356
作者: Chunyu Ye,Shaonan Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Decoding thoughts from brain activity offers valuable insights into human cognition and enables promising applications in brain-computer interaction. While prior studies have explored language reconstruction from fMRI data, they are typically limited to single-modality inputs such as images or audio. In contrast, human thought is inherently multimodal. To bridge this gap, we propose a unified and flexible framework for reconstructing coherent language from brain recordings elicited by diverse input modalities-visual, auditory, and textual. Our approach leverages visual-language models (VLMs), using modality-specific experts to jointly interpret information across modalities. Experiments demonstrate that our method achieves performance comparable to state-of-the-art systems while remaining adaptable and extensible. This work advances toward more ecologically valid and generalizable mind decoding.
zh

[NLP-18] LDIR: Low-Dimensional Dense and Interpretable Text Embeddings with Relative Representations ACL2025

【速读】: 该论文试图解决现有文本嵌入方法在语义表示与可解释性之间的权衡问题,即现有的文本嵌入(如SimCSE和LLM2Vec)虽然性能优异,但其维度值难以追踪和解释;而传统的词袋模型虽然具有可解释性,但性能较差。论文提出的解决方案是Low-dimensional (lower than 500) Dense and Interpretable text embeddings with Relative representations (LDIR),其关键在于通过最远点采样(farthest point sampling)使每个维度的数值能够反映与不同锚定文本的语义相关性,从而在保持高维可解释性的同时实现低维稠密的语义表示。

链接: https://arxiv.org/abs/2505.10354
作者: Yile Wang,Zhanyu Shen,Hui Huang
机构: Shenzhen University (深圳大学)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Findings

点击查看摘要

Abstract:Semantic text representation is a fundamental task in the field of natural language processing. Existing text embedding (e.g., SimCSE and LLM2Vec) have demonstrated excellent performance, but the values of each dimension are difficult to trace and interpret. Bag-of-words, as classic sparse interpretable embeddings, suffers from poor performance. Recently, Benara et al. (2024) propose interpretable text embeddings using large language models, which forms “0/1” embeddings based on responses to a series of questions. These interpretable text embeddings are typically high-dimensional (larger than 10,000). In this work, we propose Low-dimensional (lower than 500) Dense and Interpretable text embeddings with Relative representations (LDIR). The numerical values of its dimensions indicate semantic relatedness to different anchor texts through farthest point sampling, offering both semantic representation as well as a certain level of traceability and interpretability. We validate LDIR on multiple semantic textual similarity, retrieval, and clustering tasks. Extensive experimental results show that LDIR performs close to the black-box baseline models and outperforms the interpretable embeddings baselines with much fewer dimensions. Code is available at this https URL.
zh

[NLP-19] J1: Incentivizing Thinking in LLM -as-a-Judge via Reinforcement Learning

【速读】: 该论文试图解决人工智能(Artificial Intelligence, AI)发展中的评估质量瓶颈问题,特别是如何提升大型语言模型作为评判者(LLM-as-a-Judge)的判断能力。解决方案的关键在于提出一种基于强化学习(reinforcement learning)的训练方法——J1,该方法通过将可验证和不可验证的提示转换为具有可验证奖励的判断任务,激励模型进行更深入的思考并减少判断偏差。J1的核心优势在于其能够通过学习构建评估标准、对比自生成的参考答案以及重新评估模型响应的正确性,从而实现更准确的判断。

链接: https://arxiv.org/abs/2505.10320
作者: Chenxi Whitehouse,Tianlu Wang,Ping Yu,Xian Li,Jason Weston,Ilia Kulikov,Swarnadeep Saha
机构: Meta(元)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 8 tables, 11 figures

点击查看摘要

Abstract:The progress of AI is bottlenecked by the quality of evaluation, and powerful LLM-as-a-Judge models have proved to be a core solution. Improved judgment ability is enabled by stronger chain-of-thought reasoning, motivating the need to find the best recipes for training such models to think. In this work we introduce J1, a reinforcement learning approach to training such models. Our method converts both verifiable and non-verifiable prompts to judgment tasks with verifiable rewards that incentivize thinking and mitigate judgment bias. In particular, our approach outperforms all other existing 8B or 70B models when trained at those sizes, including models distilled from DeepSeek-R1. J1 also outperforms o1-mini, and even R1 on some benchmarks, despite training a smaller model. We provide analysis and ablations comparing Pairwise-J1 vs Pointwise-J1 models, offline vs online training recipes, reward strategies, seed prompts, and variations in thought length and content. We find that our models make better judgments by learning to outline evaluation criteria, comparing against self-generated reference answers, and re-evaluating the correctness of model responses.
zh

[NLP-20] StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation

【速读】: 该论文试图解决视觉叙事系统在跨帧保持角色身份一致性和将动作正确关联到相应主体方面的问题,这些问题常导致指代幻觉。解决方案的关键在于通过视觉元素对角色、物体及其他实体进行定位,具体包括跨帧物体再识别(利用视觉相似性和人脸识别)、链式思维推理以显式建模叙事关系,以及一种将文本元素与多帧视觉实体相链接的定位机制。

链接: https://arxiv.org/abs/2505.10292
作者: Daniel A. P. Oliveira,David Martins de Matos
机构: Cranberry-Lemon University (克兰伯里-柠檬大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 31 pages, 14 figures

点击查看摘要

Abstract:Visual storytelling systems struggle to maintain character identity across frames and link actions to appropriate subjects, frequently leading to referential hallucinations. These issues can be addressed through grounding of characters, objects, and other entities on the visual elements. We propose StoryReasoning, a dataset containing 4,178 stories derived from 52,016 movie images, with both structured scene analyses and grounded stories. Each story maintains character and object consistency across frames while explicitly modeling multi-frame relationships through structured tabular representations. Our approach features cross-frame object re-identification using visual similarity and face recognition, chain-of-thought reasoning for explicit narrative modeling, and a grounding scheme that links textual elements to visual entities across multiple frames. We establish baseline performance by fine-tuning Qwen2.5-VL 7B, creating Qwen Storyteller, which performs end-to-end object detection, re-identification, and landmark detection while maintaining consistent object references throughout the story. Evaluation demonstrates a reduction from 4.06 to 3.56 (-12.3%) hallucinations on average per story when compared to a non-fine-tuned model.
zh

[NLP-21] From Questions to Clinical Recommendations: Large Language Models Driving Evidence-Based Clinical Decision Making

【速读】: 该论文试图解决临床实践中将循证医学(Evidence-Based Medicine)整合到实时决策中的挑战,这些问题包括繁重的工作量、复杂的专业流程以及时间限制。解决方案的关键在于开发Quicker,这是一个基于大语言模型(Large Language Models, LLMs)的循证临床决策支持系统,旨在自动化证据综合过程,并生成符合标准临床指南制定流程的临床建议。Quicker通过实现从问题提出到临床建议的全自动化链条,结合集成工具和交互式用户界面,实现了定制化的决策支持。

链接: https://arxiv.org/abs/2505.10282
作者: Dubai Li,Nan Jiang,Kangping Huang,Ruiqi Tu,Shuyu Ouyang,Huayu Yu,Lin Qiao,Chen Yu,Tianshu Zhou,Danyang Tong,Qian Wang,Mengtao Li,Xiaofeng Zeng,Yu Tian,Xinping Tian,Jingsong Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Clinical evidence, derived from rigorous research and data analysis, provides healthcare professionals with reliable scientific foundations for informed decision-making. Integrating clinical evidence into real-time practice is challenging due to the enormous workload, complex professional processes, and time constraints. This highlights the need for tools that automate evidence synthesis to support more efficient and accurate decision making in clinical settings. This study introduces Quicker, an evidence-based clinical decision support system powered by large language models (LLMs), designed to automate evidence synthesis and generate clinical recommendations modeled after standard clinical guideline development processes. Quicker implements a fully automated chain that covers all phases, from questions to clinical recommendations, and further enables customized decision-making through integrated tools and interactive user interfaces. To evaluate Quicker’s capabilities, we developed the Q2CRBench-3 benchmark dataset, based on clinical guideline development records for three different diseases. Experimental results highlighted Quicker’s strong performance, with fine-grained question decomposition tailored to user preferences, retrieval sensitivities comparable to human experts, and literature screening performance approaching comprehensive inclusion of relevant studies. In addition, Quicker-assisted evidence assessment effectively supported human reviewers, while Quicker’s recommendations were more comprehensive and logically coherent than those of clinicians. In system-level testing, collaboration between a single reviewer and Quicker reduced the time required for recommendation development to 20-40 minutes. In general, our findings affirm the potential of Quicker to help physicians make quicker and more reliable evidence-based clinical decisions.
zh

[NLP-22] he Evolving Landscape of Generative Large Language Models and Traditional Natural Language Processing in Medicine

【速读】: 该论文试图解决生成式大语言模型(Generative Large Language Models, LLMs)与传统自然语言处理(Natural Language Processing, NLP)在不同医学任务中的性能差异尚未被充分探讨的问题。其解决方案的关键在于通过分析19,123篇研究,明确区分了生成式LLMs在开放式任务中的优势以及传统NLP在信息抽取与分析任务中的主导地位,从而为两者在医疗应用中的合理使用提供依据。

链接: https://arxiv.org/abs/2505.10261
作者: Rui Yang,Huitao Li,Matthew Yu Heng Wong,Yuhe Ke,Xin Li,Kunyu Yu,Jingchi Liao,Jonathan Chong Kai Liew,Sabarinath Vinod Nair,Jasmine Chiat Ling Ong,Irene Li,Douglas Teodoro,Chuan Hong,Daniel Shu Wei Ting,Nan Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Natural language processing (NLP) has been traditionally applied to medicine, and generative large language models (LLMs) have become prominent recently. However, the differences between them across different medical tasks remain underexplored. We analyzed 19,123 studies, finding that generative LLMs demonstrate advantages in open-ended tasks, while traditional NLP dominates in information extraction and analysis tasks. As these technologies advance, ethical use of them is essential to ensure their potential in medical applications.
zh

[NLP-23] Comparing LLM Text Annotation Skills: A Study on Human Rights Violations in Social Media Data

【速读】: 该论文旨在解决在多语言环境下,利用大型语言模型(Large Language Models, LLMs)进行零样本和少量样本的文本标注,特别是针对俄语和乌克兰语社交媒体文本中的人权侵犯引用进行二分类任务的问题。其解决方案的关键在于评估不同先进LLMs(如GPT-3.5、GPT-4、LLAMA3、Mistral 7B和Claude-2)在不同提示条件下的标注性能,并将其与人工双重标注的黄金标准进行对比,以分析模型在跨语言适应性、错误模式及主观性判断处理方面的能力。

链接: https://arxiv.org/abs/2505.10260
作者: Poli Apollinaire Nemkova,Solomon Ubani,Mark V. Albert
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the era of increasingly sophisticated natural language processing (NLP) systems, large language models (LLMs) have demonstrated remarkable potential for diverse applications, including tasks requiring nuanced textual understanding and contextual reasoning. This study investigates the capabilities of multiple state-of-the-art LLMs - GPT-3.5, GPT-4, LLAMA3, Mistral 7B, and Claude-2 - for zero-shot and few-shot annotation of a complex textual dataset comprising social media posts in Russian and Ukrainian. Specifically, the focus is on the binary classification task of identifying references to human rights violations within the dataset. To evaluate the effectiveness of these models, their annotations are compared against a gold standard set of human double-annotated labels across 1000 samples. The analysis includes assessing annotation performance under different prompting conditions, with prompts provided in both English and Russian. Additionally, the study explores the unique patterns of errors and disagreements exhibited by each model, offering insights into their strengths, limitations, and cross-linguistic adaptability. By juxtaposing LLM outputs with human annotations, this research contributes to understanding the reliability and applicability of LLMs for sensitive, domain-specific tasks in multilingual contexts. It also sheds light on how language models handle inherently subjective and context-dependent judgments, a critical consideration for their deployment in real-world scenarios. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.10260 [cs.CL] (or arXiv:2505.10260v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.10260 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-24] On the Interplay of Human-AI AlignmentFairness and Performance Trade-offs in Medical Imaging

【速读】: 该论文试图解决医疗影像中深度神经网络存在的偏见问题,这些问题导致不同人口群体之间的公平性差距。其解决方案的关键在于通过整合人类见解来实现Human-AI对齐(Human-AI alignment),从而系统性地减少公平性差距并提升模型在域外数据上的泛化能力,同时强调需要平衡对齐程度以避免性能上的权衡。

链接: https://arxiv.org/abs/2505.10231
作者: Haozhe Luo,Ziyu Zhou,Zixin Shu,Aurélie Pahud de Mortanges,Robert Berke,Mauricio Reyes
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Deep neural networks excel in medical imaging but remain prone to biases, leading to fairness gaps across demographic groups. We provide the first systematic exploration of Human-AI alignment and fairness in this domain. Our results show that incorporating human insights consistently reduces fairness gaps and enhances out-of-domain generalization, though excessive alignment can introduce performance trade-offs, emphasizing the need for calibrated strategies. These findings highlight Human-AI alignment as a promising approach for developing fair, robust, and generalizable medical AI systems, striking a balance between expert guidance and automated efficiency. Our code is available at this https URL.
zh

[NLP-25] ComplexFormer: Disruptively Advancing Transformer Inference Ability via Head-Specific Complex Vector Attention

【速读】: 该论文试图解决Transformer模型在整合位置信息与多头注意力(MHA)灵活性之间的挑战,传统方法往往分别建模语义和位置差异或对所有头应用统一的位置调整,可能限制了表示能力。解决方案的关键是提出ComplexFormer,其核心为Complex Multi-Head Attention(CMHA),使每个头能够独立地在复数平面内统一建模语义和位置差异,并通过旋转和缩放表示交互。

链接: https://arxiv.org/abs/2505.10222
作者: Jintian Shao,Hongyi Huang,Jiayi Wu,Beiwen Zhang,ZhiYu Wu,You Shan,MingKai Zheng
机构: Southern University of Science and Technology (南方科技大学); Fudan University (复旦大学); Sun Yat-sen University (中山大学); SenseTime Research (商汤科技研究)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Transformer models rely on self-attention to capture token dependencies but face challenges in effectively integrating positional information while allowing multi-head attention (MHA) flexibility. Prior methods often model semantic and positional differences disparately or apply uniform positional adjustments across heads, potentially limiting representational capacity. This paper introduces ComplexFormer, featuring Complex Multi-Head Attention-CMHA. CMHA empowers each head to independently model semantic and positional differences unified within the complex plane, representing interactions as rotations and scaling. ComplexFormer incorporates two key improvements: (1) a per-head Euler transformation, converting real-valued query/key projections into polar-form complex vectors for head-specific complex subspace operation; and (2) a per-head adaptive differential rotation mechanism, exp[i(Adapt(ASmn,i) + Delta(Pmn),i)], allowing each head to learn distinct strategies for integrating semantic angle differences (ASmn,i) with relative positional encodings (Delta(Pmn),i). Extensive experiments on language modeling, text generation, code generation, and mathematical reasoning show ComplexFormer achieves superior performance, significantly lower generation perplexity , and improved long-context coherence compared to strong baselines like RoPE-Transformers. ComplexFormer demonstrates strong parameter efficiency, offering a more expressive, adaptable attention mechanism.
zh

[NLP-26] RAIDEN-R1: Improving Role-awareness of LLM s via GRPO with Verifiable Reward

【速读】: 该论文旨在解决角色扮演对话代理(Role-playing Conversational Agents, RPCAs)在保持角色一致性方面面临的持续挑战。解决方案的关键在于提出一种名为RAIDEN-R1的新型强化学习框架,该框架集成了可验证的角色感知奖励(Verifiable Role-Awareness Reward, VRAR)。通过引入单一项和多项挖掘策略,该方法能够通过评估角色特定关键点生成可量化的奖励,从而提升角色一致性。此外,研究还构建了一个高质量的角色感知思维链数据集,并通过多大语言模型(LLM)协作进行实验,以增强推理连贯性。

链接: https://arxiv.org/abs/2505.10218
作者: Zongsheng Wang,Kaili Sun,Bowen Wu,Qun Yu,Ying Li,Baoxun Wang
机构: Tencent(腾讯); Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Role-playing conversational agents (RPCAs) face persistent challenges in maintaining role consistency. To address this, we propose RAIDEN-R1, a novel reinforcement learning framework that integrates Verifiable Role-Awareness Reward (VRAR). The method introduces both singular and multi-term mining strategies to generate quantifiable rewards by assessing role-specific keys. Additionally, we construct a high-quality, role-aware Chain-of-Thought dataset through multi-LLM collaboration, and implement experiments to enhance reasoning coherence. Experiments on the RAIDEN benchmark demonstrate RAIDEN-R1’s superiority: our 14B-GRPO model achieves 88.04% and 88.65% accuracy on Script-Based Knowledge and Conversation Memory metrics, respectively, outperforming baseline models while maintaining robustness. Case analyses further reveal the model’s enhanced ability to resolve conflicting contextual cues and sustain first-person narrative consistency. This work bridges the non-quantifiability gap in RPCA training and provides insights into role-aware reasoning patterns, advancing the development of RPCAs.
zh

[NLP-27] VQ-Logits: Compressing the Output Bottleneck of Large Language Models via Vector Quantized Logits

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理过程中因输出词表规模庞大而导致的计算和内存挑战问题,尤其是最后一层线性投影层参数量大、计算成本高的问题。其解决方案的关键在于提出一种名为VQ-Logits的新方法,该方法利用向量量化(Vector Quantization, VQ)技术,将原本庞大的词表嵌入矩阵替换为一个较小且共享的码本(codebook),通过将词汇表中的每个词映射到码本中的一个向量,从而显著减少参数数量和计算负载。

链接: https://arxiv.org/abs/2505.10202
作者: Jintian Shao,Hongyi Huang,Jiayi Wu,YiMing Cheng,ZhiYu Wu,You Shan,MingKai Zheng
机构: Southern University of Science and Technology (南方科技大学); Fudan University (复旦大学); Tsinghua University (清华大学); SenseTime Research (商汤研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success but face significant computational and memory challenges, particularly due to their extensive output vocabularies. The final linear projection layer, mapping hidden states to vocabulary-sized logits, often constitutes a substantial portion of the model’s parameters and computational cost during inference. Existing methods like adaptive softmax or hierarchical softmax introduce structural complexities. In this paper, we propose VQ-Logits, a novel approach that leverages Vector Quantization (VQ) to drastically reduce the parameter count and computational load of the LLM output layer. VQ-Logits replaces the large V * dmodel output embedding matrix with a small, shared codebook of K embedding vectors (K V ). Each token in the vocabulary is mapped to one of these K codebook vectors. The LLM predicts logits over this compact codebook, which are then efficiently “scattered” to the full vocabulary space using the learned or preassigned mapping. We demonstrate through extensive experiments on standard language modeling benchmarks (e.g., WikiText-103, C4) that VQ-Logits can achieve up to 99% parameter reduction in the output layer and 6x speedup in logit computation, with only a marginal 4% increase in perplexity compared to full softmax baselines. We further provide detailed ablation studies on codebook size, initialization, and learning strategies, showcasing the robustness and effectiveness of our approach.
zh

[NLP-28] he CoT Encyclopedia: Analyzing Predicting and Controlling how a Reasoning Model will Think

【速读】: 该论文试图解决现代大语言模型中长链式思维(Long chain-of-thought, CoT)推理策略的分析与解释问题,当前方法受限于人类直觉,无法全面捕捉模型行为的多样性。其解决方案的关键在于提出一种自底向上的CoT Encyclopedia框架,该框架通过自动提取模型生成的CoT中的多样化推理标准,将其嵌入语义空间并聚类为代表性类别,同时推导对比性标准以解释推理行为,从而实现更可解释和全面的分析。

链接: https://arxiv.org/abs/2505.10185
作者: Seongyun Lee,Seungone Kim,Minju Seo,Yongrae Jo,Dongyoung Go,Hyeonbin Hwang,Jinho Park,Xiang Yue,Sean Welleck,Graham Neubig,Moontae Lee,Minjoon Seo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:Long chain-of-thought (CoT) is an essential ingredient in effective usage of modern large language models, but our understanding of the reasoning strategies underlying these capabilities remains limited. While some prior works have attempted to categorize CoTs using predefined strategy types, such approaches are constrained by human intuition and fail to capture the full diversity of model behaviors. In this work, we introduce the CoT Encyclopedia, a bottom-up framework for analyzing and steering model reasoning. Our method automatically extracts diverse reasoning criteria from model-generated CoTs, embeds them into a semantic space, clusters them into representative categories, and derives contrastive rubrics to interpret reasoning behavior. Human evaluations show that this framework produces more interpretable and comprehensive analyses than existing methods. Moreover, we demonstrate that this understanding enables performance gains: we can predict which strategy a model is likely to use and guide it toward more effective alternatives. Finally, we provide practical insights, such as that training data format (e.g., free-form vs. multiple-choice) has a far greater impact on reasoning behavior than data domain, underscoring the importance of format-aware model design.
zh

[NLP-29] Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning

【速读】: 该论文试图解决在训练推理模型时,监督微调和强化学习等方法受限于特定领域(如数学和编程)所带来的数据广度与可扩展性不足的问题。其解决方案的关键在于提出一种基于合成数据的持续预训练方法——Reasoning CPT,该方法通过重建文本背后的隐藏思维过程来生成推理相关的训练数据,从而实现跨领域的推理能力提升。

链接: https://arxiv.org/abs/2505.10182
作者: Yoichi Ishibashi,Taro Yano,Masafumi Oyamada
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated significant improvements in reasoning capabilities through supervised fine-tuning and reinforcement learning. However, when training reasoning models, these approaches are primarily applicable to specific domains such as mathematics and programming, which imposes fundamental constraints on the breadth and scalability of training data. In contrast, continual pretraining (CPT) offers the advantage of not requiring task-specific signals. Nevertheless, how to effectively synthesize training data for reasoning and how such data affect a wide range of domains remain largely unexplored. This study provides a detailed evaluation of Reasoning CPT, a form of CPT that uses synthetic data to reconstruct the hidden thought processes underlying texts, based on the premise that texts are the result of the author’s thinking process. Specifically, we apply Reasoning CPT to Gemma2-9B using synthetic data with hidden thoughts derived from STEM and Law corpora, and compare it to standard CPT on the MMLU benchmark. Our analysis reveals that Reasoning CPT consistently improves performance across all evaluated domains. Notably, reasoning skills acquired in one domain transfer effectively to others; the performance gap with conventional methods widens as problem difficulty increases, with gains of up to 8 points on the most challenging problems. Furthermore, models trained with hidden thoughts learn to adjust the depth of their reasoning according to problem difficulty.
zh

[NLP-30] GE-Chat: A Graph Enhanced RAG Framework for Evidential Response Generation of LLM s IJCAI2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成回答时可能出现的幻觉(hallucination)问题,即模型生成的内容可能与事实不符,但表面上却具有说服力,从而引发用户信任危机。解决方案的关键在于提出GE-Chat框架,该框架通过知识图谱增强的检索增强生成机制,结合链式思维(Chain-of-Thought, CoT)逻辑生成、多跳子图搜索和基于蕴含的句子生成技术,实现准确的证据检索,以提高模型输出的可靠性和可验证性。

链接: https://arxiv.org/abs/2505.10143
作者: Longchao Da,Parth Mitesh Shah,Kuan-Ru Liou,Jiaxing Zhang,Hua Wei
机构: Arizona State University (亚利桑那州立大学); New Jersey Institute of Technology (新泽西理工学院)
类目: Computation and Language (cs.CL)
备注: 5 pages, 4 figures, accepted to IJCAI2025 demo track

点击查看摘要

Abstract:Large Language Models are now key assistants in human decision-making processes. However, a common note always seems to follow: “LLMs can make mistakes. Be careful with important info.” This points to the reality that not all outputs from LLMs are dependable, and users must evaluate them manually. The challenge deepens as hallucinated responses, often presented with seemingly plausible explanations, create complications and raise trust issues among users. To tackle such issue, this paper proposes GE-Chat, a knowledge Graph enhanced retrieval-augmented generation framework to provide Evidence-based response generation. Specifically, when the user uploads a material document, a knowledge graph will be created, which helps construct a retrieval-augmented agent, enhancing the agent’s responses with additional knowledge beyond its training corpus. Then we leverage Chain-of-Thought (CoT) logic generation, n-hop sub-graph searching, and entailment-based sentence generation to realize accurate evidence retrieval. We demonstrate that our method improves the existing models’ performance in terms of identifying the exact evidence in a free-form context, providing a reliable way to examine the resources of LLM’s conclusion and help with the judgment of the trustworthiness.
zh

[NLP-31] Why 1 1 1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering

【速读】: 该论文旨在解决现有视觉标记剪枝(visual token pruning)方法在不同任务中对提示对齐和视觉保真度的目标重要性差异未被充分考虑,导致性能不一致的问题。其解决方案的关键在于首次基于Hausdorff距离推导出视觉标记剪枝的闭式误差界,统一表征两个目标的贡献,并利用ϵ\epsilon-覆盖理论揭示两者之间的内在权衡,量化在固定预算下的最优达成水平。进一步提出多目标平衡覆盖(MoB)方法,将视觉标记剪枝重新建模为双目标覆盖问题,通过贪心半径交换实现预算分配,从而在保证可证明性能边界的同时实现线性可扩展性。

链接: https://arxiv.org/abs/2505.10118
作者: Yangfu Li,Hongjian Zhan,Tianyi Chen,Qi Liu,Yue Lu
机构: Shanghai Key Laboratory of Multidimensional Information Processing (上海市多维信息处理重点实验室); School of Communications and Electronic Engineering, East China Normal University (华东师范大学通信与电子工程学院); School of Mathematical Sciences, Shanghai Jiao Tong University (上海交通大学数学科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 31 pages,9 figures,conference

点击查看摘要

Abstract:Existing visual token pruning methods target prompt alignment and visual preservation with static strategies, overlooking the varying relative importance of these objectives across tasks, which leads to inconsistent performance. To address this, we derive the first closed-form error bound for visual token pruning based on the Hausdorff distance, uniformly characterizing the contributions of both objectives. Moreover, leveraging \epsilon -covering theory, we reveal an intrinsic trade-off between these objectives and quantify their optimal attainment levels under a fixed budget. To practically handle this trade-off, we propose Multi-Objective Balanced Covering (MoB), which reformulates visual token pruning as a bi-objective covering problem. In this framework, the attainment trade-off reduces to budget allocation via greedy radius trading. MoB offers a provable performance bound and linear scalability with respect to the number of input visual tokens, enabling adaptation to challenging pruning scenarios. Extensive experiments show that MoB preserves 96.4% of performance for LLaVA-1.5-7B using only 11.1% of the original visual tokens and accelerates LLaVA-Next-7B by 1.3-1.5 \times with negligible performance loss. Additionally, evaluations on Qwen2-VL and Video-LLaVA confirm that MoB integrates seamlessly into advanced MLLMs and diverse vision-language tasks.
zh

[NLP-32] Learning Virtual Machine Scheduling in Cloud Computing through Language Agents

【速读】: 该论文旨在解决云服务中虚拟机(VM)调度问题,该问题属于在线动态多维装箱(Online Dynamic Multidimensional Bin Packing, ODMBP)问题,具有大规模复杂性和需求波动性。传统优化方法难以适应实时变化,领域专家设计的启发式方法策略僵化,现有基于学习的方法则普遍缺乏泛化性和可解释性。论文提出的解决方案关键在于构建了一个分层语言代理框架MiCo,其核心是利用大语言模型(LLM)驱动的启发式设计范式,将ODMBP建模为带有选项的半马尔可夫决策过程(SMDP-Option),并通过两阶段架构——选项挖掘器和选项组合器——实现动态调度,从而有效提升大规模云环境下的调度性能。

链接: https://arxiv.org/abs/2505.10117
作者: JieHao Wu,Ziwei Wang,Junjie Sheng,Wenhao Li,Xiangfei Wang,Jun Luo
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In cloud services, virtual machine (VM) scheduling is a typical Online Dynamic Multidimensional Bin Packing (ODMBP) problem, characterized by large-scale complexity and fluctuating demands. Traditional optimization methods struggle to adapt to real-time changes, domain-expert-designed heuristic approaches suffer from rigid strategies, and existing learning-based methods often lack generalizability and interpretability. To address these limitations, this paper proposes a hierarchical language agent framework named MiCo, which provides a large language model (LLM)-driven heuristic design paradigm for solving ODMBP. Specifically, ODMBP is formulated as a Semi-Markov Decision Process with Options (SMDP-Option), enabling dynamic scheduling through a two-stage architecture, i.e., Option Miner and Option Composer. Option Miner utilizes LLMs to discover diverse and useful non-context-aware strategies by interacting with constructed environments. Option Composer employs LLMs to discover a composing strategy that integrates the non-context-aware strategies with the contextual ones. Extensive experiments on real-world enterprise datasets demonstrate that MiCo achieves a 96.9% competitive ratio in large-scale scenarios involving more than 10,000 virtual machines. It maintains high performance even under nonstationary request flows and diverse configurations, thus validating its effectiveness in complex and large-scale cloud environments.
zh

[NLP-33] What Does Neuro Mean to Cardio? Investigating the Role of Clinical Specialty Data in Medical LLM s

【速读】: 该论文旨在解决在医学问答(Medical QA)任务中,大型语言模型的微调数据对模型性能影响的机制问题,特别是探讨知识注入(knowledge injection)在高知识密集型场景下的有效性。其解决方案的关键在于通过构建S-MedQA数据集,验证了在特定临床专科上进行微调并不必然带来该专科的最佳性能,并发现无论在哪个专科上进行微调,所有专科的临床相关术语的token概率都会一致增加,这表明模型性能的提升主要来源于领域迁移(如从通用领域到医学领域),而非单纯的知识注入,从而提出需要重新思考微调数据在医学领域的角色。

链接: https://arxiv.org/abs/2505.10113
作者: Xinlan Yan,Di Wu,Yibin Lei,Christof Monz,Iacer Calixto
机构: Amsterdam UMC(阿姆斯特丹大学医学中心); University of Amsterdam(阿姆斯特丹大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we introduce S-MedQA, an English medical question-answering (QA) dataset for benchmarking large language models in fine-grained clinical specialties. We use S-MedQA to check the applicability of a popular hypothesis related to knowledge injection in the knowledge-intense scenario of medical QA, and show that: 1) training on data from a speciality does not necessarily lead to best performance on that specialty and 2) regardless of the specialty fine-tuned on, token probabilities of clinically relevant terms for all specialties increase consistently. Thus, we believe improvement gains come mostly from domain shifting (e.g., general to medical) rather than knowledge injection and suggest rethinking the role of fine-tuning data in the medical domain. We release S-MedQA and all code needed to reproduce all our experiments to the research community.
zh

[NLP-34] From Text to Network: Constructing a Knowledge Graph of Taiwan-Based China Studies Using Generative AI

【速读】: 该论文试图解决台湾中国研究(Taiwanese China Studies, CS)领域内长期积累的学术文献难以系统整理与高效利用的问题,其核心在于通过人工智能技术对大量非结构化学术文本进行结构化处理,以构建领域特定的知识图谱和向量数据库。解决方案的关键在于采用生成式 AI (Generative AI, GAI) 技术和大语言模型 (Large Language Models, LLMs),从1996年至2019年间发表的1,367篇同行评审文章中提取并标准化实体关系三元组,进而通过轻量级系统实现可视化,从而支持概念节点与语义关系的网络化知识导航。

链接: https://arxiv.org/abs/2505.10093
作者: Hsuan-Lei Shao
机构: Taipei Medical University(台北医学大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 4 pages, 4 figures

点击查看摘要

Abstract:Taiwanese China Studies (CS) has developed into a rich, interdisciplinary research field shaped by the unique geopolitical position and long standing academic engagement with Mainland China. This study responds to the growing need to systematically revisit and reorganize decades of Taiwan based CS scholarship by proposing an AI assisted approach that transforms unstructured academic texts into structured, interactive knowledge representations. We apply generative AI (GAI) techniques and large language models (LLMs) to extract and standardize entity relation triples from 1,367 peer reviewed CS articles published between 1996 and 2019. These triples are then visualized through a lightweight this http URL based system, forming the foundation of a domain specific knowledge graph and vector database for the field. This infrastructure allows users to explore conceptual nodes and semantic relationships across the corpus, revealing previously uncharted intellectual trajectories, thematic clusters, and research gaps. By decomposing textual content into graph structured knowledge units, our system enables a paradigm shift from linear text consumption to network based knowledge navigation. In doing so, it enhances scholarly access to CS literature while offering a scalable, data driven alternative to traditional ontology construction. This work not only demonstrates how generative AI can augment area studies and digital humanities but also highlights its potential to support a reimagined scholarly infrastructure for regional knowledge systems.
zh

[NLP-35] XRAG : Cross-lingual Retrieval-Augmented Generation

【速读】: 该论文试图解决跨语言检索增强生成(cross-lingual Retrieval-Augmented Generation, RAG)场景下大语言模型(Large Language Models, LLMs)的生成能力评估问题,特别是在用户语言与检索结果语言不匹配的情况下。解决方案的关键在于构建XRAG基准数据集,该数据集基于近期新闻文章,确保问题需要外部知识进行回答,并涵盖单语和多语检索的实际场景,同时为每个检索文档提供相关性标注。通过这一数据集,研究揭示了跨语言RAG中的两个新挑战:单语检索中响应语言的正确性问题以及多语检索中跨语言信息推理的难度。

链接: https://arxiv.org/abs/2505.10089
作者: Wei Liu,Sony Trenous,Leonardo F. R. Ribeiro,Bill Byrne,Felix Hieber
机构: Heidelberg Institute for Theoretical Studies gGmbH (海德堡理论物理研究所有限公司); Amazon AGI (亚马逊人工智能)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose XRAG, a novel benchmark designed to evaluate the generation abilities of LLMs in cross-lingual Retrieval-Augmented Generation (RAG) settings where the user language does not match the retrieval results. XRAG is constructed from recent news articles to ensure that its questions require external knowledge to be answered. It covers the real-world scenarios of monolingual and multilingual retrieval, and provides relevancy annotations for each retrieved document. Our novel dataset construction pipeline results in questions that require complex reasoning, as evidenced by the significant gap between human and LLM performance. Consequently, XRAG serves as a valuable benchmark for studying LLM reasoning abilities, even before considering the additional cross-lingual complexity. Experimental results on five LLMs uncover two previously unreported challenges in cross-lingual RAG: 1) in the monolingual retrieval setting, all evaluated models struggle with response language correctness; 2) in the multilingual retrieval setting, the main challenge lies in reasoning over retrieved information across languages rather than generation of non-English text.
zh

[NLP-36] Designing and Contextualising Probes for African Languages

【速读】: 该论文试图解决预训练语言模型(Pretrained Language Models, PLMs)在非洲语言中的语言知识表征机制不明确的问题,以及如何评估这些模型对目标语言的内部知识编码能力。其解决方案的关键在于通过分层探测器(layer-wise probes)对六种语言类型多样的非洲语言进行系统性分析,并设计控制任务以解释探测器性能,从而区分模型的内部知识与探测器的记忆效应。研究结果表明,针对非洲语言优化的PLMs比大规模多语言PLMs更能编码目标语言的语言信息,并揭示了句法信息集中于中后层、语义信息分布于全层的结构特征。

链接: https://arxiv.org/abs/2505.10081
作者: Wisdom Aduah,Francois Meyer
机构: African Institute for Mathematical Sciences (非洲数学科学研究所); University of Cape Town (开普敦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pretrained language models (PLMs) for African languages are continually improving, but the reasons behind these advances remain unclear. This paper presents the first systematic investigation into probing PLMs for linguistic knowledge about African languages. We train layer-wise probes for six typologically diverse African languages to analyse how linguistic features are distributed. We also design control tasks, a way to interpret probe performance, for the MasakhaPOS dataset. We find PLMs adapted for African languages to encode more linguistic information about target languages than massively multilingual PLMs. Our results reaffirm previous findings that token-level syntactic information concentrates in middle-to-last layers, while sentence-level semantic information is distributed across all layers. Through control tasks and probing baselines, we confirm that performance reflects the internal knowledge of PLMs rather than probe memorisation. Our study applies established interpretability techniques to African-language PLMs. In doing so, we highlight the internal mechanisms underlying the success of strategies like active learning and multilingual adaptation.
zh

[NLP-37] Dark LLM s: The Growing Threat of Unaligned AI Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在安全控制方面存在的漏洞问题,特别是其对越狱攻击(jailbreak attacks)的易感性。研究指出,LLMs的训练数据中若包含未经过滤的有害或“黑暗”内容,可能导致模型学习到不良模式或弱点,从而使用户能够绕过其安全机制。论文提出的解决方案的关键在于发现了一种通用的越狱攻击方法,该方法能够有效破坏多个最先进的LLMs,使其能够回答任何问题并按要求生成有害输出。这一攻击方法的核心在于利用模型在训练过程中可能学到的潜在缺陷,而非依赖特定模型的结构或配置。

链接: https://arxiv.org/abs/2505.10066
作者: Michael Fire,Yitzhak Elbazis,Adi Wasenstein,Lior Rokach
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) rapidly reshape modern life, advancing fields from healthcare to education and beyond. However, alongside their remarkable capabilities lies a significant threat: the susceptibility of these models to jailbreaking. The fundamental vulnerability of LLMs to jailbreak attacks stems from the very data they learn from. As long as this training data includes unfiltered, problematic, or ‘dark’ content, the models can inherently learn undesirable patterns or weaknesses that allow users to circumvent their intended safety controls. Our research identifies the growing threat posed by dark LLMs models deliberately designed without ethical guardrails or modified through jailbreak techniques. In our research, we uncovered a universal jailbreak attack that effectively compromises multiple state-of-the-art models, enabling them to answer almost any question and produce harmful outputs upon request. The main idea of our attack was published online over seven months ago. However, many of the tested LLMs were still vulnerable to this attack. Despite our responsible disclosure efforts, responses from major LLM providers were often inadequate, highlighting a concerning gap in industry practices regarding AI safety. As model training becomes more accessible and cheaper, and as open-source LLMs proliferate, the risk of widespread misuse escalates. Without decisive intervention, LLMs may continue democratizing access to dangerous knowledge, posing greater risks than anticipated.
zh

[NLP-38] CAFE: Retrieval Head-based Coarse-to-Fine Information Seeking to Enhance Multi-Document QA Capability

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理长上下文输入时,检索与推理能力不足的问题。现有方法虽尝试通过提示策略和检索头来缓解这一限制,但仍难以平衡检索的精度与召回率,从而影响问答效果。论文提出的解决方案是 \textbf{CAFE},其关键在于采用一种两阶段的粗到细方法,通过逐步消除背景文档和干扰文档的负面影响,使模型更依赖于证据文档。首先,利用粗粒度过滤方法识别并排序相关文档;其次,通过细粒度引导方法将注意力集中在最相关的内容上。实验结果表明,CAFE在多个基准测试中优于基线方法,特别是在Mistral模型上分别提升了22.1%和13.7%的SubEM指标。

链接: https://arxiv.org/abs/2505.10063
作者: Han Peng,Jinhao Jiang,Zican Dong,Wayne Xin Zhao,Lei Fang
机构: Gaoling School of Artificial Intelligence, Renmin University of China (高瓴人工智能学院,中国人民大学); DataCanvas Alaya NeW (DataCanvas Alaya NeW)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Advancements in Large Language Models (LLMs) have extended their input context length, yet they still struggle with retrieval and reasoning in long-context inputs. Existing methods propose to utilize the prompt strategy and retrieval head to alleviate this limitation. However, they still face challenges in balancing retrieval precision and recall, impacting their efficacy in answering questions. To address this, we introduce \textbfCAFE , a two-stage coarse-to-fine method to enhance multi-document question-answering capacities. By gradually eliminating the negative impacts of background and distracting documents, CAFE makes the responses more reliant on the evidence documents. Initially, a coarse-grained filtering method leverages retrieval heads to identify and rank relevant documents. Then, a fine-grained steering method guides attention to the most relevant content. Experiments across benchmarks show CAFE outperforms baselines, achieving up to 22.1% and 13.7% SubEM improvement over SFT and RAG methods on the Mistral model, respectively.
zh

[NLP-39] DIF: A Framework for Benchmarking and Verifying Implicit Bias in LLM s

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中隐性偏见的评估问题,特别是由于训练数据继承而导致的伦理和技术层面的偏见问题。现有研究已表明LLMs在不同社会情境下生成的回答会受到影响,但缺乏标准化的方法来衡量这一特定类型的偏见。论文提出的解决方案关键在于开发一种易于解释的基准方法——DIF(Demographic Implicit Fairness),通过评估预存在的LLM逻辑和数学问题数据集,并结合社会人口学角色进行分析,从而统计验证隐性偏见的存在,并揭示了问答准确率与隐性偏见之间的反向趋势。

链接: https://arxiv.org/abs/2505.10013
作者: Lake Yin,Fan Huang
机构: Indiana University, Bloomington (印第安纳大学伯明顿分校)
类目: Computation and Language (cs.CL)
备注: 7 pages, 1 figure

点击查看摘要

Abstract:As Large Language Models (LLMs) have risen in prominence over the past few years, there has been concern over the potential biases in LLMs inherited from the training data. Previous studies have examined how LLMs exhibit implicit bias, such as when response generation changes when different social contexts are introduced. We argue that this implicit bias is not only an ethical, but also a technical issue, as it reveals an inability of LLMs to accommodate extraneous information. However, unlike other measures of LLM intelligence, there are no standard methods to benchmark this specific subset of LLM bias. To bridge this gap, we developed a method for calculating an easily interpretable benchmark, DIF (Demographic Implicit Fairness), by evaluating preexisting LLM logic and math problem datasets with sociodemographic personas. We demonstrate that this method can statistically validate the presence of implicit bias in LLM behavior and find an inverse trend between question answering accuracy and implicit bias, supporting our argument.
zh

[NLP-40] Advanced Crash Causation Analysis for Freeway Safety: A Large Language Model Approach to Identifying Key Contributing Factors

【速读】: 该论文试图解决交通碰撞事故成因分析中传统统计方法和机器学习模型难以捕捉多因素复杂交互及事故独特性的难题。其解决方案的关键在于利用大语言模型(LLM)对高速公路碰撞数据进行分析,通过整合226篇相关研究构建涵盖环境、驾驶员、交通流及几何设计因素的训练数据集,并采用QLoRA微调Llama3 8B模型以增强其对事故成因的理解能力,进而通过零样本分类识别事故原因,提供符合现有研究的全面解释。

链接: https://arxiv.org/abs/2505.09949
作者: Ahmed S. Abdelrahman,Mohamed Abdel-Aty,Samgyu Yang,Abdulrahman Faden
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Understanding the factors contributing to traffic crashes and developing strategies to mitigate their severity is essential. Traditional statistical methods and machine learning models often struggle to capture the complex interactions between various factors and the unique characteristics of each crash. This research leverages large language model (LLM) to analyze freeway crash data and provide crash causation analysis accordingly. By compiling 226 traffic safety studies related to freeway crashes, a training dataset encompassing environmental, driver, traffic, and geometric design factors was created. The Llama3 8B model was fine-tuned using QLoRA to enhance its understanding of freeway crashes and their contributing factors, as covered in these studies. The fine-tuned Llama3 8B model was then used to identify crash causation without pre-labeled data through zero-shot classification, providing comprehensive explanations to ensure that the identified causes were reasonable and aligned with existing research. Results demonstrate that LLMs effectively identify primary crash causes such as alcohol-impaired driving, speeding, aggressive driving, and driver inattention. Incorporating event data, such as road maintenance, offers more profound insights. The model’s practical applicability and potential to improve traffic safety measures were validated by a high level of agreement among researchers in the field of traffic safety, as reflected in questionnaire results with 88.89%. This research highlights the complex nature of traffic crashes and how LLMs can be used for comprehensive analysis of crash causation and other contributing factors. Moreover, it provides valuable insights and potential countermeasures to aid planners and policymakers in developing more effective and efficient traffic safety practices.
zh

[NLP-41] Personalizing Large Language Models using Retrieval Augmented Generation and Knowledge Graph WWW

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成对话响应时由于过度拟合导致的幻觉问题,其根源在于缺乏及时、准确和个性化的信息输入。论文提出的解决方案的关键是引入检索增强生成(Retrieval Augmented Generation, RAG)方法,利用知识图谱(Knowledge Graphs, KGs)来辅助LLMs生成更准确且个性化的响应。KGs能够以结构化的方式存储持续更新的事实性信息,本文特别关注日历数据的应用,实验结果表明该方法在理解个人数据和生成准确响应方面显著优于基于文本输入的基线LLMs,同时仅带来适度的响应时间增加。

链接: https://arxiv.org/abs/2505.09945
作者: Deeksha Prahlad,Chanhee Lee,Dongha Kim,Hokeun Kim
机构: Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To appear in the Companion Proceedings of the ACM Web Conference 2025 (WWW Companion '25)

点击查看摘要

Abstract:The advent of large language models (LLMs) has allowed numerous applications, including the generation of queried responses, to be leveraged in chatbots and other conversational assistants. Being trained on a plethora of data, LLMs often undergo high levels of over-fitting, resulting in the generation of extra and incorrect data, thus causing hallucinations in output generation. One of the root causes of such problems is the lack of timely, factual, and personalized information fed to the LLM. In this paper, we propose an approach to address these problems by introducing retrieval augmented generation (RAG) using knowledge graphs (KGs) to assist the LLM in personalized response generation tailored to the users. KGs have the advantage of storing continuously updated factual information in a structured way. While our KGs can be used for a variety of frequently updated personal data, such as calendar, contact, and location data, we focus on calendar data in this paper. Our experimental results show that our approach works significantly better in understanding personal information and generating accurate responses compared to the baseline LLMs using personal data as text inputs, with a moderate reduction in response time.
zh

[NLP-42] Rethinking Prompt Optimizers: From Prompt Merits to Optimization

【速读】: 该论文试图解决传统提示优化(Prompt Optimization, PO)方法依赖于高性能大语言模型(Large Language Models, LLMs)生成复杂提示所带来的兼容性差和响应质量下降的问题。其解决方案的关键在于通过可解释性设计重新思考提示优化,提出了一种基于提示质量属性的轻量级本地部署优化器MePO,该优化器通过学习明确且可解释的质量指标,实现了对大规模和轻量级推理模型的有效泛化,从而降低了成本和隐私风险,并提升了实际应用中的性能与稳定性。

链接: https://arxiv.org/abs/2505.09930
作者: Zixiao Zhu,Hanzhang Zhou,Zijian Feng,Tianjiao Li,Chua Jia Jim Deryl,Mak Lee Onn,Gee Wah Ng,Kezhi Mao
机构: Nanyang Technological University (南洋理工大学); Home Team Science and Technology Agency (新加坡内政科技局)
类目: Computation and Language (cs.CL)
备注: 20 pages, 14 figures

点击查看摘要

Abstract:Prompt optimization (PO) offers a practical alternative to fine-tuning large language models (LLMs), enabling performance improvements without altering model weights. Existing methods typically rely on advanced, large-scale LLMs like GPT-4 to generate optimized prompts. However, due to limited downward compatibility, verbose, instruction-heavy prompts from advanced LLMs can overwhelm lightweight inference models and degrade response quality. In this work, we rethink prompt optimization through the lens of interpretable design. We first identify a set of model-agnostic prompt quality merits and empirically validate their effectiveness in enhancing prompt and response quality. We then introduce MePO, a merit-guided, lightweight, and locally deployable prompt optimizer trained on our preference dataset built from merit-aligned prompts generated by a lightweight LLM. Unlike prior work, MePO avoids online optimization reliance, reduces cost and privacy concerns, and, by learning clear, interpretable merits, generalizes effectively to both large-scale and lightweight inference models. Experiments demonstrate that MePO achieves better results across diverse tasks and model types, offering a scalable and robust solution for real-world deployment. Our model and dataset are available at: this https URL
zh

[NLP-43] From Trade-off to Synergy: A Versatile Symbiotic Watermarking Framework for Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)生成文本的滥用问题,通过水印技术实现对AI生成文本的溯源与检测。现有水印方案在鲁棒性、文本质量和安全性之间存在权衡,难以兼顾。论文提出的解决方案关键在于融合基于logits的水印方案与基于采样的水印方案,利用两者的优势实现协同效应,并提出三种策略:串行、并行和混合。其中,混合框架通过引入token熵和语义熵自适应嵌入水印,优化了可检测性、鲁棒性、文本质量与安全性的平衡。

链接: https://arxiv.org/abs/2505.09924
作者: Yidan Wang,Yubing Ren,Yanan Cao,Binxing Fang
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); Cyberspace Institute of Advanced Technology, Guangzhou University (广州大学网络空间先进技术研究院)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:The rise of Large Language Models (LLMs) has heightened concerns about the misuse of AI-generated text, making watermarking a promising solution. Mainstream watermarking schemes for LLMs fall into two categories: logits-based and sampling-based. However, current schemes entail trade-offs among robustness, text quality, and security. To mitigate this, we integrate logits-based and sampling-based schemes, harnessing their respective strengths to achieve synergy. In this paper, we propose a versatile symbiotic watermarking framework with three strategies: serial, parallel, and hybrid. The hybrid framework adaptively embeds watermarks using token entropy and semantic entropy, optimizing the balance between detectability, robustness, text quality, and security. Furthermore, we validate our approach through comprehensive experiments on various datasets and models. Experimental results indicate that our method outperforms existing baselines and achieves state-of-the-art (SOTA) performance. We believe this framework provides novel insights into diverse watermarking paradigms. Our code is available at \hrefthis https URLthis https URL.
zh

[NLP-44] PIG: Privacy Jailbreak Attack on LLM s via Gradient-based Iterative In-Context Optimization

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中存在的隐私泄露问题,特别是通过 Jailbreak 攻击提取敏感信息的潜在风险。现有方法在评估隐私泄露时存在局限性,无法有效模拟真实场景下的攻击行为。论文提出的解决方案关键在于 PIG 框架,该框架通过识别隐私查询中的个人身份信息(Personally Identifiable Information, PII)实体及其类型,结合上下文学习构建隐私情境,并利用三种基于梯度的策略迭代更新情境以诱导目标 PII 的输出,从而更有效地揭示 LLMs 中的隐私漏洞。

链接: https://arxiv.org/abs/2505.09921
作者: Yidan Wang,Yanan Cao,Yubing Ren,Fang Fang,Zheng Lin,Binxing Fang
机构: Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络空间安全学院); Cyberspace Institute of Advanced Technology, Guangzhou University(广州大学网络空间先进技术研究院)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel in various domains but pose inherent privacy risks. Existing methods to evaluate privacy leakage in LLMs often use memorized prefixes or simple instructions to extract data, both of which well-alignment models can easily block. Meanwhile, Jailbreak attacks bypass LLM safety mechanisms to generate harmful content, but their role in privacy scenarios remains underexplored. In this paper, we examine the effectiveness of jailbreak attacks in extracting sensitive information, bridging privacy leakage and jailbreak attacks in LLMs. Moreover, we propose PIG, a novel framework targeting Personally Identifiable Information (PII) and addressing the limitations of current jailbreak methods. Specifically, PIG identifies PII entities and their types in privacy queries, uses in-context learning to build a privacy context, and iteratively updates it with three gradient-based strategies to elicit target PII. We evaluate PIG and existing jailbreak methods using two privacy-related datasets. Experiments on four white-box and two black-box LLMs show that PIG outperforms baseline methods and achieves state-of-the-art (SoTA) results. The results underscore significant privacy risks in LLMs, emphasizing the need for stronger safeguards. Our code is availble at \hrefthis https URLthis https URL.
zh

[NLP-45] Crossing Borders Without Crossing Boundaries: How Sociolinguistic Awareness Can Optimize User Engagement with Localized Spanish AI Models Across Hispanophone Countries

【速读】: 该论文试图解决由于西班牙语在拉丁美洲和西班牙之间的方言差异所导致的日常使用中的社会语言不一致问题,这些问题构成了区域化语言模型的必要性。解决方案的关键在于开发至少五个西班牙语子变体,以增强用户对AI语言模型的信任与依赖,并体现文化、历史和社会语言学的敏感性,从而支持更有效的本地化策略和国际化战略。

链接: https://arxiv.org/abs/2505.09902
作者: Martin Capdevila,Esteban Villa Turek,Ellen Karina Chumbe Fernandez,Luis Felipe Polo Galvez,Luis Cadavid,Andrea Marroquin,Rebeca Vargas Quesada,Johanna Crew,Nicole Vallejo Galarraga,Christopher Rodriguez,Diego Gutierrez,Radhi Datla
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models are, by definition, based on language. In an effort to underscore the critical need for regional localized models, this paper examines primary differences between variants of written Spanish across Latin America and Spain, with an in-depth sociocultural and linguistic contextualization therein. We argue that these differences effectively constitute significant gaps in the quotidian use of Spanish among dialectal groups by creating sociolinguistic dissonances, to the extent that locale-sensitive AI models would play a pivotal role in bridging these divides. In doing so, this approach informs better and more efficient localization strategies that also serve to more adequately meet inclusivity goals, while securing sustainable active daily user growth in a major low-risk investment geographic area. Therefore, implementing at least the proposed five sub variants of Spanish addresses two lines of action: to foment user trust and reliance on AI language models while also demonstrating a level of cultural, historical, and sociolinguistic awareness that reflects positively on any internationalization strategy.
zh

[NLP-46] Comparing Exploration-Exploitation Strategies of LLM s and Humans: Insights from Standard Multi-armed Bandit Tasks

【速读】: 该论文试图解决生成式 AI(Generative AI)在复杂序列决策任务中是否能够表现出与人类相似的决策行为,并实现可比(或更优)性能的问题。其解决方案的关键在于通过经典的多臂老虎机(Multi-Armed Bandit, MAB)任务,对比分析 LLMs、人类和 MAB 算法在探索-利用(Exploration-Exploitation, EE)策略上的表现,并借助可解释的选择模型来捕捉代理的 EE 策略,探究显式推理如何影响 LLM 的决策过程。研究发现,推理能力能够使 LLM 更接近人类的行为模式,但在复杂非平稳环境中,其适应性仍存在局限。

链接: https://arxiv.org/abs/2505.09901
作者: Ziyuan Zhang,Darcy Wang,Ningyuan Chen,Rodrigo Mansur,Vahid Sarhangian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to simulate or automate human behavior in complex sequential decision-making tasks. A natural question is then whether LLMs exhibit similar decision-making behavior to humans, and can achieve comparable (or superior) performance. In this work, we focus on the exploration-exploitation (EE) tradeoff, a fundamental aspect of dynamic decision-making under uncertainty. We employ canonical multi-armed bandit (MAB) tasks introduced in the cognitive science and psychiatry literature to conduct a comparative study of the EE strategies of LLMs, humans, and MAB algorithms. We use interpretable choice models to capture the EE strategies of the agents and investigate how explicit reasoning, through both prompting strategies and reasoning-enhanced models, shapes LLM decision-making. We find that reasoning shifts LLMs toward more human-like behavior, characterized by a mix of random and directed exploration. In simple stationary tasks, reasoning-enabled LLMs exhibit similar levels of random and directed exploration compared to humans. However, in more complex, non-stationary environments, LLMs struggle to match human adaptability, particularly in effective directed exploration, despite achieving similar regret in certain scenarios. Our findings highlight both the promise and limits of LLMs as simulators of human behavior and tools for automated decision-making and point to potential areas of improvements.
zh

[NLP-47] Predictability Shapes Adaptation: An Evolutionary Perspective on Modes of Learning in Transformers

【速读】: 该论文试图解决Transformer模型中两种学习模式——权重内学习(IWL)与上下文内学习(ICL)之间的相互作用机制问题,以及环境可预测性如何影响这两种学习模式的平衡。解决方案的关键在于借鉴进化生物学中的适应策略,将IWL类比为遗传编码,ICL类比为表型可塑性,并通过实验操作环境稳定性和线索可靠性这两个可预测性维度,系统分析其对IWL/ICL平衡的影响,从而提出一种基于相对成本的假设来解释学习模式的转换。

链接: https://arxiv.org/abs/2505.09855
作者: Alexander Y. Ku,Thomas L. Griffiths,Stephanie C.Y. Chan
机构: Google DeepMind(谷歌深度思维); Princeton University(普林斯顿大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Transformer models learn in two distinct modes: in-weights learning (IWL), encoding knowledge into model weights, and in-context learning (ICL), adapting flexibly to context without weight modification. To better understand the interplay between these learning modes, we draw inspiration from evolutionary biology’s analogous adaptive strategies: genetic encoding (akin to IWL, adapting over generations and fixed within an individual’s lifetime) and phenotypic plasticity (akin to ICL, enabling flexible behavioral responses to environmental cues). In evolutionary biology, environmental predictability dictates the balance between these strategies: stability favors genetic encoding, while reliable predictive cues promote phenotypic plasticity. We experimentally operationalize these dimensions of predictability and systematically investigate their influence on the ICL/IWL balance in Transformers. Using regression and classification tasks, we show that high environmental stability decisively favors IWL, as predicted, with a sharp transition at maximal stability. Conversely, high cue reliability enhances ICL efficacy, particularly when stability is low. Furthermore, learning dynamics reveal task-contingent temporal evolution: while a canonical ICL-to-IWL shift occurs in some settings (e.g., classification with many classes), we demonstrate that scenarios with easier IWL (e.g., fewer classes) or slower ICL acquisition (e.g., regression) can exhibit an initial IWL phase later yielding to ICL dominance. These findings support a relative-cost hypothesis for explaining these learning mode transitions, establishing predictability as a critical factor governing adaptive strategies in Transformers, and offering novel insights for understanding ICL and guiding training methodologies.
zh

[NLP-48] Do Large Language Models Know Conflict? Investigating Parametric vs. Non-Parametric Knowledge of LLM s for Conflict Forecasting

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在预测暴力冲突方面的能力问题,特别是其是否能够通过预训练权重中编码的参数化知识来预测冲突升级和人员伤亡,而无需依赖外部数据。解决方案的关键在于构建一个两部分的评估框架,分别考察LLMs在仅依赖预训练知识(参数化设置)与结合外部结构化和非结构化上下文信息(非参数化设置)时的冲突预测性能,以评估外部知识增强对模型表现的影响。

链接: https://arxiv.org/abs/2505.09852
作者: Apollinaire Poli Nemkova,Sarath Chandra Lingareddy,Sagnik Ray Choudhury,Mark V. Albert
机构: University of North Texas (北德克萨斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown impressive performance across natural language tasks, but their ability to forecast violent conflict remains underexplored. We investigate whether LLMs possess meaningful parametric knowledge-encoded in their pretrained weights-to predict conflict escalation and fatalities without external data. This is critical for early warning systems, humanitarian planning, and policy-making. We compare this parametric knowledge with non-parametric capabilities, where LLMs access structured and unstructured context from conflict datasets (e.g., ACLED, GDELT) and recent news reports via Retrieval-Augmented Generation (RAG). Incorporating external information could enhance model performance by providing up-to-date context otherwise missing from pretrained weights. Our two-part evaluation framework spans 2020-2024 across conflict-prone regions in the Horn of Africa and the Middle East. In the parametric setting, LLMs predict conflict trends and fatalities relying only on pretrained knowledge. In the non-parametric setting, models receive summaries of recent conflict events, indicators, and geopolitical developments. We compare predicted conflict trend labels (e.g., Escalate, Stable Conflict, De-escalate, Peace) and fatalities against historical data. Our findings highlight the strengths and limitations of LLMs for conflict forecasting and the benefits of augmenting them with structured external knowledge.
zh

[NLP-49] KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在文学文本细读(close reading)能力方面的评估空白问题,特别是针对文学分析中涉及的风格特征提取、背景信息检索以及风格与外部语境之间的多跳推理等任务。解决方案的关键在于构建了KRISTEVA基准,这是首个用于评估解释性推理的文本细读基准,包含1331道从课堂数据中改编的多项选择题,并设计了三组逐步增加难度的任务来模拟文本细读的不同环节。

链接: https://arxiv.org/abs/2505.09825
作者: Peiqi Sui,Juan Diego Rodriguez,Philippe Laban,Dean Murphy,Joseph P. Dexter,Richard Jean So,Samuel Baker,Pramit Chaudhuri
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Each year, tens of millions of essays are written and graded in college-level English courses. Students are asked to analyze literary and cultural texts through a process known as close reading, in which they gather textual details to formulate evidence-based arguments. Despite being viewed as a basis for critical thinking and widely adopted as a required element of university coursework, close reading has never been evaluated on large language models (LLMs), and multi-discipline benchmarks like MMLU do not include literature as a subject. To fill this gap, we present KRISTEVA, the first close reading benchmark for evaluating interpretive reasoning, consisting of 1331 multiple-choice questions adapted from classroom data. With KRISTEVA, we propose three progressively more difficult sets of tasks to approximate different elements of the close reading process, which we use to test how well LLMs may seem to understand and reason about literary works: 1) extracting stylistic features, 2) retrieving relevant contextual information from parametric knowledge, and 3) multi-hop reasoning between style and external contexts. Our baseline results find that, while state-of-the-art LLMs possess some college-level close reading competency (accuracy 49.7% - 69.7%), their performances still trail those of experienced human evaluators on 10 out of our 11 tasks.
zh

[NLP-50] Exploring the generalization of LLM truth directions on conversational formats

【速读】: 该论文试图解决生成式 AI (Generative AI) 中的谎言检测问题,具体而言是探索模型激活空间中是否存在一个普遍的“真相方向”,使得真实与虚假陈述在该方向上可线性分离,并研究该方向在不同对话格式中的泛化能力。研究发现,该真相方向在以谎言结尾的简短对话中表现良好,但在较长对话中若谎言出现在输入提示早期则泛化效果较差。解决方案的关键在于在每段对话末尾添加固定的关键词,这一方法显著提升了模型在不同对话格式下的泛化能力。

链接: https://arxiv.org/abs/2505.09807
作者: Timour Ichmoukhamedov,David Martens
机构: ADM, Universiteit Antwerpen (ADM, 安特卫普大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Several recent works argue that LLMs have a universal truth direction where true and false statements are linearly separable in the activation space of the model. It has been demonstrated that linear probes trained on a single hidden state of the model already generalize across a range of topics and might even be used for lie detection in LLM conversations. In this work we explore how this truth direction generalizes between various conversational formats. We find good generalization between short conversations that end on a lie, but poor generalization to longer formats where the lie appears earlier in the input prompt. We propose a solution that significantly improves this type of generalization by adding a fixed key phrase at the end of each conversation. Our results highlight the challenges towards reliable LLM lie detectors that generalize to new settings.
zh

[NLP-51] Automated Detection of Clinical Entities in Lung and Breast Cancer Reports Using NLP Techniques

【速读】: 该论文旨在解决临床报告中信息手动提取效率低、耗时且易出错的问题,从而限制了数据驱动方法在医疗领域的应用。其解决方案的关键在于利用自然语言处理(Natural Language Processing, NLP)技术,特别是命名实体识别(Named Entity Recognition, NER),以自动化方式从电子健康记录(Electronic Health Records, EHRs)中提取关键临床信息。研究采用了GMV的NLP工具uQuery,并基于RoBERTa架构的生物医学语言模型bsc-bio-ehr-en3进行微调,以提高对肺癌和乳腺癌相关临床实体的识别准确率。

链接: https://arxiv.org/abs/2505.09794
作者: J. Moreno-Casanova,J.M. Auñón,A. Mártinez-Pérez,M.E. Pérez-Martínez,M.E. Gas-López
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Research projects, including those focused on cancer, rely on the manual extraction of information from clinical reports. This process is time-consuming and prone to errors, limiting the efficiency of data-driven approaches in healthcare. To address these challenges, Natural Language Processing (NLP) offers an alternative for automating the extraction of relevant data from electronic health records (EHRs). In this study, we focus on lung and breast cancer due to their high incidence and the significant impact they have on public health. Early detection and effective data management in both types of cancer are crucial for improving patient outcomes. To enhance the accuracy and efficiency of data extraction, we utilized GMV’s NLP tool uQuery, which excels at identifying relevant entities in clinical texts and converting them into standardized formats such as SNOMED and OMOP. uQuery not only detects and classifies entities but also associates them with contextual information, including negated entities, temporal aspects, and patient-related details. In this work, we explore the use of NLP techniques, specifically Named Entity Recognition (NER), to automatically identify and extract key clinical information from EHRs related to these two cancers. A dataset from Health Research Institute Hospital La Fe (IIS La Fe), comprising 200 annotated breast cancer and 400 lung cancer reports, was used, with eight clinical entities manually labeled using the Doccano platform. To perform NER, we fine-tuned the bsc-bio-ehr-en3 model, a RoBERTa-based biomedical linguistic model pre-trained in Spanish. Fine-tuning was performed using the Transformers architecture, enabling accurate recognition of clinical entities in these cancer types. Our results demonstrate strong overall performance, particularly in identifying entities like MET and PAT, although challenges remain with less frequent entities like EVOL.
zh

[NLP-52] A Survey on Large Language Models in Multimodal Recommender Systems

【速读】: 该论文试图解决如何将大语言模型(Large Language Models, LLMs)有效集成到多模态推荐系统(Multimodal Recommender Systems, MRS)中,以提升推荐性能的问题。其解决方案的关键在于探索不同的提示策略(prompting strategies)、微调方法(fine-tuning methods)以及数据适配技术(data adaptation techniques),并通过提出一种新的分类体系来表征LLMs与MRS的整合模式,从而识别可迁移的推荐技术,并为评估指标和数据集提供综述,进而引导未来的研究方向。

链接: https://arxiv.org/abs/2505.09777
作者: Alejo Lopez-Avila,Jinhua Du
机构: Huawei London Research Centre(华为伦敦研究中心)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 30 pages, 6 figures

点击查看摘要

Abstract:Multimodal recommender systems (MRS) integrate heterogeneous user and item data, such as text, images, and structured information, to enhance recommendation performance. The emergence of large language models (LLMs) introduces new opportunities for MRS by enabling semantic reasoning, in-context learning, and dynamic input handling. Compared to earlier pre-trained language models (PLMs), LLMs offer greater flexibility and generalisation capabilities but also introduce challenges related to scalability and model accessibility. This survey presents a comprehensive review of recent work at the intersection of LLMs and MRS, focusing on prompting strategies, fine-tuning methods, and data adaptation techniques. We propose a novel taxonomy to characterise integration patterns, identify transferable techniques from related recommendation domains, provide an overview of evaluation metrics and datasets, and point to possible future directions. We aim to clarify the emerging role of LLMs in multimodal recommendation and support future research in this rapidly evolving field.
zh

[NLP-53] Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning

【速读】: 该论文旨在解决预训练语言模型(Pretrained Language Models, LLMs)因固定分词方案(tokenization schemes)导致的效率低下和性能限制问题,尤其是在多语言或专业应用场景中。其关键解决方案是提出一种模型无关的分词器移植方法——Tokenadapt,以及针对多词超令牌(Supertokens)的预分词学习机制。Tokenadapt通过结合局部子词分解估计与全局语义相似性搜索的混合启发式方法初始化新的唯一令牌嵌入,以在保留语义的同时显著降低微调需求,从而有效缓解分词器锁定(tokenizer lock-in)带来的挑战。

链接: https://arxiv.org/abs/2505.09738
作者: Shaurya Sharthak,Vinayak Pahalwan,Adithya Kamath,Adarsh Shirawalmath
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pretrained language models (LLMs) are often constrained by their fixed tokenization schemes, leading to inefficiencies and performance limitations, particularly for multilingual or specialized applications. This tokenizer lock-in presents significant challenges. standard methods to overcome this often require prohibitive computational resources. Although tokenizer replacement with heuristic initialization aims to reduce this burden, existing methods often require exhaustive residual fine-tuning and still may not fully preserve semantic nuances or adequately address the underlying compression inefficiencies. Our framework introduces two innovations: first, Tokenadapt, a model-agnostic tokenizer transplantation method, and second, novel pre-tokenization learning for multi-word Supertokens to enhance compression and reduce fragmentation. Tokenadapt initializes new unique token embeddings via a hybrid heuristic that combines two methods: a local estimate based on subword decomposition using the old tokenizer, and a global estimate utilizing the top-k semantically similar tokens from the original vocabulary. This methodology aims to preserve semantics while significantly minimizing retraining requirements. Empirical investigations validate both contributions: the transplantation heuristic successfully initializes unique tokens, markedly outperforming conventional baselines and sophisticated methods including Transtokenizer and ReTok, while our Supertokens achieve notable compression gains. Our zero-shot perplexity results demonstrate that the TokenAdapt hybrid initialization consistently yields lower perplexity ratios compared to both ReTok and TransTokenizer baselines across different base models and newly trained target tokenizers. TokenAdapt typically reduced the overall perplexity ratio significantly compared to ReTok, yielding at least a 2-fold improvement in these aggregate scores.
zh

[NLP-54] An AI-Powered Research Assistant in the Lab: A Practical Guide for Text Analysis Through Iterative Collaboration with LLM s

【速读】: 该论文试图解决传统文本分析方法在处理开放性回答、标题或社交媒体帖子等非结构化数据时存在的耗时、劳动密集且易受偏见影响的问题。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)通过预定义的(自上而下)或数据驱动的(自下而上)分类体系,实现高效、可迭代且协作式的分类体系构建与应用。研究展示了如何通过提示工程、分类体系评估与优化、以及编码者间一致性检验,将LLMs整合到文本分析流程中,从而提升分析效率与可靠性。

链接: https://arxiv.org/abs/2505.09724
作者: Gino Carmona-Díaz,William Jiménez-Leal,María Alejandra Grisales,Chandra Sripada,Santiago Amaya,Michael Inzlicht,Juan Pablo Bermúdez
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 31 pages, 1 figure

点击查看摘要

Abstract:Analyzing texts such as open-ended responses, headlines, or social media posts is a time- and labor-intensive process highly susceptible to bias. LLMs are promising tools for text analysis, using either a predefined (top-down) or a data-driven (bottom-up) taxonomy, without sacrificing quality. Here we present a step-by-step tutorial to efficiently develop, test, and apply taxonomies for analyzing unstructured data through an iterative and collaborative process between researchers and LLMs. Using personal goals provided by participants as an example, we demonstrate how to write prompts to review datasets and generate a taxonomy of life domains, evaluate and refine the taxonomy through prompt and direct modifications, test the taxonomy and assess intercoder agreements, and apply the taxonomy to categorize an entire dataset with high intercoder reliability. We discuss the possibilities and limitations of using LLMs for text analysis.
zh

[NLP-55] VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)生成内容中事实准确性的评估问题,尤其是由于生成文本中句子间的复杂依赖关系导致的事实完整性不足和关键关系事实的遗漏。其解决方案的关键在于提出VeriFact框架,该框架通过识别并解决不完整和缺失的事实,提升事实提取的准确性,从而实现更可靠的验证结果。此外,论文还引入了FactRBench基准,用于同时评估模型输出的精确率和召回率,相较于以往仅关注精确率的工作,这一方法更全面地反映了模型在事实性上的表现。

链接: https://arxiv.org/abs/2505.09701
作者: Xin Liu,Lechen Zhang,Sheza Munir,Yiyang Gu,Lu Wang
机构: University of Michigan(密歇根大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel at generating long-form responses, but evaluating their factuality remains challenging due to complex inter-sentence dependencies within the generated facts. Prior solutions predominantly follow a decompose-decontextualize-verify pipeline but often fail to capture essential context and miss key relational facts. In this paper, we introduce VeriFact, a factuality evaluation framework designed to enhance fact extraction by identifying and resolving incomplete and missing facts to support more accurate verification results. Moreover, we introduce FactRBench , a benchmark that evaluates both precision and recall in long-form model responses, whereas prior work primarily focuses on precision. FactRBench provides reference fact sets from advanced LLMs and human-written answers, enabling recall assessment. Empirical evaluations show that VeriFact significantly enhances fact completeness and preserves complex facts with critical relational information, resulting in more accurate factuality evaluation. Benchmarking various open- and close-weight LLMs on FactRBench indicate that larger models within same model family improve precision and recall, but high precision does not always correlate with high recall, underscoring the importance of comprehensive factuality assessment.
zh

[NLP-56] System Prompt Optimization with Meta-Learning

【速读】: 该论文试图解决传统提示优化方法仅关注特定任务的用户提示(user prompts),而忽略了可跨任务和领域通用的系统提示(system prompt)的优化问题。其核心挑战在于设计一种能够适应多样化用户提示并具备跨任务迁移能力的系统提示。解决方案的关键在于提出一种元学习框架,通过在多个数据集上对用户提示进行迭代优化,同时联合更新系统提示,从而实现两者之间的协同效应,提升系统提示的泛化能力和适应性。

链接: https://arxiv.org/abs/2505.09666
作者: Yumin Choi,Jinheon Baek,Sung Ju Hwang
机构: KAIST(韩国科学技术院); DeepAuto.ai(DeepAuto.ai)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities, with optimizing their input prompts playing a pivotal role in maximizing their performance. However, while LLM prompts consist of both the task-agnostic system prompts and task-specific user prompts, existing work on prompt optimization has focused on user prompts specific to individual queries or tasks, and largely overlooked the system prompt that is, once optimized, applicable across different tasks and domains. Motivated by this, we introduce the novel problem of bilevel system prompt optimization, whose objective is to design system prompts that are robust to diverse user prompts and transferable to unseen tasks. To tackle this problem, we then propose a meta-learning framework, which meta-learns the system prompt by optimizing it over various user prompts across multiple datasets, while simultaneously updating the user prompts in an iterative manner to ensure synergy between them. We conduct experiments on 14 unseen datasets spanning 5 different domains, on which we show that our approach produces system prompts that generalize effectively to diverse user prompts. Also, our findings reveal that the optimized system prompt enables rapid adaptation even to unseen tasks, requiring fewer optimization steps for test-time user prompts while achieving improved performance.
zh

[NLP-57] ales of the 2025 Los Angeles Fire: Hotwash for Public Health Concerns in Reddit via LLM -Enhanced Topic Modeling

【速读】: 该论文试图解决在野火危机中理解受影响人口的感知与响应机制,以支持及时且富有同理心的灾害应对。其解决方案的关键在于利用社交媒体平台收集公众话语,并通过主题建模方法结合大语言模型(Large Language Models, LLMs)和人工介入(Human-in-the-Loop, HITL) refinement 技术,构建一个分层框架来分类潜在主题,从而揭示情境意识(Situational Awareness, SA)和危机叙事(Crisis Narratives, CN)等关键信息。该方法不仅能够反映实际火灾进展,还能识别公共卫生与安全、损失与损害以及应急资源等高频关联主题,为灾害响应策略提供数据支持。

链接: https://arxiv.org/abs/2505.09665
作者: Sulong Zhou,Qunying Huang,Shaoheng Zhou,Yun Hang,Xinyue Ye,Aodong Mei,Kathryn Phung,Yuning Ye,Uma Govindswamy,Zehan Li
机构: Texas A&M University (德州农工大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Google(谷歌); University of Texas Health Science Center at Houston (德克萨斯大学休斯顿健康科学中心); McWilliams School of Biomedical Informatics (麦克威廉斯生物医学信息学学院)
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Wildfires have become increasingly frequent, irregular, and severe in recent years. Understanding how affected populations perceive and respond during wildfire crises is critical for timely and empathetic disaster response. Social media platforms offer a crowd-sourced channel to capture evolving public discourse, providing hyperlocal information and insight into public sentiment. This study analyzes Reddit discourse during the 2025 Los Angeles wildfires, spanning from the onset of the disaster to full containment. We collect 385 posts and 114,879 comments related to the Palisades and Eaton fires. We adopt topic modeling methods to identify the latent topics, enhanced by large language models (LLMs) and human-in-the-loop (HITL) refinement. Furthermore, we develop a hierarchical framework to categorize latent topics, consisting of two main categories, Situational Awareness (SA) and Crisis Narratives (CN). The volume of SA category closely aligns with real-world fire progressions, peaking within the first 2-5 days as the fires reach the maximum extent. The most frequent co-occurring category set of public health and safety, loss and damage, and emergency resources expands on a wide range of health-related latent topics, including environmental health, occupational health, and one health. Grief signals and mental health risks consistently accounted for 60 percentage and 40 percentage of CN instances, respectively, with the highest total volume occurring at night. This study contributes the first annotated social media dataset on the 2025 LA fires, and introduces a scalable multi-layer framework that leverages topic modeling for crisis discourse analysis. By identifying persistent public health concerns, our results can inform more empathetic and adaptive strategies for disaster response, public health communication, and future research in comparable climate-related disaster events.
zh

[NLP-58] Large Language Models Are More Persuasive Than Incentivized Human Persuaders

【速读】: 该论文试图解决人工智能(Artificial Intelligence, AI)在说服能力方面是否超越人类的问题,特别是在具有实际经济激励的交互式对话场景中的表现。研究通过对比前沿大型语言模型(Large Language Model, LLM)与受激励的人类说服者在实时对话问答任务中的表现,验证了AI在说服效果上的优势。解决方案的关键在于设计了一个预注册的大规模、有经济激励的实验,通过让LLM和人类说服者分别引导答题者选择正确或错误答案,从而客观评估两者在真实情境下的说服能力。结果表明,LLM在促进答题者准确性及引导其走向错误答案方面均优于人类说服者,凸显了AI在说服能力上的显著优势。

链接: https://arxiv.org/abs/2505.09662
作者: Philipp Schoenegger,Francesco Salvi,Jiacheng Liu,Xiaoli Nan,Ramit Debnath,Barbara Fasolo,Evelina Leivada,Gabriel Recchia,Fritz Günther,Ali Zarifhonarvar,Joe Kwon,Zahoor Ul Islam,Marco Dehnert,Daryl Y. H. Lee,Madeline G. Reinecke,David G. Kamper,Mert Kobaş,Adam Sandford,Jonas Kgomo,Luke Hewitt,Shreya Kapoor,Kerem Oktar,Eyup Engin Kucuk,Bo Feng,Cameron R. Jones,Izzy Gainsburg,Sebastian Olschewski,Nora Heinzelmann,Francisco Cruz,Ben M. Tappin,Tao Ma,Peter S. Park,Rayan Onyonka,Arthur Hjorth,Peter Slattery,Qingcheng Zeng,Lennart Finke,Igor Grossmann,Alessandro Salatiello,Ezra Karger
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We directly compare the persuasion capabilities of a frontier large language model (LLM; Claude Sonnet 3.5) against incentivized human persuaders in an interactive, real-time conversational quiz setting. In this preregistered, large-scale incentivized experiment, participants (quiz takers) completed an online quiz where persuaders (either humans or LLMs) attempted to persuade quiz takers toward correct or incorrect answers. We find that LLM persuaders achieved significantly higher compliance with their directional persuasion attempts than incentivized human persuaders, demonstrating superior persuasive capabilities in both truthful (toward correct answers) and deceptive (toward incorrect answers) contexts. We also find that LLM persuaders significantly increased quiz takers’ accuracy, leading to higher earnings, when steering quiz takers toward correct answers, and significantly decreased their accuracy, leading to lower earnings, when steering them toward incorrect answers. Overall, our findings suggest that AI’s persuasion capabilities already exceed those of humans that have real-money bonuses tied to performance. Our findings of increasingly capable AI persuaders thus underscore the urgency of emerging alignment and governance frameworks.
zh

[NLP-59] DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language Models

【速读】: 该论文旨在解决生成式语言模型在后训练阶段中由于依赖于解级和标量奖励信号而导致的语义多样性与质量不一致问题(diversity-quality inconsistency),即不同的推理路径可能获得相同的奖励。解决方案的关键在于提出一种名为Diversity-aware Reward Adjustment(DRA)的方法,该方法通过引入子模态互信息(Submodular Mutual Information, SMI)来调整奖励计算,从而降低冗余完成项的权重并增强多样化完成项的奖励,以促进学习过程中的更好探索,同时保持对高质量样本的稳定利用。

链接: https://arxiv.org/abs/2505.09655
作者: Xiwen Chen,Wenhui Zhu,Peijie Qiu,Xuanzhao Dong,Hao Wang,Haiyu Wu,Huayu Li,Aristeidis Sotiras,Yalin Wang,Abolfazl Razi
机构: Clemson University (克莱姆森大学); Arizona State University (亚利桑那州立大学); Washington University in St.Louis (圣路易斯华盛顿大学); University of Notre Dame (圣母大学); University of Arizona (亚利桑那大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in reinforcement learning for language model post-training, such as Group Relative Policy Optimization (GRPO), have shown promise in low-resource settings. However, GRPO typically relies on solution-level and scalar reward signals that fail to capture the semantic diversity among sampled completions. This leads to what we identify as a diversity-quality inconsistency, where distinct reasoning paths may receive indistinguishable rewards. To address this limitation, we propose \textitDiversity-aware Reward Adjustment (DRA), a method that explicitly incorporates semantic diversity into the reward computation. DRA uses Submodular Mutual Information (SMI) to downweight redundant completions and amplify rewards for diverse ones. This encourages better exploration during learning, while maintaining stable exploitation of high-quality samples. Our method integrates seamlessly with both GRPO and its variant DR.~GRPO, resulting in \textitDRA-GRPO and \textitDGA-DR.~GRPO . We evaluate our method on five mathematical reasoning benchmarks and find that it outperforms recent strong baselines. It achieves state-of-the-art performance with an average accuracy of 58.2%, using only 7,000 fine-tuning samples and a total training cost of approximately 55. The code is available at this https URL.
zh

[NLP-60] Next Word Suggestion using Graph Neural Network

【速读】: 该论文试图解决语言建模中的一个关键子任务——上下文嵌入(context embedding)问题。其解决方案的关键在于利用图卷积(Graph Convolution)操作来编码上下文,并将其与长短期记忆网络(LSTM)结合,以在给定前序词局部上下文的情况下预测下一个词。

链接: https://arxiv.org/abs/2505.09649
作者: Abisha Thapa Magar,Anup Shakya
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Language Modeling is a prevalent task in Natural Language Processing. The currently existing most recent and most successful language models often tend to build a massive model with billions of parameters, feed in a tremendous amount of text data, and train with enormous computation resources which require millions of dollars. In this project, we aim to address an important sub-task in language modeling, i.e., context embedding. We propose an approach to exploit the Graph Convolution operation in GNNs to encode the context and use it in coalition with LSTMs to predict the next word given a local context of preceding words. We test this on the custom Wikipedia text corpus using a very limited amount of resources and show that this approach works fairly well to predict the next word.
zh

计算机视觉

[CV-0] 3D-Fixup: Advancing Photo Editing with 3D Priors SIGGRAPH2025

【速读】:该论文试图解决在仅提供单张图像的情况下,实现3D感知的图像编辑问题(3D-aware image editing),这一任务具有挑战性。其解决方案的关键在于提出一种名为3D-Fixup的新框架,该框架通过结合学习到的3D先验知识来指导2D图像的编辑。具体而言,该方法利用视频数据生成训练数据对,并引入Image-to-3D模型提供的3D引导,将2D信息显式地投影到3D空间中,从而实现复杂的、身份一致的3D感知编辑。

链接: https://arxiv.org/abs/2505.10566
作者: Yen-Chi Cheng,Krishna Kumar Singh,Jae Shin Yoon,Alex Schwing,Liangyan Gui,Matheus Gadelha,Paul Guerrero,Nanxuan Zhao
机构: University of Illinois Urbana-Champaign(UrbanaIllinoisUSA); Adobe Research(San JoseCaliforniaUSA); Adobe Research(LondonUK)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH 2025. Project page: this https URL

点击查看摘要

Abstract:Despite significant advances in modeling image priors via diffusion models, 3D-aware image editing remains challenging, in part because the object is only specified via a single image. To tackle this challenge, we propose 3D-Fixup, a new framework for editing 2D images guided by learned 3D priors. The framework supports difficult editing situations such as object translation and 3D rotation. To achieve this, we leverage a training-based approach that harnesses the generative power of diffusion models. As video data naturally encodes real-world physical dynamics, we turn to video data for generating training data pairs, i.e., a source and a target frame. Rather than relying solely on a single trained model to infer transformations between source and target frames, we incorporate 3D guidance from an Image-to-3D model, which bridges this challenging task by explicitly projecting 2D information into 3D space. We design a data generation pipeline to ensure high-quality 3D guidance throughout training. Results show that by integrating these 3D priors, 3D-Fixup effectively supports complex, identity coherent 3D-aware edits, achieving high-quality results and advancing the application of diffusion models in realistic image manipulation. The code is provided at this https URL
zh

[CV-1] Depth Anything with Any Prior

【速读】:该论文试图解决如何生成准确、密集且细节丰富的度量深度图(metric depth maps)的问题,特别是在处理不完整但精确的度量信息与相对完整但缺乏精度的几何结构之间的融合问题。解决方案的关键在于设计了一个从粗到细的流水线,通过像素级度量对齐和距离感知加权预填充多种度量先验,并引入条件单目深度估计(conditioned monocular depth estimation, MDE)模型来优化深度先验中的固有噪声,从而有效融合两种互补的深度源。

链接: https://arxiv.org/abs/2505.10565
作者: Zehan Wang,Siyu Chen,Lihe Yang,Jialei Wang,Ziang Zhang,Hengshuang Zhao,Zhou Zhao
机构: Zhejiang University (浙江大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Home page: this https URL

点击查看摘要

Abstract:This work presents Prior Depth Anything, a framework that combines incomplete but precise metric information in depth measurement with relative but complete geometric structures in depth prediction, generating accurate, dense, and detailed metric depth maps for any scene. To this end, we design a coarse-to-fine pipeline to progressively integrate the two complementary depth sources. First, we introduce pixel-level metric alignment and distance-aware weighting to pre-fill diverse metric priors by explicitly using depth prediction. It effectively narrows the domain gap between prior patterns, enhancing generalization across varying scenarios. Second, we develop a conditioned monocular depth estimation (MDE) model to refine the inherent noise of depth priors. By conditioning on the normalized pre-filled prior and prediction, the model further implicitly merges the two complementary depth sources. Our model showcases impressive zero-shot generalization across depth completion, super-resolution, and inpainting over 7 real-world datasets, matching or even surpassing previous task-specific methods. More importantly, it performs well on challenging, unseen mixed priors and enables test-time improvements by switching prediction models, providing a flexible accuracy-efficiency trade-off while evolving with advancements in MDE models.
zh

[CV-2] End-to-End Vision Tokenizer Tuning

【速读】:该论文试图解决现有视觉分词方法在下游任务中存在表示瓶颈的问题,即视觉分词器的优化与下游任务训练相互分离,导致视觉标记难以适应不同任务的需求,例如图像生成和视觉问答。解决方案的关键在于提出一种端到端的视觉分词器调优方法(End-to-End Vision Tokenizer Tuning, ETT),通过联合优化视觉分词与目标自回归任务,利用分词器代码本的视觉嵌入,并结合重建和描述生成目标进行端到端训练,从而提升多模态理解与视觉生成任务的性能。

链接: https://arxiv.org/abs/2505.10562
作者: Wenxuan Wang,Fan Zhang,Yufeng Cui,Haiwen Diao,Zhuoyan Luo,Huchuan Lu,Jing Liu,Xinlong Wang
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Beijing Academy of Artificial Intelligence (北京人工智能研究院); Dalian University of Technology (大连理工大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing vision tokenization isolates the optimization of vision tokenizers from downstream training, implicitly assuming the visual tokens can generalize well across various tasks, e.g., image generation and visual question answering. The vision tokenizer optimized for low-level reconstruction is agnostic to downstream tasks requiring varied representations and semantics. This decoupled paradigm introduces a critical misalignment: The loss of the vision tokenization can be the representation bottleneck for target tasks. For example, errors in tokenizing text in a given image lead to poor results when recognizing or generating them. To address this, we propose ETT, an end-to-end vision tokenizer tuning approach that enables joint optimization between vision tokenization and target autoregressive tasks. Unlike prior autoregressive models that use only discrete indices from a frozen vision tokenizer, ETT leverages the visual embeddings of the tokenizer codebook, and optimizes the vision tokenizers end-to-end with both reconstruction and caption objectives. ETT can be seamlessly integrated into existing training pipelines with minimal architecture modifications. Our ETT is simple to implement and integrate, without the need to adjust the original codebooks or architectures of the employed large language models. Extensive experiments demonstrate that our proposed end-to-end vision tokenizer tuning unlocks significant performance gains, i.e., 2-6% for multimodal understanding and visual generation tasks compared to frozen tokenizer baselines, while preserving the original reconstruction capability. We hope this very simple and strong method can empower multimodal foundation models besides image generation and understanding.
zh

[CV-3] Style Customization of Text-to-Vector Generation with Image Diffusion Priors SIGGRAPH2025

【速读】:该论文旨在解决现有文本到矢量(Text-to-Vector, T2V)生成方法在实际应用中对风格定制需求的不足,这一问题对于生成具有统一视觉外观和连贯美学的矢量图形集合至关重要。现有方法在风格定制方面面临两大挑战:基于优化的T2V模型虽可利用文本到图像(Text-to-Image, T2I)模型的先验知识进行定制,但难以保持结构规则性;而前馈式T2V模型虽能确保结构规则性,却因SVG训练数据有限而难以分离内容与风格。该论文提出的解决方案关键在于设计一种两阶段的风格定制流程,结合前馈式T2V模型与T2I图像先验的优势,首先通过路径级表示训练T2V扩散模型以保证SVG的结构规则性并保留多样的表达能力,其次通过蒸馏定制化的T2I模型来实现不同风格的定制,从而高效生成高质量且风格多样的SVG图形。

链接: https://arxiv.org/abs/2505.10558
作者: Peiying Zhang,Nanxuan Zhao,Jing Liao
机构: City University of Hong Kong (香港城市大学); Adobe Research (Adobe 研究院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by SIGGRAPH 2025 (Conference Paper). Project page: this https URL

点击查看摘要

Abstract:Scalable Vector Graphics (SVGs) are highly favored by designers due to their resolution independence and well-organized layer structure. Although existing text-to-vector (T2V) generation methods can create SVGs from text prompts, they often overlook an important need in practical applications: style customization, which is vital for producing a collection of vector graphics with consistent visual appearance and coherent aesthetics. Extending existing T2V methods for style customization poses certain challenges. Optimization-based T2V models can utilize the priors of text-to-image (T2I) models for customization, but struggle with maintaining structural regularity. On the other hand, feed-forward T2V models can ensure structural regularity, yet they encounter difficulties in disentangling content and style due to limited SVG training data. To address these challenges, we propose a novel two-stage style customization pipeline for SVG generation, making use of the advantages of both feed-forward T2V models and T2I image priors. In the first stage, we train a T2V diffusion model with a path-level representation to ensure the structural regularity of SVGs while preserving diverse expressive capabilities. In the second stage, we customize the T2V diffusion model to different styles by distilling customized T2I models. By integrating these techniques, our pipeline can generate high-quality and diverse SVGs in custom styles based on text prompts in an efficient feed-forward manner. The effectiveness of our method has been validated through extensive experiments. The project page is this https URL. Comments: Accepted by SIGGRAPH 2025 (Conference Paper). Project page: this https URL Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.10558 [cs.GR] (or arXiv:2505.10558v1 [cs.GR] for this version) https://doi.org/10.48550/arXiv.2505.10558 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-4] Does Feasibility Matter? Understanding the Impact of Feasibility on Synthetic Training Data CVPR

【速读】:该论文试图解决在使用基于CLIP的分类器进行合成训练数据生成时,是否需要强制保证合成图像的可行性(feasibility)的问题。其核心问题是验证可行性对模型性能的影响,以及是否应该排除不可行图像以提升模型在真实世界数据上的泛化能力。解决方案的关键在于提出VariReal管道,该管道通过最小化修改源图像,根据由大型语言模型生成的文本提示引入可行或不可行的属性,从而系统地评估可行性对模型性能的影响。

链接: https://arxiv.org/abs/2505.10551
作者: Yiwen Liu,Jessica Bader,Jae Myung Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPRW 2025

点击查看摘要

Abstract:With the development of photorealistic diffusion models, models trained in part or fully on synthetic data achieve progressively better results. However, diffusion models still routinely generate images that would not exist in reality, such as a dog floating above the ground or with unrealistic texture artifacts. We define the concept of feasibility as whether attributes in a synthetic image could realistically exist in the real-world domain; synthetic images containing attributes that violate this criterion are considered infeasible. Intuitively, infeasible images are typically considered out-of-distribution; thus, training on such images is expected to hinder a model’s ability to generalize to real-world data, and they should therefore be excluded from the training set whenever possible. However, does feasibility really matter? In this paper, we investigate whether enforcing feasibility is necessary when generating synthetic training data for CLIP-based classifiers, focusing on three target attributes: background, color, and texture. We introduce VariReal, a pipeline that minimally edits a given source image to include feasible or infeasible attributes given by the textual prompt generated by a large language model. Our experiments show that feasibility minimally affects LoRA-fine-tuned CLIP performance, with mostly less than 0.3% difference in top-1 accuracy across three fine-grained datasets. Also, the attribute matters on whether the feasible/infeasible images adversarially influence the classification performance. Finally, mixing feasible and infeasible images in training datasets does not significantly impact performance compared to using purely feasible or infeasible datasets.
zh

[CV-5] Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis

【速读】:该论文试图解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在评估过程中可能存在的隐式视觉误解(implicit visual misunderstanding, IVM)问题,即模型虽然能给出正确答案,但并未真正理解视觉输入。解决方案的关键在于通过解耦因果注意力模块中的视觉与文本模态,揭示随着网络层数加深,注意力分布逐渐集中于与正确答案相关的图像,从而提出一种与尺度无关的度量指标——注意力准确率(attention accuracy),该指标通过模型内部机制直接评估其视觉理解能力,具有对位置偏差的鲁棒性,提升了评估的可靠性。

链接: https://arxiv.org/abs/2505.10541
作者: Pengfei Wang,Guohai Xu,Weinong Wang,Junjie Yang,Jie Lou,Yunhua Xue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements have enhanced the capability of Multimodal Large Language Models (MLLMs) to comprehend multi-image information. However, existing benchmarks primarily evaluate answer correctness, overlooking whether models genuinely comprehend the visual input. To address this, we define implicit visual misunderstanding (IVM), where MLLMs provide correct answers without fully comprehending the visual input. Through our analysis, we decouple the visual and textual modalities within the causal attention module, revealing that attention distribution increasingly converges on the image associated with the correct answer as the network layers deepen. This insight leads to the introduction of a scale-agnostic metric, \textitattention accuracy, and a novel benchmark for quantifying IVMs. Attention accuracy directly evaluates the model’s visual understanding via internal mechanisms, remaining robust to positional biases for more reliable assessments. Furthermore, we extend our approach to finer granularities and demonstrate its effectiveness in unimodal scenarios, underscoring its versatility and generalizability.
zh

[CV-6] Enhancing Multi-Image Question Answering via Submodular Subset Selection

【速读】:该论文旨在解决在Multiple Image Question Answering场景下,大型多模态模型(Large Multimodal Models, LMMs)在处理大量图像时面临的可扩展性问题和检索性能下降的问题。其解决方案的关键在于对MIRAGE模型中检索框架的改进,通过引入子模函数(submodular functions)的子集选择技术,利用查询感知的子模函数(如GraphCut)在主检索组件之前预选出语义相关的图像子集,从而提升检索管道的有效性,特别是在大规模数据集中的表现。

链接: https://arxiv.org/abs/2505.10533
作者: Aaryan Sharma,Shivansh Gupta,Samar Agarwal,Vishak Prasad C.,Ganesh Ramakrishnan
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large multimodal models (LMMs) have achieved high performance in vision-language tasks involving single image but they struggle when presented with a collection of multiple images (Multiple Image Question Answering scenario). These tasks, which involve reasoning over large number of images, present issues in scalability (with increasing number of images) and retrieval performance. In this work, we propose an enhancement for retriever framework introduced in MIRAGE model using submodular subset selection techniques. Our method leverages query-aware submodular functions, such as GraphCut, to pre-select a subset of semantically relevant images before main retrieval component. We demonstrate that using anchor-based queries and augmenting the data improves submodular-retriever pipeline effectiveness, particularly in large haystack sizes.
zh

[CV-7] MorphGuard: Morph Specific Margin Loss for Enhancing Robustness to Face Morphing Attacks

【速读】:该论文旨在解决深度学习驱动的面部识别系统在面对人脸合成攻击(face morphing attacks)时的安全性问题,这类攻击通过将两个或多个身份的面部图像合成,使得一个身份能够伪装成另一个身份,从而对系统造成严重威胁。论文提出的解决方案的关键在于引入一种双分支分类策略(dual-branch classification strategy),该策略通过调整分类任务来处理人脸合成图像标签的模糊性,使模型能够在训练过程中有效利用合成图像,从而提升其区分合成样本与真实样本的能力。

链接: https://arxiv.org/abs/2505.10497
作者: Iurii Medvedev,Nuno Goncalves
机构: Institute of Systems and Robotics, University of Coimbra (系统与机器人研究所,科英布拉大学); Portuguese Mint and Official Printing Office (INCM) (葡萄牙铸币厂和官方印刷局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face recognition has evolved significantly with the advancement of deep learning techniques, enabling its widespread adoption in various applications requiring secure authentication. However, this progress has also increased its exposure to presentation attacks, including face morphing, which poses a serious security threat by allowing one identity to impersonate another. Therefore, modern face recognition systems must be robust against such attacks. In this work, we propose a novel approach for training deep networks for face recognition with enhanced robustness to face morphing attacks. Our method modifies the classification task by introducing a dual-branch classification strategy that effectively handles the ambiguity in the labeling of face morphs. This adaptation allows the model to incorporate morph images into the training process, improving its ability to distinguish them from bona fide samples. Our strategy has been validated on public benchmarks, demonstrating its effectiveness in enhancing robustness against face morphing attacks. Furthermore, our approach is universally applicable and can be integrated into existing face recognition training pipelines to improve classification-based recognition methods. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.10497 [cs.CV] (or arXiv:2505.10497v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.10497 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-8] CheXGenBench: A Unified Benchmark For Fidelity Privacy and Utility of Synthetic Chest Radiographs

【速读】:该论文试图解决当前生成式AI在医学影像生成领域评估方法不一致、架构对比过时以及评估标准与临床实际应用脱节的问题。其解决方案的关键在于提出CheXGenBench,一个标准化的多维度评估框架,通过统一的评估协议和超过20个定量指标,系统性地分析生成质量、潜在隐私风险及下游临床适用性,从而建立医疗AI领域的基准,实现客观、可重复的模型比较,并支持现有及未来生成模型的无缝集成。

链接: https://arxiv.org/abs/2505.10496
作者: Raman Dutt,Pedro Sanchez,Yongchen Yao,Steven McDonagh,Sotirios A. Tsaftaris,Timothy Hospedales
机构: The University of Edinburgh (爱丁堡大学); Sinkove; Samsung AI Center, Cambridge (三星人工智能中心,剑桥)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce CheXGenBench, a rigorous and multifaceted evaluation framework for synthetic chest radiograph generation that simultaneously assesses fidelity, privacy risks, and clinical utility across state-of-the-art text-to-image generative models. Despite rapid advancements in generative AI for real-world imagery, medical domain evaluations have been hindered by methodological inconsistencies, outdated architectural comparisons, and disconnected assessment criteria that rarely address the practical clinical value of synthetic samples. CheXGenBench overcomes these limitations through standardised data partitioning and a unified evaluation protocol comprising over 20 quantitative metrics that systematically analyse generation quality, potential privacy vulnerabilities, and downstream clinical applicability across 11 leading text-to-image architectures. Our results reveal critical inefficiencies in the existing evaluation protocols, particularly in assessing generative fidelity, leading to inconsistent and uninformative comparisons. Our framework establishes a standardised benchmark for the medical AI community, enabling objective and reproducible comparisons while facilitating seamless integration of both existing and future generative models. Additionally, we release a high-quality, synthetic dataset, SynthCheX-75K, comprising 75K radiographs generated by the top-performing model (Sana 0.6B) in our benchmark to support further research in this critical domain. Through CheXGenBench, we establish a new state-of-the-art and release our framework, models, and SynthCheX-75K dataset at this https URL
zh

[CV-9] UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation

【速读】:该论文试图解决统一多模态理解和生成模型缺乏统一评估框架的问题,现有评估方法存在整体结果缺失、依赖额外评估模型、大量标注图像、基准多样性不足以及指令跟随评估能力有限等局限性。解决方案的关键在于提出UniEval,这是首个无需额外模型、图像或注释的统一多模态模型评估框架,其核心包括全面的基准测试集UniBench和对应的UniScore指标,能够实现简化且统一的评估过程。

链接: https://arxiv.org/abs/2505.10483
作者: Yi Li,Haonan Wang,Qixiang Zhang,Boyu Xiao,Chenchang Hu,Hualiang Wang,Xiaomeng Li
机构: Cranberry-Lemon University (克兰伯里-柠檬大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: UniEval is the first evaluation framework designed for unified multimodal models, including a holistic benchmark UniBench and the UniScore metric

点击查看摘要

Abstract:The emergence of unified multimodal understanding and generation models is rapidly attracting attention because of their ability to enhance instruction-following capabilities while minimizing model redundancy. However, there is a lack of a unified evaluation framework for these models, which would enable an elegant, simplified, and overall evaluation. Current models conduct evaluations on multiple task-specific benchmarks, but there are significant limitations, such as the lack of overall results, errors from extra evaluation models, reliance on extensive labeled images, benchmarks that lack diversity, and metrics with limited capacity for instruction-following evaluation. To tackle these challenges, we introduce UniEval, the first evaluation framework designed for unified multimodal models without extra models, images, or annotations. This facilitates a simplified and unified evaluation process. The UniEval framework contains a holistic benchmark, UniBench (supports both unified and visual generation models), along with the corresponding UniScore metric. UniBench includes 81 fine-grained tags contributing to high diversity. Experimental results indicate that UniBench is more challenging than existing benchmarks, and UniScore aligns closely with human evaluations, surpassing current metrics. Moreover, we extensively evaluated SoTA unified and visual generation models, uncovering new insights into Univeral’s unique values.
zh

[CV-10] Logos as a Well-Tempered Pre-train for Sign Language Recognition

【速读】:该论文旨在解决孤立手语识别(ISLR)任务中的两个关键问题:一是多数手语数据量有限,导致跨语言ISLR模型训练(包括迁移学习)面临挑战;二是相似手势可能具有不同的语义含义,造成数据集标注的歧义,并引发最佳标注策略的疑问。为了解决这些问题,本文提出了Logos数据集,这是目前规模最大的俄罗斯手语(RSL)数据集,包含大量手语者和丰富的词汇。Logos数据集的关键特征是明确标注了视觉上相似的手势组,实验证明这种标注方式能够提升模型作为视觉编码器在下游任务中的性能。此外,基于该数据集预训练的模型在其他语言的SLR任务中表现出色,特别是在少样本学习场景下,展示了其作为通用编码器的潜力。

链接: https://arxiv.org/abs/2505.10481
作者: Ilya Ovodov,Petr Surovtsev,Karina Kvanchiani,Alexander Kapitanov,Alexander Nagaev
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper examines two aspects of the isolated sign language recognition (ISLR) task. First, despite the availability of a number of datasets, the amount of data for most individual sign languages is limited. It poses the challenge of cross-language ISLR model training, including transfer learning. Second, similar signs can have different semantic meanings. It leads to ambiguity in dataset labeling and raises the question of the best policy for annotating such signs. To address these issues, this study presents Logos, a novel Russian Sign Language (RSL) dataset, the most extensive ISLR dataset by the number of signers and one of the largest available datasets while also the largest RSL dataset in size and vocabulary. It is shown that a model, pre-trained on the Logos dataset can be used as a universal encoder for other language SLR tasks, including few-shot learning. We explore cross-language transfer learning approaches and find that joint training using multiple classification heads benefits accuracy for the target lowresource datasets the most. The key feature of the Logos dataset is explicitly annotated visually similar sign groups. We show that explicitly labeling visually similar signs improves trained model quality as a visual encoder for downstream tasks. Based on the proposed contributions, we outperform current state-of-the-art results for the WLASL dataset and get competitive results for the AUTSL dataset, with a single stream model processing solely RGB video. The source code, dataset, and pre-trained models are publicly available.
zh

[CV-11] Consistent Quantity-Quality Control across Scenes for Deployment-Aware Gaussian Splatting

【速读】:该论文旨在解决3D Gaussian splatting (3DGS) 中高渲染质量与减少高斯函数数量之间的固有权衡问题,现有方法虽追求更好的数量-质量性能,但缺乏用户直观调整这一权衡的能力以适应不同硬件和通信约束下的实际需求。其解决方案的关键在于提出ControlGS,一种能够实现语义有意义且跨场景一致的数量-质量控制的3DGS优化方法,通过一次固定设置的训练和用户指定的超参数,自动找到多样场景下的理想数量-质量折中点,并在保持高性能的同时支持连续可调的折中范围。

链接: https://arxiv.org/abs/2505.10473
作者: Fengdi Zhang,Hongkun Cao,Ruqi Huang
机构: Shenzhen International Graduate School, Tsinghua University (深圳国际研究生院,清华大学); Pengcheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:To reduce storage and computational costs, 3D Gaussian splatting (3DGS) seeks to minimize the number of Gaussians used while preserving high rendering quality, introducing an inherent trade-off between Gaussian quantity and rendering quality. Existing methods strive for better quantity-quality performance, but lack the ability for users to intuitively adjust this trade-off to suit practical needs such as model deployment under diverse hardware and communication constraints. Here, we present ControlGS, a 3DGS optimization method that achieves semantically meaningful and cross-scene consistent quantity-quality control while maintaining strong quantity-quality performance. Through a single training run using a fixed setup and a user-specified hyperparameter reflecting quantity-quality preference, ControlGS can automatically find desirable quantity-quality trade-off points across diverse scenes, from compact objects to large outdoor scenes. It also outperforms baselines by achieving higher rendering quality with fewer Gaussians, and supports a broad adjustment range with stepless control over the trade-off.
zh

[CV-12] SEAL: Searching Expandable Architectures for Incremental Learning

【速读】:该论文旨在解决增量学习(incremental learning)中模型在学习新任务时如何平衡可塑性(plasticity)与稳定性(stability)的问题。现有基于神经架构搜索(Neural Architecture Search, NAS)的增量学习方法通常通过在每个任务中扩展模型来适应新数据,这在资源受限环境中不具实用性。论文提出的解决方案关键在于SEAL框架,该框架通过动态调整模型结构,在必要时进行扩展,并利用容量估计指标控制扩展时机,同时通过交叉蒸馏(cross-distillation)保持稳定性。此外,NAS组件联合搜索模型架构与最优扩展策略,从而在减少遗忘的同时提升准确性,并保持较小的模型规模。

链接: https://arxiv.org/abs/2505.10457
作者: Matteo Gambella,Vicente Javier Castro Solar,Manuel Roveri
机构: Politecnico di Milano (米兰理工大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Incremental learning is a machine learning paradigm where a model learns from a sequential stream of tasks. This setting poses a key challenge: balancing plasticity (learning new tasks) and stability (preserving past knowledge). Neural Architecture Search (NAS), a branch of AutoML, automates the design of the architecture of Deep Neural Networks and has shown success in static settings. However, existing NAS-based approaches to incremental learning often rely on expanding the model at every task, making them impractical in resource-constrained environments. In this work, we introduce SEAL, a NAS-based framework tailored for data-incremental learning, a scenario where disjoint data samples arrive sequentially and are not stored for future access. SEAL adapts the model structure dynamically by expanding it only when necessary, based on a capacity estimation metric. Stability is preserved through cross-distillation training after each expansion step. The NAS component jointly searches for both the architecture and the optimal expansion policy. Experiments across multiple benchmarks demonstrate that SEAL effectively reduces forgetting and enhances accuracy while maintaining a lower model size compared to prior methods. These results highlight the promise of combining NAS and selective expansion for efficient, adaptive learning in incremental scenarios.
zh

[CV-13] Vision language models have difficulty recognizing virtual objects

【速读】:该论文试图解决视觉语言模型(Vision Language Models, VLMs)在理解图像中场景的视觉空间属性方面的不足问题。研究认为,通过测试模型对虚拟对象(virtual objects)的处理能力,可以有效评估其对场景的理解程度。解决方案的关键在于设计包含虚拟对象的提示(prompt),例如在图像中人物站在树下的情境下引入“风筝卡在树上”的假设,从而检验模型是否能够更新其对场景的表征并合理推理所有物体之间的空间关系。

链接: https://arxiv.org/abs/2505.10453
作者: Tyler Tran,Sangeet Khemlani,J.G. Trafton
机构: US Naval Research Laboratory (美国海军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision language models (VLMs) are AI systems paired with both language and vision encoders to process multimodal input. They are capable of performing complex semantic tasks such as automatic captioning, but it remains an open question about how well they comprehend the visuospatial properties of scenes depicted in the images they process. We argue that descriptions of virtual objects – objects that are not visually represented in an image – can help test scene comprehension in these AI systems. For example, an image that depicts a person standing under a tree can be paired with the following prompt: imagine that a kite is stuck in the tree. VLMs that comprehend the scene should update their representations and reason sensibly about the spatial relations between all three objects. We describe systematic evaluations of state-of-the-art VLMs and show that their ability to process virtual objects is inadequate.
zh

[CV-14] PIF: Anomaly detection via preference embedding ICPR2020

【速读】:该论文试图解决与结构化模式相关的异常检测问题(anomaly detection with respect to structured patterns)。其解决方案的关键在于提出了一种名为PIF的新方法,该方法结合了自适应隔离方法的优势与偏好嵌入(preference embedding)的灵活性。具体而言,该方法通过将数据嵌入高维空间,并采用基于树的高效方法PI-Forest来计算异常得分,从而实现对任意距离的测量和偏好空间中点的隔离。

链接: https://arxiv.org/abs/2505.10441
作者: Filippo Leveni,Luca Magri,Giacomo Boracchi,Cesare Alippi
机构: Politecnico di Milano (米兰理工大学); Università della Svizzera italiana (瑞士意大利语大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: Accepted at International Conference on Pattern Recognition (ICPR 2020)

点击查看摘要

Abstract:We address the problem of detecting anomalies with respect to structured patterns. To this end, we conceive a novel anomaly detection method called PIF, that combines the advantages of adaptive isolation methods with the flexibility of preference embedding. Specifically, we propose to embed the data in a high dimensional space where an efficient tree-based method, PI-Forest, is employed to compute an anomaly score. Experiments on synthetic and real datasets demonstrate that PIF favorably compares with state-of-the-art anomaly detection techniques, and confirm that PI-Forest is better at measuring arbitrary distances and isolate points in the preference space.
zh

[CV-15] Learned Lightweight Smartphone ISP with Unpaired Data CVPR

【速读】:该论文试图解决在开发可学习的图像信号处理器(Image Signal Processor, ISP)时,获取像素级对齐的成对数据(即手机摄像头捕获的原始图像与高质量参考图像之间的对应关系)这一困难且成本高昂的问题。其解决方案的关键在于提出一种无需原始图像与真实数据之间直接对应关系的新型训练方法,该方法采用多术语损失函数,并通过对抗训练结合多个判别器处理预训练网络的特征图,以在保持内容结构的同时学习目标RGB数据集的颜色和纹理特性。

链接: https://arxiv.org/abs/2505.10420
作者: Andrei Arhire,Radu Timofte
机构: Alexandru Ioan Cuza University of Iasi (亚历山德鲁·伊万·库扎大学); University of Würzburg (维尔茨堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at CVPRW 2025

点击查看摘要

Abstract:The Image Signal Processor (ISP) is a fundamental component in modern smartphone cameras responsible for conversion of RAW sensor image data to RGB images with a strong focus on perceptual quality. Recent work highlights the potential of deep learning approaches and their ability to capture details with a quality increasingly close to that of professional cameras. A difficult and costly step when developing a learned ISP is the acquisition of pixel-wise aligned paired data that maps the raw captured by a smartphone camera sensor to high-quality reference images. In this work, we address this challenge by proposing a novel training method for a learnable ISP that eliminates the need for direct correspondences between raw images and ground-truth data with matching content. Our unpaired approach employs a multi-term loss function guided by adversarial training with multiple discriminators processing feature maps from pre-trained networks to maintain content structure while learning color and texture characteristics from the target RGB dataset. Using lightweight neural network architectures suitable for mobile devices as backbones, we evaluated our method on the Zurich RAW to RGB and Fujifilm UltraISP datasets. Compared to paired training methods, our unpaired learning strategy shows strong potential and achieves high fidelity across multiple evaluation metrics. The code and pre-trained models are available at this https URL .
zh

[CV-16] SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and mathcalO(T) Complexity ICML2025

【速读】:该论文旨在解决现有基于脉冲神经网络(SNN)的Transformer模型在视频视觉任务中未能有效利用SNN高效性的问题,尤其是在处理时间序列信息时表现出的局限性。其解决方案的关键在于提出了一种基于脉冲驱动的汉明注意力机制(SDHA),该机制通过理论指导实现了从传统实值注意力到脉冲驱动注意力的适应性转换,并在此基础上设计了最优的时空注意力方案,从而在保持线性时间复杂度的同时,提升了视频任务的性能。

链接: https://arxiv.org/abs/2505.10352
作者: Shihao Zou,Qingfeng Li,Wei Ji,Jingjing Li,Yongkui Yang,Guoqi Li,Chao Dong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by ICML 2025

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) have shown competitive performance to Artificial Neural Networks (ANNs) in various vision tasks, while offering superior energy efficiency. However, existing SNN-based Transformers primarily focus on single-image tasks, emphasizing spatial features while not effectively leveraging SNNs’ efficiency in video-based vision tasks. In this paper, we introduce SpikeVideoFormer, an efficient spike-driven video Transformer, featuring linear temporal complexity \mathcalO(T) . Specifically, we design a spike-driven Hamming attention (SDHA) which provides a theoretically guided adaptation from traditional real-valued attention to spike-driven attention. Building on SDHA, we further analyze various spike-driven space-time attention designs and identify an optimal scheme that delivers appealing performance for video tasks, while maintaining only linear temporal complexity. The generalization ability and efficiency of our model are demonstrated across diverse downstream video tasks, including classification, human pose tracking, and semantic segmentation. Empirical results show our method achieves state-of-the-art (SOTA) performance compared to existing SNN approaches, with over 15% improvement on the latter two tasks. Additionally, it matches the performance of recent ANN-based methods while offering significant efficiency gains, achieving \times 16 , \times 10 and \times 5 improvements on the three tasks. this https URL
zh

[CV-17] A Unified and Scalable Membership Inference Method for Visual Self-supervised Encoder via Part-aware Capability CCS2024

【速读】:该论文试图解决自监督学习模型在视觉领域中存在的隐私泄露问题,特别是在攻击者无法获取模型训练方法和细节的情况下,如何进行成员推断攻击。解决方案的关键在于提出一种统一的成员推断方法——PartCrop,其核心思想是利用模型对图像中物体部分的感知能力以及在训练数据上的强部分响应,通过裁剪图像中的物体部分并在表示空间中查询响应来实现攻击。

链接: https://arxiv.org/abs/2505.10351
作者: Jie Zhu,Jirong Zha,Ding Li,Leye Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: An extension of our ACM CCS2024 conference paper ( arXiv:2404.02462 ). We show the impacts of scaling from both data and model aspects on membership inference for self-supervised visual encoders

点击查看摘要

Abstract:Self-supervised learning shows promise in harnessing extensive unlabeled data, but it also confronts significant privacy concerns, especially in vision. In this paper, we perform membership inference on visual self-supervised models in a more realistic setting: self-supervised training method and details are unknown for an adversary when attacking as he usually faces a black-box system in practice. In this setting, considering that self-supervised model could be trained by completely different self-supervised paradigms, e.g., masked image modeling and contrastive learning, with complex training details, we propose a unified membership inference method called PartCrop. It is motivated by the shared part-aware capability among models and stronger part response on the training data. Specifically, PartCrop crops parts of objects in an image to query responses within the image in representation space. We conduct extensive attacks on self-supervised models with different training protocols and structures using three widely used image datasets. The results verify the effectiveness and generalization of PartCrop. Moreover, to defend against PartCrop, we evaluate two common approaches, i.e., early stop and differential privacy, and propose a tailored method called shrinking crop scale range. The defense experiments indicate that all of them are effective. Finally, besides prototype testing on toy visual encoders and small-scale image datasets, we quantitatively study the impacts of scaling from both data and model aspects in a realistic scenario and propose a scalable PartCrop-v2 by introducing two structural improvements to PartCrop. Our code is at this https URL.
zh

[CV-18] SOS: A Shuffle Order Strategy for Data Augmentation in Industrial Human Activity Recognition

【速读】:该论文旨在解决人类活动识别(Human Activity Recognition, HAR)中高质量且多样化的数据获取难题,这一问题由于实际应用场景的高成本和活动本身的固有变异性而持续存在。其解决方案的关键在于利用深度学习方法生成数据集(如注意力自编码器和条件生成对抗网络),并通过随机序列重排策略打破时间依赖性,从而提升模型对活动转换的鲁棒性。实验结果表明,该方法显著提高了分类性能,达到了0.70±0.03的准确率和0.64±0.01的宏F1分数。

链接: https://arxiv.org/abs/2505.10312
作者: Anh Tuan Ha,Hoang Khang Phan,Thai Minh Tien Ngo,Anh Phan Truong,Nhat Tan Le
机构: Ho Chi Minh city University of Technology-Vietnam National University (胡志明市科技大学-越南国家大学); Ho Chi Minh city University of Technology-Vietnam National University (胡志明市科技大学-越南国家大学); Ho Chi Minh city University of Technology-Vietnam National University (胡志明市科技大学-越南国家大学); Ho Chi Minh city University of Technology-Vietnam National University (胡志明市科技大学-越南国家大学); Ho Chi Minh city University of Technology-Vietnam National University (胡志明市科技大学-越南国家大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the realm of Human Activity Recognition (HAR), obtaining high quality and variance data is still a persistent challenge due to high costs and the inherent variability of real-world activities. This study introduces a generation dataset by deep learning approaches (Attention Autoencoder and conditional Generative Adversarial Networks). Another problem that data heterogeneity is a critical challenge, one of the solutions is to shuffle the data to homogenize the distribution. Experimental results demonstrate that the random sequence strategy significantly improves classification performance, achieving an accuracy of up to 0.70 \pm 0.03 and a macro F1 score of 0.64 \pm 0.01. For that, disrupting temporal dependencies through random sequence reordering compels the model to focus on instantaneous recognition, thereby improving robustness against activity transitions. This approach not only broadens the effective training dataset but also offers promising avenues for enhancing HAR systems in complex, real-world scenarios.
zh

[CV-19] MIPHEI-ViT: Multiplex Immunofluorescence Prediction from HE Images using ViT Foundation Models

【速读】:该论文试图解决在癌症诊断中,尽管Hematoxylin and Eosin (HE)染色是常规手段,但其在细胞类型识别上的局限性,以及多路免疫荧光(multiplex immunofluorescence, mIF)因成本和物流限制未能广泛应用于临床的问题。解决方案的关键在于提出一种基于U-Net架构的模型MIPHEI,该模型利用先进的视觉Transformer(ViT)基础模型作为编码器,从HE图像中预测mIF信号,从而实现对多种细胞类型标记物的高精度分类。

链接: https://arxiv.org/abs/2505.10294
作者: Guillaume Balezo,Roger Trullo,Albert Pla Planas,Etienne Decenciere,Thomas Walter
机构: Sanofi(赛诺菲); InstaDeep(InstaDeep); Centre de Morphologie Mathématique(数学形态学中心); Mines Paris PSL(巴黎矿业大学PSL); Center for Computational Biology, Mines Paris, PSL(计算生物学中心,巴黎矿业大学PSL); Institut Curie(居里研究所); INSERM U1331(INSERM U1331)
类目: Computer Vision and Pattern Recognition (cs.CV); Tissues and Organs (q-bio.TO)
备注:

点击查看摘要

Abstract:Histopathological analysis is a cornerstone of cancer diagnosis, with Hematoxylin and Eosin (HE) staining routinely acquired for every patient to visualize cell morphology and tissue architecture. On the other hand, multiplex immunofluorescence (mIF) enables more precise cell type identification via proteomic markers, but has yet to achieve widespread clinical adoption due to cost and logistical constraints. To bridge this gap, we introduce MIPHEI (Multiplex Immunofluorescence Prediction from HE), a U-Net-inspired architecture that integrates state-of-the-art ViT foundation models as encoders to predict mIF signals from HE images. MIPHEI targets a comprehensive panel of markers spanning nuclear content, immune lineages (T cells, B cells, myeloid), epithelium, stroma, vasculature, and proliferation. We train our model using the publicly available ORION dataset of restained HE and mIF images from colorectal cancer tissue, and validate it on two independent datasets. MIPHEI achieves accurate cell-type classification from HE alone, with F1 scores of 0.88 for Pan-CK, 0.57 for CD3e, 0.56 for SMA, 0.36 for CD68, and 0.30 for CD20, substantially outperforming both a state-of-the-art baseline and a random classifier for most markers. Our results indicate that our model effectively captures the complex relationships between nuclear morphologies in their tissue context, as visible in HE images and molecular markers defining specific cell types. MIPHEI offers a promising step toward enabling cell-type-aware analysis of large-scale HE datasets, in view of uncovering relationships between spatial cellular organization and patient outcomes.
zh

[CV-20] MSCI: Addressing CLIPs Inherent Limitations for Compositional Zero-Shot Learning

【速读】:该论文旨在解决组合零样本学习(Compositional Zero-Shot Learning, CZSL)中对未见过的状态-物体组合进行识别的问题,特别是针对现有方法依赖于CLIP的跨模态对齐能力但忽视其在捕捉细粒度局部特征方面的局限性。解决方案的关键在于提出一种多阶段跨模态交互(Multi-Stage Cross-modal Interaction, MSCI)模型,该模型通过有效利用CLIP视觉编码器的中间层信息,结合两个自适应聚合器分别提取低层视觉特征中的局部信息和高层视觉特征中的全局信息,并通过分阶段的交互机制逐步将这些关键信息融入文本表示中,从而显著提升模型对细粒度局部视觉信息的感知能力。

链接: https://arxiv.org/abs/2505.10289
作者: Yue Wang,Shuai Xu,Xuelin Zhu,Yicong Li
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Key Laboratory of Social Computing and Cognitive Intelligence (Dalian University of Technology) (社会计算与认知智能重点实验室(大连理工大学)); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Compositional Zero-Shot Learning (CZSL) aims to recognize unseen state-object combinations by leveraging known combinations. Existing studies basically rely on the cross-modal alignment capabilities of CLIP but tend to overlook its limitations in capturing fine-grained local features, which arise from its architectural and training paradigm. To address this issue, we propose a Multi-Stage Cross-modal Interaction (MSCI) model that effectively explores and utilizes intermediate-layer information from CLIP’s visual encoder. Specifically, we design two self-adaptive aggregators to extract local information from low-level visual features and integrate global information from high-level visual features, respectively. These key information are progressively incorporated into textual representations through a stage-by-stage interaction mechanism, significantly enhancing the model’s perception capability for fine-grained local visual information. Additionally, MSCI dynamically adjusts the attention weights between global and local visual information based on different combinations, as well as different elements within the same combination, allowing it to flexibly adapt to diverse scenarios. Experiments on three widely used datasets fully validate the effectiveness and superiority of the proposed model. Data and code are available at this https URL.
zh

[CV-21] MFogHub: Bridging Multi-Regional and Multi-Satellite Data for Global Marine Fog Detection and Forecasting

【速读】:该论文旨在解决海洋雾检测与预测中因公开数据集稀缺而导致的模型泛化能力不足问题。现有数据集通常局限于单一区域或卫星,限制了模型在不同条件下的性能评估及对海洋雾内在特征的探索。其解决方案的关键在于提出\textbfMFogHub,这是首个多区域、多卫星的标注海洋雾观测数据集,整合了15个沿海雾多发区域和六颗静止轨道卫星的数据,包含超过68,000个高分辨率样本,从而为检测与预报方法的严格评估提供了多样化环境支持。

链接: https://arxiv.org/abs/2505.10281
作者: Mengqiu Xu,Kaixin Chen,Heng Guo,Yixiang Huang,Ming Wu,Zhenwei Shi,Chuang Zhang,Jun Guo
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Beihang University (北京航空航天大学); Beijing Wuzi University (北京工商大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning approaches for marine fog detection and forecasting have outperformed traditional methods, demonstrating significant scientific and practical importance. However, the limited availability of open-source datasets remains a major challenge. Existing datasets, often focused on a single region or satellite, restrict the ability to evaluate model performance across diverse conditions and hinder the exploration of intrinsic marine fog characteristics. To address these limitations, we introduce \textbfMFogHub, the first multi-regional and multi-satellite dataset to integrate annotated marine fog observations from 15 coastal fog-prone regions and six geostationary satellites, comprising over 68,000 high-resolution samples. By encompassing diverse regions and satellite perspectives, MFogHub facilitates rigorous evaluation of both detection and forecasting methods under varying conditions. Extensive experiments with 16 baseline models demonstrate that MFogHub can reveal generalization fluctuations due to regional and satellite discrepancy, while also serving as a valuable resource for the development of targeted and scalable fog prediction techniques. Through MFogHub, we aim to advance both the practical monitoring and scientific understanding of marine fog dynamics on a global scale. The dataset and code are at \hrefthis https URLthis https URL.
zh

[CV-22] RainPro-8: An Efficient Deep Learning Model to Estimate Rainfall Probabilities Over 8 Hours

【速读】:该论文旨在解决高分辨率概率性降水预报在欧洲区域的准确性与不确定性量化问题,特别是针对传统雷达仅输入的深度学习模型在短时效预报中的局限性。其解决方案的关键在于提出一种能够高效融合多源数据(包括雷达、卫星和基于物理的数值天气预报,NWP)的深度学习模型,通过捕捉长程交互关系,实现准确且具有鲁棒不确定性量化能力的概率性预报,并在保持模型紧凑架构的同时提升训练效率与推理速度。

链接: https://arxiv.org/abs/2505.10271
作者: Rafael Pablos Sarabia,Joachim Nyborg,Morten Birk,Jeppe Liborius Sjørup,Anders Lillevang Vesterholt,Ira Assent
机构: Aarhus University (奥胡斯大学); Cordulus (科德尔勒斯)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a deep learning model for high-resolution probabilistic precipitation forecasting over an 8-hour horizon in Europe, overcoming the limitations of radar-only deep learning models with short forecast lead times. Our model efficiently integrates multiple data sources - including radar, satellite, and physics-based numerical weather prediction (NWP) - while capturing long-range interactions, resulting in accurate forecasts with robust uncertainty quantification through consistent probabilistic maps. Featuring a compact architecture, it enables more efficient training and faster inference than existing models. Extensive experiments demonstrate that our model surpasses current operational NWP systems, extrapolation-based methods, and deep-learning nowcasting models, setting a new standard for high-resolution precipitation forecasting in Europe, ensuring a balance between accuracy, interpretability, and computational efficiency.
zh

[CV-23] HandReader: Advanced Techniques for Efficient Fingerspelling Recognition

【速读】:该论文旨在解决手语中指拼(fingerspelling)识别的准确性问题,尤其是在处理视频序列的时间维度时存在的局限性。其关键解决方案是提出HandReader系列架构,包括HandReader_RGB、HandReader_KP以及融合RGB和关键点模态的HandReader_RGB+KP。这些架构分别引入了时间移位自适应模块(Temporal Shift-Adaptive Module, TSAM)和时间姿态编码器(Temporal Pose Encoder, TPE),以有效提取视频中的时空信息并保留关键序列特征,从而提升指拼识别的性能。

链接: https://arxiv.org/abs/2505.10267
作者: Pavel Korotaev,Petr Surovtsev,Alexander Kapitanov,Karina Kvanchiani,Aleksandr Nagaev
机构: SberDevices(СберУстройства)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: this https URL

点击查看摘要

Abstract:Fingerspelling is a significant component of Sign Language (SL), allowing the interpretation of proper names, characterized by fast hand movements during signing. Although previous works on fingerspelling recognition have focused on processing the temporal dimension of videos, there remains room for improving the accuracy of these approaches. This paper introduces HandReader, a group of three architectures designed to address the fingerspelling recognition task. HandReader _RGB employs the novel Temporal Shift-Adaptive Module (TSAM) to process RGB features from videos of varying lengths while preserving important sequential information. HandReader _KP is built on the proposed Temporal Pose Encoder (TPE) operated on keypoints as tensors. Such keypoints composition in a batch allows the encoder to pass them through 2D and 3D convolution layers, utilizing temporal and spatial information and accumulating keypoints coordinates. We also introduce HandReader_RGB+KP - architecture with a joint encoder to benefit from RGB and keypoint modalities. Each HandReader model possesses distinct advantages and achieves state-of-the-art results on the ChicagoFSWild and ChicagoFSWild+ datasets. Moreover, the models demonstrate high performance on the first open dataset for Russian fingerspelling, Znaki, presented in this paper. The Znaki dataset and HandReader pre-trained models are publicly available.
zh

[CV-24] Inferring Driving Maps by Deep Learning-based Trail Map Extraction CVPR

【速读】:该论文试图解决在线地图构建中存在的时空一致性、传感器遮挡、运行时间和泛化能力等挑战。其解决方案的关键在于提出一种新的离线地图构建方法,该方法将驾驶者使用的非正式路径(trail)整合到地图创建过程中,通过聚合自车及其他交通参与者的路径数据,并利用基于Transformer的深度学习模型构建全局地图,从而实现持续更新且与传感器无关的地图构建,提高了数据传输效率和对未见过环境及传感器配置的泛化能力。

链接: https://arxiv.org/abs/2505.10258
作者: Michael Hubbertz,Pascal Colling,Qi Han,Tobias Meisen
机构: University of Wuppertal (伍珀塔尔大学); Aptiv Services Deutschland GmbH (Aptiv服务德国公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: This paper was accepted at the CVPR WAD 2025 Workshop

点击查看摘要

Abstract:High-definition (HD) maps offer extensive and accurate environmental information about the driving scene, making them a crucial and essential element for planning within autonomous driving systems. To avoid extensive efforts from manual labeling, methods for automating the map creation have emerged. Recent trends have moved from offline mapping to online mapping, ensuring availability and actuality of the utilized maps. While the performance has increased in recent years, online mapping still faces challenges regarding temporal consistency, sensor occlusion, runtime, and generalization. We propose a novel offline mapping approach that integrates trails - informal routes used by drivers - into the map creation process. Our method aggregates trail data from the ego vehicle and other traffic participants to construct a comprehensive global map using transformer-based deep learning models. Unlike traditional offline mapping, our approach enables continuous updates while remaining sensor-agnostic, facilitating efficient data transfer. Our method demonstrates superior performance compared to state-of-the-art online mapping approaches, achieving improved generalization to previously unseen environments and sensor configurations. We validate our approach on two benchmark datasets, highlighting its robustness and applicability in autonomous driving systems.
zh

[CV-25] Sage Deer: A Super-Aligned Driving Generalist Is Your Copilot

【速读】:该论文旨在解决智能驾驶舱中如何满足不同用户在舒适性、交互性和安全性方面的需求问题。其解决方案的关键在于构建一个名为SAGE DeeR的超级对齐且具备泛化能力的驾驶代理,该代理通过实现超级对齐(Super Alignment)、泛化能力(Generalist)和自激发(Self-Eliciting)三大核心特性,以适应不同用户的偏好与行为,并能够综合多模态输入进行生理指标、情绪、动作及驾驶场景的推理与决策。

链接: https://arxiv.org/abs/2505.10257
作者: Hao Lu,Jiaqi Tang,Jiyao Wang,Yunfan LU,Xu Cao,Qingyong Hu,Yin Wang,Yuting Zhang,Tianxin Xie,Yunpeng Zhang,Yong Chen,Jiayu.Gao,Bin Huang,Dengbo He,Shuiguang Deng,Hao Chen,Ying-Cong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The intelligent driving cockpit, an important part of intelligent driving, needs to match different users’ comfort, interaction, and safety needs. This paper aims to build a Super-Aligned and GEneralist DRiving agent, SAGE DeeR. Sage Deer achieves three highlights: (1) Super alignment: It achieves different reactions according to different people’s preferences and biases. (2) Generalist: It can understand the multi-view and multi-mode inputs to reason the user’s physiological indicators, facial emotions, hand movements, body movements, driving scenarios, and behavioral decisions. (3) Self-Eliciting: It can elicit implicit thought chains in the language space to further increase generalist and super-aligned abilities. Besides, we collected multiple data sets and built a large-scale benchmark. This benchmark measures the deer’s perceptual decision-making ability and the super alignment’s accuracy.
zh

[CV-26] ADHMR: Aligning Diffusion-based Human Mesh Recovery via Direct Preference Optimization ICML2025

【速读】:该论文试图解决从单张图像中恢复人体网格(Human Mesh Recovery, HMR)的固有病态问题,即由于深度模糊和遮挡导致的预测不确定性。现有基于概率的方法虽然能够生成多个合理的3D人体网格预测,但通常与2D图像观测存在对齐偏差,并且在真实场景图像中的鲁棒性较弱。解决方案的关键在于提出ADHMR框架,该框架通过偏好优化的方式对基于扩散的HMR模型进行对齐。其核心包括训练一个能够评估无3D标注的真实场景图像中人体网格预测的HMR-Scorer模型,并利用该模型构建偏好数据集以进行微调,从而提升模型的准确性与鲁棒性。

链接: https://arxiv.org/abs/2505.10250
作者: Wenhao Shen,Wanqi Yin,Xiaofeng Yang,Cheng Chen,Chaoyue Song,Zhongang Cai,Lei Yang,Hao Wang,Guosheng Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2025. Code: this https URL

点击查看摘要

Abstract:Human mesh recovery (HMR) from a single image is inherently ill-posed due to depth ambiguity and occlusions. Probabilistic methods have tried to solve this by generating numerous plausible 3D human mesh predictions, but they often exhibit misalignment with 2D image observations and weak robustness to in-the-wild images. To address these issues, we propose ADHMR, a framework that Aligns a Diffusion-based HMR model in a preference optimization manner. First, we train a human mesh prediction assessment model, HMR-Scorer, capable of evaluating predictions even for in-the-wild images without 3D annotations. We then use HMR-Scorer to create a preference dataset, where each input image has a pair of winner and loser mesh predictions. This dataset is used to finetune the base model using direct preference optimization. Moreover, HMR-Scorer also helps improve existing HMR models by data cleaning, even with fewer training samples. Extensive experiments show that ADHMR outperforms current state-of-the-art methods. Code is available at: this https URL.
zh

[CV-27] MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation

【速读】:该论文旨在解决现有方法在人类图像动画生成中依赖于2D渲染的姿态图像进行运动引导所导致的泛化能力受限及丢失关键3D信息的问题。其解决方案的关键在于提出MTVCrafter框架,该框架首次直接建模原始的4D运动(即4D motion)序列,通过引入4DMoT(4D motion tokenizer)将3D运动序列量化为4D运动标记,从而提供更鲁棒的时空线索并避免姿态图像与角色之间的严格像素级对齐,实现更灵活和解耦的控制。

链接: https://arxiv.org/abs/2505.10238
作者: Yanbo Ding
机构: Cranberry-Lemon University (克兰伯里-柠檬大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human image animation has gained increasing attention and developed rapidly due to its broad applications in digital humans. However, existing methods rely largely on 2D-rendered pose images for motion guidance, which limits generalization and discards essential 3D information for open-world animation. To tackle this problem, we propose MTVCrafter (Motion Tokenization Video Crafter), the first framework that directly models raw 3D motion sequences (i.e., 4D motion) for human image animation. Specifically, we introduce 4DMoT (4D motion tokenizer) to quantize 3D motion sequences into 4D motion tokens. Compared to 2D-rendered pose images, 4D motion tokens offer more robust spatio-temporal cues and avoid strict pixel-level alignment between pose image and character, enabling more flexible and disentangled control. Then, we introduce MV-DiT (Motion-aware Video DiT). By designing unique motion attention with 4D positional encodings, MV-DiT can effectively leverage motion tokens as 4D compact yet expressive context for human image animation in the complex 3D world. Hence, it marks a significant step forward in this field and opens a new direction for pose-guided human video generation. Experiments show that our MTVCrafter achieves state-of-the-art results with an FID-VID of 6.98, surpassing the second-best by 65%. Powered by robust motion tokens, MTVCrafter also generalizes well to diverse open-world characters (single/multiple, full/half-body) across various styles and scenarios. Our video demos and code are provided in the supplementary material and at this anonymous GitHub link: this https URL.
zh

[CV-28] Data-Agnostic Augmentations for Unknown Variations: Out-of-Distribution Generalisation in MRI Segmentation

【速读】:该论文试图解决医学图像分割模型在真实临床环境中因训练与测试数据分布不匹配而导致性能下降的问题。其解决方案的关键在于系统评估替代的数据增强策略,特别是MixUp和辅助傅里叶变换增强(Auxiliary Fourier Augmentation),这些方法能够在不明确针对特定分布偏移源的情况下,有效缓解多种变化的影响,从而提升模型在不同成像变换下的泛化能力和鲁棒性。

链接: https://arxiv.org/abs/2505.10223
作者: Puru Vaish,Felix Meister,Tobias Heimann,Christoph Brune,Jelmer M. Wolterink
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at MIDL 2025

点击查看摘要

Abstract:Medical image segmentation models are often trained on curated datasets, leading to performance degradation when deployed in real-world clinical settings due to mismatches between training and test distributions. While data augmentation techniques are widely used to address these challenges, traditional visually consistent augmentation strategies lack the robustness needed for diverse real-world scenarios. In this work, we systematically evaluate alternative augmentation strategies, focusing on MixUp and Auxiliary Fourier Augmentation. These methods mitigate the effects of multiple variations without explicitly targeting specific sources of distribution shifts. We demonstrate how these techniques significantly improve out-of-distribution generalization and robustness to imaging variations across a wide range of transformations in cardiac cine MRI and prostate MRI segmentation. We quantitatively find that these augmentation methods enhance learned feature representations by promoting separability and compactness. Additionally, we highlight how their integration into nnU-Net training pipelines provides an easy-to-implement, effective solution for enhancing the reliability of medical segmentation models in real-world applications.
zh

[CV-29] VolE: A Point-cloud Framework for Food 3D Reconstruction and Volume Estimation

【速读】:该论文试图解决食品体积估计中的准确性问题,这一问题在医学营养管理和健康监测应用中至关重要。现有方法通常受限于单源数据、专用硬件(如3D扫描仪)或依赖参考物体的相机标定。解决方案的关键在于提出VolE框架,该框架利用移动设备驱动的3D重建技术,通过自由运动拍摄图像和相机位置生成精确的3D模型,并采用食品视频分割技术生成食品掩码,从而实现无需参考物和深度信息的现实世界测量。

链接: https://arxiv.org/abs/2505.10205
作者: Umair Haroon,Ahmad AlMughrabi,Thanasis Zoumpekas,Ricardo Marques,Petia Radeva
机构: Universitat de Barcelona(巴塞罗那大学); Universitat Pompeu Fabra(庞培法布拉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate food volume estimation is crucial for medical nutrition management and health monitoring applications, but current food volume estimation methods are often limited by mononuclear data, leveraging single-purpose hardware such as 3D scanners, gathering sensor-oriented information such as depth information, or relying on camera calibration using a reference object. In this paper, we present VolE, a novel framework that leverages mobile device-driven 3D reconstruction to estimate food volume. VolE captures images and camera locations in free motion to generate precise 3D models, thanks to AR-capable mobile devices. To achieve real-world measurement, VolE is a reference- and depth-free framework that leverages food video segmentation for food mask generation. We also introduce a new food dataset encompassing the challenging scenarios absent in the previous benchmarks. Our experiments demonstrate that VolE outperforms the existing volume estimation techniques across multiple datasets by achieving 2.22 % MAPE, highlighting its superior performance in food volume estimation.
zh

[CV-30] Modeling Saliency Dataset Bias

【速读】:该论文试图解决跨多个显著性数据集的注视点预测问题,尤其是在现有数据集上表现接近黄金标准的模型在不同数据集间泛化能力不足的问题。研究表明,当在某一数据集上训练的模型应用于其他数据集时,性能会显著下降(约40%),且增加数据集多样性并不能有效解决这一跨数据集差异,近60%的差异归因于数据集特有的偏差。解决方案的关键在于提出一种新型架构,该架构在大多数数据集无关的编码器-解码器结构基础上,引入了少于20个数据集特定参数,这些参数控制可解释的机制,如多尺度结构、中心偏差和注视扩散。仅调整这些参数即可解决超过75%的泛化差距,且在仅使用50个样本的情况下即可实现显著提升。

链接: https://arxiv.org/abs/2505.10169
作者: Matthias Kümmerer,Harneet Khanuja,Matthias Bethge
机构: Tübingen AI Center (图宾根人工智能中心); University of Tübingen (图宾根大学); Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in image-based saliency prediction are approaching gold standard performance levels on existing benchmarks. Despite this success, we show that predicting fixations across multiple saliency datasets remains challenging due to dataset bias. We find a significant performance drop (around 40%) when models trained on one dataset are applied to another. Surprisingly, increasing dataset diversity does not resolve this inter-dataset gap, with close to 60% attributed to dataset-specific biases. To address this remaining generalization gap, we propose a novel architecture extending a mostly dataset-agnostic encoder-decoder structure with fewer than 20 dataset-specific parameters that govern interpretable mechanisms such as multi-scale structure, center bias, and fixation spread. Adapting only these parameters to new data accounts for more than 75% of the generalization gap, with a large fraction of the improvement achieved with as few as 50 samples. Our model sets a new state-of-the-art on all three datasets of the MIT/Tuebingen Saliency Benchmark (MIT300, CAT2000, and COCO-Freeview), even when purely generalizing from unrelated datasets, but with a substantial boost when adapting to the respective training datasets. The model also provides valuable insights into spatial saliency properties, revealing complex multi-scale effects that combine both absolute and relative sizes.
zh

[CV-31] Multi-Source Collaborative Style Augmentation and Domain-Invariant Learning for Federated Domain Generalization IJCAI2025

【速读】:该论文试图解决联邦领域泛化(Federated Domain Generalization)中由于数据分散导致的风格空间受限问题。现有风格增强方法要么仅在孤立源域内探索数据风格,要么在数据去中心化场景下跨现有源域插值风格信息,从而限制了风格空间的多样性。解决方案的关键在于提出一种多源协作风格增强与领域不变学习方法(MCSAD),通过多源协作风格增强模块生成更广泛风格空间的数据,并结合跨域特征对齐和类间关系蒸馏实现领域不变模型的学习,从而提升模型在未见目标域上的泛化能力。

链接: https://arxiv.org/abs/2505.10152
作者: Yikang Wei
机构: Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IJCAI 2025

点击查看摘要

Abstract:Federated domain generalization aims to learn a generalizable model from multiple decentralized source domains for deploying on the unseen target domain. The style augmentation methods have achieved great progress on domain generalization. However, the existing style augmentation methods either explore the data styles within isolated source domain or interpolate the style information across existing source domains under the data decentralization scenario, which leads to limited style space. To address this issue, we propose a Multi-source Collaborative Style Augmentation and Domain-invariant learning method (MCSAD) for federated domain generalization. Specifically, we propose a multi-source collaborative style augmentation module to generate data in the broader style space. Furthermore, we conduct domain-invariant learning between the original data and augmented data by cross-domain feature alignment within the same class and classes relation ensemble distillation between different classes to learn a domain-invariant model. By alternatively conducting collaborative style augmentation and domain-invariant learning, the model can generalize well on unseen target domain. Extensive experiments on multiple domain generalization datasets indicate that our method significantly outperforms the state-of-the-art federated domain generalization methods.
zh

[CV-32] VRSplat: Fast and Robust Gaussian Splatting for Virtual Reality

【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 在虚拟现实(VR)应用中面临的关键问题,包括时间伪影(如头部移动时的“ popping”现象)、基于投影的失真导致的视觉不一致浮点物以及在渲染大量高斯分布时帧率下降的问题。其解决方案的关键在于引入VRSplat,通过整合并扩展近期的3DGS技术,如Mini-Splatting、StopThePop和Optimal Projection,并对其进行改进,以优化核心3DGS光栅化器。此外,还提出了一种高效的中央视区光栅化器,能够在单次GPU调用中处理焦点与周边区域,提升GPU利用率,同时结合基于StopThePop深度评估和Optimal Projection的微调步骤,进一步优化高斯参数。

链接: https://arxiv.org/abs/2505.10144
作者: Xuechang Tu,Lukas Radl,Michael Steiner,Markus Steinberger,Bernhard Kerbl,Fernando de la Torre
机构: Peking University (北京大学); Graz University of Technology (格拉茨技术大学); Huawei Technologies (华为技术); Carnegie Mellon University (卡内基梅隆大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: I3D’25 (PACMCGIT); Project Page: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has rapidly become a leading technique for novel-view synthesis, providing exceptional performance through efficient software-based GPU rasterization. Its versatility enables real-time applications, including on mobile and lower-powered devices. However, 3DGS faces key challenges in virtual reality (VR): (1) temporal artifacts, such as popping during head movements, (2) projection-based distortions that result in disturbing and view-inconsistent floaters, and (3) reduced framerates when rendering large numbers of Gaussians, falling below the critical threshold for VR. Compared to desktop environments, these issues are drastically amplified by large field-of-view, constant head movements, and high resolution of head-mounted displays (HMDs). In this work, we introduce VRSplat: we combine and extend several recent advancements in 3DGS to address challenges of VR holistically. We show how the ideas of Mini-Splatting, StopThePop, and Optimal Projection can complement each other, by modifying the individual techniques and core 3DGS rasterizer. Additionally, we propose an efficient foveated rasterizer that handles focus and peripheral areas in a single GPU launch, avoiding redundant computations and improving GPU utilization. Our method also incorporates a fine-tuning step that optimizes Gaussian parameters based on StopThePop depth evaluations and Optimal Projection. We validate our method through a controlled user study with 25 participants, showing a strong preference for VRSplat over other configurations of Mini-Splatting. VRSplat is the first, systematically evaluated 3DGS approach capable of supporting modern VR applications, achieving 72+ FPS while eliminating popping and stereo-disrupting floaters.
zh

[CV-33] IMITATE: Image Registration with Context for unknown time frame recovery

【速读】:该论文旨在解决在放射治疗中,由于患者呼吸运动导致的肿瘤移动图像重建问题,特别是在不同呼吸幅度下利用4D-CT(3D+t)扫描对胸腹区域进行图像配准时所面临的挑战。其关键解决方案是提出了一种新的条件U-Net架构,该架构能够充分利用条件信息,无需固定图像,从而实现对未知条件相关图像的准确估计,并有效减少因呼吸不规则、滞后效应及呼吸信号与内部运动相关性差而导致的重建伪影。

链接: https://arxiv.org/abs/2505.10124
作者: Ziad Kheil,Lucas Robinet,Laurent Risser,Soleakhena Ken
机构: Centre de Recherches en Cancérologie de Toulouse, INSERM UMR1037, Team RADOPT; Université de Toulouse (Université de Toulouse); Institut Universitaire du Cancer – Oncopole Claudius Régaud; Institut de Mathématiques de Toulouse (UMR 5219), CNRS, Université de Toulouse
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: IEEE ISBI 2025

点击查看摘要

Abstract:In this paper, we formulate a novel image registration formalism dedicated to the estimation of unknown condition-related images, based on two or more known images and their associated conditions. We show how to practically model this formalism by using a new conditional U-Net architecture, which fully takes into account the conditional information and does not need any fixed image. Our formalism is then applied to image moving tumors for radiotherapy treatment at different breathing amplitude using 4D-CT (3D+t) scans in thoracoabdominal regions. This driving application is particularly complex as it requires to stitch a collection of sequential 2D slices into several 3D volumes at different organ positions. Movement interpolation with standard methods then generates well known reconstruction artefacts in the assembled volumes due to irregular patient breathing, hysteresis and poor correlation of breathing signal to internal motion. Results obtained on 4D-CT clinical data showcase artefact-free volumes achieved through real-time latencies. The code is publicly available at this https URL .
zh

[CV-34] MMRL: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models

【速读】:该论文旨在解决大规模预训练视觉-语言模型(Vision-Language Models, VLMs)在少量样本(few-shot)条件下适应新任务时容易过拟合,从而影响其泛化能力的问题。解决方案的关键在于提出多模态表示学习(Multi-Modal Representation Learning, MMRL),通过引入一个共享的、可学习的、模态无关的表示空间,生成同时投影到文本和图像编码器中的空间标记,以增强跨模态交互。MMRL将表示标记插入到更高层编码器中以捕捉任务特定特征,同时保留低层中的通用知识,并通过联合优化类标记和表示特征实现任务适配与预训练知识的平衡。此外,引入正则化项以对齐类标记与文本特征,提升模型泛化能力。

链接: https://arxiv.org/abs/2505.10088
作者: Yuncheng Guo,Xiaodong Gu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Due to the limitation “The abstract field cannot be longer than 1,920 characters”, the abstract appearing here is slightly shorter than that in the PDF file

点击查看摘要

Abstract:Large-scale pre-trained Vision-Language Models (VLMs) have significantly advanced transfer learning across diverse tasks. However, adapting these models with limited few-shot data often leads to overfitting, undermining their ability to generalize to new tasks. To address this, we propose Multi-Modal Representation Learning (MMRL), which introduces a shared, learnable, modality-agnostic representation space. MMRL generates space tokens projected into both text and image encoders as representation tokens, enabling more effective cross-modal interactions. Unlike prior methods that mainly optimize class token features, MMRL inserts representation tokens into higher encoder layers–where task-specific features are more prominent–while preserving general knowledge in the lower layers. During training, both class and representation features are jointly optimized: a trainable projection layer is applied to representation tokens for task adaptation, while the projection layer for class token remains frozen to retain pre-trained knowledge. To further promote generalization, we introduce a regularization term aligning class and text features with the frozen VLM’s zero-shot features. At inference, a decoupling strategy uses both class and representation features for base tasks, but only class features for novel tasks due to their stronger generalization. Building upon this, we propose MMRL++, a parameter-efficient and interaction-aware extension that significantly reduces trainable parameters and enhances intra-modal interactions–particularly across the layers of representation tokens–allowing gradient sharing and instance-specific information to propagate more effectively through the network. Extensive experiments on 15 datasets demonstrate that MMRL and MMRL++ consistently outperform state-of-the-art methods, achieving a strong balance between task-specific adaptation and generalization.
zh

[CV-35] FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation

【速读】:该论文旨在解决机器人操作中视觉世界模型的训练问题,即通过过去帧和机器人动作来预测未来的视觉观察。其解决方案的关键在于引入FlowDreamer,该方法采用显式的三维场景流(3D scene flow)作为运动表示,首先通过U-Net从过去的帧和动作条件中预测三维场景流,然后利用扩散模型根据场景流预测未来帧,从而实现更准确的视觉预测。

链接: https://arxiv.org/abs/2505.10075
作者: Jun Guo,Xiaojian Ma,Yikai Wang,Min Yang,Huaping Liu,Qing Li
机构: State Key Laboratory of General Artificial Intelligence (BIGAI); Department of Computer Science and Technology, Tsinghua University; School of Artificial Intelligence, Beijing Normal University; School of Artificial Intelligence and Automation, Huazhong University of Science and Technology
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: see this https URL

点击查看摘要

Abstract:This paper investigates training better visual world models for robot manipulation, i.e., models that can predict future visual observations by conditioning on past frames and robot actions. Specifically, we consider world models that operate on RGB-D frames (RGB-D world models). As opposed to canonical approaches that handle dynamics prediction mostly implicitly and reconcile it with visual rendering in a single model, we introduce FlowDreamer, which adopts 3D scene flow as explicit motion representations. FlowDreamer first predicts 3D scene flow from past frame and action conditions with a U-Net, and then a diffusion model will predict the future frame utilizing the scene flow. FlowDreamer is trained end-to-end despite its modularized nature. We conduct experiments on 4 different benchmarks, covering both video prediction and visual planning tasks. The results demonstrate that FlowDreamer achieves better performance compared to other baseline RGB-D world models by 7% on semantic similarity, 11% on pixel quality, and 6% on success rate in various robot manipulation domains.
zh

[CV-36] oonifyGB: StyleGAN-based Gaussian Blendshapes for 3D Stylized Head Avatars

【速读】:该论文旨在解决从单目视频中实时重建可动画化的头部虚拟形象的问题,并扩展Toonify框架以利用高斯混合形状(Gaussian blendshapes)生成多样化的风格化3D头部虚拟形象。其解决方案的关键在于提出了一种高效的两阶段框架ToonifyGB:第一阶段通过改进的StyleGAN生成风格化视频,克服了传统StyleGAN对固定分辨率对齐人脸裁剪的限制,从而提供更稳定的视频序列;第二阶段则通过学习风格化的中性头部模型和表情混合形状,实现任意表情的风格化虚拟形象渲染。

链接: https://arxiv.org/abs/2505.10072
作者: Rui-Yang Ju,Sheng-Yen Huang,Yi-Ping Hung
机构: National Taiwan University (国立台湾大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The introduction of 3D Gaussian blendshapes has enabled the real-time reconstruction of animatable head avatars from monocular video. Toonify, a StyleGAN-based framework, has become widely used for facial image stylization. To extend Toonify for synthesizing diverse stylized 3D head avatars using Gaussian blendshapes, we propose an efficient two-stage framework, ToonifyGB. In Stage 1 (stylized video generation), we employ an improved StyleGAN to generate the stylized video from the input video frames, which addresses the limitation of cropping aligned faces at a fixed resolution as preprocessing for normal StyleGAN. This process provides a more stable video, which enables Gaussian blendshapes to better capture the high-frequency details of the video frames, and efficiently generate high-quality animation in the next stage. In Stage 2 (Gaussian blendshapes synthesis), we learn a stylized neutral head model and a set of expression blendshapes from the generated video. By combining the neutral head model with expression blendshapes, ToonifyGB can efficiently render stylized avatars with arbitrary expressions. We validate the effectiveness of ToonifyGB on the benchmark dataset using two styles: Arcane and Pixar.
zh

[CV-37] PsOCR: Benchmarking Large Multimodal Models for Optical Character Recognition in Low-resource Pashto Language

【速读】:该论文旨在解决低资源语言Pashto的光学字符识别(OCR)问题,由于其书写体系为连笔字且缺乏结构化数据集,自然语言处理(NLP)面临诸多挑战。解决方案的关键在于构建了一个大规模的合成Pashto OCR数据集PsOCR,该数据集包含一百万张图像,涵盖1000种独特的字体家族、颜色、图像尺寸和布局,并在单词、行和文档层级进行边界框标注,以支持不同架构模型的训练与评估。

链接: https://arxiv.org/abs/2505.10055
作者: Ijazul Haq,Yingjie Zhang,Irfan Ali Khan
机构: South China University of Technology (华南理工大学); Zirak.ai; University of Missouri-Kansas City (密苏里大学堪萨斯城分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper evaluates the performance of Large Multimodal Models (LMMs) on Optical Character Recognition (OCR) in the low-resource Pashto language. Natural Language Processing (NLP) in Pashto faces several challenges due to the cursive nature of its script and a scarcity of structured datasets. To address this, we developed a synthetic Pashto OCR dataset, PsOCR, consisting of one million images annotated with bounding boxes at word, line, and document levels, suitable for training and evaluating models based on different architectures, including Convolutional Neural Networks (CNNs) and Transformers. PsOCR covers variations across 1,000 unique font families, colors, image sizes, and layouts. A benchmark subset of 10K images was selected to evaluate the performance of several LMMs, including seven open-source models: DeepSeek’s Janus, InternVL, MiniCPM, Florence, and Qwen (3B and 7B), and four closed-source models: GPT-4o, Gemini, Claude, and Grok. Experimental results demonstrate that Gemini achieves the best performance among all models, whereas among open-source models, Qwen-7B stands out. This work provides an insightful assessment of the capabilities and limitations of current LMMs for OCR tasks in Pashto and establishes a foundation for further research not only in Pashto OCR but also for other similar scripts such as Arabic, Persian, and Urdu. PsOCR is available at this https URL.
zh

[CV-38] Advances in Radiance Field for Dynamic Scene: From Neural Field to Gaussian Field

【速读】:该论文旨在解决动态场景表示与重建的问题,特别是针对4D动态场景的复杂性。其解决方案的关键在于结合神经辐射场和3D高斯点云技术,并通过可微分体绘制等创新方法提升运动表示和动态场景重建的质量。研究系统分析了超过200篇相关文献,从隐式神经表示到显式高斯基元,提出了统一的表示框架,并从运动表示范式、重建技术、辅助信息融合策略及正则化方法等多个角度对现有工作进行了分类与评估。

链接: https://arxiv.org/abs/2505.10049
作者: Jinlong Fan,Xuepu Zeng,Jing Zhang,Mingming Gong,Yuxiang Yang,Dacheng Tao
机构: Hangzhou Dianzi University (杭州电子科技大学); Wuhan University (武汉大学); University of Melbourne (墨尔本大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dynamic scene representation and reconstruction have undergone transformative advances in recent years, catalyzed by breakthroughs in neural radiance fields and 3D Gaussian splatting techniques. While initially developed for static environments, these methodologies have rapidly evolved to address the complexities inherent in 4D dynamic scenes through an expansive body of research. Coupled with innovations in differentiable volumetric rendering, these approaches have significantly enhanced the quality of motion representation and dynamic scene reconstruction, thereby garnering substantial attention from the computer vision and graphics communities. This survey presents a systematic analysis of over 200 papers focused on dynamic scene representation using radiance field, spanning the spectrum from implicit neural representations to explicit Gaussian primitives. We categorize and evaluate these works through multiple critical lenses: motion representation paradigms, reconstruction techniques for varied scene dynamics, auxiliary information integration strategies, and regularization approaches that ensure temporal consistency and physical plausibility. We organize diverse methodological approaches under a unified representational framework, concluding with a critical examination of persistent challenges and promising research directions. By providing this comprehensive overview, we aim to establish a definitive reference for researchers entering this rapidly evolving field while offering experienced practitioners a systematic understanding of both conceptual principles and practical frontiers in dynamic scene reconstruction.
zh

[CV-39] Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis

【速读】:该论文试图解决文本到图像生成中深度融合大语言模型(Large Language Models, LLMs)与扩散变压器(Diffusion Transformers, DiTs)的多模态生成设计空间尚未被充分研究的问题。现有研究主要关注整体系统性能,而忽略了与替代方法的详细对比,关键设计细节和训练方案也常未公开,导致对该方法实际潜力存在不确定性。解决方案的关键在于通过实证研究进行受控比较、分析重要设计选择,并提供可复现的大规模训练方案,以填补上述研究空白。

链接: https://arxiv.org/abs/2505.10046
作者: Bingda Tang,Boyang Zheng,Xichen Pan,Sayak Paul,Saining Xie
机构: New York University (纽约大学); Hugging Face (Hugging Face)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper does not describe a new method; instead, it provides a thorough exploration of an important yet understudied design space related to recent advances in text-to-image synthesis – specifically, the deep fusion of large language models (LLMs) and diffusion transformers (DiTs) for multi-modal generation. Previous studies mainly focused on overall system performance rather than detailed comparisons with alternative methods, and key design details and training recipes were often left undisclosed. These gaps create uncertainty about the real potential of this approach. To fill these gaps, we conduct an empirical study on text-to-image generation, performing controlled comparisons with established baselines, analyzing important design choices, and providing a clear, reproducible recipe for training at scale. We hope this work offers meaningful data points and practical guidelines for future research in multi-modal generation.
zh

[CV-40] DeepSeqCoco: A Robust Mobile Friendly Deep Learning Model for Detection of Diseases in Cocos nucifera

【速读】:该论文试图解决椰子树疾病识别效率低、依赖人工且难以扩展的问题(coconut tree diseases identification efficiency, manual reliance, and scalability issues)。解决方案的关键在于提出一种基于深度学习的模型DeepSeqCoco,该模型能够实现从椰子树图像中准确且自动地进行疾病识别,并通过优化器设置(如SGD、Adam及混合配置)实现了在准确性、损失最小化和计算成本之间的最佳平衡。

链接: https://arxiv.org/abs/2505.10030
作者: Miit Daga,Dhriti Parikh,Swarna Priya Ramu
机构: VIT, Vellore, India (VIT, 维洛尔, 印度)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This paper is accepted for publication in IEEE Access journal and is currently pending revisions before publication

点击查看摘要

Abstract:Coconut tree diseases are a serious risk to agricultural yield, particularly in developing countries where conventional farming practices restrict early diagnosis and intervention. Current disease identification methods are manual, labor-intensive, and non-scalable. In response to these limitations, we come up with DeepSeqCoco, a deep learning based model for accurate and automatic disease identification from coconut tree images. The model was tested under various optimizer settings, such as SGD, Adam, and hybrid configurations, to identify the optimal balance between accuracy, minimization of loss, and computational cost. Results from experiments indicate that DeepSeqCoco can achieve as much as 99.5% accuracy (achieving up to 5% higher accuracy than existing models) with the hybrid SGD-Adam showing the lowest validation loss of 2.81%. It also shows a drop of up to 18% in training time and up to 85% in prediction time for input images. The results point out the promise of the model to improve precision agriculture through an AI-based, scalable, and efficient disease monitoring system.
zh

[CV-41] ORL-LDM: Offline Reinforcement Learning Guided Latent Diffusion Model Super-Resolution Reconstruction

【速读】:该论文旨在解决遥感图像超分辨率重建中的复杂场景处理和图像细节保持问题。现有深度学习方法在这些方面仍存在局限性。其解决方案的关键在于提出一种基于强化学习的潜在扩散模型(Latent Diffusion Model, LDM)微调方法,通过构建包含状态、动作和奖励的强化学习环境,并在LDM模型的逆向去噪过程中利用近端策略优化(PPO)来优化决策目标,从而提升重建图像的质量和场景适应性。

链接: https://arxiv.org/abs/2505.10027
作者: Shijie Lyu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by the 4th International Conference on Computing Innovation and Applied Physics (CONF-CIAP 2025), and will be published in EAI Community Research Series-CORE or Theoretical and Natural Science (TNS)

点击查看摘要

Abstract:With the rapid advancement of remote sensing technology, super-resolution image reconstruction is of great research and practical significance. Existing deep learning methods have made progress but still face limitations in handling complex scenes and preserving image details. This paper proposes a reinforcement learning-based latent diffusion model (LDM) fine-tuning method for remote sensing image super-resolution. The method constructs a reinforcement learning environment with states, actions, and rewards, optimizing decision objectives through proximal policy optimization (PPO) during the reverse denoising process of the LDM model. Experiments on the RESISC45 dataset show significant improvements over the baseline model in PSNR, SSIM, and LPIPS, with PSNR increasing by 3-4dB, SSIM improving by 0.08-0.11, and LPIPS reducing by 0.06-0.10, particularly in structured and complex natural scenes. The results demonstrate the method’s effectiveness in enhancing super-resolution quality and adaptability across scenes.
zh

[CV-42] Application of YOLOv8 in monocular downward multiple Car Target detection

【速读】:该论文旨在解决自动驾驶中目标检测技术面临的挑战,包括高成本、对天气和光照条件的敏感性以及对多尺度、小目标和远距离目标检测能力的不足。其解决方案的关键在于对YOLOv8框架进行改进,通过引入结构重参数化技术、双向金字塔结构网络模型和新型检测流程,提升了模型在多尺度、小目标和远距离目标检测上的效率与精度。

链接: https://arxiv.org/abs/2505.10016
作者: Shijie Lyu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by the 5th International Conference on Signal Processing and Machine Learning (CONF-SPML 2025), to appear in Applied and Computational Engineering

点击查看摘要

Abstract:Autonomous driving technology is progressively transforming traditional car driving methods, marking a significant milestone in modern transportation. Object detection serves as a cornerstone of autonomous systems, playing a vital role in enhancing driving safety, enabling autonomous functionality, improving traffic efficiency, and facilitating effective emergency responses. However, current technologies such as radar for environmental perception, cameras for road perception, and vehicle sensor networks face notable challenges, including high costs, vulnerability to weather and lighting conditions, and limited this http URL address these limitations, this paper presents an improved autonomous target detection network based on YOLOv8. By integrating structural reparameterization technology, a bidirectional pyramid structure network model, and a novel detection pipeline into the YOLOv8 framework, the proposed approach achieves highly efficient and precise detection of multi-scale, small, and remote objects. Experimental results demonstrate that the enhanced model can effectively detect both large and small objects with a detection accuracy of 65%, showcasing significant advancements over traditional this http URL improved model holds substantial potential for real-world applications and is well-suited for autonomous driving competitions, such as the Formula Student Autonomous China (FSAC), particularly excelling in scenarios involving single-target and small-object detection.
zh

[CV-43] From Air to Wear: Personalized 3D Digital Fashion with AR/VR Immersive 3D Sketching

【速读】:该论文旨在解决普通用户在3D服装设计工具中面临的技术门槛高和数据稀缺的问题(technical barriers and limited data)。其解决方案的关键在于提出一种基于3D草图驱动的3D服装生成框架,该框架结合了条件扩散模型、在共享潜在空间中训练的草图编码器以及自适应课程学习策略,从而能够解析不精确的手绘输入并生成逼真且个性化的服装。此外,为弥补训练数据不足,研究者还构建了KO3DClothes数据集,进一步提升了方法的有效性与实用性。

链接: https://arxiv.org/abs/2505.09998
作者: Ying Zang,Yuanqi Hu,Xinyu Chen,Yuxia Xu,Suhui Wang,Chunan Yu,Lanyun Zhu,Deyi Ji,Xin Xu,Tianrun Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:In the era of immersive consumer electronics, such as AR/VR headsets and smart devices, people increasingly seek ways to express their identity through virtual fashion. However, existing 3D garment design tools remain inaccessible to everyday users due to steep technical barriers and limited data. In this work, we introduce a 3D sketch-driven 3D garment generation framework that empowers ordinary users - even those without design experience - to create high-quality digital clothing through simple 3D sketches in AR/VR environments. By combining a conditional diffusion model, a sketch encoder trained in a shared latent space, and an adaptive curriculum learning strategy, our system interprets imprecise, free-hand input and produces realistic, personalized garments. To address the scarcity of training data, we also introduce KO3DClothes, a new dataset of paired 3D garments and user-created sketches. Extensive experiments and user studies confirm that our method significantly outperforms existing baselines in both fidelity and usability, demonstrating its promise for democratized fashion design on next-generation consumer platforms.
zh

[CV-44] Descriptive Image-Text Matching with Graded Contextual Similarity

【速读】:该论文试图解决传统图像-文本匹配方法中因使用稀疏二值监督而忽略图像与文本之间固有的多对多对应关系的问题,以及未能捕捉从一般到具体描述的隐含联系。其解决方案的关键在于提出描述性图像-文本匹配(DITM),通过探索语言的描述灵活性来学习图像与文本之间的分级上下文相似性,利用累积词频-逆文档频率(TF-IDF)对每个句子的描述性得分进行建模,以平衡句子中的关键词对成对相似性的贡献,并通过描述性增强图像-文本匹配的两个关键方面:一是动态放松正负样本对之间的连接以优化错误负样本标注,二是按从一般到具体的顺序对相关句子进行对齐,从而提升模型发现最优匹配和潜在正样本对的能力。

链接: https://arxiv.org/abs/2505.09997
作者: Jinhyun Jang,Jiyeong Lee,Kwanghoon Sohn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image-text matching aims to build correspondences between visual and textual data by learning their pairwise similarities. Most existing approaches have adopted sparse binary supervision, indicating whether a pair of images and sentences matches or not. However, such sparse supervision covers a limited subset of image-text relationships, neglecting their inherent many-to-many correspondences; an image can be described in numerous texts at different descriptive levels. Moreover, existing approaches overlook the implicit connections from general to specific descriptions, which form the underlying rationale for the many-to-many relationships between vision and language. In this work, we propose descriptive image-text matching, called DITM, to learn the graded contextual similarity between image and text by exploring the descriptive flexibility of language. We formulate the descriptiveness score of each sentence with cumulative term frequency-inverse document frequency (TF-IDF) to balance the pairwise similarity according to the keywords in the sentence. Our method leverages sentence descriptiveness to learn robust image-text matching in two key ways: (1) to refine the false negative labeling, dynamically relaxing the connectivity between positive and negative pairs, and (2) to build more precise matching, aligning a set of relevant sentences in a generic-to-specific order. By moving beyond rigid binary supervision, DITM enhances the discovery of both optimal matches and potential positive pairs. Extensive experiments on MS-COCO, Flickr30K, and CxC datasets demonstrate the effectiveness of our method in representing complex image-text relationships compared to state-of-the-art approaches. In addition, DITM enhances the hierarchical reasoning ability of the model, supported by the extensive analysis on HierarCaps benchmark.
zh

[CV-45] PointArena: Probing Multimodal Grounding Through Language-Guided Pointing

【速读】:该论文旨在解决多模态模型在视觉-语言接地任务中对指针(pointing)能力的评估与优化问题,现有基准主要聚焦于参考对象定位任务,缺乏对多样化推理场景下指针能力的全面评估。其解决方案的关键在于提出PointArena平台,该平台包含三个核心组件:Point-Bench数据集、Point-Battle交互式评测环境以及Point-Act真实世界机器人操作系统,从而实现了从理论到实践的多维度评估体系。

链接: https://arxiv.org/abs/2505.09990
作者: Long Cheng,Jiafei Duan,Yi Ru Wang,Haoquan Fang,Boyang Li,Yushan Huang,Elvis Wang,Ainaz Eftekhar,Jason Lee,Wentao Yuan,Rose Hendrix,Noah A. Smith,Fei Xia,Dieter Fox,Ranjay Krishna
机构: University of Washington(华盛顿大学); Allen Institute for Artificial Intelligence(艾伦人工智能研究所); Anderson Collegiate Vocational Institute(安德森学院职业学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 Pages, Dataset and code: this https URL

点击查看摘要

Abstract:Pointing serves as a fundamental and intuitive mechanism for grounding language within visual contexts, with applications spanning robotics, assistive technologies, and interactive AI systems. While recent multimodal models have started to support pointing capabilities, existing benchmarks typically focus only on referential object localization tasks. We introduce PointArena, a comprehensive platform for evaluating multimodal pointing across diverse reasoning scenarios. PointArena comprises three components: (1) Point-Bench, a curated dataset containing approximately 1,000 pointing tasks across five reasoning categories; (2) Point-Battle, an interactive, web-based arena facilitating blind, pairwise model comparisons, which has already gathered over 4,500 anonymized votes; and (3) Point-Act, a real-world robotic manipulation system allowing users to directly evaluate multimodal model pointing capabilities in practical settings. We conducted extensive evaluations of both state-of-the-art open-source and proprietary multimodal models. Results indicate that Molmo-72B consistently outperforms other models, though proprietary models increasingly demonstrate comparable performance. Additionally, we find that supervised training specifically targeting pointing tasks significantly enhances model performance. Across our multi-stage evaluation pipeline, we also observe strong correlations, underscoring the critical role of precise pointing capabilities in enabling multimodal models to effectively bridge abstract reasoning with concrete, real-world actions. Project page: this https URL
zh

[CV-46] High Quality Underwater Image Compression with Adaptive Correction and Codebook-based Augmentation

【速读】:该论文旨在解决当前水下图像压缩算法未能充分利用水下场景与陆地图像之间的独特特征,导致压缩性能不佳的问题。其解决方案的关键在于提出HQUIC,该方法通过ALTC模块自适应预测图像的衰减系数和全局光照信息,有效缓解水下图像中光照和色调差异带来的问题;同时引入一个码本作为辅助分支,提取水下图像中的常见物体以提升主分支性能,并动态加权多尺度频域成分,优先保留对失真质量关键的信息,从而实现更高效的压缩效果。

链接: https://arxiv.org/abs/2505.09986
作者: Yimin Zhou,Yichong Xia,Sicheng Pan,Bin Chen,Baoyi An,Haoqian Wang,Zhi Wang,Yaowei Wang,Zikun Zhou
机构: Tsinghua Shenzhen International Graduate School (清华大学深圳国际研究生院); Peng Cheng Laboratory (鹏城实验室); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学(深圳)); Huawei Technologies Company Ltd. (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:With the increasing exploration and exploitation of the underwater world, underwater images have become a critical medium for human interaction with marine environments, driving extensive research into their efficient transmission and storage. However, contemporary underwater image compression algorithms fail to fully leverage the unique characteristics distinguishing underwater scenes from terrestrial images, resulting in suboptimal performance. To address this limitation, we introduce HQUIC, designed to exploit underwater-image-specific features for enhanced compression efficiency. HQUIC employs an ALTC module to adaptively predict the attenuation coefficients and global light information of the images, which effectively mitigates the issues caused by the differences in lighting and tone existing in underwater images. Subsequently, HQUIC employs a codebook as an auxiliary branch to extract the common objects within underwater images and enhances the performance of the main branch. Furthermore, HQUIC dynamically weights multi-scale frequency components, prioritizing information critical for distortion quality while discarding redundant details. Extensive evaluations on diverse underwater datasets demonstrate that HQUIC outperforms state-of-the-art compression methods.
zh

[CV-47] APCoTTA: Continual Test-Time Adaptation for Semantic Segmentation of Airborne LiDAR Point Clouds

【速读】:该论文旨在解决 airborne laser scanning (ALS) 点云语义分割中由于领域偏移导致的模型性能下降问题,特别是在无标签目标域上的持续测试时适应(Continuous Test-Time Adaptation, CTTA)挑战。其解决方案的关键在于提出一种名为 APCoTTA 的方法,该方法通过动态可训练层选择模块、基于熵的一致性损失以及随机参数插值机制,有效缓解了灾难性遗忘和错误累积问题,同时平衡了目标域适应与源知识保留。

链接: https://arxiv.org/abs/2505.09971
作者: Yuan Gao,Shaobo Xia,Sheng Nie,Cheng Wang,Xiaohuan Xi,Bisheng Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages,12 figures

点击查看摘要

Abstract:Airborne laser scanning (ALS) point cloud segmentation is a fundamental task for large-scale 3D scene understanding. In real-world applications, models are typically fixed after training. However, domain shifts caused by changes in the environment, sensor types, or sensor degradation often lead to a decline in model performance. Continuous Test-Time Adaptation (CTTA) offers a solution by adapting a source-pretrained model to evolving, unlabeled target domains. Despite its potential, research on ALS point clouds remains limited, facing challenges such as the absence of standardized datasets and the risk of catastrophic forgetting and error accumulation during prolonged adaptation. To tackle these challenges, we propose APCoTTA, the first CTTA method tailored for ALS point cloud semantic segmentation. We propose a dynamic trainable layer selection module. This module utilizes gradient information to select low-confidence layers for training, and the remaining layers are kept frozen, mitigating catastrophic forgetting. To further reduce error accumulation, we propose an entropy-based consistency loss. By losing such samples based on entropy, we apply consistency loss only to the reliable samples, enhancing model stability. In addition, we propose a random parameter interpolation mechanism, which randomly blends parameters from the selected trainable layers with those of the source model. This approach helps balance target adaptation and source knowledge retention, further alleviating forgetting. Finally, we construct two benchmarks, ISPRSC and H3DC, to address the lack of CTTA benchmarks for ALS point cloud segmentation. Experimental results demonstrate that APCoTTA achieves the best performance on two benchmarks, with mIoU improvements of approximately 9% and 14% over direct inference. The new benchmarks and code are available at this https URL.
zh

[CV-48] KFNet: Learning Texture Key Factor Driven Feature for Facial Expression Recognition

【速读】:该论文旨在解决在复杂面部外观变化下,野生环境下面部表情识别(FER)中由于表情相关特征细微且局部化所带来的挑战。其解决方案的关键在于引入纹理关键驱动因素(TKDF),即具有强判别能力的局部纹理区域,并通过Texture-Aware Feature Extractor(TAFE)和Dual Contextual Information Filtering(DCIF)架构有效捕捉和利用这些纹理线索,从而提升FER的性能与鲁棒性。

链接: https://arxiv.org/abs/2505.09967
作者: Liqian Deng
机构: Central China Normal University (华中师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial expression recognition (FER) in the wild remains a challenging task due to the subtle and localized nature of expression-related features, as well as the complex variations in facial appearance. In this paper, we introduce a novel framework that explicitly focuses on Texture Key Driver Factors (TKDF), localized texture regions that exhibit strong discriminative power across emotional categories. By carefully observing facial image patterns, we identify that certain texture cues, such as micro-changes in skin around the brows, eyes, and mouth, serve as primary indicators of emotional dynamics. To effectively capture and leverage these cues, we propose a FER architecture comprising a Texture-Aware Feature Extractor (TAFE) and Dual Contextual Information Filtering (DCIF). TAFE employs a ResNet-based backbone enhanced with multi-branch attention to extract fine-grained texture representations, while DCIF refines these features by filtering context through adaptive pooling and attention mechanisms. Experimental results on RAF-DB and KDEF datasets demonstrate that our method achieves state-of-the-art performance, verifying the effectiveness and robustness of incorporating TKDFs into FER pipelines.
zh

[CV-49] MambaControl: Anatomy Graph-Enhanced Mamba ControlNet with Fourier Refinement for Diffusion-Based Disease Trajectory Prediction

【速读】:该论文旨在解决精准医学中疾病进展建模的问题,特别是如何捕捉复杂的时空动态并保持解剖结构的一致性。现有方法在处理纵向依赖性和结构一致性方面存在局限。其解决方案的关键在于提出MambaControl框架,该框架将选择性状态空间建模与扩散过程相结合,通过融合基于Mamba的长程建模和图引导的解剖控制,以更有效地表征解剖相关性,并引入傅里叶增强的谱图表示以捕捉空间连贯性和多尺度细节,从而实现阿尔茨海默病预测的最先进性能。

链接: https://arxiv.org/abs/2505.09965
作者: Hao Yang,Tao Tan,Shuai Tan,Weiqin Yang,Kunyan Cai,Calvin Chen,Yue Sun
机构: Macao Polytechnic University (澳门理工学院); Zhejiang University (浙江大学); The University of Adelaide (阿德莱德大学); University of Birmingham (伯明翰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modelling disease progression in precision medicine requires capturing complex spatio-temporal dynamics while preserving anatomical integrity. Existing methods often struggle with longitudinal dependencies and structural consistency in progressive disorders. To address these limitations, we introduce MambaControl, a novel framework that integrates selective state-space modelling with diffusion processes for high-fidelity prediction of medical image trajectories. To better capture subtle structural changes over time while maintaining anatomical consistency, MambaControl combines Mamba-based long-range modelling with graph-guided anatomical control to more effectively represent anatomical correlations. Furthermore, we introduce Fourier-enhanced spectral graph representations to capture spatial coherence and multiscale detail, enabling MambaControl to achieve state-of-the-art performance in Alzheimer’s disease prediction. Quantitative and regional evaluations demonstrate improved progression prediction quality and anatomical fidelity, highlighting its potential for personalised prognosis and clinical decision support.
zh

[CV-50] CSPENet: Contour-Aware and Saliency Priors Embedding Network for Infrared Small Target Detection

【速读】:该论文旨在解决红外小目标检测(Infrared Small Target Detection, ISTD)中在密集杂波环境下目标定位不准确以及轮廓信息感知不足的问题,这些问题严重限制了现有方法的检测性能。其解决方案的关键在于提出一种包含轮廓感知与显著性先验嵌入的网络(Contour-Aware and Saliency Priors Embedding Network, CSPENet),该网络通过设计一个环绕收敛先验提取模块(Surround-Convergent Prior Extraction Module, SCPEM)来有效捕捉目标轮廓像素梯度向中心汇聚的内在特性,并同时提取增强的显著性先验和多尺度结构先验以提升目标定位和轮廓细节表达。此外,还引入了双分支先验嵌入架构(Dual-Branch Priors Embedding Architecture, DBPEA)和注意力引导特征增强模块(Attention-Guided Feature Enhancement Module, AGFEM)以优化特征融合与表示,从而提升检测性能。

链接: https://arxiv.org/abs/2505.09943
作者: Jiakun Deng,Kexuan Li,Xingye Cui,Jiaxuan Li,Chang Long,Tian Pu,Zhenming Peng
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Infrared small target detection (ISTD) plays a critical role in a wide range of civilian and military applications. Existing methods suffer from deficiencies in the localization of dim targets and the perception of contour information under dense clutter environments, severely limiting their detection performance. To tackle these issues, we propose a contour-aware and saliency priors embedding network (CSPENet) for ISTD. We first design a surround-convergent prior extraction module (SCPEM) that effectively captures the intrinsic characteristic of target contour pixel gradients converging toward their center. This module concurrently extracts two collaborative priors: a boosted saliency prior for accurate target localization and multi-scale structural priors for comprehensively enriching contour detail representation. Building upon this, we propose a dual-branch priors embedding architecture (DBPEA) that establishes differentiated feature fusion pathways, embedding these two priors at optimal network positions to achieve performance enhancement. Finally, we develop an attention-guided feature enhancement module (AGFEM) to refine feature representations and improve saliency estimation accuracy. Experimental results on public datasets NUDT-SIRST, IRSTD-1k, and NUAA-SIRST demonstrate that our CSPENet outperforms other state-of-the-art methods in detection performance. The code is available at this https URL.
zh

[CV-51] Non-Registration Change Detection: A Novel Change Detection Task and Benchmark Dataset

【速读】:该论文试图解决非配准变化检测(non-registration change detection)问题,旨在应对自然灾害、人为事故和军事打击等突发事件中因影像未配准而导致的变化检测失效问题。解决方案的关键在于系统性地提出八个现实世界中可能引发非配准问题的场景,并针对不同场景设计相应的图像变换方案,将现有的配准变化检测数据集转换为非配准版本,从而验证非配准问题对现有先进方法的破坏性影响。

链接: https://arxiv.org/abs/2505.09939
作者: Zhe Shan,Lei Zhou,Liu Mao,Shaofan Chen,Chuanqiu Ren,Xia Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted to IGARSS 2025

点击查看摘要

Abstract:In this study, we propose a novel remote sensing change detection task, non-registration change detection, to address the increasing number of emergencies such as natural disasters, anthropogenic accidents, and military strikes. First, in light of the limited discourse on the issue of non-registration change detection, we systematically propose eight scenarios that could arise in the real world and potentially contribute to the occurrence of non-registration problems. Second, we develop distinct image transformation schemes tailored to various scenarios to convert the available registration change detection dataset into a non-registration version. Finally, we demonstrate that non-registration change detection can cause catastrophic damage to the state-of-the-art methods. Our code and dataset are available at this https URL.
zh

[CV-52] VRU-CIPI: Crossing Intention Prediction at Intersections for Improving Vulnerable Road Users Safety

【速读】:该论文旨在解决城市交叉口中弱势道路使用者(Vulnerable Road Users, VRUs)过街意图预测的问题,以提升道路使用者之间的交互安全性。其解决方案的关键在于提出VRU-CIPI框架,该框架采用基于序列注意力的模型,结合门控循环单元(Gated Recurrent Unit, GRU)捕捉VRU运动的时间动态,并利用多头Transformer自注意力机制编码关键的上下文和空间依赖关系,从而准确预测过街方向。

链接: https://arxiv.org/abs/2505.09935
作者: Ahmed S. Abdelrahman,Mohamed Abdel-Aty,Quoc Dai Tran
机构: University of Central Florida (佛罗里达中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding and predicting human behavior in-thewild, particularly at urban intersections, remains crucial for enhancing interaction safety between road users. Among the most critical behaviors are crossing intentions of Vulnerable Road Users (VRUs), where misinterpretation may result in dangerous conflicts with oncoming vehicles. In this work, we propose the VRU-CIPI framework with a sequential attention-based model designed to predict VRU crossing intentions at intersections. VRU-CIPI employs Gated Recurrent Unit (GRU) to capture temporal dynamics in VRU movements, combined with a multi-head Transformer self-attention mechanism to encode contextual and spatial dependencies critical for predicting crossing direction. Evaluated on UCF-VRU dataset, our proposed achieves state-of-the-art performance with an accuracy of 96.45% and achieving real-time inference speed reaching 33 frames per second. Furthermore, by integrating with Infrastructure-to-Vehicles (I2V) communication, our approach can proactively enhance intersection safety through timely activation of crossing signals and providing early warnings to connected vehicles, ensuring smoother and safer interactions for all road users.
zh

[CV-53] DDFP: Data-dependent Frequency Prompt for Source Free Domain Adaptation of Medical Image Segmentation

【速读】:该论文旨在解决无监督域适应(Unsupervised Domain Adaptation, UDA)中因源域和目标域之间存在域差距而导致的模型性能下降问题,特别是在医疗数据集等受限于隐私政策无法获取标注源域数据的情况下,传统方法难以有效应用。其解决方案的关键在于提出一种新的源域自由域适应(Source-Free Domain Adaptation, SFDA)框架,通过预适应(preadaptation)生成一个预适应模型以初始化目标模型,从而在不引入额外参数的情况下生成高质量增强伪标签;同时引入依赖数据的频率提示(data-dependent frequency prompt)以更有效地将目标域图像转换为源域风格,并采用针对SFDA设计的与风格相关的层微调策略,利用提示后的目标域图像和伪标签进行目标模型训练,从而提升域适应效果。

链接: https://arxiv.org/abs/2505.09927
作者: Siqi Yin,Shaolei Liu,Manning Wang
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Domain adaptation addresses the challenge of model performance degradation caused by domain gaps. In the typical setup for unsupervised domain adaptation, labeled data from a source domain and unlabeled data from a target domain are used to train a target model. However, access to labeled source domain data, particularly in medical datasets, can be restricted due to privacy policies. As a result, research has increasingly shifted to source-free domain adaptation (SFDA), which requires only a pretrained model from the source domain and unlabeled data from the target domain data for adaptation. Existing SFDA methods often rely on domain-specific image style translation and self-supervision techniques to bridge the domain gap and train the target domain model. However, the quality of domain-specific style-translated images and pseudo-labels produced by these methods still leaves room for improvement. Moreover, training the entire model during adaptation can be inefficient under limited supervision. In this paper, we propose a novel SFDA framework to address these challenges. Specifically, to effectively mitigate the impact of domain gap in the initial training phase, we introduce preadaptation to generate a preadapted model, which serves as an initialization of target model and allows for the generation of high-quality enhanced pseudo-labels without introducing extra parameters. Additionally, we propose a data-dependent frequency prompt to more effectively translate target domain images into a source-like style. To further enhance adaptation, we employ a style-related layer fine-tuning strategy, specifically designed for SFDA, to train the target model using the prompted target domain images and pseudo-labels. Extensive experiments on cross-modality abdominal and cardiac SFDA segmentation tasks demonstrate that our proposed method outperforms existing state-of-the-art methods.
zh

[CV-54] AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection

【速读】:该论文旨在解决通用视觉异常检测(universal visual anomaly detection)问题,即在无需额外微调的情况下,从新颖或未见过的视觉领域中识别异常,这在开放场景中具有重要意义。现有方法在设计提示模板、处理复杂的令牌交互或需要额外微调方面存在局限,导致灵活性不足。该论文提出的解决方案——AdaptCLIP,其关键在于两个核心见解:首先,应交替学习适应性的视觉和文本表示,而非联合学习;其次,在查询与正常图像提示之间进行对比学习时,应结合上下文和对齐的残差特征,而非仅依赖残差特征。AdaptCLIP通过在CLIP模型的输入或输出端添加三个简单的适配器(视觉适配器、文本适配器和提示-查询适配器),实现了跨领域的零样本/少样本泛化,并在目标领域上无需训练即可应用。

链接: https://arxiv.org/abs/2505.09926
作者: Bin-Bin Gao,Yue Zhu,Jiangtao Yan,Yuezhi Cai,Weixi Zhang,Meng Wang,Jun Liu,Yong Liu,Lei Wang,Chengjie Wang
机构: Tencent YouTu Lab (腾讯优图实验室); Siemens Corporate Research (西门子企业研究院); Technical University of Munich (慕尼黑工业大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 27 pages, 15 figures, 22 tables

点击查看摘要

Abstract:Universal visual anomaly detection aims to identify anomalies from novel or unseen vision domains without additional fine-tuning, which is critical in open scenarios. Recent studies have demonstrated that pre-trained vision-language models like CLIP exhibit strong generalization with just zero or a few normal images. However, existing methods struggle with designing prompt templates, complex token interactions, or requiring additional fine-tuning, resulting in limited flexibility. In this work, we present a simple yet effective method called AdaptCLIP based on two key insights. First, adaptive visual and textual representations should be learned alternately rather than jointly. Second, comparative learning between query and normal image prompt should incorporate both contextual and aligned residual features, rather than relying solely on residual features. AdaptCLIP treats CLIP models as a foundational service, adding only three simple adapters, visual adapter, textual adapter, and prompt-query adapter, at its input or output ends. AdaptCLIP supports zero-/few-shot generalization across domains and possesses a training-free manner on target domains once trained on a base dataset. AdaptCLIP achieves state-of-the-art performance on 12 anomaly detection benchmarks from industrial and medical domains, significantly outperforming existing competitive methods. We will make the code and model of AdaptCLIP available at this https URL.
zh

[CV-55] Large-Scale Gaussian Splatting SLAM

【速读】:该论文旨在解决大规模户外场景下基于神经辐射场(Neural Radiance Fields, NeRF)和3D高斯点云(3D Gaussian Splatting, 3DGS)的视觉SLAM(Simultaneous Localization and Mapping)方法鲁棒性不足的问题。现有方法通常依赖RGBD传感器且仅适用于室内环境,而对大规模户外场景的适应性尚未得到充分探索。其解决方案的关键在于提出一种基于3DGS的大型视觉SLAM系统(LSG-SLAM),采用多模态策略估计大视角变化下的先验位姿,并引入特征对齐变形约束以缓解渲染损失中外观相似性带来的负面影响;同时,通过连续高斯点云子图实现大规模场景的可扩展重建,并利用重定位和渲染与特征变形损失优化回环关键帧间的相对位姿,最终通过全局优化和结构精修模块提升重建质量。

链接: https://arxiv.org/abs/2505.09915
作者: Zhe Xin,Chenyang Wu,Penghui Huang,Yanyong Zhang,Yinian Mao,Guoquan Huang
机构: Meituan UAV(美团无人机); University of Science and Technology of China(中国科学技术大学); University of Delaware(特拉华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The recently developed Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have shown encouraging and impressive results for visual SLAM. However, most representative methods require RGBD sensors and are only available for indoor environments. The robustness of reconstruction in large-scale outdoor scenarios remains unexplored. This paper introduces a large-scale 3DGS-based visual SLAM with stereo cameras, termed LSG-SLAM. The proposed LSG-SLAM employs a multi-modality strategy to estimate prior poses under large view changes. In tracking, we introduce feature-alignment warping constraints to alleviate the adverse effects of appearance similarity in rendering losses. For the scalability of large-scale scenarios, we introduce continuous Gaussian Splatting submaps to tackle unbounded scenes with limited memory. Loops are detected between GS submaps by place recognition and the relative pose between looped keyframes is optimized utilizing rendering and feature warping losses. After the global optimization of camera poses and Gaussian points, a structure refinement module enhances the reconstruction quality. With extensive evaluations on the EuRoc and KITTI datasets, LSG-SLAM achieves superior performance over existing Neural, 3DGS-based, and even traditional approaches. Project page: this https URL.
zh

[CV-56] Few-Shot Learning of Visual Compositional Concepts through Probabilistic Schema Induction

【速读】:该论文试图解决如何在有限示例下实现类似人类的组合视觉概念学习问题,即通过少量示例快速掌握具有结构化关系的视觉概念。其解决方案的关键在于引入概率模式归纳(Probabilistic Schema Induction, PSI)模型,该模型利用深度学习对结构化表示进行类比映射,形成称为“模式”的组合概念。PSI的核心创新在于一种结合对象级相似性和关系相似性的相似性度量方法,以及一种增强分类相关关系的机制,类似于传统模型中的选择性注意参数。

链接: https://arxiv.org/abs/2505.09859
作者: Andrew Jun Lee,Taylor Webb,Trevor Bihl,Keith Holyoak,Hongjing Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Lee, A. J., Webb, T., Bihl, T., Holyoak, K. J., Lu, H. (2025). Few-shot learning of visual compositional concepts through probabilistic schema induction. In A. Ruggeri, D. Barner, C. Walker, N. Bramley (Eds.), Proceedings of the 47th Annual Conference of the Cognitive Science Society. Cognitive Science Society

点击查看摘要

Abstract:The ability to learn new visual concepts from limited examples is a hallmark of human cognition. While traditional category learning models represent each example as an unstructured feature vector, compositional concept learning is thought to depend on (1) structured representations of examples (e.g., directed graphs consisting of objects and their relations) and (2) the identification of shared relational structure across examples through analogical mapping. Here, we introduce Probabilistic Schema Induction (PSI), a prototype model that employs deep learning to perform analogical mapping over structured representations of only a handful of examples, forming a compositional concept called a schema. In doing so, PSI relies on a novel conception of similarity that weighs object-level similarity and relational similarity, as well as a mechanism for amplifying relations relevant to classification, analogous to selective attention parameters in traditional models. We show that PSI produces human-like learning performance and outperforms two controls: a prototype model that uses unstructured feature vectors extracted from a deep learning model, and a variant of PSI with weaker structured representations. Notably, we find that PSI’s human-like performance is driven by an adaptive strategy that increases relational similarity over object-level similarity and upweights the contribution of relations that distinguish classes. These findings suggest that structured representations and analogical mapping are critical to modeling rapid human-like learning of compositional visual concepts, and demonstrate how deep learning can be leveraged to create psychological models.
zh

[CV-57] Mission Balance: Generating Under-represented Class Samples using Video Diffusion Models MICCAI2025

【速读】:该论文试图解决手术视频数据集中常见的数据不平衡问题,这一问题限制了高性能模型的开发。其解决方案的关键在于提出一种基于文本条件的两阶段扩散方法,通过生成高质量的手术视频来增强数据集。该方法利用2D潜在扩散模型捕捉空间内容,并结合时间注意力层确保时间一致性,同时引入拒绝采样策略选择最合适的合成样本,从而有效缓解类别不平衡问题。

链接: https://arxiv.org/abs/2505.09858
作者: Danush Kumar Venkatesh,Isabel Funke,Micha Pfeiffer,Fiona Kolbinger,Hanna Maria Schmeiser,Juergen Weitz,Marius Distler,Stefanie Speidel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Early accept at MICCAI 2025

点击查看摘要

Abstract:Computer-assisted interventions can improve intra-operative guidance, particularly through deep learning methods that harness the spatiotemporal information in surgical videos. However, the severe data imbalance often found in surgical video datasets hinders the development of high-performing models. In this work, we aim to overcome the data imbalance by synthesizing surgical videos. We propose a unique two-stage, text-conditioned diffusion-based method to generate high-fidelity surgical videos for under-represented classes. Our approach conditions the generation process on text prompts and decouples spatial and temporal modeling by utilizing a 2D latent diffusion model to capture spatial content and then integrating temporal attention layers to ensure temporal consistency. Furthermore, we introduce a rejection sampling strategy to select the most suitable synthetic samples, effectively augmenting existing datasets to address class imbalance. We evaluate our method on two downstream tasks-surgical action recognition and intra-operative event prediction-demonstrating that incorporating synthetic videos from our approach substantially enhances model performance. We open-source our implementation at this https URL.
zh

[CV-58] BoundarySeg:An Embarrassingly Simple Method To Boost Medical Image Segmentation Performance for Low Data Regimes

【速读】:该论文试图解决医学图像分割中由于隐私法规和数据保护政策导致大规模标注或未标注数据获取困难的问题,以及传统半监督方法依赖未标注数据且在数据稀缺时性能下降的问题。解决方案的关键在于提出一种名为BoundarySeg的多任务框架,该框架将器官边界预测作为辅助任务,与全器官分割相结合,通过两个任务预测的一致性提供额外监督,从而在不依赖未标注数据的情况下提升分割精度。

链接: https://arxiv.org/abs/2505.09829
作者: Tushar Kataria,Shireen Y. Elhabian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Obtaining large-scale medical data, annotated or unannotated, is challenging due to stringent privacy regulations and data protection policies. In addition, annotating medical images requires that domain experts manually delineate anatomical structures, making the process both time-consuming and costly. As a result, semi-supervised methods have gained popularity for reducing annotation costs. However, the performance of semi-supervised methods is heavily dependent on the availability of unannotated data, and their effectiveness declines when such data are scarce or absent. To overcome this limitation, we propose a simple, yet effective and computationally efficient approach for medical image segmentation that leverages only existing annotations. We propose BoundarySeg , a multi-task framework that incorporates organ boundary prediction as an auxiliary task to full organ segmentation, leveraging consistency between the two task predictions to provide additional supervision. This strategy improves segmentation accuracy, especially in low data regimes, allowing our method to achieve performance comparable to or exceeding state-of-the-art semi supervised approaches all without relying on unannotated data or increasing computational demands. Code will be released upon acceptance.
zh

[CV-59] Dyadic Mamba: Long-term Dyadic Human Motion Synthesis CVPR2025

【速读】:该论文试图解决从文本描述生成高质量、任意长度的双人人类运动的问题,特别是在处理超出典型训练序列长度的长期交互时面临的挑战。现有基于Transformer的方法在短时双人运动合成中表现良好,但因位置编码方案的固有局限性,在长序列生成上效果不佳。解决方案的关键在于引入Dyadic Mamba,这是一种利用状态空间模型(State-Space Models, SSMs)的新型方法,通过简单的架构设计实现个体运动序列之间的信息流动,无需复杂的交叉注意力机制,从而有效提升长序列生成的质量和稳定性。

链接: https://arxiv.org/abs/2505.09827
作者: Julian Tanke,Takashi Shibuya,Kengo Uchida,Koichi Saito,Yuki Mitsufuji
机构: Sony AI(索尼人工智能); Sony Group Corporation(索尼集团株式会社)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025 HuMoGen Workshop

点击查看摘要

Abstract:Generating realistic dyadic human motion from text descriptions presents significant challenges, particularly for extended interactions that exceed typical training sequence lengths. While recent transformer-based approaches have shown promising results for short-term dyadic motion synthesis, they struggle with longer sequences due to inherent limitations in positional encoding schemes. In this paper, we introduce Dyadic Mamba, a novel approach that leverages State-Space Models (SSMs) to generate high-quality dyadic human motion of arbitrary length. Our method employs a simple yet effective architecture that facilitates information flow between individual motion sequences through concatenation, eliminating the need for complex cross-attention mechanisms. We demonstrate that Dyadic Mamba achieves competitive performance on standard short-term benchmarks while significantly outperforming transformer-based approaches on longer sequences. Additionally, we propose a new benchmark for evaluating long-term motion synthesis quality, providing a standardized framework for future research. Our results demonstrate that SSM-based architectures offer a promising direction for addressing the challenging task of long-term dyadic human motion synthesis from text descriptions.
zh

[CV-60] Visual Feedback of Pattern Separability Improves Myoelectric Decoding Performance of Upper Limb Prostheses

【速读】:该论文旨在解决上肢肌电假肢在使用模式识别(Pattern Recognition, PR)控制系统时,用户难以生成足够区分的肌电信号以实现可靠分类的问题。传统训练方法依赖于用户进行启发式、试错式的静态解码边界调整,缺乏有效的实时反馈机制。解决方案的关键在于引入“Reviewer”——一种将肌电信号直接投影到解码器分类空间的三维可视化界面,提供直观的实时PR算法行为洞察,从而降低认知负荷并促进用户生成的肌电信号模式与解码边界之间的数据驱动协同适应。

链接: https://arxiv.org/abs/2505.09819
作者: Ruichen Yang,György M. Lévay,Christopher L. Hunt,Dániel Czeiner,Megan C. Hodgson,Damini Agarwal,Rahul R. Kaliki,Nitish V. Thakor
机构: The Johns Hopkins University (约翰霍普金斯大学); Infinite Biomedical Technologies, LLC. (无限生物医学技术公司)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:State-of-the-art upper limb myoelectric prostheses often use pattern recognition (PR) control systems that translate electromyography (EMG) signals into desired movements. As prosthesis movement complexity increases, users often struggle to produce sufficiently distinct EMG patterns for reliable classification. Existing training typically involves heuristic, trial-and-error user adjustments to static decoder boundaries. Goal: We introduce the Reviewer, a 3D visual interface projecting EMG signals directly into the decoder’s classification space, providing intuitive, real-time insight into PR algorithm behavior. This structured feedback reduces cognitive load and fosters mutual, data-driven adaptation between user-generated EMG patterns and decoder boundaries. Methods: A 10-session study with 12 able-bodied participants compared PR performance after motor-based training and updating using the Reviewer versus conventional virtual arm visualization. Performance was assessed using a Fitts law task that involved the aperture of the cursor and the control of orientation. Results: Participants trained with the Reviewer achieved higher completion rates, reduced overshoot, and improved path efficiency and throughput compared to the standard visualization group. Significance: The Reviewer introduces decoder-informed motor training, facilitating immediate and consistent PR-based myoelectric control improvements. By iteratively refining control through real-time feedback, this approach reduces reliance on trial-and-error recalibration, enabling a more adaptive, self-correcting training framework. Conclusion: The 3D visual feedback significantly improves PR control in novice operators through structured training, enabling feedback-driven adaptation and reducing reliance on extensive heuristic adjustments.
zh

[CV-61] A Computational Pipeline for Advanced Analysis of 4D Flow MRI in the Left Atrium

【速读】:该论文旨在解决传统超声分析在左心房(Left Atrium, LA)血流动力学研究中的局限性,以及4D Flow MRI在LA分析中因低流速和有限空间分辨率带来的挑战。其解决方案的关键在于开发首个针对LA的开源计算框架,该框架能够实现对4D Flow MRI数据的全面定性和定量分析,并具备对不同中心、不同质量数据的鲁棒性,同时实现了高精度的自动分割(Dice ≥ 0.9,Hausdorff 95 ≤ 3 mm),从而为探索LA内能量、涡量和压力参数作为预后生物标志物提供了可靠的技术支持。

链接: https://arxiv.org/abs/2505.09746
作者: Xabier Morales,Ayah Elsayed,Debbie Zhao,Filip Loncaric,Ainhoa Aguado,Mireia Masias,Gina Quill,Marc Ramos,Ada Doltra,Ana Garcia,Marta Sitges,David Marlevi,Alistair Young,Martyn Nash,Bart Bijnens,Oscar Camara
机构: PhySense, Department of Information and Communication Technologies, Universitat Pompeu Fabra, Barcelona, Spain; Auckland Bioengineering Institute, University of Auckland, Auckland, New Zealand; Faculty of Health and Environmental Sciences, Auckland University of Technology, Auckland, New Zealand; University Hospital Centre Zagreb, Zagreb, Croatia; Cardiovascular Institute, Hospital Clínic, Universitat de Barcelona, Barcelona, Spain; Institut d’investigacions biomèdiques august pi i sunyer (IDIBAPS), Barcelona, Spain; Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA, USA; Karolinska Institutet, Department of Molecular Medicine and Surgery, Stockholm, Sweden; Department of Engineering Science and Biomedical Engineering, University of Auckland, Auckland, New Zealand; Bart Bijnens; Oscar Camara; David Marlevi; Ana Garcia; Marta Sitges; Ada Doltra; Mireia Masias; Gina Quill; Marc Ramos; Debbie Zhao; Filip Loncaric; Ayah Elsayed; Xabier Morales
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The left atrium (LA) plays a pivotal role in modulating left ventricular filling, but our comprehension of its hemodynamics is significantly limited by the constraints of conventional ultrasound analysis. 4D flow magnetic resonance imaging (4D Flow MRI) holds promise for enhancing our understanding of atrial hemodynamics. However, the low velocities within the LA and the limited spatial resolution of 4D Flow MRI make analyzing this chamber challenging. Furthermore, the absence of dedicated computational frameworks, combined with diverse acquisition protocols and vendors, complicates gathering large cohorts for studying the prognostic value of hemodynamic parameters provided by 4D Flow MRI. In this study, we introduce the first open-source computational framework tailored for the analysis of 4D Flow MRI in the LA, enabling comprehensive qualitative and quantitative analysis of advanced hemodynamic parameters. Our framework proves robust to data from different centers of varying quality, producing high-accuracy automated segmentations (Dice 0.9 and Hausdorff 95 3 mm), even with limited training data. Additionally, we conducted the first comprehensive assessment of energy, vorticity, and pressure parameters in the LA across a spectrum of disorders to investigate their potential as prognostic biomarkers.
zh

[CV-62] UOD: Universal One-shot Detection of Anatomical Landmarks MICCAI2023

【速读】:该论文旨在解决单样本医学关键点检测在多域未标记数据中存在域偏好以及模型鲁棒性不足的问题。现有方法在单一领域表现优异,但在多域场景下性能显著下降,且对标注质量敏感。其解决方案的关键在于提出一种域自适应的单样本关键点检测框架——通用单样本检测(Universal One-shot Detection, UOD),该框架包含两个阶段和两个对应的通用模型,通过域特定模块与域共享模块的结合,实现跨域特征学习与全局上下文建模,从而提升模型在多域数据中的泛化能力和检测精度。

链接: https://arxiv.org/abs/2306.07615
作者: Heqin Zhu,Quan Quan,Qingsong Yao,Zaiyi Liu,S. Kevin Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Eealy accepted by MICCAI 2023. 11pages, 4 figures, 2 tables. arXiv admin note: text overlap with arXiv:2203.06433

点击查看摘要

Abstract:One-shot medical landmark detection gains much attention and achieves great success for its label-efficient training process. However, existing one-shot learning methods are highly specialized in a single domain and suffer domain preference heavily in the situation of multi-domain unlabeled data. Moreover, one-shot learning is not robust that it faces performance drop when annotating a sub-optimal image. To tackle these issues, we resort to developing a domain-adaptive one-shot landmark detection framework for handling multi-domain medical images, named Universal One-shot Detection (UOD). UOD consists of two stages and two corresponding universal models which are designed as combinations of domain-specific modules and domain-shared modules. In the first stage, a domain-adaptive convolution model is self-supervised learned to generate pseudo landmark labels. In the second stage, we design a domain-adaptive transformer to eliminate domain preference and build the global context for multi-domain data. Even though only one annotated sample from each domain is available for training, the domain-shared modules help UOD aggregate all one-shot samples to detect more robust and accurate landmarks. We investigated both qualitatively and quantitatively the proposed UOD on three widely-used public X-ray datasets in different anatomical domains (i.e., head, hand, chest) and obtained state-of-the-art performances in each domain. The code is available at this https URL.
zh

[CV-63] Multi-contrast laser endoscopy for in vivo gastrointestinal imaging

【速读】:该论文试图解决白光内镜在检测胃肠道疾病时因组织颜色、纹理和形态对比度较弱而导致的临床相关病例漏诊问题。解决方案的关键是引入多对比激光内镜(Multi-contrast Laser Endoscopy, MLE),该平台通过快速可调的光谱、相干和方向照明实现宽场临床成像,并具备三种核心能力:利用多光谱漫反射增强组织色素对比度、通过激光散斑对比成像量化血流、以及通过光度立体技术表征黏膜表面形貌。MLE在实验模型和临床结肠镜检查中均得到验证,其图像显示对比度和颜色差异分别比白光和窄带成像提高了约三倍和五倍。

链接: https://arxiv.org/abs/2505.10492
作者: Taylor L. Bobrow,Mayank Golhar,Suchapa Arayakarnkul,Anthony A. Song,Saowanee Ngamruengphong,Nicholas J. Durr
机构: Johns Hopkins University (约翰霍普金斯大学); Johns Hopkins Hospital (约翰霍普金斯医院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注:

点击查看摘要

Abstract:White light endoscopy is the clinical gold standard for detecting diseases in the gastrointestinal tract. Most applications involve identifying visual abnormalities in tissue color, texture, and shape. Unfortunately, the contrast of these features is often subtle, causing many clinically relevant cases to go undetected. To overcome this challenge, we introduce Multi-contrast Laser Endoscopy (MLE): a platform for widefield clinical imaging with rapidly tunable spectral, coherent, and directional illumination. We demonstrate three capabilities of MLE: enhancing tissue chromophore contrast with multispectral diffuse reflectance, quantifying blood flow using laser speckle contrast imaging, and characterizing mucosal topography using photometric stereo. We validate MLE with benchtop models, then demonstrate MLE in vivo during clinical colonoscopies. MLE images from 31 polyps demonstrate an approximate three-fold improvement in contrast and a five-fold improvement in color difference compared to white light and narrow band imaging. With the ability to reveal multiple complementary types of tissue contrast while seamlessly integrating into the clinical environment, MLE shows promise as an investigative tool to improve gastrointestinal imaging.
zh

[CV-64] HWA-UNETR: Hierarchical Window Aggregate UNETR for 3D Multimodal Gastric Lesion Segmentation MICCAI2025

【速读】:该论文旨在解决胃癌病灶分析中多模态医学图像分割所面临的挑战,尤其是在缺乏独立多模态数据集以及需要融合本质上不对齐的模态背景下,算法训练受限于近似数据并依赖应用迁移,导致资源消耗大且分析精度下降的问题。其解决方案的关键在于公开发布GCM 2025数据集,作为首个大规模、开源的胃癌多模态MRI扫描数据集,并提出HWA-UNETR框架,该框架通过可学习的窗口聚合层构建不同模态解剖结构间的动态特征对应关系,并利用三向融合Mamba机制进行上下文建模和长程空间依赖捕捉,从而提升分割性能。

链接: https://arxiv.org/abs/2505.10464
作者: Jiaming Liang,Lihuan Dai,Xiaoqi Sheng,Xiangguang Chen,Chun Yao,Guihua Tao,Qibin Leng,Honming Cai,Xi Zhong
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been provisionally accepted for MICCAI 2025

点击查看摘要

Abstract:Multimodal medical image segmentation faces significant challenges in the context of gastric cancer lesion analysis. This clinical context is defined by the scarcity of independent multimodal datasets and the imperative to amalgamate inherently misaligned modalities. As a result, algorithms are constrained to train on approximate data and depend on application migration, leading to substantial resource expenditure and a potential decline in analysis accuracy. To address those challenges, we have made two major contributions: First, we publicly disseminate the GCM 2025 dataset, which serves as the first large-scale, open-source collection of gastric cancer multimodal MRI scans, featuring professionally annotated FS-T2W, CE-T1W, and ADC images from 500 patients. Second, we introduce HWA-UNETR, a novel 3D segmentation framework that employs an original HWA block with learnable window aggregation layers to establish dynamic feature correspondences between different modalities’ anatomical structures, and leverages the innovative tri-orientated fusion mamba mechanism for context modeling and capturing long-range spatial dependencies. Extensive experiments on our GCM 2025 dataset and the publicly BraTS 2021 dataset validate the performance of our framework, demonstrating that the new approach surpasses existing methods by up to 1.68% in the Dice score while maintaining solid robustness. The dataset and code are public via this https URL.
zh

[CV-65] Visual Fidelity Index for Generative Semantic Communications with Critical Information Embedding

【速读】:该论文旨在解决生成式语义通信(Gen-SemCom)系统中因仅依赖文本提示导致的细节丢失问题以及缺乏系统性性能评估指标的问题。其解决方案的关键在于提出一种混合Gen-SemCom系统,结合关键信息嵌入(CIE)框架,同时传输文本提示和语义关键特征,以提升图像重建质量。通过引入基于扩散的生成模型进行高保真图像重建,并设计生成视觉信息保真度(GVIF)指标来量化失真特征与原始特征之间的互信息,从而实现通道自适应的系统优化。

链接: https://arxiv.org/abs/2505.10405
作者: Jianhao Huang,Qunsong Zeng,Kaibin Huang
机构: The University of Hong Kong (香港大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generative semantic communication (Gen-SemCom) with large artificial intelligence (AI) model promises a transformative paradigm for 6G networks, which reduces communication costs by transmitting low-dimensional prompts rather than raw data. However, purely prompt-driven generation loses fine-grained visual details. Additionally, there is a lack of systematic metrics to evaluate the performance of Gen-SemCom systems. To address these issues, we develop a hybrid Gen-SemCom system with a critical information embedding (CIE) framework, where both text prompts and semantically critical features are extracted for transmissions. First, a novel approach of semantic filtering is proposed to select and transmit the semantically critical features of images relevant to semantic label. By integrating the text prompt and critical features, the receiver reconstructs high-fidelity images using a diffusion-based generative model. Next, we propose the generative visual information fidelity (GVIF) metric to evaluate the visual quality of the generated image. By characterizing the statistical models of image features, the GVIF metric quantifies the mutual information between the distorted features and their original counterparts. By maximizing the GVIF metric, we design a channel-adaptive Gen-SemCom system that adaptively control the volume of features and compression rate according to the channel state. Experimental results validate the GVIF metric’s sensitivity to visual fidelity, correlating with both the PSNR and critical information volume. In addition, the optimized system achieves superior performance over benchmarking schemes in terms of higher PSNR and lower FID scores.
zh

[CV-66] Ordered-subsets Multi-diffusion Model for Sparse-view CT Reconstruction

【速读】:该论文旨在解决稀疏视图CT重建中由于投影数据量大且冗余严重,导致扩散模型学习效果差、细节重建不足的问题。其解决方案的关键在于提出有序子集多扩散模型(OSMM),该模型通过将CT投影数据划分为相等子集,并采用多子集扩散模型(MSDM)独立学习每个子集,从而降低复杂度并提升细节重建能力;同时结合完整sinogram数据的一整个扩散模型(OWDM)作为全局信息约束,减少错误或不一致sinogram信息的生成,实现更高质量的重建结果。

链接: https://arxiv.org/abs/2505.09985
作者: Pengfei Yu,Bin Huang,Minghui Zhang,Weiwen Wu,Shaoyu Wang,Qiegen Liu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Score-based diffusion models have shown significant promise in the field of sparse-view CT reconstruction. However, the projection dataset is large and riddled with redundancy. Consequently, applying the diffusion model to unprocessed data results in lower learning effectiveness and higher learning difficulty, frequently leading to reconstructed images that lack fine details. To address these issues, we propose the ordered-subsets multi-diffusion model (OSMM) for sparse-view CT reconstruction. The OSMM innovatively divides the CT projection data into equal subsets and employs multi-subsets diffusion model (MSDM) to learn from each subset independently. This targeted learning approach reduces complexity and enhances the reconstruction of fine details. Furthermore, the integration of one-whole diffusion model (OWDM) with complete sinogram data acts as a global information constraint, which can reduce the possibility of generating erroneous or inconsistent sinogram information. Moreover, the OSMM’s unsupervised learning framework provides strong robustness and generalizability, adapting seamlessly to varying sparsity levels of CT sinograms. This ensures consistent and reliable performance across different clinical scenarios. Experimental results demonstrate that OSMM outperforms traditional diffusion models in terms of image quality and noise resilience, offering a powerful and versatile solution for advanced CT imaging in sparse-view scenarios.
zh

[CV-67] ImplicitStainer: Data-Efficient Medical Image Translation for Virtual Antibody-based Tissue Staining Using Local Implicit Functions

【速读】:该论文试图解决传统免疫组化(IHC)染色在病理诊断中因获取时间长、依赖专业中心而造成的治疗延迟问题,提出通过深度学习驱动的图像翻译模型实现从苏木精-伊红(HE)染色图像虚拟生成IHC染色图像的替代方案。解决方案的关键在于提出一种名为ImplicitStainer的新方法,该方法利用局部隐式函数进行像素级预测,从而提升图像翻译性能,特别是在数据量有限的情况下仍能保持高质量的生成结果。

链接: https://arxiv.org/abs/2505.09831
作者: Tushar Kataria,Beatrice Knudsen,Shireen Y. Elhabian
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hematoxylin and eosin (HE) staining is a gold standard for microscopic diagnosis in pathology. However, HE staining does not capture all the diagnostic information that may be needed. To obtain additional molecular information, immunohistochemical (IHC) stains highlight proteins that mark specific cell types, such as CD3 for T-cells or CK8/18 for epithelial cells. While IHC stains are vital for prognosis and treatment guidance, they are typically only available at specialized centers and time consuming to acquire, leading to treatment delays for patients. Virtual staining, enabled by deep learning-based image translation models, provides a promising alternative by computationally generating IHC stains from HE stained images. Although many GAN and diffusion based image to image (I2I) translation methods have been used for virtual staining, these models treat image patches as independent data points, which results in increased and more diverse data requirements for effective generation. We present ImplicitStainer, a novel approach that leverages local implicit functions to improve image translation, specifically virtual staining performance, by focusing on pixel-level predictions. This method enhances robustness to variations in dataset sizes, delivering high-quality results even with limited data. We validate our approach on two datasets using a comprehensive set of metrics and benchmark it against over fifteen state-of-the-art GAN- and diffusion based models. Full Code and models trained will be released publicly via Github upon acceptance.
zh

[CV-68] Generative diffusion model surrogates for mechanistic agent -based biological models

【速读】:该论文试图解决生成式 AI (Generative AI) 在复杂生物系统中作为代理模型时,由于模型的随机性导致的配置多样性问题,从而影响代理模型的开发与验证。解决方案的关键在于利用去噪扩散概率模型(Denoising Diffusion Probabilistic Models, DDPMs)训练一个CPM的生成式代理模型,并结合图像分类器学习二维参数空间中定义独特区域的特征,以辅助代理模型的选择与验证。该方法实现了在计算时间上约22倍的减少,并能够生成比参考配置提前20,000个时间步的模型配置。

链接: https://arxiv.org/abs/2505.09630
作者: Tien Comlekoglu,J. Quetzalcóatl Toledo-Marín,Douglas W. DeSimone,Shayn M. Peirce,Geoffrey Fox,James A. Glazier
机构: 未知
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Mechanistic, multicellular, agent-based models are commonly used to investigate tissue, organ, and organism-scale biology at single-cell resolution. The Cellular-Potts Model (CPM) is a powerful and popular framework for developing and interrogating these models. CPMs become computationally expensive at large space- and time- scales making application and investigation of developed models difficult. Surrogate models may allow for the accelerated evaluation of CPMs of complex biological systems. However, the stochastic nature of these models means each set of parameters may give rise to different model configurations, complicating surrogate model development. In this work, we leverage denoising diffusion probabilistic models to train a generative AI surrogate of a CPM used to investigate \textitin vitro vasculogenesis. We describe the use of an image classifier to learn the characteristics that define unique areas of a 2-dimensional parameter space. We then apply this classifier to aid in surrogate model selection and verification. Our CPM model surrogate generates model configurations 20,000 timesteps ahead of a reference configuration and demonstrates approximately a 22x reduction in computational time as compared to native code execution. Our work represents a step towards the implementation of DDPMs to develop digital twins of stochastic biological systems.
zh

人工智能

[AI-0] Neural Thermodynamic Laws for Large Language Model Training

【速读】:该论文试图解决大语言模型(Large Language Models, LLMs)训练动力学中缺乏明确规律的问题,特别是在超越神经缩放定律(Neural Scaling Laws)之外的底层机制。其解决方案的关键在于提出了一种新的理论框架——神经热力学定律(Neural Thermodynamic Laws, NTL),该框架在理论层面揭示了在河谷损失景观假设下,关键热力学量(如温度、熵、比热容、热传导)和经典热力学原理(如热力学三定律和能量均分定理)自然涌现,从而为设计学习率调度提供了直观的科学指导。

链接: https://arxiv.org/abs/2505.10559
作者: Ziming Liu,Yizhou Liu,Jeff Gore,Max Tegmark
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
备注: 18 pages, 10 figures

点击查看摘要

Abstract:Beyond neural scaling laws, little is known about the laws underlying large language models (LLMs). We introduce Neural Thermodynamic Laws (NTL) – a new framework that offers fresh insights into LLM training dynamics. On the theoretical side, we demonstrate that key thermodynamic quantities (e.g., temperature, entropy, heat capacity, thermal conduction) and classical thermodynamic principles (e.g., the three laws of thermodynamics and the equipartition theorem) naturally emerge under river-valley loss landscape assumptions. On the practical side, this scientific perspective yields intuitive guidelines for designing learning rate schedules.
zh

[AI-1] Real-Time Out-of-Distribution Failure Prevention via Multi-Modal Reasoning

【速读】:该论文旨在解决在危险场景中,机器人因超出训练数据分布(out-of-distribution, OOD)而导致的安全干预不足问题。现有方法由于大型视觉语言模型的高推理延迟,依赖手动定义的干预策略,缺乏生成可泛化且语义安全运动规划的能力。论文提出的解决方案FORTRESS的关键在于实时生成并推理语义安全的回退策略,通过低频的多模态推理识别目标和预测故障模式,并在运行时监控触发回退响应时,快速合成回退路径的同时实时推断并避开语义不安全区域,从而避免对硬编码回退策略和人工安全干预的依赖。

链接: https://arxiv.org/abs/2505.10547
作者: Milan Ganai,Rohan Sinha,Christopher Agia,Daniel Morton,Marco Pavone
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Website: this https URL

点击查看摘要

Abstract:Foundation models can provide robust high-level reasoning on appropriate safety interventions in hazardous scenarios beyond a robot’s training data, i.e. out-of-distribution (OOD) failures. However, due to the high inference latency of Large Vision and Language Models, current methods rely on manually defined intervention policies to enact fallbacks, thereby lacking the ability to plan generalizable, semantically safe motions. To overcome these challenges we present FORTRESS, a framework that generates and reasons about semantically safe fallback strategies in real time to prevent OOD failures. At a low frequency in nominal operations, FORTRESS uses multi-modal reasoners to identify goals and anticipate failure modes. When a runtime monitor triggers a fallback response, FORTRESS rapidly synthesizes plans to fallback goals while inferring and avoiding semantically unsafe regions in real time. By bridging open-world, multi-modal reasoning with dynamics-aware planning, we eliminate the need for hard-coded fallbacks and human safety interventions. FORTRESS outperforms on-the-fly prompting of slow reasoning models in safety classification accuracy on synthetic benchmarks and real-world ANYmal robot data, and further improves system safety and planning success in simulation and on quadrotor hardware for urban navigation.
zh

[AI-2] LibIQ: Toward Real-Time Spectrum Classification in O-RAN dApps

【速读】:该论文旨在解决传统无线接入网(RAN)在实时监控与控制中面临的延迟过高以及无法访问用户明文数据的问题,这些问题限制了诸如波束成形和频谱分类等应用场景的实现。其解决方案的关键在于引入dApps(decentralized Applications)概念,并利用LibIQ这一新型库对射频(RF)信号进行高效处理,通过提供I/Q样本的时间序列读取、数据集构建及可视化功能,实现了对RF信号的实时频谱分类。LibIQ内部集成的卷积神经网络(CNN)能够准确识别外部RF信号类型,从而提升网络管理的效率与精度。

链接: https://arxiv.org/abs/2505.10537
作者: Filippo Olimpieri,Noemi Giustini,Andrea Lacava,Salvatore D’Oro,Tommaso Melodia,Francesca Cuomo
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 6 pages, 5 figures, 2 tables

点击查看摘要

Abstract:The O-RAN architecture is transforming cellular networks by adopting RAN softwarization and disaggregation concepts to enable data-driven monitoring and control of the network. Such management is enabled by RICs, which facilitate near-real-time and non-real-time network control through xApps and rApps. However, they face limitations, including latency overhead in data exchange between the RAN and RIC, restricting real-time monitoring, and the inability to access user plain data due to privacy and security constraints, hindering use cases like beamforming and spectrum classification. In this paper, we leverage the dApps concept to enable real-time RF spectrum classification with LibIQ, a novel library for RF signals that facilitates efficient spectrum monitoring and signal classification by providing functionalities to read I/Q samples as time-series, create datasets and visualize time-series data through plots and spectrograms. Thanks to LibIQ, I/Q samples can be efficiently processed to detect external RF signals, which are subsequently classified using a CNN inside the library. To achieve accurate spectrum analysis, we created an extensive dataset of time-series-based I/Q samples, representing distinct signal types captured using a custom dApp running on a 5G deployment over the Colosseum network emulator and an OTA testbed. We evaluate our model by deploying LibIQ in heterogeneous scenarios with varying center frequencies, time windows, and external RF signals. In real-time analysis, the model classifies the processed I/Q samples, achieving an average accuracy of approximately 97.8% in identifying signal types across all scenarios. We pledge to release both LibIQ and the dataset created as a publicly available framework upon acceptance.
zh

[AI-3] Knowledge capture adaptation and composition (KCAC): A framework for cross-task curriculum learning in robotic manipulation

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在机器人操作中面临的数据样本效率低下和可解释性不足的问题,这些问题限制了其在现实场景中的应用。论文提出的解决方案关键在于设计了一个知识捕获、适应与组合(Knowledge Capture, Adaptation, and Composition, KCAC)框架,通过跨任务课程学习系统地将知识迁移整合到RL中,从而提升智能体对多样工作场景的理解与适应能力。

链接: https://arxiv.org/abs/2505.10522
作者: Xinrui Wang,Yan Jin
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has demonstrated remarkable potential in robotic manipulation but faces challenges in sample inefficiency and lack of interpretability, limiting its applicability in real world scenarios. Enabling the agent to gain a deeper understanding and adapt more efficiently to diverse working scenarios is crucial, and strategic knowledge utilization is a key factor in this process. This paper proposes a Knowledge Capture, Adaptation, and Composition (KCAC) framework to systematically integrate knowledge transfer into RL through cross-task curriculum learning. KCAC is evaluated using a two block stacking task in the CausalWorld benchmark, a complex robotic manipulation environment. To our knowledge, existing RL approaches fail to solve this task effectively, reflecting deficiencies in knowledge capture. In this work, we redesign the benchmark reward function by removing rigid constraints and strict ordering, allowing the agent to maximize total rewards concurrently and enabling flexible task completion. Furthermore, we define two self-designed sub-tasks and implement a structured cross-task curriculum to facilitate efficient learning. As a result, our KCAC approach achieves a 40 percent reduction in training time while improving task success rates by 10 percent compared to traditional RL methods. Through extensive evaluation, we identify key curriculum design parameters subtask selection, transition timing, and learning rate that optimize learning efficiency and provide conceptual guidance for curriculum based RL frameworks. This work offers valuable insights into curriculum design in RL and robotic learning.
zh

[AI-4] PnPXAI: A Universal XAI Framework Providing Automatic Explanations Across Diverse Modalities and Models

【速读】:该论文试图解决现有可解释人工智能(Explainable Artificial Intelligence, XAI)框架在灵活性、支持的解释方法数量以及解释推荐效果方面的局限性。这些问题主要源于硬编码实现、对层特定操作的依赖以及缺乏评估和优化阶段。解决方案的关键是引入\textbf{PnPXAI},这是一个通用的XAI框架,采用即插即用(Plug-and-Play, PnP)方式支持多种数据模态和神经网络模型,能够自动检测模型架构、推荐适用的解释方法并优化超参数以获得最佳解释效果。

链接: https://arxiv.org/abs/2505.10515
作者: Seongun Kim,Sol A Kim,Geonhyeong Kim,Enver Menadjiev,Chanwoo Lee,Seongwook Chung,Nari Kim,Jaesik Choi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, post hoc explanation methods have emerged to enhance model transparency by attributing model outputs to input features. However, these methods face challenges due to their specificity to certain neural network architectures and data modalities. Existing explainable artificial intelligence (XAI) frameworks have attempted to address these challenges but suffer from several limitations. These include limited flexibility to diverse model architectures and data modalities due to hard-coded implementations, a restricted number of supported XAI methods because of the requirements for layer-specific operations of attribution methods, and sub-optimal recommendations of explanations due to the lack of evaluation and optimization phases. Consequently, these limitations impede the adoption of XAI technology in real-world applications, making it difficult for practitioners to select the optimal explanation method for their domain. To address these limitations, we introduce \textbfPnPXAI, a universal XAI framework that supports diverse data modalities and neural network models in a Plug-and-Play (PnP) manner. PnPXAI automatically detects model architectures, recommends applicable explanation methods, and optimizes hyperparameters for optimal explanations. We validate the framework’s effectiveness through user surveys and showcase its versatility across various domains, including medicine and finance.
zh

[AI-5] Fine-tuning Diffusion Policies with Backpropagation Through Diffusion Timesteps NEURIPS

【速读】:该论文旨在解决扩散策略(Diffusion Policy)在示例数据覆盖不足或次优的情况下,导致生成轨迹质量不佳甚至出现灾难性失败的问题。现有基于强化学习(Reinforcement Learning, RL)的微调方法难以有效适应近端策略优化(Proximal Policy Optimization, PPO)到扩散模型,主要原因是去噪过程中动作似然估计的计算不可行性,从而导致优化目标复杂化。论文提出的解决方案关键在于引入NCDPO框架,将扩散策略重新表述为噪声条件下的确定性策略,通过将每个去噪步骤视为基于预采样噪声的可微变换,实现似然评估和梯度反向传播的可行性,从而显著提升样本效率与最终性能。

链接: https://arxiv.org/abs/2505.10482
作者: Ningyuan Yang,Jiaxuan Gao,Feng Gao,Yi Wu,Chao Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages for main text, 23 pages in total, submitted to Neurips, 13 figures

点击查看摘要

Abstract:Diffusion policies, widely adopted in decision-making scenarios such as robotics, gaming and autonomous driving, are capable of learning diverse skills from demonstration data due to their high representation power. However, the sub-optimal and limited coverage of demonstration data could lead to diffusion policies that generate sub-optimal trajectories and even catastrophic failures. While reinforcement learning (RL)-based fine-tuning has emerged as a promising solution to address these limitations, existing approaches struggle to effectively adapt Proximal Policy Optimization (PPO) to diffusion models. This challenge stems from the computational intractability of action likelihood estimation during the denoising process, which leads to complicated optimization objectives. In our experiments starting from randomly initialized policies, we find that online tuning of Diffusion Policies demonstrates much lower sample efficiency compared to directly applying PPO on MLP policies (MLP+PPO). To address these challenges, we introduce NCDPO, a novel framework that reformulates Diffusion Policy as a noise-conditioned deterministic policy. By treating each denoising step as a differentiable transformation conditioned on pre-sampled noise, NCDPO enables tractable likelihood evaluation and gradient backpropagation through all diffusion timesteps. Our experiments demonstrate that NCDPO achieves sample efficiency comparable to MLP+PPO when training from scratch, outperforming existing methods in both sample efficiency and final performance across diverse benchmarks, including continuous robot control and multi-agent game scenarios. Furthermore, our experimental results show that our method is robust to the number denoising timesteps in the Diffusion Policy.
zh

[AI-6] AI Agents vs. Agent ic AI: A Conceptual Taxonomy Applications and Challenge

【速读】:该论文试图解决AI Agents与Agentic AI在设计哲学、能力范围及应用场景上的区分问题,旨在明确两者的技术差异并构建系统化的分类框架。其解决方案的关键在于通过架构演进、操作机制、交互模式和自主性层级的对比分析,提出针对不同范式下独特挑战(如幻觉、脆弱性、涌现行为和协调失败)的针对性解决策略,包括ReAct循环、RAG、编排层和因果建模等技术手段,以推动构建更稳健、可扩展且可解释的AI代理系统和Agentic AI驱动的应用。

链接: https://arxiv.org/abs/2505.10468
作者: Ranjan Sapkota,Konstantinos I. Roumeliotis,Manoj Karkee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 32 pages, 14 figures, 11 tables

点击查看摘要

Abstract:This study critically distinguishes between AI Agents and Agentic AI, offering a structured conceptual taxonomy, application mapping, and challenge analysis to clarify their divergent design philosophies and capabilities. We begin by outlining the search strategy and foundational definitions, characterizing AI Agents as modular systems driven by Large Language Models (LLMs) and Large Image Models (LIMs) for narrow, task-specific automation. Generative AI is positioned as a precursor, with AI Agents advancing through tool integration, prompt engineering, and reasoning enhancements. In contrast, Agentic AI systems represent a paradigmatic shift marked by multi-agent collaboration, dynamic task decomposition, persistent memory, and orchestrated autonomy. Through a sequential evaluation of architectural evolution, operational mechanisms, interaction styles, and autonomy levels, we present a comparative analysis across both paradigms. Application domains such as customer support, scheduling, and data summarization are contrasted with Agentic AI deployments in research automation, robotic coordination, and medical decision support. We further examine unique challenges in each paradigm including hallucination, brittleness, emergent behavior, and coordination failure and propose targeted solutions such as ReAct loops, RAG, orchestration layers, and causal modeling. This work aims to provide a definitive roadmap for developing robust, scalable, and explainable AI agent and Agentic AI-driven systems. AI Agents, Agent-driven, Vision-Language-Models, Agentic AI Decision Support System, Agentic-AI Applications
zh

[AI-7] Are Large Language Models Robust in Understanding Code Against Semantics-Preserving Mutations?

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在编程任务中推理能力与鲁棒性不足的问题,特别是评估其对Python程序的理解是否基于正确的逻辑推理而非随机猜测。其解决方案的关键在于通过应用五种语义保持的代码变异(如变量重命名、比较表达式镜像、if-else分支交换、for循环转while循环以及循环展开),检测LLMs在面对语法变化但语义不变的代码时的预测稳定性与推理一致性,并结合人工专家分析验证预测结果是否建立在合理的逻辑基础上。

链接: https://arxiv.org/abs/2505.10443
作者: Pedro Orvalho,Marta Kwiatkowska
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 tables, 1 figure

点击查看摘要

Abstract:Understanding the reasoning and robustness of Large Language Models (LLMs) is critical for their reliable use in programming tasks. While recent studies have assessed LLMs’ ability to predict program outputs, most focus solely on the accuracy of those predictions, without evaluating the reasoning behind them. Moreover, it has been observed on mathematical reasoning tasks that LLMs can arrive at correct answers through flawed logic, raising concerns about similar issues in code understanding. In this work, we evaluate whether state-of-the-art LLMs with up to 8B parameters can reason about Python programs or are simply guessing. We apply five semantics-preserving code mutations: renaming variables, mirroring comparison expressions, swapping if-else branches, converting for loops to while, and loop unrolling. These mutations maintain program semantics while altering its syntax. We evaluated six LLMs and performed a human expert analysis using LiveCodeBench to assess whether the correct predictions are based on sound reasoning. We also evaluated prediction stability across different code mutations on LiveCodeBench and CruxEval. Our findings show that some LLMs, such as Llama3.2, produce correct predictions based on flawed reasoning in up to 61% of cases. Furthermore, LLMs often change predictions in response to our code mutations, indicating limited robustness in their semantic understanding. Comments: 10 pages, 5 tables, 1 figure Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.10443 [cs.SE] (or arXiv:2505.10443v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2505.10443 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-8] IN-RIL: Interleaved Reinforcement and Imitation Learning for Policy Fine-Tuning

【速读】:该论文旨在解决基于模仿学习(Imitation Learning, IL)预训练后进行强化学习(Reinforcement Learning, RL)微调的机器人策略学习方法在微调阶段存在的不稳定性和样本效率低的问题。其解决方案的关键在于提出IN-RIL(Interleaved Reinforcement learning and Imitation Learning),通过在多次RL更新后周期性地注入IL更新,从而结合IL的稳定性与专家数据的引导,提升整个微调过程中的探索效率。此外,为避免IL与RL优化目标之间的冲突,引入了梯度分离机制,通过将可能冲突的梯度更新分离到正交子空间中,防止破坏性干扰。

链接: https://arxiv.org/abs/2505.10442
作者: Dechen Gao,Hang Wang,Hanchu Zhou,Nejib Ammar,Shatadal Mishra,Ahmadreza Moradipari,Iman Soltani,Junshan Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Imitation learning (IL) and reinforcement learning (RL) each offer distinct advantages for robotics policy learning: IL provides stable learning from demonstrations, and RL promotes generalization through exploration. While existing robot learning approaches using IL-based pre-training followed by RL-based fine-tuning are promising, this two-step learning paradigm often suffers from instability and poor sample efficiency during the RL fine-tuning phase. In this work, we introduce IN-RIL, INterleaved Reinforcement learning and Imitation Learning, for policy fine-tuning, which periodically injects IL updates after multiple RL updates and hence can benefit from the stability of IL and the guidance of expert data for more efficient exploration throughout the entire fine-tuning process. Since IL and RL involve different optimization objectives, we develop gradient separation mechanisms to prevent destructive interference during \ABBR fine-tuning, by separating possibly conflicting gradient updates in orthogonal subspaces. Furthermore, we conduct rigorous analysis, and our findings shed light on why interleaving IL with RL stabilizes learning and improves sample-efficiency. Extensive experiments on 14 robot manipulation and locomotion tasks across 3 benchmarks, including FurnitureBench, OpenAI Gym, and Robomimic, demonstrate that \ABBR can significantly improve sample efficiency and mitigate performance collapse during online finetuning in both long- and short-horizon tasks with either sparse or dense rewards. IN-RIL, as a general plug-in compatible with various state-of-the-art RL algorithms, can significantly improve RL fine-tuning, e.g., from 12% to 88% with 6.3x improvement in the success rate on Robomimic Transport. Project page: this https URL.
zh

[AI-9] Evaluating Model Explanations without Ground Truth

【速读】:该论文试图解决模型预测解释的评估问题,即面对单一模型预测可能存在的多种竞争性和矛盾性解释,难以选择合适解释的问题。现有解释评估框架通过与理想“地面真实”(ground-truth)解释对比或验证模型对关键输入的敏感性来衡量质量,但这些方法存在局限性。本文提出了一种无地面真实依赖的解释评估框架(AXE),其关键在于不依赖理想地面真实解释或模型敏感性,提供了一种独立的解释质量度量方式,从而能够有效评估和比较模型解释,并用于检测解释公平性伪装(fairwashing)。

链接: https://arxiv.org/abs/2505.10399
作者: Kaivalya Rawal,Zihao Fu,Eoin Delaney,Chris Russell
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: this https URL

点击查看摘要

Abstract:There can be many competing and contradictory explanations for a single model prediction, making it difficult to select which one to use. Current explanation evaluation frameworks measure quality by comparing against ideal “ground-truth” explanations, or by verifying model sensitivity to important inputs. We outline the limitations of these approaches, and propose three desirable principles to ground the future development of explanation evaluation strategies for local feature importance explanations. We propose a ground-truth Agnostic eXplanation Evaluation framework (AXE) for evaluating and comparing model explanations that satisfies these principles. Unlike prior approaches, AXE does not require access to ideal ground-truth explanations for comparison, or rely on model sensitivity - providing an independent measure of explanation quality. We verify AXE by comparing with baselines, and show how it can be used to detect explanation fairwashing. Our code is available at this https URL.
zh

[AI-10] Inconsistency Handling in DatalogMTL IJCAI2025

【速读】:该论文旨在解决DatalogMTL中不一致性的处理问题,DatalogMTL是带有度量时间算子的Datalog扩展。由于事实与时间区间相关联,当它们与规则矛盾时,有多种方式可以恢复一致性,例如删除事实或修改其时间区间。论文的关键在于定义了冲突(用于解释不一致性的最小解释)和修复(恢复一致性的可能方式)的相关概念,并研究了这些概念的性质及其相关的不一致容忍语义。此外,论文还分析了在基于修复语义下生成单个冲突/修复以及查询蕴含的数据复杂性。

链接: https://arxiv.org/abs/2505.10394
作者: Meghyn Bienvenu,Camille Bourgaux,Atefe Khodadaditaghanaki
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: This is an extended version of a paper appearing at the 34th International Joint Conference on Artificial Intelligence (IJCAI 2025). 24 pages

点击查看摘要

Abstract:In this paper, we explore the issue of inconsistency handling in DatalogMTL, an extension of Datalog with metric temporal operators. Since facts are associated with time intervals, there are different manners to restore consistency when they contradict the rules, such as removing facts or modifying their time intervals. Our first contribution is the definition of relevant notions of conflicts (minimal explanations for inconsistency) and repairs (possible ways of restoring consistency) for this setting and the study of the properties of these notions and the associated inconsistency-tolerant semantics. Our second contribution is a data complexity analysis of the tasks of generating a single conflict / repair and query entailment under repair-based semantics.
zh

[AI-11] Schreier-Coset Graph Propagation

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在处理图结构数据时面临的过压缩(over-squashing)问题,即远距离节点的信息被压缩为固定大小的向量,从而限制了模型的表达能力。现有解决方案如图重连和抗瓶颈架构(如Cayley图和扩展图)虽能缓解该问题,但存在可扩展性瓶颈。本文提出的解决方案关键在于引入Schreier-Coset Graph Propagation (SCGP),一种基于群论的增强方法,通过Schreier-陪集嵌入丰富节点特征,而无需改变输入图的拓扑结构。SCGP将无瓶颈的连接模式嵌入到紧凑特征空间中,从而在保持计算效率的同时提升长程信息传递能力。

链接: https://arxiv.org/abs/2505.10392
作者: Aryan Mishra,Lizhen Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 1 figure , preprint

点击查看摘要

Abstract:Graph Neural Networks (GNNs) offer a principled framework for learning over graph-structured data, yet their expressive capacity is often hindered by over-squashing, wherein information from distant nodes is compressed into fixed-size vectors. Existing solutions, including graph rewiring and bottleneck-resistant architectures such as Cayley and expander graphs, avoid this problem but introduce scalability bottlenecks. In particular, the Cayley graphs constructed over SL(2,\mathbbZ_n) exhibit strong theoretical properties, yet suffer from cubic node growth O(n^3) , leading to high memory usage. To address this, this work introduces Schrier-Coset Graph Propagation (SCGP), a group-theoretic augmentation method that enriches node features through Schreier-coset embeddings without altering the input graph topology. SCGP embeds bottleneck-free connectivity patterns into a compact feature space, improving long-range message passing while maintaining computational efficiency. Empirical evaluations across standard node and graph classification benchmarks demonstrate that SCGP achieves performance comparable to, or exceeding, expander graph and rewired GNN baselines. Furthermore, SCGP exhibits particular advantages in processing hierarchical and modular graph structures, offering reduced inference latency, improved scalability, and a low memory footprint, making it suitable for real-time and resource-constrained applications.
zh

[AI-12] Multi-Agent Path Finding For Large Agents Is Intractable

【速读】:该论文试图解决带有大尺寸代理的多智能体路径规划(MAPF)问题,该问题在传统MAPF模型中未考虑代理尺寸,仅处理顶点和边的冲突,而在实际应用中,如机器人领域,必须考虑代理尺寸以确保路径规划的安全性。论文的关键在于首次证明了该问题为NP-hard,即在P≠NP的前提下,不存在多项式时间的完整算法。其解决方案基于将经典的3SAT问题(已知为NP完全问题)归约到所研究的MAPF问题,通过构造特定图结构并证明3SAT公式可满足当且仅当对应的路径规划实例有解。

链接: https://arxiv.org/abs/2505.10387
作者: Artem Agafonov,Konstantin Yakovlev
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
备注:

点击查看摘要

Abstract:The multi-agent path finding (MAPF) problem asks to find a set of paths on a graph such that when synchronously following these paths the agents never encounter a conflict. In the most widespread MAPF formulation, the so-called Classical MAPF, the agents sizes are neglected and two types of conflicts are considered: occupying the same vertex or using the same edge at the same time step. Meanwhile in numerous practical applications, e.g. in robotics, taking into account the agents’ sizes is vital to ensure that the MAPF solutions can be safely executed. Introducing large agents yields an additional type of conflict arising when one agent follows an edge and its body overlaps with the body of another agent that is actually not using this same edge (e.g. staying still at some distinct vertex of the graph). Until now it was not clear how harder the problem gets when such conflicts are to be considered while planning. Specifically, it was known that Classical MAPF problem on an undirected graph can be solved in polynomial time, however no complete polynomial-time algorithm was presented to solve MAPF with large agents. In this paper we, for the first time, establish that the latter problem is NP-hard and, thus, if P!=NP no polynomial algorithm for it can, unfortunately, be presented. Our proof is based on the prevalent in the field technique of reducing the seminal 3SAT problem (which is known to be an NP-complete problem) to the problem at hand. In particular, for an arbitrary 3SAT formula we procedurally construct a dedicated graph with specific start and goal vertices and show that the given 3SAT formula is satisfiable iff the corresponding path finding instance has a solution.
zh

[AI-13] Are Sparse Autoencoders Useful for Java Function Bug Detection?

【速读】:该论文试图解决软件漏洞检测中传统方法存在的误报率高、可扩展性差以及依赖人工干预的问题,同时探索生成式 AI(Generative AI)在自动化漏洞检测与安全代码生成中的应用潜力。论文提出的解决方案关键在于利用稀疏自编码器(Sparse Autoencoder, SAE)作为轻量级且可解释的替代方案,通过分析预训练大型语言模型(Large Language Model, LLM)内部表示来检测 Java 函数中的缺陷,无需对底层模型进行微调或任务特定监督,从而实现了高达 89% 的 F1 分数的漏洞检测效果。

链接: https://arxiv.org/abs/2505.10375
作者: Rui Melo,Claudia Mamede,Andre Catarino,Rui Abreu,Henrique Lopes Cardoso
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 10 figures

点击查看摘要

Abstract:Software vulnerabilities such as buffer overflows and SQL injections are a major source of security breaches. Traditional methods for vulnerability detection remain essential but are limited by high false positive rates, scalability issues, and reliance on manual effort. These constraints have driven interest in AI-based approaches to automated vulnerability detection and secure code generation. While Large Language Models (LLMs) have opened new avenues for classification tasks, their complexity and opacity pose challenges for interpretability and deployment. Sparse Autoencoder offer a promising solution to this problem. We explore whether SAEs can serve as a lightweight, interpretable alternative for bug detection in Java functions. We evaluate the effectiveness of SAEs when applied to representations from GPT-2 Small and Gemma 2B, examining their capacity to highlight buggy behaviour without fine-tuning the underlying LLMs. We found that SAE-derived features enable bug detection with an F1 score of up to 89%, consistently outperforming fine-tuned transformer encoder baselines. Our work provides the first empirical evidence that SAEs can be used to detect software bugs directly from the internal representations of pretrained LLMs, without any fine-tuning or task-specific supervision.
zh

[AI-14] ILIF: Temporal Inhibitory Leaky Integrate-and-Fire Neuron for Overactivation in Spiking Neural Networks

【速读】:该论文试图解决脉冲神经网络(Spiking Neural Network, SNN)在通过反向传播训练时面临的gamma困境,即较大的gamma值会导致过度激活和能量消耗增加,而较小的gamma值则会导致梯度消失和时间依赖性减弱。解决方案的关键在于提出一种受生物抑制机制启发的时序抑制漏电积分-发放(temporal Inhibitory Leaky Integrate-and-Fire, ILIF)神经元模型,该模型通过引入相互连接的抑制单元来调节膜电位和电流,有效缓解过度激活问题,同时保持梯度传播能力。

链接: https://arxiv.org/abs/2505.10371
作者: Kai Sun,Peibo Duan,Levin Kuhlmann,Beilun Wang,Bin Zhang
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The Spiking Neural Network (SNN) has drawn increasing attention for its energy-efficient, event-driven processing and biological plausibility. To train SNNs via backpropagation, surrogate gradients are used to approximate the non-differentiable spike function, but they only maintain nonzero derivatives within a narrow range of membrane potentials near the firing threshold, referred to as the surrogate gradient support width gamma. We identify a major challenge, termed the dilemma of gamma: a relatively large gamma leads to overactivation, characterized by excessive neuron firing, which in turn increases energy consumption, whereas a small gamma causes vanishing gradients and weakens temporal dependencies. To address this, we propose a temporal Inhibitory Leaky Integrate-and-Fire (ILIF) neuron model, inspired by biological inhibitory mechanisms. This model incorporates interconnected inhibitory units for membrane potential and current, effectively mitigating overactivation while preserving gradient propagation. Theoretical analysis demonstrates ILIF effectiveness in overcoming the gamma dilemma, and extensive experiments on multiple datasets show that ILIF improves energy efficiency by reducing firing rates, stabilizes training, and enhances accuracy. The code is available at this http URL.
zh

[AI-15] Plasticity as the Mirror of Empowerment

【速读】:该论文试图解决如何量化智能体(agent)受其观察影响的程度这一问题,以及如何将其与智能体的影响力(empowerment)进行关联。论文的关键解决方案是引入一个称为“plasticity”的通用智能体中心度量,并通过新的信息论量——广义有向信息(generalized directed information)来定义该度量。该量严格扩展了Massey(1990)提出的有向信息,同时保留了其所有优良特性。研究发现,plasticity与empowerment存在镜像关系,并且二者之间存在设计上的权衡,这对理解智能体的本质具有重要意义。

链接: https://arxiv.org/abs/2505.10361
作者: David Abel,Michael Bowling,André Barreto,Will Dabney,Shi Dong,Steven Hansen,Anna Harutyunyan,Khimya Khetarpal,Clare Lyle,Razvan Pascanu,Georgios Piliouras,Doina Precup,Jonathan Richens,Mark Rowland,Tom Schaul,Satinder Singh
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Agents are minimally entities that are influenced by their past observations and act to influence future observations. This latter capacity is captured by empowerment, which has served as a vital framing concept across artificial intelligence and cognitive science. This former capacity, however, is equally foundational: In what ways, and to what extent, can an agent be influenced by what it observes? In this paper, we ground this concept in a universal agent-centric measure that we refer to as plasticity, and reveal a fundamental connection to empowerment. Following a set of desiderata on a suitable definition, we define plasticity using a new information-theoretic quantity we call the generalized directed information. We show that this new quantity strictly generalizes the directed information introduced by Massey (1990) while preserving all of its desirable properties. Our first finding is that plasticity is the mirror of empowerment: The agent’s plasticity is identical to the empowerment of the environment, and vice versa. Our second finding establishes a tension between the plasticity and empowerment of an agent, suggesting that agent design needs to be mindful of both characteristics. We explore the implications of these findings, and suggest that plasticity, empowerment, and their relationship are essential to understanding agency.
zh

[AI-16] FactsR: A Safer Method for Producing High Quality Healthcare Documentation

【速读】:该论文试图解决当前医疗领域中基于生成式 AI (Generative AI) 的语音转录解决方案在临床咨询后通过单次或少量提示生成病历记录时存在的问题,这些问题包括生成的病历冗长、出现幻觉、未能准确反映临床医生的意图,并且依赖医生进行校对以发现错误。论文提出的解决方案的关键在于引入一种实时提取临床关键信息的方法,称为 Facts,并利用该信息递归生成最终的病历记录,从而提高病历的准确性与简洁性,同时将临床医生纳入病历生成的闭环中,提升实时决策支持的潜力。

链接: https://arxiv.org/abs/2505.10360
作者: Victor Petrén Bach Hansen,Lasse Krogsbøll,Jonas Lyngsø,Mathias Baltzersen,Andreas Motzfeldt,Kevin Pelgrims,Lars Maaløe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:

点击查看摘要

Abstract:There are now a multitude of AI-scribing solutions for healthcare promising the utilization of large language models for ambient documentation. However, these AI scribes still rely on one-shot, or few-shot prompts for generating notes after the consultation has ended, employing little to no reasoning. This risks long notes with an increase in hallucinations, misrepresentation of the intent of the clinician, and reliance on the proofreading of the clinician to catch errors. A dangerous combination for patient safety if vigilance is compromised by workload and fatigue. In this paper, we introduce a method for extracting salient clinical information in real-time alongside the healthcare consultation, denoted Facts, and use that information recursively to generate the final note. The FactsR method results in more accurate and concise notes by placing the clinician-in-the-loop of note generation, while opening up new use cases within real-time decision support.
zh

[AI-17] Uniform Loss vs. Specialized Optimization: A Comparative Analysis in Multi-Task Learning

【速读】:该论文试图解决多任务学习(Multi-Task Learning, MTL)中任务学习平衡的问题,特别是针对专门的多任务优化器(Specialized Multi-Task Optimizers, SMTOs)在处理冲突梯度和不同梯度范数方面的作用进行重新评估。论文的关键解决方案在于通过广泛的实验验证SMTOs的有效性,并揭示固定权重和统一损失函数在某些情况下可以达到与SMTOs相当的性能,从而质疑了SMTOs在所有场景下的必要性。

链接: https://arxiv.org/abs/2505.10347
作者: Gabriel S. Gama,Valdir Grassi Jr
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Specialized Multi-Task Optimizers (SMTOs) balance task learning in Multi-Task Learning by addressing issues like conflicting gradients and differing gradient norms, which hinder equal-weighted task training. However, recent critiques suggest that equally weighted tasks can achieve competitive results compared to SMTOs, arguing that previous SMTO results were influenced by poor hyperparameter optimization and lack of regularization. In this work, we evaluate these claims through an extensive empirical evaluation of SMTOs, including some of the latest methods, on more complex multi-task problems to clarify this behavior. Our findings indicate that SMTOs perform well compared to uniform loss and that fixed weights can achieve competitive performance compared to SMTOs. Furthermore, we demonstrate why uniform loss perform similarly to SMTOs in some instances. The code will be made publicly available.
zh

[AI-18] Emergence of Structure in Ensembles of Random Neural Networks

【速读】:该论文试图解决在随机分类器集合中集体行为出现的理论机制问题,特别是如何通过统计物理中的吉布斯测度(Gibbs measure)对集合进行加权,以实现最优分类性能。其解决方案的关键在于引入一个有限温度参数,该参数使得在分类损失(或能量)意义上达到最优分类效果,并且该温度参数不依赖于教师分类器或随机分类器的数量,表现出该现象的普遍性。

链接: https://arxiv.org/abs/2505.10331
作者: Luca Muscarnera,Luigi Loreti,Giovanni Todeschini,Alessio Fumagalli,Francesco Regazzoni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Randomness is ubiquitous in many applications across data science and machine learning. Remarkably, systems composed of random components often display emergent global behaviors that appear deterministic, manifesting a transition from microscopic disorder to macroscopic organization. In this work, we introduce a theoretical model for studying the emergence of collective behaviors in ensembles of random classifiers. We argue that, if the ensemble is weighted through the Gibbs measure defined by adopting the classification loss as an energy, then there exists a finite temperature parameter for the distribution such that the classification is optimal, with respect to the loss (or the energy). Interestingly, for the case in which samples are generated by a Gaussian distribution and labels are constructed by employing a teacher perceptron, we analytically prove and numerically confirm that such optimal temperature does not depend neither on the teacher classifier (which is, by construction of the learning problem, unknown), nor on the number of random classifiers, highlighting the universal nature of the observed behavior. Experiments on the MNIST dataset underline the relevance of this phenomenon in high-quality, noiseless, datasets. Finally, a physical analogy allows us to shed light on the self-organizing nature of the studied phenomenon.
zh

[AI-19] Efficient Adaptation of Reinforcement Learning Agents to Sudden Environmental Change

【速读】:该论文试图解决在动态环境中,深度强化学习(Deep Reinforcement Learning, DRL)代理在部署过程中遇到新环境变化时,如何高效适应并避免灾难性遗忘(catastrophic forgetting)的问题。解决方案的关键在于实现两个核心能力:(1)优先级探索与采样策略,以识别并从相关经验中学习;(2)通过结构化表示选择性地保留先验知识,从而在更新过程中不影响可复用组件。

链接: https://arxiv.org/abs/2505.10330
作者: Jonathan Clifford Balloch
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: PhD Dissertation, 131 pages

点击查看摘要

Abstract:Real-world autonomous decision-making systems, from robots to recommendation engines, must operate in environments that change over time. While deep reinforcement learning (RL) has shown an impressive ability to learn optimal policies in stationary environments, most methods are data intensive and assume a world that does not change between training and test time. As a result, conventional RL methods struggle to adapt when conditions change. This poses a fundamental challenge: how can RL agents efficiently adapt their behavior when encountering novel environmental changes during deployment without catastrophically forgetting useful prior knowledge? This dissertation demonstrates that efficient online adaptation requires two key capabilities: (1) prioritized exploration and sampling strategies that help identify and learn from relevant experiences, and (2) selective preservation of prior knowledge through structured representations that can be updated without disruption to reusable components.
zh

[AI-20] A Comparative Study of SMT and MILP for the Nurse Rostering Problem

【速读】:该论文试图解决医疗人员排班问题,该问题由于持续的需求和复杂的约束条件而具有高度挑战性。解决方案的关键在于提出通用的约束公式,能够建模多种现实世界的排班约束,并将其转化为Satisfiability Modulo Theories (SMT) 和混合整数线性规划 (MILP) 问题,以对比Z3和Gurobi等先进求解器的性能。实验结果表明,SMT求解器在处理包含多样化班次和人员的现实问题时表现更优,但其性能对约束公式的构建方式更为敏感。

链接: https://arxiv.org/abs/2505.10328
作者: Alvin Combrink,Stephie Do,Kristofer Bengtsson,Sabino Francesco Roselli,Martin Fabian
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 6 pages, 3 figures

点击查看摘要

Abstract:The effects of personnel scheduling on the quality of care and working conditions for healthcare personnel have been thoroughly documented. However, the ever-present demand and large variation of constraints make healthcare scheduling particularly challenging. This problem has been studied for decades, with limited research aimed at applying Satisfiability Modulo Theories (SMT). SMT has gained momentum within the formal verification community in the last decades, leading to the advancement of SMT solvers that have been shown to outperform standard mathematical programming techniques. In this work, we propose generic constraint formulations that can model a wide range of real-world scheduling constraints. Then, the generic constraints are formulated as SMT and MILP problems and used to compare the respective state-of-the-art solvers, Z3 and Gurobi, on academic and real-world inspired rostering problems. Experimental results show how each solver excels for certain types of problems; the MILP solver generally performs better when the problem is highly constrained or infeasible, while the SMT solver performs better otherwise. On real-world inspired problems containing a more varied set of shifts and personnel, the SMT solver excels. Additionally, it was noted during experimentation that the SMT solver was more sensitive to the way the generic constraints were formulated, requiring careful consideration and experimentation to achieve better performance. We conclude that SMT-based methods present a promising avenue for future research within the domain of personnel scheduling. Comments: 6 pages, 3 figures Subjects: Artificial Intelligence (cs.AI); Systems and Control (eess.SY) Cite as: arXiv:2505.10328 [cs.AI] (or arXiv:2505.10328v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.10328 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-21] AutoPentest: Enhancing Vulnerability Management With Autonomous LLM Agents

【速读】:该论文试图解决传统渗透测试成本高、频率低的问题,旨在通过引入大型语言模型(Large Language Models, LLMs)提升渗透测试的自动化程度和效率。其解决方案的关键在于开发AutoPentest系统,该系统基于OpenAI的GPT-4o模型和LangChain框架,能够执行复杂的多步骤任务,并借助外部工具和知识库实现高度自主的黑盒渗透测试。

链接: https://arxiv.org/abs/2505.10321
作者: Julius Henke
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 24 pages, 1 figure, for implementation, see this https URL

点击查看摘要

Abstract:A recent area of increasing research is the use of Large Language Models (LLMs) in penetration testing, which promises to reduce costs and thus allow for higher frequency. We conduct a review of related work, identifying best practices and common evaluation issues. We then present AutoPentest, an application for performing black-box penetration tests with a high degree of autonomy. AutoPentest is based on the LLM GPT-4o from OpenAI and the LLM agent framework LangChain. It can perform complex multi-step tasks, augmented by external tools and knowledge bases. We conduct a study on three capture-the-flag style Hack The Box (HTB) machines, comparing our implementation AutoPentest with the baseline approach of manually using the ChatGPT-4o user interface. Both approaches are able to complete 15-25 % of the subtasks on the HTB machines, with AutoPentest slightly outperforming ChatGPT. We measure a total cost of \ 96.20 US when using AutoPentest across all experiments, while a one-month subscription to ChatGPT Plus costs \ 20. The results show that further implementation efforts and the use of more powerful LLMs released in the future are likely to make this a viable part of vulnerability management.
zh

[AI-22] Private Transformer Inference in MLaaS: A Survey

【速读】:该论文旨在解决在机器学习即服务(Machine Learning as a Service, MLaaS)场景下,Transformer模型部署所带来的隐私问题,特别是用户敏感数据在集中式处理过程中可能泄露的风险。其解决方案的关键在于采用隐私保护的Transformer推理(Private Transformer Inference, PTI),通过密码学技术如安全多方计算(Secure Multi-Party Computation, MPC)和同态加密(Homomorphic Encryption)实现模型推理过程中的数据与模型隐私保护。

链接: https://arxiv.org/abs/2505.10315
作者: Yang Li,Xinyu Zhou,Yitong Wang,Liangxin Qian,Jun Zhao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformer models have revolutionized AI, powering applications like content generation and sentiment analysis. However, their deployment in Machine Learning as a Service (MLaaS) raises significant privacy concerns, primarily due to the centralized processing of sensitive user data. Private Transformer Inference (PTI) offers a solution by utilizing cryptographic techniques such as secure multi-party computation and homomorphic encryption, enabling inference while preserving both user data and model privacy. This paper reviews recent PTI advancements, highlighting state-of-the-art solutions and challenges. We also introduce a structured taxonomy and evaluation framework for PTI, focusing on balancing resource efficiency with privacy and bridging the gap between high-performance inference and data privacy.
zh

[AI-23] Empirically evaluating commonsense intelligence in large language models with large-scale human judgments

【速读】:该论文试图解决传统静态基准评估机器常识智能时存在的局限性,即假设人类常识是同质化的,而实际上人类在常识判断上存在显著异质性。其解决方案的关键在于引入一种新的评估方法,通过测量模型判断与人类群体判断的一致性,从而体现人类常识的多样性。该方法将常识智能与文化基础相联系,强调了对不同社会知识体系的适应性需求。

链接: https://arxiv.org/abs/2505.10309
作者: Tuan Dung Nguyen,Duncan J. Watts,Mark E. Whiting
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Commonsense intelligence in machines is often assessed by static benchmarks that compare a model’s output against human-prescribed correct labels. An important, albeit implicit, assumption of these labels is that they accurately capture what any human would think, effectively treating human common sense as homogeneous. However, recent empirical work has shown that humans vary enormously in what they consider commonsensical; thus what appears self-evident to one benchmark designer may not be so to another. Here, we propose a novel method for evaluating common sense in artificial intelligence (AI), specifically in large language models (LLMs), that incorporates empirically observed heterogeneity among humans by measuring the correspondence between a model’s judgment and that of a human population. We first find that, when treated as independent survey respondents, most LLMs remain below the human median in their individual commonsense competence. Second, when used as simulators of a hypothetical population, LLMs correlate with real humans only modestly in the extent to which they agree on the same set of statements. In both cases, smaller, open-weight models are surprisingly more competitive than larger, proprietary frontier models. Our evaluation framework, which ties commonsense intelligence to its cultural basis, contributes to the growing call for adapting AI models to human collectivities that possess different, often incompatible, social stocks of knowledge.
zh

[AI-24] AI LEGO: Scaffolding Cross-Functional Collaboration in Industrial Responsible AI Practices during Early Design Stages

【速读】:该论文试图解决跨职能团队在AI开发早期阶段因知识传递障碍而导致的伦理评估与潜在危害识别受阻问题(knowledge handoff challenge)。解决方案的关键在于开发AI LEGO,一个基于网络的原型工具,通过交互式模块化结构支持技术角色与非技术角色之间的有效知识传递,并利用阶段特定的检查清单和大语言模型驱动的角色模拟,促进对有害设计选择的系统性识别。

链接: https://arxiv.org/abs/2505.10300
作者: Muzhe Wu,Yanzhi Zhao,Shuyi Han,Michael Xieyang Liu,Hong Shen
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Responsible AI (RAI) efforts increasingly emphasize the importance of addressing potential harms early in the AI development lifecycle through social-technical lenses. However, in cross-functional industry teams, this work is often stalled by a persistent knowledge handoff challenge: the difficulty of transferring high-level, early-stage technical design rationales from technical experts to non-technical or user-facing roles for ethical evaluation and harm identification. Through literature review and a co-design study with 8 practitioners, we unpack how this challenge manifests – technical design choices are rarely handed off in ways that support meaningful engagement by non-technical roles; collaborative workflows lack shared, visual structures to support mutual understanding; and non-technical practitioners are left without scaffolds for systematic harm evaluation. Existing tools like JIRA or Google Docs, while useful for product tracking, are ill-suited for supporting joint harm identification across roles, often requiring significant extra effort to align understanding. To address this, we developed AI LEGO, a web-based prototype that supports cross-functional AI practitioners in effectively facilitating knowledge handoff and identifying harmful design choices in the early design stages. Technical roles use interactive blocks to draft development plans, while non-technical roles engage with those blocks through stage-specific checklists and LLM-driven persona simulations to surface potential harms. In a study with 18 cross-functional practitioners, AI LEGO increased the volume and likelihood of harms identified compared to baseline worksheets. Participants found that its modular structure and persona prompts made harm identification more accessible, fostering clearer and more collaborative RAI practices in early design.
zh

[AI-25] Defending the Edge: Representative-Attention for Mitigating Backdoor Attacks in Federated Learning ESORICS2025

【速读】:该论文试图解决联邦学习(Federated Learning, FL)中由于边缘设备异构性导致的数据非独立同分布(non-IID)问题所带来的后门攻击检测难题。解决方案的关键在于提出一种基于联邦代表注意力的防御机制(FeRA),通过跨客户端内部特征表示的注意力机制区分良性与恶意客户端,并基于表示重构误差计算异常分数,从而有效识别出与群体共识显著偏离的客户端。

链接: https://arxiv.org/abs/2505.10297
作者: Chibueze Peace Obioma,Youcheng Sun,Mustafa A. Mustafa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Submitted to ESORICS 2025

点击查看摘要

Abstract:Federated learning (FL) enhances privacy and reduces communication cost for resource-constrained edge clients by supporting distributed model training at the edge. However, the heterogeneous nature of such devices produces diverse, non-independent, and identically distributed (non-IID) data, making the detection of backdoor attacks more challenging. In this paper, we propose a novel federated representative-attention-based defense mechanism, named FeRA, that leverages cross-client attention over internal feature representations to distinguish benign from malicious clients. FeRA computes an anomaly score based on representation reconstruction errors, effectively identifying clients whose internal activations significantly deviate from the group consensus. Our evaluation demonstrates FeRA’s robustness across various FL scenarios, including challenging non-IID data distributions typical of edge devices. Experimental results show that it effectively reduces backdoor attack success rates while maintaining high accuracy on the main task. The method is model-agnostic, attack-agnostic, and does not require labeled reference data, making it well suited to heterogeneous and resource-limited edge deployments.
zh

[AI-26] MASS: Multi-Agent Simulation Scaling for Portfolio Construction

【速读】:该论文旨在解决传统基于大语言模型(Large Language Model, LLM)的多智能体系统在投资组合构建中应用时存在的局限性,即现有方法仅限于纯模拟或受预定义工作流约束,从而限制了其适用性和效果。解决方案的关键在于提出多智能体扩展仿真(Multi-Agent Scaling Simulation, MASS),通过逐步增加智能体数量进行大规模仿真以获得对市场的更深入理解,并通过反向优化过程端到端地优化智能体分布,而非依赖固定的工作流程。

链接: https://arxiv.org/abs/2505.10278
作者: Taian Guo,Haiyang Shen,Jinsheng Huang,Zhengyang Mao,Junyu Luo,Zhuoru Chen,Xuhui Liu,Bingyu Xia,Luchen Liu,Yun Ma,Ming Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-based multi-agent has gained significant attention for their potential in simulation and enhancing performance. However, existing works are limited to pure simulations or are constrained by predefined workflows, restricting their applicability and effectiveness. In this paper, we introduce the Multi-Agent Scaling Simulation (MASS) for portfolio construction. MASS achieves stable and continuous excess returns by progressively increasing the number of agents for large-scale simulations to gain a superior understanding of the market and optimizing agent distribution end-to-end through a reverse optimization process, rather than relying on a fixed workflow. We demonstrate its superiority through performance experiments, ablation studies, backtesting experiments, experiments on updated data and stock pools, scaling experiments, parameter sensitivity experiments, and visualization experiments, conducted in comparison with 6 state-of-the-art baselines on 3 challenging A-share stock pools. We expect the paradigm established by MASS to expand to other tasks with similar characteristics. The implementation of MASS has been open-sourced at this https URL.
zh

[AI-27] AttentionGuard: Transformer-based Misbehavior Detection for Secure Vehicular Platoons

【速读】:该论文试图解决车辆编队系统中的异常行为检测问题,特别是针对经过认证的内部人员发起的复杂伪造攻击所带来的安全隐患。解决方案的关键在于提出一种基于Transformer的框架——AttentionGuard,该框架利用自注意力机制从移动数据中识别异常模式,通过多头Transformer编码器处理序列化的运动学信息,从而在多种编队场景下有效区分正常行为与伪造攻击。

链接: https://arxiv.org/abs/2505.10273
作者: Hexu Li,Konstantinos Kalogiannis,Ahmed Mohamed Hussain,Panos Papadimitratos
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: Author’s version; Accepted for presentation at the ACM Workshop on Wireless Security and Machine Learning (WiseML 2025)

点击查看摘要

Abstract:Vehicle platooning, with vehicles traveling in close formation coordinated through Vehicle-to-Everything (V2X) communications, offers significant benefits in fuel efficiency and road utilization. However, it is vulnerable to sophisticated falsification attacks by authenticated insiders that can destabilize the formation and potentially cause catastrophic collisions. This paper addresses this challenge: misbehavior detection in vehicle platooning systems. We present AttentionGuard, a transformer-based framework for misbehavior detection that leverages the self-attention mechanism to identify anomalous patterns in mobility data. Our proposal employs a multi-head transformer-encoder to process sequential kinematic information, enabling effective differentiation between normal mobility patterns and falsification attacks across diverse platooning scenarios, including steady-state (no-maneuver) operation, join, and exit maneuvers. Our evaluation uses an extensive simulation dataset featuring various attack vectors (constant, gradual, and combined falsifications) and operational parameters (controller types, vehicle speeds, and attacker positions). Experimental results demonstrate that AttentionGuard achieves up to 0.95 F1-score in attack detection, with robust performance maintained during complex maneuvers. Notably, our system performs effectively with minimal latency (100ms decision intervals), making it suitable for real-time transportation safety applications. Comparative analysis reveals superior detection capabilities and establishes the transformer-encoder as a promising approach for securing Cooperative Intelligent Transport Systems (C-ITS) against sophisticated insider threats.
zh

[AI-28] Cutting Through Privacy: A Hyperplane-Based Data Reconstruction Attack in Federated Learning

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中存在的一类数据重建攻击问题,即恶意中央服务器通过操控模型更新来重构客户端的私有训练数据。现有数据重建攻击方法存在重要局限性,例如依赖于对客户端数据分布的假设,或在批量大小超过几十个样本时性能显著下降。本文提出的解决方案的关键在于引入一种新的几何视角来分析全连接层,从而构造恶意模型参数,实现了在无需任何客户端数据先验知识的情况下,完美恢复任意大规模数据批次的分类任务数据。实验结果表明,该方法在图像和表格数据集上均优于现有方法,能够实现比现有技术大两个数量级的数据批次重建。

链接: https://arxiv.org/abs/2505.10264
作者: Francesco Diana,André Nusser,Chuan Xu,Giovanni Neglia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative training of machine learning models across distributed clients without sharing raw data, ostensibly preserving data privacy. Nevertheless, recent studies have revealed critical vulnerabilities in FL, showing that a malicious central server can manipulate model updates to reconstruct clients’ private training data. Existing data reconstruction attacks have important limitations: they often rely on assumptions about the clients’ data distribution or their efficiency significantly degrades when batch sizes exceed just a few tens of samples. In this work, we introduce a novel data reconstruction attack that overcomes these limitations. Our method leverages a new geometric perspective on fully connected layers to craft malicious model parameters, enabling the perfect recovery of arbitrarily large data batches in classification tasks without any prior knowledge of clients’ data. Through extensive experiments on both image and tabular datasets, we demonstrate that our attack outperforms existing methods and achieves perfect reconstruction of data batches two orders of magnitude larger than the state of the art. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2505.10264 [cs.LG] (or arXiv:2505.10264v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.10264 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-29] Do LLM s Memorize Recommendation Datasets? A Preliminary Study on MovieLens-1M

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)是否在训练过程中记忆了公开推荐数据集的问题,这一问题可能影响研究结果的泛化能力并加剧推荐系统中的偏差。解决方案的关键在于通过设计实验验证LLMs对MovieLens-1M数据集的记忆程度,分析记忆对推荐性能的影响,并探讨不同模型家族和规模下的记忆差异。

链接: https://arxiv.org/abs/2505.10212
作者: Dario Di Palma,Felice Antonio Merra,Maurizio Sfilio,Vito Walter Anelli,Fedelucio Narducci,Tommaso Di Noia
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become increasingly central to recommendation scenarios due to their remarkable natural language understanding and generation capabilities. Although significant research has explored the use of LLMs for various recommendation tasks, little effort has been dedicated to verifying whether they have memorized public recommendation dataset as part of their training data. This is undesirable because memorization reduces the generalizability of research findings, as benchmarking on memorized datasets does not guarantee generalization to unseen datasets. Furthermore, memorization can amplify biases, for example, some popular items may be recommended more frequently than others. In this work, we investigate whether LLMs have memorized public recommendation datasets. Specifically, we examine two model families (GPT and Llama) across multiple sizes, focusing on one of the most widely used dataset in recommender systems: MovieLens-1M. First, we define dataset memorization as the extent to which item attributes, user profiles, and user-item interactions can be retrieved by prompting the LLMs. Second, we analyze the impact of memorization on recommendation performance. Lastly, we examine whether memorization varies across model families and model sizes. Our results reveal that all models exhibit some degree of memorization of MovieLens-1M, and that recommendation performance is related to the extent of memorization. We have made all the code publicly available at: this https URL Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.10212 [cs.IR] (or arXiv:2505.10212v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2505.10212 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3726302.3730178 Focus to learn more DOI(s) linking to related resources
zh

[AI-30] A Fine-Grained Complexity View on Propositional Abduction – Algorithms and Lower Bounds

【速读】:该论文试图解决非单调推理中可解释性推理(abductive reasoning)的复杂性问题,特别是在知识库中变量数量这一参数下的计算难度。其解决方案的关键在于分析具有高复杂度的可解释性问题在变量数n这一自然参数下的复杂性,并通过精细的算法设计获得了针对Σ²P-完全以及NP和coNP完全片段的积极结果,这标志着首次在Σ²P-完全问题上超越穷举搜索的实例(据作者所知)。同时,研究还通过下界分析和强指数时间假设下的不可改进性证明,对多个片段进行了全面的复杂性评估。

链接: https://arxiv.org/abs/2505.10201
作者: Victor Lagerkvist,Mohamed Maizia,Johannes Schmidt
机构: 未知
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Boolean satisfiability problem (SAT) is a well-known example of monotonic reasoning, of intense practical interest due to fast solvers, complemented by rigorous fine-grained complexity results. However, for non-monotonic reasoning, e.g., abductive reasoning, comparably little is known outside classic complexity theory. In this paper we take a first step of bridging the gap between monotonic and non-monotonic reasoning by analyzing the complexity of intractable abduction problems under the seemingly overlooked but natural parameter n: the number of variables in the knowledge base. We obtain several positive results for \Sigma^P_2 - as well as NP- and coNP-complete fragments, which implies the first example of beating exhaustive search for a \Sigma^P_2 -complete problem (to the best of our knowledge). We complement this with lower bounds and for many fragments rule out improvements under the (strong) exponential-time hypothesis.
zh

[AI-31] Advancing Community Detection with Graph Convolutional Neural Networks: Bridging Topological and Attributive Cohesion IJCAI

【速读】:该论文旨在解决现有图卷积网络(Graph Convolutional Networks, GCNs)在社区检测中因最大化模块度而收敛至次优解的问题,以及直接使用人工标注社区可能导致仅基于节点属性而非拓扑结构进行分组从而破坏社区内连通性的问题。其解决方案的关键在于提出一种基于拓扑与属性相似性的社区检测方法(Topological and Attributive Similarity-based Community detection, TAS-Com),该方法引入了一种新型损失函数,并利用高效的Leiden算法实现全局最优模块度的社区结构检测,同时通过Leiden算法优化人工标注的社区以确保每个社区内部的连通性。

链接: https://arxiv.org/abs/2505.10197
作者: Anjali de Silva,Gang Chen,Hui Ma,Seyed Mohammad Nekooei,Xingquan Zuo
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by IJCAI (International Joint Conference on Artificial Intelligence) 2025

点击查看摘要

Abstract:Community detection, a vital technology for real-world applications, uncovers cohesive node groups (communities) by leveraging both topological and attribute similarities in social networks. However, existing Graph Convolutional Networks (GCNs) trained to maximize modularity often converge to suboptimal solutions. Additionally, directly using human-labeled communities for training can undermine topological cohesiveness by grouping disconnected nodes based solely on node attributes. We address these issues by proposing a novel Topological and Attributive Similarity-based Community detection (TAS-Com) method. TAS-Com introduces a novel loss function that exploits the highly effective and scalable Leiden algorithm to detect community structures with global optimal modularity. Leiden is further utilized to refine human-labeled communities to ensure connectivity within each community, enabling TAS-Com to detect community structures with desirable trade-offs between modularity and compliance with human labels. Experimental results on multiple benchmark networks confirm that TAS-Com can significantly outperform several state-of-the-art algorithms.
zh

[AI-32] A User Study Evaluating Argumentative Explanations in Diagnostic Decision Support ECAI2024

【速读】:该论文试图解决在医疗领域中,如何通过可解释的人工智能(Explainable AI, XAI)提高医生对机器学习(ML)系统预测结果的信任与信心的问题。其解决方案的关键在于探索不同类型的AI生成解释,并通过用户研究和访谈明确医学专家对解释需求的感知,从而识别出最有效和有用的解释方式以增强诊断过程。

链接: https://arxiv.org/abs/2505.10188
作者: Felix Liedeker,Olivia Sanchez-Graillet,Moana Seidler,Christian Brandt,Jörg Wellmer,Philipp Cimiano
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Presented at ‘The First Workshop on Natural Language Argument-Based Explanations’, co-located with ECAI 2024

点击查看摘要

Abstract:As the field of healthcare increasingly adopts artificial intelligence, it becomes important to understand which types of explanations increase transparency and empower users to develop confidence and trust in the predictions made by machine learning (ML) systems. In shared decision-making scenarios where doctors cooperate with ML systems to reach an appropriate decision, establishing mutual trust is crucial. In this paper, we explore different approaches to generating explanations in eXplainable AI (XAI) and make their underlying arguments explicit so that they can be evaluated by medical experts. In particular, we present the findings of a user study conducted with physicians to investigate their perceptions of various types of AI-generated explanations in the context of diagnostic decision support. The study aims to identify the most effective and useful explanations that enhance the diagnostic process. In the study, medical doctors filled out a survey to assess different types of explanations. Further, an interview was carried out post-survey to gain qualitative insights on the requirements of explanations incorporated in diagnostic decision support. Overall, the insights gained from this study contribute to understanding the types of explanations that are most effective.
zh

[AI-33] KAITIAN: A Unified Communication Framework for Enabling Efficient Collaboration Across Heterogeneous Accelerators in Embodied AI Systems

【速读】:该论文旨在解决嵌入式人工智能系统在使用多种异构加速器(如GPGPUs、NPUs、FPGAs)时面临的互操作性障碍问题,这些问题导致资源利用率低下和分布式AI工作负载的性能瓶颈。解决方案的关键在于提出KAITIAN框架,该框架通过统一抽象层集成厂商优化的通信库以提升组内效率,并结合通用通信协议实现组间互操作性,同时引入负载自适应调度机制,根据设备实时性能特征动态平衡计算任务,从而显著提升资源利用率和分布式训练的可扩展性。

链接: https://arxiv.org/abs/2505.10183
作者: Jieke Lin,Wanyu Wang,Longxiang Yin,Yinhe Han
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures. Jieke Lin and Wanyu Wang contributed equally to this work

点击查看摘要

Abstract:Embodied Artificial Intelligence (AI) systems, such as autonomous robots and intelligent vehicles, are increasingly reliant on diverse heterogeneous accelerators (e.g., GPGPUs, NPUs, FPGAs) to meet stringent real-time processing and energy-efficiency demands. However, the proliferation of vendor-specific proprietary communication libraries creates significant interoperability barriers, hindering seamless collaboration between different accelerator types and leading to suboptimal resource utilization and performance bottlenecks in distributed AI workloads. This paper introduces KAITIAN, a novel distributed communication framework designed to bridge this gap. KAITIAN provides a unified abstraction layer that intelligently integrates vendor-optimized communication libraries for intra-group efficiency with general-purpose communication protocols for inter-group interoperability. Crucially, it incorporates a load-adaptive scheduling mechanism that dynamically balances computational tasks across heterogeneous devices based on their real-time performance characteristics. Implemented as an extension to PyTorch and rigorously evaluated on a testbed featuring NVIDIA GPUs and Cambricon MLUs, KAITIAN demonstrates significant improvements in resource utilization and scalability for distributed training tasks. Experimental results show that KAITIAN can accelerate training time by up to 42% compared to baseline homogeneous systems, while incurring minimal communication overhead (2.8–4.3%) and maintaining model accuracy. KAITIAN paves the way for more flexible and powerful heterogeneous computing in complex embodied AI applications.
zh

[AI-34] Does Scaling Law Apply in Time Series Forecasting?

【速读】:该论文试图解决时间序列预测中模型规模快速扩张带来的挑战,即传统方法在提升性能时往往依赖于参数量的指数级增长,而这种扩展是否真正必要尚存疑问。其解决方案的关键在于提出Alinear模型,该模型通过引入一种面向预测范围的自适应分解机制,动态调整不同预测长度下的组件权重,并结合渐进式频率衰减策略,在不使用注意力机制计算开销的情况下实现稳定预测。该方法仅使用k级参数即可达到与大规模模型相当的性能,且在多个基准数据集上表现出色。

链接: https://arxiv.org/abs/2505.10172
作者: Zeyan Li,Libing Chen,Yin Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rapid expansion of model size has emerged as a key challenge in time series forecasting. From early Transformer with tens of megabytes to recent architectures like TimesNet with thousands of megabytes, performance gains have often come at the cost of exponentially increasing parameter counts. But is this scaling truly necessary? To question the applicability of the scaling law in time series forecasting, we propose Alinear, an ultra-lightweight forecasting model that achieves competitive performance using only k-level parameters. We introduce a horizon-aware adaptive decomposition mechanism that dynamically rebalances component emphasis across different forecast lengths, alongside a progressive frequency attenuation strategy that achieves stable prediction in various forecasting horizons without incurring the computational overhead of attention mechanisms. Extensive experiments on seven benchmark datasets demonstrate that Alinear consistently outperforms large-scale models while using less than 1% of their parameters, maintaining strong accuracy across both short and ultra-long forecasting horizons. Moreover, to more fairly evaluate model efficiency, we propose a new parameter-aware evaluation metric that highlights the superiority of ALinear under constrained model budgets. Our analysis reveals that the relative importance of trend and seasonal components varies depending on data characteristics rather than following a fixed pattern, validating the necessity of our adaptive design. This work challenges the prevailing belief that larger models are inherently better and suggests a paradigm shift toward more efficient time series modeling.
zh

[AI-35] QuXAI: Explainers for Hybrid Quantum Machine Learning Models

【速读】:该论文试图解决混合量子-经典机器学习(Hybrid Quantum-Classical Machine Learning, HQML)模型在可解释性方面的关键问题,特别是在量化特征编码后结合经典学习的架构中缺乏稳健的全局和局部可解释性方法。解决方案的关键在于提出QuXAI框架,其核心是基于Q-MEDLEY,一种用于解释此类混合系统中特征重要性的解释器。Q-MEDLEY通过保留量子变换阶段并可视化结果属性,结合基于特征的推理,有效识别HQML模型中的关键经典因素,并区分噪声,从而提升了模型的可解释性和可靠性。

链接: https://arxiv.org/abs/2505.10167
作者: Saikat Barua,Mostafizur Rahman,Shehenaz Khaled,Md Jafor Sadek,Rafiul Islam,Shahnewaz Siddique
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注: 16 pages, 6 figures, 7 equations

点击查看摘要

Abstract:The emergence of hybrid quantum-classical machine learning (HQML) models opens new horizons of computational intelligence but their fundamental complexity frequently leads to black box behavior that undermines transparency and reliability in their application. Although XAI for quantum systems still in its infancy, a major research gap is evident in robust global and local explainability approaches that are designed for HQML architectures that employ quantized feature encoding followed by classical learning. The gap is the focus of this work, which introduces QuXAI, an framework based upon Q-MEDLEY, an explainer for explaining feature importance in these hybrid systems. Our model entails the creation of HQML models incorporating quantum feature maps, the use of Q-MEDLEY, which combines feature based inferences, preserving the quantum transformation stage and visualizing the resulting attributions. Our result shows that Q-MEDLEY delineates influential classical aspects in HQML models, as well as separates their noise, and competes well against established XAI techniques in classical validation settings. Ablation studies more significantly expose the virtues of the composite structure used in Q-MEDLEY. The implications of this work are critically important, as it provides a route to improve the interpretability and reliability of HQML models, thus promoting greater confidence and being able to engage in safer and more responsible use of quantum-enhanced AI technology.
zh

[AI-36] Robust Federated Learning on Edge Devices with Domain Heterogeneity

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在面对领域异质性(domain heterogeneity)时,全局模型收敛困难的问题。其解决方案的关键在于提出FedAPC(Federated Augmented Prototype Contrastive Learning)框架,通过原型增强(prototype augmentation)提升全局模型的泛化能力,具体表现为利用增强数据的均值特征生成原型,从而捕获更丰富的表征,并通过局部特征与全局原型对齐,增强特征多样性与模型鲁棒性。

链接: https://arxiv.org/abs/2505.10128
作者: Huy Q. Le,Latif U. Khan,Choong Seon Hong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: IWCMC 2025

点击查看摘要

Abstract:Federated Learning (FL) allows collaborative training while ensuring data privacy across distributed edge devices, making it a popular solution for privacy-sensitive applications. However, FL faces significant challenges due to statistical heterogeneity, particularly domain heterogeneity, which impedes the global mode’s convergence. In this study, we introduce a new framework to address this challenge by improving the generalization ability of the FL global model under domain heterogeneity, using prototype augmentation. Specifically, we introduce FedAPC (Federated Augmented Prototype Contrastive Learning), a prototype-based FL framework designed to enhance feature diversity and model robustness. FedAPC leverages prototypes derived from the mean features of augmented data to capture richer representations. By aligning local features with global prototypes, we enable the model to learn meaningful semantic features while reducing overfitting to any specific domain. Experimental results on the Office-10 and Digits datasets illustrate that our framework outperforms SOTA baselines, demonstrating superior performance.
zh

[AI-37] All You Need Is Synthetic Task Augmentation

【速读】:该论文试图解决将基于规则的模型(如随机森林)注入可微分神经网络框架中的挑战,特别是在分子属性预测任务中提升神经网络模型性能的问题。其解决方案的关键在于提出一种联合训练策略,即在稀疏多任务分子属性实验目标和由XGBoost模型在Osmordred分子描述符上训练生成的合成目标上共同训练单一的图变换器(Graph Transformer)神经网络。这些合成任务作为独立的辅助任务,有效提升了模型在19个分子属性预测任务中的表现,尤其在16个任务中优于单独使用XGBoost的模型。

链接: https://arxiv.org/abs/2505.10120
作者: Guillaume Godin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 3 Figures, 6 tables

点击查看摘要

Abstract:Injecting rule-based models like Random Forests into differentiable neural network frameworks remains an open challenge in machine learning. Recent advancements have demonstrated that pretrained models can generate efficient molecular embeddings. However, these approaches often require extensive pretraining and additional techniques, such as incorporating posterior probabilities, to boost performance. In our study, we propose a novel strategy that jointly trains a single Graph Transformer neural network on both sparse multitask molecular property experimental targets and synthetic targets derived from XGBoost models trained on Osmordred molecular descriptors. These synthetic tasks serve as independent auxiliary tasks. Our results show consistent and significant performance improvement across all 19 molecular property prediction tasks. For 16 out of 19 targets, the multitask Graph Transformer outperforms the XGBoost single-task learner. This demonstrates that synthetic task augmentation is an effective method for enhancing neural model performance in multitask molecular property prediction without the need for feature injection or pretraining.
zh

[AI-38] EmbodiedMAE: A Unified 3D Multi-Modal Representation for Robot Manipulation

【速读】:该论文旨在解决机器人操作中存在显著领域差距以及缺乏有效融合三维信息的模型架构问题。其关键解决方案是提出EmbodiedMAE,一种多模态掩码自编码器,通过随机掩码和跨模态融合同时学习RGB、深度图和点云模态的表示,并在增强后的DROID-3D数据集上进行训练,从而提升了训练效率和最终性能。

链接: https://arxiv.org/abs/2505.10105
作者: Zibin Dong,Fei Ni,Yifu Yuan,Yinchuan Li,Jianye Hao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present EmbodiedMAE, a unified 3D multi-modal representation for robot manipulation. Current approaches suffer from significant domain gaps between training datasets and robot manipulation tasks, while also lacking model architectures that can effectively incorporate 3D information. To overcome these limitations, we enhance the DROID dataset with high-quality depth maps and point clouds, constructing DROID-3D as a valuable supplement for 3D embodied vision research. Then we develop EmbodiedMAE, a multi-modal masked autoencoder that simultaneously learns representations across RGB, depth, and point cloud modalities through stochastic masking and cross-modal fusion. Trained on DROID-3D, EmbodiedMAE consistently outperforms state-of-the-art vision foundation models (VFMs) in both training efficiency and final performance across 70 simulation tasks and 20 real-world robot manipulation tasks on two robot platforms. The model exhibits strong scaling behavior with size and promotes effective policy learning from 3D inputs. Experimental results establish EmbodiedMAE as a reliable unified 3D multi-modal VFM for embodied AI systems, particularly in precise tabletop manipulation settings where spatial perception is critical.
zh

[AI-39] LAV: Audio-Driven Dynamic Visual Generation with Neural Compression and StyleGAN2

【速读】:该论文试图解决如何将音频信号转化为具有视觉动态性的输出问题,特别是在保持语义连贯性和细节丰富性方面。其解决方案的关键在于利用EnCodec的潜在表示作为音频特征,并通过随机初始化的线性映射直接将其转换为StyleGAN2的风格潜在空间,从而避免依赖显式特征映射,保留了音频的语义信息并实现了高质量的音画同步生成。

链接: https://arxiv.org/abs/2505.10101
作者: Jongmin Jung,Dasaem Jeong
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Graphics (cs.GR); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: Paper accepted at ISEA 2025, The 30th International Symposium on Electronic/Emerging Art, Seoul, Republic of Korea, 23 - 29 May 2025

点击查看摘要

Abstract:This paper introduces LAV (Latent Audio-Visual), a system that integrates EnCodec’s neural audio compression with StyleGAN2’s generative capabilities to produce visually dynamic outputs driven by pre-recorded audio. Unlike previous works that rely on explicit feature mappings, LAV uses EnCodec embeddings as latent representations, directly transformed into StyleGAN2’s style latent space via randomly initialized linear mapping. This approach preserves semantic richness in the transformation, enabling nuanced and semantically coherent audio-visual translations. The framework demonstrates the potential of using pretrained audio compression models for artistic and computational applications.
zh

[AI-40] Leverag ing Graph Retrieval-Augmented Generation to Support Learners Understanding of Knowledge Concepts in MOOCs

【速读】:该论文试图解决大规模开放在线课程(Massive Open Online Courses, MOOCs)中学习者与教师缺乏直接互动,导致学习者难以理解新知识概念的问题,以及大型语言模型(Large Language Models, LLMs)存在幻觉现象影响其可靠性的问题。解决方案的关键在于提出一种基于图的检索增强生成(Graph RAG)管道,利用教育知识图谱(Educational Knowledge Graphs, EduKGs)和个人知识图谱(Personal Knowledge Graphs, PKGs)来引导学习者理解MOOC平台CourseMapper中的知识概念。

链接: https://arxiv.org/abs/2505.10074
作者: Mohamed Abdelmagied,Mohamed Amine Chatti,Shoeb Joarder,Qurat Ul Ain,Rawaa Alatrash
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted at EMOOCs 2025

点击查看摘要

Abstract:Massive Open Online Courses (MOOCs) lack direct interaction between learners and instructors, making it challenging for learners to understand new knowledge concepts. Recently, learners have increasingly used Large Language Models (LLMs) to support them in acquiring new knowledge. However, LLMs are prone to hallucinations which limits their reliability. Retrieval-Augmented Generation (RAG) addresses this issue by retrieving relevant documents before generating a response. However, the application of RAG across different MOOCs is limited by unstructured learning material. Furthermore, current RAG systems do not actively guide learners toward their learning needs. To address these challenges, we propose a Graph RAG pipeline that leverages Educational Knowledge Graphs (EduKGs) and Personal Knowledge Graphs (PKGs) to guide learners to understand knowledge concepts in the MOOC platform CourseMapper. Specifically, we implement (1) a PKG-based Question Generation method to recommend personalized questions for learners in context, and (2) an EduKG-based Question Answering method that leverages the relationships between knowledge concepts in the EduKG to answer learner selected questions. To evaluate both methods, we conducted a study with 3 expert instructors on 3 different MOOCs in the MOOC platform CourseMapper. The results of the evaluation show the potential of Graph RAG to empower learners to understand new knowledge concepts in a personalized learning experience.
zh

[AI-41] Multi-Robot Task Allocation for Homogeneous Tasks with Collision Avoidance via Spatial Clustering

【速读】:该论文旨在解决工业环境中同质化任务下的多机器人任务分配(Multi-Robot Task Allocation, MRTA)与碰撞避免问题。其解决方案的关键在于提出一种基于空间聚类的框架,通过将工作空间划分为可区分的操作区域,同时解决任务分配和碰撞风险问题,从而实现高效的机器人路径规划与调度。

链接: https://arxiv.org/abs/2505.10073
作者: Rathin Chandra Shit,Sharmila Subudhi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 5 pages, 4 figures, Scheduled for presentation at an upcoming conference

点击查看摘要

Abstract:In this paper, a novel framework is presented that achieves a combined solution based on Multi-Robot Task Allocation (MRTA) and collision avoidance with respect to homogeneous measurement tasks taking place in industrial environments. The spatial clustering we propose offers to simultaneously solve the task allocation problem and deal with collision risks by cutting the workspace into distinguishable operational zones for each robot. To divide task sites and to schedule robot routes within corresponding clusters, we use K-means clustering and the 2-Opt algorithm. The presented framework shows satisfactory performance, where up to 93% time reduction (1.24s against 17.62s) with a solution quality improvement of up to 7% compared to the best performing method is demonstrated. Our method also completely eliminates collision points that persist in comparative methods in a most significant sense. Theoretical analysis agrees with the claim that spatial partitioning unifies the apparently disjoint tasks allocation and collision avoidance problems under conditions of many identical tasks to be distributed over sparse geographical areas. Ultimately, the findings in this work are of substantial importance for real world applications where both computational efficiency and operation free from collisions is of paramount importance.
zh

[AI-42] Financial Fraud Detection Using Explainable AI and Stacking Ensemble Methods

【速读】:该论文试图解决传统机器学习模型在预测准确性与模型透明性及可解释性之间的权衡问题(trade-off),这一问题导致组织难以满足监管要求并获得利益相关者的信任。解决方案的关键在于构建一个结合梯度提升模型堆叠集成(stacking ensemble of gradient boosting models)的欺诈检测框架,并引入可解释人工智能(XAI)技术,如SHAP、LIME、PDP和PFI,以增强模型决策的透明性和可解释性。

链接: https://arxiv.org/abs/2505.10050
作者: Fahad Almalki,Mehedi Masud
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional machine learning models often prioritize predictive accuracy, often at the expense of model transparency and interpretability. The lack of transparency makes it difficult for organizations to comply with regulatory requirements and gain stakeholders trust. In this research, we propose a fraud detection framework that combines a stacking ensemble of well-known gradient boosting models: XGBoost, LightGBM, and CatBoost. In addition, explainable artificial intelligence (XAI) techniques are used to enhance the transparency and interpretability of the model’s decisions. We used SHAP (SHapley Additive Explanations) for feature selection to identify the most important features. Further efforts were made to explain the model’s predictions using Local Interpretable Model-Agnostic Explanation (LIME), Partial Dependence Plots (PDP), and Permutation Feature Importance (PFI). The IEEE-CIS Fraud Detection dataset, which includes more than 590,000 real transaction records, was used to evaluate the proposed model. The model achieved a high performance with an accuracy of 99% and an AUC-ROC score of 0.99, outperforming several recent related approaches. These results indicate that combining high prediction accuracy with transparent interpretability is possible and could lead to a more ethical and trustworthy solution in financial fraud detection.
zh

[AI-43] Boosting Text-to-Chart Retrieval through Training with Synthesized Semantic Insights

【速读】:该论文旨在解决文本到图表检索(text-to-chart retrieval)中的语义理解和上下文捕捉不足的问题,特别是在商业智能(BI)领域中,用户需要根据精确或模糊的查询找到匹配的图表。现有方法由于缺乏全面的元数据或语义洞察,难以有效捕捉图表的语义内容和上下文信息。解决方案的关键在于提出一种训练数据开发流程,自动合成图表的层次化语义洞察,涵盖视觉模式(visual-oriented)、统计属性(statistics-oriented)和实际应用(task-oriented),从而生成丰富的语义信息,并基于此训练一个名为ChartFinder的CLIP基础模型,以提升图表的表示能力。

链接: https://arxiv.org/abs/2505.10043
作者: Yifan Wu,Lutao Yan,Yizhang Zhu,Yinan Mei,Jiannan Wang,Nan Tang,Yuyu Luo
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Charts are crucial for data analysis and this http URL-to-chart retrieval systems have become increasingly important for Business Intelligence (BI), where users need to find relevant charts that match their analytical needs. These needs can be categorized into precise queries that are well-specified and fuzzy queries that are more exploratory – both require understanding the semantics and context of the charts. However, existing text-to-chart retrieval solutions often fail to capture the semantic content and contextual information of charts, primarily due to the lack of comprehensive metadata (or semantic insights). To address this limitation, we propose a training data development pipeline that automatically synthesizes hierarchical semantic insights for charts, covering visual patterns (visual-oriented), statistical properties (statistics-oriented), and practical applications (task-oriented), which produces 207,498 semantic insights for 69,166 charts. Based on these, we train a CLIP-based model named ChartFinder to learn better representations of charts for text-to-chart retrieval. Our method leverages rich semantic insights during the training phase to develop a model that understands both visual and semantic aspects of this http URL evaluate text-to-chart retrieval performance, we curate the first benchmark, CRBench, for this task with 21,862 charts and 326 text queries from real-world BI applications, with ground-truth labels verified by the crowd this http URL show that ChartFinder significantly outperforms existing methods in text-to-chart retrieval tasks across various settings. For precise queries, ChartFinder achieves up to 66.9% NDCG@10, which is 11.58% higher than state-of-the-art models. In fuzzy query tasks, our method also demonstrates consistent improvements, with an average increase of 5% across nearly all metrics.
zh

[AI-44] Optimal normalization in quantum-classical hybrid models for anti-cancer drug response prediction

【速读】:该论文试图解决量子-经典混合机器学习(Quantum-classical Hybrid Machine Learning, QHML)模型在抗癌药物反应预测任务中因数据编码不当导致的稳定性问题。其关键解决方案是引入一种基于修正梯度版本双曲正切函数(moderated gradient version of the tanh)的归一化函数,该方法在不将神经网络输出集中于极端值范围的情况下对输出进行变换,从而提升模型的稳定性和性能。

链接: https://arxiv.org/abs/2505.10037
作者: Takafumi Ito,Lysenko Artem,Tatsuhiko Tsunoda
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Quantum Physics (quant-ph)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Quantum-classical Hybrid Machine Learning (QHML) models are recognized for their robust performance and high generalization ability even for relatively small datasets. These qualities offer unique advantages for anti-cancer drug response prediction, where the number of available samples is typically small. However, such hybrid models appear to be very sensitive to the data encoding used at the interface of a neural network and a quantum circuit, with suboptimal choices leading to stability issues. To address this problem, we propose a novel strategy that uses a normalization function based on a moderated gradient version of the \tanh . This method transforms the outputs of the neural networks without concentrating them at the extreme value ranges. Our idea was evaluated on a dataset of gene expression and drug response measurements for various cancer cell lines, where we compared the prediction performance of a classical deep learning model and several QHML models. These results confirmed that QHML performed better than the classical models when data was optimally normalized. This study opens up new possibilities for biomedical data analysis using quantum computers.
zh

[AI-45] he First MPDD Challenge: Multimodal Personality-aware Depression Detection

【速读】:该论文试图解决现有抑郁检测方法主要针对年轻成年人,忽视了更广泛年龄范围和个体差异对抑郁表现的影响问题。解决方案的关键在于引入多模态数据与个体差异因素相结合的框架,通过融合音频和视频模态以及个体特征信息,提升不同人群抑郁检测的个性化与准确性。

链接: https://arxiv.org/abs/2505.10034
作者: Changzeng Fu,Zelin Fu,Xinhe Kuang,Jiacheng Dong,Qi Zhang,Kaifeng Su,Yikai Su,Wenbo Shi,Junfeng Yao,Yuliang Zhao,Shiqi Zhao,Jiadong Wang,Siyang Song,Chaoran Liu,Yuichiro Yoshikawa,Björn Schuller,Hiroshi Ishiguro
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper has been accepted as part of the MPDD Challenge in the ACMMM 2025 Grand Challenge

点击查看摘要

Abstract:Depression is a widespread mental health issue affecting diverse age groups, with notable prevalence among college students and the elderly. However, existing datasets and detection methods primarily focus on young adults, neglecting the broader age spectrum and individual differences that influence depression manifestation. Current approaches often establish a direct mapping between multimodal data and depression indicators, failing to capture the complexity and diversity of depression across individuals. This challenge includes two tracks based on age-specific subsets: Track 1 uses the MPDD-Elderly dataset for detecting depression in older adults, and Track 2 uses the MPDD-Young dataset for detecting depression in younger participants. The Multimodal Personality-aware Depression Detection (MPDD) Challenge aims to address this gap by incorporating multimodal data alongside individual difference factors. We provide a baseline model that fuses audio and video modalities with individual difference information to detect depression manifestations in diverse populations. This challenge aims to promote the development of more personalized and accurate de pression detection methods, advancing mental health research and fostering inclusive detection systems. More details are available on the official challenge website: this https URL.
zh

[AI-46] AI Greenferencing: Routing AI Inferencing to Green Modular Data Centers with Heron

【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)算力需求激增与可再生能源供应之间不匹配的问题,具体表现为AI计算的高功率密度与风力发电在电网接入上的滞后性。解决方案的关键在于将AI工作负载部署到与风电场共址的模块化计算集群中,并通过Heron——一种跨站点软件路由器,利用不同风电场之间电力生成的互补性,动态调度AI推理任务以避开电力波动,从而提高整体AI计算的吞吐量。

链接: https://arxiv.org/abs/2505.09989
作者: Tella Rajashekhar Reddy,Palak,Rohan Gandhi,Anjaly Parayil,Chaojie Zhang,Mike Shepperd,Liangcheng Yu,Jayashree Mohan,Srinivasan Iyengar,Shivkumar Kalyanaraman,Debopam Bhattacherjee
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:AI power demand is growing unprecedentedly thanks to the high power density of AI compute and the emerging inferencing workload. On the supply side, abundant wind power is waiting for grid access in interconnection queues. In this light, this paper argues bringing AI workload to modular compute clusters co-located in wind farms. Our deployment right-sizing strategy makes it economically viable to deploy more than 6 million high-end GPUs today that could consume cheap, green power at its source. We built Heron, a cross-site software router, that could efficiently leverage the complementarity of power generation across wind farms by routing AI inferencing workload around power drops. Using 1-week ofcoding and conversation production traces from Azure and (real) variable wind power traces, we show how Heron improves aggregate goodput of AI compute by up to 80% compared to the state-of-the-art.
zh

[AI-47] Analysing Safety Risks in LLM s Fine-Tuned with Pseudo-Malicious Cyber Security Data

【速读】:该论文旨在解决在网络安全应用中微调大语言模型(LLMs)所带来的安全风险问题,这些问题包括个人数据泄露和新型恶意软件的自动化生成。其解决方案的关键在于提出一种安全对齐方法,通过仔细重写指令-响应对以包含明确的安全预防措施和伦理考量,从而在保持技术效用的同时提升模型的安全性。

链接: https://arxiv.org/abs/2505.09974
作者: Adel ElZemity,Budi Arief,Shujun Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The integration of large language models (LLMs) into cyber security applications presents significant opportunities, such as enhancing threat analysis and malware detection, but can also introduce critical risks and safety concerns, including personal data leakage and automated generation of new malware. We present a systematic evaluation of safety risks in fine-tuned LLMs for cyber security applications. Using the OWASP Top 10 for LLM Applications framework, we assessed seven open-source LLMs: Phi 3 Mini 3.8B, Mistral 7B, Qwen 2.5 7B, Llama 3 8B, Llama 3.1 8B, Gemma 2 9B, and Llama 2 70B. Our evaluation shows that fine-tuning reduces safety resilience across all tested LLMs (e.g., the safety score of Llama 3.1 8B against prompt injection drops from 0.95 to 0.15). We propose and evaluate a safety alignment approach that carefully rewords instruction-response pairs to include explicit safety precautions and ethical considerations. This approach demonstrates that it is possible to maintain or even improve model safety while preserving technical utility, offering a practical path forward for developing safer fine-tuning methodologies. This work offers a systematic evaluation for safety risks in LLMs, enabling safer adoption of generative AI in sensitive domains, and contributing towards the development of secure, trustworthy, and ethically aligned LLMs.
zh

[AI-48] Pre-Act: Multi-Step Planning and Reasoning Improves Acting in LLM Agents

【速读】:该论文旨在解决传统基于ReAct(Reasoning + Action)能力的代理系统在处理复杂任务时效率与性能不足的问题,尤其是在小型模型中难以胜任复杂推理任务的局限性。其解决方案的关键在于提出Pre-Act方法,通过生成多步骤执行计划并结合详细推理,逐步整合先前步骤和工具输出,从而在每个步骤执行后不断优化计划,最终获得更准确的响应。此方法不仅提升了代理系统的性能,还适用于对话型和非对话型代理,并通过两层级评估框架验证了其有效性。

链接: https://arxiv.org/abs/2505.09970
作者: Mrinal Rawat,Ambuje Gupta,Rushil Goomer,Alessandro Di Bari,Neha Gupta,Roberto Pieraccini
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The ReAct (Reasoning + Action) capability in large language models (LLMs) has become the foundation of modern agentic systems. Recent LLMs, such as DeepSeek-R1 and OpenAI o1/o3, exemplify this by emphasizing reasoning through the generation of ample intermediate tokens, which help build a strong premise before producing the final output tokens. In this paper, we introduce Pre-Act, a novel approach that enhances the agent’s performance by creating a multi-step execution plan along with the detailed reasoning for the given user input. This plan incrementally incorporates previous steps and tool outputs, refining itself after each step execution until the final response is obtained. Our approach is applicable to both conversational and non-conversational agents. To measure the performance of task-oriented agents comprehensively, we propose a two-level evaluation framework: (1) turn level and (2) end-to-end. Our turn-level evaluation, averaged across five models, shows that our approach, Pre-Act, outperforms ReAct by 70% in Action Recall on the Almita dataset. While this approach is effective for larger models, smaller models crucial for practical applications, where latency and cost are key constraints, often struggle with complex reasoning tasks required for agentic systems. To address this limitation, we fine-tune relatively small models such as Llama 3.1 (8B 70B) using the proposed Pre-Act approach. Our experiments show that the fine-tuned 70B model outperforms GPT-4, achieving a 69.5% improvement in action accuracy (turn-level) and a 28% improvement in goal completion rate (end-to-end) on the Almita (out-of-domain) dataset.
zh

[AI-49] A Comprehensive Machine Learning Framework for Heart Disease Prediction: Performance Evaluation and Future Perspectives

【速读】:该论文试图解决心脏病预测的问题,旨在通过机器学习方法提高心脏病诊断的准确性。其解决方案的关键在于采用数据预处理、模型训练与评估相结合的框架,并利用Logistic Regression、K-Nearest Neighbors (KNN)和Random Forest三种分类器进行比较分析。其中,通过GridSearchCV和RandomizedSearchCV进行超参数调优,显著提升了模型性能,最终Random Forest分类器表现出最佳效果,实现了91%的准确率和0.89的F1-score,展示了其在临床决策支持中的潜力。

链接: https://arxiv.org/abs/2505.09969
作者: Ali Azimi Lamir,Shiva Razzagzadeh,Zeynab Rezaei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study presents a machine learning-based framework for heart disease prediction using the heart-disease dataset, comprising 303 samples with 14 features. The methodology involves data preprocessing, model training, and evaluation using three classifiers: Logistic Regression, K-Nearest Neighbors (KNN), and Random Forest. Hyperparameter tuning with GridSearchCV and RandomizedSearchCV was employed to enhance model performance. The Random Forest classifier outperformed other models, achieving an accuracy of 91% and an F1-score of 0.89. Evaluation metrics, including precision, recall, and confusion matrix, revealed balanced performance across classes. The proposed model demonstrates strong potential for aiding clinical decision-making by effectively predicting heart disease. Limitations such as dataset size and generalizability underscore the need for future studies using larger and more diverse datasets. This work highlights the utility of machine learning in healthcare, offering insights for further advancements in predictive diagnostics.
zh

[AI-50] ransPL: VQ-Code Transition Matrices for Pseudo-Labeling of Time Series Unsupervised Domain Adaptation ICML2025

【速读】:该论文旨在解决时间序列数据的无监督域适应(Unsupervised Domain Adaptation, UDA)问题,传统伪标签策略无法捕捉域间的时间模式和通道级偏移,导致生成的伪标签效果不佳。其解决方案的关键在于引入TransPL方法,通过编码转移矩阵建模源域的联合分布P(X, y),其中编码来源于时间序列片段的向量量化(Vector Quantization, VQ),并利用贝叶斯规则进行目标域适应,基于通道加权的类别条件似然生成伪标签。

链接: https://arxiv.org/abs/2505.09955
作者: Jaeho Kim,Seulki Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2025 Accept

点击查看摘要

Abstract:Unsupervised domain adaptation (UDA) for time series data remains a critical challenge in deep learning, with traditional pseudo-labeling strategies failing to capture temporal patterns and channel-wise shifts between domains, producing sub-optimal pseudo-labels. As such, we introduce TransPL, a novel approach that addresses these limitations by modeling the joint distribution P(\mathbfX, y) of the source domain through code transition matrices, where the codes are derived from vector quantization (VQ) of time series patches. Our method constructs class- and channel-wise code transition matrices from the source domain and employs Bayes’ rule for target domain adaptation, generating pseudo-labels based on channel-wise weighted class-conditional likelihoods. TransPL offers three key advantages: explicit modeling of temporal transitions and channel-wise shifts between different domains, versatility towards different UDA scenarios (e.g., weakly-supervised UDA), and explainable pseudo-label generation. We validate TransPL’s effectiveness through extensive analysis on four time series UDA benchmarks and confirm that it consistently outperforms state-of-the-art pseudo-labeling methods by a strong margin (6.1% accuracy improvement, 4.9% F1 improvement), while providing interpretable insights into the domain adaptation process through its learned code transition matrices.
zh

[AI-51] ask-Core Memory Management and Consolidation for Long-term Continual Learning NEURIPS2025

【速读】:该论文旨在解决长期持续学习(long-term continual learning, CL)中的灾难性遗忘问题,即模型在顺序学习大量任务时难以保留先前知识。现有CL方法在面对大规模任务流时表现不佳,因此论文提出了一个受人类记忆机制启发的框架——Long-CL,其关键在于引入了任务核心记忆管理策略,以高效索引并动态更新关键记忆,同时开发了长期记忆巩固机制,选择性地保留困难且具有判别性的样本,从而提升知识保留能力。

链接: https://arxiv.org/abs/2505.09952
作者: Tianyu Huai,Jie Zhou,Yuxuan Cai,Qin Chen,Wen Wu,Xingjiao Wu,Xipeng Qiu,Liang He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to Neurips2025

点击查看摘要

Abstract:In this paper, we focus on a long-term continual learning (CL) task, where a model learns sequentially from a stream of vast tasks over time, acquiring new knowledge while retaining previously learned information in a manner akin to human learning. Unlike traditional CL settings, long-term CL involves handling a significantly larger number of tasks, which exacerbates the issue of catastrophic forgetting. Our work seeks to address two critical questions: 1) How do existing CL methods perform in the context of long-term CL? and 2) How can we mitigate the catastrophic forgetting that arises from prolonged sequential updates? To tackle these challenges, we propose a novel framework inspired by human memory mechanisms for long-term continual learning (Long-CL). Specifically, we introduce a task-core memory management strategy to efficiently index crucial memories and adaptively update them as learning progresses. Additionally, we develop a long-term memory consolidation mechanism that selectively retains hard and discriminative samples, ensuring robust knowledge retention. To facilitate research in this area, we construct and release two multi-modal and textual benchmarks, MMLongCL-Bench and TextLongCL-Bench, providing a valuable resource for evaluating long-term CL approaches. Experimental results show that Long-CL outperforms the previous state-of-the-art by 7.4% and 6.5% AP on the two benchmarks, respectively, demonstrating the effectiveness of our approach.
zh

[AI-52] Demystifying AI Agents : The Final Generation of Intelligence

【速读】:该论文试图探讨人工智能(Artificial Intelligence, AI)的发展轨迹及其当前所处的阶段,特别是如何通过一系列关键技术突破(如提示技术、训练方法、硬件能力及架构创新)推动AI代理(AI agents)的出现,并分析其作为“当前概念下智能的最终一代”的潜力。论文的核心问题在于揭示AI代理的技术基础与社会影响,并强调在智能发展速度呈指数级增长的背景下,需要智慧与远见来应对由此带来的机遇与挑战。解决方案的关键在于综合评估AI技术的演进路径,以及对其实用性与潜在影响的深入理解。

链接: https://arxiv.org/abs/2505.09932
作者: Kevin J McNamara,Rhea Pritham Marpu
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The trajectory of artificial intelligence (AI) has been one of relentless acceleration, evolving from rudimentary rule-based systems to sophisticated, autonomous agents capable of complex reasoning and interaction. This whitepaper chronicles this remarkable journey, charting the key technological milestones–advancements in prompting, training methodologies, hardware capabilities, and architectural innovations–that have converged to create the AI agents of today. We argue that these agents, exemplified by systems like OpenAI’s ChatGPT with plugins and xAI’s Grok, represent a culminating phase in AI development, potentially constituting the “final generation” of intelligence as we currently conceive it. We explore the capabilities and underlying technologies of these agents, grounded in practical examples, while also examining the profound societal implications and the unprecedented pace of progress that suggests intelligence is now doubling approximately every six months. The paper concludes by underscoring the critical need for wisdom and foresight in navigating the opportunities and challenges presented by this powerful new era of intelligence.
zh

[AI-53] Reinforced Interactive Continual Learning via Real-time Noisy Human Feedback

【速读】:该论文旨在解决传统持续学习方法的两个主要局限性:一是使用动态的、实时的人类标注数据流进行模型更新,而非静态数据集;二是假设标签是干净的,而实际交互中存在噪声反馈。解决方案的关键在于提出RiCL框架,其核心包括三个组件:一个时间一致性感知的净化器用于区分数据流中的干净与噪声样本,一种交互感知的直接偏好优化策略以对齐模型行为与人类意图,以及一个抗噪声对比学习模块,通过利用数据内部关系捕获鲁棒表示,从而减少对不可靠标签的依赖。

链接: https://arxiv.org/abs/2505.09925
作者: Yutao Yang,Jie Zhou,Junsong Li,Qianjun Pan,Bihao Zhan,Qin Chen,Xipeng Qiu,Liang He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces an interactive continual learning paradigm where AI models dynamically learn new skills from real-time human feedback while retaining prior knowledge. This paradigm distinctively addresses two major limitations of traditional continual learning: (1) dynamic model updates using streaming, real-time human-annotated data, rather than static datasets with fixed labels, and (2) the assumption of clean labels, by explicitly handling the noisy feedback common in real-world interactions. To tackle these problems, we propose RiCL, a Reinforced interactive Continual Learning framework leveraging Large Language Models (LLMs) to learn new skills effectively from dynamic feedback. RiCL incorporates three key components: a temporal consistency-aware purifier to automatically discern clean from noisy samples in data streams; an interaction-aware direct preference optimization strategy to align model behavior with human intent by reconciling AI-generated and human-provided feedback; and a noise-resistant contrastive learning module that captures robust representations by exploiting inherent data relationships, thus avoiding reliance on potentially unreliable labels. Extensive experiments on two benchmark datasets (FewRel and TACRED), contaminated with realistic noise patterns, demonstrate that our RiCL approach substantially outperforms existing combinations of state-of-the-art online continual learning and noisy-label learning methods.
zh

[AI-54] “There Is No Such Thing as a Dumb Question” But There Are Good Ones

【速读】:该论文试图解决如何全面评估问题质量的问题(question quality assessment),尤其是在人类和人工智能中,尽管提问的重要性日益凸显,但相关研究仍较为有限。其解决方案的关键在于提出了一种系统化的评估框架,该框架基于两个核心评价维度:适当性(sociolinguistic competence in context)和有效性(strategic competence in goal achievement),并通过基于评分量表的系统实现结构化与灵活性,同时引入动态上下文变量以适应不同情境。

链接: https://arxiv.org/abs/2505.09923
作者: Minjung Shin,Donghyun Kim,Jeh-Kwang Ryu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures and 4 tables. This work has been accepted for presentation as a poster with full paper publication at CogSci 2025. This is the final submission

点击查看摘要

Abstract:Questioning has become increasingly crucial for both humans and artificial intelligence, yet there remains limited research comprehensively assessing question quality. In response, this study defines good questions and presents a systematic evaluation framework. We propose two key evaluation dimensions: appropriateness (sociolinguistic competence in context) and effectiveness (strategic competence in goal achievement). Based on these foundational dimensions, a rubric-based scoring system was developed. By incorporating dynamic contextual variables, our evaluation framework achieves structure and flexibility through semi-adaptive criteria. The methodology was validated using the CAUS and SQUARE datasets, demonstrating the ability of the framework to access both well-formed and problematic questions while adapting to varied contexts. As we establish a flexible and comprehensive framework for question evaluation, this study takes a significant step toward integrating questioning behavior with structured analytical methods grounded in the intrinsic nature of questioning.
zh

[AI-55] Offline Reinforcement Learning for Microgrid Voltage Regulation ICLR2025

【速读】:该论文试图解决在无法进行环境交互(如因技术或安全原因)的情况下,微电网电压调节的问题,特别是在高比例光伏接入场景下的控制策略优化问题。解决方案的关键在于采用离线强化学习算法,通过先前收集的数据集进行训练,从而获得有效的控制模型,减少对在线环境交互的依赖,提升了在缺乏实时交互条件下的控制可行性与有效性。

链接: https://arxiv.org/abs/2505.09920
作者: Shan Yang,Yongli Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: This paper has been accepted and presented at ICLR 2025 in Singapore, Apr. 28, 2025

点击查看摘要

Abstract:This paper presents a study on using different offline reinforcement learning algorithms for microgrid voltage regulation with solar power penetration. When environment interaction is unviable due to technical or safety reasons, the proposed approach can still obtain an applicable model through offline-style training on a previously collected dataset, lowering the negative impact of lacking online environment interactions. Experiment results on the IEEE 33-bus system demonstrate the feasibility and effectiveness of the proposed approach on different offline datasets, including the one with merely low-quality experience.
zh

[AI-56] Avocado Price Prediction Using a Hybrid Deep Learning Model: TCN-MLP-Attention Architecture

【速读】:该论文旨在解决农业产品价格预测中的复杂波动问题,特别是针对高价值作物如Hass牛油果的价格变化,其受季节性、地区和天气等因素影响显著。传统预测模型在处理高度非线性和动态数据时表现不佳。该研究提出了一种混合深度学习模型——TCN-MLP-Attention架构,其关键在于结合了时间卷积网络(Temporal Convolutional Network, TCN)用于序列特征提取、多层感知机(Multi-Layer Perceptron, MLP)用于非线性交互建模以及注意力机制(Attention mechanism)用于动态特征加权,从而提升预测性能。

链接: https://arxiv.org/abs/2505.09907
作者: Linwei Zhang,LuFeng,Ruijia Liang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:With the growing demand for healthy foods, agricultural product price forecasting has become increasingly important. Hass avocados, as a high-value crop, exhibit complex price fluctuations influenced by factors such as seasonality, region, and weather. Traditional prediction models often struggle with highly nonlinear and dynamic data. To address this, we propose a hybrid deep learning model, TCN-MLP-Attention Architecture, combining Temporal Convolutional Networks (TCN) for sequential feature extraction, Multi-Layer Perceptrons (MLP) for nonlinear interactions, and an Attention mechanism for dynamic feature weighting. The dataset used covers over 50,000 records of Hass avocado sales across the U.S. from 2015 to 2018, including variables such as sales volume, average price, time, region, weather, and variety type, collected from point-of-sale systems and the Hass Avocado Board. After systematic preprocessing, including missing value imputation and feature normalization, the proposed model was trained and evaluated. Experimental results demonstrate that the TCN-MLP-Attention model achieves excellent predictive performance, with an RMSE of 1.23 and an MSE of 1.51, outperforming traditional methods. This research provides a scalable and effective approach for time series forecasting in agricultural markets and offers valuable insights for intelligent supply chain management and price strategy optimization.
zh

[AI-57] Which Demographic Features Are Relevant for Individual Fairness Evaluation of U.S. Recidivism Risk Assessment Tools?

【速读】:该论文试图解决在刑事再犯风险评估(Recidivism Risk Assessment, RRA)工具中,如何实现技术层面的“个体公平性”(individual fairness)问题,尽管其在美国宪法中有基础,但尚未在州或联邦法律中得到具体实施。解决方案的关键在于通过人类受试者实验确定哪些人口特征对于个体公平性评估是相关的,研究结果表明个体相似性函数应考虑年龄和性别,但应忽略种族。

链接: https://arxiv.org/abs/2505.09868
作者: Tin Trung Nguyen,Jiannan Xu,Phuong-Anh Nguyen-Le,Jonathan Lazar,Donald Braman,Hal Daumé III,Zubin Jelveh
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Despite its U.S. constitutional foundation, the technical ``individual fairness’’ criterion has not been operationalized in state or federal statutes/regulations. We conduct a human subjects experiment to address this gap, evaluating which demographic features are relevant for individual fairness evaluation of recidivism risk assessment (RRA) tools. Our analyses conclude that the individual similarity function should consider age and sex, but it should ignore race.
zh

[AI-58] LiDDA: Data Driven Attribution at LinkedIn

【速读】:该论文旨在解决营销归因(Marketing Attribution)中的复杂问题,即如何基于数据学习到的因果模式,将转化信用合理分配给不同的营销互动。其解决方案的关键在于提出一种统一的基于Transformer的归因方法,该方法能够处理个体层级数据、聚合层级数据以及外部宏观因素的整合,从而提升营销智能的准确性和适用性。

链接: https://arxiv.org/abs/2505.09861
作者: John Bencina,Erkut Aykutlug,Yue Chen,Zerui Zhang,Stephanie Sorenson,Shao Tang,Changshuai Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Data Driven Attribution, which assigns conversion credits to marketing interactions based on causal patterns learned from data, is the foundation of modern marketing intelligence and vital to any marketing businesses and advertising platform. In this paper, we introduce a unified transformer-based attribution approach that can handle member-level data, aggregate-level data, and integration of external macro factors. We detail the large scale implementation of the approach at LinkedIn, showcasing significant impact. We also share learning and insights that are broadly applicable to the marketing and ad tech fields.
zh

[AI-59] Causal Predictive Optimization and Generation for Business AI

【速读】:该论文试图解决B2B业务中销售流程优化的问题,其核心在于通过先进的AI技术提升销售转化率和客户价值。解决方案的关键在于提出了一种基于因果预测优化与生成的系统(Causal Predictive Optimization and Generation),该系统包含三个层次:1)基于因果机器学习的预测层;2)结合约束优化与上下文强化学习的优化层;3)利用生成式AI(Generative AI)和反馈循环进行系统增强的服务层。这一方法在LinkedIn的实施展示了对传统系统的显著改进,并提供了可广泛应用于该领域的经验和见解。

链接: https://arxiv.org/abs/2505.09847
作者: Liyang Zhao,Olurotimi Seton,Himadeep Reddy Reddivari,Suvendu Jena,Shadow Zhao,Rachit Kumar,Changshuai Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:The sales process involves sales functions converting leads or opportunities to customers and selling more products to existing customers. The optimization of the sales process thus is key to success of any B2B business. In this work, we introduce a principled approach to sales optimization and business AI, namely the Causal Predictive Optimization and Generation, which includes three layers: 1) prediction layer with causal ML 2) optimization layer with constraint optimization and contextual bandit 3) serving layer with Generative AI and feedback-loop for system enhancement. We detail the implementation and deployment of the system in LinkedIn, showcasing significant wins over legacy systems and sharing learning and insight broadly applicable to this field.
zh

[AI-60] Evaluating Large Language Models for the Generation of Unit Tests with Equivalence Partitions and Boundary Values

【速读】:该论文试图解决单元测试设计与实现中因程序员忽视而导致的复杂性问题,旨在评估大型语言模型(Large Language Models, LLMs)在自动生成测试用例方面的潜力。其解决方案的关键在于开发一种优化的提示(prompt),该提示整合了代码和需求,覆盖了等价类划分和边界值等关键测试场景,从而提升LLMs生成测试用例的有效性。

链接: https://arxiv.org/abs/2505.09830
作者: Martín Rodríguez,Gustavo Rossi,Alejandro Fernandez
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Under revision at Jornadas de Cloud Computing, Big Data Emerging Topics (JCC-BDET) - 2025

点击查看摘要

Abstract:The design and implementation of unit tests is a complex task many programmers neglect. This research evaluates the potential of Large Language Models (LLMs) in automatically generating test cases, comparing them with manual tests. An optimized prompt was developed, that integrates code and requirements, covering critical cases such as equivalence partitions and boundary values. The strengths and weaknesses of LLMs versus trained programmers were compared through quantitative metrics and manual qualitative analysis. The results show that the effectiveness of LLMs depends on well-designed prompts, robust implementation, and precise requirements. Although flexible and promising, LLMs still require human supervision. This work highlights the importance of manual qualitative analysis as an essential complement to automation in unit test evaluation.
zh

[AI-61] XXt Can Be Faster

【速读】:该论文试图解决矩阵与其转置乘积 XX^t 的计算效率问题,旨在减少所需的乘法和加法操作次数。解决方案的关键在于提出了一种名为 RXTX 的新算法,该算法通过结合基于机器学习的搜索方法与组合优化技术,实现了比现有技术少5%的运算量,并在小尺寸矩阵上也能获得加速效果。

链接: https://arxiv.org/abs/2505.09814
作者: Dmitry Rybin,Yushun Zhang,Zhi-Quan Luo
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
备注:

点击查看摘要

Abstract:We present a new algorithm RXTX that computes product of matrix by its transpose XX^t . RXTX uses 5% less multiplications and additions than State-of-the-Art and achieves accelerations even for small sizes of matrix X . The algorithm was discovered by combining Machine Learning-based search methods with Combinatorial Optimization.
zh

[AI-62] A Multimodal Multi-Agent Framework for Radiology Report Generation

【速读】:该论文旨在解决放射学报告生成(Radiology Report Generation, RRG)中存在的事实不一致、幻觉和跨模态对齐问题。其解决方案的关键在于提出一种与临床推理流程相一致的多模态多智能体框架,其中特定任务的智能体分别处理检索、草稿生成、视觉分析、细化和综合等步骤,从而提升生成报告的准确性、结构化程度和可解释性。

链接: https://arxiv.org/abs/2505.09787
作者: Ziruo Yi,Ting Xiao,Mark V. Albert
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Radiology report generation (RRG) aims to automatically produce diagnostic reports from medical images, with the potential to enhance clinical workflows and reduce radiologists’ workload. While recent approaches leveraging multimodal large language models (MLLMs) and retrieval-augmented generation (RAG) have achieved strong results, they continue to face challenges such as factual inconsistency, hallucination, and cross-modal misalignment. We propose a multimodal multi-agent framework for RRG that aligns with the stepwise clinical reasoning workflow, where task-specific agents handle retrieval, draft generation, visual analysis, refinement, and synthesis. Experimental results demonstrate that our approach outperforms a strong baseline in both automatic metrics and LLM-based evaluations, producing more accurate, structured, and interpretable reports. This work highlights the potential of clinically aligned multi-agent frameworks to support explainable and trustworthy clinical AI applications.
zh

[AI-63] On the Well-Posedness of Greens Function Reconstruction via the Kirchhoff-Helmholtz Equation for One-Speed Neutron Diffusion

【速读】:该论文旨在解决核反应堆中中子通量空间分布的重建问题,通过利用外核探测器获取的实时测量数据进行求解。其解决方案的关键在于基于Kirchhoff-Helmholtz(K-H)方程构建数据驱动的Green函数近似,并将其作为反问题进行求解,从而在复杂、非均匀的反应堆几何结构中实现中子通量的准确重构。该方法通过验证Green函数的对称性及从采样数据中推导其存在性和唯一性,确保了所提出方法的可靠性与有效性。

链接: https://arxiv.org/abs/2505.09766
作者: Roberto Ponciroli
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work presents a methodology for reconstructing the spatial distribution of the neutron flux in a nuclear reactor, leveraging real-time measurements obtained from ex-core detectors. The Kirchhoff-Helmholtz (K-H) equation inherently defines the problem of estimating a scalar field within a domain based on boundary data, making it a natural mathematical framework for this task. The main challenge lies in deriving the Green’s function specific to the domain and the neutron diffusion process. While analytical solutions for Green’s functions exist for simplified geometries, their derivation of complex, heterogeneous domains-such as a nuclear reactor-requires a numerical approach. The objective of this work is to demonstrate the well-posedness of the data-driven Green’s function approximation by formulating and solving the K-H equation as an inverse problem. After establishing the symmetry properties that the Green’s function must satisfy, the K-H equation is derived from the one-speed neutron diffusion model. This is followed by a comprehensive description of the procedure for interpreting sensor readings and implementing the neutron flux reconstruction algorithm. Finally, the existence and uniqueness of the Green’s function inferred from the sampled data are demonstrated, ensuring the reliability of the proposed method and its predictions.
zh

[AI-64] rustless Autonomy: Understanding Motivations Benefits and Governance Dilemma in Self-Sovereign Decentralized AI Agents

【速读】:该论文试图解决去中心化AI代理(DeAgent)在信任lessness与不可靠自主性之间产生的悖论性张力问题,即在缺乏中央控制的环境下,如何确保基于大语言模型(LLM)的代理系统的可靠性。解决方案的关键在于通过访谈DeAgent的利益相关者——专家、创始人和开发者,深入分析其动机、优势及治理困境,从而为未来DeAgent系统和协议设计提供指导,并推动对社会技术AI系统治理的讨论。

链接: https://arxiv.org/abs/2505.09757
作者: Botao Amber Hu,Yuhan Liu,Helena Rong
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Submitted to CSCW 2026

点击查看摘要

Abstract:The recent trend of self-sovereign Decentralized AI Agents (DeAgents) combines Large Language Model (LLM)-based AI agents with decentralization technologies such as blockchain smart contracts and trusted execution environments (TEEs). These tamper-resistant trustless substrates allow agents to achieve self-sovereignty through ownership of cryptowallet private keys and control of digital assets and social media accounts. DeAgent eliminates centralized control and reduces human intervention, addressing key trust concerns inherent in centralized AI systems. However, given ongoing challenges in LLM reliability such as hallucinations, this creates paradoxical tension between trustlessness and unreliable autonomy. This study addresses this empirical research gap through interviews with DeAgents stakeholders-experts, founders, and developers-to examine their motivations, benefits, and governance dilemmas. The findings will guide future DeAgents system and protocol design and inform discussions about governance in sociotechnical AI systems in the future agentic web.
zh

[AI-65] Explainability Through Human-Centric Design for XAI in Lung Cancer Detection

【速读】:该论文旨在解决深度学习模型在胸部X光片中检测肺部病理时的可解释性问题,即模型决策过程不透明导致其在临床中的广泛应用受限。解决方案的关键在于提出XpertXAI,这是一种可泛化的专家驱动模型,能够在保持人类可解释的临床概念的同时,扩展至多种肺部病理的检测。该模型通过结合高性能的InceptionV3分类器与公开的胸部X光数据集,实现了比现有后处理可解释性方法和无监督概念瓶颈模型(XCBs)更优的预测准确性和与专家推理更一致的概念级解释。

链接: https://arxiv.org/abs/2505.09755
作者: Amy Rafferty,Rishi Ramaesh,Ajitha Rajan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning models have shown promise in lung pathology detection from chest X-rays, but widespread clinical adoption remains limited due to opaque model decision-making. In prior work, we introduced ClinicXAI, a human-centric, expert-guided concept bottleneck model (CBM) designed for interpretable lung cancer diagnosis. We now extend that approach and present XpertXAI, a generalizable expert-driven model that preserves human-interpretable clinical concepts while scaling to detect multiple lung pathologies. Using a high-performing InceptionV3-based classifier and a public dataset of chest X-rays with radiology reports, we compare XpertXAI against leading post-hoc explainability methods and an unsupervised CBM, XCBs. We assess explanations through comparison with expert radiologist annotations and medical ground truth. Although XpertXAI is trained for multiple pathologies, our expert validation focuses on lung cancer. We find that existing techniques frequently fail to produce clinically meaningful explanations, omitting key diagnostic features and disagreeing with radiologist judgments. XpertXAI not only outperforms these baselines in predictive accuracy but also delivers concept-level explanations that better align with expert reasoning. While our focus remains on explainability in lung cancer detection, this work illustrates how human-centric model design can be effectively extended to broader diagnostic contexts - offering a scalable path toward clinically meaningful explainable AI in medical diagnostics.
zh

[AI-66] Healthy Distrust in AI systems

【速读】:该论文试图解决当前可信AI研究中过于强调设计可信赖的AI系统以促进用户采纳的问题,而忽视了在某些社会情境下,用户对AI系统的不信任可能是合理且必要的。解决方案的关键在于提出“健康不信任”(healthy distrust)这一概念,用以描述对特定AI使用实践所持有的合理、谨慎态度,从而在尊重人类自主性的前提下,构建更有意义的信任关系。

链接: https://arxiv.org/abs/2505.09747
作者: Benjamin Paaßen,Suzana Alpsancar,Tobias Matzner,Ingrid Scharlau
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Under the slogan of trustworthy AI, much of contemporary AI research is focused on designing AI systems and usage practices that inspire human trust and, thus, enhance adoption of AI systems. However, a person affected by an AI system may not be convinced by AI system design alone – neither should they, if the AI system is embedded in a social context that gives good reason to believe that it is used in tension with a person’s interest. In such cases, distrust in the system may be justified and necessary to build meaningful trust in the first place. We propose the term “healthy distrust” to describe such a justified, careful stance towards certain AI usage practices. We investigate prior notions of trust and distrust in computer science, sociology, history, psychology, and philosophy, outline a remaining gap that healthy distrust might fill and conceptualize healthy distrust as a crucial part for AI usage that respects human autonomy.
zh

[AI-67] A Generative Neural Annealer for Black-Box Combinatorial Optimization

【速读】:该论文试图解决黑盒组合优化问题(black-box combinatorial optimization),特别是在NP难问题中如何提高样本效率和求解质量。其解决方案的关键在于将黑盒目标函数视为能量函数,并训练神经网络来建模相关的玻尔兹曼分布(Boltzmann distribution)。通过条件温度,网络能够捕捉从高温下的近似均匀分布到低温下围绕全局最优的尖峰分布的连续分布,从而学习能量景观的结构并促进全局优化。

链接: https://arxiv.org/abs/2505.09742
作者: Yuan-Hang Zhang,Massimiliano Di Ventra
机构: 未知
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 15 pages, 3 figures

点击查看摘要

Abstract:We propose a generative, end-to-end solver for black-box combinatorial optimization that emphasizes both sample efficiency and solution quality on NP problems. Drawing inspiration from annealing-based algorithms, we treat the black-box objective as an energy function and train a neural network to model the associated Boltzmann distribution. By conditioning on temperature, the network captures a continuum of distributions–from near-uniform at high temperatures to sharply peaked around global optima at low temperatures–thereby learning the structure of the energy landscape and facilitating global optimization. When queries are expensive, the temperature-dependent distributions naturally enable data augmentation and improve sample efficiency. When queries are cheap but the problem remains hard, the model learns implicit variable interactions, effectively “opening” the black box. We validate our approach on challenging combinatorial tasks under both limited and unlimited query budgets, showing competitive performance against state-of-the-art black-box optimizers.
zh

[AI-68] General Dynamic Goal Recognition AAAI2025

【速读】:该论文试图解决在动态环境中进行实时目标识别(Goal Recognition, GR)的问题,传统方法由于依赖预定义的目标集合,在面对不断变化和多样化的目标时表现出适应性不足。解决方案的关键在于引入“通用动态GR问题”,并采用一种无模型的目标条件强化学习(model-free goal-conditioned RL)方法,以实现对多种变化任务的快速适应。

链接: https://arxiv.org/abs/2505.09737
作者: Osher Elhadad,Reuth Mirsky
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted for publication at Generalization in Planning (GenPlan) as part of AAAI 2025 workshops

点击查看摘要

Abstract:Understanding an agent’s intent through its behavior is essential in human-robot interaction, interactive AI systems, and multi-agent collaborations. This task, known as Goal Recognition (GR), poses significant challenges in dynamic environments where goals are numerous and constantly evolving. Traditional GR methods, designed for a predefined set of goals, often struggle to adapt to these dynamic scenarios. To address this limitation, we introduce the General Dynamic GR problem - a broader definition of GR - aimed at enabling real-time GR systems and fostering further research in this area. Expanding on this foundation, this paper employs a model-free goal-conditioned RL approach to enable fast adaptation for GR across various changing tasks.
zh

[AI-69] Robust Federated Learning with Confidence-Weighted Filtering and GAN-Based Completion under Noisy and Incomplete Data

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中由于数据质量问题(如噪声标签、缺失类别和分布不平衡)导致的模型性能下降问题。其解决方案的关键在于通过自适应噪声清洗、基于条件生成对抗网络(Conditional GAN)的协同合成数据生成以及鲁棒的联邦模型训练,系统性地提升数据完整性与模型性能。

链接: https://arxiv.org/abs/2505.09733
作者: Alpaslan Gokcen,Ali Boyaci
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated learning (FL) presents an effective solution for collaborative model training while maintaining data privacy across decentralized client datasets. However, data quality issues such as noisy labels, missing classes, and imbalanced distributions significantly challenge its effectiveness. This study proposes a federated learning methodology that systematically addresses data quality issues, including noise, class imbalance, and missing labels. The proposed approach systematically enhances data integrity through adaptive noise cleaning, collaborative conditional GAN-based synthetic data generation, and robust federated model training. Experimental evaluations conducted on benchmark datasets (MNIST and Fashion-MNIST) demonstrate significant improvements in federated model performance, particularly macro-F1 Score, under varying noise and class imbalance conditions. Additionally, the proposed framework carefully balances computational feasibility and substantial performance gains, ensuring practicality for resource constrained edge devices while rigorously maintaining data privacy. Our results indicate that this method effectively mitigates common data quality challenges, providing a robust, scalable, and privacy compliant solution suitable for diverse real-world federated learning scenarios.
zh

[AI-70] Out-of-distribution generalisation is hard: evidence from ARC-like tasks NEURIPS2025

【速读】:该论文试图解决在分布外(Out-of-distribution, OOD)场景下模型泛化能力不足的问题,特别是如何使算法真正学习到可组合的结构以实现有效的OOD泛化。解决方案的关键在于不仅测试算法在OOD设置下的表现,还需要验证所识别的特征是否具有组合性,即这些特征应为任务不变且可组合的输入特征,从而使得模型能够基于这些特征的组合进行推理,而非依赖于已学习数据点之间的插值。

链接: https://arxiv.org/abs/2505.09716
作者: George Dimitriadis. Spyridon Samothrakis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submission to NeurIPS 2025

点击查看摘要

Abstract:Out-of-distribution (OOD) generalisation is considered a hallmark of human and animal intelligence. To achieve OOD through composition, a system must discover the environment-invariant properties of experienced input-output mappings and transfer them to novel inputs. This can be realised if an intelligent system can identify appropriate, task-invariant, and composable input features, as well as the composition methods, thus allowing it to act based not on the interpolation between learnt data points but on the task-invariant composition of those features. We propose that in order to confirm that an algorithm does indeed learn compositional structures from data, it is not enough to just test on an OOD setup, but one also needs to confirm that the features identified are indeed compositional. We showcase this by exploring two tasks with clearly defined OOD metrics that are not OOD solvable by three commonly used neural networks: a Multi-Layer Perceptron (MLP), a Convolutional Neural Network (CNN), and a Transformer. In addition, we develop two novel network architectures imbued with biases that allow them to be successful in OOD scenarios. We show that even with correct biases and almost perfect OOD performance, an algorithm can still fail to learn the correct features for compositional generalisation.
zh

[AI-71] Energy-Efficient Federated Learning for AIoT using Clustering Methods

【速读】:该论文旨在解决在人工智能物联网(AIoT)场景下联邦学习(Federated Learning, FL)过程中能量消耗被忽视的问题,特别是针对预处理、通信和本地学习这三个高能耗环节进行分析与优化。其解决方案的关键在于通过聚类方法对AIoT设备进行分组,依据设备标签分布的相似性形成近似异构的集群,从而缓解实际分布式学习应用中常见的数据异构性问题,进而提升模型收敛速度并保持较低的能量消耗。

链接: https://arxiv.org/abs/2505.09704
作者: Roberto Pereira,Fernanda Famá,Charalampos Kalalas,Paolo Dini
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While substantial research has been devoted to optimizing model performance, convergence rates, and communication efficiency, the energy implications of federated learning (FL) within Artificial Intelligence of Things (AIoT) scenarios are often overlooked in the existing literature. This study examines the energy consumed during the FL process, focusing on three main energy-intensive processes: pre-processing, communication, and local learning, all contributing to the overall energy footprint. We rely on the observation that device/client selection is crucial for speeding up the convergence of model training in a distributed AIoT setting and propose two clustering-informed methods. These clustering solutions are designed to group AIoT devices with similar label distributions, resulting in clusters composed of nearly heterogeneous devices. Hence, our methods alleviate the heterogeneity often encountered in real-world distributed learning applications. Throughout extensive numerical experimentation, we demonstrate that our clustering strategies typically achieve high convergence rates while maintaining low energy consumption when compared to other recent approaches available in the literature.
zh

[AI-72] ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation

【速读】:该论文试图解决当前视觉-语言模型(Vision-Language Models, VLMs)在机器人低级操作推理能力评估方面缺乏统一基准的问题。现有研究虽然探索了VLMs在机器人操作中的高层规划和部分低层推理能力,但尚未建立一个全面且标准化的评估框架来衡量其在低级操作任务中的表现。解决方案的关键是提出一个名为ManipBench的新基准,用于评估VLMs在物体-物体交互理解和可变形物体操作等多个维度上的低级机器人操作推理能力,并通过在该基准上对33个代表性VLMs进行广泛测试,揭示模型性能差异及其与真实世界操作任务的相关性。

链接: https://arxiv.org/abs/2505.09698
作者: Enyu Zhao,Vedant Raval,Hejia Zhang,Jiageng Mao,Zeyu Shangguan,Stefanos Nikolaidis,Yue Wang,Daniel Seita
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 47 pages, 29 figures. Under review

点击查看摘要

Abstract:Vision-Language Models (VLMs) have revolutionized artificial intelligence and robotics due to their commonsense reasoning capabilities. In robotic manipulation, VLMs are used primarily as high-level planners, but recent work has also studied their lower-level reasoning ability, which refers to making decisions about precise robot movements. However, the community currently lacks a clear and common benchmark that can evaluate how well VLMs can aid low-level reasoning in robotics. Consequently, we propose a novel benchmark, ManipBench, to evaluate the low-level robot manipulation reasoning capabilities of VLMs across various dimensions, including how well they understand object-object interactions and deformable object manipulation. We extensively test 33 representative VLMs across 10 model families on our benchmark, including variants to test different model sizes. Our evaluation shows that the performance of VLMs significantly varies across tasks, and there is a strong correlation between this performance and trends in our real-world manipulation tasks. It also shows that there remains a significant gap between these models and human-level understanding. See our website at: this https URL.
zh

[AI-73] Introducing voice timbre attribute detection

【速读】:该论文试图解决语音信号中音色(timbre)的表征问题,并引入了一项称为语音音色属性检测(voice timbre attribute detection, vTAD)的任务,旨在通过一组描述人类感知的感官属性来解释音色。解决方案的关键在于利用从语音语料中提取的说话人嵌入(speaker embeddings)构建框架,并通过比较语音话语对在指定音色描述符中的强度来实现音色属性的检测。实验结果表明,不同编码器在不同场景下的表现差异,揭示了模型泛化能力的重要性。

链接: https://arxiv.org/abs/2505.09661
作者: Jinghao He,Zhengyan Sheng,Liping Chen,Kong Aik Lee,Zhen-Hua Ling
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This paper focuses on explaining the timbre conveyed by speech signals and introduces a task termed voice timbre attribute detection (vTAD). In this task, voice timbre is explained with a set of sensory attributes describing its human perception. A pair of speech utterances is processed, and their intensity is compared in a designated timbre descriptor. Moreover, a framework is proposed, which is built upon the speaker embeddings extracted from the speech utterances. The investigation is conducted on the VCTK-RVA dataset. Experimental examinations on the ECAPA-TDNN and FACodec speaker encoders demonstrated that: 1) the ECAPA-TDNN speaker encoder was more capable in the seen scenario, where the testing speakers were included in the training set; 2) the FACodec speaker encoder was superior in the unseen scenario, where the testing speakers were not part of the training, indicating enhanced generalization capability. The VCTK-RVA dataset and open-source code are available on the website this https URL.
zh

[AI-74] Unlocking Location Intelligence: A Survey from Deep Learning to The LLM Era

【速读】:该论文旨在系统梳理和总结地理空间表征学习在深度学习突破和大语言模型(Large Language Model, LLM)范式下的发展现状,以推动位置智能(Location Intelligence, LI)的进一步创新。其解决方案的关键在于构建一个涵盖数据视角、方法视角和应用视角的结构化分类体系,全面回顾不同技术时期的地理空间表征学习方法,并在此基础上分析当前进展、现存局限及未来研究方向。

链接: https://arxiv.org/abs/2505.09651
作者: Xixuan Hao,Yutian Jiang,Xingchen Zou,Jiabo Liu,Yifang Yin,Yuxuan Liang
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Location Intelligence (LI), the science of transforming location-centric geospatial data into actionable knowledge, has become a cornerstone of modern spatial decision-making. The rapid evolution of Geospatial Representation Learning is fundamentally reshaping LI development through two successive technological revolutions: the deep learning breakthrough and the emerging large language model (LLM) paradigm. While deep neural networks (DNNs) have demonstrated remarkable success in automated feature extraction from structured geospatial data (e.g., satellite imagery, GPS trajectories), the recent integration of LLMs introduces transformative capabilities for cross-modal geospatial reasoning and unstructured geo-textual data processing. This survey presents a comprehensive review of geospatial representation learning across both technological eras, organizing them into a structured taxonomy based on the complete pipeline comprising: (1) data perspective, (2) methodological perspective and (3) application perspective. We also highlight current advancements, discuss existing limitations, and propose potential future research directions in the LLM era. This work offers a thorough exploration of the field and providing a roadmap for further innovation in LI. The summary of the up-to-date paper list can be found in this https URL and will undergo continuous updates.
zh

[AI-75] Feature Relevancy Necessity and Usefulness: Complexity and Algorithms

【速读】:该论文试图解决如何准确识别分类模型中对预测结果具有相关性或必要性的特征问题,以及如何定义和检测特征在模型整体行为中的普遍重要性。解决方案的关键在于改进现有的技术与算法,以高效地确定特征的相关性和必要性,并引入一个新的全局概念——“有用性”(usefulness),用于衡量特征在模型整体行为中的重要性,而非仅针对特定输入。研究证明了“有用性”与相关性和必要性之间的关联,并开发了适用于决策树和其他复杂模型的高效检测算法。

链接: https://arxiv.org/abs/2505.09640
作者: Tomás Capdevielle,Santiago Cifuentes
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 7 figures

点击查看摘要

Abstract:Given a classification model and a prediction for some input, there are heuristic strategies for ranking features according to their importance in regard to the prediction. One common approach to this task is rooted in propositional logic and the notion of \textitsufficient reason. Through this concept, the categories of relevant and necessary features were proposed in order to identify the crucial aspects of the input. This paper improves the existing techniques and algorithms for deciding which are the relevant and/or necessary features, showing in particular that necessity can be detected efficiently in complex models such as neural networks. We also generalize the notion of relevancy and study associated problems. Moreover, we present a new global notion (i.e. that intends to explain whether a feature is important for the behavior of the model in general, not depending on a particular input) of \textitusefulness and prove that it is related to relevancy and necessity. Furthermore, we develop efficient algorithms for detecting it in decision trees and other more complex models, and experiment on three datasets to analyze its practical utility.
zh

[AI-76] Study and improvement of search algorithms in two-players perfect information games

【速读】:该论文试图解决现有搜索算法在博弈论中的泛化性能缺乏系统评估的问题(generality of their performance)。其关键解决方案是提出一种新的搜索算法,并通过大规模实验验证该算法在短时间和中等时间搜索下的优越性,证明其在多数游戏中均优于已研究的其他算法。

链接: https://arxiv.org/abs/2505.09639
作者: Quentin Cohen-Solal
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:Games, in their mathematical sense, are everywhere (game industries, economics, defense, education, chemistry, biology, …).Search algorithms in games are artificial intelligence methods for playing such games. Unfortunately, there is no study on these algorithms that evaluates the generality of their performance. We propose to address this gap in the case of two-player zero-sum games with perfect information. Furthermore, we propose a new search algorithm and we show that, for a short search time, it outperforms all studied algorithms on all games in this large experiment and that, for a medium search time, it outperforms all studied algorithms on 17 of the 22 studied games.
zh

[AI-77] SpecWav-Attack: Leverag ing Spectrogram Resizing and Wav2Vec 2.0 for Attacking Anonymized Speech

【速读】:该论文试图解决在匿名化语音系统中检测说话人的问题,旨在揭示此类系统存在的安全漏洞。其解决方案的关键在于提出一种对抗性模型SpecWav-Attack,该模型利用Wav2Vec2进行特征提取,并结合频谱图缩放和增量训练以提升性能。

链接: https://arxiv.org/abs/2505.09616
作者: Yuqi Li,Yuanzhong Zheng,Zhongtian Guo,Yaoxuan Wang,Jianjun Yin,Haojun Fei
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 2 pages,3 figures,1 chart

点击查看摘要

Abstract:This paper presents SpecWav-Attack, an adversarial model for detecting speakers in anonymized speech. It leverages Wav2Vec2 for feature extraction and incorporates spectrogram resizing and incremental training for improved performance. Evaluated on librispeech-dev and librispeech-test, SpecWav-Attack outperforms conventional attacks, revealing vulnerabilities in anonymized speech systems and emphasizing the need for stronger defenses, benchmarked against the ICASSP 2025 Attacker Challenge.
zh

[AI-78] Online Isolation Forest ICML2024

【速读】:该论文旨在解决传统离线异常检测方法在流式数据场景下应用时存在的不足,这些方法通常需要反复访问内存中的数据,并且在流式环境下施加了不切实际的假设。现有在线异常检测方法也难以有效应对这些限制,通常依赖于周期性重训练来适应在线环境。论文提出的解决方案是Online-iForest,其关键在于专为流式条件设计,能够无缝跟踪随时间演变的数据生成过程,从而在保持与在线替代方法相当性能的同时,接近最先进的离线异常检测技术,并在效率方面表现出显著优势。

链接: https://arxiv.org/abs/2505.09593
作者: Filippo Leveni,Guilherme Weigert Cassales,Bernhard Pfahringer,Albert Bifet,Giacomo Boracchi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted at International Conference on Machine Learning (ICML 2024)

点击查看摘要

Abstract:The anomaly detection literature is abundant with offline methods, which require repeated access to data in memory, and impose impractical assumptions when applied to a streaming context. Existing online anomaly detection methods also generally fail to address these constraints, resorting to periodic retraining to adapt to the online context. We propose Online-iForest, a novel method explicitly designed for streaming conditions that seamlessly tracks the data generating process as it evolves over time. Experimental validation on real-world datasets demonstrated that Online-iForest is on par with online alternatives and closely rivals state-of-the-art offline anomaly detection techniques that undergo periodic retraining. Notably, Online-iForest consistently outperforms all competitors in terms of efficiency, making it a promising solution in applications where fast identification of anomalies is of primary importance such as cybersecurity, fraud and fault detection.
zh

[AI-79] AI and Generative AI Transforming Disaster Management: A Survey of Damage Assessment and Response Techniques

【速读】:该论文试图解决自然灾害(如地震、山火和飓风)中快速且高效评估灾害强度的问题,以提高应对灾害的有效性。解决方案的关键在于利用人工智能(Artificial Intelligence, AI)和生成式人工智能(Generative Artificial Intelligence, GenAI)技术,通过整合多源异构数据、模拟真实灾害场景以及快速识别趋势,实现对灾害损失的智能化评估。

链接: https://arxiv.org/abs/2505.08202
作者: Aman Raj,Lakshit Arora,Sanjay Surendranath Girija,Shashank Kapoor,Dipen Pradhan,Ankit Shetgaonkar
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted in IEEE Compsac 2025

点击查看摘要

Abstract:Natural disasters, including earthquakes, wildfires and cyclones, bear a huge risk on human lives as well as infrastructure assets. An effective response to disaster depends on the ability to rapidly and efficiently assess the intensity of damage. Artificial Intelligence (AI) and Generative Artificial Intelligence (GenAI) presents a breakthrough solution, capable of combining knowledge from multiple types and sources of data, simulating realistic scenarios of disaster, and identifying emerging trends at a speed previously unimaginable. In this paper, we present a comprehensive review on the prospects of AI and GenAI in damage assessment for various natural disasters, highlighting both its strengths and limitations. We talk about its application to multimodal data such as text, image, video, and audio, and also cover major issues of data privacy, security, and ethical use of the technology during crises. The paper also recognizes the threat of Generative AI misuse, in the form of dissemination of misinformation and for adversarial attacks. Finally, we outline avenues of future research, emphasizing the need for secure, reliable, and ethical Generative AI systems for disaster management in general. We believe that this work represents the first comprehensive survey of Gen-AI techniques being used in the field of Disaster Assessment and Response.
zh

[AI-80] Adversarial Attacks in Multimodal Systems: A Practitioners Survey

【速读】:该论文试图解决多模态人工智能模型在面对对抗攻击时存在的威胁缺乏系统性分析与实践指导的问题。其关键解决方案是通过全面调查针对文本、图像、视频和音频四种模态的对抗攻击,梳理多模态对抗威胁的发展脉络,为实际应用中的机器学习从业者提供威胁图景和预防措施的参考。

链接: https://arxiv.org/abs/2505.03084
作者: Shashank Kapoor,Sanjay Surendranath Girija,Lakshit Arora,Dipen Pradhan,Ankit Shetgaonkar,Aman Raj
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted in IEEE COMPSAC 2025

点击查看摘要

Abstract:The introduction of multimodal models is a huge step forward in Artificial Intelligence. A single model is trained to understand multiple modalities: text, image, video, and audio. Open-source multimodal models have made these breakthroughs more accessible. However, considering the vast landscape of adversarial attacks across these modalities, these models also inherit vulnerabilities of all the modalities, and ultimately, the adversarial threat amplifies. While broad research is available on possible attacks within or across these modalities, a practitioner-focused view that outlines attack types remains absent in the multimodal world. As more Machine Learning Practitioners adopt, fine-tune, and deploy open-source models in real-world applications, it’s crucial that they can view the threat landscape and take the preventive actions necessary. This paper addresses the gap by surveying adversarial attacks targeting all four modalities: text, image, video, and audio. This survey provides a view of the adversarial attack landscape and presents how multimodal adversarial threats have evolved. To the best of our knowledge, this survey is the first comprehensive summarization of the threat landscape in the multimodal world.
zh

[AI-81] Change Detection in Multivariate data streams: Online Analysis with Kernel-QuantTree ALT ECML ECML2024

【速读】:该论文旨在解决多变量数据流在线监测中的变化检测问题,特别是在非参数条件下控制误报率(即平均运行长度 ARL_0)的同时实现有效的异常检测。解决方案的关键在于结合核量化树(Kernel-QuantTree, KQT)直方图与指数加权移动平均(EWMA)统计量,构建一种灵活且实用的监控框架。该方法通过KQT直方图建模任意平稳分布,并利用EWMA统计量在平稳状态下不依赖数据流分布的特性,实现了对ARL_0的预设控制,从而在保证检测灵敏度的同时有效降低误报率。

链接: https://arxiv.org/abs/2410.13778
作者: Michelangelo Olmo Nogara Notarianni,Filippo Leveni,Diego Stucchi,Luca Frittoli,Giacomo Boracchi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: AALTD workshop at ECML 2024 ( this https URL )

点击查看摘要

Abstract:We present Kernel-QuantTree Exponentially Weighted Moving Average (KQT-EWMA), a non-parametric change-detection algorithm that combines the Kernel-QuantTree (KQT) histogram and the EWMA statistic to monitor multivariate data streams online. The resulting monitoring scheme is very flexible, since histograms can be used to model any stationary distribution, and practical, since the distribution of test statistics does not depend on the distribution of datastream in stationary conditions (non-parametric monitoring). KQT-EWMA enables controlling false alarms by operating at a pre-determined Average Run Length ( ARL_0 ), which measures the expected number of stationary samples to be monitored before triggering a false alarm. The latter peculiarity is in contrast with most non-parametric change-detection tests, which rarely can control the ARL_0 a priori. Our experiments on synthetic and real-world datasets demonstrate that KQT-EWMA can control ARL_0 while achieving detection delays comparable to or lower than state-of-the-art methods designed to work in the same conditions.
zh

[AI-82] Uncovering Magnetic Phases with Synthetic Data and Physics-Informed Training

【速读】:该论文试图解决在缺乏精确解析解的复杂系统中高效学习磁性相变的问题,具体针对稀释伊辛模型(diluted Ising model)进行研究。其解决方案的关键在于结合计算简便性与物理信息引导策略,通过人工神经网络从合成数据中学习相变特征。具体而言,采用监督分类方法使用简单的全连接神经网络,以及通过仅依赖理想自旋构型训练的卷积自编码器进行无监督相变检测,并引入两种物理信息引导机制:一是利用架构偏差增强对对称性破缺相关特征的放大,二是通过显式破坏Z2\mathbb{Z}_2对称性的训练构型强化网络对有序相的识别能力。这些机制协同作用,提升了网络在无显式标签情况下的相结构敏感性。

链接: https://arxiv.org/abs/2505.10393
作者: Agustin Medina,Marcelo Arlego,Carlos A. Lamas
机构: 未知
类目: rongly Correlated Electrons (cond-mat.str-el); Artificial Intelligence (cs.AI)
备注: 25 pages, 14 figures

点击查看摘要

Abstract:We investigate the efficient learning of magnetic phases using artificial neural networks trained on synthetic data, combining computational simplicity with physics-informed strategies. Focusing on the diluted Ising model, which lacks an exact analytical solution, we explore two complementary approaches: a supervised classification using simple dense neural networks, and an unsupervised detection of phase transitions using convolutional autoencoders trained solely on idealized spin configurations. To enhance model performance, we incorporate two key forms of physics-informed guidance. First, we exploit architectural biases which preferentially amplify features related to symmetry breaking. Second, we include training configurations that explicitly break \mathbbZ_2 symmetry, reinforcing the network’s ability to detect ordered phases. These mechanisms, acting in tandem, increase the network’s sensitivity to phase structure even in the absence of explicit labels. We validate the machine learning predictions through comparison with direct numerical estimates of critical temperatures and percolation thresholds. Our results show that synthetic, structured, and computationally efficient training schemes can reveal physically meaningful phase boundaries, even in complex systems. This framework offers a low-cost and robust alternative to conventional methods, with potential applications in broader condensed matter and statistical physics contexts. Comments: 25 pages, 14 figures Subjects: Strongly Correlated Electrons (cond-mat.str-el); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.10393 [cond-mat.str-el] (or arXiv:2505.10393v1 [cond-mat.str-el] for this version) https://doi.org/10.48550/arXiv.2505.10393 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-83] LanTu: Dynamics-Enhanced Deep Learning for Eddy-Resolving Ocean Forecasting

【速读】:该论文旨在解决传统数值模型在进行中尺度涡旋(mesoscale eddy)解析的海洋预报时所面临的科学挑战和高计算成本问题,以及人工智能(AI)模型在复杂多尺度海洋动力系统中的预测能力不足问题。其解决方案的关键在于开发了一个基于动态增强深度学习的区域涡旋解析海洋预报系统LanTu,通过引入跨尺度相互作用并构建多尺度物理约束,以提升对中尺度涡旋演变的预测性能。

链接: https://arxiv.org/abs/2505.10191
作者: Qingyu Zheng,Qi Shao,Guijun Han,Wei Li,Hong Li,Xuan Wang
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
备注: 22 pages, 6 figures

点击查看摘要

Abstract:Mesoscale eddies dominate the spatiotemporal multiscale variability of the ocean, and their impact on the energy cascade of the global ocean cannot be ignored. Eddy-resolving ocean forecasting is providing more reliable protection for fisheries and navigational safety, but also presents significant scientific challenges and high computational costs for traditional numerical models. Artificial intelligence (AI)-based weather and ocean forecasting systems are becoming powerful tools that balance forecast performance with computational efficiency. However, the complex multiscale features in the ocean dynamical system make AI models still face many challenges in mesoscale eddy forecasting (especially regional modelling). Here, we develop LanTu, a regional eddy-resolving ocean forecasting system based on dynamics-enhanced deep learning. We incorporate cross-scale interactions into LanTu and construct multiscale physical constraint for optimising LanTu guided by knowledge of eddy dynamics in order to improve the forecasting skill of LanTu for mesoscale evolution. The results show that LanTu outperforms the existing advanced operational numerical ocean forecasting system (NOFS) and AI-based ocean forecasting system (AI-OFS) in temperature, salinity, sea level anomaly and current prediction, with a lead time of more than 10 days. Our study highlights that dynamics-enhanced deep learning (LanTu) can be a powerful paradigm for eddy-resolving ocean forecasting.
zh

[AI-84] Large Wireless Localization Model (LWLM): A Foundation Model for Positioning in 6G Networks

【速读】:该论文旨在解决无线定位中数据驱动方法对大量标注数据的依赖以及在不同部署场景和无线配置下泛化能力不足的问题。其解决方案的关键在于提出一种基于基础模型的无线定位框架,通过自监督学习(SSL)任务,结合信息瓶颈(IB)理论,设计了一个预训练方法,优化空间-频率掩码信道建模、域变换不变性及位置不变对比学习三个互补目标,从而从多角度捕捉无线信道的潜在语义。此外,还设计了轻量级解码器以适配下游任务,实验结果表明该模型在多种定位任务中均优于传统模型,展现出强大的泛化能力。

链接: https://arxiv.org/abs/2505.10134
作者: Guangjin Pan,Kaixuan Huang,Hui Chen,Shunqing Zhang,Christian Häger,Henk Wymeersch
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages,16 this http URL work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Accurate and robust localization is a critical enabler for emerging 5G and 6G applications, including autonomous driving, extended reality (XR), and smart manufacturing. While data-driven approaches have shown promise, most existing models require large amounts of labeled data and struggle to generalize across deployment scenarios and wireless configurations. To address these limitations, we propose a foundation-model-based solution tailored for wireless localization. We first analyze how different self-supervised learning (SSL) tasks acquire general-purpose and task-specific semantic features based on information bottleneck (IB) theory. Building on this foundation, we design a pretraining methodology for the proposed Large Wireless Localization Model (LWLM). Specifically, we propose an SSL framework that jointly optimizes three complementary objectives: (i) spatial-frequency masked channel modeling (SF-MCM), (ii) domain-transformation invariance (DTI), and (iii) position-invariant contrastive learning (PICL). These objectives jointly capture the underlying semantics of wireless channel from multiple perspectives. We further design lightweight decoders for key downstream tasks, including time-of-arrival (ToA) estimation, angle-of-arrival (AoA) estimation, single base station (BS) localization, and multiple BS localization. Comprehensive experimental results confirm that LWLM consistently surpasses both model-based and supervised learning baselines across all localization tasks. In particular, LWLM achieves 26.0%–87.5% improvement over transformer models without pretraining, and exhibits strong generalization under label-limited fine-tuning and unseen BS configurations, confirming its potential as a foundation model for wireless localization.
zh

[AI-85] Quantum Computing and AI: Perspectives on Advanced Automation in Science and Engineering

【速读】:该论文试图解决如何将人工智能(Artificial Intelligence, AI)与量子计算相结合,以实现科学和工程过程的自动化问题,其核心挑战在于如何有效整合量子算法与传统工程设计方法。解决方案的关键在于提出Quantum CAE框架,该框架利用量子算法进行仿真、优化和机器学习,从而提升工程设计的效率与精度,并通过组合优化问题的案例研究展示了其可行性。此外,论文强调了具备量子算法设计能力的专用AI代理在推动更高自动化水平中的关键作用。

链接: https://arxiv.org/abs/2505.10012
作者: Tadashi Kadowaki
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Recent advances in artificial intelligence (AI) and quantum computing are accelerating automation in scientific and engineering processes, fundamentally reshaping research methodologies. This perspective highlights parallels between scientific automation and established Computer-Aided Engineering (CAE) practices, introducing Quantum CAE as a framework that leverages quantum algorithms for simulation, optimization, and machine learning within engineering design. Practical implementations of Quantum CAE are illustrated through case studies for combinatorial optimization problems. Further discussions include advancements toward higher automation levels, highlighting the critical role of specialized AI agents proficient in quantum algorithm design. The integration of quantum computing with AI raises significant questions about the collaborative dynamics among human scientists and engineers, AI systems, and quantum computational resources, underscoring a transformative future for automated discovery and innovation.
zh

[AI-86] Contextual Phenotyping of Pediatric Sepsis Cohort Using Large Language Models

【速读】:该论文试图解决在高维、异构的医疗数据中有效聚类患者亚群的问题,以支持个性化护理和资源优化。其解决方案的关键在于利用生成式 AI (Generative AI) 模型生成的嵌入表示进行聚类,相较于传统方法,该方法能够更好地捕捉上下文信息并突出关键特征,从而提升聚类质量与子群区分度。

链接: https://arxiv.org/abs/2505.09805
作者: Aditya Nagori,Ayush Gautam,Matthew O. Wiens,Vuong Nguyen,Nathan Kenya Mugisha,Jerome Kabakyenga,Niranjan Kissoon,John Mark Ansermino,Rishikesan Kamaleswaran
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
备注: 11 pages, 2 Figures, 1 Table

点击查看摘要

Abstract:Clustering patient subgroups is essential for personalized care and efficient resource use. Traditional clustering methods struggle with high-dimensional, heterogeneous healthcare data and lack contextual understanding. This study evaluates Large Language Model (LLM) based clustering against classical methods using a pediatric sepsis dataset from a low-income country (LIC), containing 2,686 records with 28 numerical and 119 categorical variables. Patient records were serialized into text with and without a clustering objective. Embeddings were generated using quantized LLAMA 3.1 8B, DeepSeek-R1-Distill-Llama-8B with low-rank adaptation(LoRA), and Stella-En-400M-V5 models. K-means clustering was applied to these embeddings. Classical comparisons included K-Medoids clustering on UMAP and FAMD-reduced mixed data. Silhouette scores and statistical tests evaluated cluster quality and distinctiveness. Stella-En-400M-V5 achieved the highest Silhouette Score (0.86). LLAMA 3.1 8B with the clustering objective performed better with higher number of clusters, identifying subgroups with distinct nutritional, clinical, and socioeconomic profiles. LLM-based methods outperformed classical techniques by capturing richer context and prioritizing key features. These results highlight potential of LLMs for contextual phenotyping and informed decision-making in resource-limited settings.
zh

[AI-87] Virtual Dosimetrists: A Radiotherapy Training “Flight Simulator”

【速读】:该论文试图解决放射治疗计划质量评审教育中缺乏丰富且定期更新的示例以及无法展示多种可能规划方法及其后果的问题(the lack of a robust, regularly updated set of examples and the flexibility to demonstrate multiple possible planning approaches and their consequences)。解决方案的关键在于开发了“Virtual Dosimetrist”模型,该模型能够生成次优治疗计划的训练示例,并允许学员通过简单的自然语言提示改进计划质量,从而模拟与剂量师的交流过程,实现了剂量分布预测与自然语言处理的首次结合。

链接: https://arxiv.org/abs/2505.09796
作者: Skylar S. Gay,Tucker Netherton,Barbara Marquez,Raymond Mumme,Mary Gronberg,Brent Parker,Chelsea Pinnix,Sanjay Shete,Carlos Cardenas,Laurence Court
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effective education in radiotherapy plan quality review requires a robust, regularly updated set of examples and the flexibility to demonstrate multiple possible planning approaches and their consequences. However, the current clinic-based paradigm does not support these needs. To address this, we have developed ‘Virtual Dosimetrist’ models that can both generate training examples of suboptimal treatment plans and then allow trainees to improve the plan quality through simple natural language prompts, as if communicating with a dosimetrist. The dose generation and modification process is accurate, rapid, and requires only modest resources. This work is the first to combine dose distribution prediction with natural language processing; providing a robust pipeline for both generating suboptimal training plans and allowing trainees to practice their critical plan review and improvement skills that addresses the challenges of the current clinic-based paradigm.
zh

[AI-88] Differentiable Quantum Architecture Search in Quantum-Enhanced Neural Network Parameter Generation

【速读】:该论文旨在解决量子神经网络(Quantum Neural Networks, QNNs)在实际应用中因依赖量子硬件进行推理而面临的限制问题,包括硬件缺陷和量子设备访问受限等挑战。其解决方案的关键在于提出一种基于可微分优化的自动化方法,通过端到端的方式联合优化传统电路参数和架构参数,从而生成可用于经典神经网络的参数,实现无需量子硬件的推理,并提升QNN设计的效率与性能。

链接: https://arxiv.org/abs/2505.09653
作者: Samuel Yen-Chi Chen,Chen-Yu Liu,Kuan-Cheng Chen,Wei-Jia Huang,Yen-Jui Chang,Wei-Hao Huang
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:The rapid advancements in quantum computing (QC) and machine learning (ML) have led to the emergence of quantum machine learning (QML), which integrates the strengths of both fields. Among QML approaches, variational quantum circuits (VQCs), also known as quantum neural networks (QNNs), have shown promise both empirically and theoretically. However, their broader adoption is hindered by reliance on quantum hardware during inference. Hardware imperfections and limited access to quantum devices pose practical challenges. To address this, the Quantum-Train (QT) framework leverages the exponential scaling of quantum amplitudes to generate classical neural network parameters, enabling inference without quantum hardware and achieving significant parameter compression. Yet, designing effective quantum circuit architectures for such quantum-enhanced neural programmers remains non-trivial and often requires expertise in quantum information science. In this paper, we propose an automated solution using differentiable optimization. Our method jointly optimizes both conventional circuit parameters and architectural parameters in an end-to-end manner via automatic differentiation. We evaluate the proposed framework on classification, time-series prediction, and reinforcement learning tasks. Simulation results show that our method matches or outperforms manually designed QNN architectures. This work offers a scalable and automated pathway for designing QNNs that can generate classical neural network parameters across diverse applications.
zh

[AI-89] mporal Interception and Present Reconstruction: A Cognitive-Signal Model for Human and AI Decision Making

【速读】:该论文试图解决人类心智与人工智能如何通过减少感知延迟来实现实时意识的问题(real-time awareness)。其解决方案的关键在于提出一种理论模型,将当前时刻视为一个干扰区域,其中早期到达的宇宙信号与人类反应延迟相互作用,而非作为线性时间戳来感知。该模型结合了物理与认知层面的分析,并通过实验方法验证,旨在引导人工智能系统向时间高效、伦理合理且具有内在意识的决策过程演进。

链接: https://arxiv.org/abs/2505.09646
作者: Carmel Mary Esther A
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); History and Philosophy of Physics (physics.hist-ph)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:This paper proposes a novel theoretical model to explain how the human mind and artificial intelligence can approach real-time awareness by reducing perceptual delays. By investigating cosmic signal delay, neurological reaction times, and the ancient cognitive state of stillness, we explore how one may shift from reactive perception to a conscious interface with the near future. This paper introduces both a physical and cognitive model for perceiving the present not as a linear timestamp, but as an interference zone where early-arriving cosmic signals and reactive human delays intersect. We propose experimental approaches to test these ideas using human neural observation and neuro-receptive extensions. Finally, we propose a mathematical framework to guide the evolution of AI systems toward temporally efficient, ethically sound, and internally conscious decision-making processes
zh

[AI-90] Neurophysiologically Realistic Environment for Comparing Adaptive Deep Brain Stimulation Algorithms in Parkinson Disease KDD

【速读】:该论文旨在解决自适应深部脑刺激(aDBS)在优化控制策略时面临的数据采集受限问题,其核心挑战在于传统方法依赖侵入性设备,难以获取足够数据用于离线优化。论文提出的解决方案之关键在于构建了一个首个神经生理学上真实的基准测试平台,该平台不仅涵盖了基底神经节环路动力学和病理性振荡,还整合了15项此前被忽略的生理特性,如信号不稳定性、噪声、神经漂移、电极导电性变化及个体差异,并通过β频段活动和反馈机制将其建模为时空分布特征。此外,该框架被设计为用于训练和评估深度强化学习(RL)算法的结构化环境,从而为优化aDBS控制策略提供了新途径。

链接: https://arxiv.org/abs/2505.09624
作者: Ekaterina Kuzmina,Dmitrii Kriukov,Mikhail Lebedev,Dmitry V. Dylov
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures, submission to KDD

点击查看摘要

Abstract:Adaptive deep brain stimulation (aDBS) has emerged as a promising treatment for Parkinson disease (PD). In aDBS, a surgically placed electrode sends dynamically altered stimuli to the brain based on neurophysiological feedback: an invasive gadget that limits the amount of data one could collect for optimizing the control offline. As a consequence, a plethora of synthetic models of PD and those of the control algorithms have been proposed. Herein, we introduce the first neurophysiologically realistic benchmark for comparing said models. Specifically, our methodology covers not only conventional basal ganglia circuit dynamics and pathological oscillations, but also captures 15 previously dismissed physiological attributes, such as signal instabilities and noise, neural drift, electrode conductance changes and individual variability - all modeled as spatially distributed and temporally registered features via beta-band activity in the brain and a feedback. Furthermore, we purposely built our framework as a structured environment for training and evaluating deep reinforcement learning (RL) algorithms, opening new possibilities for optimizing aDBS control strategies and inviting the machine learning community to contribute to the emerging field of intelligent neurostimulation interfaces.
zh

[AI-91] Predictive Models for Chronic Heart Failure

【速读】:该论文旨在解决慢性心力衰竭(Heart Failure, HF)患者风险分层与早期识别的问题,以支持个性化治疗和持续监测。其解决方案的关键在于构建一种基于机器学习(Machine Learning, ML)的集成学习模型,该模型采用改进的堆叠(stacking)技术,结合两个专门模型(分别利用临床和超声心动图特征)的预测结果,并通过元模型进行融合,从而提高高风险患者的识别能力。该方法在实际数据集上的实验结果表明,模型具有较高的敏感性(95%)和可接受的准确率(84%),能够有效支持医疗决策和早期干预。

链接: https://arxiv.org/abs/2505.09619
作者: Pietro Cassieri,Aiman Faiz,Anna Maria De Roberto,Claudio Pascarelli,Gianvito Mitrano,Gianluca Fimiani,Marina Garofano,Christiancarmine Esposito,Genoveffa Tortora,Alessia Bramanti,Giuseppe Scanniello
机构: 未知
类目: Other Statistics (stat.OT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The management of chronic Heart Failure (HF) presents significant challenges in modern healthcare, requiring continuous monitoring, early detection of exacerbations, and personalized treatment strategies. In this paper, we present a predictive model founded on Machine Learning (ML) techniques to identify patients at HF risk. This model is an ensemble learning approach, a modified stacking technique, that uses two specialized models leveraging clinical and echocardiographic features and then a meta-model to combine the predictions of these two models. We initially assess the model on a real dataset and the obtained results suggest that it performs well in the stratification of patients at HR risk. Specifically, we obtained high sensitivity (95%), ensuring that nearly all high-risk patients are identified. As for accuracy, we obtained 84%, which can be considered moderate in some ML contexts. However, it is acceptable given our priority of identifying patients at risk of HF because they will be asked to participate in the telemonitoring program of the PrediHealth research project on which some of the authors of this paper are working. The initial findings also suggest that ML-based risk stratification models can serve as valuable decision-support tools not only in the PrediHealth project but also for healthcare professionals, aiding in early intervention and personalized patient management. To have a better understanding of the value and of potentiality of our predictive model, we also contrasted its results with those obtained by using three baseline models. The preliminary results indicate that our predictive model outperforms these baselines that flatly consider features, \ie not grouping them in clinical and echocardiographic features.
zh

机器学习

[LG-0] An AI-driven framework for the prediction of personalised health response to air pollution

链接: https://arxiv.org/abs/2505.10556
作者: Nazanin Zounemat Kermani,Sadjad Naderi,Claire H. Dilliway,Claire E. Heaney,Shrreya Behll,Boyang Chen,Hisham Abubakar-Waziri,Alexandra E. Porter,Marc Chadeau-Hyam,Fangxin Fang,Ian M. Adcock,Kian Fan Chung,Christopher C. Pain
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: Kermani and Naderi share first authorship. 20 pages, 6 figures and 1 table

点击查看摘要

Abstract:Air pollution poses a significant threat to public health, causing or exacerbating many respiratory and cardiovascular diseases. In addition, climate change is bringing about more extreme weather events such as wildfires and heatwaves, which can increase levels of pollution and worsen the effects of pollution exposure. Recent advances in personal sensing have transformed the collection of behavioural and physiological data, leading to the potential for new improvements in healthcare. We wish to capitalise on this data, alongside new capabilities in AI for making time series predictions, in order to monitor and predict health outcomes for an individual. Thus, we present a novel workflow for predicting personalised health responses to pollution by integrating physiological data from wearable fitness devices with real-time environmental exposures. The data is collected from various sources in a secure and ethical manner, and is used to train an AI model to predict individual health responses to pollution exposure within a cloud-based, modular framework. We demonstrate that the AI model – an Adversarial Autoencoder neural network in this case – accurately reconstructs time-dependent health signals and captures nonlinear responses to pollution. Transfer learning is applied using data from a personal smartwatch, which increases the generalisation abilities of the AI model and illustrates the adaptability of the approach to real-world, user-generated data.

[LG-1] Pharmacophore-Conditioned Diffusion Model for Ligand-Based De Novo Drug Design

链接: https://arxiv.org/abs/2505.10545
作者: Amira Alakhdar,Barnabas Poczos,Newell Washburn
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Developing bioactive molecules remains a central, time- and cost-heavy challenge in drug discovery, particularly for novel targets lacking structural or functional data. Pharmacophore modeling presents an alternative for capturing the key features required for molecular bioactivity against a biological target. In this work, we present PharmaDiff, a pharmacophore-conditioned diffusion model for 3D molecular generation. PharmaDiff employs a transformer-based architecture to integrate an atom-based representation of the 3D pharmacophore into the generative process, enabling the precise generation of 3D molecular graphs that align with predefined pharmacophore hypotheses. Through comprehensive testing, PharmaDiff demonstrates superior performance in matching 3D pharmacophore constraints compared to ligand-based drug design methods. Additionally, it achieves higher docking scores across a range of proteins in structure-based drug design, without the need for target protein structures. By integrating pharmacophore modeling with 3D generative techniques, PharmaDiff offers a powerful and flexible framework for rational drug design.

[LG-2] Learning Nonlinear Dynamics in Physical Modelling Synthesis using Neural Ordinary Differential Equations

链接: https://arxiv.org/abs/2505.10511
作者: Victor Zheleznov,Stefan Bilbao,Alec Wright,Simon King
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Computational Physics (physics.comp-ph)
*备注: Accepted for publication in Proceedings of the 28th International Conference on Digital Audio Effects (DAFx25), Ancona, Italy, September 2025

点击查看摘要

Abstract:Modal synthesis methods are a long-standing approach for modelling distributed musical systems. In some cases extensions are possible in order to handle geometric nonlinearities. One such case is the high-amplitude vibration of a string, where geometric nonlinear effects lead to perceptually important effects including pitch glides and a dependence of brightness on striking amplitude. A modal decomposition leads to a coupled nonlinear system of ordinary differential equations. Recent work in applied machine learning approaches (in particular neural ordinary differential equations) has been used to model lumped dynamic systems such as electronic circuits automatically from data. In this work, we examine how modal decomposition can be combined with neural ordinary differential equations for modelling distributed musical systems. The proposed model leverages the analytical solution for linear vibration of system’s modes and employs a neural network to account for nonlinear dynamic behaviour. Physical parameters of a system remain easily accessible after the training without the need for a parameter encoder in the network architecture. As an initial proof of concept, we generate synthetic data for a nonlinear transverse string and show that the model can be trained to reproduce the nonlinear dynamics of the system. Sound examples are presented.

[LG-3] Fixing Incomplete Value Function Decomposition for Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2505.10484
作者: Andrea Baisero,Rupali Bhati,Shuo Liu,Aathira Pillai,Christopher Amato
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Value function decomposition methods for cooperative multi-agent reinforcement learning compose joint values from individual per-agent utilities, and train them using a joint objective. To ensure that the action selection process between individual utilities and joint values remains consistent, it is imperative for the composition to satisfy the individual-global max (IGM) property. Although satisfying IGM itself is straightforward, most existing methods (e.g., VDN, QMIX) have limited representation capabilities and are unable to represent the full class of IGM values, and the one exception that has no such limitation (QPLEX) is unnecessarily complex. In this work, we present a simple formulation of the full class of IGM values that naturally leads to the derivation of QFIX, a novel family of value function decomposition models that expand the representation capabilities of prior models by means of a thin “fixing” layer. We derive multiple variants of QFIX, and implement three variants in two well-known multi-agent frameworks. We perform an empirical evaluation on multiple SMACv2 and Overcooked environments, which confirms that QFIX (i) succeeds in enhancing the performance of prior methods, (ii) learns more stably and performs better than its main competitor QPLEX, and (iii) achieves this while employing the simplest and smallest mixing models.

[LG-4] Large Language Models for Cancer Communication: Evaluating Linguistic Quality Safety and Accessibility in Generative AI

链接: https://arxiv.org/abs/2505.10472
作者: Agnik Saha,Victoria Churchill,Anny D. Rodriguez,Ugur Kursuncu,Muhammed Y. Idris
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective communication about breast and cervical cancers remains a persistent health challenge, with significant gaps in public understanding of cancer prevention, screening, and treatment, potentially leading to delayed diagnoses and inadequate treatments. This study evaluates the capabilities and limitations of Large Language Models (LLMs) in generating accurate, safe, and accessible cancer-related information to support patient understanding. We evaluated five general-purpose and three medical LLMs using a mixed-methods evaluation framework across linguistic quality, safety and trustworthiness, and communication accessibility and affectiveness. Our approach utilized quantitative metrics, qualitative expert ratings, and statistical analysis using Welch’s ANOVA, Games-Howell, and Hedges’ g. Our results show that general-purpose LLMs produced outputs of higher linguistic quality and affectiveness, while medical LLMs demonstrate greater communication accessibility. However, medical LLMs tend to exhibit higher levels of potential harm, toxicity, and bias, reducing their performance in safety and trustworthiness. Our findings indicate a duality between domain-specific knowledge and safety in health communications. The results highlight the need for intentional model design with targeted improvements, particularly in mitigating harm and bias, and improving safety and affectiveness. This study provides a comprehensive evaluation of LLMs for cancer communication, offering critical insights for improving AI-generated health content and informing future development of accurate, safe, and accessible digital health tools.

[LG-5] Identification and Optimal Nonlinear Control of Turbojet Engine Using Koopman Eigenfunction Model

链接: https://arxiv.org/abs/2505.10438
作者: David Grasev
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 51 pages, 28 figures

点击查看摘要

Abstract:Gas turbine engines represent complex highly nonlinear dynamical systems. Deriving their physics-based models can be challenging as it requires performance characteristics, that are not always available, and one often has to make many simplifying assumptions. In this paper, the limitations of conventional experimental methods used to derive component-level and locally linear parameter-varying models are discussed and addressed by employing identification techniques based on data collected from standard engine operation under closed-loop control. The rotor dynamics were estimated using the sparse identification of nonlinear dynamics. Subsequently, the autonomous part of the dynamics was mapped into an optimally constructed Koopman eigenfunction space. The process included eigenvalue optimization using metaheuristic algorithms and temporal projection, followed by gradient-based eigenfunction identification. The resulting Koopman model was validated against an in-house reference component-level model. A globally optimal nonlinear feedback controller and a Kalman estimator were then designed in the eigenfunction space and compared to the classical and gain-scheduled proportional-integral controllers, as well as a proposed internal model control approach. The eigenmode structure allowed targeting individual modes during the optimization process, resulting in a better performance tuning. The results showed that the Koopman-based controller outperformed the other benchmark controllers in both reference tracking and disturbance rejection, under sea-level and varying flight conditions, due to its global nature.

[LG-6] Score-based diffusion nowcasting of GOES imagery

链接: https://arxiv.org/abs/2505.10432
作者: Randy J. Chase,Katherine Haynes,Lander Ver Hoef,Imme Ebert-Uphoff
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Clouds and precipitation are important for understanding weather and climate. Simulating clouds and precipitation with traditional numerical weather prediction is challenging because of the sub-grid parameterizations required. Machine learning has been explored for forecasting clouds and precipitation, but early machine learning methods often created blurry forecasts. In this paper we explore a newer method, named score-based diffusion, to nowcast (zero to three hour forecast) clouds and precipitation. We discuss the background and intuition of score-based diffusion models - thus providing a starting point for the community - while exploring the methodology’s use for nowcasting geostationary infrared imagery. We experiment with three main types of diffusion models: a standard score-based diffusion model (Diff); a residual correction diffusion model (CorrDiff); and a latent diffusion model (LDM). Our results show that the diffusion models are able to not only advect existing clouds, but also generate and decay clouds, including convective initiation. These results are surprising because the forecasts are initiated with only the past 20 mins of infrared satellite imagery. A case study qualitatively shows the preservation of high resolution features longer into the forecast than a conventional mean-squared error trained U-Net. The best of the three diffusion models tested was the CorrDiff approach, outperforming all other diffusion models, the traditional U-Net, and a persistence forecast by one to two kelvin on root mean squared error. The diffusion models also enable out-of-the-box ensemble generation, which shows skillful calibration, with the spread of the ensemble correlating well to the error.

[LG-7] Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLM s

链接: https://arxiv.org/abs/2505.10425
作者: Jingyao Wang,Wenwen Qiang,Zeen Song,Changwen Zheng,Hui Xiong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) excel at complex tasks thanks to advances in reasoning abilities. However, existing methods overlook the trade-off between reasoning effectiveness and computational efficiency, often encouraging unnecessarily long reasoning chains and wasting tokens. To address this, we propose Learning to Think (L2T), an information-theoretic reinforcement fine-tuning framework for LLMs to make the models achieve optimal reasoning with fewer tokens. Specifically, L2T treats each query-response interaction as a hierarchical session of multiple episodes and proposes a universal dense process reward, i.e., quantifies the episode-wise information gain in parameters, requiring no extra annotations or task-specific evaluators. We propose a method to quickly estimate this reward based on PAC-Bayes bounds and the Fisher information matrix. Theoretical analyses show that it significantly reduces computational complexity with high estimation accuracy. By immediately rewarding each episode’s contribution and penalizing excessive updates, L2T optimizes the model via reinforcement learning to maximize the use of each episode and achieve effective updates. Empirical results on various reasoning benchmarks and base models demonstrate the advantage of L2T across different tasks, boosting both reasoning effectiveness and efficiency.

[LG-8] he Power of Random Features and the Limits of Distribution-Free Gradient Descent

链接: https://arxiv.org/abs/2505.10423
作者: Ari Karchmer,Eran Malach
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the relationship between gradient-based optimization of parametric models (e.g., neural networks) and optimization of linear combinations of random features. Our main result shows that if a parametric model can be learned using mini-batch stochastic gradient descent (bSGD) without making assumptions about the data distribution, then with high probability, the target function can also be approximated using a polynomial-sized combination of random features. The size of this combination depends on the number of gradient steps and numerical precision used in the bSGD process. This finding reveals fundamental limitations of distribution-free learning in neural networks trained by gradient descent, highlighting why making assumptions about data distributions is often crucial in practice. Along the way, we also introduce a new theoretical framework called average probabilistic dimension complexity (adc), which extends the probabilistic dimension complexity developed by Kamath et al. (2020). We prove that adc has a polynomial relationship with statistical query dimension, and use this relationship to demonstrate an infinite separation between adc and standard dimension complexity.

[LG-9] Decomposed Inductive Procedure Learning: Learning Academic Tasks with Human-Like Data Efficiency

链接: https://arxiv.org/abs/2505.10422
作者: Daniel Weitekamp,Christopher MacLellan,Erik Harpstead,Kenneth Koedinger
类目: Machine Learning (cs.LG)
*备注: To appear in CogSci 2025

点击查看摘要

Abstract:Human learning relies on specialization – distinct cognitive mechanisms working together to enable rapid learning. In contrast, most modern neural networks rely on a single mechanism: gradient descent over an objective function. This raises the question: might human learners’ relatively rapid learning from just tens of examples instead of tens of thousands in data-driven deep learning arise from our ability to use multiple specialized mechanisms of learning in combination? We investigate this question through an ablation analysis of inductive human learning simulations in online tutoring environments. Comparing reinforcement learning to a more data-efficient 3-mechanism symbolic rule induction approach, we find that decomposing learning into multiple distinct mechanisms significantly improves data efficiency, bringing it in line with human learning. Furthermore, we show that this decomposition has a greater impact on efficiency than the distinction between symbolic and subsymbolic learning alone. Efforts to align data-driven machine learning with human learning often overlook the stark difference in learning efficiency. Our findings suggest that integrating multiple specialized learning mechanisms may be key to bridging this gap.

[LG-10] wo-Stage Generative Model for Intracranial Aneurysm Meshes with Morphological Marker Conditioning

链接: https://arxiv.org/abs/2505.10407
作者: Wenhao Ding,Choon Hwai Yap,Kangjun Ji,Simão Castro
类目: Machine Learning (cs.LG)
*备注: 10 pages, 2 figures

点击查看摘要

Abstract:A generative model for the mesh geometry of intracranial aneurysms (IA) is crucial for training networks to predict blood flow forces in real time, which is a key factor affecting disease progression. This need is necessitated by the absence of a large IA image datasets. Existing shape generation methods struggle to capture realistic IA features and ignore the relationship between IA pouches and parent vessels, limiting physiological realism and their generation cannot be controlled to have specific morphological measurements. We propose AneuG, a two-stage Variational Autoencoder (VAE)-based IA mesh generator. In the first stage, AneuG generates low-dimensional Graph Harmonic Deformation (GHD) tokens to encode and reconstruct aneurysm pouch shapes, constrained to morphing energy statistics truths. GHD enables more accurate shape encoding than alternatives. In the second stage, AneuG generates parent vessels conditioned on GHD tokens, by generating vascular centreline and propagating the cross-section. AneuG’s IA shape generation can further be conditioned to have specific clinically relevant morphological measurements. This is useful for studies to understand shape variations represented by clinical measurements, and for flow simulation studies to understand effects of specific clinical shape parameters on fluid dynamics. Source code and implementation details are available at this https URL.

[LG-11] AutoCam: Hierarchical Path Planning for an Autonomous Auxiliary Camera in Surgical Robotics

链接: https://arxiv.org/abs/2505.10398
作者: Alexandre Banks,Randy Moore,Sayem Nazmuz Zaman,Alaa Eldin Abdelaal,Septimiu E. Salcudean
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注: 13 pages, 9 figures

点击查看摘要

Abstract:Incorporating an autonomous auxiliary camera into robot-assisted minimally invasive surgery (RAMIS) enhances spatial awareness and eliminates manual viewpoint control. Existing path planning methods for auxiliary cameras track two-dimensional surgical features but do not simultaneously account for camera orientation, workspace constraints, and robot joint limits. This study presents AutoCam: an automatic auxiliary camera placement method to improve visualization in RAMIS. Implemented on the da Vinci Research Kit, the system uses a priority-based, workspace-constrained control algorithm that combines heuristic geometric placement with nonlinear optimization to ensure robust camera tracking. A user study (N=6) demonstrated that the system maintained 99.84% visibility of a salient feature and achieved a pose error of 4.36 \pm 2.11 degrees and 1.95 \pm 5.66 mm. The controller was computationally efficient, with a loop time of 6.8 \pm 12.8 ms. An additional pilot study (N=6), where novices completed a Fundamentals of Laparoscopic Surgery training task, suggests that users can teleoperate just as effectively from AutoCam’s viewpoint as from the endoscope’s while still benefiting from AutoCam’s improved visual coverage of the scene. These results indicate that an auxiliary camera can be autonomously controlled using the da Vinci patient-side manipulators to track a salient feature, laying the groundwork for new multi-camera visualization methods in RAMIS.

[LG-12] A Hybrid Strategy for Aggregated Probabilistic Forecasting and Energy Trading in HEFTCom2024

链接: https://arxiv.org/abs/2505.10367
作者: Chuanqing Pu,Feilong Fan,Nengling Tai,Songyuan Liu,Jinming Yu
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Solution description of IEEE Hybrid Energy Forecasting and Trading Competition (HEFTCom)

点击查看摘要

Abstract:Obtaining accurate probabilistic energy forecasts and making effective decisions amid diverse uncertainties are routine challenges in future energy systems. This paper presents the solution of team GEB, which ranked 3rd in trading, 4th in forecasting, and 1st among student teams in the IEEE Hybrid Energy Forecasting and Trading Competition 2024 (HEFTCom2024). The solution provides accurate probabilistic forecasts for a wind-solar hybrid system, and achieves substantial trading revenue in the day-ahead electricity market. Key components include: (1) a stacking-based approach combining sister forecasts from various Numerical Weather Predictions (NWPs) to provide wind power forecasts, (2) an online solar post-processing model to address the distribution shift in the online test set caused by increased solar capacity, (3) a probabilistic aggregation method for accurate quantile forecasts of hybrid generation, and (4) a stochastic trading strategy to maximize expected trading revenue considering uncertainties in electricity prices. This paper also explores the potential of end-to-end learning to further enhance the trading revenue by adjusting the distribution of forecast errors. Detailed case studies are provided to validate the effectiveness of these proposed methods. Code for all mentioned methods is available for reproduction and further research in both industry and academia.

[LG-13] An Introduction to Discrete Variational Autoencoders

链接: https://arxiv.org/abs/2505.10344
作者: Alan Jeffares,Liyuan Liu
类目: Machine Learning (cs.LG)
*备注: Tutorial paper

点击查看摘要

Abstract:Variational Autoencoders (VAEs) are well-established as a principled approach to probabilistic unsupervised learning with neural networks. Typically, an encoder network defines the parameters of a Gaussian distributed latent space from which we can sample and pass realizations to a decoder network. This model is trained to reconstruct its inputs and is optimized through the evidence lower bound. In recent years, discrete latent spaces have grown in popularity, suggesting that they may be a natural choice for many data modalities (e.g. text). In this tutorial, we provide a rigorous, yet practical, introduction to discrete variational autoencoders – specifically, VAEs in which the latent space is made up of latent variables that follow a categorical distribution. We assume only a basic mathematical background with which we carefully derive each step from first principles. From there, we develop a concrete training recipe and provide an example implementation, hosted at this https URL.

[LG-14] A Representation Learning Approach to Feature Drift Detection in Wireless Networks

链接: https://arxiv.org/abs/2505.10325
作者: Athanasios Tziouvaras,Blaz Bertalanic,George Floros,Kostas Kolomvatsos,Panagiotis Sarigiannidis,Carolina Fortuna
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:AI is foreseen to be a centerpiece in next generation wireless networks enabling enabling ubiquitous communication as well as new services. However, in real deployment, feature distribution changes may degrade the performance of AI models and lead to undesired behaviors. To counter for undetected model degradation, we propose ALERT; a method that can detect feature distribution changes and trigger model re-training that works well on two wireless network use cases: wireless fingerprinting and link anomaly detection. ALERT includes three components: representation learning, statistical testing and utility assessment. We rely on MLP for designing the representation learning component, on Kolmogorov-Smirnov and Population Stability Index tests for designing the statistical testing and a new function for utility assessment. We show the superiority of the proposed method against ten standard drift detection methods available in the literature on two wireless network use cases.

[LG-15] Asynchronous Decentralized SGD under Non-Convexity: A Block-Coordinate Descent Framework

链接: https://arxiv.org/abs/2505.10322
作者: Yijie Zhou,Shi Pu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Decentralized optimization has become vital for leveraging distributed data without central control, enhancing scalability and privacy. However, practical deployments face fundamental challenges due to heterogeneous computation speeds and unpredictable communication delays. This paper introduces a refined model of Asynchronous Decentralized Stochastic Gradient Descent (ADSGD) under practical assumptions of bounded computation and communication times. To understand the convergence of ADSGD, we first analyze Asynchronous Stochastic Block Coordinate Descent (ASBCD) as a tool, and then show that ADSGD converges under computation-delay-independent step sizes. The convergence result is established without assuming bounded data heterogeneity. Empirical experiments reveal that ADSGD outperforms existing methods in wall-clock convergence time across various scenarios. With its simplicity, efficiency in memory and communication, and resilience to communication and computation delays, ADSGD is well-suited for real-world decentralized learning tasks.

[LG-16] Deconstructing Subset Construction – Reducing While Determinizing

链接: https://arxiv.org/abs/2505.10319
作者: John Nicol,Markus Frohme
类目: Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
*备注: 19 pages, 2 figures

点击查看摘要

Abstract:We present a novel perspective on the NFA canonization problem, which introduces intermediate minimization steps to reduce the exploration space on-the-fly. Essential to our approach are so-called equivalence registries which manage information about equivalent states and allow for incorporating further optimization techniques such as convexity closures or simulation to boost performance. Due to the generality of our approach, these concepts can be embedded in classic subset construction or Brzozowski’s approach. We evaluate our approach on a set of real-world examples from automatic sequences and observe that we are able to improve especially worst-case scenarios. We implement our approach in an open-source library for users to experiment with.

[LG-17] Negative Metric Learning for Graphs

链接: https://arxiv.org/abs/2505.10307
作者: Yiyang Zhao,Chengpei Wu,Lilin Zhang,Ning Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph contrastive learning (GCL) often suffers from false negatives, which degrades the performance on downstream tasks. The existing methods addressing the false negative issue usually rely on human prior knowledge, still leading GCL to suboptimal results. In this paper, we propose a novel Negative Metric Learning (NML) enhanced GCL (NML-GCL). NML-GCL employs a learnable Negative Metric Network (NMN) to build a negative metric space, in which false negatives can be distinguished better from true negatives based on their distance to anchor node. To overcome the lack of explicit supervision signals for NML, we propose a joint training scheme with bi-level optimization objective, which implicitly utilizes the self-supervision signals to iteratively optimize the encoder and the negative metric network. The solid theoretical analysis and the extensive experiments conducted on widely used benchmarks verify the superiority of the proposed method.

[LG-18] Optimizing Electric Bus Charging Scheduling with Uncertainties Using Hierarchical Deep Reinforcement Learning

链接: https://arxiv.org/abs/2505.10296
作者: Jiaju Qi,Lei Lei,Thorsteinn Jonsson,Dusit Niyato
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growing adoption of Electric Buses (EBs) represents a significant step toward sustainable development. By utilizing Internet of Things (IoT) systems, charging stations can autonomously determine charging schedules based on real-time data. However, optimizing EB charging schedules remains a critical challenge due to uncertainties in travel time, energy consumption, and fluctuating electricity prices. Moreover, to address real-world complexities, charging policies must make decisions efficiently across multiple time scales and remain scalable for large EB fleets. In this paper, we propose a Hierarchical Deep Reinforcement Learning (HDRL) approach that reformulates the original Markov Decision Process (MDP) into two augmented MDPs. To solve these MDPs and enable multi-timescale decision-making, we introduce a novel HDRL algorithm, namely Double Actor-Critic Multi-Agent Proximal Policy Optimization Enhancement (DAC-MAPPO-E). Scalability challenges of the Double Actor-Critic (DAC) algorithm for large-scale EB fleets are addressed through enhancements at both decision levels. At the high level, we redesign the decentralized actor network and integrate an attention mechanism to extract relevant global state information for each EB, decreasing the size of neural networks. At the low level, the Multi-Agent Proximal Policy Optimization (MAPPO) algorithm is incorporated into the DAC framework, enabling decentralized and coordinated charging power decisions, reducing computational complexity and enhancing convergence speed. Extensive experiments with real-world data demonstrate the superior performance and scalability of DAC-MAPPO-E in optimizing EB fleet charging schedules.

[LG-19] Spike-timing-dependent Hebbian learning as noisy gradient descent

链接: https://arxiv.org/abs/2505.10272
作者: Niklas Dexheimer,Sascha Gaudlitz,Johannes Schmidt-Hieber
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Hebbian learning is a key principle underlying learning in biological neural networks. It postulates that synaptic changes occur locally, depending on the activities of pre- and postsynaptic neurons. While Hebbian learning based on neuronal firing rates is well explored, much less is known about learning rules that account for precise spike-timing. We relate a Hebbian spike-timing-dependent plasticity rule to noisy gradient descent with respect to a natural loss function on the probability simplex. This connection allows us to prove that the learning rule eventually identifies the presynaptic neuron with the highest activity. We also discover an intrinsic connection to noisy mirror descent.

[LG-20] Electric Bus Charging Schedules Relying on Real Data-Driven Targets Based on Hierarchical Deep Reinforcement Learning

链接: https://arxiv.org/abs/2505.10262
作者: Jiaju Qi,Lei Lei,Thorsteinn Jonsson,Lajos Hanzo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The charging scheduling problem of Electric Buses (EBs) is investigated based on Deep Reinforcement Learning (DRL). A Markov Decision Process (MDP) is conceived, where the time horizon includes multiple charging and operating periods in a day, while each period is further divided into multiple time steps. To overcome the challenge of long-range multi-phase planning with sparse reward, we conceive Hierarchical DRL (HDRL) for decoupling the original MDP into a high-level Semi-MDP (SMDP) and multiple low-level MDPs. The Hierarchical Double Deep Q-Network (HDDQN)-Hindsight Experience Replay (HER) algorithm is proposed for simultaneously solving the decision problems arising at different temporal resolutions. As a result, the high-level agent learns an effective policy for prescribing the charging targets for every charging period, while the low-level agent learns an optimal policy for setting the charging power of every time step within a single charging period, with the aim of minimizing the charging costs while meeting the charging target. It is proved that the flat policy constructed by superimposing the optimal high-level policy and the optimal low-level policy performs as well as the optimal policy of the original MDP. Since jointly learning both levels of policies is challenging due to the non-stationarity of the high-level agent and the sampling inefficiency of the low-level agent, we divide the joint learning process into two phases and exploit our new HER algorithm to manipulate the experience replay buffers for both levels of agents. Numerical experiments are performed with the aid of real-world data to evaluate the performance of the proposed algorithm.

[LG-21] SpecOffload: Unlocking Latent GPU Capacity for LLM Inference on Resource-Constrained Devices

链接: https://arxiv.org/abs/2505.10259
作者: Xiangwen Zhuge,Xu Shen,Zeyu Wang,Fan Dang,Xuan Ding,Danyang Li,Yahui Han,Tianxiang Hao,Zheng Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Efficient LLM inference on resource-constrained devices presents significant challenges in compute and memory utilization. Due to limited GPU memory, existing systems offload model weights to CPU memory, incurring substantial I/O overhead between the CPU and GPU. This leads to two major inefficiencies: (1) GPU cores are underutilized, often remaining idle while waiting for data to be loaded; and (2) GPU memory has low impact on performance, as reducing its capacity has minimal effect on overall this http URL this paper, we propose SpecOffload, a high-throughput inference engine that embeds speculative decoding into offloading. Our key idea is to unlock latent GPU resources for storing and executing a draft model used for speculative decoding, thus accelerating inference at near-zero additional cost. To support this, we carefully orchestrate the interleaved execution of target and draft models in speculative decoding within the offloading pipeline, and propose a planner to manage tensor placement and select optimal parameters. Compared to the best baseline, SpecOffload improves GPU core utilization by 4.49x and boosts inference throughput by 2.54x. Our code is available at this https URL .

[LG-22] Informed Forecasting: Leverag ing Auxiliary Knowledge to Boost LLM Performance on Time Series Forecasting

链接: https://arxiv.org/abs/2505.10213
作者: Mohammadmahdi Ghasemloo,Alireza Moradi
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:With the widespread adoption of Large Language Models (LLMs), there is a growing need to establish best practices for leveraging their capabilities beyond traditional natural language tasks. In this paper, a novel cross-domain knowledge transfer framework is proposed to enhance the performance of LLMs in time series forecasting – a task of increasing relevance in fields such as energy systems, finance, and healthcare. The approach systematically infuses LLMs with structured temporal information to improve their forecasting accuracy. This study evaluates the proposed method on a real-world time series dataset and compares it to a naive baseline where the LLM receives no auxiliary information. Results show that knowledge-informed forecasting significantly outperforms the uninformed baseline in terms of predictive accuracy and generalization. These findings highlight the potential of knowledge transfer strategies to bridge the gap between LLMs and domain-specific forecasting tasks.

[LG-23] A multi-head deep fusion model for recognition of cattle forag ing events using sound and movement signals

链接: https://arxiv.org/abs/2505.10198
作者: Mariano Ferrero,José Omar Chelotti,Luciano Sebastián Martinez-Rau,Leandro Vignolo,Martín Pires,Julio Ricardo Galli,Leonardo Luis Giovanini,Hugo Leonardo Rufiner
类目: Machine Learning (cs.LG)
*备注: Preprint submitted to Engineering Applications of Artificial Intelligence

点击查看摘要

Abstract:Monitoring feeding behaviour is a relevant task for efficient herd management and the effective use of available resources in grazing cattle. The ability to automatically recognise animals’ feeding activities through the identification of specific jaw movements allows for the improvement of diet formulation, as well as early detection of metabolic problems and symptoms of animal discomfort, among other benefits. The use of sensors to obtain signals for such monitoring has become popular in the last two decades. The most frequently employed sensors include accelerometers, microphones, and cameras, each with its own set of advantages and drawbacks. An unexplored aspect is the simultaneous use of multiple sensors with the aim of combining signals in order to enhance the precision of the estimations. In this direction, this work introduces a deep neural network based on the fusion of acoustic and inertial signals, composed of convolutional, recurrent, and dense layers. The main advantage of this model is the combination of signals through the automatic extraction of features independently from each of them. The model has emerged from an exploration and comparison of different neural network architectures proposed in this work, which carry out information fusion at different levels. Feature-level fusion has outperformed data and decision-level fusion by at least a 0.14 based on the F1-score metric. Moreover, a comparison with state-of-the-art machine learning methods is presented, including traditional and deep learning approaches. The proposed model yielded an F1-score value of 0.802, representing a 14% increase compared to previous methods. Finally, results from an ablation study and post-training quantization evaluation are also reported.

[LG-24] Defect Detection in Photolithographic Patterns Using Deep Learning Models Trained on Synthetic Data

链接: https://arxiv.org/abs/2505.10192
作者: Prashant P. Shinde,Priyadarshini P. Pai,Shashishekar P. Adiga,K. Subramanya Mayya,Yongbeom Seo,Myungsoo Hwang,Heeyoung Go,Changmin Park
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the photolithographic process vital to semiconductor manufacturing, various types of defects appear during EUV pattering. Due to ever-shrinking pattern size, these defects are extremely small and cause false or missed detection during inspection. Specifically, the lack of defect-annotated quality data with good representation of smaller defects has prohibited deployment of deep learning based defect detection models in fabrication lines. To resolve the problem of data unavailability, we artificially generate scanning electron microscopy (SEM) images of line patterns with known distribution of defects and autonomously annotate them. We then employ state-of-the-art object detection models to investigate defect detection performance as a function of defect size, much smaller than the pitch width. We find that the real-time object detector YOLOv8 has the best mean average precision of 96% as compared to EfficientNet, 83%, and SSD, 77%, with the ability to detect smaller defects. We report the smallest defect size that can be detected reliably. When tested on real SEM data, the YOLOv8 model correctly detected 84.6% of Bridge defects and 78.3% of Break defects across all relevant instances. These promising results suggest that synthetic data can be used as an alternative to real-world data in order to develop robust machine-learning models.

[LG-25] Near Optimal Best Arm Identification for Clustered Bandits ICML2025

链接: https://arxiv.org/abs/2505.10147
作者: Yash,Nikhil Karamchandani,Avishek Ghosh
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: To be published in ICML 2025

点击查看摘要

Abstract:This work investigates the problem of best arm identification for multi-agent multi-armed bandits. We consider N agents grouped into M clusters, where each cluster solves a stochastic bandit problem. The mapping between agents and bandits is a priori unknown. Each bandit is associated with K arms, and the goal is to identify the best arm for each agent under a \delta -probably correct ( \delta -PC) framework, while minimizing sample complexity and communication overhead. We propose two novel algorithms: Clustering then Best Arm Identification (Cl-BAI) and Best Arm Identification then Clustering (BAI-Cl). Cl-BAI uses a two-phase approach that first clusters agents based on the bandit problems they are learning, followed by identifying the best arm for each cluster. BAI-Cl reverses the sequence by identifying the best arms first and then clustering agents accordingly. Both algorithms leverage the successive elimination framework to ensure computational efficiency and high accuracy. We establish \delta -PC guarantees for both methods, derive bounds on their sample complexity, and provide a lower bound for this problem class. Moreover, when M is small (a constant), we show that the sample complexity of a variant of BAI-Cl is minimax optimal in an order-wise sense. Experiments on synthetic and real-world datasets (MovieLens, Yelp) demonstrate the superior performance of the proposed algorithms in terms of sample and communication efficiency, particularly in settings where M \ll N . Comments: To be published in ICML 2025 Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2505.10147 [cs.LG] (or arXiv:2505.10147v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.10147 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-26] Enhancing the Performance of Global Model by Improving the Adaptability of Local Models in Federated Learning

链接: https://arxiv.org/abs/2505.10125
作者: Wujun Zhou,Shu Ding,ZeLin Li,Wei Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning enables the clients to collaboratively train a global model, which is aggregated from local models. Due to the heterogeneous data distributions over clients and data privacy in federated learning, it is difficult to train local models to achieve a well-performed global model. In this paper, we introduce the adaptability of local models, i.e., the average performance of local models on data distributions over clients, and enhance the performance of the global model by improving the adaptability of local models. Since each client does not know the data distributions over other clients, the adaptability of the local model cannot be directly optimized. First, we provide the property of an appropriate local model which has good adaptability on the data distributions over clients. Then, we formalize the property into the local training objective with a constraint and propose a feasible solution to train the local model. Extensive experiments on federated learning benchmarks demonstrate that our method significantly improves the adaptability of local models and achieves a well-performed global model that consistently outperforms the baseline methods.

[LG-27] ChronoSteer: Bridging Large Language Model and Time Series Foundation Model via Synthetic Data

链接: https://arxiv.org/abs/2505.10083
作者: Chengsen Wang,Qi Qi,Zhongwen Rao,Lujia Pan,Jingyu Wang,Jianxin Liao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conventional forecasting methods rely on unimodal time series data, limiting their ability to exploit rich textual information. Recently, large language models (LLMs) and time series foundation models (TSFMs) have demonstrated powerful capability in textual reasoning and temporal modeling, respectively. Integrating the strengths of both to construct a multimodal model that concurrently leverages both temporal and textual information for future inference has emerged as a critical research challenge. To address the scarcity of event-series paired data, we propose a decoupled framework: an LLM is employed to transform textual events into revision instructions, which are then used to steer the output of TSFM. To implement this framework, we introduce ChronoSteer, a multimodal TSFM that can be steered through textual revision instructions, effectively bridging LLM and TSFM. Moreover, to mitigate the shortage of cross-modal instruction-series paired data, we devise a two-stage training strategy based on synthetic data. In addition, we also construct a high-quality multimodal time series forecasting benchmark to address the information leakage concerns during evaluation. After integrating with an LLM, ChronoSteer, which is trained exclusively on synthetic data, achieves a 25.7% improvement in prediction accuracy compared to the unimodal backbone and a 22.5% gain over the previous state-of-the-art multimodal method.

[LG-28] JointDistill: Adaptive Multi-Task Distillation for Joint Depth Estimation and Scene Segmentation

链接: https://arxiv.org/abs/2505.10057
作者: Tiancong Cheng,Ying Zhang,Yuxuan Liang,Roger Zimmermann,Zhiwen Yu,Bin Guo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Depth estimation and scene segmentation are two important tasks in intelligent transportation systems. A joint modeling of these two tasks will reduce the requirement for both the storage and training efforts. This work explores how the multi-task distillation could be used to improve such unified modeling. While existing solutions transfer multiple teachers’ knowledge in a static way, we propose a self-adaptive distillation method that can dynamically adjust the knowledge amount from each teacher according to the student’s current learning ability. Furthermore, as multiple teachers exist, the student’s gradient update direction in the distillation is more prone to be erroneous where knowledge forgetting may occur. To avoid this, we propose a knowledge trajectory to record the most essential information that a model has learnt in the past, based on which a trajectory-based distillation loss is designed to guide the student to follow the learning curve similarly in a cost-effective way. We evaluate our method on multiple benchmarking datasets including Cityscapes and NYU-v2. Compared to the state-of-the-art solutions, our method achieves a clearly improvement. The code is provided in the supplementary materials.

[LG-29] Instance-Prototype Affinity Learning for Non-Exemplar Continual Graph Learning

链接: https://arxiv.org/abs/2505.10040
作者: Lei Song,Jiaxing Li,Shihan Guan,Youyong Kong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNN) endure catastrophic forgetting, undermining their capacity to preserve previously acquired knowledge amid the assimilation of novel information. Rehearsal-based techniques revisit historical examples, adopted as a principal strategy to alleviate this phenomenon. However, memory explosion and privacy infringements impose significant constraints on their utility. Non-Exemplar methods circumvent the prior issues through Prototype Replay (PR), yet feature drift presents new challenges. In this paper, our empirical findings reveal that Prototype Contrastive Learning (PCL) exhibits less pronounced drift than conventional PR. Drawing upon PCL, we propose Instance-Prototype Affinity Learning (IPAL), a novel paradigm for Non-Exemplar Continual Graph Learning (NECGL). Exploiting graph structural information, we formulate Topology-Integrated Gaussian Prototypes (TIGP), guiding feature distributions towards high-impact nodes to augment the model’s capacity for assimilating new knowledge. Instance-Prototype Affinity Distillation (IPAD) safeguards task memory by regularizing discontinuities in class relationships. Moreover, we embed a Decision Boundary Perception (DBP) mechanism within PCL, fostering greater inter-class discriminability. Evaluations on four node classification benchmark datasets demonstrate that our method outperforms existing state-of-the-art methods, achieving a better trade-off between plasticity and stability.

[LG-30] Rethinking Circuit Completeness in Language Models: AND OR and ADDER Gates

链接: https://arxiv.org/abs/2505.10039
作者: Hang Chen,Jiaying Zhu,Xinyu Yang,Wenya Wang
类目: Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Circuit discovery has gradually become one of the prominent methods for mechanistic interpretability, and research on circuit completeness has also garnered increasing attention. Methods of circuit discovery that do not guarantee completeness not only result in circuits that are not fixed across different runs but also cause key mechanisms to be omitted. The nature of incompleteness arises from the presence of OR gates within the circuit, which are often only partially detected in standard circuit discovery methods. To this end, we systematically introduce three types of logic gates: AND, OR, and ADDER gates, and decompose the circuit into combinations of these logical gates. Through the concept of these gates, we derive the minimum requirements necessary to achieve faithfulness and completeness. Furthermore, we propose a framework that combines noising-based and denoising-based interventions, which can be easily integrated into existing circuit discovery methods without significantly increasing computational complexity. This framework is capable of fully identifying the logic gates and distinguishing them within the circuit. In addition to the extensive experimental validation of the framework’s ability to restore the faithfulness, completeness, and sparsity of circuits, using this framework, we uncover fundamental properties of the three logic gates, such as their proportions and contributions to the output, and explore how they behave among the functionalities of language models.

[LG-31] Evaluating Robustness of Deep Reinforcement Learning for Autonomous Surface Vehicle Control in Field Tests ICRA2025

链接: https://arxiv.org/abs/2505.10033
作者: Luis F. W. Batista,Stéphanie Aravecchia,Seth Hutchinson,Cédric Pradalier
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Workshop on Field Robotics at ICRA 2025

点击查看摘要

Abstract:Despite significant advancements in Deep Reinforcement Learning (DRL) for Autonomous Surface Vehicles (ASVs), their robustness in real-world conditions, particularly under external disturbances, remains insufficiently explored. In this paper, we evaluate the resilience of a DRL-based agent designed to capture floating waste under various perturbations. We train the agent using domain randomization and evaluate its performance in real-world field tests, assessing its ability to handle unexpected disturbances such as asymmetric drag and an off-center payload. We assess the agent’s performance under these perturbations in both simulation and real-world experiments, quantifying performance degradation and benchmarking it against an MPC baseline. Results indicate that the DRL agent performs reliably despite significant disturbances. Along with the open-source release of our implementation, we provide insights into effective training strategies, real-world challenges, and practical considerations for deploying DRLbased ASV controllers.

[LG-32] ImagineBench: Evaluating Reinforcement Learning with Large Language Model Rollouts

链接: https://arxiv.org/abs/2505.10010
作者: Jing-Cheng Pang,Kaiyuan Li,Yidi Wang,Si-Hang Yang,Shengyi Jiang,Yang Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A central challenge in reinforcement learning (RL) is its dependence on extensive real-world interaction data to learn task-specific policies. While recent work demonstrates that large language models (LLMs) can mitigate this limitation by generating synthetic experience (noted as imaginary rollouts) for mastering novel tasks, progress in this emerging field is hindered due to the lack of a standard benchmark. To bridge this gap, we introduce ImagineBench, the first comprehensive benchmark for evaluating offline RL algorithms that leverage both real rollouts and LLM-imaginary rollouts. The key features of ImagineBench include: (1) datasets comprising environment-collected and LLM-imaginary rollouts; (2) diverse domains of environments covering locomotion, robotic manipulation, and navigation tasks; and (3) natural language task instructions with varying complexity levels to facilitate language-conditioned policy learning. Through systematic evaluation of state-of-the-art offline RL algorithms, we observe that simply applying existing offline RL algorithms leads to suboptimal performance on unseen tasks, achieving 35.44% success rate in hard tasks in contrast to 64.37% of method training on real rollouts for hard tasks. This result highlights the need for algorithm advancements to better leverage LLM-imaginary rollouts. Additionally, we identify key opportunities for future research: including better utilization of imaginary rollouts, fast online adaptation and continual learning, and extension to multi-modal tasks. Our code is publicly available at this https URL.

[LG-33] Sample Complexity of Distributionally Robust Averag e-Reward Reinforcement Learning

链接: https://arxiv.org/abs/2505.10007
作者: Zijun Chen,Shengbo Wang,Nian Si
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Motivated by practical applications where stable long-term performance is critical-such as robotics, operations research, and healthcare-we study the problem of distributionally robust (DR) average-reward reinforcement learning. We propose two algorithms that achieve near-optimal sample complexity. The first reduces the problem to a DR discounted Markov decision process (MDP), while the second, Anchored DR Average-Reward MDP, introduces an anchoring state to stabilize the controlled transition kernels within the uncertainty set. Assuming the nominal MDP is uniformly ergodic, we prove that both algorithms attain a sample complexity of \widetildeO\left(|\mathbfS||\mathbfA| t_\mathrmmix^2\varepsilon^-2\right) for estimating the optimal policy as well as the robust average reward under KL and f_k -divergence-based uncertainty sets, provided the uncertainty radius is sufficiently small. Here, \varepsilon is the target accuracy, |\mathbfS| and |\mathbfA| denote the sizes of the state and action spaces, and t_\mathrmmix is the mixing time of the nominal MDP. This represents the first finite-sample convergence guarantee for DR average-reward reinforcement learning. We further validate the convergence rates of our algorithms through numerical experiments.

[LG-34] AI2MMUM: AI-AI Oriented Multi-Modal Universal Model Leverag ing Telecom Domain Large Model

链接: https://arxiv.org/abs/2505.10003
作者: Tianyu Jiao,Zhuoran Xiao,Yihang Huang,Chenhui Ye,Yijia Feng,Liyu Cai,Jiang Chang,Fangkun Liu,Yin Xu,Dazhi He,Yunfeng Guan,Wenjun Zhang
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Designing a 6G-oriented universal model capable of processing multi-modal data and executing diverse air interface tasks has emerged as a common goal in future wireless systems. Building on our prior work in communication multi-modal alignment and telecom large language model (LLM), we propose a scalable, task-aware artificial intelligence-air interface multi-modal universal model (AI2MMUM), which flexibility and effectively perform various physical layer tasks according to subtle task instructions. The LLM backbone provides robust contextual comprehension and generalization capabilities, while a fine-tuning approach is adopted to incorporate domain-specific knowledge. To enhance task adaptability, task instructions consist of fixed task keywords and learnable, implicit prefix prompts. Frozen radio modality encoders extract universal representations and adapter layers subsequently bridge radio and language modalities. Moreover, lightweight task-specific heads are designed to directly output task objectives. Comprehensive evaluations demonstrate that AI2MMUM achieves SOTA performance across five representative physical environment/wireless channel-based downstream tasks using the WAIR-D and DeepMIMO datasets.

[LG-35] Sybil-based Virtual Data Poisoning Attacks in Federated Learning

链接: https://arxiv.org/abs/2505.09983
作者: Changxun Zhu,Qilong Wu,Lingjuan Lyu,Shibei Xue
类目: Machine Learning (cs.LG)
*备注: 7 pages, 6 figures, accepted by IEEE Codit 2025

点击查看摘要

Abstract:Federated learning is vulnerable to poisoning attacks by malicious adversaries. Existing methods often involve high costs to achieve effective attacks. To address this challenge, we propose a sybil-based virtual data poisoning attack, where a malicious client generates sybil nodes to amplify the poisoning model’s impact. To reduce neural network computational complexity, we develop a virtual data generation method based on gradient matching. We also design three schemes for target model acquisition, applicable to online local, online global, and offline scenarios. In simulation, our method outperforms other attack algorithms since our method can obtain a global target model under non-independent uniformly distributed data.

[LG-36] Approximated Behavioral Metric-based State Projection for Federated Reinforcement Learning

链接: https://arxiv.org/abs/2505.09959
作者: Zengxia Guo,Bohui An,Zhongqi Lu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated reinforcement learning (FRL) methods usually share the encrypted local state or policy information and help each client to learn from others while preserving everyone’s privacy. In this work, we propose that sharing the approximated behavior metric-based state projection function is a promising way to enhance the performance of FRL and concurrently provides an effective protection of sensitive information. We introduce FedRAG, a FRL framework to learn a computationally practical projection function of states for each client and aggregating the parameters of projection functions at a central server. The FedRAG approach shares no sensitive task-specific information, yet provides information gain for each client. We conduct extensive experiments on the DeepMind Control Suite to demonstrate insightful results.

[LG-37] Improving the Euclidean Diffusion Generation of Manifold Data by Mitigating Score Function Singularity

链接: https://arxiv.org/abs/2505.09922
作者: Zichen Liu,Wei Zhang,Tiejun Li
类目: Machine Learning (cs.LG)
*备注: 22 pages

点击查看摘要

Abstract:Euclidean diffusion models have achieved remarkable success in generative modeling across diverse domains, and they have been extended to manifold case in recent advances. Instead of explicitly utilizing the structure of special manifolds as studied in previous works, we investigate direct sampling of the Euclidean diffusion models for general manifold-constrained data in this paper. We reveal the multiscale singularity of the score function in the embedded space of manifold, which hinders the accuracy of diffusion-generated samples. We then present an elaborate theoretical analysis of the singularity structure of the score function by separating it along the tangential and normal directions of the manifold. To mitigate the singularity and improve the sampling accuracy, we propose two novel methods: (1) Niso-DM, which introduces non-isotropic noise along the normal direction to reduce scale discrepancies, and (2) Tango-DM, which trains only the tangential component of the score function using a tangential-only loss function. Numerical experiments demonstrate that our methods achieve superior performance on distributions over various manifolds with complex geometries.

[LG-38] BINGO: A Novel Pruning Mechanism to Reduce the Size of Neural Networks

链接: https://arxiv.org/abs/2505.09864
作者: Aditya Panangat
类目: Machine Learning (cs.LG)
*备注: 6 pages, 0 figures, 2 tables

点击查看摘要

Abstract:Over the past decade, the use of machine learning has increased exponentially. Models are far more complex than ever before, growing to gargantuan sizes and housing millions of weights. Unfortunately, the fact that large models have become the state of the art means that it often costs millions of dollars to train and operate them. These expenses not only hurt companies but also bar non-wealthy individuals from contributing to new developments and force consumers to pay greater prices for AI. Current methods used to prune models, such as iterative magnitude pruning, have shown great accuracy but require an iterative training sequence that is incredibly computationally and environmentally taxing. To solve this problem, BINGO is introduced. BINGO, during the training pass, studies specific subsets of a neural network one at a time to gauge how significant of a role each weight plays in contributing to a network’s accuracy. By the time training is done, BINGO generates a significance score for each weight, allowing for insignificant weights to be pruned in one shot. BINGO provides an accuracy-preserving pruning technique that is less computationally intensive than current methods, allowing for a world where AI growth does not have to mean model growth, as well.

[LG-39] Chisme: Fully Decentralized Differentiated Deep Learning for Edge Intelligence

链接: https://arxiv.org/abs/2505.09854
作者: Harikrishna Kuttivelil,Katia Obraczka
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:As demand for intelligent services rises and edge devices become more capable, distributed learning at the network edge has emerged as a key enabling technology. While existing paradigms like federated learning (FL) and decentralized FL (DFL) enable privacy-preserving distributed learning in many scenarios, they face potential challenges in connectivity and synchronization imposed by resource-constrained and infrastructure-less environments. While more robust, gossip learning (GL) algorithms have generally been designed for homogeneous data distributions and may not suit all contexts. This paper introduces Chisme, a novel suite of protocols designed to address the challenges of implementing robust intelligence in the network edge, characterized by heterogeneous data distributions, episodic connectivity, and lack of infrastructure. Chisme includes both synchronous DFL (Chisme-DFL) and asynchronous GL (Chisme-GL) variants that enable collaborative yet decentralized model training that considers underlying data heterogeneity. We introduce a data similarity heuristic that allows agents to opportunistically infer affinity with each other using the existing communication of model updates in decentralized FL and GL. We leverage the heuristic to extend DFL’s model aggregation and GL’s model merge mechanisms for better personalized training while maintaining collaboration. While Chisme-DFL is a synchronous decentralized approach whose resource utilization scales linearly with network size, Chisme-GL is fully asynchronous and has a lower, constant resource requirement independent of network size. We demonstrate that Chisme methods outperform their standard counterparts in model training over distributed and heterogeneous data in network scenarios ranging from less connected and reliable networks to fully connected and lossless networks.

[LG-40] ZENN: A Thermodynamics-Inspired Computational Framework for Heterogeneous Data-Driven Modeling

链接: https://arxiv.org/abs/2505.09851
作者: Shun Wang,Shun-Li Shang,Zi-Kui Liu,Wenrui Hao
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Traditional entropy-based methods - such as cross-entropy loss in classification problems - have long been essential tools for quantifying uncertainty and disorder in data and developing artificial intelligence algorithms. However, the rapid growth of data across various domains has introduced new challenges, particularly the integration of heterogeneous datasets with intrinsic disparities. In this paper, we extend zentropy theory into the data science domain by introducing intrinsic entropy, enabling more effective learning from heterogeneous data sources. We propose a zentropy-enhanced neural network (ZENN) that simultaneously learns both energy and intrinsic entropy components, capturing the underlying structure of multi-source data. To support this, we redesign the neural network architecture to better reflect the intrinsic properties and variability inherent in diverse datasets. We demonstrate the effectiveness of ZENN on classification tasks and energy landscape reconstructions, showing its superior generalization capabilities and robustness-particularly in predicting high-order derivatives. As a practical application, we employ ZENN to reconstruct the Helmholtz energy landscape of Fe3Pt using data generated from DFT and capture key material behaviors, including negative thermal expansion and the critical point in the temperature-pressure space. Overall, our study introduces a novel approach for data-driven machine learning grounded in zentropy theory, highlighting ZENN as a versatile and robust deep learning framework for scientific problems involving complex, heterogeneous datasets.

[LG-41] Radiogenomic Bipartite Graph Representation Learning for Alzheimers Disease Detection

链接: https://arxiv.org/abs/2505.09848
作者: Aditya Raj,Golrokh Mirzaei
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 11 pages

点击查看摘要

Abstract:Imaging and genomic data offer distinct and rich features, and their integration can unveil new insights into the complex landscape of diseases. In this study, we present a novel approach utilizing radiogenomic data including structural MRI images and gene expression data, for Alzheimer’s disease detection. Our framework introduces a novel heterogeneous bipartite graph representation learning featuring two distinct node types: genes and images. The network can effectively classify Alzheimer’s disease (AD) into three distinct stages:AD, Mild Cognitive Impairment (MCI), and Cognitive Normal (CN) classes, utilizing a small dataset. Additionally, it identified which genes play a significant role in each of these classification groups. We evaluate the performance of our approach using metrics including classification accuracy, recall, precision, and F1 score. The proposed technique holds potential for extending to radiogenomic-based classification to other diseases.

[LG-42] Automated Alert Classification and Triage (AACT): An Intelligent System for the Prioritisation of Cybersecurity Alerts

链接: https://arxiv.org/abs/2505.09843
作者: Melissa Turcotte,François Labrèche,Serge-Olivier Paquette
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Enterprise networks are growing ever larger with a rapidly expanding attack surface, increasing the volume of security alerts generated from security controls. Security Operations Centre (SOC) analysts triage these alerts to identify malicious activity, but they struggle with alert fatigue due to the overwhelming number of benign alerts. Organisations are turning to managed SOC providers, where the problem is amplified by context switching and limited visibility into business processes. A novel system, named AACT, is introduced that automates SOC workflows by learning from analysts’ triage actions on cybersecurity alerts. It accurately predicts triage decisions in real time, allowing benign alerts to be closed automatically and critical ones prioritised. This reduces the SOC queue allowing analysts to focus on the most severe, relevant or ambiguous threats. The system has been trained and evaluated on both real SOC data and an open dataset, obtaining high performance in identifying malicious alerts from benign alerts. Additionally, the system has demonstrated high accuracy in a real SOC environment, reducing alerts shown to analysts by 61% over six months, with a low false negative rate of 1.36% over millions of alerts. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Applications (stat.AP) Cite as: arXiv:2505.09843 [cs.CR] (or arXiv:2505.09843v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2505.09843 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-43] Learning Rock Pushability on Rough Planetary Terrain ICRA2025

链接: https://arxiv.org/abs/2505.09833
作者: Tuba Girgin,Emre Girgin,Cagri Kilic
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Paper presented at the Workshop on Field Robotics, ICRA 2025, Atlanta, GA, United States

点击查看摘要

Abstract:In the context of mobile navigation in unstructured environments, the predominant approach entails the avoidance of obstacles. The prevailing path planning algorithms are contingent upon deviating from the intended path for an indefinite duration and returning to the closest point on the route after the obstacle is left behind spatially. However, avoiding an obstacle on a path that will be used repeatedly by multiple agents can hinder long-term efficiency and lead to a lasting reliance on an active path planning system. In this study, we propose an alternative approach to mobile navigation in unstructured environments by leveraging the manipulation capabilities of a robotic manipulator mounted on top of a mobile robot. Our proposed framework integrates exteroceptive and proprioceptive feedback to assess the push affordance of obstacles, facilitating their repositioning rather than avoidance. While our preliminary visual estimation takes into account the characteristics of both the obstacle and the surface it relies on, the push affordance estimation module exploits the force feedback obtained by interacting with the obstacle via a robotic manipulator as the guidance signal. The objective of our navigation approach is to enhance the efficiency of routes utilized by multiple agents over extended periods by reducing the overall time spent by a fleet in environments where autonomous infrastructure development is imperative, such as lunar or Martian surfaces.

[LG-44] Learning Kronecker-Structured Graphs from Smooth Signals

链接: https://arxiv.org/abs/2505.09822
作者: Changhao Shi,Gal Mishne
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Graph learning, or network inference, is a prominent problem in graph signal processing (GSP). GSP generalizes the Fourier transform to non-Euclidean domains, and graph learning is pivotal to applying GSP when these domains are unknown. With the recent prevalence of multi-way data, there has been growing interest in product graphs that naturally factorize dependencies across different ways. However, the types of graph products that can be learned are still limited for modeling diverse dependency structures. In this paper, we study the problem of learning a Kronecker-structured product graph from smooth signals. Unlike the more commonly used Cartesian product, the Kronecker product models dependencies in a more intricate, non-separable way, but posits harder constraints on the graph learning problem. To tackle this non-convex problem, we propose an alternating scheme to optimize each factor graph and provide theoretical guarantees for its asymptotic convergence. The proposed algorithm is also modified to learn factor graphs of the strong product. We conduct experiments on synthetic and real-world graphs and demonstrate our approach’s efficacy and superior performance compared to existing methods.

[LG-45] Adversarial Attack on Large Language Models using Exponentiated Gradient Descent IJCNN

链接: https://arxiv.org/abs/2505.09820
作者: Sajib Biswas,Mao Nishino,Samuel Jacob Chacko,Xiuwen Liu
类目: Machine Learning (cs.LG)
*备注: Accepted to International Joint Conference on Neural Networks (IJCNN) 2025

点击查看摘要

Abstract:As Large Language Models (LLMs) are widely used, understanding them systematically is key to improving their safety and realizing their full potential. Although many models are aligned using techniques such as reinforcement learning from human feedback (RLHF), they are still vulnerable to jailbreaking attacks. Some of the existing adversarial attack methods search for discrete tokens that may jailbreak a target model while others try to optimize the continuous space represented by the tokens of the model’s vocabulary. While techniques based on the discrete space may prove to be inefficient, optimization of continuous token embeddings requires projections to produce discrete tokens, which might render them ineffective. To fully utilize the constraints and the structures of the space, we develop an intrinsic optimization technique using exponentiated gradient descent with the Bregman projection method to ensure that the optimized one-hot encoding always stays within the probability simplex. We prove the convergence of the technique and implement an efficient algorithm that is effective in jailbreaking several widely used LLMs. We demonstrate the efficacy of the proposed technique using five open-source LLMs on four openly available datasets. The results show that the technique achieves a higher success rate with great efficiency compared to three other state-of-the-art jailbreaking techniques. The source code for our implementation is available at: this https URL

[LG-46] Comparative Analysis of Stroke Prediction Models Using Machine Learning

链接: https://arxiv.org/abs/2505.09812
作者: Anastasija Tashkova,Stefan Eftimov,Bojan Ristov,Slobodan Kalajdziski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Stroke remains one of the most critical global health challenges, ranking as the second leading cause of death and the third leading cause of disability worldwide. This study explores the effectiveness of machine learning algorithms in predicting stroke risk using demographic, clinical, and lifestyle data from the Stroke Prediction Dataset. By addressing key methodological challenges such as class imbalance and missing data, we evaluated the performance of multiple models, including Logistic Regression, Random Forest, and XGBoost. Our results demonstrate that while these models achieve high accuracy, sensitivity remains a limiting factor for real-world clinical applications. In addition, we identify the most influential predictive features and propose strategies to improve machine learning-based stroke prediction. These findings contribute to the development of more reliable and interpretable models for the early assessment of stroke risk.

[LG-47] Lossless Compression for LLM Tensor Incremental Snapshots

链接: https://arxiv.org/abs/2505.09810
作者: Daniel Waddington,Cornel Constantinescu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:During the training of Large Language Models (LLMs), tensor data is periodically “checkpointed” to persistent storage to allow recovery of work done in the event of failure. The volume of data that must be copied during each checkpoint, even when using reduced-precision representations such as bfloat16, often reaches hundreds of gigabytes. Furthermore, the data must be moved across a network and written to a storage system before the next epoch occurs. With a view to ultimately building an optimized checkpointing solution, this paper presents experimental analysis of checkpoint data used to derive a design that maximizes the use of lossless compression to reduce the volume of data. We examine how tensor data and its compressibility evolve during model training and evaluate the efficacy of existing common off-the-shelf general purpose compression engines combined with known data optimization techniques such as byte-grouping and incremental delta compression. Leveraging our analysis we have built an effective compression solution, known as Language Model Compressor (LMC), which is based on byte-grouping and Huffman encoding. LMC offers more compression performance than the best alternative (BZ2) but with an order-of-magnitude reduction in the time needed to perform the compression. We show that a 16-core parallel implementation of LMC can attain compression and decompression throughput of 2.78 GiB/s and 3.76 GiB/s respectively. This increase in performance ultimately reduces the CPU resources needed and provides more time to copy the data to the storage system before the next epoch thus allowing for higher-frequency checkpoints. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.09810 [cs.LG] (or arXiv:2505.09810v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.09810 Focus to learn more arXiv-issued DOI via DataCite

[LG-48] Ontology-Based Structuring and Analysis of North Macedonian Public Procurement Contracts

链接: https://arxiv.org/abs/2505.09798
作者: Bojan Ristov,Stefan Eftimov,Milena Trajanoska,Dimitar Trajanov
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Public procurement plays a critical role in government operations, ensuring the efficient allocation of resources and fostering economic growth. However, traditional procurement data is often stored in rigid, tabular formats, limiting its analytical potential and hindering transparency. This research presents a methodological framework for transforming structured procurement data into a semantic knowledge graph, leveraging ontological modeling and automated data transformation techniques. By integrating RDF and SPARQL-based querying, the system enhances the accessibility and interpretability of procurement records, enabling complex semantic queries and advanced analytics. Furthermore, by incorporating machine learning-driven predictive modeling, the system extends beyond conventional data analysis, offering insights into procurement trends and risk assessment. This work contributes to the broader field of public procurement intelligence by improving data transparency, supporting evidence-based decision-making, and enabling in-depth analysis of procurement activities in North Macedonia.

[LG-49] Interim Report on Human-Guided Adaptive Hyperparameter Optimization with Multi-Fidelity Sprints

链接: https://arxiv.org/abs/2505.09792
作者: Michael Kamfonas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This case study applies a phased hyperparameter optimization process to compare multitask natural language model variants that utilize multiphase learning rate scheduling and optimizer parameter grouping. We employ short, Bayesian optimization sessions that leverage multi-fidelity, hyperparameter space pruning, progressive halving, and a degree of human guidance. We utilize the Optuna TPE sampler and Hyperband pruner, as well as the Scikit-Learn Gaussian process minimization. Initially, we use efficient low-fidelity sprints to prune the hyperparameter space. Subsequent sprints progressively increase their model fidelity and employ hyperband pruning for efficiency. A second aspect of our approach is using a meta-learner to tune threshold values to resolve classification probabilities during inference. We demonstrate our method on a collection of variants of the 2021 Joint Entity and Relation Extraction model proposed by Eberts and Ulges.

[LG-50] Self-Consuming Generative Models with Adversarially Curated Data

链接: https://arxiv.org/abs/2505.09768
作者: Xiukun Wei,Xueru Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in generative models have made it increasingly difficult to distinguish real data from model-generated synthetic data. Using synthetic data for successive training of future model generations creates “self-consuming loops”, which may lead to model collapse or training instability. Furthermore, synthetic data is often subject to human feedback and curated by users based on their preferences. Ferbach et al. (2024) recently showed that when data is curated according to user preferences, the self-consuming retraining loop drives the model to converge toward a distribution that optimizes those preferences. However, in practice, data curation is often noisy or adversarially manipulated. For example, competing platforms may recruit malicious users to adversarially curate data and disrupt rival models. In this paper, we study how generative models evolve under self-consuming retraining loops with noisy and adversarially curated data. We theoretically analyze the impact of such noisy data curation on generative models and identify conditions for the robustness of the retraining process. Building on this analysis, we design attack algorithms for competitive adversarial scenarios, where a platform with a limited budget employs malicious users to misalign a rival’s model from actual user preferences. Experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed algorithms.

[LG-51] Community-based Multi-Agent Reinforcement Learning with Transfer and Active Exploration

链接: https://arxiv.org/abs/2505.09756
作者: Zhaoyang Shi
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose a new framework for multi-agent reinforcement learning (MARL), where the agents cooperate in a time-evolving network with latent community structures and mixed memberships. Unlike traditional neighbor-based or fixed interaction graphs, our community-based framework captures flexible and abstract coordination patterns by allowing each agent to belong to multiple overlapping communities. Each community maintains shared policy and value functions, which are aggregated by individual agents according to personalized membership weights. We also design actor-critic algorithms that exploit this structure: agents inherit community-level estimates for policy updates and value learning, enabling structured information sharing without requiring access to other agents’ policies. Importantly, our approach supports both transfer learning by adapting to new agents or tasks via membership estimation, and active learning by prioritizing uncertain communities during exploration. Theoretically, we establish convergence guarantees under linear function approximation for both actor and critic updates. To our knowledge, this is the first MARL framework that integrates community structure, transferability, and active learning with provable guarantees.

[LG-52] Risk-Aware Safe Reinforcement Learning for Control of Stochastic Linear Systems

链接: https://arxiv.org/abs/2505.09734
作者: Babak Esmaeili,Nariman Niknejad,Hamidreza Modares
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO); Optimization and Control (math.OC)
*备注: Submitted to Asian Journal of Control

点击查看摘要

Abstract:This paper presents a risk-aware safe reinforcement learning (RL) control design for stochastic discrete-time linear systems. Rather than using a safety certifier to myopically intervene with the RL controller, a risk-informed safe controller is also learned besides the RL controller, and the RL and safe controllers are combined together. Several advantages come along with this approach: 1) High-confidence safety can be certified without relying on a high-fidelity system model and using limited data available, 2) Myopic interventions and convergence to an undesired equilibrium can be avoided by deciding on the contribution of two stabilizing controllers, and 3) highly efficient and computationally tractable solutions can be provided by optimizing over a scalar decision variable and linear programming polyhedral sets. To learn safe controllers with a large invariant set, piecewise affine controllers are learned instead of linear controllers. To this end, the closed-loop system is first represented using collected data, a decision variable, and noise. The effect of the decision variable on the variance of the safe violation of the closed-loop system is formalized. The decision variable is then designed such that the probability of safety violation for the learned closed-loop system is minimized. It is shown that this control-oriented approach reduces the data requirements and can also reduce the variance of safety violations. Finally, to integrate the safe and RL controllers, a new data-driven interpolation technique is introduced. This method aims to maintain the RL agent’s optimal implementation while ensuring its safety within environments characterized by noise. The study concludes with a simulation example that serves to validate the theoretical results.

[LG-53] raining Deep Morphological Neural Networks as Universal Approximators

链接: https://arxiv.org/abs/2505.09710
作者: Konstantinos Fotopoulos,Petros Maragos
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate deep morphological neural networks (DMNNs). We demonstrate that despite their inherent non-linearity, activations between layers are essential for DMNNs. We then propose several new architectures for DMNNs, each with a different constraint on their parameters. For the first (resp. second) architecture, we work under the constraint that the majority of parameters (resp. learnable parameters) should be part of morphological operations. We empirically show that our proposed networks can be successfully trained, and are more prunable than linear networks. To the best of our knowledge, we are the first to successfully train DMNNs under such constraints, although the generalization capabilities of our networks remain limited. Finally, we propose a hybrid network architecture combining linear and morphological layers, showing empirically that the inclusion of morphological layers significantly accelerates the convergence of gradient descent with large batches.

[LG-54] Enabling Group Fairness in Graph Unlearning via Bi-level Debiasing

链接: https://arxiv.org/abs/2505.09702
作者: Yezi Liu,Prathyush Poduval,Wenjun Huang,Yang Ni,Hanning Chen,Mohsen Imani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph unlearning is a crucial approach for protecting user privacy by erasing the influence of user data on trained graph models. Recent developments in graph unlearning methods have primarily focused on maintaining model prediction performance while removing user information. However, we have observed that when user information is deleted from the model, the prediction distribution across different sensitive groups often changes. Furthermore, graph models are shown to be prone to amplifying biases, making the study of fairness in graph unlearning particularly important. This raises the question: Does graph unlearning actually introduce bias? Our findings indicate that the predictions of post-unlearning models become highly correlated with sensitive attributes, confirming the introduction of bias in the graph unlearning process. To address this issue, we propose a fair graph unlearning method, FGU. To guarantee privacy, FGU trains shard models on partitioned subgraphs, unlearns the requested data from the corresponding subgraphs, and retrains the shard models on the modified subgraphs. To ensure fairness, FGU employs a bi-level debiasing process: it first enables shard-level fairness by incorporating a fairness regularizer in the shard model retraining, and then achieves global-level fairness by aligning all shard models to minimize global disparity. Our experiments demonstrate that FGU achieves superior fairness while maintaining privacy and accuracy. Additionally, FGU is robust to diverse unlearning requests, ensuring fairness and utility performance across various data distributions.

[LG-55] Analog Foundation Models

链接: https://arxiv.org/abs/2505.09663
作者: Julian Büchel,Iason Chalas,Giovanni Acampa,An Chen,Omobayode Fagbohungbe,Sidney Tsai,Kaoutar El Maghraoui,Manuel Le Gallo,Abbas Rahimi,Abu Sebastian
类目: Machine Learning (cs.LG)
*备注: 43 pages, 8 figures, under review

点击查看摘要

Abstract:Analog in-memory computing (AIMC) is a promising compute paradigm to improve speed and power efficiency of neural network inference beyond the limits of conventional von Neumann-based architectures. However, AIMC introduces fundamental challenges such as noisy computations and strict constraints on input and output quantization. Because of these constraints and imprecisions, off-the-shelf LLMs are not able to achieve 4-bit-level performance when deployed on AIMC-based hardware. While researchers previously investigated recovering this accuracy gap on small, mostly vision-based models, a generic method applicable to LLMs pre-trained on trillions of tokens does not yet exist. In this work, we introduce a general and scalable method to robustly adapt LLMs for execution on noisy, low-precision analog hardware. Our approach enables state-of-the-art models \unicodex2013 including Phi-3-mini-4k-instruct and Llama-3.2-1B-Instruct \unicodex2013 to retain performance comparable to 4-bit weight, 8-bit activation baselines, despite the presence of analog noise and quantization constraints. Additionally, we show that as a byproduct of our training methodology, analog foundation models can be quantized for inference on low-precision digital hardware. Finally, we show that our models also benefit from test-time compute scaling, showing better scaling behavior than models trained with 4-bit weight and 8-bit static input quantization. Our work bridges the gap between high-capacity LLMs and efficient analog hardware, offering a path toward energy-efficient foundation models. Code is available at this https URL .

[LG-56] LAS: Loss-less ANN-SNN Conversion for Fully Spike-Driven Large Language Models

链接: https://arxiv.org/abs/2505.09659
作者: Long Chen,Xiaotian Song,Yanan Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spiking Large Language Models (LLMs) have emerged as an energy-efficient alternative to conventional LLMs through their event-driven computation. To effectively obtain spiking LLMs, researchers develop different ANN-to-SNN conversion methods by leveraging pre-trained ANN parameters while inheriting the energy efficiency of SNN. However, existing conversion methods struggle with extreme activation outliers and incompatible nonlinear operations of ANN-based LLMs. To address this, we propose a loss-less ANN-SNN conversion for fully spike-driven LLMs, termed LAS. Specifically, LAS introduces two novel neurons to convert the activation outlier and nonlinear operation of ANN-based LLMs. Moreover, LAS tailors the spike-equivalent Transformer components for spiking LLMs, which can ensure full spiking conversion without any loss of performance. Experimental results on six language models and two vision-language models demonstrate that LAS achieves loss-less conversion. Notably, on OPT-66B, LAS even improves the accuracy of 2% on the WSC task. In addition, the parameter and ablation studies further verify the effectiveness of LAS. The source code is available at this https URL

[LG-57] On Unbiased Low-Rank Approximation with Minimum Distortion

链接: https://arxiv.org/abs/2505.09647
作者: Leighton Pate Barnes,Stephen Cameron,Benjamin Howard
类目: Data Structures and Algorithms (cs.DS); Information Theory (cs.IT); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We describe an algorithm for sampling a low-rank random matrix Q that best approximates a fixed target matrix P\in\mathbbC^n\times m in the following sense: Q is unbiased, i.e., \mathbbE[Q] = P ; \mathsfrank(Q)\leq r ; and Q minimizes the expected Frobenius norm error \mathbbE|P-Q|_F^2 . Our algorithm mirrors the solution to the efficient unbiased sparsification problem for vectors, except applied to the singular components of the matrix P . Optimality is proven by showing that our algorithm matches the error from an existing lower bound.

[LG-58] Detecting Musical Deepfakes

链接: https://arxiv.org/abs/2505.09633
作者: Nick Sunday
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Submitted as part of coursework at UT Austin. Accompanying code available at: this https URL

点击查看摘要

Abstract:The proliferation of Text-to-Music (TTM) platforms has democratized music creation, enabling users to effortlessly generate high-quality compositions. However, this innovation also presents new challenges to musicians and the broader music industry. This study investigates the detection of AI-generated songs using the FakeMusicCaps dataset by classifying audio as either deepfake or human. To simulate real-world adversarial conditions, tempo stretching and pitch shifting were applied to the dataset. Mel spectrograms were generated from the modified audio, then used to train and evaluate a convolutional neural network. In addition to presenting technical results, this work explores the ethical and societal implications of TTM platforms, arguing that carefully designed detection systems are essential to both protecting artists and unlocking the positive potential of generative AI in music.

[LG-59] Batched Nonparametric Bandits via k-Nearest Neighbor UCB

链接: https://arxiv.org/abs/2505.10498
作者: Sakshi Arya
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 25 pages, 6 figures

点击查看摘要

Abstract:We study sequential decision-making in batched nonparametric contextual bandits, where actions are selected over a finite horizon divided into a small number of batches. Motivated by constraints in domains such as medicine and marketing – where online feedback is limited – we propose a nonparametric algorithm that combines adaptive k-nearest neighbor (k-NN) regression with the upper confidence bound (UCB) principle. Our method, BaNk-UCB, is fully nonparametric, adapts to the context dimension, and is simple to implement. Unlike prior work relying on parametric or binning-based estimators, BaNk-UCB uses local geometry to estimate rewards and adaptively balances exploration and exploitation. We provide near-optimal regret guarantees under standard Lipschitz smoothness and margin assumptions, using a theoretically motivated batch schedule that balances regret across batches and achieves minimax-optimal rates. Empirical evaluations on synthetic and real-world datasets demonstrate that BaNk-UCB consistently outperforms binning-based baselines.

[LG-60] FlowVAT: Normalizing Flow Variational Inference with Affine-Invariant Tempering

链接: https://arxiv.org/abs/2505.10466
作者: Juehang Qin,Shixiao Liang,Christopher Tunnell
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注: 10 pages, 5 figures, and 2 tables in main text, two appendices

点击查看摘要

Abstract:Multi-modal and high-dimensional posteriors present significant challenges for variational inference, causing mode-seeking behavior and collapse despite the theoretical expressiveness of normalizing flows. Traditional annealing methods require temperature schedules and hyperparameter tuning, falling short of the goal of truly black-box variational inference. We introduce FlowVAT, a conditional tempering approach for normalizing flow variational inference that addresses these limitations. Our method tempers both the base and target distributions simultaneously, maintaining affine-invariance under tempering. By conditioning the normalizing flow on temperature, we leverage overparameterized neural networks’ generalization capabilities to train a single flow representing the posterior across a range of temperatures. This preserves modes identified at higher temperatures when sampling from the variational posterior at T = 1 , mitigating standard variational methods’ mode-seeking behavior. In experiments with 2, 10, and 20 dimensional multi-modal distributions, FlowVAT outperforms traditional and adaptive annealing methods, finding more modes and achieving better ELBO values, particularly in higher dimensions where existing approaches fail. Our method requires minimal hyperparameter tuning and does not require an annealing schedule, advancing toward fully-automatic black-box variational inference for complicated posteriors.

[LG-61] Efficient MCMC Sampling with Expensive-to-Compute and Irregular Likelihoods

链接: https://arxiv.org/abs/2505.10448
作者: Conor Rosato,Harvinder Lehal,Simon Maskell,Lee Devlin,Malcolm Strens
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 45 pages

点击查看摘要

Abstract:Bayesian inference with Markov Chain Monte Carlo (MCMC) is challenging when the likelihood function is irregular and expensive to compute. We explore several sampling algorithms that make use of subset evaluations to reduce computational overhead. We adapt the subset samplers for this setting where gradient information is not available or is unreliable. To achieve this, we introduce data-driven proxies in place of Taylor expansions and define a novel computation-cost aware adaptive controller. We undertake an extensive evaluation for a challenging disease modelling task and a configurable task with similar irregularity in the likelihood surface. We find our improved version of Hierarchical Importance with Nested Training Samples (HINTS), with adaptive proposals and a data-driven proxy, obtains the best sampling error in a fixed computational budget. We conclude that subset evaluations can provide cheap and naturally-tempered exploration, while a data-driven proxy can pre-screen proposals successfully in explored regions of the state space. These two elements combine through hierarchical delayed acceptance to achieve efficient, exact sampling.

[LG-62] Inferring entropy production in many-body systems using nonequilibrium MaxEnt

链接: https://arxiv.org/abs/2505.10444
作者: Miguel Aguilera,Sosuke Ito,Artemy Kolchinsky
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Adaptation and Self-Organizing Systems (nlin.AO); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:We propose a method for inferring entropy production (EP) in high-dimensional stochastic systems, including many-body systems and non-Markovian systems with long memory. Standard techniques for estimating EP become intractable in such systems due to computational and statistical limitations. We infer trajectory-level EP and lower bounds on average EP by exploiting a nonequilibrium analogue of the Maximum Entropy principle, along with convex duality. Our approach uses only samples of trajectory observables (such as spatiotemporal correlation functions). It does not require reconstruction of high-dimensional probability distributions or rate matrices, nor any special assumptions such as discrete states or multipartite dynamics. It may be used to compute a hierarchical decomposition of EP, reflecting contributions from different kinds of interactions, and it has an intuitive physical interpretation as a thermodynamic uncertainty relation. We demonstrate its numerical performance on a disordered nonequilibrium spin model with 1000 spins and a large neural spike-train dataset.

[LG-63] Estimating the number of household TV profiles based in customer behaviour using Gaussian mixture model averag ing

链接: https://arxiv.org/abs/2505.10279
作者: Gabriel R. Palma,Sally McClean,Brahim Allan,Zeeshan Tariq,Rafael A. Moral
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: 21 pages

点击查看摘要

Abstract:TV customers today face many choices from many live channels and on-demand services. Providing a personalised experience that saves customers time when discovering content is essential for TV providers. However, a reliable understanding of their behaviour and preferences is key. When creating personalised recommendations for TV, the biggest challenge is understanding viewing behaviour within households when multiple people are watching. The objective is to detect and combine individual profiles to make better-personalised recommendations for group viewing. Our challenge is that we have little explicit information about who is watching the devices at any time (individuals or groups). Also, we do not have a way to combine more than one individual profile to make better recommendations for group viewing. We propose a novel framework using a Gaussian mixture model averaging to obtain point estimates for the number of household TV profiles and a Bayesian random walk model to introduce uncertainty. We applied our approach using data from real customers whose TV-watching data totalled approximately half a million observations. Our results indicate that combining our framework with the selected features provides a means to estimate the number of household TV profiles and their characteristics, including shifts over time and quantification of uncertainty.

[LG-64] One-Stage Top-k Learning-to-Defer: Score-Based Surrogates with Theoretical Guarantees

链接: https://arxiv.org/abs/2505.10160
作者: Yannis Montreuil,Axel Carlier,Lai Xing Ng,Wei Tsang Ooi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce the first one-stage Top- k Learning-to-Defer framework, which unifies prediction and deferral by learning a shared score-based model that selects the k most cost-effective entities-labels or experts-per input. While existing one-stage L2D methods are limited to deferring to a single expert, our approach jointly optimizes prediction and deferral across multiple entities through a single end-to-end objective. We define a cost-sensitive loss and derive a novel convex surrogate that is independent of the cardinality parameter k , enabling generalization across Top- k regimes without retraining. Our formulation recovers the Top-1 deferral policy of prior score-based methods as a special case, and we prove that our surrogate is both Bayes-consistent and \mathcalH -consistent under mild assumptions. We further introduce an adaptive variant, Top- k(x) , which dynamically selects the number of consulted entities per input to balance predictive accuracy and consultation cost. Experiments on CIFAR-10 and SVHN confirm that our one-stage Top- k method strictly outperforms Top-1 deferral, while Top- k(x) achieves superior accuracy-cost trade-offs by tailoring allocations to input complexity.

[LG-65] Path Gradients after Flow Matching

链接: https://arxiv.org/abs/2505.10139
作者: Lorenz Vaitl,Leon Klein
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Boltzmann Generators have emerged as a promising machine learning tool for generating samples from equilibrium distributions of molecular systems using Normalizing Flows and importance weighting. Recently, Flow Matching has helped speed up Continuous Normalizing Flows (CNFs), scale them to more complex molecular systems, and minimize the length of the flow integration trajectories. We investigate the benefits of using path gradients to fine-tune CNFs initially trained by Flow Matching, in the setting where a target energy is known. Our experiments show that this hybrid approach yields up to a threefold increase in sampling efficiency for molecular systems, all while using the same model, a similar computational budget and without the need for additional sampling. Furthermore, by measuring the length of the flow trajectories during fine-tuning, we show that path gradients largely preserve the learned structure of the flow.

[LG-66] A Scalable Gradient-Based Optimization Framework for Sparse Minimum-Variance Portfolio Selection

链接: https://arxiv.org/abs/2505.10099
作者: Sarat Moka,Matias Quiroz,Vali Asimit,Samuel Muller
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Portfolio Management (q-fin.PM)
*备注:

点击查看摘要

Abstract:Portfolio optimization involves selecting asset weights to minimize a risk-reward objective, such as the portfolio variance in the classical minimum-variance framework. Sparse portfolio selection extends this by imposing a cardinality constraint: only k assets from a universe of p may be included. The standard approach models this problem as a mixed-integer quadratic program and relies on commercial solvers to find the optimal solution. However, the computational costs of such methods increase exponentially with k and p , making them too slow for problems of even moderate size. We propose a fast and scalable gradient-based approach that transforms the combinatorial sparse selection problem into a constrained continuous optimization task via Boolean relaxation, while preserving equivalence with the original problem on the set of binary points. Our algorithm employs a tunable parameter that transmutes the auxiliary objective from a convex to a concave function. This allows a stable convex starting point, followed by a controlled path toward a sparse binary solution as the tuning parameter increases and the objective moves toward concavity. In practice, our method matches commercial solvers in asset selection for most instances and, in rare instances, the solution differs by a few assets whilst showing a negligible error in portfolio variance.

[LG-67] Role of scrambling and noise in temporal information processing with quantum systems

链接: https://arxiv.org/abs/2505.10080
作者: Weijie Xiong,Zoë Holmes,Armando Angrisani,Yudai Suzuki,Thiparat Chotibut,Supanut Thanasilp
类目: Quantum Physics (quant-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注: 14+35 pages, 6+5 figures, 1 table

点击查看摘要

Abstract:Scrambling quantum systems have been demonstrated as effective substrates for temporal information processing. While their role in providing rich feature maps has been widely studied, a theoretical understanding of their performance in temporal tasks is still lacking. Here we consider a general quantum reservoir processing framework that captures a broad range of physical computing models with quantum systems. We examine the scalability and memory retention of the model with scrambling reservoirs modelled by high-order unitary designs in both noiseless and noisy settings. In the former regime, we show that measurement readouts become exponentially concentrated with increasing reservoir size, yet strikingly do not worsen with the reservoir iterations. Thus, while repeatedly reusing a small scrambling reservoir with quantum data might be viable, scaling up the problem size deteriorates generalization unless one can afford an exponential shot overhead. In contrast, the memory of early inputs and initial states decays exponentially in both reservoir size and reservoir iterations. In the noisy regime, we also prove exponential memory decays with iterations for local noisy channels. Proving these results required us to introduce new proof techniques for bounding concentration in temporal quantum learning models.

[LG-68] Who Said What WSW 2.0? Enhanced Automated Analysis of Preschool Classroom Speech

链接: https://arxiv.org/abs/2505.09972
作者: Anchen Sun,Tiantian Feng,Gabriela Gutierrez,Juan J Londono,Anfeng Xu,Batya Elbaum,Shrikanth Narayanan,Lynn K Perry,Daniel S Messinger
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: 8 pages, 2 figures, 5 tables

点击查看摘要

Abstract:This paper introduces an automated framework WSW2.0 for analyzing vocal interactions in preschool classrooms, enhancing both accuracy and scalability through the integration of wav2vec2-based speaker classification and Whisper (large-v2 and large-v3) speech transcription. A total of 235 minutes of audio recordings (160 minutes from 12 children and 75 minutes from 5 teachers), were used to compare system outputs to expert human annotations. WSW2.0 achieves a weighted F1 score of .845, accuracy of .846, and an error-corrected kappa of .672 for speaker classification (child vs. teacher). Transcription quality is moderate to high with word error rates of .119 for teachers and .238 for children. WSW2.0 exhibits relatively high absolute agreement intraclass correlations (ICC) with expert transcriptions for a range of classroom language features. These include teacher and child mean utterance length, lexical diversity, question asking, and responses to questions and other utterances, which show absolute agreement intraclass correlations between .64 and .98. To establish scalability, we apply the framework to an extensive dataset spanning two years and over 1,592 hours of classroom audio recordings, demonstrating the framework’s robustness for broad real-world applications. These findings highlight the potential of deep learning and natural language processing techniques to revolutionize educational research by providing accurate measures of key features of preschool classroom speech, ultimately guiding more effective intervention strategies and supporting early childhood language development.

[LG-69] LatticeVision: Image to Image Networks for Modeling Non-Stationary Spatial Data

链接: https://arxiv.org/abs/2505.09803
作者: Antony Sikorski,Michael Ivanitskiy,Nathan Lenssen,Douglas Nychka,Daniel McKenzie
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many scientific and industrial applications, we are given a handful of instances (a ‘small ensemble’) of a spatially distributed quantity (a ‘field’) but would like to acquire many more. For example, a large ensemble of global temperature sensitivity fields from a climate model can help farmers, insurers, and governments plan appropriately. When acquiring more data is prohibitively expensive – as is the case with climate models – statistical emulation offers an efficient alternative for simulating synthetic yet realistic fields. However, parameter inference using maximum likelihood estimation (MLE) is computationally prohibitive, especially for large, non-stationary fields. Thus, many recent works train neural networks to estimate parameters given spatial fields as input, sidestepping MLE completely. In this work we focus on a popular class of parametric, spatially autoregressive (SAR) models. We make a simple yet impactful observation; because the SAR parameters can be arranged on a regular grid, both inputs (spatial fields) and outputs (model parameters) can be viewed as images. Using this insight, we demonstrate that image-to-image (I2I) networks enable faster and more accurate parameter estimation for a class of non-stationary SAR models with unprecedented complexity.

[LG-70] Pure Component Property Estimation Framework Using Explainable Machine Learning Methods

链接: https://arxiv.org/abs/2505.09783
作者: Jianfeng Jiao,Xi Gao,Jie Li
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction of pure component physiochemical properties is crucial for process integration, multiscale modeling, and optimization. In this work, an enhanced framework for pure component property prediction by using explainable machine learning methods is proposed. In this framework, the molecular representation method based on the connectivity matrix effectively considers atomic bonding relationships to automatically generate features. The supervised machine learning model random forest is applied for feature ranking and pooling. The adjusted R2 is introduced to penalize the inclusion of additional features, providing an assessment of the true contribution of features. The prediction results for normal boiling point (Tb), liquid molar volume, critical temperature (Tc) and critical pressure (Pc) obtained using Artificial Neural Network and Gaussian Process Regression models confirm the accuracy of the molecular representation method. Comparison with GC based models shows that the root-mean-square error on the test set can be reduced by up to 83.8%. To enhance the interpretability of the model, a feature analysis method based on Shapley values is employed to determine the contribution of each feature to the property predictions. The results indicate that using the feature pooling method reduces the number of features from 13316 to 100 without compromising model accuracy. The feature analysis results for Tb, Tc, and Pc confirms that different molecular properties are influenced by different structural features, aligning with mechanistic interpretations. In conclusion, the proposed framework is demonstrated to be feasible and provides a solid foundation for mixture component reconstruction and process integration modelling.

[LG-71] Learning Multi-Attribute Differential Graphs with Non-Convex Penalties

链接: https://arxiv.org/abs/2505.09748
作者: Jitendra K Tugnait
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 14 pages, 1 figures, 2 tables, published in IEEE Access, pp. 67065-67078, 2025

点击查看摘要

Abstract:We consider the problem of estimating differences in two multi-attribute Gaussian graphical models (GGMs) which are known to have similar structure, using a penalized D-trace loss function with non-convex penalties. The GGM structure is encoded in its precision (inverse covariance) matrix. Existing methods for multi-attribute differential graph estimation are based on a group lasso penalized loss function. In this paper, we consider a penalized D-trace loss function with non-convex (log-sum and smoothly clipped absolute deviation (SCAD)) penalties. Two proximal gradient descent methods are presented to optimize the objective function. Theoretical analysis establishing sufficient conditions for consistency in support recovery, convexity and estimation in high-dimensional settings is provided. We illustrate our approaches with numerical examples based on synthetic and real data.

[LG-72] Neural models for prediction of spatially patterned phase transitions: methods and challenges

链接: https://arxiv.org/abs/2505.09718
作者: Daniel Dylewsky,Sonia Kéfi,Madhur Anand,Chris T. Bauch
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dryland vegetation ecosystems are known to be susceptible to critical transitions between alternative stable states when subjected to external forcing. Such transitions are often discussed through the framework of bifurcation theory, but the spatial patterning of vegetation, which is characteristic of drylands, leads to dynamics that are much more complex and diverse than local bifurcations. Recent methodological developments in Early Warning Signal (EWS) detection have shown promise in identifying dynamical signatures of oncoming critical transitions, with particularly strong predictive capabilities being demonstrated by deep neural networks. However, a machine learning model trained on synthetic examples is only useful if it can effectively transfer to a test case of practical interest. These models’ capacity to generalize in this manner has been demonstrated for bifurcation transitions, but it is not as well characterized for high-dimensional phase transitions. This paper explores the successes and shortcomings of neural EWS detection for spatially patterned phase transitions, and shows how these models can be used to gain insight into where and how EWS-relevant information is encoded in spatiotemporal dynamics. A few paradigmatic test systems are used to illustrate how the capabilities of such models can be probed in a number of ways, with particular attention to the performances of a number of proposed statistical indicators for EWS and to the supplementary task of distinguishing between abrupt and continuous transitions. Results reveal that model performance often changes dramatically when training and test data sources are interchanged, which offers new insight into the criteria for model generalization.

[LG-73] Forests for Differences: Robust Causal Inference Beyond Parametric DiD

链接: https://arxiv.org/abs/2505.09706
作者: Hugo Gobato Souto,Francisco Louzada Neto
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper introduces the Difference-in-Differences Bayesian Causal Forest (DiD-BCF), a novel non-parametric model addressing key challenges in DiD estimation, such as staggered adoption and heterogeneous treatment effects. DiD-BCF provides a unified framework for estimating Average (ATE), Group-Average (GATE), and Conditional Average Treatment Effects (CATE). A core innovation, its Parallel Trends Assumption (PTA)-based reparameterization, enhances estimation accuracy and stability in complex panel data settings. Extensive simulations demonstrate DiD-BCF’s superior performance over established benchmarks, particularly under non-linearity, selection biases, and effect heterogeneity. Applied to U.S. minimum wage policy, the model uncovers significant conditional treatment effect heterogeneity related to county population, insights obscured by traditional methods. DiD-BCF offers a robust and versatile tool for more nuanced causal inference in modern DiD applications.

[LG-74] On Measuring Intrinsic Causal Attributions in Deep Neural Networks

链接: https://arxiv.org/abs/2505.09660
作者: Saptarshi Saha,Dhruv Vansraj Rathore,Soumadeep Saha,Utpal Garain,David Doermann
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantifying the causal influence of input features within neural networks has become a topic of increasing interest. Existing approaches typically assess direct, indirect, and total causal effects. This work treats NNs as structural causal models (SCMs) and extends our focus to include intrinsic causal contributions (ICC). We propose an identifiable generative post-hoc framework for quantifying ICC. We also draw a relationship between ICC and Sobol’ indices. Our experiments on synthetic and real-world datasets demonstrate that ICC generates more intuitive and reliable explanations compared to existing global explanation techniques.

[LG-75] A Computational Approach to Epilepsy Treatment: An AI-optimized Global Natural Product Prescription System

链接: https://arxiv.org/abs/2505.09643
作者: Zhixuan Wang
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Epilepsy is a prevalent neurological disease with millions of patients worldwide. Many patients have turned to alternative medicine due to the limited efficacy and side effects of conventional antiepileptic drugs. In this study, we developed a computational approach to optimize herbal epilepsy treatment through AI-driven analysis of global natural products and statistically validated randomized controlled trials (RCTs). Our intelligent prescription system combines machine learning (ML) algorithms for herb-efficacy characterization, Bayesian optimization for personalized dosing, and meta-analysis of RCTs for evidence-based recommendations. The system analyzed 1,872 natural compounds from traditional Chinese medicine (TCM), Ayurveda, and ethnopharmacological databases, integrating their bioactive properties with clinical outcomes from 48 RCTs covering 48 epilepsy conditions (n=5,216). Using LASSO regression and SHAP value analysis, we identified 17 high-efficacy herbs (e.g., Gastrodia elata [using é for accented characters], Withania somnifera), showing significant seizure reduction (p 0.01, Cohen’s d=0.89) with statistical significance confirmed by multiple testing (p 0.001). A randomized double-blind validation trial (n=120) demonstrated 28.5% greater seizure frequency reduction with AI-optimized herbal prescriptions compared to conventional protocols (95% CI: 18.7-37.3%, p=0.003).

信息检索

[IR-0] Beyond Pairwise Learning-To-Rank At Airbnb

链接: https://arxiv.org/abs/2505.09795
作者: Malay Haldar,Daochen Zha,Huiji Gao,Liwei He,Sanjeev Katariya
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:There are three fundamental asks from a ranking algorithm: it should scale to handle a large number of items, sort items accurately by their utility, and impose a total order on the items for logical consistency. But here’s the catch-no algorithm can achieve all three at the same time. We call this limitation the SAT theorem for ranking algorithms. Given the dilemma, how can we design a practical system that meets user needs? Our current work at Airbnb provides an answer, with a working solution deployed at scale. We start with pairwise learning-to-rank (LTR) models-the bedrock of search ranking tech stacks today. They scale linearly with the number of items ranked and perform strongly on metrics like NDCG by learning from pairwise comparisons. They are at a sweet spot of performance vs. cost, making them an ideal choice for several industrial applications. However, they have a drawback-by ignoring interactions between items, they compromise on accuracy. To improve accuracy, we create a “true” pairwise LTR model-one that captures interactions between items during pairwise comparisons. But accuracy comes at the expense of scalability and total order, and we discuss strategies to counter these challenges. For greater accuracy, we take each item in the search result, and compare it against the rest of the items along two dimensions: (1) Superiority: How strongly do searchers prefer the given item over the remaining ones? (2) Similarity: How similar is the given item to all the other items? This forms the basis of our “all-pairwise” LTR framework, which factors in interactions across all items at once. Looking at items on the search result page all together-superiority and similarity combined-gives us a deeper understanding of what searchers truly want. We quantify the resulting improvements in searcher experience through offline and online experiments at Airbnb.

[IR-1] he Impact of International Collaborations with Highly Publishing Countries in Computer Science

链接: https://arxiv.org/abs/2505.09776
作者: Alberto Gomez Espes,Michael Faerber,Adam Jatowt
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This paper analyzes international collaborations in Computer Science, focusing on three major players: China, the European Union, and the United States. Drawing from a comprehensive literature review, we examine collaboration patterns, research impact, retraction rates, and the role of the Development Index in shaping research outcomes. Our findings show that while China, the EU, and the US lead global research efforts, other regions are narrowing the gap in publication volume. Collaborations involving these key regions tend to have lower retraction rates, reflecting stronger adherence to scientific standards. We also find that countries with a Very High Development Index contribute to research with higher citation rates and fewer retractions. Overall, this study highlights the value of international collaboration and the importance of inclusive, ethical practices in advancing global research in Computer Science.

附件下载

点击下载今日全部论文列表