本篇博文主要内容为 2025-05-12 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-05-12)

今日共更新376篇论文,其中:

  • 自然语言处理38篇(Computation and Language (cs.CL))
  • 人工智能88篇(Artificial Intelligence (cs.AI))
  • 计算机视觉96篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习128篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Neuro-Symbolic Concepts

【速读】: 该论文试图解决传统智能体在持续学习和灵活推理方面的局限性,特别是在跨领域任务中缺乏数据效率、组合泛化能力及零样本迁移能力的问题。解决方案的关键在于提出一种以概念为中心的范式,利用神经符号概念(neuro-symbolic concepts)作为基础构建模块,这些概念具有可组合性,并通过符号程序与神经网络表示进行类型化建模,从而实现高效学习、概念重组以及多领域任务的灵活解决。

链接: https://arxiv.org/abs/2505.06191
作者: Jiayuan Mao,Joshua B. Tenenbaum,Jiajun Wu
机构: Massachusetts Institute of Technology (麻省理工学院); Stanford University (斯坦福大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: To appear in Communications of the ACM

点击查看摘要

Abstract:This article presents a concept-centric paradigm for building agents that can learn continually and reason flexibly. The concept-centric agent utilizes a vocabulary of neuro-symbolic concepts. These concepts, such as object, relation, and action concepts, are grounded on sensory inputs and actuation outputs. They are also compositional, allowing for the creation of novel concepts through their structural combination. To facilitate learning and reasoning, the concepts are typed and represented using a combination of symbolic programs and neural network representations. Leveraging such neuro-symbolic concepts, the agent can efficiently learn and recombine them to solve various tasks across different domains, ranging from 2D images, videos, 3D scenes, and robotic manipulation tasks. This concept-centric framework offers several advantages, including data efficiency, compositional generalization, continual learning, and zero-shot transfer.
zh

[NLP-1] Query-driven Document-level Scientific Evidence Extraction from Biomedical Studies

【速读】: 该论文旨在解决从生物医学研究中提取科学证据以支持临床研究问题的任务,特别是针对存在矛盾证据的临床问题。其解决方案的关键在于构建了一个名为CochraneForest的数据集,并提出了一种称为URCA(Uniform Retrieval Clustered Augmentation)的检索增强生成框架,该框架通过结合检索与生成技术,有效提升了证据提取的效果。

链接: https://arxiv.org/abs/2505.06186
作者: Massimiliano Pronesti,Joao Bettencourt-Silva,Paul Flanagan,Alessandra Pascale,Oisin Redmond,Anya Belz,Yufang Hou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Extracting scientific evidence from biomedical studies for clinical research questions (e.g., Does stem cell transplantation improve quality of life in patients with medically refractory Crohn’s disease compared to placebo?) is a crucial step in synthesising biomedical evidence. In this paper, we focus on the task of document-level scientific evidence extraction for clinical questions with conflicting evidence. To support this task, we create a dataset called CochraneForest, leveraging forest plots from Cochrane systematic reviews. It comprises 202 annotated forest plots, associated clinical research questions, full texts of studies, and study-specific conclusions. Building on CochraneForest, we propose URCA (Uniform Retrieval Clustered Augmentation), a retrieval-augmented generation framework designed to tackle the unique challenges of evidence extraction. Our experiments show that URCA outperforms the best existing methods by up to 10.3% in F1 score on this task. However, the results also underscore the complexity of CochraneForest, establishing it as a challenging testbed for advancing automated evidence synthesis systems.
zh

[NLP-2] From Millions of Tweets to Actionable Insights: Leverag ing LLM s for User Profiling AAAI

【速读】: 该论文旨在解决现有社交媒体用户画像技术在可迁移性、可解释性、对大规模标注数据的依赖以及预定义类别限制等方面的不足。其解决方案的关键在于引入一种基于大型语言模型(Large Language Model, LLM)的方法,利用领域定义性陈述作为用户画像的基础,通过两阶段流程实现半监督过滤与抽象型(合成描述)及抽取型(代表性推文选择)用户画像的生成,从而在减少对大规模标注数据依赖的同时,提升用户画像的可解释性与跨领域适应性。

链接: https://arxiv.org/abs/2505.06184
作者: Vahid Rahimzadeh,Ali Hamzehpour,Azadeh Shakery,Masoud Asadpour
机构: 未知
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at MisD @ AAAI ICWSM 2025

点击查看摘要

Abstract:Social media user profiling through content analysis is crucial for tasks like misinformation detection, engagement prediction, hate speech monitoring, and user behavior modeling. However, existing profiling techniques, including tweet summarization, attribute-based profiling, and latent representation learning, face significant limitations: they often lack transferability, produce non-interpretable features, require large labeled datasets, or rely on rigid predefined categories that limit adaptability. We introduce a novel large language model (LLM)-based approach that leverages domain-defining statements, which serve as key characteristics outlining the important pillars of a domain as foundations for profiling. Our two-stage method first employs semi-supervised filtering with a domain-specific knowledge base, then generates both abstractive (synthesized descriptions) and extractive (representative tweet selections) user profiles. By harnessing LLMs’ inherent knowledge with minimal human validation, our approach is adaptable across domains while reducing the need for large labeled datasets. Our method generates interpretable natural language user profiles, condensing extensive user data into a scale that unlocks LLMs’ reasoning and knowledge capabilities for downstream social network tasks. We contribute a Persian political Twitter (X) dataset and an LLM-based evaluation framework with human validation. Experimental results show our method significantly outperforms state-of-the-art LLM-based and traditional methods by 9.8%, demonstrating its effectiveness in creating flexible, adaptable, and interpretable user profiles.
zh

[NLP-3] Estimating Quality in Therapeutic Conversations: A Multi-Dimensional Natural Language Processing Framework

【速读】: 该论文试图解决心理咨询过程中客户与治疗师之间参与质量评估的客观化问题,旨在通过自然语言处理(Natural Language Processing, NLP)技术对咨询对话文本进行多维分析,从而准确分类参与质量。解决方案的关键在于构建一个基于文本转录本的多维度NLP框架,提取包括对话动态、语义相似性、情感分类和问题检测在内的42个特征,并利用多种分类器(如随机森林、Cat-Boost和支持向量机)进行模型训练与优化,最终在平衡数据和经过SMOTE-Tomek增强的数据上均取得了较高的分类性能,表明该框架具有良好的泛化能力和未来扩展潜力。

链接: https://arxiv.org/abs/2505.06151
作者: Alice Rueda,Argyrios Perivolaris,Niloy Roy,Dylan Weston,Sarmed Shaya,Zachary Cote,Martin Ivanov,Bazen G. Teferra,Yuqi Wu,Sirisha Rambhatla,Divya Sharma,Andrew Greenshaw,Rakesh Jetly,Yanbo Zhang,Bo Cao,Reza Samavi,Sridhar Krishnan,Venkat Bhat
机构: St. Michael’s Hospital, Unity Health Toronto (圣迈克尔医院,统一健康多伦多); Toronto Metropolitan University (多伦多都会大学); University of Waterloo (滑铁卢大学); York University (约克大学); University of Alberta (阿尔伯塔大学); Institute of Mental Health Research, University of Ottawa (精神健康研究研究所,渥太华大学)
类目: Computation and Language (cs.CL)
备注: 12 pages, 4 figures, 7 tables

点击查看摘要

Abstract:Engagement between client and therapist is a critical determinant of therapeutic success. We propose a multi-dimensional natural language processing (NLP) framework that objectively classifies engagement quality in counseling sessions based on textual transcripts. Using 253 motivational interviewing transcripts (150 high-quality, 103 low-quality), we extracted 42 features across four domains: conversational dynamics, semantic similarity as topic alignment, sentiment classification, and question detection. Classifiers, including Random Forest (RF), Cat-Boost, and Support Vector Machines (SVM), were hyperparameter tuned and trained using a stratified 5-fold cross-validation and evaluated on a holdout test set. On balanced (non-augmented) data, RF achieved the highest classification accuracy (76.7%), and SVM achieved the highest AUC (85.4%). After SMOTE-Tomek augmentation, performance improved significantly: RF achieved up to 88.9% accuracy, 90.0% F1-score, and 94.6% AUC, while SVM reached 81.1% accuracy, 83.1% F1-score, and 93.6% AUC. The augmented data results reflect the potential of the framework in future larger-scale applications. Feature contribution revealed conversational dynamics and semantic similarity between clients and therapists were among the top contributors, led by words uttered by the client (mean and standard deviation). The framework was robust across the original and augmented datasets and demonstrated consistent improvements in F1 scores and recall. While currently text-based, the framework supports future multimodal extensions (e.g., vocal tone, facial affect) for more holistic assessments. This work introduces a scalable, data-driven method for evaluating engagement quality of the therapy session, offering clinicians real-time feedback to enhance the quality of both virtual and in-person therapeutic interactions.
zh

[NLP-4] A Scaling Law for Token Efficiency in LLM Fine-Tuning Under Fixed Compute Budgets

【速读】: 该论文试图解决在固定计算预算下微调大规模语言模型(Large Language Models, LLMs)时,如何更准确地评估和优化训练数据组成的问题。传统方法仅通过总token数量来衡量训练数据,而忽略了样本数量及其平均token长度——即所谓的数据量(dataset volume)对模型性能的关键影响。论文提出的解决方案关键在于建立一种考虑数据组成的缩放定律,通过实验验证数据组成对token效率的显著影响,从而为资源受限环境下的LLM微调提供更精细的指导。

链接: https://arxiv.org/abs/2505.06150
作者: Ryan Lagasse,Aidan Kiernans,Avijit Ghosh,Shiri Dori-Hacohen
机构: University of Connecticut(康涅狄格大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a scaling law for fine-tuning large language models (LLMs) under fixed compute budgets that explicitly accounts for data composition. Conventional approaches measure training data solely by total tokens, yet the number of examples and their average token length – what we term \emphdataset volume – play a decisive role in model performance. Our formulation is tuned following established procedures. Experiments on the BRICC dataset \citesalavati2024reducing and subsets of the MMLU dataset \citehendrycks2021measuringmassivemultitasklanguage, evaluated under multiple subsampling strategies, reveal that data composition significantly affects token efficiency. These results motivate refined scaling laws for practical LLM fine-tuning in resource-constrained settings.
zh

[NLP-5] Can Prompting LLM s Unlock Hate Speech Detection across Languages? A Zero-shot and Few-shot Study

【速读】: 该论文试图解决多语言环境下自动仇恨言论检测中因语言多样性而被忽视的问题,尤其是针对非英语语言的仇恨言论识别能力不足的问题。其解决方案的关键在于利用指令调优的大型语言模型(Large Language Models, LLMs)进行零样本和少样本提示(zero-shot and few-shot prompting),并通过优化提示设计来提升模型在不同语言中的泛化能力。研究发现,尽管此类方法在真实场景评估中仍落后于微调的编码器模型,但在功能测试中表现出更好的泛化性能,且提示设计对不同语言的性能提升具有关键影响。

链接: https://arxiv.org/abs/2505.06149
作者: Faeze Ghorbanpour,Daryna Dementieva,Alexander Fraser
机构: TU Munich (慕尼黑工业大学); LMU Munich (慕尼黑路德维希-马克西米利安大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Despite growing interest in automated hate speech detection, most existing approaches overlook the linguistic diversity of online content. Multilingual instruction-tuned large language models such as LLaMA, Aya, Qwen, and BloomZ offer promising capabilities across languages, but their effectiveness in identifying hate speech through zero-shot and few-shot prompting remains underexplored. This work evaluates LLM prompting-based detection across eight non-English languages, utilizing several prompting techniques and comparing them to fine-tuned encoder models. We show that while zero-shot and few-shot prompting lag behind fine-tuned encoder models on most of the real-world evaluation sets, they achieve better generalization on functional tests for hate speech detection. Our study also reveals that prompt design plays a critical role, with each language often requiring customized prompting techniques to maximize performance.
zh

[NLP-6] owards Robust Few-Shot Text Classification Using Transformer Architectures and Dual Loss Strategies

【速读】: 该论文旨在解决低资源环境下文本分类的性能问题,特别是在少样本(few-shot)设置下模型泛化能力不足和过拟合的问题。其解决方案的关键在于结合自适应微调、对比学习和正则化优化策略,通过引入对比损失和正则化损失来增强模型的泛化能力,从而有效缓解少样本环境下的过拟合问题,并提升分类准确性。此外,研究还表明,使用具有更强自注意力机制的Transformer模型或生成式架构有助于提高少样本分类的稳定性和准确性。

链接: https://arxiv.org/abs/2505.06145
作者: Xu Han,Yumeng Sun,Weiqiang Huang,Hongye Zheng,Junliang Du
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Few-shot text classification has important application value in low-resource environments. This paper proposes a strategy that combines adaptive fine-tuning, contrastive learning, and regularization optimization to improve the classification performance of Transformer-based models. Experiments on the FewRel 2.0 dataset show that T5-small, DeBERTa-v3, and RoBERTa-base perform well in few-shot tasks, especially in the 5-shot setting, which can more effectively capture text features and improve classification accuracy. The experiment also found that there are significant differences in the classification difficulty of different relationship categories. Some categories have fuzzy semantic boundaries or complex feature distributions, making it difficult for the standard cross entropy loss to learn the discriminative information required to distinguish categories. By introducing contrastive loss and regularization loss, the generalization ability of the model is enhanced, effectively alleviating the overfitting problem in few-shot environments. In addition, the research results show that the use of Transformer models or generative architectures with stronger self-attention mechanisms can help improve the stability and accuracy of few-shot classification.
zh

[NLP-7] LLM s Get Lost In Multi-Turn Conversation

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在多轮对话(multi-turn conversation)中性能显著下降的问题,尤其是在用户指令未充分明确的情况下,LLMs的表现不如单轮(single-turn)完全指定任务的场景。解决方案的关键在于通过大规模模拟实验对比LLMs在单轮与多轮设置下的表现,并揭示性能退化的主要原因,即LLMs在早期对话回合中容易做出错误假设并过早尝试生成最终答案,导致后续对话中无法有效纠正和恢复。

链接: https://arxiv.org/abs/2505.06120
作者: Philippe Laban,Hiroaki Hayashi,Yingbo Zhou,Jennifer Neville
机构: Microsoft Research (微软研究院); Salesforce Research (Salesforce 研究院)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are conversational interfaces. As such, LLMs have the potential to assist their users not only when they can fully specify the task at hand, but also to help them define, explore, and refine what they need through multi-turn conversational exchange. Although analysis of LLM conversation logs has confirmed that underspecification occurs frequently in user instructions, LLM evaluation has predominantly focused on the single-turn, fully-specified instruction setting. In this work, we perform large-scale simulation experiments to compare LLM performance in single- and multi-turn settings. Our experiments confirm that all the top open- and closed-weight LLMs we test exhibit significantly lower performance in multi-turn conversations than single-turn, with an average drop of 39% across six generation tasks. Analysis of 200,000+ simulated conversations decomposes the performance degradation into two components: a minor loss in aptitude and a significant increase in unreliability. We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, we discover that when LLMs take a wrong turn in a conversation, they get lost and do not recover.
zh

[NLP-8] Multimodal Sentiment Analysis on CMU-MOSEI Dataset using Transformer-based Models

【速读】: 该论文旨在解决多模态情感分析(multimodal sentiment analysis)问题,通过有效融合文本、音频和视觉模态的信息来提升情感分类的准确性。其解决方案的关键在于采用基于Transformer的模型,并通过早期融合(early fusion)策略将各模态的嵌入表示进行拼接后进行分类,从而捕捉跨模态的交互信息。实验结果表明,该方法在CMU-MOSEI数据集上取得了优异的性能,验证了早期融合在多模态情感建模中的有效性。

链接: https://arxiv.org/abs/2505.06110
作者: Jugal Gajjar,Kaustik Ranaware
机构: George Washington University (乔治·华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures, 5 tables, and 19 references

点击查看摘要

Abstract:This project performs multimodal sentiment analysis using the CMU-MOSEI dataset, using transformer-based models with early fusion to integrate text, audio, and visual modalities. We employ BERT-based encoders for each modality, extracting embeddings that are concatenated before classification. The model achieves strong performance, with 97.87% 7-class accuracy and a 0.9682 F1-score on the test set, demonstrating the effectiveness of early fusion in capturing cross-modal interactions. The training utilized Adam optimization (lr=1e-4), dropout (0.3), and early stopping to ensure generalization and robustness. Results highlight the superiority of transformer architectures in modeling multimodal sentiment, with a low MAE (0.1060) indicating precise sentiment intensity prediction. Future work may compare fusion strategies or enhance interpretability. This approach utilizes multimodal learning by effectively combining linguistic, acoustic, and visual cues for sentiment analysis.
zh

[NLP-9] Differentiating Emigration from Return Migration of Scholars Using Name-Based Nationality Detection Models

【速读】: 该论文试图解决由于数字痕迹数据中缺乏个体国籍信息而导致的迁移研究中的左截断(left-censoring)问题。其关键解决方案是通过全名检测国籍,并将其与学术起源国进行比较,以更准确地区分移民与回流迁移。研究利用从维基百科获取的260万条唯一姓名-国籍对作为训练数据,构建了基于字符的机器学习模型,实现了较高精度的国籍分类,从而提升了对迁移模式的分析能力。

链接: https://arxiv.org/abs/2505.06107
作者: Faeze Ghorbanpour,Thiago Zordan Malaguth,Aliakbar Akbaritabar
机构: Max Planck Institute for Demographic Research (马克斯·普朗克人口研究所)
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: Accepted to appear @ ICWSM 2025. The link to the camera-ready paper will be added soon

点击查看摘要

Abstract:Most web and digital trace data do not include information about an individual’s nationality due to privacy concerns. The lack of data on nationality can create challenges for migration research. It can lead to a left-censoring issue since we are uncertain about the migrant’s country of origin. Once we observe an emigration event, if we know the nationality, we can differentiate it from return migration. We propose methods to detect the nationality with the least available data, i.e., full names. We use the detected nationality in comparison with the country of academic origin, which is a common approach in studying the migration of researchers. We gathered 2.6 million unique name-nationality pairs from Wikipedia and categorized them into families of nationalities with three granularity levels to use as our training data. Using a character-based machine learning model, we achieved a weighted F1 score of 84% for the broadest and 67% for the most granular, country-level categorization. In our empirical study, we used the trained and tested model to assign nationality to 8+ million scholars’ full names in Scopus data. Our results show that using the country of first publication as a proxy for nationality underestimates the size of return flows, especially for countries with a more diverse academic workforce, such as the USA, Australia, and Canada. We found that around 48% of emigration from the USA was return migration once we used the country of name origin, in contrast to 33% based on academic origin. In the most recent period, 79% of scholars whose affiliation has consistently changed from the USA to China, and are considered emigrants, have Chinese names in contrast to 41% with a Chinese academic origin. Our proposed methods for addressing left-censoring issues are beneficial for other research that uses digital trace data to study migration.
zh

[NLP-10] Attention on Multiword Expressions: A Multilingual Study of BERT-based Models with Regard to Idiomaticity and Microsyntax

【速读】: 该论文试图解决预训练的基于BERT架构的编码器模型在微调后对两种不同类型的多词表达(MWEs)——习语和微句法单位(MSUs)的注意力模式变化问题,以及这种注意力差异在语义任务与句法任务之间的表现。解决方案的关键在于通过分析预训练和微调后的BERT模型在六种印欧语言中的注意力得分,揭示微调过程如何影响模型对MWEs的关注方式,特别是语义任务微调模型在各层中对习语的注意力分布更均匀,而句法任务微调模型在低层对MSUs的注意力增强,这与句法处理需求相关。

链接: https://arxiv.org/abs/2505.06062
作者: Iuliia Zaitova,Vitalii Hirak,Badr M. Abdullah,Dietrich Klakow,Bernd Möbius,Tania Avgustinova
机构: Saarland University (萨尔兰大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 3 figures. Findings 2025

点击查看摘要

Abstract:This study analyzes the attention patterns of fine-tuned encoder-only models based on the BERT architecture (BERT-based models) towards two distinct types of Multiword Expressions (MWEs): idioms and microsyntactic units (MSUs). Idioms present challenges in semantic non-compositionality, whereas MSUs demonstrate unconventional syntactic behavior that does not conform to standard grammatical categorizations. We aim to understand whether fine-tuning BERT-based models on specific tasks influences their attention to MWEs, and how this attention differs between semantic and syntactic tasks. We examine attention scores to MWEs in both pre-trained and fine-tuned BERT-based models. We utilize monolingual models and datasets in six Indo-European languages - English, German, Dutch, Polish, Russian, and Ukrainian. Our results show that fine-tuning significantly influences how models allocate attention to MWEs. Specifically, models fine-tuned on semantic tasks tend to distribute attention to idiomatic expressions more evenly across layers. Models fine-tuned on syntactic tasks show an increase in attention to MSUs in the lower layers, corresponding with syntactic processing requirements.
zh

[NLP-11] Healthy LLM s? Benchmarking LLM Knowledge of UK Government Public Health Information

【速读】: 该论文试图解决当前对大型语言模型(Large Language Models, LLMs)在英国政府公共卫生信息领域的知识理解不足的问题,特别是在实际应用中确保其能够准确、及时地提供相关信息。解决方案的关键是引入了一个新的基准测试集PubHealthBench,包含超过8000个问题,用于评估LLMs在多项选择题回答(MCQA)和自由形式回答方面的表现,并提供了用于构建该基准的英国政府公共卫生指南文档数据集。通过在PubHealthBench上评估24个LLMs,研究发现最新的私有LLMs在MCQA任务中表现出色,但自由形式回答的表现仍需提升。

链接: https://arxiv.org/abs/2505.06046
作者: Joshua Harris,Fan Grayson,Felix Feldman,Timothy Laurence,Toby Nonnenmacher,Oliver Higgins,Leo Loman,Selina Patel,Thomas Finnie,Samuel Collins,Michael Borowitz
机构: UK Health Security Agency (英国卫生安全局)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 24 pages, 10 pages main text

点击查看摘要

Abstract:As Large Language Models (LLMs) become widely accessible, a detailed understanding of their knowledge within specific domains becomes necessary for successful real world use. This is particularly critical in public health, where failure to retrieve relevant, accurate, and current information could significantly impact UK residents. However, currently little is known about LLM knowledge of UK Government public health information. To address this issue, this paper introduces a new benchmark, PubHealthBench, with over 8000 questions for evaluating LLMs’ Multiple Choice Question Answering (MCQA) and free form responses to public health queries, created via an automated pipeline. We also release a new dataset of the extracted UK Government public health guidance documents used as source text for PubHealthBench. Assessing 24 LLMs on PubHealthBench we find the latest private LLMs (GPT-4.5, GPT-4.1 and o1) have a high degree of knowledge, achieving 90% in the MCQA setup, and outperform humans with cursory search engine use. However, in the free form setup we see lower performance with no model scoring 75%. Therefore, whilst there are promising signs that state of the art (SOTA) LLMs are an increasingly accurate source of public health information, additional safeguards or tools may still be needed when providing free form responses on public health topics.
zh

[NLP-12] Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification

【速读】: 该论文试图解决语言模型中依赖虚假相关性(spurious correlations)导致的决策偏差问题,这种现象可能影响模型的泛化能力和可靠性。其解决方案的关键在于通过机制可解释性方法识别出模型中专注于这些虚假相关性的特定注意力头(attention heads),并利用这些信息提出基于注意力头的token归因方法(Head-based Token Attribution, HTA),从而实现对虚假相关性的检测与针对性缓解。

链接: https://arxiv.org/abs/2505.06032
作者: Leon Eshuijs,Shihan Wang,Antske Fokkens
机构: Vrije Universiteit Amsterdam (阿姆斯特丹自由大学); Utrecht University (乌得勒支大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reliance on spurious correlations (shortcuts) has been shown to underlie many of the successes of language models. Previous work focused on identifying the input elements that impact prediction. We investigate how shortcuts are actually processed within the model’s decision-making mechanism. We use actor names in movie reviews as controllable shortcuts with known impact on the outcome. We use mechanistic interpretability methods and identify specific attention heads that focus on shortcuts. These heads gear the model towards a label before processing the complete input, effectively making premature decisions that bypass contextual analysis. Based on these findings, we introduce Head-based Token Attribution (HTA), which traces intermediate decisions back to input tokens. We show that HTA is effective in detecting shortcuts in LLMs and enables targeted mitigation by selectively deactivating shortcut-related attention heads.
zh

[NLP-13] Unilogit: Robust Machine Unlearning for LLM s Using Uniform-Target Self-Distillation ACL

【速读】: 该论文试图解决在大型语言模型中实现机器遗忘(machine unlearning)的问题,即在保留模型整体性能的同时,选择性地遗忘特定信息,以满足数据隐私法规如GDPR的要求。解决方案的关键在于提出一种新颖的自蒸馏方法——Unilogit,其通过动态调整目标logits以实现目标标记的均匀概率分布,利用当前模型输出来获得更准确的自蒸馏目标,从而无需额外超参数并提升模型对黄金目标的逼近能力。

链接: https://arxiv.org/abs/2505.06027
作者: Stefan Vasilev,Christian Herold,Baohao Liao,Seyyed Hadi Hashemi,Shahram Khadivi,Christof Monz
机构: eBay Inc.(eBay公司); University of Amsterdam(阿姆斯特丹大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages, 6 figures, 5 tables, under review at ACL

点击查看摘要

Abstract:This paper introduces Unilogit, a novel self-distillation method for machine unlearning in Large Language Models. Unilogit addresses the challenge of selectively forgetting specific information while maintaining overall model utility, a critical task in compliance with data privacy regulations like GDPR. Unlike prior methods that rely on static hyperparameters or starting model outputs, Unilogit dynamically adjusts target logits to achieve a uniform probability for the target token, leveraging the current model’s outputs for more accurate self-distillation targets. This approach not only eliminates the need for additional hyperparameters but also enhances the model’s ability to approximate the golden targets. Extensive experiments on public benchmarks and an in-house e-commerce dataset demonstrate Unilogit’s superior performance in balancing forget and retain objectives, outperforming state-of-the-art methods such as NPO and UnDIAL. Our analysis further reveals Unilogit’s robustness across various scenarios, highlighting its practical applicability and effectiveness in achieving efficacious machine unlearning.
zh

[NLP-14] Do Not Change Me: On Transferring Entities Without Modification in Neural Machine Translation – a Multilingual Perspective

【速读】: 该论文试图解决机器翻译模型在翻译过程中未能正确保留特定实体(如URL地址、IBAN号码或电子邮件)的问题。解决方案的关键在于分析当前主流神经机器翻译(Neural Machine Translation, NMT)模型在不同语言间的实体传递能力,识别其错误类型及原因,并提出一个包含36,000个句子的多语言合成数据集,用于评估九类实体在四种语言(英语、德语、波兰语和乌克兰语)间的传递质量。

链接: https://arxiv.org/abs/2505.06010
作者: Dawid Wisniewski,Mikolaj Pokrywka,Zofia Rostek
机构: Poznan University of Technology, Poland; Adam Mickiewicz University, Poland
类目: Computation and Language (cs.CL)
备注: Accepted at MTSummit 2025 (The 20th Machine Translation Summit)

点击查看摘要

Abstract:Current machine translation models provide us with high-quality outputs in most scenarios. However, they still face some specific problems, such as detecting which entities should not be changed during translation. In this paper, we explore the abilities of popular NMT models, including models from the OPUS project, Google Translate, MADLAD, and EuroLLM, to preserve entities such as URL addresses, IBAN numbers, or emails when producing translations between four languages: English, German, Polish, and Ukrainian. We investigate the quality of popular NMT models in terms of accuracy, discuss errors made by the models, and examine the reasons for errors. Our analysis highlights specific categories, such as emojis, that pose significant challenges for many models considered. In addition to the analysis, we propose a new multilingual synthetic dataset of 36,000 sentences that can help assess the quality of entity transfer across nine categories and four aforementioned languages.
zh

[NLP-15] Exploring the Feasibility of Multilingual Grammatical Error Correction with a Single LLM up to 9B parameters: A Comparative Study of 17 Models

【速读】: 该论文试图解决多语言语法错误修正的问题(multilingual grammatical error correction),即利用单一模型对英语、德语、意大利语和瑞典语四种语言的文本进行语法修正。解决方案的关键在于评估17种流行模型在保持修改幅度较小的前提下,减少语法错误的能力,并通过分析模型输出来识别性能优异的模型,最终推荐适用于多语言任务的模型。

链接: https://arxiv.org/abs/2505.06004
作者: Dawid Wisniewski,Antoni Solarski,Artur Nowakowski
机构: Laniqo.com; Poznan University of Technology (波兹南科技大学); Adam Mickiewicz University (亚当·密茨凯维奇大学)
类目: Computation and Language (cs.CL)
备注: Accepted at MTSummit 2025 (The 20th Machine Translation Summit)

点击查看摘要

Abstract:Recent language models can successfully solve various language-related tasks, and many understand inputs stated in different languages. In this paper, we explore the performance of 17 popular models used to correct grammatical issues in texts stated in English, German, Italian, and Swedish when using a single model to correct texts in all those languages. We analyze the outputs generated by these models, focusing on decreasing the number of grammatical errors while keeping the changes small. The conclusions drawn help us understand what problems occur among those models and which models can be recommended for multilingual grammatical error correction tasks. We list six models that improve grammatical correctness in all four languages and show that Gemma 9B is currently the best performing one for the languages considered.
zh

[NLP-16] An Exploratory Analysis on the Explanatory Potential of Embedding-Based Measures of Semantic Transparency for Malay Word Recognition

【速读】: 该论文旨在探讨语义透明度在形态加工中的作用,并评估基于嵌入的语义透明度度量对阅读的影响。其解决方案的关键在于通过语义空间中的几何结构分析,提取多种度量指标,并验证这些指标在预测词类判断反应时间中的有效性。具体而言,研究利用t-分布随机邻域嵌入聚类分析复杂词的语义结构,构建了基于词向量和位移向量的分类模型,并通过比较词与包含相同词缀的其他词的中心点、位移向量及功能表示模型预测词之间的嵌入差异,获得多个语义透明度的度量指标,最终在广义可加混合模型中验证了这些指标对反应时间的预测能力。

链接: https://arxiv.org/abs/2505.05973
作者: M. Maziyah Mohamed(1),R. H. Baayen(1) ((1) University of Tuebingen)
机构: University of Tübingen (图宾根大学)
类目: Computation and Language (cs.CL)
备注: 24 pages, 5 figures, and 9 tables. Submitted to the Journal of Morphology

点击查看摘要

Abstract:Studies of morphological processing have shown that semantic transparency is crucial for word recognition. Its computational operationalization is still under discussion. Our primary objectives are to explore embedding-based measures of semantic transparency, and assess their impact on reading. First, we explored the geometry of complex words in semantic space. To do so, we conducted a t-distributed Stochastic Neighbor Embedding clustering analysis on 4,226 Malay prefixed words. Several clusters were observed for complex words varied by their prefix class. Then, we derived five simple measures, and investigated whether they were significant predictors of lexical decision latencies. Two sets of Linear Discriminant Analyses were run in which the prefix of a word is predicted from either word embeddings or shift vectors (i.e., a vector subtraction of the base word from the derived word). The accuracy with which the model predicts the prefix of a word indicates the degree of transparency of the prefix. Three further measures were obtained by comparing embeddings between each word and all other words containing the same prefix (i.e., centroid), between each word and the shift from their base word, and between each word and the predicted word of the Functional Representations of Affixes in Compositional Semantic Space model. In a series of Generalized Additive Mixed Models, all measures predicted decision latencies after accounting for word frequency, word length, and morphological family size. The model that included the correlation between each word and their centroid as a predictor provided the best fit to the data.
zh

[NLP-17] owards Developmentally Plausible Rewards: Communicative Success as a Learning Signal for Interactive Language Models

【速读】: 该论文试图解决如何在计算认知模型中利用交互性来促进语言学习的问题,特别是通过模仿儿童语言习得的互动机制来训练语言模型。其解决方案的关键在于设计一种基于单轮对话的交互式设置,其中说话者尝试向听者传递信息并根据沟通成功与否获得奖励,而沟通成功被定义为在纯语言问答场景中的抽象语义对齐。该方法通过强化学习对语言模型进行微调,并探索了通信通道上的认知合理性约束对说话者行为的影响。

链接: https://arxiv.org/abs/2505.05970
作者: Lennart Stöpler,Rufat Asadli,Mitja Nikolaus,Ryan Cotterell,Alex Warstadt
机构: Heidelberg University (海德堡大学); ETH Zürich (苏黎世联邦理工学院); CerCo, CNRS, Toulouse (CerCo, 法国国家科学研究中心, 图卢兹); University of California San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose a method for training language models in an interactive setting inspired by child language acquisition. In our setting, a speaker attempts to communicate some information to a listener in a single-turn dialogue and receives a reward if communicative success is achieved. Unlike earlier related work using image–caption data for interactive reference games, we operationalize communicative success in a more abstract language-only question–answering setting. First, we present a feasibility study demonstrating that our reward provides an indirect signal about grammaticality. Second, we conduct experiments using reinforcement learning to fine-tune language models. We observe that cognitively plausible constraints on the communication channel lead to interpretable changes in speaker behavior. However, we do not yet see improvements on linguistic evaluations from our training regime. We outline potential modifications to the task design and training configuration that could better position future work to use our methodology to observe the benefits of interaction on language learning in computational cognitive models.
zh

[NLP-18] NeoQA: Evidence-based Question Answering with Generated News Events

【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)中检索增强生成(Retrieval-Augmented Generation, RAG)评估的挑战,即现有基准测试容易过时,导致原本需要检索的问题可能因模型预训练数据包含更多信息而变得可直接回答,从而难以区分基于证据的推理与记忆检索。解决方案的关键在于提出NeoQA(News Events for Out-of-training Question Answering)基准,通过生成虚构新闻事件的时间线、知识库、新闻文章及问答对,确保模型无法依赖预训练知识,从而强制模型仅从检索到的证据中生成答案,实现对基于证据的问答任务的可控评估。

链接: https://arxiv.org/abs/2505.05949
作者: Max Glockner,Xiang Jiang,Leonardo F. R. Ribeiro,Iryna Gurevych,Markus Dreyer
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating Retrieval-Augmented Generation (RAG) in large language models (LLMs) is challenging because benchmarks can quickly become stale. Questions initially requiring retrieval may become answerable from pretraining knowledge as newer models incorporate more recent information during pretraining, making it difficult to distinguish evidence-based reasoning from recall. We introduce NeoQA (News Events for Out-of-training Question Answering), a benchmark designed to address this issue. To construct NeoQA, we generated timelines and knowledge bases of fictional news events and entities along with news articles and Q\A pairs to prevent LLMs from leveraging pretraining knowledge, ensuring that no prior evidence exists in their training data. We propose our dataset as a new platform for evaluating evidence-based question answering, as it requires LLMs to generate responses exclusively from retrieved evidence and only when sufficient evidence is available. NeoQA enables controlled evaluation across various evidence scenarios, including cases with missing or misleading details. Our findings indicate that LLMs struggle to distinguish subtle mismatches between questions and evidence, and suffer from short-cut reasoning when key information required to answer a question is missing from the evidence, underscoring key limitations in evidence-based reasoning.
zh

[NLP-19] Summarisation of German Judgments in conjunction with a Class-based Evaluation

【速读】: 该论文试图解决长篇法律文件自动摘要的问题,旨在为法律专家提供辅助工具以提高其工作效率。解决方案的关键在于通过微调基于解码器的大型语言模型来生成德国判决书的摘要(guiding principles),并在训练前将法律实体信息丰富到判决书中,以此帮助生成模型更准确地识别相关内容。

链接: https://arxiv.org/abs/2505.05947
作者: Bianca Steffes,Nils Torben Wiedemann,Alexander Gratz,Pamela Hochreither,Jana Elina Meyer,Katharina Luise Schilke
机构: Saarland University (萨尔兰大学); Saarland Informatics Campus (萨尔兰信息学园区); Rechtsanwälte Zimmer-Gratz (Zimmer-Gratz律师事务所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The automated summarisation of long legal documents can be a great aid for legal experts in their daily work. We automatically create summaries (guiding principles) of German judgments by fine-tuning a decoder-based large language model. We enrich the judgments with information about legal entities before the training. For the evaluation of the created summaries, we define a set of evaluation classes which allows us to measure their language, pertinence, completeness and correctness. Our results show that employing legal entities helps the generative model to find the relevant content, but the quality of the created summaries is not yet sufficient for a use in practice.
zh

[NLP-20] Elastic Weight Consolidation for Full-Parameter Continual Pre-Training of Gemma2

【速读】: 该论文旨在解决大规模语言模型(Large Language Model, LLM)在持续学习过程中出现的灾难性遗忘问题。其解决方案的关键在于应用弹性权重整合(Elastic Weight Consolidation, EWC)对模型的全部参数进行正则化,从而在学习新任务时保留已有知识,有效缓解灾难性遗忘现象,并可能提升模型对新任务的学习效果。

链接: https://arxiv.org/abs/2505.05946
作者: Vytenis Šliogeris,Povilas Daniušis,Artūras Nakvosas
机构: Neurotechnology(神经技术)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:This technical report describes an experiment on autoregressive pre-training of Gemma2 2 billion parameter large language model (LLM) with 10% on the Lithuanian language component of CulturaX from the point of view of continual learning. We apply elastic weight consolidation (EWC) to the full set of the model’s parameters and investigate language understanding benchmarks, consisting of Arc, Belebele, Gsm8K, Hellaswag, MMLU, TruthfulQA, and Winogrande sets (both in English and Lithuanian versions), and perplexity benchmarks. We empirically demonstrate that EWC regularisation allows us not only to mitigate catastrophic forgetting effects but also that it is potentially beneficial for learning of the new task with LLMs.
zh

[NLP-21] Symbol-based entity marker highlighting for enhanced text mining in materials science with generative AI

【速读】: 该论文旨在解决从非结构化科学文献中自动提取结构化数据的问题,这一过程对于推动数据驱动的科学发现至关重要。现有方法,包括多步骤方法和直接方法,在独立应用时均存在局限性。论文提出的解决方案是一种混合文本挖掘框架,其关键在于整合多步骤方法与直接方法的优势,并通过引入实体标记(entity marker)技术提升实体识别性能,从而实现更高质量的结构化数据生成。

链接: https://arxiv.org/abs/2505.05864
作者: Junhyeong Lee,Jong Min Yuk,Chan-Woo Lee
机构: 未知
类目: Computation and Language (cs.CL)
备注: 29 pages

点击查看摘要

Abstract:The construction of experimental datasets is essential for expanding the scope of data-driven scientific discovery. Recent advances in natural language processing (NLP) have facilitated automatic extraction of structured data from unstructured scientific literature. While existing approaches-multi-step and direct methods-offer valuable capabilities, they also come with limitations when applied independently. Here, we propose a novel hybrid text-mining framework that integrates the advantages of both methods to convert unstructured scientific text into structured data. Our approach first transforms raw text into entity-recognized text, and subsequently into structured form. Furthermore, beyond the overall data structuring framework, we also enhance entity recognition performance by introducing an entity marker-a simple yet effective technique that uses symbolic annotations to highlight target entities. Specifically, our entity marker-based hybrid approach not only consistently outperforms previous entity recognition approaches across three benchmark datasets (MatScholar, SOFC, and SOFC slot NER) but also improve the quality of final structured data-yielding up to a 58% improvement in entity-level F1 score and up to 83% improvement in relation-level F1 score compared to direct approach.
zh

[NLP-22] An empathic GPT -based chatbot to talk about mental disorders with Spanish teenagers

【速读】: 该论文试图解决青少年对某些心理障碍缺乏认知的问题,旨在通过一种基于聊天机器人的系统提高他们对此类问题的意识。解决方案的关键在于利用自我披露技术,结合封闭式和开放式对话,以控制性信息引导对话聚焦于特定心理障碍,并根据用户的敏感性启动相关话题,从而建立更具共情的交流;随后,系统采用基于GPT-3语言模型的开放式对话,使用户能够更自由地表达自己。

链接: https://arxiv.org/abs/2505.05828
作者: Alba María Mármol-Romero,Manuel García-Vega,Miguel Ángel García-Cumbreras,Arturo Montejo-Ráez
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: This is an Accepted Manuscript version of the following article, accepted for publication in International Journal of Human-Computer Interaction. It is deposited under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives License

点击查看摘要

Abstract:This paper presents a chatbot-based system to engage young Spanish people in the awareness of certain mental disorders through a self-disclosure technique. The study was carried out in a population of teenagers aged between 12 and 18 years. The dialogue engine mixes closed and open conversations, so certain controlled messages are sent to focus the chat on a specific disorder, which will change over time. Once a set of trial questions is answered, the system can initiate the conversation on the disorder under the focus according to the user’s sensibility to that disorder, in an attempt to establish a more empathetic communication. Then, an open conversation based on the GPT-3 language model is initiated, allowing the user to express themselves with more freedom. The results show that these systems are of interest to young people and could help them become aware of certain mental disorders.
zh

[NLP-23] Me Who Your Students Are: GPT Can Generate Valid Multiple-Choice Questions When Students (Mis)Understanding Is Hinted

【速读】: 该论文试图解决如何利用预训练大语言模型生成高质量的多项选择题(MCQs)的问题,特别是如何确保生成的题目在难度和区分度上接近由人类教师设计的题目。解决方案的关键在于提出一种创新的提示技术AnaQuest,该技术通过整合形成性评估与总结性评估,先让学生以自由文本形式回答开放性问题,再基于这些回答生成包含正确和错误断言的选项,从而提升生成题目的质量和有效性。

链接: https://arxiv.org/abs/2505.05815
作者: Machi Shimmei,Masaki Uto,Yuichiroh Matsubayashi,Kentaro Inui,Aditi Mallavarapu,Noboru Matsuda
机构: 未知
类目: Computation and Language (cs.CL)
备注: This is a pre-print version of a paper to appear in AIED2025

点击查看摘要

Abstract:The primary goal of this study is to develop and evaluate an innovative prompting technique, AnaQuest, for generating multiple-choice questions (MCQs) using a pre-trained large language model. In AnaQuest, the choice items are sentence-level assertions about complex concepts. The technique integrates formative and summative assessments. In the formative phase, students answer open-ended questions for target concepts in free text. For summative assessment, AnaQuest analyzes these responses to generate both correct and incorrect assertions. To evaluate the validity of the generated MCQs, Item Response Theory (IRT) was applied to compare item characteristics between MCQs generated by AnaQuest, a baseline ChatGPT prompt, and human-crafted items. An empirical study found that expert instructors rated MCQs generated by both AI models to be as valid as those created by human instructors. However, IRT-based analysis revealed that AnaQuest-generated questions - particularly those with incorrect assertions (foils) - more closely resembled human-crafted items in terms of difficulty and discrimination than those produced by ChatGPT.
zh

[NLP-24] Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM

【速读】: 该论文旨在解决基于Transformer的模型在自回归解码过程中因频繁内存访问和不断增长的键值(KV)缓存而导致的内存带宽瓶颈问题。其解决方案的关键在于提出STARC,一种专为PIM架构上高效大语言模型(LLM)解码而设计的稀疏性优化数据映射方案。STARC通过按语义相似性对KV对进行聚类,并将其映射到与PIM存储体结构对齐的连续内存区域,从而在解码过程中实现基于聚类粒度的查询匹配,减少频繁重新聚类或数据移动的开销,提升处理效率和资源利用率。

链接: https://arxiv.org/abs/2505.05772
作者: Zehao Fan,Garrett Gagnon,Zhenyu Liu,Liu Liu
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Transformer-based models are the foundation of modern machine learning, but their execution, particularly during autoregressive decoding in large language models (LLMs), places significant pressure on memory systems due to frequent memory accesses and growing key-value (KV) caches. This creates a bottleneck in memory bandwidth, especially as context lengths increase. Processing-in-memory (PIM) architectures are a promising solution, offering high internal bandwidth and compute parallelism near memory. However, current PIM designs are primarily optimized for dense attention and struggle with the dynamic, irregular access patterns introduced by modern KV cache sparsity techniques. Consequently, they suffer from workload imbalance, reducing throughput and resource utilization. In this work, we propose STARC, a novel sparsity-optimized data mapping scheme tailored specifically for efficient LLM decoding on PIM architectures. STARC clusters KV pairs by semantic similarity and maps them to contiguous memory regions aligned with PIM bank structures. During decoding, queries retrieve relevant tokens at cluster granularity by matching against precomputed centroids, enabling selective attention and parallel processing without frequent reclustering or data movement overhead. Experiments on the HBM-PIM system show that, compared to common token-wise sparsity methods, STARC reduces attention-layer latency by 19%–31% and energy consumption by 19%–27%. Under a KV cache budget of 1024, it achieves up to 54%–74% latency reduction and 45%–67% energy reduction compared to full KV cache retrieval. Meanwhile, STARC maintains model accuracy comparable to state-of-the-art sparse attention methods, demonstrating its effectiveness in enabling efficient and hardware-friendly long-context LLM inference on PIM architectures.
zh

[NLP-25] BMMDetect: A Multimodal Deep Learning Framework for Comprehensive Biomedical Misconduct Detection

【速读】: 该论文旨在解决生物医学研究中学术不端行为检测的挑战,现有方法在算法局限性和分析流程碎片化方面存在不足。其解决方案的关键在于提出BMMDetect,这是一个多模态深度学习框架,通过整合期刊元数据(如SJR、机构数据)、语义嵌入(PubMedBERT)以及GPT-4挖掘的文本属性(如方法统计信息、数据异常),实现对稿件的全面评估。该框架的核心创新包括:多模态融合以减少检测偏差、定量评估特征重要性以识别关键预测因子(如期刊权威性指标和文本异常),以及构建BioMCD数据集作为大规模基准。

链接: https://arxiv.org/abs/2505.05763
作者: Yize Zhou,Jie Zhang,Meijie Wang,Lun Yu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Academic misconduct detection in biomedical research remains challenging due to algorithmic narrowness in existing methods and fragmented analytical pipelines. We present BMMDetect, a multimodal deep learning framework that integrates journal metadata (SJR, institutional data), semantic embeddings (PubMedBERT), and GPT-4o-mined textual attributes (methodological statistics, data anomalies) for holistic manuscript evaluation. Key innovations include: (1) multimodal fusion of domain-specific features to reduce detection bias; (2) quantitative evaluation of feature importance, identifying journal authority metrics (e.g., SJR-index) and textual anomalies (e.g., statistical outliers) as dominant predictors; and (3) the BioMCD dataset, a large-scale benchmark with 13,160 retracted articles and 53,411 controls. BMMDetect achieves 74.33% AUC, outperforming single-modality baselines by 8.6%, and demonstrates transferability across biomedical subfields. This work advances scalable, interpretable tools for safeguarding research integrity.
zh

[NLP-26] Insertion Language Models: Sequence Generation with Arbitrary-Position Insertions

【速读】: 该论文旨在解决传统自回归模型(Autoregressive Models, ARMs)在处理需要满足复杂约束或具有非顺序依赖关系的序列生成任务时的局限性,以及掩码扩散模型(Masked Diffusion Models, MDMs)在同时解码多个标记时可能引入不连贯性及无法处理未知数量填充标记的约束问题。其解决方案的关键在于提出插入语言模型(Insertion Language Models, ILMs),该模型通过联合选择要插入的位置和词汇元素,实现对序列中任意位置的标记插入,从而能够有效建模强依赖关系,并支持任意顺序的序列生成。

链接: https://arxiv.org/abs/2505.05755
作者: Dhruvesh Patel,Aishwarya Sahoo,Avinash Amballa,Tahira Naseem,Tim G. J. Rudner,Andrew McCallum
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Autoregressive models (ARMs), which predict subsequent tokens one-by-one ``from left to right,‘’ have achieved significant success across a wide range of sequence generation tasks. However, they struggle to accurately represent sequences that require satisfying sophisticated constraints or whose sequential dependencies are better addressed by out-of-order generation. Masked Diffusion Models (MDMs) address some of these limitations, but the process of unmasking multiple tokens simultaneously in MDMs can introduce incoherences, and MDMs cannot handle arbitrary infilling constraints when the number of tokens to be filled in is not known in advance. In this work, we introduce Insertion Language Models (ILMs), which learn to insert tokens at arbitrary positions in a sequence – that is, they select jointly both the position and the vocabulary element to be inserted. By inserting tokens one at a time, ILMs can represent strong dependencies between tokens, and their ability to generate sequences in arbitrary order allows them to accurately model sequences where token dependencies do not follow a left-to-right sequential structure. To train ILMs, we propose a tailored network parameterization and use a simple denoising objective. Our empirical evaluation demonstrates that ILMs outperform both ARMs and MDMs on common planning tasks. Furthermore, we show that ILMs outperform MDMs and perform on par with ARMs in an unconditional text generation task while offering greater flexibility than MDMs in arbitrary-length text infilling.
zh

[NLP-27] Harnessing LLM s Explanations to Boost Surrogate Models in Tabular Data Classification

【速读】: 该论文旨在解决基于大型语言模型(Large Language Models, LLMs)的表格预测方法在资源消耗高、示范选择不优以及可解释性有限等方面的问题,这些问题严重制约了其预测性能和实际应用。论文提出的解决方案关键在于引入一种基于上下文学习的框架,通过利用LLMs生成的解释来指导一个更小、可本地部署的替代语言模型(Surrogate Language Model, SLM)进行可解释的表格预测,从而提升模型性能并增强输出的可解释性。

链接: https://arxiv.org/abs/2505.05744
作者: Ruxue Shi,Hengrui Gu,Xu Shen,Xin Wang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable ability in solving complex tasks, making them a promising tool for enhancing tabular learning. However, existing LLM-based methods suffer from high resource requirements, suboptimal demonstration selection, and limited interpretability, which largely hinder their prediction performance and application in the real world. To overcome these problems, we propose a novel in-context learning framework for tabular prediction. The core idea is to leverage the explanations generated by LLMs to guide a smaller, locally deployable Surrogate Language Model (SLM) to make interpretable tabular predictions. Specifically, our framework mainly involves three stages: (i) Post Hoc Explanation Generation, where LLMs are utilized to generate explanations for question-answer pairs in candidate demonstrations, providing insights into the reasoning behind the answer. (ii) Post Hoc Explanation-Guided Demonstrations Selection, which utilizes explanations generated by LLMs to guide the process of demonstration selection from candidate demonstrations. (iii) Post Hoc Explanation-Guided Interpretable SLM Prediction, which utilizes the demonstrations obtained in step (ii) as in-context and merges corresponding explanations as rationales to improve the performance of SLM and guide the model to generate interpretable outputs. Experimental results highlight the framework’s effectiveness, with an average accuracy improvement of 5.31% across various tabular datasets in diverse domains.
zh

[NLP-28] opicVD: A Topic-Based Dataset of Video-Guided Multimodal Machine Translation for Documentaries

【速读】: 该论文旨在解决现有多模态机器翻译(Multimodal Machine Translation, MMT)数据集缺乏跨领域和主题的长视频数据的问题,从而无法满足如纪录片翻译等实际应用场景的需求。其解决方案的关键在于构建了一个基于主题的视频支持多模态机器翻译数据集TopicVD,并提出了一种基于跨模态双向注意力模块的MMT模型。该模型通过捕捉文本与视频之间的共享语义,提升了纪录片翻译的性能,同时验证了全局上下文信息对翻译效果的积极影响。

链接: https://arxiv.org/abs/2505.05714
作者: Jinze Lv,Jian Chen,Zi Long,Xianghua Fu,Yin Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注: NLDB 2025

点击查看摘要

Abstract:Most existing multimodal machine translation (MMT) datasets are predominantly composed of static images or short video clips, lacking extensive video data across diverse domains and topics. As a result, they fail to meet the demands of real-world MMT tasks, such as documentary translation. In this study, we developed TopicVD, a topic-based dataset for video-supported multimodal machine translation of documentaries, aiming to advance research in this field. We collected video-subtitle pairs from documentaries and categorized them into eight topics, such as economy and nature, to facilitate research on domain adaptation in video-guided MMT. Additionally, we preserved their contextual information to support research on leveraging the global context of documentaries in video-guided MMT. To better capture the shared semantics between text and video, we propose an MMT model based on a cross-modal bidirectional attention module. Extensive experiments on the TopicVD dataset demonstrate that visual information consistently improves the performance of the NMT model in documentary translation. However, the MMT model’s performance significantly declines in out-of-domain scenarios, highlighting the need for effective domain adaptation methods. Additionally, experiments demonstrate that global context can effectively improve translation performance. % Dataset and our implementations are available at this https URL
zh

[NLP-29] Assessing Robustness to Spurious Correlations in Post-Training Language Models ICLR’25

【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)在对齐用户意图和正确性标准过程中,因训练数据中存在虚假相关性(spurious correlations)而导致的性能或泛化能力下降问题。其解决方案的关键在于系统评估三种后训练算法——监督微调(Supervised Fine-Tuning, SFT)、直接偏好优化(Direct Preference Optimization, DPO)和KTO(Kahneman-Tversky Optimization)在不同合成任务和虚假相关性条件下的表现,以揭示各方法在面对不同类型的虚假特征时的相对鲁棒性与适用性。

链接: https://arxiv.org/abs/2505.05704
作者: Julia Shuieh,Prasann Singhal,Apaar Shanker,John Heyer,George Pu,Samuel Denton
机构: Scale AI(规模人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICLR '25 Workshop on Spurious Correlation and Shortcut Learning

点击查看摘要

Abstract:Supervised and preference-based fine-tuning techniques have become popular for aligning large language models (LLMs) with user intent and correctness criteria. However, real-world training data often exhibits spurious correlations – arising from biases, dataset artifacts, or other “shortcut” features – that can compromise a model’s performance or generalization. In this paper, we systematically evaluate three post-training algorithms – Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and KTO (Kahneman-Tversky Optimization) – across a diverse set of synthetic tasks and spuriousness conditions. Our tasks span mathematical reasoning, constrained instruction-following, and document-grounded question answering. We vary the degree of spurious correlation (10% vs. 90%) and investigate two forms of artifacts: “Feature Ambiguity” and “Distributional Narrowness.” Our results show that the models often but not always degrade under higher spuriousness. The preference-based methods (DPO/KTO) can demonstrate relative robustness in mathematical reasoning tasks. By contrast, SFT maintains stronger performance in complex, context-intensive tasks. These findings highlight that no single post-training strategy universally outperforms in all scenarios; the best choice depends on the type of target task and the nature of spurious correlations.
zh

[NLP-30] Exploration of COVID-19 Discourse on Twitter: American Politician Edition

【速读】: 该论文试图解决在新冠疫情背景下,美国政治人物在推特上表达的党派立场差异问题,以及如何通过分析其言论中的关键词、主题和情感来预测其政治倾向。解决方案的关键在于利用词袋模型(bag-of-words)、双词模型(bigram)和TF-IDF模型对文本进行特征提取,并结合不同的分类算法在语言模型上实现对推文政治立场的预测与区分。

链接: https://arxiv.org/abs/2505.05687
作者: Cindy Kim,Daniela Puchall,Jiangyi Liang,Jiwon Kim
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The advent of the COVID-19 pandemic has undoubtedly affected the political scene worldwide and the introduction of new terminology and public opinions regarding the virus has further polarized partisan stances. Using a collection of tweets gathered from leading American political figures online (Republican and Democratic), we explored the partisan differences in approach, response, and attitude towards handling the international crisis. Implementation of the bag-of-words, bigram, and TF-IDF models was used to identify and analyze keywords, topics, and overall sentiments from each party. Results suggest that Democrats are more concerned with the casualties of the pandemic, and give more medical precautions and recommendations to the public whereas Republicans are more invested in political responsibilities such as keeping the public updated through media and carefully watching the progress of the virus. We propose a systematic approach to predict and distinguish a tweet’s political stance (left or right leaning) based on its COVID-19 related terms using different classification algorithms on different language models.
zh

[NLP-31] Prompted Meta-Learning for Few-shot Knowledge Graph Completion

【速读】: 该论文旨在解决少样本知识图谱补全(few-shot knowledge graph completion, KGC)中因数据稀缺而导致的性能下降问题,特别是在新关系出现时难以有效迁移和适应的问题。现有方法多集中于关系信息的利用,而忽略了知识图谱中丰富的语义信息。其解决方案的关键在于提出一种新颖的提示元学习框架(PromptMeta),该框架将元语义与关系信息相结合,通过两个核心创新实现:一是元语义提示池,用于捕捉和整合高层元语义,以支持对罕见和新兴关系的有效知识迁移;二是可学习融合提示,用于动态结合元语义信息与任务特定的关系信息,以适应不同的少样本任务。这两个组件在元学习框架内与模型参数共同优化,从而提升了少样本KGC的性能。

链接: https://arxiv.org/abs/2505.05684
作者: Han Wu,Jie Yin
机构: The University of Sydney (悉尼大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Few-shot knowledge graph completion (KGC) has obtained significant attention due to its practical applications in real-world scenarios, where new knowledge often emerges with limited available data. While most existing methods for few-shot KGC have predominantly focused on leveraging relational information, rich semantics inherent in KGs have been largely overlooked. To address this gap, we propose a novel prompted meta-learning (PromptMeta) framework that seamlessly integrates meta-semantics with relational information for few-shot KGC. PrompMeta has two key innovations: (1) a meta-semantic prompt pool that captures and consolidates high-level meta-semantics, enabling effective knowledge transfer and adaptation to rare and newly emerging relations. (2) a learnable fusion prompt that dynamically combines meta-semantic information with task-specific relational information tailored to different few-shot tasks. Both components are optimized together with model parameters within a meta-learning framework. Extensive experiments on two benchmark datasets demonstrate the effectiveness of our approach.
zh

[NLP-32] Adaptive Stress Testing Black-Box LLM Planners

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在决策任务中容易产生不安全和不可靠输出的问题,即所谓的“幻觉”(hallucination)问题。解决方案的关键在于提出一种基于自适应压力测试(Adaptive Stress Testing, AST)与蒙特卡洛树搜索(Monte-Carlo Tree Search, MCTS)的新型提示扰动空间搜索方法,以高效发现导致模型高不确定性的场景和提示,从而在运行时自动生成影响模型不确定性的提示,并支持实时信任评估。

链接: https://arxiv.org/abs/2505.05665
作者: Neeloy Chakraborty,John Pohovey,Melkior Ornik,Katherine Driggs-Campbell
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 26 pages, 16 figures, 4 tables

点击查看摘要

Abstract:Large language models (LLMs) have recently demonstrated success in generalizing across decision-making tasks including planning, control and prediction, but their tendency to hallucinate unsafe and undesired outputs poses risks. We argue that detecting such failures is necessary, especially in safety-critical scenarios. Existing black-box methods often detect hallucinations by identifying inconsistencies across multiple samples. Many of these approaches typically introduce prompt perturbations like randomizing detail order or generating adversarial inputs, with the intuition that a confident model should produce stable outputs. We first perform a manual case study showing that other forms of perturbations (e.g., adding noise, removing sensor details) cause LLMs to hallucinate in a driving environment. We then propose a novel method for efficiently searching the space of prompt perturbations using Adaptive Stress Testing (AST) with Monte-Carlo Tree Search (MCTS). Our AST formulation enables discovery of scenarios and prompts that cause language models to act with high uncertainty. By generating MCTS prompt perturbation trees across diverse scenarios, we show that offline analyses can be used at runtime to automatically generate prompts that influence model uncertainty, and to inform real-time trust assessments of an LLM.
zh

[NLP-33] Privacy-Preserving Transformers: SwiftKeys Differential Privacy Implementation

【速读】: 该论文旨在解决在SwiftKey中使用生成式AI (Generative AI) 进行语言建模时,如何在模型规模、运行速度和准确性之间取得平衡的问题。其解决方案的关键在于将GPT2架构缩放以适应所需大小,并采用两阶段训练过程:第一阶段在通用数据上构建种子模型,第二阶段在输入法数据上进行差分隐私 (Differential Privacy, DP) 微调,从而在保持较高准确性的前提下,实现内存和速度的可控增长。此外,模型通过ONNX集成以兼顾灵活性与效率。

链接: https://arxiv.org/abs/2505.05648
作者: Abdelrahman Abouelenin,Mohamed Abdelrehim,Raffy Fahim,Amr Hendy,Mohamed Afify
机构: Microsoft(微软)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper we train a transformer using differential privacy (DP) for language modeling in SwiftKey. We run multiple experiments to balance the trade-off between the model size, run-time speed and accuracy. We show that we get small and consistent gains in the next-word-prediction and accuracy with graceful increase in memory and speed compared to the production GRU. This is obtained by scaling down a GPT2 architecture to fit the required size and a two stage training process that builds a seed model on general data and DP finetunes it on typing data. The transformer is integrated using ONNX offering both flexibility and efficiency.
zh

[NLP-34] KG-HTC: Integrating Knowledge Graphs into LLM s for Effective Zero-shot Hierarchical Text Classification

【速读】: 该论文旨在解决层次化文本分类(Hierarchical Text Classification, HTC)在实际应用中面临的挑战,包括标注数据不足、标签空间庞大以及长尾分布问题。其解决方案的关键在于引入知识图谱(Knowledge Graphs)与大语言模型(Large Language Models, LLMs)的结合,通过检索增强生成(Retrieval-Augmented Generation, RAG)方法从知识图谱中检索与输入文本相关的子图,从而为分类任务提供结构化的语义上下文,提升模型对多层级标签语义的理解能力。

链接: https://arxiv.org/abs/2505.05583
作者: Qianbo Zang,Christophe Zgrzendek,Igor Tchappi,Afshin Khadangi,Johannes Sedlmeir
机构: Interdisciplinary Centre for Security, Reliability and Trust (SnT), Université du Luxembourg; Enovos Luxembourg S.A.; Universität Münster
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hierarchical Text Classification (HTC) involves assigning documents to labels organized within a taxonomy. Most previous research on HTC has focused on supervised methods. However, in real-world scenarios, employing supervised HTC can be challenging due to a lack of annotated data. Moreover, HTC often faces issues with large label spaces and long-tail distributions. In this work, we present Knowledge Graphs for zero-shot Hierarchical Text Classification (KG-HTC), which aims to address these challenges of HTC in applications by integrating knowledge graphs with Large Language Models (LLMs) to provide structured semantic context during classification. Our method retrieves relevant subgraphs from knowledge graphs related to the input text using a Retrieval-Augmented Generation (RAG) approach. Our KG-HTC can enhance LLMs to understand label semantics at various hierarchy levels. We evaluate KG-HTC on three open-source HTC datasets: WoS, DBpedia, and Amazon. Our experimental results show that KG-HTC significantly outperforms three baselines in the strict zero-shot setting, particularly achieving substantial improvements at deeper levels of the hierarchy. This evaluation demonstrates the effectiveness of incorporating structured knowledge into LLMs to address HTC’s challenges in large label spaces and long-tailed label distributions. Our code is available at: this https URL.
zh

[NLP-35] X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP ICML2025

【速读】: 该论文试图解决对比语言-图像预训练(Contrastive Language-Image Pre-training, CLIP)模型在面对对抗扰动时的脆弱性问题,特别是其在不同数据、领域、模型和任务间的普遍对抗迁移性不足的问题。解决方案的关键在于提出一种名为X-Transfer的新攻击方法,其核心创新是“代理缩放”(surrogate scaling),通过动态从大规模搜索空间中选择少量合适的代理模型,实现高效且具有强跨域、跨模型、跨任务迁移能力的通用对抗扰动(Universal Adversarial Perturbation, UAP)生成。

链接: https://arxiv.org/abs/2505.05528
作者: Hanxun Huang,Sarah Erfani,Yige Li,Xingjun Ma,James Bailey
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ICML 2025

点击查看摘要

Abstract:As Contrastive Language-Image Pre-training (CLIP) models are increasingly adopted for diverse downstream tasks and integrated into large vision-language models (VLMs), their susceptibility to adversarial perturbations has emerged as a critical concern. In this work, we introduce \textbfX-Transfer, a novel attack method that exposes a universal adversarial vulnerability in CLIP. X-Transfer generates a Universal Adversarial Perturbation (UAP) capable of deceiving various CLIP encoders and downstream VLMs across different samples, tasks, and domains. We refer to this property as \textbfsuper transferability–a single perturbation achieving cross-data, cross-domain, cross-model, and cross-task adversarial transferability simultaneously. This is achieved through \textbfsurrogate scaling, a key innovation of our approach. Unlike existing methods that rely on fixed surrogate models, which are computationally intensive to scale, X-Transfer employs an efficient surrogate scaling strategy that dynamically selects a small subset of suitable surrogates from a large search space. Extensive evaluations demonstrate that X-Transfer significantly outperforms previous state-of-the-art UAP methods, establishing a new benchmark for adversarial transferability across CLIP models. The code is publicly available in our \hrefthis https URLGitHub repository.
zh

[NLP-36] Evolutionary ecology of words

【速读】: 该论文试图解决如何将进化博弈论和基于代理的模型扩展至利用大型语言模型(Large Language Models, LLMs)丰富的语言表达,以模拟词汇的进化生态问题。其解决方案的关键在于构建一个基于LLM生成短语的代理系统,使代理在空间环境中通过相邻交互进行词汇竞争与演化,其中交互结果由LLM根据词汇间的关系决定,同时引入基于LLM输出的词汇突变机制,从而实现多样且无限的交互选项的涌现与进化。

链接: https://arxiv.org/abs/2505.05863
作者: Reiji Suzuki,Takaya Arita
机构: Nagoya University (名古屋大学)
类目: Populations and Evolution (q-bio.PE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, 5 figures. Preprint of the paper published in Proceedings of 2025 IEEE Symposium on Computational Intelligence in Artificial Life and Cooperative Intelligent Systems (ALIFE-CIS)

点击查看摘要

Abstract:We propose a model for the evolutionary ecology of words as one attempt to extend evolutionary game theory and agent-based models by utilizing the rich linguistic expressions of Large Language Models (LLMs). Our model enables the emergence and evolution of diverse and infinite options for interactions among agents. Within the population, each agent possesses a short word (or phrase) generated by an LLM and moves within a spatial environment. When agents become adjacent, the outcome of their interaction is determined by the LLM based on the relationship between their words, with the loser’s word being replaced by the winner’s. Word mutations, also based on LLM outputs, may occur. We conducted preliminary experiments assuming that ``strong animal species" would survive. The results showed that from an initial population consisting of well-known species, many species emerged both gradually and in a punctuated equilibrium manner. Each trial demonstrated the unique evolution of diverse populations, with one type of large species becoming dominant, such as terrestrial animals, marine life, or extinct species, which were ecologically specialized and adapted ones across diverse extreme habitats. We also conducted a long-term experiment with a large population, demonstrating the emergence and coexistence of diverse species.
zh

[NLP-37] Multimodal Integrated Knowledge Transfer to Large Language Models through Preference Optimization with Biomedical Applications

【速读】: 该论文旨在解决高质量多模态生物医学数据稀缺的问题,这一问题限制了预训练大型语言模型(Large Language Models, LLMs)在专业生物医学任务中的有效微调。解决方案的关键在于提出MINT(Multimodal Integrated kNowledge Transfer)框架,该框架通过偏好优化(preference optimization)将单模态大解码模型与多模态生物医学数据中的领域特定决策模式对齐。MINT利用上游多模态机器学习(MML)模型从高质量多模态数据中提取领域知识,并将其迁移至下游的文本或图像单模态LLMs中,从而提升其在特定任务上的性能。

链接: https://arxiv.org/abs/2505.05736
作者: Da Wu,Zhanliang Wang,Quan Nguyen,Zhuoran Xu,Kai Wang
机构: 未知
类目: Quantitative Methods (q-bio.QM); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: First Draft

点击查看摘要

Abstract:The scarcity of high-quality multimodal biomedical data limits the ability to effectively fine-tune pretrained Large Language Models (LLMs) for specialized biomedical tasks. To address this challenge, we introduce MINT (Multimodal Integrated kNowledge Transfer), a framework that aligns unimodal large decoder models with domain-specific decision patterns from multimodal biomedical data through preference optimization. While MINT supports different optimization techniques, we primarily implement it with the Odds Ratio Preference Optimization (ORPO) framework as its backbone. This strategy enables the aligned LLMs to perform predictive tasks using text-only or image-only inputs while retaining knowledge learnt from multimodal data. MINT leverages an upstream multimodal machine learning (MML) model trained on high-quality multimodal data to transfer domain-specific insights to downstream text-only or image-only LLMs. We demonstrate its effectiveness through two key applications: (1) Rare genetic disease prediction from texts, where MINT uses a multimodal encoder model, trained on facial photos and clinical notes, to generate a preference dataset for aligning a lightweight Llama 3.2-3B-Instruct. Despite relying on text input only, the MINT-derived model outperforms models trained with SFT, RAG, or DPO, and even outperforms Llama 3.1-405B-Instruct. (2) Tissue type classification using cell nucleus images, where MINT uses a vision-language foundation model as the preference generator, containing knowledge learnt from both text and histopathological images to align downstream image-only models. The resulting MINT-derived model significantly improves the performance of Llama 3.2-Vision-11B-Instruct on tissue type classification. In summary, MINT provides an effective strategy to align unimodal LLMs with high-quality multimodal expertise through preference optimization.
zh

计算机视觉

[CV-0] Anymate: A Dataset and Baselines for Learning 3D Object Rigging SIGGRAPH2025

【速读】:该论文旨在解决3D动画制作中绑定(rigging)与蒙皮(skinning)过程自动化的问题,这些问题通常需要大量专业知识和手动操作。传统方法依赖几何启发式规则,难以处理复杂几何体,而现有数据驱动方法受限于训练数据量。论文提出Anymate数据集,包含23万对3D资产及其专家设计的绑定与蒙皮信息,是现有数据集的70倍。解决方案的关键在于构建一个基于学习的自动绑定框架,包含关节、连接性和蒙皮权重预测的三个顺序模块,并通过系统化的架构设计与实验验证,显著提升了现有方法的性能。

链接: https://arxiv.org/abs/2505.06227
作者: Yufan Deng,Yuhao Zhang,Chen Geng,Shangzhe Wu,Jiajun Wu
机构: Stanford University (斯坦福大学); University of Cambridge (剑桥大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH 2025. Project page: this https URL

点击查看摘要

Abstract:Rigging and skinning are essential steps to create realistic 3D animations, often requiring significant expertise and manual effort. Traditional attempts at automating these processes rely heavily on geometric heuristics and often struggle with objects of complex geometry. Recent data-driven approaches show potential for better generality, but are often constrained by limited training data. We present the Anymate Dataset, a large-scale dataset of 230K 3D assets paired with expert-crafted rigging and skinning information – 70 times larger than existing datasets. Using this dataset, we propose a learning-based auto-rigging framework with three sequential modules for joint, connectivity, and skinning weight prediction. We systematically design and experiment with various architectures as baselines for each module and conduct comprehensive evaluations on our dataset to compare their performance. Our models significantly outperform existing methods, providing a foundation for comparing future methods in automated rigging and skinning. Code and dataset can be found at this https URL.
zh

[CV-1] VIN-NBV: A View Introspection Network for Next-Best-View Selection for Resource-Efficient 3D Reconstruction

【速读】:该论文旨在解决传统Next Best View (NBV)算法在复杂几何和自遮挡场景中,仅通过最大化覆盖范围无法直接提升三维重建质量的问题。其解决方案的关键在于提出View Introspection Network (VIN),该网络能够直接预测视图对重建质量的改进程度,并结合VIN-NBV策略,通过贪心序列采样选择最优视图,从而在有限的采集次数或时间约束下显著提升重建质量。

链接: https://arxiv.org/abs/2505.06219
作者: Noah Frahm,Dongxu Zhao,Andrea Dunn Beltran,Ron Alterovitz,Jan-Michael Frahm,Junier Oliva,Roni Sengupta
机构: UNC Chapel Hill(北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 19 pages, 11 figures

点击查看摘要

Abstract:Next Best View (NBV) algorithms aim to acquire an optimal set of images using minimal resources, time, or number of captures to enable efficient 3D reconstruction of a scene. Existing approaches often rely on prior scene knowledge or additional image captures and often develop policies that maximize coverage. Yet, for many real scenes with complex geometry and self-occlusions, coverage maximization does not lead to better reconstruction quality directly. In this paper, we propose the View Introspection Network (VIN), which is trained to predict the reconstruction quality improvement of views directly, and the VIN-NBV policy. A greedy sequential sampling-based policy, where at each acquisition step, we sample multiple query views and choose the one with the highest VIN predicted improvement score. We design the VIN to perform 3D-aware featurization of the reconstruction built from prior acquisitions, and for each query view create a feature that can be decoded into an improvement score. We then train the VIN using imitation learning to predict the reconstruction improvement score. We show that VIN-NBV improves reconstruction quality by ~30% over a coverage maximization baseline when operating with constraints on the number of acquisitions or the time in motion.
zh

[CV-2] Let Humanoids Hike! Integrative Skill Development on Complex Trails CVPR2025

【速读】:该论文试图解决复杂地形中人形机器人自主徒步的问题,当前的人形机器人研究在运动控制上缺乏长期目标和情境感知,而语义导航则忽视了现实世界的具身性和局部地形变化。解决方案的关键在于提出一种名为LEGO-H的学习框架,其核心技术创新包括:1)一种时序视觉Transformer变体,嵌入到分层强化学习框架中,用于预测未来局部目标以指导运动,实现运动与目标导向导航的无缝整合;2)结合关节运动模式潜在表示与分层度量学习的特权学习方案,提升从特权训练到机载执行的策略迁移能力。这些组件使LEGO-H能够在不依赖预定义运动模式的情况下应对多样的物理和环境挑战。

链接: https://arxiv.org/abs/2505.06218
作者: Kwan-Yee Lin,Stella X.Yu
机构: University of Michigan (密歇根大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Project page: this https URL

点击查看摘要

Abstract:Hiking on complex trails demands balance, agility, and adaptive decision-making over unpredictable terrain. Current humanoid research remains fragmented and inadequate for hiking: locomotion focuses on motor skills without long-term goals or situational awareness, while semantic navigation overlooks real-world embodiment and local terrain variability. We propose training humanoids to hike on complex trails, driving integrative skill development across visual perception, decision making, and motor execution. We develop a learning framework, LEGO-H, that enables a vision-equipped humanoid robot to hike complex trails autonomously. We introduce two technical innovations: 1) A temporal vision transformer variant - tailored into Hierarchical Reinforcement Learning framework - anticipates future local goals to guide movement, seamlessly integrating locomotion with goal-directed navigation. 2) Latent representations of joint movement patterns, combined with hierarchical metric learning - enhance Privileged Learning scheme - enable smooth policy transfer from privileged training to onboard execution. These components allow LEGO-H to handle diverse physical and environmental challenges without relying on predefined motion patterns. Experiments across varied simulated trails and robot morphologies highlight LEGO-H’s versatility and robustness, positioning hiking as a compelling testbed for embodied autonomy and LEGO-H as a baseline for future humanoid development.
zh

[CV-3] Adapting a Segmentation Foundation Model for Medical Image Classification

【速读】:该论文旨在解决如何有效将生成式 AI (Generative AI) 领域中的基础模型,如 Segment Anything Model (SAM),适配到医学图像分类任务中的问题。现有方法在该领域仍缺乏深入探索,而 SAM 虽在图像分割任务中表现出色,但其在医学图像分类中的应用尚未得到充分研究。论文的关键解决方案是利用 SAM 的图像编码器作为特征提取器,以捕获具有重要空间和上下文信息的特征,并通过提出一种新型的空间局部化通道注意力机制(Spatially Localized Channel Attention, SLCA),计算特征图的空间局部化注意力权重,从而增强深度学习分类模型对图像中有意义区域的关注,提升分类性能。

链接: https://arxiv.org/abs/2505.06217
作者: Pengfei Gu,Haoteng Tang,Islam A. Ebeid,Jose A. Nunez,Fabian Vazquez,Diego Adame,Marcus Zhan,Huimin Li,Bin Fu,Danny Z. Chen
机构: University of Texas Rio Grande Valley (得克萨斯大学里奥格兰德河谷分校); Texas Woman’s University (得克萨斯女子大学); Sewickley Academy (塞维克利学院); The University of Texas at Dallas (得克萨斯大学达拉斯分校); University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in foundation models, such as the Segment Anything Model (SAM), have shown strong performance in various vision tasks, particularly image segmentation, due to their impressive zero-shot segmentation capabilities. However, effectively adapting such models for medical image classification is still a less explored topic. In this paper, we introduce a new framework to adapt SAM for medical image classification. First, we utilize the SAM image encoder as a feature extractor to capture segmentation-based features that convey important spatial and contextual details of the image, while freezing its weights to avoid unnecessary overhead during training. Next, we propose a novel Spatially Localized Channel Attention (SLCA) mechanism to compute spatially localized attention weights for the feature maps. The features extracted from SAM’s image encoder are processed through SLCA to compute attention weights, which are then integrated into deep learning classification models to enhance their focus on spatially relevant or meaningful regions of the image, thus improving classification performance. Experimental results on three public medical image classification datasets demonstrate the effectiveness and data-efficiency of our approach.
zh

[CV-4] Brain Hematoma Marker Recognition Using Multitask Learning: SwinTransformer and Swin-Unet

【速读】:该论文试图解决在医学图像分析中由于潜在的虚假相关性(spurious correlation)导致的模型泛化能力不足的问题。解决方案的关键在于引入多任务学习(multi-task learning)框架,结合Transformer结构,并通过语义分割(semantic segmentation)和图像重建(image reconstruction)所得的图像表示来增强原始图像的表征能力,从而提升模型在不同测试条件下的性能表现。

链接: https://arxiv.org/abs/2505.06185
作者: Kodai Hirata,Tsuyoshi Okita
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages,4 figures

点击查看摘要

Abstract:This paper proposes a method MTL-Swin-Unet which is multi-task learning using transformers for classification and semantic segmentation. For spurious-correlation problems, this method allows us to enhance the image representation with two other image representations: representation obtained by semantic segmentation and representation obtained by image reconstruction. In our experiments, the proposed method outperformed in F-value measure than other classifiers when the test data included slices from the same patient (no covariate shift). Similarly, when the test data did not include slices from the same patient (covariate shift setting), the proposed method outperformed in AUC measure.
zh

[CV-5] MonetGPT : Solving Puzzles Enhances MLLM s Image Retouching Skills SIGGRAPH2025

【速读】:该论文试图解决专业级修图过程中,用户难以规划复杂且细粒度的程序化编辑操作的问题,同时兼顾生成式AI在保持对象身份一致性方面的不足。其解决方案的关键在于利用多模态大语言模型(MLLM),通过训练模型理解图像处理操作,并基于专家编辑数据合成推理数据,使模型能够批判性地评估原始照片、提出合适的修复方案,并最终通过预定义的程序化图像操作实现编辑。该方法确保了操作的可解释性、对象细节和分辨率的保留,并允许用户选择性地覆盖编辑结果。

链接: https://arxiv.org/abs/2505.06176
作者: Niladri Shekhar Dutt,Duygu Ceylan,Niloy J. Mitra
机构: University College London(伦敦大学学院); Adobe Research(Adobe研究院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at SIGGRAPH 2025 [ACM Transactions on Graphics]; Project website: this https URL

点击查看摘要

Abstract:Retouching is an essential task in post-manipulation of raw photographs. Generative editing, guided by text or strokes, provides a new tool accessible to users but can easily change the identity of the original objects in unacceptable and unpredictable ways. In contrast, although traditional procedural edits, as commonly supported by photoediting tools (e.g., Gimp, Lightroom), are conservative, they are still preferred by professionals. Unfortunately, professional quality retouching involves many individual procedural editing operations that is challenging to plan for most novices. In this paper, we ask if a multimodal large language model (MLLM) can be taught to critique raw photographs, suggest suitable remedies, and finally realize them with a given set of pre-authored procedural image operations. We demonstrate that MLLMs can be first made aware of the underlying image processing operations, by training them to solve specially designed visual puzzles. Subsequently, such an operation-aware MLLM can both plan and propose edit sequences. To facilitate training, given a set of expert-edited photos, we synthesize a reasoning dataset by procedurally manipulating the expert edits and then grounding a pretrained LLM on the visual adjustments, to synthesize reasoning for finetuning. The proposed retouching operations are, by construction, understandable by the users, preserve object details and resolution, and can be optionally overridden. We evaluate our setup on a variety of test examples and show advantages, in terms of explainability and identity preservation, over existing generative and other procedural alternatives. Code, data, models, and supplementary results can be found via our project website at this https URL.
zh

[CV-6] DiffLocks: Generating 3D Hair from a Single Image using Diffusion Models CVPR2025

【速读】:该论文试图解决从单张图像生成3D头发几何结构的问题,这一任务具有挑战性,主要是因为发型的多样性以及缺乏成对的图像到3D头发数据。传统方法主要依赖于合成数据,并通过低维中间表示(如引导线和头皮级嵌入)来应对数据量有限的问题,这些表示需要后处理以解码、上采样和增加真实感,但无法重建细节丰富的头发,难以处理卷发,或仅能处理少量发型。该论文的关键解决方案是提出DiffLocks框架,该框架通过自动化创建目前最大的合成头发数据集(包含40K种发型),并利用该数据集训练一个图像条件扩散-Transformer模型,从而直接从单张正面图像生成精确的3D发丝。该方法通过预训练的图像主干网络实现对真实场景图像的泛化能力,其扩散模型预测头皮纹理图,其中每个点包含单个发丝的潜在代码,这些代码可直接解码为3D发丝而无需后处理技术,从而能够恢复复杂的卷发发型。

链接: https://arxiv.org/abs/2505.06166
作者: Radu Alexandru Rosu,Keyu Wu,Yao Feng,Youyi Zheng,Michael J. Black
机构: Meshcapade(梅斯卡帕德); Zhejiang University(浙江大学); Stanford University(斯坦福大学); Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:We address the task of generating 3D hair geometry from a single image, which is challenging due to the diversity of hairstyles and the lack of paired image-to-3D hair data. Previous methods are primarily trained on synthetic data and cope with the limited amount of such data by using low-dimensional intermediate representations, such as guide strands and scalp-level embeddings, that require post-processing to decode, upsample, and add realism. These approaches fail to reconstruct detailed hair, struggle with curly hair, or are limited to handling only a few hairstyles. To overcome these limitations, we propose DiffLocks, a novel framework that enables detailed reconstruction of a wide variety of hairstyles directly from a single image. First, we address the lack of 3D hair data by automating the creation of the largest synthetic hair dataset to date, containing 40K hairstyles. Second, we leverage the synthetic hair dataset to learn an image-conditioned diffusion-transfomer model that generates accurate 3D strands from a single frontal image. By using a pretrained image backbone, our method generalizes to in-the-wild images despite being trained only on synthetic data. Our diffusion model predicts a scalp texture map in which any point in the map contains the latent code for an individual hair strand. These codes are directly decoded to 3D strands without post-processing techniques. Representing individual strands, instead of guide strands, enables the transformer to model the detailed spatial structure of complex hairstyles. With this, DiffLocks can recover highly curled hair, like afro hairstyles, from a single image for the first time. Data and code is available at this https URL
zh

[CV-7] MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks

【速读】:该论文试图解决当前医学视觉语言模型(VLM)在皮肤科领域中缺乏专业且详尽诊断分析能力的问题,主要原因是现有皮肤科多模态数据集中的文本描述不够专业化。解决方案的关键在于构建MM-Skin,这是首个大规模的多模态皮肤科数据集,包含三种成像模式(临床、皮肤镜和病理)以及近10k高质量图像-文本对,并生成超过27k个多样化的指令遵循视觉问答(VQA)样本,从而为皮肤科专用VLM模型的开发提供数据基础。

链接: https://arxiv.org/abs/2505.06152
作者: Wenqi Zeng,Yuqi Sun,Chenxi Ma,Weimin Tan,Bo Yan
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical vision-language models (VLMs) have shown promise as clinical assistants across various medical fields. However, specialized dermatology VLM capable of delivering professional and detailed diagnostic analysis remains underdeveloped, primarily due to less specialized text descriptions in current dermatology multimodal datasets. To address this issue, we propose MM-Skin, the first large-scale multimodal dermatology dataset that encompasses 3 imaging modalities, including clinical, dermoscopic, and pathological and nearly 10k high-quality image-text pairs collected from professional textbooks. In addition, we generate over 27k diverse, instruction-following vision question answering (VQA) samples (9 times the size of current largest dermatology VQA dataset). Leveraging public datasets and MM-Skin, we developed SkinVL, a dermatology-specific VLM designed for precise and nuanced skin disease interpretation. Comprehensive benchmark evaluations of SkinVL on VQA, supervised fine-tuning (SFT) and zero-shot classification tasks across 8 datasets, reveal its exceptional performance for skin diseases in comparison to both general and medical VLM models. The introduction of MM-Skin and SkinVL offers a meaningful contribution to advancing the development of clinical dermatology VLM assistants. MM-Skin is available at this https URL
zh

[CV-8] BrainSegDMlF: A Dynamic Fusion-enhanced SAM for Brain Lesion Segmentation

【速读】:该论文旨在解决脑部病变分割中的挑战性问题,包括病变区域与正常脑组织边界不清晰、小病灶识别困难以及现有方法在多模态信息利用、数据量依赖性和自动分割能力方面的局限性。其解决方案的关键在于提出一种名为BrainSegDMLF的大规模全自动分割模型,该模型包含动态模态交互融合(Dynamic Modal Interactive Fusion, DMIF)模块以整合多模态数据,分层上采样解码器以提取丰富的低级和高级特征,以及自动分割掩码以实现无需人工提示的自动化分割。

链接: https://arxiv.org/abs/2505.06133
作者: Hongming Wang,Yifeng Wu,Huimin Huang,Hongtao Wu,Jia-Xuan Jiang,Xiaodong Zhang,Hao Zheng,Xian Wu,Yefeng Zheng,Jinping Xu,Jing Cheng
机构: Southern University of Science and Technology (南方科技大学); Tencent (腾讯); Westlake University (西湖大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The segmentation of substantial brain lesions is a significant and challenging task in the field of medical image segmentation. Substantial brain lesions in brain imaging exhibit high heterogeneity, with indistinct boundaries between lesion regions and normal brain tissue. Small lesions in single slices are difficult to identify, making the accurate and reproducible segmentation of abnormal regions, as well as their feature description, highly complex. Existing methods have the following limitations: 1) They rely solely on single-modal information for learning, neglecting the multi-modal information commonly used in diagnosis. This hampers the ability to comprehensively acquire brain lesion information from multiple perspectives and prevents the effective integration and utilization of multi-modal data inputs, thereby limiting a holistic understanding of lesions. 2) They are constrained by the amount of data available, leading to low sensitivity to small lesions and difficulty in detecting subtle pathological changes. 3) Current SAM-based models rely on external prompts, which cannot achieve automatic segmentation and, to some extent, affect diagnostic this http URL address these issues, we have developed a large-scale fully automated segmentation model specifically designed for brain lesion segmentation, named BrainSegDMLF. This model has the following features: 1) Dynamic Modal Interactive Fusion (DMIF) module that processes and integrates multi-modal data during the encoding process, providing the SAM encoder with more comprehensive modal information. 2) Layer-by-Layer Upsampling Decoder, enabling the model to extract rich low-level and high-level features even with limited data, thereby detecting the presence of small lesions. 3) Automatic segmentation masks, allowing the model to generate lesion masks automatically without requiring manual prompts.
zh

[CV-9] Wasserstein Distances Made Explainable: Insights into Dataset Shifts and Transport Phenomena

【速读】:该论文试图解决如何准确识别导致高或低Wasserstein距离的因素问题,因为仅计算Wasserstein距离或分析对应的最优传输映射(transport map)可能不足以揭示其背后的数据成分贡献。解决方案的关键在于引入可解释人工智能(Explainable AI),通过该方法能够高效且准确地将Wasserstein距离归因于不同的数据组件,如数据子组、输入特征或可解释子空间。

链接: https://arxiv.org/abs/2505.06123
作者: Philip Naumann,Jacob Kauffmann,Grégoire Montavon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Wasserstein distances provide a powerful framework for comparing data distributions. They can be used to analyze processes over time or to detect inhomogeneities within data. However, simply calculating the Wasserstein distance or analyzing the corresponding transport map (or coupling) may not be sufficient for understanding what factors contribute to a high or low Wasserstein distance. In this work, we propose a novel solution based on Explainable AI that allows us to efficiently and accurately attribute Wasserstein distances to various data components, including data subgroups, input features, or interpretable subspaces. Our method achieves high accuracy across diverse datasets and Wasserstein distance specifications, and its practical utility is demonstrated in two use cases.
zh

[CV-10] Photovoltaic Defect Image Generator with Boundary Alignment Smoothing Constraint for Domain Shift Mitigation

【速读】:该论文旨在解决光伏(Photovoltaic, PV)电池缺陷检测中因缺陷数据稀缺而导致的模型训练困难问题。现有方法虽尝试利用生成模型扩充数据集,但普遍存在稳定性差、多样性有限和领域偏移等问题。其解决方案的关键在于提出PDIG(Photovoltaic Defect Image Generator),该方法基于Stable Diffusion(SD)架构,通过引入语义概念嵌入(Semantic Concept Embedding, SCE)模块增强缺陷类型与外观之间的关系建模,并结合轻量级工业风格适配器(Lightweight Industrial Style Adaptor, LISA)注入工业缺陷特征,同时在推理阶段采用文本-图像双空间约束(Text-Image Dual-Space Constraints, TIDSC)模块提升生成图像的质量与一致性。

链接: https://arxiv.org/abs/2505.06117
作者: Dongying Li,Binyi Su,Hua Zhang,Yong Li,Haiyong Chen
机构: Hebei University of Technology(河北工业大学); China Xiongan Group Digital City Technology Company Ltd.(雄安集团数字城市技术有限公司); Chinese Academy of Sciences(中国科学院); Beihang University(北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate defect detection of photovoltaic (PV) cells is critical for ensuring quality and efficiency in intelligent PV manufacturing systems. However, the scarcity of rich defect data poses substantial challenges for effective model training. While existing methods have explored generative models to augment datasets, they often suffer from instability, limited diversity, and domain shifts. To address these issues, we propose PDIG, a Photovoltaic Defect Image Generator based on Stable Diffusion (SD). PDIG leverages the strong priors learned from large-scale datasets to enhance generation quality under limited data. Specifically, we introduce a Semantic Concept Embedding (SCE) module that incorporates text-conditioned priors to capture the relational concepts between defect types and their appearances. To further enrich the domain distribution, we design a Lightweight Industrial Style Adaptor (LISA), which injects industrial defect characteristics into the SD model through cross-disentangled attention. At inference, we propose a Text-Image Dual-Space Constraints (TIDSC) module, enforcing the quality of generated images via positional consistency and spatial smoothing alignment. Extensive experiments demonstrate that PDIG achieves superior realism and diversity compared to state-of-the-art methods. Specifically, our approach improves Frechet Inception Distance (FID) by 19.16 points over the second-best method and significantly enhances the performance of downstream defect detection tasks.
zh

[CV-11] Camera-Only Birds Eye View Perception: A Neural Approach to LiDAR-Free Environmental Mapping for Autonomous Vehicles

【速读】:该论文旨在解决自动驾驶感知系统中依赖高成本LiDAR传感器的问题,其核心挑战是通过更经济的摄像头输入实现精确的环境建模与场景理解。解决方案的关键在于提出一种仅使用摄像头的感知框架,该框架基于Lift-Splat-Shoot架构生成Bird’s Eye View (BEV)地图,并结合YOLOv11目标检测与DepthAnythingV2单目深度估计技术,以多摄像头输入实现360度场景的全面理解。

链接: https://arxiv.org/abs/2505.06113
作者: Anupkumar Bochare
机构: Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous vehicle perception systems have traditionally relied on costly LiDAR sensors to generate precise environmental representations. In this paper, we propose a camera-only perception framework that produces Bird’s Eye View (BEV) maps by extending the Lift-Splat-Shoot architecture. Our method combines YOLOv11-based object detection with DepthAnythingV2 monocular depth estimation across multi-camera inputs to achieve comprehensive 360-degree scene understanding. We evaluate our approach on the OpenLane-V2 and NuScenes datasets, achieving up to 85% road segmentation accuracy and 85-90% vehicle detection rates when compared against LiDAR ground truth, with average positional errors limited to 1.2 meters. These results highlight the potential of deep learning to extract rich spatial information using only camera inputs, enabling cost-efficient autonomous navigation without sacrificing accuracy.
zh

[CV-12] REND: Tri-teaching for Robust Preference-based Reinforcement Learning with Demonstrations ICRA2025

【速读】:该论文旨在解决基于偏好反馈的强化学习中因人类或视觉语言模型(VLM)标注器收集的偏好标签存在噪声而带来的挑战。其解决方案的关键在于提出TREND框架,该框架结合了少量专家示范与三重教学策略,通过同时训练三个奖励模型,使每个模型将其损失较小的偏好对视为有用知识,并将其传授给其他网络以更新参数,从而有效缓解噪声影响。

链接: https://arxiv.org/abs/2505.06079
作者: Shuaiyi Huang,Mara Levy,Anubhav Gupta,Daniel Ekpo,Ruijie Zheng,Abhinav Shrivastava
机构: University of Maryland, College Park (马里兰大学学院市分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA 2025

点击查看摘要

Abstract:Preference feedback collected by human or VLM annotators is often noisy, presenting a significant challenge for preference-based reinforcement learning that relies on accurate preference labels. To address this challenge, we propose TREND, a novel framework that integrates few-shot expert demonstrations with a tri-teaching strategy for effective noise mitigation. Our method trains three reward models simultaneously, where each model views its small-loss preference pairs as useful knowledge and teaches such useful pairs to its peer network for updating the parameters. Remarkably, our approach requires as few as one to three expert demonstrations to achieve high performance. We evaluate TREND on various robotic manipulation tasks, achieving up to 90% success rates even with noise levels as high as 40%, highlighting its effective robustness in handling noisy preference feedback. Project page: this https URL.
zh

[CV-13] Noise-Consistent Siamese-Diffusion for Medical Image Synthesis and Segmentation CVPR2025

【速读】:该论文旨在解决医学图像分割中由于标注数据集稀缺而导致深度学习模型潜力受限的问题。其解决方案的关键在于提出一种名为Siamese-Diffusion的新型双组件模型,该模型由Mask-Diffusion和Image-Diffusion组成,并在训练过程中引入噪声一致性损失(Noise Consistency Loss),以提升Mask-Diffusion在参数空间中的形态保真度,从而生成更高质量的合成图像-掩码对,增强数据集的多样性和可扩展性。

链接: https://arxiv.org/abs/2505.06068
作者: Kunpeng Qiu,Zhiqiang Gao,Zhiying Zhou,Mingjie Sun,Yongxin Guo
机构: National University of Singapore (新加坡国立大学); National University of Singapore Suzhou Research Institute (新加坡国立大学苏州研究院); Wenzhou-Kean University (温州肯恩大学); Soochow University (苏州大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:Deep learning has revolutionized medical image segmentation, yet its full potential remains constrained by the paucity of annotated datasets. While diffusion models have emerged as a promising approach for generating synthetic image-mask pairs to augment these datasets, they paradoxically suffer from the same data scarcity challenges they aim to mitigate. Traditional mask-only models frequently yield low-fidelity images due to their inability to adequately capture morphological intricacies, which can critically compromise the robustness and reliability of segmentation models. To alleviate this limitation, we introduce Siamese-Diffusion, a novel dual-component model comprising Mask-Diffusion and Image-Diffusion. During training, a Noise Consistency Loss is introduced between these components to enhance the morphological fidelity of Mask-Diffusion in the parameter space. During sampling, only Mask-Diffusion is used, ensuring diversity and scalability. Comprehensive experiments demonstrate the superiority of our method. Siamese-Diffusion boosts SANet’s mDice and mIoU by 3.6% and 4.4% on the Polyps, while UNet improves by 1.52% and 1.64% on the ISIC2018. Code is available at GitHub.
zh

[CV-14] owards Better Cephalometric Landmark Detection with Diffusion Data Generation

【速读】:该论文旨在解决正畸诊断与治疗规划中骨测量标志点检测面临的样本稀缺和人工标注耗时的问题,这些问题限制了基于深度学习的检测方法,尤其是大规模视觉模型的有效性。其解决方案的关键在于开发了一种创新的数据生成方法,能够无需人工干预即可生成多样化的骨测量X射线图像及其对应的标注,该方法首先利用解剖先验构建新的骨测量标志点标注,然后通过基于扩散的生成器生成与之匹配的逼真X射线图像,并引入一个包含真实骨测量X射线图像和详细医学文本提示的新型提示骨测量X射线图像数据集,以实现对生成样本不同属性的精确控制。

链接: https://arxiv.org/abs/2505.06055
作者: Dongqian Guo,Wencheng Han,Pang Lyu,Yuxi Zhou,Jianbing Shen
机构: University of Macau (澳门大学); Zhongshan Hospital, Fudan University (复旦大学中山医院); Justus-Liebig-University of Giessen (吉森尤斯图斯-李比希大学); Stomatology Hospital of Guangzhou Medical University (广州医科大学口腔医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cephalometric landmark detection is essential for orthodontic diagnostics and treatment planning. Nevertheless, the scarcity of samples in data collection and the extensive effort required for manual annotation have significantly impeded the availability of diverse datasets. This limitation has restricted the effectiveness of deep learning-based detection methods, particularly those based on large-scale vision models. To address these challenges, we have developed an innovative data generation method capable of producing diverse cephalometric X-ray images along with corresponding annotations without human intervention. To achieve this, our approach initiates by constructing new cephalometric landmark annotations using anatomical priors. Then, we employ a diffusion-based generator to create realistic X-ray images that correspond closely with these annotations. To achieve precise control in producing samples with different attributes, we introduce a novel prompt cephalometric X-ray image dataset. This dataset includes real cephalometric X-ray images and detailed medical text prompts describing the images. By leveraging these detailed prompts, our method improves the generation process to control different styles and attributes. Facilitated by the large, diverse generated data, we introduce large-scale vision detection models into the cephalometric landmark detection task to improve accuracy. Experimental results demonstrate that training with the generated data substantially enhances the performance. Compared to methods without using the generated data, our approach improves the Success Detection Rate (SDR) by 6.5%, attaining a notable 82.2%. All code and data are available at: this https URL
zh

[CV-15] Document Image Rectification Bases on Self-Adaptive Multitask Fusion

【速读】:该论文旨在解决变形文档图像校正中的多任务协作问题,当前的多任务方法(如背景去除、3D坐标预测和文本行分割)常常忽视任务间的互补特征及其相互作用。解决方案的关键在于提出一种自适应可学习的多任务融合校正网络(SalmRec),其核心是引入了任务间特征聚合模块,以自适应地提升几何失真感知能力、增强特征互补性并减少负干扰,同时通过门控机制有效平衡全局任务与局部任务之间的特征。

链接: https://arxiv.org/abs/2505.06038
作者: Heng Li,Xiangping Wu,Qingcai Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deformed document image rectification is essential for real-world document understanding tasks, such as layout analysis and text recognition. However, current multi-task methods – such as background removal, 3D coordinate prediction, and text line segmentation – often overlook the complementary features between tasks and their interactions. To address this gap, we propose a self-adaptive learnable multi-task fusion rectification network named SalmRec. This network incorporates an inter-task feature aggregation module that adaptively improves the perception of geometric distortions, enhances feature complementarity, and reduces negative interference. We also introduce a gating mechanism to balance features both within global tasks and between local tasks effectively. Experimental results on two English benchmarks (DIR300 and DocUNet) and one Chinese benchmark (DocReal) demonstrate that our method significantly improves rectification performance. Ablation studies further highlight the positive impact of different tasks on dewarping and the effectiveness of our proposed module.
zh

[CV-16] Why Are You Wrong? Counterfactual Explanations for Language Grounding with 3D Objects IJCNN2025

【速读】:该论文试图解决在自然语言与几何形状结合的领域中,对象指称识别(object referent identification)任务中存在的模型预测错误问题,特别是在文本描述看似正确但模型仍出现错误的情况下,难以理解模型为何出错。解决方案的关键在于生成反事实例子(counterfactual examples),通过调整误分类样本中的描述,生成一个结构相似但语义合理且能促使模型做出正确预测的替代描述,从而揭示描述中的弱点、模型偏差并增强对模型行为的理解。

链接: https://arxiv.org/abs/2505.06030
作者: Tobias Preintner,Weixuan Yuan,Qi Huang,Adrian König,Thomas Bäck,Elena Raponi,Niki van Stein
机构: Leiden University (莱顿大学); BMW Group (宝马集团)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at IJCNN 2025

点击查看摘要

Abstract:Combining natural language and geometric shapes is an emerging research area with multiple applications in robotics and language-assisted design. A crucial task in this domain is object referent identification, which involves selecting a 3D object given a textual description of the target. Variability in language descriptions and spatial relationships of 3D objects makes this a complex task, increasing the need to better understand the behavior of neural network models in this domain. However, limited research has been conducted in this area. Specifically, when a model makes an incorrect prediction despite being provided with a seemingly correct object description, practitioners are left wondering: “Why is the model wrong?”. In this work, we present a method answering this question by generating counterfactual examples. Our method takes a misclassified sample, which includes two objects and a text description, and generates an alternative yet similar formulation that would have resulted in a correct prediction by the model. We have evaluated our approach with data from the ShapeTalk dataset along with three distinct models. Our counterfactual examples maintain the structure of the original description, are semantically similar and meaningful. They reveal weaknesses in the description, model bias and enhance the understanding of the models behavior. Theses insights help practitioners to better interact with systems as well as engineers to improve models.
zh

[CV-17] ArtRAG : Retrieval-Augmented Generation with Structured Context for Visual Art Understanding

【速读】:该论文试图解决当前多模态大语言模型(MLLMs)在艺术作品解释中缺乏多视角理解的问题,尤其是在文化、历史和风格层面的细微解读能力不足。解决方案的关键在于提出ArtRAG框架,该框架通过结合结构化知识与检索增强生成(RAG)技术,自动构建艺术语境知识图谱(ACKG),并在推理阶段利用多粒度结构化检索器选择语义和拓扑相关的子图,以引导生成过程,从而实现更具上下文关联性和文化深度的艺术描述。

链接: https://arxiv.org/abs/2505.06020
作者: Shuai Wang,Ivona Najdenkoska,Hongyi Zhu,Stevan Rudinac,Monika Kackovic,Nachoem Wijnberg,Marcel Worring
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding visual art requires reasoning across multiple perspectives – cultural, historical, and stylistic – beyond mere object recognition. While recent multimodal large language models (MLLMs) perform well on general image captioning, they often fail to capture the nuanced interpretations that fine art demands. We propose ArtRAG, a novel, training-free framework that combines structured knowledge with retrieval-augmented generation (RAG) for multi-perspective artwork explanation. ArtRAG automatically constructs an Art Context Knowledge Graph (ACKG) from domain-specific textual sources, organizing entities such as artists, movements, themes, and historical events into a rich, interpretable graph. At inference time, a multi-granular structured retriever selects semantically and topologically relevant subgraphs to guide generation. This enables MLLMs to produce contextually grounded, culturally informed art descriptions. Experiments on the SemArt and Artpedia datasets show that ArtRAG outperforms several heavily trained baselines. Human evaluations further confirm that ArtRAG generates coherent, insightful, and culturally enriched interpretations.
zh

[CV-18] From Pixels to Perception: Interpretable Predictions via Instance-wise Grouped Feature Selection

【速读】:该论文试图解决机器学习模型决策过程的可解释性问题,旨在提供更符合人类理解的预测结果。其解决方案的关键在于通过实例级的输入图像稀疏化实现本质上可解释的预测,具体而言是将掩码学习置于语义上有意义的像素区域空间中,而非像素级别,并引入一种动态确定每个实例所需稀疏度的显式方法。

链接: https://arxiv.org/abs/2505.06003
作者: Moritz Vandenhirtz,Julia E. Vogt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: International Conference on Machine Learning

点击查看摘要

Abstract:Understanding the decision-making process of machine learning models provides valuable insights into the task, the data, and the reasons behind a model’s failures. In this work, we propose a method that performs inherently interpretable predictions through the instance-wise sparsification of input images. To align the sparsification with human perception, we learn the masking in the space of semantically meaningful pixel regions rather than on pixel-level. Additionally, we introduce an explicit way to dynamically determine the required level of sparsity for each instance. We show empirically on semi-synthetic and natural image datasets that our inherently interpretable classifier produces more meaningful, human-understandable predictions than state-of-the-art benchmarks.
zh

[CV-19] ask-Adapter: Task-specific Adaptation with Order-aware Alignment for Few-shot Action Recognition

【速读】:该论文旨在解决少样本动作识别(few-shot action recognition, FSAR)中预训练图像模型微调后泛化能力下降、视觉任务中任务特定信息探索不足、文本建模中语义顺序信息被忽视以及跨模态对齐技术忽略多模态信息的时间耦合等问题。其解决方案的关键在于提出一种参数高效的双适应方法——Task-Adapter++,通过为图像编码器设计任务特定的适配模块以提取判别性特征,并利用大语言模型生成动作类别的详细子动作序列描述,同时在文本编码器中引入语义顺序适配器以建模子动作间的顺序关系,最终采用细粒度的跨模态对齐策略,将视觉特征映射到与语义描述相同的时间阶段。

链接: https://arxiv.org/abs/2505.06002
作者: Congqi Cao,Peiheng Han,Yueran zhang,Yating Yu,Qinyi Lv,Lingtong Min,Yanning zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2408.00249

点击查看摘要

Abstract:Large-scale pre-trained models have achieved remarkable success in language and image tasks, leading an increasing number of studies to explore the application of pre-trained image models, such as CLIP, in the domain of few-shot action recognition (FSAR). However, current methods generally suffer from several problems: 1) Direct fine-tuning often undermines the generalization capability of the pre-trained model; 2) The exploration of task-specific information is insufficient in the visual tasks; 3) The semantic order information is typically overlooked during text modeling; 4) Existing cross-modal alignment techniques ignore the temporal coupling of multimodal information. To address these, we propose Task-Adapter++, a parameter-efficient dual adaptation method for both image and text encoders. Specifically, to make full use of the variations across different few-shot learning tasks, we design a task-specific adaptation for the image encoder so that the most discriminative information can be well noticed during feature extraction. Furthermore, we leverage large language models (LLMs) to generate detailed sequential sub-action descriptions for each action class, and introduce semantic order adapters into the text encoder to effectively model the sequential relationships between these sub-actions. Finally, we develop an innovative fine-grained cross-modal alignment strategy that actively maps visual features to reside in the same temporal stage as semantic descriptions. Extensive experiments fully demonstrate the effectiveness and superiority of the proposed method, which achieves state-of-the-art performance on 5 benchmarks consistently. The code is open-sourced at this https URL.
zh

[CV-20] Achieving 3D Attention via Triplet Squeeze and Excitation Block

【速读】:该论文旨在提升卷积神经网络(Convolutional Neural Network, CNN)在图像分类任务中的性能,特别是在面部表情识别(Facial Expression Recognition, FER)领域的表现。其解决方案的关键在于引入一种结合三元组注意力(Triplet Attention)与挤压-激励(Squeeze-and-Excitation, SE)机制的新型注意力模块(TripSE),并通过将其集成到ResNet18、DenseNet和ConvNeXt等主流CNN架构中,验证了该模块的通用性和有效性。实验结果表明,TripSE模块显著提升了模型性能,尤其在ConvNeXt架构上取得了当前最优的准确率(78.27%)。

链接: https://arxiv.org/abs/2505.05943
作者: Maan Alhazmi,Abdulrahman Altahhan
机构: University of Leeds (利兹大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The emergence of ConvNeXt and its variants has reaffirmed the conceptual and structural suitability of CNN-based models for vision tasks, re-establishing them as key players in image classification in general, and in facial expression recognition (FER) in particular. In this paper, we propose a new set of models that build on these advancements by incorporating a new set of attention mechanisms that combines Triplet attention with Squeeze-and-Excitation (TripSE) in four different variants. We demonstrate the effectiveness of these variants by applying them to the ResNet18, DenseNet and ConvNext architectures to validate their versatility and impact. Our study shows that incorporating a TripSE block in these CNN models boosts their performances, particularly for the ConvNeXt architecture, indicating its utility. We evaluate the proposed mechanisms and associated models across four datasets, namely CIFAR100, ImageNet, FER2013 and AffectNet datasets, where ConvNext with TripSE achieves state-of-the-art results with an accuracy of \textbf78.27% on the popular FER2013 dataset, a new feat for this dataset.
zh

[CV-21] CGTrack: Cascade Gating Network with Hierarchical Feature Aggregation for UAV Tracking ICRA2025

【速读】:该论文旨在解决无人机(UAV)跟踪中由于引入分层轻量级网络导致的网络容量下降问题,这一问题在频繁遮挡和极端视角变化等复杂场景下进一步加剧。解决方案的关键在于提出一种名为CGTrack的新型UAV跟踪器,其核心是通过粗到细的框架结合显式与隐式技术来扩展网络容量。具体而言,首先引入了Hierarchical Feature Cascade(HFC)模块,通过特征复用思想将深层语义线索与丰富的空间信息融合,从而在计算成本较低的情况下提升特征表示能力;随后设计了Lightweight Gated Center Head(LGCH),利用门控机制从已扩展的特征中解耦出目标导向的坐标,以保留密集的局部判别信息。

链接: https://arxiv.org/abs/2505.05936
作者: Weihong Li,Xiaoqiong Liu,Heng Fan,Libo Zhang
机构: Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(杭州高等研究院,中国科学院大学); Dept. of Computer Science and Engineering, University of North Texas(计算机科学与工程系,北德克萨斯大学); Institute of Software Chinese Academy of Science(软件研究所,中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICRA 2025

点击查看摘要

Abstract:Recent advancements in visual object tracking have markedly improved the capabilities of unmanned aerial vehicle (UAV) tracking, which is a critical component in real-world robotics applications. While the integration of hierarchical lightweight networks has become a prevalent strategy for enhancing efficiency in UAV tracking, it often results in a significant drop in network capacity, which further exacerbates challenges in UAV scenarios, such as frequent occlusions and extreme changes in viewing angles. To address these issues, we introduce a novel family of UAV trackers, termed CGTrack, which combines explicit and implicit techniques to expand network capacity within a coarse-to-fine framework. Specifically, we first introduce a Hierarchical Feature Cascade (HFC) module that leverages the spirit of feature reuse to increase network capacity by integrating the deep semantic cues with the rich spatial information, incurring minimal computational costs while enhancing feature representation. Based on this, we design a novel Lightweight Gated Center Head (LGCH) that utilizes gating mechanisms to decouple target-oriented coordinates from previously expanded features, which contain dense local discriminative information. Extensive experiments on three challenging UAV tracking benchmarks demonstrate that CGTrack achieves state-of-the-art performance while running fast. Code will be available at this https URL.
zh

[CV-22] DFEN: Dual Feature Equalization Network for Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割中因边界像素和类别像素数量较少区域未能充分获取其他类别的上下文特征信息而导致的误分类问题(misclassification)。其解决方案的关键在于提出一种基于Swin Transformer与卷积神经网络混合架构的双特征均衡网络,通过图像级和类别级特征均衡模块,分别对图像内像素的上下文信息进行均衡,并对同类区域的像素特征表示进行均衡,从而增强像素特征表示。此外,利用Swin Transformer作为编码器和解码器,提升了模型捕捉长距离依赖和空间相关性的能力。

链接: https://arxiv.org/abs/2505.05913
作者: Jianjian Yin,Yi Chen,Chengyu Li,Zhichao Zheng,Yanhui Gu,Junsheng Zhou
机构: Nanjing Normal University (南京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current methods for medical image segmentation primarily focus on extracting contextual feature information from the perspective of the whole image. While these methods have shown effective performance, none of them take into account the fact that pixels at the boundary and regions with a low number of class pixels capture more contextual feature information from other classes, leading to misclassification of pixels by unequal contextual feature information. In this paper, we propose a dual feature equalization network based on the hybrid architecture of Swin Transformer and Convolutional Neural Network, aiming to augment the pixel feature representations by image-level equalization feature information and class-level equalization feature information. Firstly, the image-level feature equalization module is designed to equalize the contextual information of pixels within the image. Secondly, we aggregate regions of the same class to equalize the pixel feature representations of the corresponding class by class-level feature equalization module. Finally, the pixel feature representations are enhanced by learning weights for image-level equalization feature information and class-level equalization feature information. In addition, Swin Transformer is utilized as both the encoder and decoder, thereby bolstering the ability of the model to capture long-range dependencies and spatial correlations. We conducted extensive experiments on Breast Ultrasound Images (BUSI), International Skin Imaging Collaboration (ISIC2017), Automated Cardiac Diagnosis Challenge (ACDC) and PH ^2 datasets. The experimental results demonstrate that our method have achieved state-of-the-art performance. Our code is publicly available at this https URL.
zh

[CV-23] Examining the Source of Defects from a Mechanical Perspective for 3D Anomaly Detection

【速读】:该论文旨在解决工业场景中三维异常检测的问题,其核心在于不仅从结构层面识别异常,还通过分析异常成因来提升异常检测的准确性。解决方案的关键是引入了 Mechanics Complementary framework for 3D anomaly detection (MC4AD),该框架为每个点生成内部和外部的修正力。其中,Diverse Anomaly-Generation (DA-Gen) 模块用于模拟多种异常,而 Corrective Force Prediction Network (CFP-Net) 则通过互补表示对点级特征进行建模,以模拟内部与外部修正力的不同贡献。此外,论文提出了一种结合对称损失和整体损失的联合损失函数,以有效约束修正力。

链接: https://arxiv.org/abs/2505.05901
作者: Hanzhe Liang,Aoran Wang,Jie Zhou,Xin Jin,Can Gao,Jinbao Wang
机构: Shenzhen University (深圳大学); Shanghai AI Lab (上海人工智能实验室); Ningbo EIT (宁波EIT)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 26 pages

点击查看摘要

Abstract:In this paper, we go beyond identifying anomalies only in structural terms and think about better anomaly detection motivated by anomaly causes. Most anomalies are regarded as the result of unpredictable defective forces from internal and external sources, and their opposite forces are sought to correct the anomalies. We introduced a Mechanics Complementary framework for 3D anomaly detection (MC4AD) to generate internal and external Corrective forces for each point. A Diverse Anomaly-Generation (DA-Gen) module is first proposed to simulate various anomalies. Then, we present a Corrective Force Prediction Network (CFP-Net) with complementary representations for point-level representation to simulate the different contributions of internal and external corrective forces. A combined loss was proposed, including a new symmetric loss and an overall loss, to constrain the corrective forces properly. As a highlight, we consider 3D anomaly detection in industry more comprehensively, creating a hierarchical quality control strategy based on a three-way decision and contributing a dataset named Anomaly-IntraVariance with intraclass variance to evaluate the model. On the proposed and existing five datasets, we obtained nine state-of-the-art performers with the minimum parameters and the fastest inference speed. The source is available at this https URL
zh

[CV-24] Leverag ing Vision-Language Models for Visual Grounding and Analysis of Automotive UI

【速读】:该论文旨在解决现代汽车信息娱乐系统中频繁的用户界面(User Interface, UI)更新和多样化设计带来的智能与自适应性不足的问题。其关键解决方案是提出一种基于视觉-语言框架的方法,以实现对不同UI设计的无缝适应,并通过引入 AutomotiveUI-Bench-4K 数据集和合成数据生成管道,结合低秩适配(Low-Rank Adaptation, LoRa)技术对 Molmo-7B 模型进行微调,从而提升模型在视觉定位和评估方面的能力,最终在 ScreenSpot 任务中实现了优于基线模型的性能表现。

链接: https://arxiv.org/abs/2505.05895
作者: Benjamin Raphael Ernhofer,Daniil Prokhorov,Jannica Langner,Dominik Bollmann
机构: SPARKS Solutions GmbH (SPARKS解决方案有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern automotive infotainment systems require intelligent and adaptive solutions to handle frequent User Interface (UI) updates and diverse design variations. We introduce a vision-language framework for understanding and interacting with automotive infotainment systems, enabling seamless adaptation across different UI designs. To further support research in this field, we release AutomotiveUI-Bench-4K, an open-source dataset of 998 images with 4,208 annotations. Additionally, we present a synthetic data pipeline to generate training data. We fine-tune a Molmo-7B-based model using Low-Rank Adaptation (LoRa) and incorporating reasoning generated by our pipeline, along with visual grounding and evaluation capabilities. The fine-tuned Evaluative Large Action Model (ELAM) achieves strong performance on AutomotiveUI-Bench-4K (model and dataset are available on Hugging Face) and demonstrating strong cross-domain generalization, including a +5.2% improvement on ScreenSpot over the baseline model. Notably, our approach achieves 80.4% average accuracy on ScreenSpot, closely matching or even surpassing specialized models for desktop, mobile, and web, such as ShowUI, despite being trained for the infotainment domain. This research investigates how data collection and subsequent fine-tuning can lead to AI-driven progress within automotive UI understanding and interaction. The applied method is cost-efficient and fine-tuned models can be deployed on consumer-grade GPUs.
zh

[CV-25] Register and CLS tokens yield a decoupling of local and global features in large ViTs

【速读】:该论文试图解决DINOv2模型中注意力图(attention maps)出现的伪影问题,这些问题影响了模型的可解释性和在密集图像任务中的性能。其关键解决方案是引入额外的注册令牌(register tokens),以替代原本用于存储全局图像信息的冗余局部信息块令牌。通过分析这些注册令牌对全局与局部图像特征关系的影响,研究发现虽然注册令牌能够生成更清晰的注意力图,但这些图未能准确反映大型模型中局部信息的整合,而是由注册令牌提取的信息主导了全局信息,导致局部与全局特征之间的脱节。

链接: https://arxiv.org/abs/2505.05892
作者: Alexander Lappe,Martin A. Giese
机构: Hertie Institute (赫尔蒂研究所); University Clinics Tübingen (图宾根大学诊所); IMPRS-IS (国际研究生院-信息科学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent work has shown that the attention maps of the widely popular DINOv2 model exhibit artifacts, which hurt both model interpretability and performance on dense image tasks. These artifacts emerge due to the model repurposing patch tokens with redundant local information for the storage of global image information. To address this problem, additional register tokens have been incorporated in which the model can store such information instead. We carefully examine the influence of these register tokens on the relationship between global and local image features, showing that while register tokens yield cleaner attention maps, these maps do not accurately reflect the integration of local image information in large models. Instead, global information is dominated by information extracted from register tokens, leading to a disconnect between local and global features. Inspired by these findings, we show that the CLS token itself, which can be interpreted as a register, leads to a very similar phenomenon in models without explicit register tokens. Our work shows that care must be taken when interpreting attention maps of large ViTs. Further, by clearly attributing the faulty behaviour to register and CLS tokens, we show a path towards more interpretable vision models.
zh

[CV-26] owards Facial Image Compression with Consistency Preserving Diffusion Prior

【速读】:该论文旨在解决低比特率下人脸图像压缩导致的重建图像质量不佳以及下游应用性能下降的问题。现有基于学习的人脸图像压缩方法在低比特率下难以保持高质量的重建效果,而直接将基于扩散的方法应用于人脸压缩任务时,由于高频信息保留不足,导致重建图像在下游任务中表现较差。论文提出的解决方案关键在于引入一种具有稳定扩散先验的人脸图像压缩方法(FaSDiff),其核心是通过频率增强机制实现一致性保持,具体包括一个对高频敏感的压缩器以捕捉精细图像细节,并结合混合低频增强模块以解耦低频面部语义并稳定调制扩散先验,从而在提升人类视觉感知质量的同时减少因语义不一致导致的机器视觉性能损失。

链接: https://arxiv.org/abs/2505.05870
作者: Yimin Zhou,Yichong Xia,Bin Chen,Baoyi An,Haoqian Wang,Zhi Wang,Yaowei Wang,Zikun Zhou
机构: Tsinghua Shenzhen International Graduate School (清华大学深圳国际研究生院); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳); Peng Cheng Laboratory (鹏城实验室); Huawei Technologies Company Ltd. (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:With the widespread application of facial image data across various domains, the efficient storage and transmission of facial images has garnered significant attention. However, the existing learned face image compression methods often produce unsatisfactory reconstructed image quality at low bit rates. Simply adapting diffusion-based compression methods to facial compression tasks results in reconstructed images that perform poorly in downstream applications due to insufficient preservation of high-frequency information. To further explore the diffusion prior in facial image compression, we propose Facial Image Compression with a Stable Diffusion Prior (FaSDiff), a method that preserves consistency through frequency enhancement. FaSDiff employs a high-frequency-sensitive compressor in an end-to-end framework to capture fine image details and produce robust visual prompts. Additionally, we introduce a hybrid low-frequency enhancement module that disentangles low-frequency facial semantics and stably modulates the diffusion prior alongside visual prompts. The proposed modules allow FaSDiff to leverage diffusion priors for superior human visual perception while minimizing performance loss in machine vision due to semantic inconsistency. Extensive experiments show that FaSDiff outperforms state-of-the-art methods in balancing human visual quality and machine vision accuracy. The code will be released after the paper is accepted.
zh

[CV-27] Decoupling Multi-Contrast Super-Resolution: Pairing Unpaired Synthesis with Implicit Representations

【速读】:该论文旨在解决磁共振成像(MRI)中因采集时间长和信噪比低而导致的图像质量受限问题,特别是针对多对比度成像中的跨模态增强挑战。现有方法通常依赖于固定分辨率设置和大量完美配对的训练数据,而这些条件在实际临床环境中难以满足。论文提出了一种模块化多对比度超分辨率(MCSR)框架,其关键在于通过无配对跨模态合成(U-CMS)和无监督超分辨率(U-SR)两个阶段实现无需配对数据的任意尺度上采样,从而提升图像质量和解剖一致性。

链接: https://arxiv.org/abs/2505.05855
作者: Hongyu Rui,Yinzhe Wu,Fanwen Wang,Jiahao Huang,Liutao Yang,Zi Wang,Guang Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Magnetic Resonance Imaging (MRI) is critical for clinical diagnostics but is often limited by long acquisition times and low signal-to-noise ratios, especially in modalities like diffusion and functional MRI. The multi-contrast nature of MRI presents a valuable opportunity for cross-modal enhancement, where high-resolution (HR) modalities can serve as references to boost the quality of their low-resolution (LR) counterparts-motivating the development of Multi-Contrast Super-Resolution (MCSR) techniques. Prior work has shown that leveraging complementary contrasts can improve SR performance; however, effective feature extraction and fusion across modalities with varying resolutions remains a major challenge. Moreover, existing MCSR methods often assume fixed resolution settings and all require large, perfectly paired training datasets-conditions rarely met in real-world clinical environments. To address these challenges, we propose a novel Modular Multi-Contrast Super-Resolution (MCSR) framework that eliminates the need for paired training data and supports arbitrary upscaling. Our method decouples the MCSR task into two stages: (1) Unpaired Cross-Modal Synthesis (U-CMS), which translates a high-resolution reference modality into a synthesized version of the target contrast, and (2) Unsupervised Super-Resolution (U-SR), which reconstructs the final output using implicit neural representations (INRs) conditioned on spatial coordinates. This design enables scale-agnostic and anatomically faithful reconstruction by bridging un-paired cross-modal synthesis with unsupervised resolution enhancement. Experiments show that our method achieves superior performance at 4x and 8x upscaling, with improved fidelity and anatomical consistency over existing baselines. Our framework demonstrates strong potential for scalable, subject-specific, and data-efficient MCSR in real-world clinical settings.
zh

[CV-28] PICD: Versatile Perceptual Image Compression with Diffusion Rendering CVPR2025

【速读】:该论文旨在解决屏幕内容图像压缩中文本区域易产生明显伪影的问题,尤其是在低比特率下保持高视觉质量的挑战。其解决方案的关键在于提出一种基于扩散渲染的通用感知屏幕图像压缩(Perceptual Image Compression with Diffusion Rendering, PICD),通过将文本和图像分别编码,并利用扩散模型将其融合为一张图像。该方法在三个层次上集成条件信息:领域层、适配器层和实例层,以提升压缩效果和解码质量。

链接: https://arxiv.org/abs/2505.05853
作者: Tongda Xu,Jiahao Li,Bin Li,Yan Wang,Ya-Qin Zhang,Yan Lu
机构: Tsinghua University (清华大学); Microsoft Research Asia (微软亚洲研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Recently, perceptual image compression has achieved significant advancements, delivering high visual quality at low bitrates for natural images. However, for screen content, existing methods often produce noticeable artifacts when compressing text. To tackle this challenge, we propose versatile perceptual screen image compression with diffusion rendering (PICD), a codec that works well for both screen and natural images. More specifically, we propose a compression framework that encodes the text and image separately, and renders them into one image using diffusion model. For this diffusion rendering, we integrate conditional information into diffusion models at three distinct levels: 1). Domain level: We fine-tune the base diffusion model using text content prompts with screen content. 2). Adaptor level: We develop an efficient adaptor to control the diffusion model using compressed image and text as input. 3). Instance level: We apply instance-wise guidance to further enhance the decoding process. Empirically, our PICD surpasses existing perceptual codecs in terms of both text accuracy and perceptual quality. Additionally, without text conditions, our approach serves effectively as a perceptual codec for natural images.
zh

[CV-29] RefRef: A Synthetic Dataset and Benchmark for Reconstructing Refractive and Reflective Objects

【速读】:该论文旨在解决在存在折射和反射材质的场景中进行三维重建和新视角合成的问题,当前方法通常假设光线路径为直线,因此无法有效处理此类材质。其解决方案的关键在于引入了一个名为RefRef的合成数据集和基准测试,包含50种不同复杂度的折射和反射物体,并提出了一个基于物体几何和折射率计算准确光线路径的oracle方法,以及一种无需依赖这些假设的神经渲染方法。

链接: https://arxiv.org/abs/2505.05848
作者: Yue Yin,Enze Tao,Weijian Deng,Dylan Campbell
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern 3D reconstruction and novel view synthesis approaches have demonstrated strong performance on scenes with opaque Lambertian objects. However, most assume straight light paths and therefore cannot properly handle refractive and reflective materials. Moreover, datasets specialized for these effects are limited, stymieing efforts to evaluate performance and develop suitable techniques. In this work, we introduce a synthetic RefRef dataset and benchmark for reconstructing scenes with refractive and reflective objects from posed images. Our dataset has 50 such objects of varying complexity, from single-material convex shapes to multi-material non-convex shapes, each placed in three different background types, resulting in 150 scenes. We also propose an oracle method that, given the object geometry and refractive indices, calculates accurate light paths for neural rendering, and an approach based on this that avoids these assumptions. We benchmark these against several state-of-the-art methods and show that all methods lag significantly behind the oracle, highlighting the challenges of the task and dataset.
zh

[CV-30] Automated Knot Detection and Pairing for Wood Analysis in the Timber Industry

【速读】:该论文旨在解决木材中结疤(knot)的检测与配对问题,这对木材加工中的美学和结构完整性至关重要。传统的人工标注方法存在劳动强度大、效率低的问题,因此需要自动化解决方案。论文提出了一种轻量级且完全自动化的处理流程,其关键在于结合了机器学习技术:在检测阶段,利用工业级相机采集高分辨率表面图像,并通过迁移学习使YOLOv8l模型在mAP@0.5上达到0.887;在配对阶段,采用三元组神经网络进行多维特征提取并映射到潜在空间,结合聚类算法实现结疤的准确配对,最终配对准确率达到0.85。

链接: https://arxiv.org/abs/2505.05845
作者: Guohao Lin,Shidong Pan,Rasul Khanbayov,Changxi Yang,Ani Khaloian-Sarnaghi,Andriy Kovryga
机构: Technical University of Munich (慕尼黑工业大学); Australian National University (澳大利亚国立大学); CSIRO’s Data61 (澳大利亚联邦科学与工业研究组织数据61); Hochschule für Musik und Theater München (慕尼黑音乐与戏剧学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Knots in wood are critical to both aesthetics and structural integrity, making their detection and pairing essential in timber processing. However, traditional manual annotation was labor-intensive and inefficient, necessitating automation. This paper proposes a lightweight and fully automated pipeline for knot detection and pairing based on machine learning techniques. In the detection stage, high-resolution surface images of wooden boards were collected using industrial-grade cameras, and a large-scale dataset was manually annotated and preprocessed. After the transfer learning, the YOLOv8l achieves an mAP@0.5 of 0.887. In the pairing stage, detected knots were analyzed and paired based on multidimensional feature extraction. A triplet neural network was used to map the features into a latent space, enabling clustering algorithms to identify and pair corresponding knots. The triplet network with learnable weights achieved a pairing accuracy of 0.85. Further analysis revealed that he distances from the knot’s start and end points to the bottom of the wooden board, and the longitudinal coordinates play crucial roles in achieving high pairing accuracy. Our experiments validate the effectiveness of the proposed solution, demonstrating the potential of AI in advancing wood science and industry.
zh

[CV-31] Dual-level Fuzzy Learning with Patch Guidance for Image Ordinal Regression IJCAI2025

【速读】:该论文试图解决图像序数回归(ordinal regression)中由于仅提供图像级序数标签而忽视细粒度的局部块(patch)级特征的问题。其解决方案的关键在于提出一种双层级模糊学习框架(Dual-level Fuzzy Learning with Patch Guidance, DFPG),通过引入块级监督和模糊逻辑机制,从模糊的序数标签中学习精确的基于特征的分级边界,从而提升模型对难以分类样本的区分能力。

链接: https://arxiv.org/abs/2505.05834
作者: Chunlai Dong,Haochao Ying,Qibo Qiu,Jinhong Wang,Danny Chen,Jian Wu
机构: Zhejiang University(浙江大学); The Second Affiliated Hospital Zhejiang University School of Medicine(浙江大学附属第二医院); China Mobile (Zhejiang) Research & Innovation Institute(中国移动(浙江)研究院); University of Notre Dame(圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCAI 2025

点击查看摘要

Abstract:Ordinal regression bridges regression and classification by assigning objects to ordered classes. While human experts rely on discriminative patch-level features for decisions, current approaches are limited by the availability of only image-level ordinal labels, overlooking fine-grained patch-level characteristics. In this paper, we propose a Dual-level Fuzzy Learning with Patch Guidance framework, named DFPG that learns precise feature-based grading boundaries from ambiguous ordinal labels, with patch-level supervision. Specifically, we propose patch-labeling and filtering strategies to enable the model to focus on patch-level features exclusively with only image-level ordinal labels available. We further design a dual-level fuzzy learning module, which leverages fuzzy logic to quantitatively capture and handle label ambiguity from both patch-wise and channel-wise perspectives. Extensive experiments on various image ordinal regression datasets demonstrate the superiority of our proposed method, further confirming its ability in distinguishing samples from difficult-to-classify categories. The code is available at this https URL.
zh

[CV-32] Accelerating Diffusion Transformer via Increment-Calibrated Caching with Channel-Aware Singular Value Decomposition CVPR2025

【速读】:该论文旨在解决扩散变换器(Diffusion Transformer, DiT)在图像生成任务中因迭代特性导致的高计算复杂度问题,从而提升其部署效率。其解决方案的关键在于提出一种无需训练的增量校准缓存方法,通过从预训练模型本身利用低秩近似生成校准参数,以减少冗余计算。为应对异常激活可能导致的校正失败,进一步引入了通道感知的奇异值分解(channel-aware Singular Value Decomposition, SVD),从而增强校正效果。

链接: https://arxiv.org/abs/2505.05829
作者: Zhiyuan Chen,Keyi Li,Yifan Jia,Le Ye,Yufei Ma
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: accepted by CVPR2025

点击查看摘要

Abstract:Diffusion transformer (DiT) models have achieved remarkable success in image generation, thanks for their exceptional generative capabilities and scalability. Nonetheless, the iterative nature of diffusion models (DMs) results in high computation complexity, posing challenges for deployment. Although existing cache-based acceleration methods try to utilize the inherent temporal similarity to skip redundant computations of DiT, the lack of correction may induce potential quality degradation. In this paper, we propose increment-calibrated caching, a training-free method for DiT acceleration, where the calibration parameters are generated from the pre-trained model itself with low-rank approximation. To deal with the possible correction failure arising from outlier activations, we introduce channel-aware Singular Value Decomposition (SVD), which further strengthens the calibration effect. Experimental results show that our method always achieve better performance than existing naive caching methods with a similar computation resource budget. When compared with 35-step DDIM, our method eliminates more than 45% computation and improves IS by 12 at the cost of less than 0.06 FID increase. Code is available at this https URL.
zh

[CV-33] Image Segmentation via Variational Model Based Tailored UNet: A Deep Variational Framework

【速读】:该论文旨在解决传统图像分割方法在参数敏感性和计算成本方面的不足,以及深度学习模型在理论可解释性和标注数据依赖性上的缺陷。其解决方案的关键在于提出一种混合框架——基于变分模型的定制UNet(VM_TUNet),该框架将四阶改进的Cahn-Hilliard方程与UNet的深度学习主干相结合,融合了变分方法的可解释性和边界保持特性与神经网络的自适应特征学习能力。通过引入数据驱动算子替代人工参数调优,并结合定制有限点法(TFPM)以实现高精度边界保持,从而提升了分割性能,特别是在精细边界划分方面表现出色。

链接: https://arxiv.org/abs/2505.05806
作者: Kaili Qi,Wenli Yang,Ye Li,Zhongyi Huang
机构: Tsinghua University (清华大学); China University of Mining and Technology (中国矿业大学); Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional image segmentation methods, such as variational models based on partial differential equations (PDEs), offer strong mathematical interpretability and precise boundary modeling, but often suffer from sensitivity to parameter settings and high computational costs. In contrast, deep learning models such as UNet, which are relatively lightweight in parameters, excel in automatic feature extraction but lack theoretical interpretability and require extensive labeled data. To harness the complementary strengths of both paradigms, we propose Variational Model Based Tailored UNet (VM_TUNet), a novel hybrid framework that integrates the fourth-order modified Cahn-Hilliard equation with the deep learning backbone of UNet, which combines the interpretability and edge-preserving properties of variational methods with the adaptive feature learning of neural networks. Specifically, a data-driven operator is introduced to replace manual parameter tuning, and we incorporate the tailored finite point method (TFPM) to enforce high-precision boundary preservation. Experimental results on benchmark datasets demonstrate that VM_TUNet achieves superior segmentation performance compared to existing approaches, especially for fine boundary delineation.
zh

[CV-34] Describe Anything in Medical Images

【速读】:该论文旨在解决在医学影像领域中,局部图像描述生成技术应用不足的问题,尤其是在依赖细微区域发现进行诊断的场景下,现有模型如DAM等尚未被广泛应用于此类专业领域。解决方案的关键在于提出MedDAM,这是首个利用大视觉-语言模型进行医学图像区域特定描述的综合性框架,其核心在于采用医学专家设计的提示词、建立包含定制评估协议、数据预处理流程和专业问答模板库的基准测试体系,从而有效评估模型的临床事实性,并通过属性级验证任务克服医学数据集中缺乏真实区域-字幕对的问题。

链接: https://arxiv.org/abs/2505.05804
作者: Xi Xiao,Yunbei Zhang,Thanh-Huy Nguyen,Ba-Thinh Lam,Janet Wang,Jihun Hamm,Tianyang Wang,Xingjian Li,Xiao Wang,Hao Xu,Tianming Liu,Min Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Localized image captioning has made significant progress with models like the Describe Anything Model (DAM), which can generate detailed region-specific descriptions without explicit region-text supervision. However, such capabilities have yet to be widely applied to specialized domains like medical imaging, where diagnostic interpretation relies on subtle regional findings rather than global understanding. To mitigate this gap, we propose MedDAM, the first comprehensive framework leveraging large vision-language models for region-specific captioning in medical images. MedDAM employs medical expert-designed prompts tailored to specific imaging modalities and establishes a robust evaluation benchmark comprising a customized assessment protocol, data pre-processing pipeline, and specialized QA template library. This benchmark evaluates both MedDAM and other adaptable large vision-language models, focusing on clinical factuality through attribute-level verification tasks, thereby circumventing the absence of ground-truth region-caption pairs in medical datasets. Extensive experiments on the VinDr-CXR, LIDC-IDRI, and SkinCon datasets demonstrate MedDAM’s superiority over leading peers (including GPT-4o, Claude 3.7 Sonnet, LLaMA-3.2 Vision, Qwen2.5-VL, GPT-4Rol, and OMG-LLaVA) in the task, revealing the importance of region-level semantic alignment in medical image understanding and establishing MedDAM as a promising foundation for clinical vision-language integration.
zh

[CV-35] 3D CAVLA: Leverag ing Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks CVPR2025

【速读】:该论文旨在解决机器人在三维空间中进行操作时,如何有效理解场景上下文并生成精确的关节空间轨迹的问题。其核心挑战在于将现实世界的视觉和语义信息转化为低层次的控制指令,以实现对物体的精准操控。解决方案的关键在于提升模型的场景上下文感知能力,通过集成思维链推理、深度感知和任务导向的兴趣区域检测,增强模型对环境的理解与适应能力。实验结果表明,所提出的3D-CAVLA模型在多个任务套件中显著提升了成功率,并在零样本任务上表现出更强的泛化能力。

链接: https://arxiv.org/abs/2505.05800
作者: Vineet Bhat,Yu-Hsiang Lan,Prashanth Krishnamurthy,Ramesh Karri,Farshad Khorrami
机构: New York University (纽约大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 1st Workshop on 3D LLM/VLA, CVPR 2025

点击查看摘要

Abstract:Robotic manipulation in 3D requires learning an N degree-of-freedom joint space trajectory of a robot manipulator. Robots must possess semantic and visual perception abilities to transform real-world mappings of their workspace into the low-level control necessary for object manipulation. Recent work has demonstrated the capabilities of fine-tuning large Vision-Language Models (VLMs) to learn the mapping between RGB images, language instructions, and joint space control. These models typically take as input RGB images of the workspace and language instructions, and are trained on large datasets of teleoperated robot demonstrations. In this work, we explore methods to improve the scene context awareness of a popular recent Vision-Language-Action model by integrating chain-of-thought reasoning, depth perception, and task-oriented region of interest detection. Our experiments in the LIBERO simulation environment show that our proposed model, 3D-CAVLA, improves the success rate across various LIBERO task suites, achieving an average success rate of 98.1 % . We also evaluate the zero-shot capabilities of our method, demonstrating that 3D scene awareness leads to robust learning and adaptation for completely unseen tasks. 3D-CAVLA achieves an absolute improvement of 8.8 % on unseen tasks. We will open-source our code and the unseen tasks dataset to promote community-driven research here: this https URL
zh

[CV-36] Improving Generalizability of Kolmogorov-Arnold Networks via Error-Correcting Output Codes

【速读】:该论文试图解决多类别分类任务中模型泛化能力不足的问题,特别是在医疗图像分类中的应用。解决方案的关键在于将纠错输出编码(Error-Correcting Output Codes, ECOC)集成到Kolmogorov-Arnold Networks (KAN)框架中,通过将多类分类转化为多个二分类任务,并利用汉明距离解码提高模型的鲁棒性。这种集成方法在血液细胞分类数据集上表现出更高的准确性,并且在不同超参数设置下均能提升FastKAN和FasterKAN变体的性能。

链接: https://arxiv.org/abs/2505.05798
作者: Youngjoon Lee,Jinu Gong,Joonhyuk Kang
机构: KAIST(韩国科学技术院); Hansung University(汉城大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注: 4 pages

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KAN) offer universal function approximation using univariate spline compositions without nonlinear activations. In this work, we integrate Error-Correcting Output Codes (ECOC) into the KAN framework to transform multi-class classification into multiple binary tasks, improving robustness via Hamming-distance decoding. Our proposed KAN with ECOC method outperforms vanilla KAN on a challenging blood cell classification dataset, achieving higher accuracy under diverse hyperparameter settings. Ablation studies further confirm that ECOC consistently enhances performance across FastKAN and FasterKAN variants. These results demonstrate that ECOC integration significantly boosts KAN generalizability in critical healthcare AI applications. To the best of our knowledge, this is the first integration of ECOC with KAN for enhancing multi-class medical image classification performance.
zh

[CV-37] A review of advancements in low-light image enhancement using deep learning

【速读】:该论文试图解决低光照环境下计算机视觉算法性能显著下降的问题,这一问题对分割、检测和分类等关键视觉任务产生了负面影响。解决方案的关键在于系统地分析和评估近年来基于深度学习的低光照图像增强方法的工作机制及其在提升下游视觉任务中的有效性,通过详细阐述不同方法的操作原理和增强机制,结合清晰的图示,并探讨不同增强技术对后续视觉任务的影响,从而为优化低光照条件下的视觉任务性能提供参考。

链接: https://arxiv.org/abs/2505.05759
作者: Fangxue Liu,Lei Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In low-light environments, the performance of computer vision algorithms often deteriorates significantly, adversely affecting key vision tasks such as segmentation, detection, and classification. With the rapid advancement of deep learning, its application to low-light image processing has attracted widespread attention and seen significant progress in recent years. However, there remains a lack of comprehensive surveys that systematically examine how recent deep-learning-based low-light image enhancement methods function and evaluate their effectiveness in enhancing downstream vison tasks. To address this gap, this review provides a detailed elaboration on how various recent approaches (from 2020) operate and their enhancement mechanisms, supplemented with clear illustrations. It also investigates the impact of different enhancement techniques on subsequent vision tasks, critically analyzing their strengths and limitations. Additionally, it proposes future research directions. This review serves as a useful reference for determining low-light image enhancement techniques and optimizing vision task performance in low-light conditions.
zh

[CV-38] Automating Infrastructure Surveying: A Framework for Geometric Measurements and Compliance Assessment Using Point Cloud Data

【速读】:该论文旨在解决基础设施勘测与合规性评估中人工操作效率低、一致性差的问题,特别是针对人行道坡道(curb ramp)是否符合美国残疾人法案(Americans with Disabilities Act, ADA)标准的自动化评估。解决方案的关键在于提出一种基于点云数据(point cloud data)的自动化框架,该框架融合了深度学习驱动的目标检测与分割技术,以及几何和信号处理方法,实现了几何测量与合规性评估的自动化。此外,研究还构建了一个公开可用的大规模标注人行道坡道数据集,以支持模型的训练与验证,从而提升方法的准确性和可靠性。

链接: https://arxiv.org/abs/2505.05752
作者: Amin Ghafourian,Andrew Lee,Dechen Gao,Tyler Beer,Kin Yen,Iman Soltani
机构: LARA Research; University of California, Davis
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: 19 pages, 15 figures, 4 tables

点击查看摘要

Abstract:Automation can play a prominent role in improving efficiency, accuracy, and scalability in infrastructure surveying and assessing construction and compliance standards. This paper presents a framework for automation of geometric measurements and compliance assessment using point cloud data. The proposed approach integrates deep learning-based detection and segmentation, in conjunction with geometric and signal processing techniques, to automate surveying tasks. As a proof of concept, we apply this framework to automatically evaluate the compliance of curb ramps with the Americans with Disabilities Act (ADA), demonstrating the utility of point cloud data in survey automation. The method leverages a newly collected, large annotated dataset of curb ramps, made publicly available as part of this work, to facilitate robust model training and evaluation. Experimental results, including comparison with manual field measurements of several ramps, validate the accuracy and reliability of the proposed method, highlighting its potential to significantly reduce manual effort and improve consistency in infrastructure assessment. Beyond ADA compliance, the proposed framework lays the groundwork for broader applications in infrastructure surveying and automated construction evaluation, promoting wider adoption of point cloud data in these domains. The annotated database, manual ramp survey data, and developed algorithms are publicly available on the project’s GitHub page: this https URL.
zh

[CV-39] kFuse: A novel density based agglomerative clustering

【速读】:该论文旨在解决传统凝聚聚类方法在子簇划分和簇间相似性评估中依赖额外参数、缺乏自适应性以及连接距离计算方法导致聚类结果不稳定的问题。其解决方案的关键在于提出一种基于密度的凝聚聚类方法kFuse,该方法通过自然邻域进行子簇划分、利用相邻样本和最短距离计算边界连通性、通过均值密度和方差评估子簇密度相似性,并结合边界连通性和密度相似性建立合并规则,从而在仅需在最终合并阶段指定聚类数量的情况下,显著提升合并阶段的准确性与识别能力。

链接: https://arxiv.org/abs/2505.05748
作者: Huan Yan,Junjie Hu
机构: Shaanxi Normal University (陕西师范大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 11 figures

点击查看摘要

Abstract:Agglomerative clustering has emerged as a vital tool in data analysis due to its intuitive and flexible characteristics. However, existing agglomerative clustering methods often involve additional parameters for sub-cluster partitioning and inter-cluster similarity assessment. This necessitates different parameter settings across various datasets, which is undoubtedly challenging in the absence of prior knowledge. Moreover, existing agglomerative clustering techniques are constrained by the calculation method of connection distance, leading to unstable clustering results. To address these issues, this paper introduces a novel density-based agglomerative clustering method, termed kFuse. kFuse comprises four key components: (1) sub-cluster partitioning based on natural neighbors; (2) determination of boundary connectivity between sub-clusters through the computation of adjacent samples and shortest distances; (3) assessment of density similarity between sub-clusters via the calculation of mean density and variance; and (4) establishment of merging rules between sub-clusters based on boundary connectivity and density similarity. kFuse requires the specification of the number of clusters only at the final merging stage. Additionally, by comprehensively considering adjacent samples, distances, and densities among different sub-clusters, kFuse significantly enhances accuracy during the merging phase, thereby greatly improving its identification capability. Experimental results on both synthetic and real-world datasets validate the effectiveness of kFuse.
zh

[CV-40] Dome-DETR: DETR with Density-Oriented Feature-Query Manipulation for Efficient Tiny Object Detection

【速读】:该论文旨在解决小目标检测(tiny object detection)中特征利用效率低和计算成本高的问题,这些问题主要源于冗余的特征处理和固定的查询分配机制。其解决方案的关键在于提出Dome-DETR框架,该框架通过密度导向的特征-查询操作实现高效的小目标检测,具体包括轻量级密度焦点提取器(DeFE)以生成紧凑的前景掩码,结合掩码窗口注意力稀疏化(MWAS)聚焦于信息丰富的区域,以及渐进式自适应查询初始化(PAQI)以实现更优的查询分配。

链接: https://arxiv.org/abs/2505.05741
作者: Zhangchi Hu,Peixi Wu,Jie Chen,Huyue Zhu,Yijun Wang,Yansong Peng,Hebei Li,Xiaoyan Sun
机构: University of Science and Technology of China(中国科学技术大学); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合性国家科学中心人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tiny object detection plays a vital role in drone surveillance, remote sensing, and autonomous systems, enabling the identification of small targets across vast landscapes. However, existing methods suffer from inefficient feature leverage and high computational costs due to redundant feature processing and rigid query allocation. To address these challenges, we propose Dome-DETR, a novel framework with Density-Oriented Feature-Query Manipulation for Efficient Tiny Object Detection. To reduce feature redundancies, we introduce a lightweight Density-Focal Extractor (DeFE) to produce clustered compact foreground masks. Leveraging these masks, we incorporate Masked Window Attention Sparsification (MWAS) to focus computational resources on the most informative regions via sparse attention. Besides, we propose Progressive Adaptive Query Initialization (PAQI), which adaptively modulates query density across spatial areas for better query allocation. Extensive experiments demonstrate that Dome-DETR achieves state-of-the-art performance (+3.3 AP on AI-TOD-V2 and +2.5 AP on VisDrone) while maintaining low computational complexity and a compact model size. Code will be released upon acceptance.
zh

[CV-41] Automated Learning of Semantic Embedding Representations for Diffusion Models SDM25

【速读】:该论文试图解决生成式模型(Generative Models)在高效表示学习方面的不足,特别是扩散模型(Denoising Diffusion Models, DDMs)在获取语义丰富表示能力上的局限性。其解决方案的关键在于引入一种多层级去噪自编码器框架,通过依次一致的扩散变换器和一个时间步依赖的编码器,在去噪马尔可夫链中获取嵌入表示,从而增强DDMs的表征能力。该方法利用整个扩散过程作为条件,将高维数据压缩为不同噪声水平下的方向向量,促进跨所有时间步的图像嵌入学习。

链接: https://arxiv.org/abs/2505.05732
作者: Limai Jiang,Yunpeng Cai
机构: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院); University of Chinese Academy of Sciences(中国科学院大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Extended version of the paper published in SDM25

点击查看摘要

Abstract:Generative models capture the true distribution of data, yielding semantically rich representations. Denoising diffusion models (DDMs) exhibit superior generative capabilities, though efficient representation learning for them are lacking. In this work, we employ a multi-level denoising autoencoder framework to expand the representation capacity of DDMs, which introduces sequentially consistent Diffusion Transformers and an additional timestep-dependent encoder to acquire embedding representations on the denoising Markov chain through self-conditional diffusion learning. Intuitively, the encoder, conditioned on the entire diffusion process, compresses high-dimensional data into directional vectors in latent under different noise levels, facilitating the learning of image embeddings across all timesteps. To verify the semantic adequacy of embeddings generated through this approach, extensive experiments are conducted on various datasets, demonstrating that optimally learned embeddings by DDMs surpass state-of-the-art self-supervised representation learning methods in most cases, achieving remarkable discriminative semantic representation quality. Our work justifies that DDMs are not only suitable for generative tasks, but also potentially advantageous for general-purpose deep learning applications.
zh

[CV-42] You Are Your Best Teacher: Semi-Supervised Surgical Point Tracking with Cycle-Consistent Self-Distillation CVPR2025

【速读】:该论文旨在解决在真实世界场景(尤其是手术视频)中部署基于合成数据训练的点跟踪模型所面临的领域偏移和缺乏标注数据的问题。其解决方案的关键在于提出SurgTracker,一个半监督框架,通过过滤后的自蒸馏方法将合成训练的点跟踪器适应到手术视频中。该方法利用固定教师网络生成伪标签,并通过循环一致性约束过滤时间上不一致的轨迹,从而确保几何一致性并提供稳定的监督信号。

链接: https://arxiv.org/abs/2505.05722
作者: Valay Bundele,Mehran Hosseinzadeh,Hendrik Lensch
机构: University of Tübingen (图宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025 SynData4CV Workshop

点击查看摘要

Abstract:Synthetic datasets have enabled significant progress in point tracking by providing large-scale, densely annotated supervision. However, deploying these models in real-world domains remains challenging due to domain shift and lack of labeled data-issues that are especially severe in surgical videos, where scenes exhibit complex tissue deformation, occlusion, and lighting variation. While recent approaches adapt synthetic-trained trackers to natural videos using teacher ensembles or augmentation-heavy pseudo-labeling pipelines, their effectiveness in high-shift domains like surgery remains unexplored. This work presents SurgTracker, a semi-supervised framework for adapting synthetic-trained point trackers to surgical video using filtered self-distillation. Pseudo-labels are generated online by a fixed teacher-identical in architecture and initialization to the student-and are filtered using a cycle consistency constraint to discard temporally inconsistent trajectories. This simple yet effective design enforces geometric consistency and provides stable supervision throughout training, without the computational overhead of maintaining multiple teachers. Experiments on the STIR benchmark show that SurgTracker improves tracking performance using only 80 unlabeled videos, demonstrating its potential for robust adaptation in high-shift, data-scarce domains.
zh

[CV-43] Semantic-Space-Intervened Diffusive Alignment for Visual Classification

【速读】:该论文试图解决跨模态对齐中因视觉与文本特征在类别样本分布和特征值范围上的差异而导致的投影困难问题。解决方案的关键在于提出一种称为SeDA(Semantic-Space-Intervened Diffusive Alignment)的方法,该方法通过构建语义空间作为视觉到文本投影的桥梁,并采用双阶段扩散框架实现模态间的渐进式对齐。具体而言,第一阶段利用Diffusion-Controlled Semantic Learner建模视觉特征的语义空间,第二阶段通过Diffusion-Controlled Semantic Translator从语义空间学习文本特征的分布,同时引入Progressive Feature Interaction Network进行分步特征交互,逐步融合文本信息到映射特征中。

链接: https://arxiv.org/abs/2505.05721
作者: Zixuan Li,Lei Meng,Guoqing Chao,Wei Wu,Xiaoshuo Yan,Yimeng Yang,Zhuang Qi,Xiangxu Meng
机构: School of Software, Shandong University, Jinan, China (山东大学软件学院); School of Computer Science and Technology, Harbin Institute of Technology, Weihai, China (哈尔滨工业大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-modal alignment is an effective approach to improving visual classification. Existing studies typically enforce a one-step mapping that uses deep neural networks to project the visual features to mimic the distribution of textual features. However, they typically face difficulties in finding such a projection due to the two modalities in both the distribution of class-wise samples and the range of their feature values. To address this issue, this paper proposes a novel Semantic-Space-Intervened Diffusive Alignment method, termed SeDA, models a semantic space as a bridge in the visual-to-textual projection, considering both types of features share the same class-level information in classification. More importantly, a bi-stage diffusion framework is developed to enable the progressive alignment between the two modalities. Specifically, SeDA first employs a Diffusion-Controlled Semantic Learner to model the semantic features space of visual features by constraining the interactive features of the diffusion model and the category centers of visual features. In the later stage of SeDA, the Diffusion-Controlled Semantic Translator focuses on learning the distribution of textual features from the semantic space. Meanwhile, the Progressive Feature Interaction Network introduces stepwise feature interactions at each alignment step, progressively integrating textual information into mapped features. Experimental results show that SeDA achieves stronger cross-modal feature alignment, leading to superior performance over existing methods across multiple scenarios.
zh

[CV-44] DiGIT: Multi-Dilated Gated Encoder and Central-Adjacent Region Integrated Decoder for Temporal Action Detection Transformer CVPR2025

【速读】:该论文旨在解决基于查询的时序动作检测(Temporal Action Detection, TAD)方法中存在的一系列问题,这些问题源于其直接借鉴目标检测任务中设计的架构,未能充分应对TAD特有的挑战,如多尺度特征的冗余性和捕捉足够时序上下文能力的不足。论文提出的解决方案关键在于引入一种多膨胀门控编码器和中心相邻区域融合解码器的结构,即DiGIT。该方法通过多膨胀门控编码器减少多级特征带来的冗余信息,同时保持对细粒度和长距离时序信息的捕捉能力;并通过中心相邻区域融合解码器采用更全面的采样策略来增强对关键信息的提取。

链接: https://arxiv.org/abs/2505.05711
作者: Ho-Joong Kim,Yearang Lee,Jung-Ho Hong,Seong-Whan Lee
机构: Korea University (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:In this paper, we examine a key limitation in query-based detectors for temporal action detection (TAD), which arises from their direct adaptation of originally designed architectures for object detection. Despite the effectiveness of the existing models, they struggle to fully address the unique challenges of TAD, such as the redundancy in multi-scale features and the limited ability to capture sufficient temporal context. To address these issues, we propose a multi-dilated gated encoder and central-adjacent region integrated decoder for temporal action detection transformer (DiGIT). Our approach replaces the existing encoder that consists of multi-scale deformable attention and feedforward network with our multi-dilated gated encoder. Our proposed encoder reduces the redundant information caused by multi-level features while maintaining the ability to capture fine-grained and long-range temporal information. Furthermore, we introduce a central-adjacent region integrated decoder that leverages a more comprehensive sampling strategy for deformable cross-attention to capture the essential information. Extensive experiments demonstrate that DiGIT achieves state-of-the-art performance on THUMOS14, ActivityNet v1.3, and HACS-Segment. Code is available at: this https URL
zh

[CV-45] HyperspectralMAE: The Hyperspectral Imagery Classification Model using Fourier-Encoded Dual-Branch Masked Autoencoder

【速读】:该论文旨在解决高光谱图像(Hyperspectral Imagery)由于其在空间和光谱域中的高维度所带来的独特挑战,尤其是如何有效学习具有鲁棒性的光谱-空间表示以支持下游任务。其解决方案的关键在于提出了一种基于Transformer的预训练模型——HyperspectralMAE,该模型采用双维度掩码策略,在预训练过程中随机遮蔽50%的空间块和50%的光谱波段,从而迫使模型学习跨两个维度重建缺失信息的能力。此外,引入了基于波长的可学习谐波傅里叶位置嵌入以编码光谱顺序,并通过结合均方误差(MSE)与光谱角映射器(SAM)的重建目标,平衡像素级精度与光谱形状保真度。

链接: https://arxiv.org/abs/2505.05710
作者: Wooyoung Jeong,Hyun Jae Park,Seonghun Jeong,Jong Wook Jang,Tae Hoon Lim,Dae Seoung Kim
机构: GNewSoft Co., Ltd. (GNewSoft公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Hyperspectral imagery provides rich spectral detail but poses unique challenges because of its high dimensionality in both spatial and spectral domains. We propose \textitHyperspectralMAE, a Transformer-based foundation model for hyperspectral data that employs a \textitdual masking strategy: during pre-training we randomly occlude 50% of spatial patches and 50% of spectral bands. This forces the model to learn representations capable of reconstructing missing information across both dimensions. To encode spectral order, we introduce learnable harmonic Fourier positional embeddings based on wavelength. The reconstruction objective combines mean-squared error (MSE) with the spectral angle mapper (SAM) to balance pixel-level accuracy and spectral-shape fidelity. The resulting model contains about 1.8\times10^8 parameters and produces 768-dimensional embeddings, giving it sufficient capacity for transfer learning. We pre-trained HyperspectralMAE on two large hyperspectral corpora – NASA EO-1 Hyperion ( \sim 1,600 scenes, \sim 3\times10^11 pixel spectra) and DLR EnMAP Level-0 ( \sim 1,300 scenes, \sim 3\times10^11 pixel spectra) – and fine-tuned it for land-cover classification on the Indian Pines benchmark. HyperspectralMAE achieves state-of-the-art transfer-learning accuracy on Indian Pines, confirming that masked dual-dimensional pre-training yields robust spectral-spatial representations. These results demonstrate that dual masking and wavelength-aware embeddings advance hyperspectral image reconstruction and downstream analysis. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV) Cite as: arXiv:2505.05710 [cs.CV] (or arXiv:2505.05710v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.05710 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-46] Fine-Tuning Video-Text Contrastive Model for Primate Behavior Retrieval from Unlabeled Raw Videos

【速读】:该论文旨在解决从非人灵长类动物(如卷尾猴)自然栖息地的视频中自动检索有用行为片段的问题,其核心挑战在于仅使用未经标注的原始视频和有限的弱音频描述进行模型训练。解决方案的关键在于提出一种两阶段方法:一是代理式数据处理流水线,用于从原始视频中提取语义对齐的视频-文本对;二是基于低秩适应(LoRA)的微调过程,利用预训练的Microsoft X-CLIP模型进行领域适配。该方法显著提升了模型在特定领域数据上的检索性能,表现为Hits@5指标分别提高了167%和114%,并在NDCG@K评估中展现出优于原始预训练模型的排序能力。

链接: https://arxiv.org/abs/2505.05681
作者: Giulio Cesare Mastrocinque Santo,Patrícia Izar,Irene Delval,Victor de Napole Gregolin,Nina S. T. Hirata
机构: Institute of Mathematics and Statistics, University of São Paulo (IME-USP); Department of Experimental Psychology, Institute of Psychology, University of São Paulo (IP-USP); Institute of Biosciences, University of São Paulo (IB-USP)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video recordings of nonhuman primates in their natural habitat are a common source for studying their behavior in the wild. We fine-tune pre-trained video-text foundational models for the specific domain of capuchin monkeys, with the goal of developing useful computational models to help researchers to retrieve useful clips from videos. We focus on the challenging problem of training a model based solely on raw, unlabeled video footage, using weak audio descriptions sometimes provided by field collaborators. We leverage recent advances in Multimodal Large Language Models (MLLMs) and Vision-Language Models (VLMs) to address the extremely noisy nature of both video and audio content. Specifically, we propose a two-folded approach: an agentic data treatment pipeline and a fine-tuning process. The data processing pipeline automatically extracts clean and semantically aligned video-text pairs from the raw videos, which are subsequently used to fine-tune a pre-trained Microsoft’s X-CLIP model through Low-Rank Adaptation (LoRA). We obtained an uplift in Hits@5 of 167% for the 16 frames model and an uplift of 114% for the 8 frame model on our domain data. Moreover, based on NDCG@K results, our model is able to rank well most of the considered behaviors, while the tested raw pre-trained models are not able to rank them at all. The code will be made available upon acceptance.
zh

[CV-47] InstanceGen: Image Generation with Instance-level Instructions

【速读】:该论文试图解决预训练文本到图像生成模型在处理包含多个物体和实例级属性的复杂提示时语义捕捉不足的问题。其解决方案的关键在于将基于图像的细粒度结构初始化与基于大语言模型(LLM)的实例级指令相结合,从而确保生成的图像能够准确遵循文本提示中的所有部分,包括物体数量、实例级属性以及实例之间的空间关系。

链接: https://arxiv.org/abs/2505.05678
作者: Etai Sella,Yanir Kleiman,Hadar Averbuch-Elor
机构: Tel Aviv University (特拉维夫大学); Meta AI (元人工智能); Cornell University (康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Despite rapid advancements in the capabilities of generative models, pretrained text-to-image models still struggle in capturing the semantics conveyed by complex prompts that compound multiple objects and instance-level attributes. Consequently, we are witnessing growing interests in integrating additional structural constraints, %leveraging additional structural inputs typically in the form of coarse bounding boxes, to better guide the generation process in such challenging cases. In this work, we take the idea of structural guidance a step further by making the observation that contemporary image generation models can directly provide a plausible \emphfine-grained structural initialization. We propose a technique that couples this image-based structural guidance with LLM-based instance-level instructions, yielding output images that adhere to all parts of the text prompt, including object counts, instance-level attributes, and spatial relations between instances.
zh

[CV-48] GA: Texture Space Gaussian Avatars for High-Resolution Dynamic Head Modeling SIGGRAPH2025

【速读】:该论文旨在解决现有基于3D高斯点云(3D Gaussian splatting)的可动画化3D头部虚拟形象在运动估计不准确和内存限制导致细节丢失的问题。其关键解决方案是提出一种新的高细节3D头部虚拟形象模型,通过大幅增加3D高斯数量并提升建模质量,实现4K分辨率下的高质量渲染。该模型基于多视角输入视频重建,并构建于基于网格的3D可变形模型之上,利用嵌入在连续UVD切空间中的3D高斯进行外观建模,同时引入新颖的UVD变形场对高斯进行变形,以捕捉细微的局部运动,从而在保留外观细节的同时准确捕获面部运动及其他高频特征。

链接: https://arxiv.org/abs/2505.05672
作者: Gengyan Li,Paulo Gotardo,Timo Bolkart,Stephan Garbin,Kripasindhu Sarkar,Abhimitra Meka,Alexandros Lattas,Thabo Beeler
机构: Google(谷歌); ETH Zurich(苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 9 figures, supplementary results found at: this https URL , to be published in SIGGRAPH 2025

点击查看摘要

Abstract:Sparse volumetric reconstruction and rendering via 3D Gaussian splatting have recently enabled animatable 3D head avatars that are rendered under arbitrary viewpoints with impressive photorealism. Today, such photoreal avatars are seen as a key component in emerging applications in telepresence, extended reality, and entertainment. Building a photoreal avatar requires estimating the complex non-rigid motion of different facial components as seen in input video images; due to inaccurate motion estimation, animatable models typically present a loss of fidelity and detail when compared to their non-animatable counterparts, built from an individual facial expression. Also, recent state-of-the-art models are often affected by memory limitations that reduce the number of 3D Gaussians used for modeling, leading to lower detail and quality. To address these problems, we present a new high-detail 3D head avatar model that improves upon the state of the art, largely increasing the number of 3D Gaussians and modeling quality for rendering at 4K resolution. Our high-quality model is reconstructed from multiview input video and builds on top of a mesh-based 3D morphable model, which provides a coarse deformation layer for the head. Photoreal appearance is modelled by 3D Gaussians embedded within the continuous UVD tangent space of this mesh, allowing for more effective densification where most needed. Additionally, these Gaussians are warped by a novel UVD deformation field to capture subtle, localized motion. Our key contribution is the novel deformable Gaussian encoding and overall fitting procedure that allows our head model to preserve appearance detail, while capturing facial motion and other transient high-frequency features such as skin wrinkling.
zh

[CV-49] Lost in OCR Translation? Vision-Based Approaches to Robust Document Retrieval

【速读】:该论文旨在解决传统基于光学字符识别(OCR)的检索增强生成(RAG)系统在处理退化或复杂文档时存在的误差问题,以及探索视觉基础RAG系统在文档检索与问答任务中的表现。其解决方案的关键在于对比分析基于视觉的RAG系统(如ColPali)与传统的OCR管道(如Nougat OCR)在不同文档质量下的性能差异,并引入语义答案评估基准以全面衡量端到端问答效果。研究揭示了视觉RAG在已微调文档上的优势与OCR-RAG在未见文档泛化能力上的优越性之间的权衡。

链接: https://arxiv.org/abs/2505.05666
作者: Alexander Most,Joseph Winjum,Ayan Biswas,Shawn Jones,Nishath Rajiv Ranasinghe,Dan O’Malley,Manish Bhattarai
机构: Los Alamos National Laboratory(洛斯阿拉莫斯国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has become a popular technique for enhancing the reliability and utility of Large Language Models (LLMs) by grounding responses in external documents. Traditional RAG systems rely on Optical Character Recognition (OCR) to first process scanned documents into text. However, even state-of-the-art OCRs can introduce errors, especially in degraded or complex documents. Recent vision-language approaches, such as ColPali, propose direct visual embedding of documents, eliminating the need for OCR. This study presents a systematic comparison between a vision-based RAG system (ColPali) and more traditional OCR-based pipelines utilizing Llama 3.2 (90B) and Nougat OCR across varying document qualities. Beyond conventional retrieval accuracy metrics, we introduce a semantic answer evaluation benchmark to assess end-to-end question-answering performance. Our findings indicate that while vision-based RAG performs well on documents it has been fine-tuned on, OCR-based RAG is better able to generalize to unseen documents of varying quality. We highlight the key trade-offs between computational efficiency and semantic accuracy, offering practical guidance for RAG practitioners in selecting between OCR-dependent and vision-based document retrieval systems in production environments.
zh

[CV-50] he Moons Many Faces: A Single Unified Transformer for Multimodal Lunar Reconstruction

【速读】:该论文试图解决行星表面的反射参数估计与基于图像的三维重建问题,其核心在于将这些问题建模为多模态学习任务。解决方案的关键是一个统一的Transformer架构,该架构通过学习多种输入模态(如灰度图像、数字高程模型、表面法线和反照率图)之间的共享表示,实现任意输入模态到目标模态的灵活转换。该方法能够同时预测数字高程模型(DEM)和反照率图,从而完成行星表面的三维重建,并分离光照参数与高度信息。

链接: https://arxiv.org/abs/2505.05644
作者: Tom Sander,Moritz Tenthoff,Kay Wohlfarth,Christian Wöhler
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 14pages

点击查看摘要

Abstract:Multimodal learning is an emerging research topic across multiple disciplines but has rarely been applied to planetary science. In this contribution, we identify that reflectance parameter estimation and image-based 3D reconstruction of lunar images can be formulated as a multimodal learning problem. We propose a single, unified transformer architecture trained to learn shared representations between multiple sources like grayscale images, digital elevation models, surface normals, and albedo maps. The architecture supports flexible translation from any input modality to any target modality. Predicting DEMs and albedo maps from grayscale images simultaneously solves the task of 3D reconstruction of planetary surfaces and disentangles photometric parameters and height information. Our results demonstrate that our foundation model learns physically plausible relations across these four modalities. Adding more input modalities in the future will enable tasks such as photometric normalization and co-registration.
zh

[CV-51] Semantic Style Transfer for Enhancing Animal Facial Landmark Detection

【速读】:该论文试图解决动物面部关键点检测器训练中数据多样性不足和结构一致性问题,通过引入语义风格迁移(Semantic Style Transfer)作为增强策略来提升检测模型的性能。解决方案的关键在于使用监督式风格迁移(Supervised Style Transfer, SST),该方法根据关键点检测精度选择风格源,有效减少了标注错位问题,并在保持高达98%基线准确率的同时,通过风格迁移图像增强数据集,显著提升了模型的鲁棒性。

链接: https://arxiv.org/abs/2505.05640
作者: Anadil Hussein,Anna Zamansky,George Martvel
机构: University of Haifa (海法大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neural Style Transfer (NST) is a technique for applying the visual characteristics of one image onto another while preserving structural content. Traditionally used for artistic transformations, NST has recently been adapted, e.g., for domain adaptation and data augmentation. This study investigates the use of this technique for enhancing animal facial landmark detectors training. As a case study, we use a recently introduced Ensemble Landmark Detector for 48 anatomical cat facial landmarks and the CatFLW dataset it was trained on, making three main contributions. First, we demonstrate that applying style transfer to cropped facial images rather than full-body images enhances structural consistency, improving the quality of generated images. Secondly, replacing training images with style-transferred versions raised challenges of annotation misalignment, but Supervised Style Transfer (SST) - which selects style sources based on landmark accuracy - retained up to 98% of baseline accuracy. Finally, augmenting the dataset with style-transferred images further improved robustness, outperforming traditional augmentation methods. These findings establish semantic style transfer as an effective augmentation strategy for enhancing the performance of facial landmark detection models for animals and beyond. While this study focuses on cat facial landmarks, the proposed method can be generalized to other species and landmark detection models.
zh

[CV-52] VR-RAG : Open-vocabulary Species Recognition with RAG -Assisted Large Multi-Modal Models

【速读】:该论文试图解决开放词汇识别(open-vocabulary recognition)在计算机视觉中的挑战,特别是在自然环境中不断出现新物种的背景下,如何准确识别未见过的鸟类物种。传统基准测试如CUB-200-2011和Birdsnap在封闭词汇范式下评估,限制了其在现实场景中的适用性。解决方案的关键在于提出一种可扩展的框架,该框架整合了通过GPT-4o提炼的11,202种鸟类物种的维基百科文章的结构化文本知识,并引入了视觉重排序检索增强生成(Visual Re-ranking Retrieval-Augmented Generation, VR-RAG)方法,利用视觉相似性对多模态视觉语言编码器检索的前m个候选进行重新排序,从而实现对未见分类单元的识别。

链接: https://arxiv.org/abs/2505.05635
作者: Faizan Farooq Khan,Jun Chen,Youssef Mohamed,Chun-Mei Feng,Mohamed Elhoseiny
机构: King Abdullah University of Science and Technology (阿卜杜拉国王科技大学); IHPC, A*STAR (IHPC,新加坡科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 figures

点击查看摘要

Abstract:Open-vocabulary recognition remains a challenging problem in computer vision, as it requires identifying objects from an unbounded set of categories. This is particularly relevant in nature, where new species are discovered every year. In this work, we focus on open-vocabulary bird species recognition, where the goal is to classify species based on their descriptions without being constrained to a predefined set of taxonomic categories. Traditional benchmarks like CUB-200-2011 and Birdsnap have been evaluated in a closed-vocabulary paradigm, limiting their applicability to real-world scenarios where novel species continually emerge. We show that the performance of current systems when evaluated under settings closely aligned with open-vocabulary drops by a huge margin. To address this gap, we propose a scalable framework integrating structured textual knowledge from Wikipedia articles of 11,202 bird species distilled via GPT-4o into concise, discriminative summaries. We propose Visual Re-ranking Retrieval-Augmented Generation(VR-RAG), a novel, retrieval-augmented generation framework that uses visual similarities to rerank the top m candidates retrieved by a set of multimodal vision language encoders. This allows for the recognition of unseen taxa. Extensive experiments across five established classification benchmarks show that our approach is highly effective. By integrating VR-RAG, we improve the average performance of state-of-the-art Large Multi-Modal Model QWEN2.5-VL by 15.4% across five benchmarks. Our approach outperforms conventional VLM-based approaches, which struggle with unseen species. By bridging the gap between encyclopedic knowledge and visual recognition, our work advances open-vocabulary recognition, offering a flexible, scalable solution for biodiversity monitoring and ecological research.
zh

[CV-53] Looking Beyond Language Priors: Enhancing Visual Comprehension and Attention in Multimodal Models

【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中视觉与语言深度对齐的问题,这类模型往往无法充分利用视觉输入,而倾向于依赖强语言先验。论文的关键解决方案是深入分析MLLMs如何内部构建图像区域的视觉理解,并引入技术来增强这一能力,具体包括加深模型对视觉内容的理解以及确保这些视觉洞察能够主动引导语言生成。

链接: https://arxiv.org/abs/2505.05626
作者: Aarti Ghatkesar,Uddeshya Upadhyay,Ganesh Venkatesh
机构: Cerebras(赛灵思)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Achieving deep alignment between vision and language remains a central challenge for Multimodal Large Language Models (MLLMs). These models often fail to fully leverage visual input, defaulting to strong language priors. Our approach first provides insights into how MLLMs internally build visual understanding of image regions and then introduces techniques to amplify this capability. Specifically, we explore techniques designed both to deepen the model’s understanding of visual content and to ensure that these visual insights actively guide language generation. We demonstrate the superior multimodal understanding of our resultant model through a detailed upstream analysis quantifying its ability to predict visually-dependent tokens as well as 10 pt boost on visually challenging tasks.
zh

[CV-54] A Preliminary Study for GPT -4o on Image Restoration

【速读】:该论文旨在探讨OpenAI的GPT-4o模型在图像修复领域的潜在影响,特别是其在多模态输入输出下的表现及其对传统图像修复任务的适用性。研究发现,尽管GPT-4o生成的图像在视觉上具有吸引力,但在像素级结构保真度方面存在不足,例如图像比例变化、物体位置和数量偏移等问题。然而,GPT-4o的输出可以作为强大的视觉先验,显著提升现有去雾网络的性能。解决方案的关键在于利用GPT-4o生成的图像作为先验知识,为未来的图像修复流程提供实用指导和基准框架。

链接: https://arxiv.org/abs/2505.05621
作者: Hao Yang,Yan Yang,Ruikun Zhang,Liyuan Pan
机构: Beijing Institute of Technology (北京理工大学); Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:OpenAI’s GPT-4o model, integrating multi-modal inputs and outputs within an autoregressive architecture, has demonstrated unprecedented performance in image generation. In this work, we investigate its potential impact on the image restoration community. We present the first systematic evaluation of GPT-4o across diverse restoration tasks. Our experiments reveal that, although restoration outputs from GPT-4o are visually appealing, they often suffer from pixel-level structural fidelity when compared to ground-truth images. Common issues are variations in image proportions, shifts in object positions and quantities, and changes in this http URL address it, taking image dehazing, derainning, and low-light enhancement as representative case studies, we show that GPT-4o’s outputs can serve as powerful visual priors, substantially enhancing the performance of existing dehazing networks. It offers practical guidelines and a baseline framework to facilitate the integration of GPT-4o into future image restoration pipelines. We hope the study on GPT-4o image restoration will accelerate innovation in the broader field of image generation areas. To support further research, we will release GPT-4o-restored images from over 10 widely used image restoration datasets.
zh

[CV-55] Enhancing Satellite Object Localization with Dilated Convolutions and Attention-aided Spatial Pooling

【速读】:该论文旨在解决卫星图像中目标定位的挑战,这些问题包括目标的高变异性、低空间分辨率以及噪声和显著特征(如云层和城市灯光)的干扰。该研究针对上层大气重力波(Gravity Waves, GW)、中间层 bore(Bore)和海洋涡旋(Ocean Eddies, OE)三个卫星数据集进行分析,这些数据集各自具有尺度和外观变化大的主要目标模式问题。解决方案的关键在于提出一种改进的YOLOv5模型——YOLO-DCAP,其核心组件包括多尺度膨胀残差卷积(Multi-scale Dilated Residual Convolution, MDRC)块,用于捕捉不同尺度的特征,以及注意力辅助空间池化(Attention-aided Spatial Pooling, AaSP)模块,用于聚焦全局相关空间区域,从而提升特征选择能力。

链接: https://arxiv.org/abs/2505.05599
作者: Seraj Al Mahmud Mostafa,Chenxi Wang,Jia Yue,Yuta Hozumi,Jianwu Wang
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校); Catholic University of America (美国天主教大学); NASA Goddard Space Flight Center (美国国家航空航天局戈达德航天中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper has been accepted to International conference on Advanced Machine Learning and Data Science (AMLDS) 2025

点击查看摘要

Abstract:Object localization in satellite imagery is particularly challenging due to the high variability of objects, low spatial resolution, and interference from noise and dominant features such as clouds and city lights. In this research, we focus on three satellite datasets: upper atmospheric Gravity Waves (GW), mesospheric Bores (Bore), and Ocean Eddies (OE), each presenting its own unique challenges. These challenges include the variability in the scale and appearance of the main object patterns, where the size, shape, and feature extent of objects of interest can differ significantly. To address these challenges, we introduce YOLO-DCAP, a novel enhanced version of YOLOv5 designed to improve object localization in these complex scenarios. YOLO-DCAP incorporates a Multi-scale Dilated Residual Convolution (MDRC) block to capture multi-scale features at scale with varying dilation rates, and an Attention-aided Spatial Pooling (AaSP) module to focus on the global relevant spatial regions, enhancing feature selection. These structural improvements help to better localize objects in satellite imagery. Experimental results demonstrate that YOLO-DCAP significantly outperforms both the YOLO base model and state-of-the-art approaches, achieving an average improvement of 20.95% in mAP50 and 32.23% in IoU over the base model, and 7.35% and 9.84% respectively over state-of-the-art alternatives, consistently across all three satellite datasets. These consistent gains across all three satellite datasets highlight the robustness and generalizability of the proposed approach. Our code is open sourced at this https URL.
zh

[CV-56] Learning to Drive Anywhere with Model-Based Reannotation11

【速读】:该论文试图解决机器人在视觉导航中泛化能力不足的问题,这一问题主要受限于大规模、多样化训练数据的缺乏。论文提出的解决方案关键在于Model-Based ReAnnotation (MBRA)框架,该框架利用一个学习到的短时域、基于模型的专家模型,对被动收集的低质量或无标签数据进行重新标注或生成高质量动作,从而提升数据质量,进而训练出具有长时域导航能力的LogoNav策略。

链接: https://arxiv.org/abs/2505.05592
作者: Noriaki Hirose,Lydia Ignatova,Kyle Stachowicz,Catherine Glossop,Sergey Levine,Dhruv Shah
机构: University of California Berkeley (加州大学伯克利分校); Toyota Motor North America (丰田汽车北美公司); Princeton University (普林斯顿大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 19 pages, 11 figures, 8 tables

点击查看摘要

Abstract:Developing broadly generalizable visual navigation policies for robots is a significant challenge, primarily constrained by the availability of large-scale, diverse training data. While curated datasets collected by researchers offer high quality, their limited size restricts policy generalization. To overcome this, we explore leveraging abundant, passively collected data sources, including large volumes of crowd-sourced teleoperation data and unlabeled YouTube videos, despite their potential for lower quality or missing action labels. We propose Model-Based ReAnnotation (MBRA), a framework that utilizes a learned short-horizon, model-based expert model to relabel or generate high-quality actions for these passive datasets. This relabeled data is then distilled into LogoNav, a long-horizon navigation policy conditioned on visual goals or GPS waypoints. We demonstrate that LogoNav, trained using MBRA-processed data, achieves state-of-the-art performance, enabling robust navigation over distances exceeding 300 meters in previously unseen indoor and outdoor environments. Our extensive real-world evaluations, conducted across a fleet of robots (including quadrupeds) in six cities on three continents, validate the policy’s ability to generalize and navigate effectively even amidst pedestrians in crowded settings.
zh

[CV-57] QuickSplat: Fast 3D Surface Reconstruction via Learned Gaussian Initialization

【速读】:该论文旨在解决大规模室内场景表面重建中优化速度慢以及难以建模欠观测或无纹理区域的问题。其解决方案的关键在于引入QuickSplat,通过学习数据驱动的先验知识为2D高斯点云(Gaussian Splatting)优化生成密集初始化,从而加速优化收敛并提升平面墙体结构的几何精度。此外,该方法在每次迭代中联合估计场景参数的稀疏化与更新,通过提出的稀疏化网络根据现有高斯分布的渲染梯度预测新的高斯分布,消除了对启发式稀疏化策略的依赖。

链接: https://arxiv.org/abs/2505.05591
作者: Yueh-Cheng Liu,Lukas Höllein,Matthias Nießner,Angela Dai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL , Video: this https URL

点击查看摘要

Abstract:Surface reconstruction is fundamental to computer vision and graphics, enabling applications in 3D modeling, mixed reality, robotics, and more. Existing approaches based on volumetric rendering obtain promising results, but optimize on a per-scene basis, resulting in a slow optimization that can struggle to model under-observed or textureless regions. We introduce QuickSplat, which learns data-driven priors to generate dense initializations for 2D gaussian splatting optimization of large-scale indoor scenes. This provides a strong starting point for the reconstruction, which accelerates the convergence of the optimization and improves the geometry of flat wall structures. We further learn to jointly estimate the densification and update of the scene parameters during each iteration; our proposed densifier network predicts new Gaussians based on the rendering gradients of existing ones, removing the needs of heuristics for densification. Extensive experiments on large-scale indoor scene reconstruction demonstrate the superiority of our data-driven optimization. Concretely, we accelerate runtime by 8x, while decreasing depth errors by up to 48% in comparison to state of the art methods.
zh

[CV-58] ReactDance: Progressive-Granular Representation for Long-Term Coherent Reactive Dance Generation

【速读】:该论文旨在解决反应式舞蹈生成(Reactive Dance Generation, RDG)中存在的一致性与控制性不足的问题,特别是在双人舞蹈合成中面临交互真实性、同步性和时间一致性方面的挑战。其解决方案的关键在于提出了一种基于扩散模型的框架ReactDance,该框架引入了两个核心创新:一是多尺度解耦的动作表示方法Group Residual Finite Scalar Quantization (GRFSQ),能够从粗粒度的身体节奏到细粒度的关节动态捕捉交互语义;二是基于局部块因果掩码和周期性位置编码的Blockwise Local Context (BLC)采样策略,有效避免了长序列生成中的误差累积问题。此外,通过构建解耦的多尺度GRFSQ表示,并结合Layer-Decoupled Classifier-free Guidance (LDCFG)实现对运动语义的跨尺度精细控制。

链接: https://arxiv.org/abs/2505.05589
作者: Jingzhong Lin,Yuanyuan Qi,Xinru Li,Wenxuan Huang,Xiangfeng Xu,Bangyan Li,Xuejiao Wang,Gaoqi He
机构: East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reactive dance generation (RDG) produces follower movements conditioned on guiding dancer and music while ensuring spatial coordination and temporal coherence. However, existing methods overemphasize global constraints and optimization, overlooking local information, such as fine-grained spatial interactions and localized temporal context. Therefore, we present ReactDance, a novel diffusion-based framework for high-fidelity RDG with long-term coherence and multi-scale controllability. Unlike existing methods that struggle with interaction fidelity, synchronization, and temporal consistency in duet synthesis, our approach introduces two key innovations: 1)Group Residual Finite Scalar Quantization (GRFSQ), a multi-scale disentangled motion representation that captures interaction semantics from coarse body rhythms to fine-grained joint dynamics, and 2)Blockwise Local Context (BLC), a sampling strategy eliminating error accumulation in long sequence generation via local block causal masking and periodic positional encoding. Built on the decoupled multi-scale GRFSQ representation, we implement a diffusion model withLayer-Decoupled Classifier-free Guidance (LDCFG), allowing granular control over motion semantics across scales. Extensive experiments on standard benchmarks demonstrate that ReactDance surpasses existing methods, achieving state-of-the-art performance.
zh

[CV-59] Steepest Descent Density Control for Compact 3D Gaussian Splatting CVPR2025

【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 中由于密集化算法导致的冗余点云问题,该问题造成了内存占用过高、性能下降和存储需求大的挑战。解决方案的关键在于提出一种理论框架,通过优化理论方法确定密集化所需的最小后代高斯分布数量、最优参数更新方向以及后代不透明度的解析归一化方法,并引入SteepGS,其核心是梯度密度控制策略,能够在保持渲染质量的同时显著减少高斯点数量,从而提升效率和可扩展性。

链接: https://arxiv.org/abs/2505.05587
作者: Peihao Wang,Yuehao Wang,Dilin Wang,Sreyas Mohan,Zhiwen Fan,Lemeng Wu,Ruisi Cai,Yu-Ying Yeh,Zhangyang Wang,Qiang Liu,Rakesh Ranjan
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Meta Reality Labs (Meta 现实实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025, Project page: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a powerful technique for real-time, high-resolution novel view synthesis. By representing scenes as a mixture of Gaussian primitives, 3DGS leverages GPU rasterization pipelines for efficient rendering and reconstruction. To optimize scene coverage and capture fine details, 3DGS employs a densification algorithm to generate additional points. However, this process often leads to redundant point clouds, resulting in excessive memory usage, slower performance, and substantial storage demands - posing significant challenges for deployment on resource-constrained devices. To address this limitation, we propose a theoretical framework that demystifies and improves density control in 3DGS. Our analysis reveals that splitting is crucial for escaping saddle points. Through an optimization-theoretic approach, we establish the necessary conditions for densification, determine the minimal number of offspring Gaussians, identify the optimal parameter update direction, and provide an analytical solution for normalizing off-spring opacity. Building on these insights, we introduce SteepGS, incorporating steepest density control, a principled strategy that minimizes loss while maintaining a compact point cloud. SteepGS achieves a ~50% reduction in Gaussian points without compromising rendering quality, significantly enhancing both efficiency and scalability.
zh

[CV-60] Prompt to Polyp: Clinically-Aware Medical Image Synthesis with Diffusion Models

【速读】:该论文试图解决医疗AI领域中数据稀缺问题,同时保护患者隐私,通过从文本描述生成逼真的医学图像来实现这一目标。其解决方案的关键在于提出一种名为MSDM的新型模型,该模型基于Stable Diffusion优化架构,集成了临床文本编码器、变分自编码器和交叉注意力机制,以更好地对齐医学文本提示与生成图像。

链接: https://arxiv.org/abs/2505.05573
作者: Mikhail Chaichuk,Sushant Gautam,Steven Hicks,Elena Tutubalina
机构: AIRI( AIRI); HSE University( HSE大学); Simula Metropolitan Center for Digital Engineering( Simula大都会数字工程中心); Oslo Metropolitan University( 奥斯陆大都会大学); Kazan Federal University( 卡尔扎克联邦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: code available at this https URL

点击查看摘要

Abstract:The generation of realistic medical images from text descriptions has significant potential to address data scarcity challenges in healthcare AI while preserving patient privacy. This paper presents a comprehensive study of text-to-image synthesis in the medical domain, comparing two distinct approaches: (1) fine-tuning large pre-trained latent diffusion models and (2) training small, domain-specific models. We introduce a novel model named MSDM, an optimized architecture based on Stable Diffusion that integrates a clinical text encoder, variational autoencoder, and cross-attention mechanisms to better align medical text prompts with generated images. Our study compares two approaches: fine-tuning large pre-trained models (FLUX, Kandinsky) versus training compact domain-specific models (MSDM). Evaluation across colonoscopy (MedVQA-GI) and radiology (ROCOv2) datasets reveals that while large models achieve higher fidelity, our optimized MSDM delivers comparable quality with lower computational costs. Quantitative metrics and qualitative evaluations by medical experts reveal strengths and limitations of each approach.
zh

[CV-61] Benchmarking Vision Language Action Models in Procedurally Generated Open Ended Action Environments

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在分布外(out-of-distribution, OOD)环境中的零样本泛化能力不足的问题。其解决方案的关键在于构建一个全面的基准测试框架——MultiNet v0.2,用于评估和分析当前最先进的视觉-语言模型(VLM)和VLA模型在Procgen基准下的泛化性能,从而揭示模型在不同任务复杂度和动作表示下的表现差异,并探索有效提升模型泛化能力的方法。

链接: https://arxiv.org/abs/2505.05540
作者: Pranav Guruprasad,Yangyue Wang,Sudipta Chowdhury,Harshvardhan Sikka
机构: Manifold Research; Metarch.ai; Georgia Tech
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages, 26 figures

点击查看摘要

Abstract:Vision-language-action (VLA) models represent an important step toward general-purpose robotic systems by integrating visual perception, language understanding, and action execution. However, systematic evaluation of these models, particularly their zero-shot generalization capabilities in out-of-distribution (OOD) environments, remains limited. In this paper, we introduce MultiNet v0.2, a comprehensive benchmark designed to evaluate and analyze the generalization performance of state-of-the-art VLM and VLA models-including GPT-4o, GPT-4.1, OpenVLA,Pi0 Base, and Pi0 FAST-on diverse procedural tasks from the Procgen benchmark. Our analysis reveals several critical insights: (1) all evaluated models exhibit significant limitations in zero-shot generalization to OOD tasks, with performance heavily influenced by factors such as action representation and task complexit; (2) VLAs generally outperform other models due to their robust architectural design; and (3) VLM variants demonstrate substantial improvements when constrained appropriately, highlighting the sensitivity of model performance to precise prompt engineering.
zh

[CV-62] OXSeg: Multidimensional attention UNet-based lip segmentation using semi-supervised lip contours

【速读】:该论文旨在解决唇部分割在监督训练中因缺乏唇轮廓标注而效果受限,以及受图像质量、光照和肤色影响导致检测边界不准确的问题。其解决方案的关键在于提出一种结合注意力UNet和多维输入的顺序唇部分割方法,通过局部二值模式提取面部图像中的微结构特征以构建多维输入,并引入基于少量解剖学关键点估计完整唇轮廓的掩码生成方法,从而提升分割精度。

链接: https://arxiv.org/abs/2505.05531
作者: Hanie Moghaddasi,Christina Chambers,Sarah N. Mattson,Jeffrey R. Wozniak,Claire D. Coles,Raja Mukherjee,Michael Suttie
机构: Nuffield Department of Women’s & Reproductive Health, University of Oxford(牛津大学妇女与生殖健康系); Big Data Institute, University of Oxford(大数据研究所,牛津大学); Department of Pediatrics, University of California San Diego(加州大学圣地亚哥分校儿科系); Department of Psychology, Center for Behavioral Teratology, San Diego State University(圣迭戈州立大学心理学系,行为毒理中心); University of Minnesota Twin Cities(明尼苏达大学双城分校); Department of Psychiatry and Behavioral Sciences, Emory University School of Medicine(埃默里大学医学院精神医学与行为科学系); Faculty of Health and Medical Science, University of Surrey Medical School(萨里大学健康与医学科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Lip segmentation plays a crucial role in various domains, such as lip synchronization, lipreading, and diagnostics. However, the effectiveness of supervised lip segmentation is constrained by the availability of lip contour in the training phase. A further challenge with lip segmentation is its reliance on image quality , lighting, and skin tone, leading to inaccuracies in the detected boundaries. To address these challenges, we propose a sequential lip segmentation method that integrates attention UNet and multidimensional input. We unravel the micro-patterns in facial images using local binary patterns to build multidimensional inputs. Subsequently, the multidimensional inputs are fed into sequential attention UNets, where the lip contour is reconstructed. We introduce a mask generation method that uses a few anatomical landmarks and estimates the complete lip contour to improve segmentation accuracy. This mask has been utilized in the training phase for lip segmentation. To evaluate the proposed method, we use facial images to segment the upper lips and subsequently assess lip-related facial anomalies in subjects with fetal alcohol syndrome (FAS). Using the proposed lip segmentation method, we achieved a mean dice score of 84.75%, and a mean pixel accuracy of 99.77% in upper lip segmentation. To further evaluate the method, we implemented classifiers to identify those with FAS. Using a generative adversarial network (GAN), we reached an accuracy of 98.55% in identifying FAS in one of the study populations. This method could be used to improve lip segmentation accuracy, especially around Cupid’s bow, and shed light on distinct lip-related characteristics of FAS.
zh

[CV-63] GaMNet: A Hybrid Network with Gabor Fusion and NMamba for Efficient 3D Glioma Segmentation

【速读】:该论文旨在解决胶质瘤(gliomas)分割中传统深度学习模型在上下文建模能力不足或计算复杂度高导致难以在移动医疗设备上实时应用的问题。其解决方案的关键在于提出GaMNet架构,该架构融合了NMamba模块以实现全局建模,并结合多尺度卷积神经网络(CNN)进行高效的局部特征提取,同时引入多尺度Gabor滤波器以提升可解释性并模拟人类视觉系统,从而在减少参数量和计算时间的同时实现高精度分割。

链接: https://arxiv.org/abs/2505.05520
作者: Chengwei Ye,Huanzhen Zhang,Yufei Lin,Kangsheng Wang,Linuo Xu,Shuyan Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Gliomas are aggressive brain tumors that pose serious health risks. Deep learning aids in lesion segmentation, but CNN and Transformer-based models often lack context modeling or demand heavy computation, limiting real-time use on mobile medical devices. We propose GaMNet, integrating the NMamba module for global modeling and a multi-scale CNN for efficient local feature extraction. To improve interpretability and mimic the human visual system, we apply Gabor filters at multiple scales. Our method achieves high segmentation accuracy with fewer parameters and faster computation. Extensive experiments show GaMNet outperforms existing methods, notably reducing false positives and negatives, which enhances the reliability of clinical diagnosis.
zh

[CV-64] Real-Time Privacy Preservation for Robot Visual Perception

【速读】:该论文试图解决机器人在基于实时视频流操作时,隐私敏感对象(如个人标识符)可能被无意中暴露的问题。现有方法在保证所有敏感对象完全隐藏方面存在不足,且缺乏对实时视频流的有效处理能力。解决方案的关键在于提出一种隐私约束视频流方法(Privacy-Constrained Video Streaming, PCVS),该方法通过逻辑规范约束隐私敏感对象的存在,并利用检测模型评估每一帧中的敏感对象,随后对部分对象进行模糊处理以满足规范。此外,该方法引入了置信预测技术,以建立并动态更新敏感对象存在的理论下限概率,从而确保隐私保护的可靠性与有效性。

链接: https://arxiv.org/abs/2505.05519
作者: Minkyu Choi,Yunhao Yang,Neel P. Bhatt,Kushagra Gupta,Sahil Shah,Aditya Rai,David Fridovich-Keil,Ufuk Topcu,Sandeep P. Chinchali
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Many robots (e.g., iRobot’s Roomba) operate based on visual observations from live video streams, and such observations may inadvertently include privacy-sensitive objects, such as personal identifiers. Existing approaches for preserving privacy rely on deep learning models, differential privacy, or cryptography. They lack guarantees for the complete concealment of all sensitive objects. Guaranteeing concealment requires post-processing techniques and thus is inadequate for real-time video streams. We develop a method for privacy-constrained video streaming, PCVS, that conceals sensitive objects within real-time video streams. PCVS takes a logical specification constraining the existence of privacy-sensitive objects, e.g., never show faces when a person exists. It uses a detection model to evaluate the existence of these objects in each incoming frame. Then, it blurs out a subset of objects such that the existence of the remaining objects satisfies the specification. We then propose a conformal prediction approach to (i) establish a theoretical lower bound on the probability of the existence of these objects in a sequence of frames satisfying the specification and (ii) update the bound with the arrival of each subsequent frame. Quantitative evaluations show that PCVS achieves over 95 percent specification satisfaction rate in multiple datasets, significantly outperforming other methods. The satisfaction rate is consistently above the theoretical bounds across all datasets, indicating that the established bounds hold. Additionally, we deploy PCVS on robots in real-time operation and show that the robots operate normally without being compromised when PCVS conceals objects.
zh

[CV-65] Web2Grasp: Learning Functional Grasps from Web Images of Hand-Object Interactions

【速读】:该论文旨在解决多指机器人手进行功能性抓取(functional grasp)的难题,传统方法要么专注于力量抓取(power grasp),即仅保持物体静止,要么依赖昂贵的遥操作机器人演示来教授机器人如何功能性抓取。解决方案的关键在于从网络图像中提取人类抓取信息,这些图像展示了自然且功能性的物体交互,从而避免了对精心策划演示数据的依赖。通过从RGB图像重建人机交互(HOI)的3D网格,并将人类手部动作迁移至多指机器人手,同时对噪声物体网格进行校准,利用低成本网络来源的相对低质量HOI数据训练功能性抓取模型。此外,结合IsaacGym模拟器生成物理可行且保留功能性的抓取策略,进一步扩展了已见与未见物体的抓取数据集。

链接: https://arxiv.org/abs/2505.05517
作者: Hongyi Chen,Yunchao Yao,Yufei Ye,Zhixuan Xu,Homanga Bharadhwaj,Jiashun Wang,Shubham Tulsiani,Zackory Erickson,Jeffrey Ichnowski
机构: Carnegie Mellon University (卡内基梅隆大学); Stanford University (斯坦福大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Functional grasp is essential for enabling dexterous multi-finger robot hands to manipulate objects effectively. However, most prior work either focuses on power grasping, which simply involves holding an object still, or relies on costly teleoperated robot demonstrations to teach robots how to grasp each object functionally. Instead, we propose extracting human grasp information from web images since they depict natural and functional object interactions, thereby bypassing the need for curated demonstrations. We reconstruct human hand-object interaction (HOI) 3D meshes from RGB images, retarget the human hand to multi-finger robot hands, and align the noisy object mesh with its accurate 3D shape. We show that these relatively low-quality HOI data from inexpensive web sources can effectively train a functional grasping model. To further expand the grasp dataset for seen and unseen objects, we use the initially-trained grasping policy with web data in the IsaacGym simulator to generate physically feasible grasps while preserving functionality. We train the grasping model on 10 object categories and evaluate it on 9 unseen objects, including challenging items such as syringes, pens, spray bottles, and tongs, which are underrepresented in existing datasets. The model trained on the web HOI dataset, achieving a 75.8% success rate on seen objects and 61.8% across all objects in simulation, with a 6.7% improvement in success rate and a 1.8x increase in functionality ratings over baselines. Simulator-augmented data further boosts performance from 61.8% to 83.4%. The sim-to-real transfer to the LEAP Hand achieves a 85% success rate. Project website is at: this https URL.
zh

[CV-66] Exploring Convolutional Neural Networks for Rice Grain Classification: An Explainable AI Approach

【速读】:该论文试图解决人工检测和分类水稻粒质量的劳动密集型、耗时且易出错的问题,以实现高效、准确的水稻品种分类。解决方案的关键在于提出一种基于卷积神经网络(Convolutional Neural Network, CNN)的自动框架,通过严格的训练与验证,实现了高精度的分类性能,并结合LIME(Local Interpretable Model-agnostic Explanations)和SHAP(SHapley Additive exPlanations)等可解释性技术,提升了模型决策过程的透明度与可信度。

链接: https://arxiv.org/abs/2505.05513
作者: Muhammad Junaid Asif,Hamza Khan,Rabia Tehseen,Syed Tahir Hussain Rizvi,Mujtaba Asad,Shazia Saqib,Rana Fayyaz Ahmad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Rice is an essential staple food worldwide that is important in promoting international trade, economic growth, and nutrition. Asian countries such as China, India, Pakistan, Thailand, Vietnam, and Indonesia are notable for their significant contribution to the cultivation and utilization of rice. These nations are also known for cultivating different rice grains, including short and long grains. These sizes are further classified as basmati, jasmine, kainat saila, ipsala, arborio, etc., catering to diverse culinary preferences and cultural traditions. For both local and international trade, inspecting and maintaining the quality of rice grains to satisfy customers and preserve a country’s reputation is necessary. Manual quality check and classification is quite a laborious and time-consuming process. It is also highly prone to mistakes. Therefore, an automatic solution must be proposed for the effective and efficient classification of different varieties of rice grains. This research paper presents an automatic framework based on a convolutional neural network (CNN) for classifying different varieties of rice grains. We evaluated the proposed model based on performance metrics such as accuracy, recall, precision, and F1-Score. The CNN model underwent rigorous training and validation, achieving a remarkable accuracy rate and a perfect area under each class’s Receiver Operating Characteristic (ROC) curve. The confusion matrix analysis confirmed the model’s effectiveness in distinguishing between the different rice varieties, indicating minimal misclassifications. Additionally, the integration of explainability techniques such as LIME (Local Interpretable Model-agnostic Explanations) and SHAP (SHapley Additive exPlanations) provided valuable insights into the model’s decision-making process, revealing how specific features of the rice grains influenced classification outcomes.
zh

[CV-67] Occupancy World Model for Robots

【速读】:该论文旨在解决室内场景中3D占用场景演化预测的问题,现有方法主要聚焦于室外结构化道路场景,而忽略了对室内机器人场景的探索。其解决方案的关键在于提出一种基于结合时空感受野和引导自回归Transformer的占用世界模型(RoboOccWorld),其中包含条件因果状态注意力(CCSA)机制,利用下一状态的相机位姿作为条件来指导自回归Transformer适应和理解室内机器人场景,并通过混合时空聚合(HSTA)有效融合历史观测中的时空线索,从而提升室内3D占用场景演化预测的性能。

链接: https://arxiv.org/abs/2505.05512
作者: Zhang Zhang,Qiang Zhang,Wei Cui,Shuai Shi,Yijie Guo,Gang Han,Wen Zhao,Jingkai Sun,Jiahang Cao,Jiaxu Wang,Hao Cheng,Xiaozhu Ju,Zhengping Che,Renjing Xu,Jian Tang
机构: Beijing Innovation Center of Humanoid Robotics (北京人形机器人创新中心); Beijing Institute of Technology (北京理工大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州))
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Understanding and forecasting the scene evolutions deeply affect the exploration and decision of embodied agents. While traditional methods simulate scene evolutions through trajectory prediction of potential instances, current works use the occupancy world model as a generative framework for describing fine-grained overall scene dynamics. However, existing methods cluster on the outdoor structured road scenes, while ignoring the exploration of forecasting 3D occupancy scene evolutions for robots in indoor scenes. In this work, we explore a new framework for learning the scene evolutions of observed fine-grained occupancy and propose an occupancy world model based on the combined spatio-temporal receptive field and guided autoregressive transformer to forecast the scene evolutions, called RoboOccWorld. We propose the Conditional Causal State Attention (CCSA), which utilizes camera poses of next state as conditions to guide the autoregressive transformer to adapt and understand the indoor robotics scenarios. In order to effectively exploit the spatio-temporal cues from historical observations, Hybrid Spatio-Temporal Aggregation (HSTA) is proposed to obtain the combined spatio-temporal receptive field based on multi-scale spatio-temporal windows. In addition, we restructure the OccWorld-ScanNet benchmark based on local annotations to facilitate the evaluation of the indoor 3D occupancy scene evolution prediction task. Experimental results demonstrate that our RoboOccWorld outperforms state-of-the-art methods in indoor 3D occupancy scene evolution prediction task. The code will be released soon.
zh

[CV-68] How to Train Your Metamorphic Deep Neural Network

【速读】:该论文旨在解决神经网络压缩中模型宽度和深度可变性不足的问题,特别是原始NeuMeta方法仅适用于底层模型的最后几层,限制了其在更广泛场景中的应用。解决方案的关键在于提出一种训练算法,通过分块增量训练、隐式神经表示(Implicit Neural Representation, INR)初始化以及批量归一化替换策略,实现对整个网络的形态变换,从而在保持较高准确率的前提下支持不同压缩比的模型生成。

链接: https://arxiv.org/abs/2505.05510
作者: Thomas Sommariva,Simone Calderara,Angelo Porrello
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:Neural Metamorphosis (NeuMeta) is a recent paradigm for generating neural networks of varying width and depth. Based on Implicit Neural Representation (INR), NeuMeta learns a continuous weight manifold, enabling the direct generation of compressed models, including those with configurations not seen during training. While promising, the original formulation of NeuMeta proves effective only for the final layers of the undelying model, limiting its broader applicability. In this work, we propose a training algorithm that extends the capabilities of NeuMeta to enable full-network metamorphosis with minimal accuracy degradation. Our approach follows a structured recipe comprising block-wise incremental training, INR initialization, and strategies for replacing batch normalization. The resulting metamorphic networks maintain competitive accuracy across a wide range of compression ratios, offering a scalable solution for adaptable and efficient deployment of deep models. The code is available at: this https URL.
zh

[CV-69] Apply Hierarchical-Chain-of-Generation to Complex Attributes Text-to-3D Generation

【速读】:该论文试图解决文本到3D生成模型在处理具有复杂属性物体时存在的语义理解偏差和部分遮挡生成问题。其关键解决方案是提出一种自动化方法Hierarchical-Chain-of-Generation (HCoG),该方法利用大语言模型将长描述分解为表示不同物体部分的块,并根据遮挡情况从内向外排序形成层次化生成链,通过目标区域定位与3D高斯核优化实现属性精确绑定,同时引入高斯扩展与标签消除机制,确保新部分的无缝生成且不干扰已有优化部分。

链接: https://arxiv.org/abs/2505.05505
作者: Yiming Qin,Zhu Xu,Yang Liu
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Project page here: this https URL

点击查看摘要

Abstract:Recent text-to-3D models can render high-quality assets, yet they still stumble on objects with complex attributes. The key obstacles are: (1) existing text-to-3D approaches typically lift text-to-image models to extract semantics via text encoders, while the text encoder exhibits limited comprehension ability for long descriptions, leading to deviated cross-attention focus, subsequently wrong attribute binding in generated results. (2) Occluded object parts demand a disciplined generation order and explicit part disentanglement. Though some works introduce manual efforts to alleviate the above issues, their quality is unstable and highly reliant on manual information. To tackle above problems, we propose a automated method Hierarchical-Chain-of-Generation (HCoG). It leverages a large language model to decompose the long description into blocks representing different object parts, and orders them from inside out according to occlusions, forming a hierarchical chain. Within each block we first coarsely create components, then precisely bind attributes via target-region localization and corresponding 3D Gaussian kernel optimization. Between blocks, we introduce Gaussian Extension and Label Elimination to seamlessly generate new parts by extending new Gaussian kernels, re-assigning semantic labels, and eliminating unnecessary kernels, ensuring that only relevant parts are added without disrupting previously optimized parts. Experiments confirm that HCoG yields structurally coherent, attribute-faithful 3D objects with complex attributes. The code is available at this https URL .
zh

[CV-70] Preliminary Explorations with GPT -4o(mni) Native Image Generation

【速读】:该论文试图评估GPT-4o在多种任务中的表现,以探索其在视觉生成方面的能力。解决方案的关键在于构建一个任务分类体系,并设计了一组精心挑选的测试样本,以进行全面的定性测试。通过这一方法,研究者能够系统地评估GPT-4o在六个任务类别中的性能,包括传统图像生成、判别任务、基于知识的生成、基于常识的生成、空间感知图像生成以及时间感知图像生成,从而深入分析其对现实世界概念的理解能力。

链接: https://arxiv.org/abs/2505.05501
作者: Pu Cao,Feng Zhou,Junyi Ji,Qingye Kong,Zhixiang Lv,Mingjian Zhang,Xuekun Zhao,Siqi Wu,Yinghui Lin,Qing Song,Lu Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Recently, the visual generation ability by GPT-4o(mni) has been unlocked by OpenAI. It demonstrates a very remarkable generation capability with excellent multimodal condition understanding and varied task instructions. In this paper, we aim to explore the capabilities of GPT-4o across various tasks. Inspired by previous study, we constructed a task taxonomy along with a carefully curated set of test samples to conduct a comprehensive qualitative test. Benefiting from GPT-4o’s powerful multimodal comprehension, its image-generation process demonstrates abilities surpassing those of traditional image-generation tasks. Thus, regarding the dimensions of model capabilities, we evaluate its performance across six task categories: traditional image generation tasks, discriminative tasks, knowledge-based generation, commonsense-based generation, spatially-aware image generation, and temporally-aware image generation. These tasks not only assess the quality and conditional alignment of the model’s outputs but also probe deeper into GPT-4o’s understanding of real-world concepts. Our results reveal that GPT-4o performs impressively well in general-purpose synthesis tasks, showing strong capabilities in text-to-image generation, visual stylization, and low-level image processing. However, significant limitations remain in its ability to perform precise spatial reasoning, instruction-grounded generation, and consistent temporal prediction. Furthermore, when faced with knowledge-intensive or domain-specific scenarios, such as scientific illustrations or mathematical plots, the model often exhibits hallucinations, factual errors, or structural inconsistencies. These findings suggest that while GPT-4o marks a substantial advancement in unified multimodal generation, there is still a long way to go before it can be reliably applied to professional or safety-critical domains.
zh

[CV-71] Learning 3D Persistent Embodied World Models

【速读】:该论文试图解决智能具身代理在复杂环境中进行长期一致性规划的问题,现有方法由于缺乏对未被当前观测图像覆盖场景的记忆,导致其模拟能力具有短视性。解决方案的关键在于引入一种具有显式记忆的持久性具身世界模型,通过将生成的视频内容聚合到一个持久的3D环境地图中,并利用该空间地图作为视频模型的条件输入,从而实现对已见和未见区域的准确模拟,提升代理的长期规划与策略学习能力。

链接: https://arxiv.org/abs/2505.05495
作者: Siyuan Zhou,Yilun Du,Yuncong Yang,Lei Han,Peihao Chen,Dit-Yan Yeung,Chuang Gan
机构: Institution1; Institution2
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The ability to simulate the effects of future actions on the world is a crucial ability of intelligent embodied agents, enabling agents to anticipate the effects of their actions and make plans accordingly. While a large body of existing work has explored how to construct such world models using video models, they are often myopic in nature, without any memory of a scene not captured by currently observed images, preventing agents from making consistent long-horizon plans in complex environments where many parts of the scene are partially observed. We introduce a new persistent embodied world model with an explicit memory of previously generated content, enabling much more consistent long-horizon simulation. During generation time, our video diffusion model predicts RGB-D video of the future observations of the agent. This generation is then aggregated into a persistent 3D map of the environment. By conditioning the video model on this 3D spatial map, we illustrate how this enables video world models to faithfully simulate both seen and unseen parts of the world. Finally, we illustrate the efficacy of such a world model in downstream embodied applications, enabling effective planning and policy learning.
zh

[CV-72] DetoxAI: a Python Toolkit for Debiasing Deep Learning Models in Computer Vision

【速读】:该论文试图解决深度学习视觉分类任务中机器学习公平性不足的问题(machine learning fairness),尤其是在现有解决方案主要针对表格数据而对视觉任务适应性较差的背景下。其关键解决方案是提出DetoxAI,一个开源的Python库,通过后处理去偏技术(post-hoc debiasing)提升深度学习视觉分类器的公平性,该库实现了先进的去偏算法、公平性度量和可视化工具,支持通过干预内部表示进行去偏,并提供基于属性的可视化和量化公平性指标以展示偏见的缓解效果。

链接: https://arxiv.org/abs/2505.05492
作者: Ignacy Stępka,Lukasz Sztukiewicz,Michał Wiliński,Jerzy Stefanowski
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While machine learning fairness has made significant progress in recent years, most existing solutions focus on tabular data and are poorly suited for vision-based classification tasks, which rely heavily on deep learning. To bridge this gap, we introduce DetoxAI, an open-source Python library for improving fairness in deep learning vision classifiers through post-hoc debiasing. DetoxAI implements state-of-the-art debiasing algorithms, fairness metrics, and visualization tools. It supports debiasing via interventions in internal representations and includes attribution-based visualization tools and quantitative algorithmic fairness metrics to show how bias is mitigated. This paper presents the motivation, design, and use cases of DetoxAI, demonstrating its tangible value to engineers and researchers.
zh

[CV-73] MDDFNet: Mamba-based Dynamic Dual Fusion Network for Traffic Sign Detection

【速读】:该论文旨在解决小目标检测,尤其是交通标志检测中的两个主要问题:特征提取过于单一以及检测过程难以有效处理不同尺寸或尺度的目标。解决方案的关键在于提出一种新型的目标检测网络——基于Mamba的动态双融合网络(MDDFNet),该网络通过集成动态双融合模块和基于Mamba的主干网络来同时应对上述问题。动态双融合模块利用多分支结构整合多种空间和语义信息,从而增强特征多样性;而基于Mamba的主干网络则通过全局特征融合与局部特征交互,以自适应方式结合特征,生成独特的分类特性。

链接: https://arxiv.org/abs/2505.05491
作者: TianYi Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Detection of small objects, especially traffic signs, is a critical sub-task in object detection and autonomous driving. Despite signficant progress in previous research, two main challenges remain. First, the issue of feature extraction being too singular. Second, the detection process struggles to efectively handle objects of varying sizes or scales. These problems are also prevalent in general object detection tasks. To address these challenges, we propose a novel object detection network, Mamba-based Dynamic Dual Fusion Network (MDDFNet), for traffic sign detection. The network integrates a dynamic dual fusion module and a Mamba-based backbone to simultaneously tackle the aforementioned issues. Specifically, the dynamic dual fusion module utilizes multiple branches to consolidate various spatial and semantic information, thus enhancing feature diversity. The Mamba-based backbone leverages global feature fusion and local feature interaction, combining features in an adaptive manner to generate unique classification characteristics. Extensive experiments conducted on the TT100K (Tsinghua-Tencent 100K) datasets demonstrate that MDDFNet outperforms other state-of-the-art detectors, maintaining real-time processing capabilities of single-stage models while achieving superior performance. This confirms the efectiveness of MDDFNet in detecting small traffic signs.
zh

[CV-74] From Events to Enhancement: A Survey on Event-Based Imaging Technologies

【速读】:该论文试图解决事件相机(event camera)在通用成像应用中如何被有效利用的问题,当前对事件相机最新进展和挑战的系统性研究仍较为缺乏。其解决方案的关键在于首先建立事件传感器的物理模型及其特性基础,随后分析事件数据在图像/视频增强任务中的进展与交互,并进一步探讨通过事件捕捉更丰富光信息的高级任务,如光场估计、多视角生成和测光等,最后提出该领域面临的新挑战与开放性问题,以推动该快速发展的研究方向。

链接: https://arxiv.org/abs/2505.05488
作者: Yunfan Lu,Xiaogang Xu,Pengteng Li,Yusheng Wang,Yi Cui,Huizai Yao,Hui Xiong
机构: HKUST(GZ)(香港科技大学(广州)); CUHK(香港中文大学); University of Tokyo(东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event cameras offering high dynamic range and low latency have emerged as disruptive technologies in imaging. Despite growing research on leveraging these benefits for different imaging tasks, a comprehensive study of recently advances and challenges are still lacking. This limits the broader understanding of how to utilize events in universal imaging applications. In this survey, we first introduce a physical model and the characteristics of different event sensors as the foundation. Following this, we highlight the advancement and interaction of image/video enhancement tasks with events. Additionally, we explore advanced tasks, which capture richer light information with events, \eg~light field estimation, multi-view generation, and photometric. Finally, we discuss new challenges and open questions offering a perspective for this rapidly evolving field. More continuously updated resources are at this link: this https URL
zh

[CV-75] Data extraction and processing methods to aid the study of driving behaviors at intersections in naturalistic driving

【速读】:该论文旨在解决自然驾驶研究中从大量车辆数据中自动提取和表征驾驶员在交叉口的头部扫描行为的问题。其关键解决方案是开发定制工具以标记交叉口、同步位置与视频数据,并裁剪出交叉口前后100米范围内的舱内和场景视频;同时,利用自定义的生成式 AI(Generative AI)模型进行广角头部转动的头部姿态检测,结合YOLO目标检测模型识别交通信号灯、停车标志、行人及其他车辆,并通过规则算法推断交叉口类型、操作及边界。

链接: https://arxiv.org/abs/2505.05487
作者: Shrinivas Pundlik,Seonggyu Choe,Patrick Baker,Chen-Yuan Lee,Naser Al-Madi,Alex R. Bowers,Gang Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 19 pages, 11 figures

点击查看摘要

Abstract:Naturalistic driving studies use devices in participants’ own vehicles to record daily driving over many months. Due to diverse and extensive amounts of data recorded, automated processing is necessary. This report describes methods to extract and characterize driver head scans at intersections from data collected from an in-car recording system that logged vehicle speed, GPS location, scene videos, and cabin videos. Custom tools were developed to mark the intersections, synchronize location and video data, and clip the cabin and scene videos for +/-100 meters from the intersection location. A custom-developed head pose detection AI model for wide angle head turns was run on the cabin videos to estimate the driver head pose, from which head scans 20 deg were computed in the horizontal direction. The scene videos were processed using a YOLO object detection model to detect traffic lights, stop signs, pedestrians, and other vehicles on the road. Turning maneuvers were independently detected using vehicle self-motion patterns. Stop lines on the road surface were detected using changing intensity patterns over time as the vehicle moved. The information obtained from processing the scene videos, along with the speed data was used in a rule-based algorithm to infer the intersection type, maneuver, and bounds. We processed 190 intersections from 3 vehicles driven in cities and suburban areas from Massachusetts and California. The automated video processing algorithm correctly detected intersection signage and maneuvers in 100% and 94% of instances, respectively. The median [IQR] error in detecting vehicle entry into the intersection was 1.1[0.4-4.9] meters and 0.2[0.1-0.54] seconds. The median overlap between ground truth and estimated intersection bounds was 0.88[0.82-0.93].
zh

[CV-76] opo-VM-UNetV2: Encoding Topology into Vision Mamba UNet for Polyp Segmentation

【速读】:该论文试图解决基于Mamba的结肠息肉分割模型在捕捉拓扑特征(如连通组件、环和空洞)方面的不足,从而导致边界划分不准确的问题。解决方案的关键在于提出一种名为Topo-VM-UNetV2的方法,该方法通过将拓扑特征编码到Mamba-based的先进分割模型VM-UNetV2中,具体包括两个阶段:第一阶段利用概率图生成拓扑注意力图,第二阶段将这些拓扑注意力图整合到语义与细节注入模块中,形成拓扑引导的语义与细节注入模块,以提升分割效果。

链接: https://arxiv.org/abs/2505.06210
作者: Diego Adame,Jose A. Nunez,Fabian Vazquez,Nayeli Gurrola,Huimin Li,Haoteng Tang,Bin Fu,Pengfei Gu
机构: University of Texas Rio Grande Valley (得克萨斯大学里奥格兰德河谷分校); The University of Texas at Dallas (得克萨斯大学达拉斯分校)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Convolutional neural network (CNN) and Transformer-based architectures are two dominant deep learning models for polyp segmentation. However, CNNs have limited capability for modeling long-range dependencies, while Transformers incur quadratic computational complexity. Recently, State Space Models such as Mamba have been recognized as a promising approach for polyp segmentation because they not only model long-range interactions effectively but also maintain linear computational complexity. However, Mamba-based architectures still struggle to capture topological features (e.g., connected components, loops, voids), leading to inaccurate boundary delineation and polyp segmentation. To address these limitations, we propose a new approach called Topo-VM-UNetV2, which encodes topological features into the Mamba-based state-of-the-art polyp segmentation model, VM-UNetV2. Our method consists of two stages: Stage 1: VM-UNetV2 is used to generate probability maps (PMs) for the training and test images, which are then used to compute topology attention maps. Specifically, we first compute persistence diagrams of the PMs, then we generate persistence score maps by assigning persistence values (i.e., the difference between death and birth times) of each topological feature to its birth location, finally we transform persistence scores into attention weights using the sigmoid function. Stage 2: These topology attention maps are integrated into the semantics and detail infusion (SDI) module of VM-UNetV2 to form a topology-guided semantics and detail infusion (Topo-SDI) module for enhancing the segmentation results. Extensive experiments on five public polyp segmentation datasets demonstrate the effectiveness of our proposed method. The code will be made publicly available.
zh

[CV-77] he Application of Deep Learning for Lymph Node Segmentation: A Systematic Review

【速读】:该论文试图解决淋巴结自动分割在癌症早期检测和分期中的准确性不足问题,传统分割方法受限于人工勾画和操作者技术水平的差异,难以实现高精度。其解决方案的关键在于引入深度学习技术,特别是卷积神经网络、编码器-解码器网络和Transformer等架构,以提升医学影像数据的分析能力。研究还探讨了多模态融合、迁移学习及大规模预训练模型等未来方向,旨在克服当前在淋巴结形状多样性、标注数据稀缺以及跨模态泛化能力不足等方面的挑战。

链接: https://arxiv.org/abs/2505.06118
作者: Jingguo Qu,Xinyang Han,Man-Lik Chui,Yao Pu,Simon Takadiyi Gunda,Ziman Chen,Jing Qin,Ann Dorothy King,Winnie Chiu-Wing Chu,Jing Cai,Michael Tin-Cheung Ying
机构: The Hong Kong Polytechnic University (香港理工大学); The Chinese University of Hong Kong (香港中文大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automatic lymph node segmentation is the cornerstone for advances in computer vision tasks for early detection and staging of cancer. Traditional segmentation methods are constrained by manual delineation and variability in operator proficiency, limiting their ability to achieve high accuracy. The introduction of deep learning technologies offers new possibilities for improving the accuracy of lymph node image analysis. This study evaluates the application of deep learning in lymph node segmentation and discusses the methodologies of various deep learning architectures such as convolutional neural networks, encoder-decoder networks, and transformers in analyzing medical imaging data across different modalities. Despite the advancements, it still confronts challenges like the shape diversity of lymph nodes, the scarcity of accurately labeled datasets, and the inadequate development of methods that are robust and generalizable across different imaging modalities. To the best of our knowledge, this is the first study that provides a comprehensive overview of the application of deep learning techniques in lymph node segmentation task. Furthermore, this study also explores potential future research directions, including multimodal fusion techniques, transfer learning, and the use of large-scale pre-trained models to overcome current limitations while enhancing cancer diagnosis and treatment planning strategies.
zh

[CV-78] S2MNet: Speckle-To-Mesh Net for Three-Dimensional Cardiac Morphology Reconstruction via Echocardiogram

【速读】:该论文试图解决传统二维超声心动图在心脏解剖和功能评估中无法提供完整三维信息的问题,以及现有三维超声心动图在分辨率、可用性和成本方面的局限性。其解决方案的关键在于提出一种基于深度学习的框架S2MNet,通过整合常规获取的六张二维超声心动图切片,重建连续且高保真的三维心脏模型,从而克服了训练数据获取困难,并通过引入基于形变场的方法避免了三维重建中的空间不连续或结构伪影问题。

链接: https://arxiv.org/abs/2505.06105
作者: Xilin Gong,Yongkai Chen,Shushan Wu,Fang Wang,Ping Ma,Wenxuan Zhong
机构: University of Georgia (佐治亚大学); Havard University (哈佛大学); Chinese Academy of Medical Sciences (中国医学科学院); Beijing Hospital (北京医院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Echocardiogram is the most commonly used imaging modality in cardiac assessment duo to its non-invasive nature, real-time capability, and cost-effectiveness. Despite its advantages, most clinical echocardiograms provide only two-dimensional views, limiting the ability to fully assess cardiac anatomy and function in three dimensions. While three-dimensional echocardiography exists, it often suffers from reduced resolution, limited availability, and higher acquisition costs. To overcome these challenges, we propose a deep learning framework S2MNet that reconstructs continuous and high-fidelity 3D heart models by integrating six slices of routinely acquired 2D echocardiogram views. Our method has three advantages. First, our method avoid the difficulties on training data acquasition by simulate six of 2D echocardiogram images from corresponding slices of a given 3D heart mesh. Second, we introduce a deformation field-based method, which avoid spatial discontinuities or structural artifacts in 3D echocardiogram reconstructions. We validate our method using clinically collected echocardiogram and demonstrate that our estimated left ventricular volume, a key clinical indicator of cardiac function, is strongly correlated with the doctor measured GLPS, a clinical measurement that should demonstrate a negative correlation with LVE in medical theory. This association confirms the reliability of our proposed 3D construction method.
zh

[CV-79] Efficient Quantum Convolutional Neural Networks for Image Classification: Overcoming Hardware Constraints

【速读】:该论文旨在解决在当前噪声中等规模量子(NISQ)设备上实现量子卷积神经网络(QCNN)的挑战,特别是由于硬件限制导致的量子电路复杂性和可扩展性问题。其解决方案的关键在于提出一种编码方案,显著降低输入数据的维度,使得49量子比特的QCNN能够直接处理28×28像素的MNIST图像,无需依赖传统的降维预处理。此外,研究还引入了一个基于表达能力、纠缠度和复杂度特性的自动化框架,用于识别参数化量子电路(PQCs)的构建模块,从而提升模型的准确性和收敛速度。

链接: https://arxiv.org/abs/2505.05957
作者: Peter Röseler,Oliver Schaudt,Helmut Berg,Christian Bauckhage,Matthias Koch
机构: 未知
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While classical convolutional neural networks (CNNs) have revolutionized image classification, the emergence of quantum computing presents new opportunities for enhancing neural network architectures. Quantum CNNs (QCNNs) leverage quantum mechanical properties and hold potential to outperform classical approaches. However, their implementation on current noisy intermediate-scale quantum (NISQ) devices remains challenging due to hardware limitations. In our research, we address this challenge by introducing an encoding scheme that significantly reduces the input dimensionality. We demonstrate that a primitive QCNN architecture with 49 qubits is sufficient to directly process 28\times 28 pixel MNIST images, eliminating the need for classical dimensionality reduction pre-processing. Additionally, we propose an automated framework based on expressibility, entanglement, and complexity characteristics to identify the building blocks of QCNNs, parameterized quantum circuits (PQCs). Our approach demonstrates advantages in accuracy and convergence speed with a similar parameter count compared to both hybrid QCNNs and classical CNNs. We validated our experiments on IBM’s Heron r2 quantum processor, achieving 96.08% classification accuracy, surpassing the 71.74% benchmark of traditional approaches under identical training conditions. These results represent one of the first implementations of image classifications on real quantum hardware and validate the potential of quantum computing in this area.
zh

[CV-80] owards order of magnitude X-ray dose reduction in breast cancer imaging using phase contrast and deep denoising

【速读】:该论文试图解决传统乳腺癌筛查方法(如X射线乳腺摄影和数字乳腺断层扫描)在灵敏度、特异性以及患者舒适度方面的局限性,同时探索一种更安全、更高效的成像方式。其解决方案的关键在于采用相位对比计算机断层扫描(Phase-contrast computed tomography, PCT),该技术能够在无需乳腺压缩的情况下提供更高质量的图像,并通过基于深度学习的图像去噪方法进一步将辐射剂量降低至少16倍,而不会影响图像质量。

链接: https://arxiv.org/abs/2505.05812
作者: Ashkan Pakzad,Robert Turnbull,Simon J. Mutch,Thomas A. Leatham,Darren Lockie,Jane Fox,Beena Kumar,Daniel Häsermann,Christopher J. Hall,Anton Maksimenko,Benedicta D. Arhatari,Yakov I. Nesterets,Amir Entezam,Seyedamir T. Taba,Patrick C. Brennan,Timur E. Gureyev,Harry M. Quiney
机构: 未知
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 3 figures, 1 table

点击查看摘要

Abstract:Breast cancer is the most frequently diagnosed human cancer in the United States at present. Early detection is crucial for its successful treatment. X-ray mammography and digital breast tomosynthesis are currently the main methods for breast cancer screening. However, both have known limitations in terms of their sensitivity and specificity to breast cancers, while also frequently causing patient discomfort due to the requirement for breast compression. Breast computed tomography is a promising alternative, however, to obtain high-quality images, the X-ray dose needs to be sufficiently high. As the breast is highly radiosensitive, dose reduction is particularly important. Phase-contrast computed tomography (PCT) has been shown to produce higher-quality images at lower doses and has no need for breast compression. It is demonstrated in the present study that, when imaging full fresh mastectomy samples with PCT, deep learning-based image denoising can further reduce the radiation dose by a factor of 16 or more, without any loss of image quality. The image quality has been assessed both in terms of objective metrics, such as spatial resolution and contrast-to-noise ratio, as well as in an observer study by experienced medical imaging specialists and radiologists. This work was carried out in preparation for live patient PCT breast cancer imaging, initially at specialized synchrotron facilities.
zh

[CV-81] Predicting Diabetic Macular Edema Treatment Responses Using OCT: Dataset and Methods of APTOS Competition

【速读】:该论文旨在解决糖尿病黄斑水肿(Diabetic Macular Edema, DME)患者对玻璃体内治疗反应差异的问题,通过预治疗分层来预测治疗效果,从而实现个性化治疗策略。其解决方案的关键在于利用大规模光学相干断层扫描(OCT)图像数据集,结合人工智能技术,提升抗VEGF治疗响应的预测准确性。

链接: https://arxiv.org/abs/2505.05768
作者: Weiyi Zhang,Peranut Chotcomwongse,Yinwen Li,Pusheng Xu,Ruijie Yao,Lianhao Zhou,Yuxuan Zhou,Hui Feng,Qiping Zhou,Xinyue Wang,Shoujin Huang,Zihao Jin,Florence H.T. Chung,Shujun Wang,Yalin Zheng,Mingguang He,Danli Shi,Paisan Ruamviboonsuk
机构: School of Optometry, The Hong Kong Polytechnic University, Hong Kong; Department of Ophthalmology, College of Medicine, Rangsit University, Rajavithi Hospital, Thailand; Asia Pacific Tele-Ophthalmology Society, c/o State Key Laboratory (Ophthalmology), Zhongshan Ophthalmic Center, Sun Yat-Sen University, China; Duke University, USA; Texas A&M University, USA; Department of Electronic Engineering, Tsinghua University, China; Xidian University, China; Svision Imaging Ltd., China; Research Centre for SHARP Vision (RCSV), The Hong Kong Polytechnic University, Hong Kong; Centre for Eye and Vision Research (CEVR), 17W Hong Kong Science Park, Hong Kong; Department of Ophthalmology and Visual Sciences, the Chinese University of Hong Kong, Hong Kong; Faculty of Engineering, The University of Hong Kong, Hong Kong; Faculty of Engineering, The Hong Kong Polytechnic University, Hong Kong; Faculty of Health and Life Sciences, University of Liverpool, United Kingdom; Chien-Shiung Wu College, Southeast University, China; Shenzhen Technology University, China
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 42 pages,5 tables, 12 figures, challenge report

点击查看摘要

Abstract:Diabetic macular edema (DME) significantly contributes to visual impairment in diabetic patients. Treatment responses to intravitreal therapies vary, highlighting the need for patient stratification to predict therapeutic benefits and enable personalized strategies. To our knowledge, this study is the first to explore pre-treatment stratification for predicting DME treatment responses. To advance this research, we organized the 2nd Asia-Pacific Tele-Ophthalmology Society (APTOS) Big Data Competition in 2021. The competition focused on improving predictive accuracy for anti-VEGF therapy responses using ophthalmic OCT images. We provided a dataset containing tens of thousands of OCT images from 2,000 patients with labels across four sub-tasks. This paper details the competition’s structure, dataset, leading methods, and evaluation metrics. The competition attracted strong scientific community participation, with 170 teams initially registering and 41 reaching the final round. The top-performing team achieved an AUC of 80.06%, highlighting the potential of AI in personalized DME treatment and clinical decision-making.
zh

[CV-82] Hybrid Learning: A Novel Combination of Self-Supervised and Supervised Learning for MRI Reconstruction without High-Quality Training Reference

【速读】:该论文旨在解决深度学习在MRI重建中面临的挑战,即传统监督学习方法需要高质量参考图像,而实际应用中此类图像往往不可用,同时自监督学习在高加速率下性能下降的问题。其解决方案的关键在于提出了一种混合学习(hybrid learning)框架,该框架采用两阶段训练策略,首先利用自监督学习从噪声或欠采样数据中生成改进的图像作为伪真实标签,随后通过监督学习进一步优化重建效果,从而提升图像质量和定量映射的准确性。

链接: https://arxiv.org/abs/2505.05703
作者: Haoyang Pei,Ding Xia,Xiang Xu,William Moore,Yao Wang,Hersh Chandarana,Li Feng
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Purpose: Deep learning has demonstrated strong potential for MRI reconstruction, but conventional supervised learning methods require high-quality reference images, which are often unavailable in practice. Self-supervised learning offers an alternative, yet its performance degrades at high acceleration rates. To overcome these limitations, we propose hybrid learning, a novel two-stage training framework that combines self-supervised and supervised learning for robust image reconstruction. Methods: Hybrid learning is implemented in two sequential stages. In the first stage, self-supervised learning is employed to generate improved images from noisy or undersampled reference data. These enhanced images then serve as pseudo-ground truths for the second stage, which uses supervised learning to refine reconstruction performance and support higher acceleration rates. We evaluated hybrid learning in two representative applications: (1) accelerated 0.55T spiral-UTE lung MRI using noisy reference data, and (2) 3D T1 mapping of the brain without access to fully sampled ground truth. Results: For spiral-UTE lung MRI, hybrid learning consistently improved image quality over both self-supervised and conventional supervised methods across different acceleration rates, as measured by SSIM and NMSE. For 3D T1 mapping, hybrid learning achieved superior T1 quantification accuracy across a wide dynamic range, outperforming self-supervised learning in all tested conditions. Conclusions: Hybrid learning provides a practical and effective solution for training deep MRI reconstruction networks when only low-quality or incomplete reference data are available. It enables improved image quality and accurate quantitative mapping across different applications and field strengths, representing a promising technique toward broader clinical deployment of deep learning-based MRI. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.05703 [eess.IV] (or arXiv:2505.05703v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2505.05703 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Haoyang Pei [view email] [v1] Fri, 9 May 2025 00:35:14 UTC (16,172 KB)
zh

[CV-83] Equivariant Imaging Biomarkers for Robust Unsupervised Segmentation of Histopathology

【速读】:该论文旨在解决传统病理学分析中存在的时间消耗大、劳动强度高、成本效率低以及诊断一致性与准确性受限的问题,同时针对现有基于机器学习(ML)模型在处理组织病理图像时对旋转和反射不具备不变性,从而影响其泛化能力的局限性。解决方案的关键在于通过一种新颖的对称卷积核实现无监督分割,从而构建出具有鲁棒性和等变性的组织病理生物标志物,该方法在前列腺组织微阵列(TMA)图像上的验证表明其在旋转对抗性方面优于标准卷积核模型,有望提升数字病理学中机器学习模型的准确性、一致性和鲁棒性。

链接: https://arxiv.org/abs/2505.05689
作者: Fuyao Chen,Yuexi Du,Tal Zeevi,Nicha C. Dvornek,John A. Onofrey
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by MIDL 2025

点击查看摘要

Abstract:Histopathology evaluation of tissue specimens through microscopic examination is essential for accurate disease diagnosis and prognosis. However, traditional manual analysis by specially trained pathologists is time-consuming, labor-intensive, cost-inefficient, and prone to inter-rater variability, potentially affecting diagnostic consistency and accuracy. As digital pathology images continue to proliferate, there is a pressing need for automated analysis to address these challenges. Recent advancements in artificial intelligence-based tools such as machine learning (ML) models, have significantly enhanced the precision and efficiency of analyzing histopathological slides. However, despite their impressive performance, ML models are invariant only to translation, lacking invariance to rotation and reflection. This limitation restricts their ability to generalize effectively, particularly in histopathology, where images intrinsically lack meaningful orientation. In this study, we develop robust, equivariant histopathological biomarkers through a novel symmetric convolutional kernel via unsupervised segmentation. The approach is validated using prostate tissue micro-array (TMA) images from 50 patients in the Gleason 2019 Challenge public dataset. The biomarkers extracted through this approach demonstrate enhanced robustness and generalizability against rotation compared to models using standard convolution kernels, holding promise for enhancing the accuracy, consistency, and robustness of ML models in digital pathology. Ultimately, this work aims to improve diagnostic and prognostic capabilities of histopathology beyond prostate cancer through equivariant imaging.
zh

[CV-84] V-EfficientNets: Vector-Valued Efficiently Scaled Convolutional Neural Network Models IJCNN2025

【速读】:该论文旨在解决传统卷积神经网络在处理多维向量数据时对通道间关系建模不足的问题,以及在保持高精度的同时减少模型参数的需求。其解决方案的关键在于提出一种新型的向量值EfficientNet(V-EfficientNets),该模型将多维数据视为连贯的整体,从而更好地捕捉通道间的相互关系,并通过优化网络宽度、深度和分辨率来实现高效的参数分配。

链接: https://arxiv.org/abs/2505.05659
作者: Guilherme Vieira Neto,Marcos Eduardo Valle
机构: Universidade Estadual de Campinas (UNICAMP)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at International Joint Conference on Neural Networks (IJCNN 2025)

点击查看摘要

Abstract:EfficientNet models are convolutional neural networks optimized for parameter allocation by jointly balancing network width, depth, and resolution. Renowned for their exceptional accuracy, these models have become a standard for image classification tasks across diverse computer vision benchmarks. While traditional neural networks learn correlations between feature channels during training, vector-valued neural networks inherently treat multidimensional data as coherent entities, taking for granted the inter-channel relationships. This paper introduces vector-valued EfficientNets (V-EfficientNets), a novel extension of EfficientNet designed to process arbitrary vector-valued data. The proposed models are evaluated on a medical image classification task, achieving an average accuracy of 99.46% on the ALL-IDB2 dataset for detecting acute lymphoblastic leukemia. V-EfficientNets demonstrate remarkable efficiency, significantly reducing parameters while outperforming state-of-the-art models, including the original EfficientNet. The source code is available at this https URL.
zh

[CV-85] A New k-Space Model for Non-Cartesian Fourier Imaging

【速读】:该论文试图解决传统基于体素(voxel)的傅里叶成像数据重建方法所面临的计算成本高、收敛速度慢及伪影易出现等问题。其解决方案的关键在于从新的视角重新审视该模型,提出一种基于频域(Fourier-domain)基函数展开的新模型,而非传统的图像域体素基函数方法,从而提升对旧有和新发现限制的鲁棒性,实现更优的图像质量和更低的计算复杂度。

链接: https://arxiv.org/abs/2505.05647
作者: Chin-Cheng Chan,Justin P. Haldar
机构: University of Southern California (南加州大学)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:For the past several decades, it has been popular to reconstruct Fourier imaging data using model-based approaches that can easily incorporate physical constraints and advanced regularization/machine learning priors. The most common modeling approach is to represent the continuous image as a linear combination of shifted “voxel” basis functions. Although well-studied and widely-deployed, this voxel-based model is associated with longstanding limitations, including high computational costs, slow convergence, and a propensity for artifacts. In this work, we reexamine this model from a fresh perspective, identifying new issues that may have been previously overlooked (including undesirable approximation, periodicity, and nullspace characteristics). Our insights motivate us to propose a new model that is more resilient to the limitations (old and new) of the previous approach. Specifically, the new model is based on a Fourier-domain basis expansion rather than the standard image-domain voxel-based approach. Illustrative results, which are presented in the context of non-Cartesian MRI reconstruction, demonstrate that the new model enables improved image quality (reduced artifacts) and/or reduced computational complexity (faster computations and improved convergence).
zh

[CV-86] UltraG auss: Ultrafast Gaussian Reconstruction of 3D Ultrasound Volumes

【速读】:该论文旨在解决超声成像中二维图像解释依赖操作者、导致结果变异性和认知负担增加的问题,以及现有2D到3D重建方法在计算成本、内存消耗或与超声物理兼容性方面的不足。其解决方案的关键在于提出UltraGauss:首个针对超声的高斯点云渲染框架,通过扩展视图合成技术以适应超声波传播特性,采用三维探头平面交点建模,结合高效的光栅化边界公式和数值稳定的协方差参数化,从而提升计算效率与重建精度。

链接: https://arxiv.org/abs/2505.05643
作者: Mark C. Eid,Ana I.L. Namburete,João F. Henriques
机构: University of Oxford (牛津大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Ultrasound imaging is widely used due to its safety, affordability, and real-time capabilities, but its 2D interpretation is highly operator-dependent, leading to variability and increased cognitive demand. 2D-to-3D reconstruction mitigates these challenges by providing standardized volumetric views, yet existing methods are often computationally expensive, memory-intensive, or incompatible with ultrasound physics. We introduce UltraGauss: the first ultrasound-specific Gaussian Splatting framework, extending view synthesis techniques to ultrasound wave propagation. Unlike conventional perspective-based splatting, UltraGauss models probe-plane intersections in 3D, aligning with acoustic image formation. We derive an efficient rasterization boundary formulation for GPU parallelization and introduce a numerically stable covariance parametrization, improving computational efficiency and reconstruction accuracy. On real clinical ultrasound data, UltraGauss achieves state-of-the-art reconstructions in 5 minutes, and reaching 0.99 SSIM within 20 minutes on a single GPU. A survey of expert clinicians confirms UltraGauss’ reconstructions are the most realistic among competing methods. Our CUDA implementation will be released upon publication.
zh

[CV-87] Score-based Self-supervised MRI Denoising

【速读】:该论文旨在解决磁共振成像(MRI)中由于加速和/或低场采集导致的噪声污染问题,该问题会显著降低图像质量和诊断准确性。传统基于监督学习的去噪方法虽然效果显著,但需要高信噪比(SNR)标签,而这些标签通常难以获取。为应对标签稀缺问题,研究提出了一种基于分数的自监督框架——Corruption2Self (C2S),其关键在于广义去噪分数匹配(GDSM)损失,该损失通过建模更高SNR图像在进一步污染观测下的条件期望,使模型能够直接从噪声数据中学习多噪声级别的去噪能力。此外,C2S通过引入噪声水平的重参数化和细节细化扩展,提升了训练稳定性与细部特征的保留能力。

链接: https://arxiv.org/abs/2505.05631
作者: Jiachen Tu,Yaokun Shi,Fan Lam
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Magnetic resonance imaging (MRI) is a powerful noninvasive diagnostic imaging tool that provides unparalleled soft tissue contrast and anatomical detail. Noise contamination, especially in accelerated and/or low-field acquisitions, can significantly degrade image quality and diagnostic accuracy. Supervised learning based denoising approaches have achieved impressive performance but require high signal-to-noise ratio (SNR) labels, which are often unavailable. Self-supervised learning holds promise to address the label scarcity issue, but existing self-supervised denoising methods tend to oversmooth fine spatial features and often yield inferior performance than supervised methods. We introduce Corruption2Self (C2S), a novel score-based self-supervised framework for MRI denoising. At the core of C2S is a generalized denoising score matching (GDSM) loss, which extends denoising score matching to work directly with noisy observations by modeling the conditional expectation of higher-SNR images given further corrupted observations. This allows the model to effectively learn denoising across multiple noise levels directly from noisy data. Additionally, we incorporate a reparameterization of noise levels to stabilize training and enhance convergence, and introduce a detail refinement extension to balance noise reduction with the preservation of fine spatial features. Moreover, C2S can be extended to multi-contrast denoising by leveraging complementary information across different MRI contrasts. We demonstrate that our method achieves state-of-the-art performance among self-supervised methods and competitive results compared to supervised counterparts across varying noise conditions and MRI contrasts on the M4Raw and fastMRI dataset.
zh

[CV-88] Guidance for Intra-cardiac Echocardiography Manipulation to Maintain Continuous Therapy Device Tip Visibility

【速读】:该论文旨在解决在经导管心脏介入手术中,由于手动调整ICE导管导致治疗设备尖端难以保持持续可视化的技术难题。其解决方案的关键在于提出一种基于人工智能的跟踪模型,该模型能够估计设备尖端的入射角度和通过点,从而确保连续可视化并支持机器人控制ICE导管。该方法的核心创新是采用混合数据集生成策略,结合临床ICE序列与合成数据增强,以提高模型的鲁棒性。

链接: https://arxiv.org/abs/2505.05518
作者: Jaeyoung Huh,Ankur Kapoor,Young-Ho Kim
机构: Siemens Healthineers(西门子医疗); Princeton, NJ, USA(美国新泽西州普林斯顿)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Intra-cardiac Echocardiography (ICE) plays a critical role in Electrophysiology (EP) and Structural Heart Disease (SHD) interventions by providing real-time visualization of intracardiac structures. However, maintaining continuous visibility of the therapy device tip remains a challenge due to frequent adjustments required during manual ICE catheter manipulation. To address this, we propose an AI-driven tracking model that estimates the device tip incident angle and passing point within the ICE imaging plane, ensuring continuous visibility and facilitating robotic ICE catheter control. A key innovation of our approach is the hybrid dataset generation strategy, which combines clinical ICE sequences with synthetic data augmentation to enhance model robustness. We collected ICE images in a water chamber setup, equipping both the ICE catheter and device tip with electromagnetic (EM) sensors to establish precise ground-truth locations. Synthetic sequences were created by overlaying catheter tips onto real ICE images, preserving motion continuity while simulating diverse anatomical scenarios. The final dataset consists of 5,698 ICE-tip image pairs, ensuring comprehensive training coverage. Our model architecture integrates a pretrained ultrasound (US) foundation model, trained on 37.4M echocardiography images, for feature extraction. A transformer-based network processes sequential ICE frames, leveraging historical passing points and incident angles to improve prediction accuracy. Experimental results demonstrate that our method achieves 3.32 degree entry angle error, 12.76 degree rotation angle error. This AI-driven framework lays the foundation for real-time robotic ICE catheter adjustments, minimizing operator workload while ensuring consistent therapy device visibility. Future work will focus on expanding clinical datasets to further enhance model generalization. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) Cite as: arXiv:2505.05518 [eess.IV] (or arXiv:2505.05518v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2505.05518 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Young-Ho Kim [view email] [v1] Thu, 8 May 2025 02:48:30 UTC (907 KB)
zh

[CV-89] StereoINR: Cross-View Geometry Consistent Stereo Super Resolution with Implicit Neural Representation

【速读】:该论文旨在解决立体图像超分辨率(Stereo Image Super-Resolution, SSR)中现有上采样方法(如像素洗牌)忽视跨视角几何一致性以及仅限于固定尺度上采样的问题。其关键解决方案是提出 Stereo Implicit Neural Representation (StereoINR),通过将立体图像对建模为连续隐式表示,突破了尺度限制,并实现了任意尺度的立体图像超分辨率重建。同时,通过引入空间变形和跨注意力机制,StereoINR增强了跨视角信息融合能力,显著提升了像素级几何一致性。

链接: https://arxiv.org/abs/2505.05509
作者: Yi Liu,Xinyi Liu,Panwang Xia,Qiong Wu,Yi Wan,Yongjun Zhang
机构: Wuhan University (武汉大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Stereo image super-resolution (SSR) aims to enhance high-resolution details by leveraging information from stereo image pairs. However, existing stereo super-resolution (SSR) upsampling methods (e.g., pixel shuffle) often overlook cross-view geometric consistency and are limited to fixed-scale upsampling. The key issue is that previous upsampling methods use convolution to independently process deep features of different views, lacking cross-view and non-local information perception, making it difficult to select beneficial information from multi-view scenes adaptively. In this work, we propose Stereo Implicit Neural Representation (StereoINR), which innovatively models stereo image pairs as continuous implicit representations. This continuous representation breaks through the scale limitations, providing a unified solution for arbitrary-scale stereo super-resolution reconstruction of left-right views. Furthermore, by incorporating spatial warping and cross-attention mechanisms, StereoINR enables effective cross-view information fusion and achieves significant improvements in pixel-level geometric consistency. Extensive experiments across multiple datasets show that StereoINR outperforms out-of-training-distribution scale upsampling and matches state-of-the-art SSR methods within training-distribution scales.
zh

[CV-90] Image Restoration via Multi-domain Learning

【速读】:该论文旨在解决自然图像在复杂大气和成像条件下所面临的多种退化问题,传统方法通常针对特定退化类型进行优化,而缺乏对不同退化之间共性先验的深入研究。其解决方案的关键在于提出一种融合多域学习的Transformer框架,通过在Token Mixer中引入空间-小波-傅里叶多域结构以实现局部-全局多感受野建模,替代传统的自注意力机制,并在Feed-Forward Network中引入多尺度学习以融合不同分辨率下的多域特征,从而提升图像恢复性能并平衡模型参数量、计算成本与推理延迟。

链接: https://arxiv.org/abs/2505.05504
作者: Xingyu Jiang,Ning Gao,Xiuhui Zhang,Hongkun Dou,Shaowen Fu,Xiaoqing Zhong,Hongjue Li,Yue Deng
机构: Beihang University (北京航空航天大学); Beijing Aerospace Automatic Control Institute (北京航天自动控制研究所); China Academy of Space Technology (中国空间技术研究院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Due to adverse atmospheric and imaging conditions, natural images suffer from various degradation phenomena. Consequently, image restoration has emerged as a key solution and garnered substantial attention. Although recent Transformer architectures have demonstrated impressive success across various restoration tasks, their considerable model complexity poses significant challenges for both training and real-time deployment. Furthermore, instead of investigating the commonalities among different degradations, most existing restoration methods focus on modifying Transformer under limited restoration priors. In this work, we first review various degradation phenomena under multi-domain perspective, identifying common priors. Then, we introduce a novel restoration framework, which integrates multi-domain learning into Transformer. Specifically, in Token Mixer, we propose a Spatial-Wavelet-Fourier multi-domain structure that facilitates local-region-global multi-receptive field modeling to replace vanilla self-attention. Additionally, in Feed-Forward Network, we incorporate multi-scale learning to fuse multi-domain features at different resolutions. Comprehensive experimental results across ten restoration tasks, such as dehazing, desnowing, motion deblurring, defocus deblurring, rain streak/raindrop removal, cloud removal, shadow removal, underwater enhancement and low-light enhancement, demonstrate that our proposed model outperforms state-of-the-art methods and achieves a favorable trade-off among restoration performance, parameter size, computational cost and inference latency. The code is available at: this https URL.
zh

[CV-91] ECGDeDRDNet: A deep learning-based method for Electrocardiogram noise removal using a double recurrent dense network

【速读】:该论文旨在解决心电图(Electrocardiogram, ECG)信号中常见噪声(如基线漂移、肌肉伪影和电极运动噪声)导致诊断价值下降的问题。其解决方案的关键在于提出一种基于深度学习的ECG去噪框架ECGDeDRDNet,该框架采用双循环密集网络(Double Recurrent Dense Network)结构,通过引入双循环机制,增强从ECG波形和估计的干净图像中信息的复用,从而更有效地抑制噪声。

链接: https://arxiv.org/abs/2505.05477
作者: Sainan xiao,Wangdong Yang,Buwen Cao,Jintao Wu
机构: 未知
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Electrocardiogram (ECG) signals are frequently corrupted by noise, such as baseline wander (BW), muscle artifacts (MA), and electrode motion (EM), which significantly degrade their diagnostic utility. To address this issue, we propose ECGDeDRDNet, a deep learning-based ECG Denoising framework leveraging a Double Recurrent Dense Network architecture. In contrast to traditional approaches, we introduce a double recurrent scheme to enhance information reuse from both ECG waveforms and the estimated clean image. For ECG waveform processing, our basic model employs LSTM layers cascaded with DenseNet blocks. The estimated clean ECG image, obtained by subtracting predicted noise components from the noisy input, is iteratively fed back into the model. This dual recurrent architecture enables comprehensive utilization of both temporal waveform features and spatial image details, leading to more effective noise suppression. Experimental results on the MIT-BIH dataset demonstrate that our method achieves superior performance compared to conventional image denoising methods in terms of PSNR and SSIM while also surpassing classical ECG denoising techniques in both SNR and RMSE.
zh

[CV-92] MAISY: Motion-Aware Image SYnthesis for Medical Image Motion Correction

【速读】:该论文旨在解决医学图像采集过程中患者运动导致的图像模糊、伪影及器官变形问题,这些问题使得图像解读变得困难。其解决方案的关键在于提出Motion-Aware Image SYnthesis (MAISY)模型,该模型通过两个核心机制实现运动特征的表征与校正:一是利用Segment Anything Model (SAM)动态学习解剖边界处的空间模式,以捕捉运动伪影最显著的区域;二是引入Variance-Selective SSIM (VS-SSIM)损失函数,自适应强调像素方差较高的区域,从而在伪影校正过程中保留关键解剖细节。

链接: https://arxiv.org/abs/2505.04105
作者: Andrew Zhang,Hao Wang,Shuchang Ye,Michael Fulham,Jinman Kim
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Patient motion during medical image acquisition causes blurring, ghosting, and distorts organs, which makes image interpretation challenging. Current state-of-the-art algorithms using Generative Adversarial Network (GAN)-based methods with their ability to learn the mappings between corrupted images and their ground truth via Structural Similarity Index Measure (SSIM) loss effectively generate motion-free images. However, we identified the following limitations: (i) they mainly focus on global structural characteristics and therefore overlook localized features that often carry critical pathological information, and (ii) the SSIM loss function struggles to handle images with varying pixel intensities, luminance factors, and variance. In this study, we propose Motion-Aware Image SYnthesis (MAISY) which initially characterize motion and then uses it for correction by: (a) leveraging the foundation model Segment Anything Model (SAM), to dynamically learn spatial patterns along anatomical boundaries where motion artifacts are most pronounced and, (b) introducing the Variance-Selective SSIM (VS-SSIM) loss which adaptively emphasizes spatial regions with high pixel variance to preserve essential anatomical details during artifact correction. Experiments on chest and head CT datasets demonstrate that our model outperformed the state-of-the-art counterparts, with Peak Signal-to-Noise Ratio (PSNR) increasing by 40%, SSIM by 10%, and Dice by 16%.
zh

人工智能

[AI-0] Efficient Sensorimotor Learning for Open-world Robot Manipulation

【速读】:该论文旨在解决开放世界机器人操作(Open-world Robot Manipulation)问题,即机器人需要在未预先编程或预训练的情况下,泛化或快速适应新物体、场景或任务。解决方案的关键在于利用有限演示数据中存在的规律性(regularity),以实现高效的感觉运动学习(sensorimotor learning)。通过挖掘这些规律性,该研究能够促进机器人学习可泛化的操作技能,从而提升其在复杂和未知环境中的适应能力。

链接: https://arxiv.org/abs/2505.06136
作者: Yifeng Zhu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Ph.D. Dissertation

点击查看摘要

Abstract:This dissertation considers Open-world Robot Manipulation, a manipulation problem where a robot must generalize or quickly adapt to new objects, scenes, or tasks for which it has not been pre-programmed or pre-trained. This dissertation tackles the problem using a methodology of efficient sensorimotor learning. The key to enabling efficient sensorimotor learning lies in leveraging regular patterns that exist in limited amounts of demonstration data. These patterns, referred to as ``regularity,‘’ enable the data-efficient learning of generalizable manipulation skills. This dissertation offers a new perspective on formulating manipulation problems through the lens of regularity. Building upon this notion, we introduce three major contributions. First, we introduce methods that endow robots with object-centric priors, allowing them to learn generalizable, closed-loop sensorimotor policies from a small number of teleoperation demonstrations. Second, we introduce methods that constitute robots’ spatial understanding, unlocking their ability to imitate manipulation skills from in-the-wild video observations. Last but not least, we introduce methods that enable robots to identify reusable skills from their past experiences, resulting in systems that can continually imitate multiple tasks in a sequential manner. Altogether, the contributions of this dissertation help lay the groundwork for building general-purpose personal robots that can quickly adapt to new situations or tasks with low-cost data collection and interact easily with humans. By enabling robots to learn and generalize from limited data, this dissertation takes a step toward realizing the vision of intelligent robotic assistants that can be seamlessly integrated into everyday scenarios.
zh

[AI-1] UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

【速读】:该论文旨在解决通用机器人在不同环境和身体结构间迁移学习能力不足的问题,现有方法通常依赖于动作标注数据的扩展,导致其局限于单一物理规格且难以获取可迁移的知识。解决方案的关键在于提出UniVLA框架,通过从视频中提取以任务为中心的动作表示(task-centric action representations),结合语言指令和DINO特征空间中的潜在动作模型,实现跨身体结构的视觉-语言-动作(VLA)策略学习,从而有效利用多样化数据并提升泛化能力。

链接: https://arxiv.org/abs/2505.06111
作者: Qingwen Bu,Yanting Yang,Jisong Cai,Shenyuan Gao,Guanghui Ren,Maoqing Yao,Ping Luo,Hongyang Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to RSS 2025. Code is available at this https URL

点击查看摘要

Abstract:A generalist robot should perform effectively across various environments. However, most existing approaches heavily rely on scaling action-annotated data to enhance their capabilities. Consequently, they are often limited to single physical specification and struggle to learn transferable knowledge across different embodiments and environments. To confront these limitations, we propose UniVLA, a new framework for learning cross-embodiment vision-language-action (VLA) policies. Our key innovation is to derive task-centric action representations from videos with a latent action model. This enables us to exploit extensive data across a wide spectrum of embodiments and perspectives. To mitigate the effect of task-irrelevant dynamics, we incorporate language instructions and establish a latent action model within the DINO feature space. Learned from internet-scale videos, the generalist policy can be deployed to various robots through efficient latent action decoding. We obtain state-of-the-art results across multiple manipulation and navigation benchmarks, as well as real-robot deployments. UniVLA achieves superior performance over OpenVLA with less than 1/20 of pretraining compute and 1/10 of downstream data. Continuous performance improvements are observed as heterogeneous data, even including human videos, are incorporated into the training pipeline. The results underscore UniVLA’s potential to facilitate scalable and efficient robot policy learning.
zh

[AI-2] LLM s Outperform Experts on Challenging Biology Benchmarks

【速读】:该论文试图解决当前大型语言模型在生物学领域性能评估的系统性不足问题,通过多维度、跨领域的基准测试来全面衡量模型的生物医学能力。其解决方案的关键在于构建涵盖分子生物学、遗传学、克隆、病毒学和生物安全等不同方向的多样化基准测试,并对来自主要AI开发者的27个前沿模型进行多次独立运行评估,以准确反映模型在实际生物科学任务中的表现。

链接: https://arxiv.org/abs/2505.06108
作者: Lennart Justen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:This study systematically evaluates 27 frontier Large Language Models on eight diverse biology benchmarks spanning molecular biology, genetics, cloning, virology, and biosecurity. Models from major AI developers released between November 2022 and April 2025 were assessed through ten independent runs per benchmark. The findings reveal dramatic improvements in biological capabilities. Top model performance increased more than 4-fold on the challenging text-only subset of the Virology Capabilities Test over the study period, with the top model now performing twice as well as expert virologists. Several models now match or exceed expert-level performance on other challenging benchmarks, including LAB-Bench CloningScenarios and the biology subsets of GPQA and WMDP. Contrary to expectations, chain-of-thought did not substantially improve performance over zero-shot evaluation, while extended reasoning features in o3-mini and Claude 3.7 Sonnet typically improved performance as predicted by inference scaling. Benchmarks such as PubMedQA and the MMLU and WMDP biology subsets exhibited performance plateaus well below 100%, suggesting benchmark saturation and errors in the underlying benchmark data. The analysis highlights the need for more sophisticated evaluation methodologies as AI systems continue to advance.
zh

[AI-3] Free and Fair Hardware: A Pathway to Copyright Infringement-Free Verilog Generation using LLM s

【速读】:该论文试图解决大型语言模型(Large Language Model, LLM)在硬件设计任务中生成功能性Verilog代码时存在的版权侵权风险问题。现有基于开源仓库的硬件数据集规模有限且缺乏对再利用许可的严格检查,导致微调后的LLM可能产生受版权保护的代码。解决方案的关键在于提出一个评估基准以量化风险,并构建了一个包含超过220k文件的开源Verilog数据集FreeSet,结合自动化数据集清理框架以确保公平使用,最终通过持续预训练微调出一个名为FreeV的Llama模型,其在Verilog生成任务中表现出较低的版权侵权风险(3%的违规率)及优于基线模型的性能。

链接: https://arxiv.org/abs/2505.06096
作者: Sam Bush,Matthew DeLorenzo,Phat Tieu,Jeyavijayan Rajendran
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at DAC 2025

点击查看摘要

Abstract:Limitations in Large Language Model (LLM) capabilities for hardware design tasks, such as generating functional Verilog codes, have motivated various fine-tuning optimizations utilizing curated hardware datasets from open-source repositories. However, these datasets remain limited in size and contain minimal checks on licensing for reuse, resulting in potential copyright violations by fine-tuned LLMs. Therefore, we propose an evaluation benchmark to estimate the risk of Verilog-trained LLMs to generate copyright-protected codes. To minimize this risk, we present an open-source Verilog dataset, FreeSet, containing over 220k files, along with the automated dataset curation framework utilized to provide additional guarantees of fair-use Verilog data. We then execute an LLM fine-tuning framework consisting of continual pre-training, resulting in a fine-tuned Llama model for Verilog, FreeV. Our results indicate that FreeV demonstrates the smallest risk of copyright-infringement among prior works, with only a 3% violation rate. Furthermore, experimental results demonstrate improvements in Verilog generation functionality over its baseline model, improving VerilogEval pass@10 rates by over 10%.
zh

[AI-4] UniSymNet: A Unified Symbolic Network Guided by Transformer

【速读】:该论文旨在解决传统符号回归(Symbolic Regression, SR)算法在处理复杂树结构时性能受限的问题,以及现有符号网络在扩展二元非线性运算符和避免过拟合方面的挑战。其解决方案的关键在于提出一种统一的符号网络(UniSymNet),通过将非线性二元运算符统一为嵌套的一元运算符,从而降低表达式的复杂度,并定义了简化复杂度的条件。此外,通过预训练Transformer模型并采用目标特定的优化策略,提升了符号网络的拟合精度和符号解的生成能力。

链接: https://arxiv.org/abs/2505.06091
作者: Xinxin Li,Juan Zhang,Da Li,Xingyu Liu,Jin Xu,Junping Yin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注:

点击查看摘要

Abstract:Symbolic Regression (SR) is a powerful technique for automatically discovering mathematical expressions from input data. Mainstream SR algorithms search for the optimal symbolic tree in a vast function space, but the increasing complexity of the tree structure limits their performance. Inspired by neural networks, symbolic networks have emerged as a promising new paradigm. However, most existing symbolic networks still face certain challenges: binary nonlinear operators \times, ÷\ cannot be naturally extended to multivariate operators, and training with fixed architecture often leads to higher complexity and overfitting. In this work, we propose a Unified Symbolic Network that unifies nonlinear binary operators into nested unary operators and define the conditions under which UniSymNet can reduce complexity. Moreover, we pre-train a Transformer model with a novel label encoding method to guide structural selection, and adopt objective-specific optimization strategies to learn the parameters of the symbolic network. UniSymNet shows high fitting accuracy, excellent symbolic solution rate, and relatively low expression complexity, achieving competitive performance on low-dimensional Standard Benchmarks and high-dimensional SRBench.
zh

[AI-5] Assessing Tenstorrents RISC-V MatMul Acceleration Capabilities

【速读】:该论文旨在解决生成式 AI(Generative AI)服务中对高效计算架构的需求,特别是针对大型语言模型(LLM)中的基础线性代数运算在低精度数值下的优化问题。其解决方案的关键在于评估 Tenstorrent Grayskull e75 RISC-V 加速器在不同网格尺寸、矩阵维度、数据格式及数值精度下的计算效率,并通过与英特尔 Sapphire Rapids 处理器以及 NVIDIA V100 和 A100 GPU 的对比,展示其在功耗与计算吞吐量之间的竞争优势,最终实现了 1.55 TFLOPs/Watt 的峰值能效比。

链接: https://arxiv.org/abs/2505.06085
作者: Hiari Pizzini Cavagna,Daniele Cesarini,Andrea Bartolini
机构: 未知
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: Accepted to the Computational Aspects of Deep Learning Workshop at ISC High Performance 2025. To appear in the ISC High Performance 2025 Workshop Proceedings

点击查看摘要

Abstract:The increasing demand for generative AI as Large Language Models (LLMs) services has driven the need for specialized hardware architectures that optimize computational efficiency and energy consumption. This paper evaluates the performance of the Tenstorrent Grayskull e75 RISC-V accelerator for basic linear algebra kernels at reduced numerical precision, a fundamental operation in LLM computations. We present a detailed characterization of Grayskull’s execution model, gridsize, matrix dimensions, data formats, and numerical precision impact computational efficiency. Furthermore, we compare Grayskull’s performance against state-of-the-art architectures with tensor acceleration, including Intel Sapphire Rapids processors and two NVIDIA GPUs (V100 and A100). Whilst NVIDIA GPUs dominate raw performance, Grayskull demonstrates a competitive trade-off between power consumption and computational throughput, reaching a peak of 1.55 TFLOPs/Watt with BF16.
zh

[AI-6] Seqret: Mining Rule Sets from Event Sequences

【速读】:该论文试图解决从事件序列数据中同时发现条件依赖和无条件依赖的问题(conditional and unconditional dependencies),而现有方法通常仅关注无条件的序列模式。解决方案的关键在于通过发现形式为 $ X \rightarrow Y $ 的规则,其中 $ X $ 和 $ Y $ 是序列模式,从而捕捉事件之间的关系。为了获得简洁且非冗余的规则集,作者基于最小描述长度原则(Minimum Description Length principle)对问题进行形式化,并提出了 Seqret 方法以在实际中高效发现高质量的规则集。

链接: https://arxiv.org/abs/2505.06049
作者: Aleena Siji,Joscha Cüppers,Osman Ali Mian,Jilles Vreeken
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Summarizing event sequences is a key aspect of data mining. Most existing methods neglect conditional dependencies and focus on discovering sequential patterns only. In this paper, we study the problem of discovering both conditional and unconditional dependencies from event sequence data. We do so by discovering rules of the form X \rightarrow Y where X and Y are sequential patterns. Rules like these are simple to understand and provide a clear description of the relation between the antecedent and the consequent. To discover succinct and non-redundant sets of rules we formalize the problem in terms of the Minimum Description Length principle. As the search space is enormous and does not exhibit helpful structure, we propose the Seqret method to discover high-quality rule sets in practice. Through extensive empirical evaluation we show that unlike the state of the art, Seqret ably recovers the ground truth on synthetic datasets and finds useful rules from real datasets.
zh

[AI-7] PYRREGULAR: A Unified Framework for Irregular Time Series with Classification Benchmarks

【速读】:该论文旨在解决不规则时间序列(irregular time series)分类中由于记录频率不一致、观测时长不同以及存在缺失值所带来的挑战。现有研究社区通常孤立地处理这些问题,导致工具和方法分散。该论文提出了一种统一框架,并构建了首个标准化数据集仓库,以增强不同数据之间的互操作性,其关键在于采用统一的数组格式来整合多种数据源,从而为不规则时间序列分析方法提供更全面和可靠的评估基础。

链接: https://arxiv.org/abs/2505.06047
作者: Francesco Spinnato,Cristiano Landi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Irregular temporal data, characterized by varying recording frequencies, differing observation durations, and missing values, presents significant challenges across fields like mobility, healthcare, and environmental science. Existing research communities often overlook or address these challenges in isolation, leading to fragmented tools and methods. To bridge this gap, we introduce a unified framework, and the first standardized dataset repository for irregular time series classification, built on a common array format to enhance interoperability. This repository comprises 34 datasets on which we benchmark 12 classifier models from diverse domains and communities. This work aims to centralize research efforts and enable a more robust evaluation of irregular temporal data analysis methods.
zh

[AI-8] Universal Approximation Theorem for Deep Q-Learning via FBSDE System

【速读】:该论文试图解决深度Q网络(Deep Q-Networks, DQNs)的近似能力问题,传统方法依赖于不考虑最优Q函数内在结构特性的通用近似定理(Universal Approximation Theorems, UATs)。论文提出的关键解决方案是建立一种针对特定结构DQNs的UAT,其架构设计旨在模拟贝尔曼更新中的迭代精炼过程。该方案的核心在于正则性传播分析,即从有限时域动态规划原理推导出价值迭代序列的整体一致正则性,并证明深度残差网络的层可以作为函数空间上的神经算子来近似贝尔曼算子的作用,从而将网络深度与价值函数精炼的迭代次数直接关联,实现可控误差传播。

链接: https://arxiv.org/abs/2505.06023
作者: Qian Qi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:The approximation capabilities of Deep Q-Networks (DQNs) are commonly justified by general Universal Approximation Theorems (UATs) that do not leverage the intrinsic structural properties of the optimal Q-function, the solution to a Bellman equation. This paper establishes a UAT for a class of DQNs whose architecture is designed to emulate the iterative refinement process inherent in Bellman updates. A central element of our analysis is the propagation of regularity: while the transformation induced by a single Bellman operator application exhibits regularity, for which Backward Stochastic Differential Equations (BSDEs) theory provides analytical tools, the uniform regularity of the entire sequence of value iteration iterates–specifically, their uniform Lipschitz continuity on compact domains under standard Lipschitz assumptions on the problem data–is derived from finite-horizon dynamic programming principles. We demonstrate that layers of a deep residual network, conceived as neural operators acting on function spaces, can approximate the action of the Bellman operator. The resulting approximation theorem is thus intrinsically linked to the control problem’s structure, offering a proof technique wherein network depth directly corresponds to iterations of value function refinement, accompanied by controlled error propagation. This perspective reveals a dynamic systems view of the network’s operation on a space of value functions.
zh

[AI-9] Minimal Sequent Calculus for Teaching First-Order Logic: Lessons Learned

【速读】:该论文试图解决如何有效教学一阶逻辑(first-order logic)的问题,其解决方案的关键在于开发了一个基于最小序式演算(minimal sequent calculus)的网页应用MiniCalc,并允许通过Isabelle证明助手对证明进行验证。

链接: https://arxiv.org/abs/2505.05988
作者: Jørgen Villadsen(Technical University of Denmark)
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: In Proceedings ThEdu24, arXiv:2505.04677

点击查看摘要

Abstract:MiniCalc is a web app for teaching first-order logic based on a minimal sequent calculus. As an option the proofs can be verified in the Isabelle proof assistant. We present the lessons learned using the tool in recent years at our university.
zh

[AI-10] Pseudo-Boolean d-DNNF Compilation for Expressive Feature Modeling Constructs

【速读】:该论文试图解决现有可配置系统中特征模型(feature model)与自动化推理工具之间存在的不匹配问题,特别是现代特征建模语言中包含的表达性构造(如基数约束)难以转换为合取范式(CNF)的问题,这限制了这些构造的应用。解决方案的关键在于提出一种伪布尔编码(pseudo-Boolean encoding)来表示特征模型,相较于传统的布尔编码能够提供更紧凑的表示,并引入一种将伪布尔公式编译为布尔D-DNNF(d-DNNF)的新方法,从而支持高效的分析。

链接: https://arxiv.org/abs/2505.05976
作者: Chico Sundermann,Stefan Vill,Elias Kuiter,Sebastian Krieter,Thomas Thüm,Matthias Tichy
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Configurable systems typically consist of reusable assets that have dependencies between each other. To specify such dependencies, feature models are commonly used. As feature models in practice are often complex, automated reasoning is typically employed to analyze the dependencies. Here, the de facto standard is translating the feature model to conjunctive normal form (CNF) to enable employing off-the-shelf tools, such as SAT or #SAT solvers. However, modern feature-modeling dialects often contain constructs, such as cardinality constraints, that are ill-suited for conversion to CNF. This mismatch between the input of reasoning engines and the available feature-modeling dialects limits the applicability of the more expressive constructs. In this work, we shorten this gap between expressive constructs and scalable automated reasoning. Our contribution is twofold: First, we provide a pseudo-Boolean encoding for feature models, which facilitates smaller representations of commonly employed constructs compared to Boolean encoding. Second, we propose a novel method to compile pseudo-Boolean formulas to Boolean d-DNNF. With the compiled d-DNNFs, we can resort to a plethora of efficient analyses already used in feature modeling. Our empirical evaluation shows that our proposal substantially outperforms the state-of-the-art based on CNF inputs for expressive constructs. For every considered dataset representing different feature models and feature-modeling constructs, the feature models can be significantly faster translated to pseudo-Boolean than to CNF. Overall, deriving d-DNNFs from a feature model with the targeted expressive constraints can be substantially accelerated using our pseudo-Boolean approach. Furthermore, our approach is competitive on feature models with only basic constructs.
zh

[AI-11] A Noise-Resilient Semi-Supervised Graph Autoencoder for Overlapping Semantic Community Detection

【速读】:该论文旨在解决在存在噪声的现实网络中,如何有效检测重叠社区的问题,特别是在需要融合拓扑结构、节点属性和先验信息的情况下。其解决方案的关键在于提出一种半监督图自编码器,该模型结合了图多头注意力机制和模块度最大化,通过融合结构、属性和先验知识来学习语义表示,并显式处理节点特征中的噪声。关键创新包括抗噪声架构和基于模块度约束优化的语义半监督设计,从而提升了重叠社区检测的准确性和鲁棒性。

链接: https://arxiv.org/abs/2505.05965
作者: Abdelfateh Bekkair,Slimane Bellaouar,Slimane Oulad-Naoui
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Community detection in networks with overlapping structures remains a significant challenge, particularly in noisy real-world environments where integrating topology, node attributes, and prior information is critical. To address this, we propose a semi-supervised graph autoencoder that combines graph multi-head attention and modularity maximization to robustly detect overlapping communities. The model learns semantic representations by fusing structural, attribute, and prior knowledge while explicitly addressing noise in node features. Key innovations include a noise-resistant architecture and a semantic semi-supervised design optimized for community quality through modularity constraints. Experiments demonstrate superior performance the model outperforms state-of-the-art methods in overlapping community detection (improvements in NMI and F1-score) and exhibits exceptional robustness to attribute noise, maintaining stable performance under 60% feature corruption. These results highlight the importance of integrating attribute semantics and structural patterns for accurate community discovery in complex networks.
zh

[AI-12] IRNN: Innovation-driven Recurrent Neural Network for Time-Series Data Modeling and Prediction

【速读】:该论文旨在解决时间序列数据建模与预测中动态特性捕捉不足的问题,特别是通过改进传统循环神经网络(Recurrent Neural Network, RNN)的结构以提升预测性能。其解决方案的关键在于引入“创新”(Innovation)概念,将历史预测误差作为额外输入信号,用于更新RNN的隐藏状态,从而增强模型对时间依赖性的建模能力。为适配该架构,论文进一步提出了一种定制的训练算法——基于输入更新的反向传播通过时间(Input Updating-based Back-Propagation Through Time, IU-BPTT),该算法通过交替更新创新数据和优化网络参数来实现有效训练。

链接: https://arxiv.org/abs/2505.05916
作者: Yifan Zhou,Yibo Wang,Chao Shang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many real-world datasets are time series that are sequentially collected and contain rich temporal information. Thus, a common interest in practice is to capture dynamics of time series and predict their future evolutions. To this end, the recurrent neural network (RNN) has been a prevalent and effective machine learning option, which admits a nonlinear state-space model representation. Motivated by the resemblance between RNN and Kalman filter (KF) for linear state-space models, we propose in this paper Innovation-driven RNN (IRNN), a novel RNN architecture tailored to time-series data modeling and prediction tasks. By adapting the concept of “innovation” from KF to RNN, past prediction errors are adopted as additional input signals to update hidden states of RNN and boost prediction performance. Since innovation data depend on network parameters, existing training algorithms for RNN do not apply to IRNN straightforwardly. Thus, a tailored training algorithm dubbed input updating-based back-propagation through time (IU-BPTT) is further proposed, which alternates between updating innovations and optimizing network parameters via gradient descent. Experiments on real-world benchmark datasets show that the integration of innovations into various forms of RNN leads to remarkably improved prediction accuracy of IRNN without increasing the training cost substantially.
zh

[AI-13] LightNobel: Improving Sequence Length Limitation in Protein Structure Prediction Model via Adaptive Activation Quantization ISCA2025

【速读】:该论文旨在解决蛋白质结构预测模型(PPM)在处理长氨基酸序列时的可扩展性问题,特别是由于激活尺寸指数增长带来的内存和计算需求激增。其关键解决方案是提出一种软硬件协同设计的加速器LightNobel,其中软件层面引入了基于令牌的自适应激活量化(AAQ)技术,利用PPM激活中的独特令牌特征实现细粒度量化而不牺牲精度;硬件层面则通过集成多精度可重构矩阵处理单元(RMPU)和通用向量处理单元(VVPU)来高效执行AAQ,从而显著提升性能并降低功耗。

链接: https://arxiv.org/abs/2505.05893
作者: Seunghee Han,Soongyu Choi,Joo-Young Kim
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: To appear in the Proceedings of the 52nd IEEE/ACM International Symposium on Computer Architecture (ISCA 2025)

点击查看摘要

Abstract:Recent advances in Protein Structure Prediction Models (PPMs), such as AlphaFold2 and ESMFold, have revolutionized computational biology by achieving unprecedented accuracy in predicting three-dimensional protein folding structures. However, these models face significant scalability challenges, particularly when processing proteins with long amino acid sequences (e.g., sequence length 1,000). The primary bottleneck that arises from the exponential growth in activation sizes is driven by the unique data structure in PPM, which introduces an additional dimension that leads to substantial memory and computational demands. These limitations have hindered the effective scaling of PPM for real-world applications, such as analyzing large proteins or complex multimers with critical biological and pharmaceutical relevance. In this paper, we present LightNobel, the first hardware-software co-designed accelerator developed to overcome scalability limitations on the sequence length in PPM. At the software level, we propose Token-wise Adaptive Activation Quantization (AAQ), which leverages unique token-wise characteristics, such as distogram patterns in PPM activations, to enable fine-grained quantization techniques without compromising accuracy. At the hardware level, LightNobel integrates the multi-precision reconfigurable matrix processing unit (RMPU) and versatile vector processing unit (VVPU) to enable the efficient execution of AAQ. Through these innovations, LightNobel achieves up to 8.44x, 8.41x speedup and 37.29x, 43.35x higher power efficiency over the latest NVIDIA A100 and H100 GPUs, respectively, while maintaining negligible accuracy loss. It also reduces the peak memory requirement up to 120.05x in PPM, enabling scalable processing for proteins with long sequences. Comments: To appear in the Proceedings of the 52nd IEEE/ACM International Symposium on Computer Architecture (ISCA 2025) Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG) ACMclasses: B.7; I.2; J.3 Cite as: arXiv:2505.05893 [cs.AR] (or arXiv:2505.05893v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2505.05893 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3695053.3731006 Focus to learn more DOI(s) linking to related resources
zh

[AI-14] Combining Abstract Argumentation and Machine Learning for Efficiently Analyzing Low-Level Process Event Streams

【速读】:该论文试图解决在过程追踪分析中,当追踪事件与参考业务活动之间存在差距时,如何将每个事件映射到对应活动实例步骤的解释问题。其解决方案的关键在于提出一种数据/计算高效的神经符号方法,该方法结合了基于序列标注器生成高概率候选事件解释与基于抽象论证框架(Abstract Argumentation Framework, AAF)的推理机对候选解释进行精炼,从而在减少人工标注数据依赖的同时,利用先验知识提升解释的准确性和可靠性。

链接: https://arxiv.org/abs/2505.05880
作者: Bettina Fazzinga,Sergio Flesca,Filippo Furfaro,Luigi Pontieri,Francesco Scala
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Monitoring and analyzing process traces is a critical task for modern companies and organizations. In scenarios where there is a gap between trace events and reference business activities, this entails an interpretation problem, amounting to translating each event of any ongoing trace into the corresponding step of the activity instance. Building on a recent approach that frames the interpretation problem as an acceptance problem within an Abstract Argumentation Framework (AAF), one can elegantly analyze plausible event interpretations (possibly in an aggregated form), as well as offer explanations for those that conflict with prior process knowledge. Since, in settings where event-to-activity mapping is highly uncertain (or simply under-specified) this reasoning-based approach may yield lowly-informative results and heavy computation, one can think of discovering a sequencetagging model, trained to suggest highly-probable candidate event interpretations in a context-aware way. However, training such a model optimally may require using a large amount of manually-annotated example traces. Considering the urgent need of developing Green AI solutions enabling environmental and societal sustainability (with reduced labor/computational costs and carbon footprint), we propose a data/computation-efficient neuro-symbolic approach to the problem, where the candidate interpretations returned by the example-driven sequence tagger is refined by the AAF-based reasoner. This allows us to also leverage prior knowledge to compensate for the scarcity of example data, as confirmed by experimental results; clearly, this property is particularly useful in settings where data annotation and model optimization costs are subject to stringent constraints.
zh

[AI-15] Multi-Modal Molecular Representation Learning via Structure Awareness

【速读】:该论文旨在解决多模态分子表示学习中现有方法因直接融合不同模态信息而忽视模态间交互,未能充分捕捉分子间的复杂高阶关系和不变特征的问题。其解决方案的关键在于提出一种基于结构感知的多模态自监督分子表示预训练框架(MMSA),该框架通过两个主要模块实现:多模态分子表示学习模块用于协同处理同一分子的不同模态信息以生成统一的分子嵌入,结构感知模块则通过构建超图结构建模分子间的高阶相关性,并引入记忆机制存储典型分子表示以对齐记忆锚点,从而整合不变知识,提升模型的泛化能力。

链接: https://arxiv.org/abs/2505.05877
作者: Rong Yin,Ruyue Liu,Xiaoshuai Hao,Xingrui Zhou,Yong Liu,Can Ma,Weiping Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE Transactions on Image Processing (TIP) 2025

点击查看摘要

Abstract:Accurate extraction of molecular representations is a critical step in the drug discovery process. In recent years, significant progress has been made in molecular representation learning methods, among which multi-modal molecular representation methods based on images, and 2D/3D topologies have become increasingly mainstream. However, existing these multi-modal approaches often directly fuse information from different modalities, overlooking the potential of intermodal interactions and failing to adequately capture the complex higher-order relationships and invariant features between molecules. To overcome these challenges, we propose a structure-awareness-based multi-modal self-supervised molecular representation pre-training framework (MMSA) designed to enhance molecular graph representations by leveraging invariant knowledge between molecules. The framework consists of two main modules: the multi-modal molecular representation learning module and the structure-awareness module. The multi-modal molecular representation learning module collaboratively processes information from different modalities of the same molecule to overcome intermodal differences and generate a unified molecular embedding. Subsequently, the structure-awareness module enhances the molecular representation by constructing a hypergraph structure to model higher-order correlations between molecules. This module also introduces a memory mechanism for storing typical molecular representations, aligning them with memory anchors in the memory bank to integrate invariant knowledge, thereby improving the model generalization ability. Extensive experiments have demonstrated the effectiveness of MMSA, which achieves state-of-the-art performance on the MoleculeNet benchmark, with average ROC-AUC improvements ranging from 1.8% to 9.6% over baseline methods.
zh

[AI-16] Generative Discovery of Partial Differential Equations by Learning from Math Handbooks

【速读】:该论文试图解决数据驱动发现偏微分方程(PDE)过程中面临的搜索空间与优化效率之间的权衡问题。其解决方案的关键在于引入一种知识引导的方法,将数学手册中已有的PDE编码为类似句子的结构,并利用这些结构训练一个生成模型——EqGPT,从而实现自由形式PDE的生成。通过构建生成-评估-优化的循环机制,该方法能够自主识别最合适的PDE,展现出在复杂时间导数或复杂空间项情况下的高精度和计算效率,以及对不规则空间域和高维设置的泛化能力。

链接: https://arxiv.org/abs/2505.05869
作者: Hao Xu,Yuntian Chen,Rui Cao,Tianning Tang,Mengge Du,Jian Li,Adrian H. Callaghan,Dongxiao Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Data driven discovery of partial differential equations (PDEs) is a promising approach for uncovering the underlying laws governing complex systems. However, purely data driven techniques face the dilemma of balancing search space with optimization efficiency. This study introduces a knowledge guided approach that incorporates existing PDEs documented in a mathematical handbook to facilitate the discovery process. These PDEs are encoded as sentence like structures composed of operators and basic terms, and used to train a generative model, called EqGPT, which enables the generation of free form PDEs. A loop of generation evaluation optimization is constructed to autonomously identify the most suitable PDE. Experimental results demonstrate that this framework can recover a variety of PDE forms with high accuracy and computational efficiency, particularly in cases involving complex temporal derivatives or intricate spatial terms, which are often beyond the reach of conventional methods. The approach also exhibits generalizability to irregular spatial domains and higher dimensional settings. Notably, it succeeds in discovering a previously unreported PDE governing strongly nonlinear surface gravity waves propagating toward breaking, based on real world experimental data, highlighting its applicability to practical scenarios and its potential to support scientific discovery.
zh

[AI-17] Agent Xploit: End-to-End Redteaming of Black-Box AI Agents

【速读】:该论文旨在解决基于大型语言模型(Large Language Models, LLMs)的智能体系统中存在的一种关键安全风险——间接提示注入(indirect prompt injection)问题。这种攻击通过操纵上下文信息而非直接用户提示来危害智能体的核心组件LLM。论文提出的解决方案是设计一种通用的黑盒模糊测试框架AgentXploit,其关键在于通过构建高质量初始种子语料库,并结合蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)的种子选择算法,迭代优化输入以最大化发现智能体弱点的可能性。

链接: https://arxiv.org/abs/2505.05849
作者: Zhun Wang,Vincent Siu,Zhe Ye,Tianneng Shi,Yuzhou Nie,Xuandong Zhao,Chenguang Wang,Wenbo Guo,Dawn Song
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The strong planning and reasoning capabilities of Large Language Models (LLMs) have fostered the development of agent-based systems capable of leveraging external tools and interacting with increasingly complex environments. However, these powerful features also introduce a critical security risk: indirect prompt injection, a sophisticated attack vector that compromises the core of these agents, the LLM, by manipulating contextual information rather than direct user prompts. In this work, we propose a generic black-box fuzzing framework, AgentXploit, designed to automatically discover and exploit indirect prompt injection vulnerabilities across diverse LLM agents. Our approach starts by constructing a high-quality initial seed corpus, then employs a seed selection algorithm based on Monte Carlo Tree Search (MCTS) to iteratively refine inputs, thereby maximizing the likelihood of uncovering agent weaknesses. We evaluate AgentXploit on two public benchmarks, AgentDojo and VWA-adv, where it achieves 71% and 70% success rates against agents based on o3-mini and GPT-4o, respectively, nearly doubling the performance of baseline attacks. Moreover, AgentXploit exhibits strong transferability across unseen tasks and internal LLMs, as well as promising results against defenses. Beyond benchmark evaluations, we apply our attacks in real-world environments, successfully misleading agents to navigate to arbitrary URLs, including malicious sites.
zh

[AI-18] MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design

【速读】:该论文旨在解决混合专家(Mixture-of-Experts, MoE)模型在部署过程中面临的参数量大和计算需求高的问题。其解决方案的关键在于提出一种混合精度优化框架MxMoE,该框架从算法和系统两个角度出发,考虑参数敏感性、专家激活动态及硬件资源,以生成高效的混合精度配置,并自动优化混合精度GroupGEMM内核,从而实现不同精度GEMM的并行执行,显著提升模型效率与性能。

链接: https://arxiv.org/abs/2505.05799
作者: Haojie Duanmu,Xiuhong Li,Zhihang Yuan,Size Zheng,Jiangfei Duan,Xingcheng Zhang,Dahua Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models face deployment challenges due to their large parameter counts and computational demands. We explore quantization for MoE models and highlight two key insights: 1) linear blocks exhibit varying quantization sensitivity, and 2) divergent expert activation frequencies create heterogeneous computational characteristics. Based on these observations, we introduce MxMoE, a mixed-precision optimization framework for MoE models that considers both algorithmic and system perspectives. MxMoE navigates the design space defined by parameter sensitivity, expert activation dynamics, and hardware resources to derive efficient mixed-precision configurations. Additionally, MxMoE automatically generates optimized mixed-precision GroupGEMM kernels, enabling parallel execution of GEMMs with different precisions. Evaluations show that MxMoE outperforms existing methods, achieving 2.4 lower Wikitext-2 perplexity than GPTQ at 2.25-bit and delivering up to 3.4x speedup over full precision, as well as up to 29.4% speedup over uniform quantization at equivalent accuracy with 5-bit weight-activation quantization. Our code is available at this https URL.
zh

[AI-19] Human-in-the-Loop AI for HVAC Management Enhancing Comfort and Energy Efficiency

【速读】:该论文旨在解决传统供暖、通风与空调(HVAC)系统在应对实时电力市场价格波动和个体舒适度偏好变化时的动态适应性不足问题,从而导致能源成本增加和舒适度下降。其解决方案的关键在于提出一种人机协同(Human-in-the-Loop, HITL)的人工智能框架,通过整合占用预测模型与强化学习,实时学习并适应用户反馈,优化HVAC运行策略,以实现能源效率提升和成本降低,同时保障 occupants 的舒适度。

链接: https://arxiv.org/abs/2505.05796
作者: Xinyu Liang,Frits de Nijs,Buser Say,Hao Wang
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: ACM e-Energy 2025

点击查看摘要

Abstract:Heating, Ventilation, and Air Conditioning (HVAC) systems account for approximately 38% of building energy consumption globally, making them one of the most energy-intensive services. The increasing emphasis on energy efficiency and sustainability, combined with the need for enhanced occupant comfort, presents a significant challenge for traditional HVAC systems. These systems often fail to dynamically adjust to real-time changes in electricity market rates or individual comfort preferences, leading to increased energy costs and reduced comfort. In response, we propose a Human-in-the-Loop (HITL) Artificial Intelligence framework that optimizes HVAC performance by incorporating real-time user feedback and responding to fluctuating electricity prices. Unlike conventional systems that require predefined information about occupancy or comfort levels, our approach learns and adapts based on ongoing user input. By integrating the occupancy prediction model with reinforcement learning, the system improves operational efficiency and reduces energy costs in line with electricity market dynamics, thereby contributing to demand response initiatives. Through simulations, we demonstrate that our method achieves significant cost reductions compared to baseline approaches while maintaining or enhancing occupant comfort. This feedback-driven approach ensures personalized comfort control without the need for predefined settings, offering a scalable solution that balances individual preferences with economic and environmental goals.
zh

[AI-20] What Is Next for LLM s? Next-Generation AI Computing Hardware Using Photonic Chips

【速读】:该论文试图解决大规模语言模型(Large Language Models, LLMs)对计算硬件的高能耗与低效率问题,特别是针对生成式 AI (Generative AI) 的下一代计算需求。其关键解决方案是探索基于光子学的新型计算架构,如集成光子神经网络、超快矩阵运算结构以及结合二维材料的可调制器和片上突触元件,旨在提升计算吞吐量和能效,同时应对长上下文窗口和超大数据集存储等挑战。

链接: https://arxiv.org/abs/2505.05794
作者: Renjie Li,Wenjie Wei,Qi Xin,Xiaoli Liu,Sixuan Mao,Erik Ma,Zijian Chen,Malu Zhang,Haizhou Li,Zhaoyu Zhang
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 36 pages, 22 figures

点击查看摘要

Abstract:Large language models (LLMs) are rapidly pushing the limits of contemporary computing hardware. For example, training GPT-3 has been estimated to consume around 1300 MWh of electricity, and projections suggest future models may require city-scale (gigawatt) power budgets. These demands motivate exploration of computing paradigms beyond conventional von Neumann architectures. This review surveys emerging photonic hardware optimized for next-generation generative AI computing. We discuss integrated photonic neural network architectures (e.g., Mach-Zehnder interferometer meshes, lasers, wavelength-multiplexed microring resonators) that perform ultrafast matrix operations. We also examine promising alternative neuromorphic devices, including spiking neural network circuits and hybrid spintronic-photonic synapses, which combine memory and processing. The integration of two-dimensional materials (graphene, TMDCs) into silicon photonic platforms is reviewed for tunable modulators and on-chip synaptic elements. Transformer-based LLM architectures (self-attention and feed-forward layers) are analyzed in this context, identifying strategies and challenges for mapping dynamic matrix multiplications onto these novel hardware substrates. We then dissect the mechanisms of mainstream LLMs, such as ChatGPT, DeepSeek, and LLaMA, highlighting their architectural similarities and differences. We synthesize state-of-the-art components, algorithms, and integration methods, highlighting key advances and open issues in scaling such systems to mega-sized LLM models. We find that photonic computing systems could potentially surpass electronic processors by orders of magnitude in throughput and energy efficiency, but require breakthroughs in memory, especially for long-context windows and long token sequences, and in storage of ultra-large datasets.
zh

[AI-21] PyResBugs: A Dataset of Residual Python Bugs for Natural Language-Driven Fault Injection

【速读】:该论文试图解决传统测试无法检测到但会在生产环境中显现的残留缺陷(residual bugs)问题,其解决方案的关键在于构建了一个名为PyResBugs的精心标注的数据集,该数据集包含残留缺陷及其对应的无故障版本,并附有多层次自然语言(NL)描述,从而支持基于自然语言的故障注入,为软件系统中的真实故障模拟提供了新方法。

链接: https://arxiv.org/abs/2505.05777
作者: Domenico Cotroneo,Giuseppe De Rosa,Pietro Liguori
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents PyResBugs, a curated dataset of residual bugs, i.e., defects that persist undetected during traditional testing but later surface in production, collected from major Python frameworks. Each bug in the dataset is paired with its corresponding fault-free (fixed) version and annotated with multi-level natural language (NL) descriptions. These NL descriptions enable natural language-driven fault injection, offering a novel approach to simulating real-world faults in software systems. By bridging the gap between software fault injection techniques and real-world representativeness, PyResBugs provides researchers with a high-quality resource for advancing AI-driven automated testing in Python systems.
zh

[AI-22] Multi-Agent Systems for Robotic Autonomy with LLM s

【速读】:该论文旨在解决传统机器人系统开发过程中效率低、门槛高以及跨领域协同困难的问题。其解决方案的关键在于构建一个基于大型语言模型(Large Language Models, LLMs)的多智能体框架,该框架整合了任务分析、机械设计和强化学习策略生成三个核心模块,通过多模态输出提升系统的可理解性与实用性,从而显著提高机器人系统开发的效率和可访问性。

链接: https://arxiv.org/abs/2505.05762
作者: Junhong Chen,Ziqi Yang,Haoyuan G Xu,Dandan Zhang,George Mylonas
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures, 5 tables, submitted for publication

点击查看摘要

Abstract:Since the advent of Large Language Models (LLMs), various research based on such models have maintained significant academic attention and impact, especially in AI and robotics. In this paper, we propose a multi-agent framework with LLMs to construct an integrated system for robotic task analysis, mechanical design, and path generation. The framework includes three core agents: Task Analyst, Robot Designer, and Reinforcement Learning Designer. Outputs are formatted as multimodal results, such as code files or technical reports, for stronger understandability and usability. To evaluate generalizability comparatively, we conducted experiments with models from both GPT and DeepSeek. Results demonstrate that the proposed system can design feasible robots with control strategies when appropriate task inputs are provided, exhibiting substantial potential for enhancing the efficiency and accessibility of robotic system development in research and industrial applications.
zh

[AI-23] APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning

【速读】:该论文试图解决在形式化验证系统中,利用大语言模型(Large Language Models, LLMs)生成完全正确的形式化证明仍是一项具有挑战性的任务。现有方法通常通过多次提示LLM直至生成的证明通过验证系统,但效率较低。该研究提出的解决方案关键在于APOLLO(Automated PrOof repair via LLM and Lean cOllaboration)框架,它结合了Lean编译器的特性与LLM的推理能力,通过自动化流程修复语法错误、识别证明中的错误、隔离失败的子引理,并利用自动求解器和低top-K预算的LLM调用来优化剩余目标,从而在有限的采样预算下显著提升证明生成的效率和准确性。

链接: https://arxiv.org/abs/2505.05758
作者: Azim Ospanov,Roozbeh Yousefzadeh
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Formal reasoning and automated theorem proving constitute a challenging subfield of machine learning, in which machines are tasked with proving mathematical theorems using formal languages like Lean. A formal verification system can check whether a formal proof is correct or not almost instantaneously, but generating a completely correct formal proof with large language models (LLMs) remains a formidable task. The usual approach in the literature is to prompt the LLM many times (up to several thousands) until one of the generated proofs passes the verification system. In this work, we present APOLLO (Automated PrOof repair via LLM and Lean cOllaboration), a modular, model-agnostic pipeline that combines the strengths of the Lean compiler with an LLM’s reasoning abilities to achieve better proof-generation results at a low sampling budget. Apollo directs a fully automated process in which the LLM generates proofs for theorems, a set of agents analyze the proofs, fix the syntax errors, identify the mistakes in the proofs using Lean, isolate failing sub-lemmas, utilize automated solvers, and invoke an LLM on each remaining goal with a low top-K budget. The repaired sub-proofs are recombined and reverified, iterating up to a user-controlled maximum number of attempts. On the miniF2F benchmark, we establish a new state-of-the-art accuracy of 75.0% among 7B-parameter models while keeping the sampling budget below one thousand. Moreover, Apollo raises the state-of-the-art accuracy for Goedel-Prover-SFT to 65.6% while cutting sample complexity from 25,600 to a few hundred. General-purpose models (o3-mini, o4-mini) jump from 3-7% to over 40% accuracy. Our results demonstrate that targeted, compiler-guided repair of LLM outputs yields dramatic gains in both efficiency and correctness, suggesting a general paradigm for scalable automated theorem proving.
zh

[AI-24] Evolutionary thoughts: integration of large language models and evolutionary algorithms

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在复杂、新颖场景中推理易产生幻觉以及难以找到有效解决方案的问题,同时针对进化算法(Evolutionary Algorithms, EAs)在处理复杂问题时搜索空间过大导致的计算瓶颈问题。其解决方案的关键在于引入一种高效的评估框架,以减少大规模种群评估的计算负担,并结合LLMs生成更优候选解的能力,从而实现对广阔解空间的更高效探索。

链接: https://arxiv.org/abs/2505.05756
作者: Antonio Jimeno Yepes,Pieter Barnard
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have unveiled remarkable capabilities in understanding and generating both natural language and code, but LLM reasoning is prone to hallucination and struggle with complex, novel scenarios, often getting stuck on partial or incorrect solutions. However, the inherent ability of Evolutionary Algorithms (EAs) to explore extensive and complex search spaces makes them particularly effective in scenarios where traditional optimization methodologies may falter. However, EAs explore a vast search space when applied to complex problems. To address the computational bottleneck of evaluating large populations, particularly crucial for complex evolutionary tasks, we introduce a highly efficient evaluation framework. This implementation maintains compatibility with existing primitive definitions, ensuring the generation of valid individuals. Using LLMs, we propose an enhanced evolutionary search strategy that enables a more focused exploration of expansive solution spaces. LLMs facilitate the generation of superior candidate solutions, as evidenced by empirical results demonstrating their efficacy in producing improved outcomes. Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.05756 [cs.NE] (or arXiv:2505.05756v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2505.05756 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-25] owards Embodiment Scaling Laws in Robot Locomotion

【速读】:该论文试图解决如何开发能够跨多种任务、环境和物理形态(embodiment)进行操作的通用智能体这一挑战,其核心问题是提升模型在未见过的物理形态上的泛化能力。解决方案的关键在于研究形态扩展规律(embodiment scaling laws),即通过增加训练阶段所使用的不同物理形态的数量,来提升模型对新形态的零样本迁移能力。论文通过生成包含约1000种不同形态的机器人运动数据集,并在随机子集上训练能够处理多样化观测和动作空间的通用策略,验证了增加训练形态数量对于实现形态层面泛化的重要性。

链接: https://arxiv.org/abs/2505.05753
作者: Bo Ai,Liu Dai,Nico Bohlinger,Dichen Li,Tongzhou Mu,Zhanxin Wu,K. Fay,Henrik I. Christensen,Jan Peters,Hao Su
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 32 pages. Project website: this https URL

点击查看摘要

Abstract:Developing generalist agents that can operate across diverse tasks, environments, and physical embodiments is a grand challenge in robotics and artificial intelligence. In this work, we focus on the axis of embodiment and investigate embodiment scaling laws \unicodex2013 the hypothesis that increasing the number of training embodiments improves generalization to unseen ones. Using robot locomotion as a test bed, we procedurally generate a dataset of \sim 1,000 varied embodiments, spanning humanoids, quadrupeds, and hexapods, and train generalist policies capable of handling diverse observation and action spaces on random subsets. We find that increasing the number of training embodiments improves generalization to unseen ones, and scaling embodiments is more effective in enabling embodiment-level generalization than scaling data on small, fixed sets of embodiments. Notably, our best policy, trained on the full dataset, zero-shot transfers to novel embodiments in the real world, such as Unitree Go2 and H1. These results represent a step toward general embodied intelligence, with potential relevance to adaptive control for configurable robots, co-design of morphology and control, and beyond.
zh

[AI-26] Accurate and Efficient Multivariate Time Series Forecasting via Offline Clustering

【速读】:该论文旨在解决多变量时间序列(Multivariate Time Series, MTS)预测中长程时间依赖性和实体间交互建模的计算复杂度问题,尤其是在基于Transformer架构的方法中,其计算复杂度随输入长度呈二次增长。解决方案的关键在于引入FOCUS(Forecaster with Offline Clustering Using Segments),通过离线聚类提取原型(prototype),这些原型概括了真实系统中的高层次事件,从而在在线阶段动态适应当前输入并捕捉输入片段与高层次事件之间的依赖关系,将长程依赖建模的计算复杂度降低至线性规模。

链接: https://arxiv.org/abs/2505.05738
作者: Yiming Niu,Jinliang Deng,Lulu Zhang,Zimu Zhou,Yongxin Tong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate and efficient multivariate time series (MTS) forecasting is essential for applications such as traffic management and weather prediction, which depend on capturing long-range temporal dependencies and interactions between entities. Existing methods, particularly those based on Transformer architectures, compute pairwise dependencies across all time steps, leading to a computational complexity that scales quadratically with the length of the input. To overcome these challenges, we introduce the Forecaster with Offline Clustering Using Segments (FOCUS), a novel approach to MTS forecasting that simplifies long-range dependency modeling through the use of prototypes extracted via offline clustering. These prototypes encapsulate high-level events in the real-world system underlying the data, summarizing the key characteristics of similar time segments. In the online phase, FOCUS dynamically adapts these patterns to the current input and captures dependencies between the input segment and high-level events, enabling both accurate and efficient forecasting. By identifying prototypes during the offline clustering phase, FOCUS reduces the computational complexity of modeling long-range dependencies in the online phase to linear scaling. Extensive experiments across diverse benchmarks demonstrate that FOCUS achieves state-of-the-art accuracy while significantly reducing computational costs.
zh

[AI-27] Pretraining a Shared Q-Network for Data-Efficient Offline Reinforcement Learning

【速读】:该论文旨在解决离线强化学习(offline reinforcement learning, offline RL)中数据效率低的问题,即如何在有限的静态数据集上学习到最优策略。其解决方案的关键在于提出一种简单而有效的预训练方法,通过共享的Q-网络结构同时预测下一状态和Q值,并利用监督回归任务进行预训练,从而提升模型在离线RL中的数据利用效率。

链接: https://arxiv.org/abs/2505.05701
作者: Jongchan Park,Mingyu Park,Donghwan Lee
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Offline reinforcement learning (RL) aims to learn a policy from a static dataset without further interactions with the environment. Collecting sufficiently large datasets for offline RL is exhausting since this data collection requires colossus interactions with environments and becomes tricky when the interaction with the environment is restricted. Hence, how an agent learns the best policy with a minimal static dataset is a crucial issue in offline RL, similar to the sample efficiency problem in online RL. In this paper, we propose a simple yet effective plug-and-play pretraining method to initialize a feature of a Q -network to enhance data efficiency in offline RL. Specifically, we introduce a shared Q -network structure that outputs predictions of the next state and Q -value. We pretrain the shared Q -network through a supervised regression task that predicts a next state and trains the shared Q -network using diverse offline RL methods. Through extensive experiments, we empirically demonstrate that our method enhances the performance of existing popular offline RL methods on the D4RL, Robomimic and V-D4RL benchmarks. Furthermore, we show that our method significantly boosts data-efficient offline RL across various data qualities and data distributions trough D4RL and ExoRL benchmarks. Notably, our method adapted with only 10% of the dataset outperforms standard algorithms even with full datasets.
zh

[AI-28] Interactive Diabetes Risk Prediction Using Explainable Machine Learning: A Dash-Based Approach with SHAP LIME and Comorbidity Insights

【速读】:该论文旨在解决糖尿病风险预测的精准性与可解释性问题,通过构建一个基于网络的交互式健康风险预测工具实现对个体糖尿病风险的评估。其解决方案的关键在于采用多种机器学习模型(如逻辑回归、随机森林、XGBoost、LightGBM、K近邻和神经网络)并结合原始数据、SMOTE过采样及下采样策略进行性能比较,最终选择LightGBM配合下采样策略以获得最佳召回率,从而提升风险检测能力;同时,通过集成SHAP和LIME解释方法以及Pearson分析来增强模型预测的可解释性,并利用Dash框架构建用户友好的界面,实现个性化建议与特征洞察的交互式展示。

链接: https://arxiv.org/abs/2505.05683
作者: Udaya Allani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 21 figures, submitted as a preprint for academic dissemination

点击查看摘要

Abstract:This study presents a web-based interactive health risk prediction tool designed to assess diabetes risk using machine learning models. Built on the 2015 CDC BRFSS dataset, the study evaluates models including Logistic Regression, Random Forest, XGBoost, LightGBM, KNN, and Neural Networks under original, SMOTE, and undersampling strategies. LightGBM with undersampling achieved the best recall, making it ideal for risk detection. The tool integrates SHAP and LIME to explain predictions and highlights comorbidity correlations using Pearson analysis. A Dash-based UI enables user-friendly interaction with model predictions, personalized suggestions, and feature insights, supporting data-driven health awareness.
zh

[AI-29] Closing the Loop: Motion Prediction Models beyond Open-Loop Benchmarks

【速读】:该论文试图解决当前基于学习的运动预测模型在提升开环预测精度的同时,未能有效验证其在闭环自动驾驶系统中的实际性能问题。研究的关键在于系统评估最先进的运动预测器与运动规划器之间的相互作用,发现更高的开环精度并不总是能带来更好的闭环驾驶行为,并揭示了预测的时间一致性及规划器兼容性等因素的重要性。此外,研究还表明,在某些情况下,参数减少达86%的简化模型仍能实现相当或更优的闭环驾驶性能。

链接: https://arxiv.org/abs/2505.05638
作者: Mohamed-Khalil Bouzidi,Christian Schlauch,Nicole Scheuerer,Yue Yao,Nadja Klein,Daniel Göhring,Jörg Reichardt
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Fueled by motion prediction competitions and benchmarks, recent years have seen the emergence of increasingly large learning based prediction models, many with millions of parameters, focused on improving open-loop prediction accuracy by mere centimeters. However, these benchmarks fail to assess whether such improvements translate to better performance when integrated into an autonomous driving stack. In this work, we systematically evaluate the interplay between state-of-the-art motion predictors and motion planners. Our results show that higher open-loop accuracy does not always correlate with better closed-loop driving behavior and that other factors, such as temporal consistency of predictions and planner compatibility, also play a critical role. Furthermore, we investigate downsized variants of these models, and, surprisingly, find that in some cases models with up to 86% fewer parameters yield comparable or even superior closed-loop driving performance. Our code is available at this https URL.
zh

[AI-30] SPIN-ODE: Stiff Physics-Informed Neural ODE for Chemical Reaction Rate Estimation

【速读】:该论文旨在解决从复杂化学反应中估计速率常数的问题,这一问题在推进详细化学模型中具有重要意义。然而,现实世界大气化学系统固有的刚性(stiffness)带来了严重挑战,导致基于学习的方法在训练过程中出现不稳定和收敛性差的问题。论文提出的解决方案是Stiff Physics-Informed Neural ODE框架(SPIN-ODE),其关键在于引入三阶段优化过程:首先,通过潜在神经微分方程(latent neural ODE)学习化学浓度与其时间导数之间的连续可微轨迹;其次,利用显式化学反应神经网络(CRNN)根据学习到的动力学提取潜在的速率系数;最后,通过神经微分方程求解器对CRNN进行微调,以进一步提升速率系数的估计精度。

链接: https://arxiv.org/abs/2505.05625
作者: Wenqing Peng,Zhi-Song Liu,Michael Boy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Estimating rate constants from complex chemical reactions is essential for advancing detailed chemistry. However, the stiffness inherent in real-world atmospheric chemistry systems poses severe challenges, leading to training instability and poor convergence that hinder effective rate constant estimation using learning-based approaches. To address this, we propose a Stiff Physics-Informed Neural ODE framework (SPIN-ODE) for chemical reaction modelling. Our method introduces a three-stage optimisation process: first, a latent neural ODE learns the continuous and differentiable trajectory between chemical concentrations and their time derivatives; second, an explicit Chemical Reaction Neural Network (CRNN) extracts the underlying rate coefficients based on the learned dynamics; and third, fine-tune CRNN using a neural ODE solver to further improve rate coefficient estimation. Extensive experiments on both synthetic and newly proposed real-world datasets validate the effectiveness and robustness of our approach. As the first work on stiff Neural ODEs for chemical rate coefficient discovery, our study opens promising directions for integrating neural networks with detailed chemistry.
zh

[AI-31] CityNavAgent : Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory

【速读】:该论文旨在解决城市空中视觉-语言导航(aerial VLN)中的挑战,即在缺乏预定义导航图和长程探索中动作空间指数级扩展的情况下,使无人机能够根据自然语言指令进行复杂城市环境的导航。解决方案的关键在于提出了一种基于大语言模型(LLM)的智能体CityNavAgent,其核心是分层语义规划模块(HSPM),该模块通过将长期任务分解为不同语义层次的子目标,逐步实现目标,并利用全局记忆模块存储历史轨迹以简化已访问目标的导航。

链接: https://arxiv.org/abs/2505.05622
作者: Weichen Zhang,Chen Gao,Shiquan Yu,Ruiying Peng,Baining Zhao,Qian Zhang,Jinqiang Cui,Xinlei Chen,Yong Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Aerial vision-and-language navigation (VLN), requiring drones to interpret natural language instructions and navigate complex urban environments, emerges as a critical embodied AI challenge that bridges human-robot interaction, 3D spatial reasoning, and real-world deployment. Although existing ground VLN agents achieved notable results in indoor and outdoor settings, they struggle in aerial VLN due to the absence of predefined navigation graphs and the exponentially expanding action space in long-horizon exploration. In this work, we propose \textbfCityNavAgent, a large language model (LLM)-empowered agent that significantly reduces the navigation complexity for urban aerial VLN. Specifically, we design a hierarchical semantic planning module (HSPM) that decomposes the long-horizon task into sub-goals with different semantic levels. The agent reaches the target progressively by achieving sub-goals with different capacities of the LLM. Additionally, a global memory module storing historical trajectories into a topological graph is developed to simplify navigation for visited targets. Extensive benchmark experiments show that our method achieves state-of-the-art performance with significant improvement. Further experiments demonstrate the effectiveness of different modules of CityNavAgent for aerial VLN in continuous city environments. The code is available at \hrefthis https URLlink.
zh

[AI-32] Leverag ing Large Language Models for enzymatic reaction prediction and characterization

【速读】:该论文旨在解决酶促反应预测的问题,这一问题在生物催化、代谢工程和药物发现中具有重要应用价值,但仍然是一项复杂且资源密集的任务。论文提出的解决方案关键在于利用大型语言模型(LLMs),特别是Llama-3.1家族(8B和70B),通过参数高效微调(LoRA适配器)进行单任务和多任务学习策略的系统评估,以捕捉生化知识并提升正向合成与逆向合成的预测性能。

链接: https://arxiv.org/abs/2505.05616
作者: Lorenzo Di Fruscia,Jana Marie Weber
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:Predicting enzymatic reactions is crucial for applications in biocatalysis, metabolic engineering, and drug discovery, yet it remains a complex and resource-intensive task. Large Language Models (LLMs) have recently demonstrated remarkable success in various scientific domains, e.g., through their ability to generalize knowledge, reason over complex structures, and leverage in-context learning strategies. In this study, we systematically evaluate the capability of LLMs, particularly the Llama-3.1 family (8B and 70B), across three core biochemical tasks: Enzyme Commission number prediction, forward synthesis, and retrosynthesis. We compare single-task and multitask learning strategies, employing parameter-efficient fine-tuning via LoRA adapters. Additionally, we assess performance across different data regimes to explore their adaptability in low-data settings. Our results demonstrate that fine-tuned LLMs capture biochemical knowledge, with multitask learning enhancing forward- and retrosynthesis predictions by leveraging shared enzymatic information. We also identify key limitations, for example challenges in hierarchical EC classification schemes, highlighting areas for further improvement in LLM-driven biochemical modeling.
zh

[AI-33] scDrugMap: Benchmarking Large Foundation Models for Drug Response Prediction

【速读】:该论文旨在解决癌症治疗中药物耐受性问题,通过单细胞分析揭示细胞异质性,但现有大规模基础模型在单细胞数据中预测药物反应的应用仍缺乏系统研究。其解决方案的关键在于开发了scDrugMap,这是一个集成框架,包含Python命令行接口和网络服务器,用于药物反应预测,并评估了多种基础模型在大规模单细胞数据集上的性能,包括层冻结和低秩适应(LoRA)微调策略,从而为药物发现和转化研究提供了首个大规模基准平台。

链接: https://arxiv.org/abs/2505.05612
作者: Qing Wang,Yining Pan,Minghao Zhou,Zijia Tang,Yanfei Wang,Guangyu Wang,Qianqian Song
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:Drug resistance presents a major challenge in cancer therapy. Single cell profiling offers insights into cellular heterogeneity, yet the application of large-scale foundation models for predicting drug response in single cell data remains underexplored. To address this, we developed scDrugMap, an integrated framework featuring both a Python command-line interface and a web server for drug response prediction. scDrugMap evaluates a wide range of foundation models, including eight single-cell models and two large language models, using a curated dataset of over 326,000 cells in the primary collection and 18,800 cells in the validation set, spanning 36 datasets and diverse tissue and cancer types. We benchmarked model performance under pooled-data and cross-data evaluation settings, employing both layer freezing and Low-Rank Adaptation (LoRA) fine-tuning strategies. In the pooled-data scenario, scFoundation achieved the best performance, with mean F1 scores of 0.971 (layer freezing) and 0.947 (fine-tuning), outperforming the lowest-performing model by over 50%. In the cross-data setting, UCE excelled post fine-tuning (mean F1: 0.774), while scGPT led in zero-shot learning (mean F1: 0.858). Overall, scDrugMap provides the first large-scale benchmark of foundation models for drug response prediction in single-cell data and serves as a user-friendly, flexible platform for advancing drug discovery and translational research.
zh

[AI-34] HiBayES: A Hierarchical Bayesian Modeling Framework for AI Evaluation Statistics

【速读】:该论文旨在解决在大型语言模型(Large Language Models, LLMs)等AI系统评估中,如何从固有随机性的输出中稳健估计其能力,并系统性地量化这些估计的不确定性问题。此外,针对高级AI评估通常具有嵌套层次结构、高复杂度及高测试成本的特点,论文提出了HiBayES,一个可泛化的分层贝叶斯建模框架,用于AI评估统计。其解决方案的关键在于基于广义线性模型(Generalized Linear Models, GLMs)、贝叶斯数据分析和形式化模型比较,实现有原则的不确定性量化和稳健参数估计,尤其适用于数据量有限的场景。

链接: https://arxiv.org/abs/2505.05602
作者: Lennart Luettgau,Harry Coppock,Magda Dubois,Christopher Summerfield,Cozmin Ududec
机构: 未知
类目: Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: 23 pages, 9 figures

点击查看摘要

Abstract:As Large Language Models (LLMs) and other AI systems evolve, robustly estimating their capabilities from inherently stochastic outputs while systematically quantifying uncertainty in these estimates becomes increasingly important. Further, advanced AI evaluations often have a nested hierarchical structure, exhibit high levels of complexity, and come with high costs in testing the most advanced AI systems. To address these challenges, we introduce HiBayES, a generalizable Hierarchical Bayesian modeling framework for AI Evaluation Statistics. HiBayES supports robust inferences in classical question-answer benchmarks and advanced agentic evaluations, particularly in low-data scenarios (e.g., 20 data points per evaluation). Built on Generalized Linear Models (GLMs), Bayesian data analysis, and formal model comparison, HiBayES provides principled uncertainty quantification and robust parameter estimation. This paper offers a comprehensive introduction to HiBayES, including illustrative examples, comparisons to conventional statistical methods, and practical guidance for implementing multilevel Bayesian GLMs. Additionally, we provide a HiBayES software package [4] (Beta version) for out-of-the-box implementation.
zh

[AI-35] Flight Validation of Learning-Based Trajectory Optimization for the Astrobee Free-Flyer

【速读】:该论文旨在解决轨迹优化在空间应用中因计算需求高而使用受限的问题,其解决方案的关键在于利用机器学习加速机载轨迹优化,同时保持理论求解器的保证。具体而言,该方法基于GuSTO顺序凸规划框架,并采用离线训练的神经网络,将问题参数映射到有效的初始“热启动”轨迹,从而实现在资源受限的空间平台上的更快实时优化。

链接: https://arxiv.org/abs/2505.05588
作者: Somrita Banerjee,Abhishek Cauligi,Marco Pavone
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to RSS 2025 Workshop on Space Robotics

点击查看摘要

Abstract:Although widely used in commercial and industrial robotics, trajectory optimization has seen limited use in space applications due to its high computational demands. In this work, we present flight results from experiments with the Astrobee free-flying robot on board the International Space Station (ISS), that demonstrate how machine learning can accelerate on-board trajectory optimization while preserving theoretical solver guarantees. To the best of the authors’ knowledge, this is the first-ever demonstration of learning-based control on the ISS. Our approach leverages the GuSTO sequential convex programming framework and uses a neural network, trained offline, to map problem parameters to effective initial ``warm-start’’ trajectories, paving the way for faster real-time optimization on resource-constrained space platforms.
zh

[AI-36] PyTDC: A multimodal machine learning training evaluation and inference platform for biomedical foundation models

【速读】:该论文旨在解决现有生物医学基准无法为整合多模态生物数据和广泛机器学习任务的模型提供端到端训练、评估和推理基础设施的问题。其解决方案的关键在于提出PyTDC,一个开源的机器学习平台,该平台通过统一分布式、异构且持续更新的数据源和模型权重,并标准化基准测试与推理端点,实现了对多模态生物AI模型的流程化训练、评估与推理支持。

链接: https://arxiv.org/abs/2505.05577
作者: Alejandro Velez-Arce,Marinka Zitnik
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Proceedings of the 42nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025

点击查看摘要

Abstract:Existing biomedical benchmarks do not provide end-to-end infrastructure for training, evaluation, and inference of models that integrate multimodal biological data and a broad range of machine learning tasks in therapeutics. We present PyTDC, an open-source machine-learning platform providing streamlined training, evaluation, and inference software for multimodal biological AI models. PyTDC unifies distributed, heterogeneous, continuously updated data sources and model weights and standardizes benchmarking and inference endpoints. This paper discusses the components of PyTDC’s architecture and, to our knowledge, the first-of-its-kind case study on the introduced single-cell drug-target nomination ML task. We find state-of-the-art methods in graph representation learning and domain-specific methods from graph theory perform poorly on this task. Though we find a context-aware geometric deep learning method that outperforms the evaluated SoTA and domain-specific baseline methods, the model is unable to generalize to unseen cell types or incorporate additional modalities, highlighting PyTDC’s capacity to facilitate an exciting avenue of research developing multimodal, context-aware, foundation models for open problems in biomedical AI.
zh

[AI-37] Griffin: Towards a Graph-Centric Relational Database Foundation Model

【速读】:该论文试图解决关系型数据库(Relational Database, RDB)中多任务处理与模型泛化能力不足的问题。传统方法通常针对单一RDB任务进行设计,缺乏对复杂关系数据的统一建模能力。其解决方案的关键在于提出Griffin,一个专为RDB设计的基座模型,通过统一的数据编码器和任务解码器来处理多样化任务,并引入交叉注意力模块和新型聚合器以增强模型对关系数据复杂性的捕捉能力。此外,Griffin在单表和RDB数据集上进行预训练,结合先进的编码器与创新组件如增强型消息传递神经网络(MPNN),提升了模型在低数据场景下的性能及跨数据集和任务的迁移能力。

链接: https://arxiv.org/abs/2505.05568
作者: Yanbo Wang,Xiyuan Wang,Quan Gan,Minjie Wang,Qibin Yang,David Wipf,Muhan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:We introduce Griffin, the first foundation model attemptation designed specifically for Relational Databases (RDBs). Unlike previous smaller models focused on single RDB tasks, Griffin unifies the data encoder and task decoder to handle diverse tasks. Additionally, we enhance the architecture by incorporating a cross-attention module and a novel aggregator. Griffin utilizes pretraining on both single-table and RDB datasets, employing advanced encoders for categorical, numerical, and metadata features, along with innovative components such as cross-attention modules and enhanced message-passing neural networks (MPNNs) to capture the complexities of relational data. Evaluated on large-scale, heterogeneous, and temporal graphs extracted from RDBs across various domains (spanning over 150 million nodes), Griffin demonstrates superior or comparable performance to individually trained models, excels in low-data scenarios, and shows strong transferability with similarity and diversity in pretraining across new datasets and tasks, highlighting its potential as a universally applicable foundation model for RDBs. Code available at this https URL.
zh

[AI-38] Would You Rely on an Eerie Agent ? A Systematic Review of the Impact of the Uncanny Valley Effect on Trust in Human-Agent Interaction

【速读】:该论文试图解决人类对人工代理的信任与恐怖谷效应(Uncanny Valley Effect, UVE)之间关系的系统性理解不足的问题。现有研究在概念定义和操作化方面存在较大差异,导致无法明确UVE在何种条件下影响人类对代理的信任。解决方案的关键在于通过遵循PRISMA指南进行系统性文献综述,分析53项实证研究,归纳出方法学模式、局限性和研究空白,并提出一种新的信任测量分类框架,以促进对UVE与信任之间相互作用的深入理解。

链接: https://arxiv.org/abs/2505.05543
作者: Ahdiyeh Alipour,Tilo Hartmann,Maryam Alimardani
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 75 pages, Figure 11, Table 5

点击查看摘要

Abstract:Trust is a fundamental component of human-agent interaction. With the increasing presence of artificial agents in daily life, it is essential to understand how people perceive and trust these agents. One of the key challenges affecting this perception is the Uncanny Valley Effect (UVE), where increasingly human-like artificial beings can be perceived as eerie or repelling. Despite growing interest in trust and the UVE, existing research varies widely in terms of how these concepts are defined and operationalized. This inconsistency raises important questions about how and under what conditions the UVE influences trust in agents. A systematic understanding of their relationship is currently lacking. This review aims to examine the impact of the UVE on human trust in agents and to identify methodological patterns, limitations, and gaps in the existing empirical literature. Following PRISMA guidelines, a systematic search identified 53 empirical studies that investigated both UVE-related constructs and trust or trust-related outcomes. Studies were analyzed based on a structured set of categories, including types of agents and interactions, methodological and measurement approaches, and key findings. The results of our systematic review reveal that most studies rely on static images or hypothetical scenarios with limited real-time interaction, and the majority use subjective trust measures. This review offers a novel framework for classifying trust measurement approaches with regard to the best-practice criteria for empirically investigating the UVE. As the first systematic attempt to map the intersection of UVE and trust, this review contributes to a deeper understanding of their interplay and offers a foundation for future research. Keywords: the uncanny valley effect, trust, human-likeness, affinity response, human-agent interaction
zh

[AI-39] Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods

【速读】:该论文试图解决当前AI安全评估方法不足以全面衡量前沿AI系统潜在风险的问题,尤其是在确保系统安全性与指导治理决策方面存在不足。其解决方案的关键在于提出一个围绕三个维度的系统性分类框架:即我们测量哪些属性、如何测量这些属性,以及这些测量结果如何整合到评估框架中。通过超越传统基准测试,该研究强调了对模型在极限条件下表现的能力(capabilities)、默认行为倾向(propensities)以及安全措施在面对对抗性AI时的有效性(control)的测量,并结合行为技术与内部分析方法实现这一目标。

链接: https://arxiv.org/abs/2505.05541
作者: Markov Grey,Charbel-Raphaël Segerie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As frontier AI systems advance toward transformative capabilities, we need a parallel transformation in how we measure and evaluate these systems to ensure safety and inform governance. While benchmarks have been the primary method for estimating model capabilities, they often fail to establish true upper bounds or predict deployment behavior. This literature review consolidates the rapidly evolving field of AI safety evaluations, proposing a systematic taxonomy around three dimensions: what properties we measure, how we measure them, and how these measurements integrate into frameworks. We show how evaluations go beyond benchmarks by measuring what models can do when pushed to the limit (capabilities), the behavioral tendencies exhibited by default (propensities), and whether our safety measures remain effective even when faced with subversive adversarial AI (control). These properties are measured through behavioral techniques like scaffolding, red teaming and supervised fine-tuning, alongside internal techniques such as representation analysis and mechanistic interpretability. We provide deeper explanations of some safety-critical capabilities like cybersecurity exploitation, deception, autonomous replication, and situational awareness, alongside concerning propensities like power-seeking and scheming. The review explores how these evaluation methods integrate into governance frameworks to translate results into concrete development decisions. We also highlight challenges to safety evaluations - proving absence of capabilities, potential model sandbagging, and incentives for “safetywashing” - while identifying promising research directions. By synthesizing scattered resources, this literature review aims to provide a central reference point for understanding AI safety evaluations.
zh

[AI-40] Cardioformer: Advancing AI in ECG Analysis with Multi-Granularity Patching and ResNet

【速读】:该论文旨在解决心电图(Electrocardiogram, ECG)分类中难以同时捕捉局部形态细节和长程时间依赖性的挑战。其解决方案的关键在于提出Cardioformer模型,该模型结合了跨通道分块、层次化残差学习以及两阶段自注意力机制,通过多尺度标记嵌入编码细粒度局部特征与全局上下文信息,并利用 intra-和 inter-granularity 自注意力机制进行表征选择性融合,从而提升了ECG分类的性能与泛化能力。

链接: https://arxiv.org/abs/2505.05538
作者: Md Kamrujjaman Mobin,Md Saiful Islam,Sadik Al Barid,Md Masum
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Electrocardiogram (ECG) classification is crucial for automated cardiac disease diagnosis, yet existing methods often struggle to capture local morphological details and long-range temporal dependencies simultaneously. To address these challenges, we propose Cardioformer, a novel multi-granularity hybrid model that integrates cross-channel patching, hierarchical residual learning, and a two-stage self-attention mechanism. Cardioformer first encodes multi-scale token embeddings to capture fine-grained local features and global contextual information and then selectively fuses these representations through intra- and inter-granularity self-attention. Extensive evaluations on three benchmark ECG datasets under subject-independent settings demonstrate that model consistently outperforms four state-of-the-art baselines. Our Cardioformer model achieves the AUROC of 96.34 \pm 0.11, 89.99 \pm 0.12, and 95.59 \pm 1.66 in MIMIC-IV, PTB-XL and PTB dataset respectively outperforming PatchTST, Reformer, Transformer, and Medformer models. It also demonstrates strong cross-dataset generalization, achieving 49.18% AUROC on PTB and 68.41% on PTB-XL when trained on MIMIC-IV. These findings underscore the potential of Cardioformer to advance automated ECG analysis, paving the way for more accurate and robust cardiovascular disease diagnosis. We release the source code at this https URL.
zh

[AI-41] Rethinking Graph Contrastive Learning through Relative Similarity Preservation IJCAI2025

【速读】:该论文旨在解决图对比学习(Graph Contrastive Learning, GCL)在处理图结构数据时面临的挑战,即传统方法依赖于保持增强视图之间的绝对相似性,但在图的离散、非欧几里得特性下,视图生成常破坏语义有效性,导致相似性验证不可靠。论文的关键解决方案是基于对图中标签一致性随结构距离增加而系统性下降的普遍模式的发现,提出RELGCL框架,通过集体相似性目标保留图中固有的相对相似性模式,而非依赖人工构造的绝对相似性。

链接: https://arxiv.org/abs/2505.05533
作者: Zhiyuan Ning,Pengfei Wang,Ziyue Qiao,Pengyang Wang,Yuanchun Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by IJCAI2025; full version including appendix

点击查看摘要

Abstract:Graph contrastive learning (GCL) has achieved remarkable success by following the computer vision paradigm of preserving absolute similarity between augmented views. However, this approach faces fundamental challenges in graphs due to their discrete, non-Euclidean nature – view generation often breaks semantic validity and similarity verification becomes unreliable. Through analyzing 11 real-world graphs, we discover a universal pattern transcending the homophily-heterophily dichotomy: label consistency systematically diminishes as structural distance increases, manifesting as smooth decay in homophily graphs and oscillatory decay in heterophily graphs. We establish theoretical guarantees for this pattern through random walk theory, proving label distribution convergence and characterizing the mechanisms behind different decay behaviors. This discovery reveals that graphs naturally encode relative similarity patterns, where structurally closer nodes exhibit collectively stronger semantic relationships. Leveraging this insight, we propose RELGCL, a novel GCL framework with complementary pairwise and listwise implementations that preserve these inherent patterns through collective similarity objectives. Extensive experiments demonstrate that our method consistently outperforms 20 existing approaches across both homophily and heterophily graphs, validating the effectiveness of leveraging natural relative similarity over artificial absolute similarity.
zh

[AI-42] Low-bit Model Quantization for Deep Neural Networks: A Survey

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在实际部署中因计算成本高和模型规模大而难以应用的问题。其核心解决方案是通过模型量化(Model Quantization)技术,将连续的浮点数转换为离散的整数,从而显著提升内存I/O和计算效率。然而,这种转换会导致精度损失,因此研究的关键在于如何有效地进行数值转换并补偿信息丢失,以最小化性能下降。

链接: https://arxiv.org/abs/2505.05530
作者: Kai Liu,Qian Zheng,Kaiwen Tao,Zhiteng Li,Haotong Qin,Wenbo Li,Yong Guo,Xianglong Liu,Linghe Kong,Guihai Chen,Yulun Zhang,Xiaokang Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: We have systematically collected and reviewed the state-of-the-art quantization methods from the past five years, categorizing them into eight distinct groups. A curated list of model quantization is provided at this https URL

点击查看摘要

Abstract:With unprecedented rapid development, deep neural networks (DNNs) have deeply influenced almost all fields. However, their heavy computation costs and model sizes are usually unacceptable in real-world deployment. Model quantization, an effective weight-lighting technique, has become an indispensable procedure in the whole deployment pipeline. The essence of quantization acceleration is the conversion from continuous floating-point numbers to discrete integer ones, which significantly speeds up the memory I/O and calculation, i.e., addition and multiplication. However, performance degradation also comes with the conversion because of the loss of precision. Therefore, it has become increasingly popular and critical to investigate how to perform the conversion and how to compensate for the information loss. This article surveys the recent five-year progress towards low-bit quantization on DNNs. We discuss and compare the state-of-the-art quantization methods and classify them into 8 main categories and 24 sub-categories according to their core techniques. Furthermore, we shed light on the potential research opportunities in the field of model quantization. A curated list of model quantization is provided at this https URL.
zh

[AI-43] ADMM-Based Training for Spiking Neural Networks

【速读】:该论文试图解决脉冲神经网络(Spiking Neural Networks, SNNs)缺乏专用且高效的训练算法的问题,特别是针对基于替代梯度的反向传播方法在SNN训练中表现出的可扩展性差和数值不精确问题。解决方案的关键在于提出一种基于交替方向乘子法(Alternating Direction Method of Multipliers, ADMM)的新型SNN训练方法,该方法旨在解决SNN阶跃函数不可微的问题,并通过形式化问题、推导闭式更新规则以及实证分析优化器的收敛性、潜力和可能的研究方向来验证其有效性。

链接: https://arxiv.org/abs/2505.05527
作者: Giovanni Perin,Cesare Bidini,Riccardo Mazzieri,Michele Rossi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Signal Processing (eess.SP); Optimization and Control (math.OC)
备注: 6 pages, 4 figures. Preprint submitted to IEEE MLSP 2025

点击查看摘要

Abstract:In recent years, spiking neural networks (SNNs) have gained momentum due to their high potential in time-series processing combined with minimal energy consumption. However, they still lack a dedicated and efficient training algorithm. The popular backpropagation with surrogate gradients, adapted from stochastic gradient descent (SGD)-derived algorithms, has several drawbacks when used as an optimizer for SNNs. Specifically, it suffers from low scalability and numerical imprecision. In this paper, we propose a novel SNN training method based on the alternating direction method of multipliers (ADMM). Our ADMM-based training aims to solve the problem of the SNN step function’s non-differentiability. We formulate the problem, derive closed-form updates, and empirically show the optimizer’s convergence properties, great potential, and possible new research directions to improve the method in a simulated proof-of-concept.
zh

[AI-44] Continuous Thought Machines

【速读】:该论文试图解决传统深度学习架构中对神经活动简化抽象导致的时间动态缺失问题,即如何在深度学习模型中重新引入神经时间特性以更接近生物大脑的信息处理机制。其解决方案的关键在于提出连续思维机器(Continuous Thought Machine, CTM),该模型通过两个核心创新实现这一目标:一是基于神经元级别的时序处理,每个神经元使用独特的权重参数来处理输入信号的历史;二是将神经同步作为潜在表示,从而在计算效率与生物现实性之间取得平衡,有效捕捉关键的时间动态特性。

链接: https://arxiv.org/abs/2505.05522
作者: Luke Darlow,Ciaran Regan,Sebastian Risi,Jeffrey Seely,Llion Jones
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Technical report accompanied by online project page

点击查看摘要

Abstract:Biological brains demonstrate complex neural activity, where the timing and interplay between neurons is critical to how brains process information. Most deep learning architectures simplify neural activity by abstracting away temporal dynamics. In this paper we challenge that paradigm. By incorporating neuron-level processing and synchronization, we can effectively reintroduce neural timing as a foundational element. We present the Continuous Thought Machine (CTM), a model designed to leverage neural dynamics as its core representation. The CTM has two core innovations: (1) neuron-level temporal processing, where each neuron uses unique weight parameters to process a history of incoming signals; and (2) neural synchronization employed as a latent representation. The CTM aims to strike a balance between oversimplified neuron abstractions that improve computational efficiency, and biological realism. It operates at a level of abstraction that effectively captures essential temporal dynamics while remaining computationally tractable for deep learning. We demonstrate the CTM’s strong performance and versatility across a range of challenging tasks, including ImageNet-1K classification, solving 2D mazes, sorting, parity computation, question-answering, and RL tasks. Beyond displaying rich internal representations and offering a natural avenue for interpretation owing to its internal process, the CTM is able to perform tasks that require complex sequential reasoning. The CTM can also leverage adaptive compute, where it can stop earlier for simpler tasks, or keep computing when faced with more challenging instances. The goal of this work is to share the CTM and its associated innovations, rather than pushing for new state-of-the-art results. To that end, we believe the CTM represents a significant step toward developing more biologically plausible and powerful artificial intelligence systems.
zh

[AI-45] An Overview of the Prospects and Challenges of Using Artificial Intelligence for Energy Management Systems in Microgrids

【速读】:该论文试图解决微电网在能源管理中面临的一系列挑战,包括可再生能源需求与生产的可靠预测、抵御网络攻击、控制运行成本、优化功率流动以及调节能源管理系统(EMS)的性能。解决方案的关键在于引入基于人工智能(AI)的方法,以提高微电网能源管理的效率和可靠性,通过AI技术实现特定的技术和经济目标,并探索未来研究方向,如自愈微电网、与区块链技术的集成、物联网(IoT)的应用,以及解决可解释性、数据隐私、可扩展性等问题,同时关注生成式AI在未来的应用前景。

链接: https://arxiv.org/abs/2505.05498
作者: Noor ul Misbah Khanum,Hayssam Dahrouj,Ramesh C. Bansal,Hissam Mouayad Tawfik
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: 70 pages, 7 figures

点击查看摘要

Abstract:Microgrids have emerged as a pivotal solution in the quest for a sustainable and energy-efficient future. While microgrids offer numerous advantages, they are also prone to issues related to reliably forecasting renewable energy demand and production, protecting against cyberattacks, controlling operational costs, optimizing power flow, and regulating the performance of energy management systems (EMS). Tackling these energy management challenges is essential to facilitate microgrid applications and seamlessly incorporate renewable energy resources. Artificial intelligence (AI) has recently demonstrated immense potential for optimizing energy management in microgrids, providing efficient and reliable solutions. This paper highlights the combined benefits of enabling AI-based methodologies in the energy management systems of microgrids by examining the applicability and efficiency of AI-based EMS in achieving specific technical and economic objectives. The paper also points out several future research directions that promise to spearhead AI-driven EMS, namely the development of self-healing microgrids, integration with blockchain technology, use of Internet of things (IoT), and addressing interpretability, data privacy, scalability, and the prospects to generative AI in the context of future AI-based EMS.
zh

[AI-46] An Automated LLM -based Pipeline for Asset-Level Database Creation to Assess Deforestation Impact ACL

【速读】:该论文试图解决由于《欧盟森林砍伐法规》(European Union Deforestation Regulation, EUDR)要求企业证明其产品不导致森林砍伐,从而产生的对精确、资产级环境影响数据的迫切需求问题。现有数据库缺乏必要的细节,主要依赖于广泛的财务指标和人工数据收集,这限制了监管合规性和准确的环境建模。解决方案的关键在于提出一种自动化、端到端的数据提取流程,利用大语言模型(LLMs)创建、清理和验证结构化数据库,并引入基于指令、角色的零样本思维链(IRZ-CoT)提示方法以提高数据提取准确性,以及结合实时网络搜索的检索增强验证(RAV)过程以提升数据可靠性。

链接: https://arxiv.org/abs/2505.05494
作者: Avanija Menon,Ovidiu Serban
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted to ACL ClimateNLP 2025

点击查看摘要

Abstract:The European Union Deforestation Regulation (EUDR) requires companies to prove their products do not contribute to deforestation, creating a critical demand for precise, asset-level environmental impact data. Current databases lack the necessary detail, relying heavily on broad financial metrics and manual data collection, which limits regulatory compliance and accurate environmental modeling. This study presents an automated, end-to-end data extraction pipeline that uses LLMs to create, clean, and validate structured databases, specifically targeting sectors with a high risk of deforestation. The pipeline introduces Instructional, Role-Based, Zero-Shot Chain-of-Thought (IRZ-CoT) prompting to enhance data extraction accuracy and a Retrieval-Augmented Validation (RAV) process that integrates real-time web searches for improved data reliability. Applied to SEC EDGAR filings in the Mining, Oil Gas, and Utilities sectors, the pipeline demonstrates significant improvements over traditional zero-shot prompting approaches, particularly in extraction accuracy and validation coverage. This work advances NLP-driven automation for regulatory compliance, CSR (Corporate Social Responsibility), and ESG, with broad sectoral applicability.
zh

[AI-47] FedAvgen: Metadata for Model Aggregation In Communication Systems

【速读】:该论文旨在解决联邦学习中由于设备配置多样性导致的模型聚合效率与泛化能力不足的问题。其解决方案的关键在于引入一种元启发式算法(FedAvgen),通过将预训练模型与其权重空间分别视为表型和基因,模拟父代-子代的遗传进化过程,以优化全局模型的平均步骤。

链接: https://arxiv.org/abs/2505.05486
作者: Anthony Kiggundu,Dennis Krummacker,Hans D. Schotten
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted in IEEE NetSoft 2025

点击查看摘要

Abstract:To improve business efficiency and minimize costs, Artificial Intelligence (AI) practitioners have adopted a shift from formulating models from scratch towards sharing pretrained models. The pretrained models are then aggregated into a global model with higher generalization capabilities, which is afterwards distributed to the client devices. This approach is known as federated learning and inherently utilizes different techniques to select the candidate client models averaged to obtain the global model. This approach, in the case of communication systems, faces challenges arising from the existential diversity in device profiles. The multiplicity in profiles motivates our conceptual assessment of a metaheuristic algorithm (FedAvgen), which relates each pretrained model with its weight space as metadata, to a phenotype and genotype, respectively. This parent-child genetic evolution characterizes the global averaging step in federated learning. We then compare the results of our approach to two widely adopted baseline federated learning algorithms like Federated Averaging (FedAvg) and Federated Stochastic Gradient Descent (FedSGD).
zh

[AI-48] CLAM: Continuous Latent Action Models for Robot Learning from Unlabeled Demonstrations

【速读】:该论文旨在解决模仿学习中因需要大量昂贵的动作标注专家示范数据而导致的训练数据规模受限问题。其关键解决方案是设计连续潜在动作模型(Continuous Latent Action Models, CLAM),该模型通过两个核心要素提升从无标签观察数据中学习复杂连续控制任务的能力:一是采用连续潜在动作标签而非离散表示,二是联合训练动作解码器以确保潜在动作空间能够通过少量标注示例轻松映射到实际动作。这一方法使CLAM能够在无需任何动作标注专家数据的情况下,利用非最优游戏数据学习出性能优越的机器人策略。

链接: https://arxiv.org/abs/2505.04999
作者: Anthony Liang,Pavel Czempin,Matthew Hong,Yutai Zhou,Erdem Biyik,Stephen Tu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Latent Action Models, Self-supervised Pretraining, Learning from Videos

点击查看摘要

Abstract:Learning robot policies using imitation learning requires collecting large amounts of costly action-labeled expert demonstrations, which fundamentally limits the scale of training data. A promising approach to address this bottleneck is to harness the abundance of unlabeled observations-e.g., from video demonstrations-to learn latent action labels in an unsupervised way. However, we find that existing methods struggle when applied to complex robot tasks requiring fine-grained motions. We design continuous latent action models (CLAM) which incorporate two key ingredients we find necessary for learning to solve complex continuous control tasks from unlabeled observation data: (a) using continuous latent action labels instead of discrete representations, and (b) jointly training an action decoder to ensure that the latent action space can be easily grounded to real actions with relatively few labeled examples. Importantly, the labeled examples can be collected from non-optimal play data, enabling CLAM to learn performant policies without access to any action-labeled expert data. We demonstrate on continuous control benchmarks in DMControl (locomotion) and MetaWorld (manipulation), as well as on a real WidowX robot arm that CLAM significantly outperforms prior state-of-the-art methods, remarkably with a 2-3x improvement in task success rate compared to the best baseline. Videos and code can be found at this http URL.
zh

[AI-49] L2R: Learning to Reduce Search Space for Generalizable Neural Routing Solver

【速读】:该论文旨在解决构造性神经组合优化(Constructive Neural Combinatorial Optimization, NCO)方法在处理大规模问题时泛化能力不足的问题,主要挑战包括高计算复杂度和对结构模式捕捉效率低下。其解决方案的关键在于提出一种基于学习的搜索空间缩减方法,该方法在构造性NCO过程中自适应地选择少量有潜力的候选节点,通过动态优先级排序机制替代传统固定启发式方法,从而显著缩小搜索空间并保持解的质量。

链接: https://arxiv.org/abs/2503.03137
作者: Changliang Zhou,Xi Lin,Zhenkun Wang,Qingfu Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 23 pages, 10 figures

点击查看摘要

Abstract:Constructive neural combinatorial optimization (NCO) has attracted growing research attention due to its ability to solve complex routing problems without relying on handcrafted rules. However, existing NCO methods face significant challenges in generalizing to large-scale problems due to high computational complexity and inefficient capture of structural patterns. To address this issue, we propose a novel learning-based search space reduction method that adaptively selects a small set of promising candidate nodes at each step of the constructive NCO process. Unlike traditional methods that rely on fixed heuristics, our selection model dynamically prioritizes nodes based on learned patterns, significantly reducing the search space while maintaining solution quality. Experimental results demonstrate that our method, trained solely on 100-node instances from uniform distribution, generalizes remarkably well to large-scale Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) instances with up to 1 million nodes from the uniform distribution and over 80K nodes from other distributions.
zh

[AI-50] Instance-Conditioned Adaptation for Large-scale Generalization of Neural Combinatorial Optimization

【速读】:该论文旨在解决神经组合优化(Neural Combinatorial Optimization, NCO)方法在处理大规模实例时的局限性,即现有构造性NCO方法无法直接求解大规模问题,从而限制了其应用前景。解决方案的关键在于提出一种新型的实例条件自适应模型(Instance-Conditioned Adaptation Model, ICAM),该模型通过设计一个强大且轻量的实例条件自适应模块,使NCO模型能够生成适用于不同规模实例的更优解。此外,还开发了一种基于强化学习的三阶段高效训练方案,使模型能够在无需任何最优解标签的情况下学习跨规模特征。

链接: https://arxiv.org/abs/2405.01906
作者: Changliang Zhou,Xi Lin,Zhenkun Wang,Xialiang Tong,Mingxuan Yuan,Qingfu Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 17 pages, 6 figures

点击查看摘要

Abstract:The neural combinatorial optimization (NCO) approach has shown great potential for solving routing problems without the requirement of expert knowledge. However, existing constructive NCO methods cannot directly solve large-scale instances, which significantly limits their application prospects. To address these crucial shortcomings, this work proposes a novel Instance-Conditioned Adaptation Model (ICAM) for better large-scale generalization of neural combinatorial optimization. In particular, we design a powerful yet lightweight instance-conditioned adaptation module for the NCO model to generate better solutions for instances across different scales. In addition, we develop an efficient three-stage reinforcement learning-based training scheme that enables the model to learn cross-scale features without any labeled optimal solution. Experimental results show that our proposed method is capable of obtaining excellent results with a very fast inference time in solving Traveling Salesman Problems (TSPs) and Capacitated Vehicle Routing Problems (CVRPs) across different scales. To the best of our knowledge, our model achieves state-of-the-art performance among all RL-based constructive methods for TSP and CVRP with up to 1,000 nodes.
zh

[AI-51] urbo-ICL: In-Context Learning-Based Turbo Equalization

【速读】:该论文旨在解决在编码多输入多输出(MIMO)系统中,软输入软输出信道均衡的问题,尤其是在传统线性假设失效的情况下,如低分辨率量化存在时的性能瓶颈。其解决方案的关键在于提出一种受大语言模型(LLMs)启发的上下文学习(ICL)框架,该框架通过从导频信号和解码器反馈的提示中直接学习后验符号分布,并利用提示增强技术将解码器输出的外信息作为额外上下文,从而在迭代的Turbo解码过程中不断优化符号估计。

链接: https://arxiv.org/abs/2505.06175
作者: Zihang Song,Matteo Zecchin,Bipin Rajendran,Osvaldo Simeone
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces a novel in-context learning (ICL) framework, inspired by large language models (LLMs), for soft-input soft-output channel equalization in coded multiple-input multiple-output (MIMO) systems. The proposed approach learns to infer posterior symbol distributions directly from a prompt of pilot signals and decoder feedback. A key innovation is the use of prompt augmentation to incorporate extrinsic information from the decoder output as additional context, enabling the ICL model to refine its symbol estimates iteratively across turbo decoding iterations. Two model variants, based on Transformer and state-space architectures, are developed and evaluated. Extensive simulations demonstrate that, when traditional linear assumptions break down, e.g., in the presence of low-resolution quantization, ICL equalizers consistently outperform conventional model-based baselines, even when the latter are provided with perfect channel state information. Results also highlight the advantage of Transformer-based models under limited training diversity, as well as the efficiency of state-space models in resource-constrained scenarios.
zh

[AI-52] FlowHFT: Flow Policy Induced Optimal High-Frequency Trading under Diverse Market Conditions

【速读】:该论文旨在解决高频交易(High-frequency trading, HFT)中传统方法在动态、多变且易受突发波动影响的现实市场环境中表现受限的问题。传统HFT方法依赖历史数据建模,并假设未来市场状态与过去相似,导致其在特定训练条件下效果良好,但在其他市场情境下性能显著下降。论文提出的解决方案是FlowHFT,其关键在于采用基于流匹配策略(flow matching policy)的模仿学习框架,能够同时从多个专家模型中学习不同市场情景下的交易策略,并通过网格搜索微调机制提升策略适应性,从而在复杂或极端市场条件下实现优于单一专家模型的性能。

链接: https://arxiv.org/abs/2505.05784
作者: Yang Li,Zhi Chen,Steve Yang
机构: 未知
类目: Trading and Market Microstructure (q-fin.TR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computational Finance (q-fin.CP)
备注: 14 pages, 1 figure, 6 tables, 2 algorithms

点击查看摘要

Abstract:High-frequency trading (HFT) is an investing strategy that continuously monitors market states and places bid and ask orders at millisecond speeds. Traditional HFT approaches fit models with historical data and assume that future market states follow similar patterns. This limits the effectiveness of any single model to the specific conditions it was trained for. Additionally, these models achieve optimal solutions only under specific market conditions, such as assumptions about stock price’s stochastic process, stable order flow, and the absence of sudden volatility. Real-world markets, however, are dynamic, diverse, and frequently volatile. To address these challenges, we propose the FlowHFT, a novel imitation learning framework based on flow matching policy. FlowHFT simultaneously learns strategies from numerous expert models, each proficient in particular market scenarios. As a result, our framework can adaptively adjust investment decisions according to the prevailing market state. Furthermore, FlowHFT incorporates a grid-search fine-tuning mechanism. This allows it to refine strategies and achieve superior performance even in complex or extreme market scenarios where expert strategies may be suboptimal. We test FlowHFT in multiple market environments. We first show that flow matching policy is applicable in stochastic market environments, thus enabling FlowHFT to learn trading strategies under different market conditions. Notably, our single framework consistently achieves performance superior to the best expert for each market condition.
zh

[AI-53] rading Under Uncertainty: A Distribution-Based Strategy for Futures Markets Using FutureQuant Transformer

【速读】:该论文试图解决传统期货交易中由于大量数据和变量(如实时限价订单簿)导致的价格预测复杂性问题。解决方案的关键在于引入FutureQuant Transformer模型,该模型利用注意力机制来解析复杂的市场模式,从而在价格范围和波动率的预测上优于传统模型,实现了更优的风险管理和显著的平均收益提升。

链接: https://arxiv.org/abs/2505.05595
作者: Wenhao Guo,Yuda Wang,Zeqiao Huang,Changjiang Zhang,Shumin ma
机构: 未知
类目: Trading and Market Microstructure (q-fin.TR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注: 16 pages, 12 figures

点击查看摘要

Abstract:In the complex landscape of traditional futures trading, where vast data and variables like real-time Limit Order Books (LOB) complicate price predictions, we introduce the FutureQuant Transformer model, leveraging attention mechanisms to navigate these challenges. Unlike conventional models focused on point predictions, the FutureQuant model excels in forecasting the range and volatility of future prices, thus offering richer insights for trading strategies. Its ability to parse and learn from intricate market patterns allows for enhanced decision-making, significantly improving risk management and achieving a notable average gain of 0.1193% per 30-minute trade over state-of-the-art models with a simple algorithm using factors such as RSI, ATR, and Bollinger Bands. This innovation marks a substantial leap forward in predictive analytics within the volatile domain of futures trading.
zh

[AI-54] GenAI in Entrepreneurship: a systematic review of generative artificial intelligence in entrepreneurship research: current issues and future directions

【速读】:该论文试图解决当前关于生成式人工智能(Generative AI)在创业领域影响的研究尚不充分的问题,尤其是缺乏对GenAI作为创业前提条件影响的系统性理解。其解决方案的关键在于通过系统文献综述方法,结合自然语言处理和无监督机器学习技术(如TF-IDF向量化、主成分分析和层次聚类),对83篇来自Web of Science和Scopus的同行评议文章进行分析,从而识别和分析GenAI对创业影响的研究主题演化趋势,并提出未来研究方向、现有文献中的知识空白及伦理问题。

链接: https://arxiv.org/abs/2505.05523
作者: Anna Kusetogullari,Huseyin Kusetogullari,Martin Andersson,Tony Gorschek
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative Artificial Intelligence (GenAI) and Large Language Models (LLMs) are recognized to have significant effects on industry and business dynamics, not least because of their impact on the preconditions for entrepreneurship. There is still a lack of knowledge of GenAI as a theme in entrepreneurship research. This paper presents a systematic literature review aimed at identifying and analyzing the evolving landscape of research on the effects of GenAI on entrepreneurship. We analyze 83 peer-reviewed articles obtained from leading academic databases: Web of Science and Scopus. Using natural language processing and unsupervised machine learning techniques with TF-IDF vectorization, Principal Component Analysis (PCA), and hierarchical clustering, five major thematic clusters are identified: (1) Digital Transformation and Behavioral Models, (2) GenAI-Enhanced Education and Learning Systems, (3) Sustainable Innovation and Strategic AI Impact, (4) Business Models and Market Trends, and (5) Data-Driven Technological Trends in Entrepreneurship. Based on the review, we discuss future research directions, gaps in the current literature, as well as ethical concerns raised in the literature. We highlight the need for more macro-level research on GenAI and LLMs as external enablers for entrepreneurship and for research on effective regulatory frameworks that facilitate business experimentation, innovation, and further technology development.
zh

[AI-55] AI-powered virtual eye: perspective challenges and opportunities

【速读】:该论文试图解决如何构建一个高保真、跨尺度的数字眼模型,以模拟眼睛的复杂结构和生物功能,从而推动个性化眼科护理和眼部健康与疾病研究。解决方案的关键在于利用生成式 AI (Generative AI) 和基础模型(foundation models)构建一个统一的多模态、多尺度、动态预测模型,并集成反馈机制,同时依赖大规模多模态数据集、基于代理的架构和交互界面来实现其发展路径。

链接: https://arxiv.org/abs/2505.05516
作者: Yue Wu,Yibo Guo,Yulong Yan,Jiancheng Yang,Xin Zhou,Ching-Yu Cheng,Danli Shi,Mingguang He
机构: 未知
类目: Tissues and Organs (q-bio.TO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 30 Pages, 3 figures, 1 table

点击查看摘要

Abstract:We envision the “virtual eye” as a next-generation, AI-powered platform that uses interconnected foundation models to simulate the eye’s intricate structure and biological function across all scales. Advances in AI, imaging, and multiomics provide a fertile ground for constructing a universal, high-fidelity digital replica of the human eye. This perspective traces the evolution from early mechanistic and rule-based models to contemporary AI-driven approaches, integrating in a unified model with multimodal, multiscale, dynamic predictive capabilities and embedded feedback mechanisms. We propose a development roadmap emphasizing the roles of large-scale multimodal datasets, generative AI, foundation models, agent-based architectures, and interactive interfaces. Despite challenges in interpretability, ethics, data processing and evaluation, the virtual eye holds the potential to revolutionize personalized ophthalmic care and accelerate research into ocular health and disease.
zh

[AI-56] Structure Quality: Conceptual and Formal Foundations for the Mind-Body Problem

【速读】:该论文试图解决意识的“困难问题”(hard problem of consciousness),即解释主观体验(qualia)如何从物理过程中产生。其解决方案的关键在于探讨结构(structure)与质量(quality)之间的基础关系,而非传统上区分物理与心理的二元视角。研究开发了信息论度量方法,以量化结构与质量之间的相互确定性,并提出一种新的Q-S空间用于分析两者之间的保真度。该空间自然地引导出结构与质量属性之间五种可能关系的分类,每种关系均通过概念和形式模型进行说明,进而探讨了每种类别的本体论含义,为功能主义、涌现论、唯心主义、泛心论和中立一元论等哲学争论提供了新的见解。

链接: https://arxiv.org/abs/2505.05481
作者: Ryan Williams
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper explores the hard problem of consciousness from a different perspective. Instead of drawing distinctions between the physical and the mental, an exploration of a more foundational relationship is examined: the relationship between structure and quality. Information-theoretic measures are developed to quantify the mutual determinability between structure and quality, including a novel Q-S space for analyzing fidelity between the two domains. This novel space naturally points toward a five-fold categorization of possible relationships between structural and qualitative properties, illustrating each through conceptual and formal models. The ontological implications of each category are examined, shedding light on debates around functionalism, emergentism, idealism, panpsychism, and neutral monism. This new line of inquiry has established a framework for deriving theoretical constraints on qualitative systems undergoing evolution that is explored in my companion paper, Qualia Natural Selection. Subjects: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.05481 [q-bio.NC] (or arXiv:2505.05481v1 [q-bio.NC] for this version) https://doi.org/10.48550/arXiv.2505.05481 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Ryan Williams [view email] [v1] Wed, 23 Apr 2025 23:49:40 UTC (1,118 KB)
zh

机器学习

[LG-0] owards a Unified Representation Evaluation Framework Beyond Downstream Tasks IJCNN2025

链接: https://arxiv.org/abs/2505.06224
作者: Christos Plachouras,Julien Guinot,George Fazekas,Elio Quinton,Emmanouil Benetos,Johan Pauwels
类目: Machine Learning (cs.LG)
*备注: Accepted at IJCNN 2025

点击查看摘要

Abstract:Downstream probing has been the dominant method for evaluating model representations, an important process given the increasing prominence of self-supervised learning and foundation models. However, downstream probing primarily assesses the availability of task-relevant information in the model’s latent space, overlooking attributes such as equivariance, invariance, and disentanglement, which contribute to the interpretability, adaptability, and utility of representations in real-world applications. While some attempts have been made to measure these qualities in representations, no unified evaluation framework with modular, generalizable, and interpretable metrics exists. In this paper, we argue for the importance of representation evaluation beyond downstream probing. We introduce a standardized protocol to quantify informativeness, equivariance, invariance, and disentanglement of factors of variation in model representations. We use it to evaluate representations from a variety of models in the image and speech domains using different architectures and pretraining approaches on identified controllable factors of variation. We find that representations from models with similar downstream performance can behave substantially differently with regard to these attributes. This hints that the respective mechanisms underlying their downstream performance are functionally different, prompting new research directions to understand and improve representations. Comments: Accepted at IJCNN 2025 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.06224 [cs.LG] (or arXiv:2505.06224v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.06224 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-1] Leverag ing Multi-Task Learning for Multi-Label Power System Security Assessment

链接: https://arxiv.org/abs/2505.06207
作者: Muhy Eddin Za’ter,Amir Sajad,Bri-Mathias Hodge
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a novel approach to the power system security assessment using Multi-Task Learning (MTL), and reformulating the problem as a multi-label classification task. The proposed MTL framework simultaneously assesses static, voltage, transient, and small-signal stability, improving both accuracy and interpretability with respect to the most state of the art machine learning methods. It consists of a shared encoder and multiple decoders, enabling knowledge transfer between stability tasks. Experiments on the IEEE 68-bus system demonstrate a measurable superior performance of the proposed method compared to the extant state-of-the-art approaches.

[LG-2] Auto Tensor Singular Value Thresholding: A Non-Iterative and Rank-Free Framework for Tensor Denoising

链接: https://arxiv.org/abs/2505.06203
作者: Hiroki Hasegawa,Yukihiko Okada
类目: Machine Learning (cs.LG)
*备注: 16 pages, 4 figures

点击查看摘要

Abstract:In modern data-driven tasks such as classification, optimization, and forecasting, mitigating the effects of intrinsic noise is crucial for improving predictive accuracy. While numerous denoising techniques have been developed, the rising dimensionality of real-world datasets limits conventional matrix-based methods in preserving data structure and accuracy. This challenge has led to increasing interest in tensor-based approaches, which naturally capture multi-way data relationships. However, classical tensor decomposition methods (e.g., HOSVD, HOOI) typically require pre-specified ranks and iterative optimization, making them computationally expensive and less practical. In this work, we propose a novel low-rank approximation method for tensor data that avoids these limitations. Our approach applies statistically grounded singular value thresholding to mode-wise matricizations, enabling automatic extraction of significant components without requiring prior rank specification or iterative refinement. Experiments on synthetic and real-world tensors show that our method consistently outperforms existing techniques in terms of estimation accuracy and computational efficiency, especially in noisy high-dimensional settings.

[LG-3] Active Perception for Tactile Sensing: A Task-Agnostic Attention-Based Approach

链接: https://arxiv.org/abs/2505.06182
作者: Tim Schneider,Cristiana de Farias,Roberto Calandra,Liming Chen,Jan Peters
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 16 pages; 13 figures

点击查看摘要

Abstract:Humans make extensive use of haptic exploration to map and identify the properties of the objects that we touch. In robotics, active tactile perception has emerged as an important research domain that complements vision for tasks such as object classification, shape reconstruction, and manipulation. This work introduces TAP (Task-agnostic Active Perception) – a novel framework that leverages reinforcement learning (RL) and transformer-based architectures to address the challenges posed by partially observable environments. TAP integrates Soft Actor-Critic (SAC) and CrossQ algorithms within a unified optimization objective, jointly training a perception module and decision-making policy. By design, TAP is completely task-agnostic and can, in principle, generalize to any active perception problem. We evaluate TAP across diverse tasks, including toy examples and realistic applications involving haptic exploration of 3D models from the Tactile MNIST benchmark. Experiments demonstrate the efficacy of TAP, achieving high accuracies on the Tactile MNIST haptic digit recognition task and a tactile pose estimation task. These findings underscore the potential of TAP as a versatile and generalizable framework for advancing active tactile perception in robotics.

[LG-4] A Large Language Model-Enhanced Q-learning for Capacitated Vehicle Routing Problem with Time Windows

链接: https://arxiv.org/abs/2505.06178
作者: Linjiang Cao,Maonan Wang,Xi Xiong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Capacitated Vehicle Routing Problem with Time Windows (CVRPTW) is a classic NP-hard combinatorial optimization problem widely applied in logistics distribution and transportation management. Its complexity stems from the constraints of vehicle capacity and time windows, which pose significant challenges to traditional approaches. Advances in Large Language Models (LLMs) provide new possibilities for finding approximate solutions to CVRPTW. This paper proposes a novel LLM-enhanced Q-learning framework to address the CVRPTW with real-time emergency constraints. Our solution introduces an adaptive two-phase training mechanism that transitions from the LLM-guided exploration phase to the autonomous optimization phase of Q-network. To ensure reliability, we design a three-tier self-correction mechanism based on the Chain-of-Thought (CoT) for LLMs: syntactic validation, semantic verification, and physical constraint enforcement. In addition, we also prioritized replay of the experience generated by LLMs to amplify the regulatory role of LLMs in the architecture. Experimental results demonstrate that our framework achieves a 7.3% average reduction in cost compared to traditional Q-learning, with fewer training steps required for convergence.

[LG-5] On the Depth of Monotone ReLU Neural Networks and ICNNs

链接: https://arxiv.org/abs/2505.06169
作者: Egor Bakaev,Florestan Brunck,Christoph Hertrich,Daniel Reichman,Amir Yehudayoff
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM); Neural and Evolutionary Computing (cs.NE); Combinatorics (math.CO)
*备注: 27 pages, 17 figures

点击查看摘要

Abstract:We study two models of ReLU neural networks: monotone networks (ReLU ^+ ) and input convex neural networks (ICNN). Our focus is on expressivity, mostly in terms of depth, and we prove the following lower bounds. For the maximum function MAX _n computing the maximum of n real numbers, we show that ReLU ^+ networks cannot compute MAX _n , or even approximate it. We prove a sharp n lower bound on the ICNN depth complexity of MAX _n . We also prove depth separations between ReLU networks and ICNNs; for every k , there is a depth-2 ReLU network of size O(k^2) that cannot be simulated by a depth- k ICNN. The proofs are based on deep connections between neural networks and polyhedral geometry, and also use isoperimetric properties of triangulations.

[LG-6] Learning-Augmented Algorithms for Boolean Satisfiability

链接: https://arxiv.org/abs/2505.06146
作者: Idan Attias,Xing Gao,Lev Reyzin
类目: Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning-augmented algorithms are a prominent recent development in beyond worst-case analysis. In this framework, a problem instance is provided with a prediction (advice'') from a machine-learning oracle, which provides partial information about an optimal solution, and the goal is to design algorithms that leverage this advice to improve worst-case performance. We study the classic Boolean satisfiability (SAT) decision and optimization problems within this framework using two forms of advice. Subset advice" provides a random \epsilon fraction of the variables from an optimal assignment, whereas ``label advice" provides noisy predictions for all variables in an optimal assignment. For the decision problem k -SAT, by using the subset advice we accelerate the exponential running time of the PPSZ family of algorithms due to Paturi, Pudlak, Saks and Zane, which currently represent the state of the art in the worst case. We accelerate the running time by a multiplicative factor of 2^-c in the base of the exponent, where c is a function of \epsilon and k . For the optimization problem, we show how to incorporate subset advice in a black-box fashion with any \alpha -approximation algorithm, improving the approximation ratio to \alpha + (1 - \alpha)\epsilon . Specifically, we achieve approximations of 0.94 + \Omega(\epsilon) for MAX- 2 -SAT, 7/8 + \Omega(\epsilon) for MAX- 3 -SAT, and 0.79 + \Omega(\epsilon) for MAX-SAT. Moreover, for label advice, we obtain near-optimal approximation for instances with large average degree, thereby generalizing recent results on MAX-CUT and MAX- 2 -LIN. Subjects: Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC); Machine Learning (cs.LG) Cite as: arXiv:2505.06146 [cs.DS] (or arXiv:2505.06146v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2505.06146 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-7] Realistic Adversarial Attacks for Robustness Evaluation of Trajectory Prediction Models via Future State Perturbation

链接: https://arxiv.org/abs/2505.06134
作者: Julian F. Schumann,Jeroen Hagenus,Frederik Baymler Mathiesen,Arkady Zgonnikov
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: 20 pages, 3 figures

点击查看摘要

Abstract:Trajectory prediction is a key element of autonomous vehicle systems, enabling them to anticipate and react to the movements of other road users. Evaluating the robustness of prediction models against adversarial attacks is essential to ensure their reliability in real-world traffic. However, current approaches tend to focus on perturbing the past positions of surrounding agents, which can generate unrealistic scenarios and overlook critical vulnerabilities. This limitation may result in overly optimistic assessments of model performance in real-world conditions. In this work, we demonstrate that perturbing not just past but also future states of adversarial agents can uncover previously undetected weaknesses and thereby provide a more rigorous evaluation of model robustness. Our novel approach incorporates dynamic constraints and preserves tactical behaviors, enabling more effective and realistic adversarial attacks. We introduce new performance measures to assess the realism and impact of these adversarial trajectories. Testing our method on a state-of-the-art prediction model revealed significant increases in prediction errors and collision rates under adversarial conditions. Qualitative analysis further showed that our attacks can expose critical weaknesses, such as the inability of the model to detect potential collisions in what appear to be safe predictions. These results underscore the need for more comprehensive adversarial testing to better evaluate and improve the reliability of trajectory prediction models for autonomous vehicles. Comments: 20 pages, 3 figures Subjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC) Cite as: arXiv:2505.06134 [cs.LG] (or arXiv:2505.06134v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.06134 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-8] FIC-TSC: Learning Time Series Classification with Fisher Information Constraint ICML2025

链接: https://arxiv.org/abs/2505.06114
作者: Xiwen Chen,Wenhui Zhu,Peijie Qiu,Hao Wang,Huayu Li,Zihan Li,Yalin Wang,Aristeidis Sotiras,Abolfazl Razi
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML2025. Pre camera-ready version

点击查看摘要

Abstract:Analyzing time series data is crucial to a wide spectrum of applications, including economics, online marketplaces, and human healthcare. In particular, time series classification plays an indispensable role in segmenting different phases in stock markets, predicting customer behavior, and classifying worker actions and engagement levels. These aspects contribute significantly to the advancement of automated decision-making and system optimization in real-world applications. However, there is a large consensus that time series data often suffers from domain shifts between training and test sets, which dramatically degrades the classification performance. Despite the success of (reversible) instance normalization in handling the domain shifts for time series regression tasks, its performance in classification is unsatisfactory. In this paper, we propose \textitFIC-TSC, a training framework for time series classification that leverages Fisher information as the constraint. We theoretically and empirically show this is an efficient and effective solution to guide the model converge toward flatter minima, which enhances its generalizability to distribution shifts. We rigorously evaluate our method on 30 UEA multivariate and 85 UCR univariate datasets. Our empirical results demonstrate the superiority of the proposed method over 14 recent state-of-the-art methods.

[LG-9] Deep Diffusion Maps

链接: https://arxiv.org/abs/2505.06087
作者: Sergio García-Heredia,Ángela Fernández,Carlos M. Alaíz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One of the fundamental problems within the field of machine learning is dimensionality reduction. Dimensionality reduction methods make it possible to combat the so-called curse of dimensionality, visualize high-dimensional data and, in general, improve the efficiency of storing and processing large data sets. One of the best-known nonlinear dimensionality reduction methods is Diffusion Maps. However, despite their virtues, both Diffusion Maps and many other manifold learning methods based on the spectral decomposition of kernel matrices have drawbacks such as the inability to apply them to data outside the initial set, their computational complexity, and high memory costs for large data sets. In this work, we propose to alleviate these problems by resorting to deep learning. Specifically, a new formulation of Diffusion Maps embedding is offered as a solution to a certain unconstrained minimization problem and, based on it, a cost function to train a neural network which computes Diffusion Maps embedding – both inside and outside the training sample – without the need to perform any spectral decomposition. The capabilities of this approach are compared on different data sets, both real and synthetic, with those of Diffusion Maps and the Nystrom method.

[LG-10] Fault Diagnosis of 3D-Printed Scaled Wind Turbine Blades

链接: https://arxiv.org/abs/2505.06080
作者: Luis Miguel Esquivel-Sancho,Maryam Ghandchi Tehrani,Mauricio Muñoz-Arias,Mahmoud Askari
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study presents an integrated methodology for fault detection in wind turbine blades using 3D-printed scaled models, finite element simulations, experimental modal analysis, and machine learning techniques. A scaled model of the NREL 5MW blade was fabricated using 3D printing, and crack-type damages were introduced at critical locations. Finite Element Analysis was employed to predict the impact of these damages on the natural frequencies, with the results validated through controlled hammer impact tests. Vibration data was processed to extract both time-domain and frequency-domain features, and key discriminative variables were identified using statistical analyses (ANOVA). Machine learning classifiers, including Support Vector Machine and K-Nearest Neighbors, achieved classification accuracies exceeding 94%. The results revealed that vibration modes 3, 4, and 6 are particularly sensitive to structural anomalies for this blade. This integrated approach confirms the feasibility of combining numerical simulations with experimental validations and paves the way for structural health monitoring systems in wind energy applications.

[LG-11] Safe-EF: Error Feedback for Nonsmooth Constrained Optimization

链接: https://arxiv.org/abs/2505.06053
作者: Rustem Islamov,Yarden As,Ilyas Fatkhullin
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Federated learning faces severe communication bottlenecks due to the high dimensionality of model updates. Communication compression with contractive compressors (e.g., Top-K) is often preferable in practice but can degrade performance without proper handling. Error feedback (EF) mitigates such issues but has been largely restricted for smooth, unconstrained problems, limiting its real-world applicability where non-smooth objectives and safety constraints are critical. We advance our understanding of EF in the canonical non-smooth convex setting by establishing new lower complexity bounds for first-order algorithms with contractive compression. Next, we propose Safe-EF, a novel algorithm that matches our lower bound (up to a constant) while enforcing safety constraints essential for practical applications. Extending our approach to the stochastic setting, we bridge the gap between theory and practical implementation. Extensive experiments in a reinforcement learning setup, simulating distributed humanoid robot training, validate the effectiveness of Safe-EF in ensuring safety and reducing communication complexity.

[LG-12] Learning Music Audio Representations With Limited Data ICASSP2025

链接: https://arxiv.org/abs/2505.06042
作者: Christos Plachouras,Emmanouil Benetos,Johan Pauwels
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Presented at ICASSP 2025

点击查看摘要

Abstract:Large deep-learning models for music, including those focused on learning general-purpose music audio representations, are often assumed to require substantial training data to achieve high performance. If true, this would pose challenges in scenarios where audio data or annotations are scarce, such as for underrepresented music traditions, non-popular genres, and personalized music creation and listening. Understanding how these models behave in limited-data scenarios could be crucial for developing techniques to tackle them. In this work, we investigate the behavior of several music audio representation models under limited-data learning regimes. We consider music models with various architectures, training paradigms, and input durations, and train them on data collections ranging from 5 to 8,000 minutes long. We evaluate the learned representations on various music information retrieval tasks and analyze their robustness to noise. We show that, under certain conditions, representations from limited-data and even random models perform comparably to ones from large-dataset models, though handcrafted features outperform all learned representations in some tasks. Comments: Presented at ICASSP 2025 Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS) Cite as: arXiv:2505.06042 [cs.SD] (or arXiv:2505.06042v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2505.06042 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1109/ICASSP49660.2025.10887766 Focus to learn more DOI(s) linking to related resources

[LG-13] Fuzzy-UCS Revisited: Self-Adaptation of Rule Representations in Michigan-Style Learning Fuzzy-Classifier Systems GECCO

链接: https://arxiv.org/abs/2505.06017
作者: Hiroki Shiraishi,Yohei Hayamizu,Tomonori Hashiyama
类目: Machine Learning (cs.LG)
*备注: Accepted by the ACM Genetic and Evolutionary Computation Conference (GECCO) 2023

点击查看摘要

Abstract:This paper focuses on the impact of rule representation in Michigan-style Learning Fuzzy-Classifier Systems (LFCSs) on its classification performance. A well-representation of the rules in an LFCS is crucial for improving its performance. However, conventional rule representations frequently need help addressing problems with unknown data characteristics. To address this issue, this paper proposes a supervised LFCS (i.e., Fuzzy-UCS) with a self-adaptive rule representation mechanism, entitled Adaptive-UCS. Adaptive-UCS incorporates a fuzzy indicator as a new rule parameter that sets the membership function of a rule as either rectangular (i.e., crisp) or triangular (i.e., fuzzy) shapes. The fuzzy indicator is optimized with evolutionary operators, allowing the system to search for an optimal rule representation. Results from extensive experiments conducted on continuous space problems demonstrate that Adaptive-UCS outperforms other UCSs with conventional crisp-hyperrectangular and fuzzy-hypertrapezoidal rule representations in classification accuracy. Additionally, Adaptive-UCS exhibits robustness in the case of noisy inputs and real-world problems with inherent uncertainty, such as missing values, leading to stable classification performance.

[LG-14] Differentiable Fuzzy Neural Networks for Recommender Systems

链接: https://arxiv.org/abs/2505.06000
作者: Stephan Bartl,Kevin Innerebner,Elisabeth Lex
类目: Machine Learning (cs.LG)
*备注: Accepted for publication at the HyPer workshop, co-located with ACM UMAP 2025

点击查看摘要

Abstract:As recommender systems become increasingly complex, transparency is essential to increase user trust, accountability, and regulatory compliance. Neuro-symbolic approaches that integrate symbolic reasoning with sub-symbolic learning offer a promising approach toward transparent and user-centric systems. In this work-in-progress, we investigate using fuzzy neural networks (FNNs) as a neuro-symbolic approach for recommendations that learn logic-based rules over predefined, human-readable atoms. Each rule corresponds to a fuzzy logic expression, making the recommender’s decision process inherently transparent. In contrast to black-box machine learning methods, our approach reveals the reasoning behind a recommendation while maintaining competitive performance. We evaluate our method on a synthetic and MovieLens 1M datasets and compare it to state-of-the-art recommendation algorithms. Our results demonstrate that our approach accurately captures user behavior while providing a transparent decision-making process. Finally, the differentiable nature of this approach facilitates an integration with other neural models, enabling the development of hybrid, transparent recommender systems.

[LG-15] Modeling Multi-Hop Semantic Paths for Recommendation in Heterogeneous Information Networks

链接: https://arxiv.org/abs/2505.05989
作者: Hongye Zheng,Yue Xing,Lipeng Zhu,Xu Han,Junliang Du,Wanyu Cui
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study focuses on the problem of path modeling in heterogeneous information networks and proposes a multi-hop path-aware recommendation framework. The method centers on multi-hop paths composed of various types of entities and relations. It models user preferences through three stages: path selection, semantic representation, and attention-based fusion. In the path selection stage, a path filtering mechanism is introduced to remove redundant and noisy information. In the representation learning stage, a sequential modeling structure is used to jointly encode entities and relations, preserving the semantic dependencies within paths. In the fusion stage, an attention mechanism assigns different weights to each path to generate a global user interest representation. Experiments conducted on real-world datasets such as Amazon-Book show that the proposed method significantly outperforms existing recommendation models across multiple evaluation metrics, including HR@10, Recall@10, and Precision@10. The results confirm the effectiveness of multi-hop paths in capturing high-order interaction semantics and demonstrate the expressive modeling capabilities of the framework in heterogeneous recommendation scenarios. This method provides both theoretical and practical value by integrating structural information modeling in heterogeneous networks with recommendation algorithm design. It offers a more expressive and flexible paradigm for learning user preferences in complex data environments.

[LG-16] Architectural Exploration of Hybrid Neural Decoders for Neuromorphic Implantable BMI

链接: https://arxiv.org/abs/2505.05983
作者: Vivek Mohan,Biyan Zhou,Zhou Wang,Anil Bharath,Emmanuel Drakakis,Arindam Basu
类目: Machine Learning (cs.LG)
*备注: The paper has been accepted for lecture presentation at the 2025 IEEE International Symposium on Circuits and Systems in London

点击查看摘要

Abstract:This work presents an efficient decoding pipeline for neuromorphic implantable brain-machine interfaces (Neu-iBMI), leveraging sparse neural event data from an event-based neural sensing scheme. We introduce a tunable event filter (EvFilter), which also functions as a spike detector (EvFilter-SPD), significantly reducing the number of events processed for decoding by 192X and 554X, respectively. The proposed pipeline achieves high decoding performance, up to R^2=0.73, with ANN- and SNN-based decoders, eliminating the need for signal recovery, spike detection, or sorting, commonly performed in conventional iBMI systems. The SNN-Decoder reduces computations and memory required by 5-23X compared to NN-, and LSTM-Decoders, while the ST-NN-Decoder delivers similar performance to an LSTM-Decoder requiring 2.5X fewer resources. This streamlined approach significantly reduces computational and memory demands, making it ideal for low-power, on-implant, or wearable iBMIs.

[LG-17] Offline Multi-agent Reinforcement Learning via Score Decomposition

链接: https://arxiv.org/abs/2505.05968
作者: Dan Qiao,Wenhao Li,Shanchao Yang,Hongyuan Zha,Baoxiang Wang
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Working papers

点击查看摘要

Abstract:Offline multi-agent reinforcement learning (MARL) faces critical challenges due to distributional shifts, further exacerbated by the high dimensionality of joint action spaces and the diversity in coordination strategies and quality among agents. Conventional approaches, including independent learning frameworks and value decomposition methods based on pessimistic principles, remain susceptible to out-of-distribution (OOD) joint actions and often yield suboptimal performance. Through systematic analysis of prevalent offline MARL benchmarks, we identify that this limitation primarily stems from the inherently multimodal nature of joint collaborative policies induced by offline data collection. To address these challenges, we propose a novel two-stage framework: First, we employ a diffusion-based generative model to explicitly capture the complex behavior policy, enabling accurate modeling of diverse multi-agent coordination patterns. Second, we introduce a sequential score function decomposition mechanism to regularize individual policies and enable decentralized execution. Extensive experiments on continuous control tasks demonstrate state-of-the-art performance across multiple standard offline MARL benchmarks, outperforming existing methods by 26.3% in normalized returns. Our approach provides new insights into offline coordination and equilibrium selection in cooperative multi-agent systems.

[LG-18] Learning Power Control Protocol for In-Factory 6G Subnetworks

链接: https://arxiv.org/abs/2505.05967
作者: Uyoata E. Uyoata,Gilberto Berardinelli,Ramoni Adeogun
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注: Accepted for presented at IEEE EuCNC 6G Summit 2025

点击查看摘要

Abstract:In-X Subnetworks are envisioned to meet the stringent demands of short-range communication in diverse 6G use cases. In the context of In-Factory scenarios, effective power control is critical to mitigating the impact of interference resulting from potentially high subnetwork density. Existing approaches to power control in this domain have predominantly emphasized the data plane, often overlooking the impact of signaling overhead. Furthermore, prior work has typically adopted a network-centric perspective, relying on the assumption of complete and up-to-date channel state information (CSI) being readily available at the central controller. This paper introduces a novel multi-agent reinforcement learning (MARL) framework designed to enable access points to autonomously learn both signaling and power control protocols in an In-Factory Subnetwork environment. By formulating the problem as a partially observable Markov decision process (POMDP) and leveraging multi-agent proximal policy optimization (MAPPO), the proposed approach achieves significant advantages. The simulation results demonstrate that the learning-based method reduces signaling overhead by a factor of 8 while maintaining a buffer flush rate that lags the ideal “Genie” approach by only 5%.

[LG-19] FloE: On-the-Fly MoE Inference ICML2025

链接: https://arxiv.org/abs/2505.05950
作者: Yuxin Zhou,Zheng Li,Jun Zhang,Jue Wang,Yiping Wang,Zhongle Xie,Ke Chen,Lidan Shou
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2025

点击查看摘要

Abstract:With the widespread adoption of Mixture-of-Experts (MoE) models, there is a growing demand for efficient inference on memory-constrained devices. While offloading expert parameters to CPU memory and loading activated experts on demand has emerged as a potential solution, the large size of activated experts overburdens the limited PCIe bandwidth, hindering the effectiveness in latency-sensitive scenarios. To mitigate this, we propose FloE, an on-the-fly MoE inference system on memory-constrained GPUs. FloE is built on the insight that there exists substantial untapped redundancy within sparsely activated experts. It employs various compression techniques on the expert’s internal parameter matrices to reduce the data movement load, combined with low-cost sparse prediction, achieving perceptible inference acceleration in wall-clock time on resource-constrained devices. Empirically, FloE achieves a 9.3x compression of parameters per expert in Mixtral-8x7B; enables deployment on a GPU with only 11GB VRAM, reducing the memory footprint by up to 8.5x; and delivers a 48.7x inference speedup compared to DeepSpeed-MII on a single GeForce RTX 3090.

[LG-20] Fast Differentiable Modal Simulation of Non-linear Strings Membranes and Plates

链接: https://arxiv.org/abs/2505.05940
作者: Rodrigo Diaz,Mark Sandler
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Computational Physics (physics.comp-ph)
*备注: accepted to DAFx 2025

点击查看摘要

Abstract:Modal methods for simulating vibrations of strings, membranes, and plates are widely used in acoustics and physically informed audio synthesis. However, traditional implementations, particularly for non-linear models like the von Kármán plate, are computationally demanding and lack differentiability, limiting inverse modelling and real-time applications. We introduce a fast, differentiable, GPU-accelerated modal framework built with the JAX library, providing efficient simulations and enabling gradient-based inverse modelling. Benchmarks show that our approach significantly outperforms CPU and GPU-based implementations, particularly for simulations with many modes. Inverse modelling experiments demonstrate that our approach can recover physical parameters, including tension, stiffness, and geometry, from both synthetic and experimental data. Although fitting physical parameters is more sensitive to initialisation compared to other methods, it provides greater interpretability and more compact parameterisation. The code is released as open source to support future research and applications in differentiable physical modelling and sound synthesis.

[LG-21] Autoencoder-Based Hybrid Replay for Class-Incremental Learning ICML2025

链接: https://arxiv.org/abs/2505.05926
作者: Milad Khademi Nori,Il-Min Kim,Guanghui Wang
类目: Machine Learning (cs.LG)
*备注: Accepted ICML 2025

点击查看摘要

Abstract:In class-incremental learning (CIL), effective incremental learning strategies are essential to mitigate task confusion and catastrophic forgetting, especially as the number of tasks t increases. Current exemplar replay strategies impose \mathcalO(t) memory/compute complexities. We propose an autoencoder-based hybrid replay (AHR) strategy that leverages our new hybrid autoencoder (HAE) to function as a compressor to alleviate the requirement for large memory, achieving \mathcalO(0.1 t) at the worst case with the computing complexity of \mathcalO(t) while accomplishing state-of-the-art performance. The decoder later recovers the exemplar data stored in the latent space, rather than in raw format. Additionally, HAE is designed for both discriminative and generative modeling, enabling classification and replay capabilities, respectively. HAE adopts the charged particle system energy minimization equations and repulsive force algorithm for the incremental embedding and distribution of new class centroids in its latent space. Our results demonstrate that AHR consistently outperforms recent baselines across multiple benchmarks while operating with the same memory/compute budgets. The source code is included in the supplementary material and will be open-sourced upon publication.

[LG-22] CAPE: Context-Aware Prompt Perturbation Mechanism with Differential Privacy ICML2025

链接: https://arxiv.org/abs/2505.05922
作者: Haoqi Wu,Wei Dai,Li Wang,Qiang Yan
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: to be published in ICML 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have gained significant popularity due to their remarkable capabilities in text understanding and generation. However, despite their widespread deployment in inference services such as ChatGPT, concerns about the potential leakage of sensitive user data have arisen. Existing solutions primarily rely on privacy-enhancing technologies to mitigate such risks, facing the trade-off among efficiency, privacy, and utility. To narrow this gap, we propose Cape, a context-aware prompt perturbation mechanism based on differential privacy, to enable efficient inference with an improved privacy-utility trade-off. Concretely, we introduce a hybrid utility function that better captures the token similarity. Additionally, we propose a bucketized sampling mechanism to handle large sampling space, which might lead to long-tail phenomenons. Extensive experiments across multiple datasets, along with ablation studies, demonstrate that Cape achieves a better privacy-utility trade-off compared to prior state-of-the-art works.

[LG-23] A 3D pocket-aware and evolutionary conserved interaction guided diffusion model for molecular optimization

链接: https://arxiv.org/abs/2505.05874
作者: Anjie Qiao,Hao Zhang,Qianmu Yuan,Qirui Deng,Jingtian Su,Weifeng Huang,Huihao Zhou,Guo-Bo Li,Zhen Wang,Jinping Lei
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Generating molecules that bind to specific protein targets via diffusion models has shown good promise for structure-based drug design and molecule optimization. Especially, the diffusion models with binding interaction guidance enables molecule generation with high affinity through forming favorable interaction within protein pocket. However, the generated molecules may not form interactions with the highly conserved residues, which are important for protein functions and bioactivities of the ligands. Herein, we developed a new 3D target-aware diffusion model DiffDecip, which explicitly incorporates the protein-ligand binding interactions and evolutionary conservation information of protein residues into both diffusion and sampling process, for molecule optimization through scaffold decoration. The model performance revealed that DiffDecip outperforms baseline model DiffDec on molecule optimization towards higher affinity through forming more non-covalent interactions with highly conserved residues in the protein pocket.

[LG-24] A Taxonomy of Attacks and Defenses in Split Learning

链接: https://arxiv.org/abs/2505.05872
作者: Aqsa Shabbir,Halil İbrahim Kanpak,Alptekin Küpçü,Sinem Sav
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Split Learning (SL) has emerged as a promising paradigm for distributed deep learning, allowing resource-constrained clients to offload portions of their model computation to servers while maintaining collaborative learning. However, recent research has demonstrated that SL remains vulnerable to a range of privacy and security threats, including information leakage, model inversion, and adversarial attacks. While various defense mechanisms have been proposed, a systematic understanding of the attack landscape and corresponding countermeasures is still lacking. In this study, we present a comprehensive taxonomy of attacks and defenses in SL, categorizing them along three key dimensions: employed strategies, constraints, and effectiveness. Furthermore, we identify key open challenges and research gaps in SL based on our systematization, highlighting potential future directions.

[LG-25] Open Set Label Shift with Test Time Out-of-Distribution Reference CVPR2025

链接: https://arxiv.org/abs/2505.05868
作者: Changkun Ye,Russell Tsuchida,Lars Petersson,Nick Barnes
类目: Machine Learning (cs.LG)
*备注: Accepted at CVPR 2025

点击查看摘要

Abstract:Open set label shift (OSLS) occurs when label distributions change from a source to a target distribution, and the target distribution has an additional out-of-distribution (OOD) class. In this work, we build estimators for both source and target open set label distributions using a source domain in-distribution (ID) classifier and an ID/OOD classifier. With reasonable assumptions on the ID/OOD classifier, the estimators are assembled into a sequence of three stages: 1) an estimate of the source label distribution of the OOD class, 2) an EM algorithm for Maximum Likelihood estimates (MLE) of the target label distribution, and 3) an estimate of the target label distribution of OOD class under relaxed assumptions on the OOD classifier. The sampling errors of estimates in 1) and 3) are quantified with a concentration inequality. The estimation result allows us to correct the ID classifier trained on the source distribution to the target distribution without retraining. Experiments on a variety of open set label shift settings demonstrate the effectiveness of our model. Our code is available at this https URL.

[LG-26] Mixed-Integer Optimization for Responsible Machine Learning

链接: https://arxiv.org/abs/2505.05857
作者: Nathan Justin,Qingshi Sun,Andrés Gómez,Phebe Vayanos
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 56 pages, 10 figures

点击查看摘要

Abstract:In the last few decades, Machine Learning (ML) has achieved significant success across domains ranging from healthcare, sustainability, and the social sciences, to criminal justice and finance. But its deployment in increasingly sophisticated, critical, and sensitive areas affecting individuals, the groups they belong to, and society as a whole raises critical concerns around fairness, transparency, robustness, and privacy, among others. As the complexity and scale of ML systems and of the settings in which they are deployed grow, so does the need for responsible ML methods that address these challenges while providing guaranteed performance in deployment. Mixed-integer optimization (MIO) offers a powerful framework for embedding responsible ML considerations directly into the learning process while maintaining performance. For example, it enables learning of inherently transparent models that can conveniently incorporate fairness or other domain specific constraints. This tutorial paper provides an accessible and comprehensive introduction to this topic discussing both theoretical and practical aspects. It outlines some of the core principles of responsible ML, their importance in applications, and the practical utility of MIO for building ML models that align with these principles. Through examples and mathematical formulations, it illustrates practical strategies and available tools for efficiently solving MIO problems for responsible ML. It concludes with a discussion on current limitations and open research questions, providing suggestions for future work. Comments: 56 pages, 10 figures Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML) Cite as: arXiv:2505.05857 [cs.LG] (or arXiv:2505.05857v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.05857 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-27] Collecting Human Motion Data in Large and Occlusion-Prone Environments using Ultra-Wideband Localization ICRA

链接: https://arxiv.org/abs/2505.05851
作者: Janik Kaden,Maximilian Hilger,Tim Schreiter,Marius Schaab,Thomas Graichen,Andrey Rudenko,Ulrich Heinkel,Achim J. Lilienthal
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: accepted for presentation at the 7th Workshop on Long-term Human Motion Prediction (LHMP) at International Conference on Robotics and Automation (ICRA) 2025

点击查看摘要

Abstract:With robots increasingly integrating into human environments, understanding and predicting human motion is essential for safe and efficient interactions. Modern human motion and activity prediction approaches require high quality and quantity of data for training and evaluation, usually collected from motion capture systems, onboard or stationary sensors. Setting up these systems is challenging due to the intricate setup of hardware components, extensive calibration procedures, occlusions, and substantial costs. These constraints make deploying such systems in new and large environments difficult and limit their usability for in-the-wild measurements. In this paper we investigate the possibility to apply the novel Ultra-Wideband (UWB) localization technology as a scalable alternative for human motion capture in crowded and occlusion-prone environments. We include additional sensing modalities such as eye-tracking, onboard robot LiDAR and radar sensors, and record motion capture data as ground truth for evaluation and comparison. The environment imitates a museum setup, with up to four active participants navigating toward random goals in a natural way, and offers more than 130 minutes of multi-modal data. Our investigation provides a step toward scalable and accurate motion data collection beyond vision-based systems, laying a foundation for evaluating sensing modalities like UWB in larger and complex environments like warehouses, airports, or convention centers.

[LG-28] DaringFed: A Dynamic Bayesian Persuasion Pricing for Online Federated Learning under Two-sided Incomplete Information

链接: https://arxiv.org/abs/2505.05842
作者: Yun Xin,Jianfeng Lu,Shuqin Cao,Gang Li,Haozhao Wang,Guanghui Wen
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Online Federated Learning (OFL) is a real-time learning paradigm that sequentially executes parameter aggregation immediately for each random arriving client. To motivate clients to participate in OFL, it is crucial to offer appropriate incentives to offset the training resource consumption. However, the design of incentive mechanisms in OFL is constrained by the dynamic variability of Two-sided Incomplete Information (TII) concerning resources, where the server is unaware of the clients’ dynamically changing computational resources, while clients lack knowledge of the real-time communication resources allocated by the server. To incentivize clients to participate in training by offering dynamic rewards to each arriving client, we design a novel Dynamic Bayesian persuasion pricing for online Federated learning (DaringFed) under TII. Specifically, we begin by formulating the interaction between the server and clients as a dynamic signaling and pricing allocation problem within a Bayesian persuasion game, and then demonstrate the existence of a unique Bayesian persuasion Nash equilibrium. By deriving the optimal design of DaringFed under one-sided incomplete information, we further analyze the approximate optimal design of DaringFed with a specific bound under TII. Finally, extensive evaluation conducted on real datasets demonstrate that DaringFed optimizes accuracy and converges speed by 16.99%, while experiments with synthetic datasets validate the convergence of estimate unknown values and the effectiveness of DaringFed in improving the server’s utility by up to 12.6%.

[LG-29] New Statistical and Computational Results for Learning Junta Distributions

链接: https://arxiv.org/abs/2505.05819
作者: Lorenzo Beretta
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:We study the problem of learning junta distributions on \0, 1^n , where a distribution is a k -junta if its probability mass function depends on a subset of at most k variables. We make two main contributions: - We show that learning k -junta distributions is \emphcomputationally equivalent to learning k -parity functions with noise (LPN), a landmark problem in computational learning theory. - We design an algorithm for learning junta distributions whose statistical complexity is optimal, up to polylogarithmic factors. Computationally, our algorithm matches the complexity of previous (non-sample-optimal) algorithms. Combined, our two contributions imply that our algorithm cannot be significantly improved, statistically or computationally, barring a breakthrough for LPN. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2505.05819 [cs.LG] (or arXiv:2505.05819v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.05819 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Lorenzo Beretta [view email] [v1] Fri, 9 May 2025 06:44:35 UTC (52 KB)

[LG-30] On the Price of Differential Privacy for Spectral Clustering over Stochastic Block Models

链接: https://arxiv.org/abs/2505.05816
作者: Antti Koskela,Mohamed Seif,Andrea J. Goldsmith
类目: ocial and Information Networks (cs.SI); Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate privacy-preserving spectral clustering for community detection within stochastic block models (SBMs). Specifically, we focus on edge differential privacy (DP) and propose private algorithms for community recovery. Our work explores the fundamental trade-offs between the privacy budget and the accurate recovery of community labels. Furthermore, we establish information-theoretic conditions that guarantee the accuracy of our methods, providing theoretical assurances for successful community recovery under edge DP.

[LG-31] BCE vs. CE in Deep Feature Learning ICML2025

链接: https://arxiv.org/abs/2505.05813
作者: Qiufu Li,Huibin Xiao,Linlin Shen
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML2025

点击查看摘要

Abstract:When training classification models, it expects that the learned features are compact within classes, and can well separate different classes. As the dominant loss function for training classification models, minimizing cross-entropy (CE) loss maximizes the compactness and distinctiveness, i.e., reaching neural collapse (NC). The recent works show that binary CE (BCE) performs also well in multi-class tasks. In this paper, we compare BCE and CE in deep feature learning. For the first time, we prove that BCE can also maximize the intra-class compactness and inter-class distinctiveness when reaching its minimum, i.e., leading to NC. We point out that CE measures the relative values of decision scores in the model training, implicitly enhancing the feature properties by classifying samples one-by-one. In contrast, BCE measures the absolute values of decision scores and adjust the positive/negative decision scores across all samples to uniformly high/low levels. Meanwhile, the classifier biases in BCE present a substantial constraint on the decision scores to explicitly enhance the feature properties in the training. The experimental results are aligned with above analysis, and show that BCE could improve the classification and leads to better compactness and distinctiveness among sample features. The codes will be released.

[LG-32] A novel Neural-ODE model for the state of health estimation of lithium-ion battery using charging curve

链接: https://arxiv.org/abs/2505.05803
作者: Yiming Li,Man He,Jiapeng Liu
类目: Machine Learning (cs.LG)
*备注: 28 pages, 6 figures

点击查看摘要

Abstract:The state of health (SOH) of lithium-ion batteries (LIBs) is crucial for ensuring the safe and reliable operation of electric vehicles. Nevertheless, the prevailing SOH estimation methods often have limited generalizability. This paper introduces a data-driven approach for estimating the SOH of LIBs, which is designed to improve generalization. We construct a hybrid model named ACLA, which integrates the attention mechanism, convolutional neural network (CNN), and long short-term memory network (LSTM) into the augmented neural ordinary differential equation (ANODE) framework. This model employs normalized charging time corresponding to specific voltages in the constant current charging phase as input and outputs the SOH as well as remaining useful of life. The model is trained on NASA and Oxford datasets and validated on the TJU and HUST datasets. Compared to the benchmark models NODE and ANODE, ACLA exhibits higher accuracy with root mean square errors (RMSE) for SOH estimation as low as 1.01% and 2.24% on the TJU and HUST datasets, respectively.

[LG-33] Rethinking Graph Out-Of-Distribution Generalization: A Learnable Random Walk Perspective

链接: https://arxiv.org/abs/2505.05785
作者: Henan Sun,Xunkai Li,Lei Zhu,Junyi Han,Guang Zeng,Ronghua Li,Guoren Wang
类目: Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:Out-Of-Distribution (OOD) generalization has gained increasing attentions for machine learning on graphs, as graph neural networks (GNNs) often exhibit performance degradation under distribution shifts. Existing graph OOD methods tend to follow the basic ideas of invariant risk minimization and structural causal models, interpreting the invariant knowledge across datasets under various distribution shifts as graph topology or graph spectrum. However, these interpretations may be inconsistent with real-world scenarios, as neither invariant topology nor spectrum is assured. In this paper, we advocate the learnable random walk (LRW) perspective as the instantiation of invariant knowledge, and propose LRW-OOD to realize graph OOD generalization learning. Instead of employing fixed probability transition matrix (i.e., degree-normalized adjacency matrix), we parameterize the transition matrix with an LRW-sampler and a path encoder. Furthermore, we propose the kernel density estimation (KDE)-based mutual information (MI) loss to generate random walk sequences that adhere to OOD principles. Extensive experiment demonstrates that our model can effectively enhance graph OOD generalization under various types of distribution shifts and yield a significant accuracy improvement of 3.87% over state-of-the-art graph OOD generalization baselines.

[LG-34] Deep-ICE: The first globally optimal algorithm for empirical risk minimization of two-layer maxout and ReLU networks

链接: https://arxiv.org/abs/2505.05740
作者: Xi He,Yi Miao,Max A. Little
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces the first globally optimal algorithm for the empirical risk minimization problem of two-layer maxout and ReLU networks, i.e., minimizing the number of misclassifications. The algorithm has a worst-case time complexity of O\left(N^DK+1\right) , where K denotes the number of hidden neurons and D represents the number of features. It can be can be generalized to accommodate arbitrary computable loss functions without affecting its computational complexity. Our experiments demonstrate that the proposed algorithm provides provably exact solutions for small-scale datasets. To handle larger datasets, we introduce a novel coreset selection method that reduces the data size to a manageable scale, making it feasible for our algorithm. This extension enables efficient processing of large-scale datasets and achieves significantly improved performance, with a 20-30% reduction in misclassifications for both training and prediction, compared to state-of-the-art approaches (neural networks trained using gradient descent and support vector machines), when applied to the same models (two-layer networks with fixed hidden nodes and linear models).

[LG-35] Understanding Strag glers in Large Model Training Using What-if Analysis

链接: https://arxiv.org/abs/2505.05713
作者: Jinkun Lin,Ziheng Jiang,Zuquan Song,Sida Zhao,Menghan Yu,Zhanghan Wang,Chenyuan Wang,Zuocheng Shi,Xiang Shi,Wei Jia,Zherui Liu,Shuguang Wang,Haibin Lin,Xiu Liu,Aurojit Panda,Jinyang Li
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language model (LLM) training is one of the most demanding distributed computations today, often requiring thousands of GPUs with frequent synchronization across machines. Such a workload pattern makes it susceptible to stragglers, where the training can be stalled by few slow workers. At ByteDance we find stragglers are not trivially always caused by hardware failures, but can arise from multiple complex factors. This work aims to present a comprehensive study on the straggler issues in LLM training, using a five-month trace collected from our ByteDance LLM training cluster. The core methodology is what-if analysis that simulates the scenario without any stragglers and contrasts with the actual case. We use this method to study the following questions: (1) how often do stragglers affect training jobs, and what effect do they have on job performance; (2) do stragglers exhibit temporal or spatial patterns; and (3) what are the potential root causes for stragglers?

[LG-36] Crowding Out The Noise: Algorithmic Collective Action Under Differential Privacy

链接: https://arxiv.org/abs/2505.05707
作者: Rushabh Solanki,Meghana Bhange,Ulrich Aïvodji,Elliot Creager
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The integration of AI into daily life has generated considerable attention and excitement, while also raising concerns about automating algorithmic harms and re-entrenching existing social inequities. While the responsible deployment of trustworthy AI systems is a worthy goal, there are many possible ways to realize it, from policy and regulation to improved algorithm design and evaluation. In fact, since AI trains on social data, there is even a possibility for everyday users, citizens, or workers to directly steer its behavior through Algorithmic Collective Action, by deliberately modifying the data they share with a platform to drive its learning process in their favor. This paper considers how these grassroots efforts to influence AI interact with methods already used by AI firms and governments to improve model trustworthiness. In particular, we focus on the setting where the AI firm deploys a differentially private model, motivated by the growing regulatory focus on privacy and data protection. We investigate how the use of Differentially Private Stochastic Gradient Descent (DPSGD) affects the collective’s ability to influence the learning process. Our findings show that while differential privacy contributes to the protection of individual data, it introduces challenges for effective algorithmic collective action. We characterize lower bounds on the success of algorithmic collective action under differential privacy as a function of the collective’s size and the firm’s privacy parameters, and verify these trends experimentally by simulating collective action during the training of deep neural network classifiers across several datasets.

[LG-37] Hypergraph Neural Sheaf Diffusion: A Symmetric Simplicial Set Framework for Higher-Order Learning

链接: https://arxiv.org/abs/2505.05702
作者: Seongjin Choi,Gahee Kim,Yong-Geun Oh
类目: Machine Learning (cs.LG)
*备注: This manuscript has been submitted to IEEE Access for publication

点击查看摘要

Abstract:The absence of intrinsic adjacency relations and orientation systems in hypergraphs creates fundamental challenges for constructing sheaf Laplacians of arbitrary degrees. We resolve these limitations through symmetric simplicial sets derived directly from hypergraphs, which encode all possible oriented subrelations within each hyperedge as ordered tuples. This construction canonically defines adjacency via facet maps while inherently preserving hyperedge provenance. We establish that the normalized degree zero sheaf Laplacian on our induced symmetric simplicial set reduces exactly to the traditional graph normalized sheaf Laplacian when restricted to graphs, validating its mathematical consistency with prior graph-based sheaf theory. Furthermore, the induced structure preserves all structural information from the original hypergraph, ensuring that every multi-way relational detail is faithfully retained. Leveraging this framework, we introduce Hypergraph Neural Sheaf Diffusion (HNSD), the first principled extension of Neural Sheaf Diffusion (NSD) to hypergraphs. HNSD operates via normalized degree zero sheaf Laplacians over symmetric simplicial sets, resolving orientation ambiguity and adjacency sparsity inherent to hypergraph learning. Experimental evaluations demonstrate HNSD’s competitive performance across established benchmarks.

[LG-38] Extending Stress Detection Reproducibility to Consumer Wearable Sensors

链接: https://arxiv.org/abs/2505.05694
作者: Ohida Binte Amin,Varun Mishra,Tinashe M. Tapera,Robert Volpe,Aarti Sathyanarayana
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Accepted at IEEE EMBC 2025

点击查看摘要

Abstract:Wearable sensors are widely used to collect physiological data and develop stress detection models. However, most studies focus on a single dataset, rarely evaluating model reproducibility across devices, populations, or study conditions. We previously assessed the reproducibility of stress detection models across multiple studies, testing models trained on one dataset against others using heart rate (with R-R interval) and electrodermal activity (EDA). In this study, we extended our stress detection reproducibility to consumer wearable sensors. We compared validated research-grade devices, to consumer wearables - Biopac MP160, Polar H10, Empatica E4, to the Garmin Forerunner 55s, assessing device-specific stress detection performance by conducting a new stress study on undergraduate students. Thirty-five students completed three standardized stress-induction tasks in a lab setting. Biopac MP160 performed the best, being consistent with our expectations of it as the gold standard, though performance varied across devices and models. Combining heart rate variability (HRV) and EDA enhanced stress prediction across most scenarios. However, Empatica E4 showed variability; while HRV and EDA improved stress detection in leave-one-subject-out (LOSO) evaluations (AUROC up to 0.953), device-specific limitations led to underperformance when tested with our pre-trained stress detection tool (AUROC 0.723), highlighting generalizability challenges related to hardware-model compatibility. Garmin Forerunner 55s demonstrated strong potential for real-world stress monitoring, achieving the best mental arithmetic stress detection performance in LOSO (AUROC up to 0.961) comparable to research-grade devices like Polar H10 (AUROC 0.954), and Empatica E4 (AUROC 0.905 with HRV-only model and AUROC 0.953 with HRV+EDA model), with the added advantage of consumer-friendly wearability for free-living contexts.

[LG-39] Physics-informed Temporal Difference Metric Learning for Robot Motion Planning ICLR2025

链接: https://arxiv.org/abs/2505.05691
作者: Ruiqi Ni,Zherong Pan,Ahmed H Qureshi
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted to ICLR 2025

点击查看摘要

Abstract:The motion planning problem involves finding a collision-free path from a robot’s starting to its target configuration. Recently, self-supervised learning methods have emerged to tackle motion planning problems without requiring expensive expert demonstrations. They solve the Eikonal equation for training neural networks and lead to efficient solutions. However, these methods struggle in complex environments because they fail to maintain key properties of the Eikonal equation, such as optimal value functions and geodesic distances. To overcome these limitations, we propose a novel self-supervised temporal difference metric learning approach that solves the Eikonal equation more accurately and enhances performance in solving complex and unseen planning tasks. Our method enforces Bellman’s principle of optimality over finite regions, using temporal difference learning to avoid spurious local minima while incorporating metric learning to preserve the Eikonal equation’s essential geodesic properties. We demonstrate that our approach significantly outperforms existing self-supervised learning methods in handling complex environments and generalizing to unseen environments, with robot configurations ranging from 2 to 12 degrees of freedom (DOF).

[LG-40] Conditional Front-door Adjustment for Heterogeneous Treatment Assignment Effect Estimation Under Non-adherence ALT

链接: https://arxiv.org/abs/2505.05677
作者: Winston Chen,Trenton Chang,Jenna Wiens
类目: Machine Learning (cs.LG)
*备注: Accepted by Conference on Health, Inference, and Learning (CHIL) 2025

点击查看摘要

Abstract:Estimates of heterogeneous treatment assignment effects can inform treatment decisions. Under the presence of non-adherence (e.g., patients do not adhere to their assigned treatment), both the standard backdoor adjustment (SBD) and the conditional front-door adjustment (CFD) can recover unbiased estimates of the treatment assignment effects. However, the estimation variance of these approaches may vary widely across settings, which remains underexplored in the literature. In this work, we demonstrate theoretically and empirically that CFD yields lower-variance estimates than SBD when the true effect of treatment assignment is small (i.e., assigning an intervention leads to small changes in patients’ future outcome). Additionally, since CFD requires estimating multiple nuisance parameters, we introduce LobsterNet, a multi-task neural network that implements CFD with joint modeling of the nuisance parameters. Empirically, LobsterNet reduces estimation error across several semi-synthetic and real-world datasets compared to baselines. Our findings suggest CFD with shared nuisance parameter modeling can improve treatment assignment effect estimation under non-adherence.

[LG-41] EquiHGNN: Scalable Rotationally Equivariant Hypergraph Neural Networks

链接: https://arxiv.org/abs/2505.05650
作者: Tien Dang,Truong-Son Hy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Molecular interactions often involve high-order relationships that cannot be fully captured by traditional graph-based models limited to pairwise connections. Hypergraphs naturally extend graphs by enabling multi-way interactions, making them well-suited for modeling complex molecular systems. In this work, we introduce EquiHGNN, an Equivariant HyperGraph Neural Network framework that integrates symmetry-aware representations to improve molecular modeling. By enforcing the equivariance under relevant transformation groups, our approach preserves geometric and topological properties, leading to more robust and physically meaningful representations. We examine a range of equivariant architectures and demonstrate that integrating symmetry constraints leads to notable performance gains on large-scale molecular datasets. Experiments on both small and large molecules show that high-order interactions offer limited benefits for small molecules but consistently outperform 2D graphs on larger ones. Adding geometric features to these high-order structures further improves the performance, emphasizing the value of spatial information in molecular learning. Our source code is available at this https URL

[LG-42] LiteLMGuard: Seamless and Lightweight On-Device Prompt Filtering for Safeguarding Small Language Models against Quantization-induced Risks and Vulnerabilities

链接: https://arxiv.org/abs/2505.05619
作者: Kalyan Nakka,Jimmy Dani,Ausmit Mondal,Nitesh Saxena
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 14 pages, 18 figures, and 4 tables

点击查看摘要

Abstract:The growing adoption of Large Language Models (LLMs) has influenced the development of their lighter counterparts-Small Language Models (SLMs)-to enable on-device deployment across smartphones and edge devices. These SLMs offer enhanced privacy, reduced latency, server-free functionality, and improved user experience. However, due to resource constraints of on-device environment, SLMs undergo size optimization through compression techniques like quantization, which can inadvertently introduce fairness, ethical and privacy risks. Critically, quantized SLMs may respond to harmful queries directly, without requiring adversarial manipulation, raising significant safety and trust concerns. To address this, we propose LiteLMGuard (LLMG), an on-device prompt guard that provides real-time, prompt-level defense for quantized SLMs. Additionally, our prompt guard is designed to be model-agnostic such that it can be seamlessly integrated with any SLM, operating independently of underlying architectures. Our LLMG formalizes prompt filtering as a deep learning (DL)-based prompt answerability classification task, leveraging semantic understanding to determine whether a query should be answered by any SLM. Using our curated dataset, Answerable-or-Not, we trained and fine-tuned several DL models and selected ELECTRA as the candidate, with 97.75% answerability classification accuracy. Our safety effectiveness evaluations demonstrate that LLMG defends against over 87% of harmful prompts, including both direct instruction and jailbreak attack strategies. We further showcase its ability to mitigate the Open Knowledge Attacks, where compromised SLMs provide unsafe responses without adversarial prompting. In terms of prompt filtering effectiveness, LLMG achieves near state-of-the-art filtering accuracy of 94%, with an average latency of 135 ms, incurring negligible overhead for users. Comments: 14 pages, 18 figures, and 4 tables Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2505.05619 [cs.CR] (or arXiv:2505.05619v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2505.05619 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-43] On Corruption-Robustness in Performative Reinforcement Learning

链接: https://arxiv.org/abs/2505.05609
作者: Vasilis Pollatos,Debmalya Mandal,Goran Radanovic
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In performative Reinforcement Learning (RL), an agent faces a policy-dependent environment: the reward and transition functions depend on the agent’s policy. Prior work on performative RL has studied the convergence of repeated retraining approaches to a performatively stable policy. In the finite sample regime, these approaches repeatedly solve for a saddle point of a convex-concave objective, which estimates the Lagrangian of a regularized version of the reinforcement learning problem. In this paper, we aim to extend such repeated retraining approaches, enabling them to operate under corrupted data. More specifically, we consider Huber’s \epsilon -contamination model, where an \epsilon fraction of data points is corrupted by arbitrary adversarial noise. We propose a repeated retraining approach based on convex-concave optimization under corrupted gradients and a novel problem-specific robust mean estimator for the gradients. We prove that our approach exhibits last-iterate convergence to an approximately stable policy, with the approximation error linear in \sqrt\epsilon . We experimentally demonstrate the importance of accounting for corruption in performative RL.

[LG-44] he Evolution of Embedding Table Optimization and Multi-Epoch Training in Pinterest Ads Conversion

链接: https://arxiv.org/abs/2505.05605
作者: Andrew Qiu,Shubham Barhate,Hin Wai Lui,Runze Su,Rafael Rios Müller,Kungang Li,Ling Leng,Han Sun,Shayan Ehsani,Zhifang Liu
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Information Retrieval (cs.IR); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Deep learning for conversion prediction has found widespread applications in online advertising. These models have become more complex as they are trained to jointly predict multiple objectives such as click, add-to-cart, checkout and other conversion types. Additionally, the capacity and performance of these models can often be increased with the use of embedding tables that encode high cardinality categorical features such as advertiser, user, campaign, and product identifiers (IDs). These embedding tables can be pre-trained, but also learned end-to-end jointly with the model to directly optimize the model objectives. Training these large tables is challenging due to: gradient sparsity, the high cardinality of the categorical features, the non-uniform distribution of IDs and the very high label sparsity. These issues make training prone to both slow convergence and overfitting after the first epoch. Previous works addressed the multi-epoch overfitting issue by using: stronger feature hashing to reduce cardinality, filtering of low frequency IDs, regularization of the embedding tables, re-initialization of the embedding tables after each epoch, etc. Some of these techniques reduce overfitting at the expense of reduced model performance if used too aggressively. In this paper, we share key learnings from the development of embedding table optimization and multi-epoch training in Pinterest Ads Conversion models. We showcase how our Sparse Optimizer speeds up convergence, and how multi-epoch overfitting varies in severity between different objectives in a multi-task model depending on label sparsity. We propose a new approach to deal with multi-epoch overfitting: the use of a frequency-adaptive learning rate on the embedding tables and compare it to embedding re-initialization. We evaluate both methods offline using an industrial large-scale production dataset.

[LG-45] Enhancing Large Language Models with Faster Code Preprocessing for Vulnerability Detection

链接: https://arxiv.org/abs/2505.05600
作者: José Gonçalves,Miguel Silva,Eva Maia,Isabel Praça
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 10 pages, 3 tables, DCAI’25: Distributed Computing and Artificial Intelligence 2025

点击查看摘要

Abstract:The application of Artificial Intelligence has become a powerful approach to detecting software vulnerabilities. However, effective vulnerability detection relies on accurately capturing the semantic structure of code and its contextual relationships. Given that the same functionality can be implemented in various forms, a preprocessing tool that standardizes code representation is important. This tool must be efficient, adaptable across programming languages, and capable of supporting new transformations. To address this challenge, we build on the existing SCoPE framework and introduce SCoPE2, an enhanced version with improved performance. We compare both versions in terms of processing time and memory usage and evaluate their impact on a Large Language Model (LLM) for vulnerability detection. Our results show a 97.3% reduction in processing time with SCoPE2, along with an improved F1-score for the LLM, solely due to the refined preprocessing approach.

[LG-46] his part looks alike this: identifying important parts of explained instances and prototypes

链接: https://arxiv.org/abs/2505.05597
作者: Jacek Karolczak,Jerzy Stefanowski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Although prototype-based explanations provide a human-understandable way of representing model predictions they often fail to direct user attention to the most relevant features. We propose a novel approach to identify the most informative features within prototypes, termed alike parts. Using feature importance scores derived from an agnostic explanation method, it emphasizes the most relevant overlapping features between an instance and its nearest prototype. Furthermore, the feature importance score is incorporated into the objective function of the prototype selection algorithms to promote global prototypes diversity. Through experiments on six benchmark datasets, we demonstrate that the proposed approach improves user comprehension while maintaining or even increasing predictive accuracy.

[LG-47] Anticipating Gaming to Incentivize Improvement: Guiding Agents in (Fair) Strategic Classification

链接: https://arxiv.org/abs/2505.05594
作者: Sura Alhanouti,Parinaz Naghizadeh
类目: Machine Learning (cs.LG)
*备注: 31 pages, 12 figures

点击查看摘要

Abstract:As machine learning algorithms increasingly influence critical decision making in different application areas, understanding human strategic behavior in response to these systems becomes vital. We explore individuals’ choice between genuinely improving their qualifications (improvement'') vs. attempting to deceive the algorithm by manipulating their features (manipulation’') in response to an algorithmic decision system. We further investigate an algorithm designer’s ability to shape these strategic responses, and its fairness implications. Specifically, we formulate these interactions as a Stackelberg game, where a firm deploys a (fair) classifier, and individuals strategically respond. Our model incorporates both different costs and stochastic efficacy for manipulation and improvement. The analysis reveals different potential classes of agent responses, and characterizes optimal classifiers accordingly. Based on these, we highlight the impact of the firm’s anticipation of strategic behavior, identifying when and why a (fair) strategic policy can not only prevent manipulation, but also incentivize agents to opt for improvement.

[LG-48] PRIMG : Efficient LLM -driven Test Generation Using Mutant Prioritization

链接: https://arxiv.org/abs/2505.05584
作者: Mohamed Salah Bouafif,Mohammad Hamdaqa,Edward Zulkoski
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mutation testing is a widely recognized technique for assessing and enhancing the effectiveness of software test suites by introducing deliberate code mutations. However, its application often results in overly large test suites, as developers generate numerous tests to kill specific mutants, increasing computational overhead. This paper introduces PRIMG (Prioritization and Refinement Integrated Mutation-driven Generation), a novel framework for incremental and adaptive test case generation for Solidity smart contracts. PRIMG integrates two core components: a mutation prioritization module, which employs a machine learning model trained on mutant subsumption graphs to predict the usefulness of surviving mutants, and a test case generation module, which utilizes Large Language Models (LLMs) to generate and iteratively refine test cases to achieve syntactic and behavioral correctness. We evaluated PRIMG on real-world Solidity projects from Code4Arena to assess its effectiveness in improving mutation scores and generating high-quality test cases. The experimental results demonstrate that PRIMG significantly reduces test suite size while maintaining high mutation coverage. The prioritization module consistently outperformed random mutant selection, enabling the generation of high-impact tests with reduced computational effort. Furthermore, the refining process enhanced the correctness and utility of LLM-generated tests, addressing their inherent limitations in handling edge cases and complex program logic. Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG) Cite as: arXiv:2505.05584 [cs.SE] (or arXiv:2505.05584v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2505.05584 Focus to learn more arXiv-issued DOI via DataCite

[LG-49] A Common Interface for Automatic Differentiation

链接: https://arxiv.org/abs/2505.05542
作者: Guillaume Dalle,Adrian Hill
类目: Mathematical Software (cs.MS); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 11 pages, 2 figures, 3 listings, 1 table

点击查看摘要

Abstract:For scientific machine learning tasks with a lot of custom code, picking the right Automatic Differentiation (AD) system matters. Our Julia package this http URL provides a common frontend to a dozen AD backends, unlocking easy comparison and modular development. In particular, its built-in preparation mechanism leverages the strengths of each backend by amortizing one-time computations. This is key to enabling sophisticated features like sparsity handling without putting additional burdens on the user.

[LG-50] A critical assessment of reinforcement learning methods for microswimmer navigation in complex flows

链接: https://arxiv.org/abs/2505.05525
作者: Selim Mecanna,Aurore Loisy,Christophe Eloy
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Navigating in a fluid flow while being carried by it, using only information accessible from on-board sensors, is a problem commonly faced by small planktonic organisms. It is also directly relevant to autonomous robots deployed in the oceans. In the last ten years, the fluid mechanics community has widely adopted reinforcement learning, often in the form of its simplest implementations, to address this challenge. But it is unclear how good are the strategies learned by these algorithms. In this paper, we perform a quantitative assessment of reinforcement learning methods applied to navigation in partially observable flows. We first introduce a well-posed problem of directional navigation for which a quasi-optimal policy is known analytically. We then report on the poor performance and robustness of commonly used algorithms (Q-Learning, Advantage Actor Critic) in flows regularly encountered in the literature: Taylor-Green vortices, Arnold-Beltrami-Childress flow, and two-dimensional turbulence. We show that they are vastly surpassed by PPO (Proximal Policy Optimization), a more advanced algorithm that has established dominance across a wide range of benchmarks in the reinforcement learning community. In particular, our custom implementation of PPO matches the theoretical quasi-optimal performance in turbulent flow and does so in a robust manner. Reaching this result required the use of several additional techniques, such as vectorized environments and generalized advantage estimation, as well as hyperparameter optimization. This study demonstrates the importance of algorithm selection, implementation details, and fine-tuning for discovering truly smart autonomous navigation strategies in complex flows.

[LG-51] Economic Analysis and Optimization of Energy Storag e Configuration for Park Power Systems Based on Random Forest and Genetic Algorithm

链接: https://arxiv.org/abs/2505.05511
作者: Yanghui Song,Aoqi Li,Lilei Huo
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 8 pages, 8 figures,International Journal of New Developments in Engineering and Society ISSN 2522-3488 Vol. 8, Issue 4: 22-29

点击查看摘要

Abstract:This study aims to analyze the economic performance of various parks under different conditions, particularly focusing on the operational costs and power load balancing before and after the deployment of energy storage systems. Firstly, the economic performance of the parks without energy storage was analyzed using a random forest model. Taking Park A as an example, it was found that the cost had the greatest correlation with electricity purchase, followed by photovoltaic output, indicating that solar and wind power output are key factors affecting economic performance. Subsequently, the operation of the parks after the configuration of a 50kW/100kWh energy storage system was simulated, and the total cost and operation strategy of the energy storage system were calculated. The results showed that after the deployment of energy storage, the amount of wind and solar power curtailment in each park decreased, and the operational costs were reduced. Finally, a genetic algorithm was used to optimize the energy storage configuration of each park. The energy storage operation strategy was optimized through fitness functions, crossover operations, and mutation operations. After optimization, the economic indicators of Parks A, B, and C all improved. The research results indicate that by optimizing energy storage configuration, each park can reduce costs, enhance economic benefits, and achieve sustainable development of the power system.

[LG-52] Akkumula: Evidence accumulation driver models with Spiking Neural Networks

链接: https://arxiv.org/abs/2505.05489
作者: Alberto Morando
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Processes of evidence accumulation for motor control contribute to the ecological validity of driver models. According to established theories of cognition, drivers make control adjustments when a process of accumulation of perceptual inputs reaches a decision boundary. Unfortunately, there is not a standard way for building such models, limiting their use. Current implementations are hand-crafted, lack adaptability, and rely on inefficient optimization techniques that do not scale well with large datasets. This paper introduces Akkumula, an evidence accumulation modelling framework built using deep learning techniques to leverage established coding libraries, gradient optimization, and large batch training. The core of the library is based on Spiking Neural Networks, whose operation mimic the evidence accumulation process in the biological brain. The model was tested on data collected during a test-track experiment. Results are promising. The model fits well the time course of vehicle control (brake, accelerate, steering) based on vehicle sensor data. The perceptual inputs are extracted by a dedicated neural network, increasing the context-awareness of the model in dynamic scenarios. Akkumula integrates with existing machine learning architectures, benefits from continuous advancements in deep learning, efficiently processes large datasets, adapts to diverse driving scenarios, and maintains a degree of transparency in its core mechanisms.

[LG-53] Evolutionary Optimization for the Classification of Small Molecules Regulating the Circadian Rhythm Period: A Reliable Assessment

链接: https://arxiv.org/abs/2505.05485
作者: Antonio Arauzo-Azofra,Jose Molina-Baena,Maria Luque-Rodriguez
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 13 pages, 5 figures, 8 tables. To be published

点击查看摘要

Abstract:The circadian rhythm plays a crucial role in regulating biological processes, and its disruption is linked to various health issues. Identifying small molecules that influence the circadian period is essential for developing targeted therapies. This study explores the use of evolutionary optimization techniques to enhance the classification of these molecules. We applied an evolutionary algorithm to optimize feature selection and classification performance. Several machine learning classifiers were employed, and performance was evaluated using accuracy and generalization ability. The findings demonstrate that the proposed evolutionary optimization method improves classification accuracy and reduces overfitting compared to baseline models. Additionally, the use of variance in accuracy as a penalty factor may enhance the model’s reliability for real-world applications. Our study confirms that evolutionary optimization is an effective strategy for classifying small molecules regulating the circadian rhythm. The proposed approach not only improves predictive performance but also ensures a more robust model.

[LG-54] A Machine-Learning Compositional Study of Exoplanetary Material Accreted Onto Five Helium-Atmosphere White Dwarfs with textttcecilia

链接: https://arxiv.org/abs/2505.06228
作者: Mariona Badenas-Agusti,Siyi Xu,Andrew Vanderburg,Kishalay De,Patrick Dufour,Laura K. Rogers,Susana Hoyos,Simon Blouin,Javier Viaña,Amy Bonsor,Ben Zuckerman
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注: 28 pages, 14 figures, 5 tables. Accepted for publication in MNRAS

点击查看摘要

Abstract:We present the first application of the Machine Learning (ML) pipeline \textttcecilia to determine the physical parameters and photospheric composition of five metal-polluted He-atmosphere white dwarfs without well-characterised elemental abundances. To achieve this, we perform a joint and iterative Bayesian fit to their \textitSDSS (R=2,000) and \textitKeck/ESI (R=4,500) optical spectra, covering the wavelength range from about 3,800Å to 9,000Å. Our analysis measures the abundances of at least two - and up to six - chemical elements in their atmospheres with a predictive accuracy similar to that of conventional WD analysis techniques ( \approx 0.20 dex). The white dwarfs with the largest number of detected heavy elements are SDSS J0859 + 5732 and SDSS J2311 - 0041, which simultaneously exhibit O, Mg, Si, Ca, and Fe in their \textitKeck/ESI spectra. For all systems, we find that the bulk composition of their pollutants is largely consistent with those of primitive CI chondrites to within 1-2 \sigma . We also find evidence of statistically significant ( 2\sigma ) oxygen excesses for SDSS J0859 + 5732 and SDSS J2311 - 0041, which could point to the accretion of oxygen-rich exoplanetary material. In the future, as wide-field astronomical surveys deliver millions of public WD spectra to the scientific community, \textttcecilia aspires to unlock population-wide studies of polluted WDs, therefore helping to improve our statistical knowledge of extrasolar compositions.

[LG-55] Multi-User Beamforming with Deep Reinforcement Learning in Sensing-Aided Communication

链接: https://arxiv.org/abs/2505.05956
作者: Xiyu Wang,Gilberto Berardinelli,Hei Victor Cheng,Petar Popovski,Ramoni Adeogun
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: Accepted for Presentation at IEEE EuCNC 6G Summit 2025

点击查看摘要

Abstract:Mobile users are prone to experience beam failure due to beam drifting in millimeter wave (mmWave) communications. Sensing can help alleviate beam drifting with timely beam changes and low overhead since it does not need user feedback. This work studies the problem of optimizing sensing-aided communication by dynamically managing beams allocated to mobile users. A multi-beam scheme is introduced, which allocates multiple beams to the users that need an update on the angle of departure (AoD) estimates and a single beam to the users that have satisfied AoD estimation precision. A deep reinforcement learning (DRL) assisted method is developed to optimize the beam allocation policy, relying only upon the sensing echoes. For comparison, a heuristic AoD-based method using approximated Cramér-Rao lower bound (CRLB) for allocation is also presented. Both methods require neither user feedback nor prior state evolution information. Results show that the DRL-assisted method achieves a considerable gain in throughput than the conventional beam sweeping method and the AoD-based method, and it is robust to different user speeds.

[LG-56] Unsupervised Blind Speech Separation with a Diffusion Prior ICML2025

链接: https://arxiv.org/abs/2505.05657
作者: Zhongweiyang Xu,Xulin Fan,Zhong-Qiu Wang,Xilin Jiang,Romit Roy Choudhury
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Signal Processing (eess.SP)
*备注: Paper Accepted at ICML2025 Demo: this https URL Code: this https URL

点击查看摘要

Abstract:Blind Speech Separation (BSS) aims to separate multiple speech sources from audio mixtures recorded by a microphone array. The problem is challenging because it is a blind inverse problem, i.e., the microphone array geometry, the room impulse response (RIR), and the speech sources, are all unknown. We propose ArrayDPS to solve the BSS problem in an unsupervised, array-agnostic, and generative manner. The core idea builds on diffusion posterior sampling (DPS), but unlike DPS where the likelihood is tractable, ArrayDPS must approximate the likelihood by formulating a separate optimization problem. The solution to the optimization approximates room acoustics and the relative transfer functions between microphones. These approximations, along with the diffusion priors, iterate through the ArrayDPS sampling process and ultimately yield separated voice sources. We only need a simple single-speaker speech diffusion model as a prior along with the mixtures recorded at the microphones; no microphone array information is necessary. Evaluation results show that ArrayDPS outperforms all baseline unsupervised methods while being comparable to supervised methods in terms of SDR. Audio demos are provided at: this https URL.

[LG-57] Optimal Regret of Bernoulli Bandits under Global Differential Privacy

链接: https://arxiv.org/abs/2505.05613
作者: Achraf Azize,Yulian Wu,Junya Honda,Francesco Orabona,Shinji Ito,Debabrota Basu
类目: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:As sequential learning algorithms are increasingly applied to real life, ensuring data privacy while maintaining their utilities emerges as a timely question. In this context, regret minimisation in stochastic bandits under \epsilon -global Differential Privacy (DP) has been widely studied. Unlike bandits without DP, there is a significant gap between the best-known regret lower and upper bound in this setting, though they “match” in order. Thus, we revisit the regret lower and upper bounds of \epsilon -global DP algorithms for Bernoulli bandits and improve both. First, we prove a tighter regret lower bound involving a novel information-theoretic quantity characterising the hardness of \epsilon -global DP in stochastic bandits. Our lower bound strictly improves on the existing ones across all \epsilon values. Then, we choose two asymptotically optimal bandit algorithms, i.e. DP-KLUCB and DP-IMED, and propose their DP versions using a unified blueprint, i.e., (a) running in arm-dependent phases, and (b) adding Laplace noise to achieve privacy. For Bernoulli bandits, we analyse the regrets of these algorithms and show that their regrets asymptotically match our lower bound up to a constant arbitrary close to 1. This refutes the conjecture that forgetting past rewards is necessary to design optimal bandit algorithms under global DP. At the core of our algorithms lies a new concentration inequality for sums of Bernoulli variables under Laplace mechanism, which is a new DP version of the Chernoff bound. This result is universally useful as the DP literature commonly treats the concentrations of Laplace noise and random variables separately, while we couple them to yield a tighter bound.

[LG-58] Machine learning automorphic forms for black holes

链接: https://arxiv.org/abs/2505.05549
作者: Vishnu Jejjala,Suresh Nampuri,Dumisani Nxumalo,Pratik Roy,Abinash Swain
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG); Number Theory (math.NT)
*备注:

点击查看摘要

Abstract:Modular, Jacobi, and mock-modular forms serve as generating functions for BPS black hole degeneracies. By training feed-forward neural networks on Fourier coefficients of automorphic forms derived from the Dedekind eta function, Eisenstein series, and Jacobi theta functions, we demonstrate that machine learning techniques can accurately predict modular weights from truncated expansions. Our results reveal strong performance for negative weight modular and quasi-modular forms, particularly those arising in exact black hole counting formulae, with lower accuracy for positive weights and more complicated combinations of Jacobi theta functions. This study establishes a proof of concept for using machine learning to identify how data is organized in terms of modular symmetries in gravitational systems and suggests a pathway toward automated detection and verification of symmetries in quantum gravity.

[LG-59] Natures Insight: A Novel Framework and Comprehensive Analysis of Agent ic Reasoning Through the Lens of Neuroscience

链接: https://arxiv.org/abs/2505.05515
作者: Zinan Liu,Haoran Li,Jingyi Lu,Gaoyuan Ma,Xu Hong,Giovanni Iacca,Arvind Kumar,Shaojun Tang,Lin Wang
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: 39 pages, 17 figures

点击查看摘要

Abstract:Autonomous AI is no longer a hard-to-reach concept, it enables the agents to move beyond executing tasks to independently addressing complex problems, adapting to change while handling the uncertainty of the environment. However, what makes the agents truly autonomous? It is agentic reasoning, that is crucial for foundation models to develop symbolic logic, statistical correlations, or large-scale pattern recognition to process information, draw inferences, and make decisions. However, it remains unclear why and how existing agentic reasoning approaches work, in comparison to biological reasoning, which instead is deeply rooted in neural mechanisms involving hierarchical cognition, multimodal integration, and dynamic interactions. In this work, we propose a novel neuroscience-inspired framework for agentic reasoning. Grounded in three neuroscience-based definitions and supported by mathematical and biological foundations, we propose a unified framework modeling reasoning from perception to action, encompassing four core types, perceptual, dimensional, logical, and interactive, inspired by distinct functional roles observed in the human brain. We apply this framework to systematically classify and analyze existing AI reasoning methods, evaluating their theoretical foundations, computational designs, and practical limitations. We also explore its implications for building more generalizable, cognitively aligned agents in physical and virtual environments. Finally, building on our framework, we outline future directions and propose new neural-inspired reasoning methods, analogous to chain-of-thought prompting. By bridging cognitive neuroscience and AI, this work offers a theoretical foundation and practical roadmap for advancing agentic reasoning in intelligent systems. The associated project can be found at: this https URL .

[LG-60] Improving Local Air Quality Predictions Using Transfer Learning on Satellite Data and Graph Neural Networks

链接: https://arxiv.org/abs/2505.05479
作者: Finn Gueterbock,Raul Santos-Rodriguez,Jeffrey N. Clark
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Air pollution is a significant global health risk, contributing to millions of premature deaths annually. Nitrogen dioxide (NO2), a harmful pollutant, disproportionately affects urban areas where monitoring networks are often sparse. We propose a novel method for predicting NO2 concentrations at unmonitored locations using transfer learning with satellite and meteorological data. Leveraging the GraphSAGE framework, our approach integrates autoregression and transfer learning to enhance predictive accuracy in data-scarce regions like Bristol. Pre-trained on data from London, UK, our model achieves a 8.6% reduction in Normalised Root Mean Squared Error (NRMSE) and a 32.6% reduction in Gradient RMSE compared to a baseline model. This work demonstrates the potential of virtual sensors for cost-effective air quality monitoring, contributing to actionable insights for climate and health interventions.

[LG-61] OccuEMBED: Occupancy Extraction Merged with Building Energy Disaggregation for Occupant-Responsive Operation at Scale

链接: https://arxiv.org/abs/2505.05478
作者: Yufei Zhang(1),Andrew Sonta(1) ((1) ETHOS Lab, EPFL-ENAC-IIC)
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 33 pages, 16 figures

点击查看摘要

Abstract:Buildings account for a significant share of global energy consumption and emissions, making it critical to operate them efficiently. As electricity grids become more volatile with renewable penetration, buildings must provide flexibility to support grid stability. Building automation plays a key role in enhancing efficiency and flexibility via centralized operations, but it must prioritize occupant-centric strategies to balance energy and comfort targets. However, incorporating occupant information into large-scale, centralized building operations remains challenging due to data limitations. We investigate the potential of using whole-building smart meter data to infer both occupancy and system operations. Integrating these insights into data-driven building energy analysis allows more occupant-centric energy-saving and flexibility at scale. Specifically, we propose OccuEMBED, a unified framework for occupancy inference and system-level load analysis. It combines two key components: a probabilistic occupancy profile generator, and a controllable and interpretable load disaggregator supported by Kolmogorov-Arnold Networks (KAN). This design embeds knowledge of occupancy patterns and load-occupancy-weather relationships into deep learning models. We conducted comprehensive evaluations to demonstrate its effectiveness across synthetic and real-world datasets compared to various occupancy inference baselines. OccuEMBED always achieved average F1 scores above 0.8 in discrete occupancy inference and RMSE within 0.1-0.2 for continuous occupancy ratios. We further demonstrate how OccuEMBED integrates with building load monitoring platforms to display occupancy profiles, analyze system-level operations, and inform occupant-responsive strategies. Our model lays a robust foundation in scaling occupant-centric building management systems to meet the challenges of an evolving energy system.

信息检索

[IR-0] Cost-Effective Low Latency Vector Search with Azure Cosmos DB

链接: https://arxiv.org/abs/2505.05885
作者: Nitish Upreti,Krishnan Sundaram,Hari Sudan Sundar,Samer Boshra,Balachandar Perumalswamy,Shivam Atri,Martin Chisholm,Revti Raman Singh,Greg Yang,Subramanyam Pattipaka,Tamara Hass,Nitesh Dudhey,James Codella,Mark Hildebrand,Magdalen Manohar,Jack Moffitt,Haiyang Xu,Naren Datha,Suryansh Gupta,Ravishankar Krishnaswamy,Prashant Gupta,Abhishek Sahu,Ritika Mor,Santosh Kulkarni,Hemeswari Varada,Sudhanshu Barthwal,Amar Sagare,Dinesh Billa,Zishan Fu,Neil Deshpande,Shaun Cooper,Kevin Pilch,Simon Moreno,Aayush Kataria,Vipul Vishal,Harsha Vardhan Simhadri
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Vector indexing enables semantic search over diverse corpora and has become an important interface to databases for both users and AI agents. Efficient vector search requires deep optimizations in database systems. This has motivated a new class of specialized vector databases that optimize for vector search quality and cost. Instead, we argue that a scalable, high-performance, and cost-efficient vector search system can be built inside a cloud-native operational database like Azure Cosmos DB while leveraging the benefits of a distributed database such as high availability, durability, and scale. We do this by deeply integrating DiskANN, a state-of-the-art vector indexing library, inside Azure Cosmos DB NoSQL. This system uses a single vector index per partition stored in existing index trees, and kept in sync with underlying data. It supports 20ms query latency over an index spanning 10 million of vectors, has stable recall over updates, and offers nearly 15x and 41x lower query cost compared to Zilliz and Pinecone serverless enterprise products. It also scales out to billions of vectors via automatic partitioning. This convergent design presents a point in favor of integrating vector indices into operational databases in the context of recent debates on specialized vector databases, and offers a template for vector indexing in other databases.

附件下载

点击下载今日全部论文列表