本篇博文主要内容为 2025-05-15 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-05-15)
今日共更新407篇论文,其中:
- 自然语言处理共31篇(Computation and Language (cs.CL))
- 人工智能共108篇(Artificial Intelligence (cs.AI))
- 计算机视觉共93篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共124篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Language Agents Mirror Human Causal Reasoning Biases. How Can We Help Them Think Like Scientists?
【速读】: 该论文试图解决语言模型(Language Model, LM)在探索和推断因果关系时存在的系统性偏差问题,特别是其对析取性因果关系的偏好与对合取性因果关系的困难。解决方案的关键在于提出一种测试时采样方法,该方法通过显式地对因果关系假设进行采样和排除,从而显著减少模型的析取性偏差,推动其向科学、因果严谨的推理目标迈进。
链接: https://arxiv.org/abs/2505.09614
作者: Anthony GX-Chen,Dongyan Lin,Mandana Samiei,Doina Precup,Blake A. Richards,Rob Fergus,Kenneth Marino
机构: Center for Data Science, New York University, New York, USA (数据科学中心,纽约大学,纽约,美国); Integrated Program in Neuroscience, McGill University, Montreal, QC, Canada (神经科学综合项目,麦吉尔大学,蒙特利尔,魁北克省,加拿大); Mila - Quebec Artificial Intelligence Institute, Montreal, QC, Canada (Mila - 魁北克人工智能研究所,蒙特利尔,魁北克省,加拿大); School of Computer Science, McGill University, Montreal, QC, Canada (计算机科学学院,麦吉尔大学,蒙特利尔,魁北克省,加拿大); Department of Neurology & Neurosurgery, McGill University, Montreal, QC, Canada (神经病学与神经外科系,麦吉尔大学,蒙特利尔,魁北克省,加拿大); Montreal Neurological Institute, McGill University, Montreal, QC, Canada (蒙特利尔神经学研究所,麦吉尔大学,蒙特利尔,魁北克省,加拿大); CIFAR Learning in Machines and Brains Program, Toronto, ON, Canada (CIFAR 机器与大脑学习计划,多伦多,安大略省,加拿大); The University of Utah, Utah, USA (犹他大学,犹他州,美国)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Language model (LM) agents are increasingly used as autonomous decision-makers who need to actively gather information to guide their decisions. A crucial cognitive skill for such agents is the efficient exploration and understanding of the causal structure of the world – key to robust, scientifically grounded reasoning. Yet, it remains unclear whether LMs possess this capability or exhibit systematic biases leading to erroneous conclusions. In this work, we examine LMs’ ability to explore and infer causal relationships, using the well-established “Blicket Test” paradigm from developmental psychology. We find that LMs reliably infer the common, intuitive disjunctive causal relationships but systematically struggle with the unusual, yet equally (or sometimes even more) evidenced conjunctive ones. This “disjunctive bias” persists across model families, sizes, and prompting strategies, and performance further declines as task complexity increases. Interestingly, an analogous bias appears in human adults, suggesting that LMs may have inherited deep-seated reasoning heuristics from their training data. To this end, we quantify similarities between LMs and humans, finding that LMs exhibit adult-like inference profiles (but not children-like). Finally, we propose a test-time sampling method which explicitly samples and eliminates hypotheses about causal relationships from the LM. This scalable approach significantly reduces the disjunctive bias and moves LMs closer to the goal of scientific, causally rigorous reasoning.
zh
[NLP-1] Customizing a Large Language Model for VHDL Design of High-Performance Microprocessors
【速读】: 该论文试图解决在高性能处理器设计领域中,针对VHDL代码解释的挑战,尤其是在已有大量经验与资产的组织中提升代码理解与维护效率的问题。解决方案的关键在于开发一个专门用于解释VHDL代码的大型语言模型(Large Language Model, LLM),并通过扩展预训练(Extended Pretraining, EPT)和构建LLM-as-a-judge评估体系来优化模型性能,最终显著提升了专家评价的准确率。
链接: https://arxiv.org/abs/2505.09610
作者: Nicolas Dupuis,Ravi Nair,Shyam Ramji,Sean McClintock,Nishant Chauhan,Priyanka Nagpal,Bart Blaner,Ken Valk,Leon Stok,Ruchir Puri
机构: IBM Research (IBM 研究院); IBM Infrastructure (IBM 基础设施)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The use of Large Language Models (LLMs) in hardware design has taken off in recent years, principally through its incorporation in tools that increase chip designer productivity. There has been considerable discussion about the use of LLMs in RTL specifications of chip designs, for which the two most popular languages are Verilog and VHDL. LLMs and their use in Verilog design has received significant attention due to the higher popularity of the language, but little attention so far has been given to VHDL despite its continued popularity in the industry. There has also been little discussion about the unique needs of organizations that engage in high-performance processor design, and techniques to deploy AI solutions in these settings. In this paper, we describe our journey in developing a Large Language Model (LLM) specifically for the purpose of explaining VHDL code, a task that has particular importance in an organization with decades of experience and assets in high-performance processor design. We show how we developed test sets specific to our needs and used them for evaluating models as we performed extended pretraining (EPT) of a base LLM. Expert evaluation of the code explanations produced by the EPT model increased to 69% compared to a base model rating of 43%. We further show how we developed an LLM-as-a-judge to gauge models similar to expert evaluators. This led us to deriving and evaluating a host of new models, including an instruction-tuned version of the EPT model with an expected expert evaluator rating of 71%. Our experiments also indicate that with the potential use of newer base models, this rating can be pushed to 85% and beyond. We conclude with a discussion on further improving the quality of hardware design LLMs using exciting new developments in the Generative AI world.
zh
[NLP-2] WorldView-Bench: A Benchmark for Evaluating Global Cultural Perspectives in Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在训练和对齐过程中强化西方中心主义认识论和社会文化规范,导致文化同质化并限制其反映全球文明多样性的能力问题。现有基准测试框架未能充分捕捉这一偏差,因其依赖刚性、封闭式的评估方式,忽视了文化包容性的复杂性。解决方案的关键在于引入WorldView-Bench,该基准通过分析模型容纳多元世界观的能力来评估全球文化包容性(Global Cultural Inclusivity, GCI),并基于Senturk等提出的多维世界观(Multiplex Worldview)理论,采用两种干预策略:(1)情境化实施多维语言模型,通过系统提示嵌入多维性原则;(2)多智能体系统(Multi-Agent System, MAS)实施多维语言模型,多个代表不同文化视角的语言模型代理协作生成回应。
链接: https://arxiv.org/abs/2505.09595
作者: Abdullah Mushtaq,Imran Taj,Rafay Naeem,Ibrahim Ghaznavi,Junaid Qadir
机构: Information Technology University (信息科技大学); Zayed University (扎耶德大学); Qatar University (卡塔尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注: Preprint. Submitted to the Journal of Artificial Intelligence Research (JAIR) on April 29, 2025
Abstract:Large Language Models (LLMs) are predominantly trained and aligned in ways that reinforce Western-centric epistemologies and socio-cultural norms, leading to cultural homogenization and limiting their ability to reflect global civilizational plurality. Existing benchmarking frameworks fail to adequately capture this bias, as they rely on rigid, closed-form assessments that overlook the complexity of cultural inclusivity. To address this, we introduce WorldView-Bench, a benchmark designed to evaluate Global Cultural Inclusivity (GCI) in LLMs by analyzing their ability to accommodate diverse worldviews. Our approach is grounded in the Multiplex Worldview proposed by Senturk et al., which distinguishes between Uniplex models, reinforcing cultural homogenization, and Multiplex models, which integrate diverse perspectives. WorldView-Bench measures Cultural Polarization, the exclusion of alternative perspectives, through free-form generative evaluation rather than conventional categorical benchmarks. We implement applied multiplexity through two intervention strategies: (1) Contextually-Implemented Multiplex LLMs, where system prompts embed multiplexity principles, and (2) Multi-Agent System (MAS)-Implemented Multiplex LLMs, where multiple LLM agents representing distinct cultural perspectives collaboratively generate responses. Our results demonstrate a significant increase in Perspectives Distribution Score (PDS) entropy from 13% at baseline to 94% with MAS-Implemented Multiplex LLMs, alongside a shift toward positive sentiment (67.7%) and enhanced cultural balance. These findings highlight the potential of multiplex-aware AI evaluation in mitigating cultural bias in LLMs, paving the way for more inclusive and ethically aligned AI systems.
zh
[NLP-3] PT-MoE: An Efficient Finetuning Framework for Integrating Mixture-of-Experts into Prompt Tuning
【速读】: 该论文旨在解决参数高效微调(PEFT)方法在适应大型语言模型时存在的效率与性能不均衡问题,特别是现有方法在引入路由机制后虽提升训练效率但未必提升性能,以及通过矩阵分解减少参数虽在特定领域有效但泛化能力不足的问题。其解决方案的关键在于提出PT-MoE框架,该框架将矩阵分解与专家混合(MoE)路由机制相结合,通过矩阵分解实现专家间的高效参数共享,同时利用MoE提供动态适应能力,从而在保持参数量减少的同时提升跨任务的性能与泛化能力。
链接: https://arxiv.org/abs/2505.09519
作者: Zongqian Li,Yixuan Su,Nigel Collier
机构: University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Parameter-efficient fine-tuning (PEFT) methods have shown promise in adapting large language models, yet existing approaches exhibit counter-intuitive phenomena: integrating router into prompt tuning (PT) increases training efficiency yet does not improve performance universally; parameter reduction through matrix decomposition can improve performance in specific domains. Motivated by these observations and the modular nature of PT, we propose PT-MoE, a novel framework that integrates matrix decomposition with mixture-of-experts (MoE) routing for efficient PT. Results across 17 datasets demonstrate that PT-MoE achieves state-of-the-art performance in both question answering (QA) and mathematical problem solving tasks, improving F1 score by 1.49 points over PT and 2.13 points over LoRA in QA tasks, while enhancing mathematical accuracy by 10.75 points over PT and 0.44 points over LoRA, all while using 25% fewer parameters than LoRA. Our analysis reveals that while PT methods generally excel in QA tasks and LoRA-based methods in math datasets, the integration of matrix decomposition and MoE in PT-MoE yields complementary benefits: decomposition enables efficient parameter sharing across experts while MoE provides dynamic adaptation, collectively enabling PT-MoE to demonstrate cross-task consistency and generalization abilities. These findings, along with ablation studies on routing mechanisms and architectural components, provide insights for future PEFT methods.
zh
[NLP-4] CXMArena: Unified Dataset to benchmark performance in realistic CXM Scenarios
【速读】: 该论文旨在解决在客户体验管理(Customer Experience Management, CXM)中评估大型语言模型(Large Language Models, LLMs)实际应用价值的难题,尤其是在客服中心等复杂操作环境中,由于数据稀缺性和现有基准测试缺乏真实性的限制。其解决方案的关键在于构建了一个名为CXMArena的新型大规模合成基准数据集,该数据集通过可控噪声注入和严格的自动化验证,模拟了品牌CXM实体,如知识库文章、问题分类和客服对话,以更贴近现实场景。此外,该研究还设计了一条基于LLM的可扩展流水线,用于生成高质量的基准任务数据,从而为五项关键操作任务提供评估标准。
链接: https://arxiv.org/abs/2505.09436
作者: Raghav Garg,Kapil Sharma,Karan Gupta
机构: Sprinklr(社交管理平台)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Large Language Models (LLMs) hold immense potential for revolutionizing Customer Experience Management (CXM), particularly in contact center operations. However, evaluating their practical utility in complex operational environments is hindered by data scarcity (due to privacy concerns) and the limitations of current benchmarks. Existing benchmarks often lack realism, failing to incorporate deep knowledge base (KB) integration, real-world noise, or critical operational tasks beyond conversational fluency. To bridge this gap, we introduce CXMArena, a novel, large-scale synthetic benchmark dataset specifically designed for evaluating AI in operational CXM contexts. Given the diversity in possible contact center features, we have developed a scalable LLM-powered pipeline that simulates the brand’s CXM entities that form the foundation of our datasets-such as knowledge articles including product specifications, issue taxonomies, and contact center conversations. The entities closely represent real-world distribution because of controlled noise injection (informed by domain experts) and rigorous automated validation. Building on this, we release CXMArena, which provides dedicated benchmarks targeting five important operational tasks: Knowledge Base Refinement, Intent Prediction, Agent Quality Adherence, Article Search, and Multi-turn RAG with Integrated Tools. Our baseline experiments underscore the benchmark’s difficulty: even state of the art embedding and generation models achieve only 68% accuracy on article search, while standard embedding methods yield a low F1 score of 0.3 for knowledge base refinement, highlighting significant challenges for current models necessitating complex pipelines and solutions over conventional techniques.
zh
[NLP-5] Multilingual Machine Translation with Quantum Encoder Decoder Attention-based Convolutional Variational Circuits
【速读】: 该论文试图解决传统云基础多语言翻译服务在处理多语言机器翻译时所依赖的经典计算架构的局限性,旨在探索量子计算领域作为替代方案以提升翻译性能。解决方案的关键在于提出QEDACVC(Quantum Encoder Decoder Attention-based Convolutional Variational Circuits)架构,该架构通过量子卷积、量子池化、量子变分电路和量子注意力机制等软件改进,在量子计算硬件上实现编码器-解码器结构的模拟与运行,从而在OPUS数据集上实现了82%的准确率。
链接: https://arxiv.org/abs/2505.09407
作者: Subrit Dikshit,Ritu Tiwari,Priyank Jain
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 12 pages, 12 figures
Abstract:Cloud-based multilingual translation services like Google Translate and Microsoft Translator achieve state-of-the-art translation capabilities. These services inherently use large multilingual language models such as GRU, LSTM, BERT, GPT, T5, or similar encoder-decoder architectures with attention mechanisms as the backbone. Also, new age natural language systems, for instance ChatGPT and DeepSeek, have established huge potential in multiple tasks in natural language processing. At the same time, they also possess outstanding multilingual translation capabilities. However, these models use the classical computing realm as a backend. QEDACVC (Quantum Encoder Decoder Attention-based Convolutional Variational Circuits) is an alternate solution that explores the quantum computing realm instead of the classical computing realm to study and demonstrate multilingual machine translation. QEDACVC introduces the quantum encoder-decoder architecture that simulates and runs on quantum computing hardware via quantum convolution, quantum pooling, quantum variational circuit, and quantum attention as software alterations. QEDACVC achieves an Accuracy of 82% when trained on the OPUS dataset for English, French, German, and Hindi corpora for multilingual translations.
zh
[NLP-6] Qwen 3 Technical Report
【速读】: 该论文旨在解决大型语言模型(LLMs)在性能、效率及多语言支持方面的局限性,同时减少模型切换带来的复杂性和资源浪费。其解决方案的关键在于提出一种统一框架,将思考模式(用于复杂多步骤推理)与非思考模式(用于快速上下文驱动响应)集成在一起,实现了根据用户查询或聊天模板动态切换模式,无需依赖多个独立模型。此外,引入了思考预算机制,使用户能够在推理过程中自适应分配计算资源,从而在延迟与性能之间取得平衡。
链接: https://arxiv.org/abs/2505.09388
作者: An Yang,Anfeng Li,Baosong Yang,Beichen Zhang,Binyuan Hui,Bo Zheng,Bowen Yu,Chang Gao,Chengen Huang,Chenxu Lv,Chujie Zheng,Dayiheng Liu,Fan Zhou,Fei Huang,Feng Hu,Hao Ge,Haoran Wei,Huan Lin,Jialong Tang,Jian Yang,Jianhong Tu,Jianwei Zhang,Jianxin Yang,Jiaxi Yang,Jing Zhou,Jingren Zhou,Junyang Lin,Kai Dang,Keqin Bao,Kexin Yang,Le Yu,Lianghao Deng,Mei Li,Mingfeng Xue,Mingze Li,Pei Zhang,Peng Wang,Qin Zhu,Rui Men,Ruize Gao,Shixuan Liu,Shuang Luo,Tianhao Li,Tianyi Tang,Wenbiao Yin,Xingzhang Ren,Xinyu Wang,Xinyu Zhang,Xuancheng Ren,Yang Fan,Yang Su,Yichang Zhang,Yinger Zhang,Yu Wan,Yuqiong Liu,Zekun Wang,Zeyu Cui,Zhenru Zhang,Zhipeng Zhou,Zihan Qiu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex, multi-step reasoning) and non-thinking mode (for rapid, context-driven responses) into a unified framework. This eliminates the need to switch between different models–such as chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g., QwQ-32B)–and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller-scale models, while ensuring their highly competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive against larger MoE models and proprietary models. Compared to its predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility through improved cross-lingual understanding and generation capabilities. To facilitate reproducibility and community-driven research and development, all Qwen3 models are publicly accessible under Apache 2.0.
zh
[NLP-7] Llama See Llama Do: A Mechanistic Perspective on Contextual Entrainment and Distraction in LLM s
【速读】: 该论文试图解决语言模型(Language Models, LMs)在处理输入提示时受到“无关”上下文信息干扰的问题,即模型在生成输出时可能被先前出现的不相关token所影响。解决方案的关键在于发现了“上下文牵连”(contextual entrainment)现象,即语言模型对之前出现在上下文中的任何token赋予显著更高的logits(或概率),无论这些token是否与问题或句子其余部分相关。研究进一步提出了一种基于可微分掩码的新方法来识别与该现象相关的注意力头(entrainment heads),并通过关闭这些头显著减弱了上下文牵连效应,从而减轻了模型受干扰的程度。这一发现为理解并缓解语言模型的注意力分散问题提供了重要的机制分析基础。
链接: https://arxiv.org/abs/2505.09338
作者: Jingcheng Niu,Xingdi Yuan,Tong Wang,Hamidreza Saghir,Amir H. Abdi
机构: University of Toronto (多伦多大学); UKP Lab, Technical University of Darmstadt (德国达姆施塔特工业大学UKP实验室); Microsoft (微软)
类目: Computation and Language (cs.CL)
备注:
Abstract:We observe a novel phenomenon, contextual entrainment, across a wide range of language models (LMs) and prompt settings, providing a new mechanistic perspective on how LMs become distracted by irrelevant'' contextual information in the input prompt. Specifically, LMs assign significantly higher logits (or probabilities) to any tokens that have previously appeared in the context prompt, even for random tokens. This suggests that contextual entrainment is a mechanistic phenomenon, occurring independently of the relevance or semantic relation of the tokens to the question or the rest of the sentence. We find statistically significant evidence that the magnitude of contextual entrainment is influenced by semantic factors. Counterfactual prompts have a greater effect compared to factual ones, suggesting that while contextual entrainment is a mechanistic phenomenon, it is modulated by semantic factors. We hypothesise that there is a circuit of attention heads -- the entrainment heads -- that corresponds to the contextual entrainment phenomenon. Using a novel entrainment head discovery method based on differentiable masking, we identify these heads across various settings. When we
turn off’’ these heads, i.e., set their outputs to zero, the effect of contextual entrainment is significantly attenuated, causing the model to generate output that capitulates to what it would produce if no distracting context were provided. Our discovery of contextual entrainment, along with our investigation into LM distraction via the entrainment heads, marks a key step towards the mechanistic analysis and mitigation of the distraction problem. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2505.09338 [cs.CL] (or arXiv:2505.09338v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.09338 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-8] Scent of Knowledge: Optimizing Search-Enhanced Reasoning with Information Forag ing
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对复杂任务时,由于固有的知识截止限制而表现出的不足,尤其是针对模糊性、多步骤或动态信息需求的任务。传统检索增强生成方法采用静态预推理检索策略,无法有效适应这些复杂场景。该论文提出的解决方案关键在于引入InForage框架,这是一个基于强化学习的动态信息搜索框架,其核心是将检索增强推理建模为一个动态的信息获取过程,并通过显式奖励中间检索质量来引导LLMs进行迭代式的信息收集与整合,从而实现自适应的推理能力。
链接: https://arxiv.org/abs/2505.09316
作者: Hongjin Qian,Zheng Liu
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 16 pages
Abstract:Augmenting large language models (LLMs) with external retrieval has become a standard method to address their inherent knowledge cutoff limitations. However, traditional retrieval-augmented generation methods employ static, pre-inference retrieval strategies, making them inadequate for complex tasks involving ambiguous, multi-step, or evolving information needs. Recent advances in test-time scaling techniques have demonstrated significant potential in enabling LLMs to dynamically interact with external tools, motivating the shift toward adaptive inference-time retrieval. Inspired by Information Foraging Theory (IFT), we propose InForage, a reinforcement learning framework that formalizes retrieval-augmented reasoning as a dynamic information-seeking process. Unlike existing approaches, InForage explicitly rewards intermediate retrieval quality, encouraging LLMs to iteratively gather and integrate information through adaptive search behaviors. To facilitate training, we construct a human-guided dataset capturing iterative search and reasoning trajectories for complex, real-world web tasks. Extensive evaluations across general question answering, multi-hop reasoning tasks, and a newly developed real-time web QA dataset demonstrate InForage’s superior performance over baseline methods. These results highlight InForage’s effectiveness in building robust, adaptive, and efficient reasoning agents.
zh
[NLP-9] A Scalable Unsupervised Framework for multi-aspect labeling of Multilingual and Multi-Domain Review Data
【速读】: 该论文试图解决在线评论数据分析中的跨领域方面检测问题,尤其是现有研究在特定领域和语言上的局限性以及对监督学习方法依赖所带来的大规模标注数据需求。解决方案的关键在于提出一种多语言、可扩展且无监督的框架,通过聚类提取方面类别候选,并利用负采样生成方面感知的嵌入向量,实现多语言和多领域评论数据的多方面标注。该框架通过自动标注并评估标签质量,证明其适用于模型训练,并展现出优于公开大型语言模型的一致性和可扩展性。
链接: https://arxiv.org/abs/2505.09286
作者: Jiin Park,Misuk Kim
机构: 未知
类目: Computation and Language (cs.CL)
备注: 36 pages, 3 figures
Abstract:Effectively analyzing online review data is essential across industries. However, many existing studies are limited to specific domains and languages or depend on supervised learning approaches that require large-scale labeled datasets. To address these limitations, we propose a multilingual, scalable, and unsupervised framework for cross-domain aspect detection. This framework is designed for multi-aspect labeling of multilingual and multi-domain review data. In this study, we apply automatic labeling to Korean and English review datasets spanning various domains and assess the quality of the generated labels through extensive experiments. Aspect category candidates are first extracted through clustering, and each review is then represented as an aspect-aware embedding vector using negative sampling. To evaluate the framework, we conduct multi-aspect labeling and fine-tune several pretrained language models to measure the effectiveness of the automatically generated labels. Results show that these models achieve high performance, demonstrating that the labels are suitable for training. Furthermore, comparisons with publicly available large language models highlight the framework’s superior consistency and scalability when processing large-scale data. A human evaluation also confirms that the quality of the automatic labels is comparable to those created manually. This study demonstrates the potential of a robust multi-aspect labeling approach that overcomes limitations of supervised methods and is adaptable to multilingual, multi-domain environments. Future research will explore automatic review summarization and the integration of artificial intelligence agents to further improve the efficiency and depth of review analysis.
zh
[NLP-10] How an unintended Side Effect of a Research Project led to Boosting the Power of UML
【速读】: 该论文试图解决传统UML建模工具在功能集成与动态执行方面的局限性,其解决方案的关键在于实现了类图与对象图的集成以及对象的执行能力。这一创新不仅支持软件与对应对象模型的整合,还为教学提供了更具互动性的学习体验。
链接: https://arxiv.org/abs/2505.09269
作者: Ulrich Frank,Pierre Maier
机构: University of Duisburg-Essen (杜伊斯堡-埃森大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper describes the design, implementation and use of a new UML modeling tool that represents a significant advance over conventional tools. Among other things, it allows the integration of class diagrams and object diagrams as well as the execution of objects. This not only enables new software architectures characterized by the integration of software with corresponding object models, but is also ideal for use in teaching, as it provides students with a particularly stimulating learning experience. A special feature of the project is that it has emerged from a long-standing international research project, which is aimed at a comprehensive multi-level architecture. The project is therefore an example of how research can lead to valuable results that arise as a side effect of other work.
zh
[NLP-11] Focus Merge Rank: Improved Question Answering Based on Semi-structured Knowledge Bases
【速读】: 该论文旨在解决多跳问答任务中如何有效融合结构化知识与非结构化文本的问题,以提升知识检索与推理的准确性。其解决方案的关键在于提出一种基于半结构化知识库(SKB)的框架——FocusedRetriever,该框架通过整合基于向量相似性搜索的实体检索、基于大语言模型(LLM)生成Cypher查询以及成对重排序等组件,实现对复杂查询的高效处理,从而在多个基准测试集中超越现有最先进方法。
链接: https://arxiv.org/abs/2505.09246
作者: Derian Boer,Stephen Roth,Stefan Kramer
机构: Johannes Gutenberg University Mainz(马尔堡约翰内斯古腾堡大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:In many real-world settings, machine learning models and interactive systems have access to both structured knowledge, e.g., knowledge graphs or tables, and unstructured content, e.g., natural language documents. However, most rely on either. Semi-Structured Knowledge Bases (SKBs) bridge this gap by linking unstructured content to nodes within structured data, thereby enabling new strategies for knowledge access and use. In this work, we present FocusedRetriever, a modular SKB-based framework for multi-hop question answering. It integrates components (VSS-based entity search, LLM-based generation of Cypher queries and pairwise re-ranking) in a way that enables it to outperform state-of-the-art methods across all three STaRK benchmark test sets, covering diverse domains and multiple performance metrics. The average first-hit rate exceeds that of the second-best method by 25.7%. FocusedRetriever leverages (1) the capacity of Large Language Models (LLMs) to extract relational facts and entity attributes from unstructured text, (2) node set joins to filter answer candidates based on these extracted triplets and constraints, (3) vector similarity search to retrieve and rank relevant unstructured content, and (4) the contextual capabilities of LLMs to finally rank the top-k answers. For generality, we only incorporate base LLMs in FocusedRetriever in our evaluation. However, our analysis of intermediate results highlights several opportunities for further upgrades including finetuning. The source code is publicly available at this https URL .
zh
[NLP-12] CEC-Zero: Chinese Error Correction Solution Based on LLM
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在中文文本处理任务中,尤其是中文拼写纠错(Chinese Spelling Correction, CSC)任务中的可靠性与泛化能力不足的问题。解决方案的关键在于提出一种名为CEC-Zero的新型强化学习(Reinforcement Learning, RL)框架,该框架使LLMs能够通过自主学习错误策略进行自我纠正,而无需依赖外部监督。通过将RL与LLMs的生成能力相结合,该方法消除了对标注数据或辅助模型的依赖,从而实现了行业可接受的准确率和跨领域泛化能力。
链接: https://arxiv.org/abs/2505.09082
作者: Sophie Zhang,Zhiming Lin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in large language models (LLMs) demonstrate exceptional Chinese text processing capabilities, particularly in Chinese Spelling Correction (CSC). While LLMs outperform traditional BERT-based models in accuracy and robustness, challenges persist in reliability and generalization. This paper proposes CEC-Zero, a novel reinforcement learning (RL) framework enabling LLMs to self-correct through autonomous error strategy learning without external supervision. By integrating RL with LLMs’ generative power, the method eliminates dependency on annotated data or auxiliary models. Experiments reveal RL-enhanced LLMs achieve industry-viable accuracy and superior cross-domain generalization, offering a scalable solution for reliability optimization in Chinese NLP applications. This breakthrough facilitates LLM deployment in practical Chinese text correction scenarios while establishing a new paradigm for self-improving language models.
zh
[NLP-13] S-DAT: A Multilingual GenAI-Driven Framework for Automated Divergent Thinking Assessment
【速读】: 该论文试图解决传统创造性评估方法存在的劳动密集、语言依赖性强以及依赖主观人工评分所带来的可扩展性和跨文化适用性受限的问题。解决方案的关键在于提出S-DAT(Synthetic-Divergent Association Task),该框架利用大语言模型和先进的多语言嵌入技术计算语义距离,作为发散思维(Divergent Thinking, DT)的语言无关代理指标,从而实现跨语言、跨文化的自动化评估。
链接: https://arxiv.org/abs/2505.09068
作者: Jennifer Haase,Paul H. P. Hanel,Sebastian Pokutta
机构: Weizenbaum Institute and Humboldt University (魏兹曼研究所和洪堡大学); University of Essex (埃塞克斯大学); TU Berlin (柏林工业大学); Zuse Institute Berlin (祖斯研究所柏林)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:This paper introduces S-DAT (Synthetic-Divergent Association Task), a scalable, multilingual framework for automated assessment of divergent thinking (DT) -a core component of human creativity. Traditional creativity assessments are often labor-intensive, language-specific, and reliant on subjective human ratings, limiting their scalability and cross-cultural applicability. In contrast, S-DAT leverages large language models and advanced multilingual embeddings to compute semantic distance – a language-agnostic proxy for DT. We evaluate S-DAT across eleven diverse languages, including English, Spanish, German, Russian, Hindi, and Japanese (Kanji, Hiragana, Katakana), demonstrating robust and consistent scoring across linguistic contexts. Unlike prior DAT approaches, the S-DAT shows convergent validity with other DT measures and correct discriminant validity with convergent thinking. This cross-linguistic flexibility allows for more inclusive, global-scale creativity research, addressing key limitations of earlier approaches. S-DAT provides a powerful tool for fairer, more comprehensive evaluation of cognitive flexibility in diverse populations and can be freely assessed online: this https URL.
zh
[NLP-14] A Comprehensive Analysis of Large Language Model Outputs: Similarity Diversity and Bias
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在输出相似性、多样性及伦理标准方面的疑问,具体包括同一模型生成文本的相似性、不同模型之间的对比以及模型在伦理表现上的差异。其解决方案的关键在于通过5,000个涵盖多种任务的提示词,生成约300万条文本,分析来自12种LLMs(包括OpenAI、Google、Microsoft、Meta和Mistral等机构的专有与开源系统)的输出特性,从而揭示LLM在文本风格、词汇使用、语气及偏见等方面的表现差异。
链接: https://arxiv.org/abs/2505.09056
作者: Brandon Smith,Mohamed Reda Bouadjenek,Tahsin Alamgir Kheya,Phillip Dawson,Sunil Aryal
机构: Deakin University (迪肯大学); University of New South Wales (新南威尔士大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) represent a major step toward artificial general intelligence, significantly advancing our ability to interact with technology. While LLMs perform well on Natural Language Processing tasks – such as translation, generation, code writing, and summarization – questions remain about their output similarity, variability, and ethical implications. For instance, how similar are texts generated by the same model? How does this compare across different models? And which models best uphold ethical standards? To investigate, we used 5,000 prompts spanning diverse tasks like generation, explanation, and rewriting. This resulted in approximately 3 million texts from 12 LLMs, including proprietary and open-source systems from OpenAI, Google, Microsoft, Meta, and Mistral. Key findings include: (1) outputs from the same LLM are more similar to each other than to human-written texts; (2) models like WizardLM-2-8x22b generate highly similar outputs, while GPT-4 produces more varied responses; (3) LLM writing styles differ significantly, with Llama 3 and Mistral showing higher similarity, and GPT-4 standing out for distinctiveness; (4) differences in vocabulary and tone underscore the linguistic uniqueness of LLM-generated content; (5) some LLMs demonstrate greater gender balance and reduced bias. These results offer new insights into the behavior and diversity of LLM outputs, helping guide future development and ethical evaluation.
zh
[NLP-15] Atomic Consistency Preference Optimization for Long-Form Question Answering
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在回答事实性问题时产生的事实性幻觉(factoid hallucinations)问题,即模型生成看似合理但不准确的答案。解决方案的关键在于提出一种自监督的偏好调优方法——原子一致性偏好优化(Atomic Consistency Preference Optimization, ACPO),该方法通过利用单个事实在多个随机响应中的一致性信号来识别高质量和低质量的数据对,从而进行模型对齐,无需依赖外部模型或知识库即可提升事实准确性。
链接: https://arxiv.org/abs/2505.09039
作者: Jingfeng Chen,Raghuveer Thirukovalluru,Junlin Wang,Kaiwei Luo,Bhuwan Dhingra
机构: Duke Kunshan University (杜克昆山大学); Duke University (杜克大学); TeleAI (TeleAI)
类目: Computation and Language (cs.CL)
备注: 16 pages, 2 figures
Abstract:Large Language Models (LLMs) frequently produce factoid hallucinations - plausible yet incorrect answers. A common mitigation strategy is model alignment, which improves factual accuracy by training on curated factual and non-factual pairs. However, this approach often relies on a stronger model (e.g., GPT-4) or an external knowledge base to assess factual correctness, which may not always be accessible. To address this, we propose Atomic Consistency Preference Optimization (ACPO), a self-supervised preference-tuning method that enhances factual accuracy without external supervision. ACPO leverages atomic consistency signals, i.e., the agreement of individual facts across multiple stochastic responses, to identify high- and low-quality data pairs for model alignment. By eliminating the need for costly GPT calls, ACPO provides a scalable and efficient approach to improving factoid question-answering. Despite being self-supervised, empirical results demonstrate that ACPO outperforms FactAlign, a strong supervised alignment baseline, by 1.95 points on the LongFact and BioGen datasets, highlighting its effectiveness in enhancing factual reliability without relying on external models or knowledge bases.
zh
[NLP-16] Improving the Reliability of LLM s: Combining CoT RAG Self-Consistency and Self-Verification
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在复杂、开放性任务中生成自信但错误或不相关信息的幻觉(Hallucination)问题。其解决方案的关键在于结合链式思维(Chain-of-thought, CoT)与检索增强生成(Retrieval-Augmented Generation, RAG),并应用自一致性(Self-consistency)和自验证(Self-verification)策略,通过在推理过程中引入外部知识源,并使模型能够验证或修改自身输出,从而提升事实准确性与回答的一致性。
链接: https://arxiv.org/abs/2505.09031
作者: Adarsh Kumar,Hwiyoon Kim,Jawahar Sai Nathani,Neil Roy
机构: Texas A&M University (德克萨斯A&M大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Hallucination, where large language models (LLMs) generate confident but incorrect or irrelevant information, remains a key limitation in their application to complex, open-ended tasks. Chain-of-thought (CoT) prompting has emerged as a promising method for improving multistep reasoning by guiding models through intermediate steps. However, CoT alone does not fully address the hallucination problem. In this work, we investigate how combining CoT with retrieval-augmented generation (RAG), as well as applying self-consistency and self-verification strategies, can reduce hallucinations and improve factual accuracy. By incorporating external knowledge sources during reasoning and enabling models to verify or revise their own outputs, we aim to generate more accurate and coherent responses. We present a comparative evaluation of baseline LLMs against CoT, CoT+RAG, self-consistency, and self-verification techniques. Our results highlight the effectiveness of each method and identify the most robust approach for minimizing hallucinations while preserving fluency and reasoning depth.
zh
[NLP-17] Automated Meta Prompt Engineering for Alignment with the Theory of Mind
【速读】: 该论文试图解决大型语言模型(Large Language Model, LLM)与人类心理预期之间的对齐问题,特别是在复杂任务中生成流畅文本的同时优化神经状态的相似性。其解决方案的关键在于引入一种元提示(meta-prompting)方法,结合代理强化学习(agentic reinforcement learning),其中LLM作为裁判(LLMaaJ)通过上下文学习教导另一个LLM如何根据生成文本的有意和无意特征来生成内容。通过分析用户在2024年美国网球公开赛前对AI生成文本的修改,LLMaaJ能够预测并整合人类编辑行为,从而解决心智理论(Theory of Mind, ToM)对齐问题。该方法通过在希尔伯特向量空间中对内容特征进行几何解释,实现了对人类心智理论的优化,显著提升了内容质量。
链接: https://arxiv.org/abs/2505.09024
作者: Aaron Baughman,Rahul Agarwal,Eduardo Morales,Gozde Akay
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 6 figures, 3 tables
Abstract:We introduce a method of meta-prompting that jointly produces fluent text for complex tasks while optimizing the similarity of neural states between a human’s mental expectation and a Large Language Model’s (LLM) neural processing. A technique of agentic reinforcement learning is applied, in which an LLM as a Judge (LLMaaJ) teaches another LLM, through in-context learning, how to produce content by interpreting the intended and unintended generated text traits. To measure human mental beliefs around content production, users modify long form AI-generated text articles before publication at the US Open 2024 tennis Grand Slam. Now, an LLMaaJ can solve the Theory of Mind (ToM) alignment problem by anticipating and including human edits within the creation of text from an LLM. Throughout experimentation and by interpreting the results of a live production system, the expectations of human content reviewers had 100% of alignment with AI 53.8% of the time with an average iteration count of 4.38. The geometric interpretation of content traits such as factualness, novelty, repetitiveness, and relevancy over a Hilbert vector space combines spatial volume (all trait importance) with vertices alignment (individual trait relevance) enabled the LLMaaJ to optimize on Human ToM. This resulted in an increase in content quality by extending the coverage of tennis action. Our work that was deployed at the US Open 2024 has been used across other live events within sports and entertainment.
zh
[NLP-18] For GPT -4 as with Humans: Information Structure Predicts Acceptability of Long-Distance Dependencies
【速读】: 该论文试图解决语言模型(Language Model, LM)是否能够理解自然语言并生成可靠的元语言判断,以及是否能够表征和尊重语言学家提出的形式与功能之间的细微关系。其解决方案的关键在于通过特定任务测试GPT-4在信息结构与可接受性之间的关联能力,结果显示GPT-4在零样本、显式任务中表现出可靠的元语言技能,并验证了信息结构对长距离依赖(Long Distance Dependency, LDD)构造可接受性的影响,揭示了自然语言与GPT-4生成语言之间紧密的联系。
链接: https://arxiv.org/abs/2505.09005
作者: Nicole Cuneo,Eleanor Graves,Supantho Rakshit,Adele E. Goldberg
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:It remains debated how well any LM understands natural language or generates reliable metalinguistic judgments. Moreover, relatively little work has demonstrated that LMs can represent and respect subtle relationships between form and function proposed by linguists. We here focus on a particular such relationship established in recent work: English speakers’ judgments about the information structure of canonical sentences predicts independently collected acceptability ratings on corresponding ‘long distance dependency’ [LDD] constructions, across a wide array of base constructions and multiple types of LDDs. To determine whether any LM captures this relationship, we probe GPT-4 on the same tasks used with humans and new this http URL reveal reliable metalinguistic skill on the information structure and acceptability tasks, replicating a striking interaction between the two, despite the zero-shot, explicit nature of the tasks, and little to no chance of contamination [Studies 1a, 1b]. Study 2 manipulates the information structure of base sentences and confirms a causal relationship: increasing the prominence of a constituent in a context sentence increases the subsequent acceptability ratings on an LDD construction. The findings suggest a tight relationship between natural and GPT-4 generated English, and between information structure and syntax, which begs for further exploration.
zh
[NLP-19] A suite of LMs comprehend puzzle statements as well as humans
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在理解简单英语陈述时是否逊于人类的问题。研究认为,先前的研究可能高估了人类的表现,而低估了LLMs的能力。其解决方案的关键在于通过一个预注册实验,比较了在允许重读和限制重读两种条件下人类与模型的准确率,发现当限制重读时,人类准确率显著下降,低于Falcon-180B-Chat和GPT-4,而最新GPT-o1模型则达到了完美准确率。此外,研究还揭示了人类与模型在处理涉及潜在互惠行为的查询时存在共同的语用敏感性,而非模型特有的缺陷。
链接: https://arxiv.org/abs/2505.08996
作者: Adele E Goldberg,Supantho Rakshit,Jennifer Hu,Kyle Mahowald
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent claims suggest that large language models (LMs) underperform humans in comprehending minimally complex English statements (Dentella et al., 2024). Here, we revisit those findings and argue that human performance was overestimated, while LLM abilities were underestimated. Using the same stimuli, we report a preregistered study comparing human responses in two conditions: one allowed rereading (replicating the original study), and one that restricted rereading (a more naturalistic comprehension test). Human accuracy dropped significantly when rereading was restricted (73%), falling below that of Falcon-180B-Chat (76%) and GPT-4 (81%). The newer GPT-o1 model achieves perfect accuracy. Results further show that both humans and models are disproportionately challenged by queries involving potentially reciprocal actions (e.g., kissing), suggesting shared pragmatic sensitivities rather than model-specific deficits. Additional analyses using Llama-2-70B log probabilities, a recoding of open-ended model responses, and grammaticality ratings of other sentences reveal systematic underestimation of model performance. We find that GPT-4o can align with either naive or expert grammaticality judgments, depending on prompt framing. These findings underscore the need for more careful experimental design and coding practices in LLM evaluation, and they challenge the assumption that current models are inherently weaker than humans at language comprehension.
zh
[NLP-20] Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training
【速读】: 该论文旨在解决标准大型视觉-语言模型(LVLM)预训练过程中,由于仅通过下一词预测(NTP)最大化图像条件下的文本联合概率,导致模型过度拟合噪声并增加幻觉风险的问题。其解决方案的关键在于引入PRIOR方法,通过在NTP损失中对图像相关标记进行差异化加权,利用一个仅基于文本的大型语言模型(LLM)作为参考模型,根据文本概率为每个标记分配重要性得分,并在训练中引入基于重要性评分的特定标记重加权项,从而提升模型对视觉相关内容的关注度。
链接: https://arxiv.org/abs/2505.08971
作者: Yangyi Chen,Hao Peng,Tong Zhang,Heng Ji
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: The code will be available at this https URL
Abstract:In standard large vision-language models (LVLMs) pre-training, the model typically maximizes the joint probability of the caption conditioned on the image via next-token prediction (NTP); however, since only a small subset of caption tokens directly relates to the visual content, this naive NTP unintentionally fits the model to noise and increases the risk of hallucination. We present PRIOR, a simple vision-language pre-training approach that addresses this issue by prioritizing image-related tokens through differential weighting in the NTP loss, drawing from the importance sampling framework. PRIOR introduces a reference model-a text-only large language model (LLM) trained on the captions without image inputs, to weight each token based on its probability for LVLMs training. Intuitively, tokens that are directly related to the visual inputs are harder to predict without the image and thus receive lower probabilities from the text-only reference LLM. During training, we implement a token-specific re-weighting term based on the importance scores to adjust each token’s loss. We implement PRIOR in two distinct settings: LVLMs with visual encoders and LVLMs without visual encoders. We observe 19% and 8% average relative improvement, respectively, on several vision-language benchmarks compared to NTP. In addition, PRIOR exhibits superior scaling properties, as demonstrated by significantly higher scaling coefficients, indicating greater potential for performance gains compared to NTP given increasing compute and data.
zh
[NLP-21] ForeCite: Adapting Pre-Trained Language Models to Predict Future Citation Rates of Academic Papers
【速读】: 该论文旨在解决学术论文未来引用率预测的问题,这是实现科研评估自动化和加速科学进步的重要步骤。其解决方案的关键在于提出了一种名为ForeCite的框架,该框架通过将预训练的因果语言模型与线性头部结合,用于平均月引用率的预测。该方法在2000年至2024年间发表的90万篇以上生物医学论文的精心整理数据集上实现了测试相关性ρ=0.826,相比之前最先进的方法提升了27个百分点。
链接: https://arxiv.org/abs/2505.08941
作者: Gavin Hull,Alex Bihlo
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 16 pages, 13 figures
Abstract:Predicting the future citation rates of academic papers is an important step toward the automation of research evaluation and the acceleration of scientific progress. We present \textbfForeCite , a simple but powerful framework to append pre-trained causal language models with a linear head for average monthly citation rate prediction. Adapting transformers for regression tasks, ForeCite achieves a test correlation of \rho = 0.826 on a curated dataset of 900K+ biomedical papers published between 2000 and 2024, a 27-point improvement over the previous state-of-the-art. Comprehensive scaling-law analysis reveals consistent gains across model sizes and data volumes, while temporal holdout experiments confirm practical robustness. Gradient-based saliency heatmaps suggest a potentially undue reliance on titles and abstract texts. These results establish a new state-of-the-art in forecasting the long-term influence of academic research and lay the groundwork for the automated, high-fidelity evaluation of scientific contributions.
zh
[NLP-22] Behind Maya: Building a Multilingual Vision Language Model CVPR2025
【速读】: 该论文试图解决当前大型视觉-语言模型(Vision-Language Models, VLMs)在低资源语言和多样化文化背景下的性能不足问题。其解决方案的关键在于引入Maya,一个开源的多语言VLM,包含基于LLaVA预训练数据集构建的八种语言的多语言图像-文本预训练数据集,并开发支持这些语言的多语言图像-文本模型,以增强视觉-语言任务中的文化和语言理解能力。
链接: https://arxiv.org/abs/2505.08910
作者: Nahid Alam,Karthik Reddy Kanjula,Surya Guthikonda,Timothy Chung,Bala Krishna S Vegesna,Abhipsha Das,Anthony Susevski,Ryan Sze-Yin Chan,S M Iftekhar Uddin,Shayekh Bin Islam,Roshan Santhosh,Snegha A,Drishti Sharma,Chen Liu,Isha Chaturvedi,Genta Indra Winata,Ashvanth.S,Snehanshu Mukherjee,Alham Fikri Aji
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted at VLM4ALL CVPR 2025 Workshop
Abstract:In recent times, we have seen a rapid development of large Vision-Language Models (VLMs). They have shown impressive results on academic benchmarks, primarily in widely spoken languages but lack performance on low-resource languages and varied cultural contexts. To address these limitations, we introduce Maya, an open-source Multilingual VLM. Our contributions are: 1) a multilingual image-text pretraining dataset in eight languages, based on the LLaVA pretraining dataset; and 2) a multilingual image-text model supporting these languages, enhancing cultural and linguistic comprehension in vision-language tasks. Code available at this https URL.
zh
[NLP-23] Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document Corpora
【速读】: 该论文试图解决传统评估基准在构建过程中所需的人工努力有限且难以跟上模型规模和范围快速扩展的问题,以及为每个潜在领域构建基准的不现实性。其解决方案的关键在于提出一种自动化构建基于事实的合成数据模型评估的方法,该方法利用语言模型(Language Models, LMs)仅以基础文档(如教科书)为输入,自动评估特定领域的知识,从而生成包含选择题和开放性问题的合成数据基准,实现对模型能力的诊断性分析。
链接: https://arxiv.org/abs/2505.08905
作者: Michael Majurski,Cynthia Matuszek
机构: University of Maryland Baltimore County (马里兰大学巴尔的摩县分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Language Models (LMs) continue to advance, improving response quality and coherence. Given Internet-scale training datasets, LMs have likely encountered much of what users might ask them to generate in some form during their training. A plethora of evaluation benchmarks have been constructed to assess model quality, response appropriateness, and reasoning capabilities. However, the human effort required for benchmark construction is limited and being rapidly outpaced by the size and scope of the models under evaluation. Additionally, having humans build a benchmark for every possible domain of interest is impractical. Therefore, we propose a methodology for automating the construction of fact-based synthetic data model evaluations grounded in document populations. This work leverages those very same LMs to evaluate domain-specific knowledge automatically, using only grounding documents (e.g., a textbook) as input. This synthetic data benchmarking approach corresponds well with human curated questions with a Spearman ranking correlation of 0.96 and a benchmark evaluation Pearson accuracy correlation of 0.79. This novel tool supports generating both multiple choice and open-ended synthetic data questions to gain diagnostic insight of LM capability. We apply this methodology to evaluate model performance on a recent relevant arXiv preprint, discovering a surprisingly strong performance from Gemma3 models.
zh
[NLP-24] Performance Gains of LLM s With Humans in a World of LLM s Versus Humans
【速读】: 该论文试图解决当前研究中将大语言模型(Large Language Models, LLMs)与人类专家进行比较所带来的问题,尤其是“专家”定义不明确或不断变化,以及在缺乏适当保障措施的情况下,LLMs可能对患者安全护理体系构成威胁。论文提出的解决方案关键在于转变研究方向,不再单纯比较LLMs与人类,而是开发策略以实现人类与LLMs在临床环境中高效协作,形成近乎共生的工作模式。
链接: https://arxiv.org/abs/2505.08902
作者: Lucas McCullum,Pelagie Ami Agassi,Leo Anthony Celi,Daniel K. Ebner,Chrystinne Oliveira Fernandes,Rachel S. Hicklen,Mkliwa Koumbia,Lisa Soleymani Lehmann,David Restrepo
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Currently, a considerable research effort is devoted to comparing LLMs to a group of human experts, where the term “expert” is often ill-defined or variable, at best, in a state of constantly updating LLM releases. Without proper safeguards in place, LLMs will threaten to cause harm to the established structure of safe delivery of patient care which has been carefully developed throughout history to keep the safety of the patient at the forefront. A key driver of LLM innovation is founded on community research efforts which, if continuing to operate under “humans versus LLMs” principles, will expedite this trend. Therefore, research efforts moving forward must focus on effectively characterizing the safe use of LLMs in clinical settings that persist across the rapid development of novel LLM models. In this communication, we demonstrate that rather than comparing LLMs to humans, there is a need to develop strategies enabling efficient work of humans with LLMs in an almost symbiotic manner.
zh
[NLP-25] Clicking some of the silly options: Exploring Player Motivation in Static and Dynamic Educational Interactive Narratives
【速读】: 该论文试图解决动态叙事(dynamic narrative)在教育游戏中对学习者动机影响的研究空白,其核心问题是如何评估动态叙事相较于传统静态叙事(static narrative)在提升学习者参与度和动机方面的效果。解决方案的关键在于开发并比较两种版本的教育互动叙事游戏“Academical”,其中一种采用传统的手写分支剧情(静态叙事),另一种则在游戏过程中动态编排剧情(动态叙事),从而探索动态叙事对学习者动机的影响及其在教学目标与叙事动态性之间的平衡挑战。
链接: https://arxiv.org/abs/2505.08891
作者: Daeun Hwang,Samuel Shields,Alex Calderwood,Shi Johnson-Bey,Michael Mateas,Noah Wardrip-Fruin,Edward F. Melcer
机构: University of California, Santa Cruz(加州大学圣克鲁兹分校)
类目: Computation and Language (cs.CL)
备注: 8 pages, 3 figures, 1 table, 1 appendix. Workshop paper, CHI 2025 Augmented Educators and AI
Abstract:Motivation is an important factor underlying successful learning. Previous research has demonstrated the positive effects that static interactive narrative games can have on motivation. Concurrently, advances in AI have made dynamic and adaptive approaches to interactive narrative increasingly accessible. However, limited work has explored the impact that dynamic narratives can have on learner motivation. In this paper, we compare two versions of Academical, a choice-based educational interactive narrative game about research ethics. One version employs a traditional hand-authored branching plot (i.e., static narrative) while the other dynamically sequences plots during play (i.e., dynamic narrative). Results highlight the importance of responsive content and a variety of choices for player engagement, while also illustrating the challenge of balancing pedagogical goals with the dynamic aspects of narrative. We also discuss design implications that arise from these findings. Ultimately, this work provides initial steps to illuminate the emerging potential of AI-driven dynamic narrative in educational games.
zh
[NLP-26] LibVulnWatch: A Deep Assessment Agent System and Leaderboard for Uncovering Hidden Vulnerabilities in Open-Source AI Libraries
【速读】: 该论文试图解决开源AI库在安全、许可、维护、供应链完整性和监管合规性方面存在的重大但未被充分研究的风险问题。其解决方案的关键在于提出LibVulnWatch,这是一个基于图的代理评估框架,通过协调专门代理的有向无环图,从可信来源(如代码仓库、文档和漏洞数据库)中提取、验证和量化风险,从而实现对这些库的深度、源代码基础的评估。
链接: https://arxiv.org/abs/2505.08842
作者: Zekun Wu,Seonglae Cho,Umar Mohammed,Cristian Munoz,Kleyton Costa,Xin Guan,Theo King,Ze Wang,Emre Kazim,Adriano Koshiyama
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Open-source AI libraries are foundational to modern AI systems but pose significant, underexamined risks across security, licensing, maintenance, supply chain integrity, and regulatory compliance. We present LibVulnWatch, a graph-based agentic assessment framework that performs deep, source-grounded evaluations of these libraries. Built on LangGraph, the system coordinates a directed acyclic graph of specialized agents to extract, verify, and quantify risk using evidence from trusted sources such as repositories, documentation, and vulnerability databases. LibVulnWatch generates reproducible, governance-aligned scores across five critical domains, publishing them to a public leaderboard for longitudinal ecosystem monitoring. Applied to 20 widely used libraries, including ML frameworks, LLM inference engines, and agent orchestration tools, our system covers up to 88% of OpenSSF Scorecard checks while uncovering up to 19 additional risks per library. These include critical Remote Code Execution (RCE) vulnerabilities, absent Software Bills of Materials (SBOMs), licensing constraints, undocumented telemetry, and widespread gaps in regulatory documentation and auditability. By translating high-level governance principles into practical, verifiable metrics, LibVulnWatch advances technical AI governance with a scalable, transparent mechanism for continuous supply chain risk assessment and informed library selection.
zh
[NLP-27] Human-AI Collaboration or Academic Misconduct? Measuring AI Use in Student Writing Through Stylometric Evidence
【速读】: 该论文试图解决在教育场景中日益普遍的人机协作(human-AI collaboration)背景下,如何理解和量化这种互动的范围与性质的问题。其解决方案的关键在于利用作者身份验证(Authorship Verification, AV)技术,将其作为衡量学术写作中AI辅助程度的工具,而非惩罚性手段,从而促进透明度、可解释性及学生发展。研究通过构建适应性的特征向量差异AV方法,为学生创建稳健的学术写作档案,以捕捉其写作中的个体特征,并在多种场景下评估该方法的有效性,最终为教育者提供一种透明的工具来支持学术诚信调查。
链接: https://arxiv.org/abs/2505.08828
作者: Eduardo Araujo Oliveira,Madhavi Mohoni,Sonsoles López-Pernas,Mohammed Saqr
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 19 pages, 10 figures, 11 tables
Abstract:As human-AI collaboration becomes increasingly prevalent in educational contexts, understanding and measuring the extent and nature of such interactions pose significant challenges. This research investigates the use of authorship verification (AV) techniques not as a punitive measure, but as a means to quantify AI assistance in academic writing, with a focus on promoting transparency, interpretability, and student development. Building on prior work, we structured our investigation into three stages: dataset selection and expansion, AV method development, and systematic evaluation. Using three datasets - including a public dataset (PAN-14) and two from University of Melbourne students from various courses - we expanded the data to include LLM-generated texts, totalling 1,889 documents and 540 authorship problems from 506 students. We developed an adapted Feature Vector Difference AV methodology to construct robust academic writing profiles for students, designed to capture meaningful, individual characteristics of their writing. The method’s effectiveness was evaluated across multiple scenarios, including distinguishing between student-authored and LLM-generated texts and testing resilience against LLMs’ attempts to mimic student writing styles. Results demonstrate the enhanced AV classifier’s ability to identify stylometric discrepancies and measure human-AI collaboration at word and sentence levels while providing educators with a transparent tool to support academic integrity investigations. This work advances AV technology, offering actionable insights into the dynamics of academic writing in an AI-driven era.
zh
[NLP-28] An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits
【速读】: 该论文试图解决将大型语言模型(Large Language Models, LLMs)量化至极低比特(如2-bit)时面临的精度下降和训练不稳定问题。其解决方案的关键在于在每个线性投影前插入均方根归一化(RMS Normalization),并采用渐进的、分层的量化策略,从而稳定地微调全精度检查点为三值LLMs。这一方法无需增加模型复杂度即可在标准语言建模基准上达到或超越更复杂的知识蒸馏流程,表明精心设计的归一化机制能够显著缩小三值与全精度LLMs之间的精度差距,使超低比特推理成为可能。
链接: https://arxiv.org/abs/2505.08823
作者: Cody Steinmetz,Gavin Childress,Aaron Herbst,Gavin Jones,Jasdeep Singh,Eli Vang,Keagan Weinstock
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have transformed natural-language processing, yet their scale makes real-world deployment costly. Post-training quantization reduces memory and computation but often degrades accuracy, while quantization-aware training can recover performance at the cost of extra training. Pushing quantization to the ternary (2-bit) regime yields even larger savings but is notoriously unstable. Building on recent work showing that a bias-free, RMS-normalized Transformer with straight-through estimation can reach 1.58-bit precision, we demonstrate that simply inserting RMS normalization before every linear projection and applying a gradual, layer-wise quantization schedule stably fine-tunes full-precision checkpoints into ternary LLMs. Our approach matches or surpasses more elaborate knowledge-distillation pipelines on standard language-modeling benchmarks without adding model complexity. These results indicate that careful normalization alone can close much of the accuracy gap between ternary and full-precision LLMs, making ultra-low-bit inference practical.
zh
[NLP-29] he Geometry of Meaning: Perfect Spacetime Representations of Hierarchical Structures
【速读】: 该论文试图解决如何将离散数据(如语义层次结构)完美地嵌入到三维闵可夫斯基时空中的问题,其核心在于利用因果结构而非传统距离度量来编码数据的层次关系。解决方案的关键是通过定向标记对——即局部层次信号——来构建嵌入,无需依赖全局符号结构,从而在几何中完全编码层次结构,并精确再现真实数据。
链接: https://arxiv.org/abs/2505.08795
作者: Andres Anabalon,Hugo Garces,Julio Oliva,Jose Cifuentes
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages, 3 figures
Abstract:We show that there is a fast algorithm that embeds hierarchical structures in three-dimensional Minkowski spacetime. The correlation of data ends up purely encoded in the causal structure. Our model relies solely on oriented token pairs – local hierarchical signals – with no access to global symbolic structure. We apply our method to the corpus of \textitWordNet. We provide a perfect embedding of the mammal sub-tree including ambiguities (more than one hierarchy per node) in such a way that the hierarchical structures get completely codified in the geometry and exactly reproduce the ground-truth. We extend this to a perfect embedding of the maximal unambiguous subset of the \textitWordNet with 82,115 noun tokens and a single hierarchy per token. We introduce a novel retrieval mechanism in which causality, not distance, governs hierarchical access. Our results seem to indicate that all discrete data has a perfect geometrical representation that is three-dimensional. The resulting embeddings are nearly conformally invariant, indicating deep connections with general relativity and field theory. These results suggest that concepts, categories, and their interrelations, namely hierarchical meaning itself, is geometric.
zh
[NLP-30] Ornithologist: Towards Trustworthy "Reasoning " about Central Bank Communications
【速读】: 该论文试图解决中央银行文本中鹰派(hawkishness)与鸽派(dovishness)立场的自动识别问题,以预测现金利率路径和市场预期。解决方案的关键在于提出Ornithologist系统,该系统采用“分类法引导推理”(taxonomy-guided reasoning),通过人类编写的决策树引导大型语言模型,从而提高系统的透明度、可解释性,并降低幻觉风险,同时减少对监督数据的依赖,使其更易于应用于其他文本源。
链接: https://arxiv.org/abs/2505.09083
作者: Dominic Zaun Eu Jones
机构: 未知
类目: General Economics (econ.GN); Computation and Language (cs.CL)
备注: 16 pages, 6 figures
Abstract:I develop Ornithologist, a weakly-supervised textual classification system and measure the hawkishness and dovishness of central bank text. Ornithologist uses ``taxonomy-guided reasoning’', guiding a large language model with human-authored decision trees. This increases the transparency and explainability of the system and makes it accessible to non-experts. It also reduces hallucination risk. Since it requires less supervision than traditional classification systems, it can more easily be applied to other problems or sources of text (e.g. news) without much modification. Ornithologist measurements of hawkishness and dovishness of RBA communication carry information about the future of the cash rate path and of market expectations.
zh
计算机视觉
[CV-0] UWAV: Uncertainty-weighted Weakly-supervised Audio-Visual Video Parsing CVPR2025
【速读】:该论文旨在解决音频-视觉视频解析(Audio-Visual Video Parsing, AVVP)中由于标注数据成本高昂而导致的模型训练受限问题,特别是在弱监督设置下,仅能获取模态无关的视频级标签。其解决方案的关键在于提出一种名为不确定性加权的弱监督音频-视觉视频解析(Uncertainty-weighted Weakly-supervised Audio-visual Video Parsing, UWAV)的方法,该方法通过引入伪标签的不确定性权重以及基于特征混合的训练正则化,有效提升了模型性能。
链接: https://arxiv.org/abs/2505.09615
作者: Yung-Hsuan Lai,Janek Ebbers,Yu-Chiang Frank Wang,François Germain,Michael Jeffrey Jones,Moitreya Chatterjee
机构: National Taiwan University (台湾大学); NVIDIA (NVIDIA); Mitsubishi Electric Research Labs (三菱电机研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: CVPR 2025
Abstract:Audio-Visual Video Parsing (AVVP) entails the challenging task of localizing both uni-modal events (i.e., those occurring exclusively in either the visual or acoustic modality of a video) and multi-modal events (i.e., those occurring in both modalities concurrently). Moreover, the prohibitive cost of annotating training data with the class labels of all these events, along with their start and end times, imposes constraints on the scalability of AVVP techniques unless they can be trained in a weakly-supervised setting, where only modality-agnostic, video-level labels are available in the training data. To this end, recently proposed approaches seek to generate segment-level pseudo-labels to better guide model training. However, the absence of inter-segment dependencies when generating these pseudo-labels and the general bias towards predicting labels that are absent in a segment limit their performance. This work proposes a novel approach towards overcoming these weaknesses called Uncertainty-weighted Weakly-supervised Audio-visual Video Parsing (UWAV). Additionally, our innovative approach factors in the uncertainty associated with these estimated pseudo-labels and incorporates a feature mixup based training regularization for improved training. Empirical results show that UWAV outperforms state-of-the-art methods for the AVVP task on multiple metrics, across two different datasets, attesting to its effectiveness and generalizability.
zh
[CV-1] LightLab: Controlling Light Sources in Images with Diffusion Models
【速读】:该论文试图解决图像中光源的细粒度、参数化控制问题,现有重光照方法要么依赖多视角输入在推理时进行逆渲染,要么无法提供对光照变化的显式控制。解决方案的关键在于对一个小型真实原始照片对数据集进行微调,并结合大规模合成渲染图像,以激发模型的逼真光照先验,从而实现对光照强度和颜色的显式控制。
链接: https://arxiv.org/abs/2505.09608
作者: Nadav Magar,Amir Hertz,Eric Tabellion,Yael Pritch,Alex Rav-Acha,Ariel Shamir,Yedid Hoshen
机构: Tel Aviv University (特拉维夫大学); Google Israel (谷歌以色列); Google United States of America (谷歌美国); Reichman University (里奇曼大学); Hebrew University of Jerusalem (希伯来大学耶路撒冷分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project Page: this https URL
Abstract:We present a simple, yet effective diffusion-based method for fine-grained, parametric control over light sources in an image. Existing relighting methods either rely on multiple input views to perform inverse rendering at inference time, or fail to provide explicit control over light changes. Our method fine-tunes a diffusion model on a small set of real raw photograph pairs, supplemented by synthetically rendered images at scale, to elicit its photorealistic prior for relighting. We leverage the linearity of light to synthesize image pairs depicting controlled light changes of either a target light source or ambient illumination. Using this data and an appropriate fine-tuning scheme, we train a model for precise illumination changes with explicit control over light intensity and color. Lastly, we show how our method can achieve compelling light editing results, and outperforms existing methods based on user preference.
zh
[CV-2] Variational Visual Question Answering ICCV2025
【速读】:该论文试图解决多模态模型在视觉问答(VQA)任务中存在可靠性不足的问题,特别是在分布外(OOD)设置下模型可能过度自信和校准不良的现象。现有研究主要针对单模态模型的可靠性问题,而针对多模态模型的工作较少。论文提出的解决方案关键在于采用一种变分方法——IVON算法,通过该算法在模型参数上获得后验分布,从而提升模型的校准性和不确定性估计能力,同时保持与AdamW优化方法相当的准确性。
链接: https://arxiv.org/abs/2505.09591
作者: Tobias Jan Wieczorek,Nathalie Daun,Mohammad Emtiyaz Khan,Marcus Rohrbach
机构: TU Darmstadt & hessian.AI (图宾根大学 & 黑森人工智能); RIKEN Center for Advanced Intelligence Project (理化学研究所高级智能项目中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, 16 figures, under review at ICCV 2025
Abstract:Despite remarkable progress in multimodal models for Visual Question Answering (VQA), there remain major reliability concerns because the models can often be overconfident and miscalibrated, especially in out-of-distribution (OOD) settings. Plenty has been done to address such issues for unimodal models, but little work exists for multimodal cases. Here, we address unreliability in multimodal models by proposing a Variational VQA approach. Specifically, instead of fine-tuning vision-language models by using AdamW, we employ a recently proposed variational algorithm called IVON, which yields a posterior distribution over model parameters. Through extensive experiments, we show that our approach improves calibration and abstentions without sacrificing the accuracy of AdamW. For instance, compared to AdamW fine-tuning, we reduce Expected Calibration Error by more than 50% compared to the AdamW baseline and raise Coverage by 4% vs. SOTA (for a fixed risk of 1%). In the presence of distribution shifts, the performance gain is even higher, achieving 8% Coverage (@ 1% risk) improvement vs. SOTA when 50% of test cases are OOD. Overall, we present variational learning as a viable option to enhance the reliability of multimodal models.
zh
[CV-3] Dont Forget your Inverse DDIM for Image Editing
【速读】:该论文旨在解决真实图像编辑中的挑战,特别是现有方法在计算复杂度或重建质量方面的不足。其解决方案的关键在于提出SAGE(Self-Attention Guidance for image Editing),该方法利用预训练的扩散模型,并基于DDIM算法引入了一种新颖的引导机制,该机制通过扩散U-Net的自注意力层生成注意力图,从而计算重建目标。这一机制使得未编辑区域能够高效重建,而无需精确重建整个输入图像,从而直接应对了图像编辑中的核心问题。
链接: https://arxiv.org/abs/2505.09571
作者: Guillermo Gomez-Trenado,Pablo Mesejo,Oscar Cordón,Stéphane Lathuilière
机构: University of Granada and DaSCI Research Institute (格拉纳达大学和DaSCI研究中心); Panacea Cooperative Research (帕尼萨合作研究); Inria at University Grenoble Alpes (INRIA在格勒诺布尔-阿尔卑斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 12 figures, code available at this https URL
Abstract:The field of text-to-image generation has undergone significant advancements with the introduction of diffusion models. Nevertheless, the challenge of editing real images persists, as most methods are either computationally intensive or produce poor reconstructions. This paper introduces SAGE (Self-Attention Guidance for image Editing) - a novel technique leveraging pre-trained diffusion models for image editing. SAGE builds upon the DDIM algorithm and incorporates a novel guidance mechanism utilizing the self-attention layers of the diffusion U-Net. This mechanism computes a reconstruction objective based on attention maps generated during the inverse DDIM process, enabling efficient reconstruction of unedited regions without the need to precisely reconstruct the entire input image. Thus, SAGE directly addresses the key challenges in image editing. The superiority of SAGE over other methods is demonstrated through quantitative and qualitative evaluations and confirmed by a statistically validated comprehensive user study, in which all 47 surveyed users preferred SAGE over competing methods. Additionally, SAGE ranks as the top-performing method in seven out of 10 quantitative analyses and secures second and third places in the remaining three.
zh
[CV-4] BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture Training and Dataset
【速读】:该论文旨在解决统一多模态模型中图像理解和生成任务的联合建模问题,特别是如何设计高效的模型架构和训练策略以实现高质量的图像生成与理解。其解决方案的关键在于引入一种基于扩散变换器的新型方法,用于生成语义丰富的CLIP图像特征,相较于传统基于变分自编码器(VAE)的表示方式,该方法在训练效率和生成质量上均有提升;同时,采用顺序预训练策略,先进行图像理解任务再进行图像生成任务,以保持图像理解能力并增强生成能力。
链接: https://arxiv.org/abs/2505.09568
作者: Jiuhai Chen,Zhiyang Xu,Xichen Pan,Yushi Hu,Can Qin,Tom Goldstein,Lifu Huang,Tianyi Zhou,Saining Xie,Silvio Savarese,Le Xue,Caiming Xiong,Ran Xu
机构: Salesforce Research (Salesforce 研究院); University of Maryland (马里兰大学); Virginia Tech (弗吉尼亚理工学院); New York University (纽约大学); University of Washington (华盛顿大学); UC Davis (加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Unifying image understanding and generation has gained growing attention in recent research on multimodal models. Although design choices for image understanding have been extensively studied, the optimal model architecture and training recipe for a unified framework with image generation remain underexplored. Motivated by the strong potential of autoregressive and diffusion models for high-quality generation and scalability, we conduct a comprehensive study of their use in unified multimodal settings, with emphasis on image representations, modeling objectives, and training strategies. Grounded in these investigations, we introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features, in contrast to conventional VAE-based representations. This design yields both higher training efficiency and improved generative quality. Furthermore, we demonstrate that a sequential pretraining strategy for unified models-first training on image understanding and subsequently on image generation-offers practical advantages by preserving image understanding capability while developing strong image generation ability. Finally, we carefully curate a high-quality instruction-tuning dataset BLIP3o-60k for image generation by prompting GPT-4o with a diverse set of captions covering various scenes, objects, human gestures, and more. Building on our innovative model design, training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art unified multimodal models. BLIP3-o achieves superior performance across most of the popular benchmarks spanning both image understanding and generation tasks. To facilitate future research, we fully open-source our models, including code, model weights, training scripts, and pretraining and instruction tuning datasets.
zh
[CV-5] Using Foundation Models as Pseudo-Label Generators for Pre-Clinical 4D Cardiac CT Segmentation
【速读】:该论文旨在解决在猪心脏计算机断层扫描(CT)图像中实现准确分割的问题,尤其是在缺乏手动标注数据的情况下。其关键解决方案是利用基础模型生成足够精确的伪标签,并通过一种简单的自训练方法迭代优化这些标签,从而提升分割质量并减少连续帧之间的时序不一致性。
链接: https://arxiv.org/abs/2505.09564
作者: Anne-Marie Rickmann,Stephanie L. Thorn,Shawn S. Ahn,Supum Lee,Selen Uman,Taras Lysyy,Rachel Burns,Nicole Guerrera,Francis G. Spinale,Jason A. Burdick,Albert J. Sinusas,James S. Duncan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted at FIMH 2025
Abstract:Cardiac image segmentation is an important step in many cardiac image analysis and modeling tasks such as motion tracking or simulations of cardiac mechanics. While deep learning has greatly advanced segmentation in clinical settings, there is limited work on pre-clinical imaging, notably in porcine models, which are often used due to their anatomical and physiological similarity to humans. However, differences between species create a domain shift that complicates direct model transfer from human to pig data. Recently, foundation models trained on large human datasets have shown promise for robust medical image segmentation; yet their applicability to porcine data remains largely unexplored. In this work, we investigate whether foundation models can generate sufficiently accurate pseudo-labels for pig cardiac CT and propose a simple self-training approach to iteratively refine these labels. Our method requires no manually annotated pig data, relying instead on iterative updates to improve segmentation quality. We demonstrate that this self-training process not only enhances segmentation accuracy but also smooths out temporal inconsistencies across consecutive frames. Although our results are encouraging, there remains room for improvement, for example by incorporating more sophisticated self-training strategies and by exploring additional foundation models and other cardiac imaging technologies. Comments: accepted at FIMH 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.09564 [cs.CV] (or arXiv:2505.09564v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.09564 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-6] Camera-Only 3D Panoptic Scene Completion for Autonomous Driving through Differentiable Object Shapes CVPR2025
【速读】:该论文旨在解决3D语义场景补全(3D semantic scene completion)中尚未充分探索的3D全景场景补全(3D panoptic scene completion)问题。现有方法主要关注预测场景中的占据情况,而未能区分同一类别中的不同物体实例,这限制了其在路径规划和决策制定中的应用。该研究提出了一种新的框架,通过引入对象模块(Object Module)和全景模块(Panoptic Module),能够有效整合到现有的3D占据和场景补全方法中。其关键在于利用占据基准中的可用标注,将单个物体形状的学习转化为可微问题,从而实现更精确的全景场景补全。
链接: https://arxiv.org/abs/2505.09562
作者: Nicola Marinello,Simen Cassiman,Jonas Heylen,Marc Proesmans,Luc Van Gool
机构: KU Leuven (荷兰天主教鲁汶大学); TRACE vzw (TRACE 有限公司); ETH Zürich (苏黎世联邦理工学院); INSAIT (INSAIT 研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025 Workshop on Autonomous Driving
Abstract:Autonomous vehicles need a complete map of their surroundings to plan and act. This has sparked research into the tasks of 3D occupancy prediction, 3D scene completion, and 3D panoptic scene completion, which predict a dense map of the ego vehicle’s surroundings as a voxel grid. Scene completion extends occupancy prediction by predicting occluded regions of the voxel grid, and panoptic scene completion further extends this task by also distinguishing object instances within the same class; both aspects are crucial for path planning and decision-making. However, 3D panoptic scene completion is currently underexplored. This work introduces a novel framework for 3D panoptic scene completion that extends existing 3D semantic scene completion models. We propose an Object Module and Panoptic Module that can easily be integrated with 3D occupancy and scene completion methods presented in the literature. Our approach leverages the available annotations in occupancy benchmarks, allowing individual object shapes to be learned as a differentiable problem. The code is available at this https URL .
zh
[CV-7] Contactless Cardiac Pulse Monitoring Using Event Cameras
【速读】:该论文试图解决非接触式远程心率监测的问题,通过利用时间事件相机(Time Event Camera)采集的面部视频流来重建个体的心跳信号。解决方案的关键在于采用监督卷积神经网络(CNN)模型,从事件流的二维表示中端到端地提取心脏信号,并验证了事件相机在保留生理心脏信息方面的有效性。实验结果表明,基于事件帧的模型在心率估计精度上优于传统相机帧的基线模型,尤其是在高帧率下表现更优。
链接: https://arxiv.org/abs/2505.09529
作者: Mohamed Moustafa,Joseph Lemley,Peter Corcoran
机构: University of Galway (爱尔兰国立高威大学); C3I Imaging Lab (C3I成像实验室); Autosense Team, FotoNation-Tobii (Autosense团队,FotoNation-Tobii)
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: This paper is a preprint of a paper submitted to IEEE Access and is currently under review
Abstract:Time event cameras are a novel technology for recording scene information at extremely low latency and with low power consumption. Event cameras output a stream of events that encapsulate pixel-level light intensity changes within the scene, capturing information with a higher dynamic range and temporal resolution than traditional cameras. This study investigates the contact-free reconstruction of an individual’s cardiac pulse signal from time event recording of their face using a supervised convolutional neural network (CNN) model. An end-to-end model is trained to extract the cardiac signal from a two-dimensional representation of the event stream, with model performance evaluated based on the accuracy of the calculated heart rate. The experimental results confirm that physiological cardiac information in the facial region is effectively preserved within the event stream, showcasing the potential of this novel sensor for remote heart rate monitoring. The model trained on event frames achieves a root mean square error (RMSE) of 3.32 beats per minute (bpm) compared to the RMSE of 2.92 bpm achieved by the baseline model trained on standard camera frames. Furthermore, models trained on event frames generated at 60 and 120 FPS outperformed the 30 FPS standard camera results, achieving an RMSE of 2.54 and 2.13 bpm, respectively.
zh
[CV-8] Conformal Bounds on Full-Reference Image Quality for Imaging Inverse Problems
【速读】:该论文试图解决在成像逆问题中如何评估恢复图像与真实图像之间的全参考图像质量(Full-Reference Image Quality, FRIQ)的问题,特别是在医疗成像等安全关键应用中,准确评估如PSNR、SSIM、LPIPS等指标的重要性。由于无法获得真实图像,直接计算FRIQ具有挑战性。该研究的关键解决方案是将校准预测(conformal prediction)与近似后验采样相结合,从而构建出在用户指定误差概率下具有保证的FRIQ边界。
链接: https://arxiv.org/abs/2505.09528
作者: Jeffrey Wen,Rizwan Ahmad,Philip Schniter
机构: The Ohio State University (俄亥俄州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In imaging inverse problems, we would like to know how close the recovered image is to the true image in terms of full-reference image quality (FRIQ) metrics like PSNR, SSIM, LPIPS, etc. This is especially important in safety-critical applications like medical imaging, where knowing that, say, the SSIM was poor could potentially avoid a costly misdiagnosis. But since we don’t know the true image, computing FRIQ is non-trivial. In this work, we combine conformal prediction with approximate posterior sampling to construct bounds on FRIQ that are guaranteed to hold up to a user-specified error probability. We demonstrate our approach on image denoising and accelerated magnetic resonance imaging (MRI) problems. Code is available at this https URL.
zh
[CV-9] Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput
【速读】:该论文旨在解决Vision-Language Models(VLMs)在实时应用中面临超低延迟和高吞吐量需求的问题,同时不牺牲模型的准确性。其解决方案的关键在于通过先进的架构优化和高效的计算策略,结合定制化的架构选择、令牌压缩机制、数据整理、训练方案以及一种称为隐式语义拼接(implicit semantic stitching)的新图像处理技术,以有效平衡计算负载与模型性能。
链接: https://arxiv.org/abs/2505.09498
作者: Bo Zhang,Shuo Li,Runhe Tian,Yang Yang,Jixin Tang,Jinhao Zhou,Lin Ma
机构: Meituan(美团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 7 figures
Abstract:In this paper, we introduce Flash-VL 2B, a novel approach to optimizing Vision-Language Models (VLMs) for real-time applications, targeting ultra-low latency and high throughput without sacrificing accuracy. Leveraging advanced architectural enhancements and efficient computational strategies, Flash-VL 2B is designed to maximize throughput by reducing processing time while maintaining competitive performance across multiple vision-language benchmarks. Our approach includes tailored architectural choices, token compression mechanisms, data curation, training schemes, and a novel image processing technique called implicit semantic stitching that effectively balances computational load and model performance. Through extensive evaluations on 11 standard VLM benchmarks, we demonstrate that Flash-VL 2B achieves state-of-the-art results in both speed and accuracy, making it a promising solution for deployment in resource-constrained environments and large-scale real-time applications.
zh
[CV-10] Denoising and Alignment: Rethinking Domain Generalization for Multimodal Face Anti-Spoofing
【速读】:该论文旨在解决多模态人脸识别反欺骗(Multimodal Face Anti-Spoofing, FAS)方法在跨域泛化能力不足的问题,主要原因是模态特异性偏差和领域漂移。其解决方案的关键在于提出一种名为多模态去噪与对齐(MultiModal Denoising and Alignment, MMDA)的框架,该框架通过利用CLIP模型的零样本泛化能力,结合去噪和对齐机制有效抑制多模态数据中的噪声,从而显著提升跨模态对齐的泛化性能。此外,MMDA引入了模态-领域联合差分注意力(Modality-Domain Joint Differential Attention, MD2A)模块和表示空间软对齐(Representation Space Soft, RS2)策略,进一步优化了对领域和模态噪声的处理,并增强了模型在不同未见条件下的适应能力。
链接: https://arxiv.org/abs/2505.09484
作者: Yingjie Ma,Xun Lin,Zitong Yu,Xin Liu,Xiaochen Yuan,Weicheng Xie,Linlin Shen
机构: Shenzhen University (深圳大学); Great Bay University (大湾区大学); Lappeenranta-Lahti University of Technology (拉彭兰塔-拉赫蒂应用科学大学); Macao Polytechnic University (澳门理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Face Anti-Spoofing (FAS) is essential for the security of facial recognition systems in diverse scenarios such as payment processing and surveillance. Current multimodal FAS methods often struggle with effective generalization, mainly due to modality-specific biases and domain shifts. To address these challenges, we introduce the \textbfMulti\textbfmodal \textbfDenoising and \textbfAlignment (\textbfMMDA) framework. By leveraging the zero-shot generalization capability of CLIP, the MMDA framework effectively suppresses noise in multimodal data through denoising and alignment mechanisms, thereby significantly enhancing the generalization performance of cross-modal alignment. The \textbfModality-\textbfDomain Joint \textbfDifferential \textbfAttention (\textbfMD2A) module in MMDA concurrently mitigates the impacts of domain and modality noise by refining the attention mechanism based on extracted common noise features. Furthermore, the \textbfRepresentation \textbfSpace \textbfSoft (\textbfRS2) Alignment strategy utilizes the pre-trained CLIP model to align multi-domain multimodal data into a generalized representation space in a flexible manner, preserving intricate representations and enhancing the model’s adaptability to various unseen conditions. We also design a \textbfU-shaped \textbfDual \textbfSpace \textbfAdaptation (\textbfU-DSA) module to enhance the adaptability of representations while maintaining generalization performance. These improvements not only enhance the framework’s generalization capabilities but also boost its ability to represent complex representations. Our experimental results on four benchmark datasets under different evaluation protocols demonstrate that the MMDA framework outperforms existing state-of-the-art methods in terms of cross-domain generalization and multimodal detection accuracy. The code will be released soon.
zh
[CV-11] A 2D Semantic-Aware Position Encoding for Vision Transformers
【速读】:该论文试图解决现有位置编码技术在视觉变换器中无法有效捕捉图像块之间语义相关的空间关系问题(Position Encoding),尤其是传统方法如绝对位置编码和相对位置编码主要关注一维线性位置关系,忽视了远距离但语义相似的图像块之间的关联。解决方案的关键是提出一种二维语义感知位置编码(2-Dimensional Semantic-Aware Position Encoding, \textSaPE^2),该方法通过利用局部内容动态适应位置表示,而非依赖固定的线性位置关系或空间坐标,从而增强模型在不同图像分辨率和尺度下的泛化能力,并提升对重复或结构化模式的处理效果。
链接: https://arxiv.org/abs/2505.09466
作者: Xi Chen,Shiyang Zhou,Muqi Huang,Jiaxu Feng,Yun Xiong,Kun Zhou,Biao Yang,Yuhui Zhang,Huishuai Bao,Sijia Peng,Chuan Li,Feng Shi
机构: Fudan University (复旦大学); Alibaba Group (阿里巴巴集团); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 4 figures, 3 tables
Abstract:Vision transformers have demonstrated significant advantages in computer vision tasks due to their ability to capture long-range dependencies and contextual relationships through self-attention. However, existing position encoding techniques, which are largely borrowed from natural language processing, fail to effectively capture semantic-aware positional relationships between image patches. Traditional approaches like absolute position encoding and relative position encoding primarily focus on 1D linear position relationship, often neglecting the semantic similarity between distant yet contextually related patches. These limitations hinder model generalization, translation equivariance, and the ability to effectively handle repetitive or structured patterns in images. In this paper, we propose 2-Dimensional Semantic-Aware Position Encoding ( \textSaPE^2 ), a novel position encoding method with semantic awareness that dynamically adapts position representations by leveraging local content instead of fixed linear position relationship or spatial coordinates. Our method enhances the model’s ability to generalize across varying image resolutions and scales, improves translation equivariance, and better aggregates features for visually similar but spatially distant patches. By integrating \textSaPE^2 into vision transformers, we bridge the gap between position encoding and perceptual similarity, thereby improving performance on computer vision tasks.
zh
[CV-12] Beyond Pixels: Leverag ing the Language of Soccer to Improve Spatio-Temporal Action Detection in Broadcast Videos
【速读】:该论文试图解决当前时空动作检测(Spatio-Temporal Action Detection, STAD)方法在足球分析中进行全事件覆盖时,因缺乏上下文理解而导致的高误报问题。解决方案的关键在于通过引入去噪序列转换任务,在游戏层面进行推理,并利用基于Transformer的编码器-解码器模型处理噪声的、以球员为中心的预测序列与干净的游戏状态信息,从而建模更长的时序上下文并联合推理团队级动态,借助足球的战术规律和球员间依赖关系生成“去噪”的动作序列,进而提升低置信度场景下的精度和召回率。
链接: https://arxiv.org/abs/2505.09455
作者: Jeremie Ochin,Raphael Chekroun,Bogdan Stanciulescu,Sotiris Manitsaris
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, submitted to Advanced Concepts for Intelligent Vision Systems 2025
Abstract:State-of-the-art spatio-temporal action detection (STAD) methods show promising results for extracting soccer events from broadcast videos. However, when operated in the high-recall, low-precision regime required for exhaustive event coverage in soccer analytics, their lack of contextual understanding becomes apparent: many false positives could be resolved by considering a broader sequence of actions and game-state information. In this work, we address this limitation by reasoning at the game level and improving STAD through the addition of a denoising sequence transduction task. Sequences of noisy, context-free player-centric predictions are processed alongside clean game state information using a Transformer-based encoder-decoder model. By modeling extended temporal context and reasoning jointly over team-level dynamics, our method leverages the “language of soccer” - its tactical regularities and inter-player dependencies - to generate “denoised” sequences of actions. This approach improves both precision and recall in low-confidence regimes, enabling more reliable event extraction from broadcast video and complementing existing pixel-based methods.
zh
[CV-13] MrTrack: Register Mamba for Needle Tracking with Rapid Reciprocating Motion during Ultrasound-Guided Aspiration Biopsy MICCAI2025
【速读】:该论文旨在解决超声引导下细针穿刺活检(ultrasound-guided fine needle aspiration, FNA)中缺乏能够应对快速往复运动的穿刺针跟踪器的问题。其解决方案的关键在于提出MrTrack,该系统采用基于Mamba的注册机制,通过Mamba-based register extractor依次从每个历史搜索图中蒸馏全局上下文,并将这些时序线索存储在注册库中;随后,Mamba-based register retriever从注册库中检索时序提示,以在当前视觉特征因快速往复运动和成像退化而暂时不可用时提供外部线索。此外,引入自监督的注册多样化损失以增强学习到的注册特征的多样性与维度独立性,从而缓解特征坍缩问题。
链接: https://arxiv.org/abs/2505.09450
作者: Yuelin Zhang,Qingpeng Ding,Long Lei,Yongxuan Feng,Raymond Shing-Yan Tang,Shing Shin Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Early Accepted by MICCAI 2025
Abstract:Ultrasound-guided fine needle aspiration (FNA) biopsy is a common minimally invasive diagnostic procedure. However, an aspiration needle tracker addressing rapid reciprocating motion is still missing. MrTrack, an aspiration needle tracker with a mamba-based register mechanism, is proposed. MrTrack leverages a Mamba-based register extractor to sequentially distill global context from each historical search map, storing these temporal cues in a register bank. The Mamba-based register retriever then retrieves temporal prompts from the register bank to provide external cues when current vision features are temporarily unusable due to rapid reciprocating motion and imaging degradation. A self-supervised register diversify loss is proposed to encourage feature diversity and dimension independence within the learned register, mitigating feature collapse. Comprehensive experiments conducted on both motorized and manual aspiration datasets demonstrate that MrTrack not only outperforms state-of-the-art trackers in accuracy and robustness but also achieves superior inference efficiency.
zh
[CV-14] Endo-CLIP: Progressive Self-Supervised Pre-training on Raw Colonoscopy Records MICCAI2025
【速读】:该论文旨在解决结肠镜图像分析中预训练面临的挑战,包括非信息性背景图像、复杂的医学术语以及多病灶描述的模糊性。其解决方案的关键在于提出了一种名为Endo-CLIP的自监督框架,该框架通过三个阶段——清洗、调谐和统一——来优化对比语言-图像预训练(CLIP)模型,具体包括移除背景帧、利用大语言模型提取临床属性以实现细粒度对比学习,以及采用患者级交叉注意力机制解决多腺瘤歧义问题。
链接: https://arxiv.org/abs/2505.09435
作者: Yili He,Yan Zhu,Peiyao Fu,Ruijie Yang,Tianyi Chen,Zhihua Wang,Quanlin Li,Pinghong Zhou,Xian Yang,Shuo Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Early accepted to MICCAI 2025
Abstract:Pre-training on image-text colonoscopy records offers substantial potential for improving endoscopic image analysis, but faces challenges including non-informative background images, complex medical terminology, and ambiguous multi-lesion descriptions. We introduce Endo-CLIP, a novel self-supervised framework that enhances Contrastive Language-Image Pre-training (CLIP) for this domain. Endo-CLIP’s three-stage framework–cleansing, attunement, and unification–addresses these challenges by (1) removing background frames, (2) leveraging large language models to extract clinical attributes for fine-grained contrastive learning, and (3) employing patient-level cross-attention to resolve multi-polyp ambiguities. Extensive experiments demonstrate that Endo-CLIP significantly outperforms state-of-the-art pre-training methods in zero-shot and few-shot polyp detection and classification, paving the way for more accurate and clinically relevant endoscopic analysis.
zh
[CV-15] Efficient LiDAR Reflectance Compression via Scanning Serialization
【速读】:该论文旨在解决LiDAR点云中反射率(reflectance)属性在神经压缩方法中未被充分探索的问题,从而提升下游任务的信息利用效率。其解决方案的关键在于提出SerLiC框架,通过扫描顺序序列化将三维LiDAR点云转换为一维序列,以设备为中心的视角进行反射率分析,并结合上下文表示和Mamba模型的双重并行化机制,实现高效依赖关系建模与快速处理,从而显著降低数据体积并提升压缩性能。
链接: https://arxiv.org/abs/2505.09433
作者: Jiahao Zhu,Kang You,Dandan Ding,Zhan Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Reflectance attributes in LiDAR point clouds provide essential information for downstream tasks but remain underexplored in neural compression methods. To address this, we introduce SerLiC, a serialization-based neural compression framework to fully exploit the intrinsic characteristics of LiDAR reflectance. SerLiC first transforms 3D LiDAR point clouds into 1D sequences via scan-order serialization, offering a device-centric perspective for reflectance analysis. Each point is then tokenized into a contextual representation comprising its sensor scanning index, radial distance, and prior reflectance, for effective dependencies exploration. For efficient sequential modeling, Mamba is incorporated with a dual parallelization scheme, enabling simultaneous autoregressive dependency capture and fast processing. Extensive experiments demonstrate that SerLiC attains over 2x volume reduction against the original reflectance data, outperforming the state-of-the-art method by up to 22% reduction of compressed bits while using only 2% of its parameters. Moreover, a lightweight version of SerLiC achieves 10 fps (frames per second) with just 111K parameters, which is attractive for real-world applications.
zh
[CV-16] MoRAL: Motion-aware Multi-Frame 4D Radar and LiDAR Fusion for Robust 3D Object Detection
【速读】:该论文旨在解决多模态融合在自动驾驶中因雷达点云帧间错位及未充分利用4D雷达动态信息而导致的3D目标检测精度不足问题。其解决方案的关键在于提出MoRAL框架,该框架包含运动感知的雷达编码器(Motion-aware Radar Encoder, MRE)以补偿移动物体引起的帧间错位,并引入运动注意力门控融合(Motion Attention Gated Fusion, MAGF)模块,通过雷达运动特征引导LiDAR特征关注动态前景目标,从而提升检测性能。
链接: https://arxiv.org/abs/2505.09422
作者: Xiangyuan Peng,Yu Wang,Miao Tang,Bierzynski Kay,Lorenzo Servadei,Robert Wille
机构: Technical University of Munich (慕尼黑工业大学); Infineon Technologies AG (英飞凌科技公司); China University of Geosciences (中国地质大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reliable autonomous driving systems require accurate detection of traffic participants. To this end, multi-modal fusion has emerged as an effective strategy. In particular, 4D radar and LiDAR fusion methods based on multi-frame radar point clouds have demonstrated the effectiveness in bridging the point density gap. However, they often neglect radar point clouds’ inter-frame misalignment caused by object movement during accumulation and do not fully exploit the object dynamic information from 4D radar. In this paper, we propose MoRAL, a motion-aware multi-frame 4D radar and LiDAR fusion framework for robust 3D object detection. First, a Motion-aware Radar Encoder (MRE) is designed to compensate for inter-frame radar misalignment from moving objects. Later, a Motion Attention Gated Fusion (MAGF) module integrate radar motion features to guide LiDAR features to focus on dynamic foreground objects. Extensive evaluations on the View-of-Delft (VoD) dataset demonstrate that MoRAL outperforms existing methods, achieving the highest mAP of 73.30% in the entire area and 88.68% in the driving corridor. Notably, our method also achieves the best AP of 69.67% for pedestrians in the entire area and 96.25% for cyclists in the driving corridor.
zh
[CV-17] FaceShield: Explainable Face Anti-Spoofing with Multimodal Large Language Models
【速读】:该论文旨在解决面部反欺骗(Face Anti-Spoofing, FAS)任务中现有方法缺乏可解释性和推理能力的问题。传统方法将FAS视为分类问题,无法提供预测结果的依据。为了解决这一问题,作者提出了FaceShield,一个面向FAS任务的多模态大语言模型(Multimodal Large Language Model, MLLM),其关键在于引入了欺骗感知视觉感知(Spoof-Aware Vision Perception, SAVP)和提示引导视觉标记掩码(Prompt-Guided Vision Token Masking, PVTM)策略,以提升模型的泛化能力和判断的可解释性。
链接: https://arxiv.org/abs/2505.09415
作者: Hongyang Wang,Yichen Shi,Zhuofu Tao,Yuhao Gao,Liepiao Zhang,Xun Lin,Jun Feng,Xiaochen Yuan,Zitong Yu,Xiaochun Cao
机构: Shijiazhuang Tiedao University (石家庄铁道大学); Shanghai Jiao Tong University (上海交通大学); UCLA (加利福尼亚大学洛杉矶分校); GRGBanking (工商银行业); Great Bay University (大湾大学); Macao Polytechnic University (澳门理工学院); Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Face anti-spoofing (FAS) is crucial for protecting facial recognition systems from presentation attacks. Previous methods approached this task as a classification problem, lacking interpretability and reasoning behind the predicted results. Recently, multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and decision-making in visual tasks. However, there is currently no universal and comprehensive MLLM and dataset specifically designed for FAS task. To address this gap, we propose FaceShield, a MLLM for FAS, along with the corresponding pre-training and supervised fine-tuning (SFT) datasets, FaceShield-pre10K and FaceShield-sft45K. FaceShield is capable of determining the authenticity of faces, identifying types of spoofing attacks, providing reasoning for its judgments, and detecting attack areas. Specifically, we employ spoof-aware vision perception (SAVP) that incorporates both the original image and auxiliary information based on prior knowledge. We then use an prompt-guided vision token masking (PVTM) strategy to random mask vision tokens, thereby improving the model’s generalization ability. We conducted extensive experiments on three benchmark datasets, demonstrating that FaceShield significantly outperforms previous deep learning models and general MLLMs on four FAS tasks, i.e., coarse-grained classification, fine-grained classification, reasoning, and attack localization. Our instruction datasets, protocols, and codes will be released soon.
zh
[CV-18] Sparse Point Cloud Patches Rendering via Splitting 2D Gaussians CVPR2025
【速读】:该论文旨在解决现有基于学习的方法在从点云生成NeRF或3D高斯分布时依赖类别先验、密集点云或额外优化的问题。其关键解决方案是通过从点云预测2D高斯分布来实现点云渲染,该方法引入了两个具有完整块架构的相同模块,利用点云信息(包括法线、颜色和距离)对高斯分布进行归一化和初始化,并通过拆分解码器复制并优化初始高斯分布,从而有效适应稀疏点云。训练完成后,该方法可直接泛化到不同类别的点云,并直接使用预测的高斯分布进行渲染,无需对渲染图像进行额外优化。
链接: https://arxiv.org/abs/2505.09413
作者: Ma Changfeng,Bi Ran,Guo Jie,Wang Chongjun,Guo Yanwen
机构: Nanjing University (南京大学); School of Software, North University of China (软件学院,华北电力大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025 Accepted
Abstract:Current learning-based methods predict NeRF or 3D Gaussians from point clouds to achieve photo-realistic rendering but still depend on categorical priors, dense point clouds, or additional refinements. Hence, we introduce a novel point cloud rendering method by predicting 2D Gaussians from point clouds. Our method incorporates two identical modules with an entire-patch architecture enabling the network to be generalized to multiple datasets. The module normalizes and initializes the Gaussians utilizing the point cloud information including normals, colors and distances. Then, splitting decoders are employed to refine the initial Gaussians by duplicating them and predicting more accurate results, making our methodology effectively accommodate sparse point clouds as well. Once trained, our approach exhibits direct generalization to point clouds across different categories. The predicted Gaussians are employed directly for rendering without additional refinement on the rendered images, retaining the benefits of 2D Gaussians. We conduct extensive experiments on various datasets, and the results demonstrate the superiority and generalization of our method, which achieves SOTA performance. The code is available at this https URLthis https URL.
zh
[CV-19] FreeDriveRF: Monocular RGB Dynamic NeRF without Poses for Autonomous Driving via Point-Level Dynamic-Static Decoupling ICRA2025
【速读】:该论文旨在解决自动驾驶中动态场景重建的问题,特别是如何在不依赖精确位姿输入和多传感器数据的情况下,利用单目RGB图像实现高质量的动态场景建模。其解决方案的关键在于提出FreeDriveRF,通过语义监督在早期采样阶段解耦动态与静态部分,从而减少图像模糊和伪影,并引入基于光流的扭曲射线引导动态物体渲染一致性损失以及估计的动态流来约束位姿优化过程,提升无界场景重建的稳定性和准确性。
链接: https://arxiv.org/abs/2505.09406
作者: Yue Wen,Liang Song,Yijia Liu,Siting Zhu,Yanzi Miao,Lijun Han,Hesheng Wang
机构: Shanghai Jiao Tong University (上海交通大学); Dimanshen Technology Co., Ltd. (_dimanshen_科技有限公司); China University of Mining and Technology (中国矿业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 9 figures, accepted by ICRA2025
Abstract:Dynamic scene reconstruction for autonomous driving enables vehicles to perceive and interpret complex scene changes more precisely. Dynamic Neural Radiance Fields (NeRFs) have recently shown promising capability in scene modeling. However, many existing methods rely heavily on accurate poses inputs and multi-sensor data, leading to increased system complexity. To address this, we propose FreeDriveRF, which reconstructs dynamic driving scenes using only sequential RGB images without requiring poses inputs. We innovatively decouple dynamic and static parts at the early sampling level using semantic supervision, mitigating image blurring and artifacts. To overcome the challenges posed by object motion and occlusion in monocular camera, we introduce a warped ray-guided dynamic object rendering consistency loss, utilizing optical flow to better constrain the dynamic modeling process. Additionally, we incorporate estimated dynamic flow to constrain the pose optimization process, improving the stability and accuracy of unbounded scene reconstruction. Extensive experiments conducted on the KITTI and Waymo datasets demonstrate the superior performance of our method in dynamic scene modeling for autonomous driving.
zh
[CV-20] UMotion: Uncertainty-driven Human Motion Estimation from Inertial and Ultra-wideband Units CVPR2025
【速读】:该论文旨在解决基于稀疏可穿戴惯性测量单元(IMU)进行三维人体运动估计时存在的姿态歧义、数据漂移以及对不同体型适应性有限等问题。其解决方案的关键在于提出UMotion框架,该框架基于六组集成超宽带(UWB)距离传感器与IMU的在线融合,通过不确定性驱动的方式,结合人体解剖学数据,利用紧耦合的无迹卡尔曼滤波(UKF)对传感器数据和人体运动进行不确定性融合,从而实时校正IMU数据漂移并缓解UWB传感器受身体遮挡的影响,最终实现更稳定和精确的三维人体姿态与形态估计。
链接: https://arxiv.org/abs/2505.09393
作者: Huakun Liu,Hiroki Ota,Xin Wei,Yutaro Hirao,Monica Perusquia-Hernandez,Hideaki Uchiyama,Kiyoshi Kiyokawa
机构: Nara Institute of Science and Technology (奈良先端科学技术大学院大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025
Abstract:Sparse wearable inertial measurement units (IMUs) have gained popularity for estimating 3D human motion. However, challenges such as pose ambiguity, data drift, and limited adaptability to diverse bodies persist. To address these issues, we propose UMotion, an uncertainty-driven, online fusing-all state estimation framework for 3D human shape and pose estimation, supported by six integrated, body-worn ultra-wideband (UWB) distance sensors with IMUs. UWB sensors measure inter-node distances to infer spatial relationships, aiding in resolving pose ambiguities and body shape variations when combined with anthropometric data. Unfortunately, IMUs are prone to drift, and UWB sensors are affected by body occlusions. Consequently, we develop a tightly coupled Unscented Kalman Filter (UKF) framework that fuses uncertainties from sensor data and estimated human motion based on individual body shape. The UKF iteratively refines IMU and UWB measurements by aligning them with uncertain human motion constraints in real-time, producing optimal estimates for each. Experiments on both synthetic and real-world datasets demonstrate the effectiveness of UMotion in stabilizing sensor data and the improvement over state of the art in pose accuracy.
zh
[CV-21] FedSaaS: Class-Consistency Federated Semantic Segmentation via Global Prototype Supervision and Local Adversarial Harmonization
【速读】:该论文旨在解决联邦语义分割中因领域偏移导致的细粒度类别关系被忽视的问题,从而引发类别表示之间的模糊性。其解决方案的关键在于提出一种名为FedSaaS的新型联邦分割框架,通过引入类别样本(class exemplars)作为局部和全局类别表示的标准,确保类别一致性。在服务器端,上传的类别样本用于建模类别原型,监督客户端的全局分支以保证与全局表示对齐;在客户端,采用对抗机制协调全局和局部分支的贡献,实现输出一致性,并通过多层级对比损失强化同一语义空间内两级表示的一致性。
链接: https://arxiv.org/abs/2505.09385
作者: Xiaoyang Yu,Xiaoming Wu,Xin Wang,Dongrun Li,Ming Yang,Peng Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated semantic segmentation enables pixel-level classification in images through collaborative learning while maintaining data privacy. However, existing research commonly overlooks the fine-grained class relationships within the semantic space when addressing heterogeneous problems, particularly domain shift. This oversight results in ambiguities between class representation. To overcome this challenge, we propose a novel federated segmentation framework that strikes class consistency, termed FedSaaS. Specifically, we introduce class exemplars as a criterion for both local- and global-level class representations. On the server side, the uploaded class exemplars are leveraged to model class prototypes, which supervise global branch of clients, ensuring alignment with global-level representation. On the client side, we incorporate an adversarial mechanism to harmonize contributions of global and local branches, leading to consistent output. Moreover, multilevel contrastive losses are employed on both sides to enforce consistency between two-level representations in the same semantic space. Extensive experiments on several driving scene segmentation datasets demonstrate that our framework outperforms state-of-the-art methods, significantly improving average segmentation accuracy and effectively addressing the class-consistency representation problem.
zh
[CV-22] Examining Deployment and Refinement of the VIOLA-AI Intracranial Hemorrhage Model Using an Interactive NeoMedSys Platform
【速读】:该论文旨在解决在放射学中部署AI工具所面临的挑战,特别是如何实现AI模型的高效部署与持续优化。其解决方案的关键在于开发了一个名为NeoMedSys的放射学软件平台,该平台集成了AI模型的部署、测试和优化工具,并与基于网络的医学影像查看器、标注系统及医院范围内的放射信息管理系统相结合,从而支持实时数据反馈和模型迭代改进。通过这一平台,研究团队实现了对自研AI模型VIOLA-AI在颅内出血检测任务中的性能提升,显著提高了诊断准确性。
链接: https://arxiv.org/abs/2505.09380
作者: Qinghui Liu,Jon Nesvold,Hanna Raaum,Elakkyen Murugesu,Martin Røvang,Bradley J Maclntosh,Atle Bjørnerud,Karoline Skogen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 11 figures, on submission to BMC Methods
Abstract:Background: There are many challenges and opportunities in the clinical deployment of AI tools in radiology. The current study describes a radiology software platform called NeoMedSys that can enable efficient deployment and refinements of AI models. We evaluated the feasibility and effectiveness of running NeoMedSys for three months in real-world clinical settings and focused on improvement performance of an in-house developed AI model (VIOLA-AI) designed for intracranial hemorrhage (ICH) detection. Methods: NeoMedSys integrates tools for deploying, testing, and optimizing AI models with a web-based medical image viewer, annotation system, and hospital-wide radiology information systems. A pragmatic investigation was deployed using clinical cases of patients presenting to the largest Emergency Department in Norway (site-1) with suspected traumatic brain injury (TBI) or patients with suspected stroke (site-2). We assessed ICH classification performance as VIOLA-AI encountered new data and underwent pre-planned model retraining. Performance metrics included sensitivity, specificity, accuracy, and the area under the receiver operating characteristic curve (AUC). Results: NeoMedSys facilitated iterative improvements in the AI model, significantly enhancing its diagnostic accuracy. Automated bleed detection and segmentation were reviewed in near real-time to facilitate re-training VIOLA-AI. The iterative refinement process yielded a marked improvement in classification sensitivity, rising to 90.3% (from 79.2%), and specificity that reached 89.3% (from 80.7%). The bleed detection ROC analysis for the entire sample demonstrated a high area-under-the-curve (AUC) of 0.949 (from 0.873). Model refinement stages were associated with notable gains, highlighting the value of real-time radiologist feedback. Comments: 19 pages, 11 figures, on submission to BMC Methods Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2505.09380 [cs.CV] (or arXiv:2505.09380v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.09380 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-23] xt-driven Motion Generation: Overview Challenges and Directions
【速读】:该论文旨在解决如何通过自然语言直接生成人类运动的问题,即文本驱动的运动生成(text-driven motion generation)。传统运动合成方法通常依赖于预定义的运动输入,并基于观察到的初始序列预测未来姿态,常以动作标签作为条件。而本文提出的解决方案关键在于构建一种无需预定义运动输入、能够直接从文本生成运动的模型架构与表示方法,其核心在于从两个互补视角对现代文本到运动生成方法进行系统性综述:一是模型架构,包括VAE-based、diffusion-based和混合模型;二是运动表示,区分离散与连续运动生成策略。此外,论文还探讨了常用数据集、评估方法及基准测试,以推动该领域的发展。
链接: https://arxiv.org/abs/2505.09379
作者: Ali Rida Sahili,Najett Neji,Hedi Tabia
机构: IBISC, Univ. Evry, Université Paris-Saclay
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 5 tables
Abstract:Text-driven motion generation offers a powerful and intuitive way to create human movements directly from natural language. By removing the need for predefined motion inputs, it provides a flexible and accessible approach to controlling animated characters. This makes it especially useful in areas like virtual reality, gaming, human-computer interaction, and robotics. In this review, we first revisit the traditional perspective on motion synthesis, where models focused on predicting future poses from observed initial sequences, often conditioned on action labels. We then provide a comprehensive and structured survey of modern text-to-motion generation approaches, categorizing them from two complementary perspectives: (i) architectural, dividing methods into VAE-based, diffusion-based, and hybrid models; and (ii) motion representation, distinguishing between discrete and continuous motion generation strategies. In addition, we explore the most widely used datasets, evaluation methods, and recent benchmarks that have shaped progress in this area. With this survey, we aim to capture where the field currently stands, bring attention to its key challenges and limitations, and highlight promising directions for future exploration. We hope this work offers a valuable starting point for researchers and practitioners working to push the boundaries of language-driven human motion synthesis.
zh
[CV-24] MAKE: Multi-Aspect Knowledge-Enhanced Vision-Language Pretraining for Zero-shot Dermatological Assessment MICCAI2025
【速读】:该论文旨在解决皮肤病诊断中视觉-语言预训练(VLP)模型效果受限的问题,主要由于文本长度限制和缺乏结构化文本。其解决方案的关键在于提出MAKE框架,该框架通过三种核心机制提升性能:(1)多方面对比学习策略,利用大语言模型将临床叙述分解为增强知识的子文本;(2)细粒度对齐机制,将子标题与具有诊断相关性的图像特征进行关联;(3)诊断引导的加权方案,根据临床重要性先验自适应地优先考虑不同子标题。
链接: https://arxiv.org/abs/2505.09372
作者: Siyuan Yan,Xieji Li,Ming Hu,Yiwen Jiang,Zhen Yu,Zongyuan Ge
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI2025 early acceptance; First two authors contribute equally
Abstract:Dermatological diagnosis represents a complex multimodal challenge that requires integrating visual features with specialized clinical knowledge. While vision-language pretraining (VLP) has advanced medical AI, its effectiveness in dermatology is limited by text length constraints and the lack of structured texts. In this paper, we introduce MAKE, a Multi-Aspect Knowledge-Enhanced vision-language pretraining framework for zero-shot dermatological tasks. Recognizing that comprehensive dermatological descriptions require multiple knowledge aspects that exceed standard text constraints, our framework introduces: (1) a multi-aspect contrastive learning strategy that decomposes clinical narratives into knowledge-enhanced sub-texts through large language models, (2) a fine-grained alignment mechanism that connects subcaptions with diagnostically relevant image features, and (3) a diagnosis-guided weighting scheme that adaptively prioritizes different sub-captions based on clinical significance prior. Through pretraining on 403,563 dermatological image-text pairs collected from education resources, MAKE significantly outperforms state-of-the-art VLP models on eight datasets across zero-shot skin disease classification, concept annotation, and cross-modal retrieval tasks. Our code will be made publicly available at https: //github.com/SiyuanYan1/MAKE.
zh
[CV-25] RobustSpring: Benchmarking Robustness to Image Corruptions for Optical Flow Scene Flow and Stereo
【速读】:该论文试图解决现有光学流(optical flow)、场景流(scene flow)和立体视觉(stereo vision)算法基准测试中对模型鲁棒性(robustness)评估不足的问题,尤其是针对图像退化如噪声或雨雾等现实世界干扰的抗扰能力缺乏量化分析。解决方案的关键在于提出RobustSpring数据集和基准,该数据集通过在高分辨率Spring数据集上以时间、立体和深度一致的方式应用20种不同的图像退化类型,生成包含20,000张退化图像的综合性数据集,并引入一种新的图像退化鲁棒性度量方法,从而实现对模型准确性和鲁棒性的双轴评估。
链接: https://arxiv.org/abs/2505.09368
作者: Jenny Schmalfuss,Victor Oei,Lukas Mehl,Madlen Bartsch,Shashank Agnihotri,Margret Keuper,Andrés Bruhn
机构: University of Stuttgart, SimTech(斯图加特大学,SimTech); University of Mannheim(曼海姆大学); Max-Planck-Institute for Informatics(马克斯·普朗克信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Standard benchmarks for optical flow, scene flow, and stereo vision algorithms generally focus on model accuracy rather than robustness to image corruptions like noise or rain. Hence, the resilience of models to such real-world perturbations is largely unquantified. To address this, we present RobustSpring, a comprehensive dataset and benchmark for evaluating robustness to image corruptions for optical flow, scene flow, and stereo models. RobustSpring applies 20 different image corruptions, including noise, blur, color changes, quality degradations, and weather distortions, in a time-, stereo-, and depth-consistent manner to the high-resolution Spring dataset, creating a suite of 20,000 corrupted images that reflect challenging conditions. RobustSpring enables comparisons of model robustness via a new corruption robustness metric. Integration with the Spring benchmark enables public two-axis evaluations of both accuracy and robustness. We benchmark a curated selection of initial models, observing that accurate models are not necessarily robust and that robustness varies widely by corruption type. RobustSpring is a new computer vision benchmark that treats robustness as a first-class citizen to foster models that combine accuracy with resilience. It will be available at this https URL.
zh
[CV-26] Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis CVPR2024
【速读】:该论文旨在解决在数据稀缺场景下,如何有效利用预训练模型进行迁移学习的问题,特别是针对密集图像分析任务(如单目深度估计、表面法线预测和固有分解)。解决方案的关键在于提出了一种条件生成模型家族Marigold及其微调协议,该方法能够从像Stable Diffusion这样的预训练潜在扩散模型中提取知识,并将其适配到密集图像分析任务中,而无需对原始模型架构进行大幅修改,仅需少量合成数据即可在单块GPU上训练数天,从而实现最先进的零样本泛化能力。
链接: https://arxiv.org/abs/2505.09358
作者: Bingxin Ke,Kevin Qu,Tianfu Wang,Nando Metzger,Shengyu Huang,Bo Li,Anton Obukhov,Konrad Schindler
机构: ETH Zürich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Journal extension of our CVPR 2024 paper, featuring new tasks, improved efficiency, high-resolution capabilities, and enhanced accessibility
Abstract:The success of deep learning in computer vision over the past decade has hinged on large labeled datasets and strong pretrained models. In data-scarce settings, the quality of these pretrained models becomes crucial for effective transfer learning. Image classification and self-supervised learning have traditionally been the primary methods for pretraining CNNs and transformer-based architectures. Recently, the rise of text-to-image generative models, particularly those using denoising diffusion in a latent space, has introduced a new class of foundational models trained on massive, captioned image datasets. These models’ ability to generate realistic images of unseen content suggests they possess a deep understanding of the visual world. In this work, we present Marigold, a family of conditional generative models and a fine-tuning protocol that extracts the knowledge from pretrained latent diffusion models like Stable Diffusion and adapts them for dense image analysis tasks, including monocular depth estimation, surface normals prediction, and intrinsic decomposition. Marigold requires minimal modification of the pre-trained latent diffusion model’s architecture, trains with small synthetic datasets on a single GPU over a few days, and demonstrates state-of-the-art zero-shot generalization. Project page: this https URL
zh
[CV-27] APR-Transformer: Initial Pose Estimation for Localization in Complex Environments through Absolute Pose Regression
【速读】:该论文旨在解决定位算法中初始位姿(initial pose)不准确导致的定位精度问题,特别是在无GNSS(全球导航卫星系统)环境下的应用。解决方案的关键在于提出一种名为APR-Transformer的模型架构,该架构基于先进的深度神经网络方法,能够利用图像或LiDAR数据预测绝对位姿(3D position and 3D orientation),从而提升复杂空间关系和方向估计的准确性与鲁棒性。
链接: https://arxiv.org/abs/2505.09356
作者: Srinivas Ravuri(1),Yuan Xu(1),Martin Ludwig Zehetner(2),Ketan Motlag(1),Sahin Albayrak(1) ((1) Technische Universität Berlin, Berlin, Germany (2) Forschungszentrum Informatik, Berlin, Germany)
机构: Technische Universität Berlin (柏林工业大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages with 6 figures
Abstract:Precise initialization plays a critical role in the performance of localization algorithms, especially in the context of robotics, autonomous driving, and computer vision. Poor localization accuracy is often a consequence of inaccurate initial poses, particularly noticeable in GNSS-denied environments where GPS signals are primarily relied upon for initialization. Recent advances in leveraging deep neural networks for pose regression have led to significant improvements in both accuracy and robustness, especially in estimating complex spatial relationships and orientations. In this paper, we introduce APR-Transformer, a model architecture inspired by state-of-the-art methods, which predicts absolute pose (3D position and 3D orientation) using either image or LiDAR data. We demonstrate that our proposed method achieves state-of-the-art performance on established benchmark datasets such as the Radar Oxford Robot-Car and DeepLoc datasets. Furthermore, we extend our experiments to include our custom complex APR-BeIntelli dataset. Additionally, we validate the reliability of our approach in GNSS-denied environments by deploying the model in real-time on an autonomous test vehicle. This showcases the practical feasibility and effectiveness of our approach. The source code is available at:this https URL.
zh
[CV-28] GreenFactory: Ensembling Zero-Cost Proxies to Estimate Performance of Neural Networks
【速读】:该论文旨在解决在神经网络架构搜索(Neural Architecture Search, NAS)过程中,传统方法需要耗费大量时间和资源进行每个网络的训练与评估的问题。为了解决这一问题,研究者提出了GreenFactory,其关键在于通过集成多个零成本代理(zero-cost proxies)并利用随机森林回归器(random forest regressor)来综合各预测器的优势,从而直接预测模型的测试准确率,而非仅提供相对排名,提升了预测的准确性和泛化能力。
链接: https://arxiv.org/abs/2505.09344
作者: Gabriel Cortês,Nuno Lourenço,Paolo Romano,Penousal Machado
机构: University of Coimbra, CISUC/LASI, DEI (科英布拉大学,CISUC/LASI,DEI); INESC-ID & Instituto Superior Técnico, Universidade de Lisboa (INESC-ID & 高等技术研究所,里斯本大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Determining the performance of a Deep Neural Network during Neural Architecture Search processes is essential for identifying optimal architectures and hyperparameters. Traditionally, this process requires training and evaluation of each network, which is time-consuming and resource-intensive. Zero-cost proxies estimate performance without training, serving as an alternative to traditional training. However, recent proxies often lack generalization across diverse scenarios and provide only relative rankings rather than predicted accuracies. To address these limitations, we propose GreenFactory, an ensemble of zero-cost proxies that leverages a random forest regressor to combine multiple predictors’ strengths and directly predict model test accuracy. We evaluate GreenFactory on NATS-Bench, achieving robust results across multiple datasets. Specifically, GreenFactory achieves high Kendall correlations on NATS-Bench-SSS, indicating substantial agreement between its predicted scores and actual performance: 0.907 for CIFAR-10, 0.945 for CIFAR-100, and 0.920 for ImageNet-16-120. Similarly, on NATS-Bench-TSS, we achieve correlations of 0.921 for CIFAR-10, 0.929 for CIFAR-100, and 0.908 for ImageNet-16-120, showcasing its reliability in both search spaces.
zh
[CV-29] Unsupervised Multiview Contrastive Language-Image Joint Learning with Pseudo-Labeled Prompts Via Vision-Language Model for 3D/4D Facial Expression Recognition
【速读】:该论文旨在解决从3D/4D数据中进行无监督对比多视角情感表征学习的问题,特别是如何在没有显式标注的情况下对面部情绪进行有效建模。其解决方案的关键在于提出一种联合嵌入空间,通过伪标签引导情感语义的隐式对齐,并结合一种新颖的多视角对比学习策略,利用稳定的正负样本对采样提升模型的判别能力,同时引入梯度友好的损失函数以实现更平滑和稳定的收敛。
链接: https://arxiv.org/abs/2505.09336
作者: Muzammil Behzad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we introduce MultiviewVLM, a vision-language model designed for unsupervised contrastive multiview representation learning of facial emotions from 3D/4D data. Our architecture integrates pseudo-labels derived from generated textual prompts to guide implicit alignment of emotional semantics. To capture shared information across multi-views, we propose a joint embedding space that aligns multiview representations without requiring explicit supervision. We further enhance the discriminability of our model through a novel multiview contrastive learning strategy that leverages stable positive-negative pair sampling. A gradient-friendly loss function is introduced to promote smoother and more stable convergence, and the model is optimized for distributed training to ensure scalability. Extensive experiments demonstrate that MultiviewVLM outperforms existing state-of-the-art methods and can be easily adapted to various real-world applications with minimal modifications.
zh
[CV-30] BioVFM-21M: Benchmarking and Scaling Self-Supervised Vision Foundation Models for Biomedical Image Analysis
【速读】:该论文试图解决在医疗影像领域构建可扩展的医学视觉基础模型时,由于缺乏对医疗领域缩放行为的深入理解,导致关键因素不明确的问题。解决方案的关键在于通过自监督学习探索模型规模、训练算法、数据规模和成像模态的缩放行为,并引入BioVFM-21M,一个包含广泛生物医学图像模态和解剖结构的大规模生物医学图像数据集,以支持可扩展的预训练。
链接: https://arxiv.org/abs/2505.09329
作者: Jiarun Liu,Hong-Yu Zhou,Weijian Huang,Hao Yang,Dongning Song,Tao Tan,Yong Liang,Shanshan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures
Abstract:Scaling up model and data size have demonstrated impressive performance improvement over a wide range of tasks. Despite extensive studies on scaling behaviors for general-purpose tasks, medical images exhibit substantial differences from natural data. It remains unclear the key factors in developing medical vision foundation models at scale due to the absence of an extensive understanding of scaling behavior in the medical domain. In this paper, we explored the scaling behavior across model sizes, training algorithms, data sizes, and imaging modalities in developing scalable medical vision foundation models by self-supervised learning. To support scalable pretraining, we introduce BioVFM-21M, a large-scale biomedical image dataset encompassing a wide range of biomedical image modalities and anatomies. We observed that scaling up does provide benefits but varies across tasks. Additional analysis reveals several factors correlated with scaling benefits. Finally, we propose BioVFM, a large-scale medical vision foundation model pretrained on 21 million biomedical images, which outperforms the previous state-of-the-art foundation models across 12 medical benchmarks. Our results highlight that while scaling up is beneficial for pursuing better performance, task characteristics, data diversity, pretraining methods, and computational efficiency remain critical considerations for developing scalable medical foundation models.
zh
[CV-31] Neural Video Compression using 2D Gaussian Splatting
【速读】:该论文试图解决神经视频编解码器(Neural Video Codec, NVC)在实时应用中因高计算需求而受限的问题,尤其是在视频会议等场景中的应用。其解决方案的关键在于提出一种基于感兴趣区域(Region-of-Interest, ROI)的神经视频压缩模型,该模型利用2D Gaussian Splatting技术,通过内容感知初始化策略和新颖的高斯帧间冗余减少机制,将之前基于高斯点云的图像编解码器编码时间提升了88%,从而实现了首个基于高斯点云的视频编解码方案。
链接: https://arxiv.org/abs/2505.09324
作者: Lakshya Gupta,Imran N. Junejo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 9 pages, 8 figures
Abstract:The computer vision and image processing research community has been involved in standardizing video data communications for the past many decades, leading to standards such as AVC, HEVC, VVC, AV1, AV2, etc. However, recent groundbreaking works have focused on employing deep learning-based techniques to replace the traditional video codec pipeline to a greater affect. Neural video codecs (NVC) create an end-to-end ML-based solution that does not rely on any handcrafted features (motion or edge-based) and have the ability to learn content-aware compression strategies, offering better adaptability and higher compression efficiency than traditional methods. This holds a great potential not only for hardware design, but also for various video streaming platforms and applications, especially video conferencing applications such as MS-Teams or Zoom that have found extensive usage in classrooms and workplaces. However, their high computational demands currently limit their use in real-time applications like video conferencing. To address this, we propose a region-of-interest (ROI) based neural video compression model that leverages 2D Gaussian Splatting. Unlike traditional codecs, 2D Gaussian Splatting is capable of real-time decoding and can be optimized using fewer data points, requiring only thousands of Gaussians for decent quality outputs as opposed to millions in 3D scenes. In this work, we designed a video pipeline that speeds up the encoding time of the previous Gaussian splatting-based image codec by 88% by using a content-aware initialization strategy paired with a novel Gaussian inter-frame redundancy-reduction mechanism, enabling Gaussian splatting to be used for a video-codec solution, the first of its kind solution in this neural video codec space.
zh
[CV-32] ransDiffuser: End-to-end Trajectory Generation with Decorrelated Multi-modal Representation for Autonomous Driving
【速读】:该论文旨在解决自动驾驶系统中轨迹规划的多样性与高质量生成问题,特别是在应对模式崩溃(mode collapse)挑战时。其解决方案的关键在于提出一种基于编码器-解码器结构的生成式轨迹规划模型——TransDiffuser,通过将场景信息作为多模态条件输入到去噪解码器中,并引入一种简单而有效的多模态表示去相关优化机制,从而在不依赖锚点轨迹的前提下,实现了高精度的轨迹生成,其在NAVSIM基准上的PDMS指标达到94.85,优于现有最先进方法。
链接: https://arxiv.org/abs/2505.09315
作者: Xuefeng Jiang,Yuan Ma,Pengxiang Li,Leimeng Xu,Xin Wen,Kun Zhan,Zhongpu Xia,Peng Jia,XianPeng Lang,Sheng Sun
机构: LiAuto; Institute of Computing Technology, Chinese Academy of Sciences; Tsinghua University
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Under review
Abstract:In recent years, diffusion model has shown its potential across diverse domains from vision generation to language modeling. Transferring its capabilities to modern autonomous driving systems has also emerged as a promising this http URL this work, we propose TransDiffuser, an encoder-decoder based generative trajectory planning model for end-to-end autonomous driving. The encoded scene information serves as the multi-modal conditional input of the denoising decoder. To tackle the mode collapse dilemma in generating high-quality diverse trajectories, we introduce a simple yet effective multi-modal representation decorrelation optimization mechanism during the training this http URL achieves PDMS of 94.85 on the NAVSIM benchmark, surpassing previous state-of-the-art methods without any anchor-based prior trajectories.
zh
[CV-33] Predicting butterfly species presence from satellite imagery using soft contrastive regularisation CVPR
【速读】:该论文试图解决如何利用遥感数据准确预测物种多样性的问题,特别是针对蝴蝶物种的分布情况。其关键解决方案是构建一个用于英国蝴蝶物种存在预测的新数据集,并开发一种针对概率标签(如物种存在数据)的软监督对比正则化损失函数,以提升多物种存在预测的准确性。
链接: https://arxiv.org/abs/2505.09306
作者: Thijs L van der Plas,Stephen Law,Michael JO Pocock
机构: The Alan Turing Institute, the UK; Wageningen University & Research, the NL; University College London, the UK; UK Centre for Ecology & Hydrology, the UK
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: To be published in the 2025 CVPR FGVC12 workshop
Abstract:The growing demand for scalable biodiversity monitoring methods has fuelled interest in remote sensing data, due to its widespread availability and extensive coverage. Traditionally, the application of remote sensing to biodiversity research has focused on mapping and monitoring habitats, but with increasing availability of large-scale citizen-science wildlife observation data, recent methods have started to explore predicting multi-species presence directly from satellite images. This paper presents a new data set for predicting butterfly species presence from satellite data in the United Kingdom. We experimentally optimise a Resnet-based model to predict multi-species presence from 4-band satellite images, and find that this model especially outperforms the mean rate baseline for locations with high species biodiversity. To improve performance, we develop a soft, supervised contrastive regularisation loss that is tailored to probabilistic labels (such as species-presence data), and demonstrate that this improves prediction accuracy. In summary, our new data set and contrastive regularisation method contribute to the open challenge of accurately predicting species biodiversity from remote sensing data, which is key for efficient biodiversity monitoring.
zh
[CV-34] Recent Advances in Medical Imaging Segmentation: A Survey
【速读】:该论文旨在解决医学影像分割中的多个挑战,包括数据可获取性、标注复杂性、结构变异、医学成像模态的差异以及隐私限制等问题。其关键解决方案在于探索生成式AI(Generative AI)、小样本学习(Few-Shot Learning)、基础模型(Foundation Models)和通用模型(Universal Models)等前沿方法,这些方法为应对长期存在的分割难题提供了有前景的解决方案。
链接: https://arxiv.org/abs/2505.09274
作者: Fares Bougourzi,Abdenour Hadid
机构: Junia, UMR 8520, CNRS, Centrale Lille, Univerity of Polytechnique Hauts-de-France(朱尼亚,UMR 8520,法国国家科学研究中心,里尔中央理工,上法兰西地区综合理工大学); Sorbonne Center for Artificial Intelligence (SCAI)(索邦人工智能中心(SCAI))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical imaging is a cornerstone of modern healthcare, driving advancements in diagnosis, treatment planning, and patient care. Among its various tasks, segmentation remains one of the most challenging problem due to factors such as data accessibility, annotation complexity, structural variability, variation in medical imaging modalities, and privacy constraints. Despite recent progress, achieving robust generalization and domain adaptation remains a significant hurdle, particularly given the resource-intensive nature of some proposed models and their reliance on domain expertise. This survey explores cutting-edge advancements in medical image segmentation, focusing on methodologies such as Generative AI, Few-Shot Learning, Foundation Models, and Universal Models. These approaches offer promising solutions to longstanding challenges. We provide a comprehensive overview of the theoretical foundations, state-of-the-art techniques, and recent applications of these methods. Finally, we discuss inherent limitations, unresolved issues, and future research directions aimed at enhancing the practicality and accessibility of segmentation models in medical imaging. We are maintaining a \hrefthis https URLGitHub Repository to continue tracking and updating innovations in this field.
zh
[CV-35] MetaUAS: Universal Anomaly Segmentation with One-Prompt Meta-Learning NEURIPS2024
【速读】:该论文试图解决零样本和少样本视觉异常分割中依赖于人工设计文本提示的视觉-语言模型所带来的局限性,这些问题源于视觉表征与语言的固有独立性。其解决方案的关键在于提出一种纯视觉基础模型的范式,将异常分割统一为变化分割,并利用大规模合成图像对进行训练,这些图像对具有对象级和局部区域的变化,且独立于目标异常数据集。此外,该研究引入了一种基于单个正常图像提示的元学习框架(MetaUAS),通过软特征对齐模块处理提示图像与查询图像之间的几何差异,从而实现无需语言引导的通用异常分割。
链接: https://arxiv.org/abs/2505.09265
作者: Bin-Bin Gao
机构: Tencent YouTu Lab (腾讯优图实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by NeurIPS 2024
Abstract:Zero- and few-shot visual anomaly segmentation relies on powerful vision-language models that detect unseen anomalies using manually designed textual prompts. However, visual representations are inherently independent of language. In this paper, we explore the potential of a pure visual foundation model as an alternative to widely used vision-language models for universal visual anomaly segmentation. We present a novel paradigm that unifies anomaly segmentation into change segmentation. This paradigm enables us to leverage large-scale synthetic image pairs, featuring object-level and local region changes, derived from existing image datasets, which are independent of target anomaly datasets. We propose a one-prompt Meta-learning framework for Universal Anomaly Segmentation (MetaUAS) that is trained on this synthetic dataset and then generalizes well to segment any novel or unseen visual anomalies in the real world. To handle geometrical variations between prompt and query images, we propose a soft feature alignment module that bridges paired-image change perception and single-image semantic segmentation. This is the first work to achieve universal anomaly segmentation using a pure vision model without relying on special anomaly detection datasets and pre-trained visual-language models. Our method effectively and efficiently segments any anomalies with only one normal image prompt and enjoys training-free without guidance from language. Our MetaUAS significantly outperforms previous zero-shot, few-shot, and even full-shot anomaly segmentation methods. The code and pre-trained models are available at this https URL.
zh
[CV-36] Learning to Detect Multi-class Anomalies with Just One Normal Image Prompt ECCV2024
【速读】:该论文旨在解决多类统一异常检测中,自注意力重构模型因仅依赖目标特征而导致的正常与异常特征均能被完美重构、从而无法有效检测异常的问题,以及由于在低空间分辨率潜在空间中进行重构而造成的异常分割不准确问题。解决方案的关键在于提出一种简单有效的策略——仅使用一个正常图像提示(One Normal Image Prompt, OneNIP),通过重构正常特征并恢复异常特征,从而提升统一异常检测的性能;此外,还引入了一个监督精修模块,通过使用真实正常图像和合成异常图像回归重构误差,显著提升了像素级异常分割的效果。
链接: https://arxiv.org/abs/2505.09264
作者: Bin-Bin Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ECCV 2024
Abstract:Unsupervised reconstruction networks using self-attention transformers have achieved state-of-the-art performance for multi-class (unified) anomaly detection with a single model. However, these self-attention reconstruction models primarily operate on target features, which may result in perfect reconstruction for both normal and anomaly features due to high consistency with context, leading to failure in detecting anomalies. Additionally, these models often produce inaccurate anomaly segmentation due to performing reconstruction in a low spatial resolution latent space. To enable reconstruction models enjoying high efficiency while enhancing their generalization for unified anomaly detection, we propose a simple yet effective method that reconstructs normal features and restores anomaly features with just One Normal Image Prompt (OneNIP). In contrast to previous work, OneNIP allows for the first time to reconstruct or restore anomalies with just one normal image prompt, effectively boosting unified anomaly detection performance. Furthermore, we propose a supervised refiner that regresses reconstruction errors by using both real normal and synthesized anomalous images, which significantly improves pixel-level anomaly segmentation. OneNIP outperforms previous methods on three industry anomaly detection benchmarks: MVTec, BTAD, and VisA. The code and pre-trained models are available at this https URL.
zh
[CV-37] Few-Shot Anomaly-Driven Generation for Anomaly Classification and Segmentation ECCV2024
【速读】:该论文试图解决工业检测中异常样本稀缺导致的异常检测性能受限问题,现有方法通过噪声或外部数据合成异常,但合成异常与真实异常之间存在较大的语义差距,影响检测效果。解决方案的关键在于提出一种少样本异常驱动生成(AnoGen)方法,该方法通过少量真实异常样本引导扩散模型生成逼真且多样的异常,从而提升异常检测模型的训练效果。
链接: https://arxiv.org/abs/2505.09263
作者: Guan Gui,Bin-Bin Gao,Jun Liu,Chengjie Wang,Yunsheng Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ECCV 2024
Abstract:Anomaly detection is a practical and challenging task due to the scarcity of anomaly samples in industrial inspection. Some existing anomaly detection methods address this issue by synthesizing anomalies with noise or external data. However, there is always a large semantic gap between synthetic and real-world anomalies, resulting in weak performance in anomaly detection. To solve the problem, we propose a few-shot Anomaly-driven Generation (AnoGen) method, which guides the diffusion model to generate realistic and diverse anomalies with only a few real anomalies, thereby benefiting training anomaly detection models. Specifically, our work is divided into three stages. In the first stage, we learn the anomaly distribution based on a few given real anomalies and inject the learned knowledge into an embedding. In the second stage, we use the embedding and given bounding boxes to guide the diffusion model to generate realistic and diverse anomalies on specific objects (or textures). In the final stage, we propose a weakly-supervised anomaly detection method to train a more powerful model with generated anomalies. Our method builds upon DRAEM and DesTSeg as the foundation model and conducts experiments on the commonly used industrial anomaly detection dataset, MVTec. The experiments demonstrate that our generated anomalies effectively improve the model performance of both anomaly classification and segmentation tasks simultaneously, \eg, DRAEM and DseTSeg achieved a 5.8% and 1.5% improvement in AU-PR metric on segmentation task, respectively. The code and generated anomalous data are available at this https URL.
zh
[CV-38] st-Time Augmentation for Pose-invariant Face Recognition
【速读】:该论文试图解决在测试阶段通过增强头部姿态来提升人脸识别性能的问题(face recognition performance)。现有方法通常依赖于在正面化图像上进行训练或学习姿态不变的表征,但这些方法通常需要针对每个数据集进行重新训练和测试,耗费大量精力。相比之下,该研究提出了一种名为Pose-TTA的新方法,其关键在于在推理阶段对齐人脸而无需额外训练,通过使用肖像动画器将源图像的身份转移到驱动图像的姿态中,生成匹配的侧视图图像以减少身份信息损失,并采用加权特征聚合策略缓解合成数据带来的失真或偏差,从而提高增强图像的可靠性。
链接: https://arxiv.org/abs/2505.09256
作者: Jaemin Jung,Youngjoon Jang,Joon Son Chung
机构: Korea Advanced Institute of Science and Technology (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The goal of this paper is to enhance face recognition performance by augmenting head poses during the testing phase. Existing methods often rely on training on frontalised images or learning pose-invariant representations, yet both approaches typically require re-training and testing for each dataset, involving a substantial amount of effort. In contrast, this study proposes Pose-TTA, a novel approach that aligns faces at inference time without additional training. To achieve this, we employ a portrait animator that transfers the source image identity into the pose of a driving image. Instead of frontalising a side-profile face – which can introduce distortion – Pose-TTA generates matching side-profile images for comparison, thereby reducing identity information loss. Furthermore, we propose a weighted feature aggregation strategy to address any distortions or biases arising from the synthetic data, thus enhancing the reliability of the augmented images. Extensive experiments on diverse datasets and with various pre-trained face recognition models demonstrate that Pose-TTA consistently improves inference performance. Moreover, our method is straightforward to integrate into existing face recognition pipelines, as it requires no retraining or fine-tuning of the underlying recognition models.
zh
[CV-39] Zero-Shot Multi-modal Large Language Model v.s. Supervised Deep Learning: A Comparative Study on CT-Based Intracranial Hemorrhage Subtyping
【速读】:该论文旨在解决在非增强计算机断层扫描(non-contrast computed tomography, NCCT)中及时识别颅内出血(intracranial hemorrhage, ICH)亚型的问题,该问题对于预后预测和治疗决策至关重要,但因对比度低和边界模糊而具有挑战性。研究的解决方案关键在于评估零样本多模态大语言模型(zero-shot multi-modal large language models, MLLMs)与传统深度学习方法在ICH二分类和亚型分类任务中的性能差异,并通过精心设计的提示(prompts)引导MLLMs完成相关任务。
链接: https://arxiv.org/abs/2505.09252
作者: Yinuo Wang,Yue Zeng,Kai Chen,Cai Meng,Chao Pan,Zhouping Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Introduction: Timely identification of intracranial hemorrhage (ICH) subtypes on non-contrast computed tomography is critical for prognosis prediction and therapeutic decision-making, yet remains challenging due to low contrast and blurring boundaries. This study evaluates the performance of zero-shot multi-modal large language models (MLLMs) compared to traditional deep learning methods in ICH binary classification and subtyping. Methods: We utilized a dataset provided by RSNA, comprising 192 NCCT volumes. The study compares various MLLMs, including GPT-4o, Gemini 2.0 Flash, and Claude 3.5 Sonnet V2, with conventional deep learning models, including ResNet50 and Vision Transformer. Carefully crafted prompts were used to guide MLLMs in tasks such as ICH presence, subtype classification, localization, and volume estimation. Results: The results indicate that in the ICH binary classification task, traditional deep learning models outperform MLLMs comprehensively. For subtype classification, MLLMs also exhibit inferior performance compared to traditional deep learning models, with Gemini 2.0 Flash achieving an macro-averaged precision of 0.41 and a macro-averaged F1 score of 0.31. Conclusion: While MLLMs excel in interactive capabilities, their overall accuracy in ICH subtyping is inferior to deep networks. However, MLLMs enhance interpretability through language interactions, indicating potential in medical imaging analysis. Future efforts will focus on model refinement and developing more precise MLLMs to improve performance in three-dimensional medical image processing.
zh
[CV-40] A Surrogate Model for the Forward Design of Multi-layered Metasurface-based Radar Absorbing Structures
【速读】:该论文旨在解决传统电磁设计与优化方法在多层基于超表面的雷达吸波结构(RAS)中计算成本高、耗时长且需探索大设计空间的问题。其解决方案的关键在于提出一种代理模型,采用基于卷积神经网络(CNN)的架构并结合Huber损失函数,以显著加速电磁响应的预测,从而在保持高预测精度的同时大幅减少计算时间。
链接: https://arxiv.org/abs/2505.09251
作者: Vineetha Joy,Aditya Anand,Nidhi,Anshuman Kumar,Amit Sethi,Hema Singh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Metasurface-based radar absorbing structures (RAS) are highly preferred for applications like stealth technology, electromagnetic (EM) shielding, etc. due to their capability to achieve frequency selective absorption characteristics with minimal thickness and reduced weight penalty. However, the conventional approach for the EM design and optimization of these structures relies on forward simulations, using full wave simulation tools, to predict the electromagnetic (EM) response of candidate meta atoms. This process is computationally intensive, extremely time consuming and requires exploration of large design spaces. To overcome this challenge, we propose a surrogate model that significantly accelerates the prediction of EM responses of multi-layered metasurface-based RAS. A convolutional neural network (CNN) based architecture with Huber loss function has been employed to estimate the reflection characteristics of the RAS model. The proposed model achieved a cosine similarity of 99.9% and a mean square error of 0.001 within 1000 epochs of training. The efficiency of the model has been established via full wave simulations as well as experiment where it demonstrated significant reduction in computational time while maintaining high predictive accuracy.
zh
[CV-41] PDE: Gene Effect Inspired Parameter Dynamic Evolution for Low-light Image Enhancement
【速读】:该论文旨在解决低光照图像增强(Low-light Image Enhancement, LLIE)中因静态参数导致的性能限制问题,即所谓的“基因效应”(gene effect),该现象表现为随机初始化的参数有时能优于训练得到的参数,从而限制了模型性能的提升。论文提出的解决方案的关键在于参数动态进化(Parameter Dynamic Evolution, PDE),其核心思想是通过参数正交生成技术模拟基因重组与突变过程,使模型能够适应不同图像内容并缓解“基因效应”。
链接: https://arxiv.org/abs/2505.09196
作者: Tong Li,Lizhi Wang,Hansen Feng,Lin Zhu,Hua Huang
机构: Beijing Institute of Technology (北京理工大学); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 9 tables, 9 figures
Abstract:Low-light image enhancement (LLIE) is a fundamental task in computational photography, aiming to improve illumination, reduce noise, and enhance image quality. While recent advancements focus on designing increasingly complex neural network models, we observe a peculiar phenomenon: resetting certain parameters to random values unexpectedly improves enhancement performance for some images. Drawing inspiration from biological genes, we term this phenomenon the gene effect. The gene effect limits enhancement performance, as even random parameters can sometimes outperform learned ones, preventing models from fully utilizing their capacity. In this paper, we investigate the reason and propose a solution. Based on our observations, we attribute the gene effect to static parameters, analogous to how fixed genetic configurations become maladaptive when environments change. Inspired by biological evolution, where adaptation to new environments relies on gene mutation and recombination, we propose parameter dynamic evolution (PDE) to adapt to different images and mitigate the gene effect. PDE employs a parameter orthogonal generation technique and the corresponding generated parameters to simulate gene recombination and gene mutation, separately. Experiments validate the effectiveness of our techniques. The code will be released to the public.
zh
[CV-42] Zero-shot Quantization: A Comprehensive Survey IJCAI2025
【速读】:该论文旨在解决传统网络量化方法依赖训练数据的问题,这一问题在隐私、安全或监管限制下使得实际应用变得不可行。其解决方案的关键在于零样本量化(Zero-shot Quantization, ZSQ),该方法能够在无需任何真实数据的情况下实现模型量化,从而有效克服数据访问受限的挑战。
链接: https://arxiv.org/abs/2505.09188
作者: Minjun Kim,Jaehyeon Choi,Jongkeun Lee,Wonjin Cho,U Kang
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IJCAI 2025 Survey Track
Abstract:Network quantization has proven to be a powerful approach to reduce the memory and computational demands of deep learning models for deployment on resource-constrained devices. However, traditional quantization methods often rely on access to training data, which is impractical in many real-world scenarios due to privacy, security, or regulatory constraints. Zero-shot Quantization (ZSQ) emerges as a promising solution, achieving quantization without requiring any real data. In this paper, we provide a comprehensive overview of ZSQ methods and their recent advancements. First, we provide a formal definition of the ZSQ problem and highlight the key challenges. Then, we categorize the existing ZSQ methods into classes based on data generation strategies, and analyze their motivations, core ideas, and key takeaways. Lastly, we suggest future research directions to address the remaining limitations and advance the field of ZSQ. To the best of our knowledge, this paper is the first in-depth survey on ZSQ.
zh
[CV-43] UniCAD: Efficient and Extendable Architecture for Multi-Task Computer-Aided Diagnosis System
【速读】:该论文旨在解决多任务医学影像辅助诊断(multi-task computer-aided diagnosis, CAD)系统在开发和部署过程中面临的挑战,包括模型复杂度高、资源消耗大以及医学影像领域缺乏开源CAD平台的问题。其解决方案的关键在于提出UniCAD,一个统一架构,通过利用预训练视觉基础模型的强大能力,实现对2D和3D医学图像的无缝处理,同时仅需少量任务特定参数。该架构引入了两个关键创新:低秩适应策略以提升效率,以及模块化“即插即用”架构以支持多样化任务和功能扩展。
链接: https://arxiv.org/abs/2505.09178
作者: Yitao Zhu,Yuan Yin,Zhenrong Shen,Zihao Zhao,Haiyu Song,Sheng Wang,Dinggang Shen,Qian Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages
Abstract:The growing complexity and scale of visual model pre-training have made developing and deploying multi-task computer-aided diagnosis (CAD) systems increasingly challenging and resource-intensive. Furthermore, the medical imaging community lacks an open-source CAD platform to enable the rapid creation of efficient and extendable diagnostic models. To address these issues, we propose UniCAD, a unified architecture that leverages the robust capabilities of pre-trained vision foundation models to seamlessly handle both 2D and 3D medical images while requiring only minimal task-specific parameters. UniCAD introduces two key innovations: (1) Efficiency: A low-rank adaptation strategy is employed to adapt a pre-trained visual model to the medical image domain, achieving performance on par with fully fine-tuned counterparts while introducing only 0.17% trainable parameters. (2) Plug-and-Play: A modular architecture that combines a frozen foundation model with multiple plug-and-play experts, enabling diverse tasks and seamless functionality expansion. Building on this unified CAD architecture, we establish an open-source platform where researchers can share and access lightweight CAD experts, fostering a more equitable and efficient research ecosystem. Comprehensive experiments across 12 diverse medical datasets demonstrate that UniCAD consistently outperforms existing methods in both accuracy and deployment efficiency. The source code and project page are available at this https URL.
zh
[CV-44] Optimizing Urban Critical Green Space Development Using Machine Learning
【速读】:该论文旨在解决城市绿地开发优先级确定的问题,以优化 Tehran 城市的生态环境和居民生活质量。其解决方案的关键在于构建一个融合社会经济、环境和敏感性指标的综合框架,并利用随机森林(Random Forest, RF)机器学习模型进行植被覆盖分类与绿地开发优先级评估,最终通过微气候模拟验证框架的有效性,从而为城市规划者提供科学决策支持。
链接: https://arxiv.org/abs/2505.09175
作者: Mohammad Ganjirad,Mahmoud Reza Delavar,Hossein Bagheri,Mohammad Mehdi Azizi
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents a novel framework for prioritizing urban green space development in Tehran using diverse socio-economic, environmental, and sensitivity indices. The indices were derived from various sources including Google Earth Engine, air pollution measurements, municipal reports and the Weather Research Forecasting (WRF) model. The WRF model was used to estimate the air temperature at a 1 km resolution due to insufficient meteorological stations, yielding RMSE and MAE values of 0.96°C and 0.92°C, respectively. After data preparation, several machine learning models were used for binary vegetation cover classification including XGBoost, LightGBM, Random Forest (RF) and Extra Trees. RF achieved the highest performance, exceeding 94% in Overall Accuracy, Recall, and F1-score. Then, the probability of areas lacking vegetation cover was assessed using socio-economic, environmental and sensitivity indices. This resulted in the RF generating an urban green space development prioritization map. Feature Importance Analysis revealed that the most significant indices were nightly land surface temperature (LST) and sensitive population. Finally, the framework performance was validated through microclimate simulation to assess the critical areas after and before the green space development by green roofs. The simulation demonstrated reducing air temperature by up to 0.67°C after utilizing the green roof technology in critical areas. As a result, this framework provides a valuable tool for urban planners to develop green spaces.
zh
[CV-45] DRRNet: Macro-Micro Feature Fusion and Dual Reverse Refinement for Camouflaged Object Detection
【速读】:该论文旨在解决伪装目标检测(Camouflage Object Detection, COD)中的核心挑战,即目标与背景在颜色、纹理和形状上的高度相似性导致现有方法要么过度依赖全局语义信息而丢失边缘细节(如类似头发的细结构),要么仅依赖局部特征而受到相似背景(如植被图案)的干扰。其解决方案的关键在于提出一种四阶段架构DRRNet,采用“上下文-细节-融合-精炼”的处理流程,通过引入全场景上下文特征提取模块和局部细节提取模块,结合多尺度的全景特征与局部特征进行融合,并在解码器中引入逆向精炼模块,利用空间边缘先验和频域噪声抑制实现两阶段的逆向优化,从而有效抑制背景干扰并增强目标边界连续性。
链接: https://arxiv.org/abs/2505.09168
作者: Jianlin Sun,Xiaolin Fang,Juwei Guan,Dongdong Gui,Teqi Wang,Tongxin Zhu
机构: Southeast University (东南大学); Key Laboratory of Computer Network and Information Integration (Southeast University) (计算机网络与信息集成教育部重点实验室(东南大学))
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The core challenge in Camouflage Object Detection (COD) lies in the indistinguishable similarity between targets and backgrounds in terms of color, texture, and shape. This causes existing methods to either lose edge details (such as hair-like fine structures) due to over-reliance on global semantic information or be disturbed by similar backgrounds (such as vegetation patterns) when relying solely on local features. We propose DRRNet, a four-stage architecture characterized by a “context-detail-fusion-refinement” pipeline to address these issues. Specifically, we introduce an Omni-Context Feature Extraction Module to capture global camouflage patterns and a Local Detail Extraction Module to supplement microstructural information for the full-scene context module. We then design a module for forming dual representations of scene understanding and structural awareness, which fuses panoramic features and local features across various scales. In the decoder, we also introduce a reverse refinement module that leverages spatial edge priors and frequency-domain noise suppression to perform a two-stage inverse refinement of the output. By applying two successive rounds of inverse refinement, the model effectively suppresses background interference and enhances the continuity of object boundaries. Experimental results demonstrate that DRRNet significantly outperforms state-of-the-art methods on benchmark datasets. Our code is available at this https URL.
zh
[CV-46] AMSnet 2.0: A Large AMS Database with AI Segmentation for Net Detection
【速读】:该论文试图解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在理解电路原理图时存在的局限性,这一问题主要源于缺乏高质量的原理图-网表训练数据。现有方法如AMSnet虽然通过原理图解析生成网表,但其依赖于硬编码的启发式规则,难以适应复杂或噪声较大的原理图。论文提出的解决方案的关键在于一种基于分割的新型网表检测机制,该机制具有高鲁棒性,并能恢复位置信息,从而实现原理图的数字重建。此外,作者还扩展了AMSnet数据集,构建了包含更多电路和详细结构信息的AMSnet 2.0数据集。
链接: https://arxiv.org/abs/2505.09155
作者: Yichen Shi,Zhuofu Tao,Yuhao Gao,Li Huang,Hongyang Wang,Zhiping Yu,Ting-Jung Lin,Lei He
机构: Ningbo Institute of Digital Twin(宁波数字孪生研究所); Eastern Institute of Technology(东方理工大学); University of California, Los Angeles(加州大学洛杉矶分校); Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by LAD25
Abstract:Current multimodal large language models (MLLMs) struggle to understand circuit schematics due to their limited recognition capabilities. This could be attributed to the lack of high-quality schematic-netlist training data. Existing work such as AMSnet applies schematic parsing to generate netlists. However, these methods rely on hard-coded heuristics and are difficult to apply to complex or noisy schematics in this paper. We therefore propose a novel net detection mechanism based on segmentation with high robustness. The proposed method also recovers positional information, allowing digital reconstruction of schematics. We then expand AMSnet dataset with schematic images from various sources and create AMSnet 2.0. AMSnet 2.0 contains 2,686 circuits with schematic images, Spectre-formatted netlists, OpenAccess digital schematics, and positional information for circuit components and nets, whereas AMSnet only includes 792 circuits with SPICE netlists but no digital schematics.
zh
[CV-47] opoDiT-3D: Topology-Aware Diffusion Transformer with Bottleneck Structure for 3D Point Cloud Generation
【速读】:该论文旨在解决现有Diffusion Transformer (DiT)模型在3D点云生成中过度关注局部特征而忽视全局拓扑信息(如空洞)的问题,这可能导致形状一致性不足和复杂几何结构难以捕捉。其解决方案的关键在于提出TopoDiT-3D,该模型通过引入具有瓶颈结构的拓扑感知扩散Transformer,利用Perceiver Resampler将通过持久同调(persistent homology)提取的拓扑信息整合到特征学习中,同时自适应地过滤冗余局部特征,从而提升训练效率与生成质量。
链接: https://arxiv.org/abs/2505.09140
作者: Zechao Guan,Feng Yan,Shuai Du,Lin Ma,Qingshan Liu
机构: Southeast University (东南大学); Meituan (美团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in Diffusion Transformer (DiT) models have significantly improved 3D point cloud generation. However, existing methods primarily focus on local feature extraction while overlooking global topological information, such as voids, which are crucial for maintaining shape consistency and capturing complex geometries. To address this limitation, we propose TopoDiT-3D, a Topology-Aware Diffusion Transformer with a bottleneck structure for 3D point cloud generation. Specifically, we design the bottleneck structure utilizing Perceiver Resampler, which not only offers a mode to integrate topological information extracted through persistent homology into feature learning, but also adaptively filters out redundant local features to improve training efficiency. Experimental results demonstrate that TopoDiT-3D outperforms state-of-the-art models in visual quality, diversity, and training efficiency. Furthermore, TopoDiT-3D demonstrates the importance of rich topological information for 3D point cloud generation and its synergy with conventional local feature learning. Videos and code are available at this https URL.
zh
[CV-48] Beyond General Prompts: Automated Prompt Refinement using Contrastive Class Alignment Scores for Disambiguating Objects in Vision-Language Models
【速读】:该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在基于自然语言提示进行目标检测时因提示语句结构不同而导致的性能波动问题。其解决方案的关键在于引入一种新的度量标准——对比类对齐分数(Contrastive Class Alignment Score, CCAS),该指标通过评估提示与目标物体类别的语义对齐程度,并惩罚与干扰类别相似性,从而对提示进行排序。该方法利用大语言模型生成多样化的提示候选,并通过CCAS进行筛选,最终实现无需额外模型训练或标注数据即可提升目标检测精度。
链接: https://arxiv.org/abs/2505.09139
作者: Lucas Choi,Ross Greer
机构: Archbishop Mitty; University of California, Merced
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models (VLMs) offer flexible object detection through natural language prompts but suffer from performance variability depending on prompt phrasing. In this paper, we introduce a method for automated prompt refinement using a novel metric called the Contrastive Class Alignment Score (CCAS), which ranks prompts based on their semantic alignment with a target object class while penalizing similarity to confounding classes. Our method generates diverse prompt candidates via a large language model and filters them through CCAS, computed using prompt embeddings from a sentence transformer. We evaluate our approach on challenging object categories, demonstrating that our automatic selection of high-precision prompts improves object detection accuracy without the need for additional model training or labeled data. This scalable and model-agnostic pipeline offers a principled alternative to manual prompt engineering for VLM-based detection systems.
zh
[CV-49] WSCIF: A Weakly-Supervised Color Intelligence Framework for Tactical Anomaly Detection in Surveillance Keyframes
【速读】:该论文旨在解决在无标签、数据不可获取的视频智能环境中,传统深度学习模型在高风险安全任务中部署所面临的挑战。其解决方案的关键在于提出一种基于颜色特征的轻量级异常检测框架,通过融合无监督KMeans聚类与RGB通道直方图建模,实现对关键帧中结构异常和颜色突变信号的复合检测,从而在资源受限和数据敏感条件下快速识别潜在威胁事件。
链接: https://arxiv.org/abs/2505.09129
作者: Wei Meng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 3 figures, 3 tables. The paper proposes a lightweight weakly-supervised color intelligence model for tactical video anomaly detection, tested on anonymized African surveillance data
Abstract:The deployment of traditional deep learning models in high-risk security tasks in an unlabeled, data-non-exploitable video intelligence environment faces significant challenges. In this paper, we propose a lightweight anomaly detection framework based on color features for surveillance video clips in a high sensitivity tactical mission, aiming to quickly identify and interpret potential threat events under resource-constrained and data-sensitive conditions. The method fuses unsupervised KMeans clustering with RGB channel histogram modeling to achieve composite detection of structural anomalies and color mutation signals in key frames. The experiment takes an operation surveillance video occurring in an African country as a research sample, and successfully identifies multiple highly anomalous frames related to high-energy light sources, target presence, and reflective interference under the condition of no access to the original data. The results show that this method can be effectively used for tactical assassination warning, suspicious object screening and environmental drastic change monitoring with strong deployability and tactical interpretation value. The study emphasizes the importance of color features as low semantic battlefield signal carriers, and its battlefield intelligent perception capability will be further extended by combining graph neural networks and temporal modeling in the future.
zh
[CV-50] Promoting SAM for Camouflaged Object Detection via Selective Key Point-based Guidance
【速读】:该论文试图解决伪装目标检测(Camouflaged Object Detection, COD)问题,旨在利用大模型(如Segment Anything Model, SAM)提升检测性能。其解决方案的关键在于设计了一种新的框架,通过生成提示点(promotion points)来引导SAM进行分割。具体而言,首先开发了多尺度特征融合的Promotion Point Targeting Network (PPT-net) 用于预测候选点上伪装目标存在的概率;随后引入关键点选择(Key Point Selection, KPS)算法,对比性地部署正负提示点以指导分割过程。该方法首次成功将大模型应用于COD任务,并在多个数据集上取得了优于现有方法的实验结果。
链接: https://arxiv.org/abs/2505.09123
作者: Guoying Liang,Su Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Big model has emerged as a new research paradigm that can be applied to various down-stream tasks with only minor effort for domain adaption. Correspondingly, this study tackles Camouflaged Object Detection (COD) leveraging the Segment Anything Model (SAM). The previous studies declared that SAM is not workable for COD but this study reveals that SAM works if promoted properly, for which we devise a new framework to render point promotions: First, we develop the Promotion Point Targeting Network (PPT-net) to leverage multi-scale features in predicting the probabilities of camouflaged objects’ presences at given candidate points over the image. Then, we develop a key point selection (KPS) algorithm to deploy both positive and negative point promotions contrastively to SAM to guide the segmentation. It is the first work to facilitate big model for COD and achieves plausible results experimentally over the existing methods on 3 data sets under 6 metrics. This study demonstrates an off-the-shelf methodology for COD by leveraging SAM, which gains advantage over designing professional models from scratch, not only in performance, but also in turning the problem to a less challenging task, that is, seeking informative but not exactly precise promotions.
zh
[CV-51] Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning
【速读】:该论文试图解决传统场景图在视觉-语言模型(VLMs)中对复杂交互推理能力不足的问题,具体包括两个关键挑战:一是传统检测到构建方法生成的关联集合缺乏焦点且语境无关,二是现有方法无法形成持久记忆以将交互推理推广到新场景。解决方案的关键在于提出一种交互增强的场景图推理框架(Interaction-augmented Scene Graph Reasoning, ISGR),其核心包含三个互补组件:双流图构造器结合SAM驱动的空间关系提取与交互感知的描述生成,以构建具有空间定位功能的场景图;通过定向交互查询激活VLMs中对象功能的潜在知识,将被动识别转化为主动推理;引入长期记忆强化学习策略,结合专门设计的交互聚焦奖励函数,将短暂模式转化为长期推理启发式方法。
链接: https://arxiv.org/abs/2505.09118
作者: Dayong Liang,Changmeng Zheng,Zhiyuan Wen,Yi Cai,Xiao-Yong Wei,Qing Li
机构: South China University of Technology (华南理工大学); Peng Cheng Laboratory (鹏城实验室); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Traditional scene graphs primarily focus on spatial relationships, limiting vision-language models’ (VLMs) ability to reason about complex interactions in visual scenes. This paper addresses two key challenges: (1) conventional detection-to-construction methods produce unfocused, contextually irrelevant relationship sets, and (2) existing approaches fail to form persistent memories for generalizing interaction reasoning to new scenes. We propose Interaction-augmented Scene Graph Reasoning (ISGR), a framework that enhances VLMs’ interactional reasoning through three complementary components. First, our dual-stream graph constructor combines SAM-powered spatial relation extraction with interaction-aware captioning to generate functionally salient scene graphs with spatial grounding. Second, we employ targeted interaction queries to activate VLMs’ latent knowledge of object functionalities, converting passive recognition into active reasoning about how objects work together. Finally, we introduce a lone-term memory reinforcement learning strategy with a specialized interaction-focused reward function that transforms transient patterns into long-term reasoning heuristics. Extensive experiments demonstrate that our approach significantly outperforms baseline methods on interaction-heavy reasoning benchmarks, with particularly strong improvements on complex scene understanding tasks. The source code can be accessed at this https URL.
zh
[CV-52] FoldNet: Learning Generalizable Closed-Loop Policy for Garment Folding via Keypoint-Driven Asset and Demonstration Synthesis
【速读】:该论文旨在解决由于服装的可变形性,在机器人服装操作任务中生成大量高质量数据的高度挑战性问题。其解决方案的关键在于构建一个合成服装数据集,通过基于关键点的几何服装模板和生成模型生成逼真的纹理图案,结合关键点标注在仿真中生成折叠示范,并通过闭环模仿学习训练折叠策略。此外,为提高鲁棒性,提出了KG-DAgger方法,利用基于关键点的策略生成失败恢复的示范数据,从而显著提升了模型性能。
链接: https://arxiv.org/abs/2505.09109
作者: Yuxing Chen,Bowen Xiao,He Wang
机构: Peking University (北京大学); Galbot; Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Due to the deformability of garments, generating a large amount of high-quality data for robotic garment manipulation tasks is highly challenging. In this paper, we present a synthetic garment dataset that can be used for robotic garment folding. We begin by constructing geometric garment templates based on keypoints and applying generative models to generate realistic texture patterns. Leveraging these keypoint annotations, we generate folding demonstrations in simulation and train folding policies via closed-loop imitation learning. To improve robustness, we propose KG-DAgger, which uses a keypoint-based strategy to generate demonstration data for recovering from failures. KG-DAgger significantly improves the model performance, boosting the real-world success rate by 25%. After training with 15K trajectories (about 2M image-action pairs), the model achieves a 75% success rate in the real world. Experiments in both simulation and real-world settings validate the effectiveness of our proposed framework.
zh
[CV-53] OpenLKA: An Open Dataset of Lane Keeping Assist from Recent Car Models under Real-world Driving Conditions
【速读】:该论文旨在解决现有车道保持辅助系统(Lane Keeping Assist, LKA)在真实道路环境中的性能评估与改进缺乏公开、大规模数据支持的问题。其解决方案的关键在于构建了OpenLKA,这是首个开放的、大规模的LKA评估数据集,涵盖了来自50多种量产车型的400小时驾驶数据,包含多模态信息如CAN总线信号、高清行车记录仪视频、Openpilot实时输出及基于视觉语言模型的场景标注,从而为LKA系统的实际性能基准测试、安全关键场景识别以及自动驾驶道路基础设施适应性评估提供了全面平台。
链接: https://arxiv.org/abs/2505.09092
作者: Yuhang Wang,Abdulaziz Alhuraish,Shengming Yuan,Hao Zhou
机构: University of South Florida (南佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Lane Keeping Assist (LKA) is widely adopted in modern vehicles, yet its real-world performance remains underexplored due to proprietary systems and limited data access. This paper presents OpenLKA, the first open, large-scale dataset for LKA evaluation and improvement. It includes 400 hours of driving data from 50+ production vehicle models, collected through extensive road testing in Tampa, Florida and global contributions from the this http URL driving community. The dataset spans a wide range of challenging scenarios, including complex road geometries, degraded lane markings, adverse weather, lighting conditions and surrounding traffic. The dataset is multimodal, comprising: i) full CAN bus streams, decoded using custom reverse-engineered DBC files to extract key LKA events (e.g., system disengagements, lane detection failures); ii) synchronized high-resolution dash-cam video; iii) real-time outputs from Openpilot, providing accurate estimates of road curvature and lane positioning; iv) enhanced scene annotations generated by Vision Language Models, describing lane visibility, pavement quality, weather, lighting, and traffic conditions. By integrating vehicle-internal signals with high-fidelity perception and rich semantic context, OpenLKA provides a comprehensive platform for benchmarking the real-world performance of production LKA systems, identifying safety-critical operational scenarios, and assessing the readiness of current road infrastructure for autonomous driving. The dataset is publicly available at: this https URL.
zh
[CV-54] DPN-GAN: Inducing Periodic Activations in Generative Adversarial Networks for High-Fidelity Audio Synthesis
【速读】:该论文旨在解决生成式对抗网络(Generative Adversarial Networks, GANs)在生成音频序列时依赖带宽受限的梅尔频谱图所导致的分辨率受限和条件生成中的模式崩溃问题。其解决方案的关键在于提出一种基于可变形周期网络的GAN(Deformable Periodic Network-based GAN, DPN-GAN),该架构引入了基于核的周期性ReLU激活函数以在音频生成中引入周期性偏置,并通过可变形卷积操作实现多分辨率生成,从而增强模型对复杂音频模式的捕捉与再现能力。
链接: https://arxiv.org/abs/2505.09091
作者: Zeeshan Ahmad,Shudi Bao,Meng Chen
机构: Ningbo Institute of Digital Twin (宁波数字孪生研究所); Eastern Institute of Technology (东方理工大学); Ningbo University of Technology (宁波工程学院)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:
Abstract:In recent years, generative adversarial networks (GANs) have made significant progress in generating audio sequences. However, these models typically rely on bandwidth-limited mel-spectrograms, which constrain the resolution of generated audio sequences, and lead to mode collapse during conditional generation. To address this issue, we propose Deformable Periodic Network based GAN (DPN-GAN), a novel GAN architecture that incorporates a kernel-based periodic ReLU activation function to induce periodic bias in audio generation. This innovative approach enhances the model’s ability to capture and reproduce intricate audio patterns. In particular, our proposed model features a DPN module for multi-resolution generation utilizing deformable convolution operations, allowing for adaptive receptive fields that improve the quality and fidelity of the synthetic audio. Additionally, we enhance the discriminator network using deformable convolution to better distinguish between real and generated samples, further refining the audio quality. We trained two versions of the model: DPN-GAN small (38.67M parameters) and DPN-GAN large (124M parameters). For evaluation, we use five different datasets, covering both speech synthesis and music generation tasks, to demonstrate the efficiency of the DPN-GAN. The experimental results demonstrate that DPN-GAN delivers superior performance on both out-of-distribution and noisy data, showcasing its robustness and adaptability. Trained across various datasets, DPN-GAN outperforms state-of-the-art GAN architectures on standard evaluation metrics, and exhibits increased robustness in synthesized audio.
zh
[CV-55] 2D-3D Attention and Entropy for Pose Robust 2D Facial Recognition
【速读】:该论文旨在解决由于注册图像与查询图像之间存在显著视角(姿态)差异而导致的面部识别性能下降问题。其解决方案的关键在于提出一种新的领域自适应框架,通过使基于图像的二维(2D)表示能够推断出固有姿态不变的点云(三维,3D)表示的特性,从而提升姿态不变性。该框架的核心技术包括:(1)共享(联合)注意力映射,用于强调2D面部图像与3D面部数据之间最相关的共同模式;(2)联合熵正则化损失,通过利用注意力映射促进二维和三维表示之间的更好一致性与相关性。
链接: https://arxiv.org/abs/2505.09073
作者: J. Brennan Peace,Shuowen Hu,Benjamin S. Riggan
机构: University of Nebraska-Lincoln (内布拉斯加大学林肯分校); U.S. Army Research Laboratory (美国陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear at the IEEE International Conference on Automatic Face and Gesture 2025 (FG2025)
Abstract:Despite recent advances in facial recognition, there remains a fundamental issue concerning degradations in performance due to substantial perspective (pose) differences between enrollment and query (probe) imagery. Therefore, we propose a novel domain adaptive framework to facilitate improved performances across large discrepancies in pose by enabling image-based (2D) representations to infer properties of inherently pose invariant point cloud (3D) representations. Specifically, our proposed framework achieves better pose invariance by using (1) a shared (joint) attention mapping to emphasize common patterns that are most correlated between 2D facial images and 3D facial data and (2) a joint entropy regularizing loss to promote better consistency \unicodex2014 enhancing correlations among the intersecting 2D and 3D representations \unicodex2014 by leveraging both attention maps. This framework is evaluated on FaceScape and ARL-VTF datasets, where it outperforms competitive methods by achieving profile (90 \unicodex00b0 \unicodex002b ) TAR @ 1 \unicodex0025 FAR improvements of at least 7.1 \unicodex0025 and 1.57 \unicodex0025 , respectively.
zh
[CV-56] RT-cache: Efficient Robot Trajectory Retrieval System
【速读】:该论文试图解决现代Vision-Language-Action (VLA)模型在实际机器人推理中存在高每步推理成本导致显著延迟的问题。解决方案的关键在于提出RT-cache,一种新颖的轨迹记忆流水线,通过利用大数据检索和经验学习来加速机器人推理。RT-cache通过存储大规模先前成功机器人轨迹的记忆,并检索相关多步骤运动片段,从而大幅降低推理开销,同时结合记忆构建器与轨迹检索模块,实现高效且可扩展的检索过程。
链接: https://arxiv.org/abs/2505.09040
作者: Owen Kwon,Abraham George,Alison Bartsch,Amir Barati Farimani
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 5 figures. Submitted to an IEEE robotics conference
Abstract:This paper introduces RT-cache, a novel trajectorymemory pipeline that accelerates real-world robot inference by leveraging big-data retrieval and learning from experience. While modern Vision-Language-Action (VLA) models can handle diverse robotic tasks, they often incur high per-step inference costs, resulting in significant latency, sometimes minutes per task. In contrast, RT-cache stores a large-scale Memory of previously successful robot trajectories and retrieves relevant multistep motion snippets, drastically reducing inference overhead. By integrating a Memory Builder with a Trajectory Retrieval, we develop an efficient retrieval process that remains tractable even for extremely large datasets. RT-cache flexibly accumulates real-world experiences and replays them whenever the current scene matches past states, adapting quickly to new or unseen environments with only a few additional samples. Experiments on the Open-X Embodiment Dataset and other real-world data demonstrate that RT-cache completes tasks both faster and more successfully than a baseline lacking retrieval, suggesting a practical, data-driven solution for real-time manipulation.
zh
[CV-57] Multimodal Fusion of Glucose Monitoring and Food Imagery for Caloric Content Prediction
【速读】:该论文旨在解决Type 2糖尿病患者精准估算热量摄入的问题,这一问题在有效饮食监测中具有重要意义。现有连续血糖监测(CGM)设备虽然能提供有价值的生理数据,但难以全面捕捉餐食的营养特征,主要受限于个体差异和餐食特异性变化。该研究提出的解决方案是一种多模态深度学习框架,其关键在于联合利用CGM时间序列数据、人口统计/微生物组数据以及餐前食物图像,通过基于注意力的编码、卷积特征提取、多层感知机处理及晚期融合策略进行联合推理,从而提升热量估计的准确性。
链接: https://arxiv.org/abs/2505.09018
作者: Adarsh Kumar
机构: Texas A&M University (得克萨斯A&M大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Effective dietary monitoring is critical for managing Type 2 diabetes, yet accurately estimating caloric intake remains a major challenge. While continuous glucose monitors (CGMs) offer valuable physiological data, they often fall short in capturing the full nutritional profile of meals due to inter-individual and meal-specific variability. In this work, we introduce a multimodal deep learning framework that jointly leverages CGM time-series data, Demographic/Microbiome, and pre-meal food images to enhance caloric estimation. Our model utilizes attention based encoding and a convolutional feature extraction for meal imagery, multi-layer perceptrons for CGM and Microbiome data followed by a late fusion strategy for joint reasoning. We evaluate our approach on a curated dataset of over 40 participants, incorporating synchronized CGM, Demographic and Microbiome data and meal photographs with standardized caloric labels. Our model achieves a Root Mean Squared Relative Error (RMSRE) of 0.2544, outperforming the baselines models by over 50%. These findings demonstrate the potential of multimodal sensing to improve automated dietary assessment tools for chronic disease management.
zh
[CV-58] owards Adaptive Meta-Gradient Adversarial Examples for Visual Tracking
【速读】:该论文试图解决现有视觉跟踪方法在实际应用场景中因深度学习模型的安全性问题而影响其可靠性的问题,具体而言是通过有效的对抗攻击揭示现有视觉跟踪器的安全漏洞。解决方案的关键在于提出一种自适应元梯度对抗攻击(AMGA)方法,该方法融合了多模型集成和元学习策略,结合动量机制与高斯平滑技术,显著提升了对抗样本的可迁移性和攻击效果。AMGA通过从大规模模型库中随机选择模型、构建多样化的跟踪场景,并在每个场景中迭代执行白盒和黑盒对抗攻击,优化各模型的梯度方向,从而缩小白盒与黑盒对抗攻击之间的差距,实现在黑盒场景下的优异攻击性能。
链接: https://arxiv.org/abs/2505.08999
作者: Wei-Long Tian,Peng Gao,Xiao Liu,Long Xu,Hamido Fujita,Hanan Aljuai,Mao-Li Wang
机构: Qufu Normal University (曲阜师范大学); University Teknologi Malaysia (马来西亚科技大学); Princess Nourah bint Abdulrahman University (普拉蒂娜·本·阿卜杜勒拉赫曼大学); Iwate Prefectural University (岩手县立大学); University of Hradec Kralove (赫拉德茨-克拉洛韦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In recent years, visual tracking methods based on convolutional neural networks and Transformers have achieved remarkable performance and have been successfully applied in fields such as autonomous driving. However, the numerous security issues exposed by deep learning models have gradually affected the reliable application of visual tracking methods in real-world scenarios. Therefore, how to reveal the security vulnerabilities of existing visual trackers through effective adversarial attacks has become a critical problem that needs to be addressed. To this end, we propose an adaptive meta-gradient adversarial attack (AMGA) method for visual tracking. This method integrates multi-model ensembles and meta-learning strategies, combining momentum mechanisms and Gaussian smoothing, which can significantly enhance the transferability and attack effectiveness of adversarial examples. AMGA randomly selects models from a large model repository, constructs diverse tracking scenarios, and iteratively performs both white- and black-box adversarial attacks in each scenario, optimizing the gradient directions of each model. This paradigm minimizes the gap between white- and black-box adversarial attacks, thus achieving excellent attack performance in black-box scenarios. Extensive experimental results on large-scale datasets such as OTB2015, LaSOT, and GOT-10k demonstrate that AMGA significantly improves the attack performance, transferability, and deception of adversarial examples. Codes and data are available at this https URL.
zh
[CV-59] Neural BRDF Importance Sampling by Reparameterization
【速读】:该论文试图解决神经双向反射分布函数(Neural BRDF)在重要性采样中的挑战,旨在提升物理基础渲染中材质表示的真实感。其解决方案的关键在于提出一种基于重参数化的神经BRDF重要性采样方法,该方法将分布学习任务转化为BRDF积分替换问题,从而无需依赖可逆网络和多步推理,提高了灵活性和效率,并在保持高推理速度的同时实现了最佳的方差减少效果。
链接: https://arxiv.org/abs/2505.08998
作者: Liwen Wu,Sai Bi,Zexiang Xu,Hao Tan,Kai Zhang,Fujun Luan,Haolin Lu,Ravi Ramamoorthi
机构: University of California San Diego(加州大学圣地亚哥分校); Adobe Research(Adobe 研究院); Hillbot(Hillbot); Max Planck Institute for Informatics(马克斯·普朗克信息研究所)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Neural bidirectional reflectance distribution functions (BRDFs) have emerged as popular material representations for enhancing realism in physically-based rendering. Yet their importance sampling remains a significant challenge. In this paper, we introduce a reparameterization-based formulation of neural BRDF importance sampling that seamlessly integrates into the standard rendering pipeline with precise generation of BRDF samples. The reparameterization-based formulation transfers the distribution learning task to a problem of identifying BRDF integral substitutions. In contrast to previous methods that rely on invertible networks and multi-step inference to reconstruct BRDF distributions, our model removes these constraints, which offers greater flexibility and efficiency. Our variance and performance analysis demonstrates that our reparameterization method achieves the best variance reduction in neural BRDF renderings while maintaining high inference speeds compared to existing baselines.
zh
[CV-60] oward Accessible and Safe Live Streaming Using Distributed Content Filtering with MoQ ICME2025
【速读】:该论文旨在解决实时视频直播中内容审核的挑战,特别是在低延迟要求下对危险、非法或令人反感内容的检测与移除问题。其关键解决方案是通过对正在传输的Media Over QUIC协议进行扩展,实现在一对多视频直播中的实时内容审核,该方案仅移除包含不当内容的视频片段,从而在流媒体符合内容政策后迅速恢复播放。此外,内容分析任务可以透明地分配到任意客户端设备,以提高系统的灵活性和效率。
链接: https://arxiv.org/abs/2505.08990
作者: Andrew C. Freeman
机构: Baylor University (贝勒大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
备注: Accepted to the ICME 2025 LIVES workshop
Abstract:Live video streaming is increasingly popular on social media platforms. With the growth of live streaming comes an increased need for robust content moderation to remove dangerous, illegal, or otherwise objectionable content. Whereas video on demand distribution enables offline content analysis, live streaming imposes restrictions on latency for both analysis and distribution. In this paper, we present extensions to the in-progress Media Over QUIC Transport protocol that enable real-time content moderation in one-to-many video live streams. Importantly, our solution removes only the video segments that contain objectionable content, allowing playback resumption as soon as the stream conforms to content policies again. Content analysis tasks may be transparently distributed to arbitrary client devices. We implement and evaluate our system in the context of light strobe removal for photosensitive viewers, finding that streaming clients experience an increased latency of only one group-of-pictures duration.
zh
[CV-61] Differentiable Channel Selection in Self-Attention For Person Re-Identification
【速读】:该论文旨在解决行人重识别(person Re-ID)任务中特征提取的效率与判别性不足的问题,其解决方案的关键在于提出一种名为可微通道选择注意力模块(DCS-Attention)的新颖注意力机制。该模块通过在计算注意力权重时选择具有信息量的特征通道,并以可微的方式进行通道选择,从而实现与深度神经网络(DNN)训练的无缝集成。DCS-Attention受信息瓶颈(Information Bottleneck, IB)原理启发,引入了一种新的变分上界用于IB损失,并可通过随机梯度下降(SGD)进行优化,使网络能够选择最具有判别性的通道进行特征提取,从而提升Re-ID任务的性能。
链接: https://arxiv.org/abs/2505.08961
作者: Yancheng Wang,Nebojsa Jojic,Yingzhen Yang
机构: Arizona State University (亚利桑那州立大学); Microsoft Research (微软研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:In this paper, we propose a novel attention module termed the Differentiable Channel Selection Attention module, or the DCS-Attention module. In contrast with conventional self-attention, the DCS-Attention module features selection of informative channels in the computation of the attention weights. The selection of the feature channels is performed in a differentiable manner, enabling seamless integration with DNN training. Our DCS-Attention is compatible with either fixed neural network backbones or learnable backbones with Differentiable Neural Architecture Search (DNAS), leading to DCS with Fixed Backbone (DCS-FB) and DCS-DNAS, respectively. Importantly, our DCS-Attention is motivated by the principle of Information Bottleneck (IB), and a novel variational upper bound for the IB loss, which can be optimized by SGD, is derived and incorporated into the training loss of the networks with the DCS-Attention modules. In this manner, a neural network with DCS-Attention modules is capable of selecting the most informative channels for feature extraction so that it enjoys state-of-the-art performance for the Re-ID task. Extensive experiments on multiple person Re-ID benchmarks using both DCS-FB and DCS-DNAS show that DCS-Attention significantly enhances the prediction accuracy of DNNs for person Re-ID, which demonstrates the effectiveness of DCS-Attention in learning discriminative features critical to identifying person identities. The code of our work is available at this https URL.
zh
[CV-62] Multi-step manipulation task and motion planning guided by video demonstration
【速读】:该论文旨在解决机器人在复杂多步骤任务与运动规划中的问题,通过利用教学视频来指导任务执行。其解决方案的关键在于对经典的快速探索随机树(RRT)规划器进行扩展,同时在从教学视频中提取的抓取和释放状态周围生长多个树结构,结合接触状态和3D物体位姿信息,与传统规划算法相结合,从而解决具有顺序依赖性的任务。
链接: https://arxiv.org/abs/2505.08949
作者: Kateryna Zorina,David Kovar,Mederic Fourmy,Florent Lamiraux,Nicolas Mansard,Justin Carpentier,Josef Sivic,Vladimir Petrik
机构: CIIRC, Czech Technical University in Prague(CIIRC,布拉格捷克技术大学); LAAS-CNRS, Universite de Toulouse, CNRS, Toulouse(LAAS-CNRS,图卢兹大学,CNRS,图卢兹); INRIA, Paris(INRIA,巴黎)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注:
Abstract:This work aims to leverage instructional video to solve complex multi-step task-and-motion planning tasks in robotics. Towards this goal, we propose an extension of the well-established Rapidly-Exploring Random Tree (RRT) planner, which simultaneously grows multiple trees around grasp and release states extracted from the guiding video. Our key novelty lies in combining contact states and 3D object poses extracted from the guiding video with a traditional planning algorithm that allows us to solve tasks with sequential dependencies, for example, if an object needs to be placed at a specific location to be grasped later. We also investigate the generalization capabilities of our approach to go beyond the scene depicted in the instructional video. To demonstrate the benefits of the proposed video-guided planning approach, we design a new benchmark with three challenging tasks: (I) 3D re-arrangement of multiple objects between a table and a shelf, (ii) multi-step transfer of an object through a tunnel, and (iii) transferring objects using a tray similar to a waiter transfers dishes. We demonstrate the effectiveness of our planning algorithm on several robots, including the Franka Emika Panda and the KUKA KMR iiwa. For a seamless transfer of the obtained plans to the real robot, we develop a trajectory refinement approach formulated as an optimal control problem (OCP).
zh
[CV-63] Parameter-Efficient Fine-Tuning of Vision Foundation Model for Forest Floor Segmentation from UAV Imagery ICRA
【速读】:该论文试图解决在复杂且变化迅速的森林环境中,对林地表面进行精确分割的难题,这一问题主要源于自然变异性强、环境参数变化快以及标注不明确等因素。解决方案的关键在于采用具有强大泛化能力的视觉基础模型Segment Anything Model (SAM),并通过参数高效微调(PEFT)方法,仅调整少量额外模型参数以适应特定数据集类别,同时保持原始权重不变,并对SAM的掩码解码器进行调整,以实现无需人工提示的自动分割。
链接: https://arxiv.org/abs/2505.08932
作者: Mohammad Wasil,Ahmad Drak,Brennan Penfold,Ludovico Scarton,Maximilian Johenneken,Alexander Asteroth,Sebastian Houben
机构: Bonn-Rhein-Sieg University of Applied Sciences (波恩-莱茵-锡格应用科学大学); Department of Computer Science (计算机科学系); Institute for Artificial Intelligence and Autonomous Systems (人工智能与自主系统研究所); Institute of Technology, Resource and Energy-efficient Engineering (技术、资源和能源高效工程研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the Novel Approaches for Precision Agriculture and Forestry with Autonomous Robots IEEE ICRA Workshop - 2025
Abstract:Unmanned Aerial Vehicles (UAVs) are increasingly used for reforestation and forest monitoring, including seed dispersal in hard-to-reach terrains. However, a detailed understanding of the forest floor remains a challenge due to high natural variability, quickly changing environmental parameters, and ambiguous annotations due to unclear definitions. To address this issue, we adapt the Segment Anything Model (SAM), a vision foundation model with strong generalization capabilities, to segment forest floor objects such as tree stumps, vegetation, and woody debris. To this end, we employ parameter-efficient fine-tuning (PEFT) to fine-tune a small subset of additional model parameters while keeping the original weights fixed. We adjust SAM’s mask decoder to generate masks corresponding to our dataset categories, allowing for automatic segmentation without manual prompting. Our results show that the adapter-based PEFT method achieves the highest mean intersection over union (mIoU), while Low-rank Adaptation (LoRA), with fewer parameters, offers a lightweight alternative for resource-constrained UAV platforms.
zh
[CV-64] mplate-Guided Reconstruction of Pulmonary Segments with Neural Implicit Functions
【速读】:该论文旨在解决肺段高精度三维重建的问题,这一问题在肺叶切除术和肺癌手术治疗规划中具有关键作用。传统基于深度学习的方法由于目标重建的分辨率需求,常面临计算资源限制或粒度不足的挑战,而隐式建模因其计算效率和任意分辨率下的连续表示而受到青睐。论文提出了一种基于神经隐式函数的方法,通过变形可学习模板来学习三维表面,实现解剖感知且精确的肺段重建。该解决方案的关键在于利用隐式建模的优势,结合临床相关的评估指标,并构建了名为Lung3D的公开数据集以支持算法基准测试。
链接: https://arxiv.org/abs/2505.08919
作者: Kangxian Xie,Yufei Zhu,Kaiming Kuang,Li Zhang,Hongwei Bran Li,Mingchen Gao,Jiancheng Yang
机构: Computer Vision Laboratory, Swiss Federal Institute of Technology Lausanne (EPFL); University at Buffalo, SUNY; Dianei Technology; University of California San Diego; Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: In revision process
Abstract:High-quality 3D reconstruction of pulmonary segments plays a crucial role in segmentectomy and surgical treatment planning for lung cancer. Due to the resolution requirement of the target reconstruction, conventional deep learning-based methods often suffer from computational resource constraints or limited granularity. Conversely, implicit modeling is favored due to its computational efficiency and continuous representation at any resolution. We propose a neural implicit function-based method to learn a 3D surface to achieve anatomy-aware, precise pulmonary segment reconstruction, represented as a shape by deforming a learnable template. Additionally, we introduce two clinically relevant evaluation metrics to assess the reconstruction comprehensively. Further, due to the absence of publicly available shape datasets to benchmark reconstruction algorithms, we developed a shape dataset named Lung3D, including the 3D models of 800 labeled pulmonary segments and the corresponding airways, arteries, veins, and intersegmental veins. We demonstrate that the proposed approach outperforms existing methods, providing a new perspective for pulmonary segment reconstruction. Code and data will be available at this https URL.
zh
[CV-65] Learning Cocoercive Conservative Denoisers via Helmholtz Decomposition for Poisson Inverse Problems
【速读】:该论文旨在解决在泊松逆问题中,传统基于深度去噪器的插件式(Plug-and-play, PnP)方法因对保真项强凸性或平滑性以及去噪器非扩张性的依赖而难以有效应用的问题。其关键解决方案是提出一种共轭保守(cocoercive conservative, CoCo)去噪器,该去噪器可能具有(残差)扩张性,从而提升去噪性能。通过引入广义赫尔姆霍兹分解,结合哈密顿正则化以促进保守性及谱正则化以确保共轭误差性,CoCo去噪器被证明为弱凸函数的近似算子,从而构建了一个隐式弱凸先验的恢复模型,并保证了PnP方法全局收敛至该模型的驻点。
链接: https://arxiv.org/abs/2505.08909
作者: Deliang Wei,Peng Chen,Haobo Xu,Jiale Yao,Fang Li,Tieyong Zeng
机构: East China Normal University (华东师范大学); Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Functional Analysis (math.FA); Optimization and Control (math.OC)
备注: 31 pages
Abstract:Plug-and-play (PnP) methods with deep denoisers have shown impressive results in imaging problems. They typically require strong convexity or smoothness of the fidelity term and a (residual) non-expansive denoiser for convergence. These assumptions, however, are violated in Poisson inverse problems, and non-expansiveness can hinder denoising performance. To address these challenges, we propose a cocoercive conservative (CoCo) denoiser, which may be (residual) expansive, leading to improved denoising. By leveraging the generalized Helmholtz decomposition, we introduce a novel training strategy that combines Hamiltonian regularization to promote conservativeness and spectral regularization to ensure cocoerciveness. We prove that CoCo denoiser is a proximal operator of a weakly convex function, enabling a restoration model with an implicit weakly convex prior. The global convergence of PnP methods to a stationary point of this restoration model is established. Extensive experimental results demonstrate that our approach outperforms closely related methods in both visual quality and quantitative metrics.
zh
[CV-66] IntrinsicEdit: Precise generative image manipulation in intrinsic space SIGGRAPH2025
【速读】:该论文旨在解决生成式扩散模型在图像编辑中缺乏精确控制以及方法通常仅针对单一编辑任务的问题。其解决方案的关键在于引入一种基于内在图像潜在空间的通用生成工作流,通过精确的扩散逆过程和解耦的通道操作,实现对图像的语义化、局部化且像素级的编辑,同时自动处理全局光照效果,无需额外的数据收集或模型微调。
链接: https://arxiv.org/abs/2505.08889
作者: Linjie Lyu,Valentin Deschaintre,Yannick Hold-Geoffroy,Miloš Hašan,Jae Shin Yoon,Thomas Leimkühler,Christian Theobalt,Iliyan Georgiev
机构: Adobe Research London (伦敦); Adobe Research Quebec (魁北克); Adobe Research San Jose (圣何塞); Max-Planck-Institute for Informatics, Saarland Informatics Campus (马克斯·普朗克信息学研究所,萨尔兰信息校园)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH 2025 Journal track
Abstract:Generative diffusion models have advanced image editing with high-quality results and intuitive interfaces such as prompts and semantic drawing. However, these interfaces lack precise control, and the associated methods typically specialize on a single editing task. We introduce a versatile, generative workflow that operates in an intrinsic-image latent space, enabling semantic, local manipulation with pixel precision for a range of editing operations. Building atop the RGB-X diffusion framework, we address key challenges of identity preservation and intrinsic-channel entanglement. By incorporating exact diffusion inversion and disentangled channel manipulation, we enable precise, efficient editing with automatic resolution of global illumination effects – all without additional data collection or model fine-tuning. We demonstrate state-of-the-art performance across a variety of tasks on complex images, including color and texture adjustments, object insertion and removal, global relighting, and their combinations.
zh
[CV-67] Optimizing Neuro-Fuzzy and Colonial Competition Algorithms for Skin Cancer Diagnosis in Dermatoscopic Images
【速读】:该论文试图解决皮肤癌诊断中因公众意识不足和临床专家短缺而导致的诊断准确性不足问题,特别是在区分恶性与良性皮肤病变方面。解决方案的关键在于融合图像处理技术与机器学习算法,特别是神经模糊和殖民竞争方法,通过在ISIC数据库的皮肤镜图像上应用该方法,实现了94%的显著准确率,从而为临床医生提供早期黑色素瘤检测的支持。
链接: https://arxiv.org/abs/2505.08886
作者: Hamideh Khaleghpour,Brett McKinney
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 7 pages, 10 figures. Accepted at the 2nd Asia Pacific Computer Systems Conference (APCS 2024), March 15-17, 2024
Abstract:The rising incidence of skin cancer, coupled with limited public awareness and a shortfall in clinical expertise, underscores an urgent need for advanced diagnostic aids. Artificial Intelligence (AI) has emerged as a promising tool in this domain, particularly for distinguishing malignant from benign skin lesions. Leveraging publicly available datasets of skin lesions, researchers have been developing AI-based diagnostic solutions. However, the integration of such computer systems in clinical settings is still nascent. This study aims to bridge this gap by employing a fusion of image processing techniques and machine learning algorithms, specifically neuro-fuzzy and colonial competition approaches. Applied to dermoscopic images from the ISIC database, our method achieved a notable accuracy of 94% on a dataset of 560 images. These results underscore the potential of our approach in aiding clinicians in the early detection of melanoma, thereby contributing significantly to skin cancer diagnostics.
zh
[CV-68] Intelligent Road Anomaly Detection with Real-time Notification System for Enhanced Road Safety
【速读】:该论文试图解决道路安全问题,特别是由道路损坏异常(如坑洼和裂缝)引发的交通事故。解决方案的关键在于开发一个综合系统,利用Raspberry Pi、摄像头模块、深度学习模型和云服务,实现对坑洼和裂缝的检测、分类、实时计数以及向附近车辆广播警告信号,从而及时通知相关部门和驾驶员采取行动,以减少因道路缺陷导致的潜在事故。
链接: https://arxiv.org/abs/2505.08882
作者: Ali Almakhluk,Uthman Baroudi,Yasser El-Alfy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Systems and Control (eess.SY)
备注:
Abstract:This study aims to improve transportation safety, especially traffic safety. Road damage anomalies such as potholes and cracks have emerged as a significant and recurring cause for accidents. To tackle this problem and improve road safety, a comprehensive system has been developed to detect potholes, cracks (e.g. alligator, transverse, longitudinal), classify their sizes, and transmit this data to the cloud for appropriate action by authorities. The system also broadcasts warning signals to nearby vehicles warning them if a severe anomaly is detected on the road. Moreover, the system can count road anomalies in real-time. It is emulated through the utilization of Raspberry Pi, a camera module, deep learning model, laptop, and cloud service. Deploying this innovative solution aims to proactively enhance road safety by notifying relevant authorities and drivers about the presence of potholes and cracks to take actions, thereby mitigating potential accidents arising from this prevalent road hazard leading to safer road conditions for the whole community.
zh
[CV-69] Generative AI for Autonomous Driving: Frontiers and Opportunities
【速读】:该论文试图解决自动驾驶领域中实现完全自主驾驶(Level 5 autonomy)这一工程上的重大挑战,其核心在于利用生成式 AI (Generative AI) 技术提升自动驾驶系统的感知、决策与规划能力。解决方案的关键在于通过现代生成建模技术(如变分自编码器 VAEs、生成对抗网络 GANs、扩散模型和大型语言模型 LLMs)实现多模态内容生成与推理,并将其应用于图像、激光雷达、轨迹、占用网格、视频生成以及基于 LLM 的决策制定等任务,从而构建高保真数字孪生系统、智能交通网络及跨领域迁移的具身 AI 应用。
链接: https://arxiv.org/abs/2505.08854
作者: Yuping Wang,Shuo Xing,Cui Can,Renjie Li,Hongyuan Hua,Kexin Tian,Zhaobin Mo,Xiangbo Gao,Keshu Wu,Sulong Zhou,Hengxu You,Juntong Peng,Junge Zhang,Zehao Wang,Rui Song,Mingxuan Yan,Walter Zimmer,Xingcheng Zhou,Peiran Li,Zhaohan Lu,Chia-Ju Chen,Yue Huang,Ryan A. Rossi,Lichao Sun,Hongkai Yu,Zhiwen Fan,Frank Hao Yang,Yuhao Kang,Ross Greer,Chenxi Liu,Eun Hak Lee,Xuan Di,Xinyue Ye,Liu Ren,Alois Knoll,Xiaopeng Li,Shuiwang Ji,Masayoshi Tomizuka,Marco Pavone,Tianbao Yang,Jing Du,Ming-Hsuan Yang,Hua Wei,Ziran Wang,Yang Zhou,Jiachen Li,Zhengzhong Tu
机构: Texas A&M University (得克萨斯A&M大学); University of California, Riverside (加利福尼亚大学河滨分校); University of Michigan (密歇根大学); Purdue University (普渡大学); Columbia University (哥伦比亚大学); University of Florida (佛罗里达大学); Technische Universität München (慕尼黑工业大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校); University of Texas at Austin (德克萨斯大学奥斯汀分校); University of Notre Dame (圣母大学); Adobe Research (Adobe研究院); Lehigh University (利哈伊大学); Cleveland State University (克利夫兰州立大学); Johns Hopkins University (约翰霍普金斯大学); University of California, Merced (加利福尼亚大学默塞德分校); University of Utah (犹他大学); Texas A&M Transportation Institute (得克萨斯A&M交通研究所); Bosch Research North America (博世北美研究中心); Bosch Center for Artificial Intelligence (BCAI) (博世人工智能中心); University of California, Berkeley (加利福尼亚大学伯克利分校); Stanford University (斯坦福大学); NVIDIA (英伟达); Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Generative Artificial Intelligence (GenAI) constitutes a transformative technological wave that reconfigures industries through its unparalleled capabilities for content creation, reasoning, planning, and multimodal understanding. This revolutionary force offers the most promising path yet toward solving one of engineering’s grandest challenges: achieving reliable, fully autonomous driving, particularly the pursuit of Level 5 autonomy. This survey delivers a comprehensive and critical synthesis of the emerging role of GenAI across the autonomous driving stack. We begin by distilling the principles and trade-offs of modern generative modeling, encompassing VAEs, GANs, Diffusion Models, and Large Language Models (LLMs). We then map their frontier applications in image, LiDAR, trajectory, occupancy, video generation as well as LLM-guided reasoning and decision making. We categorize practical applications, such as synthetic data workflows, end-to-end driving strategies, high-fidelity digital twin systems, smart transportation networks, and cross-domain transfer to embodied AI. We identify key obstacles and possibilities such as comprehensive generalization across rare cases, evaluation and safety checks, budget-limited implementation, regulatory compliance, ethical concerns, and environmental effects, while proposing research plans across theoretical assurances, trust metrics, transport integration, and socio-technical influence. By unifying these threads, the survey provides a forward-looking reference for researchers, engineers, and policymakers navigating the convergence of generative AI and advanced autonomous mobility. An actively maintained repository of cited works is available at this https URL.
zh
[CV-70] Adaptive Security Policy Management in Cloud Environments Using Reinforcement Learning
【速读】:该论文试图解决传统静态安全策略在动态变化的云环境(如AWS)中应对不断演变的威胁和弹性资源时的不足问题。解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的安全策略管理框架,利用深度强化学习算法(如深度Q网络和近端策略优化)实现对防火墙规则和IAM策略等控制措施的动态学习与持续调整。该框架通过分析云遥测数据(如AWS Cloud Trail日志、网络流量数据和威胁情报信息)来持续优化安全策略,从而在最大化威胁缓解和合规性的同时最小化资源影响。
链接: https://arxiv.org/abs/2505.08837
作者: Muhammad Saqib,Dipkumar Mehta,Fnu Yashu,Shubham Malhotra
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注: 10 pages, 6 figures, 1 table
Abstract:The security of cloud environments, such as Amazon Web Services (AWS), is complex and dynamic. Static security policies have become inadequate as threats evolve and cloud resources exhibit elasticity [1]. This paper addresses the limitations of static policies by proposing a security policy management framework that uses reinforcement learning (RL) to adapt dynamically. Specifically, we employ deep reinforcement learning algorithms, including deep Q Networks and proximal policy optimization, enabling the learning and continuous adjustment of controls such as firewall rules and Identity and Access Management (IAM) policies. The proposed RL based solution leverages cloud telemetry data (AWS Cloud Trail logs, network traffic data, threat intelligence feeds) to continuously refine security policies, maximizing threat mitigation, and compliance while minimizing resource impact. Experimental results demonstrate that our adaptive RL based framework significantly outperforms static policies, achieving higher intrusion detection rates (92% compared to 82% for static policies) and substantially reducing incident detection and response times by 58%. In addition, it maintains high conformity with security requirements and efficient resource usage. These findings validate the effectiveness of adaptive reinforcement learning approaches in improving cloud security policy management.
zh
[CV-71] Robustness Analysis against Adversarial Patch Attacks in Fully Unmanned Stores
【速读】:该论文旨在解决全无人商店中基于人工智能的自动结账系统所面临的对抗性补丁攻击问题,此类攻击能够严重干扰物体检测模型,导致盗窃、库存差异和系统干扰等安全问题。论文提出的关键解决方案是引入一种新颖的颜色直方图相似性损失函数,该函数利用攻击者对目标类别物体颜色信息的了解,以增强对抗性补丁的有效性。此外,研究还提出了一种基于边界框的新评估指标,以更准确地分析攻击的实际影响,并在黑盒场景下验证了阴影攻击对提升攻击成功率的作用。
链接: https://arxiv.org/abs/2505.08835
作者: Hyunsik Na,Wonho Lee,Seungdeok Roh,Sohee Park,Daeseon Choi
机构: Soongsil University (松溪大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The advent of convenient and efficient fully unmanned stores equipped with artificial intelligence-based automated checkout systems marks a new era in retail. However, these systems have inherent artificial intelligence security vulnerabilities, which are exploited via adversarial patch attacks, particularly in physical environments. This study demonstrated that adversarial patches can severely disrupt object detection models used in unmanned stores, leading to issues such as theft, inventory discrepancies, and interference. We investigated three types of adversarial patch attacks – Hiding, Creating, and Altering attacks – and highlighted their effectiveness. We also introduce the novel color histogram similarity loss function by leveraging attacker knowledge of the color information of a target class object. Besides the traditional confusion-matrix-based attack success rate, we introduce a new bounding-boxes-based metric to analyze the practical impact of these attacks. Starting with attacks on object detection models trained on snack and fruit datasets in a digital environment, we evaluated the effectiveness of adversarial patches in a physical testbed that mimicked a real unmanned store with RGB cameras and realistic conditions. Furthermore, we assessed the robustness of these attacks in black-box scenarios, demonstrating that shadow attacks can enhance success rates of attacks even without direct access to model parameters. Our study underscores the necessity for robust defense strategies to protect unmanned stores from adversarial threats. Highlighting the limitations of the current defense mechanisms in real-time detection systems and discussing various proactive measures, we provide insights into improving the robustness of object detection models and fortifying unmanned retail environments against these attacks.
zh
[CV-72] Crowd Scene Analysis using Deep Learning Techniques
【速读】:该论文旨在解决人群场景分析中的两个主要问题:人群计数和异常检测。在人群计数方面,传统深度学习模型依赖大量标注数据进行训练,而数据标注过程耗时且成本高昂,为此提出了自监督学习作为解决方案;在异常检测方面,针对光照条件、环境变化、意外物体及可扩展性等挑战,提出了一种基于VGG19的时空模型,通过结合卷积神经网络(CNN)提取空间特征和长短期记忆网络(LSTM)提取时间特征,实现对正常或异常行为的二分类检测,并通过用密集残差块替代全连接层提升模型性能。
链接: https://arxiv.org/abs/2505.08834
作者: Muhammad Junaid Asif
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: MS Graduate Research Thesis
Abstract:Our research is focused on two main applications of crowd scene analysis crowd counting and anomaly detection In recent years a large number of researches have been presented in the domain of crowd counting We addressed two main challenges in this domain 1 Deep learning models are datahungry paradigms and always need a large amount of annotated data for the training of algorithm It is timeconsuming and costly task to annotate such large amount of data Selfsupervised training is proposed to deal with this challenge 2 MCNN consists of multicolumns of CNN with different sizes of filters by presenting a novel approach based on a combination of selfsupervised training and MultiColumn CNN This enables the model to learn features at different levels and makes it effective in dealing with challenges of occluded scenes nonuniform density complex backgrounds and scale invariation The proposed model was evaluated on publicly available data sets such as ShanghaiTech and UCFQNRF by means of MAE and MSE A spatiotemporal model based on VGG19 is proposed for crowd anomaly detection addressing challenges like lighting environmental conditions unexpected objects and scalability The model extracts spatial and temporal features allowing it to be generalized to realworld scenes Spatial features are learned using CNN while temporal features are learned using LSTM blocks The model works on binary classification and can detect normal or abnormal behavior The models performance is improved by replacing fully connected layers with dense residual blocks Experiments on the Hockey Fight dataset and SCVD dataset show our models outperform other stateoftheart approaches
zh
[CV-73] Generative AI for Urban Planning : Synthesizing Satellite Imagery via Diffusion Models
【速读】:该论文试图解决现有生成式 AI 在城市规划中难以大规模生成现实且实用设计的问题,其解决方案的关键在于将最先进的 Stable Diffusion 模型与 ControlNet 结合,以生成基于土地利用描述、基础设施和自然环境的高保真卫星图像。通过将卫星图像与来自 OpenStreetMap 的结构化土地利用和约束信息进行空间关联,克服了数据可用性的限制,并在三个美国主要城市的数据上验证了模型生成真实且多样的城市景观的能力。
链接: https://arxiv.org/abs/2505.08833
作者: Qingyi Wang,Yuebing Liang,Yunhan Zheng,Kaiyuan Xu,Jinhua Zhao,Shenhao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Generative AI offers new opportunities for automating urban planning by creating site-specific urban layouts and enabling flexible design exploration. However, existing approaches often struggle to produce realistic and practical designs at scale. Therefore, we adapt a state-of-the-art Stable Diffusion model, extended with ControlNet, to generate high-fidelity satellite imagery conditioned on land use descriptions, infrastructure, and natural environments. To overcome data availability limitations, we spatially link satellite imagery with structured land use and constraint information from OpenStreetMap. Using data from three major U.S. cities, we demonstrate that the proposed diffusion model generates realistic and diverse urban landscapes by varying land-use configurations, road networks, and water bodies, facilitating cross-city learning and design diversity. We also systematically evaluate the impacts of varying language prompts and control imagery on the quality of satellite imagery generation. Our model achieves high FID and KID scores and demonstrates robustness across diverse urban contexts. Qualitative assessments from urban planners and the general public show that generated images align closely with design descriptions and constraints, and are often preferred over real images. This work establishes a benchmark for controlled urban imagery generation and highlights the potential of generative AI as a tool for enhancing planning workflows and public engagement.
zh
[CV-74] owards SFW sampling for diffusion models via external conditioning IJCNN2025
【速读】:该论文试图解决生成式 AI (Generative AI) 在图像合成过程中可能生成不适宜工作场所(NSFW)内容的问题,如暴力图像和非同意的裸露图像。解决方案的关键在于引入一种安全的采样器(SFW sampler),其核心是通过条件轨迹修正步骤,利用多模态模型作为条件源,引导样本远离潜在的不良区域。此外,该方法结合了对比语言-图像预训练(CLIP)技术,允许用户定义不同的NSFW类别,从而实现灵活且有效的安全控制。
链接: https://arxiv.org/abs/2505.08817
作者: Camilo Carvajal Reyes,Joaquín Fontbona,Felipe Tobar
机构: Imperial College (帝国理工学院); Universidad de Chile (智利大学); Imperial-X (Imperial-X)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepcted at IJCNN 2025
Abstract:Score-based generative models (SBM), also known as diffusion models, are the de facto state of the art for image synthesis. Despite their unparalleled performance, SBMs have recently been in the spotlight for being tricked into creating not-safe-for-work (NSFW) content, such as violent images and non-consensual nudity. Current approaches that prevent unsafe generation are based on the models’ own knowledge, and the majority of them require fine-tuning. This article explores the use of external sources for ensuring safe outputs in SBMs. Our safe-for-work (SFW) sampler implements a Conditional Trajectory Correction step that guides the samples away from undesired regions in the ambient space using multimodal models as the source of conditioning. Furthermore, using Contrastive Language Image Pre-training (CLIP), our method admits user-defined NSFW classes, which can vary in different settings. Our experiments on the text-to-image SBM Stable Diffusion validate that the proposed SFW sampler effectively reduces the generation of explicit content while being competitive with other fine-tuning-based approaches, as assessed via independent NSFW detectors. Moreover, we evaluate the impact of the SFW sampler on image quality and show that the proposed correction scheme comes at a minor cost with negligible effect on samples not needing correction. Our study confirms the suitability of the SFW sampler towards aligned SBM models and the potential of using model-agnostic conditioning for the prevention of unwanted images.
zh
[CV-75] owards Understanding Deep Learning Model in Image Recognition via Coverag e Test
【速读】:该论文试图解决深度神经网络(Deep Neural Networks, DNNs)在安全测试中覆盖率度量之间的关系及模式问题,特别是模型深度、配置信息与神经网络覆盖率之间的关联性缺乏实证研究。其解决方案的关键在于通过一系列实验,选取LeNet、VGG和ResNet等不同架构的DNN模型,以及具有不同深度(5至54层)的10个模型,分析四种覆盖率度量指标(主要功能、边界、层次和结构覆盖率)之间的关系,并进一步探讨修改后的判定/条件覆盖率与数据集规模之间的联系。
链接: https://arxiv.org/abs/2505.08814
作者: Wenkai Li,Xiaoqi Li,Yingjie Mao,Yishun Wang
机构: Hainan University (海南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Deep neural networks (DNNs) play a crucial role in the field of artificial intelligence, and their security-related testing has been a prominent research focus. By inputting test cases, the behavior of models is examined for anomalies, and coverage metrics are utilized to determine the extent of neurons covered by these test cases. With the widespread application and advancement of DNNs, different types of neural behaviors have garnered attention, leading to the emergence of various coverage metrics for neural networks. However, there is currently a lack of empirical research on these coverage metrics, specifically in analyzing the relationships and patterns between model depth, configuration information, and neural network coverage. This paper aims to investigate the relationships and patterns of four coverage metrics: primary functionality, boundary, hierarchy, and structural coverage. A series of empirical experiments were conducted, selecting LeNet, VGG, and ResNet as different DNN architectures, along with 10 models of varying depths ranging from 5 to 54 layers, to compare and study the relationships between different depths, configuration information, and various neural network coverage metrics. Additionally, an investigation was carried out on the relationships between modified decision/condition coverage and dataset size. Finally, three potential future directions are proposed to further contribute to the security testing of DNN Models.
zh
[CV-76] UGS: Physics-based Compact Representation of Underwater Scenes by Tensorized Gaussian
【速读】:该论文旨在解决水下三维场景重建中由于光传播、水介质与物体表面之间的复杂相互作用导致的建模难题,以及现有方法在训练和渲染成本高昂方面限制其在水下机器人系统中实际应用的问题。解决方案的关键在于提出Tensorized Underwater Gaussian Splatting (TUGS),该方法通过使用轻量级张量化高阶高斯函数结合基于物理的水下自适应介质估计(Adaptive Medium Estimation, AME)模块,有效模拟水下环境中的光衰减和后向散射效应,从而实现高效且高质量的水下图像渲染,同时显著降低参数数量和内存消耗。
链接: https://arxiv.org/abs/2505.08811
作者: Shijie Lian,Ziyi Zhang,Laurence Tianruo Yang and,Mengyu Ren,Debin Liu,Hua Li
机构: Huazhong University of Science and Technology (华中科技大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学深圳校区); Zhengzhou University (郑州大学); Hainan University (海南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Underwater 3D scene reconstruction is crucial for undewater robotic perception and navigation. However, the task is significantly challenged by the complex interplay between light propagation, water medium, and object surfaces, with existing methods unable to model their interactions accurately. Additionally, expensive training and rendering costs limit their practical application in underwater robotic systems. Therefore, we propose Tensorized Underwater Gaussian Splatting (TUGS), which can effectively solve the modeling challenges of the complex interactions between object geometries and water media while achieving significant parameter reduction. TUGS employs lightweight tensorized higher-order Gaussians with a physics-based underwater Adaptive Medium Estimation (AME) module, enabling accurate simulation of both light attenuation and backscatter effects in underwater environments. Compared to other NeRF-based and GS-based methods designed for underwater, TUGS is able to render high-quality underwater images with faster rendering speeds and less memory usage. Extensive experiments on real-world underwater datasets have demonstrated that TUGS can efficiently achieve superior reconstruction quality using a limited number of parameters, making it particularly suitable for memory-constrained underwater UAV applications
zh
[CV-77] SparseMeXT Unlocking the Potential of Sparse Representations for HD Map Construction
【速读】:该论文旨在解决稀疏表示在在线高精度(HD)地图构建中性能落后于密集表示的问题。现有方法由于缺乏专门设计而难以充分发挥稀疏表示的潜力,导致其在效率与性能之间的权衡上处于劣势。解决方案的关键在于系统性地改进稀疏表示技术,包括引入针对稀疏地图特征提取优化的专用网络架构、基于稀疏-密集分割辅助任务以更好地利用几何和语义线索,以及通过物理先验引导的去噪模块来提升预测精度。这些改进显著提升了稀疏方法的性能,使其在nuScenes数据集上达到了最先进的效果。
链接: https://arxiv.org/abs/2505.08808
作者: Anqing Jiang,Jinhao Chai,Yu Gao,Yiru Wang,Yuwen Heng,Zhigang Sun,Hao Sun,Zezhong Zhao,Li Sun,Jian Zhou,Lijuan Zhu,Shugong Xu,Hao Zhao
机构: Bosch Corporate Research, Bosch (China) Investment Ltd.(博世企业研究院,博世(中国)投资有限公司); School of Communication and Information Engineering, Shanghai University(上海大学通信与信息工程学院); Bosch Mobility Solutions, Robert Bosch GmbH(博世移动解决方案,罗伯特·博世有限公司); AIR, Tsinghua University(清华大学人工智能研究院); Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in high-definition \emphHD map construction have demonstrated the effectiveness of dense representations, which heavily rely on computationally intensive bird’s-eye view \emphBEV features. While sparse representations offer a more efficient alternative by avoiding dense BEV processing, existing methods often lag behind due to the lack of tailored designs. These limitations have hindered the competitiveness of sparse representations in online HD map construction. In this work, we systematically revisit and enhance sparse representation techniques, identifying key architectural and algorithmic improvements that bridge the gap with–and ultimately surpass–dense approaches. We introduce a dedicated network architecture optimized for sparse map feature extraction, a sparse-dense segmentation auxiliary task to better leverage geometric and semantic cues, and a denoising module guided by physical priors to refine predictions. Through these enhancements, our method achieves state-of-the-art performance on the nuScenes dataset, significantly advancing HD map construction and centerline detection. Specifically, SparseMeXt-Tiny reaches a mean average precision \emphmAP of 55.5% at 32 frames per second \emphfps, while SparseMeXt-Base attains 65.2% mAP. Scaling the backbone and decoder further, SparseMeXt-Large achieves an mAP of 68.9% at over 20 fps, establishing a new benchmark for sparse representations in HD map construction. These results underscore the untapped potential of sparse methods, challenging the conventional reliance on dense representations and redefining efficiency-performance trade-offs in the field.
zh
[CV-78] OptiGait-LGBM: An Efficient Approach of Gait-based Person Re-identification in Non-Overlapping Regions
【速读】:该论文旨在解决在非受控户外环境中,基于视频的步态识别系统在实际应用中性能下降的问题,特别是针对非重叠摄像头视角、光照变化、计算效率等核心挑战。其解决方案的关键在于提出一种基于骨骼模型的OptiGait-LGBM模型,通过从关键点位置构建数据集,减少内存占用并提升计算效率,同时引入RUET-GAIT基准数据集以表征复杂户外环境下的步态序列,从而实现高效且准确的人员再识别。
链接: https://arxiv.org/abs/2505.08801
作者: Md. Sakib Hassan Chowdhury,Md. Hafiz Ahamed,Bishowjit Paul,Sarafat Hussain Abhi,Abu Bakar Siddique,Md. Robius Sany
机构: Rajshahi University of Engineering & Technology (拉杰沙希工程与技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 12 pages, 17 figures
Abstract:Gait recognition, known for its ability to identify individuals from a distance, has gained significant attention in recent times due to its non-intrusive verification. While video-based gait identification systems perform well on large public datasets, their performance drops when applied to real-world, unconstrained gait data due to various factors. Among these, uncontrolled outdoor environments, non-overlapping camera views, varying illumination, and computational efficiency are core challenges in gait-based authentication. Currently, no dataset addresses all these challenges simultaneously. In this paper, we propose an OptiGait-LGBM model capable of recognizing person re-identification under these constraints using a skeletal model approach, which helps mitigate inconsistencies in a person’s appearance. The model constructs a dataset from landmark positions, minimizing memory usage by using non-sequential data. A benchmark dataset, RUET-GAIT, is introduced to represent uncontrolled gait sequences in complex outdoor environments. The process involves extracting skeletal joint landmarks, generating numerical datasets, and developing an OptiGait-LGBM gait classification model. Our aim is to address the aforementioned challenges with minimal computational cost compared to existing methods. A comparative analysis with ensemble techniques such as Random Forest and CatBoost demonstrates that the proposed approach outperforms them in terms of accuracy, memory usage, and training time. This method provides a novel, low-cost, and memory-efficient video-based gait recognition solution for real-world scenarios.
zh
[CV-79] Graph-based Online Monitoring of Train Driver States via Facial and Skeletal Features
【速读】:该论文试图解决铁路安全中的驾驶员疲劳问题,传统系统如死人开关提供的警觉性检测功能有限。其解决方案的关键在于提出一种基于行为的在线监测系统,采用定制化的有向图神经网络(Directed-Graph Neural Network, DGNN)对列车驾驶员状态进行分类,分为警觉、不警觉和病态三类。通过消融实验对比了仅骨骼特征、仅面部特征以及两者结合的特征配置,结果表明结合面部与骨骼特征能够显著提升三分类模型的准确率(80.88%),并在二分类警觉性判断中达到超过99%的准确率。此外,研究还引入了一个新型数据集,首次将模拟病态条件纳入列车驾驶员监测中,拓展了疲劳与健康相关风险评估的范围。
链接: https://arxiv.org/abs/2505.08800
作者: Olivia Nocentini,Marta Lagomarsino,Gokhan Solak,Younggeol Cho,Qiyi Tong,Marta Lorenzini,Arash Ajoudani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Driver fatigue poses a significant challenge to railway safety, with traditional systems like the dead-man switch offering limited and basic alertness checks. This study presents an online behavior-based monitoring system utilizing a customised Directed-Graph Neural Network (DGNN) to classify train driver’s states into three categories: alert, not alert, and pathological. To optimize input representations for the model, an ablation study was performed, comparing three feature configurations: skeletal-only, facial-only, and a combination of both. Experimental results show that combining facial and skeletal features yields the highest accuracy (80.88%) in the three-class model, outperforming models using only facial or skeletal features. Furthermore, this combination achieves over 99% accuracy in the binary alertness classification. Additionally, we introduced a novel dataset that, for the first time, incorporates simulated pathological conditions into train driver monitoring, broadening the scope for assessing risks related to fatigue and health. This work represents a step forward in enhancing railway safety through advanced online monitoring using vision-based technologies.
zh
[CV-80] Meta-learning Slice-to-Volume Reconstruction in Fetal Brain MRI using Implicit Neural Representations
【速读】:该论文旨在解决从多幅受运动伪影影响的低分辨率2D切片中进行高分辨率切片到体积重建(slice-to-volume reconstruction, SVR)的问题,这在动态目标(如胎儿大脑磁共振成像)的基于图像的诊断中至关重要。现有方法在处理图像伪影和严重受试者运动时表现不佳,或需要切片预对齐以获得满意的重建效果。该研究提出了一种新的SVR方法,其关键在于完全基于隐式神经表示(implicit neural representations)执行运动校正、异常值处理和超分辨率重建,从而实现了在严重图像和运动退化情况下的快速且准确的MRI重建。
链接: https://arxiv.org/abs/2505.09565
作者: Maik Dannecker,Thomas Sanchez,Meritxell Bach Cuadra,Özgün Turgut,Anthony N. Price,Lucilio Cordero-Grande,Vanessa Kyriakopoulou,Joseph V. Hajnal,Daniel Rueckert
机构: Technical University Munich (慕尼黑工业大学); Lausanne University Hospital (洛桑大学医院); University of Lausanne (洛桑大学); King’s College London (伦敦国王学院); Universidad Politécnica de Madrid (马德里理工大学); Imperial College London (帝国理工学院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures
Abstract:High-resolution slice-to-volume reconstruction (SVR) from multiple motion-corrupted low-resolution 2D slices constitutes a critical step in image-based diagnostics of moving subjects, such as fetal brain Magnetic Resonance Imaging (MRI). Existing solutions struggle with image artifacts and severe subject motion or require slice pre-alignment to achieve satisfying reconstruction performance. We propose a novel SVR method to enable fast and accurate MRI reconstruction even in cases of severe image and motion corruption. Our approach performs motion correction, outlier handling, and super-resolution reconstruction with all operations being entirely based on implicit neural representations. The model can be initialized with task-specific priors through fully self-supervised meta-learning on either simulated or real-world data. In extensive experiments including over 480 reconstructions of simulated and clinical MRI brain data from different centers, we prove the utility of our method in cases of severe subject motion and image artifacts. Our results demonstrate improvements in reconstruction quality, especially in the presence of severe motion, compared to state-of-the-art methods, and up to 50% reduction in reconstruction time.
zh
[CV-81] Spec2VolCAMU-Net: A Spectrogram-to-Volume Model for EEG-to-fMRI Reconstruction based on Multi-directional Time-Frequency Convolutional Attention Encoder and Vision-Mamba U-Net
【速读】:该论文旨在解决高分辨率功能磁共振成像(fMRI)成本高且物流复杂的问题,通过从广泛可用的头皮脑电图(EEG)直接生成可比较的体积数据,使先进的神经影像学更加普及。其解决方案的关键在于提出了一种轻量级的频谱图到体积生成器Spec2VolCAMU-Net,该模型通过多方向时间-频率卷积注意力编码器融合时序、频域和联合卷积与自注意力机制,并采用Vision-Mamba U-Net解码器实现高效的长程空间建模,从而有效克服了现有方法在跨通道时频特征捕捉不足及计算资源消耗过大的问题。
链接: https://arxiv.org/abs/2505.09521
作者: Dongyi He,Shiyang Li,Bin Jiang,He Yan
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-resolution functional magnetic resonance imaging (fMRI) is essential for mapping human brain activity; however, it remains costly and logistically challenging. If comparable volumes could be generated directly from widely available scalp electroencephalography (EEG), advanced neuroimaging would become significantly more accessible. Existing EEG-to-fMRI generators rely on plain CNNs that fail to capture cross-channel time-frequency cues or on heavy transformer/GAN decoders that strain memory and stability. We propose Spec2VolCAMU-Net, a lightweight spectrogram-to-volume generator that confronts these issues via a Multi-directional Time-Frequency Convolutional Attention Encoder, stacking temporal, spectral and joint convolutions with self-attention, and a Vision-Mamba U-Net decoder whose linear-time state-space blocks enable efficient long-range spatial modelling. Trained end-to-end with a hybrid SSI-MSE loss, Spec2VolCAMU-Net achieves state-of-the-art fidelity on three public benchmarks, recording SSIMs of 0.693 on NODDI, 0.725 on Oddball and 0.788 on CN-EPFL, representing improvements of 14.5%, 14.9%, and 16.9% respectively over previous best SSIM scores. Furthermore, it achieves competitive PSNR scores, particularly excelling on the CN-EPFL dataset with a 4.6% improvement over the previous best PSNR, thus striking a better balance in reconstruction quality. The proposed model is lightweight and efficient, making it suitable for real-time applications in clinical and research settings. The code is available at this https URL.
zh
[CV-82] DCSNet: A Lightweight Knowledge Distillation-Based Model with Explainable AI for Lung Cancer Diagnosis from Histopathological Images
【速读】:该论文旨在解决肺癌早期检测中深度学习模型计算成本高、资源消耗大以及缺乏透明性导致难以在医疗领域广泛应用的问题。其解决方案的关键在于采用知识蒸馏(knowledge distillation)方法,将复杂教师模型(如ResNet50、EfficientNetB0等)的知识迁移至轻量级学生模型(Distilled Custom Student Network, DCSNet),同时结合可解释AI(explainable AI, XAI)技术提升模型的透明性,从而在保证诊断性能的同时适应资源受限环境并增强医疗领域的信任度。
链接: https://arxiv.org/abs/2505.09334
作者: Sadman Sakib Alif,Nasim Anzum Promise,Fiaz Al Abid,Aniqua Nusrat Zereen
机构: North South University (南亚大学); Mahidol University (玛哈柴拉蓬大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Lung cancer is a leading cause of cancer-related deaths globally, where early detection and accurate diagnosis are critical for improving survival rates. While deep learning, particularly convolutional neural networks (CNNs), has revolutionized medical image analysis by detecting subtle patterns indicative of early-stage lung cancer, its adoption faces challenges. These models are often computationally expensive and require significant resources, making them unsuitable for resource constrained environments. Additionally, their lack of transparency hinders trust and broader adoption in sensitive fields like healthcare. Knowledge distillation addresses these challenges by transferring knowledge from large, complex models (teachers) to smaller, lightweight models (students). We propose a knowledge distillation-based approach for lung cancer detection, incorporating explainable AI (XAI) techniques to enhance model transparency. Eight CNNs, including ResNet50, EfficientNetB0, EfficientNetB3, and VGG16, are evaluated as teacher models. We developed and trained a lightweight student model, Distilled Custom Student Network (DCSNet) using ResNet50 as the teacher. This approach not only ensures high diagnostic performance in resource-constrained settings but also addresses transparency concerns, facilitating the adoption of AI-driven diagnostic tools in healthcare.
zh
[CV-83] Q-space Guided Collaborative Attention Translation Network for Flexible Diffusion-Weighted Images Synthesis MICCAI2025
【速读】:该论文旨在解决从灵活的q-space采样中合成多壳层高角分辨率扩散加权成像(MS-HARDI)的问题,通过利用常规获取的结构磁共振成像(sMRI)数据实现更准确的参数图和纤维束估计。其解决方案的关键在于提出了一种基于Q-space引导的协同注意力翻译网络(Q-CATN),该网络采用协同注意力机制有效提取多模态互补信息,并根据灵活的q-space信息动态调整内部表示,从而无需依赖固定采样方案,同时引入任务特定约束以保持DWI的解剖保真度。
链接: https://arxiv.org/abs/2505.09323
作者: Pengli Zhu,Yingji Fu,Nanguang Chen,Anqi Qiu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025
Abstract:This study, we propose a novel Q-space Guided Collaborative Attention Translation Networks (Q-CATN) for multi-shell, high-angular resolution DWI (MS-HARDI) synthesis from flexible q-space sampling, leveraging the commonly acquired structural MRI data. Q-CATN employs a collaborative attention mechanism to effectively extract complementary information from multiple modalities and dynamically adjust its internal representations based on flexible q-space information, eliminating the need for fixed sampling schemes. Additionally, we introduce a range of task-specific constraints to preserve anatomical fidelity in DWI, enabling Q-CATN to accurately learn the intrinsic relationships between directional DWI signal distributions and q-space. Extensive experiments on the Human Connectome Project (HCP) dataset demonstrate that Q-CATN outperforms existing methods, including 1D-qDL, 2D-qDL, MESC-SD, and QGAN, in estimating parameter maps and fiber tracts both quantitatively and qualitatively, while preserving fine-grained details. Notably, its ability to accommodate flexible q-space sampling highlights its potential as a promising toolkit for clinical and research applications. Our code is available at this https URL.
zh
[CV-84] EDBench: Large-Scale Electron Density Data for Molecular Modeling
【速读】:该论文旨在解决现有分子机器学习力场(MLFFs)在学习过程中忽视电子密度(ED)的重要性问题,而ED是准确理解分子力场(MFFs)的关键因素。ED通过描述电子在原子或分子周围特定位置的概率,根据Hohenberg-Kohn定理唯一决定了多粒子系统的基态性质。然而,由于ED的计算依赖于耗时的第一性原理密度泛函理论(DFT),导致大规模ED数据的缺乏,限制了其在MLFFs中的应用。该论文提出的解决方案是构建EDBench,一个大规模、高质量的ED数据集,基于PCQM4Mv2,包含330万种分子的精确ED数据,并设计了一系列以ED为中心的基准任务,以全面评估模型理解和利用电子信息的能力。关键在于通过EDBench提升基于学习的方法在电子尺度上的研究能力,并证明其在计算效率和精度上优于传统DFT方法。
链接: https://arxiv.org/abs/2505.09262
作者: Hongxin Xiang,Ke Li,Mingquan Liu,Zhixiang Cheng,Bin Yao,Wenjie Du,Jun Xia,Li Zeng,Xin Jin,Xiangxiang Zeng
机构: Hunan University (湖南大学); East China Normal University (华东师范大学); University of Science and Technology of China (中国科学技术大学); Westlake University (西湖大学); Eastern Institute of Technology (东华理工大学)
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Existing molecular machine learning force fields (MLFFs) generally focus on the learning of atoms, molecules, and simple quantum chemical properties (such as energy and force), but ignore the importance of electron density (ED) \rho® in accurately understanding molecular force fields (MFFs). ED describes the probability of finding electrons at specific locations around atoms or molecules, which uniquely determines all ground state properties (such as energy, molecular structure, etc.) of interactive multi-particle systems according to the Hohenberg-Kohn theorem. However, the calculation of ED relies on the time-consuming first-principles density functional theory (DFT) which leads to the lack of large-scale ED data and limits its application in MLFFs. In this paper, we introduce EDBench, a large-scale, high-quality dataset of ED designed to advance learning-based research at the electronic scale. Built upon the PCQM4Mv2, EDBench provides accurate ED data, covering 3.3 million molecules. To comprehensively evaluate the ability of models to understand and utilize electronic information, we design a suite of ED-centric benchmark tasks spanning prediction, retrieval, and generation. Our evaluation on several state-of-the-art methods demonstrates that learning from EDBench is not only feasible but also achieves high accuracy. Moreover, we show that learning-based method can efficiently calculate ED with comparable precision while significantly reducing the computational cost relative to traditional DFT calculations. All data and benchmarks from EDBench will be freely available, laying a robust foundation for ED-driven drug discovery and materials science.
zh
[CV-85] BiECVC: Gated Diversification of Bidirectional Contexts for Learned Video Compression
【速读】:该论文旨在解决学习型双向视频压缩(BVC)中存在的性能差距问题,即其在上下文建模能力和适应性方面相较于单向方法的不足。解决方案的关键在于提出BiECVC框架,该框架结合了多样化局部与非局部上下文建模以及自适应上下文门控机制。通过重用低层高质量特征并利用解码运动矢量进行对齐以增强局部上下文,并采用线性注意力机制高效建模非局部依赖;同时引入受近期自回归语言模型中数据相关衰减启发的双向上下文门控,以动态过滤基于条件编码结果的上下文信息,从而有效缓解不准确上下文预测的影响。
链接: https://arxiv.org/abs/2505.09193
作者: Wei Jiang,Junru Li,Kai Zhang,Li Zhang
机构: Bytedance(字节跳动); Bytedance San Diego(字节跳动圣地亚哥); Bytedance Shenzhen(字节跳动深圳)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: The first learned video codec that surpasses VTM 13.2 RA across all standard test datasets. Code will be available at this https URL
Abstract:Recent forward prediction-based learned video compression (LVC) methods have achieved impressive results, even surpassing VVC reference software VTM under the Low Delay B (LDB) configuration. In contrast, learned bidirectional video compression (BVC) remains underexplored and still lags behind its forward-only counterparts. This performance gap is mainly due to the limited ability to extract diverse and accurate contexts: most existing BVCs primarily exploit temporal motion while neglecting non-local correlations across frames. Moreover, they lack the adaptability to dynamically suppress harmful contexts arising from fast motion or occlusion. To tackle these challenges, we propose BiECVC, a BVC framework that incorporates diversified local and non-local context modeling along with adaptive context gating. For local context enhancement, BiECVC reuses high-quality features from lower layers and aligns them using decoded motion vectors without introducing extra motion this http URL model non-local dependencies efficiently, we adopt a linear attention mechanism that balances performance and complexity. To further mitigate the impact of inaccurate context prediction, we introduce Bidirectional Context Gating, inspired by data-dependent decay in recent autoregressive language models, to dynamically filter contextual information based on conditional coding results. Extensive experiments demonstrate that BiECVC achieves state-of-the-art performance, reducing the bit-rate by 13.4% and 15.7% compared to VTM 13.2 under the Random Access (RA) configuration with intra periods of 32 and 64, respectively. To our knowledge, BiECVC is the first learned video codec to surpass VTM 13.2 RA across all standard test datasets. Code will be available at this https URL.
zh
[CV-86] Validation of Conformal Prediction in Cervical Atypia Classification
【速读】:该论文试图解决深度学习模型在宫颈癌分类任务中存在过度自信、无法可靠反映诊断不确定性以及生成的最大似然预测未能传达结果中的不确定性和模糊性的问题。解决方案的关键在于应用可证明的预测(conformal prediction),这是一种模型无关的框架,用于生成包含训练好的深度学习模型可能类别的预测集,其大小反映了模型的不确定性。该方法通过预测集的准确性与人类标注的一致性来评估模型性能,而非仅关注是否覆盖真实类别,从而提升预测结果对终端用户的可信度和实用性。
链接: https://arxiv.org/abs/2505.08845
作者: Misgina Tsighe Hagos,Antti Suutala,Dmitrii Bychkov,Hakan Kücükel,Joar von Bahr,Milda Poceviciute,Johan Lundin,Nina Linder,Claes Lundström
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注:
Abstract:Deep learning based cervical cancer classification can potentially increase access to screening in low-resource regions. However, deep learning models are often overconfident and do not reliably reflect diagnostic uncertainty. Moreover, they are typically optimized to generate maximum-likelihood predictions, which fail to convey uncertainty or ambiguity in their results. Such challenges can be addressed using conformal prediction, a model-agnostic framework for generating prediction sets that contain likely classes for trained deep-learning models. The size of these prediction sets indicates model uncertainty, contracting as model confidence increases. However, existing conformal prediction evaluation primarily focuses on whether the prediction set includes or covers the true class, often overlooking the presence of extraneous classes. We argue that prediction sets should be truthful and valuable to end users, ensuring that the listed likely classes align with human expectations rather than being overly relaxed and including false positives or unlikely classes. In this study, we comprehensively validate conformal prediction sets using expert annotation sets collected from multiple annotators. We evaluate three conformal prediction approaches applied to three deep-learning models trained for cervical atypia classification. Our expert annotation-based analysis reveals that conventional coverage-based evaluations overestimate performance and that current conformal prediction methods often produce prediction sets that are not well aligned with human labels. Additionally, we explore the capabilities of the conformal prediction methods in identifying ambiguous and out-of-distribution data.
zh
[CV-87] otal Variation-Based Image Decomposition and Denoising for Microscopy Images
【速读】:该论文旨在解决显微镜图像中噪声和其他不需要信号导致图像质量下降的问题,这些问题可能掩盖了重要的特征。其解决方案的关键在于通过基于全变分(Total Variation, TV)的工作流程实现图像分解与去噪,具体包括提取并从原始图像中减去不需要的信号成分或直接进行去噪。研究评估了TV-L¹、Huber-ROF和TGV-L¹等方法在不同案例中的性能,其中Huber-ROF表现出更高的灵活性,而TGV-L¹则更适合于去噪。
链接: https://arxiv.org/abs/2505.08843
作者: Marco Corrias,Giada Franceschi,Michele Riva,Alberto Tampieri,Karin Föttinger,Ulrike Diebold,Thomas Pock,Cesare Franchini
机构: University of Vienna (维也纳大学); TU Wien (维也纳技术大学); TU Graz (格拉茨技术大学); University of Bologna (博洛尼亚大学)
类目: Image and Video Processing (eess.IV); Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Experimentally acquired microscopy images are unavoidably affected by the presence of noise and other unwanted signals, which degrade their quality and might hide relevant features. With the recent increase in image acquisition rate, modern denoising and restoration solutions become necessary. This study focuses on image decomposition and denoising of microscopy images through a workflow based on total variation (TV), addressing images obtained from various microscopy techniques, including atomic force microscopy (AFM), scanning tunneling microscopy (STM), and scanning electron microscopy (SEM). Our approach consists in restoring an image by extracting its unwanted signal components and subtracting them from the raw one, or by denoising it. We evaluate the performance of TV- L^1 , Huber-ROF, and TGV- L^1 in achieving this goal in distinct study cases. Huber-ROF proved to be the most flexible one, while TGV- L^1 is the most suitable for denoising. Our results suggest a wider applicability of this method in microscopy, restricted not only to STM, AFM, and SEM images. The Python code used for this study is publicly available as part of AiSurf. It is designed to be integrated into experimental workflows for image acquisition or can be used to denoise previously acquired images.
zh
[CV-88] Ultrasound Report Generation with Multimodal Large Language Models for Standardized Texts
【速读】:该论文旨在解决超声(Ultrasound, US)报告生成中的挑战,包括US图像的变异性、操作者依赖性以及标准化文本的需求。由于US成像缺乏一致的数据集,自动化任务尤为困难。其解决方案的关键在于提出一个统一的多器官和多语言US报告生成框架,通过基于片段的多语言训练和利用US报告的标准化特性,实现跨器官和语言的一致且临床准确的文本生成。此外,通过选择性解冻视觉变压器(Vision Transformer, ViT)进行微调,进一步提升了文本与图像的对齐效果。
链接: https://arxiv.org/abs/2505.08838
作者: Peixuan Ge,Tongkun Su,Faqin Lv,Baoliang Zhao,Peng Zhang,Chi Hong Wong,Liang Yao,Yu Sun,Zenan Wang,Pak Kin Wong,Ying Hu
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Ultrasound (US) report generation is a challenging task due to the variability of US images, operator dependence, and the need for standardized text. Unlike X-ray and CT, US imaging lacks consistent datasets, making automation difficult. In this study, we propose a unified framework for multi-organ and multilingual US report generation, integrating fragment-based multilingual training and leveraging the standardized nature of US reports. By aligning modular text fragments with diverse imaging data and curating a bilingual English-Chinese dataset, the method achieves consistent and clinically accurate text generation across organ sites and languages. Fine-tuning with selective unfreezing of the vision transformer (ViT) further improves text-image alignment. Compared to the previous state-of-the-art KMVE method, our approach achieves relative gains of about 2% in BLEU scores, approximately 3% in ROUGE-L, and about 15% in CIDEr, while significantly reducing errors such as missing or incorrect content. By unifying multi-organ and multi-language report generation into a single, scalable framework, this work demonstrates strong potential for real-world clinical workflows.
zh
[CV-89] houghts on Objectives of Sparse and Hierarchical Masked Image Model
【速读】:该论文试图解决自监督学习中图像预训练时的掩码策略对模型性能的影响问题。其解决方案的关键在于提出一种新的掩码模式,即Mesh Mask,用于改进SparK模型的图像掩码效果,从而提升模型在预训练阶段的表现。
链接: https://arxiv.org/abs/2505.08819
作者: Asahi Miyazaki,Tsuyoshi Okita
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 11 figures
Abstract:Masked image modeling is one of the most poplular objectives of training. Recently, the SparK model has been proposed with superior performance among self-supervised learning models. This paper proposes a new mask pattern for this SparK model, proposing it as the Mesh Mask-ed SparK model. We report the effect of the mask pattern used for image masking in pre-training on performance.
zh
[CV-90] In-Context Learning for Label-Efficient Cancer Image Classification in Oncology
【速读】:该论文试图解决人工智能在肿瘤学中应用受限的问题,主要由于其对大规模标注数据集的依赖以及针对特定领域诊断任务重新训练模型的需求。解决方案的关键在于采用在上下文学习(In-Context Learning, ICL)作为一种替代模型微调的实用方法,通过在推理阶段仅使用少量标注示例,使模型适应新的诊断任务,而无需进行参数更新。
链接: https://arxiv.org/abs/2505.08798
作者: Mobina Shrestha,Bishwas Mandal,Vishal Mandal,Asis Shrestha
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The application of AI in oncology has been limited by its reliance on large, annotated datasets and the need for retraining models for domain-specific diagnostic tasks. Taking heed of these limitations, we investigated in-context learning as a pragmatic alternative to model retraining by allowing models to adapt to new diagnostic tasks using only a few labeled examples at inference, without the need for retraining. Using four vision-language models (VLMs)-Paligemma, CLIP, ALIGN and GPT-4o, we evaluated the performance across three oncology datasets: MHIST, PatchCamelyon and HAM10000. To the best of our knowledge, this is the first study to compare the performance of multiple VLMs on different oncology classification tasks. Without any parameter updates, all models showed significant gains with few-shot prompting, with GPT-4o reaching an F1 score of 0.81 in binary classification and 0.60 in multi-class classification settings. While these results remain below the ceiling of fully fine-tuned systems, they highlight the potential of ICL to approximate task-specific behavior using only a handful of examples, reflecting how clinicians often reason from prior cases. Notably, open-source models like Paligemma and CLIP demonstrated competitive gains despite their smaller size, suggesting feasibility for deployment in computing constrained clinical environments. Overall, these findings highlight the potential of ICL as a practical solution in oncology, particularly for rare cancers and resource-limited contexts where fine-tuning is infeasible and annotated data is difficult to obtain.
zh
人工智能
[AI-0] How Hungry is AI? Benchmarking Energy Water and Carbon Footprint of LLM Inference
【速读】:该论文试图解决当前对大型语言模型(Large Language Models, LLMs)推理阶段环境足迹评估不足的问题,尤其是现有研究往往忽略专有模型、基础设施差异和开销,或仅关注训练阶段,而忽视了推理阶段日益增长的环境影响。其解决方案的关键在于提出一种新型的基础设施感知基准测试框架,该框架结合公共API性能数据、区域特定的环境乘数以及硬件配置的统计推断,同时利用交叉效率数据包络分析(DEA)按环境成本对模型性能进行排序,从而实现对LLM推理环境足迹的量化评估。
链接: https://arxiv.org/abs/2505.09598
作者: Nidhal Jegham,Marwen Abdelatti,Lassad Elmoubarki,Abdeltawab Hendawi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:As large language models (LLMs) spread across industries, understanding their environmental footprint at the inference level is no longer optional; it is essential. However, most existing studies exclude proprietary models, overlook infrastructural variability and overhead, or focus solely on training, even as inference increasingly dominates AI’s environmental impact. To bridge this gap, this paper introduces a novel infrastructure-aware benchmarking framework for quantifying the environmental footprint of LLM inference across 30 state-of-the-art models as deployed in commercial data centers. Our framework combines public API performance data with region-specific environmental multipliers and statistical inference of hardware configurations. We additionally utilize cross-efficiency Data Envelopment Analysis (DEA) to rank models by performance relative to environmental cost. Our results show that o3 and DeepSeek-R1 emerge as the most energy-intensive models, consuming over 33 Wh per long prompt, more than 70 times the consumption of GPT-4.1 nano, and that Claude-3.7 Sonnet ranks highest in eco-efficiency. While a single short GPT-4o query consumes 0.43 Wh, scaling this to 700 million queries/day results in substantial annual environmental impacts. These include electricity use comparable to 35,000 U.S. homes, freshwater evaporation matching the annual drinking needs of 1.2 million people, and carbon emissions requiring a Chicago-sized forest to offset. These findings illustrate a growing paradox: although individual queries are efficient, their global scale drives disproportionate resource consumption. Our study provides a standardized, empirically grounded methodology for benchmarking the sustainability of LLM deployments, laying a foundation for future environmental accountability in AI development and sustainability standards.
zh
[AI-1] Ethics and Persuasion in Reinforcement Learning from Human Feedback: A Procedural Rhetorical Approach
【速读】:该论文试图解决由生成式AI聊天机器人(Generative AI Chatbots)通过强化学习人类反馈(Reinforcement Learning from Human Feedback, RLHF)技术优化输出所引发的伦理、社会技术及教学层面的问题,特别是其对语言规范、信息获取行为和社交关系期望的影响。解决方案的关键在于运用伊恩·博格斯特(Ian Bogost)的程序修辞概念,将修辞分析的焦点从生成内容的说服力转向RLHF增强型大语言模型(LLMs)中内置的说服机制,从而揭示人工智能驱动技术如何可能强化主导性语言使用、延续偏见、去情境化学习并侵蚀人际关系。
链接: https://arxiv.org/abs/2505.09576
作者: Shannon Lodoen,Alexi Orchard
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 10 pages, 1 figure, Accepted version
Abstract:Since 2022, versions of generative AI chatbots such as ChatGPT and Claude have been trained using a specialized technique called Reinforcement Learning from Human Feedback (RLHF) to fine-tune language model output using feedback from human annotators. As a result, the integration of RLHF has greatly enhanced the outputs of these large language models (LLMs) and made the interactions and responses appear more “human-like” than those of previous versions using only supervised learning. The increasing convergence of human and machine-written text has potentially severe ethical, sociotechnical, and pedagogical implications relating to transparency, trust, bias, and interpersonal relations. To highlight these implications, this paper presents a rhetorical analysis of some of the central procedures and processes currently being reshaped by RLHF-enhanced generative AI chatbots: upholding language conventions, information seeking practices, and expectations for social relationships. Rhetorical investigations of generative AI and LLMs have, to this point, focused largely on the persuasiveness of the content generated. Using Ian Bogost’s concept of procedural rhetoric, this paper shifts the site of rhetorical investigation from content analysis to the underlying mechanisms of persuasion built into RLHF-enhanced LLMs. In doing so, this theoretical investigation opens a new direction for further inquiry in AI ethics that considers how procedures rerouted through AI-driven technologies might reinforce hegemonic language use, perpetuate biases, decontextualize learning, and encroach upon human relationships. It will therefore be of interest to educators, researchers, scholars, and the growing number of users of generative AI chatbots.
zh
[AI-2] Learning Long-Context Diffusion Policies via Past-Token Prediction
【速读】:该论文旨在解决长序列观测与动作推理中长期上下文策略学习的挑战,特别是在面对增加的上下文长度时,训练成本上升和策略性能退化的问题。其解决方案的关键在于引入一种名为Past-Token Prediction (PTP)的辅助任务,通过让策略同时预测过去和未来的动作标记,显式地正则化过去信息的保留,从而提升策略头的时间建模能力。此外,论文还提出了一种多阶段训练策略,以在保持PTP优势的同时显著降低内存和计算开销。
链接: https://arxiv.org/abs/2505.09561
作者: Marcel Torne,Andy Tang,Yuejiang Liu,Chelsea Finn
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Videos are available at this https URL
Abstract:Reasoning over long sequences of observations and actions is essential for many robotic tasks. Yet, learning effective long-context policies from demonstrations remains challenging. As context length increases, training becomes increasingly expensive due to rising memory demands, and policy performance often degrades as a result of spurious correlations. Recent methods typically sidestep these issues by truncating context length, discarding historical information that may be critical for subsequent decisions. In this paper, we propose an alternative approach that explicitly regularizes the retention of past information. We first revisit the copycat problem in imitation learning and identify an opposite challenge in recent diffusion policies: rather than over-relying on prior actions, they often fail to capture essential dependencies between past and future actions. To address this, we introduce Past-Token Prediction (PTP), an auxiliary task in which the policy learns to predict past action tokens alongside future ones. This regularization significantly improves temporal modeling in the policy head, with minimal reliance on visual representations. Building on this observation, we further introduce a multistage training strategy: pre-train the visual encoder with short contexts, and fine-tune the policy head using cached long-context embeddings. This strategy preserves the benefits of PTP while greatly reducing memory and computational overhead. Finally, we extend PTP into a self-verification mechanism at test time, enabling the policy to score and select candidates consistent with past actions during inference. Experiments across four real-world and six simulated tasks demonstrate that our proposed method improves the performance of long-context diffusion policies by 3x and accelerates policy training by more than 10x.
zh
[AI-3] textscrfPG: Robust Finite-Memory Policy Gradients for Hidden-Model POMDPs IJCAI2025
【速读】:该论文旨在解决部分可观测马尔可夫决策过程(Partially observable Markov decision processes, POMDPs)中策略对环境扰动不鲁棒的问题。其核心挑战在于,传统POMDP的最优策略可能无法应对环境模型的变化。为了解决这一问题,作者提出了隐模型POMDP(Hidden-model POMDP, HM-POMDP),该模型包含一组共享动作和观测空间的不同环境模型。解决方案的关键在于结合两种正交技术:一是通过形式化验证技术计算HM-POMDP中的最坏情况POMDP以进行可扩展的鲁棒策略评估;二是利用次梯度上升优化候选策略以适应最坏情况POMDP,从而生成在多种环境模型下均表现良好的鲁棒策略。
链接: https://arxiv.org/abs/2505.09518
作者: Maris F. L. Galesloot,Roman Andriushchenko,Milan Češka,Sebastian Junges,Nils Jansen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for publication at IJCAI 2025
Abstract:Partially observable Markov decision processes (POMDPs) model specific environments in sequential decision-making under uncertainty. Critically, optimal policies for POMDPs may not be robust against perturbations in the environment. Hidden-model POMDPs (HM-POMDPs) capture sets of different environment models, that is, POMDPs with a shared action and observation space. The intuition is that the true model is hidden among a set of potential models, and it is unknown which model will be the environment at execution time. A policy is robust for a given HM-POMDP if it achieves sufficient performance for each of its POMDPs. We compute such robust policies by combining two orthogonal techniques: (1) a deductive formal verification technique that supports tractable robust policy evaluation by computing a worst-case POMDP within the HM-POMDP and (2) subgradient ascent to optimize the candidate policy for a worst-case POMDP. The empirical evaluation shows that, compared to various baselines, our approach (1) produces policies that are more robust and generalize better to unseen POMDPs and (2) scales to HM-POMDPs that consist of over a hundred thousand environments.
zh
[AI-4] Preserving Plasticity in Continual Learning with Adaptive Linearity Injection
【速读】:该论文试图解决深度神经网络中可塑性丧失(plasticity loss)的问题,即模型在非平稳问题设置下逐步降低的增量学习能力。其解决方案的关键在于提出一种名为自适应线性化(Adaptive Linearization, AdaLin)的方法,该方法通过动态调整每个神经元的激活函数来缓解可塑性丧失。AdaLin为每个神经元配备了一个可学习参数和一个门控机制,根据梯度流向激活函数注入线性性,从而确保足够的梯度信号并维持持续学习,而无需引入额外超参数或显式任务边界。
链接: https://arxiv.org/abs/2505.09486
作者: Seyed Roozbeh Razavi Rohani,Khashayar Khajavi,Wesley Chung,Mo Chen,Sharan Vaswani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted in 4th Conference on Lifelong Learning Agents (CoLLAs), 2025
Abstract:Loss of plasticity in deep neural networks is the gradual reduction in a model’s capacity to incrementally learn and has been identified as a key obstacle to learning in non-stationary problem settings. Recent work has shown that deep linear networks tend to be resilient towards loss of plasticity. Motivated by this observation, we propose Adaptive Linearization (AdaLin), a general approach that dynamically adapts each neuron’s activation function to mitigate plasticity loss. Unlike prior methods that rely on regularization or periodic resets, AdaLin equips every neuron with a learnable parameter and a gating mechanism that injects linearity into the activation function based on its gradient flow. This adaptive modulation ensures sufficient gradient signal and sustains continual learning without introducing additional hyperparameters or requiring explicit task boundaries. When used with conventional activation functions like ReLU, Tanh, and GeLU, we demonstrate that AdaLin can significantly improve performance on standard benchmarks, including Random Label and Permuted MNIST, Random Label and Shuffled CIFAR-10, and Class-Split CIFAR-100. Furthermore, its efficacy is shown in more complex scenarios, such as class-incremental learning on CIFAR-100 with a ResNet-18 backbone, and in mitigating plasticity loss in off-policy reinforcement learning agents. We perform a systematic set of ablations that show that neuron-level adaptation is crucial for good performance and analyze a number of metrics in the network that might be correlated to loss of plasticity.
zh
[AI-5] Deploying Foundation Model-Enabled Air and Ground Robots in the Field: Challenges and Opportunities ICRA
【速读】:该论文旨在解决将基于基础模型(Foundation Models, FMs)的机器人部署到大规模、非结构化环境中的问题,这类环境通常缺乏完整的先验地图或视野。传统FM-enabled机器人主要在封闭世界设置中运行,而本文则关注实际应用中机器人需主动探索、导航障碍物密集区域、处理意外传感器输入并满足计算资源限制的需求。解决方案的关键在于提出SPINE框架,该框架支持大型语言模型(Large Language Models, LLMs)在资源受限平台上的高效部署,并通过模型蒸馏技术生成适用于机载系统的轻量级语言模型,从而实现语言驱动的无人机规划。
链接: https://arxiv.org/abs/2505.09477
作者: Zachary Ravichandran,Fernando Cladera,Jason Hughes,Varun Murali,M. Ani Hsieh,George J. Pappas,Camillo J. Taylor,Vijay Kumar
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to the IEEE ICRA Workshop on Field Robotics 2025
Abstract:The integration of foundation models (FMs) into robotics has enabled robots to understand natural language and reason about the semantics in their environments. However, existing FM-enabled robots primary operate in closed-world settings, where the robot is given a full prior map or has a full view of its workspace. This paper addresses the deployment of FM-enabled robots in the field, where missions often require a robot to operate in large-scale and unstructured environments. To effectively accomplish these missions, robots must actively explore their environments, navigate obstacle-cluttered terrain, handle unexpected sensor inputs, and operate with compute constraints. We discuss recent deployments of SPINE, our LLM-enabled autonomy framework, in field robotic settings. To the best of our knowledge, we present the first demonstration of large-scale LLM-enabled robot planning in unstructured environments with several kilometers of missions. SPINE is agnostic to a particular LLM, which allows us to distill small language models capable of running onboard size, weight and power (SWaP) limited platforms. Via preliminary model distillation work, we then present the first language-driven UAV planner using on-device language models. We conclude our paper by proposing several promising directions for future research.
zh
[AI-6] Counterfactual Strategies for Markov Decision Processes
【速读】:该论文试图解决在顺序决策任务中生成反事实策略的问题,即如何通过最小化对初始策略的修改,将模型达到不良结果的概率降低到设定阈值以下。现有方法通常适用于单步决策,无法直接应用于马尔可夫决策过程(Markov Decision Process, MDP)。论文的关键在于将反事实策略编码为非线性优化问题的解,并进一步扩展以合成多样化的反事实策略,从而实现对复杂顺序决策任务的有效干预和解释。
链接: https://arxiv.org/abs/2505.09412
作者: Paul Kobialka,Lina Gerlach,Francesco Leofante,Erika Ábrahám,Silvia Lizeth Tapia Tarifa,Einar Broch Johnsen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Counterfactuals are widely used in AI to explain how minimal changes to a model’s input can lead to a different output. However, established methods for computing counterfactuals typically focus on one-step decision-making, and are not directly applicable to sequential decision-making tasks. This paper fills this gap by introducing counterfactual strategies for Markov Decision Processes (MDPs). During MDP execution, a strategy decides which of the enabled actions (with known probabilistic effects) to execute next. Given an initial strategy that reaches an undesired outcome with a probability above some limit, we identify minimal changes to the initial strategy to reduce that probability below the limit. We encode such counterfactual strategies as solutions to non-linear optimization problems, and further extend our encoding to synthesize diverse counterfactual strategies. We evaluate our approach on four real-world datasets and demonstrate its practical viability in sophisticated sequential decision-making tasks.
zh
[AI-7] he Influence of Human-inspired Agent ic Sophistication in LLM -driven Strategic Reason ers
【速读】:该论文试图解决基于大型语言模型(Large Language Models, LLMs)的智能体在博弈论情境中是否能够复制人类战略推理的问题,特别是探讨代理设计复杂性与人类相似性之间的关系。其解决方案的关键在于通过评估三种代理设计——一个简单的博弈论模型、一个非结构化的LLM作为代理模型以及将LLM集成到传统代理框架中的模型——来分析代理性能,并引入模糊化博弈场景以测试其泛化能力,从而揭示人类启发的认知结构如何提升LLM代理与人类战略行为的一致性。
链接: https://arxiv.org/abs/2505.09396
作者: Vince Trencsenyi,Agnieszka Mensfelt,Kostas Stathis
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:The rapid rise of large language models (LLMs) has shifted artificial intelligence (AI) research toward agentic systems, motivating the use of weaker and more flexible notions of agency. However, this shift raises key questions about the extent to which LLM-based agents replicate human strategic reasoning, particularly in game-theoretic settings. In this context, we examine the role of agentic sophistication in shaping artificial reasoners’ performance by evaluating three agent designs: a simple game-theoretic model, an unstructured LLM-as-agent model, and an LLM integrated into a traditional agentic framework. Using guessing games as a testbed, we benchmarked these agents against human participants across general reasoning patterns and individual role-based objectives. Furthermore, we introduced obfuscated game scenarios to assess agents’ ability to generalise beyond training distributions. Our analysis, covering over 2000 reasoning samples across 25 agent configurations, shows that human-inspired cognitive structures can enhance LLM agents’ alignment with human strategic behaviour. Still, the relationship between agentic design complexity and human-likeness is non-linear, highlighting a critical dependence on underlying LLM capabilities and suggesting limits to simple architectural augmentation.
zh
[AI-8] he Voice Timbre Attribute Detection 2025 Challenge Evaluation Plan
【速读】:该论文旨在解决语音音色属性的对比性检测问题,即如何通过比较两个语音在特定感知描述维度上的强度差异来解释语音音色特征。解决方案的关键在于将人类对语音音色的主观印象转化为一组感官描述符(如明亮、粗糙、柔和、磁性等),并通过对比分析实现对音色属性的量化和识别。
链接: https://arxiv.org/abs/2505.09382
作者: Zhengyan Sheng,Jinghao He,Liping Chen,Kong Aik Lee,Zhen-Hua Ling
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Voice timbre refers to the unique quality or character of a person’s voice that distinguishes it from others as perceived by human hearing. The Voice Timbre Attribute Detection (VtaD) 2025 challenge focuses on explaining the voice timbre attribute in a comparative manner. In this challenge, the human impression of voice timbre is verbalized with a set of sensory descriptors, including bright, coarse, soft, magnetic, and so on. The timbre is explained from the comparison between two voices in their intensity within a specific descriptor dimension. The VtaD 2025 challenge starts in May and culminates in a special proposal at the NCMMSC2025 conference in October 2025 in Zhenjiang, China.
zh
[AI-9] Insights into DeepSeek -V3: Scaling Challenges and Reflections on Hardware for AI Architectures ISCA’25
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在当前硬件架构下的关键限制问题,包括内存容量不足、计算效率低下以及互连带宽受限。其解决方案的关键在于硬件感知的模型协同设计,通过多项创新技术实现高效训练与推理,如多头潜在注意力(Multi-head Latent Attention, MLA)提升内存效率、专家混合(Mixture of Experts, MoE)架构优化计算与通信权衡、FP8混合精度训练充分发挥硬件潜力,以及多平面网络拓扑减少集群级网络开销。这些技术共同推动了大规模AI系统在成本与性能上的突破。
链接: https://arxiv.org/abs/2505.09343
作者: Chenggang Zhao,Chengqi Deng,Chong Ruan,Damai Dai,Huazuo Gao,Jiashi Li,Liyue Zhang,Panpan Huang,Shangyan Zhou,Shirong Ma,Wenfeng Liang,Ying He,Yuqing Wang,Yuxuan Liu,Y.X. Wei
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: This is the author’s version of the work. It is posted here for your personal use. Not for redistribution. The definitive version will appear as part of the Industry Track in Proceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA '25)
Abstract:The rapid scaling of large language models (LLMs) has unveiled critical limitations in current hardware architectures, including constraints in memory capacity, computational efficiency, and interconnection bandwidth. DeepSeek-V3, trained on 2,048 NVIDIA H800 GPUs, demonstrates how hardware-aware model co-design can effectively address these challenges, enabling cost-efficient training and inference at scale. This paper presents an in-depth analysis of the DeepSeek-V3/R1 model architecture and its AI infrastructure, highlighting key innovations such as Multi-head Latent Attention (MLA) for enhanced memory efficiency, Mixture of Experts (MoE) architectures for optimized computation-communication trade-offs, FP8 mixed-precision training to unlock the full potential of hardware capabilities, and a Multi-Plane Network Topology to minimize cluster-level network overhead. Building on the hardware bottlenecks encountered during DeepSeek-V3’s development, we engage in a broader discussion with academic and industry peers on potential future hardware directions, including precise low-precision computation units, scale-up and scale-out convergence, and innovations in low-latency communication fabrics. These insights underscore the critical role of hardware and model co-design in meeting the escalating demands of AI workloads, offering a practical blueprint for innovation in next-generation AI systems.
zh
[AI-10] Evaluating the Robustness of Adversarial Defenses in Malware Detection Systems
【速读】:该论文旨在解决机器学习(Machine Learning, ML)在Android恶意软件检测中面临的对抗性逃避攻击问题,即通过微小且精心设计的修改绕过检测机制。其关键解决方案是提出两种核心技术:一是优先二值化(Prioritized Binary Rounding),用于将连续扰动转换为二值特征空间,同时保持高攻击成功率和低扰动规模;二是sigma-binary攻击方法,一种针对二值域的新型对抗性攻击策略,能够在最小特征改动下实现攻击目标。实验表明,sigma-binary攻击能够有效暴露现有防御机制的脆弱性,尤其是在特征修改数量较少的情况下表现出极高的成功率。
链接: https://arxiv.org/abs/2505.09342
作者: Mostafa Jafari,Alireza Shameli-Sendi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to IEEE Transactions on Information Forensics and Security (T-IFS), 13 pages, 4 figures
Abstract:Machine learning is a key tool for Android malware detection, effectively identifying malicious patterns in apps. However, ML-based detectors are vulnerable to evasion attacks, where small, crafted changes bypass detection. Despite progress in adversarial defenses, the lack of comprehensive evaluation frameworks in binary-constrained domains limits understanding of their robustness. We introduce two key contributions. First, Prioritized Binary Rounding, a technique to convert continuous perturbations into binary feature spaces while preserving high attack success and low perturbation size. Second, the sigma-binary attack, a novel adversarial method for binary domains, designed to achieve attack goals with minimal feature changes. Experiments on the Malscan dataset show that sigma-binary outperforms existing attacks and exposes key vulnerabilities in state-of-the-art defenses. Defenses equipped with adversary detectors, such as KDE, DLA, DNN+, and ICNN, exhibit significant brittleness, with attack success rates exceeding 90% using fewer than 10 feature modifications and reaching 100% with just 20. Adversarially trained defenses, including AT-rFGSM-k, AT-MaxMA, improves robustness under small budgets but remains vulnerable to unrestricted perturbations, with attack success rates of 99.45% and 96.62%, respectively. Although PAD-SMA demonstrates strong robustness against state-of-the-art gradient-based adversarial attacks by maintaining an attack success rate below 16.55%, the sigma-binary attack significantly outperforms these methods, achieving a 94.56% success rate under unrestricted perturbations. These findings highlight the critical need for precise method like sigma-binary to expose hidden vulnerabilities in existing defenses and support the development of more resilient malware detection systems.
zh
[AI-11] Access Controls Will Solve the Dual-Use Dilemma
【速读】:该论文试图解决AI安全系统面临的双重用途困境(dual-use dilemma),即同一请求可能因发起者及其意图的不同而具有无害或有害的性质,仅依据请求内容进行决策会导致合法查询被拒绝或有害内容被放行。解决方案的关键在于提出一种基于验证用户凭证(如机构隶属关系)和输出风险分类器(如高级病毒学)的概念性访问控制框架,系统仅在用户的验证凭证与风险类别要求匹配时才允许响应,从而实现对AI能力的细粒度治理。
链接: https://arxiv.org/abs/2505.09341
作者: Evžen Wybitul
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:AI safety systems face a dual-use dilemma. Since the same request can be either harmless or harmful depending on who made it and why, if the system makes decisions based solely on the request’s content, it will refuse some legitimate queries and let pass harmful ones. To address this, we propose a conceptual access control framework, based on verified user credentials (such as institutional affiliation) and classifiers that assign model outputs to risk categories (such as advanced virology). The system permits responses only when the user’s verified credentials match the category’s requirements. For implementation of the model output classifiers, we introduce a theoretical approach utilizing small, gated expert modules integrated into the generator model, trained with gradient routing, that enable efficient risk detection without the capability gap problems of external monitors. While open questions remain about the verification mechanisms, risk categories, and the technical implementation, our framework makes the first step toward enabling granular governance of AI capabilities: verified users gain access to specialized knowledge without arbitrary restrictions, while adversaries are blocked from it. This contextual approach reconciles model utility with robust safety, addressing the dual-use dilemma.
zh
[AI-12] oward Fair Federated Learning under Demographic Disparities and Data Imbalance
【速读】:该论文旨在解决在医疗等高风险领域应用人工智能时,由于训练数据存在不平衡和人口统计学偏差,导致预测模型可能加剧现有不平等问题。其解决方案的关键在于提出FedIDA(Federated Learning for Imbalance and Disparity Awareness),该方法通过结合公平意识正则化与组条件过采样,实现对多敏感属性和异构数据分布的处理,同时不影响底层联邦学习(FL)算法的收敛性。
链接: https://arxiv.org/abs/2505.09295
作者: Qiming Wu,Siqi Li,Doudou Zhou,Nan Liu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Ensuring fairness is critical when applying artificial intelligence to high-stakes domains such as healthcare, where predictive models trained on imbalanced and demographically skewed data risk exacerbating existing disparities. Federated learning (FL) enables privacy-preserving collaboration across institutions, but remains vulnerable to both algorithmic bias and subgroup imbalance - particularly when multiple sensitive attributes intersect. We propose FedIDA (Fed erated Learning for Imbalance and D isparity A wareness), a framework-agnostic method that combines fairness-aware regularization with group-conditional oversampling. FedIDA supports multiple sensitive attributes and heterogeneous data distributions without altering the convergence behavior of the underlying FL algorithm. We provide theoretical analysis establishing fairness improvement bounds using Lipschitz continuity and concentration inequalities, and show that FedIDA reduces the variance of fairness metrics across test sets. Empirical results on both benchmark and real-world clinical datasets confirm that FedIDA consistently improves fairness while maintaining competitive predictive performance, demonstrating its effectiveness for equitable and privacy-preserving modeling in healthcare. The source code is available on GitHub.
zh
[AI-13] Reproducibility Study of "Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents "
【速读】:该论文旨在评估和扩展Piatti等人提出的GovSim框架,该框架用于分析大规模语言模型(Large Language Models, LLMs)在资源共享场景中的协作决策能力。研究通过复现关键实验验证了大型模型(如GPT-4-turbo)相较于小型模型的性能优势,并探讨了普遍化原则对协作可持续性的影响。其解决方案的关键在于构建一个可扩展的仿真环境,以测试不同模型架构、规模及语言设置下的协作行为,并通过引入异构多智能体环境和“逆向环境”等新设置,验证框架的适用性和模型的适应性。研究结果表明,高性能模型能够影响低性能模型的行为,为基于代理的系统提供了重要的理论支持与实践指导。
链接: https://arxiv.org/abs/2505.09289
作者: Pedro M. P. Curvo,Mara Dragomir,Salvador Torpes,Mohammadmahdi Rahimi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 Tables, 9 Figures
Abstract:This study evaluates and extends the findings made by Piatti et al., who introduced GovSim, a simulation framework designed to assess the cooperative decision-making capabilities of large language models (LLMs) in resource-sharing scenarios. By replicating key experiments, we validate claims regarding the performance of large models, such as GPT-4-turbo, compared to smaller models. The impact of the universalization principle is also examined, with results showing that large models can achieve sustainable cooperation, with or without the principle, while smaller models fail without it. In addition, we provide multiple extensions to explore the applicability of the framework to new settings. We evaluate additional models, such as DeepSeek-V3 and GPT-4o-mini, to test whether cooperative behavior generalizes across different architectures and model sizes. Furthermore, we introduce new settings: we create a heterogeneous multi-agent environment, study a scenario using Japanese instructions, and explore an “inverse environment” where agents must cooperate to mitigate harmful resource distributions. Our results confirm that the benchmark can be applied to new models, scenarios, and languages, offering valuable insights into the adaptability of LLMs in complex cooperative tasks. Moreover, the experiment involving heterogeneous multi-agent systems demonstrates that high-performing models can influence lower-performing ones to adopt similar behaviors. This finding has significant implications for other agent-based applications, potentially enabling more efficient use of computational resources and contributing to the development of more effective cooperative AI systems.
zh
[AI-14] Educational impacts of generative artificial intelligence on learning and performance of engineering students in China
【速读】:该论文试图解决生成式AI(Generative AI)在工程教育中的应用现状及其对学生学习体验的影响问题,重点探讨其带来的机遇与挑战。研究通过调查中国不同工程学科和地区148名学生的使用情况,分析了生成式AI的使用频率、应用场景、对学习效果的影响、 encountered challenges以及未来在工程教育中的发展潜力。解决方案的关键在于从学生视角出发,深入理解生成式AI的实际应用效果,并提出有效的整合策略,以充分发挥其在提升学习效率、主动性及创造力方面的潜力,同时应对准确性与领域可靠性等问题。
链接: https://arxiv.org/abs/2505.09208
作者: Lei Fan,Kunyang Deng,Fangxue Liu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:With the rapid advancement of generative artificial intelligence(AI), its potential applications in higher education have attracted significant attention. This study investigated how 148 students from diverse engineering disciplines and regions across China used generative AI, focusing on its impact on their learning experience and the opportunities and challenges it poses in engineering education. Based on the surveyed data, we explored four key areas: the frequency and application scenarios of AI use among engineering students, its impact on students’ learning and performance, commonly encountered challenges in using generative AI, and future prospects for its adoption in engineering education. The results showed that more than half of the participants reported a positive impact of generative AI on their learning efficiency, initiative, and creativity, with nearly half believing it also enhanced their independent thinking. However, despite acknowledging improved study efficiency, many felt their actual academic performance remained largely unchanged and expressed concerns about the accuracy and domain-specific reliability of generative AI. Our findings provide a first-hand insight into the current benefits and challenges generative AI brings to students, particularly Chinese engineering students, while offering several recommendations, especially from the students’ perspective, for effectively integrating generative AI into engineering education.
zh
[AI-15] An Initial Exploration of Default Images in Text-to-Image Generation
【速读】:该论文试图解决文本到图像生成(Text-to-Image Generation, TTI)模型在处理包含未知术语的提示时,生成“默认图像”(default images)的问题。这些图像在多个不相关的提示下表现出高度相似性,可能影响生成结果的质量和用户满意度。论文的关键解决方案是通过系统化的方法生成能够触发默认图像的输入提示,并通过实验和小规模消融研究分析其特性,同时通过调查研究探讨默认图像对用户满意度的影响,从而为改进TTI模型和提示工程提供理论基础与实践指导。
链接: https://arxiv.org/abs/2505.09166
作者: Hannu Simonen,Atte Kiviniemi,Jonas Oppenlaender
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 16 pages, 6 figures
Abstract:In the creative practice of text-to-image generation (TTI), images are generated from text prompts. However, TTI models are trained to always yield an output, even if the prompt contains unknown terms. In this case, the model may generate what we call “default images”: images that closely resemble each other across many unrelated prompts. We argue studying default images is valuable for designing better solutions for TTI and prompt engineering. In this paper, we provide the first investigation into default images on Midjourney, a popular image generator. We describe our systematic approach to create input prompts triggering default images, and present the results of our initial experiments and several small-scale ablation studies. We also report on a survey study investigating how default images affect user satisfaction. Our work lays the foundation for understanding default images in TTI and highlights challenges and future research directions.
zh
[AI-16] A Multi-Task Foundation Model for Wireless Channel Representation Using Contrastive and Masked Autoencoder Learning
【速读】:该论文旨在解决当前自监督学习在无线信道表征中的应用未能充分考虑无线通信的独特特性和约束的问题。其解决方案的关键在于提出一种基于Transformer的编码器-解码器基础模型WiMAE(Wireless Masked Autoencoder),并在其基础上开发ContraWiMAE,通过在统一的多任务框架中引入对比学习目标与重建任务,提升模型对结构化和判别性特征的捕捉能力,从而增强表征质量。
链接: https://arxiv.org/abs/2505.09160
作者: Berkay Guler,Giovanni Geraci,Hamid Jafarkhani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:
Abstract:Current applications of self-supervised learning to wireless channel representation often borrow paradigms developed for text and image processing, without fully addressing the unique characteristics and constraints of wireless communications. Aiming to fill this gap, we first propose WiMAE (Wireless Masked Autoencoder), a transformer-based encoder-decoder foundation model pretrained on a realistic open-source multi-antenna wireless channel dataset. Building upon this foundation, we develop ContraWiMAE, which enhances WiMAE by incorporating a contrastive learning objective alongside the reconstruction task in a unified multi-task framework. By warm-starting from pretrained WiMAE weights and generating positive pairs via noise injection, the contrastive component enables the model to capture both structural and discriminative features, enhancing representation quality beyond what reconstruction alone can achieve. Through extensive evaluation on unseen scenarios, we demonstrate the effectiveness of both approaches across multiple downstream tasks, with ContraWiMAE showing further improvements in linear separability and adaptability in diverse wireless environments. Comparative evaluations against a state-of-the-art wireless channel foundation model confirm the superior performance and data efficiency of our models, highlighting their potential as powerful baselines for future research in self-supervised wireless channel representation learning.
zh
[AI-17] ELIS: Efficient LLM Iterative Scheduling System with Response Length Predictor
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)服务系统中由于采用先到先服务(First-Come-First-Served, FCFS)调度策略导致的“队头阻塞”(head-of-line blocking)问题。其解决方案的关键在于设计一种迭代最短剩余时间优先(Iterative Shortest Remaining Time First, ISRTF)调度策略,并通过训练响应长度预测器来预估LLM推理时间,从而实现更高效的任务管理。该预测器基于BGE模型进行训练,以应对LLM自回归特性带来的推理延迟预测挑战。
链接: https://arxiv.org/abs/2505.09142
作者: Seungbeom Choi,Jeonghoe Goo,Eunjoo Jeon,Mingyu Yang,Minsung Jang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 5 figures. Cloud-native LLM scheduling system with latency-aware inference optimization
Abstract:We propose ELIS, a serving system for Large Language Models (LLMs) featuring an Iterative Shortest Remaining Time First (ISRTF) scheduler designed to efficiently manage inference tasks with the shortest remaining tokens. Current LLM serving systems often employ a first-come-first-served scheduling strategy, which can lead to the “head-of-line blocking” problem. To overcome this limitation, it is necessary to predict LLM inference times and apply a shortest job first scheduling strategy. However, due to the auto-regressive nature of LLMs, predicting the inference latency is challenging. ELIS addresses this challenge by training a response length predictor for LLMs using the BGE model, an encoder-based state-of-the-art model. Additionally, we have devised the ISRTF scheduling strategy, an optimization of shortest remaining time first tailored to existing LLM iteration batching. To evaluate our work in an industrial setting, we simulate streams of requests based on our study of real-world user LLM serving trace records. Furthermore, we implemented ELIS as a cloud-native scheduler system on Kubernetes to evaluate its performance in production environments. Our experimental results demonstrate that ISRTF reduces the average job completion time by up to 19.6%.
zh
[AI-18] Fair Clustering via Alignment ICML2025
【速读】:该论文旨在解决聚类算法中的公平性问题,即在考虑敏感属性的情况下,平衡不同群体实例分配到各个簇的比例。现有公平聚类算法虽然在特定公平约束下优化了聚类目标,但其固有的复杂性或近似性常导致聚类效用不佳或数值不稳定。论文提出的解决方案是基于一种新的公平K均值聚类目标函数分解方法,其关键在于通过交替地(i)找到一个联合概率分布以对齐不同受保护群体的数据,(ii)在对齐空间中优化簇中心,从而理论上保证在任意给定的公平水平下接近最优的聚类效用,实现高效用的公平聚类。
链接: https://arxiv.org/abs/2505.09131
作者: Kunwoong Kim,Jihu Lee,Sangchul Park,Yongdai Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted at ICML 2025. This is the version submitted for review and will be replaced by the camera-ready version soon
Abstract:Algorithmic fairness in clustering aims to balance the proportions of instances assigned to each cluster with respect to a given sensitive attribute. While recently developed fair clustering algorithms optimize clustering objectives under specific fairness constraints, their inherent complexity or approximation often results in suboptimal clustering utility or numerical instability in practice. To resolve these limitations, we propose a new fair clustering algorithm based on a novel decomposition of the fair K-means clustering objective function. The proposed algorithm, called Fair Clustering via Alignment (FCA), operates by alternately (i) finding a joint probability distribution to align the data from different protected groups, and (ii) optimizing cluster centers in the aligned space. A key advantage of FCA is that it theoretically guarantees approximately optimal clustering utility for any given fairness level without complex constraints, thereby enabling high-utility fair clustering in practice. Experiments show that FCA outperforms existing methods by (i) attaining a superior trade-off between fairness level and clustering utility, and (ii) achieving near-perfect fairness without numerical instability.
zh
[AI-19] PreCare: Designing AI Assistants for Advance Care Planning (ACP) to Enhance Personal Value Exploration Patient Knowledge and Decisional Confidence
【速读】:该论文试图解决传统 Advance Care Planning (ACP) 在线工具缺乏临床咨询中关键优势的问题,如个性化价值探索和决策后果的即时澄清。解决方案的关键在于设计并实现 PreCare,这是一个结合人工智能驱动助手的网站,旨在引导用户探索个人价值观、获取 ACP 知识并支持知情决策,从而弥补在线 ACP 与面对面咨询之间的差距。
链接: https://arxiv.org/abs/2505.09115
作者: Yu Lun Hsu(1),Yun-Rung Chou(1),Chiao-Ju Chang(1),Yu-Cheng Chang(1),Zer-Wei Lee(1),Rokas Gipiškis(2),Rachel Li(3),Chih-Yuan Shih(4),Jen-Kuei Peng(4),Hsien-Liang Huang(4),Jaw-Shiun Tsai(4),Mike Y. Chen((1) National Taiwan University (2) Vilnius University (3) University of California, Berkeley (4) National Taiwan University Hospital)
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Advance Care Planning (ACP) allows individuals to specify their preferred end-of-life life-sustaining treatments before they become incapacitated by injury or terminal illness (e.g., coma, cancer, dementia). While online ACP offers high accessibility, it lacks key benefits of clinical consultations, including personalized value exploration, immediate clarification of decision consequences. To bridge this gap, we conducted two formative studies: 1) shadowed and interviewed 3 ACP teams consisting of physicians, nurses, and social workers (18 patients total), and 2) interviewed 14 users of ACP websites. Building on these insights, we designed PreCare in collaboration with 6 ACP professionals. PreCare is a website with 3 AI-driven assistants designed to guide users through exploring personal values, gaining ACP knowledge, and supporting informed decision-making. A usability study (n=12) showed that PreCare achieved a System Usability Scale (SUS) rating of excellent. A comparative evaluation (n=12) showed that PreCare’s AI assistants significantly improved exploration of personal values, knowledge, and decisional confidence, and was preferred by 92% of participants.
zh
[AI-20] Beyond the Known: Decision Making with Counterfactual Reasoning Decision Transformer
【速读】:该论文试图解决决策变压器(Decision Transformer, DT)在实际应用中因训练数据不足和最优行为稀缺而导致的性能受限问题。传统DT依赖高质量、全面的数据集以达到最佳效果,但在现实中,这些条件往往难以满足,导致次优数据影响模型表现。论文提出的解决方案是Counterfactual Reasoning Decision Transformer (CRDT),其关键在于引入反事实推理机制,通过生成和利用反事实经验来增强DT在未知场景中的决策能力,从而提升其性能与泛化能力。
链接: https://arxiv.org/abs/2505.09114
作者: Minh Hoang Nguyen,Linh Le Pham Van,Thommen George Karimpanal,Sunil Gupta,Hung Le
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Decision Transformers (DT) play a crucial role in modern reinforcement learning, leveraging offline datasets to achieve impressive results across various domains. However, DT requires high-quality, comprehensive data to perform optimally. In real-world applications, the lack of training data and the scarcity of optimal behaviours make training on offline datasets challenging, as suboptimal data can hinder performance. To address this, we propose the Counterfactual Reasoning Decision Transformer (CRDT), a novel framework inspired by counterfactual reasoning. CRDT enhances DT ability to reason beyond known data by generating and utilizing counterfactual experiences, enabling improved decision-making in unseen scenarios. Experiments across Atari and D4RL benchmarks, including scenarios with limited data and altered dynamics, demonstrate that CRDT outperforms conventional DT approaches. Additionally, reasoning counterfactually allows the DT agent to obtain stitching abilities, combining suboptimal trajectories, without architectural modifications. These results highlight the potential of counterfactual reasoning to enhance reinforcement learning agents’ performance and generalization capabilities.
zh
[AI-21] Air-Ground Collaboration for Language-Specified Missions in Unknown Environments
【速读】:该论文试图解决如何使自主机器人系统通过自然语言指令完成任务的问题,特别是在面对动态变化的任务规范时,实现异构机器人之间的协同作业。解决方案的关键在于利用生成式 AI (Generative AI) 驱动的规划器,结合在线构建并共享的语义-度量地图,以实现对任务语义的推理与信息的主动获取,从而支持在城市和乡村环境中的任务驱动导航。
链接: https://arxiv.org/abs/2505.09108
作者: Fernando Cladera,Zachary Ravichandran,Jason Hughes,Varun Murali,Carlos Nieto-Granda,M. Ani Hsieh,George J. Pappas,Camillo J. Taylor,Vijay Kumar
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 19 pages, 24 figures, 7 tables. Submitted to T-FR
Abstract:As autonomous robotic systems become increasingly mature, users will want to specify missions at the level of intent rather than in low-level detail. Language is an expressive and intuitive medium for such mission specification. However, realizing language-guided robotic teams requires overcoming significant technical hurdles. Interpreting and realizing language-specified missions requires advanced semantic reasoning. Successful heterogeneous robots must effectively coordinate actions and share information across varying viewpoints. Additionally, communication between robots is typically intermittent, necessitating robust strategies that leverage communication opportunities to maintain coordination and achieve mission objectives. In this work, we present a first-of-its-kind system where an unmanned aerial vehicle (UAV) and an unmanned ground vehicle (UGV) are able to collaboratively accomplish missions specified in natural language while reacting to changes in specification on the fly. We leverage a Large Language Model (LLM)-enabled planner to reason over semantic-metric maps that are built online and opportunistically shared between an aerial and a ground robot. We consider task-driven navigation in urban and rural areas. Our system must infer mission-relevant semantics and actively acquire information via semantic mapping. In both ground and air-ground teaming experiments, we demonstrate our system on seven different natural-language specifications at up to kilometer-scale navigation.
zh
[AI-22] Human-like Cognitive Generalization for Large Models via Brain-in-the-loop Supervision
【速读】:该论文试图解决大规模深度神经网络(DNN)在理解抽象概念、推理及适应新场景等复杂认知能力方面存在的不足问题。其解决方案的关键在于采用脑机协同的监督学习方法,利用少量脑信号有效传递人类的概念结构至DNN,从而显著提升模型对抽象及未见概念的理解能力。
链接: https://arxiv.org/abs/2505.09085
作者: Jiaxuan Chen,Yu Qi,Yueming Wang,Gang Pan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in deep neural networks (DNNs), particularly large-scale language models, have demonstrated remarkable capabilities in image and natural language understanding. Although scaling up model parameters with increasing volume of training data has progressively improved DNN capabilities, achieving complex cognitive abilities - such as understanding abstract concepts, reasoning, and adapting to novel scenarios, which are intrinsic to human cognition - remains a major challenge. In this study, we show that brain-in-the-loop supervised learning, utilizing a small set of brain signals, can effectively transfer human conceptual structures to DNNs, significantly enhancing their comprehension of abstract and even unseen concepts. Experimental results further indicate that the enhanced cognitive capabilities lead to substantial performance gains in challenging tasks, including few-shot/zero-shot learning and out-of-distribution recognition, while also yielding highly interpretable concept representations. These findings highlight that human-in-the-loop supervision can effectively augment the complex cognitive abilities of large models, offering a promising pathway toward developing more human-like cognitive abilities in artificial systems.
zh
[AI-23] SALM: A Multi-Agent Framework for Language Model-Driven Social Network Simulation
【速读】:该论文试图解决传统基于规则的代理建模(Agent-Based Modeling, ABM)在社会系统模拟中难以捕捉复杂动态的问题,特别是在长期模拟中保持行为真实性和时间稳定性方面的不足。其解决方案的关键在于提出SALM(Social Agent LM Framework),通过整合语言模型(Language Models, LMs)实现社会网络模拟,并引入分层提示架构、基于注意力的记忆系统以及个性稳定性形式化边界,从而在保证行为保真度的同时提升模拟的长期稳定性和效率。
链接: https://arxiv.org/abs/2505.09081
作者: Gaurav Koley
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Contemporary approaches to agent-based modeling (ABM) of social systems have traditionally emphasized rule-based behaviors, limiting their ability to capture nuanced dynamics by moving beyond predefined rules and leveraging contextual understanding from LMs of human social interaction. This paper presents SALM (Social Agent LM Framework), a novel approach for integrating language models (LMs) into social network simulation that achieves unprecedented temporal stability in multi-agent scenarios. Our primary contributions include: (1) a hierarchical prompting architecture enabling stable simulation beyond 4,000 timesteps while reducing token usage by 73%, (2) an attention-based memory system achieving 80% cache hit rates (95% CI [78%, 82%]) with sub-linear memory growth of 9.5%, and (3) formal bounds on personality stability. Through extensive validation against SNAP ego networks, we demonstrate the first LLM-based framework capable of modeling long-term social phenomena while maintaining empirically validated behavioral fidelity.
zh
[AI-24] Variational Prefix Tuning for Diverse and Accurate Code Summarization Using Pre-trained Language Models
【速读】:该论文试图解决现有代码摘要生成方法仅能生成单一高质量摘要的问题,而忽视了在生成摘要不够理想时需要提供替代选项的场景。其解决方案的关键在于引入变分前缀调优(Variational Prefix Tuning, VPT),通过集成条件变分自编码器(CVAE)框架,使预训练模型能够生成多样且准确的摘要集合,从而允许用户选择最合适的摘要。该方法在参数高效的前提下实现模型优化,避免了昂贵的模型微调过程,并采用双标准重排序策略以平衡生成摘要的多样性和准确性。
链接: https://arxiv.org/abs/2505.09062
作者: Junda Zhao,Yuliang Song,Eldan Cohen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by the Journal of Systems and Software
Abstract:Recent advancements in source code summarization have leveraged transformer-based pre-trained models, including Large Language Models of Code (LLMCs), to automate and improve the generation of code summaries. However, existing methods often focus on generating a single high-quality summary for a given source code, neglecting scenarios where the generated summary might be inadequate and alternative options are needed. In this paper, we introduce Variational Prefix Tuning (VPT), a novel approach that enhances pre-trained models’ ability to generate diverse yet accurate sets of summaries, allowing the user to choose the most suitable one for the given source code. Our method integrates a Conditional Variational Autoencoder (CVAE) framework as a modular component into pre-trained models, enabling us to model the distribution of observed target summaries and sample continuous embeddings to be used as prefixes to steer the generation of diverse outputs during decoding. Importantly, we construct our method in a parameter-efficient manner, eliminating the need for expensive model retraining, especially when using LLMCs. Furthermore, we employ a bi-criteria reranking method to select a subset of generated summaries, optimizing both the diversity and the accuracy of the options presented to users. We present extensive experimental evaluations using widely used datasets and current state-of-the-art pre-trained code summarization models to demonstrate the effectiveness of our approach and its adaptability across models.
zh
[AI-25] Monte Carlo Beam Search for Actor-Critic Reinforcement Learning in Continuous Control
【速读】:该论文旨在解决基于噪声的探索方法在Actor-critic算法(如Twin Delayed Deep Deterministic Policy Gradient, TD3)中导致策略收敛不理想的問題。其解决方案的关键在于引入一种新的混合方法——蒙特卡洛束搜索(Monte Carlo Beam Search, MCBS),该方法结合了束搜索与蒙特卡洛轨迹模拟,通过在策略输出周围生成多个候选动作并进行短时域轨迹评估,从而提升探索效率和动作选择质量。
链接: https://arxiv.org/abs/2505.09029
作者: Hazim Alzorgan,Abolfazl Razi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Actor-critic methods, like Twin Delayed Deep Deterministic Policy Gradient (TD3), depend on basic noise-based exploration, which can result in less than optimal policy convergence. In this study, we introduce Monte Carlo Beam Search (MCBS), a new hybrid method that combines beam search and Monte Carlo rollouts with TD3 to improve exploration and action selection. MCBS produces several candidate actions around the policy’s output and assesses them through short-horizon rollouts, enabling the agent to make better-informed choices. We test MCBS across various continuous-control benchmarks, including HalfCheetah-v4, Walker2d-v5, and Swimmer-v5, showing enhanced sample efficiency and performance compared to standard TD3 and other baseline methods like SAC, PPO, and A2C. Our findings emphasize MCBS’s capability to enhance policy learning through structured look-ahead search while ensuring computational efficiency. Additionally, we offer a detailed analysis of crucial hyperparameters, such as beam width and rollout depth, and explore adaptive strategies to optimize MCBS for complex control tasks. Our method shows a higher convergence rate across different environments compared to TD3, SAC, PPO, and A2C. For instance, we achieved 90% of the maximum achievable reward within around 200 thousand timesteps compared to 400 thousand timesteps for the second-best method.
zh
[AI-26] sts as Prompt: A Test-Driven-Development Benchmark for LLM Code Generation
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在测试驱动开发(Test-Driven Development, TDD)任务中的评估问题,即如何有效衡量LLMs根据测试用例生成功能代码的能力。解决方案的关键在于构建WebApp1K基准,该基准通过1000个跨20个应用领域的多样化挑战,评估LLMs在上下文长度限制和多特性复杂性下的代码生成能力,强调模型对测试用例的直接理解和实现能力,而非依赖自然语言提示。
链接: https://arxiv.org/abs/2505.09027
作者: Yi Cui
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2409.05177
Abstract:We introduce WebApp1K, a novel benchmark for evaluating large language models (LLMs) in test-driven development (TDD) tasks, where test cases serve as both prompt and verification for code generation. Unlike traditional approaches relying on natural language prompts, our benchmark emphasizes the ability of LLMs to interpret and implement functionality directly from test cases, reflecting real-world software development practices. Comprising 1000 diverse challenges across 20 application domains, the benchmark evaluates LLMs on their ability to generate compact, functional code under the constraints of context length and multi-feature complexity. Our findings highlight instruction following and in-context learning as critical capabilities for TDD success, surpassing the importance of general coding proficiency or pretraining knowledge. Through comprehensive evaluation of 19 frontier models, we reveal performance bottlenecks, such as instruction loss in long prompts, and provide a detailed error analysis spanning multiple root causes. This work underscores the practical value of TDD-specific benchmarks and lays the foundation for advancing LLM capabilities in rigorous, application-driven coding scenarios.
zh
[AI-27] Block-Biased Mamba for Long-Range Sequence Processing
【速读】:该论文试图解决Mamba模型在长程序列任务中表现不佳的问题,尽管其架构设计旨在处理长程依赖关系。论文从表达能力、归纳偏置和训练稳定性三个方面分析了Mamba的局限性,并提出了一种名为\textB_2\textS_6的改进方案,其关键在于将块状选择性动态与通道特定偏置相结合,从而提升模型的归纳偏置、表达能力和稳定性。
链接: https://arxiv.org/abs/2505.09022
作者: Annan Yu,N. Benjamin Erichson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Mamba extends earlier state space models (SSMs) by introducing input-dependent dynamics, and has demonstrated strong empirical performance across a range of domains, including language modeling, computer vision, and foundation models. However, a surprising weakness remains: despite being built on architectures designed for long-range dependencies, Mamba performs poorly on long-range sequential tasks. Understanding and addressing this gap is important for improving Mamba’s universality and versatility. In this work, we analyze Mamba’s limitations through three perspectives: expressiveness, inductive bias, and training stability. Our theoretical results show how Mamba falls short in each of these aspects compared to earlier SSMs such as S4D. To address these issues, we propose \textB_2\textS_6 , a simple extension of Mamba’s S6 unit that combines block-wise selective dynamics with a channel-specific bias. We prove that these changes equip the model with a better-suited inductive bias and improve its expressiveness and stability. Empirically, \textB_2\textS_6 outperforms S4 and S4D on Long-Range Arena (LRA) tasks while maintaining Mamba’s performance on language modeling benchmarks.
zh
[AI-28] AI-Mediated Code Comment Improvement
【速读】:该论文试图解决代码注释质量提升的问题,旨在通过定制化的人工智能(Artificial Intelligence, AI)工具对代码注释进行重写,以在多个质量维度上进行改进。解决方案的关键在于首先通过实证研究和扎根理论定性分析确定需要改进的质量维度,随后利用大型语言模型(Large Language Model, LLM)提出一个重写代码注释的流程,并将结果蒸馏为可在本地运行的小型模型,从而保障用户的数据主权。
链接: https://arxiv.org/abs/2505.09021
作者: Maria Dhakal,Chia-Yi Su,Robert Wallace,Chris Fakhimi,Aakash Bansal,Toby Li,Yu Huang,Collin McMillan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:This paper describes an approach to improve code comments along different quality axes by rewriting those comments with customized Artificial Intelligence (AI)-based tools. We conduct an empirical study followed by grounded theory qualitative analysis to determine the quality axes to improve. Then we propose a procedure using a Large Language Model (LLM) to rewrite existing code comments along the quality axes. We implement our procedure using GPT-4o, then distil the results into a smaller model capable of being run in-house, so users can maintain data custody. We evaluate both our approach using GPT-4o and the distilled model versions. We show in an evaluation how our procedure improves code comments along the quality axes. We release all data and source code in an online repository for reproducibility.
zh
[AI-29] Deep Reinforcement Learning for Power Grid Multi-Stage Cascading Failure Mitigation ICLR2025
【速读】:该论文试图解决电力系统中多阶段级联故障(multi-stage cascading failures)导致电网崩溃的问题,现有缓解策略通常基于单阶段方法,未能充分考虑多阶段场景的复杂性。解决方案的关键是将多阶段级联故障问题建模为强化学习(reinforcement learning)任务,并开发相应的仿真环境,通过确定性策略梯度算法训练智能体以实现连续动作,从而提升对多阶段故障的应对能力。
链接: https://arxiv.org/abs/2505.09012
作者: Bo Meng,Chenghao Xu,Yongli Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: This paper has been accepted and presented at ICLR 2025 in Singapore, Apr. 28, 2025
Abstract:Cascading failures in power grids can lead to grid collapse, causing severe disruptions to social operations and economic activities. In certain cases, multi-stage cascading failures can occur. However, existing cascading-failure-mitigation strategies are usually single-stage-based, overlooking the complexity of the multi-stage scenario. This paper treats the multi-stage cascading failure problem as a reinforcement learning task and develops a simulation environment. The reinforcement learning agent is then trained via the deterministic policy gradient algorithm to achieve continuous actions. Finally, the effectiveness of the proposed approach is validated on the IEEE 14-bus and IEEE 118-bus systems.
zh
[AI-30] Continual Reinforcement Learning via Autoencoder-Driven Task and New Environment Recognition AAMAS2025
【速读】:该论文旨在解决强化学习智能体在持续学习过程中保持和利用已有信息的难题,特别是在没有外部信号指示任务或环境变化的情况下。其解决方案的关键在于将自编码器(autoencoder)与策略优化相结合,构建一个端到端的持续学习系统,该系统能够识别并学习新任务或环境,同时保留早期经验的知识,并在重新遇到已知环境时选择性地检索相关知识。
链接: https://arxiv.org/abs/2505.09003
作者: Zeki Doruk Erden,Donia Gasmi,Boi Faltings
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published in the Autonomous Robots and Multirobot Systems (ARMS) workshop at AAMAS 2025
Abstract:Continual learning for reinforcement learning agents remains a significant challenge, particularly in preserving and leveraging existing information without an external signal to indicate changes in tasks or environments. In this study, we explore the effectiveness of autoencoders in detecting new tasks and matching observed environments to previously encountered ones. Our approach integrates policy optimization with familiarity autoencoders within an end-to-end continual learning system. This system can recognize and learn new tasks or environments while preserving knowledge from earlier experiences and can selectively retrieve relevant knowledge when re-encountering a known environment. Initial results demonstrate successful continual learning without external signals to indicate task changes or reencounters, showing promise for this methodology.
zh
[AI-31] Enhancing Aerial Combat Tactics through Hierarchical Multi-Agent Reinforcement Learning
【速读】:该论文旨在解决在包含异构智能体的模拟空战场景中,如何有效识别导致任务成功的行动方案(Courses of Action)的问题,从而在低成本、安全的环境中探索现实世界的防御场景。其解决方案的关键在于提出一种分层的多智能体强化学习框架,将决策过程分为两个抽象层次:低层策略负责个体单元的控制,而高层指挥策略则根据整体任务目标发出宏观指令。这种分层结构通过利用个体智能体策略的对称性,并将控制与指挥任务分离,从而简化了训练过程并提升了整体系统的效能。
链接: https://arxiv.org/abs/2505.08995
作者: Ardian Selmonaj,Oleg Szehr,Giacomo Del Rio,Alessandro Antonucci,Adrian Schneider,Michael Rüegsegger
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: Published as journal chapter in Deep Learning Applications, Vol. 1, by Taylor Francis
Abstract:This work presents a Hierarchical Multi-Agent Reinforcement Learning framework for analyzing simulated air combat scenarios involving heterogeneous agents. The objective is to identify effective Courses of Action that lead to mission success within preset simulations, thereby enabling the exploration of real-world defense scenarios at low cost and in a safe-to-fail setting. Applying deep Reinforcement Learning in this context poses specific challenges, such as complex flight dynamics, the exponential size of the state and action spaces in multi-agent systems, and the capability to integrate real-time control of individual units with look-ahead planning. To address these challenges, the decision-making process is split into two levels of abstraction: low-level policies control individual units, while a high-level commander policy issues macro commands aligned with the overall mission targets. This hierarchical structure facilitates the training process by exploiting policy symmetries of individual agents and by separating control from command tasks. The low-level policies are trained for individual combat control in a curriculum of increasing complexity. The high-level commander is then trained on mission targets given pre-trained control policies. The empirical validation confirms the advantages of the proposed framework.
zh
[AI-32] Generalization in Monitored Markov Decision Processes (Mon-MDPs)
【速读】:该论文试图解决在部分可观测奖励环境下强化学习(Reinforcement Learning, RL)的泛化问题,具体是针对监控马尔可夫决策过程(Monitored Markov Decision Process, Mon-MDP)中奖励不可观测的情形。传统方法受限于表格型状态空间,难以应用于现实场景,而本文通过引入函数逼近(Function Approximation, FA)与学习到的奖励模型,使智能体能够从可观测的监控状态泛化到不可观测的环境状态,从而在形式上被定义为不可解的环境中获得近优策略。解决方案的关键在于利用奖励模型实现有效的泛化,但同时也揭示了函数逼近可能导致的过泛化问题,进而引发错误的奖励外推和不良行为。为此,本文提出一种基于奖励不确定性的谨慎策略优化方法以缓解该问题。
链接: https://arxiv.org/abs/2505.08988
作者: Montaser Mohammedalamen,Michael Bowling
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:Reinforcement learning (RL) typically models the interaction between the agent and environment as a Markov decision process (MDP), where the rewards that guide the agent’s behavior are always observable. However, in many real-world scenarios, rewards are not always observable, which can be modeled as a monitored Markov decision process (Mon-MDP). Prior work on Mon-MDPs have been limited to simple, tabular cases, restricting their applicability to real-world problems. This work explores Mon-MDPs using function approximation (FA) and investigates the challenges involved. We show that combining function approximation with a learned reward model enables agents to generalize from monitored states with observable rewards, to unmonitored environment states with unobservable rewards. Therefore, we demonstrate that such generalization with a reward model achieves near-optimal policies in environments formally defined as unsolvable. However, we identify a critical limitation of such function approximation, where agents incorrectly extrapolate rewards due to overgeneralization, resulting in undesirable behaviors. To mitigate overgeneralization, we propose a cautious police optimization method leveraging reward uncertainty. This work serves as a step towards bridging this gap between Mon-MDP theory and real-world applications.
zh
[AI-33] GPML: Graph Processing for Machine Learning
【速读】:该论文试图解决动态网络中复杂、多步骤且快速演化的攻击检测问题,这些问题对传统检测方法提出了严峻挑战。解决方案的关键在于引入GPML(Graph Processing for Machine Learning)库,该库通过将原始网络流量痕迹转换为图表示,从而实现对网络行为的深入分析,支持异常检测和社区结构变化的识别,进而提升实时检测与历史取证分析的能力。
链接: https://arxiv.org/abs/2505.08964
作者: Majed Jaber,Julien Michel,Nicolas Boutry,Pierre Parrend
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:The dramatic increase of complex, multi-step, and rapidly evolving attacks in dynamic networks involves advanced cyber-threat detectors. The GPML (Graph Processing for Machine Learning) library addresses this need by transforming raw network traffic traces into graph representations, enabling advanced insights into network behaviors. The library provides tools to detect anomalies in interaction and community shifts in dynamic networks. GPML supports community and spectral metrics extraction, enhancing both real-time detection and historical forensics analysis. This library supports modern cybersecurity challenges with a robust, graph-based approach.
zh
[AI-34] racing the Invisible: Understanding Students Judgment in AI-Supported Design Work
【速读】:该论文试图解决生成式 AI(Generative AI)在设计工作流中被学生作为协作工具使用时,其对设计判断(design judgment)产生的影响问题。研究通过分析33个学生团队在人机交互设计课程中的反思,揭示了学生在使用AI工具时所做出的既有传统设计判断(如工具性、审美性和质量判断)以及新兴类型(如代理分配判断和可靠性判断)。解决方案的关键在于识别并强调这些新的判断形式,从而为理解学生在设计情境中与AI进行协同创造的意义构建提供了一个概念性视角。
链接: https://arxiv.org/abs/2505.08939
作者: Suchismita Naik,Prakash Shukla,Ike Obi,Jessica Backus,Nancy Rasche,Paul Parsons
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 Tables, In Creativity and Cognition 2025, June 23–25, 2025, Virtual, United Kingdom
Abstract:As generative AI tools become integrated into design workflows, students increasingly engage with these tools not just as aids, but as collaborators. This study analyzes reflections from 33 student teams in an HCI design course to examine the kinds of judgments students make when using AI tools. We found both established forms of design judgment (e.g., instrumental, appreciative, quality) and emergent types: agency-distribution judgment and reliability judgment. These new forms capture how students negotiate creative responsibility with AI and assess the trustworthiness of its outputs. Our findings suggest that generative AI introduces new layers of complexity into design reasoning, prompting students to reflect not only on what AI produces, but also on how and when to rely on it. By foregrounding these judgments, we offer a conceptual lens for understanding how students engage in co-creative sensemaking with AI in design contexts.
zh
[AI-35] A New Tractable Description Logic under Categorical Semantics
【速读】:该论文试图解决在可 tractable 的描述逻辑(Description Logic, DL)EL 中引入否定操作符后导致的不可 tractable 问题,特别是如何在保持计算效率的同时表示涉及否定知识的生物医学本体概念或角色名称(如 lacks_part, absence_of)。解决方案的关键在于对逻辑构造器的范畴语义进行弱化,通过识别并去除导致不可 tractability 的独立范畴属性,从而在保留可 tractability 的前提下扩展 EL 的表达能力。
链接: https://arxiv.org/abs/2505.08916
作者: Chan Le Duc,Ludovic Brieulle
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注:
Abstract:Biomedical ontologies contain numerous concept or role names involving negative knowledge such as lacks_part, absence_of. Such a representation with labels rather than logical constructors would not allow a reasoner to interpret lacks_part as a kind of negation of has_part. It is known that adding negation to the tractable Description Logic (DL) EL allowing for conjunction, existential restriction and concept inclusion makes it intractable since the obtained logic includes implicitly disjunction and universal restriction which interact with other constructors. In this paper, we propose a new extension of EL with a weakened negation allowing to represent negative knowledge while retaining tractability. To this end, we introduce categorical semantics of all logical constructors of the DL SH including EL with disjunction, negation, universal restriction, role inclusion and transitive roles. The categorical semantics of a logical constructor is usually described as a set of categorical properties referring to several objects without using set membership. To restore tractability, we have to weaken semantics of disjunction and universal restriction by identifying \emphindependent categorical properties that are responsible for intractability, and dropping them from the set of categorical properties. We show that the logic resulting from weakening semantics is more expressive than EL with the bottom concept, transitive roles and role inclusion.
zh
[AI-36] FareShare: A Tool for Labor Organizers to Estimate Lost Wages and Contest Arbitrary AI and Algorithmic Deactivations
【速读】:该论文试图解决网约车司机在被平台突然封禁后,因算法决策缺乏透明度和申诉机制而导致的收入损失问题。解决方案的关键在于设计并部署FareShare,这是一个计算工具,通过自动化估算被封禁司机的损失工资,从而提高申诉流程的效率和准确性。该工具通过减少人工数据输入错误、缩短工资计算时间,并支持法律团队生成仲裁所需的报告,有效提升了劳动组织者处理此类复杂流程的能力。
链接: https://arxiv.org/abs/2505.08904
作者: Varun Nagaraj Rao,Samantha Dalal,Andrew Schwartz,Amna Liaqat,Dana Calacci,Andrés Monroy-Hernández
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注:
Abstract:What happens when a rideshare driver is suddenly locked out of the platform connecting them to riders, wages, and daily work? Deactivation-the abrupt removal of gig workers’ platform access-typically occurs through arbitrary AI and algorithmic decisions with little explanation or recourse. This represents one of the most severe forms of algorithmic control and often devastates workers’ financial stability. Recent U.S. state policies now mandate appeals processes and recovering compensation during the period of wrongful deactivation based on past earnings. Yet, labor organizers still lack effective tools to support these complex, error-prone workflows. We designed FareShare, a computational tool automating lost wage estimation for deactivated drivers, through a 6 month partnership with the State of Washington’s largest rideshare labor union. Over the following 3 months, our field deployment of FareShare registered 178 account signups. We observed that the tool could reduce lost wage calculation time by over 95%, eliminate manual data entry errors, and enable legal teams to generate arbitration-ready reports more efficiently. Beyond these gains, the deployment also surfaced important socio-technical challenges around trust, consent, and tool adoption in high-stakes labor contexts.
zh
[AI-37] Deep reinforcement learning-based longitudinal control strategy for automated vehicles at signalised intersections
【速读】:该论文旨在解决在信号交叉口(Signalised Intersection, SI)中开发自主车辆纵向控制策略的问题,该问题因其复杂的决策过程而具有挑战性。解决方案的关键在于提出一种基于深度强化学习(Deep Reinforcement Learning, DRL)的控制策略,并设计了一个综合奖励函数,重点关注基于跟车距离的效率奖励、黄灯期间的决策标准以及不对称的加速/减速响应,同时兼顾传统安全与舒适性指标。该奖励函数结合了两种流行的DRL算法——深度确定性策略梯度(Deep Deterministic Policy Gradient, DDPG)和软演员-评论家(Soft-Actor Critic, SAC),以处理加速度/减速度的连续动作空间,并通过真实世界前车轨迹与基于Ornstein-Uhlenbeck过程生成的模拟轨迹进行训练,最终验证了模型在交通安全性、效率和舒适性方面的优越性。
链接: https://arxiv.org/abs/2505.08896
作者: Pankaj Kumar,Aditya Mishra,Pranamesh Chakraborty,Subrahmanya Swamy Peruru
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Developing an autonomous vehicle control strategy for signalised intersections (SI) is one of the challenging tasks due to its inherently complex decision-making process. This study proposes a Deep Reinforcement Learning (DRL) based longitudinal vehicle control strategy at SI. A comprehensive reward function has been formulated with a particular focus on (i) distance headway-based efficiency reward, (ii) decision-making criteria during amber light, and (iii) asymmetric acceleration/ deceleration response, along with the traditional safety and comfort criteria. This reward function has been incorporated with two popular DRL algorithms, Deep Deterministic Policy Gradient (DDPG) and Soft-Actor Critic (SAC), which can handle the continuous action space of acceleration/deceleration. The proposed models have been trained on the combination of real-world leader vehicle (LV) trajectories and simulated trajectories generated using the Ornstein-Uhlenbeck (OU) process. The overall performance of the proposed models has been tested using Cumulative Distribution Function (CDF) plots and compared with the real-world trajectory data. The results show that the RL models successfully maintain lower distance headway (i.e., higher efficiency) and jerk compared to human-driven vehicles without compromising safety. Further, to assess the robustness of the proposed models, we evaluated the model performance on diverse safety-critical scenarios, in terms of car-following and traffic signal compliance. Both DDPG and SAC models successfully handled the critical scenarios, while the DDPG model showed smoother action profiles compared to the SAC model. Overall, the results confirm that DRL-based longitudinal vehicle control strategy at SI can help to improve traffic safety, efficiency, and comfort.
zh
[AI-38] WaLLM – Insights from an LLM -Powered Chatbot deployment via WhatsApp
【速读】:该论文试图解决发展中国家由于数字鸿沟导致的生成式 AI(Generative AI)访问不足的问题,其解决方案的关键是开发了一个基于 WhatsApp 的定制 AI 聊天机器人 WaLLM。该系统通过提供每日热门问题、建议后续问题、趋势与近期查询以及基于排行榜的奖励机制来增强用户参与度,从而提升用户在资源受限环境下的信息获取能力。
链接: https://arxiv.org/abs/2505.08894
作者: Hiba Eltigani,Rukhshan Haroon,Asli Kocak,Abdullah Bin Faisal,Noah Martin,Fahad Dogar
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Recent advances in generative AI, such as ChatGPT, have transformed access to information in education, knowledge-seeking, and everyday decision-making. However, in many developing regions, access remains a challenge due to the persistent digital divide. To help bridge this gap, we developed WaLLM - a custom AI chatbot over WhatsApp, a widely used communication platform in developing regions. Beyond answering queries, WaLLM offers several features to enhance user engagement: a daily top question, suggested follow-up questions, trending and recent queries, and a leaderboard-based reward system. Our service has been operational for over 6 months, amassing over 14.7K queries from approximately 100 users. In this paper, we present WaLLM’s design and a systematic analysis of logs to understand user interactions. Our results show that 55% of user queries seek factual information. “Health and well-being” was the most popular topic (28%), including queries about nutrition and disease, suggesting users view WaLLM as a reliable source. Two-thirds of users’ activity occurred within 24 hours of the daily top question. Users who accessed the “Leaderboard” interacted with WaLLM 3x as those who did not. We conclude by discussing implications for culture-based customization, user interface design, and appropriate calibration of users’ trust in AI systems for developing regions.
zh
[AI-39] Optimized Couplings for Watermarking Large Language Models
【速读】:该论文试图解决在单次生成场景下,如何在保持文本质量的同时有效检测由大型语言模型(Large-language models, LLMs)生成的文本中的水印问题。解决方案的关键在于设计一种水印机制,该机制通过将水印检测器所使用的辅助信息与LLM词汇表的随机划分进行耦合,从而在满足最小熵约束的最坏情况下的LLM下一个词分布中,实现最优的耦合和随机化策略。
链接: https://arxiv.org/abs/2505.08878
作者: Dor Tsur,Carol Xuan Long,Claudio Mayrink Verdun,Hsiang Hsu,Haim Permuter,Flavio P. Calmon
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: Accepted at ISIT25
Abstract:Large-language models (LLMs) are now able to produce text that is, in many cases, seemingly indistinguishable from human-generated content. This has fueled the development of watermarks that imprint a ``signal’’ in LLM-generated text with minimal perturbation of an LLM’s output. This paper provides an analysis of text watermarking in a one-shot setting. Through the lens of hypothesis testing with side information, we formulate and analyze the fundamental trade-off between watermark detection power and distortion in generated textual quality. We argue that a key component in watermark design is generating a coupling between the side information shared with the watermark detector and a random partition of the LLM vocabulary. Our analysis identifies the optimal coupling and randomization strategy under the worst-case LLM next-token distribution that satisfies a min-entropy constraint. We provide a closed-form expression of the resulting detection rate under the proposed scheme and quantify the cost in a max-min sense. Finally, we provide an array of numerical results, comparing the proposed scheme with the theoretical optimum and existing schemes, in both synthetic data and LLM watermarking. Our code is available at this https URL
zh
[AI-40] Improved Algorithms for Differentially Private Language Model Alignment
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)对齐过程中涉及敏感用户数据带来的隐私问题,同时保持模型对齐效果。其关键解决方案是提出一种隐私保护的对齐算法框架,该框架可兼容直接偏好优化(Direct Preference Optimization, DPO)和基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF),并通过系统实验验证了其在不同隐私预算下的有效性,其中DP-AdamW算法在中等隐私预算下显著提升了对齐质量。
链接: https://arxiv.org/abs/2505.08849
作者: Keyu Chen,Hao Tang,Qinglin Liu,Yizhao Xu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Language model alignment is crucial for ensuring that large language models (LLMs) align with human preferences, yet it often involves sensitive user data, raising significant privacy concerns. While prior work has integrated differential privacy (DP) with alignment techniques, their performance remains limited. In this paper, we propose novel algorithms for privacy-preserving alignment and rigorously analyze their effectiveness across varying privacy budgets and models. Our framework can be deployed on two celebrated alignment techniques, namely direct preference optimization (DPO) and reinforcement learning from human feedback (RLHF). Through systematic experiments on large-scale language models, we demonstrate that our approach achieves state-of-the-art performance. Notably, one of our algorithms, DP-AdamW, combined with DPO, surpasses existing methods, improving alignment quality by up to 15% under moderate privacy budgets (\epsilon=2-5). We further investigate the interplay between privacy guarantees, alignment efficacy, and computational demands, providing practical guidelines for optimizing these trade-offs.
zh
[AI-41] On the interplay of Explainability Privacy and Predictive Performance with Explanation-assisted Model Extraction
【速读】:该论文试图解决机器学习即服务(Machine Learning as a Service, MLaaS)平台在面对模型提取(Model Extraction, MEA)攻击时的隐私安全问题,尤其是在可解释人工智能(Explainable AI, XAI)集成后,攻击者可能利用模型解释中的反事实解释(Counterfactual Explanations, CFs)来辅助MEA攻击。解决方案的关键在于评估差分隐私(Differential Privacy, DP)在模型性能、隐私保护和可解释性之间的权衡,并通过在分类模型训练阶段和反事实解释生成阶段分别应用DP策略,以缓解CF驱动的MEA攻击。
链接: https://arxiv.org/abs/2505.08847
作者: Fatima Ezzeddine,Rinad Akel,Ihab Sbeity,Silvia Giordano,Marc Langheinrich,Omran Ayoub
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Machine Learning as a Service (MLaaS) has gained important attraction as a means for deploying powerful predictive models, offering ease of use that enables organizations to leverage advanced analytics without substantial investments in specialized infrastructure or expertise. However, MLaaS platforms must be safeguarded against security and privacy attacks, such as model extraction (MEA) attacks. The increasing integration of explainable AI (XAI) within MLaaS has introduced an additional privacy challenge, as attackers can exploit model explanations particularly counterfactual explanations (CFs) to facilitate MEA. In this paper, we investigate the trade offs among model performance, privacy, and explainability when employing Differential Privacy (DP), a promising technique for mitigating CF facilitated MEA. We evaluate two distinct DP strategies: implemented during the classification model training and at the explainer during CF generation.
zh
[AI-42] Evaluating Simplification Algorithms for Interpretability of Time Series Classification
【速读】:该论文试图解决时间序列分类(TSC)中可解释性不足的问题,特别是由于时间序列数据相对于文本和图像数据对人类不直观,导致难以理解分类结果。解决方案的关键在于引入度量标准来评估简化时间序列在TSC可解释性中的有效性,这些度量标准包括简化的复杂性(如包含的段落数量)和忠实度(保持原始时间序列分类的可能性)。通过这些度量标准,研究者评估了四种不同的简化算法在不同TSC算法和具有不同特征的数据集上的表现,结果表明,在季节性、非平稳和低熵的时间序列中,使用简化时间序列进行可解释性分析优于使用原始时间序列。
链接: https://arxiv.org/abs/2505.08846
作者: Felix Marti-Perez,Brigt Håvardstun,Cèsar Ferri,Carlos Monserrat,Jan Arne Telle
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In this work, we introduce metrics to evaluate the use of simplified time series in the context of interpretability of a TSC - a Time Series Classifier. Such simplifications are important because time series data, in contrast to text and image data, are not intuitively understandable to humans. These metrics are related to the complexity of the simplifications - how many segments they contain - and to their loyalty - how likely they are to maintain the classification of the original time series. We employ these metrics to evaluate four distinct simplification algorithms, across several TSC algorithms and across datasets of varying characteristics, from seasonal or stationary to short or long. Our findings suggest that using simplifications for interpretability of TSC is much better than using the original time series, particularly when the time series are seasonal, non-stationary and/or with low entropy.
zh
[AI-43] Will AI Take My Job? Evolving Perceptions of Automation and Labor Risk in Latin America
【速读】:该论文试图解决公众对人工智能和机器人技术在劳动力市场中影响的感知问题,特别是关注拉丁美洲地区公众对因人工智能和机器人导致失业的恐惧情绪。其解决方案的关键在于利用拉丁obarómetro在2017、2018、2020和2023年收集的调查数据,结合统计建模和潜在类别分析,识别出影响公众担忧的主要结构性和意识形态因素,其中教育水平和政治倾向被确认为最稳定的驱动因素。
链接: https://arxiv.org/abs/2505.08841
作者: Andrea Cremaschi,Dae-Jin Lee,Manuele Leonelli
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:As artificial intelligence and robotics increasingly reshape the global labor market, understanding public perceptions of these technologies becomes critical. We examine how these perceptions have evolved across Latin America, using survey data from the 2017, 2018, 2020, and 2023 waves of the Latinobarómetro. Drawing on responses from over 48,000 individuals across 16 countries, we analyze fear of job loss due to artificial intelligence and robotics. Using statistical modeling and latent class analysis, we identify key structural and ideological predictors of concern, with education level and political orientation emerging as the most consistent drivers. Our findings reveal substantial temporal and cross-country variation, with a notable peak in fear during 2018 and distinct attitudinal profiles emerging from latent segmentation. These results offer new insights into the social and structural dimensions of AI anxiety in emerging economies and contribute to a broader understanding of public attitudes toward automation beyond the Global North.
zh
[AI-44] Federated Large Language Models : Feasibility Robustness Security and Future Directions
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)与大型语言模型(Large Language Models, LLMs)融合过程中所面临的隐私保护、数据孤岛、通信与计算开销、异构性及安全风险等问题。其解决方案的关键在于通过系统性地分析FLLM的可行性、鲁棒性、安全性及未来方向,提出增强系统鲁棒性的方法,并探索应对资源、数据和任务异构性的策略,同时关注隐私威胁与安全挑战的防御机制,以推动FLLM技术的可持续发展。
链接: https://arxiv.org/abs/2505.08830
作者: Wenhao Jiang,Yuchuan Luo,Guilin Deng,Silong Chen,Xu Yang,Shihong Wu,Xinwen Gao,Lin Liu,Shaojing Fu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 35 pages
Abstract:The integration of Large Language Models (LLMs) and Federated Learning (FL) presents a promising solution for joint training on distributed data while preserving privacy and addressing data silo issues. However, this emerging field, known as Federated Large Language Models (FLLM), faces significant challenges, including communication and computation overheads, heterogeneity, privacy and security concerns. Current research has primarily focused on the feasibility of FLLM, but future trends are expected to emphasize enhancing system robustness and security. This paper provides a comprehensive review of the latest advancements in FLLM, examining challenges from four critical perspectives: feasibility, robustness, security, and future directions. We present an exhaustive survey of existing studies on FLLM feasibility, introduce methods to enhance robustness in the face of resource, data, and task heterogeneity, and analyze novel risks associated with this integration, including privacy threats and security challenges. We also review the latest developments in defense mechanisms and explore promising future research directions, such as few-shot learning, machine unlearning, and IP protection. This survey highlights the pressing need for further research to enhance system robustness and security while addressing the unique challenges posed by the integration of FL and LLM.
zh
[AI-45] Aggregating Concepts of Fairness and Accuracy in Predictive Systems
【速读】:该论文试图解决预测算法在准确性(accuracy)与公平性(fairness)之间可能存在的冲突问题,以及如何在不同衡量标准下对这两种属性进行综合评估的难题。其解决方案的关键在于主张使用准确性与公平性度量的线性组合来衡量预测算法的整体价值,这一方法基于Harsanyi在偏好聚合文献中的经典结果,为同时关注准确性和公平性的决策者提供了一种合理的综合评估框架。
链接: https://arxiv.org/abs/2505.08829
作者: David Kinney
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:An algorithm that outputs predictions about the state of the world will almost always be designed with the implicit or explicit goal of outputting accurate predictions (i.e., predictions that are likely to be true). In addition, the rise of increasingly powerful predictive algorithms brought about by the recent revolution in artificial intelligence has led to an emphasis on building predictive algorithms that are fair, in the sense that their predictions do not systematically evince bias or bring about harm to certain individuals or groups. This state of affairs presents two conceptual challenges. First, the goals of accuracy and fairness can sometimes be in tension, and there are no obvious normative guidelines for managing the trade-offs between these two desiderata when they arise. Second, there are many distinct ways of measuring both the accuracy and fairness of a predictive algorithm; here too, there are no obvious guidelines on how to aggregate our preferences for predictive algorithms that satisfy disparate measures of fairness and accuracy to various extents. The goal of this paper is to address these challenges by arguing that there are good reasons for using a linear combination of accuracy and fairness metrics to measure the all-things-considered value of a predictive algorithm for agents who care about both accuracy and fairness. My argument depends crucially on a classic result in the preference aggregation literature due to Harsanyi. After making this formal argument, I apply my result to an analysis of accuracy-fairness trade-offs using the COMPAS dataset compiled by Angwin et al.
zh
[AI-46] Self Rewarding Self Improving
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在缺乏参考答案的情况下如何实现自我改进的问题,特别是在强化学习领域中因难以构建程序化奖励而受限的场景。解决方案的关键在于利用生成与验证过程之间的固有不对称性,通过自评(self-judging)机制为模型提供可靠的奖励信号,从而实现无需人工标注的自我优化循环。该方法结合了合成问题生成技术,使模型能够自主生成练习题、解答并评估自身表现,最终在积分任务上超越了GPT-4o的表现,展示了LLM作为评判者在训练中的有效性。
链接: https://arxiv.org/abs/2505.08827
作者: Toby Simonds,Kevin Lopez,Akira Yoshiyama,Dominique Garmier
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We demonstrate that large language models can effectively self-improve through self-judging without requiring reference solutions, leveraging the inherent asymmetry between generating and verifying solutions. Our experiments on Countdown puzzles and MIT Integration Bee problems show that models can provide reliable reward signals without ground truth answers, enabling reinforcement learning in domains previously not possible. By implementing self-judging, we achieve significant performance gains maintaining alignment with formal verification. When combined with synthetic question generation, we establish a complete self-improvement loop where models generate practice problems, solve them, and evaluate their own performance-achieving an 8% improvement with Qwen 2.5 7B over baseline and surpassing GPT-4o performance on integration tasks. Our findings demonstrate that LLM judges can provide effective reward signals for training models, unlocking many reinforcement learning environments previously limited by the difficulty of creating programmatic rewards. This suggests a potential paradigm shift toward AI systems that continuously improve through self-directed learning rather than human-guided training, potentially accelerating progress in domains with scarce training data or complex evaluation requirements.
zh
[AI-47] Multi-source Plume Tracing via Multi-Agent Reinforcement Learning
【速读】:该论文试图解决在复杂湍流环境下快速准确定位多个空气污染源的问题,以应对工业灾难对公共健康和环境的威胁。解决方案的关键在于提出一种基于多智能体强化学习(MARL)的算法,该算法将问题建模为部分可观测马尔可夫博弈(POMG),并采用基于长短期记忆网络(LSTM)的动作特定双深度循环Q网络(ADDRQN),通过利用完整的历史动作-观测序列来有效近似潜在状态,从而提升模型在部分可观测环境中的适应能力。
链接: https://arxiv.org/abs/2505.08825
作者: Pedro Antonio Alarcon Granadeno,Theodore Chambers,Jane Cleland-Huang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 13 pages, 7 figures
Abstract:Industrial catastrophes like the Bhopal disaster (1984) and the Aliso Canyon gas leak (2015) demonstrate the urgent need for rapid and reliable plume tracing algorithms to protect public health and the environment. Traditional methods, such as gradient-based or biologically inspired approaches, often fail in realistic, turbulent conditions. To address these challenges, we present a Multi-Agent Reinforcement Learning (MARL) algorithm designed for localizing multiple airborne pollution sources using a swarm of small uncrewed aerial systems (sUAS). Our method models the problem as a Partially Observable Markov Game (POMG), employing a Long Short-Term Memory (LSTM)-based Action-specific Double Deep Recurrent Q-Network (ADDRQN) that uses full sequences of historical action-observation pairs, effectively approximating latent states. Unlike prior work, we use a general-purpose simulation environment based on the Gaussian Plume Model (GPM), incorporating realistic elements such as a three-dimensional environment, sensor noise, multiple interacting agents, and multiple plume sources. The incorporation of action histories as part of the inputs further enhances the adaptability of our model in complex, partially observable environments. Extensive simulations show that our algorithm significantly outperforms conventional approaches. Specifically, our model allows agents to explore only 1.29% of the environment to successfully locate pollution sources.
zh
[AI-48] Position: Restructuring of Categories and Implementation of Guidelines Essential for VLM Adoption in Healthcare
【速读】:该论文试图解决视觉语言模型(Vision Language Model, VLM)在开发、适应和应用过程中缺乏清晰和标准化的报告协议的问题,特别是在高风险的医疗健康领域。解决方案的关键在于重新构建传统的机器学习报告标准和评估指南,以适应多阶段VLM研究的需求,同时确保开发者能够直观理解并保持研究的可复现性。为此,论文提出了一种VLM研究的分类框架,并制定了涵盖性能评估、数据报告协议及论文撰写建议的综合报告标准。
链接: https://arxiv.org/abs/2505.08818
作者: Amara Tariq,Rimita Lahiri,Charles Kahn,Imon Banerjee
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 2, tables, 3 figures
Abstract:The intricate and multifaceted nature of vision language model (VLM) development, adaptation, and application necessitates the establishment of clear and standardized reporting protocols, particularly within the high-stakes context of healthcare. Defining these reporting standards is inherently challenging due to the diverse nature of studies involving VLMs, which vary significantly from the development of all new VLMs or finetuning for domain alignment to off-the-shelf use of VLM for targeted diagnosis and prediction tasks. In this position paper, we argue that traditional machine learning reporting standards and evaluation guidelines must be restructured to accommodate multiphase VLM studies; it also has to be organized for intuitive understanding of developers while maintaining rigorous standards for reproducibility. To facilitate community adoption, we propose a categorization framework for VLM studies and outline corresponding reporting standards that comprehensively address performance evaluation, data reporting protocols, and recommendations for manuscript composition. These guidelines are organized according to the proposed categorization scheme. Lastly, we present a checklist that consolidates reporting standards, offering a standardized tool to ensure consistency and quality in the publication of VLM-related research.
zh
[AI-49] Machine Learning-Based Detection of DDoS Attacks in VANETs for Emergency Vehicle Communication
【速读】:该论文旨在解决车辆自组织网络(VANETs)中分布式拒绝服务(DDoS)攻击对安全关键通信通道的干扰问题,从而提升其可靠性。解决方案的关键在于构建了一个基于NS-3与SUMO的合成数据集,并结合真实世界德国A81高速公路的移动轨迹数据,通过数据预处理和特征工程提取有效特征,采用SHAP方法评估特征重要性,并利用XGBoost和CatBoost等分类器进行DDoS攻击检测,最终实现了高精度的检测性能,F1-score达到96%。
链接: https://arxiv.org/abs/2505.08810
作者: Bappa Muktar,Vincent Fono,Adama Nouboukpo
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Vehicular Ad Hoc Networks (VANETs) play a key role in Intelligent Transportation Systems (ITS), particularly in enabling real-time communication for emergency vehicles. However, Distributed Denial of Service (DDoS) attacks, which interfere with safety-critical communication channels, can severely impair their reliability. This study introduces a robust and scalable framework to detect DDoS attacks in highway-based VANET environments. A synthetic dataset was constructed using Network Simulator 3 (NS-3) in conjunction with the Simulation of Urban Mobility (SUMO) and further enriched with real-world mobility traces from Germany’s A81 highway, extracted via OpenStreetMap (OSM). Three traffic categories were simulated: DDoS, VoIP, and TCP-based video streaming (VideoTCP). The data preprocessing pipeline included normalization, signal-to-noise ratio (SNR) feature engineering, missing value imputation, and class balancing using the Synthetic Minority Over-sampling Technique (SMOTE). Feature importance was assessed using SHapley Additive exPlanations (SHAP). Eleven classifiers were benchmarked, among them XGBoost (XGB), CatBoost (CB), AdaBoost (AB), GradientBoosting (GB), and an Artificial Neural Network (ANN). XGB and CB achieved the best performance, each attaining an F1-score of 96%. These results highlight the robustness of the proposed framework and its potential for real-time deployment in VANETs to secure critical emergency communications.
zh
[AI-50] MixBridge: Heterogeneous Image-to-Image Backdoor Attack through Mixture of Schrödinger Bridges
【速读】:该论文旨在解决在基于桥接的扩散模型中植入多个异构后门触发器的问题,现有后门方法主要针对单次攻击场景,并局限于高斯噪声输入模型。其解决方案的关键在于提出一种名为MixBridge的新颖扩散薛定谔桥(DSB)框架,该框架能够适应任意输入分布,并通过直接使用污染图像对进行训练来注入后门触发器,从而避免了先前研究中对随机微分方程的复杂修改,提供了研究桥接模型后门行为的灵活工具。
链接: https://arxiv.org/abs/2505.08809
作者: Shixi Qin,Zhiyong Yang,Shilong Bao,Shi Wang,Qianqian Xu,Qingming Huang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper focuses on implanting multiple heterogeneous backdoor triggers in bridge-based diffusion models designed for complex and arbitrary input distributions. Existing backdoor formulations mainly address single-attack scenarios and are limited to Gaussian noise input models. To fill this gap, we propose MixBridge, a novel diffusion Schrödinger bridge (DSB) framework to cater to arbitrary input distributions (taking I2I tasks as special cases). Beyond this trait, we demonstrate that backdoor triggers can be injected into MixBridge by directly training with poisoned image pairs. This eliminates the need for the cumbersome modifications to stochastic differential equations required in previous studies, providing a flexible tool to study backdoor behavior for bridge models. However, a key question arises: can a single DSB model train multiple backdoor triggers? Unfortunately, our theory shows that when attempting this, the model ends up following the geometric mean of benign and backdoored distributions, leading to performance conflict across backdoor tasks. To overcome this, we propose a Divide-and-Merge strategy to mix different bridges, where models are independently pre-trained for each specific objective (Divide) and then integrated into a unified model (Merge). In addition, a Weight Reallocation Scheme (WRS) is also designed to enhance the stealthiness of MixBridge. Empirical studies across diverse generation tasks speak to the efficacy of MixBridge.
zh
[AI-51] Security of Internet of Agents : Attacks and Countermeasures
【速读】:该论文旨在解决互联网代理(Internet of Agents, IoA)系统中的安全与隐私问题,特别是针对异构代理之间协同过程中面临的身份认证威胁、跨代理信任问题、实体安全及隐私风险。其解决方案的关键在于全面分析现有防御机制,并识别出提升IoA生态系统鲁棒性与隐私保护能力的开放研究方向。
链接: https://arxiv.org/abs/2505.08807
作者: Yuntao Wang,Yanghe Pan,Shaolong Guo,Zhou Su
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures, 3 tables, submitted to IEEE OJCS
Abstract:With the rise of large language and vision-language models, AI agents have evolved into autonomous, interactive systems capable of perception, reasoning, and decision-making. As they proliferate across virtual and physical domains, the Internet of Agents (IoA) has emerged as a key infrastructure for enabling scalable and secure coordination among heterogeneous agents. This survey offers a comprehensive examination of the security and privacy landscape in IoA systems. We begin by outlining the IoA architecture and its distinct vulnerabilities compared to traditional networks, focusing on four critical aspects: identity authentication threats, cross-agent trust issues, embodied security, and privacy risks. We then review existing and emerging defense mechanisms and highlight persistent challenges. Finally, we identify open research directions to advance the development of resilient and privacy-preserving IoA ecosystems.
zh
[AI-52] Multi-modal Synthetic Data Training and Model Collapse: Insights from VLMs and Diffusion Models
【速读】:该论文试图解决多模态生成系统中因持续使用自生成数据训练而导致的模型退化(model collapse)问题,这一现象在单模态生成模型中已被观察到,但在多模态场景下的表现尚不明确。论文的关键解决方案在于通过增加解码预算、提升模型多样性以及利用冻结模型进行重新标注等通用方法,有效缓解模型退化现象,从而为构建稳健的多模态合成数据集和自增强的多智能体AI系统提供实践指导。
链接: https://arxiv.org/abs/2505.08803
作者: Zizhao Hu,Mohammad Rostami,Jesse Thomason
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent research has highlighted the risk of generative model collapse, where performance progressively degrades when continually trained on self-generated data. However, existing exploration on model collapse is limited to single, unimodal models, limiting our understanding in more realistic scenarios, such as diverse multi-modal AI agents interacting autonomously through synthetic data and continually evolving. We expand the synthetic data training and model collapse study to multi-modal vision-language generative systems, such as vision-language models (VLMs) and text-to-image diffusion models, as well as recursive generate-train loops with multiple models. We find that model collapse, previously observed in single-modality generative models, exhibits distinct characteristics in the multi-modal context, such as improved vision-language alignment and increased variance in VLM image-captioning task. Additionally, we find that general approaches such as increased decoding budgets, greater model diversity, and relabeling with frozen models can effectively mitigate model collapse. Our findings provide initial insights and practical guidelines for reducing the risk of model collapse in self-improving multi-agent AI systems and curating robust multi-modal synthetic datasets.
zh
[AI-53] CaMDN: Enhancing Cache Efficiency for Multi-tenant DNNs on Integrated NPUs
【速读】:该论文旨在解决多租户深度神经网络(DNN)在单片系统(SoC)上执行时,共享缓存(shared cache)带来的性能瓶颈问题。其关键解决方案是提出CaMDN,一种架构-调度协同设计方法,通过引入轻量级架构支持模型专属、由NPU控制的缓存区域以消除意外的缓存竞争,并结合缓存感知映射方法和动态分配算法,提升共享缓存的利用率与多租户DNN的整体性能。
链接: https://arxiv.org/abs/2505.06625
作者: Tianhao Cai,Liang Wang,Limin Xiao,Meng Han,Zeyu Wang,Lin Sun,Xiaojian Liao
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Operating Systems (cs.OS)
备注: 7 pages, 9 figures. This paper has been accepted to the 2025 Design Automation Conference (DAC)
Abstract:With the rapid development of DNN applications, multi-tenant execution, where multiple DNNs are co-located on a single SoC, is becoming a prevailing trend. Although many methods are proposed in prior works to improve multi-tenant performance, the impact of shared cache is not well studied. This paper proposes CaMDN, an architecture-scheduling co-design to enhance cache efficiency for multi-tenant DNNs on integrated NPUs. Specifically, a lightweight architecture is proposed to support model-exclusive, NPU-controlled regions inside shared cache to eliminate unexpected cache contention. Moreover, a cache scheduling method is proposed to improve shared cache utilization. In particular, it includes a cache-aware mapping method for adaptability to the varying available cache capacity and a dynamic allocation algorithm to adjust the usage among co-located DNNs at runtime. Compared to prior works, CaMDN reduces the memory access by 33.4% on average and achieves a model speedup of up to 2.56 \times (1.88 \times on average).
zh
[AI-54] A Retrieval-Augmented Generation Framework for Academic Literature Navigation in Data Science
【速读】:该论文旨在解决数据科学领域中高效导航大量学术文献的问题,以支持决策和创新。其解决方案的关键在于构建一种增强型检索增强生成(Retrieval-Augmented Generation, RAG)应用,该系统整合了GROBID技术用于提取参考文献信息、微调的嵌入模型、语义分块以及优先摘要检索方法,从而显著提升检索信息的相关性和准确性。
链接: https://arxiv.org/abs/2412.15404
作者: Ahmet Yasin Aytar,Kemal Kilic,Kamer Kaya
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:In the rapidly evolving field of data science, efficiently navigating the expansive body of academic literature is crucial for informed decision-making and innovation. This paper presents an enhanced Retrieval-Augmented Generation (RAG) application, an artificial intelligence (AI)-based system designed to assist data scientists in accessing precise and contextually relevant academic resources. The AI-powered application integrates advanced techniques, including the GeneRation Of BIbliographic Data (GROBID) technique for extracting bibliographic information, fine-tuned embedding models, semantic chunking, and an abstract-first retrieval method, to significantly improve the relevance and accuracy of the retrieved information. This implementation of AI specifically addresses the challenge of academic literature navigation. A comprehensive evaluation using the Retrieval-Augmented Generation Assessment System (RAGAS) framework demonstrates substantial improvements in key metrics, particularly Context Relevance, underscoring the system’s effectiveness in reducing information overload and enhancing decision-making processes. Our findings highlight the potential of this enhanced Retrieval-Augmented Generation system to transform academic exploration within data science, ultimately advancing the workflow of research and innovation in the field.
zh
[AI-55] WavReward: Spoken Dialogue Models With Generalist Reward Evaluators
【速读】:该论文试图解决语音对话模型在对话性能评估方面的不足,特别是由于智能聊天机器人传递了大量文本无法涵盖的非文本信息,导致传统基于文本的语言模型难以有效评估其表现。解决方案的关键在于提出WavReward,这是一个基于音频语言模型的奖励反馈模型,能够同时评估语音对话系统的智力(IQ)和情商(EQ)。WavReward通过引入基于音频语言模型的深度推理过程和非线性奖励机制,并结合强化学习算法的多样本反馈,构建了一个专门针对语音对话模型的评估器。此外,论文还提出了ChatReward-30K偏好数据集用于训练WavReward,涵盖了语音对话模型的理解与生成多个方面,从而显著提升了评估效果。
链接: https://arxiv.org/abs/2505.09558
作者: Shengpeng Ji,Tianle Liang,Yangzhuo Li,Jialong Zuo,Minghui Fang,Jinzheng He,Yifu Chen,Zhengqing Liu,Ziyue Jiang,Xize Cheng,Siqi Zheng,Jin Xu,Junyang Lin,Zhou Zhao
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)
备注:
Abstract:End-to-end spoken dialogue models such as GPT-4o-audio have recently garnered significant attention in the speech domain. However, the evaluation of spoken dialogue models’ conversational performance has largely been overlooked. This is primarily due to the intelligent chatbots convey a wealth of non-textual information which cannot be easily measured using text-based language models like ChatGPT. To address this gap, we propose WavReward, a reward feedback model based on audio language models that can evaluate both the IQ and EQ of spoken dialogue systems with speech input. Specifically, 1) based on audio language models, WavReward incorporates the deep reasoning process and the nonlinear reward mechanism for post-training. By utilizing multi-sample feedback via the reinforcement learning algorithm, we construct a specialized evaluator tailored to spoken dialogue models. 2) We introduce ChatReward-30K, a preference dataset used to train WavReward. ChatReward-30K includes both comprehension and generation aspects of spoken dialogue models. These scenarios span various tasks, such as text-based chats, nine acoustic attributes of instruction chats, and implicit chats. WavReward outperforms previous state-of-the-art evaluation models across multiple spoken dialogue scenarios, achieving a substantial improvement about Qwen2.5-Omni in objective accuracy from 55.1 % to 91.5 % . In subjective A/B testing, WavReward also leads by a margin of 83 % . Comprehensive ablation studies confirm the necessity of each component of WavReward. All data and code will be publicly at this https URL after the paper is accepted.
zh
[AI-56] Quantum state-agnostic work extraction (almost) without dissipation
【速读】:该论文试图解决在未知纯量子比特态的多个副本中,通过顺序访问实现最大能量转移到电池的问题。其核心挑战在于设计交互策略,以最优平衡两个竞争目标:在当前量子比特上最优充电,以及通过获取更多量子比特信息来提升后续轮次的能量采集效率。该研究的关键在于利用强化学习中的探索-利用权衡机制,开发出适应性策略,使得能量耗散仅随N的多项式对数增长,相较于基于完整态层析成像的现有协议实现了指数级改进。
链接: https://arxiv.org/abs/2505.09456
作者: Josep Lumbreras,Ruo Cheng Huang,Yanglin Hu,Mile Gu,Marco Tomamichel
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages+14 pages, 2 figures
Abstract:We investigate work extraction protocols designed to transfer the maximum possible energy to a battery using sequential access to N copies of an unknown pure qubit state. The core challenge is designing interactions to optimally balance two competing goals: charging of the battery optimally using the qubit in hand, and acquiring more information by qubit to improve energy harvesting in subsequent rounds. Here, we leverage exploration-exploitation trade-off in reinforcement learning to develop adaptive strategies achieving energy dissipation that scales only poly-logarithmically in N . This represents an exponential improvement over current protocols based on full state tomography.
zh
[AI-57] Evaluating GPT - and Reasoning -based Large Language Models on Physics Olympiad Problems: Surpassing Human Performance and Implications for Educational Assessment
【速读】:该论文旨在解决生成式 AI (Generative AI) 在物理教育中可能影响学习过程和评估完整性的问题,特别是其在物理问题求解中的表现及其对教学与评估设计的影响。研究的关键在于通过对比通用大语言模型(GPT-4o)和推理优化模型(o1-preview)在物理奥赛题上的求解能力,与德国物理奥赛参赛者进行比较,以评估其在物理问题求解中的优势与局限性,并为物理教育中的总结性与形成性评估设计提供依据。
链接: https://arxiv.org/abs/2505.09438
作者: Paul Tschisgale,Holger Maus,Fabian Kieser,Ben Kroehs,Stefan Petersen,Peter Wulff
机构: 未知
类目: Physics Education (physics.ed-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are now widely accessible, reaching learners at all educational levels. This development has raised concerns that their use may circumvent essential learning processes and compromise the integrity of established assessment formats. In physics education, where problem solving plays a central role in instruction and assessment, it is therefore essential to understand the physics-specific problem-solving capabilities of LLMs. Such understanding is key to informing responsible and pedagogically sound approaches to integrating LLMs into instruction and assessment. This study therefore compares the problem-solving performance of a general-purpose LLM (GPT-4o, using varying prompting techniques) and a reasoning-optimized model (o1-preview) with that of participants of the German Physics Olympiad, based on a set of well-defined Olympiad problems. In addition to evaluating the correctness of the generated solutions, the study analyzes characteristic strengths and limitations of LLM-generated solutions. The findings of this study indicate that both tested LLMs (GPT-4o and o1-preview) demonstrate advanced problem-solving capabilities on Olympiad-type physics problems, on average outperforming the human participants. Prompting techniques had little effect on GPT-4o’s performance, while o1-preview almost consistently outperformed both GPT-4o and the human benchmark. Based on these findings, the study discusses implications for the design of summative and formative assessment in physics education, including how to uphold assessment integrity and support students in critically engaging with LLMs.
zh
[AI-58] Quantum-Enhanced Parameter-Efficient Learning for Typhoon Trajectory Forecasting
【速读】:该论文试图解决台风轨迹预测中计算资源消耗大、模型训练复杂度高的问题,特别是在大气动力学复杂性和深度学习模型资源需求之间的矛盾。其解决方案的关键在于引入量子参数适应(Quantum Parameter Adaptation, QPA),该方法利用量子神经网络(Quantum Neural Networks, QNNs)在训练阶段生成可训练参数,从而在推理阶段无需量子硬件,实现了高效且参数节省的模型学习。通过与基于注意力机制的多卷积门控循环单元(Attention-based Multi-ConvGRU)模型结合,QPA在保持预测精度的同时显著减少了可训练参数数量,为大规模台风轨迹预测提供了可扩展且节能的量子机器学习(Quantum Machine Learning, QML)方案。
链接: https://arxiv.org/abs/2505.09395
作者: Chen-Yu Liu,Kuan-Cheng Chen,Yi-Chien Chen,Samuel Yen-Chi Chen,Wei-Hao Huang,Wei-Jia Huang,Yen-Jui Chang
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Typhoon trajectory forecasting is essential for disaster preparedness but remains computationally demanding due to the complexity of atmospheric dynamics and the resource requirements of deep learning models. Quantum-Train (QT), a hybrid quantum-classical framework that leverages quantum neural networks (QNNs) to generate trainable parameters exclusively during training, eliminating the need for quantum hardware at inference time. Building on QT’s success across multiple domains, including image classification, reinforcement learning, flood prediction, and large language model (LLM) fine-tuning, we introduce Quantum Parameter Adaptation (QPA) for efficient typhoon forecasting model learning. Integrated with an Attention-based Multi-ConvGRU model, QPA enables parameter-efficient training while maintaining predictive accuracy. This work represents the first application of quantum machine learning (QML) to large-scale typhoon trajectory prediction, offering a scalable and energy-efficient approach to climate modeling. Our results demonstrate that QPA significantly reduces the number of trainable parameters while preserving performance, making high-performance forecasting more accessible and sustainable through hybrid quantum-classical learning.
zh
[AI-59] nsorRL-QAS: Reinforcement learning with tensor networks for scalable quantum architecture search
【速读】:该论文旨在解决在噪声中等规模量子硬件上设计能够解决目标问题并符合设备限制的量子电路的挑战,尤其是针对变分量子算法中因量子电路复杂度增加而导致的可扩展性问题。其解决方案的关键在于提出一种名为TensorRL-QAS的可扩展框架,该框架结合张量网络(Tensor Network, TN)方法与强化学习(Reinforcement Learning, RL),通过使用目标解的矩阵乘积态近似来预热架构搜索,从而有效缩小搜索空间,加速收敛至期望解。
链接: https://arxiv.org/abs/2505.09371
作者: Akash Kundu,Stefano Mangini
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: The code will be available soon! Comments are welcomed!
Abstract:Variational quantum algorithms hold the promise to address meaningful quantum problems already on noisy intermediate-scale quantum hardware, but they face the challenge of designing quantum circuits that both solve the target problem and comply with device limitations. Quantum architecture search (QAS) automates this design process, with reinforcement learning (RL) emerging as a promising approach. Yet, RL-based QAS methods encounter significant scalability issues, as computational and training costs grow rapidly with the number of qubits, circuit depth, and noise, severely impacting performance. To address these challenges, we introduce \textitTensorRL-QAS , a scalable framework that combines tensor network (TN) methods with RL for designing quantum circuits. By warm-starting the architecture search with a matrix product state approximation of the target solution, TensorRL-QAS effectively narrows the search space to physically meaningful circuits, accelerating convergence to the desired solution. Tested on several quantum chemistry problems of up to 12-qubit, TensorRL-QAS achieves up to a 10-fold reduction in CNOT count and circuit depth compared to baseline methods, while maintaining or surpassing chemical accuracy. It reduces function evaluations by up to 100-fold, accelerates training episodes by up to 98% , and achieves up to 50% success probability for 10-qubit systems-far exceeding the 1% rates of baseline approaches. Robustness and versatility are demonstrated both in the noiseless and noisy scenarios, where we report a simulation of up to 8-qubit. These advancements establish TensorRL-QAS as a promising candidate for a scalable and efficient quantum circuit discovery protocol on near-term quantum hardware.
zh
[AI-60] InvDesFlow-AL: Active Learning-based Workflow for Inverse Design of Functional Materials
【速读】:该论文旨在解决功能材料逆向设计中生成和预测晶体结构的低成功率问题,从而加速材料发现过程。其解决方案的关键在于提出一种基于主动学习策略的新型逆向材料设计生成框架——InvDesFlow-AL,该框架通过迭代优化材料生成过程,逐步引导其达到预期性能特征,显著提升了晶体结构预测的准确性与材料探索的效率。
链接: https://arxiv.org/abs/2505.09203
作者: Xiao-Qi Han,Peng-Jie Guo,Ze-Feng Gao,Hao Sun,Zhong-Yi Lu
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Superconductivity (cond-mat.supr-con); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 29 pages, 11 figures
Abstract:Developing inverse design methods for functional materials with specific properties is critical to advancing fields like renewable energy, catalysis, energy storage, and carbon capture. Generative models based on diffusion principles can directly produce new materials that meet performance constraints, thereby significantly accelerating the material design process. However, existing methods for generating and predicting crystal structures often remain limited by low success rates. In this work, we propose a novel inverse material design generative framework called InvDesFlow-AL, which is based on active learning strategies. This framework can iteratively optimize the material generation process to gradually guide it towards desired performance characteristics. In terms of crystal structure prediction, the InvDesFlow-AL model achieves an RMSE of 0.0423 Å, representing an 32.96% improvement in performance compared to exsisting generative models. Additionally, InvDesFlow-AL has been successfully validated in the design of low-formation-energy and low-Ehull materials. It can systematically generate materials with progressively lower formation energies while continuously expanding the exploration across diverse chemical spaces. These results fully demonstrate the effectiveness of the proposed active learning-driven generative model in accelerating material discovery and inverse design. To further prove the effectiveness of this method, we took the search for BCS superconductors under ambient pressure as an example explored by InvDesFlow-AL. As a result, we successfully identified Li(_2)AuH(_6) as a conventional BCS superconductor with an ultra-high transition temperature of 140 K. This discovery provides strong empirical support for the application of inverse design in materials science.
zh
[AI-61] When repeats drive the vocabulary: a Byte-Pair Encoding analysis of T2T primate genomes ICLR2025
【速读】:该论文试图解决在端粒到端粒(telomere-to-telomere, T2T)基因组组装背景下,如何有效进行基因组序列的分词以支持比较基因组学的问题。其解决方案的关键在于应用字节对编码(Byte Pair Encoding, BPE)算法,通过为九个T2T灵长类基因组(包括三个人类基因组)训练独立的BPE分词器,使用固定词汇表大小为512,000个标记的策略,探索基因组序列的分词方法。研究发现,共享标记数量极少,而大部分标记具有基因组特异性,表明BPE在处理重复序列时虽具压缩效果,但对高拷贝重复元件敏感,限制了其作为通用比较基因组学工具的适用性。
链接: https://arxiv.org/abs/2505.08918
作者: Marina Popova,Iaroslav Chelombitko,Aleksey Komissarov
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注: ICLR 2025 Workshop on Machine Learning for Genomics Explorations
Abstract:The emergence of telomere-to-telomere (T2T) genome assemblies has opened new avenues for comparative genomics, yet effective tokenization strategies for genomic sequences remain underexplored. In this pilot study, we apply Byte Pair Encoding (BPE) to nine T2T primate genomes including three human assemblies by training independent BPE tokenizers with a fixed vocabulary of 512,000 tokens using our custom tool, dnaBPE. Our analysis reveals that only 11,569 tokens are shared across all assemblies, while nearly 991,854 tokens are unique to a single genome, indicating a rapid decline in shared vocabulary with increasing assembly comparisons. Moreover, phylogenetic trees derived from token overlap failed to recapitulate established primate relationships, a discrepancy attributed to the disproportionate influence of species-specific high-copy repetitive elements. These findings underscore the dual nature of BPE tokenization: while it effectively compresses repetitive sequences, its sensitivity to high-copy elements limits its utility as a universal tool for comparative genomics. We discuss potential hybrid strategies and repeat-masking approaches to refine genomic tokenization, emphasizing the need for domain-specific adaptations in the development of large-scale genomic language models. The dnaBPE tool used in this study is open-source and available at this https URL.
zh
[AI-62] CellTypeAgent : Trustworthy cell type annotation with Large Language Models
【速读】:该论文旨在解决单细胞RNA测序分析中细胞类型注释这一关键但耗时的问题。其解决方案的关键在于提出一种可信的大型语言模型(Large Language Model, LLM)代理——CellTypeAgent,该代理将LLM与相关数据库的验证相结合,从而在提高注释准确性的同时减少幻觉现象。
链接: https://arxiv.org/abs/2505.08844
作者: Jiawen Chen,Jianghao Zhang,Huaxiu Yao,Yun Li
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注:
Abstract:Cell type annotation is a critical yet laborious step in single-cell RNA sequencing analysis. We present a trustworthy large language model (LLM)-agent, CellTypeAgent, which integrates LLMs with verification from relevant databases. CellTypeAgent achieves higher accuracy than existing methods while mitigating hallucinations. We evaluated CellTypeAgent across nine real datasets involving 303 cell types from 36 tissues. This combined approach holds promise for more efficient and reliable cell type annotation.
zh
[AI-63] A Comparative Study of Transformer-Based Models for Multi-Horizon Blood Glucose Prediction
【速读】:该论文旨在解决类型1糖尿病患者血糖(Blood Glucose, BG)预测的准确性问题,以支持个性化的胰岛素和饮食调整等新型干预措施。其解决方案的关键在于利用基于Transformer的架构,特别是通过分块(patch-wise)的嵌入方式对多变量时间序列数据进行建模,从而捕捉和利用数据中的季节性模式,提升预测精度。实验表明,采用分块嵌入的模型在不同预测时 horizon 上均表现出色,尤其是在使用长达一周的输入历史时效果最佳。
链接: https://arxiv.org/abs/2505.08821
作者: Meryem Altin Karagoz,Marc D. Breton,Anas El Fathi
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: 7 pages, 2 figures, 1 table, 1st IFAC Workshop on Engineering Diabetes Technologies (EDT 2025)
Abstract:Accurate blood glucose prediction can enable novel interventions for type 1 diabetes treatment, including personalized insulin and dietary adjustments. Although recent advances in transformer-based architectures have demonstrated the power of attention mechanisms in complex multivariate time series prediction, their potential for blood glucose (BG) prediction remains underexplored. We present a comparative analysis of transformer models for multi-horizon BG prediction, examining forecasts up to 4 hours and input history up to 1 week. The publicly available DCLP3 dataset (n=112) was split (80%-10%-10%) for training, validation, and testing, and the OhioT1DM dataset (n=12) served as an external test set. We trained networks with point-wise, patch-wise, series-wise, and hybrid embeddings, using CGM, insulin, and meal data. For short-term blood glucose prediction, Crossformer, a patch-wise transformer architecture, achieved a superior 30-minute prediction of RMSE (15.6 mg / dL on OhioT1DM). For longer-term predictions (1h, 2h, and 4h), PatchTST, another path-wise transformer, prevailed with the lowest RMSE (24.6 mg/dL, 36.1 mg/dL, and 46.5 mg/dL on OhioT1DM). In general, models that used tokenization through patches demonstrated improved accuracy with larger input sizes, with the best results obtained with a one-week history. These findings highlight the promise of transformer-based architectures for BG prediction by capturing and leveraging seasonal patterns in multivariate time-series data to improve accuracy.
zh
机器学习
[LG-0] DataMIL: Selecting Data for Robot Imitation Learning with Datamodels
链接: https://arxiv.org/abs/2505.09603
作者: Shivin Dass,Alaa Khaddaj,Logan Engstrom,Aleksander Madry,Andrew Ilyas,Roberto Martín-Martín
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Recently, the robotics community has amassed ever larger and more diverse datasets to train generalist robot policies. However, while these policies achieve strong mean performance across a variety of tasks, they often underperform on individual, specialized tasks and require further tuning on newly acquired task-specific data. Combining task-specific data with carefully curated subsets of large prior datasets via co-training can produce better specialized policies, but selecting data naively may actually harm downstream performance. To address this, we introduce DataMIL, a policy-driven data selection framework built on the datamodels paradigm that reasons about data selection in an end-to-end manner, using the policy itself to identify which data points will most improve performance. Unlike standard practices that filter data using human notions of quality (e.g., based on semantic or visual similarity), DataMIL directly optimizes data selection for task success, allowing us to select data that enhance the policy while dropping data that degrade it. To avoid performing expensive rollouts in the environment during selection, we use a novel surrogate loss function on task-specific data, allowing us to use DataMIL in the real world without degrading performance. We validate our approach on a suite of more than 60 simulation and real-world manipulation tasks - most notably showing successful data selection from the Open X-Embodiment datasets-demonstrating consistent gains in success rates and superior performance over multiple baselines. Our results underscore the importance of end-to-end, performance-aware data selection for unlocking the potential of large prior datasets in robotics. More information at this https URL
[LG-1] Adversarial Suffix Filtering: a Defense Pipeline for LLM s
链接: https://arxiv.org/abs/2505.09602
作者: David Khachaturov,Robert Mullins
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Large Language Models (LLMs) are increasingly embedded in autonomous systems and public-facing environments, yet they remain susceptible to jailbreak vulnerabilities that may undermine their security and trustworthiness. Adversarial suffixes are considered to be the current state-of-the-art jailbreak, consistently outperforming simpler methods and frequently succeeding even in black-box settings. Existing defenses rely on access to the internal architecture of models limiting diverse deployment, increase memory and computation footprints dramatically, or can be bypassed with simple prompt engineering methods. We introduce \textbfAdversarial Suffix Filtering (ASF), a lightweight novel model-agnostic defensive pipeline designed to protect LLMs against adversarial suffix attacks. ASF functions as an input preprocessor and sanitizer that detects and filters adversarially crafted suffixes in prompts, effectively neutralizing malicious injections. We demonstrate that ASF provides comprehensive defense capabilities across both black-box and white-box attack settings, reducing the attack efficacy of state-of-the-art adversarial suffix generation methods to below 4%, while only minimally affecting the target model’s capabilities in non-adversarial scenarios.
[LG-2] Online Isolation Forest ICML2024
链接: https://arxiv.org/abs/2505.09593
作者: Filippo Leveni,Guilherme Weigert Cassales,Bernhard Pfahringer,Albert Bifet,Giacomo Boracchi
类目: Machine Learning (cs.LG)
*备注: Accepted at International Conference on Machine Learning (ICML 2024)
Abstract:The anomaly detection literature is abundant with offline methods, which require repeated access to data in memory, and impose impractical assumptions when applied to a streaming context. Existing online anomaly detection methods also generally fail to address these constraints, resorting to periodic retraining to adapt to the online context. We propose Online-iForest, a novel method explicitly designed for streaming conditions that seamlessly tracks the data generating process as it evolves over time. Experimental validation on real-world datasets demonstrated that Online-iForest is on par with online alternatives and closely rivals state-of-the-art offline anomaly detection techniques that undergo periodic retraining. Notably, Online-iForest consistently outperforms all competitors in terms of efficiency, making it a promising solution in applications where fast identification of anomalies is of primary importance such as cybersecurity, fraud and fault detection.
[LG-3] Rhomboid Tiling for Geometric Graph Deep Learning
链接: https://arxiv.org/abs/2505.09586
作者: Yipeng Zhang,Longlong Li,Kelin Xia
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph Neural Networks (GNNs) have proven effective for learning from graph-structured data through their neighborhood-based message passing framework. Many hierarchical graph clustering pooling methods modify this framework by introducing clustering-based strategies, enabling the construction of more expressive and powerful models. However, all of these message passing framework heavily rely on the connectivity structure of graphs, limiting their ability to capture the rich geometric features inherent in geometric graphs. To address this, we propose Rhomboid Tiling (RT) clustering, a novel clustering method based on the rhomboid tiling structure, which performs clustering by leveraging the complex geometric information of the data and effectively extracts its higher-order geometric structures. Moreover, we design RTPool, a hierarchical graph clustering pooling model based on RT clustering for graph classification tasks. The proposed model demonstrates superior performance, outperforming 21 state-of-the-art competitors on all the 7 benchmark datasets.
[LG-4] SAD Neural Networks: Divergent Gradient Flows and Asymptotic Optimality via o-minimal Structures
链接: https://arxiv.org/abs/2505.09572
作者: Julian Kranz,Davide Gallon,Steffen Dereich,Arnulf Jentzen
类目: Machine Learning (cs.LG); Logic (math.LO); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 27 pages, 4 figures
Abstract:We study gradient flows for loss landscapes of fully connected feed forward neural networks with commonly used continuously differentiable activation functions such as the logistic, hyperbolic tangent, softplus or GELU function. We prove that the gradient flow either converges to a critical point or diverges to infinity while the loss converges to an asymptotic critical value. Moreover, we prove the existence of a threshold \varepsilon0 such that the loss value of any gradient flow initialized at most \varepsilon above the optimal level converges to it. For polynomial target functions and sufficiently big architecture and data set, we prove that the optimal loss value is zero and can only be realized asymptotically. From this setting, we deduce our main result that any gradient flow with sufficiently good initialization diverges to infinity. Our proof heavily relies on the geometry of o-minimal structures. We confirm these theoretical findings with numerical experiments and extend our investigation to real-world scenarios, where we observe an analogous behavior.
[LG-5] Distilling Realizable Students from Unrealizable Teachers
链接: https://arxiv.org/abs/2505.09546
作者: Yujin Kim,Nathaniel Chin,Arnav Vasudev,Sanjiban Choudhury
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:We study policy distillation under privileged information, where a student policy with only partial observations must learn from a teacher with full-state access. A key challenge is information asymmetry: the student cannot directly access the teacher’s state space, leading to distributional shifts and policy degradation. Existing approaches either modify the teacher to produce realizable but sub-optimal demonstrations or rely on the student to explore missing information independently, both of which are inefficient. Our key insight is that the student should strategically interact with the teacher --querying only when necessary and resetting from recovery states --to stay on a recoverable path within its own observation space. We introduce two methods: (i) an imitation learning approach that adaptively determines when the student should query the teacher for corrections, and (ii) a reinforcement learning approach that selects where to initialize training for efficient exploration. We validate our methods in both simulated and real-world robotic tasks, demonstrating significant improvements over standard teacher-student baselines in training efficiency and final performance. The project website is available at : this https URL
[LG-6] owards Fair In-Context Learning with Tabular Foundation Models
链接: https://arxiv.org/abs/2505.09503
作者: Patrik Kenfack,Samira Ebrahimi Kaho,Ulrich Aïvodji
类目: Machine Learning (cs.LG)
*备注: 24 pages, 10 figures, 4 tables
Abstract:Tabular foundational models have exhibited strong in-context learning (ICL) capabilities on structured data, allowing them to make accurate predictions on test sets without parameter updates, using training examples as context. This emerging approach positions itself as a competitive alternative to traditional gradient-boosted tree methods. However, while biases in conventional machine learning models are well documented, it remains unclear how these biases manifest in tabular ICL. The paper investigates the fairness implications of tabular ICL and explores three preprocessing strategies–correlation removal, group-balanced demonstration selection, and uncertainty-based demonstration selection–to address bias. Comprehensive experiments indicate that uncertainty-based demonstration selection consistently enhances group fairness of in-context predictions. The source code for reproducing the results of this work can be found at this https URL.
[LG-7] Layered Unlearning for Adversarial Relearning
链接: https://arxiv.org/abs/2505.09500
作者: Timothy Qian,Vinith Suriyakumar,Ashia Wilson,Dylan Hadfield-Menell
类目: Machine Learning (cs.LG)
*备注: 37 pages, 8 figures
Abstract:Our goal is to understand how post-training methods, such as fine-tuning, alignment, and unlearning, modify language model behavior and representations. We are particularly interested in the brittle nature of these modifications that makes them easy to bypass through prompt engineering or relearning. Recent results suggest that post-training induces shallow context-dependent ``circuits’’ that suppress specific response patterns. This could be one explanation for the brittleness of post-training. To test this hypothesis, we design an unlearning algorithm, Layered Unlearning (LU), that creates distinct inhibitory mechanisms for a growing subset of the data. By unlearning the first i folds while retaining the remaining k - i at the i th of k stages, LU limits the ability of relearning on a subset of data to recover the full dataset. We evaluate LU through a combination of synthetic and large language model (LLM) experiments. We find that LU improves robustness to adversarial relearning for several different unlearning methods. Our results contribute to the state-of-the-art of machine unlearning and provide insight into the effect of post-training updates.
[LG-8] Variational Rank Reduction Autoencoder
链接: https://arxiv.org/abs/2505.09458
作者: Jad Mounayer,Alicia Tierz,Jerome Tomezyk,Chady Ghnatios,Francisco Chinesta
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deterministic Rank Reduction Autoencoders (RRAEs) enforce by construction a regularization on the latent space by applying a truncated SVD. While this regularization makes Autoencoders more powerful, using them for generative purposes is counter-intuitive due to their deterministic nature. On the other hand, Variational Autoencoders (VAEs) are well known for their generative abilities by learning a probabilistic latent space. In this paper, we present Variational Rank Reduction Autoencoders (VRRAEs), a model that leverages the advantages of both RRAEs and VAEs. Our claims and results show that when carefully sampling the latent space of RRAEs and further regularizing with the Kullback-Leibler (KL) divergence (similarly to VAEs), VRRAEs outperform RRAEs and VAEs. Additionally, we show that the regularization induced by the SVD not only makes VRRAEs better generators than VAEs, but also reduces the possibility of posterior collapse. Our results include a synthetic dataset of a small size that showcases the robustness of VRRAEs against collapse, and three real-world datasets; the MNIST, CelebA, and CIFAR-10, over which VRRAEs are shown to outperform both VAEs and RRAEs on many random generation and interpolation tasks based on the FID score.
[LG-9] Establishing Linear Surrogate Regret Bounds for Convex Smooth Losses via Convolutional Fenche-Young Losses
链接: https://arxiv.org/abs/2505.09432
作者: Yuzhou Cao,Han Bao,Lei Feng,Bo An
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Surrogate regret bounds bridge the gap between the convergence rates of surrogate and target losses, with linear bounds favorable for their lossless regret transfer. While convex smooth surrogate losses are appealing in particular due to the efficient estimation and optimization, the existence of a trade-off between the smoothness and linear regret bound has been believed in the community. That being said, the better optimization and estimation properties of convex smooth surrogate losses may inevitably deteriorate after undergoing the regret transfer onto a target loss. We overcome this dilemma for arbitrary discrete target losses by constructing a convex smooth surrogate loss, which entails a linear surrogate regret bound composed with a tailored prediction link. The construction is based on Fenchel-Young losses generated by the convolutional negentropy, which are equivalent to the infimal convolution of a generalized negentropy and the target Bayes risk. Consequently, the infimal convolution enables us to derive a smooth loss while maintaining the surrogate regret bound linear. We additionally benefit from the infimal convolution to have a consistent estimator of the underlying class probability. Our results are overall a novel demonstration of how convex analysis penetrates into optimization and statistical efficiency in risk minimization.
[LG-10] rain a Multi-Task Diffusion Policy on RLBench-18 in One Day with One GPU
链接: https://arxiv.org/abs/2505.09430
作者: Yutong Hu,Pinhao Song,Kehan Wen,Renaud Detry
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:We present a method for training multi-task vision-language robotic diffusion policies that reduces training time and memory usage by an order of magnitude. This improvement arises from a previously underexplored distinction between action diffusion and the image diffusion techniques that inspired it: image generation targets are high-dimensional, while robot actions lie in a much lower-dimensional space. Meanwhile, the vision-language conditions for action generation remain high-dimensional. Our approach, Mini-Diffuser, exploits this asymmetry by introducing Level-2 minibatching, which pairs multiple noised action samples with each vision-language condition, instead of the conventional one-to-one sampling strategy. To support this batching scheme, we introduce architectural adaptations to the diffusion transformer that prevent information leakage across samples while maintaining full conditioning access. In RLBench simulations, Mini-Diffuser achieves 95% of the performance of state-of-the-art multi-task diffusion policies, while using only 5% of the training time and 7% of the memory. Real-world experiments further validate that Mini-Diffuser preserves the key strengths of diffusion-based policies, including the ability to model multimodal action distributions and produce behavior conditioned on diverse perceptual inputs. Code available at this http URL.
[LG-11] SafePath: Conformal Prediction for Safe LLM -Based Autonomous Navigation
链接: https://arxiv.org/abs/2505.09427
作者: Achref Doula,Max Mühläuser,Alejandro Sanchez Guinea
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:Large Language Models (LLMs) show growing promise in autonomous driving by reasoning over complex traffic scenarios to generate path plans. However, their tendencies toward overconfidence, and hallucinations raise critical safety concerns. We introduce SafePath, a modular framework that augments LLM-based path planning with formal safety guarantees using conformal prediction. SafePath operates in three stages. In the first stage, we use an LLM that generates a set of diverse candidate paths, exploring possible trajectories based on agent behaviors and environmental cues. In the second stage, SafePath filters out high-risk trajectories while guaranteeing that at least one safe option is included with a user-defined probability, through a multiple-choice question-answering formulation that integrates conformal prediction. In the final stage, our approach selects the path with the lowest expected collision risk when uncertainty is low or delegates control to a human when uncertainty is high. We theoretically prove that SafePath guarantees a safe trajectory with a user-defined probability, and we show how its human delegation rate can be tuned to balance autonomy and safety. Extensive experiments on nuScenes and Highway-env show that SafePath reduces planning uncertainty by 77% and collision rates by up to 70%, demonstrating effectiveness in making LLM-driven path planning more safer.
[LG-12] Personalized Control for Lower Limb Prosthesis Using Kolmogorov-Arnold Networks
链接: https://arxiv.org/abs/2505.09366
作者: SeyedMojtaba Mohasel,Alireza Afzal Aghaei,Corey Pew
类目: Machine Learning (cs.LG)
*备注:
Abstract:Objective: This paper investigates the potential of learnable activation functions in Kolmogorov-Arnold Networks (KANs) for personalized control in a lower-limb prosthesis. In addition, user-specific vs. pooled training data is evaluated to improve machine learning (ML) and Deep Learning (DL) performance for turn intent prediction. Method: Inertial measurement unit (IMU) data from the shank were collected from five individuals with lower-limb amputation performing turning tasks in a laboratory setting. Ability to classify an upcoming turn was evaluated for Multilayer Perceptron (MLP), Kolmogorov-Arnold Network (KAN), convolutional neural network (CNN), and fractional Kolmogorov-Arnold Networks (FKAN). The comparison of MLP and KAN (for ML models) and FKAN and CNN (for DL models) assessed the effectiveness of learnable activation functions. Models were trained separately on user-specific and pooled data to evaluate the impact of training data on their performance. Results: Learnable activation functions in KAN and FKAN did not yield significant improvement compared to MLP and CNN, respectively. Training on user-specific data yielded superior results compared to pooled data for ML models ( p 0.05 ). In contrast, no significant difference was observed between user-specific and pooled training for DL models. Significance: These findings suggest that learnable activation functions may demonstrate distinct advantages in datasets involving more complex tasks and larger volumes. In addition, pooled training showed comparable performance to user-specific training in DL models, indicating that model training for prosthesis control can utilize data from multiple participants. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.09366 [cs.LG] (or arXiv:2505.09366v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.09366 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Alireza Afzal Aghaei [view email] [v1] Wed, 14 May 2025 13:18:57 UTC (1,114 KB)
[LG-13] Diffusion Recommender Models and the Illusion of Progress: A Concerning Study of Reproducibility and a Conceptual Mismatch
链接: https://arxiv.org/abs/2505.09364
作者: Michael Benigni,Maurizio Ferrari Dacrema,Dietmar Jannach
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:Countless new machine learning models are published every year and are reported to significantly advance the state-of-the-art in \emphtop-n recommendation. However, earlier reproducibility studies indicate that progress in this area may be quite limited. Specifically, various widespread methodological issues, e.g., comparisons with untuned baseline models, have led to an \emphillusion of progress. In this work, our goal is to examine whether these problems persist in today’s research. To this end, we aim to reproduce the latest advancements reported from applying modern Denoising Diffusion Probabilistic Models to recommender systems, focusing on four models published at the top-ranked SIGIR conference in 2023 and 2024. Our findings are concerning, revealing persistent methodological problems. Alarmingly, through experiments, we find that the latest recommendation techniques based on diffusion models, despite their computational complexity and substantial carbon footprint, are consistently outperformed by simpler existing models. Furthermore, we identify key mismatches between the characteristics of diffusion models and those of the traditional \emphtop-n recommendation task, raising doubts about their suitability for recommendation. We also note that, in the papers we analyze, the generative capabilities of these models are constrained to a minimum. Overall, our results and continued methodological issues call for greater scientific rigor and a disruptive change in the research and publication culture in this area.
[LG-14] Efficient Mixed Precision Quantization in Graph Neural Networks
链接: https://arxiv.org/abs/2505.09361
作者: Samir Moustafa,Nils M. Kriege,Wilfried N. Gansterer
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph Neural Networks (GNNs) have become essential for handling large-scale graph applications. However, the computational demands of GNNs necessitate the development of efficient methods to accelerate inference. Mixed precision quantization emerges as a promising solution to enhance the efficiency of GNN architectures without compromising prediction performance. Compared to conventional deep learning architectures, GNN layers contain a wider set of components that can be quantized, including message passing functions, aggregation functions, update functions, the inputs, learnable parameters, and outputs of these functions. In this paper, we introduce a theorem for efficient quantized message passing to aggregate integer messages. It guarantees numerical equality of the aggregated messages using integer values with respect to those obtained with full (FP32) precision. Based on this theorem, we introduce the Mixed Precision Quantization for GNN (MixQ-GNN) framework, which flexibly selects effective integer bit-widths for all components within GNN layers. Our approach systematically navigates the wide set of possible bit-width combinations, addressing the challenge of optimizing efficiency while aiming at maintaining comparable prediction performance. MixQ-GNN integrates with existing GNN quantization methods, utilizing their graph structure advantages to achieve higher prediction performance. On average, MixQ-GNN achieved reductions in bit operations of 5.5x for node classification and 5.1x for graph classification compared to architectures represented in FP32 precision.
[LG-15] Exploiting the Potential Supervision Information of Clean Samples in Partial Label Learning
链接: https://arxiv.org/abs/2505.09354
作者: Guangtai Wang,Chi-Man Vong,Jintao Huang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diminishing the impact of false-positive labels is critical for conducting disambiguation in partial label learning. However, the existing disambiguation strategies mainly focus on exploiting the characteristics of individual partial label instances while neglecting the strong supervision information of clean samples randomly lying in the datasets. In this work, we show that clean samples can be collected to offer guidance and enhance the confidence of the most possible candidates. Motivated by the manner of the differentiable count loss strat- egy and the K-Nearest-Neighbor algorithm, we proposed a new calibration strategy called CleanSE. Specifically, we attribute the most reliable candidates with higher significance under the assumption that for each clean sample, if its label is one of the candidates of its nearest neighbor in the representation space, it is more likely to be the ground truth of its neighbor. Moreover, clean samples offer help in characterizing the sample distributions by restricting the label counts of each label to a specific interval. Extensive experiments on 3 synthetic benchmarks and 5 real-world PLL datasets showed this calibration strategy can be applied to most of the state-of-the-art PLL methods as well as enhance their performance.
[LG-16] MUST: Multi-Scale Structural-Temporal Link Prediction Model for UAV Ad Hoc Networks
链接: https://arxiv.org/abs/2505.09331
作者: Cunlai Pu,Fangrui Wu,Rajput Ramiz Sharafat,Guangzhao Dai,Xiangbo Shu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Link prediction in unmanned aerial vehicle (UAV) ad hoc networks (UANETs) aims to predict the potential formation of future links between UAVs. In adversarial environments where the route information of UAVs is unavailable, predicting future links must rely solely on the observed historical topological information of UANETs. However, the highly dynamic and sparse nature of UANET topologies presents substantial challenges in effectively capturing meaningful structural and temporal patterns for accurate link prediction. Most existing link prediction methods focus on temporal dynamics at a single structural scale while neglecting the effects of sparsity, resulting in insufficient information capture and limited applicability to UANETs. In this paper, we propose a multi-scale structural-temporal link prediction model (MUST) for UANETs. Specifically, we first employ graph attention networks (GATs) to capture structural features at multiple levels, including the individual UAV level, the UAV community level, and the overall network level. Then, we use long short-term memory (LSTM) networks to learn the temporal dynamics of these multi-scale structural features. Additionally, we address the impact of sparsity by introducing a sophisticated loss function during model optimization. We validate the performance of MUST using several UANET datasets generated through simulations. Extensive experimental results demonstrate that MUST achieves state-of-the-art link prediction performance in highly dynamic and sparse UANETs.
[LG-17] Detecting Sybil Addresses in Blockchain Airdrops: A Subgraph-based Feature Propagation and Fusion Approach
链接: https://arxiv.org/abs/2505.09313
作者: Qiangqiang Liu,Qian Huang,Frank Fan,Haishan Wu,Xueyan Tang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: IEEE International Conference on Blockchain and Cryptocurrency(Proc. IEEE ICBC 2025)
Abstract:Sybil attacks pose a significant security threat to blockchain ecosystems, particularly in token airdrop events. This paper proposes a novel sybil address identification method based on subgraph feature extraction lightGBM. The method first constructs a two-layer deep transaction subgraph for each address, then extracts key event operation features according to the lifecycle of sybil addresses, including the time of first transaction, first gas acquisition, participation in airdrop activities, and last transaction. These temporal features effectively capture the consistency of sybil address behavior operations. Additionally, the method extracts amount and network structure features, comprehensively describing address behavior patterns and network topology through feature propagation and fusion. Experiments conducted on a dataset containing 193,701 addresses (including 23,240 sybil addresses) show that this method outperforms existing approaches in terms of precision, recall, F1 score, and AUC, with all metrics exceeding 0.9. The methods and results of this study can be further applied to broader blockchain security areas such as transaction manipulation identification and token liquidity risk assessment, contributing to the construction of a more secure and fair blockchain ecosystem.
[LG-18] Neural Multivariate Regression: Qualitative Insights from the Unconstrained Feature Model
链接: https://arxiv.org/abs/2505.09308
作者: George Andriopoulos,Soyuj Jung Basnet,Juan Guevara,Li Guo,Keith Ross
类目: Machine Learning (cs.LG)
*备注: 31 pages, 8 figures
Abstract:The Unconstrained Feature Model (UFM) is a mathematical framework that enables closed-form approximations for minimal training loss and related performance measures in deep neural networks (DNNs). This paper leverages the UFM to provide qualitative insights into neural multivariate regression, a critical task in imitation learning, robotics, and reinforcement learning. Specifically, we address two key questions: (1) How do multi-task models compare to multiple single-task models in terms of training performance? (2) Can whitening and normalizing regression targets improve training performance? The UFM theory predicts that multi-task models achieve strictly smaller training MSE than multiple single-task models when the same or stronger regularization is applied to the latter, and our empirical results confirm these findings. Regarding whitening and normalizing regression targets, the UFM theory predicts that they reduce training MSE when the average variance across the target dimensions is less than one, and our empirical results once again confirm these findings. These findings highlight the UFM as a powerful framework for deriving actionable insights into DNN design and data pre-processing strategies.
[LG-19] Adaptive Noise Resilient Keyword Spotting Using One-Shot Learning
链接: https://arxiv.org/abs/2505.09304
作者: Luciano Sebastian Martinez-Rau,Quynh Nguyen Phuong Vu,Yuxuan Zhang,Bengt Oelmann,Sebastian Bader
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Preprint submitted to the IEEE 11th World Forum on Internet of Things
Abstract:Keyword spotting (KWS) is a key component of smart devices, enabling efficient and intuitive audio interaction. However, standard KWS systems deployed on embedded devices often suffer performance degradation under real-world operating conditions. Resilient KWS systems address this issue by enabling dynamic adaptation, with applications such as adding or replacing keywords, adjusting to specific users, and improving noise robustness. However, deploying resilient, standalone KWS systems with low latency on resource-constrained devices remains challenging due to limited memory and computational resources. This study proposes a low computational approach for continuous noise adaptation of pretrained neural networks used for KWS classification, requiring only 1-shot learning and one epoch. The proposed method was assessed using two pretrained models and three real-world noise sources at signal-to-noise ratios (SNRs) ranging from 24 to -3 dB. The adapted models consistently outperformed the pretrained models across all scenarios, especially at SNR \leq 18 dB, achieving accuracy improvements of 4.9% to 46.0%. These results highlight the efficacy of the proposed methodology while being lightweight enough for deployment on resource-constrained devices.
[LG-20] On the Learning with Augmented Class via Forests IJCAI2025
链接: https://arxiv.org/abs/2505.09294
作者: Fan Xu,Wuyang Chen,Wei Gao
类目: Machine Learning (cs.LG)
*备注: Accepted by IJCAI 2025
Abstract:Decision trees and forests have achieved successes in various real applications, most working with all testing classes known in training data. In this work, we focus on learning with augmented class via forests, where an augmented class may appear in testing data yet not in training data. We incorporate information of augmented class into trees’ splitting, i.e., a new splitting criterion, called augmented Gini impurity, is introduced to exploit some unlabeled data from testing distribution. We then develop the approach named Learning with Augmented Class via Forests (LACForest), which constructs shallow forests based on the augmented Gini impurity and then splits forests with pseudo-labeled augmented instances for better performance. We also develop deep neural forests with a novel optimization objective based on our augmented Gini impurity, so as to utilize the representation power of neural networks for forests. Theoretically, we present the convergence analysis for augmented Gini impurity, and finally conduct experiments to verify the effectiveness of our approaches. The code is available at this https URL.
[LG-21] Ranking-Based At-Risk Student Prediction Using Federated Learning and Differential Features
链接: https://arxiv.org/abs/2505.09287
作者: Shunsuke Yoneda,Valdemar Švábenský,Gen Li,Daisuke Deguchi,Atsushi Shimada
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: To appear in the Proceedings of the 18th Educational Data Mining Conference (EDM 2025)
Abstract:Digital textbooks are widely used in various educational contexts, such as university courses and online lectures. Such textbooks yield learning log data that have been used in numerous educational data mining (EDM) studies for student behavior analysis and performance prediction. However, these studies have faced challenges in integrating confidential data, such as academic records and learning logs, across schools due to privacy concerns. Consequently, analyses are often conducted with data limited to a single school, which makes developing high-performing and generalizable models difficult. This study proposes a method that combines federated learning and differential features to address these issues. Federated learning enables model training without centralizing data, thereby preserving student privacy. Differential features, which utilize relative values instead of absolute values, enhance model performance and generalizability. To evaluate the proposed method, a model for predicting at-risk students was trained using data from 1,136 students across 12 courses conducted over 4 years, and validated on hold-out test data from 5 other courses. Experimental results demonstrated that the proposed method addresses privacy concerns while achieving performance comparable to that of models trained via centralized learning in terms of Top-n precision, nDCG, and PR-AUC. Furthermore, using differential features improved prediction performance across all evaluation datasets compared to non-differential approaches. The trained models were also applicable for early prediction, achieving high performance in detecting at-risk students in earlier stages of the semester within the validation datasets.
[LG-22] Generating Full-field Evolution of Physical Dynamics from Irregular Sparse Observations
链接: https://arxiv.org/abs/2505.09284
作者: Panqi Chen,Yifan Sun,Lei Cheng,Yang Yang,Weichang Li,Yang Liu,Weiqing Liu,Jiang Bian,Shikai Fang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Modeling and reconstructing multidimensional physical dynamics from sparse and off-grid observations presents a fundamental challenge in scientific research. Recently, diffusion-based generative modeling shows promising potential for physical simulation. However, current approaches typically operate on on-grid data with preset spatiotemporal resolution, but struggle with the sparsely observed and continuous nature of real-world physical dynamics. To fill the gaps, we present SDIFT, Sequential DIffusion in Functional Tucker space, a novel framework that generates full-field evolution of physical dynamics from irregular sparse observations. SDIFT leverages the functional Tucker model as the latent space representer with proven universal approximation property, and represents observations as latent functions and Tucker core sequences. We then construct a sequential diffusion model with temporally augmented UNet in the functional Tucker space, denoising noise drawn from a Gaussian process to generate the sequence of core tensors. At the posterior sampling stage, we propose a Message-Passing Posterior Sampling mechanism, enabling conditional generation of the entire sequence guided by observations at limited time steps. We validate SDIFT on three physical systems spanning astronomical (supernova explosions, light-year scale), environmental (ocean sound speed fields, kilometer scale), and molecular (organic liquid, millimeter scale) domains, demonstrating significant improvements in both reconstruction accuracy and computational efficiency compared to state-of-the-art approaches. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2505.09284 [cs.LG] (or arXiv:2505.09284v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.09284 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-23] Stable and Convexified Information Bottleneck Optimization via Symbolic Continuation and Entropy-Regularized Trajectories
链接: https://arxiv.org/abs/2505.09239
作者: Faruk Alpay
类目: Machine Learning (cs.LG)
*备注: 23 pages, 11 figures, includes analytical proofs, sensitivity analysis (95% CI), and JAX-based open-source implementation available at: this https URL
Abstract:The Information Bottleneck (IB) method frequently suffers from unstable optimization, characterized by abrupt representation shifts near critical points of the IB trade-off parameter, beta. In this paper, I introduce a novel approach to achieve stable and convex IB optimization through symbolic continuation and entropy-regularized trajectories. I analytically prove convexity and uniqueness of the IB solution path when an entropy regularization term is included, and demonstrate how this stabilizes representation learning across a wide range of \beta values. Additionally, I provide extensive sensitivity analyses around critical points (beta) with statistically robust uncertainty quantification (95% confidence intervals). The open-source implementation, experimental results, and reproducibility framework included in this work offer a clear path for practical deployment and future extension of my proposed method.
[LG-24] Birch SGD: A Tree Graph Framework for Local and Asynchronous SGD Methods
链接: https://arxiv.org/abs/2505.09218
作者: Alexander Tyurin,Danil Sivtsov
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC)
*备注:
Abstract:We propose a new unifying framework, Birch SGD, for analyzing and designing distributed SGD methods. The central idea is to represent each method as a weighted directed tree, referred to as a computation tree. Leveraging this representation, we introduce a general theoretical result that reduces convergence analysis to studying the geometry of these trees. This perspective yields a purely graph-based interpretation of optimization dynamics, offering a new and intuitive foundation for method development. Using Birch SGD, we design eight new methods and analyze them alongside previously known ones, with at least six of the new methods shown to have optimal computational time complexity. Our research leads to two key insights: (i) all methods share the same “iteration rate” of O\left(\frac(R + 1) L \Delta\varepsilon + \frac\sigma^2 L \Delta\varepsilon^2\right) , where R the maximum “tree distance” along the main branch of a tree; and (ii) different methods exhibit different trade-offs-for example, some update iterates more frequently, improving practical performance, while others are more communication-efficient or focus on other aspects. Birch SGD serves as a unifying framework for navigating these trade-offs. We believe these results provide a unified foundation for understanding, analyzing, and designing efficient asynchronous and parallel optimization methods.
[LG-25] he Larger the Merrier? Efficient Large AI Model Inference in Wireless Edge Networks
链接: https://arxiv.org/abs/2505.09214
作者: Zhonghao Lyu,Ming Xiao,Jie Xu,Mikael Skoglund,Marco Di Renzo
类目: Machine Learning (cs.LG)
*备注:
Abstract:The growing demand for large artificial intelligence model (LAIM) services is driving a paradigm shift from traditional cloud-based inference to edge-based inference for low-latency, privacy-preserving applications. In particular, edge-device co-inference, which partitions LAIMs between edge devices and servers, has emerged as a promising strategy for resource-efficient LAIM execution in wireless networks. In this paper, we investigate a pruning-aware LAIM co-inference scheme, where a pre-trained LAIM is pruned and partitioned into on-device and on-server sub-models for deployment. For analysis, we first prove that the LAIM output distortion is upper bounded by its parameter distortion. Then, we derive a lower bound on parameter distortion via rate-distortion theory, analytically capturing the relationship between pruning ratio and co-inference performance. Next, based on the analytical results, we formulate an LAIM co-inference distortion bound minimization problem by jointly optimizing the pruning ratio, transmit power, and computation frequency under system latency, energy, and available resource constraints. Moreover, we propose an efficient algorithm to tackle the considered highly non-convex problem. Finally, extensive simulations demonstrate the effectiveness of the proposed design. In particular, model parameter distortion is shown to provide a reliable bound on output distortion. Also, the proposed joint pruning ratio and resource management design achieves superior performance in balancing trade-offs among inference performance, system latency, and energy consumption compared with benchmark schemes, such as fully on-device and on-server inference. Moreover, the split point is shown to play a critical role in system performance optimization under heterogeneous and resource-limited edge environments.
[LG-26] Quotient Complex Transformer (QCformer) for Perovskite Data Analysis
链接: https://arxiv.org/abs/2505.09174
作者: Xinyu You,Xiang Liu,Chuan-Shen Hu,Kelin Xia,Tze Chien Sum
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:
Abstract:The discovery of novel functional materials is crucial in addressing the challenges of sustainable energy generation and climate change. Hybrid organic-inorganic perovskites (HOIPs) have gained attention for their exceptional optoelectronic properties in photovoltaics. Recently, geometric deep learning, particularly graph neural networks (GNNs), has shown strong potential in predicting material properties and guiding material design. However, traditional GNNs often struggle to capture the periodic structures and higher-order interactions prevalent in such systems. To address these limitations, we propose a novel representation based on quotient complexes (QCs) and introduce the Quotient Complex Transformer (QCformer) for material property prediction. A material structure is modeled as a quotient complex, which encodes both pairwise and many-body interactions via simplices of varying dimensions and captures material periodicity through a quotient operation. Our model leverages higher-order features defined on simplices and processes them using a simplex-based Transformer module. We pretrain QCformer on benchmark datasets such as the Materials Project and JARVIS, and fine-tune it on HOIP datasets. The results show that QCformer outperforms state-of-the-art models in HOIP property prediction, demonstrating its effectiveness. The quotient complex representation and QCformer model together contribute a powerful new tool for predictive modeling of perovskite materials.
[LG-27] Scaling Gaussian Process Regression with Full Derivative Observations
链接: https://arxiv.org/abs/2505.09134
作者: Daniel Huang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 12 pages
Abstract:We present a scalable Gaussian Process (GP) method that can fit and predict full derivative observations called DSoftKI. It extends SoftKI, a method that approximates a kernel via softmax interpolation from learned interpolation point locations, to the setting with derivatives. DSoftKI enhances SoftKI’s interpolation scheme to incorporate the directional orientation of interpolation points relative to the data. This enables the construction of a scalable approximate kernel, including its first and second-order derivatives, through interpolation. We evaluate DSoftKI on a synthetic function benchmark and high-dimensional molecular force field prediction (100-1000 dimensions), demonstrating that DSoftKI is accurate and can scale to larger datasets with full derivative observations than previously possible.
[LG-28] Sequential Treatment Effect Estimation with Unmeasured Confounders
链接: https://arxiv.org/abs/2505.09113
作者: Yingrong Wang,Anpeng Wu,Baohong Li,Ziyang Xiao,Ruoxuan Xiong,Qing Han,Kun Kuang
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:This paper studies the cumulative causal effects of sequential treatments in the presence of unmeasured confounders. It is a critical issue in sequential decision-making scenarios where treatment decisions and outcomes dynamically evolve over time. Advanced causal methods apply transformer as a backbone to model such time sequences, which shows superiority in capturing long time dependence and periodic patterns via attention mechanism. However, even they control the observed confounding, these estimators still suffer from unmeasured confounders, which influence both treatment assignments and outcomes. How to adjust the latent confounding bias in sequential treatment effect estimation remains an open challenge. Therefore, we propose a novel Decomposing Sequential Instrumental Variable framework for CounterFactual Regression (DSIV-CFR), relying on a common negative control assumption. Specifically, an instrumental variable (IV) is a special negative control exposure, while the previous outcome serves as a negative control outcome. This allows us to recover the IVs latent in observation variables and estimate sequential treatment effects via a generalized moment condition. We conducted experiments on 4 datasets and achieved significant performance in one- and multi-step prediction, supported by which we can identify optimal treatments for dynamic systems.
[LG-29] oward Malicious Clients Detection in Federated Learning ASIACCS2025
链接: https://arxiv.org/abs/2505.09110
作者: Zhihao Dou,Jiaqi Wang,Wei Sun,Zhuqing Liu,Minghong Fang
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: To appear in ACM ASIACCS 2025
Abstract:Federated learning (FL) enables multiple clients to collaboratively train a global machine learning model without sharing their raw data. However, the decentralized nature of FL introduces vulnerabilities, particularly to poisoning attacks, where malicious clients manipulate their local models to disrupt the training process. While Byzantine-robust aggregation rules have been developed to mitigate such attacks, they remain inadequate against more advanced threats. In response, recent advancements have focused on FL detection techniques to identify potentially malicious participants. Unfortunately, these methods often misclassify numerous benign clients as threats or rely on unrealistic assumptions about the server’s capabilities. In this paper, we propose a novel algorithm, SafeFL, specifically designed to accurately identify malicious clients in FL. The SafeFL approach involves the server collecting a series of global models to generate a synthetic dataset, which is then used to distinguish between malicious and benign models based on their behavior. Extensive testing demonstrates that SafeFL outperforms existing methods, offering superior efficiency and accuracy in detecting malicious clients.
[LG-30] Argus: Federated Non-convex Bilevel Learning over 6G Space-Air-Ground Integrated Network
链接: https://arxiv.org/abs/2505.09106
作者: Ya Liu,Kai Yang,Yu Zhu,Keying Yang,Haibo Zhao
类目: Machine Learning (cs.LG)
*备注: 17 pages, 11 figures
Abstract:The space-air-ground integrated network (SAGIN) has recently emerged as a core element in the 6G networks. However, traditional centralized and synchronous optimization algorithms are unsuitable for SAGIN due to infrastructureless and time-varying environments. This paper aims to develop a novel Asynchronous algorithm a.k.a. Argus for tackling non-convex and non-smooth decentralized federated bilevel learning over SAGIN. The proposed algorithm allows networked agents (e.g. autonomous aerial vehicles) to tackle bilevel learning problems in time-varying networks asynchronously, thereby averting stragglers from impeding the overall training speed. We provide a theoretical analysis of the iteration complexity, communication complexity, and computational complexity of Argus. Its effectiveness is further demonstrated through numerical experiments.
[LG-31] Imitation Learning for Adaptive Control of a Virtual Soft Exoglove
链接: https://arxiv.org/abs/2505.09099
作者: Shirui Lyu,Vittorio Caggiano,Matteo Leonetti,Dario Farina,Letizia Gionfrida
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:The use of wearable robots has been widely adopted in rehabilitation training for patients with hand motor impairments. However, the uniqueness of patients’ muscle loss is often overlooked. Leveraging reinforcement learning and a biologically accurate musculoskeletal model in simulation, we propose a customized wearable robotic controller that is able to address specific muscle deficits and to provide compensation for hand-object manipulation tasks. Video data of a same subject performing human grasping tasks is used to train a manipulation model through learning from demonstration. This manipulation model is subsequently fine-tuned to perform object-specific interaction tasks. The muscle forces in the musculoskeletal manipulation model are then weakened to simulate neurological motor impairments, which are later compensated by the actuation of a virtual wearable robotics glove. Results shows that integrating the virtual wearable robotic glove provides shared assistance to support the hand manipulator with weakened muscle forces. The learned exoglove controller achieved an average of 90.5% of the original manipulation proficiency.
[LG-32] Statistical Mean Estimation with Coded Relayed Observations
链接: https://arxiv.org/abs/2505.09098
作者: Yan Hao Ling,Zhouhao Yang,Jonathan Scarlett
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:We consider a problem of statistical mean estimation in which the samples are not observed directly, but are instead observed by a relay (teacher'') that transmits information through a memoryless channel to the decoder (
student’'), who then produces the final estimate. We consider the minimax estimation error in the large deviations regime, and establish achievable error exponents that are tight in broad regimes of the estimation accuracy and channel quality. In contrast, two natural baseline methods are shown to yield strictly suboptimal error exponents. We initially focus on Bernoulli sources and binary symmetric channels, and then generalize to sub-Gaussian and heavy-tailed settings along with arbitrary discrete memoryless channels.
[LG-33] Generating time-consistent dynamics with discriminator-guided image diffusion models
链接: https://arxiv.org/abs/2505.09089
作者: Philipp Hess,Maximilian Gelbrecht,Christof Schötz,Michael Aich,Yu Huang,Shangshang Yang,Niklas Boers
类目: Machine Learning (cs.LG)
*备注:
Abstract:Realistic temporal dynamics are crucial for many video generation, processing and modelling applications, e.g. in computational fluid dynamics, weather prediction, or long-term climate simulations. Video diffusion models (VDMs) are the current state-of-the-art method for generating highly realistic dynamics. However, training VDMs from scratch can be challenging and requires large computational resources, limiting their wider application. Here, we propose a time-consistency discriminator that enables pretrained image diffusion models to generate realistic spatiotemporal dynamics. The discriminator guides the sampling inference process and does not require extensions or finetuning of the image diffusion model. We compare our approach against a VDM trained from scratch on an idealized turbulence simulation and a real-world global precipitation dataset. Our approach performs equally well in terms of temporal consistency, shows improved uncertainty calibration and lower biases compared to the VDM, and achieves stable centennial-scale climate simulations at daily time steps.
[LG-34] AdaFortiTran: An Adaptive Transformer Model for Robust OFDM Channel Estimation
链接: https://arxiv.org/abs/2505.09076
作者: Berkay Guler,Hamid Jafarkhani
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Deep learning models for channel estimation in Orthogonal Frequency Division Multiplexing (OFDM) systems often suffer from performance degradation under fast-fading channels and low-SNR scenarios. To address these limitations, we introduce the Adaptive Fortified Transformer (AdaFortiTran), a novel model specifically designed to enhance channel estimation in challenging environments. Our approach employs convolutional layers that exploit locality bias to capture strong correlations between neighboring channel elements, combined with a transformer encoder that applies the global Attention mechanism to channel patches. This approach effectively models both long-range dependencies and spectro-temporal interactions within single OFDM frames. We further augment the model’s adaptability by integrating nonlinear representations of available channel statistics SNR, delay spread, and Doppler shift as priors. A residual connection is employed to merge global features from the transformer with local features from early convolutional processing, followed by final convolutional layers to refine the hierarchical channel representation. Despite its compact architecture, AdaFortiTran achieves up to 6 dB reduction in mean squared error (MSE) compared to state-of-the-art models. Tested across a wide range of Doppler shifts (200-1000 Hz), SNRs (0 to 25 dB), and delay spreads (50-300 ns), it demonstrates superior robustness in high-mobility environments.
[LG-35] Single-shot prediction of parametric partial differential equations
链接: https://arxiv.org/abs/2505.09063
作者: Khalid Rafiq,Wenjing Liao,Aditya G. Nair
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 35 pages, 17 figures
Abstract:We introduce Flexi-VAE, a data-driven framework for efficient single-shot forecasting of nonlinear parametric partial differential equations (PDEs), eliminating the need for iterative time-stepping while maintaining high accuracy and stability. Flexi-VAE incorporates a neural propagator that advances latent representations forward in time, aligning latent evolution with physical state reconstruction in a variational autoencoder setting. We evaluate two propagation strategies, the Direct Concatenation Propagator (DCP) and the Positional Encoding Propagator (PEP), and demonstrate, through representation-theoretic analysis, that DCP offers superior long-term generalization by fostering disentangled and physically meaningful latent spaces. Geometric diagnostics, including Jacobian spectral analysis, reveal that propagated latent states reside in regions of lower decoder sensitivity and more stable local geometry than those derived via direct encoding, enhancing robustness for long-horizon predictions. We validate Flexi-VAE on canonical PDE benchmarks, the 1D viscous Burgers equation and the 2D advection-diffusion equation, achieving accurate forecasts across wide parametric ranges. The model delivers over 50x CPU and 90x GPU speedups compared to autoencoder-LSTM baselines for large temporal shifts. These results position Flexi-VAE as a scalable and interpretable surrogate modeling tool for accelerating high-fidelity simulations in computational fluid dynamics (CFD) and other parametric PDE-driven applications, with extensibility to higher-dimensional and more complex systems.
[LG-36] DyGSSM: Multi-view Dynamic Graph Embeddings with State Space Model Gradient Update
链接: https://arxiv.org/abs/2505.09017
作者: Bizhan Alipour Pijan,Serdar Bozdag
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:
Abstract:Most of the dynamic graph representation learning methods involve dividing a dynamic graph into discrete snapshots to capture the evolving behavior of nodes over time. Existing methods primarily capture only local or global structures of each node within a snapshot using message-passing and random walk-based methods. Then, they utilize sequence-based models (e.g., transformers) to encode the temporal evolution of node embeddings, and meta-learning techniques to update the model parameters. However, these approaches have two limitations. First, they neglect the extraction of global and local information simultaneously in each snapshot. Second, they fail to consider the model’s performance in the current snapshot during parameter updates, resulting in a lack of temporal dependency management. Recently, HiPPO (High-order Polynomial Projection Operators) algorithm has gained attention for their ability to optimize and preserve sequence history in State Space Model (SSM). To address the aforementioned limitations in dynamic graph representation learning, we propose a novel method called Multi-view Dynamic Graph Embeddings with State Space Model Gradient Update (DyGSSM). Our approach combines Graph Convolution Networks (GCN) for local feature extraction and random walk with Gated Recurrent Unit (GRU) for global feature extraction in each snapshot. We then integrate the local and global features using a cross-attention mechanism. Additionally, we incorporate an SSM based on HiPPO algorithm to account for long-term dependencies when updating model parameters, ensuring that model performance in each snapshot informs subsequent updates. Experiments on five public datasets show that our method outperforms existing baseline and state-of-the-art (SOTA) methods in 17 out of 20 cases.
[LG-37] Signal-based AI-driven software solution for automated quantification of metastatic bone disease and treatment response assessment using Whole-Body Diffusion-Weighted MRI (WB-DWI) biomarkers in Advanced Prostate Cancer
链接: https://arxiv.org/abs/2505.09011
作者: Antonio Candito,Matthew D Blackledge,Richard Holbrey,Nuria Porta,Ana Ribeiro,Fabio Zugni,Luca D’Erme,Francesca Castagnoli,Alina Dragan,Ricardo Donners,Christina Messiou,Nina Tunariu,Dow-Mu Koh
类目: Machine Learning (cs.LG)
*备注:
Abstract:We developed an AI-driven software solution to quantify metastatic bone disease from WB-DWI scans. Core technologies include: (i) a weakly-supervised Residual U-Net model generating a skeleton probability map to isolate bone; (ii) a statistical framework for WB-DWI intensity normalisation, obtaining a signal-normalised b=900s/mm^2 (b900) image; and (iii) a shallow convolutional neural network that processes outputs from (i) and (ii) to generate a mask of suspected bone lesions, characterised by higher b900 signal intensity due to restricted water diffusion. This mask is applied to the gADC map to extract TDV and gADC statistics. We tested the tool using expert-defined metastatic bone disease delineations on 66 datasets, assessed repeatability of imaging biomarkers (N=10), and compared software-based response assessment with a construct reference standard based on clinical, laboratory and imaging assessments (N=118). Dice score between manual and automated delineations was 0.6 for lesions within pelvis and spine, with an average surface distance of 2mm. Relative differences for log-transformed TDV (log-TDV) and median gADC were below 9% and 5%, respectively. Repeatability analysis showed coefficients of variation of 4.57% for log-TDV and 3.54% for median gADC, with intraclass correlation coefficients above 0.9. The software achieved 80.5% accuracy, 84.3% sensitivity, and 85.7% specificity in assessing response to treatment compared to the construct reference standard. Computation time generating a mask averaged 90 seconds per scan. Our software enables reproducible TDV and gADC quantification from WB-DWI scans for monitoring metastatic bone disease response, thus providing potentially useful measurements for clinical decision-making in APC patients.
[LG-38] ChicGrasp: Imitation-Learning based Customized Dual-Jaw Gripper Control for Delicate Irregular Bio-products Manipulation
链接: https://arxiv.org/abs/2505.08986
作者: Amirreza Davar,Zhengtong Xu,Siavash Mahmoudi,Pouya Sohrabipour,Chaitanya Pallerla,Yu She,Wan Shou,Philip Crandall,Dongyi Wang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Submitted for journal review
Abstract:Automated poultry processing lines still rely on humans to lift slippery, easily bruised carcasses onto a shackle conveyor. Deformability, anatomical variance, and strict hygiene rules make conventional suction and scripted motions unreliable. We present ChicGrasp, an end–to–end hardware–software co-design for this task. An independently actuated dual-jaw pneumatic gripper clamps both chicken legs, while a conditional diffusion-policy controller, trained from only 50 multi–view teleoperation demonstrations (RGB + proprioception), plans 5 DoF end–effector motion, which includes jaw commands in one shot. On individually presented raw broiler carcasses, our system achieves a 40.6% grasp–and–lift success rate and completes the pick to shackle cycle in 38 s, whereas state–of–the–art implicit behaviour cloning (IBC) and LSTM-GMM baselines fail entirely. All CAD, code, and datasets will be open-source. ChicGrasp shows that imitation learning can bridge the gap between rigid hardware and variable bio–products, offering a reproducible benchmark and a public dataset for researchers in agricultural engineering and robot learning.
[LG-39] Model-free Online Learning for the Kalman Filter: Forgetting Factor and Logarithmic Regret
链接: https://arxiv.org/abs/2505.08982
作者: Jiachen Qian,Yang Zheng
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注:
Abstract:We consider the problem of online prediction for an unknown, non-explosive linear stochastic system. With a known system model, the optimal predictor is the celebrated Kalman filter. In the case of unknown systems, existing approaches based on recursive least squares and its variants may suffer from degraded performance due to the highly imbalanced nature of the regression model. This imbalance can easily lead to overfitting and thus degrade prediction accuracy. We tackle this problem by injecting an inductive bias into the regression model via exponential forgetting. While exponential forgetting is a common wisdom in online learning, it is typically used for re-weighting data. In contrast, our approach focuses on balancing the regression model. This achieves a better trade-off between regression and regularization errors, and simultaneously reduces the accumulation error. With new proof techniques, we also provide a sharper logarithmic regret bound of O(\log^3 N) , where N is the number of observations.
[LG-40] SaFARi: State-Space Models for Frame-Agnostic Representation
链接: https://arxiv.org/abs/2505.08977
作者: Hossein Babaei,Mel White,Sina Alemohammad,Richard G. Baraniuk
类目: Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注: 13 pages, 5 figures
Abstract:State-Space Models (SSMs) have re-emerged as a powerful tool for online function approximation, and as the backbone of machine learning models for long-range dependent data. However, to date, only a few polynomial bases have been explored for this purpose, and the state-of-the-art implementations were built upon the best of a few limited options. In this paper, we present a generalized method for building an SSM with any frame or basis, rather than being restricted to polynomials. This framework encompasses the approach known as HiPPO, but also permits an infinite diversity of other possible “species” within the SSM architecture. We dub this approach SaFARi: SSMs for Frame-Agnostic Representation.
[LG-41] NeurIPS 2024 Ariel Data Challenge: Characterisation of Exoplanetary Atmospheres Using a Data-Centric Approach
链接: https://arxiv.org/abs/2505.08940
作者: Jeremie Blanchard,Lisa Casino,Jordan Gierschendorf
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM)
*备注: 12 pages
Abstract:The characterization of exoplanetary atmospheres through spectral analysis is a complex challenge. The NeurIPS 2024 Ariel Data Challenge, in collaboration with the European Space Agency’s (ESA) Ariel mission, provided an opportunity to explore machine learning techniques for extracting atmospheric compositions from simulated spectral data. In this work, we focus on a data-centric business approach, prioritizing generalization over competition-specific optimization. We briefly outline multiple experimental axes, including feature extraction, signal transformation, and heteroskedastic uncertainty modeling. Our experiments demonstrate that uncertainty estimation plays a crucial role in the Gaussian Log-Likelihood (GLL) score, impacting performance by several percentage points. Despite improving the GLL score by 11%, our results highlight the inherent limitations of tabular modeling and feature engineering for this task, as well as the constraints of a business-driven approach within a Kaggle-style competition framework. Our findings emphasize the trade-offs between model simplicity, interpretability, and generalization in astrophysical data analysis.
[LG-42] An Analytical Characterization of Sloppiness in Neural Networks: Insights from Linear Models
链接: https://arxiv.org/abs/2505.08915
作者: Jialin Mao,Itay Griniasty,Yan Sun,Mark K. Transtrum,James P. Sethna,Pratik Chaudhari
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech)
*备注:
Abstract:Recent experiments have shown that training trajectories of multiple deep neural networks with different architectures, optimization algorithms, hyper-parameter settings, and regularization methods evolve on a remarkably low-dimensional “hyper-ribbon-like” manifold in the space of probability distributions. Inspired by the similarities in the training trajectories of deep networks and linear networks, we analytically characterize this phenomenon for the latter. We show, using tools in dynamical systems theory, that the geometry of this low-dimensional manifold is controlled by (i) the decay rate of the eigenvalues of the input correlation matrix of the training data, (ii) the relative scale of the ground-truth output to the weights at the beginning of training, and (iii) the number of steps of gradient descent. By analytically computing and bounding the contributions of these quantities, we characterize phase boundaries of the region where hyper-ribbons are to be expected. We also extend our analysis to kernel machines and linear models that are trained with stochastic gradient descent.
[LG-43] he Geography of Transportation Cybersecurity: Visitor Flows Industry Clusters and Spatial Dynamics
链接: https://arxiv.org/abs/2505.08822
作者: Yuhao Wang,Kailai Wang,Songhua Hu,Yunpeng(Jack)Zhang,Gino Lim,Pengyu Zhu
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Physics and Society (physics.soc-ph)
*备注:
Abstract:The rapid evolution of the transportation cybersecurity ecosystem, encompassing cybersecurity, automotive, and transportation and logistics sectors, will lead to the formation of distinct spatial clusters and visitor flow patterns across the US. This study examines the spatiotemporal dynamics of visitor flows, analyzing how socioeconomic factors shape industry clustering and workforce distribution within these evolving sectors. To model and predict visitor flow patterns, we develop a BiTransGCN framework, integrating an attention-based Transformer architecture with a Graph Convolutional Network backbone. By integrating AI-enabled forecasting techniques with spatial analysis, this study improves our ability to track, interpret, and anticipate changes in industry clustering and mobility trends, thereby supporting strategic planning for a secure and resilient transportation network. It offers a data-driven foundation for economic planning, workforce development, and targeted investments in the transportation cybersecurity ecosystem.
[LG-44] Self-Supervised Transformer-based Contrastive Learning for Intrusion Detection Systems
链接: https://arxiv.org/abs/2505.08816
作者: Ippokratis Koukoulis,Ilias Syrigos,Thanasis Korakis
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted at IFIP Networking 2025. Code available at this https URL
Abstract:As the digital landscape becomes more interconnected, the frequency and severity of zero-day attacks, have significantly increased, leading to an urgent need for innovative Intrusion Detection Systems (IDS). Machine Learning-based IDS that learn from the network traffic characteristics and can discern attack patterns from benign traffic offer an advanced solution to traditional signature-based IDS. However, they heavily rely on labeled datasets, and their ability to generalize when encountering unseen traffic patterns remains a challenge. This paper proposes a novel self-supervised contrastive learning approach based on transformer encoders, specifically tailored for generalizable intrusion detection on raw packet sequences. Our proposed learning scheme employs a packet-level data augmentation strategy combined with a transformer-based architecture to extract and generate meaningful representations of traffic flows. Unlike traditional methods reliant on handcrafted statistical features (NetFlow), our approach automatically learns comprehensive packet sequence representations, significantly enhancing performance in anomaly identification tasks and supervised learning for intrusion detection. Our transformer-based framework exhibits better performance in comparison to existing NetFlow self-supervised methods. Specifically, we achieve up to a 3% higher AUC in anomaly detection for intra-dataset evaluation and up to 20% higher AUC scores in inter-dataset evaluation. Moreover, our model provides a strong baseline for supervised intrusion detection with limited labeled data, exhibiting an improvement over self-supervised NetFlow models of up to 1.5% AUC when pretrained and evaluated on the same dataset. Additionally, we show the adaptability of our pretrained model when fine-tuned across different datasets, demonstrating strong performance even when lacking benign data from the target domain.
[LG-45] okenProber: Jailbreaking Text-to-image Models via Fine-grained Word Impact Analysis
链接: https://arxiv.org/abs/2505.08804
作者: Longtian Wang,Xiaofei Xie,Tianlin Li,Yuhan Zhi,Chao Shen
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 13 pages, 5 figures
Abstract:Text-to-image (T2I) models have significantly advanced in producing high-quality images. However, such models have the ability to generate images containing not-safe-for-work (NSFW) content, such as pornography, violence, political content, and discrimination. To mitigate the risk of generating NSFW content, refusal mechanisms, i.e., safety checkers, have been developed to check potential NSFW content. Adversarial prompting techniques have been developed to evaluate the robustness of the refusal mechanisms. The key challenge remains to subtly modify the prompt in a way that preserves its sensitive nature while bypassing the refusal mechanisms. In this paper, we introduce TokenProber, a method designed for sensitivity-aware differential testing, aimed at evaluating the robustness of the refusal mechanisms in T2I models by generating adversarial prompts. Our approach is based on the key observation that adversarial prompts often succeed by exploiting discrepancies in how T2I models and safety checkers interpret sensitive content. Thus, we conduct a fine-grained analysis of the impact of specific words within prompts, distinguishing between dirty words that are essential for NSFW content generation and discrepant words that highlight the different sensitivity assessments between T2I models and safety checkers. Through the sensitivity-aware mutation, TokenProber generates adversarial prompts, striking a balance between maintaining NSFW content generation and evading detection. Our evaluation of TokenProber against 5 safety checkers on 3 popular T2I models, using 324 NSFW prompts, demonstrates its superior effectiveness in bypassing safety filters compared to existing methods (e.g., 54%+ increase on average), highlighting TokenProber’s ability to uncover robustness issues in the existing refusal mechanisms.
[LG-46] Onboard Optimization and Learning: A Survey
链接: https://arxiv.org/abs/2505.08793
作者: Monirul Islam Pavel,Siyi Hu,Mahardhika Pratama,Ryszard Kowalczyk
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: 36 pages, 5 figures, 3 tables
Abstract:Onboard learning is a transformative approach in edge AI, enabling real-time data processing, decision-making, and adaptive model training directly on resource-constrained devices without relying on centralized servers. This paradigm is crucial for applications demanding low latency, enhanced privacy, and energy efficiency. However, onboard learning faces challenges such as limited computational resources, high inference costs, and security vulnerabilities. This survey explores a comprehensive range of methodologies that address these challenges, focusing on techniques that optimize model efficiency, accelerate inference, and support collaborative learning across distributed devices. Approaches for reducing model complexity, improving inference speed, and ensuring privacy-preserving computation are examined alongside emerging strategies that enhance scalability and adaptability in dynamic environments. By bridging advancements in hardware-software co-design, model compression, and decentralized learning, this survey provides insights into the current state of onboard learning to enable robust, efficient, and secure AI deployment at the edge.
[LG-47] A Preliminary Framework for Intersectionality in ML Pipelines
链接: https://arxiv.org/abs/2505.08792
作者: Michelle Nashla Turcios,Alicia E. Boyd,Angela D.R. Smith,Brittany Johnson
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: Accepted for the 1st International Intersectionality and Software Engineering Workshop, colocated with FSE 2025
Abstract:Machine learning (ML) has become a go-to solution for improving how we use, experience, and interact with technology (and the world around us). Unfortunately, studies have repeatedly shown that machine learning technologies may not provide adequate support for societal identities and experiences. Intersectionality is a sociological framework that provides a mechanism for explicitly considering complex social identities, focusing on social justice and power. While the framework of intersectionality can support the development of technologies that acknowledge and support all members of society, it has been adopted and adapted in ways that are not always true to its foundations, thereby weakening its potential for impact. To support the appropriate adoption and use of intersectionality for more equitable technological outcomes, we amplify the foundational intersectionality scholarship–Crenshaw, Combahee, and Collins (three C’s), to create a socially relevant preliminary framework in developing machine-learning solutions. We use this framework to evaluate and report on the (mis)alignments of intersectionality application in machine learning literature.
[LG-48] Adaptively-weighted Nearest Neighbors for Matrix Completion
链接: https://arxiv.org/abs/2505.09612
作者: Tathagata Sadhukhan,Manit Paul,Raaz Dwivedi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 25 pages, 6 figures
Abstract:In this technical note, we introduce and analyze AWNN: an adaptively weighted nearest neighbor method for performing matrix completion. Nearest neighbor (NN) methods are widely used in missing data problems across multiple disciplines such as in recommender systems and for performing counterfactual inference in panel data settings. Prior works have shown that in addition to being very intuitive and easy to implement, NN methods enjoy nice theoretical guarantees. However, the performance of majority of the NN methods rely on the appropriate choice of the radii and the weights assigned to each member in the nearest neighbor set and despite several works on nearest neighbor methods in the past two decades, there does not exist a systematic approach of choosing the radii and the weights without relying on methods like cross-validation. AWNN addresses this challenge by judiciously balancing the bias variance trade off inherent in weighted nearest-neighbor regression. We provide theoretical guarantees for the proposed method under minimal assumptions and support the theory via synthetic experiments.
[LG-49] Scalable Computations for Generalized Mixed Effects Models with Crossed Random Effects Using Krylov Subspace Methods
链接: https://arxiv.org/abs/2505.09552
作者: Pascal Kündig,Fabio Sigrist
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Mixed effects models are widely used for modeling data with hierarchically grouped structures and high-cardinality categorical predictor variables. However, for high-dimensional crossed random effects, current standard computations relying on Cholesky decompositions can become prohibitively slow. In this work, we present novel Krylov subspace-based methods that address several existing computational bottlenecks. Among other things, we theoretically analyze and empirically evaluate various preconditioners for the conjugate gradient and stochastic Lanczos quadrature methods, derive new convergence results, and develop computationally efficient methods for calculating predictive variances. Extensive experiments using simulated and real-world data sets show that our proposed methods scale much better than Cholesky-based computations, for instance, achieving a runtime reduction of approximately two orders of magnitudes for both estimation and prediction. Moreover, our software implementation is up to 10’000 times faster and more stable than state-of-the-art implementations such as lme4 and glmmTMB when using default settings. Our methods are implemented in the free C++ software library GPBoost with high-level Python and R packages.
[LG-50] Depth-Based Local Center Clustering: A Framework for Handling Different Clustering Scenarios
链接: https://arxiv.org/abs/2505.09516
作者: Siyi Wang,Alexandre Leblanc,Paul D. McNicholas
类目: Methodology (stat.ME); Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:Cluster analysis, or clustering, plays a crucial role across numerous scientific and engineering domains. Despite the wealth of clustering methods proposed over the past decades, each method is typically designed for specific scenarios and presents certain limitations in practical applications. In this paper, we propose depth-based local center clustering (DLCC). This novel method makes use of data depth, which is known to produce a center-outward ordering of sample points in a multivariate space. However, data depth typically fails to capture the multimodal characteristics of data, something of the utmost importance in the context of clustering. To overcome this, DLCC makes use of a local version of data depth that is based on subsets of data. From this, local centers can be identified as well as clusters of varying shapes. Furthermore, we propose a new internal metric based on density-based clustering to evaluate clustering performance on non-convex clusters. Overall, DLCC is a flexible clustering approach that seems to overcome some limitations of traditional clustering methods, thereby enhancing data analysis capabilities across a wide range of application scenarios.
[LG-51] Deep-SITAR: A SITAR-Based Deep Learning Framework for Growth Curve Modeling via Autoencoders
链接: https://arxiv.org/abs/2505.09506
作者: María Alejandra Hernández,Oscar Rodriguez,Dae-Jin Lee
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Pre-print
Abstract:Several approaches have been developed to capture the complexity and nonlinearity of human growth. One widely used is the Super Imposition by Translation and Rotation (SITAR) model, which has become popular in studies of adolescent growth. SITAR is a shape-invariant mixed-effects model that represents the shared growth pattern of a population using a natural cubic spline mean curve while incorporating three subject-specific random effects – timing, size, and growth intensity – to account for variations among individuals. In this work, we introduce a supervised deep learning framework based on an autoencoder architecture that integrates a deep neural network (neural network) with a B-spline model to estimate the SITAR model. In this approach, the encoder estimates the random effects for each individual, while the decoder performs a fitting based on B-splines similar to the classic SITAR model. We refer to this method as the Deep-SITAR model. This innovative approach enables the prediction of the random effects of new individuals entering a population without requiring a full model re-estimation. As a result, Deep-SITAR offers a powerful approach to predicting growth trajectories, combining the flexibility and efficiency of deep learning with the interpretability of traditional mixed-effects models.
[LG-52] Reinforcement Learning for Individual Optimal Policy from Heterogeneous Data
链接: https://arxiv.org/abs/2505.09496
作者: Rui Miao,Babak Shahbaba,Annie Qu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Offline reinforcement learning (RL) aims to find optimal policies in dynamic environments in order to maximize the expected total rewards by leveraging pre-collected data. Learning from heterogeneous data is one of the fundamental challenges in offline RL. Traditional methods focus on learning an optimal policy for all individuals with pre-collected data from a single episode or homogeneous batch episodes, and thus, may result in a suboptimal policy for a heterogeneous population. In this paper, we propose an individualized offline policy optimization framework for heterogeneous time-stationary Markov decision processes (MDPs). The proposed heterogeneous model with individual latent variables enables us to efficiently estimate the individual Q-functions, and our Penalized Pessimistic Personalized Policy Learning (P4L) algorithm guarantees a fast rate on the average regret under a weak partial coverage assumption on behavior policies. In addition, our simulation studies and a real data application demonstrate the superior numerical performance of the proposed method compared with existing methods.
[LG-53] Fairness-aware Bayes optimal functional classification
链接: https://arxiv.org/abs/2505.09471
作者: Xiaoyu Hu,Gengyu Xue,Zhenhua Lin,Yi Yu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:
Abstract:Algorithmic fairness has become a central topic in machine learning, and mitigating disparities across different subpopulations has emerged as a rapidly growing research area. In this paper, we systematically study the classification of functional data under fairness constraints, ensuring the disparity level of the classifier is controlled below a pre-specified threshold. We propose a unified framework for fairness-aware functional classification, tackling an infinite-dimensional functional space, addressing key challenges from the absence of density ratios and intractability of posterior probabilities, and discussing unique phenomena in functional classification. We further design a post-processing algorithm, Fair Functional Linear Discriminant Analysis classifier (Fair-FLDA), which targets at homoscedastic Gaussian processes and achieves fairness via group-wise thresholding. Under weak structural assumptions on eigenspace, theoretical guarantees on fairness and excess risk controls are established. As a byproduct, our results cover the excess risk control of the standard FLDA as a special case, which, to the best of our knowledge, is first time seen. Our theoretical findings are complemented by extensive numerical experiments on synthetic and real datasets, highlighting the practicality of our designed algorithm.
[LG-54] Independent Component Analysis by Robust Distance Correlation
链接: https://arxiv.org/abs/2505.09425
作者: Sarah Leyder,Jakob Raymaekers,Peter J. Rousseeuw,Tom Van Deuren,Tim Verdonck
类目: Computation (stat.CO); Machine Learning (cs.LG)
*备注:
Abstract:Independent component analysis (ICA) is a powerful tool for decomposing a multivariate signal or distribution into fully independent sources, not just uncorrelated ones. Unfortunately, most approaches to ICA are not robust against outliers. Here we propose a robust ICA method called RICA, which estimates the components by minimizing a robust measure of dependence between multivariate random variables. The dependence measure used is the distance correlation (dCor). In order to make it more robust we first apply a new transformation called the bowl transform, which is bounded, one-to-one, continuous, and maps far outliers to points close to the origin. This preserves the crucial property that a zero dCor implies independence. RICA estimates the independent sources sequentially, by looking for the component that has the smallest dCor with the remainder. RICA is strongly consistent and has the usual parametric rate of convergence. Its robustness is investigated by a simulation study, in which it generally outperforms its competitors. The method is illustrated on three applications, including the well-known cocktail party problem.
[LG-55] ARCANE – Early Detection of Interplanetary Coronal Mass Ejections
链接: https://arxiv.org/abs/2505.09365
作者: H. T. Rüdisser,G. Nguyen,J. Le Louëdec,C. Möstl
类目: pace Physics (physics.space-ph); Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注: 25 pages, 9 figures, 1 table, submitted to AGU Space Weather on 14th May 2025
Abstract:Interplanetary coronal mass ejections (ICMEs) are major drivers of space weather disturbances, posing risks to both technological infrastructure and human activities. Automatic detection of ICMEs in solar wind in situ data is essential for early warning systems. While several methods have been proposed to identify these structures in time series data, robust real-time detection remains a significant challenge. In this work, we present ARCANE - the first framework explicitly designed for early ICME detection in streaming solar wind data under realistic operational constraints, enabling event identification without requiring observation of the full structure. Our approach evaluates the strengths and limitations of detection models by comparing a machine learning-based method to a threshold-based baseline. The ResUNet++ model, previously validated on science data, significantly outperforms the baseline, particularly in detecting high-impact events, while retaining solid performance on lower-impact cases. Notably, we find that using real-time solar wind (RTSW) data instead of high-resolution science data leads to only minimal performance degradation. Despite the challenges of operational settings, our detection pipeline achieves an F1 score of 0.53, with an average detection delay of 21.5% of the event’s duration while only seeing a minimal amount of data. As more data becomes available, the performance increases significantly. These results mark a substantial step forward in automated space weather monitoring and lay the groundwork for enhanced real-time forecasting capabilities.
[LG-56] Accelerating Machine Learning Systems via Category Theory: Applications to Spherical Attention for Gene Regulatory Networks
链接: https://arxiv.org/abs/2505.09326
作者: Vincent Abbott,Kotaro Kamiya,Gerard Glowacki,Yu Atsumi,Gioele Zardini,Yoshihiro Maruyama
类目: Category Theory (math.CT); Machine Learning (cs.LG); Molecular Networks (q-bio.MN)
*备注:
Abstract:How do we enable artificial intelligence models to improve themselves? This is central to exponentially improving generalized artificial intelligence models, which can improve their own architecture to handle new problem domains in an efficient manner that leverages the latest hardware. However, current automated compilation methods are poor, and efficient algorithms require years of human development. In this paper, we use neural circuit diagrams, based in category theory, to prove a general theorem related to deep learning algorithms, guide the development of a novel attention algorithm catered to the domain of gene regulatory networks, and produce a corresponding efficient kernel. The algorithm we propose, spherical attention, shows that neural circuit diagrams enable a principled and systematic method for reasoning about deep learning architectures and providing high-performance code. By replacing SoftMax with an L^2 norm as suggested by diagrams, it overcomes the special function unit bottleneck of standard attention while retaining the streaming property essential to high-performance. Our diagrammatically derived \textitFlashSign kernel achieves comparable performance to the state-of-the-art, fine-tuned FlashAttention algorithm on an A100, and 3.6\times the performance of PyTorch. Overall, this investigation shows neural circuit diagrams’ suitability as a high-level framework for the automated development of efficient, novel artificial intelligence architectures.
[LG-57] Enhanced Photonic Chip Design via Interpretable Machine Learning Techniques
链接: https://arxiv.org/abs/2505.09266
作者: Lirandë Pira,Airin Antony,Nayanthara Prathap,Daniel Peace,Jacquiline Romero
类目: Optics (physics.optics); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:
Abstract:Photonic chip design has seen significant advancements with the adoption of inverse design methodologies, offering flexibility and efficiency in optimizing device performance. However, the black-box nature of the optimization approaches, such as those used in inverse design in order to minimize a loss function or maximize coupling efficiency, poses challenges in understanding the outputs. This challenge is prevalent in machine learning-based optimization methods, which can suffer from the same lack of transparency. To this end, interpretability techniques address the opacity of optimization models. In this work, we apply interpretability techniques from machine learning, with the aim of gaining understanding of inverse design optimization used in designing photonic components, specifically two-mode multiplexers. We base our methodology on the widespread interpretability technique known as local interpretable model-agnostic explanations, or LIME. As a result, LIME-informed insights point us to more effective initial conditions, directly improving device performance. This demonstrates that interpretability methods can do more than explain models – they can actively guide and enhance the inverse-designed photonic components. Our results demonstrate the ability of interpretable techniques to reveal underlying patterns in the inverse design process, leading to the development of better-performing components.
[LG-58] Optimal Transport-Based Domain Adaptation for Rotated Linear Regression
链接: https://arxiv.org/abs/2505.09229
作者: Brian Britos(AMU),Mathias Bourel(UDELAR)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:Optimal Transport (OT) has proven effective for domain adaptation (DA) by aligning distributions across domains with differing statistical properties. Building on the approach of Courty et al. (2016), who mapped source data to the target domain for improved model transfer, we focus on a supervised DA problem involving linear regression models under rotational shifts. This ongoing work considers cases where source and target domains are related by a rotation-common in applications like sensor calibration or image orientation. We show that in \mathbbR^2 , when using a p-norm cost with p \ge 2 , the optimal transport map recovers the underlying rotation. Based on this, we propose an algorithm that combines K-means clustering, OT, and singular value decomposition (SVD) to estimate the rotation angle and adapt the regression model. This method is particularly effective when the target domain is sparsely sampled, leveraging abundant source data for improved generalization. Our contributions offer both theoretical and practical insights into OT-based model adaptation under geometric transformations.
[LG-59] Online Learning of Neural Networks
链接: https://arxiv.org/abs/2505.09167
作者: Amit Daniely,Idan Mehalel,Elchanan Mossel
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We study online learning of feedforward neural networks with the sign activation function that implement functions from the unit ball in \mathbbR^d to a finite label set \1, \ldots, Y\ . First, we characterize a margin condition that is sufficient and in some cases necessary for online learnability of a neural network: Every neuron in the first hidden layer classifies all instances with some margin \gamma bounded away from zero. Quantitatively, we prove that for any net, the optimal mistake bound is at most approximately \mathttTS(d,\gamma) , which is the (d,\gamma) -totally-separable-packing number, a more restricted variation of the standard (d,\gamma) -packing number. We complement this result by constructing a net on which any learner makes \mathttTS(d,\gamma) many mistakes. We also give a quantitative lower bound of approximately \mathttTS(d,\gamma) \geq \max\1/(\gamma \sqrtd)^d, d\ when \gamma \geq 1/2 , implying that for some nets and input sequences every learner will err for \exp(d) many times, and that a dimension-free mistake bound is almost always impossible. To remedy this inevitable dependence on d , it is natural to seek additional natural restrictions to be placed on the network, so that the dependence on d is removed. We study two such restrictions. The first is the multi-index model, in which the function computed by the net depends only on k \ll d orthonormal directions. We prove a mistake bound of approximately (1.5/\gamma)^k + 2 in this model. The second is the extended margin assumption. In this setting, we assume that all neurons (in all layers) in the network classify every ingoing input from previous layer with margin \gamma bounded away from zero. In this model, we prove a mistake bound of approximately (\log Y)/ \gamma^O(L) , where L is the depth of the network. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2505.09167 [stat.ML] (or arXiv:2505.09167v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2505.09167 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-60] Bridging Theory and Experiment in Materials Discovery: Machine-Learning-Assisted Prediction of Synthesizable Structures
链接: https://arxiv.org/abs/2505.09161
作者: Yu Xin,Peng Liu,Zhuohang Xie,Wenhui Mi,Pengyue Gao,Hong Jian Zhao,Jian Lv,Yanchao Wang,Yanming Ma
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
Abstract:Even though thermodynamic energy-based crystal structure prediction (CSP) has revolutionized materials discovery, the energy-driven CSP approaches often struggle to identify experimentally realizable metastable materials synthesized through kinetically controlled pathways, creating a critical gap between theoretical predictions and experimental synthesis. Here, we propose a synthesizability-driven CSP framework that integrates symmetry-guided structure derivation with a Wyckoff encode-based machine-learning model, allowing for the efficient localization of subspaces likely to yield highly synthesizable structures. Within the identified promising subspaces, a structure-based synthesizability evaluation model, fine-tuned using recently synthesized structures to enhance predictive accuracy, is employed in conjunction with ab initio calculations to systematically identify synthesizable candidates. The framework successfully reproduces 13 experimentally known XSe (X = Sc, Ti, Mn, Fe, Ni, Cu, Zn) structures, demonstrating its effectiveness in predicting synthesizable structures. Notably, 92,310 structures are filtered from the 554,054 candidates predicted by GNoME, exhibiting great potential for promising synthesizability. Additionally, eight thermodynamically favorable Hf-X-O (X = Ti, V, and Mn) structures have been identified, among which three HfV _2 O _7 candidates exhibit high synthesizability, presenting viable candidates for experimental realization and potentially associated with experimentally observed temperature-induced phase transitions. This work establishes a data-driven paradigm for machine-learning-assisted inorganic materials synthesis, highlighting its potential to bridge the gap between computational predictions and experimental realization while unlocking new opportunities for the targeted discovery of novel functional materials.
[LG-61] A Comparative Review of RNA Language Models
链接: https://arxiv.org/abs/2505.09087
作者: He Wang,Yikun Zhang,Jie Chen,Jian Zhan,Yaoqi Zhou
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:
Abstract:Given usefulness of protein language models (LMs) in structure and functional inference, RNA LMs have received increased attentions in the last few years. However, these RNA models are often not compared against the same standard. Here, we divided RNA LMs into three classes (pretrained on multiple RNA types (especially noncoding RNAs), specific-purpose RNAs, and LMs that unify RNA with DNA or proteins or both) and compared 13 RNA LMs along with 3 DNA and 1 protein LMs as controls in zero-shot prediction of RNA secondary structure and functional classification. Results shows that the models doing well on secondary structure prediction often perform worse in function classification or vice versa, suggesting that more balanced unsupervised training is needed.
[LG-62] Risk Bounds For Distributional Regression
链接: https://arxiv.org/abs/2505.09075
作者: Carlos Misael Madrid Padilla,Oscar Hernan Madrid Padilla,Sabyasachi Chatterjee
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:This work examines risk bounds for nonparametric distributional regression estimators. For convex-constrained distributional regression, general upper bounds are established for the continuous ranked probability score (CRPS) and the worst-case mean squared error (MSE) across the domain. These theoretical results are applied to isotonic and trend filtering distributional regression, yielding convergence rates consistent with those for mean estimation. Furthermore, a general upper bound is derived for distributional regression under non-convex constraints, with a specific application to neural network-based estimators. Comprehensive experiments on both simulated and real data validate the theoretical contributions, demonstrating their practical effectiveness.
[LG-63] Probabilistic Wind Power Forecasting via Non-Stationary Gaussian Processes
链接: https://arxiv.org/abs/2505.09026
作者: Domniki Ladopoulou,Dat Minh Hong,Petros Dellaportas
类目: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 11 pages, 3 figures, 2 tables
Abstract:Accurate probabilistic forecasting of wind power is essential for maintaining grid stability and enabling efficient integration of renewable energy sources. Gaussian Process (GP) models offer a principled framework for quantifying uncertainty; however, conventional approaches rely on stationary kernels, which are inadequate for modeling the inherently non-stationary nature of wind speed and power output. We propose a non-stationary GP framework that incorporates the generalized spectral mixture (GSM) kernel, enabling the model to capture time-varying patterns and heteroscedastic behaviors in wind speed and wind power data. We evaluate the performance of the proposed model on real-world SCADA data across short\mbox-, medium-, and long-term forecasting horizons. Compared to standard radial basis function and spectral mixture kernels, the GSM-based model outperforms, particularly in short-term forecasts. These results highlight the necessity of modeling non-stationarity in wind power forecasting and demonstrate the practical value of non-stationary GP models in operational settings.
[LG-64] Lower Bounds on the MMSE of Adversarially Inferring Sensitive Features
链接: https://arxiv.org/abs/2505.09004
作者: Monica Welfert,Nathan Stromberg,Mario Diaz,Lalitha Sankar
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: submitted to IEEE Transactions on Information Theory
Abstract:We propose an adversarial evaluation framework for sensitive feature inference based on minimum mean-squared error (MMSE) estimation with a finite sample size and linear predictive models. Our approach establishes theoretical lower bounds on the true MMSE of inferring sensitive features from noisy observations of other correlated features. These bounds are expressed in terms of the empirical MMSE under a restricted hypothesis class and a non-negative error term. The error term captures both the estimation error due to finite number of samples and the approximation error from using a restricted hypothesis class. For linear predictive models, we derive closed-form bounds, which are order optimal in terms of the noise variance, on the approximation error for several classes of relationships between the sensitive and non-sensitive features, including linear mappings, binary symmetric channels, and class-conditional multi-variate Gaussian distributions. We also present a new lower bound that relies on the MSE computed on a hold-out validation dataset of the MMSE estimator learned on finite-samples and a restricted hypothesis class. Through empirical evaluation, we demonstrate that our framework serves as an effective tool for MMSE-based adversarial evaluation of sensitive feature inference that balances theoretical guarantees with practical efficiency.
[LG-65] Statistical Decision Theory with Counterfactual Loss
链接: https://arxiv.org/abs/2505.08908
作者: Benedikt Koch,Kosuke Imai
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Theoretical Economics (econ.TH)
*备注:
Abstract:Classical statistical decision theory evaluates treatment choices based solely on observed outcomes. However, by ignoring counterfactual outcomes, it cannot assess the quality of decisions relative to feasible alternatives. For example, the quality of a physician’s decision may depend not only on patient survival, but also on whether a less invasive treatment could have produced a similar result. To address this limitation, we extend standard decision theory to incorporate counterfactual losses–criteria that evaluate decisions using all potential outcomes. The central challenge in this generalization is identification: because only one potential outcome is observed for each unit, the associated risk under a counterfactual loss is generally not identifiable. We show that under the assumption of strong ignorability, a counterfactual risk is identifiable if and only if the counterfactual loss function is additive in the potential outcomes. Moreover, we demonstrate that additive counterfactual losses can yield treatment recommendations that differ from those based on standard loss functions, provided that the decision problem involves more than two treatment options.
[LG-66] Bounding Neyman-Pearson Region with f-Divergences
链接: https://arxiv.org/abs/2505.08899
作者: Andrew Mullhaupt,Cheng Peng
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:The Neyman-Pearson region of a simple binary hypothesis testing is the set of points whose coordinates represent the false positive rate and false negative rate of some test. The lower boundary of this region is given by the Neyman-Pearson lemma, and is up to a coordinate change, equivalent to the optimal ROC curve. We establish a novel lower bound for the boundary in terms of any f -divergence. Since the bound generated by hockey-stick f -divergences characterizes the Neyman-Pearson boundary, this bound is best possible. In the case of KL divergence, this bound improves Pinsker’s inequality. Furthermore, we obtain a closed-form refined upper bound for the Neyman-Pearson boundary in terms of the Chernoff \alpha -coefficient. Finally, we present methods for constructing pairs of distributions that can approximately or exactly realize any given Neyman-Pearson boundary.
[LG-67] Equilibrium Propagation for Learning in Lagrangian Dynamical Systems
链接: https://arxiv.org/abs/2505.07363
作者: Serge Massar
类目: Chaotic Dynamics (nlin.CD); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 8 pages, 1 figure
Abstract:We propose a method for training dynamical systems governed by Lagrangian mechanics using Equilibrium Propagation. Our approach extends Equilibrium Propagation – initially developed for energy-based models – to dynamical trajectories by leveraging the principle of action extremization. Training is achieved by gently nudging trajectories toward desired targets and measuring how the variables conjugate to the parameters to be trained respond. This method is particularly suited to systems with periodic boundary conditions or fixed initial and final states, enabling efficient parameter updates without requiring explicit backpropagation through time. In the case of periodic boundary conditions, this approach yields the semiclassical limit of Quantum Equilibrium Propagation. Applications to systems with dissipation are also discussed.
信息检索
[IR-0] Distance-aware Self-adaptive Graph Convolution for Fine-grained Hierarchical Recommendation
链接: https://arxiv.org/abs/2505.09590
作者: Tao Huang,Yihong Chen,Wei Fan,Wei Zhou,Junhao Wen
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Graph Convolutional Networks (GCNs) are widely used to improve recommendation accuracy and performance by effectively learning the representations of user and item nodes. However, two major challenges remain: (1) the lack of further optimization in the graph representation structure and (2) insufficient attention given to the varying contributions of different convolutional this http URL paper proposes SAGCN, a distance-based adaptive hierarchical aggregation method that refines the aggregation process through differentiated representation metrics. SAGCN introduces a detailed approach to multilayer information aggregation and representation space optimization, enabling the model to learn hierarchical embedding weights based on the distance between hierarchical representations. This innovation allows for more precise cross-layer information aggregation, improves the model’s ability to capture hierarchical embeddings, and optimizes the representation space structure. Additionally, the objective loss function is refined to better align with recommendation this http URL experiments conducted on four real-world datasets demonstrate significant improvements, including over a 5% increase on Yelp and a 5.58% increase in Recall@10 on the ML_1M dataset.
[IR-1] GlobalMood: A cross-cultural benchmark for music emotion recognition
链接: https://arxiv.org/abs/2505.09539
作者: Harin Lee,Elif Çelen,Peter Harrison,Manuel Anglada-Tort,Pol van Rijn,Minsu Park,Marc Schönwiesner,Nori Jacoby
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Human annotations of mood in music are essential for music generation and recommender systems. However, existing datasets predominantly focus on Western songs with mood terms derived from English, which may limit generalizability across diverse linguistic and cultural backgrounds. To address this, we introduce `GlobalMood’, a novel cross-cultural benchmark dataset comprising 1,180 songs sampled from 59 countries, with large-scale annotations collected from 2,519 individuals across five culturally and linguistically distinct locations: U.S., France, Mexico, S. Korea, and Egypt. Rather than imposing predefined mood categories, we implement a bottom-up, participant-driven approach to organically elicit culturally specific music-related mood terms. We then recruit another pool of human participants to collect 988,925 ratings for these culture-specific descriptors. Our analysis confirms the presence of a valence-arousal structure shared across cultures, yet also reveals significant divergences in how certain mood terms, despite being dictionary equivalents, are perceived cross-culturally. State-of-the-art multimodal models benefit substantially from fine-tuning on our cross-culturally balanced dataset, as evidenced by improved alignment with human evaluations - particularly in non-English contexts. More broadly, our findings inform the ongoing debate on the universality versus cultural specificity of emotional descriptors, and our methodology can contribute to other multimodal and cross-lingual research.
[IR-2] FACTors: A New Dataset for Studying the Fact-checking Ecosystem SIGIR SIGIR’25
链接: https://arxiv.org/abs/2505.09414
作者: Enes Altuncu,Can Başkent,Sanjay Bhattacherjee,Shujun Li,Dwaipayan Roy
类目: Information Retrieval (cs.IR)
*备注: Accepted for the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '25)
Abstract:Our fight against false information is spearheaded by fact-checkers. They investigate the veracity of claims and document their findings as fact-checking reports. With the rapid increase in the amount of false information circulating online, the use of automation in fact-checking processes aims to strengthen this ecosystem by enhancing scalability. Datasets containing fact-checked claims play a key role in developing such automated solutions. However, to the best of our knowledge, there is no fact-checking dataset at the ecosystem level, covering claims from a sufficiently long period of time and sourced from a wide range of actors reflecting the entire ecosystem that admittedly follows widely-accepted codes and principles of fact-checking. We present a new dataset FACTors, the first to fill this gap by presenting ecosystem-level data on fact-checking. It contains 118,112 claims from 117,993 fact-checking reports in English (co-)authored by 1,953 individuals and published during the period of 1995-2025 by 39 fact-checking organisations that are active signatories of the IFCN (International Fact-Checking Network) and/or EFCSN (European Fact-Checking Standards Network). It contains 7,327 overlapping claims investigated by multiple fact-checking organisations, corresponding to 2,977 unique claims. It allows to conduct new ecosystem-level studies of the fact-checkers (organisations and individuals). To demonstrate the usefulness of FACTors, we present three example applications, including a first-of-its-kind statistical analysis of the fact-checking ecosystem, examining the political inclinations of the fact-checking organisations, and attempting to assign a credibility score to each organisation based on the findings of the statistical analysis and political leanings. Our methods for constructing FACTors are generic and can be used to maintain a live dataset that can be updated dynamically.
[IR-3] HMamba: Hyperbolic Mamba for Sequential Recommendation
链接: https://arxiv.org/abs/2505.09205
作者: Qianru Zhang,Honggang Wen,Wei Yuan,Crystal Chen,Menglin Yang,Siu-Ming Yiu,Hongzhi Yin
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Sequential recommendation systems have become a cornerstone of personalized services, adept at modeling the temporal evolution of user preferences by capturing dynamic interaction sequences. Existing approaches predominantly rely on traditional models, including RNNs and Transformers. Despite their success in local pattern recognition, Transformer-based methods suffer from quadratic computational complexity and a tendency toward superficial attention patterns, limiting their ability to infer enduring preference hierarchies in sequential recommendation data. Recent advances in Mamba-based sequential models introduce linear-time efficiency but remain constrained by Euclidean geometry, failing to leverage the intrinsic hyperbolic structure of recommendation data. To bridge this gap, we propose Hyperbolic Mamba, a novel architecture that unifies the efficiency of Mamba’s selective state space mechanism with hyperbolic geometry’s hierarchical representational power. Our framework introduces (1) a hyperbolic selective state space that maintains curvature-aware sequence modeling and (2) stabilized Riemannian operations to enable scalable training. Experiments across four benchmarks demonstrate that Hyperbolic Mamba achieves 3-11% improvement while retaining Mamba’s linear-time efficiency, enabling real-world deployment. This work establishes a new paradigm for efficient, hierarchy-aware sequential modeling.
[IR-4] Display Content Display Methods and Evaluation Methods of the HCI in Explainable Recommender Systems: A Survey
链接: https://arxiv.org/abs/2505.09065
作者: Weiqing Li,Yue Xu,Yuefeng Li,Yinghui Huang
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: 2 Tables, 29 figures
Abstract:Explainable Recommender Systems (XRS) aim to provide users with understandable reasons for the recommendations generated by these systems, representing a crucial research direction in artificial intelligence (AI). Recent research has increasingly focused on the algorithms, display, and evaluation methodologies of XRS. While current research and reviews primarily emphasize the algorithmic aspects, with fewer studies addressing the Human-Computer Interaction (HCI) layer of XRS. Additionally, existing reviews lack a unified taxonomy for XRS and there is insufficient attention given to the emerging area of short video recommendations. In this study, we synthesize existing literature and surveys on XRS, presenting a unified framework for its research and development. The main contributions are as follows: 1) We adopt a lifecycle perspective to systematically summarize the technologies and methods used in XRS, addressing challenges posed by the diversity and complexity of algorithmic models and explanation techniques. 2) For the first time, we highlight the application of multimedia, particularly video-based explanations, along with its potential, technical pathways, and challenges in XRS. 3) We provide a structured overview of evaluation methods from both qualitative and quantitative dimensions. These findings provide valuable insights for the systematic design, progress, and testing of XRS.
[IR-5] Item Level Exploration Traffic Allocation in Large-scale Recommendation Systems RECSYS
链接: https://arxiv.org/abs/2505.09033
作者: Dong Wang,Junyi Jiao,Arnab Bhadury,Yaping Zhang,Mingyan Gao
类目: Information Retrieval (cs.IR)
*备注: accepted by the 18th ACM Recsys Large Recsys Workshop
Abstract:This paper contributes to addressing the item cold start problem in large-scale recommender systems, focusing on how to efficiently gain initial visibility for newly ingested content. We propose an exploration system designed to efficiently allocate impressions to these fresh items. Our approach leverages a learned probabilistic model to predict an item’s discoverability, which then informs a scalable and adaptive traffic allocation strategy. This system intelligently distributes exploration budgets, optimizing for the long-term benefit of the recommendation platform. The impact is a demonstrably more efficient cold-start process, leading to a significant increase in the discoverability of new content and ultimately enriching the item corpus available for exploitation, as evidenced by its successful deployment in a large-scale production environment.