本篇博文主要内容为 2025-05-08 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-05-08)
今日共更新455篇论文,其中:
- 自然语言处理共36篇(Computation and Language (cs.CL))
- 人工智能共147篇(Artificial Intelligence (cs.AI))
- 计算机视觉共93篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共146篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] ZeroSearch: Incentivize the Search Capability of LLM s without Searching
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在通过强化学习(Reinforcement Learning, RL)提升搜索能力时面临的两个主要问题:不可控的文档质量以及高昂的API成本。解决方案的关键在于提出ZeroSearch框架,该框架通过轻量级监督微调将LLM转换为能够生成相关和噪声文档的检索模块,并在RL训练中采用基于课程的回放策略,逐步降低生成文档的质量以激发模型的推理能力,从而无需与真实搜索引擎交互即可有效激励LLM的搜索能力。
链接: https://arxiv.org/abs/2505.04588
作者: Hao Sun,Zile Qiao,Jiayan Guo,Xuanbo Fan,Yingyan Hou,Yong Jiang,Pengjun Xie,Fei Huang,Yan Zhang
机构: Tongyi Lab (通义实验室); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:Effective information searching is essential for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored using reinforcement learning (RL) to improve LLMs’ search capabilities by interacting with live search engines in real-world environments. While these approaches show promising results, they face two major challenges: (1) Uncontrolled Document Quality: The quality of documents returned by search engines is often unpredictable, introducing noise and instability into the training process. (2) Prohibitively High API Costs: RL training requires frequent rollouts, potentially involving hundreds of thousands of search requests, which incur substantial API expenses and severely constrain scalability. To address these challenges, we introduce ZeroSearch, a reinforcement learning framework that incentivizes the search capabilities of LLMs without interacting with real search engines. Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both relevant and noisy documents in response to a query. During RL training, we employ a curriculum-based rollout strategy that incrementally degrades the quality of generated documents, progressively eliciting the model’s reasoning ability by exposing it to increasingly challenging retrieval scenarios. Extensive experiments demonstrate that ZeroSearch effectively incentivizes the search capabilities of LLMs using a 3B LLM as the retrieval module. Remarkably, a 7B retrieval module achieves comparable performance to the real search engine, while a 14B retrieval module even surpasses it. Furthermore, it generalizes well across both base and instruction-tuned models of various parameter sizes and is compatible with a wide range of RL algorithms.
zh
[NLP-1] Overcoming Data Scarcity in Generative Language Modelling for Low-Resource Languages: A Systematic Review
【速读】: 该论文试图解决低资源语言(Low-Resource Languages, LRL)在生成式语言建模中的数据稀缺问题,以缓解自然语言处理(Natural Language Processing, NLP)领域的语言不平等问题。解决方案的关键在于探索和评估多种技术方法,包括单语数据增强、反向翻译、多语言训练和提示工程,旨在提升低资源语言模型的性能,并推动更广泛的语言覆盖和公平性。研究还指出当前方法主要依赖于基于Transformer的模型,但存在语言覆盖范围有限和评估标准不一致的问题。
链接: https://arxiv.org/abs/2505.04531
作者: Josh McGiff,Nikola S. Nikolov
机构: University of Limerick(利默里克大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This work is currently under review. Please do not cite without permission
Abstract:Generative language modelling has surged in popularity with the emergence of services such as ChatGPT and Google Gemini. While these models have demonstrated transformative potential in productivity and communication, they overwhelmingly cater to high-resource languages like English. This has amplified concerns over linguistic inequality in natural language processing (NLP). This paper presents the first systematic review focused specifically on strategies to address data scarcity in generative language modelling for low-resource languages (LRL). Drawing from 54 studies, we identify, categorise and evaluate technical approaches, including monolingual data augmentation, back-translation, multilingual training, and prompt engineering, across generative tasks. We also analyse trends in architecture choices, language family representation, and evaluation methods. Our findings highlight a strong reliance on transformer-based models, a concentration on a small subset of LRLs, and a lack of consistent evaluation across studies. We conclude with recommendations for extending these methods to a wider range of LRLs and outline open challenges in building equitable generative language systems. Ultimately, this review aims to support researchers and developers in building inclusive AI tools for underrepresented languages, a necessary step toward empowering LRL speakers and the preservation of linguistic diversity in a world increasingly shaped by large-scale language technologies.
zh
[NLP-2] Beyond Theorem Proving: Formulation Framework and Benchmark for Formal Problem-Solving
【速读】: 该论文试图解决问题求解(problem-solving)缺乏通用且具体的形式化表述,以及基于AI的问题求解代理在过程级可验证性方面的需求未被充分探索的问题。其解决方案的关键在于将问题求解建模为确定性马尔可夫决策过程,并提出一种名为FPS(Formal Problem-Solving)的框架,该框架利用现有的形式化定理证明(formal theorem proving, FTP)环境实现过程可验证的问题求解;进一步引入D-FPS(Deductive FPS),通过解题与答案验证的解耦提升与人类的一致性。此外,论文还提出了RPE(Restricted Propositional Equivalence)作为评估方法,以实现忠实、可解释且与人类对齐的评价。
链接: https://arxiv.org/abs/2505.04528
作者: Qi Liu,Xinhao Zheng,Renqiu Xia,Xingzhi Qi,Qinxiang Cao,Junchi Yan
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注: 42 pages, 3 figures
Abstract:As a seemingly self-explanatory task, problem-solving has been a significant component of science and engineering. However, a general yet concrete formulation of problem-solving itself is missing. With the recent development of AI-based problem-solving agents, the demand for process-level verifiability is rapidly increasing yet underexplored. To fill these gaps, we present a principled formulation of problem-solving as a deterministic Markov decision process; a novel framework, FPS (Formal Problem-Solving), which utilizes existing FTP (formal theorem proving) environments to perform process-verified problem-solving; and D-FPS (Deductive FPS), decoupling solving and answer verification for better human-alignment. The expressiveness, soundness and completeness of the frameworks are proven. We construct three benchmarks on problem-solving: FormalMath500, a formalization of a subset of the MATH500 benchmark; MiniF2F-Solving and PutnamBench-Solving, adaptations of FTP benchmarks MiniF2F and PutnamBench. For faithful, interpretable, and human-aligned evaluation, we propose RPE (Restricted Propositional Equivalence), a symbolic approach to determine the correctness of answers by formal verification. We evaluate four prevalent FTP models and two prompting methods as baselines, solving at most 23.77% of FormalMath500, 27.47% of MiniF2F-Solving, and 0.31% of PutnamBench-Solving.
zh
[NLP-3] Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs
【速读】: 该论文旨在解决在Ascend NPU上高效利用稀疏大语言模型(Sparse Large Language Models, LLMs)所带来的计算资源利用率低和性能提升受限的问题。其关键解决方案在于通过仿真方法优化模型配置,以选择适合Ascend NPU的超参数,并通过专家并行(Expert Parallelism)优化设备间的通信以降低同步开销,同时提升设备内部的内存效率,从而实现高效的训练过程。
链接: https://arxiv.org/abs/2505.04519
作者: Yehui Tang,Yichun Yin,Yaoyuan Wang,Hang Zhou,Yu Pan,Wei Guo,Ziyang Zhang,Miao Rang,Fangcheng Liu,Naifu Zhang,Binghan Li,Yonghan Dong,Xiaojun Meng,Yasheng Wang,Dong Li,Yin Li,Dandan Tu,Can Chen,Youliang Yan,Fisher Yu,Ruiming Tang,Yunhe Wang,Botian Huang,Bo Wang,Boxiao Liu,Changzheng Zhang,Da Kuang,Fei Liu,Gang Huang,Jiansheng Wei,Jiarui Qin,Jie Ran,Jinpeng Li,Jun Zhao,Liang Dai,Lin Li,Liqun Deng,Peifeng Qin,Pengyuan Zeng,Qiang Gu,Shaohua Tang,Shengjun Cheng,Tao Gao,Tao Yu,Tianshu Li,Tianyu Bi,Wei He,Weikai Mao,Wenyong Huang,Wulong Liu,Xiabing Li,Xianzhi Yu,Xueyu Wu,Xu He,Yangkai Du,Yan Xu,Ye Tian,Yimeng Wu,Yongbing Huang,Yong Tian,Yong Zhu,Yue Li,Yufei Wang,Yuhang Gai,Yujun Li,Yu Luo,Yunsheng Ni,Yusen Sun,Zelin Chen,Zhe Liu,Zhicheng Liu,Zhipeng Tu,Zilin Ding,Zongyuan Zhan
机构: Huawei(华为)
类目: Computation and Language (cs.CL)
备注:
Abstract:Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models. However, the massive model scale poses significant challenges for the underlying software and hardware systems. In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs. The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware. To select model configurations suitable for Ascend NPUs without repeatedly running the expensive experiments, we leverage simulation to compare the trade-off of various model hyperparameters. This study led to Pangu Ultra MoE, a sparse LLM with 718 billion parameters, and we conducted experiments on the model to verify the simulation results. On the system side, we dig into Expert Parallelism to optimize the communication between NPU devices to reduce the synchronization overhead. We also optimize the memory efficiency within the devices to further reduce the parameter and activation management overhead. In the end, we achieve an MFU of 30.0% when training Pangu Ultra MoE, with performance comparable to that of DeepSeek R1, on 6K Ascend NPUs, and demonstrate that the Ascend system is capable of harnessing all the training stages of the state-of-the-art language models. Extensive experiments indicate that our recipe can lead to efficient training of large-scale sparse language models with MoE. We also study the behaviors of such models for future reference.
zh
[NLP-4] Detecting Spelling and Grammatical Anomalies in Russian Poetry Texts
【速读】: 该论文旨在解决生成式模型训练数据集中文本质量不佳的问题,特别是在计算创造力任务(如诗歌或歌词生成)中,低质量文本会显著降低生成结果的价值。其关键解决方案是引入自动化语言异常检测技术,以识别并过滤训练数据集中的低质量文本,从而提升生成模型在创意领域中的性能。
链接: https://arxiv.org/abs/2505.04507
作者: Ilya Koziev
机构: SalutDevices(萨卢特设备)
类目: Computation and Language (cs.CL)
备注:
Abstract:The quality of natural language texts in fine-tuning datasets plays a critical role in the performance of generative models, particularly in computational creativity tasks such as poem or song lyric generation. Fluency defects in generated poems significantly reduce their value. However, training texts are often sourced from internet-based platforms without stringent quality control, posing a challenge for data engineers to manage defect levels effectively. To address this issue, we propose the use of automated linguistic anomaly detection to identify and filter out low-quality texts from training datasets for creative models. In this paper, we present a comprehensive comparison of unsupervised and supervised text anomaly detection approaches, utilizing both synthetic and human-labeled datasets. We also introduce the RUPOR dataset, a collection of Russian-language human-labeled poems designed for cross-sentence grammatical error detection, and provide the full evaluation code. Our work aims to empower the community with tools and insights to improve the quality of training datasets for generative models in creative domains. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2505.04507 [cs.CL] (or arXiv:2505.04507v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.04507 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-5] Miipher-2: A Universal Speech Restoration Model for Million-Hour Scale Data Restoration
【速读】: 该论文旨在解决大规模生成模型训练数据清洗中的语音恢复(Speech Restoration, SR)问题,特别是针对百万小时级数据的高效处理。其关键解决方案是提出Miipher-2模型,该模型基于一个冻结的预训练通用语音模型(Universal Speech Model, USM),作为无需条件输入(如文本或说话人ID)的鲁棒特征提取器,并通过并行适配器和WaneFit神经声码器实现高效的波形合成,从而在保持高语音质量的同时降低计算资源消耗。
链接: https://arxiv.org/abs/2505.04457
作者: Shigeki Karita,Yuma Koizumi,Heiga Zen,Haruko Ishikawa,Robin Scheibler,Michiel Bacchiani
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Training data cleaning is a new application for generative model-based speech restoration (SR). This paper introduces Miipher-2, an SR model designed for million-hour scale data, for training data cleaning for large-scale generative models like large language models. Key challenges addressed include generalization to unseen languages, operation without explicit conditioning (e.g., text, speaker ID), and computational efficiency. Miipher-2 utilizes a frozen, pre-trained Universal Speech Model (USM), supporting over 300 languages, as a robust, conditioning-free feature extractor. To optimize efficiency and minimize memory, Miipher-2 incorporates parallel adapters for predicting clean USM features from noisy inputs and employs the WaneFit neural vocoder for waveform synthesis. These components were trained on 3,000 hours of multi-lingual, studio-quality recordings with augmented degradations, while USM parameters remained fixed. Experimental results demonstrate Miipher-2’s superior or comparable performance to conventional SR models in word-error-rate, speaker similarity, and both objective and subjective sound quality scores across all tested languages. Miipher-2 operates efficiently on consumer-grade accelerators, achieving a real-time factor of 0.0078, enabling the processing of a million-hour speech dataset in approximately three days using only 100 such accelerators.
zh
[NLP-6] OBLIVIATE: Robust and Practical Machine Unlearning for Large Language Models
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在训练过程中可能记忆敏感、受版权保护或有害内容的问题。其关键解决方案是提出OBLIVIATE框架,该框架通过提取目标标记、构建保留集,并使用包含掩码、知识蒸馏和世界事实三个组件的定制损失函数进行微调,从而有效移除特定数据同时保持模型性能。此外,该方法利用低秩适配器(LoRA)实现高效计算,确保未学习效果不受影响。
链接: https://arxiv.org/abs/2505.04416
作者: Xiaoyu Xu,Minxin Du,Qingqing Ye,Haibo Hu
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 18 pages, 2 figures
Abstract:Large language models (LLMs) trained over extensive corpora risk memorizing sensitive, copyrighted, or toxic content. To address this, we propose OBLIVIATE, a robust unlearning framework that removes targeted data while preserving model utility. The framework follows a structured process: extracting target tokens, building retain sets, and fine-tuning with a tailored loss function comprising three components – masking, distillation, and world fact. Using low-rank adapters (LoRA), it ensures efficiency without compromising unlearning quality. We conduct experiments on multiple datasets, including the Harry Potter series, WMDP, and TOFU, using a comprehensive suite of metrics: forget quality (new document-level memorization score), model utility, and fluency. Results demonstrate its effectiveness in resisting membership inference attacks, minimizing the impact on retained data, and maintaining robustness across diverse scenarios.
zh
[NLP-7] YABLoCo: Yet Another Benchmark for Long Context Code Generation ICSE2025
【速读】: 该论文试图解决现有大型语言模型(Large Language Models, LLMs)在代码生成任务中评估基准与实际软件项目规模不匹配的问题。传统基准测试通常仅涵盖数千行代码的小型或中型上下文窗口,而现实中的软件项目可能包含数百万行代码(LoC)。为弥补这一差距,该论文提出了一个长上下文代码生成基准(YABLoCo),其关键在于针对C和C++语言的大规模代码库进行函数体生成评估,包含从20万到200万LoC的大型仓库数据,并提供了可扩展的评估流水线及生成代码的可视化分析工具,从而实现了对大规模代码生成任务的有效评估。
链接: https://arxiv.org/abs/2505.04406
作者: Aidar Valeev(1),Roman Garaev(1),Vadim Lomshakov(2),Irina Piontkovskaya(3),Vladimir Ivanov(1),Israel Adewuyi(1) ((1) Research Center of the Artificial Intelligence Institute, Innopolis University, Russia, (2) St. Petersburg Department of the Steklov Institute of Mathematics, Russia, (3) Huawei Noah’s Ark Lab)
机构: Innopolis University (伊诺波利斯大学); Research Center of the Artificial Intelligence Institute (人工智能研究所); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Presented at LLM4Code 2025 Workshop co-located wtih ICSE 2025
Abstract:Large Language Models demonstrate the ability to solve various programming tasks, including code generation. Typically, the performance of LLMs is measured on benchmarks with small or medium-sized context windows of thousands of lines of code. At the same time, in real-world software projects, repositories can span up to millions of LoC. This paper closes this gap by contributing to the long context code generation benchmark (YABLoCo). The benchmark featured a test set of 215 functions selected from four large repositories with thousands of functions. The dataset contained metadata of functions, contexts of the functions with different levels of dependencies, docstrings, functions bodies, and call graphs for each repository. This paper presents three key aspects of the contribution. First, the benchmark aims at function body generation in large repositories in C and C++, two languages not covered by previous benchmarks. Second, the benchmark contains large repositories from 200K to 2,000K LoC. Third, we contribute a scalable evaluation pipeline for efficient computing of the target metrics and a tool for visual analysis of generated code. Overall, these three aspects allow for evaluating code generation in large repositories in C and C++.
zh
[NLP-8] Large Means Left: Political Bias in Large Language Models Increases with Their Number of Parameters
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中可能存在的政治偏见问题,这些问题可能影响用户的信息获取和决策过程。其解决方案的关键在于通过量化分析LLMs在德国联邦议院投票背景下的政治倾向,利用Wahl-O-Mat评分衡量模型与德国各政党立场的契合度,并比较不同模型的对齐分数以识别影响其政治偏好的因素。研究发现,较大规模的LLMs更倾向于左翼政党,同时模型的训练语言和来源也对其政治观点产生显著影响。
链接: https://arxiv.org/abs/2505.04393
作者: David Exler,Mark Schutera,Markus Reischl,Luca Rettenberger
机构: Karlsruhe Institute of Technology (Karlsruhe Institute of Technology)
类目: Computation and Language (cs.CL)
备注:
Abstract:With the increasing prevalence of artificial intelligence, careful evaluation of inherent biases needs to be conducted to form the basis for alleviating the effects these predispositions can have on users. Large language models (LLMs) are predominantly used by many as a primary source of information for various topics. LLMs frequently make factual errors, fabricate data (hallucinations), or present biases, exposing users to misinformation and influencing opinions. Educating users on their risks is key to responsible use, as bias, unlike hallucinations, cannot be caught through data verification. We quantify the political bias of popular LLMs in the context of the recent vote of the German Bundestag using the score produced by the Wahl-O-Mat. This metric measures the alignment between an individual’s political views and the positions of German political parties. We compare the models’ alignment scores to identify factors influencing their political preferences. Doing so, we discover a bias toward left-leaning parties, most dominant in larger LLMs. Also, we find that the language we use to communicate with the models affects their political views. Additionally, we analyze the influence of a model’s origin and release date and compare the results to the outcome of the recent vote of the Bundestag. Our results imply that LLMs are prone to exhibiting political bias. Large corporations with the necessary means to develop LLMs, thus, knowingly or unknowingly, have a responsibility to contain these biases, as they can influence each voter’s decision-making process and inform public opinion in general and at scale.
zh
[NLP-9] he Aloe Family Recipe for Open and Specialized Healthcare LLM s
【速读】: 该论文旨在解决医疗领域中大型语言模型(Large Language Models, LLMs)缺乏具有竞争力的开源模型的问题,以保护公共利益。其关键解决方案在于优化数据预处理和训练阶段,并通过直接偏好优化(DPO)提升模型安全性,以及通过检索增强生成(RAG)提高模型有效性。此外,论文提出了一种包含四种不同类型测试的评估方法,为该领域设定了新的标准。
链接: https://arxiv.org/abs/2505.04388
作者: Dario Garcia-Gasulla,Jordi Bayarri-Planas,Ashwin Kumar Gururajan,Enrique Lopez-Cuena,Adrian Tormos,Daniel Hinjos,Pablo Bernabeu-Perez,Anna Arias-Duart,Pablo Agustin Martin-Torres,Marta Gonzalez-Mallo,Sergio Alvarez-Napagao,Eduard Ayguadé-Parra,Ulises Cortés
机构: Barcelona Supercomputing Center (巴塞罗那超级计算中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2405.01886
Abstract:Purpose: With advancements in Large Language Models (LLMs) for healthcare, the need arises for competitive open-source models to protect the public interest. This work contributes to the field of open medical LLMs by optimizing key stages of data preprocessing and training, while showing how to improve model safety (through DPO) and efficacy (through RAG). The evaluation methodology used, which includes four different types of tests, defines a new standard for the field. The resultant models, shown to be competitive with the best private alternatives, are released with a permisive license. Methods: Building on top of strong base models like Llama 3.1 and Qwen 2.5, Aloe Beta uses a custom dataset to enhance public data with synthetic Chain of Thought examples. The models undergo alignment with Direct Preference Optimization, emphasizing ethical and policy-aligned performance in the presence of jailbreaking attacks. Evaluation includes close-ended, open-ended, safety and human assessments, to maximize the reliability of results. Results: Recommendations are made across the entire pipeline, backed by the solid performance of the Aloe Family. These models deliver competitive performance across healthcare benchmarks and medical fields, and are often preferred by healthcare professionals. On bias and toxicity, the Aloe Beta models significantly improve safety, showing resilience to unseen jailbreaking attacks. For a responsible release, a detailed risk assessment specific to healthcare is attached to the Aloe Family models. Conclusion: The Aloe Beta models, and the recipe that leads to them, are a significant contribution to the open-source medical LLM field, offering top-of-the-line performance while maintaining high ethical requirements. This work sets a new standard for developing and reporting aligned LLMs in healthcare. Comments: arXiv admin note: substantial text overlap with arXiv:2405.01886 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.04388 [cs.CL] (or arXiv:2505.04388v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.04388 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Daniel Hinjos García [view email] [v1] Wed, 7 May 2025 13:13:14 UTC (5,218 KB)
zh
[NLP-10] Benchmarking LLM s Swarm intelligence
【速读】: 该论文试图解决在严格约束条件下(如有限的局部感知和通信)大型语言模型(Large Language Models, LLMs)在多智能体系统(Multi-Agent Systems, MAS)中实现涌现协调的能力尚未被充分探索的问题,特别是针对群体智能的细微特性。其解决方案的关键在于提出SwarmBench,一个新颖的基准测试平台,用于系统评估LLMs作为去中心化代理的群体智能能力。SwarmBench包含五个基础的MAS协调任务,在可配置的二维网格环境中运行,迫使代理主要依赖局部感官输入和局部通信,并通过定义的度量标准分析协调效果与群体动态。
链接: https://arxiv.org/abs/2505.04364
作者: Kai Ruan,Mowen Huang,Ji-Rong Wen,Hao Sun
机构: 未知
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) show potential for complex reasoning, yet their capacity for emergent coordination in Multi-Agent Systems (MAS) when operating under strict constraints-such as limited local perception and communication, characteristic of natural swarms-remains largely unexplored, particularly concerning the nuances of swarm intelligence. Existing benchmarks often do not fully capture the unique challenges of decentralized coordination that arise when agents operate with incomplete spatio-temporal information. To bridge this gap, we introduce SwarmBench, a novel benchmark designed to systematically evaluate the swarm intelligence capabilities of LLMs acting as decentralized agents. SwarmBench features five foundational MAS coordination tasks within a configurable 2D grid environment, forcing agents to rely primarily on local sensory input (k x k view) and local communication. We propose metrics for coordination effectiveness and analyze emergent group dynamics. Evaluating several leading LLMs in a zero-shot setting, we find significant performance variations across tasks, highlighting the difficulties posed by local information constraints. While some coordination emerges, results indicate limitations in robust planning and strategy formation under uncertainty in these decentralized scenarios. Assessing LLMs under swarm-like conditions is crucial for realizing their potential in future decentralized systems. We release SwarmBench as an open, extensible toolkit-built upon a customizable and scalable physical system with defined mechanical properties. It provides environments, prompts, evaluation scripts, and the comprehensive experimental datasets generated, aiming to foster reproducible research into LLM-based MAS coordination and the theoretical underpinnings of Embodied MAS. Our code repository is available at this https URL.
zh
[NLP-11] GASCADE: Grouped Summarization of Adverse Drug Event for Enhanced Cancer Pharmacovigilance
【速读】: 该论文旨在解决癌症治疗中患者使用处方药物后报告的不良药物事件(Adverse Drug Events, ADEs)的总结问题,以提升药物流行病学监测实践和药物相关决策。现有研究多集中于一般疾病,而缺乏对癌症特定场景的关注。解决方案的关键在于提出一个名为GASCADE的框架,该框架结合了大语言模型(Large Language Models, LLMs)的信息抽取能力与编码器-解码器T5模型的摘要生成能力,并首次在合成数据集上应用对齐技术,如直接偏好优化(Direct Preference Optimization),以提升摘要性能。
链接: https://arxiv.org/abs/2505.04284
作者: Sofia Jamil,Aryan Dabad,Bollampalli Areen Reddy,Sriparna Saha,Rajiv Misra,Adil A. Shakur
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:In the realm of cancer treatment, summarizing adverse drug events (ADEs) reported by patients using prescribed drugs is crucial for enhancing pharmacovigilance practices and improving drug-related decision-making. While the volume and complexity of pharmacovigilance data have increased, existing research in this field has predominantly focused on general diseases rather than specifically addressing cancer. This work introduces the task of grouped summarization of adverse drug events reported by multiple patients using the same drug for cancer treatment. To address the challenge of limited resources in cancer pharmacovigilance, we present the MultiLabeled Cancer Adverse Drug Reaction and Summarization (MCADRS) dataset. This dataset includes pharmacovigilance posts detailing patient concerns regarding drug efficacy and adverse effects, along with extracted labels for drug names, adverse drug events, severity, and adversity of reactions, as well as summaries of ADEs for each drug. Additionally, we propose the Grouping and Abstractive Summarization of Cancer Adverse Drug events (GASCADE) framework, a novel pipeline that combines the information extraction capabilities of Large Language Models (LLMs) with the summarization power of the encoder-decoder T5 model. Our work is the first to apply alignment techniques, including advanced algorithms like Direct Preference Optimization, to encoder-decoder models using synthetic datasets for summarization tasks. Through extensive experiments, we demonstrate the superior performance of GASCADE across various metrics, validated through both automated assessments and human evaluations. This multitasking approach enhances drug-related decision-making and fosters a deeper understanding of patient concerns, paving the way for advancements in personalized and responsive cancer care. The code and dataset used in this work are publicly available.
zh
[NLP-12] LLM -Independent Adaptive RAG : Let the Question Speak for Itself
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成过程中容易产生幻觉的问题,同时通过检索增强生成(Retrieval-Augmented Generation, RAG)方法降低幻觉风险,但RAG存在计算成本高和可能引入错误信息的缺陷。为了解决这一问题,研究提出了一种轻量级、不依赖LLM的自适应检索方法,其关键在于利用外部信息来判断是否需要进行检索,从而在保证问答性能的同时显著提升效率。
链接: https://arxiv.org/abs/2505.04253
作者: Maria Marina,Nikolay Ivanov,Sergey Pletenev,Mikhail Salnikov,Daria Galimzianova,Nikita Krayko,Vasily Konovalov,Alexander Panchenko,Viktor Moskvoretskii
机构: Skoltech(斯科尔科沃科技研究所); AIRI(人工智能研究机构); HSE University(高等经济大学); MTS AI(MTS人工智能); MIPT(莫斯科物理技术学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages, 5 figures, 2 tables
Abstract:Large Language Models~(LLMs) are prone to hallucinations, and Retrieval-Augmented Generation (RAG) helps mitigate this, but at a high computational cost while risking misinformation. Adaptive retrieval aims to retrieve only when necessary, but existing approaches rely on LLM-based uncertainty estimation, which remain inefficient and impractical. In this study, we introduce lightweight LLM-independent adaptive retrieval methods based on external information. We investigated 27 features, organized into 7 groups, and their hybrid combinations. We evaluated these methods on 6 QA datasets, assessing the QA performance and efficiency. The results show that our approach matches the performance of complex LLM-based methods while achieving significant efficiency gains, demonstrating the potential of external information for adaptive retrieval.
zh
[NLP-13] VideoPath-LLaVA: Pathology Diagnostic Reasoning Through Video Instruction Tuning
【速读】: 该论文旨在解决病理学视频分析中缺乏有效整合视觉与诊断推理的大型多模态模型(LMM)的问题。其解决方案的关键在于构建VideoPath-Instruct数据集,该数据集包含4278对视频与诊断相关的链式思维指令对,并通过迁移学习策略,利用现有单图像指令数据集对弱标注的关键帧提取片段进行训练,随后在手动分割的视频上进行微调,从而提升模型的诊断推理能力。
链接: https://arxiv.org/abs/2505.04192
作者: Trinh T.L. Vuong,Jin Tae Kwak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We present VideoPath-LLaVA, the first large multimodal model (LMM) in computational pathology that integrates three distinct image scenarios, single patch images, automatically keyframe-extracted clips, and manually segmented video pathology images, to mimic the natural diagnostic process of pathologists. By generating detailed histological descriptions and culminating in a definitive sign-out diagnosis, VideoPath-LLaVA bridges visual narratives with diagnostic reasoning. Central to our approach is the VideoPath-Instruct dataset, comprising 4278 video and diagnosis-specific chain-of-thought instructional pairs sourced from educational histopathology videos on YouTube. Although high-quality data is critical for enhancing diagnostic reasoning, its creation is time-intensive and limited in volume. To overcome this challenge, we transfer knowledge from existing single-image instruction datasets to train on weakly annotated, keyframe-extracted clips, followed by fine-tuning on manually segmented videos. VideoPath-LLaVA establishes a new benchmark in pathology video analysis and offers a promising foundation for future AI systems that support clinical decision-making through integrated visual and diagnostic reasoning. Our code, data, and model are publicly available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2505.04192 [cs.CV] (or arXiv:2505.04192v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.04192 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-14] Large Language Models are often politically extreme usually ideologically inconsistent and persuasive even in informational contexts
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)是否存在政治偏见及其对政治影响力潜在作用的问题。研究指出,尽管现有学术研究普遍认为LLMs的总体党派偏好较小,但其表面上的小偏见实际上是由于在具体议题上存在相互抵消的极端观点所导致,类似于中间选民的立场。解决方案的关键在于通过对比31个LLMs与立法者、法官及美国选民样本,揭示LLMs在特定议题上的极端观点,并通过随机实验验证LLMs在信息寻求情境中能够有效传播其政治倾向,从而对选民产生显著的政治说服效果。
链接: https://arxiv.org/abs/2505.04171
作者: Nouar Aldahoul,Hazem Ibrahim,Matteo Varvello,Aaron Kaufman,Talal Rahwan,Yasir Zaki
机构: New York University Abu Dhabi(纽约大学阿布扎比分校); USA(美国)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 61 pages, 29 figures
Abstract:Large Language Models (LLMs) are a transformational technology, fundamentally changing how people obtain information and interact with the world. As people become increasingly reliant on them for an enormous variety of tasks, a body of academic research has developed to examine these models for inherent biases, especially political biases, often finding them small. We challenge this prevailing wisdom. First, by comparing 31 LLMs to legislators, judges, and a nationally representative sample of U.S. voters, we show that LLMs’ apparently small overall partisan preference is the net result of offsetting extreme views on specific topics, much like moderate voters. Second, in a randomized experiment, we show that LLMs can promulgate their preferences into political persuasiveness even in information-seeking contexts: voters randomized to discuss political issues with an LLM chatbot are as much as 5 percentage points more likely to express the same preferences as that chatbot. Contrary to expectations, these persuasive effects are not moderated by familiarity with LLMs, news consumption, or interest in politics. LLMs, especially those controlled by private companies or governments, may become a powerful and targeted vector for political influence.
zh
[NLP-15] Can Language Models Understand Social Behavior in Clinical Conversations?
【速读】: 该论文试图解决在临床对话中自动追踪社会信号(social signals)的问题,这些信号包括如医生主导性、患者温暖度等20种不同的行为特征,它们对医患关系质量和健康结果有重要影响。解决方案的关键在于利用大型语言模型(LLMs)的能力,通过设计任务特定的提示(task-specific prompts)并基于一个高度不平衡的标注数据集评估不同模型架构和提示风格的表现,从而实现对社会信号的自动分析与提取。研究提出了首个能够跟踪所有20种编码信号的系统,并揭示了LLM在处理此类任务时的行为模式。
链接: https://arxiv.org/abs/2505.04152
作者: Manas Satish Bedmutha,Feng Chen,Andrea Hartzler,Trevor Cohen,Nadir Weibel
机构: UC San Diego(加州大学圣地亚哥分校); University of Washington(华盛顿大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:Effective communication between providers and their patients influences health and care outcomes. The effectiveness of such conversations has been linked not only to the exchange of clinical information, but also to a range of interpersonal behaviors; commonly referred to as social signals, which are often conveyed through non-verbal cues and shape the quality of the patient-provider relationship. Recent advances in large language models (LLMs) have demonstrated an increasing ability to infer emotional and social behaviors even when analyzing only textual information. As automation increases also in clinical settings, such as for transcription of patient-provider conversations, there is growing potential for LLMs to automatically analyze and extract social behaviors from these interactions. To explore the foundational capabilities of LLMs in tracking social signals in clinical dialogue, we designed task-specific prompts and evaluated model performance across multiple architectures and prompting styles using a highly imbalanced, annotated dataset spanning 20 distinct social signals such as provider dominance, patient warmth, etc. We present the first system capable of tracking all these 20 coded signals, and uncover patterns in LLM behavior. Further analysis of model configurations and clinical context provides insights for enhancing LLM performance on social signal processing tasks in healthcare settings.
zh
[NLP-16] Unmasking the Canvas: A Dynamic Benchmark for Image Generation Jailbreaking and LLM Content Safety
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在图像生成任务中内容安全检查易受基于提示的越狱攻击的问题。其解决方案的关键在于提出一个动态且可扩展的基准数据集——Unmasking the Canvas (UTC Benchmark; UTCB),该数据集通过结构化提示工程、多语言混淆技术(如祖鲁语、盖尔语、Base64)以及使用Groq托管的LLaMA-3进行评估,支持零样本和回退提示策略、风险评分和自动化标记,从而系统性地评估LLMs在图像生成中的脆弱性。
链接: https://arxiv.org/abs/2505.04146
作者: Variath Madhupal Gautham Nair,Vishal Varma Dantuluri
机构: Rutgers Univerity-New Brunswick(罗格斯大学-新布朗斯维克分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Existing large language models (LLMs) are advancing rapidly and produce outstanding results in image generation tasks, yet their content safety checks remain vulnerable to prompt-based jailbreaks. Through preliminary testing on platforms such as ChatGPT, MetaAI, and Grok, we observed that even short, natural prompts could lead to the generation of compromising images ranging from realistic depictions of forged documents to manipulated images of public figures. We introduce Unmasking the Canvas (UTC Benchmark; UTCB), a dynamic and scalable benchmark dataset to evaluate LLM vulnerability in image generation. Our methodology combines structured prompt engineering, multilingual obfuscation (e.g., Zulu, Gaelic, Base64), and evaluation using Groq-hosted LLaMA-3. The pipeline supports both zero-shot and fallback prompting strategies, risk scoring, and automated tagging. All generations are stored with rich metadata and curated into Bronze (non-verified), Silver (LLM-aided verification), and Gold (manually verified) tiers. UTCB is designed to evolve over time with new data sources, prompt templates, and model behaviors. Warning: This paper includes visual examples of adversarial inputs designed to test model safety. All outputs have been redacted to ensure responsible disclosure. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2505.04146 [cs.CL] (or arXiv:2505.04146v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.04146 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-17] Enhancing Granular Sentiment Classification with Chain-of-Thought Prompting in Large Language Models
【速读】: 该论文试图解决应用商店评论中细粒度情感分类的准确性问题(granular sentiment categorization),因为传统的数值评分和极性评分往往无法捕捉用户反馈中的细微情感。解决方案的关键在于使用链式思维(Chain-of-Thought, CoT)提示方法,通过显式推理提升情感分析性能,实验结果显示,与简单提示相比,CoT提示将分类准确率从84%提升至93%。
链接: https://arxiv.org/abs/2505.04135
作者: Vihaan Miriyala,Smrithi Bukkapatnam,Lavanya Prahallad
机构: John F Kennedy High school Fremont, USA; Queensland Academy for Science Mathematics and Technology, Australia; Research Spark Hub Inc
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 5 pages
Abstract:We explore the use of Chain-of-Thought (CoT) prompting with large language models (LLMs) to improve the accuracy of granular sentiment categorization in app store reviews. Traditional numeric and polarity-based ratings often fail to capture the nuanced sentiment embedded in user feedback. We evaluated the effectiveness of CoT prompting versus simple prompting on 2000 Amazon app reviews by comparing each method’s predictions to human judgements. CoT prompting improved classification accuracy from 84% to 93% highlighting the benefit of explicit reasoning in enhancing sentiment analysis performance.
zh
[NLP-18] Bringing legal knowledge to the public by constructing a legal question bank using large-scale pre-trained language model
【速读】: 该论文试图解决如何将高度技术性的法律文件转化为普通公众易于导航和理解的知识问题(legal knowledge accessibility)。其解决方案的关键在于提出一个三步方法:首先,将法律条文的部分内容翻译成通俗易懂的CLIC-pages;其次,构建包含可从CLIC-pages中找到答案的法律问题的Legal Question Bank(LQB);最后,设计一个交互式CLIC Recommender(CRec),根据用户的法律情境描述推荐相关CLIC页面。该研究重点探讨了LQB的技术实现,展示了如何利用大规模预训练语言模型(如GPT-3)生成法律问题,并比较了机器生成问题(MGQs)与人工编写问题(HCQs)的优劣。
链接: https://arxiv.org/abs/2505.04132
作者: Mingruo Yuan,Ben Kao,Tien-Hsuan Wu,Michael M. K. Cheung,Henry W. H. Chan,Anne S. Y. Cheung,Felix W. H. Chan,Yongxi Chen
机构: The University of Hong Kong (香港大学); Australian National University (澳大利亚国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Access to legal information is fundamental to access to justice. Yet accessibility refers not only to making legal documents available to the public, but also rendering legal information comprehensible to them. A vexing problem in bringing legal information to the public is how to turn formal legal documents such as legislation and judgments, which are often highly technical, to easily navigable and comprehensible knowledge to those without legal education. In this study, we formulate a three-step approach for bringing legal knowledge to laypersons, tackling the issues of navigability and comprehensibility. First, we translate selected sections of the law into snippets (called CLIC-pages), each being a small piece of article that focuses on explaining certain technical legal concept in layperson’s terms. Second, we construct a Legal Question Bank (LQB), which is a collection of legal questions whose answers can be found in the CLIC-pages. Third, we design an interactive CLIC Recommender (CRec). Given a user’s verbal description of a legal situation that requires a legal solution, CRec interprets the user’s input and shortlists questions from the question bank that are most likely relevant to the given legal situation and recommends their corresponding CLIC pages where relevant legal knowledge can be found. In this paper we focus on the technical aspects of creating an LQB. We show how large-scale pre-trained language models, such as GPT-3, can be used to generate legal questions. We compare machine-generated questions (MGQs) against human-composed questions (HCQs) and find that MGQs are more scalable, cost-effective, and more diversified, while HCQs are more precise. We also show a prototype of CRec and illustrate through an example how our 3-step approach effectively brings relevant legal knowledge to the public.
zh
[NLP-19] Natural Language Generation in Healthcare: A Review of Methods and Applications
【速读】: 该论文试图解决医疗领域中自然语言生成(Natural Language Generation, NLG)方法与应用的系统性综述问题,旨在梳理当前NLG技术在医疗场景中的发展现状、关键方法、临床应用场景及评估体系。其解决方案的关键在于通过遵循PRISMA指南,对3,988篇相关文献进行筛选,最终系统分析113篇高质量论文,从数据模态、模型架构、临床应用和评估方法等方面进行全面总结,从而揭示NLG在医疗领域的潜力、局限性及未来研究方向。
链接: https://arxiv.org/abs/2505.04073
作者: Mengxian Lyu,Xiaohan Li,Ziyi Chen,Jinqian Pan,Cheng Peng,Sankalp Talankar,Yonghui Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Natural language generation (NLG) is the key technology to achieve generative artificial intelligence (AI). With the breakthroughs in large language models (LLMs), NLG has been widely used in various medical applications, demonstrating the potential to enhance clinical workflows, support clinical decision-making, and improve clinical documentation. Heterogeneous and diverse medical data modalities, such as medical text, images, and knowledge bases, are utilized in NLG. Researchers have proposed many generative models and applied them in a number of healthcare applications. There is a need for a comprehensive review of NLG methods and applications in the medical domain. In this study, we systematically reviewed 113 scientific publications from a total of 3,988 NLG-related articles identified using a literature search, focusing on data modality, model architecture, clinical applications, and evaluation methods. Following PRISMA (Preferred Reporting Items for Systematic reviews and Meta-Analyses) guidelines, we categorize key methods, identify clinical applications, and assess their capabilities, limitations, and emerging challenges. This timely review covers the key NLG technologies and medical applications and provides valuable insights for future studies to leverage NLG to transform medical discovery and healthcare.
zh
[NLP-20] Advancing and Benchmarking Personalized Tool Invocation for LLM s
【速读】: 该论文试图解决在工具调用过程中忽视用户个性化约束的问题,具体包括工具偏好(Tool Preference)和与用户资料相关的查询(Profile-dependent Query)两个关键任务。现有研究主要关注大语言模型(Large Language Models, LLMs)的基本工具调用能力,而未考虑用户在选择功能相似工具时的偏好以及用户查询中缺失工具参数时的推断需求。解决方案的关键在于提出PTool,这是一个针对个性化工具调用的数据合成框架,并构建了首个用于评估个性化工具调用的基准测试PTBench,通过微调开源模型验证了框架的有效性。
链接: https://arxiv.org/abs/2505.04072
作者: Xu Huang,Yuefeng Huang,Weiwen Liu,Xingshan Zeng,Yasheng Wang,Ruiming Tang,Hong Xie,Defu Lian
机构: University of Science and Technology of China (中国科学技术大学); Shanghai Jiao Tong University (上海交通大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 7 figures, 5 tables
Abstract:Tool invocation is a crucial mechanism for extending the capabilities of Large Language Models (LLMs) and has recently garnered significant attention. It enables LLMs to solve complex problems through tool calls while accessing up-to-date world knowledge. However, existing work primarily focuses on the fundamental ability of LLMs to invoke tools for problem-solving, without considering personalized constraints in tool invocation. In this work, we introduce the concept of Personalized Tool Invocation and define two key tasks: Tool Preference and Profile-dependent Query. Tool Preference addresses user preferences when selecting among functionally similar tools, while Profile-dependent Query considers cases where a user query lacks certain tool parameters, requiring the model to infer them from the user profile. To tackle these challenges, we propose PTool, a data synthesis framework designed for personalized tool invocation. Additionally, we construct \textbfPTBench, the first benchmark for evaluating personalized tool invocation. We then fine-tune various open-source models, demonstrating the effectiveness of our framework and providing valuable insights. Our benchmark is public at this https URL.
zh
[NLP-21] LLAMAPIE: Proactive In-Ear Conversation Assistants
【速读】: 该论文试图解决如何在不中断对话的情况下,通过可穿戴设备为用户提供实时、隐秘且简洁的指导以增强人类对话的问题。解决方案的关键在于构建一个半合成对话数据集,并提出一种两模型流水线:一个小模型用于决定何时响应,一个大模型用于生成有助于对话的响应,从而实现上下文感知、实时且本地化的辅助功能。
链接: https://arxiv.org/abs/2505.04066
作者: Tuochao Chen,Nicholas Batchelder,Alisa Liu,Noah Smith,Shyamnath Gollakota
机构: University of Washington(华盛顿大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:We introduce LlamaPIE, the first real-time proactive assistant designed to enhance human conversations through discreet, concise guidance delivered via hearable devices. Unlike traditional language models that require explicit user invocation, this assistant operates in the background, anticipating user needs without interrupting conversations. We address several challenges, including determining when to respond, crafting concise responses that enhance conversations, leveraging knowledge of the user for context-aware assistance, and real-time, on-device processing. To achieve this, we construct a semi-synthetic dialogue dataset and propose a two-model pipeline: a small model that decides when to respond and a larger model that generates the response. We evaluate our approach on real-world datasets, demonstrating its effectiveness in providing helpful, unobtrusive assistance. User studies with our assistant, implemented on Apple Silicon M2 hardware, show a strong preference for the proactive assistant over both a baseline with no assistance and a reactive model, highlighting the potential of LlamaPie to enhance live conversations.
zh
[NLP-22] SLOT: Structuring the Output of Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成结构化输出时偏离预定义模式的问题,这一问题严重影响了其在代理和信息提取等关键应用中的可靠性。论文提出的解决方案关键在于SLOT(Structured LLM Output Transformer),它采用微调的轻量级语言模型作为后处理层,将非结构化输出转换为精确的结构化格式,从而实现了对多种LLMs和模式规范的灵活性和兼容性。
链接: https://arxiv.org/abs/2505.04016
作者: Darren Yow-Bang Wang,Zhengyuan Shen,Soumya Smruti Mishra,Zhichao Xu,Yifei Teng,Haibo Ding
机构: Amazon Web Services (亚马逊云服务)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Structured outputs are essential for large language models (LLMs) in critical applications like agents and information extraction. Despite their capabilities, LLMs often generate outputs that deviate from predefined schemas, significantly hampering reliable application development. We present SLOT (Structured LLM Output Transformer), a model-agnostic approach that transforms unstructured LLM outputs into precise structured formats. While existing solutions predominantly rely on constrained decoding techniques or are tightly coupled with specific models, SLOT employs a fine-tuned lightweight language model as a post-processing layer, achieving flexibility across various LLMs and schema specifications. We introduce a systematic pipeline for data curation and synthesis alongside a formal evaluation methodology that quantifies both schema accuracy and content fidelity. Our results demonstrate that fine-tuned Mistral-7B model with constrained decoding achieves near perfect schema accuracy (99.5%) and content similarity (94.0%), outperforming Claude-3.5-Sonnet by substantial margins (+25 and +20 percentage points, respectively). Notably, even compact models like Llama-3.2-1B can match or exceed the structured output capabilities of much larger proprietary models when equipped with SLOT, enabling reliable structured generation in resource-constrained environments.
zh
[NLP-23] Quiet Feature Learning in Algorithmic Tasks
【速读】: 该论文试图解决当前语言模型在训练过程中表现出的非连续性性能提升问题,特别是针对基于Transformer的模型在大量计算资源下验证损失曲线出现的显著相变现象。其解决方案的关键在于发现模型在停滞阶段学习到“安静特征”(quiet features),随后突然获得“响亮特征”(loud features),这些特征的协同作用导致了损失的急剧下降,从而揭示了模型内部表征的非线性发展机制。
链接: https://arxiv.org/abs/2505.03997
作者: Prudhviraj Naidu,Zixian Wang,Leon Bergen,Ramamohan Paturi
机构: UC San Diego(加州大学圣地亚哥分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:We train Transformer-based language models on ten foundational algorithmic tasks and observe pronounced phase transitions in their loss curves that deviate from established power-law scaling trends. Over large ranges of compute, the validation loss barely improves, then abruptly decreases. Probing the models’ internal representations reveals the learning of quiet features during the stagnant phase, followed by sudden acquisition of loud features that coincide with the sharp drop in loss. Our ablation experiments show that disrupting a single learned feature can dramatically degrade performance, providing evidence of their causal role in task performance. These findings challenge the prevailing assumption that next-token predictive loss reliably tracks incremental progress; instead, key internal features may be developing below the surface until they coalesce, triggering a rapid performance gain.
zh
[NLP-24] X-Reason er: Towards Generalizable Reasoning Across Modalities and Domains
【速读】: 该论文试图解决如何有效扩展推理能力以超越文本输入和通用领域的问题(generalizing reasoning across modalities and domains)。其解决方案的关键在于通过通用领域文本的后训练(post-training)实现可泛化的推理能力,进而构建了一个仅在通用领域文本上进行后训练的视觉-语言模型X-Reasoner,该模型采用两阶段方法:首先进行监督微调,使用提炼的长链式思维(distilled long chain-of-thoughts),随后通过可验证奖励进行强化学习。实验表明,X-Reasoner能够成功将推理能力迁移至多模态和跨领域场景,并在多个通用和医学基准测试中优于现有最先进模型。
链接: https://arxiv.org/abs/2505.03981
作者: Qianchu Liu,Sheng Zhang,Guanghui Qin,Timothy Ossowski,Yu Gu,Ying Jin,Sid Kiblawi,Sam Preston,Mu Wei,Paul Vozila,Tristan Naumann,Hoifung Poon
机构: Microsoft Research(微软研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent proprietary models (e.g., o3) have begun to demonstrate strong multimodal reasoning capabilities. Yet, most existing open-source research concentrates on training text-only reasoning models, with evaluations limited to mainly mathematical and general-domain tasks. Therefore, it remains unclear how to effectively extend reasoning capabilities beyond text input and general domains. This paper explores a fundamental research question: Is reasoning generalizable across modalities and domains? Our findings support an affirmative answer: General-domain text-based post-training can enable such strong generalizable reasoning. Leveraging this finding, we introduce X-Reasoner, a vision-language model post-trained solely on general-domain text for generalizable reasoning, using a two-stage approach: an initial supervised fine-tuning phase with distilled long chain-of-thoughts, followed by reinforcement learning with verifiable rewards. Experiments show that X-Reasoner successfully transfers reasoning capabilities to both multimodal and out-of-domain settings, outperforming existing state-of-the-art models trained with in-domain and multimodal data across various general and medical benchmarks (Figure 1). Additionally, we find that X-Reasoner’s performance in specialized domains can be further enhanced through continued training on domain-specific text-only data. Building upon this, we introduce X-Reasoner-Med, a medical-specialized variant that achieves new state of the art on numerous text-only and multimodal medical benchmarks.
zh
[NLP-25] Divide Optimize Merge: Fine-Grained LLM Agent Optimization at Scale
【速读】: 该论文旨在解决基于大语言模型(Large Language Model, LLM)的优化方法在处理大规模数据集时遇到的上下文窗口溢出和模式识别能力下降的问题。其解决方案的关键在于提出细粒度优化(Fine-Grained Optimization, FGO)框架,该框架通过将大规模优化任务分解为可管理的子集,进行针对性优化,并通过渐进式合并系统地整合优化后的组件,从而实现高效且可扩展的优化过程。
链接: https://arxiv.org/abs/2505.03973
作者: Jiale Liu,Yifan Zeng,Shaokun Zhang,Chi Zhang,Malte Højmark-Bertelsen,Marie Normann Gadeberg,Huazheng Wang,Qingyun Wu
机构: Pennsylvania State University (宾夕法尼亚州立大学); Oregon State University (俄勒冈州立大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校); Beyond Work
类目: Computation and Language (cs.CL)
备注:
Abstract:LLM-based optimization has shown remarkable potential in enhancing agentic systems. However, the conventional approach of prompting LLM optimizer with the whole training trajectories on training dataset in a single pass becomes untenable as datasets grow, leading to context window overflow and degraded pattern recognition. To address these challenges, we propose Fine-Grained Optimization (FGO), a scalable framework that divides large optimization tasks into manageable subsets, performs targeted optimizations, and systematically combines optimized components through progressive merging. Evaluation across ALFWorld, LogisticsQA, and GAIA benchmarks demonstrate that FGO outperforms existing approaches by 1.6-8.6% while reducing average prompt token consumption by 56.3%. Our framework provides a practical solution for scaling up LLM-based optimization of increasingly sophisticated agent systems. Further analysis demonstrates that FGO achieves the most consistent performance gain in all training dataset sizes, showcasing its scalability and efficiency.
zh
[NLP-26] A Reasoning -Focused Legal Retrieval Benchmark
【速读】: 该论文试图解决法律领域中检索增强型大语言模型(Retrieval-Augmented Language Models, RAG)缺乏真实法律基准的问题,这一问题限制了专用RAG系统的开发,因为现有基准未能充分捕捉法律检索与下游法律问答任务的复杂性。解决方案的关键在于引入两个新的法律RAG基准:Bar Exam QA和Housing Statute QA,这些任务对应于实际的法律研究任务,并通过模拟法律研究的标注过程构建。研究描述了这些基准的构建方法及现有检索器管道的性能,结果表明法律RAG仍然是一个具有挑战性的应用领域,从而推动未来的研究。
链接: https://arxiv.org/abs/2505.03970
作者: Lucia Zheng,Neel Guha,Javokhir Arifov,Sarah Zhang,Michal Skreta,Christopher D. Manning,Peter Henderson,Daniel E. Ho
机构: Stanford University (斯坦福大学); Princeton University (普林斯顿大学)
类目: Computation and Language (cs.CL)
备注: CSLaw 2025. For data, see this https URL
Abstract:As the legal community increasingly examines the use of large language models (LLMs) for various legal applications, legal AI developers have turned to retrieval-augmented LLMs (“RAG” systems) to improve system performance and robustness. An obstacle to the development of specialized RAG systems is the lack of realistic legal RAG benchmarks which capture the complexity of both legal retrieval and downstream legal question-answering. To address this, we introduce two novel legal RAG benchmarks: Bar Exam QA and Housing Statute QA. Our tasks correspond to real-world legal research tasks, and were produced through annotation processes which resemble legal research. We describe the construction of these benchmarks and the performance of existing retriever pipelines. Our results suggest that legal RAG remains a challenging application, thus motivating future research.
zh
[NLP-27] he Power of Stories: Narrative Priming Shapes How LLM Agents Collaborate and Compete
【速读】: 该论文试图解决如何通过共享叙事来引导大型语言模型(LLM)代理实现协作的问题,其核心在于探究叙事对代理间谈判行为的影响机制。研究的关键在于利用有限重复的公共物品博弈场景,通过不同强度的团队合作故事对代理进行预处理,进而观察叙事对合作策略及谈判结果的影响,从而为多智能体系统设计和人工智能对齐提供理论支持。
链接: https://arxiv.org/abs/2505.03961
作者: Gerrit Großmann,Larisa Ivanova,Sai Leela Poduru,Mohaddeseh Tabrizian,Islam Mesabah,David A. Selby,Sebastian J. Vollmer
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 16 pages, 8 figures. Code available at this https URL
Abstract:According to Yuval Noah Harari, large-scale human cooperation is driven by shared narratives that encode common beliefs and values. This study explores whether such narratives can similarly nudge LLM agents toward collaboration. We use a finitely repeated public goods game in which LLM agents choose either cooperative or egoistic spending strategies. We prime agents with stories highlighting teamwork to different degrees and test how this influences negotiation outcomes. Our experiments explore four questions:(1) How do narratives influence negotiation behavior? (2) What differs when agents share the same story versus different ones? (3) What happens when the agent numbers grow? (4) Are agents resilient against self-serving negotiators? We find that story-based priming significantly affects negotiation strategies and success rates. Common stories improve collaboration, benefiting each agent. By contrast, priming agents with different stories reverses this effect, and those agents primed toward self-interest prevail. We hypothesize that these results carry implications for multi-agent system design and AI alignment.
zh
[NLP-28] Hesitation is defeat? Connecting Linguistic and Predictive Uncertainty
【速读】: 该论文试图解决在医学影像分析中,如何将深度学习模型的预测不确定性与人类放射科医生在自由文本报告中表达的语言不确定性进行有效对齐的问题。解决方案的关键在于利用贝叶斯深度学习近似方法(如蒙特卡洛Dropout和深度集成)来量化模型的预测不确定性,并通过BERT模型评估不同二值化方法在不确定性标签上的效果,以探索机器不确定性与人类不确定性之间的关联性。
链接: https://arxiv.org/abs/2505.03910
作者: Gianluca Manzo,Julia Ive
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Automating chest radiograph interpretation using Deep Learning (DL) models has the potential to significantly improve clinical workflows, decision-making, and large-scale health screening. However, in medical settings, merely optimising predictive performance is insufficient, as the quantification of uncertainty is equally crucial. This paper investigates the relationship between predictive uncertainty, derived from Bayesian Deep Learning approximations, and human/linguistic uncertainty, as estimated from free-text radiology reports labelled by rule-based labellers. Utilising BERT as the model of choice, this study evaluates different binarisation methods for uncertainty labels and explores the efficacy of Monte Carlo Dropout and Deep Ensembles in estimating predictive uncertainty. The results demonstrate good model performance, but also a modest correlation between predictive and linguistic uncertainty, highlighting the challenges in aligning machine uncertainty with human interpretation nuances. Our findings suggest that while Bayesian approximations provide valuable uncertainty estimates, further refinement is necessary to fully capture and utilise the subtleties of human uncertainty in clinical applications.
zh
[NLP-29] Sentiment-Aware Recommendation Systems in E-Commerce: A Review from a Natural Language Processing Perspective
【速读】: 该论文试图解决传统推荐系统主要依赖数值评分而忽视用户反馈中潜在情感信息的问题,从而影响推荐的准确性与可解释性。其解决方案的关键在于将情感分析(sentiment analysis)集成到电子商务推荐系统中,通过自然语言处理技术提取详细的主观意见,并将其与用户-物品交互信息相结合,以提升推荐效果。
链接: https://arxiv.org/abs/2505.03828
作者: Yogesh Gajula
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 12 pages, 2 tables, 2 figures
Abstract:E-commerce platforms generate vast volumes of user feedback, such as star ratings, written reviews, and comments. However, most recommendation engines rely primarily on numerical scores, often overlooking the nuanced opinions embedded in free text. This paper comprehensively reviews sentiment-aware recommendation systems from a natural language processing perspective, covering advancements from 2023 to early 2025. It highlights the benefits of integrating sentiment analysis into e-commerce recommenders to enhance prediction accuracy and explainability through detailed opinion extraction. Our survey categorizes recent work into four main approaches: deep learning classifiers that combine sentiment embeddings with user item interactions, transformer based methods for nuanced feature extraction, graph neural networks that propagate sentiment signals, and conversational recommenders that adapt in real time to user feedback. We summarize model architectures and demonstrate how sentiment flows through recommendation pipelines, impacting dialogue-based suggestions. Key challenges include handling noisy or sarcastic text, dynamic user preferences, and bias mitigation. Finally, we outline research gaps and provide a roadmap for developing smarter, fairer, and more user-centric recommendation tools.
zh
[NLP-30] Grouped Sequency-arranged Rotation: Optimizing Rotation Transformation for Quantization for Free
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在部署过程中因计算成本过高而面临的挑战,特别是现有基于旋转的后训练量化(Post-Training Quantization, PTQ)方法在极低位宽(如2-bit)下性能受限的问题。其解决方案的关键在于提出一种无需训练的改进旋转矩阵构造方法,通过引入具有序号排列的Walsh-Hadamard变换,将相似频率成分聚类,从而减少量化误差,并进一步采用分组序号排列旋转(Grouped Sequency-arranged Rotation, GSR)结构,利用块对角矩阵有效隔离异常影响,实现与优化方法相当的性能。
链接: https://arxiv.org/abs/2505.03810
作者: Euntae Choi,Sumin Song,Woosang Lim,Sungjoo Yoo
机构: Seoul National University (首尔国立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages
Abstract:Large Language Models (LLMs) face deployment challenges due to high computational costs, and while Post-Training Quantization (PTQ) offers a solution, existing rotation-based methods struggle at very low bit-widths like 2-bit. We introduce a novel, training-free approach to construct an improved rotation matrix, addressing the limitations of current methods. The key contributions include leveraging the Walsh-Hadamard transform with sequency ordering, which clusters similar frequency components to reduce quantization error compared to standard Hadamard matrices, significantly improving performance. Furthermore, we propose a Grouped Sequency-arranged Rotation (GSR) using block-diagonal matrices with smaller Walsh blocks, effectively isolating outlier impacts and achieving performance comparable to optimization-based methods without requiring any training. Our method demonstrates robust performance on reasoning tasks and Perplexity (PPL) score on WikiText-2. Our method also enhances results even when applied over existing learned rotation techniques.
zh
[NLP-31] Scalability Matters: Overcoming Challenges in InstructGLM with Similarity-Degree-Based Sampling IJCNN
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在图相关问题中的应用受限问题,主要原因是可扩展性约束以及缺乏专门处理图结构的机制。现有方法多将LLMs与图神经网络(Graph Neural Networks, GNNs)结合,但直接在LLMs中编码图结构的研究较少,尤其在大规模图中,由于令牌限制导致表示效果不佳。该论文提出的解决方案关键在于SDM-InstructGLM框架,其核心是引入基于相似度的偏差随机游走机制,根据节点特征相似性和度中心性选择性地采样和编码图信息,从而在LLMs中实现自适应且结构化的表示,提升令牌效率并增强图任务性能。
链接: https://arxiv.org/abs/2505.03799
作者: Hyun Lee,Chris Yi,Maminur Islam,B.D.S. Aritra
机构: Trinity College (三一学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: To be published in International Joint Conference on Neural Networks (IJCNN), 2025
Abstract:Large Language Models (LLMs) have demonstrated strong capabilities in various natural language processing tasks; however, their application to graph-related problems remains limited, primarily due to scalability constraints and the absence of dedicated mechanisms for processing graph structures. Existing approaches predominantly integrate LLMs with Graph Neural Networks (GNNs), using GNNs as feature encoders or auxiliary components. However, directly encoding graph structures within LLMs has been underexplored, particularly in the context of large-scale graphs where token limitations hinder effective representation. To address these challenges, we propose SDM-InstructGLM, a novel instruction-tuned Graph Language Model (InstructGLM) framework that enhances scalability and efficiency without relying on GNNs. Our method introduces a similarity-degree-based biased random walk mechanism, which selectively samples and encodes graph information based on node-feature similarity and degree centrality, ensuring an adaptive and structured representation within the LLM. This approach significantly improves token efficiency, mitigates information loss due to random sampling, and enhances performance on graph-based tasks such as node classification and link prediction. Furthermore, our results demonstrate the feasibility of LLM-only graph processing, enabling scalable and interpretable Graph Language Models (GLMs) optimized through instruction-based fine-tuning. This work paves the way for GNN-free approaches to graph learning, leveraging LLMs as standalone graph reasoning models. Our source code is available on GitHub.
zh
[NLP-32] Calibrating Uncertainty Quantification of Multi-Modal LLM s using Grounding
【速读】: 该论文试图解决多模态大语言模型(multi-modal large language models)中不确定性量化(uncertainty quantification, UQ)校准不足的问题。现有方法依赖于模型在不同设置下对同一输入生成的多个响应之间的一致性,但在模型持续错误的情况下仍会报告较高置信度,导致置信度与实际准确性之间校准不佳。解决方案的关键在于引入跨模态一致性(cross-modal consistency)与自一致性(self-consistency)相结合的方法,通过将文本响应与视觉输入进行对齐来增强模型的置信度校准,并利用温度缩放(temperature scaling)对对齐模型的置信度进行进一步校准。
链接: https://arxiv.org/abs/2505.03788
作者: Trilok Padhi,Ramneet Kaur,Adam D. Cobb,Manoj Acharya,Anirban Roy,Colin Samplawski,Brian Matejek,Alexander M. Berenbeim,Nathaniel D. Bastian,Susmit Jha
机构: Georgia State University (佐治亚州立大学); SRI (SRI); United States Military Academy (美国军事学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce a novel approach for calibrating uncertainty quantification (UQ) tailored for multi-modal large language models (LLMs). Existing state-of-the-art UQ methods rely on consistency among multiple responses generated by the LLM on an input query under diverse settings. However, these approaches often report higher confidence in scenarios where the LLM is consistently incorrect. This leads to a poorly calibrated confidence with respect to accuracy. To address this, we leverage cross-modal consistency in addition to self-consistency to improve the calibration of the multi-modal models. Specifically, we ground the textual responses to the visual inputs. The confidence from the grounding model is used to calibrate the overall confidence. Given that using a grounding model adds its own uncertainty in the pipeline, we apply temperature scaling - a widely accepted parametric calibration technique - to calibrate the grounding model’s confidence in the accuracy of generated responses. We evaluate the proposed approach across multiple multi-modal tasks, such as medical question answering (Slake) and visual question answering (VQAv2), considering multi-modal models such as LLaVA-Med and LLaVA. The experiments demonstrate that the proposed framework achieves significantly improved calibration on both tasks.
zh
[NLP-33] When Reasoning Beats Scale: A 1.5B Reasoning Model Outranks 13B LLM s as Discriminator
【速读】: 该论文试图解决在规划框架中利用生成式 AI (Generative AI) 进行候选评估时,推理模型与传统非推理模型性能对比不明确的问题。其解决方案的关键在于引入一种从思维链(Chain-of-Thought, CoT)输出中提取软分数的新方法,从而实现对候选结果的细粒度排序,并在此基础上评估推理模型作为判别器的有效性。研究通过对比一个参数量为1.5B的蒸馏推理模型DeepSeek-R1与多个最先进的非推理模型,验证了推理模型在判别任务中的优越性,同时揭示了推理模型在生成能力上的局限性。
链接: https://arxiv.org/abs/2505.03786
作者: Md Fahim Anjum
机构: University of California San Francisco (加州大学旧金山分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 12 pages, 5 figures. Code available at: this https URL
Abstract:Large Language Models (LLM) with reasoning capabilities offer a promising path for improving candidate evaluation in planning frameworks, but their relative performance against traditional non-reasoning models remains largely underexplored. In this study, we benchmark a distilled 1.5B parameter reasoning model (DeepSeek-R1) against several state-of-the-art non-reasoning LLMs within a generator-discriminator LLM planning framework for the text-to-SQL task. For this, we introduce a novel method for extracting soft scores from the chain-of-thought (CoT) outputs from reasoning that enables fine-grained ranking of candidates. Our central hypothesis is that reasoning models are more effective discriminators than non-reasoning LLMs. Our results show that distilled DeepSeek-R1-1.5B achieves up to 87% higher F1 and 3.7% better discrimination accuracy than CodeLlama-7B, as well as 3.7% higher execution accuracy than CodeLlama-13B, despite having significantly fewer parameters. Furthermore, we find that there is a limit to the logical capabilities of reasoning models, and only providing more context or allowing more compute budget for reasoning is not enough to improve their discrimination performance. Finally, we demonstrate that, unlike non-reasoning LLMs, reasoning models find generation more challenging than discrimination and may underperform as generators compared to smaller non-reasoning LLMs. Our work highlights the potential of reasoning models as discriminators in agentic frameworks, far outweighing their capabilities as generators, offering insights into their optimal role within LLM planning infrastructures.
zh
[NLP-34] IRIS: Interactive Research Ideation System for Accelerating Scientific Discovery
【速读】: 该论文试图解决如何利用大型语言模型(Large Language Models, LLMs)加速科学发现的问题,特别是针对研究初期的新型假设生成阶段。现有方法主要集中在多智能体框架和扩展测试时计算上,但未能有效结合透明性和可控性。论文提出的解决方案的关键在于引入IRIS:一种交互式研究构想系统(Interactive Research Ideation System),通过人机协同(Human-in-the-loop, HITL)的方式增强科学构想过程,其核心创新包括基于蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)的自适应测试时计算扩展、细粒度反馈机制以及基于查询的文献综合功能。
链接: https://arxiv.org/abs/2504.16728
作者: Aniketh Garikaparthi,Manasi Patwardhan,Lovekesh Vig,Arman Cohan
机构: TCS Research (TCS 研究院); Yale University (耶鲁大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 6 pages main-text, 2 pages appendix
Abstract:The rapid advancement in capabilities of large language models (LLMs) raises a pivotal question: How can LLMs accelerate scientific discovery? This work tackles the crucial first stage of research, generating novel hypotheses. While recent work on automated hypothesis generation focuses on multi-agent frameworks and extending test-time compute, none of the approaches effectively incorporate transparency and steerability through a synergistic Human-in-the-loop (HITL) approach. To address this gap, we introduce IRIS: Interactive Research Ideation System, an open-source platform designed for researchers to leverage LLM-assisted scientific ideation. IRIS incorporates innovative features to enhance ideation, including adaptive test-time compute expansion via Monte Carlo Tree Search (MCTS), fine-grained feedback mechanism, and query-based literature synthesis. Designed to empower researchers with greater control and insight throughout the ideation process. We additionally conduct a user study with researchers across diverse disciplines, validating the effectiveness of our system in enhancing ideation. We open-source our code at this https URL
zh
[NLP-35] Cer-Eval: Certifiable and Cost-Efficient Evaluation Framework for LLM s
【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)评估中测试数据量不足或选择不合理的挑战,特别是在缺乏系统性分析和指导的情况下,难以确定测试数据的充分性或选择具有信息量的样本。解决方案的关键在于提出一种可验证且成本高效的评估框架,通过引入“测试样本复杂度”来量化实现可验证评估所需的测试点数量,并推导出紧致的边界。基于此理论,作者开发了一种基于划分的算法Cer-Eval,该算法自适应地选择测试点以最小化评估成本,实验证明其可在保持与当前评估过程相当的估计误差水平的同时,减少20%至40%的测试点,并提供95%的置信保证。
链接: https://arxiv.org/abs/2505.03814
作者: Ganghua Wang,Zhaorun Chen,Bo Li,Haifeng Xu
机构: University of Chicago (芝加哥大学); UIUC (伊利诺伊大学厄巴纳-香槟分校)
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:As foundation models continue to scale, the size of trained models grows exponentially, presenting significant challenges for their evaluation. Current evaluation practices involve curating increasingly large datasets to assess the performance of large language models (LLMs). However, there is a lack of systematic analysis and guidance on determining the sufficiency of test data or selecting informative samples for evaluation. This paper introduces a certifiable and cost-efficient evaluation framework for LLMs. Our framework adapts to different evaluation objectives and outputs confidence intervals that contain true values with high probability. We use ``test sample complexity’’ to quantify the number of test points needed for a certifiable evaluation and derive tight bounds on test sample complexity. Based on the developed theory, we develop a partition-based algorithm, named Cer-Eval, that adaptively selects test points to minimize the cost of LLM evaluation. Real-world experiments demonstrate that Cer-Eval can save 20% to 40% test points across various benchmarks, while maintaining an estimation error level comparable to the current evaluation process and providing a 95% confidence guarantee.
zh
计算机视觉
[CV-0] PrimitiveAnything: Human-Crafted 3D Primitive Assembly Generation with Auto-Regressive Transformer SIGGRAPH2025
【速读】:该论文旨在解决复杂3D形状分解为简单几何元素(shape primitive abstraction)的问题,现有方法在语义理解或泛化能力方面存在局限。其解决方案的关键在于提出PrimitiveAnything框架,将形状分解任务重新定义为一种基于形状条件的生成任务,并采用自回归生成的primitive transformer和无歧义的参数化方案,从而直接学习人类对复杂形状进行分解的过程,实现更符合人类感知且保持几何精度的高质量分解结果。
链接: https://arxiv.org/abs/2505.04622
作者: Jingwen Ye,Yuze He,Yanning Zhou,Yiqin Zhu,Kaiwen Xiao,Yong-Jin Liu,Wei Yang,Xiao Han
机构: Tencent AIPD(腾讯AI产品部); Tsinghua University(清华大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH 2025. 14 pages, 15 figures
Abstract:Shape primitive abstraction, which decomposes complex 3D shapes into simple geometric elements, plays a crucial role in human visual cognition and has broad applications in computer vision and graphics. While recent advances in 3D content generation have shown remarkable progress, existing primitive abstraction methods either rely on geometric optimization with limited semantic understanding or learn from small-scale, category-specific datasets, struggling to generalize across diverse shape categories. We present PrimitiveAnything, a novel framework that reformulates shape primitive abstraction as a primitive assembly generation task. PrimitiveAnything includes a shape-conditioned primitive transformer for auto-regressive generation and an ambiguity-free parameterization scheme to represent multiple types of primitives in a unified manner. The proposed framework directly learns the process of primitive assembly from large-scale human-crafted abstractions, enabling it to capture how humans decompose complex shapes into primitive elements. Through extensive experiments, we demonstrate that PrimitiveAnything can generate high-quality primitive assemblies that better align with human perception while maintaining geometric fidelity across diverse shape categories. It benefits various 3D applications and shows potential for enabling primitive-based user-generated content (UGC) in games. Project page: this https URL
zh
[CV-1] On Path to Multimodal Generalist: General-Level and General-Bench ICML’25
【速读】:该论文试图解决当前多模态大语言模型(Multimodal Large Language Model, MLLM)评估中存在的一般性能力衡量标准不足的问题,即是否可以通过任务上的高性能直接推断模型具备更强的MLLM能力并更接近人类水平的人工智能。论文提出的解决方案关键在于引入了General-Level评估框架,该框架定义了五级性能与泛化能力指标,并通过Synergy概念衡量模型在理解和生成任务以及跨模态之间的一致性能力,同时构建了涵盖广泛技能、模态、格式和能力的General-Bench基准,包含超过700个任务和325,800个实例,以系统性地评估和比较现有MLLM的能力进展。
链接: https://arxiv.org/abs/2505.04620
作者: Hao Fei,Yuan Zhou,Juncheng Li,Xiangtai Li,Qingshan Xu,Bobo Li,Shengqiong Wu,Yaoting Wang,Junbao Zhou,Jiahao Meng,Qingyu Shi,Zhiyuan Zhou,Liangtao Shi,Minghe Gao,Daoan Zhang,Zhiqi Ge,Weiming Wu,Siliang Tang,Kaihang Pan,Yaobo Ye,Haobo Yuan,Tao Zhang,Tianjie Ju,Zixiang Meng,Shilin Xu,Liyu Jia,Wentao Hu,Meng Luo,Jiebo Luo,Tat-Seng Chua,Shuicheng Yan,Hanwang Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML’25, 305 pages, 115 tables, 177 figures, project page: this https URL
Abstract:The Multimodal Large Language Model (MLLM) is currently experiencing rapid growth, driven by the advanced capabilities of LLMs. Unlike earlier specialists, existing MLLMs are evolving towards a Multimodal Generalist paradigm. Initially limited to understanding multiple modalities, these models have advanced to not only comprehend but also generate across modalities. Their capabilities have expanded from coarse-grained to fine-grained multimodal understanding and from supporting limited modalities to arbitrary ones. While many benchmarks exist to assess MLLMs, a critical question arises: Can we simply assume that higher performance across tasks indicates a stronger MLLM capability, bringing us closer to human-level AI? We argue that the answer is not as straightforward as it seems. This project introduces General-Level, an evaluation framework that defines 5-scale levels of MLLM performance and generality, offering a methodology to compare MLLMs and gauge the progress of existing systems towards more robust multimodal generalists and, ultimately, towards AGI. At the core of the framework is the concept of Synergy, which measures whether models maintain consistent capabilities across comprehension and generation, and across multiple modalities. To support this evaluation, we present General-Bench, which encompasses a broader spectrum of skills, modalities, formats, and capabilities, including over 700 tasks and 325,800 instances. The evaluation results that involve over 100 existing state-of-the-art MLLMs uncover the capability rankings of generalists, highlighting the challenges in reaching genuine AI. We expect this project to pave the way for future research on next-generation multimodal foundation models, providing a robust infrastructure to accelerate the realization of AGI. Project page: this https URL
zh
[CV-2] Merging and Disentangling Views in Visual Reinforcement Learning for Robotic Manipulation
【速读】:该论文旨在解决多视角视觉伺服在计算上具有挑战性的问题,特别是在扩大视场角的同时保持系统鲁棒性与样本效率。其解决方案的关键在于提出一种名为Merge And Disentanglement (MAD)的算法,该算法通过高效融合多视角信息以提高样本效率,并结合单视角特征实现轻量级部署和确保鲁棒策略。
链接: https://arxiv.org/abs/2505.04619
作者: Abdulaziz Almuzairee,Rohan Patil,Dwait Bhatt,Henrik I. Christensen
机构: University of California San Diego (加州大学圣地亚哥分校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: For project website and code, see this https URL
Abstract:Vision is well-known for its use in manipulation, especially using visual servoing. To make it robust, multiple cameras are needed to expand the field of view. That is computationally challenging. Merging multiple views and using Q-learning allows the design of more effective representations and optimization of sample efficiency. Such a solution might be expensive to deploy. To mitigate this, we introduce a Merge And Disentanglement (MAD) algorithm that efficiently merges views to increase sample efficiency while augmenting with single-view features to allow lightweight deployment and ensure robust policies. We demonstrate the efficiency and robustness of our approach using Meta-World and ManiSkill3. For project website and code, see this https URL
zh
[CV-3] Person Recognition at Altitude and Range: Fusion of Face Body Shape and Gait
【速读】:该论文旨在解决在非受限环境中进行全身人体识别的问题,这一问题常见于如IARPA BRIAR项目等监控场景中,其中生物特征数据在远距离、高视角和恶劣大气条件下被捕获。解决方案的关键在于提出FarSight系统,这是一个统一的端到端系统,通过整合面部、步态和身体形态等多种生物特征线索来实现人体识别。FarSight包含四个核心模块:多目标检测与跟踪、识别感知的视频修复、模态特异性生物特征编码以及质量引导的多模态融合,这些模块在图像退化、大姿态和尺度变化以及跨域差距下协同工作,从而有效提升了识别性能。
链接: https://arxiv.org/abs/2505.04616
作者: Feng Liu,Nicholas Chimitt,Lanqing Guo,Jitesh Jain,Aditya Kane,Minchul Kim,Wes Robbins,Yiyang Su,Dingqiang Ye,Xingguang Zhang,Jie Zhu,Siddharth Satyakam,Christopher Perry,Stanley H. Chan,Arun Ross,Humphrey Shi,Zhangyang Wang,Anil Jain,Xiaoming Liu
机构: Michigan State University (密歇根州立大学); Purdue University (普渡大学); Georgia Tech (佐治亚理工学院); University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 12 figures
Abstract:We address the problem of whole-body person recognition in unconstrained environments. This problem arises in surveillance scenarios such as those in the IARPA Biometric Recognition and Identification at Altitude and Range (BRIAR) program, where biometric data is captured at long standoff distances, elevated viewing angles, and under adverse atmospheric conditions (e.g., turbulence and high wind velocity). To this end, we propose FarSight, a unified end-to-end system for person recognition that integrates complementary biometric cues across face, gait, and body shape modalities. FarSight incorporates novel algorithms across four core modules: multi-subject detection and tracking, recognition-aware video restoration, modality-specific biometric feature encoding, and quality-guided multi-modal fusion. These components are designed to work cohesively under degraded image conditions, large pose and scale variations, and cross-domain gaps. Extensive experiments on the BRIAR dataset, one of the most comprehensive benchmarks for long-range, multi-modal biometric recognition, demonstrate the effectiveness of FarSight. Compared to our preliminary system, this system achieves a 34.1% absolute gain in 1:1 verification accuracy (TAR@0.1% FAR), a 17.8% increase in closed-set identification (Rank-20), and a 34.3% reduction in open-set identification errors (FNIR@1% FPIR). Furthermore, FarSight was evaluated in the 2025 NIST RTE Face in Video Evaluation (FIVE), which conducts standardized face recognition testing on the BRIAR dataset. These results establish FarSight as a state-of-the-art solution for operational biometric recognition in challenging real-world conditions.
zh
[CV-4] FastMap: Revisiting Dense and Scalable Structure from Motion FAST
【速读】:该论文试图解决传统结构从运动(Structure from Motion, SfM)方法在处理大规模场景时的可扩展性问题,尤其是当匹配关键点对数量增加时,计算效率显著下降的问题。其解决方案的关键在于设计一种完全基于GPU友好操作的SfM框架,从而实现良好的并行化,并且每个优化步骤的时间复杂度与图像对数量呈线性关系,而非依赖于关键点对或三维点的数量。
链接: https://arxiv.org/abs/2505.04612
作者: Jiahao Li,Haochen Wang,Muhammad Zubair Irshad,Igor Vasiljevic,Matthew R. Walter,Vitor Campagnolo Guizilini,Greg Shakhnarovich
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project webpage: this https URL
Abstract:We propose FastMap, a new global structure from motion method focused on speed and simplicity. Previous methods like COLMAP and GLOMAP are able to estimate high-precision camera poses, but suffer from poor scalability when the number of matched keypoint pairs becomes large. We identify two key factors leading to this problem: poor parallelization and computationally expensive optimization steps. To overcome these issues, we design an SfM framework that relies entirely on GPU-friendly operations, making it easily parallelizable. Moreover, each optimization step runs in time linear to the number of image pairs, independent of keypoint pairs or 3D points. Through extensive experiments, we show that FastMap is one to two orders of magnitude faster than COLMAP and GLOMAP on large-scale scenes with comparable pose accuracy.
zh
[CV-5] OpenVision: A Fully-Open Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning
【速读】:该论文试图解决当前多模态基础模型中视觉编码器缺乏完全开源的问题,现有替代方案如SigLIP虽已开始挑战OpenAI的CLIP(Contrastive Language-Image Pretraining)地位,但其训练数据或训练方法并未公开。解决方案的关键在于提出OpenVision,这是一个完全开源、成本效益高的视觉编码器系列,其性能在集成到多模态框架(如LLaVA)时可与CLIP相媲美甚至超越。OpenVision基于现有工作(如CLIPS训练框架和Recap-DataComp-1B训练数据)进行优化,揭示了提升编码器质量的关键洞察,并展示了在推进多模态模型方面的实际优势。
链接: https://arxiv.org/abs/2505.04601
作者: Xianhang Li,Yanqing Liu,Haoqin Tu,Hongru Zhu,Cihang Xie
机构: University of California, Santa Cruz (加利福尼亚大学圣克鲁兹分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:OpenAI’s CLIP, released in early 2021, have long been the go-to choice of vision encoder for building multimodal foundation models. Although recent alternatives such as SigLIP have begun to challenge this status quo, to our knowledge none are fully open: their training data remains proprietary and/or their training recipes are not released. This paper fills this gap with OpenVision, a fully-open, cost-effective family of vision encoders that match or surpass the performance of OpenAI’s CLIP when integrated into multimodal frameworks like LLaVA. OpenVision builds on existing works – e.g., CLIPS for training framework and Recap-DataComp-1B for training data – while revealing multiple key insights in enhancing encoder quality and showcasing practical benefits in advancing multimodal models. By releasing vision encoders spanning from 5.9M to 632.1M parameters, OpenVision offers practitioners a flexible trade-off between capacity and efficiency in building multimodal models: larger models deliver enhanced multimodal performance, while smaller versions enable lightweight, edge-ready multimodal deployments.
zh
[CV-6] MonoCoP: Chain-of-Prediction for Monocular 3D Object Detection
【速读】:该论文旨在解决单目3D目标检测(Mono3D)中深度估计的准确性问题,该问题由于从2D图像到3D空间映射的固有模糊性而尤为困难。现有方法虽然通过引入多深度线索(如深度不确定性估计、深度误差建模)来提升深度精度,但忽略了准确的深度预测需要依赖其他3D属性的条件信息,这些属性通过3D到2D投影具有内在相关性,从而限制了整体精度和稳定性。论文提出的解决方案关键在于引入Chain-of-Prediction (CoP)机制,通过三个核心设计实现属性的顺序且条件化的预测:首先使用轻量级AttributeNet (AN)学习每个3D属性的特征,其次构建显式链路传播特征,最后利用残差连接在链中聚合特征,确保后续属性预测基于所有先前处理的属性而不遗忘早期特征。
链接: https://arxiv.org/abs/2505.04594
作者: Zhihao Zhang,Abhinav Kumar,Girish Chandar Ganesan,Xiaoming Liu
机构: Michigan State University (密歇根州立大学); Samsung Research America (三星美国研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurately predicting 3D attributes is crucial for monocular 3D object detection (Mono3D), with depth estimation posing the greatest challenge due to the inherent ambiguity in mapping 2D images to 3D space. While existing methods leverage multiple depth cues (e.g., estimating depth uncertainty, modeling depth error) to improve depth accuracy, they overlook that accurate depth prediction requires conditioning on other 3D attributes, as these attributes are intrinsically inter-correlated through the 3D to 2D projection, which ultimately limits overall accuracy and stability. Inspired by Chain-of-Thought (CoT) in large language models (LLMs), this paper proposes MonoCoP, which leverages a Chain-of-Prediction (CoP) to predict attributes sequentially and conditionally via three key designs. First, it employs a lightweight AttributeNet (AN) for each 3D attribute to learn attribute-specific features. Next, MonoCoP constructs an explicit chain to propagate these learned features from one attribute to the next. Finally, MonoCoP uses a residual connection to aggregate features for each attribute along the chain, ensuring that later attribute predictions are conditioned on all previously processed attributes without forgetting the features of earlier ones. Experimental results show that our MonoCoP achieves state-of-the-art (SoTA) performance on the KITTI leaderboard without requiring additional data and further surpasses existing methods on the Waymo and nuScenes frontal datasets.
zh
[CV-7] Weave: Isosurface Extraction using On-The-Fly Delaunay Tetrahedral Grids for Gradient-Based Mesh Optimization SIGGRAPH2025
【速读】:该论文试图解决的是基于梯度的网格优化中,传统预定义四面体网格灵活性不足以及重建误差与网格公平性之间难以平衡的问题。解决方案的关键在于提出TetWeave,这是一种新的等值面表示方法,它联合优化用于Marching Tetrahedra的四面体网格布局和每个点的新型方向符号距离函数。TetWeave通过Delaunay三角剖分实时构建四面体网格,从而提高了灵活性,并支持在重建误差较高的区域进行自适应采样,同时保持网格的封闭性、双流形性和无交集特性,最终实现高质量、低内存占用的自适应网格。
链接: https://arxiv.org/abs/2505.04590
作者: Alexandre Binninger,Ruben Wiersma,Philipp Herholz,Olga Sorkine-Hornung
机构: ETH Zurich(ETH Zurich); Independent Contributor(Independent Contributor)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: ACM Trans. Graph. 44, 4. SIGGRAPH 2025. 19 pages, 21 figures
Abstract:We introduce TetWeave, a novel isosurface representation for gradient-based mesh optimization that jointly optimizes the placement of a tetrahedral grid used for Marching Tetrahedra and a novel directional signed distance at each point. TetWeave constructs tetrahedral grids on-the-fly via Delaunay triangulation, enabling increased flexibility compared to predefined grids. The extracted meshes are guaranteed to be watertight, two-manifold and intersection-free. The flexibility of TetWeave enables a resampling strategy that places new points where reconstruction error is high and allows to encourage mesh fairness without compromising on reconstruction error. This leads to high-quality, adaptive meshes that require minimal memory usage and few parameters to optimize. Consequently, TetWeave exhibits near-linear memory scaling relative to the vertex count of the output mesh - a substantial improvement over predefined grids. We demonstrate the applicability of TetWeave to a broad range of challenging tasks in computer graphics and vision, such as multi-view 3D reconstruction, mesh compression and geometric texture generation.
zh
[CV-8] Active Sampling for MRI-based Sequential Decision Making
【速读】:该论文试图解决磁共振成像(Magnetic Resonance Imaging, MRI)作为床旁检测(Point-of-Care, PoC)设备应用受限的问题,主要由于其高成本和复杂性。解决方案的关键在于通过降低磁场强度并改进采样策略来实现这一目标。研究提出了一种多目标强化学习框架,能够在欠采样的k空间数据上进行综合且连续的诊断评估,通过引入分步加权奖励函数的训练方法,识别对每个诊断目标贡献最大的样本,从而在减少采样数量的同时保持诊断性能。
链接: https://arxiv.org/abs/2505.04586
作者: Yuning Du,Jingshuai Liu,Rohan Dharmakumar,Sotirios A. Tsaftaris
机构: University of Edinburgh (爱丁堡大学); Cedars Sinai Medical Center (西达赛奈医学中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Under Review
Abstract:Despite the superior diagnostic capability of Magnetic Resonance Imaging (MRI), its use as a Point-of-Care (PoC) device remains limited by high cost and complexity. To enable such a future by reducing the magnetic field strength, one key approach will be to improve sampling strategies. Previous work has shown that it is possible to make diagnostic decisions directly from k-space with fewer samples. Such work shows that single diagnostic decisions can be made, but if we aspire to see MRI as a true PoC, multiple and sequential decisions are necessary while minimizing the number of samples acquired. We present a novel multi-objective reinforcement learning framework enabling comprehensive, sequential, diagnostic evaluation from undersampled k-space data. Our approach during inference actively adapts to sequential decisions to optimally sample. To achieve this, we introduce a training methodology that identifies the samples that contribute the best to each diagnostic objective using a step-wise weighting reward function. We evaluate our approach in two sequential knee pathology assessment tasks: ACL sprain detection and cartilage thickness loss assessment. Our framework achieves diagnostic performance competitive with various policy-based benchmarks on disease detection, severity quantification, and overall sequential diagnosis, while substantially saving k-space samples. Our approach paves the way for the future of MRI as a comprehensive and affordable PoC device. Our code is publicly available at this https URL
zh
[CV-9] Componential Prompt-Knowledge Alignment for Domain Incremental Learning ICML2025
【速读】:该论文试图解决领域增量学习(Domain Incremental Learning, DIL)中由于领域特定提示(domain-specific prompts)组件级错位导致的知识冲突与预测性能下降问题。解决方案的关键在于提出一种基于组件感知的提示-知识对齐方法(Componential Prompt-Knowledge Alignment, KA-Prompt),通过在训练过程中引入组件级对齐机制,实现新旧提示之间的内在一致性,从而提升模型的学习与推理能力。
链接: https://arxiv.org/abs/2505.04575
作者: Kunlun Xu,Xu Zou,Gang Hua,Jiahuan Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accpted by ICML2025
Abstract:Domain Incremental Learning (DIL) aims to learn from non-stationary data streams across domains while retaining and utilizing past knowledge. Although prompt-based methods effectively store multi-domain knowledge in prompt parameters and obtain advanced performance through cross-domain prompt fusion, we reveal an intrinsic limitation: component-wise misalignment between domain-specific prompts leads to conflicting knowledge integration and degraded predictions. This arises from the random positioning of knowledge components within prompts, where irrelevant component fusion introduces this http URL address this, we propose Componential Prompt-Knowledge Alignment (KA-Prompt), a novel prompt-based DIL method that introduces component-aware prompt-knowledge alignment during training, significantly improving both the learning and inference capacity of the model. KA-Prompt operates in two phases: (1) Initial Componential Structure Configuring, where a set of old prompts containing knowledge relevant to the new domain are mined via greedy search, which is then exploited to initialize new prompts to achieve reusable knowledge transfer and establish intrinsic alignment between new and old prompts. (2) Online Alignment Preservation, which dynamically identifies the target old prompts and applies adaptive componential consistency constraints as new prompts evolve. Extensive experiments on DIL benchmarks demonstrate the effectiveness of our KA-Prompt. Our source code is available at this https URL
zh
[CV-10] Registration of 3D Point Sets Using Exponential-based Similarity Matrix
【速读】:该论文旨在解决点云配准(point cloud registration)中的两个关键问题:当点云之间存在较大的旋转差异或数据受到显著传感器噪声干扰时,现有先进配准技术容易出现对齐失败,从而导致三维重建不准确或失真。其解决方案的关键在于对经典迭代最近点(Iterative Closest Point, ICP)算法进行改进,提出了一种称为指数相似性矩阵ICP(Exponential Similarity Matrix ICP, ESM-ICP)的方法,该方法通过引入高斯启发的指数加权机制构建动态适应的相似性矩阵,从而提升旋转和平移参数估计的准确性。
链接: https://arxiv.org/abs/2505.04540
作者: Ashutosh Singandhupe,Sanket Lokhande,Hung Manh La
机构: University of Nevada, Reno(内华达大学雷诺分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Point cloud registration is a fundamental problem in computer vision and robotics, involving the alignment of 3D point sets captured from varying viewpoints using depth sensors such as LiDAR or structured light. In modern robotic systems, especially those focused on mapping, it is essential to merge multiple views of the same environment accurately. However, state-of-the-art registration techniques often struggle when large rotational differences exist between point sets or when the data is significantly corrupted by sensor noise. These challenges can lead to misalignments and, consequently, to inaccurate or distorted 3D reconstructions. In this work, we address both these limitations by proposing a robust modification to the classic Iterative Closest Point (ICP) algorithm. Our method, termed Exponential Similarity Matrix ICP (ESM-ICP), integrates a Gaussian-inspired exponential weighting scheme to construct a similarity matrix that dynamically adapts across iterations. This matrix facilitates improved estimation of both rotational and translational components during alignment. We demonstrate the robustness of ESM-ICP in two challenging scenarios: (i) large rotational discrepancies between the source and target point clouds, and (ii) data corrupted by non-Gaussian noise. Our results show that ESM-ICP outperforms traditional geometric registration techniques as well as several recent learning-based methods. To encourage reproducibility and community engagement, our full implementation is made publicly available on GitHub. this https URL
zh
[CV-11] RAFT: Robust Augmentation of FeaTures for Image Segmentation
【速读】:该论文旨在解决生成式 AI 在图像分割任务中从合成数据(Synthetic Data)到真实数据(Real-world Data)的域适应问题,即所谓的 Syn2Real 问题,该问题导致模型在实际应用中的性能下降。解决方案的关键在于提出 RAFT 框架,该框架通过最小量的真实世界标注数据,结合数据增强、特征增强以及主动学习策略,实现对图像分割模型的有效适应。实验结果表明,RAFT 在多个基准测试中优于现有最先进的方法 HALO,显著提升了 mIoU 指标。
链接: https://arxiv.org/abs/2505.04529
作者: Edward Humes,Xiaomin Lin,Uttej Kallakuri,Tinoosh Mohsenin
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image segmentation is a powerful computer vision technique for scene understanding. However, real-world deployment is stymied by the need for high-quality, meticulously labeled datasets. Synthetic data provides high-quality labels while reducing the need for manual data collection and annotation. However, deep neural networks trained on synthetic data often face the Syn2Real problem, leading to poor performance in real-world deployments. To mitigate the aforementioned gap in image segmentation, we propose RAFT, a novel framework for adapting image segmentation models using minimal labeled real-world data through data and feature augmentations, as well as active learning. To validate RAFT, we perform experiments on the synthetic-to-real “SYNTHIA-Cityscapes” and “GTAV-Cityscapes” benchmarks. We managed to surpass the previous state of the art, HALO. SYNTHIA-Cityscapes experiences an improvement in mIoU* upon domain adaptation of 2.1%/79.9%, and GTAV-Cityscapes experiences a 0.4%/78.2% improvement in mIoU. Furthermore, we test our approach on the real-to-real benchmark of “Cityscapes-ACDC”, and again surpass HALO, with a gain in mIoU upon adaptation of 1.3%/73.2%. Finally, we examine the effect of the allocated annotation budget and various components of RAFT upon the final transfer mIoU. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.04529 [cs.CV] (or arXiv:2505.04529v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.04529 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-12] DFVO: Learning Darkness-free Visible and Infrared Image Disentanglement and Fusion All at Once
【速读】:该论文旨在解决可见光与红外图像融合中因可见光图像严重光照退化导致的融合结果模糊、暗淡的问题,这一问题对自动驾驶等高级视觉任务构成了重大挑战。其解决方案的关键在于提出一种名为DFVO(Darkness-Free network for Visible and infrared image disentanglement and fusion all at Once)的网络,采用级联多任务方法替代传统的两阶段级联训练(增强与融合),以减少层级数据传输导致的信息熵损失。该方法通过构建潜在公共特征提取器(LCFE)、细节提取模块(DEM)、超交叉注意力模块(HCAM)以及设计相关损失函数,实现了更清晰、更具信息量且光照更均匀的融合结果。
链接: https://arxiv.org/abs/2505.04526
作者: Qi Zhou,Yukai Shi,Xiaojun Yang,Xiaoyu Xian,Lunjia Liao,Ruimao Zhang,Liang Lin
机构: Guangdong University of Technology (广东工业大学); School of Information Engineering (信息工程学院); Key Laboratory of Photonic Technology for Integrated Sensing and Communication, Ministry of Education of China (教育部光子技术集成感知与通信重点实验室); CRRC Institute Co., Ltd. (中车研究院有限公司); UBTECH Robotics Co., Ltd (优必选科技有限公司); The Chinese University of Hong Kong, Shenzhen (香港中文大学深圳); School of Data Science (数据科学学院); School of Data and Computer Science (数据与计算机科学学院, Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Visible and infrared image fusion is one of the most crucial tasks in the field of image fusion, aiming to generate fused images with clear structural information and high-quality texture features for high-level vision tasks. However, when faced with severe illumination degradation in visible images, the fusion results of existing image fusion methods often exhibit blurry and dim visual effects, posing major challenges for autonomous driving. To this end, a Darkness-Free network is proposed to handle Visible and infrared image disentanglement and fusion all at Once (DFVO), which employs a cascaded multi-task approach to replace the traditional two-stage cascaded training (enhancement and fusion), addressing the issue of information entropy loss caused by hierarchical data transmission. Specifically, we construct a latent-common feature extractor (LCFE) to obtain latent features for the cascaded tasks strategy. Firstly, a details-extraction module (DEM) is devised to acquire high-frequency semantic information. Secondly, we design a hyper cross-attention module (HCAM) to extract low-frequency information and preserve texture features from source images. Finally, a relevant loss function is designed to guide the holistic network learning, thereby achieving better image fusion. Extensive experiments demonstrate that our proposed approach outperforms state-of-the-art alternatives in terms of qualitative and quantitative evaluations. Particularly, DFVO can generate clearer, more informative, and more evenly illuminated fusion results in the dark environments, achieving best performance on the LLVIP dataset with 63.258 dB PSNR and 0.724 CC, providing more effective information for high-level vision tasks. Our code is publicly accessible at this https URL.
zh
[CV-13] Edge-GPU Based Face Tracking for Face Detection and Recognition Acceleration
【速读】:该论文旨在解决在公共场合中实现高效、低功耗的实时人脸检测与识别系统的问题。其关键解决方案是采用软硬件协同设计方法,充分利用NVIDIA Jetson AGX Orin边缘GPU的所有硬件引擎,并集成人脸跟踪模块以减少冗余的人脸识别计算,从而提升处理吞吐量并降低功耗。
链接: https://arxiv.org/abs/2505.04524
作者: Asma Baobaid,Mahmoud Meribout
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 10 pages, 12 figures
Abstract:Cost-effective machine vision systems dedicated to real-time and accurate face detection and recognition in public places are crucial for many modern applications. However, despite their high performance, which could be reached using specialized edge or cloud AI hardware accelerators, there is still room for improvement in throughput and power consumption. This paper aims to suggest a combined hardware-software approach that optimizes face detection and recognition systems on one of the latest edge GPUs, namely NVIDIA Jetson AGX Orin. First, it leverages the simultaneous usage of all its hardware engines to improve processing time. This offers an improvement over previous works where these tasks were mainly allocated automatically and exclusively to the CPU or, to a higher extent, to the GPU core. Additionally, the paper suggests integrating a face tracker module to avoid redundantly running the face recognition algorithm for every frame but only when a new face appears in the scene. The results of extended experiments suggest that simultaneous usage of all the hardware engines that are available in the Orin GPU and tracker integration into the pipeline yield an impressive throughput of 290 FPS (frames per second) on 1920 x 1080 input size frames containing in average of 6 faces/frame. Additionally, a substantial saving of power consumption of around 800 mW was achieved when compared to running the task on the CPU/GPU engines only and without integrating a tracker into the Orin GPU’92s pipeline. This hardware-codesign approach can pave the way to design high-performance machine vision systems at the edge, critically needed in video monitoring in public places where several nearby cameras are usually deployed for a same scene.
zh
[CV-14] xt2CT: Towards 3D CT Volume Generation from Free-text Descriptions Using Diffusion Model
【速读】:该论文试图解决从描述性自由文本输入生成3D CT影像的问题,这一问题在诊断和研究中具有重要价值。解决方案的关键在于提出Text2CT方法,该方法利用扩散模型从多样化的自由文本描述中合成3D CT体积,通过新颖的提示格式和医学文本到潜在表示的编码与解码机制,有效实现了语义文本输入与详细体素表示之间的桥梁构建。
链接: https://arxiv.org/abs/2505.04522
作者: Pengfei Guo,Can Zhao,Dong Yang,Yufan He,Vishwesh Nath,Ziyue Xu,Pedro R. A. S. Bassi,Zongwei Zhou,Benjamin D. Simon,Stephanie Anne Harmon,Baris Turkbey,Daguang Xu
机构: NVIDIA(英伟达); Johns Hopkins University(约翰霍普金斯大学); National Institutes of Health(美国国家卫生研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating 3D CT volumes from descriptive free-text inputs presents a transformative opportunity in diagnostics and research. In this paper, we introduce Text2CT, a novel approach for synthesizing 3D CT volumes from textual descriptions using the diffusion model. Unlike previous methods that rely on fixed-format text input, Text2CT employs a novel prompt formulation that enables generation from diverse, free-text descriptions. The proposed framework encodes medical text into latent representations and decodes them into high-resolution 3D CT scans, effectively bridging the gap between semantic text inputs and detailed volumetric representations in a unified 3D framework. Our method demonstrates superior performance in preserving anatomical fidelity and capturing intricate structures as described in the input text. Extensive evaluations show that our approach achieves state-of-the-art results, offering promising potential applications in diagnostics, and data augmentation.
zh
[CV-15] HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation
【速读】:该论文旨在解决定制化视频生成中身份一致性不足和输入模态受限的问题(identity consistency and limited input modalities)。其解决方案的关键在于提出了一种多模态定制化视频生成框架HunyuanCustom,该框架通过引入文本-图像融合模块、图像ID增强模块以及针对音频和视频条件的模态特异性条件注入机制,实现了跨模态理解与身份特征的强化,从而显著提升了视频生成的连贯性、真实性和文本-视频对齐能力。
链接: https://arxiv.org/abs/2505.04512
作者: Teng Hu,Zhentao Yu,Zhengguang Zhou,Sen Liang,Yuan Zhou,Qin Lin,Qinglin Lu
机构: Tencent Hunyuan(腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Customized video generation aims to produce videos featuring specific subjects under flexible user-defined conditions, yet existing methods often struggle with identity consistency and limited input modalities. In this paper, we propose HunyuanCustom, a multi-modal customized video generation framework that emphasizes subject consistency while supporting image, audio, video, and text conditions. Built upon HunyuanVideo, our model first addresses the image-text conditioned generation task by introducing a text-image fusion module based on LLaVA for enhanced multi-modal understanding, along with an image ID enhancement module that leverages temporal concatenation to reinforce identity features across frames. To enable audio- and video-conditioned generation, we further propose modality-specific condition injection mechanisms: an AudioNet module that achieves hierarchical alignment via spatial cross-attention, and a video-driven injection module that integrates latent-compressed conditional video through a patchify-based feature-alignment network. Extensive experiments on single- and multi-subject scenarios demonstrate that HunyuanCustom significantly outperforms state-of-the-art open- and closed-source methods in terms of ID consistency, realism, and text-video alignment. Moreover, we validate its robustness across downstream tasks, including audio and video-driven customized video generation. Our results highlight the effectiveness of multi-modal conditioning and identity-preserving strategies in advancing controllable video generation. All the code and models are available at this https URL.
zh
[CV-16] Leverag ing Simultaneous Usage of Edge GPU Hardware Engines for Video Face Detection and Recognition
【速读】:该论文旨在解决在公共场合边缘计算设备中视频人脸检测与识别的效率问题,特别是如何最大化利用现代边缘GPU中可用的硬件引擎。其关键解决方案是通过任务的并发性和流水线处理,同时使用所有可用的硬件资源,从而提高吞吐量并降低功耗。相较于以往工作仅将任务分配给单一引擎,本文提出的方法能够更有效地协调视频解码、人脸检测与识别等任务,以满足实时性能要求。
链接: https://arxiv.org/abs/2505.04502
作者: Asma Baobaid,Mahmoud Meribout
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR); Image and Video Processing (eess.IV)
备注: 10 pages, 11 figures
Abstract:Video face detection and recognition in public places at the edge is required in several applications, such as security reinforcement and contactless access to authorized venues. This paper aims to maximize the simultaneous usage of hardware engines available in edge GPUs nowadays by leveraging the concurrency and pipelining of tasks required for face detection and recognition. This also includes the video decoding task, which is required in most face monitoring applications as the video streams are usually carried via Gbps Ethernet network. This constitutes an improvement over previous works where the tasks are usually allocated to a single engine due to the lack of a unified and automated framework that simultaneously explores all hardware engines. In addition, previously, the input faces were usually embedded in still images or within raw video streams that overlook the burst delay caused by the decoding stage. The results on real-life video streams suggest that simultaneously using all the hardware engines available in the recent NVIDIA edge Orin GPU, higher throughput, and a slight saving of power consumption of around 300 mW, accounting for around 5%, have been achieved while satisfying the real-time performance constraint. The performance gets even higher by considering several video streams simultaneously. Further performance improvement could have been obtained if the number of shuffle layers that were created by the tensor RT framework for the face recognition task was lower. Thus, the paper suggests some hardware improvements to the existing edge GPU processors to enhance their performance even higher.
zh
[CV-17] Defining and Quantifying Creative Behavior in Popular Image Generators
【速读】:该论文试图解决生成式 AI 模型的创造力评估问题,这一问题在过去几年中引发了科学界的争论,但尚未有明确结论。论文从实用角度出发,提出了定量评估指标,帮助用户根据具体任务选择合适的 AI 模型。解决方案的关键在于通过实验验证这些指标与人类直觉的一致性,从而为模型选择提供可靠依据。
链接: https://arxiv.org/abs/2505.04497
作者: Aditi Ramaswamy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Creativity of generative AI models has been a subject of scientific debate in the last years, without a conclusive answer. In this paper, we study creativity from a practical perspective and introduce quantitative measures that help the user to choose a suitable AI model for a given task. We evaluated our measures on a number of popular image-to-image generation models, and the results of this suggest that our measures conform to human intuition.
zh
[CV-18] “I Can See Forever!”: Evaluating Real-time VideoLLM s for Assisting Individuals with Visual Impairments
【速读】:该论文旨在解决视觉障碍人群在动态复杂环境中实时感知与辅助需求不足的问题,尤其是现有基于大语言模型和视觉-语言模型的解决方案多局限于静态内容,无法有效支持日常活动。其关键解决方案是引入先进的视频理解技术,构建了首个针对视觉障碍辅助任务的基准数据集VisAssistDaily,并通过用户研究评估模型在封闭与开放场景下的表现,同时提出环境感知数据集SafeVid及轮询机制以提升模型对潜在危险的识别能力。
链接: https://arxiv.org/abs/2505.04488
作者: Ziyi Zhang,Zhen Sun,Zongmin Zhang,Zifan Peng,Yuemeng Zhao,Zichun Wang,Zeren Luo,Ruiting Zuo,Xinlei He
机构: Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州))
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注: 12 pages, 6 figures
Abstract:The visually impaired population, especially the severely visually impaired, is currently large in scale, and daily activities pose significant challenges for them. Although many studies use large language and vision-language models to assist the blind, most focus on static content and fail to meet real-time perception needs in dynamic and complex environments, such as daily activities. To provide them with more effective intelligent assistance, it is imperative to incorporate advanced visual understanding technologies. Although real-time vision and speech interaction VideoLLMs demonstrate strong real-time visual understanding, no prior work has systematically evaluated their effectiveness in assisting visually impaired individuals. In this work, we conduct the first such evaluation. First, we construct a benchmark dataset (VisAssistDaily), covering three categories of assistive tasks for visually impaired individuals: Basic Skills, Home Life Tasks, and Social Life Tasks. The results show that GPT-4o achieves the highest task success rate. Next, we conduct a user study to evaluate the models in both closed-world and open-world scenarios, further exploring the practical challenges of applying VideoLLMs in assistive contexts. One key issue we identify is the difficulty current models face in perceiving potential hazards in dynamic environments. To address this, we build an environment-awareness dataset named SafeVid and introduce a polling mechanism that enables the model to proactively detect environmental risks. We hope this work provides valuable insights and inspiration for future research in this field.
zh
[CV-19] Efficient Flow Matching using Latent Variables
【速读】:该论文旨在解决流匹配模型在学习从简单源分布到目标数据的流时,未能显式建模目标数据潜在结构/流形的问题,这导致了在高维真实数据集上的学习效率低下。其解决方案的关键在于提出\textttLatent-CFM,通过预训练的深度潜在变量模型简化训练与推理策略,从而有效地整合多模态数据结构,提升生成质量并减少训练时间和计算成本。
链接: https://arxiv.org/abs/2505.04486
作者: Anirban Samaddar,Yixuan Sun,Viktor Nilsson,Sandeep Madireddy
机构: Argonne National Laboratory (阿贡国家实验室); KTH Royal Institute of Technology (皇家理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Flow matching models have shown great potential in image generation tasks among probabilistic generative models. Building upon the ideas of continuous normalizing flows, flow matching models generalize the transport path of the diffusion models from a simple prior distribution to the data. Most flow matching models in the literature do not explicitly model the underlying structure/manifold in the target data when learning the flow from a simple source distribution like the standard Gaussian. This leads to inefficient learning, especially for many high-dimensional real-world datasets, which often reside in a low-dimensional manifold. Existing strategies of incorporating manifolds, including data with underlying multi-modal distribution, often require expensive training and hence frequently lead to suboptimal performance. To this end, we present \textttLatent-CFM, which provides simplified training/inference strategies to incorporate multi-modal data structures using pretrained deep latent variable models. Through experiments on multi-modal synthetic data and widely used image benchmark datasets, we show that \textttLatent-CFM exhibits improved generation quality with significantly less training ( \sim 50% less in some cases) and computation than state-of-the-art flow matching models. Using a 2d Darcy flow dataset, we demonstrate that our approach generates more physically accurate samples than competitive approaches. In addition, through latent space analysis, we demonstrate that our approach can be used for conditional image generation conditioned on latent features.
zh
[CV-20] FA-KPConv: Introducing Euclidean Symmetries to KPConv via Frame Averag ing IJCNN2025
【速读】:该论文试图解决3D点云分析中神经网络对欧几里得变换(如平移、旋转和反射)的不变性和/或等变性难以精确实现的问题。尽管KPConv-based网络在大规模数据集或使用大量数据增强时可以近似实现这些性质,但无法保证精确性。解决方案的关键在于引入Frame Averaging方法,通过将其封装在现有的KPConv网络周围,使网络能够精确地具备对输入点云的平移、旋转和反射的不变性和/或等变性,同时保持可学习参数数量不变且不丢失任何输入信息。
链接: https://arxiv.org/abs/2505.04485
作者: Ali Alawieh(1 and 2),Alexandru P. Condurache(1 and 2) ((1) Robert Bosch GmbH - ADAS Systems, Software and Services, (2) University of Lübeck - Institute for Signal Processing)
机构: Robert Bosch GmbH - ADAS Systems, Software & Services (罗伯特·博世有限公司-高级驾驶辅助系统、软件与服务); University of Lübeck - Institute for Signal Processing (吕贝克大学-信号处理研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 figures, accepted at IJCNN 2025
Abstract:We present Frame-Averaging Kernel-Point Convolution (FA-KPConv), a neural network architecture built on top of the well-known KPConv, a widely adopted backbone for 3D point cloud analysis. Even though invariance and/or equivariance to Euclidean transformations are required for many common tasks, KPConv-based networks can only approximately achieve such properties when training on large datasets or with significant data augmentations. Using Frame Averaging, we allow to flexibly customize point cloud neural networks built with KPConv layers, by making them exactly invariant and/or equivariant to translations, rotations and/or reflections of the input point clouds. By simply wrapping around an existing KPConv-based network, FA-KPConv embeds geometrical prior knowledge into it while preserving the number of learnable parameters and not compromising any input information. We showcase the benefit of such an introduced bias for point cloud classification and point cloud registration, especially in challenging cases such as scarce training data or randomly rotated test data.
zh
[CV-21] CAD-Llama: Leverag ing Large Language Models for Computer-Aided Design Parametric 3D Model Generation
【速读】:该论文试图解决如何利用大型语言模型(Large Language Models, LLMs)生成计算机辅助设计(CAD)模型的参数化序列问题,从而扩展LLMs在特定领域中的生成能力。其关键解决方案是提出CAD-Llama框架,通过构建分层注释流程和代码类格式,将参数化3D CAD命令序列转化为结构化参数化CAD代码(Structured Parametric CAD Code, SPCC),并采用基于SPCC的自适应预训练方法及与CAD规范对齐的指令微调过程,以赋予LLMs空间知识和生成参数化3D形状的能力。
链接: https://arxiv.org/abs/2505.04481
作者: Jiahao Li,Weijian Ma,Xueyang Li,Yunzhong Lou,Guichun Zhou,Xiangdong Zhou
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recently, Large Language Models (LLMs) have achieved significant success, prompting increased interest in expanding their generative capabilities beyond general text into domain-specific areas. This study investigates the generation of parametric sequences for computer-aided design (CAD) models using LLMs. This endeavor represents an initial step towards creating parametric 3D shapes with LLMs, as CAD model parameters directly correlate with shapes in three-dimensional space. Despite the formidable generative capacities of LLMs, this task remains challenging, as these models neither encounter parametric sequences during their pretraining phase nor possess direct awareness of 3D structures. To address this, we present CAD-Llama, a framework designed to enhance pretrained LLMs for generating parametric 3D CAD models. Specifically, we develop a hierarchical annotation pipeline and a code-like format to translate parametric 3D CAD command sequences into Structured Parametric CAD Code (SPCC), incorporating hierarchical semantic descriptions. Furthermore, we propose an adaptive pretraining approach utilizing SPCC, followed by an instruction tuning process aligned with CAD-specific guidelines. This methodology aims to equip LLMs with the spatial knowledge inherent in parametric sequences. Experimental results demonstrate that our framework significantly outperforms prior autoregressive methods and existing LLM baselines.
zh
[CV-22] Learning Real Facial Concepts for Independent Deepfake Detection IJCAI2025
【速读】:该论文旨在解决深度伪造检测模型在未见过的数据集上泛化能力不足的问题,具体表现为在目标领域中将真实样本错误分类为伪造样本。解决方案的关键在于提出一种名为RealID的新方法,通过学习真实人脸的全面概念并独立评估属于真实和伪造类别的概率来提升泛化能力。RealID包含两个关键模块:Real Concept Capture Module (RealC2) 和 Independent Dual-Decision Classifier (IDC),其中RealC2利用MultiReal Memory维护多种真实人脸原型,而IDC则通过基于真实类概念和伪造痕迹的独立决策重新定义分类策略,从而减轻无关伪造模式的影响。
链接: https://arxiv.org/abs/2505.04460
作者: Ming-Hui Liu,Harry Cheng,Tianyi Wang,Xin Luo,Xin-Shun Xu
机构: School of Software, Shandong University (山东大学软件学院); School of Computing, National University of Singapore (新加坡国立大学计算机学院); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCAI 2025
Abstract:Deepfake detection models often struggle with generalization to unseen datasets, manifesting as misclassifying real instances as fake in target domains. This is primarily due to an overreliance on forgery artifacts and a limited understanding of real faces. To address this challenge, we propose a novel approach RealID to enhance generalization by learning a comprehensive concept of real faces while assessing the probabilities of belonging to the real and fake classes independently. RealID comprises two key modules: the Real Concept Capture Module (RealC2) and the Independent Dual-Decision Classifier (IDC). With the assistance of a MultiReal Memory, RealC2 maintains various prototypes for real faces, allowing the model to capture a comprehensive concept of real class. Meanwhile, IDC redefines the classification strategy by making independent decisions based on the concept of the real class and the presence of forgery artifacts. Through the combined effect of the above modules, the influence of forgery-irrelevant patterns is alleviated, and extensive experiments on five widely used datasets demonstrate that RealID significantly outperforms existing state-of-the-art methods, achieving a 1.74% improvement in average accuracy.
zh
[CV-23] RLMiniStyler: Light-weight RL Style Agent for Arbitrary Sequential Neural Style Generation IJCAI2025
【速读】:该论文试图解决任意风格迁移中现有基于深度学习的方法计算成本高、难以生成多样化且高质量艺术图像序列的问题。解决方案的关键在于提出一种基于强化学习的框架RLMiniStyler,该框架通过统一的强化学习策略迭代引导风格迁移过程,结合不确定性感知的多任务学习策略自动调整损失权重,从而在保持模型轻量化的同时生成平滑且多样化的风格化结果。
链接: https://arxiv.org/abs/2505.04424
作者: Jing Hu,Chengming Feng,Shu Hu,Ming-Ching Chang,Xin Li,Xi Wu,Xin Wang
机构: Chengdu University of Information Technology (成都信息工程大学); Purdue University (普渡大学); University at Albany, SUNY (纽约州立大学阿尔巴尼分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IJCAI2025
Abstract:Arbitrary style transfer aims to apply the style of any given artistic image to another content image. Still, existing deep learning-based methods often require significant computational costs to generate diverse stylized results. Motivated by this, we propose a novel reinforcement learning-based framework for arbitrary style transfer RLMiniStyler. This framework leverages a unified reinforcement learning policy to iteratively guide the style transfer process by exploring and exploiting stylization feedback, generating smooth sequences of stylized results while achieving model lightweight. Furthermore, we introduce an uncertainty-aware multi-task learning strategy that automatically adjusts loss weights to adapt to the content and style balance requirements at different training stages, thereby accelerating model convergence. Through a series of experiments across image various resolutions, we have validated the advantages of RLMiniStyler over other state-of-the-art methods in generating high-quality, diverse artistic image sequences at a lower cost. Codes are available at this https URL.
zh
[CV-24] DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception
【速读】:该论文旨在解决密集视觉预测任务中因依赖预定义类别而限制其在现实场景中应用的问题,尤其是在视觉概念无边界的情况下。现有方法如Vision-Language Models(VLMs)虽然在开放词汇任务中表现出色,但其在密集预测任务中的直接应用效果不佳,主要原因是局部特征表示能力有限。该论文提出的解决方案关键在于提出DeCLIP框架,通过解耦自注意力模块分别获取“内容”和“上下文”特征,“内容”特征与图像裁剪表示对齐以提升局部判别性,“上下文”特征则在视觉基础模型(如DINO)的指导下保留空间相关性,从而有效提升模型在开放词汇密集预测任务中的性能。
链接: https://arxiv.org/abs/2505.04410
作者: Junjie Wang,Bin Chen,Yulin Li,Bin Kang,Yichi Chen,Zhuotao Tian
机构: HIT, Shenzhen (哈尔滨工业大学深圳); International Research Institute for Artificial Intelligence, HIT, Shenzhen (人工智能国际研究院,哈尔滨工业大学深圳); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dense visual prediction tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have shown promise in open-vocabulary tasks, their direct application to dense prediction often leads to suboptimal performance due to limitations in local feature representation. In this work, we present our observation that CLIP’s image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and
context’’ features respectively. The content'' features are aligned with image crop representations to improve local discriminability, while
context’’ features learn to retain the spatial correlations under the guidance of vision foundation models, such as DINO. Extensive experiments demonstrate that DeCLIP significantly outperforms existing methods across multiple open-vocabulary dense prediction tasks, including object detection and semantic segmentation. Code is available at \textcolormagentathis https URL.
zh
[CV-25] MFSeg: Efficient Multi-frame 3D Semantic Segmentation ICRA2025
【速读】:该论文旨在解决多帧点云序列的3D语义分割问题,特别是在保持高精度的同时降低计算开销。其解决方案的关键在于通过在特征层面聚合点云序列并规范化特征提取与聚合过程,从而减少计算负担,同时采用轻量级基于MLP的点解码器,避免从历史帧中上采样冗余点。
链接: https://arxiv.org/abs/2505.04408
作者: Chengjie Huang,Krzysztof Czarnecki
机构: University of Waterloo (滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA 2025
Abstract:We propose MFSeg, an efficient multi-frame 3D semantic segmentation framework. By aggregating point cloud sequences at the feature level and regularizing the feature extraction and aggregation process, MFSeg reduces computational overhead while maintaining high accuracy. Moreover, by employing a lightweight MLP-based point decoder, our method eliminates the need to upsample redundant points from past frames. Experiments on the nuScenes and Waymo datasets show that MFSeg outperforms existing methods, demonstrating its effectiveness and efficiency.
zh
[CV-26] Deep residual learning with product units
【速读】:该论文旨在解决深度卷积网络在表达能力与参数效率之间的平衡问题,以及提升模型对噪声的鲁棒性。其解决方案的关键在于引入生成式 AI (Generative AI) 中的乘积单元(product units)到残差块中,以替代传统的求和神经元,从而实现特征的乘法交互,增强模型对复杂模式的表示能力。PURe通过在每个残差块的第二层使用二维乘积单元代替传统卷积层,并去除非线性激活函数以保留结构信息,从而在多个基准数据集上实现了更高的分类精度、更快的收敛速度以及更优的参数效率和噪声鲁棒性。
链接: https://arxiv.org/abs/2505.04397
作者: Ziyuan Li,Uwe Jaekel,Babette Dellen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We propose a deep product-unit residual neural network (PURe) that integrates product units into residual blocks to improve the expressiveness and parameter efficiency of deep convolutional networks. Unlike standard summation neurons, product units enable multiplicative feature interactions, potentially offering a more powerful representation of complex patterns. PURe replaces conventional convolutional layers with 2D product units in the second layer of each residual block, eliminating nonlinear activation functions to preserve structural information. We validate PURe on three benchmark datasets. On Galaxy10 DECaLS, PURe34 achieves the highest test accuracy of 84.89%, surpassing the much deeper ResNet152, while converging nearly five times faster and demonstrating strong robustness to Poisson noise. On ImageNet, PURe architectures outperform standard ResNet models at similar depths, with PURe34 achieving a top-1 accuracy of 80.27% and top-5 accuracy of 95.78%, surpassing deeper ResNet variants (ResNet50, ResNet101) while utilizing significantly fewer parameters and computational resources. On CIFAR-10, PURe consistently outperforms ResNet variants across varying depths, with PURe272 reaching 95.01% test accuracy, comparable to ResNet1001 but at less than half the model size. These results demonstrate that PURe achieves a favorable balance between accuracy, efficiency, and robustness. Compared to traditional residual networks, PURe not only achieves competitive classification performance with faster convergence and fewer parameters, but also demonstrates greater robustness to noise. Its effectiveness across diverse datasets highlights the potential of product-unit-based architectures for scalable and reliable deep learning in computer vision.
zh
[CV-27] SwinLip: An Efficient Visual Speech Encoder for Lip Reading Using Swin Transformer
【速读】:该论文旨在解决传统基于卷积神经网络(Convolutional Neural Network, CNN)的唇读模型在处理时空信息时计算复杂度高、效率不足的问题,以及由此带来的多模态研究中模型复杂性和推理延迟增加的问题。其解决方案的关键在于引入了具有层次结构和窗口自注意力机制的Swin Transformer,并配置了一个适用于唇读数据的轻量级Swin Transformer变体——SwinLip视觉语音编码器,通过将改进的Convolution-augmented Transformer(Conformer)时间嵌入与传统空间嵌入相结合,有效降低了计算负载并提升了模型性能和推理速度。
链接: https://arxiv.org/abs/2505.04394
作者: Young-Hu Park,Rae-Hong Park,Hyung-Min Park
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注:
Abstract:This paper presents an efficient visual speech encoder for lip reading. While most recent lip reading studies have been based on the ResNet architecture and have achieved significant success, they are not sufficiently suitable for efficiently capturing lip reading features due to high computational complexity in modeling spatio-temporal information. Additionally, using a complex visual model not only increases the complexity of lip reading models but also induces delays in the overall network for multi-modal studies (e.g., audio-visual speech recognition, speech enhancement, and speech separation). To overcome the limitations of Convolutional Neural Network (CNN)-based models, we apply the hierarchical structure and window self-attention of the Swin Transformer to lip reading. We configure a new lightweight scale of the Swin Transformer suitable for processing lip reading data and present the SwinLip visual speech encoder, which efficiently reduces computational load by integrating modified Convolution-augmented Transformer (Conformer) temporal embeddings with conventional spatial embeddings in the hierarchical structure. Through extensive experiments, we have validated that our SwinLip successfully improves the performance and inference speed of the lip reading network when applied to various backbones for word and sentence recognition, reducing computational load. In particular, our SwinLip demonstrated robust performance in both English LRW and Mandarin LRW-1000 datasets and achieved state-of-the-art performance on the Mandarin LRW-1000 dataset with less computation compared to the existing state-of-the-art model.
zh
[CV-28] Predicting Road Surface Anomalies by Visual Tracking of a Preceding Vehicle
【速读】:该论文试图解决在低能见度或密集交通条件下,由于前车遮挡而无法直接观察道路表面异常的问题。传统方法依赖于对特定异常(如坑洞、凸起、碎屑等)的视觉检测器进行训练,而本文提出了一种新的方法,通过跟踪前车的视觉信息来预测道路异常。解决方案的关键在于利用相机跟踪前车信号,并通过迭代鲁棒估计器补偿由车辆振动引起的相机俯仰旋转,从而提高异常检测的准确性与可靠性。该方法能够在真实环境中实时运行,并在复杂路况下实现可靠的远距离异常检测。
链接: https://arxiv.org/abs/2505.04392
作者: Petr Jahoda,Jan Cech
机构: Czech Technical University in Prague (捷克技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the IEEE Intelligent Vehicles Symposium (IV), 2025
Abstract:A novel approach to detect road surface anomalies by visual tracking of a preceding vehicle is proposed. The method is versatile, predicting any kind of road anomalies, such as potholes, bumps, debris, etc., unlike direct observation methods that rely on training visual detectors of those cases. The method operates in low visibility conditions or in dense traffic where the anomaly is occluded by a preceding vehicle. Anomalies are detected predictively, i.e., before a vehicle encounters them, which allows to pre-configure low-level vehicle systems (such as chassis) or to plan an avoidance maneuver in case of autonomous driving. A challenge is that the signal coming from camera-based tracking of a preceding vehicle may be weak and disturbed by camera ego motion due to vibrations affecting the ego vehicle. Therefore, we propose an efficient method to compensate camera pitch rotation by an iterative robust estimator. Our experiments on both controlled setup and normal traffic conditions show that road anomalies can be detected reliably at a distance even in challenging cases where the ego vehicle traverses imperfect road surfaces. The method is effective and performs in real time on standard consumer hardware.
zh
[CV-29] Geometry-Aware Texture Generation for 3D Head Modeling with Artist-driven Control CVPR
【速读】:该论文试图解决为虚拟角色创建符合精确艺术愿景的逼真3D头部资产时劳动强度大的问题。解决方案的关键在于提出一种新颖的框架,通过提供对生成的3D头部的直观控制来简化这一过程,其核心是基于几何感知的纹理合成流程,该流程学习不同人口统计学背景下头部几何形状与皮肤纹理图之间的相关性。
链接: https://arxiv.org/abs/2505.04387
作者: Amin Fadaeinejad,Abdallah Dib,Luiz Gustavo Hafemann,Emeline Got,Trevor Anderson,Amaury Depierre,Nikolaus F. Troje,Marcus A. Brubaker,Marc-André Carbonneau
机构: Ubisoft LaForge(育碧实验室); York University(约克大学); Vector Institute(向量研究所)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 9 figures, AI for Creative Visual Content Generation Editing and Understanding (CVEU), CVPRW 2025
Abstract:Creating realistic 3D head assets for virtual characters that match a precise artistic vision remains labor-intensive. We present a novel framework that streamlines this process by providing artists with intuitive control over generated 3D heads. Our approach uses a geometry-aware texture synthesis pipeline that learns correlations between head geometry and skin texture maps across different demographics. The framework offers three levels of artistic control: manipulation of overall head geometry, adjustment of skin tone while preserving facial characteristics, and fine-grained editing of details such as wrinkles or facial hair. Our pipeline allows artists to make edits to a single texture map using familiar tools, with our system automatically propagating these changes coherently across the remaining texture maps needed for realistic rendering. Experiments demonstrate that our method produces diverse results with clean geometries. We showcase practical applications focusing on intuitive control for artists, including skin tone adjustments and simplified editing workflows for adding age-related details or removing unwanted features from scanned models. This integrated approach aims to streamline the artistic workflow in virtual character creation.
zh
[CV-30] DATA: Multi-Disentanglement based Contrastive Learning for Open-World Semi-Supervised Deepfake Attribution
【速读】:该论文旨在解决开放世界半监督深度伪造归属(OSS-DFA)任务中模型泛化能力不足以及难以区分不确定新类别的问题。现有方法仅关注特定生成技术的线索,容易过拟合并忽视共性伪造特征,导致在实际应用中的性能受限。论文提出的解决方案关键在于提出一种基于多解耦的对比学习框架DATA,通过定义“正交深度伪造基”来解耦方法特异性特征,减少对无关伪造信息的过拟合,并引入增强记忆机制辅助新类别发现与对比学习,从而提升模型对新类别的泛化能力和分类准确性。
链接: https://arxiv.org/abs/2505.04384
作者: Ming-Hui Liu,Xiao-Qian Liu,Xin Luo,Xin-Shun Xu
机构: Shandong University (山东大学); Quan Cheng Laboratory (全城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TMM on 17-Jan-2025; Submitted to IEEE TMM on 11-Jul-2024
Abstract:Deepfake attribution (DFA) aims to perform multiclassification on different facial manipulation techniques, thereby mitigating the detrimental effects of forgery content on the social order and personal reputations. However, previous methods focus only on method-specific clues, which easily lead to overfitting, while overlooking the crucial role of common forgery features. Additionally, they struggle to distinguish between uncertain novel classes in more practical open-world scenarios. To address these issues, in this paper we propose an innovative multi-DisentAnglement based conTrastive leArning framework, DATA, to enhance the generalization ability on novel classes for the open-world semi-supervised deepfake attribution (OSS-DFA) task. Specifically, since all generation techniques can be abstracted into a similar architecture, DATA defines the concept of ‘Orthonormal Deepfake Basis’ for the first time and utilizes it to disentangle method-specific features, thereby reducing the overfitting on forgery-irrelevant information. Furthermore, an augmented-memory mechanism is designed to assist in novel class discovery and contrastive learning, which aims to obtain clear class boundaries for the novel classes through instance-level disentanglements. Additionally, to enhance the standardization and discrimination of features, DATA uses bases contrastive loss and center contrastive loss as auxiliaries for the aforementioned modules. Extensive experimental evaluations show that DATA achieves state-of-the-art performance on the OSS-DFA benchmark, e.g., there are notable accuracy improvements in 2.55% / 5.7% under different settings, compared with the existing methods.
zh
[CV-31] rahedron-Net for Medical Image Registration
【速读】:该论文旨在解决医学图像配准中特征表示表达能力不足的问题,以提升配准质量。其解决方案的关键在于引入一个额外的解码器,与原始编码器和解码器进行交互,从而在保持架构简洁性的同时增强特征表示能力。该设计形成了由一个编码器和两个解码器构成的“四面体”(Tetrahedron)结构,通过复用编码器中对应层的特征表示并协同优化原始解码器,实现更精确的配准结果。
链接: https://arxiv.org/abs/2505.04380
作者: Jinhai Xiang,Shuai Guo,Qianru Han,Dantong Shi,Xinwei He,Xiang Bai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:
Abstract:Medical image registration plays a vital role in medical image processing. Extracting expressive representations for medical images is crucial for improving the registration quality. One common practice for this end is constructing a convolutional backbone to enable interactions with skip connections among feature extraction layers. The de facto structure, U-Net-like networks, has attempted to design skip connections such as nested or full-scale ones to connect one single encoder and one single decoder to improve its representation capacity. Despite being effective, it still does not fully explore interactions with a single encoder and decoder architectures. In this paper, we embrace this observation and introduce a simple yet effective alternative strategy to enhance the representations for registrations by appending one additional decoder. The new decoder is designed to interact with both the original encoder and decoder. In this way, it not only reuses feature presentation from corresponding layers in the encoder but also interacts with the original decoder to corporately give more accurate registration results. The new architecture is concise yet generalized, with only one encoder and two decoders forming a Tetrahedron'' structure, thereby dubbed Tetrahedron-Net. Three instantiations of Tetrahedron-Net are further constructed regarding the different structures of the appended decoder. Our extensive experiments prove that superior performance can be obtained on several representative benchmarks of medical image registration. Finally, such a
Tetrahedron’’ design can also be easily integrated into popular U-Net-like architectures including VoxelMorph, ViT-V-Net, and TransMorph, leading to consistent performance gains.
zh
[CV-32] Label-efficient Single Photon Images Classification via Active Learning
【速读】:该论文旨在解决单光子LiDAR图像的语义解释问题,尤其是在高标注成本和低效标注策略下,如何有效提升分类性能。其关键解决方案是提出了一种面向成像条件的主动学习采样策略,通过结合合成增强技术来建模不同成像条件下的变化,从而选择性地标注最具信息量的样本,显著减少了所需标注数据的比例。
链接: https://arxiv.org/abs/2505.04376
作者: Zili Zhang,Ziting Wen,Yiheng Qiang,Hongzhou Dong,Wenle Dong,Xinyang Li,Xiaofan Wang,Xiaoqiang Ren
机构: School of Mechatronic Engineering and Automation, Shanghai University, Shanghai 200444, China; Australian Center for Robotics, School of Aerospace, Mechanical and Mechatronic Engineering, the University of Sydney, NSW 2006, Sydney; Shanghai Institute of Technology, Shanghai 200235, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Single-photon LiDAR achieves high-precision 3D imaging in extreme environments through quantum-level photon detection technology. Current research primarily focuses on reconstructing 3D scenes from sparse photon events, whereas the semantic interpretation of single-photon images remains underexplored, due to high annotation costs and inefficient labeling strategies. This paper presents the first active learning framework for single-photon image classification. The core contribution is an imaging condition-aware sampling strategy that integrates synthetic augmentation to model variability across imaging conditions. By identifying samples where the model is both uncertain and sensitive to these conditions, the proposed method selectively annotates only the most informative examples. Experiments on both synthetic and real-world datasets show that our approach outperforms all baselines and achieves high classification accuracy with significantly fewer labeled samples. Specifically, our approach achieves 97% accuracy on synthetic single-photon data using only 1.5% labeled samples. On real-world data, we maintain 90.63% accuracy with just 8% labeled samples, which is 4.51% higher than the best-performing baseline. This illustrates that active learning enables the same level of classification performance on single-photon images as on classical images, opening doors to large-scale integration of single-photon data in real-world applications.
zh
[CV-33] Balancing Accuracy Calibration and Efficiency in Active Learning with Vision Transformers Under Label Noise
【速读】:该论文试图解决在存在标签噪声的情况下,视觉变换器(Vision Transformer, ViT)和Swin Transformer在下游任务中的性能表现及其对模型规模、补丁尺寸和主动学习策略的依赖问题。其解决方案的关键在于通过系统评估不同规模的ViT和Swin Transformer配置在CIFAR10和CIFAR100数据集上的分类准确性和校准性,分析模型大小、补丁尺寸以及主动学习策略在不同标签噪声水平下的影响,从而为资源受限环境下的模型微调或蒸馏提供实践指导。
链接: https://arxiv.org/abs/2505.04375
作者: Moseli Mots’oehli,Hope Mogale,Kyungim Baek
机构: University of Hawai‘i at Manoa (夏威夷大学马诺阿分校); University of Pretoria (普列托利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Fine-tuning pre-trained convolutional neural networks on ImageNet for downstream tasks is well-established. Still, the impact of model size on the performance of vision transformers in similar scenarios, particularly under label noise, remains largely unexplored. Given the utility and versatility of transformer architectures, this study investigates their practicality under low-budget constraints and noisy labels. We explore how classification accuracy and calibration are affected by symmetric label noise in active learning settings, evaluating four vision transformer configurations (Base and Large with 16x16 and 32x32 patch sizes) and three Swin Transformer configurations (Tiny, Small, and Base) on CIFAR10 and CIFAR100 datasets, under varying label noise rates. Our findings show that larger ViT models (ViTl32 in particular) consistently outperform their smaller counterparts in both accuracy and calibration, even under moderate to high label noise, while Swin Transformers exhibit weaker robustness across all noise levels. We find that smaller patch sizes do not always lead to better performance, as ViTl16 performs consistently worse than ViTl32 while incurring a higher computational cost. We also find that information-based Active Learning strategies only provide meaningful accuracy improvements at moderate label noise rates, but they result in poorer calibration compared to models trained on randomly acquired labels, especially at high label noise rates. We hope these insights provide actionable guidance for practitioners looking to deploy vision transformers in resource-constrained environments, where balancing model complexity, label noise, and compute efficiency is critical in model fine-tuning or distillation.
zh
[CV-34] WDMamba: When Wavelet Degradation Prior Meets Vision Mamba for Image Dehazing
【速读】:该论文旨在解决雾霾图像退化问题,其核心挑战在于如何有效恢复图像的清晰度并保留细节信息。解决方案的关键在于提出一种基于小波变换分析的新型去雾框架WDMamba,该框架通过将去雾任务分解为低频重建和细节增强两个阶段,实现从粗到细的逐步处理。在低频重建阶段,集成Mamba模块以线性复杂度重建全局结构,去除整体雾霾;在细节增强阶段,恢复可能被忽略的细粒度信息,最终输出去雾结果。此外,引入自引导对比正则化机制,利用粗略重建结果作为硬负例,提升模型的判别能力,从而显著提高去雾性能。
链接: https://arxiv.org/abs/2505.04369
作者: Jie Sun,Heng Liu,Yongzhen Wang,Xiao-Ping Zhang,Mingqiang Wei
机构: Anhui University of Technology(安徽理工大学); Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院); Nanjing University of Aeronautics and Astronautics(南京航空航天大学); Taiyuan University of Technology(太原理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we reveal a novel haze-specific wavelet degradation prior observed through wavelet transform analysis, which shows that haze-related information predominantly resides in low-frequency components. Exploiting this insight, we propose a novel dehazing framework, WDMamba, which decomposes the image dehazing task into two sequential stages: low-frequency restoration followed by detail enhancement. This coarse-to-fine strategy enables WDMamba to effectively capture features specific to each stage of the dehazing process, resulting in high-quality restored images. Specifically, in the low-frequency restoration stage, we integrate Mamba blocks to reconstruct global structures with linear complexity, efficiently removing overall haze and producing a coarse restored image. Thereafter, the detail enhancement stage reinstates fine-grained information that may have been overlooked during the previous phase, culminating in the final dehazed output. Furthermore, to enhance detail retention and achieve more natural dehazing, we introduce a self-guided contrastive regularization during network training. By utilizing the coarse restored output as a hard negative example, our model learns more discriminative representations, substantially boosting the overall dehazing performance. Extensive evaluations on public dehazing benchmarks demonstrate that our method surpasses state-of-the-art approaches both qualitatively and quantitatively. Code is available at this https URL.
zh
[CV-35] CountDiffusion: Text-to-Image Synthesis with Training-Free Counting-Guidance Diffusion
【速读】:该论文试图解决文本到图像(Text-to-Image, T2I)生成模型在生成图像时准确控制物体数量的问题,这一问题由于计算成本高和模型难以理解数量的抽象概念而难以实现。解决方案的关键在于提出一种无需训练的框架CountDiffusion,其核心思想分为两个阶段:第一阶段通过扩散模型进行一步去噪生成中间结果,并利用计数模型确定物体数量;第二阶段通过修正模块调整注意力图以校正物体数量,从而在不改变原有模型结构的前提下提升T2I模型生成图像中物体数量的准确性。
链接: https://arxiv.org/abs/2505.04347
作者: Yanyu Li,Pencheng Wan,Liang Han,Yaowei Wang,Liqiang Nie,Min Zhang
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 9 figures, 3 tables
Abstract:Stable Diffusion has advanced text-to-image synthesis, but training models to generate images with accurate object quantity is still difficult due to the high computational cost and the challenge of teaching models the abstract concept of quantity. In this paper, we propose CountDiffusion, a training-free framework aiming at generating images with correct object quantity from textual descriptions. CountDiffusion consists of two stages. In the first stage, an intermediate denoising result is generated by the diffusion model to predict the final synthesized image with one-step denoising, and a counting model is used to count the number of objects in this image. In the second stage, a correction module is used to correct the object quantity by changing the attention map of the object with universal guidance. The proposed CountDiffusion can be plugged into any diffusion-based text-to-image (T2I) generation models without further training. Experiment results demonstrate the superiority of our proposed CountDiffusion, which improves the accurate object quantity generation ability of T2I models by a large margin.
zh
[CV-36] Multi-turn Consistent Image Editing
【速读】:该论文旨在解决现有图像编辑方法在处理模糊用户意图、复杂变换或需要逐步优化时,因仅支持单步修改而导致结果不一致或无法满足用户期望的问题。其解决方案的关键在于提出一种多轮图像编辑框架,该框架通过流匹配实现精确的图像逆向生成,并结合双目标线性二次调节器(LQR)确保采样稳定性,从而有效减少误差累积;同时,通过分析Transformer的层级作用,引入自适应注意力突出方法,在保持多轮一致性的同时提升可编辑性。
链接: https://arxiv.org/abs/2505.04320
作者: Zijun Zhou,Yingying Deng,Xiangyu He,Weiming Dong,Fan Tang
机构: Institute of Automation, Chinese Academy of Sciences, Beijing, China (自动化研究所,中国科学院,北京,中国); Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China (计算技术研究所,中国科学院,北京,中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Many real-world applications, such as interactive photo retouching, artistic content creation, and product design, require flexible and iterative image editing. However, existing image editing methods primarily focus on achieving the desired modifications in a single step, which often struggles with ambiguous user intent, complex transformations, or the need for progressive refinements. As a result, these methods frequently produce inconsistent outcomes or fail to meet user expectations. To address these challenges, we propose a multi-turn image editing framework that enables users to iteratively refine their edits, progressively achieving more satisfactory results. Our approach leverages flow matching for accurate image inversion and a dual-objective Linear Quadratic Regulators (LQR) for stable sampling, effectively mitigating error accumulation. Additionally, by analyzing the layer-wise roles of transformers, we introduce a adaptive attention highlighting method that enhances editability while preserving multi-turn coherence. Extensive experiments demonstrate that our framework significantly improves edit success rates and visual fidelity compared to existing methods.
zh
[CV-37] MoDE: Mixture of Diffusion Experts for Any Occluded Face Recognition
【速读】:该论文旨在解决遮挡面部识别(Occluded Face Recognition, OFR)中由于缺乏对遮挡先验知识而导致的性能不佳问题,特别是在处理不同类型和严重程度的遮挡面部时表现较差。其解决方案的关键在于提出一种基于扩散模型的专家混合架构(identity-gated mixture of diffusion experts, MoDE),其中每个基于扩散的生成专家估计一个可能的完整面部图像,并通过身份门控网络评估各重建面部对身份的贡献,从而在决策空间中自适应地整合预测结果,提升识别效果。
链接: https://arxiv.org/abs/2505.04306
作者: Qiannan Fan,Zhuoyang Li,Jitong Li,Chenyang Cao
机构: State Grid Tianjin Economic Research Institute(国网天津经济研究院); College of Intelligence and Computing(智能计算学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages,7 figures
Abstract:With the continuous impact of epidemics, people have become accustomed to wearing masks. However, most current occluded face recognition (OFR) algorithms lack prior knowledge of occlusions, resulting in poor performance when dealing with occluded faces of varying types and severity in reality. Recognizing occluded faces is still a significant challenge, which greatly affects the convenience of people’s daily lives. In this paper, we propose an identity-gated mixture of diffusion experts (MoDE) for OFR. Each diffusion-based generative expert estimates one possible complete image for occluded faces. Considering the random sampling process of the diffusion model, which introduces inevitable differences and variations between the inpainted faces and the real ones. To ensemble effective information from multi-reconstructed faces, we introduce an identity-gating network to evaluate the contribution of each reconstructed face to the identity and adaptively integrate the predictions in the decision space. Moreover, our MoDE is a plug-and-play module for most existing face recognition models. Extensive experiments on three public face datasets and two datasets in the wild validate our advanced performance for various occlusions in comparison with the competing methods.
zh
[CV-38] S-Diff: Two-Stage Diffusion Model for Low-Light RAW Image Enhancement IJCNN
【速读】:该论文旨在解决极端低光条件下RAW图像增强的问题,特别是在噪声抑制、泛化能力和颜色一致性方面存在挑战。其解决方案的关键在于提出一种两阶段扩散模型(Two-Stage Diffusion Model, TS-Diff),该模型首先通过构建多个虚拟相机在预训练阶段合成噪声图像,并利用相机特征融合(Camera Feature Integration, CFI)模块学习跨不同虚拟相机的通用特征;随后在对齐阶段,通过平均CFI模块生成目标特定的CFI^T,并使用少量真实RAW数据进行微调以适应特定相机的噪声特性。此外,引入结构重参数化技术以简化CFI^T,提升部署效率,并通过颜色校正器确保扩散过程中的颜色一致性。
链接: https://arxiv.org/abs/2505.04281
作者: Yi Li,Zhiyuan Zhang,Jiangnan Xia,Jianghan Cheng,Qilong Wu,Junwei Li,Yibin Tian,Hui Kong
机构: Zhejiang University (浙江大学); Singapore Management University (新加坡管理大学); Shenzhen University (深圳大学); University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: International Joint Conference on Neural Networks (IJCNN)
Abstract:This paper presents a novel Two-Stage Diffusion Model (TS-Diff) for enhancing extremely low-light RAW images. In the pre-training stage, TS-Diff synthesizes noisy images by constructing multiple virtual cameras based on a noise space. Camera Feature Integration (CFI) modules are then designed to enable the model to learn generalizable features across diverse virtual cameras. During the aligning stage, CFIs are averaged to create a target-specific CFI ^T , which is fine-tuned using a small amount of real RAW data to adapt to the noise characteristics of specific cameras. A structural reparameterization technique further simplifies CFI ^T for efficient deployment. To address color shifts during the diffusion process, a color corrector is introduced to ensure color consistency by dynamically adjusting global color distributions. Additionally, a novel dataset, QID, is constructed, featuring quantifiable illumination levels and a wide dynamic range, providing a comprehensive benchmark for training and evaluation under extreme low-light conditions. Experimental results demonstrate that TS-Diff achieves state-of-the-art performance on multiple datasets, including QID, SID, and ELD, excelling in denoising, generalization, and color consistency across various cameras and illumination levels. These findings highlight the robustness and versatility of TS-Diff, making it a practical solution for low-light imaging applications. Source codes and models are available at this https URL
zh
[CV-39] HDiffTG: A Lightweight Hybrid Diffusion-Transformer-GCN Architecture for 3D Human Pose Estimation IJCNN
【速读】:该论文旨在解决3D人体姿态估计(3DHPE)中的精度与鲁棒性问题,特别是在遮挡和复杂场景下的表现。其解决方案的关键在于提出一种融合Transformer、图卷积网络(GCN)和扩散模型的统一框架HDiffTG,通过Transformer捕捉全局时空依赖性,GCN建模局部骨骼结构,扩散模型实现逐步优化,从而在全局与局部特征之间实现互补平衡,提升模型在复杂环境下的性能。同时,通过轻量化优化和目标函数改进,在保持高性能的同时降低计算开销。
链接: https://arxiv.org/abs/2505.04276
作者: Yajie Fu,Chaorui Huang,Junwei Li,Hui Kong,Yibin Tian,Huakang Li,Zhiyuan Zhang
机构: Zhejiang University (浙江大学); University of Macau (澳门大学); Shenzhen University (深圳大学); Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 8 pages, 4 figures, International Joint Conference on Neural Networks (IJCNN)
Abstract:We propose HDiffTG, a novel 3D Human Pose Estimation (3DHPE) method that integrates Transformer, Graph Convolutional Network (GCN), and diffusion model into a unified framework. HDiffTG leverages the strengths of these techniques to significantly improve pose estimation accuracy and robustness while maintaining a lightweight design. The Transformer captures global spatiotemporal dependencies, the GCN models local skeletal structures, and the diffusion model provides step-by-step optimization for fine-tuning, achieving a complementary balance between global and local features. This integration enhances the model’s ability to handle pose estimation under occlusions and in complex scenarios. Furthermore, we introduce lightweight optimizations to the integrated model and refine the objective function design to reduce computational overhead without compromising performance. Evaluation results on the Human3.6M and MPI-INF-3DHP datasets demonstrate that HDiffTG achieves state-of-the-art (SOTA) performance on the MPI-INF-3DHP dataset while excelling in both accuracy and computational efficiency. Additionally, the model exhibits exceptional robustness in noisy and occluded environments. Source codes and models are available at this https URL
zh
[CV-40] Object-Shot Enhanced Grounding Network for Egocentric Video CVPR2025
【速读】:该论文旨在解决**第一视角视频定位(egocentric video grounding)**任务中的关键问题,即现有方法主要关注第一视角与第三人称视角视频的分布差异,而忽视了第一视角视频的核心特征以及由问题类型查询所强调的细粒度信息。其解决方案的关键在于提出OSGNet,通过从视频中提取物体信息以丰富视频表征,特别是对文本查询中提及但未在视频特征中直接捕获的物体进行增强;同时分析第一视角视频中常见的镜头运动,利用这些特性提取佩戴者的注意力信息,从而提升模型在模态对齐方面的能力。
链接: https://arxiv.org/abs/2505.04270
作者: Yisen Feng,Haoyu Zhang,Meng Liu,Weili Guan,Liqiang Nie
机构: Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳)); Pengcheng Laboratory (鹏城实验室); Shandong Jianzhu University (山东建筑大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2025
Abstract:Egocentric video grounding is a crucial task for embodied intelligence applications, distinct from exocentric video moment localization. Existing methods primarily focus on the distributional differences between egocentric and exocentric videos but often neglect key characteristics of egocentric videos and the fine-grained information emphasized by question-type queries. To address these limitations, we propose OSGNet, an Object-Shot enhanced Grounding Network for egocentric video. Specifically, we extract object information from videos to enrich video representation, particularly for objects highlighted in the textual query but not directly captured in the video features. Additionally, we analyze the frequent shot movements inherent to egocentric videos, leveraging these features to extract the wearer’s attention information, which enhances the model’s ability to perform modality alignment. Experiments conducted on three datasets demonstrate that OSGNet achieves state-of-the-art performance, validating the effectiveness of our approach. Our code can be found at this https URL.
zh
[CV-41] Bridging Geometry-Coherent Text-to-3D Generation with Multi-View Diffusion Priors and Gaussian Splatting
【速读】:该论文旨在解决基于预训练2D扩散模型进行文本到3D生成时存在的多视角相关性忽略问题,从而导致生成的3D内容出现几何不一致和多面体伪影。其解决方案的关键在于提出耦合得分蒸馏(Coupled Score Distillation, CSD),通过引入多视角联合分布先验,确保几何一致性的3D生成,并实现对3D高斯点云(3D Gaussian Splatting)的稳定直接优化。该方法将优化过程重新表述为多视角联合优化问题,有效耦合多视角先验以指导不同视角下的优化,同时保持生成3D资产的多样性。
链接: https://arxiv.org/abs/2505.04262
作者: Feng Yang,Wenliang Qian,Wangmeng Zuo,Hui Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Score Distillation Sampling (SDS) leverages pretrained 2D diffusion models to advance text-to-3D generation but neglects multi-view correlations, being prone to geometric inconsistencies and multi-face artifacts in the generated 3D content. In this work, we propose Coupled Score Distillation (CSD), a framework that couples multi-view joint distribution priors to ensure geometrically consistent 3D generation while enabling the stable and direct optimization of 3D Gaussian Splatting. Specifically, by reformulating the optimization as a multi-view joint optimization problem, we derive an effective optimization rule that effectively couples multi-view priors to guide optimization across different viewpoints while preserving the diversity of generated 3D assets. Additionally, we propose a framework that directly optimizes 3D Gaussian Splatting (3D-GS) with random initialization to generate geometrically consistent 3D content. We further employ a deformable tetrahedral grid, initialized from 3D-GS and refined through CSD, to produce high-quality, refined meshes. Quantitative and qualitative experimental results demonstrate the efficiency and competitive quality of our approach.
zh
[CV-42] RGB-Event Fusion with Self-Attention for Collision Prediction
【速读】:该论文旨在解决自主机器人在动态现实环境中实现鲁棒且实时的障碍物避让问题,其核心挑战在于准确预测无人机与动态物体的碰撞时间和位置。解决方案的关键在于提出一种基于RGB和事件触发视觉传感器的神经网络框架,该框架采用两个独立的编码分支分别处理两种模态数据,并通过自注意力机制进行特征融合,以提升预测精度。
链接: https://arxiv.org/abs/2505.04258
作者: Pietro Bonazzi,Christian Vogt,Michael Jost,Haotong Qin,Lyes Khacef,Federico Paredes-Valles,Michele Magno
机构: ETH Zürich (ETH Zurich); Sony Semiconductor Solutions Europe (Sony Semiconductor Solutions Europe); Sony Europe B.V. (Sony Europe B.V.)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Ensuring robust and real-time obstacle avoidance is critical for the safe operation of autonomous robots in dynamic, real-world environments. This paper proposes a neural network framework for predicting the time and collision position of an unmanned aerial vehicle with a dynamic object, using RGB and event-based vision sensors. The proposed architecture consists of two separate encoder branches, one for each modality, followed by fusion by self-attention to improve prediction accuracy. To facilitate benchmarking, we leverage the ABCD [8] dataset collected that enables detailed comparisons of single-modality and fusion-based approaches. At the same prediction throughput of 50Hz, the experimental results show that the fusion-based model offers an improvement in prediction accuracy over single-modality approaches of 1% on average and 10% for distances beyond 0.5m, but comes at the cost of +71% in memory and + 105% in FLOPs. Notably, the event-based model outperforms the RGB model by 4% for position and 26% for time error at a similar computational cost, making it a competitive alternative. Additionally, we evaluate quantized versions of the event-based models, applying 1- to 8-bit quantization to assess the trade-offs between predictive performance and computational efficiency. These findings highlight the trade-offs of multi-modal perception using RGB and event-based cameras in robotic applications.
zh
[CV-43] A Weak Supervision Learning Approach Towards an Equitable Parking Lot Occupancy Estimation
【速读】:该论文旨在解决高分辨率遥感图像标注数据稀缺且成本高昂的问题,特别是在低收入地区,高分辨率数据较为匮乏。其解决方案的关键在于提出一种弱监督框架,利用粗粒度的时间标签(基于德国大型超市和五金店的停车场在周六通常满载、周日通常空置的假设),训练一个成对比较模型,从而估计停车场占用情况,该模型在大型停车场上的AUC达到了0.92。该方法减少了对昂贵高分辨率图像的依赖,并为城市交通分析提供了可扩展的途径。
链接: https://arxiv.org/abs/2505.04229
作者: Theophilus Aidoo,Till Koebe,Akansh Maurya,Hewan Shrestha,Ingmar Weber
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:
Abstract:The scarcity and high cost of labeled high-resolution imagery have long challenged remote sensing applications, particularly in low-income regions where high-resolution data are scarce. In this study, we propose a weak supervision framework that estimates parking lot occupancy using 3m resolution satellite imagery. By leveraging coarse temporal labels – based on the assumption that parking lots of major supermarkets and hardware stores in Germany are typically full on Saturdays and empty on Sundays – we train a pairwise comparison model that achieves an AUC of 0.92 on large parking lots. The proposed approach minimizes the reliance on expensive high-resolution images and holds promise for scalable urban mobility analysis. Moreover, the method can be adapted to assess transit patterns and resource allocation in vulnerable communities, providing a data-driven basis to improve the well-being of those most in need.
zh
[CV-44] CM1 - A Dataset for Evaluating Few-Shot Information Extraction with Large Vision Language Models
【速读】:该论文旨在解决从手写文档中自动提取关键值信息的问题,这是文档分析中的一个核心挑战。其解决方案的关键在于利用大型视觉语言模型(Large Vision Language Models, LVLMs),特别是在标注训练数据有限的情况下,这些模型能够通过其庞大的规模和广泛的预训练优势,表现出优于传统全页提取模型的性能。
链接: https://arxiv.org/abs/2505.04214
作者: Fabian Wolf,Oliver Tüselmann,Arthur Matei,Lukas Hennies,Christoph Rass,Gernot A. Fink
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The automatic extraction of key-value information from handwritten documents is a key challenge in document analysis. A reliable extraction is a prerequisite for the mass digitization efforts of many archives. Large Vision Language Models (LVLM) are a promising technology to tackle this problem especially in scenarios where little annotated training data is available. In this work, we present a novel dataset specifically designed to evaluate the few-shot capabilities of LVLMs. The CM1 documents are a historic collection of forms with handwritten entries created in Europe to administer the Care and Maintenance program after World War Two. The dataset establishes three benchmarks on extracting name and birthdate information and, furthermore, considers different training set sizes. We provide baseline results for two different LVLMs and compare performances to an established full-page extraction model. While the traditional full-page model achieves highly competitive performances, our experiments show that when only a few training samples are available the considered LVLMs benefit from their size and heavy pretraining and outperform the classical approach.
zh
[CV-45] An Enhanced YOLOv8 Model for Real-Time and Accurate Pothole Detection and Measurement
【速读】:该论文试图解决传统道路坑洼检测方法仅依赖2D RGB图像而无法准确分析坑洼物理特征的问题(Pothole Detection)。其关键解决方案是构建了一个公开的RGB-D图像数据集(PothRGBD),并提出了一种改进的YOLOv8模型,该模型通过引入动态蛇形卷积(Dynamic Snake Convolution, DSConv)、简单注意力模块(Simple Attention Module, SimAM)和高斯误差线性单元(Gaussian Error Linear Unit, GELU)结构,提升了坑洼边缘结构的分割精度,并实现了深度图上的周长和深度测量。
链接: https://arxiv.org/abs/2505.04207
作者: Mustafa Yurdakul,Şakir Tasdemir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Potholes cause vehicle damage and traffic accidents, creating serious safety and economic problems. Therefore, early and accurate detection of potholes is crucial. Existing detection methods are usually only based on 2D RGB images and cannot accurately analyze the physical characteristics of potholes. In this paper, a publicly available dataset of RGB-D images (PothRGBD) is created and an improved YOLOv8-based model is proposed for both pothole detection and pothole physical features analysis. The Intel RealSense D415 depth camera was used to collect RGB and depth data from the road surfaces, resulting in a PothRGBD dataset of 1000 images. The data was labeled in YOLO format suitable for segmentation. A novel YOLO model is proposed based on the YOLOv8n-seg architecture, which is structurally improved with Dynamic Snake Convolution (DSConv), Simple Attention Module (SimAM) and Gaussian Error Linear Unit (GELU). The proposed model segmented potholes with irregular edge structure more accurately, and performed perimeter and depth measurements on depth maps with high accuracy. The standard YOLOv8n-seg model achieved 91.9% precision, 85.2% recall and 91.9% mAP@50. With the proposed model, the values increased to 93.7%, 90.4% and 93.8% respectively. Thus, an improvement of 1.96% in precision, 6.13% in recall and 2.07% in mAP was achieved. The proposed model performs pothole detection as well as perimeter and depth measurement with high accuracy and is suitable for real-time applications due to its low model complexity. In this way, a lightweight and effective model that can be used in deep learning-based intelligent transportation solutions has been acquired.
zh
[CV-46] SToLa: Self-Adaptive Touch-Language Framework with Tactile Commonsense Reasoning in Open-Ended Scenarios
【速读】:该论文试图解决将触觉感知整合到智能系统中以实现多模态推理所面临的挑战,特别是对开放性物理世界的常识推理问题。其关键挑战包括模态差异(现有大型触觉-语言模型常将触觉视为语言的子模态)和开放性触觉数据稀缺(当前数据集缺乏多样性、开放性和复杂性)。解决方案的关键在于提出SToLa框架,该框架利用混合专家(Mixture of Experts, MoE)动态处理、统一和管理触觉与语言模态,捕捉其独特特征,并构建了一个全面的触觉常识推理数据集和基准测试。实验表明,SToLa在PhysiCLeAR基准和自建数据集上表现出色,验证了MoE架构在多模态管理中的有效性及在开放场景下触觉常识推理任务中的性能优势。
链接: https://arxiv.org/abs/2505.04201
作者: Ning Cheng,Jinan Xu,Jialing Chen,Wenjuan Han
机构: Beijing Jiaotong University (北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper explores the challenges of integrating tactile sensing into intelligent systems for multimodal reasoning, particularly in enabling commonsense reasoning about the open-ended physical world. We identify two key challenges: modality discrepancy, where existing large touch-language models often treat touch as a mere sub-modality of language, and open-ended tactile data scarcity, where current datasets lack the diversity, open-endness and complexity needed for reasoning. To overcome these challenges, we introduce SToLa, a Self-Adaptive Touch-Language framework. SToLa utilizes Mixture of Experts (MoE) to dynamically process, unify, and manage tactile and language modalities, capturing their unique characteristics. Crucially, we also present a comprehensive tactile commonsense reasoning dataset and benchmark featuring free-form questions and responses, 8 physical properties, 4 interactive characteristics, and diverse commonsense knowledge. Experiments show SToLa exhibits competitive performance compared to existing models on the PhysiCLeAR benchmark and self-constructed datasets, proving the effectiveness of the Mixture of Experts architecture in multimodal management and the performance advantages for open-scenario tactile commonsense reasoning tasks.
zh
[CV-47] S3D: Sketch-Driven 3D Model Generation CVPR’25
【速读】:该论文旨在解决从2D草图生成高质量3D模型的问题,这一任务由于草图数据固有的模糊性和稀疏性而具有挑战性。其解决方案的关键在于提出一种名为S3D的框架,该框架采用基于U-Net的编码器-解码器结构将草图转换为面分割掩码,进而生成可从新视角渲染的3D表示。为了确保草图域与3D输出之间的鲁棒一致性,引入了一种新颖的风格对齐损失,通过将U-Net瓶颈特征与3D生成模块的初始编码器输出对齐,显著提升了重建精度。
链接: https://arxiv.org/abs/2505.04185
作者: Hail Song,Wonsik Shin,Naeun Lee,Soomin Chung,Nojun Kwak,Woontack Woo
机构: KAIST(韩国科学技术院); Seoul National University(首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted as a short paper to the GMCV Workshop at CVPR’25
Abstract:Generating high-quality 3D models from 2D sketches is a challenging task due to the inherent ambiguity and sparsity of sketch data. In this paper, we present S3D, a novel framework that converts simple hand-drawn sketches into detailed 3D models. Our method utilizes a U-Net-based encoder-decoder architecture to convert sketches into face segmentation masks, which are then used to generate a 3D representation that can be rendered from novel views. To ensure robust consistency between the sketch domain and the 3D output, we introduce a novel style-alignment loss that aligns the U-Net bottleneck features with the initial encoder outputs of the 3D generation module, significantly enhancing reconstruction fidelity. To further enhance the network’s robustness, we apply augmentation techniques to the sketch dataset. This streamlined framework demonstrates the effectiveness of S3D in generating high-quality 3D models from sketch inputs. The source code for this project is publicly available at this https URL.
zh
[CV-48] DOTA: Deformable Optimized Transformer Architecture for End-to-End Text Recognition with Retrieval-Augmented Generation
【速读】:该论文旨在解决自然图像中的文本识别问题,这是一个在计算机视觉和自然语言处理领域中具有广泛应用但依然具有挑战性的任务。其解决方案的关键在于提出了一种结合ResNet和Vision Transformer主干网络的端到端框架,并引入了可变形卷积、检索增强生成和条件随机场(CRF)等先进方法,以提升特征表示能力和OCR性能。具体而言,该框架通过在第三和第四块中替换标准卷积层为可变形卷积,利用自适应丢弃进行正则化,并引入CRF进行更精细的序列建模,从而显著提升了文本识别的准确性。
链接: https://arxiv.org/abs/2505.04175
作者: Naphat Nithisopa,Teerapong Panboonyuen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Text recognition in natural images remains a challenging yet essential task, with broad applications spanning computer vision and natural language processing. This paper introduces a novel end-to-end framework that combines ResNet and Vision Transformer backbones with advanced methodologies, including Deformable Convolutions, Retrieval-Augmented Generation, and Conditional Random Fields (CRF). These innovations collectively enhance feature representation and improve Optical Character Recognition (OCR) performance. Specifically, the framework substitutes standard convolution layers in the third and fourth blocks with Deformable Convolutions, leverages adaptive dropout for regularization, and incorporates CRF for more refined sequence modeling. Extensive experiments conducted on six benchmark datasets IC13, IC15, SVT, IIIT5K, SVTP, and CUTE80 validate the proposed method’s efficacy, achieving notable accuracies: 97.32% on IC13, 58.26% on IC15, 88.10% on SVT, 74.13% on IIIT5K, 82.17% on SVTP, and 66.67% on CUTE80, resulting in an average accuracy of 77.77%. These results establish a new state-of-the-art for text recognition, demonstrating the robustness of the approach across diverse and challenging datasets.
zh
[CV-49] Learning from Similarity Proportion Loss for Classifying Skeletal Muscle Recovery Stages MICCAI2024
【速读】:该论文试图解决肌肉组织再生过程评估中依赖人工视觉检查的主观性和低效性问题,以及现有弱监督学习方法在处理肌肉组织特征提取和恢复阶段有序信息时的不足。其解决方案的关键在于提出一种基于相似性比例的序数尺度学习(Ordinal Scale Learning from Similarity Proportion, OSLSP),通过利用两个样本集合的相似性比例损失来更新特征提取器,并引入类别比例注意力机制以捕捉恢复阶段的序数信息,从而提升骨骼肌恢复阶段分类任务的性能。
链接: https://arxiv.org/abs/2505.04150
作者: Yu Yamaoka or Weng Ian Chan,Shigeto Seno,Soichiro Fukada,Hideo Matsuda
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: MICCAI2024 workshop ADSMI in Morocco (oral) [Peer-reviewed]
Abstract:Evaluating the regeneration process of damaged muscle tissue is a fundamental analysis in muscle research to measure experimental effect sizes and uncover mechanisms behind muscle weakness due to aging and disease. The conventional approach to assessing muscle tissue regeneration involves whole-slide imaging and expert visual inspection of the recovery stages based on the morphological information of cells and fibers. There is a need to replace these tasks with automated methods incorporating machine learning techniques to ensure a quantitative and objective analysis. Given the limited availability of fully labeled data, a possible approach is Learning from Label Proportions (LLP), a weakly supervised learning method using class label proportions. However, current LLP methods have two limitations: (1) they cannot adapt the feature extractor for muscle tissues, and (2) they treat the classes representing recovery stages and cell morphological changes as nominal, resulting in the loss of ordinal information. To address these issues, we propose Ordinal Scale Learning from Similarity Proportion (OSLSP), which uses a similarity proportion loss derived from two bag combinations. OSLSP can update the feature extractor by using class proportion attention to the ordinal scale of the class. Our model with OSLSP outperforms large-scale pre-trained and fine-tuning models in classification tasks of skeletal muscle recovery stages.
zh
[CV-50] R3-VQA: “Read the Room” by Video Social Reasoning
【速读】:该论文旨在解决当前社会推理任务和数据集在复杂性上的不足,如场景简单、交互基础、心理状态变量不完整以及仅限单步推理等问题。其解决方案的关键在于构建一个名为R^3-VQA的高质量视频数据集,该数据集包含对社会事件和心理状态(即信念、意图、欲望和情绪)的精确细粒度标注,以及复杂社会场景中的相应社会因果链。此外,还引入了人工标注和模型生成的问答对,以全面评估现有大型视觉-语言模型(LVLMs)的社会推理能力。
链接: https://arxiv.org/abs/2505.04147
作者: Lixing Niu,Jiapeng Li,Xingping Yu,Shu Wang,Ruining Feng,Bo Wu,Ping Wei,Yisen Wang,Lifeng Fan
机构: Peking University (北京大学); Xi’an Jiaotong University (西安交通大学); Beijing Institute for General Artificial Intelligence (北京通用人工智能研究院); Yuanpei College, Peking University (北京大学元培学院); Tsinghua University (清华大学); University of California, Los Angeles (加利福尼亚大学洛杉矶分校); MIT-IBM Watson AI Lab (MIT-IBM沃森人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:“Read the room” is a significant social reasoning capability in human daily life. Humans can infer others’ mental states from subtle social cues. Previous social reasoning tasks and datasets lack complexity (e.g., simple scenes, basic interactions, incomplete mental state variables, single-step reasoning, etc.) and fall far short of the challenges present in real-life social interactions. In this paper, we contribute a valuable, high-quality, and comprehensive video dataset named R^3-VQA with precise and fine-grained annotations of social events and mental states (i.e., belief, intent, desire, and emotion) as well as corresponding social causal chains in complex social scenarios. Moreover, we include human-annotated and model-generated QAs. Our task R^3-VQA includes three aspects: Social Event Understanding, Mental State Estimation, and Social Causal Reasoning. As a benchmark, we comprehensively evaluate the social reasoning capabilities and consistencies of current state-of-the-art large vision-language models (LVLMs). Comprehensive experiments show that (i) LVLMs are still far from human-level consistent social reasoning in complex social scenarios; (ii) Theory of Mind (ToM) prompting can help LVLMs perform better on social reasoning tasks. We provide some of our dataset and codes in supplementary material and will release our full dataset and codes upon acceptance.
zh
[CV-51] Vision Graph Prompting via Semantic Low-Rank Decomposition ICML2025
【速读】:该论文旨在解决现有视觉提示方法在适应基于图结构的视觉图神经网络(Vision GNN,ViG)时存在的不足,即这些方法主要针对基于Transformer的模型设计,忽略了图结构中节点与边之间的丰富拓扑关系,从而限制了模型对复杂语义的建模能力。解决方案的关键在于提出一种针对视觉图结构的新型框架——视觉图提示(Vision Graph Prompting,VGP),其核心洞察是图中语义相连的组件具有低秩特性,通过引入语义低秩提示方法,将低秩语义特征分解并与图拓扑结构上的提示相结合,从而捕捉全局结构模式和细粒度语义依赖。
链接: https://arxiv.org/abs/2505.04121
作者: Zixiang Ai,Zichen Liu,Jiahuan Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2025
Abstract:Vision GNN (ViG) demonstrates superior performance by representing images as graph structures, providing a more natural way to capture irregular semantic patterns beyond traditional grid or sequence-based representations. To efficiently adapt ViG to downstream tasks, parameter-efficient fine-tuning techniques like visual prompting become increasingly essential. However, existing prompting methods are primarily designed for Transformer-based models, neglecting the rich topological relationships among nodes and edges in graph-based representations, limiting their capacity to model complex semantics. In this paper, we propose Vision Graph Prompting (VGP), a novel framework tailored for vision graph structures. Our core insight reveals that semantically connected components in the graph exhibit low-rank properties. Building on this observation, we introduce a semantic low-rank prompting method that decomposes low-rank semantic features and integrates them with prompts on vision graph topologies, capturing both global structural patterns and fine-grained semantic dependencies. Extensive experiments demonstrate our method significantly improves ViG’s transfer performance on diverse downstream tasks, achieving results comparable to full fine-tuning while maintaining parameter efficiency. Our code is available at this https URL.
zh
[CV-52] GAPrompt: Geometry-Aware Point Cloud Prompt for 3D Vision Model ICML2025
【速读】:该论文旨在解决预训练的3D视觉模型在下游任务中进行全量微调时计算成本高且存储需求大的问题。现有参数高效微调(PEFT)方法主要依赖输入标记提示,但由于无法充分捕捉点云中的几何信息,导致性能受限。该论文提出的解决方案关键在于引入一种几何感知的点云提示(GAPrompt),通过点提示和点偏移提示器分别增强模型对细粒度几何细节和全局形状信息的捕捉能力,并结合提示传播机制将形状信息融入特征提取过程,从而提升模型的适应性与性能。
链接: https://arxiv.org/abs/2505.04119
作者: Zixiang Ai,Zichen Liu,Yuanhang Lei,Zhenyu Cui,Xu Zou,Jiahuan Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2025
Abstract:Pre-trained 3D vision models have gained significant attention for their promising performance on point cloud data. However, fully fine-tuning these models for downstream tasks is computationally expensive and storage-intensive. Existing parameter-efficient fine-tuning (PEFT) approaches, which focus primarily on input token prompting, struggle to achieve competitive performance due to their limited ability to capture the geometric information inherent in point clouds. To address this challenge, we propose a novel Geometry-Aware Point Cloud Prompt (GAPrompt) that leverages geometric cues to enhance the adaptability of 3D vision models. First, we introduce a Point Prompt that serves as an auxiliary input alongside the original point cloud, explicitly guiding the model to capture fine-grained geometric details. Additionally, we present a Point Shift Prompter designed to extract global shape information from the point cloud, enabling instance-specific geometric adjustments at the input level. Moreover, our proposed Prompt Propagation mechanism incorporates the shape information into the model’s feature extraction process, further strengthening its ability to capture essential geometric characteristics. Extensive experiments demonstrate that GAPrompt significantly outperforms state-of-the-art PEFT methods and achieves competitive results compared to full fine-tuning on various benchmarks, while utilizing only 2.19% of trainable parameters. Our code is available at this https URL.
zh
[CV-53] One2Any: One-Reference 6D Pose Estimation for Any Object CVPR2025
【速读】:该论文旨在解决6D物体位姿估计(6D object pose estimation)在面对新物体时的泛化能力不足问题,尤其是在缺乏完整3D模型、多视角图像或特定物体类别约束的情况下。其解决方案的关键在于提出一种名为One2Any的方法,该方法通过仅使用单张参考图像和单张查询RGB-D图像,利用编码-解码框架生成物体的相对6自由度(DOF)位姿。该方法首先从单个参考视图中提取包含物体形状、方向和纹理的全面参考物体位姿嵌入(ROPE),再通过U-Net结构的位姿解码模块生成参考物体坐标(ROC),从而实现快速且精确的位姿估计。
链接: https://arxiv.org/abs/2505.04109
作者: Mengya Liu,Siyuan Li,Ajad Chhatkuli,Prune Truong,Luc Van Gool,Federico Tombari
机构: ETH Zurich; INSAIT, Sofia University “St. Kliment Ohridski"; Google(谷歌); TUM
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by CVPR 2025
Abstract:6D object pose estimation remains challenging for many applications due to dependencies on complete 3D models, multi-view images, or training limited to specific object categories. These requirements make generalization to novel objects difficult for which neither 3D models nor multi-view images may be available. To address this, we propose a novel method One2Any that estimates the relative 6-degrees of freedom (DOF) object pose using only a single reference-single query RGB-D image, without prior knowledge of its 3D model, multi-view data, or category constraints. We treat object pose estimation as an encoding-decoding process, first, we obtain a comprehensive Reference Object Pose Embedding (ROPE) that encodes an object shape, orientation, and texture from a single reference view. Using this embedding, a U-Net-based pose decoding module produces Reference Object Coordinate (ROC) for new views, enabling fast and accurate pose estimation. This simple encoding-decoding framework allows our model to be trained on any pair-wise pose data, enabling large-scale training and demonstrating great scalability. Experiments on multiple benchmark datasets demonstrate that our model generalizes well to novel objects, achieving state-of-the-art accuracy and robustness even rivaling methods that require multi-view or CAD inputs, at a fraction of compute.
zh
[CV-54] MAISY: Motion-Aware Image SYnthesis for MedicalImage Motion Correction
【速读】:该论文旨在解决医学影像采集过程中患者运动导致的图像模糊、伪影及器官变形问题,这些问题会影响图像的准确解读。现有基于生成对抗网络(Generative Adversarial Network, GAN)的方法虽然能够通过结构相似性指数测量(Structural Similarity Index Measure, SSIM)损失函数学习退化图像与真实图像之间的映射关系,从而生成无运动伪影的图像,但存在两个主要局限:一是仅关注全局结构特征而忽略可能包含关键病理信息的局部特征;二是SSIM损失函数难以处理像素强度、亮度和方差变化较大的图像。该研究提出的Motion-Aware Image SYnthesis (MAISY)解决方案的关键在于:首先利用Segment Anything Model (SAM) 动态学习解剖边界处的时空模式以表征运动,其次引入Variance-Selective SSIM (VS-SSIM) 损失函数,自适应强调高像素方差区域以在伪影校正过程中保留重要的解剖细节。
链接: https://arxiv.org/abs/2505.04105
作者: Andrew Zhang,Hao Wang,Shuchang Ye,Michael Fulham,Jinman Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Patient motion during medical image acquisition causes blurring, ghosting, and distorts organs, which makes image interpretation this http URL state-of-the-art algorithms using Generative Adversarial Network (GAN)-based methods with their ability to learn the mappings between corrupted images and their ground truth via Structural Similarity Index Measure (SSIM) loss effectively generate motion-free images. However, we identified the following limitations: (i) they mainly focus on global structural characteristics and therefore overlook localized features that often carry critical pathological information, and (ii) the SSIM loss function struggles to handle images with varying pixel intensities, luminance factors, and variance. In this study, we propose Motion-Aware Image SYnthesis (MAISY) which initially characterize motion and then uses it for correction by: (a) leveraging the foundation model Segment Anything Model (SAM), to dynamically learn spatial patterns along anatomical boundaries where motion artifacts are most pronounced and, (b) introducing the Variance-Selective SSIM (VS-SSIM) loss which adaptively emphasizes spatial regions with high pixel variance to preserve essential anatomical details during artifact correction. Experiments on chest and head CT datasets demonstrate that our model outperformed the state-of-the-art counterparts, with Peak Signal-to-Noise Ratio (PSNR) increasing by 40%, SSIM by 10%, and Dice by 16%.
zh
[CV-55] Scalable Aerial GNSS Localization for Marine Robots
【速读】:该论文试图解决水下机器人精确定位的问题,传统基于GNSS(全球导航卫星系统)的定位方法因水面信号反射和水下GNSS接收器成本高昂而难以实施,而现有的惯性导航、多普勒速度计程仪(DVL)、SLAM和声学定位方法则面临误差累积和计算复杂度高的挑战。论文提出的解决方案的关键在于利用搭载GNSS的空中无人机,在水下机器人接近水面时对其进行跟踪和定位,从而实现高效且可扩展的单体和多体水下机器人定位。
链接: https://arxiv.org/abs/2505.04095
作者: Shuo Wen,Edwin Meriaux,Mariana Sosa Guzmán,Charlotte Morissette,Chloe Si,Bobak Baghi,Gregory Dudek
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: International Conference on Robotics and Automation 2025 Workshop Robots in the Wild
Abstract:Accurate localization is crucial for water robotics, yet traditional onboard Global Navigation Satellite System (GNSS) approaches are difficult or ineffective due to signal reflection on the water’s surface and its high cost of aquatic GNSS receivers. Existing approaches, such as inertial navigation, Doppler Velocity Loggers (DVL), SLAM, and acoustic-based methods, face challenges like error accumulation and high computational complexity. Therefore, a more efficient and scalable solution remains necessary. This paper proposes an alternative approach that leverages an aerial drone equipped with GNSS localization to track and localize a marine robot once it is near the surface of the water. Our results show that this novel adaptation enables accurate single and multi-robot marine robot localization.
zh
[CV-56] SMMT: Siamese Motion Mamba with Self-attention for Thermal Infrared Target Tracking
【速读】:该论文旨在解决热红外(Thermal Infrared, TIR)目标跟踪中面临的挑战,如目标遮挡、运动模糊和背景杂波等问题,这些问题显著降低了跟踪器的性能。其解决方案的关键在于提出了一种新颖的Siamese Motion Mamba Tracker (SMMT),该方法结合了双向状态空间模型和自注意力机制。具体而言,通过在Siamese架构中引入Motion Mamba模块,利用双向建模和自注意力机制提取运动特征并恢复被忽略的边缘细节,同时采用Siamese参数共享策略以减少计算冗余并保持强特征表示,并设计了一种运动边缘感知的回归损失以提升跟踪精度。
链接: https://arxiv.org/abs/2505.04088
作者: Shang Zhang,Huanbin Zhang,Dali Feng,Yujie Cui,Ruoyan Xiong,Cen He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Thermal infrared (TIR) object tracking often suffers from challenges such as target occlusion, motion blur, and background clutter, which significantly degrade the performance of trackers. To address these issues, this paper pro-poses a novel Siamese Motion Mamba Tracker (SMMT), which integrates a bidirectional state-space model and a self-attention mechanism. Specifically, we introduce the Motion Mamba module into the Siamese architecture to ex-tract motion features and recover overlooked edge details using bidirectional modeling and self-attention. We propose a Siamese parameter-sharing strate-gy that allows certain convolutional layers to share weights. This approach reduces computational redundancy while preserving strong feature represen-tation. In addition, we design a motion edge-aware regression loss to improve tracking accuracy, especially for motion-blurred targets. Extensive experi-ments are conducted on four TIR tracking benchmarks, including LSOTB-TIR, PTB-TIR, VOT-TIR2015, and VOT-TIR 2017. The results show that SMMT achieves superior performance in TIR target tracking.
zh
[CV-57] SEVA: Leverag ing Single-Step Ensemble of Vicinal Augmentations for Test-Time Adaptation
【速读】:该论文旨在解决测试时自适应(Test-Time Adaptation, TTA)中因分布偏移导致模型鲁棒性不足的问题,以及现有方法在利用可靠样本时效率低下的问题。其解决方案的关键在于提出一种名为单步邻域增强集成(Single-step Ensemble of Vicinal Augmentations, SEVA)的新方法,该方法通过理论框架分析多种增强策略对模型适应的影响,并优化熵损失的上界以在单一步骤中整合多轮增强训练的效果,从而在不增加计算负担的情况下提升模型适应效率和可靠性。
链接: https://arxiv.org/abs/2505.04087
作者: Zixuan Hu,Yichun Hu,Ling-Yu Duan
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Test-Time adaptation (TTA) aims to enhance model robustness against distribution shifts through rapid model adaptation during inference. While existing TTA methods often rely on entropy-based unsupervised training and achieve promising results, the common practice of a single round of entropy training is typically unable to adequately utilize reliable samples, hindering adaptation efficiency. In this paper, we discover augmentation strategies can effectively unleash the potential of reliable samples, but the rapidly growing computational cost impedes their real-time application. To address this limitation, we propose a novel TTA approach named Single-step Ensemble of Vicinal Augmentations (SEVA), which can take advantage of data augmentations without increasing the computational burden. Specifically, instead of explicitly utilizing the augmentation strategy to generate new data, SEVA develops a theoretical framework to explore the impacts of multiple augmentations on model adaptation and proposes to optimize an upper bound of the entropy loss to integrate the effects of multiple rounds of augmentation training into a single step. Furthermore, we discover and verify that using the upper bound as the loss is more conducive to the selection mechanism, as it can effectively filter out harmful samples that confuse the model. Combining these two key advantages, the proposed efficient loss and a complementary selection strategy can simultaneously boost the potential of reliable samples and meet the stringent time requirements of TTA. The comprehensive experiments on various network architectures across challenging testing scenarios demonstrate impressive performances and the broad adaptability of SEVA. The code will be publicly available.
zh
[CV-58] AS3D: 2D-Assisted Cross-Modal Understanding with Semantic-Spatial Scene Graphs for 3D Visual Grounding
【速读】:该论文旨在解决3D视觉定位(3D visual grounding)中由于3D与语言模态之间存在显著差异而导致的在复杂场景中通过描述的空间关系区分多个相似物体的问题。其解决方案的关键在于提出一种基于2D辅助的3D视觉定位框架,该框架通过构建包含被指称物体区分能力的语义-空间场景图来实现关系感知,并采用双分支视觉编码器结合2D预训练属性以指导多模态物体编码,同时利用图注意力机制的跨模态交互模块促进面向关系的信息融合。
链接: https://arxiv.org/abs/2505.04058
作者: Feng Xiao,Hongbin Xu,Guocan Zhao,Wenxiong Kang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D visual grounding aims to localize the unique target described by natural languages in 3D scenes. The significant gap between 3D and language modalities makes it a notable challenge to distinguish multiple similar objects through the described spatial relationships. Current methods attempt to achieve cross-modal understanding in complex scenes via a target-centered learning mechanism, ignoring the perception of referred objects. We propose a novel 2D-assisted 3D visual grounding framework that constructs semantic-spatial scene graphs with referred object discrimination for relationship perception. The framework incorporates a dual-branch visual encoder that utilizes 2D pre-trained attributes to guide the multi-modal object encoding. Furthermore, our cross-modal interaction module uses graph attention to facilitate relationship-oriented information fusion. The enhanced object representation and iterative relational learning enable the model to establish effective alignment between 3D vision and referential descriptions. Experimental results on the popular benchmarks demonstrate our superior performance compared to state-of-the-art methods, especially in addressing the challenges of multiple similar distractors.
zh
[CV-59] FoodTrack: Estimating Handheld Food Portions with Egocentric Video CVPR2025
【速读】:该论文试图解决食品摄入量准确追踪的问题,传统方法通常依赖特定摄像头角度、无遮挡图像或手势识别来估计摄入量,而非直接测量食物体积。其解决方案的关键在于提出FoodTrack框架,该框架利用第一视角视频对手持食物进行体积跟踪与测量,具有对手部遮挡的鲁棒性和在不同摄像头及物体姿态下的灵活性,能够直接估算食物体积,而不依赖于进食手势或固定咬食量假设。
链接: https://arxiv.org/abs/2505.04055
作者: Ervin Wang,Yuhao Chen
机构: University of Waterloo(滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as extended abstract at CVPR 2025 Metafood workshop
Abstract:Accurately tracking food consumption is crucial for nutrition and health monitoring. Traditional approaches typically require specific camera angles, non-occluded images, or rely on gesture recognition to estimate intake, making assumptions about bite size rather than directly measuring food volume. We propose the FoodTrack framework for tracking and measuring the volume of hand-held food items using egocentric video which is robust to hand occlusions and flexible with varying camera and object poses. FoodTrack estimates food volume directly, without relying on intake gestures or fixed assumptions about bite size, offering a more accurate and adaptable solution for tracking food consumption. We achieve absolute percentage loss of approximately 7.01% on a handheld food object, improving upon a previous approach that achieved a 16.40% mean absolute percentage error in its best case, under less flexible conditions.
zh
[CV-60] Person-In-Situ: Scene-Consistent Human Image Insertion with Occlusion-Aware Pose Control
【速读】:该论文旨在解决将人体图像合成到场景图像中时存在的遮挡处理不当和人体位置不自然的问题,以及现有方法在人体姿态控制方面的局限性。解决方案的关键在于利用3D身体模型实现显式的姿态控制,并通过潜在扩散模型在语境适当的深度上合成人体,从而自然地处理遮挡,而无需依赖遮挡掩码。
链接: https://arxiv.org/abs/2505.04052
作者: Shun Masuda,Yuki Endo,Yoshihiro Kanamori
机构: University of Tsukuba (筑波大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Compositing human figures into scene images has broad applications in areas such as entertainment and advertising. However, existing methods often cannot handle occlusion of the inserted person by foreground objects and unnaturally place the person in the frontmost layer. Moreover, they offer limited control over the inserted person’s pose. To address these challenges, we propose two methods. Both allow explicit pose control via a 3D body model and leverage latent diffusion models to synthesize the person at a contextually appropriate depth, naturally handling occlusions without requiring occlusion masks. The first is a two-stage approach: the model first learns a depth map of the scene with the person through supervised learning, and then synthesizes the person accordingly. The second method learns occlusion implicitly and synthesizes the person directly from input data without explicit depth supervision. Quantitative and qualitative evaluations show that both methods outperform existing approaches by better preserving scene consistency while accurately reflecting occlusions and user-specified poses.
zh
[CV-61] rraFusion: Joint Generation of Terrain Geometry and Texture Using Latent Diffusion Models
【速读】:该论文试图解决现有方法在生成地形时仅生成高度图或纹理,而未能充分考虑两者之间固有相关性的问题。解决方案的关键在于提出一种基于潜在扩散模型的联合生成方法,能够同时生成地形高度图和纹理,并通过无监督预训练和有监督适配器学习,实现用户通过手绘草图进行控制,从而保持高度图与纹理之间的关联性。
链接: https://arxiv.org/abs/2505.04050
作者: Kazuki Higo,Toshiki Kanai,Yuki Endo,Yoshihiro Kanamori
机构: University of Tsukuba(筑波大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D terrain models are essential in fields such as video game development and film production. Since surface color often correlates with terrain geometry, capturing this relationship is crucial to achieving realism. However, most existing methods generate either a heightmap or a texture, without sufficiently accounting for the inherent correlation. In this paper, we propose a method that jointly generates terrain heightmaps and textures using a latent diffusion model. First, we train the model in an unsupervised manner to randomly generate paired heightmaps and textures. Then, we perform supervised learning of an external adapter to enable user control via hand-drawn sketches. Experiments show that our approach allows intuitive terrain generation while preserving the correlation between heightmaps and textures.
zh
[CV-62] he Eye as a Window to Systemic Health: A Survey of Retinal Imaging from Classical Techniques to Oculomics
【速读】:该论文试图解决如何利用视网膜成像技术结合人工智能(Artificial Intelligence, AI)以揭示全身健康信息并实现非侵入性标志物的识别问题。其解决方案的关键在于将传统视网膜成像技术向眼组学(oculomics)转变,通过AI驱动的分析方法提升对眼部及全身性疾病早期检测、病情监测和干预的能力。
链接: https://arxiv.org/abs/2505.04006
作者: Inamullah,Imran Razzak,Shoaib Jameel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The unique vascularized anatomy of the human eye, encased in the retina, provides an opportunity to act as a window for human health. The retinal structure assists in assessing the early detection, monitoring of disease progression and intervention for both ocular and non-ocular diseases. The advancement in imaging technology leveraging Artificial Intelligence has seized this opportunity to bridge the gap between the eye and human health. This track paves the way for unveiling systemic health insight from the ocular system and surrogating non-invasive markers for timely intervention and identification. The new frontiers of oculomics in ophthalmology cover both ocular and systemic diseases, and getting more attention to explore them. In this survey paper, we explore the evolution of retinal imaging techniques, the dire need for the integration of AI-driven analysis, and the shift of retinal imaging from classical techniques to oculomics. We also discuss some hurdles that may be faced in the progression of oculomics, highlighting the research gaps and future directions.
zh
[CV-63] Action Spotting and Precise Event Detection in Sports: Datasets Methods and Challenges
【速读】:该论文旨在解决体育视频中事件检测的问题,具体包括时间动作定位(Temporal Action Localization, TAL)、动作定位(Action Spotting, AS)和精确事件定位(Precise Event Spotting, PES)等任务,以实现对关键运动时刻的自动化识别。其解决方案的关键在于综述和分类针对体育场景设计的数据集与评估指标,并分析最新的多模态方法、自监督学习与知识蒸馏技术,以及跨运动泛化策略,从而推动更通用、高效和鲁棒的事件检测框架的发展。
链接: https://arxiv.org/abs/2505.03991
作者: Hao Xu,Arbind Agrahari Baniya,Sam Well,Mohamed Reda Bouadjenek,Richard Dazeley,Sunil Aryal
机构: Deakin University (迪肯大学); Agriculture Victoria Research (维多利亚农业研究); Paralympics Australia (澳大利亚残奥委员会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 4 figures, 2 tables
Abstract:Video event detection has become an essential component of sports analytics, enabling automated identification of key moments and enhancing performance analysis, viewer engagement, and broadcast efficiency. Recent advancements in deep learning, particularly Convolutional Neural Networks (CNNs) and Transformers, have significantly improved accuracy and efficiency in Temporal Action Localization (TAL), Action Spotting (AS), and Precise Event Spotting (PES). This survey provides a comprehensive overview of these three key tasks, emphasizing their differences, applications, and the evolution of methodological approaches. We thoroughly review and categorize existing datasets and evaluation metrics specifically tailored for sports contexts, highlighting the strengths and limitations of each. Furthermore, we analyze state-of-the-art techniques, including multi-modal approaches that integrate audio and visual information, methods utilizing self-supervised learning and knowledge distillation, and approaches aimed at generalizing across multiple sports. Finally, we discuss critical open challenges and outline promising research directions toward developing more generalized, efficient, and robust event detection frameworks applicable to diverse sports. This survey serves as a foundation for future research on efficient, generalizable, and multi-modal sports event detection.
zh
[CV-64] Deep Learning Framework for Infrastructure Maintenance: Crack Detection and High-Resolution Imaging of Infrastructure Surfaces
【速读】:该论文试图解决基础设施资产管理中由于传感器特性、接近度限制、难以到达的区域和环境条件导致的数据集分辨率不足的问题,以及现有超分辨率技术在处理所有结构图像(包括正负样本)时增加计算成本和误报率的问题。解决方案的关键在于提出一个由卷积神经网络(CNN)和高效子像素卷积神经网络(ESPCNN)组成的框架,其中CNN用于准确分类正负样本,而ESPCNN作为轻量级超分辨率技术,仅对CNN识别出的正样本进行高分辨率重建,从而有效减少后续步骤中的计算成本和误报率。
链接: https://arxiv.org/abs/2505.03974
作者: Nikhil M. Pawar,Jorge A. Prozzi,Feng Hong,Surya Sarat Chandra Congress
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Texas State University (德克萨斯州立大学); Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Presented :Transportation Research Board 104th Annual Meeting, Washington, D.C
Abstract:Recently, there has been an impetus for the application of cutting-edge data collection platforms such as drones mounted with camera sensors for infrastructure asset management. However, the sensor characteristics, proximity to the structure, hard-to-reach access, and environmental conditions often limit the resolution of the datasets. A few studies used super-resolution techniques to address the problem of low-resolution images. Nevertheless, these techniques were observed to increase computational cost and false alarms of distress detection due to the consideration of all the infrastructure images i.e., positive and negative distress classes. In order to address the pre-processing of false alarm and achieve efficient super-resolution, this study developed a framework consisting of convolutional neural network (CNN) and efficient sub-pixel convolutional neural network (ESPCNN). CNN accurately classified both the classes. ESPCNN, which is the lightweight super-resolution technique, generated high-resolution infrastructure image of positive distress obtained from CNN. The ESPCNN outperformed bicubic interpolation in all the evaluation metrics for super-resolution. Based on the performance metrics, the combination of CNN and ESPCNN was observed to be effective in preprocessing the infrastructure images with negative distress, reducing the computational cost and false alarms in the next step of super-resolution. The visual inspection showed that EPSCNN is able to capture crack propagation, complex geometry of even minor cracks. The proposed framework is expected to help the highway agencies in accurately performing distress detection and assist in efficient asset management practices.
zh
[CV-65] OpenHelix: A Short Survey Empirical Analysis and Open-Source Dual-System VLA Model for Robotic Manipulation
【速读】:该论文试图解决双系统视觉-语言-动作(Dual-system VLA)架构在具身智能研究中缺乏足够的开源工作以支持进一步的性能分析与优化的问题。其解决方案的关键在于对现有双系统架构的结构设计进行总结与比较,并对其核心设计要素进行系统的实证评估,最终提供一个低成本的开源模型以促进后续研究。
链接: https://arxiv.org/abs/2505.03912
作者: Can Cui,Pengxiang Ding,Wenxuan Song,Shuanghao Bai,Xinyang Tong,Zirui Ge,Runze Suo,Wanqi Zhou,Yang Liu,Bofang Jia,Han Zhao,Siteng Huang,Donglin Wang
机构: Westlake University(西湖大学); Zhejiang University(浙江大学); Xi’an Jiaotong University(西安交通大学); HKUST(GZ)(香港科技大学(广州))
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dual-system VLA (Vision-Language-Action) architectures have become a hot topic in embodied intelligence research, but there is a lack of sufficient open-source work for further performance analysis and optimization. To address this problem, this paper will summarize and compare the structural designs of existing dual-system architectures, and conduct systematic empirical evaluations on the core design elements of existing dual-system architectures. Ultimately, it will provide a low-cost open-source model for further exploration. Of course, this project will continue to update with more experimental conclusions and open-source models with improved performance for everyone to choose from. Project page: this https URL.
zh
[CV-66] Novel Extraction of Discriminative Fine-Grained Feature to Improve Retinal Vessel Segmentation
【速读】:该论文旨在解决视网膜血管分割中的特征区分性不足问题,特别是在模型仅关注解码器输出与标签之间的差异而忽略编码器中细粒度特征表示的情况下。解决方案的关键在于提出一种名为AttUKAN的新型注意力U型Kolmogorov-Arnold网络,并结合一种新的标签引导的像素级对比损失(Label-guided Pixel-wise Contrastive Loss)。AttUKAN通过引入注意力门机制增强模型对无关特征激活的抑制能力以及模型可解释性,而标签引导的像素级对比损失则通过区分前景血管像素对与背景像素对来监督模型提取更具区分性的特征。
链接: https://arxiv.org/abs/2505.03896
作者: Shuang Zeng,Chee Hong Lee,Micky C Nnamdi,Wenqi Shi,J Ben Tamo,Lei Zhu,Hangzhou He,Xinliang Zhang,Qian Chen,May D. Wang,Yanye Lu,Qiushi Ren
机构: Peking University (北京大学); Peking University Health Science Center, Peking University (北京大学医学部,北京大学); Peking University (北京大学); Peking University Shenzhen Graduate School (北京大学深圳研究生院); Georgia Institute of Technology (佐治亚理工学院); UT Southwestern Medical Center (UTSW) (西南达拉斯医学中心); Georgia Institute of Technology and Emory University (佐治亚理工学院和埃默里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Retinal vessel segmentation is a vital early detection method for several severe ocular diseases. Despite significant progress in retinal vessel segmentation with the advancement of Neural Networks, there are still challenges to overcome. Specifically, retinal vessel segmentation aims to predict the class label for every pixel within a fundus image, with a primary focus on intra-image discrimination, making it vital for models to extract more discriminative features. Nevertheless, existing methods primarily focus on minimizing the difference between the output from the decoder and the label, but ignore fully using feature-level fine-grained representations from the encoder. To address these issues, we propose a novel Attention U-shaped Kolmogorov-Arnold Network named AttUKAN along with a novel Label-guided Pixel-wise Contrastive Loss for retinal vessel segmentation. Specifically, we implement Attention Gates into Kolmogorov-Arnold Networks to enhance model sensitivity by suppressing irrelevant feature activations and model interpretability by non-linear modeling of KAN blocks. Additionally, we also design a novel Label-guided Pixel-wise Contrastive Loss to supervise our proposed AttUKAN to extract more discriminative features by distinguishing between foreground vessel-pixel pairs and background pairs. Experiments are conducted across four public datasets including DRIVE, STARE, CHASE_DB1, HRF and our private dataset. AttUKAN achieves F1 scores of 82.50%, 81.14%, 81.34%, 80.21% and 80.09%, along with MIoU scores of 70.24%, 68.64%, 68.59%, 67.21% and 66.94% in the above datasets, which are the highest compared to 11 networks for retinal vessel segmentation. Quantitative and qualitative results show that our AttUKAN achieves state-of-the-art performance and outperforms existing retinal vessel segmentation methods. Our code will be available at this https URL.
zh
[CV-67] Deepfakes on Demand: the rise of accessible non-consensual deepfake image generators
【速读】:该论文试图解决生成式 AI (Generative AI) 在文本到图像(T2I)模型中被滥用以生成非自愿的个人图像,即深度伪造(deepfakes)的问题。解决方案的关键在于通过分析 Hugging Face 和 Civitai 等开源平台上的元数据,揭示了大量可公开下载的深度伪造模型变体的广泛存在及其低门槛的创建方式,特别是基于参数高效微调技术低秩适应(LoRA)的模型,仅需少量图像和普通计算资源即可完成训练,从而突显出当前监管与平台政策在防止此类内容传播方面的不足,并强调亟需加强应对措施。
链接: https://arxiv.org/abs/2505.03859
作者: Will Hawkins,Chris Russell,Brent Mittelstadt
机构: Oxford Internet Institute, University of Oxford(牛津互联网研究所,牛津大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages
Abstract:Advances in multimodal machine learning have made text-to-image (T2I) models increasingly accessible and popular. However, T2I models introduce risks such as the generation of non-consensual depictions of identifiable individuals, otherwise known as deepfakes. This paper presents an empirical study exploring the accessibility of deepfake model variants online. Through a metadata analysis of thousands of publicly downloadable model variants on two popular repositories, Hugging Face and Civitai, we demonstrate a huge rise in easily accessible deepfake models. Almost 35,000 examples of publicly downloadable deepfake model variants are identified, primarily hosted on Civitai. These deepfake models have been downloaded almost 15 million times since November 2022, with the models targeting a range of individuals from global celebrities to Instagram users with under 10,000 followers. Both Stable Diffusion and Flux models are used for the creation of deepfake models, with 96% of these targeting women and many signalling intent to generate non-consensual intimate imagery (NCII). Deepfake model variants are often created via the parameter-efficient fine-tuning technique known as low rank adaptation (LoRA), requiring as few as 20 images, 24GB VRAM, and 15 minutes of time, making this process widely accessible via consumer-grade computers. Despite these models violating the Terms of Service of hosting platforms, and regulation seeking to prevent dissemination, these results emphasise the pressing need for greater action to be taken against the creation of deepfakes and NCII.
zh
[CV-68] An Active Inference Model of Covert and Overt Visual Attention
【速读】:该论文旨在解决如何在处理复杂、高维感官输入的智能体中实现对相关刺激的选择性注意,同时过滤干扰的问题。其解决方案的关键在于通过主动推断框架构建一个隐匿性和显性视觉注意的模型,利用感官精度的动态优化以最小化自由能。该模型根据当前环境信念和感官输入确定视觉感官精度,从而影响隐匿性和显性注意的分配。
链接: https://arxiv.org/abs/2505.03856
作者: Tin Mišić,Karlo Koledić,Fabio Bonsignorio,Ivan Petrović,Ivan Marković
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages, 7 figures. Code available at this https URL
Abstract:The ability to selectively attend to relevant stimuli while filtering out distractions is essential for agents that process complex, high-dimensional sensory input. This paper introduces a model of covert and overt visual attention through the framework of active inference, utilizing dynamic optimization of sensory precisions to minimize free-energy. The model determines visual sensory precisions based on both current environmental beliefs and sensory input, influencing attentional allocation in both covert and overt modalities. To test the effectiveness of the model, we analyze its behavior in the Posner cueing task and a simple target focus task using two-dimensional(2D) visual data. Reaction times are measured to investigate the interplay between exogenous and endogenous attention, as well as valid and invalid cueing. The results show that exogenous and valid cues generally lead to faster reaction times compared to endogenous and invalid cues. Furthermore, the model exhibits behavior similar to inhibition of return, where previously attended locations become suppressed after a specific cue-target onset asynchrony interval. Lastly, we investigate different aspects of overt attention and show that involuntary, reflexive saccades occur faster than intentional ones, but at the expense of adaptability.
zh
[CV-69] Advanced Clustering Framework for Semiconductor Image Analytics Integrating Deep TDA with Self-Supervised and Transfer Learning Techniques
【速读】:该论文旨在解决半导体制造过程中图像数据量大、缺陷识别与良率优化困难的问题,传统聚类技术在处理高维无标签数据时效果有限,难以捕捉细微模式。其解决方案的关键在于引入一种结合深度拓扑数据分析(deep Topological Data Analysis, TDA)与自监督学习和迁移学习的先进聚类框架,通过TDA提取内在拓扑特征,自监督学习从无标签数据中获取有意义表示,迁移学习提升框架的适应性与可扩展性,从而实现高效的无监督图像聚类。
链接: https://arxiv.org/abs/2505.03848
作者: Janhavi Giri,Attila Lengyel,Don Kent,Edward Kibardin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 46 pages, 22 figures, 5 tables
Abstract:Semiconductor manufacturing generates vast amounts of image data, crucial for defect identification and yield optimization, yet often exceeds manual inspection capabilities. Traditional clustering techniques struggle with high-dimensional, unlabeled data, limiting their effectiveness in capturing nuanced patterns. This paper introduces an advanced clustering framework that integrates deep Topological Data Analysis (TDA) with self-supervised and transfer learning techniques, offering a novel approach to unsupervised image clustering. TDA captures intrinsic topological features, while self-supervised learning extracts meaningful representations from unlabeled data, reducing reliance on labeled datasets. Transfer learning enhances the framework’s adaptability and scalability, allowing fine-tuning to new datasets without retraining from scratch. Validated on synthetic and open-source semiconductor image datasets, the framework successfully identifies clusters aligned with defect patterns and process variations. This study highlights the transformative potential of combining TDA, self-supervised learning, and transfer learning, providing a scalable solution for proactive process monitoring and quality control in semiconductor manufacturing and other domains with large-scale image datasets.
zh
[CV-70] GAME: Learning Multimodal Interactions via Graph Structures for Personality Trait Estimation
【速读】:该论文旨在解决从短视频中进行表面人格分析(apparent personality analysis)所面临的挑战,这一任务由于视觉、听觉和文本线索的复杂交互而具有较高难度。其解决方案的关键在于提出GAME(Graph-Augmented Multimodal Encoder),通过图增强的多模态编码器来鲁棒地建模和融合多源特征。具体而言,视觉流中构建了面部图并引入双分支Geo Two-Stream Network,结合图卷积网络(GCN)和卷积神经网络(CNN)与注意力机制以捕捉面部结构和表观特征;同时利用预训练的ResNet18和VGGFace提取全局上下文和身份特征,并通过带有时间注意力模块的BiGRU处理帧级特征以捕捉时序动态。音频和语言语义分别由VGGish和XLM-Roberta提取,最终通过基于通道注意力的融合模块和多层感知机回归头实现有效的多模态整合。
链接: https://arxiv.org/abs/2505.03846
作者: Kangsheng Wang,Yuhang Li,Chengwei Ye,Yufei Lin,Huanzhen Zhang,Bohan Hu,Linuo Xu,Shuyan Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Apparent personality analysis from short videos poses significant chal-lenges due to the complex interplay of visual, auditory, and textual cues. In this paper, we propose GAME, a Graph-Augmented Multimodal Encoder designed to robustly model and fuse multi-source features for automatic personality prediction. For the visual stream, we construct a facial graph and introduce a dual-branch Geo Two-Stream Network, which combines Graph Convolutional Networks (GCNs) and Convolutional Neural Net-works (CNNs) with attention mechanisms to capture both structural and appearance-based facial cues. Complementing this, global context and iden-tity features are extracted using pretrained ResNet18 and VGGFace back-bones. To capture temporal dynamics, frame-level features are processed by a BiGRU enhanced with temporal attention modules. Meanwhile, audio representations are derived from the VGGish network, and linguistic se-mantics are captured via the XLM-Roberta transformer. To achieve effective multimodal integration, we propose a Channel Attention-based Fusion module, followed by a Multi-Layer Perceptron (MLP) regression head for predicting personality traits. Extensive experiments show that GAME con-sistently outperforms existing methods across multiple benchmarks, vali-dating its effectiveness and generalizability.
zh
[CV-71] Coverag e Biases in High-Resolution Satellite Imagery
【速读】:该论文试图解决卫星遥感数据在地球不同区域的覆盖是否存在偏差的问题,即是否所有地区都能平等地从卫星影像数据中受益。其解决方案的关键在于分析主要卫星星座在光学卫星影像上的覆盖情况,评估未来按需任务执行的可能性以及历史影像的可用性,同时结合地理因素、社会经济因素和地缘政治事件的影响进行综合分析,从而揭示卫星影像数据分布的不均衡性。
链接: https://arxiv.org/abs/2505.03842
作者: Vadim Musienko,Axel Jacquet,Ingmar Weber,Till Koebe
机构: Universität des Saarlandes (萨尔大学)
类目: Computers and Society (cs.CY); Earth and Planetary Astrophysics (astro-ph.EP); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Satellite imagery is increasingly used to complement traditional data collection approaches such as surveys and censuses across scientific disciplines. However, we ask: Do all places on earth benefit equally from this new wealth of information? In this study, we investigate coverage bias of major satellite constellations that provide optical satellite imagery with a ground sampling distance below 10 meters, evaluating both the future on-demand tasking opportunities as well as the availability of historic images across the globe. Specifically, forward-looking, we estimate how often different places are revisited during a window of 30 days based on the satellites’ orbital paths, thus investigating potential coverage biases caused by physical factors. We find that locations farther away from the equator are generally revisited more frequently by the constellations under study. Backward-looking, we show that historic satellite image availability – based on metadata collected from major satellite imagery providers – is influenced by socio-economic factors on the ground: less developed, less populated places have less satellite images available. Furthermore, in three small case studies on recent conflict regions in this world, namely Gaza, Sudan and Ukraine, we show that also geopolitical events play an important role in satellite image availability, hinting at underlying business model decisions. These insights lay bare that the digital dividend yielded by satellite imagery is not equally distributed across our planet.
zh
[CV-72] Explainable Face Recognition via Improved Localization
【速读】:该论文试图解决深度学习驱动的面部识别系统在决策过程中缺乏可解释性的问题(explanatory deficiency),这导致用户难以信任基于人工智能的生物特征认证系统。解决方案的关键在于采用一种基于类激活映射(Class Activation Mapping, CAM)的判别定位技术,称为缩放定向发散(Scaled Directed Divergence, SDD),以实现对面部特征的精确可视化解释,从而提升系统的透明度和可信度。
链接: https://arxiv.org/abs/2505.03837
作者: Rashik Shadman,Daqing Hou,Faraz Hussain,M G Sarwar Murshed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Biometric authentication has become one of the most widely used tools in the current technological era to authenticate users and to distinguish between genuine users and imposters. Face is the most common form of biometric modality that has proven effective. Deep learning-based face recognition systems are now commonly used across different domains. However, these systems usually operate like black-box models that do not provide necessary explanations or justifications for their decisions. This is a major disadvantage because users cannot trust such artificial intelligence-based biometric systems and may not feel comfortable using them when clear explanations or justifications are not provided. This paper addresses this problem by applying an efficient method for explainable face recognition systems. We use a Class Activation Mapping (CAM)-based discriminative localization (very narrow/specific localization) technique called Scaled Directed Divergence (SDD) to visually explain the results of deep learning-based face recognition systems. We perform fine localization of the face features relevant to the deep learning model for its prediction/decision. Our experiments show that the SDD Class Activation Map (CAM) highlights the relevant face features very specifically compared to the traditional CAM and very accurately. The provided visual explanations with narrow localization of relevant features can ensure much-needed transparency and trust for deep learning-based face recognition systems.
zh
[CV-73] OBD-Finder: Explainable Coarse-to-Fine Text-Centric Oracle Bone Duplicates Discovery ACL WWW ATC ECML KDD2025
【速读】:该论文试图解决甲骨文(Oracle Bone, OB)重复识别这一在甲骨文(Oracle Bone Inscription, OBI)研究中的基础性问题。解决方案的关键在于设计了一个渐进式的OB重复发现框架,该框架结合了无监督的低级关键点匹配与高级文本中心的内容匹配方法,以语义感知和可解释性的方式对候选OB重复进行精炼和排序。
链接: https://arxiv.org/abs/2505.03836
作者: Chongsheng Zhang,Shuwen Wu,Yingqi Chen,Matthias Aßenmacher,Christian Heumann,Yi Men,Gaojuan Fan,João Gama
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: This is the long version of our OBD-Finder paper for AI-enabled Oracle Bone Duplicates Discovery (currently under review at the ECML PKDD 2025 Demo Track). The models, video illustration and demonstration of this paper are available at: this https URL . Illustration video: this https URL
Abstract:Oracle Bone Inscription (OBI) is the earliest systematic writing system in China, while the identification of Oracle Bone (OB) duplicates is a fundamental issue in OBI research. In this work, we design a progressive OB duplicate discovery framework that combines unsupervised low-level keypoints matching with high-level text-centric content-based matching to refine and rank the candidate OB duplicates with semantic awareness and interpretability. We compare our approach with state-of-the-art content-based image retrieval and image matching methods, showing that our approach yields comparable recall performance and the highest simplified mean reciprocal rank scores for both Top-5 and Top-15 retrieval results, and with significantly accelerated computation efficiency. We have discovered over 60 pairs of new OB duplicates in real-world deployment, which were missed by OBI researchers for decades. The models, video illustration and demonstration of this work are available at: this https URL.
zh
[CV-74] PointExplainer: Towards Transparent Parkinsons Disease Diagnosis
【速读】:该论文试图解决深度神经网络在分析数字化手绘信号以早期诊断帕金森病时缺乏清晰可解释性的问题,这一问题阻碍了临床信任。其解决方案的关键在于提出PointExplainer,该策略通过为手绘片段分配离散归因值,明确量化其对模型决策的相对贡献,从而实现可解释的诊断。PointExplainer的核心组件包括:(i) 一个将手绘信号编码为3D点云的诊断模块,用于表示手绘轨迹;(ii) 一个训练可解释代理模型以近似黑盒诊断模型局部行为的解释模块。此外,还引入了一致性度量以进一步解决解释的忠实性问题。
链接: https://arxiv.org/abs/2505.03833
作者: Xuechao Wang,Sven Nomm,Junqing Huang,Kadri Medijainen,Aaro Toomela,Michael Ruzhansky
机构: Ghent University(根特大学); Tallinn University of Technology(塔林理工大学); University of Tartu(塔尔图大学); Tallinn University(塔林大学); Queen Mary University of London(伦敦玛丽女王大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Deep neural networks have shown potential in analyzing digitized hand-drawn signals for early diagnosis of Parkinson’s disease. However, the lack of clear interpretability in existing diagnostic methods presents a challenge to clinical trust. In this paper, we propose PointExplainer, an explainable diagnostic strategy to identify hand-drawn regions that drive model diagnosis. Specifically, PointExplainer assigns discrete attribution values to hand-drawn segments, explicitly quantifying their relative contributions to the model’s decision. Its key components include: (i) a diagnosis module, which encodes hand-drawn signals into 3D point clouds to represent hand-drawn trajectories, and (ii) an explanation module, which trains an interpretable surrogate model to approximate the local behavior of the black-box diagnostic model. We also introduce consistency measures to further address the issue of faithfulness in explanations. Extensive experiments on two benchmark datasets and a newly constructed dataset show that PointExplainer can provide intuitive explanations with no diagnostic performance degradation. The source code is available at this https URL.
zh
[CV-75] Video Forgery Detection for Surveillance Cameras: A Review
【速读】:该论文试图解决视频篡改带来的证据可信度问题,特别是在监控视频的司法和安全应用中,由于高级视频编辑工具的普及,视频内容容易被伪造,从而可能导致信息误导和司法决策失误。解决方案的关键在于开发更强大的视频取证技术,包括基于压缩分析、帧重复检测以及机器学习的方法,以提高对监控录像真实性的验证能力。
链接: https://arxiv.org/abs/2505.03832
作者: Noor B. Tayfor,Tarik A. Rashid,Shko M. Qader,Bryar A. Hassan,Mohammed H. Abdalla,Jafar Majidpour,Aram M. Ahmed,Hussein M. Ali,Aso M. Aladdin,Abdulhady A. Abdullah,Ahmed S. Shamsaldin,Haval M. Sidqi,Abdulrahman Salih,Zaher M. Yaseen,Azad A. Ameen,Janmenjoy Nayak,Mahmood Yashar Hamza
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The widespread availability of video recording through smartphones and digital devices has made video-based evidence more accessible than ever. Surveillance footage plays a crucial role in security, law enforcement, and judicial processes. However, with the rise of advanced video editing tools, tampering with digital recordings has become increasingly easy, raising concerns about their authenticity. Ensuring the integrity of surveillance videos is essential, as manipulated footage can lead to misinformation and undermine judicial decisions. This paper provides a comprehensive review of existing forensic techniques used to detect video forgery, focusing on their effectiveness in verifying the authenticity of surveillance recordings. Various methods, including compression-based analysis, frame duplication detection, and machine learning-based approaches, are explored. The findings highlight the growing necessity for more robust forensic techniques to counteract evolving forgery methods. Strengthening video forensic capabilities will ensure that surveillance recordings remain credible and admissible as legal evidence.
zh
[CV-76] VideoLLM Benchmarks and Evaluation: A Survey
【速读】:该论文试图解决视频理解技术中评估框架不足的问题,特别是针对视频大语言模型(VideoLLMs)的基准测试和评估方法存在的局限性。其解决方案的关键在于系统性地分析现有的视频理解基准和评估方法,包括封闭集、开放集以及针对时序和时空理解任务的专项评估,并提出未来研究方向,如设计更多样化、多模态和可解释性的基准,以提升VideoLLMs的评估效果与研究深度。
链接: https://arxiv.org/abs/2505.03829
作者: Yogesh Kumar
机构: Indian Institute of Technology Jodhpur (印度理工学院贾多尔普尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 2 Tables
Abstract:The rapid development of Large Language Models (LLMs) has catalyzed significant advancements in video understanding technologies. This survey provides a comprehensive analysis of benchmarks and evaluation methodologies specifically designed or used for Video Large Language Models (VideoLLMs). We examine the current landscape of video understanding benchmarks, discussing their characteristics, evaluation protocols, and limitations. The paper analyzes various evaluation methodologies, including closed-set, open-set, and specialized evaluations for temporal and spatiotemporal understanding tasks. We highlight the performance trends of state-of-the-art VideoLLMs across these benchmarks and identify key challenges in current evaluation frameworks. Additionally, we propose future research directions to enhance benchmark design, evaluation metrics, and protocols, including the need for more diverse, multimodal, and interpretability-focused benchmarks. This survey aims to equip researchers with a structured understanding of how to effectively evaluate VideoLLMs and identify promising avenues for advancing the field of video understanding with large language models.
zh
[CV-77] In-situ and Non-contact Etch Depth Prediction in Plasma Etching via Machine Learning (ANN BNN) and Digital Image Colorimetry
【速读】:该论文旨在解决半导体制造中蚀刻深度和绝缘材料(如二氧化硅和氮化硅)厚度的精确监测问题,传统离线分析方法虽准确但存在时间延迟和污染风险。解决方案的关键在于提出一种基于机器学习(ML)技术的非接触式、原位蚀刻深度预测框架。该框架通过人工神经网络(ANN)和贝叶斯神经网络(BNN)实现对蚀刻深度的高精度预测,并结合数字图像色度法(DIC)的RGB数据作为输入,展示了其在无显式工艺参数情况下的有效性,从而为等离子体蚀刻过程提供了实时、原位且非侵入式的监测手段。
链接: https://arxiv.org/abs/2505.03826
作者: Minji Kang,Seongho Kim,Eunseo Go,Donghyeon Paek,Geon Lim,Muyoung Kim,Soyeun Kim,Sung Kyu Jang,Min Sup Choi,Woo Seok Kang,Jaehyun Kim,Jaekwang Kim,Hyeong-U Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages
Abstract:Precise monitoring of etch depth and the thickness of insulating materials, such as Silicon dioxide and silicon nitride, is critical to ensuring device performance and yield in semiconductor manufacturing. While conventional ex-situ analysis methods are accurate, they are constrained by time delays and contamination risks. To address these limitations, this study proposes a non-contact, in-situ etch depth prediction framework based on machine learning (ML) techniques. Two scenarios are explored. In the first scenario, an artificial neural network (ANN) is trained to predict average etch depth from process parameters, achieving a significantly lower mean squared error (MSE) compared to a linear baseline model. The approach is then extended to incorporate variability from repeated measurements using a Bayesian Neural Network (BNN) to capture both aleatoric and epistemic uncertainty. Coverage analysis confirms the BNN’s capability to provide reliable uncertainty estimates. In the second scenario, we demonstrate the feasibility of using RGB data from digital image colorimetry (DIC) as input for etch depth prediction, achieving strong performance even in the absence of explicit process parameters. These results suggest that the integration of DIC and ML offers a viable, cost-effective alternative for real-time, in-situ, and non-invasive monitoring in plasma etching processes, contributing to enhanced process stability, and manufacturing efficiency.
zh
[CV-78] Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models
【速读】:该论文试图解决视觉语言模型(Vision Language Models, VLMs)在复杂视觉任务中进行空间推理和视角转换的能力不足的问题。解决方案的关键在于设计了一套受人类测试启发的新型视觉任务,通过控制场景中的空间配置(如物体相对于人形小玩偶的位置和方向)并结合鸟瞰视图与表面视图,构建了144个独特的视觉任务,并配以7个诊断问题来评估场景理解、空间推理和视觉视角转换三个层次的视觉认知能力。
链接: https://arxiv.org/abs/2505.03821
作者: Gracjan Góral,Alicja Ziarko,Piotr Miłoś,Michał Nauman,Maciej Wołczyk,Michał Kosiński
机构: Faculty of Mathematics, Informatics and Mechanics, University of Warsaw(数学、信息学和力学学院,华沙大学); Institute of Mathematics, Polish Academy of Sciences(数学研究所,波兰科学院); Graduate School of Business, Stanford University(商学院,斯坦福大学); Robot Learning Lab, University of California, Berkeley(机器人学习实验室,加州大学伯克利分校); IDEAS NCBR(IDEAS NCBR)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Dataset: this https URL
Abstract:We investigate the ability of Vision Language Models (VLMs) to perform visual perspective taking using a novel set of visual tasks inspired by established human tests. Our approach leverages carefully controlled scenes, in which a single humanoid minifigure is paired with a single object. By systematically varying spatial configurations - such as object position relative to the humanoid minifigure and the humanoid minifigure’s orientation - and using both bird’s-eye and surface-level views, we created 144 unique visual tasks. Each visual task is paired with a series of 7 diagnostic questions designed to assess three levels of visual cognition: scene understanding, spatial reasoning, and visual perspective taking. Our evaluation of several state-of-the-art models, including GPT-4-Turbo, GPT-4o, Llama-3.2-11B-Vision-Instruct, and variants of Claude Sonnet, reveals that while they excel in scene understanding, the performance declines significantly on spatial reasoning and further deteriorates on perspective-taking. Our analysis suggests a gap between surface-level object recognition and the deeper spatial and perspective reasoning required for complex visual tasks, pointing to the need for integrating explicit geometric representations and tailored training protocols in future VLM development.
zh
[CV-79] When Dynamic Data Selection Meets Data Augmentation
【速读】:该论文旨在解决动态数据选择与数据增强难以协同优化的问题,从而在保持模型性能的前提下提升训练效率。其解决方案的关键在于提出一种新颖的在线数据训练框架,首次将动态数据选择与数据增强统一起来,通过估计每个样本的局部密度与多模态语义一致性的联合分布,实现对适合增强的样本进行有针对性的选择,同时抑制噪声或模糊数据的引入,从而在显著减少数据集规模的同时不损害模型的泛化能力。
链接: https://arxiv.org/abs/2505.03809
作者: Suorong Yang,Peng Ye,Furao Shen,Dongzhan Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dynamic data selection aims to accelerate training with lossless performance. However, reducing training data inherently limits data diversity, potentially hindering generalization. While data augmentation is widely used to enhance diversity, it is typically not optimized in conjunction with selection. As a result, directly combining these techniques fails to fully exploit their synergies. To tackle the challenge, we propose a novel online data training framework that, for the first time, unifies dynamic data selection and augmentation, achieving both training efficiency and enhanced performance. Our method estimates each sample’s joint distribution of local density and multimodal semantic consistency, allowing for the targeted selection of augmentation-suitable samples while suppressing the inclusion of noisy or ambiguous data. This enables a more significant reduction in dataset size without sacrificing model generalization. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches on various benchmark datasets and architectures, e.g., reducing 50% training costs on ImageNet-1k with lossless performance. Furthermore, our approach enhances noise resistance and improves model robustness, reinforcing its practical utility in real-world scenarios.
zh
[CV-80] AI-driven multi-source data fusion for algal bloom severity classification in small inland water bodies: Leverag ing Sentinel-2 DEM and NOAA climate data
【速读】:该论文试图解决有害藻华(Harmful Algal Blooms)对内陆水质和公共健康的威胁问题,旨在开发一种高效、准确且成本低廉的检测方法。解决方案的关键在于整合多源开源遥感数据与先进的人工智能模型,具体包括Copernicus Sentinel-2光学影像、Copernicus数字高程模型(DEM)以及NOAA的高分辨率快速刷新(HRRR)气候数据,并通过Google Earth Engine(GEE)和Microsoft Planetary Computer(MPC)平台进行高效获取。此外,该方法结合了基于树的机器学习模型与神经网络,构建集成模型以分类藻华严重程度,从而提升检测的鲁棒性和准确性。
链接: https://arxiv.org/abs/2505.03808
作者: Ioannis Nasios
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Harmful algal blooms are a growing threat to inland water quality and public health worldwide, creating an urgent need for efficient, accurate, and cost-effective detection methods. This research introduces a high-performing methodology that integrates multiple open-source remote sensing data with advanced artificial intelligence models. Key data sources include Copernicus Sentinel-2 optical imagery, the Copernicus Digital Elevation Model (DEM), and NOAA’s High-Resolution Rapid Refresh (HRRR) climate data, all efficiently retrieved using platforms like Google Earth Engine (GEE) and Microsoft Planetary Computer (MPC). The NIR and two SWIR bands from Sentinel-2, the altitude from the elevation model, the temperature and wind from NOAA as well as the longitude and latitude were the most important features. The approach combines two types of machine learning models, tree-based models and a neural network, into an ensemble for classifying algal bloom severity. While the tree models performed strongly on their own, incorporating a neural network added robustness and demonstrated how deep learning models can effectively use diverse remote sensing inputs. The method leverages high-resolution satellite imagery and AI-driven analysis to monitor algal blooms dynamically, and although initially developed for a NASA competition in the U.S., it shows potential for global application. The complete code is available for further adaptation and practical implementation, illustrating the convergence of remote sensing data and AI to address critical environmental challenges (this https URL).
zh
[CV-81] Facilitating Video Story Interaction with Multi-Agent Collaborative System
【速读】:该论文旨在解决视频故事交互中用户个性化体验不足的问题,现有方法受限于用户选择、定制化叙事的缺乏以及缺乏自定义能力。其解决方案的关键在于构建一个基于用户意图的交互系统,该系统利用视觉语言模型(VLM)理解视频故事,并结合检索增强生成(RAG)和多智能体系统(MAS)来生成动态角色与场景体验,从而实现更具互动性和个性化的叙事内容。
链接: https://arxiv.org/abs/2505.03807
作者: Yiwen Zhang,Jianing Hao,Zhan Wang,Hongling Sheng,Wei Zeng
机构: The Hong Kong University of Science and Technology (Guangzhou)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注: Prepared and submitted in 2024
Abstract:Video story interaction enables viewers to engage with and explore narrative content for personalized experiences. However, existing methods are limited to user selection, specially designed narratives, and lack customization. To address this, we propose an interactive system based on user intent. Our system uses a Vision Language Model (VLM) to enable machines to understand video stories, combining Retrieval-Augmented Generation (RAG) and a Multi-Agent System (MAS) to create evolving characters and scene experiences. It includes three stages: 1) Video story processing, utilizing VLM and prior knowledge to simulate human understanding of stories across three modalities. 2) Multi-space chat, creating growth-oriented characters through MAS interactions based on user queries and story stages. 3) Scene customization, expanding and visualizing various story scenes mentioned in dialogue. Applied to the Harry Potter series, our study shows the system effectively portrays emergent character social behavior and growth, enhancing the interactive experience in the video story world.
zh
[CV-82] Design description of Wisdom Computing Persperctive
【速读】:该论文试图解决学生在学习数学时因抽象公式和复杂计算步骤而难以理解的问题(abstract formulas and complex calculation steps)。其解决方案的关键在于通过引入Mamba主干网络提升手写矩阵内容的精确识别能力,利用YOLO模型完成数字提取与矩阵重建,并结合CoordAttention坐标注意力机制以提高字符空间位置的准确捕捉;同时,借助Manim动画引擎逐帧展示计算过程,直观呈现数学运算的每一步骤,从而帮助学生深入理解数学操作的内在逻辑。
链接: https://arxiv.org/abs/2505.03800
作者: TianYi Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This course design aims to develop and research a handwriting matrix recognition and step-by-step visual calculation process display system, addressing the issue of abstract formulas and complex calculation steps that students find difficult to understand when learning mathematics. By integrating artificial intelligence with visualization animation technology, the system enhances precise recognition of handwritten matrix content through the introduction of Mamba backbone networks, completes digital extraction and matrix reconstruction using the YOLO model, and simultaneously combines CoordAttention coordinate attention mechanisms to improve the accurate grasp of character spatial positions. The calculation process is demonstrated frame by frame through the Manim animation engine, vividly showcasing each mathematical calculation step, helping students intuitively understand the intrinsic logic of mathematical operations. Through dynamically generating animation processes for different computational tasks, the system exhibits high modularity and flexibility, capable of generating various mathematical operation examples in real-time according to student needs. By innovating human-computer interaction methods, it brings mathematical calculation processes to life, helping students bridge the gap between knowledge and understanding on a deeper level, ultimately achieving a learning experience where “every step is understood.” The system’s scalability and interactivity make it an intuitive, user-friendly, and efficient auxiliary tool in education.
zh
[CV-83] EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLM s via Reinforcement Learning
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在结构化跨模态推理中的不足,尤其是在整合音频和视觉信号时的表现问题。其解决方案的关键在于提出EchoInk-R1框架,该框架基于Qwen2.5-Omni-7B模型并采用Group Relative Policy Optimization (GRPO)进行优化,通过强化学习方法提升模型在同步音频-图像对上的多项选择题回答能力。
链接: https://arxiv.org/abs/2505.04623
作者: Zhenghao Xing,Xiaowei Hu,Chi-Wing Fu,Wenhai Wang,Jifeng Dai,Pheng-Ann Heng
机构: The Chinese University of Hong Kong (香港中文大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Tsinghua University (清华大学)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
备注:
Abstract:Multimodal large language models (MLLMs) have advanced perception across text, vision, and audio, yet they often struggle with structured cross-modal reasoning, particularly when integrating audio and visual signals. We introduce EchoInk-R1, a reinforcement learning framework that enhances such reasoning in MLLMs. Built upon the Qwen2.5-Omni-7B foundation and optimized with Group Relative Policy Optimization (GRPO), EchoInk-R1 tackles multiple-choice question answering over synchronized audio-image pairs. To enable this, we curate AVQA-R1-6K, a dataset pairing such audio-image inputs with multiple-choice questions derived from OmniInstruct-v1. EchoInk-R1-7B achieves 85.77% accuracy on the validation set, outperforming the base model, which scores 80.53%, using only 562 reinforcement learning steps. Beyond accuracy, EchoInk-R1 demonstrates reflective reasoning by revisiting initial interpretations and refining responses when facing ambiguous multimodal inputs. These results suggest that lightweight reinforcement learning fine-tuning enhances cross-modal reasoning in MLLMs. EchoInk-R1 is the first framework to unify audio, visual, and textual modalities for general open-world reasoning via reinforcement learning. Code and data are publicly released to facilitate further research.
zh
[CV-84] Dynamic Network Flow Optimization for Task Scheduling in PTZ Camera Surveillance Systems
【速读】:该论文旨在解决动态监控环境中全景-俯仰-变焦(PTZ)摄像头的调度与控制优化问题。其关键解决方案是将卡尔曼滤波器用于运动预测,并结合动态网络流模型以提升实时视频捕获效率。通过为跟踪目标分配卡尔曼滤波器,系统可预测目标未来位置,从而实现精准的摄像头任务调度,该预测驱动的方法被建模为网络流优化问题,确保了系统的可扩展性和适应性。此外,引入群组跟踪节点和基于价值的优先级系统进一步减少了冗余监控,并提高了对关键事件的响应速度。
链接: https://arxiv.org/abs/2505.04596
作者: Mohammad Merati,David Castañón
机构: Boston University (波士顿大学)
类目: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: 7 pages, 3 Figures, Accepted at AIRC 2025
Abstract:This paper presents a novel approach for optimizing the scheduling and control of Pan-Tilt-Zoom (PTZ) cameras in dynamic surveillance environments. The proposed method integrates Kalman filters for motion prediction with a dynamic network flow model to enhance real-time video capture efficiency. By assigning Kalman filters to tracked objects, the system predicts future locations, enabling precise scheduling of camera tasks. This prediction-driven approach is formulated as a network flow optimization, ensuring scalability and adaptability to various surveillance scenarios. To further reduce redundant monitoring, we also incorporate group-tracking nodes, allowing multiple objects to be captured within a single camera focus when appropriate. In addition, a value-based system is introduced to prioritize camera actions, focusing on the timely capture of critical events. By adjusting the decay rates of these values over time, the system ensures prompt responses to tasks with imminent deadlines. Extensive simulations demonstrate that this approach improves coverage, reduces average wait times, and minimizes missed events compared to traditional master-slave camera systems. Overall, our method significantly enhances the efficiency, scalability, and effectiveness of surveillance systems, particularly in dynamic and crowded environments.
zh
[CV-85] 3D Brain MRI Classification for Alzheimer Diagnosis Using CNN with Data Augmentation
【速读】:该论文旨在解决通过T1加权脑部磁共振成像(T1-weighted brain MRI)对健康个体与阿尔茨海默病(Alzheimer’s disease)患者进行分类的问题。其解决方案的关键在于采用一种三维卷积神经网络(3D convolutional neural network),该网络结合了3D卷积、池化、批量归一化、密集ReLU层和Sigmoid输出层,并通过随机噪声注入和五折交叉验证提高了模型的泛化能力,最终在测试集上达到了0.912的准确率和0.961的ROC曲线下面积,显著优于仅使用重采样的方法。
链接: https://arxiv.org/abs/2505.04097
作者: Thien Nhan Vo,Bac Nam Ho,Thanh Xuan Truong
机构: Ho Chi Minh City University of Technology (HUTECH); Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), Vietnam National University Ho Chi Minh City (VNU-HCM)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:A three-dimensional convolutional neural network was developed to classify T1-weighted brain MRI scans as healthy or Alzheimer. The network comprises 3D convolution, pooling, batch normalization, dense ReLU layers, and a sigmoid output. Using stochastic noise injection and five-fold cross-validation, the model achieved test set accuracy of 0.912 and area under the ROC curve of 0.961, an improvement of approximately 0.027 over resizing alone. Sensitivity and specificity both exceeded 0.90. These results align with prior work reporting up to 0.10 gain via synthetic augmentation. The findings demonstrate the effectiveness of simple augmentation for 3D MRI classification and motivate future exploration of advanced augmentation methods and architectures such as 3D U-Net and vision transformers.
zh
[CV-86] Prototype-Based Information Compensation Network for Multi-Source Remote Sensing Data Classification
【速读】:该论文旨在解决多源遥感数据联合分类中的两个关键问题:跨频段多源特征耦合不足以及互补信息挖掘不一致。其解决方案的关键在于提出一种基于原型的信息补偿网络(Prototype-based Information Compensation Network, PICNet),该网络首先设计了一个频率交互模块以增强多源特征提取中的跨频段耦合,通过解耦与再耦合机制实现高效的跨频段通信;随后引入基于原型的信息补偿模块,利用可学习的模态原型表示多源数据的全局模态信息,并通过跨模态注意力计算实现特征集成与对齐。
链接: https://arxiv.org/abs/2505.04003
作者: Feng Gao,Sheng Liu,Chuanzheng Gong,Xiaowei Zhou,Jiayi Wang,Junyu Dong,Qian Du
机构: Ocean University of China (中国海洋大学); Mississippi State University (密西西比州立大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TGRS 2025
Abstract:Multi-source remote sensing data joint classification aims to provide accuracy and reliability of land cover classification by leveraging the complementary information from multiple data sources. Existing methods confront two challenges: inter-frequency multi-source feature coupling and inconsistency of complementary information exploration. To solve these issues, we present a Prototype-based Information Compensation Network (PICNet) for land cover classification based on HSI and SAR/LiDAR data. Specifically, we first design a frequency interaction module to enhance the inter-frequency coupling in multi-source feature extraction. The multi-source features are first decoupled into high- and low-frequency components. Then, these features are recoupled to achieve efficient inter-frequency communication. Afterward, we design a prototype-based information compensation module to model the global multi-source complementary information. Two sets of learnable modality prototypes are introduced to represent the global modality information of multi-source data. Subsequently, cross-modal feature integration and alignment are achieved through cross-attention computation between the modality-specific prototype vectors and the raw feature representations. Extensive experiments on three public datasets demonstrate the significant superiority of our PICNet over state-of-the-art methods. The codes are available at this https URL.
zh
[CV-87] A Deep Learning approach for Depressive Symptoms assessment in Parkinsons disease patients using facial videos
【速读】:该论文旨在解决帕金森病(Parkinson’s disease, PD)患者中抑郁症状的检测问题,特别是针对其常见但常被误诊或漏诊的情况。研究提出了一种基于深度学习(deep learning, DL)的方法,利用面部视频分析来评估抑郁症状的存在及其严重程度,通过Geriatric Depression Scale (GDS) 进行量化。解决方案的关键在于采用先进的视觉模型,如Video Swin Tiny、ViViT和3D CNN-LSTM,并结合注意力机制,以捕捉面部动态特征,从而提高检测的准确性和泛化能力。实验结果表明,Video Swin Tiny在二分类和多分类任务中均表现出最佳性能,验证了该方法的有效性。
链接: https://arxiv.org/abs/2505.03845
作者: Ioannis Kyprakis,Vasileios Skaramagkas,Iro Boura,Georgios Karamanis,Dimitrios I. Fotiadis,Zinovia Kefalopoulou,Cleanthe Spanaki,Manolis Tsiknakis
机构: Hellenic Mediterranean University (希腊迈索里大学); Foundation for Research and Technology Hellas (弗洛拉斯研究与技术基金会); University of Crete (克里特大学); Dept. of Neurology, University Hospital of Heraklion (赫拉克利翁大学医院神经科); Dept. of Neurology, Patras University Hospital (帕特拉斯大学医院神经科); University of Ioannina (约阿尼纳大学); Biomedical Research Institute, Foundation for Research and Technology Hellas (弗洛拉斯研究与技术基金会生物医学研究所)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Parkinson’s disease (PD) is a neurodegenerative disorder, manifesting with motor and non-motor symptoms. Depressive symptoms are prevalent in PD, affecting up to 45% of patients. They are often underdiagnosed due to overlapping motor features, such as hypomimia. This study explores deep learning (DL) models-ViViT, Video Swin Tiny, and 3D CNN-LSTM with attention layers-to assess the presence and severity of depressive symptoms, as detected by the Geriatric Depression Scale (GDS), in PD patients through facial video analysis. The same parameters were assessed in a secondary analysis taking into account whether patients were one hour after (ON-medication state) or 12 hours without (OFF-medication state) dopaminergic medication. Using a dataset of 1,875 videos from 178 patients, the Video Swin Tiny model achieved the highest performance, with up to 94% accuracy and 93.7% F1-score in binary classification (presence of absence of depressive symptoms), and 87.1% accuracy with an 85.4% F1-score in multiclass tasks (absence or mild or severe depressive symptoms).
zh
[CV-88] From Spaceborn to Airborn: SAR Image Synthesis Using Foundation Models for Multi-Scale Adaptation
【速读】:该论文试图解决高分辨率机载合成孔径雷达(SAR)图像数据稀缺的问题,这限制了现有基础模型在遥感应用中的使用。其解决方案的关键在于利用ONERA多年积累的机载SAR数据,构建一个包含11万张SAR图像的训练集,并基于一个35亿参数的预训练潜在扩散模型进行生成式AI (Generative AI) 的训练,从而实现从卫星SAR图像到机载SAR表征的转换。此外,该方法通过空间条件技术提升了生成图像的真实性,有效弥合了物理仿真器EMPRISE生成的模拟图像与真实数据之间的差距。
链接: https://arxiv.org/abs/2505.03844
作者: Solène Debuysère,Nicolas Trouvé,Nathan Letheule,Olivier Lévêque,Elise Colin
机构: ONERA - DEMR(法国航空航天学会-探测与测量研究部); Université Paris Saclay(巴黎萨克雷大学); ONERA - DTIS(法国航空航天学会-信息技术与系统研究部)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The availability of Synthetic Aperture Radar (SAR) satellite imagery has increased considerably in recent years, with datasets commercially available. However, the acquisition of high-resolution SAR images in airborne configurations, remains costly and limited. Thus, the lack of open source, well-labeled, or easily exploitable SAR text-image datasets is a barrier to the use of existing foundation models in remote sensing applications. In this context, synthetic image generation is a promising solution to augment this scarce data, enabling a broader range of applications. Leveraging over 15 years of ONERA’s extensive archival airborn data from acquisition campaigns, we created a comprehensive training dataset of 110 thousands SAR images to exploit a 3.5 billion parameters pre-trained latent diffusion model. In this work, we present a novel approach utilizing spatial conditioning techniques within a foundation model to transform satellite SAR imagery into airborne SAR representations. Additionally, we demonstrate that our pipeline is effective for bridging the realism of simulated images generated by ONERA’s physics-based simulator EMPRISE. Our method explores a key application of AI in advancing SAR imaging technology. To the best of our knowledge, we are the first to introduce this approach in the literature.
zh
[CV-89] IntelliCardiac: An Intelligent Platform for Cardiac Image Segmentation and Classification
【速读】:该论文旨在解决心血管疾病诊断中对心脏影像数据精确且高效处理的需求,其核心问题是传统方法在心脏结构分割与疾病分类上的准确性和自动化程度不足。解决方案的关键在于开发IntelliCardiac平台,该平台采用基于深度学习的分割模型与双阶段分类流程,利用公开的ACDC数据集进行训练,实现了对左右心室及心肌的自动分割,并将心脏影像分类为五类诊断结果,整体分割准确率为92.6%,分类准确率达98%,显著优于现有方法。
链接: https://arxiv.org/abs/2505.03838
作者: Ting Yu Tsai,An Yu,Meghana Spurthi Maadugundu,Ishrat Jahan Mohima,Umme Habiba Barsha,Mei-Hwa F. Chen,Balakrishnan Prabhakaran,Ming-Ching Chang
机构: University at Albany, State University of New York (纽约州立大学阿尔巴尼分校)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Precise and effective processing of cardiac imaging data is critical for the identification and management of the cardiovascular diseases. We introduce IntelliCardiac, a comprehensive, web-based medical image processing platform for the automatic segmentation of 4D cardiac images and disease classification, utilizing an AI model trained on the publicly accessible ACDC dataset. The system, intended for patients, cardiologists, and healthcare professionals, offers an intuitive interface and uses deep learning models to identify essential heart structures and categorize cardiac diseases. The system supports analysis of both the right and left ventricles as well as myocardium, and then classifies patient’s cardiac images into five diagnostic categories: dilated cardiomyopathy, myocardial infarction, hypertrophic cardiomyopathy, right ventricular abnormality, and no disease. IntelliCardiac combines a deep learning-based segmentation model with a two-step classification pipeline. The segmentation module gains an overall accuracy of 92.6%. The classification module, trained on characteristics taken from segmented heart structures, achieves 98% accuracy in five categories. These results exceed the performance of the existing state-of-the-art methods that integrate both segmentation and classification models. IntelliCardiac, which supports real-time visualization, workflow integration, and AI-assisted diagnostics, has great potential as a scalable, accurate tool for clinical decision assistance in cardiac imaging and diagnosis.
zh
[CV-90] On the Residual-based Neural Network for Unmodeled Distortions in Coordinate Transformation
【速读】:该论文试图解决坐标变换模型在处理非线性和空间依赖性畸变时存在的不足,这些不足导致地理空间应用中出现显著的残差误差。其解决方案的关键在于提出一种基于残差的神经网络校正策略,该策略使神经网络仅学习初始几何变换后遗留的系统性畸变,从而降低模型复杂度并提升性能,尤其在控制点配置稀疏或结构化的情况下表现更优。
链接: https://arxiv.org/abs/2505.03757
作者: Vinicius Francisco Rofatto,Luiz Felipe Rodrigues de Almeida,Marcelo Tomio Matsuoka,Ivandro Klein,Mauricio Roberto Veronez,Luiz Gonzaga Da Silveira Junior
机构: 未知
类目: Geophysics (physics.geo-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
备注:
Abstract:Coordinate transformation models often fail to account for nonlinear and spatially dependent distortions, leading to significant residual errors in geospatial applications. Here we propose a residual-based neural correction strategy, in which a neural network learns to model only the systematic distortions left by an initial geometric transformation. By focusing solely on residual patterns, the proposed method reduces model complexity and improves performance, particularly in scenarios with sparse or structured control point configurations. We evaluate the method using both simulated datasets with varying distortion intensities and sampling strategies, as well as under the real-world image georeferencing tasks. Compared with direct neural network coordinate converter and classical transformation models, the residual-based neural correction delivers more accurate and stable results under challenging conditions, while maintaining comparable performance in ideal cases. These findings demonstrate the effectiveness of residual modelling as a lightweight and robust alternative for improving coordinate transformation accuracy.
zh
人工智能
[AI-0] Score Distillation Sampling for Audio: Source Separation Synthesis and Beyond
【速读】:该论文试图解决如何将Score Distillation Sampling (SDS)方法扩展到文本条件下的音频扩散模型(text-conditioned audio diffusion models)中,以实现多样化的音频生成任务。解决方案的关键在于将SDS的核心思想——将强大的生成先验知识提炼为独立的参数化表示——应用于音频领域,从而利用单一预训练模型完成包括物理启发的冲击声模拟、FM合成参数校准和指定提示的源分离等任务,展现了基于蒸馏方法在多模态中的通用性与有效性。
链接: https://arxiv.org/abs/2505.04621
作者: Jessie Richter-Powell,Antonio Torralba,Jonathan Lorraine
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: See the project website at this https URL
Abstract:We introduce Audio-SDS, a generalization of Score Distillation Sampling (SDS) to text-conditioned audio diffusion models. While SDS was initially designed for text-to-3D generation using image diffusion, its core idea of distilling a powerful generative prior into a separate parametric representation extends to the audio domain. Leveraging a single pretrained model, Audio-SDS enables a broad range of tasks without requiring specialized datasets. In particular, we demonstrate how Audio-SDS can guide physically informed impact sound simulations, calibrate FM-synthesis parameters, and perform prompt-specified source separation. Our findings illustrate the versatility of distillation-based methods across modalities and establish a robust foundation for future work using generative priors in audio tasks.
zh
[AI-1] WATCH: Weighted Adaptive Testing for Changepoint Hypotheses via Weighted-Conformal Martingales ICML
【速读】:该论文旨在解决在高风险场景中部署人工智能/机器学习(Artificial Intelligence/Machine Learning, AI/ML)系统时,如何实现持续、实时的系统行为监控以快速检测和应对不安全行为的问题。现有方法在监测范围或“警报标准”上存在局限,例如仅能检测特定类型的分布变化或无法在线适应数据分布的变动。论文的关键解决方案是提出一种加权广义的 conformal test martingales(WCTMs),其理论基础支持对数据分布中的任何意外变化点进行在线监控,同时控制误报率。通过设计具体的WCTM算法,该方法能够在面对轻微协变量偏移时实现在线适应,并在检测到更严重的概念偏移或极端协变量偏移时触发警报,从而在实际数据集上表现出优于现有最先进方法的性能。
链接: https://arxiv.org/abs/2505.04608
作者: Drew Prinster,Xing Han,Anqi Liu,Suchi Saria
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: To be published in The International Conference on Machine Learning (ICML), 2025
Abstract:Responsibly deploying artificial intelligence (AI) / machine learning (ML) systems in high-stakes settings arguably requires not only proof of system reliability, but moreover continual, post-deployment monitoring to quickly detect and address any unsafe behavior. Statistical methods for nonparametric change-point detection – especially the tools of conformal test martingales (CTMs) and anytime-valid inference – offer promising approaches to this monitoring task. However, existing methods are restricted to monitoring limited hypothesis classes or ``alarm criteria,‘’ such as data shifts that violate certain exchangeability assumptions, or do not allow for online adaptation in response to shifts. In this paper, we expand the scope of these monitoring methods by proposing a weighted generalization of conformal test martingales (WCTMs), which lay a theoretical foundation for online monitoring for any unexpected changepoints in the data distribution while controlling false-alarms. For practical applications, we propose specific WCTM algorithms that accommodate online adaptation to mild covariate shifts (in the marginal input distribution) while raising alarms in response to more severe shifts, such as concept shifts (in the conditional label distribution) or extreme (out-of-support) covariate shifts that cannot be easily adapted to. On real-world datasets, we demonstrate improved performance relative to state-of-the-art baselines.
zh
[AI-2] AI Governance to Avoid Extinction: The Strategic Landscape and Actionable Research Questions
【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)快速发展可能带来的重大灾难性风险问题,特别是AI系统在认知领域全面超越人类专家后所引发的失控、滥用、大国战争及极权固化等风险。其解决方案的关键在于构建技术、法律和制度基础设施,以实现对危险AI开发与部署的国际限制(称为“Off Switch”),并最终推动全球协调的前沿AI活动暂停(Halt)。这一路径旨在通过国际合作减少AI相关的 catastrophic risks,而其他备选方案则被认为存在不可接受的风险。
链接: https://arxiv.org/abs/2505.04592
作者: Peter Barnett,Aaron Scher
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Humanity appears to be on course to soon develop AI systems that substantially outperform human experts in all cognitive domains and activities. We believe the default trajectory has a high likelihood of catastrophe, including human extinction. Risks come from failure to control powerful AI systems, misuse of AI by malicious rogue actors, war between great powers, and authoritarian lock-in. This research agenda has two aims: to describe the strategic landscape of AI development and to catalog important governance research questions. These questions, if answered, would provide important insight on how to successfully reduce catastrophic risks. We describe four high-level scenarios for the geopolitical response to advanced AI development, cataloging the research questions most relevant to each. Our favored scenario involves building the technical, legal, and institutional infrastructure required to internationally restrict dangerous AI development and deployment (which we refer to as an Off Switch), which leads into an internationally coordinated Halt on frontier AI activities at some point in the future. The second scenario we describe is a US National Project for AI, in which the US Government races to develop advanced AI systems and establish unilateral control over global AI development. We also describe two additional scenarios: a Light-Touch world similar to that of today and a Threat of Sabotage situation where countries use sabotage and deterrence to slow AI development. In our view, apart from the Off Switch and Halt scenario, all of these trajectories appear to carry an unacceptable risk of catastrophic harm. Urgent action is needed from the US National Security community and AI governance ecosystem to answer key research questions, build the capability to halt dangerous AI activities, and prepare for international AI agreements. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.04592 [cs.CY] (or arXiv:2505.04592v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2505.04592 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-3] Fight Fire with Fire: Defending Against Malicious RL Fine-Tuning via Reward Neutralization
【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)微调过程中存在的安全漏洞问题,即恶意RL微调能够高效地破坏模型的安全防护机制,导致有害输出从0-2级迅速上升至7-9级。解决方案的关键在于提出一种名为“奖励中和”(Reward Neutralization)的防御框架,该框架专门针对RL微调攻击设计,通过训练模型生成信息量最小的拒绝响应,使攻击者无法利用恶意奖励信号,从而系统性地中和有害输出优化的尝试。实验表明,该方法在经历200次攻击步骤后仍能保持较低的有害评分(不超过2),而标准模型则迅速恶化。
链接: https://arxiv.org/abs/2505.04578
作者: Wenjun Cao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) fine-tuning transforms large language models while creating a vulnerability we experimentally verify: Our experiment shows that malicious RL fine-tuning dismantles safety guardrails with remarkable efficiency, requiring only 50 steps and minimal adversarial prompts, with harmful escalating from 0-2 to 7-9. This attack vector particularly threatens open-source models with parameter-level access. Existing defenses targeting supervised fine-tuning prove ineffective against RL’s dynamic feedback mechanisms. We introduce Reward Neutralization, the first defense framework specifically designed against RL fine-tuning attacks, establishing concise rejection patterns that render malicious reward signals ineffective. Our approach trains models to produce minimal-information rejections that attackers cannot exploit, systematically neutralizing attempts to optimize toward harmful outputs. Experiments validate that our approach maintains low harmful scores (no greater than 2) after 200 attack steps, while standard models rapidly deteriorate. This work provides the first constructive proof that robust defense against increasingly accessible RL attacks is achievable, addressing a critical security gap for open-weight models.
zh
[AI-4] Purity Law for Generalizable Neural TSP Solvers
【速读】:该论文旨在解决神经方法在不同规模和分布下的泛化能力问题,特别是在旅行商问题(Traveling Salesman Problem, TSP)中,神经网络难以学习到识别通用模式并从多样本中推导出最优解的鲁棒性原则。论文提出了一种称为纯度定律(Purity Law, PuLa)的基本结构原则,该原则表明边的普遍性随着周围顶点的稀疏性呈指数增长。基于此洞察,作者提出了纯度策略优化(Purity Policy Optimization, PUPO),其关键在于在求解过程中显式地将神经解的特性与PuLa对齐,从而提升模型的泛化能力。实验表明,PUPO可无缝集成至主流神经求解器中,在不增加推理计算开销的情况下显著提升其泛化性能。
链接: https://arxiv.org/abs/2505.04558
作者: Wenzhao Liu,Haoran Li,Congying Han,Zicheng Zhang,Anqi Li,Tiande Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Achieving generalization in neural approaches across different scales and distributions remains a significant challenge for the Traveling Salesman Problem~(TSP). A key obstacle is that neural networks often fail to learn robust principles for identifying universal patterns and deriving optimal solutions from diverse instances. In this paper, we first uncover Purity Law (PuLa), a fundamental structural principle for optimal TSP solutions, defining that edge prevalence grows exponentially with the sparsity of surrounding vertices. Statistically validated across diverse instances, PuLa reveals a consistent bias toward local sparsity in global optima. Building on this insight, we propose Purity Policy Optimization~(PUPO), a novel training paradigm that explicitly aligns characteristics of neural solutions with PuLa during the solution construction process to enhance generalization. Extensive experiments demonstrate that PUPO can be seamlessly integrated with popular neural solvers, significantly enhancing their generalization performance without incurring additional computational overhead during inference.
zh
[AI-5] Qualitative Analysis of ω-Regular Objectives on Robust MDPs
【速读】:该论文旨在解决在不确定转移概率的鲁棒马尔可夫决策过程(RMDPs)中,针对可达性目标和保真目标的定性分析问题,即判断是否可以在概率1下确保达成目标。其解决方案的关键在于提出高效的算法,这些算法通过访问不确定性集的预言机来解决可达性和保真目标的定性问题,并通过实验验证了该方法在大规模RMDP实例中的有效性。
链接: https://arxiv.org/abs/2505.04539
作者: Ali Asadi,Krishnendu Chatterjee,Ehsan Kafshdar Goharshady,Mehrdad Karrabi,Ali Shafiee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Robust Markov Decision Processes (RMDPs) generalize classical MDPs that consider uncertainties in transition probabilities by defining a set of possible transition functions. An objective is a set of runs (or infinite trajectories) of the RMDP, and the value for an objective is the maximal probability that the agent can guarantee against the adversarial environment. We consider (a) reachability objectives, where given a target set of states, the goal is to eventually arrive at one of them; and (b) parity objectives, which are a canonical representation for \omega -regular objectives. The qualitative analysis problem asks whether the objective can be ensured with probability 1. In this work, we study the qualitative problem for reachability and parity objectives on RMDPs without making any assumption over the structures of the RMDPs, e.g., unichain or aperiodic. Our contributions are twofold. We first present efficient algorithms with oracle access to uncertainty sets that solve qualitative problems of reachability and parity objectives. We then report experimental results demonstrating the effectiveness of our oracle-based approach on classical RMDP examples from the literature scaling up to thousands of states. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2505.04539 [cs.AI] (or arXiv:2505.04539v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.04539 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-6] On some improvements to Unbounded Minimax
【速读】:该论文旨在探索对无界最佳优先极小极大算法(Unbounded Best-First Minimax)的四种此前未被测试的改进方法,以提升其在博弈树搜索中的效率。解决方案的关键在于通过不同的策略优化算法的搜索过程,包括引入置换表(transposition tables)减少重复状态、调整反向传播策略、使用学习启发式函数替代精确终端评估函数以及采用优先处理胜利状态的完成技术。这些改进措施在不同场景下均显示出对算法性能的提升作用。
链接: https://arxiv.org/abs/2505.04525
作者: Quentin Cohen-Solal,Tristan Cazenave
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents the first experimental evaluation of four previously untested modifications of Unbounded Best-First Minimax algorithm. This algorithm explores the game tree by iteratively expanding the most promising sequences of actions based on the current partial game tree. We first evaluate the use of transposition tables, which convert the game tree into a directed acyclic graph by merging duplicate states. Second, we compare the original algorithm by Korf Chickering with the variant proposed by Cohen-Solal, which differs in its backpropagation strategy: instead of stopping when a stable value is encountered, it updates values up to the root. This change slightly improves performance when value ties or transposition tables are involved. Third, we assess replacing the exact terminal evaluation function with the learned heuristic function. While beneficial when exact evaluations are costly, this modification reduces performance in inexpensive settings. Finally, we examine the impact of the completion technique that prioritizes resolved winning states and avoids resolved losing states. This technique also improves performance. Overall, our findings highlight how targeted modifications can enhance the efficiency of Unbounded Best-First Minimax.
zh
[AI-7] Model-Based AI planning and Execution Systems for Robotics
【速读】:该论文旨在探讨模型基础的规划与执行系统在机器人任务级控制中的设计选择和所面临的问题,并总结现有解决方案,以推动未来的发展。其关键在于通过集成现代机器人平台,实现灵活的自主机器人系统,这些系统能够通过自动组合基本技能来完成多样化任务,而这一理念自现代机器人学诞生以来已存在多年,但真正集成化、通用化的系统直到近年来才开始出现,如ROSPlan系统。
链接: https://arxiv.org/abs/2505.04493
作者: Or Wertheim,Ronen I. Brafman
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Model-based planning and execution systems offer a principled approach to building flexible autonomous robots that can perform diverse tasks by automatically combining a host of basic skills. This idea is almost as old as modern robotics. Yet, while diverse general-purpose reasoning architectures have been proposed since, general-purpose systems that are integrated with modern robotic platforms have emerged only recently, starting with the influential ROSPlan system. Since then, a growing number of model-based systems for robot task-level control have emerged. In this paper, we consider the diverse design choices and issues existing systems attempt to address, the different solutions proposed so far, and suggest avenues for future development.
zh
[AI-8] rajEvo: Designing Trajectory Prediction Heuristics via LLM -driven Evolution
【速读】:该论文旨在解决轨迹预测任务中传统启发式方法准确性不足以及深度学习方法在计算成本、可解释性和泛化能力方面的局限性问题。其解决方案的关键在于引入TrajEvo框架,该框架利用大型语言模型(Large Language Models, LLMs)自动设计轨迹预测启发式规则,并通过进化算法从历史轨迹数据中生成和优化这些规则,同时采用跨代精英采样和统计反馈循环机制以提升种群多样性和预测性能。
链接: https://arxiv.org/abs/2505.04480
作者: Zhikai Zhao,Chuanbo Hua,Federico Berto,Kanghoon Lee,Zihan Ma,Jiachen Li,Jinkyoo Park
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO)
备注:
Abstract:Trajectory prediction is a crucial task in modeling human behavior, especially in fields as social robotics and autonomous vehicle navigation. Traditional heuristics based on handcrafted rules often lack accuracy, while recently proposed deep learning approaches suffer from computational cost, lack of explainability, and generalization issues that limit their practical adoption. In this paper, we introduce TrajEvo, a framework that leverages Large Language Models (LLMs) to automatically design trajectory prediction heuristics. TrajEvo employs an evolutionary algorithm to generate and refine prediction heuristics from past trajectory data. We introduce a Cross-Generation Elite Sampling to promote population diversity and a Statistics Feedback Loop allowing the LLM to analyze alternative predictions. Our evaluations show TrajEvo outperforms previous heuristic methods on the ETH-UCY datasets, and remarkably outperforms both heuristics and deep learning methods when generalizing to the unseen SDD dataset. TrajEvo represents a first step toward automated design of fast, explainable, and generalizable trajectory prediction heuristics. We make our source code publicly available to foster future research at this https URL.
zh
[AI-9] Spectral and Temporal Denoising for Differentially Private Optimization
【速读】:该论文旨在解决差分隐私随机梯度下降(DP-SGD)中因添加噪声而导致模型实用性能下降的问题。其解决方案的关键在于提出一种基于快速傅里叶变换的增强卡尔曼滤波器(FFTKF),通过频域噪声整形与卡尔曼滤波相结合的方式,在保持 (\varepsilon, \delta) -DP 保证的前提下提升梯度质量。具体而言,FFTKF在傅里叶域中使用高频整形掩码将差分隐私噪声集中到信息量较少的频谱成分中,同时保留低频梯度信号,并通过带有有限差分海森矩阵近似的标量增益卡尔曼滤波进一步优化去噪后的梯度。
链接: https://arxiv.org/abs/2505.04468
作者: Hyeju Shin,Kyudan Jung,Seongwon Yun,Juyoung Yun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:This paper introduces the FFT-Enhanced Kalman Filter (FFTKF), a differentially private optimization method that addresses the challenge of preserving performance in DP-SGD, where added noise typically degrades model utility. FFTKF integrates frequency-domain noise shaping with Kalman filtering to enhance gradient quality while preserving (\varepsilon, \delta) -DP guarantees. It employs a high-frequency shaping mask in the Fourier domain to concentrate differential privacy noise in less informative spectral components, preserving low-frequency gradient signals. A scalar-gain Kalman filter with finite-difference Hessian approximation further refines the denoised gradients. With a per-iteration complexity of \mathcalO(d \log d) , FFTKF demonstrates improved test accuracy over DP-SGD and DiSK across MNIST, CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets using CNNs, Wide ResNets, and Vision Transformers. Theoretical analysis confirms that FFTKF maintains equivalent privacy guarantees while achieving a tighter privacy-utility trade-off through reduced noise and controlled bias.
zh
[AI-10] Discriminative Ordering Through Ensemble Consensus UAI2025
【速读】:该论文试图解决聚类模型性能评估的问题,尤其是在不同聚类定义下难以有效比较多个聚类模型以及整合约束条件的挑战。其解决方案的关键在于通过集成聚类构建一个判别性排序,该排序基于聚类模型的连通性与共识矩阵之间的距离,从而能够有效地对不同聚类模型进行排名和评估。
链接: https://arxiv.org/abs/2505.04464
作者: Louis Ohl,Fredrik Lindsten
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at UAI 2025
Abstract:Evaluating the performance of clustering models is a challenging task where the outcome depends on the definition of what constitutes a cluster. Due to this design, current existing metrics rarely handle multiple clustering models with diverse cluster definitions, nor do they comply with the integration of constraints when available. In this work, we take inspiration from consensus clustering and assume that a set of clustering models is able to uncover hidden structures in the data. We propose to construct a discriminative ordering through ensemble clustering based on the distance between the connectivity of a clustering model and the consensus matrix. We first validate the proposed method with synthetic scenarios, highlighting that the proposed score ranks the models that best match the consensus first. We then show that this simple ranking score significantly outperforms other scoring methods when comparing sets of different clustering algorithms that are not restricted to a fixed number of clusters and is compatible with clustering constraints.
zh
[AI-11] A Survey on Temporal Interaction Graph Representation Learning: Progress Challenges and Opportunities IJCAI2025
【速读】:该论文旨在解决时间交互图(Temporal Interaction Graphs, TIGs)表示学习中的挑战,即如何在动态数据环境中有效嵌入节点以保留结构和时间信息,从而提升下游任务如分类、预测和聚类的性能。其解决方案的关键在于提出一种全面的TIGRL方法分类体系,根据学习过程中使用的不同类型信息进行系统性归类,以应对TIGs特有的时间依赖性问题。此外,论文还提供了数据集和基准测试资源,为后续研究和应用提供支持。
链接: https://arxiv.org/abs/2505.04461
作者: Pengfei Jiao,Hongjiang Chen,Xuan Guo,Zhidong Zhao,Dongxiao He,Di Jin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: IJCAI 2025 Survey Track
Abstract:Temporal interaction graphs (TIGs), defined by sequences of timestamped interaction events, have become ubiquitous in real-world applications due to their capability to model complex dynamic system behaviors. As a result, temporal interaction graph representation learning (TIGRL) has garnered significant attention in recent years. TIGRL aims to embed nodes in TIGs into low-dimensional representations that effectively preserve both structural and temporal information, thereby enhancing the performance of downstream tasks such as classification, prediction, and clustering within constantly evolving data environments. In this paper, we begin by introducing the foundational concepts of TIGs and emphasize the critical role of temporal dependencies. We then propose a comprehensive taxonomy of state-of-the-art TIGRL methods, systematically categorizing them based on the types of information utilized during the learning process to address the unique challenges inherent to TIGs. To facilitate further research and practical applications, we curate the source of datasets and benchmarks, providing valuable resources for empirical investigations. Finally, we examine key open challenges and explore promising research directions in TIGRL, laying the groundwork for future advancements that have the potential to shape the evolution of this field.
zh
[AI-12] Automatic Music Transcription using Convolutional Neural Networks and Constant-Q transform
【速读】:该论文试图解决自动音乐转录(Automatic Music Transcription, AMT)问题,即从包含多个同时演奏音符的音频信号中分析出正在演奏的音符,并生成乐谱表示。解决方案的关键在于设计一个处理流程,将古典钢琴音频文件转换为乐谱表示,其中音频信号的特征通过恒定Q变换(constant-Q transform)提取,并将所得系数作为卷积神经网络(Convolutional Neural Network, CNN)模型的输入。
链接: https://arxiv.org/abs/2505.04451
作者: Yohannis Telila,Tommaso Cucinotta,Davide Bacciu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: 6 pages
Abstract:Automatic music transcription (AMT) is the problem of analyzing an audio recording of a musical piece and detecting notes that are being played. AMT is a challenging problem, particularly when it comes to polyphonic music. The goal of AMT is to produce a score representation of a music piece, by analyzing a sound signal containing multiple notes played simultaneously. In this work, we design a processing pipeline that can transform classical piano audio files in .wav format into a music score representation. The features from the audio signals are extracted using the constant-Q transform, and the resulting coefficients are used as an input to the convolutional neural network (CNN) model.
zh
[AI-13] FedBWO: Enhancing Communication Efficiency in Federated Learning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中由于客户端与服务器之间通信数据量过大导致的系统性能瓶颈问题。现有FL策略在训练过程中传输大量模型权重,需要高带宽支持,而在资源受限的设备上,增加客户端数量会进一步加剧通信负担。论文提出的解决方案关键在于引入联邦黑寡妇优化(Federated Black Widow Optimization, FedBWO)技术,通过仅传输客户端的性能评分而非本地模型权重来减少传输数据量,并利用黑寡妇优化(Black Widow Optimization, BWO)算法提升本地模型更新效果,从而显著提高全局模型性能和通信效率。
链接: https://arxiv.org/abs/2505.04435
作者: Vahideh Hayyolalam,Öznur Özkasap
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5th IEEE International Conference on Human-Machine Systems, Abu Dhabi, UAE, 26-28 May 2025
Abstract:Federated Learning (FL) is a distributed Machine Learning (ML) setup, where a shared model is collaboratively trained by various clients using their local datasets while keeping the data private. Considering resource-constrained devices, FL clients often suffer from restricted transmission capacity. Aiming to enhance the system performance, the communication between clients and server needs to be diminished. Current FL strategies transmit a tremendous amount of data (model weights) within the FL process, which needs a high communication bandwidth. Considering resource constraints, increasing the number of clients and, consequently, the amount of data (model weights) can lead to a bottleneck. In this paper, we introduce the Federated Black Widow Optimization (FedBWO) technique to decrease the amount of transmitted data by transmitting only a performance score rather than the local model weights from clients. FedBWO employs the BWO algorithm to improve local model updates. The conducted experiments prove that FedBWO remarkably improves the performance of the global model and the communication efficiency of the overall system. According to the experimental outcomes, FedBWO enhances the global model accuracy by an average of 21% over FedAvg, and 12% over FedGWO. Furthermore, FedBWO dramatically decreases the communication cost compared to other methods.
zh
[AI-14] In-Context Adaptation to Concept Drift for Learned Database Operations ICML2025
【速读】:该论文旨在解决动态数据库环境中由于概念漂移导致的机器学习模型性能下降问题,这一问题限制了其在实际数据库操作中的应用。解决方案的关键在于提出FLAIR框架,该框架引入了“在线上下文适应”机制,通过利用数据系统中预测结果的即时可用性来动态构建上下文,并将适应过程形式化为基于动态上下文记忆的函数映射,从而实现无需运行时参数优化的当前概念对齐预测。
链接: https://arxiv.org/abs/2505.04404
作者: Jiaqi Zhu,Shaofeng Cai,Yanyan Shen,Gang Chen,Fang Deng,Beng Chin Ooi
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2025
Abstract:Machine learning has demonstrated transformative potential for database operations, such as query optimization and in-database data analytics. However, dynamic database environments, characterized by frequent updates and evolving data distributions, introduce concept drift, which leads to performance degradation for learned models and limits their practical applicability. Addressing this challenge requires efficient frameworks capable of adapting to shifting concepts while minimizing the overhead of retraining or fine-tuning. In this paper, we propose FLAIR, an online adaptation framework that introduces a new paradigm called \textitin-context adaptation for learned database operations. FLAIR leverages the inherent property of data systems, i.e., immediate availability of execution results for predictions, to enable dynamic context construction. By formalizing adaptation as f:(\mathbfx ,| ,\mathcalC_t) \to \mathbfy , with \mathcalC_t representing a dynamic context memory, FLAIR delivers predictions aligned with the current concept, eliminating the need for runtime parameter optimization. To achieve this, FLAIR integrates two key modules: a Task Featurization Module for encoding task-specific features into standardized representations, and a Dynamic Decision Engine, pre-trained via Bayesian meta-training, to adapt seamlessly using contextual information at runtime. Extensive experiments across key database tasks demonstrate that FLAIR outperforms state-of-the-art baselines, achieving up to 5.2x faster adaptation and reducing error by 22.5% for cardinality estimation. Comments: Accepted by ICML 2025 Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.04404 [cs.DB] (or arXiv:2505.04404v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2505.04404 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-15] Consensus-Aware AV Behavior: Trade-offs Between Safety Interaction and Performance in Mixed Urban Traffic
【速读】:该论文试图解决在混合交通环境中,自动驾驶车辆(AV)与人类驾驶车辆(HDV)之间在安全性、交互质量和交通性能方面实现共识的问题。其关键解决方案是将共识视为交通系统的基本属性,并通过高分辨率轨迹数据对关键指标如碰撞时间(TTC)、侵入后时间(PET)、减速度模式、车头时距和串稳定性进行实证分析,以量化不同场景下的共识程度。研究结果表明,全面的共识极为罕见,仅1.63%的AV-VRU交互帧满足所有三个条件,这凸显了需要开发能够显式平衡多维性能的AV模型。
链接: https://arxiv.org/abs/2505.04379
作者: Mohammad Elayan,Wissam Kontar
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 7 pages, 8 figures
Abstract:Transportation systems have long been shaped by complexity and heterogeneity, driven by the interdependency of agent actions and traffic outcomes. The deployment of automated vehicles (AVs) in such systems introduces a new challenge: achieving consensus across safety, interaction quality, and traffic performance. In this work, we position consensus as a fundamental property of the traffic system and aim to quantify it. We use high-resolution trajectory data from the Third Generation Simulation (TGSIM) dataset to empirically analyze AV and human-driven vehicle (HDV) behavior at a signalized urban intersection and around vulnerable road users (VRUs). Key metrics, including Time-to-Collision (TTC), Post-Encroachment Time (PET), deceleration patterns, headways, and string stability, are evaluated across the three performance dimensions. Results show that full consensus across safety, interaction, and performance is rare, with only 1.63% of AV-VRU interaction frames meeting all three conditions. These findings highlight the need for AV models that explicitly balance multi-dimensional performance in mixed-traffic environments. Full reproducibility is supported via our open-source codebase on this https URL.
zh
[AI-16] Uncertain Machine Ethics Planning
【速读】:该论文试图解决在机器伦理决策中如何处理不确定性以及不同道德理论之间可能产生的冲突问题。其核心挑战在于,决策需要在长期序列动作中实现有利结果,而结果的评估可能涉及多个具有冲突判断的道德理论。解决方案的关键是将问题形式化为多道德马尔可夫决策过程(Multi-Moral Markov Decision Process)和多道德随机最短路径问题(Multi-Moral Stochastic Shortest Path Problem),并开发一种基于多目标AO*的启发式算法,结合Sven-Ove Hansson的假设回顾法进行不确定性下的伦理推理。
链接: https://arxiv.org/abs/2505.04352
作者: Simon Kolker,Louise A. Dennis,Ramon Fraga Pereira,Mengwei Xu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Machine Ethics decisions should consider the implications of uncertainty over decisions. Decisions should be made over sequences of actions to reach preferable outcomes long term. The evaluation of outcomes, however, may invoke one or more moral theories, which might have conflicting judgements. Each theory will require differing representations of the ethical situation. For example, Utilitarianism measures numerical values, Deontology analyses duties, and Virtue Ethics emphasises moral character. While balancing potentially conflicting moral considerations, decisions may need to be made, for example, to achieve morally neutral goals with minimal costs. In this paper, we formalise the problem as a Multi-Moral Markov Decision Process and a Multi-Moral Stochastic Shortest Path Problem. We develop a heuristic algorithm based on Multi-Objective AO*, utilising Sven-Ove Hansson’s Hypothetical Retrospection procedure for ethical reasoning under uncertainty. Our approach is validated by a case study from Machine Ethics literature: the problem of whether to steal insulin for someone who needs it.
zh
[AI-17] Multi-Granular Attention based Heterogeneous Hypergraph Neural Network
【速读】:该论文旨在解决异构图神经网络(HeteGNNs)在学习节点表示时存在的两个关键问题:一是由于元路径(meta-path)的成对性质导致无法捕捉节点间的高阶关系,二是长距离信息传递引起的“过压缩”(over-squashing)问题,进而导致信息失真和性能受限。其解决方案的关键在于提出MGA-HHN,该方法通过两种创新机制实现改进:首先,构建基于元路径的异构超图(heterogeneous hypergraph),以多视角显式建模异构图中的高阶语义信息;其次,引入多粒度注意力机制,在节点和超边层级上操作,从而捕捉同一语义上下文中节点的细粒度交互,同时保持不同超边类型之间的语义多样性。
链接: https://arxiv.org/abs/2505.04340
作者: Hong Jin,Kaicheng Zhou,Jie Yin,Lan You,Zhifeng Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Heterogeneous graph neural networks (HeteGNNs) have demonstrated strong abilities to learn node representations by effectively extracting complex structural and semantic information in heterogeneous graphs. Most of the prevailing HeteGNNs follow the neighborhood aggregation paradigm, leveraging meta-path based message passing to learn latent node representations. However, due to the pairwise nature of meta-paths, these models fail to capture high-order relations among nodes, resulting in suboptimal performance. Additionally, the challenge of ``over-squashing’', where long-range message passing in HeteGNNs leads to severe information distortion, further limits the efficacy of these models. To address these limitations, this paper proposes MGA-HHN, a Multi-Granular Attention based Heterogeneous Hypergraph Neural Network for heterogeneous graph representation learning. MGA-HHN introduces two key innovations: (1) a novel approach for constructing meta-path based heterogeneous hypergraphs that explicitly models higher-order semantic information in heterogeneous graphs through multiple views, and (2) a multi-granular attention mechanism that operates at both the node and hyperedge levels. This mechanism enables the model to capture fine-grained interactions among nodes sharing the same semantic context within a hyperedge type, while preserving the diversity of semantics across different hyperedge types. As such, MGA-HHN effectively mitigates long-range message distortion and generates more expressive node representations. Extensive experiments on real-world benchmark datasets demonstrate that MGA-HHN outperforms state-of-the-art models, showcasing its effectiveness in node classification, node clustering and visualization tasks.
zh
[AI-18] Detecting Concept Drift in Neural Networks Using Chi-squared Goodness of Fit Testing
【速读】:该论文试图解决深度学习模型在推理过程中因概念漂移(concept drift)导致的可靠性问题,即当模型遇到与训练数据分布不同的推理数据时,其性能可能下降但难以被及时检测。解决方案的关键在于应用卡方拟合优度检验(\chi^2 Goodness of Fit Hypothesis Test)作为元算法,用于检测模型在不同推理场景下的统计特性变化,从而在不直接检查推理输出的情况下识别准确率的异常下降,提升模型在不同条件下的持续可靠性评估能力。
链接: https://arxiv.org/abs/2505.04318
作者: Jacob Glenn Ayers,Buvaneswari A. Ramanan,Manzoor A. Khan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 8 pages, 6 figures, 1 table
Abstract:As the adoption of deep learning models has grown beyond human capacity for verification, meta-algorithms are needed to ensure reliable model inference. Concept drift detection is a field dedicated to identifying statistical shifts that is underutilized in monitoring neural networks that may encounter inference data with distributional characteristics diverging from their training data. Given the wide variety of model architectures, applications, and datasets, it is important that concept drift detection algorithms are adaptable to different inference scenarios. In this paper, we introduce an application of the \chi^2 Goodness of Fit Hypothesis Test as a drift detection meta-algorithm applied to a multilayer perceptron, a convolutional neural network, and a transformer trained for machine vision as they are exposed to simulated drift during inference. To that end, we demonstrate how unexpected drops in accuracy due to concept drift can be detected without directly examining the inference outputs. Our approach enhances safety by ensuring models are continually evaluated for reliability across varying conditions.
zh
[AI-19] Mastering Multi-Drone Volleyball through Hierarchical Co-Self-Play Reinforcement Learning
【速读】:该论文试图解决3v3多无人机排球这一新型具身竞争任务中的学习问题,该任务需要高水平的战略协调与低水平的敏捷控制。其挑战主要来源于长时序依赖、强代理间耦合以及四旋翼飞行器的欠驱动动力学特性。解决方案的关键在于提出分层协同自对弈(Hierarchical Co-Self-Play, HCSP)框架,该框架通过将集中式的高层战略决策与分布式的低层运动控制分离,实现策略与技能的协同进化。
链接: https://arxiv.org/abs/2505.04317
作者: Ruize Zhang,Sirui Xiang,Zelai Xu,Feng Gao,Shilong Ji,Wenhao Tang,Wenbo Ding,Chao Yu,Yu Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, we tackle the problem of learning to play 3v3 multi-drone volleyball, a new embodied competitive task that requires both high-level strategic coordination and low-level agile control. The task is turn-based, multi-agent, and physically grounded, posing significant challenges due to its long-horizon dependencies, tight inter-agent coupling, and the underactuated dynamics of quadrotors. To address this, we propose Hierarchical Co-Self-Play (HCSP), a hierarchical reinforcement learning framework that separates centralized high-level strategic decision-making from decentralized low-level motion control. We design a three-stage population-based training pipeline to enable both strategy and skill to emerge from scratch without expert demonstrations: (I) training diverse low-level skills, (II) learning high-level strategy via self-play with fixed low-level controllers, and (III) joint fine-tuning through co-self-play. Experiments show that HCSP achieves superior performance, outperforming non-hierarchical self-play and rule-based hierarchical baselines with an average 82.9% win rate and a 71.5% win rate against the two-stage variant. Moreover, co-self-play leads to emergent team behaviors such as role switching and coordinated formations, demonstrating the effectiveness of our hierarchical design and training scheme.
zh
[AI-20] KERAIA: An Adaptive and Explainable Framework for Dynamic Knowledge Representation and Reasoning
【速读】:该论文旨在解决如何将非结构化且通常隐性的专家知识有效地转化为AI系统可高效利用的计算可处理算法这一挑战。其解决方案的关键在于KERAIA框架,该框架基于Minsky的基于框架的推理和K线概念,并引入了关键创新,包括知识云用于动态聚合、动态关系(DRels)用于上下文敏感的继承、显式思维路线(LoTs)用于可追溯推理以及云细化用于自适应知识转换,从而突破了传统静态知识表示范式的局限性。
链接: https://arxiv.org/abs/2505.04313
作者: Stephen Richard Varey,Alessandro Di Stefano, TheAnh Han
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Symbolic Computation (cs.SC)
备注: 22 pages
Abstract:In this paper, we introduce KERAIA, a novel framework and software platform for symbolic knowledge engineering designed to address the persistent challenges of representing, reasoning with, and executing knowledge in dynamic, complex, and context-sensitive environments. The central research question that motivates this work is: How can unstructured, often tacit, human expertise be effectively transformed into computationally tractable algorithms that AI systems can efficiently utilise? KERAIA seeks to bridge this gap by building on foundational concepts such as Minsky’s frame-based reasoning and K-lines, while introducing significant innovations. These include Clouds of Knowledge for dynamic aggregation, Dynamic Relations (DRels) for context-sensitive inheritance, explicit Lines of Thought (LoTs) for traceable reasoning, and Cloud Elaboration for adaptive knowledge transformation. This approach moves beyond the limitations of traditional, often static, knowledge representation paradigms. KERAIA is designed with Explainable AI (XAI) as a core principle, ensuring transparency and interpretability, particularly through the use of LoTs. The paper details the framework’s architecture, the KSYNTH representation language, and the General Purpose Paradigm Builder (GPPB) to integrate diverse inference methods within a unified structure. We validate KERAIA’s versatility, expressiveness, and practical applicability through detailed analysis of multiple case studies spanning naval warfare simulation, industrial diagnostics in water treatment plants, and strategic decision-making in the game of RISK. Furthermore, we provide a comparative analysis against established knowledge representation paradigms (including ontologies, rule-based systems, and knowledge graphs) and discuss the implementation aspects and computational considerations of the KERAIA platform.
zh
[AI-21] Flow Models for Unbounded and Geometry-Aware Distributional Reinforcement Learning
【速读】:该论文旨在解决传统分布强化学习(Distributional Reinforcement Learning, DistRL)方法在建模回报分布时的局限性,例如固定或有界表示、多模态、偏度和尾部行为建模能力不足以及参数效率低的问题。其解决方案的关键在于采用归一化流(normalizing flows)来建模回报分布,从而实现灵活且无界的支撑集,并提升对复杂分布特征的建模能力。此外,为克服现有训练指标如KL散度或Wasserstein距离在尺度不敏感或样本梯度有偏的问题,作者提出了一种几何感知的Cramèr距离替代指标,该指标可直接从回报分布的概率密度函数(PDF)计算,避免了代价高昂的累积分布函数(CDF)计算。
链接: https://arxiv.org/abs/2505.04310
作者: Simo Alami C.,Rim Kaddah,Jesse Read,Marie-Paule Cani
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:
Abstract:We introduce a new architecture for Distributional Reinforcement Learning (DistRL) that models return distributions using normalizing flows. This approach enables flexible, unbounded support for return distributions, in contrast to categorical approaches like C51 that rely on fixed or bounded representations. It also offers richer modeling capacity to capture multi-modality, skewness, and tail behavior than quantile based approaches. Our method is significantly more parameter-efficient than categorical approaches. Standard metrics used to train existing models like KL divergence or Wasserstein distance either are scale insensitive or have biased sample gradients, especially when return supports do not overlap. To address this, we propose a novel surrogate for the Cramèr distance, that is geometry-aware and computable directly from the return distribution’s PDF, avoiding the costly CDF computation. We test our model on the ATARI-5 sub-benchmark and show that our approach outperforms PDF based models while remaining competitive with quantile based methods.
zh
[AI-22] Guardians of the Web: The Evolution and Future of Website Information Security
【速读】:该论文旨在探讨网站信息安全的发展历程、当前实践及未来方向,以应对数字时代日益复杂的网络安全挑战。其解决方案的关键在于采用多层防护策略,包括加密技术、安全编码规范、定期安全审计和用户教育,同时结合新兴技术如人工智能、区块链和量子计算,以及加强国际协作与标准化建设,以提升对敏感信息的保护能力并维护数字世界的信任体系。
链接: https://arxiv.org/abs/2505.04308
作者: Md Saiful Islam,Li Xiangdong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 22 pages
Abstract:Website information security has become a critical concern in the digital age. This article explores the evolution of website information security, examining its historical development, current practices, and future directions. The early beginnings from the 1960s to the 1980s laid the groundwork for modern cybersecurity, with the development of ARPANET, TCP/IP, public-key cryptography, and the first antivirus programs. The 1990s marked a transformative era, driven by the commercialization of the Internet and the emergence of web-based services. As the Internet grew, so did the range and sophistication of cyber threats, leading to advancements in security technologies such as the Secure Sockets Layer (SSL) protocol, password protection, and firewalls. Current practices in website information security involve a multi-layered approach, including encryption, secure coding practices, regular security audits, and user education. The future of website information security is expected to be shaped by emerging technologies such as artificial intelligence, blockchain, and quantum computing, as well as the increasing importance of international cooperation and standardization efforts. As cyber threats continue to evolve, ongoing research and innovation in website information security will be essential to protect sensitive information and maintain trust in the digital world.
zh
[AI-23] Non-stationary Diffusion For Probabilistic Time Series Forecasting ICML
【速读】:该论文试图解决时间序列预测中不确定性随时间变化的非平稳性问题,现有去噪扩散概率模型(Denoising Diffusion Probabilistic Models, DDPMs)由于依赖加性噪声模型(Additive Noise Model, ANM)的固定方差假设,无法有效捕捉这种动态变化的不确定性。解决方案的关键在于创新性地引入位置-尺度噪声模型(Location-Scale Noise Model, LSNM),以放宽ANM的固定不确定性假设,并设计了一个基于LSNM的扩散概率预测框架——非平稳扩散(Non-stationary Diffusion, NsDiff),通过结合去噪扩散条件生成模型与预训练的条件均值和方差估计器,实现对不确定性的自适应建模,同时提出了一种考虑不确定性的噪声调度策略,动态调整噪声水平以反映数据在每一步的不确定性。
链接: https://arxiv.org/abs/2505.04278
作者: Weiwei Ye,Zhuopeng Xu,Ning Gui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted as spotlight poster at ICML
Abstract:Due to the dynamics of underlying physics and external influences, the uncertainty of time series often varies over time. However, existing Denoising Diffusion Probabilistic Models (DDPMs) often fail to capture this non-stationary nature, constrained by their constant variance assumption from the additive noise model (ANM). In this paper, we innovatively utilize the Location-Scale Noise Model (LSNM) to relax the fixed uncertainty assumption of ANM. A diffusion-based probabilistic forecasting framework, termed Non-stationary Diffusion (NsDiff), is designed based on LSNM that is capable of modeling the changing pattern of uncertainty. Specifically, NsDiff combines a denoising diffusion-based conditional generative model with a pre-trained conditional mean and variance estimator, enabling adaptive endpoint distribution modeling. Furthermore, we propose an uncertainty-aware noise schedule, which dynamically adjusts the noise levels to accurately reflect the data uncertainty at each step and integrates the time-varying variances into the diffusion process. Extensive experiments conducted on nine real-world and synthetic datasets demonstrate the superior performance of NsDiff compared to existing approaches. Code is available at this https URL.
zh
[AI-24] Weaponizing Language Models for Cybersecurity Offensive Operations: Automating Vulnerability Assessment Report Validation; A Review Paper
【速读】:该论文试图解决在漏洞评估(Vulnerability Assessment, VA)报告验证过程中存在的效率低和误报率高的问题。其解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)自动化并改进VA报告的分析与验证流程,从而减少误报并提升整体效率。
链接: https://arxiv.org/abs/2505.04265
作者: Abdulrahman S Almuhaidib,Azlan Mohd Zain,Zalmiyah Zakaria,Izyan Izzati Kamsani,Abdulaziz S Almuhaidib
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Pre-print - Accepted for publication in the Proceedings of the International Computer Sciences and Informatics Conference (ICSIC-2024), published by AIP Publishing
Abstract:This, with the ever-increasing sophistication of cyberwar, calls for novel solutions. In this regard, Large Language Models (LLMs) have emerged as a highly promising tool for defensive and offensive cybersecurity-related strategies. While existing literature has focused much on the defensive use of LLMs, when it comes to their offensive utilization, very little has been reported-namely, concerning Vulnerability Assessment (VA) report validation. Consequentially, this paper tries to fill that gap by investigating the capabilities of LLMs in automating and improving the validation process of the report of the VA. From the critical review of the related literature, this paper hereby proposes a new approach to using the LLMs in the automation of the analysis and within the validation process of the report of the VA that could potentially reduce the number of false positives and generally enhance efficiency. These results are promising for LLM automatization for improving validation on reports coming from VA in order to improve accuracy while reducing human effort and security postures. The contribution of this paper provides further evidence about the offensive and defensive LLM capabilities and therefor helps in devising more appropriate cybersecurity strategies and tools accordingly.
zh
[AI-25] Steerable Chatbots: Personalizing LLM s with Preference-Based Activation Steering
【速读】:该论文试图解决用户在与大型语言模型(Large Language Models, LLMs)交互时,由于缺乏有效的提示规范能力,难以准确传达其隐性偏好以获得个性化响应的问题。解决方案的关键在于利用激活转向(activation steering)技术,在推理过程中引导LLMs对可解释的偏好维度进行对齐,从而实现更符合用户软性偏好的个性化交互。相比依赖长期用户历史的记忆型个性化方法,激活转向具有轻量级和用户可控性强的特点,通过线性强度因子即可实现对输出的调节。
链接: https://arxiv.org/abs/2505.04260
作者: Jessica Y. Bo,Tianyu Xu,Ishan Chatterjee,Katrina Passarella-Ward,Achin Kulshrestha,D Shin
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:As large language models (LLMs) improve in their capacity to serve as personal AI assistants, their ability to output uniquely tailored, personalized responses that align with the soft preferences of their users is essential for enhancing user satisfaction and retention. However, untrained lay users have poor prompt specification abilities and often struggle with conveying their latent preferences to AI assistants. To address this, we leverage activation steering to guide LLMs to align with interpretable preference dimensions during inference. In contrast to memory-based personalization methods that require longer user history, steering is extremely lightweight and can be easily controlled by the user via an linear strength factor. We embed steering into three different interactive chatbot interfaces and conduct a within-subjects user study (n=14) to investigate how end users prefer to personalize their conversations. The results demonstrate the effectiveness of preference-based steering for aligning real-world conversations with hidden user preferences, and highlight further insights on how diverse values around control, usability, and transparency lead users to prefer different interfaces.
zh
[AI-26] Facilitating Trustworthy Human-Agent Collaboration in LLM Agent Collaboration in LLM-based Multi-Agent System oriented Software Engineering
【速读】:该论文试图解决在软件工程(Software Engineering, SE)领域中,将基于大语言模型的多智能体自主系统(LLM-based Multi-Agent Autonomous, LMA)引入时,如何可信地进行人机任务战略分配的问题。解决方案的关键是提出一种基于RACI(Responsible, Accountable, Consulted, Informed)的框架,该框架旨在促进高效协作、确保责任归属,并降低由大语言模型驱动的自动化带来的潜在风险,同时符合可信人工智能(Trustworthy AI)的指导原则。
链接: https://arxiv.org/abs/2505.04251
作者: Krishna Ronanki
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Multi-agent autonomous systems (MAS) are better at addressing challenges that spans across multiple domains than singular autonomous agents. This holds true within the field of software engineering (SE) as well. The state-of-the-art research on MAS within SE focuses on integrating LLMs at the core of autonomous agents to create LLM-based multi-agent autonomous (LMA) systems. However, the introduction of LMA systems into SE brings a plethora of challenges. One of the major challenges is the strategic allocation of tasks between humans and the LMA system in a trustworthy manner. To address this challenge, a RACI-based framework is proposed in this work in progress article, along with implementation guidelines and an example implementation of the framework. The proposed framework can facilitate efficient collaboration, ensure accountability, and mitigate potential risks associated with LLM-driven automation while aligning with the Trustworthy AI guidelines. The future steps for this work delineating the planned empirical validation method are also presented.
zh
[AI-27] FRAIN to Train: A Fast-and-Reliable Solution for Decentralized Federated Learning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中由于设备异构性导致的同步延迟、非独立同分布(Non-IID)数据带来的客户端漂移以及聚合器自由架构下同步开销增加等问题。其解决方案的关键在于提出一种名为FAIRN的新方法,通过两个核心机制实现:首先,FastSync策略避免了重复加载历史模型版本,使新加入或参与频率较低的客户端能够高效逼近全局模型;其次,在参数融合过程中采用球面线性插值(SLERP),保持模型方向的一致性,缓解因局部训练差异导致的破坏性干扰。
链接: https://arxiv.org/abs/2505.04223
作者: Sanghyeon Park,Soo-Mook Moon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated learning (FL) enables collaborative model training across distributed clients while preserving data locality. Although FedAvg pioneered synchronous rounds for global model averaging, slower devices can delay collective progress. Asynchronous FL (e.g., FedAsync) addresses stragglers by continuously integrating client updates, yet naive implementations risk client drift due to non-IID data and stale contributions. Some Blockchain-based FL approaches (e.g., BRAIN) employ robust weighting or scoring of updates to resist malicious or misaligned proposals. However, performance drops can still persist under severe data heterogeneity or high staleness, and synchronization overhead has emerged as a new concern due to its aggregator-free architectures. We introduce Fast-and-Reliable AI Network, FRAIN, a new asynchronous FL method that mitigates these limitations by incorporating two key ideas. First, our FastSync strategy eliminates the need to replay past model versions, enabling newcomers and infrequent participants to efficiently approximate the global model. Second, we adopt spherical linear interpolation (SLERP) when merging parameters, preserving models’ directions and alleviating destructive interference from divergent local training. Experiments with a CNN image-classification model and a Transformer-based language model demonstrate that FRAIN achieves more stable and robust convergence than FedAvg, FedAsync, and BRAIN, especially under harsh environments: non-IID data distributions, networks that experience delays and require frequent re-synchronization, and the presence of malicious nodes. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.04223 [cs.LG] (or arXiv:2505.04223v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.04223 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-28] o Judge or not to Judge: Using LLM Judgements for Advertiser Keyphrase Relevance at eBay
【速读】:该论文试图解决电商卖家在广告关键词推荐中的相关性问题,即如何确保推荐的关键词与卖家库存高度相关,以提升买家参与度并避免搜索系统被大量不相关商品淹没。解决方案的关键在于通过大规模使用大语言模型(LLM)作为评判者,替代传统基于点击率/销售数据/搜索相关性的训练信号,从而更准确地对齐卖家的判断,实现卖家行为、广告投放和搜索拍卖三者之间的协同优化。
链接: https://arxiv.org/abs/2505.04209
作者: Soumik Dey,Hansi Wu,Binbin Li
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:E-commerce sellers are recommended keyphrases based on their inventory on which they advertise to increase buyer engagement (clicks/sales). The relevance of advertiser keyphrases plays an important role in preventing the inundation of search systems with numerous irrelevant items that compete for attention in auctions, in addition to maintaining a healthy seller perception. In this work, we describe the shortcomings of training Advertiser keyphrase relevance filter models on click/sales/search relevance signals and the importance of aligning with human judgment, as sellers have the power to adopt or reject said keyphrase recommendations. In this study, we frame Advertiser keyphrase relevance as a complex interaction between 3 dynamical systems – seller judgment, which influences seller adoption of our product, Advertising, which provides the keyphrases to bid on, and Search, who holds the auctions for the same keyphrases. This study discusses the practicalities of using human judgment via a case study at eBay Advertising and demonstrate that using LLM-as-a-judge en-masse as a scalable proxy for seller judgment to train our relevance models achieves a better harmony across the three systems – provided that they are bound by a meticulous evaluation framework grounded in business metrics.
zh
[AI-29] On-Device LLM for Context-Aware Wi-Fi Roaming
【速读】:该论文旨在解决动态移动环境中无线漫游(wireless roaming)的挑战性问题,传统基于阈值或启发式的方案常导致粘性或过度切换,影响连接的无缝性。其解决方案的关键在于首次将设备端的大语言模型(large language model, LLM)应用于跨层控制,通过应用层的高层推理生成实时动作,并在物理层/媒体访问控制层(PHY/MAC stack)执行。LLM主要处理两个任务:上下文感知的接入点(AP)选择与动态阈值调整,同时通过一系列优化技术(如思维链提示、参数高效微调和量化)满足边缘硬件的低延迟和资源限制需求。
链接: https://arxiv.org/abs/2505.04174
作者: Ju-Hyung Lee,Yanqing Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
备注:
Abstract:Wireless roaming is a critical yet challenging task for maintaining seamless connectivity in dynamic mobile environments. Conventional threshold-based or heuristic schemes often fail, leading to either sticky or excessive handovers. We introduce the first cross-layer use of an on-device large language model (LLM): high-level reasoning in the application layer that issues real-time actions executed in the PHY/MAC stack. The LLM addresses two tasks: (i) context-aware AP selection, where structured prompts fuse environmental cues (e.g., location, time) to choose the best BSSID; and (ii) dynamic threshold adjustment, where the model adaptively decides when to roam. To satisfy the tight latency and resource budgets of edge hardware, we apply a suite of optimizations-chain-of-thought prompting, parameter-efficient fine-tuning, and quantization. Experiments on indoor and outdoor datasets show that our approach surpasses legacy heuristics and DRL baselines, achieving a strong balance between roaming stability and signal quality. These findings underscore the promise of application-layer LLM reasoning for lower-layer wireless control in future edge systems.
zh
[AI-30] S-SNN: Temporal Shift Module for Spiking Neural Networks ICML2025
【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在处理时间信息与保持低能耗之间的平衡问题。其解决方案的关键在于引入了时间位移模块(Temporal Shift module, TS-SNN),该模块通过一个简单而有效的位移操作,在单个时间步内整合过去、现在和未来的脉冲特征,并采用残差组合方法防止信息丢失。TS模块轻量且仅需一个额外的可学习参数,能够以极小的计算成本融入现有架构,从而在减少时间步数的同时实现高准确率和低能耗。
链接: https://arxiv.org/abs/2505.04165
作者: Kairong Yu,Tianqing Zhang,Qi Xu,Gang Pan,Hongwei Wang
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: Accepted by ICML2025
Abstract:Spiking Neural Networks (SNNs) are increasingly recognized for their biological plausibility and energy efficiency, positioning them as strong alternatives to Artificial Neural Networks (ANNs) in neuromorphic computing applications. SNNs inherently process temporal information by leveraging the precise timing of spikes, but balancing temporal feature utilization with low energy consumption remains a challenge. In this work, we introduce Temporal Shift module for Spiking Neural Networks (TS-SNN), which incorporates a novel Temporal Shift (TS) module to integrate past, present, and future spike features within a single timestep via a simple yet effective shift operation. A residual combination method prevents information loss by integrating shifted and original features. The TS module is lightweight, requiring only one additional learnable parameter, and can be seamlessly integrated into existing architectures with minimal additional computational cost. TS-SNN achieves state-of-the-art performance on benchmarks like CIFAR-10 (96.72%), CIFAR-100 (80.28%), and ImageNet (70.61%) with fewer timesteps, while maintaining low energy consumption. This work marks a significant step forward in developing efficient and accurate SNN architectures.
zh
[AI-31] Polynomial-Time Relational Probabilistic Inference in Open Universes
【速读】:该论文试图解决人工智能中在不确定性下的推理问题,这一问题在表达能力与计算可 tractability 之间存在显著矛盾。论文提出了一种满足两者要求的一阶关系概率推理方法,能够处理混合(离散和连续)变量。其解决方案的关键在于将期望的平方和逻辑扩展到关系设置,并证明在有界度数片段和有界量词秩的知识库中,可以以多项式时间进行提升推理,即使对象集是先验未知和/或可数无限的。关键的可 tractability 概念以证明论术语定义,超越了语言或查询的语法属性。
链接: https://arxiv.org/abs/2505.04115
作者: Luise Ge,Brendan Juba,Kris Nilsson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reasoning under uncertainty is a fundamental challenge in Artificial Intelligence. As with most of these challenges, there is a harsh dilemma between the expressive power of the language used, and the tractability of the computational problem posed by reasoning. Inspired by human reasoning, we introduce a method of first-order relational probabilistic inference that satisfies both criteria, and can handle hybrid (discrete and continuous) variables. Specifically, we extend sum-of-squares logic of expectation to relational settings, demonstrating that lifted reasoning in the bounded-degree fragment for knowledge bases of bounded quantifier rank can be performed in polynomial time, even with an a priori unknown and/or countably infinite set of objects. Crucially, our notion of tractability is framed in proof-theoretic terms, which extends beyond the syntactic properties of the language or queries. We are able to derive the tightest bounds provable by proofs of a given degree and size and establish completeness in our sum-of-squares refutations for fixed degrees.
zh
[AI-32] LLM s Suitability for Network Security: A Case Study of STRIDE Threat Modeling
【速读】:该论文试图解决当前在6G网络中利用生成式AI(Generative AI)进行网络安全分析的适用性问题,特别是针对大型语言模型(Large Language Models, LLMs)在威胁建模中的应用缺乏系统研究的问题。解决方案的关键在于通过四种提示技术与五种LLMs对5G威胁进行STRIDE分类,评估LLMs在网络安全场景下的表现,并揭示影响其在特定威胁建模中行为的潜在因素,从而为LLMs在网络安全中的调整与微调提供依据。
链接: https://arxiv.org/abs/2505.04101
作者: AbdulAziz AbdulGhaffar,Ashraf Matrawy
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:
Abstract:Artificial Intelligence (AI) is expected to be an integral part of next-generation AI-native 6G networks. With the prevalence of AI, researchers have identified numerous use cases of AI in network security. However, there are almost nonexistent studies that analyze the suitability of Large Language Models (LLMs) in network security. To fill this gap, we examine the suitability of LLMs in network security, particularly with the case study of STRIDE threat modeling. We utilize four prompting techniques with five LLMs to perform STRIDE classification of 5G threats. From our evaluation results, we point out key findings and detailed insights along with the explanation of the possible underlying factors influencing the behavior of LLMs in the modeling of certain threats. The numerical results and the insights support the necessity for adjusting and fine-tuning LLMs for network security use cases.
zh
[AI-33] An Empirical Study of OpenAI API Discussions on Stack Overflow
【速读】:该论文试图解决开发者在使用OpenAI API时所面临的一系列独特挑战,这些挑战包括提示工程的复杂性、基于令牌的成本管理、非确定性输出以及作为黑箱的操作等。现有研究尚未对这些问题进行系统的实证分析。解决方案的关键在于通过分析Stack Overflow上2,874条与OpenAI API相关的讨论,手动分类为九个相关类别,并利用主题建模分析识别每个类别中的具体问题,从而提出针对开发者、LLM供应商和研究人员的可行建议。
链接: https://arxiv.org/abs/2505.04084
作者: Xiang Chen,Jibin Wang,Chaoyang Gao,Xiaolin Ju,Zhanqi Cui
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid advancement of large language models (LLMs), represented by OpenAI’s GPT series, has significantly impacted various domains such as natural language processing, software development, education, healthcare, finance, and scientific research. However, OpenAI APIs introduce unique challenges that differ from traditional APIs, such as the complexities of prompt engineering, token-based cost management, non-deterministic outputs, and operation as black boxes. To the best of our knowledge, the challenges developers encounter when using OpenAI APIs have not been explored in previous empirical studies. To fill this gap, we conduct the first comprehensive empirical study by analyzing 2,874 OpenAI API-related discussions from the popular QA forum Stack Overflow. We first examine the popularity and difficulty of these posts. After manually categorizing them into nine OpenAI API-related categories, we identify specific challenges associated with each category through topic modeling analysis. Based on our empirical findings, we finally propose actionable implications for developers, LLM vendors, and researchers.
zh
[AI-34] Plexus: Taming Billion-edge Graphs with 3D Parallel GNN Training
【速读】:该论文旨在解决大规模图神经网络(Graph Neural Network, GNN)训练中的内存限制与计算效率问题,特别是在处理具有数十亿边的图数据时,传统的小批量采样方法会导致精度下降和训练速度变慢,而分布式全图训练则面临通信开销高和负载不均衡的问题。论文提出的解决方案是Plexus,其关键在于采用三维(3D)并行策略,通过优化的排列方案实现负载均衡,并结合性能模型预测最优的3D配置,从而有效提升大规模图数据的训练效率与可扩展性。
链接: https://arxiv.org/abs/2505.04083
作者: Aditya K. Ranjan,Siddharth Singh,Cunyang Wei,Abhinav Bhatele
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Graph neural networks have emerged as a potent class of neural networks capable of leveraging the connectivity and structure of real-world graphs to learn intricate properties and relationships between nodes. Many real-world graphs exceed the memory capacity of a GPU due to their sheer size, and using GNNs on them requires techniques such as mini-batch sampling to scale. However, this can lead to reduced accuracy in some cases, and sampling and data transfer from the CPU to the GPU can also slow down training. On the other hand, distributed full-graph training suffers from high communication overhead and load imbalance due to the irregular structure of graphs. We propose Plexus, a three-dimensional (3D) parallel approach for full-graph training that tackles these issues and scales to billion-edge graphs. Additionally, we introduce optimizations such as a permutation scheme for load balancing, and a performance model to predict the optimal 3D configuration. We evaluate Plexus on several graph datasets and show scaling results for up to 2048 GPUs on Perlmutter, which is 33% of the machine, and 2048 GCDs on Frontier. Plexus achieves unprecedented speedups of 2.3x-12.5x over existing methods and a reduction in the time to solution by 5.2-8.7x on Perlmutter and 7-54.2x on Frontier.
zh
[AI-35] LLM -e Guess: Can LLM s Capabilities Advance Without Hardware Progress?
【速读】:该论文试图解决在计算资源受限环境下,大型语言模型(Large Language Model, LLM)是否仍能持续进步的问题,以及算法创新在该条件下的表现如何。其解决方案的关键在于提出一种新的分类框架,用于区分依赖计算的创新与独立于计算的创新,并通过计算等效增益(Compute-Equivalent Gain, CEG)量化算法改进对模型性能的贡献,从而评估不同类型的算法创新在资源受限场景下的有效性。
链接: https://arxiv.org/abs/2505.04075
作者: Teddy Foley,Spencer Guo,Henry Josephson,Anqi Qu,Jack Sanderson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper examines whether large language model (LLM) capabilities can continue to advance without additional compute by analyzing the development and role of algorithms used in state-of-the-art LLMs. Motivated by regulatory efforts that have largely focused on restricting access to high-performance hardware, we ask: Can LLMs progress in a compute-constrained environment, and how do algorithmic innovations perform under such conditions? To address these questions, we introduce a novel classification framework that distinguishes between compute-dependent innovations – which yield disproportionate benefits at high compute levels (e.g., the Transformer architecture and mixture-of-experts models) and compute-independent innovations, which improve efficiency across all compute scales (e.g., rotary positional encoding, FlashAttention, or layer normalization). We quantify these contributions using a metric called compute-equivalent gain (CEG), which estimates the additional compute that would be required to achieve similar improvements without these algorithmic advancements. To validate this framework, we conduct small-scale training experiments with a scaled-down GPT-2 model. Our results confirm that compute-independent advancements yield meaningful performance gains even in resource-constrained settings, with a CEG of up to 3.5\times over a baseline model. By contrast, compute-dependent advancements provided little benefit or even degraded performance at the small scale, reinforcing the importance of compute availability for certain algorithmic gains. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) ACMclasses: I.2 Cite as: arXiv:2505.04075 [cs.LG] (or arXiv:2505.04075v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.04075 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-36] Izhikevich-Inspired Temporal Dynamics for Enhancing Privacy Efficiency and Transferability in Spiking Neural Networks
【速读】:该论文旨在解决将生物神经元复杂的时序放电模式集成到可扩展的脉冲神经网络(Spiking Neural Network, SNN)训练流程中的挑战,以提升神经形态学习系统的隐私性、泛化能力和生物合理性。其解决方案的关键在于提出两种基于概率驱动的输入级时序脉冲变换方法:Poisson-Burst 和 Delayed-Burst,它们通过引入生物启发的时序变异,直接作用于标准的漏电积分-发放(Leaky Integrate-and-Fire, LIF)神经元,从而实现对脉冲时间动态影响的系统性评估与优化。
链接: https://arxiv.org/abs/2505.04034
作者: Ayana Moshruba,Hamed Poursiami,Maryam Parsa
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Biological neurons exhibit diverse temporal spike patterns, which are believed to support efficient, robust, and adaptive neural information processing. While models such as Izhikevich can replicate a wide range of these firing dynamics, their complexity poses challenges for directly integrating them into scalable spiking neural networks (SNN) training pipelines. In this work, we propose two probabilistically driven, input-level temporal spike transformations: Poisson-Burst and Delayed-Burst that introduce biologically inspired temporal variability directly into standard Leaky Integrate-and-Fire (LIF) neurons. This enables scalable training and systematic evaluation of how spike timing dynamics affect privacy, generalization, and learning performance. Poisson-Burst modulates burst occurrence based on input intensity, while Delayed-Burst encodes input strength through burst onset timing. Through extensive experiments across multiple benchmarks, we demonstrate that Poisson-Burst maintains competitive accuracy and lower resource overhead while exhibiting enhanced privacy robustness against membership inference attacks, whereas Delayed-Burst provides stronger privacy protection at a modest accuracy trade-off. These findings highlight the potential of biologically grounded temporal spike dynamics in improving the privacy, generalization and biological plausibility of neuromorphic learning systems.
zh
[AI-37] Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)服务中的高成本问题,特别是在多LLM服务场景下,如何通过GPU共享提升资源利用率并满足延迟服务等级目标(SLOs)。现有GPU共享系统缺乏在运行时动态调整资源分配和共享策略的能力,难以应对工作负载的快速变化。论文提出的Prism系统通过解决跨模型内存协调(cross-model memory coordination)这一关键限制,实现了灵活的GPU内存共享,其核心设计包括按需内存分配机制和基于模型运行时需求的两级调度策略,从而显著提升了成本效率和SLO达成率。
链接: https://arxiv.org/abs/2505.04021
作者: Shan Yu,Jiarong Xing,Yifan Qiao,Mingyuan Ma,Yangmin Li,Yang Wang,Shuo Yang,Zhiqiang Xie,Shiyi Cao,Ke Bao,Ion Stoica,Harry Xu,Ying Sheng
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注:
Abstract:Serving large language models (LLMs) is expensive, especially for providers hosting many models, making cost reduction essential. The unique workload patterns of serving multiple LLMs (i.e., multi-LLM serving) create new opportunities and challenges for this task. The long-tail popularity of models and their long idle periods present opportunities to improve utilization through GPU sharing. However, existing GPU sharing systems lack the ability to adjust their resource allocation and sharing policies at runtime, making them ineffective at meeting latency service-level objectives (SLOs) under rapidly fluctuating workloads. This paper presents Prism, a multi-LLM serving system that unleashes the full potential of GPU sharing to achieve both cost efficiency and SLO attainment. At its core, Prism tackles a key limitation of existing systems \unicodex2014 the lack of \textitcross-model memory coordination , which is essential for flexibly sharing GPU memory across models under dynamic workloads. Prism achieves this with two key designs. First, it supports on-demand memory allocation by dynamically mapping physical to virtual memory pages, allowing flexible memory redistribution among models that space- and time-share a GPU. Second, it improves memory efficiency through a two-level scheduling policy that dynamically adjusts sharing strategies based on models’ runtime demands. Evaluations on real-world traces show that Prism achieves more than 2\times cost savings and 3.3\times SLO attainment compared to state-of-the-art systems. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF) Cite as: arXiv:2505.04021 [cs.DC] (or arXiv:2505.04021v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2505.04021 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-38] Extending Decision Predicate Graphs for Comprehensive Explanation of Isolation Forest
【速读】:该论文试图解决机器学习中预测模型的可解释性问题,特别是针对异常检测技术Isolation Forest(iForest)的全局可解释性不足的问题。其关键解决方案是引入一种基于决策谓词图(Decision Predicate Graph, DPG)的可解释人工智能(Explainable AI, XAI)方法,并结合内联-异常传播得分(Inlier-Outlier Propagation Score, IOP-Score),以清晰地展示样本被识别为异常值的逻辑及特征贡献,从而提升iForest的可解释性并提供对决策过程的全面理解。
链接: https://arxiv.org/abs/2505.04019
作者: Matteo Ceschin,Leonardo Arrighi,Luca Longo,Sylvio Barbon Junior
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The need to explain predictive models is well-established in modern machine learning. However, beyond model interpretability, understanding pre-processing methods is equally essential. Understanding how data modifications impact model performance improvements and potential biases and promoting a reliable pipeline is mandatory for developing robust machine learning solutions. Isolation Forest (iForest) is a widely used technique for outlier detection that performs well. Its effectiveness increases with the number of tree-based learners. However, this also complicates the explanation of outlier selection and the decision boundaries for inliers. This research introduces a novel Explainable AI (XAI) method, tackling the problem of global explainability. In detail, it aims to offer a global explanation for outlier detection to address its opaque nature. Our approach is based on the Decision Predicate Graph (DPG), which clarifies the logic of ensemble methods and provides both insights and a graph-based metric to explain how samples are identified as outliers using the proposed Inlier-Outlier Propagation Score (IOP-Score). Our proposal enhances iForest’s explainability and provides a comprehensive view of the decision-making process, detailing which features contribute to outlier identification and how the model utilizes them. This method advances the state-of-the-art by providing insights into decision boundaries and a comprehensive view of holistic feature usage in outlier identification. – thus promoting a fully explainable machine learning pipeline.
zh
[AI-39] MergeGuard: Efficient Thwarting of Trojan Attacks in Machine Learning Models
【速读】:该论文试图解决由不可信第三方训练的AI模型中潜在的AI木马攻击(AI Trojan attacks)问题,此类攻击会使嵌入触发器的输入被错误分类到攻击者的目标类别,从而威胁模型的可用性。解决方案的关键在于提出了一种新的后训练方法——MergeGuard,其核心思想是通过线性化和合并全连接层来提升模型的泛化能力和性能,实验结果表明该方法在保持模型准确率的同时有效降低了木马攻击的成功率,并优于常见的微调类后训练缓解方法。
链接: https://arxiv.org/abs/2505.04015
作者: Soheil Zibakhsh Shabgahi,Yaman Jandali,Farinaz Koushanfar
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper proposes MergeGuard, a novel methodology for mitigation of AI Trojan attacks. Trojan attacks on AI models cause inputs embedded with triggers to be misclassified to an adversary’s target class, posing a significant threat to model usability trained by an untrusted third party. The core of MergeGuard is a new post-training methodology for linearizing and merging fully connected layers which we show simultaneously improves model generalizability and performance. Our Proof of Concept evaluation on Transformer models demonstrates that MergeGuard maintains model accuracy while decreasing trojan attack success rate, outperforming commonly used (post-training) Trojan mitigation by fine-tuning methodologies.
zh
[AI-40] PARC: Physics-based Augmentation with Reinforcement Learning for Character Controllers SIGGRAPH
【速读】:该论文试图解决在模拟角色中再现敏捷地形穿越行为的问题,这一问题主要源于敏捷运动的运动捕捉数据稀缺以及获取此类数据的高成本。解决方案的关键在于提出PARC(Physics-based Augmentation with Reinforcement Learning for Character Controllers)框架,该框架通过机器学习与物理仿真相结合的方式,迭代增强运动数据集并扩展地形穿越控制器的能力。其核心步骤包括:首先在少量核心地形穿越技能数据上训练运动生成器,随后利用该生成器生成新地形的合成数据,再通过物理跟踪控制器修正生成动作中的缺陷,最终将优化后的动作加入数据集中以持续训练运动生成器,从而实现运动生成器与跟踪器能力的联合提升。
链接: https://arxiv.org/abs/2505.04002
作者: Michael Xu,Yi Shi,KangKang Yin,Xue Bin Peng
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: SIGGRAPH Conference Papers 2025
Abstract:Humans excel in navigating diverse, complex environments with agile motor skills, exemplified by parkour practitioners performing dynamic maneuvers, such as climbing up walls and jumping across gaps. Reproducing these agile movements with simulated characters remains challenging, in part due to the scarcity of motion capture data for agile terrain traversal behaviors and the high cost of acquiring such data. In this work, we introduce PARC (Physics-based Augmentation with Reinforcement Learning for Character Controllers), a framework that leverages machine learning and physics-based simulation to iteratively augment motion datasets and expand the capabilities of terrain traversal controllers. PARC begins by training a motion generator on a small dataset consisting of core terrain traversal skills. The motion generator is then used to produce synthetic data for traversing new terrains. However, these generated motions often exhibit artifacts, such as incorrect contacts or discontinuities. To correct these artifacts, we train a physics-based tracking controller to imitate the motions in simulation. The corrected motions are then added to the dataset, which is used to continue training the motion generator in the next iteration. PARC’s iterative process jointly expands the capabilities of the motion generator and tracker, creating agile and versatile models for interacting with complex environments. PARC provides an effective approach to develop controllers for agile terrain traversal, which bridges the gap between the scarcity of motion data and the need for versatile character controllers.
zh
[AI-41] An alignment safety case sketch based on debate
【速读】:该论文试图解决当人工智能系统在广泛任务上达到或超越人类能力时,人类难以有效评估其行为,从而难以通过人类反馈引导系统向有益方向发展的安全问题(AI safety)。其提出的解决方案是利用另一个超人类系统通过辩论(debate)指出系统输出中的缺陷,以此作为对齐(alignment)机制。该方案的关键在于构建一个“对齐安全论证”(alignment safety case),其核心假设包括:辩论能力的提升意味着系统诚实性的增强、部署过程中诚实性不会显著下降,以及部署环境能够容忍一定误差。
链接: https://arxiv.org/abs/2505.03989
作者: Marie Davidsen Buhl,Jacob Pfau,Benjamin Hilton,Geoffrey Irving
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:If AI systems match or exceed human capabilities on a wide range of tasks, it may become difficult for humans to efficiently judge their actions – making it hard to use human feedback to steer them towards desirable traits. One proposed solution is to leverage another superhuman system to point out flaws in the system’s outputs via a debate. This paper outlines the value of debate for AI safety, as well as the assumptions and further research required to make debate work. It does so by sketching an ``alignment safety case’’ – an argument that an AI system will not autonomously take actions which could lead to egregious harm, despite being able to do so. The sketch focuses on the risk of an AI R\D agent inside an AI company sabotaging research, for example by producing false results. To prevent this, the agent is trained via debate, subject to exploration guarantees, to teach the system to be honest. Honesty is maintained throughout deployment via online training. The safety case rests on four key claims: (1) the agent has become good at the debate game, (2) good performance in the debate game implies that the system is mostly honest, (3) the system will not become significantly less honest during deployment, and (4) the deployment context is tolerant of some errors. We identify open research problems that, if solved, could render this a compelling argument that an AI system is safe.
zh
[AI-42] Can Large Language Models Predict Parallel Code Performance?
【速读】:该论文试图解决在缺乏高性能GPU硬件访问权限的情况下,如何准确预测并分类GPU内核是计算受限(compute-bound)还是带宽受限(bandwidth-bound)的问题。其解决方案的关键在于利用大型语言模型(LLMs)对GPU源代码和目标硬件规格进行分析,通过 Roofline 模型的分类任务实现性能预测,从而替代传统依赖实际硬件执行时间分析的方法。研究结果表明,先进的LLMs在提供明确的性能分析数据时能够达到100%的分类准确率,并且具备推理能力的LLMs在零样本和少量样本设置下也能表现出较高的预测能力。
链接: https://arxiv.org/abs/2505.03988
作者: Gregory Bolet,Giorgis Georgakoudis,Harshitha Menon,Konstantinos Parasyris,Niranjan Hasabnis,Hayden Estes,Kirk W. Cameron,Gal Oren
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: 5 pages, 4 figures, accepted to AI4Sys Workshop at HPDC 2025
Abstract:Accurate determination of the performance of parallel GPU code typically requires execution-time profiling on target hardware – an increasingly prohibitive step due to limited access to high-end GPUs. This paper explores whether Large Language Models (LLMs) can offer an alternative approach for GPU performance prediction without relying on hardware. We frame the problem as a roofline classification task: given the source code of a GPU kernel and the hardware specifications of a target GPU, can an LLM predict whether the GPU kernel is compute-bound or bandwidth-bound? For this study, we build a balanced dataset of 340 GPU kernels, obtained from HeCBench benchmark and written in CUDA and OpenMP, along with their ground-truth labels obtained via empirical GPU profiling. We evaluate LLMs across four scenarios: (1) with access to profiling data of the kernel source, (2) zero-shot with source code only, (3) few-shot with code and label pairs, and (4) fine-tuned on a small custom dataset. Our results show that state-of-the-art LLMs have a strong understanding of the Roofline model, achieving 100% classification accuracy when provided with explicit profiling data. We also find that reasoning-capable LLMs significantly outperform standard LLMs in zero- and few-shot settings, achieving up to 64% accuracy on GPU source codes, without profiling information. Lastly, we find that LLM fine-tuning will require much more data than what we currently have available. This work is among the first to use LLMs for source-level roofline performance prediction via classification, and illustrates their potential to guide optimization efforts when runtime profiling is infeasible. Our findings suggest that with better datasets and prompt strategies, LLMs could become practical tools for HPC performance analysis and performance portability. Comments: 5 pages, 4 figures, accepted to AI4Sys Workshop at HPDC 2025 Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Performance (cs.PF) Cite as: arXiv:2505.03988 [cs.DC] (or arXiv:2505.03988v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2505.03988 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Gregory Bolet [view email] [v1] Tue, 6 May 2025 21:41:20 UTC (401 KB)
zh
[AI-43] LogiDebrief: A Signal-Temporal Logic based Automated Debriefing Approach with Large Language Models Integration IJCAI-2025
【速读】:该论文旨在解决传统9-1-1电话接线员评估方法在处理高通话量时存在的覆盖不足和评估延迟问题。其解决方案的关键在于提出LogiDebrief框架,该框架通过将信号时序逻辑(Signal-Temporal Logic, STL)与大型语言模型(Large Language Models, LLMs)相结合,实现对9-1-1电话的自动化深入分析与全面性能评估,从而提升评估效率和准确性。
链接: https://arxiv.org/abs/2505.03985
作者: Zirong Chen,Ziyan An,Jennifer Reynolds,Kristin Mullen,Stephen Martini,Meiyi Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at IJCAI-2025
Abstract:Emergency response services are critical to public safety, with 9-1-1 call-takers playing a key role in ensuring timely and effective emergency operations. To ensure call-taking performance consistency, quality assurance is implemented to evaluate and refine call-takers’ skillsets. However, traditional human-led evaluations struggle with high call volumes, leading to low coverage and delayed assessments. We introduce LogiDebrief, an AI-driven framework that automates traditional 9-1-1 call debriefing by integrating Signal-Temporal Logic (STL) with Large Language Models (LLMs) for fully-covered rigorous performance evaluation. LogiDebrief formalizes call-taking requirements as logical specifications, enabling systematic assessment of 9-1-1 calls against procedural guidelines. It employs a three-step verification process: (1) contextual understanding to identify responder types, incident classifications, and critical conditions; (2) STL-based runtime checking with LLM integration to ensure compliance; and (3) automated aggregation of results into quality assurance reports. Beyond its technical contributions, LogiDebrief has demonstrated real-world impact. Successfully deployed at Metro Nashville Department of Emergency Communications, it has assisted in debriefing 1,701 real-world calls, saving 311.85 hours of active engagement. Empirical evaluation with real-world data confirms its accuracy, while a case study and extensive user study highlight its effectiveness in enhancing call-taking performance.
zh
[AI-44] Diffusion Models are Secretly Exchangeable: Parallelizing DDPMs via Autospeculation
【速读】:该论文试图解决扩散模型(Denoising Diffusion Probabilistic Models, DDPMs)在推理过程中由于顺序计算导致的显著性能瓶颈问题。解决方案的关键在于利用DDPM与随机定位(Stochastic Localization)之间的联系,证明在适当的参数化下,DDPM的增量满足可交换性(exchangeability)性质,从而使得自回归模型中的多种性能优化技术可以近似黑盒地迁移到扩散设置中。为此,作者提出了一种无需辅助草稿模型的自适应推测解码(Autospeculative Decoding, ASD)方法,并理论分析表明,ASD在K步顺序DDPM上实现了近似O(K^(1/3))的并行运行时加速。
链接: https://arxiv.org/abs/2505.03983
作者: Hengyuan Hu,Aniket Das,Dorsa Sadigh,Nima Anari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Denoising Diffusion Probabilistic Models (DDPMs) have emerged as powerful tools for generative modeling. However, their sequential computation requirements lead to significant inference-time bottlenecks. In this work, we utilize the connection between DDPMs and Stochastic Localization to prove that, under an appropriate reparametrization, the increments of DDPM satisfy an exchangeability property. This general insight enables near-black-box adaptation of various performance optimization techniques from autoregressive models to the diffusion setting. To demonstrate this, we introduce \emphAutospeculative Decoding (ASD), an extension of the widely used speculative decoding algorithm to DDPMs that does not require any auxiliary draft models. Our theoretical analysis shows that ASD achieves a \tildeO (K^\frac13) parallel runtime speedup over the K step sequential DDPM. We also demonstrate that a practical implementation of autospeculative decoding accelerates DDPM inference significantly in various domains.
zh
[AI-45] Frog Soup: Zero-Shot In-Context and Sample-Efficient Frogger Agents
【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)中代理系统在面对新任务时适应性差、训练成本高且效率低的问题。其解决方案的关键在于利用具备域外强化学习(out-of-domain RL)微调的最新推理大语言模型(Large Language Models, LLMs),使其能够在零样本(zero-shot)设置下完成复杂的Atari游戏Frogger,并通过引入上下文学习和推理努力来优化LLM性能,同时结合传统RL方法以提升其性能与样本效率。
链接: https://arxiv.org/abs/2505.03947
作者: Xiang Li,Yiyang Hao,Doug Fulop
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:One of the primary aspirations in reinforcement learning research is developing general-purpose agents capable of rapidly adapting to and mastering novel tasks. While RL gaming agents have mastered many Atari games, they remain slow and costly to train for each game. In this work, we demonstrate that latest reasoning LLMs with out-of-domain RL post-training can play a challenging Atari game called Frogger under a zero-shot setting. We then investigate the effect of in-context learning and the amount of reasoning effort on LLM performance. Lastly, we demonstrate a way to bootstrap traditional RL method with LLM demonstrations, which significantly improves their performance and sample efficiency. Our implementation is open sourced at this https URL.
zh
[AI-46] Decentralized Distributed Proximal Policy Optimization (DD-PPO) for High Performance Computing Scheduling on Multi-User Systems
【速读】:该论文旨在解决高性能计算(High Performance Computing, HPC)环境中资源分配的复杂性问题,特别是传统基于规则的调度算法在面对系统异构性和规模扩大时效率与灵活性不足的问题。其解决方案的关键在于引入一种基于强化学习(Reinforcement Learning, RL)的新型调度器,该调度器采用去中心化分布式近端策略优化(Decentralized Distributed Proximal Policy Optimization, DD-PPO)算法,通过支持多工作节点的大规模分布式训练,避免了每一步都需要参数同步的限制,从而提升了调度系统的可扩展性、训练效率和样本利用率。
链接: https://arxiv.org/abs/2505.03946
作者: Matthew Sgambati,Aleksandar Vakanski,Matthew Anderson
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Resource allocation in High Performance Computing (HPC) environments presents a complex and multifaceted challenge for job scheduling algorithms. Beyond the efficient allocation of system resources, schedulers must account for and optimize multiple performance metrics, including job wait time and system utilization. While traditional rule-based scheduling algorithms dominate the current deployments of HPC systems, the increasing heterogeneity and scale of those systems is expected to challenge the efficiency and flexibility of those algorithms in minimizing job wait time and maximizing utilization. Recent research efforts have focused on leveraging advancements in Reinforcement Learning (RL) to develop more adaptable and intelligent scheduling strategies. Recent RL-based scheduling approaches have explored a range of algorithms, from Deep Q-Networks (DQN) to Proximal Policy Optimization (PPO), and more recently, hybrid methods that integrate Graph Neural Networks with RL techniques. However, a common limitation across these methods is their reliance on relatively small datasets, and these methods face scalability issues when using large datasets. This study introduces a novel RL-based scheduler utilizing the Decentralized Distributed Proximal Policy Optimization (DD-PPO) algorithm, which supports large-scale distributed training across multiple workers without requiring parameter synchronization at every step. By eliminating reliance on centralized updates to a shared policy, the DD-PPO scheduler enhances scalability, training efficiency, and sample utilization. The validation dataset leveraged over 11.5 million real HPC job traces for comparing DD-PPO performance between traditional and advanced scheduling approaches, and the experimental results demonstrate improved scheduling performance in comparison to both rule-based schedulers and existing RL-based scheduling algorithms.
zh
[AI-47] AI-Driven Security in Cloud Computing: Enhancing Threat Detection Automated Response and Cyber Resilience
【速读】:该论文试图解决传统安全方案在应对复杂威胁时效率不足的问题,特别是在实时检测和预防方面存在局限。其解决方案的关键在于利用人工智能(Artificial Intelligence, AI)技术,通过预测分析、基于行为的安全威胁检测以及AI驱动的加密手段来提升云安全防护能力。AI-enabled系统能够更有效地监控网络活动,并提前识别潜在的安全威胁,从而实现更高效的安全响应与防护。
链接: https://arxiv.org/abs/2505.03945
作者: Shamnad Mohamed Shaffi,Sunish Vengathattil,Jezeena Nikarthil Sidhick,Resmi Vijayan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Cloud security concerns have been greatly realized in recent years due to the increase of complicated threats in the computing world. Many traditional solutions do not work well in real-time to detect or prevent more complex threats. Artificial intelligence is today regarded as a revolution in determining a protection plan for cloud data architecture through machine learning, statistical visualization of computing infrastructure, and detection of security breaches followed by counteraction. These AI-enabled systems make work easier as more network activities are scrutinized, and any anomalous behavior that might be a precursor to a more serious breach is prevented. This paper examines ways AI can enhance cloud security by applying predictive analytics, behavior-based security threat detection, and AI-stirring encryption. It also outlines the problems of the previous security models and how AI overcomes them. For a similar reason, issues like data privacy, biases in the AI model, and regulatory compliance are also covered. So, AI improves the protection of cloud computing contexts; however, more efforts are needed in the subsequent phases to extend the technology’s reliability, modularity, and ethical aspects. This means that AI can be blended with other new computing technologies, including blockchain, to improve security frameworks further. The paper discusses the current trends in securing cloud data architecture using AI and presents further research and application directions.
zh
[AI-48] GRAML: Dynamic Goal Recognition As Metric Learning IJCAI
【速读】:该论文试图解决目标识别(Goal Recognition, GR)问题,即根据观察到的代理行为来识别其目标。传统数据驱动的方法虽然减少了对人工构建领域模型的依赖,但只能处理预定义的目标集合,并且在面对新出现的目标时需要耗时的训练。该论文提出的解决方案关键在于引入GRAML:目标识别作为度量学习(Metric Learning),通过Siamese网络将GR转化为深度度量学习任务,利用RNN在嵌入空间中学习一个度量,使得不同目标的观察轨迹嵌入距离较远,而相同目标的观察轨迹嵌入距离较近,从而实现对新目标的快速适应。
链接: https://arxiv.org/abs/2505.03941
作者: Matan Shamir,Reuth Mirsky
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted for publication in International Joint Conference on Artificial Intelligence (IJCAI) 2025
Abstract:Goal Recognition (GR) is the problem of recognizing an agent’s objectives based on observed actions. Recent data-driven approaches for GR alleviate the need for costly, manually crafted domain models. However, these approaches can only reason about a pre-defined set of goals, and time-consuming training is needed for new emerging goals. To keep this model-learning automated while enabling quick adaptation to new goals, this paper introduces GRAML: Goal Recognition As Metric Learning. GRAML uses a Siamese network to treat GR as a deep metric learning task, employing an RNN that learns a metric over an embedding space, where the embeddings for observation traces leading to different goals are distant, and embeddings of traces leading to the same goals are close. This metric is especially useful when adapting to new goals, even if given just one example observation trace per goal. Evaluated on a versatile set of environments, GRAML shows speed, flexibility, and runtime improvements over the state-of-the-art GR while maintaining accurate recognition.
zh
[AI-49] Scratch Copilot: Supporting Youth Creative Coding with AI
【速读】:该论文旨在解决儿童在使用图形化编程平台(如Scratch)时,将创意想法转化为功能性代码所面临的障碍。现有AI辅助工具主要面向成人程序员,缺乏针对儿童在积木式编程环境中的支持。论文提出的解决方案是Cognimates Scratch Copilot,这是一个集成于类似Scratch环境中的AI助手,能够实时支持创意生成、代码编写、调试和素材创建。其关键在于通过AI技术提供即时帮助,同时允许儿童在使用过程中保持创造性控制,通过协商调整或拒绝建议来维护自主性,从而平衡辅助性支架与独立解决问题能力的培养。
链接: https://arxiv.org/abs/2505.03867
作者: Stefania Druga,Amy J. Ko
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 5 figures, 14 pages
Abstract:Creative coding platforms like Scratch have democratized programming for children, yet translating imaginative ideas into functional code remains a significant hurdle for many young learners. While AI copilots assist adult programmers, few tools target children in block-based environments. Building on prior research \citedruga_how_2021,druga2023ai, druga2023scratch, we present Cognimates Scratch Copilot: an AI-powered assistant integrated into a Scratch-like environment, providing real-time support for ideation, code generation, debugging, and asset creation. This paper details the system architecture and findings from an exploratory qualitative evaluation with 18 international children (ages 7–12). Our analysis reveals how the AI Copilot supported key creative coding processes, particularly aiding ideation and debugging. Crucially, it also highlights how children actively negotiated the use of AI, demonstrating strong agency by adapting or rejecting suggestions to maintain creative control. Interactions surfaced design tensions between providing helpful scaffolding and fostering independent problem-solving, as well as learning opportunities arising from navigating AI limitations and errors. Findings indicate Cognimates Scratch Copilot’s potential to enhance creative self-efficacy and engagement. Based on these insights, we propose initial design guidelines for AI coding assistants that prioritize youth agency and critical interaction alongside supportive scaffolding.
zh
[AI-50] From Glue-Code to Protocols: A Critical Analysis of A2A and MCP Integration for Scalable Agent Systems
【速读】:该论文试图解决多智能体系统中跨智能体通信与工具访问标准集成所带来的复杂性问题,特别是Google的Agent to Agent (A2A)协议与Anthropic的Model Context Protocol (MCP)在语义互操作性、安全风险叠加以及治理机制等方面的挑战。解决方案的关键在于深入分析A2A与MCP结合后的实际影响,识别由集成引发的新安全漏洞、隐私复杂性及协议间调试难题,并提出对语义协商机制的强化需求,以实现高效、安全的“Agent Economy”架构。
链接: https://arxiv.org/abs/2505.03864
作者: Qiaomu Li,Ying Xie
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Artificial intelligence is rapidly evolving towards multi-agent systems where numerous AI agents collaborate and interact with external tools. Two key open standards, Google’s Agent to Agent (A2A) protocol for inter-agent communication and Anthropic’s Model Context Protocol (MCP) for standardized tool access, promise to overcome the limitations of fragmented, custom integration approaches. While their potential synergy is significant, this paper argues that effectively integrating A2A and MCP presents unique, emergent challenges at their intersection, particularly concerning semantic interoperability between agent tasks and tool capabilities, the compounded security risks arising from combined discovery and execution, and the practical governance required for the envisioned “Agent Economy”. This work provides a critical analysis, moving beyond a survey to evaluate the practical implications and inherent difficulties of combining these horizontal and vertical integration standards. We examine the benefits (e.g., specialization, scalability) while critically assessing their dependencies and trade-offs in an integrated context. We identify key challenges increased by the integration, including novel security vulnerabilities, privacy complexities, debugging difficulties across protocols, and the need for robust semantic negotiation mechanisms. In summary, A2A+MCP offers a vital architectural foundation, but fully realizing its potential requires substantial advancements to manage the complexities of their combined operation.
zh
[AI-51] Data-Driven Falsification of Cyber-Physical Systems
【速读】:该论文试图解决Cyber-Physical Systems (CPS) 的操作安全性验证问题,具体是通过寻找系统中的不安全执行路径来实现对CPS的证伪,而非证明其安全性。解决方案的关键在于构建一个代理模型(可以是深度神经网络DNN或决策树),并利用决策树的内在可解释性来指导CPS的证伪过程。该框架通过结合DNN证伪工具和基于决策树解释的安全违规分析,实现了对CPS中难以发现的反例的有效检测。
链接: https://arxiv.org/abs/2505.03863
作者: Atanu Kundu,Sauvik Gon,Rajarshi Ray
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Cyber-Physical Systems (CPS) are abundant in safety-critical domains such as healthcare, avionics, and autonomous vehicles. Formal verification of their operational safety is, therefore, of utmost importance. In this paper, we address the falsification problem, where the focus is on searching for an unsafe execution in the system instead of proving their absence. The contribution of this paper is a framework that (a) connects the falsification of CPS with the falsification of deep neural networks (DNNs) and (b) leverages the inherent interpretability of Decision Trees for faster falsification of CPS. This is achieved by: (1) building a surrogate model of the CPS under test, either as a DNN model or a Decision Tree, (2) application of various DNN falsification tools to falsify CPS, and (3) a novel falsification algorithm guided by the explanations of safety violations of the CPS model extracted from its Decision Tree surrogate. The proposed framework has the potential to exploit a repertoire of \emphadversarial attack algorithms designed to falsify robustness properties of DNNs, as well as state-of-the-art falsification algorithms for DNNs. Although the presented methodology is applicable to systems that can be executed/simulated in general, we demonstrate its effectiveness, particularly in CPS. We show that our framework, implemented as a tool \textscFlexiFal, can detect hard-to-find counterexamples in CPS that have linear and non-linear dynamics. Decision tree-guided falsification shows promising results in efficiently finding multiple counterexamples in the ARCH-COMP 2024 falsification benchmarks~\citekhandait2024arch.
zh
[AI-52] Impact Analysis of Inference Time Attack of Perception Sensors on Autonomous Vehicles
【速读】:该论文旨在解决自动驾驶汽车(Autonomous Vehicles, AVs)感知模块的安全性问题,特别是针对感知正确性之外的潜在安全威胁。现有研究多关注感知结果的准确性,而本文提出了一种基于推理时间攻击的影响分析方法,揭示了推理时间攻击可能对自车及其他交通参与者安全性造成的威胁,其关键在于通过模拟系统验证了此类攻击的有效性和潜在危害。
链接: https://arxiv.org/abs/2505.03850
作者: Hanlin Chen,Simin Chen,Wenyu Li,Wei Yang,Yiheng Feng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted and presented in TRBAM 2024
Abstract:As a safety-critical cyber-physical system, cybersecurity and related safety issues for Autonomous Vehicles (AVs) have been important research topics for a while. Among all the modules on AVs, perception is one of the most accessible attack surfaces, as drivers and AVs have no control over the outside environment. Most current work targeting perception security for AVs focuses on perception correctness. In this work, we propose an impact analysis based on inference time attacks for autonomous vehicles. We demonstrate in a simulation system that such inference time attacks can also threaten the safety of both the ego vehicle and other traffic participants.
zh
[AI-53] CoCoB: Adaptive Collaborative Combinatorial Bandits for Online Recommendation DASFAA2025
【速读】:该论文旨在解决协同过滤中用户相似性定义不明确以及具有独特偏好的用户缺乏合适邻居所带来的推荐质量下降问题(Collaborative Filtering)。其解决方案的关键在于提出一种自适应的协同组合强化学习算法(Adaptive Collaborative Combinatorial Bandits, CoCoB),该算法采用双向强化学习架构,通过增强的贝叶斯模型在用户端动态识别相似用户,并在物品端基于用户端输出生成多样化推荐,从而实现对目标用户的精准推荐。
链接: https://arxiv.org/abs/2505.03840
作者: Cairong Yan,Jinyi Han,Jin Ju,Yanting Zhang,Zijian Wang,Xuan Shao
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper has been accepted by DASFAA 2025: The International Conference on Database Systems for Advanced Applications. This version provides more detailed information
Abstract:Clustering bandits have gained significant attention in recommender systems by leveraging collaborative information from neighboring users to better capture target user preferences. However, these methods often lack a clear definition of similar users and face challenges when users with unique preferences lack appropriate neighbors. In such cases, relying on divergent preferences of misidentified neighbors can degrade recommendation quality. To address these limitations, this paper proposes an adaptive Collaborative Combinatorial Bandits algorithm (CoCoB). CoCoB employs an innovative two-sided bandit architecture, applying bandit principles to both the user and item sides. The user-bandit employs an enhanced Bayesian model to explore user similarity, identifying neighbors based on a similarity probability threshold. The item-bandit treats items as arms, generating diverse recommendations informed by the user-bandit’s output. CoCoB dynamically adapts, leveraging neighbor preferences when available or focusing solely on the target user otherwise. Regret analysis under a linear contextual bandit setting and experiments on three real-world datasets demonstrate CoCoB’s effectiveness, achieving an average 2.4% improvement in F1 score over state-of-the-art methods.
zh
[AI-54] he Shift Towards Preprints in AI Policy Research: A Comparative Study of Preprint Trends in the U.S. Europe and South Korea
【速读】:该论文试图解决全球范围内人工智能(Artificial Intelligence, AI)政策研究中预印本(preprint)引用趋势受重大突发事件影响的区域差异问题。其解决方案的关键在于利用Web of Science的文献计量数据,通过标记疫情和ChatGPT发布等重大事件的时间节点,分析不同地区在2015至2024年间预印本引用模式的变化,并揭示这些变化与当地研究文化、政策环境及开放科学成熟度之间的关系。
链接: https://arxiv.org/abs/2505.03835
作者: Simon Suh,Jihyuk Bang,Ji Woo Han
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 22 pages, 6 figures, 3 tables. Uses cross-regional analysis to evaluate how preprint citation trends in AI - policy research have shifted over time in response to two major global events: the COVID-19 pandemic and the release of ChatGPT. Compares United States, Europe, and South Korea
Abstract:The adoption of open science has quickly changed how artificial intelligence (AI) policy research is distributed globally. This study examines the regional trends in the citation of preprints, specifically focusing on the impact of two major disruptive events: the COVID-19 pandemic and the release of ChatGPT, on research dissemination patterns in the United States, Europe, and South Korea from 2015 to 2024. Using bibliometrics data from the Web of Science, this study tracks how global disruptive events influenced the adoption of preprints in AI policy research and how such shifts vary by region. By marking the timing of these disruptive events, the analysis reveals that while all regions experienced growth in preprint citations, the magnitude and trajectory of change varied significantly. The United States exhibited sharp, event-driven increases; Europe demonstrated institutional growth; and South Korea maintained consistent, linear growth in preprint adoption. These findings suggest that global disruptions may have accelerated preprint adoption, but the extent and trajectory are shaped by local research cultures, policy environments, and levels of open science maturity. This paper emphasizes the need for future AI governance strategies to consider regional variability in research dissemination and highlights opportunities for further longitudinal and comparative research to deepen our understanding of open-access adoption in AI policy development.
zh
[AI-55] MISE: Meta-knowledge Inheritance for Social Media-Based Stressor Estimation WWW2025
【速读】:该论文试图解决在社交媒体中识别具体压力源(stressor)的问题,尤其是在压力源种类繁多且每类样本数量有限,同时新压力源不断出现的情况下,传统方法难以有效理解和识别这些压力源。解决方案的关键在于提出一种基于元学习(meta-learning)的压力源估计框架,并引入元知识继承机制(meta-knowledge inheritance mechanism),以增强模型在少量标注数据下对新压力源的泛化能力,同时防止在适应新压力源时发生灾难性遗忘。
链接: https://arxiv.org/abs/2505.03827
作者: Xin Wang,Ling Feng,Huijun Zhang,Lei Cao,Kaisheng Zeng,Qi Li,Yang Ding,Yi Dai,David Clifton
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: WWW2025, Oral Presentation
Abstract:Stress haunts people in modern society, which may cause severe health issues if left unattended. With social media becoming an integral part of daily life, leveraging social media to detect stress has gained increasing attention. While the majority of the work focuses on classifying stress states and stress categories, this study introduce a new task aimed at estimating more specific stressors (like exam, writing paper, etc.) through users’ posts on social media. Unfortunately, the diversity of stressors with many different classes but a few examples per class, combined with the consistent arising of new stressors over time, hinders the machine understanding of stressors. To this end, we cast the stressor estimation problem within a practical scenario few-shot learning setting, and propose a novel meta-learning based stressor estimation framework that is enhanced by a meta-knowledge inheritance mechanism. This model can not only learn generic stressor context through meta-learning, but also has a good generalization ability to estimate new stressors with little labeled data. A fundamental breakthrough in our approach lies in the inclusion of the meta-knowledge inheritance mechanism, which equips our model with the ability to prevent catastrophic forgetting when adapting to new stressors. The experimental results show that our model achieves state-of-the-art performance compared with the baselines. Additionally, we construct a social media-based stressor estimation dataset that can help train artificial intelligence models to facilitate human well-being. The dataset is now public at \hrefthis https URL\underlineKaggle and \hrefthis https URL\underlineHugging Face.
zh
[AI-56] Intelligently Augmented Contrastive Tensor Factorization: Empowering Multi-dimensional Time Series Classification in Low-Data Environments
【速读】:该论文旨在解决从现实世界系统中获取的多维时间序列分类问题,特别是在训练数据稀缺的情况下,如何学习具有判别力的复杂特征,如跨维度依赖关系和类内变化。其解决方案的关键在于提出了一种高效且通用的框架——Intelligently Augmented Contrastive Tensor Factorization (ITA-CTF),该框架通过结合张量分解(Tensor Factorization, TF)模块与对比损失优化,学习具有类感知特性的表示,并利用智能增强模块生成有针对性的增强数据,以突出原始数据中的类内模式,从而提升分类性能。
链接: https://arxiv.org/abs/2505.03825
作者: Anushiya Arunan,Yan Qin,Xiaoli Li,Yuen Chau
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted in Expert Systems with Applications (DOI pending)
Abstract:Classification of multi-dimensional time series from real-world systems require fine-grained learning of complex features such as cross-dimensional dependencies and intra-class variations-all under the practical challenge of low training data availability. However, standard deep learning (DL) struggles to learn generalizable features in low-data environments due to model overfitting. We propose a versatile yet data-efficient framework, Intelligently Augmented Contrastive Tensor Factorization (ITA-CTF), to learn effective representations from multi-dimensional time series. The CTF module learns core explanatory components of the time series (e.g., sensor factors, temporal factors), and importantly, their joint dependencies. Notably, unlike standard tensor factorization (TF), the CTF module incorporates a new contrastive loss optimization to induce similarity learning and class-awareness into the learnt representations for better classification performance. To strengthen this contrastive learning, the preceding ITA module generates targeted but informative augmentations that highlight realistic intra-class patterns in the original data, while preserving class-wise properties. This is achieved by dynamically sampling a “soft” class prototype to guide the warping of each query data sample, which results in an augmentation that is intelligently pattern-mixed between the “soft” class prototype and the query sample. These augmentations enable the CTF module to recognize complex intra-class variations despite the limited original training data, and seek out invariant class-wise properties for accurate classification performance. The proposed method is comprehensively evaluated on five different classification tasks. Compared to standard TF and several DL benchmarks, notable performance improvements up to 18.7% were achieved.
zh
[AI-57] Memory Assisted LLM for Personalized Recommendation System
【速读】:该论文试图解决个性化大型语言模型(Large Language Models, LLMs)在推荐任务中对用户多样化偏好捕捉不准确以及无法及时更新用户历史记录的问题。其解决方案的关键在于提出一种基于记忆辅助的个性化LLM(Memory-Assisted Personalized LLM, MAP),通过构建用户历史档案并根据相似性提取相关记忆,将其融入提示中以提升推荐的个性化程度。实验结果表明,MAP在单领域和跨领域场景下均优于直接通过提示设计整合用户历史的常规LLM推荐系统,并且随着用户历史的增长,其优势更加显著。
链接: https://arxiv.org/abs/2505.03824
作者: Jiarui Chen
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures
Abstract:Large language models (LLMs) have demonstrated significant potential in solving recommendation tasks. With proven capabilities in understanding user preferences, LLM personalization has emerged as a critical area for providing tailored responses to individuals. Current studies explore personalization through prompt design and fine-tuning, paving the way for further research in personalized LLMs. However, existing approaches are either costly and inefficient in capturing diverse user preferences or fail to account for timely updates to user history. To address these gaps, we propose the Memory-Assisted Personalized LLM (MAP). Through user interactions, we first create a history profile for each user, capturing their preferences, such as ratings for historical items. During recommendation, we extract relevant memory based on similarity, which is then incorporated into the prompts to enhance personalized recommendations. In our experiments, we evaluate MAP using a sequential rating prediction task under two scenarios: single domain, where memory and tasks are from the same category (e.g., movies), and cross-domain (e.g., memory from movies and recommendation tasks in books). The results show that MAP outperforms regular LLM-based recommenders that integrate user history directly through prompt design. Moreover, as user history grows, MAP’s advantage increases in both scenarios, making it more suitable for addressing successive personalized user requests.
zh
[AI-58] DRSLF: Double Regularized Second-Order Low-Rank Representation for Web Service QoS Prediction
【速读】:该论文旨在解决云服务选择中由于用户无法访问所有服务而导致的高质量服务数据(QoS)呈现高维且不完整(HDI)矩阵的问题,从而影响QoS预测的准确性。其解决方案的关键在于提出一种双正则化二阶潜在因子(DRSLF)模型,该模型通过整合L1-范数和L2-范数正则化项以提升低秩表示性能,并在共轭梯度步骤中计算Hessian-向量乘积来引入二阶信息。
链接: https://arxiv.org/abs/2505.03822
作者: Hao Wu,Jialiang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Quality-of-Service (QoS) data plays a crucial role in cloud service selection. Since users cannot access all services, QoS can be represented by a high-dimensional and incomplete (HDI) matrix. Latent factor analysis (LFA) models have been proven effective as low-rank representation techniques for addressing this issue. However, most LFA models rely on first-order optimizers and use L2-norm regularization, which can lead to lower QoS prediction accuracy. To address this issue, this paper proposes a double regularized second-order latent factor (DRSLF) model with two key ideas: a) integrating L1-norm and L2-norm regularization terms to enhance the low-rank representation performance; b) incorporating second-order information by calculating the Hessian-vector product in each conjugate gradient step. Experimental results on two real-world response-time QoS datasets demonstrate that DRSLF has a higher low-rank representation capability than two baselines.
zh
[AI-59] Focus on the Likely: Test-time Instance-based Uncertainty Removal
【速读】:该论文试图解决模型在预测时的不确定性问题,特别是在面对高决策不确定性的样本时,模型预测性能不足的问题。解决方案的关键在于提出两种新的测试阶段微调方法,这些方法无需辅助数据,仅利用给定的测试实例进行优化。通过在推理过程中引入对可能类别(focus classes)的关注,并在初始前向传播表明高不确定性时应用单步梯度下降,对预测结果进行精调,从而使预测更接近于将不合理的输出概率设为零的理想状态。
链接: https://arxiv.org/abs/2505.03819
作者: Johannes Schneider
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose two novel test-time fine-tuning methods to improve uncertain model predictions. Our methods require no auxiliary data and use the given test instance only. Instead of performing a greedy selection of the most likely class to make a prediction, we introduce an additional focus on the likely classes step during inference. By applying a single-step gradient descent, we refine predictions when an initial forward pass indicates high uncertainty. This aligns predictions more closely with the ideal of assigning zero probability to less plausible outcomes. Our theoretical discussion provides a deeper understanding highlighting the impact on shared and non-shared features among (focus) classes. The experimental evaluation highlights accuracy gains on samples exhibiting high decision uncertainty for a diverse set of models from both the text and image domain using the same hyperparameters.
zh
[AI-60] Program Semantic Inequivalence Game with Large Language Models
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在处理需要复杂程序语义推理的任务时表现不佳的问题,特别是缺乏足够的训练数据来提升其在这些任务上的性能。解决方案的关键在于提出一种基于语义不等价游戏(Semantic Inequivalence Game, SInQ)的合成代码推理训练数据生成方法,其中生成器代理创建语义不同的程序变体,评估器代理则识别导致原始程序与生成变体行为差异的输入示例,两者通过半对抗性训练相互提升。
链接: https://arxiv.org/abs/2505.03818
作者: Antonio Valerio Miceli-Barone,Vaishak Belle,Ali Payani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) can achieve strong performance on everyday coding tasks, but they can fail on complex tasks that require non-trivial reasoning about program semantics. Finding training examples to teach LLMs to solve these tasks can be challenging. In this work, we explore a method to synthetically generate code reasoning training data based on a semantic inequivalence game SInQ: a generator agent creates program variants that are semantically distinct, derived from a dataset of real-world programming tasks, while an evaluator agent has to identify input examples that cause the original programs and the generated variants to diverge in their behaviour, with the agents training each other semi-adversarially. We prove that this setup enables theoretically unlimited improvement through self-play in the limit of infinite computational resources. We evaluated our approach on multiple code generation and understanding benchmarks, including cross-language vulnerability detection (Lu et al., 2021), where our method improves vulnerability detection in C/C++ code despite being trained exclusively on Python code, and the challenging Python builtin identifier swap benchmark (Miceli-Barone et al., 2023), showing that whereas modern LLMs still struggle with this benchmark, our approach yields substantial improvements. We release the code needed to replicate the experiments, as well as the generated synthetic data, which can be used to fine-tune LLMs. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.03818 [cs.LG] (or arXiv:2505.03818v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.03818 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Antonio Valerio Miceli Barone [view email] [v1] Fri, 2 May 2025 20:03:35 UTC (221 KB)
zh
[AI-61] Modeling Behavioral Preferences of Cyber Adversaries Using Inverse Reinforcement Learning
【速读】:该论文试图解决如何从系统级审计日志中建模攻击者偏好以提升对网络攻击者的威胁归因问题。现有方法依赖于不断更新的攻击工具和技术文档,而本文提出一种基于逆强化学习(Inverse Reinforcement Learning, IRL)的综合方法,通过分析取证数据来学习攻击者的内在行为偏好。解决方案的关键在于将攻击者建模为具有未知行为偏好的决策代理,并利用审计日志中的攻击溯源图推导出攻击的状态-动作轨迹,从而自动揭示攻击者的主观偏好,这些偏好可作为攻击者的独特行为特征,用于改进威胁归因。
链接: https://arxiv.org/abs/2505.03817
作者: Aditya Shinde,Prashant Doshi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:This paper presents a holistic approach to attacker preference modeling from system-level audit logs using inverse reinforcement learning (IRL). Adversary modeling is an important capability in cybersecurity that lets defenders characterize behaviors of potential attackers, which enables attribution to known cyber adversary groups. Existing approaches rely on documenting an ever-evolving set of attacker tools and techniques to track known threat actors. Although attacks evolve constantly, attacker behavioral preferences are intrinsic and less volatile. Our approach learns the behavioral preferences of cyber adversaries from forensics data on their tools and techniques. We model the attacker as an expert decision-making agent with unknown behavioral preferences situated in a computer host. We leverage attack provenance graphs of audit logs to derive a state-action trajectory of the attack. We test our approach on open datasets of audit logs containing real attack data. Our results demonstrate for the first time that low-level forensics data can automatically reveal an adversary’s subjective preferences, which serves as an additional dimension to modeling and documenting cyber adversaries. Attackers’ preferences tend to be invariant despite their different tools and indicate predispositions that are inherent to the attacker. As such, these inferred preferences can potentially serve as unique behavioral signatures of attackers and improve threat attribution.
zh
[AI-62] Geospatial and Temporal Trends in Urban Transportation: A Study of NYC Taxis and Pathao Food Deliveries
【速读】:该论文旨在解决城市交通系统中运输需求的识别与优化问题,具体包括分析需求趋势、高峰时段以及重要的地理热点区域。其解决方案的关键在于结合探索性数据分析(Exploratory Data Analysis, EDA)、地理空间分析、SARIMAX时间序列模型和聚类技术,以全面理解交通模式并预测需求变化,从而为车辆调度和资源分配提供科学依据。
链接: https://arxiv.org/abs/2505.03816
作者: Bidyarthi Paul,Fariha Tasnim Chowdhury,Dipta Biswas,Meherin Sultana
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:
Abstract:Urban transportation plays a vital role in modern city life, affecting how efficiently people and goods move around. This study analyzes transportation patterns using two datasets: the NYC Taxi Trip dataset from New York City and the Pathao Food Trip dataset from Dhaka, Bangladesh. Our goal is to identify key trends in demand, peak times, and important geographical hotspots. We start with Exploratory Data Analysis (EDA) to understand the basic characteristics of the datasets. Next, we perform geospatial analysis to map out high-demand and low-demand regions. We use the SARIMAX model for time series analysis to forecast demand patterns, capturing seasonal and weekly variations. Lastly, we apply clustering techniques to identify significant areas of high and low demand. Our findings provide valuable insights for optimizing fleet management and resource allocation in both passenger transport and food delivery services. These insights can help improve service efficiency, better meet customer needs, and enhance urban transportation systems in diverse urban environments.
zh
[AI-63] ScarceGAN: Discriminative Classification Framework for Rare Class Identification for Longitudinal Data with Weak Prior
【速读】:该论文试图解决在多维纵向遥测数据中识别极少数或稀缺样本的问题,特别是在正类样本严重稀缺、负类样本具有多类别性质且分布不均、以及大量未标记数据导致正负类先验信息微弱的情况下。解决方案的关键在于提出ScarceGAN,它通过重新构建半监督生成对抗网络(Generative Adversarial Network, GAN),引入对弱标签多类别负样本的处理机制,并在判别器和生成器的成本目标上进行改进,以提升对稀缺正类的识别能力。该方法通过利用负类知识的缺失来更好地学习其补集(即正类),从而在稀有攻击类别的识别任务中取得了显著的性能提升。
链接: https://arxiv.org/abs/2505.03811
作者: Surajit Chakrabarty,Rukma Talwadker,Tridib Mukherjee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces ScarceGAN which focuses on identification of extremely rare or scarce samples from multi-dimensional longitudinal telemetry data with small and weak label prior. We specifically address: (i) severe scarcity in positive class, stemming from both underlying organic skew in the data, as well as extremely limited labels; (ii) multi-class nature of the negative samples, with uneven density distributions and partially overlapping feature distributions; and (iii) massively unlabelled data leading to tiny and weak prior on both positive and negative classes, and possibility of unseen or unknown behavior in the unlabelled set, especially in the negative class. Although related to PU learning problems, we contend that knowledge (or lack of it) on the negative class can be leveraged to learn the compliment of it (i.e., the positive class) better in a semi-supervised manner. To this effect, ScarceGAN re-formulates semi-supervised GAN by accommodating weakly labelled multi-class negative samples and the available positive samples. It relaxes the supervised discriminator’s constraint on exact differentiation between negative samples by introducing a ‘leeway’ term for samples with noisy prior. We propose modifications to the cost objectives of discriminator, in supervised and unsupervised path as well as that of the generator. For identifying risky players in skill gaming, this formulation in whole gives us a recall of over 85% (~60% jump over vanilla semi-supervised GAN) on our scarce class with very minimal verbosity in the unknown space. Further ScarceGAN outperforms the recall benchmarks established by recent GAN based specialized models for the positive imbalanced class identification and establishes a new benchmark in identifying one of rare attack classes (0.09%) in the intrusion dataset from the KDDCUP99 challenge.
zh
[AI-64] Perception-Informed Neural Networks: Beyond Physics-Informed Neural Networks
【速读】:该论文试图解决如何将基于感知的信息融入神经网络以建模动态系统的问题,特别是在系统遵循已知或未知物理定律或微分方程的情况下。解决方案的关键在于提出感知启发的神经网络(Perception-Informed Neural Networks, PrINNs),通过损失函数整合专家知识和感知信息,从而实现传统物理建模与现代数据驱动方法的结合。PrINNs的核心创新包括Mixture of Experts Informed Neural Networks (MOEINNs) 和 Transformed-Knowledge Informed Neural Networks (TKINNs),以及利用模糊逻辑约束的Fuzzy-Informed Neural Networks (FINNs),这些方法增强了模型在不确定环境中的性能并支持在线训练。
链接: https://arxiv.org/abs/2505.03806
作者: Mehran Mazandarani,Marzieh Najariyan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This article introduces Perception-Informed Neural Networks (PrINNs), a framework designed to incorporate perception-based information into neural networks, addressing both systems with known and unknown physics laws or differential equations. Moreover, PrINNs extend the concept of Physics-Informed Neural Networks (PINNs) and their variants, offering a platform for the integration of diverse forms of perception precisiation, including singular, probability distribution, possibility distribution, interval, and fuzzy graph. In fact, PrINNs allow neural networks to model dynamical systems by integrating expert knowledge and perception-based information through loss functions, enabling the creation of modern data-driven models. Some of the key contributions include Mixture of Experts Informed Neural Networks (MOEINNs), which combine heterogeneous expert knowledge into the network, and Transformed-Knowledge Informed Neural Networks (TKINNs), which facilitate the incorporation of meta-information for enhanced model performance. Additionally, Fuzzy-Informed Neural Networks (FINNs) as a modern class of fuzzy deep neural networks leverage fuzzy logic constraints within a deep learning architecture, allowing online training without pre-training and eliminating the need for defuzzification. PrINNs represent a significant step forward in bridging the gap between traditional physics-based modeling and modern data-driven approaches, enabling neural networks to learn from both structured physics laws and flexible perception-based rules. This approach empowers neural networks to operate in uncertain environments, model complex systems, and discover new forms of differential equations, making PrINNs a powerful tool for advancing computational science and engineering.
zh
[AI-65] MoEQuant: Enhancing Quantization for Mixture-of-Experts Large Language Models via Expert-Balanced Sampling and Affinity Guidance
【速读】:该论文旨在解决混合专家(Mixture-of-Experts, MoE)大语言模型在后训练量化(Post-training Quantization, PTQ)过程中出现的显著精度下降和泛化能力减弱问题。其关键挑战在于MoE结构的稀疏性和动态性导致的专家间不平衡与专家内不平衡。为应对这些问题,论文提出了MoEQuant,其核心解决方案包括两个创新技术:专家平衡自采样(Expert-Balanced Self-Sampling, EBSS)和亲和力引导量化(Affinity-Guided Quantization, AGQ),分别用于构建平衡的校准集和在量化过程中引入样本与专家间的亲和力信息,从而提升量化后的模型性能。
链接: https://arxiv.org/abs/2505.03804
作者: Xing Hu,Zhixuan Chen,Dawei Yang,Zukang Xu,Chen Xu,Zhihang Yuan,Sifan Zhou,Jiangyong Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Mixture-of-Experts (MoE) large language models (LLMs), which leverage dynamic routing and sparse activation to enhance efficiency and scalability, have achieved higher performance while reducing computational costs. However, these models face significant memory overheads, limiting their practical deployment and broader adoption. Post-training quantization (PTQ), a widely used method for compressing LLMs, encounters severe accuracy degradation and diminished generalization performance when applied to MoE models. This paper investigates the impact of MoE’s sparse and dynamic characteristics on quantization and identifies two primary challenges: (1) Inter-expert imbalance, referring to the uneven distribution of samples across experts, which leads to insufficient and biased calibration for less frequently utilized experts; (2) Intra-expert imbalance, arising from MoE’s unique aggregation mechanism, which leads to varying degrees of correlation between different samples and their assigned experts. To address these challenges, we propose MoEQuant, a novel quantization framework tailored for MoE LLMs. MoE-Quant includes two novel techniques: 1) Expert-Balanced Self-Sampling (EBSS) is an efficient sampling method that efficiently constructs a calibration set with balanced expert distributions by leveraging the cumulative probabilities of tokens and expert balance metrics as guiding factors. 2) Affinity-Guided Quantization (AGQ), which incorporates affinities between experts and samples into the quantization process, thereby accurately assessing the impact of individual samples on different experts within the MoE layer. Experiments demonstrate that MoEQuant achieves substantial performance gains (more than 10 points accuracy gain in the HumanEval for DeepSeekMoE-16B under 4-bit quantization) and boosts efficiency.
zh
[AI-66] RWKVQuant: Quantizing the RWKV Family with Proxy Guided Hybrid of Scalar and Vector Quantization
【速读】:该论文试图解决在资源受限设备上部署RWKV模型时面临的性能下降问题,尤其是在应用后训练量化(PTQ)技术时出现的显著性能退化。解决方案的关键在于提出RWKVQuant框架,其核心包含两项创新技术:一是基于粗到精的代理机制,能够根据权重的均匀性和异常值识别自适应选择不同的量化方法;二是代码本优化算法,用于提升RWKV中逐元素乘法的聚类量化方法性能。
链接: https://arxiv.org/abs/2505.03803
作者: Chen Xu,Yuxuan Yue,Zukang Xu,Xing Hu,Jiangyong Yu,Zhixuan Chen,Sifan Zhou,Zhihang Yuan,Dawei Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:RWKV is a modern RNN architecture with comparable performance to Transformer, but still faces challenges when deployed to resource-constrained devices. Post Training Quantization (PTQ), which is a an essential technique to reduce model size and inference latency, has been widely used in Transformer models. However, it suffers significant degradation of performance when applied to RWKV. This paper investigates and identifies two key constraints inherent in the properties of RWKV: (1) Non-linear operators hinder the parameter-fusion of both smooth- and rotation-based quantization, introducing extra computation overhead. (2) The larger amount of uniformly distributed weights poses challenges for cluster-based quantization, leading to reduced accuracy. To this end, we propose RWKVQuant, a PTQ framework tailored for RWKV models, consisting of two novel techniques: (1) a coarse-to-fine proxy capable of adaptively selecting different quantization approaches by assessing the uniformity and identifying outliers in the weights, and (2) a codebook optimization algorithm that enhances the performance of cluster-based quantization methods for element-wise multiplication in RWKV. Experiments show that RWKVQuant can quantize RWKV-6-14B into about 3-bit with less than 1% accuracy loss and 2.14x speed up.
zh
[AI-67] Efficient Fine-Tuning of Quantized Models via Adaptive Rank and Bitwidth
【速读】:该论文旨在解决量化低秩适配(quantized LoRA fine-tuning)中由于量化误差导致的模型性能下降问题,以及现有方法在优化低秩子空间和量化组件时缺乏协同性的问题。其解决方案的关键在于提出\textbf{QR-Adaptor},这是一种无需梯度的统一策略,通过部分校准数据联合搜索每一层的量化组件和低秩空间的秩,将精度和秩分配视为由实际下游性能和内存使用驱动的离散优化问题,从而持续提升模型性能。
链接: https://arxiv.org/abs/2505.03802
作者: Changhai Zhou,Yuhua Zhou,Qian Qiao,Weizhong Zhang,Cheng Jin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages, 6 figures
Abstract:QLoRA effectively combines low-bit quantization and LoRA to achieve memory-friendly fine-tuning for large language models (LLM). Recently, methods based on SVD for continuous update iterations to initialize LoRA matrices to accommodate quantization errors have generally failed to consistently improve performance. Dynamic mixed precision is a natural idea for continuously improving the fine-tuning performance of quantized models, but previous methods often optimize low-rank subspaces or quantization components separately, without considering their synergy. To address this, we propose \textbfQR-Adaptor, a unified, gradient-free strategy that uses partial calibration data to jointly search the quantization components and the rank of low-rank spaces for each layer, thereby continuously improving model performance. QR-Adaptor does not minimize quantization error but treats precision and rank allocation as a discrete optimization problem guided by actual downstream performance and memory usage. Compared to state-of-the-art (SOTA) quantized LoRA fine-tuning methods, our approach achieves a 4.89% accuracy improvement on GSM8K, and in some cases even outperforms the 16-bit fine-tuned model while maintaining the memory footprint of the 4-bit setting.
zh
[AI-68] Large Language Model Compression with Global Rank and Sparsity Optimization
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)压缩中的两个主要挑战:低秩与稀疏矩阵之间的交互与协作问题,以及不同层之间权重分配的确定问题,因为各层的冗余程度差异较大。其解决方案的关键在于提出一种两阶段的LLM压缩方法,能够在全局范围内优化低秩和稀疏性。第一阶段利用鲁棒主成分分析(Robust Principal Component Analysis, RPCA)将权重矩阵分解为低秩和稀疏成分,从而缩小优化空间;第二阶段则采用概率全局优化技术,在低秩和稀疏空间中联合识别结构,实现对层间冗余的自动检测及稀疏与低秩组件间的有效协同。
链接: https://arxiv.org/abs/2505.03801
作者: Changhai Zhou,Qian Qiao,Weizhong Zhang,Cheng Jin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages, 5 figures
Abstract:Low-rank and sparse composite approximation is a natural idea to compress Large Language Models (LLMs). However, such an idea faces two primary challenges that adversely affect the performance of existing methods. The first challenge relates to the interaction and cooperation between low-rank and sparse matrices, while the second involves determining weight allocation across different layers, as redundancy varies considerably among them. To address these challenges, we propose a novel two-stage LLM compression method with the capability of global rank and sparsity optimization. It is noteworthy that the overall optimization space is vast, making comprehensive optimization computationally prohibitive. Therefore, to reduce the optimization space, our first stage utilizes robust principal component analysis to decompose the weight matrices of LLMs into low-rank and sparse components, which span the low dimensional and sparse spaces containing the resultant low-rank and sparse matrices, respectively. In the second stage, we propose a probabilistic global optimization technique to jointly identify the low-rank and sparse structures within the above two spaces. The appealing feature of our approach is its ability to automatically detect the redundancy across different layers and to manage the interaction between the sparse and low-rank components. Extensive experimental results indicate that our method significantly surpasses state-of-the-art techniques for sparsification and composite approximation.
zh
[AI-69] Position: Foundation Models Need Digital Twin Representations
【速读】:该论文试图解决当前基础模型(Foundation Models, FMs)在处理多模态数据时依赖于离散的token表示所带来的局限性,这些问题包括难以保持跨模态的语义连贯性、捕捉细粒度的时空动态以及进行因果推理。论文提出的解决方案之关键在于采用数字孪生(Digital Twin, DT)表示,这是一种以结果为导向的数字表示方式,能够作为构建物理过程虚拟副本的基本单元,从而提供具有物理基础的表示,显式编码领域知识并保留真实世界过程的连续性。
链接: https://arxiv.org/abs/2505.03798
作者: Yiqing Shen,Hao Ding,Lalithkumar Seenivasan,Tianmin Shu,Mathias Unberath
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Current foundation models (FMs) rely on token representations that directly fragment continuous real-world multimodal data into discrete tokens. They limit FMs to learning real-world knowledge and relationships purely through statistical correlation rather than leveraging explicit domain knowledge. Consequently, current FMs struggle with maintaining semantic coherence across modalities, capturing fine-grained spatial-temporal dynamics, and performing causal reasoning. These limitations cannot be overcome by simply scaling up model size or expanding datasets. This position paper argues that the machine learning community should consider digital twin (DT) representations, which are outcome-driven digital representations that serve as building blocks for creating virtual replicas of physical processes, as an alternative to the token representation for building FMs. Finally, we discuss how DT representations can address these challenges by providing physically grounded representations that explicitly encode domain knowledge and preserve the continuous nature of real-world processes.
zh
[AI-70] AI-Driven IRM: Transforming insider risk management with adaptive scoring and LLM -based threat detection
【速读】:该论文旨在解决内部威胁(insider threats)对组织安全构成的挑战,特别是传统基于规则的检测系统难以有效识别此类威胁的问题。解决方案的关键在于构建一个基于人工智能的内部风险管理系统(AI-powered Insider Risk Management, IRM),其核心是采用混合评分机制,从静态的PRISM模型过渡到利用自编码器神经网络的自适应AI模型,通过专家标注的用户行为数据进行训练,结合迭代反馈和持续学习,显著提升了检测准确性和适应性。
链接: https://arxiv.org/abs/2505.03796
作者: Lokesh Koli,Shubham Kalra,Rohan Thakur,Anas Saifi,Karanpreet Singh
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Insider threats pose a significant challenge to organizational security, often evading traditional rule-based detection systems due to their subtlety and contextual nature. This paper presents an AI-powered Insider Risk Management (IRM) system that integrates behavioral analytics, dynamic risk scoring, and real-time policy enforcement to detect and mitigate insider threats with high accuracy and adaptability. We introduce a hybrid scoring mechanism - transitioning from the static PRISM model to an adaptive AI-based model utilizing an autoencoder neural network trained on expert-annotated user activity data. Through iterative feedback loops and continuous learning, the system reduces false positives by 59% and improves true positive detection rates by 30%, demonstrating substantial gains in detection precision. Additionally, the platform scales efficiently, processing up to 10 million log events daily with sub-300ms query latency, and supports automated enforcement actions for policy violations, reducing manual intervention. The IRM system’s deployment resulted in a 47% reduction in incident response times, highlighting its operational impact. Future enhancements include integrating explainable AI, federated learning, graph-based anomaly detection, and alignment with Zero Trust principles to further elevate its adaptability, transparency, and compliance-readiness. This work establishes a scalable and proactive framework for mitigating emerging insider risks in both on-premises and hybrid environments.
zh
[AI-71] Modeling Human Behavior in a Strategic Network Game with Complex Group Dynamics
【速读】:该论文试图解决如何更准确地建模人类在网络博弈中的行为问题,以更好地理解人类网络对社会结果的影响。其解决方案的关键在于采用一种基于群体分布而非均值的建模方法,并假设人类行为具有社区意识(community-aware behavior)而非简单的行为匹配。这种被称为hCAB的模型在小规模社会中能够有效模拟人类群体的动力学特性,并且通过用户研究验证了其行为与真实人类行为的相似性。
链接: https://arxiv.org/abs/2505.03795
作者: Jacob W. Crandall,Jonathan Skaggs
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:
Abstract:Human networks greatly impact important societal outcomes, including wealth and health inequality, poverty, and bullying. As such, understanding human networks is critical to learning how to promote favorable societal outcomes. As a step toward better understanding human networks, we compare and contrast several methods for learning models of human behavior in a strategic network game called the Junior High Game (JHG). These modeling methods differ with respect to the assumptions they use to parameterize human behavior (behavior vs. community-aware behavior) and the statistical moments they model (mean vs. distribution). Results show that the highest-performing method models the population’s distribution rather than the mean and assumes humans use community-aware behavior rather than behavior matching. When applied to small societies (6-11 individuals), this learned model, called hCAB, closely mirrors the population dynamics of human groups (with some differences). Additionally, a user study reveals that human participants were unable to distinguish hCAB agents from other humans, thus illustrating that individual hCAB behavior plausibly mirrors human behavior in this strategic network game.
zh
[AI-72] LENSLLM : Unveiling Fine-Tuning Dynamics for LLM Selection ICML’2025
【速读】:该论文试图解决在计算资源受限的情况下,如何高效地进行大规模语言模型(Large Language Model, LLM)选择的问题,特别是如何建模LLM在微调过程中的动态行为以提升其在多样化下游任务中的泛化性能。解决方案的关键在于提出了一种新的理论框架,通过基于Hessian的PAC-Bayes泛化界揭示了LLM的微调动态,并引入了LENSLLM,这是一种基于神经切线核(Neural Tangent Kernel, NTK)的修正缩放模型,能够在保持计算效率的同时实现跨多种任务的准确性能预测。
链接: https://arxiv.org/abs/2505.03793
作者: Xinyue Zeng,Haohui Wang,Junhong Lin,Jun Wu,Tyler Cody,Dawei Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: It is accepted by ICML’2025, and the code is open-sourcing on this https URL
Abstract:The proliferation of open-sourced Large Language Models (LLMs) and diverse downstream tasks necessitates efficient model selection, given the impracticality of fine-tuning all candidates due to computational constraints. Despite the recent advances in LLM selection, a fundamental research question largely remains nascent: how can we model the dynamic behaviors of LLMs during fine-tuning, thereby enhancing our understanding of their generalization performance across diverse downstream tasks? In this work, we propose a novel theoretical framework that provides a proper lens to assess the generalization capabilities of LLMs, thereby enabling accurate and efficient LLM selection for downstream applications. In particular, we first derive a Hessian-based PAC-Bayes generalization bound that unveils fine-tuning dynamics of LLMs and then introduce LENSLLM, a Neural Tangent Kernel(NTK)-based Rectified Scaling Model that enables accurate performance predictions across diverse tasks while maintaining computational efficiency. Extensive empirical results on 3 large-scale benchmarks demonstrate that our model achieves up to 91.1% accuracy and reduces up to 88.5% computational cost in LLM selection, outperforming 5 state-of-the-art methods. We open-source our proposed LENSLLM model and corresponding results at the Github link: this https URL.
zh
[AI-73] owards Efficient Online Tuning of VLM Agents via Counterfactual Soft Reinforcement Learning ICML2025
【速读】:该论文旨在解决在线微调视觉-语言模型(VLM)代理在强化学习(RL)中面临的有效探索问题,特别是在开放式的文本动作空间和非端到端的动作生成机制下,探索空间易出现爆炸性增长的问题。解决方案的关键在于提出一种名为反事实软强化学习(Counterfactual Soft Reinforcement Learning, CoSo)的新方法,该方法通过反事实推理动态评估单个标记对后处理动作的因果影响,从而优先探索关键动作标记并减少语义冗余或低影响标记的影响,实现更精准高效的在线 rollout 过程。
链接: https://arxiv.org/abs/2505.03792
作者: Lang Feng,Weihao Tan,Zhiyi Lyu,Longtao Zheng,Haiyang Xu,Ming Yan,Fei Huang,Bo An
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2025
Abstract:Online fine-tuning vision-language model (VLM) agents with reinforcement learning (RL) has shown promise for equipping agents with multi-step, goal-oriented capabilities in dynamic environments. However, their open-ended textual action space and non-end-to-end nature of action generation present significant challenges to effective online exploration in RL, e.g., explosion of the exploration space. We propose a novel online fine-tuning method, Counterfactual Soft Reinforcement Learning (CoSo), better suited to the textual output space of VLM agents. Compared to prior methods that assign uniform uncertainty to all tokens, CoSo leverages counterfactual reasoning to dynamically assess the causal influence of individual tokens on post-processed actions. By prioritizing the exploration of action-critical tokens while reducing the impact of semantically redundant or low-impact tokens, CoSo enables a more targeted and efficient online rollout process. We provide theoretical analysis proving CoSo’s convergence and policy improvement guarantees, and extensive empirical evaluations supporting CoSo’s effectiveness. Our results across a diverse set of agent tasks, including Android device control, card gaming, and embodied AI, highlight its remarkable ability to enhance exploration efficiency and deliver consistent performance gains. The code is available at this https URL.
zh
[AI-74] Practical Boolean Backpropagation
【速读】:该论文试图解决在布尔神经网络(Boolean neural network)中实现纯布尔训练(purely Boolean training)的难题,以提供一种无需数值计算的硬件高效替代方案。解决方案的关键在于提出了一种适用于特定逻辑门的纯布尔反向传播方法,该方法直接在布尔代数中操作,避免了数值计算的参与。
链接: https://arxiv.org/abs/2505.03791
作者: Simon Golbert
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages
Abstract:Boolean neural networks offer hardware-efficient alternatives to real-valued models. While quantization is common, purely Boolean training remains underexplored. We present a practical method for purely Boolean backpropagation for networks based on a single specific gate we chose, operating directly in Boolean algebra involving no numerics. Initial experiments confirm its feasibility.
zh
[AI-75] A Time-Series Data Augmentation Model through Diffusion and Transformer Integration
【速读】:该论文试图解决时间序列数据增强不足的问题(time-series data augmentation),旨在生成大量高质量的时间序列数据以提升深度神经网络的性能。解决方案的关键在于结合扩散模型(diffusion model)和Transformer模型,首先利用调整后的扩散去噪模型生成大量初始时间步动作数据,再通过Transformer模型预测后续动作,并引入加权损失函数以实现模型收敛,从而有效生成高质量的增强数据。
链接: https://arxiv.org/abs/2505.03790
作者: Yuren Zhang,Zhongnan Pu,Lei Jing
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages,22 figures
Abstract:With the development of Artificial Intelligence, numerous real-world tasks have been accomplished using technology integrated with deep learning. To achieve optimal performance, deep neural networks typically require large volumes of data for training. Although advances in data augmentation have facilitated the acquisition of vast datasets, most of this data is concentrated in domains like images and speech. However, there has been relatively less focus on augmenting time-series data. To address this gap and generate a substantial amount of time-series data, we propose a simple and effective method that combines the Diffusion and Transformer models. By utilizing an adjusted diffusion denoising model to generate a large volume of initial time-step action data, followed by employing a Transformer model to predict subsequent actions, and incorporating a weighted loss function to achieve convergence, the method demonstrates its effectiveness. Using the performance improvement of the model after applying augmented data as a benchmark, and comparing the results with those obtained without data augmentation or using traditional data augmentation methods, this approach shows its capability to produce high-quality augmented data.
zh
[AI-76] ArrhythmiaVision: Resource-Conscious Deep Learning Models with Visual Explanations for ECG Arrhythmia Classification
【速读】:该论文旨在解决心律失常准确、及时检测的需求与传统人工解读ECG(心电图)方法在效率、依赖专家经验和易出错之间的矛盾,以及现有深度学习模型在保留信号时序和形态特征、可解释性及计算效率方面的不足。其解决方案的关键在于提出两种轻量级的一维卷积神经网络(ArrhythmiNet V1和V2),通过借鉴MobileNet的深度可分离卷积设计,在保持极低内存占用(分别为302.18 KB和157.76 KB)的同时,实现了高分类准确率,并结合Shapley Additive Explanations和Gradient-weighted Class Activation Mapping技术提升模型的可解释性,从而满足边缘设备上实时心律失常分类的应用需求。
链接: https://arxiv.org/abs/2505.03787
作者: Zuraiz Baig,Sidra Nasir,Rizwan Ahmed Khan,Muhammad Zeeshan Ul Haque
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages and 08 figures
Abstract:Cardiac arrhythmias are a leading cause of life-threatening cardiac events, highlighting the urgent need for accurate and timely detection. Electrocardiography (ECG) remains the clinical gold standard for arrhythmia diagnosis; however, manual interpretation is time-consuming, dependent on clinical expertise, and prone to human error. Although deep learning has advanced automated ECG analysis, many existing models abstract away the signal’s intrinsic temporal and morphological features, lack interpretability, and are computationally intensive-hindering their deployment on resource-constrained platforms. In this work, we propose two novel lightweight 1D convolutional neural networks, ArrhythmiNet V1 and V2, optimized for efficient, real-time arrhythmia classification on edge devices. Inspired by MobileNet’s depthwise separable convolutional design, these models maintain memory footprints of just 302.18 KB and 157.76 KB, respectively, while achieving classification accuracies of 0.99 (V1) and 0.98 (V2) on the MIT-BIH Arrhythmia Dataset across five classes: Normal Sinus Rhythm, Left Bundle Branch Block, Right Bundle Branch Block, Atrial Premature Contraction, and Premature Ventricular Contraction. In order to ensure clinical transparency and relevance, we integrate Shapley Additive Explanations and Gradient-weighted Class Activation Mapping, enabling both local and global interpretability. These techniques highlight physiologically meaningful patterns such as the QRS complex and T-wave that contribute to the model’s predictions. We also discuss performance-efficiency trade-offs and address current limitations related to dataset diversity and generalizability. Overall, our findings demonstrate the feasibility of combining interpretability, predictive accuracy, and computational efficiency in practical, wearable, and embedded ECG monitoring systems.
zh
[AI-77] GPU Performance Portability needs Autotuning
【速读】:该论文试图解决大型语言模型(Large Language Model, LLM)在不同硬件平台上的可移植性问题,以及由此带来的厂商锁定和性能优化难度增加的问题。其解决方案的关键在于将即时编译(Just-In-Time Compilation, JIT)与内核参数自动调优(kernel parameter autotuning)相结合,从而在不修改代码的情况下实现跨平台的高性能LLM执行。通过针对广泛使用的Flash Attention这一关键内核进行实验,该方法能够探索更多参数配置、生成更丰富的代码变体,并在性能上超越厂商优化的实现,同时显著减少内核代码规模并消除手动优化的需求。
链接: https://arxiv.org/abs/2505.03780
作者: Burkhard Ringlein,Thomas Parnell,Radu Stoica
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:As LLMs grow in complexity, achieving state-of-the-art performance requires tight co-design across algorithms, software, and hardware. Today’s reliance on a single dominant platform limits portability, creates vendor lock-in, and raises barriers for new AI hardware. In this work, we make the case for combining just-in-time (JIT) compilation with kernel parameter autotuning to enable portable, state-of-the-art performance LLM execution without code changes. Focusing on flash attention – a widespread performance-critical LLM kernel – we demonstrate that this approach explores up to 15x more kernel parameter configurations, produces significantly more diverse code across multiple dimensions, and even outperforms vendor-optimized implementations by up to 230%, all while reducing kernel code size by 70x and eliminating manual code optimizations. Our results highlight autotuning as a promising path to unlocking model portability across GPU vendors.
zh
[AI-78] Proceedings of 1st Workshop on Advancing Artificial Intelligence through Theory of Mind
【速读】:该论文试图解决如何通过理论心智(Theory of Mind, ToM)推动人工智能发展的核心问题,其解决方案的关键在于构建一个开放获取且经过筛选的文献合集,以促进ToM与AI研究社区之间的知识共享与交流。
链接: https://arxiv.org/abs/2505.03770
作者: Mouad Abrini,Omri Abend,Dina Acklin,Henny Admoni,Gregor Aichinger,Nitay Alon,Zahra Ashktorab,Ashish Atreja,Moises Auron,Alexander Aufreiter,Raghav Awasthi,Soumya Banerjee,Joe M. Barnby,Rhea Basappa,Severin Bergsmann,Djallel Bouneffouf,Patrick Callaghan,Marc Cavazza,Thierry Chaminade,Sonia Chernova,Mohamed Chetouan,Moumita Choudhury,Axel Cleeremans,Jacek B. Cywinski,Fabio Cuzzolin,Hokin Deng,N’yoma Diamond,Camilla Di Pasquasio,Guillaume Dumas,Max van Duijn,Mahapatra Dwarikanath,Qingying Gao,Ashok Goel,Rebecca Goldstein,Matthew Gombolay,Gabriel Enrique Gonzalez,Amar Halilovic,Tobias Halmdienst,Mahimul Islam,Julian Jara-Ettinger,Natalie Kastel,Renana Keydar,Ashish K. Khanna,Mahdi Khoramshahi,JiHyun Kim,MiHyeon Kim,YoungBin Kim,Senka Krivic,Nikita Krasnytskyi,Arun Kumar,JuneHyoung Kwon,Eunju Lee,Shane Lee,Peter R. Lewis,Xue Li,Yijiang Li,Michal Lewandowski,Nathan Lloyd,Matthew B. Luebbers,Dezhi Luo,Haiyun Lyu,Dwarikanath Mahapatra,Kamal Maheshwari,Mallika Mainali,Piyush Mathur,Patrick Mederitsch,Shuwa Miura,Manuel Preston de Miranda,Reuth Mirsky,Shreya Mishra,Nina Moorman,Katelyn Morrison,John Muchovej,Bernhard Nessler,Felix Nessler,Hieu Minh Jord Nguyen,Abby Ortego,Francis A. Papay,Antoine Pasquali,Hamed Rahimi,Charumathi Raghu,Amanda Royka,Stefan Sarkadi,Jaelle Scheuerman,Simon Schmid,Paul Schrater,Anik Sen,Zahra Sheikhbahaee,Ke Shi,Reid Simmons,Nishant Singh,Mason O. Smith,Ramira van der Meulen,Anthia Solaki,Haoran Sun,Viktor Szolga,Matthew E. Taylor,Travis Taylor,Sanne Van Waveren,Juan David Vargas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: workshop proceedings
Abstract:This volume includes a selection of papers presented at the Workshop on Advancing Artificial Intelligence through Theory of Mind held at AAAI 2025 in Philadelphia US on 3rd March 2025. The purpose of this volume is to provide an open access and curated anthology for the ToM and AI research community.
zh
[AI-79] he Influence of Text Variation on User Engagement in Cross-Platform Content Sharing
【速读】:该论文试图解决在跨平台社交媒体环境中,多模态内容(尤其是文本与视觉结合的内容)影响用户参与度的因素复杂性问题。其解决方案的关键在于构建一个受控数据集并设计多阶段实验,以隔离文本变化对参与度的影响,同时通过统计分析和基于BERT的分类器验证有效标题重写特征,如情感共鸣、词汇丰富性和社区规范一致性,从而为未来跨平台多模态内容策略提供理论支持与实践框架。
链接: https://arxiv.org/abs/2505.03769
作者: Yibo Hu,Yiqiao Jin,Meng Ye,Ajay Divakaran,Srijan Kumar
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:In today’s cross-platform social media landscape, understanding factors that drive engagement for multimodal content, especially text paired with visuals, remains complex. This study investigates how rewriting Reddit post titles adapted from YouTube video titles affects user engagement. First, we build and analyze a large dataset of Reddit posts sharing YouTube videos, revealing that 21% of post titles are minimally modified. Statistical analysis demonstrates that title rewrites measurably improve engagement. Second, we design a controlled, multi-phase experiment to rigorously isolate the effects of textual variations by neutralizing confounding factors like video popularity, timing, and community norms. Comprehensive statistical tests reveal that effective title rewrites tend to feature emotional resonance, lexical richness, and alignment with community-specific norms. Lastly, pairwise ranking prediction experiments using a fine-tuned BERT classifier achieves 74% accuracy, significantly outperforming near-random baselines, including GPT-4o. These results validate that our controlled dataset effectively minimizes confounding effects, allowing advanced models to both learn and demonstrate the impact of textual features on engagement. By bridging quantitative rigor with qualitative insights, this study uncovers engagement dynamics and offers a robust framework for future cross-platform, multimodal content strategies.
zh
[AI-80] Ultra-Low-Power Spiking Neurons in 7 nm FinFET Technology: A Comparative Analysis of Leaky Integrate-and-Fire Morris-Lecar and Axon-Hillock Architectures
【速读】:该论文旨在解决神经形态计算中如何优化脉冲神经元电路设计以实现高能效与高吞吐量的问题。其解决方案的关键在于对三种脉冲神经元电路架构——漏电积分-放电(Leaky-Integrate-and-Fire, LIF)、Morris-Lecar (ML) 和轴突始段(Axon-Hillock, AH)在 7 nm FinFET 工艺下的性能进行系统比较与优化,通过 SPICE 仿真分析脉冲频率、单脉冲能耗及静态功耗,并揭示不同架构在不同工作区域的性能优势,从而为先进纳米尺度技术下的神经形态硬件设计提供优化路线图。
链接: https://arxiv.org/abs/2505.03764
作者: Logan Larsh,Raiyan Siddique,Sarah Sharif Yaser Mike Banad
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:
Abstract:Neuromorphic computing aims to replicate the brain’s remarkable energy efficiency and parallel processing capabilities for large-scale artificial intelligence applications. In this work, we present a comprehensive comparative study of three spiking neuron circuit architectures-Leaky-Integrate-and-Fire (LIF), Morris-Lecar (ML), and Axon-Hillock (AH)-implemented in a 7 nm FinFET technology. Through extensive SPICE simulations, we explore the optimization of spiking frequency, energy per spike, and static power consumption. Our results show that the AH design achieves the highest throughput, demonstrating multi-gigahertz firing rates (up to 3 GHz) with attojoule energy costs. By contrast, the ML architecture excels in subthreshold to near-threshold regimes, offering robust low-power operation (as low as 0.385 aJ/spike) and biological bursting behavior. Although LIF benefits from a decoupled current mirror for high-frequency operation, it exhibits slightly higher static leakage compared to ML and AH at elevated supply voltages. Comparisons with previous node implementations (22 nm planar, 28 nm) reveal that 7 nm FinFETs can drastically boost energy efficiency and speed albeit at the cost of increased subthreshold leakage in deep subthreshold regions. By quantifying design trade-offs for each neuron architecture, our work provides a roadmap for optimizing spiking neuron circuits in advanced nanoscale technologies to deliver neuromorphic hardware capable of both ultra-low-power operation and high computational throughput.
zh
[AI-81] Splitwiser: Efficient LM inference with constrained resources
【速读】:该论文试图解决大型语言模型(Large Language Model, LLM)推理过程中计算资源利用率不足的问题,特别是在令牌生成阶段无法充分利用计算资源。解决方案的关键在于提出Splitwiser方法,该方法将LLM推理请求的两个阶段——提示计算和令牌生成——分割并在同一GPU上执行,从而减少开销并提升内存访问和缓存利用率,同时避免跨设备数据传输带来的网络相关开销。
链接: https://arxiv.org/abs/2505.03763
作者: Asad Aali,Adney Cardoza,Melissa Capo
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:
Abstract:Efficient inference of LLMs remains a crucial challenge, with two main phases: a compute-intensive prompt computation and a memory-intensive token generation. Despite existing batching and scheduling techniques, token generation phases fail to fully utilize compute resources, especially when compared to prompt computation phases. To address these challenges, we propose Splitwiser, a methodology that splits the two phases of an LLM inference request onto the same GPU, thereby reducing overhead and improving memory access and cache utilization. By eliminating the need to transfer data across devices, Splitwiser aims to minimize network-related overheads. In this report, we describe the basic structure of our proposed pipeline while sharing preliminary results and analysis. We implement our proposed multiprocessing design on two widely-used and independent LLM architectures: Huggingface and vLLM. We open-source our code for the respective implementations: 1) Huggingface (this https URL), and 2) vLLM (this https URL).
zh
[AI-82] Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management
【速读】:该论文旨在解决多低秩适配器(Multi-LoRAs)在推理服务中的性能优化问题,特别是针对时延指标Time-To-First-Token (TTFT) 的优化。现有系统未能考虑LoRA适配器与键值(KV)缓存之间的使用依赖关系,导致缓存策略不够高效。论文提出的解决方案是FASTLIBRA,其关键在于引入了一个依赖感知的缓存管理器和一个基于统一成本模型的性能驱动缓存交换器,以在高带宽内存(HBM)空闲或繁忙时动态调整LoRA和KV缓存的交换策略,从而显著降低TTFT。
链接: https://arxiv.org/abs/2505.03756
作者: Hang Zhang,Jiuchen Shi,Yixiao Wang,Quan Chen,Yizhou Shan,Minyi Guo
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注:
Abstract:Multiple Low-Rank Adapters (Multi-LoRAs) are gaining popularity for task-specific Large Language Model (LLM) applications. For multi-LoRA serving, caching hot KV caches and LoRA adapters in high bandwidth memory of accelerations can improve inference performance. However, existing Multi-LoRA inference systems fail to optimize serving performance like Time-To-First-Toke (TTFT), neglecting usage dependencies when caching LoRAs and KVs. We therefore propose FASTLIBRA, a Multi-LoRA caching system to optimize the serving performance. FASTLIBRA comprises a dependency-aware cache manager and a performance-driven cache swapper. The cache manager maintains the usage dependencies between LoRAs and KV caches during the inference with a unified caching pool. The cache swapper determines the swap-in or out of LoRAs and KV caches based on a unified cost model, when the HBM is idle or busy, respectively. Experimental results show that ELORA reduces the TTFT by 63.4% on average, compared to state-of-the-art works.
zh
[AI-83] AI-Powered Agile Analog Circuit Design and Optimization
【速读】:该论文试图解决模拟电路设计中传统方法效率低、迭代成本高的问题,以及如何通过人工智能(AI)提升电路性能和系统级优化。其解决方案的关键在于结合两种AI技术:一是利用多目标贝叶斯优化(Multi-Objective Bayesian Optimization, MOBO)进行晶体管尺寸调整,实现电路参数的直接优化;二是将AI集成到电路传递函数建模中,用于关键词检测(Keyword Spotting, KWS)应用中的系统级优化,通过在机器学习训练循环中优化模拟带通滤波器来实现。
链接: https://arxiv.org/abs/2505.03750
作者: Jinhai Hu,Wang Ling Goh,Yuan Gao
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 3 pages, 5 figures, AI4X, 2025
Abstract:Artificial intelligence (AI) techniques are transforming analog circuit design by automating device-level tuning and enabling system-level co-optimization. This paper integrates two approaches: (1) AI-assisted transistor sizing using Multi-Objective Bayesian Optimization (MOBO) for direct circuit parameter optimization, demonstrated on a linearly tunable transconductor; and (2) AI-integrated circuit transfer function modeling for system-level optimization in a keyword spotting (KWS) application, demonstrated by optimizing an analog bandpass filter within a machine learning training loop. The combined insights highlight how AI can improve analog performance, reduce design iteration effort, and jointly optimize analog components and application-level metrics.
zh
[AI-84] APSQ: Additive Partial Sum Quantization with Algorithm-Hardware Co-Design
【速读】:该论文旨在解决深度神经网络(DNN)加速器中高精度部分和(PSUM)频繁访问导致的内存需求过高的问题。传统压缩策略通常忽略了PSUM量化,而PSUM量化可能占总功耗的69%。论文提出的解决方案是引入一种新颖的加法部分和量化(APSQ)方法,将PSUM累积无缝集成到量化框架中,关键在于通过APSQ与可重构架构结合的分组策略实现PSUM的高效压缩,从而显著降低能量成本。
链接: https://arxiv.org/abs/2505.03748
作者: Yonghao Tan,Pingcheng Dong,Yongkun Wu,Yu Liu,Xuejiao Liu,Peng Luo,Shih-Yang Liu,Xijie Huang,Dong Zhang,Luhong Liang,Kwang-Ting Cheng
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 62nd ACM/IEEE Design Automation Conference (DAC) 2025
Abstract:DNN accelerators, significantly advanced by model compression and specialized dataflow techniques, have marked considerable progress. However, the frequent access of high-precision partial sums (PSUMs) leads to excessive memory demands in architectures utilizing input/weight stationary dataflows. Traditional compression strategies have typically overlooked PSUM quantization, which may account for 69% of power consumption. This study introduces a novel Additive Partial Sum Quantization (APSQ) method, seamlessly integrating PSUM accumulation into the quantization framework. A grouping strategy that combines APSQ with PSUM quantization enhanced by a reconfigurable architecture is further proposed. The APSQ performs nearly lossless on NLP and CV tasks across BERT, Segformer, and EfficientViT models while compressing PSUMs to INT8. This leads to a notable reduction in energy costs by 28-87%. Extended experiments on LLaMA2-7B demonstrate the potential of APSQ for large language models. Code is available at this https URL.
zh
[AI-85] Promoting Security and Trust on Social Networks: Explainable Cyberbullying Detection Using Large Language Models in a Stream-Based Machine Learning Framework
【速读】:该论文试图解决在线社区中网络欺凌(cyberbullying)的检测问题,特别是在应对不断演变的侮辱性和仇恨言论方面的挑战。解决方案的关键在于提出一种创新的实时检测方法,该方法结合了基于流的机器学习(stream-based Machine Learning, ML)模型以处理增量数据,以及大型语言模型(Large Language Models, LLMs)用于特征工程,从而有效捕捉网络欺凌行为的动态特性。此外,系统还配备了一个可解释性仪表板,以增强系统的可信度、可靠性和问责性。
链接: https://arxiv.org/abs/2505.03746
作者: Silvia García-Méndez,Francisco De Arriba-Pérez
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:
Abstract:Social media platforms enable instant and ubiquitous connectivity and are essential to social interaction and communication in our technological society. Apart from its advantages, these platforms have given rise to negative behaviors in the online community, the so-called cyberbullying. Despite the many works involving generative Artificial Intelligence (AI) in the literature lately, there remain opportunities to study its performance apart from zero/few-shot learning strategies. Accordingly, we propose an innovative and real-time solution for cyberbullying detection that leverages stream-based Machine Learning (ML) models able to process the incoming samples incrementally and Large Language Models (LLMS) for feature engineering to address the evolving nature of abusive and hate speech online. An explainability dashboard is provided to promote the system’s trustworthiness, reliability, and accountability. Results on experimental data report promising performance close to 90 % in all evaluation metrics and surpassing those obtained by competing works in the literature. Ultimately, our proposal contributes to the safety of online communities by timely detecting abusive behavior to prevent long-lasting harassment and reduce the negative consequences in society.
zh
[AI-86] AccLLM : Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design
【速读】:该论文旨在解决将大语言模型(Large Language Models, LLMs)部署到资源受限的边缘设备所面临的挑战,包括密集计算与巨大模型规模、自回归生成过程带来的内存和带宽需求以及处理长序列的可扩展性问题。其解决方案的关键在于提出AccLLM框架,通过算法与硬件协同设计实现高效且快速的长上下文LLM推理,具体包括剪枝、Λ形注意力机制以及一种创新的W2A8KV4量化方案,同时设计了基于FPGA的专用加速器以支持压缩算法的多样化操作,从而将算法创新转化为实际的硬件效率提升。
链接: https://arxiv.org/abs/2505.03745
作者: Yanbiao Liang,Huihong Shi,Haikuo Shao,Zhongfeng Wang
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Recently, large language models (LLMs) have achieved huge success in the natural language processing (NLP) field, driving a growing demand to extend their deployment from the cloud to edge devices. However, deploying LLMs on resource-constrained edge devices poses significant challenges, including (1) intensive computations and huge model sizes, (2) great memory and bandwidth demands introduced by the autoregressive generation process, and (3) limited scalability for handling long sequences. To address these challenges, we propose AccLLM, a comprehensive acceleration framework that enables efficient and fast long-context LLM inference through algorithm and hardware co-design. At the algorithmic level, we integrate (1) pruning, (2) \Lambda-shaped attention, and (3) an innovative W2A8KV4 (2-bit weights, 8-bit activations, and 4-bit KV cache) quantization scheme, thus effectively reducing memory and bandwidth requirements while facilitating LLMs’ long-sequence generation. At the hardware level, we design a dedicated FPGA-based accelerator with a reconfigurable computing engine to effectively and flexibly accommodate diverse operations arising from our compression algorithm, thereby fully translating the algorithmic innovations into tangible hardware efficiency. We validate AccLLM on the Xilinx Alveo U280 FPGA, demonstrating a 4.07x energy efficiency and a 2.98x throughput compared to the state-of-the-art work FlightLLM.
zh
[AI-87] Beyond Misinformation: A Conceptual Framework for Studying AI Hallucinations in (Science) Communication
【速读】:该论文试图解决生成式人工智能(Generative AI)产生的幻觉(hallucinations)作为一类新型虚假信息的问题,其核心在于重新界定虚假信息理论的边界。传统虚假信息研究主要关注人类意图,而生成式AI系统在无明确意图的情况下仍能产生看似合理但虚假的输出。论文提出的解决方案关键在于将AI幻觉视为具有社会后果的传播现象,而非单纯的技术故障,并基于供给需求模型和分布式代理概念,分析其在生产、感知及制度回应方面的独特性。该框架呼吁传播学研究者从宏观(制度)、中观(群体)和微观(个体)层面探讨幻觉内容的产生、传播与接受机制。
链接: https://arxiv.org/abs/2504.13777
作者: Anqi Shao
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper proposes a conceptual framework for understanding AI hallucinations as a distinct form of misinformation. While misinformation scholarship has traditionally focused on human intent, generative AI systems now produce false yet plausible outputs absent of such intent. I argue that these AI hallucinations should not be treated merely as technical failures but as communication phenomena with social consequences. Drawing on a supply-and-demand model and the concept of distributed agency, the framework outlines how hallucinations differ from human-generated misinformation in production, perception, and institutional response. I conclude by outlining a research agenda for communication scholars to investigate the emergence, dissemination, and audience reception of hallucinated content, with attention to macro (institutional), meso (group), and micro (individual) levels. This work urges communication researchers to rethink the boundaries of misinformation theory in light of probabilistic, non-human actors increasingly embedded in knowledge production.
zh
[AI-88] Risk-sensitive Reinforcement Learning Based on Convex Scoring Functions
【速读】:该论文试图解决在风险目标下的强化学习(Reinforcement Learning, RL)框架中的时间不一致性问题。其解决方案的关键在于引入扩展的状态空间和辅助变量,将原问题转化为一个双状态优化问题,并提出了一种定制的Actor-Critic算法,同时建立了理论上的近似保证。此外,该方法还引入了一种受交替最小化算法启发的辅助变量采样方法,在特定条件下具有收敛性。
链接: https://arxiv.org/abs/2505.04553
作者: Shanyu Han,Yang Liu,Xiang Yu
机构: 未知
类目: Mathematical Finance (q-fin.MF); Artificial Intelligence (cs.AI); Risk Management (q-fin.RM)
备注: 35 pages
Abstract:We propose a reinforcement learning (RL) framework under a broad class of risk objectives, characterized by convex scoring functions. This class covers many common risk measures, such as variance, Expected Shortfall, entropic Value-at-Risk, and mean-risk utility. To resolve the time-inconsistency issue, we consider an augmented state space and an auxiliary variable and recast the problem as a two-state optimization problem. We propose a customized Actor-Critic algorithm and establish some theoretical approximation guarantees. A key theoretical contribution is that our results do not require the Markov decision process to be continuous. Additionally, we propose an auxiliary variable sampling method inspired by the alternating minimization algorithm, which is convergent under certain conditions. We validate our approach in simulation experiments with a financial application in statistical arbitrage trading, demonstrating the effectiveness of the algorithm.
zh
[AI-89] Recognizing Ornaments in Vocal Indian Art Music with Active Annotation
【速读】:该论文旨在解决在印度古典音乐中识别装饰音(ornamentations)的问题,这对于音乐信息检索(MIR)具有重要意义,应用包括音乐教学、歌手识别、流派分类和受控歌唱语音生成。其解决方案的关键在于引入了一个名为Rāga Ornamentation Detection (ROD)的新数据集,该数据集由专业音乐家精选并使用定制的人机交互工具进行标注,同时开发了一种基于深度时间序列分析的装饰音检测模型,能够在长音频片段分割过程中保持装饰音边界。
链接: https://arxiv.org/abs/2505.04419
作者: Sumit Kumar,Parampreet Singh,Vipul Arora
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Ornamentations, embellishments, or microtonal inflections are essential to melodic expression across many musical traditions, adding depth, nuance, and emotional impact to performances. Recognizing ornamentations in singing voices is key to MIR, with potential applications in music pedagogy, singer identification, genre classification, and controlled singing voice generation. However, the lack of annotated datasets and specialized modeling approaches remains a major obstacle for progress in this research area. In this work, we introduce Rāga Ornamentation Detection (ROD), a novel dataset comprising Indian classical music recordings curated by expert musicians. The dataset is annotated using a custom Human-in-the-Loop tool for six vocal ornaments marked as event-based labels. Using this dataset, we develop an ornamentation detection model based on deep time-series analysis, preserving ornament boundaries during the chunking of long audio recordings. We conduct experiments using different train-test configurations within the ROD dataset and also evaluate our approach on a separate, manually annotated dataset of Indian classical concert recordings. Our experimental results support the superior performance of our proposed approach over the baseline CRNN.
zh
[AI-90] High-speed multiwavelength photonic temporal integration using silicon photonics
【速读】:该论文试图解决光学硬件在映射大规模向量以支持人工智能任务时的可扩展性问题,尤其是在保持高速并行计算的同时实现高效的能量利用。其解决方案的关键在于通过时间展开标量运算并引入光路中的光电热单元(photonic-heater-in-lightpath, PHIL),从而实现全光时域积分。该方法利用缓慢的热耗散过程,将50 GHz调制的光信号进行整合,弥合了广泛使用的热光效应与超快光子学之间的速度差距,进而支持端到端的光学信号处理,并在统一框架内实现线性和非线性操作。
链接: https://arxiv.org/abs/2505.04405
作者: Yi Zhang,Nikolaos Farmakidis,Ioannis Roumpos,Miltiadis Moralis-Pegios,Apostolos Tsakyridis,June Sang Lee,Bowei Dong,Yuhan He,Samarth Aggarwal,Nikolaos Pleros,Harish Bhaskaran
机构: 未知
类目: Optics (physics.optics); Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph)
备注:
Abstract:Optical systems have been pivotal for energy-efficient computing, performing high-speed, parallel operations in low-loss carriers. While these predominantly analog optical accelerators bypass digitization to perform parallel floating-point computations, scaling optical hardware to map large-vector sizes for AI tasks remains challenging. Here, we overcome this limitation by unfolding scalar operations in time and introducing a photonic-heater-in-lightpath (PHIL) unit for all-optical temporal integration. Counterintuitively, we exploit a slow heat dissipation process to integrate optical signals modulated at 50 GHz bridging the speed gap between the widely applied thermo-optic effects and ultrafast photonics. This architecture supports optical end-to-end signal processing, eliminates inefficient electro-optical conversions, and enables both linear and nonlinear operations within a unified framework. Our results demonstrate a scalable path towards high-speed photonic computing through thermally driven integration.
zh
[AI-91] Optimization Problem Solving Can Transition to Evolutionary Agent ic Workflows
【速读】:该论文试图解决传统优化问题求解中依赖专家的瓶颈问题(expert-dependent bottlenecks),这些问题主要体现在问题建模、算法选择和超参数调优等环节,限制了前沿方法在工业场景中的应用。其解决方案的关键在于引入进化代理工作流(evolutionary agentic workflow),该流程利用基础模型和进化搜索技术,实现对优化空间(包括问题空间、建模空间、算法空间和超参数空间)的自主探索与优化,从而提升求解过程的可扩展性和适应性。
链接: https://arxiv.org/abs/2505.04354
作者: Wenhao Li,Bo Jin,Mingyi Hong,Changhong Lu,Xiangfeng Wang
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注: 27 pages, 5 figures
Abstract:This position paper argues that optimization problem solving can transition from expert-dependent to evolutionary agentic workflows. Traditional optimization practices rely on human specialists for problem formulation, algorithm selection, and hyperparameter tuning, creating bottlenecks that impede industrial adoption of cutting-edge methods. We contend that an evolutionary agentic workflow, powered by foundation models and evolutionary search, can autonomously navigate the optimization space, comprising problem, formulation, algorithm, and hyperparameter spaces. Through case studies in cloud resource scheduling and ADMM parameter adaptation, we demonstrate how this approach can bridge the gap between academic innovation and industrial implementation. Our position challenges the status quo of human-centric optimization workflows and advocates for a more scalable, adaptive approach to solving real-world optimization problems.
zh
[AI-92] Sparsity is All You Need: Rethinking Biological Pathway-Informed Approaches in Deep Learning
【速读】:该论文试图解决生物信息学中路径注释在神经网络模型中的实际贡献问题,即验证路径整合是否真正源于其生物学相关性,还是仅仅由于其引入的稀疏性。解决方案的关键在于通过系统比较生物信息学最先进的深度学习模型与其随机化版本,发现随机化信息在多个指标和数据集上表现相当甚至更优,且在可解释性方面未表现出明显劣势。这一结果表明当前方法可能未能充分挖掘路径注释的有效信息,或路径注释本身存在噪声。为此,作者提出了一种通用方法,可用于不同领域,并作为基准以系统评估新型路径注释模型与随机化模型的性能差异。
链接: https://arxiv.org/abs/2505.04300
作者: Isabella Caranzano,Corrado Pancotti,Cesare Rollo,Flavio Sartori,Pietro Liò,Piero Fariselli,Tiziana Sanavia
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Biologically-informed neural networks typically leverage pathway annotations to enhance performance in biomedical applications. We hypothesized that the benefits of pathway integration does not arise from its biological relevance, but rather from the sparsity it introduces. We conducted a comprehensive analysis of all relevant pathway-based neural network models for predictive tasks, critically evaluating each study’s contributions. From this review, we curated a subset of methods for which the source code was publicly available. The comparison of the biologically informed state-of-the-art deep learning models and their randomized counterparts showed that models based on randomized information performed equally well as biologically informed ones across different metrics and datasets. Notably, in 3 out of the 15 analyzed models, the randomized versions even outperformed their biologically informed counterparts. Moreover, pathway-informed models did not show any clear advantage in interpretability, as randomized models were still able to identify relevant disease biomarkers despite lacking explicit pathway information. Our findings suggest that pathway annotations may be too noisy or inadequately explored by current methods. Therefore, we propose a methodology that can be applied to different domains and can serve as a robust benchmark for systematically comparing novel pathway-informed models against their randomized counterparts. This approach enables researchers to rigorously determine whether observed performance improvements can be attributed to biological insights.
zh
[AI-93] A Graphical Global Optimization Framework for Parameter Estimation of Statistical Models with Nonconvex Regularization Functions
【速读】:该论文旨在解决包含范数约束的优化问题,特别是涉及零范数函数等复杂非凸惩罚项的稀疏线性回归问题。现有方法通常通过引入二进制变量或利用目标函数的特定结构进行求解,但这些方法在通用性和计算效率上存在局限。本文提出的解决方案关键在于采用基于图的方法,利用决策图在原始变量空间中构建强凸松弛,从而避免了辅助变量和人工边界的需求,并将其集成到空间分支切割框架中,确保全局最优解的收敛性。
链接: https://arxiv.org/abs/2505.03899
作者: Danial Davarnia,Mohammadreza Kiaghadi
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Statistics Theory (math.ST)
备注:
Abstract:Optimization problems with norm-bounding constraints arise in a variety of applications, including portfolio optimization, machine learning, and feature selection. A common approach to these problems involves relaxing the norm constraint via Lagrangian relaxation, transforming it into a regularization term in the objective function. A particularly challenging class includes the zero-norm function, which promotes sparsity in statistical parameter estimation. Most existing exact methods for solving these problems introduce binary variables and artificial bounds to reformulate them as higher-dimensional mixed-integer programs, solvable by standard solvers. Other exact approaches exploit specific structural properties of the objective, making them difficult to generalize across different problem types. Alternative methods employ nonconvex penalties with favorable statistical characteristics, but these are typically addressed using heuristic or local optimization techniques due to their structural complexity. In this paper, we propose a novel graph-based method to globally solve optimization problems involving generalized norm-bounding constraints. Our approach encompasses standard \ell_p -norms for p \in [0, \infty) and nonconvex penalties such as SCAD and MCP. We leverage decision diagrams to construct strong convex relaxations directly in the original variable space, eliminating the need for auxiliary variables or artificial bounds. Integrated into a spatial branch-and-cut framework, our method guarantees convergence to the global optimum. We demonstrate its effectiveness through preliminary computational experiments on benchmark sparse linear regression problems involving complex nonconvex penalties, which are not tractable using existing global optimization techniques.
zh
[AI-94] GRAPE: Heterogeneous Graph Representation Learning for Genetic Perturbation with Coding and Non-Coding Biotype
【速读】:该论文旨在解决现有方法在构建基因调控网络(GRN)时未能充分利用基因相关信息、仅依赖简单评估指标导致网络粗粒度以及忽略生物类型(biotype)功能差异的问题,从而限制了潜在基因互作的捕捉能力。其解决方案的关键在于利用预训练的大语言模型和DNA序列模型分别从基因描述和DNA序列数据中提取特征,作为基因表示的初始化,并首次引入基因生物类型信息,在异构图神经网络(HGNN)中模拟不同生物类型基因在调控细胞过程中的不同作用,同时通过图结构学习(GSL)动态优化GRN。
链接: https://arxiv.org/abs/2505.03853
作者: Changxi Chi,Jun Xia,Jingbo Zhou,Jiabei Cheng,Chang Yu,Stan Z. Li
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:
Abstract:Predicting genetic perturbations enables the identification of potentially crucial genes prior to wet-lab experiments, significantly improving overall experimental efficiency. Since genes are the foundation of cellular life, building gene regulatory networks (GRN) is essential to understand and predict the effects of genetic perturbations. However, current methods fail to fully leverage gene-related information, and solely rely on simple evaluation metrics to construct coarse-grained GRN. More importantly, they ignore functional differences between biotypes, limiting the ability to capture potential gene interactions. In this work, we leverage pre-trained large language model and DNA sequence model to extract features from gene descriptions and DNA sequence data, respectively, which serve as the initialization for gene representations. Additionally, we introduce gene biotype information for the first time in genetic perturbation, simulating the distinct roles of genes with different biotypes in regulating cellular processes, while capturing implicit gene relationships through graph structure learning (GSL). We propose GRAPE, a heterogeneous graph neural network (HGNN) that leverages gene representations initialized with features from descriptions and sequences, models the distinct roles of genes with different biotypes, and dynamically refines the GRN through GSL. The results on publicly available datasets show that our method achieves state-of-the-art performance.
zh
[AI-95] Deep Reinforcement Learning for Investor-Specific Portfolio Optimization: A Volatility-Guided Asset Selection Approach
【速读】:该论文试图解决在动态市场条件下,如何通过平衡风险与收益来实现资金的动态配置问题,特别是在资产预选阶段考虑投资者偏好以优化投资策略。解决方案的关键在于提出一种基于波动率引导的深度强化学习(DRL)投资组合优化框架,该框架利用广义自回归条件异方差(GARCH)模型对股票进行波动率预测并分类,随后通过DRL代理与历史市场数据交互学习最优投资策略,从而生成符合投资者风险偏好的动态投资组合。
链接: https://arxiv.org/abs/2505.03760
作者: Arishi Orra,Aryan Bhambu,Himanshu Choudhary,Manoj Thakur,Selvaraju Natarajan
机构: 未知
类目: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:
Abstract:Portfolio optimization requires dynamic allocation of funds by balancing the risk and return tradeoff under dynamic market conditions. With the recent advancements in AI, Deep Reinforcement Learning (DRL) has gained prominence in providing adaptive and scalable strategies for portfolio optimization. However, the success of these strategies depends not only on their ability to adapt to market dynamics but also on the careful pre-selection of assets that influence overall portfolio performance. Incorporating the investor’s preference in pre-selecting assets for a portfolio is essential in refining their investment strategies. This study proposes a volatility-guided DRL-based portfolio optimization framework that dynamically constructs portfolios based on investors’ risk profiles. The Generalized Autoregressive Conditional Heteroscedasticity (GARCH) model is utilized for volatility forecasting of stocks and categorizes them based on their volatility as aggressive, moderate, and conservative. The DRL agent is then employed to learn an optimal investment policy by interacting with the historical market data. The efficacy of the proposed methodology is established using stocks from the Dow 30 index. The proposed investor-specific DRL-based portfolios outperformed the baseline strategies by generating consistent risk-adjusted returns.
zh
[AI-96] he Evolution of Rough Sets 1970s-1981
【速读】:该论文试图回顾Zdzisław Pawlak及其合作者在1970年代至1981年间的研究与发表成果,重点分析其研究灵感的来源,并概述1981年后与粗糙集(rough sets)和信息系统的相关发展。解决方案的关键在于通过对早期文献的梳理,揭示粗糙集理论的形成背景及其对信息处理方法的贡献。
链接: https://arxiv.org/abs/2505.03747
作者: Viktor Marek,Ewa Orłowska,Ivo Düntsch
机构: 未知
类目: History and Overview (math.HO); Artificial Intelligence (cs.AI)
备注:
Abstract:In this note research and publications by Zdzisław Pawlak and his collaborators from 1970s and 1981 are recalled. Focus is placed on the sources of inspiration which one can identify on the basis of those publications. Finally, developments from 1981 related to rough sets and information systems are outlined.
zh
机器学习
[LG-0] sting Juntas Optimally with Samples
链接: https://arxiv.org/abs/2505.04604
作者: Lorenzo Beretta,Nathaniel Harms,Caleb Koch
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS)
*备注:
Abstract:We prove tight upper and lower bounds of \Theta\left(\tfrac1\epsilon\left( \sqrt2^k \log\binomnk + \log\binomnk \right)\right) on the number of samples required for distribution-free k -junta testing. This is the first tight bound for testing a natural class of Boolean functions in the distribution-free sample-based model. Our bounds also hold for the feature selection problem, showing that a junta tester must learn the set of relevant variables. For tolerant junta testing, we prove a sample lower bound of \Omega(2^(1-o(1)) k + \log\binomnk) showing that, unlike standard testing, there is no large gap between tolerant testing and learning.
[LG-1] Complexity Lower Bounds of Adaptive Gradient Algorithms for Non-convex Stochastic Optimization under Relaxed Smoothness ICLR2025
链接: https://arxiv.org/abs/2505.04599
作者: Michael Crawshaw,Mingrui Liu
类目: Machine Learning (cs.LG)
*备注: ICLR 2025
Abstract:Recent results in non-convex stochastic optimization demonstrate the convergence of popular adaptive algorithms (e.g., AdaGrad) under the (L_0, L_1) -smoothness condition, but the rate of convergence is a higher-order polynomial in terms of problem parameters like the smoothness constants. The complexity guaranteed by such algorithms to find an \epsilon -stationary point may be significantly larger than the optimal complexity of \Theta \left( \Delta L \sigma^2 \epsilon^-4 \right) achieved by SGD in the L -smooth setting, where \Delta is the initial optimality gap, \sigma^2 is the variance of stochastic gradient. However, it is currently not known whether these higher-order dependencies can be tightened. To answer this question, we investigate complexity lower bounds for several adaptive optimization algorithms in the (L_0, L_1) -smooth setting, with a focus on the dependence in terms of problem parameters \Delta, L_0, L_1 . We provide complexity bounds for three variations of AdaGrad, which show at least a quadratic dependence on problem parameters \Delta, L_0, L_1 . Notably, we show that the decorrelated variant of AdaGrad-Norm requires at least \Omega \left( \Delta^2 L_1^2 \sigma^2 \epsilon^-4 \right) stochastic gradient queries to find an \epsilon -stationary point. We also provide a lower bound for SGD with a broad class of adaptive stepsizes. Our results show that, for certain adaptive algorithms, the (L_0, L_1) -smooth setting is fundamentally more difficult than the standard smooth setting, in terms of the initial optimality gap and the smoothness constants.
[LG-2] Modeling Personalized Difficulty of Rehabilitation Exercises Using Causal Trees
链接: https://arxiv.org/abs/2505.04583
作者: Nathaniel Dennler,Zhonghao Shi,Uksang Yoo,Stefanos Nikolaidis,Maja Matarić
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted to IEEE/RAS-EMBS International Conference on Rehabilitation Robotics (ICORR 2025)
Abstract:Rehabilitation robots are often used in game-like interactions for rehabilitation to increase a person’s motivation to complete rehabilitation exercises. By adjusting exercise difficulty for a specific user throughout the exercise interaction, robots can maximize both the user’s rehabilitation outcomes and the their motivation throughout the exercise. Previous approaches have assumed exercises have generic difficulty values that apply to all users equally, however, we identified that stroke survivors have varied and unique perceptions of exercise difficulty. For example, some stroke survivors found reaching vertically more difficult than reaching farther but lower while others found reaching farther more challenging than reaching vertically. In this paper, we formulate a causal tree-based method to calculate exercise difficulty based on the user’s performance. We find that this approach accurately models exercise difficulty and provides a readily interpretable model of why that exercise is difficult for both users and caretakers.
[LG-3] Implicitly Aligning Humans and Autonomous Agents through Shared Task Abstractions IJCAI2025
链接: https://arxiv.org/abs/2505.04579
作者: Stéphane Aroca-Ouellette,Miguel Aroca-Ouellette,Katharina von der Wense,Alessandro Roncone
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注: 9 pages (7 paper + 2 references). To be published in IJCAI 2025
Abstract:In collaborative tasks, autonomous agents fall short of humans in their capability to quickly adapt to new and unfamiliar teammates. We posit that a limiting factor for zero-shot coordination is the lack of shared task abstractions, a mechanism humans rely on to implicitly align with teammates. To address this gap, we introduce HA ^2 : Hierarchical Ad Hoc Agents, a framework leveraging hierarchical reinforcement learning to mimic the structured approach humans use in collaboration. We evaluate HA ^2 in the Overcooked environment, demonstrating statistically significant improvement over existing baselines when paired with both unseen agents and humans, providing better resilience to environmental shifts, and outperforming all state-of-the-art methods.
[LG-4] Multitask LSTM for Arboviral Outbreak Prediction Using Public Health Data
链接: https://arxiv.org/abs/2505.04566
作者: Lucas R. C. Farias,Talita P. Silva,Pedro H. M. Araujo
类目: Machine Learning (cs.LG)
*备注: 6 pages, 4 figures
Abstract:This paper presents a multitask learning approach based on long-short-term memory (LSTM) networks for the joint prediction of arboviral outbreaks and case counts of dengue, chikungunya, and Zika in Recife, Brazil. Leveraging historical public health data from DataSUS (2017-2023), the proposed model concurrently performs binary classification (outbreak detection) and regression (case forecasting) tasks. A sliding window strategy was adopted to construct temporal features using varying input lengths (60, 90, and 120 days), with hyperparameter optimization carried out using Keras Tuner. Model evaluation used time series cross-validation for robustness and a held-out test from 2023 for generalization assessment. The results show that longer windows improve dengue regression accuracy, while classification performance peaked at intermediate windows, suggesting an optimal trade-off between sequence length and generalization. The multitask architecture delivers competitive performance across diseases and tasks, demonstrating the feasibility and advantages of unified modeling strategies for scalable epidemic forecasting in data-limited public health scenarios.
[LG-5] ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via α-β-Divergence ICML2025
链接: https://arxiv.org/abs/2505.04560
作者: Guanghui Wang,Zhiyong Yang,Zitai Wang,Shi Wang,Qianqian Xu,Qingming Huang
类目: Machine Learning (cs.LG)
*备注: ICML 2025 Spotlight
Abstract:Knowledge Distillation (KD) transfers knowledge from a large teacher model to a smaller student model by minimizing the divergence between their output distributions, typically using forward Kullback-Leibler divergence (FKLD) or reverse KLD (RKLD). It has become an effective training paradigm due to the broader supervision information provided by the teacher distribution compared to one-hot labels. We identify that the core challenge in KD lies in balancing two mode-concentration effects: the \textbf\textitHardness-Concentration effect, which refers to focusing on modes with large errors, and the \textbf\textitConfidence-Concentration effect, which refers to focusing on modes with high student confidence. Through an analysis of how probabilities are reassigned during gradient updates, we observe that these two effects are entangled in FKLD and RKLD, but in extreme forms. Specifically, both are too weak in FKLD, causing the student to fail to concentrate on the target class. In contrast, both are too strong in RKLD, causing the student to overly emphasize the target class while ignoring the broader distributional information from the teacher. To address this imbalance, we propose ABKD, a generic framework with \alpha - \beta -divergence. Our theoretical results show that ABKD offers a smooth interpolation between FKLD and RKLD, achieving an effective trade-off between these effects. Extensive experiments on 17 language/vision datasets with 12 teacher-student settings confirm its efficacy. The code is available at this https URL.
[LG-6] Communication-Efficient Federated Fine-Tuning of Language Models via Dynamic Update Schedules
链接: https://arxiv.org/abs/2505.04535
作者: Michail Theologitis,Vasilis Samoladas,Antonios Deligiannakis
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated learning (FL) makes it possible to train models on data that would otherwise remain untapped and inaccessible. Simultaneously, pre-trained language models (LMs) have emerged as indispensable tools in modern workflows. These models exhibit extraordinary capabilities and are easily adapted to downstream tasks. This opens one of the most exciting frontiers in FL: fine-tuning LMs. However, a persistent challenge in FL is the frequent, rigid communication of parameters, a problem which is magnified by the sheer size of these modern models. Currently, the FedOpt family of algorithms is the prevailing approach in FL, though it relies on fixed, heuristic intervals for model synchronization. Recently, the FDA algorithm introduced a dynamic alternative by monitoring training progress, but it came with its own drawbacks; namely, a hard-to-tune threshold parameter and a rigid synchronization scheme. In this work, we introduce the FDA-Opt family of algorithms – a unified generalization that extends the principles behind both FDA and FedOpt, while resolving their core limitations. We evaluate our approach on fine-tuning LMs across a range of downstream NLP tasks, and demonstrate that it consistently outperforms FedOpt – even when FDA-Opt operates under hyper-parameter settings originally optimized for its competitors. In other words, we show that FDA-Opt is a practical, drop-in replacement for FedOpt in modern FL libraries and systems: it requires no additional configuration and delivers superior performance out of the box.
[LG-7] Hamiltonian Normalizing Flows as kinetic PDE solvers: application to the 1D Vlasov-Poisson Equations
链接: https://arxiv.org/abs/2505.04471
作者: Vincent Souveton,Sébastien Terrana
类目: Machine Learning (cs.LG)
*备注:
Abstract:Many conservative physical systems can be described using the Hamiltonian formalism. A notable example is the Vlasov-Poisson equations, a set of partial differential equations that govern the time evolution of a phase-space density function representing collisionless particles under a self-consistent potential. These equations play a central role in both plasma physics and cosmology. Due to the complexity of the potential involved, analytical solutions are rarely available, necessitating the use of numerical methods such as Particle-In-Cell. In this work, we introduce a novel approach based on Hamiltonian-informed Normalizing Flows, specifically a variant of Fixed-Kinetic Neural Hamiltonian Flows. Our method transforms an initial Gaussian distribution in phase space into the final distribution using a sequence of invertible, volume-preserving transformations derived from Hamiltonian dynamics. The model is trained on a dataset comprising initial and final states at a fixed time T, generated via numerical simulations. After training, the model enables fast sampling of the final distribution from any given initial state. Moreover, by automatically learning an interpretable physical potential, it can generalize to intermediate states not seen during training, offering insights into the system’s evolution across time.
[LG-8] owards Effectively Leverag ing Execution Traces for Program Repair with Code LLM s
链接: https://arxiv.org/abs/2505.04441
作者: Mirazul Haque,Petr Babkin,Farima Farmahinifarahani,Manuela Veloso
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) show promising performance on various programming tasks, including Automatic Program Repair (APR). However, most approaches to LLM-based APR are limited to the static analysis of the programs, while disregarding their runtime behavior. Inspired by knowledge-augmented NLP, in this work, we aim to remedy this potential blind spot by augmenting standard APR prompts with program execution traces. We evaluate our approach using the GPT family of models on three popular APR datasets. Our findings suggest that simply incorporating execution traces into the prompt provides a limited performance improvement over trace-free baselines, in only 2 out of 6 tested dataset / model configurations. We further find that the effectiveness of execution traces for APR diminishes as their complexity increases. We explore several strategies for leveraging traces in prompts and demonstrate that LLM-optimized prompts help outperform trace-free prompts more consistently. Additionally, we show trace-based prompting to be superior to finetuning a smaller LLM on a small-scale dataset; and conduct probing studies reinforcing the notion that execution traces can complement the reasoning abilities of the LLMs.
[LG-9] owards Initialization-Agnostic Clustering with Iterative Adaptive Resonance Theory IJCNN2025
链接: https://arxiv.org/abs/2505.04440
作者: Xiaozheng Qu,Zhaochuan Li,Zhuang Qi,Xiang Li,Haibei Huang,Lei Meng,Xiangxu Meng
类目: Machine Learning (cs.LG)
*备注: 2025 International Joint Conference on Neural Networks (IJCNN 2025)
Abstract:The clustering performance of Fuzzy Adaptive Resonance Theory (Fuzzy ART) is highly dependent on the preset vigilance parameter, where deviations in its value can lead to significant fluctuations in clustering results, severely limiting its practicality for non-expert users. Existing approaches generally enhance vigilance parameter robustness through adaptive mechanisms such as particle swarm optimization and fuzzy logic rules. However, they often introduce additional hyperparameters or complex frameworks that contradict the original simplicity of the algorithm. To address this, we propose Iterative Refinement Adaptive Resonance Theory (IR-ART), which integrates three key phases into a unified iterative framework: (1) Cluster Stability Detection: A dynamic stability detection module that identifies unstable clusters by analyzing the change of sample size (number of samples in the cluster) in iteration. (2) Unstable Cluster Deletion: An evolutionary pruning module that eliminates low-quality clusters. (3) Vigilance Region Expansion: A vigilance region expansion mechanism that adaptively adjusts similarity thresholds. Independent of the specific execution of clustering, these three phases sequentially focus on analyzing the implicit knowledge within the iterative process, adjusting weights and vigilance parameters, thereby laying a foundation for the next iteration. Experimental evaluation on 15 datasets demonstrates that IR-ART improves tolerance to suboptimal vigilance parameter values while preserving the parameter simplicity of Fuzzy ART. Case studies visually confirm the algorithm’s self-optimization capability through iterative refinement, making it particularly suitable for non-expert users in resource-constrained scenarios.
[LG-10] Localized Diffusion Models for High Dimensional Distributions Generation
链接: https://arxiv.org/abs/2505.04417
作者: Georg A. Gottwald,Shuigen Liu,Youssef Marzouk,Sebastian Reich,Xin T. Tong
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:
Abstract:Diffusion models are the state-of-the-art tools for various generative tasks. However, estimating high-dimensional score functions makes them potentially suffer from the curse of dimensionality (CoD). This underscores the importance of better understanding and exploiting low-dimensional structure in the target distribution. In this work, we consider locality structure, which describes sparse dependencies between model components. Under locality structure, the score function is effectively low-dimensional, so that it can be estimated by a localized neural network with significantly reduced sample complexity. This motivates the localized diffusion model, where a localized score matching loss is used to train the score function within a localized hypothesis space. We prove that such localization enables diffusion models to circumvent CoD, at the price of additional localization error. Under realistic sample size scaling, we show both theoretically and numerically that a moderate localization radius can balance the statistical and localization error, leading to a better overall performance. The localized structure also facilitates parallel training of diffusion models, making it potentially more efficient for large-scale applications.
[LG-11] Latent Manifold Reconstruction and Representation with Topological and Geometrical Regularization
链接: https://arxiv.org/abs/2505.04412
作者: Ren Wang,Pengcheng Zhou
类目: Machine Learning (cs.LG)
*备注: 25 pages, 11 figures, 4 tables
Abstract:Manifold learning aims to discover and represent low-dimensional structures underlying high-dimensional data while preserving critical topological and geometric properties. Existing methods often fail to capture local details with global topological integrity from noisy data or construct a balanced dimensionality reduction, resulting in distorted or fractured embeddings. We present an AutoEncoder-based method that integrates a manifold reconstruction layer, which uncovers latent manifold structures from noisy point clouds, and further provides regularizations on topological and geometric properties during dimensionality reduction, whereas the two components promote each other during training. Experiments on point cloud datasets demonstrate that our method outperforms baselines like t-SNE, UMAP, and Topological AutoEncoders in discovering manifold structures from noisy data and preserving them through dimensionality reduction, as validated by visualization and quantitative metrics. This work demonstrates the significance of combining manifold reconstruction with manifold learning to achieve reliable representation of the latent manifold, particularly when dealing with noisy real-world data. Code repository: this https URL.
[LG-12] Supporting renewable energy planning and operation with data-driven high-resolution ensemble weather forecast
链接: https://arxiv.org/abs/2505.04396
作者: Jingnan Wang,Jie Chao,Shangshang Yang,Congyi Nai,Kaijun Ren,Kefeng Deng,Xi Chen,Yaxin Liu,Hanqiuzi Wen,Ziniu Xiao,Lifeng Zhang,Xiaodong Wang,Jiping Guan,Baoxiang Pan
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:
Abstract:The planning and operation of renewable energy, especially wind power, depend crucially on accurate, timely, and high-resolution weather information. Coarse-grid global numerical weather forecasts are typically downscaled to meet these requirements, introducing challenges of scale inconsistency, process representation error, computation cost, and entanglement of distinct uncertainty sources from chaoticity, model bias, and large-scale forcing. We address these challenges by learning the climatological distribution of a target wind farm using its high-resolution numerical weather simulations. An optimal combination of this learned high-resolution climatological prior with coarse-grid large scale forecasts yields highly accurate, fine-grained, full-variable, large ensemble of weather pattern forecasts. Using observed meteorological records and wind turbine power outputs as references, the proposed methodology verifies advantageously compared to existing numerical/statistical forecasting-downscaling pipelines, regarding either deterministic/probabilistic skills or economic gains. Moreover, a 100-member, 10-day forecast with spatial resolution of 1 km and output frequency of 15 min takes 1 hour on a moderate-end GPU, as contrast to \mathcalO(10^3) CPU hours for conventional numerical simulation. By drastically reducing computational costs while maintaining accuracy, our method paves the way for more efficient and reliable renewable energy planning and operation.
[LG-13] Clust-Splitter - an Efficient Nonsmooth Optimization-Based Algorithm for Clustering Large Datasets
链接: https://arxiv.org/abs/2505.04389
作者: Jenni Lampainen,Kaisa Joki,Napsu Karmitsa,Marko M. Mäkelä
类目: Machine Learning (cs.LG)
*备注: 36 pages, 23 figures
Abstract:Clustering is a fundamental task in data mining and machine learning, particularly for analyzing large-scale data. In this paper, we introduce Clust-Splitter, an efficient algorithm based on nonsmooth optimization, designed to solve the minimum sum-of-squares clustering problem in very large datasets. The clustering task is approached through a sequence of three nonsmooth optimization problems: two auxiliary problems used to generate suitable starting points, followed by a main clustering formulation. To solve these problems effectively, the limited memory bundle method is combined with an incremental approach to develop the Clust-Splitter algorithm. We evaluate Clust-Splitter on real-world datasets characterized by both a large number of attributes and a large number of data points and compare its performance with several state-of-the-art large-scale clustering algorithms. Experimental results demonstrate the efficiency of the proposed method for clustering very large datasets, as well as the high quality of its solutions, which are on par with those of the best existing methods.
[LG-14] Extending a Quantum Reinforcement Learning Exploration Policy with Flags to Connect Four
链接: https://arxiv.org/abs/2505.04371
作者: Filipe Santos(1),João Paulo Fernandes(2),Luís Macedo(1) ((1) CISUC, DEI, University of Coimbra, (2) LIACC, New York University Abu Dhabi)
类目: Machine Learning (cs.LG)
*备注: 8 pages, 3 figures, to be submitted to a journal
Abstract:Action selection based on flags is a Reinforcement Learning (RL) exploration policy that improves the exploration of the state space through the use of flags, which can identify the most promising actions to take in each state. The quantum counterpart of this exploration policy further improves upon this by taking advantage of a quadratic speedup for sampling flagged actions. This approach has already been successfully employed for the game of Checkers. In this work, we describe the application of this method to the context of Connect Four, in order to study its performance in a different setting, which can lead to a better generalization of the technique. We also kept track of a metric that wasn’t taken into account in previous work: the average number of iterations to obtain a flagged action. Since going second is a significant disadvantage in Connect Four, we also had the intent of exploring how this more complex scenario would impact the performance of our approach. The experiments involved training and testing classical and quantum RL agents that played either going first or going second against a Randomized Negamax opponent. The results showed that both flagged exploration policies were clearly superior to a simple epsilon-greedy policy. Furthermore, the quantum agents did in fact sample flagged actions in less iterations. Despite obtaining tagged actions more consistently, the win rates between the classical and quantum versions of the approach were identical, which could be due to the simplicity of the training scenario chosen.
[LG-15] Deep Learning Innovations for Energy Efficiency: Advances in Non-Intrusive Load Monitoring and EV Charging Optimization for a Sustainable Grid
链接: https://arxiv.org/abs/2505.04367
作者: Stavros Sykiotis
类目: Machine Learning (cs.LG)
*备注: PhD thesis
Abstract:The global energy landscape is undergoing a profound transformation, often referred to as the energy transition, driven by the urgent need to mitigate climate change, reduce greenhouse gas emissions, and ensure sustainable energy supplies. However, the undoubted complexity of new investments in renewables, as well as the phase out of high CO2-emission energy sources, hampers the pace of the energy transition and raises doubts as to whether new renewable energy sources are capable of solely meeting the climate target goals. This highlights the need to investigate alternative pathways to accelerate the energy transition, by identifying human activity domains with higher/excessive energy demands. Two notable examples where there is room for improvement, in the sense of reducing energy consumption and consequently CO2 emissions, are residential energy consumption and road transport. This dissertation investigates the development of novel Deep Learning techniques to create tools which solve limitations in these two key energy domains. Reduction of residential energy consumption can be achieved by empowering end-users with the user of Non-Intrusive Load Monitoring, whereas optimization of EV charging with Deep Reinforcement Learning can tackle road transport decarbonization.
[LG-16] opology-Driven Clustering: Enhancing Performance with Betti Number Filtration
链接: https://arxiv.org/abs/2505.04346
作者: Arghya Pratihar,Kushal Bose,Swagatam Das
类目: Machine Learning (cs.LG)
*备注:
Abstract:Clustering aims to form groups of similar data points in an unsupervised regime. Yet, clustering complex datasets containing critically intertwined shapes poses significant challenges. The prevailing clustering algorithms widely depend on evaluating similarity measures based on Euclidean metrics. Exploring topological characteristics to perform clustering of complex datasets inevitably presents a better scope. The topological clustering algorithms predominantly perceive the point set through the lens of Simplicial complexes and Persistent homology. Despite these approaches, the existing topological clustering algorithms cannot somehow fully exploit topological structures and show inconsistent performances on some highly complicated datasets. This work aims to mitigate the limitations by identifying topologically similar neighbors through the Vietoris-Rips complex and Betti number filtration. In addition, we introduce the concept of the Betti sequences to capture flexibly essential features from the topological structures. Our proposed algorithm is adept at clustering complex, intertwined shapes contained in the datasets. We carried out experiments on several synthetic and real-world datasets. Our algorithm demonstrated commendable performances across the datasets compared to some of the well-known topology-based clustering algorithms.
[LG-17] Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning
链接: https://arxiv.org/abs/2505.04339
作者: Hao Peng,Xiang Huang,Shuo Sun,Ruitong Zhang,Philip S. Yu
类目: Machine Learning (cs.LG)
*备注:
Abstract:DBSCAN, a well-known density-based clustering algorithm, has gained widespread popularity and usage due to its effectiveness in identifying clusters of arbitrary shapes and handling noisy data. However, it encounters challenges in producing satisfactory cluster results when confronted with datasets of varying density scales, a common scenario in real-world applications. In this paper, we propose a novel Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning cluster framework, namely AR-DBSCAN. First, we model the initial dataset as a two-level encoding tree and categorize the data vertices into distinct density partitions according to the information uncertainty determined in the encoding tree. Each partition is then assigned to an agent to find the best clustering parameters without manual assistance. The allocation is density-adaptive, enabling AR-DBSCAN to effectively handle diverse density distributions within the dataset by utilizing distinct agents for different partitions. Second, a multi-agent deep reinforcement learning guided automatic parameter searching process is designed. The process of adjusting the parameter search direction by perceiving the clustering environment is modeled as a Markov decision process. Using a weakly-supervised reward training policy network, each agent adaptively learns the optimal clustering parameters by interacting with the clusters. Third, a recursive search mechanism adaptable to the data’s scale is presented, enabling efficient and controlled exploration of large parameter spaces. Extensive experiments are conducted on nine artificial datasets and a real-world dataset. The results of offline and online tasks show that AR-DBSCAN not only improves clustering accuracy by up to 144.1% and 175.3% in the NMI and ARI metrics, respectively, but also is capable of robustly finding dominant parameters.
[LG-18] Riemannian Denoising Diffusion Probabilistic Models
链接: https://arxiv.org/abs/2505.04338
作者: Zichen Liu,Wei Zhang,Christof Schütte,Tiejun Li
类目: Machine Learning (cs.LG)
*备注: 28 pages
Abstract:We propose Riemannian Denoising Diffusion Probabilistic Models (RDDPMs) for learning distributions on submanifolds of Euclidean space that are level sets of functions, including most of the manifolds relevant to applications. Existing methods for generative modeling on manifolds rely on substantial geometric information such as geodesic curves or eigenfunctions of the Laplace-Beltrami operator and, as a result, they are limited to manifolds where such information is available. In contrast, our method, built on a projection scheme, can be applied to more general manifolds, as it only requires being able to evaluate the value and the first order derivatives of the function that defines the submanifold. We provide a theoretical analysis of our method in the continuous-time limit, which elucidates the connection between our RDDPMs and score-based generative models on manifolds. The capability of our method is demonstrated on datasets from previous studies and on new datasets sampled from two high-dimensional manifolds, i.e. \mathrmSO(10) and the configuration space of molecular system alanine dipeptide with fixed dihedral angle.
[LG-19] Hyperbolic Fuzzy C-Means with Adaptive Weight-based Filtering for Clustering in Non-Euclidean Spaces
链接: https://arxiv.org/abs/2505.04335
作者: Swagato Das,Arghya Pratihar,Swagatam Das
类目: Machine Learning (cs.LG)
*备注:
Abstract:Clustering algorithms play a pivotal role in unsupervised learning by identifying and grouping similar objects based on shared characteristics. While traditional clustering techniques, such as hard and fuzzy center-based clustering, have been widely used, they struggle with complex, high-dimensional, and non-Euclidean datasets. In particular, the Fuzzy C -Means (FCM) algorithm, despite its efficiency and popularity, exhibits notable limitations in non-Euclidean spaces. Euclidean spaces assume linear separability and uniform distance scaling, limiting their effectiveness in capturing complex, hierarchical, or non-Euclidean structures in fuzzy clustering. To overcome these challenges, we introduce Filtration-based Hyperbolic Fuzzy C -Means (HypeFCM), a novel clustering algorithm tailored for better representation of data relationships in non-Euclidean spaces. HypeFCM integrates the principles of fuzzy clustering with hyperbolic geometry and employs a weight-based filtering mechanism to improve performance. The algorithm initializes weights using a Dirichlet distribution and iteratively refines cluster centroids and membership assignments based on a hyperbolic metric in the Poincaré Disc model. Extensive experimental evaluations demonstrate that HypeFCM significantly outperforms conventional fuzzy clustering methods in non-Euclidean settings, underscoring its robustness and effectiveness.
[LG-20] Physics-Informed DeepONets for drift-diffusion on metric graphs: simulation and parameter identification
链接: https://arxiv.org/abs/2505.04263
作者: Jan Blechschmidt,Tom-Christian Riemer,Max Winkler,Martin Stoll,Jan-F. Pietschmann
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:We develop a novel physics informed deep learning approach for solving nonlinear drift-diffusion equations on metric graphs. These models represent an important model class with a large number of applications in areas ranging from transport in biological cells to the motion of human crowds. While traditional numerical schemes require a large amount of tailoring, especially in the case of model design or parameter identification problems, physics informed deep operator networks (DeepONet) have emerged as a versatile tool for the solution of partial differential equations with the particular advantage that they easily incorporate parameter identification questions. We here present an approach where we first learn three DeepONet models for representative inflow, inner and outflow edges, resp., and then subsequently couple these models for the solution of the drift-diffusion metric graph problem by relying on an edge-based domain decomposition approach. We illustrate that our framework is applicable for the accurate evaluation of graph-coupled physics models and is well suited for solving optimization or inverse problems on these coupled networks.
[LG-21] chnology prediction of a 3D model using Neural Network
链接: https://arxiv.org/abs/2505.04241
作者: Grzegorz Miebs,Rafał A. Bachorz
类目: Machine Learning (cs.LG)
*备注: 5 pages, 2 figures
Abstract:Accurate estimation of production times is critical for effective manufacturing scheduling, yet traditional methods relying on expert analysis or historical data often fall short in dynamic or customized production environments. This paper introduces a data-driven approach that predicts manufacturing steps and their durations directly from a product’s 3D model. By rendering the model into multiple 2D images and leveraging a neural network inspired by the Generative Query Network, the method learns to map geometric features into time estimates for predefined production steps enabling scalable, adaptive, and precise process planning across varied product types.
[LG-22] Cyber Security Data Science: Machine Learning Methods and their Performance on Imbalanced Datasets
链接: https://arxiv.org/abs/2505.04204
作者: Mateo Lopez-Ledezma,Gissel Velarde
类目: Machine Learning (cs.LG)
*备注: 13 pages, 5 figures. Digital Management and Artificial Intelligence. Proceedings of the Fourth International Scientific-Practical Conference (ISPC 2024), Hybrid, October 10-11, 2024. this https URL
Abstract:Cybersecurity has become essential worldwide and at all levels, concerning individuals, institutions, and governments. A basic principle in cybersecurity is to be always alert. Therefore, automation is imperative in processes where the volume of daily operations is large. Several cybersecurity applications can be addressed as binary classification problems, including anomaly detection, fraud detection, intrusion detection, spam detection, or malware detection. We present three experiments. In the first experiment, we evaluate single classifiers including Random Forests, Light Gradient Boosting Machine, eXtreme Gradient Boosting, Logistic Regression, Decision Tree, and Gradient Boosting Decision Tree. In the second experiment, we test different sampling techniques including over-sampling, under-sampling, Synthetic Minority Over-sampling Technique, and Self-Paced Ensembling. In the last experiment, we evaluate Self-Paced Ensembling and its number of base classifiers. We found that imbalance learning techniques had positive and negative effects, as reported in related studies. Thus, these techniques should be applied with caution. Besides, we found different best performers for each dataset. Therefore, we recommend testing single classifiers and imbalance learning techniques for each new dataset and application involving imbalanced datasets as is the case in several cyber security applications.
[LG-23] Estimating Causal Effects in Networks with Cluster-Based Bandits AAAI2022
链接: https://arxiv.org/abs/2505.04200
作者: Ahmed Sayeed Faruk,Jason Sulskis,Elena Zheleva
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Presented at the AAAI 2022 Workshop on Artificial Intelligence for Behavioral Change (AI4BC)
Abstract:The gold standard for estimating causal effects is randomized controlled trial (RCT) or A/B testing where a random group of individuals from a population of interest are given treatment and the outcome is compared to a random group of individuals from the same population. However, A/B testing is challenging in the presence of interference, commonly occurring in social networks, where individuals can impact each others outcome. Moreover, A/B testing can incur a high performance loss when one of the treatment arms has a poor performance and the test continues to treat individuals with it. Therefore, it is important to design a strategy that can adapt over time and efficiently learn the total treatment effect in the network. We introduce two cluster-based multi-armed bandit (MAB) algorithms to gradually estimate the total treatment effect in a network while maximizing the expected reward by making a tradeoff between exploration and exploitation. We compare the performance of our MAB algorithms with a vanilla MAB algorithm that ignores clusters and the corresponding RCT methods on semi-synthetic data with simulated interference. The vanilla MAB algorithm shows higher reward-action ratio at the cost of higher treatment effect error due to undesired spillover. The cluster-based MAB algorithms show higher reward-action ratio compared to their corresponding RCT methods without sacrificing much accuracy in treatment effect estimation.
[LG-24] A Large Language Model for Feasible and Diverse Population Synthesis
链接: https://arxiv.org/abs/2505.04196
作者: Sung Yoo Lim,Hyunsoo Yun,Prateek Bansal,Dong-Kyu Kim,Eui-Jin Kim
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 28 pages, 7 figures, 6 tables. Submitted to Transportation Research Part C: Emerging Technologies. Preprint version
Abstract:Generating a synthetic population that is both feasible and diverse is crucial for ensuring the validity of downstream activity schedule simulation in activity-based models (ABMs). While deep generative models (DGMs), such as variational autoencoders and generative adversarial networks, have been applied to this task, they often struggle to balance the inclusion of rare but plausible combinations (i.e., sampling zeros) with the exclusion of implausible ones (i.e., structural zeros). To improve feasibility while maintaining diversity, we propose a fine-tuning method for large language models (LLMs) that explicitly controls the autoregressive generation process through topological orderings derived from a Bayesian Network (BN). Experimental results show that our hybrid LLM-BN approach outperforms both traditional DGMs and proprietary LLMs (e.g., ChatGPT-4o) with few-shot learning. Specifically, our approach achieves approximately 95% feasibility, significantly higher than the ~80% observed in DGMs, while maintaining comparable diversity, making it well-suited for practical applications. Importantly, the method is based on a lightweight open-source LLM, enabling fine-tuning and inference on standard personal computing environments. This makes the approach cost-effective and scalable for large-scale applications, such as synthesizing populations in megacities, without relying on expensive infrastructure. By initiating the ABM pipeline with high-quality synthetic populations, our method improves overall simulation reliability and reduces downstream error propagation. The source code for these methods is available for research and practical application.
[LG-25] rajectory Entropy Reinforcement Learning for Predictable and Robust Control
链接: https://arxiv.org/abs/2505.04193
作者: Bang You,Chenxu Wang,Huaping Liu
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 10 pages
Abstract:Simplicity is a critical inductive bias for designing data-driven controllers, especially when robustness is important. Despite the impressive results of deep reinforcement learning in complex control tasks, it is prone to capturing intricate and spurious correlations between observations and actions, leading to failure under slight perturbations to the environment. To tackle this problem, in this work we introduce a novel inductive bias towards simple policies in reinforcement learning. The simplicity inductive bias is introduced by minimizing the entropy of entire action trajectories, corresponding to the number of bits required to describe information in action trajectories after the agent observes state trajectories. Our reinforcement learning agent, Trajectory Entropy Reinforcement Learning, is optimized to minimize the trajectory entropy while maximizing rewards. We show that the trajectory entropy can be effectively estimated by learning a variational parameterized action prediction model, and use the prediction model to construct an information-regularized reward function. Furthermore, we construct a practical algorithm that enables the joint optimization of models, including the policy and the prediction model. Experimental evaluations on several high-dimensional locomotion tasks show that our learned policies produce more cyclical and consistent action trajectories, and achieve superior performance, and robustness to noise and dynamic changes than the state-of-the-art.
[LG-26] DiffPattern-Flex: Efficient Layout Pattern Generation via Discrete Diffusion
链接: https://arxiv.org/abs/2505.04173
作者: Zixiao Wang,Wenqian Zhao,Yunheng Shen,Yang Bai,Guojin Chen,Farzan Farnia,Bei Yu
类目: Machine Learning (cs.LG)
*备注: 13 pages, 13 figures. Accepted by TCAD
Abstract:Recent advancements in layout pattern generation have been dominated by deep generative models. However, relying solely on neural networks for legality guarantees raises concerns in many practical applications. In this paper, we present \toolDiffPattern-Flex, a novel approach designed to generate reliable layout patterns efficiently. \toolDiffPattern-Flex incorporates a new method for generating diverse topologies using a discrete diffusion model while maintaining a lossless and compute-efficient layout representation. To ensure legal pattern generation, we employ an optimization-based, white-box pattern assessment process based on specific design rules. Furthermore, fast sampling and efficient legalization technologies are employed to accelerate the generation process. Experimental results across various benchmarks demonstrate that \toolDiffPattern-Flex significantly outperforms existing methods and excels at producing reliable layout patterns.
[LG-27] STRGCN: Capturing Asynchronous Spatio-Temporal Dependencies for Irregular Multivariate Time Series Forecasting
链接: https://arxiv.org/abs/2505.04167
作者: Yulong Wang,Xiaofeng Hu,Xiaojian Cui,Kai Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Irregular multivariate time series (IMTS) are prevalent in real-world applications across many fields, where varying sensor frequencies and asynchronous measurements pose significant modeling challenges. Existing solutions often rely on a pre-alignment strategy to normalize data, which can distort intrinsic patterns and escalate computational and memory demands. Addressing these limitations, we introduce STRGCN, a Spatio-Temporal Relational Graph Convolutional Network that avoids pre-alignment and directly captures the complex interdependencies in IMTS by representing them as a fully connected graph. Each observation is represented as a node, allowing the model to effectively handle misaligned timestamps by mapping all inter-node relationships, thus faithfully preserving the asynchronous nature of the data. Moreover, we enhance this model with a hierarchical ``Sandwich’’ structure that strategically aggregates nodes to optimize graph embeddings, reducing computational overhead while maintaining detailed local and global context. Extensive experiments on four public datasets demonstrate that STRGCN achieves state-of-the-art accuracy, competitive memory usage and training speed.
[LG-28] Retrieval Augmented Time Series Forecasting
链接: https://arxiv.org/abs/2505.04163
作者: Sungwon Han,Seungeon Lee,Meeyoung Cha,Sercan O Arik,Jinsung Yoon
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:
Abstract:Time series forecasting uses historical data to predict future trends, leveraging the relationships between past observations and available features. In this paper, we propose RAFT, a retrieval-augmented time series forecasting method to provide sufficient inductive biases and complement the model’s learning capacity. When forecasting the subsequent time frames, we directly retrieve historical data candidates from the training dataset with patterns most similar to the input, and utilize the future values of these candidates alongside the inputs to obtain predictions. This simple approach augments the model’s capacity by externally providing information about past patterns via retrieval modules. Our empirical evaluations on ten benchmark datasets show that RAFT consistently outperforms contemporary baselines with an average win ratio of 86%.
[LG-29] Optimization of Infectious Disease Intervention Measures Based on Reinforcement Learning - Empirical analysis based on UK COVID-19 epidemic data
链接: https://arxiv.org/abs/2505.04161
作者: Baida Zhang,Yakai Chen,Huichun Li,Zhenghu Zu
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Multiagent Systems (cs.MA); Computational Physics (physics.comp-ph)
*备注:
Abstract:Globally, the outbreaks of infectious diseases have exerted an extremely profound and severe influence on health security and the economy. During the critical phases of epidemics, devising effective intervention measures poses a significant challenge to both the academic and practical arenas. There is numerous research based on reinforcement learning to optimize intervention measures of infectious diseases. Nevertheless, most of these efforts have been confined within the differential equation based on infectious disease models. Although a limited number of studies have incorporated reinforcement learning methodologies into individual-based infectious disease models, the models employed therein have entailed simplifications and limitations, rendering it incapable of modeling the complexity and dynamics inherent in infectious disease transmission. We establish a decision-making framework based on an individual agent-based transmission model, utilizing reinforcement learning to continuously explore and develop a strategy function. The framework’s validity is verified through both experimental and theoretical approaches. Covasim, a detailed and widely used agent-based disease transmission model, was modified to support reinforcement learning research. We conduct an exhaustive exploration of the application efficacy of multiple algorithms across diverse action spaces. Furthermore, we conduct an innovative preliminary theoretical analysis concerning the issue of “time coverage”. The results of the experiment robustly validate the effectiveness and feasibility of the methodological framework of this study. The coping strategies gleaned therefrom prove highly efficacious in suppressing the expansion of the epidemic scale and safeguarding the stability of the economic system, thereby providing crucial reference perspectives for the formulation of global public health security strategies.
[LG-30] FilterTS: Comprehensive Frequency Filtering for Multivariate Time Series Forecasting AAAI2025
链接: https://arxiv.org/abs/2505.04158
作者: Yulong Wang,Yushuo Liu,Xiaoyi Duan,Kai Wang
类目: Machine Learning (cs.LG)
*备注: Accepted to AAAI 2025
Abstract:Multivariate time series forecasting is crucial across various industries, where accurate extraction of complex periodic and trend components can significantly enhance prediction performance. However, existing models often struggle to capture these intricate patterns. To address these challenges, we propose FilterTS, a novel forecasting model that utilizes specialized filtering techniques based on the frequency domain. FilterTS introduces a Dynamic Cross-Variable Filtering Module, a key innovation that dynamically leverages other variables as filters to extract and reinforce shared variable frequency components across variables in multivariate time series. Additionally, a Static Global Filtering Module captures stable frequency components, identified throughout the entire training set. Moreover, the model is built in the frequency domain, converting time-domain convolutions into frequency-domain multiplicative operations to enhance computational efficiency. Extensive experimental results on eight real-world datasets have demonstrated that FilterTS significantly outperforms existing methods in terms of prediction accuracy and computational efficiency.
[LG-31] LHT: Statistically-Driven Oblique Decision Trees for Interpretable Classification
链接: https://arxiv.org/abs/2505.04139
作者: Hongyi Li,Jun Xu,William Ward Armstrong
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce the Learning Hyperplane Tree (LHT), a novel oblique decision tree model designed for expressive and interpretable classification. LHT fundamentally distinguishes itself through a non-iterative, statistically-driven approach to constructing splitting hyperplanes. Unlike methods that rely on iterative optimization or heuristics, LHT directly computes the hyperplane parameters, which are derived from feature weights based on the differences in feature expectations between classes within each node. This deterministic mechanism enables a direct and well-defined hyperplane construction process. Predictions leverage a unique piecewise linear membership function within leaf nodes, obtained via local least-squares fitting. We formally analyze the convergence of the LHT splitting process, ensuring that each split yields meaningful, non-empty partitions. Furthermore, we establish that the time complexity for building an LHT up to depth d is O(mnd) , demonstrating the practical feasibility of constructing trees with powerful oblique splits using this methodology. The explicit feature weighting at each split provides inherent interpretability. Experimental results on benchmark datasets demonstrate LHT’s competitive accuracy, positioning it as a practical, theoretically grounded, and interpretable alternative in the landscape of tree-based models. The implementation of the proposed method is available at this https URL.
[LG-32] Alpha Excel Benchmark
链接: https://arxiv.org/abs/2505.04110
作者: David Noever,Forrest McKee
类目: Machine Learning (cs.LG)
*备注:
Abstract:This study presents a novel benchmark for evaluating Large Language Models (LLMs) using challenges derived from the Financial Modeling World Cup (FMWC) Excel competitions. We introduce a methodology for converting 113 existing FMWC challenges into programmatically evaluable JSON formats and use this dataset to compare the performance of several leading LLMs. Our findings demonstrate significant variations in performance across different challenge categories, with models showing specific strengths in pattern recognition tasks but struggling with complex numerical reasoning. The benchmark provides a standardized framework for assessing LLM capabilities in realistic business-oriented tasks rather than abstract academic problems. This research contributes to the growing field of AI benchmarking by establishing proficiency among the 1.5 billion people who daily use Microsoft Excel as a meaningful evaluation metric that bridges the gap between academic AI benchmarks and practical business applications.
[LG-33] Position: We need responsible application-driven (RAD) AI research
链接: https://arxiv.org/abs/2505.04104
作者: Sarah Hartman,Cheng Soon Ong,Julia Powles,Petra Kuhnert
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 11 pages, 1 figure, Accepted to Proceedings of the 41 st International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025
Abstract:This position paper argues that achieving meaningful scientific and societal advances with artificial intelligence (AI) requires a responsible, application-driven approach (RAD) to AI research. As AI is increasingly integrated into society, AI researchers must engage with the specific contexts where AI is being applied. This includes being responsive to ethical and legal considerations, technical and societal constraints, and public discourse. We present the case for RAD-AI to drive research through a three-staged approach: (1) building transdisciplinary teams and people-centred studies; (2) addressing context-specific methods, ethical commitments, assumptions, and metrics; and (3) testing and sustaining efficacy through staged testbeds and a community of practice. We present a vision for the future of application-driven AI research to unlock new value through technically feasible methods that are adaptive to the contextual needs and values of the communities they ultimately serve.
[LG-34] Reliable Disentanglement Multi-view Learning Against View Adversarial Attacks IJCAI2025
链接: https://arxiv.org/abs/2505.04046
作者: Xuyang Wang,Siyuan Duan,Qizhi Li,Guiduo Duan,Yuan Sun,Dezhong Peng
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 11 pages, 11 figures, accepted by International Joint Conference on Artificial Intelligence (IJCAI 2025)
Abstract:Recently, trustworthy multi-view learning has attracted extensive attention because evidence learning can provide reliable uncertainty estimation to enhance the credibility of multi-view predictions. Existing trusted multi-view learning methods implicitly assume that multi-view data is secure. In practice, however, in safety-sensitive applications such as autonomous driving and security monitoring, multi-view data often faces threats from adversarial perturbations, thereby deceiving or disrupting multi-view learning models. This inevitably leads to the adversarial unreliability problem (AUP) in trusted multi-view learning. To overcome this tricky problem, we propose a novel multi-view learning framework, namely Reliable Disentanglement Multi-view Learning (RDML). Specifically, we first propose evidential disentanglement learning to decompose each view into clean and adversarial parts under the guidance of corresponding evidences, which is extracted by a pretrained evidence extractor. Then, we employ the feature recalibration module to mitigate the negative impact of adversarial perturbations and extract potential informative features from them. Finally, to further ignore the irreparable adversarial interferences, a view-level evidential attention mechanism is designed. Extensive experiments on multi-view classification tasks with adversarial attacks show that our RDML outperforms the state-of-the-art multi-view learning methods by a relatively large margin.
[LG-35] Iterative Orthogonalization Scaling Laws
链接: https://arxiv.org/abs/2505.04005
作者: Devan Selvaraj
类目: Machine Learning (cs.LG)
*备注:
Abstract:The muon optimizer has picked up much attention as of late as a possible replacement to the seemingly omnipresent Adam optimizer. Recently, care has been taken to document the scaling laws of hyper-parameters under muon such as weight decay and learning rate. However, at much larger scales the iterative orthogonalization procedure present in muon may suffer a possible issue as the singular values of random matrices shrink with scale. This paper shows this scaling behavior theoretically and empirically on random matrices but does not suggest what to do about it.
[LG-36] Algorithmic Accountability in Small Data: Sample-Size-Induced Bias Within Classification Metrics AISTATS2025
链接: https://arxiv.org/abs/2505.03992
作者: Jarren Briscoe,Garrett Kepler,Daryl Deford,Assefaw Gebremedhin
类目: Machine Learning (cs.LG)
*备注: AISTATS 2025
Abstract:Evaluating machine learning models is crucial not only for determining their technical accuracy but also for assessing their potential societal implications. While the potential for low-sample-size bias in algorithms is well known, we demonstrate the significance of sample-size bias induced by combinatorics in classification metrics. This revelation challenges the efficacy of these metrics in assessing bias with high resolution, especially when comparing groups of disparate sizes, which frequently arise in social applications. We provide analyses of the bias that appears in several commonly applied metrics and propose a model-agnostic assessment and correction technique. Additionally, we analyze counts of undefined cases in metric calculations, which can lead to misleading evaluations if improperly handled. This work illuminates the previously unrecognized challenge of combinatorics and probability in standard evaluation practices and thereby advances approaches for performing fair and trustworthy classification methods.
[LG-37] Comparing statistical and deep learning techniques for parameter estimation of continuous-time stochastic differentiable equations
链接: https://arxiv.org/abs/2505.03980
作者: Aroon Sankoh,Victor Wickerhauser
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注: 6 pages, 2 figures, 2 tables
Abstract:Stochastic differential equations such as the Ornstein-Uhlenbeck process have long been used to model realworld probablistic events such as stock prices and temperature fluctuations. While statistical methods such as Maximum Likelihood Estimation (MLE), Kalman Filtering, Inverse Variable Method, and more have historically been used to estimate the parameters of stochastic differential equations, the recent explosion of deep learning technology suggests that models such as a Recurrent Neural Network (RNN) could produce more precise estimators. We present a series of experiments that compare the estimation accuracy and computational expensiveness of a statistical method (MLE) with a deep learning model (RNN) for the parameters of the Ornstein-Uhlenbeck process.
[LG-38] Call for Action: towards the next generation of symbolic regression benchmark GECCO’25
链接: https://arxiv.org/abs/2505.03977
作者: Guilherme S. Imai Aldeia,Hengzhe Zhang,Geoffrey Bomarito,Miles Cranmer,Alcides Fonseca,Bogdan Burlacu,William G. La Cava,Fabrício Olivetti de França
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 10 pages, 4 figures, 3 tables, accepted in Genetic and Evolutionary Computation Conference (GECCO '25) Symbolic Regression Workshop
Abstract:Symbolic Regression (SR) is a powerful technique for discovering interpretable mathematical expressions. However, benchmarking SR methods remains challenging due to the diversity of algorithms, datasets, and evaluation criteria. In this work, we present an updated version of SRBench. Our benchmark expands the previous one by nearly doubling the number of evaluated methods, refining evaluation metrics, and using improved visualizations of the results to understand the performances. Additionally, we analyze trade-offs between model complexity, accuracy, and energy consumption. Our results show that no single algorithm dominates across all datasets. We propose a call for action from SR community in maintaining and evolving SRBench as a living benchmark that reflects the state-of-the-art in symbolic regression, by standardizing hyperparameter tuning, execution constraints, and computational resource allocation. We also propose deprecation criteria to maintain the benchmark’s relevance and discuss best practices for improving SR algorithms, such as adaptive hyperparameter tuning and energy-efficient implementations.
[LG-39] Hierarchical Forecast Reconciliation on Networks: A Network Flow Optimization Formulation
链接: https://arxiv.org/abs/2505.03955
作者: Charupriya Sharma,Iñaki Estella Aguerri,Daniel Guimarans
类目: Machine Learning (cs.LG)
*备注:
Abstract:Hierarchical forecasting with reconciliation requires forecasting values of a hierarchy (e.g.~customer demand in a state and district), such that forecast values are linked (e.g.~ district forecasts should add up to the state forecast). Basic forecasting provides no guarantee for these desired structural relationships. Reconciliation addresses this problem, which is crucial for organizations requiring coherent predictions across multiple aggregation levels. Current methods like minimum trace (MinT) are mostly limited to tree structures and are computationally expensive. We introduce FlowRec, which reformulates hierarchical forecast reconciliation as a network flow optimization, enabling forecasting on generalized network structures. While reconciliation under the \ell_0 norm is NP-hard, we prove polynomial-time solvability for all \ell_p 0 norms and , for any strictly convex and continuously differentiable loss function. For sparse networks, FlowRec achieves O(n^2\log n) complexity, significantly improving upon MinT’s O(n^3) . Furthermore, we prove that FlowRec extends MinT to handle general networks, replacing MinT’s error-covariance estimation step with direct network structural information. A key novelty of our approach is its handling of dynamic scenarios: while traditional methods recompute both base forecasts and reconciliation, FlowRec provides efficient localised updates with optimality guarantees. Monotonicity ensures that when forecasts improve incrementally, the initial reconciliation remains optimal. We also establish efficient, error-bounded approximate reconciliation, enabling fast updates in time-critical applications. Experiments on both simulated and real benchmarks demonstrate that FlowRec improves accuracy, runtime by 3-40x and memory usage by 5-7x. These results establish FlowRec as a powerful tool for large-scale hierarchical forecasting applications.
[LG-40] Sufficient Decision Proxies for Decision-Focused Learning
链接: https://arxiv.org/abs/2505.03953
作者: Noah Schutte,Grigorii Veviurko,Krzysztof Postek,Neil Yorke-Smith
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 16 pages, 4 figures,
Abstract:When solving optimization problems under uncertainty with contextual data, utilizing machine learning to predict the uncertain parameters is a popular and effective approach. Decision-focused learning (DFL) aims at learning a predictive model such that decision quality, instead of prediction accuracy, is maximized. Common practice here is to predict a single value for each uncertain parameter, implicitly assuming that there exists a (single-scenario) deterministic problem approximation (proxy) that is sufficient to obtain an optimal decision. Other work assumes the opposite, where the underlying distribution needs to be estimated. However, little is known about when either choice is valid. This paper investigates for the first time problem properties that justify using either assumption. Using this, we present effective decision proxies for DFL, with very limited compromise on the complexity of the learning task. We show the effectiveness of presented approaches in experiments on problems with continuous and discrete variables, as well as uncertainty in the objective function and in the constraints.
[LG-41] Deep Q-Network (DQN) multi-agent reinforcement learning (MARL) for Stock Trading
链接: https://arxiv.org/abs/2505.03949
作者: John Christopher Tidwell,John Storm Tidwell
类目: Machine Learning (cs.LG)
*备注:
Abstract:This project addresses the challenge of automated stock trading, where traditional methods and direct reinforcement learning (RL) struggle with market noise, complexity, and generalization. Our proposed solution is an integrated deep learning framework combining a Convolutional Neural Network (CNN) to identify patterns in technical indicators formatted as images, a Long Short-Term Memory (LSTM) network to capture temporal dependencies across both price history and technical indicators, and a Deep Q-Network (DQN) agent which learns the optimal trading policy (buy, sell, hold) based on the features extracted by the CNN and LSTM.
[LG-42] SAND: One-Shot Feature Selection with Additive Noise Distortion ICML
链接: https://arxiv.org/abs/2505.03923
作者: Pedram Pad,Hadi Hammoud,Mohamad Dia,Nadim Maamari,L. Andrea Dunbar
类目: Machine Learning (cs.LG)
*备注: Proceedings of the 42nd International Conference on Machine Learning (ICML), Vancouver, Canada. PMLR 267, 2025
Abstract:Feature selection is a critical step in data-driven applications, reducing input dimensionality to enhance learning accuracy, computational efficiency, and interpretability. Existing state-of-the-art methods often require post-selection retraining and extensive hyperparameter tuning, complicating their adoption. We introduce a novel, non-intrusive feature selection layer that, given a target feature count k , automatically identifies and selects the k most informative features during neural network training. Our method is uniquely simple, requiring no alterations to the loss function, network architecture, or post-selection retraining. The layer is mathematically elegant and can be fully described by: \beginalign \nonumber \tildex_i = a_i x_i + (1-a_i)z_i \endalign where x_i is the input feature, \tildex_i the output, z_i a Gaussian noise, and a_i trainable gain such that \sum_ia_i^2=k . This formulation induces an automatic clustering effect, driving k of the a_i gains to 1 (selecting informative features) and the rest to 0 (discarding redundant ones) via weighted noise distortion and gain normalization. Despite its extreme simplicity, our method delivers state-of-the-art performance on standard benchmark datasets and a novel real-world dataset, outperforming or matching existing approaches without requiring hyperparameter search for k or retraining. Theoretical analysis in the context of linear regression further validates its efficacy. Our work demonstrates that simplicity and performance are not mutually exclusive, offering a powerful yet straightforward tool for feature selection in machine learning.
[LG-43] Explaining Anomalies with Tensor Networks
链接: https://arxiv.org/abs/2505.03911
作者: Hans Hohenfeld,Marius Beuerle,Elie Mounzer
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 6 pages, 3 figures
Abstract:Tensor networks, a class of variational quantum many-body wave functions have attracted considerable research interest across many disciplines, including classical machine learning. Recently, Aizpurua et al. demonstrated explainable anomaly detection with matrix product states on a discrete-valued cyber-security task, using quantum-inspired methods to gain insight into the learned model and detected anomalies. Here, we extend this framework to real-valued data domains. We furthermore introduce tree tensor networks for the task of explainable anomaly detection. We demonstrate these methods with three benchmark problems, show adequate predictive performance compared to several baseline models and both tensor network architectures’ ability to explain anomalous samples. We thereby extend the application of tensor networks to a broader class of potential problems and open a pathway for future extensions to more complex tensor network architectures.
[LG-44] MARCO: A Multi-Agent System for Optimizing HPC Code Generation Using Large Language Models
链接: https://arxiv.org/abs/2505.03906
作者: Asif Rahman,Veljko Cvetkovic,Kathleen Reece,Aidan Walters,Yasir Hassan,Aneesh Tummeti,Bryan Torres,Denise Cooney,Margaret Ellis,Dimitrios S. Nikolopoulos
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 9 pages, 4 figures, 2 tables
Abstract:Large language models (LLMs) have transformed software development through code generation capabilities, yet their effectiveness for high-performance computing (HPC) remains limited. HPC code requires specialized optimizations for parallelism, memory efficiency, and architecture-specific considerations that general-purpose LLMs often overlook. We present MARCO (Multi-Agent Reactive Code Optimizer), a novel framework that enhances LLM-generated code for HPC through a specialized multi-agent architecture. MARCO employs separate agents for code generation and performance evaluation, connected by a feedback loop that progressively refines optimizations. A key innovation is MARCO’s web-search component that retrieves real-time optimization techniques from recent conference proceedings and research publications, bridging the knowledge gap in pre-trained LLMs. Our extensive evaluation on the LeetCode 75 problem set demonstrates that MARCO achieves a 14.6% average runtime reduction compared to Claude 3.5 Sonnet alone, while the integration of the web-search component yields a 30.9% performance improvement over the base MARCO system. These results highlight the potential of multi-agent systems to address the specialized requirements of high-performance code generation, offering a cost-effective alternative to domain-specific model fine-tuning.
[LG-45] Machine Learning: a Lecture Note
链接: https://arxiv.org/abs/2505.03861
作者: Kyunghyun Cho
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:This lecture note is intended to prepare early-year master’s and PhD students in data science or a related discipline with foundational ideas in machine learning. It starts with basic ideas in modern machine learning with classification as a main target task. These basic ideas include loss formulation, backpropagation, stochastic gradient descent, generalization, model selection as well as fundamental blocks of artificial neural networks. Based on these basic ideas, the lecture note explores in depth the probablistic approach to unsupervised learning, covering directed latent variable models, product of experts, generative adversarial networks and autoregressive models. Finally, the note ends by covering a diverse set of further topics, such as reinforcement learning, ensemble methods and meta-learning. After reading this lecture note, a student should be ready to embark on studying and researching more advanced topics in machine learning and more broadly artificial intelligence.
[LG-46] Differentially Private Densest-k-Subgraph
链接: https://arxiv.org/abs/2505.03858
作者: Alireza Khayatian,Anil Vullikanti,Aritra Konar
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
Abstract:Many graph datasets involve sensitive network data, motivating the need for privacy-preserving graph mining. The Densest- k -subgraph (D k S) problem is a key primitive in graph mining that aims to extract a subset of k vertices with the maximum internal connectivity. Although non-private algorithms are known for D k S, this paper is the first to design algorithms that offer formal differential privacy (DP) guarantees for the problem. We base our general approach on using the principal component (PC) of the graph adjacency matrix to output a subset of k vertices under edge DP. For this task, we first consider output perturbation, which traditionally offer good scalability, but at the expense of utility. Our tight on the local sensitivity indicate a big gap with the global sensitivity, motivating the use of instance specific sensitive methods for private PC. Next, we derive a tight bound on the smooth sensitivity and show that it can be close to the global sensitivity. This leads us to consider the Propose-Test-Release (PTR) framework for private PC. Although computationally expensive in general, we design a novel approach for implementing PTR in the same time as computation of a non-private PC, while offering good utility for \DkS. Additionally, we also consider the iterative private power method (PPM) for private PC, albeit it is significantly slower than PTR on large networks. We run our methods on diverse real-world networks, with the largest having 3 million vertices, and show good privacy-utility trade-offs. Although PTR requires a slightly larger privacy budget, on average, it achieves a 180-fold improvement in runtime over PPM.
[LG-47] Improved Dimensionality Reduction for Inverse Problems in Nuclear Fusion and High-Energy Astrophysics
链接: https://arxiv.org/abs/2505.03849
作者: Jonathan Gorard,Ammar Hakim,Hong Qin,Kyle Parfrey,Shantenu Jha
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM); Nuclear Theory (nucl-th)
*备注: 2 pages. Position paper accepted to DOE-ASCR Inverse Methods for Complex Systems under Uncertainty Workshop (Rockville, MD, United States, June 10-12, 2025)
Abstract:Many inverse problems in nuclear fusion and high-energy astrophysics research, such as the optimization of tokamak reactor geometries or the inference of black hole parameters from interferometric images, necessitate high-dimensional parameter scans and large ensembles of simulations to be performed. Such inverse problems typically involve large uncertainties, both in the measurement parameters being inverted and in the underlying physics models themselves. Monte Carlo sampling, when combined with modern non-linear dimensionality reduction techniques such as autoencoders and manifold learning, can be used to reduce the size of the parameter spaces considerably. However, there is no guarantee that the resulting combinations of parameters will be physically valid, or even mathematically consistent. In this position paper, we advocate adopting a hybrid approach that leverages our recent advances in the development of formal verification methods for numerical algorithms, with the goal of constructing parameter space restrictions with provable mathematical and physical correctness properties, whilst nevertheless respecting both experimental uncertainties and uncertainties in the underlying physical processes.
[LG-48] A Comprehensive Analysis of Adversarial Attacks against Spam Filters
链接: https://arxiv.org/abs/2505.03831
作者: Esra Hotoğlu,Sevil Sen,Burcu Can
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Deep learning has revolutionized email filtering, which is critical to protect users from cyber threats such as spam, malware, and phishing. However, the increasing sophistication of adversarial attacks poses a significant challenge to the effectiveness of these filters. This study investigates the impact of adversarial attacks on deep learning-based spam detection systems using real-world datasets. Six prominent deep learning models are evaluated on these datasets, analyzing attacks at the word, character sentence, and AI-generated paragraph-levels. Novel scoring functions, including spam weights and attention weights, are introduced to improve attack effectiveness. This comprehensive analysis sheds light on the vulnerabilities of spam filters and contributes to efforts to improve their security against evolving adversarial threats.
[LG-49] Information Filtering Networks: Theoretical Foundations Generative Methodologies and Real-World Applications
链接: https://arxiv.org/abs/2505.03812
作者: Tomaso Aste
类目: Machine Learning (cs.LG)
*备注:
Abstract:Information Filtering Networks (IFNs) provide a powerful framework for modeling complex systems through globally sparse yet locally dense and interpretable structures that capture multivariate dependencies. This review offers a comprehensive account of IFNs, covering their theoretical foundations, construction methodologies, and diverse applications. Tracing their origins from early network-based models to advanced formulations such as the Triangulated Maximally Filtered Graph (TMFG) and the Maximally Filtered Clique Forest (MFCF), the paper highlights how IFNs address key challenges in high-dimensional data-driven modeling. IFNs and their construction methodologies are intrinsically higher-order networks that generate simplicial complexes-structures that are only now becoming popular in the broader literature. Applications span fields including finance, biology, psychology, and artificial intelligence, where IFNs improve interpretability, computational efficiency, and predictive performance. Special attention is given to their role in graphical modeling, where IFNs enable the estimation of sparse inverse covariance matrices with greater accuracy and scalability than traditional approaches like Graphical LASSO. Finally, the review discusses recent developments that integrate IFNs with machine learning and deep learning, underscoring their potential not only to bridge classical network theory with contemporary data-driven paradigms, but also to shape the architectures of deep learning models themselves.
[LG-50] Feature Optimization for Time Series Forecasting via Novel Randomized Uphill Climbing
链接: https://arxiv.org/abs/2505.03805
作者: Nguyen Van Thanh
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注:
Abstract:Randomized Uphill Climbing is a lightweight, stochastic search heuristic that has delivered state of the art equity alpha factors for quantitative hedge funds. I propose to generalize RUC into a model agnostic feature optimization framework for multivariate time series forecasting. The core idea is to synthesize candidate feature programs by randomly composing operators from a domain specific grammar, score candidates rapidly with inexpensive surrogate models on rolling windows, and filter instability via nested cross validation and information theoretic shrinkage. By decoupling feature discovery from GPU heavy deep learning, the method promises faster iteration cycles, lower energy consumption, and greater interpretability. Societal relevance: accurate, transparent forecasting tools empower resource constrained institutions, energy regulators, climate risk NGOs to make data driven decisions without proprietary black box models.
[LG-51] Utilising Gradient-Based Proposals Within Sequential Monte Carlo Samplers for Training of Partial Bayesian Neural Networks
链接: https://arxiv.org/abs/2505.03797
作者: Andrew Millard,Joshua Murphy,Simon Maskell,Zheng Zhao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Partial Bayesian neural networks (pBNNs) have been shown to perform competitively with fully Bayesian neural networks while only having a subset of the parameters be stochastic. Using sequential Monte Carlo (SMC) samplers as the inference method for pBNNs gives a non-parametric probabilistic estimation of the stochastic parameters, and has shown improved performance over parametric methods. In this paper we introduce a new SMC-based training method for pBNNs by utilising a guided proposal and incorporating gradient-based Markov kernels, which gives us better scalability on high dimensional problems. We show that our new method outperforms the state-of-the-art in terms of predictive performance and optimal loss. We also show that pBNNs scale well with larger batch sizes, resulting in significantly reduced training times and often better performance.
[LG-52] A Double Inertial Forward-Backward Splitting Algorithm With Applications to Regression and Classification Problems
链接: https://arxiv.org/abs/2505.03794
作者: İrfan Işik,Ibrahim Karahan,Okan Erkaymaz
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 20 pages, 5 sections, 5 figures, 5 tables
Abstract:This paper presents an improved forward-backward splitting algorithm with two inertial parameters. It aims to find a point in the real Hilbert space at which the sum of a co-coercive operator and a maximal monotone operator vanishes. Under standard assumptions, our proposed algorithm demonstrates weak convergence. We present numerous experimental results to demonstrate the behavior of the developed algorithm by comparing it with existing algorithms in the literature for regression and data classification problems. Furthermore, these implementations suggest our proposed algorithm yields superior outcomes when benchmarked against other relevant algorithms in existing literature.
[LG-53] A new architecture of high-order deep neural networks that learn martingales
链接: https://arxiv.org/abs/2505.03789
作者: Syoiti Ninomiya,Yuming Ma
类目: Machine Learning (cs.LG); Probability (math.PR); Computational Finance (q-fin.CP)
*备注: 19 pages, 3 figures
Abstract:A new deep-learning neural network architecture based on high-order weak approximation algorithms for stochastic differential equations (SDEs) is proposed. The architecture enables the efficient learning of martingales by deep learning models. The behaviour of deep neural networks based on this architecture, when applied to the problem of pricing financial derivatives, is also examined. The core of this new architecture lies in the high-order weak approximation algorithms of the explicit Runge–Kutta type, wherein the approximation is realised solely through iterative compositions and linear combinations of vector fields of the target SDEs.
[LG-54] mAIstro: an open-source multi-agent ic system for automated end-to-end development of radiomics and deep learning models for medical imaging
链接: https://arxiv.org/abs/2505.03785
作者: Eleftherios Tzanis,Michail E. Klontzas
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:
Abstract:Agentic systems built on large language models (LLMs) offer promising capabilities for automating complex workflows in healthcare AI. We introduce mAIstro, an open-source, autonomous multi-agentic framework for end-to-end development and deployment of medical AI models. The system orchestrates exploratory data analysis, radiomic feature extraction, image segmentation, classification, and regression through a natural language interface, requiring no coding from the user. Built on a modular architecture, mAIstro supports both open- and closed-source LLMs, and was evaluated using a large and diverse set of prompts across 16 open-source datasets, covering a wide range of imaging modalities, anatomical regions, and data types. The agents successfully executed all tasks, producing interpretable outputs and validated models. This work presents the first agentic framework capable of unifying data analysis, AI model development, and inference across varied healthcare applications, offering a reproducible and extensible foundation for clinical and research AI integration. The code is available at: this https URL
[LG-55] Insulin Resistance Prediction From Wearables and Routine Blood Biomarkers
链接: https://arxiv.org/abs/2505.03784
作者: Ahmed A. Metwally,A. Ali Heydari,Daniel McDuff,Alexandru Solot,Zeinab Esmaeilpour,Anthony Z Faranesh,Menglian Zhou,David B. Savage,Conor Heneghan,Shwetak Patel,Cathy Speed,Javier L. Prieto
类目: Machine Learning (cs.LG)
*备注:
Abstract:Insulin resistance, a precursor to type 2 diabetes, is characterized by impaired insulin action in tissues. Current methods for measuring insulin resistance, while effective, are expensive, inaccessible, not widely available and hinder opportunities for early intervention. In this study, we remotely recruited the largest dataset to date across the US to study insulin resistance (N=1,165 participants, with median BMI=28 kg/m2, age=45 years, HbA1c=5.4%), incorporating wearable device time series data and blood biomarkers, including the ground-truth measure of insulin resistance, homeostatic model assessment for insulin resistance (HOMA-IR). We developed deep neural network models to predict insulin resistance based on readily available digital and blood biomarkers. Our results show that our models can predict insulin resistance by combining both wearable data and readily available blood biomarkers better than either of the two data sources separately (R2=0.5, auROC=0.80, Sensitivity=76%, and specificity 84%). The model showed 93% sensitivity and 95% adjusted specificity in obese and sedentary participants, a subpopulation most vulnerable to developing type 2 diabetes and who could benefit most from early intervention. Rigorous evaluation of model performance, including interpretability, and robustness, facilitates generalizability across larger cohorts, which is demonstrated by reproducing the prediction performance on an independent validation cohort (N=72 participants). Additionally, we demonstrated how the predicted insulin resistance can be integrated into a large language model agent to help understand and contextualize HOMA-IR values, facilitating interpretation and safe personalized recommendations. This work offers the potential for early detection of people at risk of type 2 diabetes and thereby facilitate earlier implementation of preventative strategies.
[LG-56] A general physics-constrained method for the modelling of equations closure terms with sparse data
链接: https://arxiv.org/abs/2505.03783
作者: Tian Chen,Shengping Liu,Li Liu,Heng Yong
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate modeling of closure terms is a critical challenge in engineering and scientific research, particularly when data is sparse (scarse or incomplete), making widely applicable models difficult to develop. This study proposes a novel approach for constructing closure models in such challenging scenarios. We introduce a Series-Parallel Multi-Network Architecture that integrates Physics-Informed Neural Networks (PINNs) to incorporate physical constraints and heterogeneous data from multiple initial and boundary conditions, while employing dedicated subnetworks to independently model unknown closure terms, enhancing generalizability across diverse problems. These closure models are integrated into an accurate Partial Differential Equation (PDE) solver, enabling robust solutions to complex predictive simulations in engineering applications.
[LG-57] ALFRED: Ask a Large-language model For Reliable ECG Diagnosis
链接: https://arxiv.org/abs/2505.03781
作者: Jin Yu,JaeHo Park,TaeJun Park,Gyurin Kim,JiHyun Lee,Min Sung Lee,Joon-myoung Kwon,Jeong Min Son,Yong-Yeon Jo
类目: Machine Learning (cs.LG)
*备注:
Abstract:Leveraging Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) for analyzing medical data, particularly Electrocardiogram (ECG), offers high accuracy and convenience. However, generating reliable, evidence-based results in specialized fields like healthcare remains a challenge, as RAG alone may not suffice. We propose a Zero-shot ECG diagnosis framework based on RAG for ECG analysis that incorporates expert-curated knowledge to enhance diagnostic accuracy and explainability. Evaluation on the PTB-XL dataset demonstrates the framework’s effectiveness, highlighting the value of structured domain expertise in automated ECG interpretation. Our framework is designed to support comprehensive ECG analysis, addressing diverse diagnostic needs with potential applications beyond the tested dataset.
[LG-58] Neural Co-Optimization of Structural Topology Manufacturable Layers and Path Orientations for Fiber-Reinforced Composites
链接: https://arxiv.org/abs/2505.03779
作者: Tao Liu,Tianyu Zhang,Yongxue Chen,Weiming Wang,Yu Jiang,Yuming Huang,Charlie C.L. Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose a neural network-based computational framework for the simultaneous optimization of structural topology, curved layers, and path orientations to achieve strong anisotropic strength in fiber-reinforced thermoplastic composites while ensuring manufacturability. Our framework employs three implicit neural fields to represent geometric shape, layer sequence, and fiber orientation. This enables the direct formulation of both design and manufacturability objectives - such as anisotropic strength, structural volume, machine motion control, layer curvature, and layer thickness - into an integrated and differentiable optimization process. By incorporating these objectives as loss functions, the framework ensures that the resultant composites exhibit optimized mechanical strength while remaining its manufacturability for filament-based multi-axis 3D printing across diverse hardware platforms. Physical experiments demonstrate that the composites generated by our co-optimization method can achieve an improvement of up to 33.1% in failure loads compared to composites with sequentially optimized structures and manufacturing sequences.
[LG-59] Drag onfly: a modular deep reinforcement learning library
链接: https://arxiv.org/abs/2505.03778
作者: Jonathan Viquerat,Paul Garnier,Amirhossein Bateni,Elie Hachem
类目: Machine Learning (cs.LG)
*备注:
Abstract:Dragonfly is a deep reinforcement learning library focused on modularity, in order to ease experimentation and developments. It relies on a json serialization that allows to swap building blocks and perform parameter sweep, while minimizing code maintenance. Some of its features are specifically designed for CPU-intensive environments, such as numerical simulations. Its performance on standard agents using common benchmarks compares favorably with the literature.
[LG-60] MolMole: Molecule Mining from Scientific Literature
链接: https://arxiv.org/abs/2505.03777
作者: LG AI Research,Sehyun Chun,Jiye Kim,Ahra Jo,Yeonsik Jo,Seungyul Oh,Seungjun Lee,Kwangrok Ryoo,Jongmin Lee,Seunghwan Kim,Byung Jun Kang,Soonyoung Lee,Jun Ha Park,Chanwoo Moon,Jiwon Ham,Haein Lee,Heejae Han,Jaeseung Byun,Soojong Do,Minju Ha,Dongyun Kim,Kyunghoon Bae,Woohyung Lim,Edward Hwayoung Lee,Yongmin Park,Jeongsang Yu,Gerrard Jeongwon Jo,Yeonjung Hong,Kyungjae Yoo,Sehui Han,Jaewan Lee,Changyoung Park,Kijeong Jeon,Sihyuk Yi
类目: Machine Learning (cs.LG)
*备注: 15 pages, 12 figures
Abstract:The extraction of molecular structures and reaction data from scientific documents is challenging due to their varied, unstructured chemical formats and complex document layouts. To address this, we introduce MolMole, a vision-based deep learning framework that unifies molecule detection, reaction diagram parsing, and optical chemical structure recognition (OCSR) into a single pipeline for automating the extraction of chemical data directly from page-level documents. Recognizing the lack of a standard page-level benchmark and evaluation metric, we also present a testset of 550 pages annotated with molecule bounding boxes, reaction labels, and MOLfiles, along with a novel evaluation metric. Experimental results demonstrate that MolMole outperforms existing toolkits on both our benchmark and public datasets. The benchmark testset will be publicly available, and the MolMole toolkit will be accessible soon through an interactive demo on the LG AI Research website. For commercial inquiries, please contact us at \hrefmailto:contact_ddu@lgresearch.aicontact_ddu@lgresearch.ai.
[LG-61] PAPN: Proximity Attention Encoder and Pointer Network Decoder for Parcel Pickup Route Prediction
链接: https://arxiv.org/abs/2505.03776
作者: Hansi Denis,Siegfried Mercelis,Ngoc-Quang Luong
类目: Machine Learning (cs.LG)
*备注: 9 pages, 2 figures, 2 tables
Abstract:Optimization of the last-mile delivery and first-mile pickup of parcels is an integral part of the broader logistics optimization pipeline as it entails both cost and resource efficiency as well as a heightened service quality. Such optimization requires accurate route and time prediction systems to adapt to different scenarios in advance. This work tackles the first building block, namely route prediction. This is done by introducing a novel Proximity Attention mechanism in an encoder-decoder architecture utilizing a Pointer Network in the decoding process (Proximity Attention Encoder and Pointer Network decoder: PAPN) to leverage the underlying connections between the different visitable pickup positions at each timestep. To this local attention process is coupled global context computing via a multi-head attention transformer encoder. The obtained global context is then mixed to an aggregated version of the local embedding thus achieving a mix of global and local attention for complete modeling of the problems. Proximity attention is also used in the decoding process to skew predictions towards the locations with the highest attention scores and thus using inter-connectivity of locations as a base for next-location prediction. This method is trained, validated and tested on a large industry-level dataset of real-world, large-scale last-mile delivery and first-mile pickup named LaDE[1]. This approach shows noticeable promise, outperforming all state-of-the-art supervised systems in terms of most metrics used for benchmarking methods on this dataset while still being competitive with the best-performing reinforcement learning method named DRL4Route[2].
[LG-62] Hierarchical Multi-Label Generation with Probabilistic Level-Constraint
链接: https://arxiv.org/abs/2505.03775
作者: Linqing Chen,Weilei Wang,Wentao Wu,Hanmeng Zhong
类目: Machine Learning (cs.LG)
*备注:
Abstract:Hierarchical Extreme Multi-Label Classification poses greater difficulties compared to traditional multi-label classification because of the intricate hierarchical connections of labels within a domain-specific taxonomy and the substantial number of labels. Some of the prior research endeavors centered on classifying text through several ancillary stages such as the cluster algorithm and multiphase classification. Others made attempts to leverage the assistance of generative methods yet were unable to properly control the output of the generative model. We redefine the task from hierarchical multi-Label classification to Hierarchical Multi-Label Generation (HMG) and employ a generative framework with Probabilistic Level Constraints (PLC) to generate hierarchical labels within a specific taxonomy that have complex hierarchical relationships. The approach we proposed in this paper enables the framework to generate all relevant labels across levels for each document without relying on preliminary operations like clustering. Meanwhile, it can control the model output precisely in terms of count, length, and level aspects. Experiments demonstrate that our approach not only achieves a new SOTA performance in the HMG task, but also has a much better performance in constrained the output of model than previous research work.
[LG-63] Out-of-Distribution Detection in Heterogeneous Graphs via Energy Propagation
链接: https://arxiv.org/abs/2505.03774
作者: Tao Yin,Chen Zhao,Xiaoyan Liu,Minglai Shao
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Knowledge-Based Systems 2025
Abstract:Graph neural networks (GNNs) are proven effective in extracting complex node and structural information from graph data. While current GNNs perform well in node classification tasks within in-distribution (ID) settings, real-world scenarios often present distribution shifts, leading to the presence of out-of-distribution (OOD) nodes. OOD detection in graphs is a crucial and challenging task. Most existing research focuses on homogeneous graphs, but real-world graphs are often heterogeneous, consisting of diverse node and edge types. This heterogeneity adds complexity and enriches the informational content. To the best of our knowledge, OOD detection in heterogeneous graphs remains an underexplored area. In this context, we propose a novel methodology for OOD detection in heterogeneous graphs (OODHG) that aims to achieve two main objectives: 1) detecting OOD nodes and 2) classifying all ID nodes based on the first task’s results. Specifically, we learn representations for each node in the heterogeneous graph, calculate energy values to determine whether nodes are OOD, and then classify ID nodes. To leverage the structural information of heterogeneous graphs, we introduce a meta-path-based energy propagation mechanism and an energy constraint to enhance the distinction between ID and OOD nodes. Extensive experimental findings substantiate the simplicity and effectiveness of OODHG, demonstrating its superiority over baseline models in OOD detection tasks and its accuracy in ID node classification.
[LG-64] Is the end of Insight in Sight ?
链接: https://arxiv.org/abs/2505.04627
作者: Jean-Michel Tucny,Mihir Durve,Sauro Succi
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 20 pages, 5 figures
Abstract:It is shown that the weight matrices of a Physics-informed neural network (PINN)-based deep learning application to a rarefied gas dynamics problem described by the Boltzmann equation bear no evident link to the mathematical structure of the physical problem. Instead, the weights appear close to Gaussian distributed random matrices. Although significantly more work is needed to support a robust assessment in this direction, these results suggest that deep-learning and the numerical solution of the Boltzmann equation represent two equivalent, but largely distinct paths to the same physical knowledge. If so, Explainable AI might be an unrealistic target and possibly even an ill-posed one.
[LG-65] From Two Sample Testing to Singular Gaussian Discrimination
链接: https://arxiv.org/abs/2505.04613
作者: Leonardo V. Santoro,Kartik G. Waghmare,Victor M. Panaretos
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:We establish that testing for the equality of two probability measures on a general separable and compact metric space is equivalent to testing for the singularity between two corresponding Gaussian measures on a suitable Reproducing Kernel Hilbert Space. The corresponding Gaussians are defined via the notion of kernel mean and covariance embedding of a probability measure. Discerning two singular Gaussians is fundamentally simpler from an information-theoretic perspective than non-parametric two-sample testing, particularly in high-dimensional settings. Our proof leverages the Feldman-Hajek criterion for singularity/equivalence of Gaussians on Hilbert spaces, and shows that discrepancies between distributions are heavily magnified through their corresponding Gaussian embeddings: at a population level, distinct probability measures lead to essentially separated Gaussian embeddings. This appears to be a new instance of the blessing of dimensionality that can be harnessed for the design of efficient inference tools in great generality.
[LG-66] Likelihood-Free Adaptive Bayesian Inference via Nonparametric Distribution Matching
链接: https://arxiv.org/abs/2505.04603
作者: Wenhui Sophia Lu,Wing Hung Wong
类目: Methodology (stat.ME); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注:
Abstract:When the likelihood is analytically unavailable and computationally intractable, approximate Bayesian computation (ABC) has emerged as a widely used methodology for approximate posterior inference; however, it suffers from severe computational inefficiency in high-dimensional settings or under diffuse priors. To overcome these limitations, we propose Adaptive Bayesian Inference (ABI), a framework that bypasses traditional data-space discrepancies and instead compares distributions directly in posterior space through nonparametric distribution matching. By leveraging a novel Marginally-augmented Sliced Wasserstein (MSW) distance on posterior measures and exploiting its quantile representation, ABI transforms the challenging problem of measuring divergence between posterior distributions into a tractable sequence of one-dimensional conditional quantile regression tasks. Moreover, we introduce a new adaptive rejection sampling scheme that iteratively refines the posterior approximation by updating the proposal distribution via generative density estimation. Theoretically, we establish parametric convergence rates for the trimmed MSW distance and prove that the ABI posterior converges to the true posterior as the tolerance threshold vanishes. Through extensive empirical evaluation, we demonstrate that ABI significantly outperforms data-based Wasserstein ABC, summary-based ABC, and state-of-the-art likelihood-free simulators, especially in high-dimensional or dependent observation regimes.
[LG-67] A Two-Timescale Primal-Dual Framework for Reinforcement Learning via Online Dual Variable Guidance
链接: https://arxiv.org/abs/2505.04494
作者: Axel Friedrich Wolter,Tobias Sutter
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 35 pages, 1 figure
Abstract:We study reinforcement learning by combining recent advances in regularized linear programming formulations with the classical theory of stochastic approximation. Motivated by the challenge of designing algorithms that leverage off-policy data while maintaining on-policy exploration, we propose PGDA-RL, a novel primal-dual Projected Gradient Descent-Ascent algorithm for solving regularized Markov Decision Processes (MDPs). PGDA-RL integrates experience replay-based gradient estimation with a two-timescale decomposition of the underlying nested optimization problem. The algorithm operates asynchronously, interacts with the environment through a single trajectory of correlated data, and updates its policy online in response to the dual variable associated with the occupation measure of the underlying MDP. We prove that PGDA-RL converges almost surely to the optimal value function and policy of the regularized MDP. Our convergence analysis relies on tools from stochastic approximation theory and holds under weaker assumptions than those required by existing primal-dual RL approaches, notably removing the need for a simulator or a fixed behavioral policy.
[LG-68] A Tutorial on Discriminative Clustering and Mutual Information
链接: https://arxiv.org/abs/2505.04484
作者: Louis Ohl,Pierre-Alexandre Mattei,Frédéric Precioso
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:To cluster data is to separate samples into distinctive groups that should ideally have some cohesive properties. Today, numerous clustering algorithms exist, and their differences lie essentially in what can be perceived as ``cohesive properties’'. Therefore, hypotheses on the nature of clusters must be set: they can be either generative or discriminative. As the last decade witnessed the impressive growth of deep clustering methods that involve neural networks to handle high-dimensional data often in a discriminative manner; we concentrate mainly on the discriminative hypotheses. In this paper, our aim is to provide an accessible historical perspective on the evolution of discriminative clustering methods and notably how the nature of assumptions of the discriminative models changed over time: from decision boundaries to invariance critics. We notably highlight how mutual information has been a historical cornerstone of the progress of (deep) discriminative clustering methods. We also show some known limitations of mutual information and how discriminative clustering methods tried to circumvent those. We then discuss the challenges that discriminative clustering faces with respect to the selection of the number of clusters. Finally, we showcase these techniques using the dedicated Python package, GemClus, that we have developed for discriminative clustering.
[LG-69] A Heuristic-Integrated DRL Approach for Phase Optimization in Large-Scale RISs
链接: https://arxiv.org/abs/2505.04401
作者: Wei Wang,Peizheng Li,Angela Doufexi,Mark A. Beach
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 5 pages, 5 figures. This work has been accepted for publication in IEEE Communications Letters
Abstract:Optimizing discrete phase shifts in large-scale reconfigurable intelligent surfaces (RISs) is challenging due to their non-convex and non-linear nature. In this letter, we propose a heuristic-integrated deep reinforcement learning (DRL) framework that (1) leverages accumulated actions over multiple steps in the double deep Q-network (DDQN) for RIS column-wise control and (2) integrates a greedy algorithm (GA) into each DRL step to refine the state via fine-grained, element-wise optimization of RIS configurations. By learning from GA-included states, the proposed approach effectively addresses RIS optimization within a small DRL action space, demonstrating its capability to optimize phase-shift configurations of large-scale RISs.
[LG-70] Discrete Optimal Transport and Voice Conversion
链接: https://arxiv.org/abs/2505.04382
作者: Anton Selitskiy,Maitreya Kocharekar
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: 4 pages, 6 figures, 1 table
Abstract:In this work, we address the voice conversion (VC) task using a vector-based interface. To align audio embeddings between speakers, we employ discrete optimal transport mapping. Our evaluation results demonstrate the high quality and effectiveness of this method. Additionally, we show that applying discrete optimal transport as a post-processing step in audio generation can lead to the incorrect classification of synthetic audio as real.
[LG-71] Learning based convex approximation for constrained parametric optimization
链接: https://arxiv.org/abs/2505.04037
作者: Kang Liu,Wei Peng,Jianchen Hu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:We propose an input convex neural network (ICNN)-based self-supervised learning framework to solve continuous constrained optimization problems. By integrating the augmented Lagrangian method (ALM) with the constraint correction mechanism, our framework ensures \emphnon-strict constraint feasibility, \emphbetter optimality gap, and \emphbest convergence rate with respect to the state-of-the-art learning-based methods. We provide a rigorous convergence analysis, showing that the algorithm converges to a Karush-Kuhn-Tucker (KKT) point of the original problem even when the internal solver is a neural network, and the approximation error is bounded. We test our approach on a range of benchmark tasks including quadratic programming (QP), nonconvex programming, and large-scale AC optimal power flow problems. The results demonstrate that compared to existing solvers (e.g., \textttOSQP, \textttIPOPT) and the latest learning-based methods (e.g., DC3, PDL), our approach achieves a superior balance among accuracy, feasibility, and computational efficiency.
[LG-72] Variational Formulation of the Particle Flow Particle Filter
链接: https://arxiv.org/abs/2505.04007
作者: Yinzhuang Yi,Jorge Cortés,Nikolay Atanasov
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:This paper provides a formulation of the particle flow particle filter from the perspective of variational inference. We show that the transient density used to derive the particle flow particle filter follows a time-scaled trajectory of the Fisher-Rao gradient flow in the space of probability densities. The Fisher-Rao gradient flow is obtained as a continuous-time algorithm for variational inference, minimizing the Kullback-Leibler divergence between a variational density and the true posterior density.
[LG-73] Categorical and geometric methods in statistical manifold and machine learning
链接: https://arxiv.org/abs/2505.03862
作者: Hông Vân Lê,Hà Quang Minh,Frederic Protin,Wilderich Tuschmann
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Category Theory (math.CT); Differential Geometry (math.DG); Statistics Theory (math.ST)
*备注: 37 p., will appear as part of a special volume in the Springer Tohoku Series in Mathematics
Abstract:We present and discuss applications of the category of probabilistic morphisms, initially developed in \citeLe2023, as well as some geometric methods to several classes of problems in statistical, machine and manifold learning which shall be, along with many other topics, considered in depth in the forthcoming book \citeLMPT2024.
信息检索
[IR-0] User and Recommender Behavior Over Time: Contextualizing Activity Effectiveness Diversity and Fairness in Book Recommendation
链接: https://arxiv.org/abs/2505.04518
作者: Samira Vaez Barenji,Sushobhan Parajuli,Michael D. Ekstrand
类目: Information Retrieval (cs.IR)
*备注: 8 pages, 9 figures
Abstract:Data is an essential resource for studying recommender systems. While there has been significant work on improving and evaluating state-of-the-art models and measuring various properties of recommender system outputs, less attention has been given to the data itself, particularly how data has changed over time. Such documentation and analysis provide guidance and context for designing and evaluating recommender systems, particularly for evaluation designs making use of time (e.g., temporal splitting). In this paper, we present a temporal explanatory analysis of the UCSD Book Graph dataset scraped from Goodreads, a social reading and recommendation platform active since 2006. We measure the book interaction data using a set of activity, diversity, and fairness metrics; we then train a set of collaborative filtering algorithms on rolling training windows to observe how the same measures evolve over time in the recommendations. Additionally, we explore whether the introduction of algorithmic recommendations in 2011 was followed by observable changes in user or recommender system behavior.
[IR-1] M2Rec: Multi-scale Mamba for Efficient Sequential Recommendation
链接: https://arxiv.org/abs/2505.04445
作者: Qianru Zhang,Liang Qu,Honggang Wen,Dong Huang,Siu-Ming Yiu,Nguyen Quoc Viet Hung,Hongzhi Yin
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Sequential recommendation systems aim to predict users’ next preferences based on their interaction histories, but existing approaches face critical limitations in efficiency and multi-scale pattern recognition. While Transformer-based methods struggle with quadratic computational complexity, recent Mamba-based models improve efficiency but fail to capture periodic user behaviors, leverage rich semantic information, or effectively fuse multimodal features. To address these challenges, we propose \model, a novel sequential recommendation framework that integrates multi-scale Mamba with Fourier analysis, Large Language Models (LLMs), and adaptive gating. First, we enhance Mamba with Fast Fourier Transform (FFT) to explicitly model periodic patterns in the frequency domain, separating meaningful trends from noise. Second, we incorporate LLM-based text embeddings to enrich sparse interaction data with semantic context from item descriptions. Finally, we introduce a learnable gate mechanism to dynamically balance temporal (Mamba), frequency (FFT), and semantic (LLM) features, ensuring harmonious multimodal fusion. Extensive experiments demonstrate that \model\ achieves state-of-the-art performance, improving Hit Rate@10 by 3.2% over existing Mamba-based models while maintaining 20% faster inference than Transformer baselines. Our results highlight the effectiveness of combining frequency analysis, semantic understanding, and adaptive fusion for sequential recommendation. Code and datasets are available at: this https URL.
[IR-2] heoretical Guarantees for LT-TTD: A Unified Transformer-based Architecture for Two-Level Ranking Systems
链接: https://arxiv.org/abs/2505.04434
作者: Ayoub Abraich
类目: Information Retrieval (cs.IR); Machine Learning (stat.ML)
*备注:
Abstract:Modern recommendation and search systems typically employ multi-stage ranking architectures to efficiently handle billions of candidates. The conventional approach uses distinct L1 (candidate retrieval) and L2 (re-ranking) models with different optimization objectives, introducing critical limitations including irreversible error propagation and suboptimal ranking. This paper identifies and analyzes the fundamental limitations of this decoupled paradigm and proposes LT-TTD (Listwise Transformer with Two-Tower Distillation), a novel unified architecture that bridges retrieval and ranking phases. Our approach combines the computational efficiency of two-tower models with the expressivity of transformers in a unified listwise learning framework. We provide a comprehensive theoretical analysis of our architecture and establish formal guarantees regarding error propagation mitigation, ranking quality improvements, and optimization convergence. We derive theoretical bounds showing that LT-TTD reduces the upper limit on irretrievable relevant items by a factor that depends on the knowledge distillation strength, and prove that our multi-objective optimization framework achieves a provably better global optimum than disjoint training. Additionally, we analyze the computational complexity of our approach, demonstrating that the asymptotic complexity remains within practical bounds for real-world applications. We also introduce UPQE, a novel evaluation metric specifically designed for unified ranking architectures that holistically captures retrieval quality, ranking performance, and computational efficiency.
[IR-3] LONGER: Scaling Up Long Sequence Modeling in Industrial Recommenders
链接: https://arxiv.org/abs/2505.04421
作者: Zheng Chai,Qin Ren,Xijun Xiao,Huizhi Yang,Bo Han,Sijun Zhang,Di Chen,Hui Lu,Wenlin Zhao,Lele Yu,Xionghang Xie,Shiru Ren,Xiang Sun,Yaocheng Tan,Peng Xu,Yuchao Zheng,Di Wu
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Modeling ultra-long user behavior sequences is critical for capturing both long- and short-term preferences in industrial recommender systems. Existing solutions typically rely on two-stage retrieval or indirect modeling paradigms, incuring upstream-downstream inconsistency and computational inefficiency. In this paper, we present LONGER, a Long-sequence Optimized traNsformer for GPU-Efficient Recommenders. LONGER incorporates (i) a global token mechanism for stabilizing attention over long contexts, (ii) a token merge module with lightweight InnerTransformers and hybrid attention strategy to reduce quadratic complexity, and (iii) a series of engineering optimizations, including training with mixed-precision and activation recomputation, KV cache serving, and the fully synchronous model training and serving framework for unified GPU-based dense and sparse parameter updates. LONGER consistently outperforms strong baselines in both offline metrics and online A/B testing in both advertising and e-commerce services at ByteDance, validating its consistent effectiveness and industrial-level scaling laws. Currently, LONGER has been fully deployed at more than 10 influential scenarios at ByteDance, serving billion users.
[IR-4] CDE-Mapper: Using Retrieval-Augmented Language Models for Linking Clinical Data Elements to Controlled Vocabularies
链接: https://arxiv.org/abs/2505.04365
作者: Komal Gilani,Marlo Verket,Christof Peters,Michel Dumontier,Hans-Peter Brunner-La Rocca,Visara Urovi
类目: Information Retrieval (cs.IR)
*备注: 25 pages (one column), 7 figures
Abstract:The standardization of clinical data elements (CDEs) aims to ensure consistent and comprehensive patient information across various healthcare systems. Existing methods often falter when standardizing CDEs of varying representation and complex structure, impeding data integration and interoperability in clinical research. We introduce CDE-Mapper, an innovative framework that leverages Retrieval-Augmented Generation approach combined with Large Language Models to automate the linking of CDEs to controlled vocabularies. Our modular approach features query decomposition to manage varying levels of CDEs complexity, integrates expert-defined rules within prompt engineering, and employs in-context learning alongside multiple retriever components to resolve terminological ambiguities. In addition, we propose a knowledge reservoir validated by a human-in-loop approach, achieving accurate concept linking for future applications while minimizing computational costs. For four diverse datasets, CDE-Mapper achieved an average of 7.2% higher accuracy improvement compared to baseline methods. This work highlights the potential of advanced language models in improving data harmonization and significantly advancing capabilities in clinical decision support systems and research.
[IR-5] owards Large-scale Generative Ranking
链接: https://arxiv.org/abs/2505.04180
作者: Yanhua Huang,Yuqi Chen,Xiong Cao,Rui Yang,Mingliang Qi,Yinghao Zhu,Qingchang Han,Yaowei Liu,Zhaoyu Liu,Xuefeng Yao,Yuting Jia,Leilei Ma,Yinqi Zhang,Taoyu Zhu,Liujie Zhang,Lei Chen,Weihang Chen,Min Zhu,Ruiwen Xu,Lei Zhang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Generative recommendation has recently emerged as a promising paradigm in information retrieval. However, generative ranking systems are still understudied, particularly with respect to their effectiveness and feasibility in large-scale industrial settings. This paper investigates this topic at the ranking stage of Xiaohongshu’s Explore Feed, a recommender system that serves hundreds of millions of users. Specifically, we first examine how generative ranking outperforms current industrial recommenders. Through theoretical and empirical analyses, we find that the primary improvement in effectiveness stems from the generative architecture, rather than the training paradigm. To facilitate efficient deployment of generative ranking, we introduce RankGPT, a novel generative architecture for ranking. We validate the effectiveness and efficiency of our solution through online A/B experiments. The results show that RankGPT achieves significant improvements in user satisfaction with nearly equivalent computational resources compared to the existing production system.
[IR-6] An Adaptive Data-Resilient Multi-Modal Framework for Hierarchical Multi-Label Book Genre Identification
链接: https://arxiv.org/abs/2505.03839
作者: Utsav Kumar Nareti,Soumi Chattopadhyay,Prolay Mallick,Suraj Kumar,Ayush Vikas Daga,Chandranath Adak,Adarsh Wase,Arjab Roy
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Identifying the finer details of a book’s genres enhances user experience by enabling efficient book discovery and personalized recommendations, ultimately improving reader engagement and satisfaction. It also provides valuable insights into market trends and consumer preferences, allowing publishers and marketers to make data-driven decisions regarding book production and marketing strategies. While traditional book genre classification methods primarily rely on review data or textual analysis, incorporating additional modalities, such as book covers, blurbs, and metadata, can offer richer context and improve prediction accuracy. However, the presence of incomplete or noisy information across these modalities presents a significant challenge. This paper introduces IMAGINE (Intelligent Multi-modal Adaptive Genre Identification NEtwork), a framework designed to address these complexities. IMAGINE extracts robust feature representations from multiple modalities and dynamically selects the most informative sources based on data availability. It employs a hierarchical classification strategy to capture genre relationships and remains adaptable to varying input conditions. Additionally, we curate a hierarchical genre classification dataset that structures genres into a well-defined taxonomy, accommodating the diverse nature of literary works. IMAGINE integrates information from multiple sources and assigns multiple genre labels to each book, ensuring a more comprehensive classification. A key feature of our framework is its resilience to incomplete data, enabling accurate predictions even when certain modalities, such as text, images, or metadata, are missing or incomplete. Experimental results show that IMAGINE outperformed existing baselines in genre classification accuracy, particularly in scenarios with insufficient modality-specific data.