本篇博文主要内容为 2025-05-01 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-05-01)
今日共更新412篇论文,其中:
- 自然语言处理共60篇(Computation and Language (cs.CL))
- 人工智能共132篇(Artificial Intelligence (cs.AI))
- 计算机视觉共78篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共97篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] RUST: An LLM -Based Dialogue System for Trauma Understanding and Structured Assessments
【速读】: 该论文试图解决精神健康护理可及性不足的问题,特别是针对标准化诊断访谈和评估中缺乏对话系统支持的空白。其解决方案的关键在于开发了一个基于大型语言模型(LLM)的对话系统TRUST,该系统能够模拟临床医生的行为,通过专门设计的对话行为(Dialogue Acts)框架生成合适的临床回应,并利用基于真实访谈转录文本的患者仿真方法进行系统评估,从而实现高效、低成本的诊断访谈与评估。
链接: https://arxiv.org/abs/2504.21851
作者: Sichang Tu,Abigail Powers,Stephen Doogan,Jinho D. Choi
机构: Emory University (埃默里大学); DooGood Foundation (DooGood 基金会)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 figures, 4 tables
Abstract:Objectives: While Large Language Models (LLMs) have been widely used to assist clinicians and support patients, no existing work has explored dialogue systems for standard diagnostic interviews and assessments. This study aims to bridge the gap in mental healthcare accessibility by developing an LLM-powered dialogue system that replicates clinician behavior. Materials and Methods: We introduce TRUST, a framework of cooperative LLM modules capable of conducting formal diagnostic interviews and assessments for Post-Traumatic Stress Disorder (PTSD). To guide the generation of appropriate clinical responses, we propose a Dialogue Acts schema specifically designed for clinical interviews. Additionally, we develop a patient simulation approach based on real-life interview transcripts to replace time-consuming and costly manual testing by clinicians. Results: A comprehensive set of evaluation metrics is designed to assess the dialogue system from both the agent and patient simulation perspectives. Expert evaluations by conversation and clinical specialists show that TRUST performs comparably to real-life clinical interviews. Discussion: Our system performs at the level of average clinicians, with room for future enhancements in communication styles and response appropriateness. Conclusions: Our TRUST framework shows its potential to facilitate mental healthcare availability.
zh
[NLP-1] DeepSeek -Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition
【速读】: 该论文旨在解决形式化定理证明中大型语言模型的性能瓶颈问题,特别是在Lean 4环境下的应用。其关键解决方案是通过递归定理证明流水线收集初始化数据,并利用DeepSeek-V3进行复杂问题的分解与推理,将已解决子目标的证明合成链式思维过程,结合DeepSeek-V3的逐步推理能力,构建强化学习的初始冷启动策略,从而实现非形式化与形式化数学推理的统一建模。
链接: https://arxiv.org/abs/2504.21801
作者: Z.Z. Ren,Zhihong Shao,Junxiao Song,Huajian Xin,Haocheng Wang,Wanjia Zhao,Liyue Zhang,Zhe Fu,Qihao Zhu,Dejian Yang,Z.F. Wu,Zhibin Gou,Shirong Ma,Hongxuan Tang,Yuxuan Liu,Wenjun Gao,Daya Guo,Chong Ruan
机构: DeepSeek-AI (DeepSeek-AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce DeepSeek-Prover-V2, an open-source large language model designed for formal theorem proving in Lean 4, with initialization data collected through a recursive theorem proving pipeline powered by DeepSeek-V3. The cold-start training procedure begins by prompting DeepSeek-V3 to decompose complex problems into a series of subgoals. The proofs of resolved subgoals are synthesized into a chain-of-thought process, combined with DeepSeek-V3’s step-by-step reasoning, to create an initial cold start for reinforcement learning. This process enables us to integrate both informal and formal mathematical reasoning into a unified model. The resulting model, DeepSeek-Prover-V2-671B, achieves state-of-the-art performance in neural theorem proving, reaching 88.9% pass ratio on the MiniF2F-test and solving 49 out of 658 problems from PutnamBench. In addition to standard benchmarks, we introduce ProverBench, a collection of 325 formalized problems, to enrich our evaluation, including 15 selected problems from the recent AIME competitions (years 24-25). Further evaluation on these 15 AIME problems shows that the model successfully solves 6 of them. In comparison, DeepSeek-V3 solves 8 of these problems using majority voting, highlighting that the gap between formal and informal mathematical reasoning in large language models is substantially narrowing.
zh
[NLP-2] How Real Are Synthetic Therapy Conversations? Evaluating Fidelity in Prolonged Exposure Dialogues DATE
【速读】: 该论文旨在解决临床模型训练与评估中数据稀缺和隐私保护的问题,提出使用生成式Prolonged Exposure (PE)治疗对话作为可扩展的替代数据源。其解决方案的关键在于通过语言学、结构及协议特定指标系统比较真实与合成对话,并引入基于语言分析和语义建模的PE专用评估指标,以超越表面流畅性,衡量临床保真度,从而揭示合成数据在捕捉治疗互动细微动态方面的不足。
链接: https://arxiv.org/abs/2504.21800
作者: Suhas BN,Dominik Mattioli,Saeed Abdullah,Rosa I. Arriaga,Chris W. Wiese,Andrew M. Sherrill
机构: Penn State University (宾夕法尼亚州立大学); Georgia Tech (佐治亚理工学院); Emory University (埃默里大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 11 pages, 5 tables, updated abstract and tables
Abstract:The growing adoption of synthetic data in healthcare is driven by privacy concerns, limited access to real-world data, and the high cost of annotation. This work explores the use of synthetic Prolonged Exposure (PE) therapeutic conversations for Post-Traumatic Stress Disorder (PTSD) as a scalable alternative for training and evaluating clinical models. We systematically compare real and synthetic dialogues using linguistic, structural, and protocol-specific metrics, including turn-taking patterns and treatment fidelity. We also introduce and evaluate PE-specific metrics derived from linguistic analysis and semantic modeling, offering a novel framework for assessing clinical fidelity beyond surface fluency. Our findings show that although synthetic data holds promise for mitigating data scarcity and protecting patient privacy, it can struggle to capture the subtle dynamics of therapeutic interactions. In our dataset, synthetic dialogues match structural features of real-world dialogues (e.g., speaker switch ratio: 0.98 vs. 0.99), however, synthetic interactions do not adequately reflect key fidelity markers (e.g., distress monitoring). We highlight gaps in existing evaluation frameworks and advocate for fidelity-aware metrics that go beyond surface fluency to uncover clinically significant failures. Our findings clarify where synthetic data can effectively complement real-world datasets – and where critical limitations remain.
zh
[NLP-3] SWE-smith: Scaling Data for Software Engineering Agents
【速读】: 该论文试图解决软件工程领域语言模型(Language Models, LMs)训练数据收集的难题,现有数据集规模较小且构建过程复杂,限制了模型的可扩展性和实用性。解决方案的关键在于提出SWE-smith,一个可扩展的生成软件工程训练数据的流水线,能够基于任意Python代码库构建对应的执行环境,并自动生成数百至数千个破坏代码库中现有测试的任务实例,从而大规模生成高质量训练数据。
链接: https://arxiv.org/abs/2504.21798
作者: John Yang,Kilian Leret,Carlos E. Jimenez,Alexander Wettig,Kabir Khandpur,Yanzhe Zhang,Binyuan Hui,Ofir Press,Ludwig Schmidt,Diyi Yang
机构: Stanford University (斯坦福大学); Princeton University (普林斯顿大学); Indepedent (独立); Alibaba Qwen (阿里巴巴通义千问)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Despite recent progress in Language Models (LMs) for software engineering, collecting training data remains a significant pain point. Existing datasets are small, with at most 1,000s of training instances from 11 or fewer GitHub repositories. The procedures to curate such datasets are often complex, necessitating hundreds of hours of human labor; companion execution environments also take up several terabytes of storage, severely limiting their scalability and usability. To address this pain point, we introduce SWE-smith, a novel pipeline for generating software engineering training data at scale. Given any Python codebase, SWE-smith constructs a corresponding execution environment, then automatically synthesizes 100s to 1,000s of task instances that break existing test(s) in the codebase. Using SWE-smith, we create a dataset of 50k instances sourced from 128 GitHub repositories, an order of magnitude larger than all previous works. We train SWE-agent-LM-32B, achieving 40.2% Pass@1 resolve rate on the SWE-bench Verified benchmark, state of the art among open source models. We open source SWE-smith (collection procedure, task instances, trajectories, models) to lower the barrier of entry for research in LM systems for automated software engineering. All assets available at this https URL.
zh
[NLP-4] WebThinker: Empowering Large Reasoning Models with Deep Research Capability
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRs)在处理复杂、知识密集型任务时因依赖静态内部知识而导致性能受限的问题,以及在生成需要综合多种网络信息的全面研究报告时能力不足的问题。其解决方案的关键在于提出一种名为WebThinker的深度研究代理,该代理通过集成深度网络探索模块(Deep Web Explorer),使LRMs能够在遇到知识缺口时动态搜索、导航并提取网络信息,并采用自主的“思考-搜索-撰写”策略,实现在推理过程中实时交织推理、信息获取与报告撰写。此外,还引入基于强化学习的训练策略,通过迭代在线直接偏好优化(DPO)进一步提升研究工具的利用效率。
链接: https://arxiv.org/abs/2504.21776
作者: Xiaoxi Li,Jiajie Jin,Guanting Dong,Hongjin Qian,Yutao Zhu,Yongkang Wu,Ji-Rong Wen,Zhicheng Dou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, demonstrate impressive long-horizon reasoning capabilities. However, their reliance on static internal knowledge limits their performance on complex, knowledge-intensive tasks and hinders their ability to produce comprehensive research reports requiring synthesis of diverse web information. To address this, we propose \textbfWebThinker, a deep research agent that empowers LRMs to autonomously search the web, navigate web pages, and draft research reports during the reasoning process. WebThinker integrates a \textbfDeep Web Explorer module, enabling LRMs to dynamically search, navigate, and extract information from the web when encountering knowledge gaps. It also employs an \textbfAutonomous Think-Search-and-Draft strategy, allowing the model to seamlessly interleave reasoning, information gathering, and report writing in real time. To further enhance research tool utilization, we introduce an \textbfRL-based training strategy via iterative online Direct Preference Optimization (DPO). Extensive experiments on complex reasoning benchmarks (GPQA, GAIA, WebWalkerQA, HLE) and scientific report generation tasks (Glaive) demonstrate that WebThinker significantly outperforms existing methods and strong proprietary systems. Our approach enhances LRM reliability and applicability in complex scenarios, paving the way for more capable and versatile deep research systems. The code is available at this https URL.
zh
[NLP-5] MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对多问题设置时,难以准确识别自身参数化知识边界,从而产生幻觉(hallucination)的问题。其解决方案的关键在于提出一种名为“多答案与置信度分步调优”(Multiple Answers and Confidence Stepwise Tuning, MAC-Tuning)的新方法,该方法在指令数据微调过程中将答案预测与置信度估计的学习过程分离,从而提升模型在多任务场景下的准确性与可靠性。
链接: https://arxiv.org/abs/2504.21773
作者: Junsheng Huang,Zhitao He,Sandeep Polisetty,Qingyun Wang,May Fung
机构: Hong Kong University of Science and Technology (香港科技大学); University of Illinois (伊利诺伊大学); UMass Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:With the widespread application of large language models (LLMs), the issue of generating non-existing facts, known as hallucination, has garnered increasing attention. Previous research in enhancing LLM confidence estimation mainly focuses on the single problem setting. However, LLM awareness of its internal parameterized knowledge boundary under the more challenging multi-problem setting, which requires answering multiple problems accurately simultaneously, remains underexplored. To bridge this gap, we introduce a novel method, Multiple Answers and Confidence Stepwise Tuning (MAC-Tuning), that separates the learning of answer prediction and confidence estimation during fine-tuning on instruction data. Extensive experiments demonstrate that our method outperforms baselines by up to 25% in average precision.
zh
[NLP-6] CodeFlowBench: A Multi-turn Iterative Benchmark for Complex Code Generation
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在多轮迭代式代码复用(codeflow)任务中的能力评估问题,即如何通过多次交互逐步利用已有函数实现新功能。解决方案的关键在于提出CodeFlowBench,这是一个基于Codeforces的基准测试集,包含5258个问题,并通过自动化流程将每个问题分解为依赖树驱动的函数级子问题,每个子问题均配有单元测试。此外,研究还引入了针对多轮代码复用的新型评估框架,以全面衡量模型在复杂结构化问题上的表现。
链接: https://arxiv.org/abs/2504.21751
作者: Sizhe Wang,Zhengren Wang,Dongsheng Ma,Yongan Yu,Rui Ling,Zhiyu Li,Feiyu Xiong,Wentao Zhang
机构: 1Center for Data Science, Peking University (北京大学数据科学中心); 2Shanghai University of Finance and Economics (上海财经大学); 3Institute for Advanced Algorithms Research, Shanghai (上海先进算法研究院)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:
Abstract:Real world development demands code that is readable, extensible, and testable by organizing the implementation into modular components and iteratively reuse pre-implemented code. We term this iterative, multi-turn process codeflow and introduce CodeFlowBench, the first benchmark designed for comprehensively evaluating LLMs’ ability to perform codeflow, namely to implement new functionality by reusing existing functions over multiple turns. CodeFlowBench comprises 5258 problems drawn from Codeforces and is continuously updated via an automated pipeline that decomposes each problem into a series of function-level subproblems based on its dependency tree and each subproblem is paired with unit tests. We further propose a novel evaluation framework with tasks and metrics tailored to multi-turn code reuse to assess model performance. In experiments across various LLMs under both multi-turn and single-turn patterns. We observe models’ poor performance on CodeFlowBench, with a substantial performance drop in the iterative codeflow scenario. For instance, o1-mini achieves a pass@1 of 20.8% in multi-turn pattern versus 37.8% in single-turn pattern. Further analysis shows that different models excel at different dependency depths, yet all struggle to correctly solve structurally complex problems, highlighting challenges for current LLMs to serve as code generation tools when performing codeflow. Overall, CodeFlowBench offers a comprehensive benchmark and new insights into LLM capabilities for multi-turn, iterative code generation, guiding future advances in code generation tasks.
zh
[NLP-7] Improving Retrieval-Augmented Neural Machine Translation with Monolingual Data
【速读】: 该论文试图解决传统检索增强神经机器翻译(RANMT)系统依赖双语语料库(如翻译记忆库TM)的问题,而在许多实际场景中,目标语言的领域内单语语料库往往更为丰富。解决方案的关键在于设计改进的跨语言检索系统,通过同时考虑句级和词级匹配目标,直接在目标语言中检索相关片段,从而有效利用目标侧的单语资源。实验结果表明,该方法在控制环境下超越了标准TM基线模型,并在真实场景中显著提升了翻译性能。
链接: https://arxiv.org/abs/2504.21747
作者: Maxime Bouthors,Josep Crego,François Yvon
机构: SYSTRAN by ChapsVision( SYSTRAN by ChapsVision); Sorbonne Université(索邦大学)
类目: Computation and Language (cs.CL)
备注: 13 pages
Abstract:Conventional retrieval-augmented neural machine translation (RANMT) systems leverage bilingual corpora, e.g., translation memories (TMs). Yet, in many settings, in-domain monolingual target-side corpora are often available. This work explores ways to take advantage of such resources by retrieving relevant segments directly in the target language, based on a source-side query. For this, we design improved cross-lingual retrieval systems, trained with both sentence level and word-level matching objectives. In our experiments with two RANMT architectures, we first demonstrate the benefits of such cross-lingual objectives in a controlled setting, obtaining translation performances that surpass standard TM-based models. We then showcase our method on a real-world set-up, where the target monolingual resources far exceed the amount of parallel data and observe large improvements of our new techniques, which outperform both the baseline setting, and general-purpose cross-lingual retrievers.
zh
[NLP-8] Investigating Literary Motifs in Ancient and Medieval Novels with Large Language Models
【速读】: 该论文试图解决古希腊虚构叙事(love novels or romances)在文学母题上的共性与差异问题,旨在明确这些文本在哪些母题上具有共同特征,以及它们之间的差异性。解决方案的关键在于应用微调的大语言模型,以准确提取符合预定义标准的文学母题,从而为定量和定性分析提供数据支持。
链接: https://arxiv.org/abs/2504.21742
作者: Emelie Hallenberg
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The Greek fictional narratives often termed love novels or romances, ranging from the first century CE to the middle of the 15th century, have long been considered as similar in many ways, not least in the use of particular literary motifs. By applying the use of fine-tuned large language models, this study aims to investigate which motifs exactly that the texts in this corpus have in common, and in which ways they differ from each other. The results show that while some motifs persist throughout the corpus, others fluctuate in frequency, indicating certain trends or external influences. Conclusively, the method proves to adequately extract literary motifs according to a set definition, providing data for both quantitative and qualitative analyses.
zh
[NLP-9] LLM -Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics
【速读】: 该论文旨在解决家庭环境中机器人自主管理物体的问题,通过集成记忆增强的任务规划与多代理协作架构实现高效的任务执行与长期记忆追踪。其解决方案的关键在于采用由大语言模型(LLM)驱动的代理协调架构,结合基于检索增强生成(RAG)的记忆机制以及视觉-语言模型进行语义场景理解,从而提升任务规划的准确性和记忆召回能力。
链接: https://arxiv.org/abs/2504.21716
作者: Marc Glocker,Peter Hönig,Matthias Hirschmanner,Markus Vincze
机构: TU Wien (维也纳技术大学); AIT Austrian Institute of Technology GmbH (奥地利技术研究所有限公司)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at Austrian Robotics Workshop 2025
Abstract:We present an embodied robotic system with an LLM-driven agent-orchestration architecture for autonomous household object management. The system integrates memory-augmented task planning, enabling robots to execute high-level user commands while tracking past actions. It employs three specialized agents: a routing agent, a task planning agent, and a knowledge base agent, each powered by task-specific LLMs. By leveraging in-context learning, our system avoids the need for explicit model training. RAG enables the system to retrieve context from past interactions, enhancing long-term object tracking. A combination of Grounded SAM and LLaMa3.2-Vision provides robust object detection, facilitating semantic scene understanding for task planning. Evaluation across three household scenarios demonstrates high task planning accuracy and an improvement in memory recall due to RAG. Specifically, Qwen2.5 yields best performance for specialized agents, while LLaMA3.1 excels in routing tasks. The source code is available at: this https URL.
zh
[NLP-10] Enhancing Health Mention Classification Performance: A Study on Advancements in Parameter Efficient Tuning
【速读】: 该论文旨在解决健康提及分类(Health Mention Classification, HMC)中的挑战,特别是在社交媒体帖子中准确识别与健康相关的内容,而这些内容往往依赖于语境、隐喻语言和描述性术语,而非直接表达个人疾病。论文提出的解决方案关键在于通过传统微调方法结合增强的生物医学自然语言处理(NLP)参数,以及利用词性标注(POS)信息和参数高效微调(PEFT)技术的组合,从而在保持模型规模较小和训练效率的同时,显著提升F1分数性能。
链接: https://arxiv.org/abs/2504.21685
作者: Reem Abdel-Salam,Mary Adewunmi
机构: Cairo University (开罗大学); Menzies School of Health Research (门兹ies健康研究学院); Charles Darwin University (查尔斯·达尔文大学); CaresAI (CaresAI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:Health Mention Classification (HMC) plays a critical role in leveraging social media posts for real-time tracking and public health monitoring. Nevertheless, the process of HMC presents significant challenges due to its intricate nature, primarily stemming from the contextual aspects of health mentions, such as figurative language and descriptive terminology, rather than explicitly reflecting a personal ailment. To address this problem, we argue that clearer mentions can be achieved through conventional fine-tuning with enhanced parameters of biomedical natural language methods (NLP). In this study, we explore different techniques such as the utilisation of part-of-speech (POS) tagger information, improving on PEFT techniques, and different combinations thereof. Extensive experiments are conducted on three widely used datasets: RHDM, PHM, and Illness. The results incorporated POS tagger information, and leveraging PEFT techniques significantly improves performance in terms of F1-score compared to state-of-the-art methods across all three datasets by utilising smaller models and efficient training. Furthermore, the findings highlight the effectiveness of incorporating POS tagger information and leveraging PEFT techniques for HMC. In conclusion, the proposed methodology presents a potentially effective approach to accurately classifying health mentions in social media posts while optimising the model size and training efficiency.
zh
[NLP-11] Investigating the Effect of Parallel Data in the Cross-Lingual Transfer for Vision-Language Encoders
【速读】: 该论文试图解决多语言视觉-语言(Multilingual Vision-Language, VL)任务中因预训练模型和数据主要局限于英语而导致的跨语言迁移问题。其解决方案的关键在于利用平行数据将已训练的编码器进行迁移,而非仅依赖多语言预训练模型或文本编码器的迁移。研究重点考察了平行数据的领域特性及语言数量对性能的影响,发现即使使用机器翻译的任务数据,在平均表现上仍不如某些语言中的真实语境平行数据,同时表明大多数语言受益于多语言训练。
链接: https://arxiv.org/abs/2504.21681
作者: Andrei-Alexandru Manea,Jindřich Libovický
机构: Charles University, Faculty of Mathematics and Physics; Institute of Formal and Applied Linguistics
类目: Computation and Language (cs.CL)
备注:
Abstract:Most pre-trained Vision-Language (VL) models and training data for the downstream tasks are only available in English. Therefore, multilingual VL tasks are solved using cross-lingual transfer: fine-tune a multilingual pre-trained model or transfer the text encoder using parallel data. We study the alternative approach: transferring an already trained encoder using parallel data. We investigate the effect of parallel data: domain and the number of languages, which were out of focus in previous work. Our results show that even machine-translated task data are the best on average, caption-like authentic parallel data outperformed it in some languages. Further, we show that most languages benefit from multilingual training.
zh
[NLP-12] 20min-XD: A Comparable Corpus of Swiss News Articles
【速读】: 该论文旨在构建一个法语-德语层面的文档级可比语料库,以支持自然语言处理(NLP)任务和语言学研究。解决方案的关键在于通过语义相似性自动对齐来自瑞士在线新闻平台20分钟(20 Minuten/20 minutes)的新闻文章,从而生成包含约15,000对文章的语料库,涵盖2015年至2024年的时间范围,且具有从近似翻译到松散相关文章的广泛跨语言相似性。
链接: https://arxiv.org/abs/2504.21677
作者: Michelle Wastl,Jannis Vamvas,Selena Calleri,Rico Sennrich
机构: University of Zurich (苏黎世大学); TX Group (TX集团)
类目: Computation and Language (cs.CL)
备注: 10 pages; accepted at SwissText 2025
Abstract:We present 20min-XD (20 Minuten cross-lingual document-level), a French-German, document-level comparable corpus of news articles, sourced from the Swiss online news outlet 20 Minuten/20 minutes. Our dataset comprises around 15,000 article pairs spanning 2015 to 2024, automatically aligned based on semantic similarity. We detail the data collection process and alignment methodology. Furthermore, we provide a qualitative and quantitative analysis of the corpus. The resulting dataset exhibits a broad spectrum of cross-lingual similarity, ranging from near-translations to loosely related articles, making it valuable for various NLP applications and broad linguistically motivated studies. We publicly release the dataset in document- and sentence-aligned versions and code for the described experiments.
zh
[NLP-13] AdaR1: From Long-CoT to Hybrid-CoT via Bi-Level Adaptive Reasoning Optimization
【速读】: 该论文试图解决长推理模型在复杂推理任务中虽表现优异但存在显著推理开销的问题,从而提升推理效率。其解决方案的关键在于提出一种新颖的两阶段自适应推理框架:首先通过融合长链思维(Long-CoT)和短链思维(Short-CoT)模型构建混合推理模型,以支持多样化的推理风格;其次通过双层偏好训练,引导模型在群体层面选择合适的推理风格,并在每个风格组内偏好简洁且准确的推理过程。这一方法在保持性能的同时显著降低了推理成本。
链接: https://arxiv.org/abs/2504.21659
作者: Haotian Luo,Haiying He,Yibo Wang,Jinluan Yang,Rui Liu,Naiqiang Tan,Xiaochun Cao,Dacheng Tao,Li Shen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Recently, long-thought reasoning models achieve strong performance on complex reasoning tasks, but often incur substantial inference overhead, making efficiency a critical concern. Our empirical analysis reveals that the benefit of using Long-CoT varies across problems: while some problems require elaborate reasoning, others show no improvement, or even degraded accuracy. This motivates adaptive reasoning strategies that tailor reasoning depth to the input. However, prior work primarily reduces redundancy within long reasoning paths, limiting exploration of more efficient strategies beyond the Long-CoT paradigm. To address this, we propose a novel two-stage framework for adaptive and efficient reasoning. First, we construct a hybrid reasoning model by merging long and short CoT models to enable diverse reasoning styles. Second, we apply bi-level preference training to guide the model to select suitable reasoning styles (group-level), and prefer concise and correct reasoning within each style group (instance-level). Experiments demonstrate that our method significantly reduces inference costs compared to other baseline approaches, while maintaining performance. Notably, on five mathematical datasets, the average length of reasoning is reduced by more than 50%, highlighting the potential of adaptive strategies to optimize reasoning efficiency in large language models. Our code is coming soon at this https URL
zh
[NLP-14] Sadeed: Advancing Arabic Diacritization Through Small Language Model
【速读】: 该论文旨在解决阿拉伯语文本的元音符号标注(diacritization)问题,这一任务在自然语言处理中仍是一个持续性的挑战,主要由于阿拉伯语的形态学复杂性。解决方案的关键在于提出Sadeed,这是一种基于微调的仅解码器语言模型,该模型源自Kuwain 1.5B Hennara等[2025]的紧凑模型,并在其上进行了高质量、精心筛选的元音符号数据集的微调,通过严格的清洗和标准化流程构建数据集。尽管使用了有限的计算资源,Sadeed在性能上表现出色,优于传统模型并可与专有大型语言模型相媲美。
链接: https://arxiv.org/abs/2504.21635
作者: Zeina Aldallal,Sara Chrouf,Khalil Hennara,Mohamed Motaism Hamed,Muhammad Hreden,Safwan AlModhayan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Arabic text diacritization remains a persistent challenge in natural language processing due to the language’s morphological richness. In this paper, we introduce Sadeed, a novel approach based on a fine-tuned decoder-only language model adapted from Kuwain 1.5B Hennara et al. [2025], a compact model originally trained on diverse Arabic corpora. Sadeed is fine-tuned on carefully curated, high-quality diacritized datasets, constructed through a rigorous data-cleaning and normalization pipeline. Despite utilizing modest computational resources, Sadeed achieves competitive results compared to proprietary large language models and outperforms traditional models trained on similar domains. Additionally, we highlight key limitations in current benchmarking practices for Arabic diacritization. To address these issues, we introduce SadeedDiac-25, a new benchmark designed to enable fairer and more comprehensive evaluation across diverse text genres and complexity levels. Together, Sadeed and SadeedDiac-25 provide a robust foundation for advancing Arabic NLP applications, including machine translation, text-to-speech, and language learning tools.
zh
[NLP-15] Meeseeks: An Iterative Benchmark Evaluating LLM s Multi-Turn Instruction-Following Ability
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中准确遵循指令的能力不足的问题。现有指令遵循基准要么为单轮交互,要么在每轮中引入新要求而无法进行自我修正,无法真实反映用户与模型之间的互动过程。论文提出的解决方案——Meeseeks,通过迭代反馈机制模拟真实的用户-LLM交互,其关键在于允许模型根据具体需求失败情况进行自我修正,从而更贴近实际应用场景中的使用模式。
链接: https://arxiv.org/abs/2504.21625
作者: Jiaming Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The ability to follow instructions accurately is fundamental for Large Language Models (LLMs) to serve as reliable agents in real-world applications. While existing instruction-following benchmarks are either single-turn or introduce new requirements in each turn without allowing self-correction, Meeseeks simulates realistic human-LLM interactions through an iterative feedback process. This design enables models to self-correct based on specific requirement failures, better reflecting real-world user-end usage patterns. The benchmark implements a comprehensive evaluation system with 38 capability tags organized across three dimensions: Intent Recognition, Granular Content Validation, and Output Structure Validation. Through rigorous evaluation across LLMs, Meeseeks provides valuable insights into LLMs’ instruction-following capabilities in practical applications.
zh
[NLP-16] RDF-Based Structured Quality Assessment Representation of Multilingual LLM Evaluations
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对冲突信息时可靠性评估困难的问题。其解决方案的关键在于提出一种基于RDF(Resource Description Framework)的框架,用于评估多语言LLM的质量,特别是关注知识冲突的检测与分析。该框架通过在四种不同上下文条件(完整、不完整、冲突和无上下文信息)下捕捉模型响应,实现对知识泄露、错误检测以及多语言一致性等方面的系统分析。
链接: https://arxiv.org/abs/2504.21605
作者: Jonas Gwozdz,Andreas Both
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Large Language Models (LLMs) increasingly serve as knowledge interfaces, yet systematically assessing their reliability with conflicting information remains difficult. We propose an RDF-based framework to assess multilingual LLM quality, focusing on knowledge conflicts. Our approach captures model responses across four distinct context conditions (complete, incomplete, conflicting, and no-context information) in German and English. This structured representation enables the comprehensive analysis of knowledge leakage-where models favor training data over provided context-error detection, and multilingual consistency. We demonstrate the framework through a fire safety domain experiment, revealing critical patterns in context prioritization and language-specific performance, and demonstrating that our vocabulary was sufficient to express every assessment facet encountered in the 28-question study.
zh
[NLP-17] Robust Misinformation Detection by Visiting Potential Commonsense Conflict IJCAI2025
【速读】: 该论文旨在解决在线虚假信息检测(Misinformation Detection, MD)中的挑战,特别是针对由常识冲突(commonsense conflict)引发的虚假文章。其解决方案的关键在于提出一种新颖的即插即用增强方法——基于潜在常识冲突的虚假信息检测(MD-PCC),通过构建文章的常识表达来体现潜在的常识冲突,这些表达来源于提取的常识三元组与由成熟常识推理工具COMET生成的黄金三元组之间的差异,并将其作为数据增强应用于MD任务中。
链接: https://arxiv.org/abs/2504.21604
作者: Bing Wang,Ximing Li,Changchun Li,Bingrui Zhao,Bo Fu,Renchu Guan,Shengsheng Wang
机构: Jilin University (吉林大学); Liaoning Normal University (辽宁师范大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 11 pages, 2 figures. Accepted by IJCAI 2025. Code: this https URL
Abstract:The development of Internet technology has led to an increased prevalence of misinformation, causing severe negative effects across diverse domains. To mitigate this challenge, Misinformation Detection (MD), aiming to detect online misinformation automatically, emerges as a rapidly growing research topic in the community. In this paper, we propose a novel plug-and-play augmentation method for the MD task, namely Misinformation Detection with Potential Commonsense Conflict (MD-PCC). We take inspiration from the prior studies indicating that fake articles are more likely to involve commonsense conflict. Accordingly, we construct commonsense expressions for articles, serving to express potential commonsense conflicts inferred by the difference between extracted commonsense triplet and golden ones inferred by the well-established commonsense reasoning tool COMET. These expressions are then specified for each article as augmentation. Any specific MD methods can be then trained on those commonsense-augmented articles. Besides, we also collect a novel commonsense-oriented dataset named CoMis, whose all fake articles are caused by commonsense conflict. We integrate MD-PCC with various existing MD backbones and compare them across both 4 public benchmark datasets and CoMis. Empirical results demonstrate that MD-PCC can consistently outperform the existing MD baselines.
zh
[NLP-18] DNB-AI-Project at SemEval-2025 Task 5: An LLM -Ensemble Approach for Automated Subject Indexing SEMEVAL-2025
【速读】: 该论文试图解决的是面向国家技术图书馆开放获取目录的基于大语言模型(Large Language Model, LLM)的自动主题标签生成问题。解决方案的关键在于采用少量示例提示(few-shot prompting)技术,通过向不同LLMs提供已标注的文献记录示例,使其为新记录生成关键词,并结合一系列后处理步骤,包括将生成的关键词映射到目标词汇表、聚合结果进行集成投票以及根据相关性对最终主题术语进行排序。
链接: https://arxiv.org/abs/2504.21589
作者: Lisa Kluge,Maximilian Kähler
机构: Deutsche Nationalbibliothek (德国国家图书馆)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: 11 pages, 4 figures, submitted to SemEval-2025 workshop Task 5: LLMs4Subjects
Abstract:This paper presents our system developed for the SemEval-2025 Task 5: LLMs4Subjects: LLM-based Automated Subject Tagging for a National Technical Library’s Open-Access Catalog. Our system relies on prompting a selection of LLMs with varying examples of intellectually annotated records and asking the LLMs to similarly suggest keywords for new records. This few-shot prompting technique is combined with a series of post-processing steps that map the generated keywords to the target vocabulary, aggregate the resulting subject terms to an ensemble vote and, finally, rank them as to their relevance to the record. Our system is fourth in the quantitative ranking in the all-subjects track, but achieves the best result in the qualitative ranking conducted by subject indexing experts.
zh
[NLP-19] Black-Box Visual Prompt Engineering for Mitigating Object Hallucination in Large Vision Language Models NAACL2025
【速读】: 该论文旨在解决大型视觉语言模型(Large Vision Language Models, LVLMs)中存在的物体幻觉(object hallucination)问题,该问题会降低模型的可靠性。其解决方案的关键在于提出一种黑盒视觉提示工程(Black-Box Visual Prompt Engineering, BBVPE)框架,通过引入一组候选视觉提示(Visual Prompts, VPs),并训练一个路由模型以动态选择针对特定输入图像最有效的VP,从而在无需访问模型内部结构的情况下提升LVLM的响应质量。
链接: https://arxiv.org/abs/2504.21559
作者: Sangmin Woo,Kang Zhou,Yun Zhou,Shuai Wang,Sheng Guan,Haibo Ding,Lin Lee Cheong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: NAACL 2025
Abstract:Large Vision Language Models (LVLMs) often suffer from object hallucination, which undermines their reliability. Surprisingly, we find that simple object-based visual prompting – overlaying visual cues (e.g., bounding box, circle) on images – can significantly mitigate such hallucination; however, different visual prompts (VPs) vary in effectiveness. To address this, we propose Black-Box Visual Prompt Engineering (BBVPE), a framework to identify optimal VPs that enhance LVLM responses without needing access to model internals. Our approach employs a pool of candidate VPs and trains a router model to dynamically select the most effective VP for a given input image. This black-box approach is model-agnostic, making it applicable to both open-source and proprietary LVLMs. Evaluations on benchmarks such as POPE and CHAIR demonstrate that BBVPE effectively reduces object hallucination.
zh
[NLP-20] Precision Where It Matters: A Novel Spike Aware Mixed-Precision Quantization Strategy for LLaMA-based Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在部署和推理过程中因模型规模过大而带来的挑战。其核心问题是现有量化技术在处理激活异常值(activation outliers)时效果有限,无法充分发挥模型性能。解决方案的关键在于提出一种针对LLaMA架构及其衍生模型的混合精度量化方法,通过识别并为集中出现激活峰值的特定投影层保留较高精度(如FP16或FP8),而对其他部分进行低比特量化,从而在保持模型性能的同时提升效率。实验结果表明,该方法在多个模型上均取得了优于通用量化方法的效果。
链接: https://arxiv.org/abs/2504.21553
作者: Lucas Maisonnave,Cyril Moineau,Olivier Bichler,Fabrice Rastello
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in various natural language processing tasks. However, their size presents significant challenges for deployment and inference. This paper investigates the quantization of LLMs, focusing on the LLaMA architecture and its derivatives. We challenge existing assumptions about activation outliers in LLMs and propose a novel mixed-precision quantization approach tailored for LLaMA-like models. Our method leverages the observation that activation spikes in LLaMA architectures are predominantly concentrated in specific projection layers. By applying higher precision (FP16 or FP8) to these layers while quantizing the rest of the model to lower bit-widths, we achieve superior performance compared to existing quantization techniques. Experimental results on LLaMA2, LLaMA3, and Mistral models demonstrate significant improvements in perplexity and zero-shot accuracy, particularly for 8-bit per-tensor quantization. Our approach outperforms general-purpose methods designed to handle outliers across all architecture types, highlighting the benefits of architecture-specific quantization strategies. This research contributes to the ongoing efforts to make LLMs more efficient and deployable, potentially enabling their use in resource-constrained environments. Our findings emphasize the importance of considering model-specific characteristics in developing effective quantization pipelines for state-of-the-art language models by identifying and targeting a small number of projections that concentrate activation spikes.
zh
[NLP-21] artuNLP at SemEval-2025 Task 5: Subject Tagging as Two-Stage Information Retrieval SEMEVAL-2025
【速读】: 该论文试图解决图书馆员在为图书馆记录分配主题标签(subject tags)时效率低下的问题,旨在通过生成与给定文档相关的潜在相关标签列表来辅助这一过程。解决方案的关键在于将任务建模为信息检索问题,并采用两种编码器模型构建两阶段的信息检索系统——第一阶段使用双编码器(bi-encoder)进行粗粒度候选标签提取,第二阶段使用交叉编码器(cross-encoder)进行细粒度重排序,从而显著提升召回率并取得具有竞争力的实验结果。
链接: https://arxiv.org/abs/2504.21547
作者: Aleksei Dorkin,Kairit Sirts
机构: University of Tartu (塔尔图大学)
类目: Computation and Language (cs.CL)
备注: To appear in the Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)
Abstract:We present our submission to the Task 5 of SemEval-2025 that aims to aid librarians in assigning subject tags to the library records by producing a list of likely relevant tags for a given document. We frame the task as an information retrieval problem, where the document content is used to retrieve subject tags from a large subject taxonomy. We leverage two types of encoder models to build a two-stage information retrieval system – a bi-encoder for coarse-grained candidate extraction at the first stage, and a cross-encoder for fine-grained re-ranking at the second stage. This approach proved effective, demonstrating significant improvements in recall compared to single-stage methods and showing competitive results according to qualitative evaluation.
zh
[NLP-22] Improving Informally Romanized Language Identification
【速读】: 该论文旨在解决罗马化文本(romanized text)中的语言识别(Language Identification, LID)准确性问题,特别是在非拉丁字母语言被非正式地用拉丁字母书写时,由于拼写变异性高导致的语言混淆问题。解决方案的关键在于改进合成训练数据的生成方法,通过引入自然拼写变异性来构建训练样本,从而提升LID系统的性能。实验结果表明,使用仅包含合成数据的线性分类器在20种印度语言的Bhasha-Abhijnaanam评估集上达到了85.4%的F1分数,结合可用的真实文本后进一步提升至88.2%。
链接: https://arxiv.org/abs/2504.21540
作者: Adrian Benton,Alexander Gutkin,Christo Kirov,Brian Roark
机构: Google Research(谷歌研究院)
类目: Computation and Language (cs.CL)
备注: 16 pages, 14 tables, 4 figures
Abstract:The Latin script is often used to informally write languages with non-Latin native scripts. In many cases (e.g., most languages in India), there is no conventional spelling of words in the Latin script, hence there will be high spelling variability in written text. Such romanization renders languages that are normally easily distinguished based on script highly confusable, such as Hindi and Urdu. In this work, we increase language identification (LID) accuracy for romanized text by improving the methods used to synthesize training sets. We find that training on synthetic samples which incorporate natural spelling variation yields higher LID system accuracy than including available naturally occurring examples in the training set, or even training higher capacity models. We demonstrate new state-of-the-art LID performance on romanized text from 20 Indic languages in the Bhasha-Abhijnaanam evaluation set (Madhani et al., 2023a), improving test F1 from the reported 74.7% (using a pretrained neural model) to 85.4% using a linear classifier trained solely on synthetic data and 88.2% when also training on available harvested text.
zh
[NLP-23] Advancing Arabic Reverse Dictionary Systems: A Transformer-Based Approach with Dataset Construction Guidelines
【速读】: 该论文旨在解决阿拉伯语自然语言处理中的关键空白,即开发一个有效的阿拉伯语逆向词典(Reverse Dictionary, RD)系统,使用户能够根据词语的描述或含义查找单词。其解决方案的关键在于提出一种基于Transformer的新型半编码器神经网络架构,该架构具有几何递减层结构,从而在阿拉伯语逆向词典任务中取得了最先进的结果。此外,研究还构建了全面的数据集并建立了正式的阿拉伯语词典定义质量标准,同时开发了一个模块化、可扩展的Python库(RDTL),以支持高效的训练流程和理论分析。
链接: https://arxiv.org/abs/2504.21475
作者: Serry Sibaee,Samar Ahmed,Abdullah Al Harbi,Omer Nacar,Adel Ammar,Yasser Habashi,Wadii Boulila
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:This study addresses the critical gap in Arabic natural language processing by developing an effective Arabic Reverse Dictionary (RD) system that enables users to find words based on their descriptions or meanings. We present a novel transformer-based approach with a semi-encoder neural network architecture featuring geometrically decreasing layers that achieves state-of-the-art results for Arabic RD tasks. Our methodology incorporates a comprehensive dataset construction process and establishes formal quality standards for Arabic lexicographic definitions. Experiments with various pre-trained models demonstrate that Arabic-specific models significantly outperform general multilingual embeddings, with ARBERTv2 achieving the best ranking score (0.0644). Additionally, we provide a formal abstraction of the reverse dictionary task that enhances theoretical understanding and develop a modular, extensible Python library (RDTL) with configurable training pipelines. Our analysis of dataset quality reveals important insights for improving Arabic definition construction, leading to eight specific standards for building high-quality reverse dictionary resources. This work contributes significantly to Arabic computational linguistics and provides valuable tools for language learning, academic writing, and professional communication in Arabic.
zh
[NLP-24] Homa at SemEval-2025 Task 5: Aligning Librarian Records with OntoAligner for Subject Tagging SEMEVAL2025
【速读】: 该论文试图解决技术记录的自动主题标记(subject tagging)问题,即在TIBKAT数据集中利用Gemeinsame Normdatei (GND)分类法为技术记录分配主题标签。解决方案的关键在于将主题标记问题转化为本体对齐(ontology alignment)任务,并采用OntoAligner工具包结合检索增强生成(RAG)技术,通过语义相似性匹配记录与GND类别,从而实现高效的主题索引。
链接: https://arxiv.org/abs/2504.21474
作者: Hadi Bayrami Asl Tekanlou,Jafar Razmara,Mahsa Sanaei,Mostafa Rahgouy,Hamed Babaei Giglou
机构: University of Tabriz(乌尔德·塔布里兹大学); Auburn University(奥本大学); TIB - Leibniz Information Centre for Science and Technology(蒂布-莱布尼茨科学和技术信息中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures, accepted to the LLMs4Subjects shared task at SemEval2025
Abstract:This paper presents our system, Homa, for SemEval-2025 Task 5: Subject Tagging, which focuses on automatically assigning subject labels to technical records from TIBKAT using the Gemeinsame Normdatei (GND) taxonomy. We leverage OntoAligner, a modular ontology alignment toolkit, to address this task by integrating retrieval-augmented generation (RAG) techniques. Our approach formulates the subject tagging problem as an alignment task, where records are matched to GND categories based on semantic similarity. We evaluate OntoAligner’s adaptability for subject indexing and analyze its effectiveness in handling multilingual records. Experimental results demonstrate the strengths and limitations of this method, highlighting the potential of alignment techniques for improving subject tagging in digital libraries.
zh
[NLP-25] RWKV-X: A Linear Complexity Hybrid Language Model
【速读】: 该论文旨在解决传统语言模型在处理长序列时效率低下和计算复杂度高的问题。其解决方案的关键在于提出一种新型的混合架构——RWKV-X,该架构结合了RWKV在短距离建模中的高效性与稀疏注意力机制以捕捉长距离上下文,从而实现了训练时线性时间复杂度和推理解码时常数时间复杂度。
链接: https://arxiv.org/abs/2504.21463
作者: Haowen Hou,Zhiyi Huang,Kaifeng Tan,Rongchang Lu,Fei Richard Yu
机构: Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China; Hohai University (河海大学); Shenzhen University (深圳大学); Qinghai University (青海大学)
类目: Computation and Language (cs.CL)
备注: 12 pages
Abstract:In this paper, we introduce \textbfRWKV-X, a novel hybrid architecture that combines the efficiency of RWKV for short-range modeling with a sparse attention mechanism designed to capture long-range context. Unlike previous hybrid approaches that rely on full attention layers and retain quadratic complexity, RWKV-X achieves linear-time complexity in training and constant-time complexity in inference decoding. We demonstrate that RWKV-X, when continually pretrained on 64K-token sequences, achieves near-perfect accuracy on the 64K passkey retrieval benchmark. It consistently outperforms prior RWKV-7 models on long-context benchmarks, while maintaining strong performance on short-context tasks. These results highlight RWKV-X as a scalable and efficient backbone for general-purpose language modeling, capable of decoding sequences up to 1 million tokens with stable speed and memory usage. To facilitate further research and analysis, we have made the checkpoints and the associated code publicly accessible at: this https URL.
zh
[NLP-26] SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding CVPR2025
【速读】: 该论文旨在解决当前多模态大语言模型(Multi-modal Large Language Models, MLLMs)在理解以叙事驱动的系列视频(narrative-driven series)方面存在的不足,现有基准测试主要针对独立视频,侧重于评估视觉元素如人类动作和物体状态,而忽略了现实世界中视频通常包含复杂且连续的叙事结构。解决方案的关键在于提出一个名为SeriesBench的基准,包含105个精心策划的叙事驱动系列,覆盖28项需要深入叙事理解的任务,并引入一种新颖的长跨度叙事标注方法与全信息转换策略,将人工标注转化为多样化的任务格式。此外,还提出了一个名为PC-DCoT的叙事推理框架,以增强模型对系列中情节结构和人物关系的详细分析能力。
链接: https://arxiv.org/abs/2504.21435
作者: Chenkai Zhang,Yiming Lei,Zeming Liu,Haitao Leng,ShaoGuo Liu,Tingting Gao,Qingjie Liu,Yunhong Wang
机构: Beihang University (北京航空航天大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 29 pages, 15 figures, CVPR 2025
Abstract:With the rapid development of Multi-modal Large Language Models (MLLMs), an increasing number of benchmarks have been established to evaluate the video understanding capabilities of these models. However, these benchmarks focus on \textbfstandalone videos and mainly assess ``visual elements’’ like human actions and object states. In reality, contemporary videos often encompass complex and continuous narratives, typically presented as a \textbfseries. To address this challenge, we propose \textbfSeriesBench, a benchmark consisting of 105 carefully curated narrative-driven series, covering 28 specialized tasks that require deep narrative understanding. Specifically, we first select a diverse set of drama series spanning various genres. Then, we introduce a novel long-span narrative annotation method, combined with a full-information transformation approach to convert manual annotations into diverse task formats. To further enhance model capacity for detailed analysis of plot structures and character relationships within series, we propose a novel narrative reasoning framework, \textbfPC-DCoT. Extensive results on \textbfSeriesBench indicate that existing MLLMs still face significant challenges in understanding narrative-driven series, while \textbfPC-DCoT enables these MLLMs to achieve performance improvements. Overall, our \textbfSeriesBench and \textbfPC-DCoT highlight the critical necessity of advancing model capabilities to understand narrative-driven series, guiding the future development of MLLMs. SeriesBench is publicly available at this https URL.
zh
[NLP-27] he Distribution of Dependency Distance and Hierarchical Distance in Contemporary Written Japanese and Its Influencing Factors
【速读】: 该论文试图解决日语中依存距离(Dependency Distance, DD)与层级距离(Hierarchical Distance, HD)之间关系的问题,特别是探讨在句长变化情况下两者概率分布的差异及其相互作用机制。其解决方案的关键在于识别谓词的能性(valency)是导致日语中平均依存距离(Mean Dependency Distance, MDD)与平均层级距离(Mean Hierarchical Distance, MHD)之间权衡关系的根本因素, native speakers 通过谓词的能性调节句子的线性复杂度和层级复杂度,而MDD与MHD的相对大小取决于是否达到能性的阈值。此外,谓词的能性对HD分布的影响大于对DD分布的影响,从而导致两者概率分布的差异。
链接: https://arxiv.org/abs/2504.21421
作者: Linxuan Wang,Shuiyuan Yu
机构: 未知
类目: Computation and Language (cs.CL)
备注: This paper has been accepted by the 13th International Quantitative Linguistics Conference QUALICO 2025
Abstract:To explore the relationship between dependency distance (DD) and hierarchical distance (HD) in Japanese, we compared the probability distributions of DD and HD with and without sentence length fixed, and analyzed the changes in mean dependency distance (MDD) and mean hierarchical distance (MHD) as sentence length increases, along with their correlation coefficient based on the Balanced Corpus of Contemporary Written Japanese. It was found that the valency of the predicates is the underlying factor behind the trade-off relation between MDD and MHD in Japanese. Native speakers of Japanese regulate the linear complexity and hierarchical complexity through the valency of the predicates, and the relative sizes of MDD and MHD depend on whether the threshold of valency has been reached. Apart from the cognitive load, the valency of the predicates also affects the probability distributions of DD and HD. The effect of the valency of the predicates on the distribution of HD is greater than on that of DD, which leads to differences in their probability distributions and causes the mean of MDD to be lower than that of MHD.
zh
[NLP-28] Retrieval-Enhanced Few-Shot Prompting for Speech Event Extraction
【速读】: 该论文试图解决语音事件抽取(Speech Event Extraction, SpeechEE)问题,即从口语语言中识别结构化的事件信息。其解决方案的关键在于构建一个模块化、基于流水线的框架,该框架将高性能自动语音识别(ASR)与增强语义搜索的大型语言模型(LLM)提示技术相结合。通过混合过滤机制对语音片段进行事件相关性分类,并利用语义相似性检索动态丰富少样本LLM提示,从而实现事件触发词的识别和论元的提取。
链接: https://arxiv.org/abs/2504.21372
作者: Máté Gedeon
机构: Budapest University of Technology and Economics (布达佩斯技术与经济大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Speech Event Extraction (SpeechEE) is a challenging task that lies at the intersection of Automatic Speech Recognition (ASR) and Natural Language Processing (NLP), requiring the identification of structured event information from spoken language. In this work, we present a modular, pipeline-based SpeechEE framework that integrates high-performance ASR with semantic search-enhanced prompting of Large Language Models (LLMs). Our system first classifies speech segments likely to contain events using a hybrid filtering mechanism including rule-based, BERT-based, and LLM-based models. It then employs few-shot LLM prompting, dynamically enriched via semantic similarity retrieval, to identify event triggers and extract corresponding arguments. We evaluate the pipeline using multiple LLMs (Llama3-8B, GPT-4o-mini, and o1-mini) highlighting significant performance gains with o1-mini, which achieves 63.3% F1 on trigger classification and 27.8% F1 on argument classification, outperforming prior benchmarks. Our results demonstrate that pipeline approaches, when empowered by retrieval-augmented LLMs, can rival or exceed end-to-end systems while maintaining interpretability and modularity. This work provides practical insights into LLM-driven event extraction and opens pathways for future hybrid models combining textual and acoustic features.
zh
[NLP-29] Does the Prompt-based Large Language Model Recognize Students Demographics and Introduce Bias in Essay Scoring?
【速读】: 该论文试图解决在基于提示的生成式 AI(Generative AI)范式下,自动作文评分(Automated Essay Scoring, AES)中可能存在的评分偏差问题,特别是针对弱势群体的不公平现象。其解决方案的关键在于探究模型通过学生作文预测其人口统计属性(如性别、第一语言背景)的能力与评分偏差之间的关系,并通过实证分析验证这种预测能力如何影响评分结果的公平性。
链接: https://arxiv.org/abs/2504.21330
作者: Kaixun Yang,Mladen Raković,Dragan Gašević,Guanliang Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) are widely used in Automated Essay Scoring (AES) due to their ability to capture semantic meaning. Traditional fine-tuning approaches required technical expertise, limiting accessibility for educators with limited technical backgrounds. However, prompt-based tools like ChatGPT have made AES more accessible, enabling educators to obtain machine-generated scores using natural-language prompts (i.e., the prompt-based paradigm). Despite advancements, prior studies have shown bias in fine-tuned LLMs, particularly against disadvantaged groups. It remains unclear whether such biases persist or are amplified in the prompt-based paradigm with cutting-edge tools. Since such biases are believed to stem from the demographic information embedded in pre-trained models (i.e., the ability of LLMs’ text embeddings to predict demographic attributes), this study explores the relationship between the model’s predictive power of students’ demographic attributes based on their written works and its predictive bias in the scoring task in the prompt-based paradigm. Using a publicly available dataset of over 25,000 students’ argumentative essays, we designed prompts to elicit demographic inferences (i.e., gender, first-language background) from GPT-4o and assessed fairness in automated scoring. Then we conducted multivariate regression analysis to explore the impact of the model’s ability to predict demographics on its scoring outcomes. Our findings revealed that (i) prompt-based LLMs can somewhat infer students’ demographics, particularly their first-language backgrounds, from their essays; (ii) scoring biases are more pronounced when the LLM correctly predicts students’ first-language background than when it does not; and (iii) scoring error for non-native English speakers increases when the LLM correctly identifies them as non-native.
zh
[NLP-30] Phi-4-reasoning Technical Report
【速读】: 该论文旨在解决复杂推理任务中模型性能不足的问题,特别是如何在有限参数规模下提升模型的推理能力。其解决方案的关键在于通过监督微调(Supervised Fine-Tuning, SFT)对Phi-4模型进行优化,利用精心筛选的“可教学”提示和由o3-mini生成的推理示范,从而生成详尽且有效的推理链。此外,通过引入基于结果的强化学习(Reinforcement Learning, RL)进一步优化模型,使其能够生成更长的推理轨迹,显著提升了模型在多种推理任务上的表现。
链接: https://arxiv.org/abs/2504.21318
作者: Marah Abdin,Sahaj Agarwal,Ahmed Awadallah,Vidhisha Balachandran,Harkirat Behl,Lingjiao Chen,Gustavo de Rosa,Suriya Gunasekar,Mojan Javaheripi,Neel Joshi,Piero Kauffmann,Yash Lara,Caio César Teodoro Mendes,Arindam Mitra,Besmira Nushi,Dimitris Papailiopoulos,Olli Saarikivi,Shital Shah,Vaishnavi Shrivastava,Vibhav Vineet,Yue Wu,Safoora Yousefi,Guoqing Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. Trained via supervised fine-tuning of Phi-4 on carefully curated set of “teachable” prompts-selected for the right level of complexity and diversity-and reasoning demonstrations generated using o3-mini, Phi-4-reasoning generates detailed reasoning chains that effectively leverage inference-time compute. We further develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning that offers higher performance by generating longer reasoning traces. Across a wide range of reasoning tasks, both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model. Our comprehensive evaluations span benchmarks in math and scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding. Interestingly, we observe a non-trivial transfer of improvements to general-purpose benchmarks as well. In this report, we provide insights into our training data, our training methodologies, and our evaluations. We show that the benefit of careful data curation for supervised fine-tuning (SFT) extends to reasoning language models, and can be further amplified by reinforcement learning (RL). Finally, our evaluation points to opportunities for improving how we assess the performance and robustness of reasoning models.
zh
[NLP-31] Confidence in Large Language Model Evaluation: A Bayesian Approach to Limited-Sample Challenges
【速读】: 该论文试图解决传统评估框架在有限样本条件下对大型语言模型(Large Language Models, LLMs)能力评估的局限性问题。其解决方案的关键在于引入贝叶斯方法,通过概率推断整合先验知识,将模型能力视为潜在变量,并利用精心设计的查询集诱导区分性响应,从而将模型排序形式化为关于互斥能力区间的贝叶斯假设检验问题。
链接: https://arxiv.org/abs/2504.21303
作者: Xiao Xiao,Yu Su,Sijing Zhang,Zhang Chen,Yadong Chen,Tian Liu
机构: Tencent Hunyuan(腾讯混元)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) exhibit probabilistic output characteristics, yet conventional evaluation frameworks rely on deterministic scalar metrics. This study introduces a Bayesian approach for LLM capability assessment that integrates prior knowledge through probabilistic inference, addressing limitations under limited-sample regimes. By treating model capabilities as latent variables and leveraging a curated query set to induce discriminative responses, we formalize model ranking as a Bayesian hypothesis testing problem over mutually exclusive capability intervals. Experimental evaluations with GPT-series models demonstrate that the proposed method achieves superior discrimination compared to conventional evaluation methods. Results indicate that even with reduced sample sizes, the approach maintains statistical robustness while providing actionable insights, such as probabilistic statements about a model’s likelihood of surpassing specific baselines. This work advances LLM evaluation methodologies by bridging Bayesian inference with practical constraints in real-world deployment scenarios.
zh
[NLP-32] BiasGuard: A Reasoning -enhanced Bias Detection Tool For Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Model, LLM)生成内容中的偏见识别问题,这是确保LLM公平性的关键前提。现有方法如公平性分类器和基于LLM的评判者存在理解潜在意图困难以及缺乏公平性判断标准的局限性。论文提出的解决方案是BiasGuard,其关键在于通过显式分析输入并依据公平性规范进行推理,从而提供准确的判断。BiasGuard采用两阶段方法:第一阶段使模型基于公平性规范进行显式推理,第二阶段利用强化学习提升其推理与判断能力。实验结果表明,该方法在五大数据集上优于现有工具,提高了准确性并减少了过度公平的误判。
链接: https://arxiv.org/abs/2504.21299
作者: Zhiting Fan,Ruizhe Chen,Zuozhu Liu
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Identifying bias in LLM-generated content is a crucial prerequisite for ensuring fairness in LLMs. Existing methods, such as fairness classifiers and LLM-based judges, face limitations related to difficulties in understanding underlying intentions and the lack of criteria for fairness judgment. In this paper, we introduce BiasGuard, a novel bias detection tool that explicitly analyzes inputs and reasons through fairness specifications to provide accurate judgments. BiasGuard is implemented through a two-stage approach: the first stage initializes the model to explicitly reason based on fairness specifications, while the second stage leverages reinforcement learning to enhance its reasoning and judgment capabilities. Our experiments, conducted across five datasets, demonstrate that BiasGuard outperforms existing tools, improving accuracy and reducing over-fairness misjudgments. We also highlight the importance of reasoning-enhanced decision-making and provide evidence for the effectiveness of our two-stage optimization pipeline.
zh
[NLP-33] alk Before You Retrieve: Agent -Led Discussions for Better RAG in Medical QA
【速读】: 该论文旨在解决医疗问答(Medical QA)任务中大型语言模型(LLMs)因幻觉和过时领域知识而导致的性能不足问题,以及现有医疗检索增强生成(RAG)系统在信息检索过程中缺乏类人推理建模和依赖次优医学语料库所导致的相关性不佳问题。其解决方案的关键在于提出了一种名为Discuss-RAG的即插即用模块,通过基于协作代理的推理机制来增强医疗QA RAG系统,具体包括引入一个摘要代理以协调医疗专家团队进行多轮头脑风暴,从而提升检索内容的相关性,并引入一个决策代理在最终整合前评估检索到的片段。
链接: https://arxiv.org/abs/2504.21252
作者: Xuanzhao Dong,Wenhui Zhu,Hao Wang,Xiwen Chen,Peijie Qiu,Rui Yin,Yi Su,Yalin Wang
机构: Arizona State University (亚利桑那州立大学); Clemson University (克莱姆森大学); Washington University in St.Louis (圣路易斯华盛顿大学); Banner Alzheimer’s Institute (Banner阿尔茨海默病研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Medical question answering (QA) is a reasoning-intensive task that remains challenging for large language models (LLMs) due to hallucinations and outdated domain knowledge. Retrieval-Augmented Generation (RAG) provides a promising post-training solution by leveraging external knowledge. However, existing medical RAG systems suffer from two key limitations: (1) a lack of modeling for human-like reasoning behaviors during information retrieval, and (2) reliance on suboptimal medical corpora, which often results in the retrieval of irrelevant or noisy snippets. To overcome these challenges, we propose Discuss-RAG, a plug-and-play module designed to enhance the medical QA RAG system through collaborative agent-based reasoning. Our method introduces a summarizer agent that orchestrates a team of medical experts to emulate multi-turn brainstorming, thereby improving the relevance of retrieved content. Additionally, a decision-making agent evaluates the retrieved snippets before their final integration. Experimental results on four benchmark medical QA datasets show that Discuss-RAG consistently outperforms MedRAG, especially significantly improving answer accuracy by up to 16.67% on BioASQ and 12.20% on PubMedQA. The code is available at: this https URL.
zh
[NLP-34] Memorization and Knowledge Injection in Gated LLM s
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在连续学习过程中难以有效添加新记忆和整合新知识的问题。现有方法通常依赖于大上下文窗口或外部记忆缓冲区,而缺乏对日常生活中真实场景的测试。论文提出的解决方案关键在于引入一种持续学习框架——嵌入门控大语言模型的记忆(Memory Embedded in Gated LLMs, MEGa),该框架通过将事件记忆直接注入模型权重中实现知识注入,每个记忆存储在一组专用的门控低秩权重中,并在推理时通过查询嵌入与存储记忆嵌入的匹配激活相关记忆权重,从而实现记忆回忆与问答功能。
链接: https://arxiv.org/abs/2504.21239
作者: Xu Pan,Ely Hahami,Zechen Zhang,Haim Sompolinsky
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) currently struggle to sequentially add new memories and integrate new knowledge. These limitations contrast with the human ability to continuously learn from new experiences and acquire knowledge throughout life. Most existing approaches add memories either through large context windows or external memory buffers (e.g., Retrieval-Augmented Generation), and studies on knowledge injection rarely test scenarios resembling everyday life events. In this work, we introduce a continual learning framework, Memory Embedded in Gated LLMs (MEGa), which injects event memories directly into the weights of LLMs. Each memory is stored in a dedicated set of gated low-rank weights. During inference, a gating mechanism activates relevant memory weights by matching query embeddings to stored memory embeddings. This enables the model to both recall entire memories and answer related questions. On two datasets - fictional characters and Wikipedia events - MEGa outperforms baseline approaches in mitigating catastrophic forgetting. Our model draws inspiration from the complementary memory system of the human brain.
zh
[NLP-35] Phi-4-Mini-Reasoning : Exploring the Limits of Small Reasoning Language Models in Math
【速读】: 该论文试图解决在小型语言模型(Small Language Models, SLMs)中提升推理能力的问题,因为尽管大型语言模型(Large Language Models, LLMs)可以通过Chain-of-Thought (CoT) 显著增强形式推理能力,但SLMs由于模型容量有限,难以实现类似效果。论文提出的解决方案关键在于设计了一套系统化的训练流程,包括大规模中等训练、监督微调、Rollout DPO以及基于可验证奖励的强化学习,结合高质量的长CoT数据,从而有效释放资源受限的小型模型的强推理能力。
链接: https://arxiv.org/abs/2504.21233
作者: Haoran Xu,Baolin Peng,Hany Awadalla,Dongdong Chen,Yen-Chun Chen,Mei Gao,Young Jin Kim,Yunsheng Li,Liliang Ren,Yelong Shen,Shuohang Wang,Weijian Xu,Jianfeng Gao,Weizhu Chen
机构: Microsoft(微软)
类目: Computation and Language (cs.CL)
备注:
Abstract:Chain-of-Thought (CoT) significantly enhances formal reasoning capabilities in Large Language Models (LLMs) by training them to explicitly generate intermediate reasoning steps. While LLMs readily benefit from such techniques, improving reasoning in Small Language Models (SLMs) remains challenging due to their limited model capacity. Recent work by Deepseek-R1 demonstrates that distillation from LLM-generated synthetic data can substantially improve the reasoning ability of SLM. However, the detailed modeling recipe is not disclosed. In this work, we present a systematic training recipe for SLMs that consists of four steps: (1) large-scale mid-training on diverse distilled long-CoT data, (2) supervised fine-tuning on high-quality long-CoT data, (3) Rollout DPO leveraging a carefully curated preference dataset, and (4) Reinforcement Learning (RL) with Verifiable Reward. We apply our method on Phi-4-Mini, a compact 3.8B-parameter model. The resulting Phi-4-Mini-Reasoning model exceeds, on math reasoning tasks, much larger reasoning models, e.g., outperforming DeepSeek-R1-Distill-Qwen-7B by 3.2 points and DeepSeek-R1-Distill-Llama-8B by 7.7 points on Math-500. Our results validate that a carefully designed training recipe, with large-scale high-quality CoT data, is effective to unlock strong reasoning capabilities even in resource-constrained small models.
zh
[NLP-36] Pretraining Large Brain Language Model for Active BCI: Silent Speech
【速读】: 该论文旨在解决主动脑机接口(active BCI)系统中静默语音解码的问题,以实现更自然和灵活的通信方式。其关键解决方案是提出一种名为Large Brain Language Model (LBLM) 的预训练模型,并采用未来时频预测(Future Spectro-Temporal Prediction, FSTP)的预训练范式,通过自回归建模在时间和频率域中捕捉EEG信号的时序和频谱依赖性,从而学习有效的表征。此方法在未标记EEG数据上进行预训练,并在下游任务中进行微调,显著提升了静默语音解码的性能。
链接: https://arxiv.org/abs/2504.21214
作者: Jinzhao Zhou,Zehong Cao,Yiqun Duan,Connor Barkley,Daniel Leong,Xiaowei Jiang,Quoc-Toan Nguyen,Ziyi Zhao,Thomas Do,Yu-Cheng Chang,Sheng-Fu Liang,Chin-teng Lin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:This paper explores silent speech decoding in active brain-computer interface (BCI) systems, which offer more natural and flexible communication than traditional BCI applications. We collected a new silent speech dataset of over 120 hours of electroencephalogram (EEG) recordings from 12 subjects, capturing 24 commonly used English words for language model pretraining and decoding. Following the recent success of pretraining large models with self-supervised paradigms to enhance EEG classification performance, we propose Large Brain Language Model (LBLM) pretrained to decode silent speech for active BCI. To pretrain LBLM, we propose Future Spectro-Temporal Prediction (FSTP) pretraining paradigm to learn effective representations from unlabeled EEG data. Unlike existing EEG pretraining methods that mainly follow a masked-reconstruction paradigm, our proposed FSTP method employs autoregressive modeling in temporal and frequency domains to capture both temporal and spectral dependencies from EEG signals. After pretraining, we finetune our LBLM on downstream tasks, including word-level and semantic-level classification. Extensive experiments demonstrate significant performance gains of the LBLM over fully-supervised and pretrained baseline models. For instance, in the difficult cross-session setting, our model achieves 47.0% accuracy on semantic-level classification and 39.6% in word-level classification, outperforming baseline methods by 5.4% and 7.3%, respectively. Our research advances silent speech decoding in active BCI systems, offering an innovative solution for EEG language model pretraining and a new dataset for fundamental research.
zh
[NLP-37] Automatic Legal Writing Evaluation of LLM s
【速读】: 该论文试图解决法律写作评估基准稀缺的问题,尤其是在大型语言模型(Large Language Models)领域中,由于评估开放式回答的固有复杂性,缺乏公开、频繁更新且包含全面评估指南的数据集。解决方案的关键在于引入oab-bench,这是一个包含来自巴西律师考试近几届试题的基准,涵盖七个法律领域,共105道题目,并附有由人类考官使用的全面评估指南和参考材料,以确保评分一致性。通过在该基准上评估四种LLMs,验证了其在法律写作评估中的有效性,并探索了LLMs作为可靠自动化评判工具的潜力。
链接: https://arxiv.org/abs/2504.21202
作者: Ramon Pires,Roseval Malaquias Junior,Rodrigo Nogueira
机构: Maritaca AI(马里塔卡人工智能); Campinas(坎皮纳斯); São Paulo(圣保罗); Brazil(巴西)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the recent advances in Large Language Models, benchmarks for evaluating legal writing remain scarce due to the inherent complexity of assessing open-ended responses in this domain. One of the key challenges in evaluating language models on domain-specific tasks is finding test datasets that are public, frequently updated, and contain comprehensive evaluation guidelines. The Brazilian Bar Examination meets these requirements. We introduce oab-bench, a benchmark comprising 105 questions across seven areas of law from recent editions of the exam. The benchmark includes comprehensive evaluation guidelines and reference materials used by human examiners to ensure consistent grading. We evaluate the performance of four LLMs on oab-bench, finding that Claude-3.5 Sonnet achieves the best results with an average score of 7.93 out of 10, passing all 21 exams. We also investigated whether LLMs can serve as reliable automated judges for evaluating legal writing. Our experiments show that frontier models like OpenAI’s o1 achieve a strong correlation with human scores when evaluating approved exams, suggesting their potential as reliable automated evaluators despite the inherently subjective nature of legal writing assessment. The source code and the benchmark – containing questions, evaluation guidelines, model-generated responses, and their respective automated evaluations – are publicly available.
zh
[NLP-38] Small or Large? Zero-Shot or Finetuned? Guiding Language Model Choice for Specialized Applications in Healthcare
【速读】: 该论文试图解决在特定任务中如何选择合适的语言模型的问题,具体包括微调与零样本使用的效果比较、邻域领域与通用预训练模型的优劣、进一步领域特定预训练的价值以及小型语言模型(SLMs)与大型语言模型(LLMs)在特定任务中的相关性。研究的关键在于通过在不同时期和不同数据规模下的分类场景评估多种SLMs和一个LLM,发现微调显著提升了SLMs在所有场景中的表现,并且经过微调的SLMs在复杂任务中能够超越零样本LLMs,同时领域邻近或特定的预训练数据进一步增强了SLMs在复杂问题或有限微调数据情况下的性能。
链接: https://arxiv.org/abs/2504.21191
作者: Lovedeep Gondara,Jonathan Simkin,Graham Sayle,Shebnum Devji,Gregory Arbour,Raymond Ng
机构: British Columbia Cancer Registry (不列颠哥伦比亚癌症登记处); Provincial Health Services Authority (省卫生服务局); University of British Columbia (不列颠哥伦比亚大学); Data Science Institute (数据科学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:This study aims to guide language model selection by investigating: 1) the necessity of finetuning versus zero-shot usage, 2) the benefits of domain-adjacent versus generic pretrained models, 3) the value of further domain-specific pretraining, and 4) the continued relevance of Small Language Models (SLMs) compared to Large Language Models (LLMs) for specific tasks. Using electronic pathology reports from the British Columbia Cancer Registry (BCCR), three classification scenarios with varying difficulty and data size are evaluated. Models include various SLMs and an LLM. SLMs are evaluated both zero-shot and finetuned; the LLM is evaluated zero-shot only. Finetuning significantly improved SLM performance across all scenarios compared to their zero-shot results. The zero-shot LLM outperformed zero-shot SLMs but was consistently outperformed by finetuned SLMs. Domain-adjacent SLMs generally performed better than the generic SLM after finetuning, especially on harder tasks. Further domain-specific pretraining yielded modest gains on easier tasks but significant improvements on the complex, data-scarce task. The results highlight the critical role of finetuning for SLMs in specialized domains, enabling them to surpass zero-shot LLM performance on targeted classification tasks. Pretraining on domain-adjacent or domain-specific data provides further advantages, particularly for complex problems or limited finetuning data. While LLMs offer strong zero-shot capabilities, their performance on these specific tasks did not match that of appropriately finetuned SLMs. In the era of LLMs, SLMs remain relevant and effective, offering a potentially superior performance-resource trade-off compared to LLMs.
zh
[NLP-39] Detecting Manipulated Contents Using Knowledge-Grounded Inference
【速读】: 该论文旨在解决零日篡改内容(zero-day manipulated content)的检测问题,这类内容无法通过传统方法基于历史事件进行识别,因为它们需要实时上下文信息才能被正确判断。解决方案的关键在于Manicod工具,它通过从主流搜索引擎获取输入声明的上下文信息,并利用检索增强生成(Retrieval-Augmented Generation, RAG)技术将上下文向量化,输入到大语言模型(Large Language Model, LLM)中进行推理,从而生成“真实”或“篡改”的判断及相应的文本解释。
链接: https://arxiv.org/abs/2504.21165
作者: Mark Huasong Meng,Ruizhe Wang,Meng Xu,Chuan Yan,Guangdong Bai
机构: Technical University of Munich (慕尼黑工业大学); University of Waterloo (滑铁卢大学); The University of Queensland (昆士兰大学)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: 16 pages
Abstract:The detection of manipulated content, a prevalent form of fake news, has been widely studied in recent years. While existing solutions have been proven effective in fact-checking and analyzing fake news based on historical events, the reliance on either intrinsic knowledge obtained during training or manually curated context hinders them from tackling zero-day manipulated content, which can only be recognized with real-time contextual information. In this work, we propose Manicod, a tool designed for detecting zero-day manipulated content. Manicod first sources contextual information about the input claim from mainstream search engines, and subsequently vectorizes the context for the large language model (LLM) through retrieval-augmented generation (RAG). The LLM-based inference can produce a “truthful” or “manipulated” decision and offer a textual explanation for the decision. To validate the effectiveness of Manicod, we also propose a dataset comprising 4270 pieces of manipulated fake news derived from 2500 recent real-world news headlines. Manicod achieves an overall F1 score of 0.856 on this dataset and outperforms existing methods by up to 1.9x in F1 score on their benchmarks on fact-checking and claim verification.
zh
[NLP-40] LLM Enhancer: Merged Approach using Vector Embedding for Reducing Large Language Model Hallucinations with External Knowledge
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在现实世界关键场景中应用时存在的问题,即容易生成不准确信息以及缺乏有效利用外部知识源的能力。其解决方案的关键在于构建LLM ENHANCER系统,该系统通过集成多个在线来源(如Google、Wikipedia和DuckDuckGo)来提升数据准确性,并利用向量嵌入技术筛选相关信息,从而增强LLMs的可靠性和信息质量,同时保持对话的自然性和准确性。
链接: https://arxiv.org/abs/2504.21132
作者: Naheed Rayhan,Md. Ashrafuzzaman
机构: Jagannath University (贾甘纳特大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs), such as ChatGPT, have demonstrated the capability to generate human like, natural responses across a range of tasks, including task oriented dialogue and question answering. However, their application in real world, critical scenarios is often hindered by a tendency to produce inaccurate information and a limited ability to leverage external knowledge sources. This paper introduces the LLM ENHANCER system, designed to integrate multiple online sources such as Google, Wikipedia, and DuckDuckGo to enhance data accuracy. The LLMs employed within this system are open source. The data acquisition process for the LLM ENHANCER system operates in parallel, utilizing custom agent tools to manage the flow of information. Vector embeddings are used to identify the most pertinent information, which is subsequently supplied to the LLM for user interaction. The LLM ENHANCER system mitigates hallucinations in chat based LLMs while preserving response naturalness and accuracy.
zh
[NLP-41] Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts
【速读】: 该论文试图解决自然语言生成(Natural Language Generation, NLG)系统评估中因输出多样性带来的挑战,尤其是传统人工评估存在不一致、缺乏标准化和人口统计偏差等问题,以及基于大语言模型(Large Language Model, LLM)的评估方法对提示设计高度敏感的问题。解决方案的关键在于提出一种逆向学习方法,该方法通过从模型输出反向学习到输入指令的有效映射,实现了模型特定评估提示的自动生成,仅需单个评估样本即可完成,从而避免了耗时的手动提示工程,提升了评估的效率与鲁棒性。
链接: https://arxiv.org/abs/2504.21117
作者: Hanhua Hong,Chenghao Xiao,Yang Wang,Yiqi Liu,Wenge Rong,Chenghua Lin
机构: The University of Manchester (曼彻斯特大学); Durham University (杜伦大学); Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL)
备注: 10 pages
Abstract:Evaluating natural language generation (NLG) systems is challenging due to the diversity of valid outputs. While human evaluation is the gold standard, it suffers from inconsistencies, lack of standardisation, and demographic biases, limiting reproducibility. LLM-based evaluation offers a scalable alternative but is highly sensitive to prompt design, where small variations can lead to significant discrepancies. In this work, we propose an inversion learning method that learns effective reverse mappings from model outputs back to their input instructions, enabling the automatic generation of highly effective, model-specific evaluation prompts. Our method requires only a single evaluation sample and eliminates the need for time-consuming manual prompt engineering, thereby improving both efficiency and robustness. Our work contributes toward a new direction for more robust and efficient LLM-based evaluation.
zh
[NLP-42] Multimodal Large Language Models for Medicine: A Comprehensive Survey
【速读】: 该论文旨在探讨多模态大语言模型(Multimodal Large Language Models, MLLMs)在医疗和健康领域的应用潜力及其面临的挑战。其解决方案的关键在于通过综合分析330篇相关文献,系统梳理MLLMs在医疗报告、医学诊断和医疗治疗三个主要方向的应用,并结合六种主流数据模式及其评估基准,揭示MLLMs在医疗领域中的强大能力,同时提出应对当前技术瓶颈的可行方法。
链接: https://arxiv.org/abs/2504.21051
作者: Jiarui Ye,Hao Tang
机构: Nanjing University of Science and Technology (南京理工大学); Peking University (北京大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:
Abstract:MLLMs have recently become a focal point in the field of artificial intelligence research. Building on the strong capabilities of LLMs, MLLMs are adept at addressing complex multi-modal tasks. With the release of GPT-4, MLLMs have gained substantial attention from different domains. Researchers have begun to explore the potential of MLLMs in the medical and healthcare domain. In this paper, we first introduce the background and fundamental concepts related to LLMs and MLLMs, while emphasizing the working principles of MLLMs. Subsequently, we summarize three main directions of application within healthcare: medical reporting, medical diagnosis, and medical treatment. Our findings are based on a comprehensive review of 330 recent papers in this area. We illustrate the remarkable capabilities of MLLMs in these domains by providing specific examples. For data, we present six mainstream modes of data along with their corresponding evaluation benchmarks. At the end of the survey, we discuss the challenges faced by MLLMs in the medical and healthcare domain and propose feasible methods to mitigate or overcome these issues.
zh
[NLP-43] A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage
【速读】: 该论文试图解决现有文本数据脱敏技术在隐私保护方面的不足,特别是针对隐含的文本特征可能导致个体重新识别的问题。传统方法通常依赖于移除显式标识符或生成合成数据,但其有效性往往仅通过检测显式标识符泄露来评估,忽略了可能用于重新识别的细微文本标记。论文提出了一种新的框架,用于评估重新识别攻击并量化数据发布后的个体隐私风险,其关键在于揭示看似无害的辅助信息(如日常社交活动)可能被用来推断敏感属性(如年龄或药物使用史),从而证明当前脱敏技术仅提供一种虚假的隐私保护感。
链接: https://arxiv.org/abs/2504.21035
作者: Rui Xin,Niloofar Mireshghallah,Shuyue Stella Li,Michael Duan,Hyunwoo Kim,Yejin Choi,Yulia Tsvetkov,Sewoong Oh,Pang Wei Koh
机构: Cranberry-Lemon University (克兰伯里-柠檬大学); University of the Witwatersrand (维特沃特斯兰德大学); University of Washington (华盛顿大学); Allen Institute for Artificial Intelligence (艾伦人工智能研究所)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Sanitizing sensitive text data typically involves removing personally identifiable information (PII) or generating synthetic data under the assumption that these methods adequately protect privacy; however, their effectiveness is often only assessed by measuring the leakage of explicit identifiers but ignoring nuanced textual markers that can lead to re-identification. We challenge the above illusion of privacy by proposing a new framework that evaluates re-identification attacks to quantify individual privacy risks upon data release. Our approach shows that seemingly innocuous auxiliary information – such as routine social activities – can be used to infer sensitive attributes like age or substance use history from sanitized data. For instance, we demonstrate that Azure’s commercial PII removal tool fails to protect 74% of information in the MedQA dataset. Although differential privacy mitigates these risks to some extent, it significantly reduces the utility of the sanitized text for downstream tasks. Our findings indicate that current sanitization techniques offer a \textitfalse sense of privacy, highlighting the need for more robust methods that protect against semantic-level information leakage.
zh
[NLP-44] UrbanPlanBench: A Comprehensive Urban Planning Benchmark for Evaluating Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在城市规划领域中辅助人类专业人员的能力不足问题,特别是LLMs在获取和应用城市规划知识方面的局限性。解决方案的关键在于构建一个全面的基准测试平台UrbanPlanBench,用于评估LLMs在城市规划基础原则、专业知识及管理法规等方面的表现,并提供一个大规模的监督微调数据集UrbanPlanText,以提升模型对城市规划知识的掌握程度。通过该基准和数据集,研究旨在推动LLMs与城市规划专业技能的深度融合。
链接: https://arxiv.org/abs/2504.21027
作者: Yu Zheng,Longyi Liu,Yuming Lin,Jie Feng,Guozhen Zhang,Depeng Jin,Yong Li
机构: The Thørväld Group(Thørväld集团); Institute for Clarity in Documentation(清晰文档研究所); Inria Paris-Rocquencourt(巴黎-罗克昂瓦勒INRIA研究所); Rajiv Gandhi University(拉吉夫·甘地大学); Tsinghua University(清华大学); Palmer Research Laboratories(帕默研究实验室); The Kumquat Consortium(库姆夸特联盟)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The advent of Large Language Models (LLMs) holds promise for revolutionizing various fields traditionally dominated by human expertise. Urban planning, a professional discipline that fundamentally shapes our daily surroundings, is one such field heavily relying on multifaceted domain knowledge and experience of human experts. The extent to which LLMs can assist human practitioners in urban planning remains largely unexplored. In this paper, we introduce a comprehensive benchmark, UrbanPlanBench, tailored to evaluate the efficacy of LLMs in urban planning, which encompasses fundamental principles, professional knowledge, and management and regulations, aligning closely with the qualifications expected of human planners. Through extensive evaluation, we reveal a significant imbalance in the acquisition of planning knowledge among LLMs, with even the most proficient models falling short of meeting professional standards. For instance, we observe that 70% of LLMs achieve subpar performance in understanding planning regulations compared to other aspects. Besides the benchmark, we present the largest-ever supervised fine-tuning (SFT) dataset, UrbanPlanText, comprising over 30,000 instruction pairs sourced from urban planning exams and textbooks. Our findings demonstrate that fine-tuned models exhibit enhanced performance in memorization tests and comprehension of urban planning knowledge, while there exists significant room for improvement, particularly in tasks requiring domain-specific terminology and reasoning. By making our benchmark, dataset, and associated evaluation and fine-tuning toolsets publicly available at this https URL, we aim to catalyze the integration of LLMs into practical urban planning, fostering a symbiotic collaboration between human expertise and machine intelligence.
zh
[NLP-45] Creating and Evaluating Code-Mixed Nepali-English and Telugu-English Datasets for Abusive Language Detection Using Traditional and Deep Learning Models
【速读】: 该论文旨在解决在多语言社交媒体环境中检测代码混合(code-mixed)文本中的仇恨语言(abusive language)的挑战,特别是在低资源语言如泰卢固语和尼泊尔语与英语混合的情况下,传统检测模型因语言混杂和上下文依赖性而难以有效识别有害内容。其解决方案的关键在于构建一个手动标注的高质量代码混合数据集,并通过多种机器学习(ML)、深度学习(DL)和大语言模型(LLMs)进行系统评估与优化,以探索适用于多语言环境的有效检测方法。
链接: https://arxiv.org/abs/2504.21026
作者: Manish Pandey,Nageshwar Prasad Yadav,Mokshada Adduru,Sawan Rai
机构: SRM Ambedkar University (SRM阿姆倍德卡尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:
Abstract:With the growing presence of multilingual users on social media, detecting abusive language in code-mixed text has become increasingly challenging. Code-mixed communication, where users seamlessly switch between English and their native languages, poses difficulties for traditional abuse detection models, as offensive content may be context-dependent or obscured by linguistic blending. While abusive language detection has been extensively explored for high-resource languages like English and Hindi, low-resource languages such as Telugu and Nepali remain underrepresented, leaving gaps in effective moderation. In this study, we introduce a novel, manually annotated dataset of 2 thousand Telugu-English and 5 Nepali-English code-mixed comments, categorized as abusive and non-abusive, collected from various social media platforms. The dataset undergoes rigorous preprocessing before being evaluated across multiple Machine Learning (ML), Deep Learning (DL), and Large Language Models (LLMs). We experimented with models including Logistic Regression, Random Forest, Support Vector Machines (SVM), Neural Networks (NN), LSTM, CNN, and LLMs, optimizing their performance through hyperparameter tuning, and evaluate it using 10-fold cross-validation and statistical significance testing (t-test). Our findings provide key insights into the challenges of detecting abusive language in code-mixed settings and offer a comparative analysis of computational approaches. This study contributes to advancing NLP for low-resource languages by establishing benchmarks for abusive language detection in Telugu-English and Nepali-English code-mixed text. The dataset and insights can aid in the development of more robust moderation strategies for multilingual social media environments.
zh
[NLP-46] Durghotona GPT : A Web Scraping and Large Language Model Based Framework to Generate Road Accident Dataset Automatically in Bangladesh
【速读】: 该论文试图解决道路交通事故数据收集不及时、不准确以及存在沟通障碍的问题,这些问题限制了对事故的预测与缓解。解决方案的关键在于提出一种名为“Durghotona GPT”的框架,该框架结合了网络爬虫技术和大型语言模型(Large Language Models, LLMs),实现了从孟加拉国主流日报中自动化生成全面的事故数据集。通过使用GPT-4、GPT-3.5和Llama-3等最新LLMs进行信息提取与分类,该框架有效克服了传统人工数据收集方法的局限性。
链接: https://arxiv.org/abs/2504.21025
作者: MD Thamed Bin Zaman Chowdhury,Moazzem Hossain,Md. Ridwanul Islam
机构: 未知
类目: Computation and Language (cs.CL)
备注: It has been accepted in IEEE 27th International Conference on Computer and Information Technology (ICCIT). Now, we are waiting for it to get published in IEEE Xplore
Abstract:Road accidents pose significant concerns globally. They lead to large financial losses, injuries, disabilities, and societal challenges. Accurate and timely accident data is essential for predicting and mitigating these events. This paper presents a novel framework named ‘Durghotona GPT’ that integrates web scraping and Large Language Models (LLMs) to automate the generation of comprehensive accident datasets from prominent national dailies in Bangladesh. The authors collected accident reports from three major newspapers: Prothom Alo, Dhaka Tribune, and The Daily Star. The collected news was then processed using the newest available LLMs: GPT-4, GPT-3.5, and Llama-3. The framework efficiently extracts relevant information, categorizes reports, and compiles detailed datasets. Thus, this framework overcomes limitations of manual data collection methods such as delays, errors, and communication gaps. The authors’ evaluation demonstrates that Llama-3, an open-source model, performs comparably to GPT-4. It achieved 89% accuracy in the authors’ evaluation. Therefore, it can be considered a cost-effective alternative for similar tasks. The results suggest that the framework developed by the authors can drastically enhance the quality and availability of accident data. As a result, it can support critical applications in traffic safety analysis, urban planning, and public health. The authors also developed an interface for ‘Durghotona GPT’ for ease of use as part of this paper. Future work will focus on expanding data collection methods and refining LLMs to further increase dataset accuracy and applicability.
zh
[NLP-47] WebEvolver: Enhancing Web Agent Self-Improvement with Coevolving World Model
【速读】: 该论文试图解决自主学习过程中代理模型性能达到瓶颈的问题,这一问题在基于网络环境的生成式 AI (Generative AI) 中尤为显著。其关键解决方案是引入一个协同进化的世界模型(World Model)LLM,该模型通过预测当前观测与动作下的下一状态,利用预训练的网络知识,既作为虚拟网络服务器生成自我指导的训练数据以持续优化代理策略,又在推理阶段作为想象引擎进行前瞻模拟,从而提升代理的行动选择能力。
链接: https://arxiv.org/abs/2504.21024
作者: Tianqing Fang,Hongming Zhang,Zhisong Zhang,Kaixin Ma,Wenhao Yu,Haitao Mi,Dong Yu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 19 pages
Abstract:Agent self-improvement, where the backbone Large Language Model (LLM) of the agent are trained on trajectories sampled autonomously based on their own policies, has emerged as a promising approach for enhancing performance. Recent advancements, particularly in web environments, face a critical limitation: their performance will reach a stagnation point during autonomous learning cycles, hindering further improvement. We argue that this stems from limited exploration of the web environment and insufficient exploitation of pre-trained web knowledge in LLMs. To improve the performance of self-improvement, we propose a novel framework that introduces a co-evolving World Model LLM. This world model predicts the next observation based on the current observation and action within the web environment. Leveraging LLMs’ pretrained knowledge of abundant web content, the World Model serves dual roles: (1) as a virtual web server generating self-instructed training data to continuously refine the agent’s policy, and (2) as an imagination engine during inference, enabling look-ahead simulation to guide action selection for the agent LLM. Experiments in real-world web environments (Mind2Web-Live, WebVoyager, and GAIA-web) show a 10% performance gain over existing self-evolving agents, demonstrating the efficacy and generalizability of our approach, without using any distillation from more powerful close-sourced models. Our work establishes the necessity of integrating world models into autonomous agent frameworks to unlock sustained adaptability.
zh
[NLP-48] ParamΔ for Direct Weight Mixing: Post-Train Large Language Model at Zero Cost ICLR2025
【速读】: 该论文试图解决大语言模型后训练阶段中需要大量高质量数据、存在过拟合风险以及计算成本高昂的问题。解决方案的关键在于提出 Param\Delta 方法,通过将已有后训练模型的权重(Θ_post)与更新后的基础模型权重(Θ’_base)之间的差异(Θ_post - Θ_base)直接叠加到更新后的基础模型上,从而生成一个无需额外训练即可具备后训练能力的 Param\Delta 模型,其性能可达到直接后训练的约95%。
链接: https://arxiv.org/abs/2504.21023
作者: Sheng Cao,Mingrui Wu,Karthik Prasad,Yuandong Tian,Zechun Liu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Published as a conference paper at ICLR 2025
Abstract:The post-training phase of large language models is essential for enhancing capabilities such as instruction-following, reasoning, and alignment with human preferences. However, it demands extensive high-quality data and poses risks like overfitting, alongside significant computational costs due to repeated post-training and evaluation after each base model update. This paper introduces Param\Delta , a novel method that streamlines post-training by transferring knowledge from an existing post-trained model to a newly updated base model with ZERO additional training. By computing the difference between post-trained model weights ( \Theta_\textpost ) and base model weights ( \Theta_\textbase ), and adding this to the updated base model ( \Theta’\textbase ), we define Param\Delta Model as: \Theta\textParam\Delta = \Theta_\textpost - \Theta_\textbase + \Theta’_\textbase . This approach surprisingly equips the new base model with post-trained capabilities, achieving performance comparable to direct post-training. We did analysis on LLama3, Llama3.1, Qwen, and DeepSeek-distilled models. Results indicate Param\Delta Model effectively replicates traditional post-training. For example, the Param\Delta Model obtained from 70B Llama3-inst, Llama3-base, Llama3.1-base models attains approximately 95% of Llama3.1-inst model’s performance on average. Param\Delta brings a new perspective on how to fully leverage models in the open-weight community, where checkpoints for base and instruct models are readily available and frequently updated, by providing a cost-free framework to accelerate the iterative cycle of model development.
zh
[NLP-49] ConformalNL2LTL: Translating Natural Language Instructions into Temporal Logic Formulas with Conformal Correctness Guarantees
【速读】: 该论文试图解决自然语言(Natural Language, NL)指令到线性时序逻辑(Linear Temporal Logic, LTL)公式的翻译问题,现有方法缺乏正确性保障。其解决方案的关键在于提出一种名为ConformalNL2LTL的新方法,该方法通过迭代地解决一系列开放词汇问答(open-vocabulary Question-Answering, QA)问题来构建LTL公式,并利用置信区间预测(conformal prediction, CP)实现不确定性感知的翻译,从而在保证用户定义的翻译成功率的同时最小化求助频率。
链接: https://arxiv.org/abs/2504.21022
作者: Jun Wang,David Smith Sundarsingh,Jyotirmoy V. Deshmukh,Yiannis Kantaros
机构: Washington University in St Louis, MO, 63108, USA; University of Southern California, Los Angeles, CA 90089, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:Linear Temporal Logic (LTL) has become a prevalent specification language for robotic tasks. To mitigate the significant manual effort and expertise required to define LTL-encoded tasks, several methods have been proposed for translating Natural Language (NL) instructions into LTL formulas, which, however, lack correctness guarantees. To address this, we introduce a new NL-to-LTL translation method, called ConformalNL2LTL, that can achieve user-defined translation success rates over unseen NL commands. Our method constructs LTL formulas iteratively by addressing a sequence of open-vocabulary Question-Answering (QA) problems with LLMs. To enable uncertainty-aware translation, we leverage conformal prediction (CP), a distribution-free uncertainty quantification tool for black-box models. CP enables our method to assess the uncertainty in LLM-generated answers, allowing it to proceed with translation when sufficiently confident and request help otherwise. We provide both theoretical and empirical results demonstrating that ConformalNL2LTL achieves user-specified translation accuracy while minimizing help rates.
zh
[NLP-50] Context-Enhanced Contrastive Search for Improved LLM Text Generation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成高质量文本时面临的挑战,即如何在连贯性、多样性和相关性之间取得平衡。传统解码方法如束搜索和top-k采样在长文本生成任务中常出现重复或不连贯的输出问题。为解决这些限制,论文提出了一种改进的对比搜索算法——上下文增强的对比搜索(Context-Enhanced Contrastive Search, CECS),其关键在于引入动态上下文重要性加权、多层级对比搜索以及自适应温度控制,以优化流畅性、创造性和精确性的平衡。
链接: https://arxiv.org/abs/2504.21020
作者: Jaydip Sen,Rohit Pandey,Hetvi Waghela
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This is the pre-review version of our paper, which has been accepted for publication in the IEEE 6th International Conference on Emerging Technologies (INCET). The conference will be organized at Belgaum, India, from May 24 to 26, 2025. This is not the final camera-ready paper, which will be available on IEEE Xplore. The paper is 9 pages long, and it contains 2 Figures and 4 Tables
Abstract:Recently, Large Language Models (LLMs) have demonstrated remarkable advancements in Natural Language Processing (NLP). However, generating high-quality text that balances coherence, diversity, and relevance remains challenging. Traditional decoding methods, such as bean search and top-k sampling, often struggle with either repetitive or incoherent outputs, particularly in tasks that require long-form text generation. To address these limitations, the paper proposes a novel enhancement of the well-known Contrastive Search algorithm, Context-Enhanced Contrastive Search (CECS) with contextual calibration. The proposed scheme introduces several novelties including dynamic contextual importance weighting, multi-level Contrastive Search, and adaptive temperature control, to optimize the balance between fluency, creativity, and precision. The performance of CECS is evaluated using several standard metrics such as BLEU, ROUGE, and semantic similarity. Experimental results demonstrate significant improvements in both coherence and relevance of the generated texts by CECS outperforming the existing Contrastive Search techniques. The proposed algorithm has several potential applications in the real world including legal document drafting, customer service chatbots, and content marketing.
zh
[NLP-51] Kill two birds with one stone: generalized and robust AI-generated text detection via dynamic perturbations NAACL2025
【速读】: 该论文旨在解决生成式 AI (Generative AI) 生成文本(AIGT)检测方法在泛化能力和鲁棒性之间难以兼顾的问题。现有方法通常仅关注模型的泛化能力或鲁棒性,而缺乏统一机制来同时提升两者。论文提出的关键解决方案是通过引入强化学习框架下的动态扰动(DP-Net),结合精心设计的奖励和动作机制,从而有效提升检测模型在跨领域场景下的泛化能力以及在对抗攻击下的鲁棒性。
链接: https://arxiv.org/abs/2504.21019
作者: Yinghan Zhou,Juan Wen,Wanli Peng,Yiming Xue,Ziwei Zhang,Zhengxian Wu
机构: China Agricultural University (中国农业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by NAACL 2025 main conference
Abstract:The growing popularity of large language models has raised concerns regarding the potential to misuse AI-generated text (AIGT). It becomes increasingly critical to establish an excellent AIGT detection method with high generalization and robustness. However, existing methods either focus on model generalization or concentrate on robustness. The unified mechanism, to simultaneously address the challenges of generalization and robustness, is less explored. In this paper, we argue that robustness can be view as a specific form of domain shift, and empirically reveal an intrinsic mechanism for model generalization of AIGT detection task. Then, we proposed a novel AIGT detection method (DP-Net) via dynamic perturbations introduced by a reinforcement learning with elaborated reward and action. Experimentally, extensive results show that the proposed DP-Net significantly outperforms some state-of-the-art AIGT detection methods for generalization capacity in three cross-domain scenarios. Meanwhile, the DP-Net achieves best robustness under two text adversarial attacks. The code is publicly available at this https URL.
zh
[NLP-52] HYPEROFA: Expanding LLM Vocabulary to New Languages via Hypernetwork-Based Embedding Initialization
【速读】: 该论文旨在解决预训练语言模型(Pre-trained Language Models, PLMs)在中等和低资源语言上的性能不足问题,这一问题主要源于模型在预训练阶段对这些语言的接触有限。为了解决这一问题,现有方法通常通过引入目标语言特定的token、初始化其嵌入并进行持续预训练来提升性能。其中,OFA方法采用基于相似性的子词嵌入初始化策略,虽有效但受限于将目标语言token嵌入限制为固定数量源语言嵌入的凸组合,可能影响表达能力。该论文提出的解决方案关键在于使用基于超网络(Hypernetwork)的方法——HYPEROFA,该方法通过训练一个超网络,从外部多语言词向量空间映射到PLMs的token嵌入空间,从而生成灵活的目标语言token嵌入,作为持续预训练的良好起点。
链接: https://arxiv.org/abs/2504.21018
作者: Enes Özeren,Yihong Liu,Hinrich Schütze
机构: LMU Munich (慕尼黑路德维希-马克西米利安大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 18 pages, 3 figures, 15 tables
Abstract:Many pre-trained language models (PLMs) exhibit suboptimal performance on mid- and low-resource languages, largely due to limited exposure to these languages during pre-training. A common strategy to address this is to introduce new tokens specific to the target languages, initialize their embeddings, and apply continual pre-training on target-language data. Among such methods, OFA (Liu et al., 2024a) proposes a similarity-based subword embedding initialization heuristic that is both effective and efficient. However, OFA restricts target-language token embeddings to be convex combinations of a fixed number of source-language embeddings, which may limit expressiveness. To overcome this limitation, we propose HYPEROFA, a hypernetwork-based approach for more adaptive token embedding initialization. The hypernetwork is trained to map from an external multilingual word vector space to the PLMs token embedding space using source-language tokens. Once trained, it can generate flexible embeddings for target-language tokens, serving as a good starting point for continual pretraining. Experiments demonstrate that HYPEROFA consistently outperforms random initialization baseline and matches or exceeds the performance of OFA in both continual pre-training convergence and downstream task performance. We make the code publicly available.
zh
[NLP-53] ViQA-COVID: COVID-19 Machine Reading Comprehension Dataset for Vietnamese
【速读】: 该论文试图解决在新冠疫情背景下,利用人工智能(AI)技术支持疾病预防与控制的问题,特别是通过机器阅读理解(MRC)技术来提升信息处理与应对能力。其解决方案的关键在于创建了首个针对越南语的新冠疫情机器阅读理解数据集ViQA-COVID,该数据集不仅可用于构建相关模型和系统以辅助疾病预防,同时也是首个支持多跨度抽取的越南语MRC数据集,旨在推动越南语及多语言环境下的MRC研究发展。
链接: https://arxiv.org/abs/2504.21017
作者: Hai-Chung Nguyen-Phung,Ngoc C. Lê,Van-Chien Nguyen,Hang Thi Nguyen,Thuy Phuong Thi Nguyen
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages. Technical report
Abstract:After two years of appearance, COVID-19 has negatively affected people and normal life around the world. As in May 2022, there are more than 522 million cases and six million deaths worldwide (including nearly ten million cases and over forty-three thousand deaths in Vietnam). Economy and society are both severely affected. The variant of COVID-19, Omicron, has broken disease prevention measures of countries and rapidly increased number of infections. Resources overloading in treatment and epidemics prevention is happening all over the world. It can be seen that, application of artificial intelligence (AI) to support people at this time is extremely necessary. There have been many studies applying AI to prevent COVID-19 which are extremely useful, and studies on machine reading comprehension (MRC) are also in it. Realizing that, we created the first MRC dataset about COVID-19 for Vietnamese: ViQA-COVID and can be used to build models and systems, contributing to disease prevention. Besides, ViQA-COVID is also the first multi-span extraction MRC dataset for Vietnamese, we hope that it can contribute to promoting MRC studies in Vietnamese and multilingual.
zh
[NLP-54] Nested Named-Entity Recognition on Vietnamese COVID-19: Dataset and Experiments IJCAI2021
【速读】: 该论文试图解决在越南疫情防控过程中,由于人工进行接触者追踪、本地化和隔离工作量大且效率低的问题。其解决方案的关键在于开展命名实体识别(Named Entity Recognition, NER)研究,并构建一个手动标注的新冠病毒数据集,该数据集支持嵌套命名实体识别任务,定义了新的实体类型以提升系统在疫情防控中的应用效果。
链接: https://arxiv.org/abs/2504.21016
作者: Ngoc C.Lê,Hai-Chung Nguyen-Phung,Thu-Huong Pham Thi,Hue Vu,Phuong-Thao Nguyen Thi,Thu-Thuy Tran,Hong-Nhung Le Thi,Thuy-Duong Nguyen-Thi,Thanh-Huy Nguyen
机构: School of Applied Mathematics and Informatics, Hanoi University of Science and Technology (应用数学与信息学院,河内科技大学); Financial Deep Mind (金融深度思维); National Economics University (国家经济大学); Faculty of Information Technology, Hanoi Open University (信息技术学院,河内开放大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages. AI4SG-21 The 3rd Workshop on Artificial Intelligence for Social Good at IJCAI 2021
Abstract:The COVID-19 pandemic caused great losses worldwide, efforts are taken place to prevent but many countries have failed. In Vietnam, the traceability, localization, and quarantine of people who contact with patients contribute to effective disease prevention. However, this is done by hand, and take a lot of work. In this research, we describe a named-entity recognition (NER) study that assists in the prevention of COVID-19 pandemic in Vietnam. We also present our manually annotated COVID-19 dataset with nested named entity recognition task for Vietnamese which be defined new entity types using for our system.
zh
[NLP-55] Dont Retrieve Generate: Prompting LLM s for Synthetic Training Data in Dense Retrieval
【速读】: 该论文试图解决传统密集检索模型训练中依赖于从文档语料库中挖掘困难负例(hard negative, HN)的问题,这一过程通常计算成本高且需要完整的语料库访问权限。其解决方案的关键在于提出一种端到端的流程,利用大型语言模型(Large Language Model, LLM)首先从一段文本生成查询,随后仅基于该查询文本生成困难负例,从而实现无需语料库的负例生成。实验结果表明,该全LLM流水线在多个BEIR基准数据集上表现与基于BM25和交叉编码器(cross-encoder, CE)的传统方法相当,证明了该方法在不依赖语料库的情况下能够达到与复杂挖掘技术相当的效果。
链接: https://arxiv.org/abs/2504.21015
作者: Aarush Sinha
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Training effective dense retrieval models often relies on hard negative (HN) examples mined from the document corpus via methods like BM25 or cross-encoders (CE), processes that can be computationally demanding and require full corpus access. This paper introduces a different approach, an end-to-end pipeline where a Large Language Model (LLM) first generates a query from a passage, and then generates a hard negative example using \emphonly that query text. This corpus-free negative generation contrasts with standard mining techniques. We evaluated this \textscLLM Query \rightarrow LLM HN approach against traditional \textscLLM Query \rightarrow BM25 HN and \textscLLM Query \rightarrow CE HN pipelines using E5-Base and GTE-Base models on several BEIR benchmark datasets. Our results show the proposed all-LLM pipeline achieves performance identical to both the BM25 and the computationally intensive CE baselines across nDCG@10, Precision@10, and Recall@100 metrics. This demonstrates that our corpus-free negative generation method matches the effectiveness of complex, corpus-dependent mining techniques, offering a potentially simpler and more efficient pathway for training high-performance retrievers without sacrificing results. We make the dataset including the queries and the hard-negatives for all three methods publicly available this https URL.
zh
[NLP-56] Analyzing Feedback Mechanisms in AI-Generated MCQs: Insights into Readability Lexical Properties and Levels of Challenge CCS
【速读】: 该论文旨在解决AI生成反馈在教育场景中的语言特征及其适应性问题,特别是其可读性、词汇丰富性和在不同难度题目中的适应能力。研究通过分析Google的Gemini 1.5-flash文本模型针对计算机科学多项选择题生成的反馈,结合三种难度等级和三种反馈语气进行评估。解决方案的关键在于利用微调的基于RoBERTa的多任务学习(MTL)模型预测这些语言属性,并取得了较高的预测精度,为构建更个性化和有效的AI驱动反馈机制提供了理论支持与技术路径。
链接: https://arxiv.org/abs/2504.21013
作者: Antoun Yaacoub,Zainab Assaghir,Lionel Prevost,Jérôme Da-Rugna
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This paper will be presented in the 9th Int. Conf. on Computer, Software and Modeling (ICCSM 2025), Roma, Italy, 2025, July 3-5
Abstract:Artificial Intelligence (AI)-generated feedback in educational settings has garnered considerable attention due to its potential to enhance learning outcomes. However, a comprehensive understanding of the linguistic characteristics of AI-generated feedback, including readability, lexical richness, and adaptability across varying challenge levels, remains limited. This study delves into the linguistic and structural attributes of feedback generated by Google’s Gemini 1.5-flash text model for computer science multiple-choice questions (MCQs). A dataset of over 1,200 MCQs was analyzed, considering three difficulty levels (easy, medium, hard) and three feedback tones (supportive, neutral, challenging). Key linguistic metrics, such as length, readability scores (Flesch-Kincaid Grade Level), vocabulary richness, and lexical density, were computed and examined. A fine-tuned RoBERTa-based multi-task learning (MTL) model was trained to predict these linguistic properties, achieving a Mean Absolute Error (MAE) of 2.0 for readability and 0.03 for vocabulary richness. The findings reveal significant interaction effects between feedback tone and question difficulty, demonstrating the dynamic adaptation of AI-generated feedback within diverse educational contexts. These insights contribute to the development of more personalized and effective AI-driven feedback mechanisms, highlighting the potential for improved learning outcomes while underscoring the importance of ethical considerations in their design and deployment.
zh
[NLP-57] Waking Up an AI: A Quantitative Framework for Prompt-Induced Phase Transition in Large Language Models
【速读】: 该论文试图解决的问题是揭示人类直觉思维的底层机制,并通过对比人类与大型语言模型(Large Language Models, LLMs)的认知动态来探索这一问题。其解决方案的关键在于提出了一种两部分框架:一种是能够触发LLMs响应性快速变化的过渡诱导提示(Transition-Inducing Prompt, TIP),另一种是利用另一个LLM评估这种变化的过渡量化提示(Transition Quantifying Prompt, TQP)。该方法实现了对AI认知行为的定量分析,从而揭示了LLMs在概念融合方面的不足,表明其尚未能复制人类直觉中的概念整合过程。
链接: https://arxiv.org/abs/2504.21012
作者: Makoto Sato
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:What underlies intuitive human thinking? One approach to this question is to compare the cognitive dynamics of humans and large language models (LLMs). However, such a comparison requires a method to quantitatively analyze AI cognitive behavior under controlled conditions. While anecdotal observations suggest that certain prompts can dramatically change LLM behavior, these observations have remained largely qualitative. Here, we propose a two-part framework to investigate this phenomenon: a Transition-Inducing Prompt (TIP) that triggers a rapid shift in LLM responsiveness, and a Transition Quantifying Prompt (TQP) that evaluates this change using a separate LLM. Through controlled experiments, we examined how LLMs react to prompts embedding two semantically distant concepts (e.g., mathematical aperiodicity and traditional crafts)–either fused together or presented separately–by changing their linguistic quality and affective tone. Whereas humans tend to experience heightened engagement when such concepts are meaningfully blended producing a novel concept–a form of conceptual fusion–current LLMs showed no significant difference in responsiveness between semantically fused and non-fused prompts. This suggests that LLMs may not yet replicate the conceptual integration processes seen in human intuition. Our method enables fine-grained, reproducible measurement of cognitive responsiveness, and may help illuminate key differences in how intuition and conceptual leaps emerge in artificial versus human minds.
zh
[NLP-58] Glucagon and insulin production in pancreatic cells modeled using Petri nets and Boolean networks
【速读】: 该论文试图解决糖尿病中葡萄糖调节机制的复杂性问题,特别是通过构建数学模型来理解胰岛素和胰高血糖素分泌的动态过程。解决方案的关键在于利用Petri网模型对胰岛β细胞中的胰岛素分泌以及胰岛α细胞中的胰高血糖素分泌进行建模,并进一步将整个胰岛素和胰高血糖素分泌系统转化为布尔网络,以分析其在不同血糖水平下的动态行为。
链接: https://arxiv.org/abs/2504.21578
作者: Kamila Barylska,Frank Delaplace,Anna Gogolińska,Ewa Pańkowska
机构: Nicolaus Copernicus University in Toruń (尼古拉·哥白尼托伦大学); Paris-Saclay University - University Evry (巴黎-萨克雷大学-埃夫里大学); Institute of Diabetology, Warsaw (华沙糖尿病研究所)
类目: Cell Behavior (q-bio.CB); Computation and Language (cs.CL)
备注:
Abstract:Diabetes is a civilization chronic disease characterized by a constant elevated concentration of glucose in the blood. Many processes are involved in the glucose regulation, and their interactions are very complex. To better understand those processes we set ourselves a goal to create a Petri net model of the glucose regulation in the whole body. So far we have managed to create a model of glycolysis and synthesis of glucose in the liver, and the general overview models of the glucose regulation in a healthy and diabetic person. In this paper we introduce Petri nets models of insulin secretion in beta cell of the pancreas, and glucagon in the pancreas alpha cells. Those two hormones have mutually opposite effects: insulin preventing hyperglycemia, and glucagon preventing hypoglycemia. Understanding the mechanisms of insulin and glucagon secretion constitutes the basis for understanding diabetes. We also present a model in which both processes occur together, depending on the blood glucose level. The dynamics of each model is analysed. Additionally, we transform the overall insulin and glucagon secretion system to a Boolean network, following standard transformation rules.
zh
[NLP-59] Who Gets the Callback? Generative AI and Gender Bias
【速读】: 该论文试图解决生成式人工智能(Generative AI)在招聘过程中可能存在的性别偏见问题,特别是大型语言模型(LLMs)在候选人筛选中的偏见表现。其解决方案的关键在于通过分析大量真实工作招聘信息,评估模型对男女候选人的推荐倾向,并结合职业分类和语言特征分析,揭示模型推荐与传统性别刻板印象的强相关性。此外,通过引入人格特质和历史人物视角模拟,探索了减少刻板印象的可能性,发现较低宜人性的模型行为可降低偏见,从而为提升AI招聘系统的公平性提供依据。
链接: https://arxiv.org/abs/2504.21400
作者: Sugat Chaturvedi,Rochana Chaturvedi
机构: Ahmedabad University (艾哈迈达巴德大学); University of Illinois Chicago (芝加哥伊利诺伊大学)
类目: General Economics (econ.GN); Computation and Language (cs.CL)
备注:
Abstract:Generative artificial intelligence (AI), particularly large language models (LLMs), is being rapidly deployed in recruitment and for candidate shortlisting. We audit several mid-sized open-source LLMs for gender bias using a dataset of 332,044 real-world online job postings. For each posting, we prompt the model to recommend whether an equally qualified male or female candidate should receive an interview callback. We find that most models tend to favor men, especially for higher-wage roles. Mapping job descriptions to the Standard Occupational Classification system, we find lower callback rates for women in male-dominated occupations and higher rates in female-associated ones, indicating occupational segregation. A comprehensive analysis of linguistic features in job ads reveals strong alignment of model recommendations with traditional gender stereotypes. To examine the role of recruiter identity, we steer model behavior by infusing Big Five personality traits and simulating the perspectives of historical figures. We find that less agreeable personas reduce stereotyping, consistent with an agreeableness bias in LLMs. Our findings highlight how AI-driven hiring may perpetuate biases in the labor market and have implications for fairness and diversity within firms.
zh
计算机视觉
[CV-0] ReVision: High-Quality Low-Cost Video Generation with Explicit 3D Physics Modeling for Complex Motion and Interaction
【速读】:该论文旨在解决视频生成中复杂运动和交互生成的挑战,尤其是在保持运动真实性与一致性方面的问题。其解决方案的关键在于引入ReVision框架,该框架通过显式整合参数化的3D物理知识到预训练的条件视频生成模型中,从而显著提升模型生成高质量视频的能力。具体而言,ReVision通过三个阶段实现这一目标:首先生成粗略视频,接着提取2D和3D特征构建3D物体中心表示并进行物理先验优化,最后将优化后的运动序列作为额外条件反馈至视频扩散模型,以生成运动一致的视频。
链接: https://arxiv.org/abs/2504.21855
作者: Qihao Liu,Ju He,Qihang Yu,Liang-Chieh Chen,Alan Yuille
机构: Johns Hopkins University (约翰霍普金斯大学); Independent Researcher (独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:In recent years, video generation has seen significant advancements. However, challenges still persist in generating complex motions and interactions. To address these challenges, we introduce ReVision, a plug-and-play framework that explicitly integrates parameterized 3D physical knowledge into a pretrained conditional video generation model, significantly enhancing its ability to generate high-quality videos with complex motion and interactions. Specifically, ReVision consists of three stages. First, a video diffusion model is used to generate a coarse video. Next, we extract a set of 2D and 3D features from the coarse video to construct a 3D object-centric representation, which is then refined by our proposed parameterized physical prior model to produce an accurate 3D motion sequence. Finally, this refined motion sequence is fed back into the same video diffusion model as additional conditioning, enabling the generation of motion-consistent videos, even in scenarios involving complex actions and interactions. We validate the effectiveness of our approach on Stable Video Diffusion, where ReVision significantly improves motion fidelity and coherence. Remarkably, with only 1.5B parameters, it even outperforms a state-of-the-art video generation model with over 13B parameters on complex video generation by a substantial margin. Our results suggest that, by incorporating 3D physical knowledge, even a relatively small video diffusion model can generate complex motions and interactions with greater realism and controllability, offering a promising solution for physically plausible video generation.
zh
[CV-1] A Survey of Interactive Generative Video
【速读】:该论文试图解决如何构建一个具备生成能力与交互功能的高质量视频生成系统,即交互式生成视频(Interactive Generative Video, IGV)的问题。其解决方案的关键在于提出一个包含五个核心模块的综合框架:生成、控制、记忆、动力学和智能,旨在实现对动态场景的实时生成、开放域控制、长期一致性维持、精确物理模拟以及因果推理的集成,从而推动IGV技术向更复杂和实用的方向发展。
链接: https://arxiv.org/abs/2504.21853
作者: Jiwen Yu,Yiran Qin,Haoxuan Che,Quande Liu,Xintao Wang,Pengfei Wan,Di Zhang,Kun Gai,Hao Chen,Xihui Liu
机构: The University of Hong Kong (香港大学); Kuaishou Technology (快手科技); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Interactive Generative Video (IGV) has emerged as a crucial technology in response to the growing demand for high-quality, interactive video content across various domains. In this paper, we define IGV as a technology that combines generative capabilities to produce diverse high-quality video content with interactive features that enable user engagement through control signals and responsive feedback. We survey the current landscape of IGV applications, focusing on three major domains: 1) gaming, where IGV enables infinite exploration in virtual worlds; 2) embodied AI, where IGV serves as a physics-aware environment synthesizer for training agents in multimodal interaction with dynamically evolving scenes; and 3) autonomous driving, where IGV provides closed-loop simulation capabilities for safety-critical testing and validation. To guide future development, we propose a comprehensive framework that decomposes an ideal IGV system into five essential modules: Generation, Control, Memory, Dynamics, and Intelligence. Furthermore, we systematically analyze the technical challenges and future directions in realizing each component for an ideal IGV system, such as achieving real-time generation, enabling open-domain control, maintaining long-term coherence, simulating accurate physics, and integrating causal reasoning. We believe that this systematic analysis will facilitate future research and development in the field of IGV, ultimately advancing the technology toward more sophisticated and practical applications.
zh
[CV-2] COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理需要多种能力协同的复杂视觉-语言任务时表现不佳的问题,例如同时识别物体、计数以及理解其空间关系。传统训练方法Visual Instruction Tuning(VIT)主要关注数据量的扩展,而忽视了训练样本的组合复杂性。解决方案的关键在于提出COMPACT(COMPositional Atomic-to-complex visual Capability Tuning),该方法通过生成显式控制组合复杂性的训练数据集,使MLLMs能够更高效地学习复杂能力。
链接: https://arxiv.org/abs/2504.21850
作者: Xindi Wu,Hee Seung Hwang,Polina Kirichenko,Olga Russakovsky
机构: Princeton University (普林斯顿大学); Meta AI (Meta人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 13 figures
Abstract:Multimodal Large Language Models (MLLMs) excel at simple vision-language tasks but struggle when faced with complex tasks that require multiple capabilities, such as simultaneously recognizing objects, counting them, and understanding their spatial relationships. This might be partially the result of the fact that Visual Instruction Tuning (VIT), a critical training step for MLLMs, has traditionally focused on scaling data volume, but not the compositional complexity of training examples. We propose COMPACT (COMPositional Atomic-to-complex visual Capability Tuning), which generates a training dataset explicitly controlling for the compositional complexity of the training examples. The data from COMPACT allows MLLMs to train on combinations of atomic capabilities to learn complex capabilities more efficiently. Across all benchmarks, COMPACT achieves comparable performance to the LLaVA-665k VIT while using less than 10% of its data budget, and even outperforms it on several, especially those involving complex multi-capability tasks. For example, COMPACT achieves substantial 83.3% improvement on MMStar and 94.0% improvement on MM-Vet compared to the full-scale VIT on particularly complex questions that require four or more atomic capabilities. COMPACT offers a scalable, data-efficient, visual compositional tuning recipe to improve on complex visual-language tasks.
zh
[CV-3] Differentiable Room Acoustic Rendering with Multi-View Vision Priors
【速读】:该论文旨在解决在虚拟环境中生成真实感声学体验的问题,特别是如何高效且准确地估计房间脉冲响应(Room Impulse Response, RIR)。现有方法要么依赖数据密集型的基于学习的模型,要么需要计算成本高昂的物理建模。其解决方案的关键在于提出一种多模态的物理基础声学渲染框架——Audio-Visual Differentiable Room Acoustic Rendering (AV-DAR),该框架结合了从多视角图像中提取的视觉线索和声束追踪技术,实现了高效、可解释且精确的声学渲染。
链接: https://arxiv.org/abs/2504.21847
作者: Derong Jin,Ruohan Gao(University of Maryland, College Park)
机构: University of Maryland, College Park(马里兰大学学院市分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: Project Page: this https URL
Abstract:An immersive acoustic experience enabled by spatial audio is just as crucial as the visual aspect in creating realistic virtual environments. However, existing methods for room impulse response estimation rely either on data-demanding learning-based models or computationally expensive physics-based modeling. In this work, we introduce Audio-Visual Differentiable Room Acoustic Rendering (AV-DAR), a framework that leverages visual cues extracted from multi-view images and acoustic beam tracing for physics-based room acoustic rendering. Experiments across six real-world environments from two datasets demonstrate that our multimodal, physics-based approach is efficient, interpretable, and accurate, significantly outperforming a series of prior methods. Notably, on the Real Acoustic Field dataset, AV-DAR achieves comparable performance to models trained on 10 times more data while delivering relative gains ranging from 16.6% to 50.9% when trained at the same scale.
zh
[CV-4] Active Light Modulation to Counter Manipulation of Speech Visual Content
【速读】:该论文试图解决高影响力语音视频在数字域中易被篡改的问题,特别是针对说话人身份和嘴唇及面部动作的视觉伪造。解决方案的关键在于提出Spotlight系统,该系统通过在事件现场生成动态物理签名,并将其以不可感知的调制光形式嵌入所有视频记录中,从而实现对视频真实性的保护。这些物理签名编码了与语音事件相关的语义特征,并通过密码学手段确保安全性,能够从任何视频中提取并验证其完整性。
链接: https://arxiv.org/abs/2504.21846
作者: Hadleigh Schwartz,Xiaofeng Yan,Charles J. Carver,Xia Zhou
机构: Columbia University (哥伦比亚大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:High-profile speech videos are prime targets for falsification, owing to their accessibility and influence. This work proposes Spotlight, a low-overhead and unobtrusive system for protecting live speech videos from visual falsification of speaker identity and lip and facial motion. Unlike predominant falsification detection methods operating in the digital domain, Spotlight creates dynamic physical signatures at the event site and embeds them into all video recordings via imperceptible modulated light. These physical signatures encode semantically-meaningful features unique to the speech event, including the speaker’s identity and facial motion, and are cryptographically-secured to prevent spoofing. The signatures can be extracted from any video downstream and validated against the portrayed speech content to check its integrity. Key elements of Spotlight include (1) a framework for generating extremely compact (i.e., 150-bit), pose-invariant speech video features, based on locality-sensitive hashing; and (2) an optical modulation scheme that embeds 200 bps into video while remaining imperceptible both in video and live. Prototype experiments on extensive video datasets show Spotlight achieves AUCs \geq 0.99 and an overall true positive rate of 100% in detecting falsified videos. Further, Spotlight is highly robust across recording conditions, video post-processing techniques, and white-box adversarial attacks on its video feature extraction methodologies.
zh
[CV-5] 3D Stylization via Large Reconstruction Model SIGGRAPH2025
【速读】:该论文试图解决在3D生成过程中如何根据参考图像实现外观风格化的问题,即在保持多视角视觉一致性的同时,将生成的3D资产外观适配为参考图像的视觉风格。解决方案的关键在于利用大型重建模型中的注意力机制,发现其中某些注意力块能够捕捉与外观相关的特征,并通过向这些块注入风格图像的特征,实现高效的3D外观风格化,而无需训练或测试时的优化。
链接: https://arxiv.org/abs/2504.21836
作者: Ipek Oztas,Duygu Ceylan,Aysegul Dundar
机构: Bilkent University (比尔肯特大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to SIGGRAPH 2025
Abstract:With the growing success of text or image guided 3D generators, users demand more control over the generation process, appearance stylization being one of them. Given a reference image, this requires adapting the appearance of a generated 3D asset to reflect the visual style of the reference while maintaining visual consistency from multiple viewpoints. To tackle this problem, we draw inspiration from the success of 2D stylization methods that leverage the attention mechanisms in large image generation models to capture and transfer visual style. In particular, we probe if large reconstruction models, commonly used in the context of 3D generation, has a similar capability. We discover that the certain attention blocks in these models capture the appearance specific features. By injecting features from a visual style image to such blocks, we develop a simple yet effective 3D appearance stylization method. Our method does not require training or test time optimization. Through both quantitative and qualitative evaluations, we demonstrate that our approach achieves superior results in terms of 3D appearance stylization, significantly improving efficiency while maintaining high-quality visual outcomes.
zh
[CV-6] Early Exit and Multi Stage Knowledge Distillation in VLMs for Video Summarization
【速读】:该论文旨在解决视频摘要任务中模型性能与计算效率之间的平衡问题,特别是在段落级视频摘要生成中。其关键解决方案是引入DEEVISum模型,该模型结合了多模态提示(文本和音频信号)以及多阶段知识蒸馏(MSKD)和早期退出(EE)机制,以在保持较高摘要质量的同时降低计算成本。MSKD提升了模型的性能,而EE则有效减少了推理时间,从而实现了性能与效率的优化平衡。
链接: https://arxiv.org/abs/2504.21831
作者: Anas Anwarul Haq Khan,Utkarsh Verma,Prateek Chanda,Ganesh Ramakrishnan
机构: IIT Bombay (印度理工学院孟买分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce DEEVISum (Distilled Early Exit Vision language model for Summarization), a lightweight, efficient, and scalable vision language model designed for segment wise video summarization. Leveraging multi modal prompts that combine textual and audio derived signals, DEEVISum incorporates Multi Stage Knowledge Distillation (MSKD) and Early Exit (EE) to strike a balance between performance and efficiency. MSKD offers a 1.33% absolute F1 improvement over baseline distillation (0.5%), while EE reduces inference time by approximately 21% with a 1.3 point drop in F1. Evaluated on the TVSum dataset, our best model PaLI Gemma2 3B + MSKD achieves an F1 score of 61.1, competing the performance of significantly larger models, all while maintaining a lower computational footprint. We publicly release our code and processed dataset to support further research.
zh
[CV-7] Why Compress What You Can Generate? When GPT -4o Generation Ushers in Image Compression Fields
【速读】:该论文试图解决传统图像压缩中依赖像素级变换与编码的问题,探索生成式 AI (Generative AI) 在图像压缩领域的潜力。其核心挑战在于如何在解码过程中保持语义和结构的一致性。解决方案的关键是提出一种结构光栅扫描提示工程机制,将图像转换为文本空间,并将其作为 GPT-4o 图像生成模型的条件进行压缩,从而实现高效且高质量的图像重建。
链接: https://arxiv.org/abs/2504.21814
作者: Yixin Gao,Xiaohan Pan,Xin Li,Zhibo Chen
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid development of AIGC foundation models has revolutionized the paradigm of image compression, which paves the way for the abandonment of most pixel-level transform and coding, compelling us to ask: why compress what you can generate if the AIGC foundation model is powerful enough to faithfully generate intricate structure and fine-grained details from nothing more than some compact descriptors, i.e., texts, or cues. Fortunately, recent GPT-4o image generation of OpenAI has achieved impressive cross-modality generation, editing, and design capabilities, which motivates us to answer the above question by exploring its potential in image compression fields. In this work, we investigate two typical compression paradigms: textual coding and multimodal coding (i.e., text + extremely low-resolution image), where all/most pixel-level information is generated instead of compressing via the advanced GPT-4o image generation function. The essential challenge lies in how to maintain semantic and structure consistency during the decoding process. To overcome this, we propose a structure raster-scan prompt engineering mechanism to transform the image into textual space, which is compressed as the condition of GPT-4o image generation. Extensive experiments have shown that the combination of our designed structural raster-scan prompts and GPT-4o’s image generation function achieved the impressive performance compared with recent multimodal/generative image compression at ultra-low bitrate, further indicating the potential of AIGC generation in image compression fields.
zh
[CV-8] A simple and effective approach for body part recognition on CT scans based on projection estimation
【速读】:该论文试图解决在医学影像领域中,由于CT数据的体积特性以及常缺失或不完整的元数据,导致标注过程复杂且耗时的问题。其解决方案的关键在于提出一种基于2D X-ray-like估计的3D CT扫描身体区域识别方法,通过生成的2D图像来识别14个不同的身体区域,从而为构建高质量的医疗数据集提供有价值的信息。
链接: https://arxiv.org/abs/2504.21810
作者: Franko Hrzic,Mohammadreza Movahhedi,Ophelie Lavoie-Gagne,Ata Kiapour
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 6 figures
Abstract:It is well known that machine learning models require a high amount of annotated data to obtain optimal performance. Labelling Computed Tomography (CT) data can be a particularly challenging task due to its volumetric nature and often missing and / or incomplete associated meta-data. Even inspecting one CT scan requires additional computer software, or in the case of programming languages - additional programming libraries. This study proposes a simple, yet effective approach based on 2D X-ray-like estimation of 3D CT scans for body region identification. Although body region is commonly associated with the CT scan, it often describes only the focused major body region neglecting other anatomical regions present in the observed CT. In the proposed approach, estimated 2D images were utilized to identify 14 distinct body regions, providing valuable information for constructing a high-quality medical dataset. To evaluate the effectiveness of the proposed method, it was compared against 2.5D, 3D and foundation model (MI2) based approaches. Our approach outperformed the others, where it came on top with statistical significance and F1-Score for the best-performing model EffNet-B0 of 0.980 \pm 0.016 in comparison to the 0.840 \pm 0.114 (2.5D DenseNet-161), 0.854 \pm 0.096 (3D VoxCNN), and 0.852 \pm 0.104 (MI2 foundation model). The utilized dataset comprised three different clinical centers and counted 15,622 CT scans (44,135 labels).
zh
[CV-9] Anomaly-Driven Approach for Enhanced Prostate Cancer Segmentation
【速读】:该论文旨在解决自动化识别 clinically significant prostate cancer (csPCa) 中存在的挑战,包括数据不平衡、肿瘤大小变化大以及标注数据不足等问题。其解决方案的关键在于引入基于异常检测的分割框架——Anomaly-Driven U-Net (adU-Net),通过将从双参数MRI序列中生成的异常图(anomaly maps)整合到深度学习分割流程中,以提升csPCa的识别性能。异常图利用Fixed-Point GAN重建技术生成,突出显示与正常前列腺组织的偏差,从而引导分割模型关注潜在癌变区域。
链接: https://arxiv.org/abs/2504.21789
作者: Alessia Hu,Regina Beets-Tan,Lishan Cai,Eduardo Pooch
机构: University of Amsterdam, The Netherlands(阿姆斯特丹大学, 荷兰); The Netherlands Cancer Institute, The Netherlands(荷兰癌症研究所, 荷兰); GROW Research Institute for Oncology and Reproduction, Maastricht University, The Netherlands(格罗w肿瘤与生殖研究机构, 马斯特里赫特大学, 荷兰)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Paper accepted for publication at 2025 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) Copyright 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media
Abstract:Magnetic Resonance Imaging (MRI) plays an important role in identifying clinically significant prostate cancer (csPCa), yet automated methods face challenges such as data imbalance, variable tumor sizes, and a lack of annotated data. This study introduces Anomaly-Driven U-Net (adU-Net), which incorporates anomaly maps derived from biparametric MRI sequences into a deep learning-based segmentation framework to improve csPCa identification. We conduct a comparative analysis of anomaly detection methods and evaluate the integration of anomaly maps into the segmentation pipeline. Anomaly maps, generated using Fixed-Point GAN reconstruction, highlight deviations from normal prostate tissue, guiding the segmentation model to potential cancerous regions. We compare the performance by using the average score, computed as the mean of the AUROC and Average Precision (AP). On the external test set, adU-Net achieves the best average score of 0.618, outperforming the baseline nnU-Net model (0.605). The results demonstrate that incorporating anomaly detection into segmentation improves generalization and performance, particularly with ADC-based anomaly maps, offering a promising direction for automated csPCa identification.
zh
[CV-10] Anatomical Similarity as a New Metric to Evaluate Brain Generative Models
【速读】:该论文试图解决生成式脑部磁共振成像(MRI)在评估过程中缺乏对解剖学真实性的敏感性问题,现有评估方法主要关注纹理和视觉感知,而忽略了关键的解剖结构准确性。解决方案的关键在于提出一种新的度量标准WASABI(基于Wasserstein的解剖学脑部指数),该方法利用深度学习驱动的脑分割工具SynthSeg获取每幅MRI中脑区的体积测量值,并通过多变量Wasserstein距离比较真实与合成解剖结构的分布差异,从而更敏感地量化解剖学上的偏差。
链接: https://arxiv.org/abs/2504.21771
作者: Bahram Jafrasteh,Wei Peng,Cheng Wan,Yimin Luo,Ehsan Adeli,Qingyu Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative models enhance neuroimaging through data augmentation, quality improvement, and rare condition studies. Despite advances in realistic synthetic MRIs, evaluations focus on texture and perception, lacking sensitivity to crucial anatomical fidelity. This study proposes a new metric, called WASABI (Wasserstein-Based Anatomical Brain Index), to assess the anatomical realism of synthetic brain MRIs. WASABI leverages \textitSynthSeg, a deep learning-based brain parcellation tool, to derive volumetric measures of brain regions in each MRI and uses the multivariate Wasserstein distance to compare distributions between real and synthetic anatomies. Based on controlled experiments on two real datasets and synthetic MRIs from five generative models, WASABI demonstrates higher sensitivity in quantifying anatomical discrepancies compared to traditional image-level metrics, even when synthetic images achieve near-perfect visual quality. Our findings advocate for shifting the evaluation paradigm beyond visual inspection and conventional metrics, emphasizing anatomical fidelity as a crucial benchmark for clinically meaningful brain MRI synthesis. Our code is available at this https URL.
zh
[CV-11] Common3D: Self-Supervised Learning of 3D Morphable Models for Common Objects in Neural Feature Space
【速读】:该论文试图解决现有3D可变形模型(3DMMs)仅适用于少数特定物体类别(如人脸或人体)的问题,因为这些模型需要大量3D数据采集和类别特定的训练过程。解决方案的关键在于提出一种全新的方法——Common3D,该方法通过从以物体为中心的视频集合中完全自监督地学习常见物体的3DMM。其核心创新在于将物体表示为学习到的3D模板网格和参数化为图像条件神经网络的形变场,并使用神经特征代替RGB颜色来表示物体外观,从而实现更泛化的表示学习。此外,通过利用可变形模板网格定义的对应关系进行对比目标训练,提升了对应特征的质量和模型在估计3D物体姿态与语义对应方面的性能。
链接: https://arxiv.org/abs/2504.21749
作者: Leonhard Sommer,Olaf Dünkel,Christian Theobalt,Adam Kortylewski
机构: University of Freiburg (弗赖堡大学); Max Planck Institute for Informatics (马克斯·普朗克信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D morphable models (3DMMs) are a powerful tool to represent the possible shapes and appearances of an object category. Given a single test image, 3DMMs can be used to solve various tasks, such as predicting the 3D shape, pose, semantic correspondence, and instance segmentation of an object. Unfortunately, 3DMMs are only available for very few object categories that are of particular interest, like faces or human bodies, as they require a demanding 3D data acquisition and category-specific training process. In contrast, we introduce a new method, Common3D, that learns 3DMMs of common objects in a fully self-supervised manner from a collection of object-centric videos. For this purpose, our model represents objects as a learned 3D template mesh and a deformation field that is parameterized as an image-conditioned neural network. Different from prior works, Common3D represents the object appearance with neural features instead of RGB colors, which enables the learning of more generalizable representations through an abstraction from pixel intensities. Importantly, we train the appearance features using a contrastive objective by exploiting the correspondences defined through the deformable template mesh. This leads to higher quality correspondence features compared to related works and a significantly improved model performance at estimating 3D object pose and semantic correspondence. Common3D is the first completely self-supervised method that can solve various vision tasks in a zero-shot manner.
zh
[CV-12] Adaptive 3D UI Placement in Mixed Reality Using Deep Reinforcement Learning
【速读】:该论文试图解决在混合现实(Mixed Reality, MR)中如何动态地、持续地将虚拟内容放置在最佳位置以支持用户任务的问题。传统方法多采用优化算法,而本文则探索强化学习(Reinforcement Learning, RL)在考虑用户姿态和周围环境的情况下,实现连续的3D内容定位。解决方案的关键在于利用RL模型根据实时用户交互和环境反馈,最大化用户的奖励信号,从而实现个性化和优化的用户界面及内容布局。
链接: https://arxiv.org/abs/2504.21731
作者: Feiyu Lu,Mengyu Chen,Hsiang Hsu,Pranav Deshpande,Cheng Yao Wang,Blair MacIntyre
机构: JPMorgan Chase & Co.(摩根大通公司)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA '24)
Abstract:Mixed Reality (MR) could assist users’ tasks by continuously integrating virtual content with their view of the physical environment. However, where and how to place these content to best support the users has been a challenging problem due to the dynamic nature of MR experiences. In contrast to prior work that investigates optimization-based methods, we are exploring how reinforcement learning (RL) could assist with continuous 3D content placement that is aware of users’ poses and their surrounding environments. Through an initial exploration and preliminary evaluation, our results demonstrate the potential of RL to position content that maximizes the reward for users on the go. We further identify future directions for research that could harness the power of RL for personalized and optimized UI and content placement in MR.
zh
[CV-13] Cert-SSB: Toward Certified Sample-Specific Backdoor Defense
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在面对后门攻击时的安全性问题,即攻击者通过操纵训练数据中的少量样本,在模型中植入隐藏的后门,使得模型在正常样本上表现正常,但在带有后门的样本上会错误地分类到攻击者指定的目标类别。现有基于随机平滑(Randomized Smoothing)的认证防御方法隐含假设所有样本与决策边界等距,这一假设在实际中可能不成立,导致认证性能不佳。该论文提出的解决方案关键在于设计一种样本特定的认证后门防御方法(Cert-SSB),通过针对每个样本优化噪声幅度,并利用多个污染训练集重新训练多个平滑模型,最终通过聚合多个平滑模型的预测结果生成鲁棒预测,同时引入基于存储更新的认证方法以动态调整每个样本的认证区域,从而提升认证效果。
链接: https://arxiv.org/abs/2504.21730
作者: Ting Qiao,Yingjia Wang,Xing Liu,Sixing Wu,Jianbing Li,Yiming Li
机构: North China Electric Power University (华北电力大学); China Unicom (中国联通); National Engineering Research Center of Next Generation Internet Broadband Service Application (下一代互联网宽带业务应用国家工程研究中心); Nanyang Technology University (南洋理工大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages
Abstract:Deep neural networks (DNNs) are vulnerable to backdoor attacks, where an attacker manipulates a small portion of the training data to implant hidden backdoors into the model. The compromised model behaves normally on clean samples but misclassifies backdoored samples into the attacker-specified target class, posing a significant threat to real-world DNN applications. Currently, several empirical defense methods have been proposed to mitigate backdoor attacks, but they are often bypassed by more advanced backdoor techniques. In contrast, certified defenses based on randomized smoothing have shown promise by adding random noise to training and testing samples to counteract backdoor attacks. In this paper, we reveal that existing randomized smoothing defenses implicitly assume that all samples are equidistant from the decision boundary. However, it may not hold in practice, leading to suboptimal certification performance. To address this issue, we propose a sample-specific certified backdoor defense method, termed Cert-SSB. Cert-SSB first employs stochastic gradient ascent to optimize the noise magnitude for each sample, ensuring a sample-specific noise level that is then applied to multiple poisoned training sets to retrain several smoothed models. After that, Cert-SSB aggregates the predictions of multiple smoothed models to generate the final robust prediction. In particular, in this case, existing certification methods become inapplicable since the optimized noise varies across different samples. To conquer this challenge, we introduce a storage-update-based certification method, which dynamically adjusts each sample’s certification region to improve certification performance. We conduct extensive experiments on multiple benchmark datasets, demonstrating the effectiveness of our proposed method. Our code is available at this https URL.
zh
[CV-14] VividListener: Expressive and Controllable Listener Dynamics Modeling for Multi-Modal Responsive Interaction
【速读】:该论文旨在解决虚拟角色在对话建模中缺乏对运动变化和情感强度的细粒度控制问题,特别是在长序列建模中的表现不足。为了解决这一问题,作者提出了VividListener框架,其关键在于通过多模态条件引导实现对话双方的连贯交互,并引入响应性交互模块(Responsive Interaction Module, RIM)以自适应地表示多模态交互嵌入,确保听者动态与文本描述的语义协调,同时保留对说话人行为的表达性反应。此外,情感强度标签(Emotional Intensity Tags, EIT)的引入进一步增强了情感强度的编辑能力。
链接: https://arxiv.org/abs/2504.21718
作者: Shiying Li,Xingqun Qi,Bingkun Yang,Chen Weile,Zezhao Tian,Muyi Sun,Qifeng Liu,Man Zhang,Zhenan Sun
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Hong Kong University of Science and Technology (香港科技大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating responsive listener head dynamics with nuanced emotions and expressive reactions is crucial for practical dialogue modeling in various virtual avatar animations. Previous studies mainly focus on the direct short-term production of listener behavior. They overlook the fine-grained control over motion variations and emotional intensity, especially in long-sequence modeling. Moreover, the lack of long-term and large-scale paired speaker-listener corpora including head dynamics and fine-grained multi-modality annotations (e.g., text-based expression descriptions, emotional intensity) also limits the application of dialogue this http URL, we first newly collect a large-scale multi-turn dataset of 3D dyadic conversation containing more than 1.4M valid frames for multi-modal responsive interaction, dubbed ListenerX. Additionally, we propose VividListener, a novel framework enabling fine-grained, expressive and controllable listener dynamics modeling. This framework leverages multi-modal conditions as guiding principles for fostering coherent interactions between speakers and this http URL, we design the Responsive Interaction Module (RIM) to adaptively represent the multi-modal interactive embeddings. RIM ensures the listener dynamics achieve fine-grained semantic coordination with textual descriptions and adjustments, while preserving expressive reaction with speaker behavior. Meanwhile, we design the Emotional Intensity Tags (EIT) for emotion intensity editing with multi-modal information integration, applying to both text descriptions and listener motion this http URL experiments conducted on our newly collected ListenerX dataset demonstrate that VividListener achieves state-of-the-art performance, realizing expressive and controllable listener dynamics.
zh
[CV-15] Vision Transformers in Precision Agriculture: A Comprehensive Survey
【速读】:该论文旨在解决传统植物病害检测方法在可扩展性和准确性方面的局限性,提出使用视觉Transformer(Vision Transformers, ViTs)作为替代方案。其解决方案的关键在于利用ViTs在处理长距离依赖关系和视觉任务可扩展性方面的优势,同时通过减少传统模型如卷积神经网络(Convolutional Neural Networks, CNNs)中的归纳偏置,提升模型的适应性和性能。
链接: https://arxiv.org/abs/2504.21706
作者: Saber Mehdipour,Seyed Abolghasem Mirroshandel,Seyed Amirhossein Tabatabaei
机构: University of Guilan(吉兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Detecting plant diseases is a crucial aspect of modern agriculture - it plays a key role in maintaining crop health and increasing overall yield. Traditional approaches, though still valuable, often rely on manual inspection or conventional machine learning techniques, both of which face limitations in scalability and accuracy. Recently, Vision Transformers (ViTs) have emerged as a promising alternative, offering benefits such as improved handling of long-range dependencies and better scalability for visual tasks. This survey explores the application of ViTs in precision agriculture, covering tasks from classification to detection and segmentation. We begin by introducing the foundational architecture of ViTs and discuss their transition from Natural Language Processing (NLP) to computer vision. The discussion includes the concept of inductive bias in traditional models like Convolutional Neural Networks (CNNs), and how ViTs mitigate these biases. We provide a comprehensive review of recent literature, focusing on key methodologies, datasets, and performance metrics. The survey also includes a comparative analysis of CNNs and ViTs, with a look at hybrid models and performance enhancements. Technical challenges - such as data requirements, computational demands, and model interpretability - are addressed alongside potential solutions. Finally, we outline potential research directions and technological advancements that could further support the integration of ViTs in real-world agricultural settings. Our goal with this study is to offer practitioners and researchers a deeper understanding of how ViTs are poised to transform smart and precision agriculture.
zh
[CV-16] REHEARSE-3D: A Multi-modal Emulated Rain Dataset for 3D Point Cloud De-raining
【速读】:该论文旨在解决恶劣天气条件下传感器退化对自动驾驶系统的影响问题,特别是雨天降雨导致的LiDAR点云质量下降问题。其解决方案的关键在于构建了一个大规模、多模态的模拟降雨数据集REHEARSE-3D,该数据集具有高分辨率LiDAR数据(LiDAR-256)和4D雷达点云,并在控制天气环境下记录了昼夜条件下的数据,同时包含雨特征信息,为传感器噪声建模和点级天气影响分析提供了重要支持。
链接: https://arxiv.org/abs/2504.21699
作者: Abu Mohammed Raisuddin,Jesper Holmblad,Hamed Haghighi,Yuri Poledna,Maikol Funk Drechsler,Valentina Donzella,Eren Erdal Aksoy
机构: Halmstad University (哈尔姆斯塔德大学); WMG, University of Warwick (WMG,华威大学); CARISSMA Institute of Automated Driving (CARISSMA自动驾驶研究所); Technische Hochschule Ingolstadt (英戈尔施塔特应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Sensor degradation poses a significant challenge in autonomous driving. During heavy rainfall, the interference from raindrops can adversely affect the quality of LiDAR point clouds, resulting in, for instance, inaccurate point measurements. This, in turn, can potentially lead to safety concerns if autonomous driving systems are not weather-aware, i.e., if they are unable to discern such changes. In this study, we release a new, large-scale, multi-modal emulated rain dataset, REHEARSE-3D, to promote research advancements in 3D point cloud de-raining. Distinct from the most relevant competitors, our dataset is unique in several respects. First, it is the largest point-wise annotated dataset, and second, it is the only one with high-resolution LiDAR data (LiDAR-256) enriched with 4D Radar point clouds logged in both daytime and nighttime conditions in a controlled weather environment. Furthermore, REHEARSE-3D involves rain-characteristic information, which is of significant value not only for sensor noise modeling but also for analyzing the impact of weather at a point level. Leveraging REHEARSE-3D, we benchmark raindrop detection and removal in fused LiDAR and 4D Radar point clouds. Our comprehensive study further evaluates the performance of various statistical and deep-learning models. Upon publication, the dataset and benchmark models will be made publicly available at: this https URL.
zh
[CV-17] Enhancing Self-Supervised Fine-Grained Video Object Tracking with Dynamic Memory Prediction
【速读】:该论文旨在解决视频分析中帧重建方法在复杂场景下(如遮挡或快速运动)因未充分利用多参考帧进行重建和决策而造成的精度不足问题。其解决方案的关键在于提出一种动态记忆预测(Dynamic Memory Prediction, DMP)框架,该框架通过参考帧记忆引擎动态选择基于目标像素特征的参考帧以提升跟踪精度,并构建双向目标预测网络以利用多参考帧增强模型鲁棒性。
链接: https://arxiv.org/abs/2504.21692
作者: Zihan Zhou,Changrui Dai,Aibo Song,Xiaolin Fang
机构: Southeast University (东南大学); Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education (教育部计算机网络与信息集成重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Successful video analysis relies on accurate recognition of pixels across frames, and frame reconstruction methods based on video correspondence learning are popular due to their efficiency. Existing frame reconstruction methods, while efficient, neglect the value of direct involvement of multiple reference frames for reconstruction and decision-making aspects, especially in complex situations such as occlusion or fast movement. In this paper, we introduce a Dynamic Memory Prediction (DMP) framework that innovatively utilizes multiple reference frames to concisely and directly enhance frame reconstruction. Its core component is a Reference Frame Memory Engine that dynamically selects frames based on object pixel features to improve tracking accuracy. In addition, a Bidirectional Target Prediction Network is built to utilize multiple reference frames to improve the robustness of the model. Through experiments, our algorithm outperforms the state-of-the-art self-supervised techniques on two fine-grained video object tracking tasks: object segmentation and keypoint tracking.
zh
[CV-18] Visual Text Processing: A Comprehensive Review and Unified Evaluation
【速读】:该论文旨在解决视觉文本处理中的关键挑战,即如何有效捕捉和利用文本的独特属性以提升模型的鲁棒性。其解决方案的关键在于深入分析适用于不同视觉文本处理任务的文本特征,并探索如何将这些特征有效地整合到处理框架中。为此,作者提出了VTPBench基准和VTPScore评估指标,以促进对当前技术的全面评估与改进。
链接: https://arxiv.org/abs/2504.21682
作者: Yan Shu,Weichao Zeng,Fangmin Zhao,Zeyu Chen,Zhenhang Li,Xiaomeng Yang,Yu Zhou,Paolo Rota,Xiang Bai,Lianwen Jin,Xu-Cheng Yin,Nicu Sebe
机构: Nankai University (南开大学); University of Trento (特伦托大学); Chinese Academy of Sciences (中国科学院); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); Northeastern University (东北大学); Huazhong University of Science and Technology (华中科技大学); South China University of Technology (华南理工大学); University of Science and Technology Beijing (北京科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual text is a crucial component in both document and scene images, conveying rich semantic information and attracting significant attention in the computer vision community. Beyond traditional tasks such as text detection and recognition, visual text processing has witnessed rapid advancements driven by the emergence of foundation models, including text image reconstruction and text image manipulation. Despite significant progress, challenges remain due to the unique properties that differentiate text from general objects. Effectively capturing and leveraging these distinct textual characteristics is essential for developing robust visual text processing models. In this survey, we present a comprehensive, multi-perspective analysis of recent advancements in visual text processing, focusing on two key questions: (1) What textual features are most suitable for different visual text processing tasks? (2) How can these distinctive text features be effectively incorporated into processing frameworks? Furthermore, we introduce VTPBench, a new benchmark that encompasses a broad range of visual text processing datasets. Leveraging the advanced visual quality assessment capabilities of multimodal large language models (MLLMs), we propose VTPScore, a novel evaluation metric designed to ensure fair and reliable evaluation. Our empirical study with more than 20 specific models reveals substantial room for improvement in the current techniques. Our aim is to establish this work as a fundamental resource that fosters future exploration and innovation in the dynamic field of visual text processing. The relevant repository is available at this https URL.
zh
[CV-19] HoloTime: Taming Video Diffusion Models for Panoramic 4D Scene Generation
【速读】:该论文旨在解决现有扩散模型在生成场景级4D资产方面的不足,这些资产是提升VR和AR用户体验的关键。当前的扩散模型主要关注静态3D场景或物体级动态,难以提供真正沉浸式的体验。解决方案的关键在于提出HoloTime框架,该框架结合视频扩散模型以从单个提示或参考图像生成全景视频,并通过360度4D场景重建方法将生成的全景视频无缝转换为4D资产,从而实现完整的4D沉浸式体验。
链接: https://arxiv.org/abs/2504.21650
作者: Haiyang Zhou,Wangbo Yu,Jiawen Guan,Xinhua Cheng,Yonghong Tian,Li Yuan
机构: Peking University(北京大学); Peng Cheng Laboratory(鹏城实验室); Harbin Institute of Technology, Shenzhen(哈尔滨工业大学,深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project homepage: this https URL
Abstract:The rapid advancement of diffusion models holds the promise of revolutionizing the application of VR and AR technologies, which typically require scene-level 4D assets for user experience. Nonetheless, existing diffusion models predominantly concentrate on modeling static 3D scenes or object-level dynamics, constraining their capacity to provide truly immersive experiences. To address this issue, we propose HoloTime, a framework that integrates video diffusion models to generate panoramic videos from a single prompt or reference image, along with a 360-degree 4D scene reconstruction method that seamlessly transforms the generated panoramic video into 4D assets, enabling a fully immersive 4D experience for users. Specifically, to tame video diffusion models for generating high-fidelity panoramic videos, we introduce the 360World dataset, the first comprehensive collection of panoramic videos suitable for downstream 4D scene reconstruction tasks. With this curated dataset, we propose Panoramic Animator, a two-stage image-to-video diffusion model that can convert panoramic images into high-quality panoramic videos. Following this, we present Panoramic Space-Time Reconstruction, which leverages a space-time depth estimation method to transform the generated panoramic videos into 4D point clouds, enabling the optimization of a holistic 4D Gaussian Splatting representation to reconstruct spatially and temporally consistent 4D scenes. To validate the efficacy of our method, we conducted a comparative analysis with existing approaches, revealing its superiority in both panoramic video generation and 4D scene reconstruction. This demonstrates our method’s capability to create more engaging and realistic immersive environments, thereby enhancing user experiences in VR and AR applications.
zh
[CV-20] Diffusion-based Adversarial Identity Manipulation for Facial Privacy Protection
【速读】:该论文旨在解决面部识别(Face Recognition, FR)系统在社交网络上可能引发的隐私泄露问题,现有隐私增强方法无法生成自然且能够有效保护面部隐私的图像。其解决方案的关键在于提出基于扩散模型的对抗性身份操控方法(Diffusion-based Adversarial Identity Manipulation, DiffAIM),通过在扩散模型的低维潜在空间中迭代注入基于梯度的对抗性身份引导,逐步引导生成过程向目标对抗性人脸方向发展,从而在保持视觉自然性的前提下实现有效的身份伪装。
链接: https://arxiv.org/abs/2504.21646
作者: Liqin Wang,Qianyue Hu,Wei Lu,Xiangyang Luo
机构: Sun Yat-sen University (中山大学); State Key Laboratory of Mathematical Engineering and Advanced Computing (数学工程与先进计算国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The success of face recognition (FR) systems has led to serious privacy concerns due to potential unauthorized surveillance and user tracking on social networks. Existing methods for enhancing privacy fail to generate natural face images that can protect facial privacy. In this paper, we propose diffusion-based adversarial identity manipulation (DiffAIM) to generate natural and highly transferable adversarial faces against malicious FR systems. To be specific, we manipulate facial identity within the low-dimensional latent space of a diffusion model. This involves iteratively injecting gradient-based adversarial identity guidance during the reverse diffusion process, progressively steering the generation toward the desired adversarial faces. The guidance is optimized for identity convergence towards a target while promoting semantic divergence from the source, facilitating effective impersonation while maintaining visual naturalness. We further incorporate structure-preserving regularization to preserve facial structure consistency during manipulation. Extensive experiments on both face verification and identification tasks demonstrate that compared with the state-of-the-art, DiffAIM achieves stronger black-box attack transferability while maintaining superior visual quality. We also demonstrate the effectiveness of the proposed approach for commercial FR APIs, including Face++ and Aliyun.
zh
[CV-21] Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection
【速读】:该论文试图解决在大规模未标记数据中选择和标注适合机器学习模型训练样本的难题,尤其是在检测长尾类别方面存在显著挑战。针对这一问题,解决方案的关键在于提出一种开放词汇的数据选择过程,以聚焦于罕见和新颖类别,从而提升模型在智能交通系统(Intelligent Transportation Systems, ITS)中的性能。该研究还提供了完整的基于数据的开发周期模块,从数据采集到模型部署,并通过公开代码库确保了系统的可访问性和可复用性。
链接: https://arxiv.org/abs/2504.21614
作者: Daniel Bogdoll,Rajanikant Patnaik Ananta,Abeyankar Giridharan,Isabel Moore,Gregory Stevens,Henry X. Liu
机构: University of Michigan Transportation Research Institute, Ann Arbor, MI, USA; Karlsruhe Institute of Technology, Karlsruhe, BW, Germany; Texas A&M University, College Station, TX, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With an ever-increasing availability of data, it has become more and more challenging to select and label appropriate samples for the training of machine learning models. It is especially difficult to detect long-tail classes of interest in large amounts of unlabeled data. This holds especially true for Intelligent Transportation Systems (ITS), where vehicle fleets and roadside perception systems generate an abundance of raw data. While industrial, proprietary data engines for such iterative data selection and model training processes exist, researchers and the open-source community suffer from a lack of an openly available system. We present the Mcity Data Engine, which provides modules for the complete data-based development cycle, beginning at the data acquisition phase and ending at the model deployment stage. The Mcity Data Engine focuses on rare and novel classes through an open-vocabulary data selection process. All code is publicly available on GitHub under an MIT license: this https URL
zh
[CV-22] Cascade Detector Analysis and Application to Biomedical Microscopy
【速读】:该论文旨在解决在计算机视觉模型和生物医学数据集规模扩增背景下,如何高效进行目标检测的问题。其解决方案的关键在于利用级联检测器(cascade detectors)以高效识别多分辨率图像中的稀疏目标。通过分析目标的出现频率及不同分辨率下检测器的已知准确率,推导出级联检测器的准确率及分类器调用的期望次数,该方法在维度数量和级联层级上具有泛化能力。
链接: https://arxiv.org/abs/2504.21598
作者: Thomas L. Athey,Shashata Sawmya,Nir Shavit
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As both computer vision models and biomedical datasets grow in size, there is an increasing need for efficient inference algorithms. We utilize cascade detectors to efficiently identify sparse objects in multiresolution images. Given an object’s prevalence and a set of detectors at different resolutions with known accuracies, we derive the accuracy, and expected number of classifier calls by a cascade detector. These results generalize across number of dimensions and number of cascade levels. Finally, we compare one- and two-level detectors in fluorescent cell detection, organelle segmentation, and tissue segmentation across various microscopy modalities. We show that the multi-level detector achieves comparable performance in 30-75% less time. Our work is compatible with a variety of computer vision models and data domains.
zh
[CV-23] NCApsulate: NCA for Precision Diagnosis on Capsule Endoscopes
【速读】:该论文试图解决无线胶囊内镜(Wireless Capsule Endoscopy)在吞服后难以定位的问题,以及如何在资源受限的微型设备上实现出血分割和单目深度估计。解决方案的关键在于利用神经细胞自动机(Neural Cellular Automata, NCA)架构,通过知识蒸馏将大型基础模型压缩到轻量级NCA中,并将其部署到ESP32微控制器上,从而实现在小型硬件上的高效图像处理。该方法在保持较高精度的同时,显著减少了内存占用,并通过运行时优化提升了推理速度。
链接: https://arxiv.org/abs/2504.21562
作者: Henry John Krumb,Anirban Mukhopadhyay
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Wireless Capsule Endoscopy is a non-invasive imaging method for the entire gastrointestinal tract, and is a pain-free alternative to traditional endoscopy. It generates extensive video data that requires significant review time, and localizing the capsule after ingestion is a challenge. Techniques like bleeding detection and depth estimation can help with localization of pathologies, but deep learning models are typically too large to run directly on the capsule. Neural Cellular Automata (NCA) for bleeding segmentation and depth estimation are trained on capsule endoscopic images. For monocular depth estimation, we distill a large foundation model into the lean NCA architecture, by treating the outputs of the foundation model as pseudo ground truth. We then port the trained NCA to the ESP32 microcontroller, enabling efficient image processing on hardware as small as a camera capsule. NCA are more accurate (Dice) than other portable segmentation models, while requiring more than 100x fewer parameters stored in memory than other small-scale models. The visual results of NCA depth estimation look convincing, and in some cases beat the realism and detail of the pseudo ground truth. Runtime optimizations on the ESP32-S3 accelerate the average inference speed significantly, by more than factor 3. With several algorithmic adjustments and distillation, it is possible to eNCApsulate NCA models into microcontrollers that fit into wireless capsule endoscopes. This is the first work that enables reliable bleeding segmentation and depth estimation on a miniaturized device, paving the way for precise diagnosis combined with visual odometry as a means of precise localization of the capsule – on the capsule.
zh
[CV-24] Iterative Trajectory Exploration for Multimodal Agents
【速读】:该论文旨在解决多模态代理在适应新环境时依赖大量专家数据进行微调的问题,其解决方案的关键在于提出一种在线自我探索方法SPORT,通过分步偏好优化来改进代理的轨迹,该方法无需任何专家标注即可自动生成任务并学习解决这些任务。SPORT由四个迭代组件构成:任务合成、步骤采样、步骤验证和偏好调优,其中通过验证器提供AI反馈以构建分步偏好数据,并利用该数据通过偏好调优更新控制器策略,从而生成具备更强泛化能力和有效性的SPORT代理。
链接: https://arxiv.org/abs/2504.21561
作者: Pengxiang Li,Zhi Gao,Bofei Zhang,Yapeng Mi,Xiaojian Ma,Chenrui Shi,Tao Yuan,Yuwei Wu,Yunde Jia,Song-Chun Zhu,Qing Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 8 figures
Abstract:Multimodal agents, which integrate a controller (e.g., a large language model) with external tools, have demonstrated remarkable capabilities in tackling complex tasks. However, existing agents need to collect a large number of expert data for fine-tuning to adapt to new environments. In this paper, we propose an online self-exploration method for multimodal agents, namely SPORT, via step-wise preference optimization to refine the trajectories of agents, which automatically generates tasks and learns from solving the generated tasks, without any expert annotation. SPORT operates through four iterative components: task synthesis, step sampling, step verification, and preference tuning. First, we synthesize multi-modal tasks using language models. Then, we introduce a novel search scheme, where step sampling and step verification are executed alternately to solve each generated task. We employ a verifier to provide AI feedback to construct step-wise preference data. The data is subsequently used to update the controller’s policy through preference tuning, producing a SPORT Agent. By interacting with real environments, the SPORT Agent evolves into a more refined and capable system. Evaluation in the GTA and GAIA benchmarks show that the SPORT Agent achieves 6.41% and 3.64% improvements, underscoring the generalization and effectiveness introduced by our method. The project page is this https URL.
zh
[CV-25] SAM4EM: Efficient memory-based two stage prompt-free segment anything model adapter for complex 3D neuroscience electron microscopy stacks MICRO CVPR
【速读】:该论文旨在解决电子显微镜(EM)数据中复杂神经结构的三维分割问题,特别是在标注数据有限的情况下提升分割精度。其关键解决方案包括:利用Segment Anything Model (SAM) 并结合先进的微调策略,提出一种无需提示的适配器,通过两阶段掩码解码自动生成提示嵌入;基于低秩适应(LoRA)的双阶段微调方法以增强在有限标注数据下的分割性能;以及引入三维记忆注意力机制以保证三维图像堆栈中的分割一致性。
链接: https://arxiv.org/abs/2504.21544
作者: Uzair Shah,Marco Agus,Daniya Boges,Vanessa Chiappini,Mahmood Alzubaidi,Jens Schneider,Markus Hadwiger,Pierre J. Magistretti,Mowafa Househ,Corrado Calı
机构: HBKU(哈马德本哈利法大学); KAUST(沙特阿卜杜拉国王科技大学); University of Turin(都灵大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at (CVPRW) 10th IEEE Workshop on Computer Vision for Microscopy Image Analysis (CVMI)
Abstract:We present SAM4EM, a novel approach for 3D segmentation of complex neural structures in electron microscopy (EM) data by leveraging the Segment Anything Model (SAM) alongside advanced fine-tuning strategies. Our contributions include the development of a prompt-free adapter for SAM using two stage mask decoding to automatically generate prompt embeddings, a dual-stage fine-tuning method based on Low-Rank Adaptation (LoRA) for enhancing segmentation with limited annotated data, and a 3D memory attention mechanism to ensure segmentation consistency across 3D stacks. We further release a unique benchmark dataset for the segmentation of astrocytic processes and synapses. We evaluated our method on challenging neuroscience segmentation benchmarks, specifically targeting mitochondria, glia, and synapses, with significant accuracy improvements over state-of-the-art (SOTA) methods, including recent SAM-based adapters developed for the medical domain and other vision transformer-based approaches. Experimental results indicate that our approach outperforms existing solutions in the segmentation of complex processes like glia and post-synaptic densities. Our code and models are available at this https URL.
zh
[CV-26] RoboGround: Robotic Manipulation with Grounded Vision-Language Priors
【速读】:该论文旨在解决机器人操作中策略泛化能力不足的问题,其核心挑战在于如何使机器人在面对新环境、新物体或新任务时仍能有效执行操作。解决方案的关键在于引入接地掩码(grounding masks)作为有效的中间表示,该表示既提供了目标物体和放置区域的空间指导,又包含了物体形状和大小的信息,同时依托于大规模视觉-语言模型,具备广泛的泛化潜力。通过构建RoboGround系统并设计自动化数据生成管道,论文验证了接地掩码在提升机器人策略泛化能力方面的有效性。
链接: https://arxiv.org/abs/2504.21530
作者: Haifeng Huang,Xinyi Chen,Yilun Chen,Hao Li,Xiaoshen Han,Zehan Wang,Tai Wang,Jiangmiao Pang,Zhou Zhao
机构: Zhejiang University (浙江大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in robotic manipulation have highlighted the potential of intermediate representations for improving policy generalization. In this work, we explore grounding masks as an effective intermediate representation, balancing two key advantages: (1) effective spatial guidance that specifies target objects and placement areas while also conveying information about object shape and size, and (2) broad generalization potential driven by large-scale vision-language models pretrained on diverse grounding datasets. We introduce RoboGround, a grounding-aware robotic manipulation system that leverages grounding masks as an intermediate representation to guide policy networks in object manipulation tasks. To further explore and enhance generalization, we propose an automated pipeline for generating large-scale, simulated data with a diverse set of objects and instructions. Extensive experiments show the value of our dataset and the effectiveness of grounding masks as intermediate guidance, significantly enhancing the generalization abilities of robot policies.
zh
[CV-27] MagicPortrait: Temporally Consistent Face Reenactment with 3D Geometric Guidance
【速读】:该论文旨在解决视频人脸重演(video face reenactment)中形状一致性不足和运动控制不够精确的问题。其解决方案的关键在于将3D人脸参数化模型(FLAME)集成到潜在扩散框架中,通过引入深度图、法线图和渲染图等丰富的3D表情和姿态信息,增强潜在扩散模型的表达能力,并利用多层人脸运动融合模块结合身份与运动潜在特征,从而实现参考图像与驱动视频中面部身份的参数化对齐。
链接: https://arxiv.org/abs/2504.21497
作者: Mengting Wei,Yante Li,Tuomas Varanka,Yan Jiang,Licai Sun,Guoying Zhao
机构: University of Oulu (奥卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we propose a method for video face reenactment that integrates a 3D face parametric model into a latent diffusion framework, aiming to improve shape consistency and motion control in existing video-based face generation approaches. Our approach employs the FLAME (Faces Learned with an Articulated Model and Expressions) model as the 3D face parametric representation, providing a unified framework for modeling face expressions and head pose. This enables precise extraction of detailed face geometry and motion features from driving videos. Specifically, we enhance the latent diffusion model with rich 3D expression and detailed pose information by incorporating depth maps, normal maps, and rendering maps derived from FLAME sequences. A multi-layer face movements fusion module with integrated self-attention mechanisms is used to combine identity and motion latent features within the spatial domain. By utilizing the 3D face parametric model as motion guidance, our method enables parametric alignment of face identity between the reference image and the motion captured from the driving video. Experimental results on benchmark datasets show that our method excels at generating high-quality face animations with precise expression and head pose variation modeling. In addition, it demonstrates strong generalization performance on out-of-domain images. Code is publicly available at this https URL.
zh
[CV-28] Consistency-aware Fake Videos Detection on Short Video Platforms
【速读】:该论文试图解决短视频平台上虚假新闻检测的准确性不足问题(Fake News Detection),当前检测精度受限于内容篡改和生成技术的快速发展。其解决方案的关键在于提出一种新的检测范式,该范式通过显式识别和利用跨模态矛盾作为区分特征,具体包含两个核心模块:跨模态一致性学习(Cross-modal Consistency Learning, CMCL)和多模态协同诊断(Multi-modal Collaborative Diagnosis, MMCD)。CMCL通过伪标签生成和跨模态一致性诊断来量化跨模态不一致,而MMCD则通过多模态特征融合与概率分数融合增强模型对虚假内容的判别能力。
链接: https://arxiv.org/abs/2504.21495
作者: Junxi Wang,Jize liu,Na Zhang,Yaxiong Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 2025 icic
Abstract:This paper focuses to detect the fake news on the short video platforms. While significant research efforts have been devoted to this task with notable progress in recent years, current detection accuracy remains suboptimal due to the rapid evolution of content manipulation and generation technologies. Existing approaches typically employ a cross-modal fusion strategy that directly combines raw video data with metadata inputs before applying a classification layer. However, our empirical observations reveal a critical oversight: manipulated content frequently exhibits inter-modal inconsistencies that could serve as valuable discriminative features, yet remain underutilized in contemporary detection frameworks. Motivated by this insight, we propose a novel detection paradigm that explicitly identifies and leverages cross-modal contradictions as discriminative cues. Our approach consists of two core modules: Cross-modal Consistency Learning (CMCL) and Multi-modal Collaborative Diagnosis (MMCD). CMCL includes Pseudo-label Generation (PLG) and Cross-modal Consistency Diagnosis (CMCD). In PLG, a Multimodal Large Language Model is used to generate pseudo-labels for evaluating cross-modal semantic consistency. Then, CMCD extracts [CLS] tokens and computes cosine loss to quantify cross-modal inconsistencies. MMCD further integrates multimodal features through Multimodal Feature Fusion (MFF) and Probability Scores Fusion (PSF). MFF employs a co-attention mechanism to enhance semantic interactions across different modalities, while a Transformer is utilized for comprehensive feature fusion. Meanwhile, PSF further integrates the fake news probability scores obtained in the previous step. Extensive experiments on established benchmarks (FakeSV and FakeTT) demonstrate our model exhibits outstanding performance in Fake videos detection.
zh
[CV-29] ClassWise-CRF: Category-Specific Fusion for Enhanced Semantic Segmentation of Remote Sensing Imagery
【速读】:该论文旨在解决遥感图像语义分割中多网络预测结果融合的优化问题,以提升分割精度和类别特定性能。其解决方案的关键在于提出一种基于类别特异性融合的架构——ClassWise-CRF,该架构通过两阶段流程实现:首先利用贪心算法从候选网络中选择在特定类别上表现优异的专家网络;其次根据各网络在每个类别中的分割性能自适应加权整合其预测结果,结合条件随机场(Conditional Random Field, CRF)的结构,进一步优化融合结果的空间一致性与边界准确性。
链接: https://arxiv.org/abs/2504.21491
作者: Qinfeng Zhu,Yunxi Jiang,Lei Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:
Abstract:We propose a result-level category-specific fusion architecture called ClassWise-CRF. This architecture employs a two-stage process: first, it selects expert networks that perform well in specific categories from a pool of candidate networks using a greedy algorithm; second, it integrates the segmentation predictions of these selected networks by adaptively weighting their contributions based on their segmentation performance in each category. Inspired by Conditional Random Field (CRF), the ClassWise-CRF architecture treats the segmentation predictions from multiple networks as confidence vector fields. It leverages segmentation metrics (such as Intersection over Union) from the validation set as priors and employs an exponential weighting strategy to fuse the category-specific confidence scores predicted by each network. This fusion method dynamically adjusts the weights of each network for different categories, achieving category-specific optimization. Building on this, the architecture further optimizes the fused results using unary and pairwise potentials in CRF to ensure spatial consistency and boundary accuracy. To validate the effectiveness of ClassWise-CRF, we conducted experiments on two remote sensing datasets, LoveDA and Vaihingen, using eight classic and advanced semantic segmentation networks. The results show that the ClassWise-CRF architecture significantly improves segmentation performance: on the LoveDA dataset, the mean Intersection over Union (mIoU) metric increased by 1.00% on the validation set and by 0.68% on the test set; on the Vaihingen dataset, the mIoU improved by 0.87% on the validation set and by 0.91% on the test set. These results fully demonstrate the effectiveness and generality of the ClassWise-CRF architecture in semantic segmentation of remote sensing images. The full code is available at this https URL.
zh
[CV-30] DGSolver: Diffusion Generalist Solver with Universal Posterior Sampling for Image Restoration
【速读】:该论文旨在解决扩散模型在通用图像修复任务中因减少采样步骤而导致的累积误差问题,以及在退化表示的共性和修复质量之间难以平衡的问题。其解决方案的关键在于提出一种名为DGSolver的扩散通用求解器,该求解器通过推导通用扩散模型的精确常微分方程,并结合基于队列的加速采样策略设计高阶求解器,以提升准确性和效率;同时引入通用后验采样方法,更精确地逼近流形约束梯度,从而实现更准确的噪声估计并修正逆向推理中的误差。
链接: https://arxiv.org/abs/2504.21487
作者: Hebaixu Wang,Jing Zhang,Haonan Guo,Di Wang,Jiayi Ma,Bo Du
机构: Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models have achieved remarkable progress in universal image restoration. While existing methods speed up inference by reducing sampling steps, substantial step intervals often introduce cumulative errors. Moreover, they struggle to balance the commonality of degradation representations and restoration quality. To address these challenges, we introduce \textbfDGSolver, a diffusion generalist solver with universal posterior sampling. We first derive the exact ordinary differential equations for generalist diffusion models and tailor high-order solvers with a queue-based accelerated sampling strategy to improve both accuracy and efficiency. We then integrate universal posterior sampling to better approximate manifold-constrained gradients, yielding a more accurate noise estimation and correcting errors in inverse inference. Extensive experiments show that DGSolver outperforms state-of-the-art methods in restoration accuracy, stability, and scalability, both qualitatively and quantitatively. Code and models will be available at this https URL.
zh
[CV-31] CAE-DFKD: Bridging the Transferability Gap in Data-Free Knowledge Distillation
【速读】:该论文旨在解决数据无关的知识蒸馏(Data-Free Knowledge Distillation, DFKD)中模型表示的可迁移性不足的问题,现有方法主要关注图像识别性能的提升,而忽视了学习到的表征在不同任务中的泛化能力。其解决方案的关键在于提出一种类别感知的嵌入级数据无关知识蒸馏(Category-Aware Embedding Data-Free Knowledge Distillation, CAE-DFKD),通过在嵌入层进行优化,克服了传统基于图像级方法在DFKD中应用时的局限性。
链接: https://arxiv.org/abs/2504.21478
作者: Zherui Zhang,Changwei Wang,Rongtao Xu,Wenhao Xu,Shibiao Xu,Yu Zhang,Li Guo
机构: Beijing University of Posts and Telecommunications(北京邮电大学); Chinese Academy of Sciences(中国科学院); Shandong Computer Science Center(山东计算机科学中心); Qilu University of Technology(齐鲁工业大学); Shandong Fundamental Research Center for Computer Science(山东省计算机科学基础研究中; Tongji University(同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Data-Free Knowledge Distillation (DFKD) enables the knowledge transfer from the given pre-trained teacher network to the target student model without access to the real training data. Existing DFKD methods focus primarily on improving image recognition performance on associated datasets, often neglecting the crucial aspect of the transferability of learned representations. In this paper, we propose Category-Aware Embedding Data-Free Knowledge Distillation (CAE-DFKD), which addresses at the embedding level the limitations of previous rely on image-level methods to improve model generalization but fail when directly applied to DFKD. The superiority and flexibility of CAE-DFKD are extensively evaluated, including: \textit\textbfi.) Significant efficiency advantages resulting from altering the generator training paradigm; \textit\textbfii.) Competitive performance with existing DFKD state-of-the-art methods on image recognition tasks; \textit\textbfiii.) Remarkable transferability of data-free learned representations demonstrated in downstream tasks.
zh
[CV-32] GarmentDiffusion: 3D Garment Sewing Pattern Generation with Multimodal Diffusion Transformers IJCAI2025
【速读】:该论文旨在解决服装缝制图案生成中的多样化设计与生成效率不足的问题,现有方法要么依赖单一输入模态,要么生成效率较低。其解决方案的关键在于提出GarmentDiffusion模型,该模型能够从多模态输入(文本、图像和不完整缝制图案)中生成厘米级精度的矢量3D缝制图案,通过高效编码3D缝制参数为紧凑的边缘标记表示,显著缩短序列长度,并利用扩散变换器在时间轴上同时去噪所有边缘标记,从而大幅提升生成速度。
链接: https://arxiv.org/abs/2504.21476
作者: Xinyu Li,Qi Yao,Yuanda Wang
机构: Zhejiang University (浙江大学); Shenfu Research (深孚研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The 34th International Joint Conference on Artificial Intelligence (IJCAI 2025)
Abstract:Garment sewing patterns are fundamental design elements that bridge the gap between design concepts and practical manufacturing. The generative modeling of sewing patterns is crucial for creating diversified garments. However, existing approaches are limited either by reliance on a single input modality or by suboptimal generation efficiency. In this work, we present \textbf\textitGarmentDiffusion, a new generative model capable of producing centimeter-precise, vectorized 3D sewing patterns from multimodal inputs (text, image, and incomplete sewing pattern). Our method efficiently encodes 3D sewing pattern parameters into compact edge token representations, achieving a sequence length that is \textbf10\times shorter than that of the autoregressive SewingGPT in DressCode. By employing a diffusion transformer, we simultaneously denoise all edge tokens along the temporal axis, while maintaining a constant number of denoising steps regardless of dataset-specific edge and panel statistics. With all combination of designs of our model, the sewing pattern generation speed is accelerated by \textbf100\times compared to SewingGPT. We achieve new state-of-the-art results on DressCodeData, as well as on the largest sewing pattern dataset, namely GarmentCodeData. The project website is available at this https URL.
zh
[CV-33] Robust Orthogonal NMF with Label Propagation for Image Clustering
【速读】:该论文旨在解决传统非负矩阵分解(Non-negative Matrix Factorization, NMF)在实际图像聚类任务中对噪声敏感以及无法有效利用有限监督信息的问题。其解决方案的关键在于提出一种统一的非凸框架,称为鲁棒正交非负矩阵分解(Robust Orthogonal Nonnegative Matrix Factorization, RONMF),该方法通过引入图拉普拉斯和标签传播作为正则化项、采用更有效的非凸结构来衡量重构误差,并对基矩阵施加正交约束,从而提升模型的鲁棒性。
链接: https://arxiv.org/abs/2504.21472
作者: Jingjing Liu,Nian Wu,Xianchao Xiu,Jianhua Zhang
机构: Shanghai Key Laboratory of Chips and Systems for Intelligent Connected Vehicle, School of Microelectronics, Shanghai University; State Key Laboratory of Integrated Chips and Systems, Fudan University; School of Mechatronic Engineering and Automation, Shanghai University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Non-negative matrix factorization (NMF) is a popular unsupervised learning approach widely used in image clustering. However, in real-world clustering scenarios, most existing NMF methods are highly sensitive to noise corruption and are unable to effectively leverage limited supervised information. To overcome these drawbacks, we propose a unified non-convex framework with label propagation called robust orthogonal nonnegative matrix factorization (RONMF). This method not only considers the graph Laplacian and label propagation as regularization terms but also introduces a more effective non-convex structure to measure the reconstruction error and imposes orthogonal constraints on the basis matrix to reduce the noise corruption, thereby achieving higher robustness. To solve RONMF, we develop an alternating direction method of multipliers (ADMM)-based optimization algorithm. In particular, all subproblems have closed-form solutions, which ensures its efficiency. Experimental evaluations on eight public image datasets demonstrate that the proposed RONMF outperforms state-of-the-art NMF methods across various standard metrics and shows excellent robustness. The code will be available at this https URL.
zh
[CV-34] Quaternion Nuclear Norms Over Frobenius Norms Minimization for Robust Matrix Completion
【速读】:该论文旨在解决从不完整或噪声数据中恢复隐藏结构的问题,特别是在需要多维数据表示的领域中。其解决方案的关键在于引入四元数核范数(Quaternions Nuclear Norm, QNOF),作为四元数矩阵秩的非凸近似,该方法无需参数且具有尺度不变性。通过四元数奇异值分解,证明了求解QNOF可简化为求解奇异值的L₁/L₂问题,并进一步将其扩展至鲁棒的四元数矩阵补全,采用交替方向乘子法获得在弱条件下保证收敛的解。
链接: https://arxiv.org/abs/2504.21468
作者: Yu Guo,Guoqing Chen,Tieyong Zeng,Qiyu Jin,Michael Kwok-Po Ng
机构: Inner Mongolia University (内蒙古大学); The Chinese University of Hong Kong (香港中文大学); Hong Kong Baptist University (香港浸会大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recovering hidden structures from incomplete or noisy data remains a pervasive challenge across many fields, particularly where multi-dimensional data representation is essential. Quaternion matrices, with their ability to naturally model multi-dimensional data, offer a promising framework for this problem. This paper introduces the quaternion nuclear norm over the Frobenius norm (QNOF) as a novel nonconvex approximation for the rank of quaternion matrices. QNOF is parameter-free and scale-invariant. Utilizing quaternion singular value decomposition, we prove that solving the QNOF can be simplified to solving the singular value L_1/L_2 problem. Additionally, we extend the QNOF to robust quaternion matrix completion, employing the alternating direction multiplier method to derive solutions that guarantee weak convergence under mild conditions. Extensive numerical experiments validate the proposed model’s superiority, consistently outperforming state-of-the-art quaternion methods.
zh
[CV-35] Multiview Point Cloud Registration via Optimization in an Autoencoder Latent Space
【速读】:该论文旨在解决多视角点云刚性配准(multiview point cloud rigid registration)问题,传统基于成对配准的方法因依赖后续同步算法而难以扩展至大量视角,而现有生成式方法受限于高斯混合模型和期望最大化算法,难以处理大变换和高退化情况。论文提出的解决方案关键在于将配准问题转换到预训练自编码器的潜在空间中,设计考虑退化的损失函数,并开发高效的多起点优化策略,从而在大规模视角、高退化和大初始角度下实现高效且鲁棒的配准。
链接: https://arxiv.org/abs/2504.21467
作者: Luc Vedrenne,Sylvain Faisan,Denis Fortun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 19 figures, IEEE Transactions on Image Processing
Abstract:Point cloud rigid registration is a fundamental problem in 3D computer vision. In the multiview case, we aim to find a set of 6D poses to align a set of objects. Methods based on pairwise registration rely on a subsequent synchronization algorithm, which makes them poorly scalable with the number of views. Generative approaches overcome this limitation, but are based on Gaussian Mixture Models and use an Expectation-Maximization algorithm. Hence, they are not well suited to handle large transformations. Moreover, most existing methods cannot handle high levels of degradations. In this paper, we introduce POLAR (POint cloud LAtent Registration), a multiview registration method able to efficiently deal with a large number of views, while being robust to a high level of degradations and large initial angles. To achieve this, we transpose the registration problem into the latent space of a pretrained autoencoder, design a loss taking degradations into account, and develop an efficient multistart optimization strategy. Our proposed method significantly outperforms state-of-the-art approaches on synthetic and real data. POLAR is available at this http URL or as a standalone package which can be installed with pip install polaregistration.
zh
[CV-36] VR-FuseNet: A Fusion of Heterogeneous Fundus Data and Explainable Deep Network for Diabetic Retinopathy Classification
【速读】:该论文旨在解决糖尿病视网膜病变(Diabetic Retinopathy)的早期准确检测问题,以防止病情进展导致视力丧失和失明。其解决方案的关键在于提出一种新的混合深度学习模型VR-FuseNet,该模型结合了VGG19和ResNet50V2的优势,分别用于捕捉细粒度的空间特征和深层的层次化特征,从而提升诊断性能,并通过引入多种可解释人工智能(XAI)技术增强模型的临床可用性和可解释性。
链接: https://arxiv.org/abs/2504.21464
作者: Shamim Rahim Refat,Ziyan Shirin Raha,Shuvashis Sarker,Faika Fairuj Preotee,MD. Musfikur Rahman,Tashreef Muhammad,Mohammad Shafiul Islam
机构: Ahsanullah University of Science and Technology (Ahsanullah University of Science and Technology); Southeast University (Southeast University)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 33 pages, 49 figures
Abstract:Diabetic retinopathy is a severe eye condition caused by diabetes where the retinal blood vessels get damaged and can lead to vision loss and blindness if not treated. Early and accurate detection is key to intervention and stopping the disease progressing. For addressing this disease properly, this paper presents a comprehensive approach for automated diabetic retinopathy detection by proposing a new hybrid deep learning model called VR-FuseNet. Diabetic retinopathy is a major eye disease and leading cause of blindness especially among diabetic patients so accurate and efficient automated detection methods are required. To address the limitations of existing methods including dataset imbalance, diversity and generalization issues this paper presents a hybrid dataset created from five publicly available diabetic retinopathy datasets. Essential preprocessing techniques such as SMOTE for class balancing and CLAHE for image enhancement are applied systematically to the dataset to improve the robustness and generalizability of the dataset. The proposed VR-FuseNet model combines the strengths of two state-of-the-art convolutional neural networks, VGG19 which captures fine-grained spatial features and ResNet50V2 which is known for its deep hierarchical feature extraction. This fusion improves the diagnostic performance and achieves an accuracy of 91.824%. The model outperforms individual architectures on all performance metrics demonstrating the effectiveness of hybrid feature extraction in Diabetic Retinopathy classification tasks. To make the proposed model more clinically useful and interpretable this paper incorporates multiple XAI techniques. These techniques generate visual explanations that clearly indicate the retinal features affecting the model’s prediction such as microaneurysms, hemorrhages and exudates so that clinicians can interpret and validate.
zh
[CV-37] Rethinking Visual Layer Selection in Multimodal LLM s ICCV2025
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中视觉层选择的优化问题,即如何系统性地选择CLIP-ViT的不同层次特征以提升模型性能。传统方法依赖经验启发式策略,缺乏理论依据。其解决方案的关键在于提出一种基于层间表示相似性的方法(Layer-wise Representation Similarity),将CLIP-ViT的层次划分为浅层、中层和深层,并分析其对MLLM性能的影响,从而为视觉特征的选择提供科学依据。
链接: https://arxiv.org/abs/2504.21447
作者: Haoran Chen,Junyan Lin,Xinhao Chen,Yue Fan,Xin Jin,Hui Su,Jianfeng Dong,Jinlan Fu,Xiaoyu Shen
机构: Zhejiang Gongshang University (浙江工商大学); Genmo.ai (未知); Meituan Inc. (美团公司); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures, submitted to ICCV 2025
Abstract:Multimodal large language models (MLLMs) have achieved impressive performance across a wide range of tasks, typically using CLIP-ViT as their visual encoder due to its strong text-image alignment capabilities. While prior studies suggest that different CLIP-ViT layers capture different types of information, with shallower layers focusing on fine visual details and deeper layers aligning more closely with textual semantics, most MLLMs still select visual features based on empirical heuristics rather than systematic analysis. In this work, we propose a Layer-wise Representation Similarity approach to group CLIP-ViT layers with similar behaviors into shallow, middle, and deep categories and assess their impact on MLLM performance. Building on this foundation, we revisit the visual layer selection problem in MLLMs at scale, training LLaVA-style models ranging from 1.4B to 7B parameters. Through extensive experiments across 10 datasets and 4 tasks, we find that: (1) deep layers are essential for OCR tasks; (2) shallow and middle layers substantially outperform deep layers on reasoning tasks involving counting, positioning, and object localization; (3) a lightweight fusion of features across shallow, middle, and deep layers consistently outperforms specialized fusion baselines and single-layer selections, achieving gains on 9 out of 10 datasets. Our work offers the first principled study of visual layer selection in MLLMs, laying the groundwork for deeper investigations into visual representation learning for MLLMs.
zh
[CV-38] UAV-VLN: End-to-End Vision Language guided Navigation for UAVs
【速读】:该论文旨在解决人工智能引导的自主系统在未见过的环境中基于自然语言指令进行现实且高效导航的核心挑战。其解决方案的关键在于提出UAV-VLN框架,该框架是一种端到端的视觉-语言导航(VLN)方法,将大语言模型(LLMs)与视觉感知无缝集成,以实现人机交互式导航。该系统通过解析高层次语义目标、检测和定位环境中的语义相关物体,并融合多模态信息来推理空间关系、消除人类指令中的歧义,从而在最少任务特定监督下规划上下文感知的行为。
链接: https://arxiv.org/abs/2504.21432
作者: Pranav Saxena,Nishant Raghuvanshi,Neena Goveas
机构: Birla Institute of Technology and Science Pilani, K.K Birla Goa Campus (比尔拉理工学院和科学学院,K.K 比拉果阿校区)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A core challenge in AI-guided autonomy is enabling agents to navigate realistically and effectively in previously unseen environments based on natural language commands. We propose UAV-VLN, a novel end-to-end Vision-Language Navigation (VLN) framework for Unmanned Aerial Vehicles (UAVs) that seamlessly integrates Large Language Models (LLMs) with visual perception to facilitate human-interactive navigation. Our system interprets free-form natural language instructions, grounds them into visual observations, and plans feasible aerial trajectories in diverse environments. UAV-VLN leverages the common-sense reasoning capabilities of LLMs to parse high-level semantic goals, while a vision model detects and localizes semantically relevant objects in the environment. By fusing these modalities, the UAV can reason about spatial relationships, disambiguate references in human instructions, and plan context-aware behaviors with minimal task-specific supervision. To ensure robust and interpretable decision-making, the framework includes a cross-modal grounding mechanism that aligns linguistic intent with visual context. We evaluate UAV-VLN across diverse indoor and outdoor navigation scenarios, demonstrating its ability to generalize to novel instructions and environments with minimal task-specific training. Our results show significant improvements in instruction-following accuracy and trajectory efficiency, highlighting the potential of LLM-driven vision-language interfaces for safe, intuitive, and generalizable UAV autonomy. Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2504.21432 [cs.RO] (or arXiv:2504.21432v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2504.21432 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-39] Diff-Prompt: Diffusion-Driven Prompt Generator with Mask Supervision ICLR2025
【速读】:该论文旨在解决预训练多模态模型在复杂且细粒度任务中微调时性能提升受限的问题,其核心原因是现有方法通过损失反向传播直接优化提示生成过程中的参数,导致提示表示的丰富性和特异性受到限制。解决方案的关键在于提出一种基于扩散模型的提示生成器(Diffusion-Driven Prompt Generator, Diff-Prompt),通过三个阶段实现:首先使用Mask-VAE将掩码压缩到潜在空间,其次利用改进的Diffusion Transformer(DiT)在潜在空间中训练提示生成器,并以掩码为监督信号,最后将提示生成器的去噪过程与预训练模型在语义空间中对齐,从而生成高质量的提示用于模型微调。
链接: https://arxiv.org/abs/2504.21423
作者: Weicai Yan,Wang Lin,Zirun Guo,Ye Wang,Fangming Feng,Xiaoda Yang,Zehan Wang,Tao Jin
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICLR 2025
Abstract:Prompt learning has demonstrated promising results in fine-tuning pre-trained multimodal models. However, the performance improvement is limited when applied to more complex and fine-grained tasks. The reason is that most existing methods directly optimize the parameters involved in the prompt generation process through loss backpropagation, which constrains the richness and specificity of the prompt representations. In this paper, we propose Diffusion-Driven Prompt Generator (Diff-Prompt), aiming to use the diffusion model to generate rich and fine-grained prompt information for complex downstream tasks. Specifically, our approach consists of three stages. In the first stage, we train a Mask-VAE to compress the masks into latent space. In the second stage, we leverage an improved Diffusion Transformer (DiT) to train a prompt generator in the latent space, using the masks for supervision. In the third stage, we align the denoising process of the prompt generator with the pre-trained model in the semantic space, and use the generated prompts to fine-tune the model. We conduct experiments on a complex pixel-level downstream task, referring expression comprehension, and compare our method with various parameter-efficient fine-tuning approaches. Diff-Prompt achieves a maximum improvement of 8.87 in R@1 and 14.05 in R@5 compared to the foundation model and also outperforms other state-of-the-art methods across multiple metrics. The experimental results validate the effectiveness of our approach and highlight the potential of using generative models for prompt generation. Code is available at this https URL.
zh
[CV-40] Adapting In-Domain Few-Shot Segmentation to New Domains without Retraining
【速读】:该论文旨在解决跨域小样本分割(Cross-domain Few-shot Segmentation, CD-FSS)中由于目标域特性多样性和支持数据有限而导致的分割难题。现有方法通常通过重新设计和微调域内小样本分割(In-domain Few-shot Segmentation, FSS)模型来应对这一问题,但此过程成本高昂。论文提出的解决方案关键在于利用推理阶段从少量标注的支持样本中学习域特征,从而适应预训练FSS模型的结构,无需重新训练。具体而言,通过一种基于数据依赖性的结构Fisher得分自适应地识别域特定模型结构,并逐步使用分层构建的训练样本进行训练,最终实现对域偏移的有效应对。
链接: https://arxiv.org/abs/2504.21414
作者: Qi Fan,Kaiqi Liu,Nian Liu,Hisham Cholakkal,Rao Muhammad Anwer,Wenbin Li,Yang Gao
机构: Nanjing University (南京大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cross-domain few-shot segmentation (CD-FSS) aims to segment objects of novel classes in new domains, which is often challenging due to the diverse characteristics of target domains and the limited availability of support data. Most CD-FSS methods redesign and retrain in-domain FSS models using various domain-generalization techniques, which are effective but costly to train. To address these issues, we propose adapting informative model structures of the well-trained FSS model for target domains by learning domain characteristics from few-shot labeled support samples during inference, thereby eliminating the need for retraining. Specifically, we first adaptively identify domain-specific model structures by measuring parameter importance using a novel structure Fisher score in a data-dependent manner. Then, we progressively train the selected informative model structures with hierarchically constructed training samples, progressing from fewer to more support shots. The resulting Informative Structure Adaptation (ISA) method effectively addresses domain shifts and equips existing well-trained in-domain FSS models with flexible adaptation capabilities for new domains, eliminating the need to redesign or retrain CD-FSS models on base data. Extensive experiments validate the effectiveness of our method, demonstrating superior performance across multiple CD-FSS benchmarks.
zh
[CV-41] Static or Dynamic: Towards Query-Adaptive Token Selection for Video Question Answering
【速读】:该论文旨在解决视频问答任务中由于长视频生成大量token而导致的内存效率低下和模型性能受限的问题。现有方法通过压缩视频输入来缓解这一问题,但往往忽视了不同查询对静态信息与动态信息重要性的差异,导致在有限的token预算下使用效率不高。该研究提出了一种新颖的token选择策略——EXPLORE-THEN-SELECT,其关键在于根据问题需求自适应调整所需静态和动态信息,首先探索静态帧与动态帧之间的不同token分配方案,然后通过基于查询感知注意力的度量方法选择最优的token组合,无需更新模型即可实现高效的信息提取。
链接: https://arxiv.org/abs/2504.21403
作者: Yumeng Shi,Quanyu Long,Wenya Wang
机构: Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video question answering benefits from the rich information available in videos, enabling a wide range of applications. However, the large volume of tokens generated from longer videos presents significant challenges to memory efficiency and model performance. To alleviate this issue, existing works propose to compress video inputs, but usually overlooking the varying importance of static and dynamic information across different queries, leading to inefficient token usage within limited budgets. To tackle this, we propose a novel token selection strategy, EXPLORE-THEN-SELECT, that adaptively adjust static and dynamic information needed based on question requirements. Our framework first explores different token allocations between static frames, which preserve spatial details, and dynamic frames, which capture temporal changes. Next, it employs a query-aware attention-based metric to select the optimal token combination without model updates. Our proposed framework is plug-and-play that can be seamlessly integrated within diverse video-language models. Extensive experiments show that our method achieves significant performance improvements (up to 5.8%) among various video question answering benchmarks.
zh
[CV-42] Comparison of Different Deep Neural Network Models in the Cultural Heritage Domain
【速读】:该论文试图解决如何将通用数据集(如ImageNet)中的知识迁移至文化遗产相关任务中的问题,以提升文化遗址记录与保护及游客体验的效果。解决方案的关键在于对比分析卷积神经网络(Convolutional Neural Networks)和Transformer架构在知识迁移能力上的表现,最终结果表明DenseNet在效率-计算能力比率方面表现最优。
链接: https://arxiv.org/abs/2504.21387
作者: Teodor Boyadzhiev,Gabriele Lagani,Luca Ciampi,Giuseppe Amato,Krassimira Ivanova
机构: Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, Sofia, Bulgaria; Institute of Information Science and Technologies, Consiglio Nazionale delle Ricerche, Pisa, Italy
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 10th International Euro-Mediterranean Conference (EuroMed 2024)
Abstract:The integration of computer vision and deep learning is an essential part of documenting and preserving cultural heritage, as well as improving visitor experiences. In recent years, two deep learning paradigms have been established in the field of computer vision: convolutional neural networks and transformer architectures. The present study aims to make a comparative analysis of some representatives of these two techniques of their ability to transfer knowledge from generic dataset, such as ImageNet, to cultural heritage specific tasks. The results of testing examples of the architectures VGG, ResNet, DenseNet, Visual Transformer, Swin Transformer, and PoolFormer, showed that DenseNet is the best in terms of efficiency-computability ratio.
zh
[CV-43] IDDM: Bridging Synthetic-to-Real Domain Gap from Physics-Guided Diffusion for Real-world Image Dehazing
【速读】:该论文旨在解决当前基于数据驱动的去雾算法在合成数据集上表现良好,但在真实世界场景中泛化能力不足的问题(domain gap)。其解决方案的关键在于提出一种名为Image Dehazing Diffusion Models (IDDM) 的新型扩散过程,该过程将大气散射模型引入噪声扩散过程中,通过渐进式雾霾生成过程帮助去噪U-Net从条件输入的雾霾图像中稳健地学习清晰图像的分布。
链接: https://arxiv.org/abs/2504.21385
作者: Shijun Zhou,Yajing Liu,Chunhui Hao,Zhiyuan Liu,Jiandong Tian
机构: State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences(国家重点实验室,沈阳自动化研究所,中国科学院); University of Chinese Academy of Sciences(中国科学院大学); Shenyang university of chemical technology(沈阳化工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Due to the domain gap between real-world and synthetic hazy images, current data-driven dehazing algorithms trained on synthetic datasets perform well on synthetic data but struggle to generalize to real-world scenarios. To address this challenge, we propose \textbfImage \textbfDehazing \textbfDiffusion \textbfModels (IDDM), a novel diffusion process that incorporates the atmospheric scattering model into noise diffusion. IDDM aims to use the gradual haze formation process to help the denoising Unet robustly learn the distribution of clear images from the conditional input hazy images. We design a specialized training strategy centered around IDDM. Diffusion models are leveraged to bridge the domain gap from synthetic to real-world, while the atmospheric scattering model provides physical guidance for haze formation. During the forward process, IDDM simultaneously introduces haze and noise into clear images, and then robustly separates them during the sampling process. By training with physics-guided information, IDDM shows the ability of domain generalization, and effectively restores the real-world hazy images despite being trained on synthetic datasets. Extensive experiments demonstrate the effectiveness of our method through both quantitative and qualitative comparisons with state-of-the-art approaches.
zh
[CV-44] Sparse-to-Sparse Training of Diffusion Models
【速读】:该论文试图解决扩散模型(Diffusion Models, DMs)在训练和推理阶段计算资源消耗大的问题,其解决方案的关键在于引入了稀疏到稀疏训练(sparse-to-sparse training)的范式,通过在模型中引入稀疏性来提升训练和推理效率。研究通过在六个数据集上使用三种不同方法(Static-DM、RigL-DM 和 MagRan-DM)从头训练稀疏扩散模型(Latent Diffusion 和 ChiroDiff),验证了稀疏模型在保持生成质量的同时显著减少了可训练参数数量和浮点运算次数(FLOPs)。
链接: https://arxiv.org/abs/2504.21380
作者: Inês Cardoso Oliveira,Decebal Constantin Mocanu,Luis A. Leiva
机构: University of Luxembourg (卢森堡大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models (DMs) are a powerful type of generative models that have achieved state-of-the-art results in various image synthesis tasks and have shown potential in other domains, such as natural language processing and temporal data modeling. Despite their stable training dynamics and ability to produce diverse high-quality samples, DMs are notorious for requiring significant computational resources, both in the training and inference stages. Previous work has focused mostly on increasing the efficiency of model inference. This paper introduces, for the first time, the paradigm of sparse-to-sparse training to DMs, with the aim of improving both training and inference efficiency. We focus on unconditional generation and train sparse DMs from scratch (Latent Diffusion and ChiroDiff) on six datasets using three different methods (Static-DM, RigL-DM, and MagRan-DM) to study the effect of sparsity in model performance. Our experiments show that sparse DMs are able to match and often outperform their Dense counterparts, while substantially reducing the number of trainable parameters and FLOPs. We also identify safe and effective values to perform sparse-to-sparse training of DMs.
zh
[CV-45] Revisiting Diffusion Autoencoder Training for Image Reconstruction Quality CVPR2025
【速读】:该论文试图解决扩散自编码器(Diffusion Autoencoders, DAEs)在生成图像时因噪声水平设置导致的图像质量低下和模糊问题。传统方法采用线性β噪声调度,在高噪声水平下花费较多采样步骤,这虽然有助于恢复大尺度图像结构,但不利于细节恢复。解决方案的关键在于将训练过程分为两个阶段:第一阶段将噪声水平始终设置为最高,迫使编码器和解码器在潜在空间中填充结构信息;第二阶段引入更多时间停留在低噪声区域的噪声调度,使DAE能够学习完善细节。该方法实现了图像在保留潜在代码有用特性的同时,具备准确的高层结构和低层细节。
链接: https://arxiv.org/abs/2504.21368
作者: Pramook Khungurn,Sukit Seripanitkarn,Phonphrm Thawatdamrongkit,Supasorn Suwajanakorn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AI for Content Creation (AI4CC) Workshop at CVPR 2025
Abstract:Diffusion autoencoders (DAEs) are typically formulated as a noise prediction model and trained with a linear- \beta noise schedule that spends much of its sampling steps at high noise levels. Because high noise levels are associated with recovering large-scale image structures and low noise levels with recovering details, this configuration can result in low-quality and blurry images. However, it should be possible to improve details while spending fewer steps recovering structures because the latent code should already contain structural information. Based on this insight, we propose a new DAE training method that improves the quality of reconstructed images. We divide training into two phases. In the first phase, the DAE is trained as a vanilla autoencoder by always setting the noise level to the highest, forcing the encoder and decoder to populate the latent code with structural information. In the second phase, we incorporate a noise schedule that spends more time in the low-noise region, allowing the DAE to learn how to perfect the details. Our method results in images that have accurate high-level structures and low-level details while still preserving useful properties of the latent codes.
zh
[CV-46] Nexus-Gen: A Unified Model for Image Understanding Generation and Editing
【速读】:该论文旨在解决统一多模态大语言模型(Unified Multimodal Large Language Models, MMLLs)在性能上与领域专用架构存在差距的问题。其解决方案的关键在于提出Nexus-Gen,该模型通过将语言推理能力与扩散模型的图像生成能力相结合,实现多模态理解与生成的统一。为对齐语言模型与扩散模型的嵌入空间,采用了双阶段对齐训练过程:第一阶段,自回归语言模型学习根据多模态输入预测图像嵌入;第二阶段,视觉解码器则被训练从这些嵌入中重建高保真图像。此外,为解决自回归范式在训练与推理阶段之间的关键差异问题,引入了预填充自回归策略,以位置嵌入的特殊标记替代连续嵌入来填充输入序列,从而避免连续嵌入空间中的误差累积问题。
链接: https://arxiv.org/abs/2504.21356
作者: Hong Zhang,Zhongjie Duan,Xingjun Wang,Yingda Chen,Yuze Zhao,Yu Zhang
机构: Zhejiang University (浙江大学); Alibaba Group Inc. (阿里巴巴集团); East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Unified multimodal large language models (MLLMs) aim to integrate multimodal understanding and generation abilities through a single framework. Despite their versatility, existing open-source unified models exhibit performance gaps against domain-specific architectures. To bridge this gap, we present Nexus-Gen, a unified model that synergizes the language reasoning capabilities of LLMs with the image synthesis power of diffusion models. To align the embedding space of the LLM and diffusion model, we conduct a dual-phase alignment training process. (1) The autoregressive LLM learns to predict image embeddings conditioned on multimodal inputs, while (2) the vision decoder is trained to reconstruct high-fidelity images from these embeddings. During training the LLM, we identified a critical discrepancy between the autoregressive paradigm’s training and inference phases, where error accumulation in continuous embedding space severely degrades generation quality. To avoid this issue, we introduce a prefilled autoregression strategy that prefills input sequence with position-embedded special tokens instead of continuous embeddings. Through dual-phase training, Nexus-Gen has developed the integrated capability to comprehensively address the image understanding, generation and editing tasks. All models, datasets, and codes are published at this https URL to facilitate further advancements across the field.
zh
[CV-47] Vision-Language Model-Based Semantic-Guided Imaging Biomarker for Early Lung Cancer Detection
【速读】:该论文试图解决现有机器学习模型在评估肺结节恶性程度时依赖人工标注、可解释性差以及对影像变化敏感的问题,这些问题限制了其在真实临床环境中的应用。解决方案的关键在于整合放射科医生对结节的语义特征,使模型能够学习到临床相关、稳健且可解释的特征以预测肺癌。通过微调预训练的对比语言-图像预训练(Contrastive Language-Image Pretraining, CLIP)模型,实现影像与语义特征的对齐,并提升模型在外部数据集上的性能。
链接: https://arxiv.org/abs/2504.21344
作者: Luoting Zhuang,Seyed Mohammad Hossein Tabatabaei,Ramin Salehi-Rad,Linh M. Tran,Denise R. Aberle,Ashley E. Prosper,William Hsu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:
Abstract:Objective: A number of machine learning models have utilized semantic features, deep features, or both to assess lung nodule malignancy. However, their reliance on manual annotation during inference, limited interpretability, and sensitivity to imaging variations hinder their application in real-world clinical settings. Thus, this research aims to integrate semantic features derived from radiologists’ assessments of nodules, allowing the model to learn clinically relevant, robust, and explainable features for predicting lung cancer. Methods: We obtained 938 low-dose CT scans from the National Lung Screening Trial with 1,246 nodules and semantic features. The Lung Image Database Consortium dataset contains 1,018 CT scans, with 2,625 lesions annotated for nodule characteristics. Three external datasets were obtained from UCLA Health, the LUNGx Challenge, and the Duke Lung Cancer Screening. We finetuned a pretrained Contrastive Language-Image Pretraining model with a parameter-efficient fine-tuning approach to align imaging and semantic features and predict the one-year lung cancer diagnosis. Results: We evaluated the performance of the one-year diagnosis of lung cancer with AUROC and AUPRC and compared it to three state-of-the-art models. Our model demonstrated an AUROC of 0.90 and AUPRC of 0.78, outperforming baseline state-of-the-art models on external datasets. Using CLIP, we also obtained predictions on semantic features, such as nodule margin (AUROC: 0.81), nodule consistency (0.81), and pleural attachment (0.84), that can be used to explain model predictions. Conclusion: Our approach accurately classifies lung nodules as benign or malignant, providing explainable outputs, aiding clinicians in comprehending the underlying meaning of model predictions. This approach also prevents the model from learning shortcuts and generalizes across clinical settings.
zh
[CV-48] owards Improved Cervical Cancer Screening: Vision Transformer-Based Classification and Interpretability
【速读】:该论文旨在解决宫颈细胞图像分类问题,以提高宫颈癌筛查的准确性。其解决方案的关键在于提出一种基于EVA-02变压器模型的四步流程:微调EVA-02、特征提取、通过多种机器学习模型选择重要特征,并训练具有可选损失加权的新人工神经网络,从而提升模型的泛化能力。
链接: https://arxiv.org/abs/2504.21340
作者: Khoa Tuan Nguyen,Ho-min Park,Gaeun Oh,Joris Vankerschaver,Wesley De Neve
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ISBI 2025 “Challenge 2: Pap Smear Cell Classification Challenge”
Abstract:We propose a novel approach to cervical cell image classification for cervical cancer screening using the EVA-02 transformer model. We developed a four-step pipeline: fine-tuning EVA-02, feature extraction, selecting important features through multiple machine learning models, and training a new artificial neural network with optional loss weighting for improved generalization. With this design, our best model achieved an F1-score of 0.85227, outperforming the baseline EVA-02 model (0.84878). We also utilized Kernel SHAP analysis and identified key features correlating with cell morphology and staining characteristics, providing interpretable insights into the decision-making process of the fine-tuned model. Our code is available at this https URL.
zh
[CV-49] UniBiomed: A Universal Foundation Model for Grounded Biomedical Image Interpretation
【速读】:该论文旨在解决传统人工智能方法在生物医学图像分析中因训练过程割裂而导致的灵活性不足和无法充分利用全面生物医学信息的问题。传统方法通常分别使用大型语言模型(LLMs)进行临床文本生成和分割模型进行目标提取,这种分离的训练方式限制了实际应用效果。解决方案的关键在于提出UniBiomed,这是首个用于基础生物医学图像解释的通用基础模型,其核心是将多模态大语言模型(MLLM)与可分割任何模型(SAM)进行创新性整合,从而有效统一临床文本生成与相应生物医学对象的分割,实现端到端的、基于图像的解释能力。
链接: https://arxiv.org/abs/2504.21336
作者: Linshan Wu,Yuxiang Nie,Sunan He,Jiaxin Zhuang,Hao Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The first universal foundation model for grounded biomedical image interpretation
Abstract:Multi-modal interpretation of biomedical images opens up novel opportunities in biomedical image analysis. Conventional AI approaches typically rely on disjointed training, i.e., Large Language Models (LLMs) for clinical text generation and segmentation models for target extraction, which results in inflexible real-world deployment and a failure to leverage holistic biomedical information. To this end, we introduce UniBiomed, the first universal foundation model for grounded biomedical image interpretation. UniBiomed is based on a novel integration of Multi-modal Large Language Model (MLLM) and Segment Anything Model (SAM), which effectively unifies the generation of clinical texts and the segmentation of corresponding biomedical objects for grounded interpretation. In this way, UniBiomed is capable of tackling a wide range of biomedical tasks across ten diverse biomedical imaging modalities. To develop UniBiomed, we curate a large-scale dataset comprising over 27 million triplets of images, annotations, and text descriptions across ten imaging modalities. Extensive validation on 84 internal and external datasets demonstrated that UniBiomed achieves state-of-the-art performance in segmentation, disease recognition, region-aware diagnosis, visual question answering, and report generation. Moreover, unlike previous models that rely on clinical experts to pre-diagnose images and manually craft precise textual or visual prompts, UniBiomed can provide automated and end-to-end grounded interpretation for biomedical image analysis. This represents a novel paradigm shift in clinical workflows, which will significantly improve diagnostic efficiency. In summary, UniBiomed represents a novel breakthrough in biomedical AI, unlocking powerful grounded interpretation capabilities for more accurate and efficient biomedical image analysis.
zh
[CV-50] Simple Visual Artifact Detection in Sora-Generated Videos
【速读】:该论文旨在解决生成式视频模型(如Sora)中常见的视觉伪影问题,这些问题可能影响视频质量、误导观众或传播虚假信息。其解决方案的关键在于提出一种多标签分类框架,用于识别四种常见的视觉伪影类型:边界/边缘缺陷、纹理/噪声问题、运动/关节异常以及物体不匹配/消失。通过使用手动标注的300帧数据集,并训练多种2D卷积神经网络架构,最终基于ResNet-50的模型在多标签分类任务中达到了94.14%的平均准确率,为视频质量评估和视觉风险识别提供了有效手段。
链接: https://arxiv.org/abs/2504.21334
作者: Misora Sugiyama,Hirokatsu Kataoka
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The December 2024 release of OpenAI’s Sora, a powerful video generation model driven by natural language prompts, highlights a growing convergence between large language models (LLMs) and video synthesis. As these multimodal systems evolve into video-enabled LLMs (VidLLMs), capable of interpreting, generating, and interacting with visual content, understanding their limitations and ensuring their safe deployment becomes essential. This study investigates visual artifacts frequently found and reported in Sora-generated videos, which can compromise quality, mislead viewers, or propagate disinformation. We propose a multi-label classification framework targeting four common artifact label types: label 1: boundary / edge defects, label 2: texture / noise issues, label 3: movement / joint anomalies, and label 4: object mismatches / disappearances. Using a dataset of 300 manually annotated frames extracted from 15 Sora-generated videos, we trained multiple 2D CNN architectures (ResNet-50, EfficientNet-B3 / B4, ViT-Base). The best-performing model trained by ResNet-50 achieved an average multi-label classification accuracy of 94.14%. This work supports the broader development of VidLLMs by contributing to (1) the creation of datasets for video quality evaluation, (2) interpretable artifact-based analysis beyond language metrics, and (3) the identification of visual risks relevant to factuality and safety.
zh
[CV-51] xt-Conditioned Diffusion Model for High-Fidelity Korean Font Generation
【速读】:该论文试图解决复杂语言如韩语和汉语的手写风格字体生成问题,传统自动字体生成方法(如生成对抗网络和变分自编码器)在训练过程中通常不稳定,并且容易出现模式崩溃问题,同时难以捕捉字体图像中的细节。该研究提出了一种基于扩散模型的自动字体生成方法,其关键在于通过增量式地细化噪声图像实现稳定训练并生成高质量、多样化的手写和印刷风格的韩文字体图像,同时引入文本编码器以处理语音表示并生成准确且上下文正确的字符,以及采用感知损失来提升生成图像的整体风格质量。
链接: https://arxiv.org/abs/2504.21325
作者: Abdul Sami,Avinash Kumar,Irfanullah Memon,Youngwon Jo,Muhammad Rizwan,Jaeyoung Choi
机构: School of Computer Science and Engineering, Soongsil University (计算机科学与工程学院,松溪大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 4 figures, Accepted at ICOIN 2025
Abstract:Automatic font generation (AFG) is the process of creating a new font using only a few examples of the style images. Generating fonts for complex languages like Korean and Chinese, particularly in handwritten styles, presents significant challenges. Traditional AFGs, like Generative adversarial networks (GANs) and Variational Auto-Encoders (VAEs), are usually unstable during training and often face mode collapse problems. They also struggle to capture fine details within font images. To address these problems, we present a diffusion-based AFG method which generates high-quality, diverse Korean font images using only a single reference image, focusing on handwritten and printed styles. Our approach refines noisy images incrementally, ensuring stable training and visually appealing results. A key innovation is our text encoder, which processes phonetic representations to generate accurate and contextually correct characters, even for unseen characters. We used a pre-trained style encoder from DG FONT to effectively and accurately encode the style images. To further enhance the generation quality, we used perceptual loss that guides the model to focus on the global style of generated images. Experimental results on over 2000 Korean characters demonstrate that our model consistently generates accurate and detailed font images and outperforms benchmark methods, making it a reliable tool for generating authentic Korean fonts across different styles.
zh
[CV-52] An Evaluation of a Visual Question Answering Strategy for Zero-shot Facial Expression Recognition in Still Images
【速读】:该论文试图解决在面部表情识别(Facial Expression Recognition, FER)中,现有先进模型在零样本(zero-shot)场景下性能显著下降的问题。解决方案的关键在于引入本地执行的视觉语言模型(Visual Language Models, VLMs),通过视觉问答策略避免任务特定知识的缺失,并评估其在多个公开FER基准数据集上的表现,结果显示部分VLM在零样本FER场景中表现出色,表明进一步探索VLM在提升FER泛化能力方面的潜力。
链接: https://arxiv.org/abs/2504.21309
作者: Modesto Castrillón-Santana,Oliverio J Santana,David Freire-Obregón,Daniel Hernández-Sosa,Javier Lorenzo-Navarro
机构: SIANI - Universidad de Las Palmas de Gran Canaria (SIANI - 拉斯帕尔马斯大加那利大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Facial expression recognition (FER) is a key research area in computer vision and human-computer interaction. Despite recent advances in deep learning, challenges persist, especially in generalizing to new scenarios. In fact, zero-shot FER significantly reduces the performance of state-of-the-art FER models. To address this problem, the community has recently started to explore the integration of knowledge from Large Language Models for visual tasks. In this work, we evaluate a broad collection of locally executed Visual Language Models (VLMs), avoiding the lack of task-specific knowledge by adopting a Visual Question Answering strategy. We compare the proposed pipeline with state-of-the-art FER models, both integrating and excluding VLMs, evaluating well-known FER benchmarks: AffectNet, FERPlus, and RAF-DB. The results show excellent performance for some VLMs in zero-shot FER scenarios, indicating the need for further exploration to improve FER generalization.
zh
[CV-53] AGHI-QA: A Subjective-Aligned Dataset and Metric for AI-Generated Human Images
【速读】:该论文试图解决现有图像质量评估(IQA)方法在生成式AI (Generative AI) 生成的人像图像(AGHIs)中无法提供细粒度感知评价的问题,特别是在结构复杂的主体如人体上存在频繁的解剖和纹理失真情况下。解决方案的关键在于提出AGHI-QA,首个针对AGHIs质量评估的大规模基准数据集,并基于此构建AGHI-Assessor,一种结合大模态模型(LMM)与领域特定人体特征的新型质量评估指标,以实现对AGHIs中可见和失真身体部位的精确预测与识别。
链接: https://arxiv.org/abs/2504.21308
作者: Yunhao Li,Sijing Wu,Wei Sun,Zhichao Zhang,Yucheng Zhu,Zicheng Zhang,Huiyu Duan,Xiongkuo Min,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid development of text-to-image (T2I) generation approaches has attracted extensive interest in evaluating the quality of generated images, leading to the development of various quality assessment methods for general-purpose T2I outputs. However, existing image quality assessment (IQA) methods are limited to providing global quality scores, failing to deliver fine-grained perceptual evaluations for structurally complex subjects like humans, which is a critical challenge considering the frequent anatomical and textural distortions in AI-generated human images (AGHIs). To address this gap, we introduce AGHI-QA, the first large-scale benchmark specifically designed for quality assessment of AGHIs. The dataset comprises 4,000 images generated from 400 carefully crafted text prompts using 10 state of-the-art T2I models. We conduct a systematic subjective study to collect multidimensional annotations, including perceptual quality scores, text-image correspondence scores, visible and distorted body part labels. Based on AGHI-QA, we evaluate the strengths and weaknesses of current T2I methods in generating human images from multiple dimensions. Furthermore, we propose AGHI-Assessor, a novel quality metric that integrates the large multimodal model (LMM) with domain-specific human features for precise quality prediction and identification of visible and distorted body parts in AGHIs. Extensive experimental results demonstrate that AGHI-Assessor showcases state-of-the-art performance, significantly outperforming existing IQA methods in multidimensional quality assessment and surpassing leading LMMs in detecting structural distortions in AGHIs.
zh
[CV-54] he Dual Power of Interpretable Token Embeddings: Jailbreaking Attacks and Defenses for Diffusion Model Unlearning
【速读】:该论文试图解决扩散模型在受到特定文本指令时可能记忆并生成有害内容的问题,以及现有微调方法在面对越狱攻击时的不足,这些方法未能彻底消除有害概念。解决方案的关键在于提出一种能够学习可解释的正交攻击标记嵌入(attack token embeddings)的攻击方法,该方法可以分解为人类可理解的文本元素,揭示未学习模型仍通过隐式文本成分保留目标概念。此外,这些攻击标记嵌入在不同文本提示、初始噪声和未学习模型之间具有鲁棒性和可迁移性。
链接: https://arxiv.org/abs/2504.21307
作者: Siyi Chen,Yimeng Zhang,Sijia Liu,Qing Qu
机构: University of Michigan(密歇根大学); Michigan State University(密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite the remarkable generalization capabilities of diffusion models, recent studies have shown that these models can memorize and generate harmful content when prompted with specific text instructions. Although fine-tuning approaches have been developed to mitigate this issue by unlearning harmful concepts, these methods can be easily circumvented through jailbreaking attacks. This indicates that the harmful concept has not been fully erased from the model. However, existing attack methods, while effective, lack interpretability regarding why unlearned models still retain the concept, thereby hindering the development of defense strategies. In this work, we address these limitations by proposing an attack method that learns an orthogonal set of interpretable attack token embeddings. The attack token embeddings can be decomposed into human-interpretable textual elements, revealing that unlearned models still retain the target concept through implicit textual components. Furthermore, these attack token embeddings are robust and transferable across text prompts, initial noises, and unlearned models. Finally, leveraging this diverse set of embeddings, we design a defense method applicable to both our proposed attack and existing attack methods. Experimental results demonstrate the effectiveness of both our attack and defense strategies.
zh
[CV-55] CMD: Constraining Multimodal Distribution for Domain Adaptation in Stereo Matching
【速读】:该论文试图解决在无监督领域自适应场景中,基于学习的立体匹配方法由于软argmin和平滑L1损失操作导致目标域中出现多模态视差概率分布,从而降低模型泛化能力的问题。解决方案的关键在于提出了一种名为约束多模态分布(Constrain Multi-modal Distribution, CMD)的方法,通过引入不确定性正则化最小化和各向异性软argmin,促使网络在目标域中生成主要为单峰的视差分布,从而提升预测精度。
链接: https://arxiv.org/abs/2504.21302
作者: Zhelun Shen,Zhuo Li,Chenming Wu,Zhibo Rao,Lina Liu,Yuchao Dai,Liangjun Zhang
机构: RAL, Baidu Research(百度研究院); ICT, Chinese Academy of Science(中国科学院); Nanchang Hangkong University(南昌航空大学); Zhejiang University(浙江大学); Northwestern Polytechnical University(西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 13 pages, 5 figures, accepted for publication in Pattern Recognition
Abstract:Recently, learning-based stereo matching methods have achieved great improvement in public benchmarks, where soft argmin and smooth L1 loss play a core contribution to their success. However, in unsupervised domain adaptation scenarios, we observe that these two operations often yield multimodal disparity probability distributions in target domains, resulting in degraded generalization. In this paper, we propose a novel approach, Constrain Multi-modal Distribution (CMD), to address this issue. Specifically, we introduce \textituncertainty-regularized minimization and \textitanisotropic soft argmin to encourage the network to produce predominantly unimodal disparity distributions in the target domain, thereby improving prediction accuracy. Experimentally, we apply the proposed method to multiple representative stereo-matching networks and conduct domain adaptation from synthetic data to unlabeled real-world scenes. Results consistently demonstrate improved generalization in both top-performing and domain-adaptable stereo-matching models. The code for CMD will be available at: \hrefthis https URLthis https URL.
zh
[CV-56] Learning Multi-view Multi-class Anomaly Detection
【速读】:该论文试图解决多视图多类别异常检测(Multi-View Multi-Class Anomaly Detection, MVMCAD)中现有模型在多视图场景下性能不佳的问题,主要原因是无法有效建模不同视图之间的关系和互补信息。解决方案的关键在于提出一种半冻结编码器(semi-frozen encoder),通过预编码器先验增强机制实现稳定的跨视图特征建模;引入异常增强模块(Anomaly Amplification Module, AAM)以建模全局token交互并抑制正常区域,从而增强异常信号;以及设计跨特征损失(Cross-Feature Loss)来对齐浅层编码器特征与深层解码器特征,提升模型在多视图场景下对不同语义层级异常的敏感性。
链接: https://arxiv.org/abs/2504.21294
作者: Qianzi Yu,Yang Cao,Yu Kang
机构: USTC(中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The latest trend in anomaly detection is to train a unified model instead of training a separate model for each category. However, existing multi-class anomaly detection (MCAD) models perform poorly in multi-view scenarios because they often fail to effectively model the relationships and complementary information among different views. In this paper, we introduce a Multi-View Multi-Class Anomaly Detection model (MVMCAD), which integrates information from multiple views to accurately identify anomalies. Specifically, we propose a semi-frozen encoder, where a pre-encoder prior enhancement mechanism is added before the frozen encoder, enabling stable cross-view feature modeling and efficient adaptation for improved anomaly detection. Furthermore, we propose an Anomaly Amplification Module (AAM) that models global token interactions and suppresses normal regions to enhance anomaly signals, leading to improved detection performance in multi-view settings. Finally, we propose a Cross-Feature Loss that aligns shallow encoder features with deep decoder features and vice versa, enhancing the model’s sensitivity to anomalies at different semantic levels under multi-view scenarios. Extensive experiments on the Real-IAD dataset for multi-view multi-class anomaly detection validate the effectiveness of our approach, achieving state-of-the-art performance of 91.0/88.6/82.1 and 99.1/43.9/48.2/95.2 for image-level and the pixel-level, respectively.
zh
[CV-57] Can We Achieve Efficient Diffusion without Self-Attention? Distilling Self-Attention into Convolutions
【速读】:该论文试图解决传统扩散模型中自注意力机制(self-attention)在计算复杂度高且全局交互可能并非关键的问题,其解决方案的关键在于提出ΔConvFusion,通过将常规的自注意力模块替换为金字塔卷积块(ΔConvBlocks),将注意力模式蒸馏为局部卷积操作,从而在保持生成质量的同时显著降低计算成本。
链接: https://arxiv.org/abs/2504.21292
作者: ZiYi Dong,Chengxing Zhou,Weijian Deng,Pengxu Wei,Xiangyang Ji,Liang Lin
机构: Sun Yat-sen Unviersity (中山大学); Australian National University (澳大利亚国立大学); Peng Cheng Laboratory (鹏城实验室); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Contemporary diffusion models built upon U-Net or Diffusion Transformer (DiT) architectures have revolutionized image generation through transformer-based attention mechanisms. The prevailing paradigm has commonly employed self-attention with quadratic computational complexity to handle global spatial relationships in complex images, thereby synthesizing high-fidelity images with coherent visual this http URL to conventional wisdom, our systematic layer-wise analysis reveals an interesting discrepancy: self-attention in pre-trained diffusion models predominantly exhibits localized attention patterns, closely resembling convolutional inductive biases. This suggests that global interactions in self-attention may be less critical than commonly this http URL by this, we propose (\Delta)ConvFusion to replace conventional self-attention modules with Pyramid Convolution Blocks ((\Delta)ConvBlocks).By distilling attention patterns into localized convolutional operations while keeping other components frozen, (\Delta)ConvFusion achieves performance comparable to transformer-based counterparts while reducing computational cost by 6929 \times and surpassing LinFusion by 5.42 \times in efficiency–all without compromising generative fidelity.
zh
[CV-58] Mamba Based Feature Extraction And Adaptive Multilevel Feature Fusion For 3D Tumor Segmentation From Multi-modal Medical Image
【速读】:该论文旨在解决多模态3D医学图像分割中因图像强度差异和肿瘤形态变化带来的挑战,特别是传统卷积神经网络(CNN)难以捕捉全局特征,而基于Transformer的方法虽能有效捕获全局上下文但计算成本高昂的问题。其解决方案的关键在于提出一种基于Mamba的特征提取与自适应多层级特征融合方法,通过设计特定模态的Mamba编码器以高效提取长程相关特征,并引入双层级协同整合模块,利用模态注意力和通道注意力学习动态融合多模态、多层级的互补特征,最终通过解码器结合语义信息与细粒度细节生成肿瘤分割图。
链接: https://arxiv.org/abs/2504.21281
作者: Zexin Ji,Beiji Zou,Xiaoyan Kui,Hua Li,Pierre Vera,Su Ruan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-modal 3D medical image segmentation aims to accurately identify tumor regions across different modalities, facing challenges from variations in image intensity and tumor morphology. Traditional convolutional neural network (CNN)-based methods struggle with capturing global features, while Transformers-based methods, despite effectively capturing global context, encounter high computational costs in 3D medical image segmentation. The Mamba model combines linear scalability with long-distance modeling, making it a promising approach for visual representation learning. However, Mamba-based 3D multi-modal segmentation still struggles to leverage modality-specific features and fuse complementary information effectively. In this paper, we propose a Mamba based feature extraction and adaptive multilevel feature fusion for 3D tumor segmentation using multi-modal medical image. We first develop the specific modality Mamba encoder to efficiently extract long-range relevant features that represent anatomical and pathological structures present in each modality. Moreover, we design an bi-level synergistic integration block that dynamically merges multi-modal and multi-level complementary features by the modality attention and channel attention learning. Lastly, the decoder combines deep semantic information with fine-grained details to generate the tumor segmentation map. Experimental results on medical image datasets (PET/CT and MRI multi-sequence) show that our approach achieve competitive performance compared to the state-of-the-art CNN, Transformer, and Mamba-based approaches.
zh
[CV-59] CoCoDiff: Diversifying Skeleton Action Features via Coarse-Fine Text-Co-Guided Latent Diffusion
【速读】:该论文旨在解决动作识别任务中特征多样性不足的问题,以提升模型的泛化能力和性能。现有方法通常通过扩展样本空间中的训练数据来促进特征多样性,但这种方法常导致效率低下和语义不一致。论文提出的解决方案关键在于引入一种新颖的粗粒度-细粒度文本协同引导扩散模型(Coarse-fine text co-guidance Diffusion model, CoCoDiff),该模型通过扩散机制与多粒度文本引导,在潜在空间中生成既多样化又语义一致的特征。其核心创新在于利用从大语言模型中获取的文本信息,确保生成特征与原始输入之间的语义一致性,同时作为可插拔的辅助模块,在训练过程中不增加推理成本。
链接: https://arxiv.org/abs/2504.21266
作者: Zhifu Zhao,Hanyang Hua,Jianan Li,Shaoxin Wu,Fu Li,Yangtao Zhou,Yang Li
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In action recognition tasks, feature diversity is essential for enhancing model generalization and performance. Existing methods typically promote feature diversity by expanding the training data in the sample space, which often leads to inefficiencies and semantic inconsistencies. To overcome these problems, we propose a novel Coarse-fine text co-guidance Diffusion model (CoCoDiff). CoCoDiff generates diverse yet semantically consistent features in the latent space by leveraging diffusion and multi-granularity textual guidance. Specifically, our approach feeds spatio-temporal features extracted from skeleton sequences into a latent diffusion model to generate diverse action representations. Meanwhile, we introduce a coarse-fine text co-guided strategy that leverages textual information from large language models (LLMs) to ensure semantic consistency between the generated features and the original inputs. It is noted that CoCoDiff operates as a plug-and-play auxiliary module during training, incurring no additional inference cost. Extensive experiments demonstrate that CoCoDiff achieves SOTA performance on skeleton-based action recognition benchmarks, including NTU RGB+D, NTU RGB+D 120 and Kinetics-Skeleton.
zh
[CV-60] Embracing Collaboration Over Competition: Condensing Multiple Prompts for Visual In-Context Learning CVPR’25
【速读】:该论文旨在解决视觉在上下文学习(Visual In-Context Learning, VICL)中提示选择(prompt selection)的问题,现有方法假设候选提示池中存在一个“理想”提示,但在实际应用中可能并不成立,因此需要更有效的提示整合策略。其解决方案的关键在于提出一种新的视角——提示压缩(prompt condensation),通过让多个候选提示协作,高效整合细粒度的上下文信息,而无需牺牲分辨率。为此,作者设计了Condenser,一个轻量级的外部插件,能够端到端优化以确保上下文线索的准确整合,从而在多个基准任务中表现出优于现有方法的性能。
链接: https://arxiv.org/abs/2504.21263
作者: Jinpeng Wang,Tianci Luo,Yaohua Zha,Yan Feng,Ruisheng Luo,Bin Chen,Tao Dai,Long Chen,Yaowei Wang,Shu-Tao Xia
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学); Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)); Shenzhen University(深圳大学); The Hong Kong University of Science and Technology(香港科技大学); Research Center of Artificial Intelligence, Peng Cheng Laboratory(鹏城实验室人工智能研究中心); Meituan, Beijing(美团,北京)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Accepted by CVPR’25. 10 pages, 5 figures, 6 tables
Abstract:Visual In-Context Learning (VICL) enables adaptively solving vision tasks by leveraging pixel demonstrations, mimicking human-like task completion through analogy. Prompt selection is critical in VICL, but current methods assume the existence of a single “ideal” prompt in a pool of candidates, which in practice may not hold true. Multiple suitable prompts may exist, but individually they often fall short, leading to difficulties in selection and the exclusion of useful context. To address this, we propose a new perspective: prompt condensation. Rather than relying on a single prompt, candidate prompts collaborate to efficiently integrate informative contexts without sacrificing resolution. We devise Condenser, a lightweight external plugin that compresses relevant fine-grained context across multiple prompts. Optimized end-to-end with the backbone, Condenser ensures accurate integration of contextual cues. Experiments demonstrate Condenser outperforms state-of-the-arts across benchmark tasks, showing superior context compression, scalability with more prompts, and enhanced computational efficiency compared to ensemble methods, positioning it as a highly competitive solution for VICL. Code is open-sourced at this https URL.
zh
[CV-61] Multi-modal Transfer Learning for Dynamic Facial Emotion Recognition in the Wild
【速读】:该论文试图解决视频面部表情识别(video-based FER)中的分类准确性问题,特别是在具有挑战性的动态面部表情在野数据集(Dynamic Facial Expression in-the-Wild, DFEW)上。解决方案的关键在于利用多模态迁移学习,结合预训练的ResNets、OpenPose和OmniVec网络,提取跨时间的多模态特征,以提升基于Transformer的分类模型的性能。
链接: https://arxiv.org/abs/2504.21248
作者: Ezra Engel,Lishan Li,Chris Hudy,Robert Schleusner
机构: Georgia Institute of Technology(佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures
Abstract:Facial expression recognition (FER) is a subset of computer vision with important applications for human-computer-interaction, healthcare, and customer service. FER represents a challenging problem-space because accurate classification requires a model to differentiate between subtle changes in facial features. In this paper, we examine the use of multi-modal transfer learning to improve performance on a challenging video-based FER dataset, Dynamic Facial Expression in-the-Wild (DFEW). Using a combination of pretrained ResNets, OpenPose, and OmniVec networks, we explore the impact of cross-temporal, multi-modal features on classification accuracy. Ultimately, we find that these finely-tuned multi-modal feature generators modestly improve accuracy of our transformer-based classification model.
zh
[CV-62] Subject Information Extraction for Novelty Detection with Domain Shifts
【速读】:该论文旨在解决在领域偏移(domain shift)条件下,无监督新颖性检测(Unsupervised Novelty Detection, UND)方法容易将正常数据误判为新颖样本的问题。现有方法通常假设训练数据与测试正常数据来自同一领域,但在实际场景中,训练与测试数据可能来自不同领域,导致模型性能下降。论文提出的解决方案的关键在于将主体信息与背景变化分离,其中背景变化包含领域信息,通过最小化主体与背景表示之间的互信息,并利用深度高斯混合模型建模背景变化,从而在仅基于主体表示进行新颖性检测的过程中避免领域变化的影响。
链接: https://arxiv.org/abs/2504.21247
作者: Yangyang Qu,Dazhi Fu,Jicong Fan
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unsupervised novelty detection (UND), aimed at identifying novel samples, is essential in fields like medical diagnosis, cybersecurity, and industrial quality control. Most existing UND methods assume that the training data and testing normal data originate from the same domain and only consider the distribution variation between training data and testing data. However, in real scenarios, it is common for normal testing and training data to originate from different domains, a challenge known as domain shift. The discrepancies between training and testing data often lead to incorrect classification of normal data as novel by existing methods. A typical situation is that testing normal data and training data describe the same subject, yet they differ in the background conditions. To address this problem, we introduce a novel method that separates subject information from background variation encapsulating the domain information to enhance detection performance under domain shifts. The proposed method minimizes the mutual information between the representations of the subject and background while modelling the background variation using a deep Gaussian mixture model, where the novelty detection is conducted on the subject representations solely and hence is not affected by the variation of domains. Extensive experiments demonstrate that our model generalizes effectively to unseen domains and significantly outperforms baseline methods, especially under substantial domain shifts between training and testing data.
zh
[CV-63] 2ID-CAS: Diffusion Model and Class Aware Sampling to Mitigate Class Imbalance in Neck Ultrasound Anatomical Landmark Detection
【速读】:该论文旨在解决颈部超声(Neck US)中由于数据集内关键结构如气管环和声带等样本不足而导致的类别不平衡问题,该问题严重影响了目标检测模型的性能。解决方案的关键在于提出T2ID-CAS方法,该方法结合了文本到图像的潜在扩散模型与类别感知采样技术,用于生成高质量的合成样本以增强少数类别的表示,从而提升检测模型的准确性。
链接: https://arxiv.org/abs/2504.21231
作者: Manikanta Varaganti,Amulya Vankayalapati,Nour Awad,Gregory R. Dion,Laura J. Brattain
机构: University of Central Florida(中佛罗里达大学); University of Cincinnati College of Medicine(辛辛那提大学医学院); University of Central Florida College of Medicine(中佛罗里达大学医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: submitted to IEEE EMBC 2025
Abstract:Neck ultrasound (US) plays a vital role in airway management by providing non-invasive, real-time imaging that enables rapid and precise interventions. Deep learning-based anatomical landmark detection in neck US can further facilitate procedural efficiency. However, class imbalance within datasets, where key structures like tracheal rings and vocal folds are underrepresented, presents significant challenges for object detection models. To address this, we propose T2ID-CAS, a hybrid approach that combines a text-to-image latent diffusion model with class-aware sampling to generate high-quality synthetic samples for underrepresented classes. This approach, rarely explored in the ultrasound domain, improves the representation of minority classes. Experimental results using YOLOv9 for anatomical landmark detection in neck US demonstrated that T2ID-CAS achieved a mean Average Precision of 88.2, significantly surpassing the baseline of 66. This highlights its potential as a computationally efficient and scalable solution for mitigating class imbalance in AI-assisted ultrasound-guided interventions.
zh
[CV-64] MemeBLIP2: A novel lightweight multimodal system to detect harmful memes
【速读】:该论文旨在解决有害模因(harmful memes)检测的问题,特别是针对那些融合了视觉与简短文本的模因中可能包含的仇恨言论等负面信息。解决方案的关键在于提出MemeBLIP2,一个轻量级的多模态系统,通过将图像和文本表示对齐到共享空间并进行融合,从而更有效地捕捉两种模态中的细微线索,提升有害内容的检测性能。
链接: https://arxiv.org/abs/2504.21226
作者: Jiaqi Liu,Ran Tong,Aowei Shen,Shuzheng Li,Changlin Yang,Lisha Xu
机构: Walmart Global Tech(沃尔玛全球科技); University of Texas at Dallas(德克萨斯大学达拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11pages,2 figures, manucripts in preparation
Abstract:Memes often merge visuals with brief text to share humor or opinions, yet some memes contain harmful messages such as hate speech. In this paper, we introduces MemeBLIP2, a light weight multimodal system that detects harmful memes by combining image and text features effectively. We build on previous studies by adding modules that align image and text representations into a shared space and fuse them for better classification. Using BLIP-2 as the core vision-language model, our system is evaluated on the PrideMM datasets. The results show that MemeBLIP2 can capture subtle cues in both modalities, even in cases with ironic or culturally specific content, thereby improving the detection of harmful material.
zh
[CV-65] Geolocating Earth Imagery from ISS: Integrating Machine Learning with Astronaut Photography for Enhanced Geographic Mapping
【速读】:该论文试图解决从国际空间站(International Space Station, ISS)拍摄的图像中自动确定其地理定位的问题,尽管ISS的坐标是已知的,但图像中所展示的具体地球位置往往无法准确识别。解决方案的关键在于采用三种不同的图像处理流程:基于神经网络(Neural Network, NN)的方法、基于SIFT(Scale-Invariant Feature Transform)的算法以及GPT-4模型,通过这些方法对高分辨率ISS图像进行处理,以识别自然和人工地理特征,并实现高效的自动化地理定位。
链接: https://arxiv.org/abs/2504.21194
作者: Vedika Srivastava,Hemant Kumar Singh,Jaisal Singh
机构: Boston University (波士顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents a novel approach to geolocating images captured from the International Space Station (ISS) using advanced machine learning algorithms. Despite having precise ISS coordinates, the specific Earth locations depicted in astronaut-taken photographs often remain unidentified. Our research addresses this gap by employing three distinct image processing pipelines: a Neural Network based approach, a SIFT based method, and GPT-4 model. Each pipeline is tailored to process high-resolution ISS imagery, identifying both natural and man-made geographical features. Through extensive evaluation on a diverse dataset of over 140 ISS images, our methods demonstrate significant promise in automated geolocation with varied levels of success. The NN approach showed a high success rate in accurately matching geographical features, while the SIFT pipeline excelled in processing zoomed-in images. GPT-4 model provided enriched geographical descriptions alongside location predictions. This research contributes to the fields of remote sensing and Earth observation by enhancing the accuracy and efficiency of geolocating space-based imagery, thereby aiding environmental monitoring and global mapping efforts.
zh
[CV-66] Dance Style Recognition Using Laban Movement Analysis
【速读】:该论文试图解决在复杂人类活动(包括舞蹈)识别中,传统方法因依赖跨帧运动分析而难以捕捉动作的时序上下文和动态过渡的问题。解决方案的关键在于引入一种新颖的流程,结合三维姿态估计、三维人体网格重建和地面感知身体建模,以有效提取Laban Movement Analysis (LMA) 特征,并通过滑动窗口方法捕捉特征随时间演变的信息,从而增强时序上下文的表达能力。
链接: https://arxiv.org/abs/2504.21166
作者: Muhammad Turab,Philippe Colantoni,Damien Muselet,Alain Tremeau
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The growing interest in automated movement analysis has presented new challenges in recognition of complex human activities including dance. This study focuses on dance style recognition using features extracted using Laban Movement Analysis. Previous studies for dance style recognition often focus on cross-frame movement analysis, which limits the ability to capture temporal context and dynamic transitions between movements. This gap highlights the need for a method that can add temporal context to LMA features. For this, we introduce a novel pipeline which combines 3D pose estimation, 3D human mesh reconstruction, and floor aware body modeling to effectively extract LMA features. To address the temporal limitation, we propose a sliding window approach that captures movement evolution across time in features. These features are then used to train various machine learning methods for classification, and their explainability explainable AI methods to evaluate the contribution of each feature to classification performance. Our proposed method achieves a highest classification accuracy of 99.18% which shows that the addition of temporal context significantly improves dance style recognition performance.
zh
[CV-67] Emotion Recognition in Contemporary Dance Performances Using Laban Movement Analysis
【速读】:该论文试图解决当代舞蹈中情感识别的问题,旨在通过改进现有的Laban Movement Analysis (LMA)特征描述符,并引入稳健且新颖的描述符来捕捉运动的定量和定性特征。解决方案的关键在于从专业舞者在不同情绪状态下表演的3D关键点数据中提取表现性特征,并利用随机森林和支持向量机等分类器进行训练,同时通过可解释机器学习方法深入分析特征及其对模型预测的影响。
链接: https://arxiv.org/abs/2504.21154
作者: Muhammad Turab,Philippe Colantoni,Damien Muselet,Alain Tremeau
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents a novel framework for emotion recognition in contemporary dance by improving existing Laban Movement Analysis (LMA) feature descriptors and introducing robust, novel descriptors that capture both quantitative and qualitative aspects of the movement. Our approach extracts expressive characteristics from 3D keypoints data of professional dancers performing contemporary dance under various emotional states, and trains multiple classifiers, including Random Forests and Support Vector Machines. Additionally, we provide in-depth explanation of features and their impact on model predictions using explainable machine learning methods. Overall, our study improves emotion recognition in contemporary dance and offers promising applications in performance analysis, dance training, and human–computer interaction, with a highest accuracy of 96.85%.
zh
[CV-68] Legilimens: Performant Video Analytics on the System-on-Chip Edge
【速读】:该论文旨在解决在资源受限的移动边缘设备(如无人机和行车记录仪)上实现高精度视频分析时,持续重新训练模型所面临的计算与内存效率问题。现有系统依赖传统边缘服务器的空闲计算资源进行模型适应,而移动边缘设备具有较弱的计算能力但拥有丰富的统一内存池。论文提出的解决方案关键在于利用视觉上差异显著但模型嵌入存在重叠的场景特性,通过将基础模型存储在设备内存中,使针对每个新场景的模型微调变得轻量,仅需少量样本即可完成。这一方法有效降低了重新训练成本,并提升了模型精度。
链接: https://arxiv.org/abs/2504.21136
作者: Murali Ramanujam,Yinwei Dai,Kyle Jamieson,Ravi Netravali
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Continually retraining models has emerged as a primary technique to enable high-accuracy video analytics on edge devices. Yet, existing systems employ such adaptation by relying on the spare compute resources that traditional (memory-constrained) edge servers afford. In contrast, mobile edge devices such as drones and dashcams offer a fundamentally different resource profile: weak(er) compute with abundant unified memory pools. We present Legilimens, a continuous learning system for the mobile edge’s System-on-Chip GPUs. Our driving insight is that visually distinct scenes that require retraining exhibit substantial overlap in model embeddings; if captured into a base model on device memory, specializing to each new scene can become lightweight, requiring very few samples. To practically realize this approach, Legilimens presents new, compute-efficient techniques to (1) select high-utility data samples for retraining specialized models, (2) update the base model without complete retraining, and (3) time-share compute resources between retraining and live inference for maximal accuracy. Across diverse workloads, Legilimens lowers retraining costs by 2.8-10x compared to existing systems, resulting in 18-45% higher accuracies.
zh
[CV-69] GauSS-MI: Gaussian Splatting Shannon Mutual Information for Active 3D Reconstruction
【速读】:该论文试图解决实时主动视点选择与视觉质量不确定性量化的问题,特别是在主动三维重建中如何高效获取具有信息量的输入图像。现有研究主要关注几何完整性及未观测区域的探索,而缺乏对重建模型内部视觉不确定性的直接评估。解决方案的关键在于引入一种概率模型,用于量化每个高斯分布的视觉不确定性,并基于香农互信息构建GauSS-MI准则,以实现实时从新视点评估视觉互信息并选择最优视点。
链接: https://arxiv.org/abs/2504.21067
作者: Yuhan Xie,Yixi Cai,Yinqiang Zhang,Lei Yang,Jia Pan
机构: The University of Hong Kong (香港大学); KTH Royal Institute of Technology (皇家理工学院); Hong Kong SAR (香港特别行政区)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:This research tackles the challenge of real-time active view selection and uncertainty quantification on visual quality for active 3D reconstruction. Visual quality is a critical aspect of 3D reconstruction. Recent advancements such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have notably enhanced the image rendering quality of reconstruction models. Nonetheless, the efficient and effective acquisition of input images for reconstruction-specifically, the selection of the most informative viewpoint-remains an open challenge, which is crucial for active reconstruction. Existing studies have primarily focused on evaluating geometric completeness and exploring unobserved or unknown regions, without direct evaluation of the visual uncertainty within the reconstruction model. To address this gap, this paper introduces a probabilistic model that quantifies visual uncertainty for each Gaussian. Leveraging Shannon Mutual Information, we formulate a criterion, Gaussian Splatting Shannon Mutual Information (GauSS-MI), for real-time assessment of visual mutual information from novel viewpoints, facilitating the selection of next best view. GauSS-MI is implemented within an active reconstruction system integrated with a view and motion planner. Extensive experiments across various simulated and real-world scenes showcase the superior visual quality and reconstruction efficiency performance of the proposed system.
zh
[CV-70] Can a Large Language Model Assess Urban Design Quality? Evaluating Walkability Metrics Across Expertise Levels
【速读】:该论文试图解决如何通过将更正式和结构化的专家城市设计知识整合到多模态大语言模型(MLLM)的输入提示中,以提升其在基于街景图像(SVIs)评估建成环境步行可达性方面的性能与可靠性。解决方案的关键在于通过引入专家知识来优化提示设计,从而增强模型在评价标准上的清晰度和具体性,进而提高其评估结果的一致性和准确性。
链接: https://arxiv.org/abs/2504.21040
作者: Chenyi Cai,Kosuke Kuriyama,Youlong Gu,Filip Biljecki,Pieter Herthogs
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Urban street environments are vital to supporting human activity in public spaces. The emergence of big data, such as street view images (SVIs) combined with multimodal large language models (MLLMs), is transforming how researchers and practitioners investigate, measure, and evaluate semantic and visual elements of urban environments. Considering the low threshold for creating automated evaluative workflows using MLLMs, it is crucial to explore both the risks and opportunities associated with these probabilistic models. In particular, the extent to which the integration of expert knowledge can influence the performance of MLLMs in evaluating the quality of urban design has not been fully explored. This study sets out an initial exploration of how integrating more formal and structured representations of expert urban design knowledge into the input prompts of an MLLM (ChatGPT-4) can enhance the model’s capability and reliability in evaluating the walkability of built environments using SVIs. We collect walkability metrics from the existing literature and categorize them using relevant ontologies. We then select a subset of these metrics, focusing on the subthemes of pedestrian safety and attractiveness, and develop prompts for the MLLM accordingly. We analyze the MLLM’s ability to evaluate SVI walkability subthemes through prompts with varying levels of clarity and specificity regarding evaluation criteria. Our experiments demonstrate that MLLMs are capable of providing assessments and interpretations based on general knowledge and can support the automation of multimodal image-text evaluations. However, they generally provide more optimistic scores and can make mistakes when interpreting the provided metrics, resulting in incorrect evaluations. By integrating expert knowledge, the MLLM’s evaluative performance exhibits higher consistency and concentration.
zh
[CV-71] ranscending Dimensions using Generative AI: Real-Time 3D Model Generation in Augmented Reality
【速读】:该论文试图解决传统3D建模过程需要专业技术知识、专业软件以及耗时的问题,从而使得许多用户难以接触和使用。其解决方案的关键在于将生成式 AI (Generative AI) 与增强现实 (Augmented Reality, AR) 结合,构建一个一体化系统,使用户能够在AR环境中实时生成、操作和交互3D模型。通过利用先进的AI模型(如Shap-E)以及基于Mask R-CNN的先进目标检测方法,解决了2D图像到3D表示转换中的复杂挑战,包括对象隔离、复杂背景处理和无缝用户交互等问题。
链接: https://arxiv.org/abs/2504.21033
作者: Majid Behravan,Maryam Haghani,Denis Gracanin
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Traditional 3D modeling requires technical expertise, specialized software, and time-intensive processes, making it inaccessible for many users. Our research aims to lower these barriers by combining generative AI and augmented reality (AR) into a cohesive system that allows users to easily generate, manipulate, and interact with 3D models in real time, directly within AR environments. Utilizing cutting-edge AI models like Shap-E, we address the complex challenges of transforming 2D images into 3D representations in AR environments. Key challenges such as object isolation, handling intricate backgrounds, and achieving seamless user interaction are tackled through advanced object detection methods, such as Mask R-CNN. Evaluation results from 35 participants reveal an overall System Usability Scale (SUS) score of 69.64, with participants who engaged with AR/VR technologies more frequently rating the system significantly higher, at 80.71. This research is particularly relevant for applications in gaming, education, and AR-based e-commerce, offering intuitive, model creation for users without specialized skills.
zh
[CV-72] LoC-LIC: Low Complexity Learned Image Coding Using Hierarchical Feature Transforms
【速读】:该论文试图解决当前学习型图像压缩模型复杂度高、计算资源需求大的问题(learned image compression models typically exhibit high complexity, which demands significant computational resources)。其解决方案的关键在于采用分层特征提取变换(hierarchical feature extraction transforms),通过减少高空间分辨率输入/特征图的通道数,并降低通道数较多的特征图的空间维度,从而在不牺牲性能的前提下显著降低计算负载。这一策略将前向传播复杂度从1256kMAC/Pixel降至270kMAC/Pixel。
链接: https://arxiv.org/abs/2504.21778
作者: Ayman A. Ameen,Thomas Richter,André Kaup
机构: Fraunhofer Institute for Integrated Circuits IIS (弗劳恩霍夫集成电路研究所); Sohag University (索哈格大学); Friedrich-Alexander University at Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希-亚历山大大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current learned image compression models typically exhibit high complexity, which demands significant computational resources. To overcome these challenges, we propose an innovative approach that employs hierarchical feature extraction transforms to significantly reduce complexity while preserving bit rate reduction efficiency. Our novel architecture achieves this by using fewer channels for high spatial resolution inputs/feature maps. On the other hand, feature maps with a large number of channels have reduced spatial dimensions, thereby cutting down on computational load without sacrificing performance. This strategy effectively reduces the forward pass complexity from (1256 , \textkMAC/Pixel) to just (270 , \textkMAC/Pixel). As a result, the reduced complexity model can open the way for learned image compression models to operate efficiently across various devices and pave the way for the development of new architectures in image compression technology.
zh
[CV-73] owards Space Group Determination from EBSD Patterns: The Role of Deep Learning and High-throughput Dynamical Simulations
【速读】:该论文旨在解决高通量纳米材料发现中晶体对称性确定的瓶颈问题,即在大量材料合成后,缺乏快速且高效的方法对其结构进行表征。其解决方案的关键在于利用扫描电子显微镜(SEM)中的Kikuchi衍射技术结合深度学习方法,通过分析电子背散射衍射(EBSD)图案来分类空间群对称性,从而实现对晶体结构的快速确定。研究中采用人工生成的5,148种立方相EBSD图案数据集进行神经网络训练,并引入一种重标签方案以提升模型在实验数据上的准确率,证明了神经网络在从EBSD图案中预测晶体对称性的可行性。
链接: https://arxiv.org/abs/2504.21331
作者: Alfred Yan,Muhammad Nur Talha Kilic,Gert Nolze,Ankit Agrawal,Alok Choudhary,Roberto dos Reis,Vinayak Dravid
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV)
备注: 33 pages, preliminary version
Abstract:The design of novel materials hinges on the understanding of structure-property relationships. However, our capability to synthesize a large number of materials has outpaced the ability and speed needed to characterize them. While the overall chemical constituents can be readily known during synthesis, the structural evolution and characterization of newly synthesized samples remains a bottleneck for the ultimate goal of high throughput nanomaterials discovery. Thus, scalable methods for crystal symmetry determination that can analyze a large volume of material samples within a short time-frame are especially needed. Kikuchi diffraction in the SEM is a promising technique for this due to its sensitivity to dynamical scattering, which may provide information beyond just the seven crystal systems and fourteen Bravais lattices. After diffraction patterns are collected from material samples, deep learning methods may be able to classify the space group symmetries using the patterns as input, which paired with the elemental composition, would help enable the determination of the crystal structure. To investigate the feasibility of this solution, neural networks were trained to predict the space group type of background corrected EBSD patterns. Our networks were first trained and tested on an artificial dataset of EBSD patterns of 5,148 different cubic phases, created through physics-based dynamical simulations. Next, Maximum Classifier Discrepancy, an unsupervised deep learning-based domain adaptation method, was utilized to train neural networks to make predictions for experimental EBSD patterns. We introduce a relabeling scheme, which enables our models to achieve accuracy scores higher than 90% on simulated and experimental data, suggesting that neural networks are capable of making predictions of crystal symmetry from an EBSD pattern.
zh
[CV-74] Gradient Attention Map Based Verification of Deep Convolutional Neural Networks with Application to X-ray Image Datasets
【速读】:该论文旨在解决深度学习模型在医学影像中应用时,因数据分布差异导致预测不可靠的问题,这可能影响患者护理。其解决方案的关键在于提出一个综合验证框架,通过多种互补策略评估模型的适用性:首先,采用基于梯度注意力图(Gradient Attention Map, GAM)的方法,利用Grad-CAM分析注意力模式,并通过IoU、Dice Similarity、SSIM、余弦相似性、皮尔逊相关性、KL散度和Wasserstein距离等相似性度量进行比较;其次,将验证扩展到早期卷积特征图,以捕捉仅依靠注意力机制可能遗漏的结构错位;最后,在分类模型中引入额外的“垃圾”类别,以明确拒绝分布外输入。这些方法的结合有效识别了不适用的模型和输入,促进了深度学习在医学影像中的安全可靠部署。
链接: https://arxiv.org/abs/2504.21227
作者: Omid Halimi Milani,Amanda Nikho,Lauren Mills,Marouane Tliba,Ahmet Enis Cetin,Mohammed H. Elnagar
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 7 figures, accepted at IEEE VLSI Test Symposium (VTS) 2025
Abstract:Deep learning models have great potential in medical imaging, including orthodontics and skeletal maturity assessment. However, applying a model to data different from its training set can lead to unreliable predictions that may impact patient care. To address this, we propose a comprehensive verification framework that evaluates model suitability through multiple complementary strategies. First, we introduce a Gradient Attention Map (GAM)-based approach that analyzes attention patterns using Grad-CAM and compares them via similarity metrics such as IoU, Dice Similarity, SSIM, Cosine Similarity, Pearson Correlation, KL Divergence, and Wasserstein Distance. Second, we extend verification to early convolutional feature maps, capturing structural mis-alignments missed by attention alone. Finally, we incorporate an additional garbage class into the classification model to explicitly reject out-of-distribution inputs. Experimental results demonstrate that these combined methods effectively identify unsuitable models and inputs, promoting safer and more reliable deployment of deep learning in medical imaging.
zh
[CV-75] Light Weight CNN for classification of Brain Tumors from MRI Images
【速读】:该论文试图解决脑肿瘤类型的多类分类问题,旨在通过磁共振成像(MRI)扫描实现自动化的高精度分类。解决方案的关键在于构建一个轻量级的深度学习模型,并通过图像预处理步骤(包括归一化、数据增强和裁剪技术)提升模型性能,同时利用Keras Tuner进行超参数调优,并采用5折交叉验证确保评估的可靠性,最终实现了98.78%的分类准确率。
链接: https://arxiv.org/abs/2504.21188
作者: Natnael Alemayehu
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages
Abstract:This study presents a convolutional neural network (CNN)-based approach for the multi-class classification of brain tumors using magnetic resonance imaging (MRI) scans. We utilize a publicly available dataset containing MRI images categorized into four classes: glioma, meningioma, pituitary tumor, and no tumor. Our primary objective is to build a light weight deep learning model that can automatically classify brain tumor types with high accuracy. To achieve this goal, we incorporate image preprocessing steps, including normalization, data augmentation, and a cropping technique designed to reduce background noise and emphasize relevant regions. The CNN architecture is optimized through hyperparameter tuning using Keras Tuner, enabling systematic exploration of network parameters. To ensure reliable evaluation, we apply 5-fold cross-validation, where each hyperparameter configuration is evaluated across multiple data splits to mitigate overfitting. Experimental results demonstrate that the proposed model achieves a classification accuracy of 98.78%, indicating its potential as a diagnostic aid in clinical settings. The proposed method offers a low-complexity yet effective solution for assisting in early brain tumor diagnosis.
zh
人工智能
[AI-0] Public Opinion and The Rise of Digital Minds: Perceived Risk Trust and Regulation Support
【速读】:该论文试图解决如何理解公众对人工智能(AI)监管的偏好及其背后的影响因素,特别是公共信任和风险感知在其中的作用。其解决方案的关键在于通过实证分析,揭示政府、AI企业和AI技术的信任水平以及公众对风险的感知如何共同塑造对软性(如放缓AI发展)和强性(如全面禁止AI系统)监管措施的支持程度。研究结果强调了在AI治理中平衡公众风险担忧与制度信任的重要性,并为政策制定者提供了基于实证数据的参考依据。
链接: https://arxiv.org/abs/2504.21849
作者: Justin B. Bullock,Janet V.T. Pauketat,Hsini Huang,Yi-Fan Wang,Jacy Reese Anthis
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 31 pages, 1 figure, 5 tables, accepted to Public Performance and Management Review
Abstract:Governance institutions must respond to societal risks, including those posed by generative AI. This study empirically examines how public trust in institutions and AI technologies, along with perceived risks, shape preferences for AI regulation. Using the nationally representative 2023 Artificial Intelligence, Morality, and Sentience (AIMS) survey, we assess trust in government, AI companies, and AI technologies, as well as public support for regulatory measures such as slowing AI development or outright bans on advanced AI. Our findings reveal broad public support for AI regulation, with risk perception playing a significant role in shaping policy preferences. Individuals with higher trust in government favor regulation, while those with greater trust in AI companies and AI technologies are less inclined to support restrictions. Trust in government and perceived risks significantly predict preferences for both soft (e.g., slowing development) and strong (e.g., banning AI systems) regulatory interventions. These results highlight the importance of public opinion in AI governance. As AI capabilities advance, effective regulation will require balancing public concerns about risks with trust in institutions. This study provides a foundational empirical baseline for policymakers navigating AI governance and underscores the need for further research into public trust, risk perception, and regulatory strategies in the evolving AI landscape.
zh
[AI-1] Characterizing AI Agents for Alignment and Governance
【速读】:该论文试图解决如何构建有效的AI代理治理机制的问题,其核心在于深入理解AI代理的核心属性及其与代理在现实世界中部署和操作相关问题之间的关系。解决方案的关键在于提出一个基于四个维度(自主性、效能性、目标复杂性和普遍性)的AI代理表征框架,并为每个维度设定不同的梯度,从而揭示这些系统在设计、运行和治理方面所面临的独特问题。通过该框架,论文进一步构建了不同类型的AI代理的“代理档案”,以阐明各类AI代理所带来的跨领域技术与非技术治理挑战。
链接: https://arxiv.org/abs/2504.21848
作者: Atoosa Kasirzadeh,Iason Gabriel
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:The creation of effective governance mechanisms for AI agents requires a deeper understanding of their core properties and how these properties relate to questions surrounding the deployment and operation of agents in the world. This paper provides a characterization of AI agents that focuses on four dimensions: autonomy, efficacy, goal complexity, and generality. We propose different gradations for each dimension, and argue that each dimension raises unique questions about the design, operation, and governance of these systems. Moreover, we draw upon this framework to construct “agentic profiles” for different kinds of AI agents. These profiles help to illuminate cross-cutting technical and non-technical governance challenges posed by different classes of AI agents, ranging from narrow task-specific assistants to highly autonomous general-purpose systems. By mapping out key axes of variation and continuity, this framework provides developers, policymakers, and members of the public with the opportunity to develop governance approaches that better align with collective societal goals.
zh
[AI-2] Learning Heterogeneous Performance-Fairness Trade-offs in Federated Learning IJCAI2025
【速读】:该论文试图解决联邦学习中性能与公平性权衡问题,特别是现有方法在训练超网络(hypernet)时采用统一的偏好采样分布,忽略了客户端本地帕累托前沿(local Pareto front)的固有异质性,以及未考虑全局数据集上本地与全局帕累托前沿之间的差距。解决方案的关键在于提出HetPFL框架,其包含偏好采样适应(Preference Sampling Adaptation, PSA)和偏好感知超网络融合(Preference-aware Hypernet Fusion, PHF),分别用于适应性地确定每个客户端的最优偏好采样分布,并实现客户端超网络的偏好感知融合,以提升全局帕累托前沿的性能。
链接: https://arxiv.org/abs/2504.21775
作者: Rongguang Ye,Ming Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by IJCAI 2025
Abstract:Recent methods leverage a hypernet to handle the performance-fairness trade-offs in federated learning. This hypernet maps the clients’ preferences between model performance and fairness to preference-specifc models on the trade-off curve, known as local Pareto front. However, existing methods typically adopt a uniform preference sampling distribution to train the hypernet across clients, neglecting the inherent heterogeneity of their local Pareto fronts. Meanwhile, from the perspective of generalization, they do not consider the gap between local and global Pareto fronts on the global dataset. To address these limitations, we propose HetPFL to effectively learn both local and global Pareto fronts. HetPFL comprises Preference Sampling Adaptation (PSA) and Preference-aware Hypernet Fusion (PHF). PSA adaptively determines the optimal preference sampling distribution for each client to accommodate heterogeneous local Pareto fronts. While PHF performs preference-aware fusion of clients’ hypernets to ensure the performance of the global Pareto front. We prove that HetPFL converges linearly with respect to the number of rounds, under weaker assumptions than existing methods. Extensive experiments on four datasets show that HetPFL significantly outperforms seven baselines in terms of the quality of learned local and global Pareto fronts.
zh
[AI-3] Is Intermediate Fusion All You Need for UAV-based Collaborative Perception?
【速读】:该论文旨在解决无人机(Unmanned Aerial Vehicles, UAVs)在协作感知中因忽视其视角特性而导致的通信开销过大的问题。其解决方案的关键在于提出一种基于晚期中间融合(Late-Intermediate Fusion, LIF)的高效通信协作感知框架,通过交换信息丰富且紧凑的检测结果,并将融合阶段转移到特征表示层面,从而减少通信负担。该方法利用视觉引导的位置嵌入(Vision-Guided Positional Embedding, VPE)和基于边界框的虚拟增强特征(Box-Based Virtual Augmented Feature, BoBEV)来有效整合来自不同智能体的互补信息,同时引入基于不确定性的通信机制以选择高质量的共享区域。
链接: https://arxiv.org/abs/2504.21774
作者: Jiuwu Hao,Liguo Sun,Yuting Wan,Yueyang Wu,Ti Xiang,Haolin Song,Pin Lv
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Collaborative perception enhances environmental awareness through inter-agent communication and is regarded as a promising solution to intelligent transportation systems. However, existing collaborative methods for Unmanned Aerial Vehicles (UAVs) overlook the unique characteristics of the UAV perspective, resulting in substantial communication overhead. To address this issue, we propose a novel communication-efficient collaborative perception framework based on late-intermediate fusion, dubbed LIF. The core concept is to exchange informative and compact detection results and shift the fusion stage to the feature representation level. In particular, we leverage vision-guided positional embedding (VPE) and box-based virtual augmented feature (BoBEV) to effectively integrate complementary information from various agents. Additionally, we innovatively introduce an uncertainty-driven communication mechanism that uses uncertainty evaluation to select high-quality and reliable shared areas. Experimental results demonstrate that our LIF achieves superior performance with minimal communication bandwidth, proving its effectiveness and practicality. Code and models are available at this https URL.
zh
[AI-4] Solving Copyright Infringement on Short Video Platforms: Novel Datasets and an Audio Restoration Deep Learning Pipeline IJCAI2025
【速读】:该论文试图解决短视频平台(如YouTube Shorts和TikTok)在版权合规方面面临的挑战,特别是侵权者通过嵌入任意背景音乐(BGM)来掩盖原始音轨(OST)并逃避内容原创性检测的问题。解决方案的关键在于提出一种集成音乐源分离(Music Source Separation, MSS)和跨模态视频-音乐检索(Cross-modal Video-Music Retrieval, CMVMR)的新型流水线,该方法能够有效分离出任意BGM并恢复原始OST,从而保障内容的完整性。
链接: https://arxiv.org/abs/2504.21772
作者: Minwoo Oh,Minsu Park,Eunil Park
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
备注: will be presented in IJCAI 2025, 9 pages, 4 tables, 3 figures
Abstract:Short video platforms like YouTube Shorts and TikTok face significant copyright compliance challenges, as infringers frequently embed arbitrary background music (BGM) to obscure original soundtracks (OST) and evade content originality detection. To tackle this issue, we propose a novel pipeline that integrates Music Source Separation (MSS) and cross-modal video-music retrieval (CMVMR). Our approach effectively separates arbitrary BGM from the original OST, enabling the restoration of authentic video audio tracks. To support this work, we introduce two domain-specific datasets: OASD-20K for audio separation and OSVAR-160 for pipeline evaluation. OASD-20K contains 20,000 audio clips featuring mixed BGM and OST pairs, while OSVAR160 is a unique benchmark dataset comprising 1,121 video and mixed-audio pairs, specifically designed for short video restoration tasks. Experimental results demonstrate that our pipeline not only removes arbitrary BGM with high accuracy but also restores OSTs, ensuring content integrity. This approach provides an ethical and scalable solution to copyright challenges in user-generated content on short video platforms.
zh
[AI-5] Sionna RT: Technical Report
【速读】:该论文旨在解决无线电信号传播模拟中的高效计算问题,特别是针对信道冲激响应(CIR)和无线电图(radio maps)的生成。其解决方案的关键在于Sionna RT模块中采用的可微分射线追踪算法,该算法结合了射击与反弹射线(SBR)方法与图像法,并通过基于哈希的机制高效消除重复路径,从而实现了对系统和环境参数的梯度计算,提升了模拟的速度、内存效率和可扩展性。
链接: https://arxiv.org/abs/2504.21719
作者: Fayçal Aït Aoudia,Jakob Hoydis,Merlin Nimier-David,Sebastian Cammerer,Alexander Keller
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:
Abstract:Sionna is an open-source, GPU-accelerated library that, as of version 0.14, incorporates a ray tracer for simulating radio wave propagation. A unique feature of Sionna RT is differentiability, enabling the calculation of gradients for the channel impulse responses (CIRs), radio maps, and other related metrics with respect to system and environmental parameters, such as material properties, antenna patterns, and array geometries. The release of Sionna 1.0 provides a complete overhaul of the ray tracer, significantly improving its speed, memory efficiency, and extensibility. This document details the algorithms employed by Sionna RT to simulate radio wave propagation efficiently, while also addressing their current limitations. Given that the computation of CIRs and radio maps requires distinct algorithms, these are detailed in separate sections. For CIRs, Sionna RT integrates shooting and bouncing of rays (SBR) with the image method and uses a hashing-based mechanism to efficiently eliminate duplicate paths. Radio maps are computed using a purely SBR-based approach.
zh
[AI-6] Recursive KL Divergence Optimization: A Dynamic Framework for Representation Learning
【速读】:该论文试图解决传统表示学习目标在优化过程中的效率和稳定性问题,特别是现有框架如信息对比学习(Information Contrastive Learning, I-Con)通过固定邻域条件下的KL散度统一多种学习范式,但忽略了学习过程中固有的递归结构。解决方案的关键在于引入递归KL散度优化(Recursive KL Divergence Optimization, RKDO),该方法将表示学习建模为数据邻域间KL散度的演化过程,不仅能够捕捉对比聚类和降维方法作为静态切片,还为模型稳定性和局部适应性提供了新的路径。实验表明,RKDO在三个不同数据集上相比静态方法损失值降低约30%,且计算资源消耗减少60%至80%,体现了其在资源受限应用中的显著优势。
链接: https://arxiv.org/abs/2504.21707
作者: Anthony D Martin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:
Abstract:We propose a generalization of modern representation learning objectives by reframing them as recursive divergence alignment processes over localized conditional distributions While recent frameworks like Information Contrastive Learning I-Con unify multiple learning paradigms through KL divergence between fixed neighborhood conditionals we argue this view underplays a crucial recursive structure inherent in the learning process. We introduce Recursive KL Divergence Optimization RKDO a dynamic formalism where representation learning is framed as the evolution of KL divergences across data neighborhoods. This formulation captures contrastive clustering and dimensionality reduction methods as static slices while offering a new path to model stability and local adaptation. Our experiments demonstrate that RKDO offers dual efficiency advantages approximately 30 percent lower loss values compared to static approaches across three different datasets and 60 to 80 percent reduction in computational resources needed to achieve comparable results. This suggests that RKDOs recursive updating mechanism provides a fundamentally more efficient optimization landscape for representation learning with significant implications for resource constrained applications.
zh
[AI-7] XBreaking: Explainable Artificial Intelligence for Jailbreaking LLM s
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在关键应用场景中因安全威胁而难以可靠部署的问题,特别是针对商业LLMs中用于消除有害输出的复杂过滤机制的突破。解决方案的关键在于提出一种可解释的人工智能(Explainable-AI)方法,通过对比分析被过滤与未被过滤模型的行为,提取可利用的对齐模式,并基于此设计XBreaking攻击,通过有针对性的噪声注入来绕过LLMs的安全约束。
链接: https://arxiv.org/abs/2504.21700
作者: Marco Arazzi,Vignesh Kumar Kembu,Antonino Nocera,Vinod P
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models are fundamental actors in the modern IT landscape dominated by AI solutions. However, security threats associated with them might prevent their reliable adoption in critical application scenarios such as government organizations and medical institutions. For this reason, commercial LLMs typically undergo a sophisticated censoring mechanism to eliminate any harmful output they could possibly produce. In response to this, LLM Jailbreaking is a significant threat to such protections, and many previous approaches have already demonstrated its effectiveness across diverse domains. Existing jailbreak proposals mostly adopt a generate-and-test strategy to craft malicious input. To improve the comprehension of censoring mechanisms and design a targeted jailbreak attack, we propose an Explainable-AI solution that comparatively analyzes the behavior of censored and uncensored models to derive unique exploitable alignment patterns. Then, we propose XBreaking, a novel jailbreak attack that exploits these unique patterns to break the security constraints of LLMs by targeted noise injection. Our thorough experimental campaign returns important insights about the censoring mechanisms and demonstrates the effectiveness and performance of our attack.
zh
[AI-8] Self-Supervised Monocular Visual Drone Model Identification through Improved Occlusion Handling
【速读】:该论文试图解决在GPS拒止环境中,无人机进行高速飞行时由于视觉条件恶劣(如运动模糊和大范围遮挡)导致的自运动估计(ego-motion estimation)精度下降问题。当前的视觉方法通常需要依赖外部运动捕捉系统提供的监督数据来学习无人机模型,这限制了其在不同环境和无人机平台上的可扩展性。解决方案的关键在于提出一种自监督学习框架,利用机载单目视频和飞行控制器数据(IMU和电机反馈)训练基于神经网络的无人机模型,通过先训练一个自监督的相对位姿估计模型作为教师模型,再将其用于指导学生模型的学习,同时引入改进的遮挡处理方法以提升高速接近障碍物时的性能,最终实现了更精确的位姿估计和更高的飞行速度适应性。
链接: https://arxiv.org/abs/2504.21695
作者: Stavrow A. Bahnam,Christophe De Wagter,Guido C.H.E. de Croon
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Ego-motion estimation is vital for drones when flying in GPS-denied environments. Vision-based methods struggle when flight speed increases and close-by objects lead to difficult visual conditions with considerable motion blur and large occlusions. To tackle this, vision is typically complemented by state estimation filters that combine a drone model with inertial measurements. However, these drone models are currently learned in a supervised manner with ground-truth data from external motion capture systems, limiting scalability to different environments and drones. In this work, we propose a self-supervised learning scheme to train a neural-network-based drone model using only onboard monocular video and flight controller data (IMU and motor feedback). We achieve this by first training a self-supervised relative pose estimation model, which then serves as a teacher for the drone model. To allow this to work at high speed close to obstacles, we propose an improved occlusion handling method for training self-supervised pose estimation models. Due to this method, the root mean squared error of resulting odometry estimates is reduced by an average of 15%. Moreover, the student neural drone model can be successfully obtained from the onboard data. It even becomes more accurate at higher speeds compared to its teacher, the self-supervised vision-based model. We demonstrate the value of the neural drone model by integrating it into a traditional filter-based VIO system (ROVIO), resulting in superior odometry accuracy on aggressive 3D racing trajectories near obstacles. Self-supervised learning of ego-motion estimation represents a significant step toward bridging the gap between flying in controlled, expensive lab environments and real-world drone applications. The fusion of vision and drone models will enable higher-speed flight and improve state estimation, on any drone in any environment.
zh
[AI-9] Automatic Mapping of AutomationML Files to Ontologies for Graph Queries and Validation
【速读】:该论文试图解决AutomationML作为自动化领域广泛采用的开放数据交换格式,其基于XML的结构在使用常规XML工具进行查询或数据验证时存在局限性。解决方案的关键在于构建一个最新的概念本体以及一种声明式的映射方法,以将任何AutomationML模型自动转换为RDF三元组,从而实现AutomationML信息向工业知识图谱的便捷集成。通过将AutomationML转换为OWL,该方法为查询和验证提供了前所未有的强大能力。
链接: https://arxiv.org/abs/2504.21694
作者: Tom Westermann,Malte Ramonat,Johannes Hujer,Felix Gehlhoff,Alexander Fay
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:AutomationML has seen widespread adoption as an open data exchange format in the automation domain. It is an open and vendor neutral standard based on the extensible markup language XML. However, AutomationML extends XML with additional semantics, that limit the applicability of common XML-tools for applications like querying or data validation. This article provides practitioners with 1) an up-to-date ontology of the concepts in the AutomationML-standard, as well as 2) a declarative mapping to automatically transform any AutomationML model into RDF triples. Together, these artifacts allow practitioners an easy integration of AutomationML information into industrial knowledge graphs. A study on examples from the automation domain concludes that transforming AutomationML to OWL opens up new powerful ways for querying and validation that are impossible without transformation.
zh
[AI-10] Extension-ranking Semantics for Abstract Argumentation Preprint
【速读】:该论文试图解决在抽象论证框架中对论证集合进行排序的问题,具体是通过论证被接受的合理性(plausibility of acceptance)来实现这一目标。解决方案的关键在于提出了一种扩展排序语义(extension-ranking semantics),这是对Dung的扩展语义的一种推广,它在所有论证的幂集上诱导出一个偏序关系,从而能够判断一个论证集合比另一个更接近被接受。该方法通过引入一系列原则来评估扩展排序语义的有效性,并结合多种基础关系以形成一族扩展排序语义。
链接: https://arxiv.org/abs/2504.21683
作者: Kenneth Skiba,Tjitze Rienstra,Matthias Thimm,Jesse Heyninck,Gabriele Kern-Isberner
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, we present a general framework for ranking sets of arguments in abstract argumentation based on their plausibility of acceptance. We present a generalisation of Dung’s extension semantics as extension-ranking semantics, which induce a preorder over the power set of all arguments, allowing us to state that one set is “closer” to being acceptable than another. To evaluate the extension-ranking semantics, we introduce a number of principles that a well-behaved extension-ranking semantics should satisfy. We consider several simple base relations, each of which models a single central aspect of argumentative reasoning. The combination of these base relations provides us with a family of extension-ranking semantics. We also adapt a number of approaches from the literature for ranking extensions to be usable in the context of extension-ranking semantics, and evaluate their behaviour.
zh
[AI-11] Designing Control Barrier Function via Probabilistic Enumeration for Safe Reinforcement Learning Navigation
【速读】:该论文试图解决在动态和不确定的现实环境中部署安全的自主导航系统的问题。其解决方案的关键在于提出一种分层控制框架,该框架利用神经网络验证技术设计控制屏障函数(Control Barrier Functions, CBF)和策略修正机制,以确保强化学习导航策略的安全性。通过概率枚举识别操作中的不安全区域,并构建适用于任意策略的安全CBF控制层,从而实现对不安全行为的修正,同时保持高效的导航性能。
链接: https://arxiv.org/abs/2504.21643
作者: Luca Marzari,Francesco Trotti,Enrico Marchesini,Alessandro Farinelli
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Achieving safe autonomous navigation systems is critical for deploying robots in dynamic and uncertain real-world environments. In this paper, we propose a hierarchical control framework leveraging neural network verification techniques to design control barrier functions (CBFs) and policy correction mechanisms that ensure safe reinforcement learning navigation policies. Our approach relies on probabilistic enumeration to identify unsafe regions of operation, which are then used to construct a safe CBF-based control layer applicable to arbitrary policies. We validate our framework both in simulation and on a real robot, using a standard mobile robot benchmark and a highly dynamic aquatic environmental monitoring task. These experiments demonstrate the ability of the proposed solution to correct unsafe actions while preserving efficient navigation behavior. Our results show the promise of developing hierarchical verification-based systems to enable safe and robust navigation behaviors in complex scenarios.
zh
[AI-12] Quantitative Auditing of AI Fairness with Differentially Private Synthetic Data
【速读】:该论文试图解决AI系统公平性审计中的安全与隐私问题,传统方法使用真实数据进行审计会暴露敏感信息,增加安全风险和隐私泄露的可能性。解决方案的关键在于提出一种基于差分隐私合成数据的框架,通过应用隐私保护机制生成能够反映原始数据统计特征的同时确保隐私的合成数据,从而在实现严格公平性审计目标与保障强隐私保护之间取得平衡。
链接: https://arxiv.org/abs/2504.21634
作者: Chih-Cheng Rex Yuan,Bow-Yaw Wang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Fairness auditing of AI systems can identify and quantify biases. However, traditional auditing using real-world data raises security and privacy concerns. It exposes auditors to security risks as they become custodians of sensitive information and targets for cyberattacks. Privacy risks arise even without direct breaches, as data analyses can inadvertently expose confidential information. To address these, we propose a framework that leverages differentially private synthetic data to audit the fairness of AI systems. By applying privacy-preserving mechanisms, it generates synthetic data that mirrors the statistical properties of the original dataset while ensuring privacy. This method balances the goal of rigorous fairness auditing and the need for strong privacy protections. Through experiments on real datasets like Adult, COMPAS, and Diabetes, we compare fairness metrics of synthetic and real data. By analyzing the alignment and discrepancies between these metrics, we assess the capacity of synthetic data to preserve the fairness properties of real data. Our results demonstrate the framework’s ability to enable meaningful fairness evaluations while safeguarding sensitive information, proving its applicability across critical and sensitive domains.
zh
[AI-13] Leverag ing Pre-trained Large Language Models with Refined Prompting for Online Task and Motion Planning
【速读】:该论文旨在解决智能机器人在执行复杂任务时面临的稳定性与鲁棒性问题,特别是在处理环境异常和任务规划中的约束条件时。其解决方案的关键在于提出一种闭环任务规划与执行系统(LLM-PAS),该系统通过将部分约束检查过程从规划阶段转移到执行阶段,增强了对环境异常的检测能力,并利用预训练大型语言模型(LLM)的推理能力来应对传统执行器无法处理的异常情况。此外,论文还引入了First Look Prompting (FLP)方法,以提升LLM在重规划过程中生成有效PDDL目标的能力。
链接: https://arxiv.org/abs/2504.21596
作者: Huihui Guo,Huilong Pi,Yunchuan Qin,Zhuo Tang,Kenli Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:With the rapid advancement of artificial intelligence, there is an increasing demand for intelligent robots capable of assisting humans in daily tasks and performing complex operations. Such robots not only require task planning capabilities but must also execute tasks with stability and robustness. In this paper, we present a closed-loop task planning and acting system, LLM-PAS, which is assisted by a pre-trained Large Language Model (LLM). While LLM-PAS plans long-horizon tasks in a manner similar to traditional task and motion planners, it also emphasizes the execution phase of the task. By transferring part of the constraint-checking process from the planning phase to the execution phase, LLM-PAS enables exploration of the constraint space and delivers more accurate feedback on environmental anomalies during execution. The reasoning capabilities of the LLM allow it to handle anomalies that cannot be addressed by the robust executor. To further enhance the system’s ability to assist the planner during replanning, we propose the First Look Prompting (FLP) method, which induces LLM to generate effective PDDL goals. Through comparative prompting experiments and systematic experiments, we demonstrate the effectiveness and robustness of LLM-PAS in handling anomalous conditions during task execution.
zh
[AI-14] One Net to Rule Them All: Domain Randomization in Quadcopter Racing Across Different Platforms
【速读】:该论文试图解决在高速四旋翼无人机竞速中,如何设计一个能够在不同物理特性各异的四旋翼平台间通用的控制器的问题。解决方案的关键在于提出一种基于神经网络的控制器,该控制器通过领域随机化(domain randomization)进行训练,仅依赖当前状态直接计算电机指令,从而实现对多种类型四旋翼无人机的鲁棒控制。
链接: https://arxiv.org/abs/2504.21586
作者: Robin Ferede,Till Blaha,Erin Lucassen,Christophe De Wagter,Guido C.H.E. de Croon
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:In high-speed quadcopter racing, finding a single controller that works well across different platforms remains challenging. This work presents the first neural network controller for drone racing that generalizes across physically distinct quadcopters. We demonstrate that a single network, trained with domain randomization, can robustly control various types of quadcopters. The network relies solely on the current state to directly compute motor commands. The effectiveness of this generalized controller is validated through real-world tests on two substantially different crafts (3-inch and 5-inch race quadcopters). We further compare the performance of this generalized controller with controllers specifically trained for the 3-inch and 5-inch drone, using their identified model parameters with varying levels of domain randomization (0%, 10%, 20%, 30%). While the generalized controller shows slightly slower speeds compared to the fine-tuned models, it excels in adaptability across different platforms. Our results show that no randomization fails sim-to-real transfer while increasing randomization improves robustness but reduces speed. Despite this trade-off, our findings highlight the potential of domain randomization for generalizing controllers, paving the way for universal AI controllers that can adapt to any platform.
zh
[AI-15] Multi-Goal Dexterous Hand Manipulation using Probabilistic Model-based Reinforcement Learning
【速读】:该论文旨在解决基于模型的强化学习在多目标灵巧手操作任务中的挑战。其解决方案的关键在于提出一种基于目标条件的概率模型预测控制(Goal-Conditioned Probabilistic Model Predictive Control, GC-PMPC),通过设计概率神经网络集成来描述高维灵巧手动力学,并引入异步MPC策略以满足真实世界灵巧手系统中的控制频率要求。
链接: https://arxiv.org/abs/2504.21585
作者: Yingzhuo Jiang,Wenjun Huang,Rongdun Lin,Chenyang Miao,Tianfu Sun,Yunduan Cui
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:This paper tackles the challenge of learning multi-goal dexterous hand manipulation tasks using model-based Reinforcement Learning. We propose Goal-Conditioned Probabilistic Model Predictive Control (GC-PMPC) by designing probabilistic neural network ensembles to describe the high-dimensional dexterous hand dynamics and introducing an asynchronous MPC policy to meet the control frequency requirements in real-world dexterous hand systems. Extensive evaluations on four simulated Shadow Hand manipulation scenarios with randomly generated goals demonstrate GC-PMPC’s superior performance over state-of-the-art baselines. It successfully drives a cable-driven Dexterous hand, DexHand 021 with 12 Active DOFs and 5 tactile sensors, to learn manipulating a cubic die to three goal poses within approximately 80 minutes of interactions, demonstrating exceptional learning efficiency and control performance on a cost-effective dexterous hand platform.
zh
[AI-16] MF-LLM : Simulating Collective Decision Dynamics via a Mean-Field Large Language Model Framework
【速读】:该论文试图解决社会模拟中个体行为聚合与真实世界数据之间存在的偏差问题,特别是如何更准确地模拟集体决策的动态交互过程。其解决方案的关键在于提出一种名为均场大语言模型(Mean-Field LLM, MF-LLM)的框架,该框架通过交替运行策略模型与均场模型,显式建模微观决策与宏观群体分布之间的反馈循环,从而生成更贴近现实的群体行为轨迹。
链接: https://arxiv.org/abs/2504.21582
作者: Qirui Mi,Mengyue Yang,Xiangning Yu,Zhiyu Zhao,Cheng Deng,Bo An,Haifeng Zhang,Xu Chen,Jun Wang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 27 pages, 8 figures, 4 tables
Abstract:Simulating collective decision-making involves more than aggregating individual behaviors; it arises from dynamic interactions among individuals. While large language models (LLMs) show promise for social simulation, existing approaches often exhibit deviations from real-world data. To address this gap, we propose the Mean-Field LLM (MF-LLM) framework, which explicitly models the feedback loop between micro-level decisions and macro-level population. MF-LLM alternates between two models: a policy model that generates individual actions based on personal states and group-level information, and a mean field model that updates the population distribution from the latest individual decisions. Together, they produce rollouts that simulate the evolving trajectories of collective decision-making. To better match real-world data, we introduce IB-Tune, a fine-tuning method for LLMs grounded in the information bottleneck principle, which maximizes the relevance of population distributions to future actions while minimizing redundancy with historical data. We evaluate MF-LLM on a real-world social dataset, where it reduces KL divergence to human population distributions by 47 percent over non-mean-field baselines, and enables accurate trend forecasting and intervention planning. It generalizes across seven domains and four LLM backbones, providing a scalable foundation for high-fidelity social simulation.
zh
[AI-17] A Study on Group Decision Making Problem Based on Fuzzy Reasoning and Bayesian Networks
【速读】:该论文旨在解决多目标属性下的群体决策问题,其核心挑战在于处理尺度差异、专家语言变量以及多维指标间的非线性关联。解决方案的关键在于集成模糊推理与贝叶斯网络:通过构建融合阈值、隶属函数、专家经验和领域知识的模糊规则库,以应对量化难题;同时设计分层贝叶斯网络,利用最大似然估计动态优化条件概率表,从而建模多维指标间的非线性关系并实现后验概率聚合。
链接: https://arxiv.org/abs/2504.21568
作者: Shui-jin Rong,Wei Guo,Da-qing Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Aiming at the group decision - making problem with multi - objective attributes, this study proposes a group decision - making system that integrates fuzzy inference and Bayesian network. A fuzzy rule base is constructed by combining threshold values, membership functions, expert experience, and domain knowledge to address quantitative challenges such as scale differences and expert linguistic variables. A hierarchical Bayesian network is designed, featuring a directed acyclic graph with nodes selected by experts, and maximum likelihood estimation is used to dynamically optimize the conditional probability table, modeling the nonlinear correlations among multidimensional indices for posterior probability aggregation. In a comprehensive student evaluation case, this method is compared with the traditional weighted scoring approach. The results indicate that the proposed method demonstrates effectiveness in both rule criterion construction and ranking consistency, with a classification accuracy of 86.0% and an F1 value improvement of 53.4% over the traditional method. Additionally, computational experiments on real - world datasets across various group decision scenarios assess the method’s performance and robustness, providing evidence of its reliability in diverse contexts.
zh
[AI-18] owards proactive self-adaptive AI for non-stationary environments with dataset shifts
【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)模型在非平稳环境中的性能维持问题,特别是在医疗场景中由于时间数据分布变化导致的模型退化问题。解决方案的关键在于提出一种主动自适应AI方法(pro-adaptive),通过建模AI参数的时间轨迹,实现对参数值的短期预测,从而在缺乏实时标注数据的情况下提升模型对数据分布变化的鲁棒性。该方法基于可扩展的功能数据分析框架,采用多项式样条基进行建模,并在模拟数据集和真实世界新冠数据集上进行了验证。
链接: https://arxiv.org/abs/2504.21565
作者: David Fernández Narro,Pablo Ferri,Juan M. García-Gómez,Carlos Sáez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures, conference paper
Abstract:Artificial Intelligence (AI) models deployed in production frequently face challenges in maintaining their performance in non-stationary environments. This issue is particularly noticeable in medical settings, where temporal dataset shifts often occur. These shifts arise when the distributions of training data differ from those of the data encountered during deployment over time. Further, new labeled data to continuously retrain AI is not typically available in a timely manner due to data access limitations. To address these challenges, we propose a proactive self-adaptive AI approach, or pro-adaptive, where we model the temporal trajectory of AI parameters, allowing us to short-term forecast parameter values. To this end, we use polynomial spline bases, within an extensible Functional Data Analysis framework. We validate our methodology with a logistic regression model addressing prior probability shift, covariate shift, and concept shift. This validation is conducted on both a controlled simulated dataset and a publicly available real-world COVID-19 dataset from Mexico, with various shifts occurring between 2020 and 2024. Our results indicate that this approach enhances the performance of AI against shifts compared to baseline stable models trained at different time distances from the present, without requiring updated training data. This work lays the foundation for pro-adaptive AI research against dynamic, non-stationary environments, being compatible with data protection, in resilient AI production environments for health.
zh
[AI-19] Meta knowledge assisted Evolutionary Neural Architecture Search
【速读】:该论文旨在解决基于进化计算(Evolutionary Computation, EC)的神经网络架构搜索(Neural Architecture Search, NAS)中高计算成本以及固定学习率(Learning Rate, LR)调度导致的信息丢失问题。其解决方案的关键在于引入一种创新的元学习框架,通过预训练获得适应性强的学习率调度策略(Meta-LR),以减少在评估每个个体时的信息损失;同时设计了一个自适应代理模型,利用自适应阈值在少量周期内筛选潜在架构,并在完整周期内进行评估,从而提升效率与效果。此外,提出了一种周期性变异算子以增强种群多样性,提高方法的泛化能力和鲁棒性。
链接: https://arxiv.org/abs/2504.21545
作者: Yangyang Li,Guanlong Liu,Ronghua Shang,Licheng Jiao
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:
Abstract:Evolutionary computation (EC)-based neural architecture search (NAS) has achieved remarkable performance in the automatic design of neural architectures. However, the high computational cost associated with evaluating searched architectures poses a challenge for these methods, and a fixed form of learning rate (LR) schedule means greater information loss on diverse searched architectures. This paper introduces an efficient EC-based NAS method to solve these problems via an innovative meta-learning framework. Specifically, a meta-learning-rate (Meta-LR) scheme is used through pretraining to obtain a suitable LR schedule, which guides the training process with lower information loss when evaluating each individual. An adaptive surrogate model is designed through an adaptive threshold to select the potential architectures in a few epochs and then evaluate the potential architectures with complete epochs. Additionally, a periodic mutation operator is proposed to increase the diversity of the population, which enhances the generalizability and robustness. Experiments on CIFAR-10, CIFAR-100, and ImageNet1K datasets demonstrate that the proposed method achieves high performance comparable to that of many state-of-the-art peer methods, with lower computational cost and greater robustness.
zh
[AI-20] RIED: Truly Innovative and Effective Detection Benchmark developed by WITNESS
【速读】:该论文试图解决生成式 AI (Generative AI) 和欺骗性合成媒体对全球信息生态系统的威胁,特别是在全球多数国家(Global Majority)中表现尤为突出的问题。当前的AI检测工具在实际应用中存在性能不足的情况,主要受限于可解释性、公平性、可及性和上下文相关性等方面的挑战。解决方案的关键是引入TRIED Benchmark,这是一个基于实际影响和创新能力的新评估框架,旨在推动检测工具在多样化的语言、文化和技术背景下实现真正的创新与相关性,并为开发者、政策制定者和标准机构提供设计负责任、透明且以用户为中心的检测方案的实践指导。
链接: https://arxiv.org/abs/2504.21489
作者: Shirin Anlen,Zuzanna Wojciak(WITNESS)
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 33 pages
Abstract:The rise of generative AI and deceptive synthetic media threatens the global information ecosystem, especially across the Global Majority. This report from WITNESS highlights the limitations of current AI detection tools, which often underperform in real-world scenarios due to challenges related to explainability, fairness, accessibility, and contextual relevance. In response, WITNESS introduces the Truly Innovative and Effective AI Detection (TRIED) Benchmark, a new framework for evaluating detection tools based on their real-world impact and capacity for innovation. Drawing on frontline experiences, deceptive AI cases, and global consultations, the report outlines how detection tools must evolve to become truly innovative and relevant by meeting diverse linguistic, cultural, and technological contexts. It offers practical guidance for developers, policymakers, and standards bodies to design accountable, transparent, and user-centered detection solutions, and incorporate sociotechnical considerations into future AI standards, procedures and evaluation frameworks. By adopting the TRIED Benchmark, stakeholders can drive innovation, safeguard public trust, strengthen AI literacy, and contribute to a more resilient global information credibility.
zh
[AI-21] A Comprehensive Study of Exploitable Patterns in Smart Contracts: From Vulnerability to Defense
【速读】:该论文旨在解决以太坊智能合约中的安全问题,特别是针对用Solidity语言编写并在以太坊虚拟机(EVM)上执行的智能合约中存在的关键安全风险。论文重点关注两种常见且严重的漏洞类型——重入(reentrancy)和整数溢出(integer overflow),通过分析其底层机制、复现攻击场景以及评估有效的防御措施来提出解决方案。其关键在于深入理解这些漏洞的工作原理,并据此设计针对性的防护策略,以提升智能合约的安全性。
链接: https://arxiv.org/abs/2504.21480
作者: Yuchen Ding,Hongli Peng,Xiaoqi Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:With the rapid advancement of blockchain technology, smart contracts have enabled the implementation of increasingly complex functionalities. However, ensuring the security of smart contracts remains a persistent challenge across the stages of development, compilation, and execution. Vulnerabilities within smart contracts not only undermine the security of individual applications but also pose significant risks to the broader blockchain ecosystem, as demonstrated by the growing frequency of attacks since 2016, resulting in substantial financial losses. This paper provides a comprehensive analysis of key security risks in Ethereum smart contracts, specifically those written in Solidity and executed on the Ethereum Virtual Machine (EVM). We focus on two prevalent and critical vulnerability types (reentrancy and integer overflow) by examining their underlying mechanisms, replicating attack scenarios, and assessing effective countermeasures.
zh
[AI-22] xEEGNet: Towards Explainable AI in EEG Dementia Classification
【速读】:该论文旨在解决EEG数据分类中模型的可解释性不足与过拟合问题,特别是在阿尔茨海默病和额颞叶变性等常见痴呆病症的分类任务中。其关键解决方案是提出一种新型、紧凑且可解释的神经网络架构——xEEGNet,通过大幅减少参数数量(仅168个参数,比ShallowNet少200倍)实现模型的可解释性与抗过拟合能力,同时保持与复杂模型相当的分类性能,并通过嵌入式EEG表示分析数据分割的变异性,从而提升模型的临床适用性与可靠性。
链接: https://arxiv.org/abs/2504.21457
作者: Andrea Zanola,Louis Fabrice Tshimanga,Federico Del Pup,Marco Baiesi,Manfredo Atzori
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This work presents xEEGNet, a novel, compact, and explainable neural network for EEG data analysis. It is fully interpretable and reduces overfitting through major parameter reduction. As an applicative use case, we focused on classifying common dementia conditions, Alzheimer’s and frontotemporal dementia, versus controls. xEEGNet is broadly applicable to other neurological conditions involving spectral alterations. We initially used ShallowNet, a simple and popular model from the EEGNet-family. Its structure was analyzed and gradually modified to move from a “black box” to a more transparent model, without compromising performance. The learned kernels and weights were examined from a clinical standpoint to assess medical relevance. Model variants, including ShallowNet and the final xEEGNet, were evaluated using robust Nested-Leave-N-Subjects-Out cross-validation for unbiased performance estimates. Variability across data splits was explained using embedded EEG representations, grouped by class and set, with pairwise separability to quantify group distinction. Overfitting was assessed through training-validation loss correlation and training speed. xEEGNet uses only 168 parameters, 200 times fewer than ShallowNet, yet retains interpretability, resists overfitting, achieves comparable median performance (-1.5%), and reduces variability across splits. This variability is explained by embedded EEG representations: higher accuracy correlates with greater separation between test set controls and Alzheimer’s cases, without significant influence from training data. xEEGNet’s ability to filter specific EEG bands, learn band-specific topographies, and use relevant spectral features demonstrates its interpretability. While large deep learning models are often prioritized for performance, this study shows smaller architectures like xEEGNet can be equally effective in EEG pathology classification.
zh
[AI-23] SimPRIVE: a Simulation framework for Physical Robot Interaction with Virtual Environments ITSC2025
【速读】:该论文试图解决在工业和学术界中,由于神经网络和强化学习智能体的不可预测行为而缺乏通用解决方案的问题。其关键解决方案是提出SimPRIVE,一个用于物理机器人与虚拟环境交互的仿真框架,该框架作为车辆在环平台运行,在渲染虚拟世界的同时操作真实车辆,使物理移动机器人能够在其基于Unreal Engine 5构建的虚拟世界中运行数字孪生,从而在降低测试风险和成本的前提下,对完整的软硬件堆栈进行复杂算法的测试。
链接: https://arxiv.org/abs/2504.21454
作者: Federico Nesti,Gianluca D’Amico,Mauro Marinoni,Giorgio Buttazzo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Submitted to IEEE ITSC 2025
Abstract:The use of machine learning in cyber-physical systems has attracted the interest of both industry and academia. However, no general solution has yet been found against the unpredictable behavior of neural networks and reinforcement learning agents. Nevertheless, the improvements of photo-realistic simulators have paved the way towards extensive testing of complex algorithms in different virtual scenarios, which would be expensive and dangerous to implement in the real world. This paper presents SimPRIVE, a simulation framework for physical robot interaction with virtual environments, which operates as a vehicle-in-the-loop platform, rendering a virtual world while operating the vehicle in the real world. Using SimPRIVE, any physical mobile robot running on ROS 2 can easily be configured to move its digital twin in a virtual world built with the Unreal Engine 5 graphic engine, which can be populated with objects, people, or other vehicles with programmable behavior. SimPRIVE has been designed to accommodate custom or pre-built virtual worlds while being light-weight to contain execution times and allow fast rendering. Its main advantage lies in the possibility of testing complex algorithms on the full software and hardware stack while minimizing the risks and costs of a test campaign. The framework has been validated by testing a reinforcement learning agent trained for obstacle avoidance on an AgileX Scout Mini rover that navigates a virtual office environment where everyday objects and people are placed as obstacles. The physical rover moves with no collision in an indoor limited space, thanks to a LiDAR-based heuristic. Comments: Submitted to IEEE ITSC 2025 Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.21454 [cs.RO] (or arXiv:2504.21454v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2504.21454 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Federico Nesti [view email] [v1] Wed, 30 Apr 2025 09:22:55 UTC (27,186 KB)
zh
[AI-24] NGENT: Next-Generation AI Agents Must Integrate Multi-Domain Abilities to Achieve Artificial General Intelligence
【速读】:该论文试图解决当前AI代理在特定领域内表现有效但缺乏跨领域能力的问题,从而阻碍了向人工通用智能(AGI)的进展。论文提出的解决方案的关键在于将不同领域的优势整合到一个统一框架中,使AI代理能够在文本、视觉、机器人技术、强化学习、情感智能等多个领域协同工作,实现类人智能所需的多功能性和适应性。
链接: https://arxiv.org/abs/2504.21433
作者: Zhicong Li,Hangyu Mao,Jiangjin Yin,Mingzhe Xing,Zhiwei Xu,Yuanxing Zhang,Yang Xiao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper argues that the next generation of AI agent (NGENT) should integrate across-domain abilities to advance toward Artificial General Intelligence (AGI). Although current AI agents are effective in specialized tasks such as robotics, role-playing, and tool-using, they remain confined to narrow domains. We propose that future AI agents should synthesize the strengths of these specialized systems into a unified framework capable of operating across text, vision, robotics, reinforcement learning, emotional intelligence, and beyond. This integration is not only feasible but also essential for achieving the versatility and adaptability that characterize human intelligence. The convergence of technologies across AI domains, coupled with increasing user demand for cross-domain capabilities, suggests that such integration is within reach. Ultimately, the development of these versatile agents is a critical step toward realizing AGI. This paper explores the rationale for this shift, potential pathways for achieving it.
zh
[AI-25] UAV Marketplace Simulation Tool for BVLOS Operations AAMAS2025
【速读】:该论文试图解决在自主多无人机(Multi-UAV)任务中,尤其是在超出视觉视距(BVLOS)操作环境下,如何有效评估和优化团队协作与任务执行的问题。解决方案的关键在于开发一个仿真工具,该工具能够模拟动态和对抗性条件下的无人机协作,并允许研究人员在可控环境中集成和比较不同的团队形成策略,同时通过结构化日志和性能指标支持统计分析,从而提升无人机协调策略的实际应用效果。
链接: https://arxiv.org/abs/2504.21428
作者: Kıvanç Şerefoğlu,Önder Gürcan,Reyhan Aydoğan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 3 pages, 2 figures, the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025)
Abstract:We present a simulation tool for evaluating team formation in autonomous multi-UAV (Unmanned Aerial Vehicle) missions that operate Beyond Visual Line of Sight (BVLOS). The tool models UAV collaboration and mission execution in dynamic and adversarial conditions, where Byzantine UAVs attempt to disrupt operations. Our tool allows researchers to integrate and compare various team formation strategies in a controlled environment with configurable mission parameters and adversarial behaviors. The log of each simulation run is stored in a structured way along with performance metrics so that statistical analysis could be done straightforwardly. The tool is versatile for testing and improving UAV coordination strategies in real-world applications.
zh
[AI-26] MPEC: Manifold-Preserved EEG Classification via an Ensemble of Clustering-Based Classifiers
【速读】:该论文旨在解决脑电(EEG)信号分类中因忽略其非欧几里得流形结构而导致性能不佳的问题,传统分类方法未能有效保留EEG数据的流形信息,从而影响了脑机接口(BCI)和神经假体应用的准确性。解决方案的关键在于提出MPEC(基于聚类的流形保留EEG分类集成方法),其核心创新包括:(1)通过结合协方差矩阵与径向基函数(RBF)核进行特征工程,以捕捉EEG通道间的线性和非线性关系;(2)采用改进的K-means算法在黎曼流形空间中进行聚类,确保局部几何敏感性,并通过集成多个基于聚类的分类器实现性能提升。
链接: https://arxiv.org/abs/2504.21427
作者: Shermin Shahbazi,Mohammad-Reza Nasiri,Majid Ramezani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages ,3 figures
Abstract:Accurate classification of EEG signals is crucial for brain-computer interfaces (BCIs) and neuroprosthetic applications, yet many existing methods fail to account for the non-Euclidean, manifold structure of EEG data, resulting in suboptimal performance. Preserving this manifold information is essential to capture the true geometry of EEG signals, but traditional classification techniques largely overlook this need. To this end, we propose MPEC (Manifold-Preserved EEG Classification via an Ensemble of Clustering-Based Classifiers), that introduces two key innovations: (1) a feature engineering phase that combines covariance matrices and Radial Basis Function (RBF) kernels to capture both linear and non-linear relationships among EEG channels, and (2) a clustering phase that employs a modified K-means algorithm tailored for the Riemannian manifold space, ensuring local geometric sensitivity. Ensembling multiple clustering-based classifiers, MPEC achieves superior results, validated by significant improvements on the BCI Competition IV dataset 2a.
zh
[AI-27] Optimizing Mouse Dynamics for User Authentication by Machine Learning: Addressing Data Sufficiency Accuracy-Practicality Trade-off and Model Performance Challenges
【速读】:该论文旨在解决传统用户认证方法在可用性、成本和安全性方面的局限性,以及鼠标动态认证中数据量不足、准确性和实用性难以平衡、时间行为模式捕捉效率低等问题。其解决方案的关键在于提出一种基于高斯核密度估计(Gaussian kernel density estimate, KDE)和Kullback-Leibler(KL)散度的统计方法,用于确定训练认证模型的足够数据量,并引入Mouse Authentication Unit(MAU)以优化分段长度,结合局部时间鼠标认证(Local-Time Mouse Authentication, LT-AMouse)框架,融合1D-ResNet与GRU模型,实现高效且精确的行为表征与长期时间依赖建模。
链接: https://arxiv.org/abs/2504.21415
作者: Yi Wang,Chengyv Wu,Yang Liao,Maowei You
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 14pages, 10 figures
Abstract:User authentication is essential to ensure secure access to computer systems, yet traditional methods face limitations in usability, cost, and security. Mouse dynamics authentication, based on the analysis of users’ natural interaction behaviors with mouse devices, offers a cost-effective, non-intrusive, and adaptable solution. However, challenges remain in determining the optimal data volume, balancing accuracy and practicality, and effectively capturing temporal behavioral patterns. In this study, we propose a statistical method using Gaussian kernel density estimate (KDE) and Kullback-Leibler (KL) divergence to estimate the sufficient data volume for training authentication models. We introduce the Mouse Authentication Unit (MAU), leveraging Approximate Entropy (ApEn) to optimize segment length for efficient and accurate behavioral representation. Furthermore, we design the Local-Time Mouse Authentication (LT-AMouse) framework, integrating 1D-ResNet for local feature extraction and GRU for modeling long-term temporal dependencies. Taking the Balabit and DFL datasets as examples, we significantly reduced the data scale, particularly by a factor of 10 for the DFL dataset, greatly alleviating the training burden. Additionally, we determined the optimal input recognition unit length for the user authentication system on different datasets based on the slope of Approximate Entropy. Training with imbalanced samples, our model achieved a successful defense AUC 98.52% for blind attack on the DFL dataset and 94.65% on the Balabit dataset, surpassing the current sota performance.
zh
[AI-28] Galvatron: An Automatic Distributed System for Efficient Foundation Model Training
【速读】:该论文试图解决大规模基础模型(Foundation Models)高效分布式训练的问题,特别是如何选择最优的并行策略以提升训练效率。解决方案的关键在于Galvatron系统能够自动识别最有效的混合并行策略,该策略结合了数据并行、张量并行、流水线并行、分片数据并行和序列并行,以及梯度重计算,从而克服了手动选择并行策略的复杂性。
链接: https://arxiv.org/abs/2504.21411
作者: Xinyi Liu,Yujie Wang,Shenhan Zhu,Fangcheng Fu,Qingshuo Liu,Guangming Lin,Bin Cui
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Galvatron is a distributed system for efficiently training large-scale Foundation Models. It overcomes the complexities of selecting optimal parallelism strategies by automatically identifying the most efficient hybrid strategy, incorporating data, tensor, pipeline, sharded data, and sequence parallelism, along with recomputation. The system’s architecture includes a profiler for hardware and model analysis, a search engine for strategy optimization using decision trees and dynamic programming, and a runtime for executing these strategies efficiently. Benchmarking on various clusters demonstrates Galvatron’s superior throughput compared to existing frameworks. This open-source system offers user-friendly interfaces and comprehensive documentation, making complex distributed training accessible and efficient. The source code of Galvatron is available at this https URL.
zh
[AI-29] FAST-Q: Fast-track Exploration with Adversarially Balanced State Representations for Counterfactual Action Estimation in Offline Reinforcement Learning
【速读】:该论文旨在解决离线强化学习(offline reinforcement learning, RL)中由于函数近似误差导致的Q值高估问题,特别是在动态且高风险的应用场景下,如在线游戏中的推荐系统,其面临玩家心理意图驱动和平台内在波动带来的复杂性。这些问题导致状态空间高度稀疏且部分重叠,同时实验路径选择逻辑会偏向特定策略,进一步限制了学习效果。论文提出的解决方案——FAST-Q的关键在于:(1) 利用梯度反转学习构建平衡的状态表示,规范化玩家状态与动作之间的策略特异性偏差,从而实现反事实估计;(2) 支持在并行处理静态数据利用的同时进行离线反事实探索;(3) 提出一种Q值分解策略以实现多目标优化,促进短期与长期目标的可解释性推荐。这些创新使FAST-Q在多个指标上优于现有最先进方法。
链接: https://arxiv.org/abs/2504.21383
作者: Pulkit Agrawal,Rukma Talwadker,Aditya Pareek,Tridib Mukherjee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in state-of-the-art (SOTA) offline reinforcement learning (RL) have primarily focused on addressing function approximation errors, which contribute to the overestimation of Q-values for out-of-distribution actions, a challenge that static datasets exacerbate. However, high stakes applications such as recommendation systems in online gaming, introduce further complexities due to player’s psychology (intent) driven by gameplay experiences and the inherent volatility on the platform. These factors create highly sparse, partially overlapping state spaces across policies, further influenced by the experiment path selection logic which biases state spaces towards specific policies. Current SOTA methods constrain learning from such offline data by clipping known counterfactual actions as out-of-distribution due to poor generalization across unobserved states. Further aggravating conservative Q-learning and necessitating more online exploration. FAST-Q introduces a novel approach that (1) leverages Gradient Reversal Learning to construct balanced state representations, regularizing the policy-specific bias between the player’s state and action thereby enabling counterfactual estimation; (2) supports offline counterfactual exploration in parallel with static data exploitation; and (3) proposes a Q-value decomposition strategy for multi-objective optimization, facilitating explainable recommendations over short and long-term objectives. These innovations demonstrate superiority of FAST-Q over prior SOTA approaches and demonstrates at least 0.15 percent increase in player returns, 2 percent improvement in lifetime value (LTV), 0.4 percent enhancement in the recommendation driven engagement, 2 percent improvement in the player’s platform dwell time and an impressive 10 percent reduction in the costs associated with the recommendation, on our volatile gaming platform.
zh
[AI-30] ShorterBetter: Guiding Reasoning Models to Find Optimal Inference Length for Efficient Reasoning
【速读】:该论文试图解决推理模型在复杂问题求解过程中因“过度思考”而导致的推理轨迹过长、效率低下问题。其解决方案的关键在于提出一种名为ShorterBetter的强化学习方法,该方法通过为每个问题采样多个输出并定义样本最优长度(Sample Optimal Length, SOL)来动态引导模型找到最佳的Chain-of-Thought (CoT)长度,从而在保持准确性的同时显著缩短输出长度。
链接: https://arxiv.org/abs/2504.21370
作者: Jingyang Yi,Jiazheng Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: An appendix will be uploaded soon
Abstract:Reasoning models such as OpenAI o3 and DeepSeek-R1 have demonstrated strong performance on reasoning-intensive tasks through extended Chain-of-Thought (CoT) prompting. While longer reasoning traces can facilitate a more thorough exploration of solution paths for complex problems, researchers have observed that these models often “overthink”, leading to inefficient inference. In this paper, we introduce ShorterBetter, a simple yet effective reinforcement learning methed that enables reasoning language models to discover their own optimal CoT lengths without human intervention. By sampling multiple outputs per problem and defining the Sample Optimal Length (SOL) as the shortest correct response among all the outputs, our method dynamically guides the model toward optimal inference lengths. Applied to the DeepSeek-Distill-Qwen-1.5B model, ShorterBetter achieves up to an 80% reduction in output length on both in-domain and out-of-domain reasoning tasks while maintaining accuracy. Our analysis shows that overly long reasoning traces often reflect loss of reasoning direction, and thus suggests that the extended CoT produced by reasoning models is highly compressible.
zh
[AI-31] DGFNet: End-to-End Audio-Visual Source Separation Based on Dynamic Gating Fusion ICMR
【速读】:该论文旨在解决当前音频-视觉源分离(Audio-Visual Source Separation)方法中存在的模态融合不足问题,特别是在音频和视觉特征存在显著差异时,传统方法容易丢失关键信息或无法有效捕捉两者之间的复杂关系。其解决方案的关键在于提出一种基于门控机制的动态融合方法,该方法能够动态调整模态融合程度,从而弥补仅依赖解码器进行交互的局限性,并促进音频与视觉特征之间的高效协作。此外,引入的音频注意力模块进一步增强了音频特征的表达能力,提升了模型的整体性能。
链接: https://arxiv.org/abs/2504.21366
作者: Yinfeng Yu,Shiyu Sun
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Main paper (9 pages). Accepted for publication by ICMR(International Conference on Multimedia Retrieval) 2025
Abstract:Current Audio-Visual Source Separation methods primarily adopt two design strategies. The first strategy involves fusing audio and visual features at the bottleneck layer of the encoder, followed by processing the fused features through the decoder. However, when there is a significant disparity between the two modalities, this approach may lead to the loss of critical information. The second strategy avoids direct fusion and instead relies on the decoder to handle the interaction between audio and visual features. Nonetheless, if the encoder fails to integrate information across modalities adequately, the decoder may be unable to effectively capture the complex relationships between them. To address these issues, this paper proposes a dynamic fusion method based on a gating mechanism that dynamically adjusts the modality fusion degree. This approach mitigates the limitations of solely relying on the decoder and facilitates efficient collaboration between audio and visual features. Additionally, an audio attention module is introduced to enhance the expressive capacity of audio features, thereby further improving model performance. Experimental results demonstrate that our method achieves significant performance improvements on two benchmark datasets, validating its effectiveness and advantages in Audio-Visual Source Separation tasks.
zh
[AI-32] A comparative study of deep learning and ensemble learning to extend the horizon of traffic forecasting
【速读】:该论文旨在解决长期交通流预测这一具有挑战性的问题,尤其是在大预测范围(长达30天)下的准确性和稳定性。其解决方案的关键在于通过改进模型对时间动态的建模能力,特别是利用时间嵌入(time embedding)来增强模型对周期性和事件因素的理解,从而提升长期预测性能。研究对比了多种机器学习和深度学习方法,发现尽管注意力机制/Transformer框架在捕捉序列数据中的长程依赖关系方面有效,但随着预测范围的扩大,模型的重点逐渐从时间依赖性的捕捉转向周期性建模,而时间嵌入在此过程中表现出显著优势。此外,XGBoost在仅使用时间特征的情况下仍能与深度学习方法竞争,显示出其在长期预测任务中的高效性和鲁棒性。
链接: https://arxiv.org/abs/2504.21358
作者: Xiao Zheng,Saeed Asadi Bagloee,Majid Sarvi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 32 pages, 16 figures
Abstract:Traffic forecasting is vital for Intelligent Transportation Systems, for which Machine Learning (ML) methods have been extensively explored to develop data-driven Artificial Intelligence (AI) solutions. Recent research focuses on modelling spatial-temporal correlations for short-term traffic prediction, leaving the favourable long-term forecasting a challenging and open issue. This paper presents a comparative study on large-scale real-world signalized arterials and freeway traffic flow datasets, aiming to evaluate promising ML methods in the context of large forecasting horizons up to 30 days. Focusing on modelling capacity for temporal dynamics, we develop one ensemble ML method, eXtreme Gradient Boosting (XGBoost), and a range of Deep Learning (DL) methods, including Recurrent Neural Network (RNN)-based methods and the state-of-the-art Transformer-based method. Time embedding is leveraged to enhance their understanding of seasonality and event factors. Experimental results highlight that while the attention mechanism/Transformer framework is effective for capturing long-range dependencies in sequential data, as the forecasting horizon extends, the key to effective traffic forecasting gradually shifts from temporal dependency capturing to periodicity modelling. Time embedding is particularly effective in this context, helping naive RNN outperform Informer by 31.1% for 30-day-ahead forecasting. Meanwhile, as an efficient and robust model, XGBoost, while learning solely from time features, performs competitively with DL methods. Moreover, we investigate the impacts of various factors like input sequence length, holiday traffic, data granularity, and training data size. The findings offer valuable insights and serve as a reference for future long-term traffic forecasting research and the improvement of AI’s corresponding learning capabilities.
zh
[AI-33] IRL Dittos: Embodied Multimodal AI Agent Interactions in Open Spaces
【速读】:该论文试图解决分布式团队中远程同事在共享办公空间中的社交互动不足问题,旨在通过生成式 AI (Generative AI) 驱动的具身代理 IRL Ditto 来增强同事间的社交联系。解决方案的关键在于利用 IRL Ditto 模拟远程同事的存在感,并通过其与现场同事的实时互动,促进有意义的交流,从而加强不同社会熟悉度层次下的关系。研究发现,社交关系的增强高度依赖于参与者与 IRL Ditto 源对象之间的原有关系基础。
链接: https://arxiv.org/abs/2504.21347
作者: Seonghee Lee,Denae Ford,John Tang,Sasa Junuzovic,Asta Roseway,Ed Cutrell,Kori Inkpen
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 8 pages, 3 figures
Abstract:We introduce the In Real Life (IRL) Ditto, an AI-driven embodied agent designed to represent remote colleagues in shared office spaces, creating opportunities for real-time exchanges even in their absence. IRL Ditto offers a unique hybrid experience by allowing in-person colleagues to encounter a digital version of their remote teammates, initiating greetings, updates, or small talk as they might in person. Our research question examines: How can the IRL Ditto influence interactions and relationships among colleagues in a shared office space? Through a four-day study, we assessed IRL Ditto’s ability to strengthen social ties by simulating presence and enabling meaningful interactions across different levels of social familiarity. We find that enhancing social relationships depended deeply on the foundation of the relationship participants had with the source of the IRL Ditto. This study provides insights into the role of embodied agents in enriching workplace dynamics for distributed teams.
zh
[AI-34] Q-function Decomposition with Intervention Semantics with Factored Action Spaces AISTATS2025
【速读】:该论文旨在解决在具有离散因子动作空间的强化学习环境中,由于动作组合数量庞大而导致的显著挑战。现有方法通过利用动作空间的结构并采用Q函数的线性分解来避免枚举所有可能的动作组合。本文的关键解决方案是考虑定义在原始动作空间低维投影子空间上的Q函数,并利用因果统计中的无未观测混杂因素设定下的因果效应估计,研究分解Q函数的无偏性条件,从而提出一种称为动作分解强化学习(Action Decomposed Reinforcement Learning)的通用框架,该框架使用投影Q函数近似标准无模型强化学习算法中的Q函数,从而在模型基于强化学习设置中提升了样本复杂度,并在在线连续控制环境和真实世界的败血症治疗离线环境中表现出更高的样本效率。
链接: https://arxiv.org/abs/2504.21326
作者: Junkyu Lee,Tian Gao,Elliot Nelson,Miao Liu,Debarun Bhattacharjya,Songtao Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: AISTATS 2025
Abstract:Many practical reinforcement learning environments have a discrete factored action space that induces a large combinatorial set of actions, thereby posing significant challenges. Existing approaches leverage the regular structure of the action space and resort to a linear decomposition of Q-functions, which avoids enumerating all combinations of factored actions. In this paper, we consider Q-functions defined over a lower dimensional projected subspace of the original action space, and study the condition for the unbiasedness of decomposed Q-functions using causal effect estimation from the no unobserved confounder setting in causal statistics. This leads to a general scheme which we call action decomposed reinforcement learning that uses the projected Q-functions to approximate the Q-function in standard model-free reinforcement learning algorithms. The proposed approach is shown to improve sample complexity in a model-based reinforcement learning setting. We demonstrate improvements in sample efficiency compared to state-of-the-art baselines in online continuous control environments and a real-world offline sepsis treatment environment.
zh
[AI-35] How to Backdoor the Knowledge Distillation
【速读】:该论文试图解决知识蒸馏(knowledge distillation)过程中可能存在的后门攻击漏洞问题,即在教师模型(teacher model)为干净模型的前提下,如何通过污染蒸馏数据集来隐蔽地破坏学生模型(student model)。解决方案的关键在于提出一种新颖的攻击方法,该方法通过在蒸馏数据集中嵌入带有后门触发器的对抗样本,从而在不破坏教师模型完整性的情况下实现对学生模型的隐蔽攻击。
链接: https://arxiv.org/abs/2504.21323
作者: Chen Wu,Qian Ma,Prasenjit Mitra,Sencun Zhu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Knowledge distillation has become a cornerstone in modern machine learning systems, celebrated for its ability to transfer knowledge from a large, complex teacher model to a more efficient student model. Traditionally, this process is regarded as secure, assuming the teacher model is clean. This belief stems from conventional backdoor attacks relying on poisoned training data with backdoor triggers and attacker-chosen labels, which are not involved in the distillation process. Instead, knowledge distillation uses the outputs of a clean teacher model to guide the student model, inherently preventing recognition or response to backdoor triggers as intended by an attacker. In this paper, we challenge this assumption by introducing a novel attack methodology that strategically poisons the distillation dataset with adversarial examples embedded with backdoor triggers. This technique allows for the stealthy compromise of the student model while maintaining the integrity of the teacher model. Our innovative approach represents the first successful exploitation of vulnerabilities within the knowledge distillation process using clean teacher models. Through extensive experiments conducted across various datasets and attack settings, we demonstrate the robustness, stealthiness, and effectiveness of our method. Our findings reveal previously unrecognized vulnerabilities and pave the way for future research aimed at securing knowledge distillation processes against backdoor attacks.
zh
[AI-36] Participatory AI Public Sector AI Differential Privacy Conversational Interfaces Explainable AI Citizen Engagement in AI
【速读】:该论文试图解决在公共部门应用中如何平衡数学上的隐私保障与民主问责的问题。其关键解决方案是提出三个核心贡献:一是基于TOPSIS多准则决策分析的自适应ε选择协议,以对齐公民偏好与差分隐私(Differential Privacy, DP)参数;二是具有实时平均绝对误差(Mean Absolute Error, MAE)可视化和GPT-4驱动影响分析的可解释噪声注入框架;三是根据动态变化的法规约束调节隐私预算的集成法律合规机制。这些方案通过对话式接口增强了公众在算法隐私机制中的参与度,确保公共部门治理中的隐私保护AI既数学稳健又民主可问责。
链接: https://arxiv.org/abs/2504.21297
作者: Wenjun Yang,Eyhab Al-Masri
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
备注:
Abstract:This paper introduces a conversational interface system that enables participatory design of differentially private AI systems in public sector applications. Addressing the challenge of balancing mathematical privacy guarantees with democratic accountability, we propose three key contributions: (1) an adaptive \epsilon -selection protocol leveraging TOPSIS multi-criteria decision analysis to align citizen preferences with differential privacy (DP) parameters, (2) an explainable noise-injection framework featuring real-time Mean Absolute Error (MAE) visualizations and GPT-4-powered impact analysis, and (3) an integrated legal-compliance mechanism that dynamically modulates privacy budgets based on evolving regulatory constraints. Our results advance participatory AI practices by demonstrating how conversational interfaces can enhance public engagement in algorithmic privacy mechanisms, ensuring that privacy-preserving AI in public sector governance remains both mathematically robust and democratically accountable.
zh
[AI-37] Fairness in Graph Learning Augmented with Machine Learning: A Survey
【速读】:该论文试图解决图学习(Graph Learning)与机器学习(Machine Learning, ML)融合后产生的公平性问题,特别是在高风险应用场景中可能引发的歧视性结果。解决方案的关键在于系统分析图学习机制与机器学习技术之间的复杂相互作用,并探索四种常用的技术手段以提升GL-ML(Graph Learning augmented with Machine Learning)方法的公平性,从而为该领域的公平性研究奠定坚实基础。
链接: https://arxiv.org/abs/2504.21296
作者: Renqiang Luo,Ziqi Xu,Xikun Zhang,Qing Qing,Huafei Huang,Enyan Dai,Zhe Wang,Bo Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Augmenting specialised machine learning techniques into traditional graph learning models has achieved notable success across various domains, including federated graph learning, dynamic graph learning, and graph transformers. However, the intricate mechanisms of these specialised techniques introduce significant challenges in maintaining model fairness, potentially resulting in discriminatory outcomes in high-stakes applications such as recommendation systems, disaster response, criminal justice, and loan approval. This paper systematically examines the unique fairness challenges posed by Graph Learning augmented with Machine Learning (GL-ML). It highlights the complex interplay between graph learning mechanisms and machine learning techniques, emphasising how the augmentation of machine learning both enhances and complicates fairness. Additionally, we explore four critical techniques frequently employed to improve fairness in GL-ML methods. By thoroughly investigating the root causes and broader implications of fairness challenges in this rapidly evolving field, this work establishes a robust foundation for future research and innovation in GL-ML fairness.
zh
[AI-38] Orthogonal Factor-Based Biclustering Algorithm (BCBOF) for High-Dimensional Data and Its Application in Stock Trend Prediction
【速读】:该论文旨在解决高维数据中传统聚类方法在进行双聚类(biclustering)时面临的两个核心问题:高维空间中的距离集中现象导致数据稀疏性,使得相似性度量失效;以及主流线性降维方法破坏了关键的局部结构模式。其解决方案的关键在于提出一种基于正交因子的双聚类算法(BCBOF),通过在高维数据的向量空间中构建正交因子,并利用正交子空间中的原始数据坐标作为聚类目标,从而在聚类前进行降维,有效缓解高维带来的数据稀疏问题。
链接: https://arxiv.org/abs/2504.21289
作者: Yan Huang,Da-Qing Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Biclustering is an effective technique in data mining and pattern recognition. Biclustering algorithms based on traditional clustering face two fundamental limitations when processing high-dimensional data: (1) The distance concentration phenomenon in high-dimensional spaces leads to data sparsity, rendering similarity measures ineffective; (2) Mainstream linear dimensionality reduction methods disrupt critical local structural patterns. To apply biclustering to high-dimensional datasets, we propose an orthogonal factor-based biclustering algorithm (BCBOF). First, we constructed orthogonal factors in the vector space of the high-dimensional dataset. Then, we performed clustering using the coordinates of the original data in the orthogonal subspace as clustering targets. Finally, we obtained biclustering results of the original dataset. Since dimensionality reduction was applied before clustering, the proposed algorithm effectively mitigated the data sparsity problem caused by high dimensionality. Additionally, we applied this biclustering algorithm to stock technical indicator combinations and stock price trend prediction. Biclustering results were transformed into fuzzy rules, and we incorporated profit-preserving and stop-loss rules into the rule set, ultimately forming a fuzzy inference system for stock price trend predictions and trading signals. To evaluate the performance of BCBOF, we compared it with existing biclustering methods using multiple evaluation metrics. The results showed that our algorithm outperformed other biclustering techniques. To validate the effectiveness of the fuzzy inference system, we conducted virtual trading experiments using historical data from 10 A-share stocks. The experimental results showed that the generated trading strategies yielded higher returns for investors.
zh
[AI-39] Reinforced MLLM : A Survey on RL-Based Reasoning in Multimodal Large Language Models
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理跨模态信息时的推理能力不足问题,特别是在优化推理轨迹和对齐多模态信息方面。解决方案的关键在于将强化学习(Reinforcement Learning, RL)引入MLLMs的推理机制中,通过价值无关和价值相关两种主要RL范式来提升模型的推理能力,同时探索奖励机制创新与算法设计以应对稀疏奖励、跨模态推理效率低及实际部署约束等挑战。
链接: https://arxiv.org/abs/2504.21277
作者: Guanghao Zhou,Panjia Qiu,Cen Chen,Jie Wang,Zheming Yang,Jian Xu,Minghui Qiu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The integration of reinforcement learning (RL) into the reasoning capabilities of Multimodal Large Language Models (MLLMs) has rapidly emerged as a transformative research direction. While MLLMs significantly extend Large Language Models (LLMs) to handle diverse modalities such as vision, audio, and video, enabling robust reasoning across multimodal inputs remains a major challenge. This survey systematically reviews recent advances in RL-based reasoning for MLLMs, covering key algorithmic designs, reward mechanism innovations, and practical applications. We highlight two main RL paradigms–value-free and value-based methods–and analyze how RL enhances reasoning abilities by optimizing reasoning trajectories and aligning multimodal information. Furthermore, we provide an extensive overview of benchmark datasets, evaluation protocols, and existing limitations, and propose future research directions to address current bottlenecks such as sparse rewards, inefficient cross-modal reasoning, and real-world deployment constraints. Our goal is to offer a comprehensive and structured guide to researchers interested in advancing RL-based reasoning in the multimodal era.
zh
[AI-40] Assessing LLM code generation quality through path planning tasks
【速读】:该论文试图解决生成式 AI (Generative AI) 生成的代码在安全关键型应用(如路径规划)中的潜在风险评估问题,因为现有的编码基准无法反映此类应用的上下文和复杂性。其解决方案的关键在于评估六种大型语言模型(LLMs)生成三种不同路径规划算法代码的能力,并在三种不同难度的地图上进行测试,从而揭示 LLM 生成代码在安全关键场景中的严重隐患。
链接: https://arxiv.org/abs/2504.21276
作者: Wanyi Chen,Meng-Wen Su,Mary L. Cummings
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:As LLM-generated code grows in popularity, more evaluation is needed to assess the risks of using such tools, especially for safety-critical applications such as path planning. Existing coding benchmarks are insufficient as they do not reflect the context and complexity of safety-critical applications. To this end, we assessed six LLMs’ abilities to generate the code for three different path-planning algorithms and tested them on three maps of various difficulties. Our results suggest that LLM-generated code presents serious hazards for path planning applications and should not be applied in safety-critical contexts without rigorous testing.
zh
[AI-41] Multi-Domain Causal Discovery in Bijective Causal Models
【速读】:该论文试图解决多领域设置下的因果发现(causal discovery,也称为因果结构学习)问题。其核心假设是因果函数在不同领域中保持不变,而外生噪声的分布可能变化。论文的关键解决方案是引入双射生成机制(bijective generation mechanisms, BGM),该机制确保外生噪声 $ E $ 与内生变量 $ Y $ 之间的函数关系在每个层次的因果变量 $ X = x $ 上均为双射且可微,从而在因果充分性条件下,相较于以往工作,能够在更宽松的功能假设下实现因果图的识别。
链接: https://arxiv.org/abs/2504.21261
作者: Kasra Jalaldoust,Saber Salehkaleybar,Negar Kiyavash
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注: Proceedings of Causal Learning and Reasoning (CLeaR) 2025
Abstract:We consider the problem of causal discovery (a.k.a., causal structure learning) in a multi-domain setting. We assume that the causal functions are invariant across the domains, while the distribution of the exogenous noise may vary. Under causal sufficiency (i.e., no confounders exist), we show that the causal diagram can be discovered under less restrictive functional assumptions compared to previous work. What enables causal discovery in this setting is bijective generation mechanisms (BGM), which ensures that the functional relation between the exogenous noise E and the endogenous variable Y is bijective and differentiable in both directions at every level of the cause variable X = x . BGM generalizes a variety of models including additive noise model, LiNGAM, post-nonlinear model, and location-scale noise model. Further, we derive a statistical test to find the parents set of the target variable. Experiments on various synthetic and real-world datasets validate our theoretical findings.
zh
[AI-42] CachePrune: Neural-Based Attribution Defense Against Indirect Prompt Injection Attacks
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对间接提示注入攻击(indirect prompt injection attack)时的脆弱性问题,此类攻击通过在提示上下文中注入任务触发指令,使模型偏离用户提供的原始指令。解决方案的关键在于提出CachePrune方法,该方法通过识别并修剪输入提示上下文的键值(KV)缓存中与任务触发相关的神经元,从而促使模型将输入提示文本视为纯数据而非指令指示。该方法利用基于直接偏好优化(Direct Preference Optimization, DPO)目标上界诱导的损失函数进行特征归因,有效识别出任务触发神经元,并通过利用指令遵循中的触发效应进一步提升特征归因质量。
链接: https://arxiv.org/abs/2504.21228
作者: Rui Wang,Junda Wu,Yu Xia,Tong Yu,Ruiyi Zhang,Ryan Rossi,Lina Yao,Julian McAuley
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are identified as being susceptible to indirect prompt injection attack, where the model undesirably deviates from user-provided instructions by executing tasks injected in the prompt context. This vulnerability stems from LLMs’ inability to distinguish between data and instructions within a prompt. In this paper, we propose CachePrune that defends against this attack by identifying and pruning task-triggering neurons from the KV cache of the input prompt context. By pruning such neurons, we encourage the LLM to treat the text spans of input prompt context as only pure data, instead of any indicator of instruction following. These neurons are identified via feature attribution with a loss function induced from an upperbound of the Direct Preference Optimization (DPO) objective. We show that such a loss function enables effective feature attribution with only a few samples. We further improve on the quality of feature attribution, by exploiting an observed triggering effect in instruction following. Our approach does not impose any formatting on the original prompt or introduce extra test-time LLM calls. Experiments show that CachePrune significantly reduces attack success rates without compromising the response quality. Note: This paper aims to defend against indirect prompt injection attacks, with the goal of developing more secure and robust AI systems.
zh
[AI-43] heoretical Foundations for Semantic Cognition in Artificial Intelligence
【速读】:该论文试图解决如何构建具有自我调节认知能力的人工智能代理,使其能够以结构化、可解释的方式进行推理、记忆和信念调控。其核心挑战在于如何形式化地建模信念作为结构化的语义状态,并在此基础上实现动态的、具备反思能力的认知过程。解决方案的关键在于提出“认识论真空”(epistemic vacuum)这一概念,即一类语义惰性认知状态,作为信念空间的概念起点,并通过“空塔”(Null Tower)这一递归生成结构,利用内部表征能力构建出模块化的认知架构。该框架旨在为符号系统与神经网络系统提供通用的实现路径,支持包括大语言模型在内的多种智能代理的开发。
链接: https://arxiv.org/abs/2504.21218
作者: Sebastian Dumbrava
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This monograph presents a modular cognitive architecture for artificial intelligence grounded in the formal modeling of belief as structured semantic state. Belief states are defined as dynamic ensembles of linguistic expressions embedded within a navigable manifold, where operators enable assimilation, abstraction, nullification, memory, and introspection. Drawing from philosophy, cognitive science, and neuroscience, we develop a layered framework that enables self-regulating epistemic agents capable of reflective, goal-directed thought. At the core of this framework is the epistemic vacuum: a class of semantically inert cognitive states that serves as the conceptual origin of belief space. From this foundation, the Null Tower arises as a generative structure recursively built through internal representational capacities. The theoretical constructs are designed to be implementable in both symbolic and neural systems, including large language models, hybrid agents, and adaptive memory architectures. This work offers a foundational substrate for constructing agents that reason, remember, and regulate their beliefs in structured, interpretable ways.
zh
[AI-44] A Cost-Effective LLM -based Approach to Identify Wildlife Trafficking in Online Marketplaces
【速读】:该论文试图解决野生动物走私分析数据科学管道中的一个关键挑战,即生成高质量的标记数据以训练分类器来筛选相关数据。传统方法需要耗费大量时间和成本进行数据标注,限制了对多样化广告和研究问题的支持。解决方案的关键在于提出一种成本效益策略,利用大型语言模型(Large Language Models, LLMs)为小样本数据生成伪标签,并基于这些标签构建专业分类模型,从而在降低标注成本的同时实现高精度的广告识别。实验结果表明,该方法在保持较低成本的前提下,分类器的F1分数可达95%,优于直接使用LLMs进行大规模标注的效果。
链接: https://arxiv.org/abs/2504.21211
作者: Juliana Barbosa,Ulhas Gondhali,Gohar Petrossian,Kinshuk Sharma,Sunandan Chakraborty,Jennifer Jacquet,Juliana Freire
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Wildlife trafficking remains a critical global issue, significantly impacting biodiversity, ecological stability, and public health. Despite efforts to combat this illicit trade, the rise of e-commerce platforms has made it easier to sell wildlife products, putting new pressure on wild populations of endangered and threatened species. The use of these platforms also opens a new opportunity: as criminals sell wildlife products online, they leave digital traces of their activity that can provide insights into trafficking activities as well as how they can be disrupted. The challenge lies in finding these traces. Online marketplaces publish ads for a plethora of products, and identifying ads for wildlife-related products is like finding a needle in a haystack. Learning classifiers can automate ad identification, but creating them requires costly, time-consuming data labeling that hinders support for diverse ads and research questions. This paper addresses a critical challenge in the data science pipeline for wildlife trafficking analytics: generating quality labeled data for classifiers that select relevant data. While large language models (LLMs) can directly label advertisements, doing so at scale is prohibitively expensive. We propose a cost-effective strategy that leverages LLMs to generate pseudo labels for a small sample of the data and uses these labels to create specialized classification models. Our novel method automatically gathers diverse and representative samples to be labeled while minimizing the labeling costs. Our experimental evaluation shows that our classifiers achieve up to 95% F1 score, outperforming LLMs at a lower cost. We present real use cases that demonstrate the effectiveness of our approach in enabling analyses of different aspects of wildlife trafficking.
zh
[AI-45] FedHERO: A Federated Learning Approach for Node Classification Task on Heterophilic Graphs
【速读】:该论文旨在解决联邦图学习(Federated Graph Learning, FGL)中由于客户端持有的图数据在节点邻居分布模式上存在异质性(heterophily)而导致的模型聚合性能下降问题。传统FGL方法假设所有客户端的数据具有同质性(homophily),以保证本地模型知识的一致性,从而实现有效的全局模型聚合。然而,当不同客户端的图数据具有不同的异质性水平时,本地模型可能学到冲突的知识,导致全局模型性能显著下降。为解决这一问题,论文提出了FedHERO框架,其关键在于引入了一个双通道图神经网络(GNN)和结构学习器,用于捕捉本地图中的结构知识,使每个客户端的本地模型能够识别并学习跨不同节点邻居分布模式的通用模式,从而提升模型性能并有效处理异质性图数据。
链接: https://arxiv.org/abs/2504.21206
作者: Zihan Chen,Xingbo Fu,Yushun Dong,Jundong Li,Cong Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Federated Graph Learning (FGL) empowers clients to collaboratively train Graph neural networks (GNNs) in a distributed manner while preserving data privacy. However, FGL methods usually require that the graph data owned by all clients is homophilic to ensure similar neighbor distribution patterns of nodes. Such an assumption ensures that the learned knowledge is consistent across the local models from all clients. Therefore, these local models can be properly aggregated as a global model without undermining the overall performance. Nevertheless, when the neighbor distribution patterns of nodes vary across different clients (e.g., when clients hold graphs with different levels of heterophily), their local models may gain different and even conflict knowledge from their node-level predictive tasks. Consequently, aggregating these local models usually leads to catastrophic performance deterioration on the global model. To address this challenge, we propose FedHERO, an FGL framework designed to harness and share insights from heterophilic graphs effectively. At the heart of FedHERO is a dual-channel GNN equipped with a structure learner, engineered to discern the structural knowledge encoded in the local graphs. With this specialized component, FedHERO enables the local model for each client to identify and learn patterns that are universally applicable across graphs with different patterns of node neighbor distributions. FedHERO not only enhances the performance of individual client models by leveraging both local and shared structural insights but also sets a new precedent in this field to effectively handle graph data with various node neighbor distribution patterns. We conduct extensive experiments to validate the superior performance of FedHERO against existing alternatives.
zh
[AI-46] SecRepoBench: Benchmarking LLM s for Secure Code Generation in Real-World Repositories
【速读】:该论文试图解决大语言模型(Large Language Models, LLMs)在真实世界代码仓库中生成安全代码的能力不足的问题。其解决方案的关键在于构建SecRepoBench,这是一个针对LLMs在实际代码仓库中生成安全代码能力的基准测试工具,包含318个代码生成任务和覆盖15种常见弱口令(Common Weakness Enumerations, CWEs)的C/C++仓库数据集,通过该基准测试揭示当前LLMs在生成正确且安全代码方面的局限性,并验证现有提示工程方法在该场景下的有效性下降。
链接: https://arxiv.org/abs/2504.21205
作者: Connor Dilgren,Purva Chiniya,Luke Griffith,Yu Ding,Yizheng Chen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces SecRepoBench, a benchmark to evaluate LLMs on secure code generation in real-world repositories. SecRepoBench has 318 code generation tasks in 27 C/C++ repositories, covering 15 CWEs. We evaluate 19 state-of-the-art LLMs using our benchmark and find that the models struggle with generating correct and secure code. In addition, the performance of LLMs to generate self-contained programs as measured by prior benchmarks do not translate to comparative performance at generating secure and correct code at the repository level in SecRepoBench. We show that the state-of-the-art prompt engineering techniques become less effective when applied to the repository level secure code generation problem. We conduct extensive experiments, including an agentic technique to generate secure code, to demonstrate that our benchmark is currently the most difficult secure coding benchmark, compared to previous state-of-the-art benchmarks. Finally, our comprehensive analysis provides insights into potential directions for enhancing the ability of LLMs to generate correct and secure code in real-world repositories.
zh
[AI-47] -LoRA MoE: Unifying Parameter-Efficient Fine-Tuning and Sparse Mixture-of-Experts
【速读】:该论文旨在解决大规模模型部署中的可扩展性问题,特别是传统Mixture of Experts (MoE)方法在专家数量增加时面临的计算开销过大的挑战。其解决方案的关键在于提出了一种名为Tensor-Trained Low-Rank Adaptation Mixture of Experts (TT-LoRA MoE)的框架,该框架将参数高效微调(PEFT)与稀疏MoE路由相结合,通过将训练过程分解为两个独立优化阶段:首先训练轻量级、张量化的低秩适配器(TT-LoRA专家),随后冻结这些专家以避免多任务设置中的任务间干扰和灾难性遗忘,最后通过一个单独训练的稀疏MoE路由器动态选择适合的适配器,从而实现高效的多任务推理部署。
链接: https://arxiv.org/abs/2504.21190
作者: Pradip Kunwar,Minh N. Vu,Maanak Gupta,Mahmoud Abdelsalam,Manish Bhattarai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose Tensor-Trained Low-Rank Adaptation Mixture of Experts (TT-LoRA MoE), a novel computational framework integrating Parameter-Efficient Fine-Tuning (PEFT) with sparse MoE routing to address scalability challenges in large model deployments. Unlike traditional MoE approaches, which face substantial computational overhead as expert counts grow, TT-LoRA MoE decomposes training into two distinct, optimized stages. First, we independently train lightweight, tensorized low-rank adapters (TT-LoRA experts), each specialized for specific tasks. Subsequently, these expert adapters remain frozen, eliminating inter-task interference and catastrophic forgetting in multi-task setting. A sparse MoE router, trained separately, dynamically leverages base model representations to select exactly one specialized adapter per input at inference time, automating expert selection without explicit task specification. Comprehensive experiments confirm our architecture retains the memory efficiency of low-rank adapters, seamlessly scales to large expert pools, and achieves robust task-level optimization. This structured decoupling significantly enhances computational efficiency and flexibility: uses only 2% of LoRA, 0.3% of Adapters and 0.03% of AdapterFusion parameters and outperforms AdapterFusion by 4 value in multi-tasking, enabling practical and scalable multi-task inference deployments.
zh
[AI-48] Artificial Intelligence for Personalized Prediction of Alzheimers Disease Progression: A Survey of Methods Data Challenges and Future Directions
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s Disease, AD)在个体间进展存在显著异质性所带来的精准预后与个性化护理规划难题,其核心挑战在于如何利用复杂、多模态和纵向患者数据构建有效的预测模型。解决方案的关键在于应用人工智能(Artificial Intelligence, AI)技术,特别是状态空间模型、深度学习方法(如循环神经网络)、图神经网络(Graph Neural Networks, GNNs)以及AI驱动的数字孪生技术,以捕捉疾病动态并实现个体化模拟。此外,针对数据限制问题,研究还探讨了基于变分自编码器(Variational Autoencoders, VAEs)和生成对抗网络(Generative Adversarial Networks, GANs)的合成数据生成策略,以增强和平衡数据集。
链接: https://arxiv.org/abs/2504.21189
作者: Gulsah Hancerliogullari Koksalmis,Bulent Soykan,Laura J. Brattain,Hsin-Hsiung Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 25 pages, 11 figures
Abstract:Alzheimer’s Disease (AD) is marked by significant inter-individual variability in its progression, complicating accurate prognosis and personalized care planning. This heterogeneity underscores the critical need for predictive models capable of forecasting patient-specific disease trajectories. Artificial Intelligence (AI) offers powerful tools to address this challenge by analyzing complex, multi-modal, and longitudinal patient data. This paper provides a comprehensive survey of AI methodologies applied to personalized AD progression prediction. We review key approaches including state-space models for capturing temporal dynamics, deep learning techniques like Recurrent Neural Networks for sequence modeling, Graph Neural Networks (GNNs) for leveraging network structures, and the emerging concept of AI-driven digital twins for individualized simulation. Recognizing that data limitations often impede progress, we examine common challenges such as high dimensionality, missing data, and dataset imbalance. We further discuss AI-driven mitigation strategies, with a specific focus on synthetic data generation using Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) to augment and balance datasets. The survey synthesizes the strengths and limitations of current approaches, emphasizing the trend towards multimodal integration and the persistent need for model interpretability and generalizability. Finally, we identify critical open challenges, including robust external validation, clinical integration, and ethical considerations, and outline promising future research directions such as hybrid models, causal inference, and federated learning. This review aims to consolidate current knowledge and guide future efforts in developing clinically relevant AI tools for personalized AD prognostication.
zh
[AI-49] AffectEval: A Modular and Customizable Framework for Affective Computing
【速读】:该论文试图解决情感计算(affective computing)流水线开发过程中因缺乏支持多模态、多领域情绪识别应用的软件框架而导致的劳动密集问题,这通常会导致在不同应用场景中构建流水线时出现重复性工作。解决方案的关键在于引入AffectEval,这是一个模块化和可定制的框架,旨在简化情感计算流水线的开发,减少手动操作和重复工作,实验结果表明该框架可将编程工作量减少高达90%。
链接: https://arxiv.org/abs/2504.21184
作者: Emily Zhou,Khushboo Khatri,Yixue Zhao,Bhaskar Krishnamachari
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: The short version is published in ACM/IEEE CHASE 2025
Abstract:The field of affective computing focuses on recognizing, interpreting, and responding to human emotions, and has broad applications across education, child development, and human health and wellness. However, developing affective computing pipelines remains labor-intensive due to the lack of software frameworks that support multimodal, multi-domain emotion recognition applications. This often results in redundant effort when building pipelines for different applications. While recent frameworks attempt to address these challenges, they remain limited in reducing manual effort and ensuring cross-domain generalizability. We introduce AffectEval, a modular and customizable framework to facilitate the development of affective computing pipelines while reducing the manual effort and duplicate work involved in developing such pipelines. We validate AffectEval by replicating prior affective computing experiments, and we demonstrate that our framework reduces programming effort by up to 90%, as measured by the reduction in raw lines of code.
zh
[AI-50] SMOGAN: Synthetic Minority Oversampling with GAN Refinement for Imbalanced Regression
【速读】:该论文旨在解决不平衡回归(imbalanced regression)问题,即在目标变量分布偏斜的情况下,机器学习模型(尤其是神经网络)难以有效预测少数样本。现有方法多借鉴分类领域的过采样技术,如线性插值和高斯噪声添加,但这些方法在处理复杂非线性数据分布时生成的合成样本无法准确反映真实特征-目标关系。该论文提出的解决方案是SMOGAN,其关键在于采用两阶段过采样框架:第一阶段生成初始合成样本,第二阶段引入DistGAN,一种基于分布感知的生成对抗网络(Generative Adversarial Network),通过对抗损失与最大均值差异(Maximum Mean Discrepancy)目标联合优化,将合成样本对齐至真实特征-目标联合分布。
链接: https://arxiv.org/abs/2504.21152
作者: Shayan Alahyari,Mike Domaratzki
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Imbalanced regression refers to prediction tasks where the target variable is skewed. This skewness hinders machine learning models, especially neural networks, which concentrate on dense regions and therefore perform poorly on underrepresented (minority) samples. Despite the importance of this problem, only a few methods have been proposed for imbalanced regression. Many of the available solutions for imbalanced regression adapt techniques from the class imbalance domain, such as linear interpolation and the addition of Gaussian noise, to create synthetic data in sparse regions. However, in many cases, the underlying distribution of the data is complex and non-linear. Consequently, these approaches generate synthetic samples that do not accurately represent the true feature-target relationship. To overcome these limitations, we propose SMOGAN, a two-step oversampling framework for imbalanced regression. In Stage 1, an existing oversampler generates initial synthetic samples in sparse target regions. In Stage 2, we introduce DistGAN, a distribution-aware GAN that serves as SMOGAN’s filtering layer and refines these samples via adversarial loss augmented with a Maximum Mean Discrepancy objective, aligning them with the true joint feature-target distribution. Extensive experiments on 23 imbalanced datasets show that SMOGAN consistently outperforms the default oversampling method without the DistGAN filtering layer.
zh
[AI-51] A Formalism for Optimal Search with Dynamic Heuristics
【速读】:该论文试图解决在启发式搜索中,动态启发式(dynamic heuristics)因依赖搜索历史而带来的复杂性问题,这些问题传统上被忽视或未被充分形式化。论文的关键在于形式化动态启发式的概念,并将其融入一个通用算法框架中,从而推导出一般性的最优性结果。通过研究一种特定的实例化方法,该方法将动态启发式与A∗算法结合,论文证明了其最优性,并展示了经典规划中的现有方法可视为该实例化的特例,从而可以直接应用其最优性结论。
链接: https://arxiv.org/abs/2504.21131
作者: Remo Christen,Florian Pommerening,Clemens Büchner,Malte Helmert
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While most heuristics studied in heuristic search depend only on the state, some accumulate information during search and thus also depend on the search history. Various existing approaches use such dynamic heuristics in \mathrmA^* -like algorithms and appeal to classic results for \mathrmA^* to show optimality. However, doing so ignores the complexities of searching with a mutable heuristic. In this paper we formalize the idea of dynamic heuristics and use them in a generic algorithm framework. We study a particular instantiation that models \mathrmA^* with dynamic heuristics and show general optimality results. Finally we show how existing approaches from classical planning can be viewed as special cases of this instantiation, making it possible to directly apply our optimality results.
zh
[AI-52] A Survey on Parameter-Efficient Fine-Tuning for Foundation Models in Federated Learning
【速读】:该论文旨在解决在联邦学习(Federated Learning, FL)环境中,如何高效地适应大规模基础模型(Foundation Models)以完成特定下游任务的问题。传统方法需要对整个模型进行微调,这在计算资源上成本高昂。其解决方案的关键在于采用参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)技术,通过仅更新少量参数或对模型结构进行重参数化,从而在保持模型性能的同时降低计算和通信开销。
链接: https://arxiv.org/abs/2504.21099
作者: Jieming Bian,Yuanzhe Peng,Lei Wang,Yin Huang,Jie Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: survey paper, under updating
Abstract:Foundation models have revolutionized artificial intelligence by providing robust, versatile architectures pre-trained on large-scale datasets. However, adapting these massive models to specific downstream tasks requires fine-tuning, which can be prohibitively expensive in computational resources. Parameter-Efficient Fine-Tuning (PEFT) methods address this challenge by selectively updating only a small subset of parameters. Meanwhile, Federated Learning (FL) enables collaborative model training across distributed clients without sharing raw data, making it ideal for privacy-sensitive applications. This survey provides a comprehensive review of the integration of PEFT techniques within federated learning environments. We systematically categorize existing approaches into three main groups: Additive PEFT (which introduces new trainable parameters), Selective PEFT (which fine-tunes only subsets of existing parameters), and Reparameterized PEFT (which transforms model architectures to enable efficient updates). For each category, we analyze how these methods address the unique challenges of federated settings, including data heterogeneity, communication efficiency, computational constraints, and privacy concerns. We further organize the literature based on application domains, covering both natural language processing and computer vision tasks. Finally, we discuss promising research directions, including scaling to larger foundation models, theoretical analysis of federated PEFT methods, and sustainable approaches for resource-constrained environments.
zh
[AI-53] On the Potential of Large Language Models to Solve Semantics-Aware Process Mining Tasks
【速读】:该论文试图解决如何利用大型语言模型(Large Language Models, LLMs)进行语义感知的过程挖掘任务,这些任务需要对活动及其关系的含义有深入理解。与以往研究主要在模型默认状态下评估LLMs不同,本文的关键解决方案是通过上下文学习和监督微调来提升LLMs在过程挖掘任务中的性能。研究定义了五个需要语义理解的过程挖掘任务,并提供了广泛的基准数据集进行评估,实验结果表明,经过微调的LLMs在多种过程类型和行业场景下能够实现强大的性能。
链接: https://arxiv.org/abs/2504.21074
作者: Adrian Rebmann,Fabian David Schmidt,Goran Glavaš,Han van der Aa
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: 31 pages, submitted to PS
Abstract:Large language models (LLMs) have shown to be valuable tools for tackling process mining tasks. Existing studies report on their capability to support various data-driven process analyses and even, to some extent, that they are able to reason about how processes work. This reasoning ability suggests that there is potential for LLMs to tackle semantics-aware process mining tasks, which are tasks that rely on an understanding of the meaning of activities and their relationships. Examples of these include process discovery, where the meaning of activities can indicate their dependency, whereas in anomaly detection the meaning can be used to recognize process behavior that is abnormal. In this paper, we systematically explore the capabilities of LLMs for such tasks. Unlike prior work, which largely evaluates LLMs in their default state, we investigate their utility through both in-context learning and supervised fine-tuning. Concretely, we define five process mining tasks requiring semantic understanding and provide extensive benchmarking datasets for evaluation. Our experiments reveal that while LLMs struggle with challenging process mining tasks when used out of the box or with minimal in-context examples, they achieve strong performance when fine-tuned for these tasks across a broad range of process types and industries.
zh
[AI-54] Erased but Not Forgotten: How Backdoors Compromise Concept Erasure
【速读】:该论文试图解决大规模文本到图像扩散模型在生成有害内容方面的安全风险,特别是针对通过微调实现的概念擦除(concept erasure)技术的有效性问题。其解决方案的关键在于揭示现有擦除算法在面对特定后门攻击时的脆弱性,并提出一种新型的威胁模型——毒性擦除(Toxic Erasure, ToxE),该模型通过建立触发词与不良内容之间的关联,使得擦除操作无法有效移除此类关联,从而允许攻击者持续生成有害内容。此外,论文还引入了深度干预评分攻击(Deep Intervention Score-based Attack, DISA),通过优化整个U-Net结构来增强攻击的持久性。
链接: https://arxiv.org/abs/2504.21072
作者: Jonas Henry Grebe,Tobias Braun,Marcus Rohrbach,Anna Rohrbach
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The expansion of large-scale text-to-image diffusion models has raised growing concerns about their potential to generate undesirable or harmful content, ranging from fabricated depictions of public figures to sexually explicit images. To mitigate these risks, prior work has devised machine unlearning techniques that attempt to erase unwanted concepts through fine-tuning. However, in this paper, we introduce a new threat model, Toxic Erasure (ToxE), and demonstrate how recent unlearning algorithms, including those explicitly designed for robustness, can be circumvented through targeted backdoor attacks. The threat is realized by establishing a link between a trigger and the undesired content. Subsequent unlearning attempts fail to erase this link, allowing adversaries to produce harmful content. We instantiate ToxE via two established backdoor attacks: one targeting the text encoder and another manipulating the cross-attention layers. Further, we introduce Deep Intervention Score-based Attack (DISA), a novel, deeper backdoor attack that optimizes the entire U-Net using a score-based objective, improving the attack’s persistence across different erasure methods. We evaluate five recent concept erasure methods against our threat model. For celebrity identity erasure, our deep attack circumvents erasure with up to 82% success, averaging 57% across all erasure methods. For explicit content erasure, ToxE attacks can elicit up to 9 times more exposed body parts, with DISA yielding an average increase by a factor of 2.9. These results highlight a critical security gap in current unlearning strategies.
zh
[AI-55] A Brief Review for Compression and Transfer Learning Techniques in DeepFake Detection
【速读】:该论文试图解决在边缘设备上训练和部署深度伪造检测模型时面临的计算和内存资源受限问题。其解决方案的关键在于采用压缩技术(如剪枝、知识蒸馏、量化等)以降低计算需求和推理时间,同时结合迁移学习方法以减少训练开销,从而在保持检测性能的同时实现模型的高效部署。
链接: https://arxiv.org/abs/2504.21066
作者: Andreas Karathanasis,John Violos,Ioannis Kompatsiaris,Symeon Papadopoulos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Training and deploying deepfake detection models on edge devices offers the advantage of maintaining data privacy and confidentiality by processing it close to its source. However, this approach is constrained by the limited computational and memory resources available at the edge. To address this challenge, we explore compression techniques to reduce computational demands and inference time, alongside transfer learning methods to minimize training overhead. Using the Synthbuster, RAISE, and ForenSynths datasets, we evaluate the effectiveness of pruning, knowledge distillation (KD), quantization, fine-tuning, and adapter-based techniques. Our experimental results demonstrate that both compression and transfer learning can be effectively achieved, even with a high compression level of 90%, remaining at the same performance level when the training and validation data originate from the same DeepFake model. However, when the testing dataset is generated by DeepFake models not present in the training set, a domain generalization issue becomes evident.
zh
[AI-56] A 3D pocket-aware and affinity-guided diffusion model for lead optimization
【速读】:该论文旨在解决分子优化过程中难以充分考虑与蛋白质靶点结合亲和力的问题,这是药物发现中的关键任务。其解决方案的关键在于提出了一种名为Diffleop的3D口袋感知且亲和力引导的扩散模型,该模型通过显式整合蛋白质-配体结合亲和力的知识,指导去噪采样以生成具有高亲和力的分子。
链接: https://arxiv.org/abs/2504.21065
作者: Anjie Qiao,Junjie Xie,Weifeng Huang,Hao Zhang,Jiahua Rao,Shuangjia Zheng,Yuedong Yang,Zhen Wang,Guo-Bo Li,Jinping Lei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Molecular optimization, aimed at improving binding affinity or other molecular properties, is a crucial task in drug discovery that often relies on the expertise of medicinal chemists. Recently, deep learning-based 3D generative models showed promise in enhancing the efficiency of molecular optimization. However, these models often struggle to adequately consider binding affinities with protein targets during lead optimization. Herein, we propose a 3D pocket-aware and affinity-guided diffusion model, named Diffleop, to optimize molecules with enhanced binding affinity. The model explicitly incorporates the knowledge of protein-ligand binding affinity to guide the denoising sampling for molecule generation with high affinity. The comprehensive evaluations indicated that Diffleop outperforms baseline models across multiple metrics, especially in terms of binding affinity.
zh
[AI-57] Frequency Feature Fusion Graph Network For Depression Diagnosis Via fNIRS
【速读】:该论文试图解决抑郁症诊断中因缺乏稳健的时间生物标志物而限制图神经网络(GNN)模型有效性的难题。其解决方案的关键在于引入一种基于离散傅里叶变换(DFT)的新生物标志物,并设计一种定制的时序图卷积网络(TGCN)架构,以增强对脑区时间特征的表征能力。通过在包含1,086名受试者的大规模数据集上进行训练,并结合倾向评分匹配(PSM)方法构建更符合医学需求的子集,实验结果表明该方法显著提升了F1分数,同时利用SHAP方法验证了模型的可解释性,为实际医疗应用提供了支持。
链接: https://arxiv.org/abs/2504.21064
作者: Chengkai Yang,Xingping Dong,Xiaofen Zong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Data-driven approaches for depression diagnosis have emerged as a significant research focus in neuromedicine, driven by the development of relevant datasets. Recently, graph neural network (GNN)-based models have gained widespread adoption due to their ability to capture brain channel functional connectivity from both spatial and temporal perspectives. However, their effectiveness is hindered by the absence of a robust temporal biomarker. In this paper, we introduce a novel and effective biomarker for depression diagnosis by leveraging the discrete Fourier transform (DFT) and propose a customized graph network architecture based on Temporal Graph Convolutional Network (TGCN). Our model was trained on a dataset comprising 1,086 subjects, which is over 10 times larger than previous datasets in the field of depression diagnosis. Furthermore, to align with medical requirements, we performed propensity score matching (PSM) to create a refined subset, referred to as the PSM dataset. Experimental results demonstrate that incorporating our newly designed biomarker enhances the representation of temporal characteristics in brain channels, leading to improved F1 scores in both the real-world dataset and the PSM dataset. This advancement has the potential to contribute to the development of more effective depression diagnostic tools. In addition, we used SHapley Additive exPlaination (SHAP) to validate the interpretability of our model, ensuring its practical applicability in medical settings.
zh
[AI-58] oken-Level Prompt Mixture with Parameter-Free Routing for Federated Domain Generalization
【速读】:该论文旨在解决联邦领域泛化(FedDG)中由于客户端数据异构性导致的模型泛化能力下降问题,尤其是在使用单一全局提示(global prompt)进行视觉-语言模型(VLM)适应时,难以有效处理个性化样本的问题。其解决方案的关键在于提出TRIP框架,该框架通过将多个提示视为不同的专家,并采用基于token级聚类和最优传输的无参数路由机制,实现对图像中不同token的精细化专家分配,从而提升模型的泛化能力和通信效率。
链接: https://arxiv.org/abs/2504.21063
作者: Shuai Gong,Chaoran Cui,Xiaolin Dong,Xiushan Nie,Lei Zhu,Xiaojun Chang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: The manuscript has been submitted to IEEE Transactions on Knowledge and Data Engineering
Abstract:Federated domain generalization (FedDG) aims to learn a globally generalizable model from decentralized clients with heterogeneous data while preserving privacy. Recent studies have introduced prompt learning to adapt vision-language models (VLMs) in FedDG by learning a single global prompt. However, such a one-prompt-fits-all learning paradigm typically leads to performance degradation on personalized samples. Although the mixture of experts (MoE) offers a promising solution for specialization, existing MoE-based methods suffer from coarse image-level expert assignment and high communication costs from parameterized routers. To address these limitations, we propose TRIP, a Token-level prompt mixture with parameter-free routing framework for FedDG, which treats multiple prompts as distinct experts. Unlike existing image-level routing designs, TRIP assigns different tokens within an image to specific experts. To ensure communication efficiency, TRIP incorporates a parameter-free routing mechanism based on token clustering and optimal transport. The instance-specific prompt is then synthesized by aggregating experts, weighted by the number of tokens assigned to each. Additionally, TRIP develops an unbiased learning strategy for prompt experts, leveraging the VLM’s zero-shot generalization capability. Extensive experiments across four benchmarks demonstrate that TRIP achieves optimal generalization results, with communication of only 1K parameters per round. Our code is available at this https URL.
zh
[AI-59] Modeling and Performance Analysis for Semantic Communications Based on Empirical Results
【速读】:该论文试图解决深度学习驱动的语义编码器和解码器的黑箱特性所带来的端到端性能分析难题,旨在建立端到端性能指标与信噪比(SNR)之间的理论关系。解决方案的关键是提出了一种Alpha-Beta-Gamma (ABG) 公式,该公式能够建模端到端测量与SNR之间的关系,并适用于图像重建任务和推理任务。通过该公式,研究进一步设计了适应性功率控制方案和最优功率分配策略,以提升语义通信系统的能量效率和服务质量(QoS),并利用二分法算法实现了多用户OFDMA下行语义通信中的最小QoS最大化。
链接: https://arxiv.org/abs/2504.21055
作者: Shuai Ma,Bin Shen,Chuanhui Zhang,Youlong Wu,Hang Li,Shiyin Li,Guangming Shi,Naofal Al-Dhahir
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Due to the black-box characteristics of deep learning based semantic encoders and decoders, finding a tractable method for the performance analysis of semantic communications is a challenging problem. In this paper, we propose an Alpha-Beta-Gamma (ABG) formula to model the relationship between the end-to-end measurement and SNR, which can be applied for both image reconstruction tasks and inference tasks. Specifically, for image reconstruction tasks, the proposed ABG formula can well fit the commonly used DL networks, such as SCUNet, and Vision Transformer, for semantic encoding with the multi scale-structural similarity index measure (MS-SSIM) measurement. Furthermore, we find that the upper bound of the MS-SSIM depends on the number of quantized output bits of semantic encoders, and we also propose a closed-form expression to fit the relationship between the MS-SSIM and quantized output bits. To the best of our knowledge, this is the first theoretical expression between end-to-end performance metrics and SNR for semantic communications. Based on the proposed ABG formula, we investigate an adaptive power control scheme for semantic communications over random fading channels, which can effectively guarantee quality of service (QoS) for semantic communications, and then design the optimal power allocation scheme to maximize the energy efficiency of the semantic communication system. Furthermore, by exploiting the bisection algorithm, we develop the power allocation scheme to maximize the minimum QoS of multiple users for OFDMA downlink semantic communication Extensive simulations verify the effectiveness and superiority of the proposed ABG formula and power allocation schemes.
zh
[AI-60] FFCBA: Feature-based Full-target Clean-label Backdoor Attacks
【速读】:该论文旨在解决多目标后门攻击中存在的一些挑战,特别是现有方法依赖于脏标签(dirty-label)策略导致易被检测,而干净标签(clean-label)攻击则难以实现稳定且可扩展的多目标攻击性能。其解决方案的关键在于提出一种基于特征的全目标干净标签后门攻击(Feature-based Full-target Clean-label Backdoor Attacks, FFCBA),该方法包含两种范式:特征扩展后门攻击(Feature-Spanning Backdoor Attacks, FSBA)和特征迁移后门攻击(Feature-Migrating Backdoor Attacks, FMBA)。FSBA通过类条件自编码器生成与原类别特征对齐的噪声触发器,确保触发器的有效性、类内一致性、类间特异性和自然特征相关性;FMBA则通过两阶段类条件自编码器训练过程,生成具有强目标类别特征的触发器,从而提升跨模型攻击能力。
链接: https://arxiv.org/abs/2504.21054
作者: Yangxu Yin,Honglong Chen,Yudong Gao,Peng Sun,Liantao Wu,Zhe Li,Weifeng Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Backdoor attacks pose a significant threat to deep neural networks, as backdoored models would misclassify poisoned samples with specific triggers into target classes while maintaining normal performance on clean samples. Among these, multi-target backdoor attacks can simultaneously target multiple classes. However, existing multi-target backdoor attacks all follow the dirty-label paradigm, where poisoned samples are mislabeled, and most of them require an extremely high poisoning rate. This makes them easily detectable by manual inspection. In contrast, clean-label attacks are more stealthy, as they avoid modifying the labels of poisoned samples. However, they generally struggle to achieve stable and satisfactory attack performance and often fail to scale effectively to multi-target attacks. To address this issue, we propose the Feature-based Full-target Clean-label Backdoor Attacks (FFCBA) which consists of two paradigms: Feature-Spanning Backdoor Attacks (FSBA) and Feature-Migrating Backdoor Attacks (FMBA). FSBA leverages class-conditional autoencoders to generate noise triggers that align perturbed in-class samples with the original category’s features, ensuring the effectiveness, intra-class consistency, inter-class specificity and natural-feature correlation of triggers. While FSBA supports swift and efficient attacks, its cross-model attack capability is relatively weak. FMBA employs a two-stage class-conditional autoencoder training process that alternates between using out-of-class samples and in-class samples. This allows FMBA to generate triggers with strong target-class features, making it highly effective for cross-model attacks. We conduct experiments on multiple datasets and models, the results show that FFCBA achieves outstanding attack performance and maintains desirable robustness against the state-of-the-art backdoor defenses.
zh
[AI-61] NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)中安全对齐(safety alignment)的问题,即通过微调机制调节神经元激活以抑制有害内容。其解决方案的关键在于识别并修改负责安全约束的神经元,具体包括三个关键步骤:神经元激活分析、基于相似性的神经元识别以及安全移除的神经元再学习,从而实现对模型安全约束的有效去除。
链接: https://arxiv.org/abs/2504.21053
作者: Yi Zhou,Wenpeng Xing,Dezhang Kong,Changting Lin,Meng Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Safety alignment in large language models (LLMs) is achieved through fine-tuning mechanisms that regulate neuron activations to suppress harmful content. In this work, we propose a novel approach to induce disalignment by identifying and modifying the neurons responsible for safety constraints. Our method consists of three key steps: Neuron Activation Analysis, where we examine activation patterns in response to harmful and harmless prompts to detect neurons that are critical for distinguishing between harmful and harmless inputs; Similarity-Based Neuron Identification, which systematically locates the neurons responsible for safe alignment; and Neuron Relearning for Safety Removal, where we fine-tune these selected neurons to restore the model’s ability to generate previously restricted responses. Experimental results demonstrate that our method effectively removes safety constraints with minimal fine-tuning, highlighting a critical vulnerability in current alignment techniques. Our findings underscore the need for robust defenses against adversarial fine-tuning attacks on LLMs.
zh
[AI-62] SFIBA: Spatial-based Full-target Invisible Backdoor Attacks
【速读】:该论文旨在解决多目标后门攻击在黑盒设置中无法保证触发器特异性和隐蔽性的问题,具体表现为无法同时针对所有类别进行攻击以及触发器缺乏视觉不可感知性。其解决方案的关键在于提出一种基于空间的全目标隐形后门攻击(Spatial-based Full-target Invisible Backdoor Attack, SFIBA),通过将不同类别的触发器限制在像素空间中的特定局部区域和形态以确保特异性,并采用基于频域的触发器注入方法以提升隐蔽性。
链接: https://arxiv.org/abs/2504.21052
作者: Yangxu Yin,Honglong Chen,Yudong Gao,Peng Sun,Zhishuai Li,Weifeng Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-target backdoor attacks pose significant security threats to deep neural networks, as they can preset multiple target classes through a single backdoor injection. This allows attackers to control the model to misclassify poisoned samples with triggers into any desired target class during inference, exhibiting superior attack performance compared with conventional backdoor attacks. However, existing multi-target backdoor attacks fail to guarantee trigger specificity and stealthiness in black-box settings, resulting in two main issues. First, they are unable to simultaneously target all classes when only training data can be manipulated, limiting their effectiveness in realistic attack scenarios. Second, the triggers often lack visual imperceptibility, making poisoned samples easy to detect. To address these problems, we propose a Spatial-based Full-target Invisible Backdoor Attack, called SFIBA. It restricts triggers for different classes to specific local spatial regions and morphologies in the pixel space to ensure specificity, while employing a frequency-domain-based trigger injection method to guarantee stealthiness. Specifically, for injection of each trigger, we first apply fast fourier transform to obtain the amplitude spectrum of clean samples in local spatial regions. Then, we employ discrete wavelet transform to extract the features from the amplitude spectrum and use singular value decomposition to integrate the trigger. Subsequently, we selectively filter parts of the trigger in pixel space to implement trigger morphology constraints and adjust injection coefficients based on visual effects. We conduct experiments on multiple datasets and models. The results demonstrate that SFIBA can achieve excellent attack performance and stealthiness, while preserving the model’s performance on benign samples, and can also bypass existing backdoor defenses.
zh
[AI-63] Phishing URL Detection using Bi-LSTM
【速读】:该论文旨在解决网络钓鱼攻击检测中存在的高误报率及检测类型受限的问题(Phishing attacks detection problem),传统检测系统难以有效识别多种类型的网络钓鱼攻击。其解决方案的关键在于采用基于深度学习的双向长短期记忆网络(Bidirectional Long Short-Term Memory, Bi-LSTM),通过利用URL的序列数据和上下文信息,实现对URL的分类,从而提升检测准确性。
链接: https://arxiv.org/abs/2504.21049
作者: Sneha Baskota
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Phishing attacks threaten online users, often leading to data breaches, financial losses, and identity theft. Traditional phishing detection systems struggle with high false positive rates and are usually limited by the types of attacks they can identify. This paper proposes a deep learning-based approach using a Bidirectional Long Short-Term Memory (Bi-LSTM) network to classify URLs into four categories: benign, phishing, defacement, and malware. The model leverages sequential URL data and captures contextual information, improving the accuracy of phishing detection. Experimental results on a dataset comprising over 650,000 URLs demonstrate the model’s effectiveness, achieving 97% accuracy and significant improvements over traditional techniques.
zh
[AI-64] Multi-Agent Reinforcement Learning for Resources Allocation Optimization: A Survey
【速读】:该论文试图解决资源分配优化(Resource Allocation Optimization, RAO)问题,特别是在动态和去中心化环境中如何有效进行决策与学习。其解决方案的关键在于利用多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)框架,通过智能体间的协作与竞争机制,提升资源分配的效率与适应性。论文系统梳理了当前MARL在RAO中的算法、核心概念及分类,并提出了结构化的分类体系,旨在为研究者和实践者提供理论支持与技术指导。
链接: https://arxiv.org/abs/2504.21048
作者: Mohamad A. Hady,Siyi Hu,Mahardhika Pratama,Jimmy Cao,Ryszard Kowalczyk
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Multi-Agent Reinforcement Learning (MARL) has become a powerful framework for numerous real-world applications, modeling distributed decision-making and learning from interactions with complex environments. Resource Allocation Optimization (RAO) benefits significantly from MARL’s ability to tackle dynamic and decentralized contexts. MARL-based approaches are increasingly applied to RAO challenges across sectors playing pivotal roles to Industry 4.0 developments. This survey provides a comprehensive review of recent MARL algorithms for RAO, encompassing core concepts, classifications, and a structured taxonomy. By outlining the current research landscape and identifying primary challenges and future directions, this survey aims to support researchers and practitioners in leveraging MARL’s potential to advance resource allocation solutions.
zh
[AI-65] Model Connectomes: A Generational Approach to Data-Efficient Language Models
【速读】:该论文试图解决人工神经网络与生物神经网络在学习机制上的差异问题,即当前的人工神经网络缺乏进化和个体学习的双重维度,仅依赖单一的大规模训练过程。其解决方案的关键在于引入一个“外循环”进化机制,该机制塑造了“内循环”学习过程,从而使得人工网络更接近生物体中进化与个体学习的协同作用。通过在语言任务中训练一个继承“模型连接组(model connectome)”的模型,并在大规模语料库上进行微调,实验结果显示该模型在自然语言处理任务及与人类行为和脑数据的对齐方面表现优于或相当于是对照模型,表明模型连接组可作为低数据环境下学习的有效先验。
链接: https://arxiv.org/abs/2504.21047
作者: Klemen Kotar,Greta Tuckute
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Biological neural networks are shaped both by evolution across generations and by individual learning within an organism’s lifetime, whereas standard artificial neural networks undergo a single, large training procedure without inherited constraints. In this preliminary work, we propose a framework that incorporates this crucial generational dimension - an “outer loop” of evolution that shapes the “inner loop” of learning - so that artificial networks better mirror the effects of evolution and individual learning in biological organisms. Focusing on language, we train a model that inherits a “model connectome” from the outer evolution loop before exposing it to a developmental-scale corpus of 100M tokens. Compared with two closely matched control models, we show that the connectome model performs better or on par on natural language processing tasks as well as alignment to human behavior and brain data. These findings suggest that a model connectome serves as an efficient prior for learning in low-data regimes - narrowing the gap between single-generation artificial models and biologically evolved neural networks.
zh
[AI-66] Leverag ing LLM to Strengthen ML-Based Cross-Site Scripting Detection
【速读】:该论文旨在解决生成式AI在检测跨站脚本(XSS)攻击中的有效性问题,特别是在面对经过混淆的XSS攻击时检测能力下降的问题。其关键解决方案是通过微调大型语言模型(LLM)来自动生成复杂度更高的混淆XSS载荷,从而为机器学习(ML)模型提供更具挑战性的训练数据,提升其在实际应用中对高级XSS攻击的检测能力。
链接: https://arxiv.org/abs/2504.21045
作者: Dennis Miczek,Divyesh Gabbireddy,Suman Saha
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: This work has been accepted for presentation at the ACM Workshop on Wireless Security and Machine Learning (WiseML 2025)
Abstract:According to the Open Web Application Security Project (OWASP), Cross-Site Scripting (XSS) is a critical security vulnerability. Despite decades of research, XSS remains among the top 10 security vulnerabilities. Researchers have proposed various techniques to protect systems from XSS attacks, with machine learning (ML) being one of the most widely used methods. An ML model is trained on a dataset to identify potential XSS threats, making its effectiveness highly dependent on the size and diversity of the training data. A variation of XSS is obfuscated XSS, where attackers apply obfuscation techniques to alter the code’s structure, making it challenging for security systems to detect its malicious intent. Our study’s random forest model was trained on traditional (non-obfuscated) XSS data achieved 99.8% accuracy. However, when tested against obfuscated XSS samples, accuracy dropped to 81.9%, underscoring the importance of training ML models with obfuscated data to improve their effectiveness in detecting XSS attacks. A significant challenge is to generate highly complex obfuscated code despite the availability of several public tools. These tools can only produce obfuscation up to certain levels of complexity. In our proposed system, we fine-tune a Large Language Model (LLM) to generate complex obfuscated XSS payloads automatically. By transforming original XSS samples into diverse obfuscated variants, we create challenging training data for ML model evaluation. Our approach achieved a 99.5% accuracy rate with the obfuscated dataset. We also found that the obfuscated samples generated by the LLMs were 28.1% more complex than those created by other tools, significantly improving the model’s ability to handle advanced XSS attacks and making it more effective for real-world application security. Comments: This work has been accepted for presentation at the ACM Workshop on Wireless Security and Machine Learning (WiseML 2025) Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.21045 [cs.CR] (or arXiv:2504.21045v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2504.21045 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-67] AGATE: Stealthy Black-box Watermarking for Multimodal Model Copyright Protection
【速读】:该论文旨在解决多模态人工智能模型在版权保护中的隐蔽性和鲁棒性问题,尤其是针对现有方法易被攻击者检测和伪造导致水印逃逸的缺陷。其解决方案的关键在于提出一种基于对抗触发生成和后处理模块的黑盒后门水印框架AGATE,通过从常规数据集中生成隐蔽的对抗触发器,并利用后处理模块校正模型输出以减少异常检测风险,最终通过两阶段水印验证机制判断模型是否侵权。
链接: https://arxiv.org/abs/2504.21044
作者: Jianbo Gao,Keke Gai,Jing Yu,Liehuang Zhu,Qi Wu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancement in large-scale Artificial Intelligence (AI) models offering multimodal services have become foundational in AI systems, making them prime targets for model theft. Existing methods select Out-of-Distribution (OoD) data as backdoor watermarks and retrain the original model for copyright protection. However, existing methods are susceptible to malicious detection and forgery by adversaries, resulting in watermark evasion. In this work, we propose Model-\underlineagnostic Black-box Backdoor W\underlineatermarking Framework (AGATE) to address stealthiness and robustness challenges in multimodal model copyright protection. Specifically, we propose an adversarial trigger generation method to generate stealthy adversarial triggers from ordinary dataset, providing visual fidelity while inducing semantic shifts. To alleviate the issue of anomaly detection among model outputs, we propose a post-transform module to correct the model output by narrowing the distance between adversarial trigger image embedding and text embedding. Subsequently, a two-phase watermark verification is proposed to judge whether the current model infringes by comparing the two results with and without the transform module. Consequently, we consistently outperform state-of-the-art methods across five datasets in the downstream tasks of multimodal image-text retrieval and image classification. Additionally, we validated the robustness of AGATE under two adversarial attack scenarios.
zh
[AI-68] CodeBC: A More Secure Large Language Model for Smart Contract Code Generation in Blockchain
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在生成代码时缺乏对安全漏洞的理解,从而难以避免生成代码中的安全风险的问题,尤其是在高安全性编程任务如区块链智能合约开发中。其解决方案的关键在于提出CodeBC,一个专门用于生成安全智能合约的代码生成模型,该模型采用三阶段微调方法,不依赖于成对的安全漏洞定位标注,而是通过漏洞和安全标签来教导模型区分易受攻击与安全代码,从而在推理阶段生成更安全、鲁棒的代码。
链接: https://arxiv.org/abs/2504.21043
作者: Lingxiang wang,Hainan Zhang,Qinnan Zhang,Ziwei Wang,Hongwei Zheng,Jin Dong,Zhiming Zheng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) excel at generating code from natural language instructions, yet they often lack an understanding of security vulnerabilities. This limitation makes it difficult for LLMs to avoid security risks in generated code, particularly in high-security programming tasks such as smart contract development for blockchain. Researchers have attempted to enhance the vulnerability awareness of these models by training them to differentiate between vulnerable and fixed code snippets. However, this approach relies heavily on manually labeled vulnerability data, which is only available for popular languages like Python and C++. For low-resource languages like Solidity, used in smart contracts, large-scale annotated datasets are scarce and difficult to obtain. To address this challenge, we introduce CodeBC, a code generation model specifically designed for generating secure smart contracts in blockchain. CodeBC employs a three-stage fine-tuning approach based on CodeLlama, distinguishing itself from previous methods by not relying on pairwise vulnerability location annotations. Instead, it leverages vulnerability and security tags to teach the model the differences between vulnerable and secure code. During the inference phase, the model leverages security tags to generate secure and robust code. Experimental results demonstrate that CodeBC outperforms baseline models in terms of BLEU, CodeBLEU, and compilation pass rates, while significantly reducing vulnerability rates. These findings validate the effectiveness and cost-efficiency of our three-stage fine-tuning strategy, making CodeBC a promising solution for generating secure smart contract code.
zh
[AI-69] Whats Pulling the Strings? Evaluating Integrity and Attribution in AI Training and Inference through Concept Shift CCS
【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)系统中可信度问题,包括完整性、隐私性、鲁棒性和偏见等威胁。其解决方案的关键在于提出ConceptLens框架,该框架利用预训练的多模态模型,通过分析探测样本中的概念漂移(Concept Shift)来识别完整性威胁的根本原因,从而实现对数据污染攻击的有效检测、偏见注入的揭示以及隐私风险的识别,并提供对模型弱点的深入分析。
链接: https://arxiv.org/abs/2504.21042
作者: Jiamin Chang,Haoyang Li,Hammond Pearce,Ruoxi Sun,Bo Li,Minhui Xue
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accept By The ACM Conference on Computer and Communications Security (CCS) 2025
Abstract:The growing adoption of artificial intelligence (AI) has amplified concerns about trustworthiness, including integrity, privacy, robustness, and bias. To assess and attribute these threats, we propose ConceptLens, a generic framework that leverages pre-trained multimodal models to identify the root causes of integrity threats by analyzing Concept Shift in probing samples. ConceptLens demonstrates strong detection performance for vanilla data poisoning attacks and uncovers vulnerabilities to bias injection, such as the generation of covert advertisements through malicious concept shifts. It identifies privacy risks in unaltered but high-risk samples, filters them before training, and provides insights into model weaknesses arising from incomplete or imbalanced training data. Additionally, at the model level, it attributes concepts that the target model is overly dependent on, identifies misleading concepts, and explains how disrupting key concepts negatively impacts the model. Furthermore, it uncovers sociological biases in generative content, revealing disparities across sociological contexts. Strikingly, ConceptLens reveals how safe training and inference data can be unintentionally and easily exploited, potentially undermining safety alignment. Our study informs actionable insights to breed trust in AI systems, thereby speeding adoption and driving greater innovation.
zh
[AI-70] Llama-3.1-FoundationAI-SecurityLLM -Base-8B Technical Report
【速读】:该论文旨在解决生成式 AI 在网络安全领域应用受限的问题,主要挑战包括专业训练数据稀缺以及网络安全知识表示的复杂性。其解决方案的关键在于构建一个专注于网络安全的大型语言模型 Foundation-Sec-8B,该模型基于 Llama 3.1 架构,并通过在精心筛选的网络安全语料库上进行持续预训练来增强其专业知识。
链接: https://arxiv.org/abs/2504.21039
作者: Paul Kassianik,Baturay Saglam,Alexander Chen,Blaine Nelson,Anu Vellore,Massimo Aufiero,Fraser Burch,Dhruv Kedia,Avi Zohary,Sajana Weerawardhena,Aman Priyanshu,Adam Swanda,Amy Chang,Hyrum Anderson,Kojin Oshiba,Omar Santos,Yaron Singer,Amin Karbasi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:As transformer-based large language models (LLMs) increasingly permeate society, they have revolutionized domains such as software engineering, creative writing, and digital arts. However, their adoption in cybersecurity remains limited due to challenges like scarcity of specialized training data and complexity of representing cybersecurity-specific knowledge. To address these gaps, we present Foundation-Sec-8B, a cybersecurity-focused LLM built on the Llama 3.1 architecture and enhanced through continued pretraining on a carefully curated cybersecurity corpus. We evaluate Foundation-Sec-8B across both established and new cybersecurity benchmarks, showing that it matches Llama 3.1-70B and GPT-4o-mini in certain cybersecurity-specific tasks. By releasing our model to the public, we aim to accelerate progress and adoption of AI-driven tools in both public and private cybersecurity contexts.
zh
[AI-71] Prefill-Based Jailbreak: A Novel Approach of Bypassing LLM Safety Boundary
【速读】:该论文旨在解决生成式 AI (Generative AI) 中因对抗性攻击(即“越狱”)导致的安全协议失效问题,此类攻击能够使大型语言模型(LLMs)生成有害内容或泄露敏感信息。论文提出的解决方案关键在于利用 LLMs 的预填充(prefilling)功能,通过直接操纵后续标记的概率分布来绕过安全机制,从而控制模型输出。该方法包含两种变体:静态预填充(Static Prefilling, SP)和优化预填充(Optimized Prefilling, OP),其中 OP 通过迭代优化预填充文本以最大化攻击成功率,实验结果显示其在多个先进 LLM 上的攻击成功率高达 99.82%,显著优于基线方法。
链接: https://arxiv.org/abs/2504.21038
作者: Yakai Li,Jiekang Hu,Weiduan Sang,Luping Ma,Jing Xie,Weijuan Zhang,Aimin Yu,Shijie Zhao,Qingjia Huang,Qihang Zhou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are designed to generate helpful and safe content. However, adversarial attacks, commonly referred to as jailbreak, can bypass their safety protocols, prompting LLMs to generate harmful content or reveal sensitive data. Consequently, investigating jailbreak methodologies is crucial for exposing systemic vulnerabilities within LLMs, ultimately guiding the continuous implementation of security enhancements by developers. In this paper, we introduce a novel jailbreak attack method that leverages the prefilling feature of LLMs, a feature designed to enhance model output constraints. Unlike traditional jailbreak methods, the proposed attack circumvents LLMs’ safety mechanisms by directly manipulating the probability distribution of subsequent tokens, thereby exerting control over the model’s output. We propose two attack variants: Static Prefilling (SP), which employs a universal prefill text, and Optimized Prefilling (OP), which iteratively optimizes the prefill text to maximize the attack success rate. Experiments on six state-of-the-art LLMs using the AdvBench benchmark validate the effectiveness of our method and demonstrate its capability to substantially enhance attack success rates when combined with existing jailbreak approaches. The OP method achieved attack success rates of up to 99.82% on certain models, significantly outperforming baseline methods. This work introduces a new jailbreak attack method in LLMs, emphasizing the need for robust content validation mechanisms to mitigate the adversarial exploitation of prefilling features. All code and data used in this paper are publicly available.
zh
[AI-72] Security Bug Report Prediction Within and Across Projects: A Comparative Study of BERT and Random Forest
【速读】:该论文旨在解决安全漏洞报告(Security Bug Reports, SBRs)的早期检测问题,以防止系统漏洞并确保系统可靠性。其解决方案的关键在于对比和分析基于深度学习的BERT模型与传统机器学习方法随机森林(Random Forest, RF)在SBR预测中的性能表现,并探索不同数据组合对模型效果的影响。研究发现,虽然RF在项目内预测中优于BERT,但引入多项目SBR数据后,BERT在跨项目预测中表现出更优的性能,尤其在包含安全与非安全漏洞报告的情况下,BERT的平均G-measure显著提升。
链接: https://arxiv.org/abs/2504.21037
作者: Farnaz Soltaniani,Mohammad Ghafari,Mohammed Sayagh
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Early detection of security bug reports (SBRs) is crucial for preventing vulnerabilities and ensuring system reliability. While machine learning models have been developed for SBR prediction, their predictive performance still has room for improvement. In this study, we conduct a comprehensive comparison between BERT and Random Forest (RF), a competitive baseline for predicting SBRs. The results show that RF outperforms BERT with a 34% higher average G-measure for within-project predictions. Adding only SBRs from various projects improves both models’ average performance. However, including both security and nonsecurity bug reports significantly reduces RF’s average performance to 46%, while boosts BERT to its best average performance of 66%, surpassing RF. In cross-project SBR prediction, BERT achieves a remarkable 62% G-measure, which is substantially higher than RF.
zh
[AI-73] Can Differentially Private Fine-tuning LLM s Protect Against Privacy Attacks?
【速读】:该论文旨在解决在对大语言模型(Large Language Models, LLMs)进行微调过程中所面临的隐私泄露问题,特别是敏感训练数据可能被模型无意中记忆并暴露的风险。其解决方案的关键在于系统评估差分隐私(Differential Privacy, DP)在不同微调方法和隐私预算下的隐私保护效果,并通过数据提取和成员推断攻击来量化实际的隐私风险,从而为隐私敏感场景下的LLM部署提供实践指导。
链接: https://arxiv.org/abs/2504.21036
作者: Hao Du,Shang Liu,Yang Cao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Fine-tuning large language models (LLMs) has become an essential strategy for adapting them to specialized tasks; however, this process introduces significant privacy challenges, as sensitive training data may be inadvertently memorized and exposed. Although differential privacy (DP) offers strong theoretical guarantees against such leakage, its empirical privacy effectiveness on LLMs remains unclear, especially under different fine-tuning methods. In this paper, we systematically investigate the impact of DP across fine-tuning methods and privacy budgets, using both data extraction and membership inference attacks to assess empirical privacy risks. Our main findings are as follows: (1) Differential privacy reduces model utility, but its impact varies significantly across different fine-tuning methods. (2) Without DP, the privacy risks of models fine-tuned with different approaches differ considerably. (3) When DP is applied, even a relatively high privacy budget can substantially lower privacy risk. (4) The privacy-utility trade-off under DP training differs greatly among fine-tuning methods, with some methods being unsuitable for DP due to severe utility degradation. Our results provide practical guidance for privacy-conscious deployment of LLMs and pave the way for future research on optimizing the privacy-utility trade-off in fine-tuning methodologies.
zh
[AI-74] SAGA: A Security Architecture for Governing AI Agent ic Systems
【速读】:该论文试图解决自主代理系统中用户缺乏对代理生命周期的全面控制问题,特别是在面对潜在恶意代理时难以有效 mitigating 损害的问题。解决方案的关键是提出 SAGA(Security Architecture for Governing Agentic systems),该架构通过让用户将代理注册到一个中央实体(Provider)来实现对代理的管理,Provider 负责维护代理的联系信息、用户定义的访问控制策略,并协助代理在代理间通信中执行这些策略。此外,SAGA 引入了一种加密机制用于生成访问控制令牌,从而实现对代理交互的细粒度控制,兼顾安全性和性能。
链接: https://arxiv.org/abs/2504.21034
作者: Georgios Syros,Anshuman Suri,Cristina Nita-Rotaru,Alina Oprea
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Model (LLM)-based agents increasingly interact, collaborate, and delegate tasks to one another autonomously with minimal human interaction. Industry guidelines for agentic system governance emphasize the need for users to maintain comprehensive control over their agents, mitigating potential damage from malicious agents. Several proposed agentic system designs address agent identity, authorization, and delegation, but remain purely theoretical, without concrete implementation and evaluation. Most importantly, they do not provide user-controlled agent management. To address this gap, we propose SAGA, a Security Architecture for Governing Agentic systems, that offers user oversight over their agents’ lifecycle. In our design, users register their agents with a central entity, the Provider, that maintains agents contact information, user-defined access control policies, and helps agents enforce these policies on inter-agent communication. We introduce a cryptographic mechanism for deriving access control tokens, that offers fine-grained control over an agent’s interaction with other agents, balancing security and performance consideration. We evaluate SAGA on several agentic tasks, using agents in different geolocations, and multiple on-device and cloud LLMs, demonstrating minimal performance overhead with no impact on underlying task utility in a wide range of conditions. Our architecture enables secure and trustworthy deployment of autonomous agents, accelerating the responsible adoption of this technology in sensitive environments.
zh
[AI-75] Selecting the Right LLM for eGov Explanations
【速读】:该论文试图解决如何为电子政务服务选择合适的生成式 AI (Generative AI) 模型以生成高质量解释的问题,从而提升用户对这些服务的信任度和使用率。解决方案的关键在于适应并应用一个已有的量表,以系统化地比较不同大型语言模型 (Large Language Models, LLMs) 生成解释的感知质量,并通过税务申报流程作为实例验证其适用性,最终提供一种基于用户研究的方法来选择最合适的 LLM。
链接: https://arxiv.org/abs/2504.21032
作者: Lior Limonad,Fabiana Fournier,Hadar Mulian,George Manias,Spiros Borotis,Danai Kyrkou
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 7 figures. ICEDEG 2025, Bern, Switzerland, June 2025
Abstract:The perceived quality of the explanations accompanying e-government services is key to gaining trust in these institutions, consequently amplifying further usage of these services. Recent advances in generative AI, and concretely in Large Language Models (LLMs) allow the automation of such content articulations, eliciting explanations’ interpretability and fidelity, and more generally, adapting content to various audiences. However, selecting the right LLM type for this has become a non-trivial task for e-government service providers. In this work, we adapted a previously developed scale to assist with this selection, providing a systematic approach for the comparative analysis of the perceived quality of explanations generated by various LLMs. We further demonstrated its applicability through the tax-return process, using it as an exemplar use case that could benefit from employing an LLM to generate explanations about tax refund decisions. This was attained through a user study with 128 survey respondents who were asked to rate different versions of LLM-generated explanations about tax refund decisions, providing a methodological basis for selecting the most appropriate LLM. Recognizing the practical challenges of conducting such a survey, we also began exploring the automation of this process by attempting to replicate human feedback using a selection of cutting-edge predictive techniques.
zh
[AI-76] Advancing Multi-Agent Systems Through Model Context Protocol: Architecture Implementation and Applications
【速读】:该论文旨在解决多智能体系统在上下文管理、协调效率和可扩展性方面面临的根本性挑战。其解决方案的关键在于引入模型上下文协议(Model Context Protocol, MCP),通过标准化的上下文共享和协调机制来提升系统的性能。该框架建立了统一的理论基础,结合先进的上下文管理技术和可扩展的协调模式,从而在企业知识管理、协作研究和分布式问题解决等领域实现了显著的性能提升。
链接: https://arxiv.org/abs/2504.21030
作者: Naveen Krishnan
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-agent systems represent a significant advancement in artificial intelligence, enabling complex problem-solving through coordinated specialized agents. However, these systems face fundamental challenges in context management, coordination efficiency, and scalable operation. This paper introduces a comprehensive framework for advancing multi-agent systems through Model Context Protocol (MCP), addressing these challenges through standardized context sharing and coordination mechanisms. We extend previous work on AI agent architectures by developing a unified theoretical foundation, advanced context management techniques, and scalable coordination patterns. Through detailed implementation case studies across enterprise knowledge management, collaborative research, and distributed problem-solving domains, we demonstrate significant performance improvements compared to traditional approaches. Our evaluation methodology provides a systematic assessment framework with benchmark tasks and datasets specifically designed for multi-agent systems. We identify current limitations, emerging research opportunities, and potential transformative applications across industries. This work contributes to the evolution of more capable, collaborative, and context-aware artificial intelligence systems that can effectively address complex real-world challenges.
zh
[AI-77] PICO: Secure Transformers via Robust Prompt Isolation and Cybersecurity Oversight
【速读】:该论文试图解决生成式 AI (Generative AI) 在面对提示注入攻击(prompt injection attacks)时的安全性和可靠性问题。其解决方案的关键在于提出一种名为 PICO(Prompt Isolation and Cybersecurity Oversight)的框架,通过双通道结构将可信系统指令与不可信用户输入进行结构性隔离,并仅通过受控的门控融合机制进行合并。此外,该框架整合了安全专家代理(Security Expert Agent)和网络安全知识图谱(Cybersecurity Knowledge Graph),以增强系统的安全推理能力,同时确保系统提示分支在训练过程中保持不可变性。
链接: https://arxiv.org/abs/2504.21029
作者: Ben Goertzel,Paulos Yibelo
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose a robust transformer architecture designed to prevent prompt injection attacks and ensure secure, reliable response generation. Our PICO (Prompt Isolation and Cybersecurity Oversight) framework structurally separates trusted system instructions from untrusted user inputs through dual channels that are processed independently and merged only by a controlled, gated fusion mechanism. In addition, we integrate a specialized Security Expert Agent within a Mixture-of-Experts (MoE) framework and incorporate a Cybersecurity Knowledge Graph (CKG) to supply domain-specific reasoning. Our training design further ensures that the system prompt branch remains immutable while the rest of the network learns to handle adversarial inputs safely. This PICO framework is presented via a general mathematical formulation, then elaborated in terms of the specifics of transformer architecture, and fleshed out via hypothetical case studies including Policy Puppetry attacks. While the most effective implementation may involve training transformers in a PICO-based way from scratch, we also present a cost-effective fine-tuning approach.
zh
[AI-78] Semantic-Aware Contrastive Fine-Tuning: Boosting Multimodal Malware Classification with Discriminative Embeddings
【速读】:该论文旨在解决恶意软件变种快速演化带来的分类挑战,特别是传统方法在语义嵌入重叠和二进制行为特征不对齐方面的局限性。其解决方案的关键在于提出一种对比微调(Contrastive Fine-Tuning, CFT)方法,通过基于余弦相似度选择困难负样本,优化大型语言模型(Large Language Models, LLMs)的嵌入表示,从而增强模型区分相近恶意软件家族的能力。该方法结合高相似性负样本以提升判别能力,并引入中等相似性负样本以增加嵌入多样性,实现精度与泛化的平衡。
链接: https://arxiv.org/abs/2504.21028
作者: Ivan Montoya Sanchez,Shaswata Mitra,Aritran Piplai,Sudip Mittal
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 5 figures, 5 tables
Abstract:The rapid evolution of malware variants requires robust classification methods to enhance cybersecurity. While Large Language Models (LLMs) offer potential for generating malware descriptions to aid family classification, their utility is limited by semantic embedding overlaps and misalignment with binary behavioral features. We propose a contrastive fine-tuning (CFT) method that refines LLM embeddings via targeted selection of hard negative samples based on cosine similarity, enabling LLMs to distinguish between closely related malware families. Our approach combines high-similarity negatives to enhance discriminative power and mid-tier negatives to increase embedding diversity, optimizing both precision and generalization. Evaluated on the CIC-AndMal-2020 and BODMAS datasets, our refined embeddings are integrated into a multimodal classifier within a Model-Agnostic Meta-Learning (MAML) framework on a few-shot setting. Experiments demonstrate significant improvements: our method achieves 63.15% classification accuracy with as few as 20 samples on CIC-AndMal-2020, outperforming baselines by 11–21 percentage points and surpassing prior negative sampling strategies. Ablation studies confirm the superiority of similarity-based selection over random sampling, with gains of 10-23%. Additionally, fine-tuned LLMs generate attribute-aware descriptions that generalize to unseen variants, bridging textual and binary feature gaps. This work advances malware classification by enabling nuanced semantic distinctions and provides a scalable framework for adapting LLMs to cybersecurity challenges.
zh
[AI-79] Research on CNN-BiLSTM Network Traffic Anomaly Detection Model Based on MindSpore
【速读】:该论文试图解决传统安全机制在面对日益复杂的网络架构和大量流量时,难以有效检测高频、多样化且隐蔽性强的网络攻击的问题。解决方案的关键在于提出一种融合卷积神经网络(Convolutional Neural Network, CNN)与双向长短期记忆网络(Bidirectional Long Short-Term Memory, BiLSTM)的新型网络流量异常检测模型,并基于MindSpore框架进行实现。
链接: https://arxiv.org/abs/2504.21008
作者: Qiuyan Xiang,Shuang Wu,Dongze Wu,Yuxin Liu,Zhenkai Qin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:With the widespread adoption of the Internet of Things (IoT) and Industrial IoT (IIoT) technologies, network architectures have become increasingly complex, and the volume of traffic has grown substantially. This evolution poses significant challenges to traditional security mechanisms, particularly in detecting high-frequency, diverse, and highly covert network attacks. To address these challenges, this study proposes a novel network traffic anomaly detection model that integrates a Convolutional Neural Network (CNN) with a Bidirectional Long Short-Term Memory (BiLSTM) network, implemented on the MindSpore framework. Comprehensive experiments were conducted using the NF-BoT-IoT dataset. The results demonstrate that the proposed model achieves 99% across accuracy, precision, recall, and F1-score, indicating its strong performance and robustness in network intrusion detection tasks.
zh
[AI-80] Efficient Quantum-Safe Homomorphic Encryption for Quantum Computer Programs
【速读】:该论文旨在解决在量子计算环境中实现对量子程序的同态评估,并确保其对量子敌手的安全性。其解决方案的关键在于将经典同态加密提升至量子设置,通过使用模块学习误差(MLWE)格代替复合阶群,并将多项式函子推广为有界自然超函子(BNSF)。此外,通过秘密去极化BNSF掩码隐藏振幅,量子状态以MLWE密文对形式存储,并通过qIND-CPA游戏形式化安全性,同时采用四重混合归约到决策性MLWE问题,从而保证了方案的安全性和实用性。
链接: https://arxiv.org/abs/2504.21235
作者: Ben Goertzel
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:We present a lattice-based scheme for homomorphic evaluation of quantum programs and proofs that remains secure against quantum adversaries. Classical homomorphic encryption is lifted to the quantum setting by replacing composite-order groups with Module Learning-With-Errors (MLWE) lattices and by generalizing polynomial functors to bounded natural super functors (BNSFs). A secret depolarizing BNSF mask hides amplitudes, while each quantum state is stored as an MLWE ciphertext pair. We formalize security with the qIND-CPA game that allows coherent access to the encryption oracle and give a four-hybrid reduction to decisional MLWE. The design also covers practical issues usually left open. A typed QC-bridge keeps classical bits produced by measurements encrypted yet still usable as controls, with weak-measurement semantics for expectation-value workloads. Encrypted Pauli twirls add circuit privacy. If a fixed knowledge base is needed, its axioms are shipped as MLWE “capsules”; the evaluator can use them but cannot read them. A rho-calculus driver schedules encrypted tasks across several QPUs and records an auditable trace on an RChain-style ledger. Performance analysis shows that the extra lattice arithmetic fits inside today’s QPU idle windows: a 100-qubit, depth-10^3 teleportation-based proof runs in about 10 ms, the public key (seed only) is 32 bytes, and even a CCA-level key stays below 300 kB. A photonic Dirac-3 prototype that executes homomorphic teleportation plus knowledge-base-relative amplitude checks appears feasible with current hardware. These results indicate that fully homomorphic, knowledge-base-aware quantum reasoning is compatible with near-term quantum clouds and standard post-quantum security assumptions. Subjects: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.21235 [quant-ph] (or arXiv:2504.21235v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2504.21235 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-81] urning Up the Heat: Assessing 2-m Temperature Forecast Errors in AI Weather Prediction Models During Heat Waves
【速读】:该论文试图解决传统数值天气预测(NWP)模型在中短期和次季节至季节(S2S)时间尺度上对极端高温预测能力不足的问题,以及人工智能天气预测(AIWP)模型在极端天气事件预测中的表现尚不明确的问题。解决方案的关键在于利用两种AIWP模型(Google GraphCast和Pangu-Weather)与一种传统NWP模型(NOAA UFS GEFS)进行对比分析,评估其在不同季节和区域对60次热浪事件的2米温度预测性能,并探讨AIWP模型在中短期和S2S时间尺度上的可预报性潜力。
链接: https://arxiv.org/abs/2504.21195
作者: Kelsey E. Ennis,Elizabeth A. Barnes,Marybeth C. Arcodia,Martin A. Fernandez,Eric D. Maloney
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Extreme heat is the deadliest weather-related hazard in the United States. Furthermore, it is increasing in intensity, frequency, and duration, making skillful forecasts vital to protecting life and property. Traditional numerical weather prediction (NWP) models struggle with extreme heat for medium-range and subseasonal-to-seasonal (S2S) timescales. Meanwhile, artificial intelligence-based weather prediction (AIWP) models are progressing rapidly. However, it is largely unknown how well AIWP models forecast extremes, especially for medium-range and S2S timescales. This study investigates 2-m temperature forecasts for 60 heat waves across the four boreal seasons and over four CONUS regions at lead times up to 20 days, using two AIWP models (Google GraphCast and Pangu-Weather) and one traditional NWP model (NOAA United Forecast System Global Ensemble Forecast System (UFS GEFS)). First, case study analyses show that both AIWP models and the UFS GEFS exhibit consistent cold biases on regional scales in the 5-10 days of lead time before heat wave onset. GraphCast is the more skillful AIWP model, outperforming UFS GEFS and Pangu-Weather in most locations. Next, the two AIWP models are isolated and analyzed across all heat waves and seasons, with events split among the model’s testing (2018-2023) and training (1979-2017) periods. There are cold biases before and during the heat waves in both models and all seasons, except Pangu-Weather in winter, which exhibits a mean warm bias before heat wave onset. Overall, results offer encouragement that AIWP models may be useful for medium-range and S2S predictability of extreme heat.
zh
[AI-82] Evaluation and Verification of Physics-Informed Neural Models of the Grad-Shafranov Equation
【速读】:该论文旨在解决在磁约束聚变装置中,如何利用物理信息神经网络(Physics-Informed Neural Networks, PINNs)对广义的平衡条件进行建模,并在多种边界条件下实现模型的泛化问题。其解决方案的关键在于设计一种将边界点作为网络输入的PINN架构,以增强模型对不同工况的适应能力,同时通过与傅里叶神经算子(Fourier Neural Operator, FNO)模型的对比,验证了PINN在精度和推理速度上的优越性,并借助网络验证工具Marabou实现了对模型的可靠性分析。
链接: https://arxiv.org/abs/2504.21155
作者: Fauzan Nazranda Rizqa,Matthew Hole,Charles Gretton
机构: 未知
类目: Plasma Physics (physics.plasm-ph); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 9 pages, 4 figures
Abstract:Our contributions are motivated by fusion reactors that rely on maintaining magnetohydrodynamic (MHD) equilibrium, where the balance between plasma pressure and confining magnetic fields is required for stable operation. In axisymmetric tokamak reactors in particular, and under the assumption of toroidal symmetry, this equilibrium can be mathematically modelled using the Grad-Shafranov Equation (GSE). Recent works have demonstrated the potential of using Physics-Informed Neural Networks (PINNs) to model the GSE. Existing studies did not examine realistic scenarios in which a single network generalizes to a variety of boundary conditions. Addressing that limitation, we evaluate a PINN architecture that incorporates boundary points as network inputs. Additionally, we compare PINN model accuracy and inference speeds with a Fourier Neural Operator (FNO) model. Finding the PINN model to be the most performant, and accurate in our setting, we use the network verification tool Marabou to perform a range of verification tasks. Although we find some discrepancies between evaluations of the networks natively in PyTorch, compared to via Marabou, we are able to demonstrate useful and practical verification workflows. Our study is the first investigation of verification of such networks.
zh
机器学习
[LG-0] Stable Trajectory Clustering: An Efficient Split and Merge Algorithm
链接: https://arxiv.org/abs/2504.21808
作者: Atieh Rahmani,Mansoor Davoodi,Justin M. Calabrese
类目: Machine Learning (cs.LG); Computational Geometry (cs.CG)
*备注:
Abstract:Clustering algorithms group data points by characteristics to identify patterns. Over the past two decades, researchers have extended these methods to analyze trajectories of humans, animals, and vehicles, studying their behavior and movement across applications. This paper presents whole-trajectory clustering and sub-trajectory clustering algorithms based on DBSCAN line segment clustering, which encompasses two key events: split and merge of line segments. The events are employed by object movement history and the average Euclidean distance between line segments. In this framework, whole-trajectory clustering considers entire entities’ trajectories, whereas sub-trajectory clustering employs a sliding window model to identify similar sub-trajectories. Many existing trajectory clustering algorithms respond to temporary anomalies in data by splitting trajectories, which often obscures otherwise consistent clustering patterns and leads to less reliable insights. We introduce the stable trajectory clustering algorithm, which leverages the mean absolute deviation concept to demonstrate that selective omission of transient deviations not only preserves the integrity of clusters but also improves their stability and interpretability. We run all proposed algorithms on real trajectory datasets to illustrate their effectiveness and sensitivity to parameter variations.
[LG-1] raceback of Poisoning Attacks to Retrieval-Augmented Generation
链接: https://arxiv.org/abs/2504.21668
作者: Baolei Zhang,Haoran Xin,Minghong Fang,Zhuqing Liu,Biao Yi,Tong Li,Zheli Liu
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted by The Web Conference 2025
Abstract:Large language models (LLMs) integrated with retrieval-augmented generation (RAG) systems improve accuracy by leveraging external knowledge sources. However, recent research has revealed RAG’s susceptibility to poisoning attacks, where the attacker injects poisoned texts into the knowledge database, leading to attacker-desired responses. Existing defenses, which predominantly focus on inference-time mitigation, have proven insufficient against sophisticated attacks. In this paper, we introduce RAGForensics, the first traceback system for RAG, designed to identify poisoned texts within the knowledge database that are responsible for the attacks. RAGForensics operates iteratively, first retrieving a subset of texts from the database and then utilizing a specially crafted prompt to guide an LLM in detecting potential poisoning texts. Empirical evaluations across multiple datasets demonstrate the effectiveness of RAGForensics against state-of-the-art poisoning attacks. This work pioneers the traceback of poisoned texts in RAG systems, providing a practical and promising defense mechanism to enhance their security.
[LG-2] On Advancements of the Forward-Forward Algorithm
链接: https://arxiv.org/abs/2504.21662
作者: Mauricio Ortiz Torres,Markus Lange,Arne P. Raulf
类目: Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication
Abstract:The Forward-Forward algorithm has evolved in machine learning research, tackling more complex tasks that mimic real-life applications. In the last years, it has been improved by several techniques to perform better than its original version, handling a challenging dataset like CIFAR10 without losing its flexibility and low memory usage. We have shown in our results that improvements are achieved through a combination of convolutional channel grouping, learning rate schedules, and independent block structures during training that lead to a 20% decrease in test error percentage. Additionally, to approach further implementations on low-capacity hardware projects we have presented a series of lighter models that achieve low test error percentages within (21 \pm 6)% and number of trainable parameters between 164,706 and 754,386. This serving also as a basis for our future study on complete verification and validation of these kinds of neural networks.
[LG-3] Real Time Semantic Segmentation of High Resolution Automotive LiDAR Scans
链接: https://arxiv.org/abs/2504.21602
作者: Hannes Reichert,Benjamin Serfling,Elijah Schüssler,Kerim Turacan,Konrad Doll,Bernhard Sick
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:In recent studies, numerous previous works emphasize the importance of semantic segmentation of LiDAR data as a critical component to the development of driver-assistance systems and autonomous vehicles. However, many state-of-the-art methods are tested on outdated, lower-resolution LiDAR sensors and struggle with real-time constraints. This study introduces a novel semantic segmentation framework tailored for modern high-resolution LiDAR sensors that addresses both accuracy and real-time processing demands. We propose a novel LiDAR dataset collected by a cutting-edge automotive 128 layer LiDAR in urban traffic scenes. Furthermore, we propose a semantic segmentation method utilizing surface normals as strong input features. Our approach is bridging the gap between cutting-edge research and practical automotive applications. Additionaly, we provide a Robot Operating System (ROS2) implementation that we operate on our research vehicle. Our dataset and code are publicly available: this https URL.
[LG-4] Low-rank computation of the posterior mean in Multi-Output Gaussian Processes
链接: https://arxiv.org/abs/2504.21527
作者: Sebastian Esche,Martin Stoll
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:Gaussian processes (GP) are a versatile tool in machine learning and computational science. We here consider the case of multi-output Gaussian processes (MOGP) and present low-rank approaches for efficiently computing the posterior mean of a MOGP. Starting from low-rank spatio-temporal data we consider a structured covariance function, assuming separability across space and time. This separability, in turn, gives a decomposition of the covariance matrix into a Kronecker product of individual covariance matrices. Incorporating the typical noise term to the model then requires the solution of a large-scale Stein equation for computing the posterior mean. For this, we propose efficient low-rank methods based on a combination of a LRPCG method with the Sylvester equation solver KPIK adjusted for solving Stein equations. We test the developed method on real world street network graphs by using graph filters as covariance matrices. Moreover, we propose a degree-weighted average covariance matrix, which can be employed under specific assumptions to achieve more efficient convergence.
[LG-5] Deep Learning Optimization Using Self-Adaptive Weighted Auxiliary Variables
链接: https://arxiv.org/abs/2504.21501
作者: Yaru Liu,Yiqi Gu,Michael K. Ng
类目: Machine Learning (cs.LG)
*备注: 32 pages, 11 figures
Abstract:In this paper, we develop a new optimization framework for the least squares learning problem via fully connected neural networks or physics-informed neural networks. The gradient descent sometimes behaves inefficiently in deep learning because of the high non-convexity of loss functions and the vanishing gradient issue. Our idea is to introduce auxiliary variables to separate the layers of the deep neural networks and reformulate the loss functions for ease of optimization. We design the self-adaptive weights to preserve the consistency between the reformulated loss and the original mean squared loss, which guarantees that optimizing the new loss helps optimize the original problem. Numerical experiments are presented to verify the consistency and show the effectiveness and robustness of our models over gradient descent.
[LG-6] Whispers of Data: Unveiling Label Distributions in Federated Learning Through Virtual Client Simulation
链接: https://arxiv.org/abs/2504.21436
作者: Zhixuan Ma,Haichang Gao,Junxiang Huang,Ping Wang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Federated Learning enables collaborative training of a global model across multiple geographically dispersed clients without the need for data sharing. However, it is susceptible to inference attacks, particularly label inference attacks. Existing studies on label distribution inference exhibits sensitive to the specific settings of the victim client and typically underperforms under defensive strategies. In this study, we propose a novel label distribution inference attack that is stable and adaptable to various scenarios. Specifically, we estimate the size of the victim client’s dataset and construct several virtual clients tailored to the victim client. We then quantify the temporal generalization of each class label for the virtual clients and utilize the variation in temporal generalization to train an inference model that predicts the label distribution proportions of the victim client. We validate our approach on multiple datasets, including MNIST, Fashion-MNIST, FER2013, and AG-News. The results demonstrate the superiority of our method compared to state-of-the-art techniques. Furthermore, our attack remains effective even under differential privacy defense mechanisms, underscoring its potential for real-world applications. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2504.21436 [cs.LG] (or arXiv:2504.21436v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.21436 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-7] Enhanced Semi-Supervised Stamping Process Monitoring with Physically-Informed Feature Extraction
链接: https://arxiv.org/abs/2504.21389
作者: Jianyu Zhang,Jianshe Feng,Yizhang Zhu,Fanyu Qi
类目: Machine Learning (cs.LG)
*备注: 19 pages, 14 figures
Abstract:In tackling frequent anomalies in stamping processes, this study introduces a novel semi-supervised in-process anomaly monitoring framework, utilizing accelerometer signals and physics information, to capture the process anomaly effectively. The proposed framework facilitates the construction of a monitoring model with imbalanced sample distribution, which enables in-process condition monitoring in real-time to prevent batch anomalies, which helps to reduce batch defects risk and enhance production yield. Firstly, to effectively capture key features from raw data containing redundant information, a hybrid feature extraction algorithm is proposed to utilize data-driven methods and physical mechanisms simultaneously. Secondly, to address the challenge brought by imbalanced sample distribution, a semi-supervised anomaly detection model is established, which merely employs normal samples to build a golden baseline model, and a novel deviation score is proposed to quantify the anomaly level of each online stamping stroke. The effectiveness of the proposed feature extraction method is validated with various classification algorithms. A real-world in-process dataset from stamping manufacturing workshop is employed to illustrate the superiority of proposed semi-supervised framework with enhance performance for process anomaly monitoring.
[LG-8] Synergy-CLIP: Extending CLIP with Multi-modal Integration for Robust Representation Learning
链接: https://arxiv.org/abs/2504.21375
作者: Sangyeon Cho,Jangyeong Jeon,Mingi Kim,Junyeong Kim
类目: Machine Learning (cs.LG)
*备注: Multi-modal, Multi-modal Representation Learning, Missing Modality, Missing Modality Reconstruction, Speech and Multi-modality, Vision and Language
Abstract:Multi-modal representation learning has become a pivotal area in artificial intelligence, enabling the integration of diverse modalities such as vision, text, and audio to solve complex problems. However, existing approaches predominantly focus on bimodal interactions, such as image-text pairs, which limits their ability to fully exploit the richness of multi-modal data. Furthermore, the integration of modalities in equal-scale environments remains underexplored due to the challenges of constructing large-scale, balanced datasets. In this study, we propose Synergy-CLIP, a novel framework that extends the contrastive language-image pre-training (CLIP) architecture to enhance multi-modal representation learning by integrating visual, textual, and audio modalities. Unlike existing methods that focus on adapting individual modalities to vanilla-CLIP, Synergy-CLIP aligns and captures latent information across three modalities equally. To address the high cost of constructing large-scale multi-modal datasets, we introduce VGG-sound+, a triple-modal dataset designed to provide equal-scale representation of visual, textual, and audio data. Synergy-CLIP is validated on various downstream tasks, including zero-shot classification, where it outperforms existing baselines. Additionally, we introduce a missing modality reconstruction task, demonstrating Synergy-CLIP’s ability to extract synergy among modalities in realistic application scenarios. These contributions provide a robust foundation for advancing multi-modal representation learning and exploring new research directions.
[LG-9] Generative QoE Modeling: A Lightweight Approach for Telecom Networks
链接: https://arxiv.org/abs/2504.21353
作者: Vinti Nayar,Kanica Sachdev,Brejesh Lall
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:Quality of Experience (QoE) prediction plays a crucial role in optimizing resource management and enhancing user satisfaction across both telecommunication and OTT services. While recent advances predominantly rely on deep learning models, this study introduces a lightweight generative modeling framework that balances computational efficiency, interpretability, and predictive accuracy. By validating the use of Vector Quantization (VQ) as a preprocessing technique, continuous network features are effectively transformed into discrete categorical symbols, enabling integration with a Hidden Markov Model (HMM) for temporal sequence modeling. This VQ-HMM pipeline enhances the model’s capacity to capture dynamic QoE patterns while supporting probabilistic inference on new and unseen data. Experimental results on publicly available time-series datasets incorporating both objective indicators and subjective QoE scores demonstrate the viability of this approach in real-time and resource-constrained environments, where inference latency is also critical. The framework offers a scalable alternative to complex deep learning methods, particularly in scenarios with limited computational resources or where latency constraints are critical.
[LG-10] A Memetic Algorithm based on Variational Autoencoder for Black-Box Discrete Optimization with Epistasis among Parameters CEC2025
链接: https://arxiv.org/abs/2504.21338
作者: Aoi Kato,Kenta Kojima,Masahiro Nomura,Isao Ono
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: IEEE CEC 2025 (Poster)
Abstract:Black-box discrete optimization (BB-DO) problems arise in many real-world applications, such as neural architecture search and mathematical model estimation. A key challenge in BB-DO is epistasis among parameters where multiple variables must be modified simultaneously to effectively improve the objective function. Estimation of Distribution Algorithms (EDAs) provide a powerful framework for tackling BB-DO problems. In particular, an EDA leveraging a Variational Autoencoder (VAE) has demonstrated strong performance on relatively low-dimensional problems with epistasis while reducing computational cost. Meanwhile, evolutionary algorithms such as DSMGA-II and P3, which integrate bit-flip-based local search with linkage learning, have shown excellent performance on high-dimensional problems. In this study, we propose a new memetic algorithm that combines VAE-based sampling with local search. The proposed method inherits the strengths of both VAE-based EDAs and local search-based approaches: it effectively handles high-dimensional problems with epistasis among parameters without incurring excessive computational overhead. Experiments on NK landscapes – a challenging benchmark for BB-DO involving epistasis among parameters – demonstrate that our method outperforms state-of-the-art VAE-based EDA methods, as well as leading approaches such as P3 and DSMGA-II.
[LG-11] Multi-level datasets training method in Physics-Informed Neural Networks
链接: https://arxiv.org/abs/2504.21328
作者: Yao-Hsuan Tsai,Hsiao-Tung Juan,Pao-Hsiung Chiu,Chao-An Lin
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Fluid Dynamics (physics.flu-dyn)
*备注: 33 pages, 12figures
Abstract:Physics-Informed Neural Networks have emerged as a promising methodology for solving PDEs, gaining significant attention in computer science and various physics-related fields. Despite being demonstrated the ability to incorporate the physics of laws for versatile applications, PINNs still struggle with the challenging problems which are stiff to be solved and/or have high-frequency components in the solutions, resulting in accuracy and convergence issues. It may not only increase computational costs, but also lead to accuracy loss or solution divergence. In this study, an alternative approach is proposed to mitigate the above-mentioned problems. Inspired by the multi-grid method in CFD community, the underlying idea of the current approach is to efficiently remove different frequency errors via training with different levels of training samples, resulting in a simpler way to improve the training accuracy without spending time in fine-tuning of neural network structures, loss weights as well as hyperparameters. To demonstrate the efficacy of current approach, we first investigate canonical 1D ODE with high-frequency component and 2D convection-diffusion equation with V-cycle training strategy. Finally, the current method is employed for the classical benchmark problem of steady Lid-driven cavity flows at different Reynolds numbers, to investigate the applicability and efficacy for the problem involved multiple modes of high and low frequency. By virtue of various training sequence modes, improvement through predictions lead to 30% to 60% accuracy improvement. We also investigate the synergies between current method and transfer learning techniques for more challenging problems (i.e., higher Re). From the present results, it also revealed that the current framework can produce good predictions even for the case of Re=5000, demonstrating the ability to solve complex high-frequency PDEs.
[LG-12] A Generalized Meta Federated Learning Framework with Theoretical Convergence Guarantees
链接: https://arxiv.org/abs/2504.21327
作者: Mohammad Vahid Jamali,Hamid Saber,Jung Hyun Bae
类目: Machine Learning (cs.LG)
*备注:
Abstract:Meta federated learning (FL) is a personalized variant of FL, where multiple agents collaborate on training an initial shared model without exchanging raw data samples. The initial model should be trained in a way that current or new agents can easily adapt it to their local datasets after one or a few fine-tuning steps, thus improving the model personalization. Conventional meta FL approaches minimize the average loss of agents on the local models obtained after one step of fine-tuning. In practice, agents may need to apply several fine-tuning steps to adapt the global model to their local data, especially under highly heterogeneous data distributions across agents. To this end, we present a generalized framework for the meta FL by minimizing the average loss of agents on their local model after any arbitrary number \nu of fine-tuning steps. For this generalized framework, we present a variant of the well-known federated averaging (FedAvg) algorithm and conduct a comprehensive theoretical convergence analysis to characterize the convergence speed as well as behavior of the meta loss functions in both the exact and approximated cases. Our experiments on real-world datasets demonstrate superior accuracy and faster convergence for the proposed scheme compared to conventional approaches.
[LG-13] Redundancy Analysis and Mitigation for Machine Learning-Based Process Monitoring of Additive Manufacturing
链接: https://arxiv.org/abs/2504.21317
作者: Jiarui Xie,Yaoyao Fiona Zhao
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 13 pages, 5 figures, 2 tables. Accepted by IDETC-CIE 2025
Abstract:The deployment of machine learning (ML)-based process monitoring systems has significantly advanced additive manufacturing (AM) by enabling real-time defect detection, quality assessment, and process optimization. However, redundancy is a critical yet often overlooked challenge in the deployment and operation of ML-based AM process monitoring systems. Excessive redundancy leads to increased equipment costs, compromised model performance, and high computational requirements, posing barriers to industrial adoption. However, existing research lacks a unified definition of redundancy and a systematic framework for its evaluation and mitigation. This paper defines redundancy in ML-based AM process monitoring and categorizes it into sample-level, feature-level, and model-level redundancy. A comprehensive multi-level redundancy mitigation (MLRM) framework is proposed, incorporating advanced methods such as data registration, downscaling, cross-modality knowledge transfer, and model pruning to systematically reduce redundancy while improving model performance. The framework is validated through an ML-based in-situ defect detection case study for directed energy deposition (DED), demonstrating a 91% reduction in latency, a 47% decrease in error rate, and a 99.4% reduction in storage requirements. Additionally, the proposed approach lowers sensor costs and energy consumption, enabling a lightweight, cost-effective, and scalable monitoring system. By defining redundancy and introducing a structured mitigation framework, this study establishes redundancy analysis and mitigation as a key enabler of efficient ML-based process monitoring in production environments.
[LG-14] Capturing Conditional Dependence via Auto-regressive Diffusion Models
链接: https://arxiv.org/abs/2504.21314
作者: Xunpeng Huang,Yujin Han,Difan Zou,Yian Ma,Tong Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Diffusion models have demonstrated appealing performance in both image and video generation. However, many works discover that they struggle to capture important, high-level relationships that are present in the real world. For example, they fail to learn physical laws from data, and even fail to understand that the objects in the world exist in a stable fashion. This is due to the fact that important conditional dependence structures are not adequately captured in the vanilla diffusion models. In this work, we initiate an in-depth study on strengthening the diffusion model to capture the conditional dependence structures in the data. In particular, we examine the efficacy of the auto-regressive (AR) diffusion models for such purpose and develop the first theoretical results on the sampling error of AR diffusion models under (possibly) the mildest data assumption. Our theoretical findings indicate that, compared with typical diffusion models, the AR variant produces samples with a reduced gap in approximating the data conditional distribution. On the other hand, the overall inference time of the AR-diffusion models is only moderately larger than that for the vanilla diffusion models, making them still practical for large scale applications. We also provide empirical results showing that when there is clear conditional dependence structure in the data, the AR diffusion models captures such structure, whereas vanilla DDPM fails to do so. On the other hand, when there is no obvious conditional dependence across patches of the data, AR diffusion does not outperform DDPM.
[LG-15] Unsupervised Feature Transformation via In-context Generation Generator-critic LLM Agents and Duet-play Teaming IJCAI2025
链接: https://arxiv.org/abs/2504.21304
作者: Nanxu Gong,Xinyuan Wang,Wangyang Ying,Haoyue Bai,Sixun Dong,Haifeng Chen,Yanjie Fu
类目: Machine Learning (cs.LG)
*备注: Accepted to IJCAI 2025
Abstract:Feature transformation involves generating a new set of features from the original dataset to enhance the data’s utility. In certain domains like material performance screening, dimensionality is large and collecting labels is expensive and lengthy. It highly necessitates transforming feature spaces efficiently and without supervision to enhance data readiness and AI utility. However, existing methods fall short in efficient navigation of a vast space of feature combinations, and are mostly designed for supervised settings. To fill this gap, our unique perspective is to leverage a generator-critic duet-play teaming framework using LLM agents and in-context learning to derive pseudo-supervision from unsupervised data. The framework consists of three interconnected steps: (1) Critic agent diagnoses data to generate actionable advice, (2) Generator agent produces tokenized feature transformations guided by the critic’s advice, and (3) Iterative refinement ensures continuous improvement through feedback between agents. The generator-critic framework can be generalized to human-agent collaborative generation, by replacing the critic agent with human experts. Extensive experiments demonstrate that the proposed framework outperforms even supervised baselines in feature transformation efficiency, robustness, and practical applicability across diverse datasets.
[LG-16] Power Flow Approximations for Multiphase Distribution Networks using Gaussian Processes
链接: https://arxiv.org/abs/2504.21260
作者: Daniel Glover,Parikshit Pareek,Deepjyoti Deka,Anamika Dubey
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 5 pages, 7 figures, Accepted at 2025 IEEE PES General Meeting
Abstract:Learning-based approaches are increasingly leveraged to manage and coordinate the operation of grid-edge resources in active power distribution networks. Among these, model-based techniques stand out for their superior data efficiency and robustness compared to model-free methods. However, effective model learning requires a learning-based approximator for the underlying power flow model. This study extends existing work by introducing a data-driven power flow method based on Gaussian Processes (GPs) to approximate the multiphase power flow model, by mapping net load injections to nodal voltages. Simulation results using the IEEE 123-bus and 8500-node distribution test feeders demonstrate that the trained GP model can reliably predict the nonlinear power flow solutions with minimal training data. We also conduct a comparative analysis of the training efficiency and testing performance of the proposed GP-based power flow approximator against a deep neural network-based approximator, highlighting the advantages of our data-efficient approach. Results over realistic operating conditions show that despite an 85% reduction in the training sample size (corresponding to a 92.8% improvement in training time), GP models produce a 99.9% relative reduction in mean absolute error compared to the baselines of deep neural networks.
[LG-17] LSTMGeo with xgBoost Filtering: A Novel Approach for Race and Ethnicity Imputation with Reduced Bias
链接: https://arxiv.org/abs/2504.21259
作者: S. Chalavadi,A. Pastor,T. Leitch
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:
Abstract:Accurate imputation of race and ethnicity (RE) is crucial for analyzing disparities and informing policy. Methods like Bayesian Improved Surname Geocoding (BISG) are widely used but exhibit limitations, including systematic misclassification biases linked to socioeconomic status. This paper introduces LSTM+Geo, a novel approach enhancing Long Short-Term Memory (LSTM) networks with census tract geolocation information. Using a large voter dataset, we demonstrate that LSTM+Geo (88.7% accuracy) significantly outperforms standalone LSTM (86.4%) and Bayesian methods like BISG (82.9%) and BIFSG (86.8%) in accuracy and F1-score on a held-out validation set. LSTM+Geo reduces the rate at which non-White individuals are misclassified as White (White FPR 19.3%) compared to name-only LSTMs (White FPR 24.6%). While sophisticated ensemble methods incorporating XGBoost achieve the highest overall accuracy (up to 89.4%) and lowest White FPR (17.8%), LSTM+Geo offers strong standalone performance with improved bias characteristics compared to baseline models. Integrating LSTM+Geo into an XGBoost ensemble further boosts accuracy, highlighting its utility as both a standalone model and a component for advanced systems. We give a caution at the end regarding the appropriate use of these methods.
[LG-18] ABG-NAS: Adaptive Bayesian Genetic Neural Architecture Search for Graph Representation Learning
链接: https://arxiv.org/abs/2504.21254
作者: Sixuan Wang,Jiao Yin,Jinli Cao,MingJian Tang,Hua Wang,Yanchun Zhang
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:Effective and efficient graph representation learning is essential for enabling critical downstream tasks, such as node classification, link prediction, and subgraph search. However, existing graph neural network (GNN) architectures often struggle to adapt to diverse and complex graph structures, limiting their ability to provide robust and generalizable representations. To address this challenge, we propose ABG-NAS, a novel framework for automated graph neural network architecture search tailored for efficient graph representation learning. ABG-NAS encompasses three key components: a Comprehensive Architecture Search Space (CASS), an Adaptive Genetic Optimization Strategy (AGOS), and a Bayesian-Guided Tuning Module (BGTM). CASS systematically explores diverse propagation § and transformation (T) operations, enabling the discovery of GNN architectures capable of capturing intricate graph characteristics. AGOS dynamically balances exploration and exploitation, ensuring search efficiency and preserving solution diversity. BGTM further optimizes hyperparameters periodically, enhancing the scalability and robustness of the resulting architectures. Empirical evaluations on benchmark datasets (Cora, PubMed, Citeseer, and CoraFull) demonstrate that ABG-NAS consistently outperforms both manually designed GNNs and state-of-the-art neural architecture search (NAS) methods. These results highlight the potential of ABG-NAS to advance graph representation learning by providing scalable and adaptive solutions for diverse graph structures. Our code is publicly available at this https URL.
[LG-19] Data-driven operator learning for energy-efficient building control
链接: https://arxiv.org/abs/2504.21243
作者: Yuexin Bian,Yuanyuan Shi
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:
Abstract:Energy-efficient ventilation control plays a vital role in reducing building energy consumption while ensuring occupant health and comfort. While Computational Fluid Dynamics (CFD) simulations offer high-fidelity modeling of airflow for building HVAC design, their high computational cost makes them impractical for practical adoption in real-time building management system. In this work, we present a data-driven framework that combines the physical accuracy of CFD with the computational efficiency of machine learning to enable energy-efficient building ventilation control. Our method jointly optimizes airflow supply rates and vent angles to reduce energy use and adhere to air quality constraints. We train a neural operator transformer to learn the mapping from building control actions to airflow field distributions using high-resolution CFD data. This learned operator enables a gradient-based control framework capable of optimal decision-making. Experimental results demonstrate that our approach achieves substantial energy savings compared to maximum airflow rate control, rule-based control, and data-driven control based on regional average CO2 predictions, while consistently maintaining safe indoor air quality. These results highlight the practicality and scalability of our method for enabling safe and energy-efficient building management.
[LG-20] Passive Measurement of Autonomic Arousal in Real-World Settings
链接: https://arxiv.org/abs/2504.21242
作者: Samy Abdel-Ghaffar,Isaac Galatzer-Levy,Conor Heneghan,Xin Liu,Sarah Kernasovskiy,Brennan Garrett,Andrew Barakat,Daniel McDuff
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:
Abstract:The autonomic nervous system (ANS) is activated during stress, which can have negative effects on cardiovascular health, sleep, the immune system, and mental health. While there are ways to quantify ANS activity in laboratories, there is a paucity of methods that have been validated in real-world contexts. We present the Fitbit Body Response Algorithm, an approach to continuous remote measurement of ANS activation through widely available remote wrist-based sensors. The design was validated via two experiments, a Trier Social Stress Test (n = 45) and ecological momentary assessments (EMA) of perceived stress (n=87), providing both controlled and ecologically valid test data. Model performance predicting perceived stress when using all available sensor modalities was consistent with expectations (accuracy=0.85) and outperformed models with access to only a subset of the signals. We discuss and address challenges to sensing that arise in real world settings that do not present in conventional lab environments.
[LG-21] Graph Synthetic Out-of-Distribution Exposure with Large Language Models
链接: https://arxiv.org/abs/2504.21198
作者: Haoyan Xu,Zhengtao Yao,Ziyi Wang,Zhan Cheng,Xiyang Hu,Mengyuan Li,Yue Zhao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Out-of-distribution (OOD) detection in graphs is critical for ensuring model robustness in open-world and safety-sensitive applications. Existing approaches to graph OOD detection typically involve training an in-distribution (ID) classifier using only ID data, followed by the application of post-hoc OOD scoring techniques. Although OOD exposure - introducing auxiliary OOD samples during training - has proven to be an effective strategy for enhancing detection performance, current methods in the graph domain generally assume access to a set of real OOD nodes. This assumption, however, is often impractical due to the difficulty and cost of acquiring representative OOD samples. In this paper, we introduce GOE-LLM, a novel framework that leverages Large Language Models (LLMs) for OOD exposure in graph OOD detection without requiring real OOD nodes. GOE-LLM introduces two pipelines: (1) identifying pseudo-OOD nodes from the initially unlabeled graph using zero-shot LLM annotations, and (2) generating semantically informative synthetic OOD nodes via LLM-prompted text generation. These pseudo-OOD nodes are then used to regularize the training of the ID classifier for improved OOD awareness. We evaluate our approach across multiple benchmark datasets, showing that GOE-LLM significantly outperforms state-of-the-art graph OOD detection methods that do not use OOD exposure and achieves comparable performance to those relying on real OOD data.
[LG-22] LIFT: LLM -Based Prag ma Insertion for HLS via GNN Supervised Fine-Tuning
链接: https://arxiv.org/abs/2504.21187
作者: Neha Prakriya,Zijian Ding,Yizhou Sun,Jason Cong
类目: Machine Learning (cs.LG)
*备注:
Abstract:FPGAs are increasingly adopted in datacenter environments for their reconfigurability and energy efficiency. High-Level Synthesis (HLS) tools have eased FPGA programming by raising the abstraction level from RTL to untimed C/C++, yet attaining high performance still demands expert knowledge and iterative manual insertion of optimization pragmas to modify the microarchitecture. To address this challenge, we propose LIFT, a large language model (LLM)-based coding assistant for HLS that automatically generates performance-critical pragmas given a C/C++ design. We fine-tune the LLM by tightly integrating and supervising the training process with a graph neural network (GNN), combining the sequential modeling capabilities of LLMs with the structural and semantic understanding of GNNs necessary for reasoning over code and its control/data dependencies. On average, LIFT produces designs that improve performance by 3.52x and 2.16x than prior state-of the art AutoDSE and HARP respectively, and 66x than GPT-4o.
[LG-23] GLIP-OOD: Zero-Shot Graph OOD Detection with Foundation Model
链接: https://arxiv.org/abs/2504.21186
作者: Haoyan Xu,Zhengtao Yao,Xuzhi Zhang,Ziyi Wang,Langzhou He,Yushun Dong,Philip S. Yu,Mengyuan Li,Yue Zhao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Out-of-distribution (OOD) detection is critical for ensuring the safety and reliability of machine learning systems, particularly in dynamic and open-world environments. In the vision and text domains, zero-shot OOD detection - which requires no training on in-distribution (ID) data - has made significant progress through the use of large-scale pretrained models such as vision-language models (VLMs) and large language models (LLMs). However, zero-shot OOD detection in graph-structured data remains largely unexplored, primarily due to the challenges posed by complex relational structures and the absence of powerful, large-scale pretrained models for graphs. In this work, we take the first step toward enabling zero-shot graph OOD detection by leveraging a graph foundation model (GFM). We show that, when provided only with class label names, the GFM can perform OOD detection without any node-level supervision - outperforming existing supervised methods across multiple datasets. To address the more practical setting where OOD label names are unavailable, we introduce GLIP-OOD, a novel framework that employs LLMs to generate semantically informative pseudo-OOD labels from unlabeled data. These labels enable the GFM to capture nuanced semantic boundaries between ID and OOD classes and perform fine-grained OOD detection - without requiring any labeled nodes. Our approach is the first to enable node-level graph OOD detection in a fully zero-shot setting, and achieves state-of-the-art performance on four benchmark text-attributed graph datasets.
[LG-24] Federated One-Shot Learning with Data Privacy and Objective-Hiding
链接: https://arxiv.org/abs/2504.21182
作者: Maximilian Egger,Rüdiger Urbanke,Rawad Bitar
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Privacy in federated learning is crucial, encompassing two key aspects: safeguarding the privacy of clients’ data and maintaining the privacy of the federator’s objective from the clients. While the first aspect has been extensively studied, the second has received much less attention. We present a novel approach that addresses both concerns simultaneously, drawing inspiration from techniques in knowledge distillation and private information retrieval to provide strong information-theoretic privacy guarantees. Traditional private function computation methods could be used here; however, they are typically limited to linear or polynomial functions. To overcome these constraints, our approach unfolds in three stages. In stage 0, clients perform the necessary computations locally. In stage 1, these results are shared among the clients, and in stage 2, the federator retrieves its desired objective without compromising the privacy of the clients’ data. The crux of the method is a carefully designed protocol that combines secret-sharing-based multi-party computation and a graph-based private information retrieval scheme. We show that our method outperforms existing tools from the literature when properly adapted to this setting. Subjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2504.21182 [cs.CR] (or arXiv:2504.21182v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2504.21182 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-25] Efficient LLM s with AMP: Attention Heads and MLP Pruning IJCNN
链接: https://arxiv.org/abs/2504.21174
作者: Leandro Giusti Mugnaini,Bruno Lopes Yamamoto,Lucas Lauton de Alcantara,Victor Zacarias,Edson Bollis,Lucas Pellicer,Anna Helena Reali Costa,Artur Jordao
类目: Machine Learning (cs.LG)
*备注: To be published in International Joint Conference on Neural Networks (IJCNN), 2025
Abstract:Deep learning drives a new wave in computing systems and triggers the automation of increasingly complex problems. In particular, Large Language Models (LLMs) have significantly advanced cognitive tasks, often matching or even surpassing human-level performance. However, their extensive parameters result in high computational costs and slow inference, posing challenges for deployment in resource-limited settings. Among the strategies to overcome the aforementioned challenges, pruning emerges as a successful mechanism since it reduces model size while maintaining predictive ability. In this paper, we introduce AMP: Attention Heads and MLP Pruning, a novel structured pruning method that efficiently compresses LLMs by removing less critical structures within Multi-Head Attention (MHA) and Multilayer Perceptron (MLP). By projecting the input data onto weights, AMP assesses structural importance and overcomes the limitations of existing techniques, which often fall short in flexibility or efficiency. In particular, AMP surpasses the current state-of-the-art on commonsense reasoning tasks by up to 1.49 percentage points, achieving a 30% pruning ratio with minimal impact on zero-shot task performance. Moreover, AMP also improves inference speeds, making it well-suited for deployment in resource-constrained environments. We confirm the flexibility of AMP on different families of LLMs, including LLaMA and Phi.
[LG-26] R2VFL: A Robust Random Vector Functional Link Network with Huber-Weighted Framework
链接: https://arxiv.org/abs/2504.21069
作者: Anuradha Kumari,Mushir Akhtar,P. N. Suganthan,M. Tanveer
类目: Machine Learning (cs.LG)
*备注:
Abstract:The random vector functional link (RVFL) neural network has shown significant potential in overcoming the constraints of traditional artificial neural networks, such as excessive computation time and suboptimal solutions. However, RVFL faces challenges when dealing with noise and outliers, as it assumes all data samples contribute equally. To address this issue, we propose a novel robust framework, R2VFL, RVFL with Huber weighting function and class probability, which enhances the model’s robustness and adaptability by effectively mitigating the impact of noise and outliers in the training data. The Huber weighting function reduces the influence of outliers, while the class probability mechanism assigns less weight to noisy data points, resulting in a more resilient model. We explore two distinct approaches for calculating class centers within the R2VFL framework: the simple average of all data points in each class and the median of each feature, the later providing a robust alternative by minimizing the effect of extreme values. These approaches give rise to two novel variants of the model-R2VFL-A and R2VFL-M. We extensively evaluate the proposed models on 47 UCI datasets, encompassing both binary and multiclass datasets, and conduct rigorous statistical testing, which confirms the superiority of the proposed models. Notably, the models also demonstrate exceptional performance in classifying EEG signals, highlighting their practical applicability in real-world biomedical domain.
[LG-27] A Hamiltonian Higher-Order Elasticity Framework for Dynamic Diagnostics(2HOED)
链接: https://arxiv.org/abs/2504.21062
作者: Ngueuleweu Tiwang Gildas
类目: Machine Learning (cs.LG); General Economics (econ.GN)
*备注: 19 pages, 7 figures
Abstract:Machine learning detects patterns, block chain guarantees trust and immutability, and modern causal inference identifies directional linkages, yet none alone exposes the full energetic anatomy of complex systems; the Hamiltonian Higher Order Elasticity Dynamics(2HOED) framework bridges these gaps. Grounded in classical mechanics but extended to Economics order elasticity terms, 2HOED represents economic, social, and physical systems as energy-based Hamiltonians whose position, velocity, acceleration, and jerk of elasticity jointly determine systemic power, Inertia, policy sensitivity, and marginal responses. Because the formalism is scaling free and coordinate agnostic, it transfers seamlessly from financial markets to climate science, from supply chain logistics to epidemiology, thus any discipline in which adaptation and shocks coexist. By embedding standard econometric variables inside a Hamiltonian, 2HOED enriches conventional economic analysis with rigorous diagnostics of resilience, tipping points, and feedback loops, revealing failure modes invisible to linear models. Wavelet spectra, phase space attractors, and topological persistence diagrams derived from 2HOED expose multistage policy leverage that machine learning detects only empirically and block chain secures only after the fact. For economists, physicians and other scientists, the method opens a new causal energetic channel linking biological or mechanical elasticity to macro level outcomes. Portable, interpretable, and computationally light, 2HOED turns data streams into dynamical energy maps, empowering decision makers to anticipate crises, design adaptive policies, and engineer robust systems delivering the predictive punch of AI with the explanatory clarity of physics.
[LG-28] AI Supply Chains: An Emerging Ecosystem of AI Actors Products and Services
链接: https://arxiv.org/abs/2504.20185
作者: Aspen Hopkins,Sarah H. Cen,Andrew Ilyas,Isabella Struckman,Luis Videgaray,Aleksander Mądry
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 27 pages, 8 figures
Abstract:The widespread adoption of AI in recent years has led to the emergence of AI supply chains: complex networks of AI actors contributing models, datasets, and more to the development of AI products and services. AI supply chains have many implications yet are poorly understood. In this work, we take a first step toward a formal study of AI supply chains and their implications, providing two illustrative case studies indicating that both AI development and regulation are complicated in the presence of supply chains. We begin by presenting a brief historical perspective on AI supply chains, discussing how their rise reflects a longstanding shift towards specialization and outsourcing that signals the healthy growth of the AI industry. We then model AI supply chains as directed graphs and demonstrate the power of this abstraction by connecting examples of AI issues to graph properties. Finally, we examine two case studies in detail, providing theoretical and empirical results in both. In the first, we show that information passing (specifically, of explanations) along the AI supply chains is imperfect, which can result in misunderstandings that have real-world implications. In the second, we show that upstream design choices (e.g., by base model providers) have downstream consequences (e.g., on AI products fine-tuned on the base model). Together, our findings motivate further study of AI supply chains and their increasingly salient social, economic, regulatory, and technical implications.
[LG-29] Automated Generation of Precedence Graphs in Digital Value Chains for Automotive Production
链接: https://arxiv.org/abs/2504.19835
作者: Cornelius Hake,Christian Friedrich
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:This study examines the digital value chain in automotive manufacturing, focusing on the identification, software flashing, customization, and commissioning of electronic control units in vehicle networks. A novel precedence graph design is proposed to optimize this process chain using an automated scheduling algorithm that employs mixed integer linear programming techniques. The results show significant improvements in key metrics. The algorithm reduces the number of production stations equipped with expensive hardware and software to execute digital value chain processes, while increasing capacity utilization through efficient scheduling and reduced idle time. Task parallelization is optimized, resulting in streamlined workflows and increased throughput. Compared to the traditional method, the automated approach has reduced preparation time by 50% and reduced scheduling activities, as it now takes two minutes to create the precedence graph. The flexibility of the algorithm’s constraints allows for vehicle-specific configurations while maintaining high responsiveness, eliminating backup stations and facilitating the integration of new topologies. Automated scheduling significantly outperforms manual methods in efficiency, functionality, and adaptability.
[LG-30] Scalable Multi-Task Learning for Particle Collision Event Reconstruction with Heterogeneous Graph Neural Networks
链接: https://arxiv.org/abs/2504.21844
作者: William Sutcliffe,Marta Calvi,Simone Capelli,Jonas Eschle,Julián García Pardiñas,Abhijit Mathad,Azusa Uzuki,Nicola Serra
类目: Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 21 pages, 10 figures, 4 tables
Abstract:The growing luminosity frontier at the Large Hadron Collider is challenging the reconstruction and analysis of particle collision events. Increased particle multiplicities are straining latency and storage requirements at the data acquisition stage, while new complications are emerging, including higher background levels and more frequent particle vertex misassociations. This in turn necessitates the development of more holistic and scalable reconstruction methods that take advantage of recent advances in machine learning. We propose a novel Heterogeneous Graph Neural Network (HGNN) architecture featuring unique representations for diverse particle collision relationships and integrated graph pruning layers for scalability. Trained with a multi-task paradigm in an environment mimicking the LHCb experiment, this HGNN significantly improves beauty hadron reconstruction performance. Notably, it concurrently performs particle vertex association and graph pruning within a single framework. We quantify reconstruction and pruning performance, demonstrate enhanced inference time scaling with event complexity, and mitigate potential performance loss using a weighted message passing scheme.
[LG-31] Balancing Interpretability and Flexibility in Modeling Diagnostic Trajectories with an Embedded Neural Hawkes Process Model
链接: https://arxiv.org/abs/2504.21795
作者: Yuankang Zhao,Matthew Engelhard
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:The Hawkes process (HP) is commonly used to model event sequences with self-reinforcing dynamics, including electronic health records (EHRs). Traditional HPs capture self-reinforcement via parametric impact functions that can be inspected to understand how each event modulates the intensity of others. Neural network-based HPs offer greater flexibility, resulting in improved fit and prediction performance, but at the cost of interpretability, which is often critical in healthcare. In this work, we aim to understand and improve upon this tradeoff. We propose a novel HP formulation in which impact functions are modeled by defining a flexible impact kernel, instantiated as a neural network, in event embedding space, which allows us to model large-scale event sequences with many event types. This approach is more flexible than traditional HPs yet more interpretable than other neural network approaches, and allows us to explicitly trade flexibility for interpretability by adding transformer encoder layers to further contextualize the event embeddings. Results show that our method accurately recovers impact functions in simulations, achieves competitive performance on MIMIC-IV procedure dataset, and gains clinically meaningful interpretation on XX-EHR with children diagnosis dataset even without transformer layers. This suggests that our flexible impact kernel is often sufficient to capture self-reinforcing dynamics in EHRs and other data effectively, implying that interpretability can be maintained without loss of performance.
[LG-32] Estimation of discrete distributions in relative entropy and the deviations of the missing mass
链接: https://arxiv.org/abs/2504.21787
作者: Jaouad Mourtada
类目: atistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 54 pages
Abstract:We study the problem of estimating a distribution over a finite alphabet from an i.i.d. sample, with accuracy measured in relative entropy (Kullback-Leibler divergence). While optimal expected risk bounds are known, high-probability guarantees remain less well-understood. First, we analyze the classical Laplace (add- 1 ) estimator, obtaining matching upper and lower bounds on its performance and showing its optimality among confidence-independent estimators. We then characterize the minimax-optimal high-probability risk achievable by any estimator, which is attained via a simple confidence-dependent smoothing technique. Interestingly, the optimal non-asymptotic risk contains an additional logarithmic factor over the ideal asymptotic risk. Next, motivated by scenarios where the alphabet exceeds the sample size, we investigate methods that adapt to the sparsity of the distribution at hand. We introduce an estimator using data-dependent smoothing, for which we establish a high-probability risk bound depending on two effective sparsity parameters. As part of the analysis, we also derive a sharp high-probability upper bound on the missing mass.
[LG-33] A comparison of generative deep learning methods for multivariate angular simulation
链接: https://arxiv.org/abs/2504.21505
作者: Jakob Benjamin Wessel,Callum J. R. Murphy-Barltrop,Emma S. Simpson
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:With the recent development of new geometric and angular-radial frameworks for multivariate extremes, reliably simulating from angular variables in moderate-to-high dimensions is of increasing importance. Empirical approaches have the benefit of simplicity, and work reasonably well in low dimensions, but as the number of variables increases, they can lack the required flexibility and scalability. Classical parametric models for angular variables, such as the von Mises-Fisher (vMF) distribution, provide an alternative. Exploiting mixtures of vMF distributions increases their flexibility, but there are cases where even this is not sufficient to capture the intricate features that can arise in data. Owing to their flexibility, generative deep learning methods are able to capture complex data structures; they therefore have the potential to be useful in the simulation of angular variables. In this paper, we explore a range of deep learning approaches for this task, including generative adversarial networks, normalizing flows and flow matching. We assess their performance via a range of metrics and make comparisons to the more classical approach of using a mixture of vMF distributions. The methods are also applied to a metocean data set, demonstrating their applicability to real-world, complex data structures.
[LG-34] Wasserstein-Aitchison GAN for angular measures of multivariate extremes
链接: https://arxiv.org/abs/2504.21438
作者: Stéphane Lhaut,Holger Rootzén,Johan Segers
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 38 pages, 11 figures
Abstract:Economically responsible mitigation of multivariate extreme risks – extreme rainfall in a large area, huge variations of many stock prices, widespread breakdowns in transportation systems – requires estimates of the probabilities that such risks will materialize in the future. This paper develops a new method, Wasserstein–Aitchison Generative Adversarial Networks (WA-GAN), which provides simulated values of future d -dimensional multivariate extreme events and which hence can be used to give estimates of such probabilities. The main hypothesis is that, after transforming the observations to the unit-Pareto scale, their distribution is regularly varying in the sense that the distributions of their radial and angular components (with respect to the L_1 -norm) converge and become asymptotically independent as the radius gets large. The method is a combination of standard extreme value analysis modeling of the tails of the marginal distributions with nonparametric GAN modeling of the angular distribution. For the latter, the angular values are transformed to Aitchison coordinates in a full (d-1) -dimensional linear space, and a Wasserstein GAN is trained on these coordinates and used to generate new values. A reverse transformation is then applied to these values and gives simulated values on the original data scale. The method shows good performance compared to other existing methods in the literature, both in terms of capturing the dependence structure of the extremes in the data, as well as in generating accurate new extremes of the data distribution. The comparison is performed on simulated multivariate extremes from a logistic model in dimensions up to 50 and on a 30-dimensional financial data set.
[LG-35] Kernel Density Machines
链接: https://arxiv.org/abs/2504.21419
作者: Damir Filipovic,Paul Schneider
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:We introduce kernel density machines (KDM), a novel density ratio estimator in a reproducing kernel Hilbert space setting. KDM applies to general probability measures on countably generated measurable spaces without restrictive assumptions on continuity, or the existence of a Lebesgue density. For computational efficiency, we incorporate a low-rank approximation with precisely controlled error that grants scalability to large-sample settings. We provide rigorous theoretical guarantees, including asymptotic consistency, a functional central limit theorem, and finite-sample error bounds, establishing a strong foundation for practical use. Empirical results based on simulated and real data demonstrate the efficacy and precision of KDM.
[LG-36] Generalised Label-free Artefact Cleaning for Real-time Medical Pulsatile Time Series
链接: https://arxiv.org/abs/2504.21209
作者: Xuhang Chen,Ihsane Olakorede,Stefan Yu Bögli,Wenhao Xu,Erta Beqiri,Xuemeng Li,Chenyu Tang,Zeyu Gao,Shuo Gao,Ari Ercole,Peter Smielewski
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:Artefacts compromise clinical decision-making in the use of medical time series. Pulsatile waveforms offer probabilities for accurate artefact detection, yet most approaches rely on supervised manners and overlook patient-level distribution shifts. To address these issues, we introduce a generalised label-free framework, GenClean, for real-time artefact cleaning and leverage an in-house dataset of 180,000 ten-second arterial blood pressure (ABP) samples for training. We first investigate patient-level generalisation, demonstrating robust performances under both intra- and inter-patient distribution shifts. We further validate its effectiveness through challenging cross-disease cohort experiments on the MIMIC-III database. Additionally, we extend our method to photoplethysmography (PPG), highlighting its applicability to diverse medical pulsatile signals. Finally, its integration into ICM+, a clinical research monitoring software, confirms the real-time feasibility of our framework, emphasising its practical utility in continuous physiological monitoring. This work provides a foundational step toward precision medicine in improving the reliability of high-resolution medical time series analysis
[LG-37] Generate-then-Verify: Reconstructing Data from Limited Published Statistics
链接: https://arxiv.org/abs/2504.21199
作者: Terrance Liu,Eileen Xiao,Pratiksha Thaker,Adam Smith,Zhiwei Steven Wu
类目: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:We study the problem of reconstructing tabular data from aggregate statistics, in which the attacker aims to identify interesting claims about the sensitive data that can be verified with 100% certainty given the aggregates. Successful attempts in prior work have conducted studies in settings where the set of published statistics is rich enough that entire datasets can be reconstructed with certainty. In our work, we instead focus on the regime where many possible datasets match the published statistics, making it impossible to reconstruct the entire private dataset perfectly (i.e., when approaches in prior work fail). We propose the problem of partial data reconstruction, in which the goal of the adversary is to instead output a \textitsubset of rows and/or columns that are \textitguaranteed to be correct . We introduce a novel integer programming approach that first \textbfgenerates a set of claims and then \textbfverifies whether each claim holds for all possible datasets consistent with the published aggregates. We evaluate our approach on the housing-level microdata from the U.S. Decennial Census release, demonstrating that privacy violations can still persist even when information published about such data is relatively sparse.
[LG-38] QAOA Parameter Transferability for Maximum Independent Set using Graph Attention Networks
链接: https://arxiv.org/abs/2504.21135
作者: Hanjing Xu,Xiaoyuan Liu,Alex Pothen,Ilya Safro
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:The quantum approximate optimization algorithm (QAOA) is one of the promising variational approaches of quantum computing to solve combinatorial optimization problems. In QAOA, variational parameters need to be optimized by solving a series of nonlinear, nonconvex optimization programs. In this work, we propose a QAOA parameter transfer scheme using Graph Attention Networks (GAT) to solve Maximum Independent Set (MIS) problems. We prepare optimized parameters for graphs of 12 and 14 vertices and use GATs to transfer their parameters to larger graphs. Additionally, we design a hybrid distributed resource-aware algorithm for MIS (HyDRA-MIS), which decomposes large problems into smaller ones that can fit onto noisy intermediate-scale quantum (NISQ) computers. We integrate our GAT-based parameter transfer approach to HyDRA-MIS and demonstrate competitive results compared to KaMIS, a state-of-the-art classical MIS solver, on graphs with several thousands vertices.
信息检索
[IR-0] Learning Universal User Representations Leverag ing Cross-domain User Intent at Snapchat SIGIR’25
链接: https://arxiv.org/abs/2504.21838
作者: Clark Mingxuan Ju,Leonardo Neves,Bhuvesh Kumar,Liam Collins,Tong Zhao,Yuwei Qiu,Qing Dou,Yang Zhou,Sohail Nizam,Rengim Ozturk,Yvette Liu,Sen Yang,Manish Malik,Neil Shah
类目: Information Retrieval (cs.IR)
*备注: Accepted to the industrial track of SIGIR’25
Abstract:The development of powerful user representations is a key factor in the success of recommender systems (RecSys). Online platforms employ a range of RecSys techniques to personalize user experience across diverse in-app surfaces. User representations are often learned individually through user’s historical interactions within each surface and user representations across different surfaces can be shared post-hoc as auxiliary features or additional retrieval sources. While effective, such schemes cannot directly encode collaborative filtering signals across different surfaces, hindering its capacity to discover complex relationships between user behaviors and preferences across the whole platform. To bridge this gap at Snapchat, we seek to conduct universal user modeling (UUM) across different in-app surfaces, learning general-purpose user representations which encode behaviors across surfaces. Instead of replacing domain-specific representations, UUM representations capture cross-domain trends, enriching existing representations with complementary information. This work discusses our efforts in developing initial UUM versions, practical challenges, technical choices and modeling and research directions with promising offline performance. Following successful A/B testing, UUM representations have been launched in production, powering multiple use cases and demonstrating their value. UUM embedding has been incorporated into (i) Long-form Video embedding-based retrieval, leading to 2.78% increase in Long-form Video Open Rate, (ii) Long-form Video L2 ranking, with 19.2% increase in Long-form Video View Time sum, (iii) Lens L2 ranking, leading to 1.76% increase in Lens play time, and (iv) Notification L2 ranking, with 0.87% increase in Notification Open Rate.
[IR-1] From Precision to Perception: User-Centred Evaluation of Keyword Extraction Algorithms for Internet-Scale Contextual Advertising
链接: https://arxiv.org/abs/2504.21667
作者: Jingwen Cai,Sara Leckner,Johanna Björklund
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Keyword extraction is a foundational task in natural language processing, underpinning countless real-world applications. A salient example is contextual advertising, where keywords help predict the topical congruence between ads and their surrounding media contexts to enhance advertising effectiveness. Recent advances in artificial intelligence, particularly large language models, have improved keyword extraction capabilities but also introduced concerns about computational cost. Moreover, although the end-user experience is of vital importance, human evaluation of keyword extraction performances remains under-explored. This study provides a comparative evaluation of three prevalent keyword extraction algorithms that vary in complexity: TF-IDF, KeyBERT, and Llama 2. To evaluate their effectiveness, a mixed-methods approach is employed, combining quantitative benchmarking with qualitative assessments from 552 participants through three survey-based experiments. Findings indicate a slight user preference for KeyBERT, which offers a favourable balance between performance and computational efficiency compared to the other two algorithms. Despite a strong overall preference for gold-standard keywords, differences between the algorithmic outputs are not statistically significant, highlighting a long-overlooked gap between traditional precision-focused metrics and user-perceived algorithm efficiency. The study highlights the importance of user-centred evaluation methodologies and proposes analytical tools to support their implementation.
[IR-2] Efficient Conversational Search via Topical Locality in Dense Retrieval SIGIR2025
链接: https://arxiv.org/abs/2504.21507
作者: Cristina Ioana Muntean,Franco Maria Nardini,Raffaele Perego,Guido Rocchietti,Cosimo Rulli
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
*备注: 5 pages, 2 figures, SIGIR 2025
Abstract:Pre-trained language models have been widely exploited to learn dense representations of documents and queries for information retrieval. While previous efforts have primarily focused on improving effectiveness and user satisfaction, response time remains a critical bottleneck of conversational search systems. To address this, we exploit the topical locality inherent in conversational queries, i.e., the tendency of queries within a conversation to focus on related topics. By leveraging query embedding similarities, we dynamically restrict the search space to semantically relevant document clusters, reducing computational complexity without compromising retrieval quality. We evaluate our approach on the TREC CAsT 2019 and 2020 datasets using multiple embedding models and vector indexes, achieving improvements in processing speed of up to 10.4X with little loss in performance (4.4X without any loss). Our results show that the proposed system effectively handles complex, multiturn queries with high precision and efficiency, offering a practical solution for real-time conversational search.
[IR-3] In a Few Words: Comparing Weak Supervision and LLM s for Short Query Intent Classification SIGIR SIGIR’25
链接: https://arxiv.org/abs/2504.21398
作者: Daria Alexander,Arjen P. de Vries
类目: Information Retrieval (cs.IR)
*备注: accepted at International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '25), July 13–18, 2025, Padua, Italy
Abstract:User intent classification is an important task in information retrieval. Previously, user intents were classified manually and automatically; the latter helped to avoid hand labelling of large datasets. Recent studies explored whether LLMs can reliably determine user intent. However, researchers have recognized the limitations of using generative LLMs for classification tasks. In this study, we empirically compare user intent classification into informational, navigational, and transactional categories, using weak supervision and LLMs. Specifically, we evaluate LLaMA-3.1-8B-Instruct and LLaMA-3.1-70B-Instruct for in-context learning and LLaMA-3.1-8B-Instruct for fine-tuning, comparing their performance to an established baseline classifier trained using weak supervision (ORCAS-I). Our results indicate that while LLMs outperform weak supervision in recall, they continue to struggle with precision, which shows the need for improved methods to balance both metrics effectively.
[IR-4] Enhancing New-item Fairness in Dynamic Recommender Systems SIGIR SIGIR’25
链接: https://arxiv.org/abs/2504.21362
作者: Huizhong Guo,Zhu Sun,Dongxia Wang,Tianjun Wei,Jinfeng Li,Jie Zhang
类目: Information Retrieval (cs.IR)
*备注: 10 pages, 6 figures, Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '25)
Abstract:New-items play a crucial role in recommender systems (RSs) for delivering fresh and engaging user experiences. However, traditional methods struggle to effectively recommend new-items due to their short exposure time and limited interaction records, especially in dynamic recommender systems (DRSs) where new-items get continuously introduced and users’ preferences evolve over time. This leads to significant unfairness towards new-items, which could accumulate over the successive model updates, ultimately compromising the stability of the entire system. Therefore, we propose FairAgent, a reinforcement learning (RL)-based new-item fairness enhancement framework specifically designed for DRSs. It leverages knowledge distillation to extract collaborative signals from traditional models, retaining strong recommendation capabilities for old-items. In addition, FairAgent introduces a novel reward mechanism for recommendation tailored to the characteristics of DRSs, which consists of three components: 1) a new-item exploration reward to promote the exposure of dynamically introduced new-items, 2) a fairness reward to adapt to users’ personalized fairness requirements for new-items, and 3) an accuracy reward which leverages users’ dynamic feedback to enhance recommendation accuracy. Extensive experiments on three public datasets and backbone models demonstrate the superior performance of FairAgent. The results present that FairAgent can effectively boost new-item exposure, achieve personalized new-item fairness, while maintaining high recommendation accuracy.
[IR-5] A Framework for Elastic Adaptation of User Multiple Intents in Sequential Recommendation
链接: https://arxiv.org/abs/2504.21270
作者: Zhikai Wang,Yanyan Shen
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Recently, substantial research has been conducted on sequential recommendation, with the objective of forecasting the subsequent item by leveraging a user’s historical sequence of interacted items. Prior studies employ both capsule networks and self-attention techniques to effectively capture diverse underlying intents within a user’s interaction sequence, thereby achieving the most advanced performance in sequential recommendation. However, users could potentially form novel intents from fresh interactions as the lengths of user interaction sequences grow. Consequently, models need to be continually updated or even extended to adeptly encompass these emerging user intents, referred as incremental multi-intent sequential recommendation. % We refer to this problem as incremental multi-intent sequential recommendation, which has not yet been well investigated in the existing literature. In this paper, we propose an effective Incremental learning framework for user Multi-intent Adaptation in sequential recommendation called IMA, which augments the traditional fine-tuning strategy with the existing-intents retainer, new-intents detector, and projection-based intents trimmer to adaptively expand the model to accommodate user’s new intents and prevent it from forgetting user’s existing intents. Furthermore, we upgrade the IMA into an Elastic Multi-intent Adaptation (EMA) framework which can elastically remove inactive intents and compress user intent vectors under memory space limit. Extensive experiments on real-world datasets verify the effectiveness of the proposed IMA and EMA on incremental multi-intent sequential recommendation, compared with various baselines.