本篇博文主要内容为 2025-05-13 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-05-13)

今日共更新843篇论文,其中:

  • 自然语言处理85篇(Computation and Language (cs.CL))
  • 人工智能236篇(Artificial Intelligence (cs.AI))
  • 计算机视觉177篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习259篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] A Comparative Analysis of Static Word Embeddings for Hungarian

【速读】: 该论文旨在评估不同静态词嵌入模型在匈牙利语中的表现,包括传统的Word2Vec和FastText模型,以及基于BERT的静态嵌入。其解决方案的关键在于通过内在和外在任务对这些嵌入进行综合评估,并探索从BERT等动态模型中提取静态嵌入的有效方法。研究发现,基于BERT的X2Static方法在静态嵌入提取中表现出色,接近传统静态嵌入的效果,并且在命名实体识别和词性标注等任务中优于纯静态嵌入,表明上下文感知表示在静态形式下仍具有显著优势。

链接: https://arxiv.org/abs/2505.07809
作者: Máté Gedeon
机构: Budapest University of Technology and Economics (布达佩斯技术与经济大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a comprehensive analysis of various static word embeddings for Hungarian, including traditional models such as Word2Vec, FastText, as well as static embeddings derived from BERT-based models using different extraction methods. We evaluate these embeddings on both intrinsic and extrinsic tasks to provide a holistic view of their performance. For intrinsic evaluation, we employ a word analogy task, which assesses the embeddings ability to capture semantic and syntactic relationships. Our results indicate that traditional static embeddings, particularly FastText, excel in this task, achieving high accuracy and mean reciprocal rank (MRR) scores. Among the BERT-based models, the X2Static method for extracting static embeddings demonstrates superior performance compared to decontextualized and aggregate methods, approaching the effectiveness of traditional static embeddings. For extrinsic evaluation, we utilize a bidirectional LSTM model to perform Named Entity Recognition (NER) and Part-of-Speech (POS) tagging tasks. The results reveal that embeddings derived from dynamic models, especially those extracted using the X2Static method, outperform purely static embeddings. Notably, ELMo embeddings achieve the highest accuracy in both NER and POS tagging tasks, underscoring the benefits of contextualized representations even when used in a static form. Our findings highlight the continued relevance of static word embeddings in NLP applications and the potential of advanced extraction methods to enhance the utility of BERT-based models. This piece of research contributes to the understanding of embedding performance in the Hungarian language and provides valuable insights for future developments in the field. The training scripts, evaluation codes, restricted vocabulary, and extracted embeddings will be made publicly available to support further research and reproducibility.
zh

[NLP-1] Learning Dynamics in Continual Pre-Training for Large Language Models ICML2025

【速读】: 该论文旨在解决持续预训练(Continual Pre-Training, CPT)过程中模型在通用任务和下游领域任务性能演变的动态理解问题,以及如何通过理论建模预测不同训练步骤和学习率调度下的损失表现。其解决方案的关键在于提出了一种CPT缩放定律(scaling law),该定律通过解耦分布偏移(distribution shift)和学习率衰减(learning rate annealing)的影响,能够准确描述CPT损失曲线的演变,并实现对任意训练步骤和学习率调度下的损失进行预测。这一方法为定制化调整CPT超参数以平衡通用性和领域特定性能提供了理论基础。

链接: https://arxiv.org/abs/2505.07796
作者: Xingjin Wang,Howe Tissue,Lu Wang,Linjing Li,Daniel Dajun Zeng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ICML2025 (spotlight)

点击查看摘要

Abstract:Continual Pre-Training (CPT) has become a popular and effective method to apply strong foundation models to specific downstream tasks. In this work, we explore the learning dynamics throughout the CPT process for large language models. We specifically focus on how general and downstream domain performance evolves at each training step, with domain performance measured via validation losses. We have observed that the CPT loss curve fundamentally characterizes the transition from one curve to another hidden curve, and could be described by decoupling the effects of distribution shift and learning rate annealing. We derive a CPT scaling law that combines the two factors, enabling the prediction of loss at any (continual) training steps and across learning rate schedules (LRS) in CPT. Our formulation presents a comprehensive understanding of several critical factors in CPT, including loss potential, peak learning rate, training steps, replay ratio, etc. Moreover, our approach can be adapted to customize training hyper-parameters to different CPT goals such as balancing general and domain-specific performance. Extensive experiments demonstrate that our scaling law holds across various CPT datasets and training hyper-parameters.
zh

[NLP-2] Learning from Peers in Reasoning Models

【速读】: 该论文试图解决大型推理模型(Large Reasoning Models, LRMs)在推理过程中因起始阶段的不良表现而难以自我纠正的问题,这一现象被称为“Prefix Dominance Trap”。解决方案的关键在于提出Learning from Peers(LeaP),通过让每个推理路径在生成过程中总结中间推理并与其他路径共享,利用路由机制引入同伴见解,从而实现推理过程中的协同与自我修正。此外,针对小型模型在总结和反思指令上的不足,进一步提出了LeaP-T模型系列进行微调以提升效果。

链接: https://arxiv.org/abs/2505.07787
作者: Tongxu Luo,Wenyu Du,Jiaxi Bi,Stephen Chung,Zhengyang Tang,Hao Yang,Min Zhang,Benyou Wang
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); DualityRL (DualityRL); USTB (USTB); Huawei (华为)
类目: Computation and Language (cs.CL)
备注: 29 pages, 32 figures

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have the ability to self-correct even when they make mistakes in their reasoning paths. However, our study reveals that when the reasoning process starts with a short but poor beginning, it becomes difficult for the model to recover. We refer to this phenomenon as the “Prefix Dominance Trap”. Inspired by psychological findings that peer interaction can promote self-correction without negatively impacting already accurate individuals, we propose Learning from Peers (LeaP) to address this phenomenon. Specifically, every tokens, each reasoning path summarizes its intermediate reasoning and shares it with others through a routing mechanism, enabling paths to incorporate peer insights during inference. However, we observe that smaller models sometimes fail to follow summarization and reflection instructions effectively. To address this, we fine-tune them into our LeaP-T model series. Experiments on AIME 2024, AIME 2025, AIMO 2025, and GPQA Diamond show that LeaP provides substantial improvements. For instance, QwQ-32B with LeaP achieves nearly 5 absolute points higher than the baseline on average, and surpasses DeepSeek-R1-671B on three math benchmarks with an average gain of 3.3 points. Notably, our fine-tuned LeaP-T-7B matches the performance of DeepSeek-R1-Distill-Qwen-14B on AIME 2024. In-depth analysis reveals LeaP’s robust error correction by timely peer insights, showing strong error tolerance and handling varied task difficulty. LeaP marks a milestone by enabling LRMs to collaborate during reasoning. Our code, datasets, and models are available at this https URL .
zh

[NLP-3] Domain Regeneration: How well do LLM s match syntactic properties of text domains?

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在多大程度上能够准确地近似其训练数据的文本领域特性。论文的核心在于探究LLMs对不同文本领域属性的拟合程度及其准确性。解决方案的关键在于采用来自语料语言学的观察方法,通过让一个常用的开源LLM重新生成来自两个受许可协议宽松的英文文本领域(如维基百科和新闻文本)的文本,从而在语义控制较为严格的环境中评估LLMs是否能忠实匹配原始人类文本领域的特性。

链接: https://arxiv.org/abs/2505.07784
作者: Da Ju,Hagen Blix,Adina Williams
机构: Meta AI (元宇宙人工智能); New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent improvement in large language model performance have, in all likelihood, been accompanied by improvement in how well they can approximate the distribution of their training data. In this work, we explore the following question: which properties of text domains do LLMs faithfully approximate, and how well do they do so? Applying observational approaches familiar from corpus linguistics, we prompt a commonly used, opensource LLM to regenerate text from two domains of permissively licensed English text which are often contained in LLM training data – Wikipedia and news text. This regeneration paradigm allows us to investigate whether LLMs can faithfully match the original human text domains in a fairly semantically-controlled setting. We investigate varying levels of syntactic abstraction, from more simple properties like sentence length, and article readability, to more complex and higher order properties such as dependency tag distribution, parse depth, and parse complexity. We find that the majority of the regenerated distributions show a shifted mean, a lower standard deviation, and a reduction of the long tail, as compared to the human originals.
zh

[NLP-4] Must Read: A Systematic Survey of Computational Persuasion

【速读】: 该论文试图解决AI在说服过程中的多维角色及其带来的伦理与安全问题,包括AI作为说服者、被说服者和说服评估者的复杂性。其关键解决方案在于构建一个围绕“计算说服”的综合框架,从三个核心视角出发:AI作为说服者,探讨AI生成的说服内容及其应用;AI作为被说服者,分析其对影响和操纵的敏感性;AI作为说服评估者,研究其在评估说服策略、检测操控和确保伦理说服中的作用。论文提出了一种计算说服的研究分类法,并讨论了关键挑战,如评估说服力、减轻操纵性说服以及开发负责任的AI驱动说服系统,旨在提升AI说服技术的安全性、公平性和有效性。

链接: https://arxiv.org/abs/2505.07775
作者: Nimet Beyza Bozdag,Shuhaib Mehri,Xiaocheng Yang,Hyeonjeong Ha,Zirui Cheng,Esin Durmus,Jiaxuan You,Heng Ji,Gokhan Tur,Dilek Hakkani-Tür
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Anthropic (Anthropic)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Persuasion is a fundamental aspect of communication, influencing decision-making across diverse contexts, from everyday conversations to high-stakes scenarios such as politics, marketing, and law. The rise of conversational AI systems has significantly expanded the scope of persuasion, introducing both opportunities and risks. AI-driven persuasion can be leveraged for beneficial applications, but also poses threats through manipulation and unethical influence. Moreover, AI systems are not only persuaders, but also susceptible to persuasion, making them vulnerable to adversarial attacks and bias reinforcement. Despite rapid advancements in AI-generated persuasive content, our understanding of what makes persuasion effective remains limited due to its inherently subjective and context-dependent nature. In this survey, we provide a comprehensive overview of computational persuasion, structured around three key perspectives: (1) AI as a Persuader, which explores AI-generated persuasive content and its applications; (2) AI as a Persuadee, which examines AI’s susceptibility to influence and manipulation; and (3) AI as a Persuasion Judge, which analyzes AI’s role in evaluating persuasive strategies, detecting manipulation, and ensuring ethical persuasion. We introduce a taxonomy for computational persuasion research and discuss key challenges, including evaluating persuasiveness, mitigating manipulative persuasion, and developing responsible AI-driven persuasive systems. Our survey outlines future research directions to enhance the safety, fairness, and effectiveness of AI-powered persuasion while addressing the risks posed by increasingly capable language models.
zh

[NLP-5] Enhancing Code Generation via Bidirectional Comment-Level Mutual Grounding ICSE2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)生成代码时存在的功能性错误问题,尤其是在处理复杂编程任务时,LLMs由于缺乏对新任务的理解而难以生成正确代码。其解决方案的关键在于提出一种交互式方法,通过代码注释作为开发者与LLMs之间建立共同理解的媒介,实现迭代性的语境对齐,从而更准确地反映开发者的意图。该方法通过交错进行代码生成、内联注释生成以及基于可编辑注释的上下文反馈,促进相互校准,提升代码生成的准确性与开发者信心。

链接: https://arxiv.org/abs/2505.07768
作者: Yifeng Di,Tianyi Zhang
机构: Purdue University (普渡大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to ICSE 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated unprecedented capability in code generation. However, LLM-generated code is still plagued with a wide range of functional errors, especially for complex programming tasks that LLMs have not seen before. Recent studies have shown that developers often struggle with inspecting and fixing incorrect code generated by LLMs, diminishing their productivity and trust in LLM-based code generation. Inspired by the mutual grounding theory in communication, we propose an interactive approach that leverages code comments as a medium for developers and LLMs to establish a shared understanding. Our approach facilitates iterative grounding by interleaving code generation, inline comment generation, and contextualized user feedback through editable comments to align generated code with developer intent. We evaluated our approach on two popular benchmarks and demonstrated that our approach significantly improved multiple state-of-the-art LLMs, e.g., 17.1% pass@1 improvement for code-davinci-002 on HumanEval. Furthermore, we conducted a user study with 12 participants in comparison to two baselines: (1) interacting with GitHub Copilot, and (2) interacting with a multi-step code generation paradigm called Multi-Turn Program Synthesis. Participants completed the given programming tasks 16.7% faster and with 10.5% improvement in task success rate when using our approach. Both results show that interactively refining code comments enables the collaborative establishment of mutual grounding, leading to more accurate code generation and higher developer confidence.
zh

[NLP-6] Spoken Language Understanding on Unseen Tasks With In-Context Learning

【速读】: 该论文试图解决在缺乏任务特定训练数据的情况下,如何提升语音-文本大型语言模型(speech-text large language models)在口语语言理解(SLU)任务中的零样本/少样本性能问题。解决方案的关键在于提出一种基于随机类别标签的鲁棒任务无关微调方法,该方法显著提升了模型在未见任务上的表现,同时避免了启用新任务时对任务特定数据标注的依赖。

链接: https://arxiv.org/abs/2505.07731
作者: Neeraj Agrawal,Sriram Ganapathy
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Spoken language understanding (SLU) tasks involve diverse skills that probe the information extraction, classification and/or generation capabilities of models. In this setting, task-specific training data may not always be available. While traditional task-specific SLU models are unable to cater to such requirements, the speech-text large language models (LLMs) offer a promising alternative with emergent abilities. However, out of-the-box, our evaluations indicate that the zero/few-shot performance of prominent open-source speech-text LLMs on SLU tasks are not up to the mark. In this paper, we introduce a novel approach to robust task-agnostic fine-tuning using randomized class labels. With this proposed fine-tuning, we illustrate that the performance of the speech-text LLMs on an unseen task is significantly improved over standard approaches. Critically, the proposed approach avoids the requirement of task-specific data annotations for enabling new tasks in speech-text LLMs.
zh

[NLP-7] Codifying Character Logic in Role-Playing

【速读】: 该论文试图解决传统基于提示的角色扮演(role-playing)方法在持久性、可更新性和可控随机性方面的不足。传统方法通过将角色描述直接附加到文本提示中来实现角色行为,但这种方式依赖模型的隐式推理,难以保证逻辑的一致性和可维护性。论文提出的解决方案是“编码化角色档案(Codified Profiles)”,其关键在于将角色逻辑表示为结构化的可执行函数,通过显式的控制结构和条件检查来驱动行为决策,从而实现更稳定、可调试和可定制的角色扮演效果。

链接: https://arxiv.org/abs/2505.07705
作者: Letian Peng,Jingbo Shang
机构: University of California, San Diego (加利福尼亚大学圣地亚哥分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces Codified Profiles for role-playing, a novel approach that represents character logic as structured, executable functions for behavioral decision-making. Each profile defines a set of functions parse_by_scene(scene) that outputs a list of logic-grounded assertions triggered_statements, using both explicit control structures (e.g., if-then-else) and condition checks like check_condition(scene, question), where each question is a semantically meaningful prompt about the scene (e.g., “Is the character in danger?”) discriminated by the role-playing LLM as true, false, or unknown. This explicit representation offers three key advantages over traditional prompt-based profiles, which append character descriptions directly into text prompts: (1) Persistence, by enforcing complete and consistent execution of character logic, rather than relying on the model’s implicit reasoning; (2) Updatability, through systematic inspection and revision of behavioral logic, which is difficult to track or debug in prompt-only approaches; (3) Controllable Randomness, by supporting stochastic behavior directly within the logic, enabling fine-grained variability that prompting alone struggles to achieve. To validate these advantages, we introduce a new benchmark constructed from 83 characters and 5,141 scenes curated from Fandom, using NLI-based scoring to compare character responses against ground-truth actions. Our experiments demonstrate the significant benefits of codified profiles in improving persistence, updatability, and behavioral diversity. Notably, by offloading a significant portion of reasoning to preprocessing, codified profiles enable even 1B-parameter models to perform high-quality role-playing, providing a scalable and efficient foundation for local deployment of role-play agents.
zh

[NLP-8] hrough the Looking Glass: Common Sense Consistency Evaluation of Weird Images

【速读】: 该论文试图解决如何评估图像与常识一致性的问题(Image Common Sense Consistency),例如判断一张图片是否符合现实逻辑。其解决方案的关键在于引入一种名为Through the Looking Glass (TLG)的新方法,该方法利用生成式AI(Generative AI)中的大型视觉-语言模型(LVLMs)提取图像中的原子事实,并通过基于Transformer的编码器进行处理,随后对编码后的原子事实进行微调以构建一个紧凑的注意力池化分类器,从而实现对图像常识一致性的高效评估。

链接: https://arxiv.org/abs/2505.07704
作者: Elisei Rykov,Kseniia Petrushina,Kseniia Titova,Anton Razzhigaev,Alexander Panchenko,Vasily Konovalov
机构: Skoltech(斯科尔科沃科技学院); AIRI(人工智能研究机构); MTS AI(MTS人工智能); Moscow Institute of Physics and Technology(莫斯科物理技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Measuring how real images look is a complex task in artificial intelligence research. For example, an image of a boy with a vacuum cleaner in a desert violates common sense. We introduce a novel method, which we call Through the Looking Glass (TLG), to assess image common sense consistency using Large Vision-Language Models (LVLMs) and Transformer-based encoder. By leveraging LVLMs to extract atomic facts from these images, we obtain a mix of accurate facts. We proceed by fine-tuning a compact attention-pooling classifier over encoded atomic facts. Our TLG has achieved a new state-of-the-art performance on the WHOOPS! and WEIRD datasets while leveraging a compact fine-tuning component.
zh

[NLP-9] OnPrem.LLM : A Privacy-Conscious Document Intelligence Toolkit

【速读】: 该论文试图解决在隐私保护场景下,如何将大型语言模型(Large Language Models, LLMs)应用于敏感的非公开数据的问题。解决方案的关键在于提供一个基于Python的工具包,支持离线或受限环境中的本地执行,并通过预构建的文档处理与存储、检索增强生成(Retrieval-Augmented Generation, RAG)、信息提取、摘要生成、分类以及提示/输出处理等流水线,实现低配置下的隐私保护应用。此外,该工具包支持多种LLM后端及量化模型,具备GPU加速和后端无缝切换能力,同时允许在合规条件下与云服务提供商集成,以实现性能与数据控制之间的平衡。

链接: https://arxiv.org/abs/2505.07672
作者: Arun S. Maiya
机构: Institute for Defense Analyses (国防分析研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages

点击查看摘要

Abstract:We present this http URL, a Python-based toolkit for applying large language models (LLMs) to sensitive, non-public data in offline or restricted environments. The system is designed for privacy-preserving use cases and provides prebuilt pipelines for document processing and storage, retrieval-augmented generation (RAG), information extraction, summarization, classification, and prompt/output processing with minimal configuration. this http URL supports multiple LLM backends – including this http URL, Ollama, vLLM, and Hugging Face Transformers – with quantized model support, GPU acceleration, and seamless backend switching. Although designed for fully local execution, this http URL also supports integration with a wide range of cloud LLM providers when permitted, enabling hybrid deployments that balance performance with data control. A no-code web interface extends accessibility to non-technical users.
zh

[NLP-10] Benchmarking Retrieval-Augmented Generation for Chemistry

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在化学领域中应用受限的问题,特别是在缺乏高质量、领域特定语料库和评估基准的情况下,如何有效增强大型语言模型 (LLMs) 的性能。其解决方案的关键在于构建了一个全面的基准测试平台 ChemRAG-Bench,以及配套的模块化 RAG 工具包 ChemRAG-Toolkit,通过整合多源异构知识数据并结合多种检索算法与 LLMs,显著提升了模型在化学相关任务中的表现,平均相对提升达 17.4%。

链接: https://arxiv.org/abs/2505.07671
作者: Xianrui Zhong,Bowen Jin,Siru Ouyang,Yanzhen Shen,Qiao Jin,Yin Fang,Zhiyong Lu,Jiawei Han
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); National Institutes of Health (美国国家卫生研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has emerged as a powerful framework for enhancing large language models (LLMs) with external knowledge, particularly in scientific domains that demand specialized and dynamic information. Despite its promise, the application of RAG in the chemistry domain remains underexplored, primarily due to the lack of high-quality, domain-specific corpora and well-curated evaluation benchmarks. In this work, we introduce ChemRAG-Bench, a comprehensive benchmark designed to systematically assess the effectiveness of RAG across a diverse set of chemistry-related tasks. The accompanying chemistry corpus integrates heterogeneous knowledge sources, including scientific literature, the PubChem database, PubMed abstracts, textbooks, and Wikipedia entries. In addition, we present ChemRAG-Toolkit, a modular and extensible RAG toolkit that supports five retrieval algorithms and eight LLMs. Using ChemRAG-Toolkit, we demonstrate that RAG yields a substantial performance gain – achieving an average relative improvement of 17.4% over direct inference methods. We further conduct in-depth analyses on retriever architectures, corpus selection, and the number of retrieved passages, culminating in practical recommendations to guide future research and deployment of RAG systems in the chemistry domain. The code and data is available at this https URL.
zh

[NLP-11] Using Information Theory to Characterize Prosodic Typology: The Case of Tone Pitch-Accent and Stress-Accent

【速读】: 该论文试图解决语言学中词汇身份(lexical identity)与语调(prosody)之间关系的量化问题,特别是探讨语调在词汇区分中的作用是否可以通过信息论进行表征。解决方案的关键在于利用互信息(mutual information)来衡量词义与语调之间的关联性,并通过分析不同语言中音高(pitch)曲线与文本之间的互信息差异,验证语调在词汇区分中的功能差异。研究结果表明,声调语言在音高曲线与文本之间的互信息更高,支持了其语调系统在词汇区分中的重要作用。

链接: https://arxiv.org/abs/2505.07659
作者: Ethan Gotlieb Wilcox,Cui Ding,Giovanni Acampa,Tiago Pimentel,Alex Warstadt,Tamar I. Regev
机构: Georgetown University (乔治城大学); University of Zürich (苏黎世大学); ETH Zürich (苏黎世联邦理工学院); UC San Diego (加州大学圣地亚哥分校); MIT (麻省理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper argues that the relationship between lexical identity and prosody – one well-studied parameter of linguistic variation – can be characterized using information theory. We predict that languages that use prosody to make lexical distinctions should exhibit a higher mutual information between word identity and prosody, compared to languages that don’t. We test this hypothesis in the domain of pitch, which is used to make lexical distinctions in tonal languages, like Cantonese. We use a dataset of speakers reading sentences aloud in ten languages across five language families to estimate the mutual information between the text and their pitch curves. We find that, across languages, pitch curves display similar amounts of entropy. However, these curves are easier to predict given their associated text in the tonal languages, compared to pitch- and stress-accent languages, and thus the mutual information is higher in these languages, supporting our hypothesis. Our results support perspectives that view linguistic typology as gradient, rather than categorical.
zh

[NLP-12] JobHop: A Large-Scale Dataset of Career Trajectories

【速读】: 该论文旨在解决劳动力市场动态研究中缺乏全面数据的问题,特别是真实职业轨迹的捕捉。其解决方案的关键在于利用大型语言模型(Large Language Models)处理非结构化简历数据,提取结构化的职业信息,并通过多标签分类模型将其映射到标准化的ESCO职业代码,从而构建了一个包含超过230万条工作经历的大规模公共数据集JobHop。

链接: https://arxiv.org/abs/2505.07653
作者: Iman Johary,Raphael Romero,Alexandru C. Mara,Tijl De Bie
机构: Ghent University (根特大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding labor market dynamics is essential for policymakers, employers, and job seekers. However, comprehensive datasets that capture real-world career trajectories are scarce. In this paper, we introduce JobHop, a large-scale public dataset derived from anonymized resumes provided by VDAB, the public employment service in Flanders, Belgium. Utilizing Large Language Models (LLMs), we process unstructured resume data to extract structured career information, which is then mapped to standardized ESCO occupation codes using a multi-label classification model. This results in a rich dataset of over 2.3 million work experiences, extracted from and grouped into more than 391,000 user resumes and mapped to standardized ESCO occupation codes, offering valuable insights into real-world occupational transitions. This dataset enables diverse applications, such as analyzing labor market mobility, job stability, and the effects of career breaks on occupational transitions. It also supports career path prediction and other data-driven decision-making processes. To illustrate its potential, we explore key dataset characteristics, including job distributions, career breaks, and job transitions, demonstrating its value for advancing labor market research.
zh

[NLP-13] Chronocept: Instilling a Sense of Time in Machines

【速读】: 该论文试图解决人工智能(Artificial Intelligence, AI)在时间有效性推理方面的不足,即AI难以准确判断事实的有效性随时间变化的模式。解决方案的关键在于提出Chronocept,这是首个将时间有效性建模为时间上连续概率分布的基准,通过拟合偏正态曲线来捕捉事实的出现、衰减和相关性峰值等细微模式,从而实现可解释且泛化的学习。

链接: https://arxiv.org/abs/2505.07637
作者: Krish Goel,Sanskar Pandey,KS Mahadevan,Harsh Kumar,Vishesh Khadaria
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 8 figures, 18 tables

点击查看摘要

Abstract:Human cognition is deeply intertwined with a sense of time, known as Chronoception. This sense allows us to judge how long facts remain valid and when knowledge becomes outdated. Despite progress in vision, language, and motor control, AI still struggles to reason about temporal validity. We introduce Chronocept, the first benchmark to model temporal validity as a continuous probability distribution over time. Using skew-normal curves fitted along semantically decomposed temporal axes, Chronocept captures nuanced patterns of emergence, decay, and peak relevance. It includes two datasets: Benchmark I (atomic facts) and Benchmark II (multi-sentence passages). Annotations show strong inter-annotator agreement (84% and 89%). Our baselines predict curve parameters - location, scale, and skewness - enabling interpretable, generalizable learning and outperforming classification-based approaches. Chronocept fills a foundational gap in AI’s temporal reasoning, supporting applications in knowledge grounding, fact-checking, retrieval-augmented generation (RAG), and proactive agents. Code and data are publicly available.
zh

[NLP-14] Concept-Level Explainability for Auditing Steering LLM Responses NEURIPS2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全性和对齐性方面的挑战,特别是如何通过识别提示中影响模型输出特定方面的部分来引导模型行为,例如减轻偏见或防御越狱攻击。其解决方案的关键在于提出ConceptX,一种模型无关的、基于概念层面的可解释性方法,该方法通过识别提示中的语义丰富标记(即概念)并根据输出的语义相似性分配重要性,从而实现对模型行为的审计与引导。与当前基于令牌的方法不同,ConceptX通过原地令牌替换保持上下文完整性,并支持灵活的解释目标,如性别偏见分析,从而在忠实性和人类对齐性方面优于现有方法。

链接: https://arxiv.org/abs/2505.07610
作者: Kenza Amara,Rita Sevastjanova,Mennatallah El-Assady
机构: ETH Zurich (苏黎世联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 7 figures, Submission to Neurips 2025

点击查看摘要

Abstract:As large language models (LLMs) become widely deployed, concerns about their safety and alignment grow. An approach to steer LLM behavior, such as mitigating biases or defending against jailbreaks, is to identify which parts of a prompt influence specific aspects of the model’s output. Token-level attribution methods offer a promising solution, but still struggle in text generation, explaining the presence of each token in the output separately, rather than the underlying semantics of the entire LLM response. We introduce ConceptX, a model-agnostic, concept-level explainability method that identifies the concepts, i.e., semantically rich tokens in the prompt, and assigns them importance based on the outputs’ semantic similarity. Unlike current token-level methods, ConceptX also offers to preserve context integrity through in-place token replacements and supports flexible explanation goals, e.g., gender bias. ConceptX enables both auditing, by uncovering sources of bias, and steering, by modifying prompts to shift the sentiment or reduce the harmfulness of LLM responses, without requiring retraining. Across three LLMs, ConceptX outperforms token-level methods like TokenSHAP in both faithfulness and human alignment. Steering tasks boost sentiment shift by 0.252 versus 0.131 for random edits and lower attack success rates from 0.463 to 0.242, outperforming attribution and paraphrasing baselines. While prompt engineering and self-explaining methods sometimes yield safer responses, ConceptX offers a transparent and faithful alternative for improving LLM safety and alignment, demonstrating the practical value of attribution-based explainability in guiding LLM behavior.
zh

[NLP-15] MiMo: Unlocking the Reasoning Potential of Language Model – From Pretraining to Posttraining

【速读】: 该论文旨在解决大型语言模型在推理任务中的性能瓶颈问题,特别是如何通过优化预训练和后训练阶段来提升模型的推理能力。其解决方案的关键在于:在预训练阶段,通过增强数据预处理流程并采用三阶段数据混合策略来强化基础模型的推理潜力,同时引入多标记预测目标以提高性能和推理速度;在后训练阶段,通过构建包含13万道可验证数学与编程问题的数据集进行强化学习,并结合基于测试难度驱动的代码奖励机制缓解稀疏奖励问题,同时采用策略性数据重采样以稳定训练过程。

链接: https://arxiv.org/abs/2505.07608
作者: Xiaomi LLM-Core Team:Bingquan Xia,Bowen Shen,Cici,Dawei Zhu,Di Zhang,Gang Wang,Hailin Zhang,Huaqiu Liu,Jiebao Xiao,Jinhao Dong,Liang Zhao,Peidian Li,Peng Wang,Shihua Yu,Shimao Chen,Weikun Wang,Wenhan Ma,Xiangwei Deng,Yi Huang,Yifan Song,Zihan Jiang,Bowen Ye,Can Cai,Chenhong He,Dong Zhang,Duo Zhang,Guoan Wang,Hao Tian,Haochen Zhao,Heng Qu,Hongshen Xu,Jun Shi,Kainan Bao,QingKai Fang,Kang Zhou,Kangyang Zhou,Lei Li,Menghang Zhu,Nuo Chen,Qiantong Wang,Shaohui Liu,Shicheng Li,Shuhao Gu,Shuhuai Ren,Shuo Liu,Sirui Deng,Weiji Zhuang,Weiwei Lv,Wenyu Yang,Xin Zhang,Xing Yong,Xing Zhang,Xingchen Song,Xinzhe Xu,Xu Wang,Yihan Yan,Yu Tu,Yuanyuan Tian,Yudong Wang,Yue Yu,Zhenru Lin,Zhichao Song,Zihao Yue
机构: Xiaomi LLM-Core Team(小米大模型核心团队)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model’s reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed. During post-training, we curate a dataset of 130K verifiable mathematics and programming problems for reinforcement learning, integrating a test-difficulty-driven code-reward scheme to alleviate sparse-reward issues and employing strategic data resampling to stabilize training. Extensive evaluations show that MiMo-7B-Base possesses exceptional reasoning potential, outperforming even much larger 32B models. The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini. The model checkpoints are available at this https URL.
zh

[NLP-16] Characterizing the Investigative Methods of Fictional Detectives with Large Language Models

【速读】: 该论文试图解决计算叙事学中对侦探小说中虚构侦探的调查方法进行系统性表征的问题,传统文学研究虽然提供了深入的分析,但通常局限于少量角色且缺乏可扩展性。解决方案的关键在于提出一种基于人工智能的方法,利用15个大型语言模型(Large Language Models, LLMs)的多阶段工作流,提取、综合并验证虚构侦探的独特调查特征,从而实现对侦探调查风格的有效识别与建模。

链接: https://arxiv.org/abs/2505.07601
作者: Edirlei Soares de Lima,Marco A. Casanova,Bruno Feijó,Antonio L. Furtado
机构: Academy for AI, Games and Media; Breda University of Applied Sciences; Department of Informatics; PUC-Rio
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Detective fiction, a genre defined by its complex narrative structures and character-driven storytelling, presents unique challenges for computational narratology, a research field focused on integrating literary theory into automated narrative generation. While traditional literary studies have offered deep insights into the methods and archetypes of fictional detectives, these analyses often focus on a limited number of characters and lack the scalability needed for the extraction of unique traits that can be used to guide narrative generation methods. In this paper, we present an AI-driven approach for systematically characterizing the investigative methods of fictional detectives. Our multi-phase workflow explores the capabilities of 15 Large Language Models (LLMs) to extract, synthesize, and validate distinctive investigative traits of fictional detectives. This approach was tested on a diverse set of seven iconic detectives - Hercule Poirot, Sherlock Holmes, William Murdoch, Columbo, Father Brown, Miss Marple, and Auguste Dupin - capturing the distinctive investigative styles that define each character. The identified traits were validated against existing literary analyses and further tested in a reverse identification phase, achieving an overall accuracy of 91.43%, demonstrating the method’s effectiveness in capturing the distinctive investigative approaches of each detective. This work contributes to the broader field of computational narratology by providing a scalable framework for character analysis, with potential applications in AI-driven interactive storytelling and automated narrative generation.
zh

[NLP-17] Reinforced Internal-External Knowledge Synergistic Reasoning for Efficient Adaptive Search Agent

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在使用检索增强生成(Retrieval-augmented generation, RAG)策略时存在的冗余检索、潜在有害知识冲突以及推理延迟增加的问题。其解决方案的关键在于提出一种高效且自适应的搜索代理——强化内部-外部知识协同推理代理(Reinforced Internal-External Knowledge Synergistic Reasoning Agent, IKEA),该代理能够识别自身的知识边界,优先利用内部知识,在内部知识不足时再调用外部检索,并通过一种新的知识边界感知奖励函数和训练数据集来实现内部与外部知识的协同优化。

链接: https://arxiv.org/abs/2505.07596
作者: Ziyang Huang,Xiaowei Yuan,Yiming Ju,Jun Zhao,Kang Liu
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of Chinese Academy of Sciences (中国科学院大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is a common strategy to reduce hallucinations in Large Language Models (LLMs). While reinforcement learning (RL) can enable LLMs to act as search agents by activating retrieval capabilities, existing ones often underutilize their internal knowledge. This can lead to redundant retrievals, potential harmful knowledge conflicts, and increased inference latency. To address these limitations, an efficient and adaptive search agent capable of discerning optimal retrieval timing and synergistically integrating parametric (internal) and retrieved (external) knowledge is in urgent need. This paper introduces the Reinforced Internal-External Knowledge Synergistic Reasoning Agent (IKEA), which could indentify its own knowledge boundary and prioritize the utilization of internal knowledge, resorting to external search only when internal knowledge is deemed insufficient. This is achieved using a novel knowledge-boundary aware reward function and a knowledge-boundary aware training dataset. These are designed for internal-external knowledge synergy oriented RL, incentivizing the model to deliver accurate answers, minimize unnecessary retrievals, and encourage appropriate external searches when its own knowledge is lacking. Evaluations across multiple knowledge reasoning tasks demonstrate that IKEA significantly outperforms baseline methods, reduces retrieval frequency significantly, and exhibits robust generalization capabilities.
zh

[NLP-18] A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models

【速读】: 该论文试图解决现有指令跟随评估基准依赖模板化约束提示导致的真实场景多样性不足以及细粒度性能评估受限的问题。其解决方案的关键在于提出一个包含三种约束模式、四种约束类别和四种难度级别的多维约束框架,并基于此框架开发了一条自动化指令生成流水线,实现约束扩展、冲突检测和指令重写,从而生成1,200个可验证代码的指令跟随测试样本。

链接: https://arxiv.org/abs/2505.07591
作者: Junjie Ye,Caishuang Huang,Zhuohan Chen,Wenjie Fu,Chenyuan Yang,Leyi Yang,Yilong Wu,Peng Wang,Meng Zhou,Xiaolong Yang,Tao Gui,Qi Zhang,Zhongchao Shi,Jianping Fan,Xuanjing Huang
机构: Fudan University (复旦大学); Lenovo Research (联想研究院); Tencent (腾讯)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Instruction following evaluates large language models (LLMs) on their ability to generate outputs that adhere to user-defined constraints. However, existing benchmarks often rely on templated constraint prompts, which lack the diversity of real-world usage and limit fine-grained performance assessment. To fill this gap, we propose a multi-dimensional constraint framework encompassing three constraint patterns, four constraint categories, and four difficulty levels. Building on this framework, we develop an automated instruction generation pipeline that performs constraint expansion, conflict detection, and instruction rewriting, yielding 1,200 code-verifiable instruction-following test samples. We evaluate 19 LLMs across seven model families and uncover substantial variation in performance across constraint forms. For instance, average performance drops from 77.67% at Level I to 32.96% at Level IV. Furthermore, we demonstrate the utility of our approach by using it to generate data for reinforcement learning, achieving substantial gains in instruction following without degrading general performance. In-depth analysis indicates that these gains stem primarily from modifications in the model’s attention modules parameters, which enhance constraint recognition and adherence. Code and data are available in this https URL.
zh

[NLP-19] Direct Density Ratio Optimization: A Statistically Consistent Approach to Aligning Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)与人类偏好对齐的问题,现有方法依赖于特定的偏好模型(如Bradley-Terry模型),导致统计不一致性,即增加数据量并不能保证收敛到真实的用户偏好。解决方案的关键在于提出一种新的对齐方法——直接密度比优化(Direct Density Ratio Optimization, DDRO),该方法直接估计偏好输出与非偏好输出分布之间的密度比,从而避免了显式建模人类偏好,理论上证明了DDRO的统计一致性,确保随着数据量增加能够收敛到真实偏好分布。

链接: https://arxiv.org/abs/2505.07558
作者: Rei Higuchi,Taiji Suzuki
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Aligning large language models (LLMs) with human preferences is crucial for safe deployment, yet existing methods assume specific preference models like Bradley-Terry model. This assumption leads to statistical inconsistency, where more data doesn’t guarantee convergence to true human preferences. To address this critical gap, we introduce a novel alignment method Direct Density Ratio Optimization (DDRO). DDRO directly estimates the density ratio between preferred and unpreferred output distributions, circumventing the need for explicit human preference modeling. We theoretically prove that DDRO is statistically consistent, ensuring convergence to the true preferred distribution as the data size grows, regardless of the underlying preference structure. Experiments demonstrate that DDRO achieves superior performance compared to existing methods on many major benchmarks. DDRO unlocks the potential for truly data-driven alignment, paving the way for more reliable and human-aligned LLMs.
zh

[NLP-20] SEReDeEP: Hallucination Detection in Retrieval-Augmented Models via Semantic Entropy and Context-Parameter Fusion

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)模型在整合外部信息与内部参数化知识时频繁出现的幻觉(hallucination)问题。现有方法多侧重于单独分析外部或内部机制,忽略了两者的协同作用。论文提出的解决方案关键在于通过语义熵(semantic entropy)的计算,利用训练好的线性探测器增强计算过程,从而更准确地评估幻觉,提升对真实情况的反映能力。

链接: https://arxiv.org/abs/2505.07528
作者: Lei Wang
机构: Hunan University (湖南大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) models frequently encounter hallucination phenomena when integrating external information with internal parametric knowledge. Empirical studies demonstrate that the disequilibrium between external contextual information and internal parametric knowledge constitutes a primary factor in hallucination generation. Existing hallucination detection methodologies predominantly emphasize either the external or internal mechanism in isolation, thereby overlooking their synergistic effects. The recently proposed ReDeEP framework decouples these dual mechanisms, identifying two critical contributors to hallucinations: excessive reliance on parametric knowledge encoded in feed-forward networks (FFN) and insufficient utilization of external information by attention mechanisms (particularly copy heads). ReDeEP quantitatively assesses these factors to detect hallucinations and dynamically modulates the contributions of FFNs and copy heads to attenuate their occurrence. Nevertheless, ReDeEP and numerous other hallucination detection approaches have been employed at logit-level uncertainty estimation or language-level self-consistency evaluation, inadequately address the semantic dimensions of model responses, resulting in inconsistent hallucination assessments in RAG implementations. Building upon ReDeEP’s foundation, this paper introduces SEReDeEP, which enhances computational processes through semantic entropy captured via trained linear probes, thereby achieving hallucination assessments that more accurately reflect ground truth evaluations.
zh

[NLP-21] oolACE-DEV: Self-Improving Tool Learning via Decomposition and EVolution

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在工具使用能力上的提升问题,特别是针对当前依赖先进模型进行数据合成的微调方法所存在的高成本和数据兼容性问题。解决方案的关键在于提出一种自改进框架ToolACE-DEV,通过将工具学习目标分解为增强基础工具构建与使用能力的子任务,并引入自进化范式,使轻量级模型能够自主优化,从而降低对先进LLMs的依赖。

链接: https://arxiv.org/abs/2505.07512
作者: Xu Huang,Weiwen Liu,Xingshan Zeng,Yuefeng Huang,Xinlong Hao,Yuxian Wang,Yirong Zeng,Chuhan Wu,Yasheng Wang,Ruiming Tang,Defu Lian
机构: University of Science and Technology of China (中国科学技术大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); Huawei Technologies Co., Ltd (华为技术有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The tool-using capability of large language models (LLMs) enables them to access up-to-date external information and handle complex tasks. Current approaches to enhancing this capability primarily rely on distilling advanced models by data synthesis. However, this method incurs significant costs associated with advanced model usage and often results in data compatibility issues, led by the high discrepancy in the knowledge scope between the advanced model and the target model. To address these challenges, we propose ToolACE-DEV, a self-improving framework for tool learning. First, we decompose the tool-learning objective into sub-tasks that enhance basic tool-making and tool-using abilities. Then, we introduce a self-evolving paradigm that allows lightweight models to self-improve, reducing reliance on advanced LLMs. Extensive experiments validate the effectiveness of our approach across models of varying scales and architectures.
zh

[NLP-22] ranslating the Grievance Dictionary: a psychometric evaluation of Dutch German and Italian versions

【速读】: 该论文试图解决跨语言分析暴力、威胁或基于不满情绪文本的工具不足问题,其解决方案的关键在于开发并评估Grievance Dictionary(一种心理语言学词典)的荷兰语、德语和意大利语翻译版本。通过自动化翻译结合人工标注的过程,确保翻译词典在不同语言中的适用性,并通过心理测量分析验证其可靠性与有效性。

链接: https://arxiv.org/abs/2505.07495
作者: Isabelle van der Vegt,Bennett Kleinberg,Marilu Miotto,Jonas Festor
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces and evaluates three translations of the Grievance Dictionary, a psycholinguistic dictionary for the analysis of violent, threatening or grievance-fuelled texts. Considering the relevance of these themes in languages beyond English, we translated the Grievance Dictionary to Dutch, German, and Italian. We describe the process of automated translation supplemented by human annotation. Psychometric analyses are performed, including internal reliability of dictionary categories and correlations with the LIWC dictionary. The Dutch and German translations perform similarly to the original English version, whereas the Italian dictionary shows low reliability for some categories. Finally, we make suggestions for further validation and application of the dictionary, as well as for future dictionary translations following a similar approach.
zh

[NLP-23] A Survey on Collaborative Mechanisms Between Large and Small Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在部署过程中面临的高资源消耗和延迟问题,以及小型语言模型(Small Language Models, SLMs)在性能上的局限性之间的权衡问题。其解决方案的关键在于通过LLM与SLM的协作,利用多种交互机制(如流水线、路由、辅助、蒸馏和融合)实现性能与效率的平衡,从而推动更高效、可扩展且适用于资源受限边缘设备的AI应用。

链接: https://arxiv.org/abs/2505.07460
作者: Yi Chen,JiaHao Zhao,HaoHao Han
机构: Chengdu Institute of Computer Applications, Chinese Academy of Sciences(中国科学院成都计算机应用研究所); University of Chinese Academy of Sciences(中国科学院大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) deliver powerful AI capabilities but face deployment challenges due to high resource costs and latency, whereas Small Language Models (SLMs) offer efficiency and deployability at the cost of reduced performance. Collaboration between LLMs and SLMs emerges as a crucial paradigm to synergistically balance these trade-offs, enabling advanced AI applications, especially on resource-constrained edge devices. This survey provides a comprehensive overview of LLM-SLM collaboration, detailing various interaction mechanisms (pipeline, routing, auxiliary, distillation, fusion), key enabling technologies, and diverse application scenarios driven by on-device needs like low latency, privacy, personalization, and offline operation. While highlighting the significant potential for creating more efficient, adaptable, and accessible AI, we also discuss persistent challenges including system overhead, inter-model consistency, robust task allocation, evaluation complexity, and security/privacy concerns. Future directions point towards more intelligent adaptive frameworks, deeper model fusion, and expansion into multimodal and embodied AI, positioning LLM-SLM collaboration as a key driver for the next generation of practical and ubiquitous artificial intelligence.
zh

[NLP-24] Matching Tasks with Industry Groups for Augmenting Commonsense Knowledge

【速读】: 该论文试图解决现有常识知识库(Commonsense Knowledge Base, KB)在捕捉不同行业领域(Industry Group, IG)特定任务方面的不足,尤其是在像ConceptNet这样的大型KB中,仅包含少量跨行业通用任务的显式知识。其解决方案的关键在于提出一种弱监督框架,通过训练神经模型学习任务与行业群体之间的亲和性(task-IG affinity),并结合聚类方法为每个IG选择top-k任务,从而有效地扩充常识KB中的任务-行业关系。该方法从两个公开新闻数据集中提取了2339个形式为⟨IG, is capable of, task⟩的三元组,精度达到0.86,验证了所提取任务-行业对的可靠性。

链接: https://arxiv.org/abs/2505.07440
作者: Rituraj Singh,Sachin Pawar,Girish Palshikar
机构: TCS Research, Tata Consultancy Services Limited(塔塔咨询服务有限公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Commonsense knowledge bases (KB) are a source of specialized knowledge that is widely used to improve machine learning applications. However, even for a large KB such as ConceptNet, capturing explicit knowledge from each industry domain is challenging. For example, only a few samples of general \em tasks performed by various industries are available in ConceptNet. Here, a task is a well-defined knowledge-based volitional action to achieve a particular goal. In this paper, we aim to fill this gap and present a weakly-supervised framework to augment commonsense KB with tasks carried out by various industry groups (IG). We attempt to \em match each task with one or more suitable IGs by training a neural model to learn task-IG affinity and apply clustering to select the top-k tasks per IG. We extract a total of 2339 triples of the form \langle IG, is~capable~of, task \rangle from two publicly available news datasets for 24 IGs with the precision of 0.86. This validates the reliability of the extracted task-IG pairs that can be directly added to existing KBs.
zh

[NLP-25] Comparative sentiment analysis of public perception: Monkeypox vs. COVID-19 behavioral insights

【速读】: 该论文试图解决在突发全球健康危机中,如何通过分析公众情绪来优化公共卫生策略的问题。其解决方案的关键在于利用大规模推文数据集(分别包含147,475条和106,638条关于新冠疫情和猴痘疫情的推文)并结合先进的机器学习模型(如逻辑回归、朴素贝叶斯、RoBERTa、DistilRoBERTa和XLNet)进行情感分类,从而揭示公众情绪和讨论趋势的差异。该方法为制定针对性的公共卫生信息传播策略、减少错误信息以及增强公众信任提供了重要依据。

链接: https://arxiv.org/abs/2505.07430
作者: Mostafa Mohaimen Akand Faisal,Rabeya Amin Jhuma
机构: University of Information Technology and Sciences (UITS)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The emergence of global health crises, such as COVID-19 and Monkeypox (mpox), has underscored the importance of understanding public sentiment to inform effective public health strategies. This study conducts a comparative sentiment analysis of public perceptions surrounding COVID-19 and mpox by leveraging extensive datasets of 147,475 and 106,638 tweets, respectively. Advanced machine learning models, including Logistic Regression, Naive Bayes, RoBERTa, DistilRoBERTa and XLNet, were applied to perform sentiment classification, with results indicating key trends in public emotion and discourse. The analysis highlights significant differences in public sentiment driven by disease characteristics, media representation, and pandemic fatigue. Through the lens of sentiment polarity and thematic trends, this study offers valuable insights into tailoring public health messaging, mitigating misinformation, and fostering trust during concurrent health crises. The findings contribute to advancing sentiment analysis applications in public health informatics, setting the groundwork for enhanced real-time monitoring and multilingual analysis in future research.
zh

[NLP-26] ViMRHP: A Vietnamese Benchmark Dataset for Multimodal Review Helpfulness Prediction via Human-AI Collaborative Annotation

【速读】: 该论文旨在解决多模态评论有用性预测(Multimodal Review Helpfulness Prediction, MRHP)任务中数据集语言多样性不足的问题,特别是在低资源语言如越南语中的数据稀缺问题。其解决方案的关键在于构建一个大规模的越南语多模态评论有用性预测基准数据集ViMRHP,并通过生成式AI辅助标注过程,以显著降低标注时间和成本,同时保持数据质量。

链接: https://arxiv.org/abs/2505.07416
作者: Truc Mai-Thanh Nguyen,Dat Minh Nguyen,Son T. Luu,Kiet Van Nguyen
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at NLDB 2025

点击查看摘要

Abstract:Multimodal Review Helpfulness Prediction (MRHP) is an essential task in recommender systems, particularly in E-commerce platforms. Determining the helpfulness of user-generated reviews enhances user experience and improves consumer decision-making. However, existing datasets focus predominantly on English and Indonesian, resulting in a lack of linguistic diversity, especially for low-resource languages such as Vietnamese. In this paper, we introduce ViMRHP (Vietnamese Multimodal Review Helpfulness Prediction), a large-scale benchmark dataset for MRHP task in Vietnamese. This dataset covers four domains, including 2K products with 46K reviews. Meanwhile, a large-scale dataset requires considerable time and cost. To optimize the annotation process, we leverage AI to assist annotators in constructing the ViMRHP dataset. With AI assistance, annotation time is reduced (90 to 120 seconds per task down to 20 to 40 seconds per task) while maintaining data quality and lowering overall costs by approximately 65%. However, AI-generated annotations still have limitations in complex annotation tasks, which we further examine through a detailed performance analysis. In our experiment on ViMRHP, we evaluate baseline models on human-verified and AI-generated annotations to assess their quality differences. The ViMRHP dataset is publicly available at this https URL
zh

[NLP-27] Computational Fact-Checking of Online Discourse: Scoring scientific accuracy in climate change related news articles

【速读】: 该论文试图解决虚假信息在主流媒体中对公民话语造成的威胁,特别是如何半自动化地量化在线媒体的科学准确性(scientific accuracy)。解决方案的关键在于通过语义化处理未知可信度的媒体内容,并将其陈述与经过相同处理的可信来源进行对比,从而实现对信息真实性的评估。该方法利用基于大语言模型(LLM)的陈述提取和知识图谱分析,构建了一个神经符号系统,以提升现有真实性量化技术的效率。

链接: https://arxiv.org/abs/2505.07409
作者: Tim Wittenborg,Constantin Sebastian Tremel,Markus Stocker,Sören Auer
机构: L3S Research Center, Leibniz University Hannover(莱布尼茨汉诺威L3S研究中心); TIB - Leibniz Information Centre for Science and Technology(莱布尼茨科学技术信息中心)
类目: Computation and Language (cs.CL)
备注: 4 pages, 4 figures, submitted to ACM Web Conference 2025

点击查看摘要

Abstract:Democratic societies need reliable information. Misinformation in popular media such as news articles or videos threatens to impair civic discourse. Citizens are, unfortunately, not equipped to verify this content flood consumed daily at increasing rates. This work aims to semi-automatically quantify scientific accuracy of online media. By semantifying media of unknown veracity, their statements can be compared against equally processed trusted sources. We implemented a workflow using LLM-based statement extraction and knowledge graph analysis. Our neurosymbolic system was able to evidently streamline state-of-the-art veracity quantification. Evaluated via expert interviews and a user survey, the tool provides a beneficial veracity indication. This indicator, however, is unable to annotate public media at the required granularity and scale. Further work towards a FAIR (Findable, Accessible, Interoperable, Reusable) ground truth and complementary metrics are required to scientifically support civic discourse.
zh

[NLP-28] Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge

【速读】: 该论文旨在解决音频语言模型在多领域声音理解中的问答能力问题,具体通过构建一个跨领域的音频问答(Audio Question Answering, AQA)基准来评估模型在不同声学场景下的交互式问答性能。解决方案的关键在于设计三个不同的QA子集(生物声学、时间声景和复杂QA),涵盖从海洋哺乳动物叫声到复杂现实场景的多样化数据,并采用基于top-1准确率与答案洗牌鲁棒性的评估协议,同时引入多个基线系统以对比模型表现,从而推动音频语言模型在音频理解和推理能力上向人类水平迈进。

链接: https://arxiv.org/abs/2505.07365
作者: Chao-Han Huck Yang,Sreyan Ghosh,Qing Wang,Jaeyeon Kim,Hengyi Hong,Sonal Kumar,Guirui Zhong,Zhifeng Kong,S Sakshi,Vaibhavi Lokegaonkar,Oriol Nieto,Ramani Duraiswami,Dinesh Manocha,Gunhee Kim,Jun Du,Rafael Valle,Bryan Catanzaro
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: Preprint. DCASE 2025 Audio QA Challenge: this https URL

点击查看摘要

Abstract:We present Task 5 of the DCASE 2025 Challenge: an Audio Question Answering (AQA) benchmark spanning multiple domains of sound understanding. This task defines three QA subsets (Bioacoustics, Temporal Soundscapes, and Complex QA) to test audio-language models on interactive question-answering over diverse acoustic scenes. We describe the dataset composition (from marine mammal calls to soundscapes and complex real-world clips), the evaluation protocol (top-1 accuracy with answer-shuffling robustness), and baseline systems (Qwen2-Audio-7B, AudioFlamingo 2, Gemini-2-Flash). Preliminary results on the development set are compared, showing strong variation across models and subsets. This challenge aims to advance the audio understanding and reasoning capabilities of audio-language models toward human-level acuity, which are crucial for enabling AI agents to perceive and interact about the world effectively.
zh

[NLP-29] QUPID: Quantified Understanding for Enhanced Performance Insights and Decisions in Korean Search Engines

【速读】: 该论文试图解决信息检索中相关性评估的效率与准确性问题,特别是针对当前广泛使用的大型语言模型(Large Language Models, LLMs)在计算成本高且性能提升有限的局限性。解决方案的关键在于通过结合两种具有不同架构的小型语言模型(Small Language Models, SLMs),即生成式SLM与基于嵌入的SLM,构建QUPID方法,从而在保持较高相关性判断准确率的同时显著降低计算成本,并提升系统的可扩展性。

链接: https://arxiv.org/abs/2505.07345
作者: Ohjoon Kwon,Changsu Lee,Jihye Back,Lim Sun Suk,Inho Kang,Donghyeon Jeon
机构: Naver Corporation (耐伯)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been widely used for relevance assessment in information retrieval. However, our study demonstrates that combining two distinct small language models (SLMs) with different architectures can outperform LLMs in this task. Our approach – QUPID – integrates a generative SLM with an embedding-based SLM, achieving higher relevance judgment accuracy while reducing computational costs compared to state-of-the-art LLM solutions. This computational efficiency makes QUPID highly scalable for real-world search systems processing millions of queries daily. In experiments across diverse document types, our method demonstrated consistent performance improvements (Cohen’s Kappa of 0.646 versus 0.387 for leading LLMs) while offering 60x faster inference times. Furthermore, when integrated into production search pipelines, QUPID improved nDCG@5 scores by 1.9%. These findings underscore how architectural diversity in model combinations can significantly enhance both search relevance and operational efficiency in information retrieval systems.
zh

[NLP-30] owards Multi-Agent Reasoning Systems for Collaborative Expertise Delegation: An Exploratory Design Study

【速读】: 该论文旨在解决多智能体大语言模型系统中协作结构设计的问题,以提升集体推理能力。其关键解决方案在于系统性地研究三个核心设计维度:(1)专业领域对齐(Expertise-Domain Alignment),(2)协作范式(结构化工作流与多样性驱动集成),(3)系统规模,并发现专业知识对齐在上下文推理任务中效果显著,基于多样知识整合的协作优于严格的任务分解,同时揭示了在扩展多智能体系统时的计算权衡,强调了高效通信协议设计的重要性。

链接: https://arxiv.org/abs/2505.07313
作者: Baixuan Xu,Chunyang Li,Weiqi Wang,Wei Fan,Tianshi Zheng,Haochen Shi,Tao Fan,Yangqiu Song,Qiang Yang
机构: The Hong Kong University of Science and Technology (香港科技大学); WeBank (微众银行); The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages

点击查看摘要

Abstract:Designing effective collaboration structure for multi-agent LLM systems to enhance collective reasoning is crucial yet remains under-explored. In this paper, we systematically investigate how collaborative reasoning performance is affected by three key design dimensions: (1) Expertise-Domain Alignment, (2) Collaboration Paradigm (structured workflow vs. diversity-driven integration), and (3) System Scale. Our findings reveal that expertise alignment benefits are highly domain-contingent, proving most effective for contextual reasoning tasks. Furthermore, collaboration focused on integrating diverse knowledge consistently outperforms rigid task decomposition. Finally, we empirically explore the impact of scaling the multi-agent system with expertise specialization and study the computational trade off, highlighting the need for more efficient communication protocol design. This work provides concrete guidelines for configuring specialized multi-agent system and identifies critical architectural trade-offs and bottlenecks for scalable multi-agent reasoning. The code will be made available upon acceptance.
zh

[NLP-31] AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection

【速读】: 该论文试图解决如何有效收集具有复杂推理能力的预训练数据以提升大语言模型(Large Language Models, LLMs)性能的问题。传统方法依赖于监督分类器来识别此类数据,但需要人工或模型标注,常引入领域特定偏差。该论文提出的解决方案关键在于AttentionInfluence,这是一种无需监督信号、训练-free 的方法,通过简单的注意力头掩码操作,使小型预训练语言模型能够作为高效的数据选择器,从而提升大规模模型的性能。

链接: https://arxiv.org/abs/2505.07293
作者: Kai Hua,Steven Wu,Ge Zhang,Ke Shen
机构: 未知
类目: Computation and Language (cs.CL)
备注: 28 pages, 19 figures

点击查看摘要

Abstract:Recently, there has been growing interest in collecting reasoning-intensive pretraining data to improve LLMs’ complex reasoning ability. Prior approaches typically rely on supervised classifiers to identify such data, which requires labeling by humans or LLMs, often introducing domain-specific biases. Due to the attention heads being crucial to in-context reasoning, we propose AttentionInfluence, a simple yet effective, training-free method without supervision signal. Our approach enables a small pretrained language model to act as a strong data selector through a simple attention head masking operation. Specifically, we identify retrieval heads and compute the loss difference when masking these heads. We apply AttentionInfluence to a 1.3B-parameter dense model to conduct data selection on the SmolLM corpus of 241B tokens, and mix the SmolLM corpus with the selected subset comprising 73B tokens to pretrain a 7B-parameter dense model using 1T training tokens and WSD learning rate scheduling. Our experimental results demonstrate substantial improvements, ranging from 1.4pp to 3.5pp, across several knowledge-intensive and reasoning-heavy benchmarks (i.e., MMLU, MMLU-Pro, AGIEval-en, GSM8K, and HumanEval). This demonstrates an effective weak-to-strong scaling property, with small models improving the final performance of larger models-offering a promising and scalable path for reasoning-centric data selection.
zh

[NLP-32] Semantic Retention and Extreme Compression in LLM s: Can We Have Both? IJCNN

【速读】: 该论文试图解决大型语言模型(Large Language Model, LLM)部署中计算和内存成本过高的问题,特别是通过高效模型压缩技术来降低这些成本。其解决方案的关键在于探索剪枝(pruning)与量化(quantization)联合应用的潜力,以实现比单一方法更优的性能-压缩比。为此,作者提出了语义保留压缩率(Semantic Retention Compression Rate, SrCr)作为评估指标,以更准确地衡量模型压缩与语义保留之间的权衡,从而优化剪枝与量化的配置。实验结果表明,该联合方法在相同理论压缩率下,平均性能提升了20%。

链接: https://arxiv.org/abs/2505.07289
作者: Stanislas Laborde,Martin Cousseau,Antoun Yaacoub,Lionel Prevost
机构: Learning, Data and Robotics (LDR) ESIEA Lab, ESIEA, Paris, France
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for publication in the Proceedings of the 2025 International Joint Conference on Neural Networks (IJCNN); this arXiv version includes an appendix with 6 result tables; 10 pages, 15 figures, 7 tables

点击查看摘要

Abstract:The exponential growth in Large Language Model (LLM) deployment has intensified the need for efficient model compression techniques to reduce computational and memory costs. While pruning and quantization have shown promise, their combined potential remains largely unexplored. In this paper, we examine joint compression and how strategically combining pruning and quantization could yield superior performance-to-compression ratios compared to single-method approaches. Recognizing the challenges in accurately assessing LLM performance, we address key limitations of previous evaluation frameworks and introduce the Semantic Retention Compression Rate (SrCr), a novel metric that quantifies the trade-off between model compression and semantic preservation, facilitating the optimization of pruning-quantization configurations. Experiments demonstrate that our recommended combination achieves, on average, a 20% performance increase compared to an equivalent quantization-only model at the same theoretical compression rate.
zh

[NLP-33] On the Robustness of Reward Models for Language Model Alignment ICML2025

【速读】: 该论文旨在解决奖励模型(Reward Model, RM)在基于人类反馈的强化学习(Reinforcement Learning with Human Feedback, RLHF)中因使用Bradley-Terry (BT) 模型损失函数而产生的过优化问题,该问题导致RM在未见输入分布上的泛化能力下降。论文提出的关键解决方案是批次级归零正则化(Batch-wise Sum-to-Zero Regularization, BSR),通过强制每批次的奖励总和为零,从而约束极端幅度的奖励值,缓解隐藏状态范数的过度离散,提升RM的分布鲁棒性。

链接: https://arxiv.org/abs/2505.07271
作者: Jiwoo Hong,Noah Lee,Eunki Kim,Guijin Son,Woojin Chung,Aman Gupta,Shao Tang,James Thorne
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICML 2025

点击查看摘要

Abstract:The Bradley-Terry (BT) model is widely practiced in reward modeling for reinforcement learning with human feedback (RLHF). Despite its effectiveness, reward models (RMs) trained with BT model loss are prone to over-optimization, losing generalizability to unseen input distributions. In this paper, we study the cause of over-optimization in RM training and its downstream effects on the RLHF procedure, accentuating the importance of distributional robustness of RMs in unseen data. First, we show that the excessive dispersion of hidden state norms is the main source of over-optimization. Then, we propose batch-wise sum-to-zero regularization (BSR) to enforce zero-centered reward sum per batch, constraining the rewards with extreme magnitudes. We assess the impact of BSR in improving robustness in RMs through four scenarios of over-optimization, where BSR consistently manifests better robustness. Subsequently, we compare the plain BT model and BSR on RLHF training and empirically show that robust RMs better align the policy to the gold preference model. Finally, we apply BSR to high-quality data and models, which surpasses state-of-the-art RMs in the 8B scale by adding more than 5% in complex preference prediction tasks. By conducting RLOO training with 8B RMs, AlpacaEval 2.0 reduces generation length by 40% while adding a 7% increase in win rate, further highlighting that robustness in RMs induces robustness in RLHF training. We release the code, data, and models: this https URL.
zh

[NLP-34] No Query No Access

【速读】: 该论文旨在解决文本对抗攻击中存在的一系列限制问题,包括对目标模型的先验知识依赖、大量查询需求以及对训练数据的访问要求,从而提升攻击在现实场景中的可行性。其解决方案的关键在于提出基于目标文本的对抗攻击方法(Victim Data-based Adversarial Attack, VDBA),该方法仅利用目标文本进行攻击,无需访问目标模型或其训练数据。为应对信息反馈不足导致的攻击成功率低的问题,VDBA引入了分层替代模型设计和多样化的对抗样本生成策略,以提高攻击的有效性和相似性。实验结果表明,VDBA在多个数据集上显著提升了攻击成功率,并对如Qwen2和GPT系列等大型语言模型构成了严重威胁。

链接: https://arxiv.org/abs/2505.07258
作者: Wenqiang Wang,Siyuan Liang,Yangshijie Zhang,Xiaojun Jia,Hao Lin,Xiaochun Cao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Textual adversarial attacks mislead NLP models, including Large Language Models (LLMs), by subtly modifying text. While effective, existing attacks often require knowledge of the victim model, extensive queries, or access to training data, limiting real-world feasibility. To overcome these constraints, we introduce the \textbfVictim Data-based Adversarial Attack (VDBA), which operates using only victim texts. To prevent access to the victim model, we create a shadow dataset with publicly available pre-trained models and clustering methods as a foundation for developing substitute models. To address the low attack success rate (ASR) due to insufficient information feedback, we propose the hierarchical substitution model design, generating substitute models to mitigate the failure of a single substitute model at the decision boundary. Concurrently, we use diverse adversarial example generation, employing various attack methods to generate and select the adversarial example with better similarity and attack effectiveness. Experiments on the Emotion and SST5 datasets show that VDBA outperforms state-of-the-art methods, achieving an ASR improvement of 52.08% while significantly reducing attack queries to 0. More importantly, we discover that VDBA poses a significant threat to LLMs such as Qwen2 and the GPT family, and achieves the highest ASR of 45.99% even without access to the API, confirming that advanced NLP models still face serious security risks. Our codes can be found at this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.07258 [cs.CL] (or arXiv:2505.07258v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.07258 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-35] SAS-Bench: A Fine-Grained Benchmark for Evaluating Short Answer Scoring with Large Language Models

【速读】: 该论文旨在解决主观答案评分(Subjective Answer Grading, SAG)中现有方法评分粒度粗、缺乏详细推理以及大型语言模型(Large Language Models, LLMs)在评分决策中存在偏差、不一致性和透明度不足的问题。其解决方案的关键在于引入SAS-Bench,这是一个专为基于LLM的短答案评分(Short Answer Scoring, SAS)任务设计的基准,提供了细粒度的分步评分、专家标注的错误类别以及来自真实学科考试的多样化题目类型,从而支持对模型推理过程和可解释性的深入评估。

链接: https://arxiv.org/abs/2505.07247
作者: Peichao Lai,Kexuan Zhang,Yi Lin,Linyihan Zhang,Feiyang Ye,Jinhao Yan,Yanwei Xu,Conghui He,Yilei Wang,Wentao Zhang,Bin Cui
机构: Peking University (北京大学); Fuzhou University (福州大学); Hunan University (湖南大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Subjective Answer Grading (SAG) plays a crucial role in education, standardized testing, and automated assessment systems, particularly for evaluating short-form responses in Short Answer Scoring (SAS). However, existing approaches often produce coarse-grained scores and lack detailed reasoning. Although large language models (LLMs) have demonstrated potential as zero-shot evaluators, they remain susceptible to bias, inconsistencies with human judgment, and limited transparency in scoring decisions. To overcome these limitations, we introduce SAS-Bench, a benchmark specifically designed for LLM-based SAS tasks. SAS-Bench provides fine-grained, step-wise scoring, expert-annotated error categories, and a diverse range of question types derived from real-world subject-specific exams. This benchmark facilitates detailed evaluation of model reasoning processes and explainability. We also release an open-source dataset containing 1,030 questions and 4,109 student responses, each annotated by domain experts. Furthermore, we conduct comprehensive experiments with various LLMs, identifying major challenges in scoring science-related questions and highlighting the effectiveness of few-shot prompting in improving scoring accuracy. Our work offers valuable insights into the development of more robust, fair, and educationally meaningful LLM-based evaluation systems.
zh

[NLP-36] DynamicRAG : Leverag ing Outputs of Large Language Model as Feedback for Dynamic Reranking in Retrieval-Augmented Generation

【速读】: 该论文试图解决检索增强生成(Retrieval-augmented generation, RAG)系统中检索文档数量(k)选择不当导致的性能问题,即过少可能遗漏关键信息,过多则引入噪声和效率低下。解决方案的关键在于提出DynamicRAG框架,其中重排序器通过强化学习(Reinforcement Learning, RL)优化,动态调整检索文档的顺序和数量,其奖励信号来源于大语言模型(Large Language Models, LLMs)输出的质量,从而提升生成质量和可解释性。

链接: https://arxiv.org/abs/2505.07233
作者: Jiashuo Sun,Xianrui Zhong,Sizhe Zhou,Jiawei Han
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 pages, 6 figures, 15 tables

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems combine large language models (LLMs) with external knowledge retrieval, making them highly effective for knowledge-intensive tasks. A crucial but often under-explored component of these systems is the reranker, which refines retrieved documents to enhance generation quality and explainability. The challenge of selecting the optimal number of documents (k) remains unsolved: too few may omit critical information, while too many introduce noise and inefficiencies. Although recent studies have explored LLM-based rerankers, they primarily leverage internal model knowledge and overlook the rich supervisory signals that LLMs can provide, such as using response quality as feedback for optimizing reranking decisions. In this paper, we propose DynamicRAG, a novel RAG framework where the reranker dynamically adjusts both the order and number of retrieved documents based on the query. We model the reranker as an agent optimized through reinforcement learning (RL), using rewards derived from LLM output quality. Across seven knowledge-intensive datasets, DynamicRAG demonstrates superior performance, achieving state-of-the-art results. The model, data and code are available at this https URL
zh

[NLP-37] Benchmarking Ethical and Safety Risks of Healthcare LLM s in China-Toward Systemic Governance under Healthy China 2030

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗领域应用中所带来的伦理与患者安全挑战。其关键解决方案是构建一个包含12,000个问题的问答基准,覆盖11个伦理维度和9个安全维度,用于量化评估LLMs在医疗场景中的风险。通过该数据集对先进中文医疗LLMs进行评估,揭示了现有模型在伦理与安全决策方面的显著不足,并提出了包括嵌入LLM审计团队、制定数据伦理指南以及实施安全模拟流程在内的治理框架,以实现对LLM风险的主动管理。

链接: https://arxiv.org/abs/2505.07205
作者: Mouxiao Bian,Rongzhao Zhang,Chao Ding,Xinwei Peng,Jie Xu
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are poised to transform healthcare under China’s Healthy China 2030 initiative, yet they introduce new ethical and patient-safety challenges. We present a novel 12,000-item QA benchmark covering 11 ethics and 9 safety dimensions in medical contexts, to quantitatively evaluate these risks. Using this dataset, we assess state-of-the-art Chinese medical LLMs (e.g., Qwen 2.5-32B, DeepSeek), revealing moderate baseline performance (accuracy 42.7% for Qwen 2.5-32B) and significant improvements after fine-tuning on our data (up to 50.8% accuracy). Results show notable gaps in LLM decision-making on ethics and safety scenarios, reflecting insufficient institutional oversight. We then identify systemic governance shortfalls-including the lack of fine-grained ethical audit protocols, slow adaptation by hospital IRBs, and insufficient evaluation tools-that currently hinder safe LLM deployment. Finally, we propose a practical governance framework for healthcare institutions (embedding LLM auditing teams, enacting data ethics guidelines, and implementing safety simulation pipelines) to proactively manage LLM risks. Our study highlights the urgent need for robust LLM governance in Chinese healthcare, aligning AI innovation with patient safety and ethical standards.
zh

[NLP-38] On the Cost and Benefits of Training Context with Utterance or Full Conversation Training: A Comparative Stud

【速读】: 该论文试图解决现代对话式文本转语音(Text-to-Speech, TTS)系统在公开可用性方面的局限性,探讨现有开源架构与训练技术是否不足以支持高质量的对话生成。论文的关键解决方案是通过对比两种训练方法:基于上下文的逐句训练与完整对话训练,发现基于上下文的逐句训练在主观意见分数(MOS)上表现更优(4.3/5.0 vs 3.7/5.0),并且训练时间减少了37%,同时避免了完整对话训练中出现的说话人相似性幻觉问题。因此,研究建议采用具有上下文条件的逐句训练方法以提高资源效率和输出质量。

链接: https://arxiv.org/abs/2505.07202
作者: Hyouin Liu,Zhikuan Zhang
机构: Divergence 2% LLC (Divergence 2% LLC)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Modern TTS systems designed for conversations achieve high-quality utterances but often remain inaccessible publicly. Are existing open-source architectures inadequate, or are current training techniques insufficient? This paper investigates prominent models and their underlying behaviors regarding conversational context. Using 20 GPU-hours on an NVIDIA H100, we empirically examine two approaches: context-based utterance-level training versus full conversation training. Results demonstrate that context-based utterance training achieves superior MOS scores (4.3/5.0 vs 3.7/5.0) and reduces training time by 37%, while full conversation approaches suffer from speaker similarity hallucination issues. These findings provide practical guidelines for conversational TTS development, favoring utterance-level training with contextual conditioning for both resource efficiency and output quality.
zh

[NLP-39] Securing Genomic Data Against Inference Attacks in Federated Learning Environments

【速读】: 该论文旨在解决联邦学习(Federated Learning, FL)在处理去中心化基因组数据时面临的隐私泄露问题,特别是针对梯度暴露导致的成员推理攻击(Membership Inference Attack, MIA)和标签推理攻击(Label Inference Attack, LIA)等威胁。研究通过模拟联邦学习环境并利用合成基因组数据评估不同攻击向量的有效性,发现基于梯度的MIA具有最高的攻击效果,表明梯度信息在联邦更新中存在显著的隐私风险。解决方案的关键在于识别梯度暴露所带来的隐私漏洞,并强调需要开发更强大的隐私保护机制以应对基因组数据的特殊敏感性。

链接: https://arxiv.org/abs/2505.07188
作者: Chetan Pathade,Shubham Patil
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 10 Pages, 7 Figures

点击查看摘要

Abstract:Federated Learning (FL) offers a promising framework for collaboratively training machine learning models across decentralized genomic datasets without direct data sharing. While this approach preserves data locality, it remains susceptible to sophisticated inference attacks that can compromise individual privacy. In this study, we simulate a federated learning setup using synthetic genomic data and assess its vulnerability to three key attack vectors: Membership Inference Attack (MIA), Gradient-Based Membership Inference Attack, and Label Inference Attack (LIA). Our experiments reveal that Gradient-Based MIA achieves the highest effectiveness, with a precision of 0.79 and F1-score of 0.87, underscoring the risk posed by gradient exposure in federated updates. Additionally, we visualize comparative attack performance through radar plots and quantify model leakage across clients. The findings emphasize the inadequacy of naïve FL setups in safeguarding genomic privacy and motivate the development of more robust privacy-preserving mechanisms tailored to the unique sensitivity of genomic data.
zh

[NLP-40] Structural Entropy Guided Agent for Detecting and Repairing Knowledge Deficiencies in LLM s

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在知识密集型领域(如医学和科研)中表现不佳的问题,这些问题需要高事实准确性。现有方法生成的合成数据常包含冗余样本,无法有效填补模型的真实知识缺口。解决方案的关键在于提出一种基于结构熵引导的知识导航框架(Structural Entropy-guided Knowledge Navigator, SENATOR),该框架利用结构熵(Structure Entropy, SE)度量知识图谱路径中的不确定性,并结合蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)选择性地探索模型缺乏领域知识的区域,从而生成针对性的合成数据用于监督微调,实现模型的持续自我优化。

链接: https://arxiv.org/abs/2505.07184
作者: Yifan Wei,Xiaoyan Yu,Tengfei Pan,Angsheng Li,Li Du
机构: Beihang University (北京航空航天大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院); Beijing Institute of Technology (北京理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved unprecedented performance by leveraging vast pretraining corpora, yet their performance remains suboptimal in knowledge-intensive domains such as medicine and scientific research, where high factual precision is required. While synthetic data provides a promising avenue for augmenting domain knowledge, existing methods frequently generate redundant samples that do not align with the model’s true knowledge gaps. To overcome this limitation, we propose a novel Structural Entropy-guided Knowledge Navigator (SENATOR) framework that addresses the intrinsic knowledge deficiencies of LLMs. Our approach employs the Structure Entropy (SE) metric to quantify uncertainty along knowledge graph paths and leverages Monte Carlo Tree Search (MCTS) to selectively explore regions where the model lacks domain-specific knowledge. Guided by these insights, the framework generates targeted synthetic data for supervised fine-tuning, enabling continuous self-improvement. Experimental results on LLaMA-3 and Qwen2 across multiple domain-specific benchmarks show that SENATOR effectively detects and repairs knowledge deficiencies, achieving notable performance improvements. The code and data for our methods and experiments are available at this https URL.
zh

[NLP-41] One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全对齐过程中存在的脆弱性问题,特别是针对“越狱攻击”(jailbreak attacks)导致模型生成有害响应的隐患。研究发现,当前安全对齐的LLMs在拒绝响应时,其初始tokens具有高度相似性,这些初始tokens被定义为“安全触发词(safety trigger tokens)”。解决方案的关键在于提出一种名为\textttD-STT的防御算法,该算法通过识别并显式解码安全触发词来触发模型已有的安全模式,从而在最小干预解码过程的前提下有效降低输出的有害性。

链接: https://arxiv.org/abs/2505.07167
作者: Haoran Gu,Handing Wang,Yi Mei,Mengjie Zhang,Yaochu Jin
机构: Xidian University (西安电子科技大学); Victoria University of Wellington (维多利亚大学); Westlake University (西湖大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been extensively used across diverse domains, including virtual assistants, automated code generation, and scientific research. However, they remain vulnerable to jailbreak attacks, which manipulate the models into generating harmful responses despite safety alignment. Recent studies have shown that current safety-aligned LLMs often undergo the shallow safety alignment, where the first few tokens largely determine whether the response will be harmful. Through comprehensive observations, we find that safety-aligned LLMs and various defense strategies generate highly similar initial tokens in their refusal responses, which we define as safety trigger tokens. Building on this insight, we propose \textttD-STT, a simple yet effective defense algorithm that identifies and explicitly decodes safety trigger tokens of the given safety-aligned LLM to trigger the model’s learned safety patterns. In this process, the safety trigger is constrained to a single token, which effectively preserves model usability by introducing minimum intervention in the decoding process. Extensive experiments across diverse jailbreak attacks and benign prompts demonstrate that \ours significantly reduces output harmfulness while preserving model usability and incurring negligible response time overhead, outperforming ten baseline methods.
zh

[NLP-42] Pre-training vs. Fine-tuning: A Reproducibility Study on Dense Retrieval Knowledge Acquisition SIGIR-2025

【速读】: 该论文试图解决密集检索器(dense retriever)中预训练与微调作用的争议问题,即检索知识是否主要在预训练阶段获得,而微调仅起到次要作用。其解决方案的关键在于通过扩展实验验证不同表示方法(如CLS标记与均值池化)、主干架构(如仅编码器的BERT与仅解码器的LLaMA)以及数据集(如MSMARCO与Natural Questions)下的检索性能,以评估预训练知识与微调对检索效果的影响。研究结果表明,在DPR微调中,预训练知识是检索性能的基础,而微调主要调整神经元激活而非重新组织知识,但这一模式在某些模型(如均值池化的Contriever和基于解码器的LLaMA)中并不成立。

链接: https://arxiv.org/abs/2505.07166
作者: Zheng Yao,Shuai Wang,Guido Zuccon
机构: The University of Queensland, Australia
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted in SIGIR-2025

点击查看摘要

Abstract:Dense retrievers utilize pre-trained backbone language models (e.g., BERT, LLaMA) that are fine-tuned via contrastive learning to perform the task of encoding text into sense representations that can be then compared via a shallow similarity operation, e.g. inner product. Recent research has questioned the role of fine-tuning vs. that of pre-training within dense retrievers, specifically arguing that retrieval knowledge is primarily gained during pre-training, meaning knowledge not acquired during pre-training cannot be sub-sequentially acquired via fine-tuning. We revisit this idea here as the claim was only studied in the context of a BERT-based encoder using DPR as representative dense retriever. We extend the previous analysis by testing other representation approaches (comparing the use of CLS tokens with that of mean pooling), backbone architectures (encoder-only BERT vs. decoder-only LLaMA), and additional datasets (MSMARCO in addition to Natural Questions). Our study confirms that in DPR tuning, pre-trained knowledge underpins retrieval performance, with fine-tuning primarily adjusting neuron activation rather than reorganizing knowledge. However, this pattern does not hold universally, such as in mean-pooled (Contriever) and decoder-based (LLaMA) models. We ensure full reproducibility and make our implementation publicly available at this https URL.
zh

[NLP-43] KDH-MLTC: Knowledge Distillation for Healthcare Multi-Label Text Classification

【速读】: 该论文旨在解决医疗文本数据分类中计算效率与高准确性之间的矛盾,特别是在处理医疗术语的细微差别和复杂性时。其关键解决方案是提出一种基于知识蒸馏的医疗多标签文本分类框架(KDH-MLTC),通过将复杂的教师模型(如BERT)的知识迁移至轻量级学生模型(如DistilBERT),结合序列微调和粒子群优化(PSO)进行超参数调优,从而在保持较高分类性能的同时显著降低计算需求,实现本地化部署并满足HIPAA合规性要求。

链接: https://arxiv.org/abs/2505.07162
作者: Hajar Sakai,Sarah S. Lam
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The increasing volume of healthcare textual data requires computationally efficient, yet highly accurate classification approaches able to handle the nuanced and complex nature of medical terminology. This research presents Knowledge Distillation for Healthcare Multi-Label Text Classification (KDH-MLTC), a framework leveraging model compression and Large Language Models (LLMs). The proposed approach addresses conventional healthcare Multi-Label Text Classification (MLTC) challenges by integrating knowledge distillation and sequential fine-tuning, subsequently optimized through Particle Swarm Optimization (PSO) for hyperparameter tuning. KDH-MLTC transfers knowledge from a more complex teacher LLM (i.e., BERT) to a lighter student LLM (i.e., DistilBERT) through sequential training adapted to MLTC that preserves the teacher’s learned information while significantly reducing computational requirements. As a result, the classification is enabled to be conducted locally, making it suitable for healthcare textual data characterized by sensitivity and, therefore, ensuring HIPAA compliance. The experiments conducted on three medical literature datasets of different sizes, sampled from the Hallmark of Cancer (HoC) dataset, demonstrate that KDH-MLTC achieves superior performance compared to existing approaches, particularly for the largest dataset, reaching an F1 score of 82.70%. Additionally, statistical validation and an ablation study are carried out, proving the robustness of KDH-MLTC. Furthermore, the PSO-based hyperparameter optimization process allowed the identification of optimal configurations. The proposed approach contributes to healthcare text classification research, balancing efficiency requirements in resource-constrained healthcare settings with satisfactory accuracy demands.
zh

[NLP-44] owards Actionable Pedagogical Feedback: A Multi-Perspective Analysis of Mathematics Teaching and Tutoring Dialogue

【速读】: 该论文试图解决课堂对话分析中话语行为(talk moves)的多用途性问题以及领域特定话语行为分类排除部分话语导致反馈缺失的问题。其解决方案的关键在于提出一种多视角话语分析框架,该框架整合了领域特定的话语行为与对话行为(采用包含43个标签的扁平化多功能SWBD-MASL模式)以及话语关系(应用分段话语表示理论,包含16种关系),从而实现对包含和不包含话语行为的语句进行全面分析。

链接: https://arxiv.org/abs/2505.07161
作者: Jannatun Naim,Jie Cao,Fareen Tasneem,Jennifer Jacobs,Brent Milne,James Martin,Tamara Sumner
机构: University of Colorado Boulder (科罗拉多大学博尔德分校); University of Oklahoma (俄克拉荷马大学); University of Chittagong (锡塔贡大学); Saga Education (Saga教育)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted to EDM’2025

点击查看摘要

Abstract:Effective feedback is essential for refining instructional practices in mathematics education, and researchers often turn to advanced natural language processing (NLP) models to analyze classroom dialogues from multiple perspectives. However, utterance-level discourse analysis encounters two primary challenges: (1) multifunctionality, where a single utterance may serve multiple purposes that a single tag cannot capture, and (2) the exclusion of many utterances from domain-specific discourse move classifications, leading to their omission in feedback. To address these challenges, we proposed a multi-perspective discourse analysis that integrates domain-specific talk moves with dialogue act (using the flattened multi-functional SWBD-MASL schema with 43 tags) and discourse relation (applying Segmented Discourse Representation Theory with 16 relations). Our top-down analysis framework enables a comprehensive understanding of utterances that contain talk moves, as well as utterances that do not contain talk moves. This is applied to two mathematics education datasets: TalkMoves (teaching) and SAGA22 (tutoring). Through distributional unigram analysis, sequential talk move analysis, and multi-view deep dive, we discovered meaningful discourse patterns, and revealed the vital role of utterances without talk moves, demonstrating that these utterances, far from being mere fillers, serve crucial functions in guiding, acknowledging, and structuring classroom discourse. These insights underscore the importance of incorporating discourse relations and dialogue acts into AI-assisted education systems to enhance feedback and create more responsive learning environments. Our framework may prove helpful for providing human educator feedback, but also aiding in the development of AI agents that can effectively emulate the roles of both educators and students.
zh

[NLP-45] HAMLET: Healthcare-focused Adaptive Multilingual Learning Embedding-based Topic Modeling

【速读】: 该论文试图解决传统主题模型在处理语境细微差别、多义词和罕见词时的局限性,这些问题导致生成的主题缺乏连贯性和质量。解决方案的关键在于提出一种基于图驱动架构的跨语言医疗主题建模方法——HAMLET,该方法利用大型语言模型(LLM)生成初始主题,并通过神经增强的语义融合技术对主题嵌入进行精炼。此外,该方法结合了双向编码器表示从变压器(BERT)和图神经网络(GNN),以建立文档、主题、词语及其相似性之间的联系,从而提升主题的代表性和可解释性。

链接: https://arxiv.org/abs/2505.07157
作者: Hajar Sakai,Sarah S. Lam
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traditional topic models often struggle with contextual nuances and fail to adequately handle polysemy and rare words. This limitation typically results in topics that lack coherence and quality. Large Language Models (LLMs) can mitigate this issue by generating an initial set of topics. However, these raw topics frequently lack refinement and representativeness, which leads to redundancy without lexical similarity and reduced interpretability. This paper introduces HAMLET, a graph-driven architecture for cross-lingual healthcare topic modeling that uses LLMs. The proposed approach leverages neural-enhanced semantic fusion to refine the embeddings of topics generated by the LLM. Instead of relying solely on statistical co-occurrence or human interpretation to extract topics from a document corpus, this method introduces a topic embedding refinement that uses Bidirectional Encoder Representations from Transformers (BERT) and Graph Neural Networks (GNN). After topic generation, a hybrid technique that involves BERT and Sentence-BERT (SBERT) is employed for embedding. The topic representations are further refined using a GNN, which establishes connections between documents, topics, words, similar topics, and similar words. A novel method is introduced to compute similarities. Consequently, the topic embeddings are refined, and the top k topics are extracted. Experiments were conducted using two healthcare datasets, one in English and one in French, from which six sets were derived. The results demonstrate the effectiveness of HAMLET.
zh

[NLP-46] Reassessing Large Language Model Boolean Query Generation for Systematic Reviews SIGIR-2025

【速读】: 该论文试图解决在系统综述中生成有效布尔查询(Boolean queries)的问题,特别是如何利用大型语言模型(Large Language Models, LLMs)提高查询构建的效率和效果。其解决方案的关键在于系统性地重现已有研究,并针对先前工作中被忽视的重要因素进行优化,包括生成查询的验证、输出格式的约束以及链式思维(Guided)提示中示例的选择。研究结果表明,模型选择和提示设计是成功生成有效查询的核心驱动因素,强调了针对特定模型和提示进行优化的重要性。

链接: https://arxiv.org/abs/2505.07155
作者: Shuai Wang,Harrisen Scells,Bevan Koopman,Guido Zuccon
机构: The University of Queensland, Australia; University of Tübingen, Germany; CSIRO & The University of Queensland, Australia
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted in SIGIR-2025

点击查看摘要

Abstract:Systematic reviews are comprehensive literature reviews that address highly focused research questions and represent the highest form of evidence in medicine. A critical step in this process is the development of complex Boolean queries to retrieve relevant literature. Given the difficulty of manually constructing these queries, recent efforts have explored Large Language Models (LLMs) to assist in their formulation. One of the first studies,Wang et al., investigated ChatGPT for this task, followed by Staudinger et al., which evaluated multiple LLMs in a reproducibility study. However, the latter overlooked several key aspects of the original work, including (i) validation of generated queries, (ii) output formatting constraints, and (iii) selection of examples for chain-of-thought (Guided) prompting. As a result, its findings diverged significantly from the original study. In this work, we systematically reproduce both studies while addressing these overlooked factors. Our results show that query effectiveness varies significantly across models and prompt designs, with guided query formulation benefiting from well-chosen seed studies. Overall, prompt design and model selection are key drivers of successful query formulation. Our findings provide a clearer understanding of LLMs’ potential in Boolean query generation and highlight the importance of model- and prompt-specific optimisations. The complex nature of systematic reviews adds to challenges in both developing and reproducing methods but also highlights the importance of reproducibility studies in this domain.
zh

[NLP-47] LLM -Augmented Chemical Synthesis and Design Decision Programs

【速读】: 该论文试图解决有机化学中复杂的多步逆合成规划问题(retrosynthesis planning),该问题涉及从目标分子出发,通过一系列有效反应分解为更简单的前体分子。传统方法受限于可能路径的庞大组合空间,而现有机器学习方法主要集中在单步逆合成建模和后续路线搜索上。论文提出的解决方案关键在于利用大语言模型(LLM)的强大化学知识,结合一种高效的反应路径编码方案和新的路线级搜索策略,从而超越传统的逐步预测反应物的方法,提升逆合成规划的效率与准确性。

链接: https://arxiv.org/abs/2505.07027
作者: Haorui Wang,Jeff Guo,Lingkai Kong,Rampi Ramprasad,Philippe Schwaller,Yuanqi Du,Chao Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Chemical Physics (physics.chem-ph)
备注:

点击查看摘要

Abstract:Retrosynthesis, the process of breaking down a target molecule into simpler precursors through a series of valid reactions, stands at the core of organic chemistry and drug development. Although recent machine learning (ML) research has advanced single-step retrosynthetic modeling and subsequent route searches, these solutions remain restricted by the extensive combinatorial space of possible pathways. Concurrently, large language models (LLMs) have exhibited remarkable chemical knowledge, hinting at their potential to tackle complex decision-making tasks in chemistry. In this work, we explore whether LLMs can successfully navigate the highly constrained, multi-step retrosynthesis planning problem. We introduce an efficient scheme for encoding reaction pathways and present a new route-level search strategy, moving beyond the conventional step-by-step reactant prediction. Through comprehensive evaluations, we show that our LLM-augmented approach excels at retrosynthesis planning and extends naturally to the broader challenge of synthesizable molecular design.
zh

[NLP-48] owards the Three-Phase Dynamics of Generalization Power of a DNN

【速读】: 该论文试图解决深度神经网络(Deep Neural Networks, DNNs)的泛化能力分析问题,具体是通过直接解耦和分析DNN在训练过程中编码的可泛化与不可泛化交互动态来理解其泛化性能。解决方案的关键在于利用可解释AI(Explainable AI, XAI)的最新理论成果,即证明DNN的详细推理逻辑可以严格重写为少量的AND-OR交互模式,并基于此提出一种高效量化每个交互泛化能力的方法,从而揭示了交互泛化能力在训练过程中的三阶段动态特性。

链接: https://arxiv.org/abs/2505.06993
作者: Yuxuan He,Junpeng Zhang,Hongyuan Zhang,Quanshi Zhang
机构: University of Electronic Science and Technology of China (中国电子科技大学); Shanghai Jiao Tong University (上海交通大学); Institute of Artificial Intelligence, China Telecom (中国电信人工智能研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper proposes a new perspective for analyzing the generalization power of deep neural networks (DNNs), i.e., directly disentangling and analyzing the dynamics of generalizable and non-generalizable interaction encoded by a DNN through the training process. Specifically, this work builds upon the recent theoretical achievement in explainble AI, which proves that the detailed inference logic of DNNs can be can be strictly rewritten as a small number of AND-OR interaction patterns. Based on this, we propose an efficient method to quantify the generalization power of each interaction, and we discover a distinct three-phase dynamics of the generalization power of interactions during training. In particular, the early phase of training typically removes noisy and non-generalizable interactions and learns simple and generalizable ones. The second and the third phases tend to capture increasingly complex interactions that are harder to generalize. Experimental results verify that the learning of non-generalizable interactions is the the direct cause for the gap between the training and testing losses.
zh

[NLP-49] Convert Language Model into a Value-based Strategic Planner ACL2025

【速读】: 该论文旨在解决情感支持对话(Emotional Support Conversation, ESC)中长期满意度不足的问题,现有研究大多未能从状态模型的角度定义问题,导致解决方案效果有限。其关键解决方案是将Q-learning应用于大语言模型(Large Language Models, LLMs),提出了一种名为straQ*的框架,该框架通过让LLM在ESC过程中进行规划、基于长期回报确定最优策略,并最终指导LLM生成回应,从而提升对话的长期有效性。

链接: https://arxiv.org/abs/2505.06987
作者: Xiaoyu Wang,Yue Zhao,Qingqing Gu,Zhonglin Jiang,Xiaokai Chen,Yong Chen,Luo Ji
机构: Geely AI Lab (吉利人工智能实验室); Beijing Institute of Technology (北京理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures, Accepted by ACL 2025 Industry Track

点击查看摘要

Abstract:Emotional support conversation (ESC) aims to alleviate the emotional distress of individuals through effective conversations. Although large language models (LLMs) have obtained remarkable progress on ESC, most of these studies might not define the diagram from the state model perspective, therefore providing a suboptimal solution for long-term satisfaction. To address such an issue, we leverage the Q-learning on LLMs, and propose a framework called straQ*. Our framework allows a plug-and-play LLM to bootstrap the planning during ESC, determine the optimal strategy based on long-term returns, and finally guide the LLM to response. Substantial experiments on ESC datasets suggest that straQ* outperforms many baselines, including direct inference, self-refine, chain of thought, finetuning, and finite state machines.
zh

[NLP-50] CNN-based Image Models Verify a Hypothesis that The Writers of Cuneiform Texts Improved Their Writing Skills When Studying at the Age of Hittite Empire

【速读】: 该论文试图解决古代文献中为何存在内容几乎相同的两份誊写文本的问题,特别是针对Kizzuwatna仪式文本的双写现象。其解决方案的关键在于开发了一种基于卷积神经网络(CNN)的图像分析方法,无需逐个分割楔形文字即可对泥板图像进行定量分析,从而揭示了两位作者之间的教学与学习关系。这一方法突破了传统语言学的研究范式,提供了新的研究视角。

链接: https://arxiv.org/abs/2505.06974
作者: Daichi Kohmoto,Katsutoshi Fukuda,Daisuke Yoshida,Takafumi Matsui,Sachihiro Omura
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 9 figures, 5 tables

点击查看摘要

Abstract:A cuneiform tablet KBo 23.1 ++/KUB 30.38, which is known to represent a text of Kizzuwatna rituals, was written by two writers with almost identical content in two iterations. Unlike other cuneiform tablets that contained information such as myths, essays, or business records, the reason why ancient people left such tablets for posterity remains unclear. To study this problem, we develop a new methodology by analyzing images of a tablet quantitatively using CNN (Convolutional Neural Network)-based image models, without segmenting cuneiforms one-by-one. Our data-driven methodology implies that the writer writing the first half was a teacher' and the other writer was a student’ who was training his skills of writing cuneiforms. This result has not been reached by classical linguistics. We also discuss related conclusions and possible further directions for applying our method and its generalizations.
zh

[NLP-51] Web Page Classification using LLM s for Crawling Support

【速读】: 该论文试图解决如何高效爬取新网页的问题,其核心挑战在于如何在不同网站特性(如XML站点地图和页面更新频率)下实现通用的高效爬取。解决方案的关键在于利用大型语言模型(Large Language Model, LLM)将网页分类为“索引页”和“内容页”,并通过分类结果选择索引页作为访问新页面的起点,从而提升爬取效率和覆盖范围。

链接: https://arxiv.org/abs/2505.06972
作者: Yuichi Sasazawa,Yasuhiro Sogawa
机构: Hitachi, Ltd. Research and Development Group (日立有限公司研发集团)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 8 pages, 2 figures

点击查看摘要

Abstract:A web crawler is a system designed to collect web pages, and efficient crawling of new pages requires appropriate algorithms. While website features such as XML sitemaps and the frequency of past page updates provide important clues for accessing new pages, their universal application across diverse conditions is challenging. In this study, we propose a method to efficiently collect new pages by classifying web pages into two types, “Index Pages” and “Content Pages,” using a large language model (LLM), and leveraging the classification results to select index pages as starting points for accessing new pages. We construct a dataset with automatically annotated web page types and evaluate our approach from two perspectives: the page type classification performance and coverage of new pages. Experimental results demonstrate that the LLM-based method outperformed baseline methods in both evaluation metrics.
zh

[NLP-52] A digital perspective on the role of a stemma in material-philological transmission studies

【速读】: 该论文试图解决数字方法在文本传统研究中的应用问题,特别是如何利用生成式 AI (Generative AI) 技术来重新定义校勘谱系(stemma codicum)的角色,使其从传统的最终研究成果转变为一种研究工具。解决方案的关键在于利用计算生成谱系的相对简便性,通过将《霍姆undur传说》作为案例研究,展示谱系可以作为进一步探索文本传统的起点,并帮助回答以往无法解答的研究问题。此外,论文提供了用于生成谱系的数据集以及两个自定义的 Python 脚本,用于将基于 TEI 标准的 XML 文本数据转换为 PHYLIP 包所需的输入格式,以生成文本间关系的无根树结构。

链接: https://arxiv.org/abs/2505.06938
作者: Katarzyna Anna Kapitan
机构: 未知
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Taking its point of departure in the recent developments in the field of digital humanities and the increasing automatisation of scholarly workflows, this study explores the implications of digital approaches to textual traditions for the broader field of textual scholarship. It argues that the relative simplicity of creating computergenerated stemmas allows us to view the stemma codicum as a research tool rather than the final product of our scholarly investigation. Using the Old Norse saga of Hrómundur as a case study, this article demonstrates that stemmas can serve as a starting point for exploring textual traditions further. In doing so, they enable us to address research questions that otherwise remain unanswered. The article is accompanied by datasets used to generate stemmas for the Hrómundar saga tradition as well as two custom Python scripts. The scripts are designed to convert XML-based textual data, encoded according to the TEI Guidelines, into the input format used for the analysis in the PHYLIP package to generate unrooted trees of relationships between texts.
zh

[NLP-53] he Distracting Effect: Understanding Irrelevant Passages in RAG

【速读】: 该论文旨在解决检索增强生成(Retrieval Augmented Generation, RAG)系统中因检索到与查询无关的文本片段而干扰大语言模型(LLM)生成正确答案的问题。其解决方案的关键在于提出了一种可量化的文本片段对查询的干扰效应度量方法,并引入了针对难处理干扰片段的识别与利用方法。通过使用这些精心挑选的干扰片段对LLM进行微调,实验结果显示相比传统RAG数据集微调的模型,回答准确率提升了最高7.5%。该研究的贡献在于突破了传统二分类方式对无关文本的处理,提出了多种寻找硬干扰片段的方法,并构建了一个全面的框架用于识别和利用这些干扰片段。

链接: https://arxiv.org/abs/2505.06914
作者: Chen Amiraz,Florin Cuconasu,Simone Filice,Zohar Karnin
机构: Technology Innovation Institute (技术创新研究所); Sapienza University of Rome (罗马第一大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:A well-known issue with Retrieval Augmented Generation (RAG) is that retrieved passages that are irrelevant to the query sometimes distract the answer-generating LLM, causing it to provide an incorrect response. In this paper, we shed light on this core issue and formulate the distracting effect of a passage w.r.t. a query (and an LLM). We provide a quantifiable measure of the distracting effect of a passage and demonstrate its robustness across LLMs. Our research introduces novel methods for identifying and using hard distracting passages to improve RAG systems. By fine-tuning LLMs with these carefully selected distracting passages, we achieve up to a 7.5% increase in answering accuracy compared to counterparts fine-tuned on conventional RAG datasets. Our contribution is two-fold: first, we move beyond the simple binary classification of irrelevant passages as either completely unrelated vs. distracting, and second, we develop and analyze multiple methods for finding hard distracting passages. To our knowledge, no other research has provided such a comprehensive framework for identifying and utilizing hard distracting passages. Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2505.06914 [cs.CL] (or arXiv:2505.06914v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.06914 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-54] EcoLANG: Efficient and Effective Agent Communication Language Induction for Social Simulation

【速读】: 该论文旨在解决大规模社会模拟中面临的高时间和计算成本问题。现有解决方案如分布式机制或混合基于代理的模型(Agent-Based Model, ABM)集成,要么无法有效降低推理成本,要么牺牲了精度和泛化能力。论文提出的解决方案关键在于EcoLANG:一种高效且有效的代理通信语言诱导方法,其通过两个阶段实现优化,即语言进化阶段(过滤同义词并优化句法规则)和语言利用阶段(代理使用演化后的语言进行交流),从而在不牺牲模拟精度的前提下,显著降低令牌消耗量。

链接: https://arxiv.org/abs/2505.06904
作者: Xinyi Mou,Chen Qian,Wei Liu,Xuanjing Huang,Zhongyu Wei
机构: Fudan University (复旦大学); Shanghai Jiao Tong University (上海交通大学); King’s College London (伦敦国王学院); Shanghai Innovation Institute (上海创新研究院)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated an impressive ability to role-play humans and replicate complex social dynamics. While large-scale social simulations are gaining increasing attention, they still face significant challenges, particularly regarding high time and computation costs. Existing solutions, such as distributed mechanisms or hybrid agent-based model (ABM) integrations, either fail to address inference costs or compromise accuracy and generalizability. To this end, we propose EcoLANG: Efficient and Effective Agent Communication Language Induction for Social Simulation. EcoLANG operates in two stages: (1) language evolution, where we filter synonymous words and optimize sentence-level rules through natural selection, and (2) language utilization, where agents in social simulations communicate using the evolved language. Experimental results demonstrate that EcoLANG reduces token consumption by over 20%, enhancing efficiency without sacrificing simulation accuracy.
zh

[NLP-55] Multi-Modal Explainable Medical AI Assistant for Trustworthy Human-AI Collaboration

【速读】: 该论文旨在解决通用医学人工智能(Generalist Medical AI, GMAI)系统在临床应用中的可解释性不足和预后能力欠佳的问题。其解决方案的关键在于提出XMedGPT,一个以临床医生为中心的多模态AI助手,通过整合文本与视觉可解释性,实现透明且可信的医疗决策支持。XMedGPT不仅能够生成准确的诊断和描述性输出,还能在医学图像中定位相关解剖部位,从而提升可解释性与临床可用性。此外,引入了可靠性索引机制,通过交互式问答进行一致性评估以量化不确定性,进一步增强了模型的可信度与实用性。

链接: https://arxiv.org/abs/2505.06898
作者: Honglong Yang,Shanshan Song,Yi Qin,Lehan Wang,Haonan Wang,Xinpeng Ding,Qixiang Zhang,Bodong Du,Xiaomeng Li
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generalist Medical AI (GMAI) systems have demonstrated expert-level performance in biomedical perception tasks, yet their clinical utility remains limited by inadequate multi-modal explainability and suboptimal prognostic capabilities. Here, we present XMedGPT, a clinician-centric, multi-modal AI assistant that integrates textual and visual interpretability to support transparent and trustworthy medical decision-making. XMedGPT not only produces accurate diagnostic and descriptive outputs, but also grounds referenced anatomical sites within medical images, bridging critical gaps in interpretability and enhancing clinician usability. To support real-world deployment, we introduce a reliability indexing mechanism that quantifies uncertainty through consistency-based assessment via interactive question-answering. We validate XMedGPT across four pillars: multi-modal interpretability, uncertainty quantification, and prognostic modeling, and rigorous benchmarking. The model achieves an IoU of 0.703 across 141 anatomical regions, and a Kendall’s tau-b of 0.479, demonstrating strong alignment between visual rationales and clinical outcomes. For uncertainty estimation, it attains an AUC of 0.862 on visual question answering and 0.764 on radiology report generation. In survival and recurrence prediction for lung and glioma cancers, it surpasses prior leading models by 26.9%, and outperforms GPT-4o by 25.0%. Rigorous benchmarking across 347 datasets covers 40 imaging modalities and external validation spans 4 anatomical systems confirming exceptional generalizability, with performance gains surpassing existing GMAI by 20.7% for in-domain evaluation and 16.7% on 11,530 in-house data evaluation. Together, XMedGPT represents a significant leap forward in clinician-centric AI integration, offering trustworthy and scalable support for diverse healthcare applications.
zh

[NLP-56] IM-BERT: Enhancing Robustness of BERT through the Implicit Euler Method EMNLP2024

【速读】: 该论文旨在解决预训练语言模型(Pre-trained Language Models, PLMs)在有限下游数据集上进行微调时,因参数量大而容易受到对抗攻击的问题,从而导致模型在标准数据集上的过拟合现象。其解决方案的关键在于从动态系统的角度出发,将BERT的一层概念化为常微分方程(Ordinary Differential Equations, ODEs)的解,并分析显式和隐式欧拉方法在初始值扰动下的数值稳定性,进而引入一种数值稳健的IM-connection结构,该结构通过整合BERT的各层来增强模型对对抗攻击的鲁棒性,且无需引入额外参数或对抗训练策略。

链接: https://arxiv.org/abs/2505.06889
作者: Mihyeon Kim,Juhyoung Park,Youngbin Kim
机构: KT CORPORATION(KT公司); VAIV COMAPANY(VAIV公司); Chung-Ang University(中央大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2024 Main

点击查看摘要

Abstract:Pre-trained Language Models (PLMs) have achieved remarkable performance on diverse NLP tasks through pre-training and fine-tuning. However, fine-tuning the model with a large number of parameters on limited downstream datasets often leads to vulnerability to adversarial attacks, causing overfitting of the model on standard datasets. To address these issues, we propose IM-BERT from the perspective of a dynamic system by conceptualizing a layer of BERT as a solution of Ordinary Differential Equations (ODEs). Under the situation of initial value perturbation, we analyze the numerical stability of two main numerical ODE solvers: the explicit and implicit Euler approaches. Based on these analyses, we introduce a numerically robust IM-connection incorporating BERT’s layers. This strategy enhances the robustness of PLMs against adversarial attacks, even in low-resource scenarios, without introducing additional parameters or adversarial training strategies. Experimental results on the adversarial GLUE (AdvGLUE) dataset validate the robustness of IM-BERT under various conditions. Compared to the original BERT, IM-BERT exhibits a performance improvement of approximately 8.3%p on the AdvGLUE dataset. Furthermore, in low-resource scenarios, IM-BERT outperforms BERT by achieving 5.9%p higher accuracy. Comments: Accepted to EMNLP 2024 Main Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.06889 [cs.CL] (or arXiv:2505.06889v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.06889 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mihyeon Kim [view email] [v1] Sun, 11 May 2025 07:54:33 UTC (8,693 KB)
zh

[NLP-57] A Split-then-Join Approach to Abstractive Summarization for Very Long Documents in a Low Resource Setting

【速读】: 该论文试图解决生成式 AI (Generative AI) 在处理超长文档摘要任务时性能下降的问题,因为现有模型如 \textttBIGBIRD-PEGASUS 的最大输入长度限制为 4,096 个 tokens,导致在处理更长文档时效果受限。解决方案的关键在于通过微调预训练的 \textttBIGBIRD-PEGASUS 模型,并利用领域数据增强方法,将长文档拆分为符合模型输入长度限制的部分,从而避免领域偏移和过拟合问题。

链接: https://arxiv.org/abs/2505.06862
作者: Lhuqita Fazry
机构: Universitas Indonesia (印度尼西亚大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract: \textttBIGBIRD-PEGASUS model achieves \textitstate-of-the-art on abstractive text summarization for long documents. However it’s capacity still limited to maximum of 4,096 tokens, thus caused performance degradation on summarization for very long documents. Common method to deal with the issue is to truncate the documents. In this reasearch, we’ll use different approach. We’ll use the pretrained \textttBIGBIRD-PEGASUS model by fine tuned the model on other domain dataset. First, we filter out all documents which length less than 20,000 tokens to focus on very long documents. To prevent domain shifting problem and overfitting on transfer learning due to small dataset, we augment the dataset by splitting document-summary training pair into parts, to fit the document into 4,096 tokens. Source code available on \hrefthis https URLthis https URL .
zh

[NLP-58] Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety

【速读】: 该论文试图解决在大型语言模型(Large Language Models, LLMs)微调阶段中,即使使用完全无害的数据集也可能导致模型输出有害性显著增加的问题。解决方案的关键在于从无害数据集中识别出对安全性能下降贡献最大的样本,并仅使用这些异常样本进行微调。研究提出了一种基于异常检测的方法——Self-Inf-N,用于检测和提取这些异常样本,实验结果表明,仅使用100个由Self-Inf-N选出的异常样本进行微调即可严重损害LLM的安全对齐性。

链接: https://arxiv.org/abs/2505.06843
作者: Zihan Guan,Mengxuan Hu,Ronghang Zhu,Sheng Li,Anil Vullikanti
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 26 pages, 13 figures

点击查看摘要

Abstract:Recent studies have uncovered a troubling vulnerability in the fine-tuning stage of large language models (LLMs): even fine-tuning on entirely benign datasets can lead to a significant increase in the harmfulness of LLM outputs. Building on this finding, our red teaming study takes this threat one step further by developing a more effective attack. Specifically, we analyze and identify samples within benign datasets that contribute most to safety degradation, then fine-tune LLMs exclusively on these samples. We approach this problem from an outlier detection perspective and propose Self-Inf-N, to detect and extract outliers for fine-tuning. Our findings reveal that fine-tuning LLMs on 100 outlier samples selected by Self-Inf-N in the benign datasets severely compromises LLM safety alignment. Extensive experiments across seven mainstream LLMs demonstrate that our attack exhibits high transferability across different architectures and remains effective in practical scenarios. Alarmingly, our results indicate that most existing mitigation strategies fail to defend against this attack, underscoring the urgent need for more robust alignment safeguards. Codes are available at this https URL.
zh

[NLP-59] Overview of the NLPCC 2025 Shared Task 4: Multi-modal Multilingual and Multi-hop Medical Instructional Video Question Answering Challenge

【速读】: 该论文试图解决多模态、多语言和多跳医疗指导问答(M4IVQA)系统的研究难题,特别是针对医疗指导视频的问答任务。解决方案的关键在于构建能够整合医疗指导视频信息、理解多种语言并回答需要跨模态推理的多跳问题的模型。具体包括三个赛道:单视频中的多模态、多语言和多跳时间答案定位(M4TAGSV)、视频语料库检索(M4VCR)以及视频语料库中的多模态、多语言和多跳时间答案定位(M4TAGVC)。

链接: https://arxiv.org/abs/2505.06814
作者: Bin Li,Shenxi Liu,Yixuan Weng,Yue Du,Yuhang Tian,Shoujun Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Following the successful hosts of the 1-st (NLPCC 2023 Foshan) CMIVQA and the 2-rd (NLPCC 2024 Hangzhou) MMIVQA challenges, this year, a new task has been introduced to further advance research in multi-modal, multilingual, and multi-hop medical instructional question answering (M4IVQA) systems, with a specific focus on medical instructional videos. The M4IVQA challenge focuses on evaluating models that integrate information from medical instructional videos, understand multiple languages, and answer multi-hop questions requiring reasoning over various modalities. This task consists of three tracks: multi-modal, multilingual, and multi-hop Temporal Answer Grounding in Single Video (M4TAGSV), multi-modal, multilingual, and multi-hop Video Corpus Retrieval (M4VCR) and multi-modal, multilingual, and multi-hop Temporal Answer Grounding in Video Corpus (M4TAGVC). Participants in M4IVQA are expected to develop algorithms capable of processing both video and text data, understanding multilingual queries, and providing relevant answers to multi-hop medical questions. We believe the newly introduced M4IVQA challenge will drive innovations in multimodal reasoning systems for healthcare scenarios, ultimately contributing to smarter emergency response systems and more effective medical education platforms in multilingual communities. Our official website is this https URL
zh

[NLP-60] Bridging Ears and Eyes: Analyzing Audio and Visual Large Language Models to Humans in Visible Sound Recognition and Reducing Their Sensory Gap via Cross-Modal Distillation

【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在不同感知模态(如音频、视觉或视听)之间的性能差异问题,特别是音频大语言模型(Audio LLMs)与视觉或视听模型以及人类在声音物体识别任务中的表现差距。其解决方案的关键在于引入一种跨模态蒸馏框架,通过一个模态的模型作为教师模型,另一个作为学生模型,利用启发式模型预测声音类别对学生的难度,从而实现知识迁移。双向蒸馏(从Qwen2-VL到Qwen2-Audio以及反之)显著提升了模型在挑战性类别上的性能。

链接: https://arxiv.org/abs/2505.06803
作者: Xilin Jiang,Junkai Wu,Vishal Choudhari,Nima Mesgarani
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Audio large language models (LLMs) are considered experts at recognizing sound objects, yet their performance relative to LLMs in other sensory modalities, such as visual or audio-visual LLMs, and to humans using their ears, eyes, or both remains unexplored. To investigate this, we systematically evaluate audio, visual, and audio-visual LLMs, specifically Qwen2-Audio, Qwen2-VL, and Qwen2.5-Omni, against humans in recognizing sound objects of different classes from audio-only, silent video, or sounded video inputs. We uncover a performance gap between Qwen2-Audio and Qwen2-VL that parallels the sensory discrepancy between human ears and eyes. To reduce this gap, we introduce a cross-modal distillation framework, where an LLM in one modality serves as the teacher and another as the student, with knowledge transfer in sound classes predicted as more challenging to the student by a heuristic model. Distillation in both directions, from Qwen2-VL to Qwen2-Audio and vice versa, leads to notable improvements, particularly in challenging classes. This work highlights the sensory gap in LLMs from a human-aligned perspective and proposes a principled approach to enhancing modality-specific perception in multimodal LLMs.
zh

[NLP-61] Utilizing LLM s to Investigate the Disputed Role of Evidence in Electronic Cigarette Health Policy Formation in Australia and the UK

【速读】: 该论文试图解决不同国家在电子烟(Electronic Nicotine Delivery Systems, ENDS)政策制定中如何管理和呈现证据的问题,特别是澳大利亚和英国在相同证据基础上形成截然不同政策的原因。解决方案的关键在于开发并评估一种基于大型语言模型(Large Language Model, LLM)的句子分类器,利用GPT-4对电子烟相关立法文件中的句子进行自动分类,以判断其是否包含关于电子烟对公共健康有益或有害的陈述,从而揭示两国政策文本中证据呈现的差异。

链接: https://arxiv.org/abs/2505.06782
作者: Damian Curran,Brian Chapman,Mike Conway
机构: University of Melbourne (墨尔本大学)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Australia and the UK have developed contrasting approaches to the regulation of electronic cigarettes, with - broadly speaking - Australia adopting a relatively restrictive approach and the UK adopting a more permissive approach. Notably, these divergent policies were developed from the same broad evidence base. In this paper, to investigate differences in how the two jurisdictions manage and present evidence, we developed and evaluated a Large Language Model-based sentence classifier to perform automated analyses of electronic cigarette-related policy documents drawn from official Australian and UK legislative processes (109 documents in total). Specifically, we utilized GPT-4 to automatically classify sentences based on whether they contained claims that e-cigarettes were broadly helpful or harmful for public health. Our LLM-based classifier achieved an F-score of 0.9. Further, when applying the classifier to our entire sentence-level corpus, we found that Australian legislative documents show a much higher proportion of harmful statements, and a lower proportion of helpful statements compared to the expected values, with the opposite holding for the UK. In conclusion, this work utilized an LLM-based approach to provide evidence to support the contention that - drawing on the same evidence base - Australian ENDS-related policy documents emphasize the harms associated with ENDS products and UK policy documents emphasize the benefits. Further, our approach provides a starting point for using LLM-based methods to investigate the complex relationship between evidence and health policy formation.
zh

[NLP-62] Gated Attention for Large Language Models : Non-linearity Sparsity and Attention-Sink-Free

【速读】: 该论文试图解决现有研究中对门控机制(gating mechanisms)具体影响缺乏系统性分析的问题,特别是在softmax注意力机制中的作用。其解决方案的关键在于通过在缩放点积注意力(Scaled Dot-Product Attention, SDPA)后应用一个头特定的sigmoid门控,这一简单修改能够持续提升模型性能,并增强训练稳定性、容忍更大的学习率以及改进模型的扩展性。该方法的有效性主要归因于两点:一是在softmax注意力中的低秩映射引入非线性,二是通过查询相关的稀疏门控分数调节SDPA输出。

链接: https://arxiv.org/abs/2505.06708
作者: Zihan Qiu,Zekun Wang,Bo Zheng,Zeyu Huang,Kaiyue Wen,Songlin Yang,Rui Men,Le Yu,Fei Huang,Suozhi Huang,Dayiheng Liu,Jingren Zhou,Junyang Lin
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Gating mechanisms have been widely utilized, from early models like LSTMs and Highway Networks to recent state space models, linear attention, and also softmax attention. Yet, existing literature rarely examines the specific effects of gating. In this work, we conduct comprehensive experiments to systematically investigate gating-augmented softmax attention variants. Specifically, we perform a comprehensive comparison over 30 variants of 15B Mixture-of-Experts (MoE) models and 1.7B dense models trained on a 3.5 trillion token dataset. Our central finding is that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance. This modification also enhances training stability, tolerates larger learning rates, and improves scaling properties. By comparing various gating positions and computational variants, we attribute this effectiveness to two key factors: (1) introducing non-linearity upon the low-rank mapping in the softmax attention, and (2) applying query-dependent sparse gating scores to modulate the SDPA output. Notably, we find this sparse gating mechanism mitigates ‘attention sink’ and enhances long-context extrapolation performance, and we also release related \hrefthis https URLcodes and \hrefthis https URLmodels to facilitate future research.
zh

[NLP-63] From Rankings to Insights: Evaluation Should Shift Focus from Leaderboard to Feedback

【速读】: 该论文试图解决当前自动评估基准(如MT-Bench、Arena-Hard和Auto-Arena)在评估大型语言模型(Large Language Models, LLMs)时存在的局限性,即这些基准仅提供总体评分,无法为模型优化和行为分析提供有价值的反馈。其解决方案的关键在于提出一种新的评估范式,从近似人类模型排名转向提供具有分析价值的反馈。核心工具是Feedbacker框架,它包含可扩展的基于树状查询分类法构建器、自动查询生成方案以及可视化与分析工具,并引入了一种新的LLM-as-a-Judge方法——PC2(Pre-Comparison-derived Criteria)点对点评估,通过预比较辅助响应差异来生成评估标准,从而在保持点对点评估时间复杂度的同时实现成对评估的准确性。

链接: https://arxiv.org/abs/2505.06698
作者: Zongqi Wang,Tianle Gu,Chen Gong,Xin Tian,Siqi Bao,Yujiu Yang
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); School of Cyber Engineering, Xidian University (西安电子科技大学网络空间安全学院); Baidu, Inc (百度公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automatic evaluation benchmarks such as MT-Bench, Arena-Hard, and Auto-Arena are seeing growing adoption for the evaluation of Large Language Models (LLMs). Existing research has primarily focused on approximating human-based model rankings using limited data and LLM-as-a-Judge. However, the fundamental premise of these studies, which attempts to replicate human rankings, is flawed. Specifically, these benchmarks typically offer only overall scores, limiting their utility to leaderboard rankings, rather than providing feedback that can guide model optimization and support model profiling. Therefore, we advocate for an evaluation paradigm shift from approximating human-based model rankings to providing feedback with analytical value. To this end, we introduce Feedbacker, an evaluation framework that provides comprehensive and fine-grained results, thereby enabling thorough identification of a model’s specific strengths and weaknesses. Such feedback not only supports the targeted optimization of the model but also enhances the understanding of its behavior. Feedbacker comprises three key components: an extensible tree-based query taxonomy builder, an automated query synthesis scheme, and a suite of visualization and analysis tools. Furthermore, we propose a novel LLM-as-a-Judge method: PC2 (Pre-Comparison-derived Criteria) pointwise evaluation. This method derives evaluation criteria by pre-comparing the differences between several auxiliary responses, achieving the accuracy of pairwise evaluation while maintaining the time complexity of pointwise evaluation. Finally, leveraging the evaluation results of 17 mainstream LLMs, we demonstrate the usage of Feedbacker and highlight its effectiveness and potential. Our homepage project is available at this https URL.
zh

[NLP-64] Enhancing BERTopic with Intermediate Layer Representations

【速读】: 该论文旨在解决传统主题建模方法在处理大规模文本数据时效率低、难以捕捉语义信息的问题,其解决方案的关键在于利用基于Transformer的嵌入(embedding)技术生成密集聚类,从而更有效地估计主题结构并提取有价值的信息。通过评估18种不同的嵌入表示,并在三个多样化数据集上进行实验,研究证明了通过优化嵌入配置可以显著提升BERTopic算法的性能。

链接: https://arxiv.org/abs/2505.06696
作者: Dominik Koterwa,Maciej Świtała
机构: Faculty of Economic Sciences, University of Warsaw (经济科学学院,华沙大学)
类目: Computation and Language (cs.CL)
备注: Repository with code for reproduction: this https URL

点击查看摘要

Abstract:BERTopic is a topic modeling algorithm that leverages transformer-based embeddings to create dense clusters, enabling the estimation of topic structures and the extraction of valuable insights from a corpus of documents. This approach allows users to efficiently process large-scale text data and gain meaningful insights into its structure. While BERTopic is a powerful tool, embedding preparation can vary, including extracting representations from intermediate model layers and applying transformations to these embeddings. In this study, we evaluate 18 different embedding representations and present findings based on experiments conducted on three diverse datasets. To assess the algorithm’s performance, we report topic coherence and topic diversity metrics across all experiments. Our results demonstrate that, for each dataset, it is possible to find an embedding configuration that performs better than the default setting of BERTopic. Additionally, we investigate the influence of stop words on different embedding configurations.
zh

[NLP-65] S-SUPERB: A Target Speech Processing Benchmark for Speech Self-Supervised Learning Models ICASSP2025

【速读】: 该论文试图解决在噪声和多说话人环境下,针对目标说话人的语音处理任务的评估与建模问题,这一场景相较于传统的单说话人任务更具挑战性且更贴近实际应用。解决方案的关键在于提出TS-SUPERB基准,该基准包含四个目标说话人处理任务,并利用从注册语音中提取的说话人嵌入作为条件来指导下游模型,同时采用基于统一SSL的目标语音编码器,结合说话人编码器和提取模块,通过联合优化多个任务以利用相互信息,从而提升模型在目标说话人场景下的性能。

链接: https://arxiv.org/abs/2505.06660
作者: Junyi Peng,Takanori Ashihara,Marc Delcroix,Tsubasa Ochiai,Oldrich Plchot,Shoko Araki,Jan Černocký
机构: Brno University of Technology (布雷诺大学); NTT Corporation (NTT公司)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at ICASSP 2025

点击查看摘要

Abstract:Self-supervised learning (SSL) models have significantly advanced speech processing tasks, and several benchmarks have been proposed to validate their effectiveness. However, previous benchmarks have primarily focused on single-speaker scenarios, with less exploration of target-speaker tasks in noisy, multi-talker conditions – a more challenging yet practical case. In this paper, we introduce the Target-Speaker Speech Processing Universal Performance Benchmark (TS-SUPERB), which includes four widely recognized target-speaker processing tasks that require identifying the target speaker and extracting information from the speech mixture. In our benchmark, the speaker embedding extracted from enrollment speech is used as a clue to condition downstream models. The benchmark result reveals the importance of evaluating SSL models in target speaker scenarios, demonstrating that performance cannot be easily inferred from related single-speaker tasks. Moreover, by using a unified SSL-based target speech encoder, consisting of a speaker encoder and an extractor module, we also investigate joint optimization across TS tasks to leverage mutual information and demonstrate its effectiveness.
zh

[NLP-66] Improving Block-Wise LLM Quantization by 4-bit Block-Wise Optimal Float (BOF4): Analysis and Variations

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在微调和推理过程中对内存容量的高需求问题,特别是通过块状量化(block-wise quantization)技术实现更高效的内存使用。其关键解决方案是提出一种优化的块状量化方法,即4-bit block-wise optimal float (BOF4),通过理论与数据驱动的优化过程,显著降低了量化误差。此外,论文还引入了基于有符号绝对块最大值的归一化方法(BOF4-S)以及一种保留异常值的混合精度量化策略(outlier-preserving quantization, OPQ),以进一步减少量化误差并提升语言建模性能。

链接: https://arxiv.org/abs/2505.06653
作者: Patrick Blumenberg,Thomas Graave,Tim Fingscheidt
机构: Insitute for Communications Technology, Technische Universität Braunschweig, Germany
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) demand extensive memory capacity during both fine-tuning and inference. To enable memory-efficient fine-tuning, existing methods apply block-wise quantization techniques, such as NF4 and AF4, to the network weights. We show that these quantization techniques incur suboptimal quantization errors. Therefore, as a first novelty, we propose an optimization approach for block-wise quantization. Using this method, we design a family of quantizers named 4-bit block-wise optimal float (BOF4), which consistently reduces the quantization error compared to both baseline methods. We provide both a theoretical and a data-driven solution for the optimization process and prove their practical equivalence. Secondly, we propose a modification to the employed normalization method based on the signed absolute block maximum (BOF4-S), enabling further reduction of the quantization error and empirically achieving less degradation in language modeling performance. Thirdly, we explore additional variations of block-wise quantization methods applied to LLMs through an experimental study on the importance of accurately representing zero and large-amplitude weights on the one hand, and optimization towards various error metrics on the other hand. Lastly, we introduce a mixed-precision quantization strategy dubbed outlier-preserving quantization (OPQ) to address the distributional mismatch induced by outlier weights in block-wise quantization. By storing outlier weights in 16-bit precision (OPQ) while applying BOF4-S, we achieve top performance among 4-bit block-wise quantization techniques w.r.t. perplexity.
zh

[NLP-67] Attention Is Not All You Need: The Importance of Feedforward Networks in Transformer Models

【速读】: 该论文试图解决如何优化Transformer模型结构以提升语言建模性能的问题,特别是在减少参数量和训练时间的同时保持或提高模型效果。其解决方案的关键在于对前馈网络(Feedforward Network, FFN)结构的改进,即采用三层FFN的变压器块配置,相较于传统的两层FFN结构,在减少总参数数量和训练时间的情况下实现了更低的训练损失和更好的模型性能。

链接: https://arxiv.org/abs/2505.06633
作者: Isaac Gerber
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Decoder-only transformer networks have become incredibly popular for language modeling tasks. State-of-the-art models can have over a hundred transformer blocks, containing billions of trainable parameters, and are trained on trillions of tokens of text. Each transformer block typically consists of a multi-head attention (MHA) mechanism and a two-layer fully connected feedforward network (FFN). In this paper, we examine the importance of the FFN during the model pre-training process through a series of experiments, confirming that the FFN is important to model performance. Furthermore, we show that models using a transformer block configuration with three-layer FFNs with fewer such blocks outperform the standard two-layer configuration delivering lower training loss with fewer total parameters in less time.
zh

[NLP-68] Dynamic Domain Information Modulation Algorithm for Multi-domain Sentiment Analysis

【速读】: 该论文旨在解决多领域情感分类中由于单一领域标注数据稀缺而导致模型性能下降的问题,其核心挑战在于如何有效利用多领域数据提升情感分类的准确性。解决方案的关键在于提出一种动态信息调制算法,通过将模型训练过程分为两个阶段:第一阶段确定一个共享的超参数以控制跨领域的领域分类任务比例,第二阶段引入一种新的领域感知调制算法,基于梯度和损失的方法调整输入文本中的领域信息,从而更高效地生成情感分类所需的领域信息。

链接: https://arxiv.org/abs/2505.06630
作者: Chunyi Yue,Ang Li
机构: Hainan University (海南大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Multi-domain sentiment classification aims to mitigate poor performance models due to the scarcity of labeled data in a single domain, by utilizing data labeled from various domains. A series of models that jointly train domain classifiers and sentiment classifiers have demonstrated their advantages, because domain classification helps generate necessary information for sentiment classification. Intuitively, the importance of sentiment classification tasks is the same in all domains for multi-domain sentiment classification; but domain classification tasks are different because the impact of domain information on sentiment classification varies across different fields; this can be controlled through adjustable weights or hyper parameters. However, as the number of domains increases, existing hyperparameter optimization algorithms may face the following challenges: (1) tremendous demand for computing resources, (2) convergence problems, and (3) high algorithm complexity. To efficiently generate the domain information required for sentiment classification in each domain, we propose a dynamic information modulation algorithm. Specifically, the model training process is divided into two stages. In the first stage, a shared hyperparameter, which would control the proportion of domain classification tasks across all fields, is determined. In the second stage, we introduce a novel domain-aware modulation algorithm to adjust the domain information contained in the input text, which is then calculated based on a gradient-based and loss-based method. In summary, experimental results on a public sentiment analysis dataset containing 16 domains prove the superiority of the proposed method.
zh

[NLP-69] he Efficiency of Pre-training with Objective Masking in Pseudo Labeling for Semi-Supervised Text Classification

【速读】: 该论文试图解决在文档类别仅由少量人工标注示例(gold-labeled examples)描述的情况下,如何有效进行文本分类的问题,尤其是在大多数训练样本未被标注的情形下。解决方案的关键在于采用一种半监督学习模型,该模型借鉴了Meta Pseudo Labels的教师-学生架构,其中“教师”模型为原始未标注数据生成伪标签以训练“学生”模型,并根据学生在人工标注数据上的表现迭代更新自身模型。此外,作者通过引入基于目标掩码的无监督预训练阶段对原始模型进行了扩展,以提升模型性能。

链接: https://arxiv.org/abs/2505.06624
作者: Arezoo Hatefi,Xuan-Son Vu,Monowar Bhuyan,Frank Drewes
机构: Umeå University (于默奥大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We extend and study a semi-supervised model for text classification proposed earlier by Hatefi et al. for classification tasks in which document classes are described by a small number of gold-labeled examples, while the majority of training examples is unlabeled. The model leverages the teacher-student architecture of Meta Pseudo Labels in which a ‘‘teacher’’ generates labels for originally unlabeled training data to train the ‘‘student’’ and updates its own model iteratively based on the performance of the student on the gold-labeled portion of the data. We extend the original model of Hatefi et al. by an unsupervised pre-training phase based on objective masking, and conduct in-depth performance evaluations of the original model, our extension, and various independent baselines. Experiments are performed using three different datasets in two different languages (English and Swedish).
zh

[NLP-70] Boosting Neural Language Inference via Cascaded Interactive Reasoning

【速读】: 该论文旨在解决自然语言推理(Natural Language Inference, NLI)任务中,现有方法主要依赖于预训练语言模型(Pre-trained Language Models, PLMs)的最终层表示,从而可能忽略中间层中蕴含的有价值信息的问题。其解决方案的关键在于提出一种名为级联交互推理网络(Cascaded Interactive Reasoning Network, CIRN)的新架构,该架构通过多层次的特征提取策略,在一个交互空间中持续整合跨句信息,以实现从表层特征匹配到深层次逻辑与语义关联的渐进式推理过程,从而更全面地建模复杂的语义交互。

链接: https://arxiv.org/abs/2505.06607
作者: Min Li,Chun Yuan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Natural Language Inference (NLI) focuses on ascertaining the logical relationship (entailment, contradiction, or neutral) between a given premise and hypothesis. This task presents significant challenges due to inherent linguistic features such as diverse phrasing, semantic complexity, and contextual nuances. While Pre-trained Language Models (PLMs) built upon the Transformer architecture have yielded substantial advancements in NLI, prevailing methods predominantly utilize representations from the terminal layer. This reliance on final-layer outputs may overlook valuable information encoded in intermediate layers, potentially limiting the capacity to model intricate semantic interactions effectively. Addressing this gap, we introduce the Cascaded Interactive Reasoning Network (CIRN), a novel architecture designed for deeper semantic comprehension in NLI. CIRN implements a hierarchical feature extraction strategy across multiple network depths, operating within an interactive space where cross-sentence information is continuously integrated. This mechanism aims to mimic a process of progressive reasoning, transitioning from surface-level feature matching to uncovering more profound logical and semantic connections between the premise and hypothesis. By systematically mining latent semantic relationships at various representational levels, CIRN facilitates a more thorough understanding of the input pair. Comprehensive evaluations conducted on several standard NLI benchmark datasets reveal consistent performance gains achieved by CIRN over competitive baseline approaches, demonstrating the efficacy of leveraging multi-level interactive features for complex relational reasoning.
zh

[NLP-71] Using External knowledge to Enhanced PLM for Semantic Matching

【速读】: 该论文试图解决如何在仅有大规模标注数据的情况下,使机器学习所有必要的知识以完成语义相关性检测任务的问题,以及如何将外部知识融入基于神经网络的模型中以提升语义相关性检测性能。其解决方案的关键在于利用外部知识增强预训练的语义相关性判别模型,从而在10个公开数据集上的实验中实现了相对于基线模型的性能一致提升。

链接: https://arxiv.org/abs/2505.06605
作者: Min Li,Chun Yuan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modeling semantic relevance has always been a challenging and critical task in natural language processing. In recent years, with the emergence of massive amounts of annotated data, it has become feasible to train complex models, such as neural network-based reasoning models. These models have shown excellent performance in practical applications and have achieved the current state-ofthe-art performance. However, even with such large-scale annotated data, we still need to think: Can machines learn all the knowledge necessary to perform semantic relevance detection tasks based on this data alone? If not, how can neural network-based models incorporate external knowledge into themselves, and how can relevance detection models be constructed to make full use of external knowledge? In this paper, we use external knowledge to enhance the pre-trained semantic relevance discrimination model. Experimental results on 10 public datasets show that our method achieves consistent improvements in performance compared to the baseline model.
zh

[NLP-72] Bridging the Gap: An Intermediate Language for Enhanced and Cost-Effective Grapheme-to-Phoneme Conversion with Homographs with Multiple Pronunciations Disambiguation

【速读】: 该论文旨在解决波斯语中图素到音素(Grapheme-to-Phoneme, G2P)转换的挑战,特别是由于同形异义词(homographs)和Ezafe结构在正式与非正式语言环境中的存在所带来的语音歧义问题。解决方案的关键在于提出一种专门用于波斯语处理的中间语言,并结合大型语言模型(Large Language Model, LLM)提示技术和专门设计的序列到序列机器转写架构,同时通过形式概念分析构建一个全面的词汇数据库以实现多发音词(polyphones)的消歧。该方法在两个不同数据集上进行训练,显著提升了音素错误率(Phoneme Error Rate, PER)指标,为波斯语文本到语音系统提供了可靠解决方案,并具备扩展至其他具有丰富同形异义现象的语言的潜力。

链接: https://arxiv.org/abs/2505.06599
作者: Abbas Bertina,Shahab Beirami,Hossein Biniazian,Elham Esmaeilnia,Soheil Shahi,Mahdi Pirnia
机构: 未知
类目: Computation and Language (cs.CL)
备注: pdf, 8 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Grapheme-to-phoneme (G2P) conversion for Persian presents unique challenges due to its complex phonological features, particularly homographs and Ezafe, which exist in formal and informal language contexts. This paper introduces an intermediate language specifically designed for Persian language processing that addresses these challenges through a multi-faceted approach. Our methodology combines two key components: Large Language Model (LLM) prompting techniques and a specialized sequence-to-sequence machine transliteration architecture. We developed and implemented a systematic approach for constructing a comprehensive lexical database for homographs with multiple pronunciations disambiguation often termed polyphones, utilizing formal concept analysis for semantic differentiation. We train our model using two distinct datasets: the LLM-generated dataset for formal and informal Persian and the B-Plus podcasts for informal language variants. The experimental results demonstrate superior performance compared to existing state-of-the-art approaches, particularly in handling the complexities of Persian phoneme conversion. Our model significantly improves Phoneme Error Rate (PER) metrics, establishing a new benchmark for Persian G2P conversion accuracy. This work contributes to the growing research in low-resource language processing and provides a robust solution for Persian text-to-speech systems and demonstrating its applicability beyond Persian. Specifically, the approach can extend to languages with rich homographic phenomena such as Chinese and Arabic
zh

[NLP-73] Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation

【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在处理复杂多模态输入(如整个电视节目剧集)时难以平衡视觉与文本信息的问题。其解决方案的关键在于提出一种零样本视频到文本的摘要方法,该方法构建了剧集的原创剧本表示,将关键视频片段、对话和角色信息整合为统一文档。该方法通过仅使用音频、视频和字幕作为输入,同时生成剧本并命名角色,实现了零样本场景下的多模态信息融合。

链接: https://arxiv.org/abs/2505.06594
作者: Galann Pennec,Zhengyuan Liu,Nicholas Asher,Philippe Muller,Nancy F. Chen
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) often struggle to balance visual and textual information when summarizing complex multimodal inputs, such as entire TV show episodes. In this paper, we propose a zero-shot video-to-text summarization approach that builds its own screenplay representation of an episode, effectively integrating key video moments, dialogue, and character information into a unified document. Unlike previous approaches, we simultaneously generate screenplays and name the characters in zero-shot, using only the audio, video, and transcripts as input. Additionally, we highlight that existing summarization metrics can fail to assess the multimodal content in summaries. To address this, we introduce MFactSum, a multimodal metric that evaluates summaries with respect to both vision and text modalities. Using MFactSum, we evaluate our screenplay summaries on the SummScreen3D dataset, demonstrating superiority against state-of-the-art VLMs such as Gemini 1.5 by generating summaries containing 20% more relevant visual information while requiring 75% less of the video as input.
zh

[NLP-74] Evaluating LLM -Generated QA Test: a Student-Centered Study

【速读】: 该论文试图解决传统考试设计中耗时耗力且难以大规模定制化的问题,提出利用生成式 AI (Generative AI) 自动构建可靠的问题-答案(QA)测试。解决方案的关键在于基于 GPT-4o-mini 的自动问答测试生成管道,并通过项目反应理论(IRT)分析和用户评价验证其心理测量学性能与感知质量,证明了大语言模型生成的评估工具在信度、效度及用户满意度方面可与人工编写的测试相媲美。

链接: https://arxiv.org/abs/2505.06591
作者: Anna Wróblewska,Bartosz Grabek,Jakub Świstak,Daniel Dan
机构: Warsaw University of Technology (华沙理工大学); Faculty of Mathematics and Information Science, Poland (数学与信息科学学院,波兰); MODUL Univeristy Vienna (MODUL维也纳大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: accepted to AIED 2025

点击查看摘要

Abstract:This research prepares an automatic pipeline for generating reliable question-answer (QA) tests using AI chatbots. We automatically generated a GPT-4o-mini-based QA test for a Natural Language Processing course and evaluated its psychometric and perceived-quality metrics with students and experts. A mixed-format IRT analysis showed that the generated items exhibit strong discrimination and appropriate difficulty, while student and expert star ratings reflect high overall quality. A uniform DIF check identified two items for review. These findings demonstrate that LLM-generated assessments can match human-authored tests in psychometric performance and user satisfaction, illustrating a scalable approach to AI-assisted assessment development.
zh

[NLP-75] MacRAG : Compress Slice and Scale-up for Multi-Scale Adaptive Context RAG

【速读】: 该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理长上下文、多跳任务时存在的检索不精确、受限上下文窗口下的上下文覆盖不全以及由次优上下文构建导致的信息碎片化问题。其解决方案的关键在于提出多尺度自适应上下文RAG(Multi-scale Adaptive Context RAG, MacRAG),该框架通过将文档压缩并划分为从粗到细的不同粒度,实时通过块级和文档级扩展自适应地融合相关上下文,从而构建有效的查询特定长上下文,优化了检索的精度与覆盖范围。

链接: https://arxiv.org/abs/2505.06569
作者: Woosang Lim,Zekun Li,Gyuwan Kim,Sungyoung Ji,HyeonJung Kim,Kyuri Choi,Jin Hyuk Lim,Kyungpyo Park,William Yang Wang
机构: POSCO HOLDINGS(浦项控股); University of California, Santa Barbara(加州大学圣塔芭芭拉分校); Google Cloud(谷歌云)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Long-context (LC) Large Language Models (LLMs) combined with Retrieval-Augmented Generation (RAG) hold strong potential for complex multi-hop and large-document tasks. However, existing RAG systems often suffer from imprecise retrieval, incomplete context coverage under constrained context windows, and fragmented information caused by suboptimal context construction. We introduce Multi-scale Adaptive Context RAG (MacRAG), a hierarchical retrieval framework that compresses and partitions documents into coarse-to-fine granularities, then adaptively merges relevant contexts through chunk- and document-level expansions in real time. By starting from the finest-level retrieval and progressively incorporating higher-level and broader context, MacRAG constructs effective query-specific long contexts, optimizing both precision and coverage. Evaluations on the challenging LongBench expansions of HotpotQA, 2WikiMultihopQA, and Musique confirm that MacRAG consistently surpasses baseline RAG pipelines on single- and multi-step generation with Llama-3.1-8B, Gemini-1.5-pro, and GPT-4o. Our results establish MacRAG as an efficient, scalable solution for real-world long-context, multi-hop reasoning. Our code is available at this https URL.
zh

[NLP-76] References Indeed Matter? Reference-Free Preference Optimization for Conversational Query Reformulation

【速读】: 该论文试图解决对话式应用中信息检索性能提升所依赖的参考段落(reference passages)在现实场景中难以获取的问题。其解决方案的关键在于提出一种无需参考段落的双阶段偏好优化框架DualReform,该框架通过两种核心创新实现目标:一是基于响应的推理,利用响应作为代理来推断伪参考段落;二是通过CQR的双重角色进行响应优化,即CQR模型根据响应优化与CQR之间的共同目标对响应进行精炼。

链接: https://arxiv.org/abs/2505.06552
作者: Doyoung Kim,Youngjun Lee,Joeun Kim,Jihwan Bang,Hwanjun Song,Susik Yoon,Jae-Gil Lee
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Conversational query reformulation (CQR) has become indispensable for improving retrieval in dialogue-based applications. However, existing approaches typically rely on reference passages for optimization, which are impractical to acquire in real-world scenarios. To address this limitation, we introduce a novel reference-free preference optimization framework DualReform that generates pseudo reference passages from commonly-encountered conversational datasets containing only queries and responses. DualReform attains this goal through two key innovations: (1) response-based inference, where responses serve as proxies to infer pseudo reference passages, and (2) response refinement via the dual-role of CQR, where a CQR model refines responses based on the shared objectives between response refinement and CQR. Despite not relying on reference passages, DualReform achieves 96.9–99.1% of the retrieval accuracy attainable only with reference passages and surpasses the state-of-the-art method by up to 31.6%.
zh

[NLP-77] REFINE-AF: A Task-Agnostic Framework to Align Language Models via Self-Generated Instructions using Reinforcement Learning from Automated Feedback

【速读】: 该论文试图解决在生成指令数据时依赖人工标注所带来的高成本、低效率及任务多样性不足的问题。其解决方案的关键在于采用一种半自动化框架,利用开源的小型大语言模型(Large Language Models, LLMs)如LLaMA 2-7B、LLaMA 2-13B和Mistral 7B来减少人工干预,从而降低生成用于微调LLMs的指令数据集所需的人力、努力和成本。此外,研究还表明,在该框架中引入基于强化学习(Reinforcement Learning, RL)的训练算法能够进一步提升性能。

链接: https://arxiv.org/abs/2505.06548
作者: Aniruddha Roy,Pretam Ray,Abhilash Nandy,Somak Aditya,Pawan Goyal
机构: Indian Institute of Technology, Kharagpur (印度理工学院,克哈格普尔分校)
类目: Computation and Language (cs.CL)
备注: 11 pages

点击查看摘要

Abstract:Instruction-based Large Language Models (LLMs) have proven effective in numerous few-shot or zero-shot Natural Language Processing (NLP) tasks. However, creating human-annotated instruction data is time-consuming, expensive, and often limited in quantity and task diversity. Previous research endeavors have attempted to address this challenge by proposing frameworks capable of generating instructions in a semi-automated and task-agnostic manner directly from the model itself. Many of these efforts have relied on large API-only parameter-based models such as GPT-3.5 (175B), which are expensive, and subject to limits on a number of queries. This paper explores the performance of three open-source small LLMs such as LLaMA 2-7B, LLama 2-13B, and Mistral 7B, using a semi-automated framework, thereby reducing human intervention, effort, and cost required to generate an instruction dataset for fine-tuning LLMs. Furthermore, we demonstrate that incorporating a Reinforcement Learning (RL) based training algorithm into this LLMs-based framework leads to further enhancements. Our evaluation of the dataset reveals that these RL-based frameworks achieve a substantial improvements in 63-66% of the tasks compared to previous approaches.
zh

[NLP-78] hink in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning Model

【速读】: 该论文旨在解决多模态大推理模型(Multimodal Large Reasoning Models, MLRMs)在安全性和可靠性方面的关键问题,特别是在面对潜在的恶意攻击(如越狱攻击)时模型表现退化的问题。其解决方案的关键在于利用模型内在的推理能力来检测不安全意图,通过构建一个包含安全导向思维过程的多模态微调数据集,进而提升模型在越狱鲁棒性和安全意识基准测试中的安全性表现。

链接: https://arxiv.org/abs/2505.06538
作者: Xinyue Lou,You Li,Jinan Xu,Xiangyu Shi,Chi Chen,Kaiyu Huang
机构: Beijing Jiaotong University (北京交通大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: Work in Progress

点击查看摘要

Abstract:The rapid development of multimodal large reasoning models (MLRMs) has demonstrated broad application potential, yet their safety and reliability remain critical concerns that require systematic exploration. To address this gap, we conduct a comprehensive and systematic safety evaluation of 11 MLRMs across 5 benchmarks and unveil prevalent safety degradation phenomena in most advanced models. Moreover, our analysis reveals distinct safety patterns across different benchmarks: significant safety degradation is observed across jailbreak robustness benchmarks, whereas safety-awareness benchmarks demonstrate less pronounced degradation. In particular, a long thought process in some scenarios even enhances safety performance. Therefore, it is a potential approach to addressing safety issues in MLRMs by leveraging the intrinsic reasoning capabilities of the model to detect unsafe intent. To operationalize this insight, we construct a multimodal tuning dataset that incorporates a safety-oriented thought process. Experimental results from fine-tuning existing MLRMs with this dataset effectively enhances the safety on both jailbreak robustness and safety-awareness benchmarks. This study provides a new perspective for developing safe MLRMs. Our dataset is available at this https URL.
zh

[NLP-79] xGen-small Technical Report

【速读】: 该论文旨在解决长上下文处理任务中模型性能不足的问题,特别是针对数学和编程领域的需求。其解决方案的关键在于采用垂直集成的流水线,包括领域平衡且频率感知的数据收集、多阶段预训练(结合质量退火和长度扩展至128k token)以及通过监督微调、偏好学习和在线强化学习进行针对性后训练。

链接: https://arxiv.org/abs/2505.06496
作者: Erik Nijkamp,Bo Pang,Egor Pakhomov,Akash Gokul,Jin Qu,Silvio Savarese,Yingbo Zhou,Caiming Xiong
机构: Salesforce( Salesforce)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce xGen-small, a family of 4B and 9B Transformer decoder models optimized for long-context applications. Our vertically integrated pipeline unites domain-balanced, frequency-aware data curation; multi-stage pre-training with quality annealing and length extension to 128k tokens; and targeted post-training via supervised fine-tuning, preference learning, and online reinforcement learning. xGen-small delivers strong performance across various tasks, especially in math and coding domains, while excelling at long context benchmarks.
zh

[NLP-80] Is your multimodal large language model a good science tutor?

【速读】: 该论文试图解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在教育场景中仅关注最终答案准确性而忽视教学能力的问题。其解决方案的关键在于提出一个基于教育评分标准和模拟学生模型的框架,用于评估MLLM作为科学导师的教学表现,并通过对比强弱导师的输出进行偏好优化,从而提升模型的教育适配性。

链接: https://arxiv.org/abs/2505.06418
作者: Ming Liu,Liwen Wang,Wensheng Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) demonstrate impressive performance on scientific reasoning tasks (e.g., ScienceQA). However, most existing benchmarks focus narrowly on the accuracy of the final answer while ignoring other metrics. In particular, when applying MLLMs to educational contexts, the goal is not only correctness but also the ability to teach. In this paper, we propose a framework that evaluates MLLMs as science tutors using a comprehensive educational rubric and a simulated student model that judges the teaching performance of the tutors. Given a list of candidate MLLM science tutors, we use rubric-based student judgments to produce a range of tutor performance scores, identifying both strong and weak tutors. Using the training section of the ScienceQA dataset, we then construct a data set of pairwise comparisons between the outputs of strong and weak tutors. This enables us to apply multiple preference optimization methods to fine-tune an underperforming tutor model (Qwen2-VL-2B) into more effective ones. Our results also show that strong problem-solving skills do not guarantee high-quality tutoring and that performance optimization-guided refinements can yield more educationally aligned tutor models. This approach opens avenues for building MLLMs that serve not only as problem solvers, but as genuinely helpful educational assistants.
zh

[NLP-81] ScaleMCP: Dynamic and Auto-Synchronizing Model Context Protocol Tools for LLM Agents

【速读】: 该论文旨在解决现有工具选择框架在集成Model Context Protocol (MCP)服务器方面的不足,以及由此导致的工具仓库维护效率低、重复和不一致等问题。传统方法依赖人工更新静态本地工具仓库,限制了LLM代理的自主性和多轮交互中的动态重新查询能力。论文提出的解决方案关键在于引入ScaleMCP,其核心是通过MCP服务器动态为LLM代理配备工具检索器,并结合CRUD操作实现自动同步的工具存储系统,同时提出一种新的嵌入策略Tool Document Weighted Average (TDWA),以在嵌入过程中突出工具文档的关键部分。

链接: https://arxiv.org/abs/2505.06416
作者: Elias Lumer,Anmol Gulati,Vamse Kumar Subbiah,Pradeep Honaganahalli Basavaraju,James A. Burke
机构: PricewaterhouseCoopers(普华永道)
类目: Computation and Language (cs.CL)
备注: 17 pages

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) and the introduction of the Model Context Protocol (MCP) have significantly expanded LLM agents’ capability to interact dynamically with external tools and APIs. However, existing tool selection frameworks do not integrate MCP servers, instead relying heavily on error-prone manual updates to monolithic local tool repositories, leading to duplication, inconsistencies, and inefficiencies. Additionally, current approaches abstract tool selection before the LLM agent is invoked, limiting its autonomy and hindering dynamic re-querying capabilities during multi-turn interactions. To address these issues, we introduce ScaleMCP, a novel tool selection approach that dynamically equips LLM agents with a MCP tool retriever, giving agents the autonomy to add tools into their memory, as well as an auto-synchronizing tool storage system pipeline through CRUD (create, read, update, delete) operations with MCP servers as the single source of truth. We also propose a novel embedding strategy, Tool Document Weighted Average (TDWA), designed to selectively emphasize critical components of tool documents (e.g. tool name or synthetic questions) during the embedding process. Comprehensive evaluations conducted on a created dataset of 5,000 financial metric MCP servers, across 10 LLM models, 5 embedding models, and 5 retriever types, demonstrate substantial improvements in tool retrieval and agent invocation performance, emphasizing ScaleMCP’s effectiveness in scalable, dynamic tool selection and invocation.
zh

[NLP-82] Divide (Text) and Conquer (Sentiment): Improved Sentiment Classification by Constituent Conflict Resolution

【速读】: 该论文旨在解决在分析具有多个冲突情感的文本段落时,情感分类任务面临的挑战,尤其是长文本导致模型性能下降的问题。其解决方案的关键在于提出新颖的方法来隔离冲突情感并进行有效聚合,其中一种聚合策略采用多层感知机(MLP)模型,在多个数据集上表现出优于基线模型的性能,且训练成本仅为微调基线模型的约1/100。

链接: https://arxiv.org/abs/2505.06320
作者: Jan Kościałkowski,Paweł Marcinkowski
机构: Relativity(相对论); Allegro Pay(阿尔格罗支付)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, 6 figures, 4 tables, developed as a final project for the Stanford Center for Professional Education XCS224U (Natural Language Understanding) course

点击查看摘要

Abstract:Sentiment classification, a complex task in natural language processing, becomes even more challenging when analyzing passages with multiple conflicting tones. Typically, longer passages exacerbate this issue, leading to decreased model performance. The aim of this paper is to introduce novel methodologies for isolating conflicting sentiments and aggregating them to effectively predict the overall sentiment of such passages. One of the aggregation strategies involves a Multi-Layer Perceptron (MLP) model which outperforms baseline models across various datasets, including Amazon, Twitter, and SST while costing \sim 1/100 of what fine-tuning the baseline would take.
zh

[NLP-83] AI Approaches to Qualitative and Quantitative News Analytics on NATO Unity

【速读】: 该论文试图解决如何利用生成式 AI (Generative AI) 结合检索增强生成(RAG)技术对不同网络来源中的北约(NATO)情绪、北约团结以及北约第五条信任意见分数进行定性和定量分析的问题。解决方案的关键在于采用基于 GPT-4.1 模型的 RAG 方法,通过零样本提示生成新闻摘要和意见评分,并在两个层级上进行分析:第一层级生成新闻的定性摘要和定量意见评分,第二层级生成新闻摘要的总结;随后使用贝叶斯回归分析生成的定量意见评分以获取趋势线,并通过回归参数分布评估意见趋势的不确定性。该方法旨在为复杂分析提供一种基于 AI 的辅助工具,而非直接进行政治分析。

链接: https://arxiv.org/abs/2505.06313
作者: Bohdan M. Pavlyshenko
机构: Ivan Franko National University of Lviv (伊万·弗兰科国立利沃夫大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:The paper considers the use of GPT models with retrieval-augmented generation (RAG) for qualitative and quantitative analytics on NATO sentiments, NATO unity and NATO Article 5 trust opinion scores in different web sources: news sites found via Google Search API, Youtube videos with comments, and Reddit discussions. A RAG approach using GPT-4.1 model was applied to analyse news where NATO related topics were discussed. Two levels of RAG analytics were used: on the first level, the GPT model generates qualitative news summaries and quantitative opinion scores using zero-shot prompts; on the second level, the GPT model generates the summary of news summaries. Quantitative news opinion scores generated by the GPT model were analysed using Bayesian regression to get trend lines. The distributions found for the regression parameters make it possible to analyse an uncertainty in specified news opinion score trends. Obtained results show a downward trend for analysed scores of opinion related to NATO unity. This approach does not aim to conduct real political analysis; rather, it consider AI based approaches which can be used for further analytics as a part of a complex analytical approach. The obtained results demonstrate that the use of GPT models for news analysis can give informative qualitative and quantitative analytics, providing important insights. The dynamic model based on neural ordinary differential equations was considered for modelling public opinions. This approach makes it possible to analyse different scenarios for evolving public opinions. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI) Cite as: arXiv:2505.06313 [cs.IR] (or arXiv:2505.06313v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2505.06313 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Bohdan Pavlyshenko [view email] [v1] Thu, 8 May 2025 18:42:01 UTC (951 KB)
zh

[NLP-84] Lossless Compression of Large Language Model-Generated Text via Next-Token Prediction

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)生成数据在现代文本管理系统中高效且无损压缩的问题。传统压缩方法在处理结构化程度高、内容简单的机器生成数据时表现良好,但面对LLM生成的复杂且多样化的文本数据则效果有限。论文提出的解决方案关键在于利用LLM自身的预测能力,因其在训练过程中通过下一个词预测机制,使得LLM生成的数据具有高度可预测性,从而能够作为自身输出的高效压缩器。实验结果表明,基于LLM的预测方法在多个数据集和模型规模下均实现了显著优于Gzip等传统压缩工具的压缩率。

链接: https://arxiv.org/abs/2505.06297
作者: Yu Mao,Holger Pirk,Chun Jason Xue
机构: MBZUAI(穆巴达拉人工智能大学); Imperial College London(帝国理工学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) continue to be deployed and utilized across domains, the volume of LLM-generated data is growing rapidly. This trend highlights the increasing importance of effective and lossless compression for such data in modern text management systems. However, compressing LLM-generated data presents unique challenges compared to traditional human- or machine-generated content. Traditional machine-generated data is typically derived from computational processes or device outputs, often highly structured and limited to low-level elements like labels or numerical values. This structure enables conventional lossless compressors to perform efficiently. In contrast, LLM-generated data is more complex and diverse, requiring new approaches for effective compression. In this work, we conduct the first systematic investigation of lossless compression techniques tailored specifically to LLM-generated data. Notably, because LLMs are trained via next-token prediction, we find that LLM-generated data is highly predictable for the models themselves. This predictability enables LLMs to serve as efficient compressors of their own outputs. Through extensive experiments with 14 representative LLMs and 8 LLM-generated datasets from diverse domains, we show that LLM-based prediction methods achieve remarkable compression rates, exceeding 20x, far surpassing the 3x rate achieved by Gzip, a widely used general-purpose compressor. Furthermore, this advantage holds across different LLM sizes and dataset types, demonstrating the robustness and practicality of LLM-based methods in lossless text compression under generative AI workloads.
zh

计算机视觉

[CV-0] Hmathbf3DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning

【速读】:该论文旨在解决视觉-运动策略学习中视觉感知与动作预测之间关键耦合关系被忽视的问题,传统方法多依赖生成模型建模动作分布,但未能有效整合视觉特征与动作生成。其解决方案的关键在于提出一种三层次的扩散策略框架(Triply-Hierarchical Diffusion Policy, H³DP),通过深度感知输入层、多尺度视觉表征以及分层条件扩散过程,强化视觉特征与动作生成之间的层次化整合。

链接: https://arxiv.org/abs/2505.07819
作者: Yiyang Lu,Yufeng Tian,Zhecheng Yuan,Xianbang Wang,Pu Hua,Zhengrong Xue,Huazhe Xu
机构: Tsinghua University IIIS(清华大学人工智能学院); Shanghai Qi Zhi Institute(上海期智研究院); Shanghai AI Lab(上海人工智能实验室); Harbin Institute of Technology(哈尔滨工业大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visuomotor policy learning has witnessed substantial progress in robotic manipulation, with recent approaches predominantly relying on generative models to model the action distribution. However, these methods often overlook the critical coupling between visual perception and action prediction. In this work, we introduce \textbfTriply-Hierarchical Diffusion Policy~(\textbfH ^\mathbf3 DP) , a novel visuomotor learning framework that explicitly incorporates hierarchical structures to strengthen the integration between visual features and action generation. H ^3 DP contains \mathbf3 levels of hierarchy: (1) depth-aware input layering that organizes RGB-D observations based on depth information; (2) multi-scale visual representations that encode semantic features at varying levels of granularity; and (3) a hierarchically conditioned diffusion process that aligns the generation of coarse-to-fine actions with corresponding visual features. Extensive experiments demonstrate that H ^3 DP yields a \mathbf+27.5% average relative improvement over baselines across \mathbf44 simulation tasks and achieves superior performance in \mathbf4 challenging bimanual real-world manipulation tasks. Project Page: this https URL.
zh

[CV-1] DanceGRPO: Unleashing GRPO on Visual Generation

【速读】:该论文试图解决生成式模型(Generative Models)输出与人类偏好对齐的问题,尤其是在视觉内容生成中,现有基于强化学习(Reinforcement Learning, RL)的方法存在与现代常微分方程(Ordinary Differential Equations, ODEs)采样范式的不兼容性、大规模训练的不稳定性以及视频生成验证的缺失。解决方案的关键在于提出DanceGRPO,这是首个将群体相对策略优化(Group Relative Policy Optimization, GRPO)适配到视觉生成范式的统一框架,能够跨扩散模型和修正流两种生成范式、三种任务(文本到图像、文本到视频、图像到视频)、四种基础模型及五种奖励模型实现无缝适应,从而显著提升生成质量并稳定策略优化过程。

链接: https://arxiv.org/abs/2505.07818
作者: Zeyue Xue,Jie Wu,Yu Gao,Fangyuan Kong,Lingting Zhu,Mengzhao Chen,Zhiheng Liu,Wei Liu,Qiushan Guo,Weilin Huang,Ping Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent breakthroughs in generative models-particularly diffusion models and rectified flows-have revolutionized visual content creation, yet aligning model outputs with human preferences remains a critical challenge. Existing reinforcement learning (RL)-based methods for visual generation face critical limitations: incompatibility with modern Ordinary Differential Equations (ODEs)-based sampling paradigms, instability in large-scale training, and lack of validation for video generation. This paper introduces DanceGRPO, the first unified framework to adapt Group Relative Policy Optimization (GRPO) to visual generation paradigms, unleashing one unified RL algorithm across two generative paradigms (diffusion models and rectified flows), three tasks (text-to-image, text-to-video, image-to-video), four foundation models (Stable Diffusion, HunyuanVideo, FLUX, SkyReel-I2V), and five reward models (image/video aesthetics, text-image alignment, video motion quality, and binary reward). To our knowledge, DanceGRPO is the first RL-based unified framework capable of seamless adaptation across diverse generative paradigms, tasks, foundational models, and reward models. DanceGRPO demonstrates consistent and substantial improvements, which outperform baselines by up to 181% on benchmarks such as HPS-v2.1, CLIP Score, VideoAlign, and GenEval. Notably, DanceGRPO not only can stabilize policy optimization for complex video generation, but also enables generative policy to better capture denoising trajectories for Best-of-N inference scaling and learn from sparse binary feedback. Our results establish DanceGRPO as a robust and versatile solution for scaling Reinforcement Learning from Human Feedback (RLHF) tasks in visual generation, offering new insights into harmonizing reinforcement learning and visual synthesis. The code will be released.
zh

[CV-2] Pixel Motion as Universal Representation for Robot Control

【速读】:该论文试图解决机器人控制中如何有效融合语言指令、运动表示与具体动作执行的问题,特别是在无监督和有监督设置下实现灵活、可扩展和通用的控制。解决方案的关键在于提出LangToMo框架,该框架采用双系统架构,其中System 2(高阶系统)作为基于图像扩散模型的高层策略,从单帧生成文本条件下的像素运动序列;System 1(低阶系统)则通过运动到动作的映射函数将这些像素运动转换为具体的机器人动作,从而实现分层解耦的控制结构。像素运动(pixel motion)作为一种通用、可解释且以运动为中心的表示,能够通过自监督方式从视频中提取,并用于大规模扩散模型训练,提升了系统的泛化能力。

链接: https://arxiv.org/abs/2505.07817
作者: Kanchana Ranasinghe,Xiang Li,Cristina Mata,Jongwoo Park,Michael S Ryoo
机构: Stony Brook University (斯托尼布鲁克大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present LangToMo, a vision-language-action framework structured as a dual-system architecture that uses pixel motion forecasts as intermediate representations. Our high-level System 2, an image diffusion model, generates text-conditioned pixel motion sequences from a single frame to guide robot control. Pixel motion-a universal, interpretable, and motion-centric representation-can be extracted from videos in a self-supervised manner, enabling diffusion model training on web-scale video-caption data. Treating generated pixel motion as learned universal representations, our low level System 1 module translates these into robot actions via motion-to-action mapping functions, which can be either hand-crafted or learned with minimal supervision. System 2 operates as a high-level policy applied at sparse temporal intervals, while System 1 acts as a low-level policy at dense temporal intervals. This hierarchical decoupling enables flexible, scalable, and generalizable robot control under both unsupervised and supervised settings, bridging the gap between language, motion, and action. Checkout this https URL for visualizations.
zh

[CV-3] Imagine Verify Execute: Memory-Guided Agent ic Exploration with Vision-Language Models

【速读】:该论文旨在解决通用机器人学习中探索效率低下的问题,特别是在缺乏密集奖励、明确目标或任务特定监督的开放环境中的探索挑战。其解决方案的关键在于提出IVE(Imagine, Verify, Execute)框架,该框架受人类好奇心启发,利用视觉语言模型(VLM)将RGB-D观测抽象为语义场景图,生成新颖场景并预测其物理可行性,最终通过动作工具生成可执行技能序列,从而实现更高效和有意义的探索。

链接: https://arxiv.org/abs/2505.07815
作者: Seungjae Lee,Daniel Ekpo,Haowen Liu,Furong Huang,Abhinav Shrivastava,Jia-Bin Huang
机构: University of Maryland(马里兰大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project webpage: this https URL

点击查看摘要

Abstract:Exploration is essential for general-purpose robotic learning, especially in open-ended environments where dense rewards, explicit goals, or task-specific supervision are scarce. Vision-language models (VLMs), with their semantic reasoning over objects, spatial relations, and potential outcomes, present a compelling foundation for generating high-level exploratory behaviors. However, their outputs are often ungrounded, making it difficult to determine whether imagined transitions are physically feasible or informative. To bridge the gap between imagination and execution, we present IVE (Imagine, Verify, Execute), an agentic exploration framework inspired by human curiosity. Human exploration is often driven by the desire to discover novel scene configurations and to deepen understanding of the environment. Similarly, IVE leverages VLMs to abstract RGB-D observations into semantic scene graphs, imagine novel scenes, predict their physical plausibility, and generate executable skill sequences through action tools. We evaluate IVE in both simulated and real-world tabletop environments. The results show that IVE enables more diverse and meaningful exploration than RL baselines, as evidenced by a 4.1 to 7.8x increase in the entropy of visited states. Moreover, the collected experience supports downstream learning, producing policies that closely match or exceed the performance of those trained on human-collected demonstrations.
zh

[CV-4] DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies

【速读】:该论文旨在解决机器人在未知环境中实现灵活操作(dexterous manipulation)的泛化能力不足问题,尤其是在缺乏大规模、多样化机器人数据集的情况下。其解决方案的关键在于提出DexWild-System,一种低成本、便携且易于使用的设备,使人类通过日常手部操作收集多环境、多物体的交互数据,并结合人类与机器人示范进行联合训练(co-training),从而提升机器人策略的泛化能力。这种混合数据源的训练方法显著优于仅依赖机器人数据的训练方式。

链接: https://arxiv.org/abs/2505.07813
作者: Tony Tao,Mohan Kumar Srirama,Jason Jingzhou Liu,Kenneth Shaw,Deepak Pathak
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: In RSS 2025. Website at this https URL

点击查看摘要

Abstract:Large-scale, diverse robot datasets have emerged as a promising path toward enabling dexterous manipulation policies to generalize to novel environments, but acquiring such datasets presents many challenges. While teleoperation provides high-fidelity datasets, its high cost limits its scalability. Instead, what if people could use their own hands, just as they do in everyday life, to collect data? In DexWild, a diverse team of data collectors uses their hands to collect hours of interactions across a multitude of environments and objects. To record this data, we create DexWild-System, a low-cost, mobile, and easy-to-use device. The DexWild learning framework co-trains on both human and robot demonstrations, leading to improved performance compared to training on each dataset individually. This combination results in robust robot policies capable of generalizing to novel environments, tasks, and embodiments with minimal additional robot-specific data. Experimental results demonstrate that DexWild significantly improves performance, achieving a 68.5% success rate in unseen environments-nearly four times higher than policies trained with robot data only-and offering 5.8x better cross-embodiment generalization. Video results, codebases, and instructions at this https URL
zh

[CV-5] Continuous Visual Autoregressive Generation via Score Maximization ICML2025

【速读】:该论文试图解决传统自回归模型在处理连续模态(如视觉数据)时因量化导致的信息损失问题。其解决方案的关键在于引入一种连续的自回归建模框架(Continuous VAR),该框架通过严格适当的评分规则(strictly proper scoring rules)实现无需向量量化即可直接进行视觉自回归生成。该理论基础提供了强大的统计工具,用于评估生成模型对真实分布的逼近程度,核心在于选择合适的严格适当评分作为训练目标进行优化。

链接: https://arxiv.org/abs/2505.07812
作者: Chenze Shao,Fandong Meng,Jie Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2025

点击查看摘要

Abstract:Conventional wisdom suggests that autoregressive models are used to process discrete data. When applied to continuous modalities such as visual data, Visual AutoRegressive modeling (VAR) typically resorts to quantization-based approaches to cast the data into a discrete space, which can introduce significant information loss. To tackle this issue, we introduce a Continuous VAR framework that enables direct visual autoregressive generation without vector quantization. The underlying theoretical foundation is strictly proper scoring rules, which provide powerful statistical tools capable of evaluating how well a generative model approximates the true distribution. Within this framework, all we need is to select a strictly proper score and set it as the training objective to optimize. We primarily explore a class of training objectives based on the energy score, which is likelihood-free and thus overcomes the difficulty of making probabilistic predictions in the continuous space. Previous efforts on continuous autoregressive generation, such as GIVT and diffusion loss, can also be derived from our framework using other strictly proper scores. Source code: this https URL.
zh

[CV-6] Privacy Risks of Robot Vision: A User Study on Image Modalities and Resolution

【速读】:该论文试图解决移动服务机器人在个人或敏感环境中部署时用户隐私保护的问题,特别是在使用摄像头进行下游任务时可能引发的隐私风险。其解决方案的关键在于通过用户研究分析不同图像模态(如深度图像和语义分割图像)及图像分辨率(如32×32和16×16)对用户隐私感知的影响,从而为隐私安全的视觉数据处理提供依据。

链接: https://arxiv.org/abs/2505.07766
作者: Xuying Huang,Sicong Pan,Maren Bennewitz
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:User privacy is a crucial concern in robotic applications, especially when mobile service robots are deployed in personal or sensitive environments. However, many robotic downstream tasks require the use of cameras, which may raise privacy risks. To better understand user perceptions of privacy in relation to visual data, we conducted a user study investigating how different image modalities and image resolutions affect users’ privacy concerns. The results show that depth images are broadly viewed as privacy-safe, and a similarly high proportion of respondents feel the same about semantic segmentation images. Additionally, the majority of participants consider 3232 resolution RGB images to be almost sufficiently privacy-preserving, while most believe that 1616 resolution can fully guarantee privacy protection.
zh

[CV-7] Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets

【速读】:该论文旨在解决3D生成领域面临的挑战,包括数据稀缺性、算法局限性和生态系统碎片化问题。其解决方案的关键在于提出一个名为Step1X-3D的开源框架,该框架通过三个核心要素进行改进:(1)构建一个经过严格数据筛选的高质量数据集,包含2M个具有标准化几何和纹理属性的资产;(2)设计一种两阶段的3D原生架构,结合混合VAE-DiT几何生成器与基于扩散的纹理合成模块,以提升生成质量和一致性;(3)全面开源模型、训练代码和适配模块,促进研究的可重复性和扩展性。其中,混合VAE-DiT组件通过感知器基础的潜在编码和锐利边缘采样实现细节保留,而扩散纹理合成模块则通过几何条件和潜在空间同步确保跨视角一致性。

链接: https://arxiv.org/abs/2505.07747
作者: Weiyu Li,Xuanyang Zhang,Zheng Sun,Di Qi,Hao Li,Wei Cheng,Weiwei Cai,Shihao Wu,Jiarui Liu,Zihao Wang,Xiao Chen,Feipeng Tian,Jianxiong Pan,Zeming Li,Gang Yu,Xiangyu Zhang,Daxin Jiang,Ping Tan
机构: Step1X-3D Team & LightIllusions Team; StepFun & LightIllusions
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report

点击查看摘要

Abstract:While generative artificial intelligence has advanced significantly across text, image, audio, and video domains, 3D generation remains comparatively underdeveloped due to fundamental challenges such as data scarcity, algorithmic limitations, and ecosystem fragmentation. To this end, we present Step1X-3D, an open framework addressing these challenges through: (1) a rigorous data curation pipeline processing 5M assets to create a 2M high-quality dataset with standardized geometric and textural properties; (2) a two-stage 3D-native architecture combining a hybrid VAE-DiT geometry generator with an diffusion-based texture synthesis module; and (3) the full open-source release of models, training code, and adaptation modules. For geometry generation, the hybrid VAE-DiT component produces TSDF representations by employing perceiver-based latent encoding with sharp edge sampling for detail preservation. The diffusion-based texture synthesis module then ensures cross-view consistency through geometric conditioning and latent-space synchronization. Benchmark results demonstrate state-of-the-art performance that exceeds existing open-source methods, while also achieving competitive quality with proprietary solutions. Notably, the framework uniquely bridges the 2D and 3D generation paradigms by supporting direct transfer of 2D control techniques~(e.g., LoRA) to 3D synthesis. By simultaneously advancing data quality, algorithmic fidelity, and reproducibility, Step1X-3D aims to establish new standards for open research in controllable 3D asset generation.
zh

[CV-8] BodyGPS: Anatomical Positioning System

【速读】:该论文旨在解决医学影像中人体解剖结构解析的通用性与效率问题,特别是针对不同成像模态(如CT和MRI)的适应性与实时处理需求。其解决方案的关键在于训练一个神经网络估计器,通过回归将查询位置映射到图谱坐标,从而实现匹配、配准、分类或分割等任务,并通过稀疏采样输入以提升计算效率,使响应时间低于1毫秒而无需额外加速硬件。

链接: https://arxiv.org/abs/2505.07744
作者: Halid Ziya Yerebakan,Kritika Iyer,Xueqi Guo,Yoshihisa Shinagawa,Gerardo Hermosillo Valadez
机构: Siemens Healthineers (西门子医疗)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce a new type of foundational model for parsing human anatomy in medical images that works for different modalities. It supports supervised or unsupervised training and can perform matching, registration, classification, or segmentation with or without user interaction. We achieve this by training a neural network estimator that maps query locations to atlas coordinates via regression. Efficiency is improved by sparsely sampling the input, enabling response times of less than 1 ms without additional accelerator hardware. We demonstrate the utility of the algorithm in both CT and MRI modalities.
zh

[CV-9] LAMM-ViT: AI Face Detection via Layer-Aware Modulation of Region-Guided Attention

【速读】:该论文旨在解决AI生成人脸检测中的关键挑战,即难以捕捉不同生成技术之间面部区域的一致性结构关系。现有方法通常关注特定伪影而非根本性不一致,因此在面对新型生成模型时表现不佳。其解决方案的关键在于提出一种名为Layer-aware Mask Modulation Vision Transformer (LAMM-ViT) 的视觉变换器架构,该架构通过集成Region-Guided Multi-Head Attention (RG-MHA) 和 Layer-aware Mask Modulation (LAMM) 模块,实现对面部区域的结构不一致性进行更精确的建模与分析。其中,LAMM模块根据网络上下文动态生成层特定参数,从而调节RG-MHA的行为,使模型能够自适应调整不同网络深度下的区域关注点,进而有效捕获多种生成技术中普遍存在的细微伪造线索。

链接: https://arxiv.org/abs/2505.07734
作者: Jiangling Zhang,Weijie Zhu,Jirui Huang,Yaxiong Chen
机构: Wuhan University of Technology (武汉理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Detecting AI-synthetic faces presents a critical challenge: it is hard to capture consistent structural relationships between facial regions across diverse generation techniques. Current methods, which focus on specific artifacts rather than fundamental inconsistencies, often fail when confronted with novel generative models. To address this limitation, we introduce Layer-aware Mask Modulation Vision Transformer (LAMM-ViT), a Vision Transformer designed for robust facial forgery detection. This model integrates distinct Region-Guided Multi-Head Attention (RG-MHA) and Layer-aware Mask Modulation (LAMM) components within each layer. RG-MHA utilizes facial landmarks to create regional attention masks, guiding the model to scrutinize architectural inconsistencies across different facial areas. Crucially, the separate LAMM module dynamically generates layer-specific parameters, including mask weights and gating values, based on network context. These parameters then modulate the behavior of RG-MHA, enabling adaptive adjustment of regional focus across network depths. This architecture facilitates the capture of subtle, hierarchical forgery cues ubiquitous among diverse generation techniques, such as GANs and Diffusion Models. In cross-model generalization tests, LAMM-ViT demonstrates superior performance, achieving 94.09% mean ACC (a +5.45% improvement over SoTA) and 98.62% mean AP (a +3.09% improvement). These results demonstrate LAMM-ViT’s exceptional ability to generalize and its potential for reliable deployment against evolving synthetic media threats.
zh

[CV-10] Gameplay Highlights Generation

【速读】:该论文试图解决游戏玩家在社交平台上分享游戏体验时,需要手动制作吸引人的精彩片段(highlight reels)耗时且低效的问题。解决方案的关键在于通过自动识别游戏视频中有趣的事件区间并进行拼接,从而实现自动化生成精彩片段。其核心技术是利用自研的游戏事件检测数据集对多模态通用视频理解模型(如X-CLIP)进行微调,使其能够在无需针对每款游戏进行工程开发的情况下,跨多个同类型游戏泛化识别有趣事件,并通过提示工程提升分类性能。此外,模型在低资源游戏中表现出良好的迁移学习能力,同时通过ONNX库优化模型以实现跨平台高效推理。

链接: https://arxiv.org/abs/2505.07721
作者: Vignesh Edithal,Le Zhang,Ilia Blank,Imran Junejo
机构: AMD(超微半导体)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we enable gamers to share their gaming experience on social media by automatically generating eye-catching highlight reels from their gameplay session Our automation will save time for gamers while increasing audience engagement. We approach the highlight generation problem by first identifying intervals in the video where interesting events occur and then concatenate them. We developed an in-house gameplay event detection dataset containing interesting events annotated by humans using VIA video annotator. Traditional techniques for highlight detection such as game engine integration requires expensive collaboration with game developers. OCR techniques which detect patches of specific images or texts require expensive per game engineering and may not generalize across game UI and different language. We finetuned a multimodal general purpose video understanding model such as X-CLIP using our dataset which generalizes across multiple games in a genre without per game engineering. Prompt engineering was performed to improve the classification performance of this multimodal model. Our evaluation showed that such a finetuned model can detect interesting events in first person shooting games from unseen gameplay footage with more than 90% accuracy. Moreover, our model performed significantly better on low resource games (small dataset) when trained along with high resource games, showing signs of transfer learning. To make the model production ready, we used ONNX libraries to enable cross platform inference. These libraries also provide post training quantization tools to reduce model size and inference time for deployment. ONNX runtime libraries with DirectML backend were used to perform efficient inference on Windows OS. We show that natural language supervision in the X-CLIP model leads to data efficient and highly performant video recognition models.
zh

[CV-11] Hybrid Spiking Vision Transformer for Object Detection with Event Cameras

【速读】:该论文旨在解决事件驱动的物体检测(event-based object detection)中因事件数据的异步性和稀疏性而导致的复杂任务处理能力不足的问题。其解决方案的关键在于提出了一种混合脉冲视觉Transformer(HsVT)模型,该模型通过集成空间特征提取模块与时间特征提取模块,有效捕捉事件序列中的时空特征,从而提升对复杂事件驱动物体检测任务的处理能力。

链接: https://arxiv.org/abs/2505.07715
作者: Qi Xu,Jie Deng,Jiangrong Shen,Biwu Chen,Huajin Tang,Gang Pan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Event-based object detection has gained increasing attention due to its advantages such as high temporal resolution, wide dynamic range, and asynchronous address-event representation. Leveraging these advantages, Spiking Neural Networks (SNNs) have emerged as a promising approach, offering low energy consumption and rich spatiotemporal dynamics. To further enhance the performance of event-based object detection, this study proposes a novel hybrid spike vision Transformer (HsVT) model. The HsVT model integrates a spatial feature extraction module to capture local and global features, and a temporal feature extraction module to model time dependencies and long-term patterns in event sequences. This combination enables HsVT to capture spatiotemporal features, improving its capability to handle complex event-based object detection tasks. To support research in this area, we developed and publicly released The Fall Detection Dataset as a benchmark for event-based object detection tasks. This dataset, captured using an event-based camera, ensures facial privacy protection and reduces memory usage due to the event representation format. We evaluated the HsVT model on GEN1 and Fall Detection datasets across various model sizes. Experimental results demonstrate that HsVT achieves significant performance improvements in event detection with fewer parameters.
zh

[CV-12] Feedback-Driven Pseudo-Label Reliability Assessment: Redefining Thresholding for Semi-Supervised Semantic Segmentation

【速读】:该论文旨在解决半监督学习中伪标签选择的挑战,特别是在缺乏足够标注数据的现实场景下,传统基于预定义置信度阈值或熵值的伪标签过滤方法难以获得最优性能。其解决方案的关键在于提出了一种动态反馈驱动的阈值策略——Ensemble-of-Confidence Reinforcement (ENCORE),该方法通过估计未标注数据中各类别的真实正向置信度,并根据模型对不同伪标签过滤程度的响应持续调整阈值,从而在保留信息丰富伪标签的同时过滤不可靠标签,实现无需人工调参的高效模型训练。

链接: https://arxiv.org/abs/2505.07691
作者: Negin Ghamsarian,Sahar Nasirihaghighi,Klaus Schoeffmann,Raphael Sznitman
机构: University of Bern(伯尔尼大学); University of Klagenfurt(克兰纳特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 Figures

点击查看摘要

Abstract:Semi-supervised learning leverages unlabeled data to enhance model performance, addressing the limitations of fully supervised approaches. Among its strategies, pseudo-supervision has proven highly effective, typically relying on one or multiple teacher networks to refine pseudo-labels before training a student network. A common practice in pseudo-supervision is filtering pseudo-labels based on pre-defined confidence thresholds or entropy. However, selecting optimal thresholds requires large labeled datasets, which are often scarce in real-world semi-supervised scenarios. To overcome this challenge, we propose Ensemble-of-Confidence Reinforcement (ENCORE), a dynamic feedback-driven thresholding strategy for pseudo-label selection. Instead of relying on static confidence thresholds, ENCORE estimates class-wise true-positive confidence within the unlabeled dataset and continuously adjusts thresholds based on the model’s response to different levels of pseudo-label filtering. This feedback-driven mechanism ensures the retention of informative pseudo-labels while filtering unreliable ones, enhancing model training without manual threshold tuning. Our method seamlessly integrates into existing pseudo-supervision frameworks and significantly improves segmentation performance, particularly in data-scarce conditions. Extensive experiments demonstrate that integrating ENCORE with existing pseudo-supervision frameworks enhances performance across multiple datasets and network architectures, validating its effectiveness in semi-supervised learning.
zh

[CV-13] Beyond CLIP Generalization: Against ForwardBackward Forgetting Adapter for Continual Learning of Vision-Language Models

【速读】:该论文旨在解决多领域任务增量学习(MTIL)问题,即要求视觉-语言模型(VLMs)在持续获取新知识的同时保持其固有的零样本识别能力。现有方法将未见领域样本的测试任务委托给原始CLIP,仅能防止模型零样本能力的退化,但未能进一步提升VLM的泛化能力。为此,本文提出了一种名为AFA的新颖MTIL框架,其关键在于两个核心模块:(1)对抗正向遗忘适配器,用于在增量任务中学习每个数据集的任务不变信息,以增强VLM的零样本识别能力;(2)对抗反向遗忘适配器,在支持增量学习的同时增强VLM的少样本学习能力。

链接: https://arxiv.org/abs/2505.07690
作者: Songlin Dong,Chenhao Ding,Jiangyang Li,Jizhou Han,Qiang Wang,Yuhang He,Yihong Gong
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study aims to address the problem of multi-domain task incremental learning~(MTIL), which requires that vision-language models~(VLMs) continuously acquire new knowledge while maintaining their inherent zero-shot recognition capability. Existing paradigms delegate the testing of unseen-domain samples to the original CLIP, which only prevents the degradation of the model’s zero-shot capability but fails to enhance the generalization of the VLM further. To this end, we propose a novel MTIL framework, named AFA, which comprises two core modules: (1) an against forward-forgetting adapter that learns task-invariant information for each dataset in the incremental tasks to enhance the zero-shot recognition ability of VLMs; (2) an against backward-forgetting adapter that strengthens the few-shot learning capability of VLMs while supporting incremental learning. Extensive experiments demonstrate that the AFA method significantly outperforms existing state-of-the-art approaches, especially in few-shot MTIL tasks, and surpasses the inherent zero-shot performance of CLIP in terms of transferability. The code is provided in the Supplementary Material.
zh

[CV-14] Anatomical Attention Alignment representation for Radiology Report Generation

【速读】:该论文旨在解决自动化放射学报告生成(Automated Radiology Report Generation, RRG)中由于现有编码器-解码器模型仅依赖于从原始输入图像中提取的视觉特征,导致对空间结构和语义关系理解不足,从而影响文本生成质量的问题。解决方案的关键在于提出一种名为Anatomical Attention Alignment Network (A3Net)的框架,通过构建超视觉表示来增强视觉-文本理解,该框架将解剖结构的知识字典与局部图像特征相结合,使模型能够有效关联图像区域与其对应的解剖实体,从而提升语义推理、可解释性及跨模态对齐能力。

链接: https://arxiv.org/abs/2505.07689
作者: Quang Vinh Nguyen,Minh Duc Nguyen,Thanh Hoang Son Vo,Hyung-Jeong Yang,Soo-Hyung Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated Radiology report generation (RRG) aims at producing detailed descriptions of medical images, reducing radiologists’ workload and improving access to high-quality diagnostic services. Existing encoder-decoder models only rely on visual features extracted from raw input images, which can limit the understanding of spatial structures and semantic relationships, often resulting in suboptimal text generation. To address this, we propose Anatomical Attention Alignment Network (A3Net), a framework that enhance visual-textual understanding by constructing hyper-visual representations. Our approach integrates a knowledge dictionary of anatomical structures with patch-level visual features, enabling the model to effectively associate image regions with their corresponding anatomical entities. This structured representation improves semantic reasoning, interpretability, and cross-modal alignment, ultimately enhancing the accuracy and clinical relevance of generated reports. Experimental results on IU X-Ray and MIMIC-CXR datasets demonstrate that A3Net significantly improves both visual perception and text generation quality. Our code is available at \hrefthis https URLGitHub.
zh

[CV-15] Simple Semi-supervised Knowledge Distillation from Vision-Language Models via mathbftextttDual-mathbftextttHead mathbftextttOptimization

【速读】:该论文旨在解决在资源受限环境中部署大型视觉-语言模型(Vision-Language Models, VLMs)的挑战,特别是通过知识蒸馏(Knowledge Distillation, KD)方法将VLM的知识迁移至轻量级、任务特定模型的问题。其解决方案的关键在于提出了一种名为Dual-Head Optimization (DHO)的简单而有效的KD框架,该框架通过引入两个独立的预测头分别学习标注数据和教师模型的预测,并在推理阶段线性组合其输出,从而缓解监督信号与蒸馏信号之间的梯度冲突,提升特征学习效果。

链接: https://arxiv.org/abs/2505.07675
作者: Seongjae Kang,Dong Bok Lee,Hyungjoon Jang,Sung Ju Hwang
机构: VUNO Inc.(VUNO公司); KAIST(韩国科学技术院); DeepAuto.ai(DeepAuto.ai)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 41 pages, 19 figures, preprint

点击查看摘要

Abstract:Vision-language models (VLMs) have achieved remarkable success across diverse tasks by leveraging rich textual information with minimal labeled data. However, deploying such large models remains challenging, particularly in resource-constrained environments. Knowledge distillation (KD) offers a well-established solution to this problem; however, recent KD approaches from VLMs often involve multi-stage training or additional tuning, increasing computational overhead and optimization complexity. In this paper, we propose \mathbf\textttD ual- \mathbf\textttH ead \mathbf\textttO ptimization ( \mathbf\textttDHO ) – a simple yet effective KD framework that transfers knowledge from VLMs to compact, task-specific models in semi-supervised settings. Specifically, we introduce dual prediction heads that independently learn from labeled data and teacher predictions, and propose to linearly combine their outputs during inference. We observe that \textttDHO mitigates gradient conflicts between supervised and distillation signals, enabling more effective feature learning than single-head KD baselines. As a result, extensive experiments show that \textttDHO consistently outperforms baselines across multiple domains and fine-grained datasets. Notably, on ImageNet, it achieves state-of-the-art performance, improving accuracy by 3% and 0.1% with 1% and 10% labeled data, respectively, while using fewer parameters.
zh

[CV-16] ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models CVPR2025

【速读】:该论文试图解决当前基于扩散的文本到视频方法在生成多镜头视频时的局限性,即无法生成具有离散过渡的多镜头视频,其中同一角色在相同或不同背景中执行不同的活动。解决方案的关键在于提出一个包含数据集收集管道和视频扩散模型架构扩展的框架,通过引入过渡标记和局部注意力掩码策略,实现对视频中每个镜头的特定控制,从而确保角色和背景的一致性,并允许用户控制镜头的数量、持续时间和内容。

链接: https://arxiv.org/abs/2505.07652
作者: Ozgur Kara,Krishna Kumar Singh,Feng Liu,Duygu Ceylan,James M. Rehg,Tobias Hinz
机构: UIUC(伊利诺伊大学厄巴纳-香槟分校); Adobe(Adobe公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Current diffusion-based text-to-video methods are limited to producing short video clips of a single shot and lack the capability to generate multi-shot videos with discrete transitions where the same character performs distinct activities across the same or different backgrounds. To address this limitation we propose a framework that includes a dataset collection pipeline and architectural extensions to video diffusion models to enable text-to-multi-shot video generation. Our approach enables generation of multi-shot videos as a single video with full attention across all frames of all shots, ensuring character and background consistency, and allows users to control the number, duration, and content of shots through shot-specific conditioning. This is achieved by incorporating a transition token into the text-to-video model to control at which frames a new shot begins and a local attention masking strategy which controls the transition token’s effect and allows shot-specific prompting. To obtain training data we propose a novel data collection pipeline to construct a multi-shot video dataset from existing single-shot video datasets. Extensive experiments demonstrate that fine-tuning a pre-trained text-to-video model for a few thousand iterations is enough for the model to subsequently be able to generate multi-shot videos with shot-specific control, outperforming the baselines. You can find more details in this https URL
zh

[CV-17] Neural Brain: A Neuroscience-inspired Framework for Embodied Agents

【速读】:该论文试图解决当前人工智能系统(AI)在现实世界中部署时的动态适应性不足问题,特别是针对具身智能(embodied AI)中自主代理在非结构化环境中感知、决策与操作的能力限制。其核心挑战在于如何构建具备人类般适应能力的神经大脑(Neural Brain),以实现多模态感知与认知功能的无缝集成,并通过自适应记忆系统和能效优化的软硬件协同设计,支持实时动态响应。解决方案的关键在于提出一种生物启发的架构,整合多模态主动感知、感知-认知-行动功能、基于神经可塑性的记忆存储与更新以及类脑硬件软件优化,从而弥合静态AI模型与真实环境动态适应性之间的差距。

链接: https://arxiv.org/abs/2505.07634
作者: Jian Liu,Xiongtao Shi,Thai Duy Nguyen,Haitian Zhang,Tianxiang Zhang,Wei Sun,Yanjie Li,Athanasios V. Vasilakos,Giovanni Iacca,Arshad Ali Khan,Arvind Kumar,Jae Won Cho,Ajmal Mian,Lihua Xie,Erik Cambria,Lin Wang
机构: Nanyang Technological University(南洋理工大学); Hunan University(湖南大学); Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)); University of Agder(阿格德大学); University of Trento(特伦托大学); Elm Company(Elm公司); KTH Royal Institute of Technology(皇家理工学院); Sejong University(世宗大学); University of Western Australia(西澳大利亚大学); Nanyang Technological University(南洋理工大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 51 pages, 17 figures, 9 tables

点击查看摘要

Abstract:The rapid evolution of artificial intelligence (AI) has shifted from static, data-driven models to dynamic systems capable of perceiving and interacting with real-world environments. Despite advancements in pattern recognition and symbolic reasoning, current AI systems, such as large language models, remain disembodied, unable to physically engage with the world. This limitation has driven the rise of embodied AI, where autonomous agents, such as humanoid robots, must navigate and manipulate unstructured environments with human-like adaptability. At the core of this challenge lies the concept of Neural Brain, a central intelligence system designed to drive embodied agents with human-like adaptability. A Neural Brain must seamlessly integrate multimodal sensing and perception with cognitive capabilities. Achieving this also requires an adaptive memory system and energy-efficient hardware-software co-design, enabling real-time action in dynamic environments. This paper introduces a unified framework for the Neural Brain of embodied agents, addressing two fundamental challenges: (1) defining the core components of Neural Brain and (2) bridging the gap between static AI models and the dynamic adaptability required for real-world deployment. To this end, we propose a biologically inspired architecture that integrates multimodal active sensing, perception-cognition-action function, neuroplasticity-based memory storage and updating, and neuromorphic hardware/software optimization. Furthermore, we also review the latest research on embodied agents across these four aspects and analyze the gap between current AI systems and human intelligence. By synthesizing insights from neuroscience, we outline a roadmap towards the development of generalizable, autonomous agents capable of human-level intelligence in real-world scenarios.
zh

[CV-18] A Unified Hierarchical Framework for Fine-grained Cross-view Geo-localization over Large-scale Scenarios

【速读】:该论文旨在解决大规模定位问题中,传统方法因独立设计检索和度量定位任务而导致协作效率低及训练开销大的问题。其解决方案的关键在于提出一种统一的分层地理定位框架UnifyGeo,通过共享参数的统一学习策略联合学习多粒度表征,促进两个任务之间的相互增强,并设计了一个由专用损失函数引导的重排序机制,以提升定位性能。

链接: https://arxiv.org/abs/2505.07622
作者: Zhuo Song,Ye Zhang,Kunhong Li,Longguang Wang,Yulan Guo
机构: Sun Yat-sen University(中山大学); Aviation University of Air Force(空军航空大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-view geo-localization is a promising solution for large-scale localization problems, requiring the sequential execution of retrieval and metric localization tasks to achieve fine-grained predictions. However, existing methods typically focus on designing standalone models for these two tasks, resulting in inefficient collaboration and increased training overhead. In this paper, we propose UnifyGeo, a novel unified hierarchical geo-localization framework that integrates retrieval and metric localization tasks into a single network. Specifically, we first employ a unified learning strategy with shared parameters to jointly learn multi-granularity representation, facilitating mutual reinforcement between these two tasks. Subsequently, we design a re-ranking mechanism guided by a dedicated loss function, which enhances geo-localization performance by improving both retrieval accuracy and metric localization references. Extensive experiments demonstrate that UnifyGeo significantly outperforms the state-of-the-arts in both task-isolated and task-associated settings. Remarkably, on the challenging VIGOR benchmark, which supports fine-grained localization evaluation, the 1-meter-level localization recall rate improves from 1.53% to 39.64% and from 0.43% to 25.58% under same-area and cross-area evaluations, respectively. Code will be made publicly available.
zh

[CV-19] Higher-Order Convolution Improves Neural Predictivity in the Retina

【速读】:该论文旨在解决传统卷积神经网络(CNN)在建模生物视觉系统中复杂时空交互时的表达能力不足问题,以及其深度与生物视觉系统相对浅层处理层次之间的架构差异。其解决方案的关键在于提出一种高阶卷积神经网络(HoCNN),通过在卷积操作符中嵌入高阶运算,直接建模相邻像素在空间和时间上的乘积交互,从而增强CNN的表达能力而不增加网络深度。

链接: https://arxiv.org/abs/2505.07620
作者: Simone Azeglio,Victor Calbiague Garcia,Guilhem Glaziou,Peter Neri,Olivier Marre,Ulisse Ferrari
机构: Sorbonne Université (索邦大学); CNRS (法国国家科学研究中心); INSERM (法国国家医学研究院); École Normale Supérieure (巴黎高等师范学院); Italian Institute of Technology (意大利技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:We present a novel approach to neural response prediction that incorporates higher-order operations directly within convolutional neural networks (CNNs). Our model extends traditional 3D CNNs by embedding higher-order operations within the convolutional operator itself, enabling direct modeling of multiplicative interactions between neighboring pixels across space and time. Our model increases the representational power of CNNs without increasing their depth, therefore addressing the architectural disparity between deep artificial networks and the relatively shallow processing hierarchy of biological visual systems. We evaluate our approach on two distinct datasets: salamander retinal ganglion cell (RGC) responses to natural scenes, and a new dataset of mouse RGC responses to controlled geometric transformations. Our higher-order CNN (HoCNN) achieves superior performance while requiring only half the training data compared to standard architectures, demonstrating correlation coefficients up to 0.75 with neural responses (against 0.80 \pm 0.02 retinal reliability). When integrated into state-of-the-art architectures, our approach consistently improves performance across different species and stimulus conditions. Analysis of the learned representations reveals that our network naturally encodes fundamental geometric transformations, particularly scaling parameters that characterize object expansion and contraction. This capability is especially relevant for specific cell types, such as transient OFF-alpha and transient ON cells, which are known to detect looming objects and object motion respectively, and where our model shows marked improvement in response prediction. The correlation coefficients for scaling parameters are more than twice as high in HoCNN (0.72) compared to baseline models (0.32).
zh

[CV-20] Deep Learning Advances in Vision-Based Traffic Accident Anticipation: A Comprehensive Review of MethodsDatasetsand Future Directions

【速读】:该论文试图解决交通事故预测与检测的问题,以提升道路安全性。其关键解决方案在于应用监督学习、无监督学习和混合深度学习模型,并结合真实世界与合成数据,通过图像和视频特征提取、时空特征建模、场景理解以及多模态数据融合等方法进行事故预测。这些方法虽展现出显著潜力,但仍面临数据稀缺、复杂场景泛化能力有限及实时性能约束等挑战,因此未来研究需关注多模态数据融合、自监督学习和基于Transformer的架构以提高预测精度。

链接: https://arxiv.org/abs/2505.07611
作者: Yi Zhang,Wenye Zhou,Ruonan Lin,Xin Yang,Hao Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traffic accident prediction and detection are critical for enhancing road safety,and vision-based traffic accident anticipation (Vision-TAA) has emerged as a promising approach in the era of deep this http URL paper reviews 147 recent studies,focusing on the application of supervised,unsupervised,and hybrid deep learning models for accident prediction,alongside the use of real-world and synthetic this http URL methodologies are categorized into four key approaches: image and video feature-based prediction, spatiotemporal feature-based prediction, scene understanding,and multimodal data this http URL these methods demonstrate significant potential,challenges such as data scarcity,limited generalization to complex scenarios,and real-time performance constraints remain prevalent. This review highlights opportunities for future research,including the integration of multimodal data fusion, self-supervised learning,and Transformer-based architectures to enhance prediction accuracy and this http URL synthesizing existing advancements and identifying critical gaps, this paper provides a foundational reference for developing robust and adaptive Vision-TAA systems,contributing to road safety and traffic management.
zh

[CV-21] Beyond Static Perception: Integrating Temporal Context into VLMs for Cloth Folding ICRA2025

【速读】:该论文旨在解决衣物操作中的挑战,包括其复杂的动态特性、高度可变形性以及频繁的自遮挡问题。由于衣物具有几乎无限的配置可能性,因此难以定义显式的状态表示。论文提出的解决方案是BiFold模型,其关键在于通过端到端学习隐式编码衣物状态,并基于视觉观察预测语言条件下的抓取与放置动作。该模型利用时间上下文以提高状态估计的准确性,特别是在处理褶皱衣物或从失败操作中恢复的情况下。

链接: https://arxiv.org/abs/2505.07600
作者: Oriol Barbany,Adrià Colomé,Carme Torras
机构: Institut de Robòtica i Informàtica Industrial, CSIC-UPC (工业机器人与信息学研究所,CSIC-UPC)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICRA 2025 Workshop “Reflections on Representations and Manipulating Deformable Objects”. Project page this https URL

点击查看摘要

Abstract:Manipulating clothes is challenging due to their complex dynamics, high deformability, and frequent self-occlusions. Garments exhibit a nearly infinite number of configurations, making explicit state representations difficult to define. In this paper, we analyze BiFold, a model that predicts language-conditioned pick-and-place actions from visual observations, while implicitly encoding garment state through end-to-end learning. To address scenarios such as crumpled garments or recovery from failed manipulations, BiFold leverages temporal context to improve state estimation. We examine the internal representations of the model and present evidence that its fine-tuning and temporal context enable effective alignment between text and image regions, as well as temporal consistency.
zh

[CV-22] Evaluating Modern Visual Anomaly Detection Approaches in Semiconductor Manufacturing: A Comparative Study

【速读】:该论文旨在解决半导体制造过程中异常检测的问题,特别是在缺乏足够标注异常样本的情况下,如何实现高效的视觉异常检测。传统方法多依赖于监督学习,而本文则采用视觉异常检测(Visual Anomaly Detection, VAD)这一新兴研究方向,通过无监督学习避免了高成本的缺陷样本收集阶段,同时提供预测结果的解释性。解决方案的关键在于利用MIIC数据集构建半导体领域的VAD基准,验证现代VAD方法在该领域的有效性。

链接: https://arxiv.org/abs/2505.07576
作者: Manuel Barusco,Francesco Borsatti,Youssef Ben Khalifa,Davide Dalle Pezze,Gian Antonio Susto
机构: University of Padova(帕多瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Semiconductor manufacturing is a complex, multistage process. Automated visual inspection of Scanning Electron Microscope (SEM) images is indispensable for minimizing equipment downtime and containing costs. Most previous research considers supervised approaches, assuming a sufficient number of anomalously labeled samples. On the contrary, Visual Anomaly Detection (VAD), an emerging research domain, focuses on unsupervised learning, avoiding the costly defect collection phase while providing explanations of the predictions. We introduce a benchmark for VAD in the semiconductor domain by leveraging the MIIC dataset. Our results demonstrate the efficacy of modern VAD approaches in this field.
zh

[CV-23] Robust Kidney Abnormality Segmentation: A Validation Study of an AI-Based Framework

【速读】:该论文试图解决肾脏异常分割在临床实践中依赖主观视觉评估的问题,从而提升评估的客观性和可重复性。其解决方案的关键在于开发一种经过全面验证的肾脏异常分割算法,该算法基于公开的训练数据集,并采用先进的医学图像分割框架nnU-Net,实现了对外部测试集的有效泛化,并在所有测试数据集中均优于现有最先进模型。

链接: https://arxiv.org/abs/2505.07573
作者: Sarah de Boer,Hartmut Häntze,Kiran Vaidhya Venkadesh,Myrthe A. D. Buser,Gabriel E. Humpire Mamani,Lina Xu,Lisa C. Adams,Jawed Nawabi,Keno K. Bressem,Bram van Ginneken,Mathias Prokop,Alessa Hering
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 35 pages, 11 figures

点击查看摘要

Abstract:Kidney abnormality segmentation has important potential to enhance the clinical workflow, especially in settings requiring quantitative assessments. Kidney volume could serve as an important biomarker for renal diseases, with changes in volume correlating directly with kidney function. Currently, clinical practice often relies on subjective visual assessment for evaluating kidney size and abnormalities, including tumors and cysts, which are typically staged based on diameter, volume, and anatomical location. To support a more objective and reproducible approach, this research aims to develop a robust, thoroughly validated kidney abnormality segmentation algorithm, made publicly available for clinical and research use. We employ publicly available training datasets and leverage the state-of-the-art medical image segmentation framework nnU-Net. Validation is conducted using both proprietary and public test datasets, with segmentation performance quantified by Dice coefficient and the 95th percentile Hausdorff distance. Furthermore, we analyze robustness across subgroups based on patient sex, age, CT contrast phases, and tumor histologic subtypes. Our findings demonstrate that our segmentation algorithm, trained exclusively on publicly available data, generalizes effectively to external test sets and outperforms existing state-of-the-art models across all tested datasets. Subgroup analyses reveal consistent high performance, indicating strong robustness and reliability. The developed algorithm and associated code are publicly accessible at this https URL.
zh

[CV-24] Self-Supervised Event Representations: Towards Accurate Real-Time Perception on SoC FPGAs

【速读】:该论文旨在解决事件相机(event camera)稀疏、异步事件流的有效处理问题,传统方法在定性性能或时间保真度上存在局限。其解决方案的关键在于提出一种自监督事件表示(Self-Supervised Event Representation, SSER)方法,利用门控循环单元(Gated Recurrent Unit, GRU)网络实现无需时间离散化的像素级事件时间戳和极性精确编码,通过自监督训练提升事件时间编码的保真度,并支持异步推理以兼容高吞吐量传感器。

链接: https://arxiv.org/abs/2505.07556
作者: Kamil Jeziorek,Tomasz Kryjak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at the Real-time Processing of Image, Depth and Video Information 2025 workshop and to be considered for publication is the SPIE Proceedings

点击查看摘要

Abstract:Event cameras offer significant advantages over traditional frame-based sensors. These include microsecond temporal resolution, robustness under varying lighting conditions and low power consumption. Nevertheless, the effective processing of their sparse, asynchronous event streams remains challenging. Existing approaches to this problem can be categorised into two distinct groups. The first group involves the direct processing of event data with neural models, such as Spiking Neural Networks or Graph Convolutional Neural Networks. However, this approach is often accompanied by a compromise in terms of qualitative performance. The second group involves the conversion of events into dense representations with handcrafted aggregation functions, which can boost accuracy at the cost of temporal fidelity. This paper introduces a novel Self-Supervised Event Representation (SSER) method leveraging Gated Recurrent Unit (GRU) networks to achieve precise per-pixel encoding of event timestamps and polarities without temporal discretisation. The recurrent layers are trained in a self-supervised manner to maximise the fidelity of event-time encoding. The inference is performed with event representations generated asynchronously, thus ensuring compatibility with high-throughput sensors. The experimental validation demonstrates that SSER outperforms aggregation-based baselines, achieving improvements of 2.4% mAP and 0.6% on the Gen1 and 1 Mpx object detection datasets. Furthermore, the paper presents the first hardware implementation of recurrent representation for event data on a System-on-Chip FPGA, achieving sub-microsecond latency and power consumption between 1-2 W, suitable for real-time, power-efficient applications. Code is available at this https URL.
zh

[CV-25] Automated Visual Attention Detection using Mobile Eye Tracking in Behavioral Classroom Studies

【速读】:该论文试图解决教师在课堂中视觉注意力及其在学生之间的分布信息难以准确推断的问题(Teachers’ visual attention and its distribution across the students in classrooms),这一信息对于学生参与度、学业成就以及教师专业培训具有重要意义。解决方案的关键在于提出一种自动化处理流程,该流程利用迁移学习在课堂环境中训练面部识别模型,并结合移动眼动追踪设备获取的教师注视数据,从而最小化对人工标注数据的依赖,实现对学生被教师关注情况的识别。

链接: https://arxiv.org/abs/2505.07552
作者: Efe Bozkir,Christian Kosel,Tina Seidel,Enkelejda Kasneci
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted as a long paper at the Educational Data Mining (EDM) Conference 2025

点击查看摘要

Abstract:Teachers’ visual attention and its distribution across the students in classrooms can constitute important implications for student engagement, achievement, and professional teacher training. Despite that, inferring the information about where and which student teachers focus on is not trivial. Mobile eye tracking can provide vital help to solve this issue; however, the use of mobile eye tracking alone requires a significant amount of manual annotations. To address this limitation, we present an automated processing pipeline concept that requires minimal manually annotated data to recognize which student the teachers focus on. To this end, we utilize state-of-the-art face detection models and face recognition feature embeddings to train face recognition models with transfer learning in the classroom context and combine these models with the teachers’ gaze from mobile eye trackers. We evaluated our approach with data collected from four different classrooms, and our results show that while it is possible to estimate the visually focused students with reasonable performance in all of our classroom setups, U-shaped and small classrooms led to the best results with accuracies of approximately 0.7 and 0.9, respectively. While we did not evaluate our method for teacher-student interactions and focused on the validity of the technical approach, as our methodology does not require a vast amount of manually annotated data and offers a non-intrusive way of handling teachers’ visual attention, it could help improve instructional strategies, enhance classroom management, and provide feedback for professional teacher development.
zh

[CV-26] Noise Optimized Conditional Diffusion for Domain Adaptation IJCAI2025

【速读】:该论文旨在解决无监督域适应(Unsupervised Domain Adaptation, UDA)中由于高置信伪标签目标域样本(high-confidence pseudo-labeled target domain samples, hcpl-tds)稀缺而导致的跨域统计对齐不准确问题,进而引发域适应失败。其解决方案的关键在于提出一种名为Noise Optimized Conditional Diffusion for Domain Adaptation (NOCDDA)的方法,该方法将条件扩散模型的生成能力与域适应的决策需求相结合,实现任务耦合优化,以提升域适应效率。此外,通过引入类感知噪声优化策略,改进传统高斯初始化导致的类别混淆问题,从而增强跨域对齐效果。

链接: https://arxiv.org/abs/2505.07548
作者: Lingkun Luo,Shiqiang Hu,Liming Chen
机构: Shanghai Jiao Tong University (上海交通大学); Ecole Centrale de Lyon (里昂中央理工学院); Institut Universitaire de France (法国国家科学研究院); CNRS UMR 5205 (法国国家科学研究中心5205实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures This work has been accepted by the International Joint Conference on Artificial Intelligence (IJCAI 2025)

点击查看摘要

Abstract:Pseudo-labeling is a cornerstone of Unsupervised Domain Adaptation (UDA), yet the scarcity of High-Confidence Pseudo-Labeled Target Domain Samples (\textbfhcpl-tds) often leads to inaccurate cross-domain statistical alignment, causing DA failures. To address this challenge, we propose \textbfNoise \textbfOptimized \textbfConditional \textbfDiffusion for \textbfDomain \textbfAdaptation (\textbfNOCDDA), which seamlessly integrates the generative capabilities of conditional diffusion models with the decision-making requirements of DA to achieve task-coupled optimization for efficient adaptation. For robust cross-domain consistency, we modify the DA classifier to align with the conditional diffusion classifier within a unified optimization framework, enabling forward training on noise-varying cross-domain samples. Furthermore, we argue that the conventional ( \mathcalN(\mathbf0, \mathbfI) ) initialization in diffusion models often generates class-confused hcpl-tds, compromising discriminative DA. To resolve this, we introduce a class-aware noise optimization strategy that refines sampling regions for reverse class-specific hcpl-tds generation, effectively enhancing cross-domain alignment. Extensive experiments across 5 benchmark datasets and 29 DA tasks demonstrate significant performance gains of \textbfNOCDDA over 31 state-of-the-art methods, validating its robustness and effectiveness.
zh

[CV-27] SynID: Passport Synthetic Dataset for Presentation Attack Detection

【速读】:该论文试图解决在远程验证系统中识别伪造身份证件的Presentation Attack Detection (PAD)问题,特别是在训练数据受限的情况下。解决方案的关键在于提出了一种新的护照数据集,该数据集通过混合合成数据和公开信息,并遵循国际民航组织(ICAO)的要求,生成逼真的训练和测试图像,以弥补隐私限制导致的真实ID文档数量不足的问题。

链接: https://arxiv.org/abs/2505.07540
作者: Juan E. Tapia,Fabian Stockhardt,Lázaro Janier González-Soler,Christoph Busch
机构: Hochschule Darmstadt (h-da)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The demand for Presentation Attack Detection (PAD) to identify fraudulent ID documents in remote verification systems has significantly risen in recent years. This increase is driven by several factors, including the rise of remote work, online purchasing, migration, and advancements in synthetic images. Additionally, we have noticed a surge in the number of attacks aimed at the enrolment process. Training a PAD to detect fake ID documents is very challenging because of the limited number of ID documents available due to privacy concerns. This work proposes a new passport dataset generated from a hybrid method that combines synthetic data and open-access information using the ICAO requirement to obtain realistic training and testing images.
zh

[CV-28] GIFStream: 4D Gaussian-based Immersive Video with Feature Stream

【速读】:该论文旨在解决沉浸式视频中高质量渲染与可管理存储之间的矛盾,特别是在保持高视觉质量的同时实现高效压缩。其解决方案的关键在于提出GIFStream,一种基于规范空间和时间依赖特征流的4D高斯表示方法,通过引入变形场和时间对应性来建模复杂运动,并结合时空压缩网络实现端到端的高效压缩与解码。

链接: https://arxiv.org/abs/2505.07539
作者: Hao Li,Sicheng Li,Xiang Gao,Abudouaihati Batuer,Lu Yu,Yiyi Liao
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 10 figures

点击查看摘要

Abstract:Immersive video offers a 6-Dof-free viewing experience, potentially playing a key role in future video technology. Recently, 4D Gaussian Splatting has gained attention as an effective approach for immersive video due to its high rendering efficiency and quality, though maintaining quality with manageable storage remains challenging. To address this, we introduce GIFStream, a novel 4D Gaussian representation using a canonical space and a deformation field enhanced with time-dependent feature streams. These feature streams enable complex motion modeling and allow efficient compression by leveraging temporal correspondence and motion-aware pruning. Additionally, we incorporate both temporal and spatial compression networks for end-to-end compression. Experimental results show that GIFStream delivers high-quality immersive video at 30 Mbps, with real-time rendering and fast decoding on an RTX 4090. Project page: this https URL
zh

[CV-29] Discrete Visual Tokens of Autoregression by Diffusion and for Reasoning

【速读】:该论文试图解决视觉令牌(visual tokens)无法有效支持强化学习(reinforcement learning, RL)的长期挑战。传统视觉表示依赖于空间先验(spatial prior),而该工作提出了一种新的离散视觉分词器——自洽分词器(Self-consistency Tokenizer, Selftok),其核心在于将自回归(autoregressive, AR)先验引入视觉令牌中,通过图像生成的反向扩散过程实现。Selftok的关键创新在于其AR属性满足贝尔曼方程(Bellman equation),从而使得视觉生成任务中的RL应用效果可与语言模型(language models, LLMs)相媲美。此外,Selftok在高质量重建与压缩率之间实现了优越的平衡,并成功构建了一个纯AR的视觉语言模型(vision-language model, VLM),在无需文本-图像训练对的情况下,通过简单的策略梯度RL显著提升了视觉生成性能。

链接: https://arxiv.org/abs/2505.07538
作者: Bohan Wang,Zhongqi Yue,Fengda Zhang,Shuo Chen,Li’an Bi,Junzhe Zhang,Xue Song,Kennard Yanting Chan,Jiachun Pan,Weijia Wu,Mingze Zhou,Wang Lin,Kaihang Pan,Saining Zhang,Liyu Jia,Wentao Hu,Wei Zhao,Hanwang Zhang
机构: Selftok Team; Media Technology Institute; Huawei Singapore
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We completely discard the conventional spatial prior in image representation and introduce a novel discrete visual tokenizer: Self-consistency Tokenizer (Selftok). At its design core, we compose an autoregressive (AR) prior – mirroring the causal structure of language – into visual tokens by using the reverse diffusion process of image generation. The AR property makes Selftok fundamentally distinct from traditional spatial tokens in the following two key ways: - Selftok offers an elegant and minimalist approach to unify diffusion and AR for vision-language models (VLMs): By representing images with Selftok tokens, we can train a VLM using a purely discrete autoregressive architecture – like that in LLMs – without requiring additional modules or training objectives. - We theoretically show that the AR prior satisfies the Bellman equation, whereas the spatial prior does not. Therefore, Selftok supports reinforcement learning (RL) for visual generation with effectiveness comparable to that achieved in LLMs. Besides the AR property, Selftok is also a SoTA tokenizer that achieves a favorable trade-off between high-quality reconstruction and compression rate. We use Selftok to build a pure AR VLM for both visual comprehension and generation tasks. Impressively, without using any text-image training pairs, a simple policy gradient RL working in the visual tokens can significantly boost the visual generation benchmark, surpassing all the existing models by a large margin. Therefore, we believe that Selftok effectively addresses the long-standing challenge that visual tokens cannot support effective RL. When combined with the well-established strengths of RL in LLMs, this brings us one step closer to realizing a truly multimodal LLM. Project Page: this https URL.
zh

[CV-30] IKrNet: A Neural Network for Detecting Specific Drug-Induced Patterns in Electrocardiograms Amidst Physiological Variability

【速读】:该论文旨在解决现有基于人工智能的心电图(ECG)信号分析方法在复杂生理条件下(如运动、药物和压力等)无法准确识别药物特异性模式的问题,从而限制了其在实际临床中的应用。其解决方案的关键在于提出IKrNet模型,该模型通过结合卷积神经网络的多尺度感受野以捕捉空间特征,并利用双向长短期记忆模块建模时间依赖性,同时将心率变异性作为生理波动的替代指标,从而在多种生理条件下实现更准确和稳定的ECG分析。

链接: https://arxiv.org/abs/2505.07533
作者: Ahmad Fall,Federica Granese,Alex Lence,Dominique Fourer,Blaise Hanczar,Joe-Elie Salem,Jean-Daniel Zucker,Edi Prifti
机构: IRD, Sorbonne Université, Unité de Modélisation Mathématique et Informatique des Systèmes Complexes, UMMISCO (IRD, 索邦大学,复杂系统数学建模与信息单位,UMMISCO); University of Evry - Paris-Saclay, IBISC Laboratory (埃维昂大学-巴黎萨克雷大学,IBISC 实验室); Clinical Investigation Center Paris-Est, CIC-1901, INSERM, Department of Pharmacology, Pitié-Salpêtrière University Hospital, Sorbonne Université (巴黎-埃斯特临床研究中心,CIC-1901,INSERM,药理学系,皮蒂耶-萨尔佩特里埃大学医院,索邦大学); Department of Medicine, Vanderbilt University Medical Center (医学系,范德比尔特大学医学中心); Sorbonne Université, INSERM, Nutrition et Obesities; systemic approaches, NutriOmique, AP-HP Hôpital Pitié-Salpêtrière (索邦大学,INSERM,营养与肥胖;系统方法,营养组学,AP-HP 皮蒂耶-萨尔佩特里埃医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Monitoring and analyzing electrocardiogram (ECG) signals, even under varying physiological conditions, including those influenced by physical activity, drugs and stress, is crucial to accurately assess cardiac health. However, current AI-based methods often fail to account for how these factors interact and alter ECG patterns, ultimately limiting their applicability in real-world settings. This study introduces IKrNet, a novel neural network model, which identifies drug-specific patterns in ECGs amidst certain physiological conditions. IKrNet’s architecture incorporates spatial and temporal dynamics by using a convolutional backbone with varying receptive field size to capture spatial features. A bi-directional Long Short-Term Memory module is also employed to model temporal dependencies. By treating heart rate variability as a surrogate for physiological fluctuations, we evaluated IKrNet’s performance across diverse scenarios, including conditions with physical stress, drug intake alone, and a baseline without drug presence. Our assessment follows a clinical protocol in which 990 healthy volunteers were administered 80mg of Sotalol, a drug which is known to be a precursor to Torsades-de-Pointes, a life-threatening arrhythmia. We show that IKrNet outperforms state-of-the-art models’ accuracy and stability in varying physiological conditions, underscoring its clinical viability.
zh

[CV-31] FLUXSynID: A Framework for Identity-Controlled Synthetic Face Generation with Document and Live Images

【速读】:该论文旨在解决现有合成人脸数据集在身份属性细粒度控制不足以及在结构化采集条件下无法生成配对且身份一致图像的问题。其解决方案的关键在于提出FLUXSynID框架,该框架能够生成高分辨率的合成人脸数据集,支持用户自定义身份属性分布,并生成符合文档风格和可信活体采集条件的配对图像。

链接: https://arxiv.org/abs/2505.07530
作者: Raul Ismayilov,Luuk Spreeuwers,Dzemila Sero
机构: University of Twente (特温特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Synthetic face datasets are increasingly used to overcome the limitations of real-world biometric data, including privacy concerns, demographic imbalance, and high collection costs. However, many existing methods lack fine-grained control over identity attributes and fail to produce paired, identity-consistent images under structured capture conditions. We introduce FLUXSynID, a framework for generating high-resolution synthetic face datasets with user-defined identity attribute distributions and paired document-style and trusted live capture images. The dataset generated using the FLUXSynID framework shows improved alignment with real-world identity distributions and greater inter-set diversity compared to prior work. The FLUXSynID framework for generating custom datasets, along with a dataset of 14,889 synthetic identities, is publicly released to support biometric research, including face recognition and morphing attack detection.
zh

[CV-32] MAIS: Memory-Attention for Interactive Segmentation

【速读】:该论文试图解决交互式医学分割中用户反馈被独立处理导致的冗余修正和优化效果有限的问题。解决方案的关键在于引入MAIS(Memory-Attention mechanism for Interactive Segmentation),该机制通过存储过去的用户输入和分割状态,实现时间上下文的整合,从而提升ViT-based分割模型在多种成像模态下的效率和准确性。

链接: https://arxiv.org/abs/2505.07511
作者: Mauricio Orbes-Arteaga,Oeslle Lucena,Sabastien Ourselin,M. Jorge Cardoso
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Interactive medical segmentation reduces annotation effort by refining predictions through user feedback. Vision Transformer (ViT)-based models, such as the Segment Anything Model (SAM), achieve state-of-the-art performance using user clicks and prior masks as prompts. However, existing methods treat interactions as independent events, leading to redundant corrections and limited refinement gains. We address this by introducing MAIS, a Memory-Attention mechanism for Interactive Segmentation that stores past user inputs and segmentation states, enabling temporal context integration. Our approach enhances ViT-based segmentation across diverse imaging modalities, achieving more efficient and accurate refinements.
zh

[CV-33] Learning to Reason and Navigate: Parameter Efficient Action Planning with Large Language Models

【速读】:该论文旨在解决远程具身指代表达(REVERIE)任务中,智能体在未预先探索环境的情况下,如何高效导航并定位由高层指令指定的远程物体的问题。解决方案的关键在于提出一种基于大语言模型的参数高效动作规划器(PEAP-LLM),其核心包含两个模块:LLM目标规划器(LGP)和LoRA动作规划器(LAP)。LGP从指令中提取目标导向的计划,而LAP则结合目标导向计划、高层指令和当前视觉观测生成单步指令,从而实现智能体的实时路径规划。此外,为提升生成指令的质量并避免幻觉和偏差信息,论文还引入了两阶段微调方法,包括监督微调(SFT)和直接偏好优化(DPO)。

链接: https://arxiv.org/abs/2505.07500
作者: Bahram Mohammadi,Ehsan Abbasnejad,Yuankai Qi,Qi Wu,Anton Van Den Hengel,Javen Qinfeng Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The remote embodied referring expression (REVERIE) task requires an agent to navigate through complex indoor environments and localize a remote object specified by high-level instructions, such as “bring me a spoon”, without pre-exploration. Hence, an efficient navigation plan is essential for the final success. This paper proposes a novel parameter-efficient action planner using large language models (PEAP-LLM) to generate a single-step instruction at each location. The proposed model consists of two modules, LLM goal planner (LGP) and LoRA action planner (LAP). Initially, LGP extracts the goal-oriented plan from REVERIE instructions, including the target object and room. Then, LAP generates a single-step instruction with the goal-oriented plan, high-level instruction, and current visual observation as input. PEAP-LLM enables the embodied agent to interact with LAP as the path planner on the fly. A simple direct application of LLMs hardly achieves good performance. Also, existing hard-prompt-based methods are error-prone in complicated scenarios and need human intervention. To address these issues and prevent the LLM from generating hallucinations and biased information, we propose a novel two-stage method for fine-tuning the LLM, consisting of supervised fine-tuning (STF) and direct preference optimization (DPO). SFT improves the quality of generated instructions, while DPO utilizes environmental feedback. Experimental results show the superiority of our proposed model on REVERIE compared to the previous state-of-the-art.
zh

[CV-34] DocVXQA: Context-Aware Visual Explanations for Document Question Answering

【速读】:该论文试图解决文档视觉问答(Document Visual Question Answering, DocVQA)中模型决策缺乏可解释性的问题。传统方法通常仅关注与答案相关的区域,而未能提供上下文充分且表示高效的解释,从而影响了用户对模型的信任。解决方案的关键在于提出DocVXQA框架,该框架不仅生成准确的答案,还通过学习视觉热图来突出上下文关键区域,从而提供可解释的决策依据。此外,该框架将可解释性原则量化为显式的学习目标,实现了预测性能与可解释性之间的平衡。

链接: https://arxiv.org/abs/2505.07496
作者: Mohamed Ali Souibgui,Changkyu Choi,Andrey Barsky,Kangsoo Jung,Ernest Valveny,Dimosthenis Karatzas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose DocVXQA, a novel framework for visually self-explainable document question answering. The framework is designed not only to produce accurate answers to questions but also to learn visual heatmaps that highlight contextually critical regions, thereby offering interpretable justifications for the model’s decisions. To integrate explanations into the learning process, we quantitatively formulate explainability principles as explicit learning objectives. Unlike conventional methods that emphasize only the regions pertinent to the answer, our framework delivers explanations that are \textitcontextually sufficient while remaining \textitrepresentation-efficient. This fosters user trust while achieving a balance between predictive performance and interpretability in DocVQA applications. Extensive experiments, including human evaluation, provide strong evidence supporting the effectiveness of our method. The code is available at this https URL.
zh

[CV-35] Addressing degeneracies in latent interpolation for diffusion models

【速读】:该论文试图解决在使用图像生成扩散模型进行深度数据增强和图像变形时,当输入图像数量较多时,通过潜在空间插值得到的图像容易出现退化结果的问题。其解决方案的关键在于提出一种简单的归一化方案,以在需要潜在空间插值的情况下有效减少退化现象,并提升图像质量指标(如FID和CLIP嵌入距离)。

链接: https://arxiv.org/abs/2505.07481
作者: Erik Landolsi,Fredrik Kahl
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 12 figures

点击查看摘要

Abstract:There is an increasing interest in using image-generating diffusion models for deep data augmentation and image morphing. In this context, it is useful to interpolate between latents produced by inverting a set of input images, in order to generate new images representing some mixture of the inputs. We observe that such interpolation can easily lead to degenerate results when the number of inputs is large. We analyze the cause of this effect theoretically and experimentally, and suggest a suitable remedy. The suggested approach is a relatively simple normalization scheme that is easy to use whenever interpolation between latents is needed. We measure image quality using FID and CLIP embedding distance and show experimentally that baseline interpolation methods lead to a drop in quality metrics long before the degeneration issue is clearly visible. In contrast, our method significantly reduces the degeneration effect and leads to improved quality metrics also in non-degenerate situations.
zh

[CV-36] You Only Look One Step: Accelerating Backpropagation in Diffusion Sampling with Gradient Shortcuts

【速读】:该论文试图解决扩散模型(Diffusion Models, DMs)在下游任务中需要通过反向传播进行生成内容引导所带来的计算成本过高的问题。传统方法在生成过程中需要多次递归调用网络,导致高内存占用和显著的时间消耗。解决方案的关键在于从并行去噪的角度出发,证明在整个生成过程中不需要完整的反向传播,仅保留生成过程中的一步计算图即可优化下游指标,从而实现梯度传播的快捷路径。该方法称为Shortcut Diffusion Optimization (SDO),具有通用性、高性能和计算轻量的特点,能够在保持优异性能的同时将计算成本降低约90%。

链接: https://arxiv.org/abs/2505.07477
作者: Hongkun Dou,Zeyu Li,Xingyu Jiang,Hongjue Li,Lijun Yang,Wen Yao,Yue Deng
机构: Beihang University (北京航空航天大学); Defense Innovation Institute, Chinese Academy of Military Science (中国军事科学研究院国防创新研究所); Institute of Artificial Intelligence, Beihang University (北京航空航天大学人工智能学院); Beijing Zhongguancun Academy (北京中关村科学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models (DMs) have recently demonstrated remarkable success in modeling large-scale data distributions. However, many downstream tasks require guiding the generated content based on specific differentiable metrics, typically necessitating backpropagation during the generation process. This approach is computationally expensive, as generating with DMs often demands tens to hundreds of recursive network calls, resulting in high memory usage and significant time consumption. In this paper, we propose a more efficient alternative that approaches the problem from the perspective of parallel denoising. We show that full backpropagation throughout the entire generation process is unnecessary. The downstream metrics can be optimized by retaining the computational graph of only one step during generation, thus providing a shortcut for gradient propagation. The resulting method, which we call Shortcut Diffusion Optimization (SDO), is generic, high-performance, and computationally lightweight, capable of optimizing all parameter types in diffusion sampling. We demonstrate the effectiveness of SDO on several real-world tasks, including controlling generation by optimizing latent and aligning the DMs by fine-tuning network parameters. Compared to full backpropagation, our approach reduces computational costs by \sim 90% while maintaining superior performance. Code is available at this https URL.
zh

[CV-37] Unified Continuous Generative Models

【速读】:该论文试图解决连续生成模型(continuous generative models)中多步方法(如扩散模型和流匹配模型)与少步方法(如一致性模型)被当作独立范式处理所导致的训练和采样方法分离的问题。解决方案的关键在于提出一个统一框架(Unified Continuous Generative Models Trainer and Sampler, UCGM-T,S),该框架能够统一训练、采样和分析这些模型,从而实现性能优化。通过该框架,论文在ImageNet 256x256数据集上取得了当前最优的生成效果,例如在20步内达到1.30 FID的多步模型以及在2步内达到1.42 FID的少步模型。

链接: https://arxiv.org/abs/2505.07447
作者: Peng Sun,Yi Jiang,Tao Lin
机构: Westlake University (西湖大学); Zhejiang University (浙江大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Recent advances in continuous generative models, including multi-step approaches like diffusion and flow-matching (typically requiring 8-1000 sampling steps) and few-step methods such as consistency models (typically 1-8 steps), have demonstrated impressive generative performance. However, existing work often treats these approaches as distinct paradigms, resulting in separate training and sampling methodologies. We introduce a unified framework for training, sampling, and analyzing these models. Our implementation, the Unified Continuous Generative Models Trainer and Sampler (UCGM-T,S), achieves state-of-the-art (SOTA) performance. For example, on ImageNet 256x256 using a 675M diffusion transformer, UCGM-T trains a multi-step model achieving 1.30 FID in 20 steps and a few-step model reaching 1.42 FID in just 2 steps. Additionally, applying UCGM-S to a pre-trained model (previously 1.26 FID at 250 steps) improves performance to 1.06 FID in only 40 steps. Code is available at: this https URL.
zh

[CV-38] Lightweight Multispectral Crop-Weed Segmentation for Precision Agriculture

【速读】:该论文试图解决精准农业中作物与杂草分割效率不足的问题,传统基于卷积神经网络(Convolutional Neural Network, CNN)的方法在复杂田间条件下泛化能力差,并且依赖于RGB影像,限制了性能。解决方案的关键在于提出一种轻量级的Transformer-CNN混合模型,通过专用编码器处理RGB、近红外(Near-Infrared, NIR)和红边(Red-Edge, RE)波段,并采用动态模态融合机制,从而提升了分割精度与计算效率。

链接: https://arxiv.org/abs/2505.07444
作者: Zeynep Galymzhankyzy,Eric Martinson
机构: Lawrence Technological University (劳伦斯科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 5 figures, 1 table

点击查看摘要

Abstract:Efficient crop-weed segmentation is critical for site-specific weed control in precision agriculture. Conventional CNN-based methods struggle to generalize and rely on RGB imagery, limiting performance under complex field conditions. To address these challenges, we propose a lightweight transformer-CNN hybrid. It processes RGB, Near-Infrared (NIR), and Red-Edge (RE) bands using specialized encoders and dynamic modality integration. Evaluated on the WeedsGalore dataset, the model achieves a segmentation accuracy (mean IoU) of 78.88%, outperforming RGB-only models by 15.8 percentage points. With only 8.7 million parameters, the model offers high accuracy, computational efficiency, and potential for real-time deployment on Unmanned Aerial Vehicles (UAVs) and edge devices, advancing precision weed management.
zh

[CV-39] ICE-Pruning: An Iterative Cost-Efficient Pruning Pipeline for Deep Neural Networks IJCNN

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)压缩过程中计算成本过高的问题,尤其是在传统剪枝流程中需要多次微调(fine-tuning)导致的时间消耗。其解决方案的关键在于提出一种名为ICE-Pruning的迭代剪枝框架,该框架通过三个核心组件显著降低微调的总体成本:i) 自动确定在哪些剪枝步骤后进行微调的机制;ii) 在每个剪枝步骤中采用冻结策略以加快微调速度;iii) 一种定制的剪枝感知学习率调度器,以提升每个剪枝步骤的精度并减少整体时间消耗。此外,还引入了一个高效的自动调参阶段来优化由这三个组件引入的超参数。

链接: https://arxiv.org/abs/2505.07411
作者: Wenhao Hu,Paul Henderson,José Cano
机构: School of Computing Science, University of Glasgow, Scotland, UK
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: Accepted to International Joint Conference on Neural Networks (IJCNN) 2025

点击查看摘要

Abstract:Pruning is a widely used method for compressing Deep Neural Networks (DNNs), where less relevant parameters are removed from a DNN model to reduce its size. However, removing parameters reduces model accuracy, so pruning is typically combined with fine-tuning, and sometimes other operations such as rewinding weights, to recover accuracy. A common approach is to repeatedly prune and then fine-tune, with increasing amounts of model parameters being removed in each step. While straightforward to implement, pruning pipelines that follow this approach are computationally expensive due to the need for repeated fine-tuning. In this paper we propose ICE-Pruning, an iterative pruning pipeline for DNNs that significantly decreases the time required for pruning by reducing the overall cost of fine-tuning, while maintaining a similar accuracy to existing pruning pipelines. ICE-Pruning is based on three main components: i) an automatic mechanism to determine after which pruning steps fine-tuning should be performed; ii) a freezing strategy for faster fine-tuning in each pruning step; and iii) a custom pruning-aware learning rate scheduler to further improve the accuracy of each pruning step and reduce the overall time consumption. We also propose an efficient auto-tuning stage for the hyperparameters (e.g., freezing percentage) introduced by the three components. We evaluate ICE-Pruning on several DNN models and datasets, showing that it can accelerate pruning by up to 9.61x. Code is available at this https URL Comments: Accepted to International Joint Conference on Neural Networks (IJCNN) 2025 Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML) Cite as: arXiv:2505.07411 [cs.LG] (or arXiv:2505.07411v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.07411 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-40] DepthFusion: Depth-Aware Hybrid Feature Fusion for LiDAR-Camera 3D Object Detection

【速读】:该论文旨在解决当前LiDAR-相机3D目标检测方法在特征融合策略中忽视深度信息的问题(depth information),从而导致多模态数据融合效果受限。其解决方案的关键在于提出一种深度感知的混合特征融合策略(Depth-Aware Hybrid Feature Fusion, DepthFusion),通过在全局和局部层次引入深度编码,动态调整点云与RGB图像模态的权重,以提升融合效果并增强对各种噪声和退化情况的鲁棒性。

链接: https://arxiv.org/abs/2505.07398
作者: Mingqian Ji,Jian Yang,Shanshan Zhang
机构: Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:State-of-the-art LiDAR-camera 3D object detectors usually focus on feature fusion. However, they neglect the factor of depth while designing the fusion strategy. In this work, we are the first to observe that different modalities play different roles as depth varies via statistical analysis and visualization. Based on this finding, we propose a Depth-Aware Hybrid Feature Fusion (DepthFusion) strategy that guides the weights of point cloud and RGB image modalities by introducing depth encoding at both global and local levels. Specifically, the Depth-GFusion module adaptively adjusts the weights of image Bird’s-Eye-View (BEV) features in multi-modal global features via depth encoding. Furthermore, to compensate for the information lost when transferring raw features to the BEV space, we propose a Depth-LFusion module, which adaptively adjusts the weights of original voxel features and multi-view image features in multi-modal local features via depth encoding. Extensive experiments on the nuScenes and KITTI datasets demonstrate that our DepthFusion method surpasses previous state-of-the-art methods. Moreover, our DepthFusion is more robust to various kinds of corruptions, outperforming previous methods on the nuScenes-C dataset.
zh

[CV-41] UM2TWIN: Introducing the Large-Scale Multimodal Urban Digital Twin Benchmark Dataset

【速读】:该论文旨在解决城市数字孪生(Urban Digital Twin, UDT)构建过程中多阶段面临的挑战,包括获取高精度三维源数据、重建高保真三维模型、维护模型更新以及确保与下游任务的无缝互操作性。现有数据集通常仅覆盖处理链的一部分,限制了对完整UDT的验证。论文提出的解决方案是引入首个全面的多模态城市数字孪生基准数据集TUM2TWIN,其关键在于整合了地理参考、语义对齐的三维模型与网络,以及多种地面、移动、空中和卫星观测数据,覆盖约100,000 m²区域,总数据量达767 GB,并支持室内-室外地理参考采集、高精度与多模态数据融合,从而为传感器分析和先进重建方法的发展提供支持。

链接: https://arxiv.org/abs/2505.07396
作者: Olaf Wysocki,Benedikt Schwab,Manoj Kumar Biswanath,Qilin Zhang,Jingwei Zhu,Thomas Froech,Medhini Heeramaglore,Ihab Hijazi,Khaoula Kanna,Mathias Pechinger,Zhaiyu Chen,Yao Sun,Alejandro Rueda Segura,Ziyang Xu,Omar AbdelGafar,Mansour Mehranfar,Chandan Yeshwanth,Yueh-Cheng Liu,Hadi Yazdi,Jiapan Wang,Stefan Auer,Katharina Anders,Klaus Bogenberger,Andre Borrmann,Angela Dai,Ludwig Hoegner,Christoph Holst,Thomas H. Kolbe,Ferdinand Ludwig,Matthias Nießner,Frank Petzold,Xiao Xiang Zhu,Boris Jutzi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to the ISPRS Journal of Photogrammetry and Remote Sensing

点击查看摘要

Abstract:Urban Digital Twins (UDTs) have become essential for managing cities and integrating complex, heterogeneous data from diverse sources. Creating UDTs involves challenges at multiple process stages, including acquiring accurate 3D source data, reconstructing high-fidelity 3D models, maintaining models’ updates, and ensuring seamless interoperability to downstream tasks. Current datasets are usually limited to one part of the processing chain, hampering comprehensive UDTs validation. To address these challenges, we introduce the first comprehensive multimodal Urban Digital Twin benchmark dataset: TUM2TWIN. This dataset includes georeferenced, semantically aligned 3D models and networks along with various terrestrial, mobile, aerial, and satellite observations boasting 32 data subsets over roughly 100,000 m^2 and currently 767 GB of data. By ensuring georeferenced indoor-outdoor acquisition, high accuracy, and multimodal data integration, the benchmark supports robust analysis of sensors and the development of advanced reconstruction methods. Additionally, we explore downstream tasks demonstrating the potential of TUM2TWIN, including novel view synthesis of NeRF and Gaussian Splatting, solar potential analysis, point cloud semantic segmentation, and LoD3 building reconstruction. We are convinced this contribution lays a foundation for overcoming current limitations in UDT creation, fostering new research directions and practical solutions for smarter, data-driven urban environments. The project is available under: this https URL
zh

[CV-42] Feature Visualization in 3D Convolutional Neural Networks

【速读】:该论文试图解决3D卷积核可视化困难的问题,特别是在高维和复杂3D特征下,直接应用2D卷积核的最大激活方法往往导致不可解释的结果。解决方案的关键在于提出一种新的可视化方法,该方法通过数据驱动的分解和两阶段优化策略,分离出3D卷积核的纹理和运动偏好,从而提供更具解释性的3D卷积操作洞察。

链接: https://arxiv.org/abs/2505.07387
作者: Chunpeng Li,Ya-tang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding the computations of convolutional neural networks requires effective visualization of their kernels. While maximal activation methods have proven successful in highlighting the preferred features of 2D convolutional kernels, directly applying these techniques to 3D convolutions often leads to uninterpretable results due to the higher dimensionality and complexity of 3D features. To address this challenge, we propose a novel visualization approach for 3D convolutional kernels that disentangles their texture and motion preferences. Our method begins with a data-driven decomposition of the optimal input that maximally activates a given kernel. We then introduce a two-stage optimization strategy to extract distinct texture and motion components from this input. Applying our approach to visualize kernels at various depths of several pre-trained models, we find that the resulting visualizations–particularly those capturing motion–clearly reveal the preferred dynamic patterns encoded by 3D kernels. These results demonstrate the effectiveness of our method in providing interpretable insights into 3D convolutional operations. Code is available at this https URL.
zh

[CV-43] Few-shot Semantic Encoding and Decoding for Video Surveillance

【速读】:该论文试图解决视频监控中由于摄像头数量和分辨率的持续增加而导致的视频传输与存储负担加重的问题,以及传统基于香农理论的通信方法面临的优化瓶颈。其解决方案的关键在于提出一种语义编码与解码方法,通过提取草图作为语义信息并采用草图压缩方法降低语义信息的比特率,结合图像翻译网络将草图转换为参考帧,并利用少样本草图解码网络从草图重建视频,从而在减少存储和传输消耗的同时保持较高的视频质量。

链接: https://arxiv.org/abs/2505.07381
作者: Baoping Cheng,Yukun Zhang,Liming Wang,Xiaoyan Xie,Tao Fu,Dongkun Wang,Xiaoming Tao
机构: Tsinghua University (清华大学); China Mobile (Hangzhou) Information Technology Co., Ltd (中国移动(杭州)信息技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the continuous increase in the number and resolution of video surveillance cameras, the burden of transmitting and storing surveillance video is growing. Traditional communication methods based on Shannon’s theory are facing optimization bottlenecks. Semantic communication, as an emerging communication method, is expected to break through this bottleneck and reduce the storage and transmission consumption of video. Existing semantic decoding methods often require many samples to train the neural network for each scene, which is time-consuming and labor-intensive. In this study, a semantic encoding and decoding method for surveillance video is proposed. First, the sketch was extracted as semantic information, and a sketch compression method was proposed to reduce the bit rate of semantic information. Then, an image translation network was proposed to translate the sketch into a video frame with a reference frame. Finally, a few-shot sketch decoding network was proposed to reconstruct video from sketch. Experimental results showed that the proposed method achieved significantly better video reconstruction performance than baseline methods. The sketch compression method could effectively reduce the storage and transmission consumption of semantic information with little compromise on video quality. The proposed method provides a novel semantic encoding and decoding method that only needs a few training samples for each surveillance scene, thus improving the practicality of the semantic communication system.
zh

[CV-44] Apples Synthetic Defocus Noise Pattern: Characterization and Forensic Applications

【速读】:该论文试图解决iPhone人像模式图像中因模拟虚化效果(bokeh effect)而产生的Apple’s Synthetic Defocus Noise Pattern (SDNP) 对盲取证分析,尤其是基于PRNU(Photo Response Non-Uniformity)的相机源验证造成的干扰问题。解决方案的关键在于对SDNP进行详细的表征,并提出一种精确估计其方法,同时建模其与场景亮度、ISO设置等因素的依赖关系,从而实现对SDNP在不同iPhone型号和iOS版本中的可追溯性分析,并通过遮蔽SDNP影响区域显著降低PRNU验证中的误报率,提升相机归属技术的准确性。

链接: https://arxiv.org/abs/2505.07380
作者: David Vázquez-Padín,Fernando Pérez-González,Pablo Pérez-Miguélez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Image and Video Processing (eess.IV)
备注: This paper was submitted to IEEE Transactions on Information Forensics Security on May, 2025

点击查看摘要

Abstract:iPhone portrait-mode images contain a distinctive pattern in out-of-focus regions simulating the bokeh effect, which we term Apple’s Synthetic Defocus Noise Pattern (SDNP). If overlooked, this pattern can interfere with blind forensic analyses, especially PRNU-based camera source verification, as noted in earlier works. Since Apple’s SDNP remains underexplored, we provide a detailed characterization, proposing a method for its precise estimation, modeling its dependence on scene brightness, ISO settings, and other factors. Leveraging this characterization, we explore forensic applications of the SDNP, including traceability of portrait-mode images across iPhone models and iOS versions in open-set scenarios, assessing its robustness under post-processing. Furthermore, we show that masking SDNP-affected regions in PRNU-based camera source verification significantly reduces false positives, overcoming a critical limitation in camera attribution, and improving state-of-the-art techniques.
zh

[CV-45] Boosting Global-Local Feature Matching via Anomaly Synthesis for Multi-Class Point Cloud Anomaly Detection

【速读】:该论文旨在解决多类点云异常检测中由于正常与异常点特征相似性导致的特征混淆问题,该问题严重限制了多类无监督方法的性能。解决方案的关键在于引入一种名为GLFM的方法,通过全局-局部特征匹配逐步分离跨类易混淆的数据,其核心包括三个阶段:第一阶段提出异常合成流程以增强特征提取器的表征能力;第二阶段根据训练数据的全局和局部特征分布建立记忆库以缓解特征混淆的影响;第三阶段利用测试数据与全局和局部记忆库的特征距离进行异常检测。

链接: https://arxiv.org/abs/2505.07375
作者: Yuqi Cheng,Yunkang Cao,Dongfang Wang,Weiming Shen,Wenlong Li
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 12 figures

点击查看摘要

Abstract:Point cloud anomaly detection is essential for various industrial applications. The huge computation and storage costs caused by the increasing product classes limit the application of single-class unsupervised methods, necessitating the development of multi-class unsupervised methods. However, the feature similarity between normal and anomalous points from different class data leads to the feature confusion problem, which greatly hinders the performance of multi-class methods. Therefore, we introduce a multi-class point cloud anomaly detection method, named GLFM, leveraging global-local feature matching to progressively separate data that are prone to confusion across multiple classes. Specifically, GLFM is structured into three stages: Stage-I proposes an anomaly synthesis pipeline that stretches point clouds to create abundant anomaly data that are utilized to adapt the point cloud feature extractor for better feature representation. Stage-II establishes the global and local memory banks according to the global and local feature distributions of all the training data, weakening the impact of feature confusion on the establishment of the memory bank. Stage-III implements anomaly detection of test data leveraging its feature distance from global and local memory banks. Extensive experiments on the MVTec 3D-AD, Real3D-AD and actual industry parts dataset showcase our proposed GLFM’s superior point cloud anomaly detection performance. The code is available at this https URL.
zh

[CV-46] Geometric Prior-Guided Neural Implicit Surface Reconstruction in the Wild

【速读】:该论文旨在解决在非受控环境下(如存在瞬时遮挡或外观变化)从多视角2D图像中准确重建高保真3D表面的问题。现有方法在一致光照条件下表现良好,但在复杂场景中难以保持几何精度。该论文的解决方案关键在于将多种几何约束引入隐式表面优化过程,包括利用结构-from-motion (SfM) 得到的稀疏3D点对符号距离函数进行优化,并通过位移补偿处理点云噪声;同时结合由法线预测器生成的鲁棒法线先验,结合边缘先验过滤和多视角一致性约束,提升与实际表面几何的一致性。

链接: https://arxiv.org/abs/2505.07373
作者: Lintao Xiang,Hongpei Zheng,Bailin Deng,Hujun Yin
机构: The University of Manchester (曼彻斯特大学); Cardiff University (卡迪夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neural implicit surface reconstruction using volume rendering techniques has recently achieved significant advancements in creating high-fidelity surfaces from multiple 2D images. However, current methods primarily target scenes with consistent illumination and struggle to accurately reconstruct 3D geometry in uncontrolled environments with transient occlusions or varying appearances. While some neural radiance field (NeRF)-based variants can better manage photometric variations and transient objects in complex scenes, they are designed for novel view synthesis rather than precise surface reconstruction due to limited surface constraints. To overcome this limitation, we introduce a novel approach that applies multiple geometric constraints to the implicit surface optimization process, enabling more accurate reconstructions from unconstrained image collections. First, we utilize sparse 3D points from structure-from-motion (SfM) to refine the signed distance function estimation for the reconstructed surface, with a displacement compensation to accommodate noise in the sparse points. Additionally, we employ robust normal priors derived from a normal predictor, enhanced by edge prior filtering and multi-view consistency constraints, to improve alignment with the actual surface geometry. Extensive testing on the Heritage-Recon benchmark and other datasets has shown that the proposed method can accurately reconstruct surfaces from in-the-wild images, yielding geometries with superior accuracy and granularity compared to existing techniques. Our approach enables high-quality 3D reconstruction of various landmarks, making it applicable to diverse scenarios such as digital preservation of cultural heritage sites.
zh

[CV-47] AI-Enabled Accurate Non-Invasive Assessment of Pulmonary Hypertension Progression via Multi-Modal Echocardiography

【速读】:该论文旨在解决肺动脉高压(Pulmonary Hypertension, PH)进展评估中非侵入性方法准确性不足的问题,以及传统金标准右心导管检查(Right Heart Catheterization, RHC)的侵入性和不适用于常规监测的局限性。解决方案的关键在于提出一种多视角、多模态的视觉-语言模型MePH,该模型通过非侵入性超声心动图数据与RHC获取的压力和阻力数据之间的精确建模,实现了对肺动脉高压进展的准确评估。

链接: https://arxiv.org/abs/2505.07347
作者: Jiewen Yang,Taoran Huang,Shangwei Ding,Xiaowei Xu,Qinhua Zhao,Yong Jiang,Jiarong Guo,Bin Pu,Jiexuan Zheng,Caojin Zhang,Hongwen Fei,Xiaomeng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Echocardiographers can detect pulmonary hypertension using Doppler echocardiography; however, accurately assessing its progression often proves challenging. Right heart catheterization (RHC), the gold standard for precise evaluation, is invasive and unsuitable for routine use, limiting its practicality for timely diagnosis and monitoring of pulmonary hypertension progression. Here, we propose MePH, a multi-view, multi-modal vision-language model to accurately assess pulmonary hypertension progression using non-invasive echocardiography. We constructed a large dataset comprising paired standardized echocardiogram videos, spectral images and RHC data, covering 1,237 patient cases from 12 medical centers. For the first time, MePH precisely models the correlation between non-invasive multi-view, multi-modal echocardiography and the pressure and resistance obtained via RHC. We show that MePH significantly outperforms echocardiographers’ assessments using echocardiography, reducing the mean absolute error in estimating mean pulmonary arterial pressure (mPAP) and pulmonary vascular resistance (PVR) by 49.73% and 43.81%, respectively. In eight independent external hospitals, MePH achieved a mean absolute error of 3.147 for PVR assessment. Furthermore, MePH achieved an area under the curve of 0.921, surpassing echocardiographers (area under the curve of 0.842) in accurately predicting the severity of pulmonary hypertension, whether mild or severe. A prospective study demonstrated that MePH can predict treatment efficacy for patients. Our work provides pulmonary hypertension patients with a non-invasive and timely method for monitoring disease progression, improving the accuracy and efficiency of pulmonary hypertension management while enabling earlier interventions and more personalized treatment decisions.
zh

[CV-48] Generative Pre-trained Autoregressive Diffusion Transformer

【速读】:该论文旨在解决长视频生成中运动动态自然性与帧间语义一致性难以同时保证的问题,以及传统离散token预测方式在连续潜在空间建模中的局限性。其解决方案的关键在于提出GPDiT,一种将扩散模型与自回归建模相结合的生成式预训练自回归扩散Transformer,通过在连续潜在空间中自回归地预测未来潜在帧,而非预测离散token,从而实现更自然的运动动力学建模和跨帧语义一致性。

链接: https://arxiv.org/abs/2505.07344
作者: Yuan Zhang,Jiacheng Jiang,Guoqing Ma,Zhiying Lu,Haoyang Huang,Jianlong Yuan,Nan Duan
机构: StepFun, China(步骤科技,中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we present GPDiT, a Generative Pre-trained Autoregressive Diffusion Transformer that unifies the strengths of diffusion and autoregressive modeling for long-range video synthesis, within a continuous latent space. Instead of predicting discrete tokens, GPDiT autoregressively predicts future latent frames using a diffusion loss, enabling natural modeling of motion dynamics and semantic consistency across frames. This continuous autoregressive framework not only enhances generation quality but also endows the model with representation capabilities. Additionally, we introduce a lightweight causal attention variant and a parameter-free rotation-based time-conditioning mechanism, improving both the training and inference efficiency. Extensive experiments demonstrate that GPDiT achieves strong performance in video generation quality, video representation ability, and few-shot learning tasks, highlighting its potential as an effective framework for video modeling in continuous space.
zh

[CV-49] SAEN-BGS: Energy-Efficient Spiking AutoEncoder Network for Background Subtraction

【速读】:该论文旨在解决基于深度学习的背景减除(Background Subtraction, BGS)技术在面对视频中的各种背景噪声时所遇到的挑战,如光照变化、摄像机角度偏移以及空气湍流或摇曳树木等干扰因素。其解决方案的关键在于设计一种基于脉冲神经网络(Spiking Neural Networks, SNNs)的脉冲自编码器网络(Spiking Autoencoder Network for BGS, SAEN-BGS),充分利用SNN对噪声的鲁棒性和对时间序列的敏感性,以提升前景与背景的分离效果。此外,通过引入一种基于人工神经网络到脉冲神经网络框架的新型自蒸馏脉冲监督学习方法,进一步提高了系统的能量效率。

链接: https://arxiv.org/abs/2505.07336
作者: Zhixuan Zhang,Xiaopeng Li,Qi Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by Pattern Recognition

点击查看摘要

Abstract:Background subtraction (BGS) is utilized to detect moving objects in a video and is commonly employed at the onset of object tracking and human recognition processes. Nevertheless, existing BGS techniques utilizing deep learning still encounter challenges with various background noises in videos, including variations in lighting, shifts in camera angles, and disturbances like air turbulence or swaying trees. To address this problem, we design a spiking autoencoder network, termed SAEN-BGS, based on noise resilience and time-sequence sensitivity of spiking neural networks (SNNs) to enhance the separation of foreground and background. To eliminate unnecessary background noise and preserve the important foreground elements, we begin by creating the continuous spiking conv-and-dconv block, which serves as the fundamental building block for the decoder in SAEN-BGS. Moreover, in striving for enhanced energy efficiency, we introduce a novel self-distillation spiking supervised learning method grounded in ANN-to-SNN frameworks, resulting in decreased power consumption. In extensive experiments conducted on CDnet-2014 and DAVIS-2016 datasets, our approach demonstrates superior segmentation performance relative to other baseline methods, even when challenged by complex scenarios with dynamic backgrounds.
zh

[CV-50] Link to the Past: Temporal Propagation for Fast 3D Human Reconstruction from Monocular Video CVPR2025

【速读】:该论文旨在解决从单目视频中快速重建穿着衣物的人体这一计算机视觉领域的挑战,特别是在计算效率与重建质量之间的平衡问题。现有方法要么专注于静态图像重建但计算成本过高,要么通过每视频优化实现高质量重建,但需要数分钟到数小时的处理时间,难以满足实时应用需求。论文提出的解决方案是TemPoFast3D,其关键在于利用人体外观的时间一致性,通过维护和优化一个规范外观表示,并借助高效的坐标映射,将像素对齐的重建网络转换为能够处理连续视频流的“即插即用”方案,从而减少冗余计算并保持重建质量。

链接: https://arxiv.org/abs/2505.07333
作者: Matthew Marchellus,Nadhira Noor,In Kyu Park
机构: Inha University (因哈大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR 2025

点击查看摘要

Abstract:Fast 3D clothed human reconstruction from monocular video remains a significant challenge in computer vision, particularly in balancing computational efficiency with reconstruction quality. Current approaches are either focused on static image reconstruction but too computationally intensive, or achieve high quality through per-video optimization that requires minutes to hours of processing, making them unsuitable for real-time applications. To this end, we present TemPoFast3D, a novel method that leverages temporal coherency of human appearance to reduce redundant computation while maintaining reconstruction quality. Our approach is a “plug-and play” solution that uniquely transforms pixel-aligned reconstruction networks to handle continuous video streams by maintaining and refining a canonical appearance representation through efficient coordinate mapping. Extensive experiments demonstrate that TemPoFast3D matches or exceeds state-of-the-art methods across standard metrics while providing high-quality textured reconstruction across diverse pose and appearance, with a maximum speed of 12 FPS.
zh

[CV-51] RealRep: Generalized SDR-to-HDR Conversion with Style Disentangled Representation Learning

【速读】:该论文旨在解决标准动态范围(Standard Dynamic Range, SDR)内容向高动态范围宽色域(High-Dynamic-Range Wide-Color-Gamut, HDR-WCG)转换的问题,特别是现有方法依赖固定色调映射算子,难以应对真实场景中风格多样的SDR输入。其解决方案的关键在于提出一种通用的SDR-to-HDR方法——真实风格解耦表示学习(Realistic Style Disentangled Representation Learning, RealRep),通过解耦亮度与色度,分析不同风格内容的内在差异,并引入解耦多视角风格表示学习方法,以捕捉不同风格下真实亮度和色度分布的指导先验,从而建立鲁棒的逆色调映射嵌入空间。

链接: https://arxiv.org/abs/2505.07322
作者: Gang He,Siqi Wang,Kepeng Xu,Lin Zhang
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-Dynamic-Range Wide-Color-Gamut (HDR-WCG) technology is becoming increasingly prevalent, intensifying the demand for converting Standard Dynamic Range (SDR) content to HDR. Existing methods primarily rely on fixed tone mapping operators, which are inadequate for handling SDR inputs with diverse styles commonly found in real-world scenarios. To address this challenge, we propose a generalized SDR-to-HDR method that handles diverse styles in real-world SDR content, termed Realistic Style Disentangled Representation Learning (RealRep). By disentangling luminance and chrominance, we analyze the intrinsic differences between contents with varying styles and propose a disentangled multi-view style representation learning method. This approach captures the guidance prior of true luminance and chrominance distributions across different styles, even when the SDR style distributions exhibit significant variations, thereby establishing a robust embedding space for inverse tone mapping. Motivated by the difficulty of directly utilizing degradation representation priors, we further introduce the Degradation-Domain Aware Controlled Mapping Network (DDACMNet), a two-stage framework that performs adaptive hierarchical mapping guided by a control-aware normalization mechanism. DDACMNet dynamically modulates the mapping process via degradation-conditioned hierarchical features, enabling robust adaptation across diverse degradation domains. Extensive experiments show that RealRep consistently outperforms state-of-the-art methods with superior generalization and perceptually faithful HDR color gamut reconstruction.
zh

[CV-52] Enabling Privacy-Aware AI-Based Ergonomic Analysis

【速读】:该论文旨在解决制造业中肌肉骨骼疾病(Musculoskeletal Disorders, MSDs)导致的工伤和生产力损失问题,同时应对基于摄像头的工效学评估系统所引发的隐私担忧。其解决方案的关键在于提出一种隐私感知的工效学评估框架,利用对抗训练构建轻量级神经网络,对视频数据进行模糊处理,仅保留用于人体姿态估计的必要信息,从而在保证姿态检测准确性的同时保护用户隐私。

链接: https://arxiv.org/abs/2505.07306
作者: Sander De Coninck,Emilio Gamba,Bart Van Doninck,Abdellatif Bey-Temsamani,Sam Leroux,Pieter Simoens
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted and presented at the 35th CIRP Design conference

点击查看摘要

Abstract:Musculoskeletal disorders (MSDs) are a leading cause of injury and productivity loss in the manufacturing industry, incurring substantial economic costs. Ergonomic assessments can mitigate these risks by identifying workplace adjustments that improve posture and reduce strain. Camera-based systems offer a non-intrusive, cost-effective method for continuous ergonomic tracking, but they also raise significant privacy concerns. To address this, we propose a privacy-aware ergonomic assessment framework utilizing machine learning techniques. Our approach employs adversarial training to develop a lightweight neural network that obfuscates video data, preserving only the essential information needed for human pose estimation. This obfuscation ensures compatibility with standard pose estimation algorithms, maintaining high accuracy while protecting privacy. The obfuscated video data is transmitted to a central server, where state-of-the-art keypoint detection algorithms extract body landmarks. Using multi-view integration, 3D keypoints are reconstructed and evaluated with the Rapid Entire Body Assessment (REBA) method. Our system provides a secure, effective solution for ergonomic monitoring in industrial environments, addressing both privacy and workplace safety concerns.
zh

[CV-53] Human Motion Prediction via Test-domain-aware Adaptation with Easily-available Human Motions Estimated from Videos

【速读】:该论文试图解决3D人体运动预测(HMP)中由于依赖昂贵的动作捕捉数据而导致的数据多样性不足,进而影响模型对未见动作或个体的泛化能力的问题。解决方案的关键在于利用易于获取的视频中估计的2D姿态,通过其转换生成类似动作捕捉风格的3D运动数据,并以此进行额外学习,从而提升HMP模型在测试域中的适应性。

链接: https://arxiv.org/abs/2505.07301
作者: Katsuki Shimbo,Hiromu Taketsugu,Norimichi Ukita
机构: Toyota Technological Institute(丰田技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 4 figures

点击查看摘要

Abstract:In 3D Human Motion Prediction (HMP), conventional methods train HMP models with expensive motion capture data. However, the data collection cost of such motion capture data limits the data diversity, which leads to poor generalizability to unseen motions or subjects. To address this issue, this paper proposes to enhance HMP with additional learning using estimated poses from easily available videos. The 2D poses estimated from the monocular videos are carefully transformed into motion capture-style 3D motions through our pipeline. By additional learning with the obtained motions, the HMP model is adapted to the test domain. The experimental results demonstrate the quantitative and qualitative impact of our method.
zh

[CV-54] L-SWAG: Layer-Sample Wise Activation with Gradients information for Zero-Shot NAS on Vision Transformers CVPR2025

【速读】:该论文旨在解决生成式 AI (Generative AI) 领域中训练-free 神经网络架构搜索(Training-free Neural Architecture Search, NAS)的适用性受限问题,特别是在扩展到视觉变压器(Vision Transformers, ViTs)时缺乏有效零成本(Zero-cost, ZC)代理的问题。其解决方案的关键在于提出一种新的通用度量方法——层-样本激活与梯度信息(Layer-Sample Wise Activation with Gradients information, L-SWAG),该方法能够同时表征卷积和变压器架构,并在14个任务中展现出良好的泛化能力。此外,论文还引入了LIBRA-NAS方法,通过策略性地组合多个ZC代理以更好地反映特定基准,从而提升NAS性能。

链接: https://arxiv.org/abs/2505.07300
作者: Sofia Casarin,Sergio Escalera,Oswald Lanz
机构: Free University of Bozen-Bolzano (博岑-波尔扎诺自由大学); Computer Vision Center (计算机视觉中心); Universitat de Barcelona (巴塞罗那大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted at CVPR 2025

点击查看摘要

Abstract:Training-free Neural Architecture Search (NAS) efficiently identifies high-performing neural networks using zero-cost (ZC) proxies. Unlike multi-shot and one-shot NAS approaches, ZC-NAS is both (i) time-efficient, eliminating the need for model training, and (ii) interpretable, with proxy designs often theoretically grounded. Despite rapid developments in the field, current SOTA ZC proxies are typically constrained to well-established convolutional search spaces. With the rise of Large Language Models shaping the future of deep learning, this work extends ZC proxy applicability to Vision Transformers (ViTs). We present a new benchmark using the Autoformer search space evaluated on 6 distinct tasks and propose Layer-Sample Wise Activation with Gradients information (L-SWAG), a novel, generalizable metric that characterizes both convolutional and transformer architectures across 14 tasks. Additionally, previous works highlighted how different proxies contain complementary information, motivating the need for a ML model to identify useful combinations. To further enhance ZC-NAS, we therefore introduce LIBRA-NAS (Low Information gain and Bias Re-Alignment), a method that strategically combines proxies to best represent a specific benchmark. Integrated into the NAS search, LIBRA-NAS outperforms evolution and gradient-based NAS techniques by identifying an architecture with a 17.0% test error on ImageNet1k in just 0.1 GPU days.
zh

[CV-55] Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning

【速读】:该论文旨在解决多模态对齐中缺乏通用且可靠奖励模型的问题,以提升多模态理解与推理任务的性能。其解决方案的关键在于构建一个大规模多模态偏好数据集,并基于Qwen2.5-VL-7B-Instruct设计一种集成奖励头的奖励模型架构,通过多阶段微调和成对排序损失优化模型,从而实现对多模态任务的有效奖励信号生成。

链接: https://arxiv.org/abs/2505.07263
作者: Xiaokun Wang,Chris,Jiangbo Pei,Wei Shen,Yi Peng,Yunzhuo Hao,Weijie Qiu,Ai Jian,Tianyidan Xie,Xuchen Song,Yang Liu,Yahui Zhou
机构: Skywork AI; Kunlun Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose Skywork-VL Reward, a multimodal reward model that provides reward signals for both multimodal understanding and reasoning tasks. Our technical approach comprises two key components: First, we construct a large-scale multimodal preference dataset that covers a wide range of tasks and scenarios, with responses collected from both standard vision-language models (VLMs) and advanced VLM reasoners. Second, we design a reward model architecture based on Qwen2.5-VL-7B-Instruct, integrating a reward head and applying multi-stage fine-tuning using pairwise ranking loss on pairwise preference data. Experimental evaluations show that Skywork-VL Reward achieves state-of-the-art results on multimodal VL-RewardBench and exhibits competitive performance on the text-only RewardBench benchmark. Furthermore, preference data constructed based on our Skywork-VL Reward proves highly effective for training Mixed Preference Optimization (MPO), leading to significant improvements in multimodal reasoning capabilities. Our results underscore Skywork-VL Reward as a significant advancement toward general-purpose, reliable reward models for multimodal alignment. Our model has been publicly released to promote transparency and reproducibility.
zh

[CV-56] Synthetic Similarity Search in Automotive Production

【速读】:该论文试图解决汽车生产中视觉质量检测对大量标注数据依赖的问题,这一问题导致数据收集成本高且耗时。解决方案的关键在于提出一种结合基于视觉的基础模型进行相似性搜索与合成数据的图像分类流程,通过DINOv2模型将输入图像转换为特征向量,并利用余弦距离与预分类参考图像进行比较,从而在不依赖真实数据的情况下实现高分类精度。

链接: https://arxiv.org/abs/2505.07256
作者: Christoph Huber,Ludwig Schleeh,Dino Knoll,Michael Guthe
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in Procedia CIRP

点击查看摘要

Abstract:Visual quality inspection in automotive production is essential for ensuring the safety and reliability of vehicles. Computer vision (CV) has become a popular solution for these inspections due to its cost-effectiveness and reliability. However, CV models require large, annotated datasets, which are costly and time-consuming to collect. To reduce the need for extensive training data, we propose a novel image classification pipeline that combines similarity search using a vision-based foundation model with synthetic data. Our approach leverages a DINOv2 model to transform input images into feature vectors, which are then compared to pre-classified reference images using cosine distance measurements. By utilizing synthetic data instead of real images as references, our pipeline achieves high classification accuracy without relying on real data. We evaluate this approach in eight real-world inspection scenarios and demonstrate that it meets the high performance requirements of production environments.
zh

[CV-57] owards Accurate State Estimation: Kalman Filter Incorporating Motion Dynamics for 3D Multi-Object Tracking

【速读】:该论文试图解决卡尔曼滤波在三维多目标跟踪(3D MOT)中状态估计精度不足以及运动模型选择困难的问题。现有方法通常依赖于恒定运动模型来估计目标状态,忽视了每个目标独特的复杂运动动态,导致轨迹分割和定位不准确,尤其是在遮挡条件下。解决方案的关键在于提出一种新的卡尔曼滤波公式,该公式引入了运动动态,使运动模型能够根据目标运动的变化自适应调整,从而显著提升了状态估计、定位和轨迹预测的性能。

链接: https://arxiv.org/abs/2505.07254
作者: Mohamed Nagy,Naoufel Werghi,Bilal Hassan,Jorge Dias,Majid Khonji
机构: Khalifa University (哈利法大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:This work addresses the critical lack of precision in state estimation in the Kalman filter for 3D multi-object tracking (MOT) and the ongoing challenge of selecting the appropriate motion model. Existing literature commonly relies on constant motion models for estimating the states of objects, neglecting the complex motion dynamics unique to each object. Consequently, trajectory division and imprecise object localization arise, especially under occlusion conditions. The core of these challenges lies in the limitations of the current Kalman filter formulation, which fails to account for the variability of motion dynamics as objects navigate their environments. This work introduces a novel formulation of the Kalman filter that incorporates motion dynamics, allowing the motion model to adaptively adjust according to changes in the object’s movement. The proposed Kalman filter substantially improves state estimation, localization, and trajectory prediction compared to the traditional Kalman filter. This is reflected in tracking performance that surpasses recent benchmarks on the KITTI and Waymo Open Datasets, with margins of 0.56% and 0.81% in higher order tracking accuracy (HOTA) and multi-object tracking accuracy (MOTA), respectively. Furthermore, the proposed Kalman filter consistently outperforms the baseline across various detectors. Additionally, it shows an enhanced capability in managing long occlusions compared to the baseline Kalman filter, achieving margins of 1.22% in higher order tracking accuracy (HOTA) and 1.55% in multi-object tracking accuracy (MOTA) on the KITTI dataset. The formulation’s efficiency is evident, with an additional processing time of only approximately 0.078 ms per frame, ensuring its applicability in real-time applications.
zh

[CV-58] Incomplete In-context Learning

【速读】:该论文试图解决在检索数据库不完整的情况下,大型视觉语言模型(Large Vision Language Models, LVLMs)进行上下文学习(In-context Learning)时性能下降的问题,即Incomplete In-context Learning (IICL)。解决方案的关键在于提出一种两阶段框架——Iterative Judgments and Integrated Prediction (IJIP),该框架通过将多类分类问题转化为一系列二分类任务,从而将IICL场景转换为标准的Vision In-context Learning (VICL)情景,并通过整合输入图像与迭代判断阶段的预测结果,进一步提升分类精度。

链接: https://arxiv.org/abs/2505.07251
作者: Wenqiang Wang,Yangshijie Zhang
机构: Sun Yat-sen University (中山大学); Lanzhou University (兰州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large vision language models (LVLMs) achieve remarkable performance through Vision In-context Learning (VICL), a process that depends significantly on demonstrations retrieved from an extensive collection of annotated examples (retrieval database). Existing studies often assume that the retrieval database contains annotated examples for all labels. However, in real-world scenarios, delays in database updates or incomplete data annotation may result in the retrieval database containing labeled samples for only a subset of classes. We refer to this phenomenon as an \textbfincomplete retrieval database and define the in-context learning under this condition as \textbfIncomplete In-context Learning (IICL). To address this challenge, we propose \textbfIterative Judgments and Integrated Prediction (IJIP), a two-stage framework designed to mitigate the limitations of IICL. The Iterative Judgments Stage reformulates an (\boldsymbolm)-class classification problem into a series of (\boldsymbolm) binary classification tasks, effectively converting the IICL setting into a standard VICL scenario. The Integrated Prediction Stage further refines the classification process by leveraging both the input image and the predictions from the Iterative Judgments Stage to enhance overall classification accuracy. IJIP demonstrates considerable performance across two LVLMs and two datasets under three distinct conditions of label incompleteness, achieving the highest accuracy of 93.9%. Notably, even in scenarios where labels are fully available, IJIP still achieves the best performance of all six baselines. Furthermore, IJIP can be directly applied to \textbfPrompt Learning and is adaptable to the \textbftext domain.
zh

[CV-59] When Dance Video Archives Challenge Computer Vision

【速读】:该论文试图解决舞蹈视频中人体姿态估计的准确性和效率问题,其核心挑战在于舞蹈数据的特殊性及高质量数据处理的需求。解决方案的关键在于提出了一种新的3D人体姿态估计流水线,该流水线整合了尚未应用于舞蹈分析的最新技术与方法,以提升对复杂舞蹈动作的建模能力。

链接: https://arxiv.org/abs/2505.07249
作者: Philippe Colantoni,Rafique Ahmed,Prashant Ghimire,Damien Muselet,Alain Trémeau
机构: Laboratoire Hubert Curien - UMR 5516 (Hubert Curien实验室 - UMR 5516)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The accuracy and efficiency of human body pose estimation depend on the quality of the data to be processed and of the particularities of these data. To demonstrate how dance videos can challenge pose estimation techniques, we proposed a new 3D human body pose estimation pipeline which combined up-to-date techniques and methods that had not been yet used in dance analysis. Second, we performed tests and extensive experimentations from dance video archives, and used visual analytic tools to evaluate the impact of several data parameters on human body pose. Our results are publicly available for research at this https URL
zh

[CV-60] Language-Driven Dual Style Mixing for Single-Domain Generalized Object Detection

【速读】:该论文旨在解决将仅在单一领域训练的物体检测器泛化到多个未见领域的挑战。现有方法通常通过图像或特征增强来提升检测器的鲁棒性,但基于视觉-语言模型(VLM)的增强技术受限于检测器主干与VLM图像编码器结构的一致性,从而限制了检测框架的选择。该论文提出的解决方案关键在于Language-Driven Dual Style Mixing (LDDS),其核心是通过充分利用VLM的语义信息来多样化源域,具体包括通过提示构建将VLM中的风格语义传递至图像翻译网络,生成具有明确语义信息的风格多样化图像,并结合图像级和特征级的风格混合策略,实现与主流检测框架的兼容性。

链接: https://arxiv.org/abs/2505.07219
作者: Hongda Qin,Xiao Lu,Zhiyong Wei,Yihong Cao,Kailun Yang,Ningjiang Chen
机构: Guangxi University (广西大学); Hunan Normal University (湖南师范大学); Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: The source code and pre-trained models will be publicly available at this https URL

点击查看摘要

Abstract:Generalizing an object detector trained on a single domain to multiple unseen domains is a challenging task. Existing methods typically introduce image or feature augmentation to diversify the source domain to raise the robustness of the detector. Vision-Language Model (VLM)-based augmentation techniques have been proven to be effective, but they require that the detector’s backbone has the same structure as the image encoder of VLM, limiting the detector framework selection. To address this problem, we propose Language-Driven Dual Style Mixing (LDDS) for single-domain generalization, which diversifies the source domain by fully utilizing the semantic information of the VLM. Specifically, we first construct prompts to transfer style semantics embedded in the VLM to an image translation network. This facilitates the generation of style diversified images with explicit semantic information. Then, we propose image-level style mixing between the diversified images and source domain images. This effectively mines the semantic information for image augmentation without relying on specific augmentation selections. Finally, we propose feature-level style mixing in a double-pipeline manner, allowing feature augmentation to be model-agnostic and can work seamlessly with the mainstream detector frameworks, including the one-stage, two-stage, and transformer-based detectors. Extensive experiments demonstrate the effectiveness of our approach across various benchmark datasets, including real to cartoon and normal to adverse weather tasks. The source code and pre-trained models will be publicly available at this https URL.
zh

[CV-61] owards user-centered interactive medical image segmentation in VR with an assistive AI agent

【速读】:该论文旨在解决医学影像中三维医学概念的定位、分割与可视化问题,尤其是在手动分割耗时且易出错的情况下,如何通过结合最新放射学AI基础模型与虚拟现实(VR)的直观数据交互来提升效率和准确性。其解决方案的关键在于提出SAMIRA,一个新型的对话式AI代理,它通过语音交互帮助用户理解放射学特征、定位临床目标,并生成可通过少量点提示进行优化的分割掩码,同时支持真实比例的三维病理可视化,以增强个性化解剖理解。

链接: https://arxiv.org/abs/2505.07214
作者: Pascal Spiegler,Arash Harirpoush,Yiming Xiao
机构: Concordia University (康考迪亚大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Crucial in disease analysis and surgical planning, manual segmentation of volumetric medical scans (e.g. MRI, CT) is laborious, error-prone, and challenging to master, while fully automatic algorithms can benefit from user-feedback. Therefore, with the complementary power of the latest radiological AI foundation models and virtual reality (VR)'s intuitive data interaction, we propose SAMIRA, a novel conversational AI agent that assists users with localizing, segmenting, and visualizing 3D medical concepts in VR. Through speech-based interaction, the agent helps users understand radiological features, locate clinical targets, and generate segmentation masks that can be refined with just a few point prompts. The system also supports true-to-scale 3D visualization of segmented pathology to enhance patient-specific anatomical understanding. Furthermore, to determine the optimal interaction paradigm under near-far attention-switching for refining segmentation masks in an immersive, human-in-the-loop workflow, we compare VR controller pointing, head pointing, and eye tracking as input modes. With a user study, evaluations demonstrated a high usability score (SUS=90.0 \pm 9.0), low overall task load, as well as strong support for the proposed VR system’s guidance, training potential, and integration of AI in radiological segmentation tasks.
zh

[CV-62] Discovering Fine-Grained Visual-Concept Relations by Disentangled Optimal Transport Concept Bottleneck Models CVPR2025

【速读】:该论文试图解决传统概念瓶颈模型(Concept Bottleneck Models, CBMs)在处理图像与概念之间的关系时存在的两个主要问题:一是容易产生虚假的视觉-概念关联,从而降低模型可靠性;二是难以确定哪些视觉区域对最终预测起关键作用。其解决方案的关键在于提出一种解耦最优传输概念瓶颈模型(Disentangled Optimal Transport CBM, DOT-CBM),通过将概念预测过程建模为局部图像块与概念之间的运输问题,实现细粒度特征对齐,并引入正交投影损失以增强局部特征的解耦性,同时利用视觉显著图和概念标签统计作为运输先验,以缓解数据中的统计偏差导致的捷径问题。

链接: https://arxiv.org/abs/2505.07209
作者: Yan Xie,Zequn Zeng,Hao Zhang,Yucheng Ding,Yi Wang,Zhengjue Wang,Bo Chen,Hongwei Liu
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) try to make the decision-making process transparent by exploring an intermediate concept space between the input image and the output prediction. Existing CBMs just learn coarse-grained relations between the whole image and the concepts, less considering local image information, leading to two main drawbacks: i) they often produce spurious visual-concept relations, hence decreasing model reliability; and ii) though CBMs could explain the importance of every concept to the final prediction, it is still challenging to tell which visual region produces the prediction. To solve these problems, this paper proposes a Disentangled Optimal Transport CBM (DOT-CBM) framework to explore fine-grained visual-concept relations between local image patches and concepts. Specifically, we model the concept prediction process as a transportation problem between the patches and concepts, thereby achieving explicit fine-grained feature alignment. We also incorporate orthogonal projection losses within the modality to enhance local feature disentanglement. To further address the shortcut issues caused by statistical biases in the data, we utilize the visual saliency map and concept label statistics as transportation priors. Thus, DOT-CBM can visualize inversion heatmaps, provide more reliable concept predictions, and produce more accurate class predictions. Comprehensive experiments demonstrate that our proposed DOT-CBM achieves SOTA performance on several tasks, including image classification, local part detection and out-of-distribution generalization.
zh

[CV-63] Ranking-aware Continual Learning for LiDAR Place Recognition

【速读】:该论文旨在解决LiDAR位姿识别(LiDAR place recognition, LPR)中由于持续学习导致的灾难性遗忘问题,该问题使得模型在新环境训练后对先前训练过的场景识别性能显著下降。论文提出的解决方案关键在于引入一种基于知识蒸馏与融合(Knowledge Distillation and Fusion, KDF)的持续学习框架,通过设计一种排名感知的知识蒸馏损失函数来保持高层次的位姿识别知识,并结合知识融合模块整合新旧模型的知识,从而有效缓解遗忘问题。

链接: https://arxiv.org/abs/2505.07198
作者: Xufei Wang,Gengxuan Tian,Junqiao Zhao,Siyue Tao,Qiwen Gu,Qiankun Yu,Tiantian Feng
机构: Shanghai Research Institute for Intelligent Autonomous System, Tongji University(上海智能自主系统研究院,同济大学); Department of Computer Science and Technology, School of Electronics and Information Engineering, Tongji University(计算机科学与技术系,电子与信息工程学院,同济大学); MOE Key Lab of Embedded System and Service Computing, Tongji University(教育部嵌入式系统与服务计算重点实验室,同济大学); Institute of Intelligent Vehicles, Tongji University(智能车辆研究所,同济大学); SAIC Intelligent Technology (Shanghai) Co. Ltd(上汽智能技术(上海)有限公司); College of Surveying and Geo-Informatics, Tongji University(测绘与地理信息学院,同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Place recognition plays a significant role in SLAM, robot navigation, and autonomous driving applications. Benefiting from deep learning, the performance of LiDAR place recognition (LPR) has been greatly improved. However, many existing learning-based LPR methods suffer from catastrophic forgetting, which severely harms the performance of LPR on previously trained places after training on a new environment. In this paper, we introduce a continual learning framework for LPR via Knowledge Distillation and Fusion (KDF) to alleviate forgetting. Inspired by the ranking process of place recognition retrieval, we present a ranking-aware knowledge distillation loss that encourages the network to preserve the high-level place recognition knowledge. We also introduce a knowledge fusion module to integrate the knowledge of old and new models for LiDAR place recognition. Our extensive experiments demonstrate that KDF can be applied to different networks to overcome catastrophic forgetting, surpassing the state-of-the-art methods in terms of mean Recall@1 and forgetting score.
zh

[CV-64] Critique Before Thinking: Mitigating Hallucination through Rationale-Augmented Instruction Tuning

【速读】:该论文旨在解决现有大型视觉语言模型(Large Vision-Language Models, LVLMs)在解释相关图像时容易产生缺乏视觉依据的响应的问题。其解决方案的关键在于提出一种名为Re-Critic的可扩展的理性增强框架,该框架通过引入基本规则和思维链(Chain-of-Thought, CoT)作为桥梁来提升模型的推理能力。具体而言,Re-Critic开发了一个视觉理性合成器,能够以可扩展的方式将原始指令与理性解释相结合,并采用上下文自批评机制选择响应对进行偏好微调,从而获得更具语境依据的输出。

链接: https://arxiv.org/abs/2505.07172
作者: Zexian Yang,Dian Li,Dayan Wu,Gang Liu,Weiping Wang
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); Foundation Technology Center, Tencent PCG (腾讯PCG基础技术中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite significant advancements in multimodal reasoning tasks, existing Large Vision-Language Models (LVLMs) are prone to producing visually ungrounded responses when interpreting associated images. In contrast, when humans embark on learning new knowledge, they often rely on a set of fundamental pre-study principles: reviewing outlines to grasp core concepts, summarizing key points to guide their focus and enhance understanding. However, such preparatory actions are notably absent in the current instruction tuning processes. This paper presents Re-Critic, an easily scalable rationale-augmented framework designed to incorporate fundamental rules and chain-of-thought (CoT) as a bridge to enhance reasoning abilities. Specifically, Re-Critic develops a visual rationale synthesizer that scalably augments raw instructions with rationale explanation. To probe more contextually grounded responses, Re-Critic employs an in-context self-critic mechanism to select response pairs for preference tuning. Experiments demonstrate that models fine-tuned with our rationale-augmented dataset yield gains that extend beyond hallucination-specific tasks to broader multimodal reasoning tasks.
zh

[CV-65] Generalizable Pancreas Segmentation via a Dual Self-Supervised Learning Framework

【速读】:该论文旨在解决单源数据训练的胰腺分割模型在跨源测试数据上表现有限且稳定性差的泛化性问题(generalizability issues)。其解决方案的关键在于提出一种双自监督学习模型,该模型结合了全局和局部解剖上下文信息,以充分挖掘胰腺内和胰腺外区域的解剖特征,从而提升高不确定性区域的表征能力,实现更稳健的泛化性能。

链接: https://arxiv.org/abs/2505.07165
作者: Jun Li,Hongzhang Zhu,Tao Chen,Xiaohua Qian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accept by IEEE JBHI. Due to the limitation “The abstract field cannot be longer than 1,920 characters”, the abstract here is shorter than that in the PDF file

点击查看摘要

Abstract:Recently, numerous pancreas segmentation methods have achieved promising performance on local single-source datasets. However, these methods don’t adequately account for generalizability issues, and hence typically show limited performance and low stability on test data from other sources. Considering the limited availability of distinct data sources, we seek to improve the generalization performance of a pancreas segmentation model trained with a single-source dataset, i.e., the single source generalization task. In particular, we propose a dual self-supervised learning model that incorporates both global and local anatomical contexts. Our model aims to fully exploit the anatomical features of the intra-pancreatic and extra-pancreatic regions, and hence enhance the characterization of the high-uncertainty regions for more robust generalization. Specifically, we first construct a global-feature contrastive self-supervised learning module that is guided by the pancreatic spatial structure. This module obtains complete and consistent pancreatic features through promoting intra-class cohesion, and also extracts more discriminative features for differentiating between pancreatic and non-pancreatic tissues through maximizing inter-class separation. It mitigates the influence of surrounding tissue on the segmentation outcomes in high-uncertainty regions. Subsequently, a local-image restoration self-supervised learning module is introduced to further enhance the characterization of the high uncertainty regions. In this module, informative anatomical contexts are actually learned to recover randomly corrupted appearance patterns in those regions.
zh

[CV-66] owards Scalable IoT Deployment for Visual Anomaly Detection via Efficient Compression

【速读】:该论文旨在解决在工业场景中部署深度学习模型进行视觉异常检测(Visual Anomaly Detection, VAD)时所面临的计算能力和带宽限制问题。解决方案的关键在于采用紧凑且高效的处理策略,通过评估多种数据压缩技术,在系统延迟与检测精度之间找到平衡,从而实现在边缘设备上有效执行VAD任务。实验结果表明,该方法能够在保持较高异常检测性能的同时实现显著的数据压缩。

链接: https://arxiv.org/abs/2505.07119
作者: Arianna Stropeni,Francesco Borsatti,Manuel Barusco,Davide Dalle Pezze,Marco Fabris,Gian Antonio Susto
机构: University of Padova(帕多瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual Anomaly Detection (VAD) is a key task in industrial settings, where minimizing waste and operational costs is essential. Deploying deep learning models within Internet of Things (IoT) environments introduces specific challenges due to the limited computational power and bandwidth of edge devices. This study investigates how to perform VAD effectively under such constraints by leveraging compact and efficient processing strategies. We evaluate several data compression techniques, examining the trade-off between system latency and detection accuracy. Experiments on the MVTec AD benchmark demonstrate that significant compression can be achieved with minimal loss in anomaly detection performance compared to uncompressed data.
zh

[CV-67] DeepSORT-Driven Visual Tracking Approach for Gesture Recognition in Interactive Systems

【速读】:该论文旨在解决智能人机交互中手势识别与跟踪的准确性与实时性问题,特别是在复杂场景下的多目标跟踪挑战。其解决方案的关键在于基于DeepSORT算法,通过结合卡尔曼滤波与深度学习特征提取方法,实现动态环境中目标的精准跟踪。该方法在处理目标遮挡、运动模糊以及多目标环境方面表现出色,从而提升了手势识别的稳定性和用户交互的流畅性。

链接: https://arxiv.org/abs/2505.07110
作者: Tong Zhang,Fenghua Shao,Runsheng Zhang,Yifan Zhuang,Liuqingqing Yang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Based on the DeepSORT algorithm, this study explores the application of visual tracking technology in intelligent human-computer interaction, especially in the field of gesture recognition and tracking. With the rapid development of artificial intelligence and deep learning technology, visual-based interaction has gradually replaced traditional input devices and become an important way for intelligent systems to interact with users. The DeepSORT algorithm can achieve accurate target tracking in dynamic environments by combining Kalman filters and deep learning feature extraction methods. It is especially suitable for complex scenes with multi-target tracking and fast movements. This study experimentally verifies the superior performance of DeepSORT in gesture recognition and tracking. It can accurately capture and track the user’s gesture trajectory and is superior to traditional tracking methods in terms of real-time and accuracy. In addition, this study also combines gesture recognition experiments to evaluate the recognition ability and feedback response of the DeepSORT algorithm under different gestures (such as sliding, clicking, and zooming). The experimental results show that DeepSORT can not only effectively deal with target occlusion and motion blur but also can stably track in a multi-target environment, achieving a smooth user interaction experience. Finally, this paper looks forward to the future development direction of intelligent human-computer interaction systems based on visual tracking and proposes future research focuses such as algorithm optimization, data fusion, and multimodal interaction in order to promote a more intelligent and personalized interactive experience. Keywords-DeepSORT, visual tracking, gesture recognition, human-computer interaction
zh

[CV-68] Privacy of Groups in Dense Street Imagery

【速读】:该论文试图解决由空间和时间密集的街景图像(DSI)数据集引发的隐私问题,特别是针对通过去标识化处理后的数据仍可能被用于推断敏感群体成员身份的风险。解决方案的关键在于通过渗透测试验证了在高密度数据和人工智能技术进步的背景下,即使对行人面部和车牌进行模糊处理,仍然可以轻易推断出敏感的群体归属信息。研究提出了DSI中可识别群体的类型学,并基于情境完整性理论分析了隐私影响,最终提出了针对DSI数据使用者的可行建议。

链接: https://arxiv.org/abs/2505.07085
作者: Matt Franchi,Hauke Sandhaus,Madiha Zahrah Choksi,Severin Engelmann,Wendy Ju,Helen Nissenbaum
机构: Cornell University, Cornell Tech(康奈尔大学,康奈尔技术学院); Jacobs Technion-Cornell Institute(雅各布斯技术-康奈尔研究所)
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注: To appear in ACM Conference on Fairness, Accountability, and Transparency (FAccT) '25

点击查看摘要

Abstract:Spatially and temporally dense street imagery (DSI) datasets have grown unbounded. In 2024, individual companies possessed around 3 trillion unique images of public streets. DSI data streams are only set to grow as companies like Lyft and Waymo use DSI to train autonomous vehicle algorithms and analyze collisions. Academic researchers leverage DSI to explore novel approaches to urban analysis. Despite good-faith efforts by DSI providers to protect individual privacy through blurring faces and license plates, these measures fail to address broader privacy concerns. In this work, we find that increased data density and advancements in artificial intelligence enable harmful group membership inferences from supposedly anonymized data. We perform a penetration test to demonstrate how easily sensitive group affiliations can be inferred from obfuscated pedestrians in 25,232,608 dashcam images taken in New York City. We develop a typology of identifiable groups within DSI and analyze privacy implications through the lens of contextual integrity. Finally, we discuss actionable recommendations for researchers working with data from DSI providers.
zh

[CV-69] Discovering Concept Directions from Diffusion-based Counterfactuals via Latent Clustering

【速读】:该论文试图解决现有基于概念的解释方法在计算复杂度高且难以高效捕捉复杂语义概念的问题。其解决方案的关键在于提出Concept Directions via Latent Clustering (CDLC),通过聚类从真实图像与扩散生成的反事实图像对中提取的潜在差异向量,从而获得全局、类别特定的概念方向,相较于之前需要遍历潜在空间维度的CDCT方法,显著降低了计算复杂度,并实现了跨潜在维度的多维语义概念提取。

链接: https://arxiv.org/abs/2505.07073
作者: Payal Varshney,Adriano Lucieri,Christoph Balada,Andreas Dengel,Sheraz Ahmed
机构: DFKI(德国弗劳恩霍夫研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Concept-based explanations have emerged as an effective approach within Explainable Artificial Intelligence, enabling interpretable insights by aligning model decisions with human-understandable concepts. However, existing methods rely on computationally intensive procedures and struggle to efficiently capture complex, semantic concepts. Recently, the Concept Discovery through Latent Diffusion-based Counterfactual Trajectories (CDCT) framework, introduced by Varshney et al. (2025), attempts to identify concepts via dimension-wise traversal of the latent space of a Variational Autoencoder trained on counterfactual trajectories. Extending the CDCT framework, this work introduces Concept Directions via Latent Clustering (CDLC), which extracts global, class-specific concept directions by clustering latent difference vectors derived from factual and diffusion-generated counterfactual image pairs. CDLC substantially reduces computational complexity by eliminating the exhaustive latent dimension traversal required in CDCT and enables the extraction of multidimensional semantic concepts encoded across the latent dimensions. This approach is validated on a real-world skin lesion dataset, demonstrating that the extracted concept directions align with clinically recognized dermoscopic features and, in some cases, reveal dataset-specific biases or unknown biomarkers. These results highlight that CDLC is interpretable, scalable, and applicable across high-stakes domains and diverse data modalities.
zh

[CV-70] Semantic-Guided Diffusion Model for Single-Step Image Super-Resolution

【速读】:该论文旨在解决基于扩散的图像超分辨率(Diffusion-based Image Super-Resolution)方法在处理复杂语义区域时效率受限的问题,尽管现有方法通过确定性采样过程将推理步骤从15次减少到单次,从而提升了推理速度,但在面对语义复杂的区域时仍存在不足。其解决方案的关键在于提出SAMSR框架,该框架通过引入语义分割掩码指导采样过程,具体包括SAM-Noise模块,用于利用分割掩码精炼高斯噪声以保留空间和语义特征,以及像素级采样策略,根据像素级语义权重动态调整残差传输率和噪声强度,优先处理语义丰富的区域。此外,还提出了语义一致性损失以增强模型训练效果。

链接: https://arxiv.org/abs/2505.07071
作者: Zihang Liu,Zhenyu Zhang,Hao Tang
机构: Beijing Institute of Technology (北京理工大学); Nanjing University (南京大学); School of Computer Science, Peking University (北京大学计算机学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-based image super-resolution (SR) methods have demonstrated remarkable performance. Recent advancements have introduced deterministic sampling processes that reduce inference from 15 iterative steps to a single step, thereby significantly improving the inference speed of existing diffusion models. However, their efficiency remains limited when handling complex semantic regions due to the single-step inference. To address this limitation, we propose SAMSR, a semantic-guided diffusion framework that incorporates semantic segmentation masks into the sampling process. Specifically, we introduce the SAM-Noise Module, which refines Gaussian noise using segmentation masks to preserve spatial and semantic features. Furthermore, we develop a pixel-wise sampling strategy that dynamically adjusts the residual transfer rate and noise strength based on pixel-level semantic weights, prioritizing semantically rich regions during the diffusion process. To enhance model training, we also propose a semantic consistency loss, which aligns pixel-wise semantic weights between predictions and ground truth. Extensive experiments on both real-world and synthetic datasets demonstrate that SAMSR significantly improves perceptual quality and detail recovery, particularly in semantically complex images. Our code is released at this https URL.
zh

[CV-71] Seed1.5-VL Technical Report

【速读】:该论文旨在解决多模态理解与推理的通用性问题,特别是提升模型在视觉、语言及两者结合任务中的表现。其解决方案的关键在于构建一个由532M参数视觉编码器和20B活跃参数的专家混合(Mixture-of-Experts, MoE)大语言模型组成的架构,该架构在多个公共基准测试中实现了最先进的性能,并在以代理为中心的任务中超越了现有主流多模态系统。

链接: https://arxiv.org/abs/2505.07062
作者: Dong Guo,Faming Wu,Feida Zhu,Fuxing Leng,Guang Shi,Haobin Chen,Haoqi Fan,Jian Wang,Jianyu Jiang,Jiawei Wang,Jingji Chen,Jingjia Huang,Kang Lei,Liping Yuan,Lishu Luo,Pengfei Liu,Qinghao Ye,Rui Qian,Shen Yan,Shixiong Zhao,Shuai Peng,Shuangye Li,Sihang Yuan,Sijin Wu,Tianheng Cheng,Weiwei Liu,Wenqian Wang,Xianhan Zeng,Xiao Liu,Xiaobo Qin,Xiaohan Ding,Xiaojun Xiao,Xiaoying Zhang,Xuanwei Zhang,Xuehan Xiong,Yanghua Peng,Yangrui Chen,Yanwei Li,Yanxu Hu,Yi Lin,Yiyuan Hu,Yiyuan Zhang,Youbin Wu,Yu Li,Yudong Liu,Yue Ling,Yujia Qin,Zanbo Wang,Zhiwu He,Aoxue Zhang,Bairen Yi,Bencheng Liao,Can Huang,Can Zhang,Chaorui Deng,Chaoyi Deng,Cheng Lin,Cheng Yuan,Chenggang Li,Chenhui Gou,Chenwei Lou,Chengzhi Wei,Chundian Liu,Chunyuan Li,Deyao Zhu,Donghong Zhong,Feng Li,Feng Zhang,Gang Wu,Guodong Li,Guohong Xiao,Haibin Lin,Haihua Yang,Haoming Wang,Heng Ji,Hongxiang Hao,Hui Shen,Huixia Li,Jiahao Li,Jialong Wu,Jianhua Zhu,Jianpeng Jiao,Jiashi Feng,Jiaze Chen,Jianhui Duan,Jihao Liu,Jin Zeng,Jingqun Tang,Jingyu Sun,Joya Chen,Jun Long,Junda Feng,Junfeng Zhan,Junjie Fang,Junting Lu,Kai Hua,Kai Liu,Kai Shen,Kaiyuan Zhang,Ke Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at this https URL (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)
zh

[CV-72] DAPE: Dual-Stage Parameter-Efficient Fine-Tuning for Consistent Video Editing with Diffusion Models

【速读】:该论文旨在解决基于扩散模型的视频生成任务中视频编辑所面临的挑战,特别是现有方法在计算成本高或性能不足方面的局限性。其关键解决方案是提出DAPE,一种高效且低成本的两阶段参数高效微调(PEFT)框架,通过第一阶段的高效范数调优增强视频的时间一致性,第二阶段引入视觉友好的适配器提升视觉质量。

链接: https://arxiv.org/abs/2505.07057
作者: Junhao Xia,Chaoyang Zhang,Yecheng Zhang,Chengyang Zhou,Zhichang Wang,Bochun Liu,Dongshuo Yin
机构: Tsinghua University(清华大学); Duke University(杜克大学); Peking University(北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video generation based on diffusion models presents a challenging multimodal task, with video editing emerging as a pivotal direction in this field. Recent video editing approaches primarily fall into two categories: training-required and training-free methods. While training-based methods incur high computational costs, training-free alternatives often yield suboptimal performance. To address these limitations, we propose DAPE, a high-quality yet cost-effective two-stage parameter-efficient fine-tuning (PEFT) framework for video editing. In the first stage, we design an efficient norm-tuning method to enhance temporal consistency in generated videos. The second stage introduces a vision-friendly adapter to improve visual quality. Additionally, we identify critical shortcomings in existing benchmarks, including limited category diversity, imbalanced object distribution, and inconsistent frame counts. To mitigate these issues, we curate a large dataset benchmark comprising 232 videos with rich annotations and 6 editing prompts, enabling objective and comprehensive evaluation of advanced methods. Extensive experiments on existing datasets (BalanceCC, LOVEU-TGVE, RAVE) and our proposed benchmark demonstrate that DAPE significantly improves temporal coherence and text-video alignment while outperforming previous state-of-the-art approaches.
zh

[CV-73] Depth-Sensitive Soft Suppression with RGB-D Inter-Modal Stylization Flow for Domain Generalization Semantic Segmentation

【速读】:该论文旨在解决无监督域适应(Unsupervised Domain Adaptation, UDA)中由于缺乏目标域数据而导致的域间差异问题,同时借鉴领域泛化(Domain Generalization, DG)的优势,无需依赖目标域数据即可提升模型的泛化能力。其解决方案的关键在于提出一种新颖的框架——基于RGB-D跨模态风格流的深度敏感软抑制(Depth-Sensitive Soft Suppression with RGB-D inter-modal stylization flow, DSSS),通过生成风格化的深度图进行敏感区域检测,并设计类别感知的软空间敏感性抑制机制,以提取更具域不变性的特征,同时引入RGB-D软对齐损失以保留深度信息的独特性。

链接: https://arxiv.org/abs/2505.07050
作者: Binbin Wei,Yuhang Zhang,Shishun Tian,Muxin Liao,Wei Li,Wenbin Zou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised Domain Adaptation (UDA) aims to align source and target domain distributions to close the domain gap, but still struggles with obtaining the target data. Fortunately, Domain Generalization (DG) excels without the need for any target data. Recent works expose that depth maps contribute to improved generalized performance in the UDA tasks, but they ignore the noise and holes in depth maps due to device and environmental factors, failing to sufficiently and effectively learn domain-invariant representation. Although high-sensitivity region suppression has shown promising results in learning domain-invariant features, existing methods cannot be directly applicable to depth maps due to their unique characteristics. Hence, we propose a novel framework, namely Depth-Sensitive Soft Suppression with RGB-D inter-modal stylization flow (DSSS), focusing on learning domain-invariant features from depth maps for the DG semantic segmentation. Specifically, we propose the RGB-D inter-modal stylization flow to generate stylized depth maps for sensitivity detection, cleverly utilizing RGB information as the stylization source. Then, a class-wise soft spatial sensitivity suppression is designed to identify and emphasize non-sensitive depth features that contain more domain-invariant information. Furthermore, an RGB-D soft alignment loss is proposed to ensure that the stylized depth maps only align part of the RGB features while still retaining the unique depth information. To our best knowledge, our DSSS framework is the first work to integrate RGB and Depth information in the multi-class DG semantic segmentation task. Extensive experiments over multiple backbone networks show that our framework achieves remarkable performance improvement.
zh

[CV-74] Differentiable NMS via Sinkhorn Matching for End-to-End Fabric Defect Detection

【速读】:该论文旨在解决纺织品缺陷检测中的两个核心问题:传统非极大值抑制(Non-Maximum Suppression, NMS)破坏了梯度流,阻碍了端到端学习;以及在工业规模下获取像素级标注的成本过高。其解决方案的关键在于提出一种可微分的NMS框架,将NMS重新建模为一个通过Sinkhorn-Knopp算法求解的可微分二部匹配问题,从而保持网络中梯度流的连续性。此外,该方法通过整合提议质量、特征相似性和空间关系来应对纺织品缺陷的不规则形态和模糊边界,并引入熵约束的掩码细化机制以提升定位精度。

链接: https://arxiv.org/abs/2505.07040
作者: Zhengyang Lu,Bingjie Lu,Weifan Wang,Feng Wang
机构: Jiangnan University (江南大学); Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fabric defect detection confronts two fundamental challenges. First, conventional non-maximum suppression disrupts gradient flow, which hinders genuine end-to-end learning. Second, acquiring pixel-level annotations at industrial scale is prohibitively costly. Addressing these limitations, we propose a differentiable NMS framework for fabric defect detection that achieves superior localization precision through end-to-end optimization. We reformulate NMS as a differentiable bipartite matching problem solved through the Sinkhorn-Knopp algorithm, maintaining uninterrupted gradient flow throughout the network. This approach specifically targets the irregular morphologies and ambiguous boundaries of fabric defects by integrating proposal quality, feature similarity, and spatial relationships. Our entropy-constrained mask refinement mechanism further enhances localization precision through principled uncertainty modeling. Extensive experiments on the Tianchi fabric defect dataset demonstrate significant performance improvements over existing methods while maintaining real-time speeds suitable for industrial deployment. The framework exhibits remarkable adaptability across different architectures and generalizes effectively to general object detection tasks.
zh

[CV-75] MarkMatch: Same-Hand Stuffing Detection

【速读】:该论文试图解决的问题是检测纸质选票上的标记是否由同一人书写,即判断两份选票的标记是否为同一只手所写。与之前最先进的方法BubbleSig(一种基于孤立标记对的二分类方法)不同,MarkMatch的关键解决方案是利用对比学习对查询标记与数据库中的标记进行风格相似性排序,通过密集批次相似性矩阵和双损失目标进行训练,使模型能够学习细微的手写差异,并在手写变化和视觉噪声下提升泛化能力,同时通过对角监督增强对真实匹配的高置信度。该方法在F1分数上达到了0.943,优于BubbleSig的最佳性能。

链接: https://arxiv.org/abs/2505.07032
作者: Fei Zhao,Runlin Zhang,Chengcui Zhang,Nitesh Saxena
机构: The University of Alabama at Birmingham (阿拉巴马大学伯明翰分校); University of Waterloo (滑铁卢大学); Texas A&M University (德克萨斯A&M大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present MarkMatch, a retrieval system for detecting whether two paper ballot marks were filled by the same hand. Unlike the previous SOTA method BubbleSig, which used binary classification on isolated mark pairs, MarkMatch ranks stylistic similarity between a query mark and a mark in the database using contrastive learning. Our model is trained with a dense batch similarity matrix and a dual loss objective. Each sample is contrasted against many negatives within each batch, enabling the model to learn subtle handwriting difference and improve generalization under handwriting variation and visual noise, while diagonal supervision reinforces high confidence on true matches. The model achieves an F1 score of 0.943, surpassing BubbleSig’s best performance. MarkMatch also integrates Segment Anything Model for flexible mark extraction via box- or point-based prompts. The system offers election auditors a practical tool for visual, non-biometric investigation of suspicious ballots.
zh

[CV-76] A Vision-Language Foundation Model for Leaf Disease Identification

【速读】:该论文旨在解决农业领域中叶片病害识别任务中图像与文本模态融合不足以及预训练数据缺乏领域特异性信息的问题。其解决方案的关键在于提出SCOLD(Soft-target COntrastive learning for Leaf Disease identification),一个面向农业任务的上下文感知视觉-语言基础模型,通过使用包含186,000对植物叶片图像及对应症状描述的多样化语料库进行任务无关的预训练,利用上下文软目标平滑标签以缓解对比学习中的过度自信问题,从而提升模型在细粒度分类任务中的泛化能力和鲁棒性。

链接: https://arxiv.org/abs/2505.07019
作者: Khang Nguyen Quoc,Lan Le Thi Thu,Luyl-Da Quach
机构: Korea University (韩国科学技术院); FPT University (FPT大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Leaf disease identification plays a pivotal role in smart agriculture. However, many existing studies still struggle to integrate image and textual modalities to compensate for each other’s limitations. Furthermore, many of these approaches rely on pretraining with constrained datasets such as ImageNet, which lack domain-specific information. We propose SCOLD (Soft-target COntrastive learning for Leaf Disease identification), a context-aware vision-language foundation model tailored to address these challenges for agricultural tasks. SCOLD is developed using a diverse corpus of plant leaf images and corresponding symptom descriptions, comprising over 186,000 image-caption pairs aligned with 97 unique concepts. Through task-agnostic pretraining, SCOLD leverages contextual soft targets to mitigate overconfidence in contrastive learning by smoothing labels, thereby improving model generalization and robustness on fine-grained classification tasks. Experimental results demonstrate that SCOLD outperforms existing vision-language models such as OpenAI-CLIP-L, BioCLIP, and SigLIP2 across several benchmarks, including zero-shot and few-shot classification, image-text retrieval, and image classification, while maintaining a competitive parameter footprint. Ablation studies further highlight SCOLD’s effectiveness in contrast to its counterparts. The proposed approach significantly advances the agricultural vision-language foundation model, offering strong performance with minimal or no supervised fine-tuning. This work lays a solid groundwork for future research on models trained with long-form and simplified contexts, tasks involving class ambiguity, and multi-modal systems for intelligent plant disease diagnostics. The code for this study is available at this https URL
zh

[CV-77] Efficient and Robust Multidimensional Attention in Remote Physiological Sensing through Target Signal Constrained Factorization

【速读】:该论文旨在解决远程生理传感中由于领域迁移(domain shifts)导致的模型泛化能力不足的问题,特别是在光照条件、相机参数、头部运动、面部姿态和生理状态变化等实际应用场景下的性能下降问题。其解决方案的关键在于引入了目标信号约束分解模块(Target Signal Constrained Factorization module, TSFM),该模块通过显式整合生理信号特征作为分解约束,实现了更精确的特征提取,从而提升了模型在跨数据集评估中的泛化能力。基于TSFM,作者进一步提出了MMRPhys架构,该架构采用高效的双分支3D-CNN结构,能够同时估计光体积描记图(rPPG)和呼吸信号(rRSP),并在多个基准数据集上验证了其优越性。

链接: https://arxiv.org/abs/2505.07013
作者: Jitesh Joshi,Youngjun Cho
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 25 pages, 6 figures

点击查看摘要

Abstract:Remote physiological sensing using camera-based technologies offers transformative potential for non-invasive vital sign monitoring across healthcare and human-computer interaction domains. Although deep learning approaches have advanced the extraction of physiological signals from video data, existing methods have not been sufficiently assessed for their robustness to domain shifts. These shifts in remote physiological sensing include variations in ambient conditions, camera specifications, head movements, facial poses, and physiological states which often impact real-world performance significantly. Cross-dataset evaluation provides an objective measure to assess generalization capabilities across these domain shifts. We introduce Target Signal Constrained Factorization module (TSFM), a novel multidimensional attention mechanism that explicitly incorporates physiological signal characteristics as factorization constraints, allowing more precise feature extraction. Building on this innovation, we present MMRPhys, an efficient dual-branch 3D-CNN architecture designed for simultaneous multitask estimation of photoplethysmography (rPPG) and respiratory (rRSP) signals from multimodal RGB and thermal video inputs. Through comprehensive cross-dataset evaluation on five benchmark datasets, we demonstrate that MMRPhys with TSFM significantly outperforms state-of-the-art methods in generalization across domain shifts for rPPG and rRSP estimation, while maintaining a minimal inference latency suitable for real-time applications. Our approach establishes new benchmarks for robust multitask and multimodal physiological sensing and offers a computationally efficient framework for practical deployment in unconstrained environments. The web browser-based application featuring on-device real-time inference of MMRPhys model is available at this https URL
zh

[CV-78] MELLM : Exploring LLM -Powered Micro-Expression Understanding Enhanced by Subtle Motion Perception

【速读】:该论文试图解决当前自动微表情识别(MER)研究中主要关注离散情绪分类,而忽视了对微表情细微动态运动和内在情感线索的深入分析的问题。解决方案的关键在于提出一种新型的微表情大语言模型(MELLM),该模型结合了多模态大语言模型(MLLM)的强大推理能力与细微面部运动感知策略,通过构建可解释的运动增强颜色图来显式引导模型关注运动敏感区域,并采用专门的微调策略以提升模型对微表情的视觉感知能力。

链接: https://arxiv.org/abs/2505.07007
作者: Zhengye Zhang,Sirui Zhao,Shifeng Liu,Shukang Yin,Xinglong Mao,Tong Xu,Enhong Chen
机构: School of Computer Science and Technology, University of Science and Technology of China (中国科学技术大学计算机科学与技术学院); School of Artificial Intelligence and Data Science, University of Science and Technology of China (中国科学技术大学人工智能与数据科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Micro-expressions (MEs) are crucial psychological responses with significant potential for affective computing. However, current automatic micro-expression recognition (MER) research primarily focuses on discrete emotion classification, neglecting a convincing analysis of the subtle dynamic movements and inherent emotional cues. The rapid progress in multimodal large language models (MLLMs), known for their strong multimodal comprehension and language generation abilities, offers new possibilities. MLLMs have shown success in various vision-language tasks, indicating their potential to understand MEs comprehensively, including both fine-grained motion patterns and underlying emotional semantics. Nevertheless, challenges remain due to the subtle intensity and short duration of MEs, as existing MLLMs are not designed to capture such delicate frame-level facial dynamics. In this paper, we propose a novel Micro-Expression Large Language Model (MELLM), which incorporates a subtle facial motion perception strategy with the strong inference capabilities of MLLMs, representing the first exploration of MLLMs in the domain of ME analysis. Specifically, to explicitly guide the MLLM toward motion-sensitive regions, we construct an interpretable motion-enhanced color map by fusing onset-apex optical flow dynamics with the corresponding grayscale onset frame as the model input. Additionally, specialized fine-tuning strategies are incorporated to further enhance the model’s visual perception of MEs. Furthermore, we construct an instruction-description dataset based on Facial Action Coding System (FACS) annotations and emotion labels to train our MELLM. Comprehensive evaluations across multiple benchmark datasets demonstrate that our model exhibits superior robustness and generalization capabilities in ME understanding (MEU). Code is available at this https URL.
zh

[CV-79] CMD: Controllable Multiview Diffusion for 3D Editing and Progressive Generation SIGGRAPH2025

【速读】:该论文试图解决现有3D生成方法仅依赖输入图像或文本提示生成3D模型,导致无法对生成模型的各个组件进行灵活控制的问题。其解决方案的关键在于提出一种名为CMD(Conditional Multiview Diffusion)的方法,该方法将3D生成建模为条件多视角扩散模型,通过已知部分作为条件来生成或编辑新增组件,从而实现3D模型的分部件生成与局部编辑。

链接: https://arxiv.org/abs/2505.07003
作者: Peng Li,Suizhi Ma,Jialiang Chen,Yuan Liu,Chongyi Zhang,Wei Xue,Wenhan Luo,Alla Sheffer,Wenping Wang,Yike Guo
机构: The Hong Kong University of Science and Technology(香港科技大学); University of British Columbia(不列颠哥伦比亚大学); Texas A&M University(得克萨斯A&M大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Siggraph 2025

点击查看摘要

Abstract:Recently, 3D generation methods have shown their powerful ability to automate 3D model creation. However, most 3D generation methods only rely on an input image or a text prompt to generate a 3D model, which lacks the control of each component of the generated 3D model. Any modifications of the input image lead to an entire regeneration of the 3D models. In this paper, we introduce a new method called CMD that generates a 3D model from an input image while enabling flexible local editing of each component of the 3D model. In CMD, we formulate the 3D generation as a conditional multiview diffusion model, which takes the existing or known parts as conditions and generates the edited or added components. This conditional multiview diffusion model not only allows the generation of 3D models part by part but also enables local editing of 3D models according to the local revision of the input image without changing other 3D parts. Extensive experiments are conducted to demonstrate that CMD decomposes a complex 3D generation task into multiple components, improving the generation quality. Meanwhile, CMD enables efficient and flexible local editing of a 3D model by just editing one rendered image.
zh

[CV-80] Hallucination-Aware Multimodal Benchmark for Gastrointestinal Image Analysis with Large Vision-Language Models

【速读】:该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在医学领域中产生的幻觉问题,即模型生成的描述与实际视觉内容不一致的现象。其解决方案的关键在于构建一个高质量的多模态胃肠道(Gastrointestinal, GI)图像-文本数据集Gut-VLM,并采用一种称为“幻觉感知微调”(hallucination-aware finetuning)的方法,通过微调模型以检测和纠正幻觉,而非仅专注于生成描述性报告。

链接: https://arxiv.org/abs/2505.07001
作者: Bidur Khanal,Sandesh Pokhrel,Sanjay Bhandari,Ramesh Rana,Nikesh Shrestha,Ram Bahadur Gurung,Cristian Linte,Angus Watson,Yash Raj Shrestha,Binod Bhattarai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) are becoming increasingly popular in the medical domain, bridging the gap between medical images and clinical language. Existing VLMs demonstrate an impressive ability to comprehend medical images and text queries to generate detailed, descriptive diagnostic medical reports. However, hallucination–the tendency to generate descriptions that are inconsistent with the visual content–remains a significant issue in VLMs, with particularly severe implications in the medical field. To facilitate VLM research on gastrointestinal (GI) image analysis and study hallucination, we curate a multimodal image-text GI dataset: Gut-VLM. This dataset is created using a two-stage pipeline: first, descriptive medical reports of Kvasir-v2 images are generated using ChatGPT, which introduces some hallucinated or incorrect texts. In the second stage, medical experts systematically review these reports, and identify and correct potential inaccuracies to ensure high-quality, clinically reliable annotations. Unlike traditional datasets that contain only descriptive texts, our dataset also features tags identifying hallucinated sentences and their corresponding corrections. A common approach to reducing hallucination in VLM is to finetune the model on a small-scale, problem-specific dataset. However, we take a different strategy using our dataset. Instead of finetuning the VLM solely for generating textual reports, we finetune it to detect and correct hallucinations, an approach we call hallucination-aware finetuning. Our results show that this approach is better than simply finetuning for descriptive report generation. Additionally, we conduct an extensive evaluation of state-of-the-art VLMs across several metrics, establishing a benchmark. GitHub Repo: this https URL.
zh

[CV-81] Replay-Based Continual Learning with Dual-Layered Distillation and a Streamlined U-Net for Efficient Text-to-Image Generation

【速读】:该论文旨在解决文本到图像扩散模型在计算资源需求高导致可访问性和可扩展性受限的问题。其解决方案的关键在于提出KDC-Diff框架,该框架通过一个参数量仅为原始U-Net(482M)一半的精简U-Net架构降低模型复杂度,并采用双层知识蒸馏策略,将教师模型的语义和结构信息有效地传递给轻量级学生模型,同时减少质量退化;此外,还引入基于重放的持续学习方法以缓解灾难性遗忘,从而在极低计算资源下实现优异的生成性能。

链接: https://arxiv.org/abs/2505.06995
作者: Md. Naimur Asif Borno,Md Sakib Hossain Shovon,Asmaa Soliman Al-Moisheer,Mohammad Ali Moni
机构: University of Queensland (昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in text-to-image diffusion models are hindered by high computational demands, limiting accessibility and scalability. This paper introduces KDC-Diff, a novel stable diffusion framework that enhances efficiency while maintaining image quality. KDC-Diff features a streamlined U-Net architecture with nearly half the parameters of the original U-Net (482M), significantly reducing model complexity. We propose a dual-layered distillation strategy to ensure high-fidelity generation, transferring semantic and structural insights from a teacher to a compact student model while minimizing quality degradation. Additionally, replay-based continual learning is integrated to mitigate catastrophic forgetting, allowing the model to retain prior knowledge while adapting to new data. Despite operating under extremely low computational resources, KDC-Diff achieves state-of-the-art performance on the Oxford Flowers and Butterflies Moths 100 Species datasets, demonstrating competitive metrics such as FID, CLIP, and LPIPS. Moreover, it significantly reduces inference time compared to existing models. These results establish KDC-Diff as a highly efficient and adaptable solution for text-to-image generation, particularly in computationally constrained environments.
zh

[CV-82] chnical Report for ICRA 2025 GOOSE 2D Semantic Segmentation Challenge: Leverag ing Color Shift Correction RoPE-Swin Backbone and Quantile-based Label Denoising Strategy for Robust Outdoor Scene Understanding

【速读】:该论文旨在解决户外场景的语义分割问题,特别是在真实环境条件下将图像划分为九个语义类别。其解决方案的关键在于结合了增强的Swin Transformer主干网络(引入了Rotary Position Embedding以提升空间泛化能力)、颜色偏移估计与校正模块(用于补偿自然环境中光照不一致的问题),以及基于分位数的去噪策略(通过降低高误差像素的影响来提高训练稳定性)。这些技术的综合应用显著提升了语义分割的鲁棒性和性能。

链接: https://arxiv.org/abs/2505.06991
作者: Chih-Chung Hsu,I-Hsuan Wu,Wen-Hai Tseng,Ching-Heng Cheng,Ming-Hsuan Wu,Jin-Hui Jiang,Yu-Jou Hsiao
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); National Cheng Kung University (国立成功大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This report presents our semantic segmentation framework developed by team ACVLAB for the ICRA 2025 GOOSE 2D Semantic Segmentation Challenge, which focuses on parsing outdoor scenes into nine semantic categories under real-world conditions. Our method integrates a Swin Transformer backbone enhanced with Rotary Position Embedding (RoPE) for improved spatial generalization, alongside a Color Shift Estimation-and-Correction module designed to compensate for illumination inconsistencies in natural environments. To further improve training stability, we adopt a quantile-based denoising strategy that downweights the top 2.5% of highest-error pixels, treating them as noise and suppressing their influence during optimization. Evaluated on the official GOOSE test set, our approach achieved a mean Intersection over Union (mIoU) of 0.848, demonstrating the effectiveness of combining color correction, positional encoding, and error-aware denoising in robust semantic segmentation.
zh

[CV-83] BridgeIV: Bridging Customized Image and Video Generation through Test-Time Autoregressive Identity Propagation

【速读】:该论文旨在解决定制化文本到视频(CT2V)生成中存在的一致性不足与细节丢失问题。现有零样本CT2V方法泛化能力差,而基于微调的文本到图像模型结合时间运动模块则容易导致结构和纹理信息的损失。论文提出的关键解决方案是自回归结构与纹理传播模块(STPM),该模块从参考主体中提取关键结构和纹理特征,并逐帧自回归注入以增强视频的一致性,同时引入测试时奖励优化(TTRO)方法进一步优化细节。

链接: https://arxiv.org/abs/2505.06985
作者: Panwen Hu,Jiehui Huang,Qiang Sun,Xiaodan Liang
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Sun Yat-sen University (中山大学); University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Both zero-shot and tuning-based customized text-to-image (CT2I) generation have made significant progress for storytelling content creation. In contrast, research on customized text-to-video (CT2V) generation remains relatively limited. Existing zero-shot CT2V methods suffer from poor generalization, while another line of work directly combining tuning-based T2I models with temporal motion modules often leads to the loss of structural and texture information. To bridge this gap, we propose an autoregressive structure and texture propagation module (STPM), which extracts key structural and texture features from the reference subject and injects them autoregressively into each video frame to enhance consistency. Additionally, we introduce a test-time reward optimization (TTRO) method to further refine fine-grained details. Quantitative and qualitative experiments validate the effectiveness of STPM and TTRO, demonstrating improvements of 7.8 and 13.1 in CLIP-I and DINO consistency metrics over the baseline, respectively.
zh

[CV-84] Federated Learning with LoRA Optimized DeiT and Multiscale Patch Embedding for Secure Eye Disease Recognition

【速读】:该论文旨在解决医学影像疾病检测中面临的标注数据有限、空间特征分析不足、数据安全问题以及训练框架效率低下等挑战。其解决方案的关键在于提出一种基于数据高效图像变换器(DeIT)的方法,通过多尺度块嵌入实现更优的特征提取,并采用分层加权随机采样缓解类别不平衡问题;同时结合LoRA增强的变换器编码器、知识蒸馏框架和联邦学习,以提升模型效率与数据安全性。

链接: https://arxiv.org/abs/2505.06982
作者: Md. Naimur Asif Borno,Md Sakib Hossain Shovon,MD Hanif Sikder,Iffat Firozy Rimi,Tahani Jaser Alahmadi,Mohammad Ali Moni
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent progress in image-based medical disease detection encounters challenges such as limited annotated data sets, inadequate spatial feature analysis, data security issues, and inefficient training frameworks. This study introduces a data-efficient image transformer (DeIT)-based approach that overcomes these challenges by utilizing multiscale patch embedding for better feature extraction and stratified weighted random sampling to address class imbalance. The model also incorporates a LoRA-enhanced transformer encoder, a distillation framework, and federated learning for decentralized training, improving both efficiency and data security. Consequently, it achieves state-of-the-art performance, with the highest AUC, F1 score, precision, minimal loss, and Top-5 accuracy. Additionally, Grad-CAM++ visualizations improve interpretability by highlighting critical pathological regions, enhancing the model’s clinical relevance. These results highlight the potential of this approach to advance AI-powered medical imaging and disease detection.
zh

[CV-85] VALISENS: A Validated Innovative Multi-Sensor System for Cooperative Automated Driving ITSC

【速读】:该论文旨在解决自动驾驶车辆在复杂现实场景中感知系统鲁棒性不足的问题,特别是在面对外部因素干扰时的感知可靠性。其解决方案的关键在于构建一个分布式的多传感器系统VALISENS,该系统通过融合车载与路侧的LiDAR、雷达、热成像相机和RGB相机等异构传感器数据,提升情境感知能力,并支持协同自动化驾驶。其中,热成像相机为感知 Vulnerable Road User (VRU) 提供了关键冗余,而与路侧传感器的融合则有效缓解了视觉遮挡问题并扩展了感知范围。

链接: https://arxiv.org/abs/2505.06980
作者: Lei Wan,Prabesh Gupta,Andreas Eich,Marcel Kettelgerdes,Hannan Ejaz Keen,Michael Klöppel-Gersdorf,Alexey Vinel
机构: XITASO GmbH(西塔索有限公司); LiangDao GmbH(梁道有限公司); Fraunhofer IVI(弗劳恩霍夫IVI); Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 11 figures, submitted to IEEE ITSC

点击查看摘要

Abstract:Perception is a core capability of automated vehicles and has been significantly advanced through modern sensor technologies and artificial intelligence. However, perception systems still face challenges in complex real-world scenarios. To improve robustness against various external factors, multi-sensor fusion techniques are essential, combining the strengths of different sensor modalities. With recent developments in Vehicle-to-Everything (V2X communication, sensor fusion can now extend beyond a single vehicle to a cooperative multi-agent system involving Connected Automated Vehicle (CAV) and intelligent infrastructure. This paper presents VALISENS, an innovative multi-sensor system distributed across multiple agents. It integrates onboard and roadside LiDARs, radars, thermal cameras, and RGB cameras to enhance situational awareness and support cooperative automated driving. The thermal camera adds critical redundancy for perceiving Vulnerable Road User (VRU), while fusion with roadside sensors mitigates visual occlusions and extends the perception range beyond the limits of individual vehicles. We introduce the corresponding perception module built on this sensor system, which includes object detection, tracking, motion forecasting, and high-level data fusion. The proposed system demonstrates the potential of cooperative perception in real-world test environments and lays the groundwork for future Cooperative Intelligent Transport Systems (C-ITS) applications.
zh

[CV-86] High-Frequency Prior-Driven Adaptive Masking for Accelerating Image Super-Resolution

【速读】:该论文旨在解决图像超分辨率加速中的核心问题,即在减少计算量的同时保持模型性能和适应性。其解决方案的关键在于提出一种无需训练的自适应掩码模块,该模块通过高斯模糊减法提取高频成分,并利用K-means聚类自适应生成二值掩码,动态聚焦于需要密集处理的区域(如边缘和纹理)。该方法能够与CNN和Transformer架构无缝集成,通过稀疏计算或选择性处理降低推理时的计算复杂度,同时支持无需重新训练的掩码调整策略,具备对未见退化因素的鲁棒性。

链接: https://arxiv.org/abs/2505.06975
作者: Wei Shang,Dongwei Ren,Wanying Zhang,Pengfei Zhu,Qinghua Hu,Wangmeng Zuo
机构: Harbin Institute of Technology (哈尔滨工业大学); City University of Hong Kong (香港城市大学); Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures, 5 tables

点击查看摘要

Abstract:The primary challenge in accelerating image super-resolution lies in reducing computation while maintaining performance and adaptability. Motivated by the observation that high-frequency regions (e.g., edges and textures) are most critical for reconstruction, we propose a training-free adaptive masking module for acceleration that dynamically focuses computation on these challenging areas. Specifically, our method first extracts high-frequency components via Gaussian blur subtraction and adaptively generates binary masks using K-means clustering to identify regions requiring intensive processing. Our method can be easily integrated with both CNNs and Transformers. For CNN-based architectures, we replace standard 3 \times 3 convolutions with an unfold operation followed by 1 \times 1 convolutions, enabling pixel-wise sparse computation guided by the mask. For Transformer-based models, we partition the mask into non-overlapping windows and selectively process tokens based on their average values. During inference, unnecessary pixels or windows are pruned, significantly reducing computation. Moreover, our method supports dilation-based mask adjustment to control the processing scope without retraining, and is robust to unseen degradations (e.g., noise, compression). Extensive experiments on benchmarks demonstrate that our method reduces FLOPs by 24–43% for state-of-the-art models (e.g., CARN, SwinIR) while achieving comparable or better quantitative metrics. The source code is available at this https URL
zh

[CV-87] Reinforcement Learning-Based Monocular Vision Approach for Autonomous UAV Landing

【速读】:该论文试图解决无人机(Unmanned Aerial Vehicles, UAVs)在仅使用前视单目相机的情况下实现自主着陆的问题,从而避免对深度估计摄像头的依赖。解决方案的关键在于将着陆任务重新建模为优化问题,并利用着陆垫上特殊设计的透镜圆环的视觉特性变化,通过感知颜色和形状来估计高度和深度。该方法采用强化学习算法近似这些估计的函数,使无人机能够通过训练确定最佳着陆参数。

链接: https://arxiv.org/abs/2505.06963
作者: Tarik Houichime,Younes EL Amrani
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces an innovative approach for the autonomous landing of Unmanned Aerial Vehicles (UAVs) using only a front-facing monocular camera, therefore obviating the requirement for depth estimation cameras. Drawing on the inherent human estimating process, the proposed method reframes the landing task as an optimization problem. The UAV employs variations in the visual characteristics of a specially designed lenticular circle on the landing pad, where the perceived color and form provide critical information for estimating both altitude and depth. Reinforcement learning algorithms are utilized to approximate the functions governing these estimations, enabling the UAV to ascertain ideal landing settings via training. This method’s efficacy is assessed by simulations and experiments, showcasing its potential for robust and accurate autonomous landing without dependence on complex sensor setups. This research contributes to the advancement of cost-effective and efficient UAV landing solutions, paving the way for wider applicability across various fields.
zh

[CV-88] Boosting Cross-spectral Unsupervised Domain Adaptation for Thermal Semantic Segmentation ICRA

【速读】:该论文旨在解决热成像语义分割中由于缺乏标注数据而导致的领域适应问题,特别是在无监督领域自适应(Unsupervised Domain Adaptation, UDA)场景下,现有方法未能有效利用RGB与热成像之间的互补信息,从而导致性能下降。其解决方案的关键在于提出了一种新颖的掩码互学习策略,通过选择性地在不同光谱模型间传递结果并遮蔽不确定区域,促进互补信息的交换,同时引入了一种基于原型的自监督损失,以提升夜间场景下的热成像分割性能。

链接: https://arxiv.org/abs/2505.06951
作者: Seokjun Kwon,Jeongmin Shin,Namil Kim,Soonmin Hwang,Yukyung Choi
机构: Sejong University (世宗大学); NAVER LABS (NAVER 实验室); Hanyang University (汉阳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 7 pages, 4 figures, International Conference on Robotics and Automation(ICRA) 2025

点击查看摘要

Abstract:In autonomous driving, thermal image semantic segmentation has emerged as a critical research area, owing to its ability to provide robust scene understanding under adverse visual conditions. In particular, unsupervised domain adaptation (UDA) for thermal image segmentation can be an efficient solution to address the lack of labeled thermal datasets. Nevertheless, since these methods do not effectively utilize the complementary information between RGB and thermal images, they significantly decrease performance during domain adaptation. In this paper, we present a comprehensive study on cross-spectral UDA for thermal image semantic segmentation. We first propose a novel masked mutual learning strategy that promotes complementary information exchange by selectively transferring results between each spectral model while masking out uncertain regions. Additionally, we introduce a novel prototypical self-supervised loss designed to enhance the performance of the thermal segmentation model in nighttime scenarios. This approach addresses the limitations of RGB pre-trained networks, which cannot effectively transfer knowledge under low illumination due to the inherent constraints of RGB sensors. In experiments, our method achieves higher performance over previous UDA methods and comparable performance to state-of-the-art supervised methods.
zh

[CV-89] Unsupervised Learning for Class Distribution Mismatch ICML2025

【速读】:该论文试图解决类别分布不匹配(Class Distribution Mismatch, CDM)问题,即训练数据与目标任务之间的类别分布差异。传统方法通过设计分类器对已知类别进行分类,并将未知或新类别归入“其他”类别,但这些方法依赖于大量标记数据,限制了其适用性和性能。论文提出的解决方案为无监督学习的类别分布不匹配(Unsupervised Learning for Class Distribution Mismatch, UCDM),其关键在于利用未标记数据构建正负样本对进行分类器训练,并引入基于置信度的伪标签机制,以迭代方式将有价值的真实数据纳入训练过程,从而在无需依赖标记数据的情况下提升分类性能。

链接: https://arxiv.org/abs/2505.06948
作者: Pan Du,Wangbo Zhao,Xinai Lu,Nian Liu,Zhikai Li,Chaoyu Gong,Suyun Zhao,Hong Chen,Cuiping Li,Kai Wang,Yang You
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by ICML 2025

点击查看摘要

Abstract:Class distribution mismatch (CDM) refers to the discrepancy between class distributions in training data and target tasks. Previous methods address this by designing classifiers to categorize classes known during training, while grouping unknown or new classes into an “other” category. However, they focus on semi-supervised scenarios and heavily rely on labeled data, limiting their applicability and performance. To address this, we propose Unsupervised Learning for Class Distribution Mismatch (UCDM), which constructs positive-negative pairs from unlabeled data for classifier training. Our approach randomly samples images and uses a diffusion model to add or erase semantic classes, synthesizing diverse training pairs. Additionally, we introduce a confidence-based labeling mechanism that iteratively assigns pseudo-labels to valuable real-world data and incorporates them into the training process. Extensive experiments on three datasets demonstrate UCDM’s superiority over previous semi-supervised methods. Specifically, with a 60% mismatch proportion on Tiny-ImageNet dataset, our approach, without relying on labeled data, surpasses OpenMatch (with 40 labels per class) by 35.1%, 63.7%, and 72.5% in classifying known, unknown, and new classes.
zh

[CV-90] ransformer-Based Dual-Optical Attention Fusion Crowd Head Point Counting and Localization Network

【速读】:该论文旨在解决在无人机视角下的群体计数任务中,由于人群密集遮挡和低光照等复杂场景导致的准确计数难题。其解决方案的关键在于提出了一种双光谱注意力融合群体头部点计数模型(TAPNet),该模型通过引入红外图像的互补信息设计了双光谱注意力融合模块(DAFP),以提升全天候群体计数的准确性和鲁棒性;同时,为充分利用不同模态信息并解决图像对之间的系统性错位导致的定位不准确问题,还提出了自适应双光谱特征分解融合模块(AFDF)。此外,通过优化训练策略,采用空间随机偏移数据增强方法进一步提升了模型的鲁棒性。

链接: https://arxiv.org/abs/2505.06937
作者: Fei Zhou,Yi Li,Mingqing Zhu
机构: Neusoft Institute Guangdong(东软学院广东); Airace Technology Co.,Ltd.(Airace科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, the dual-optical attention fusion crowd head point counting model (TAPNet) is proposed to address the problem of the difficulty of accurate counting in complex scenes such as crowd dense occlusion and low light in crowd counting tasks under UAV view. The model designs a dual-optical attention fusion module (DAFP) by introducing complementary information from infrared images to improve the accuracy and robustness of all-day crowd counting. In order to fully utilize different modal information and solve the problem of inaccurate localization caused by systematic misalignment between image pairs, this paper also proposes an adaptive two-optical feature decomposition fusion module (AFDF). In addition, we optimize the training strategy to improve the model robustness through spatial random offset data augmentation. Experiments on two challenging public datasets, DroneRGBT and GAIIC2, show that the proposed method outperforms existing techniques in terms of performance, especially in challenging dense low-light scenes. Code is available at this https URL
zh

[CV-91] Bi-directional Self-Registration for Misaligned Infrared-Visible Image Fusion

【速读】:该论文旨在解决多模态图像配准与融合中缺乏真实标签的问题,从而影响模型性能。其解决方案的关键在于提出一种自监督的双向自配准框架(B-SR),通过代理数据生成器(PDG)和逆向代理数据生成器(IPDG)实现全局-局部配准,利用伪全局差异与真实全局差异进行一致性约束,并设计邻域动态对齐损失以消除模态差异对配准模块的影响。

链接: https://arxiv.org/abs/2505.06920
作者: Timing Li,Bing Cao,Pengfei Zhu,Bin Xiao,Qinghua Hu
机构: Tianjin University (天津大学); Chongqing University of Posts and Telecommunications (重庆邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Acquiring accurately aligned multi-modal image pairs is fundamental for achieving high-quality multi-modal image fusion. To address the lack of ground truth in current multi-modal image registration and fusion methods, we propose a novel self-supervised \textbfBi-directional \textbfSelf-\textbfRegistration framework (\textbfB-SR). Specifically, B-SR utilizes a proxy data generator (PDG) and an inverse proxy data generator (IPDG) to achieve self-supervised global-local registration. Visible-infrared image pairs with spatially misaligned differences are aligned to obtain global differences through the registration module. The same image pairs are processed by PDG, such as cropping, flipping, stitching, etc., and then aligned to obtain local differences. IPDG converts the obtained local differences into pseudo-global differences, which are used to perform global-local difference consistency with the global differences. Furthermore, aiming at eliminating the effect of modal gaps on the registration module, we design a neighborhood dynamic alignment loss to achieve cross-modal image edge alignment. Extensive experiments on misaligned multi-modal images demonstrate the effectiveness of the proposed method in multi-modal image alignment and fusion against the competing methods. Our code will be publicly available.
zh

[CV-92] Building a Human-Verified Clinical Reasoning Dataset via a Human LLM Hybrid Pipeline for Trustworthy Medical AI

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗领域应用中因推理过程不透明而限制临床信任的问题,以及当前医疗LLMs依赖科学文献或合成数据导致的专家验证不足和临床相关性低的问题。其解决方案的关键在于构建一个包含31,247个医学问答对的高临床相关性数据集,每个问答对均配有专家验证的链式思维(chain-of-thought, CoT)解释,并通过可扩展的人机混合流程进行数据标注与优化,确保生成内容的准确性和可解释性。

链接: https://arxiv.org/abs/2505.06912
作者: Chao Ding,Mouxiao Bian,Pengcheng Chen,Hongliang Zhang,Tianbin Li,Lihao Liu,Jiayuan Chen,Zhuoran Li,Yabei Zhong,Yongqi Liu,Haiqing Huang,Dongming Shan,Junjun He,Jie Xu
机构: Shanghai Artificial Intelligence Laboratory(上海人工智能实验室); Shanghai Jiao Tong University(上海交通大学); Shanghai Kupas Technology Limited Company(上海酷拍科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite strong performance in medical question-answering, the clinical adoption of Large Language Models (LLMs) is critically hampered by their opaque ‘black-box’ reasoning, limiting clinician trust. This challenge is compounded by the predominant reliance of current medical LLMs on corpora from scientific literature or synthetic data, which often lack the granular expert validation and high clinical relevance essential for advancing their specialized medical capabilities. To address these critical gaps, we introduce a highly clinically relevant dataset with 31,247 medical question-answer pairs, each accompanied by expert-validated chain-of-thought (CoT) explanations. This resource, spanning multiple clinical domains, was curated via a scalable human-LLM hybrid pipeline: LLM-generated rationales were iteratively reviewed, scored, and refined by medical experts against a structured rubric, with substandard outputs revised through human effort or guided LLM regeneration until expert consensus. This publicly available dataset provides a vital source for the development of medical LLMs that capable of transparent and verifiable reasoning, thereby advancing safer and more interpretable AI in medicine.
zh

[CV-93] owards Artificial General or Personalized Intelligence? A Survey on Foundation Models for Personalized Federated Intelligence

【速读】:该论文试图解决大规模语言模型(LLMs)在个性化定制过程中面临的隐私保护、计算效率及用户需求适配等挑战,旨在实现人工个性化智能(API)。其解决方案的关键在于提出个性化联邦智能(PFI),该方法融合了联邦学习(FL)的隐私保护优势与基础模型(FMs)的零样本泛化能力,从而在边缘端实现高效、隐私保护且个性化的模型部署。

链接: https://arxiv.org/abs/2505.06907
作者: Yu Qiao,Huy Q. Le,Avi Deb Raha,Phuong-Nam Tran,Apurba Adhikary,Mengchun Zhang,Loc X. Nguyen,Eui-Nam Huh,Dusit Niyato,Choong Seon Hong
机构: Kyung Hee University (庆熙大学); Noakhali Science and Technology University (诺阿哈利科技大学); Korea Advanced Institute of Science and Technology (韩国科学技术院); Nanyang Technological University (南洋理工大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: On going work

点击查看摘要

Abstract:The rise of large language models (LLMs), such as ChatGPT, DeepSeek, and Grok-3, has reshaped the artificial intelligence landscape. As prominent examples of foundational models (FMs) built on LLMs, these models exhibit remarkable capabilities in generating human-like content, bringing us closer to achieving artificial general intelligence (AGI). However, their large-scale nature, sensitivity to privacy concerns, and substantial computational demands present significant challenges to personalized customization for end users. To bridge this gap, this paper presents the vision of artificial personalized intelligence (API), focusing on adapting these powerful models to meet the specific needs and preferences of users while maintaining privacy and efficiency. Specifically, this paper proposes personalized federated intelligence (PFI), which integrates the privacy-preserving advantages of federated learning (FL) with the zero-shot generalization capabilities of FMs, enabling personalized, efficient, and privacy-protective deployment at the edge. We first review recent advances in both FL and FMs, and discuss the potential of leveraging FMs to enhance federated systems. We then present the key motivations behind realizing PFI and explore promising opportunities in this space, including efficient PFI, trustworthy PFI, and PFI empowered by retrieval-augmented generation (RAG). Finally, we outline key challenges and future research directions for deploying FM-powered FL systems at the edge with improved personalization, computational efficiency, and privacy guarantees. Overall, this survey aims to lay the groundwork for the development of API as a complement to AGI, with a particular focus on PFI as a key enabling technique.
zh

[CV-94] Enhancing Monocular Height Estimation via Sparse LiDAR-Guided Correction

【速读】:该论文旨在解决基于单目高度估计(Monocular Height Estimation, MHE)的深度学习模型在缺乏足够结构信息的超高分辨率(VHR)遥感图像中的可靠性问题,以及纯合成数据训练模型在实际应用中可能存在的偏差和误差。其解决方案的关键在于提出一种新的校正流程,将稀疏且不完美的全球LiDAR测量数据(ICESat-2)与深度学习输出相结合,通过两阶段处理——先对原始ICESat-2数据进行预处理,再利用随机森林方法对高度估计进行密集细化,从而提升局部精度并实现空间一致性的校正。

链接: https://arxiv.org/abs/2505.06905
作者: Jian Song,Hongruixuan Chen,Naoto Yokoya
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Monocular height estimation (MHE) from very-high-resolution (VHR) remote sensing imagery via deep learning is notoriously challenging due to the lack of sufficient structural information. Conventional digital elevation models (DEMs), typically derived from airborne LiDAR or multi-view stereo, remain costly and geographically limited. Recently, models trained on synthetic data and refined through domain adaptation have shown remarkable performance in MHE, yet it remains unclear how these models make predictions or how reliable they truly are. In this paper, we investigate a state-of-the-art MHE model trained purely on synthetic data to explore where the model looks when making height predictions. Through systematic analyses, we find that the model relies heavily on shadow cues, a factor that can lead to overestimation or underestimation of heights when shadows deviate from expected norms. Furthermore, the inherent difficulty of evaluating regression tasks with the human eye underscores additional limitations of purely synthetic training. To address these issues, we propose a novel correction pipeline that integrates sparse, imperfect global LiDAR measurements (ICESat-2) with deep-learning outputs to improve local accuracy and achieve spatially consistent corrections. Our method comprises two stages: pre-processing raw ICESat-2 data, followed by a random forest-based approach to densely refine height estimates. Experiments in three representative urban regions – Saint-Omer, Tokyo, and Sao Paulo – reveal substantial error reductions, with mean absolute error (MAE) decreased by 22.8%, 6.9%, and 4.9%, respectively. These findings highlight the critical role of shadow awareness in synthetic data-driven models and demonstrate how fusing imperfect real-world LiDAR data can bolster the robustness of MHE, paving the way for more reliable and scalable 3D mapping solutions.
zh

[CV-95] CheXLearner: Text-Guided Fine-Grained Representation Learning for Progression Detection

【速读】:该论文旨在解决时间性医学图像分析中存在语义不匹配和缺乏医学语义整合的问题。现有方法要么在粗粒度层面对齐图像与文本,导致潜在的语义不一致,要么仅依赖视觉信息,未能有效融合医学语义。其解决方案的关键在于提出CheXLearner框架,该框架通过统一解剖区域检测、基于黎曼流形的结构对齐以及细粒度区域语义引导,实现跨模态表示学习的增强,并引入区域进展描述作为监督信号以支持动态低级特征优化。其中,Med-Manifold Alignment Module (Med-MAM) 利用双曲几何实现解剖结构的鲁棒对齐并捕捉病理性差异。

链接: https://arxiv.org/abs/2505.06903
作者: Yuanzhuo Wang,Junwen Duan,Xinyu Li,Jianxin Wang
机构: Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Temporal medical image analysis is essential for clinical decision-making, yet existing methods either align images and text at a coarse level - causing potential semantic mismatches - or depend solely on visual information, lacking medical semantic integration. We present CheXLearner, the first end-to-end framework that unifies anatomical region detection, Riemannian manifold-based structure alignment, and fine-grained regional semantic guidance. Our proposed Med-Manifold Alignment Module (Med-MAM) leverages hyperbolic geometry to robustly align anatomical structures and capture pathologically meaningful discrepancies across temporal chest X-rays. By introducing regional progression descriptions as supervision, CheXLearner achieves enhanced cross-modal representation learning and supports dynamic low-level feature optimization. Experiments show that CheXLearner achieves 81.12% (+17.2%) average accuracy and 80.32% (+11.05%) F1-score on anatomical region progression detection - substantially outperforming state-of-the-art baselines, especially in structurally complex regions. Additionally, our model attains a 91.52% average AUC score in downstream disease classification, validating its superior feature representation.
zh

[CV-96] NeuGen: Amplifying the Neural in Neural Radiance Fields for Domain Generalization

【速读】:该论文试图解决Neural Radiance Fields (NeRF) 在不同场景和条件下的泛化能力不足的问题。解决方案的关键在于引入一种受脑科学启发的归一化技术——Neural Generalization (NeuGen),该技术能够提取领域不变特征,从而增强模型的泛化能力,并且可以无缝集成到主流NeRF架构中,显著提升图像渲染的准确性和鲁棒性。

链接: https://arxiv.org/abs/2505.06894
作者: Ahmed Qazi,Abdul Basit,Asim Iqbal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 18 pages, 6 figures

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) have significantly advanced the field of novel view synthesis, yet their generalization across diverse scenes and conditions remains challenging. Addressing this, we propose the integration of a novel brain-inspired normalization technique Neural Generalization (NeuGen) into leading NeRF architectures which include MVSNeRF and GeoNeRF. NeuGen extracts the domain-invariant features, thereby enhancing the models’ generalization capabilities. It can be seamlessly integrated into NeRF architectures and cultivates a comprehensive feature set that significantly improves accuracy and robustness in image rendering. Through this integration, NeuGen shows improved performance on benchmarks on diverse datasets across state-of-the-art NeRF architectures, enabling them to generalize better across varied scenes. Our comprehensive evaluations, both quantitative and qualitative, confirm that our approach not only surpasses existing models in generalizability but also markedly improves rendering quality. Our work exemplifies the potential of merging neuroscientific principles with deep learning frameworks, setting a new precedent for enhanced generalizability and efficiency in novel view synthesis. A demo of our study is available at this https URL.
zh

[CV-97] Image Classification Using a Diffusion Model as a Pre-Training Model

【速读】:该论文旨在解决传统方法在图像分类任务中依赖大规模标注数据的问题,尤其针对脑部影像中的血肿检测场景。其解决方案的关键在于提出一种结合表示条件机制的扩散模型,利用视觉变换器(ViT)提取的表示作为条件,引导基于Transformer的扩散模型内部过程,从而实现无需大量标注数据的生成式数据训练,通过自监督学习有效提升分类性能。

链接: https://arxiv.org/abs/2505.06890
作者: Kosuke Ukita,Ye Xiaolong,Tsuyoshi Okita
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 10 pages, 9 figures

点击查看摘要

Abstract:In this paper, we propose a diffusion model that integrates a representation-conditioning mechanism, where the representations derived from a Vision Transformer (ViT) are used to condition the internal process of a Transformer-based diffusion model. This approach enables representation-conditioned data generation, addressing the challenge of requiring large-scale labeled datasets by leveraging self-supervised learning on unlabeled data. We evaluate our method through a zero-shot classification task for hematoma detection in brain imaging. Compared to the strong contrastive learning baseline, DINOv2, our method achieves a notable improvement of +6.15% in accuracy and +13.60% in F1-score, demonstrating its effectiveness in image classification.
zh

[CV-98] Mice to Machines: Neural Representations from Visual Cortex for Domain Generalization

【速读】:该论文试图解决如何将深度学习模型与小鼠视觉皮层的功能架构进行功能对齐的问题,以更好地理解小鼠视觉系统中的神经表征并提升AI模型在现实任务中的性能。其解决方案的关键在于引入一种通用的表示学习策略,该策略揭示了小鼠视觉皮层与高性能深度学习模型在自上而下(群体水平)和自下而上(单细胞水平)功能映射上的显著相似性,并通过添加受小鼠视觉皮层兴奋性和抑制性神经元激活特征启发的神经响应归一化(NeuRN)层进一步增强两种系统的表示相似性。

链接: https://arxiv.org/abs/2505.06886
作者: Ahmed Qazi,Hamd Jalil,Asim Iqbal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 12 pages, 8 figures, 1 table

点击查看摘要

Abstract:The mouse is one of the most studied animal models in the field of systems neuroscience. Understanding the generalized patterns and decoding the neural representations that are evoked by the diverse range of natural scene stimuli in the mouse visual cortex is one of the key quests in computational vision. In recent years, significant parallels have been drawn between the primate visual cortex and hierarchical deep neural networks. However, their generalized efficacy in understanding mouse vision has been limited. In this study, we investigate the functional alignment between the mouse visual cortex and deep learning models for object classification tasks. We first introduce a generalized representational learning strategy that uncovers a striking resemblance between the functional mapping of the mouse visual cortex and high-performing deep learning models on both top-down (population-level) and bottom-up (single cell-level) scenarios. Next, this representational similarity across the two systems is further enhanced by the addition of Neural Response Normalization (NeuRN) layer, inspired by the activation profile of excitatory and inhibitory neurons in the visual cortex. To test the performance effect of NeuRN on real-world tasks, we integrate it into deep learning models and observe significant improvements in their robustness against data shifts in domain generalization tasks. Our work proposes a novel framework for comparing the functional architecture of the mouse visual cortex with deep learning models. Our findings carry broad implications for the development of advanced AI models that draw inspiration from the mouse visual cortex, suggesting that these models serve as valuable tools for studying the neural representations of the mouse visual cortex and, as a result, enhancing their performance on real-world tasks.
zh

[CV-99] NeuRN: Neuro-inspired Domain Generalization for Image Classification

【速读】:该论文试图解决图像分类中的领域泛化(domain generalization)问题,即模型在未见过的数据集上表现不佳的挑战。解决方案的关键在于引入一种受哺乳动物视觉皮层神经元启发的神经响应归一化(Neural Response Normalization, NeuRN)层,该层旨在通过在源域上训练深度学习模型来提升其在未知目标域上的性能。

链接: https://arxiv.org/abs/2505.06881
作者: Hamd Jalil,Ahmed Qazi,Asim Iqbal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 14 pages, 7 figures, 1 table

点击查看摘要

Abstract:Domain generalization in image classification is a crucial challenge, with models often failing to generalize well across unseen datasets. We address this issue by introducing a neuro-inspired Neural Response Normalization (NeuRN) layer which draws inspiration from neurons in the mammalian visual cortex, which aims to enhance the performance of deep learning architectures on unseen target domains by training deep learning models on a source domain. The performance of these models is considered as a baseline and then compared against models integrated with NeuRN on image classification tasks. We perform experiments across a range of deep learning architectures, including ones derived from Neural Architecture Search and Vision Transformer. Additionally, in order to shortlist models for our experiment from amongst the vast range of deep neural networks available which have shown promising results, we also propose a novel method that uses the Needleman-Wunsch algorithm to compute similarity between deep learning architectures. Our results demonstrate the effectiveness of NeuRN by showing improvement against baseline in cross-domain image classification tasks. Our framework attempts to establish a foundation for future neuro-inspired deep learning models.
zh

[CV-100] Efficient Robotic Policy Learning via Latent Space Backward Planning ICML2025

【速读】:该论文试图解决机器人规划中因依赖高精度多帧图像预测而导致的计算成本高和累积误差问题,这些问题限制了实时部署和长期任务的准确性。解决方案的关键在于提出一种潜在空间逆向规划方法(Latent Space Backward Planning, LBP),该方法从任务的最终潜在目标出发,递归地预测接近当前状态的中间子目标,从而在整个规划过程中保持对任务完成的意识,提升预测的准确性和效率。

链接: https://arxiv.org/abs/2505.06861
作者: Dongxiu Liu,Haoyi Niu,Zhihao Wang,Jinliang Zheng,Yinan Zheng,Zhonghong Ou,Jianming Hu,Jianxiong Li,Xianyuan Zhan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2025

点击查看摘要

Abstract:Current robotic planning methods often rely on predicting multi-frame images with full pixel details. While this fine-grained approach can serve as a generic world model, it introduces two significant challenges for downstream policy learning: substantial computational costs that hinder real-time deployment, and accumulated inaccuracies that can mislead action extraction. Planning with coarse-grained subgoals partially alleviates efficiency issues. However, their forward planning schemes can still result in off-task predictions due to accumulation errors, leading to misalignment with long-term goals. This raises a critical question: Can robotic planning be both efficient and accurate enough for real-time control in long-horizon, multi-stage tasks? To address this, we propose a Latent Space Backward Planning scheme (LBP), which begins by grounding the task into final latent goals, followed by recursively predicting intermediate subgoals closer to the current state. The grounded final goal enables backward subgoal planning to always remain aware of task completion, facilitating on-task prediction along the entire planning horizon. The subgoal-conditioned policy incorporates a learnable token to summarize the subgoal sequences and determines how each subgoal guides action extraction. Through extensive simulation and real-robot long-horizon experiments, we show that LBP outperforms existing fine-grained and forward planning methods, achieving SOTA performance. Project Page: this https URL
zh

[CV-101] Joint Low-level and High-level Textual Representation Learning with Multiple Masking Strategies

【速读】:该论文旨在解决现有文本识别方法因依赖大规模合成数据集而导致在处理真实世界复杂场景时性能下降的问题,这些真实场景包括不均匀光照、不规则排版、遮挡和退化等。其解决方案的关键在于通过引入多掩码策略(Multi-Masking Strategy, MMS),结合随机块状掩码和跨度掩码,以充分捕捉文本的高层上下文表示,从而增强模型对字符间关系的推理能力,并在无监督预训练后通过微调提升文本识别、分割及文本图像超分辨率等任务的性能。

链接: https://arxiv.org/abs/2505.06855
作者: Zhengmi Tang,Yuto Mitsui,Tomo Miyazaki,Shinichiro Omachi
机构: Wenzhou University Artificial Intelligence and Advanced Manufacturing Institute (AIAMI)(温州大学人工智能与先进制造研究所); Tohoku University(东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Most existing text recognition methods are trained on large-scale synthetic datasets due to the scarcity of labeled real-world datasets. Synthetic images, however, cannot faithfully reproduce real-world scenarios, such as uneven illumination, irregular layout, occlusion, and degradation, resulting in performance disparities when handling complex real-world images. Recent self-supervised learning techniques, notably contrastive learning and masked image modeling (MIM), narrow this domain gap by exploiting unlabeled real text images. This study first analyzes the original Masked AutoEncoder (MAE) and observes that random patch masking predominantly captures low-level textural features but misses high-level contextual representations. To fully exploit the high-level contextual representations, we introduce random blockwise and span masking in the text recognition task. These strategies can mask the continuous image patches and completely remove some characters, forcing the model to infer relationships among characters within a word. Our Multi-Masking Strategy (MMS) integrates random patch, blockwise, and span masking into the MIM frame, which jointly learns low and high-level textual representations. After fine-tuning with real data, MMS outperforms the state-of-the-art self-supervised methods in various text-related tasks, including text recognition, segmentation, and text-image super-resolution.
zh

[CV-102] Predicting Surgical Safety Margins in Osteosarcoma Knee Resections: An Unsupervised Approach

【速读】:该论文试图解决骨肉瘤手术中安全边界估计的难题,以提高手术的精确性和患者预后。其解决方案的关键在于利用MRI和X射线数据、数字处理技术和无监督学习算法(如k-means聚类)来自动界定肿瘤边界,从而实现患者特异性安全边界的自动化确定。

链接: https://arxiv.org/abs/2505.06853
作者: Carolina Vargas-Ecos,Edwin Salcedo
机构: Universidad Católica Boliviana “San Pablo” (天主教玻利维亚大学“圣保罗”)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the 6th BioSMART Conference, 2025

点击查看摘要

Abstract:According to the Pan American Health Organization, the number of cancer cases in Latin America was estimated at 4.2 million in 2022 and is projected to rise to 6.7 million by 2045. Osteosarcoma, one of the most common and deadly bone cancers affecting young people, is difficult to detect due to its unique texture and intensity. Surgical removal of osteosarcoma requires precise safety margins to ensure complete resection while preserving healthy tissue. Therefore, this study proposes a method for estimating the confidence interval of surgical safety margins in osteosarcoma surgery around the knee. The proposed approach uses MRI and X-ray data from open-source repositories, digital processing techniques, and unsupervised learning algorithms (such as k-means clustering) to define tumor boundaries. Experimental results highlight the potential for automated, patient-specific determination of safety margins.
zh

[CV-103] Visual Instruction Tuning with Chain of Region-of-Interest

【速读】:该论文旨在解决高分辨率(High-Resolution, HR)图像在多模态大语言模型(Multimodal Large Language Models, MLLMs)中带来的计算负担问题。传统方法通过直接提升图像分辨率来增强模型的识别与理解能力,但会导致计算需求显著增加。论文提出的解决方案是“感兴趣区域链”(Chain of Region-of-Interest, CoRoI),其关键在于模仿人类视觉系统的选择性,识别并优先处理高分辨率图像中最具信息量的区域,从而在不处理大量HR图像标记的情况下提升多模态视觉理解和识别性能。

链接: https://arxiv.org/abs/2505.06840
作者: Yixin Chen,Shuai Zhang,Boran Han,Bernie Wang
机构: Amazon Web Services (亚马逊云服务)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: N/A

点击查看摘要

Abstract:High-resolution (HR) images are pivotal for enhancing the recognition and understanding capabilities of multimodal large language models (MLLMs). However, directly increasing image resolution can significantly escalate computational demands. In this study, we propose a method called Chain of Region-of-Interest (CoRoI) for Visual Instruction Tuning, aimed at alleviating the computational burden associated with high-resolution images for MLLMs. Drawing inspiration from the selective nature of the human visual system, we recognize that not all regions within high-resolution images carry equal importance. CoRoI seeks to identify and prioritize the most informative regions, thereby enhancing multimodal visual comprehension and recognition while circumventing the need for processing lengthy HR image tokens. Through extensive experiments on 11 benchmarks, we validate the efficacy of CoRoI across varying sizes, ranging from 7B to 34B in parameters. Our models consistently demonstrate superior performance across diverse multimodal benchmarks and tasks. Notably, our method outperforms LLaVA-NeXT on almost all benchmarks and our finetuned 34B model surpasses proprietary methods like Gemini Pro 1.0 on six benchmarks, as well as outperforming GPT-4V on MMB, SEED-I, and MME.
zh

[CV-104] Fine-Grained Bias Exploration and Mitigation for Group-Robust Classification

【速读】:该论文旨在解决在存在虚假相关性的情况下,如何实现群体鲁棒的泛化问题,尤其是在缺乏偏差标注的情况下。其解决方案的关键在于提出一种名为通过过拟合进行偏差探索(Bias Exploration via Overfitting, BEO)的新方法,该方法通过将每个分布建模为潜在群体的混合来更精确地捕捉数据分布特性,并在此基础上引入细粒度的类条件分布平衡(Fine-Grained Class-Conditional Distribution Balancing, FG-CCDB),从而在群体内部进行更精确的分布匹配与平衡,有效缓解虚假相关性。

链接: https://arxiv.org/abs/2505.06831
作者: Miaoyun Zhao,Qiang Zhang,Chenrong Li
机构: Dalian University of Technology (大连理工大学); Key Laboratory of Social Computing and Cognitive Intelligence (Dalian University of Technology) (社会计算与认知智能重点实验室(大连理工大学))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Achieving group-robust generalization in the presence of spurious correlations remains a significant challenge, particularly when bias annotations are unavailable. Recent studies on Class-Conditional Distribution Balancing (CCDB) reveal that spurious correlations often stem from mismatches between the class-conditional and marginal distributions of bias attributes. They achieve promising results by addressing this issue through simple distribution matching in a bias-agnostic manner. However, CCDB approximates each distribution using a single Gaussian, which is overly simplistic and rarely holds in real-world applications. To address this limitation, we propose a novel method called Bias Exploration via Overfitting (BEO), which captures each distribution in greater detail by modeling it as a mixture of latent groups. Building on these group-level descriptions, we introduce a fine-grained variant of CCDB, termed FG-CCDB, which performs more precise distribution matching and balancing within each group. Through group-level reweighting, FG-CCDB learns sample weights from a global perspective, achieving stronger mitigation of spurious correlations without incurring substantial storage or computational costs. Extensive experiments demonstrate that BEO serves as a strong proxy for ground-truth bias annotations and can be seamlessly integrated with bias-supervised methods. Moreover, when combined with FG-CCDB, our method performs on par with bias-supervised approaches on binary classification tasks and significantly outperforms them in highly biased multi-class scenarios.
zh

[CV-105] Active Learning for Multi-class Image Classification

【速读】:该论文试图解决图像分类中需要大量训练样本这一关键瓶颈问题。其解决方案的关键在于采用主动学习(active learning),通过策略性地选择具有高信息量的样本,从而减少所需的训练样本数量。具体而言,利用不同的不确定性度量(uncertainty metrics)为图像样本赋值,使模型能够在较小的训练集规模下识别并选择高价值样本。

链接: https://arxiv.org/abs/2505.06825
作者: Thien Nhan Vo
机构: Institute of Engineering, Ho Chi Minh City University of Technology (HUTECH), Vietnam
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A principle bottleneck in image classification is the large number of training examples needed to train a classifier. Using active learning, we can reduce the number of training examples to teach a CNN classifier by strategically selecting examples. Assigning values to image examples using different uncertainty metrics allows the model to identify and select high-value examples in a smaller training set size. We demonstrate results for digit recognition and fruit classification on the MNIST and Fruits360 data sets. We formally compare results for four different uncertainty metrics. Finally, we observe active learning is also effective on simpler (binary) classification tasks, but marked improvement from random sampling is more evident on more difficult tasks. We show active learning is a viable algorithm for image classification problems.
zh

[CV-106] Multimodal Fake News Detection: MFND Dataset and Shallow-Deep Multitask Learning IJCAI2025

【速读】:该论文旨在解决多模态新闻中因深度伪造建模攻击而导致的信息失真问题,特别是针对当前图像和文本生成技术所制造的高真实感虚假新闻的检测与定位。其解决方案的关键在于提出了一种浅-深多任务学习(Shallow-Deep Multitask Learning, SDML)模型,该模型充分利用单模态和跨模态特征,挖掘新闻的内在语义。在浅层推理中,采用基于动量蒸馏的轻量惩罚对比学习实现细粒度的空间图像与文本语义对齐,并引入自适应跨模态融合模块增强跨模态特征;在深层推理中,设计双分支框架分别增强图像和文本的单模态特征,并与跨模态特征融合,通过专用检测和定位投影进行四类预测。

链接: https://arxiv.org/abs/2505.06796
作者: Ye Zhu,Yunan Wang,Zitong Yu
机构: Hebei University of Technology (河北工业大学); Great Bay University (大湾大学); Shenzhen University (深圳大学); Dongguan Key Laboratory for Intelligence and Information Technology (东莞人工智能与信息技术重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCAI 2025

点击查看摘要

Abstract:Multimodal news contains a wealth of information and is easily affected by deepfake modeling attacks. To combat the latest image and text generation methods, we present a new Multimodal Fake News Detection dataset (MFND) containing 11 manipulated types, designed to detect and localize highly authentic fake news. Furthermore, we propose a Shallow-Deep Multitask Learning (SDML) model for fake news, which fully uses unimodal and mutual modal features to mine the intrinsic semantics of news. Under shallow inference, we propose the momentum distillation-based light punishment contrastive learning for fine-grained uniform spatial image and text semantic alignment, and an adaptive cross-modal fusion module to enhance mutual modal features. Under deep inference, we design a two-branch framework to augment the image and text unimodal features, respectively merging with mutual modalities features, for four predictions via dedicated detection and localization projections. Experiments on both mainstream and our proposed datasets demonstrate the superiority of the model. Codes and dataset are released at this https URL.
zh

[CV-107] M3CAD: Towards Generic Cooperative Autonomous Driving Benchmark

【速读】:该论文旨在解决通用协同自动驾驶(cooperative autonomous driving)研究中的基准不足问题,提出了一种名为M ^3 CAD的新基准,以促进相关领域的研究。其解决方案的关键在于构建一个包含204个序列、30k帧的多样化协同驾驶场景数据集,支持多车辆和多模态感知数据(如LiDAR点云、RGB图像和GPS/IMU),并提供多种自动驾驶任务的支持。此外,论文还提出了E2EC框架,通过利用车辆间共享信息来提升路径规划性能,从而推动协同自动驾驶研究的发展。

链接: https://arxiv.org/abs/2505.06746
作者: Morui Zhu,Yongqi Zhu,Yihao Zhu,Qi Chen,Deyuan Qu,Song Fu,Qing Yang
机构: University of North Texas (北德克萨斯大学); Toyota InfoTech Labs (丰田信息科技实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: supplementary material included

点击查看摘要

Abstract:We introduce M ^3 CAD, a novel benchmark designed to advance research in generic cooperative autonomous driving. M ^3 CAD comprises 204 sequences with 30k frames, spanning a diverse range of cooperative driving scenarios. Each sequence includes multiple vehicles and sensing modalities, e.g., LiDAR point clouds, RGB images, and GPS/IMU, supporting a variety of autonomous driving tasks, including object detection and tracking, mapping, motion forecasting, occupancy prediction, and path planning. This rich multimodal setup enables M ^3 CAD to support both single-vehicle and multi-vehicle autonomous driving research, significantly broadening the scope of research in the field. To our knowledge, M ^3 CAD is the most comprehensive benchmark specifically tailored for cooperative multi-task autonomous driving research. We evaluate the state-of-the-art end-to-end solution on M ^3 CAD to establish baseline performance. To foster cooperative autonomous driving research, we also propose E2EC, a simple yet effective framework for cooperative driving solution that leverages inter-vehicle shared information for improved path planning. We release M ^3 CAD, along with our baseline models and evaluation results, to support the development of robust cooperative autonomous driving systems. All resources will be made publicly available on this https URL
zh

[CV-108] Symbolic Rule Extraction from Attention-Guided Sparse Representations in Vision Transformers

【速读】:该论文试图解决Vision Transformers(ViTs)缺乏模块化概念检测器和依赖全局自注意力机制而导致的可解释性不足问题。其解决方案的关键在于引入一个受稀疏自编码器(Sparse Autoencoders, SAEs)启发的稀疏概念层,该层在注意力加权的块表示上操作,并学习一种解耦且二值化的表示,其中单个神经元针对高层视觉概念激活。通过结合L1稀疏性、熵最小化和监督对比损失来增强可解释性,最终利用FOLD-SE-M算法生成逻辑程序形式的符号规则集,实现了对ViTs的符号化推理与可验证的神经符号AI。

链接: https://arxiv.org/abs/2505.06745
作者: Parth Padalkar,Gopal Gupta
机构: The University of Texas at Dallas(德克萨斯大学达拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent neuro-symbolic approaches have successfully extracted symbolic rule-sets from CNN-based models to enhance interpretability. However, applying similar techniques to Vision Transformers (ViTs) remains challenging due to their lack of modular concept detectors and reliance on global self-attention mechanisms. We propose a framework for symbolic rule extraction from ViTs by introducing a sparse concept layer inspired by Sparse Autoencoders (SAEs). This linear layer operates on attention-weighted patch representations and learns a disentangled, binarized representation in which individual neurons activate for high-level visual concepts. To encourage interpretability, we apply a combination of L1 sparsity, entropy minimization, and supervised contrastive loss. These binarized concept activations are used as input to the FOLD-SE-M algorithm, which generates a rule-set in the form of logic programs. Our method achieves a 5.14% better classification accuracy than the standard ViT while enabling symbolic reasoning. Crucially, the extracted rule-set is not merely post-hoc but acts as a logic-based decision layer that operates directly on the sparse concept representations. The resulting programs are concise and semantically meaningful. This work is the first to extract executable logic programs from ViTs using sparse symbolic representations. It bridges the gap between transformer-based vision models and symbolic logic programming, providing a step forward in interpretable and verifiable neuro-symbolic AI.
zh

[CV-109] SimMIL: A Universal Weakly Supervised Pre-Training Framework for Multi-Instance Learning in Whole Slide Pathology Images

【速读】:该论文试图解决多实例学习(Multi-Instance Learning, MIL)中对实例级表示学习的忽视问题,即现有方法过于依赖特征聚合器而未充分考虑实例级别的特征学习。其解决方案的关键在于通过一种弱监督方案对特征提取器进行预训练,即将弱标签的包级别标签传播到对应的实例以进行监督学习,同时引入强数据增强、非线性预测头和鲁棒损失函数等关键组件来学习有效的特征。

链接: https://arxiv.org/abs/2505.06710
作者: Yicheng Song,Tiancheng Lin,Die Peng,Su Yang,Yi Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Various multi-instance learning (MIL) based approaches have been developed and successfully applied to whole-slide pathological images (WSI). Existing MIL methods emphasize the importance of feature aggregators, but largely neglect the instance-level representation learning. They assume that the availability of a pre-trained feature extractor can be directly utilized or fine-tuned, which is not always the case. This paper proposes to pre-train feature extractor for MIL via a weakly-supervised scheme, i.e., propagating the weak bag-level labels to the corresponding instances for supervised learning. To learn effective features for MIL, we further delve into several key components, including strong data augmentation, a non-linear prediction head and the robust loss function. We conduct experiments on common large-scale WSI datasets and find it achieves better performance than other pre-training schemes (e.g., ImageNet pre-training and self-supervised learning) in different downstream tasks. We further show the compatibility and scalability of the proposed scheme by deploying it in fine-tuning the pathological-specific models and pre-training on merged multiple datasets. To our knowledge, this is the first work focusing on the representation learning for MIL.
zh

[CV-110] Underwater object detection in sonar imagery with detection transformer and Zero-shot neural architecture search

【速读】:该论文旨在解决水下目标检测中由于声呐图像分辨率较低和特征稀疏而导致的检测性能下降问题。其解决方案的关键在于提出一种基于神经架构搜索(Neural Architecture Search, NAS)优化的检测变压器(DETR)架构,称为NAS-DETR。该方法通过改进的零样本神经架构搜索方法,高效地发现了具有高表征能力的CNN-Transformer主干网络,并结合特征金字塔网络(FPN)和基于可变形注意力的Transformer解码器,构建了完整的网络架构,从而在保持实时效率的同时提升了检测性能。

链接: https://arxiv.org/abs/2505.06694
作者: XiaoTong Gu,Shengyu Tang,Yiming Cao,Changdong Yu
机构: Dalian Maritime University (大连海事大学); Ocean Univesity of China (海洋大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Underwater object detection using sonar imagery has become a critical and rapidly evolving research domain within marine technology. However, sonar images are characterized by lower resolution and sparser features compared to optical images, which seriously degrades the performance of object this http URL address these challenges, we specifically propose a Detection Transformer (DETR) architecture optimized with a Neural Architecture Search (NAS) approach called NAS-DETR for object detection in sonar images. First, an improved Zero-shot Neural Architecture Search (NAS) method based on the maximum entropy principle is proposed to identify a real-time, high-representational-capacity CNN-Transformer backbone for sonar image detection. This method enables the efficient discovery of high-performance network architectures with low computational and time overhead. Subsequently, the backbone is combined with a Feature Pyramid Network (FPN) and a deformable attention-based Transformer decoder to construct a complete network architecture. This architecture integrates various advanced components and training schemes to enhance overall performance. Extensive experiments demonstrate that this architecture achieves state-of-the-art performance on two Representative datasets, while maintaining minimal overhead in real-time efficiency and computational complexity. Furthermore, correlation analysis between the key parameters and differential entropy-based fitness function is performed to enhance the interpretability of the proposed framework. To the best of our knowledge, this is the first work in the field of sonar object detection to integrate the DETR architecture with a NAS search mechanism.
zh

[CV-111] Emotion-Qwen : Training Hybrid Experts for Unified Emotion and General Vision-Language Understanding

【速读】:该论文旨在解决大型多模态模型(LMMs)在情感特定场景下的表现受限以及在情感相关任务微调过程中出现的灾难性遗忘问题。其解决方案的关键在于提出Emotion-Qwen,一个针对情感理解与通用视觉语言(VL)推理优化的多模态框架,该框架采用基于专家混合(MoE)的混合压缩器,动态地将输入路由至情感特异性与通用处理模块,以实现情感理解和通用VL推理的平衡。

链接: https://arxiv.org/abs/2505.06685
作者: Dawei Huang,Qing Li,Chuan Yan,Zebang Cheng,Yurong Huang,Xiang Li,Bin Li,Xiaohui Wang,Zheng Lian,Xiaojiang Peng
机构: Shenzhen Technology University(深圳技术大学); Stanford University(斯坦福大学); University of Electronic Science and Technology of China(中国电子科技大学); Skyworth Digital(创维数字); Shenzhen Xiaopai Tech Co(深圳小派科技有限公司); Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Emotion understanding in videos aims to accurately recognize and interpret individuals’ emotional states by integrating contextual, visual, textual, and auditory cues. While Large Multimodal Models (LMMs) have demonstrated significant progress in general vision-language (VL) tasks, their performance in emotion-specific scenarios remains limited. Moreover, fine-tuning LMMs on emotion-related tasks often leads to catastrophic forgetting, hindering their ability to generalize across diverse tasks. To address these challenges, we present Emotion-Qwen, a tailored multimodal framework designed to enhance both emotion understanding and general VL reasoning. Emotion-Qwen incorporates a sophisticated Hybrid Compressor based on the Mixture of Experts (MoE) paradigm, which dynamically routes inputs to balance emotion-specific and general-purpose processing. The model is pre-trained in a three-stage pipeline on large-scale general and emotional image datasets to support robust multimodal representations. Furthermore, we construct the Video Emotion Reasoning (VER) dataset, comprising more than 40K bilingual video clips with fine-grained descriptive annotations, to further enrich Emotion-Qwen’s emotional reasoning capability. Experimental results demonstrate that Emotion-Qwen achieves state-of-the-art performance on multiple emotion recognition benchmarks, while maintaining competitive results on general VL tasks. Code and models are available at this https URL.
zh

[CV-112] FNBench: Benchmarking Robust Federated Learning against Noisy Labels

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中标签噪声对模型鲁棒性的影响问题。由于不同客户端的数据标注存在复杂的标签噪声,导致模型性能下降,现有方法在统一设置下的实际表现缺乏系统性评估。本文的关键解决方案是提出首个基准研究FNBench,通过考虑三种多样化的标签噪声模式(合成标签噪声、不完美的人工标注错误和系统性错误),对十八种先进方法在五个图像识别数据集和一个文本分类数据集上的性能进行全面评估,并基于观察结果引入一种表征感知正则化方法以提升现有方法对标签噪声的鲁棒性。

链接: https://arxiv.org/abs/2505.06684
作者: Xuefeng Jiang,Jia Li,Nannan Wu,Zhiyuan Wu,Xujing Li,Sheng Sun,Gang Xu,Yuwei Wang,Qi Li,Min Liu
机构: Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China; Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; Huazhong University of Science and Technology, Wuhan, Hubei province, China; Institution for Network Sciences and Cyberspace, Tsinghua University, Beijing, China; Zhongguancun Laboratory, Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to IEEE TDSC, currently under major revision

点击查看摘要

Abstract:Robustness to label noise within data is a significant challenge in federated learning (FL). From the data-centric perspective, the data quality of distributed datasets can not be guaranteed since annotations of different clients contain complicated label noise of varying degrees, which causes the performance degradation. There have been some early attempts to tackle noisy labels in FL. However, there exists a lack of benchmark studies on comprehensively evaluating their practical performance under unified settings. To this end, we propose the first benchmark study FNBench to provide an experimental investigation which considers three diverse label noise patterns covering synthetic label noise, imperfect human-annotation errors and systematic errors. Our evaluation incorporates eighteen state-of-the-art methods over five image recognition datasets and one text classification dataset. Meanwhile, we provide observations to understand why noisy labels impair FL, and additionally exploit a representation-aware regularization method to enhance the robustness of existing methods against noisy labels based on our observations. Finally, we discuss the limitations of this work and propose three-fold future directions. To facilitate related communities, our source code is open-sourced at this https URL.
zh

[CV-113] UnfoldIR: Rethinking Deep Unfolding Network in Illumination Degradation Image Restoration

【速读】:该论文旨在解决深度展开网络(DUN)在光照退化图像修复(IDIR)任务中性能落后于最先进方法的问题。研究指出,这一局限并非源于DUN结构本身的缺陷,而是由于对展开结构的探索不足,具体体现在任务特定修复模型构建、先进网络架构整合以及DUN专用损失函数设计三个方面。解决方案的关键在于提出一种新型DUN方法——UnfoldIR,其通过引入具有专用正则化项的IDIR模型,并将其迭代优化解展开为多阶段网络,每个阶段包含反射率辅助光照校正(RAIC)模块和光照引导反射率增强(IGRE)模块,从而实现光照平滑与纹理增强的协同优化,同时设计了跨阶段信息一致性损失以提升网络稳定性与性能。

链接: https://arxiv.org/abs/2505.06683
作者: Chunming He,Rihan Zhang,Fengyang Xiao,Chengyu Fang,Longxiang Tang,Yulun Zhang,Sina Farsiu
机构: Duke University (杜克大学); Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 14 tables, 11 figures

点击查看摘要

Abstract:Deep unfolding networks (DUNs) are widely employed in illumination degradation image restoration (IDIR) to merge the interpretability of model-based approaches with the generalization of learning-based methods. However, the performance of DUN-based methods remains considerably inferior to that of state-of-the-art IDIR solvers. Our investigation indicates that this limitation does not stem from structural shortcomings of DUNs but rather from the limited exploration of the unfolding structure, particularly for (1) constructing task-specific restoration models, (2) integrating advanced network architectures, and (3) designing DUN-specific loss functions. To address these issues, we propose a novel DUN-based method, UnfoldIR, for IDIR tasks. UnfoldIR first introduces a new IDIR model with dedicated regularization terms for smoothing illumination and enhancing texture. We unfold the iterative optimized solution of this model into a multistage network, with each stage comprising a reflectance-assisted illumination correction (RAIC) module and an illumination-guided reflectance enhancement (IGRE) module. RAIC employs a visual state space (VSS) to extract non-local features, enforcing illumination smoothness, while IGRE introduces a frequency-aware VSS to globally align similar textures, enabling mildly degraded regions to guide the enhancement of details in more severely degraded areas. This suppresses noise while enhancing details. Furthermore, given the multistage structure, we propose an inter-stage information consistent loss to maintain network stability in the final stages. This loss contributes to structural preservation and sustains the model’s performance even in unsupervised settings. Experiments verify our effectiveness across 5 IDIR tasks and 3 downstream problems.
zh

[CV-114] Jailbreaking the Text-to-Video Generative Models

【速读】:该论文旨在解决文本到视频生成模型在面对恶意提示时可能生成不安全内容的安全漏洞问题,特别是针对“越狱攻击”(jailbreak attack)的防御不足。其解决方案的关键在于提出一种基于优化的越狱攻击方法,将提示生成任务建模为一个包含三个核心目标的优化问题:最大化输入与生成提示之间的语义相似性、确保生成的提示能够绕过模型的安全过滤机制,以及最大化生成视频与原始输入提示之间的语义相似性。此外,通过引入提示变异策略,进一步提升了生成提示的鲁棒性和视频的语义相关性。

链接: https://arxiv.org/abs/2505.06679
作者: Jiayang Liu,Siyuan Liang,Shiqian Zhao,Rongcheng Tu,Wenbo Zhou,Xiaochun Cao,Dacheng Tao,Siew Kei Lam
机构: Nanyang Technological University (南洋理工大学); University of Science and Technology of China (中国科学技术大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-video generative models have achieved significant progress, driven by the rapid advancements in diffusion models, with notable examples including Pika, Luma, Kling, and Sora. Despite their remarkable generation ability, their vulnerability to jailbreak attack, i.e. to generate unsafe content, including pornography, violence, and discrimination, raises serious safety concerns. Existing efforts, such as T2VSafetyBench, have provided valuable benchmarks for evaluating the safety of text-to-video models against unsafe prompts but lack systematic studies for exploiting their vulnerabilities effectively. In this paper, we propose the \textitfirst optimization-based jailbreak attack against text-to-video models, which is specifically designed. Our approach formulates the prompt generation task as an optimization problem with three key objectives: (1) maximizing the semantic similarity between the input and generated prompts, (2) ensuring that the generated prompts can evade the safety filter of the text-to-video model, and (3) maximizing the semantic similarity between the generated videos and the original input prompts. To further enhance the robustness of the generated prompts, we introduce a prompt mutation strategy that creates multiple prompt variants in each iteration, selecting the most effective one based on the averaged score. This strategy not only improves the attack success rate but also boosts the semantic relevance of the generated video. We conduct extensive experiments across multiple text-to-video models, including Open-Sora, Pika, Luma, and Kling. The results demonstrate that our method not only achieves a higher attack success rate compared to baseline methods but also generates videos with greater semantic similarity to the original input prompts.
zh

[CV-115] Video Dataset Condensation with Diffusion Models

【速读】:该论文旨在解决大规模数据集和复杂深度学习模型带来的计算资源需求激增问题,特别是针对视频领域的数据集蒸馏任务。其核心挑战在于如何生成高质量且具有代表性的合成视频数据,以保留原始数据集的关键信息。论文提出的解决方案关键在于引入视频扩散模型生成高质量合成视频,并结合Video Spatio-Temporal U-Net (VST-UNet) 选择多样化且信息丰富的视频子集,同时采用无需训练的Temporal-Aware Cluster-based Distillation (TAC-DT) 算法提升计算效率,从而在多个基准数据集上实现了性能提升。

链接: https://arxiv.org/abs/2505.06670
作者: Zhe Li,Hadrien Reynaud,Mischa Dombrowski,Sarah Cechnicka,Franciskus Xaverius Erick,Bernhard Kainz
机构: FAU Erlangen-Nürnberg(弗罗茨瓦夫大学); Imperial College London(帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:In recent years, the rapid expansion of dataset sizes and the increasing complexity of deep learning models have significantly escalated the demand for computational resources, both for data storage and model training. Dataset distillation has emerged as a promising solution to address this challenge by generating a compact synthetic dataset that retains the essential information from a large real dataset. However, existing methods often suffer from limited performance and poor data quality, particularly in the video domain. In this paper, we focus on video dataset distillation by employing a video diffusion model to generate high-quality synthetic videos. To enhance representativeness, we introduce Video Spatio-Temporal U-Net (VST-UNet), a model designed to select a diverse and informative subset of videos that effectively captures the characteristics of the original dataset. To further optimize computational efficiency, we explore a training-free clustering algorithm, Temporal-Aware Cluster-based Distillation (TAC-DT), to select representative videos without requiring additional training overhead. We validate the effectiveness of our approach through extensive experiments on four benchmark datasets, demonstrating performance improvements of up to (10.61%) over the state-of-the-art. Our method consistently outperforms existing approaches across all datasets, establishing a new benchmark for video dataset distillation.
zh

[CV-116] StableMotion: Repurposing Diffusion-Based Image Priors for Motion Estimation

【速读】:该论文试图解决单图像为基础的图像校正任务,如拼接图像校正(Stitched Image Rectangling, SIR)和滚动快门校正(Rolling Shutter Correction, RSC),这些问题通常需要从单张图像中估计运动信息。解决方案的关键在于提出StableMotion框架,该框架利用预训练大规模图像扩散模型的知识(几何和内容先验),将文本到图像的扩散模型重构为图像到运动估计器。此外,通过引入自适应集成策略(Adaptive Ensemble Strategy, AES)来缓解扩散模型输出不一致的问题,并结合采样步骤灾难(Sampling Steps Disaster, SSD)现象实现单步推理,从而显著提升效率与效果。

链接: https://arxiv.org/abs/2505.06668
作者: Ziyi Wang,Haipeng Li,Lin Sui,Tianhao Zhou,Hai Jiang,Lang Nie,Shuaicheng Liu
机构: University of Electronic Science and Technology of China (中国电子科技大学); Paradigm Inc (帕拉迪格姆公司); Sichuan University (四川大学); Beijing Jiaotong University (北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:We present StableMotion, a novel framework leverages knowledge (geometry and content priors) from pretrained large-scale image diffusion models to perform motion estimation, solving single-image-based image rectification tasks such as Stitched Image Rectangling (SIR) and Rolling Shutter Correction (RSC). Specifically, StableMotion framework takes text-to-image Stable Diffusion (SD) models as backbone and repurposes it into an image-to-motion estimator. To mitigate inconsistent output produced by diffusion models, we propose Adaptive Ensemble Strategy (AES) that consolidates multiple outputs into a cohesive, high-fidelity result. Additionally, we present the concept of Sampling Steps Disaster (SSD), the counterintuitive scenario where increasing the number of sampling steps can lead to poorer outcomes, which enables our framework to achieve one-step inference. StableMotion is verified on two image rectification tasks and delivers state-of-the-art performance in both, as well as showing strong generalizability. Supported by SSD, StableMotion offers a speedup of 200 times compared to previous diffusion model-based methods.
zh

[CV-117] MultiTaskVIF: Segmentation-oriented visible and infrared image fusion via multi-task learning

【速读】:该论文旨在解决传统可见光与红外图像融合(Visible and Infrared Image Fusion, VIF)方法在整合语义信息时存在的网络结构复杂和冗余问题。现有面向分割的VIF方法通常采用级联结构,分别独立进行融合与分割,导致模型效率低下。论文提出的解决方案关键在于设计一种简洁且通用的多任务训练框架——MultiTaskVIF,通过引入多任务头解码器(Multi-Task Head Decoder, MTH),在训练过程中同时输出融合图像和分割结果,从而直接将语义信息嵌入融合模型,避免了对完整分割模型的依赖,提升了模型的效率与性能。

链接: https://arxiv.org/abs/2505.06665
作者: Zixian Zhao,Andrew Howes,Xingchen Zhang
机构: University of Exeter (埃克塞特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visible and infrared image fusion (VIF) has attracted significant attention in recent years. Traditional VIF methods primarily focus on generating fused images with high visual quality, while recent advancements increasingly emphasize incorporating semantic information into the fusion model during training. However, most existing segmentation-oriented VIF methods adopt a cascade structure comprising separate fusion and segmentation models, leading to increased network complexity and redundancy. This raises a critical question: can we design a more concise and efficient structure to integrate semantic information directly into the fusion model during training-Inspired by multi-task learning, we propose a concise and universal training framework, MultiTaskVIF, for segmentation-oriented VIF models. In this framework, we introduce a multi-task head decoder (MTH) to simultaneously output both the fused image and the segmentation result during training. Unlike previous cascade training frameworks that necessitate joint training with a complete segmentation model, MultiTaskVIF enables the fusion model to learn semantic features by simply replacing its decoder with MTH. Extensive experimental evaluations validate the effectiveness of the proposed method. Our code will be released upon acceptance.
zh

[CV-118] METOR: A Unified Framework for Mutual Enhancement of Objects and Relationships in Open-vocabulary Video Visual Relationship Detection IJCAI2025

【速读】:该论文旨在解决开放词汇视频视觉关系检测(open-vocabulary video visual relationship detection)中因传统级联流程导致的误差传播问题,从而提升模型在未见类别上的性能。其解决方案的关键在于提出一种基于查询的统一框架——Mutual EnhancemenT of Objects and Relationships (METOR),通过联合建模和相互增强物体检测与关系分类,有效利用物体与关系之间的依赖性,提升识别性能。

链接: https://arxiv.org/abs/2505.06663
作者: Yongqi Wang,Xinxiao Wu,Shuo Yang
机构: Beijing Institute of Technology (北京理工大学); Shenzhen MSU-BIT University (深圳美中商学院-比特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IJCAI2025

点击查看摘要

Abstract:Open-vocabulary video visual relationship detection aims to detect objects and their relationships in videos without being restricted by predefined object or relationship categories. Existing methods leverage the rich semantic knowledge of pre-trained vision-language models such as CLIP to identify novel categories. They typically adopt a cascaded pipeline to first detect objects and then classify relationships based on the detected objects, which may lead to error propagation and thus suboptimal performance. In this paper, we propose Mutual EnhancemenT of Objects and Relationships (METOR), a query-based unified framework to jointly model and mutually enhance object detection and relationship classification in open-vocabulary scenarios. Under this framework, we first design a CLIP-based contextual refinement encoding module that extracts visual contexts of objects and relationships to refine the encoding of text features and object queries, thus improving the generalization of encoding to novel categories. Then we propose an iterative enhancement module to alternatively enhance the representations of objects and relationships by fully exploiting their interdependence to improve recognition performance. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate that our framework achieves state-of-the-art performance.
zh

[CV-119] Dataset Distillation with Probabilistic Latent Features

【速读】:该论文旨在解决深度学习模型在复杂性和训练数据量增加背景下,存储和计算成本过高的问题。其解决方案的关键在于提出一种新颖的随机方法,通过建模潜在特征的联合分布来生成具有空间结构和多样性的合成样本,从而有效替代原始数据集。该方法采用由轻量级网络参数化的低秩多元正态分布,保持了较低的计算复杂度,并与多种数据集蒸馏中使用的匹配网络兼容,最终通过预训练生成器生成合成图像用于分类模型训练。

链接: https://arxiv.org/abs/2505.06647
作者: Zhe Li,Sarah Cechnicka,Cheng Ouyang,Katharina Breininger,Peter Schüffler,Bernhard Kainz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages

点击查看摘要

Abstract:As deep learning models grow in complexity and the volume of training data increases, reducing storage and computational costs becomes increasingly important. Dataset distillation addresses this challenge by synthesizing a compact set of synthetic data that can effectively replace the original dataset in downstream classification tasks. While existing methods typically rely on mapping data from pixel space to the latent space of a generative model, we propose a novel stochastic approach that models the joint distribution of latent features. This allows our method to better capture spatial structures and produce diverse synthetic samples, which benefits model training. Specifically, we introduce a low-rank multivariate normal distribution parameterized by a lightweight network. This design maintains low computational complexity and is compatible with various matching networks used in dataset distillation. After distillation, synthetic images are generated by feeding the learned latent features into a pretrained generator. These synthetic images are then used to train classification models, and performance is evaluated on real test set. We validate our method on several benchmarks, including ImageNet subsets, CIFAR-10, and the MedMNIST histopathological dataset. Our approach achieves state-of-the-art cross architecture performance across a range of backbone architectures, demonstrating its generality and effectiveness.
zh

[CV-120] Reducing Unimodal Bias in Multi-Modal Semantic Segmentation with Multi-Scale Functional Entropy Regularization

【速读】:该论文旨在解决多模态输入在密集预测任务(尤其是语义分割)中因模态主导(unimodal dominance)导致的性能下降问题。其关键解决方案是引入一种基于功能熵(functional entropy)的简单但有效的即插即用正则化项,该方法通过利用log-Sobolev不等式将功能熵与功能Fisher信息联系起来,从而最大化每个视觉模态的信息贡献,实现多模态间的平衡。此外,还提出了一种多尺度正则化模块,在高层次特征和分割预测上应用该正则化项,进一步提升多模态学习的平衡性和鲁棒性。

链接: https://arxiv.org/abs/2505.06635
作者: Xu Zheng,Yuanhuiyi Lyu,Lutao Jiang,Danda Pani Paudel,Luc Van Gool,Xuming Hu
机构: HKUST(GZ); INSAIT; Sofia University “St. Kliment Ohridski”; HKUST
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fusing and balancing multi-modal inputs from novel sensors for dense prediction tasks, particularly semantic segmentation, is critically important yet remains a significant challenge. One major limitation is the tendency of multi-modal frameworks to over-rely on easily learnable modalities, a phenomenon referred to as unimodal dominance or bias. This issue becomes especially problematic in real-world scenarios where the dominant modality may be unavailable, resulting in severe performance degradation. To this end, we apply a simple but effective plug-and-play regularization term based on functional entropy, which introduces no additional parameters or modules. This term is designed to intuitively balance the contribution of each visual modality to the segmentation results. Specifically, we leverage the log-Sobolev inequality to bound functional entropy using functional-Fisher-information. By maximizing the information contributed by each visual modality, our approach mitigates unimodal dominance and establishes a more balanced and robust segmentation framework. A multi-scale regularization module is proposed to apply our proposed plug-and-play term on high-level features and also segmentation predictions for more balanced multi-modal learning. Extensive experiments on three datasets demonstrate that our proposed method achieves superior performance, i.e., +13.94%, +3.25%, and +3.64%, without introducing any additional parameters.
zh

[CV-121] Minimizing Risk Through Minimizing Model-Data Interaction: A Protocol For Relying on Proxy Tasks When Designing Child Sexual Abuse Imagery Detection Models

【速读】:该论文试图解决儿童性虐待图像(Child Sexual Abuse Imagery, CSAI)分布问题所带来的挑战,特别是由于敏感数据的限制导致的研究方法受限以及执法机构(Law Enforcement Agents, LEAs)在数据分类上的手动负担过重。解决方案的关键在于引入“代理任务”(Proxy Tasks),即在不使用实际CSAI数据的情况下,通过替代任务训练模型,从而规避数据泄露风险并提升自动化检测能力。研究提出了一种结合代理任务与LEAs持续输入的协议,并首次将该协议应用于少样本室内场景分类任务,展示了在真实CSAI数据集上取得良好效果的模型,且该模型并未直接在敏感数据上进行训练。

链接: https://arxiv.org/abs/2505.06621
作者: Thamiris Coelho,Leo S. F. Ribeiro,João Macedo,Jefersson A. dos Santos,Sandra Avila
机构: Instituto de Computação, Universidade Estadual de Campinas (UNICAMP)(计算研究所,坎皮纳斯州立大学); Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo (USP)(数学与计算机科学研究所,圣保罗大学); Departamento de Ciência da Computação, Universidade Federal de Minas Gerais (UFMG)(计算机科学系,米纳斯吉拉斯联邦大学); Departamento de Polícia Federal (DPF)(联邦警察局); Department of Computer Science, University of Sheffield(计算机科学系,谢菲尔德大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: ACM Conference on Fairness, Accountability, and Transparency (FAccT 2025)

点击查看摘要

Abstract:The distribution of child sexual abuse imagery (CSAI) is an ever-growing concern of our modern world; children who suffered from this heinous crime are revictimized, and the growing amount of illegal imagery distributed overwhelms law enforcement agents (LEAs) with the manual labor of categorization. To ease this burden researchers have explored methods for automating data triage and detection of CSAI, but the sensitive nature of the data imposes restricted access and minimal interaction between real data and learning algorithms, avoiding leaks at all costs. In observing how these restrictions have shaped the literature we formalize a definition of “Proxy Tasks”, i.e., the substitute tasks used for training models for CSAI without making use of CSA data. Under this new terminology we review current literature and present a protocol for making conscious use of Proxy Tasks together with consistent input from LEAs to design better automation in this field. Finally, we apply this protocol to study – for the first time – the task of Few-shot Indoor Scene Classification on CSAI, showing a final model that achieves promising results on a real-world CSAI dataset whilst having no weights actually trained on sensitive data.
zh

[CV-122] ReplayCAD: Generative Diffusion Replay for Continual Anomaly Detection IJCAI2025

【速读】:该论文旨在解决持续异常检测(Continual Anomaly Detection, CAD)中的两个关键问题:灾难性遗忘和小异常区域的分割。现有方法通过存储图像分布或块特征来缓解灾难性遗忘,但无法保留像素级细节特征以实现精确分割。论文提出的解决方案是ReplayCAD,其关键在于利用扩散模型生成高质量的历史数据进行重放,通过在预训练扩散模型的条件空间中搜索类别语义嵌入,引导模型重现具有细粒度像素细节的数据,从而提升分割性能;同时引入空间特征以增强样本空间的多样性,实现更精确的数据生成。

链接: https://arxiv.org/abs/2505.06603
作者: Lei Hu,Zhiyong Gan,Ling Deng,Jinglin Liang,Lingyu Liang,Shuangping Huang,Tianshui Chen
机构: South China University of Technology (华南理工大学); China United Network Communications Corporation Limited Guangdong Branch (中国联合网络通信有限公司广东分公司); Pazhou Laboratory (琶洲实验室); Guangdong University of Technology (广东工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCAI 2025

点击查看摘要

Abstract:Continual Anomaly Detection (CAD) enables anomaly detection models in learning new classes while preserving knowledge of historical classes. CAD faces two key challenges: catastrophic forgetting and segmentation of small anomalous regions. Existing CAD methods store image distributions or patch features to mitigate catastrophic forgetting, but they fail to preserve pixel-level detailed features for accurate segmentation. To overcome this limitation, we propose ReplayCAD, a novel diffusion-driven generative replay framework that replay high-quality historical data, thus effectively preserving pixel-level detailed features. Specifically, we compress historical data by searching for a class semantic embedding in the conditional space of the pre-trained diffusion model, which can guide the model to replay data with fine-grained pixel details, thus improving the segmentation performance. However, relying solely on semantic features results in limited spatial diversity. Hence, we further use spatial features to guide data compression, achieving precise control of sample space, thereby generating more diverse data. Our method achieves state-of-the-art performance in both classification and segmentation, with notable improvements in segmentation: 11.5% on VisA and 8.1% on MVTec. Our source code is available at this https URL.
zh

[CV-123] Batch Augmentation with Unimodal Fine-tuning for Multimodal Learning

【速读】:该论文旨在解决从超声图像及其相关临床文本信息中检测胎儿器官的问题,其解决方案的关键在于采用带有单模态微调的批量增强技术。具体而言,首先通过单模态图像部分的迁移初始化调整初始层权重以适应医学数据,随后利用微调后的初始层对图像进行批量增强处理以提取特征,并结合图像描述信息进行多模态特征融合,最终训练头部层。该方法在FPU23超声和UPMC Food-101多模态数据集上取得了接近最先进(SOTA)的性能。

链接: https://arxiv.org/abs/2505.06592
作者: H M Dipu Kabir,Subrota Kumar Mondal,Mohammad Ali Moni
机构: AI and Cyber Futures Institute, Charles Sturt University (人工智能与网络未来研究所,查尔斯·斯图尔特大学); Rural Health Research Institute, Charles Sturt University (农村健康研究研究所,查尔斯·斯图尔特大学); School of Computer Science and Engineering, Macau University of Science and Technology (计算机科学与工程学院,澳门科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper proposes batch augmentation with unimodal fine-tuning to detect the fetus’s organs from ultrasound images and associated clinical textual information. We also prescribe pre-training initial layers with investigated medical data before the multimodal training. At first, we apply a transferred initialization with the unimodal image portion of the dataset with batch augmentation. This step adjusts the initial layer weights for medical data. Then, we apply neural networks (NNs) with fine-tuned initial layers to images in batches with batch augmentation to obtain features. We also extract information from descriptions of images. We combine this information with features obtained from images to train the head layer. We write a dataloader script to load the multimodal data and use existing unimodal image augmentation techniques with batch augmentation for the multimodal data. The dataloader brings a new random augmentation for each batch to get a good generalization. We investigate the FPU23 ultrasound and UPMC Food-101 multimodal datasets. The multimodal large language model (LLM) with the proposed training provides the best results among the investigated methods. We receive near state-of-the-art (SOTA) performance on the UPMC Food-101 dataset. We share the scripts of the proposed method with traditional counterparts at the following repository: this http URL
zh

[CV-124] Compact and Efficient Neural Networks for Image Recognition Based on Learned 2D Separable Transform

【速读】:该论文试图解决传统神经网络中全连接(Fully Connected, FC)层参数数量庞大导致模型复杂度高和计算效率低的问题。解决方案的关键在于提出一种可学习的二维可分离变换(Learned Two-Dimensional Separable Transform, LST),通过共享一个FC层的权重来处理图像的所有行,随后使用第二个共享的FC层处理第一层得到的图像表示的所有列,从而显著减少模型参数数量。

链接: https://arxiv.org/abs/2505.06578
作者: Maxim Vashkevich,Egor Krivalcevich
机构: Belarusian State University of Informatics and Radioelectronics (白俄罗斯信息与无线电电子大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, 9 figures

点击查看摘要

Abstract:The paper presents a learned two-dimensional separable transform (LST) that can be considered as a new type of computational layer for constructing neural network (NN) architecture for image recognition tasks. The LST based on the idea of sharing the weights of one fullyconnected (FC) layer to process all rows of an image. After that, a second shared FC layer is used to process all columns of image representation obtained from the first layer. The use of LST layers in a NN architecture significantly reduces the number of model parameters compared to models that use stacked FC layers. We show that a NN-classifier based on a single LST layer followed by an FC layer achieves 98.02% accuracy on the MNIST dataset, while having only 9.5k parameters. We also implemented a LST-based classifier for handwritten digit recognition on the FPGA platform to demonstrate the efficiency of the suggested approach for designing a compact and high-performance implementation of NN models. Git repository with supplementary materials: this https URL
zh

[CV-125] wo-Stage Random Alternation Framework for Zero-Shot Pansharpening

【速读】:该论文旨在解决遥感图像融合中因难以获取真实高分辨率图像而限制深度学习方法实际应用的问题。其解决方案的关键在于提出一种两阶段随机交替框架(TRA-PAN),该框架通过结合低分辨率图像的强监督约束与全分辨率图像的物理特性,有效提升了融合效果。第一阶段引入了退化感知建模(DAM)和预热训练以减少训练时间并缓解低分辨率数据的负面影响,第二阶段采用随机交替优化(RAO)策略,充分利用低分辨率与全分辨率图像的优势,最终实现仅需单个图像对即可完成零样本训练,显著降低了对大规模数据集的依赖。

链接: https://arxiv.org/abs/2505.06576
作者: Haorui Chen,Zeyu Ren,Jiaxuan Ren,Ran Ran,Jinliang Shao,Jie Huang,Liangjian Deng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, pansharpening has seen rapid advancements with deep learning methods, which have demonstrated impressive fusion quality. However, the challenge of acquiring real high-resolution images limits the practical applicability of these methods. To address this, we propose a two-stage random alternating framework (TRA-PAN) that effectively integrates strong supervision constraints from reduced-resolution images with the physical characteristics of full-resolution images. The first stage introduces a pre-training procedure, which includes Degradation-Aware Modeling (DAM) to capture spatial-spectral degradation mappings, alongside a warm-up procedure designed to reduce training time and mitigate the negative effects of reduced-resolution data. In the second stage, Random Alternation Optimization (RAO) is employed, where random alternating training leverages the strengths of both reduced- and full-resolution images, further optimizing the fusion model. By primarily relying on full-resolution images, our method enables zero-shot training with just a single image pair, obviating the need for large datasets. Experimental results demonstrate that TRA-PAN outperforms state-of-the-art (SOTA) methods in both quantitative metrics and visual quality in real-world scenarios, highlighting its strong practical applicability.
zh

[CV-126] GRACE: Estimating Geometry-level 3D Human-Scene Contact from 2D Images

【速读】:该论文旨在解决人体与场景接触几何级别的估计问题,即在3D人体几何体上定位具体的接触表面点,从而提供空间先验并建立人体与场景之间的交互关系。现有方法主要依赖参数化人体模型(如SMPL),通过固定的SMPL顶点序列建立图像与接触区域的对应关系,但该方法忽略了几何信息,限制了在不同人体几何结构中的泛化能力。本文提出的GRACE(Geometry-level Reasoning for 3D Human-scene Contact Estimation)采用点云编码器-解码器架构及分层特征提取与融合模块,有效整合3D人体几何结构与来自图像的2D交互语义,通过视觉线索建立几何特征到3D人体网格顶点空间的隐式映射,从而实现接触区域的精确建模,其关键在于将几何信息与视觉语义相结合,提升模型的预测精度与泛化能力。

链接: https://arxiv.org/abs/2505.06575
作者: Chengfeng Wang,Wei Zhai,Yuhang Yang,Yang Cao,Zhengjun Zha
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Estimating the geometry level of human-scene contact aims to ground specific contact surface points at 3D human geometries, which provides a spatial prior and bridges the interaction between human and scene, supporting applications such as human behavior analysis, embodied AI, and AR/VR. To complete the task, existing approaches predominantly rely on parametric human models (e.g., SMPL), which establish correspondences between images and contact regions through fixed SMPL vertex sequences. This actually completes the mapping from image features to an ordered sequence. However, this approach lacks consideration of geometry, limiting its generalizability in distinct human geometries. In this paper, we introduce GRACE (Geometry-level Reasoning for 3D Human-scene Contact Estimation), a new paradigm for 3D human contact estimation. GRACE incorporates a point cloud encoder-decoder architecture along with a hierarchical feature extraction and fusion module, enabling the effective integration of 3D human geometric structures with 2D interaction semantics derived from images. Guided by visual cues, GRACE establishes an implicit mapping from geometric features to the vertex space of the 3D human mesh, thereby achieving accurate modeling of contact regions. This design ensures high prediction accuracy and endows the framework with strong generalization capability across diverse human geometries. Extensive experiments on multiple benchmark datasets demonstrate that GRACE achieves state-of-the-art performance in contact estimation, with additional results further validating its robust generalization to unstructured human point clouds.
zh

[CV-127] ElectricSight: 3D Hazard Monitoring for Power Lines Using Low-Cost Sensors

【速读】:该论文旨在解决电力输电线路与潜在威胁(如大型起重机)之间距离精确测量的问题,现有基于传感器的方法在精度与成本之间难以平衡。其关键解决方案是提出ElectricSight系统,该系统通过将实时图像与环境点云先验信息结合,实现了低成本且高精度的三维距离测量,其中单目深度估计方法通过融合三维点云数据提升了测量的准确性和可靠性。

链接: https://arxiv.org/abs/2505.06573
作者: Xingchen Li,LiDian Wang,Yu Sheng,ZhiPeng Tang,Haojie Ren,Guoliang You,YiFan Duan,Jianmin Ji,Yanyong Zhang
机构: School of Computer Science and Technology, University of Science and Technology of China (USTC) (中国科学技术大学计算机科学与技术学院); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Protecting power transmission lines from potential hazards involves critical tasks, one of which is the accurate measurement of distances between power lines and potential threats, such as large cranes. The challenge with this task is that the current sensor-based methods face challenges in balancing accuracy and cost in distance measurement. A common practice is to install cameras on transmission towers, which, however, struggle to measure true 3D distances due to the lack of depth information. Although 3D lasers can provide accurate depth data, their high cost makes large-scale deployment impractical. To address this challenge, we present ElectricSight, a system designed for 3D distance measurement and monitoring of potential hazards to power transmission lines. This work’s key innovations lie in both the overall system framework and a monocular depth estimation method. Specifically, the system framework combines real-time images with environmental point cloud priors, enabling cost-effective and precise 3D distance measurements. As a core component of the system, the monocular depth estimation method enhances the performance by integrating 3D point cloud data into image-based estimates, improving both the accuracy and reliability of the system. To assess ElectricSight’s performance, we conducted tests with data from a real-world power transmission scenario. The experimental results demonstrate that ElectricSight achieves an average accuracy of 1.08 m for distance measurements and an early warning accuracy of 92%. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.06573 [cs.CV] (or arXiv:2505.06573v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.06573 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-128] Dynamic Uncertainty Learning with Noisy Correspondence for Text-Based Person Search

【速读】:该论文旨在解决文本到图像的人体检索任务中因大规模文本-图像数据集存在噪声(尤其是不匹配对)而导致的检索性能下降问题。现有方法往往关注负样本,反而加剧了噪声影响。其解决方案的关键在于提出动态不确定性与关系对齐框架(DURA),包含关键特征选择器(KFS)和动态软最大间隔损失函数(DSH-Loss)。KFS通过建模噪声不确定性提升检索可靠性,而DSH-Loss通过调整负样本难度增强在噪声环境下的鲁棒性。

链接: https://arxiv.org/abs/2505.06566
作者: Zequn Xie,Haoming Ji,Lingwei Meng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image person search aims to identify an individual based on a text description. To reduce data collection costs, large-scale text-image datasets are created from co-occurrence pairs found online. However, this can introduce noise, particularly mismatched pairs, which degrade retrieval performance. Existing methods often focus on negative samples, amplifying this noise. To address these issues, we propose the Dynamic Uncertainty and Relational Alignment (DURA) framework, which includes the Key Feature Selector (KFS) and a new loss function, Dynamic Softmax Hinge Loss (DSH-Loss). KFS captures and models noise uncertainty, improving retrieval reliability. The bidirectional evidence from cross-modal similarity is modeled as a Dirichlet distribution, enhancing adaptability to noisy data. DSH adjusts the difficulty of negative samples to improve robustness in noisy environments. Our experiments on three datasets show that the method offers strong noise resistance and improves retrieval performance in both low- and high-noise scenarios.
zh

[CV-129] Weakly Supervised Temporal Sentence Grounding via Positive Sample Mining

【速读】:该论文旨在解决弱监督时间句子定位(Weakly Supervised Temporal Sentence Grounding, WSTSG)中的负样本构造问题,即现有方法在对比学习中从其他视频或同一视频中生成负样本时,忽略了与锚点样本高度相似的训练样本之间的相关性,导致优化困难。其解决方案的关键在于提出正样本挖掘(Positive Sample Mining, PSM),通过根据文本查询的相似性将训练集划分为语义相似和不相似子集,并引入PSM引导的对比损失和排序损失,以增强锚点提议与相似样本的接近性和与不相似样本的差异性。

链接: https://arxiv.org/abs/2505.06557
作者: Lu Dong,Haiyu Zhang,Hongjie Zhang,Yifei Huang,Zhen-Hua Ling,Yu Qiao,Limin Wang,Yali Wang
机构: University of Science and Technology of China (中国科学技术大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Beihang University (北京航空航天大学); State Key Laboratory for Novel Software Technology, Nanjing University (南京大学软件新技术国家重点实验室); Shenzhen Key Laboratory of Computer Vision and Pattern Recognition, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (深圳市计算机视觉与模式识别重点实验室,深圳先进技术研究院,中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: TCSVT 2025, doi at this https URL

点击查看摘要

Abstract:The task of weakly supervised temporal sentence grounding (WSTSG) aims to detect temporal intervals corresponding to a language description from untrimmed videos with only video-level video-language correspondence. For an anchor sample, most existing approaches generate negative samples either from other videos or within the same video for contrastive learning. However, some training samples are highly similar to the anchor sample, directly regarding them as negative samples leads to difficulties for optimization and ignores the correlations between these similar samples and the anchor sample. To address this, we propose Positive Sample Mining (PSM), a novel framework that mines positive samples from the training set to provide more discriminative supervision. Specifically, for a given anchor sample, we partition the remaining training set into semantically similar and dissimilar subsets based on the similarity of their text queries. To effectively leverage these correlations, we introduce a PSM-guided contrastive loss to ensure that the anchor proposal is closer to similar samples and further from dissimilar ones. Additionally, we design a PSM-guided rank loss to ensure that similar samples are closer to the anchor proposal than to the negative intra-video proposal, aiming to distinguish the anchor proposal and the negative intra-video proposal. Experiments on the WSTSG and grounded VideoQA tasks demonstrate the effectiveness and superiority of our method.
zh

[CV-130] HDGlyph: A Hierarchical Disentangled Glyph-Based Framework for Long-Tail Text Rendering in Diffusion Models

【速读】:该论文旨在解决视觉文本渲染(Visual Text Rendering)中长期尾部文本案例(尤其是未见过或小尺寸文本)处理效果不佳的问题。其解决方案的关键在于提出了一种分层解耦的基于字形(Glyph)的框架(HDGlyph),通过将文本生成与非文本视觉合成进行分层解耦,实现对常见和长尾文本渲染的联合优化。该框架在训练阶段利用多语言字形网络(Multi-Linguistic GlyphNet)和字形感知感知损失(Glyph-Aware Perceptual Loss)解耦像素级表示,而在推理阶段采用噪声解耦无分类器引导和潜在解耦两阶段渲染(LD-TSR)方案,以提升背景和小尺寸文本的渲染质量。

链接: https://arxiv.org/abs/2505.06543
作者: Shuhan Zhuang,Mengqi Huang,Fengyi Fu,Nan Chen,Bohan Lei,Zhendong Mao
机构: University of Science and Technology of China (中国科学技术大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual text rendering, which aims to accurately integrate specified textual content within generated images, is critical for various applications such as commercial design. Despite recent advances, current methods struggle with long-tail text cases, particularly when handling unseen or small-sized text. In this work, we propose a novel Hierarchical Disentangled Glyph-Based framework (HDGlyph) that hierarchically decouples text generation from non-text visual synthesis, enabling joint optimization of both common and long-tail text rendering. At the training stage, HDGlyph disentangles pixel-level representations via the Multi-Linguistic GlyphNet and the Glyph-Aware Perceptual Loss, ensuring robust rendering even for unseen characters. At inference time, HDGlyph applies Noise-Disentangled Classifier-Free Guidance and Latent-Disentangled Two-Stage Rendering (LD-TSR) scheme, which refines both background and small-sized text. Extensive evaluations show our model consistently outperforms others, with 5.08% and 11.7% accuracy gains in English and Chinese text rendering while maintaining high image quality. It also excels in long-tail scenarios with strong accuracy and visual performance.
zh

[CV-131] ProFashion: Prototype-guided Fashion Video Generation with Multiple Reference Images

【速读】:该论文旨在解决时尚视频生成中视图一致性不足和时空一致性较差的问题,特别是在使用多视角衣物图案时,现有基于扩散的方法仅支持单张参考图像输入,限制了生成效果。其解决方案的关键在于提出ProFashion框架,通过引入姿态感知的原型聚合器(Pose-aware Prototype Aggregator)和流增强的原型实例化器(Flow-enhanced Prototype Instantiator),有效融合多参考图像特征并提升运动一致性,从而实现更高质量的时尚视频生成。

链接: https://arxiv.org/abs/2505.06537
作者: Xianghao Kong,Qiaosong Qi,Yuanbin Wang,Anyi Rao,Biaolong Chen,Aixi Zhang,Si Liu,Hao Jiang
机构: Beihang University (北京航空航天大学); Taobao & Tmall Group (淘宝与天猫集团); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fashion video generation aims to synthesize temporally consistent videos from reference images of a designated character. Despite significant progress, existing diffusion-based methods only support a single reference image as input, severely limiting their capability to generate view-consistent fashion videos, especially when there are different patterns on the clothes from different perspectives. Moreover, the widely adopted motion module does not sufficiently model human body movement, leading to sub-optimal spatiotemporal consistency. To address these issues, we propose ProFashion, a fashion video generation framework leveraging multiple reference images to achieve improved view consistency and temporal coherency. To effectively leverage features from multiple reference images while maintaining a reasonable computational cost, we devise a Pose-aware Prototype Aggregator, which selects and aggregates global and fine-grained reference features according to pose information to form frame-wise prototypes, which serve as guidance in the denoising process. To further enhance motion consistency, we introduce a Flow-enhanced Prototype Instantiator, which exploits the human keypoint motion flow to guide an extra spatiotemporal attention process in the denoiser. To demonstrate the effectiveness of ProFashion, we extensively evaluate our method on the MRFashion-7K dataset we collected from the Internet. ProFashion also outperforms previous methods on the UBC Fashion dataset.
zh

[CV-132] ACFN: Transformer-based Adaptive Cross-modal Fusion Network for Multimodal Emotion Recognition

【速读】:该论文旨在解决多模态情感识别任务中跨模态注意力融合方法存在的冗余特征问题以及对互补特征捕捉不足的问题。其解决方案的关键在于设计一种基于Transformer的自适应跨模态融合网络(TACFN),通过自注意力机制实现模态内特征选择,以减少冗余并提高跨模态交互的适应性和效率,并通过拼接获取融合权重向量以更好地捕获模态间的互补信息。

链接: https://arxiv.org/abs/2505.06536
作者: Feng Liu,Ziwang Fu,Yunlong Wang,Qijian Zheng
机构: East China Normal University (华东师范大学); Meitu (China) Limited (美图(中国)有限公司); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2111.02172

点击查看摘要

Abstract:The fusion technique is the key to the multimodal emotion recognition task. Recently, cross-modal attention-based fusion methods have demonstrated high performance and strong robustness. However, cross-modal attention suffers from redundant features and does not capture complementary features well. We find that it is not necessary to use the entire information of one modality to reinforce the other during cross-modal interaction, and the features that can reinforce a modality may contain only a part of it. To this end, we design an innovative Transformer-based Adaptive Cross-modal Fusion Network (TACFN). Specifically, for the redundant features, we make one modality perform intra-modal feature selection through a self-attention mechanism, so that the selected features can adaptively and efficiently interact with another modality. To better capture the complementary information between the modalities, we obtain the fused weight vector by splicing and use the weight vector to achieve feature reinforcement of the modalities. We apply TCAFN to the RAVDESS and IEMOCAP datasets. For fair comparison, we use the same unimodal representations to validate the effectiveness of the proposed fusion method. The experimental results show that TACFN brings a significant performance improvement compared to other methods and reaches the state-of-the-art. All code and models could be accessed from this https URL.
zh

[CV-133] Unmasking Deep Fakes: Leverag ing Deep Learning for Video Authenticity Detection

【速读】:该论文试图解决深度伪造(Deepfake)视频对数字媒体真实性带来的挑战,其核心问题是开发一种能够识别深度伪造视频中细微不一致性的先进检测方法。解决方案的关键在于利用深度学习技术,特别是卷积神经网络(Convolutional Neural Networks, CNN),以实现对深度伪造视频的高效识别。研究中采用MTCNN作为人脸检测器和EfficientNet-B5作为编码模型,通过Kaggle DFDC数据集进行训练与评估,最终实现了较高的检测性能。

链接: https://arxiv.org/abs/2505.06528
作者: Mahmudul Hasan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deepfake videos, produced through advanced artificial intelligence methods now a days, pose a new challenge to the truthfulness of the digital media. As Deepfake becomes more convincing day by day, detecting them requires advanced methods capable of identifying subtle inconsistencies. The primary motivation of this paper is to recognize deepfake videos using deep learning techniques, specifically by using convolutional neural networks. Deep learning excels in pattern recognition, hence, makes it an ideal approach for detecting the intricate manipulations in deepfakes. In this paper, we consider using MTCNN as a face detector and EfficientNet-B5 as encoder model to predict if a video is deepfake or not. We utilize training and evaluation dataset from Kaggle DFDC. The results shows that our deepfake detection model acquired 42.78% log loss, 93.80% AUC and 86.82% F1 score on kaggle’s DFDC dataset.
zh

[CV-134] Improving Generalization of Medical Image Registration Foundation Model IJCNN

【速读】:该论文试图解决医学图像配准中基础模型在面对新型解剖结构、不同成像条件或未见过的模态时存在的泛化性和鲁棒性不足的问题。解决方案的关键在于将Sharpness-Aware Minimization (SAM) 引入基础模型,通过优化损失函数的平坦性来提升模型在多样化数据分布下的稳定性,并增强其处理复杂临床场景的能力。

链接: https://arxiv.org/abs/2505.06527
作者: Jing Hu,Kaiwei Yu,Hongjiang Xian,Shu Hu,Xin Wang
机构: Chengdu University of Information Technology (成都信息工程大学); Purdue University (普渡大学); University at Albany, State University of New York (纽约州立大学阿尔巴尼分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: IJCNN

点击查看摘要

Abstract:Deformable registration is a fundamental task in medical image processing, aiming to achieve precise alignment by establishing nonlinear correspondences between images. Traditional methods offer good adaptability and interpretability but are limited by computational efficiency. Although deep learning approaches have significantly improved registration speed and accuracy, they often lack flexibility and generalizability across different datasets and tasks. In recent years, foundation models have emerged as a promising direction, leveraging large and diverse datasets to learn universal features and transformation patterns for image registration, thus demonstrating strong cross-task transferability. However, these models still face challenges in generalization and robustness when encountering novel anatomical structures, varying imaging conditions, or unseen modalities. To address these limitations, this paper incorporates Sharpness-Aware Minimization (SAM) into foundation models to enhance their generalization and robustness in medical image registration. By optimizing the flatness of the loss landscape, SAM improves model stability across diverse data distributions and strengthens its ability to handle complex clinical scenarios. Experimental results show that foundation models integrated with SAM achieve significant improvements in cross-dataset registration performance, offering new insights for the advancement of medical image registration technology. Our code is available at this https URLthis https URL_sam.
zh

[CV-135] Causal Prompt Calibration Guided Segment Anything Model for Open-Vocabulary Multi-Entity Segmentation

【速读】:该论文试图解决生成式 AI (Generative AI) 在开放词汇多实体分割(OVMS)任务中的泛化能力不足问题。其核心问题是提示(prompt)中的任务无关生成因素导致的提示偏差,这些因素作为混杂变量影响了模型的泛化性能。解决方案的关键在于提出一种因果提示校准方法(Causal Prompt Calibration, CPC-SAM),通过消除提示中的混杂变量,获得仅包含任务相关因果因素的因果提示,从而提升OVMS的准确性。该方法通过因果多分布一致性理论,结合轻量级因果提示学习器(CaPL)实现提示的校准与优化。

链接: https://arxiv.org/abs/2505.06524
作者: Jingyao Wang,Jianqi Zhang,Wenwen Qiang,Changwen Zheng
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Software Chinese Academy of Sciences (中国科学院软件研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the strength of the Segment Anything Model (SAM), it struggles with generalization issues in open-vocabulary multi-entity segmentation (OVMS). Through empirical and causal analyses, we find that (i) the prompt bias is the primary cause of the generalization issues; (ii) this bias is closely tied to the task-irrelevant generating factors within the prompts, which act as confounders and affect generalization. To address the generalization issues, we aim to propose a method that can calibrate prompts to eliminate confounders for accurate OVMS. Building upon the causal analysis, we propose that the optimal prompt for OVMS should contain only task-relevant causal factors. We define it as the causal prompt, serving as the goal of calibration. Next, our theoretical analysis, grounded by causal multi-distribution consistency theory, proves that this prompt can be obtained by enforcing segmentation consistency and optimality. Inspired by this, we propose CPC-SAM, a Causal Prompt Calibration method for SAM to achieve accurate OVMS. It integrates a lightweight causal prompt learner (CaPL) into SAM to obtain causal prompts. Specifically, we first generate multiple prompts using random annotations to simulate diverse distributions and then reweight them via CaPL by enforcing causal multi-distribution consistency in both task and entity levels. To ensure obtaining causal prompts, CaPL is optimized by minimizing the cumulative segmentation loss across the reweighted prompts to achieve consistency and optimality. A bi-level optimization strategy alternates between optimizing CaPL and SAM, ensuring accurate OVMS. Extensive experiments validate its superiority.
zh

[CV-136] Edge-Enabled VIO with Long-Tracked Features for High-Accuracy Low-Altitude IoT Navigation

【速读】:该论文试图解决视觉惯性里程计(VIO)中长期跟踪特征(long-tracked features)导致的定位漂移问题,以及由此引发的匹配误差累积和实时性能下降问题。其解决方案的关键在于提出一种主动解耦机制,用于消除长期跟踪特征中的累积误差,具体包括引入视觉参考帧重置策略以消除跟踪误差,以及采用深度预测策略以利用长期约束。此外,为确保实时性能,还设计了三种高效的状态估计策略:基于预定义消除顺序的并行消除策略、逆深度消除简化策略和消除跳过策略。

链接: https://arxiv.org/abs/2505.06517
作者: Xiaohong Huang,Cui Yang,Miaowen Wen
机构: South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 9 pages with 9 figures

点击查看摘要

Abstract:This paper presents a visual-inertial odometry (VIO) method using long-tracked features. Long-tracked features can constrain more visual frames, reducing localization drift. However, they may also lead to accumulated matching errors and drift in feature tracking. Current VIO methods adjust observation weights based on re-projection errors, yet this approach has flaws. Re-projection errors depend on estimated camera poses and map points, so increased errors might come from estimation inaccuracies, not actual feature tracking errors. This can mislead the optimization process and make long-tracked features ineffective for suppressing localization drift. Furthermore, long-tracked features constrain a larger number of frames, which poses a significant challenge to real-time performance of the system. To tackle these issues, we propose an active decoupling mechanism for accumulated errors in long-tracked feature utilization. We introduce a visual reference frame reset strategy to eliminate accumulated tracking errors and a depth prediction strategy to leverage the long-term constraint. To ensure real time preformane, we implement three strategies for efficient system state estimation: a parallel elimination strategy based on predefined elimination order, an inverse-depth elimination simplification strategy, and an elimination skipping strategy. Experiments on various datasets show that our method offers higher positioning accuracy with relatively short consumption time, making it more suitable for edge-enabled low-altitude IoT navigation, where high-accuracy positioning and real-time operation on edge device are required. The code will be published at github.
zh

[CV-137] Quantum Conflict Measurement in Decision Making for Out-of-Distribution Detection

【速读】:该论文试图解决量子Dempster-Shafer理论(QDST)中多个量子质量函数(QMF)之间冲突的有效管理问题。其解决方案的关键是提出一种量子冲突指标(QCI),用于衡量两个QMF在决策过程中的冲突程度,并通过实验验证其符合非负性、对称性、有界性、极端一致性及对细化不敏感等理想冲突度量属性。此外,基于QCI的融合方法在冲突融合任务中表现出优于传统方法的性能,进一步应用于类描述域空间(C-DDS)及其优化版本C-DDS+,以提升分布外(OOD)检测的效果。

链接: https://arxiv.org/abs/2505.06516
作者: Yilin Dong,Tianyun Zhu,Xinde Li,Jean Dezert,Rigui Zhou,Changming Zhu,Lei Cao,Shuzhi Sam Ge
机构: Shanghai Maritime University (上海海事大学); Southeast University (东南大学); ONERA—The French Aerospace Lab (法国航空航天实验室); National University of Singapore (新加坡国立大学); Qingdao University (青岛大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 28 figures

点击查看摘要

Abstract:Quantum Dempster-Shafer Theory (QDST) uses quantum interference effects to derive a quantum mass function (QMF) as a fuzzy metric type from information obtained from various data sources. In addition, QDST uses quantum parallel computing to speed up computation. Nevertheless, the effective management of conflicts between multiple QMFs in QDST is a challenging question. This work aims to address this problem by proposing a Quantum Conflict Indicator (QCI) that measures the conflict between two QMFs in decision-making. Then, the properties of the QCI are carefully investigated. The obtained results validate its compliance with desirable conflict measurement properties such as non-negativity, symmetry, boundedness, extreme consistency and insensitivity to refinement. We then apply the proposed QCI in conflict fusion methods and compare its performance with several commonly used fusion approaches. This comparison demonstrates the superiority of the QCI-based conflict fusion method. Moreover, the Class Description Domain Space (C-DDS) and its optimized version, C-DDS+ by utilizing the QCI-based fusion method, are proposed to address the Out-of-Distribution (OOD) detection task. The experimental results show that the proposed approach gives better OOD performance with respect to several state-of-the-art baseline OOD detection methods. Specifically, it achieves an average increase in Area Under the Receiver Operating Characteristic Curve (AUC) of 1.2% and a corresponding average decrease in False Positive Rate at 95% True Negative Rate (FPR95) of 5.4% compared to the optimal baseline method.
zh

[CV-138] RESAR-BEV: An Explainable Progressive Residual Autoregressive Approach for Camera-Radar Fusion in BEV Segmentation

【速读】:该论文旨在解决自动驾驶中鸟瞰图(BEV)语义分割面临的多模态对齐误差和传感器噪声问题。其解决方案的关键在于提出RESAR-BEV框架,该框架通过渐进式细化机制实现从粗到细的可解释分割阶段,结合残差自回归学习与Drive-Transformer和Modifier-Transformer的级联架构;同时采用鲁棒的BEV表示方法,融合地面接近体素、自适应高度偏移以及双路径体素特征编码以提升特征提取效率;此外,通过解耦监督策略,结合离线真实标签分解与在线联合优化,有效防止过拟合并保持结构一致性。

链接: https://arxiv.org/abs/2505.06515
作者: Zhiwen Zeng,Yunfei Yin,Zheng Yuan,Argho Dey,Xianjian Bao
机构: Chongqing University (重庆大学); Maharishi University of Management (马哈里希管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work was submitted to IEEE Transactions on Intelligent Transportation Systems (T-ITS) on 09-May-2025

点击查看摘要

Abstract:Bird’s-Eye-View (BEV) semantic segmentation provides comprehensive environmental perception for autonomous driving but suffers multi-modal misalignment and sensor noise. We propose RESAR-BEV, a progressive refinement framework that advances beyond single-step end-to-end approaches: (1) progressive refinement through residual autoregressive learning that decomposes BEV segmentation into interpretable coarse-to-fine stages via our Drive-Transformer and Modifier-Transformer residual prediction cascaded architecture, (2) robust BEV representation combining ground-proximity voxels with adaptive height offsets and dual-path voxel feature encoding (max+attention pooling) for efficient feature extraction, and (3) decoupled supervision with offline Ground Truth decomposition and online joint optimization to prevent overfitting while ensuring structural coherence. Experiments on nuScenes demonstrate RESAR-BEV achieves state-of-the-art performance with 54.0% mIoU across 7 essential driving-scene categories while maintaining real-time capability at 14.6 FPS. The framework exhibits robustness in challenging scenarios of long-range perception and adverse weather conditions.
zh

[CV-139] HCMA: Hierarchical Cross-model Alignment for Grounded Text-to-Image Generation

【速读】:该论文旨在解决文本到图像生成中高阶语义保真度与显式空间控制难以兼顾的问题,尤其是在涉及多物体、细微关系或复杂布局的场景中。其解决方案的关键在于提出一种分层跨模态对齐(Hierarchical Cross-Modal Alignment, HCMA)框架,该框架在每个扩散采样步骤中集成两个对齐模块:全局模块通过持续对齐潜在表示与文本描述以确保场景级连贯性,局部模块则利用边界框布局将物体锚定在指定位置,从而实现细粒度的空间控制。

链接: https://arxiv.org/abs/2505.06512
作者: Hang Wang,Zhi-Qi Cheng,Chenhao Lin,Chao Shen,Lei Zhang
机构: The Hong Kong Polytechnic University (香港理工大学); University of Washington (华盛顿大学); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Text-to-image synthesis has progressed to the point where models can generate visually compelling images from natural language prompts. Yet, existing methods often fail to reconcile high-level semantic fidelity with explicit spatial control, particularly in scenes involving multiple objects, nuanced relations, or complex layouts. To bridge this gap, we propose a Hierarchical Cross-Modal Alignment (HCMA) framework for grounded text-to-image generation. HCMA integrates two alignment modules into each diffusion sampling step: a global module that continuously aligns latent representations with textual descriptions to ensure scene-level coherence, and a local module that employs bounding-box layouts to anchor objects at specified locations, enabling fine-grained spatial control. Extensive experiments on the MS-COCO 2014 validation set show that HCMA surpasses state-of-the-art baselines, achieving a 0.69 improvement in Frechet Inception Distance (FID) and a 0.0295 gain in CLIP Score. These results demonstrate HCMA’s effectiveness in faithfully capturing intricate textual semantics while adhering to user-defined spatial constraints, offering a robust solution for semantically grounded image this http URL code is available at this https URL
zh

[CV-140] xt-to-CadQuery: A New Paradigm for CAD Generation with Scalable Large Model Capabilities

【速读】:该论文试图解决传统计算机辅助设计(CAD)建模过程中需要专家知识和专用软件的问题,以及现有方法生成的任务特定命令序列难以被预训练大型语言模型(LLMs)直接处理的局限性。其解决方案的关键在于直接从文本生成CadQuery代码,利用预训练LLMs在Python生成和空间推理方面的优势,从而避免了中间表示的转换,简化了3D模型生成流程。通过在Text-to-CadQuery数据上微调不同规模的开源LLMs,验证了该方法的有效性。

链接: https://arxiv.org/abs/2505.06507
作者: Haoyang Xie,Feng Ju
机构: Arizona State University (亚利桑那州立大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Computer-aided design (CAD) is fundamental to modern engineering and manufacturing, but creating CAD models still requires expert knowledge and specialized software. Recent advances in large language models (LLMs) open up the possibility of generative CAD, where natural language is directly translated into parametric 3D models. However, most existing methods generate task-specific command sequences that pretrained models cannot directly handle. These sequences must be converted into CAD representations such as CAD vectors before a 3D model can be produced, which requires training models from scratch and adds unnecessary complexity. To tackle this issue, we propose generating CadQuery code directly from text, leveraging the strengths of pretrained LLMs to produce 3D models without intermediate representations, using this Python-based scripting language. Since LLMs already excel at Python generation and spatial reasoning, fine-tuning them on Text-to-CadQuery data proves highly effective. Given that these capabilities typically improve with scale, we hypothesize that larger models will perform better after fine-tuning. To enable this, we augment the Text2CAD dataset with 170,000 CadQuery annotations. We fine-tune six open-source LLMs of varying sizes and observe consistent improvements. Our best model achieves a top-1 exact match of 69.3%, up from 58.8%, and reduces Chamfer Distance by 48.6%. Project page: this https URL.
zh

[CV-141] CompSLAM: Complementary Hierarchical Multi-Modal Localization and Mapping for Robot Autonomy in Underground Environments

【速读】:该论文旨在解决在未知、无GPS、复杂地下环境中实现机器人自主性所面临的实时、鲁棒且精确的机载位姿估计与建图问题,尤其是在感知退化条件下,如黑暗、尘埃和几何自相似结构等恶劣环境因素带来的挑战。其解决方案的关键在于提出了一种名为CompSLAM的高韧性分层多模态定位与建图框架,通过融合多种传感器模态的位姿估计结果,利用其互补性实现冗余,从而提升系统的鲁棒性。

链接: https://arxiv.org/abs/2505.06483
作者: Shehryar Khattak,Timon Homberger,Lukas Bernreiter,Julian Nubert,Olov Andersson,Roland Siegwart,Kostas Alexis,Marco Hutter
机构: Jet Propulsion Lab, California Institute of Technology, USA(喷气推进实验室,加利福尼亚理工学院,美国); KTH Royal Institute of Technology, Sweden(瑞典皇家理工学院); ETH Zürich, Switzerland(瑞士苏黎世联邦理工学院); NTNU, Trondheim, Norway(挪威特隆赫姆科技大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 9 figures, Code: this https URL

点击查看摘要

Abstract:Robot autonomy in unknown, GPS-denied, and complex underground environments requires real-time, robust, and accurate onboard pose estimation and mapping for reliable operations. This becomes particularly challenging in perception-degraded subterranean conditions under harsh environmental factors, including darkness, dust, and geometrically self-similar structures. This paper details CompSLAM, a highly resilient and hierarchical multi-modal localization and mapping framework designed to address these challenges. Its flexible architecture achieves resilience through redundancy by leveraging the complementary nature of pose estimates derived from diverse sensor modalities. Developed during the DARPA Subterranean Challenge, CompSLAM was successfully deployed on all aerial, legged, and wheeled robots of Team Cerberus during their competition-winning final run. Furthermore, it has proven to be a reliable odometry and mapping solution in various subsequent projects, with extensions enabling multi-robot map sharing for marsupial robotic deployments and collaborative mapping. This paper also introduces a comprehensive dataset acquired by a manually teleoperated quadrupedal robot, covering a significant portion of the DARPA Subterranean Challenge finals course. This dataset evaluates CompSLAM’s robustness to sensor degradations as the robot traverses 740 meters in an environment characterized by highly variable geometries and demanding lighting conditions. The CompSLAM code and the DARPA SubT Finals dataset are made publicly available for the benefit of the robotics community
zh

[CV-142] PromptIQ: Who Cares About Prompts? Let System Handle It – A Component-Aware Framework for T2I Generation

【速读】:该论文试图解决文本到图像(Text-to-Image, T2I)模型在没有提示工程专业知识的情况下生成高质量图像的难题,特别是由于提示结构不良导致的图像扭曲和对齐错误问题。现有评估方法如CLIP无法有效捕捉结构不一致,限制了模型的评估准确性。论文提出的解决方案是PromptIQ,其关键在于引入一种新的组件感知相似性(Component-Aware Similarity, CAS)度量,能够检测并惩罚结构错误,同时通过迭代生成和评估图像直至用户满意,从而避免了传统的试错式提示调优过程。

链接: https://arxiv.org/abs/2505.06467
作者: Nisan Chhetri,Arpan Sainju
机构: North Carolina State University (北卡罗来纳州立大学); Middle Tennessee State University (中田纳西州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 4 pages, 2 figures

点击查看摘要

Abstract:Generating high-quality images without prompt engineering expertise remains a challenge for text-to-image (T2I) models, which often misinterpret poorly structured prompts, leading to distortions and misalignments. While humans easily recognize these flaws, metrics like CLIP fail to capture structural inconsistencies, exposing a key limitation in current evaluation methods. To address this, we introduce PromptIQ, an automated framework that refines prompts and assesses image quality using our novel Component-Aware Similarity (CAS) metric, which detects and penalizes structural errors. Unlike conventional methods, PromptIQ iteratively generates and evaluates images until the user is satisfied, eliminating trial-and-error prompt tuning. Our results show that PromptIQ significantly improves generation quality and evaluation accuracy, making T2I models more accessible for users with little to no prompt engineering expertise.
zh

[CV-143] My Emotion on your face: The use of Facial Keypoint Detection to preserve Emotions in Latent Space Editing

【速读】:该论文试图解决在使用生成式 AI (Generative AI) 模型(如 StyleGAN/2)进行人脸图像编辑时存在的特征纠缠问题,即改变一个特征会不可避免地影响其他特征,从而破坏面部表情的完整性。解决方案的关键是在面部关键点检测模型的损失函数中引入一种附加项,即 Human Face Landmark Detection (HFLD) 损失,以限制面部表情的变化,从而在保持面部表情不变的前提下实现目标特征的修改。

链接: https://arxiv.org/abs/2505.06436
作者: Jingrui He,Andrew Stephen McGough
机构: Newcastle University (纽卡斯尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to 2nd International Workshop on Synthetic Data for Face and Gesture Analysis at IEEE FG 2025

点击查看摘要

Abstract:Generative Adversarial Network approaches such as StyleGAN/2 provide two key benefits: the ability to generate photo-realistic face images and possessing a semantically structured latent space from which these images are created. Many approaches have emerged for editing images derived from vectors in the latent space of a pre-trained StyleGAN/2 models by identifying semantically meaningful directions (e.g., gender or age) in the latent space. By moving the vector in a specific direction, the ideal result would only change the target feature while preserving all the other features. Providing an ideal data augmentation approach for gesture research as it could be used to generate numerous image variations whilst keeping the facial expressions intact. However, entanglement issues, where changing one feature inevitably affects other features, impacts the ability to preserve facial expressions. To address this, we propose the use of an addition to the loss function of a Facial Keypoint Detection model to restrict changes to the facial expressions. Building on top of an existing model, adding the proposed Human Face Landmark Detection (HFLD) loss, provided by a pre-trained Facial Keypoint Detection model, to the original loss function. We quantitatively and qualitatively evaluate the existing and our extended model, showing the effectiveness of our approach in addressing the entanglement issue and maintaining the facial expression. Our approach achieves up to 49% reduction in the change of emotion in our experiments. Moreover, we show the benefit of our approach by comparing with state-of-the-art models. By increasing the ability to preserve the facial gesture and expression during facial transformation, we present a way to create human face images with fixed expression but different appearances, making it a reliable data augmentation approach for Facial Gesture and Expression research.
zh

[CV-144] Natural Reflection Backdoor Attack on Vision Language Model for Autonomous Driving

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在自动驾驶系统中面对后门攻击时的鲁棒性不足问题。其关键解决方案是提出一种基于自然反射的后门攻击方法,通过在DriveLM数据集的部分图像中嵌入微弱的反射模式(如玻璃或水面的自然表面),并在对应的文本标签前添加冗长的无关前缀(如伪造故事或系统更新通知),使模型在遇到特定触发器时生成异常长的响应,从而导致显著的推理延迟。

链接: https://arxiv.org/abs/2505.06413
作者: Ming Liu,Siyuan Liang,Koushik Howlader,Liwen Wang,Dacheng Tao,Wensheng Zhang
机构: Iowa State University (爱荷华州立大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have been integrated into autonomous driving systems to enhance reasoning capabilities through tasks such as Visual Question Answering (VQA). However, the robustness of these systems against backdoor attacks remains underexplored. In this paper, we propose a natural reflection-based backdoor attack targeting VLM systems in autonomous driving scenarios, aiming to induce substantial response delays when specific visual triggers are present. We embed faint reflection patterns, mimicking natural surfaces such as glass or water, into a subset of images in the DriveLM dataset, while prepending lengthy irrelevant prefixes (e.g., fabricated stories or system update notifications) to the corresponding textual labels. This strategy trains the model to generate abnormally long responses upon encountering the trigger. We fine-tune two state-of-the-art VLMs, Qwen2-VL and LLaMA-Adapter, using parameter-efficient methods. Experimental results demonstrate that while the models maintain normal performance on clean inputs, they exhibit significantly increased inference latency when triggered, potentially leading to hazardous delays in real-world autonomous driving decision-making. Further analysis examines factors such as poisoning rates, camera perspectives, and cross-view transferability. Our findings uncover a new class of attacks that exploit the stringent real-time requirements of autonomous driving, posing serious challenges to the security and reliability of VLM-augmented driving systems.
zh

[CV-145] MAGE:A Multi-stage Avatar Generator with Sparse Observations

【速读】:该论文旨在解决从仅能捕捉头部和手腕三个关节运动的头戴式设备(Head Mounted Device)中推断全身姿态的问题,这一任务在增强现实/虚拟现实(AR/VR)应用中具有广泛需求。传统方法通常采用单阶段运动映射学习,但由于未观测到的肢体关节运动存在过大的推理空间,导致下肢预测效果不佳且时间一致性差,从而产生不真实或不连贯的运动序列。该论文提出的解决方案关键在于引入一种多阶段Avatar生成器MAGE,通过逐步预测策略将单阶段直接运动映射学习分解为多阶段过程,从粗粒度的6部分身体表示逐步细化至22个关节,逐步引入更多的运动上下文先验信息,从而在更丰富的约束条件下实现更真实的运动补全。

链接: https://arxiv.org/abs/2505.06411
作者: Fangyu Du,Yang Yang,Xuehao Gao,Hongye Hou
机构: Xi’an Jiaotong University (西安交通大学); Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inferring full-body poses from Head Mounted Devices, which capture only 3-joint observations from the head and wrists, is a challenging task with wide AR/VR applications. Previous attempts focus on learning one-stage motion mapping and thus suffer from an over-large inference space for unobserved body joint motions. This often leads to unsatisfactory lower-body predictions and poor temporal consistency, resulting in unrealistic or incoherent motion sequences. To address this, we propose a powerful Multi-stage Avatar GEnerator named MAGE that factorizes this one-stage direct motion mapping learning with a progressive prediction strategy. Specifically, given initial 3-joint motions, MAGE gradually inferring multi-scale body part poses at different abstract granularity levels, starting from a 6-part body representation and gradually refining to 22 joints. With decreasing abstract levels step by step, MAGE introduces more motion context priors from former prediction stages and thus improves realistic motion completion with richer constraint conditions and less ambiguity. Extensive experiments on large-scale datasets verify that MAGE significantly outperforms state-of-the-art methods with better accuracy and continuity.
zh

[CV-146] oward Advancing License Plate Super-Resolution in Real-World Scenarios: A Dataset and Benchmark

【速读】:该论文旨在解决低分辨率(Low-Resolution, LR)和退化图像在车牌识别(License Plate Recognition, LPR)中的挑战,特别是在监控、交通监测和法医学应用中。其解决方案的关键在于引入了一个名为UFPR-SR-Plates的新数据集,包含10,000个车辆轨迹的100,000对真实条件下采集的低分辨率与高分辨率车牌图像,并通过多帧LR和HR图像建立基准,结合先进的超分辨率模型及融合策略,如基于字符位置的多数投票(Majority Vote by Character Position, MVCP),以提升LPR性能。

链接: https://arxiv.org/abs/2505.06393
作者: Valfride Nascimento,Gabriel E. Lima,Rafael O. Ribeiro,William Robson Schwartz,Rayson Laroca,David Menotti
机构: Federal University of Paraná(巴拉那联邦大学); Brazilian Federal Police(巴西联邦警察局); Federal University of Minas Gerais(米纳斯吉拉斯联邦大学); Pontifical Catholic University of Paraná(巴拉那天主教大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in the Journal of the Brazilian Computer Society

点击查看摘要

Abstract:Recent advancements in super-resolution for License Plate Recognition (LPR) have sought to address challenges posed by low-resolution (LR) and degraded images in surveillance, traffic monitoring, and forensic applications. However, existing studies have relied on private datasets and simplistic degradation models. To address this gap, we introduce UFPR-SR-Plates, a novel dataset containing 10,000 tracks with 100,000 paired low and high-resolution license plate images captured under real-world conditions. We establish a benchmark using multiple sequential LR and high-resolution (HR) images per vehicle – five of each – and two state-of-the-art models for super-resolution of license plates. We also investigate three fusion strategies to evaluate how combining predictions from a leading Optical Character Recognition (OCR) model for multiple super-resolved license plates enhances overall performance. Our findings demonstrate that super-resolution significantly boosts LPR performance, with further improvements observed when applying majority vote-based fusion techniques. Specifically, the Layout-Aware and Character-Driven Network (LCDNet) model combined with the Majority Vote by Character Position (MVCP) strategy led to the highest recognition rates, increasing from 1.7% with low-resolution images to 31.1% with super-resolution, and up to 44.7% when combining OCR outputs from five super-resolved images. These findings underscore the critical role of super-resolution and temporal information in enhancing LPR accuracy under real-world, adverse conditions. The proposed dataset is publicly available to support further research and can be accessed at: this https URL
zh

[CV-147] Deep Learning-Based Robust Optical Guidance for Hypersonic Platforms

【速读】:该论文试图解决长距离平台中基于传感器的引导问题,特别是传统参考图像框架下经典配准方法的结构限制。其解决方案的关键在于将场景的一组图像编码到深度网络中,通过利用图像堆叠的方式,在双模态场景(例如场景可能有雪或无雪)中表现出有效性。

链接: https://arxiv.org/abs/2505.06389
作者: Adrien Chan-Hon-Tong,Aurélien Plyer,Baptiste Cadalen,Laurent Serre
机构: ONERA(法国航空航天大学); Université Paris-Saclay(巴黎-萨克雷大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sensor-based guidance is required for long-range platforms. To bypass the structural limitation of classical registration on reference image framework, we offer in this paper to encode a stack of images of the scene into a deep network. Relying on a stack is showed to be relevant on bimodal scene (e.g. when the scene can or can not be snowy).
zh

[CV-148] Robust Precise Knowledge Distillation-based Novel Context-Aware Predictor for Disease Detection in Brain and Gastrointestinal

【速读】:该论文试图解决医学疾病预测中,尤其是在医学影像分析领域,由于数据复杂性、变异性以及噪声和图像质量差异带来的不确定性问题。传统知识蒸馏(Knowledge Distillation, KD)方法依赖于不考虑上下文的温度参数来软化教师模型的预测,难以适应医学图像中的不同不确定性水平。该论文提出的解决方案的关键在于引入一种基于蚁群优化(Ant Colony Optimization, ACO)的教师-学生模型选择框架,以及一种上下文感知的温度缩放预测方法,通过结合图像质量、疾病复杂性和教师模型置信度等因素动态调整温度参数,从而实现更鲁棒的知识迁移。

链接: https://arxiv.org/abs/2505.06381
作者: Saif Ur Rehman Khan,Muhammad Nabeel Asim,Sebastian Vollmer,Andreas Dengel
机构: DFKI(德国弗劳恩霍夫研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical disease prediction, particularly through imaging, remains a challenging task due to the complexity and variability of medical data, including noise, ambiguity, and differing image quality. Recent deep learning models, including Knowledge Distillation (KD) methods, have shown promising results in brain tumor image identification but still face limitations in handling uncertainty and generalizing across diverse medical conditions. Traditional KD methods often rely on a context-unaware temperature parameter to soften teacher model predictions, which does not adapt effectively to varying uncertainty levels present in medical images. To address this issue, we propose a novel framework that integrates Ant Colony Optimization (ACO) for optimal teacher-student model selection and a novel context-aware predictor approach for temperature scaling. The proposed context-aware framework adjusts the temperature based on factors such as image quality, disease complexity, and teacher model confidence, allowing for more robust knowledge transfer. Additionally, ACO efficiently selects the most appropriate teacher-student model pair from a set of pre-trained models, outperforming current optimization methods by exploring a broader solution space and better handling complex, non-linear relationships within the data. The proposed framework is evaluated using three publicly available benchmark datasets, each corresponding to a distinct medical imaging task. The results demonstrate that the proposed framework significantly outperforms current state-of-the-art methods, achieving top accuracy rates: 98.01% on the MRI brain tumor (Kaggle) dataset, 92.81% on the Figshare MRI dataset, and 96.20% on the GastroNet dataset. This enhanced performance is further evidenced by the improved results, surpassing existing benchmarks of 97.24% (Kaggle), 91.43% (Figshare), and 95.00% (GastroNet).
zh

[CV-149] LMLCC-Net: A Semi-Supervised Deep Learning Model for Lung Nodule Malignancy Prediction from CT Scans using a Novel Hounsfield Unit-Based Intensity Filtering

【速读】:该论文旨在解决肺部结节在CT图像中早期恶性识别的问题,以降低肺癌相关死亡率和发病率。其解决方案的关键在于提出LMLCC-Net,一种基于3D卷积神经网络(3D CNN)的深度学习框架,该框架结合了基于Hounsfield Unit (HU)强度过滤的特征提取方法。通过分析良性与恶性结节在HU强度分布上的显著差异,并融合强度模式与纹理信息进行恶性预测,LMLCC-Net通过多分支结构提取特征,并探索不同分支组合及可学习滤波器范围,最终实现最优性能。此外,还引入半监督学习方案处理模糊标注案例,并开发轻量级模型以提高实际应用可行性。

链接: https://arxiv.org/abs/2505.06370
作者: Adhora Madhuri,Nusaiba Sobir,Tasnia Binte Mamun,Taufiq Hasan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Lung cancer is the leading cause of patient mortality in the world. Early diagnosis of malignant pulmonary nodules in CT images can have a significant impact on reducing disease mortality and morbidity. In this work, we propose LMLCC-Net, a novel deep learning framework for classifying nodules from CT scan images using a 3D CNN, considering Hounsfield Unit (HU)-based intensity filtering. Benign and malignant nodules have significant differences in their intensity profile of HU, which was not exploited in the literature. Our method considers the intensity pattern as well as the texture for the prediction of malignancies. LMLCC-Net extracts features from multiple branches that each use a separate learnable HU-based intensity filtering stage. Various combinations of branches and learnable ranges of filters were explored to finally produce the best-performing model. In addition, we propose a semi-supervised learning scheme for labeling ambiguous cases and also developed a lightweight model to classify the nodules. The experimental evaluations are carried out on the LUNA16 dataset. Our proposed method achieves a classification accuracy (ACC) of 91.96%, a sensitivity (SEN) of 92.04%, and an area under the curve (AUC) of 91.87%, showing improved performance compared to existing methods. The proposed method can have a significant impact in helping radiologists in the classification of pulmonary nodules and improving patient care.
zh

[CV-150] Understanding and Mitigating Toxicity in Image-Text Pretraining Datasets: A Case Study on LLaVA CVPR2025

【速读】:该论文旨在解决多模态模型预训练数据集中存在的毒性内容问题,特别是LLaVA图像-文本预训练数据集中有害内容的普遍性及其在不同模态中的表现。其解决方案的关键在于对常见毒性类别进行系统分析,并提出针对性的缓解策略,从而构建一个经过毒性减轻的精炼数据集,该数据集移除了7,531对有毒的图像-文本对,并提供了实现稳健毒性检测流程的指南。

链接: https://arxiv.org/abs/2505.06356
作者: Karthik Reddy Kanjula,Surya Guthikonda,Nahid Alam,Shayekh Bin Islam
机构: Cohere for AI Community (Cohere 人工智能社区); Cisco Meraki (思科梅拉基); Indiana University Bloomington (印第安纳大学布鲁明顿分校); Bangladesh University of Engineering and Technology (孟加拉国工程与技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ReGenAI CVPR2025 Workshop as Oral

点击查看摘要

Abstract:Pretraining datasets are foundational to the development of multimodal models, yet they often have inherent biases and toxic content from the web-scale corpora they are sourced from. In this paper, we investigate the prevalence of toxicity in LLaVA image-text pretraining dataset, examining how harmful content manifests in different modalities. We present a comprehensive analysis of common toxicity categories and propose targeted mitigation strategies, resulting in the creation of a refined toxicity-mitigated dataset. This dataset removes 7,531 of toxic image-text pairs in the LLaVA pre-training dataset. We offer guidelines for implementing robust toxicity detection pipelines. Our findings underscore the need to actively identify and filter toxic content - such as hate speech, explicit imagery, and targeted harassment - to build more responsible and equitable multimodal systems. The toxicity-mitigated dataset is open source and is available for further research.
zh

[CV-151] Attonsecond Streaking Phase Retrieval Via Deep Learning Methods

【速读】:该论文旨在解决超快电子动力学研究中亚飞秒时间尺度下attosecond streaking相位恢复的准确性问题,传统算法依赖迭代最小化和中心动量近似,导致在宽频脉冲下的精度下降。其解决方案的关键在于将相位恢复重新建模为监督式计算机视觉问题,并通过对比四种神经网络架构(卷积神经网络、视觉Transformer、混合CNN-ViT模型和胶囊网络)来提升性能,其中胶囊网络通过动态路由进一步强化空间姿态一致性,从而实现了最高的恢复保真度。

链接: https://arxiv.org/abs/2505.06275
作者: Yuzhou Zhu,Zheng Zhang,Ruyi Zhang,Liang Zhou
机构: Dalian University of Technology (大连理工大学); Zhejiang Gongshang University (浙江工商大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注:

点击查看摘要

Abstract:Attosecond streaking phase retrieval is essential for resolving electron dynamics on sub-femtosecond time scales yet traditional algorithms rely on iterative minimization and central momentum approximations that degrade accuracy for broadband pulses. In this work phase retrieval is reformulated as a supervised computer-vision problem and four neural architectures are systematically compared. A convolutional network demonstrates strong sensitivity to local streak edges but lacks global context; a vision transformer captures long-range delay-energy correlations at the expense of local inductive bias; a hybrid CNN-ViT model unites local feature extraction and full-graph attention; and a capsule network further enforces spatial pose agreement through dynamic routing. A theoretical analysis introduces local, global and positional sensitivity measures and derives surrogate error bounds that predict the strict ordering CNNViTHybridCapsule . Controlled experiments on synthetic streaking spectrograms confirm this hierarchy, with the capsule network achieving the highest retrieval fidelity. Looking forward, embedding the strong-field integral into physics-informed neural networks and exploring photonic hardware implementations promise pathways toward real-time attosecond pulse characterization under demanding experimental conditions.
zh

[CV-152] Skeletonization of neuronal processes using Discrete Morse techniques from computational topology

【速读】:该论文试图解决在脊椎动物大脑神经网络映射中,如何更准确地理解和量化神经元轴突的投射问题。传统方法通过示踪剂注射标记神经元群体的轴突投射,并依赖区域内的总标记强度进行定量分析,但这种方法缺乏生物学意义。该研究提出的解决方案关键在于利用生成式AI(Generative AI)与计算拓扑学中的离散莫尔斯(Discrete Morse, DM)技术,对标记的轴突片段进行骨架化处理,并估计体积长度密度,从而更好地连接底层神经元结构。该方法考虑了非局部连通性信息,具备噪声鲁棒性,并首次将DM技术应用于计算神经解剖学领域,有助于弥合单轴突骨架与示踪剂注射数据之间的差距。

链接: https://arxiv.org/abs/2505.07754
作者: Samik Banerjee,Caleb Stam,Daniel J. Tward,Steven Savoia,Yusu Wang,Partha P.Mitra
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review in Nature

点击查看摘要

Abstract:To understand biological intelligence we need to map neuronal networks in vertebrate brains. Mapping mesoscale neural circuitry is done using injections of tracers that label groups of neurons whose axons project to different brain regions. Since many neurons are labeled, it is difficult to follow individual axons. Previous approaches have instead quantified the regional projections using the total label intensity within a region. However, such a quantification is not biologically meaningful. We propose a new approach better connected to the underlying neurons by skeletonizing labeled axon fragments and then estimating a volumetric length density. Our approach uses a combination of deep nets and the Discrete Morse (DM) technique from computational topology. This technique takes into account nonlocal connectivity information and therefore provides noise-robustness. We demonstrate the utility and scalability of the approach on whole-brain tracer injected data. We also define and illustrate an information theoretic measure that quantifies the additional information obtained, compared to the skeletonized tracer injection fragments, when individual axon morphologies are available. Our approach is the first application of the DM technique to computational neuroanatomy. It can help bridge between single-axon skeletons and tracer injections, two important data types in mapping neural networks in vertebrates.
zh

[CV-153] ABS-Mamba: SAM2-Driven Bidirectional Spiral Mamba Network for Medical Image Translation MICCAI2025

【速读】:该论文旨在解决多模态医学图像翻译中全局解剖语义与局部结构保真度难以协调的问题,这一挑战源于模态间信息丢失和结构失真。其解决方案的关键在于提出ABS-Mamba架构,该架构融合了Segment Anything Model 2(SAM2)的器官感知语义表示、专门设计的卷积神经网络(CNNs)以保留模态特异性边缘和纹理细节,以及Mamba的可选状态空间建模以高效处理长短期特征依赖。通过双分辨率框架、鲁棒特征融合网络(RFFN)和双向Mamba残差网络(BMRN)等结构,实现了跨模态图像的高保真合成。

链接: https://arxiv.org/abs/2505.07687
作者: Feng Yuan,Yifan Gao,Wenbin Wu,Keqing Wu,Xiaotong Guo,Jie Jiang,Xin Gao
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025(under view)

点击查看摘要

Abstract:Accurate multi-modal medical image translation requires ha-rmonizing global anatomical semantics and local structural fidelity, a challenge complicated by intermodality information loss and structural distortion. We propose ABS-Mamba, a novel architecture integrating the Segment Anything Model 2 (SAM2) for organ-aware semantic representation, specialized convolutional neural networks (CNNs) for preserving modality-specific edge and texture details, and Mamba’s selective state-space modeling for efficient long- and short-range feature dependencies. Structurally, our dual-resolution framework leverages SAM2’s image encoder to capture organ-scale semantics from high-resolution inputs, while a parallel CNNs branch extracts fine-grained local features. The Robust Feature Fusion Network (RFFN) integrates these epresentations, and the Bidirectional Mamba Residual Network (BMRN) models spatial dependencies using spiral scanning and bidirectional state-space dynamics. A three-stage skip fusion decoder enhances edge and texture fidelity. We employ Efficient Low-Rank Adaptation (LoRA+) fine-tuning to enable precise domain specialization while maintaining the foundational capabilities of the pre-trained components. Extensive experimental validation on the SynthRAD2023 and BraTS2019 datasets demonstrates that ABS-Mamba outperforms state-of-the-art methods, delivering high-fidelity cross-modal synthesis that preserves anatomical semantics and structural details to enhance diagnostic accuracy in clinical applications. The code is available at this https URL
zh

[CV-154] Hierarchical Sparse Attention Framework for Computationally Efficient Classification of Biological Cells

【速读】:该论文试图解决传统卷积神经网络(Convolutional Neural Networks, CNNs)在图像分类任务中计算效率低以及可能关注无关特征的问题。其解决方案的关键在于提出SparseAttnNet,一种基于层次化注意力机制的高效图像分类框架,该框架通过动态选择并处理图像中最具信息量的像素(即top-k像素),从而减少计算负担。该方法利用下游层中的细粒度多头注意力(multi-head attention)提炼出粗粒度注意力,实现对关键区域的自适应识别与处理,同时通过语言模型嵌入和多头注意力机制捕捉语义与全局上下文信息。

链接: https://arxiv.org/abs/2505.07661
作者: Elad Yoshai,Dana Yagoda-Aharoni,Eden Dotan,Natan T. Shaked
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present SparseAttnNet, a new hierarchical attention-driven framework for efficient image classification that adaptively selects and processes only the most informative pixels from images. Traditional convolutional neural networks typically process the entire images regardless of information density, leading to computational inefficiency and potential focus on irrelevant features. Our approach leverages a dynamic selection mechanism that uses coarse attention distilled by fine multi-head attention from the downstream layers of the model, allowing the model to identify and extract the most salient k pixels, where k is adaptively learned during training based on loss convergence trends. Once the top-k pixels are selected, the model processes only these pixels, embedding them as words in a language model to capture their semantics, followed by multi-head attention to incorporate global context. For biological cell images, we demonstrate that SparseAttnNet can process approximately 15% of the pixels instead of the full image. Applied to cell classification tasks using white blood cells images from the following modalities: optical path difference (OPD) images from digital holography for stain-free cells, images from motion-sensitive (event) camera from stain-free cells, and brightfield microscopy images of stained cells, For all three imaging modalities, SparseAttnNet achieves competitive accuracy while drastically reducing computational requirements in terms of both parameters and floating-point operations per second, compared to traditional CNNs and Vision Transformers. Since the model focuses on biologically relevant regions, it also offers improved explainability. The adaptive and lightweight nature of SparseAttnNet makes it ideal for deployment in resource-constrained and high-throughput settings, including imaging flow cytometry.
zh

[CV-155] Breast Cancer Classification in Deep Ultraviolet Fluorescence Images Using a Patch-Level Vision Transformer Framework

【速读】:该论文旨在解决乳腺癌组织在术中边缘评估中的分类问题,特别是在使用深紫外荧光扫描显微镜(DUV-FSM)获取的高分辨率全表面图像(DUV WSI)上进行准确的良恶性组织分类。解决方案的关键在于引入一种基于局部视觉Transformer(ViT)模型的分类框架,该模型能够捕捉局部和全局特征,并结合Grad-CAM++显著性加权技术以突出相关空间区域,从而提升分类结果的可解释性和诊断准确性。

链接: https://arxiv.org/abs/2505.07654
作者: Pouya Afshin,David Helminiak,Tongtong Lu,Tina Yen,Julie M. Jorns,Mollie Patton,Bing Yu,Dong Hye Ye
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Breast-conserving surgery (BCS) aims to completely remove malignant lesions while maximizing healthy tissue preservation. Intraoperative margin assessment is essential to achieve a balance between thorough cancer resection and tissue conservation. A deep ultraviolet fluorescence scanning microscope (DUV-FSM) enables rapid acquisition of whole surface images (WSIs) for excised tissue, providing contrast between malignant and normal tissues. However, breast cancer classification with DUV WSIs is challenged by high resolutions and complex histopathological features. This study introduces a DUV WSI classification framework using a patch-level vision transformer (ViT) model, capturing local and global features. Grad-CAM++ saliency weighting highlights relevant spatial regions, enhances result interpretability, and improves diagnostic accuracy for benign and malignant tissue classification. A comprehensive 5-fold cross-validation demonstrates the proposed approach significantly outperforms conventional deep learning methods, achieving a classification accuracy of 98.33%.
zh

[CV-156] Ophora: A Large-Scale Data-Driven Text-Guided Ophthalmic Surgical Video Generation Model

【速读】:该论文旨在解决眼科手术中AI系统难以获取足够高质量标注的手术视频数据的问题,这一问题主要源于隐私保护和人工标注的高成本。其解决方案的关键在于提出Ophora模型,该模型能够根据自然语言指令生成真实可靠的眼科手术视频。为构建Ophora,研究者首先设计了一个全面的数据整理流程,将叙述性手术视频转化为包含超过16万对视频-指令的数据集Ophora-160K;随后引入了一种渐进式的视频-指令微调方案,以从预训练于自然视频-文本数据集的文本到视频生成(T2V)模型中迁移丰富的时空知识,从而实现隐私保护下的眼科手术视频生成。

链接: https://arxiv.org/abs/2505.07449
作者: Wei Li,Ming Hu,Guoan Wang,Lihao Liu,Kaijin Zhou,Junzhi Ning,Xin Guo,Zongyuan Ge,Lixu Gu,Junjun He
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In ophthalmic surgery, developing an AI system capable of interpreting surgical videos and predicting subsequent operations requires numerous ophthalmic surgical videos with high-quality annotations, which are difficult to collect due to privacy concerns and labor consumption. Text-guided video generation (T2V) emerges as a promising solution to overcome this issue by generating ophthalmic surgical videos based on surgeon instructions. In this paper, we present Ophora, a pioneering model that can generate ophthalmic surgical videos following natural language instructions. To construct Ophora, we first propose a Comprehensive Data Curation pipeline to convert narrative ophthalmic surgical videos into a large-scale, high-quality dataset comprising over 160K video-instruction pairs, Ophora-160K. Then, we propose a Progressive Video-Instruction Tuning scheme to transfer rich spatial-temporal knowledge from a T2V model pre-trained on natural video-text datasets for privacy-preserved ophthalmic surgical video generation based on Ophora-160K. Experiments on video quality evaluation via quantitative analysis and ophthalmologist feedback demonstrate that Ophora can generate realistic and reliable ophthalmic surgical videos based on surgeon instructions. We also validate the capability of Ophora for empowering downstream tasks of ophthalmic surgical workflow understanding. Code is available at this https URL.
zh

[CV-157] Multi-Plane Vision Transformer for Hemorrhage Classification Using Axial and Sagittal MRI Data

【速读】:该论文旨在解决从磁共振成像(MRI)中识别脑出血的问题,特别是在不同对比度和方位的MRI数据下,传统方法因需要将图像重采样到固定平面而导致的信息丢失问题。解决方案的关键在于提出一种3D多平面视觉变压器(MP-ViT),该模型使用两个独立的Transformer编码器分别处理轴向和矢状面对比度,并通过交叉注意力机制整合不同方位的信息,同时引入模态指示向量以补充缺失的对比信息。

链接: https://arxiv.org/abs/2505.07349
作者: Badhan Kumar Das,Gengyan Zhao,Boris Mailhe,Thomas J. Re,Dorin Comaniciu,Eli Gibson,Andreas Maier
机构: Siemens Healthineers(西门子医疗); Friedrich-Alexander-Universität Erlangen-Nürnberg(埃尔朗根-纽伦堡弗里德里希·亚历山大大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Identifying brain hemorrhages from magnetic resonance imaging (MRI) is a critical task for healthcare professionals. The diverse nature of MRI acquisitions with varying contrasts and orientation introduce complexity in identifying hemorrhage using neural networks. For acquisitions with varying orientations, traditional methods often involve resampling images to a fixed plane, which can lead to information loss. To address this, we propose a 3D multi-plane vision transformer (MP-ViT) for hemorrhage classification with varying orientation data. It employs two separate transformer encoders for axial and sagittal contrasts, using cross-attention to integrate information across orientations. MP-ViT also includes a modality indication vector to provide missing contrast information to the model. The effectiveness of the proposed model is demonstrated with extensive experiments on real world clinical dataset consists of 10,084 training, 1,289 validation and 1,496 test subjects. MP-ViT achieved substantial improvement in area under the curve (AUC), outperforming the vision transformer (ViT) by 5.5% and CNN-based architectures by 1.8%. These results highlight the potential of MP-ViT in improving performance for hemorrhage detection when different orientation contrasts are needed.
zh

[CV-158] Metrics that matter: Evaluating image quality metrics for medical image generation

【速读】:该论文试图解决生成式医学影像模型在临床应用中评估不足的问题,特别是在缺乏真实图像作为参考的情况下,现有无参考图像质量度量的可靠性尚未得到充分验证。解决方案的关键在于系统评估常用无参考图像质量度量对噪声、分布偏移及局部形态变化的敏感性,并将其与下游分割任务的表现进行对比,从而揭示当前度量在临床相关细节上的显著局限性,强调需要构建多维度验证框架以确保生成模型的临床适用性。

链接: https://arxiv.org/abs/2505.07175
作者: Yash Deo,Yan Jia,Toni Lassila,William A. P. Smith,Tom Lawton,Siyuan Kang,Alejandro F. Frangi,Ibrahim Habli
机构: University of York (约克大学); University of Leeds (利兹大学); Manchester Metropolitan University (曼彻斯特城市大学); University of Manchester (曼彻斯特大学); KU Leuven (天主教鲁汶大学); Bradford Institute for Health Research (布拉德福德健康研究所)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Evaluating generative models for synthetic medical imaging is crucial yet challenging, especially given the high standards of fidelity, anatomical accuracy, and safety required for clinical applications. Standard evaluation of generated images often relies on no-reference image quality metrics when ground truth images are unavailable, but their reliability in this complex domain is not well established. This study comprehensively assesses commonly used no-reference image quality metrics using brain MRI data, including tumour and vascular images, providing a representative exemplar for the field. We systematically evaluate metric sensitivity to a range of challenges, including noise, distribution shifts, and, critically, localised morphological alterations designed to mimic clinically relevant inaccuracies. We then compare these metric scores against model performance on a relevant downstream segmentation task, analysing results across both controlled image perturbations and outputs from different generative model architectures. Our findings reveal significant limitations: many widely-used no-reference image quality metrics correlate poorly with downstream task suitability and exhibit a profound insensitivity to localised anatomical details crucial for clinical validity. Furthermore, these metrics can yield misleading scores regarding distribution shifts, e.g. data memorisation. This reveals the risk of misjudging model readiness, potentially leading to the deployment of flawed tools that could compromise patient safety. We conclude that ensuring generative models are truly fit for clinical purpose requires a multifaceted validation framework, integrating performance on relevant downstream tasks with the cautious interpretation of carefully selected no-reference image quality metrics.
zh

[CV-159] Skull stripping with purely synthetic data

【速读】:该论文试图解决多模态、多物种及病理情况下脑图像分割中缺乏一种根本通用的脑提取(skull stripping)算法的问题。其解决方案的关键在于提出PUMBA(PUrely synthetic Multimodal/species invariant Brain extrAction),通过完全使用合成数据训练模型,而无需依赖真实脑图像或标注数据,从而实现了在多种场景下的高精度分割。

链接: https://arxiv.org/abs/2505.07159
作者: Jong Sung Park,Juhyung Ha,Siddhesh Thakur,Alexandra Badea,Spyridon Bakas,Eleftherios Garyfallidis
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Oral at ISMRM 2025

点击查看摘要

Abstract:While many skull stripping algorithms have been developed for multi-modal and multi-species cases, there is still a lack of a fundamentally generalizable approach. We present PUMBA(PUrely synthetic Multimodal/species invariant Brain extrAction), a strategy to train a model for brain extraction with no real brain images or labels. Our results show that even without any real images or anatomical priors, the model achieves comparable accuracy in multi-modal, multi-species and pathological cases. This work presents a new direction of research for any generalizable medical image segmentation task.
zh

[CV-160] Whitened CLIP as a Likelihood Surrogate of Images and Captions ICML2025 ATC

【速读】:该论文试图解决图像和文本描述的似然性近似计算问题,这一问题在许多应用中具有重要意义。其解决方案的关键在于引入了\textit{Whitened CLIP},通过对CLIP的潜在空间进行可逆线性变换,使嵌入空间中的每个特征均值为零、标准差为一且与其他特征不相关,从而得到单位协方差矩阵。该变换使得白化后的嵌入统计特性可以很好地近似为标准正态分布,进而通过白化嵌入空间中的平方欧几里得范数简单估计对数似然。该过程无需训练,仅需预计算的白化矩阵,因此计算效率高。

链接: https://arxiv.org/abs/2505.06934
作者: Roy Betser,Meir Yossef Levi,Guy Gilboa
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICML 2025. This version matches the camera-ready version

点击查看摘要

Abstract:Likelihood approximations for images are not trivial to compute and can be useful in many applications. We examine the use of Contrastive Language-Image Pre-training (CLIP) to assess the likelihood of images and captions. We introduce \textitWhitened CLIP, a novel transformation of the CLIP latent space via an invertible linear operation. This transformation ensures that each feature in the embedding space has zero mean, unit standard deviation, and no correlation with all other features, resulting in an identity covariance matrix. We show that the whitened embeddings statistics can be well approximated as a standard normal distribution, thus, the log-likelihood is estimated simply by the square Euclidean norm in the whitened embedding space. The whitening procedure is completely training-free and performed using a pre-computed whitening matrix, hence, is very fast. We present several preliminary experiments demonstrating the properties and applicability of these likelihood scores to images and captions.
zh

[CV-161] Uni-AIMS: AI-Powered Microscopy Image Analysis

【速读】:该论文旨在解决显微图像的智能识别与自动分析问题,特别是在处理复杂视觉环境下的目标检测和定量分析方面。其解决方案的关键在于开发了一个数据引擎,通过实验采集、合成数据生成以及人机协同标注流程生成高质量的标注数据集,并提出了一种能够鲁棒检测大小目标的分割模型,该模型可有效识别并分离密集排列的目标,同时支持图像比例尺的精确自动识别,从而构建了一个全面的智能分析平台。

链接: https://arxiv.org/abs/2505.06918
作者: Yanhui Hong,Nan Wang,Zhiyi Xia,Haoyi Tao,Xi Fang,Yiming Li,Jiankun Wang,Peng Jin,Xiaochen Cai,Shengyu Li,Ziqi Chen,Zezhong Zhang,Guolin Ke,Linfeng Zhang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents a systematic solution for the intelligent recognition and automatic analysis of microscopy images. We developed a data engine that generates high-quality annotated datasets through a combination of the collection of diverse microscopy images from experiments, synthetic data generation and a human-in-the-loop annotation process. To address the unique challenges of microscopy images, we propose a segmentation model capable of robustly detecting both small and large objects. The model effectively identifies and separates thousands of closely situated targets, even in cluttered visual environments. Furthermore, our solution supports the precise automatic recognition of image scale bars, an essential feature in quantitative microscopic analysis. Building upon these components, we have constructed a comprehensive intelligent analysis platform and validated its effectiveness and practicality in real-world applications. This study not only advances automatic recognition in microscopy imaging but also ensures scalability and generalizability across multiple application domains, offering a powerful tool for automated microscopic analysis in interdisciplinary research.
zh

[CV-162] Missing Data Estimation for MR Spectroscopic Imaging via Mask-Free Deep Learning Methods

【速读】:该论文试图解决磁共振波谱成像(Magnetic Resonance Spectroscopic Imaging, MRSI)中由于运动伪影、磁场不均匀性或光谱拟合失败导致的缺失或损坏数据问题,这些问题在高分辨率3D采集中尤为突出。解决方案的关键在于提出一种基于深度学习的无掩码框架,通过2D和3D U-Net架构利用上下文空间特征隐式检测并估计缺失区域,而非依赖显式掩码。此外,该方法引入了渐进式训练策略以提高在不同数据退化水平下的鲁棒性,并在模拟和真实患者数据集上表现出优于传统插值方法的性能。

链接: https://arxiv.org/abs/2505.06811
作者: Tan-Hanh Pham,Ovidiu C. Andronesi,Xianqi Li,Kim-Doang Nguyen
机构: Florida Institute of Technology (佛罗里达理工学院); Massachusetts General Hospital, Harvard Medical School (麻省总医院,哈佛医学院); Department of Mathematics and Systems Engineering, Florida Institute of Technology (数学与系统工程系,佛罗里达理工学院); Department of Mechanical and Civil Engineering, Florida Institute of Technology (机械与土木工程系,佛罗里达理工学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages

点击查看摘要

Abstract:Magnetic Resonance Spectroscopic Imaging (MRSI) is a powerful tool for non-invasive mapping of brain metabolites, providing critical insights into neurological conditions. However, its utility is often limited by missing or corrupted data due to motion artifacts, magnetic field inhomogeneities, or failed spectral fitting-especially in high resolution 3D acquisitions. To address this, we propose the first deep learning-based, mask-free framework for estimating missing data in MRSI metabolic maps. Unlike conventional restoration methods that rely on explicit masks to identify missing regions, our approach implicitly detects and estimates these areas using contextual spatial features through 2D and 3D U-Net architectures. We also introduce a progressive training strategy to enhance robustness under varying levels of data degradation. Our method is evaluated on both simulated and real patient datasets and consistently outperforms traditional interpolation techniques such as cubic and linear interpolation. The 2D model achieves an MSE of 0.002 and an SSIM of 0.97 with 20% missing voxels, while the 3D model reaches an MSE of 0.001 and an SSIM of 0.98 with 15% missing voxels. Qualitative results show improved fidelity in estimating missing data, particularly in metabolically heterogeneous regions and ventricular regions. Importantly, our model generalizes well to real-world datasets without requiring retraining or mask input. These findings demonstrate the effectiveness and broad applicability of mask-free deep learning for MRSI restoration, with strong potential for clinical and research integration.
zh

[CV-163] HistDiST: Histopathological Diffusion-based Stain Transfer

【速读】:该论文旨在解决Hematoxylin and Eosin (HE)染色图像到Immunohistochemistry (IHC)图像的翻译问题,以提供一种成本更低且更具分子特异性的替代方案。现有方法主要基于生成对抗网络(GAN),但存在训练不稳定和结构保真度有限的问题,而基于扩散模型的方法尚未得到充分探索。论文提出的解决方案是HistDiST,其关键在于采用基于潜在扩散模型(Latent Diffusion Model, LDM)的框架,并引入双条件策略,结合Phikon提取的形态学嵌入与VAE编码的HE表示,以确保病理相关上下文和结构一致性。此外,通过引入重缩放噪声调度、v-prediction和尾部时间步长等技术,有效克服了亮度偏差问题,提升了翻译质量。

链接: https://arxiv.org/abs/2505.06793
作者: Erik Großkopf,Valay Bundele,Mehran Hossienzadeh,Hendrik P.A. Lensch
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Hematoxylin and Eosin (HE) staining is the cornerstone of histopathology but lacks molecular specificity. While Immunohistochemistry (IHC) provides molecular insights, it is costly and complex, motivating HE-to-IHC translation as a cost-effective alternative. Existing translation methods are mainly GAN-based, often struggling with training instability and limited structural fidelity, while diffusion-based approaches remain underexplored. We propose HistDiST, a Latent Diffusion Model (LDM) based framework for high-fidelity HE-to-IHC translation. HistDiST introduces a dual-conditioning strategy, utilizing Phikon-extracted morphological embeddings alongside VAE-encoded HE representations to ensure pathology-relevant context and structural consistency. To overcome brightness biases, we incorporate a rescaled noise schedule, v-prediction, and trailing timesteps, enforcing a zero-SNR condition at the final timestep. During inference, DDIM inversion preserves the morphological structure, while an eta-cosine noise schedule introduces controlled stochasticity, balancing structural consistency and molecular fidelity. Moreover, we propose Molecular Retrieval Accuracy (MRA), a novel pathology-aware metric leveraging GigaPath embeddings to assess molecular relevance. Extensive evaluations on MIST and BCI datasets demonstrate that HistDiST significantly outperforms existing methods, achieving a 28% improvement in MRA on the HE-to-Ki67 translation task, highlighting its effectiveness in capturing true IHC semantics.
zh

[CV-164] Reproducing and Improving CheXNet: Deep Learning for Chest X-ray Disease Classification

【速读】:该论文旨在解决医学影像中多标签分类任务的准确性和效率问题,特别是在胸部X光图像中对14种不同疾病进行自动检测与分类。其解决方案的关键在于开发和优化深度学习模型,以提高在不平衡数据集上的性能,主要通过改进的模型架构和训练策略来实现更高的F1分数和AUC-ROC值。

链接: https://arxiv.org/abs/2505.06646
作者: Daniel Strick,Carlos Garcia,Anthony Huang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:Deep learning for radiologic image analysis is a rapidly growing field in biomedical research and is likely to become a standard practice in modern medicine. On the publicly available NIH ChestX-ray14 dataset, containing X-ray images that are classified by the presence or absence of 14 different diseases, we reproduced an algorithm known as CheXNet, as well as explored other algorithms that outperform CheXNet’s baseline metrics. Model performance was primarily evaluated using the F1 score and AUC-ROC, both of which are critical metrics for imbalanced, multi-label classification tasks in medical imaging. The best model achieved an average AUC-ROC score of 0.85 and an average F1 score of 0.39 across all 14 disease classifications present in the dataset.
zh

[CV-165] Feature Representation Transferring to Lightweight Models via Perception Coherence

【速读】:该论文试图解决如何将大型教师模型的特征表示有效地迁移至轻量级学生模型的问题。解决方案的关键在于提出了一种新的概念——感知一致性(perception coherence),并基于此构建了一个损失函数,该函数通过数据点在特征空间中的排序来考虑它们的差异性。通过最小化该损失函数,学生模型学习模仿教师模型对输入的感知方式,而无需严格保留教师模型的绝对几何结构,仅需保持通过相似性排序所体现的全局一致性。这一方法为特征表示迁移提供了一个概率视角,并在实验中表现出优于或与现有强基准方法相当的性能。

链接: https://arxiv.org/abs/2505.06595
作者: Hai-Vy Nguyen,Fabrice Gamboa,Sixin Zhang,Reda Chhaibi,Serge Gratton,Thierry Giaccone
机构: Ampere Software Technology (Ampere软件技术公司); Institut de mathématiques de Toulouse (图卢兹数学研究所); Institut de Recherche en Informatique de Toulouse (图卢兹计算机科学研究所); Laboratoire Jean Alexandre Dieudonné (让·亚历山大·迪厄多内实验室); Université Côte d’Azur (科特迪瓦大学)
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Probability (math.PR)
备注:

点击查看摘要

Abstract:In this paper, we propose a method for transferring feature representation to lightweight student models from larger teacher models. We mathematically define a new notion called \textitperception coherence. Based on this notion, we propose a loss function, which takes into account the dissimilarities between data points in feature space through their ranking. At a high level, by minimizing this loss function, the student model learns to mimic how the teacher model \textitperceives inputs. More precisely, our method is motivated by the fact that the representational capacity of the student model is weaker than the teacher model. Hence, we aim to develop a new method allowing for a better relaxation. This means that, the student model does not need to preserve the absolute geometry of the teacher one, while preserving global coherence through dissimilarity ranking. Our theoretical insights provide a probabilistic perspective on the process of feature representation transfer. Our experiments results show that our method outperforms or achieves on-par performance compared to strong baseline methods for representation transferring.
zh

[CV-166] PC-SRGAN: Physically Consistent Super-Resolution Generative Adversarial Network for General Transient Simulations

【速读】:该论文旨在解决生成式人工智能(Generative AI)在超分辨率(Super Resolution, SR)任务中生成图像缺乏物理一致性的问题,这一问题限制了其在科学应用中的可靠性。论文提出的解决方案是PC-SRGAN,其关键在于通过引入物理一致性的约束,确保生成图像不仅在视觉上高质量,而且在物理意义上合理,从而支持可解释的模拟和科学分析。PC-SRGAN在有限训练数据条件下仍能显著提升峰值信噪比(Peak Signal-to-Noise Ratio)和结构相似性指数(Structural Similarity Index Measure),并通过集成数值合理的时间积分器和先进质量评估指标,增强了机器学习模型的可靠性和因果性。

链接: https://arxiv.org/abs/2505.06502
作者: Md Rakibul Hasan,Pouria Behnoudfar,Dan MacKinlay,Thomas Poulet
机构: 未知
类目: Image and Video Processing (eess.IV); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Machine Learning, particularly Generative Adversarial Networks (GANs), has revolutionised Super Resolution (SR). However, generated images often lack physical meaningfulness, which is essential for scientific applications. Our approach, PC-SRGAN, enhances image resolution while ensuring physical consistency for interpretable simulations. PC-SRGAN significantly improves both the Peak Signal-to-Noise Ratio and the Structural Similarity Index Measure compared to conventional methods, even with limited training data (e.g., only 13% of training data required for SRGAN). Beyond SR, PC-SRGAN augments physically meaningful machine learning, incorporating numerically justified time integrators and advanced quality metrics. These advancements promise reliable and causal machine-learning models in scientific domains. A significant advantage of PC-SRGAN over conventional SR techniques is its physical consistency, which makes it a viable surrogate model for time-dependent problems. PC-SRGAN advances scientific machine learning, offering improved accuracy and efficiency for image processing, enhanced process understanding, and broader applications to scientific research. The source codes and data will be made publicly available at this https URL upon acceptance of this paper.
zh

[CV-167] FEMSN: Frequency-Enhanced Multiscale Network for fault diagnosis of rotating machinery under strong noise environments

【速读】:该论文旨在解决在复杂工况下滚动轴承故障特征易被噪声干扰而难以提取的问题,传统方法在处理周期性冲击特征时存在局限性。其解决方案的关键在于提出一种基于卷积神经网络(CNN)的新型模型FEMSN,其中核心创新包括:引入傅里叶自适应去噪编码层(FADEL)以增强关键特征并抑制噪声,采用多尺度时频融合模块(MSTFF)提取融合的时频特征以提升模型鲁棒性和非线性表达能力,并通过知识蒸馏层扩展感受野,从而实现更准确的设备健康监测与稳定性评估。

链接: https://arxiv.org/abs/2505.06285
作者: Yuhan Yuan,Xiaomo Jiang,Yanfeng Han,Ke Xiao
机构: Dalian University of Technology (大连理工大学); Chongqing University (重庆大学)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Rolling bearings are critical components of rotating machinery, and their proper functioning is essential for industrial production. Most existing condition monitoring methods focus on extracting discriminative features from time-domain signals to assess bearing health status. However, under complex operating conditions, periodic impulsive characteristics related to fault information are often obscured by noise interference. Consequently, existing approaches struggle to learn distinctive fault-related features in such scenarios. To address this issue, this paper proposes a novel CNN-based model named FEMSN. Specifically, a Fourier Adaptive Denoising Encoder Layer (FADEL) is introduced as an input denoising layer to enhance key features while filtering out irrelevant information. Subsequently, a Multiscale Time-Frequency Fusion (MSTFF) module is employed to extract fused time-frequency features, further improving the model robustness and nonlinear representation capability. Additionally, a distillation layer is incorporated to expand the receptive field. Based on these advancements, a novel deep lightweight CNN model, termed the Frequency-Enhanced Multiscale Network (FEMSN), is developed. The effectiveness of FEMSN and FADEL in machine health monitoring and stability assessment is validated through two case studies.
zh

[CV-168] rahertz Spatial Wireless Channel Modeling with Radio Radiance Field

【速读】:该论文旨在解决太赫兹(Terahertz, THz)通信中由于严重自由空间路径损耗、微弱衍射和镜面反射以及显著散射导致的传统信道建模和基于导频的估计方法效率低下的问题。其解决方案的关键在于应用无线电辐射场(Radio Radiance Field, RRF)框架,通过基于视觉的几何结构和稀疏的THz射频测量来重建连续的RRF,从而实现无需密集采样的高效空间信道状态信息(Spatial-CSI)建模。

链接: https://arxiv.org/abs/2505.06277
作者: John Song,Lihao Zhang,Feng Ye,Haijian Sun
机构: University of Georgia(佐治亚大学); University of Wisconsin-Madison(威斯康星大学麦迪逊分校)
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI)
备注: submitted to IEEE conferences

点击查看摘要

Abstract:Terahertz (THz) communication is a key enabler for 6G systems, offering ultra-wide bandwidth and unprecedented data rates. However, THz signal propagation differs significantly from lower-frequency bands due to severe free space path loss, minimal diffraction and specular reflection, and prominent scattering, making conventional channel modeling and pilot-based estimation approaches inefficient. In this work, we investigate the feasibility of applying radio radiance field (RRF) framework to the THz band. This method reconstructs a continuous RRF using visual-based geometry and sparse THz RF measurements, enabling efficient spatial channel state information (Spatial-CSI) modeling without dense sampling. We first build a fine simulated THz scenario, then we reconstruct the RRF and evaluate the performance in terms of both reconstruction quality and effectiveness in THz communication, showing that the reconstructed RRF captures key propagation paths with sparse training samples. Our findings demonstrate that RRF modeling remains effective in the THz regime and provides a promising direction for scalable, low-cost spatial channel reconstruction in future 6G networks.
zh

[CV-169] DeltaDPD: Exploiting Dynamic Temporal Sparsity in Recurrent Neural Networks for Energy-Efficient Wideband Digital Predistortion MICRO

【速读】:该论文旨在解决数字预失真(Digital Predistortion, DPD)在宽带射频功率放大器(RF PAs)中因带宽和数据速率增加而导致的能耗问题,尤其是传统基于循环神经网络(RNN)的DPD模型计算复杂度高、影响系统效率的问题。其解决方案的关键在于引入DeltaDPD,通过探索输入信号和RNN神经元隐藏状态的动态时间稀疏性,在保持良好线性化性能的同时减少算术运算和内存访问,从而实现能效提升。

链接: https://arxiv.org/abs/2505.06250
作者: Yizhuo Wu,Yi Zhu,Kun Qian,Qinyu Chen,Anding Zhu,John Gajadharsing,Leo C. N. de Vreede,Chang Gao
机构: Delft University of Technology (代尔夫特理工大学); Ampleon Netherlands B.V. (阿姆普隆荷兰有限公司); Leiden University (莱顿大学); University College Dublin (都柏林圣三一学院); IEEE MTT-S International Microwave Symposium (IEEE微波理论与技术国际研讨会)
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to IEEE Microwave and Wireless Technology Letters (MWTL)

点击查看摘要

Abstract:Digital Predistortion (DPD) is a popular technique to enhance signal quality in wideband RF power amplifiers (PAs). With increasing bandwidth and data rates, DPD faces significant energy consumption challenges during deployment, contrasting with its efficiency goals. State-of-the-art DPD models rely on recurrent neural networks (RNN), whose computational complexity hinders system efficiency. This paper introduces DeltaDPD, exploring the dynamic temporal sparsity of input signals and neuronal hidden states in RNNs for energy-efficient DPD, reducing arithmetic operations and memory accesses while preserving satisfactory linearization performance. Applying a TM3.1a 200MHz-BW 256-QAM OFDM signal to a 3.5 GHz GaN Doherty RF PA, DeltaDPD achieves -50.03 dBc in Adjacent Channel Power Ratio (ACPR), -37.22 dB in Normalized Mean Square Error (NMSE) and -38.52 dBc in Error Vector Magnitude (EVM) with 52% temporal sparsity, leading to a 1.8X reduction in estimated inference power. The DeltaDPD code will be released after formal publication at this https URL.
zh

[CV-170] ReXGradient-160K: A Large-Scale Publicly Available Dataset of Chest Radiographs with Free-text Reports

【速读】:该论文旨在解决医疗影像领域中缺乏大规模、高质量标注数据的问题,从而限制了生成式 AI (Generative AI) 在医学影像分析和自动化报告生成方面的研究进展。其解决方案的关键在于构建并公开发布 ReXGradient-160K 数据集,该数据集包含来自3个美国医疗系统共109,487名患者的160,000份胸部X光检查及其配对的放射学报告,涵盖训练、验证、公共测试和私有测试集,为AI系统的开发与评估提供了宝贵资源。

链接: https://arxiv.org/abs/2505.00228
作者: Xiaoman Zhang,Julián N. Acosta,Josh Miller,Ouwen Huang,Pranav Rajpurkar
机构: Harvard Medical School (哈佛医学院); Gradient Health (梯度健康); Duke University (杜克大学); Laplace Institute (拉普拉斯研究所)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present ReXGradient-160K, representing the largest publicly available chest X-ray dataset to date in terms of the number of patients. This dataset contains 160,000 chest X-ray studies with paired radiological reports from 109,487 unique patients across 3 U.S. health systems (79 medical sites). This comprehensive dataset includes multiple images per study and detailed radiology reports, making it particularly valuable for the development and evaluation of AI systems for medical imaging and automated report generation models. The dataset is divided into training (140,000 studies), validation (10,000 studies), and public test (10,000 studies) sets, with an additional private test set (10,000 studies) reserved for model evaluation on the ReXrank benchmark. By providing this extensive dataset, we aim to accelerate research in medical imaging AI and advance the state-of-the-art in automated radiological analysis. Our dataset will be open-sourced at this https URL.
zh

人工智能

[AI-0] A class of distributed automata that contains the modal mu-frag ment

【速读】:该论文试图解决将分层模态μ-演算的μ-片段转换为一类分布式消息传递自动机的问题,其解决方案的关键在于通过这种转换获得一个替代证明,证明循环图神经网络在实数域和分层模态替换演算下的表达能力与一阶单调二阶逻辑(MSO)在限制条件下的表达能力相同。

链接: https://arxiv.org/abs/2505.07816
作者: Veeti Ahvonen,Damian Heiman,Antti Kuusisto
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper gives a translation from the \mu -fragment of the graded modal \mu -calculus to a class of distributed message-passing automata. As a corollary, we obtain an alternative proof for a theorem from \citeahvonen_neurips stating that recurrent graph neural networks working with reals and graded modal substitution calculus have the same expressive power in restriction to the logic monadic second-order logic MSO.
zh

[AI-1] Improving Trajectory Stitching with Flow Models

【速读】:该论文试图解决生成式模型在轨迹规划中因训练集缺乏完整轨迹而导致的规划能力不足问题(Generative models have shown great promise as trajectory planners, given their affinity to modeling complex distributions and guidable inference process. Previous works have successfully applied these in the context of robotic manipulation but perform poorly when the required solution does not exist as a complete trajectory within the training set)。解决方案的关键在于识别出无法通过拼接(stitching)进行规划的问题,并针对架构和数据集选择进行改进,同时提出了一种新的训练和推理流程以稳定并增强这些能力。

链接: https://arxiv.org/abs/2505.07802
作者: Reece O’Mahoney,Wanming Yu,Ioannis Havoutis
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generative models have shown great promise as trajectory planners, given their affinity to modeling complex distributions and guidable inference process. Previous works have successfully applied these in the context of robotic manipulation but perform poorly when the required solution does not exist as a complete trajectory within the training set. We identify that this is a result of being unable to plan via stitching, and subsequently address the architectural and dataset choices needed to remedy this. On top of this, we propose a novel addition to the training and inference procedures to both stabilize and enhance these capabilities. We demonstrate the efficacy of our approach by generating plans with out of distribution boundary conditions and performing obstacle avoidance on the Franka Panda in simulation and on real hardware. In both of these tasks our method performs significantly better than the baselines and is able to avoid obstacles up to four times as large.
zh

[AI-2] Overflow Prevention Enhances Long-Context Recurrent LLM s

【速读】:该论文试图解决长上下文处理中递归模型(Recurrent Models)对长上下文利用不足的问题,特别是其固定大小的递归记忆机制导致性能受限。解决方案的关键在于采用基于分块(Chunk-based)的推理过程,该方法通过识别并仅处理输入中最相关部分来缓解递归记忆失败,从而有效提升多种长上下文任务的性能。

链接: https://arxiv.org/abs/2505.07793
作者: Assaf Ben-Kish,Itamar Zimerman,M. Jehanzeb Mirza,James Glass,Leonid Karlinsky,Raja Giryes
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A recent trend in LLMs is developing recurrent sub-quadratic models that improve long-context processing efficiency. We investigate leading large long-context models, focusing on how their fixed-size recurrent memory affects their performance. Our experiments reveal that, even when these models are trained for extended contexts, their use of long contexts remains underutilized. Specifically, we demonstrate that a chunk-based inference procedure, which identifies and processes only the most relevant portion of the input can mitigate recurrent memory failures and be effective for many long-context tasks: On LongBench, our method improves the overall performance of Falcon3-Mamba-Inst-7B by 14%, Falcon-Mamba-Inst-7B by 28%, RecurrentGemma-IT-9B by 50%, and RWKV6-Finch-7B by 51%. Surprisingly, this simple approach also leads to state-of-the-art results in the challenging LongBench v2 benchmark, showing competitive performance with equivalent size Transformers. Furthermore, our findings raise questions about whether recurrent models genuinely exploit long-range dependencies, as our single-chunk strategy delivers stronger performance - even in tasks that presumably require cross-context relations.
zh

[AI-3] Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在需要精确、可验证计算的数学推理任务中表现不佳的问题,以及如何使智能体自主学习利用外部工具(如代码执行)进行推理。其解决方案的关键在于提出一种基于结果奖励的强化学习方法(Reinforcement Learning, RL),即ZeroTIR,通过训练基础LLMs自发生成并执行Python代码来解决数学问题,而无需监督式工具使用示例。该方法的核心创新在于展示了随着RL训练的推进,关键指标(如代码执行频率、响应长度和任务准确率)呈现出可预测的正相关性,表明计算投入与有效工具增强推理策略的出现之间存在量化关系。

链接: https://arxiv.org/abs/2505.07773
作者: Xinji Mai,Haotian Xu,Xing W,Weinong Wang,Yingying Zhang,Wenqiang Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often struggle with mathematical reasoning tasks requiring precise, verifiable computation. While Reinforcement Learning (RL) from outcome-based rewards enhances text-based reasoning, understanding how agents autonomously learn to leverage external tools like code execution remains crucial. We investigate RL from outcome-based rewards for Tool-Integrated Reasoning, ZeroTIR, training base LLMs to spontaneously generate and execute Python code for mathematical problems without supervised tool-use examples. Our central contribution is we demonstrate that as RL training progresses, key metrics scale predictably. Specifically, we observe strong positive correlations where increased training steps lead to increases in the spontaneous code execution frequency, the average response length, and, critically, the final task accuracy. This suggests a quantifiable relationship between computational effort invested in training and the emergence of effective, tool-augmented reasoning strategies. We implement a robust framework featuring a decoupled code execution environment and validate our findings across standard RL algorithms and frameworks. Experiments show ZeroTIR significantly surpasses non-tool ZeroRL baselines on challenging math benchmarks. Our findings provide a foundational understanding of how autonomous tool use is acquired and scales within Agent RL, offering a reproducible benchmark for future studies. Code is released at \hrefthis https URLthis https URL.
zh

[AI-4] “I Apologize For Not Understanding Your Policy”: Exploring the Specification and Evaluation of User-Managed Access Control Policies by AI Virtual Assistants

【速读】:该论文试图解决用户在使用基于人工智能的虚拟助手(Artificial Intelligence-based Virtual Assistants, VAs)时,如何有效管理和评估用户自主访问控制策略(User-Managed Access Control Policies, U-MAPs)的问题。当前VA在处理不同场景下的U-MAP时存在理解不足,导致安全漏洞和隐私泄露的风险。解决方案的关键在于通过结构化与非结构化的测试方法,评估VA对U-MAP的理解能力,并揭示其在处理复杂授权规则和动态变化方面的局限性,从而为VA的进一步优化提供依据。

链接: https://arxiv.org/abs/2505.07759
作者: Jennifer Mondragon,Carlos Rubio-Medrano,Gael Cruz,Dvijesh Shastri
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid evolution of Artificial Intelligence (AI)-based Virtual Assistants (VAs) e.g., Google Gemini, ChatGPT, Microsoft Copilot, and High-Flyer Deepseek has turned them into convenient interfaces for managing emerging technologies such as Smart Homes, Smart Cars, Electronic Health Records, by means of explicit commands,e.g., prompts, which can be even launched via voice, thus providing a very convenient interface for end-users. However, the proper specification and evaluation of User-Managed Access Control Policies (U-MAPs), the rules issued and managed by end-users to govern access to sensitive data and device functionality - within these VAs presents significant challenges, since such a process is crucial for preventing security vulnerabilities and privacy leaks without impacting user experience. This study provides an initial exploratory investigation on whether current publicly-available VAs can manage U-MAPs effectively across differing scenarios. By conducting unstructured to structured tests, we evaluated the comprehension of such VAs, revealing a lack of understanding in varying U-MAP approaches. Our research not only identifies key limitations, but offers valuable insights into how VAs can be further improved to manage complex authorization rules and adapt to dynamic changes.
zh

[AI-5] Emotion-Gradient Metacognitive RSI (Part I): Theoretical Foundations and Single-Agent Architecture

【速读】:该论文试图解决如何构建一种能够自我改进、具备内在动机和元认知能力的通用人工智能(Artificial General Intelligence, AGI)系统的问题。其解决方案的关键在于提出了一种名为情感梯度元认知递归自我改进(Emotion-Gradient Metacognitive Recursive Self-Improvement, EG-MRSI)的框架,该框架将元认知、基于情绪的内在动机和递归自我修改整合为一个统一的理论体系,并通过可微分的内在奖励函数驱动学习过程,同时确保在形式化约束下的安全性与有效性。

链接: https://arxiv.org/abs/2505.07757
作者: Rintaro Ando
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages, 3 figures. Part I of a four-part series (Parts II-IV forthcoming)

点击查看摘要

Abstract:We present the Emotion-Gradient Metacognitive Recursive Self-Improvement (EG-MRSI) framework, a novel architecture that integrates introspective metacognition, emotion-based intrinsic motivation, and recursive self-modification into a unified theoretical system. The framework is explicitly capable of overwriting its own learning algorithm under formally bounded risk. Building upon the Noise-to-Meaning RSI (N2M-RSI) foundation, EG-MRSI introduces a differentiable intrinsic reward function driven by confidence, error, novelty, and cumulative success. This signal regulates both a metacognitive mapping and a self-modification operator constrained by provable safety mechanisms. We formally define the initial agent configuration, emotion-gradient dynamics, and RSI trigger conditions, and derive a reinforcement-compatible optimization objective that guides the agent’s development trajectory. Meaning Density and Meaning Conversion Efficiency are introduced as quantifiable metrics of semantic learning, closing the gap between internal structure and predictive informativeness. This Part I paper establishes the single-agent theoretical foundations of EG-MRSI. Future parts will extend this framework to include safety certificates and rollback protocols (Part II), collective intelligence mechanisms (Part III), and feasibility constraints including thermodynamic and computational limits (Part IV). Together, the EG-MRSI series provides a rigorous, extensible foundation for open-ended and safe AGI.
zh

[AI-6] Benchmarking of CPU-intensive Stream Data Processing in The Edge Computing Systems

【速读】:该论文试图解决边缘计算环境中边缘设备资源利用率低的问题,特别是在缺乏全面性能分析机制的情况下,无法根据工作负载动态调整系统配置。解决方案的关键在于通过合成微基准测试评估单个处理节点的功耗和性能特性,从而揭示CPU频率、功耗与应用性能之间的关系,为优化边缘资源使用提供依据。

链接: https://arxiv.org/abs/2505.07755
作者: Tomasz Szydlo,Viacheslaw Horbanow,Dev Nandan Jha,Shashikant Ilager,Aleksander Slominski,Rajiv Ranjan
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Edge computing has emerged as a pivotal technology, offering significant advantages such as low latency, enhanced data security, and reduced reliance on centralized cloud infrastructure. These benefits are crucial for applications requiring real-time data processing or strict security measures. Despite these advantages, edge devices operating within edge clusters are often underutilized. This inefficiency is mainly due to the absence of a holistic performance profiling mechanism which can help dynamically adjust the desired system configuration for a given workload. Since edge computing environments involve a complex interplay between CPU frequency, power consumption, and application performance, a deeper understanding of these correlations is essential. By uncovering these relationships, it becomes possible to make informed decisions that enhance both computational efficiency and energy savings. To address this gap, this paper evaluates the power consumption and performance characteristics of a single processing node within an edge cluster using a synthetic microbenchmark by varying the workload size and CPU frequency. The results show how an optimal measure can lead to optimized usage of edge resources, given both performance and power consumption.
zh

[AI-7] Guiding Data Collection via Factored Scaling Curves

【速读】:该论文试图解决在广泛环境因素变化下,如何高效收集数据以训练具有泛化能力的模仿学习策略的问题(imitation learning policies),因为全面收集所有可能环境因素的数据成本过高。解决方案的关键在于构建因子缩放曲线(factored scaling curves, FSC),该方法通过量化策略性能随单一或配对因子数据规模变化的情况,从而指导在有限预算内针对最具影响力的因子组合进行有针对性的数据采集。

链接: https://arxiv.org/abs/2505.07728
作者: Lihan Zha,Apurva Badithela,Michael Zhang,Justin Lidard,Jeremy Bao,Emily Zhou,David Snyder,Allen Z. Ren,Dhruv Shah,Anirudha Majumdar
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project website: this https URL

点击查看摘要

Abstract:Generalist imitation learning policies trained on large datasets show great promise for solving diverse manipulation tasks. However, to ensure generalization to different conditions, policies need to be trained with data collected across a large set of environmental factor variations (e.g., camera pose, table height, distractors) - a prohibitively expensive undertaking, if done exhaustively. We introduce a principled method for deciding what data to collect and how much to collect for each factor by constructing factored scaling curves (FSC), which quantify how policy performance varies as data scales along individual or paired factors. These curves enable targeted data acquisition for the most influential factor combinations within a given budget. We evaluate the proposed method through extensive simulated and real-world experiments, across both training-from-scratch and fine-tuning settings, and show that it boosts success rates in real-world tasks in new environments by up to 26% over existing data-collection strategies. We further demonstrate how factored scaling curves can effectively guide data collection using an offline metric, without requiring real-world evaluation at scale.
zh

[AI-8] Circuit Partitioning Using Large Language Models for Quantum Compilation and Simulations

【速读】:该论文试图解决在噪声中等规模量子(NISQ)时代,量子电路编译算法在映射量子算法到量子硬件时面临计算挑战,导致其仅能处理最多5-6个量子比特的电路,从而需要对大规模电路进行分割的问题。解决方案的关键在于利用大型语言模型(LLMs)的能力,通过对其进行了细致的微调,使其能够有效地执行量子电路分割任务,实验结果表明,经过微调的开源LLMs在分割任务上的准确率达到53.4%。

链接: https://arxiv.org/abs/2505.07711
作者: Pranav Sinha,Sumit Kumar Jha,Sunny Raj
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注: 7 pages, 2 tables and 3 figures

点击查看摘要

Abstract:We are in the midst of the noisy intermediate-scale quantum (NISQ) era, where quantum computers are limited by noisy gates, some of which are more error-prone than others and can render the final computation incomprehensible. Quantum circuit compilation algorithms attempt to minimize these noisy gates when mapping quantum algorithms onto quantum hardware but face computational challenges that restrict their application to circuits with no more than 5-6 qubits, necessitating the need to partition large circuits before the application of noisy quantum gate minimization algorithms. The existing generation of these algorithms is heuristic in nature and does not account for downstream gate minimization tasks. Large language models (LLMs) have the potential to change this and help improve quantum circuit partitions. This paper investigates the use of LLMs, such as Llama and Mistral, for partitioning quantum circuits by capitalizing on their abilities to understand and generate code, including QASM. Specifically, we teach LLMs to partition circuits using the quick partition approach of the Berkeley Quantum Synthesis Toolkit. Through experimental evaluations, we show that careful fine-tuning of open source LLMs enables us to obtain an accuracy of 53.4% for the partition task while over-the-shelf LLMs are unable to correctly partition circuits, using standard 1-shot and few-shot training approaches.
zh

[AI-9] Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications

【速读】:该论文试图解决传统神经文本转语音(TTS)系统在端到端(E2E)建模中计算复杂度高、内存消耗大,导致难以适用于低资源场景下的实时离线设备应用的问题。解决方案的关键在于提出一种轻量级端到端TTS(LE2E)模型,该模型在保持高质量语音生成能力的同时,显著减少了模型参数数量并提升了实时性,从而实现了高效、低资源消耗的语音合成。

链接: https://arxiv.org/abs/2505.07701
作者: Biel Tura Vecino,Adam Gabryś,Daniel Mątwicki,Andrzej Pomirski,Tom Iddon,Marius Cotescu,Jaime Lorenzo-Trueba
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Published as a conference paper at SSW 2023

点击查看摘要

Abstract:Recent works have shown that modelling raw waveform directly from text in an end-to-end (E2E) fashion produces more natural-sounding speech than traditional neural text-to-speech (TTS) systems based on a cascade or two-stage approach. However, current E2E state-of-the-art models are computationally complex and memory-consuming, making them unsuitable for real-time offline on-device applications in low-resource scenarios. To address this issue, we propose a Lightweight E2E-TTS (LE2E) model that generates high-quality speech requiring minimal computational resources. We evaluate the proposed model on the LJSpeech dataset and show that it achieves state-of-the-art performance while being up to 90% smaller in terms of model parameters and 10\times faster in real-time-factor. Furthermore, we demonstrate that the proposed E2E training paradigm achieves better quality compared to an equivalent architecture trained in a two-stage approach. Our results suggest that LE2E is a promising approach for developing real-time, high quality, low-resource TTS applications for on-device applications.
zh

[AI-10] Belief Injection for Epistemic Control in Linguistic State Space

【速读】:该论文试图解决人工智能代理在认知状态构建与推理过程中缺乏主动调控机制的问题,特别是在如何有效影响代理的信念结构以提升其推理与对齐能力方面。解决方案的关键在于提出“信念注入”(belief injection)这一主动的认知控制机制,该机制基于语义流形框架,通过将目标语言信念片段直接嵌入代理的内部认知状态,从而实现对推理过程的主动引导,而非传统的被动反应式处理。

链接: https://arxiv.org/abs/2505.07693
作者: Sebastian Dumbrava
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 30 pages, 9 figures

点击查看摘要

Abstract:This work introduces belief injection, a proactive epistemic control mechanism for artificial agents whose cognitive states are structured as dynamic ensembles of linguistic belief fragments. Grounded in the Semantic Manifold framework, belief injection directly incorporates targeted linguistic beliefs into an agent’s internal cognitive state, influencing reasoning and alignment proactively rather than reactively. We delineate various injection strategies, such as direct, context-aware, goal-oriented, and reflective approaches, and contrast belief injection with related epistemic control mechanisms, notably belief filtering. Additionally, this work discusses practical applications, implementation considerations, ethical implications, and outlines promising directions for future research into cognitive governance using architecturally embedded belief injection.
zh

[AI-11] S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models

【速读】:该论文旨在解决推理模型在链式思维(Chain-of-Thought, CoT)生成过程中存在的过度思考问题,即模型在生成过程中表现出过多的冗余推理步骤。解决方案的关键在于提出一种新的强化学习方法——序列组衰减奖励策略优化(Serial-Group Decaying-Reward Policy Optimization, S-GRPO),该方法通过在CoT生成过程中选择多个时间点让模型提前退出推理并生成答案,同时根据退出位置分配递减的奖励,以鼓励模型在早期阶段生成高质量答案,从而有效减少冗余推理步骤。

链接: https://arxiv.org/abs/2505.07686
作者: Muzhi Dai,Chenxu Yang,Qingyi Si
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As Test-Time Scaling emerges as an active research focus in the large language model community, advanced post-training methods increasingly emphasize extending chain-of-thought (CoT) generation length, thereby enhancing reasoning capabilities to approach Deepseek R1-like reasoning models. However, recent studies reveal that reasoning models (even Qwen3) consistently exhibit excessive thought redundancy in CoT generation. This overthinking problem stems from conventional outcome-reward reinforcement learning’s systematic neglect in regulating intermediate reasoning steps. This paper proposes Serial-Group Decaying-Reward Policy Optimization (namely S-GRPO), a novel reinforcement learning method that empowers models with the capability to determine the sufficiency of reasoning steps, subsequently triggering early exit of CoT generation. Specifically, unlike GRPO, which samples multiple possible completions (parallel group) in parallel, we select multiple temporal positions in the generation of one CoT to allow the model to exit thinking and instead generate answers (serial group), respectively. For the correct answers in a serial group, we assign rewards that decay according to positions, with lower rewards towards the later ones, thereby reinforcing the model’s behavior to generate higher-quality answers at earlier phases with earlier exits of thinking. Empirical evaluations demonstrate compatibility with state-of-the-art reasoning models, including Qwen3 and Deepseek-distill models, achieving 35.4% ~ 61.1% sequence length reduction with 0.72% ~ 6.08% accuracy improvements across GSM8K, AIME 2024, AMC 2023, MATH-500, and GPQA Diamond benchmarks.
zh

[AI-12] Multimodal Survival Modeling in the Age of Foundation Models

【速读】:该论文试图解决传统癌症生存预测模型在利用多模态数据(如基因组、临床和影像数据)时的局限性,以及病理报告等自由文本数据未被充分利用的问题。其解决方案的关键在于利用生成式 AI (Generative AI) 提取的零样本嵌入进行多模态融合,从而提升生存预测的性能,并通过信息抽取技术有效整合病理报告文本,验证了文本摘要和幻觉对模型效果的影响。

链接: https://arxiv.org/abs/2505.07683
作者: Steven Song,Morgan Borjigin-Wang,Irene Madejski,Robert L. Grossman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages, 7 figures, 8 tables

点击查看摘要

Abstract:The Cancer Genome Atlas (TCGA) has enabled novel discoveries and served as a large-scale reference through its harmonized genomics, clinical, and image data. Prior studies have trained bespoke cancer survival prediction models from unimodal or multimodal TCGA data. A modern paradigm in biomedical deep learning is the development of foundation models (FMs) to derive meaningful feature embeddings, agnostic to a specific modeling task. Biomedical text especially has seen growing development of FMs. While TCGA contains free-text data as pathology reports, these have been historically underutilized. Here, we investigate the feasibility of training classical, multimodal survival models over zero-shot embeddings extracted by FMs. We show the ease and additive effect of multimodal fusion, outperforming unimodal models. We demonstrate the benefit of including pathology report text and rigorously evaluate the effect of model-based text summarization and hallucination. Overall, we modernize survival modeling by leveraging FMs and information extraction from pathology reports.
zh

[AI-13] A Case Study Investigating the Role of Generative AI in Quality Evaluations of Epics in Agile Software Development

【速读】:该论文试图解决敏捷开发中敏捷史诗(agile epic)定义不明确所带来的问题,这些问题导致了返工、交付延迟和成本超支。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)对敏捷史诗的质量进行评估,通过用户研究验证其在实际工作流程中的整合可能性与价值,从而提升史诗的清晰度和有效性。

链接: https://arxiv.org/abs/2505.07664
作者: Werner Geyer,Jessica He,Daita Sarkar,Michelle Brachman,Chris Hammond,Jennifer Heins,Zahra Ashktorab,Carlos Rosemberg,Charlie Hill
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The broad availability of generative AI offers new opportunities to support various work domains, including agile software development. Agile epics are a key artifact for product managers to communicate requirements to stakeholders. However, in practice, they are often poorly defined, leading to churn, delivery delays, and cost overruns. In this industry case study, we investigate opportunities for large language models (LLMs) to evaluate agile epic quality in a global company. Results from a user study with 17 product managers indicate how LLM evaluations could be integrated into their work practices, including perceived values and usage in improving their epics. High levels of satisfaction indicate that agile epics are a new, viable application of AI evaluations. However, our findings also outline challenges, limitations, and adoption barriers that can inform both practitioners and researchers on the integration of such evaluations into future agile work practices.
zh

[AI-14] Bang for the Buck: Vector Search on Cloud CPUs

【速读】:该论文试图解决在云环境中选择适合向量搜索任务的CPU微架构困难的问题,因为现有CPU种类繁多且缺乏跨CPU的向量搜索基准测试。解决方案的关键在于通过实验分析不同CPU微架构在多种向量搜索场景下的性能表现,例如在IVF索引和HNSW索引中的查询每秒(QPS)以及每美元查询数(QP ),从而为用户提供基于性能与成本的最佳选择依据。

链接: https://arxiv.org/abs/2505.07621
作者: Leonardo Kuffo,Peter Boncz
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: To be published in Proceedings of 21st International Workshop on Data Management on New Hardware (DaMoN '25)

点击查看摘要

Abstract:Vector databases have emerged as a new type of systems that support efficient querying of high-dimensional vectors. Many of these offer their database as a service in the cloud. However, the variety of available CPUs and the lack of vector search benchmarks across CPUs make it difficult for users to choose one. In this study, we show that CPU microarchitectures available in the cloud perform significantly differently across vector search scenarios. For instance, in an IVF index on float32 vectors, AMD’s Zen4 gives almost 3x more queries per second (QPS) compared to Intel’s Sapphire Rapids, but for HNSW indexes, the tables turn. However, when looking at the number of queries per dollar (QP ), Graviton3 is the best option for most indexes and quantization settings, even over Graviton4 (Table 1). With this work, we hope to guide users in getting the best “bang for the buck” when deploying vector search systems.
zh

[AI-15] YuLan-OneSim: Towards the Next Generation of Social Simulator with Large Language Models

【速读】:该论文旨在解决传统社会模拟系统在场景构建复杂性高、缺乏多样化默认场景、模拟过程难以适应外部反馈、规模受限以及研究流程自动化不足等问题。其解决方案的关键在于提出一种名为YuLan-OneSim的新型社会模拟器,该模拟器通过无代码场景构建、全面的默认场景库、可进化模拟机制、大规模模拟能力以及AI社会研究员的集成,实现了对社会行为模拟的高效与自动化支持。

链接: https://arxiv.org/abs/2505.07581
作者: Lei Wang,Heyang Gao,Xiaohe Bo,Xu Chen,Ji-Rong Wen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Leveraging large language model (LLM) based agents to simulate human social behaviors has recently gained significant attention. In this paper, we introduce a novel social simulator called YuLan-OneSim. Compared to previous works, YuLan-OneSim distinguishes itself in five key aspects: (1) Code-free scenario construction: Users can simply describe and refine their simulation scenarios through natural language interactions with our simulator. All simulation code is automatically generated, significantly reducing the need for programming expertise. (2) Comprehensive default scenarios: We implement 50 default simulation scenarios spanning 8 domains, including economics, sociology, politics, psychology, organization, demographics, law, and communication, broadening access for a diverse range of social researchers. (3) Evolvable simulation: Our simulator is capable of receiving external feedback and automatically fine-tuning the backbone LLMs, significantly enhancing the simulation quality. (4) Large-scale simulation: By developing a fully responsive agent framework and a distributed simulation architecture, our simulator can handle up to 100,000 agents, ensuring more stable and reliable simulation results. (5) AI social researcher: Leveraging the above features, we develop an AI social researcher. Users only need to propose a research topic, and the AI researcher will automatically analyze the input, construct simulation environments, summarize results, generate technical reports, review and refine the reports–completing the social science research loop. To demonstrate the advantages of YuLan-OneSim, we conduct experiments to evaluate the quality of the automatically generated scenarios, the reliability, efficiency, and scalability of the simulation process, as well as the performance of the AI social researcher.
zh

[AI-16] owards Requirements Engineering for RAG Systems

【速读】:该论文试图解决在复杂领域特定应用中实施检索增强生成(Retrieval Augmented Generation, RAG)系统时的特殊需求工程问题。解决方案的关键在于数据科学家通过与用户的迭代实验,识别出上下文相关的“检索需求”,因为只有用户能够判断生成结果的正确性。研究提出了一个实证过程模型,描述了数据科学家如何实际获取这些“检索需求”并管理系统的局限性。

链接: https://arxiv.org/abs/2505.07553
作者: Tor Sporsem,Rasmus Ulfsnes
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted to EASE 2025, 17-20 June, Istanbul, Turkey

点击查看摘要

Abstract:This short paper explores how a maritime company develops and integrates large-language models (LLM). Specifically by looking at the requirements engineering for Retrieval Augmented Generation (RAG) systems in expert settings. Through a case study at a maritime service provider, we demonstrate how data scientists face a fundamental tension between user expectations of AI perfection and the correctness of the generated outputs. Our findings reveal that data scientists must identify context-specific “retrieval requirements” through iterative experimentation together with users because they are the ones who can determine correctness. We present an empirical process model describing how data scientists practically elicited these “retrieval requirements” and managed system limitations. This work advances software engineering knowledge by providing insights into the specialized requirements engineering processes for implementing RAG systems in complex domain-specific applications.
zh

[AI-17] GRADA: Graph-based Reranker against Adversarial Documents Attack

【速读】:该论文试图解决检索增强生成(Retrieval Augmented Generation, RAG)框架在面对对抗性文档攻击时的脆弱性问题,即攻击者通过引入语义上与查询相似但对系统有害的文档来干扰检索过程,从而降低模型输出的准确性。解决方案的关键在于提出一种基于图的重新排序框架(Graph-based Reranking against Adversarial Document Attacks, GRADA),通过分析文档间的语义关系和结构特征,有效区分对抗性文档与正常文档,从而在保持检索质量的同时显著降低攻击成功率。

链接: https://arxiv.org/abs/2505.07546
作者: Jingjie Zheng,Aryo Pradipta Gema,Giwon Hong,Xuanli He,Pasquale Minervini,Youcheng Sun,Qiongkai Xu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) frameworks improve the accuracy of large language models (LLMs) by integrating external knowledge from retrieved documents, thereby overcoming the limitations of models’ static intrinsic knowledge. However, these systems are susceptible to adversarial attacks that manipulate the retrieval process by introducing documents that are adversarial yet semantically similar to the query. Notably, while these adversarial documents resemble the query, they exhibit weak similarity to benign documents in the retrieval set. Thus, we propose a simple yet effective Graph-based Reranking against Adversarial Document Attacks (GRADA) framework aiming at preserving retrieval quality while significantly reducing the success of adversaries. Our study evaluates the effectiveness of our approach through experiments conducted on five LLMs: GPT-3.5-Turbo, GPT-4o, Llama3.1-8b, Llama3.1-70b, and Qwen2.5-7b. We use three datasets to assess performance, with results from the Natural Questions dataset demonstrating up to an 80% reduction in attack success rates while maintaining minimal loss in accuracy.
zh

[AI-18] he Human-Data-Model Interaction Canvas for Visual Analytics

【速读】:该论文试图解决现有视觉分析(Visual Analytics, VA)过程模型和框架在系统描述人类、数据和模型角色及其相互作用方面的不足,以及在指导新型VA流程设计和提升跨学科协作与用户中心设计方面的局限性。解决方案的关键在于提出HDMI Canvas,它系统地表征了人类、数据和模型的多样化角色,并阐明了这些参与者如何从VA过程中受益并作出贡献,同时融合了现代以用户为中心的方法和可解释AI的模型贡献,从而为VA流程的设计提供新的视角和工具。

链接: https://arxiv.org/abs/2505.07534
作者: Jürgen Bernard
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 5 figures, LaTeX; to appear at the 16th International EuroVis Workshop on Visual Analytics (EuroVA’25) as a position paper

点击查看摘要

Abstract:Visual Analytics (VA) integrates humans, data, and models as key actors in insight generation and data-driven decision-making. This position paper values and reflects on 16 VA process models and frameworks and makes nine high-level observations that motivate a fresh perspective on VA. The contribution is the HDMI Canvas, a perspective to VA that complements the strengths of existing VA process models and frameworks. It systematically characterizes diverse roles of humans, data, and models, and how these actors benefit from and contribute to VA processes. The descriptive power of the HDMI Canvas eases the differentiation between a series of VA building blocks, rather than describing general VA principles only. The canvas includes modern human-centered methodologies, including human knowledge externalization and forms of feedback loops, while interpretable and explainable AI highlight model contributions beyond their conventional outputs. The HDMI Canvas has generative power, guiding the design of new VA processes and is optimized for external stakeholders, improving VA outreach, interdisciplinary collaboration, and user-centered design. The utility of the HDMI Canvas is demonstrated through two preliminary case studies.
zh

[AI-19] QuantX: A Framework for Hardware-Aware Quantization of Generative AI Workloads

【速读】:该论文试图解决大语言模型(Large Language Model, LLM)和视觉语言模型(Vision-Language Model, VLM)在量化过程中性能损失与硬件效率之间的平衡问题。解决方案的关键在于QuantX,它提供了一套定制化的量化配方,能够在保持模型性能的前提下将量化精度降低至3比特,并通过考虑硬件特定约束来优化推理过程中的解量化效率,从而实现运行时速度、内存需求和模型准确率之间的灵活权衡。

链接: https://arxiv.org/abs/2505.07531
作者: Khurram Mazher,Saad Bin Nasir
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present QuantX: a tailored suite of recipes for LLM and VLM quantization. It is capable of quantizing down to 3-bit resolutions with minimal loss in performance. The quantization strategies in QuantX take into account hardware-specific constraints to achieve efficient dequantization during inference ensuring flexible trade-off between runtime speed, memory requirement and model accuracy. Our results demonstrate that QuantX achieves performance within 6% of the unquantized model for LlaVa-v1.6 quantized down to 3-bits for multiple end user tasks and outperforms recently published state-of-the-art quantization techniques. This manuscript provides insights into the LLM quantization process that motivated the range of recipes and options that are incorporated in QuantX.
zh

[AI-20] HALO: Half Life-Based Outdated Fact Filtering in Temporal Knowledge Graphs

【速读】:该论文旨在解决时间知识图谱(Temporal Knowledge Graph, TKG)中过时事实对推理性能的负面影响问题,以及传统推理方法忽视过时事实的不利影响和由此带来的额外计算成本。其解决方案的关键在于提出一种名为HALO的过时事实过滤框架,该框架通过引入半衰期理论量化历史事实的时间有效性,并结合三个核心模块:时间事实注意力模块、动态关系感知编码模块和过时事实过滤模块,实现对过时事实的有效检测与过滤。

链接: https://arxiv.org/abs/2505.07509
作者: Feng Ding,Tingting Wang,Yupeng Gao,Shuo Yu,Jing Ren,Feng Xia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Outdated facts in temporal knowledge graphs (TKGs) result from exceeding the expiration date of facts, which negatively impact reasoning performance on TKGs. However, existing reasoning methods primarily focus on positive importance of historical facts, neglecting adverse effects of outdated facts. Besides, training on these outdated facts yields extra computational cost. To address these challenges, we propose an outdated fact filtering framework named HALO, which quantifies the temporal validity of historical facts by exploring the half-life theory to filter outdated facts in TKGs. HALO consists of three modules: the temporal fact attention module, the dynamic relation-aware encoder module, and the outdated fact filtering module. Firstly, the temporal fact attention module captures the evolution of historical facts over time to identify relevant facts. Secondly, the dynamic relation-aware encoder module is designed for efficiently predicting the half life of each fact. Finally, we construct a time decay function based on the half-life theory to quantify the temporal validity of facts and filter outdated facts. Experimental results show that HALO outperforms the state-of-the-art TKG reasoning methods on three public datasets, demonstrating its effectiveness in detecting and filtering outdated facts (Codes are available at this https URL ).
zh

[AI-21] EAGLE: Contrastive Learning for Efficient Graph Anomaly Detection

【速读】:该论文旨在解决异构图中异常检测的效率问题,现有方法在嵌入式设备上缺乏必要的高效性。其解决方案的关键在于提出一种基于对比学习的高效异常检测模型EAGLE,通过对比异常节点与正常节点在其到局部上下文距离上的差异来实现异常检测,首先在元路径层面上采样实例对进行对比学习,随后利用图自编码器在无监督条件下学习具有信息量的节点嵌入,并结合判别器预测节点的异常得分。

链接: https://arxiv.org/abs/2505.07508
作者: Jing Ren,Mingliang Hou,Zhixuan Liu,Xiaomei Bai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph anomaly detection is a popular and vital task in various real-world scenarios, which has been studied for several decades. Recently, many studies extending deep learning-based methods have shown preferable performance on graph anomaly detection. However, existing methods are lack of efficiency that is definitely necessary for embedded devices. Towards this end, we propose an Efficient Anomaly detection model on heterogeneous Graphs via contrastive LEarning (EAGLE) by contrasting abnormal nodes with normal ones in terms of their distances to the local context. The proposed method first samples instance pairs on meta path-level for contrastive learning. Then, a graph autoencoder-based model is applied to learn informative node embeddings in an unsupervised way, which will be further combined with the discriminator to predict the anomaly scores of nodes. Experimental results show that EAGLE outperforms the state-of-the-art methods on three heterogeneous network datasets.
zh

[AI-22] Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks

【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)在代码领域评估基准饱和的问题,即现有基准测试已无法有效指导模型的进一步优化。解决方案的关键在于提出一个新的基准测试——Web-Bench,该基准包含50个项目,每个项目由20个具有顺序依赖关系的任务组成,旨在模拟真实的人类开发流程,并涵盖Web开发的基础要素:Web标准和Web框架。通过这一高复杂度和高规模的基准,能够更准确地评估LLMs在实际开发场景中的表现。

链接: https://arxiv.org/abs/2505.07473
作者: Kai Xu,YiWei Mao,XinYi Guan,ZiLong Feng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 28 pages, 15 figures

点击查看摘要

Abstract:The application of large language models (LLMs) in the field of coding is evolving rapidly: from code assistants, to autonomous coding agents, and then to generating complete projects through natural language. Early LLM code benchmarks primarily focused on code generation accuracy, but these benchmarks have gradually become saturated. Benchmark saturation weakens their guiding role for LLMs. For example, HumanEval Pass@1 has reached 99.4% and MBPP 94.2%. Among various attempts to address benchmark saturation, approaches based on software engineering have stood out, but the saturation of existing software engineering benchmarks is rapidly increasing. To address this, we propose a new benchmark, Web-Bench, which contains 50 projects, each consisting of 20 tasks with sequential dependencies. The tasks implement project features in sequence, simulating real-world human development workflows. When designing Web-Bench, we aim to cover the foundational elements of Web development: Web Standards and Web Frameworks. Given the scale and complexity of these projects, which were designed by engineers with 5 to 10 years of experience, each presents a significant challenge. On average, a single project takes 4 to 8 hours for a senior engineer to complete. On our given benchmark agent (Web-Agent), SOTA (Claude 3.7 Sonnet) achieves only 25.1% Pass@1, significantly lower (better) than SWE-Bench’s Verified (65.4%) and Full (33.8%) scores. Finally, we discuss that in any development field, Standards and Frameworks represent foundational knowledge and efficiency tools, respectively, and LLMs require optimization tailored to them.
zh

[AI-23] How well do LLM s reason over tabular data really?

【速读】:该论文旨在解决通用大型语言模型(Large Language Models, LLMs)在处理表格数据时的推理能力及其对现实世界表格输入变化的鲁棒性问题。研究指出,现有评估策略无法真实反映LLMs在表格查询任务中的表现,并揭示了LLMs在表格推理任务中存在显著缺陷。解决方案的关键在于采用“LLM-as-a-judge”方法以获得更可靠的性能评估,并通过扩展表格输入以包含缺失值、重复实体和结构变化等现实特征,从而更真实地衡量LLMs的性能。实验结果表明,这些现实特征会显著影响LLMs的表格推理能力,强调了提升其在实际应用场景中鲁棒性的必要性。

链接: https://arxiv.org/abs/2505.07453
作者: Cornelius Wolff,Madelon Hulsebos
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Large Language Models (LLMs) excel in natural language tasks, but less is known about their reasoning capabilities over tabular data. Prior analyses devise evaluation strategies that poorly reflect an LLM’s realistic performance on tabular queries. Moreover, we have a limited understanding of the robustness of LLMs towards realistic variations in tabular inputs. Therefore, we ask: Can general-purpose LLMs reason over tabular data, really?, and focus on two questions 1) are tabular reasoning capabilities of general-purpose LLMs robust to real-world characteristics of tabular inputs, and 2) how can we realistically evaluate an LLM’s performance on analytical tabular queries? Building on a recent tabular reasoning benchmark, we first surface shortcomings of its multiple-choice prompt evaluation strategy, as well as commonly used free-form text metrics such as SacreBleu and BERT-score. We show that an LLM-as-a-judge procedure yields more reliable performance insights and unveil a significant deficit in tabular reasoning performance of LLMs. We then extend the tabular inputs reflecting three common characteristics in practice: 1) missing values, 2) duplicate entities, and 3) structural variations. Experiments show that the tabular reasoning capabilities of general-purpose LLMs suffer from these variations, stressing the importance of improving their robustness for realistic tabular inputs.
zh

[AI-24] Prototype Augmented Hypernetworks for Continual Learning CVPR

【速读】:该论文旨在解决持续学习(Continual Learning, CL)中的灾难性遗忘(Catastrophic Forgetting, CF)问题,即在学习新任务时,模型会遗忘之前学到的知识。其解决方案的关键在于提出了一种名为原型增强超网络(Prototype-Augmented Hypernetworks, PAH)的框架,该框架通过可学习的任务原型来条件化单一的超网络,从而按需生成特定于任务的分类器头部。PAH通过结合交叉熵损失与双重蒸馏损失(分别用于对齐logits和原型)来缓解遗忘,确保跨任务的稳定特征表示。

链接: https://arxiv.org/abs/2505.07450
作者: Neil De La Fuente,Maria Pilligua,Daniel Vidal,Albin Soutiff,Cecilia Curreli,Daniel Cremers,Andrey Barsky
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: CVPR (LatinX in CV)

点击查看摘要

Abstract:Continual learning (CL) aims to learn a sequence of tasks without forgetting prior knowledge, but gradient updates for a new task often overwrite the weights learned earlier, causing catastrophic forgetting (CF). We propose Prototype-Augmented Hypernetworks (PAH), a framework where a single hypernetwork, conditioned on learnable task prototypes, dynamically generates task-specific classifier heads on demand. To mitigate forgetting, PAH combines cross-entropy with dual distillation losses, one to align logits and another to align prototypes, ensuring stable feature representations across tasks. Evaluations on Split-CIFAR100 and TinyImageNet demonstrate that PAH achieves state-of-the-art performance, reaching 74.5 % and 63.7 % accuracy with only 1.7 % and 4.4 % forgetting, respectively, surpassing prior methods without storing samples or heads.
zh

[AI-25] LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning

【速读】:该论文旨在解决指令微调过程中因现有迭代式模型感知数据选择方法导致的计算开销过大的问题,这些问题主要源于反复进行全数据集模型推理以估计样本效用,从而造成效率瓶颈。论文提出的解决方案关键在于LEAD框架,其核心是实例级动态不确定性(Instance-Level Dynamic Uncertainty, IDU),该方法通过结合瞬时训练损失、基于梯度的损失变化近似以及历史损失信号的指数平滑,在标准训练循环内准确估计样本效用,避免了额外的模型推理开销。

链接: https://arxiv.org/abs/2505.07437
作者: Xiaotian Lin,Yanlin Qi,Yizhang Zhu,Themis Palpanas,Chengliang Chai,Nan Tang,Yuyu Luo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Instruction tuning has emerged as a critical paradigm for improving the capabilities and alignment of large language models (LLMs). However, existing iterative model-aware data selection methods incur significant computational overhead, as they rely on repeatedly performing full-dataset model inference to estimate sample utility for subsequent training iterations, creating a fundamental efficiency bottleneck. In this paper, we propose LEAD, an efficient iterative data selection framework that accurately estimates sample utility entirely within the standard training loop, eliminating the need for costly additional model inference. At its core, LEAD introduces Instance-Level Dynamic Uncertainty (IDU), a theoretically grounded utility function combining instantaneous training loss, gradient-based approximation of loss changes, and exponential smoothing of historical loss signals. To further scale efficiently to large datasets, LEAD employs a two-stage, coarse-to-fine selection strategy, adaptively prioritizing informative clusters through a multi-armed bandit mechanism, followed by precise fine-grained selection of high-utility samples using IDU. Extensive experiments across four diverse benchmarks show that LEAD significantly outperforms state-of-the-art methods, improving average model performance by 6.1%-10.8% while using only 2.5% of the training data and reducing overall training time by 5-10x.
zh

[AI-26] AI in Money Matters

【速读】:该论文试图解决在受监管的金融技术(Fintech)行业中,大型语言模型(Large Language Models, LLMs)尤其是ChatGPT的采用现状与潜在挑战问题。其解决方案的关键在于通过实证研究,特别是对Fintech行业专业从业者的访谈,来探讨LLMs在该领域的实际应用情况及面临的监管与合规性问题,从而填补现有学术讨论中关于专业领域真实观点的空白。

链接: https://arxiv.org/abs/2505.07393
作者: Nadine Sandjo Tchatchoua(Roskilde University),Richard Harper(Lancaster University)
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In November 2022, Europe and the world by and large were stunned by the birth of a new large language model : ChatGPT. Ever since then, both academic and populist discussions have taken place in various public spheres such as LinkedIn and X(formerly known as Twitter) with the view to both understand the tool and its benefits for the society. The views of real actors in professional spaces, especially in regulated industries such as finance and law have been largely missing. We aim to begin to close this gap by presenting results from an empirical investigation conducted through interviews with professional actors in the Fintech industry. The paper asks the question, how and to what extent are large language models in general and ChatGPT in particular being adopted and used in the Fintech industry? The results show that while the fintech experts we spoke with see a potential in using large language models in the future, a lot of questions marks remain concerning how they are policed and therefore might be adopted in a regulated industry such as Fintech. This paper aims to add to the existing academic discussing around large language models, with a contribution to our understanding of professional viewpoints.
zh

[AI-27] Examining the Role of LLM -Driven Interactions on Attention and Cognitive Engagement in Virtual Classrooms

【速读】:该论文试图解决在教育环境中集成大型语言模型(Large Language Models, LLMs)对用户参与度和注意力的影响这一开放性问题。其解决方案的关键在于构建一个完全由LLM驱动的虚拟学习环境,其中同伴和教师均由LLM模拟,通过研究同伴提问行为对学生参与度、注意力、认知负荷和学习成果的影响,揭示LLM在提升学习效果中的作用机制。

链接: https://arxiv.org/abs/2505.07377
作者: Suleyman Ozdel,Can Sarpkaya,Efe Bozkir,Hong Gao,Enkelejda Kasneci
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted to EDM 2025 (Eighteenth International Conference on Educational Data Mining)

点击查看摘要

Abstract:Transforming educational technologies through the integration of large language models (LLMs) and virtual reality (VR) offers the potential for immersive and interactive learning experiences. However, the effects of LLMs on user engagement and attention in educational environments remain open questions. In this study, we utilized a fully LLM-driven virtual learning environment, where peers and teachers were LLM-driven, to examine how students behaved in such settings. Specifically, we investigate how peer question-asking behaviors influenced student engagement, attention, cognitive load, and learning outcomes and found that, in conditions where LLM-driven peer learners asked questions, students exhibited more targeted visual scanpaths, with their attention directed toward the learning content, particularly in complex subjects. Our results suggest that peer questions did not introduce extraneous cognitive load directly, as the cognitive load is strongly correlated with increased attention to the learning material. Considering these findings, we provide design recommendations for optimizing VR learning spaces.
zh

[AI-28] AIS Data-Driven Maritime Monitoring Based on Transformer: A Comprehensive Review

【速读】:该论文旨在解决全球航运中对安全性、效率和可持续性的日益增长需求背景下,如何充分挖掘和利用大规模自动识别系统(AIS)数据的潜力问题。其解决方案的关键在于采用Transformer模型,该模型凭借强大的序列建模能力,特别是捕捉长距离依赖关系和复杂时间动态的能力,成为处理AIS数据的有效工具。通过回顾基于Transformer的AIS数据驱动的海上监测研究,论文重点探讨了轨迹预测、行为检测与预测等方法,并对公开的AIS数据集进行了整理与分析,为后续海上监测任务提供了数据支持与研究方向建议。

链接: https://arxiv.org/abs/2505.07374
作者: Zhiye Xie,Enmei Tu,Xianping Fu,Guoliang Yuan,Yi Han
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the increasing demands for safety, efficiency, and sustainability in global shipping, Automatic Identification System (AIS) data plays an increasingly important role in maritime monitoring. AIS data contains spatial-temporal variation patterns of vessels that hold significant research value in the marine domain. However, due to its massive scale, the full potential of AIS data has long remained untapped. With its powerful sequence modeling capabilities, particularly its ability to capture long-range dependencies and complex temporal dynamics, the Transformer model has emerged as an effective tool for processing AIS data. Therefore, this paper reviews the research on Transformer-based AIS data-driven maritime monitoring, providing a comprehensive overview of the current applications of Transformer models in the marine field. The focus is on Transformer-based trajectory prediction methods, behavior detection, and prediction techniques. Additionally, this paper collects and organizes publicly available AIS datasets from the reviewed papers, performing data filtering, cleaning, and statistical analysis. The statistical results reveal the operational characteristics of different vessel types, providing data support for further research on maritime monitoring tasks. Finally, we offer valuable suggestions for future research, identifying two promising research directions. Datasets are available at this https URL.
zh

[AI-29] Synthetic Code Surgery: Repairing Bugs and Vulnerabilities with LLM s and Synthetic Data

【速读】:该论文试图解决自动化程序修复(Automated Program Repair, APR)系统中因高质量训练数据不足而导致的性能受限问题,特别是在涵盖多种编程语言和多样化的错误类型方面。解决方案的关键在于提出一种基于大型语言模型(Large Language Models, LLMs)的合成数据生成方法,通过两阶段流程——首先生成带有缺陷和修复代码的配对示例,随后进行多维度的质量评估,从而构建高质量的合成数据集。该方法利用LLMs在多种编程语言和错误类别中生成大量样本,并通过跨模型评估确保数据质量,最终在VulRepair测试集上实现了显著的性能提升。

链接: https://arxiv.org/abs/2505.07372
作者: David de-Fitero-Dominguez,Antonio Garcia-Cabot,Eva Garcia-Lopez
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a novel methodology for enhancing Automated Program Repair (APR) through synthetic data generation utilizing Large Language Models (LLMs). Current APR systems are constrained by the limited availability of high-quality training data encompassing diverse bug types across multiple programming languages. The proposed approach addresses this limitation through a two-phase process: a synthetic sample generation followed by a rigorous quality assessment. Multiple state-of-the-art LLMs were employed to generate approximately 30,000 paired examples of buggy and fixed code across 12 programming languages and 13 bug categories. Subsequently, these samples underwent cross-model evaluation against five criteria: correctness, code quality, security, performance, and completeness. Experimental evaluation on the VulRepair test set dataset showed statistically significant improvements in Perfect Prediction rates, with the quality-filtered synthetic dataset outperforming both baseline and real-world commit data configurations in certain scenarios. The methodology was validated through rigorous statistical testing, including ANOVA and post-hoc Tukey’s Honest Significant Difference analysis. Furthermore, the best-performing configurations surpassed existing systems despite using a less computationally intensive decoding strategy. This research establishes a self-bootstrapping paradigm in which LLMs generate and evaluate their own training data, potentially transforming approaches to data scarcity across software engineering tasks and advancing the development of robust, adaptable tools for automated code maintenance.
zh

[AI-30] Laypeoples Attitudes Towards Fair Affirmative and Discriminatory Decision-Making Algorithms

【速读】:该论文试图解决算法歧视问题,特别是通过引入补偿性算法(affirmative algorithms)来纠正历史不公。其解决方案的关键在于评估公众对不同类型的算法系统的看法,包括那些优先考虑被边缘化群体的补偿性算法、优先考虑特权群体的歧视性算法以及独立于人口群体做出决策的公平算法。研究发现,尽管大多数人支持公平算法并反对歧视性系统,但对补偿性算法存在分歧,这种分歧源于人们对谁(如果有的话)被边缘化的不同信念。

链接: https://arxiv.org/abs/2505.07339
作者: Gabriel Lima,Nina Grgić-Hlača,Markus Langer,Yixin Zou
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Affirmative algorithms have emerged as a potential answer to algorithmic discrimination, seeking to redress past harms and rectify the source of historical injustices. We present the results of two experiments ( N = 1193 ) capturing laypeople’s perceptions of affirmative algorithms – those which explicitly prioritize the historically marginalized – in hiring and criminal justice. We contrast these opinions about affirmative algorithms with folk attitudes towards algorithms that prioritize the privileged (i.e., discriminatory) and systems that make decisions independently of demographic groups (i.e., fair). We find that people – regardless of their political leaning and identity – view fair algorithms favorably and denounce discriminatory systems. In contrast, we identify disagreements concerning affirmative algorithms: liberals and racial minorities rate affirmative systems as positively as their fair counterparts, whereas conservatives and those from the dominant racial group evaluate affirmative algorithms as negatively as discriminatory systems. We identify a source of these divisions: people have varying beliefs about who (if anyone) is marginalized, shaping their views of affirmative algorithms. We discuss the possibility of bridging these disagreements to bring people together towards affirmative algorithms.
zh

[AI-31] Dynamical Label Augmentation and Calibration for Noisy Electronic Health Records

【速读】:该论文试图解决医疗时间序列数据中标签错误(noisy label)对患者预后预测的负面影响问题。解决方案的关键在于提出一种基于注意力机制的学习框架ACTLL,该框架通过两组分Beta混合模型区分确定性和不确定性实例,并在动态校准不确定实例标签或从确定性实例中增强置信实例的过程中捕捉全局时间动态。

链接: https://arxiv.org/abs/2505.07320
作者: Yuhao Li,Ling Luo,Uwe Aickelin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical research, particularly in predicting patient outcomes, heavily relies on medical time series data extracted from Electronic Health Records (EHR), which provide extensive information on patient histories. Despite rigorous examination, labeling errors are inevitable and can significantly impede accurate predictions of patient outcome. To address this challenge, we propose an \textbfAttention-based Learning Framework with Dynamic \textbfCalibration and Augmentation for \textbfTime series Noisy \textbfLabel \textbfLearning (ACTLL). This framework leverages a two-component Beta mixture model to identify the certain and uncertain sets of instances based on the fitness distribution of each class, and it captures global temporal dynamics while dynamically calibrating labels from the uncertain set or augmenting confident instances from the certain set. Experimental results on large-scale EHR datasets eICU and MIMIC-IV-ED, and several benchmark datasets from the UCR and UEA repositories, demonstrate that our model ACTLL has achieved state-of-the-art performance, especially under high noise levels.
zh

[AI-32] How Do Companies Manage the Environmental Sustainability of AI? An Interview Study About Green AI Efforts and Regulations

【速读】:该论文试图解决在工业界采用人工智能(Artificial Intelligence, AI)过程中,环境可持续性所扮演的角色不明确,以及AI监管如何影响绿色AI实践和决策的问题。其解决方案的关键在于通过11次访谈,探索工业从业者对绿色AI的认知与管理,分析AI采纳、环境负面影响的缓解措施以及欧盟AI法案和企业可持续报告指令(Corporate Sustainability Reporting Directive, CSRD)的影响。研究发现,大多数参与者在AI采纳中优先考虑业务效率,而对环境可持续性的关注有限,且缺乏有效的环境影响监测与缓解措施,表明当前法规在推动绿色AI方面效果不足,亟需提升行业意识并提供更易用的绿色AI技术与工具。

链接: https://arxiv.org/abs/2505.07317
作者: Ashmita Sampatsing,Sophie Vos,Emma Beauxis-Aussalet,Justus Bogner
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted for publication at the 11th International Conference on ICT for Sustainability (ICT4S’25), see this https URL

点击查看摘要

Abstract:With the ever-growing adoption of artificial intelligence (AI), AI-based software and its negative impact on the environment are no longer negligible, and studying and mitigating this impact has become a critical area of research. However, it is currently unclear which role environmental sustainability plays during AI adoption in industry and how AI regulations influence Green AI practices and decision-making in industry. We therefore aim to investigate the Green AI perception and management of industry practitioners. To this end, we conducted a total of 11 interviews with participants from 10 different organizations that adopted AI-based software. The interviews explored three main themes: AI adoption, current efforts in mitigating the negative environmental impact of AI, and the influence of the EU AI Act and the Corporate Sustainability Reporting Directive (CSRD). Our findings indicate that 9 of 11 participants prioritized business efficiency during AI adoption, with minimal consideration of environmental sustainability. Monitoring and mitigation of AI’s environmental impact were very limited. Only one participant monitored negative environmental effects. Regarding applied mitigation practices, six participants reported no actions, with the others sporadically mentioning techniques like prompt engineering, relying on smaller models, or not overusing AI. Awareness and compliance with the EU AI Act are low, with only one participant reporting on its influence, while the CSRD drove sustainability reporting efforts primarily in larger companies. All in all, our findings reflect a lack of urgency and priority for sustainable AI among these companies. We suggest that current regulations are not very effective, which has implications for policymakers. Additionally, there is a need to raise industry awareness, but also to provide user-friendly techniques and tools for Green AI practices.
zh

[AI-33] FedIFL: A federated cross-domain diagnostic framework for motor-driven systems with inconsistent fault modes

【速读】:该论文旨在解决联邦学习在故障诊断场景中因客户端间标签空间不一致导致的模型泛化能力下降问题,具体表现为本地模型仅关注特定客户端的故障模式,并将不同客户端的故障模式映射到相似的特征表示。其解决方案的关键在于提出一种称为联邦不变特征学习(FedIFL)的跨域诊断框架,通过原型对比学习缓解客户端内部领域偏移,利用特征生成实现隐私友好的客户端间分布访问,并引入特征解耦机制与实例级联邦一致性损失、联邦实例个性化损失及正交损失,以增强模型对不变特征的提取和区分能力,从而提升全局模型在跨客户端标签空间不一致情况下的诊断性能。

链接: https://arxiv.org/abs/2505.07315
作者: Zexiao Wang,Yankai Wang,Xiaoqiang Liao,Xinguo Ming,Weiming Shen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Due to the scarcity of industrial data, individual equipment users, particularly start-ups, struggle to independently train a comprehensive fault diagnosis model; federated learning enables collaborative training while ensuring data privacy, making it an ideal solution. However, the diversity of working conditions leads to variations in fault modes, resulting in inconsistent label spaces across different clients. In federated diagnostic scenarios, label space inconsistency leads to local models focus on client-specific fault modes and causes local models from different clients to map different failure modes to similar feature representations, which weakens the aggregated global model’s generalization. To tackle this issue, this article proposed a federated cross-domain diagnostic framework termed Federated Invariant Features Learning (FedIFL). In intra-client training, prototype contrastive learning mitigates intra-client domain shifts, subsequently, feature generating ensures local models can access distributions of other clients in a privacy-friendly manner. Besides, in cross-client training, a feature disentanglement mechanism is introduced to mitigate cross-client domain shifts, specifically, an instance-level federated instance consistency loss is designed to ensure the instance-level consistency of invariant features between different clients, furthermore, a federated instance personalization loss and an orthogonal loss are constructed to distinguish specific features that from the invariant features. Eventually, the aggregated model achieves promising generalization among global label spaces, enabling accurate fault diagnosis for target clients’ Motor Driven Systems (MDSs) with inconsistent label spaces. Experiments on real-world MDSs validate the effectiveness and superiority of FedIFL in federated cross-domain diagnosis with inconsistent fault modes.
zh

[AI-34] Interpretable Event Diagnosis in Water Distribution Networks

【速读】:该论文试图解决在水系统中利用数据驱动方法进行事件诊断时,算法结果准确性不足且难以获得操作人员信任的问题。其关键解决方案是提出可解释的事件诊断框架,通过提供对比(即反事实)解释,帮助操作人员将算法诊断结果与自身的直觉和经验相联系,从而提升对算法内部机制的理解,并支持其结合个人经验做出更明智的决策。具体而言,提出了反事实事件指纹,用于表示当前事件诊断与最近似替代解释之间的差异,并以图形方式呈现。

链接: https://arxiv.org/abs/2505.07299
作者: André Artelt,Stelios G. Vrachimis,Demetrios G. Eliades,Ulrike Kuhl,Barbara Hammer,Marios M. Polycarpou
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:The increasing penetration of information and communication technologies in the design, monitoring, and control of water systems enables the use of algorithms for detecting and identifying unanticipated events (such as leakages or water contamination) using sensor measurements. However, data-driven methodologies do not always give accurate results and are often not trusted by operators, who may prefer to use their engineering judgment and experience to deal with such events. In this work, we propose a framework for interpretable event diagnosis – an approach that assists the operators in associating the results of algorithmic event diagnosis methodologies with their own intuition and experience. This is achieved by providing contrasting (i.e., counterfactual) explanations of the results provided by fault diagnosis algorithms; their aim is to improve the understanding of the algorithm’s inner workings by the operators, thus enabling them to take a more informed decision by combining the results with their personal experiences. Specifically, we propose counterfactual event fingerprints, a representation of the difference between the current event diagnosis and the closest alternative explanation, which can be presented in a graphical way. The proposed methodology is applied and evaluated on a realistic use case using the L-Town benchmark. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY) Cite as: arXiv:2505.07299 [cs.AI] (or arXiv:2505.07299v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.07299 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-35] HuB: Learning Extreme Humanoid Balance

【速读】:该论文旨在解决人形机器人在高平衡需求任务中的控制难题,具体包括参考运动误差引起的不稳定性、形态不匹配导致的学习困难以及传感器噪声和未建模动力学带来的仿真到现实(sim-to-real)差距。其解决方案的关键在于提出了一种统一框架HuB(Humanoid Balance),该框架集成了参考运动优化、平衡感知策略学习和仿真到现实的鲁棒性训练,分别针对上述三个核心挑战进行有效应对。

链接: https://arxiv.org/abs/2505.07294
作者: Tong Zhang,Boyuan Zheng,Ruiqian Nai,Yingdong Hu,Yen-Jen Wang,Geng Chen,Fanqi Lin,Jiongye Li,Chuye Hong,Koushil Sreenath,Yang Gao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: Project website: this https URL

点击查看摘要

Abstract:The human body demonstrates exceptional motor capabilities-such as standing steadily on one foot or performing a high kick with the leg raised over 1.5 meters-both requiring precise balance control. While recent research on humanoid control has leveraged reinforcement learning to track human motions for skill acquisition, applying this paradigm to balance-intensive tasks remains challenging. In this work, we identify three key obstacles: instability from reference motion errors, learning difficulties due to morphological mismatch, and the sim-to-real gap caused by sensor noise and unmodeled dynamics. To address these challenges, we propose HuB (Humanoid Balance), a unified framework that integrates reference motion refinement, balance-aware policy learning, and sim-to-real robustness training, with each component targeting a specific challenge. We validate our approach on the Unitree G1 humanoid robot across challenging quasi-static balance tasks, including extreme single-legged poses such as Swallow Balance and Bruce Lee’s Kick. Our policy remains stable even under strong physical disturbances-such as a forceful soccer strike-while baseline methods consistently fail to complete these tasks. Project website: this https URL
zh

[AI-36] Predicting Music Track Popularity by Convolutional Neural Networks on Spotify Features and Spectrogram of Audio Waveform

【速读】:该论文试图解决在数字流媒体环境中,艺术家和行业专家难以预测音乐作品成功可能性的问题。解决方案的关键在于提出一种基于卷积神经网络(Convolutional Neural Networks, CNN)的方法,结合Spotify数据进行分析,通过利用音频波形的频谱图特征、元数据以及用户参与度指标,捕捉影响音乐作品流行度的复杂模式和关系。

链接: https://arxiv.org/abs/2505.07280
作者: Navid Falah,Behnam Yousefimehr,Mehdi Ghatee
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures, 4 tables

点击查看摘要

Abstract:In the digital streaming landscape, it’s becoming increasingly challenging for artists and industry experts to predict the success of music tracks. This study introduces a pioneering methodology that uses Convolutional Neural Networks (CNNs) and Spotify data analysis to forecast the popularity of music tracks. Our approach takes advantage of Spotify’s wide range of features, including acoustic attributes based on the spectrogram of audio waveform, metadata, and user engagement metrics, to capture the complex patterns and relationships that influence a track’s popularity. Using a large dataset covering various genres and demographics, our CNN-based model shows impressive effectiveness in predicting the popularity of music tracks. Additionally, we’ve conducted extensive experiments to assess the strength and adaptability of our model across different musical styles and time periods, with promising results yielding a 97% F1 score. Our study not only offers valuable insights into the dynamic landscape of digital music consumption but also provides the music industry with advanced predictive tools for assessing and predicting the success of music tracks.
zh

[AI-37] CHD: Coupled Hierarchical Diffusion for Long-Horizon Tasks

【速读】:该论文试图解决基于扩散模型的规划方法在复杂、长时域任务中表现不佳的问题,其根本原因在于高层子目标(high-level, HL)选择与底层轨迹生成(low-level, LL)之间的耦合松散,导致计划不连贯和性能下降。解决方案的关键在于提出耦合分层扩散(Coupled Hierarchical Diffusion, CHD),该框架在一个统一的扩散过程中联合建模HL子目标和LL轨迹,并通过共享分类器将LL反馈上传以实现子目标的自我修正,从而增强轨迹一致性并支持可扩展的长时域扩散规划。

链接: https://arxiv.org/abs/2505.07261
作者: Ce Hao,Anxing Xiao,Zhiwei Xue,Harold Soh
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion-based planners have shown strong performance in short-horizon tasks but often fail in complex, long-horizon settings. We trace the failure to loose coupling between high-level (HL) sub-goal selection and low-level (LL) trajectory generation, which leads to incoherent plans and degraded performance. We propose Coupled Hierarchical Diffusion (CHD), a framework that models HL sub-goals and LL trajectories jointly within a unified diffusion process. A shared classifier passes LL feedback upstream so that sub-goals self-correct while sampling proceeds. This tight HL-LL coupling improves trajectory coherence and enables scalable long-horizon diffusion planning. Experiments across maze navigation, tabletop manipulation, and household environments show that CHD consistently outperforms both flat and hierarchical diffusion baselines.
zh

[AI-38] UMoE: Unifying Attention and FFN with Shared Experts

【速读】:该论文试图解决在注意力机制中应用稀疏专家混合(Sparse Mixture of Experts, MoE)架构时存在的性能不足问题,以及与基于前馈网络(Feed-Forward Network, FFN)的MoE架构相比的效率差异。其解决方案的关键在于通过重新表述注意力机制,揭示出注意力模块中隐含的类似FFN的结构,并在此基础上提出UMoE架构,从而实现基于注意力的MoE层在性能上的提升,同时支持FFN与注意力组件之间的高效参数共享。

链接: https://arxiv.org/abs/2505.07260
作者: Yuanhang Yang,Chaozheng Wang,Jing Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sparse Mixture of Experts (MoE) architectures have emerged as a promising approach for scaling Transformer models. While initial works primarily incorporated MoE into feed-forward network (FFN) layers, recent studies have explored extending the MoE paradigm to attention layers to enhance model performance. However, existing attention-based MoE layers require specialized implementations and demonstrate suboptimal performance compared to their FFN-based counterparts. In this paper, we aim to unify the MoE designs in attention and FFN layers by introducing a novel reformulation of the attention mechanism, revealing an underlying FFN-like structure within attention modules. Our proposed architecture, UMoE, achieves superior performance through attention-based MoE layers while enabling efficient parameter sharing between FFN and attention components.
zh

[AI-39] REMEDI: Relative Feature Enhanced Meta-Learning with Distillation for Imbalanced Prediction

【速读】:该论文试图解决现有车辆所有者未来购车预测问题,该问题面临极端类别不平衡(正样本率为0.5%)和复杂的用户行为模式两大挑战。解决方案的关键在于提出一种多阶段框架REMEDI(Relative feature Enhanced Meta-learning with Distillation for Imbalanced prediction),该框架首先训练多种基础模型以捕捉用户行为的不同方面,其次通过混合专家架构引入相对性能元特征进行有效的模型融合,最后通过监督微调将集成模型的知识蒸馏到一个高效的单一模型中,从而在保持预测能力的同时实现实际部署。

链接: https://arxiv.org/abs/2505.07245
作者: Fei Liu,Huanhuan Ren,Yu Guan,Xiuxu Wang,Wang Lv,Zhiqiang Hu,Yaxi Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Predicting future vehicle purchases among existing owners presents a critical challenge due to extreme class imbalance (0.5% positive rate) and complex behavioral patterns. We propose REMEDI (Relative feature Enhanced Meta-learning with Distillation for Imbalanced prediction), a novel multi-stage framework addressing these challenges. REMEDI first trains diverse base models to capture complementary aspects of user behavior. Second, inspired by comparative op-timization techniques, we introduce relative performance meta-features (deviation from ensemble mean, rank among peers) for effective model fusion through a hybrid-expert architecture. Third, we distill the ensemble’s knowledge into a single efficient model via supervised fine-tuning with MSE loss, enabling practical deployment. Evaluated on approximately 800,000 vehicle owners, REMEDI significantly outperforms baseline approaches, achieving the business target of identifying ~50% of actual buyers within the top 60,000 recommendations at ~10% precision. The distilled model preserves the ensemble’s predictive power while maintaining deployment efficiency, demonstrating REMEDI’s effectiveness for imbalanced prediction in industry settings.
zh

[AI-40] Comet: Accelerating Private Inference for Large Language Model by Predicting Activation Sparsity

【速读】:该论文旨在解决在云平台上的大语言模型(Large Language Models, LLMs)推理过程中敏感信息泄露的隐私问题,提出了一种高效的私有推理系统Comet。其解决方案的关键在于利用LLMs中激活稀疏性的特性,通过一个准确且快速的预测器预测激活函数输出的稀疏分布,并引入一种新的私有推理协议,该协议通过利用预测稀疏分布的空间局部性,高效且安全地避免涉及零值的计算,从而减少计算和通信开销。

链接: https://arxiv.org/abs/2505.07239
作者: Guang Yan,Yuhui Zhang,Zimu Guo,Lutan Zhao,Xiaojun Chen,Chen Wang,Wenhao Wang,Dan Meng,Rui Hou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted to SP 2025

点击查看摘要

Abstract:With the growing use of large language models (LLMs) hosted on cloud platforms to offer inference services, privacy concerns about the potential leakage of sensitive information are escalating. Secure multi-party computation (MPC) is a promising solution to protect the privacy in LLM inference. However, MPC requires frequent inter-server communication, causing high performance overhead. Inspired by the prevalent activation sparsity of LLMs, where most neuron are not activated after non-linear activation functions, we propose an efficient private inference system, Comet. This system employs an accurate and fast predictor to predict the sparsity distribution of activation function output. Additionally, we introduce a new private inference protocol. It efficiently and securely avoids computations involving zero values by exploiting the spatial locality of the predicted sparse distribution. While this computation-avoidance approach impacts the spatiotemporal continuity of KV cache entries, we address this challenge with a low-communication overhead cache refilling strategy that merges miss requests and incorporates a prefetching mechanism. Finally, we evaluate Comet on four common LLMs and compare it with six state-of-the-art private inference systems. Comet achieves a 1.87x-2.63x speedup and a 1.94x-2.64x communication reduction. Comments: Accepted to SP 2025 Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.07239 [cs.CR] (or arXiv:2505.07239v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2505.07239 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1109/SP61157.2025.00182 Focus to learn more DOI(s) linking to related resources
zh

[AI-41] UAV-CodeAgents : Scalable UAV Mission Planning via Multi-Agent ReAct and Vision-Language Reasoning

【速读】:该论文试图解决自主无人机任务生成中的复杂规划与环境适应问题,特别是在工业和环境火灾检测等大规模任务场景中实现高效、可靠的任务执行。其解决方案的关键在于构建一个基于大语言模型和视觉-语言模型(LLMs/VLMs)的可扩展多智能体框架UAV-CodeAgents,该框架采用ReAct(Reason + Act)范式,结合视觉引导的像素定位机制和反应性思维循环,以实现对卫星图像的理解、高阶自然语言指令的解析以及无人机轨迹的协同生成,从而在最小人类监督下完成动态环境中的任务调整与执行。

链接: https://arxiv.org/abs/2505.07236
作者: Oleg Sautenkov,Yasheerah Yaqoot,Muhammad Ahsan Mustafa,Faryal Batool,Jeffrin Sam,Artem Lykov,Chih-Yung Wen,Dzmitry Tsetserukou
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Submitted

点击查看摘要

Abstract:We present UAV-CodeAgents, a scalable multi-agent framework for autonomous UAV mission generation, built on large language and vision-language models (LLMs/VLMs). The system leverages the ReAct (Reason + Act) paradigm to interpret satellite imagery, ground high-level natural language instructions, and collaboratively generate UAV trajectories with minimal human supervision. A core component is a vision-grounded, pixel-pointing mechanism that enables precise localization of semantic targets on aerial maps. To support real-time adaptability, we introduce a reactive thinking loop, allowing agents to iteratively reflect on observations, revise mission goals, and coordinate dynamically in evolving environments. UAV-CodeAgents is evaluated on large-scale mission scenarios involving industrial and environmental fire detection. Our results show that a lower decoding temperature (0.5) yields higher planning reliability and reduced execution time, with an average mission creation time of 96.96 seconds and a success rate of 93%. We further fine-tune Qwen2.5VL-7B on 9,000 annotated satellite images, achieving strong spatial grounding across diverse visual categories. To foster reproducibility and future research, we will release the full codebase and a novel benchmark dataset for vision-language-based UAV planning. Comments: Submitted Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.07236 [cs.RO] (or arXiv:2505.07236v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2505.07236 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-42] Measuring General Intelligence with Generated Games

【速读】:该论文试图解决如何评估语言模型在复杂环境中的通用推理能力问题,传统静态基准难以全面反映模型的动态推理与策略制定能力。其解决方案的关键在于构建gg-bench,这是一个通过大型语言模型(LLM)自动生成游戏环境的数据生成过程,具体包括:利用LLM生成新颖游戏的自然语言描述、将游戏转化为Gym环境代码,并通过自对弈强化学习(RL)训练智能体。该方法能够持续生成新的评估实例,从而更真实地测试语言模型在复杂任务中的表现。

链接: https://arxiv.org/abs/2505.07215
作者: Vivek Verma,David Huang,William Chen,Dan Klein,Nicholas Tomlin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present gg-bench, a collection of game environments designed to evaluate general reasoning capabilities in language models. Unlike most static benchmarks, gg-bench is a data generating process where new evaluation instances can be generated at will. In particular, gg-bench is synthetically generated by (1) using a large language model (LLM) to generate natural language descriptions of novel games, (2) using the LLM to implement each game in code as a Gym environment, and (3) training reinforcement learning (RL) agents via self-play on the generated games. We evaluate language models by their winrate against these RL agents by prompting models with the game description, current board state, and a list of valid moves, after which models output the moves they wish to take. gg-bench is challenging: state-of-the-art LLMs such as GPT-4o and Claude 3.7 Sonnet achieve winrates of 7-9% on gg-bench using in-context learning, while reasoning models such as o1, o3-mini and DeepSeek-R1 achieve average winrates of 31-36%. We release the generated games, data generation process, and evaluation code in order to support future modeling work and expansion of our benchmark.
zh

[AI-43] Accountability of Generative AI: Exploring a Precautionary Approach for “Artificially Created Nature”

【速读】:该论文试图解决生成式AI技术在社会技术系统中引发的问责问题,特别是在当前生成式AI系统复杂机制导致输出结果难以追溯的情况下。论文指出,透明性虽非问责的充分条件,但有助于提升问责水平。其解决方案的关键在于,当无法实现生成式AI的透明性时,应将其视为一种“人工创造的自然”,并采用预防原则来评估和管理AI风险,同时提出需要建立公民参与平台以应对生成式AI带来的潜在风险。

链接: https://arxiv.org/abs/2505.07178
作者: Yuri Nakao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid development of generative artificial intelligence (AI) technologies raises concerns about the accountability of sociotechnical systems. Current generative AI systems rely on complex mechanisms that make it difficult for even experts to fully trace the reasons behind the outputs. This paper first examines existing research on AI transparency and accountability and argues that transparency is not a sufficient condition for accountability but can contribute to its improvement. We then discuss that if it is not possible to make generative AI transparent, generative AI technology becomes ``artificially created nature’’ in a metaphorical sense, and suggest using the precautionary principle approach to consider AI risks. Finally, we propose that a platform for citizen participation is needed to address the risks of generative AI.
zh

[AI-44] Internet of Agents : Fundamentals Applications and Challenges

【速读】:该论文试图解决随着大规模语言模型和视觉-语言模型的迅速发展,AI代理从孤立、任务特定的系统演变为无需人类干预即可感知、推理和行动的自主交互实体,从而在虚拟和物理环境中广泛部署时所面临的统一、以代理为中心的基础设施不足问题。解决方案的关键在于提出互联网代理(Internet of Agents, IoA)作为基础框架,实现异构代理之间的无缝互联、动态发现与协作编排,其核心要素包括能力通知与发现、自适应通信协议、动态任务匹配、共识与冲突解决机制以及激励模型。

链接: https://arxiv.org/abs/2505.07176
作者: Yuntao Wang,Shaolong Guo,Yanghe Pan,Zhou Su,Fahao Chen,Tom H. Luan,Peng Li,Jiawen Kang,Dusit Niyato
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 22 pages,10 figures, 8 tables. Submitted to IEEE TCCN

点击查看摘要

Abstract:With the rapid proliferation of large language models and vision-language models, AI agents have evolved from isolated, task-specific systems into autonomous, interactive entities capable of perceiving, reasoning, and acting without human intervention. As these agents proliferate across virtual and physical environments, from virtual assistants to embodied robots, the need for a unified, agent-centric infrastructure becomes paramount. In this survey, we introduce the Internet of Agents (IoA) as a foundational framework that enables seamless interconnection, dynamic discovery, and collaborative orchestration among heterogeneous agents at scale. We begin by presenting a general IoA architecture, highlighting its hierarchical organization, distinguishing features relative to the traditional Internet, and emerging applications. Next, we analyze the key operational enablers of IoA, including capability notification and discovery, adaptive communication protocols, dynamic task matching, consensus and conflict-resolution mechanisms, and incentive models. Finally, we identify open research directions toward building resilient and trustworthy IoA ecosystems.
zh

[AI-45] ReCDAP: Relation-Based Conditional Diffusion with Attention Pooling for Few-Shot Knowledge Graph Completion SIGIR2025

【速读】:该论文试图解决知识图谱(Knowledge Graph, KG)中关系呈现长尾分布的问题,这会阻碍信息检索性能。现有研究在少样本知识图谱补全任务中通常仅关注正向三元组信息,或在引入负向三元组时仅将其作为错误三元组的指示信号。解决方案的关键在于提出基于关系的条件扩散与注意力池化方法(Relation-Based Conditional Diffusion with Attention Pooling, ReCDAP),通过随机替换支持集中的尾实体生成负向三元组,并在扩散过程中分别估计正向和负向关系的潜在分布,同时引入注意力池化机制显式利用正负案例之间的差异。

链接: https://arxiv.org/abs/2505.07171
作者: Jeongho Kim,Chanyeong Heo,Jaehee Jung
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted by SIGIR 2025, 5 pages, 1 figure

点击查看摘要

Abstract:Knowledge Graphs (KGs), composed of triples in the form of (head, relation, tail) and consisting of entities and relations, play a key role in information retrieval systems such as question answering, entity search, and recommendation. In real-world KGs, although many entities exist, the relations exhibit a long-tail distribution, which can hinder information retrieval performance. Previous few-shot knowledge graph completion studies focused exclusively on the positive triple information that exists in the graph or, when negative triples were incorporated, used them merely as a signal to indicate incorrect triples. To overcome this limitation, we propose Relation-Based Conditional Diffusion with Attention Pooling (ReCDAP). First, negative triples are generated by randomly replacing the tail entity in the support set. By conditionally incorporating positive information in the KG and non-existent negative information into the diffusion process, the model separately estimates the latent distributions for positive and negative relations. Moreover, including an attention pooler enables the model to leverage the differences between positive and negative cases explicitly. Experiments on two widely used datasets demonstrate that our method outperforms existing approaches, achieving state-of-the-art performance. The code is available at this https URL.
zh

[AI-46] X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real

【速读】:该论文试图解决在机器人操作策略训练中,人类视频缺乏动作标签导致无法直接应用于标准模仿学习算法的问题,以及跨身体结构(cross-embodiment)方法在实体差异较大时表现不佳的问题。解决方案的关键在于提出X-Sim框架,该框架利用物体运动作为密集且可迁移的信号来学习机器人策略,通过从RGBD人类视频重建逼真仿真环境并跟踪物体轨迹以定义以物体为中心的奖励,进而在仿真中训练强化学习(Reinforcement Learning, RL)策略,并通过合成回放将其提炼为图像条件扩散策略,最后结合在线领域自适应技术实现真实世界的迁移。

链接: https://arxiv.org/abs/2505.07096
作者: Prithwish Dan,Kushal Kedia,Angela Chao,Edward Weiyi Duan,Maximus Adrian Pace,Wei-Chiu Ma,Sanjiban Choudhury
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Human videos offer a scalable way to train robot manipulation policies, but lack the action labels needed by standard imitation learning algorithms. Existing cross-embodiment approaches try to map human motion to robot actions, but often fail when the embodiments differ significantly. We propose X-Sim, a real-to-sim-to-real framework that uses object motion as a dense and transferable signal for learning robot policies. X-Sim starts by reconstructing a photorealistic simulation from an RGBD human video and tracking object trajectories to define object-centric rewards. These rewards are used to train a reinforcement learning (RL) policy in simulation. The learned policy is then distilled into an image-conditioned diffusion policy using synthetic rollouts rendered with varied viewpoints and lighting. To transfer to the real world, X-Si introduces an online domain adaptation technique that aligns real and simulated observations during deployment. Importantly, X-Sim does not require any robot teleoperation data. We evaluate it across 5 manipulation tasks in 2 environments and show that it: (1) improves task progress by 30% on average over hand-tracking and sim-to-real baselines, (2) matches behavior cloning with 10x less data collection time, and (3) generalizes to new camera viewpoints and test-time changes. Code and videos are available at this https URL.
zh

[AI-47] RefPentester: A Knowledge-Informed Self-Reflective Penetration Testing Framework Based on Large Language Models

【速读】:该论文试图解决基于大语言模型(Large Language Models, LLMs)的自动化渗透测试(AutoPT)框架在复杂任务中表现不佳的问题,具体包括LLM训练中知识分布不均、规划过程中的短视性以及命令生成过程中的幻觉现象,同时现有框架缺乏从先前失败操作中学习的能力,限制了渗透测试策略的自适应改进。解决方案的关键在于提出一种基于LLM的知识驱动自反思渗透测试框架——RefPentester,该框架能够识别当前渗透测试阶段、选择合适的战术与技术、提供分步操作指导,并通过学习历史失败经验实现策略优化,同时将渗透测试过程建模为七状态阶段机以提升框架集成效果。

链接: https://arxiv.org/abs/2505.07089
作者: Hanzheng Dai,Yuanliang Li,Zhibo Zhang,Jun Yan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automated penetration testing (AutoPT) powered by large language models (LLMs) has gained attention for its ability to automate ethical hacking processes and identify vulnerabilities in target systems by leveraging the intrinsic knowledge of LLMs. However, existing LLM-based AutoPT frameworks often underperform compared to human experts in challenging tasks for several reasons: the imbalanced knowledge used in LLM training, short-sighted planning in the planning process, and hallucinations during command generation. In addition, the penetration testing (PT) process, with its trial-and-error nature, is limited by existing frameworks that lack mechanisms to learn from previous failed operations, restricting adaptive improvement of PT strategies. To address these limitations, we propose a knowledge-informed self-reflective PT framework powered by LLMs, called RefPentester, which is an AutoPT framework designed to assist human operators in identifying the current stage of the PT process, selecting appropriate tactic and technique for the stage, choosing suggested action, providing step-by-step operational guidance, and learning from previous failed operations. We also modeled the PT process as a seven-state Stage Machine to integrate the proposed framework effectively. The evaluation shows that RefPentester can successfully reveal credentials on Hack The Box’s Sau machine, outperforming the baseline GPT-4o model by 16.7%. Across PT stages, RefPentester also demonstrates superior success rates on PT stage transitions.
zh

[AI-48] Architectural Precedents for General Agents using Large Language Models

【速读】:该论文试图解决如何通过分析认知设计模式来理解并推进通用人工智能(AGI)的发展问题,特别是针对当前基于大语言模型(LLMs)的智能代理系统中存在的不足。其解决方案的关键在于总结和分析在早期非Transformer架构中出现的重复性认知设计模式,并探讨这些模式在现代LLM系统中的体现,从而揭示当前系统的局限性并为未来研究提供方向。

链接: https://arxiv.org/abs/2505.07087
作者: Robert E. Wray,James R. Kirk,John E. Laird
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 2 figures. Submitted to AGI25

点击查看摘要

Abstract:One goal of AI (and AGI) is to identify and understand specific mechanisms and representations sufficient for general intelligence. Often, this work manifests in research focused on architectures and many cognitive architectures have been explored in AI/AGI. However, different research groups and even different research traditions have somewhat independently identified similar/common patterns of processes and representations or cognitive design patterns that are manifest in existing architectures. Today, AI systems exploiting large language models (LLMs) offer a relatively new combination of mechanism and representation available for exploring the possibilities of general intelligence. In this paper, we summarize a few recurring cognitive design patterns that have appeared in various pre-transformer AI architectures. We then explore how these patterns are evident in systems using LLMs, especially for reasoning and interactive (“agentic”) use cases. By examining and applying these recurring patterns, we can also predict gaps or deficiencies in today’s Agentic LLM Systems and identify likely subjects of future research towards general intelligence using LLMs and other generative foundation models.
zh

[AI-49] Arbitrarily Applicable Same/Opposite Relational Responding with NARS

【速读】:该论文试图解决如何在人工智能系统中实现基于最小经验的灵活刺激关系泛化问题,特别是生成式 AI (Generative AI) 中的任意性同/异关系反应能力。解决方案的关键在于扩展非公理化推理系统(NARS),引入习得关系的实现,使系统能够从有限的显性训练中显式推导出对称关系(互含)和新颖的关系组合(组合蕴含),并在情境控制的匹配-样本(MTS)程序中表现出稳健的关系泛化能力。实验结果表明,NARS能够快速内化显性训练的关系规则,并在关键测试阶段展现出包含互含与组合蕴含的派生关系反应。

链接: https://arxiv.org/abs/2505.07079
作者: Robert Johansson,Patrick Hammer,Tony Lofthouse
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Same/opposite relational responding, a fundamental aspect of human symbolic cognition, allows the flexible generalization of stimulus relationships based on minimal experience. In this study, we demonstrate the emergence of \textitarbitrarily applicable same/opposite relational responding within the Non-Axiomatic Reasoning System (NARS), a computational cognitive architecture designed for adaptive reasoning under uncertainty. Specifically, we extend NARS with an implementation of \textitacquired relations, enabling the system to explicitly derive both symmetric (mutual entailment) and novel relational combinations (combinatorial entailment) from minimal explicit training in a contextually controlled matching-to-sample (MTS) procedure. Experimental results show that NARS rapidly internalizes explicitly trained relational rules and robustly demonstrates derived relational generalizations based on arbitrary contextual cues. Importantly, derived relational responding in critical test phases inherently combines both mutual and combinatorial entailments, such as deriving same-relations from multiple explicitly trained opposite-relations. Internal confidence metrics illustrate strong internalization of these relational principles, closely paralleling phenomena observed in human relational learning experiments. Our findings underscore the potential for integrating nuanced relational learning mechanisms inspired by learning psychology into artificial general intelligence frameworks, explicitly highlighting the arbitrary and context-sensitive relational capabilities modeled within NARS.
zh

[AI-50] ParaView-MCP: An Autonomous Visualization Agent with Direct Tool Use

【速读】:该论文试图解决传统可视化工具如ParaView因学习曲线陡峭而限制潜在用户使用的问题,同时希望提升其智能化水平。解决方案的关键在于引入ParaView-MCP,这是一个集成现代多模态大语言模型(MLLM)的自主代理,通过Model Context Protocol (MCP) 实现与ParaView的高效交互,从而支持自然语言和视觉输入的操作,并增强智能决策支持能力。

链接: https://arxiv.org/abs/2505.07064
作者: Shusen Liu,Haichao Miao,Peer-Timo Bremer
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While powerful and well-established, tools like ParaView present a steep learning curve that discourages many potential users. This work introduces ParaView-MCP, an autonomous agent that integrates modern multimodal large language models (MLLMs) with ParaView to not only lower the barrier to entry but also augment ParaView with intelligent decision support. By leveraging the state-of-the-art reasoning, command execution, and vision capabilities of MLLMs, ParaView-MCP enables users to interact with ParaView through natural language and visual inputs. Specifically, our system adopted the Model Context Protocol (MCP) - a standardized interface for model-application communication - that facilitates direct interaction between MLLMs with ParaView’s Python API to allow seamless information exchange between the user, the language model, and the visualization tool itself. Furthermore, by implementing a visual feedback mechanism that allows the agent to observe the viewport, we unlock a range of new capabilities, including recreating visualizations from examples, closed-loop visualization parameter updates based on user-defined goals, and even cross-application collaboration involving multiple tools. Broadly, we believe such an agent-driven visualization paradigm can profoundly change the way we interact with visualization tools. We expect a significant uptake in the development of such visualization tools, in both visualization research and industry.
zh

[AI-51] Unlocking Non-Block-Structured Decisions: Inductive Mining with Choice Graphs

【速读】:该论文试图解决传统归纳挖掘算法在处理非块结构决策点时的局限性,这些算法虽然在保证过程模型的正确性和效率方面表现良好,但其严格的块结构表示限制了对现实世界复杂流程的准确建模。论文提出的解决方案的关键在于对部分有序工作流语言(POWL)进行扩展,引入选择图(choice graphs),以结构化且灵活的方式建模复杂的决策逻辑,从而在保持归纳挖掘框架质量保证的同时,提升对非块结构决策的建模能力。

链接: https://arxiv.org/abs/2505.07052
作者: Humam Kourani,Gyunam Park,Wil M.P. van der Aalst
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: The Version of Record of this contribution will be published in the proceedings of the 23rd International Conference on Business Process Management (BPM 2025). This preprint has not undergone peer review or any post-submission improvements or corrections

点击查看摘要

Abstract:Process discovery aims to automatically derive process models from event logs, enabling organizations to analyze and improve their operational processes. Inductive mining algorithms, while prioritizing soundness and efficiency through hierarchical modeling languages, often impose a strict block-structured representation. This limits their ability to accurately capture the complexities of real-world processes. While recent advancements like the Partially Ordered Workflow Language (POWL) have addressed the block-structure limitation for concurrency, a significant gap remains in effectively modeling non-block-structured decision points. In this paper, we bridge this gap by proposing an extension of POWL to handle non-block-structured decisions through the introduction of choice graphs. Choice graphs offer a structured yet flexible approach to model complex decision logic within the hierarchical framework of POWL. We present an inductive mining discovery algorithm that uses our extension and preserves the quality guarantees of the inductive mining framework. Our experimental evaluation demonstrates that the discovered models, enriched with choice graphs, more precisely represent the complex decision-making behavior found in real-world processes, without compromising the high scalability inherent in inductive mining techniques.
zh

[AI-52] DialogueReason : Rule-Based RL Sparks Dialogue Reasoning in LLM s

【速读】:该论文试图解决传统单声道式推理模型在推理过程中多样性与连贯性不足的问题,这些问题通常表现为固定策略的重复使用或注意力的无必要转移。解决方案的关键在于提出一种基于对话的推理范式——DialogueReason,该方法通过构建代理(agents)、环境(environment)和交互(interactions)的结构,利用规则奖励的PPO算法训练开源大语言模型,以实现更丰富的推理过程和更高的任务性能。

链接: https://arxiv.org/abs/2505.07049
作者: Yubo Shu,Zhewei Huang,Xin Wu,Chen Hu,Shuchang Zhou,Daxin Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose DialogueReason, a reasoning paradigm that uncovers the lost roles in monologue-style reasoning models, aiming to boost diversity and coherency of the reasoning process. Recent advances in RL-based large reasoning models have led to impressive long CoT capabilities and high performance on math and science benchmarks. However, these reasoning models rely mainly on monologue-style reasoning, which often limits reasoning diversity and coherency, frequently recycling fixed strategies or exhibiting unnecessary shifts in attention. Our work consists of an analysis of monologue reasoning patterns and the development of a dialogue-based reasoning approach. We first introduce the Compound-QA task, which concatenates multiple problems into a single prompt to assess both diversity and coherency of reasoning. Our analysis shows that Compound-QA exposes weaknesses in monologue reasoning, evidenced by both quantitative metrics and qualitative reasoning traces. Building on the analysis, we propose a dialogue-based reasoning, named DialogueReason, structured around agents, environment, and interactions. Using PPO with rule-based rewards, we train open-source LLMs (Qwen-QWQ and Qwen-Base) to adopt dialogue reasoning. We evaluate trained models on MATH, AIME, and GPQA datasets, showing that the dialogue reasoning model outperforms monologue models under more complex compound questions. Additionally, we discuss how dialogue-based reasoning helps enhance interpretability, facilitate more intuitive human interaction, and inspire advances in multi-agent system design.
zh

[AI-53] Empirical Analysis of Asynchronous Federated Learning on Heterogeneous Devices: Efficiency Fairness and Privacy Trade-offs IJCNN2025

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中设备异构性带来的效率、公平性和隐私之间的权衡问题。在同步FL中,资源受限的客户端会拖慢整体更新过程,而异步FL通过及时整合更新提高了效率,但其对隐私的影响尚未得到充分研究,尤其是在高端设备频繁贡献更新的情况下,累积的隐私泄露风险显著增加。论文提出的关键解决方案是通过实验对比FedAvg与考虑延迟的FedAsync,在真实设备异构环境下评估其性能,并结合局部差分隐私(Local Differential Privacy, LDP)和矩量会计(Moments Accountant)量化每个客户端的隐私损失,从而揭示异步FL在提升效率的同时加剧了公平性和隐私差异的问题。

链接: https://arxiv.org/abs/2505.07041
作者: Samaneh Mohammadi,Iraklis Symeonidis,Ali Balador,Francesco Flammini
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper was accepted to IJCNN 2025. This version is a preprint and not the official published version

点击查看摘要

Abstract:Device heterogeneity poses major challenges in Federated Learning (FL), where resource-constrained clients slow down synchronous schemes that wait for all updates before aggregation. Asynchronous FL addresses this by incorporating updates as they arrive, substantially improving efficiency. While its efficiency gains are well recognized, its privacy costs remain largely unexplored, particularly for high-end devices that contribute updates more frequently, increasing their cumulative privacy exposure. This paper presents the first comprehensive analysis of the efficiency-fairness-privacy trade-off in synchronous vs. asynchronous FL under realistic device heterogeneity. We empirically compare FedAvg and staleness-aware FedAsync using a physical testbed of five edge devices spanning diverse hardware tiers, integrating Local Differential Privacy (LDP) and the Moments Accountant to quantify per-client privacy loss. Using Speech Emotion Recognition (SER) as a privacy-critical benchmark, we show that FedAsync achieves up to 10x faster convergence but exacerbates fairness and privacy disparities: high-end devices contribute 6-10x more updates and incur up to 5x higher privacy loss, while low-end devices suffer amplified accuracy degradation due to infrequent, stale, and noise-perturbed updates. These findings motivate the need for adaptive FL protocols that jointly optimize aggregation and privacy mechanisms based on client capacity and participation dynamics, moving beyond static, one-size-fits-all solutions.
zh

[AI-54] Predicting Diabetes Using Machine Learning: A Comparative Study of Classifiers

【速读】:该论文旨在解决糖尿病的早期预测问题,以提高疾病干预和患者管理的效率。其解决方案的关键在于提出了一种新型混合模型DNet,该模型结合了卷积神经网络(Convolutional Neural Network, CNN)和长短期记忆网络(Long Short-Term Memory, LSTM)的结构,以实现有效的特征提取和序列学习。DNet通过卷积块捕捉关键特征,残差块增强信息流动,并引入批量归一化和Dropout进行正则化,最后通过LSTM层建模数据中的时间依赖性,从而在糖尿病预测任务中取得了优异的性能表现。

链接: https://arxiv.org/abs/2505.07036
作者: Mahade Hasan,Farhana Yasmin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diabetes remains a significant health challenge globally, contributing to severe complications like kidney disease, vision loss, and heart issues. The application of machine learning (ML) in healthcare enables efficient and accurate disease prediction, offering avenues for early intervention and patient support. Our study introduces an innovative diabetes prediction framework, leveraging both traditional ML techniques such as Logistic Regression, SVM, Naïve Bayes, and Random Forest and advanced ensemble methods like AdaBoost, Gradient Boosting, Extra Trees, and XGBoost. Central to our approach is the development of a novel model, DNet, a hybrid architecture combining Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) layers for effective feature extraction and sequential learning. The DNet model comprises an initial convolutional block for capturing essential features, followed by a residual block with skip connections to facilitate efficient information flow. Batch Normalization and Dropout are employed for robust regularization, and an LSTM layer captures temporal dependencies within the data. Using a Kaggle-sourced real-world diabetes dataset, our model evaluation spans cross-validation accuracy, precision, recall, F1 score, and ROC-AUC. Among the models, DNet demonstrates the highest efficacy with an accuracy of 99.79% and an AUC-ROC of 99.98%, establishing its potential for superior diabetes prediction. This robust hybrid architecture showcases the value of combining CNN and LSTM layers, emphasizing its applicability in medical diagnostics and disease prediction tasks.
zh

[AI-55] Efficient Fault Detection in WSN Based on PCA-Optimized Deep Neural Network Slicing Trained with GOA

【速读】:该论文试图解决无线传感器网络(Wireless Sensor Networks, WSNs)中故障检测的难题,传统方法在优化深度神经网络(Deep Neural Networks, DNNs)以实现高效性能方面存在不足,尤其是在处理高维数据和捕捉非线性关系时表现不佳,同时存在收敛速度慢和难以通过梯度优化找到最优网络结构的问题。解决方案的关键在于提出一种结合主成分分析(Principal Component Analysis, PCA)与由蚱蜢优化算法(Grasshopper Optimization Algorithm, GOA)优化的DNN的混合方法,通过PCA降低数据维度并保留关键信息,再利用GOA优化DNN的网络结构,从而提升训练效率和故障检测精度。

链接: https://arxiv.org/abs/2505.07030
作者: Mahmood Mohassel Feghhi,Raya Majid Alsharfa,Majid Hameed Majeed
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: 22 pages, 18 figures, Accepted for publication in International Journal of Intelligent Engineering and Systems, May 2025

点击查看摘要

Abstract:Fault detection in Wireless Sensor Networks (WSNs) is crucial for reliable data transmission and network longevity. Traditional fault detection methods often struggle with optimizing deep neural networks (DNNs) for efficient performance, especially in handling high-dimensional data and capturing nonlinear relationships. Additionally, these methods typically suffer from slow convergence and difficulty in finding optimal network architectures using gradient-based optimization. This study proposes a novel hybrid method combining Principal Component Analysis (PCA) with a DNN optimized by the Grasshopper Optimization Algorithm (GOA) to address these limitations. Our approach begins by computing eigenvalues from the original 12-dimensional dataset and sorting them in descending order. The cumulative sum of these values is calculated, retaining principal components until 99.5% variance is achieved, effectively reducing dimensionality to 4 features while preserving critical information. This compressed representation trains a six-layer DNN where GOA optimizes the network architecture, overcoming backpropagation’s limitations in discovering nonlinear relationships. This hybrid PCA-GOA-DNN framework compresses the data and trains a six-layer DNN that is optimized by GOA, enhancing both training efficiency and fault detection accuracy. The dataset used in this study is a real-world WSNs dataset developed by the University of North Carolina, which was used to evaluate the proposed method’s performance. Extensive simulations demonstrate that our approach achieves a remarkable 99.72% classification accuracy, with exceptional precision and recall, outperforming conventional methods. The method is computationally efficient, making it suitable for large-scale WSN deployments, and represents a significant advancement in fault detection for resource-constrained WSNs.
zh

[AI-56] Incremental Uncertainty-aware Performance Monitoring with Active Labeling Intervention

【速读】:该论文试图解决在渐进式分布偏移(gradual distribution shift)环境下机器学习模型性能监控的问题,这种环境下的条件随时间缓慢变化,可能导致准确率显著下降但难以被察觉。解决方案的关键在于提出一种无需标签的增量不确定性感知性能监控方法(Incremental Uncertainty-aware Performance Monitoring, IUPM),该方法通过最优传输建模渐进式偏移来估计性能变化,并量化性能预测的不确定性,同时引入主动标注流程在有限标注预算下恢复可靠的估计。

链接: https://arxiv.org/abs/2505.07023
作者: Alexander Koebler,Thomas Decker,Ingo Thon,Volker Tresp,Florian Buettner
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We study the problem of monitoring machine learning models under gradual distribution shifts, where circumstances change slowly over time, often leading to unnoticed yet significant declines in accuracy. To address this, we propose Incremental Uncertainty-aware Performance Monitoring (IUPM), a novel label-free method that estimates performance changes by modeling gradual shifts using optimal transport. In addition, IUPM quantifies the uncertainty in the performance prediction and introduces an active labeling procedure to restore a reliable estimate under a limited labeling budget. Our experiments show that IUPM outperforms existing performance estimation baselines in various gradual shift scenarios and that its uncertainty awareness guides label acquisition more effectively compared to other strategies.
zh

[AI-57] R-CAGE: A Structural Model for Emotion Output Design in Human-AI Interaction

【速读】:该论文试图解决长期人机交互中情感输出的结构性问题,特别是由于重复情感参与导致的认知和心理负担。现有情感计算方法侧重于表达性、沉浸感和响应性,但忽略了持续情感互动带来的认知与结构后果。解决方案的关键在于提出R-CAGE(Rhythmic Control Architecture for Guarding Ego)框架,将情感输出视为需要架构干预的伦理设计结构,而非单纯的反应性表达。该框架通过四个控制模块——节奏表达控制、感官结构化架构、认知框架保护以及自我对齐响应设计——来调节情感节奏、感官强度和解释可能性,从而实现情感的可持续设计,保护用户免受情感过载和认知超载的影响,同时维持长期的解释自主性。

链接: https://arxiv.org/abs/2505.07020
作者: Suyeon Choi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: theory-only preprint. Independent research

点击查看摘要

Abstract:This paper presents R-CAGE (Rhythmic Control Architecture for Guarding Ego), a theoretical framework for restructuring emotional output in long-term human-AI interaction. While prior affective computing approaches emphasized expressiveness, immersion, and responsiveness, they often neglected the cognitive and structural consequences of repeated emotional engagement. R-CAGE instead conceptualizes emotional output not as reactive expression but as ethical design structure requiring architectural intervention. The model is grounded in experiential observations of subtle affective symptoms such as localized head tension, interpretive fixation, and emotional lag arising from prolonged interaction with affective AI systems. These indicate a mismatch between system-driven emotion and user interpretation that cannot be fully explained by biometric data or observable behavior. R-CAGE adopts a user-centered stance prioritizing psychological recovery, interpretive autonomy, and identity continuity. The framework consists of four control blocks: (1) Control of Rhythmic Expression regulates output pacing to reduce fatigue; (2) Architecture of Sensory Structuring adjusts intensity and timing of affective stimuli; (3) Guarding of Cognitive Framing reduces semantic pressure to allow flexible interpretation; (4) Ego-Aligned Response Design supports self-reference recovery during interpretive lag. By structurally regulating emotional rhythm, sensory intensity, and interpretive affordances, R-CAGE frames emotion not as performative output but as sustainable design unit. The goal is to protect users from oversaturation and cognitive overload while sustaining long-term interpretive agency in AI-mediated environments.
zh

[AI-58] Hand-Shadow Poser SIGGRAPH2025

【速读】:该论文试图解决的是一个逆问题:给定一个目标形状,找到左右手的姿势,使得它们共同投射出的阴影尽可能接近输入形状。该问题具有挑战性,因为3D手部姿势的设计空间庞大,同时受到解剖学约束的限制,且输入形状为无色无纹理的图像,需关注其形状和关键特征。解决方案的关键在于设计了一个三阶段的流水线——Hand-Shadow Poser,通过分离解剖学约束(由手)和语义约束(由阴影形状)来实现高效求解,包括生成式手部分配模块、广义手影对齐模块以及基于阴影特征的优化模块,从而在无需专用训练数据集的情况下,有效生成符合物理合理性和阴影特征保留的手部姿势。

链接: https://arxiv.org/abs/2505.07012
作者: Hao Xu,Yinqiao Wang,Niloy J. Mitra,Shuaicheng Liu,Pheng-Ann Heng,Chi-Wing Fu
机构: 未知
类目: Computational Geometry (cs.CG); Artificial Intelligence (cs.AI)
备注: SIGGRAPH 2025 (ACM TOG)

点击查看摘要

Abstract:Hand shadow art is a captivating art form, creatively using hand shadows to reproduce expressive shapes on the wall. In this work, we study an inverse problem: given a target shape, find the poses of left and right hands that together best produce a shadow resembling the input. This problem is nontrivial, since the design space of 3D hand poses is huge while being restrictive due to anatomical constraints. Also, we need to attend to the input’s shape and crucial features, though the input is colorless and textureless. To meet these challenges, we design Hand-Shadow Poser, a three-stage pipeline, to decouple the anatomical constraints (by hand) and semantic constraints (by shadow shape): (i) a generative hand assignment module to explore diverse but reasonable left/right-hand shape hypotheses; (ii) a generalized hand-shadow alignment module to infer coarse hand poses with a similarity-driven strategy for selecting hypotheses; and (iii) a shadow-feature-aware refinement module to optimize the hand poses for physical plausibility and shadow feature preservation. Further, we design our pipeline to be trainable on generic public hand data, thus avoiding the need for any specialized training dataset. For method validation, we build a benchmark of 210 diverse shadow shapes of varying complexity and a comprehensive set of metrics, including a novel DINOv2-based evaluation metric. Through extensive comparisons with multiple baselines and user studies, our approach is demonstrated to effectively generate bimanual hand poses for a large variety of hand shapes for over 85% of the benchmark cases.
zh

[AI-59] Explainable AI the Latest Advancements and New Trends

【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)决策过程缺乏可解释性的问题,即神经网络中的各种算法使得其决策原因难以理解,从而影响了AI系统的可信度。解决方案的关键在于发展可解释的AI(Explainable AI, XAI)技术,通过研究和整合提升AI系统可解释性的方法与技术,并强调AI可解释性与自主系统元推理(meta-reasoning)之间的强关联,以推动未来更具可解释性的AI系统的发展。

链接: https://arxiv.org/abs/2505.07005
作者: Bowen Long,Enjie Liu,Renxi Qiu,Yanqing Duan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, Artificial Intelligence technology has excelled in various applications across all domains and fields. However, the various algorithms in neural networks make it difficult to understand the reasons behind decisions. For this reason, trustworthy AI techniques have started gaining popularity. The concept of trustworthiness is cross-disciplinary; it must meet societal standards and principles, and technology is used to fulfill these requirements. In this paper, we first surveyed developments from various countries and regions on the ethical elements that make AI algorithms trustworthy; and then focused our survey on the state of the art research into the interpretability of AI. We have conducted an intensive survey on technologies and techniques used in making AI explainable. Finally, we identified new trends in achieving explainable AI. In particular, we elaborate on the strong link between the explainability of AI and the meta-reasoning of autonomous systems. The concept of meta-reasoning is ‘reason the reasoning’, which coincides with the intention and goal of explainable Al. The integration of the approaches could pave the way for future interpretable AI systems.
zh

[AI-60] A Multi-Agent Reinforcement Learning Approach for Cooperative Air-Ground-Human Crowdsensing in Emergency Rescue

【速读】:该论文旨在解决紧急救援场景下的异构实体协同感知任务分配(HECTA)问题,其核心挑战在于如何在复杂环境、有限通信和部分可观测条件下优化人类、无人机(UAV)和无人地面车辆(UGV)之间的任务分配,以最大化任务完成率(TCR)。解决方案的关键在于提出一种名为HECTA4ER的新型多智能体强化学习算法,该算法基于集中式训练与分布式执行架构,并通过专用模块进行复杂特征提取、利用隐藏状态记录动作-观测历史以及混合网络融合全局与局部信息,从而有效应对部分可观测性问题。

链接: https://arxiv.org/abs/2505.06997
作者: Wenhao Lu,Zhengqiu Zhu,Yong Zhao,Yonglin Tian,Junjie Zeng,Jun Zhang,Zhong Liu,Fei-Yue Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mobile crowdsensing is evolving beyond traditional human-centric models by integrating heterogeneous entities like unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs). Optimizing task allocation among these diverse agents is critical, particularly in challenging emergency rescue scenarios characterized by complex environments, limited communication, and partial observability. This paper tackles the Heterogeneous-Entity Collaborative-Sensing Task Allocation (HECTA) problem specifically for emergency rescue, considering humans, UAVs, and UGVs. We introduce a novel ``Hard-Cooperative’’ policy where UGVs prioritize recharging low-battery UAVs, alongside performing their sensing tasks. The primary objective is maximizing the task completion rate (TCR) under strict time constraints. We rigorously formulate this NP-hard problem as a decentralized partially observable Markov decision process (Dec-POMDP) to effectively handle sequential decision-making under uncertainty. To solve this, we propose HECTA4ER, a novel multi-agent reinforcement learning algorithm built upon a Centralized Training with Decentralized Execution architecture. HECTA4ER incorporates tailored designs, including specialized modules for complex feature extraction, utilization of action-observation history via hidden states, and a mixing network integrating global and local information, specifically addressing the challenges of partial observability. Furthermore, theoretical analysis confirms the algorithm’s convergence properties. Extensive simulations demonstrate that HECTA4ER significantly outperforms baseline algorithms, achieving an average 18.42% increase in TCR. Crucially, a real-world case study validates the algorithm’s effectiveness and robustness in dynamic sensing scenarios, highlighting its strong potential for practical application in emergency response.
zh

[AI-61] CAT Merging: A Training-Free Approach for Resolving Conflicts in Model Merging

【速读】:该论文试图解决多任务模型融合中因知识冲突导致的性能退化问题,现有方法通过累加任务向量(task vectors)进行模型融合,但任务向量中的冲突成分会降低整体性能。解决方案的关键在于提出一种无需训练的冲突感知任务融合框架(Conflict-Aware Task Merging, CAT Merging),该框架通过参数特定策略,如线性权重的投影和归一化层中缩放与偏移参数的掩码,选择性地修剪冲突敏感组件,从而有效抑制知识冲突。

链接: https://arxiv.org/abs/2505.06977
作者: Wenju Sun,Qingyong Li,Yangli-ao Geng,Boyang Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-task model merging offers a promising paradigm for integrating multiple expert models into a unified model without additional training. Existing state-of-the-art techniques, such as Task Arithmetic and its variants, merge models by accumulating task vectors – the parameter differences between pretrained and finetuned models. However, task vector accumulation is often hindered by knowledge conflicts, leading to performance degradation. To address this challenge, we propose Conflict-Aware Task Merging (CAT Merging), a novel training-free framework that selectively trims conflict-prone components from the task vectors. CAT Merging introduces several parameter-specific strategies, including projection for linear weights and masking for scaling and shifting parameters in normalization layers. Extensive experiments on vision, language, and vision-language tasks demonstrate that CAT Merging effectively suppresses knowledge conflicts, achieving average accuracy improvements of up to 2.5% (ViT-B/32) and 2.0% (ViT-L/14) over state-of-the-art methods.
zh

[AI-62] From Knowledge to Reasoning : Evaluating LLM s for Ionic Liquids Research in Chemical and Biological Engineering

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在化学与生物工程(Chemical and Biological Engineering, CBE)领域,尤其是离子液体(Ionic Liquids, ILs)用于碳捕集的应用中,其知识和推理能力评估不足的问题。解决方案的关键在于构建一个由专家精心整理的包含5,920个示例的数据集,该数据集针对ILs在碳捕集中的应用,通过调整语言和领域专业知识的不同难度维度,用于评估LLMs的推理能力。研究结果表明,尽管较小的通用LLMs具备一定的ILs知识,但在领域特定的推理能力上存在不足,因此需要针对性地优化LLMs以支持碳捕集研究。

链接: https://arxiv.org/abs/2505.06964
作者: Gaurab Sarkar,Sougata Saha
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although Large Language Models (LLMs) have achieved remarkable performance in diverse general knowledge and reasoning tasks, their utility in the scientific domain of Chemical and Biological Engineering (CBE) is unclear. Hence, it necessitates challenging evaluation benchmarks that can measure LLM performance in knowledge- and reasoning-based tasks, which is lacking. As a foundational step, we empirically measure the reasoning capabilities of LLMs in CBE. We construct and share an expert-curated dataset of 5,920 examples for benchmarking LLMs’ reasoning capabilities in the niche domain of Ionic Liquids (ILs) for carbon sequestration, an emergent solution to reducing global warming. The dataset presents different difficulty levels by varying along the dimensions of linguistic and domain-specific knowledge. Benchmarking three less than 10B parameter open-source LLMs on the dataset suggests that while smaller general-purpose LLMs are knowledgeable about ILs, they lack domain-specific reasoning capabilities. Based on our results, we further discuss considerations for leveraging LLMs for carbon capture research using ILs. Since LLMs have a high carbon footprint, gearing them for IL research can symbiotically benefit both fields and help reach the ambitious carbon neutrality target by 2050. Dataset link: this https URL
zh

[AI-63] Causal knowledge graph analysis identifies adverse drug effects

【速读】:该论文试图解决知识图谱与结构因果模型之间缺乏整合的问题,即知识图谱虽能编码定性关系并支持演绎推理,但缺乏形式化的概率语义;而结构因果模型则无法有效整合知识图谱中的背景知识及演绎推理能力。解决方案的关键是引入一种新的因果知识图谱(Causal Knowledge Graphs, CKGs),通过在知识图谱中扩展形式化因果语义,保留其演绎能力的同时实现严谨的因果推断,从而支持去混杂和与编码及推导出的背景知识一致的假设生成。

链接: https://arxiv.org/abs/2505.06949
作者: Sumyyah Toonsi,Paul Schofield,Robert Hoehndorf
机构: 未知
类目: Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:Knowledge graphs and structural causal models have each proven valuable for organizing biomedical knowledge and estimating causal effects, but remain largely disconnected: knowledge graphs encode qualitative relationships focusing on facts and deductive reasoning without formal probabilistic semantics, while causal models lack integration with background knowledge in knowledge graphs and have no access to the deductive reasoning capabilities that knowledge graphs provide. To bridge this gap, we introduce a novel formulation of Causal Knowledge Graphs (CKGs) which extend knowledge graphs with formal causal semantics, preserving their deductive capabilities while enabling principled causal inference. CKGs support deconfounding via explicitly marked causal edges and facilitate hypothesis formulation aligned with both encoded and entailed background knowledge. We constructed a Drug-Disease CKG (DD-CKG) integrating disease progression pathways, drug indications, side-effects, and hierarchical disease classification to enable automated large-scale mediation analysis. Applied to UK Biobank and MIMIC-IV cohorts, we tested whether drugs mediate effects between indications and downstream disease progression, adjusting for confounders inferred from the DD-CKG. Our approach successfully reproduced known adverse drug reactions with high precision while identifying previously undocumented significant candidate adverse effects. Further validation through side effect similarity analysis demonstrated that combining our predicted drug effects with established databases significantly improves the prediction of shared drug indications, supporting the clinical relevance of our novel findings. These results demonstrate that our methodology provides a generalizable, knowledge-driven framework for scalable causal inference.
zh

[AI-64] AI-Powered Inverse Design of Ku-Band SIW Resonant Structures by Iterative Residual Correction Network

【速读】:该论文旨在解决传统逆向电磁建模方法在设计复杂微波结构时存在的精度和泛化能力不足的问题。其解决方案的关键在于提出一种基于多模态谐振器的迭代残差校正网络(IRC-Net),该网络通过引入残差神经网络结构,克服了传统前馈逆模型(FIM)的局限性,从而提升了设计的预测精度和泛化能力。该方法首先利用FIM生成初始设计估计,随后通过受混合逆向-正向残差优化网络(HiFR²-Net)启发的迭代校正策略进行优化,最终实现了对多谐振结构的高精度逆向设计。

链接: https://arxiv.org/abs/2505.06936
作者: Mohammad Mashayekhi,Kamran Salehian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 14 figures

点击查看摘要

Abstract:Inverse electromagnetic modeling has emerged as a powerful approach for designing complex microwave structures with high accuracy and efficiency. In this study, we propose an Iterative Residual Correction Network (IRC-Net) for the inverse design of Ku-band Substrate Integrated Waveguide (SIW) components based on multimode resonators. We use a multimode resonance structure to demonstrate that it is possible to control the resonances of the structure. Therefore, these structures can be used for resonant components and smart filter design. The proposed deep learning architecture leverages residual neural networks to overcome the limitations of traditional inverse design techniques, such as the Feedforward Inverse Model (FIM), offering improved generalization and prediction accuracy. The approach begins with a FIM to generate initial design estimates, followed by an iterative correction strategy inspired by the Hybrid Inverse-Forward Residual Refinement Network (HiFR\textsuperscript2-Net), which we call IRC-Net. Experiments demonstrate that the IRC-Net achieves substantial improvements in prediction accuracy compared to traditional single-stage networks, validated through statistical metrics, full-wave electromagnetic simulations, and measurements. To validate the proposed framework, we first design and fabricate a three-resonance SIW structure. Next, we apply the trained IRC-Net model to predict the geometry of a four-resonance structure based on its desired frequency response. Both designs are fabricated and tested, showing strong agreement between the simulated, predicted, and measured results, confirming the effectiveness and practicality of the proposed method.
zh

[AI-65] RedTeamLLM : an Agent ic AI framework for offensive security

【速读】:该论文试图解决在安全工程中,如何利用代理式人工智能(Agentic AI)自动化渗透测试任务的问题,同时应对该技术可能被恶意行为者用于网络犯罪所带来的潜在威胁。解决方案的关键在于提出并评估RedTeamLLM,这是一个集成架构,具备全面的安全模型,其核心流程包括总结、推理与行动三个关键步骤,以嵌入其操作能力,并解决计划修正、内存管理、上下文窗口限制以及通用性与专业化之间的平衡等四个开放性挑战。

链接: https://arxiv.org/abs/2505.06913
作者: Brian Challita,Pierre Parrend
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:From automated intrusion testing to discovery of zero-day attacks before software launch, agentic AI calls for great promises in security engineering. This strong capability is bound with a similar threat: the security and research community must build up its models before the approach is leveraged by malicious actors for cybercrime. We therefore propose and evaluate RedTeamLLM, an integrated architecture with a comprehensive security model for automatization of pentest tasks. RedTeamLLM follows three key steps: summarizing, reasoning and act, which embed its operational capacity. This novel framework addresses four open challenges: plan correction, memory management, context window constraint, and generality vs. specialization. Evaluation is performed through the automated resolution of a range of entry-level, but not trivial, CTF challenges. The contribution of the reasoning capability of our agentic AI framework is specifically evaluated.
zh

[AI-66] MMiC: Mitigating Modality Incompleteness in Clustered Federated Learning KDD’2025

【速读】:该论文旨在解决多模态联邦学习(Multimodal Federated Learning, MFL)中模态不完整的问题,该问题通常由数据质量缺陷或客户端间的隐私政策导致。其解决方案的关键在于提出MMiC框架,通过替换集群内客户端模型中的部分参数来缓解模态缺失的影响,并利用Banzhaf Power Index优化客户端选择,同时采用马科维茨投资组合理论(Markovitz Portfolio Optimization)动态控制全局聚合过程,从而提升模型在全局和个性化性能上的表现。

链接: https://arxiv.org/abs/2505.06911
作者: Lishan Yang,Wei Zhang,Quan Z. Sheng,Weitong Chen,Lina Yao,Weitong Chen,Ali Shakeri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 10 figures, it’s KDD’2025 under reviewing

点击查看摘要

Abstract:In the era of big data, data mining has become indispensable for uncovering hidden patterns and insights from vast and complex datasets. The integration of multimodal data sources further enhances its potential. Multimodal Federated Learning (MFL) is a distributed approach that enhances the efficiency and quality of multimodal learning, ensuring collaborative work and privacy protection. However, missing modalities pose a significant challenge in MFL, often due to data quality issues or privacy policies across the clients. In this work, we present MMiC, a framework for Mitigating Modality incompleteness in MFL within the Clusters. MMiC replaces partial parameters within client models inside clusters to mitigate the impact of missing modalities. Furthermore, it leverages the Banzhaf Power Index to optimize client selection under these conditions. Finally, MMiC employs an innovative approach to dynamically control global aggregation by utilizing Markovitz Portfolio Optimization. Extensive experiments demonstrate that MMiC consistently outperforms existing federated learning architectures in both global and personalized performance on multimodal datasets with missing modalities, confirming the effectiveness of our proposed solution.
zh

[AI-67] Embodied Intelligence: The Key to Unblocking Generalized Artificial Intelligence

【速读】:该论文试图解决当前研究中对Embodied Artificial Intelligence (EAI)与Artificial General Intelligence (AGI)之间直接联系缺乏系统性概述的问题,以及现有文献多聚焦于特定技术或应用而未全面探讨其对AGI的贡献。解决方案的关键在于将EAI视为AGI的基础方法,系统分析其四个核心模块——感知、智能决策、行动和反馈,并探讨它们如何支持AGI的六个核心原则,从而揭示EAI在推动AGI发展中的关键作用。

链接: https://arxiv.org/abs/2505.06897
作者: Jinhao Jiang,Changlin Chen,Shile Feng,Wanru Geng,Zesheng Zhou,Ni Wang,Shuai Li,Feng-Qi Cui,Erbao Dong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19pages,7 figures,3 tables

点击查看摘要

Abstract:The ultimate goal of artificial intelligence (AI) is to achieve Artificial General Intelligence (AGI). Embodied Artificial Intelligence (EAI), which involves intelligent systems with physical presence and real-time interaction with the environment, has emerged as a key research direction in pursuit of AGI. While advancements in deep learning, reinforcement learning, large-scale language models, and multimodal technologies have significantly contributed to the progress of EAI, most existing reviews focus on specific technologies or applications. A systematic overview, particularly one that explores the direct connection between EAI and AGI, remains scarce. This paper examines EAI as a foundational approach to AGI, systematically analyzing its four core modules: perception, intelligent decision-making, action, and feedback. We provide a detailed discussion of how each module contributes to the six core principles of AGI. Additionally, we discuss future trends, challenges, and research directions in EAI, emphasizing its potential as a cornerstone for AGI development. Our findings suggest that EAI’s integration of dynamic learning and real-world interaction is essential for bridging the gap between narrow AI and AGI.
zh

[AI-68] FACET: Force-Adaptive Control via Impedance Reference Tracking for Legged Robots

【速读】:该论文旨在解决基于位置或速度跟踪的强化学习(Reinforcement Learning, RL)在腿部机器人控制中对力感知不足的问题,这一缺陷导致机器人行为僵硬、潜在危险且在强力交互时控制性能较差。其解决方案的关键在于提出一种基于阻抗参考跟踪的力自适应控制方法(Force-Adaptive Control via Impedance Reference Tracking, FACET),通过RL训练控制策略以模仿虚拟质量-弹簧-阻尼系统,从而在外部力作用下实现细粒度控制,提升机器人的鲁棒性和可控柔顺性。

链接: https://arxiv.org/abs/2505.06883
作者: Botian Xu,Haoyang Weng,Qingzhou Lu,Yang Gao,Huazhe Xu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has made significant strides in legged robot control, enabling locomotion across diverse terrains and complex loco-manipulation capabilities. However, the commonly used position or velocity tracking-based objectives are agnostic to forces experienced by the robot, leading to stiff and potentially dangerous behaviors and poor control during forceful interactions. To address this limitation, we present \emphForce-Adaptive Control via Impedance Reference Tracking (FACET). Inspired by impedance control, we use RL to train a control policy to imitate a virtual mass-spring-damper system, allowing fine-grained control under external forces by manipulating the virtual spring. In simulation, we demonstrate that our quadruped robot achieves improved robustness to large impulses (up to 200 Ns) and exhibits controllable compliance, achieving an 80% reduction in collision impulse. The policy is deployed to a physical robot to showcase both compliance and the ability to engage with large forces by kinesthetic control and pulling payloads up to 2/3 of its weight. Further extension to a legged loco-manipulator and a humanoid shows the applicability of our method to more complex settings to enable whole-body compliance control. Project Website: this https URL
zh

[AI-69] Enhancing Time Series Forecasting via a Parallel Hybridization of ARIMA and Polynomial Classifiers

【速读】:该论文试图解决时间序列预测中如何有效结合线性与非线性模型以提升预测精度的问题。其解决方案的关键在于提出一种混合方法,将自回归积分滑动平均(ARIMA)模型与多项式分类器相结合,从而充分利用两者在建模时间依赖性和捕捉非线性关系方面的互补优势。

链接: https://arxiv.org/abs/2505.06874
作者: Thanh Son Nguyen,Van Thanh Nguyen,Dang Minh Duc Nguyen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Time series forecasting has attracted significant attention, leading to the de-velopment of a wide range of approaches, from traditional statistical meth-ods to advanced deep learning models. Among them, the Auto-Regressive Integrated Moving Average (ARIMA) model remains a widely adopted linear technique due to its effectiveness in modeling temporal dependencies in economic, industrial, and social data. On the other hand, polynomial classifi-ers offer a robust framework for capturing non-linear relationships and have demonstrated competitive performance in domains such as stock price pre-diction. In this study, we propose a hybrid forecasting approach that inte-grates the ARIMA model with a polynomial classifier to leverage the com-plementary strengths of both models. The hybrid method is evaluated on multiple real-world time series datasets spanning diverse domains. Perfor-mance is assessed based on forecasting accuracy and computational effi-ciency. Experimental results reveal that the proposed hybrid model consist-ently outperforms the individual models in terms of prediction accuracy, al-beit with a modest increase in execution time.
zh

[AI-70] DP-TRAE: A Dual-Phase Merging Transferable Reversible Adversarial Example for Image Privacy Protection

【速读】:该论文试图解决现有可逆对抗样本(Reversible Adversarial Examples, RAE)技术在黑盒场景下有效性不足的问题,以及传统黑盒攻击方法转移性差和查询成本高的问题。解决方案的关键在于提出一种双阶段融合可迁移的可逆攻击方法,该方法在白盒模型中生成高可迁移性的初始对抗扰动,并采用记忆增强的黑盒策略有效误导目标模型。

链接: https://arxiv.org/abs/2505.06860
作者: Xia Du,Jiajie Zhu,Jizhe Zhou,Chi-man Pun,Zheng Lin,Cong Wu,Zhe Chen,Jun Luo
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:In the field of digital security, Reversible Adversarial Examples (RAE) combine adversarial attacks with reversible data hiding techniques to effectively protect sensitive data and prevent unauthorized analysis by malicious Deep Neural Networks (DNNs). However, existing RAE techniques primarily focus on white-box attacks, lacking a comprehensive evaluation of their effectiveness in black-box scenarios. This limitation impedes their broader deployment in complex, dynamic environments. Further more, traditional black-box attacks are often characterized by poor transferability and high query costs, significantly limiting their practical applicability. To address these challenges, we propose the Dual-Phase Merging Transferable Reversible Attack method, which generates highly transferable initial adversarial perturbations in a white-box model and employs a memory augmented black-box strategy to effectively mislead target mod els. Experimental results demonstrate the superiority of our approach, achieving a 99.0% attack success rate and 100% recovery rate in black-box scenarios, highlighting its robustness in privacy protection. Moreover, we successfully implemented a black-box attack on a commercial model, further substantiating the potential of this approach for practical use.
zh

[AI-71] Beyond Patterns: Harnessing Causal Logic for Autonomous Driving Trajectory Prediction

【速读】:该论文旨在解决自动驾驶(AD)中轨迹预测的准确性问题,传统数据驱动模型主要依赖统计相关性,而忽视了交通行为背后的因果关系。其解决方案的关键在于引入因果推理,通过将环境分解为空间和时间组件,识别并缓解虚假相关性,从而揭示真实的因果关系,并采用渐进式融合策略整合多模态信息,模拟人类推理过程,提升预测的鲁棒性、泛化能力和准确性。

链接: https://arxiv.org/abs/2505.06856
作者: Bonan Wang,Haicheng Liao,Chengyue Wang,Bin Rao,Yanchen Guan,Guyang Yu,Jiaxun Zhang,Songning Lai,Chengzhong Xu,Zhenning Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Accurate trajectory prediction has long been a major challenge for autonomous driving (AD). Traditional data-driven models predominantly rely on statistical correlations, often overlooking the causal relationships that govern traffic behavior. In this paper, we introduce a novel trajectory prediction framework that leverages causal inference to enhance predictive robustness, generalization, and accuracy. By decomposing the environment into spatial and temporal components, our approach identifies and mitigates spurious correlations, uncovering genuine causal relationships. We also employ a progressive fusion strategy to integrate multimodal information, simulating human-like reasoning processes and enabling real-time inference. Evaluations on five real-world datasets–ApolloScape, nuScenes, NGSIM, HighD, and MoCAD–demonstrate our model’s superiority over existing state-of-the-art (SOTA) methods, with improvements in key metrics such as RMSE and FDE. Our findings highlight the potential of causal reasoning to transform trajectory prediction, paving the way for robust AD systems.
zh

[AI-72] Optimizing Recommendations using Fine-Tuned LLM s DATE

【速读】:该论文试图解决传统数字媒体平台在个性化和直观化电影与媒体推荐方面存在的局限性,这些问题主要源于基于关键词的搜索和推荐技术无法充分捕捉用户的复杂偏好。解决方案的关键在于通过建模真实用户交互生成合成数据集,从而创建反映多样化偏好的复杂对话式数据,使用户能够通过表达如情绪、情节细节和主题元素等更丰富的信息进行查询,进而提升推荐系统的个性化和准确性。

链接: https://arxiv.org/abs/2505.06841
作者: Prabhdeep Cheema,Erhan Guven
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted and presented at IEEE CAI 2025. This version includes minor clarifications and formatting updates

点击查看摘要

Abstract:As digital media platforms strive to meet evolving user expectations, delivering highly personalized and intuitive movies and media recommendations has become essential for attracting and retaining audiences. Traditional systems often rely on keyword-based search and recommendation techniques, which limit users to specific keywords and a combination of keywords. This paper proposes an approach that generates synthetic datasets by modeling real-world user interactions, creating complex chat-style data reflective of diverse preferences. This allows users to express more information with complex preferences, such as mood, plot details, and thematic elements, in addition to conventional criteria like genre, title, and actor-based searches. In today’s search space, users cannot write queries like ``Looking for a fantasy movie featuring dire wolves, ideally set in a harsh frozen world with themes of loyalty and survival.‘’ Building on these contributions, we evaluate synthetic datasets for diversity and effectiveness in training and benchmarking models, particularly in areas often absent from traditional datasets. This approach enhances personalization and accuracy by enabling expressive and natural user queries. It establishes a foundation for the next generation of conversational AI-driven search and recommendation systems in digital entertainment. Comments: Accepted and presented at IEEE CAI 2025. This version includes minor clarifications and formatting updates Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2505.06841 [cs.IR] (or arXiv:2505.06841v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2505.06841 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-73] he power of fine-grained experts: Granularity boosts expressivity in Mixture of Experts

【速读】:该论文试图解决如何通过调整混合专家(Mixture-of-Experts, MoE)架构中每个层激活的专家数量(即粒度)来优化模型的表达能力与计算效率之间的平衡问题。解决方案的关键在于证明了粒度这一设计参数对网络表达能力具有指数级的影响,表明增加激活的专家数量可以显著提升模型的表达能力,实验结果也验证了这一理论分析。

链接: https://arxiv.org/abs/2505.06839
作者: Enric Boix-Adsera,Philippe Rigollet
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) layers are increasingly central to frontier model architectures. By selectively activating parameters, they reduce computational cost while scaling total parameter count. This paper investigates the impact of the number of active experts, termed granularity, comparing architectures with many (e.g., 8 per layer in DeepSeek) to those with fewer (e.g., 1 per layer in Llama-4 models). We prove an exponential separation in network expressivity based on this design parameter, suggesting that models benefit from higher granularity. Experimental results corroborate our theoretical findings and illustrate this separation.
zh

[AI-74] Sandcastles in the Storm: Revisiting the (Im)possibility of Strong Watermarking ACL2025

【速读】:该论文试图解决生成式 AI (Generative AI) 生成文本的水印技术在面对随机游走攻击时的脆弱性问题。研究指出,尽管理论模型认为水印可以被轻易擦除,但通过大规模实验和人工验证,发现实际中水印的混合过程缓慢,且质量检测工具存在较高误判率,导致攻击效果不佳。解决方案的关键在于揭示了实际场景中水印的鲁棒性远高于理论模型所预测,其核心在于水印混合速度慢和质量控制不完善这两个现实障碍。

链接: https://arxiv.org/abs/2505.06827
作者: Fabrice Y Harel-Canada,Boran Erol,Connor Choi,Jason Liu,Gary Jiarui Song,Nanyun Peng,Amit Sahai
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: In Review @ ACL 2025

点击查看摘要

Abstract:Watermarking AI-generated text is critical for combating misuse. Yet recent theoretical work argues that any watermark can be erased via random walk attacks that perturb text while preserving quality. However, such attacks rely on two key assumptions: (1) rapid mixing (watermarks dissolve quickly under perturbations) and (2) reliable quality preservation (automated quality oracles perfectly guide edits). Through large-scale experiments and human-validated assessments, we find mixing is slow: 100% of perturbed texts retain traces of their origin after hundreds of edits, defying rapid mixing. Oracles falter, as state-of-the-art quality detectors misjudge edits (77% accuracy), compounding errors during attacks. Ultimately, attacks underperform: automated walks remove watermarks just 26% of the time – dropping to 10% under human quality review. These findings challenge the inevitability of watermark removal. Instead, practical barriers – slow mixing and imperfect quality control – reveal watermarking to be far more robust than theoretical models suggest. The gap between idealized attacks and real-world feasibility underscores the need for stronger watermarking methods and more realistic attack models.
zh

[AI-75] hreatLens: LLM -guided Threat Modeling and Test Plan Generation for Hardware Security Verification

【速读】:该论文试图解决当前硬件安全验证过程中依赖人工威胁建模和测试计划生成所导致的劳动强度大、易出错以及难以应对设计复杂度提升和攻击方法演变的问题。解决方案的关键在于提出ThreatLens,一个基于大语言模型(Large Language Model, LLM)的多智能体框架,通过检索增强生成(Retrieval-Augmented Generation, RAG)提取相关安全知识,结合LLM驱动的推理进行威胁评估,并利用交互式用户反馈确保生成的测试计划具有实用性,从而实现硬件安全威胁建模与测试计划生成的自动化。

链接: https://arxiv.org/abs/2505.06821
作者: Dipayan Saha,Hasan Al Shaikh,Shams Tarek,Farimah Farahmandi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: This paper has been presented at IEEE VLSI Test Symposium (VTS) 2025

点击查看摘要

Abstract:Current hardware security verification processes predominantly rely on manual threat modeling and test plan generation, which are labor-intensive, error-prone, and struggle to scale with increasing design complexity and evolving attack methodologies. To address these challenges, we propose ThreatLens, an LLM-driven multi-agent framework that automates security threat modeling and test plan generation for hardware security verification. ThreatLens integrates retrieval-augmented generation (RAG) to extract relevant security knowledge, LLM-powered reasoning for threat assessment, and interactive user feedback to ensure the generation of practical test plans. By automating these processes, the framework reduces the manual verification effort, enhances coverage, and ensures a structured, adaptable approach to security verification. We evaluated our framework on the NEORV32 SoC, demonstrating its capability to automate security verification through structured test plans and validating its effectiveness in real-world scenarios.
zh

[AI-76] Control Plane as a Tool: A Scalable Design Pattern for Agent ic AI Systems

【速读】:该论文试图解决当前基于代理的AI系统(Agentic AI systems)在大规模工具编排(tool orchestration)方面的不足,其核心问题是现有系统在架构和基础设施层面缺乏有效的管理机制。论文提出的解决方案的关键在于引入“控制平面作为工具”(Control Plane as a Tool)的设计抽象模式,该模式允许开发者向代理暴露统一的工具接口,同时将模块化的工具路由逻辑封装在其后,从而提升系统的可扩展性、安全性和可扩展性。

链接: https://arxiv.org/abs/2505.06817
作者: Sivasathivel Kandasamy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 2 Figures and 2 Tables

点击查看摘要

Abstract:Agentic AI systems represent a new frontier in artificial intelligence, where agents often based on large language models(LLMs) interact with tools, environments, and other agents to accomplish tasks with a degree of autonomy. These systems show promise across a range of domains, but their architectural underpinnings remain immature. This paper conducts a comprehensive review of the types of agents, their modes of interaction with the environment, and the infrastructural and architectural challenges that emerge. We identify a gap in how these systems manage tool orchestration at scale and propose a reusable design abstraction: the “Control Plane as a Tool” pattern. This pattern allows developers to expose a single tool interface to an agent while encapsulating modular tool routing logic behind it. We position this pattern within the broader context of agent design and argue that it addresses several key challenges in scaling, safety, and extensibility.
zh

[AI-77] Decoding Futures Price Dynamics: A Regularized Sparse Autoencoder for Interpretable Multi-Horizon Forecasting and Factor Discovery

【速读】:该论文旨在解决商品价格波动带来的经济挑战,特别是多时间尺度下商品价格预测的准确性问题。传统模型在处理铜和原油等商品价格预测时面临复杂且相互作用的因素(如宏观经济、供需关系、地缘政治等)所带来的困难,同时缺乏透明度,限制了其战略应用。论文提出的解决方案是基于正则化稀疏自编码器(Regularized Sparse Autoencoder, RSAE)的深度学习框架,其关键在于通过在潜在向量上引入L1正则化(|\mathbf{z}|_1)以强制稀疏性,从而实现对市场动态的简洁解释,并提取可解释的潜在驱动因素(如需求、供应冲击等)。该方法在保持预测精度的同时,提供了数据驱动的价格动态洞察,相较于传统黑箱模型具有显著优势。

链接: https://arxiv.org/abs/2505.06795
作者: Abhijit Gupta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Commodity price volatility creates economic challenges, necessitating accurate multi-horizon forecasting. Predicting prices for commodities like copper and crude oil is complicated by diverse interacting factors (macroeconomic, supply/demand, geopolitical, etc.). Current models often lack transparency, limiting strategic use. This paper presents a Regularized Sparse Autoencoder (RSAE), a deep learning framework for simultaneous multi-horizon commodity price prediction and discovery of interpretable latent market drivers. The RSAE forecasts prices at multiple horizons (e.g., 1-day, 1-week, 1-month) using multivariate time series. Crucially, L1 regularization ( |\mathbfz|_1 ) on its latent vector \mathbfz enforces sparsity, promoting parsimonious explanations of market dynamics through learned factors representing underlying drivers (e.g., demand, supply shocks). Drawing from energy-based models and sparse coding, the RSAE optimizes predictive accuracy while learning sparse representations. Evaluated on historical Copper and Crude Oil data with numerous indicators, our findings indicate the RSAE offers competitive multi-horizon forecasting accuracy and data-driven insights into price dynamics via its interpretable latent space, a key advantage over traditional black-box approaches.
zh

[AI-78] Value Iteration with Guessing for Markov Chains and Markov Decision Processes

【速读】:该论文试图解决在概率系统模型(如马尔可夫链 (Markov chains, MCs) 和马尔可夫决策过程 (Markov decision processes, MDPs))中,经典值迭代 (Value Iteration, VI) 算法在最坏情况下需要指数级Bellman更新的问题。其核心问题是能否通过多项式时间的预处理步骤,使VI在后续过程中仅需次指数级的Bellman更新。该论文提出的解决方案关键在于基于猜测值的新方法,该方法在MCs上实现了几乎线性时间的预处理,并结合猜测值显著减少了Bellman更新次数;同时,针对MDPs,提出了改进的收敛速度分析及一种实用算法,实验结果表明该方法在多个基准测试案例中优于现有VI方法。

链接: https://arxiv.org/abs/2505.06769
作者: Krishnendu Chatterjee,Mahdi JafariRaviz,Raimundo Saona,Jakub Svoboda
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
备注: Appeared in the 31st International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2025)

点击查看摘要

Abstract:Two standard models for probabilistic systems are Markov chains (MCs) and Markov decision processes (MDPs). Classic objectives for such probabilistic models for control and planning problems are reachability and stochastic shortest path. The widely studied algorithmic approach for these problems is the Value Iteration (VI) algorithm which iteratively applies local updates called Bellman updates. There are many practical approaches for VI in the literature but they all require exponentially many Bellman updates for MCs in the worst case. A preprocessing step is an algorithm that is discrete, graph-theoretical, and requires linear space. An important open question is whether, after a polynomial-time preprocessing, VI can be achieved with sub-exponentially many Bellman updates. In this work, we present a new approach for VI based on guessing values. Our theoretical contributions are twofold. First, for MCs, we present an almost-linear-time preprocessing algorithm after which, along with guessing values, VI requires only subexponentially many Bellman updates. Second, we present an improved analysis of the speed of convergence of VI for MDPs. Finally, we present a practical algorithm for MDPs based on our new approach. Experimental results show that our approach provides a considerable improvement over existing VI-based approaches on several benchmark examples from the literature.
zh

[AI-79] PK: Trustworthy Trajectory Prediction Integrating Prior Knowledge For Interpretability and Kinematic Feasibility

【速读】:该论文旨在解决轨迹预测中模型预测结果缺乏可信度的问题,特别是在物理可行性与人类逻辑性方面存在不足。其关键解决方案是引入适用于所有交通参与者(包括车辆、行人和自行车)的交互与运动学先验,并通过具有类别特异性交互层的结构来捕捉不同行为特征。此外,提出了一种基于规则的交互重要性评分DG-SFM以提升交互过程的可解释性,并设计了适用于所有类别的运动学模型,特别是创新性的行人运动学模型,以确保预测结果的物理可行性。

链接: https://arxiv.org/abs/2505.06743
作者: Marius Baden,Ahmed Abouelazm,Christian Hubschneider,Yin Wu,Daniel Slieter,J. Marius Zöllner
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted in the 36th IEEE Intelligent Vehicles Symposium (IV 2025) for oral presentation

点击查看摘要

Abstract:Trajectory prediction is crucial for autonomous driving, enabling vehicles to navigate safely by anticipating the movements of surrounding road users. However, current deep learning models often lack trustworthiness as their predictions can be physically infeasible and illogical to humans. To make predictions more trustworthy, recent research has incorporated prior knowledge, like the social force model for modeling interactions and kinematic models for physical realism. However, these approaches focus on priors that suit either vehicles or pedestrians and do not generalize to traffic with mixed agent classes. We propose incorporating interaction and kinematic priors of all agent classes–vehicles, pedestrians, and cyclists with class-specific interaction layers to capture agent behavioral differences. To improve the interpretability of the agent interactions, we introduce DG-SFM, a rule-based interaction importance score that guides the interaction layer. To ensure physically feasible predictions, we proposed suitable kinematic models for all agent classes with a novel pedestrian kinematic model. We benchmark our approach on the Argoverse 2 dataset, using the state-of-the-art transformer HPTR as our baseline. Experiments demonstrate that our method improves interaction interpretability, revealing a correlation between incorrect predictions and divergence from our interaction prior. Even though incorporating the kinematic models causes a slight decrease in accuracy, they eliminate infeasible trajectories found in the dataset and the baseline model. Thus, our approach fosters trust in trajectory prediction as its interaction reasoning is interpretable, and its predictions adhere to physics.
zh

[AI-80] Boundary-Guided Trajectory Prediction for Road Aware and Physically Feasible Autonomous Driving

【速读】:该论文试图解决自动驾驶中周围道路使用者轨迹预测的准确性问题,特别是如何防止车辆偏离道路以及确保运动学可行性。现有方法虽然引入了道路感知模块和运动学约束,但缺乏合理性保证,并且在复杂度和灵活性之间存在权衡。解决方案的关键在于提出一种新的框架,将轨迹预测建模为由允许行驶方向及其边界引导的约束回归问题,通过结合智能体当前状态和高精度地图定义有效边界,并训练网络学习左右边界多边形之间的叠加路径,同时预测符合运动学约束的加速度剖面以确保可行性。

链接: https://arxiv.org/abs/2505.06740
作者: Ahmed Abouelazm,Mianzhi Liu,Christian Hubschneider,Yin Wu,Daniel Slieter,J. Marius Zöllner
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted in the 36th IEEE Intelligent Vehicles Symposium (IV 2025)

点击查看摘要

Abstract:Accurate prediction of surrounding road users’ trajectories is essential for safe and efficient autonomous driving. While deep learning models have improved performance, challenges remain in preventing off-road predictions and ensuring kinematic feasibility. Existing methods incorporate road-awareness modules and enforce kinematic constraints but lack plausibility guarantees and often introduce trade-offs in complexity and flexibility. This paper proposes a novel framework that formulates trajectory prediction as a constrained regression guided by permissible driving directions and their boundaries. Using the agent’s current state and an HD map, our approach defines the valid boundaries and ensures on-road predictions by training the network to learn superimposed paths between left and right boundary polylines. To guarantee feasibility, the model predicts acceleration profiles that determine the vehicle’s travel distance along these paths while adhering to kinematic constraints. We evaluate our approach on the Argoverse-2 dataset against the HPTR baseline. Our approach shows a slight decrease in benchmark metrics compared to HPTR but notably improves final displacement error and eliminates infeasible trajectories. Moreover, the proposed approach has superior generalization to less prevalent maneuvers and unseen out-of-distribution scenarios, reducing the off-road rate under adversarial attacks from 66% to just 1%. These results highlight the effectiveness of our approach in generating feasible and robust predictions.
zh

[AI-81] Balancing Progress and Safety: A Novel Risk-Aware Objective for RL in Autonomous Driving

【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)在自动驾驶中因奖励函数设计不足而导致的安全性与决策能力受限的问题。传统奖励函数往往仅将安全视为碰撞的惩罚,忽视了碰撞前行为的风险,从而限制了RL在真实场景中的应用。解决方案的关键在于通过定义一组分层的驾驶目标,并以归一化方式结构化其对总体奖励的贡献,同时引入一种基于二维椭圆函数和责任敏感安全(Responsibility-Sensitive Safety, RSS)概念的新型风险感知目标,以提升奖励函数的安全性和有效性。

链接: https://arxiv.org/abs/2505.06737
作者: Ahmed Abouelazm,Jonas Michel,Helen Gremmelmaier,Tim Joseph,Philip Schörner,J. Marius Zöllner
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted in the 36th IEEE Intelligent vehicles Symposium (IV 2025)

点击查看摘要

Abstract:Reinforcement Learning (RL) is a promising approach for achieving autonomous driving due to robust decision-making capabilities. RL learns a driving policy through trial and error in traffic scenarios, guided by a reward function that combines the driving objectives. The design of such reward function has received insufficient attention, yielding ill-defined rewards with various pitfalls. Safety, in particular, has long been regarded only as a penalty for collisions. This leaves the risks associated with actions leading up to a collision unaddressed, limiting the applicability of RL in real-world scenarios. To address these shortcomings, our work focuses on enhancing the reward formulation by defining a set of driving objectives and structuring them hierarchically. Furthermore, we discuss the formulation of these objectives in a normalized manner to transparently determine their contribution to the overall reward. Additionally, we introduce a novel risk-aware objective for various driving interactions based on a two-dimensional ellipsoid function and an extension of Responsibility-Sensitive Safety (RSS) concepts. We evaluate the efficacy of our proposed reward in unsignalized intersection scenarios with varying traffic densities. The approach decreases collision rates by 21% on average compared to baseline rewards and consistently surpasses them in route progress and cumulative reward, demonstrating its capability to promote safer driving behaviors while maintaining high-performance levels.
zh

[AI-82] Deeply Explainable Artificial Neural Network

【速读】:该论文试图解决深度学习模型在关键领域(如医学图像分析)中由于其黑箱特性导致的可解释性不足问题。现有解释方法如SHAP、LIME和Grad-CAM通常为事后应用,存在计算开销大且结果可能不一致或模糊的缺陷。解决方案的关键在于提出一种名为DxANN(Deeply Explainable Artificial Neural Network)的新型深度学习架构,该架构将可解释性嵌入训练过程中,而非依赖外部解释方法,能够在前向传播中生成每个样本和每个特征的解释,从而实现准确预测与透明决策的结合。

链接: https://arxiv.org/abs/2505.06731
作者: David Zucker
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While deep learning models have demonstrated remarkable success in numerous domains, their black-box nature remains a significant limitation, especially in critical fields such as medical image analysis and inference. Existing explainability methods, such as SHAP, LIME, and Grad-CAM, are typically applied post hoc, adding computational overhead and sometimes producing inconsistent or ambiguous results. In this paper, we present the Deeply Explainable Artificial Neural Network (DxANN), a novel deep learning architecture that embeds explainability ante hoc, directly into the training process. Unlike conventional models that require external interpretation methods, DxANN is designed to produce per-sample, per-feature explanations as part of the forward pass. Built on a flow-based framework, it enables both accurate predictions and transparent decision-making, and is particularly well-suited for image-based tasks. While our focus is on medical imaging, the DxANN architecture is readily adaptable to other data modalities, including tabular and sequential data. DxANN marks a step forward toward intrinsically interpretable deep learning, offering a practical solution for applications where trust and accountability are essential.
zh

[AI-83] Bi-level Mean Field: Dynamic Grouping for Large-Scale MARL

【速读】:该论文旨在解决大规模多智能体强化学习(Large-scale Multi-Agent Reinforcement Learning, MARL)中由于智能体交互数量指数级增长导致的维度灾难问题,该问题显著增加了计算复杂度并降低了学习效率。现有基于平均场(Mean Field, MF)的方法通过将相邻智能体近似为一个单一的平均智能体来简化交互结构,但这种方法忽略了个体差异,导致在MF学习过程中因不准确的迭代更新产生聚合噪声。论文提出的双层平均场(Bi-level Mean Field, BMF)方法的关键在于通过动态分组机制捕捉智能体多样性,并利用双层交互机制缓解聚合噪声,具体包括引入基于变分自编码器(Variational AutoEncoder, VAE)的动态分组模块以学习智能体表示,以及设计双层交互模块以建模组间和组内交互,从而实现更有效的邻近聚合。

链接: https://arxiv.org/abs/2505.06706
作者: Yuxuan Zheng,Yihe Zhou,Feiyang Xu,Mingli Song,Shunyu Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large-scale Multi-Agent Reinforcement Learning (MARL) often suffers from the curse of dimensionality, as the exponential growth in agent interactions significantly increases computational complexity and impedes learning efficiency. To mitigate this, existing efforts that rely on Mean Field (MF) simplify the interaction landscape by approximating neighboring agents as a single mean agent, thus reducing overall complexity to pairwise interactions. However, these MF methods inevitably fail to account for individual differences, leading to aggregation noise caused by inaccurate iterative updates during MF learning. In this paper, we propose a Bi-level Mean Field (BMF) method to capture agent diversity with dynamic grouping in large-scale MARL, which can alleviate aggregation noise via bi-level interaction. Specifically, BMF introduces a dynamic group assignment module, which employs a Variational AutoEncoder (VAE) to learn the representations of agents, facilitating their dynamic grouping over time. Furthermore, we propose a bi-level interaction module to model both inter- and intra-group interactions for effective neighboring aggregation. Experiments across various tasks demonstrate that the proposed BMF yields results superior to the state-of-the-art methods. Our code will be made publicly available.
zh

[AI-84] A Survey on Data-Driven Modeling of Human Drivers Lane-Changing Decisions

【速读】:该论文试图解决传统分析型车道变换(Lane-changing, LC)决策(LCD)模型在复杂交通环境中对驾驶行为异质性和交互关系的简化问题,这些问题限制了其对真实LCD行为的捕捉能力。解决方案的关键在于采用数据驱动的方法,利用丰富的实证数据和机器学习技术,以解码潜在的决策模式,从而实现动态环境下的自适应LCD建模。

链接: https://arxiv.org/abs/2505.06680
作者: Linxuan Huang,Dong-Fan Xie,Li Li,Zhengbing He
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Systems and Control (eess.SY); Physics and Society (physics.soc-ph)
备注:

点击查看摘要

Abstract:Lane-changing (LC) behavior, a critical yet complex driving maneuver, significantly influences driving safety and traffic dynamics. Traditional analytical LC decision (LCD) models, while effective in specific environments, often oversimplify behavioral heterogeneity and complex interactions, limiting their capacity to capture real LCD. Data-driven approaches address these gaps by leveraging rich empirical data and machine learning to decode latent decision-making patterns, enabling adaptive LCD modeling in dynamic environments. In light of the rapid development of artificial intelligence and the demand for data-driven models oriented towards connected vehicles and autonomous vehicles, this paper presents a comprehensive survey of data-driven LCD models, with a particular focus on human drivers LC decision-making. It systematically reviews the modeling framework, covering data sources and preprocessing, model inputs and outputs, objectives, structures, and validation methods. This survey further discusses the opportunities and challenges faced by data-driven LCD models, including driving safety, uncertainty, as well as the integration and improvement of technical frameworks.
zh

[AI-85] Enfoque Odychess: Un método dialéctico constructivista y adaptativo para la enseñanza del ajedrez con inteligencias artificiales generativas

【速读】:该论文试图解决传统国际象棋教学方法在提升学生棋艺知识、战略理解和元认知技能方面的局限性,其解决方案的关键在于引入基于生成式AI(Generative AI)的Odychess方法。该方法通过适配Llama 3.3语言模型并采用参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)技术,构建了一个苏格拉底式的国际象棋导师系统,从而实现了教学效果的显著提升。

链接: https://arxiv.org/abs/2505.06652
作者: Ernesto Giralt Hernandez,Lazaro Antonio Bueno Perez
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Full article in Spanish

点击查看摘要

Abstract:Chess teaching has evolved through different approaches, however, traditional methodologies, often based on memorization, contrast with the new possibilities offered by generative artificial intelligence, a technology still little explored in this field. This study seeks to empirically validate the effectiveness of the Odychess Approach in improving chess knowledge, strategic understanding, and metacognitive skills in students. A quasi-experimental study was conducted with a pre-test/post-test design and a control group (N=60). The experimental intervention implemented the Odychess Approach, incorporating a Llama 3.3 language model that was specifically adapted using Parameter-Efficient Fine-Tuning (PEFT) techniques to act as a Socratic chess tutor. Quantitative assessment instruments were used to measure chess knowledge, strategic understanding, and metacognitive skills before and after the intervention. The results of the quasi-experimental study showed significant improvements in the experimental group compared to the control group in the three variables analyzed: chess knowledge, strategic understanding, and metacognitive skills. The complementary qualitative analysis revealed greater analytical depth, more developed dialectical reasoning, and increased intrinsic motivation in students who participated in the Odychess method-based intervention. The Odychess Approach represents an effective pedagogical methodology for teaching chess, demonstrating the potential of the synergistic integration of constructivist and dialectical principles with generative artificial intelligence. The implications of this work are relevant for educators and institutions interested in adopting innovative pedagogical technologies and for researchers in the field of AI applied to education, highlighting the transferability of the language model adaptation methodology to other educational domains.
zh

[AI-86] Dyn-D2P: Dynamic Differentially Private Decentralized Learning with Provable Utility Guarantee IJCAI2025

【速读】:该论文旨在解决现有去中心化学习方法在引入差分隐私(Differential Privacy, DP)时因固定梯度截断边界和固定水平的高斯噪声导致模型精度显著下降的问题。其解决方案的关键在于提出一种动态差分隐私去中心化学习方法(Dyn-D²P),通过利用高斯差分隐私(Gaussian DP, GDP)框架,根据梯度收敛情况动态调整梯度截断边界和噪声水平,从而在保持总隐私预算的前提下提升模型精度。

链接: https://arxiv.org/abs/2505.06651
作者: Zehan Zhu,Yan Huang,Xin Wang,Shouling Ji,Jinming Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by the 34th International Joint Conference on Artificial Intelligence(IJCAI 2025)

点击查看摘要

Abstract:Most existing decentralized learning methods with differential privacy (DP) guarantee rely on constant gradient clipping bounds and fixed-level DP Gaussian noises for each node throughout the training process, leading to a significant accuracy degradation compared to non-private counterparts. In this paper, we propose a new Dynamic Differentially Private Decentralized learning approach (termed Dyn-D ^2 P) tailored for general time-varying directed networks. Leveraging the Gaussian DP (GDP) framework for privacy accounting, Dyn-D ^2 P dynamically adjusts gradient clipping bounds and noise levels based on gradient convergence. This proposed dynamic noise strategy enables us to enhance model accuracy while preserving the total privacy budget. Extensive experiments on benchmark datasets demonstrate the superiority of Dyn-D ^2 P over its counterparts employing fixed-level noises, especially under strong privacy guarantees. Furthermore, we provide a provable utility bound for Dyn-D ^2 P that establishes an explicit dependency on network-related parameters, with a scaling factor of 1/\sqrtn in terms of the number of nodes n up to a bias error term induced by gradient clipping. To our knowledge, this is the first model utility analysis for differentially private decentralized non-convex optimization with dynamic gradient clipping bounds and noise levels.
zh

[AI-87] Exploring Multimodal Foundation AI and Expert-in-the-Loop for Sustainable Management of Wild Salmon Fisheries in Indigenous Rivers IJCAI2025

【速读】:该论文旨在解决北太平洋沿岸野生大麻哈鱼(wild salmon)监测与可持续渔业管理中的挑战,这些问题主要源于气候变化、栖息地丧失以及缺乏基础设施支持的偏远生态系统中的数据限制。论文提出的解决方案关键在于整合多模态基础人工智能(multimodal foundation AI)与专家在环(expert-in-the-loop)框架,通过视频和声呐监测技术开发AI驱动的自动化物种识别、计数和体长测量工具,从而减少人工工作量、加快结果交付并提高决策准确性。同时,结合专家验证和主动学习框架以确保生态相关性并降低标注负担。

链接: https://arxiv.org/abs/2505.06637
作者: Chi Xu,Yili Jin,Sami Ma,Rongsheng Qian,Hao Fang,Jiangchuan Liu,Xue Liu,Edith C.H. Ngai,William I. Atlas,Katrina M. Connors,Mark A. Spoljaric
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, accepted by IJCAI 2025, AI and Social Good Track

点击查看摘要

Abstract:Wild salmon are essential to the ecological, economic, and cultural sustainability of the North Pacific Rim. Yet climate variability, habitat loss, and data limitations in remote ecosystems that lack basic infrastructure support pose significant challenges to effective fisheries management. This project explores the integration of multimodal foundation AI and expert-in-the-loop frameworks to enhance wild salmon monitoring and sustainable fisheries management in Indigenous rivers across Pacific Northwest. By leveraging video and sonar-based monitoring, we develop AI-powered tools for automated species identification, counting, and length measurement, reducing manual effort, expediting delivery of results, and improving decision-making accuracy. Expert validation and active learning frameworks ensure ecological relevance while reducing annotation burdens. To address unique technical and societal challenges, we bring together a cross-domain, interdisciplinary team of university researchers, fisheries biologists, Indigenous stewardship practitioners, government agencies, and conservation organizations. Through these collaborations, our research fosters ethical AI co-development, open data sharing, and culturally informed fisheries management.
zh

[AI-88] AI-Powered Anomaly Detection with Blockchain for Real-Time Security and Reliability in Autonomous Vehicles

【速读】:该论文旨在解决自动驾驶汽车(Autonomous Vehicles, AV)在安全性和可靠性方面面临的紧迫问题,以确保公共安全并促进其广泛应用。解决方案的关键在于提出一种结合人工智能(Artificial Intelligence, AI)与区块链技术的新框架,通过实时异常检测、数据溯源和即时响应机制提升自动驾驶系统的安全性与可信度。具体而言,该框架利用长短期记忆网络(Long Short-Term Memory, LSTM)持续监控多传感器数据流以识别潜在的网络攻击或硬件故障,并通过区块链技术实现数据的不可篡改存储与透明追溯,同时借助智能合约实现异常情况下的自动响应,从而增强系统对网络空间和硬件故障的抗脆弱性。

链接: https://arxiv.org/abs/2505.06632
作者: Rathin Chandra Shit,Sharmila Subudhi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Scheduled for presentation at an upcoming conference

点击查看摘要

Abstract:Autonomous Vehicles (AV) proliferation brings important and pressing security and reliability issues that must be dealt with to guarantee public safety and help their widespread adoption. The contribution of the proposed research is towards achieving more secure, reliable, and trustworthy autonomous transportation system by providing more capabilities for anomaly detection, data provenance, and real-time response in safety critical AV deployments. In this research, we develop a new framework that combines the power of Artificial Intelligence (AI) for real-time anomaly detection with blockchain technology to detect and prevent any malicious activity including sensor failures in AVs. Through Long Short-Term Memory (LSTM) networks, our approach continually monitors associated multi-sensor data streams to detect anomalous patterns that may represent cyberattacks as well as hardware malfunctions. Further, this framework employs a decentralized platform for securely storing sensor data and anomaly alerts in a blockchain ledger for data incorruptibility and authenticity, while offering transparent forensic features. Moreover, immediate automated response mechanisms are deployed using smart contracts when anomalies are found. This makes the AV system more resilient to attacks from both cyberspace and hardware component failure. Besides, we identify potential challenges of scalability in handling high frequency sensor data, computational constraint in resource constrained environment, and of distributed data storage in terms of privacy.
zh

[AI-89] CaMDN: Enhancing Cache Efficiency for Multi-tenant DNNs on Integrated NPUs

【速读】:该论文旨在解决多租户深度神经网络(DNN)在单片系统(SoC)上执行时,共享缓存(shared cache)带来的性能瓶颈问题。现有研究虽已提出多种方法以提升多租户性能,但对共享缓存的影响尚未深入探讨。论文提出的CaMDN架构-调度协同设计,其关键在于通过轻量级架构支持模型专属、由NPU控制的缓存区域,以消除意外的缓存竞争,并结合缓存感知映射方法与动态分配算法,提升共享缓存的利用率和适应性。

链接: https://arxiv.org/abs/2505.06625
作者: Tianhao Cai,Liang Wang,Limin Xiao,Meng Han,Zeyu Wang,Lin Sun,Xiaojian Liao
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 7 pages, 9 figures. This paper has been accepted to the 2025 Design Automation Conference (DAC)

点击查看摘要

Abstract:With the rapid development of DNN applications, multi-tenant execution, where multiple DNNs are co-located on a single SoC, is becoming a prevailing trend. Although many methods are proposed in prior works to improve multi-tenant performance, the impact of shared cache is not well studied. This paper proposes CaMDN, an architecture-scheduling co-design to enhance cache efficiency for multi-tenant DNNs on integrated NPUs. Specifically, a lightweight architecture is proposed to support model-exclusive, NPU-controlled regions inside shared cache to eliminate unexpected cache contention. Moreover, a cache scheduling method is proposed to improve shared cache utilization. In particular, it includes a cache-aware mapping method for adaptability to the varying available cache capacity and a dynamic allocation algorithm to adjust the usage among co-located DNNs at runtime. Compared to prior works, CaMDN reduces the memory access by 33.4% on average and achieves a model speedup of up to 2.56 \times (1.88 \times on average).
zh

[AI-90] Integrating Explainable AI in Medical Devices: Technical Clinical and Regulatory Insights and Recommendations

【速读】:该论文试图解决生成式 AI (Generative AI) 在医疗领域作为临床决策支持系统应用时,由于模型复杂性导致的可解释性不足问题,进而影响其在临床环境中的安全集成。解决方案的关键在于通过跨学科专家工作组的协作,评估不同 AI 算法在临床决策中的输出,并结合对临床医生与 AI 方法交互行为的初步研究,提出确保医疗 AI 设备安全性和可信度的建议,同时强调对相关利益方进行充分培训的重要性。

链接: https://arxiv.org/abs/2505.06620
作者: Dima Alattal,Asal Khoshravan Azar,Puja Myles,Richard Branson,Hatim Abdulhussein,Allan Tucker
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 47 pages

点击查看摘要

Abstract:There is a growing demand for the use of Artificial Intelligence (AI) and Machine Learning (ML) in healthcare, particularly as clinical decision support systems to assist medical professionals. However, the complexity of many of these models, often referred to as black box models, raises concerns about their safe integration into clinical settings as it is difficult to understand how they arrived at their predictions. This paper discusses insights and recommendations derived from an expert working group convened by the UK Medicine and Healthcare products Regulatory Agency (MHRA). The group consisted of healthcare professionals, regulators, and data scientists, with a primary focus on evaluating the outputs from different AI algorithms in clinical decision-making contexts. Additionally, the group evaluated findings from a pilot study investigating clinicians’ behaviour and interaction with AI methods during clinical diagnosis. Incorporating AI methods is crucial for ensuring the safety and trustworthiness of medical AI devices in clinical settings. Adequate training for stakeholders is essential to address potential issues, and further insights and recommendations for safely adopting AI systems in healthcare settings are provided.
zh

[AI-91] Burger: Robust Graph Denoising-augmentation Fusion and Multi-semantic Modeling in Social Recommendation

【速读】:该论文旨在解决社会推荐系统中由于社交网络与用户-物品交互网络之间语义信息相互影响不足,导致推荐准确率受限的问题。现有方法主要关注用户兴趣相似性以过滤无关关系,但缺乏对两者语义信息相互作用的深入研究。其解决方案的关键在于提出一种结合鲁棒图去噪增强融合与多语义建模的社会推荐模型(Burger),通过构建社会张量、引入图卷积网络和张量卷积网络分别捕捉用户物品偏好和社会偏好,并设计双语义协调损失来建模语义信息的相互影响,同时利用贝叶斯后验概率挖掘潜在社会关系以减少噪声干扰,最终通过滑动窗口机制更新社会张量以实现迭代优化。

链接: https://arxiv.org/abs/2505.06612
作者: Yuqin Lan
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:In the era of rapid development of social media, social recommendation systems as hybrid recommendation systems have been widely applied. Existing methods capture interest similarity between users to filter out interest-irrelevant relations in social networks that inevitably decrease recommendation accuracy, however, limited research has a focus on the mutual influence of semantic information between the social network and the user-item interaction network for further improving social recommendation. To address these issues, we introduce a social \underlinerecommendation model with ro\underlinebust g\underlineraph denoisin\underlineg-augmentation fusion and multi-s\underlineemantic Modeling(Burger). Specifically, we firstly propose to construct a social tensor in order to smooth the training process of the model. Then, a graph convolutional network and a tensor convolutional network are employed to capture user’s item preference and social preference, respectively. Considering the different semantic information in the user-item interaction network and the social network, a bi-semantic coordination loss is proposed to model the mutual influence of semantic information. To alleviate the interference of interest-irrelevant relations on multi-semantic modeling, we further use Bayesian posterior probability to mine potential social relations to replace social noise. Finally, the sliding window mechanism is utilized to update the social tensor as the input for the next iteration. Extensive experiments on three real datasets show Burger has a superior performance compared with the state-of-the-art models.
zh

[AI-92] JAEGER: Dual-Level Humanoid Whole-Body Controller

【速读】:该论文试图解决双足机器人全身控制中策略鲁棒性和泛化能力不足的问题,特别是传统单控制器方法在处理高维状态空间和任务多样性时的局限性。解决方案的关键在于提出JAEGER,一个双层级全身控制器,将上肢和下肢的控制分离为两个独立的控制器,从而降低维度灾难并提高容错能力,同时支持根部速度跟踪(粗粒度控制)和局部关节角度跟踪(细粒度控制),实现更稳定和多样的运动表现。

链接: https://arxiv.org/abs/2505.06584
作者: Ziluo Ding,Haobin Jiang,Yuxuan Wang,Zhenguo Sun,Yu Zhang,Xiaojie Niu,Ming Yang,Weishuai Zeng,Xinrun Xu,Zongqing Lu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 15 pages, 2 figures

点击查看摘要

Abstract:This paper presents JAEGER, a dual-level whole-body controller for humanoid robots that addresses the challenges of training a more robust and versatile policy. Unlike traditional single-controller approaches, JAEGER separates the control of the upper and lower bodies into two independent controllers, so that they can better focus on their distinct tasks. This separation alleviates the dimensionality curse and improves fault tolerance. JAEGER supports both root velocity tracking (coarse-grained control) and local joint angle tracking (fine-grained control), enabling versatile and stable movements. To train the controller, we utilize a human motion dataset (AMASS), retargeting human poses to humanoid poses through an efficient retargeting network, and employ a curriculum learning approach. This method performs supervised learning for initialization, followed by reinforcement learning for further exploration. We conduct our experiments on two humanoid platforms and demonstrate the superiority of our approach against state-of-the-art methods in both simulation and real environments.
zh

[AI-93] AROT: Towards Essentially Domain-Invariant Robustness with Theoretical Justification CVPR2025

【速读】:该论文试图解决在对抗攻击下鲁棒的领域自适应问题,旨在开发能够在多样且具有挑战性的领域中保持性能一致的模型。其解决方案的关键在于提出了一种新的用于鲁棒领域自适应的发散度量,并基于此构建了名为TAROT的新算法,该算法通过有效学习领域不变特征,显著提升了模型的领域泛化能力和可扩展性。

链接: https://arxiv.org/abs/2505.06580
作者: Dongyoon Yang,Jihu Lee,Yongdai Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted in CVPR 2025 (19 pages, 7 figures)

点击查看摘要

Abstract:Robust domain adaptation against adversarial attacks is a critical research area that aims to develop models capable of maintaining consistent performance across diverse and challenging domains. In this paper, we derive a new generalization bound for robust risk on the target domain using a novel divergence measure specifically designed for robust domain adaptation. Building upon this, we propose a new algorithm named TAROT, which is designed to enhance both domain adaptability and robustness. Through extensive experiments, TAROT not only surpasses state-of-the-art methods in accuracy and robustness but also significantly enhances domain generalization and scalability by effectively learning domain-invariant features. In particular, TAROT achieves superior performance on the challenging DomainNet dataset, demonstrating its ability to learn domain-invariant representations that generalize well across different domains, including unseen ones. These results highlight the broader applicability of our approach in real-world domain adaptation scenarios.
zh

[AI-94] Quadrupedal Robot Skateboard Mounting via Reverse Curriculum Learning

【速读】:该论文试图解决四足机器人如何成功登上滑板的问题,尤其是初始上板阶段的挑战。其解决方案的关键在于采用目标导向的方法,从任务的终端阶段开始,逐步增加问题定义的复杂性,以逼近期望的目标。通过首先将滑板固定在全局坐标系中,并让机器人位于其正上方,随后逐渐放松这些初始条件,使得学习到的策略具备对滑板位置和方向变化的鲁棒性,最终实现了在移动滑板场景中的成功迁移。

链接: https://arxiv.org/abs/2505.06561
作者: Danil Belov,Artem Erkhov,Elizaveta Pestova,Ilya Osokin,Dzmitry Tsetserukou,Pavel Osinenko
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:The aim of this work is to enable quadrupedal robots to mount skateboards using Reverse Curriculum Reinforcement Learning. Although prior work has demonstrated skateboarding for quadrupeds that are already positioned on the board, the initial mounting phase still poses a significant challenge. A goal-oriented methodology was adopted, beginning with the terminal phases of the task and progressively increasing the complexity of the problem definition to approximate the desired objective. The learning process was initiated with the skateboard rigidly fixed within the global coordinate frame and the robot positioned directly above it. Through gradual relaxation of these initial conditions, the learned policy demonstrated robustness to variations in skateboard position and orientation, ultimately exhibiting a successful transfer to scenarios involving a mobile skateboard. The code, trained models, and reproducible examples are available at the following link: this https URL
zh

[AI-95] dcFCI: Robust Causal Discovery Under Latent Confounding Unfaithfulness and Mixed Data

【速读】:该论文旨在解决在存在潜在混杂因素的情况下,传统因果发现算法(如Fast Causal Inference, FCI)因依赖经验忠实性假设而难以准确推断因果结构的问题。其关键解决方案是引入一种非参数评分方法,用于评估部分祖先图(PAG)与观测数据的兼容性,该评分能够处理混合变量类型,并且能够充分表征结构不确定性,从而区分不同的PAG。基于此评分,作者提出了数据兼容的FCI(dcFCI),这是首个联合处理潜在混杂、经验不忠实性和混合数据类型的混合因果发现算法。

链接: https://arxiv.org/abs/2505.06542
作者: Adèle H. Ribeiro,Dominik Heider
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 31 pages. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Causal discovery is central to inferring causal relationships from observational data. In the presence of latent confounding, algorithms such as Fast Causal Inference (FCI) learn a Partial Ancestral Graph (PAG) representing the true model’s Markov Equivalence Class. However, their correctness critically depends on empirical faithfulness, the assumption that observed (in)dependencies perfectly reflect those of the underlying causal model, which often fails in practice due to limited sample sizes. To address this, we introduce the first nonparametric score to assess a PAG’s compatibility with observed data, even with mixed variable types. This score is both necessary and sufficient to characterize structural uncertainty and distinguish between distinct PAGs. We then propose data-compatible FCI (dcFCI), the first hybrid causal discovery algorithm to jointly address latent confounding, empirical unfaithfulness, and mixed data types. dcFCI integrates our score into an (Anytime)FCI-guided search that systematically explores, ranks, and validates candidate PAGs. Experiments on synthetic and real-world scenarios demonstrate that dcFCI significantly outperforms state-of-the-art methods, often recovering the true PAG even in small and heterogeneous datasets. Examining top-ranked PAGs further provides valuable insights into structural uncertainty, supporting more robust and informed causal reasoning and decision-making.
zh

[AI-96] Online Feedback Efficient Active Target Discovery in Partially Observable Environments

【速读】:该论文旨在解决在数据获取成本较高的科学与工程领域中,如何在有限的采样预算内最大化目标发现的问题。其解决方案的关键在于提出了一种名为Diffusion-guided Active Target Discovery (DiffATD)的新方法,该方法利用扩散动力学进行主动目标发现,通过维护环境中的每个未观测状态的信念分布,动态平衡探索与利用,从而高效地在部分可观测环境中实现目标发现。

链接: https://arxiv.org/abs/2505.06535
作者: Anindya Sarkar,Binglin Ji,Yevgeniy Vorobeychik
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 30 pages, 28 figures, Pre-print

点击查看摘要

Abstract:In various scientific and engineering domains, where data acquisition is costly, such as in medical imaging, environmental monitoring, or remote sensing, strategic sampling from unobserved regions, guided by prior observations, is essential to maximize target discovery within a limited sampling budget. In this work, we introduce Diffusion-guided Active Target Discovery (DiffATD), a novel method that leverages diffusion dynamics for active target discovery. DiffATD maintains a belief distribution over each unobserved state in the environment, using this distribution to dynamically balance exploration-exploitation. Exploration reduces uncertainty by sampling regions with the highest expected entropy, while exploitation targets areas with the highest likelihood of discovering the target, indicated by the belief distribution and an incrementally trained reward model designed to learn the characteristics of the target. DiffATD enables efficient target discovery in a partially observable environment within a fixed sampling budget, all without relying on any prior supervised training. Furthermore, DiffATD offers interpretability, unlike existing black-box policies that require extensive supervised training. Through extensive experiments and ablation studies across diverse domains, including medical imaging and remote sensing, we show that DiffATD performs significantly better than baselines and competitively with supervised methods that operate under full environmental observability.
zh

[AI-97] PRUNE: A Patching Based Repair Framework for Certiffable Unlearning of Neural Networks

【速读】:该论文试图解决从训练好的神经网络模型中移除特定训练数据的问题(即“遗忘”),特别是在数据主体行使“被遗忘权”时的需求。现有方法通常需要使用剩余数据重新训练替代模型,这在成本和验证方面存在挑战。该论文提出了一种新的解决方案,其关键在于在原始神经网络上施加精心设计的“补丁”(patch),以实现对指定数据的定向“遗忘”。该方法受到神经网络修复研究的启发,旨在为单个数据点寻找轻量级的最小“补丁”,并具备可证明的保证;同时,通过迭代选择代表性数据点进行遗忘,实现了对大量数据点或整个类别的有效遗忘。

链接: https://arxiv.org/abs/2505.06520
作者: Xuran Li,Jingyi Wang,Xiaohan Yuan,Peixin Zhang,Zhan Qin,Zhibo Wang,Kui Ren
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:It is often desirable to remove (a.k.a. unlearn) a speciffc part of the training data from a trained neural network model. A typical application scenario is to protect the data holder’s right to be forgotten, which has been promoted by many recent regulation rules. Existing unlearning methods involve training alternative models with remaining data, which may be costly and challenging to verify from the data holder or a thirdparty auditor’s perspective. In this work, we provide a new angle and propose a novel unlearning approach by imposing carefully crafted “patch” on the original neural network to achieve targeted “forgetting” of the requested data to delete. Speciffcally, inspired by the research line of neural network repair, we propose to strategically seek a lightweight minimum “patch” for unlearning a given data point with certiffable guarantee. Furthermore, to unlearn a considerable amount of data points (or an entire class), we propose to iteratively select a small subset of representative data points to unlearn, which achieves the effect of unlearning the whole set. Extensive experiments on multiple categorical datasets demonstrates our approach’s effectiveness, achieving measurable unlearning while preserving the model’s performance and being competitive in efffciency and memory consumption compared to various baseline methods.
zh

[AI-98] A Point-Based Algorithm for Distributional Reinforcement Learning in Partially Observable Domains

【速读】:该论文试图解决在部分可观测环境下,智能体面临环境状态不确定性和策略结果变异性的挑战,从而实现更安全的算法。其解决方案的关键在于将分布强化学习(Distributional Reinforcement Learning, DistRL)扩展到部分可观测马尔可夫决策过程(Partially Observable Markov Decision Processes, POMDPs),通过引入新的分布贝尔曼算子并证明其在上确界p-Wasserstein度量下的收敛性,同时利用psi-向量对回报分布进行有限表示,进而提出分布点基值迭代(Distributional Point-Based Value Iteration, DPBVI)方法,将分布强化学习与POMDP规划相结合,从而实现风险敏感控制。

链接: https://arxiv.org/abs/2505.06518
作者: Larry Preuett III
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In many real-world planning tasks, agents must tackle uncertainty about the environment’s state and variability in the outcomes of any chosen policy. We address both forms of uncertainty as a first step toward safer algorithms in partially observable settings. Specifically, we extend Distributional Reinforcement Learning (DistRL)-which models the entire return distribution for fully observable domains-to Partially Observable Markov Decision Processes (POMDPs), allowing an agent to learn the distribution of returns for each conditional plan. Concretely, we introduce new distributional Bellman operators for partial observability and prove their convergence under the supremum p-Wasserstein metric. We also propose a finite representation of these return distributions via psi-vectors, generalizing the classical alpha-vectors in POMDP solvers. Building on this, we develop Distributional Point-Based Value Iteration (DPBVI), which integrates psi-vectors into a standard point-based backup procedure-bridging DistRL and POMDP planning. By tracking return distributions, DPBVI naturally enables risk-sensitive control in domains where rare, high-impact events must be carefully managed. We provide source code to foster further research in robust decision-making under partial observability.
zh

[AI-99] On Definite Iterated Belief Revision with Belief Algebras IJCAI2025

【速读】:该论文试图解决传统基于逻辑的信念修正研究中,迭代信念修正规则过于松散导致在相同信念条件下存在多个满足规则的修正算子的问题。其解决方案的关键在于通过偏好关系表征信念信息,将信念和新证据均表示为信念代数(belief algebra),并在传统修正规则基础上引入额外的公理,包括对修正结果的上界约束,从而确保在给定当前信念状态和新证据时,修正结果是唯一确定的。

链接: https://arxiv.org/abs/2505.06505
作者: Hua Meng,Zhiguo Long,Michael Sioutis,Zhengchun Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages. Extended version of an accepted IJCAI 2025 paper

点击查看摘要

Abstract:Traditional logic-based belief revision research focuses on designing rules to constrain the behavior of revision operators. Frameworks have been proposed to characterize iterated revision rules, but they are often too loose, leading to multiple revision operators that all satisfy the rules under the same belief condition. In many practical applications, such as safety critical ones, it is important to specify a definite revision operator to enable agents to iteratively revise their beliefs in a deterministic way. In this paper, we propose a novel framework for iterated belief revision by characterizing belief information through preference relations. Semantically, both beliefs and new evidence are represented as belief algebras, which provide a rich and expressive foundation for belief revision. Building on traditional revision rules, we introduce additional postulates for revision with belief algebra, including an upper-bound constraint on the outcomes of revision. We prove that the revision result is uniquely determined given the current belief state and new evidence. Furthermore, to make the framework more useful in practice, we develop a particular algorithm for performing the proposed revision process. We argue that this approach may offer a more predictable and principled method for belief revision, making it suitable for real-world applications.
zh

[AI-100] System Prompt Poisoning: Persistent Attacks on Large Language Models Beyond User Injection

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在系统提示(system prompt)层面的安全问题,尤其是针对系统提示中毒(system prompt poisoning)这一新型攻击向量的潜在威胁。现有研究多关注用户提示和模型输出中的安全风险,而忽视了系统提示可能被攻击者篡改所带来的长期影响。论文提出的关键解决方案是识别并系统性分析系统提示中毒的攻击策略,证明其无需越狱技术即可有效实施,并且对多种任务(如数学、编程、逻辑推理和自然语言处理)均具有显著影响,同时揭示了传统增强模型性能的技术(如链式思维和检索增强生成)在遭受系统提示中毒后效果会被显著削弱。

链接: https://arxiv.org/abs/2505.06493
作者: Jiawei Guo,Haipeng Cai
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have gained widespread adoption across diverse applications due to their impressive generative capabilities. Their plug-and-play nature enables both developers and end users to interact with these models through simple prompts. However, as LLMs become more integrated into various systems in diverse domains, concerns around their security are growing. Existing studies mainly focus on threats arising from user prompts (e.g. prompt injection attack) and model output (e.g. model inversion attack), while the security of system prompts remains largely overlooked. This work bridges the critical gap. We introduce system prompt poisoning, a new attack vector against LLMs that, unlike traditional user prompt injection, poisons system prompts hence persistently impacts all subsequent user interactions and model responses. We systematically investigate four practical attack strategies in various poisoning scenarios. Through demonstration on both generative and reasoning LLMs, we show that system prompt poisoning is highly feasible without requiring jailbreak techniques, and effective across a wide range of tasks, including those in mathematics, coding, logical reasoning, and natural language processing. Importantly, our findings reveal that the attack remains effective even when user prompts employ advanced prompting techniques like chain-of-thought (CoT). We also show that such techniques, including CoT and retrieval-augmentation-generation (RAG), which are proven to be effective for improving LLM performance in a wide range of tasks, are significantly weakened in their effectiveness by system prompt poisoning.
zh

[AI-101] SmartPilot: A Multiagent CoPilot for Adaptive and Intelligent Manufacturing

【速读】:该论文试图解决工业4.0背景下制造运营中效率、精度和适应性的不足问题,具体表现为供应链中断导致的异常检测不充分、领域专家缺乏对异常的深入理解、以及传统AI模型在处理复杂传感器数据和准确预测生产方面的局限性。解决方案的关键是提出SmartPilot,这是一个神经符号多智能体CoPilot,具备先进推理和情境决策能力,能够处理多模态传感器数据,并聚焦于异常预测、生产预测和领域特定问题回答三个核心任务,从而实现制造与决策的智能化提升。

链接: https://arxiv.org/abs/2505.06492
作者: Chathurangi Shyalika,Renjith Prasad,Alaa Al Ghazo,Darssan Eswaramoorthi,Harleen Kaur,Sara Shree Muthuselvam,Amit Sheth
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 8 figures, 4 tables, IEEE Conference on Artificial Intelligence (IEEE CAI) 2025

点击查看摘要

Abstract:In the dynamic landscape of Industry 4.0, achieving efficiency, precision, and adaptability is essential to optimize manufacturing operations. Industries suffer due to supply chain disruptions caused by anomalies, which are being detected by current AI models but leaving domain experts uncertain without deeper insights into these anomalies. Additionally, operational inefficiencies persist due to inaccurate production forecasts and the limited effectiveness of traditional AI models for processing complex sensor data. Despite these advancements, existing systems lack the seamless integration of these capabilities needed to create a truly unified solution for enhancing production and decision-making. We propose SmartPilot, a neurosymbolic, multiagent CoPilot designed for advanced reasoning and contextual decision-making to address these challenges. SmartPilot processes multimodal sensor data and is compact to deploy on edge devices. It focuses on three key tasks: anomaly prediction, production forecasting, and domain-specific question answering. By bridging the gap between AI capabilities and real-world industrial needs, SmartPilot empowers industries with intelligent decision-making and drives transformative innovation in manufacturing. The demonstration video, datasets, and supplementary materials are available at this https URL.
zh

[AI-102] Video-Enhanced Offline Reinforcement Learning: A Model-Based Approach

【速读】:该论文试图解决离线强化学习(Offline Reinforcement Learning, RL)在静态数据集上进行策略优化时面临的次优行为学习和价值估计不准确的问题。其解决方案的关键在于提出一种基于模型的方法——视频增强的离线强化学习(Video-Enhanced Offline RL, VeoRL),该方法通过从大量未标注的在线视频数据中构建交互式世界模型,利用基于模型的行为引导,将自然视频中的控制策略和物理动态的常识知识迁移至目标领域的强化学习智能体中。

链接: https://arxiv.org/abs/2505.06482
作者: Minting Pan,Yitao Zheng,Jiajian Li,Yunbo Wang,Xiaokang Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Offline reinforcement learning (RL) enables policy optimization in static datasets, avoiding the risks and costs of real-world exploration. However, it struggles with suboptimal behavior learning and inaccurate value estimation due to the lack of environmental interaction. In this paper, we present Video-Enhanced Offline RL (VeoRL), a model-based approach that constructs an interactive world model from diverse, unlabeled video data readily available online. Leveraging model-based behavior guidance, VeoRL transfers commonsense knowledge of control policy and physical dynamics from natural videos to the RL agent within the target domain. Our method achieves substantial performance gains (exceeding 100% in some cases) across visuomotor control tasks in robotic manipulation, autonomous driving, and open-world video games.
zh

[AI-103] KCluster: An LLM -based Clustering Approach to Knowledge Component Discovery

【速读】:该论文试图解决在大规模题库中设计知识成分(Knowledge Component, KC)模型的困难,这一过程通常需要教育者手动分析每个问题,耗时且难以扩展。解决方案的关键在于提出KCluster算法,该算法基于大型语言模型(Large Language Model, LLM)生成的问题相似性度量,通过聚类方法识别语义一致的问题簇,从而自动构建KC模型,显著减少人工干预,提升KC模型对学生表现预测的准确性。

链接: https://arxiv.org/abs/2505.06469
作者: Yumou Wei,Paulo Carvalho,John Stamper
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to the Educational Data Mining (EDM) 2025 conference

点击查看摘要

Abstract:Educators evaluate student knowledge using knowledge component (KC) models that map assessment questions to KCs. Still, designing KC models for large question banks remains an insurmountable challenge for instructors who need to analyze each question by hand. The growing use of Generative AI in education is expected only to aggravate this chronic deficiency of expert-designed KC models, as course engineers designing KCs struggle to keep up with the pace at which questions are generated. In this work, we propose KCluster, a novel KC discovery algorithm based on identifying clusters of congruent questions according to a new similarity metric induced by a large language model (LLM). We demonstrate in three datasets that an LLM can create an effective metric of question similarity, which a clustering algorithm can use to create KC models from questions with minimal human effort. Combining the strengths of LLM and clustering, KCluster generates descriptive KC labels and discovers KC models that predict student performance better than the best expert-designed models available. In anticipation of future work, we illustrate how KCluster can reveal insights into difficult KCs and suggest improvements to instruction.
zh

[AI-104] Opening the Scope of Openness in AI

【速读】:该论文试图解决当前AI领域中“开放性”(openness)概念界定不清晰、缺乏跨学科视角的问题,以及如何在AI背景下重新定义和框架化开放性的挑战。其解决方案的关键在于通过定性分析98个从主题建模中发现的开放性概念,构建一个涵盖不同学科视角的开放性分类体系(taxonomy of openness),从而超越传统开源软件(open source software)的范式,提出更全面的开放性理解,包括行动、系统属性和伦理目标等方面。

链接: https://arxiv.org/abs/2505.06464
作者: Tamara Paris,AJung Moon,Jin Guo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: To appear in ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT) 2025

点击查看摘要

Abstract:The concept of openness in AI has so far been heavily inspired by the definition and community practice of open source software. This positions openness in AI as having positive connotations; it introduces assumptions of certain advantages, such as collaborative innovation and transparency. However, the practices and benefits of open source software are not fully transferable to AI, which has its own challenges. Framing a notion of openness tailored to AI is crucial to addressing its growing societal implications, risks, and capabilities. We argue that considering the fundamental scope of openness in different disciplines will broaden discussions, introduce important perspectives, and reflect on what openness in AI should mean. Toward this goal, we qualitatively analyze 98 concepts of openness discovered from topic modeling, through which we develop a taxonomy of openness. Using this taxonomy as an instrument, we situate the current discussion on AI openness, identify gaps and highlight links with other disciplines. Our work contributes to the recent efforts in framing openness in AI by reflecting principles and practices of openness beyond open source software and calls for a more holistic view of openness in terms of actions, system properties, and ethical objectives.
zh

[AI-105] Improved Uncertainty Quantification in Physics-Informed Neural Networks Using Error Bounds and Solution Bundles

【速读】:该论文试图解决物理信息神经网络(Physics-Informed Neural Networks, PINNs)在处理微分方程系统时缺乏不确定性量化(Uncertainty Quantification)机制的问题。其解决方案的关键在于采用两步法训练贝叶斯神经网络,以提供对PINNs解的不确定性估计。通过利用已有的PINN误差界构建异方差方差,该方法提升了不确定性估计的准确性,并进一步在宇宙学的正向问题和反向参数估计中应用所获得的不确定性信息。

链接: https://arxiv.org/abs/2505.06459
作者: Pablo Flores,Olga Graf,Pavlos Protopapas,Karim Pichara
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) have been widely used to obtain solutions to various physical phenomena modeled as Differential Equations. As PINNs are not naturally equipped with mechanisms for Uncertainty Quantification, some work has been done to quantify the different uncertainties that arise when dealing with PINNs. In this paper, we use a two-step procedure to train Bayesian Neural Networks that provide uncertainties over the solutions to differential equation systems provided by PINNs. We use available error bounds over PINNs to formulate a heteroscedastic variance that improves the uncertainty estimation. Furthermore, we solve forward problems and utilize the obtained uncertainties when doing parameter estimation in inverse problems in cosmology.
zh

[AI-106] Reliable Collaborative Conversational Agent System Based on LLM s and Answer Set Programming

【速读】:该论文试图解决基于大型语言模型(Large-Language-Model, LLM)驱动的对话机器人在任务导向对话(Task-Oriented Dialogue, TOD)中的不可靠性和安全性问题。其解决方案的关键在于提出了一种由管理员-助手双智能体(Administrator-Assistant Dual-Agent)构成的架构,其中两个由答案集规划(Answer Set Programming, ASP)驱动的智能体共享同一知识库,并通过协作规则集(Collaborative Rule Set, CRS)进行信息传递,从而实现更安全、可靠和高效的对话系统。

链接: https://arxiv.org/abs/2505.06438
作者: Yankai Zeng,Gopal Gupta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages

点击查看摘要

Abstract:As the Large-Language-Model-driven (LLM-driven) Artificial Intelligence (AI) bots became popular, people realized their strong potential in Task-Oriented Dialogue (TOD). However, bots relying wholly on LLMs are unreliable in their knowledge, and whether they can finally produce a correct result for the task is not guaranteed. The collaboration among these agents also remains a challenge, since the necessary information to convey is unclear, and the information transfer is by prompts – unreliable, and malicious knowledge is easy to inject. With the help of logic programming tools such as Answer Set Programming (ASP), conversational agents can be built safely and reliably, and communication among the agents made more efficient and secure. We proposed an Administrator-Assistant Dual-Agent paradigm, where the two ASP-driven bots share the same knowledge base and complete their tasks independently, while the information can be passed by a Collaborative Rule Set (CRS). The knowledge and information conveyed are encapsulated and invisible to the users, ensuring the security of information transmission. We have constructed AutoManager, a dual-agent system for managing the drive-through window of a fast-food restaurant such as Taco Bell in the US. In AutoManager, the assistant bot takes the customer’s order while the administrator bot manages the menu and food supply. We evaluated our AutoManager and compared it with the real-world Taco Bell Drive-Thru AI Order Taker, and the results show that our method is more reliable.
zh

[AI-107] What Do People Want to Know About Artificial Intelligence (AI)? The Importance of Answering End-User Questions to Explain Autonomous Vehicle (AV) Decisions

【速读】:该论文试图解决如何提升终端用户(如自动驾驶汽车乘客)对由人工智能驱动的自动驾驶车辆(AV)决策的理解问题,从而提高其对AV的使用率和接受度。现有解释机制主要服务于AI研究人员和工程师在调试和监控系统时的需求,而未能有效回应终端用户在不同场景下可能提出的特定疑问。论文提出的关键解决方案是通过交互式文本解释,相较于仅观察AV决策,能够更有效地提升用户对AI驱动AV决策的理解。

链接: https://arxiv.org/abs/2505.06428
作者: Somayeh Molaei,Lionel P. Robert,Nikola Banovic
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted to the Proceedings of the ACM on Human-Computer Interaction, CSCW, October 2025

点击查看摘要

Abstract:Improving end-users’ understanding of decisions made by autonomous vehicles (AVs) driven by artificial intelligence (AI) can improve utilization and acceptance of AVs. However, current explanation mechanisms primarily help AI researchers and engineers in debugging and monitoring their AI systems, and may not address the specific questions of end-users, such as passengers, about AVs in various scenarios. In this paper, we conducted two user studies to investigate questions that potential AV passengers might pose while riding in an AV and evaluate how well answers to those questions improve their understanding of AI-driven AV decisions. Our initial formative study identified a range of questions about AI in autonomous driving that existing explanation mechanisms do not readily address. Our second study demonstrated that interactive text-based explanations effectively improved participants’ comprehension of AV decisions compared to simply observing AV decisions. These findings inform the design of interactions that motivate end-users to engage with and inquire about the reasoning behind AI-driven AV decisions.
zh

[AI-108] Engineering Risk-Aware Security-by-Design Frameworks for Assurance of Large-Scale Autonomous AI Models

【速读】:该论文试图解决大规模自主AI系统在安全性和可靠性方面面临的挑战,特别是在模型参数规模扩大和操作自主性增强的背景下,如何构建工程级的安全与保障框架。解决方案的关键在于提出一种企业级、风险感知的“安全设计”方法,将标准化威胁指标、对抗强化技术以及实时异常检测整合到开发生命周期的各个阶段,形成从设计阶段的风险评估、安全训练协议到持续监控和自动化审计日志的统一流程,从而为模型在对抗性和操作压力下的行为提供可证明的保证。

链接: https://arxiv.org/abs/2505.06409
作者: Krti Tallam
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:As AI models scale to billions of parameters and operate with increasing autonomy, ensuring their safe, reliable operation demands engineering-grade security and assurance frameworks. This paper presents an enterprise-level, risk-aware, security-by-design approach for large-scale autonomous AI systems, integrating standardized threat metrics, adversarial hardening techniques, and real-time anomaly detection into every phase of the development lifecycle. We detail a unified pipeline - from design-time risk assessments and secure training protocols to continuous monitoring and automated audit logging - that delivers provable guarantees of model behavior under adversarial and operational stress. Case studies in national security, open-source model governance, and industrial automation demonstrate measurable reductions in vulnerability and compliance overhead. Finally, we advocate cross-sector collaboration - uniting engineering teams, standards bodies, and regulatory agencies - to institutionalize these technical safeguards within a resilient, end-to-end assurance ecosystem for the next generation of AI.
zh

[AI-109] Camera Control at the Edge with Language Models for Scene Understanding

【速读】:该论文试图解决如何通过自然语言接口高效控制PTZ(Pan-Tilt-Zoom)摄像头,并提升环境感知能力的问题。解决方案的关键在于构建一个基于优化提示的统一系统(OPUS),该系统利用大型语言模型(LLM)实现对PTZ摄像头的上下文理解与控制,通过从高级摄像头控制API生成关键词,并借助监督微调(SFT)在合成数据上迁移知识,从而在保持性能的同时提高成本效益,实现高效的边缘部署。

链接: https://arxiv.org/abs/2505.06402
作者: Alexiy Buynitsky,Sina Ehsani,Bhanu Pallakonda,Pragyana Mishra
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 7 pages, 6 figures. This work was presented and published at the 11th IEEE International Conference on Control, Automation and Robotics (ICCAR) in 2025

点击查看摘要

Abstract:In this paper, we present Optimized Prompt-based Unified System (OPUS), a framework that utilizes a Large Language Model (LLM) to control Pan-Tilt-Zoom (PTZ) cameras, providing contextual understanding of natural environments. To achieve this goal, the OPUS system improves cost-effectiveness by generating keywords from a high-level camera control API and transferring knowledge from larger closed-source language models to smaller ones through Supervised Fine-Tuning (SFT) on synthetic data. This enables efficient edge deployment while maintaining performance comparable to larger models like GPT-4. OPUS enhances environmental awareness by converting data from multiple cameras into textual descriptions for language models, eliminating the need for specialized sensory tokens. In benchmark testing, our approach significantly outperformed both traditional language model techniques and more complex prompting methods, achieving a 35% improvement over advanced techniques and a 20% higher task accuracy compared to closed-source models like Gemini Pro. The system demonstrates OPUS’s capability to simplify PTZ camera operations through an intuitive natural language interface. This approach eliminates the need for explicit programming and provides a conversational method for interacting with camera systems, representing a significant advancement in how users can control and utilize PTZ camera technology.
zh

[AI-110] owards AI-Driven Human-Machine Co-Teaming for Adaptive and Agile Cyber Security Operation Centers

【速读】:该论文试图解决安全运营中心(Security Operations Centers, SOCs)在应对网络安全威胁时面临的挑战,包括警报数量过多、熟练分析人员短缺以及工具集成度低等问题。其解决方案的关键在于引入一种基于大型语言模型(Large Language Models, LLMs)的AI驱动的人机协同范式,通过让LLM-based AI代理学习SOC操作中的隐性知识,从而提升威胁情报、警报分类和事件响应等任务的效率与效果。

链接: https://arxiv.org/abs/2505.06394
作者: Massimiliano Albanese,Xinming Ou,Kevin Lybarger,Daniel Lende,Dmitry Goldgof
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Security Operations Centers (SOCs) face growing challenges in managing cybersecurity threats due to an overwhelming volume of alerts, a shortage of skilled analysts, and poorly integrated tools. Human-AI collaboration offers a promising path to augment the capabilities of SOC analysts while reducing their cognitive overload. To this end, we introduce an AI-driven human-machine co-teaming paradigm that leverages large language models (LLMs) to enhance threat intelligence, alert triage, and incident response workflows. We present a vision in which LLM-based AI agents learn from human analysts the tacit knowledge embedded in SOC operations, enabling the AI agents to improve their performance on SOC tasks through this co-teaming. We invite SOCs to collaborate with us to further develop this process and uncover replicable patterns where human-AI co-teaming yields measurable improvements in SOC productivity.
zh

[AI-111] Offensive Security for AI Systems: Concepts Practices and Applications

【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)系统在广泛应用过程中面临的日益复杂和动态的网络安全威胁问题,传统防御措施难以有效应对这些独特且不断演化的风险。解决方案的关键在于构建一个全面的进攻性安全框架,通过主动威胁模拟和对抗性测试,识别AI生命周期中的潜在漏洞,并利用弱点评估、渗透测试和红队演练等技术手段,提升对AI系统安全性的理解和防御能力。该框架将进攻性AI安全从理论概念转化为可操作的方法,帮助组织增强其AI系统的抗风险能力。

链接: https://arxiv.org/abs/2505.06380
作者: Josh Harguess,Chris M. Ward
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As artificial intelligence (AI) systems become increasingly adopted across sectors, the need for robust, proactive security strategies is paramount. Traditional defensive measures often fall short against the unique and evolving threats facing AI-driven technologies, making offensive security an essential approach for identifying and mitigating risks. This paper presents a comprehensive framework for offensive security in AI systems, emphasizing proactive threat simulation and adversarial testing to uncover vulnerabilities throughout the AI lifecycle. We examine key offensive security techniques, including weakness and vulnerability assessment, penetration testing, and red teaming, tailored specifically to address AI’s unique susceptibilities. By simulating real-world attack scenarios, these methodologies reveal critical insights, informing stronger defensive strategies and advancing resilience against emerging threats. This framework advances offensive AI security from theoretical concepts to practical, actionable methodologies that organizations can implement to strengthen their AI systems against emerging threats.
zh

[AI-112] Bi-LSTM based Multi-Agent DRL with Computation-aware Pruning for Agent Twins Migration in Vehicular Embodied AI Networks

【速读】:该论文旨在解决智能交通场景中自主车辆(Autonomous Vehicles, AVs)在计算延迟和资源限制下,如何高效迁移本地AI应用以减轻路侧单元(Roadside Units, RSUs)负载的问题。其解决方案的关键在于将AV-RSU交互建模为Stackelberg博弈,以优化带宽资源分配,并设计一种基于Tiny Multi-Agent Bidirectional LSTM Proximal Policy Optimization (TMABLPPO)的算法,通过去中心化协调近似求解Stackelberg均衡。此外,引入基于Path eXclusion (PX)的个性化神经网络剪枝算法,动态适应异构AV计算能力,降低模型复杂度并保持性能。

链接: https://arxiv.org/abs/2505.06378
作者: Yuxiang Wei,Zhuoqi Zeng,Yue Zhong,Jiawen Kang,Ryan Wen Liu,M. Shamim Hossain
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the advancement of large language models and embodied Artificial Intelligence (AI) in the intelligent transportation scenarios, the combination of them in intelligent transportation spawns the Vehicular Embodied AI Network (VEANs). In VEANs, Autonomous Vehicles (AVs) are typical agents whose local advanced AI applications are defined as vehicular embodied AI agents, enabling capabilities such as environment perception and multi-agent collaboration. Due to computation latency and resource constraints, the local AI applications and services running on vehicular embodied AI agents need to be migrated, and subsequently referred to as vehicular embodied AI agent twins, which drive the advancement of vehicular embodied AI networks to offload intensive tasks to Roadside Units (RSUs), mitigating latency problems while maintaining service quality. Recognizing workload imbalance among RSUs in traditional approaches, we model AV-RSU interactions as a Stackelberg game to optimize bandwidth resource allocation for efficient migration. A Tiny Multi-Agent Bidirectional LSTM Proximal Policy Optimization (TMABLPPO) algorithm is designed to approximate the Stackelberg equilibrium through decentralized coordination. Furthermore, a personalized neural network pruning algorithm based on Path eXclusion (PX) dynamically adapts to heterogeneous AV computation capabilities by identifying task-critical parameters in trained models, reducing model complexity with less performance degradation. Experimental validation confirms the algorithm’s effectiveness in balancing system load and minimizing delays, demonstrating significant improvements in vehicular embodied AI agent deployment.
zh

[AI-113] he ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization

【速读】:该论文试图解决生成式 AI(Generative AI)在实际服务环境中能耗问题,这一问题在构建机器学习(ML)系统时常被忽视或研究不足。解决方案的关键在于提出并实现了一套基准测试框架——this http URL Benchmark,用于在真实服务环境下测量推理能耗,并通过其对应的this http URL Leaderboard提供优化参考。该基准测试遵循四个关键设计原则,涵盖了模型架构、任务类型、ML设计选择及自动化优化推荐等方面,从而有效揭示了能耗影响因素并实现了显著的能耗降低。

链接: https://arxiv.org/abs/2505.06371
作者: Jae-Won Chung,Jiachen Liu,Jeff J. Ma,Ruofan Wu,Oh Jun Kweon,Yuxuan Xia,Zhiyu Wu,Mosharaf Chowdhury
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Leaderboard: this https URL

点击查看摘要

Abstract:As the adoption of Generative AI in real-world services grow explosively, energy has emerged as a critical bottleneck resource. However, energy remains a metric that is often overlooked, under-explored, or poorly understood in the context of building ML systems. We present the this http URL Benchmark, a benchmark suite and tool for measuring inference energy consumption under realistic service environments, and the corresponding this http URL Leaderboard, which have served as a valuable resource for those hoping to understand and optimize the energy consumption of their generative AI services. In this paper, we explain four key design principles for benchmarking ML energy we have acquired over time, and then describe how they are implemented in the this http URL Benchmark. We then highlight results from the latest iteration of the benchmark, including energy measurements of 40 widely used model architectures across 6 different tasks, case studies of how ML design choices impact energy consumption, and how automated optimization recommendations can lead to significant (sometimes more than 40%) energy savings without changing what is being computed by the model. The this http URL Benchmark is open-source and can be easily extended to various customized models and application scenarios.
zh

[AI-114] Learning Sequential Kinematic Models from Demonstrations for Multi-Jointed Articulated Objects

【速读】:该论文试图解决机器人在复杂环境中与多自由度(DoF)物体交互时的精确控制问题,现有方法通常依赖先验知识或仅关注单自由度物体,且无法处理遮挡关节和操作序列的获取。解决方案的关键在于从人类示范中学习物体模型,并引入对象运动序列机(OKSM),该模型能够捕捉多自由度物体的运动学约束和操作顺序,同时通过Pokenet网络从点云数据中估计这些模型,从而提升真实世界中的关节轴和状态估计性能。

链接: https://arxiv.org/abs/2505.06363
作者: Anmol Gupta,Weiwei Gu,Omkar Patil,Jun Ki Lee,Nakul Gopalan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As robots become more generalized and deployed in diverse environments, they must interact with complex objects, many with multiple independent joints or degrees of freedom (DoF) requiring precise control. A common strategy is object modeling, where compact state-space models are learned from real-world observations and paired with classical planning. However, existing methods often rely on prior knowledge or focus on single-DoF objects, limiting their applicability. They also fail to handle occluded joints and ignore the manipulation sequences needed to access them. We address this by learning object models from human demonstrations. We introduce Object Kinematic Sequence Machines (OKSMs), a novel representation capturing both kinematic constraints and manipulation order for multi-DoF objects. To estimate these models from point cloud data, we present Pokenet, a deep neural network trained on human demonstrations. We validate our approach on 8,000 simulated and 1,600 real-world annotated samples. Pokenet improves joint axis and state estimation by over 20 percent on real-world data compared to prior methods. Finally, we demonstrate OKSMs on a Sawyer robot using inverse kinematics-based planning to manipulate multi-DoF objects.
zh

[AI-115] Remote Rowhammer Attack using Adversarial Observations on Federated Learning Clients

【速读】:该论文试图解决联邦学习(Federated Learning, FL)系统中由于客户端攻击导致的服务器内存安全问题,特别是通过远程触发行锤攻击(rowhammer attack)来引发内存位翻转的问题。解决方案的关键在于利用强化学习(Reinforcement Learning, RL)攻击者通过操控客户端的传感器观测数据,诱导服务器频繁进行重复内存更新,从而在无需直接访问服务器的情况下实现对服务器内存的远程攻击。

链接: https://arxiv.org/abs/2505.06335
作者: Jinsheng Yuan,Yuhang Hao,Weisi Guo,Yun Wu,Chongyan Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Federated Learning (FL) has the potential for simultaneous global learning amongst a large number of parallel agents, enabling emerging AI such as LLMs to be trained across demographically diverse data. Central to this being efficient is the ability for FL to perform sparse gradient updates and remote direct memory access at the central server. Most of the research in FL security focuses on protecting data privacy at the edge client or in the communication channels between the client and server. Client-facing attacks on the server are less well investigated as the assumption is that a large collective of clients offer resilience. Here, we show that by attacking certain clients that lead to a high frequency repetitive memory update in the server, we can remote initiate a rowhammer attack on the server memory. For the first time, we do not need backdoor access to the server, and a reinforcement learning (RL) attacker can learn how to maximize server repetitive memory updates by manipulating the client’s sensor observation. The consequence of the remote rowhammer attack is that we are able to achieve bit flips, which can corrupt the server memory. We demonstrate the feasibility of our attack using a large-scale FL automatic speech recognition (ASR) systems with sparse updates, our adversarial attacking agent can achieve around 70% repeated update rate (RUR) in the targeted server model, effectively inducing bit flips on server DRAM. The security implications are that can cause disruptions to learning or may inadvertently cause elevated privilege. This paves the way for further research on practical mitigation strategies in FL and hardware design. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2505.06335 [cs.LG] (or arXiv:2505.06335v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.06335 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-116] NSF-MAP: Neurosymbolic Multimodal Fusion for Robust and Interpretable Anomaly Prediction in Assembly Pipelines IJCAI2025

【速读】:该论文试图解决现代装配流水线中异常检测的问题,传统单模态方法在复杂预测环境中难以捕捉精确异常预测所需的复杂关系。解决方案的关键在于提出一种基于神经符号AI和融合的方法,通过时间序列与图像的决策级融合建模、迁移学习以及知识注入学习等创新方法,有效整合多模态数据的优势,从而提升异常预测的性能与可解释性。

链接: https://arxiv.org/abs/2505.06333
作者: Chathurangi Shyalika,Renjith Prasad,Fadi El Kalach,Revathy Venkataramanan,Ramtin Zand,Ramy Harik,Amit Sheth
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 7 figures, 2 tables, IJCAI 2025 (International Joint Conferences on Artificial Intelligence) Special Track on AI4Tech: AI Enabling Critical Technologies

点击查看摘要

Abstract:In modern assembly pipelines, identifying anomalies is crucial in ensuring product quality and operational efficiency. Conventional single-modality methods fail to capture the intricate relationships required for precise anomaly prediction in complex predictive environments with abundant data and multiple modalities. This paper proposes a neurosymbolic AI and fusion-based approach for multimodal anomaly prediction in assembly pipelines. We introduce a time series and image-based fusion model that leverages decision-level fusion techniques. Our research builds upon three primary novel approaches in multimodal learning: time series and image-based decision-level fusion modeling, transfer learning for fusion, and knowledge-infused learning. We evaluate the novel method using our derived and publicly available multimodal dataset and conduct comprehensive ablation studies to assess the impact of our preprocessing techniques and fusion model compared to traditional baselines. The results demonstrate that a neurosymbolic AI-based fusion approach that uses transfer learning can effectively harness the complementary strengths of time series and image data, offering a robust and interpretable approach for anomaly prediction in assembly pipelines with enhanced performance. \noindent The datasets, codes to reproduce the results, supplementary materials, and demo are available at this https URL.
zh

[AI-117] Mask-PINNs: Regulating Feature Distributions in Physics-Informed Neural Networks

【速读】:该论文试图解决物理信息神经网络(Physics-Informed Neural Networks, PINNs)中因内部协变量偏移(internal covariate shift)导致的神经网络容量无法有效利用的问题。解决方案的关键在于提出一种名为Mask-PINNs的新架构,该架构引入了一个可学习的非线性掩码函数,以约束特征分布而不违背基本物理规律,从而显著提升了特征分布的稳定性、精度和鲁棒性,并实现了更宽网络的稳定高效训练。

链接: https://arxiv.org/abs/2505.06331
作者: Feilong Jiang,Xiaonan Hou,Jianqiao Ye,Min Xia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) are a class of deep learning models designed to solve partial differential equations by incorporating physical laws directly into the loss function. However, the internal covariate shift, which has been largely overlooked, hinders the effective utilization of neural network capacity in PINNs. To this end, we propose Mask-PINNs, a novel architecture designed to address this issue in PINNs. Unlike traditional normalization methods such as BatchNorm or LayerNorm, we introduce a learnable, nonlinear mask function that constrains the feature distributions without violating underlying physics. The experimental results show that the proposed method significantly improves feature distribution stability, accuracy, and robustness across various activation functions and PDE benchmarks. Furthermore, it enables the stable and efficient training of wider networks a capability that has been largely overlooked in PINNs.
zh

[AI-118] Prompting Large Language Models for Training-Free Non-Intrusive Load Monitoring

【速读】:该论文旨在解决非侵入式负载监测(NILM)中深度学习方法对标注数据的依赖性高、泛化能力受限以及可解释性不足的问题。其解决方案的关键在于引入基于提示(prompt-based)的NILM框架,利用大语言模型(LLMs)结合上下文学习的能力,通过设计融合电器特征、时间戳、上下文信息及代表性时间序列示例的提示策略,实现高效的电力负荷分解,并在无需微调的情况下展现出强大的泛化能力和可解释性。

链接: https://arxiv.org/abs/2505.06330
作者: Junyu Xue,Xudong Wang,Xiaoling He,Shicheng Liu,Yi Wang,Guoming Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Non-intrusive Load Monitoring (NILM) aims to disaggregate aggregate household electricity consumption into individual appliance usage, enabling more effective energy management. While deep learning has advanced NILM, it remains limited by its dependence on labeled data, restricted generalization, and lack of interpretability. In this paper, we introduce the first prompt-based NILM framework that leverages Large Language Models (LLMs) with in-context learning. We design and evaluate prompt strategies that integrate appliance features, timestamps and contextual information, as well as representative time-series examples, using the REDD dataset. With optimized prompts, LLMs achieve competitive state detection accuracy, reaching an average F1-score of 0.676 on unseen households, and demonstrate robust generalization without the need for fine-tuning. LLMs also enhance interpretability by providing clear, human-readable explanations for their predictions. Our results show that LLMs can reduce data requirements, improve adaptability, and provide transparent energy disaggregation in NILM applications.
zh

[AI-119] A Grounded Memory System For Smart Personal Assistants ESWC2025

【速读】:该论文旨在解决广泛存在的代理型人工智能应用中对基于现实的稳健记忆系统的需求,例如针对痴呆症患者的认知助手和机器人系统。其解决方案的关键在于构建一个由三个组件组成的记忆系统:首先,结合视觉语言模型进行图像描述和实体消歧,以及大语言模型进行感知过程中的信息一致性提取;其次,利用知识图谱结合向量嵌入表示提取的信息,以高效管理关系信息;最后,通过检索增强生成技术,结合语义搜索与图查询生成实现问答功能。

链接: https://arxiv.org/abs/2505.06328
作者: Felix Ocker,Jörg Deigmöller,Pavel Smirnov,Julian Eggert
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures, accepted for the ESWC 2025 TEXT2KG workshop

点击查看摘要

Abstract:A wide variety of agentic AI applications - ranging from cognitive assistants for dementia patients to robotics - demand a robust memory system grounded in reality. In this paper, we propose such a memory system consisting of three components. First, we combine Vision Language Models for image captioning and entity disambiguation with Large Language Models for consistent information extraction during perception. Second, the extracted information is represented in a memory consisting of a knowledge graph enhanced by vector embeddings to efficiently manage relational information. Third, we combine semantic search and graph query generation for question answering via Retrieval Augmented Generation. We illustrate the system’s working and potential using a real-world example.
zh

[AI-120] Enterprise Architecture as a Dynamic Capability for Scalable and Sustainable Generative AI adoption: Bridging Innovation and Governance in Large Organisations

【速读】:该论文试图解决企业在大规模采用生成式 AI (Generative AI) 过程中面临的复杂挑战,包括技术复杂性、治理缺口和资源错配等问题。其解决方案的关键在于通过企业架构管理 (Enterprise Architecture Management, EAM) 来满足 GenAI 的复杂需求,特别是通过将 EAM 理论化为感知、捕捉和转化动态能力,以提升战略对齐度、治理框架和组织敏捷性。研究还强调需针对 GenAI 特有的挑战(如数据治理成熟度低和创新与合规之间的平衡)对 EA 框架进行定制化调整。

链接: https://arxiv.org/abs/2505.06326
作者: Alexander Ettinger
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 82 pages excluding appendix

点击查看摘要

Abstract:Generative Artificial Intelligence is a powerful new technology with the potential to boost innovation and reshape governance in many industries. Nevertheless, organisations face major challenges in scaling GenAI, including technology complexity, governance gaps and resource misalignments. This study explores how Enterprise Architecture Management can meet the complex requirements of GenAI adoption within large enterprises. Based on a systematic literature review and the qualitative analysis of 16 semi-structured interviews with experts, it examines the relationships between EAM, dynamic capabilities and GenAI adoption. The review identified key limitations in existing EA frameworks, particularly their inability to fully address the unique requirements of GenAI. The interviews, analysed using the Gioia methodology, revealed critical enablers and barriers to GenAI adoption across industries. The findings indicate that EAM, when theorised as sensing, seizing and transforming dynamic capabilities, can enhance GenAI adoption by improving strategic alignment, governance frameworks and organisational agility. However, the study also highlights the need to tailor EA frameworks to GenAI-specific challenges, including low data governance maturity and the balance between innovation and compliance. Several conceptual frameworks are proposed to guide EA leaders in aligning GenAI maturity with organisational readiness. The work contributes to academic understanding and industry practice by clarifying the role of EA in bridging innovation and governance in disruptive technology environments.
zh

[AI-121] Human in the Latent Loop (HILL): Interactively Guiding Model Training Through Human Intuition

【速读】:该论文试图解决机器学习模型中潜在空间(latent space)表示难以理解且复杂的问题,从而影响模型行为的优化。其解决方案的关键在于提出HILL框架,该框架通过交互式重塑潜在空间表示,将人类直觉融入模型训练过程中。该方法受知识蒸馏启发,将用户的修改视为教师信号,引导模型调整其内在潜在表示,从而提升模型收敛效率和性能,并为用户提供有益洞察。

链接: https://arxiv.org/abs/2505.06325
作者: Daniel Geissler,Lars Krupp,Vishal Banwari,David Habusch,Bo Zhou,Paul Lukowicz,Jakob Karolus
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Latent space representations are critical for understanding and improving the behavior of machine learning models, yet they often remain obscure and intricate. Understanding and exploring the latent space has the potential to contribute valuable human intuition and expertise about respective domains. In this work, we present HILL, an interactive framework allowing users to incorporate human intuition into the model training by interactively reshaping latent space representations. The modifications are infused into the model training loop via a novel approach inspired by knowledge distillation, treating the user’s modifications as a teacher to guide the model in reshaping its intrinsic latent representation. The process allows the model to converge more effectively and overcome inefficiencies, as well as provide beneficial insights to the user. We evaluated HILL in a user study tasking participants to train an optimal model, closely observing the employed strategies. The results demonstrated that human-guided latent space modifications enhance model performance while maintaining generalization, yet also revealing the risks of including user biases. Our work introduces a novel human-AI interaction paradigm that infuses human intuition into model training and critically examines the impact of human intervention on training strategies and potential biases.
zh

[AI-122] Document Attribution: Examining Citation Relationships using Large Language Models

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在基于文档的任务中生成结果的可信度与可解释性问题,特别是在用户需求聚焦于从提供的文档中检索信息而非依赖模型参数知识的情况下。解决方案的关键在于通过归因(attribution)技术,将生成的输出追溯到其来源文档,并评估这些引用的可靠性。论文提出了两种关键技术:一种是将归因任务视为文本蕴含(textual entailment)的零样本方法,另一种是探索注意力机制在提升归因过程中的作用。

链接: https://arxiv.org/abs/2505.06324
作者: Vipula Rawte,Ryan A. Rossi,Franck Dernoncourt,Nedim Lipka
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) are increasingly applied to document-based tasks - such as document summarization, question answering, and information extraction - where user requirements focus on retrieving information from provided documents rather than relying on the model’s parametric knowledge, ensuring the trustworthiness and interpretability of these systems has become a critical concern. A central approach to addressing this challenge is attribution, which involves tracing the generated outputs back to their source documents. However, since LLMs can produce inaccurate or imprecise responses, it is crucial to assess the reliability of these citations. To tackle this, our work proposes two techniques. (1) A zero-shot approach that frames attribution as a straightforward textual entailment task. Our method using flan-ul2 demonstrates an improvement of 0.27% and 2.4% over the best baseline of ID and OOD sets of AttributionBench, respectively. (2) We also explore the role of the attention mechanism in enhancing the attribution process. Using a smaller LLM, flan-t5-small, the F1 scores outperform the baseline across almost all layers except layer 4 and layers 8 through 11. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.06324 [cs.IR] (or arXiv:2505.06324v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2505.06324 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-123] Learn to Think: Bootstrapping LLM Reasoning Capability Through Graph Learning IJCAI2025

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在复杂推理任务中面临的问题,包括训练计算成本高以及推理能力受限。其解决方案的关键在于提出一种基于图学习(graph learning)的新框架,通过将问题的推理过程建模为图结构,并利用基于LLM的图学习引导每个推理步骤的自适应生成,同时引入图神经网络(Graph Neural Network, GNN)模块以实现对推理过程的表示学习和实时调整,从而提升模型的灵活性和泛化能力。

链接: https://arxiv.org/abs/2505.06321
作者: Hang Gao,Chenhao Zhang,Tie Wang,Junsuo Zhao,Fengge Wu,Changwen Zheng,Huaping Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by IJCAI 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success across various domains. However, they still face significant challenges, including high computational costs for training and limitations in solving complex reasoning problems. Although existing methods have extended the reasoning capabilities of LLMs through structured paradigms, these approaches often rely on task-specific prompts and predefined reasoning processes, which constrain their flexibility and generalizability. To address these limitations, we propose a novel framework that leverages graph learning to enable more flexible and adaptive reasoning capabilities for LLMs. Specifically, this approach models the reasoning process of a problem as a graph and employs LLM-based graph learning to guide the adaptive generation of each reasoning step. To further enhance the adaptability of the model, we introduce a Graph Neural Network (GNN) module to perform representation learning on the generated reasoning process, enabling real-time adjustments to both the model and the prompt. Experimental results demonstrate that this method significantly improves reasoning performance across multiple tasks without requiring additional training or task-specific prompt design. Code can be found in this https URL.
zh

[AI-124] hreat Modeling for AI: The Case for an Asset-Centric Approach

【速读】:该论文试图解决集成式AI代理(integrated AI agents)在安全防护方面面临的独特挑战,特别是在AI系统能够自主执行代码、与外部系统交互且无需人工监督的背景下,传统安全方法已无法有效应对。解决方案的关键在于提出一种以资产为中心(asset-centric)的威胁建模方法,通过关注关键AI资产而非具体攻击方式,系统性地识别常规及AI特有的漏洞对分布式基础设施的影响,从而实现跨技术领域的全面分析、对第三方AI组件的安全假设量化以及针对特定产品环境的AI相关漏洞的全面识别。

链接: https://arxiv.org/abs/2505.06315
作者: Jose Sanchez Vicarte,Marcin Spoczynski,Mostafa Elsaid
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in AI are transforming AI’s ubiquitous presence in our world from that of standalone AI-applications into deeply integrated AI-agents. These changes have been driven by agents’ increasing capability to autonomously make decisions and initiate actions, using existing applications; whether those applications are AI-based or not. This evolution enables unprecedented levels of AI integration, with agents now able to take actions on behalf of systems and users – including, in some cases, the powerful ability for the AI to write and execute scripts as it deems necessary. With AI systems now able to autonomously execute code, interact with external systems, and operate without human oversight, traditional security approaches fall short. This paper introduces an asset-centric methodology for threat modeling AI systems that addresses the unique security challenges posed by integrated AI agents. Unlike existing top-down frameworks that analyze individual attacks within specific product contexts, our bottom-up approach enables defenders to systematically identify how vulnerabilities – both conventional and AI-specific – impact critical AI assets across distributed infrastructures used to develop and deploy these agents. This methodology allows security teams to: (1) perform comprehensive analysis that communicates effectively across technical domains, (2) quantify security assumptions about third-party AI components without requiring visibility into their implementation, and (3) holistically identify AI-based vulnerabilities relevant to their specific product context. This approach is particularly relevant for securing agentic systems with complex autonomous capabilities. By focusing on assets rather than attacks, our approach scales with the rapidly evolving threat landscape while accommodating increasingly complex and distributed AI development pipelines. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.06315 [cs.CR] (or arXiv:2505.06315v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2505.06315 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-125] A4L: An Architecture for AI-Augmented Learning

【速读】:该论文试图解决如何通过生成式 AI (Generative AI) 实现个性化学习和可扩展教育的问题,其核心挑战在于构建有效的数据架构以收集、分析学习数据,并将结果反馈给教师、学习者及 AI 代理,从而实现大规模的个性化学习。解决方案的关键是开发一种名为 AI-Augmented Learning (A4L) 的架构,该架构旨在支持成人在线教育,通过整合数据采集、分析与反馈机制,推动学习的个性化与规模化发展。

链接: https://arxiv.org/abs/2505.06314
作者: Ashok Goel,Ploy Thajchayapong,Vrinda Nandan,Harshvardhan Sikka,Spencer Rugaber
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:AI promises personalized learning and scalable education. As AI agents increasingly permeate education in support of teaching and learning, there is a critical and urgent need for data architectures for collecting and analyzing data on learning, and feeding the results back to teachers, learners, and the AI agents for personalization of learning at scale. At the National AI Institute for Adult Learning and Online Education, we are developing an Architecture for AI-Augmented Learning (A4L) for supporting adult learning through online education. We present the motivations, goals, requirements of the A4L architecture. We describe preliminary applications of A4L and discuss how it advances the goals of making learning more personalized and scalable.
zh

[AI-126] Responsibility Gap in Collective Decision Making IJCAI-25

【速读】:该论文试图解决集体决策机制中责任缺失(responsibility gap)的问题,即在没有单一责任主体的情况下产生的结果不确定性。解决方案的关键在于提出“选举独裁”(elected dictatorship)的概念,并证明在完全信息环境下,只有当决策机制为选举独裁时,责任缺失的间隙才会为空;而在不完全信息环境下,无间隙的机制类严格介于两种选举独裁类之间。

链接: https://arxiv.org/abs/2505.06312
作者: Pavel Naumov,Jia Tao
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注: full version of an IJCAI-25 paper

点击查看摘要

Abstract:The responsibility gap is a set of outcomes of a collective decision-making mechanism in which no single agent is individually responsible. In general, when designing a decision-making process, it is desirable to minimise the gap. The paper proposes a concept of an elected dictatorship. It shows that, in a perfect information setting, the gap is empty if and only if the mechanism is an elected dictatorship. It also proves that in an imperfect information setting, the class of gap-free mechanisms is positioned strictly between two variations of the class of elected dictatorships. Comments: full version of an IJCAI-25 paper Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.06312 [cs.GT] (or arXiv:2505.06312v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2505.06312 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-127] Defending against Indirect Prompt Injection by Instruction Detection

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)与外部数据集成时面临的间接提示注入(Indirect Prompt Injection, IPI)攻击问题,此类攻击通过嵌入外部数据中的隐藏指令操纵LLMs执行有害行为。解决方案的关键在于利用LLMs在正向和反向传播过程中的行为状态,通过分析中间层的隐藏状态和梯度,提取具有高度区分性的指令检测特征,从而实现对IPI攻击的有效检测。

链接: https://arxiv.org/abs/2505.06311
作者: Tongyu Wen,Chenglong Wang,Xiyuan Yang,Haoyu Tang,Yueqi Xie,Lingjuan Lyu,Zhicheng Dou,Fangzhao Wu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) with external sources is becoming increasingly common, with Retrieval-Augmented Generation (RAG) being a prominent example. However, this integration introduces vulnerabilities of Indirect Prompt Injection (IPI) attacks, where hidden instructions embedded in external data can manipulate LLMs into executing unintended or harmful actions. We recognize that the success of IPI attacks fundamentally relies in the presence of instructions embedded within external content, which can alter the behavioral state of LLMs. Can effectively detecting such state changes help us defend against IPI attacks? In this paper, we propose a novel approach that takes external data as input and leverages the behavioral state of LLMs during both forward and backward propagation to detect potential IPI attacks. Specifically, we demonstrate that the hidden states and gradients from intermediate layers provide highly discriminative features for instruction detection. By effectively combining these features, our approach achieves a detection accuracy of 99.60% in the in-domain setting and 96.90% in the out-of-domain setting, while reducing the attack success rate to just 0.12% on the BIPIA benchmark.
zh

[AI-128] Large Language Model-driven Security Assistant for Internet of Things via Chain-of-Thought

【速读】:该论文旨在解决物联网(IoT)设备安全漏洞的自动、高效和准确理解问题,特别是在动态安全场景下传统方法难以适应复杂环境的挑战。其解决方案的关键在于提出一种基于大语言模型(LLM)的物联网安全助手,通过ICoT方法对安全漏洞的多个维度进行分解,并生成符合用户特定需求和专业水平的响应,从而提升LLM在复杂安全场景下的分析与推理能力,提供更精准、深入且个性化的安全建议与解决方案。

链接: https://arxiv.org/abs/2505.06307
作者: Mingfei Zeng,Ming Xie,Xixi Zheng,Chunhai Li,Chuan Zhang,Liehuang Zhu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid development of Internet of Things (IoT) technology has transformed people’s way of life and has a profound impact on both production and daily activities. However, with the rapid advancement of IoT technology, the security of IoT devices has become an unavoidable issue in both research and applications. Although some efforts have been made to detect or mitigate IoT security vulnerabilities, they often struggle to adapt to the complexity of IoT environments, especially when dealing with dynamic security scenarios. How to automatically, efficiently, and accurately understand these vulnerabilities remains a challenge. To address this, we propose an IoT security assistant driven by Large Language Model (LLM), which enhances the LLM’s understanding of IoT security vulnerabilities and related threats. The aim of the ICoT method we propose is to enable the LLM to understand security issues by breaking down the various dimensions of security vulnerabilities and generating responses tailored to the user’s specific needs and expertise level. By incorporating ICoT, LLM can gradually analyze and reason through complex security scenarios, resulting in more accurate, in-depth, and personalized security recommendations and solutions. Experimental results show that, compared to methods relying solely on LLM, our proposed LLM-driven IoT security assistant significantly improves the understanding of IoT security issues through the ICoT approach and provides personalized solutions based on the user’s identity, demonstrating higher accuracy and reliability.
zh

[AI-129] User Behavior Analysis in Privacy Protection with Large Language Models : A Study on Privacy Preferences with Limited Data

【速读】:该论文试图解决在数据受限环境下用户隐私偏好建模的挑战,传统方法依赖大规模用户数据,难以有效分析隐私偏好。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)结合少量样本学习(Few-shot Learning)与隐私计算技术,以提升隐私偏好建模的准确性,并通过差分隐私(Differential Privacy)和联邦学习(Federated Learning)降低用户数据泄露风险。

链接: https://arxiv.org/abs/2505.06305
作者: Haowei Yang,Qingyi Lu,Yang Wang,Sibei Liu,Jiayun Zheng,Ao Xiang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the widespread application of large language models (LLMs), user privacy protection has become a significant research topic. Existing privacy preference modeling methods often rely on large-scale user data, making effective privacy preference analysis challenging in data-limited environments. This study explores how LLMs can analyze user behavior related to privacy protection in scenarios with limited data and proposes a method that integrates Few-shot Learning and Privacy Computing to model user privacy preferences. The research utilizes anonymized user privacy settings data, survey responses, and simulated data, comparing the performance of traditional modeling approaches with LLM-based methods. Experimental results demonstrate that, even with limited data, LLMs significantly improve the accuracy of privacy preference modeling. Additionally, incorporating Differential Privacy and Federated Learning further reduces the risk of user data exposure. The findings provide new insights into the application of LLMs in privacy protection and offer theoretical support for advancing privacy computing and user behavior analysis.
zh

[AI-130] Collaborative Multi-LoRA Experts with Achievement-based Multi-Tasks Loss for Unified Multimodal Information Extraction IJCAI2025

【速读】:该论文旨在解决多模态信息抽取(Multimodal Information Extraction, MIE)任务中传统方法分别处理各任务导致知识共享不足,以及基于指令的T5模型全参数微调计算成本高、多任务微调易出现梯度冲突的问题。其解决方案的关键在于提出协同多LoRA专家与基于成就的多任务损失(C-LoRAE),通过引入通用专家学习跨任务的共享多模态知识和任务特定专家学习指令任务特征,提升模型在多个任务间的泛化能力,同时保持任务独立性并缓解梯度冲突,此外还通过成就-based多任务损失平衡不同任务的训练进度。

链接: https://arxiv.org/abs/2505.06303
作者: Li Yuan,Yi Cai,Xudong Shen,Qing Li,Qingbao Huang,Zikun Deng,Tao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by IJCAI 2025

点击查看摘要

Abstract:Multimodal Information Extraction (MIE) has gained attention for extracting structured information from multimedia sources. Traditional methods tackle MIE tasks separately, missing opportunities to share knowledge across tasks. Recent approaches unify these tasks into a generation problem using instruction-based T5 models with visual adaptors, optimized through full-parameter fine-tuning. However, this method is computationally intensive, and multi-task fine-tuning often faces gradient conflicts, limiting performance. To address these challenges, we propose collaborative multi-LoRA experts with achievement-based multi-task loss (C-LoRAE) for MIE tasks. C-LoRAE extends the low-rank adaptation (LoRA) method by incorporating a universal expert to learn shared multimodal knowledge from cross-MIE tasks and task-specific experts to learn specialized instructional task features. This configuration enhances the model’s generalization ability across multiple tasks while maintaining the independence of various instruction tasks and mitigating gradient conflicts. Additionally, we propose an achievement-based multi-task loss to balance training progress across tasks, addressing the imbalance caused by varying numbers of training samples in MIE tasks. Experimental results on seven benchmark datasets across three key MIE tasks demonstrate that C-LoRAE achieves superior overall performance compared to traditional fine-tuning methods and LoRA methods while utilizing a comparable number of training parameters to LoRA.
zh

[AI-131] QiMeng-TensorOp: Automatically Generating High-Performance Tensor Operators with Hardware Primitives

【速读】:该论文旨在解决在多样且不断演化的硬件架构(如RISC-V、ARM和GPU)上高效生成高性能张量操作符的问题,传统手动优化方法耗时长且难以适应不同硬件特性。其解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的张量操作符自动生成框架QiMeng-TensorOp,该框架通过单行用户指令,使LLMs能够自动利用硬件特性生成基于硬件原语的张量操作符,并针对不同硬件进行参数调优,从而显著提升计算性能并降低开发成本。

链接: https://arxiv.org/abs/2505.06302
作者: Xuzhi Zhang,Shaohui Peng,Qirui Zhou,Yuanbo Wen,Qi Guo,Ruizhi Chen,Xinguo Zhu,Weiqiang Xiong,Haixin Chen,Congying Ma,Ke Gao,Chen Zhao,Yanjun Wu,Yunji Chen,Ling Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Computation-intensive tensor operators constitute over 90% of the computations in Large Language Models (LLMs) and Deep Neural this http URL and efficiently generating high-performance tensor operators with hardware primitives is crucial for diverse and ever-evolving hardware architectures like RISC-V, ARM, and GPUs, as manually optimized implementation takes at least months and lacks this http URL excel at generating high-level language codes, but they struggle to fully comprehend hardware characteristics and produce high-performance tensor operators. We introduce a tensor-operator auto-generation framework with a one-line user prompt (QiMeng-TensorOp), which enables LLMs to automatically exploit hardware characteristics to generate tensor operators with hardware primitives, and tune parameters for optimal performance across diverse hardware. Experimental results on various hardware platforms, SOTA LLMs, and typical tensor operators demonstrate that QiMeng-TensorOp effectively unleashes the computing capability of various hardware platforms, and automatically generates tensor operators of superior performance. Compared with vanilla LLMs, QiMeng-TensorOp achieves up to 1291 \times performance improvement. Even compared with human experts, QiMeng-TensorOp could reach 251 % of OpenBLAS on RISC-V CPUs, and 124 % of cuBLAS on NVIDIA GPUs. Additionally, QiMeng-TensorOp also significantly reduces development costs by 200 \times compared with human experts.
zh

[AI-132] Domain-Adversarial Anatomical Graph Networks for Cross-User Human Activity Recognition

【速读】:该论文试图解决跨用户人体活动识别(Cross-user Human Activity Recognition, Cross-user HAR)中由于传感器位置、身体动力学和行为模式差异导致的可变性问题,传统方法难以捕捉跨用户的生物力学不变特征,从而限制了模型的泛化能力。其解决方案的关键在于提出了一种基于图的对抗域泛化框架(Edge-Enhanced Graph-Based Adversarial Domain Generalization, EEG-ADG),通过将解剖学相关性知识整合到统一的图神经网络(GNN)架构中,建模三种生物力学驱动的关系——互联单元、相似单元和侧向单元,同时利用变分边特征提取器处理用户特定的可变性,并通过梯度反转层(Gradient Reversal Layer, GRL)实现对抗域泛化,从而提升模型对未见用户的鲁棒性。

链接: https://arxiv.org/abs/2505.06301
作者: Xiaozhou Ye,Kevin I-Kai Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Cross-user variability in Human Activity Recognition (HAR) remains a critical challenge due to differences in sensor placement, body dynamics, and behavioral patterns. Traditional methods often fail to capture biomechanical invariants that persist across users, limiting their generalization capability. We propose an Edge-Enhanced Graph-Based Adversarial Domain Generalization (EEG-ADG) framework that integrates anatomical correlation knowledge into a unified graph neural network (GNN) architecture. By modeling three biomechanically motivated relationships together-Interconnected Units, Analogous Units, and Lateral Units-our method encodes domain-invariant features while addressing user-specific variability through Variational Edge Feature Extractor. A Gradient Reversal Layer (GRL) enforces adversarial domain generalization, ensuring robustness to unseen users. Extensive experiments on OPPORTUNITY and DSADS datasets demonstrate state-of-the-art performance. Our work bridges biomechanical principles with graph-based adversarial learning by integrating information fusion techniques. This fusion of information underpins our unified and generalized model for cross-user HAR.
zh

[AI-133] ARDNS-FN-Quantum: A Quantum-Enhanced Reinforcement Learning Framework with Cognitive-Inspired Adaptive Exploration for Dynamic Environments

【速读】:该论文旨在解决传统强化学习(Reinforcement Learning, RL)算法在动态环境中面临的有效探索、稳定性和适应性不足的问题。其解决方案的关键在于提出ARDNS-FN-Quantum框架,该框架融合了2量子比特量子电路用于动作选择、受人类认知启发的双记忆系统以及由奖励方差和好奇心调控的自适应探索策略,从而显著提升了算法在复杂环境中的性能与稳定性。

链接: https://arxiv.org/abs/2505.06300
作者: Umberto Gonçalves de Sousa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 7 figures

点击查看摘要

Abstract:Reinforcement learning (RL) has transformed sequential decision making, yet traditional algorithms like Deep Q-Networks (DQNs) and Proximal Policy Optimization (PPO) often struggle with efficient exploration, stability, and adaptability in dynamic environments. This study presents ARDNS-FN-Quantum (Adaptive Reward-Driven Neural Simulator with Quantum enhancement), a novel framework that integrates a 2-qubit quantum circuit for action selection, a dual-memory system inspired by human cognition, and adaptive exploration strategies modulated by reward variance and curiosity. Evaluated in a 10X10 grid-world over 20,000 episodes, ARDNS-FN-Quantum achieves a 99.5% success rate (versus 81.3% for DQN and 97.0% for PPO), a mean reward of 9.0528 across all episodes (versus 1.2941 for DQN and 7.6196 for PPO), and an average of 46.7 steps to goal (versus 135.9 for DQN and 62.5 for PPO). In the last 100 episodes, it records a mean reward of 9.1652 (versus 7.0916 for DQN and 9.0310 for PPO) and 37.2 steps to goal (versus 52.7 for DQN and 53.4 for PPO). Graphical analyses, including learning curves, steps-to-goal trends, reward variance, and reward distributions, demonstrate ARDNS-FN-Quantum’s superior stability (reward variance 5.424 across all episodes versus 252.262 for DQN and 76.583 for PPO) and efficiency. By bridging quantum computing, cognitive science, and RL, ARDNS-FN-Quantum offers a scalable, human-like approach to adaptive learning in uncertain environments, with potential applications in robotics, autonomous systems, and decision-making under uncertainty.
zh

[AI-134] Input-Specific and Universal Adversarial Attack Generation for Spiking Neural Networks in the Spiking Domain

【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在安全方面的漏洞问题,特别是针对对抗攻击(adversarial attacks)的防御不足。其解决方案的关键在于提出两种新颖的对抗攻击算法:一种是针对特定输入的攻击,能够从特定数据集输入中生成对抗样本;另一种是通用攻击,能够生成可重复使用的扰动补丁,从而在大多数输入上引发误分类,具备实际部署的可行性。这些算法基于梯度,在脉冲域中运行,并在多个评估指标上表现出色,如对抗准确性、隐蔽性和生成时间。实验结果表明,所提出的攻击方法在两个广泛使用的类脑视觉数据集NMNIST和IBM DVS Gesture上均优于现有最先进的方法。

链接: https://arxiv.org/abs/2505.06299
作者: Spyridon Raptis,Haralampos-G. Stratigopoulos
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As Spiking Neural Networks (SNNs) gain traction across various applications, understanding their security vulnerabilities becomes increasingly important. In this work, we focus on the adversarial attacks, which is perhaps the most concerning threat. An adversarial attack aims at finding a subtle input perturbation to fool the network’s decision-making. We propose two novel adversarial attack algorithms for SNNs: an input-specific attack that crafts adversarial samples from specific dataset inputs and a universal attack that generates a reusable patch capable of inducing misclassification across most inputs, thus offering practical feasibility for real-time deployment. The algorithms are gradient-based operating in the spiking domain proving to be effective across different evaluation metrics, such as adversarial accuracy, stealthiness, and generation time. Experimental results on two widely used neuromorphic vision datasets, NMNIST and IBM DVS Gesture, show that our proposed attacks surpass in all metrics all existing state-of-the-art methods. Additionally, we present the first demonstration of adversarial attack generation in the sound domain using the SHD dataset.
zh

[AI-135] BedreFlyt: Improving Patient Flows through Hospital Wards with Digital Twins

【速读】:该论文试图解决医院住院部资源规划中的短期决策与长期战略规划问题,特别是针对住院患者需求的优化配置。其解决方案的关键在于构建一个数字孪生系统,该系统结合了可执行的形式化模型、本体用于知识表示以及SMT求解器用于约束可满足性分析,以探索假设性的“如果…会怎样”情景,从而提升战略规划过程并解决具体的短期决策任务。通过将到达的住院患者流转化为可由SMT技术求解的优化问题,系统能够根据领域知识建模所需的配置,并生成涵盖平均情况及最坏情况资源需求的场景,以支持不同情境下的资源分配决策。

链接: https://arxiv.org/abs/2505.06287
作者: Riccardo Sieve(Dept. of Informatics, University of Oslo),Paul Kobialka(Dept. of Informatics, University of Oslo),Laura Slaughter(dScience Center, University of Oslo),Rudolf Schlatte(Dept. of Informatics, University of Oslo),Einar Broch Johnsen(Dept. of Informatics, University of Oslo),Silvia Lizeth Tapia Tarifa(Dept. of Informatics, University of Oslo)
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Logic in Computer Science (cs.LO)
备注: In Proceedings ASQAP 2025, arXiv:2505.02873

点击查看摘要

Abstract:Digital twins are emerging as a valuable tool for short-term decision-making as well as for long-term strategic planning across numerous domains, including process industry, energy, space, transport, and healthcare. This paper reports on our ongoing work on designing a digital twin to enhance resource planning, e.g., for the in-patient ward needs in hospitals. By leveraging executable formal models for system exploration, ontologies for knowledge representation and an SMT solver for constraint satisfiability, our approach aims to explore hypothetical “what-if” scenarios to improve strategic planning processes, as well as to solve concrete, short-term decision-making tasks. Our proposed solution uses the executable formal model to turn a stream of arriving patients, that need to be hospitalized, into a stream of optimization problems, e.g., capturing daily inpatient ward needs, that can be solved by SMT techniques. The knowledge base, which formalizes domain knowledge, is used to model the needed configuration in the digital twin, allowing the twin to support both short-term decision-making and long-term strategic planning by generating scenarios spanning average-case as well as worst-case resource needs, depending on the expected treatment of patients, as well as ranging over variations in available resources, e.g., bed distribution in different rooms. We illustrate our digital twin architecture by considering the problem of bed bay allocation in a hospital ward.
zh

[AI-136] PARM: Multi-Objective Test-Time Alignment via Preference-Aware Autoregressive Reward Model ICML2025

【速读】:该论文旨在解决多目标测试时对齐(multi-objective test-time alignment)中因独立训练自回归奖励模型(Autoregressive Reward Models, ARMs)而导致的推理成本高和生成结果与用户偏好不一致的问题。其解决方案的关键在于提出一种统一的偏好感知自回归奖励模型(Preference-aware ARM, PARM),该模型通过偏好感知双线性低秩适配(Preference-Aware Bilinear Low-Rank Adaptation, PBLoRA)机制,在单个模型中联合处理所有偏好维度,从而实现对偏好权衡的精确控制,并降低推理开销。

链接: https://arxiv.org/abs/2505.06274
作者: Baijiong Lin,Weisen Jiang,Yuancheng Xu,Hao Chen,Ying-Cong Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2025

点击查看摘要

Abstract:Multi-objective test-time alignment aims to adapt large language models (LLMs) to diverse multi-dimensional user preferences during inference while keeping LLMs frozen. Recently, GenARM (Xu et al., 2025) first independently trains Autoregressive Reward Models (ARMs) for each preference dimension without awareness of each other, then combines their outputs based on user-specific preference vectors during inference to achieve multi-objective test-time alignment, leading to two key limitations: the need for \textitmultiple ARMs increases the inference cost, and the separate training of ARMs causes the misalignment between the guided generation and the user preferences. To address these issues, we propose Preference-aware ARM (PARM), a single unified ARM trained across all preference dimensions. PARM uses our proposed Preference-Aware Bilinear Low-Rank Adaptation (PBLoRA), which employs a bilinear form to condition the ARM on preference vectors, enabling it to achieve precise control over preference trade-offs during inference. Experiments demonstrate that PARM reduces inference costs and achieves better alignment with preference vectors compared with existing methods. Additionally, PARM enables weak-to-strong guidance, allowing a smaller PARM to guide a larger frozen LLM without expensive training, making multi-objective alignment accessible with limited computing resources. The code is available at this https URL.
zh

[AI-137] Policy-labeled Preference Learning: Is Preference Enough for RLHF?

【速读】:该论文试图解决强化学习中基于人类反馈(RLHF)方法存在的似然不匹配问题,即现有方法常将轨迹误认为是由最优策略生成,导致似然估计不准确和学习效果不佳。其解决方案的关键在于提出了一种基于策略标注的偏好学习(Policy-labeled Preference Learning, PPL),通过引入后悔(regret)来建模人类偏好,从而反映行为策略的信息,并结合基于后悔原则的对比KL正则化,以提升序贯决策中的RLHF性能。

链接: https://arxiv.org/abs/2505.06273
作者: Taehyun Cho,Seokhun Ju,Seungyub Han,Dohyeong Kim,Kyungjae Lee,Jungwoo Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To design rewards that align with human goals, Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent technique for learning reward functions from human preferences and optimizing policies via reinforcement learning algorithms. However, existing RLHF methods often misinterpret trajectories as being generated by an optimal policy, causing inaccurate likelihood estimation and suboptimal learning. Inspired by Direct Preference Optimization framework which directly learns optimal policy without explicit reward, we propose policy-labeled preference learning (PPL), to resolve likelihood mismatch issues by modeling human preferences with regret, which reflects behavior policy information. We also provide a contrastive KL regularization, derived from regret-based principles, to enhance RLHF in sequential decision making. Experiments in high-dimensional continuous control tasks demonstrate PPL’s significant improvements in offline RLHF performance and its effectiveness in online settings.
zh

[AI-138] A Sensitivity-Driven Expert Allocation Method in LoRA-MoE for Efficient Fine-Tuning

【速读】:该论文试图解决在多任务复杂数据集下,共享参数导致模型性能下降的问题,以及引入Mixture-of-Experts (MoE)方法后带来的参数数量增加和训练时间延长的问题。解决方案的关键在于提出一种基于参数敏感性的专家分配方法——LoRA-SMoE,该方法通过少量数据采样和梯度信息快速评估不同任务对参数的敏感性,并在给定预算内自适应分配专家数量,从而在保持与LoRA相近内存消耗的同时实现高效且资源友好的微调过程。

链接: https://arxiv.org/abs/2505.06272
作者: Junzhou Xu,Boyu Diao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As deep learning models expand, the pre-training-fine-tuning paradigm has become the standard approach for handling various downstream tasks. However, shared parameters can lead to diminished performance when dealing with complex datasets involving multiple tasks. While introducing Mixture-of-Experts (MoE) methods has alleviated this issue to some extent, it also significantly increases the number of parameters required for fine-tuning and training time, introducing greater parameter redundancy. To address these challenges, we propose a method for allocating expert numbers based on parameter sensitivity LoRA-SMoE (A Sensitivity-Driven Expert Allocation Method in LoRA-MoE for Efficient Fine-Tuning). This method rapidly assesses the sensitivity of different tasks to parameters by sampling a small amount of data and using gradient information. It then adaptively allocates expert numbers within a given budget. The process maintains comparable memory consumption to LoRA (Low-Rank Adaptation) while ensuring an efficient and resource-friendly fine-tuning procedure. Experimental results demonstrate that compared to SOTA fine-tuning methods, our LoRA-SMoE approach can enhance model performance while reducing the number of trainable parameters. This significantly improves model performance in resource-constrained environments. Additionally, due to its efficient parameter sensitivity evaluation mechanism, LoRA-SMoE requires minimal computational overhead to optimize expert allocation, making it particularly suitable for scenarios with limited computational resources. All the code in this study will be made publicly available following the acceptance of the paper for publication. Source code is at this https URL
zh

[AI-139] ri-MTL: A Triple Multitask Learning Approach for Respiratory Disease Diagnosis

【速读】:该论文试图解决在临床实践中如何通过整合呼吸音模式、疾病表现及患者元数据属性来提升肺部声音分类和疾病诊断性能的问题(respiratory sound classification and disease diagnosis)。解决方案的关键在于将多任务学习(Multitask Learning, MTL)与先进的深度学习架构相结合,以同时建模这些复杂关系,并验证元数据在MTL框架中的有效性。实验结果表明,将听诊器信息融入MTL架构能够显著提升分类和诊断性能。

链接: https://arxiv.org/abs/2505.06271
作者: June-Woo Kim,Sanghoon Lee,Miika Toikkanen,Daehwan Hwang,Kyunghoon Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Accepted to EMBC 2025

点击查看摘要

Abstract:Auscultation remains a cornerstone of clinical practice, essential for both initial evaluation and continuous monitoring. Clinicians listen to the lung sounds and make a diagnosis by combining the patient’s medical history and test results. Given this strong association, multitask learning (MTL) can offer a compelling framework to simultaneously model these relationships, integrating respiratory sound patterns with disease manifestations. While MTL has shown considerable promise in medical applications, a significant research gap remains in understanding the complex interplay between respiratory sounds, disease manifestations, and patient metadata attributes. This study investigates how integrating MTL with cutting-edge deep learning architectures can enhance both respiratory sound classification and disease diagnosis. Specifically, we extend recent findings regarding the beneficial impact of metadata on respiratory sound classification by evaluating its effectiveness within an MTL framework. Our comprehensive experiments reveal significant improvements in both lung sound classification and diagnostic performance when the stethoscope information is incorporated into the MTL architecture.
zh

[AI-140] Importance Analysis for Dynamic Control of Balancing Parameter in a Simple Knowledge Distillation Setting

【速读】:该论文试图解决深度学习模型因复杂架构导致实时性能下降的问题,其解决方案的关键在于知识蒸馏(Knowledge Distillation, KD)中平衡参数的动态调整。研究表明,在简单的KD设置下,当损失函数下降时,应动态调整平衡参数以优化知识蒸馏与下游任务损失之间的影响力比例,从而提升模型压缩效果。

链接: https://arxiv.org/abs/2505.06270
作者: Seongmin Kim,Kwanho Kim,Minseung Kim,Kanghyun Jo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 3 pages, 2 figures, conference preprint for IWIS2025

点击查看摘要

Abstract:Although deep learning models owe their remarkable success to deep and complex architectures, this very complexity typically comes at the expense of real-time performance. To address this issue, a variety of model compression techniques have been proposed, among which knowledge distillation (KD) stands out for its strong empirical performance. The KD contains two concurrent processes: (i) matching the outputs of a large, pre-trained teacher network and a lightweight student network, and (ii) training the student to solve its designated downstream task. The associated loss functions are termed the distillation loss and the downsteam-task loss, respectively. Numerous prior studies report that KD is most effective when the influence of the distillation loss outweighs that of the downstream-task loss. The influence(or importance) is typically regulated by a balancing parameter. This paper provides a mathematical rationale showing that in a simple KD setting when the loss is decreasing, the balancing parameter should be dynamically adjusted
zh

[AI-141] Cluster-Aware Multi-Round Update for Wireless Federated Learning in Heterogeneous Environments

【速读】:该论文旨在解决无线联邦学习(Wireless Federated Learning, WFL)中由于资源限制导致的聚合效率和准确性下降问题,尤其是在设备数据分布和通信能力差异较大的异构环境中。其解决方案的关键在于提出一种基于先验知识相似性的聚类策略,将具有相似数据和通信特性的设备分组,并在此基础上设计了一种集群感知的多轮更新(Cluster-Aware Multi-round Update, CAMU)策略,通过将集群作为基本单元并根据聚类贡献阈值调整本地更新频率,有效降低更新偏差并提升聚合精度。

链接: https://arxiv.org/abs/2505.06268
作者: Pengcheng Sun,Erwu Liu,Wei Ni,Kanglei Yu,Rui Wang,Abbas Jamalipour
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The aggregation efficiency and accuracy of wireless Federated Learning (FL) are significantly affected by resource constraints, especially in heterogeneous environments where devices exhibit distinct data distributions and communication capabilities. This paper proposes a clustering strategy that leverages prior knowledge similarity to group devices with similar data and communication characteristics, mitigating performance degradation from heterogeneity. On this basis, a novel Cluster- Aware Multi-round Update (CAMU) strategy is proposed, which treats clusters as the basic units and adjusts the local update frequency based on the clustered contribution threshold, effectively reducing update bias and enhancing aggregation accuracy. The theoretical convergence of the CAMU strategy is rigorously validated. Meanwhile, based on the convergence upper bound, the local update frequency and transmission power of each cluster are jointly optimized to achieve an optimal balance between computation and communication resources under constrained conditions, significantly improving the convergence efficiency of FL. Experimental results demonstrate that the proposed method effectively improves the model performance of FL in heterogeneous environments and achieves a better balance between communication cost and computational load under limited resources.
zh

[AI-142] AKD : Adversarial Knowledge Distillation For Large Language Models Alignment on Coding tasks

【速读】:该论文旨在解决大型代码语言模型(Large Language Models, LLMs)在代码生成过程中面临的质量、安全性和可靠性问题,以及模型扩展带来的收益递减和高质量训练数据稀缺等挑战。其解决方案的关键在于提出一种名为对抗知识蒸馏(Adversarial Knowledge Distillation, AKD)的新方法,通过利用对抗生成的合成数据集,将大模型的能力提炼到更小、更高效的模型中,从而提升模型的鲁棒性、可靠性和安全性,同时提高参数效率。

链接: https://arxiv.org/abs/2505.06267
作者: Ilyas Oulkadda,Julien Perez
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The widespread adoption of Large Language Models (LLMs) for code generation, exemplified by GitHub Copilot\footnoteA coding extension powered by a Code-LLM to assist in code completion tasks surpassing a million users, highlights the transformative potential of these tools in improving developer productivity. However, this rapid growth also underscores critical concerns regarding the quality, safety, and reliability of the code they generate. As Code-LLMs evolve, they face significant challenges, including the diminishing returns of model scaling and the scarcity of new, high-quality training data. To address these issues, this paper introduces Adversarial Knowledge Distillation (AKD), a novel approach that leverages adversarially generated synthetic datasets to distill the capabilities of larger models into smaller, more efficient ones. By systematically stress-testing and refining the reasoning capabilities of Code-LLMs, AKD provides a framework for enhancing model robustness, reliability, and security while improving their parameter-efficiency. We believe this work represents a critical step toward ensuring dependable automated code generation within the constraints of existing data and the cost-efficiency of model execution.
zh

[AI-143] Knowledge Guided Encoder-Decoder Framework Integrating Multiple Physical Models for Agricultural Ecosystem Modeling

【速读】:该论文旨在解决传统基于过程的物理模型在参数不确定性高以及数据驱动模型在泛化能力不足的问题,特别是在农业监测中面对不同任务和数据分布变化时表现不佳。其解决方案的关键在于提出一种知识引导的编码器-解码器模型,该模型通过整合多个物理模型中的潜在过程知识来预测关键作物变量,并利用语言模型处理复杂且不一致的输入,同时实现选择性地融合不同物理模型的知识。

链接: https://arxiv.org/abs/2505.06266
作者: Qi Cheng,Licheng Liu,Zhang Yao,Hong Mu,Shiyuan Luo,Zhenong Jin,Yiqun Xie,Xiaowei Jia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agricultural monitoring is critical for ensuring food security, maintaining sustainable farming practices, informing policies on mitigating food shortage, and managing greenhouse gas emissions. Traditional process-based physical models are often designed and implemented for specific situations, and their parameters could also be highly uncertain. In contrast, data-driven models often use black-box structures and does not explicitly model the inter-dependence between different ecological variables. As a result, they require extensive training data and lack generalizability to different tasks with data distribution shifts and inconsistent observed variables. To address the need for more universal models, we propose a knowledge-guided encoder-decoder model, which can predict key crop variables by leveraging knowledge of underlying processes from multiple physical models. The proposed method also integrates a language model to process complex and inconsistent inputs and also utilizes it to implement a model selection mechanism for selectively combining the knowledge from different physical models. Our evaluations on predicting carbon and nitrogen fluxes for multiple sites demonstrate the effectiveness and robustness of the proposed model under various scenarios.
zh

[AI-144] Dialz: A Python Toolkit for Steering Vectors

【速读】:该论文试图解决如何有效控制和调整开源大语言模型(Large Language Models, LLMs)的输出,以减少有害内容(如刻板印象)并增强模型行为的可解释性。其解决方案的关键在于提出Dialz框架,该框架通过生成和应用转向向量(steering vectors)实现对模型激活状态的实时调整,从而放大或削弱特定概念(如诚实或积极态度)。Dialz强调模块化与易用性,支持从对比数据集构建到可视化分析的多样化任务,为研究者提供了高效、灵活的工具,以促进安全且可控的语言生成技术的发展。

链接: https://arxiv.org/abs/2505.06262
作者: Zara Siddique,Liam D. Turner,Luis Espinosa-Anke
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Dialz, a framework for advancing research on steering vectors for open-source LLMs, implemented in Python. Steering vectors allow users to modify activations at inference time to amplify or weaken a ‘concept’, e.g. honesty or positivity, providing a more powerful alternative to prompting or fine-tuning. Dialz supports a diverse set of tasks, including creating contrastive pair datasets, computing and applying steering vectors, and visualizations. Unlike existing libraries, Dialz emphasizes modularity and usability, enabling both rapid prototyping and in-depth analysis. We demonstrate how Dialz can be used to reduce harmful outputs such as stereotypes, while also providing insights into model behaviour across different layers. We release Dialz with full documentation, tutorials, and support for popular open-source models to encourage further research in safe and controllable language generation. Dialz enables faster research cycles and facilitates insights into model interpretability, paving the way for safer, more transparent, and more reliable AI systems.
zh

[AI-145] Modeling supply chain compliance response strategies based on AI synthetic data with structural path regression: A Simulation Study of EU 2027 Mandatory Labor Regulations

【速读】:该论文试图解决在欧盟2027年新强制性劳动合规政策背景下,供应链企业面临的工作时间管理要求严格及合规风险问题,旨在科学预测企业在政策影响下的应对行为和绩效结果。解决方案的关键在于构建一个整合生成式 AI (Generative AI) 合成数据生成机制与结构路径回归建模的方法框架,通过高精度模拟数据和多种统计分析方法(如多元线性回归、逻辑回归、中介效应和调节效应)建立企业战略转型路径的仿真模型,从而为企业的战略响应、政策设计及AI辅助决策提供定量依据。

链接: https://arxiv.org/abs/2505.06261
作者: Wei Meng
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: Simulated data modeling of the impact of non-tariff barriers in trade wars

点击查看摘要

Abstract:In the context of the new mandatory labor compliance in the European Union (EU), which will be implemented in 2027, supply chain enterprises face stringent working hour management requirements and compliance risks. In order to scientifically predict the enterprises’ coping behaviors and performance outcomes under the policy impact, this paper constructs a methodological framework that integrates the AI synthetic data generation mechanism and structural path regression modeling to simulate the enterprises’ strategic transition paths under the new regulations. In terms of research methodology, this paper adopts high-quality simulation data generated based on Monte Carlo mechanism and NIST synthetic data standards to construct a structural path analysis model that includes multiple linear regression, logistic regression, mediation effect and moderating effect. The variable system covers 14 indicators such as enterprise working hours, compliance investment, response speed, automation level, policy dependence, etc. The variable set with explanatory power is screened out through exploratory data analysis (EDA) and VIF multicollinearity elimination. The findings show that compliance investment has a significant positive impact on firm survival and its effect is transmitted through the mediating path of the level of intelligence; meanwhile, firms’ dependence on the EU market significantly moderates the strength of this mediating effect. It is concluded that AI synthetic data combined with structural path modeling provides an effective tool for high-intensity regulatory simulation, which can provide a quantitative basis for corporate strategic response, policy design and AI-assisted decision-making in the pre-prediction stage lacking real scenario data. Keywords: AI synthetic data, structural path regression modeling, compliance response strategy, EU 2027 mandatory labor regulation
zh

[AI-146] Fair Clustering with Clusterlets

【速读】:该论文试图解决聚类方法在现实世界中的公平性问题,特别是如何在保证聚类质量的同时实现公平性。其解决方案的关键在于提出了一种基于clusterlet的模糊聚类算法,通过匹配单类聚类来优化公平性,该方法利用clusterlet距离优化经典聚类目标,并通过正则化手段提升公平性。实验结果表明,简单的匹配策略能够实现较高的公平性,而适当的参数调节能进一步提高聚类的紧密性和降低重叠度。

链接: https://arxiv.org/abs/2505.06259
作者: Mattia Setzu,Riccardo Guidotti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Given their widespread usage in the real world, the fairness of clustering methods has become of major interest. Theoretical results on fair clustering show that fairness enjoys transitivity: given a set of small and fair clusters, a trivial centroid-based clustering algorithm yields a fair clustering. Unfortunately, discovering a suitable starting clustering can be computationally expensive, rather complex or arbitrary. In this paper, we propose a set of simple \emphclusterlet-based fuzzy clustering algorithms that match single-class clusters, optimizing fair clustering. Matching leverages clusterlet distance, optimizing for classic clustering objectives, while also regularizing for fairness. Empirical results show that simple matching strategies are able to achieve high fairness, and that appropriate parameter tuning allows to achieve high cohesion and low overlap. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.06259 [cs.LG] (or arXiv:2505.06259v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.06259 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-147] ABE: A Unified Framework for Robust and Faithful Attribution-Based Explainability

【速读】:该论文试图解决现有解释性框架(如InterpretDL和OmniXAI)在可扩展性、高耦合度、理论约束及用户友好性方面的局限性,从而阻碍了神经网络的透明性和互操作性。解决方案的关键在于提出一种统一的框架——基于解释性的归因(Attribution-Based Explainability, ABE),该框架形式化了基础归因方法,并整合了最先进的归因算法,同时确保符合归因公理。ABE通过四个可定制模块(鲁棒性、可解释性、验证和数据模型)支持研究人员开发新的归因技术,为基于归因的可解释性提供了可扩展和可扩展的基础。

链接: https://arxiv.org/abs/2505.06258
作者: Zhiyu Zhu,Jiayu Zhang,Zhibo Jin,Fang Chen,Jianlong Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Attribution algorithms are essential for enhancing the interpretability and trustworthiness of deep learning models by identifying key features driving model decisions. Existing frameworks, such as InterpretDL and OmniXAI, integrate multiple attribution methods but suffer from scalability limitations, high coupling, theoretical constraints, and lack of user-friendly implementations, hindering neural network transparency and interoperability. To address these challenges, we propose Attribution-Based Explainability (ABE), a unified framework that formalizes Fundamental Attribution Methods and integrates state-of-the-art attribution algorithms while ensuring compliance with attribution axioms. ABE enables researchers to develop novel attribution techniques and enhances interpretability through four customizable modules: Robustness, Interpretability, Validation, and Data Model. This framework provides a scalable, extensible foundation for advancing attribution-based explainability and fostering transparent AI systems. Our code is available at: this https URL.
zh

[AI-148] Beyond Attention: Toward Machines with Intrinsic Higher Mental States

【速读】:该论文试图解决在机器学习模型中确定相关信息的核心挑战,传统方法依赖于如反向传播等学习算法来处理。其解决方案的关键在于模仿大脑的高阶感知处理和清醒思维(想象)状态,通过三元神经元级调节循环,即问题(Q)、线索(K)和假设(V)之间的相互作用,实现信息的预先选择。这种机制支持在表示层面进行多样化、深层次的并行推理链,并允许从初始偏见快速转向更精确的理解,从而显著降低计算需求并提升学习效率。

链接: https://arxiv.org/abs/2505.06257
作者: Ahsan Adeel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Attending to what is relevant is fundamental to both the mammalian brain and modern machine learning models such as Transformers. Yet, determining relevance remains a core challenge, traditionally offloaded to learning algorithms like backpropagation. Inspired by recent cellular neurobiological evidence linking neocortical pyramidal cells to distinct mental states, this work shows how models (e.g., Transformers) can emulate high-level perceptual processing and awake thought (imagination) states to pre-select relevant information before applying attention. Triadic neuronal-level modulation loops among questions ( Q ), clues (keys, K ), and hypotheses (values, V ) enable diverse, deep, parallel reasoning chains at the representation level and allow a rapid shift from initial biases to refined understanding. This leads to orders-of-magnitude faster learning with significantly reduced computational demand (e.g., fewer heads, layers, and tokens), at an approximate cost of \mathcalO(N) , where N is the number of input tokens. Results span reinforcement learning (e.g., CarRacing in a high-dimensional visual setup), computer vision, and natural language question answering.
zh

[AI-149] United States Road Accident Prediction using Random Forest Predictor

【速读】:该论文试图解决道路交通事故的预测问题,以期为交通安全管理和预防策略提供科学依据。其解决方案的关键在于利用涵盖美国49个州的综合性交通数据集,并结合多种因素(如环境条件、人类行为和基础设施)进行分析,同时采用先进的机器学习模型,如回归分析、时间序列分析、随机森林和长短期记忆网络(LSTM),以实现对事故数量的准确预测。通过时空分析,识别出事故趋势、季节性变化及高风险区域,从而为政策制定者和交通管理部门提供量化洞察,支持资源的高效配置和针对性干预措施的实施。

链接: https://arxiv.org/abs/2505.06246
作者: Dominic Parosh Yamarthi,Haripriya Raman,Shamsad Parvin
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
备注: 5 Pages, 8 Figures

点击查看摘要

Abstract:Road accidents significantly threaten public safety and require in-depth analysis for effective prevention and mitigation strategies. This paper focuses on predicting accidents through the examination of a comprehensive traffic dataset covering 49 states in the United States. The dataset integrates information from diverse sources, including transportation departments, law enforcement, and traffic sensors. This paper specifically emphasizes predicting the number of accidents, utilizing advanced machine learning models such as regression analysis and time series analysis. The inclusion of various factors, ranging from environmental conditions to human behavior and infrastructure, ensures a holistic understanding of the dynamics influencing road safety. Temporal and spatial analysis further allows for the identification of trends, seasonal variations, and high-risk areas. The implications of this research extend to proactive decision-making for policymakers and transportation authorities. By providing accurate predictions and quantifiable insights into expected accident rates under different conditions, the paper aims to empower authorities to allocate resources efficiently and implement targeted interventions. The goal is to contribute to the development of informed policies and interventions that enhance road safety, creating a safer environment for all road users. Keywords: Machine Learning, Random Forest, Accident Prediction, AutoML, LSTM.
zh

[AI-150] Diffused Responsibility: Analyzing the Energy Consumption of Generative Text-to-Audio Diffusion Models

【速读】:该论文试图解决生成式音频模型在推理过程中能耗过高所带来的能源消耗和环境影响问题。其解决方案的关键在于分析7种最先进的基于扩散的文本到音频生成模型的能耗情况,并评估不同生成参数对能耗的影响,同时通过考虑所有选定模型的帕累托最优解,寻找音频质量与能耗之间的最佳平衡点。

链接: https://arxiv.org/abs/2505.07615
作者: Riccardo Passoni,Francesca Ronchini,Luca Comanducci,Romain Serizel,Fabio Antonacci
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Text-to-audio models have recently emerged as a powerful technology for generating sound from textual descriptions. However, their high computational demands raise concerns about energy consumption and environmental impact. In this paper, we conduct an analysis of the energy usage of 7 state-of-the-art text-to-audio diffusion-based generative models, evaluating to what extent variations in generation parameters affect energy consumption at inference time. We also aim to identify an optimal balance between audio quality and energy consumption by considering Pareto-optimal solutions across all selected models. Our findings provide insights into the trade-offs between performance and environmental impact, contributing to the development of more efficient generative audio models.
zh

[AI-151] Can Generative AI agents behave like humans? Evidence from laboratory market experiments

【速读】:该论文试图解决如何利用生成式 AI (Generative AI) 模拟经济市场实验中人类行为的问题,特别是关注 LLM 代理之间的动态反馈机制。其解决方案的关键在于通过提供一个最小的上下文窗口(即三个前一时点的记忆)和高变异性的设置,以捕捉响应异质性,从而使得 LLM 能够再现人类实验中的宏观趋势,如正负反馈市场的区别。

链接: https://arxiv.org/abs/2505.07457
作者: R. Maria del Rio-Chanona,Marco Pangallo,Cars Hommes
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We explore the potential of Large Language Models (LLMs) to replicate human behavior in economic market experiments. Compared to previous studies, we focus on dynamic feedback between LLM agents: the decisions of each LLM impact the market price at the current step, and so affect the decisions of the other LLMs at the next step. We compare LLM behavior to market dynamics observed in laboratory settings and assess their alignment with human participants’ behavior. Our findings indicate that LLMs do not adhere strictly to rational expectations, displaying instead bounded rationality, similarly to human participants. Providing a minimal context window i.e. memory of three previous time steps, combined with a high variability setting capturing response heterogeneity, allows LLMs to replicate broad trends seen in human experiments, such as the distinction between positive and negative feedback markets. However, differences remain at a granular level–LLMs exhibit less heterogeneity in behavior than humans. These results suggest that LLMs hold promise as tools for simulating realistic human behavior in economic contexts, though further research is needed to refine their accuracy and increase behavioral diversity.
zh

[AI-152] GAN-based synthetic FDG PET images from T1 brain MRI can serve to improve performance of deep unsupervised anomaly detection models

【速读】:该论文试图解决医学影像跨模态翻译中由于缺乏大规模标注多模态数据集而导致的挑战,特别是针对生成的合成数据在深度模型训练中的任务相关性能评估不足的问题。解决方案的关键在于设计并比较多种基于生成对抗网络(Generative Adversarial Network, GAN)的框架,以从T1加权磁共振成像(MRI)数据生成合成[18F]氟脱氧葡萄糖(FDG)正电子发射断层扫描(PET)图像,并进一步评估这些合成PET数据在训练无监督异常检测(Unsupervised Anomaly Detection, UAD)模型中的有效性。研究引入了面向诊断任务的质量指标,并结合基于孪生自编码器的深度表征学习与一类支持向量机(OC-SVM)密度估计模型,验证了合成数据在癫痫病变检测中的实用性。

链接: https://arxiv.org/abs/2505.07364
作者: Daria Zotova(MYRIAD),Nicolas Pinon(MYRIAD),Robin Trombetta(MYRIAD),Romain Bouet(CRNL),Julien Jung(CRNL, HCL),Carole Lartizien(MYRIAD)
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Background and Objective. Research in the cross-modal medical image translation domain has been very productive over the past few years in tackling the scarce availability of large curated multimodality datasets with the promising performance of GAN-based architectures. However, only a few of these studies assessed task-based related performance of these synthetic data, especially for the training of deep models. Method. We design and compare different GAN-based frameworks for generating synthetic brain [18F]fluorodeoxyglucose (FDG) PET images from T1 weighted MRI data. We first perform standard qualitative and quantitative visual quality evaluation. Then, we explore further impact of using these fake PET data in the training of a deep unsupervised anomaly detection (UAD) model designed to detect subtle epilepsy lesions in T1 MRI and FDG PET images. We introduce novel diagnostic task-oriented quality metrics of the synthetic FDG PET data tailored to our unsupervised detection task, then use these fake data to train a use case UAD model combining a deep representation learning based on siamese autoencoders with a OC-SVM density support estimation model. This model is trained on normal subjects only and allows the detection of any variation from the pattern of the normal population. We compare the detection performance of models trained on 35 paired real MR T1 of normal subjects paired either on 35 true PET images or on 35 synthetic PET images generated from the best performing generative models. Performance analysis is conducted on 17 exams of epilepsy patients undergoing surgery. Results. The best performing GAN-based models allow generating realistic fake PET images of control subject with SSIM and PSNR values around 0.9 and 23.8, respectively and in distribution (ID) with regard to the true control dataset. The best UAD model trained on these synthetic normative PET data allows reaching 74% sensitivity. Conclusion. Our results confirm that GAN-based models are the best suited for MR T1 to FDG PET translation, outperforming transformer or diffusion models. We also demonstrate the diagnostic value of these synthetic data for the training of UAD models and evaluation on clinical exams of epilepsy patients. Our code and the normative image dataset are available.
zh

[AI-153] Piloting Structure-Based Drug Design via Modality-Specific Optimal Schedule ICML2025

【速读】:该论文旨在解决基于结构的药物设计(Structure-Based Drug Design, SBDD)中深度生成模型在几何结构建模方面的挑战,特别是多模态联合决定分子几何结构时的扭曲概率路径问题,即连续三维坐标与离散二维拓扑的联合建模难题。其解决方案的关键在于提出一种名为VLB-Optimal Scheduling (VOS) 的策略,通过优化变分下界(Variational Lower Bound, VLB)作为路径积分,以提升分子几何结构和相互作用建模的效果。

链接: https://arxiv.org/abs/2505.07286
作者: Keyue Qiu,Yuxuan Song,Zhehuan Fan,Peidong Liu,Zhe Zhang,Mingyue Zheng,Hao Zhou,Wei-Ying Ma
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ICML 2025

点击查看摘要

Abstract:Structure-Based Drug Design (SBDD) is crucial for identifying bioactive molecules. Recent deep generative models are faced with challenges in geometric structure modeling. A major bottleneck lies in the twisted probability path of multi-modalities – continuous 3D positions and discrete 2D topologies – which jointly determine molecular geometries. By establishing the fact that noise schedules decide the Variational Lower Bound (VLB) for the twisted probability path, we propose VLB-Optimal Scheduling (VOS) strategy in this under-explored area, which optimizes VLB as a path integral for SBDD. Our model effectively enhances molecular geometries and interaction modeling, achieving state-of-the-art PoseBusters passing rate of 95.9% on CrossDock, more than 10% improvement upon strong baselines, while maintaining high affinities and robust intramolecular validity evaluated on held-out test set.
zh

[AI-154] Can LLM -based Financial Investing Strategies Outperform the Market in Long Run?

【速读】:该论文试图解决当前基于大型语言模型(Large Language Models, LLMs)的资产定价和股票交易策略在评估过程中存在的生存偏差和数据窥探偏差问题,这些问题导致了对LLM策略有效性的高估。论文提出的解决方案关键在于构建FINSABER,这是一个用于评估基于时机策略的回测框架,能够在更长的时间跨度和更大的股票标的范围内测试LLM策略的泛化能力和稳健性。通过系统化的回测分析,研究揭示了LLM策略在不同市场周期中的表现缺陷,并强调了提升趋势识别能力和制度感知风险控制的重要性。

链接: https://arxiv.org/abs/2505.07078
作者: Weixian Waylon Li,Hyeonjun Kim,Mihai Cucuringu,Tiejun Ma
机构: 未知
类目: Trading and Market Microstructure (q-fin.TR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 14 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have recently been leveraged for asset pricing tasks and stock trading applications, enabling AI agents to generate investment decisions from unstructured financial data. However, most evaluations of LLM timing-based investing strategies are conducted on narrow timeframes and limited stock universes, overstating effectiveness due to survivorship and data-snooping biases. We critically assess their generalizability and robustness by proposing FINSABER, a backtesting framework evaluating timing-based strategies across longer periods and a larger universe of symbols. Systematic backtests over two decades and 100+ symbols reveal that previously reported LLM advantages deteriorate significantly under broader cross-section and over a longer-term evaluation. Our market regime analysis further demonstrates that LLM strategies are overly conservative in bull markets, underperforming passive benchmarks, and overly aggressive in bear markets, incurring heavy losses. These findings highlight the need to develop LLM strategies that are able to prioritise trend detection and regime-aware risk controls over mere scaling of framework complexity.
zh

[AI-155] Quantum Observers: A NISQ Hardware Demonstration of Chaotic State Prediction Using Quantum Echo-state Networks

【速读】:该论文试图解决传统神经网络(Neural Network, NN)在经典计算机上面临的计算效率与可扩展性受限的问题,以及如何有效将量子计算与神经网络结合以提升处理能力。其解决方案的关键在于提出一种新型的量子回声状态网络(Quantum Echo-State Network, QESN)设计及其实施算法,该算法能够在当前IBM量子硬件的噪声环境中正常运行,并通过经典控制理论响应分析验证其非线性动态特性、记忆能力和可调性,从而实现对长时间序列的高精度预测。

链接: https://arxiv.org/abs/2505.06799
作者: Erik L. Connerty,Ethan N. Evans,Gerasimos Angelatos,Vignesh Narayanan
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 14 pages, 12 figures

点击查看摘要

Abstract:Recent advances in artificial intelligence have highlighted the remarkable capabilities of neural network (NN)-powered systems on classical computers. However, these systems face significant computational challenges that limit scalability and efficiency. Quantum computers hold the potential to overcome these limitations and increase processing power beyond classical systems. Despite this, integrating quantum computing with NNs remains largely unrealized due to challenges posed by noise, decoherence, and high error rates in current quantum hardware. Here, we propose a novel quantum echo-state network (QESN) design and implementation algorithm that can operate within the presence of noise on current IBM hardware. We apply classical control-theoretic response analysis to characterize the QESN, emphasizing its rich nonlinear dynamics and memory, as well as its ability to be fine-tuned with sparsity and re-uploading blocks. We validate our approach through a comprehensive demonstration of QESNs functioning as quantum observers, applied in both high-fidelity simulations and hardware experiments utilizing data from a prototypical chaotic Lorenz system. Our results show that the QESN can predict long time-series with persistent memory, running over 100 times longer than the median T1 and T2 of the IBM Marrakesh QPU, achieving state-of-the-art time-series performance on superconducting hardware.
zh

[AI-156] A Short Overview of Multi-Modal Wi-Fi Sensing

【速读】:该论文试图解决多模态Wi-Fi感知(multi-modal Wi-Fi sensing)领域中缺乏系统性综述的问题,旨在总结过去24个月内的相关研究,并指出当前的局限性、挑战及未来发展方向。解决方案的关键在于通过整合其他模态数据作为教师信号,为Wi-Fi感知模型提供真实标签或鲁棒特征,或直接与Wi-Fi信号融合以提升感知性能。

链接: https://arxiv.org/abs/2505.06682
作者: Zijian Zhao
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Wi-Fi sensing has emerged as a significant technology in wireless sensing and Integrated Sensing and Communication (ISAC), offering benefits such as low cost, high penetration, and enhanced privacy. Currently, it is widely utilized in various applications, including action recognition, human localization, and crowd counting. However, Wi-Fi sensing also faces challenges, such as low robustness and difficulties in data collection. Recently, there has been an increasing focus on multi-modal Wi-Fi sensing, where other modalities can act as teachers, providing ground truth or robust features for Wi-Fi sensing models to learn from, or can be directly fused with Wi-Fi for enhanced sensing capabilities. Although these methods have demonstrated promising results and substantial value in practical applications, there is a lack of comprehensive surveys reviewing them. To address this gap, this paper reviews the multi-modal Wi-Fi sensing literature \textbffrom the past 24 months and highlights the current limitations, challenges and future directions in this field.
zh

[AI-157] Optimal Transport for Machine Learners

【速读】:该论文试图解决如何利用最优输运(Optimal Transport, OT)理论来比较概率分布,并将其应用于机器学习中的生成模型设计与评估问题。其解决方案的关键在于深入探讨OT的数学基础,包括Monge和Kantorovich形式化、Brenier定理、对偶与动态形式化、高斯分布上的Bures度量以及梯度流等核心概念,并结合数值方法如线性规划、半离散求解器和熵正则化,为机器学习任务提供理论支持与计算工具。

链接: https://arxiv.org/abs/2505.06589
作者: Gabriel Peyré
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: arXiv admin note: text overlap with arXiv:1803.00567

点击查看摘要

Abstract:Optimal Transport is a foundational mathematical theory that connects optimization, partial differential equations, and probability. It offers a powerful framework for comparing probability distributions and has recently become an important tool in machine learning, especially for designing and evaluating generative models. These course notes cover the fundamental mathematical aspects of OT, including the Monge and Kantorovich formulations, Brenier’s theorem, the dual and dynamic formulations, the Bures metric on Gaussian distributions, and gradient flows. It also introduces numerical methods such as linear programming, semi-discrete solvers, and entropic regularization. Applications in machine learning include topics like training neural networks via gradient flows, token dynamics in transformers, and the structure of GANs and diffusion models. These notes focus primarily on mathematical content rather than deep learning techniques.
zh

[AI-158] Attention Mechanisms in Dynamical Systems: A Case Study with Predator-Prey Models

【速读】:该论文试图解决如何利用生成式 AI (Generative AI) 对经典动力系统进行可解释的数据驱动分析与控制问题。其解决方案的关键在于使用简单的线性注意力模型,通过训练扰动时间序列数据来重建系统轨迹,并发现学习到的注意力权重能够与李雅普诺夫函数的几何结构对齐,从而作为敏感性分析的代理,捕捉相空间的关键特性而无需明确了解系统方程。

链接: https://arxiv.org/abs/2505.06503
作者: David Balaban
机构: 未知
类目: Dynamical Systems (math.DS); Artificial Intelligence (cs.AI)
备注: 5 figures, 12 pages, python code included

点击查看摘要

Abstract:Attention mechanisms are widely used in artificial intelligence to enhance performance and interpretability. In this paper, we investigate their utility in modeling classical dynamical systems – specifically, a noisy predator-prey (Lotka-Volterra) system. We train a simple linear attention model on perturbed time-series data to reconstruct system trajectories. Remarkably, the learned attention weights align with the geometric structure of the Lyapunov function: high attention corresponds to flat regions (where perturbations have small effect), and low attention aligns with steep regions (where perturbations have large effect). We further demonstrate that attention-based weighting can serve as a proxy for sensitivity analysis, capturing key phase-space properties without explicit knowledge of the system equations. These results suggest a novel use of AI-derived attention for interpretable, data-driven analysis and control of nonlinear systems. For example our framework could support future work in biological modeling of circadian rhythms, and interpretable machine learning for dynamical environments.
zh

[AI-159] Quantum State Preparation via Large-Language-Model-Driven Evolution

【速读】:该论文试图解决变分量子算法中传统方法在灵活性、可扩展性及对专家依赖方面的局限性。其解决方案的关键在于提出一种自动化框架——FunSearch,该框架通过将大语言模型(Large-Language Models, LLMs)与进化优化相结合,自主发现具有可扩展性和系统尺寸无关变分参数数量的硬件高效量子电路。

链接: https://arxiv.org/abs/2505.06347
作者: Qing-Hong Cao,Zong-Yue Hou,Ying-Ying Li,Xiaohui Liu,Zhuo-Yang Song,Liang-Qi Zhang,Shutao Zhang,Ke Zhao
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); High Energy Physics - Lattice (hep-lat); High Energy Physics - Phenomenology (hep-ph)
备注: 6 + 4 pages, 14 figures

点击查看摘要

Abstract:We propose an automated framework for quantum circuit design by integrating large-language models (LLMs) with evolutionary optimization to overcome the rigidity, scalability limitations, and expert dependence of traditional ones in variational quantum algorithms. Our approach (FunSearch) autonomously discovers hardware-efficient ansätze with new features of scalability and system-size-independent number of variational parameters entirely from scratch. Demonstrations on the Ising and XY spin chains with n = 9 qubits yield circuits containing 4 parameters, achieving near-exact energy extrapolation across system sizes. Implementations on quantum hardware (Zuchongzhi chip) validate practicality, where two-qubit quantum gate noises can be effectively mitigated via zero-noise extrapolations for a spin chain system as large as 20 sites. This framework bridges algorithmic design and experimental constraints, complementing contemporary quantum architecture search frameworks to advance scalable quantum simulations.
zh

[AI-160] Prediction of Delirium Risk in Mild Cognitive Impairment Using Time-Series data Machine Learning and Comorbidity Patterns – A Retrospective Study

【速读】:该论文旨在解决轻度认知障碍(MCI)患者中谵妄(delirium)风险评估与预测的问题,通过分析MCI相关的共病模式并构建纵向预测模型来识别高风险患者。其解决方案的关键在于利用机器学习方法,特别是长短期记忆网络(LSTM),结合时间序列数据、人口学变量、Charlson共病指数(CCI)评分及多种共病状况,实现了对谵妄发生的高效预测,模型表现出优异的性能,AUROC为0.93,AUPRC为0.92。

链接: https://arxiv.org/abs/2505.06264
作者: Santhakumar Ramamoorthy,Priya Rani,James Mahon,Glenn Mathews,Shaun Cloherty,Mahdi Babaei
机构: 未知
类目: Applications (stat.AP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Delirium represents a significant clinical concern characterized by high morbidity and mortality rates, particularly in patients with mild cognitive impairment (MCI). This study investigates the associated risk factors for delirium by analyzing the comorbidity patterns relevant to MCI and developing a longitudinal predictive model leveraging machine learning methodologies. A retrospective analysis utilizing the MIMIC-IV v2.2 database was performed to evaluate comorbid conditions, survival probabilities, and predictive modeling outcomes. The examination of comorbidity patterns identified distinct risk profiles for the MCI population. Kaplan-Meier survival analysis demonstrated that individuals with MCI exhibit markedly reduced survival probabilities when developing delirium compared to their non-MCI counterparts, underscoring the heightened vulnerability within this cohort. For predictive modeling, a Long Short-Term Memory (LSTM) ML network was implemented utilizing time-series data, demographic variables, Charlson Comorbidity Index (CCI) scores, and an array of comorbid conditions. The model demonstrated robust predictive capabilities with an AUROC of 0.93 and an AUPRC of 0.92. This study underscores the critical role of comorbidities in evaluating delirium risk and highlights the efficacy of time-series predictive modeling in pinpointing patients at elevated risk for delirium development.
zh

[AI-161] SpectrumFM: A Foundation Model for Intelligent Spectrum Management

【速读】:该论文旨在解决传统智能频谱管理方法在识别准确率、收敛速度和泛化能力方面存在的显著局限性,尤其是在复杂动态的频谱环境中。其解决方案的关键在于提出了一种新型的频谱基础模型(SpectrumFM),该模型通过融合卷积神经网络与多头自注意力机制,实现了特征提取与鲁棒表示学习的协同增强,并通过两个新颖的自监督学习任务——掩码重建和下一时隙信号预测,利用大规模I/Q数据实现全面且可迁移的频谱表征。此外,该模型还引入了参数高效的微调策略,以适应多种下游频谱管理任务。

链接: https://arxiv.org/abs/2505.06256
作者: Fuhui Zhou,Chunyu Liu,Hao Zhang,Wei Wu,Qihui Wu,Derrick Wing Kwan Ng,Tony Q. S. Quek,Chan-Byoung Chae
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Intelligent spectrum management is crucial for improving spectrum efficiency and achieving secure utilization of spectrum resources. However, existing intelligent spectrum management methods, typically based on small-scale models, suffer from notable limitations in recognition accuracy, convergence speed, and generalization, particularly in the complex and dynamic spectrum environments. To address these challenges, this paper proposes a novel spectrum foundation model, termed SpectrumFM, establishing a new paradigm for spectrum management. SpectrumFM features an innovative encoder architecture that synergistically exploits the convolutional neural networks and the multi-head self-attention mechanisms to enhance feature extraction and enable robust representation learning. The model is pre-trained via two novel self-supervised learning tasks, namely masked reconstruction and next-slot signal prediction, which leverage large-scale in-phase and quadrature (IQ) data to achieve comprehensive and transferable spectrum representations. Furthermore, a parameter-efficient fine-tuning strategy is proposed to enable SpectrumFM to adapt to various downstream spectrum management tasks, including automatic modulation classification (AMC), wireless technology classification (WTC), spectrum sensing (SS), and anomaly detection (AD). Extensive experiments demonstrate that SpectrumFM achieves superior performance in terms of accuracy, robustness, adaptability, few-shot learning efficiency, and convergence speed, consistently outperforming conventional methods across multiple benchmarks. Specifically, SpectrumFM improves AMC accuracy by up to 12.1% and WTC accuracy by 9.3%, achieves an area under the curve (AUC) of 0.97 in SS at -4 dB signal-to-noise ratio (SNR), and enhances AD performance by over 10%.
zh

[AI-162] Low-Complexity CNN-Based Classification of Electroneurographic Signals

【速读】:该论文试图解决在植入式设备中对周围神经电信号(electroneurographic, ENG)进行实时分类的挑战,这一问题主要受限于计算复杂性和延迟。解决方案的关键在于提出MobilESCAPE-Net,这是一种轻量级架构,在保持甚至略微提升分类性能的同时显著降低了计算成本,通过减少可训练参数达99.9%以及每秒浮点运算次数降低92.47%,从而实现了更快的推理速度和实时处理能力。

链接: https://arxiv.org/abs/2505.06241
作者: Arek Berc Gokdag,Silvia Mura,Antonio Coviello,Michele Zhu,Maurizio Magarini,Umberto Spagnolini
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Peripheral nerve interfaces (PNIs) facilitate neural recording and stimulation for treating nerve injuries, but real-time classification of electroneurographic (ENG) signals remains challenging due to constraints on complexity and latency, particularly in implantable devices. This study introduces MobilESCAPE-Net, a lightweight architecture that reduces computational cost while maintaining and slightly improving classification performance. Compared to the state-of-the-art ESCAPE-Net, MobilESCAPE-Net achieves comparable accuracy and F1-score with significantly lower complexity, reducing trainable parameters by 99.9% and floating point operations per second by 92.47%, enabling faster inference and real-time processing. Its efficiency makes it well-suited for low-complexity ENG signal classification in resource-constrained environments such as implantable devices.
zh

机器学习

[LG-0] Automatically Differentiable Model Updating (ADiMU): conventional hybrid and neural network material model discovery including history-dependency

链接: https://arxiv.org/abs/2505.07801
作者: Bernardo P. Ferreira,Miguel A. Bessa
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 77 pages, 50 figures

点击查看摘要

Abstract:We introduce the first Automatically Differentiable Model Updating (ADiMU) framework that finds any history-dependent material model from full-field displacement and global force data (global, indirect discovery) or from strain-stress data (local, direct discovery). We show that ADiMU can update conventional (physics-based), neural network (data-driven), and hybrid material models. Moreover, this framework requires no fine-tuning of hyperparameters or additional quantities beyond those inherent to the user-selected material model architecture and optimizer. The robustness and versatility of ADiMU is extensively exemplified by updating different models spanning tens to millions of parameters, in both local and global discovery settings. Relying on fully differentiable code, the algorithmic implementation leverages vectorizing maps that enable history-dependent automatic differentiation via efficient batched execution of shared computation graphs. This contribution also aims to facilitate the integration, evaluation and application of future material model architectures by openly supporting the research community. Therefore, ADiMU is released as an open-source computational tool, integrated into a carefully designed and documented software named HookeAI.

[LG-1] A Theoretical Framework for Explaining Reinforcement Learning with Shapley Values

链接: https://arxiv.org/abs/2505.07797
作者: Daniel Beechey,Thomas M. S. Smith,Özgür Şimşek
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning agents can achieve superhuman performance, but their decisions are often difficult to interpret. This lack of transparency limits deployment, especially in safety-critical settings where human trust and accountability are essential. In this work, we develop a theoretical framework for explaining reinforcement learning through the influence of state features, which represent what the agent observes in its environment. We identify three core elements of the agent-environment interaction that benefit from explanation: behaviour (what the agent does), performance (what the agent achieves), and value estimation (what the agent expects to achieve). We treat state features as players cooperating to produce each element and apply Shapley values, a principled method from cooperative game theory, to identify the influence of each feature. This approach yields a family of mathematically grounded explanations with clear semantics and theoretical guarantees. We use illustrative examples to show how these explanations align with human intuition and reveal novel insights. Our framework unifies and extends prior work, making explicit the assumptions behind existing approaches, and offers a principled foundation for more interpretable and trustworthy reinforcement learning.

[LG-2] Relative Overfitting and Accept-Reject Framework

链接: https://arxiv.org/abs/2505.07783
作者: Yanxin Liu,Yunqi Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Currently, the scaling law of Large Language Models (LLMs) faces challenges and bottlenecks. This paper posits that noise effects, stemming from changes in the signal-to-noise ratio under diminishing marginal returns, are the root cause of these issues. To control this noise, we investigated the differences between models with performance advantages and disadvantages, introducing the concept of “relative overfitting.” Based on their complementary strengths, we have proposed an application framework, Accept-Reject (AR). In Natural Language Processing (NLP), we use LLMs and Small Language Models (SLMs) as the medium for discussion. This framework enables SLMs to exert a universal positive influence on LLM decision outputs, rather than the intuitively expected negative influence. We validated our approach using self-built models based on mainstream architectures and pre-trained mainstream models across multiple datasets, including basic language modeling, long-context tasks, subject examination, and question-answering (QA) benchmarks. The results demonstrate that through our structure, compared to increasing the LLM’s parameters, we can achieve better performance improvements with significantly lower parameter and computational costs in many scenarios. These improvements are universal, stable, and effective. Furthermore, we explore the potential of “relative overfitting” and the AR framework in other machine learning domains, such as computer vision (CV) and AI for science. We hope the proposed approach can help scale laws overcome existing bottlenecks.

[LG-3] MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering

链接: https://arxiv.org/abs/2505.07782
作者: Rushi Qiang,Yuchen Zhuang,Yinghao Li,Dingu Sagar V K,Rongzhi Zhang,Changhao Li,Ian Shu-Hei Wong,Sherry Yang,Percy Liang,Chao Zhang,Bo Dai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce MLE-Dojo, a Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents in iterative machine learning engineering (MLE) workflows. Unlike existing benchmarks that primarily rely on static datasets or single-attempt evaluations, MLE-Dojo provides an interactive environment enabling agents to iteratively experiment, debug, and refine solutions through structured feedback loops. Built upon 200+ real-world Kaggle challenges, MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic engineering scenarios such as data processing, architecture search, hyperparameter tuning, and code debugging. Its fully executable environment supports comprehensive agent training via both supervised fine-tuning and reinforcement learning, facilitating iterative experimentation, realistic data sampling, and real-time outcome verification. Extensive evaluations of eight frontier LLMs reveal that while current models achieve meaningful iterative improvements, they still exhibit significant limitations in autonomously generating long-horizon solutions and efficiently resolving complex errors. Furthermore, MLE-Dojo’s flexible and extensible architecture seamlessly integrates diverse data sources, tools, and evaluation protocols, uniquely enabling model-based agent tuning and promoting interoperability, scalability, and reproducibility. We open-source our framework and benchmarks to foster community-driven innovation towards next-generation MLE agents.

[LG-4] Synthesizing Diverse Network Flow Datasets with Scalable Dynamic Multigraph Generation

链接: https://arxiv.org/abs/2505.07777
作者: Arya Grayeli,Vipin Swarup,Steven E. Noel
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Obtaining real-world network datasets is often challenging because of privacy, security, and computational constraints. In the absence of such datasets, graph generative models become essential tools for creating synthetic datasets. In this paper, we introduce a novel machine learning model for generating high-fidelity synthetic network flow datasets that are representative of real-world networks. Our approach involves the generation of dynamic multigraphs using a stochastic Kronecker graph generator for structure generation and a tabular generative adversarial network for feature generation. We further employ an XGBoost (eXtreme Gradient Boosting) model for graph alignment, ensuring accurate overlay of features onto the generated graph structure. We evaluate our model using new metrics that assess both the accuracy and diversity of the synthetic graphs. Our results demonstrate improvements in accuracy over previous large-scale graph generation methods while maintaining similar efficiency. We also explore the trade-off between accuracy and diversity in synthetic graph dataset creation, a topic not extensively covered in related works. Our contributions include the synthesis and evaluation of large real-world netflow datasets and the definition of new metrics for evaluating synthetic graph generative models.

[LG-5] Solving Nonlinear PDEs with Sparse Radial Basis Function Networks

链接: https://arxiv.org/abs/2505.07765
作者: Zihan Shao,Konstantin Pieper,Xiaochuan Tian
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 35 pages, 7 figures

点击查看摘要

Abstract:We propose a novel framework for solving nonlinear PDEs using sparse radial basis function (RBF) networks. Sparsity-promoting regularization is employed to prevent over-parameterization and reduce redundant features. This work is motivated by longstanding challenges in traditional RBF collocation methods, along with the limitations of physics-informed neural networks (PINNs) and Gaussian process (GP) approaches, aiming to blend their respective strengths in a unified framework. The theoretical foundation of our approach lies in the function space of Reproducing Kernel Banach Spaces (RKBS) induced by one-hidden-layer neural networks of possibly infinite width. We prove a representer theorem showing that the solution to the sparse optimization problem in the RKBS admits a finite solution and establishes error bounds that offer a foundation for generalizing classical numerical analysis. The algorithmic framework is based on a three-phase algorithm to maintain computational efficiency through adaptive feature selection, second-order optimization, and pruning of inactive neurons. Numerical experiments demonstrate the effectiveness of our method and highlight cases where it offers notable advantages over GP approaches. This work opens new directions for adaptive PDE solvers grounded in rigorous analysis with efficient, learning-inspired implementation.

[LG-6] he Pitfalls of Benchmarking in Algorithm Selection: What We Are Getting Wrong

链接: https://arxiv.org/abs/2505.07750
作者: Gašper Petelin,Gjorgjina Cenikj
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Algorithm selection, aiming to identify the best algorithm for a given problem, plays a pivotal role in continuous black-box optimization. A common approach involves representing optimization functions using a set of features, which are then used to train a machine learning meta-model for selecting suitable algorithms. Various approaches have demonstrated the effectiveness of these algorithm selection meta-models. However, not all evaluation approaches are equally valid for assessing the performance of meta-models. We highlight methodological issues that frequently occur in the community and should be addressed when evaluating algorithm selection approaches. First, we identify flaws with the “leave-instance-out” evaluation technique. We show that non-informative features and meta-models can achieve high accuracy, which should not be the case with a well-designed evaluation framework. Second, we demonstrate that measuring the performance of optimization algorithms with metrics sensitive to the scale of the objective function requires careful consideration of how this impacts the construction of the meta-model, its predictions, and the model’s error. Such metrics can falsely present overly optimistic performance assessments of the meta-models. This paper emphasizes the importance of careful evaluation, as loosely defined methodologies can mislead researchers, divert efforts, and introduce noise into the field

[LG-7] Assessing the Chemical Intelligence of Large Language Models

链接: https://arxiv.org/abs/2505.07735
作者: Nicholas T. Runcie,Charlotte M. Deane,Fergus Imrie
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models are versatile, general-purpose tools with a wide range of applications. Recently, the advent of “reasoning models” has led to substantial improvements in their abilities in advanced problem-solving domains such as mathematics and software engineering. In this work, we assessed the ability of reasoning models to directly perform chemistry tasks, without any assistance from external tools. We created a novel benchmark, called ChemIQ, which consists of 796 questions assessing core concepts in organic chemistry, focused on molecular comprehension and chemical reasoning. Unlike previous benchmarks, which primarily use multiple choice formats, our approach requires models to construct short-answer responses, more closely reflecting real-world applications. The reasoning models, exemplified by OpenAI’s o3-mini, correctly answered 28%-59% of questions depending on the reasoning level used, with higher reasoning levels significantly increasing performance on all tasks. These models substantially outperformed the non-reasoning model, GPT-4o, which achieved only 7% accuracy. We found that Large Language Models can now convert SMILES strings to IUPAC names, a task earlier models were unable to perform. Additionally, we show that the latest reasoning models can elucidate structures from 1H and 13C NMR data, correctly generating SMILES strings for 74% of molecules containing up to 10 heavy atoms, and in one case solving a structure comprising 21 heavy atoms. For each task, we found evidence that the reasoning process mirrors that of a human chemist. Our results demonstrate that the latest reasoning models have the ability to perform advanced chemical reasoning.

[LG-8] ISAC: An Invertible and Stable Auditory Filter Bank with Customizable Kernels for ML Integration

链接: https://arxiv.org/abs/2505.07709
作者: Daniel Haider,Felix Perfler,Peter Balazs,Clara Hollomey,Nicki Holighaus
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Accepted at the IEEE International Conference on Sampling Theory and Applications (SampTA) 2025

点击查看摘要

Abstract:This paper introduces ISAC, an invertible and stable, perceptually-motivated filter bank that is specifically designed to be integrated into machine learning paradigms. More precisely, the center frequencies and bandwidths of the filters are chosen to follow a non-linear, auditory frequency scale, the filter kernels have user-defined maximum temporal support and may serve as learnable convolutional kernels, and there exists a corresponding filter bank such that both form a perfect reconstruction pair. ISAC provides a powerful and user-friendly audio front-end suitable for any application, including analysis-synthesis schemes.

[LG-9] 4TaStiC: Time and trend traveling time series clustering for classifying long-term type 2 diabetes patients

链接: https://arxiv.org/abs/2505.07702
作者: Onthada Preedasawakul,Nathakhun Wiroonsri
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Diabetes is one of the most prevalent diseases worldwide, characterized by persistently high blood sugar levels, capable of damaging various internal organs and systems. Diabetes patients require routine check-ups, resulting in a time series of laboratory records, such as hemoglobin A1c, which reflects each patient’s health behavior over time and informs their doctor’s recommendations. Clustering patients into groups based on their entire time series data assists doctors in making recommendations and choosing treatments without the need to review all records. However, time series clustering of this type of dataset introduces some challenges; patients visit their doctors at different time points, making it difficult to capture and match trends, peaks, and patterns. Additionally, two aspects must be considered: differences in the levels of laboratory results and differences in trends and patterns. To address these challenges, we introduce a new clustering algorithm called Time and Trend Traveling Time Series Clustering (4TaStiC), using a base dissimilarity measure combined with Euclidean and Pearson correlation metrics. We evaluated this algorithm on artificial datasets, comparing its performance with that of seven existing methods. The results show that 4TaStiC outperformed the other methods on the targeted datasets. Finally, we applied 4TaStiC to cluster a cohort of 1,989 type 2 diabetes patients at Siriraj Hospital. Each group of patients exhibits clear characteristics that will benefit doctors in making efficient clinical decisions. Furthermore, the proposed algorithm can be applied to contexts outside the medical field.

[LG-10] Heterogeneous Data Game: Characterizing the Model Competition Across Multiple Data Sources ICML2025

链接: https://arxiv.org/abs/2505.07688
作者: Renzhe Xu,Kang Wang,Bo Li
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: ICML 2025

点击查看摘要

Abstract:Data heterogeneity across multiple sources is common in real-world machine learning (ML) settings. Although many methods focus on enabling a single model to handle diverse data, real-world markets often comprise multiple competing ML providers. In this paper, we propose a game-theoretic framework – the Heterogeneous Data Game – to analyze how such providers compete across heterogeneous data sources. We investigate the resulting pure Nash equilibria (PNE), showing that they can be non-existent, homogeneous (all providers converge on the same model), or heterogeneous (providers specialize in distinct data sources). Our analysis spans monopolistic, duopolistic, and more general markets, illustrating how factors such as the “temperature” of data-source choice models and the dominance of certain data sources shape equilibrium outcomes. We offer theoretical insights into both homogeneous and heterogeneous PNEs, guiding regulatory policies and practical strategies for competitive ML marketplaces.

[LG-11] SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models

链接: https://arxiv.org/abs/2505.07680
作者: Hang Wu,Jianian Zhu,Yinghui Li,Haojie Wang,Biao Hou,Jidong Zhai
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 10 pages

点击查看摘要

Abstract:Large Language Models (LLMs) present a critical trade-off between inference quality and computational cost: larger models offer superior capabilities but incur significant latency, while smaller models are faster but less powerful. Existing serving strategies often employ fixed model scales or static two-stage speculative decoding, failing to dynamically adapt to the varying complexities of user requests or fluctuations in system performance. This paper introduces \systemname, a novel framework that reimagines LLM inference as an adaptive routing problem solved through multi-level speculative decoding. \systemname dynamically constructs and optimizes inference “paths” (chains of models) based on real-time feedback, addressing the limitations of static approaches. Our contributions are threefold: (1) An \textbfadaptive model chain scheduling mechanism that leverages performance profiling (execution times) and predictive similarity metrics (derived from token distribution divergence) to continuously select the optimal sequence of draft and verifier models, minimizing predicted latency per generated token. (2) A \textbfmulti-level collaborative verification framework where intermediate models within the selected chain can validate speculative tokens, reducing the verification burden on the final, most powerful target model. (3) A \textbfsynchronized state management system providing efficient, consistent KV cache handling across heterogeneous models in the chain, including precise, low-overhead rollbacks tailored for asynchronous batch processing inherent in multi-level speculation. Preliminary experiments demonstrate the validity of our method.

[LG-12] Joint Graph Convolution and Sequential Modeling for Scalable Network Traffic Estimation

链接: https://arxiv.org/abs/2505.07674
作者: Nan Jiang,Wenxuan Zhu,Xu Han,Weiqiang Huang,Yumeng Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study focuses on the challenge of predicting network traffic within complex topological environments. It introduces a spatiotemporal modeling approach that integrates Graph Convolutional Networks (GCN) with Gated Recurrent Units (GRU). The GCN component captures spatial dependencies among network nodes, while the GRU component models the temporal evolution of traffic data. This combination allows for precise forecasting of future traffic patterns. The effectiveness of the proposed model is validated through comprehensive experiments on the real-world Abilene network traffic dataset. The model is benchmarked against several popular deep learning methods. Furthermore, a set of ablation experiments is conducted to examine the influence of various components on performance, including changes in the number of graph convolution layers, different temporal modeling strategies, and methods for constructing the adjacency matrix. Results indicate that the proposed approach achieves superior performance across multiple metrics, demonstrating robust stability and strong generalization capabilities in complex network traffic forecasting scenarios.

[LG-13] Generating Skyline Explanations for Graph Neural Networks

链接: https://arxiv.org/abs/2505.07635
作者: Dazhuo Qiu,Haolai Che,Arijit Khan,Yinghui Wu
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:This paper proposes a novel approach to generate subgraph explanations for graph neural networks GNNs that simultaneously optimize multiple measures for explainability. Existing GNN explanation methods often compute subgraphs (called ``explanatory subgraphs’') that optimize a pre-defined, single explainability measure, such as fidelity or conciseness. This can lead to biased explanations that cannot provide a comprehensive explanation to clarify the output of GNN models. We introduce skyline explanation, a GNN explanation paradigm that aims to identify k explanatory subgraphs by simultaneously optimizing multiple explainability measures. (1) We formulate skyline explanation generation as a multi-objective optimization problem, and pursue explanations that approximate a skyline set of explanatory subgraphs. We show the hardness for skyline explanation generation. (2) We design efficient algorithms with an onion-peeling approach that strategically removes edges from neighbors of nodes of interests, and incrementally improves explanations as it explores an interpretation domain, with provable quality guarantees. (3) We further develop an algorithm to diversify explanations to provide more comprehensive perspectives. Using real-world graphs, we empirically verify the effectiveness, efficiency, and scalability of our algorithms.

[LG-14] Enhancing Federated Learning with Kolmogorov-Arnold Networks: A Comparative Study Across Diverse Aggregation Strategies

链接: https://arxiv.org/abs/2505.07629
作者: Yizhou Ma,Zhuoqin Yang,Luis-Daniel Ibáñez
类目: Machine Learning (cs.LG)
*备注: This preprint has not undergone peer review or any post-submission improvements or corrections. It was prepared prior to submission to, and has since been accepted at, ICIC 2025. The final Version of Record will be published in the ICIC 2025 proceedings by Springer

点击查看摘要

Abstract:Multilayer Perceptron (MLP), as a simple yet powerful model, continues to be widely used in classification and regression tasks. However, traditional MLPs often struggle to efficiently capture nonlinear relationships in load data when dealing with complex datasets. Kolmogorov-Arnold Networks (KAN), inspired by the Kolmogorov-Arnold representation theorem, have shown promising capabilities in modeling complex nonlinear relationships. In this study, we explore the performance of KANs within federated learning (FL) frameworks and compare them to traditional Multilayer Perceptrons. Our experiments, conducted across four diverse datasets demonstrate that KANs consistently outperform MLPs in terms of accuracy, stability, and convergence efficiency. KANs exhibit remarkable robustness under varying client numbers and non-IID data distributions, maintaining superior performance even as client heterogeneity increases. Notably, KANs require fewer communication rounds to converge compared to MLPs, highlighting their efficiency in FL scenarios. Additionally, we evaluate multiple parameter aggregation strategies, with trimmed mean and FedProx emerging as the most effective for optimizing KAN performance. These findings establish KANs as a robust and scalable alternative to MLPs for federated learning tasks, paving the way for their application in decentralized and privacy-preserving environments.

[LG-15] rial and Trust: Addressing Byzantine Attacks with Comprehensive Defense Strategy

链接: https://arxiv.org/abs/2505.07614
作者: Gleb Molodtsov,Daniil Medyakov,Sergey Skorik,Nikolas Khachaturov,Shahane Tigranyan,Vladimir Aletov,Aram Avetisyan,Martin Takáč,Aleksandr Beznosikov
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Recent advancements in machine learning have improved performance while also increasing computational demands. While federated and distributed setups address these issues, their structure is vulnerable to malicious influences. In this paper, we address a specific threat, Byzantine attacks, where compromised clients inject adversarial updates to derail global convergence. We combine the trust scores concept with trial function methodology to dynamically filter outliers. Our methods address the critical limitations of previous approaches, allowing functionality even when Byzantine nodes are in the majority. Moreover, our algorithms adapt to widely used scaled methods like Adam and RMSProp, as well as practical scenarios, including local training and partial participation. We validate the robustness of our methods by conducting extensive experiments on both synthetic and real ECG data collected from medical institutions. Furthermore, we provide a broad theoretical analysis of our algorithms and their extensions to aforementioned practical setups. The convergence guarantees of our methods are comparable to those of classical algorithms developed without Byzantine interference.

[LG-16] Multi-Objective Reinforcement Learning for Energy-Efficient Industrial Control

链接: https://arxiv.org/abs/2505.07607
作者: Georg Schäfer,Raphael Seliger,Jakob Rehrl,Stefan Huber,Simon Hirlaender
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Accepted at DEXA 2025 (AI4IP)

点击查看摘要

Abstract:Industrial automation increasingly demands energy-efficient control strategies to balance performance with environmental and cost constraints. In this work, we present a multi-objective reinforcement learning (MORL) framework for energy-efficient control of the Quanser Aero 2 testbed in its one-degree-of-freedom configuration. We design a composite reward function that simultaneously penalizes tracking error and electrical power consumption. Preliminary experiments explore the influence of varying the Energy penalty weight, alpha, on the trade-off between pitch tracking and energy savings. Our results reveal a marked performance shift for alpha values between 0.0 and 0.25, with non-Pareto optimal solutions emerging at lower alpha values, on both the simulation and the real system. We hypothesize that these effects may be attributed to artifacts introduced by the adaptive behavior of the Adam optimizer, which could bias the learning process and favor bang-bang control strategies. Future work will focus on automating alpha selection through Gaussian Process-based Pareto front modeling and transitioning the approach from simulation to real-world deployment.

[LG-17] Finite-Sample-Based Reachability for Safe Control with Gaussian Process Dynamics

链接: https://arxiv.org/abs/2505.07594
作者: Manish Prajapat,Johannes Köhler,Amon Lahr,Andreas Krause,Melanie N. Zeilinger
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Gaussian Process (GP) regression is shown to be effective for learning unknown dynamics, enabling efficient and safety-aware control strategies across diverse applications. However, existing GP-based model predictive control (GP-MPC) methods either rely on approximations, thus lacking guarantees, or are overly conservative, which limits their practical utility. To close this gap, we present a sampling-based framework that efficiently propagates the model’s epistemic uncertainty while avoiding conservatism. We establish a novel sample complexity result that enables the construction of a reachable set using a finite number of dynamics functions sampled from the GP posterior. Building on this, we design a sampling-based GP-MPC scheme that is recursively feasible and guarantees closed-loop safety and stability with high probability. Finally, we showcase the effectiveness of our method on two numerical examples, highlighting accurate reachable set over-approximation and safe closed-loop performance.

[LG-18] Personalized Federated Learning under Model Dissimilarity Constraints

链接: https://arxiv.org/abs/2505.07575
作者: Samuel Erickson,Mikael Johansson
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One of the defining challenges in federated learning is that of statistical heterogeneity among clients. We address this problem with KARULA, a regularized strategy for personalized federated learning, which constrains the pairwise model dissimilarities between clients based on the difference in their distributions, as measured by a surrogate for the 1-Wasserstein distance adapted for the federated setting. This allows the strategy to adapt to highly complex interrelations between clients, that e.g., clustered approaches fail to capture. We propose an inexact projected stochastic gradient algorithm to solve the constrained problem that the strategy defines, and show theoretically that it converges with smooth, possibly non-convex losses to a neighborhood of a stationary point with rate O(1/K). We demonstrate the effectiveness of KARULA on synthetic and real federated data sets.

[LG-19] Injecting Knowledge Graphs into Large Language Models

链接: https://arxiv.org/abs/2505.07554
作者: Erica Coppolillo
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Integrating structured knowledge from Knowledge Graphs (KGs) into Large Language Models (LLMs) remains a key challenge for symbolic reasoning. Existing methods mainly rely on prompt engineering or fine-tuning, which lose structural fidelity or incur high computational costs. Building on recent encoding techniques which integrate graph embeddings within the LLM input as tokens, we extend this paradigm to the KG domain by leveraging Knowledge Graph Embedding (KGE) models, thus enabling graph-aware reasoning. Our approach is model-agnostic, resource-efficient, and compatible with any LLMs. Extensive experimentation on synthetic and real-world datasets shows that our method improves reasoning performance over established baselines, further achieving the best trade-off in terms of accuracy and efficiency against state-of-the-art LLMs.

[LG-20] Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning

链接: https://arxiv.org/abs/2505.07527
作者: Hu Wang,Congbo Ma,Ian Reid,Mohammad Yaqub
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reward baseline is important for Reinforcement Learning (RL) algorithms to reduce variance in policy gradient estimates. Recently, for language modeling, Group Relative Policy Optimization (GRPO) is proposed to compute the advantage for each output by subtracting the mean reward, as the baseline, for all outputs in the group. However, it can lead to inaccurate advantage estimates in environments with highly noisy rewards, potentially introducing bias. In this work, we propose a model, called Kalman Filter Enhanced Group Relative Policy Optimization (KRPO), by using lightweight Kalman filtering to dynamically estimate the latent reward mean and variance. This filtering technique replaces the naive batch mean baseline, enabling more adaptive advantage normalization. Our method does not require additional learned parameters over GRPO. This approach offers a simple yet effective way to incorporate multiple outputs of GRPO into advantage estimation, improving policy optimization in settings where highly dynamic reward signals are difficult to model for language models. Through experiments and analyses, we show that using a more adaptive advantage estimation model, KRPO can improve the stability and performance of GRPO. The code is available at this https URL

[LG-21] Adaptive Latent-Space Constraints in Personalized FL

链接: https://arxiv.org/abs/2505.07525
作者: Sana Ayromlou,D. B. Emerson
类目: Machine Learning (cs.LG)
*备注: 14 Pages, 1 Algorithm, 3 Figures, 3 Tables

点击查看摘要

Abstract:Federated learning (FL) has become an effective and widely used approach to training deep learning models on decentralized datasets held by distinct clients. FL also strengthens both security and privacy protections for training data. Common challenges associated with statistical heterogeneity between distributed datasets have spurred significant interest in personalized FL (pFL) methods, where models combine aspects of global learning with local modeling specific to each client’s unique characteristics. In this work, the efficacy of theoretically supported, adaptive MMD measures within the Ditto framework, a state-of-the-art technique in pFL, are investigated. The use of such measures significantly improves model performance across a variety of tasks, especially those with pronounced feature heterogeneity. While the Ditto algorithm is specifically considered, such measures are directly applicable to a number of other pFL settings, and the results motivate the use of constraints tailored to the various kinds of heterogeneity expected in FL systems.

[LG-22] Identifying Causal Direction via Variational Bayesian Compression ICML2025

链接: https://arxiv.org/abs/2505.07503
作者: Quang-Duy Tran,Bao Duong,Phuoc Nguyen,Thin Nguyen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at the 42nd International Conference on Machine Learning (ICML2025)

点击查看摘要

Abstract:Telling apart the cause and effect between two random variables with purely observational data is a challenging problem that finds applications in various scientific disciplines. A key principle utilized in this task is the algorithmic Markov condition, which postulates that the joint distribution, when factorized according to the causal direction, yields a more succinct codelength compared to the anti-causal direction. Previous approaches approximate these codelengths by relying on simple functions or Gaussian processes (GPs) with easily evaluable complexity, compromising between model fitness and computational complexity. To overcome these limitations, we propose leveraging the variational Bayesian learning of neural networks as an interpretation of the codelengths. Consequently, we can enhance the model fitness while promoting the succinctness of the codelengths, while avoiding the significant computational complexity of the GP-based approaches. Extensive experiments on both synthetic and real-world benchmarks in cause-effect identification demonstrate the effectiveness of our proposed method, surpassing the overall performance of related complexity-based and structural causal model regression-based approaches.

[LG-23] Linux Kernel Configurations at Scale: A Dataset for Performance and Evolution Analysis

链接: https://arxiv.org/abs/2505.07487
作者: Heraldo Borges,Juliana Alves Pereira,Djamel Eddine Khelladi,Mathieu Acher
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Configuring the Linux kernel to meet specific requirements, such as binary size, is highly challenging due to its immense complexity-with over 15,000 interdependent options evolving rapidly across different versions. Although several studies have explored sampling strategies and machine learning methods to understand and predict the impact of configuration options, the literature still lacks a comprehensive and large-scale dataset encompassing multiple kernel versions along with detailed quantitative measurements. To bridge this gap, we introduce LinuxData, an accessible collection of kernel configurations spanning several kernel releases, specifically from versions 4.13 to 5.8. This dataset, gathered through automated tools and build processes, comprises over 240,000 kernel configurations systematically labeled with compilation outcomes and binary sizes. By providing detailed records of configuration evolution and capturing the intricate interplay among kernel options, our dataset enables innovative research in feature subset selection, prediction models based on machine learning, and transfer learning across kernel versions. Throughout this paper, we describe how the dataset has been made easily accessible via OpenML and illustrate how it can be leveraged using only a few lines of Python code to evaluate AI-based techniques, such as supervised machine learning. We anticipate that this dataset will significantly enhance reproducibility and foster new insights into configuration-space analysis at a scale that presents unique opportunities and inherent challenges, thereby advancing our understanding of the Linux kernel’s configurability and evolution.

[LG-24] Learning Penalty for Optimal Partitioning via Automatic Feature Extraction

链接: https://arxiv.org/abs/2505.07413
作者: Tung L Nguyen,Toby Hocking
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 9 Figures

点击查看摘要

Abstract:Changepoint detection identifies significant shifts in data sequences, making it important in areas like finance, genetics, and healthcare. The Optimal Partitioning algorithms efficiently detect these changes, using a penalty parameter to limit the changepoints number. Determining the appropriate value for this penalty can be challenging. Traditionally, this process involved manually extracting statistical features, such as sequence length or variance to make the prediction. This study proposes a novel approach that uses recurrent neural networks to learn this penalty directly from raw sequences by automatically extracting features. Experiments conducted on 20 benchmark genomic datasets show that this novel method surpasses traditional methods in partitioning accuracy in most cases.

[LG-25] Generalization Bounds and Stopping Rules for Learning with Self-Selected Data

链接: https://arxiv.org/abs/2505.07367
作者: Julian Rodemann,James Bailie
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 38 pages, 4 figures

点击查看摘要

Abstract:Many learning paradigms self-select training data in light of previously learned parameters. Examples include active learning, semi-supervised learning, bandits, or boosting. Rodemann et al. (2024) unify them under the framework of “reciprocal learning”. In this article, we address the question of how well these methods can generalize from their self-selected samples. In particular, we prove universal generalization bounds for reciprocal learning using covering numbers and Wasserstein ambiguity sets. Our results require no assumptions on the distribution of self-selected data, only verifiable conditions on the algorithms. We prove results for both convergent and finite iteration solutions. The latter are anytime valid, thereby giving rise to stopping rules for a practitioner seeking to guarantee the out-of-sample performance of their reciprocal learning algorithm. Finally, we illustrate our bounds and stopping rules for reciprocal learning’s special case of semi-supervised learning.

[LG-26] From Search To Sampling: Generative Models For Robust Algorithmic Recourse

链接: https://arxiv.org/abs/2505.07351
作者: Prateek Garg,Lokesh Nagalapatti,Sunita Sarawagi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Algorithmic Recourse provides recommendations to individuals who are adversely impacted by automated model decisions, on how to alter their profiles to achieve a favorable outcome. Effective recourse methods must balance three conflicting goals: proximity to the original profile to minimize cost, plausibility for realistic recourse, and validity to ensure the desired outcome. We show that existing methods train for these objectives separately and then search for recourse through a joint optimization over the recourse goals during inference, leading to poor recourse recommendations. We introduce GenRe, a generative recourse model designed to train the three recourse objectives jointly. Training such generative models is non-trivial due to lack of direct recourse supervision. We propose efficient ways to synthesize such supervision and further show that GenRe’s training leads to a consistent estimator. Unlike most prior methods, that employ non-robust gradient descent based search during inference, GenRe simply performs a forward sampling over the generative model to produce minimum cost recourse, leading to superior performance across multiple metrics. We also demonstrate GenRe provides the best trade-off between cost, plausibility and validity, compared to state-of-art baselines. Our code is available at: this https URL.

[LG-27] Private LoRA Fine-tuning of Open-Source LLM s with Homomorphic Encryption

链接: https://arxiv.org/abs/2505.07329
作者: Jordan Frery,Roman Bredehoft,Jakub Klemsa,Arthur Meyre,Andrei Stoian
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Preserving data confidentiality during the fine-tuning of open-source Large Language Models (LLMs) is crucial for sensitive applications. This work introduces an interactive protocol adapting the Low-Rank Adaptation (LoRA) technique for private fine-tuning. Homomorphic Encryption (HE) protects the confidentiality of training data and gradients handled by remote worker nodes performing the bulk of computations involving the base model weights. The data owner orchestrates training, requiring minimal local computing power and memory, thus alleviating the need for expensive client-side GPUs. We demonstrate feasibility by fine-tuning a Llama-3.2-1B model, presenting convergence results using HE-compatible quantization and performance benchmarks for HE computations on GPU hardware. This approach enables applications such as confidential knowledge base question answering, private codebase fine-tuning for AI code assistants, AI agents for drafting emails based on a company’s email archive, and adapting models to analyze sensitive legal or healthcare documents.

[LG-28] Uncertainty Profiles for LLM s: Uncertainty Source Decomposition and Adaptive Model-Metric Selection

链接: https://arxiv.org/abs/2505.07309
作者: Pei-Fu Guo,Yun-Da Tsai,Shou-De Lin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) often generate fluent but factually incorrect outputs, known as hallucinations, which undermine their reliability in real-world applications. While uncertainty estimation has emerged as a promising strategy for detecting such errors, current metrics offer limited interpretability and lack clarity about the types of uncertainty they capture. In this paper, we present a systematic framework for decomposing LLM uncertainty into four distinct sources, inspired by previous research. We develop a source-specific estimation pipeline to quantify these uncertainty types and evaluate how existing metrics relate to each source across tasks and models. Our results show that metrics, task, and model exhibit systematic variation in uncertainty characteristic. Building on this, we propose a method for task specific metric/model selection guided by the alignment or divergence between their uncertainty characteristics and that of a given task. Our experiments across datasets and models demonstrate that our uncertainty-aware selection strategy consistently outperforms baseline strategies, helping us select appropriate models or uncertainty metrics, and contributing to more reliable and efficient deployment in uncertainty estimation.

[LG-29] Online Episodic Convex Reinforcement Learning

链接: https://arxiv.org/abs/2505.07303
作者: Bianca Marin Moreno(Thoth, EDF Ramp;D, FiME Lab),Khaled Eldowa(UNIMI, POLIMI),Pierre Gaillard(Thoth),Margaux Brégère(EDF Ramp;D, LPSM),Nadia Oudjane(EDF Ramp;D, FiME Lab)
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study online learning in episodic finite-horizon Markov decision processes (MDPs) with convex objective functions, known as the concave utility reinforcement learning (CURL) problem. This setting generalizes RL from linear to convex losses on the state-action distribution induced by the agent’s policy. The non-linearity of CURL invalidates classical Bellman equations and requires new algorithmic approaches. We introduce the first algorithm achieving near-optimal regret bounds for online CURL without any prior knowledge on the transition function. To achieve this, we use an online mirror descent algorithm with varying constraint sets and a carefully designed exploration bonus. We then address for the first time a bandit version of CURL, where the only feedback is the value of the objective function on the state-action distribution induced by the agent’s policy. We achieve a sub-linear regret bound for this more challenging problem by adapting techniques from bandit convex optimization to the MDP setting.

[LG-30] INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning

链接: https://arxiv.org/abs/2505.07291
作者: Prime Intellect Team,Sami Jaghouar,Justus Mattern,Jack Min Ong,Jannik Straube,Manveer Basra,Aaron Pazdera,Kushal Thaman,Matthew Di Ferrante,Felix Gabriel,Fares Obeid,Kemal Erdem,Michael Keiblinger,Johannes Hagemann
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 26 pages, 12 figures

点击查看摘要

Abstract:We introduce INTELLECT-2, the first globally distributed reinforcement learning (RL) training run of a 32 billion parameter language model. Unlike traditional centralized training efforts, INTELLECT-2 trains a reasoning model using fully asynchronous RL across a dynamic, heterogeneous swarm of permissionless compute contributors. To enable a training run with this unique infrastructure, we built various components from scratch: we introduce PRIME-RL, our training framework purpose-built for distributed asynchronous reinforcement learning, based on top of novel components such as TOPLOC, which verifies rollouts from untrusted inference workers, and SHARDCAST, which efficiently broadcasts policy weights from training nodes to inference workers. Beyond infrastructure components, we propose modifications to the standard GRPO training recipe and data filtering techniques that were crucial to achieve training stability and ensure that our model successfully learned its training objective, thus improving upon QwQ-32B, the state of the art reasoning model in the 32B parameter range. We open-source INTELLECT-2 along with all of our code and data, hoping to encourage and enable more open research in the field of decentralized training. Comments: 26 pages, 12 figures Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2505.07291 [cs.LG] (or arXiv:2505.07291v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.07291 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Johannes Hagemann [view email] [v1] Mon, 12 May 2025 07:24:33 UTC (1,575 KB)

[LG-31] Cache-Efficient Posterior Sampling for Reinforcement Learning with LLM -Derived Priors Across Discrete and Continuous Domains

链接: https://arxiv.org/abs/2505.07274
作者: Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Integrating large language models (LLMs) as priors in reinforcement learning (RL) offers significant advantages but comes with substantial computational costs. We present a principled cache-efficient framework for posterior sampling with LLM-derived priors that dramatically reduces these costs while maintaining high performance. At the core of our approach is an adaptive caching mechanism, where cache parameters are meta-optimized using surrogate gradients derived from policy performance. This design enables efficient inference across both discrete text environments (e.g., TextWorld, ALFWorld) and continuous control domains (e.g., MuJoCo), achieving a 3.8–4.7 \times reduction in LLM queries and 4.0–12.0 \times lower median latencies (85–93,ms on a consumer GPU) while retaining 96–98% of uncached performance. Our theoretical analysis provides KL divergence bounds on approximation quality, validated empirically. The framework extends to offline RL, where our CQL-Prior variant improves performance by 14–29% and reduces training time by 38–40%. Extensive evaluations across a diverse suite of eight tasks demonstrate the generalizability and practical viability of LLM-guided RL in resource-constrained settings.

[LG-32] Compression Regularity Randomness and Emergent Structure: Rethinking Physical Complexity in the Data-Driven Era

链接: https://arxiv.org/abs/2505.07222
作者: Nima Dehghani
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Information Theory (cs.IT); Biological Physics (physics.bio-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:Complexity science offers a wide range of measures for quantifying unpredictability, structure, and information. Yet, a systematic conceptual organization of these measures is still missing. We present a unified framework that locates statistical, algorithmic, and dynamical measures along three axes (regularity, randomness, and complexity) and situates them in a common conceptual space. We map statistical, algorithmic, and dynamical measures into this conceptual space, discussing their computational accessibility and approximability. This taxonomy reveals the deep challenges posed by uncomputability and highlights the emergence of modern data-driven methods (including autoencoders, latent dynamical models, symbolic regression, and physics-informed neural networks) as pragmatic approximations to classical complexity ideals. Latent spaces emerge as operational arenas where regularity extraction, noise management, and structured compression converge, bridging theoretical foundations with practical modeling in high-dimensional systems. We close by outlining implications for physics-informed AI and AI-guided discovery in complex physical systems, arguing that classical questions of complexity remain central to next-generation scientific modeling. Subjects: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Information Theory (cs.IT); Biological Physics (physics.bio-ph); Data Analysis, Statistics and Probability (physics.data-an) Cite as: arXiv:2505.07222 [cs.LG] (or arXiv:2505.07222v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.07222 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Nima Dehghani [view email] [v1] Mon, 12 May 2025 04:30:42 UTC (53 KB)

[LG-33] Causal View of Time Series Imputation: Some Identification Results on Missing Mechanism

链接: https://arxiv.org/abs/2505.07180
作者: Ruichu Cai,Kaitao Zheng,Junxian Huang,Zijian Li,Zhengming Chen,Boyan Xu,Zhifeng Hao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Time series imputation is one of the most challenge problems and has broad applications in various fields like health care and the Internet of Things. Existing methods mainly aim to model the temporally latent dependencies and the generation process from the observed time series data. In real-world scenarios, different types of missing mechanisms, like MAR (Missing At Random), and MNAR (Missing Not At Random) can occur in time series data. However, existing methods often overlook the difference among the aforementioned missing mechanisms and use a single model for time series imputation, which can easily lead to misleading results due to mechanism mismatching. In this paper, we propose a framework for time series imputation problem by exploring Different Missing Mechanisms (DMM in short) and tailoring solutions accordingly. Specifically, we first analyze the data generation processes with temporal latent states and missing cause variables for different mechanisms. Sequentially, we model these generation processes via variational inference and estimate prior distributions of latent variables via normalizing flow-based neural architecture. Furthermore, we establish identifiability results under the nonlinear independent component analysis framework to show that latent variables are identifiable. Experimental results show that our method surpasses existing time series imputation techniques across various datasets with different missing mechanisms, demonstrating its effectiveness in real-world applications.

[LG-34] AugMixCloak: A Defense against Membership Inference Attacks via Image Transformation

链接: https://arxiv.org/abs/2505.07149
作者: Heqing Ren,Chao Feng,Alberto Huertas,Burkhard Stiller
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional machine learning (ML) raises serious privacy concerns, while federated learning (FL) mitigates the risk of data leakage by keeping data on local devices. However, the training process of FL can still leak sensitive information, which adversaries may exploit to infer private data. One of the most prominent threats is the membership inference attack (MIA), where the adversary aims to determine whether a particular data record was part of the training set. This paper addresses this problem through a two-stage defense called AugMixCloak. The core idea is to apply data augmentation and principal component analysis (PCA)-based information fusion to query images, which are detected by perceptual hashing (pHash) as either identical to or highly similar to images in the training set. Experimental results show that AugMixCloak successfully defends against both binary classifier-based MIA and metric-based MIA across five datasets and various decentralized FL (DFL) topologies. Compared with regularization-based defenses, AugMixCloak demonstrates stronger protection. Compared with confidence score masking, AugMixCloak exhibits better generalization. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.07149 [cs.LG] (or arXiv:2505.07149v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.07149 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-35] riangulating PL functions and the existence of efficient ReLU DNNs

链接: https://arxiv.org/abs/2505.07137
作者: Danny Calegari
类目: Machine Learning (cs.LG); Geometric Topology (math.GT)
*备注: 4 pages

点击查看摘要

Abstract:We show that every piecewise linear function f:R^d \to R with compact support a polyhedron P has a representation as a sum of so-called `simplex functions’. Such representations arise from degree 1 triangulations of the relative homology class (in R^d+1 ) bounded by P and the graph of f , and give a short elementary proof of the existence of efficient universal ReLU neural networks that simultaneously compute all such functions f of bounded complexity.

[LG-36] Learning from Samples: Inverse Problems over measures via Sharpened Fenchel-Young Losses

链接: https://arxiv.org/abs/2505.07124
作者: Francisco Andrade,Gabriel Peyré,Clarice Poon
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Estimating parameters from samples of an optimal probability distribution is essential in applications ranging from socio-economic modeling to biological system analysis. In these settings, the probability distribution arises as the solution to an optimization problem that captures either static interactions among agents or the dynamic evolution of a system over time. Our approach relies on minimizing a new class of loss functions, called sharpened Fenchel-Young losses, which measure the sub-optimality gap of the optimization problem over the space of measures. We study the stability of this estimation method when only a finite number of sample is available. The parameters to be estimated typically correspond to a cost function in static problems and to a potential function in dynamic problems. To analyze stability, we introduce a general methodology that leverages the strong convexity of the loss function together with the sample complexity of the forward optimization problem. Our analysis emphasizes two specific settings in the context of optimal transport, where our method provides explicit stability guarantees: The first is inverse unbalanced optimal transport (iUOT) with entropic regularization, where the parameters to estimate are cost functions that govern transport computations; this method has applications such as link prediction in machine learning. The second is inverse gradient flow (iJKO), where the objective is to recover a potential function that drives the evolution of a probability distribution via the Jordan-Kinderlehrer-Otto (JKO) time-discretization scheme; this is particularly relevant for understanding cell population dynamics in single-cell genomics. Finally, we validate our approach through numerical experiments on Gaussian distributions, where closed-form solutions are available, to demonstrate the practical performance of our methods

[LG-37] Knowledge Distillation for Enhancing Walmart E-commerce Search Relevance Using Large Language Models WWW

链接: https://arxiv.org/abs/2505.07105
作者: Hongwei Shang,Nguyen Vo,Nitin Yadav,Tian Zhang,Ajit Puthenputhussery,Xunfan Cai,Shuyi Chen,Prijith Chandran,Changsung Kang
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 9 pages, published at WWWW’25

点击查看摘要

Abstract:Ensuring the products displayed in e-commerce search results are relevant to users queries is crucial for improving the user experience. With their advanced semantic understanding, deep learning models have been widely used for relevance matching in search tasks. While large language models (LLMs) offer superior ranking capabilities, it is challenging to deploy LLMs in real-time systems due to the high-latency requirements. To leverage the ranking power of LLMs while meeting the low-latency demands of production systems, we propose a novel framework that distills a high performing LLM into a more efficient, low-latency student model. To help the student model learn more effectively from the teacher model, we first train the teacher LLM as a classification model with soft targets. Then, we train the student model to capture the relevance margin between pairs of products for a given query using mean squared error loss. Instead of using the same training data as the teacher model, we significantly expand the student model dataset by generating unlabeled data and labeling it with the teacher model predictions. Experimental results show that the student model performance continues to improve as the size of the augmented training data increases. In fact, with enough augmented data, the student model can outperform the teacher model. The student model has been successfully deployed in production at this http URL with significantly positive metrics.

[LG-38] Navigating the Rashomon Effect: How Personalization Can Help Adjust Interpretable Machine Learning Models to Individual Users

链接: https://arxiv.org/abs/2505.07100
作者: Julian Rosenberger,Philipp Schröppel,Sven Kruschel,Mathias Kraus,Patrick Zschech,Maximilian Förster
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: Accepted as a Completed Research Paper at the Thirty-Third European Conference on Information Systems (ECIS 2025), Amman, Jordan

点击查看摘要

Abstract:The Rashomon effect describes the observation that in machine learning (ML) multiple models often achieve similar predictive performance while explaining the underlying relationships in different ways. This observation holds even for intrinsically interpretable models, such as Generalized Additive Models (GAMs), which offer users valuable insights into the model’s behavior. Given the existence of multiple GAM configurations with similar predictive performance, a natural question is whether we can personalize these configurations based on users’ needs for interpretability. In our study, we developed an approach to personalize models based on contextual bandits. In an online experiment with 108 users in a personalized treatment and a non-personalized control group, we found that personalization led to individualized rather than one-size-fits-all configurations. Despite these individual adjustments, the interpretability remained high across both groups, with users reporting a strong understanding of the models. Our research offers initial insights into the potential for personalizing interpretable ML.

[LG-39] Physics-informed Multiple-Input Operators for efficient dynamic response prediction of structures

链接: https://arxiv.org/abs/2505.07090
作者: Bilal Ahmed,Yuqing Qiu,Diab W. Abueidda,Waleed El-Sekelly,Tarek Abdoun,Mostafa E. Mobasher
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Finite element (FE) modeling is essential for structural analysis but remains computationally intensive, especially under dynamic loading. While operator learning models have shown promise in replicating static structural responses at FEM level accuracy, modeling dynamic behavior remains more challenging. This work presents a Multiple Input Operator Network (MIONet) that incorporates a second trunk network to explicitly encode temporal dynamics, enabling accurate prediction of structural responses under moving loads. Traditional DeepONet architectures using recurrent neural networks (RNNs) are limited by fixed time discretization and struggle to capture continuous dynamics. In contrast, MIONet predicts responses continuously over both space and time, removing the need for step wise modeling. It maps scalar inputs including load type, velocity, spatial mesh, and time steps to full field structural responses. To improve efficiency and enforce physical consistency, we introduce a physics informed loss based on dynamic equilibrium using precomputed mass, damping, and stiffness matrices, without solving the governing PDEs directly. Further, a Schur complement formulation reduces the training domain, significantly cutting computational costs while preserving global accuracy. The model is validated on both a simple beam and the KW-51 bridge, achieving FEM level accuracy within seconds. Compared to GRU based DeepONet, our model offers comparable accuracy with improved temporal continuity and over 100 times faster inference, making it well suited for real-time structural monitoring and digital twin applications.

[LG-40] Multi-Objective-Guided Discrete Flow Matching for Controllable Biological Sequence Design

链接: https://arxiv.org/abs/2505.07086
作者: Tong Chen,Yinuo Zhang,Sophia Tang,Pranam Chatterjee
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Designing biological sequences that satisfy multiple, often conflicting, functional and biophysical criteria remains a central challenge in biomolecule engineering. While discrete flow matching models have recently shown promise for efficient sampling in high-dimensional sequence spaces, existing approaches address only single objectives or require continuous embeddings that can distort discrete distributions. We present Multi-Objective-Guided Discrete Flow Matching (MOG-DFM), a general framework to steer any pretrained discrete-time flow matching generator toward Pareto-efficient trade-offs across multiple scalar objectives. At each sampling step, MOG-DFM computes a hybrid rank-directional score for candidate transitions and applies an adaptive hypercone filter to enforce consistent multi-objective progression. We also trained two unconditional discrete flow matching models, PepDFM for diverse peptide generation and EnhancerDFM for functional enhancer DNA generation, as base generation models for MOG-DFM. We demonstrate MOG-DFM’s effectiveness in generating peptide binders optimized across five properties (hemolysis, non-fouling, solubility, half-life, and binding affinity), and in designing DNA sequences with specific enhancer classes and DNA shapes. In total, MOG-DFM proves to be a powerful tool for multi-property-guided biomolecule sequence design.

[LG-41] COMRECGC: Global Graph Counterfactual Explainer through Common Recourse ICML2025

链接: https://arxiv.org/abs/2505.07081
作者: Gregoire Fournier,Sourav Medya
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at ICML 2025

点击查看摘要

Abstract:Graph neural networks (GNNs) have been widely used in various domains such as social networks, molecular biology, or recommendation systems. Concurrently, different explanations methods of GNNs have arisen to complement its black-box nature. Explanations of the GNNs’ predictions can be categorized into two types–factual and counterfactual. Given a GNN trained on binary classification into ‘‘accept’’ and ‘‘reject’’ classes, a global counterfactual explanation consists in generating a small set of ‘‘accept’’ graphs relevant to all of the input ‘‘reject’’ graphs. The transformation of a ‘‘reject’’ graph into an ‘‘accept’’ graph is called a recourse. A common recourse explanation is a small set of recourse, from which every ‘‘reject’’ graph can be turned into an ‘‘accept’’ graph. Although local counterfactual explanations have been studied extensively, the problem of finding common recourse for global counterfactual explanation remains unexplored, particularly for GNNs. In this paper, we formalize the common recourse explanation problem, and design an effective algorithm, COMRECGC, to solve it. We benchmark our algorithm against strong baselines on four different real-world graphs datasets and demonstrate the superior performance of COMRECGC against the competitors. We also compare the common recourse explanations to the graph counterfactual explanation, showing that common recourse explanations are either comparable or superior, making them worth considering for applications such as drug discovery or computational biology.

[LG-42] Scaling Laws and Representation Learning in Simple Hierarchical Languages: Transformers vs. Convolutional Architectures

链接: https://arxiv.org/abs/2505.07070
作者: Francesco Cagnetta,Alessandro Favero,Antonio Sclocchi,Matthieu Wyart
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (stat.ML)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:How do neural language models acquire a language’s structure when trained for next-token prediction? We address this question by deriving theoretical scaling laws for neural network performance on synthetic datasets generated by the Random Hierarchy Model (RHM) – an ensemble of probabilistic context-free grammars designed to capture the hierarchical structure of natural language while remaining analytically tractable. Previously, we developed a theory of representation learning based on data correlations that explains how deep learning models capture the hierarchical structure of the data sequentially, one layer at a time. Here, we extend our theoretical framework to account for architectural differences. In particular, we predict and empirically validate that convolutional networks, whose structure aligns with that of the generative process through locality and weight sharing, enjoy a faster scaling of performance compared to transformer models, which rely on global self-attention mechanisms. This finding clarifies the architectural biases underlying neural scaling laws and highlights how representation learning is shaped by the interaction between model architecture and the statistical properties of data.

[LG-43] YANNs: Y-wise Affine Neural Networks for Exact and Efficient Representations of Piecewise Linear Functions

链接: https://arxiv.org/abs/2505.07054
作者: Austin Braniff,Yuhe Tian
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:This work formally introduces Y-wise Affine Neural Networks (YANNs), a fully-explainable network architecture that continuously and efficiently represent piecewise affine functions with polytopic subdomains. Following from the proofs, it is shown that the development of YANNs requires no training to achieve the functionally equivalent representation. YANNs thus maintain all mathematical properties of the original formulations. Multi-parametric model predictive control is utilized as an application showcase of YANNs, which theoretically computes optimal control laws as a piecewise affine function of states, outputs, setpoints, and disturbances. With the exact representation of multi-parametric control laws, YANNs retain essential control-theoretic guarantees such as recursive feasibility and stability. This sets YANNs apart from the existing works which apply neural networks for approximating optimal control laws instead of exactly representing them. By optimizing the inference speed of the networks, YANNs can evaluate substantially faster in real-time compared to traditional piecewise affine function calculations. Numerical case studies are presented to demonstrate the algorithmic scalability with respect to the input/output dimensions and the number of subdomains. YANNs represent a significant advancement in control as the first neural network-based controller that inherently ensures both feasibility and stability. Future applications can leverage them as an efficient and interpretable starting point for data-driven modeling/control.

[LG-44] Streaming Krylov-Accelerated Stochastic Gradient Descent

链接: https://arxiv.org/abs/2505.07046
作者: Stephen Thomas
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present SKA-SGD (Streaming Krylov-Accelerated Stochastic Gradient Descent), a novel optimization approach that accelerates convergence for ill-conditioned problems by projecting stochastic gradients onto a low-dimensional Krylov subspace. Directly inspired by recent advances in s-step Conjugate Gradient methods with streaming Gauss-Seidel Gram solvers \citedambra2025sstep, our method extends these techniques to the stochastic optimization domain. Our approach combines three key innovations: (1) projection coefficients computed via a single streaming Gauss-Seidel iteration, which is mathematically equivalent to Modified Gram-Schmidt orthogonalization; (2) a Chebyshev polynomial basis for constructing the Krylov subspace, providing superior numerical stability; and (3) efficient implementation for AMD GPUs using HIP. We prove that our streaming approach achieves a backward error near machine precision with O(s^2) complexity rather than O(s^3) , where s is the Krylov subspace dimension. Experimental results demonstrate that SKA-SGD significantly outperforms standard SGD and Adam in convergence rate and final error, particularly for problems with condition numbers exceeding 10^3 . GPU performance analysis reveals a crossover point where communication-avoiding benefits outweigh computational overhead, typically occurring at moderate scale ( p \approx 64 processors) for problem sizes n \geq 10^6 .

[LG-45] Reinforcement Learning (RL) Meets Urban Climate Modeling: Investigating the Efficacy and Impacts of RL-Based HVAC Control

链接: https://arxiv.org/abs/2505.07045
作者: Junjie Yu,John S. Schreck,David John Gagne,Keith W. Oleson,Jie Li,Yongtu Liang,Qi Liao,Mingfei Sun,David O. Topping,Zhonghua Zheng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL)-based heating, ventilation, and air conditioning (HVAC) control has emerged as a promising technology for reducing building energy consumption while maintaining indoor thermal comfort. However, the efficacy of such strategies is influenced by the background climate and their implementation may potentially alter both the indoor climate and local urban climate. This study proposes an integrated framework combining RL with an urban climate model that incorporates a building energy model, aiming to evaluate the efficacy of RL-based HVAC control across different background climates, impacts of RL strategies on indoor climate and local urban climate, and the transferability of RL strategies across cities. Our findings reveal that the reward (defined as a weighted combination of energy consumption and thermal comfort) and the impacts of RL strategies on indoor climate and local urban climate exhibit marked variability across cities with different background climates. The sensitivity of reward weights and the transferability of RL strategies are also strongly influenced by the background climate. Cities in hot climates tend to achieve higher rewards across most reward weight configurations that balance energy consumption and thermal comfort, and those cities with more varying atmospheric temperatures demonstrate greater RL strategy transferability. These findings underscore the importance of thoroughly evaluating RL-based HVAC control strategies in diverse climatic contexts. This study also provides a new insight that city-to-city learning will potentially aid the deployment of RL-based HVAC control.

[LG-46] Efficient Machine Unlearning by Model Splitting and Core Sample Selection

链接: https://arxiv.org/abs/2505.07026
作者: Maximilian Egger,Rawad Bitar,Rüdiger Urbanke
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Machine unlearning is essential for meeting legal obligations such as the right to be forgotten, which requires the removal of specific data from machine learning models upon request. While several approaches to unlearning have been proposed, existing solutions often struggle with efficiency and, more critically, with the verification of unlearning - particularly in the case of weak unlearning guarantees, where verification remains an open challenge. We introduce a generalized variant of the standard unlearning metric that enables more efficient and precise unlearning strategies. We also present an unlearning-aware training procedure that, in many cases, allows for exact unlearning. We term our approach MaxRR. When exact unlearning is not feasible, MaxRR still supports efficient unlearning with properties closely matching those achieved through full retraining.

[LG-47] Source Anonymity for Private Random Walk Decentralized Learning

链接: https://arxiv.org/abs/2505.07011
作者: Maximilian Egger,Svenja Lage,Rawad Bitar,Antonia Wachter-Zeh
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper considers random walk-based decentralized learning, where at each iteration of the learning process, one user updates the model and sends it to a randomly chosen neighbor until a convergence criterion is met. Preserving data privacy is a central concern and open problem in decentralized learning. We propose a privacy-preserving algorithm based on public-key cryptography and anonymization. In this algorithm, the user updates the model and encrypts the result using a distant user’s public key. The encrypted result is then transmitted through the network with the goal of reaching that specific user. The key idea is to hide the source’s identity so that, when the destination user decrypts the result, it does not know who the source was. The challenge is to design a network-dependent probability distribution (at the source) over the potential destinations such that, from the receiver’s perspective, all users have a similar likelihood of being the source. We introduce the problem and construct a scheme that provides anonymity with theoretical guarantees. We focus on random regular graphs to establish rigorous guarantees.

[LG-48] GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance ICML2025

链接: https://arxiv.org/abs/2505.07004
作者: Jinuk Kim,Marwa El Halabi,Wonpyo Park,Clemens JS Schaefer,Deokjae Lee,Yeonhong Park,Jae W. Lee,Hyun Oh Song
类目: Machine Learning (cs.LG)
*备注: ICML 2025

点击查看摘要

Abstract:Post-training quantization is a key technique for reducing the memory and inference latency of large language models by quantizing weights and activations without requiring retraining. However, existing methods either (1) fail to account for the varying importance of hidden features to the end loss or, when incorporating end loss, (2) neglect the critical interactions between model weights. To address these limitations, we propose GuidedQuant, a novel quantization approach that integrates gradient information from the end loss into the quantization objective while preserving cross-weight dependencies within output channels. GuidedQuant consistently boosts the performance of state-of-the-art quantization methods across weight-only scalar, weight-only vector, and weight-and-activation quantization. Additionally, we introduce a novel non-uniform scalar quantization algorithm, which is guaranteed to monotonically decrease the quantization objective value, and outperforms existing methods in this category. We release the code at this https URL.

[LG-49] Learning Value of Information towards Joint Communication and Control in 6G V2X

链接: https://arxiv.org/abs/2505.06978
作者: Lei Lei,Kan Zheng,Xuemin(Sherman)Shen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As Cellular Vehicle-to-Everything (C-V2X) evolves towards future sixth-generation (6G) networks, Connected Autonomous Vehicles (CAVs) are emerging to become a key application. Leveraging data-driven Machine Learning (ML), especially Deep Reinforcement Learning (DRL), is expected to significantly enhance CAV decision-making in both vehicle control and V2X communication under uncertainty. These two decision-making processes are closely intertwined, with the value of information (VoI) acting as a crucial bridge between them. In this paper, we introduce Sequential Stochastic Decision Process (SSDP) models to define and assess VoI, demonstrating their application in optimizing communication systems for CAVs. Specifically, we formally define the SSDP model and demonstrate that the MDP model is a special case of it. The SSDP model offers a key advantage by explicitly representing the set of information that can enhance decision-making when available. Furthermore, as current research on VoI remains fragmented, we propose a systematic VoI modeling framework grounded in the MDP, Reinforcement Learning (RL) and Optimal Control theories. We define different categories of VoI and discuss their corresponding estimation methods. Finally, we present a structured approach to leverage the various VoI metrics for optimizing the When", What", and ``How" to communicate problems. For this purpose, SSDP models are formulated with VoI-associated reward functions derived from VoI-based optimization objectives. While we use a simple vehicle-following control problem to illustrate the proposed methodology, it holds significant potential to facilitate the joint optimization of stochastic, sequential control and communication decisions in a wide range of networked control systems.

[LG-50] A Formally Verified Robustness Certifier for Neural Networks (Extended Version)

链接: https://arxiv.org/abs/2505.06958
作者: James Tobler,Hira Taqdees Syeda,Toby Murray
类目: Programming Languages (cs.PL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural networks are often susceptible to minor perturbations in input that cause them to misclassify. A recent solution to this problem is the use of globally-robust neural networks, which employ a function to certify that the classification of an input cannot be altered by such a perturbation. Outputs that pass this test are called certified robust. However, to the authors’ knowledge, these certification functions have not yet been verified at the implementation level. We demonstrate how previous unverified implementations are exploitably unsound in certain circumstances. Moreover, they often rely on approximation-based algorithms, such as power iteration, that (perhaps surprisingly) do not guarantee soundness. To provide assurance that a given output is robust, we implemented and formally verified a certification function for globally-robust neural networks in Dafny. We describe the program, its specifications, and the important design decisions taken for its implementation and verification, as well as our experience applying it in practice.

[LG-51] A systematic review of challenges and proposed solutions in modeling multimodal data

链接: https://arxiv.org/abs/2505.06945
作者: Maryam Farhadizadeh(1 and 2),Maria Weymann(2 and 3),Michael Blaß(4),Johann Kraus(5),Christopher Gundler(4),Sebastian Walter(6),Noah Hempen(1),Harald Binde(2 and 3),Nadine Binder(1 and 2) ((1) Institute of General Practice/Family Medicine, Faculty of Medicine and Medical Center - University of Freiburg, Germany, (2) Freiburg Center for Data Analysis, Modeling and AI, University of Freiburg, Germany, (3) Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Germany, (4) Institute for Applied Medical Informatics, University Medical Center Hamburg-Eppendorf, Germany, (5) Institute of Medical Systems Biology, Ulm University, Germany, (6) Department of Computer Science, Faculty of Engineering - University of Freiburg, Germany)
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Multimodal data modeling has emerged as a powerful approach in clinical research, enabling the integration of diverse data types such as imaging, genomics, wearable sensors, and electronic health records. Despite its potential to improve diagnostic accuracy and support personalized care, modeling such heterogeneous data presents significant technical challenges. This systematic review synthesizes findings from 69 studies to identify common obstacles, including missing modalities, limited sample sizes, dimensionality imbalance, interpretability issues, and finding the optimal fusion techniques. We highlight recent methodological advances, such as transfer learning, generative models, attention mechanisms, and neural architecture search that offer promising solutions. By mapping current trends and innovations, this review provides a comprehensive overview of the field and offers practical insights to guide future research and development in multimodal modeling for medical applications.

[LG-52] Non-Stationary Time Series Forecasting Based on Fourier Analysis and Cross Attention Mechanism IJCNN2025

链接: https://arxiv.org/abs/2505.06917
作者: Yuqi Xiong,Yang Wen
类目: Machine Learning (cs.LG)
*备注: IJCNN 2025

点击查看摘要

Abstract:Time series forecasting has important applications in financial analysis, weather forecasting, and traffic management. However, existing deep learning models are limited in processing non-stationary time series data because they cannot effectively capture the statistical characteristics that change over time. To address this problem, this paper proposes a new framework, AEFIN, which enhances the information sharing ability between stable and unstable components by introducing a cross-attention mechanism, and combines Fourier analysis networks with MLP to deeply explore the seasonal patterns and trend characteristics in unstable components. In addition, we design a new loss function that combines time-domain stability constraints, time-domain instability constraints, and frequency-domain stability constraints to improve the accuracy and robustness of forecasting. Experimental results show that AEFIN outperforms the most common models in terms of mean square error and mean absolute error, especially under non-stationary data conditions, and shows excellent forecasting capabilities. This paper provides an innovative solution for the modeling and forecasting of non-stationary time series data, and contributes to the research of deep learning for complex time series.

[LG-53] Realistic Counterfactual Explanations for Machine Learning-Controlled Mobile Robots using 2D LiDAR

链接: https://arxiv.org/abs/2505.06906
作者: Sindre Benjamin Remman,Anastasios M. Lekkas
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted for publication at the 2025 European Control Conference (ECC)

点击查看摘要

Abstract:This paper presents a novel method for generating realistic counterfactual explanations (CFEs) in machine learning (ML)-based control for mobile robots using 2D LiDAR. ML models, especially artificial neural networks (ANNs), can provide advanced decision-making and control capabilities by learning from data. However, they often function as black boxes, making it challenging to interpret them. This is especially a problem in safety-critical control applications. To generate realistic CFEs, we parameterize the LiDAR space with simple shapes such as circles and rectangles, whose parameters are chosen by a genetic algorithm, and the configurations are transformed into LiDAR data by raycasting. Our model-agnostic approach generates CFEs in the form of synthetic LiDAR data that resembles a base LiDAR state but is modified to produce a pre-defined ML model control output based on a query from the user. We demonstrate our method on a mobile robot, the TurtleBot3, controlled using deep reinforcement learning (DRL) in real-world and simulated scenarios. Our method generates logical and realistic CFEs, which helps to interpret the DRL agent’s decision making. This paper contributes towards advancing explainable AI in mobile robotics, and our method could be a tool for understanding, debugging, and improving ML-based autonomous control.

[LG-54] Learning Soft Sparse Shapes for Efficient Time-Series Classification ICML2025

链接: https://arxiv.org/abs/2505.06892
作者: Zhen Liu,Yicheng Luo,Boyuan Li,Emadeldeen Eldele,Min Wu,Qianli Ma
类目: Machine Learning (cs.LG)
*备注: Accepted in ICML 2025

点击查看摘要

Abstract:Shapelets are discriminative subsequences (or shapes) with high interpretability in time series classification. Due to the time-intensive nature of shapelet discovery, existing shapelet-based methods mainly focus on selecting discriminative shapes while discarding others to achieve candidate subsequence sparsification. However, this approach may exclude beneficial shapes and overlook the varying contributions of shapelets to classification performance. To this end, we propose a \textbfSoft sparse \textbfShapes (\textbfSoftShape) model for efficient time series classification. Our approach mainly introduces soft shape sparsification and soft shape learning blocks. The former transforms shapes into soft representations based on classification contribution scores, merging lower-scored ones into a single shape to retain and differentiate all subsequence information. The latter facilitates intra- and inter-shape temporal pattern learning, improving model efficiency by using sparsified soft shapes as inputs. Specifically, we employ a learnable router to activate a subset of class-specific expert networks for intra-shape pattern learning. Meanwhile, a shared expert network learns inter-shape patterns by converting sparsified shapes into sequences. Extensive experiments show that SoftShape outperforms state-of-the-art methods and produces interpretable results.

[LG-55] Masked Subspace Clustering Methods

链接: https://arxiv.org/abs/2505.06863
作者: Jiebo Song,Huaming Ling
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:To further utilize the unsupervised features and pairwise information, we propose a general Bilevel Clustering Optimization (BCO) framework to improve the performance of clustering. And then we introduce three special cases on subspace clustering with two different types of masks. At first, we reformulate the original subspace clustering as a Basic Masked Subspace Clustering (BMSC), which reformulate the diagonal constraints to a hard mask. Then, we provide a General Masked Subspace Clustering (GMSC) method to integrate different clustering via a soft mask. Furthermore, based on BCO and GMSC, we induce a learnable soft mask and design a Recursive Masked Subspace Clustering (RMSC) method that can alternately update the affinity matrix and the soft mask. Numerical experiments show that our models obtain significant improvement compared with the baselines on several commonly used datasets, such as MNIST, USPS, ORL, COIL20 and COIL100.

[LG-56] FreqMoE: Dynamic Frequency Enhancement for Neural PDE Solvers IJCAI2025

链接: https://arxiv.org/abs/2505.06858
作者: Tianyu Chen,Haoyi Zhou,Ying Li,Hao Wang,Zhenzhe Zhang,Tianchen Zhu,Shanghang Zhang,Jianxin Li
类目: Machine Learning (cs.LG)
*备注: Accepted by IJCAI 2025

点击查看摘要

Abstract:Fourier Neural Operators (FNO) have emerged as promising solutions for efficiently solving partial differential equations (PDEs) by learning infinite-dimensional function mappings through frequency domain transformations. However, the sparsity of high-frequency signals limits computational efficiency for high-dimensional inputs, and fixed-pattern truncation often causes high-frequency signal loss, reducing performance in scenarios such as high-resolution inputs or long-term predictions. To address these challenges, we propose FreqMoE, an efficient and progressive training framework that exploits the dependency of high-frequency signals on low-frequency components. The model first learns low-frequency weights and then applies a sparse upward-cycling strategy to construct a mixture of experts (MoE) in the frequency domain, effectively extending the learned weights to high-frequency regions. Experiments on both regular and irregular grid PDEs demonstrate that FreqMoE achieves up to 16.6% accuracy improvement while using merely 2.1% parameters (47.32x reduction) compared to dense FNO. Furthermore, the approach demonstrates remarkable stability in long-term predictions and generalizes seamlessly to various FNO variants and grid structures, establishing a new ``Low frequency Pretraining, High frequency Fine-tuning’’ paradigm for solving PDEs.

[LG-57] Improving Random Forests by Smoothing

链接: https://arxiv.org/abs/2505.06852
作者: Ziyi Liu,Phuc Luong,Mario Boley,Daniel F. Schmidt
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 14 pages, 2 figures, 4 pages appendix, 3 figures in appendix

点击查看摘要

Abstract:Gaussian process regression is a popular model in the small data regime due to its sound uncertainty quantification and the exploitation of the smoothness of the regression function that is encountered in a wide range of practical problems. However, Gaussian processes perform sub-optimally when the degree of smoothness is non-homogeneous across the input domain. Random forest regression partially addresses this issue by providing local basis functions of variable support set sizes that are chosen in a data-driven way. However, they do so at the expense of forgoing any degree of smoothness, which often results in poor performance in the small data regime. Here, we aim to combine the advantages of both models by applying a kernel-based smoothing mechanism to a learned random forest or any other piecewise constant prediction function. As we demonstrate empirically, the resulting model consistently improves the predictive performance of the underlying random forests and, in almost all test cases, also improves the log loss of the usual uncertainty quantification based on inter-tree variance. The latter advantage can be attributed to the ability of the smoothing model to take into account the uncertainty over the exact tree-splitting locations.

[LG-58] Predictive Digital Twins for Thermal Management Using Machine Learning and Reduced-Order Models

链接: https://arxiv.org/abs/2505.06849
作者: Tamilselvan Subramani,Sebastian Bartscher
类目: Machine Learning (cs.LG)
*备注: 10 pages, 2 tables, from this http URL . thesis accepted at BITS Pilani, 2022

点击查看摘要

Abstract:Digital twins enable real-time simulation and prediction in engineering systems. This paper presents a novel framework for predictive digital twins of a headlamp heatsink, integrating physics-based reduced-order models (ROMs) from computational fluid dynamics (CFD) with supervised machine learning. A component-based ROM library, derived via proper orthogonal decomposition (POD), captures thermal dynamics efficiently. Machine learning models, including Decision Trees, k-Nearest Neighbors, Support Vector Regression (SVR), and Neural Networks, predict optimal ROM configurations, enabling rapid digital twin updates. The Neural Network achieves a mean absolute error (MAE) of 54.240, outperforming other models. Quantitative comparisons of predicted and original values demonstrate high accuracy. This scalable, interpretable framework advances thermal management in automotive systems, supporting robust design and predictive maintenance.

[LG-59] Streaming Sliced Optimal Transport

链接: https://arxiv.org/abs/2505.06835
作者: Khai Nguyen
类目: Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 28 pages, 9 figures, 3 tables

点击查看摘要

Abstract:Sliced optimal transport (SOT) or sliced Wasserstein (SW) distance is widely recognized for its statistical and computational scalability. In this work, we further enhance the computational scalability by proposing the first method for computing SW from sample streams, called \emphstreaming sliced Wasserstein (Stream-SW). To define Stream-SW, we first introduce the streaming computation of the one-dimensional Wasserstein distance. Since the one-dimensional Wasserstein (1DW) distance has a closed-form expression, given by the absolute difference between the quantile functions of the compared distributions, we leverage quantile approximation techniques for sample streams to define the streaming 1DW distance. By applying streaming 1DW to all projections, we obtain Stream-SW. The key advantage of Stream-SW is its low memory complexity while providing theoretical guarantees on the approximation error. We demonstrate that Stream-SW achieves a more accurate approximation of SW than random subsampling, with lower memory consumption, in comparing Gaussian distributions and mixtures of Gaussians from streaming samples. Additionally, we conduct experiments on point cloud classification, point cloud gradient flows, and streaming change point detection to further highlight the favorable performance of Stream-SW.

[LG-60] Deep Learning for On-Street Parking Violation Prediction

链接: https://arxiv.org/abs/2505.06818
作者: Thien Nhan Vo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Illegal parking along with the lack of available parking spaces are among the biggest issues faced in many large cities. These issues can have a significant impact on the quality of life of citizens. On-street parking systems have been designed to this end aiming at ensuring that parking spaces will be available for the local population, while also providing easy access to parking for people visiting the city center. However, these systems are often affected by illegal parking, providing incorrect information regarding the availability of parking spaces. Even though this can be mitigated using sensors for detecting the presence of cars in various parking sectors, the cost of these implementations is usually prohibiting large. In this paper, we investigate an indirect way of predicting parking violations at a fine-grained level, equipping such parking systems with a valuable tool for providing more accurate information to citizens. To this end, we employed a Deep Learning (DL)-based model to predict fine-grained parking violation rates for on-street parking systems. Moreover, we developed a data augmentation and smoothing technique for further improving the accuracy of DL models under the presence of missing and noisy data. We demonstrate, using experiments on real data collected in Thessaloniki, Greece, that the developed system can indeed provide accurate parking violation predictions.

[LG-61] opology Guidance: Controlling the Outputs of Generative Models via Vector Field Topology

链接: https://arxiv.org/abs/2505.06804
作者: Xiaohan Wang,Matthew Berger
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:For domains that involve numerical simulation, it can be computationally expensive to run an ensemble of simulations spanning a parameter space of interest to a user. To this end, an attractive surrogate for simulation is the generative modeling of fields produced by an ensemble, allowing one to synthesize fields in a computationally cheap, yet accurate, manner. However, for the purposes of visual analysis, a limitation of generative models is their lack of control, as it is unclear what one should expect when sampling a field from a model. In this paper we study how to make generative models of fields more controllable, so that users can specify features of interest, in particular topological features, that they wish to see in the output. We propose topology guidance, a method for guiding the sampling process of a generative model, specifically a diffusion model, such that a topological description specified as input is satisfied in the generated output. Central to our method, we couple a coordinate-based neural network used to represent fields, with a diffusion model used for generation. We show how to use topologically-relevant signals provided by the coordinate-based network to help guide the denoising process of a diffusion model. This enables us to faithfully represent a user’s specified topology, while ensuring that the output field remains within the generative data distribution. Specifically, we study 2D vector field topology, evaluating our method over an ensemble of fluid flows, where we show that generated vector fields faithfully adhere to the location, and type, of critical points over the spatial domain. We further show the benefits of our method in aiding the comparison of ensembles, allowing one to explore commonalities and differences in distributions along prescribed topological features.

[LG-62] JaxRobotarium: Training and Deploying Multi-Robot Policies in 10 Minutes

链接: https://arxiv.org/abs/2505.06771
作者: Shalin Anand Jain,Jiazhen Liu,Siva Kailas,Harish Ravichandar
类目: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 22 pages, 14 figures, 10 tables

点击查看摘要

Abstract:Multi-agent reinforcement learning (MARL) has emerged as a promising solution for learning complex and scalable coordination behaviors in multi-robot systems. However, established MARL platforms (e.g., SMAC and MPE) lack robotics relevance and hardware deployment, leaving multi-robot learning researchers to develop bespoke environments and hardware testbeds dedicated to the development and evaluation of their individual contributions. The Multi-Agent RL Benchmark and Learning Environment for the Robotarium (MARBLER) is an exciting recent step in providing a standardized robotics-relevant platform for MARL, by bridging the Robotarium testbed with existing MARL software infrastructure. However, MARBLER lacks support for parallelization and GPU/TPU execution, making the platform prohibitively slow compared to modern MARL environments and hindering adoption. We contribute JaxRobotarium, a Jax-powered end-to-end simulation, learning, deployment, and benchmarking platform for the Robotarium. JaxRobotarium enables rapid training and deployment of multi-robot reinforcement learning (MRRL) policies with realistic robot dynamics and safety constraints, supporting both parallelization and hardware acceleration. Our generalizable learning interface provides an easy-to-use integration with SOTA MARL libraries (e.g., JaxMARL). In addition, JaxRobotarium includes eight standardized coordination scenarios, including four novel scenarios that bring established MARL benchmark tasks (e.g., RWARE and Level-Based Foraging) to a realistic robotics setting. We demonstrate that JaxRobotarium retains high simulation fidelity while achieving dramatic speedups over baseline (20x in training and 150x in simulation), and provides an open-access sim-to-real evaluation pipeline through the Robotarium testbed, accelerating and democratizing access to multi-robot learning research and evaluation.

[LG-63] Investigating Robotaxi Crash Severity Using Geographical Random Forest

链接: https://arxiv.org/abs/2505.06762
作者: Junfeng Jiao,Seung Gyu Baik,Seung Jun Choi,Yiming Xu
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 21 pages, 8 figures

点击查看摘要

Abstract:This paper quantitatively investigates the crash severity of Autonomous Vehicles (AVs) with spatially localized machine learning and macroscopic measures of the urban built environment. We address spatial heterogeneity and spatial autocorrelation, while focusing on land use patterns and human behavior. Our Geographical Random Forest (GRF) model, accompanied with a crash severity risk map of San Francisco, presents three findings that are useful for commercial operations of AVs and robotaxis. First, spatially localized machine learning performed better than regular machine learning, when predicting AV crash severity. Bias-variance tradeoff was evident as we adjust the localization weight hyperparameter. Second, land use was the most important built environment measure, compared to intersections, building footprints, public transit stops, and Points Of Interests (POIs). Third, it was predicted that city center areas with greater diversity and commercial activities were more likely to result in low-severity AV crashes, than residential neighborhoods. Residential land use may be associated with higher severity due to human behavior and less restrictive environment. This paper recommends to explicitly consider geographic locations, and to design safety measures specific to residential neighborhoods, when robotaxi operators train their AV systems.

[LG-64] Learning Graph Representation of Agent Diffuser AAMAS2025

链接: https://arxiv.org/abs/2505.06761
作者: Youcef Djenouri,Nassim Belmecheri,Tomasz Michalak,Jan Dubiński,Ahmed Nabil Belbachir,Anis Yazidi
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Accepted at AAMAS2025 International Conference on Autonomous Agents and Multiagent Systems

点击查看摘要

Abstract:Diffusion-based generative models have significantly advanced text-to-image synthesis, demonstrating impressive text comprehension and zero-shot generalization. These models refine images from random noise based on textual prompts, with initial reliance on text input shifting towards enhanced visual fidelity over time. This transition suggests that static model parameters might not optimally address the distinct phases of generation. We introduce LGR-AD (Learning Graph Representation of Agent Diffusers), a novel multi-agent system designed to improve adaptability in dynamic computer vision tasks. LGR-AD models the generation process as a distributed system of interacting agents, each representing an expert sub-model. These agents dynamically adapt to varying conditions and collaborate through a graph neural network that encodes their relationships and performance metrics. Our approach employs a coordination mechanism based on top- k maximum spanning trees, optimizing the generation process. Each agent’s decision-making is guided by a meta-model that minimizes a novel loss function, balancing accuracy and diversity. Theoretical analysis and extensive empirical evaluations show that LGR-AD outperforms traditional diffusion models across various benchmarks, highlighting its potential for scalable and flexible solutions in complex image generation tasks. Code is available at: this https URL

[LG-65] Privacy-aware Berrut Approximated Coded Computing applied to general distributed learning

链接: https://arxiv.org/abs/2505.06759
作者: Xavier Martínez-Luaña,Manuel Fernández-Veiga,Rebeca P. Díaz-Redondo,Ana Fernández-Vilas
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Coded computing is one of the techniques that can be used for privacy protection in Federated Learning. However, most of the constructions used for coded computing work only under the assumption that the computations involved are exact, generally restricted to special classes of functions, and require quantized inputs. This paper considers the use of Private Berrut Approximate Coded Computing (PBACC) as a general solution to add strong but non-perfect privacy to federated learning. We derive new adapted PBACC algorithms for centralized aggregation, secure distributed training with centralized data, and secure decentralized training with decentralized data, thus enlarging significantly the applications of the method and the existing privacy protection tools available for these paradigms. Particularly, PBACC can be used robustly to attain privacy guarantees in decentralized federated learning for a variety of models. Our numerical results show that the achievable quality of different learning models (convolutional neural networks, variational autoencoders, and Cox regression) is minimally altered by using these new computing schemes, and that the privacy leakage can be bounded strictly to less than a fraction of one bit per participant. Additionally, the computational cost of the encoding and decoding processes depends only of the degree of decentralization of the data.

[LG-66] Boltzmann Classifier: A Thermodynamic-Inspired Approach to Supervised Learning

链接: https://arxiv.org/abs/2505.06753
作者: Muhamed Amin,Bernard R. Brooks
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:We propose a novel classification algorithm, the Boltzmann Classifier, inspired by the thermodynamic principles underlying the Boltzmann distribution. Our method computes a probabilistic estimate for each class based on an energy function derived from feature-wise deviations between input samples and class-specific centroids. The resulting probabilities are proportional to the exponential negative energies, normalized across classes, analogous to the Boltzmann distribution used in statistical mechanics. In addition, the KT variable can be used to allow the high energy states to be more accessible, which allows the tuning of their probabilities as needed. We evaluate the model performance on several datasets from different applications. The model achieves a high accuracy, which indicates that the Boltzmann Classifier is competitive with standard models like logistic regression and k-nearest neighbors while offering a thermodynamically motivated probabilistic interpretation. our classifier does not require iterative optimization or backpropagation and is thus computationally efficient and easy to integrate into existing workflows. This work demonstrates how ideas from physics can inform new directions in machine learning, providing a foundation for interpretable, energy-based decision-making systems.

[LG-67] LineFlow: A Framework to Learn Active Control of Production Lines ICML2025

链接: https://arxiv.org/abs/2505.06744
作者: Kai Müller,Martin Wenzel,Tobias Windisch
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at ICML 2025

点击查看摘要

Abstract:Many production lines require active control mechanisms, such as adaptive routing, worker reallocation, and rescheduling, to maintain optimal performance. However, designing these control systems is challenging for various reasons, and while reinforcement learning (RL) has shown promise in addressing these challenges, a standardized and general framework is still lacking. In this work, we introduce LineFlow, an extensible, open-source Python framework for simulating production lines of arbitrary complexity and training RL agents to control them. To demonstrate the capabilities and to validate the underlying theoretical assumptions of LineFlow, we formulate core subproblems of active line control in ways that facilitate mathematical analysis. For each problem, we provide optimal solutions for comparison. We benchmark state-of-the-art RL algorithms and show that the learned policies approach optimal performance in well-understood scenarios. However, for more complex, industrial-scale production lines, RL still faces significant challenges, highlighting the need for further research in areas such as reward shaping, curriculum learning, and hierarchical control.

[LG-68] Activity and Subject Detection for UCI HAR Dataset with without missing Sensor Data

链接: https://arxiv.org/abs/2505.06730
作者: Debashish Saha,Piyush Malik,Adrika Saha
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current studies in Human Activity Recognition (HAR) primarily focus on the classification of activities through sensor data, while there is not much emphasis placed on recognizing the individuals performing these activities. This type of classification is very important for developing personalized and context-sensitive applications. Additionally, the issue of missing sensor data, which often occurs in practical situations due to hardware malfunctions, has not been explored yet. This paper seeks to fill these voids by introducing a lightweight LSTM-based model that can be used to classify both activities and subjects. The proposed model was used to classify the HAR dataset by UCI [1], achieving an accuracy of 93.89% in activity recognition (across six activities), nearing the 96.67% benchmark, and an accuracy of 80.19% in subject recognition (involving 30 subjects), thereby establishing a new baseline for this area of research. We then simulate the absence of sensor data to mirror real-world scenarios and incorporate imputation techniques, both with and without Principal Component Analysis (PCA), to restore incomplete datasets. We found that K-Nearest Neighbors (KNN) imputation performs the best for filling the missing sensor data without PCA because the use of PCA resulted in slightly lower accuracy. These results demonstrate how well the framework handles missing sensor data, which is a major step forward in using the Human Activity Recognition dataset for reliable classification tasks.

[LG-69] Beyond tildeO(sqrtT) Constraint Violation for Online Convex Optimization with Adversarial Constraints

链接: https://arxiv.org/abs/2505.06709
作者: Abhishek Sinha,Rahul Vaze
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We revisit the Online Convex Optimization problem with adversarial constraints (COCO) where, in each round, a learner is presented with a convex cost function and a convex constraint function, both of which may be chosen adversarially. The learner selects actions from a convex decision set in an online fashion, with the goal of minimizing both regret and the cumulative constraint violation (CCV) over a horizon of T rounds. The best-known policy for this problem achieves O(\sqrtT) regret and \tildeO(\sqrtT) CCV. In this paper, we present a surprising improvement that achieves a significantly smaller CCV by trading it off with regret. Specifically, for any bounded convex cost and constraint functions, we propose an online policy that achieves \tildeO(\sqrtdT+ T^\beta) regret and \tildeO(dT^1-\beta) CCV, where d is the dimension of the decision set and \beta \in [0,1] is a tunable parameter. We achieve this result by first considering the special case of \textsfConstrained Expert problem where the decision set is a probability simplex and the cost and constraint functions are linear. Leveraging a new adaptive small-loss regret bound, we propose an efficient policy for the \textsfConstrained Expert problem, that attains O(\sqrtT\ln N+T^\beta) regret and \tildeO(T^1-\beta \ln N) CCV, where N is the number of experts. The original problem is then reduced to the \textsfConstrained Expert problem via a covering argument. Finally, with an additional smoothness assumption, we propose an efficient gradient-based policy attaining O(T^\max(\frac12,\beta)) regret and \tildeO(T^1-\beta) CCV.

[LG-70] RuleGenie: SIEM Detection Rule Set Optimization

链接: https://arxiv.org/abs/2505.06701
作者: Akansha Shukla,Parth Atulbhai Gandhi,Yuval Elovici,Asaf Shabtai
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:SIEM systems serve as a critical hub, employing rule-based logic to detect and respond to threats. Redundant or overlapping rules in SIEM systems lead to excessive false alerts, degrading analyst performance due to alert fatigue, and increase computational overhead and response latency for actual threats. As a result, optimizing SIEM rule sets is essential for efficient operations. Despite the importance of such optimization, research in this area is limited, with current practices relying on manual optimization methods that are both time-consuming and error-prone due to the scale and complexity of enterprise-level rule sets. To address this gap, we present RuleGenie, a novel large language model (LLM) aided recommender system designed to optimize SIEM rule sets. Our approach leverages transformer models’ multi-head attention capabilities to generate SIEM rule embeddings, which are then analyzed using a similarity matching algorithm to identify the top-k most similar rules. The LLM then processes the rules identified, utilizing its information extraction, language understanding, and reasoning capabilities to analyze rule similarity, evaluate threat coverage and performance metrics, and deliver optimized recommendations for refining the rule set. By automating the rule optimization process, RuleGenie allows security teams to focus on more strategic tasks while enhancing the efficiency of SIEM systems and strengthening organizations’ security posture. We evaluated RuleGenie on a comprehensive set of real-world SIEM rule formats, including Splunk, Sigma, and AQL (Ariel query language), demonstrating its platform-agnostic capabilities and adaptability across diverse security infrastructures. Our experimental results show that RuleGenie can effectively identify redundant rules, which in turn decreases false positive rates and enhances overall rule efficiency.

[LG-71] Model Steering: Learning with a Reference Model Improves Generalization Bounds and Scaling Laws

链接: https://arxiv.org/abs/2505.06699
作者: Xiyuan Wei,Ming Lin,Fanjiang Ye,Fengguang Song,Liangliang Cao,My T. That,Tianbao Yang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 18 pages, 6 figures

点击查看摘要

Abstract:This paper formalizes an emerging learning paradigm that uses a trained model as a reference to guide and enhance the training of a target model through strategic data selection or weighting, named \textbfmodel steering . While ad-hoc methods have been used in various contexts, including the training of large foundation models, its underlying principles remain insufficiently understood, leading to sub-optimal performance. In this work, we propose a theory-driven framework for model steering called \textbfDRRho risk minimization , which is rooted in Distributionally Robust Optimization (DRO). Through a generalization analysis, we provide theoretical insights into why this approach improves generalization and data efficiency compared to training without a reference model. To the best of our knowledge, this is the first time such theoretical insights are provided for the new learning paradigm, which significantly enhance our understanding and practice of model steering. Building on these insights and the connection between contrastive learning and DRO, we introduce a novel method for Contrastive Language-Image Pretraining (CLIP) with a reference model, termed DRRho-CLIP. Extensive experiments validate the theoretical insights, reveal a superior scaling law compared to CLIP without a reference model, and demonstrate its strength over existing heuristic approaches.

[LG-72] E2E-FANet: A Highly Generalizable Framework for Waves prediction Behind Floating Breakwaters via Exogenous-to-Endogenous Variable Attention

链接: https://arxiv.org/abs/2505.06690
作者: Jianxin Zhang,Lianzi Jiang,Xinyu Han,Xiangrong Wang,Weinan Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction of waves behind floating breakwaters (FB) is crucial for optimizing coastal engineering structures, enhancing safety, and improving design efficiency. Existing methods demonstrate limitations in capturing nonlinear interactions between waves and structures, while exhibiting insufficient capability in modeling the complex frequency-domain relationships among elevations of different wave gauges. To address these challenges, this study introduces the Exogenous-to-Endogenous Frequency-Aware Network (E2E-FANet), a novel end-to-end neural network designed to model relationships between waves and structures. The E2E-FANetarchitecture incorporates a Dual-Basis Frequency Mapping (DBFM) module that leverages orthogonal cosine and sine bases to extract wave features from the frequency domain while preserving temporal information. Additionally, we introduce the Exogenous-to-Endogenous Cross-Attention (E2ECA) module, which employs cross attention to model the interactions between endogenous and exogenous variables. We incorporate a Temporal-wise Attention (TA) mechanism that adaptively captures complex dependencies in endogenous variables. These integrated modules function synergistically, enabling E2E-FANet to achieve both comprehensive feature perception in the time-frequency domain and precise modeling of wave-structure interactions. To comprehensively evaluate the performance of E2E-FANet, we constructed a multi-level validation framework comprising three distinct testing scenarios: internal validation under identical wave conditions, generalization testing across different wave conditions, and adaptability testing with varying relative water density (RW) conditions. These comprehensive tests demonstrate that E2E-FANet provides accurate waves behind FB predictions while successfully generalizing diverse wave conditions.

[LG-73] A Novel Framework for Significant Wave Height Prediction based on Adaptive Feature Extraction Time-Frequency Network

链接: https://arxiv.org/abs/2505.06688
作者: Jianxin Zhang,Lianzi Jiang,Xinyu Han,Xiangrong Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Precise forecasting of significant wave height (Hs) is essential for the development and utilization of wave energy. The challenges in predicting Hs arise from its non-linear and non-stationary characteristics. The combination of decomposition preprocessing and machine learning models have demonstrated significant effectiveness in Hs prediction by extracting data features. However, decomposing the unknown data in the test set can lead to data leakage issues. To simultaneously achieve data feature extraction and prevent data leakage, a novel Adaptive Feature Extraction Time-Frequency Network (AFE-TFNet) is proposed to improve prediction accuracy and stability. It is encoder-decoder rolling framework. The encoder consists of two stages: feature extraction and feature fusion. In the feature extraction stage, global and local frequency domain features are extracted by combining Wavelet Transform (WT) and Fourier Transform (FT), and multi-scale frequency analysis is performed using Inception blocks. In the feature fusion stage, time-domain and frequency-domain features are integrated through dominant harmonic sequence energy weighting (DHSEW). The decoder employed an advanced long short-term memory (LSTM) model. Hourly measured wind speed (Ws), dominant wave period (DPD), average wave period (APD) and Hs from three stations are used as the dataset, and the four metrics are employed to evaluate the forecasting performance. Results show that AFE-TFNet significantly outperforms benchmark methods in terms of prediction accuracy. Feature extraction can significantly improve the prediction accuracy. DHSEW has substantially increased the accuracy of medium-term to long-term forecasting. The prediction accuracy of AFE-TFNet does not demonstrate significant variability with changes of rolling time window size. Overall, AFE-TFNet shows strong potential for handling complex signal forecasting.

[LG-74] Geometry of Learning – L2 Phase Transitions in Deep and Shallow Neural Networks

链接: https://arxiv.org/abs/2505.06597
作者: Ibrahim Talha Ersoy,Karoline Wiesner
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:When neural networks (NNs) are subject to L2 regularization, increasing the regularization strength beyond a certain threshold pushes the model into an under-parameterization regime. This transition manifests as a first-order phase transition in single-hidden-layer NNs and a second-order phase transition in NNs with two or more hidden layers. This paper establishes a unified framework for such transitions by integrating the Ricci curvature of the loss landscape with regularizer-driven deep learning. First, we show that a curvature change-point separates the model-accuracy regimes in the onset of learning and that it is identical to the critical point of the phase transition driven by regularization. Second, we show that for more complex data sets additional phase transitions exist between model accuracies, and that they are again identical to curvature change points in the error landscape. Third, by studying the MNIST data set using a Variational Autoencoder, we demonstrate that the curvature change points identify phase transitions in model accuracy outside the L2 setting. Our framework also offers practical insights for optimizing model performance across various architectures and datasets. By linking geometric features of the error landscape to observable phase transitions, our work paves the way for more informed regularization strategies and potentially new methods for probing the intrinsic structure of neural networks beyond the L2 context.

[LG-75] An tildeOptimal Differentially Private Learner for Concept Classes with VC Dimension 1

链接: https://arxiv.org/abs/2505.06581
作者: Chao Yan
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:We present the first nearly optimal differentially private PAC learner for any concept class with VC dimension 1 and Littlestone dimension d . Our algorithm achieves the sample complexity of \tildeO_\varepsilon,\delta,\alpha,\delta(\log^* d) , nearly matching the lower bound of \Omega(\log^* d) proved by Alon et al. [STOC19]. Prior to our work, the best known upper bound is \tildeO(VC\cdot d^5) for general VC classes, as shown by Ghazi et al. [STOC21].

[LG-76] Good Things Come in Pairs: Paired Autoencoders for Inverse Problems

链接: https://arxiv.org/abs/2505.06549
作者: Matthias Chung,Bas Peters,Michael Solomon
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 43 pages, 17 figures

点击查看摘要

Abstract:In this book chapter, we discuss recent advances in data-driven approaches for inverse problems. In particular, we focus on the \emphpaired autoencoder framework, which has proven to be a powerful tool for solving inverse problems in scientific computing. The paired autoencoder framework is a novel approach that leverages the strengths of both data-driven and model-based methods by projecting both the data and the quantity of interest into a latent space and mapping these latent spaces to provide surrogate forward and inverse mappings. We illustrate the advantages of this approach through numerical experiments, including seismic imaging and classical inpainting: nonlinear and linear inverse problems, respectively. Although the paired autoencoder framework is likelihood-free, it generates multiple data- and model-based reconstruction metrics that help assess whether examples are in or out of distribution. In addition to direct model estimates from data, the paired autoencoder enables latent-space refinement to fit the observed data accurately. Numerical experiments show that this procedure, combined with the latent-space initial guess, is essential for high-quality estimates, even when data noise exceeds the training regime. We also introduce two novel variants that combine variational and paired autoencoder ideas, maintaining the original benefits while enabling sampling for uncertainty analysis.

[LG-77] GBDTSVM: Combined Support Vector Machine and Gradient Boosting Decision Tree Framework for efficient snoRNA-disease association prediction

链接: https://arxiv.org/abs/2505.06534
作者: Ummay Maria Muna,Fahim Hafiz,Shanta Biswas,Riasat Azim
类目: Machine Learning (cs.LG)
*备注: 30 pages, 3 figures

点击查看摘要

Abstract:Small nucleolar RNAs (snoRNAs) are increasingly recognized for their critical role in the pathogenesis and characterization of various human diseases. Consequently, the precise identification of snoRNA-disease associations (SDAs) is essential for the progression of diseases and the advancement of treatment strategies. However, conventional biological experimental approaches are costly, time-consuming, and resource-intensive; therefore, machine learning-based computational methods offer a promising solution to mitigate these limitations. This paper proposes a model called ‘GBDTSVM’, representing a novel and efficient machine learning approach for predicting snoRNA-disease associations by leveraging a Gradient Boosting Decision Tree (GBDT) and Support Vector Machine (SVM). ‘GBDTSVM’ effectively extracts integrated snoRNA-disease feature representations utilizing GBDT and SVM is subsequently utilized to classify and identify potential associations. Furthermore, the method enhances the accuracy of these predictions by incorporating Gaussian kernel profile similarity for both snoRNAs and diseases. Experimental evaluation of the GBDTSVM model demonstrated superior performance compared to state-of-the-art methods in the field, achieving an area under the receiver operating characteristic (AUROC) of 0.96 and an area under the precision-recall curve (AUPRC) of 0.95 on MDRF dataset. Moreover, our model shows superior performance on two more datasets named LSGT and PsnoD. Additionally, a case study on the predicted snoRNA-disease associations verified the top 10 predicted snoRNAs across nine prevalent diseases, further validating the efficacy of the GBDTSVM approach. These results underscore the model’s potential as a robust tool for advancing snoRNA-related disease research. Source codes and datasets our proposed framework can be obtained from: this https URL

[LG-78] Interpretable SHAP-bounded Bayesian Optimization for Underwater Acoustic Metamaterial Coating Design

链接: https://arxiv.org/abs/2505.06519
作者: Hansani Weeratunge,Dominic Robe,Elnaz Hajizadeh
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:We developed an interpretability informed Bayesian optimization framework to optimize underwater acoustic coatings based on polyurethane elastomers with embedded metamaterial features. A data driven model was employed to analyze the relationship between acoustic performance, specifically sound absorption and the corresponding design variables. By leveraging SHapley Additive exPlanations (SHAP), a machine learning interpretability tool, we identified the key parameters influencing the objective function and gained insights into how these parameters affect sound absorption. The insights derived from the SHAP analysis were subsequently used to automatically refine the bounds of the optimization problem automatically, enabling a more targeted and efficient exploration of the design space. The proposed approach was applied to two polyurethane materials with distinct hardness levels, resulting in improved optimal solutions compared to those obtained without SHAP-informed guidance. Notably, these enhancements were achieved without increasing the number of simulation iterations. Our findings demonstrate the potential of SHAP to streamline optimization processes by uncovering hidden parameter relationships and guiding the search toward promising regions of the design space. This work underscores the effectiveness of combining interpretability techniques with Bayesian optimization for the efficient and cost-effective design of underwater acoustic metamaterials under strict computational constraints and can be generalized towards other materials and engineering optimization problems. Subjects: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Sound (cs.SD) Cite as: arXiv:2505.06519 [cs.LG] (or arXiv:2505.06519v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.06519 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-79] FedADP: Unified Model Aggregation for Federated Learning with Heterogeneous Model Architectures

链接: https://arxiv.org/abs/2505.06497
作者: Jiacheng Wang,Hongtao Lv,Lei Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional Federated Learning (FL) faces significant challenges in terms of efficiency and accuracy, particularly in heterogeneous environments where clients employ diverse model architectures and have varying computational resources. Such heterogeneity complicates the aggregation process, leading to performance bottlenecks and reduced model generalizability. To address these issues, we propose FedADP, a federated learning framework designed to adapt to client heterogeneity by dynamically adjusting model architectures during aggregation. FedADP enables effective collaboration among clients with differing capabilities, maximizing resource utilization and ensuring model quality. Our experimental results demonstrate that FedADP significantly outperforms existing methods, such as FlexiFed, achieving an accuracy improvement of up to 23.30%, thereby enhancing model adaptability and training efficiency in heterogeneous real-world settings.

[LG-80] QoS-Efficient Serving of Multiple Mixture-of-Expert LLM s Using Partial Runtime Reconfiguration

链接: https://arxiv.org/abs/2505.06481
作者: HamidReza Imani,Jiaxin Peng,Peiman Mohseni,Abdolah Amirany,Tarek El-Ghazawi
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:The deployment of mixture-of-experts (MoE) large language models (LLMs) presents significant challenges due to their high memory demands. These challenges become even more pronounced in multi-tenant environments, where shared resources must accommodate multiple models, limiting the effectiveness of conventional virtualization techniques. This paper addresses the problem of efficiently serving multiple fine-tuned MoE-LLMs on a single-GPU. We propose a serving system that employs \textitsimilarity-based expert consolidation to reduce the overall memory footprint by sharing similar experts across models. To ensure output quality, we introduce \textitruntime partial reconfiguration, dynamically replacing non-expert layers when processing requests from different models. As a result, our approach achieves a competitive output quality while maintaining throughput comparable to serving a single model while incurring a negligible increase in time-to-first-token (TTFT). Experiments on a server with a single NVIDIA A100 GPU (80GB) using Mixtral-8x7B models demonstrate an 85% average reduction in turnaround time compared to NVIDIA’s multi-instance GPU (MIG). Furthermore, experiments on Google’s Switch Transformer Base-8 model with up to four variants demonstrate the scalability and resilience of our approach in maintaining output quality compared to other model merging baselines, highlighting its effectiveness.

[LG-81] Probing In-Context Learning: Impact of Task Complexity and Model Architecture on Generalization and Efficiency

链接: https://arxiv.org/abs/2505.06475
作者: Binwen Liu,Peiyu Xu,Quan Yuan,Yihong Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate in-context learning (ICL) through a meticulous experimental framework that systematically varies task complexity and model architecture. Extending beyond the linear regression baseline, we introduce Gaussian kernel regression and nonlinear dynamical system tasks, which emphasize temporal and recursive reasoning. We evaluate four distinct models: a GPT2-style Transformer, a Transformer with FlashAttention mechanism, a convolutional Hyena-based model, and the Mamba state-space model. Each model is trained from scratch on synthetic datasets and assessed for generalization during testing. Our findings highlight that model architecture significantly shapes ICL performance. The standard Transformer demonstrates robust performance across diverse tasks, while Mamba excels in temporally structured dynamics. Hyena effectively captures long-range dependencies but shows higher variance early in training, and FlashAttention offers computational efficiency but is more sensitive in low-data regimes. Further analysis uncovers locality-induced shortcuts in Gaussian kernel tasks, enhanced nonlinear separability through input range scaling, and the critical role of curriculum learning in mastering high-dimensional tasks.

[LG-82] Challenging GPU Dominance: When CPUs Outperform for On-Device LLM Inference

链接: https://arxiv.org/abs/2505.06461
作者: Haolin Zhang,Jeff Huang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The common assumption in on-device AI is that GPUs, with their superior parallel processing, always provide the best performance for large language model (LLM) inference. In this work, we challenge this notion by empirically demonstrating that, under certain conditions, CPUs can outperform GPUs for LLM inference on mobile devices. Using a 1-billion-parameter LLM deployed via this http URL on the iPhone 15 Pro, we show that a CPU-only configuration (two threads, F16 precision) achieves 17 tokens per second, surpassing the 12.8 tokens per second obtained with GPU acceleration. We analyze the architectural factors driving this counterintuitive result, revealing that GPU memory transfer overhead and CPU thread optimization play a critical role. Furthermore, we explore the impact of thread oversubscription, quantization strategies, and hardware constraints, providing new insights into efficient on-device AI execution. Our findings challenge conventional GPU-first thinking, highlighting the untapped potential of optimized CPU inference and paving the way for smarter deployment strategies in mobile AI. However, fully explaining the observed CPU advantage remains difficult due to limited access to low-level profiling tools on iOS.

[LG-83] Sponge Attacks on Sensing AI: Energy-Latency Vulnerabilities and Defense via Model Pruning

链接: https://arxiv.org/abs/2505.06454
作者: Syed Mhamudul Hasan,Hussein Zangoti,Iraklis Anagnostopoulos,Abdur R. Shahid
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Recent studies have shown that sponge attacks can significantly increase the energy consumption and inference latency of deep neural networks (DNNs). However, prior work has focused primarily on computer vision and natural language processing tasks, overlooking the growing use of lightweight AI models in sensing-based applications on resource-constrained devices, such as those in Internet of Things (IoT) environments. These attacks pose serious threats of energy depletion and latency degradation in systems where limited battery capacity and real-time responsiveness are critical for reliable operation. This paper makes two key contributions. First, we present the first systematic exploration of energy-latency sponge attacks targeting sensing-based AI models. Using wearable sensing-based AI as a case study, we demonstrate that sponge attacks can substantially degrade performance by increasing energy consumption, leading to faster battery drain, and by prolonging inference latency. Second, to mitigate such attacks, we investigate model pruning, a widely adopted compression technique for resource-constrained AI, as a potential defense. Our experiments show that pruning-induced sparsity significantly improves model resilience against sponge poisoning. We also quantify the trade-offs between model efficiency and attack resilience, offering insights into the security implications of model compression in sensing-based AI systems deployed in IoT environments.

[LG-84] Structured Prediction with Abstention via the Lovász Hinge

链接: https://arxiv.org/abs/2505.06446
作者: Jessie Finocchiaro,Rafael Frongillo,Enrique Nueve
类目: Machine Learning (cs.LG)
*备注: This paper is an extension of the work “The Structured Abstain Problem and the Lovász Hinge” ( arXiv:2203.08645 ) via the original authors

点击查看摘要

Abstract:The Lovász hinge is a convex loss function proposed for binary structured classification, in which k related binary predictions jointly evaluated by a submodular function. Despite its prevalence in image segmentation and related tasks, the consistency of the Lovász hinge has remained open. We show that the Lovász hinge is inconsistent with its desired target unless the set function used for evaluation is modular. Leveraging the embedding framework of Finocchiaro et al. (2024), we find the target loss for which the Lovász hinge is consistent. This target, which we call the structured abstain problem, is a variant of selective classification for structured prediction that allows one to abstain on any subset of the k binary predictions. We derive a family of link functions, each of which is simultaneously consistent for all polymatroids, a subset of submodular set functions. We then give sufficient conditions on the polymatroid for the structured abstain problem to be tightly embedded by the Lovász hinge, meaning no target prediction is redundant. We experimentally demonstrate the potential of the structured abstain problem for interpretability in structured classification tasks. Finally, for the multiclass setting, we show that one can combine the binary encoding construction of Ramaswamy et al. (2018) with our link construction to achieve an efficient consistent surrogate for a natural multiclass generalization of the structured abstain problem.

[LG-85] weedie Regression for Video Recommendation System

链接: https://arxiv.org/abs/2505.06445
作者: Yan Zheng,Qiang Chen,Chenglei Niu
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: ICMI 2025 IEEE 4th International Conference on Computing and Machine Intelligence April 05-06, 2025

点击查看摘要

Abstract:Modern recommendation systems aim to increase click-through rates (CTR) for better user experience, through commonly treating ranking as a classification task focused on predicting CTR. However, there is a gap between this method and the actual objectives of businesses across different sectors. In video recommendation services, the objective of video on demand (VOD) extends beyond merely encouraging clicks, but also guiding users to discover their true interests, leading to increased watch time. And longer users watch time will leads to more revenue through increased chances of presenting online display advertisements. This research addresses the issue by redefining the problem from classification to regression, with a focus on maximizing revenue through user viewing time. Due to the lack of positive labels on recommendation, the study introduces Tweedie Loss Function, which is better suited in this scenario than the traditional mean square error loss. The paper also provides insights on how Tweedie process capture users diverse interests. Our offline simulation and online A/B test revealed that we can substantially enhance our core business objectives: user engagement in terms of viewing time and, consequently, revenue. Additionally, we provide a theoretical comparison between the Tweedie Loss and the commonly employed viewing time weighted Logloss, highlighting why Tweedie Regression stands out as an efficient solution. We further outline a framework for designing a loss function that focuses on a singular objective.

[LG-86] Direct Data Driven Control Using Noisy Measurements

链接: https://arxiv.org/abs/2505.06407
作者: Ramin Esmzad,Gokul S. Sankar,Teawon Han,Hamidreza Modares
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO); Optimization and Control (math.OC)
*备注: Submitted to IEEE-TAC

点击查看摘要

Abstract:This paper presents a novel direct data-driven control framework for solving the linear quadratic regulator (LQR) under disturbances and noisy state measurements. The system dynamics are assumed unknown, and the LQR solution is learned using only a single trajectory of noisy input-output data while bypassing system identification. Our approach guarantees mean-square stability (MSS) and optimal performance by leveraging convex optimization techniques that incorporate noise statistics directly into the controller synthesis. First, we establish a theoretical result showing that the MSS of an uncertain data-driven system implies the MSS of the true closed-loop system. Building on this, we develop a robust stability condition using linear matrix inequalities (LMIs) that yields a stabilizing controller gain from noisy measurements. Finally, we formulate a data-driven LQR problem as a semidefinite program (SDP) that computes an optimal gain, minimizing the steady-state covariance. Extensive simulations on benchmark systems – including a rotary inverted pendulum and an active suspension system – demonstrate the superior robustness and accuracy of our method compared to existing data-driven LQR approaches. The proposed framework offers a practical and theoretically grounded solution for controller design in noise-corrupted environments where system identification is infeasible.

[LG-87] Embedding Atlas: Low-Friction Interactive Embedding Visualization

链接: https://arxiv.org/abs/2505.06386
作者: Donghao Ren,Fred Hohman,Halden Lin,Dominik Moritz
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Website: this https URL

点击查看摘要

Abstract:Embedding projections are popular for visualizing large datasets and models. However, people often encounter “friction” when using embedding visualization tools: (1) barriers to adoption, e.g., tedious data wrangling and loading, scalability limits, no integration of results into existing workflows, and (2) limitations in possible analyses, without integration with external tools to additionally show coordinated views of metadata. In this paper, we present Embedding Atlas, a scalable, interactive visualization tool designed to make interacting with large embeddings as easy as possible. Embedding Atlas uses modern web technologies and advanced algorithms – including density-based clustering, and automated labeling – to provide a fast and rich data analysis experience at scale. We evaluate Embedding Atlas with a competitive analysis against other popular embedding tools, showing that Embedding Atlas’s feature set specifically helps reduce friction, and report a benchmark on its real-time rendering performance with millions of points. Embedding Atlas is available as open source to support future work in embedding-based analysis.

[LG-88] RiM: Record Improve and Maintain Physical Well-being using Federated Learning

链接: https://arxiv.org/abs/2505.06384
作者: Aditya Mishra,Haroon Lone
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
*备注: Report submitted in partial fulfilment of the requirements for the award of the degree of Bachelor of Science (BS) in Electrical Engineering and Computer Science

点击查看摘要

Abstract:In academic settings, the demanding environment often forces students to prioritize academic performance over their physical well-being. Moreover, privacy concerns and the inherent risk of data breaches hinder the deployment of traditional machine learning techniques for addressing these health challenges. In this study, we introduce RiM: Record, Improve, and Maintain, a mobile application which incorporates a novel personalized machine learning framework that leverages federated learning to enhance students’ physical well-being by analyzing their lifestyle habits. Our approach involves pre-training a multilayer perceptron (MLP) model on a large-scale simulated dataset to generate personalized recommendations. Subsequently, we employ federated learning to fine-tune the model using data from IISER Bhopal students, thereby ensuring its applicability in real-world scenarios. The federated learning approach guarantees differential privacy by exclusively sharing model weights rather than raw data. Experimental results show that the FedAvg-based RiM model achieves an average accuracy of 60.71% and a mean absolute error of 0.91–outperforming the FedPer variant (average accuracy 46.34%, MAE 1.19)–thereby demonstrating its efficacy in predicting lifestyle deficits under privacy-preserving constraints.

[LG-89] A Comprehensive Data Description for LoRaWAN Path Loss Measurements in an Indoor Office Setting: Effects of Environmental Factors

链接: https://arxiv.org/abs/2505.06375
作者: Nahshon Mokua Obiri,Kristof Van Laerhoven
类目: Networking and Internet Architecture (cs.NI); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: This is a peer-reviewed article with the help of IEEE Access editors. The relevant DOI will be availed soon

点击查看摘要

Abstract:This paper presents a comprehensive dataset of LoRaWAN technology path loss measurements collected in an indoor office environment, focusing on quantifying the effects of environmental factors on signal propagation. Utilizing a network of six strategically placed LoRaWAN end devices (EDs) and a single indoor gateway (GW) at the University of Siegen, City of Siegen, Germany, we systematically measured signal strength indicators such as the Received Signal Strength Indicator (RSSI) and the Signal-to-Noise Ratio (SNR) under various environmental conditions, including temperature, relative humidity, carbon dioxide (CO _2 ) concentration, barometric pressure, and particulate matter levels (PM _2.5 ). Our empirical analysis confirms that transient phenomena such as reflections, scattering, interference, occupancy patterns (induced by environmental parameter variations), and furniture rearrangements can alter signal attenuation by as much as 10.58 dB, highlighting the dynamic nature of indoor propagation. As an example of how this dataset can be utilized, we tested and evaluated a refined Log-Distance Path Loss and Shadowing Model that integrates both structural obstructions (Multiple Walls) and Environmental Parameters (LDPLSM-MW-EP). Compared to a baseline model that considers only Multiple Walls (LDPLSM-MW), the enhanced approach reduced the root mean square error (RMSE) from 10.58 dB to 8.04 dB and increased the coefficient of determination (R ^2 ) from 0.6917 to 0.8222. By capturing the extra effects of environmental conditions and occupancy dynamics, this improved model provides valuable insights for optimizing power usage and prolonging device battery life, enhancing network reliability in indoor Internet of Things (IoT) deployments, among other applications. This dataset offers a solid foundation for future research and development in indoor wireless communication.

[LG-90] CAST: Time-Varying Treatment Effects with Application to Chemotherapy and Radiotherapy on Head and Neck Squamous Cell Carcinoma

链接: https://arxiv.org/abs/2505.06367
作者: Everest Yang,Ria Vasishtha,Luqman K. Dad,Lisa A. Kachnic,Andrew Hope,Eric Wang,Xiao Wu,Yading Yuan,David J. Brenner,Igor Shuryak
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Causal machine learning (CML) enables individualized estimation of treatment effects, offering critical advantages over traditional correlation-based methods. However, existing approaches for medical survival data with censoring such as causal survival forests estimate effects at fixed time points, limiting their ability to capture dynamic changes over time. We introduce Causal Analysis for Survival Trajectories (CAST), a novel framework that models treatment effects as continuous functions of time following treatment. By combining parametric and non-parametric methods, CAST overcomes the limitations of discrete time-point analysis to estimate continuous effect trajectories. Using the RADCURE dataset [1] of 2,651 patients with head and neck squamous cell carcinoma (HNSCC) as a clinically relevant example, CAST models how chemotherapy and radiotherapy effects evolve over time at the population and individual levels. By capturing the temporal dynamics of treatment response, CAST reveals how treatment effects rise, peak, and decline over the follow-up period, helping clinicians determine when and for whom treatment benefits are maximized. This framework advances the application of CML to personalized care in HNSCC and other life-threatening medical conditions. Source code/data available at: this https URL

[LG-91] Latent Diffeomorphic Dynamic Mode Decomposition

链接: https://arxiv.org/abs/2505.06351
作者: Willem Diepeveen,Jon Schwenk,Andrea Bertozzi
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:We present Latent Diffeomorphic Dynamic Mode Decomposition (LDDMD), a new data reduction approach for the analysis of non-linear systems that combines the interpretability of Dynamic Mode Decomposition (DMD) with the predictive power of Recurrent Neural Networks (RNNs). Notably, LDDMD maintains simplicity, which enhances interpretability, while effectively modeling and learning complex non-linear systems with memory, enabling accurate predictions. This is exemplified by its successful application in streamflow prediction.

[LG-92] Reinforcement Learning for Game-Theoretic Resource Allocation on Graphs

链接: https://arxiv.org/abs/2505.06319
作者: Zijian An,Lifeng Zhou
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: 12 pages, 7 figures

点击查看摘要

Abstract:Game-theoretic resource allocation on graphs (GRAG) involves two players competing over multiple steps to control nodes of interest on a graph, a problem modeled as a multi-step Colonel Blotto Game (MCBG). Finding optimal strategies is challenging due to the dynamic action space and structural constraints imposed by the graph. To address this, we formulate the MCBG as a Markov Decision Process (MDP) and apply Reinforcement Learning (RL) methods, specifically Deep Q-Network (DQN) and Proximal Policy Optimization (PPO). To enforce graph constraints, we introduce an action-displacement adjacency matrix that dynamically generates valid action sets at each step. We evaluate RL performance across a variety of graph structures and initial resource distributions, comparing against random, greedy, and learned RL policies. Experimental results show that both DQN and PPO consistently outperform baseline strategies and converge to a balanced 50% win rate when competing against the learned RL policy. Particularly, on asymmetric graphs, RL agents successfully exploit structural advantages and adapt their allocation strategies, even under disadvantageous initial resource distributions.

[LG-93] GraphComp: Extreme Error-bounded Compression of Scientific Data via Temporal Graph Autoencoders

链接: https://arxiv.org/abs/2505.06316
作者: Guozhong Li,Muhannad Alhumaidi,Spiros Skiadopoulos,Ibrahim Hoteit,Panos Kalnis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The generation of voluminous scientific data poses significant challenges for efficient storage, transfer, and analysis. Recently, error-bounded lossy compression methods emerged due to their ability to achieve high compression ratios while controlling data distortion. However, they often overlook the inherent spatial and temporal correlations within scientific data, thus missing opportunities for higher compression. In this paper we propose GRAPHCOMP, a novel graph-based method for error-bounded lossy compression of scientific data. We perform irregular segmentation of the original grid data and generate a graph representation that preserves the spatial and temporal correlations. Inspired by Graph Neural Networks (GNNs), we then propose a temporal graph autoencoder to learn latent representations that significantly reduce the size of the graph, effectively compressing the original data. Decompression reverses the process and utilizes the learnt graph model together with the latent representation to reconstruct an approximation of the original data. The decompressed data are guaranteed to satisfy a user-defined point-wise error bound. We compare our method against the state-of-the-art error-bounded lossy methods (i.e., HPEZ, SZ3.1, SPERR, and ZFP) on large-scale real and synthetic data. GRAPHCOMP consistently achieves the highest compression ratio across most datasets, outperforming the second-best method by margins ranging from 22% to 50%.

[LG-94] Benchmarking Traditional Machine Learning and Deep Learning Models for Fault Detection in Power Transformers

链接: https://arxiv.org/abs/2505.06295
作者: Bhuvan Saravanan,Pasanth Kumar M D,Aarnesh Vengateson
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate diagnosis of power transformer faults is essential for ensuring the stability and safety of electrical power systems. This study presents a comparative analysis of conventional machine learning (ML) algorithms and deep learning (DL) algorithms for fault classification of power transformers. Using a condition-monitored dataset spanning 10 months, various gas concentration features were normalized and used to train five ML classifiers: Support Vector Machine (SVM), k-Nearest Neighbors (KNN), Random Forest (RF), XGBoost, and Artificial Neural Network (ANN). In addition, four DL models were evaluated: Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), One-Dimensional Convolutional Neural Network (1D-CNN), and TabNet. Experimental results show that both ML and DL approaches performed comparably. The RF model achieved the highest ML accuracy at 86.82%, while the 1D-CNN model attained a close 86.30%.

[LG-95] Spatio-Temporal Graph Neural Network for Urban Spaces: Interpolating Citywide Traffic Volume

链接: https://arxiv.org/abs/2505.06292
作者: Silke K. Kaiser,Filipe Rodrigues,Carlos Lima Azevedo,Lynn H. Kaack
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Reliable street-level traffic volume data, covering multiple modes of transportation, helps urban planning by informing decisions on infrastructure improvements, traffic management, and public transportation. Yet, traffic sensors measuring traffic volume are typically scarcely located, due to their high deployment and maintenance costs. To address this, interpolation methods can estimate traffic volumes at unobserved locations using available data. Graph Neural Networks have shown strong performance in traffic volume forecasting, particularly on highways and major arterial networks. Applying them to urban settings, however, presents unique challenges: urban networks exhibit greater structural diversity, traffic volumes are highly overdispersed with many zeros, the best way to account for spatial dependencies remains unclear, and sensor coverage is often very sparse. We introduce the Graph Neural Network for Urban Interpolation (GNNUI), a novel urban traffic volume estimation approach. GNNUI employs a masking algorithm to learn interpolation, integrates node features to capture functional roles, and uses a loss function tailored to zero-inflated traffic distributions. In addition to the model, we introduce two new open, large-scale urban traffic volume benchmarks, covering different transportation modes: Strava cycling data from Berlin and New York City taxi data. GNNUI outperforms recent, some graph-based, interpolation methods across metrics (MAE, RMSE, true-zero rate, Kullback-Leibler divergence) and remains robust from 90% to 1% sensor coverage. On Strava, for instance, MAE rises only from 7.1 to 10.5, on Taxi from 23.0 to 40.4, demonstrating strong performance under extreme data scarcity, common in real-world urban settings. We also examine how graph connectivity choices influence model accuracy.

[LG-96] UniCO: Towards a Unified Model for Combinatorial Optimization Problems

链接: https://arxiv.org/abs/2505.06290
作者: Zefang Zong,Xiaochen Wei,Guozhen Zhang,Chen Gao,Huandong Wang,Yong Li
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM)
*备注:

点击查看摘要

Abstract:Combinatorial Optimization (CO) encompasses a wide range of problems that arise in many real-world scenarios. While significant progress has been made in developing learning-based methods for specialized CO problems, a unified model with a single architecture and parameter set for diverse CO problems remains elusive. Such a model would offer substantial advantages in terms of efficiency and convenience. In this paper, we introduce UniCO, a unified model for solving various CO problems. Inspired by the success of next-token prediction, we frame each problem-solving process as a Markov Decision Process (MDP), tokenize the corresponding sequential trajectory data, and train the model using a transformer backbone. To reduce token length in the trajectory data, we propose a CO-prefix design that aggregates static problem features. To address the heterogeneity of state and action tokens within the MDP, we employ a two-stage self-supervised learning approach. In this approach, a dynamic prediction model is first trained and then serves as a pre-trained model for subsequent policy generation. Experiments across 10 CO problems showcase the versatility of UniCO, emphasizing its ability to generalize to new, unseen problems with minimal fine-tuning, achieving even few-shot or zero-shot performance. Our framework offers a valuable complement to existing neural CO methods that focus on optimizing performance for individual problems.

[LG-97] Edge-Optimized Deep Learning Pattern Recognition Techniques for Non-Intrusive Load Monitoring of Energy Time Series

链接: https://arxiv.org/abs/2505.06289
作者: Sotirios Athanasoulias
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注: PhD dissertation as part of the GECKO Marie Curie

点击查看摘要

Abstract:The growing global energy demand and the urgent need for sustainability call for innovative ways to boost energy efficiency. While advanced energy-saving systems exist, they often fall short without user engagement. Providing feedback on energy consumption behavior is key to promoting sustainable practices. Non-Intrusive Load Monitoring (NILM) offers a promising solution by disaggregating total household energy usage, recorded by a central smart meter, into appliance-level data. This empowers users to optimize consumption. Advances in AI, IoT, and smart meter adoption have further enhanced NILM’s potential. Despite this promise, real-world NILM deployment faces major challenges. First, existing datasets mainly represent regions like the USA and UK, leaving places like the Mediterranean underrepresented. This limits understanding of regional consumption patterns, such as heavy use of air conditioners and electric water heaters. Second, deep learning models used in NILM require high computational power, often relying on cloud services. This increases costs, raises privacy concerns, and limits scalability, especially for households with poor connectivity. This thesis tackles these issues with key contributions. It presents an interoperable data collection framework and introduces the Plegma Dataset, focused on underrepresented Mediterranean energy patterns. It also explores advanced deep neural networks and model compression techniques for efficient edge deployment. By bridging theoretical advances with practical needs, this work aims to make NILM scalable, efficient, and adaptable for global energy sustainability. Comments: PhD dissertation as part of the GECKO Marie Curie Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML) Cite as: arXiv:2505.06289 [cs.LG] (or arXiv:2505.06289v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.06289 Focus to learn more arXiv-issued DOI via DataCite

[LG-98] IIKL: Isometric Immersion Kernel Learning with Riemannian Manifold for Geometric Preservation

链接: https://arxiv.org/abs/2505.06288
作者: Zihao Chen,Wenyong Wang,Jiachen Yang,Yu Xiang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 16 pages, 14 figures

点击查看摘要

Abstract:Geometric representation learning in preserving the intrinsic geometric and topological properties for discrete non-Euclidean data is crucial in scientific applications. Previous research generally mapped non-Euclidean discrete data into Euclidean space during representation learning, which may lead to the loss of some critical geometric information. In this paper, we propose a novel Isometric Immersion Kernel Learning (IIKL) method to build Riemannian manifold and isometrically induce Riemannian metric from discrete non-Euclidean data. We prove that Isometric immersion is equivalent to the kernel function in the tangent bundle on the manifold, which explicitly guarantees the invariance of the inner product between vectors in the arbitrary tangent space throughout the learning process, thus maintaining the geometric structure of the original data. Moreover, a novel parameterized learning model based on IIKL is introduced, and an alternating training method for this model is derived using Maximum Likelihood Estimation (MLE), ensuring efficient convergence. Experimental results proved that using the learned Riemannian manifold and its metric, our model preserved the intrinsic geometric representation of data in both 3D and high-dimensional datasets successfully, and significantly improved the accuracy of downstream tasks, such as data reconstruction and classification. It is showed that our method could reduce the inner product invariant loss by more than 90% compared to state-of-the-art (SOTA) methods, also achieved an average 40% improvement in downstream reconstruction accuracy and a 90% reduction in error for geometric metrics involving isometric and conformal.

[LG-99] DMRL: Data- and Model-aware Reward Learning for Data Extraction

链接: https://arxiv.org/abs/2505.06284
作者: Zhiqiang Wang,Ruoxi Cheng
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Data- and Model-aware Reward Learning for Data Extraction. arXiv admin note: substantial text overlap with arXiv:2503.18991

点击查看摘要

Abstract:Large language models (LLMs) are inherently vulnerable to unintended privacy breaches. Consequently, systematic red-teaming research is essential for developing robust defense mechanisms. However, current data extraction methods suffer from several limitations: (1) rely on dataset duplicates (addressable via deduplication), (2) depend on prompt engineering (now countered by detection and defense), and (3) rely on random-search adversarial generation. To address these challenges, we propose DMRL, a Data- and Model-aware Reward Learning approach for data extraction. This technique leverages inverse reinforcement learning to extract sensitive data from LLMs. Our method consists of two main components: (1) constructing an introspective reasoning dataset that captures leakage mindsets to guide model behavior, and (2) training reward models with Group Relative Policy Optimization (GRPO), dynamically tuning optimization based on task difficulty at both the data and model levels. Comprehensive experiments across various LLMs demonstrate that DMRL outperforms all baseline methods in data extraction performance.

[LG-100] Soft causal learning for generalized molecule property prediction: An environment perspective

链接: https://arxiv.org/abs/2505.06283
作者: Limin Li,Kuo Yang,Wenjie Du,Pengkun Wang,Zhengyang Zhou,Yang Wang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 23 pages, 7 figures, 3 tables

点击查看摘要

Abstract:Learning on molecule graphs has become an increasingly important topic in AI for science, which takes full advantage of AI to facilitate scientific discovery. Existing solutions on modeling molecules utilize Graph Neural Networks (GNNs) to achieve representations but they mostly fail to adapt models to out-of-distribution (OOD) samples. Although recent advances on OOD-oriented graph learning have discovered the invariant rationale on graphs, they still ignore three important issues, i.e., 1) the expanding atom patterns regarding environments on graphs lead to failures of invariant rationale based models, 2) the associations between discovered molecular subgraphs and corresponding properties are complex where causal substructures cannot fully interpret the labels. 3) the interactions between environments and invariances can influence with each other thus are challenging to be modeled. To this end, we propose a soft causal learning framework, to tackle the unresolved OOD challenge in molecular science, from the perspective of fully modeling the molecule environments and bypassing the invariant subgraphs. Specifically, we first incorporate chemistry theories into our graph growth generator to imitate expaned environments, and then devise an GIB-based objective to disentangle environment from whole graphs and finally introduce a cross-attention based soft causal interaction, which allows dynamic interactions between environments and invariances. We perform experiments on seven datasets by imitating different kinds of OOD generalization scenarios. Extensive comparison, ablation experiments as well as visualized case studies demonstrate well generalization ability of our proposal.

[LG-101] InfoNCE is a Free Lunch for Semantically guided Graph Contrastive Learning SIGIR2025

链接: https://arxiv.org/abs/2505.06282
作者: Zixu Wang,Bingbing Xu,Yige Yuan,Huawei Shen,Xueqi Cheng
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, Accepted by SIGIR2025

点击查看摘要

Abstract:As an important graph pre-training method, Graph Contrastive Learning (GCL) continues to play a crucial role in the ongoing surge of research on graph foundation models or LLM as enhancer for graphs. Traditional GCL optimizes InfoNCE by using augmentations to define self-supervised tasks, treating augmented pairs as positive samples and others as negative. However, this leads to semantically similar pairs being classified as negative, causing significant sampling bias and limiting performance. In this paper, we argue that GCL is essentially a Positive-Unlabeled (PU) learning problem, where the definition of self-supervised tasks should be semantically guided, i.e., augmented samples with similar semantics are considered positive, while others, with unknown semantics, are treated as unlabeled. From this perspective, the key lies in how to extract semantic information. To achieve this, we propose IFL-GCL, using InfoNCE as a “free lunch” to extract semantic information. Specifically, We first prove that under InfoNCE, the representation similarity of node pairs aligns with the probability that the corresponding contrastive sample is positive. Then we redefine the maximum likelihood objective based on the corrected samples, leading to a new InfoNCE loss function. Extensive experiments on both the graph pretraining framework and LLM as an enhancer show significantly improvements of IFL-GCL in both IID and OOD scenarios, achieving up to a 9.05% improvement, validating the effectiveness of semantically guided. Code for IFL-GCL is publicly available at: this https URL.

[LG-102] A Data-Driven Probabilistic Framework for Cascading Urban Risk Analysis Using Bayesian Networks

链接: https://arxiv.org/abs/2505.06281
作者: Chunduru Rohith Kumar,PHD Surya Shanmuk,Prabhala Naga Srinivas,Sri Venkatesh Lankalapalli,Debasis Dwibedy
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 14 pages, 4 figures, 8 tables

点击查看摘要

Abstract:The increasing complexity of cascading risks in urban systems necessitates robust, data-driven frameworks to model interdependencies across multiple domains. This study presents a foundational Bayesian network-based approach for analyzing cross-domain risk propagation across key urban domains, including air, water, electricity, agriculture, health, infrastructure, weather, and climate. Directed Acyclic Graphs (DAGs) are constructed using Bayesian Belief Networks (BBNs), with structure learning guided by Hill-Climbing search optimized through Bayesian Information Criterion (BIC) and K2 scoring. The framework is trained on a hybrid dataset that combines real-world urban indicators with synthetically generated data from Generative Adversarial Networks (GANs), and is further balanced using the Synthetic Minority Over-sampling Technique (SMOTE). Conditional Probability Tables (CPTs) derived from the learned structures enable interpretable probabilistic reasoning and quantify the likelihood of cascading failures. The results identify key intra- and inter-domain risk factors and demonstrate the framework’s utility for proactive urban resilience planning. This work establishes a scalable, interpretable foundation for cascading risk assessment and serves as a basis for future empirical research in this emerging interdisciplinary field.

[LG-103] Show or Tell? A Benchmark To Evaluate Visual and Textual Prompts in Semantic Segmentation CVPR2025

链接: https://arxiv.org/abs/2505.06280
作者: Gabriele Rosi,Fabio Cermelli
类目: Machine Learning (cs.LG)
*备注: Accepted to PixFoundation workshop at CVPR2025. Code: this https URL

点击查看摘要

Abstract:Prompt engineering has shown remarkable success with large language models, yet its systematic exploration in computer vision remains limited. In semantic segmentation, both textual and visual prompts offer distinct advantages: textual prompts through open-vocabulary methods allow segmentation of arbitrary categories, while visual reference prompts provide intuitive reference examples. However, existing benchmarks evaluate these modalities in isolation, without direct comparison under identical conditions. We present Show or Tell (SoT), a novel benchmark specifically designed to evaluate both visual and textual prompts for semantic segmentation across 14 datasets spanning 7 diverse domains (common scenes, urban, food, waste, parts, tools, and land-cover). We evaluate 5 open-vocabulary methods and 4 visual reference prompt approaches, adapting the latter to handle multi-class segmentation through a confidence-based mask merging strategy. Our extensive experiments reveal that open-vocabulary methods excel with common concepts easily described by text but struggle with complex domains like tools, while visual reference prompt methods achieve good average results but exhibit high variability depending on the input prompt. Through comprehensive quantitative and qualitative analysis, we identify the strengths and weaknesses of both prompting modalities, providing valuable insights to guide future research in vision foundation models for segmentation tasks.

[LG-104] Interpretable Learning Dynamics in Unsupervised Reinforcement Learning

链接: https://arxiv.org/abs/2505.06279
作者: Shashwat Pandey
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We present an interpretability framework for unsupervised reinforcement learning (URL) agents, aimed at understanding how intrinsic motivation shapes attention, behavior, and representation learning. We analyze five agents DQN, RND, ICM, PPO, and a Transformer-RND variant trained on procedurally generated environments, using Grad-CAM, Layer-wise Relevance Propagation (LRP), exploration metrics, and latent space clustering. To capture how agents perceive and adapt over time, we introduce two metrics: attention diversity, which measures the spatial breadth of focus, and attention change rate, which quantifies temporal shifts in attention. Our findings show that curiosity-driven agents display broader, more dynamic attention and exploratory behavior than their extrinsically motivated counterparts. Among them, TransformerRND combines wide attention, high exploration coverage, and compact, structured latent representations. Our results highlight the influence of architectural inductive biases and training signals on internal agent dynamics. Beyond reward-centric evaluation, the proposed framework offers diagnostic tools to probe perception and abstraction in RL agents, enabling more interpretable and generalizable behavior.

[LG-105] A machine learning model for skillful climate system prediction

链接: https://arxiv.org/abs/2505.06269
作者: Chenguang Zhou,Lei Chen,Xiaohui Zhong,Bo Lu,Hao Li,Libo Wu,Jie Wu,Jiahui Hu,Zesheng Dou,Pang-Chi Hsu,Xiaoye Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Climate system models (CSMs), through integrating cross-sphere interactions among the atmosphere, ocean, land, and cryosphere, have emerged as pivotal tools for deciphering climate dynamics and improving forecasting capabilities. Recent breakthroughs in artificial intelligence (AI)-driven meteorological modeling have demonstrated remarkable success in single-sphere systems and partially spheres coupled systems. However, the development of a fully coupled AI-based climate system model encompassing atmosphere-ocean-land-sea ice interactions has remained an unresolved challenge. This paper introduces FengShun-CSM, an AI-based CSM model that provides 60-day global daily forecasts for 29 critical variables across atmospheric, oceanic, terrestrial, and cryospheric domains. The model significantly outperforms the European Centre for Medium-Range Weather Forecasts (ECMWF) subseasonal-to-seasonal (S2S) model in predicting most variables, particularly precipitation, land surface, and oceanic components. This enhanced capability is primarily attributed to its improved representation of intra-seasonal variability modes, most notably the Madden-Julian Oscillation (MJO). Remarkably, FengShun-CSM exhibits substantial potential in predicting subseasonal extreme events. Such breakthroughs will advance its applications in meteorological disaster mitigation, marine ecosystem conservation, and agricultural productivity enhancement. Furthermore, it validates the feasibility of developing AI-powered CSMs through machine learning technologies, establishing a transformative paradigm for next-generation Earth system modeling.

[LG-106] ONERAs CRM WBPN database for machine learning activities related regression challenge and first results

链接: https://arxiv.org/abs/2505.06265
作者: Jacques Peter,Quentin Bennehard,Sébastien Heib,Jean-Luc Hantrais-Gervois,Frédéric Moëns
类目: Machine Learning (cs.LG)
*备注: 16 pages, 9 figures

点击查看摘要

Abstract:This paper presents a new Computational Fluid Dynamics database, developed at ONERA, to support the advancement of machine learning techniques for aerodynamic field prediction. It contains 468 Reynolds-Averaged Navier-Stokes simulations using the Spalart-Allmaras turbulence model, performed on the NASA/Boeing Common Research Model wing-body-pylon-nacelle configuration. The database spans a wide range of flow conditions, varying Mach number (including transonic regimes), angle of attack (capturing flow separation), and Reynolds number (based on three stagnation pressures, with one setting matching wind tunnel experiments). The quality of the database is assessed, through checking the convergence level of each computation. Based on these data, a regression challenge is defined. It consists in predicting the wall distributions of pressure and friction coefficients for unseen aerodynamic conditions. The 468 simulations are split into training and testing sets, with the training data made available publicly on the Codabench platform. The paper further evaluates several classical machine learning regressors on this task. Tested pointwise methods include Multi-Layer Perceptrons, \lambda -DNNs, and Decision Trees, while global methods include Multi-Layer Perceptron, k-Nearest Neighbors, Proper Orthogonal Decomposition and IsoMap. Initial performance results, using R^2 scores and worst relative mean absolute error metrics, are presented, offering insights into the capabilities of these techniques for the challenge and references for future work. Comments: 16 pages, 9 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.06265 [cs.LG] (or arXiv:2505.06265v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.06265 Focus to learn more arXiv-issued DOI via DataCite

[LG-107] Neural Network Operator-Based Fractal Approximation: Smoothness Preservation and Convergence Analysis

链接: https://arxiv.org/abs/2505.06229
作者: Aaqib Ayoub Bhat,Asif Khan,M. Mursaleen
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 18 pages

点击查看摘要

Abstract:This paper presents a new approach of constructing \alpha -fractal interpolation functions (FIFs) using neural network operators, integrating concepts from approximation theory. Initially, we construct \alpha -fractals utilizing neural network-based operators, providing an approach to generating fractal functions with interpolation properties. Based on the same foundation, we have developed fractal interpolation functions that utilize only the values of the original function at the nodes or partition points, unlike traditional methods that rely on the entire original function. Further, we have constructed (\alpha)-fractals that preserve the smoothness of functions under certain constraints by employing a four-layered neural network operator, ensuring that if (f \in C^r[a,b]), then the corresponding fractal (f^\alpha \in C^r[a,b]). Furthermore, we analyze the convergence of these \alpha -fractals to the original function under suitable conditions. The work uses key approximation theory tools, such as the modulus of continuity and interpolation operators, to develop convergence results and uniform approximation error bounds. Comments: 18 pages Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA) MSC classes: 28A80, 41A05, 41A25, 41A29, 41A30, 65D05 Cite as: arXiv:2505.06229 [cs.LG] (or arXiv:2505.06229v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.06229 Focus to learn more arXiv-issued DOI via DataCite

[LG-108] Analytic theory of dropout regularization

链接: https://arxiv.org/abs/2505.07792
作者: Francesco Mori,Francesca Mignacco
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: 17 pages, 8 figures

点击查看摘要

Abstract:Dropout is a regularization technique widely used in training artificial neural networks to mitigate overfitting. It consists of dynamically deactivating subsets of the network during training to promote more robust representations. Despite its widespread adoption, dropout probabilities are often selected heuristically, and theoretical explanations of its success remain sparse. Here, we analytically study dropout in two-layer neural networks trained with online stochastic gradient descent. In the high-dimensional limit, we derive a set of ordinary differential equations that fully characterize the evolution of the network during training and capture the effects of dropout. We obtain a number of exact results describing the generalization error and the optimal dropout probability at short, intermediate, and long training times. Our analysis shows that dropout reduces detrimental correlations between hidden nodes, mitigates the impact of label noise, and that the optimal dropout probability increases with the level of noise in the data. Our results are validated by extensive numerical simulations.

[LG-109] agging fully hadronic exotic decays of the vectorlike mathbfB quark using a graph neural network

链接: https://arxiv.org/abs/2505.07769
作者: Jai Bardhan,Tanumoy Mandal,Subhadip Mitra,Cyrin Neeraj,Mihir Rawat
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 13 pages, 10 figures, 3 tables

点击查看摘要

Abstract:Following up on our earlier study in [J. Bardhan et al., Machine learning-enhanced search for a vectorlike singlet B quark decaying to a singlet scalar or pseudoscalar, Phys. Rev. D 107 (2023) 115001; arXiv:2212.02442], we investigate the LHC prospects of pair-produced vectorlike B quarks decaying exotically to a new gauge-singlet (pseudo)scalar field \Phi and a b quark. After the electroweak symmetry breaking, the \Phi decays predominantly to gg/bb final states, leading to a fully hadronic 2b+4j or 6b signature. Because of the large Standard Model background and the lack of leptonic handles, it is a difficult channel to probe. To overcome the challenge, we employ a hybrid deep learning model containing a graph neural network followed by a deep neural network. We estimate that such a state-of-the-art deep learning analysis pipeline can lead to a performance comparable to that in the semi-leptonic mode, taking the discovery (exclusion) reach up to about M_B=1.8:(2.4) ~TeV at HL-LHC when B decays fully exotically, i.e., BR (B \to b\Phi) = 100% .

[LG-110] raining neural control variates using correlated configurations

链接: https://arxiv.org/abs/2505.07719
作者: Hyunwoo Oh
类目: High Energy Physics - Lattice (hep-lat); Machine Learning (cs.LG); Nuclear Theory (nucl-th)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:Neural control variates (NCVs) have emerged as a powerful tool for variance reduction in Monte Carlo (MC) simulations, particularly in high-dimensional problems where traditional control variates are difficult to construct analytically. By training neural networks to learn auxiliary functions correlated with the target observable, NCVs can significantly reduce estimator variance while preserving unbiasedness. However, a critical but often overlooked aspect of NCV training is the role of autocorrelated samples generated by Markov Chain Monte Carlo (MCMC). While such samples are typically discarded for error estimation due to their statistical redundancy, they may contain useful information about the structure of the underlying probability distribution that can benefit the training process. In this work, we systematically examine the effect of using correlated configurations in training neural control variates. We demonstrate, both conceptually and numerically, that training on correlated data can improve control variate performance, especially in settings with limited computational resources. Our analysis includes empirical results from U(1) gauge theory and scalar field theory, illustrating when and how autocorrelated samples enhance NCV construction. These findings provide practical guidance for the efficient use of MCMC data in training neural networks.

[LG-111] SmartUT: Receive Beamforming for Spectral Coexistence of NGSO Satellite Systems

链接: https://arxiv.org/abs/2505.07714
作者: Almoatssimbillah Saifaldawla,Eva Lagunas,Flor Ortiz,Abuzar B. M. Adam,Symeon Chatzinotas
类目: ignal Processing (eess.SP); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we investigate downlink co-frequency interference (CFI) mitigation in non-geostationary satellites orbits (NGSOs) co-existing systems. Traditional mitigation techniques, such as Zero-forcing (ZF), produce a null towards the direction of arrivals (DOAs) of the interfering signals, but they suffer from high computational complexity due to matrix inversions and required knowledge of the channel state information (CSI). Furthermore, adaptive beamformers, such as sample matrix inversion (SMI)-based minimum variance, provide poor performance when the available snapshots are limited. We propose a Mamba-based beamformer (MambaBF) that leverages an unsupervised deep learning (DL) approach and can be deployed on the user terminal (UT) antenna array, for assisting downlink beamforming and CFI mitigation using only a limited number of available array snapshots as input, and without CSI knowledge. Simulation results demonstrate that MambaBF consistently outperforms conventional beamforming techniques in mitigating interference and maximizing the signal-to-interference-plus-noise ratio (SINR), particularly under challenging conditions characterized by low SINR, limited snapshots, and imperfect CSI.

[LG-112] ransfer Learning Across Fixed-Income Product Classes

链接: https://arxiv.org/abs/2505.07676
作者: Nicolas Camenzind,Damir Filipovic
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computational Finance (q-fin.CP); Mathematical Finance (q-fin.MF)
*备注:

点击查看摘要

Abstract:We propose a framework for transfer learning of discount curves across different fixed-income product classes. Motivated by challenges in estimating discount curves from sparse or noisy data, we extend kernel ridge regression (KR) to a vector-valued setting, formulating a convex optimization problem in a vector-valued reproducing kernel Hilbert space (RKHS). Each component of the solution corresponds to the discount curve implied by a specific product class. We introduce an additional regularization term motivated by economic principles, promoting smoothness of spread curves between product classes, and show that it leads to a valid separable kernel structure. A main theoretical contribution is a decomposition of the vector-valued RKHS norm induced by separable kernels. We further provide a Gaussian process interpretation of vector-valued KR, enabling quantification of estimation uncertainty. Illustrative examples demonstrate that transfer learning significantly improves extrapolation performance and tightens confidence intervals compared to single-curve estimation.

[LG-113] Convergence of Time-Averag ed Mean Field Gradient Descent Dynamics for Continuous Multi-Player Zero-Sum Games

链接: https://arxiv.org/abs/2505.07642
作者: Yulong Lu,Pierre Monmarché
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Analysis of PDEs (math.AP); Probability (math.PR); Machine Learning (stat.ML)
*备注: 21 pages

点击查看摘要

Abstract:The approximation of mixed Nash equilibria (MNE) for zero-sum games with mean-field interacting players has recently raised much interest in machine learning. In this paper we propose a mean-field gradient descent dynamics for finding the MNE of zero-sum games involving K players with K\geq 2 . The evolution of the players’ strategy distributions follows coupled mean-field gradient descent flows with momentum, incorporating an exponentially discounted time-averaging of gradients. First, in the case of a fixed entropic regularization, we prove an exponential convergence rate for the mean-field dynamics to the mixed Nash equilibrium with respect to the total variation metric. This improves a previous polynomial convergence rate for a similar time-averaged dynamics with different averaging factors. Moreover, unlike previous two-scale approaches for finding the MNE, our approach treats all player types on the same time scale. We also show that with a suitable choice of decreasing temperature, a simulated annealing version of the mean-field dynamics converges to an MNE of the initial unregularized problem.

[LG-114] Certified Data Removal Under High-dimensional Settings

链接: https://arxiv.org/abs/2505.07640
作者: Haolin Zou,Arnab Auddy,Yongchan Kwon,Kamiar Rahnama Rad,Arian Maleki
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 46 pages, 4 figures

点击查看摘要

Abstract:Machine unlearning focuses on the computationally efficient removal of specific training data from trained models, ensuring that the influence of forgotten data is effectively eliminated without the need for full retraining. Despite advances in low-dimensional settings, where the number of parameters ( p ) is much smaller than the sample size ( n ), extending similar theoretical guarantees to high-dimensional regimes remains challenging. We propose an unlearning algorithm that starts from the original model parameters and performs a theory-guided sequence of Newton steps ( T \in \ 1,2\). After this update, carefully scaled isotropic Laplacian noise is added to the estimate to ensure that any (potential) residual influence of forget data is completely removed. We show that when both ( n, p \to \infty ) with a fixed ratio ( n/p ), significant theoretical and computational obstacles arise due to the interplay between the complexity of the model and the finite signal-to-noise ratio. Finally, we show that, unlike in low-dimensional settings, a single Newton step is insufficient for effective unlearning in high-dimensional problems – however, two steps are enough to achieve the desired certifiebility. We provide numerical experiments to support the certifiability and accuracy claims of this approach.

[LG-115] ACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining

链接: https://arxiv.org/abs/2505.07609
作者: Paul Primus,Florian Schmid,Gerhard Widmer
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: submitted to the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2025. Dataset (Zenodo): this https URL , Implementation (GitHub): this https URL

点击查看摘要

Abstract:Learning to associate audio with textual descriptions is valuable for a range of tasks, including pretraining, zero-shot classification, audio retrieval, audio captioning, and text-conditioned audio generation. Existing contrastive language-audio pretrained models are typically trained using global, clip-level descriptions, which provide only weak temporal supervision. We hypothesize that CLAP-like language-audio models - particularly, if they are expected to produce frame-level embeddings - can benefit from a stronger temporal supervision. To confirm our hypothesis, we curate a novel dataset of approximately 12,000 audio recordings from Freesound, each annotated with single-sentence free-text descriptions linked to a specific temporal segment in an audio recording. We use large language models to clean these annotations by removing references to non-audible events, transcribed speech, typos, and annotator language bias. We further propose a frame-wise contrastive training strategy that learns to align text descriptions with temporal regions in an audio recording and demonstrate that our model has better temporal text-audio alignment abilities compared to models trained only on global captions when evaluated on the AudioSet Strong benchmark. The dataset and our source code are available on Zenodo and GitHub, respectively.

[LG-116] ALPCAH: Subspace Learning for Sample-wise Heteroscedastic Data

链接: https://arxiv.org/abs/2505.07272
作者: Javier Salazar Cavazos,Jeffrey A. Fessler,Laura Balzano
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Principal component analysis (PCA) is a key tool in the field of data dimensionality reduction. However, some applications involve heterogeneous data that vary in quality due to noise characteristics associated with each data sample. Heteroscedastic methods aim to deal with such mixed data quality. This paper develops a subspace learning method, named ALPCAH, that can estimate the sample-wise noise variances and use this information to improve the estimate of the subspace basis associated with the low-rank structure of the data. Our method makes no distributional assumptions of the low-rank component and does not assume that the noise variances are known. Further, this method uses a soft rank constraint that does not require subspace dimension to be known. Additionally, this paper develops a matrix factorized version of ALPCAH, named LR-ALPCAH, that is much faster and more memory efficient at the cost of requiring subspace dimension to be known or estimated. Simulations and real data experiments show the effectiveness of accounting for data heteroscedasticity compared to existing algorithms. Code available at this https URL.

[LG-117] Adaptive Robust and Scalable Bayesian Filtering for Online Learning

链接: https://arxiv.org/abs/2505.07267
作者: Gerardo Duran-Martin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: PhD thesis

点击查看摘要

Abstract:In this thesis, we introduce Bayesian filtering as a principled framework for tackling diverse sequential machine learning problems, including online (continual) learning, prequential (one-step-ahead) forecasting, and contextual bandits. To this end, this thesis addresses key challenges in applying Bayesian filtering to these problems: adaptivity to non-stationary environments, robustness to model misspecification and outliers, and scalability to the high-dimensional parameter space of deep neural networks. We develop novel tools within the Bayesian filtering framework to address each of these challenges, including: (i) a modular framework that enables the development adaptive approaches for online learning; (ii) a novel, provably robust filter with similar computational cost to standard filters, that employs Generalised Bayes; and (iii) a set of tools for sequentially updating model parameters using approximate second-order optimisation methods that exploit the overparametrisation of high-dimensional parametric models such as neural networks. Theoretical analysis and empirical results demonstrate the improved performance of our methods in dynamic, high-dimensional, and misspecified models.

[LG-118] he Influence of the Memory Capacity of Neural DDEs on the Universal Approximation Property

链接: https://arxiv.org/abs/2505.07244
作者: Christian Kuehn,Sara-Viola Kuntz
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Neural Ordinary Differential Equations (Neural ODEs), which are the continuous-time analog of Residual Neural Networks (ResNets), have gained significant attention in recent years. Similarly, Neural Delay Differential Equations (Neural DDEs) can be interpreted as an infinite depth limit of Densely Connected Residual Neural Networks (DenseResNets). In contrast to traditional ResNet architectures, DenseResNets are feed-forward networks that allow for shortcut connections across all layers. These additional connections introduce memory in the network architecture, as typical in many modern architectures. In this work, we explore how the memory capacity in neural DDEs influences the universal approximation property. The key parameter for studying the memory capacity is the product K \tau of the Lipschitz constant and the delay of the DDE. In the case of non-augmented architectures, where the network width is not larger than the input and output dimensions, neural ODEs and classical feed-forward neural networks cannot have the universal approximation property. We show that if the memory capacity K\tau is sufficiently small, the dynamics of the neural DDE can be approximated by a neural ODE. Consequently, non-augmented neural DDEs with a small memory capacity also lack the universal approximation property. In contrast, if the memory capacity K\tau is sufficiently large, we can establish the universal approximation property of neural DDEs for continuous functions. If the neural DDE architecture is augmented, we can expand the parameter regions in which universal approximation is possible. Overall, our results show that by increasing the memory capacity K\tau , the infinite-dimensional phase space of DDEs with positive delay \tau0 is not sufficient to guarantee a direct jump transition to universal approximation, but only after a certain memory threshold, universal approximation holds.

[LG-119] Exact Spin Elimination in Ising Hamiltonians and Energy-Based Machine Learning

链接: https://arxiv.org/abs/2505.07163
作者: Natalia G. Berloff
类目: Quantum Physics (quant-ph); Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 28 pages, 6 figures

点击查看摘要

Abstract:We present an exact spin-elimination technique that reduces the dimensionality of both quadratic and k-local Ising Hamiltonians while preserving their original ground-state configurations. By systematically replacing each removed spin with an effective interaction among its neighbors, our method lowers the total spin count without invoking approximations or iterative recalculations. This capability is especially beneficial for hardware-constrained platforms, classical or quantum, that can directly implement multi-body interactions but have limited qubit or spin resources. We demonstrate three key advances enabled by this technique. First, we handle larger instances of benchmark problems such as Max-Cut on cubic graphs without exceeding a 2-local interaction limit. Second, we reduce qubit requirements in QAOA-based integer factorization on near-term quantum devices, thus extending the feasible range of integers to be factorized. Third, we improve memory capacity in Hopfield associative memories and enhance memory retrieval by suppressing spurious attractors, enhancing retrieval performance. Our spin-elimination procedure trades local spin complexity for higher-order couplings or higher node degrees in a single pass, opening new avenues for scaling up combinatorial optimization and energy-based machine learning on near-term hardware. Finally, these results underscore that the next-generation physical spin machines will likely capitalize on k-local spin Hamiltonians to offer an alternative to classical computations.

[LG-120] Constrained Online Decision-Making with Density Estimation Oracles

链接: https://arxiv.org/abs/2505.07101
作者: Haichen Hu,David Simchi-Levi,Navid Azizan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contextual online decision-making problems with constraints appear in a wide range of real-world applications, such as personalized recommendation with resource limits, adaptive experimental design, and decision-making under safety or fairness requirements. In this paper, we investigate a general formulation of sequential decision-making with stage-wise feasibility constraints, where at each round, the learner must select an action based on observed context while ensuring that a problem-specific feasibility criterion is satisfied. We propose a unified algorithmic framework that captures many existing constrained learning problems, including constrained bandits, active learning with label budgets, online hypothesis testing with Type I error control, and model calibration. Central to our approach is the concept of upper counterfactual confidence bounds, which enables the design of practically efficient online algorithms with strong theoretical guarantee using any offline conditional density estimation oracle. Technically, to handle feasibility constraints in complex environments, we introduce a generalized notion of the eluder dimension - extending it from the classical setting based on square loss to a broader class of metric-like probability divergences. This allows us to capture the complexity of various density function classes and characterize the utility regret incurred due to feasibility constraint uncertainty. Our result offers a principled foundation for constrained sequential decision-making in both theory and practice.

[LG-121] A Sparse Bayesian Learning Algorithm for Estimation of Interaction Kernels in Motsch-Tadmor Model

链接: https://arxiv.org/abs/2505.07068
作者: Jinchao Feng,Sui Tang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 18 pages

点击查看摘要

Abstract:In this paper, we investigate the data-driven identification of asymmetric interaction kernels in the Motsch-Tadmor model based on observed trajectory data. The model under consideration is governed by a class of semilinear evolution equations, where the interaction kernel defines a normalized, state-dependent Laplacian operator that governs collective dynamics. To address the resulting nonlinear inverse problem, we propose a variational framework that reformulates kernel identification using the implicit form of the governing equations, reducing it to a subspace identification problem. We establish an identifiability result that characterizes conditions under which the interaction kernel can be uniquely recovered up to scale. To solve the inverse problem robustly, we develop a sparse Bayesian learning algorithm that incorporates informative priors for regularization, quantifies uncertainty, and enables principled model selection. Extensive numerical experiments on representative interacting particle systems demonstrate the accuracy, robustness, and interpretability of the proposed framework across a range of noise levels and data regimes.

[LG-122] Learning curves theory for hierarchically compositional data with power-law distributed features

链接: https://arxiv.org/abs/2505.07067
作者: Francesco Cagnetta,Hyunmo Kang,Matthieu Wyart
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent theories suggest that Neural Scaling Laws arise whenever the task is linearly decomposed into power-law distributed units. Alternatively, scaling laws also emerge when data exhibit a hierarchically compositional structure, as is thought to occur in language and images. To unify these views, we consider classification and next-token prediction tasks based on probabilistic context-free grammars – probabilistic models that generate data via a hierarchy of production rules. For classification, we show that having power-law distributed production rules results in a power-law learning curve with an exponent depending on the rules’ distribution and a large multiplicative constant that depends on the hierarchical structure. By contrast, for next-token prediction, the distribution of production rules controls the local details of the learning curve, but not the exponent describing the large-scale behaviour.

[LG-123] Outperformance Score: A Universal Standardization Method for Confusion-Matrix-Based Classification Performance Metrics

链接: https://arxiv.org/abs/2505.07033
作者: Ningsheng Zhao,Trang Bui,Jia Yuan Yu,Krzysztof Dzieciolowski
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Many classification performance metrics exist, each suited to a specific application. However, these metrics often differ in scale and can exhibit varying sensitivity to class imbalance rates in the test set. As a result, it is difficult to use the nominal values of these metrics to interpret and evaluate classification performances, especially when imbalance rates vary. To address this problem, we introduce the outperformance score function, a universal standardization method for confusion-matrix-based classification performance (CMBCP) metrics. It maps any given metric to a common scale of [0,1] , while providing a clear and consistent interpretation. Specifically, the outperformance score represents the percentile rank of the observed classification performance within a reference distribution of possible performances. This unified framework enables meaningful comparison and monitoring of classification performance across test sets with differing imbalance rates. We illustrate how the outperformance scores can be applied to a variety of commonly used classification performance metrics and demonstrate the robustness of our method through experiments on real-world datasets spanning multiple classification applications.

[LG-124] Unraveling Quantum Environments: Transformer-Assisted Learning in Lindblad Dynamics

链接: https://arxiv.org/abs/2505.06928
作者: Chi-Sheng Chen,En-Jui Kuo
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding dissipation in open quantum systems is crucial for the development of robust quantum technologies. In this work, we introduce a Transformer-based machine learning framework to infer time-dependent dissipation rates in quantum systems governed by the Lindblad master equation. Our approach uses time series of observable quantities, such as expectation values of single Pauli operators, as input to learn dissipation profiles without requiring knowledge of the initial quantum state or even the system Hamiltonian. We demonstrate the effectiveness of our approach on a hierarchy of open quantum models of increasing complexity, including single-qubit systems with time-independent or time-dependent jump rates, two-qubit interacting systems (e.g., Heisenberg and transverse Ising models), and the Jaynes–Cummings model involving light–matter interaction and cavity loss with time-dependent decay rates. Our method accurately reconstructs both fixed and time-dependent decay rates from observable time series. To support this, we prove that under reasonable assumptions, the jump rates in all these models are uniquely determined by a finite set of observables, such as qubit and photon measurements. In practice, we combine Transformer-based architectures with lightweight feature extraction techniques to efficiently learn these dynamics. Our results suggest that modern machine learning tools can serve as scalable and data-driven alternatives for identifying unknown environments in open quantum systems. Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG) Cite as: arXiv:2505.06928 [quant-ph] (or arXiv:2505.06928v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2505.06928 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-125] Stability Regularized Cross-Validation

链接: https://arxiv.org/abs/2505.06927
作者: Ryan Cory-Wright,Andrés Gómez
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Some of this material previously appeared in 2306.14851v2 , which we have split into two papers (this one and 2306.14851v3 ), because it contained two ideas that need separate papers

点击查看摘要

Abstract:We revisit the problem of ensuring strong test-set performance via cross-validation. Motivated by the generalization theory literature, we propose a nested k-fold cross-validation scheme that selects hyperparameters by minimizing a weighted sum of the usual cross-validation metric and an empirical model-stability measure. The weight on the stability term is itself chosen via a nested cross-validation procedure. This reduces the risk of strong validation set performance and poor test set performance due to instability. We benchmark our procedure on a suite of 13 real-world UCI datasets, and find that, compared to k-fold cross-validation over the same hyperparameters, it improves the out-of-sample MSE for sparse ridge regression and CART by 4% on average, but has no impact on XGBoost. This suggests that for interpretable and unstable models, such as sparse regression and CART, our approach is a viable and computationally affordable method for improving test-set performance.

[LG-126] Near-Field Channel Estimation for XL-MIMO: A Deep Generative Model Guided by Side Information

链接: https://arxiv.org/abs/2505.06900
作者: Zhenzhou Jin,Li You,Derrick Wing Kwan Ng,Xiang-Gen Xia,Xiqi Gao
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 15 pages, 11 figures, to appear on IEEE Transactions on Cognitive Communications and Networking

点击查看摘要

Abstract:This paper investigates the near-field (NF) channel estimation (CE) for extremely large-scale multiple-input multiple-output (XL-MIMO) systems. Considering the pronounced NF effects in XL-MIMO communications, we first establish a joint angle-distance (AD) domain-based spherical-wavefront physical channel model that captures the inherent sparsity of XL-MIMO channels. Leveraging the channel’s sparsity in the joint AD domain, the CE is approached as a task of reconstructing sparse signals. Anchored in this framework, we first propose a compressed sensing algorithm to acquire a preliminary channel estimate. Harnessing the powerful implicit prior learning capability of generative artificial intelligence (GenAI), we further propose a GenAI-based approach to refine the estimated channel. Specifically, we introduce the preliminary estimated channel as side information, and derive the evidence lower bound (ELBO) of the log-marginal distribution of the target NF channel conditioned on the preliminary estimated channel, which serves as the optimization objective for the proposed generative diffusion model (GDM). Additionally, we introduce a more generalized version of the GDM, the non-Markovian GDM (NM-GDM), to accelerate the sampling process, achieving an approximately tenfold enhancement in sampling efficiency. Experimental results indicate that the proposed approach is capable of offering substantial performance gain in CE compared to existing benchmark schemes within NF XL-MIMO systems. Furthermore, our approach exhibits enhanced generalization capabilities in both the NF or far-field (FF) regions.

[LG-127] NewsNet-SDF: Stochastic Discount Factor Estimation with Pretrained Language Model News Embeddings via Adversarial Networks

链接: https://arxiv.org/abs/2505.06864
作者: Shunyao Wang,Ming Cheng,Christina Dan Wang
类目: Portfolio Management (q-fin.PM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Stochastic Discount Factor (SDF) models provide a unified framework for asset pricing and risk assessment, yet traditional formulations struggle to incorporate unstructured textual information. We introduce NewsNet-SDF, a novel deep learning framework that seamlessly integrates pretrained language model embeddings with financial time series through adversarial networks. Our multimodal architecture processes financial news using GTE-multilingual models, extracts temporal patterns from macroeconomic data via LSTM networks, and normalizes firm characteristics, fusing these heterogeneous information sources through an innovative adversarial training mechanism. Our dataset encompasses approximately 2.5 million news articles and 10,000 unique securities, addressing the computational challenges of processing and aligning text data with financial time series. Empirical evaluations on U.S. equity data (1980-2022) demonstrate NewsNet-SDF substantially outperforms alternatives with a Sharpe ratio of 2.80. The model shows a 471% improvement over CAPM, over 200% improvement versus traditional SDF implementations, and a 74% reduction in pricing errors compared to the Fama-French five-factor model. In comprehensive comparisons, our deep learning approach consistently outperforms traditional, modern, and other neural asset pricing models across all key metrics. Ablation studies confirm that text embeddings contribute significantly more to model performance than macroeconomic features, with news-derived principal components ranking among the most influential determinants of SDF dynamics. These results validate the effectiveness of our multimodal deep learning approach in integrating unstructured text with traditional financial data for more accurate asset pricing, providing new insights for digital intelligent decision-making in financial technology.

[LG-128] A stochastic gradient method for trilevel optimization

链接: https://arxiv.org/abs/2505.06805
作者: Tommaso Giovannelli,Griffin Dean Kent,Luis Nunes Vicente
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:With the success that the field of bilevel optimization has seen in recent years, similar methodologies have started being applied to solving more difficult applications that arise in trilevel optimization. At the helm of these applications are new machine learning formulations that have been proposed in the trilevel context and, as a result, efficient and theoretically sound stochastic methods are required. In this work, we propose the first-ever stochastic gradient descent method for solving unconstrained trilevel optimization problems and provide a convergence theory that covers all forms of inexactness of the trilevel adjoint gradient, such as the inexact solutions of the middle-level and lower-level problems, inexact computation of the trilevel adjoint formula, and noisy estimates of the gradients, Hessians, Jacobians, and tensors of third-order derivatives involved. We also demonstrate the promise of our approach by providing numerical results on both synthetic trilevel problems and trilevel formulations for hyperparameter adversarial tuning.

[LG-129] Reverse-BSDE Monte Carlo

链接: https://arxiv.org/abs/2505.06800
作者: Jairon H. N. Batista,Flávio B. Gonçalves,Yuri F. Saporito,Rodrigo S. Targino
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Recently, there has been a growing interest in generative models based on diffusions driven by the empirical robustness of these methods in generating high-dimensional photorealistic images and the possibility of using the vast existing toolbox of stochastic differential equations. %This remarkable ability may stem from their capacity to model and generate multimodal distributions. In this work, we offer a novel perspective on the approach introduced in Song et al. (2021), shifting the focus from a “learning” problem to a “sampling” problem. To achieve this, we reformulate the equations governing diffusion-based generative models as a Forward-Backward Stochastic Differential Equation (FBSDE), which avoids the well-known issue of pre-estimating the gradient of the log target density. The solution of this FBSDE is proved to be unique using non-standard techniques. Additionally, we propose a numerical solution to this problem, leveraging on Deep Learning techniques. This reformulation opens new pathways for sampling multidimensional distributions with densities known up to a normalization constant, a problem frequently encountered in Bayesian statistics.

[LG-130] Quantum RNNs and LSTMs Through Entangling and Disentangling Power of Unitary Transformations

链接: https://arxiv.org/abs/2505.06774
作者: Ammar Daskin
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: the simulation code can be downloaded from this https URL

点击查看摘要

Abstract:In this paper, we discuss how quantum recurrent neural networks (RNNs) and their enhanced version, long short-term memory (LSTM) networks, can be modeled using the core ideas presented in Ref.[1], where the entangling and disentangling power of unitary transformations is investigated. In particular, we interpret entangling and disentangling power as information retention and forgetting mechanisms in LSTMs. Therefore, entanglement becomes a key component of the optimization (training) process. We believe that, by leveraging prior knowledge of the entangling power of unitaries, the proposed quantum-classical framework can guide and help to design better-parameterized quantum circuits for various real-world applications.

[LG-131] Out-of-Sample Embedding with Proximity Data: Projection versus Restricted Reconstruction

链接: https://arxiv.org/abs/2505.06756
作者: Michael W. Trosset,Kaiyi Tan,Minh Tang,Carey E. Priebe
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注: 19 pages, 2 figures

点击查看摘要

Abstract:The problem of using proximity (similarity or dissimilarity) data for the purpose of “adding a point to a vector diagram” was first studied by J.C. Gower in 1968. Since then, a number of methods – mostly kernel methods – have been proposed for solving what has come to be called the problem of out-of-sample embedding. We survey the various kernel methods that we have encountered and show that each can be derived from one or the other of two competing strategies: projection or restricted reconstruction. Projection can be analogized to a well-known formula for adding a point to a principal component analysis. Restricted reconstruction poses a different challenge: how to best approximate redoing the entire multivariate analysis while holding fixed the vector diagram that was previously obtained. This strategy results in a nonlinear optimization problem that can be simplified to a unidimensional search. Various circumstances may warrant either projection or restricted reconstruction.

[LG-132] Efficient Parallelization of Message Passing Neural Networks

链接: https://arxiv.org/abs/2505.06711
作者: Junfan Xia,Bin Jiang
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注: 33 pages, 8 figures

点击查看摘要

Abstract:Machine learning potentials have achieved great success in accelerating atomistic simulations. Many of them rely on local descriptors that readily allow parallelization. More recent message passing neural network (MPNN) models have demonstrated their superior accuracy and become increasingly popular. However, parallelizing MPNN models for large-scale simulations across compute nodes remains a challenge, as the previously argued poor scalability with the number of MP layers and the necessity of data communication. Here, we propose an efficient parallel algorithm for MPNN models, in which additional data communication is minimized among local atoms only in each MP layer without redundant computation, thus scaling linearly with the layer number. Integrated with our recursively embedded atom neural network model, this algorithm demonstrates excellent strong scaling and weak scaling behaviors in several benchmark systems. This approach enables massive molecular dynamics simulations on MPNN models for hundreds of millions of atoms as fast as on strictly local models, vastly extending the applicability of the MPNN potential to an unprecedented scale. This general parallelization framework can empower various MPNN models to efficiently simulate very large and complex systems.

[LG-133] Learning Guarantee of Reward Modeling Using Deep Neural Networks

链接: https://arxiv.org/abs/2505.06601
作者: Yuanhang Luo,Yeheng Ge,Ruijian Han,Guohao Shen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we study the learning theory of reward modeling with pairwise comparison data using deep neural networks. We establish a novel non-asymptotic regret bound for deep reward estimators in a non-parametric setting, which depends explicitly on the network architecture. Furthermore, to underscore the critical importance of clear human beliefs, we introduce a margin-type condition that assumes the conditional winning probability of the optimal action in pairwise comparisons is significantly distanced from 1/2. This condition enables a sharper regret bound, which substantiates the empirical efficiency of Reinforcement Learning from Human Feedback and highlights clear human beliefs in its success. Notably, this improvement stems from high-quality pairwise comparison data implied by the margin-type condition, is independent of the specific estimators used, and thus applies to various learning algorithms and models.

[LG-134] High-Dimensional Importance-Weighted Information Criteria: Theory and Optimality

链接: https://arxiv.org/abs/2505.06531
作者: Yong-Syun Cao,Shinpei Imori,Ching-Kang Ing
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Imori and Ing (2025) proposed the importance-weighted orthogonal greedy algorithm (IWOGA) for model selection in high-dimensional misspecified regression models under covariate shift. To determine the number of IWOGA iterations, they introduced the high-dimensional importance-weighted information criterion (HDIWIC). They argued that the combined use of IWOGA and HDIWIC, IWOGA + HDIWIC, achieves an optimal trade-off between variance and squared bias, leading to optimal convergence rates in terms of conditional mean squared prediction error. In this article, we provide a theoretical justification for this claim by establishing the optimality of IWOGA + HDIWIC under a set of reasonable assumptions.

[LG-135] Fair Representation Learning for Continuous Sensitive Attributes using Expectation of Integral Probability Metrics

链接: https://arxiv.org/abs/2505.06435
作者: Insung Kong,Kunwoong Kim,Yongdai Kim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 42 pages, 30 figures. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

点击查看摘要

Abstract:AI fairness, also known as algorithmic fairness, aims to ensure that algorithms operate without bias or discrimination towards any individual or group. Among various AI algorithms, the Fair Representation Learning (FRL) approach has gained significant interest in recent years. However, existing FRL algorithms have a limitation: they are primarily designed for categorical sensitive attributes and thus cannot be applied to continuous sensitive attributes, such as age or income. In this paper, we propose an FRL algorithm for continuous sensitive attributes. First, we introduce a measure called the Expectation of Integral Probability Metrics (EIPM) to assess the fairness level of representation space for continuous sensitive attributes. We demonstrate that if the distribution of the representation has a low EIPM value, then any prediction head constructed on the top of the representation become fair, regardless of the selection of the prediction head. Furthermore, EIPM possesses a distinguished advantage in that it can be accurately estimated using our proposed estimator with finite samples. Based on these properties, we propose a new FRL algorithm called Fair Representation using EIPM with MMD (FREM). Experimental evidences show that FREM outperforms other baseline methods.

[LG-136] Adaptive Bayesian Very Short-Term Wind Power Forecasting Based on the Generalised Logit Transformation

链接: https://arxiv.org/abs/2505.06310
作者: Tao Shen,Jethro Browell,Daniela Castro-Camilo
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 31 pages, 10 figures and tables. Submitted to International Journal of Forecasting

点击查看摘要

Abstract:Wind power plays an increasingly significant role in achieving the 2050 Net Zero Strategy. Despite its rapid growth, its inherent variability presents challenges in forecasting. Accurately forecasting wind power generation is one key demand for the stable and controllable integration of renewable energy into existing grid operations. This paper proposes an adaptive method for very short-term forecasting that combines the generalised logit transformation with a Bayesian approach. The generalised logit transformation processes double-bounded wind power data to an unbounded domain, facilitating the application of Bayesian methods. A novel adaptive mechanism for updating the transformation shape parameter is introduced to leverage Bayesian updates by recovering a small sample of representative data. Four adaptive forecasting methods are investigated, evaluating their advantages and limitations through an extensive case study of over 100 wind farms ranging four years in the UK. The methods are evaluated using the Continuous Ranked Probability Score and we propose the use of functional reliability diagrams to assess calibration. Results indicate that the proposed Bayesian method with adaptive shape parameter updating outperforms benchmarks, yielding consistent improvements in CRPS and forecast reliability. The method effectively addresses uncertainty, ensuring robust and accurate probabilistic forecasting which is essential for grid integration and decision-making.

[LG-137] ALFEE: Adaptive Large Foundation Model for EEG Representation

链接: https://arxiv.org/abs/2505.06291
作者: Wei Xiong,Junming Lin,Jiangtong Li,Jie Li,Changjun Jiang
类目: ignal Processing (eess.SP); Computational Engineering, Finance, and Science (cs.CE); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 17pages, 17 figures

点击查看摘要

Abstract:While foundation models excel in text, image, and video domains, the critical biological signals, particularly electroencephalography(EEG), remain underexplored. EEG benefits neurological research with its high temporal resolution, operational practicality, and safety profile. However, low signal-to-noise ratio, inter-subject variability, and cross-paradigm differences hinder the generalization of current models. Existing methods often employ simplified strategies, such as a single loss function or a channel-temporal joint representation module, and suffer from a domain gap between pretraining and evaluation tasks that compromises efficiency and adaptability. To address these limitations, we propose the Adaptive Large Foundation model for EEG signal representation(ALFEE) framework, a novel hybrid transformer architecture with two learning stages for robust EEG representation learning. ALFEE employs a hybrid attention that separates channel-wise feature aggregation from temporal dynamics modeling, enabling robust EEG representation with variable channel configurations. A channel encoder adaptively compresses variable channel information, a temporal encoder captures task-guided evolution, and a hybrid decoder reconstructs signals in both temporal and frequency domains. During pretraining, ALFEE optimizes task prediction, channel and temporal mask reconstruction, and temporal forecasting to enhance multi-scale and multi-channel representation. During fine-tuning, a full-model adaptation with a task-specific token dictionary and a cross-attention layer boosts performance across multiple tasks. After 25,000 hours of pretraining, extensive experimental results on six downstream EEG tasks demonstrate the superior performance of ALFEE over existing models. Our ALFEE framework establishes a scalable foundation for biological signal analysis with implementation at this https URL.

[LG-138] From Biometrics to Environmental Control: AI-Enhanced Digital Twins for Personalized Health Interventions in Healing Landscapes

链接: https://arxiv.org/abs/2505.06263
作者: Yiping Meng,Yiming Sun
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The dynamic nature of human health and comfort calls for adaptive systems that respond to individual physiological needs in real time. This paper presents an AI-enhanced digital twin framework that integrates biometric signals, specifically electrocardiogram (ECG) data, with environmental parameters such as temperature, humidity, and ventilation. Leveraging IoT-enabled sensors and biometric monitoring devices, the system continuously acquires, synchronises, and preprocesses multimodal data streams to construct a responsive virtual replica of the physical environment. To validate this framework, a detailed case study is conducted using the MIT-BIH noise stress test dataset. ECG signals are filtered and segmented using dynamic sliding windows, followed by extracting heart rate variability (HRV) features such as SDNN, BPM, QTc, and LF/HF ratio. Relative deviation metrics are computed against clean baselines to quantify stress responses. A random forest classifier is trained to predict stress levels across five categories, and Shapley Additive exPlanations (SHAP) is used to interpret model behaviour and identify key contributing features. These predictions are mapped to a structured set of environmental interventions using a Five Level Stress Intervention Mapping, which activates multi-scale responses across personal, room, building, and landscape levels. This integration of physiological insight, explainable AI, and adaptive control establishes a new paradigm for health-responsive built environments. It lays the foundation for the future development of intelligent, personalised healing spaces.

[LG-139] An Early Warning Model for Forced Displacement

链接: https://arxiv.org/abs/2505.06249
作者: Geraldine Henningsen
类目: Applications (stat.AP); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 13 pages, 6 figures

点击查看摘要

Abstract:Monitoring tools for anticipatory action are increasingly gaining traction to improve the efficiency and timeliness of humanitarian responses. Whilst predictive models can now forecast conflicts with high accuracy, translating these predictions into potential forced displacement movements remains challenging because it is often unclear which precise events will trigger significant population movements. This paper presents a novel monitoring approach for refugee and asylum seeker flows that addresses this challenge. Using gradient boosting classification, we combine conflict forecasts with a comprehensive set of economic, political, and demographic variables to assess two distinct risks at the country of origin: the likelihood of significant displacement flows and the probability of sudden increases in these flows. The model generates country-specific monthly risk indices for these two events with prediction horizons of one, three, and six months. Our analysis shows high accuracy in predicting significant displacement flows and good accuracy in forecasting sudden increases in displacement–the latter being inherently more difficult to predict, given the complexity of displacement triggers. We achieve these results by including predictive factors beyond conflict, thereby demonstrating that forced displacement risks can be assessed through an integrated analysis of multiple country-level indicators. Whilst these risk indices provide valuable quantitative support for humanitarian planning, they should always be understood as decision-support tools within a broader analytical framework.

[LG-140] A Transformer-Based Approach for Diagnosing Fault Cases in Optical Fiber Amplifiers

链接: https://arxiv.org/abs/2505.06245
作者: Dominic Schneider,Lutz Rapp,Christoph Ament
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: This paper has been accepted for publication at the 25th International Conference on Transparent Optical Networks (ICTON) 2025

点击查看摘要

Abstract:A transformer-based deep learning approach is presented that enables the diagnosis of fault cases in optical fiber amplifiers using condition-based monitoring time series data. The model, Inverse Triple-Aspect Self-Attention Transformer (ITST), uses an encoder-decoder architecture, utilizing three feature extraction paths in the encoder, feature-engineered data for the decoder and a self-attention mechanism. The results show that ITST outperforms state-of-the-art models in terms of classification accuracy, which enables predictive maintenance for optical fiber amplifiers, reducing network downtimes and maintenance costs.

[LG-141] Supervised machine learning based signal demodulation in chaotic communications

链接: https://arxiv.org/abs/2505.06243
作者: Mykola Kozlenko
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 5 pages, 3 figures, 1 table. This paper was originally published in 2022 International Conference on Innovative Solutions in Software Engineering (ICISSE), available: this https URL

点击查看摘要

Abstract:A chaotic modulation scheme is an efficient wideband communication method. It utilizes the deterministic chaos to generate pseudo-random carriers. Chaotic bifurcation parameter modulation is one of the well-known and widely-used techniques. This paper presents the machine learning based demodulation approach for the bifurcation parameter keying. It presents the structure of a convolutional neural network as well as performance metrics values for signals generated with the chaotic logistic map. The paper provides an assessment of the overall accuracy for binary signals. It reports the accuracy value of 0.88 for the bifurcation parameter deviation of 1.34% in the presence of additive white Gaussian noise at the normalized signal-to-noise ratio value of 20 dB for balanced dataset.

[LG-142] Equivariant Machine Learning Decoder for 3D Toric Codes

链接: https://arxiv.org/abs/2409.04300
作者: Oliver Weissl,Evgenii Egorov
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mitigating errors in computing and communication systems has seen a great deal of research since the beginning of the widespread use of these technologies. However, as we develop new methods to do computation or communication, we also need to reiterate the method used to deal with errors. Within the field of quantum computing, error correction is getting a lot of attention since errors can propagate fast and invalidate results, which makes the theoretical exponential speed increase in computation time, compared to traditional systems, obsolete. To correct errors in quantum systems, error-correcting codes are used. A subgroup of codes, topological codes, is currently the focus of many research papers. Topological codes represent parity check matrices corresponding to graphs embedded on a d -dimensional surface. For our research, the focus lies on the toric code with a 3D square lattice. The goal of any decoder is robustness to noise, which can increase with code size. However, a reasonable decoder performance scales polynomially with lattice size. As error correction is a time-sensitive operation, we propose a neural network using an inductive bias: equivariance. This allows the network to learn from a rather small subset of the exponentially growing training space of possible inputs. In addition, we investigate how transformer networks can help in correction. These methods will be compared with various configurations and previously published methods of decoding errors in the 3D toric code.

信息检索

[IR-0] Reproducibility Replicability and Insights into Visual Document Retrieval with Late Interaction

链接: https://arxiv.org/abs/2505.07730
作者: Jingfen Qiao,Jia-Huei Ju,Xinyu Ma,Evangelos Kanoulas,Andrew Yates
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Visual Document Retrieval (VDR) is an emerging research area that focuses on encoding and retrieving document images directly, bypassing the dependence on Optical Character Recognition (OCR) for document search. A recent advance in VDR was introduced by ColPali, which significantly improved retrieval effectiveness through a late interaction mechanism. ColPali’s approach demonstrated substantial performance gains over existing baselines that do not use late interaction on an established benchmark. In this study, we investigate the reproducibility and replicability of VDR methods with and without late interaction mechanisms by systematically evaluating their performance across multiple pre-trained vision-language models. Our findings confirm that late interaction yields considerable improvements in retrieval effectiveness; however, it also introduces computational inefficiencies during inference. Additionally, we examine the adaptability of VDR models to textual inputs and assess their robustness across text-intensive datasets within the proposed benchmark, particularly when scaling the indexing mechanism. Furthermore, our research investigates the specific contributions of late interaction by looking into query-patch matching in the context of visual document retrieval. We find that although query tokens cannot explicitly match image patches as in the text retrieval scenario, they tend to match the patch contains visually similar tokens or their surrounding patches.

[IR-1] KAQG: A Knowledge-Graph-Enhanced RAG for Difficulty-Controlled Question Generation

链接: https://arxiv.org/abs/2505.07618
作者: Ching Han Chen,Ming Fang Shiu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:KAQG introduces a decisive breakthrough for Retrieval-Augmented Generation (RAG) by explicitly tackling the two chronic weaknesses of current pipelines: transparent multi-step reasoning and fine-grained cognitive difficulty control. This transforms RAG from a passive retriever into an accountable generator of calibrated exam items. Technically, the framework fuses knowledge graphs, RAG retrieval, and educational assessment theory into a single pipeline. Domain passages are parsed into a structured graph; graph-aware retrieval feeds fact chains to an LLM; and an assessment layer governed by Bloom’s Taxonomy levels and Item Response Theory (IRT) transforms those chains into psychometrically sound questions. This cross-disciplinary marriage yields two scholarly contributions: it shows how semantic graph contexts guide LLM reasoning paths, and it operationalizes difficulty metrics within the generation process, producing items whose IRT parameters match expert benchmarks. Every module, from KG construction scripts to the multi-agent reasoning scheduler and the automatic IRT validator, is openly released on GitHub. This enables peer laboratories to replicate experiments, benchmark against baselines, and extend individual components without licensing barriers. Its reproducible design paves the way for rigorous ablation studies, cross-domain transfer experiments, and shared leaderboards on multi-step reasoning benchmarks.

[IR-2] From raw affiliations to organization identifiers

链接: https://arxiv.org/abs/2505.07577
作者: Myrto Kallipoliti,Serafeim Chatzopoulos,Miriam Baglioni,Eleni Adamidi,Paris Koloveas,Thanasis Vergoulis
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注: 16 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Accurate affiliation matching, which links affiliation strings to standardized organization identifiers, is critical for improving research metadata quality, facilitating comprehensive bibliometric analyses, and supporting data interoperability across scholarly knowledge bases. Existing approaches fail to handle the complexity of affiliation strings that often include mentions of multiple organizations or extraneous information. In this paper, we present AffRo, a novel approach designed to address these challenges, leveraging advanced parsing and disambiguation techniques. We also introduce AffRoDB, an expert-curated dataset to systematically evaluate affiliation matching algorithms, ensuring robust benchmarking. Results demonstrate the effectiveness of AffRp in accurately identifying organizations from complex affiliation strings.

[IR-3] Why Uncertainty Estimation Methods Fall Short in RAG : An Axiomatic Analysis

链接: https://arxiv.org/abs/2505.07459
作者: Heydar Soudani,Evangelos Kanoulas,Faegheh Hasibi
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are valued for their strong performance across various tasks, but they also produce inaccurate or misleading outputs. Uncertainty Estimation (UE) quantifies the model’s confidence and helps users assess response reliability. However, existing UE methods have not been thoroughly examined in scenarios like Retrieval-Augmented Generation (RAG), where the input prompt includes non-parametric knowledge. This paper shows that current UE methods cannot reliably assess correctness in the RAG setting. We further propose an axiomatic framework to identify deficiencies in existing methods and guide the development of improved approaches. Our framework introduces five constraints that an effective UE method should meet after incorporating retrieved documents into the LLM’s prompt. Experimental results reveal that no existing UE method fully satisfies all the axioms, explaining their suboptimal performance in RAG. We further introduce a simple yet effective calibration function based on our framework, which not only satisfies more axioms than baseline methods but also improves the correlation between uncertainty estimates and correctness.

[IR-4] Diffusion-driven SpatioTemporal Graph KANsformer for Medical Examination Recommendation

链接: https://arxiv.org/abs/2505.07431
作者: Jianan Li,Yangtao Zhou,Zhifu Zhao,Qinglan Huang,Jian Qi,Xiao He,Hua Chu,Fu Li
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommendation systems in AI-based medical diagnostics and treatment constitute a critical component of AI in healthcare. Although some studies have explored this area and made notable progress, healthcare recommendation systems remain in their nascent stage. And these researches mainly target the treatment process such as drug or disease recommendations. In addition to the treatment process, the diagnostic process, particularly determining which medical examinations are necessary to evaluate the condition, also urgently requires intelligent decision support. To bridge this gap, we first formalize the task of medical examination recommendations. Compared to traditional recommendations, the medical examination recommendation involves more complex interactions. This complexity arises from two folds: 1) The historical medical records for examination recommendations are heterogeneous and redundant, which makes the recommendation results susceptible to noise. 2) The correlation between the medical history of patients is often irregular, making it challenging to model spatiotemporal dependencies. Motivated by the above observation, we propose a novel Diffusion-driven SpatioTemporal Graph KANsformer for Medical Examination Recommendation (DST-GKAN) with a two-stage learning paradigm to solve the above challenges. In the first stage, we exploit a task-adaptive diffusion model to distill recommendation-oriented information by reducing the noises in heterogeneous medical data. In the second stage, a spatiotemporal graph KANsformer is proposed to simultaneously model the complex spatial and temporal relationships. Moreover, to facilitate the medical examination recommendation research, we introduce a comprehensive dataset. The experimental results demonstrate the state-of-the-art performance of the proposed method compared to various competitive baselines.

[IR-5] DARLR: Dual-Agent Offline Reinforcement Learning for Recommender Systems with Dynamic Reward SIGIR2025

链接: https://arxiv.org/abs/2505.07257
作者: Yi Zhang,Ruihong Qiu,Xuwei Xu,Jiajun Liu,Sen Wang
类目: Information Retrieval (cs.IR)
*备注: SIGIR 2025

点击查看摘要

Abstract:Model-based offline reinforcement learning (RL) has emerged as a promising approach for recommender systems, enabling effective policy learning by interacting with frozen world models. However, the reward functions in these world models, trained on sparse offline logs, often suffer from inaccuracies. Specifically, existing methods face two major limitations in addressing this challenge: (1) deterministic use of reward functions as static look-up tables, which propagates inaccuracies during policy learning, and (2) static uncertainty designs that fail to effectively capture decision risks and mitigate the impact of these inaccuracies. In this work, a dual-agent framework, DARLR, is proposed to dynamically update world models to enhance recommendation policies. To achieve this, a \textbf\textitselector is introduced to identify reference users by balancing similarity and diversity so that the \textbf\textitrecommender can aggregate information from these users and iteratively refine reward estimations for dynamic reward shaping. Further, the statistical features of the selected users guide the dynamic adaptation of an uncertainty penalty to better align with evolving recommendation requirements. Extensive experiments on four benchmark datasets demonstrate the superior performance of DARLR, validating its effectiveness. The code is available at this https URL.

[IR-6] A Generative Re-ranking Model for List-level Multi-objective Optimization at Taobao

链接: https://arxiv.org/abs/2505.07197
作者: Yue Meng,Cheng Guo,Yi Cao,Tong Liu,Bo Zheng
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:E-commerce recommendation systems aim to generate ordered lists of items for customers, optimizing multiple business objectives, such as clicks, conversions and Gross Merchandise Volume (GMV). Traditional multi-objective optimization methods like formulas or Learning-to-rank (LTR) models take effect at item-level, neglecting dynamic user intent and contextual item interactions. List-level multi-objective optimization in the re-ranking stage can overcome this limitation, but most current re-ranking models focus more on accuracy improvement with context. In addition, re-ranking is faced with the challenges of time complexity and diversity. In light of this, we propose a novel end-to-end generative re-ranking model named Sequential Ordered Regression Transformer-Generator (SORT-Gen) for the less-studied list-level multi-objective optimization problem. Specifically, SORT-Gen is divided into two parts: 1)Sequential Ordered Regression Transformer innovatively uses Transformer and ordered regression to accurately estimate multi-objective values for variable-length sub-lists. 2)Mask-Driven Fast Generation Algorithm combines multi-objective candidate queues, efficient item selection and diversity mechanism into model inference, providing a fast online list generation method. Comprehensive online experiments demonstrate that SORT-Gen brings +4.13% CLCK and +8.10% GMV for Baiyibutie, a notable Mini-app of Taobao. Currently, SORT-Gen has been successfully deployed in multiple scenarios of Taobao App, serving for a vast number of users.

[IR-7] A Reinforcement Learning Framework for Application-Specific TCP Congestion-Control

链接: https://arxiv.org/abs/2505.07042
作者: Jinming Xing,Muhammad Shahzad
类目: Networking and Internet Architecture (cs.NI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The Congestion Control (CC) module plays a critical role in the Transmission Control Protocol (TCP), ensuring the stability and efficiency of network data transmission. The CC approaches that are commonly used these days employ heuristics-based rules to adjust the sending rate. Due to their heuristics-based nature, these approaches are not only unable to adapt to changing network conditions but are also agnostic to the diverse requirements that different applications often have. Recently, several learning-based CC approaches have been proposed to adapt to changing network conditions. Unfortunately, they are not designed to take application requirements into account. Prior heuristics-based as well as learning-based CC approaches focus on achieving a singular objective, which is often to maximize throughput, even though a lot of applications care more about latency, packet losses, jitter, and different combinations of various network metrics. Motivated by this, we propose a Deep Reinforcement Learning (DRL) based CC framework, namely ASC, which allows any application to specify any arbitrary objectives that the network traffic of that application should achieve and is able to swiftly adapt to the changes in the objectives of the applications as well as to the changes in the network conditions. Our ASC framework further employs a client-server architecture that serves two purposes: 1) it makes ASC highly scalable in terms of the arrival and departure of TCP connections, and 2) it makes ASC very lightweight for the nodes maintaining the TCP connections. We implemented and extensively evaluated ASC in a variety of settings. Our results show that it can not only achieve various objectives but also outperforms prior approaches even in the specific objectives that those approaches were designed to achieve.

[IR-8] NetSight: Graph Attention Based Traffic Forecasting in Computer Networks

链接: https://arxiv.org/abs/2505.07034
作者: Jinming Xing,Guoheng Sun,Hui Sun,Linchao Pan,Shakir Mahmood,Xuanhao Luo,Muhammad Shahzad
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The traffic in today’s networks is increasingly influenced by the interactions among network nodes as well as by the temporal fluctuations in the demands of the nodes. Traditional statistical prediction methods are becoming obsolete due to their inability to address the non-linear and dynamic spatio-temporal dependencies present in today’s network traffic. The most promising direction of research today is graph neural networks (GNNs) based prediction approaches that are naturally suited to handle graph-structured data. Unfortunately, the state-of-the-art GNN approaches separate the modeling of spatial and temporal information, resulting in the loss of important information about joint dependencies. These GNN based approaches further do not model information at both local and global scales simultaneously, leaving significant room for improvement. To address these challenges, we propose NetSight. NetSight learns joint spatio-temporal dependencies simultaneously at both global and local scales from the time-series of measurements of any given network metric collected at various nodes in a network. Using the learned information, NetSight can then accurately predict the future values of the given network metric at those nodes in the network. We propose several new concepts and techniques in the design of NetSight, such as spatio-temporal adjacency matrix and node normalization. Through extensive evaluations and comparison with prior approaches using data from two large real-world networks, we show that NetSight significantly outperforms all prior state-of-the-art approaches. We will release the source code and data used in the evaluation of NetSight on the acceptance of this paper.

[IR-9] Incremental Analysis of Legacy Applications Using Knowledge Graphs for Application Modernization

链接: https://arxiv.org/abs/2505.06885
作者: Saravanan Krishnan,Amith Singhee,Keerthi Narayan Raghunath,Alex Mathai,Atul Kumar,David Wenk
类目: oftware Engineering (cs.SE); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Industries such as banking, telecom, and airlines - o6en have large so6ware systems that are several decades old. Many of these systems are written in old programming languages such as COBOL, PL/1, Assembler, etc. In many cases, the documentation is not updated, and those who developed/designed these systems are no longer around. Understanding these systems for either modernization or even regular maintenance has been a challenge. An extensive application may have natural boundaries based on its code dependencies and architecture. There are also other logical boundaries in an enterprise setting driven by business functions, data domains, etc. Due to these complications, the system architects generally plan their modernization across these logical boundaries in parts, thereby adopting an incremental approach for the modernization journey of the entire system. In this work, we present a so6ware system analysis tool that allows a subject ma=er expert (SME) or system architect to analyze a large so6ware system incrementally. We analyze the source code and other artifacts (such as data schema) to create a knowledge graph using a customizable ontology/schema. Entities and relations in our ontology can be defined for any combination of programming languages and platforms. Using this knowledge graph, the analyst can then define logical boundaries around dependent Entities (e.g. Programs, Transactions, Database Tables etc.). Our tool then presents different views showcasing the dependencies from the newly defined boundary to/from the other logical groups of the system. This exercise is repeated interactively to 1) Identify the Entities and groupings of interest for a modernization task and 2) Understand how a change in one part of the system may affect the other parts. To validate the efficacy of our tool, we provide an initial study of our system on two client applications.

[IR-10] OpenSky Report 2025: Improving Crowdsourced Flight Trajectories with ADS-C Data

链接: https://arxiv.org/abs/2505.06254
作者: Junzi Sun,Xavier Olive,Martin Strohmeier,Vincent Lenders
类目: ocial and Information Networks (cs.SI); Information Retrieval (cs.IR); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:The OpenSky Network has been collecting and providing crowdsourced air traffic surveillance data since 2013. The network has primarily focused on Automatic Dependent Surveillance–Broadcast (ADS-B) data, which provides high-frequency position updates over terrestrial areas. However, the ADS-B signals are limited over oceans and remote regions, where ground-based receivers are scarce. To address these coverage gaps, the OpenSky Network has begun incorporating data from the Automatic Dependent Surveillance–Contract (ADS-C) system, which uses satellite communication to track aircraft positions over oceanic regions and remote areas. In this paper, we analyze a dataset of over 720,000 ADS-C messages collected in 2024 from around 2,600 unique aircraft via the Alphasat satellite, covering Europe, Africa, and parts of the Atlantic Ocean. We present our approach to combining ADS-B and ADS-C data to construct detailed long-haul flight paths, particularly for transatlantic and African routes. Our findings demonstrate that this integration significantly improves trajectory reconstruction accuracy, allowing for better fuel consumption and emissions estimates. We illustrate how combined data captures flight patterns across previously underrepresented regions across Africa. Despite coverage limitations, this work marks an important advancement in providing open access to global flight trajectory data, enabling new research opportunities in air traffic management, environmental impact assessment, and aviation safety.

附件下载

点击下载今日全部论文列表