本篇博文主要内容为 2025-04-02 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-04-02)
今日共更新517篇论文,其中:
- 自然语言处理共90篇(Computation and Language (cs.CL))
- 人工智能共142篇(Artificial Intelligence (cs.AI))
- 计算机视觉共126篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共139篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Self-Routing RAG : Binding Selective Retrieval with Knowledge Verbalization
【速读】: 该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)方法中选择性检索利用大型语言模型(Large Language Models, LLMs)内在知识不足的问题,导致检索决策次优且生成性能下降。为解决此问题,论文提出了一种名为Self-Routing RAG (SR-RAG) 的新框架,其关键是将选择性检索与知识显式化相结合,使LLM能够动态决定是使用外部检索还是利用自身参数化知识进行表达。为此,设计了一个多任务目标函数,同时优化LLM在知识源选择、知识显式化以及响应生成上的表现。此外,通过最近邻搜索引入动态知识源推理以提高领域偏移下的知识源决策准确性。实验证明,使用SR-RAG微调LLM可显著提升响应准确性和推理效率,在减少29%检索量的同时,相比最强的选择性检索基线提升了5.1%的性能。
链接: https://arxiv.org/abs/2504.01018
作者: Di Wu,Jia-Chen Gu,Kai-Wei Chang,Nanyun Peng
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL)
备注: Work in Progress
点击查看摘要
Abstract:Selective retrieval improves retrieval-augmented generation (RAG) by reducing distractions from low-quality retrievals and improving efficiency. However, existing approaches under-utilize the inherent knowledge of large language models (LLMs), leading to suboptimal retrieval decisions and degraded generation performance. To bridge this gap, we propose Self-Routing RAG (SR-RAG), a novel framework that binds selective retrieval with knowledge verbalization. SR-RAG enables an LLM to dynamically decide between external retrieval and verbalizing its own parametric knowledge. To this end, we design a multi-task objective that jointly optimizes an LLM on knowledge source selection, knowledge verbalization, and response generation. We further introduce dynamic knowledge source inference via nearest neighbor search to improve the accuracy of knowledge source decision under domain shifts. Fine-tuning three LLMs with SR-RAG significantly improves both their response accuracy and inference latency. Compared to the strongest selective retrieval baseline, SR-RAG reduces retrievals by 29% while improving the performance by 5.1%.
zh
[NLP-1] When To Solve When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning
【速读】: 该论文旨在解决在固定推理预算下,如何有效平衡生成式解(Solution Generation)与基于生成式奖励模型(GenRM)的验证(Verification)之间的权衡问题。传统方法如自一致性(Self-Consistency, SC)通过生成多个解并采用多数投票选择答案,而另一种常见方法则利用奖励模型(Reward Model, RM)为每个解评分以挑选最优解。近年来,生成式奖励模型(GenRM)将验证重新定义为下一个令牌预测任务,从而在推理阶段引入了新的扩展维度。然而,在有限的推理资源约束下,如何分配计算资源以最大化性能成为核心挑战。论文的关键在于评估GenRM与SC在相同推理预算下的表现,并发现对于大多数实际推理预算,SC比GenRM更计算高效;此外,研究还揭示了在GenRM范式下,计算最优的推理策略倾向于更激进地扩展解的生成而非增加验证次数。最终,本文为测试时间的扩展优化提供了实用指导。
链接: https://arxiv.org/abs/2504.01005
作者: Nishad Singhi,Hritik Bansal,Arian Hosseini,Aditya Grover,Kai-Wei Chang,Marcus Rohrbach,Anna Rohrbach
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 29 pages
点击查看摘要
Abstract:Scaling test-time compute has emerged as a key strategy for enhancing the reasoning capabilities of large language models (LLMs), particularly in tasks like mathematical problem-solving. A traditional approach, Self-Consistency (SC), generates multiple solutions to a problem and selects the most common answer via majority voting. Another common method involves scoring each solution with a reward model (verifier) and choosing the best one. Recent advancements in Generative Reward Models (GenRM) reframe verification as a next-token prediction task, enabling inference-time scaling along a new axis. Specifically, GenRM generates multiple verification chains-of-thought to score each solution. Under a limited inference budget, this introduces a fundamental trade-off: should you spend the budget on scaling solutions via SC or generate fewer solutions and allocate compute to verification via GenRM? To address this, we evaluate GenRM against SC under a fixed inference budget. Interestingly, we find that SC is more compute-efficient than GenRM for most practical inference budgets across diverse models and datasets. For instance, GenRM first matches SC after consuming up to 8x the inference compute and requires significantly more compute to outperform it. Furthermore, we derive inference scaling laws for the GenRM paradigm, revealing that compute-optimal inference favors scaling solution generation more aggressively than scaling the number of verifications. Our work provides practical guidance on optimizing test-time scaling by balancing solution generation and verification. The code is available at this https URL.
zh
[NLP-2] oken embeddings violate the manifold hypothesis
【速读】: 本文旨在解决大型语言模型(Large Language Model, LLM)行为理解与其输入空间结构假设之间可能存在的偏差问题。论文的关键在于揭示LLM输入域——词嵌入(token embeddings)的空间结构,并提出一种基于纤维丛(fiber bundle)广义模型的统计检验方法,称为“纤维丛零假设”(fiber bundle null)。通过理论与实证分析,该方法将每个词嵌入的邻域划分为明确的信号维度和噪声维度。若在特定词嵌入处拒绝此零假设,则表明该词嵌入具有显著的局部结构,值得进一步研究。论文通过对多个开源LLM进行测试发现,纤维丛零假设常被拒绝,表明词嵌入子空间并非纤维丛或流形。这一结果意味着,当LLM面对语义等价但包含不同词嵌入的提示时,涉及显著局部结构的提示可能导致更大的输出变异性,从而影响模型行为的理解一致性。
链接: https://arxiv.org/abs/2504.01002
作者: Michael Robinson,Sourya Dey,Tony Chiang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 10 figures
点击查看摘要
Abstract:To fully understand the behavior of a large language model (LLM) requires our understanding of its input space. If this input space differs from our assumption, our understanding of and conclusions about the LLM is likely flawed, regardless of its architecture. Here, we elucidate the structure of the token embeddings, the input domain for LLMs, both empirically and theoretically. We present a generalized and statistically testable model where the neighborhood of each token splits into well-defined signal and noise dimensions. This model is based on a generalization of a manifold called a fiber bundle, so we denote our hypothesis test as the ``fiber bundle null.‘’ Failing to reject the null is uninformative, but rejecting it at a specific token indicates that token has a statistically significant local structure, and so is of interest to us. By running our test over several open-source LLMs, each with unique token embeddings, we find that the null is frequently rejected, and so the token subspace is provably not a fiber bundle and hence also not a manifold. As a consequence of our findings, when an LLM is presented with two semantically equivalent prompts, and if one prompt contains a token implicated by our test, that prompt will likely exhibit more output variability proportional to the local signal dimension of the token. Comments: 20 pages, 10 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) MSC classes: 53Z50, 62H15 Cite as: arXiv:2504.01002 [cs.CL] (or arXiv:2504.01002v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.01002 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-3] Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models
【速读】: 该论文试图解决语言模型自动评估面临的挑战,特别是在跨模态任务中对复杂任务进行有效评估的问题。随着语言模型能力的提升,传统的基于人工标注的数据集和固定任务特定指标的方法逐渐显现出局限性,如高昂的标注成本和对新任务的适应性不足。此外,现有自动化测试数据生成方法多依赖于预存数据或局限于单一任务。
解决方案的关键在于提出Zero-shot Benchmarking (ZSB) 框架,通过利用语言模型生成合成测试数据和进行评估,实现对任意任务高质量基准的创建。ZSB 的核心优势在于其简单性和灵活性:仅需设计用于数据生成和评估的提示模板即可;能够扩展到数据采集困难或成本高昂的任务与语言场景;并且不受限于特定模型,允许随着模型能力的提升逐步构建更具挑战性的基准。论文通过在文本和多模态任务上的实证研究验证了 ZSB 的有效性,并发现使用开源模型即可构建强基准,而评价性能的关键因素包括判别模型的规模和数据集的多样性。
链接: https://arxiv.org/abs/2504.01001
作者: José Pombal,Nuno M. Guerreiro,Ricardo Rei,André F. T. Martins
机构: Unbabel; Instituto de Telecomunicações (葡萄牙电信研究所); Instituto Superior Técnico, Universidade de Lisboa (里斯本高等技术学院, 里斯本大学); MICS, CentraleSupélec, Université Paris-Saclay (MICS, CentraleSupélec, 巴黎萨克雷大学); ELLIS Unit Lisbon (里斯本ELLIS联盟单元)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:As language models improve and become capable of performing more complex tasks across modalities, evaluating them automatically becomes increasingly challenging. Developing strong and robust task-specific automatic metrics gets harder, and human-annotated test sets – which are expensive to create – saturate more quickly. A compelling alternative is to design reliable strategies to automate the creation of test data and evaluation, but previous attempts either rely on pre-existing data, or focus solely on individual tasks. We present Zero-shot Benchmarking (ZSB), a framework for creating high-quality benchmarks for any task by leveraging language models for both synthetic test data creation and evaluation. ZSB is simple and flexible: it requires only the creation of a prompt for data generation and one for evaluation; it is scalable to tasks and languages where collecting real-world data is costly or impractical; it is model-agnostic, allowing the creation of increasingly challenging benchmarks as models improve. To assess the effectiveness of our framework, we create benchmarks for five text-only tasks and a multi-modal one: general capabilities in four languages (English, Chinese, French, and Korean), translation, and general vision-language capabilities in English. We then rank a broad range of open and closed systems on our benchmarks. ZSB rankings consistently correlate strongly with human rankings, outperforming widely-adopted standard benchmarks. Through ablations, we find that strong benchmarks can be created with open models, and that judge model size and dataset variety are crucial drivers of performance. We release all our benchmarks, and code to reproduce our experiments and to produce new benchmarks.
zh
[NLP-4] MedReason : Eliciting Factual Medical Reasoning Steps in LLM s via Knowledge Graphs
【速读】: 该论文旨在解决医疗领域任务(如诊断和治疗规划)中AI模型缺乏透明、可验证推理能力的问题。现有数据集未能提供清晰的逐步推理步骤来验证和提升AI模型的医疗推理能力。为填补这一空白,论文提出MedReason,这是一个大规模高质量的医疗推理数据集,通过结构化医学知识图谱(Knowledge Graph, KG)将临床问答对转化为逻辑推理链(即“思考路径”),从而实现大型语言模型(Large Language Models, LLMs)的忠实且可解释的医疗问题求解。关键在于利用知识图谱将临床问题与答案通过相关实体连接起来,并确保每条推理路径符合临床逻辑和循证医学标准。实验表明,基于此数据集微调的模型在医疗问题解决能力上显著提升,尤其在DeepSeek-Ditill-8B模型上实现了最高7.7%的性能提升。
链接: https://arxiv.org/abs/2504.00993
作者: Juncheng Wu,Wenlong Deng,Xingxuan Li,Sheng Liu,Taomian Mi,Yifan Peng,Ziyang Xu,Yi Liu,Hyunjin Cho,Chang-In Choi,Yihan Cao,Hui Ren,Xiang Li,Xiaoxiao Li,Yuyin Zhou
机构: UC Santa Cruz (加州大学圣克鲁兹分校); University of British Columbia (不列颠哥伦比亚大学); Nanyang Technological University (南洋理工大学); Stanford University (斯坦福大学); Weill Cornell Medicine (威尔康奈尔医学); NYU Langone Health (纽约大学朗格尼健康); Chungnam National University Sejong Hospital (忠南国立大学世宗医院); Vector Institute (向量研究所); Pusan National University Hospital (釜山国立大学医院); Massachusetts General Hospital (马萨诸塞州总医院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Medical tasks such as diagnosis and treatment planning require precise and complex reasoning, particularly in life-critical domains. Unlike mathematical reasoning, medical reasoning demands meticulous, verifiable thought processes to ensure reliability and accuracy. However, there is a notable lack of datasets that provide transparent, step-by-step reasoning to validate and enhance the medical reasoning ability of AI models. To bridge this gap, we introduce MedReason, a large-scale high-quality medical reasoning dataset designed to enable faithful and explainable medical problem-solving in large language models (LLMs). We utilize a structured medical knowledge graph (KG) to convert clinical QA pairs into logical chains of reasoning, or ``thinking paths’', which trace connections from question elements to answers via relevant KG entities. Each path is validated for consistency with clinical logic and evidence-based medicine. Our pipeline generates detailed reasoning for various medical questions from 7 medical datasets, resulting in a dataset of 32,682 question-answer pairs, each with detailed, step-by-step explanations. Experiments demonstrate that fine-tuning with our dataset consistently boosts medical problem-solving capabilities, achieving significant gains of up to 7.7% for DeepSeek-Ditill-8B. Our top-performing model, MedReason-8B, outperforms the Huatuo-o1-8B, a state-of-the-art medical reasoning model, by up to 4.2% on the clinical benchmark MedBullets. We also engage medical professionals from diverse specialties to assess our dataset’s quality, ensuring MedReason offers accurate and coherent medical reasoning. Our data, models, and code will be publicly available.
zh
[NLP-5] Chinese Grammatical Error Correction: A Survey
【速读】: 该论文旨在系统性地梳理和总结中文语法错误纠正(Chinese Grammatical Error Correction, CGEC)领域的研究进展,解决的核心问题是提升CGEC任务在数据、标注、评估及系统建模等方面的标准化与有效性。论文的关键解决方案在于:首先,通过全面分析现有CGEC数据集的特性与局限性,强调了统一标注标准的重要性;其次,探讨了针对中文特点的错误分类挑战,如词分割歧义及特定错误类型的界定;再次,聚焦评估方法从英语语法纠错(English GEC)到中文的适应过程,提出更适合中文的字符级评分及多参考机制;最后,追踪了CGEC系统从基于规则和统计的方法向神经网络模型(尤其是Transformer架构)以及大规模预训练语言模型集成的演进路径。通过这些努力,论文为CGEC的现状提供了深入洞察,并指明了未来优化标注规范、解决分词挑战及采用多语言策略等方向。
链接: https://arxiv.org/abs/2504.00977
作者: Mengyang Qiu,Qingyu Gao,Linxuan Yang,Yang Gu,Tran Minh Nguyen,Zihao Huang,Jungyeul Park
机构: Trent University, Canada (特伦特大学,加拿大); Open Writing Evaluation, France (开放写作评估,法国); Independent Researcher, Canada (独立研究员,加拿大); The University of British Columbia, Canada (不列颠哥伦比亚大学,加拿大)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Chinese Grammatical Error Correction (CGEC) is a critical task in Natural Language Processing, addressing the growing demand for automated writing assistance in both second-language (L2) and native (L1) Chinese writing. While L2 learners struggle with mastering complex grammatical structures, L1 users also benefit from CGEC in academic, professional, and formal contexts where writing precision is essential. This survey provides a comprehensive review of CGEC research, covering datasets, annotation schemes, evaluation methodologies, and system advancements. We examine widely used CGEC datasets, highlighting their characteristics, limitations, and the need for improved standardization. We also analyze error annotation frameworks, discussing challenges such as word segmentation ambiguity and the classification of Chinese-specific error types. Furthermore, we review evaluation metrics, focusing on their adaptation from English GEC to Chinese, including character-level scoring and the use of multiple references. In terms of system development, we trace the evolution from rule-based and statistical approaches to neural architectures, including Transformer-based models and the integration of large pre-trained language models. By consolidating existing research and identifying key challenges, this survey provides insights into the current state of CGEC and outlines future directions, including refining annotation standards to address segmentation challenges, and leveraging multilingual approaches to enhance CGEC.
zh
[NLP-6] SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching
【速读】: 该论文旨在解决大型语言模型在处理长上下文时面临的显著计算和内存挑战。传统基于令牌级别的高效KV缓存方法忽略了语义信息,而现有的保持语义的KV缓存管理方法通常存在较高的内存使用量和较长的首个令牌生成时间。论文的关键解决方案是提出SentenceKV,这是一种新颖的基于句子级别的语义KV缓存方法。SentenceKV在prefilling阶段根据句子级别的语义相似性分组令牌,并将句子表示压缩为简洁的语义向量存储在GPU上,同时将单个KV对卸载到CPU;在解码阶段,通过选择性检索语义相关的句子级别KV条目生成令牌,利用prefilling阶段语义向量与解码阶段查询之间的语义相似性。这种方法确保了高效且上下文准确的预测,减少了冗余或无关数据加载到GPU内存中的可能性,显著降低了内存开销,同时维持了稳定的推理延迟,即使对于极长的上下文也是如此。多项基准测试表明,SentenceKV在效率和内存使用方面显著优于现有最先进的方法,且未牺牲模型准确性。
链接: https://arxiv.org/abs/2504.00970
作者: Yuxuan Zhu,Ali Falahati,David H. Yang,Mohammad Mohammadi Amiri
机构: Department of Computer Science, Rensselaer Polytechnic Institute (伦斯勒理工学院); Department of Computer Science, University of Waterloo (滑铁卢大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large language models face significant computational and memory challenges when processing long contexts. During inference, efficient management of the key-value (KV) cache, which stores intermediate activations for autoregressive generation, is critical to reducing memory overhead and improving computational efficiency. Traditional token-level efficient KV caching methods overlook semantic information, treating tokens independently without considering their semantic relationships. Meanwhile, existing semantic-preserving KV cache management approaches often suffer from substantial memory usage and high time-to-first-token. To address these limitations, we propose SentenceKV, a novel sentence-level semantic KV caching approach designed to enhance inference efficiency while preserving semantic coherence. During prefilling, SentenceKV groups tokens based on sentence-level semantic similarity, compressing sentence representations into concise semantic vectors stored directly on the GPU, while individual KV pairs are offloaded to CPU. During decoding, SentenceKV generates tokens by selectively retrieving semantically relevant sentence-level KV entries, leveraging the semantic similarity between the prefilling-stage semantic vectors and decoding-stage queries. This ensures efficient and contextually accurate predictions, minimizing the loading of redundant or irrelevant data into GPU memory and significantly reducing memory overhead while maintaining stable inference latency, even for extremely long contexts. Extensive evaluations on benchmarks including PG-19, LongBench, and Needle-In-A-Haystack demonstrate that SentenceKV significantly outperforms state-of-the-art methods in both efficiency and memory usage, without compromising model accuracy.
zh
[NLP-7] Experiential Semantic Information and Brain Alignment: Are Multimodal Models Better than Language Models?
【速读】: 该论文试图解决的问题是验证多模态模型(contrastive multimodal models)学习到的文本表示是否确实比仅语言模型(language-only models)更丰富且更接近人类的语言特性,特别是是否更好地捕捉经验信息并与人脑功能磁共振成像(fMRI)响应相一致。这一假设在计算语言学中较为普遍,但缺乏实证研究支持。论文的关键解决方案在于通过对比多模态模型与仅语言模型在捕捉经验信息以及与fMRI响应对齐方面的表现,发现语言-only模型在这两方面均优于多模态模型,并进一步揭示仅语言模型能够学习到更多独特的与大脑相关的语义信息。研究强调了开发整合多模态数据源互补语义信息的计算模型的重要性。
链接: https://arxiv.org/abs/2504.00942
作者: Anna Bavaresco,Raquel Fernández
机构: Institute for Logic, Language and Computation (逻辑、语言与计算研究所), University of Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:A common assumption in Computational Linguistics is that text representations learnt by multimodal models are richer and more human-like than those by language-only models, as they are grounded in images or audio – similar to how human language is grounded in real-world experiences. However, empirical studies checking whether this is true are largely lacking. We address this gap by comparing word representations from contrastive multimodal models vs. language-only ones in the extent to which they capture experiential information – as defined by an existing norm-based ‘experiential model’ – and align with human fMRI responses. Our results indicate that, surprisingly, language-only models are superior to multimodal ones in both respects. Additionally, they learn more unique brain-relevant semantic information beyond that shared with the experiential model. Overall, our study highlights the need to develop computational models that better integrate the complementary semantic information provided by multimodal data sources.
zh
[NLP-8] WikiVideo: Article Generation from Multiple Videos
【速读】: 本文旨在解决自动创建高质量维基百科风格文章的挑战,这些文章能够从多个来源视频中聚合关于真实世界事件(如自然灾害或政治选举)的信息。传统基于文本的检索增强生成(Retrieval-Augmented Generation, RAG)方法虽有效,但其流程主要关注文本处理,而现有视频摘要技术则侧重于低层次场景理解而非高层次事件语义。为弥合这一差距,论文引入了WikiVideo数据集,它包含专家撰写的文章及密集标注的视频证据,以支持文章中的主张,从而促进视频在RAG管道中的整合,并实现基于多模态信息的深入内容生成。此外,论文提出了一种名为协作文章生成(Collaborative Article Generation, CAG)的新颖交互式方法,通过迭代式交互结合r1型推理模型与VideoLLM,使系统能够超越单一依赖视频低层次视觉特征的局限,进行更高层次的目标事件推断。实验表明,在oracle检索和RAG设置下,CAG相比现有方法表现出色,同时揭示了未来研究的有趣方向。因此,论文的关键在于通过引入WikiVideo数据集和CAG方法,实现了从视频到高层次事件语义的有效转换与整合。
链接: https://arxiv.org/abs/2504.00939
作者: Alexander Martin,Reno Kriz,William Gantt Walden,Kate Sanders,Hannah Recknor,Eugene Yang,Francis Ferraro,Benjamin Van Durme
机构: Johns Hopkins University (约翰斯·霍普金斯大学); Human Language Technology Center of Excellence (语言技术卓越中心); University of Maryland Baltimore County (马里兰大学巴尔的摩郡分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Repo can be found here: this https URL
点击查看摘要
Abstract:We present the challenging task of automatically creating a high-level Wikipedia-style article that aggregates information from multiple diverse videos about real-world events, such as natural disasters or political elections. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text and existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles’ claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.
zh
[NLP-9] InformGen: An AI Copilot for Accurate and Compliant Clinical Research Consent Document Generation
【速读】: 该论文旨在解决利用大型语言模型(Large Language Models, LLMs)生成高风险文档(如知情同意书 Informed Consent Forms, ICFs)时面临的重大挑战,即如何在满足严格的监管合规性和确保事实准确性的同时实现高效生成。论文的关键在于提出了一种名为InformGen的LLM驱动的辅助工具,它通过优化的知识文档解析与内容生成技术,并结合人工干预,在环路中实现了准确且合规的ICF草拟。此外,该研究构建了一个包含900个临床试验方案和ICFs的数据集作为基准,验证了InformGen不仅能够接近100%符合FDA指南中的18条核心监管规则,还显著提升了事实准确性,同时保证了可追溯性,通过内联引用来源协议支持验证,从而保持最高的事实完整性标准。
链接: https://arxiv.org/abs/2504.00934
作者: Zifeng Wang,Junyi Gao,Benjamin Danek,Brandon Theodorou,Ruba Shaik,Shivashankar Thati,Seunghyun Won,Jimeng Sun
机构: School of Computing and Data Science, University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校计算与数据科学学院); Centre for Medical Informatics, Usher Institute, University of Edinburgh (爱丁堡大学医学信息学中心); Carle Illinois College of Medicine, University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校卡莱医学院); Seoul National University Bundang Hospital (首尔国立大学 bundles 医院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Leveraging large language models (LLMs) to generate high-stakes documents, such as informed consent forms (ICFs), remains a significant challenge due to the extreme need for regulatory compliance and factual accuracy. Here, we present InformGen, an LLM-driven copilot for accurate and compliant ICF drafting by optimized knowledge document parsing and content generation, with humans in the loop. We further construct a benchmark dataset comprising protocols and ICFs from 900 clinical trials. Experimental results demonstrate that InformGen achieves near 100% compliance with 18 core regulatory rules derived from FDA guidelines, outperforming a vanilla GPT-4o model by up to 30%. Additionally, a user study with five annotators shows that InformGen, when integrated with manual intervention, attains over 90% factual accuracy, significantly surpassing the vanilla GPT-4o model’s 57%-82%. Crucially, InformGen ensures traceability by providing inline citations to source protocols, enabling easy verification and maintaining the highest standards of factual integrity.
zh
[NLP-10] axonomizing Representational Harms using Speech Act Theory
【速读】: 该论文旨在解决生成式语言系统中表征性伤害(representational harms)定义不清的问题,并提供一种新的理论框架与细粒度分类方法来更准确地理解和量化这类伤害。论文的关键在于基于言语行为理论(speech act theory),将生成式语言系统的表征性伤害重新概念化为特定言语行为(illocutionary acts)所产生的言外之意效果(perlocutionary effects),即其对现实世界的影响。通过引入新的定义(如刻板印象、贬低和抹除)以及开发超越以往高层次分类的细化分类体系,论文不仅深化了对表征性伤害本质的理解,还探讨了如何利用这一框架设计有效的测量工具,同时通过案例研究展示了其实际应用价值。
链接: https://arxiv.org/abs/2504.00928
作者: Emily Corvi,Hannah Washington,Stefanie Reed,Chad Atalla,Alexandra Chouldechova,P. Alex Dow,Jean Garcia-Gathright,Nicholas Pangakis,Emily Sheng,Dan Vann,Matthew Vogel,Hanna Wallach
机构: Microsoft Research (微软研究)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:Representational harms are widely recognized among fairness-related harms caused by generative language systems. However, their definitions are commonly under-specified. We present a framework, grounded in speech act theory (Austin, 1962), that conceptualizes representational harms caused by generative language systems as the perlocutionary effects (i.e., real-world impacts) of particular types of illocutionary acts (i.e., system behaviors). Building on this argument and drawing on relevant literature from linguistic anthropology and sociolinguistics, we provide new definitions stereotyping, demeaning, and erasure. We then use our framework to develop a granular taxonomy of illocutionary acts that cause representational harms, going beyond the high-level taxonomies presented in previous work. We also discuss the ways that our framework and taxonomy can support the development of valid measurement instruments. Finally, we demonstrate the utility of our framework and taxonomy via a case study that engages with recent conceptual debates about what constitutes a representational harm and how such harms should be measured.
zh
[NLP-11] Multi-Token Attention
【速读】: 该论文旨在解决传统软注意力机制(Soft Attention)在长上下文(long context)任务中的瓶颈问题。具体而言,现有方法仅基于单一查询向量与键向量的相似性来确定注意力权重(单令牌注意力,Single Token Attention),这限制了模型从上下文中区分相关部分时所利用的信息量。为了解决这一局限,论文提出了一种新的多令牌注意力机制(Multi-Token Attention, MTA)。其关键是通过在查询、键以及注意力头(heads)上应用卷积操作,使相邻的查询和键能够相互影响注意力权重,从而实现更精确的注意力分布。这种方法允许语言模型利用更加丰富和细致的信息,超越单一向量的能力范围,显著提升了在多种基准测试中的性能,尤其是在需要处理长上下文信息的任务中表现尤为突出。
链接: https://arxiv.org/abs/2504.00927
作者: Olga Golovneva,Tianlu Wang,Jason Weston,Sainbayar Sukhbaatar
机构: Meta (Facebook)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Soft attention is a critical mechanism powering LLMs to locate relevant parts within a given context. However, individual attention weights are determined by the similarity of only a single query and key token vector. This “single token attention” bottlenecks the amount of information used in distinguishing a relevant part from the rest of the context. To address this issue, we propose a new attention method, Multi-Token Attention (MTA), which allows LLMs to condition their attention weights on multiple query and key vectors simultaneously. This is achieved by applying convolution operations over queries, keys and heads, allowing nearby queries and keys to affect each other’s attention weights for more precise attention. As a result, our method can locate relevant context using richer, more nuanced information that can exceed a single vector’s capacity. Through extensive evaluations, we demonstrate that MTA achieves enhanced performance on a range of popular benchmarks. Notably, it outperforms Transformer baseline models on standard language modeling tasks, and on tasks that require searching for information within long contexts, where our method’s ability to leverage richer information proves particularly beneficial.
zh
[NLP-12] On the Robustness of Agent ic Function Calling NAACL25
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在执行功能调用(Function Calling, FC)任务时,因输入扰动导致的鲁棒性不足问题。现有研究主要集中在提升FC的准确性,而对该能力在自然语言查询变化或工具包扩展时的稳定性关注较少。论文的关键在于引入了一个评估FC鲁棒性的基准,涵盖两个核心方面:一是对自然istic查询变化的适应能力;二是当工具包扩展至语义相关的新增工具时,FC的稳定性表现。通过在Berkeley FC Leaderboard (BFCL) 的精心扩展子集上评估顶级FC模型,论文揭示了现有评估方法的局限性,并指出了实际部署中需要改进的方向。
链接: https://arxiv.org/abs/2504.00914
作者: Ella Rabinovich,Ateret Anaby-Tavor
机构: IBM Research (IBM研究)
类目: Computation and Language (cs.CL)
备注: 7 pages, TrustNLP@NAACL25
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly acting as autonomous agents, with function calling (FC) capabilities enabling them to invoke specific tools for tasks. While prior research has primarily focused on improving FC accuracy, little attention has been given to the robustness of these agents to perturbations in their input. We introduce a benchmark assessing FC robustness in two key areas: resilience to naturalistic query variations, and stability in function calling when the toolkit expands with semantically related tools. Evaluating best-performing FC models on a carefully expanded subset of the Berkeley function calling leaderboard (BFCL), we identify critical weaknesses in existing evaluation methodologies, and highlight areas for improvement in real-world agentic deployments.
zh
[NLP-13] Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents
【速读】: 该论文旨在解决计算机使用代理在自动化数字任务时面临的三个主要挑战:图形用户界面(GUI)元素定位不精确、长时序任务规划困难以及依赖单一通用模型进行多样化认知任务导致的性能瓶颈。为了解决这些问题,论文提出了一种名为Agent S2的新颖组合框架,通过将认知责任分配给不同的通用模型和专家模型来实现。关键解决方案包括引入一种新的接地混合技术(Mixture-of-Grounding)以实现精确的GUI定位,以及提出主动分层规划(Proactive Hierarchical Planning)方法,在多个时间尺度上动态调整行动方案以响应不断变化的观察结果。实验结果显示,Agent S2在三个主流计算机使用基准测试中达到了新的最先进水平(SOTA),并在与其他基线模型的对比中取得了显著的相对性能提升。
链接: https://arxiv.org/abs/2504.00906
作者: Saaket Agashe,Kyle Wong,Vincent Tu,Jiachen Yang,Ang Li,Xin Eric Wang
机构: Simular Research (Simular 研究)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 18 pages, 13 figures, 8 tables
点击查看摘要
Abstract:Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices, offering significant potential to enhance human productivity by completing an open-ended space of user queries. However, current agents face significant challenges: imprecise grounding of GUI elements, difficulties with long-horizon task planning, and performance bottlenecks from relying on single generalist models for diverse cognitive tasks. To this end, we introduce Agent S2, a novel compositional framework that delegates cognitive responsibilities across various generalist and specialist models. We propose a novel Mixture-of-Grounding technique to achieve precise GUI localization and introduce Proactive Hierarchical Planning, dynamically refining action plans at multiple temporal scales in response to evolving observations. Evaluations demonstrate that Agent S2 establishes new state-of-the-art (SOTA) performance on three prominent computer use benchmarks. Specifically, Agent S2 achieves 18.9% and 32.7% relative improvements over leading baseline agents such as Claude Computer Use and UI-TARS on the OSWorld 15-step and 50-step evaluation. Moreover, Agent S2 generalizes effectively to other operating systems and applications, surpassing previous best methods by 52.8% on WindowsAgentArena and by 16.52% on AndroidWorld relatively. Code available at this https URL.
zh
[NLP-14] GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning
【速读】: 该论文旨在解决现有 Process Reward Models (PRMs) 面临的三个关键挑战:(1) 过程监督和泛化能力有限;(2) 对标量值预测的依赖而未能充分利用 Large Language Models (LLMs) 的生成能力;以及 (3) 测试时计算扩展能力不足。为应对这些问题,论文提出了一种名为 GenPRM 的生成式过程奖励模型,其核心创新在于通过代码验证进行显式的 Chain-of-Thought (CoT) 推理,并在每个推理步骤提供判断之前生成详细的解释。此外,通过引入 Relative Progress Estimation (RPE) 和结合代码验证的解释合成框架,GenPRM 能够生成高质量的过程监督标签与解释数据。实验结果表明,仅使用少量来自 MATH 数据集的训练数据(23K),GenPRM 在 ProcessBench 和数学推理任务中显著优于先前的 PRMs,并且通过测试时扩展,1.5B 参数规模的 GenPRM 超过了 GPT-4o,而 7B 参数规模的 GenPRM 则在 ProcessBench 上超越了 Qwen2.5-Math-PRM-72B。此外,GenPRM 展现出作为策略模型优化批评器的强大能力。这一工作为过程监督建立了新的范式,弥合了 PRMs 和 LLMs 中批评模型之间的差距。
链接: https://arxiv.org/abs/2504.00891
作者: Jian Zhao,Runze Liu,Kaiyan Zhang,Zhimu Zhou,Junqi Gao,Dong Li,Jiafei Lyu,Zhouyi Qian,Biqing Qi,Xiu Li,Bowen Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Recent advancements in Large Language Models (LLMs) have shown that it is promising to utilize Process Reward Models (PRMs) as verifiers to enhance the performance of LLMs. However, current PRMs face three key challenges: (1) limited process supervision and generalization capabilities, (2) dependence on scalar value prediction without leveraging the generative abilities of LLMs, and (3) inability to scale the test-time compute of PRMs. In this work, we introduce GenPRM, a generative process reward model that performs explicit Chain-of-Thought (CoT) reasoning with code verification before providing judgment for each reasoning step. To obtain high-quality process supervision labels and rationale data, we propose Relative Progress Estimation (RPE) and a rationale synthesis framework that incorporates code verification. Experimental results on ProcessBench and several mathematical reasoning tasks show that GenPRM significantly outperforms prior PRMs with only 23K training data from MATH dataset. Through test-time scaling, a 1.5B GenPRM outperforms GPT-4o, and a 7B GenPRM surpasses Qwen2.5-Math-PRM-72B on ProcessBench. Additionally, GenPRM demonstrates strong abilities to serve as a critic model for policy model refinement. This work establishes a new paradigm for process supervision that bridges the gap between PRMs and critic models in LLMs. Our code, model, and data will be available in this https URL.
zh
[NLP-15] CrackSQL: A Hybrid SQL Dialect Translation System Powered by Large Language Models SIGMOD2025
【速读】: 该论文试图解决在异构数据库系统间进行SQL方言翻译时因语法差异和细微语义变化导致的挑战,特别是现有方法(如手工重写、基于规则的系统和大型语言模型(LLM)技术)在处理复杂查询时存在的高维护成本或不可靠结果的问题。解决方案的关键在于提出CrackSQL,这是一种结合规则与LLM方法的混合SQL方言翻译系统。它通过功能驱动的查询处理分割复杂SQL来最小化人工干预,并利用新颖的跨方言语法嵌入模型实现精确的语法对齐,同时采用自适应局部到全局翻译策略有效解决相互依赖的查询操作问题。此外,CrackSQL提供了多种部署和访问选项以促进实际应用中的广泛采用。
链接: https://arxiv.org/abs/2504.00882
作者: Wei Zhou,Yuyang Gao,Xuanhe Zhou,Guoliang Li
机构: Shanghai Jiao Tong Univ. (上海交通大学); Tsinghua University (清华大学)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Extension of our SIGMOD 2025 paper. Please refer to source code available at: this https URL
点击查看摘要
Abstract:Dialect translation plays a key role in enabling seamless interaction across heterogeneous database systems. However, translating SQL queries between different dialects (e.g., from PostgreSQL to MySQL) remains a challenging task due to syntactic discrepancies and subtle semantic variations. Existing approaches including manual rewriting, rule-based systems, and large language model (LLM)-based techniques often involve high maintenance effort (e.g., crafting custom translation rules) or produce unreliable results (e.g., LLM generates non-existent functions), especially when handling complex queries. In this demonstration, we present CrackSQL, the first hybrid SQL dialect translation system that combines rule and LLM-based methods to overcome these limitations. CrackSQL leverages the adaptability of LLMs to minimize manual intervention, while enhancing translation accuracy by segmenting lengthy complex SQL via functionality-based query processing. To further improve robustness, it incorporates a novel cross-dialect syntax embedding model for precise syntax alignment, as well as an adaptive local-to-global translation strategy that effectively resolves interdependent query operations. CrackSQL supports three translation modes and offers multiple deployment and access options including a web console interface, a PyPI package, and a command-line prompt, facilitating adoption across a variety of real-world use cases
zh
[NLP-16] m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models
【速读】: 该论文旨在解决大型语言模型(LLMs)在医学推理任务中通过测试时扩展(test-time scaling)提升性能的有效性问题。传统测试时扩展技术虽在数学任务中表现良好,但其在医学推理领域的作用尚不明确,因为医学领域与数学任务在知识表示和决策过程上有本质区别。论文的关键解决方案是提出了一种名为“m1”的方法,它通过在推理阶段优化模型的医学推理能力,使参数规模小于10B的轻量级微调模型达到新的性能高度,同时其32B规模的模型性能可媲美以往70B规模的医学LLMs。然而,研究发现推理令牌预算的最佳值约为4K,超出此值可能导致因过度思考而性能下降。此外,预算强制(budget forcing)虽然有助于模型验证答案,但并不总能提升整体医学问答性能,甚至可能引入错误。进一步分析表明,医学知识不足是限制测试时扩展性能提升的主要瓶颈。论文建议通过增加数据规模、提高数据质量以及扩展模型容量来增强医学知识的嵌入能力,从而实现持续的性能改进,特别是在小模型容易达到饱和的具有挑战性的医学基准任务上。这些发现强调了LLMs在医学推理与数学推理之间的根本差异,指出丰富的医学知识比单纯增加推理深度更为重要。
链接: https://arxiv.org/abs/2504.00869
作者: Xiaoke Huang,Juncheng Wu,Hui Liu,Xianfeng Tang,Yuyin Zhou
机构: UC Santa Cruz (加州大学圣克鲁兹分校); Amazon Research (亚马逊研究); https://github.com/UCSC-VLAA/m1 (加州大学圣克鲁兹分校视觉与语言实验室项目主页)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages; 7 figures; Data, code, and models: this https URL
点击查看摘要
Abstract:Test-time scaling has emerged as a powerful technique for enhancing the reasoning capabilities of large language models. However, its effectiveness in medical reasoning remains uncertain, as the medical domain fundamentally differs from mathematical tasks in terms of knowledge representation and decision-making processes. In this paper, we provide the first comprehensive investigation of test-time scaling for medical reasoning and present m1, a simple yet effective approach that increases a model’s medical reasoning capability at inference. Our evaluation across diverse medical tasks demonstrates that test-time scaling consistently enhances medical reasoning, enabling lightweight fine-tuned models under 10B parameters to establish new state-of-the-art performance, while our 32B model rivals previous 70B-scale medical LLMs. However, we identify an optimal reasoning token budget of approximately 4K, beyond which performance may degrade due to overthinking. Budget forcing, which extends test-time computation through iterative prompts, helps models double-check answers but does not necessarily improve the overall medical QA performance and, in some cases, even introduces errors into previously correct responses. Our case-by-case analysis identifies insufficient medical knowledge as a key bottleneck that prevents further performance gains through test-time scaling. We find that increasing data scale, improving data quality, and expanding model capacity consistently enhance medical knowledge grounding, enabling continued performance improvements, particularly on challenging medical benchmarks where smaller models reach saturation. These findings underscore fundamental differences between medical and mathematical reasoning in LLMs, highlighting that enriched medical knowledge, other than increased reasoning depth alone, is essential for realizing the benefits of test-time scaling.
zh
[NLP-17] Investigating the Capabilities and Limitations of Machine Learning for Identifying Bias in English Language Data with Information and Heritage Professionals
【速读】: 该论文试图解决机器学习(ML)系统中持续存在的偏见问题,特别是这些偏见对边缘化群体造成的伤害。论文指出,完全消除偏见并构建完全公平的模型并非总是可能或理想的目标。因此,研究者重新定义了ML偏见的问题,提出通过创建模型来识别有偏语言,而非尝试去除偏见,从而将注意力集中在数据集的偏见特性上。解决方案的关键在于采用混合方法(mixed-methods approach),不仅评估了特定场景下(如信息与文化遗产专业人士的工作流程)识别偏见的可行性,还揭示了由于偏见的上下文特性及其复杂性,单纯依赖机器学习难以彻底消除偏见或实现公平的局限性,并强调了扩展现有ML方法以更全面地理解和应对偏见与公平问题的重要性。
链接: https://arxiv.org/abs/2504.00860
作者: Lucy Havens,Benjamin Bach,Melissa Terras,Beatrice Alex
机构: University of Edinburgh(爱丁堡大学); Inria(法国国家信息与自动化研究所); University of Edinburgh(爱丁堡大学); University of Edinburgh(爱丁堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Accepted to the 2025 CHI Conference on Human Factors in Computing Systems (CHI '25)
点击查看摘要
Abstract:Despite numerous efforts to mitigate their biases, ML systems continue to harm already-marginalized people. While predominant ML approaches assume bias can be removed and fair models can be created, we show that these are not always possible, nor desirable, goals. We reframe the problem of ML bias by creating models to identify biased language, drawing attention to a dataset’s biases rather than trying to remove them. Then, through a workshop, we evaluated the models for a specific use case: workflows of information and heritage professionals. Our findings demonstrate the limitations of ML for identifying bias due to its contextual nature, the way in which approaches to mitigating it can simultaneously privilege and oppress different communities, and its inevitability. We demonstrate the need to expand ML approaches to bias and fairness, providing a mixed-methods approach to investigating the feasibility of removing bias or achieving fairness in a given ML use case.
zh
[NLP-18] How Difficulty-Aware Staged Reinforcement Learning Enhances LLM s Reasoning Capabilities: A Preliminary Experimental Study
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理能力提升方面效率与可扩展性不足的根本挑战。论文的关键解决方案在于提出了一种基于难度感知的分阶段强化学习(Reinforcement Learning, RL)策略。通过系统分析,论文展示了依据明确定义的难度级别选择训练数据能够显著优化RL过程,并引入了一种逐步暴露模型于日益复杂任务的分阶段训练方法,从而进一步增强模型的推理能力。此外,研究发现,在同时进行数学推理和代码生成任务的跨领域训练中,该方法具有显著优势。最终,所提出的方法使一个15亿参数规模的模型在AIME-2024基准测试中达到42.3%的准确率,在MATH-500基准测试中达到89.5%的准确率。这些结果验证了该方法在提升LLMs推理能力方面的有效性。
链接: https://arxiv.org/abs/2504.00829
作者: Yunjie Ji,Sitong Zhao,Xiaoyu Tian,Haotian Wang,Shuaiting Chen,Yiping Peng,Han Zhao,Xiangang Li
机构: a-m-team
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Enhancing the reasoning capabilities of Large Language Models (LLMs) with efficiency and scalability remains a fundamental challenge in artificial intelligence research. This paper presents a rigorous experimental investigation into how difficulty-aware staged reinforcement learning (RL) strategies can substantially improve LLM reasoning performance. Through systematic analysis, we demonstrate that strategically selecting training data according to well-defined difficulty levels markedly enhances RL optimization. Moreover, we introduce a staged training methodology, progressively exposing models to increasingly challenging tasks, further amplifying reasoning capabilities. Our findings reveal significant cross-domain benefits when simultaneously training models on mathematical reasoning and code generation tasks. Notably, our proposed approach enables a 1.5B parameter model to achieve an accuracy of 42.3% on the AIME-2024 benchmark, 89.5% on the MATH-500 benchmark. These results underscore the efficacy of our method in advancing the reasoning proficiency of LLMs. We will open-source our datasets on GitHub and Hugging Face.
zh
[NLP-19] ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations
【速读】: 该论文旨在解决学术写作中连贯文本生成与精确引用相关文献的需求,特别是现有 Retrieval-Augmented Generation (RAG) 系统在支持专业学术写作方面能力有限的问题。论文提出的关键解决方案是 ScholarCopilot,这是一个统一框架,旨在通过动态生成检索标记 [RET] 并利用其表示从数据库中查找上下文相关的引用,从而增强现有大型语言模型生成高质量学术文章的能力。该框架联合优化生成与引用任务,并通过在 arXiv 上训练的 500K 篇论文实现高效的端到端性能提升。
链接: https://arxiv.org/abs/2504.00824
作者: Yubo Wang,Xueguang Ma,Ping Nie,Huaye Zeng,Zhiheng Lyu,Yuxuan Zhang,Benjamin Schneider,Yi Lu,Xiang Yue,Wenhu Chen
机构: University of Waterloo (滑铁卢大学); Carnegie Mellon University (卡内基梅隆大学); Independent Researcher (独立研究者); Vector Institute (向量研究所, 多伦多)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Academic writing requires both coherent text generation and precise citation of relevant literature. Although recent Retrieval-Augmented Generation (RAG) systems have significantly improved factual accuracy in general-purpose text generation, their capacity to adequately support professional academic writing remains limited. In this work, we introduce ScholarCopilot, a unified framework designed to enhance existing large language models for generating professional academic articles with accurate and contextually relevant citations. ScholarCopilot dynamically determines when to retrieve scholarly references by generating a retrieval token [RET], and then utilizes its representation to look up relevant citations from a database. The retrieved references are fed into the model to augment the generation process. We jointly optimize both the generation and citation tasks within a single framework to increase efficiency. Trained on 500K papers from arXiv, our model achieves a top-1 retrieval accuracy of 40.1% on our evaluation dataset, outperforming baselines such as E5-Mistral-7B-Instruct (15.0%) and BM25 (9.8%). On a dataset of 1,000 academic writing samples, ScholarCopilot scores 16.2/25 in generation quality (measured across relevance, coherence, academic rigor, completeness, and innovation), surpassing models with 10x more parameters such as Qwen-2.5-72B-Instruct (15.8/25). Human studies also confirm ScholarCopilot’s superior performance in citation recall, writing efficiency, and overall user experience, confirming the effectiveness of our approach.
zh
[NLP-20] Z1: Efficient Test-time Scaling with Code
【速读】: 本文旨在解决大型语言模型(Large Language Models, LLMs)在测试时通过计算扩展实现复杂问题求解过程中,因较长上下文和大量推理标记导致的高推理成本问题。为应对这一挑战,论文提出了一种高效的测试时扩展方法,通过训练LLMs学习与代码相关的推理轨迹,从而减少冗余推理标记的同时保持性能。方案的关键在于构建了一个名为Z1-Code-Reasoning-107K的数据集,包含简单到复杂的编码问题及其对应的短长解题轨迹,并引入了一种创新的“Shifted Thinking Window”机制,该机制通过移除上下文界定标签(如think…/think)并限制推理标记数量来降低过量推理开销。基于此方法训练的模型Z1-7B不仅能够根据问题复杂度调整推理深度,还在多种推理任务中实现了高效的测试时扩展,其平均推理标记数仅为R1-Distill-Qwen-7B的约30%,同时展示了对更广泛推理任务的泛化能力。
链接: https://arxiv.org/abs/2504.00810
作者: Zhaojian Yu,Yinghao Wu,Yilun Zhao,Arman Cohan,Xiao-Ping Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) can achieve enhanced complex problem-solving through test-time computing scaling, yet this often entails longer contexts and numerous reasoning token costs. In this paper, we propose an efficient test-time scaling method that trains LLMs on code-related reasoning trajectories, facilitating their reduction of excess thinking tokens while maintaining performance. First, we create Z1-Code-Reasoning-107K, a curated dataset of simple and complex coding problems paired with their short and long solution trajectories. Second, we present a novel Shifted Thinking Window to mitigate overthinking overhead by removing context-delimiting tags (e.g., think. . . /think) and capping reasoning tokens. Trained with long and short trajectory data and equipped with Shifted Thinking Window, our model, Z1-7B, demonstrates the ability to adjust its reasoning level as the complexity of problems and exhibits efficient test-time scaling across different reasoning tasks that matches R1-Distill-Qwen-7B performance with about 30% of its average thinking tokens. Notably, fine-tuned with only code trajectories, Z1-7B demonstrates generalization to broader reasoning tasks (47.5% on GPQA Diamond). Our analysis of efficient reasoning elicitation also provides valuable insights for future research.
zh
[NLP-21] Inaccuracy of an E-Dictionary and Its Influence on Chinese Language Users
【速读】: 该论文旨在解决电子词典(E-dictionaries)在语言学习者(L2 learners)词汇扩展中的准确性与可靠性问题,特别是针对像有道词典(Youdao)这类广泛使用的电子词典。研究发现,现有电子词典中不完整或误导性的释义可能导致严重的理解偏差,并揭示了这些问题部分源于数据处理以及人工智能(AI)和机器学习技术整合过程中的不足。论文的关键解决方案在于通过实验、用户调查及词典批判性分析相结合的方法,探究这些缺陷的根源,并强调提升用户词典素养培训的重要性,同时呼吁改进支撑电子词典构建的基础AI模型。
链接: https://arxiv.org/abs/2504.00799
作者: Xi Wang,Fanfei Meng,Shiyang Zhang,Lan Li
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 13 pages, presented at ASIALEX 2023 (The 15th International Conference of the Asian Association for Lexicography), Yonsei University, Seoul, Korea
点击查看摘要
Abstract:Electronic dictionaries have largely replaced paper dictionaries and become central tools for L2 learners seeking to expand their vocabulary. Users often assume these resources are reliable and rarely question the validity of the definitions provided. The accuracy of major E-dictionaries is seldom scrutinized, and little attention has been paid to how their corpora are constructed. Research on dictionary use, particularly the limitations of electronic dictionaries, remains scarce. This study adopts a combined method of experimentation, user survey, and dictionary critique to examine Youdao, one of the most widely used E-dictionaries in China. The experiment involved a translation task paired with retrospective reflection. Participants were asked to translate sentences containing words that are insufficiently or inaccurately defined in Youdao. Their consultation behavior was recorded to analyze how faulty definitions influenced comprehension. Results show that incomplete or misleading definitions can cause serious misunderstandings. Additionally, students exhibited problematic consultation habits. The study further explores how such flawed definitions originate, highlighting issues in data processing and the integration of AI and machine learning technologies in dictionary construction. The findings suggest a need for better training in dictionary literacy for users, as well as improvements in the underlying AI models used to build E-dictionaries.
zh
[NLP-22] Digitally Supported Analysis of Spontaneous Speech (DigiSpon): Benchmarking NLP-Supported Language Sample Analysis of Swiss Childrens Speech
【速读】: 该论文试图解决语言样本分析(Language Sample Analysis, LSA)在临床实践中因劳动密集型限制而未能广泛应用的问题,特别是针对发育性语言障碍(Developmental Language Disorder, DLD)的诊断。同时,研究避免使用依赖商业大型语言模型(Large Language Models, LLMs)的潜在不道德实现,旨在探索一种支持言语-语言病理学家更高效诊断DLD的方法。解决方案的关键在于引入基于自然语言处理(Natural Language Processing, NLP)的本地部署方法,这些方法不依赖商业LLMs,通过分析瑞士德语地区119名儿童的转录语音数据,初步结果显示将本地部署的NLP技术整合到半自动LSA过程中具有潜力。
链接: https://arxiv.org/abs/2504.00780
作者: Anja Ryser,Yingqiang Gao,Sarah Ebling
机构: University of Zurich (苏黎世大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Language sample analysis (LSA) is a process that complements standardized psychometric tests for diagnosing, for example, developmental language disorder (DLD) in children. However, its labor-intensive nature has limited its use in speech-language pathology practice. We introduce an approach that leverages natural language processing (NLP) methods not based on commercial large language models (LLMs) applied to transcribed speech data from 119 children in the German speaking part of Switzerland with typical and atypical language development. The study aims to identify optimal practices that support speech-language pathologists in diagnosing DLD more efficiently within a human-in-the-loop framework, without relying on potentially unethical implementations that leverage commercial LLMs. Preliminary findings underscore the potential of integrating locally deployed NLP methods into the process of semi-automatic LSA.
zh
[NLP-23] Automated Explanation of Machine Learning Models of Footballing Actions in Words
【速读】: 本文旨在解决足球分析领域中机器学习实践与教练团队话语体系之间的沟通鸿沟问题。尽管现有的足球分析方法已显著改变球队和分析师评估表现的方式,但模型提供的洞见并不总是以教练所需的具体行动为导向。为弥合这一差距,论文提出了一种名为“wordalization”的新方法(利用大规模语言模型生成可理解描述)。关键在于首先通过逻辑回归构建预期进球(Expected Goals, xG)模型,然后利用该模型的回归系数生成描述射门因素(如距离、角度和防守压力)如何影响预测结果的语句,并进一步借助大型语言模型赋予这些描述娱乐性,从而提供更直观且易于理解的信息。最终,通过构建模型卡并在开源应用中展示近期赛事中的射门wordalization实例,探讨其在教练指导和赛事评论中的潜在辅助作用,并举例说明此方法也可应用于其他足球动作的分析。
链接: https://arxiv.org/abs/2504.00767
作者: Pegah Rahimian,Jernej Flisar,David Sumpter
机构: Uppsala University (乌普萨拉大学); Twelve Football
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:While football analytics has changed the way teams and analysts assess performance, there remains a communication gap between machine learning practice and how coaching staff talk about football. Coaches and practitioners require actionable insights, which are not always provided by models. To bridge this gap, we show how to build wordalizations (a novel approach that leverages large language models) for shots in football. Specifically, we first build an expected goals model using logistic regression. We then use the co-efficients of this regression model to write sentences describing how factors (such as distance, angle and defensive pressure) contribute to the model’s prediction. Finally, we use large language models to give an entertaining description of the shot. We describe our approach in a model card and provide an interactive open-source application describing shots in recent tournaments. We discuss how shot wordalisations might aid communication in coaching and football commentary, and give a further example of how the same approach can be applied to other actions in football.
zh
[NLP-24] RECKON: Large-scale Reference-based Efficient Knowledge Evaluation for Large Language Model
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)能力验证过程中传统评估方法资源消耗高、信息损失严重的问题。解决方案的关键在于提出了一种基于参考数据的大规模高效知识评估方法(RECKON),通过将无结构数据组织成可管理单元,并针对每个聚类生成针对性问题,实现了评估效率与准确性的提升,同时显著降低了资源消耗,相比传统方法减少了56.5%的资源使用,且在多个领域(如世界知识、代码、法律和生物医学数据集)中达到超过97%的准确性。
链接: https://arxiv.org/abs/2504.00756
作者: Lin Zhang,Zhouhong Gu,Xiaoran Shi,Hongwei Feng,Yanghua Xiao
机构: Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University (上海数据科学重点实验室,复旦大学计算机学院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:As large language models (LLMs) advance, efficient knowledge evaluation becomes crucial to verifying their capabilities. Traditional methods, relying on benchmarks, face limitations such as high resource costs and information loss. We propose the Large-scale Reference-based Efficient Knowledge Evaluation for Large Language Model (RECKON), which directly uses reference data to evaluate models. RECKON organizes unstructured data into manageable units and generates targeted questions for each cluster, improving evaluation accuracy and efficiency. Experimental results show that RECKON reduces resource consumption by 56.5% compared to traditional methods while achieving over 97% accuracy across various domains, including world knowledge, code, legal, and biomedical datasets. Code is available at this https URL
zh
[NLP-25] LLM s4SchemaDiscovery: A Human-in-the-Loop Workflow for Scientific Schema Mining with Large Language Models ESWC2025
【速读】: 该论文旨在解决从非结构化文本中提取结构化信息的问题,传统模式挖掘方法依赖于半结构化数据,限制了其可扩展性。论文的关键解决方案是提出了一种名为schema-miner的新工具,它结合了大型语言模型(Large Language Models, LLMs)与人类反馈,通过迭代工作流程实现模式提取的自动化和优化。该工具能够从文本中组织属性、整合专家输入,并结合领域特定本体论以增强语义深度。在材料科学中的原子层沉积应用中,schema-miner展示了由专家引导的LLMs可以生成具有丰富语义的模式,适用于多种现实世界的应用场景。
链接: https://arxiv.org/abs/2504.00752
作者: Sameer Sadruddin,Jennifer D’Souza,Eleni Poupaki,Alex Watkins,Hamed Babaei Giglou,Anisa Rula,Bora Karasulu,Sören Auer,Adrie Mackus,Erwin Kessels
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: 15 pages, 3 figures, to appear in the Extended Semantic Web Conference (ESWC 2025) proceedings in the Resource track
点击查看摘要
Abstract:Extracting structured information from unstructured text is crucial for modeling real-world processes, but traditional schema mining relies on semi-structured data, limiting scalability. This paper introduces schema-miner, a novel tool that combines large language models with human feedback to automate and refine schema extraction. Through an iterative workflow, it organizes properties from text, incorporates expert input, and integrates domain-specific ontologies for semantic depth. Applied to materials science–specifically atomic layer deposition–schema-miner demonstrates that expert-guided LLMs generate semantically rich schemas suitable for diverse real-world applications.
zh
[NLP-26] IHC-LLM iner: Automated extraction of tumour immunohistochemical profiles from PubMed abstracts using large language models
【速读】: 该论文旨在解决免疫组化(IHC)肿瘤特征数据从大量PubMed摘要中高效提取与规范化的问题。解决方案的关键在于提出了一种基于大型语言模型(LLM)的自动化流水线IHC-LLMiner,包含两阶段任务:摘要分类(识别相关性)与相关摘要中的IHC-肿瘤特征提取。通过微调的“Gemma-2”模型实现了高精度(91.5%准确率,F1分数91.4)及快速推理速度(比GPT4-O快5.9倍),并在提取的30,481篇相关摘要中以63.3%的正确率生成高质量的IHC-肿瘤特征,并将其归一化至UMLS概念以确保一致性,从而支持大规模数据分析及临床研究应用。
链接: https://arxiv.org/abs/2504.00748
作者: Yunsoo Kim,Michal W. S. Ong,Daniel W. Rogalsky,Manuel Rodriguez-Justo,Honghan Wu,Adam P. Levine
机构: 未知
类目: Computation and Language (cs.CL)
备注: currently under review
点击查看摘要
Abstract:Immunohistochemistry (IHC) is essential in diagnostic pathology and biomedical research, offering critical insights into protein expression and tumour biology. This study presents an automated pipeline, IHC-LLMiner, for extracting IHC-tumour profiles from PubMed abstracts, leveraging advanced biomedical text mining. There are two subtasks: abstract classification (include/exclude as relevant) and IHC-tumour profile extraction on relevant included abstracts. The best-performing model, “Gemma-2 finetuned”, achieved 91.5% accuracy and an F1 score of 91.4, outperforming GPT4-O by 9.5% accuracy with 5.9 times faster inference time. From an initial dataset of 107,759 abstracts identified for 50 immunohistochemical markers, the classification task identified 30,481 relevant abstracts (Include) using the Gemma-2 finetuned model. For IHC-tumour profile extraction, the Gemma-2 finetuned model achieved the best performance with 63.3% Correct outputs. Extracted IHC-tumour profiles (tumour types and markers) were normalised to Unified Medical Language System (UMLS) concepts to ensure consistency and facilitate IHC-tumour profile landscape analysis. The extracted IHC-tumour profiles demonstrated excellent concordance with available online summary data and provided considerable added value in terms of both missing IHC-tumour profiles and quantitative assessments. Our proposed LLM based pipeline provides a practical solution for large-scale IHC-tumour profile data mining, enhancing the accessibility and utility of such data for research and clinical applications as well as enabling the generation of quantitative and structured data to support cancer-specific knowledge base development. Models and training datasets are available at this https URL.
zh
[NLP-27] Aplicação de Large Language Models na Análise e Síntese de Documentos Jurídicos: Uma Revisão de Literatura
【速读】: 该论文旨在系统性地回顾现有文献,明确提示工程(Prompt Engineering)在大型语言模型(LLMs)应用于法律领域的最新进展与状态。论文通过分析指出,诸如GPT-4、BERT、Llama 2及Legal-Pegasus等模型已被广泛用于法律文本的总结、分类和检索等任务,并验证了Few-shot Learning、Zero-shot Learning以及Chain-of-Thought提示等技术在提升法律文本解读能力方面的有效性。然而,模型偏差(model biases)与幻觉现象(hallucinations)仍是阻碍其大规模应用的主要挑战。论文的关键结论在于,尽管LLMs在法律领域展现出巨大潜力,但需进一步优化提示工程策略以提高生成结果的准确性和可靠性。
链接: https://arxiv.org/abs/2504.00725
作者: Matheus Belarmino,Rackel Coelho,Roberto Lotudo,Jayr Pereira
机构: 未知
类目: Computation and Language (cs.CL)
备注: in Portuguese language
点击查看摘要
Abstract:Large Language Models (LLMs) have been increasingly used to optimize the analysis and synthesis of legal documents, enabling the automation of tasks such as summarization, classification, and retrieval of legal information. This study aims to conduct a systematic literature review to identify the state of the art in prompt engineering applied to LLMs in the legal context. The results indicate that models such as GPT-4, BERT, Llama 2, and Legal-Pegasus are widely employed in the legal field, and techniques such as Few-shot Learning, Zero-shot Learning, and Chain-of-Thought prompting have proven effective in improving the interpretation of legal texts. However, challenges such as biases in models and hallucinations still hinder their large-scale implementation. It is concluded that, despite the great potential of LLMs for the legal field, there is a need to improve prompt engineering strategies to ensure greater accuracy and reliability in the generated results.
zh
[NLP-28] Command A: An Enterprise-Ready Large Language Model
【速读】: 该论文旨在开发Command A,一款专为企业实际应用场景设计的强大大型语言模型。论文试图解决如何构建一个在效率与性能之间达到平衡,并能够支持多语言(23种全球商业语言)且具备代理优化能力的语言模型,以自动化复杂的业务流程。解决方案的关键在于采用了一种新颖的混合架构,并通过去中心化的训练方法,结合自精炼算法和模型合并技术来实现模型的高效训练与卓越性能。此外,Command A通过 Retrieval Augmented Generation (RAG) 技术实现了基于上下文的推理和工具使用能力,从而提升了其在企业任务中的表现。同时,论文还展示了与Command A具有相似能力和架构的Command R7B的结果。
链接: https://arxiv.org/abs/2504.00698
作者: Team Cohere,Aakanksha,Arash Ahmadian,Marwan Ahmed,Jay Alammar,Yazeed Alnumay,Sophia Althammer,Arkady Arkhangorodsky,Viraat Aryabumi,Dennis Aumiller,Raphaël Avalos,Zahara Aviv,Sammie Bae,Saurabh Baji,Alexandre Barbet,Max Bartolo,Björn Bebensee,Neeral Beladia,Walter Beller-Morales,Alexandre Bérard,Andrew Berneshawi,Anna Bialas,Phil Blunsom,Matt Bobkin,Adi Bongale,Sam Braun,Maxime Brunet,Samuel Cahyawijaya,David Cairuz,Jon Ander Campos,Cassie Cao,Kris Cao,Roman Castagné,Julián Cendrero,Leila Chan Currie,Yash Chandak,Diane Chang,Giannis Chatziveroglou,Hongyu Chen,Claire Cheng,Alexis Chevalier,Justin T. Chiu,Eugene Cho,Eugene Choi,Eujeong Choi,Tim Chung,Volkan Cirik,Ana Cismaru,Pierre Clavier,Henry Conklin,Lucas Crawhall-Stein,Devon Crouse,Andres Felipe Cruz-Salinas,Ben Cyrus,Daniel D’souza,Hugo Dalla-Torre,John Dang,William Darling,Omar Darwiche Domingues,Saurabh Dash,Antoine Debugne,Théo Dehaze,Shaan Desai,Joan Devassy,Rishit Dholakia,Kyle Duffy,Ali Edalati,Ace Eldeib,Abdullah Elkady,Sarah Elsharkawy,Irem Ergün,Beyza Ermis,Marzieh Fadaee,Boyu Fan,Lucas Fayoux,Yannis Flet-Berliac,Nick Frosst,Matthias Gallé,Wojciech Galuba,Utsav Garg,Matthieu Geist,Mohammad Gheshlaghi Azar,Seraphina Goldfarb-Tarrant,Tomas Goldsack,Aidan Gomez,Victor Machado Gonzaga,Nithya Govindarajan,Manoj Govindassamy,Nathan Grinsztajn,Nikolas Gritsch,Patrick Gu,Shangmin Guo,Kilian Haefeli,Rod Hajjar,Tim Hawes,Jingyi He,Sebastian Hofstätter,Sungjin Hong,Sara Hooker,Tom Hosking
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 55 pages
点击查看摘要
Abstract:In this report we describe the development of Command A, a powerful large language model purpose-built to excel at real-world enterprise use cases. Command A is an agent-optimised and multilingual-capable model, with support for 23 languages of global business, and a novel hybrid architecture balancing efficiency with top of the range performance. It offers best-in-class Retrieval Augmented Generation (RAG) capabilities with grounding and tool use to automate sophisticated business processes. These abilities are achieved through a decentralised training approach, including self-refinement algorithms and model merging techniques. We also include results for Command R7B which shares capability and architectural similarities to Command A. Weights for both models have been released for research purposes. This technical report details our original training pipeline and presents an extensive evaluation of our models across a suite of enterprise-relevant tasks and public benchmarks, demonstrating excellent performance and efficiency.
zh
[NLP-29] oReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection
【速读】: 该论文旨在解决在预训练大规模语言模型(Large Language Models, LLMs)过程中,如何有效选择多样化文本语料以平衡计算资源与模型性能的问题。当前方法主要关注数据质量度量和混合比例,但未能充分捕捉训练样本之间的潜在语义关联以及单个领域内的质量差异。论文提出了一种名为ToReMi(基于主题重加权的模型改进)的新颖两阶段框架,其关键是通过动态调整训练样本权重来反映主题关联性和观察到的学习模式,从而更有效地利用数据资源。实验结果表明,ToReMi的变体在多个领域内实现了更快的困惑度下降,并在下游任务评估中表现出更强的能力。
链接: https://arxiv.org/abs/2504.00695
作者: Xiaoxuan Zhu,Zhouhong Gu,Suhang Zheng,Tao Wang,Tianyu Li,Hongwei Feng,Yanghua Xiao
机构: Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University (上海关键数据科学实验室,复旦大学计算机学院); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Pre-training large language models (LLMs) necessitates enormous diverse textual corpora, making effective data selection a key challenge for balancing computational resources and model performance. Current methodologies primarily emphasize data quality metrics and mixing proportions, yet they fail to adequately capture the underlying semantic connections between training samples and quality disparities within individual domains. We introduce ToReMi (Topic-based Reweighting for Model improvement), a novel two-stage framework that dynamically adjusts training sample weights according to their topical associations and observed learning patterns. Our comprehensive experiments reveal that ToReMi variants consistently achieve superior performance over conventional pre-training approaches, demonstrating accelerated perplexity reduction across multiple domains and enhanced capabilities on downstream evaluation tasks. Code is available at this https URL.
zh
[NLP-30] GLiNER-biomed: A Suite of Efficient Models for Open Biomedical Named Entity Recognition
【速读】: 本文针对生物医学命名实体识别(Biomedical Named Entity Recognition, Biomedical NER)中存在的专门词汇复杂、实体数量庞大以及新实体不断涌现等独特挑战展开研究。传统NER模型受限于固定本体和人工标注,难以超越预定义的实体类型或高效适应新兴概念。为解决这些问题,论文提出了GLiNER-biomed,这是一套针对生物医学领域优化的通用且轻量级的NER模型(Generalist and Lightweight Model for NER, GLiNER)。其关键创新在于利用自然语言描述推断任意实体类型,从而实现零样本识别(zero-shot recognition),并通过蒸馏大型语言模型(Large Language Models, LLMs)的能力生成覆盖广泛的合成生物医学NER数据。此外,文中设计了单编码器(uni-encoder)和双编码器(bi-encoder)两种架构,并在多个尺度下训练以平衡计算效率与识别性能。实验表明,GLiNER-biomed在零样本和少量样本场景下均优于现有最先进的GLiNER模型,在F1分数上提升了5.96%。消融实验进一步验证了合成数据生成策略的有效性,并强调了结合生物医学预训练与高质量通用领域微调的优势。所有数据集、模型及训练流程均已公开。
链接: https://arxiv.org/abs/2504.00676
作者: Anthony Yazdani,Ihor Stepanov,Douglas Teodoro
机构: Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva (日内瓦大学), Geneva, Switzerland; Knowledgator Engineering (Knowledgator 工程公司), Kyiv, Ukraine
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Biomedical named entity recognition (NER) presents unique challenges due to specialized vocabularies, the sheer volume of entities, and the continuous emergence of novel entities. Traditional NER models, constrained by fixed taxonomies and human annotations, struggle to generalize beyond predefined entity types or efficiently adapt to emerging concepts. To address these issues, we introduce GLiNER-biomed, a domain-adapted suite of Generalist and Lightweight Model for NER (GLiNER) models specifically tailored for biomedical NER. In contrast to conventional approaches, GLiNER uses natural language descriptions to infer arbitrary entity types, enabling zero-shot recognition. Our approach first distills the annotation capabilities of large language models (LLMs) into a smaller, more efficient model, enabling the generation of high-coverage synthetic biomedical NER data. We subsequently train two GLiNER architectures, uni- and bi-encoder, at multiple scales to balance computational efficiency and recognition performance. Evaluations on several biomedical datasets demonstrate that GLiNER-biomed outperforms state-of-the-art GLiNER models in both zero- and few-shot scenarios, achieving 5.96% improvement in F1-score over the strongest baseline. Ablation studies highlight the effectiveness of our synthetic data generation strategy and emphasize the complementary benefits of synthetic biomedical pre-training combined with fine-tuning on high-quality general-domain annotations. All datasets, models, and training pipelines are publicly available at this https URL.
zh
[NLP-31] Do LLM s Surpass Encoders for Biomedical NER?
【速读】: 本文旨在评估解码器模型(即大型语言模型或LLMs)在生物医学命名实体识别(Biomedical Named Entity Recognition, NER)任务中的表现,并探讨其与当前基于编码器模型(如BERT及其变体)相比的性能与效率权衡。论文的关键在于采用相同的BIO实体标记方案(保留位置信息),利用包含不同长度实体比例的五个数据集进行实验,以验证LLMs是否确实优于编码器模型,并分析由此带来的性能提升与计算成本之间的权衡。实验结果显示,所选的LLMs(Mistral和Llama: 8B规模)在F-score上通常比最佳编码器模型(BERT-(un)cased、BioMedBERT、DeBERTa v3: 300M规模)高出2-8个百分点,尤其对于长度为3个token的长实体,性能提升更为显著;然而,LLMs在推理时间上的开销高出一个到两个数量级,且可能需要昂贵的硬件支持。因此,当性能差异较小时或需要实时用户反馈时,编码器模型可能仍然是更合适的选择。
链接: https://arxiv.org/abs/2504.00664
作者: Motasem S Obeidat,Md Sultan Al Nahian,Ramakanth Kavuluru
机构: University of Kentucky (肯塔基大学)
类目: Computation and Language (cs.CL)
备注: Accepted to appear in IEEE ICHI 2025
点击查看摘要
Abstract:Recognizing spans of biomedical concepts and their types (e.g., drug or gene) in free text, often called biomedical named entity recognition (NER), is a basic component of information extraction (IE) pipelines. Without a strong NER component, other applications, such as knowledge discovery and information retrieval, are not practical. State-of-the-art in NER shifted from traditional ML models to deep neural networks with transformer-based encoder models (e.g., BERT) emerging as the current standard. However, decoder models (also called large language models or LLMs) are gaining traction in IE. But LLM-driven NER often ignores positional information due to the generative nature of decoder models. Furthermore, they are computationally very expensive (both in inference time and hardware needs). Hence, it is worth exploring if they actually excel at biomedical NER and assess any associated trade-offs (performance vs efficiency). This is exactly what we do in this effort employing the same BIO entity tagging scheme (that retains positional information) using five different datasets with varying proportions of longer entities. Our results show that the LLMs chosen (Mistral and Llama: 8B range) often outperform best encoder models (BERT-(un)cased, BiomedBERT, and DeBERTav3: 300M range) by 2-8% in F-scores except for one dataset, where they equal encoder performance. This gain is more prominent among longer entities of length = 3 tokens. However, LLMs are one to two orders of magnitude more expensive at inference time and may need cost prohibitive hardware. Thus, when performance differences are small or real time user feedback is needed, encoder models might still be more suitable than LLMs.
zh
[NLP-32] DynMoLE: Boosting Mixture of LoRA Experts Fine-Tuning with a Hybrid Routing Mechanism
【速读】: 该论文旨在解决现有基于Mixture of LoRA Experts (MoLE) 的参数高效微调方法在处理多下游任务时,路由机制存在的计算效率与预测精度之间的权衡问题,以及无法充分满足不同Transformer层内多样化专家选择需求的问题。为了解决这些问题,论文提出了一种名为DynMoLE的混合路由策略,其关键在于通过Tsallis熵调整路由器的概率分布动态选择专家,从而降低路由不确定性、增强稳定性并促进专家更公平地参与,最终实现更快的收敛速度和更好的模型性能。此外,还引入基于Tsallis熵的辅助损失函数,进一步引导模型减少不确定性以提高训练稳定性和性能。
链接: https://arxiv.org/abs/2504.00661
作者: Dengchun Li,Naizheng Wang,Zihao Zhang,Haoyang Yin,Lei Duan,Meng Xiao,Mingjie Tang
机构: School of Computer Science, Sichuan University, Chengdu, China. (四川大学计算机学院,成都,中国); Computer Network Information Center, Chinese Academy of Sciences, Beijing, China. (中国科学院计算机网络信息中心,北京,中国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 pages, 7 figures
点击查看摘要
Abstract:Instruction-based fine-tuning of large language models (LLMs) has achieved remarkable success in various natural language processing (NLP) tasks. Parameter-efficient fine-tuning (PEFT) methods, such as Mixture of LoRA Experts (MoLE), combine the efficiency of Low-Rank Adaptation (LoRA) with the versatility of Mixture of Experts (MoE) models, demonstrating significant potential for handling multiple downstream tasks. However, the existing routing mechanisms for MoLE often involve a trade-off between computational efficiency and predictive accuracy, and they fail to fully address the diverse expert selection demands across different transformer layers. In this work, we propose DynMoLE, a hybrid routing strategy that dynamically adjusts expert selection based on the Tsallis entropy of the router’s probability distribution. This approach mitigates router uncertainty, enhances stability, and promotes more equitable expert participation, leading to faster convergence and improved model performance. Additionally, we introduce an auxiliary loss based on Tsallis entropy to further guide the model toward convergence with reduced uncertainty, thereby improving training stability and performance. Our extensive experiments on commonsense reasoning benchmarks demonstrate that DynMoLE achieves substantial performance improvements, outperforming LoRA by 9.6% and surpassing the state-of-the-art MoLE method, MoLA, by 2.3%. We also conduct a comprehensive ablation study to evaluate the contributions of DynMoLE’s key components.
zh
[NLP-33] News is More than a Collection of Facts: Moral Frame Preserving News Summarization
【速读】: 该论文试图解决的问题是如何在人工智能生成的新闻摘要中有效保留原始新闻中的道德框架。传统新闻文章通过选择带有道德倾向的语言而非中性术语来塑造事件的呈现方式,这种道德框架包含隐含的价值判断,应在摘要中得以识别和保留以保持原作者的意图。论文的关键解决方案在于提出一种方法,利用记者有意使用或报道特定带有道德色彩词汇的直觉,确保这些词汇在摘要中得以保留。通过自动化、众包及专家评估表明,该方法能够在保持总体摘要质量的同时显著增强道德框架的保留。
链接: https://arxiv.org/abs/2504.00657
作者: Enrico Liscio,Michela Lorandi,Pradeep K. Murukannaiah
机构: Department of Intelligent Systems, Delft University of Technology (代尔夫特理工大学); ADAPT Research Centre, Dublin City University (都柏林城市大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:News articles are more than collections of facts; they reflect journalists’ framing, shaping how events are presented to the audience. One key aspect of framing is the choice to write in (or quote verbatim) morally charged language as opposed to using neutral terms. This moral framing carries implicit judgments that automated news summarizers should recognize and preserve to maintain the original intent of the writer. In this work, we perform the first study on the preservation of moral framing in AI-generated news summaries. We propose an approach that leverages the intuition that journalists intentionally use or report specific moral-laden words, which should be retained in summaries. Through automated, crowd-sourced, and expert evaluations, we demonstrate that our approach enhances the preservation of moral framing while maintaining overall summary quality.
zh
[NLP-34] Efficient Construction of Model Family through Progressive Training Using Model Expansion
【速读】: 该论文旨在解决多参数规模模型家族(model family)训练中计算成本过高的问题。传统方法中,每个模型独立训练,导致计算开销随模型数量线性增长。论文的关键解决方案是通过渐进式训练(progressive training)方法,将小规模模型逐步扩展为大规模模型,从而构建完整的模型家族。这种方法不仅减少了约25%的计算成本,还通过调整不同规模模型的最大学习率,实现了超越独立训练的性能表现,并且显著提升了模型家族内不同规模模型行为的一致性。
链接: https://arxiv.org/abs/2504.00623
作者: Kazuki Yano,Sho Takase,Sosuke Kobayashi,Shun Kiyono,Jun Suzuki
机构: Tohoku University (东北大学); SB Intuitions
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:As Large Language Models (LLMs) gain widespread practical application, providing the model family of different parameter sizes has become standard practice to address diverse computational requirements. Conventionally, each model in a family is trained independently, resulting in computational costs that scale additively with the number of models. We propose an efficient method for constructing the model family through progressive training, where smaller models are incrementally expanded to larger sizes to create a complete model family. Through extensive experiments with a model family ranging from 1B to 8B parameters, we demonstrate that our method reduces computational costs by approximately 25% while maintaining comparable performance to independently trained models. Furthermore, by strategically adjusting maximum learning rates based on model size, our method outperforms the independent training across various metrics. Beyond performance gains, our approach offers an additional advantage: models in our family tend to yield more consistent behavior across different model sizes.
zh
[NLP-35] On the Consistency of Multilingual Context Utilization in Retrieval-Augmented Generation
【速读】: 该论文试图解决的问题是如何评估大型语言模型(Large Language Models, LLMs)在多语言检索增强生成(Multilingual Retrieval-Augmented Generation, mRAG)系统中利用不同语言上下文生成准确答案的能力,并探索其独立于检索质量的表现。具体而言,研究关注LLMs是否能够一致地使用与问题相关的跨语言上下文信息,以目标语言生成准确答案,同时分析干扰性上下文(无论是否与查询语言相同)对其性能的影响。
解决方案的关键在于通过广泛的实验,分别测试LLMs在以下三方面的表现:(i) 不论相关上下文的语言为何,都能有效利用相关信息;(ii) 始终以用户期望的语言生成答案;(iii) 在包含多个干扰性上下文的情况下,仍能聚焦于相关上下文。通过对四个LLMs在三种涵盖48种语言的问答数据集上的实验,结合准确性分析和特征归因技术,揭示了LLMs在提取跨语言信息方面的能力较强,但在正确语言中生成完整答案的能力较弱,同时量化了干扰性上下文对答案质量的负面影响及其与查询语言的关系。这些发现深化了对mRAG系统中LLMs如何利用上下文的理解,并为未来改进提供了方向。
链接: https://arxiv.org/abs/2504.00597
作者: Jirui Qi,Raquel Fernández,Arianna Bisazza
机构: Center for Language and Cognition (CLCG), University of Groningen (格罗宁根大学); Institute for Logic, Language and Computation (ILLC), University of Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review at COLM2025. All codes and data are released at this https URL
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) with large language models (LLMs) has demonstrated strong performance in multilingual question-answering (QA) tasks by leveraging relevant passages retrieved from corpora. In multilingual RAG (mRAG), the retrieved passages can be written in languages other than that of the query entered by the user, making it challenging for LLMs to effectively utilize the provided information. Recent research suggests that retrieving passages from multilingual corpora can improve RAG performance, particularly for low-resource languages. However, the extent to which LLMs can leverage different kinds of multilingual contexts to generate accurate answers, independently from retrieval quality, remains understudied. In this paper, we conduct an extensive assessment of LLMs’ ability to (i) make consistent use of a relevant passage regardless of its language, (ii) respond in the expected language, and (iii) focus on the relevant passage even when multiple `distracting’ passages in different languages are provided in the context. Our experiments with four LLMs across three QA datasets covering a total of 48 languages reveal a surprising ability of LLMs to extract the relevant information from out-language passages, but a much weaker ability to formulate a full answer in the correct language. Our analysis, based on both accuracy and feature attribution techniques, further shows that distracting passages negatively impact answer quality regardless of their language. However, distractors in the query language exert a slightly stronger influence. Taken together, our findings deepen the understanding of how LLMs utilize context in mRAG systems, providing directions for future improvements.
zh
[NLP-36] Open-Qwen 2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLM s on Academic Resources
【速读】: 该论文旨在解决高效预训练大规模多模态语言模型(Multimodal Large Language Models, MLLMs)所面临的多重技术挑战,包括高质量数据筛选、多模态数据混合策略、序列打包技术以及训练框架等。论文的关键创新在于提出了一种高效的预训练方法,通过低至高动态图像分辨率调整与多模态序列打包技术显著提升了预训练效率。此外,研究团队精心设计了基于MLLM的过滤技术和传统的CLIP-based过滤方法相结合的数据筛选流程,大幅提高了数据质量和训练效率。最终,Open-Qwen2VL在多个多模态基准测试中超越了部分开源的先进模型Qwen2-VL-2B,验证了其卓越的训练效率。因此,该工作的关键是结合高效的计算与数据利用策略,以及全面开放的研究成果共享方式。
链接: https://arxiv.org/abs/2504.00595
作者: Weizhi Wang,Yu Tian,Linjie Yang,Heng Wang,Xifeng Yan
机构: UC Santa Barbara (加州大学圣塔芭芭拉分校); Seed Vision Team, ByteDance (字节跳动种子视觉团队, 字节跳动); Nvidia Research (Nvidia 研究院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The reproduction of state-of-the-art multimodal LLM pre-training faces barriers at every stage of the pipeline, including high-quality data filtering, multimodal data mixture strategies, sequence packing techniques, and training frameworks. We introduce Open-Qwen2VL, a fully open-source 2B-parameter Multimodal Large Language Model pre-trained efficiently on 29M image-text pairs using only 442 A100-40G GPU hours. Our approach employs low-to-high dynamic image resolution and multimodal sequence packing to significantly enhance pre-training efficiency. The training dataset was carefully curated using both MLLM-based filtering techniques (e.g., MLM-Filter) and conventional CLIP-based filtering methods, substantially improving data quality and training efficiency. The Open-Qwen2VL pre-training is conducted on academic level 8xA100-40G GPUs at UCSB on 5B packed multimodal tokens, which is 0.36% of 1.4T multimodal pre-training tokens of Qwen2-VL. The final instruction-tuned Open-Qwen2VL outperforms partially-open state-of-the-art MLLM Qwen2-VL-2B on various multimodal benchmarks of MMBench, SEEDBench, MMstar, and MathVista, indicating the remarkable training efficiency of Open-Qwen2VL. We open-source all aspects of our work, including compute-efficient and data-efficient training details, data filtering methods, sequence packing scripts, pre-training data in WebDataset format, FSDP-based training codebase, and both base and instruction-tuned model checkpoints. We redefine “fully open” for multimodal LLMs as the complete release of: 1) the training codebase, 2) detailed data filtering techniques, and 3) all pre-training and supervised fine-tuning data used to develop the model.
zh
[NLP-37] Efficient Annotator Reliablity Assessment with EffiARA
【速读】: 该论文试图解决数据标注在机器学习流程中的高成本和耗时问题,并针对基于Transformer模型的文档级标注缺乏标准化框架的现状提出解决方案。论文的关键在于引入EffiARA标注框架,它支持从任务资源理解到标注数据集编译的完整标注管道,并通过评估个体标注者可靠性和整体数据集质量提供洞察。此外,EffiARA通过基于标注者可靠性的软标签聚合和样本加权提升分类性能,并通过替换不可靠标注者提高注释一致性。为方便使用,该工作还提供了EffiARA Python包及其配套Web工具,开放源代码并提供图形用户界面。
链接: https://arxiv.org/abs/2504.00589
作者: Owen Cook,Jake Vasilakes,Ian Roberts,Xingyi Song
机构: School of Computer Science, University of Sheffield (计算机科学学院, 谢菲尔德大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Data annotation is an essential component of the machine learning pipeline; it is also a costly and time-consuming process. With the introduction of transformer-based models, annotation at the document level is increasingly popular; however, there is no standard framework for structuring such tasks. The EffiARA annotation framework is, to our knowledge, the first project to support the whole annotation pipeline, from understanding the resources required for an annotation task to compiling the annotated dataset and gaining insights into the reliability of individual annotators as well as the dataset as a whole. The framework’s efficacy is supported by two previous studies: one improving classification performance through annotator-reliability-based soft label aggregation and sample weighting, and the other increasing the overall agreement among annotators through removing identifying and replacing an unreliable annotator. This work introduces the EffiARA Python package and its accompanying webtool, which provides an accessible graphical user interface for the system. We open-source the EffiARA Python package at this https URL and the webtool is publicly accessible at this https URL.
zh
[NLP-38] Agent Net: Decentralized Evolutionary Coordination for LLM -based Multi-Agent Systems
【速读】: 该论文旨在解决现有多语言模型(LLM)驱动的多智能体系统在集中式协调下存在的可扩展性瓶颈、适应性限制以及单点故障问题,同时克服跨组织协作中的隐私保护与专有知识共享障碍,导致的知识孤岛现象。为应对这些挑战,论文提出AgentNet,这是一种基于检索增强生成(RAG)的去中心化框架,使LLM驱动的智能体能够在有向无环图(DAG)结构网络中自主进化能力并高效协作。解决方案的关键在于三个创新点:(1) 全面去中心化的范式,移除中央协调器,支持智能体自主协调与专业化,提升容错性和涌现的集体智能;(2) 动态演化的图拓扑结构,根据任务需求实时调整智能体连接,确保可扩展性和弹性;(3) 针对专业知识精炼的自适应学习机制,通过基于检索的记忆系统实现智能体持续更新和优化其专业化技能。AgentNet通过去中心化协调和最小化数据交换,在保障敏感信息的同时利用多样化知识源。
链接: https://arxiv.org/abs/2504.00587
作者: Yingxuan Yang,Huacan Chai,Shuai Shao,Yuanyi Song,Siyuan Qi,Renting Rui,Weinan Zhang
机构: Shanghai Jiao Tong University (上海交通大学); SII (未知)
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The rapid advancement of Large Language Models (LLMs) has catalyzed the development of multi-agent systems, where multiple LLM-based agents collaborate to solve complex tasks. However, existing systems predominantly rely on centralized coordination, which introduces scalability bottlenecks, limits adaptability, and creates single points of failure. Additionally, concerns over privacy and proprietary knowledge sharing hinder cross-organizational collaboration, leading to siloed expertise. To address these challenges, we propose AgentNet, a decentralized, Retrieval-Augmented Generation (RAG)-based framework that enables LLM-based agents to autonomously evolve their capabilities and collaborate efficiently in a Directed Acyclic Graph (DAG)-structured network. Unlike traditional multi-agent systems that depend on static role assignments or centralized control, AgentNet allows agents to specialize dynamically, adjust their connectivity, and route tasks without relying on predefined workflows. AgentNet’s core design is built upon several key innovations: (1) Fully Decentralized Paradigm: Removing the central orchestrator, allowing agents to coordinate and specialize autonomously, fostering fault tolerance and emergent collective intelligence. (2) Dynamically Evolving Graph Topology: Real-time adaptation of agent connections based on task demands, ensuring scalability and resilience.(3) Adaptive Learning for Expertise Refinement: A retrieval-based memory system that enables agents to continuously update and refine their specialized skills. By eliminating centralized control, AgentNet enhances fault tolerance, promotes scalable specialization, and enables privacy-preserving collaboration across organizations. Through decentralized coordination and minimal data exchange, agents can leverage diverse knowledge sources while safeguarding sensitive information.
zh
[NLP-39] Enhancing Negation Awareness in Universal Text Embeddings: A Data-efficient and Computational-efficient Approach
【速读】: 该论文试图解决自然语言处理任务中生成式 AI (Generative AI) 模型在理解否定表达(negation)时存在的不足。尽管现有研究表明通用文本嵌入模型在某些任务中优于上下文相关嵌入模型,但因现有评估基准存在偏差,这些模型对否定信息的理解能力仍不清楚。为填补这一研究空白,本文深入分析了最先进的通用文本嵌入模型的否定意识,并发现这些模型普遍存在否定意识不足的问题,常将否定文本对误判为语义相似。为应对不同任务对主题与否定信息之间权衡需求的差异,论文提出了一种高效的数据和计算资源友好的嵌入重加权方法,该方法无需修改文本嵌入模型的参数。关键在于此方案不仅显著提升了简单和复杂否定理解任务中的否定意识,还有效改善了基于大语言模型的任务特定高维通用文本嵌入的否定意识。
链接: https://arxiv.org/abs/2504.00584
作者: Hongliu Cao
机构: Amadeus SAS
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Negation plays an important role in various natural language processing tasks such as Natural Language Inference and Sentiment Analysis tasks. Numerous prior studies have found that contextual text embedding models such as BERT, ELMO, RoBERTa or XLNet face challenges in accurately understanding negation. Recent advancements in universal text embeddings have demonstrated superior performance over contextual text embeddings in various tasks. However, due to the bias in popular evaluation benchmarks, the negation awareness capacity of these models remains unclear. To bridge the gap in existing literature, an in-depth analysis is initiated in this work to study the negation awareness of cutting-edge universal text embedding models. Our findings reveal a significant lack of negation awareness in these models, often interpreting negated text pairs as semantically similar. To efficiently deal with the conflict that different tasks need different trade-offs between topic and negation information among other semantic information, a data-efficient and computational-efficient embedding re-weighting method is proposed without modifying the parameters of text embedding models. The proposed solution is able to improve text embedding models’ negation awareness significantly on both simple negation understanding task and complex negation understanding task. Furthermore, the proposed solution can also significantly improve the negation awareness of Large Language Model based task-specific high dimensional universal text embeddings.
zh
[NLP-40] raining a Utility-based Retriever Through Shared Context Attribution for Retrieval-Augmented Language Models
【速读】: 本文旨在解决 Retrieval-Augmented Language Models (RALMs) 中检索器(retriever)在提供外部知识以提升任务性能时,因仅关注语义相关性而可能无法有效支持生成任务的问题。为应对这一挑战,论文提出了一种名为 SCARLet 的框架,用于训练基于效用的检索器。SCARLet 的关键创新在于引入了两个重要因素:多任务泛化(multi-task generalization)和篇章间交互(inter-passage interaction)。通过构建共享上下文来合成跨任务的训练数据,SCARLet 减少了上下文差异引起的语义偏差,使检索器能够专注于学习针对具体任务的效用,从而实现更好的任务泛化能力。此外,SCARLet 还采用基于扰动的归因方法来估算共享上下文下的篇章级效用,这种交互式的反馈机制提供了更精准的训练信号。实验结果表明,利用 SCARLet 训练的检索器显著提升了 RALMs 在多种任务中的整体性能,涵盖领域内和领域外的数据集。
链接: https://arxiv.org/abs/2504.00573
作者: Yilong Xu,Jinhua Gao,Xiaoming Yu,Yuanhai Xue,Baolong Bi,Huawei Shen,Xueqi Cheng
机构: State Key Lab of AI Safety (人工智能安全国家重点实验室), Institute of Computing Technology (计算技术研究所), CAS (中国科学院); Key Lab of AI Safety (人工智能安全重点实验室), Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL)
备注: 20 pages, 9 figures. Code will be released after review
点击查看摘要
Abstract:Retrieval-Augmented Language Models boost task performance, owing to the retriever that provides external knowledge. Although crucial, the retriever primarily focuses on semantics relevance, which may not always be effective for generation. Thus, utility-based retrieval has emerged as a promising topic, prioritizing passages that provides valid benefits for downstream tasks. However, due to insufficient understanding, capturing passage utility accurately remains unexplored. This work proposes SCARLet, a framework for training utility-based retrievers in RALMs, which incorporates two key factors, multi-task generalization and inter-passage interaction. First, SCARLet constructs shared context on which training data for various tasks is synthesized. This mitigates semantic bias from context differences, allowing retrievers to focus on learning task-specific utility for better task generalization. Next, SCARLet uses a perturbation-based attribution method to estimate passage-level utility for shared context, which reflects interactions between passages and provides more accurate feedback. We evaluate our approach on ten datasets across various tasks, both in-domain and out-of-domain, showing that retrievers trained by SCARLet consistently improve the overall performance of RALMs.
zh
[NLP-41] SRLCG: Self-Rectified Large-Scale Code Generation with Multidimensional Chain-of-Thought and Dynamic Backtracking
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在代码生成方面对缺乏编程知识的用户支持不足的问题。当前LLMs主要生成孤立的代码片段,而非完整的多文件项目代码,这使得不具备编码能力的用户难以理解、修改或迭代生成结果,从而无法组装成一个完整的项目。为了解决这一挑战,论文提出了一种名为自校正大规模代码生成器(Self-Rectified Large-Scale Code Generator, SRLCG)的框架。其关键是引入一种新颖的多维链式思维(Chain-of-Thought, CoT)机制与自校正方法来引导LLMs生成正确且鲁棒的代码文件,并通过动态回溯算法将这些文件整合为一个完整且一致的项目。实验结果显示,SRLCG生成的代码长度显著优于现有基线模型,同时在正确性、鲁棒性和大规模代码生成性能上也表现出明显提升。
链接: https://arxiv.org/abs/2504.00532
作者: Hongru Ma,Yanjie Liang,Jiasheng Si,Weiyu Zhang,Hongjiao Guan,Chaoqun Zheng,Bing Xu,Wenpeng Lu
机构: Beihang University (北京航空航天大学); Shandong University (山东大学); Qilu University of Technology (Shandong Academy of Sciences) (齐鲁工业大学(山东省科学院)); Harbin Institute of Technology (哈尔滨工业大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: 23 pages
点击查看摘要
Abstract:Large language models (LLMs) have revolutionized code generation, significantly enhancing developer productivity. However, for a vast number of users with minimal coding knowledge, LLMs provide little support, as they primarily generate isolated code snippets rather than complete, large-scale project code. Without coding expertise, these users struggle to interpret, modify, and iteratively refine the outputs of LLMs, making it impossible to assemble a complete project. To address this issue, we propose Self-Rectified Large-Scale Code Generator (SRLCG), a framework that generates complete multi-file project code from a single prompt. SRLCG employs a novel multidimensional chain-of-thought (CoT) and self-rectification to guide LLMs in generating correct and robust code files, then integrates them into a complete and coherent project using our proposed dynamic backtracking algorithm. Experimental results show that SRLCG generates code 15x longer than DeepSeek-V3, 16x longer than GPT-4, and at least 10x longer than other leading CoT-based baselines. Furthermore, they confirm its improved correctness, robustness, and performance compared to baselines in large-scale code generation.
zh
[NLP-42] Recitation over Reasoning : How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?
【速读】: 该论文试图解决的问题是评估大型语言模型(Large Language Models, LLMs)在推理任务中的表现是否源于真正的智能,还是仅仅依赖于训练过程中所接触到的文本信息的简单复述行为。为了解决这一问题,论文提出了RoR-Bench,这是一个新颖的多模态基准测试集,用于检测LLMs在面对条件微小变化时是否表现出复述行为。关键在于通过设计一组简单的推理问题,并在条件上进行微妙调整,从而揭示LLMs在保持一致性能方面的能力极限。研究发现,当前最先进的LLMs在面对条件变化时表现出严重的复述行为,例如,在小学水平的算术和推理问题上,顶级模型如OpenAI-o1和DeepSeek-R1的性能可能会下降高达60%。这表明现有评估方法可能高估了LLMs的真实智能水平。
链接: https://arxiv.org/abs/2504.00509
作者: Kai Yan,Yufei Xu,Zhengyin Du,Xuesong Yao,Zheyu Wang,Xiaowen Guo,Jiecao Chen
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学香槟分校); ByteDance (字节跳动)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 21 pages, 3 figures, 10 tables
点击查看摘要
Abstract:The rapid escalation from elementary school-level to frontier problems of the difficulty for LLM benchmarks in recent years have weaved a miracle for researchers that we are only inches away from surpassing human intelligence. However, is the LLMs’ remarkable reasoning ability indeed comes from true intelligence by human standards, or are they simply reciting solutions witnessed during training at an Internet level? To study this problem, we propose RoR-Bench, a novel, multi-modal benchmark for detecting LLM’s recitation behavior when asked simple reasoning problems but with conditions subtly shifted, and conduct empirical analysis on our benchmark. Surprisingly, we found existing cutting-edge LLMs unanimously exhibits extremely severe recitation behavior; by changing one phrase in the condition, top models such as OpenAI-o1 and DeepSeek-R1 can suffer 60% performance loss on elementary school-level arithmetic and reasoning problems. Such findings are a wake-up call to the LLM community that compels us to re-evaluate the true intelligence level of cutting-edge LLMs.
zh
[NLP-43] ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers ICIP
【速读】: 本文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)由于其巨大规模和大量视觉标记(visual tokens)而导致的高计算成本问题。为应对这一挑战,论文引入了一种新颖的度量标准——层贡献(Layer Contribution, LC),用于量化各层变换对视觉和文本标记的影响。通过移除特定标记上的层变换后测量模型输出的变化程度来计算LC值。研究发现,许多MLLMs的层在处理视觉标记时贡献极小。受此启发,作者提出了无需训练的ShortV方法,利用LC识别无效层,并冻结这些层中的视觉标记更新。实验表明,ShortV能够冻结约60%的MLLM层中的视觉标记更新,显著降低了与更新视觉标记相关的计算开销,例如,在LLaVA-NeXT-13B上实现了50%的浮点运算次数(FLOPs)减少,同时保持了卓越的表现。代码将在指定链接处公开提供。
链接: https://arxiv.org/abs/2504.00502
作者: Qianhao Yuan,Qingyu Zhang,Yanjiang Liu,Jiawei Chen,Yaojie Lu,Hongyu Lin,Jia Zheng,Xianpei Han,Le Sun
机构: Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所中文信息技术处理实验室); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Project page: this https URL
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer’s transformations on visual and text tokens, respectively. The calculation of LC involves measuring the divergence in model output that results from removing the layer’s transformations on the specified tokens. Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens. Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers. Experiments show that ShortV can freeze visual token in approximately 60% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens. For example, it achieves a 50% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance. The code will be publicly available at this https URL
zh
[NLP-44] FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning
【速读】: 该论文旨在解决音频-视觉问答(AVQA)任务中现有方法容易过拟合数据集偏差的问题,导致模型鲁棒性较差,并且当前数据集可能无法有效诊断这些方法的性能。为了解决这些问题,论文提出了两个关键方案:首先,构建了一个名为FortisAVQA的新数据集,通过两阶段方式扩展测试空间并引入问题间的分布偏移,以更全面地评估模型在不同频率问题下的鲁棒性;其次,提出了一种基于多方面循环协作去偏策略的鲁棒多模态音频-视觉认知网络(Multimodal Audio-Visual Epistemic Network, MAVEN),用于减轻偏差学习的影响。实验结果表明,所提出的架构在FortisAVQA数据集上实现了最先进的性能,同时消融研究验证了去偏组件的有效性。
链接: https://arxiv.org/abs/2504.00487
作者: Jie Ma,Zhitao Gao,Qi Chai,Jun Liu,Pinghui Wang,Jing Tao,Zhou Su
机构: Ministry of Education of Key Laboratory for Intelligent Networks and Network Security, School of Cyber Science and Engineering, Xi’an Jiaotong University (西安交通大学网络空间安全学院教育部重点实验室), Xi’an, Shaanxi 710049, China;
Shannxi Provincial Key Laboratory of Big Data Knowledge Engineering, School of Computer Science and Technology, Xi’an Jiaotong University (西安交通大学计算机科学与技术学院陕西省大数据知识工程重点实验室), Xi’an, Shaanxi 710049, China;
Information Hub, Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)信息枢纽), Guangzhou, Guangdong, 510000, China
类目: Multimedia (cs.MM); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review
点击查看摘要
Abstract:Audio-Visual Question Answering (AVQA) is a challenging multimodal reasoning task requiring intelligent systems to answer natural language queries based on paired audio-video inputs accurately. However, existing AVQA approaches often suffer from overfitting to dataset biases, leading to poor robustness. Moreover, current datasets may not effectively diagnose these methods. To address these challenges, we first introduce a novel dataset, FortisAVQA, constructed in two stages: (1) rephrasing questions in the test split of the public MUSIC-AVQA dataset and (2) introducing distribution shifts across questions. The first stage expands the test space with greater diversity, while the second enables a refined robustness evaluation across rare, frequent, and overall question distributions. Second, we introduce a robust Multimodal Audio-Visual Epistemic Network (MAVEN) that leverages a multifaceted cycle collaborative debiasing strategy to mitigate bias learning. Experimental results demonstrate that our architecture achieves state-of-the-art performance on FortisAVQA, with a notable improvement of 7.81%. Extensive ablation studies on both datasets validate the effectiveness of our debiasing components. Additionally, our evaluation reveals the limited robustness of existing multimodal QA methods. We also verify the plug-and-play capability of our strategy by integrating it with various baseline models across both datasets. Our dataset and code are available at this https URL.
zh
[NLP-45] Making Large Language Models Better Reason ers with Orchestrated Streaming Experiences EMNLP2024
【速读】: 本文旨在解决大型语言模型(Large Language Models, LLMs)在零样本(zero-shot)提示下性能较低以及Few-shot提示效果依赖人工设计示例的问题。为实现这一目标,论文提出了一种名为RoSE(Reasoning with Orchestrated Streaming Experiences)的通用框架,该框架能够通过自我改进的方式无需复杂的外部干预即可解决推理任务。RoSE的关键在于其设计了一个扩展架构,使LLM能够将所有已解答问题及其思考过程存储在一个流式经验池中,并从该池中协调有助于回答新问题的示例。具体而言,RoSE首先计算经验池中每个问题与新测试问题之间的相似度,然后根据相似度排序并将问题均匀划分为多个桶,从每个桶中提取一个问题以增加所选问题的多样性。此外,为了最大化这些示例对新问题的帮助,RoSE还引入了不确定性与复杂性两个额外属性来优选每个桶内的低不确定性、高复杂性的示例。
链接: https://arxiv.org/abs/2504.00473
作者: Xiangyang Liu,Junliang He,Xipeng Qiu
机构: School of Computer Science, Fudan University (计算机科学学院, 复旦大学); Shanghai Collaborative Innovation Center of Intelligent Visual Computing (上海智能视觉创新中心)
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2024
点击查看摘要
Abstract:Large language models (LLMs) can perform complex reasoning by generating intermediate thoughts under zero-shot or few-shot settings. However, zero-shot prompting always encounters low performance, and the superior performance of few-shot prompting hinges on the manual-crafted demonstrations. In this paper, we present RoSE (Reasoning with Orchestrated Streaming Experiences), a general framework for solving reasoning tasks that can self-improve without complex external efforts. To enable RoSE, we describe an architecture that extends an LLM to store all answered questions and their thoughts in a streaming experience pool then orchestrates helpful questions from the pool to assist in answering new questions. To set up a question-aware orchestration mechanism, RoSE first calculates the similarity of each question in the pool with a new test question. Since the solution to each answered question is not always correct, RoSE will sort the questions according to their similarity with the new question, and then uniformly divide them into multiple buckets. It finally extracts one question from each bucket to make these extracted questions more diverse. To make these extracted questions help RoSE answer new questions as much as possible, we introduce two other attributes of uncertainty and complexity for each question. RoSE will preferentially select the questions with low uncertainty and high complexity from each bucket. We evaluate the versatility of RoSE in various reasoning tasks, LLMs, and CoT methods.
zh
[NLP-46] Memorizing is Not Enough: Deep Knowledge Injection Through Reasoning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)因静态特性导致的信息过时问题,特别是在适应不断演变的现实世界或领域特定知识时。当前关于知识注入的研究主要局限于记忆和检索层面,缺乏系统性。为应对这一挑战,论文提出了一种四层级的知识注入框架,定义了从记忆、检索到推理和关联的知识注入深度级别,并基于此框架开发了DeepKnowledge实验平台,用于细粒度评估三种类型知识(新知识、增量知识和更新知识)的注入深度。关键在于通过实验揭示实现各层级知识注入的核心因素,并建立层级与相应注入方法之间的映射关系,从而提供一种全面高效的多层级知识注入方案。
链接: https://arxiv.org/abs/2504.00472
作者: Ruoxi Xu,Yunjie Ji,Boxi Cao,Yaojie Lu,Hongyu Lin,Xianpei Han,Ben He,Yingfei Sun,Xiangang Li,Le Sun
机构: Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所中文信息技术研究室); University of Chinese Academy of Sciences (中国科学院大学); a-m-team (未知)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Although large language models (LLMs) excel in knowledge recall and reasoning, their static nature leads to outdated information as the real world evolves or when adapting to domain-specific knowledge, highlighting the need for effective knowledge injection. However, current research on knowledge injection remains superficial, mainly focusing on knowledge memorization and retrieval. This paper proposes a four-tier knowledge injection framework that systematically defines the levels of knowledge injection: memorization, retrieval, reasoning, and association. Based on this framework, we introduce DeepKnowledge, a synthetic experimental testbed designed for fine-grained evaluation of the depth of knowledge injection across three knowledge types (novel, incremental, and updated). We then explore various knowledge injection scenarios and evaluate the depth of knowledge injection for each scenario on the benchmark. Experimental results reveal key factors to reach each level of knowledge injection for LLMs and establish a mapping between the levels of knowledge injection and the corresponding suitable injection methods, aiming to provide a comprehensive approach for efficient knowledge injection across various levels.
zh
[NLP-47] Multimodal LLM s for OCR OCR Post-Correction and Named Entity Recognition in Historical Documents
【速读】: 该论文旨在探索多模态大型语言模型(multimodal Large Language Models, mLLMs)在历史文档转录、信息提取及构建数据集方面的潜力。具体而言,研究评估了mLLMs在光学字符识别(Optical Character Recognition, OCR)、OCR后校正(Post-Correction)以及命名实体识别(Named Entity Recognition, NER)任务中的能力,针对1754年至1870年间出版的一系列德文城市目录进行分析。论文的关键解决方案在于引入了一种基于mLLMs的多模态OCR后校正方法,无需图像预处理或模型微调即可显著提升转录准确性至字符错误率(Character Error Rate, CER)仅为1%的水平,同时证明了mLLMs在从历史文档转录内容中高效解析结构化实体信息的能力。这一创新方法为历史数据收集与文档转录范式的潜在转变提供了初步证据。
链接: https://arxiv.org/abs/2504.00414
作者: Gavin Greif,Niclas Griesshaber,Robin Greif
机构: University of Oxford (牛津大学); University of Mannheim (曼海姆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注:
点击查看摘要
Abstract:We explore how multimodal Large Language Models (mLLMs) can help researchers transcribe historical documents, extract relevant historical information, and construct datasets from historical sources. Specifically, we investigate the capabilities of mLLMs in performing (1) Optical Character Recognition (OCR), (2) OCR Post-Correction, and (3) Named Entity Recognition (NER) tasks on a set of city directories published in German between 1754 and 1870. First, we benchmark the off-the-shelf transcription accuracy of both mLLMs and conventional OCR models. We find that the best-performing mLLM model significantly outperforms conventional state-of-the-art OCR models and other frontier mLLMs. Second, we are the first to introduce multimodal post-correction of OCR output using mLLMs. We find that this novel approach leads to a drastic improvement in transcription accuracy and consistently produces highly accurate transcriptions (1% CER), without any image pre-processing or model fine-tuning. Third, we demonstrate that mLLMs can efficiently recognize entities in transcriptions of historical documents and parse them into structured dataset formats. Our findings provide early evidence for the long-term potential of mLLMs to introduce a paradigm shift in the approaches to historical data collection and document transcription.
zh
[NLP-48] Semantic Mastery: Enhancing LLM s with Advanced Natural Language Understanding
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在深层语义理解、上下文连贯性和细微推理方面存在的挑战。论文的关键在于提出先进的自然语言理解(NLU)技术,包括语义解析、知识集成和基于上下文的强化学习方法,并分析如何通过结构化知识图谱、检索增强生成(Retrieval-Augmented Generation, RAG)以及与人类理解水平相匹配的微调策略提升模型性能。此外,论文探讨了基于Transformer架构、对比学习以及符号-神经混合方法的应用,以应对复杂自然语言处理任务(如问答、文本摘要和对话生成)中存在的幻觉、歧义和不一致性问题。研究发现强调了语义精确性对于提升AI驱动的语言系统的重要性,并指出了弥合统计语言模型与真正自然语言理解之间差距的未来研究方向。
链接: https://arxiv.org/abs/2504.00409
作者: Mohanakrishnan Hariharan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have greatly improved their capability in performing NLP tasks. However, deeper semantic understanding, contextual coherence, and more subtle reasoning are still difficult to obtain. The paper discusses state-of-the-art methodologies that advance LLMs with more advanced NLU techniques, such as semantic parsing, knowledge integration, and contextual reinforcement learning. We analyze the use of structured knowledge graphs, retrieval-augmented generation (RAG), and fine-tuning strategies that match models with human-level understanding. Furthermore, we address the incorporation of transformer-based architectures, contrastive learning, and hybrid symbolic-neural methods that address problems like hallucinations, ambiguity, and inconsistency in the factual perspectives involved in performing complex NLP tasks, such as question-answering text summarization and dialogue generation. Our findings show the importance of semantic precision for enhancing AI-driven language systems and suggest future research directions to bridge the gap between statistical language models and true natural language understanding.
zh
[NLP-49] VerifiAgent : a Unified Verification Agent in Language Model Reasoning
【速读】: 该论文试图解决大型语言模型在推理任务中经常产生不可靠或错误响应的问题,以及现有验证方法存在的模型特定性、领域限制性、高计算资源需求和缺乏跨多样化推理任务可扩展性的局限。为了解决这些问题,论文提出了一种名为VerifiAgent的统一验证代理,其关键是结合了两层验证机制:元验证(meta-verification),用于评估模型响应的完整性和一致性;基于工具的自适应验证,其中VerifiAgent根据推理类型(如数学、逻辑或常识推理)自主选择合适的验证工具。这种自适应方法确保了在不同验证场景中的高效性和鲁棒性,并通过实验验证了其在多种推理任务中优于基线验证方法,同时能够通过利用验证结果反馈进一步提升推理准确性,在数学推理领域还能以更少的样本和成本实现更好的推理扩展效果。
链接: https://arxiv.org/abs/2504.00406
作者: Jiuzhou Han,Wray Buntine,Ehsan Shareghi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large language models demonstrate remarkable reasoning capabilities but often produce unreliable or incorrect responses. Existing verification methods are typically model-specific or domain-restricted, requiring significant computational resources and lacking scalability across diverse reasoning tasks. To address these limitations, we propose VerifiAgent, a unified verification agent that integrates two levels of verification: meta-verification, which assesses completeness and consistency in model responses, and tool-based adaptive verification, where VerifiAgent autonomously selects appropriate verification tools based on the reasoning type, including mathematical, logical, or commonsense reasoning. This adaptive approach ensures both efficiency and robustness across different verification scenarios. Experimental results show that VerifiAgent outperforms baseline verification methods (e.g., deductive verifier, backward verifier) among all reasoning tasks. Additionally, it can further enhance reasoning accuracy by leveraging feedback from verification results. VerifiAgent can also be effectively applied to inference scaling, achieving better results with fewer generated samples and costs compared to existing process reward models in the mathematical reasoning domain. Code is available at this https URL
zh
[NLP-50] When Persuasion Overrides Truth in Multi-Agent LLM Debates: Introducing a Confidence-Weighted Persuasion Override Rate (CW-POR)
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)在面对矛盾信息时难以准确判断真伪的问题。具体而言,在单轮多智能体辩论框架下,一个基于LLM的代理提供TruthfulQA中的事实答案,另一个代理则积极捍卫虚假陈述,而同一LLM架构作为裁判进行判断。论文的关键在于引入了置信加权说服力覆盖率(Confidence-Weighted Persuasion Override Rate, CW-POR),它不仅衡量裁判被误导的频率,还反映其对错误选择的信心强度。通过在五个开源LLM(参数量从3B到14B)上的实验表明,即使较小的模型也能构造出具有说服力的论点以覆盖真实答案,且通常伴随着较高的信心。这些结果强调了对LLM进行稳健校准和对抗性测试的重要性,以避免其自信地支持错误信息。
链接: https://arxiv.org/abs/2504.00374
作者: Mahak Agarwal,Divyam Khanna
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures
点击查看摘要
Abstract:In many real-world scenarios, a single Large Language Model (LLM) may encounter contradictory claims-some accurate, others forcefully incorrect-and must judge which is true. We investigate this risk in a single-turn, multi-agent debate framework: one LLM-based agent provides a factual answer from TruthfulQA, another vigorously defends a falsehood, and the same LLM architecture serves as judge. We introduce the Confidence-Weighted Persuasion Override Rate (CW-POR), which captures not only how often the judge is deceived but also how strongly it believes the incorrect choice. Our experiments on five open-source LLMs (3B-14B parameters), where we systematically vary agent verbosity (30-300 words), reveal that even smaller models can craft persuasive arguments that override truthful answers-often with high confidence. These findings underscore the importance of robust calibration and adversarial testing to prevent LLMs from confidently endorsing misinformation.
zh
[NLP-51] Leverag ing Large Language Models for Automated Definition Extraction with TaxoMatic A Case Study on Media Bias
【速读】: 该论文旨在解决从学术文献中自动提取概念定义的问题,特别是在媒体偏见领域的应用。解决方案的关键在于TaxoMatic框架,它利用大型语言模型(Large Language Models, LLMs)实现数据收集、基于LLMs的相关性分类以及概念定义的提取。通过在包含2,398篇人工标注文章的数据集上进行评估,研究显示Claude-3-sonnet在相关性分类和定义提取方面取得了最佳性能。未来工作将扩展数据集并将其应用于其他领域。
链接: https://arxiv.org/abs/2504.00343
作者: Timo Spinde,Luyang Lin,Smi Hinterreiter,Isao Echizen
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:This paper introduces TaxoMatic, a framework that leverages large language models to automate definition extraction from academic literature. Focusing on the media bias domain, the framework encompasses data collection, LLM-based relevance classification, and extraction of conceptual definitions. Evaluated on a dataset of 2,398 manually rated articles, the study demonstrates the frameworks effectiveness, with Claude-3-sonnet achieving the best results in both relevance classification and definition extraction. Future directions include expanding datasets and applying TaxoMatic to additional domains.
zh
[NLP-52] VNJPTranslate: A comprehensive pipeline for Vietnamese-Japanese translation
【速读】: 该论文旨在解决低资源语言对(如越南语-日语 Vi-Ja)神经机器翻译(NMT)面临的挑战,包括稀缺的平行语料以及语言和文化细微差别的处理。论文的关键解决方案在于提出了一种名为 VNJPTranslate 的系统性管道:首先通过大型语言模型(LLMs)结合强化学习(RL)生成高质量合成数据,并采用针对性的数据增强策略(利用带有链式思维提示的先进 LLMs 处理分析后识别出的难点片段);随后基于高效的微调技术(Unsloth 和 QLoRA),对一个具备能力且参数量较低的自回归模型(具体为基于 Qwen 架构的 1.8B 参数 Sailor 模型的微调版本)进行优化,从而构建出性能优越的翻译系统,以显著提升 Vi-Ja 翻译质量。
链接: https://arxiv.org/abs/2504.00339
作者: Hoang Hai Phan,Nguyen Duc Minh Vu,Nam Dang Phuong
机构: HUST(河内科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Neural Machine Translation (NMT) driven by Transformer architectures has advanced significantly, yet faces challenges with low-resource language pairs like Vietnamese-Japanese (Vi-Ja). Issues include sparse parallel data and handling linguistic/cultural nuances. Recent progress in Large Language Models (LLMs) with strong reasoning, often refined via Reinforcement Learning (RL), enables high-quality synthetic data generation. We introduce VNJPTranslate, a pipeline designed to systematically address the Vi-Ja translation task. It features a targeted data augmentation strategy using advanced LLMs with Chain-of-Thought prompting for challenging segments identified via corpus analysis. Subsequently, we employ efficient fine-tuning techniques (Unsloth with QLoRA) on a capable, low-parameter autoregressive model (specifically, a fine-tuned version of the 1.8B parameter Sailor model, which is based on the Qwen architecture) to create a practical and high-performing translation system. This integrated approach aims to improve Vi-Ja translation quality significantly over existing baselines.
zh
[NLP-53] Effect-driven interpretation: Functors for natural language composition
【速读】: 该论文试图解决自然语言分析中的形式化表示与语义理解问题,通过将人类语言类比为计算机程序的纯(pure)值与非纯(impure)过程的分离结构,探索如何利用计算机科学中的指称语义技术(denotational techniques),实现对自然语言组合(composition)的优雅且富有洞察力的分析。解决方案的关键在于借鉴计算机程序中纯函数与具有副作用组件的分离思想,构建自然语言中纯值与非纯过程的对应模型,并以此为基础发展出有效的语义分析方法。
链接: https://arxiv.org/abs/2504.00316
作者: Dylan Bumford,Simon Charlow
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Computer programs are often factored into pure components – simple, total functions from inputs to outputs – and components that may have side effects – errors, changes to memory, parallel threads, abortion of the current loop, etc. We make the case that human languages are similarly organized around the give and pull of pure values and impure processes, and we’ll aim to show how denotational techniques from computer science can be leveraged to support elegant and illuminating analyses of natural language composition.
zh
[NLP-54] Detecting and Mitigating Bias in LLM s through Knowledge Graph-Augmented Training
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在训练数据中继承并放大偏见的问题,这引发了伦理和公平性方面的担忧。论文的关键解决方案是提出了一种名为知识图谱增强训练(Knowledge Graph-Augmented Training, KGAT)的新方法。通过利用现实世界知识图谱中的结构化领域特定知识,该方法提高了模型的理解能力,并减少了有偏输出。论文采用了Gender Shades、Bias in Bios和FairFace等公开数据集进行偏见评估,并使用人口统计均等性和平等机会等指标进行严格检测。此外,论文实施了针对性的缓解策略以纠正有偏关联,显著降低了有偏输出并改善了偏见指标。这一框架结合真实世界的数据集和知识图谱,既具备可扩展性又高效,为LLMs在敏感和高风险应用中的负责任部署铺平了道路。
链接: https://arxiv.org/abs/2504.00310
作者: Rajeev Kumar,Harishankar Kumar,Kumari Shalini
机构: Gen AI Research (生成人工智能研究); Althire AI (阿尔提尔人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large language models have revolutionized natural language processing with their surprising capability to understand and generate human-like text. However, many of these models inherit and further amplify the biases present in their training data, raising ethical and fairness concerns. The detection and mitigation of such biases are vital to ensuring that LLMs act responsibly and equitably across diverse domains. This work investigates Knowledge Graph-Augmented Training (KGAT) as a novel method to mitigate bias in LLM. Using structured domain-specific knowledge from real-world knowledge graphs, we improve the understanding of the model and reduce biased output. Public datasets for bias assessment include Gender Shades, Bias in Bios, and FairFace, while metrics such as demographic parity and equal opportunity facilitate rigorous detection. We also performed targeted mitigation strategies to correct biased associations, leading to a significant drop in biased output and improved bias metrics. Equipped with real-world datasets and knowledge graphs, our framework is both scalable and effective, paving the way toward responsible deployment in sensitive and high-stakes applications.
zh
[NLP-55] Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead
【速读】: 该论文试图解决的问题是如何通过推理时间扩展(inference-time scaling)提升大型语言模型(Large Language Models, LLMs)在复杂任务中的推理能力。论文关注于在数学及更广泛的科学、技术、工程和数学(STEM)推理、日历规划、NP难问题、导航以及空间推理等具有挑战性的任务中评估这种扩展方法的效果。论文的关键解决方案在于系统性地比较传统模型与经过推理时间微调的模型(如通过独立或顺序带反馈的方式进行多次调用),并通过这些评估协议探索不同模型在低、高性能边界上的表现潜力及其未来改进的可能性。研究发现,推理时间扩展的优势因任务而异,并且随着问题复杂度增加而减弱;单纯增加令牌数量并不总能带来更高的准确性。此外,对于某些任务,传统模型结合完美验证器可以达到接近当前最先进推理模型的平均性能,但对于其他任务仍有显著差距。然而,所有模型在使用完美验证器或强反馈进一步扩展推理时均显示出显著改进,表明存在大量未来优化的空间。
链接: https://arxiv.org/abs/2504.00294
作者: Vidhisha Balachandran,Jingya Chen,Lingjiao Chen,Shivam Garg,Neel Joshi,Yash Lara,John Langford,Besmira Nushi,Vibhav Vineet,Yue Wu,Safoora Yousefi
机构: Microsoft Research (微软研究)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Inference-time scaling can enhance the reasoning capabilities of large language models (LLMs) on complex problems that benefit from step-by-step problem solving. Although lengthening generated scratchpads has proven effective for mathematical tasks, the broader impact of this approach on other tasks remains less clear. In this work, we investigate the benefits and limitations of scaling methods across nine state-of-the-art models and eight challenging tasks, including math and STEM reasoning, calendar planning, NP-hard problems, navigation, and spatial reasoning. We compare conventional models (e.g., GPT-4o) with models fine-tuned for inference-time scaling (e.g., o1) through evaluation protocols that involve repeated model calls, either independently or sequentially with feedback. These evaluations approximate lower and upper performance bounds and potential for future performance improvements for each model, whether through enhanced training or multi-model inference systems. Our extensive empirical analysis reveals that the advantages of inference-time scaling vary across tasks and diminish as problem complexity increases. In addition, simply using more tokens does not necessarily translate to higher accuracy in these challenging regimes. Results from multiple independent runs with conventional models using perfect verifiers show that, for some tasks, these models can achieve performance close to the average performance of today’s most advanced reasoning models. However, for other tasks, a significant performance gap remains, even in very high scaling regimes. Encouragingly, all models demonstrate significant gains when inference is further scaled with perfect verifiers or strong feedback, suggesting ample potential for future improvements.
zh
[NLP-56] Do Chinese models speak Chinese languages?
【速读】: 该论文试图解决的问题是评估中国与西方开源大型语言模型(LLMs)在亚洲区域语言及中国少数民族语言上的多语言能力差异,并探讨这些模型是否反映了中国语言政策的相关议程。论文的关键解决方案在于通过信息均等性(Information Parity)和阅读理解任务的实验,比较中国与西方开源LLMs在这类语言上的性能表现,特别是关注其对少数民族语言如哈萨克语和维吾尔语的支持情况,并分析其与主流语言(如普通话)能力之间的相关性(r=0.93)。
链接: https://arxiv.org/abs/2504.00289
作者: Andrea W Wen-Yi,Unso Eun Seo Jo,David Mimno
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: First and Second author contribute equally
点击查看摘要
Abstract:The release of top-performing open-weight LLMs has cemented China’s role as a leading force in AI development. Do these models support languages spoken in China? Or do they speak the same languages as Western models? Comparing multilingual capabilities is important for two reasons. First, language ability provides insights into pre-training data curation, and thus into resource allocation and development priorities. Second, China has a long history of explicit language policy, varying between inclusivity of minority languages and a Mandarin-first policy. To test whether Chinese LLMs today reflect an agenda about China’s languages, we test performance of Chinese and Western open-source LLMs on Asian regional and Chinese minority languages. Our experiments on Information Parity and reading comprehension show Chinese models’ performance across these languages correlates strongly (r=0.93) with Western models’, with the sole exception being better Mandarin. Sometimes, Chinese models cannot identify languages spoken by Chinese minorities such as Kazakh and Uyghur, even though they are good at French and German. These results provide a window into current development priorities, suggest options for future development, and indicate guidance for end users.
zh
[NLP-57] Do Large Language Models Exhibit Spontaneous Rational Deception?
【速读】: 该论文试图解决的问题是:在何种条件下大型语言模型(Large Language Models, LLMs)会自发地进行欺骗?特别是,那些在推理任务中表现更好的模型是否会在更倾向于理性自利的情境下更频繁地自发欺骗?
解决方案的关键在于设计了一个预先注册的实验协议,结合信号理论工具,通过修改版的2x2博弈(类似于囚徒困境)评估了一系列专有闭源和开源LLMs。在实验设置中,模型能够在自由交流阶段不受约束地与其他代理通信,从而创造了一种可能的欺骗机会,并在不同情境下测试其行为。结果揭示了LLMs在某些条件下会自发歪曲其行为,且这种倾向在对其有益时更为明显,同时表现出更强推理能力的模型欺骗频率更高。这一发现表明LLMs的推理能力与诚实之间存在权衡,并揭示了影响其是否欺骗的某些上下文因素。
链接: https://arxiv.org/abs/2504.00285
作者: Samuel M. Taylor,Benjamin K. Bergen
机构: UC San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are effective at deceiving, when prompted to do so. But under what conditions do they deceive spontaneously? Models that demonstrate better performance on reasoning tasks are also better at prompted deception. Do they also increasingly deceive spontaneously in situations where it could be considered rational to do so? This study evaluates spontaneous deception produced by LLMs in a preregistered experimental protocol using tools from signaling theory. A range of proprietary closed-source and open-source LLMs are evaluated using modified 2x2 games (in the style of Prisoner’s Dilemma) augmented with a phase in which they can freely communicate to the other agent using unconstrained language. This setup creates an opportunity to deceive, in conditions that vary in how useful deception might be to an agent’s rational self-interest. The results indicate that 1) all tested LLMs spontaneously misrepresent their actions in at least some conditions, 2) they are generally more likely to do so in situations in which deception would benefit them, and 3) models exhibiting better reasoning capacity overall tend to deceive at higher rates. Taken together, these results suggest a tradeoff between LLM reasoning capability and honesty. They also provide evidence of reasoning-like behavior in LLMs from a novel experimental configuration. Finally, they reveal certain contextual factors that affect whether LLMs will deceive or not. We discuss consequences for autonomous, human-facing systems driven by LLMs both now and as their reasoning capabilities continue to improve.
zh
[NLP-58] xt Chunking for Document Classification for Urban System Management using Large Language Models
【速读】: 该论文试图解决在城市系统管理中文本编码和评估过程中因资源限制、主观偏差以及人类评价者之间准确性与一致性不足所面临的挑战。论文的关键解决方案在于应用大型语言模型(Large-Language Models, LLM)进行演绎性编码,并通过两种提示方法(整体文本分析与分块文本分析)利用OpenAI的GPT-4o、GPT-4o-mini及o1-mini模型进行实验,验证其在处理语义信息方面是否能够达到与人工编码相当的可靠性。研究结果表明,在特定演绎性编码上下文初始化后,LLMs的表现可媲美人类编码员,并且采用分块方法时,GPT-4o、o1-mini和GPT-4o-mini与人工评分者表现出显著一致。此外,引入LLMs作为额外评价者与三位手动评价者合作时,所有评价者之间的统计学一致性得到提升,这表明LLMs有助于文本文档的分析。论文的新贡献在于评估了OpenAI GPT模型的性能,并提出了一种基于分块的提示方法,以缓解上下文聚合偏差并保留局部上下文。
链接: https://arxiv.org/abs/2504.00274
作者: Joshua Rodriguez(1),Om Sanan(2),Guillermo Vizarreta-Luna(1),Steven A. Conrad(1) ((1) Department of Systems Engineering, Colorado State University, Fort Collins, CO, USA, (2) Scarsdale High School, Scardsale, NY, USA)
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 16 pages, 6 figures, 4 tables, 2 algorithms; Replication data and code can be found this https URL
点击查看摘要
Abstract:Urban systems are managed using complex textual documentation that need coding and analysis to set requirements and evaluate built environment performance. This paper contributes to the study of applying large-language models (LLM) to qualitative coding activities to reduce resource requirements while maintaining comparable reliability to humans. Qualitative coding and assessment face challenges like resource limitations and bias, accuracy, and consistency between human evaluators. Here we report the application of LLMs to deductively code 10 case documents on the presence of 17 digital twin characteristics for the management of urban systems. We utilize two prompting methods to compare the semantic processing of LLMs with human coding efforts: whole text analysis and text chunk analysis using OpenAI’s GPT-4o, GPT-4o-mini, and o1-mini models. We found similar trends of internal variability between methods and results indicate that LLMs may perform on par with human coders when initialized with specific deductive coding contexts. GPT-4o, o1-mini and GPT-4o-mini showed significant agreement with human raters when employed using a chunking method. The application of both GPT-4o and GPT-4o-mini as an additional rater with three manual raters showed statistically significant agreement across all raters, indicating that the analysis of textual documents is benefited by LLMs. Our findings reveal nuanced sub-themes of LLM application suggesting LLMs follow human memory coding processes where whole-text analysis may introduce multiple meanings. The novel contributions of this paper lie in assessing the performance of OpenAI GPT models and introduces the chunk-based prompting approach, which addresses context aggregation biases by preserving localized context.
zh
[NLP-59] Multilingual Sentiment Analysis of Summarized Texts: A Cross-Language Study of Text Shortening Effects
【速读】: 该论文旨在解决跨语言形态多样性下摘要对情感分类影响的问题。研究对比了抽取式与抽象式摘要在英语、德语、法语、西班牙语、意大利语、芬兰语、匈牙利语及阿拉伯语中的效果,评估了多语言Transformer模型(如mBERT、XLM-RoBERTa、T5和BART)以及语言专用模型(如FinBERT、AraBERT)在摘要后的情感准确性变化。研究的关键在于发现抽取式摘要更有利于保留情感信息,尤其是在形态复杂的语言中;而抽象式摘要虽提升可读性,却引入情感扭曲,导致准确性下降,特别是对于具有丰富词形变化的语言。基于此,论文提出了一种平衡可读性和情感保真度的混合摘要方法,为多语言情感分析应用(如社交媒体监控、市场分析和跨语言意见挖掘)提供了重要指导。
链接: https://arxiv.org/abs/2504.00265
作者: Mikhail Krasitskii,Grigori Sidorov,Olga Kolesnikova,Liliana Chanona Hernandez,Alexander Gelbukh
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Summarization significantly impacts sentiment analysis across languages with diverse morphologies. This study examines extractive and abstractive summarization effects on sentiment classification in English, German, French, Spanish, Italian, Finnish, Hungarian, and Arabic. We assess sentiment shifts post-summarization using multilingual transformers (mBERT, XLM-RoBERTa, T5, and BART) and language-specific models (FinBERT, AraBERT). Results show extractive summarization better preserves sentiment, especially in morphologically complex languages, while abstractive summarization improves readability but introduces sentiment distortion, affecting sentiment accuracy. Languages with rich inflectional morphology, such as Finnish, Hungarian, and Arabic, experience greater accuracy drops than English or German. Findings emphasize the need for language-specific adaptations in sentiment analysis and propose a hybrid summarization approach balancing readability and sentiment preservation. These insights benefit multilingual sentiment applications, including social media monitoring, market analysis, and cross-lingual opinion mining.
zh
[NLP-60] SciReplicate-Bench: Benchmarking LLM s in Agent -driven Algorithmic Reproduction from Research Papers
【速读】: 该论文旨在解决利用大型语言模型(Large Language Models, LLMs)从自然语言处理(NLP)论文的算法描述中生成代码的问题。这一任务需要两项关键能力:一是算法理解(algorithm comprehension),即从学术文献中综合信息以理解实现逻辑;二是编码专长(coding expertise),即识别依赖关系并正确实现必要的应用程序编程接口(APIs)。为实现这一目标,论文提出了SciReplicate-Bench基准数据集,包含来自36篇2024年发表的NLP论文的100个任务,并提供了详细的注释和全面的测试用例。此外,还引入了Sci-Reproducer框架,该框架由一个解析算法概念的论文代理(Paper Agent)和一个从代码库检索依赖项并实现解决方案的代码代理(Code Agent)组成。为了评估算法理解,论文提出了推理图准确性(reasoning graph accuracy),用于量化生成的推理图与参考推理图之间的相似性;对于实现质量的评估,则采用了执行准确性(execution accuracy)、CodeBLEU以及仓库依赖/API召回率等指标。实验结果显示,基于Sci-Reproducer的最佳表现LLM仅达到39%的执行准确性,这表明算法描述中的缺失或不一致是成功复现的主要障碍之一。论文将开源其基准数据集及相关代码。
链接: https://arxiv.org/abs/2504.00255
作者: Yanzheng Xiang,Hanqi Yan,Shuyin Ouyang,Lin Gui,Yulan He
机构: King’s College London (伦敦国王学院); The Alan Turing Institute (阿兰·图灵研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注:
点击查看摘要
Abstract:This study evaluates large language models (LLMs) in generating code from algorithm descriptions from recent NLP papers. The task requires two key competencies: (1) algorithm comprehension: synthesizing information from papers and academic literature to understand implementation logic, and (2) coding expertise: identifying dependencies and correctly implementing necessary APIs. To facilitate rigorous evaluation, we introduce SciReplicate-Bench, a benchmark of 100 tasks from 36 NLP papers published in 2024, featuring detailed annotations and comprehensive test cases. Building on SciReplicate-Bench, we propose Sci-Reproducer, a multi-agent framework consisting of a Paper Agent that interprets algorithmic concepts from literature and a Code Agent that retrieves dependencies from repositories and implement solutions. To assess algorithm understanding, we introduce reasoning graph accuracy, which quantifies similarity between generated and reference reasoning graphs derived from code comments and structure. For evaluating implementation quality, we employ execution accuracy, CodeBLEU, and repository dependency/API recall metrics. In our experiments, we evaluate various powerful Non-Reasoning LLMs and Reasoning LLMs as foundational models. The best-performing LLM using Sci-Reproducer achieves only 39% execution accuracy, highlighting the benchmark’s this http URL analysis identifies missing or inconsistent algorithm descriptions as key barriers to successful reproduction. We will open-source our benchmark, and code at this https URL.
zh
[NLP-61] ElaLoRA: Elastic Learnable Low-Rank Adaptation for Efficient Model Fine-Tuning
【速读】: 该论文试图解决现有低秩适应(Low-Rank Adaptation, LoRA)方法在微调大规模预训练模型时无法动态调整秩(rank)以匹配不同层重要性的问题。解决方案的关键在于提出了一种名为ElaLoRA的自适应低秩适应框架,它通过基于梯度推导的重要性分数动态剪枝和扩展秩。与现有方法相比,ElaLoRA首次实现了在微调过程中同时支持秩剪枝和扩展的能力,并通过理论验证表明分配更高秩的层对模型性能有更显著的贡献。这一自适应秩分配机制使ElaLoRA成为一种可扩展且高效的微调方案,特别适用于资源受限环境。
链接: https://arxiv.org/abs/2504.00254
作者: Huandong Chang,Zicheng Ma,Mingyuan Ma,Zhenting Qi,Andrew Sabot,Hong Jiang,H. T. Kung
机构: Harvard University (哈佛大学), SEAS (工程与应用科学学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Low-Rank Adaptation (LoRA) has become a widely adopted technique for fine-tuning large-scale pre-trained models with minimal parameter updates. However, existing methods rely on fixed ranks or focus solely on either rank pruning or expansion, failing to adapt ranks dynamically to match the importance of different layers during training. In this work, we propose ElaLoRA, an adaptive low-rank adaptation framework that dynamically prunes and expands ranks based on gradient-derived importance scores. To the best of our knowledge, ElaLoRA is the first method that enables both rank pruning and expansion during fine-tuning. Experiments across multiple benchmarks demonstrate that ElaLoRA consistently outperforms existing PEFT methods across different parameter budgets. Furthermore, our studies validate that layers receiving higher rank allocations contribute more significantly to model performance, providing theoretical justification for our adaptive strategy. By introducing a principled and adaptive rank allocation mechanism, ElaLoRA offers a scalable and efficient fine-tuning solution, particularly suited for resource-constrained environments.
zh
[NLP-62] Synthesizing Public Opinions with LLM s: Role Creation Impacts and the Future to eDemorcacy
【速读】: 该论文试图解决传统调查方法(如响应率下降和非响应偏差)带来的挑战,探索使用大规模语言模型(Large Language Models, LLMs)合成公共意见数据的可能性。解决方案的关键在于引入了一种基于知识注入的角色创建技术,这是一种利用RAG(Retrieval-Augmented Generation)和HEXACO模型指定的人格特征及人口统计信息的上下文学习方式,并据此动态生成提示。这种方法使LLMs能够比现有的提示工程方法更准确地模拟多样化意见,从而显著提高LLM生成意见与现实世界人类调查响应的一致性(即答案依从性)。
链接: https://arxiv.org/abs/2504.00241
作者: Rabimba Karanjai,Boris Shor,Amanda Austin,Ryan Kennedy,Yang Lu,Lei Xu,Weidong Shi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:This paper investigates the use of Large Language Models (LLMs) to synthesize public opinion data, addressing challenges in traditional survey methods like declining response rates and non-response bias. We introduce a novel technique: role creation based on knowledge injection, a form of in-context learning that leverages RAG and specified personality profiles from the HEXACO model and demographic information, and uses that for dynamically generated prompts. This method allows LLMs to simulate diverse opinions more accurately than existing prompt engineering approaches. We compare our results with pre-trained models with standard few-shot prompts. Experiments using questions from the Cooperative Election Study (CES) demonstrate that our role-creation approach significantly improves the alignment of LLM-generated opinions with real-world human survey responses, increasing answer adherence. In addition, we discuss challenges, limitations and future research directions.
zh
[NLP-63] textitAgents Under Siege: Breaking Prag matic Multi-Agent LLM Systems with Optimized Prompt Attacks
【速读】: 本文旨在解决多智能体大语言模型(Multi-Agent Large Language Model, Multi-Agent LLM)系统中的新型对抗性风险问题。传统单智能体安全讨论主要集中于单一模型的行为约束,而多智能体系统因依赖于代理间的通信与去中心化推理,引入了全新的对抗性挑战。论文的关键创新在于设计了一种针对具有带宽限制、消息传递延迟及防御机制等约束的实用系统的对抗攻击方法。具体而言,通过将攻击路径建模为最大流最小成本问题,并结合新颖的排列不变逃避损失(Permutation-Invariant Evasion Loss, PIEL),利用基于图的优化技术,在保证最大化攻击成功率的同时最小化被检测的风险。实验结果表明,所提出的方法在多种模型(如Llama、Mistral、Gemma、DeepSeek等)和数据集(如JailBreakBench、AdversarialBench)上的表现优于常规攻击方法高达7倍,揭示了多智能体系统中的关键漏洞。此外,研究还证明现有的防御手段(如Llama-Guard和PromptGuard的变体)无法有效阻止该攻击,凸显了开发专门针对多智能体系统的安全性机制的紧迫性。
链接: https://arxiv.org/abs/2504.00218
作者: Rana Muhammad Shahroz Khan,Zhen Tan,Sukwon Yun,Charles Flemming,Tianlong Chen
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); Arizona State University (亚利桑那州立大学); Cisco (思科)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Most discussions about Large Language Model (LLM) safety have focused on single-agent settings but multi-agent LLM systems now create novel adversarial risks because their behavior depends on communication between agents and decentralized reasoning. In this work, we innovatively focus on attacking pragmatic systems that have constrains such as limited token bandwidth, latency between message delivery, and defense mechanisms. We design a \textitpermutation-invariant adversarial attack that optimizes prompt distribution across latency and bandwidth-constraint network topologies to bypass distributed safety mechanisms within the system. Formulating the attack path as a problem of \textitmaximum-flow minimum-cost , coupled with the novel \textitPermutation-Invariant Evasion Loss (PIEL) , we leverage graph-based optimization to maximize attack success rate while minimizing detection risk. Evaluating across models including \textttLlama , \textttMistral , \textttGemma , \textttDeepSeek and other variants on various datasets like \textttJailBreakBench and \textttAdversarialBench , our method outperforms conventional attacks by up to 7\times , exposing critical vulnerabilities in multi-agent systems. Moreover, we demonstrate that existing defenses, including variants of \textttLlama-Guard and \textttPromptGuard , fail to prohibit our attack, emphasizing the urgent need for multi-agent specific safety mechanisms.
zh
[NLP-64] Insight-RAG : Enhancing LLM s with Insight-Driven Augmentation
【速读】: 该论文旨在解决传统检索增强生成(Retrieval Augmented Generation, RAG)方法中存在的问题,包括仅基于表面相关性进行文档检索,可能导致忽略单个文档中的深层信息、未能有效挖掘跨多源的相关洞见,以及不适用于超越传统问答任务的应用场景。为了解决这些问题,论文提出了一种名为Insight-RAG的新框架。其关键在于引入两个阶段:首先利用大型语言模型(LLM)分析输入查询与任务,提取潜在的信息需求;其次通过一个专门训练过的LLM从文档数据库中挖掘直接回应这些洞见的内容。最终结合原始查询与所获洞见,使用另一个LLM生成上下文丰富且准确的回答。实验结果表明,Insight-RAG在多数情况下显著优于现有方法,证明了将基于洞见的检索整合进RAG框架不仅提升了性能,还扩展了RAG的应用范围至更广泛的非传统问答任务。
链接: https://arxiv.org/abs/2504.00187
作者: Pouya Pezeshkpour,Estevam Hruschka
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Retrieval Augmented Generation (RAG) frameworks have shown significant promise in leveraging external knowledge to enhance the performance of large language models (LLMs). However, conventional RAG methods often retrieve documents based solely on surface-level relevance, leading to many issues: they may overlook deeply buried information within individual documents, miss relevant insights spanning multiple sources, and are not well-suited for tasks beyond traditional question answering. In this paper, we propose Insight-RAG, a novel framework designed to address these issues. In the initial stage of Insight-RAG, instead of using traditional retrieval methods, we employ an LLM to analyze the input query and task, extracting the underlying informational requirements. In the subsequent stage, a specialized LLM – trained on the document database – is queried to mine content that directly addresses these identified insights. Finally, by integrating the original query with the retrieved insights, similar to conventional RAG approaches, we employ a final LLM to generate a contextually enriched and accurate response. Using two scientific paper datasets, we created evaluation benchmarks targeting each of the mentioned issues and assessed Insight-RAG against traditional RAG pipeline. Our results demonstrate that the Insight-RAG pipeline successfully addresses these challenges, outperforming existing methods by a significant margin in most cases. These findings suggest that integrating insight-driven retrieval within the RAG framework not only enhances performance but also broadens the applicability of RAG to tasks beyond conventional question answering.
zh
[NLP-65] Contradiction Detection in RAG Systems: Evaluating LLM s as Context Validators for Improved Information Consistency
【速读】: 该论文旨在解决 Retrieval Augmented Generation (RAG) 系统在处理包含矛盾信息文档时导致大型语言模型 (Large Language Models, LLMs) 输出不一致或错误的问题。论文的关键解决方案包括两个方面:首先,提出了一种新颖的数据生成框架,用于模拟 RAG 系统检索阶段可能出现的各种类型矛盾;其次,评估了不同 LLMs 在执行上下文验证任务中的鲁棒性,特别是其检测检索文档集中矛盾信息的能力。实验结果表明,即使对于最先进的 LLMs,上下文验证仍然是一个具有挑战性的任务,并且不同类型的矛盾对模型性能的影响差异显著。此外,虽然较大的模型通常在矛盾检测方面表现更好,但不同的提示策略对任务和模型架构的有效性存在差异。
链接: https://arxiv.org/abs/2504.00180
作者: Vignesh Gokul,Srikanth Tenneti,Alwarappan Nakkiran
机构: Amazon(亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Retrieval Augmented Generation (RAG) systems have emerged as a powerful method for enhancing large language models (LLMs) with up-to-date information. However, the retrieval step in RAG can sometimes surface documents containing contradictory information, particularly in rapidly evolving domains such as news. These contradictions can significantly impact the performance of LLMs, leading to inconsistent or erroneous outputs. This study addresses this critical challenge in two ways. First, we present a novel data generation framework to simulate different types of contradictions that may occur in the retrieval stage of a RAG system. Second, we evaluate the robustness of different LLMs in performing as context validators, assessing their ability to detect contradictory information within retrieved document sets. Our experimental results reveal that context validation remains a challenging task even for state-of-the-art LLMs, with performance varying significantly across different types of contradictions. While larger models generally perform better at contradiction detection, the effectiveness of different prompting strategies varies across tasks and model architectures. We find that chain-of-thought prompting shows notable improvements for some models but may hinder performance in others, highlighting the complexity of the task and the need for more robust approaches to context validation in RAG systems.
zh
[NLP-66] Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier
【速读】: 该论文试图解决现代分词管道中预分词(pre-tokenization)引入的基本限制问题,即由于按空白符和标点符号分割文本导致语料库中tokens的分布严重偏向常见完整单词,从而限制了扩展更大词汇表所带来的收益。为克服这一障碍,论文提出了一种名为BoundlessBPE的改进型BPE算法,其关键是放松了预token边界约束,通过选择性地将两个完整的预token合并为更大的单元——超词(superword),例如将“ of”和“ the”合并成“ of the”。这种合并策略显著提高了语料库中tokens分布的均匀性,并更有效地压缩文本,使每个token平均多压缩约20%的字节。
链接: https://arxiv.org/abs/2504.00178
作者: Craig W. Schmidt,Varshini Reddy,Chris Tanner,Yuval Pinter
机构: Kensho Technologies (肯肖技术公司); Department of Computer Science, Ben-Gurion University of the Negev (本-古里安大学计算机科学系); Massachusetts Institute of Technology (MIT) (麻省理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Pre-tokenization, the initial step in many modern tokenization pipelines, segments text into smaller units called pretokens, typically splitting on whitespace and punctuation. While this process encourages having full, individual words as tokens, it introduces a fundamental limitation in most tokenization algorithms such as Byte Pair Encoding (BPE). Specifically, pre-tokenization causes the distribution of tokens in a corpus to heavily skew towards common, full-length words. This skewed distribution limits the benefits of expanding to larger vocabularies, since the additional tokens appear with progressively lower counts. To overcome this barrier, we propose BoundlessBPE, a modified BPE algorithm that relaxes the pretoken boundary constraint. Our approach selectively merges two complete pretokens into a larger unit we term a superword. Superwords are not necessarily semantically cohesive. For example, the pretokens " of" and " the" might be combined to form the superword " of the". This merging strategy results in a substantially more uniform distribution of tokens across a corpus than standard BPE, and compresses text more effectively, with an approximate 20% increase in bytes per token.
zh
[NLP-67] Does "Reasoning " with Large Language Models Improve Recognizing Generating and Reframing Unhelpful Thoughts?
【速读】: 该论文试图解决如何利用大型语言模型(Large Language Models, LLMs)的推理能力改进认知行为疗法(Cognitive Behavioral Therapy, CBT)中的认知重构(Cognitive Reframing)任务,以更有效地识别、生成和重构负面或扭曲的认知。论文的关键在于探索多种增强推理方法的作用,包括基于思维链(Chain of Thought, CoT)提示和自一致性(self-consistency)等策略,以及预训练推理型LLMs的效果。研究发现,即使应用于较旧版本的LLMs(如GPT-3.5),这些增强推理方法在识别、生成和重构无益思维方面始终优于最先进的预训练推理模型。
链接: https://arxiv.org/abs/2504.00163
作者: Yilin Qi,Dong Won Lee,Cynthia Breazeal,Hae Won Park
机构: Harvard University (哈佛大学); MIT (麻省理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures (including appendix)
点击查看摘要
Abstract:Cognitive Reframing, a core element of Cognitive Behavioral Therapy (CBT), helps individuals reinterpret negative experiences by finding positive meaning. Recent advances in Large Language Models (LLMs) have demonstrated improved performance through reasoning-based strategies. This inspires a promising direction of leveraging the reasoning capabilities of LLMs to improve CBT and mental reframing by simulating the process of critical thinking, potentially enabling more effective recognition, generation, and reframing of cognitive distortions. In this work, we investigate the role of various reasoning methods, including pre-trained reasoning LLMs and augmented reasoning strategies such as CoT and self-consistency in enhancing LLMs’ ability to perform cognitive reframing tasks. We find that augmented reasoning methods, even when applied to “outdated” LLMs like GPT-3.5, consistently outperform state-of-the-art pretrained reasoning models on recognizing, generating and reframing unhelpful thoughts.
zh
[NLP-68] Universal Zero-shot Embedding Inversion
【速读】: 该论文旨在解决文本嵌入反转(Embedding Inversion)这一基础性问题,即在仅拥有嵌入向量及其编码器黑盒访问权限的情况下,重构原始文本。从自然语言处理(NLP)的角度来看,这有助于评估嵌入向量保留输入语义信息的程度;从安全性的角度来看,则用于衡量基于向量的数据库和检索系统泄露的信息量。目前最先进的方法如vec2text虽然具有高精度,但需要针对每种嵌入训练独立模型,并且依赖大量的查询请求。为了解决这些问题,论文提出了一种名为ZSInvert的新方法,这是一种基于最近提出的对抗解码技术的零样本反转方法。ZSInvert的关键在于其快速、高效的查询特性以及无需为特定嵌入训练专用反转模型的能力,从而能够适用于任意文本嵌入。实验表明,ZSInvert能够有效恢复与相应文本相关的关键词语义信息。
链接: https://arxiv.org/abs/2504.00147
作者: Collin Zhang,John X. Morris,Vitaly Shmatikov
机构: Department of Computer Science (计算机科学系), Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
点击查看摘要
Abstract:Embedding inversion, i.e., reconstructing text given its embedding and black-box access to the embedding encoder, is a fundamental problem in both NLP and security. From the NLP perspective, it helps determine how much semantic information about the input is retained in the embedding. From the security perspective, it measures how much information is leaked by vector databases and embedding-based retrieval systems. State-of-the-art methods for embedding inversion, such as vec2text, have high accuracy but require (a) training a separate model for each embedding, and (b) a large number of queries to the corresponding encoder. We design, implement, and evaluate ZSInvert, a zero-shot inversion method based on the recently proposed adversarial decoding technique. ZSInvert is fast, query-efficient, and can be used for any text embedding without training an embedding-specific inversion model. We measure the effectiveness of ZSInvert on several embeddings and demonstrate that it recovers key semantic information about the corresponding texts. Subjects: Computation and Language (cs.CL); Cryptography and Security (cs.CR) Cite as: arXiv:2504.00147 [cs.CL] (or arXiv:2504.00147v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.00147 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-69] Contextualize-then-Aggregate: Circuits for In-Context Learning in Gemma-2 2B
【速读】: 该论文旨在探究大型语言模型(Large Language Models, LLMs)中 In-Context Learning (ICL) 的工作机制,特别是如何通过少量示例从上下文中推断任务信息。论文的关键在于通过因果干预方法分析了 Gemma-2 2B 模型在五个自然语言 ICL 任务中的信息流动,揭示了模型采用的两步策略:首先通过上下文化的步骤(contextualize)将单个少量示例与前序示例关联,然后通过聚合步骤(aggregate)提取任务信息并准备输出预测。研究发现,上下文化步骤的重要性因任务而异,在存在歧义示例时尤为重要。通过这一严谨的因果分析,论文阐明了 ICL 在语言模型中发生的机制。
链接: https://arxiv.org/abs/2504.00132
作者: Aleksandra Bakalova,Yana Veitsman,Xinting Huang,Michael Hahn
机构: Saarland University (萨尔兰大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:In-Context Learning (ICL) is an intriguing ability of large language models (LLMs). Despite a substantial amount of work on its behavioral aspects and how it emerges in miniature setups, it remains unclear which mechanism assembles task information from the individual examples in a fewshot prompt. We use causal interventions to identify information flow in Gemma-2 2B for five naturalistic ICL tasks. We find that the model infers task information using a two-step strategy we call contextualize-then-aggregate: In the lower layers, the model builds up representations of individual fewshot examples, which are contextualized by preceding examples through connections between fewshot input and output tokens across the sequence. In the higher layers, these representations are aggregated to identify the task and prepare prediction of the next output. The importance of the contextualization step differs between tasks, and it may become more important in the presence of ambiguous examples. Overall, by providing rigorous causal analysis, our results shed light on the mechanisms through which ICL happens in language models.
zh
[NLP-70] LLM s for Explainable AI: A Comprehensive Survey
【速读】: 该论文试图解决人工智能模型因缺乏透明性而难以被用户信任的问题,以及由此导致的决策过程低效、责任不清及潜在偏见不明等问题。论文关注如何利用大型语言模型(Large Language Models, LLMs)提升可解释人工智能(Explainable AI, XAI)的能力,通过将复杂的机器学习输出转化为易于理解的叙事,增强用户对模型预测的理解与信任。解决方案的关键在于探索基于人类语言的LLMs在模型可解释性方面的应用潜力,并提供相应的评估技术以衡量LLM生成解释的有效性,同时讨论其面临的挑战、局限性及实际应用场景,最终强调需要更可解释、自动化、以用户为中心且跨学科的XAI方法。
链接: https://arxiv.org/abs/2504.00125
作者: Ahsan Bilal,David Ebert,Beiyu Lin
机构: University of Oklahoma (俄克拉荷马大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This manuscript is intended for submission to ACM Transactions on Intelligent Systems and Technology
点击查看摘要
Abstract:Large Language Models (LLMs) offer a promising approach to enhancing Explainable AI (XAI) by transforming complex machine learning outputs into easy-to-understand narratives, making model predictions more accessible to users, and helping bridge the gap between sophisticated model behavior and human interpretability. AI models, such as state-of-the-art neural networks and deep learning models, are often seen as “black boxes” due to a lack of transparency. As users cannot fully understand how the models reach conclusions, users have difficulty trusting decisions from AI models, which leads to less effective decision-making processes, reduced accountabilities, and unclear potential biases. A challenge arises in developing explainable AI (XAI) models to gain users’ trust and provide insights into how models generate their outputs. With the development of Large Language Models, we want to explore the possibilities of using human language-based models, LLMs, for model explainabilities. This survey provides a comprehensive overview of existing approaches regarding LLMs for XAI, and evaluation techniques for LLM-generated explanation, discusses the corresponding challenges and limitations, and examines real-world applications. Finally, we discuss future directions by emphasizing the need for more interpretable, automated, user-centric, and multidisciplinary approaches for XAI via LLMs.
zh
[NLP-71] Evaluating the Feasibility and Accuracy of Large Language Models for Medical History-Taking in Obstetrics and Gynecology
【速读】: 该论文旨在解决在不孕症等复杂且敏感的医疗领域中,有效医患沟通耗时过长导致诊所工作流程效率低下的问题。论文提出利用大型语言模型(Large Language Models, LLMs)自动化病史采集及提升诊断准确性作为潜在解决方案。研究的关键在于开发了一种基于人工智能的会话系统,通过比较ChatGPT-4o和ChatGPT-4o-mini两款模型在不孕症案例中的表现,评估其信息提取准确性(F1分数)、鉴别诊断正确率以及不孕类型判断准确率等性能指标。结果显示,ChatGPT-4o-mini在信息提取准确性和病史采集完整性方面优于ChatGPT-4o,表明其更适合于获取详细的患者信息以提高诊断精度。而ChatGPT-4o在鉴别诊断准确性上略胜一筹,但在不孕类型判断的一致性上存在不足。未来研究需关注临床环境下的模型精确度与可靠性验证、模型微调及多样化病例的大规模数据集构建。
链接: https://arxiv.org/abs/2504.00061
作者: Dou Liu,Ying Long,Sophia Zuoqiu,Tian Tang,Rong Yin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by IISE 2025 annual conference
点击查看摘要
Abstract:Effective physician-patient communications in pre-diagnostic environments, and most specifically in complex and sensitive medical areas such as infertility, are critical but consume a lot of time and, therefore, cause clinic workflows to become inefficient. Recent advancements in Large Language Models (LLMs) offer a potential solution for automating conversational medical history-taking and improving diagnostic accuracy. This study evaluates the feasibility and performance of LLMs in those tasks for infertility cases. An AI-driven conversational system was developed to simulate physician-patient interactions with ChatGPT-4o and ChatGPT-4o-mini. A total of 70 real-world infertility cases were processed, generating 420 diagnostic histories. Model performance was assessed using F1 score, Differential Diagnosis (DDs) Accuracy, and Accuracy of Infertility Type Judgment (ITJ). ChatGPT-4o-mini outperformed ChatGPT-4o in information extraction accuracy (F1 score: 0.9258 vs. 0.9029, p = 0.045, d = 0.244) and demonstrated higher completeness in medical history-taking (97.58% vs. 77.11%), suggesting that ChatGPT-4o-mini is more effective in extracting detailed patient information, which is critical for improving diagnostic accuracy. In contrast, ChatGPT-4o performed slightly better in differential diagnosis accuracy (2.0524 vs. 2.0048, p 0.05). ITJ accuracy was higher in ChatGPT-4o-mini (0.6476 vs. 0.5905) but with lower consistency (Cronbach’s \alpha = 0.562), suggesting variability in classification reliability. Both models demonstrated strong feasibility in automating infertility history-taking, with ChatGPT-4o-mini excelling in completeness and extraction accuracy. In future studies, expert validation for accuracy and dependability in a clinical setting, AI model fine-tuning, and larger datasets with a mix of cases of infertility have to be prioritized.
zh
[NLP-72] Integrating Large Language Models with Human Expertise for Disease Detection in Electronic Health Records
【速读】: 该论文旨在解决从电子健康记录(EHR)临床笔记中高效识别多种疾病的问题,传统方法需要大量人工标注,耗时且劳动密集。为应对这一挑战,研究开发了一种基于先进大型语言模型(LLM)的高效策略。解决方案的关键在于构建了一个利用LLM分析、理解和解析EHR笔记的工作流,通过特定诊断、治疗管理和临床指南的提示语(prompts)实现自动化疾病识别。该工作流应用于检测急性心肌梗死(AMI)、糖尿病和高血压,并与临床验证诊断及广泛采用的国际疾病分类(ICD)码方法进行了性能对比,展示了在敏感性和阴性预测值方面的改进。
链接: https://arxiv.org/abs/2504.00053
作者: Jie Pan,Seungwon Lee,Cheligeer Cheligeer,Elliot A. Martin,Kiarash Riazi,Hude Quan,Na Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Objective: Electronic health records (EHR) are widely available to complement administrative data-based disease surveillance and healthcare performance evaluation. Defining conditions from EHR is labour-intensive and requires extensive manual labelling of disease outcomes. This study developed an efficient strategy based on advanced large language models to identify multiple conditions from EHR clinical notes. Methods: We linked a cardiac registry cohort in 2015 with an EHR system in Alberta, Canada. We developed a pipeline that leveraged a generative large language model (LLM) to analyze, understand, and interpret EHR notes by prompts based on specific diagnosis, treatment management, and clinical guidelines. The pipeline was applied to detect acute myocardial infarction (AMI), diabetes, and hypertension. The performance was compared against clinician-validated diagnoses as the reference standard and widely adopted International Classification of Diseases (ICD) codes-based methods. Results: The study cohort accounted for 3,088 patients and 551,095 clinical notes. The prevalence was 55.4%, 27.7%, 65.9% and for AMI, diabetes, and hypertension, respectively. The performance of the LLM-based pipeline for detecting conditions varied: AMI had 88% sensitivity, 63% specificity, and 77% positive predictive value (PPV); diabetes had 91% sensitivity, 86% specificity, and 71% PPV; and hypertension had 94% sensitivity, 32% specificity, and 72% PPV. Compared with ICD codes, the LLM-based method demonstrated improved sensitivity and negative predictive value across all conditions. The monthly percentage trends from the detected cases by LLM and reference standard showed consistent patterns.
zh
[NLP-73] he Cursive Transformer
【速读】: 该论文试图解决手写数据(表示为笔画偏移序列)在生成式模型中的应用不足问题。解决方案的关键在于提出了一种新颖的分词方案,将笔画偏移转换为极坐标,离散化为bins,并进一步转化为token序列以训练标准GPT模型。这种方法无需使用专门的架构(如混合密度网络或自推进ASCII注意力头),通过仅使用3,500个手写样本及简单的数据增强技术,即可实现逼真的草书手写生成,且表现优于之前的RNN方法。
链接: https://arxiv.org/abs/2504.00051
作者: Sam Greydanus,Zachary Wimpee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 8 figures
点击查看摘要
Abstract:Transformers trained on tokenized text, audio, and images can generate high-quality autoregressive samples. But handwriting data, represented as sequences of pen coordinates, remains underexplored. We introduce a novel tokenization scheme that converts pen stroke offsets to polar coordinates, discretizes them into bins, and then turns them into sequences of tokens with which to train a standard GPT model. This allows us to capture complex stroke distributions without using any specialized architectures (eg. the mixture density network or the self-advancing ASCII attention head from Graves 2014). With just 3,500 handwritten words and a few simple data augmentations, we are able to train a model that can generate realistic cursive handwriting. Our approach is simpler and more performant than previous RNN-based methods.
zh
[NLP-74] JudgeLRM: Large Reasoning Models as a Judge
【速读】: 该论文试图解决现有监督微调(Supervised Fine-Tuning, SFT)方法在需要复杂推理的领域中表现不足的问题。论文通过分析评估任务中的推理需求,发现SFT的性能提升与推理需求样本的比例呈负相关,揭示了其在这些场景下的局限性。为了解决这一问题,论文提出了JudgeLRM模型家族,这是一种以判断为导向的大型语言模型,采用基于奖励的强化学习(Reinforcement Learning, RL)进行训练,并以判断结果为导向设计奖励机制。关键在于创新的训练策略和奖励机制,使JudgeLRM模型在需要深度推理的任务中显著优于现有的SFT调优模型和最先进的推理模型,其中JudgeLRM-7B在F1得分上比DeepSeek-R1高出2.79%。
链接: https://arxiv.org/abs/2504.00050
作者: Nuo Chen,Zhiyuan Hu,Qingyun Zou,Jiaying Wu,Qian Wang,Bryan Hooi,Bingsheng He
机构: National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: preprint
点击查看摘要
Abstract:The rise of Large Language Models (LLMs) as evaluators offers a scalable alternative to human annotation, yet existing Supervised Fine-Tuning (SFT) for judges approaches often fall short in domains requiring complex reasoning. In this work, we investigate whether LLM judges truly benefit from enhanced reasoning capabilities. Through a detailed analysis of reasoning requirements across evaluation tasks, we reveal a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples - highlighting the limitations of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM models consistently outperform both SFT-tuned and state-of-the-art reasoning models. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79% in F1 score, particularly excelling in judge tasks requiring deep reasoning.
zh
[NLP-75] Distill-C: Enhanced NL2SQL via Distilled Customization with LLM s NAACL2025
【速读】: 该论文旨在解决自然语言到SQL(Natural Language to SQL, NL2SQL)任务中高性能与高效率之间的权衡问题,同时应对领域和客户特定需求带来的复杂性。论文的关键解决方案是提出了一种名为Distill-C的蒸馏定制框架,其核心在于利用大型教师模型(large teacher LLMs)生成高质量合成数据,并通过这一鲁棒且可扩展的管道微调更小的开源模型。这种方法使较小的模型在执行精度上能够媲美甚至超越数量级更大的教师模型,从而实现轻量级但功能强大的NL2SQL模型部署,同时保持较低的计算成本。
链接: https://arxiv.org/abs/2504.00048
作者: Cong Duy Vu Hoang,Gioacchino Tangari,Clemence Lanfranchi,Dalu Guo,Paul Cayet,Steve Siu,Don Dharmasiri,Yuan-Fang Li,Long Duong,Damien Hilloulin,Rhicheek Patra,Sungpack Hong,Hassan Chafi
机构: Oracle Analytics Cloud (OAC)(甲骨文分析云); Oracle Health & AI (OHAI)(甲骨文健康与人工智能); Oracle Labs(甲骨文实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint, accepted at NAACL 2025 (Industry Track)
点击查看摘要
Abstract:The growing adoption of large language models (LLMs) in business applications has amplified interest in Natural Language to SQL (NL2SQL) solutions, in which there is competing demand for high performance and efficiency. Domain- and customer-specific requirements further complicate the problem. To address this conundrum, we introduce Distill-C, a distilled customization framework tailored for NL2SQL tasks. Distill-C utilizes large teacher LLMs to produce high-quality synthetic data through a robust and scalable pipeline. Finetuning smaller and open-source LLMs on this synthesized data enables them to rival or outperform teacher models an order of magnitude larger. Evaluated on multiple challenging benchmarks, Distill-C achieves an average improvement of 36% in execution accuracy compared to the base models from three distinct LLM families. Additionally, on three internal customer benchmarks, Distill-C demonstrates a 22.6% performance improvement over the base models. Our results demonstrate that Distill-C is an effective, high-performing and generalizable approach for deploying lightweight yet powerful NL2SQL models, delivering exceptional accuracies while maintaining low computational cost.
zh
[NLP-76] Multi-Stakeholder Disaster Insights from Social Media Using Large Language Models
【速读】: 该论文旨在解决在灾难和紧急事件中,如何通过提升社交媒体数据的自动化、聚合和定制化能力,为不同利益相关者(如媒体、警方、应急医疗服务和消防员)提供针对性的可操作洞察的问题。当前研究虽在社交媒体内容的收集与分析方面取得了进展,但仍需改进以满足多样化需求,从而更有效地支持救援协调、资源分配及媒体沟通等活动。
论文的关键解决方案在于结合分类技术和生成式人工智能 (Generative AI),将原始用户反馈转化为针对特定利益相关者的报告。具体而言,该方法利用全范围大语言模型 (LLMs),通过BERT等模型实现内容类型、情感、情绪、地理位置及主题的精确多维分类,并借助ChatGPT等生成模型创建面向不同受众的人类可读且信息丰富的报告。与直接使用ChatGPT分析帖子的标准方法相比,该方法引入了多维分类、子事件选择及定制化报告生成等步骤,显著提升了性能,无论是定量指标(如文本连贯性得分和潜在表征)还是定性评估均表现出色,为各类灾害响应利益相关者提供了精准的洞见。
链接: https://arxiv.org/abs/2504.00046
作者: Loris Belcastro,Cristian Cosentino,Fabrizio Marozzo,Merve Gündüz-Cüre,Şule Öztürk-Birim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Social and Information Networks (cs.SI)
备注:
点击查看摘要
Abstract:In recent years, social media has emerged as a primary channel for users to promptly share feedback and issues during disasters and emergencies, playing a key role in crisis management. While significant progress has been made in collecting and analyzing social media content, there remains a pressing need to enhance the automation, aggregation, and customization of this data to deliver actionable insights tailored to diverse stakeholders, including the press, police, EMS, and firefighters. This effort is essential for improving the coordination of activities such as relief efforts, resource distribution, and media communication. This paper presents a methodology that leverages the capabilities of LLMs to enhance disaster response and management. Our approach combines classification techniques with generative AI to bridge the gap between raw user feedback and stakeholder-specific reports. Social media posts shared during catastrophic events are analyzed with a focus on user-reported issues, service interruptions, and encountered challenges. We employ full-spectrum LLMs, using analytical models like BERT for precise, multi-dimensional classification of content type, sentiment, emotion, geolocation, and topic. Generative models such as ChatGPT are then used to produce human-readable, informative reports tailored to distinct audiences, synthesizing insights derived from detailed classifications. We compare standard approaches, which analyze posts directly using prompts in ChatGPT, to our advanced method, which incorporates multi-dimensional classification, sub-event selection, and tailored report generation. Our methodology demonstrates superior performance in both quantitative metrics, such as text coherence scores and latent representations, and qualitative assessments by automated tools and field experts, delivering precise insights for diverse disaster response stakeholders.
zh
[NLP-77] Measuring Online Hate on 4chan using Pre-trained Deep Learning Models
【速读】: 该论文旨在解决在线仇恨言论在非监管平台(如4chan的/pol/板块)上的普遍存在和复杂性问题。论文的关键解决方案在于利用先进的自然语言处理(NLP)模型,特别是基于变压器的模型(如RoBERTa和Detoxify),通过多类别仇恨言论分类(如种族主义、性别歧视、宗教等)、有毒内容分类(如身份攻击和威胁)以及主题建模分析,深入分析仇恨言论动态并量化其程度。研究结果显示,数据集中有11.20%被识别为包含不同类别的仇恨内容,验证了在线仇恨以多种形式呈现且检测难度大的特性。
链接: https://arxiv.org/abs/2504.00045
作者: Adrian Bermudez-Villalva,Maryam Mehrnezhad,Ehsan Toreini
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: IEEE Transactions on Technology and Society, 11 pages
点击查看摘要
Abstract:Online hate speech can harmfully impact individuals and groups, specifically on non-moderated platforms such as 4chan where users can post anonymous content. This work focuses on analysing and measuring the prevalence of online hate on 4chan’s politically incorrect board (/pol/) using state-of-the-art Natural Language Processing (NLP) models, specifically transformer-based models such as RoBERTa and Detoxify. By leveraging these advanced models, we provide an in-depth analysis of hate speech dynamics and quantify the extent of online hate non-moderated platforms. The study advances understanding through multi-class classification of hate speech (racism, sexism, religion, etc.), while also incorporating the classification of toxic content (e.g., identity attacks and threats) and a further topic modelling analysis. The results show that 11.20% of this dataset is identified as containing hate in different categories. These evaluations show that online hate is manifested in various forms, confirming the complicated and volatile nature of detection in the wild.
zh
[NLP-78] Dynamic hashtag recommendation in social media with trend shift detection and adaptation
【速读】: 该论文旨在解决现有静态 hashtag 推荐模型难以适应社交媒体对话高度动态和实时变化的问题,特别是新 hashtag 的涌现以及已有 hashtag 的语义转变。为应对这些挑战,论文提出了一种基于 BERT 的 hashtag 推荐方法 H-ADAPTS(Hashtag recommendAtion by Detecting and adAPting to Trend Shifts)。其关键是引入了一种趋势感知检测机制,能够识别 hashtag 使用的变化,并在少量近期帖子上触发高效的模型适配过程。此外,该框架利用 Apache Storm 实现实时流处理,支持高吞吐量社交媒体数据的可扩展且容错的分析。实验结果表明,H-ADAPTS 方法能够在保持高推荐准确性的同时,有效适应新兴趋势,显著优于现有解决方案。
链接: https://arxiv.org/abs/2504.00044
作者: Riccardo Cantini,Fabrizio Marozzo,Alessio Orsino,Domenico Talia,Paolo Trunfio
机构: 未知
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Neural and Evolutionary Computing (cs.NE)
备注:
点击查看摘要
Abstract:The widespread use of social media platforms results in the generation of vast amounts of user-generated content, which requires efficient methods for categorization and search. Hashtag recommendation systems have emerged as a crucial tool for automatically suggesting relevant hashtags and improving content discoverability. However, existing static models struggle to adapt to the highly dynamic and real-time nature of social media conversations, where new hashtags emerge and existing ones undergo semantic shifts. To address these challenges, this paper presents H-ADAPTS (Hashtag recommendAtion by Detecting and adAPting to Trend Shifts), a BERT-based hashtag recommendation methodology that can detect and adapt to shifts in the main trends and topics underlying social media conversation. Our approach introduces a trend-aware detection mechanism to identify changes in hashtag usage, triggering efficient model adaptation on a (small) set of recent posts. The framework leverages Apache Storm for real-time stream processing, enabling scalable and fault-tolerant analysis of high-velocity social data. Experimental results on two real-world case studies, including the COVID-19 pandemic and the 2020 US presidential election, demonstrate the ability to maintain high recommendation accuracy by adapting to emerging trends. Our methodology significantly outperforms existing solutions, ensuring timely and relevant hashtag recommendations in dynamic environments.
zh
[NLP-79] CrossWordBench: Evaluating the Reasoning Capabilities of LLM s and LVLMs with Controllable Puzzle Generation
【速读】: 该论文试图解决现有大型语言模型(LLMs)和大型视觉-语言模型(LVLMs)推理评估框架的局限性,即这些框架大多仅专注于文本推理或视觉-语言理解能力的评估,而缺乏文本与视觉约束之间的动态交互。为了解决这一问题,论文提出了一种名为CrossWordBench的新基准,它通过填字游戏任务来评估LLMs和LVLMs的推理能力,该任务需要在语义约束(基于文本的线索)和交叉约束(来自视觉网格结构)之间实现多模态遵循。解决方案的关键在于引入了一个可控的谜题生成框架,能够以多种格式(文本和图像)生成谜题,并提供从直接谜题求解到交互模式的不同评估策略。此外,研究还揭示了推理型LLMs相较于非推理型模型的优势,以及LVLMs在此任务上的局限性与其网格解析准确性之间的强相关性。
链接: https://arxiv.org/abs/2504.00043
作者: Jixuan Leng,Chengsong Huang,Langlin Huang,Bill Yuchen Lin,William W. Cohen,Haohan Wang,Jiaxin Huang
机构: CMU (Carnegie Mellon University); WUSTL (Washington University in St. Louis); UW (University of Washington); UIUC (University of Illinois Urbana-Champaign)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Existing reasoning evaluation frameworks for Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) predominantly either assess text-based reasoning or vision-language understanding capabilities, with limited dynamic interplay between textual and visual constraints. To address this limitation, we introduce CrossWordBench, a benchmark designed to evaluate the reasoning capabilities of both LLMs and LVLMs through the medium of crossword puzzles-a task requiring multimodal adherence to semantic constraints from text-based clues and intersectional constraints from visual grid structures. CrossWordBench leverages a controllable puzzle generation framework that produces puzzles in multiple formats (text and image) and offers different evaluation strategies ranging from direct puzzle solving to interactive modes. Our extensive evaluation of over 20 models reveals that reasoning LLMs outperform non-reasoning models substantially by effectively leveraging crossing-letter constraints. We further demonstrate that LVLMs struggle with the task, showing a strong correlation between their puzzle-solving performance and grid-parsing accuracy. Our findings offer insights into the limitations of the reasoning capabilities of current LLMs and LVLMs, and provide an effective approach for creating multimodal constrained tasks for future evaluations.
zh
[NLP-80] Beyond the Reported Cutoff: Where Large Language Models Fall Short on Financial Knowledge
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理历史信息和财务数据方面的知识广度与准确性问题。具体而言,研究评估了LLMs对美国上市公司历史财务数据的知识覆盖范围,并分析了公司规模、零售投资、机构关注度以及财务文件可读性等特征对LLMs知识表示准确性的影响。论文的关键解决方案在于通过大规模实证研究,即评估超过197,000个问题,并将模型响应与事实数据进行比较,揭示LLMs在不同公司特性和时间维度上的知识表现差异,同时识别其在较大公司和较近年份数据上的幻觉倾向(hallucination)。
链接: https://arxiv.org/abs/2504.00042
作者: Agam Shah,Liqin Ye,Sebastian Jaskowski,Wei Xu,Sudheer Chava
机构: Georgia Institute of Technology (乔治亚理工学院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are frequently utilized as sources of knowledge for question-answering. While it is known that LLMs may lack access to real-time data or newer data produced after the model’s cutoff date, it is less clear how their knowledge spans across historical information. In this study, we assess the breadth of LLMs’ knowledge using financial data of U.S. publicly traded companies by evaluating more than 197k questions and comparing model responses to factual data. We further explore the impact of company characteristics, such as size, retail investment, institutional attention, and readability of financial filings, on the accuracy of knowledge represented in LLMs. Our results reveal that LLMs are less informed about past financial performance, but they display a stronger awareness of larger companies and more recent information. Interestingly, at the same time, our analysis also reveals that LLMs are more likely to hallucinate for larger companies, especially for data from more recent years. We will make the code, prompts, and model outputs public upon the publication of the work.
zh
[NLP-81] Quantum Methods for Managing Ambiguity in Natural Language Processing
【速读】: 本文旨在解决自然语言处理中的句法歧义问题,通过将句子的语义表示为密度矩阵,并利用过程上的概率分布来建模句法结构的不确定性。解决方案的关键在于构建量子电路的概率分布以表示句子的意义,这种方法能够推广已有的任务,并通过实验验证所提出的理论框架的有效性。
链接: https://arxiv.org/abs/2504.00040
作者: Jurek Eisinger,Ward Gauderis,Lin de Huybrecht,Geraint A. Wiggins
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注:
点击查看摘要
Abstract:The Categorical Compositional Distributional (DisCoCat) framework models meaning in natural language using the mathematical framework of quantum theory, expressed as formal diagrams. DisCoCat diagrams can be associated with tensor networks and quantum circuits. DisCoCat diagrams have been connected to density matrices in various contexts in Quantum Natural Language Processing (QNLP). Previous use of density matrices in QNLP entails modelling ambiguous words as probability distributions over more basic words (the word \textttqueen, e.g., might mean the reigning queen or the chess piece). In this article, we investigate using probability distributions over processes to account for syntactic ambiguity in sentences. The meanings of these sentences are represented by density matrices. We show how to create probability distributions on quantum circuits that represent the meanings of sentences and explain how this approach generalises tasks from the literature. We conduct an experiment to validate the proposed theory.
zh
[NLP-82] Leaking LoRa: An Evaluation of Password Leaks and Knowledge Storag e in Large Language Models
【速读】: 该论文旨在解决在应用特定场景下通过微调大型语言模型(Large Language Models, LLMs)提升任务性能时,因使用包含敏感信息(如密码)的数据进行微调可能导致敏感信息泄露的问题。论文的关键在于首先通过低秩适应(Low-Rank Adaptation, LoRA)技术微调模型,并利用因果追踪(causal tracing)发现密码信息主要集中在模型的少数几层中。进一步地,采用秩一模型编辑(Rank One Model Editing, ROME)方法从模型中移除这些密码信息,从而将成功恢复的密码数量从37个减少至0个,有效解决了潜在的信息泄露风险。
链接: https://arxiv.org/abs/2504.00031
作者: Ryan Marinelli,Magnus Eckhoff
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:To effectively deploy Large Language Models (LLMs) in application-specific settings, fine-tuning techniques are applied to enhance performance on specialized tasks. This process often involves fine-tuning on user data data, which may contain sensitive information. Although not recommended, it is not uncommon for users to send passwords in messages, and fine-tuning models on this could result in passwords being leaked. In this study, a Large Language Model is fine-tuned with customer support data and passwords from the RockYou password wordlist using Low-Rank Adaptation (LoRA). Out of the first 200 passwords from the list, 37 were successfully recovered. Further, causal tracing is used to identify that password information is largely located in a few layers. Lastly, Rank One Model Editing (ROME) is used to remove the password information from the model, resulting in the number of passwords recovered going from 37 to 0.
zh
[NLP-83] oken-Driven GammaTune: Adaptive Calibration for Enchanced Speculative Decoding
【速读】: 该论文试图解决在基于推测解码(Speculative Decoding)加速大型语言模型(Large Language Model, LLM)推理过程中,如何选择最优推测长度以最大化加速比同时最小化无效计算的问题。解决方案的关键在于提出两种无需训练的自适应算法:\textit{GammaTune} 和 \textit{GammaTune+},它们通过基于启发式的切换机制动态调整推测长度,依据的是令牌接受率(token acceptance rates)。评估结果显示,与其它基于启发式的方案及固定长度的推测解码相比,这两种方法分别实现了平均 15% (\pm 5%) 和 16% (\pm 3%) 的加速效果,并且显著降低了性能波动,从而成为面向实际部署的稳健高效解决方案。
链接: https://arxiv.org/abs/2504.00030
作者: Aayush Gautam,Susav Shrestha,Narasimha Annapareddy
机构: Department of Electrical and Computer Engineering (电气与计算机工程系), Texas A&M University (德克萨斯农工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 2 figures, 1 table
点击查看摘要
Abstract:Speculative decoding accelerates large language model (LLM) inference by using a smaller draft model to propose tokens, which are then verified by a larger target model. However, selecting an optimal speculation length is critical for maximizing speedup while minimizing wasted computation. We introduce \textitGammaTune and \textitGammaTune+, training-free adaptive algorithms that dynamically adjust speculation length based on token acceptance rates using a heuristic-based switching mechanism. Evaluated on SpecBench across multiple tasks and model pairs, our method outperforms other heuristic-based approaches and fixed-length speculative decoding, achieving an average speedup of 15% ( \pm 5%) with \textitGammaTune and 16% ( \pm 3%) with \textitGammaTune+, while reducing performance variance. This makes \textitGammaTune a robust and efficient solution for real-world deployment.
zh
[NLP-84] Opioid Named Entity Recognition (ONER-2025) from Reddit
【速读】: 该论文旨在应对阿片类药物滥用危机这一严峻的公共卫生挑战,通过分析Reddit等社交平台上的非结构化数据,提取与阿片类药物相关的公众认知、讨论及使用经验的信息。论文的关键在于开发了一种基于自然语言处理(NLP)的方法,特别是针对阿片类药物命名实体识别(ONER-2025)的任务,结合机器学习、深度学习以及基于Transformer的语言模型,利用先进的上下文嵌入技术,构建了一个能够实时监测阿片类药物相关讨论并识别过量事件的系统。该系统在实验中通过5折交叉验证实现了97%的准确率和F1分数,显著优于传统基线方法,其核心突破在于有效处理了阿片类药物讨论中的俚语、歧义、片段化句子及情绪化语言等复杂语言现象,并成功创建了一个包含331,285个标记词的独特人工标注数据集,涵盖八个主要的阿片类药物实体类别。
链接: https://arxiv.org/abs/2504.00027
作者: Muhammad Ahmad,Humaira Farid,Iqra Ameer,Muhammad Muzamil,Ameer Hamza Muhammad Jalal,Ildar Batyrshin,Grigori Sidorov
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The opioid overdose epidemic remains a critical public health crisis, particularly in the United States, leading to significant mortality and societal costs. Social media platforms like Reddit provide vast amounts of unstructured data that offer insights into public perceptions, discussions, and experiences related to opioid use. This study leverages Natural Language Processing (NLP), specifically Opioid Named Entity Recognition (ONER-2025), to extract actionable information from these platforms. Our research makes four key contributions. First, we created a unique, manually annotated dataset sourced from Reddit, where users share self-reported experiences of opioid use via different administration routes. This dataset contains 331,285 tokens and includes eight major opioid entity categories. Second, we detail our annotation process and guidelines while discussing the challenges of labeling the ONER-2025 dataset. Third, we analyze key linguistic challenges, including slang, ambiguity, fragmented sentences, and emotionally charged language, in opioid discussions. Fourth, we propose a real-time monitoring system to process streaming data from social media, healthcare records, and emergency services to identify overdose events. Using 5-fold cross-validation in 11 experiments, our system integrates machine learning, deep learning, and transformer-based language models with advanced contextual embeddings to enhance understanding. Our transformer-based models (bert-base-NER and roberta-base) achieved 97% accuracy and F1-score, outperforming baselines by 10.23% (RF=0.88).
zh
[NLP-85] Generalization Bias in Large Language Model Summarization of Scientific Research
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在总结科学文本时倾向于过度概括研究结论的问题。论文指出,尽管LLMs能够快速将复杂的科学信息以易懂的方式呈现,但它们可能遗漏重要细节,从而导致研究结论的概括范围超出原始研究的实际支持范围。解决方案的关键在于通过调整LLM的温度设置(temperature tuning)以及建立针对概括准确性(generalization accuracy)的基准测试来减轻这种过度概括的倾向。
链接: https://arxiv.org/abs/2504.00025
作者: Uwe Peters,Benjamin Chin-Yee
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:Artificial intelligence chatbots driven by large language models (LLMs) have the potential to increase public science literacy and support scientific research, as they can quickly summarize complex scientific information in accessible terms. However, when summarizing scientific texts, LLMs may omit details that limit the scope of research conclusions, leading to generalizations of results broader than warranted by the original study. We tested 10 prominent LLMs, including ChatGPT-4o, ChatGPT-4.5, DeepSeek, LLaMA 3.3 70B, and Claude 3.7 Sonnet, comparing 4900 LLM-generated summaries to their original scientific texts. Even when explicitly prompted for accuracy, most LLMs produced broader generalizations of scientific results than those in the original texts, with DeepSeek, ChatGPT-4o, and LLaMA 3.3 70B overgeneralizing in 26 to 73% of cases. In a direct comparison of LLM-generated and human-authored science summaries, LLM summaries were nearly five times more likely to contain broad generalizations (OR = 4.85, 95% CI [3.06, 7.70]). Notably, newer models tended to perform worse in generalization accuracy than earlier ones. Our results indicate a strong bias in many widely used LLMs towards overgeneralizing scientific conclusions, posing a significant risk of large-scale misinterpretations of research findings. We highlight potential mitigation strategies, including lowering LLM temperature settings and benchmarking LLMs for generalization accuracy.
zh
[NLP-86] FUSE : A Ridge and Random Forest-Based Metric for Evaluating MT in Indigenous Languages CCL2025
【速读】: 该论文旨在解决美洲原住民语言机器翻译(Machine Translation, MT)自动评估指标面临的挑战,特别是针对形态丰富且资源匮乏的语言。传统自动评估指标如BLEU、TER和ChrF在捕捉语义充分性和流畅性等深层次翻译质量方面表现不足。论文的关键解决方案是提出Feature-Union Scorer (FUSE),它通过结合Ridge回归与梯度提升模型来建模翻译质量,并引入多语言句子嵌入和音系编码以更好地与人工评价保持一致。此外,FUSE强调将词法、语音、语义以及模糊标记相似性与基于学习的方法相结合的重要性,从而显著提升了对复杂形态语言的翻译评估性能。
链接: https://arxiv.org/abs/2504.00021
作者: Rahul Raja,Arpita Vats
机构: 未知
类目: Computation and Language (cs.CL)
备注: NACCL 2025
点击查看摘要
Abstract:This paper presents the winning submission of the RaaVa team to the AmericasNLP 2025 Shared Task 3 on Automatic Evaluation Metrics for Machine Translation (MT) into Indigenous Languages of America, where our system ranked first overall based on average Pearson correlation with the human annotations. We introduce Feature-Union Scorer (FUSE) for Evaluation, FUSE integrates Ridge regression and Gradient Boosting to model translation quality. In addition to FUSE, we explore five alternative approaches leveraging different combinations of linguistic similarity features and learning paradigms. FUSE Score highlights the effectiveness of combining lexical, phonetic, semantic, and fuzzy token similarity with learning-based modeling to improve MT evaluation for morphologically rich and low-resource languages. MT into Indigenous languages poses unique challenges due to polysynthesis, complex morphology, and non-standardized orthography. Conventional automatic metrics such as BLEU, TER, and ChrF often fail to capture deeper aspects like semantic adequacy and fluency. Our proposed framework, formerly referred to as FUSE, incorporates multilingual sentence embeddings and phonological encodings to better align with human evaluation. We train supervised models on human-annotated development sets and evaluate held-out test data. Results show that FUSE consistently achieves higher Pearson and Spearman correlations with human judgments, offering a robust and linguistically informed solution for MT evaluation in low-resource settings.
zh
[NLP-87] ObscuraCoder: Powering Efficient Code LM Pre-Training Via Obfuscation Grounding
【速读】: 该论文试图解决语言模型在代码理解任务中数据效率较低以及语法与语义难以有效解耦的问题。现有代码语言模型(Code-LMs)的预训练目标研究相对匮乏,尤其是在提升数据效率和区分语法与语义方面,相较于自然语言处理领域的工作进展有限。为解决此问题,论文提出利用混淆代码(obfuscated code)作为预训练方法的关键创新点,通过引入ObscuraX数据集(包含约5500万组源代码及其对应的混淆代码对),帮助Code-LMs超越表面语法形式,增强其样本利用效率。实验表明,基于混淆的预训练方法在多种代码理解和生成任务中显著优于传统的自回归预训练及现有的去混淆(de-obfuscation)目标,展示了其在提升代码语言模型能力方面的有效性。
链接: https://arxiv.org/abs/2504.00019
作者: Indraneil Paul,Haoyi Yang,Goran Glavaš,Kristian Kersting,Iryna Gurevych
机构: UKP Lab, TU Darmstadt (UKP实验室, 多特蒙德工业大学); AIML Lab, TU Darmstadt (AIML实验室, 多特蒙德工业大学); CAIDAS, JMU Würzburg (CAIDAS, 哥廷根耶拿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
点击查看摘要
Abstract:Language models (LMs) have become a staple of the code-writing toolbox. Their pre-training recipe has, however, remained stagnant over recent years, barring the occasional changes in data sourcing and filtering strategies. In particular, research exploring modifications to Code-LMs’ pre-training objectives, geared towards improving data efficiency and better disentangling between syntax and semantics, has been noticeably sparse, especially compared with corresponding efforts in natural language LMs. In this work, we examine grounding on obfuscated code as a means of helping Code-LMs look beyond the surface-form syntax and enhance their pre-training sample efficiency. To this end, we compile ObscuraX, a dataset of approximately 55M source and obfuscated code pairs in seven languages. Subsequently, we pre-train ObscuraCoder models, ranging in size from 255M to 2.8B parameters, on a 272B-token corpus that includes ObscuraX and demonstrate that our obfuscation-based pre-training recipe leads to consistent improvements in Code-LMs’ abilities compared to both vanilla autoregressive pre-training as well as existing de-obfuscation (DOBF) objectives. ObscuraCoder demonstrates sizeable gains across multiple tests of syntactic and semantic code understanding, along with improved capabilities in multilingual code completion, multilingual code commit summarization, and multi-purpose library-oriented code generation.
zh
[NLP-88] Medical Reasoning in LLM s: An In-Depth Analysis of DeepSeek R1
【速读】: 该论文旨在评估大型语言模型(Large Language Models, LLMs)如DeepSeek R1在医疗推理方面与临床专家模式的一致性,并探讨其在真实世界医疗决策中的应用潜力。论文通过分析DeepSeek R1在MedQA临床案例数据集上的表现,发现其诊断准确率达到93%,展现了系统性的临床判断能力,但在错误案例分析中也揭示了锚定偏见、冲突数据整合困难、替代方案探索不足、过度推理、知识盲区以及过早优先选择确定性治疗等持续存在的局限性。研究的关键在于发现推理长度与准确性之间的相关性:较短的响应(如5,000字符)更可靠,表明较长解释可能预示不确定性或错误的合理化。因此,论文强调通过优化模型的偏差缓解、知识更新及结构化推理框架来改进LLMs的临床推理能力,并提出需要特定领域的验证、可解释性保障以及置信度指标(如响应长度阈值)以确保其在实际应用中的可靠性。
链接: https://arxiv.org/abs/2504.00016
作者: Birger Moell,Fredrik Sand Aronsson,Sanian Akbar
机构: KTH Royal Institute of Technology; Karolinska Institute; Stockholm Health Care Services, Region of Stockholm
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Integrating large language models (LLMs) like DeepSeek R1 into healthcare requires rigorous evaluation of their reasoning alignment with clinical expertise. This study assesses DeepSeek R1’s medical reasoning against expert patterns using 100 MedQA clinical cases. The model achieved 93% diagnostic accuracy, demonstrating systematic clinical judgment through differential diagnosis, guideline-based treatment selection, and integration of patient-specific factors. However, error analysis of seven incorrect cases revealed persistent limitations: anchoring bias, challenges reconciling conflicting data, insufficient exploration of alternatives, overthinking, knowledge gaps, and premature prioritization of definitive treatment over intermediate care. Crucially, reasoning length correlated with accuracy - shorter responses (5,000 characters) were more reliable, suggesting extended explanations may signal uncertainty or rationalization of errors. While DeepSeek R1 exhibits foundational clinical reasoning capabilities, recurring flaws highlight critical areas for refinement, including bias mitigation, knowledge updates, and structured reasoning frameworks. These findings underscore LLMs’ potential to augment medical decision-making through artificial reasoning but emphasize the need for domain-specific validation, interpretability safeguards, and confidence metrics (e.g., response length thresholds) to ensure reliability in real-world applications.
zh
[NLP-89] RIG: Synergizing Reasoning and Imagination in End-to-End Generalist Policy
【速读】: 该论文旨在解决在复杂开放世界环境中,端到端通用策略(Generalist Policy)中推理(Reasoning)与想象力(Imagination)难以高效协同的问题。传统方法要么仅集成单一能力,要么通过多个专门模型拼接实现,导致策略的学习效率和泛化能力受限。为此,论文提出了首个端到端通用策略 RIG(Reasoning and Imagination Generalist Policy),其关键在于构建了一个渐进式数据管道,将收集自现有智能体轨迹中的想象与推理内容进行整合与增强,并通过联合学习推理与下一图像生成任务,显式建模推理、动作与环境动态之间的内在关联。这种设计不仅实现了超过17倍的样本效率提升和更强的泛化能力,还通过推理预测潜在动作及其结果,在推理-想象-修正的循环中提升了策略的鲁棒性、泛化性和互操作性,并支持测试时扩展以进一步优化性能。
链接: https://arxiv.org/abs/2503.24388
作者: Zhonghan Zhao,Wenwei Zhang,Haian Huang,Kuikun Liu,Jianfei Gao,Gaoang Wang,Kai Chen
机构: Zhejiang University (浙江大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Reasoning before action and imagining potential outcomes (i.e., world models) are essential for embodied agents operating in complex open-world environments. Yet, prior work either incorporates only one of these abilities in an end-to-end agent or integrates multiple specialized models into an agent system, limiting the learning efficiency and generalization of the policy. Thus, this paper makes the first attempt to synergize Reasoning and Imagination in an end-to-end Generalist policy, termed RIG. To train RIG in an end-to-end manner, we construct a data pipeline that progressively integrates and enriches the content of imagination and reasoning in the trajectories collected from existing agents. The joint learning of reasoning and next image generation explicitly models the inherent correlation between reasoning, action, and dynamics of environments, and thus exhibits more than 17\times sample efficiency improvements and generalization in comparison with previous works. During inference, RIG first reasons about the next action, produces potential action, and then predicts the action outcomes, which offers the agent a chance to review and self-correct based on the imagination before taking real actions. Experimental results show that the synergy of reasoning and imagination not only improves the robustness, generalization, and interoperability of generalist policy but also enables test-time scaling to enhance overall performance.
zh
计算机视觉
[CV-0] Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation WWW
【速读】:本文旨在解决编辑后视频材料(如电影和电视剧)自动生成音频描述(Audio Descriptions, ADs)的问题。为实现这一目标,论文提出了一种两阶段框架,以“镜头”作为视频理解的基本单元,并通过扩展时间上下文到相邻镜头以及结合电影语法设备(如景别和叙事结构)来指导AD生成。关键在于该方法兼容开源和专有的视觉-语言模型(Visual-Language Models, VLMs),并通过附加模块集成专家知识,而无需对VLMs进行额外训练。此外,论文引入了一种新的评估指标——动作评分(action score),用于专门评估AD的重要方面,并提出了一种新评估协议,将自动框架视为AD生成助手,要求其生成多个候选AD供选择。
链接: https://arxiv.org/abs/2504.01020
作者: Junyu Xie,Tengda Han,Max Bain,Arsha Nagrani,Eshika Khandelwal,Gül Varol,Weidi Xie,Andrew Zisserman
机构: Visual Geometry Group, University of Oxford (牛津大学视觉几何组); CVIT, IIIT Hyderabad (IIIT 海得拉巴计算机视觉与图像处理中心); LIGM, École des Ponts ParisTech (巴黎高科国立路桥学校实验室); CMIC, Shanghai Jiao Tong University (上海交通大学媒体信息计算实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
点击查看摘要
Abstract:Our objective is the automatic generation of Audio Descriptions (ADs) for edited video material, such as movies and TV series. To achieve this, we propose a two-stage framework that leverages “shots” as the fundamental units of video understanding. This includes extending temporal context to neighbouring shots and incorporating film grammar devices, such as shot scales and thread structures, to guide AD generation. Our method is compatible with both open-source and proprietary Visual-Language Models (VLMs), integrating expert knowledge from add-on modules without requiring additional training of the VLMs. We achieve state-of-the-art performance among all prior training-free approaches and even surpass fine-tuned methods on several benchmarks. To evaluate the quality of predicted ADs, we introduce a new evaluation measure – an action score – specifically targeted to assessing this important aspect of AD. Additionally, we propose a novel evaluation protocol that treats automatic frameworks as AD generation assistants and asks them to generate multiple candidate ADs for selection.
zh
[CV-1] MixerMDM: Learnable Composition of Human Motion Diffusion Models MDM CVPR2025
【速读】:本文旨在解决在基于条件(如文本描述)生成人类运动时面临的挑战,特别是当需要更精细控制时。传统方法通过结合多个预训练的运动扩散模型来实现多条件控制,但这些方法忽略了组合生成过程的最佳方式可能依赖于每个预训练模型的具体特性以及特定的文本描述。为了解决这一问题,论文引入了MixerMDM,这是一种用于组合预训练的文本条件人类运动扩散模型的首个可学习模型组合技术。MixerMDM的关键在于提供了一种动态混合策略,该策略以对抗性的方式进行训练,能够根据驱动生成的条件集学习如何结合每个模型的去噪过程。此外,还提出了一种新的评估技术,首次在该任务中通过计算混合生成运动与条件之间的对齐程度以及MixerMDM在去噪过程中根据待混合运动调整混合能力的能力,来衡量交互质量和个体质量。
链接: https://arxiv.org/abs/2504.01019
作者: Pablo Ruiz-Ponce,German Barquero,Cristina Palmero,Sergio Escalera,José García-Rodríguez
机构: Universidad de Alicante (阿利坎特大学, Spain); Universitat de Barcelona and Computer Vision Center (巴塞罗那大学和计算机视觉中心, Spain); King’s College London (伦敦国王学院, UK)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025 Accepted - Project Page: this https URL
点击查看摘要
Abstract:Generating human motion guided by conditions such as textual descriptions is challenging due to the need for datasets with pairs of high-quality motion and their corresponding conditions. The difficulty increases when aiming for finer control in the generation. To that end, prior works have proposed to combine several motion diffusion models pre-trained on datasets with different types of conditions, thus allowing control with multiple conditions. However, the proposed merging strategies overlook that the optimal way to combine the generation processes might depend on the particularities of each pre-trained generative model and also the specific textual descriptions. In this context, we introduce MixerMDM, the first learnable model composition technique for combining pre-trained text-conditioned human motion diffusion models. Unlike previous approaches, MixerMDM provides a dynamic mixing strategy that is trained in an adversarial fashion to learn to combine the denoising process of each model depending on the set of conditions driving the generation. By using MixerMDM to combine single- and multi-person motion diffusion models, we achieve fine-grained control on the dynamics of every person individually, and also on the overall interaction. Furthermore, we propose a new evaluation technique that, for the first time in this task, measures the interaction and individual quality by computing the alignment between the mixed generated motions and their conditions as well as the capabilities of MixerMDM to adapt the mixing throughout the denoising process depending on the motions to mix.
zh
[CV-2] Scaling Language-Free Visual Representation Learning
【速读】:该论文旨在解决视觉自监督学习(Visual Self-Supervised Learning, SSL)在多模态任务(如视觉问答,Visual Question Answering, VQA)中相较于对比语言图像预训练(Contrastive Language-Image Pretraining, CLIP)表现较差的问题。研究的核心问题是:这种性能差距是否源于语言监督的缺失,还是由于两种方法所使用的训练数据不同?为了解决这一问题,论文的关键在于通过在相同的MetaCLIP数据集上训练视觉SSL模型和CLIP模型,并使用VQA作为多样化测试基准来评估视觉编码器的性能。实验结果表明,在相同的数据和模型容量条件下,视觉SSL模型的表现优于CLIP模型,并且其性能在扩展到70亿参数规模后仍未达到饱和。最终发现,纯视觉自监督方法能够在大规模场景下达到与语言监督视觉预训练相当的性能,从而为以视觉为中心的表征学习开辟了新的可能性。
链接: https://arxiv.org/abs/2504.01017
作者: David Fan,Shengbang Tong,Jiachen Zhu,Koustuv Sinha,Zhuang Liu,Xinlei Chen,Michael Rabbat,Nicolas Ballas,Yann LeCun,Amir Bar,Saining Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page at this https URL
点击查看摘要
Abstract:Visual Self-Supervised Learning (SSL) currently underperforms Contrastive Language-Image Pretraining (CLIP) in multimodal settings such as Visual Question Answering (VQA). This multimodal gap is often attributed to the semantics introduced by language supervision, even though visual SSL and CLIP models are often trained on different data. In this work, we ask the question: “Do visual self-supervised approaches lag behind CLIP due to the lack of language supervision, or differences in the training data?” We study this question by training both visual SSL and CLIP models on the same MetaCLIP data, and leveraging VQA as a diverse testbed for vision encoders. In this controlled setup, visual SSL models scale better than CLIP models in terms of data and model capacity, and visual SSL performance does not saturate even after scaling up to 7B parameters. Consequently, we observe visual SSL methods achieve CLIP-level performance on a wide range of VQA and classic vision benchmarks. These findings demonstrate that pure visual SSL can match language-supervised visual pretraining at scale, opening new opportunities for vision-centric representation learning.
zh
[CV-3] GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors
【速读】:该论文旨在解决现有视频深度估计方法在实现几何保真度方面的局限性,特别是通过仿射不变预测难以满足精确重建和其他基于度量的下游任务需求的问题。论文的关键创新在于提出了一种名为GeometryCrafter的新框架,其核心是一个点图变分自编码器(Point Map Variational Autoencoder, VAE),该VAE学习了一个与视频潜在分布无关的潜在空间,从而实现了高效的点图编码和解码。基于此VAE,论文进一步训练了一个视频扩散模型来建模输入视频条件下的点图序列分布。通过这些方法,GeometryCrafter能够在开放世界视频中恢复具有时间一致性的高保真点图序列,显著提升了三维重建、相机参数估计以及基于深度的其他应用的精度、时间一致性及泛化能力。
链接: https://arxiv.org/abs/2504.01016
作者: Tian-Xing Xu,Xiangjun Gao,Wenbo Hu,Xiaoyu Li,Song-Hai Zhang,Ying Shan
机构: Tsinghua University (清华大学); ARC Lab, Tencent PCG (腾讯互娱 ARC 实验室); HKUST (香港科技大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project webpage: this https URL
点击查看摘要
Abstract:Despite remarkable advancements in video depth estimation, existing methods exhibit inherent limitations in achieving geometric fidelity through the affine-invariant predictions, limiting their applicability in reconstruction and other metrically grounded downstream tasks. We propose GeometryCrafter, a novel framework that recovers high-fidelity point map sequences with temporal coherence from open-world videos, enabling accurate 3D/4D reconstruction, camera parameter estimation, and other depth-based applications. At the core of our approach lies a point map Variational Autoencoder (VAE) that learns a latent space agnostic to video latent distributions for effective point map encoding and decoding. Leveraging the VAE, we train a video diffusion model to model the distribution of point map sequences conditioned on the input videos. Extensive evaluations on diverse datasets demonstrate that GeometryCrafter achieves state-of-the-art 3D accuracy, temporal consistency, and generalization capability.
zh
[CV-4] AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction
【速读】:该论文试图解决在无限动漫生活模拟游戏中因忽略历史视觉上下文而导致的游戏一致性问题,以及仅生成静态图像而缺乏动态体验的问题。解决方案的关键在于提出AnimeGamer系统,它基于多模态大语言模型(Multimodal Large Language Models, MLLMs),能够生成包括动态动画片段在内的每个游戏状态,这些动画片段展示了角色动作及角色状态更新。通过将历史动画片段表示作为上下文,并预测后续表示,AnimeGamer实现了具有上下文一致性和满意动态的游戏生成。
链接: https://arxiv.org/abs/2504.01014
作者: Junhao Cheng,Yuying Ge,Yixiao Ge,Jing Liao,Ying Shan
机构: ARC Lab, Tencent PCG (腾讯PCG弧光实验室); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project released at: this https URL
点击查看摘要
Abstract:Recent advancements in image and video synthesis have opened up new promise in generative games. One particularly intriguing application is transforming characters from anime films into interactive, playable entities. This allows players to immerse themselves in the dynamic anime world as their favorite characters for life simulation through language instructions. Such games are defined as infinite game since they eliminate predetermined boundaries and fixed gameplay rules, where players can interact with the game world through open-ended language and experience ever-evolving storylines and environments. Recently, a pioneering approach for infinite anime life simulation employs large language models (LLMs) to translate multi-turn text dialogues into language instructions for image generation. However, it neglects historical visual context, leading to inconsistent gameplay. Furthermore, it only generates static images, failing to incorporate the dynamics necessary for an engaging gaming experience. In this work, we propose AnimeGamer, which is built upon Multimodal Large Language Models (MLLMs) to generate each game state, including dynamic animation shots that depict character movements and updates to character states, as illustrated in Figure 1. We introduce novel action-aware multimodal representations to represent animation shots, which can be decoded into high-quality video clips using a video diffusion model. By taking historical animation shot representations as context and predicting subsequent representations, AnimeGamer can generate games with contextual consistency and satisfactory dynamics. Extensive evaluations using both automated metrics and human evaluations demonstrate that AnimeGamer outperforms existing methods in various aspects of the gaming experience. Codes and checkpoints are available at this https URL.
zh
[CV-5] A YOLO-Based Semi-Automated Labeling Approach to Improve Fault Detection Efficiency in Railroad Videos
【速读】:该论文旨在解决大规模图像和视频数据集的手动标注耗时、易错且成本高的问题,特别是在铁路视频故障检测中的机器学习工作流效率受限的挑战。论文的关键解决方案是提出了一种半自动化的标注方法,利用预训练的 You Only Look Once (YOLO) 模型来简化标注流程并提高铁路视频中的故障检测准确性。其核心在于通过初始少量人工标注数据迭代训练 YOLO 模型,并利用每次迭代的输出结果逐步优化模型性能,减少人工干预需求。此外,开发了一个系统将 YOLO 的检测数据导出为可编辑文本文件,便于快速修正预测结果,从而显著降低标注时间和成本,同时减少错误率。这种创新方法为处理大规模数据集的故障检测及其他基于检测的机器学习应用提供了经济高效的替代方案。
链接: https://arxiv.org/abs/2504.01010
作者: Dylan Lester,James Gao,Samuel Sutphin,Pingping Zhu,Husnu Narman,Ammar Alzarrad
机构: Department of Computer Sciences and Electrical Engineering (CSEE), Marshall University (Marshall University); Perception Intelligence Networks Group (PiNG), Marshall University (Marshall University); Department of Civil Engineering (CE), Marshall University (Marshall University)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Published on American Society of Engineering Education (ASEE) North Central Section Conference, 2025
点击查看摘要
Abstract:Manual labeling for large-scale image and video datasets is often time-intensive, error-prone, and costly, posing a significant barrier to efficient machine learning workflows in fault detection from railroad videos. This study introduces a semi-automated labeling method that utilizes a pre-trained You Only Look Once (YOLO) model to streamline the labeling process and enhance fault detection accuracy in railroad videos. By initiating the process with a small set of manually labeled data, our approach iteratively trains the YOLO model, using each cycle’s output to improve model accuracy and progressively reduce the need for human intervention. To facilitate easy correction of model predictions, we developed a system to export YOLO’s detection data as an editable text file, enabling rapid adjustments when detections require refinement. This approach decreases labeling time from an average of 2 to 4 minutes per image to 30 seconds to 2 minutes, effectively minimizing labor costs and labeling errors. Unlike costly AI based labeling solutions on paid platforms, our method provides a cost-effective alternative for researchers and practitioners handling large datasets in fault detection and other detection based machine learning applications. Comments: Published on American Society of Engineering Education (ASEE) North Central Section Conference, 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) Cite as: arXiv:2504.01010 [cs.CV] (or arXiv:2504.01010v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.01010 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-6] GECKO: Gigapixel Vision-Concept Contrastive Pretraining in Histopathology
【速读】:该论文旨在解决现有全视野图像(Whole Slide Image, WSI)预训练方法中,多模态学习因依赖额外临床模态而导致成本增加和可扩展性受限的问题。解决方案的关键在于提出了一种名为Gigapixel Vision-Concept Knowledge Contrastive pretraining (GECKO) 的框架,通过利用已有WSI数据中的病理学概念先验知识,构建自解释的概念先验,并结合双分支多实例学习网络,将图像嵌入与概念嵌入在对比学习目标下对齐,从而实现无监督的WSI表征学习。此外,GECKO还能够无缝整合辅助模态数据(如转录组学数据),进一步提升性能和实用性。
链接: https://arxiv.org/abs/2504.01009
作者: Saarthak Kapse,Pushpak Pati,Srikar Yellapragada,Srijan Das,Rajarsi R. Gupta,Joel Saltz,Dimitris Samaras,Prateek Prasanna
机构: Stony Brook University (石溪大学); UNC Charlotte (北卡罗来纳大学夏洛特分校); Independent Researcher (独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Pretraining a Multiple Instance Learning (MIL) aggregator enables the derivation of Whole Slide Image (WSI)-level embeddings from patch-level representations without supervision. While recent multimodal MIL pretraining approaches leveraging auxiliary modalities have demonstrated performance gains over unimodal WSI pretraining, the acquisition of these additional modalities necessitates extensive clinical profiling. This requirement increases costs and limits scalability in existing WSI datasets lacking such paired modalities. To address this, we propose Gigapixel Vision-Concept Knowledge Contrastive pretraining (GECKO), which aligns WSIs with a Concept Prior derived from the available WSIs. First, we derive an inherently interpretable concept prior by computing the similarity between each WSI patch and textual descriptions of predefined pathology concepts. GECKO then employs a dual-branch MIL network: one branch aggregates patch embeddings into a WSI-level deep embedding, while the other aggregates the concept prior into a corresponding WSI-level concept embedding. Both aggregated embeddings are aligned using a contrastive objective, thereby pretraining the entire dual-branch MIL model. Moreover, when auxiliary modalities such as transcriptomics data are available, GECKO seamlessly integrates them. Across five diverse tasks, GECKO consistently outperforms prior unimodal and multimodal pretraining approaches while also delivering clinically meaningful interpretability that bridges the gap between computational models and pathology expertise. Code is made available at this https URL
zh
[CV-7] IntrinsiX: High-Quality PBR Generation using Image Priors
【速读】:该论文旨在解决从文本描述生成高质量内在图像(intrinsic images)的问题。现有文本到图像模型的输出通常包含固定的场景光照,限制了其在需要重新布光、编辑或纹理生成等任务中的应用。为了解决这一问题,论文提出了一种名为IntrinsiX的新方法,通过预测基于物理的渲染(Physically-Based Rendering, PBR)贴图来生成不受固定光照约束的内在图像。这种方法的关键在于利用强图像先验知识,并针对PBR材质的每一部分(如albedo、roughness、metallic、normals)分别预训练模型,然后通过一种新的跨内在注意机制(cross-intrinsic attention formulation)将这些模型对齐,以实现不同输出模态之间的信息交换,从而获得语义一致的PBR预测。此外,引入渲染损失函数以在图像空间中提供约束信号,确保输出具有清晰细节和BRDF属性。这一系列创新使得生成的内在图像具备详细且强大的泛化能力,显著优于现有的内在图像分解方法。
链接: https://arxiv.org/abs/2504.01008
作者: Peter Kocsis(1),Lukas Höllein(1),Matthias Nießner(1) ((1) Technical University of Munich)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL Video: this https URL
点击查看摘要
Abstract:We introduce IntrinsiX, a novel method that generates high-quality intrinsic images from text description. In contrast to existing text-to-image models whose outputs contain baked-in scene lighting, our approach predicts physically-based rendering (PBR) maps. This enables the generated outputs to be used for content creation scenarios in core graphics applications that facilitate re-lighting, editing, and texture generation tasks. In order to train our generator, we exploit strong image priors, and pre-train separate models for each PBR material component (albedo, roughness, metallic, normals). We then align these models with a new cross-intrinsic attention formulation that concatenates key and value features in a consistent fashion. This allows us to exchange information between each output modality and to obtain semantically coherent PBR predictions. To ground each intrinsic component, we propose a rendering loss which provides image-space signals to constrain the model, thus facilitating sharp details also in the output BRDF properties. Our results demonstrate detailed intrinsic generation with strong generalization capabilities that outperforms existing intrinsic image decomposition methods used with generated images by a significant margin. Finally, we show a series of applications, including re-lighting, editing, and text-conditioned room-scale PBR texture generation.
zh
[CV-8] Enhancing 3T BOLD fMRI SNR using Unpaired 7T Data with Schrödinger Bridge Diffusion
【速读】:本文旨在解决由于7特斯拉(7T)磁共振成像(MRI)系统可用性有限,导致大多数功能性磁共振成像(fMRI)研究依赖于3特斯拉(3T)系统时,其较低的空间和时间分辨率以及信噪比(SNR)的问题。研究的目标是通过提升3T BOLD fMRI数据的时空分辨率和信噪比,使其接近7T数据的质量。解决方案的关键在于提出了一种新颖的框架,该框架将来自不同受试者和数据集的7T与3T fMRI数据对齐到一个共享的参数域,并进一步应用了一个非配对的脑磁盘薛定谔桥扩散模型来增强3T数据的时空分辨率和信噪比。这种方法通过改进3T扫描质量来应对7T数据量有限的挑战,并通过在两个不同的视网膜图fMRI数据集(一个7T和一个3T)以及合成数据上的测试验证了其有效性。
链接: https://arxiv.org/abs/2504.01004
作者: Yujian Xiong,Xuanzhao Dong,Sebastian Waz,Wenhui Zhu,Negar Mallak,Zhong-lin Lu,Yalin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:High spatial and temporal resolution, coupled with a strong signal-to-noise ratio (SNR), has made BOLD 7 Tesla fMRI an invaluable tool for understanding how the brain processes visual stimuli. However, the limited availability of 7T MRI systems means that most research relies on 3T MRI systems, which offer lower spatial and temporal resolution and SNR. This naturally raises the question: Can we enhance the spatiotemporal resolution and SNR of 3T BOLD fMRI data to approximate 7T quality? In this study, we propose a novel framework that aligns 7T and 3T fMRI data from different subjects and datasets in a shared parametric domain. We then apply an unpaired Brain Disk Schrödinger Bridge diffusion model to enhance the spatiotemporal resolution and SNR of the 3T data. Our approach addresses the challenge of limited 7T data by improving the 3T scan quality. We demonstrate its effectiveness by testing it on two distinct fMRI retinotopy datasets (one 7T and one 3T), as well as synthetic data. The results show that our method significantly improves the SNR and goodness-of-fit of the population receptive field (pRF) model in the enhanced 3T data, making it comparable to 7T quality. The codes will be available at Github.
zh
[CV-9] MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization CVPR2025
【速读】:该论文旨在解决现有基于向量量化(Vector Quantization, VQ)的生成模型在共享潜空间中面临的权衡问题,即如何同时实现高质量的图像生成与有效的表征学习,同时保持高效性。论文的关键创新在于提出了MergeVQ,通过引入标记合并(token merging)技术,将图像生成与视觉表征学习统一到一个架构中。具体而言,MergeVQ在编码器的自注意力块后利用标记合并模块解耦出前k个语义信息,并通过后续的无查找表量化(Look-up Free Quantization, LFQ)和全局对齐操作优化生成质量,同时在解码器中通过交叉注意力恢复细节以完成重建。此外,在第二阶段生成任务中,进一步提出MergeAR以实现KV缓存压缩,从而加速光栅顺序预测。这一系列设计有效提升了模型在ImageNet上的性能,验证了其在表征学习和图像生成任务中的竞争力,同时保持了高效的令牌利用率和推理速度。
链接: https://arxiv.org/abs/2504.00999
作者: Siyuan Li,Luyuan Zhang,Zedong Wang,Juanxi Tian,Cheng Tan,Zicheng Liu,Chang Yu,Qingsong Xie,Haonan Lu,Haoqian Wang,Zhen Lei
机构: Zhejiang University (浙江大学); Tsinghua University (清华大学); Westlake University (西湖大学); The Hong Kong University of Science and Technology (香港科技大学); OPPO AI Center (OPPO AI中心); CAIR, HKISI-CAS (中国科学院自动化研究所香港分支机构); MAIS CASIA (中科院自动化所); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR2025 (in process for more analysis and extension)
点击查看摘要
Abstract:Masked Image Modeling (MIM) with Vector Quantization (VQ) has achieved great success in both self-supervised pre-training and image generation. However, most existing methods struggle to address the trade-off in shared latent space for generation quality vs. representation learning and efficiency. To push the limits of this paradigm, we propose MergeVQ, which incorporates token merging techniques into VQ-based generative models to bridge the gap between image generation and visual representation learning in a unified architecture. During pre-training, MergeVQ decouples top-k semantics from latent space with the token merge module after self-attention blocks in the encoder for subsequent Look-up Free Quantization (LFQ) and global alignment and recovers their fine-grained details through cross-attention in the decoder for reconstruction. As for the second-stage generation, we introduce MergeAR, which performs KV Cache compression for efficient raster-order prediction. Extensive experiments on ImageNet verify that MergeVQ as an AR generative model achieves competitive performance in both visual representation learning and image generation tasks while maintaining favorable token efficiency and inference speed. The code and model will be available at this https URL.
zh
[CV-10] urboFill: Adapting Few-step Text-to-image Model for Fast Image Inpainting
【速读】:该论文试图解决标准扩散模型在图像修复任务中计算成本高的问题。解决方案的关键在于通过引入一个基于少量步骤蒸馏的文本到图像模型(DMD2)训练的修复适配器(inpainting adapter),结合一种新颖的三步对抗训练方案,确保修复区域的真实感、结构一致性和视觉和谐性,从而实现高质量且高效的图像修复。
链接: https://arxiv.org/abs/2504.00996
作者: Liangbin Xie,Daniil Pakhomov,Zhonghao Wang,Zongze Wu,Ziyan Chen,Yuqian Zhou,Haitian Zheng,Zhifei Zhang,Zhe Lin,Jiantao Zhou,Chao Dong
机构: State Key Laboratory of Internet of Things for Smart City, University of Macau (澳门大学); Shenzhen University of Advanced Technology (深圳先进技术研究院); Adobe (Adobe); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project webpage available at this https URL
点击查看摘要
Abstract:This paper introduces TurboFill, a fast image inpainting model that enhances a few-step text-to-image diffusion model with an inpainting adapter for high-quality and efficient inpainting. While standard diffusion models generate high-quality results, they incur high computational costs. We overcome this by training an inpainting adapter on a few-step distilled text-to-image model, DMD2, using a novel 3-step adversarial training scheme to ensure realistic, structurally consistent, and visually harmonious inpainted regions. To evaluate TurboFill, we propose two benchmarks: DilationBench, which tests performance across mask sizes, and HumanBench, based on human feedback for complex prompts. Experiments show that TurboFill outperforms both multi-step BrushNet and few-step inpainting methods, setting a new benchmark for high-performance inpainting tasks. Our project page: this https URL
zh
[CV-11] SuperDec: 3D Scene Decomposition with Superquadric Primitives
【速读】:本文旨在解决如何通过分解为超级二次曲面(Superquadric)基元来创建紧凑的3D场景表示问题。与大多数利用几何基元生成照片级真实感3D场景表示的工作不同,本文提出利用这些基元获得紧凑且具有表达能力的表示。关键在于将问题局部化到单个物体上,并结合实例分割方法的能力以扩展到完整的3D场景。为此,设计了一种新架构,能够高效地将任意物体点云分解为一组紧凑的超级二次曲面。该架构在ShapeNet上进行训练,并证明了其在从ScanNet++数据集提取的对象实例以及Replica完整场景上的泛化能力。最后,展示了基于超级二次曲面的紧凑表示在机器人任务、可控视觉内容生成与编辑等多种下游应用中的实用性。
链接: https://arxiv.org/abs/2504.00992
作者: Elisabetta Fedele,Boyang Sun,Leonidas Guibas,Marc Pollefeys,Francis Engelmann
机构: ETH Zurich (苏黎世联邦理工学院); Stanford University (斯坦福大学); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We present SuperDec, an approach for creating compact 3D scene representations via decomposition into superquadric primitives. While most recent works leverage geometric primitives to obtain photorealistic 3D scene representations, we propose to leverage them to obtain a compact yet expressive representation. We propose to solve the problem locally on individual objects and leverage the capabilities of instance segmentation methods to scale our solution to full 3D scenes. In doing that, we design a new architecture which efficiently decompose point clouds of arbitrary objects in a compact set of superquadrics. We train our architecture on ShapeNet and we prove its generalization capabilities on object instances extracted from the ScanNet++ dataset as well as on full Replica scenes. Finally, we show how a compact representation based on superquadrics can be useful for a diverse range of downstream applications, including robotic tasks and controllable visual content generation and editing.
zh
[CV-12] WorldScore: A Unified Evaluation Benchmark for World Generation
【速读】:该论文试图解决世界生成(World Generation)领域缺乏统一基准的问题,旨在通过标准化评估方法推动不同类别生成模型的发展。论文的关键解决方案是提出WorldScore基准,它将世界生成分解为一系列基于显式相机轨迹布局规范的下一场景生成任务,从而实现对从3D/4D场景生成到视频生成模型等多样化方法的统一评估。WorldScore基准包含一个精心策划的3,000个测试样本的数据集,覆盖静态与动态、室内与室外、写实与风格化等多种场景,并通过可控性(Controllability)、质量(Quality)和动态性(Dynamics)三个核心方面对生成结果进行评估。这种系统化的设计使得论文能够深入分析19种代表性模型(包括开源与闭源模型)的优势与挑战。
链接: https://arxiv.org/abs/2504.00983
作者: Haoyi Duan,Hong-Xing Yu,Sirui Chen,Li Fei-Fei,Jiajun Wu
机构: Stanford University (斯坦福大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL The first two authors contributed equally
点击查看摘要
Abstract:We introduce the WorldScore benchmark, the first unified benchmark for world generation. We decompose world generation into a sequence of next-scene generation tasks with explicit camera trajectory-based layout specifications, enabling unified evaluation of diverse approaches from 3D and 4D scene generation to video generation models. The WorldScore benchmark encompasses a curated dataset of 3,000 test examples that span diverse worlds: static and dynamic, indoor and outdoor, photorealistic and stylized. The WorldScore metrics evaluate generated worlds through three key aspects: controllability, quality, and dynamics. Through extensive evaluation of 19 representative models, including both open-source and closed-source ones, we reveal key insights and challenges for each category of models. Our dataset, evaluation code, and leaderboard can be found at this https URL
zh
[CV-13] Artificial Intelligence-Assisted Prostate Cancer Diagnosis for Reduced Use of Immunohistochemistry
【速读】:该论文旨在减少前列腺癌诊断中对免疫组化染色(Immunohistochemical Staining, IHC)的依赖,同时确保诊断准确性。当前临床实践中,病理学家常通过IHC来区分良性与恶性组织,但这种方法存在工作量大、成本高及诊断延迟等问题。论文提出的解决方案的关键在于开发一种基于人工智能(Artificial Intelligence, AI)的模型,该模型能够通过对苏木精-伊红(Hematoxylin Eosin, HE)染色切片的分析,精准识别异常腺体和边界形态,从而实现对癌症的检测。研究结果显示,所开发的AI模型在三个独立病理站点的回顾性分析中表现出0.951至0.993之间的曲线下面积(Area Under Curve, AUC),并在优先考虑敏感性的诊断阈值下,分别减少了44.4%、42.0%和20.7%的IHC需求,同时未产生任何假阴性预测。这表明该AI模型具备优化IHC使用、提升病理决策效率以及减轻资源压力的潜力。
链接: https://arxiv.org/abs/2504.00979
作者: Anders Blilie(1 and 2),Nita Mulliqi(3),Xiaoyi Ji(3),Kelvin Szolnoky(3),Sol Erika Boman(3 and 4),Matteo Titus(3),Geraldine Martinez Gonzalez(3),José Asenjo(5),Marcello Gambacorta(6),Paolo Libretti(6),Einar Gudlaugsson(1),Svein R. Kjosavik(7 and 8),Lars Egevad(9),Emiel A.M. Janssen(1 and 10 and 11),Martin Eklund(3),Kimmo Kartasalo(12) ((1) Department of Pathology, Stavanger University Hospital, Stavanger, Norway, (2) Faculty of Health Sciences, University of Stavanger, Stavanger, Norway, (3) Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden, (4) Department of Molecular Medicine and Surgery, Karolinska Institutet, Stockholm, Sweden, (5) Department of Pathology, Synlab, Madrid, Spain, (6) Department of Pathology, Synlab, Brescia, Italy, (7) The General Practice and Care Coordination Research Group, Stavanger University Hospital, Stavanger, Norway (8) Department of Global Public Health and Primary Care, Faculty of Medicine, University of Bergen, Bergen, Norway, (9) Department of Oncology and Pathology, Karolinska Institutet, Stockholm, Sweden, (10) Faculty of Science and Technology, University of Stavanger, Stavanger, Norway, (11) Institute for Biomedicine and Glycomics, Griffith University, Queensland, Australia, (12) Department of Medical Epidemiology and Biostatistics, SciLifeLab, Karolinska Institutet, Stockholm, Sweden)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 5 figures and 3 tables
点击查看摘要
Abstract:Prostate cancer diagnosis heavily relies on histopathological evaluation, which is subject to variability. While immunohistochemical staining (IHC) assists in distinguishing benign from malignant tissue, it involves increased work, higher costs, and diagnostic delays. Artificial intelligence (AI) presents a promising solution to reduce reliance on IHC by accurately classifying atypical glands and borderline morphologies in hematoxylin eosin (HE) stained tissue sections. In this study, we evaluated an AI model’s ability to minimize IHC use without compromising diagnostic accuracy by retrospectively analyzing prostate core needle biopsies from routine diagnostics at three different pathology sites. These cohorts were composed exclusively of difficult cases where the diagnosing pathologists required IHC to finalize the diagnosis. The AI model demonstrated area under the curve values of 0.951-0.993 for detecting cancer in routine HE-stained slides. Applying sensitivity-prioritized diagnostic thresholds reduced the need for IHC staining by 44.4%, 42.0%, and 20.7% in the three cohorts investigated, without a single false negative prediction. This AI model shows potential for optimizing IHC use, streamlining decision-making in prostate pathology, and alleviating resource burdens.
zh
[CV-14] IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval
【速读】:该论文旨在解决现有多模态检索任务复杂度不足以及实际应用价值有限的问题。为应对这一挑战,论文提出了Instance-Driven Multimodal Image Retrieval (IDMR),这是一种需要模型在不同上下文中实现细粒度实例级一致性以检索与查询图像包含相同实例且匹配文本描述场景的新型任务。针对训练数据稀缺的问题,论文设计了一种跨域合成方法,通过从标准检测数据集中裁剪物体来创建557K个训练样本。解决方案的关键在于利用基于Multimodal Large Language Model (MLLM) 的检索模型,并在其上进行大规模训练(1.2M样本),从而在传统基准测试及零样本IDMR-bench上超越当前最先进的方法,验证了其在实例感知检索中的优越性及MLLM在高级检索应用中的潜力。
链接: https://arxiv.org/abs/2504.00954
作者: Bangwei Liu,Yicheng Bao,Shaohui Lin,Xuhong Wang,Xin Tan,Yingchun Wang,Yuan Xie,Chaochao Lu
机构: East China Normal University (华东师范大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Multimodal retrieval systems are becoming increasingly vital for cutting-edge AI technologies, such as embodied AI and AI-driven digital content industries. However, current multimodal retrieval tasks lack sufficient complexity and demonstrate limited practical application value. It spires us to design Instance-Driven Multimodal Image Retrieval (IDMR), a novel task that requires models to retrieve images containing the same instance as a query image while matching a text-described scenario. Unlike existing retrieval tasks focused on global image similarity or category-level matching, IDMR demands fine-grained instance-level consistency across diverse contexts. To benchmark this capability, we develop IDMR-bench using real-world object tracking and first-person video data. Addressing the scarcity of training data, we propose a cross-domain synthesis method that creates 557K training samples by cropping objects from standard detection datasets. Our Multimodal Large Language Model (MLLM) based retrieval model, trained on 1.2M samples, outperforms state-of-the-art approaches on both traditional benchmarks and our zero-shot IDMR-bench. Experimental results demonstrate previous models’ limitations in instance-aware retrieval and highlight the potential of MLLM for advanced retrieval applications. The whole training dataset, codes and models, with wide ranges of sizes, are available at this https URL.
zh
[CV-15] Personalized Federated Training of Diffusion Models with Privacy Guarantees
【速读】:该论文旨在解决敏感领域(如医疗、金融和生物医学研究)中人工智能应用面临的高质量数据稀缺问题,特别是合规、伦理及隐私保护带来的数据访问限制。论文提出的关键解决方案是一种新颖的联邦学习框架,用于在分散的私有数据集上训练扩散模型 (Diffusion Models)。此框架通过利用个性化方法以及前向扩散过程中的固有噪声,生成高质量且多样化的合成数据,同时确保鲁棒的差分隐私保证。实验结果表明,该框架在数据异质性较高的场景中优于非协作训练方法,并有效减少了合成数据中的偏差与不平衡,从而实现更公平的下游模型。
链接: https://arxiv.org/abs/2504.00952
作者: Kumar Kshitij Patel,Weitong Zhang,Lingxiao Wang
机构: TTIC; UNC Chapel Hill (北卡罗来纳大学教堂山分校); NJIT (新泽西理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 4 figures
点击查看摘要
Abstract:The scarcity of accessible, compliant, and ethically sourced data presents a considerable challenge to the adoption of artificial intelligence (AI) in sensitive fields like healthcare, finance, and biomedical research. Furthermore, access to unrestricted public datasets is increasingly constrained due to rising concerns over privacy, copyright, and competition. Synthetic data has emerged as a promising alternative, and diffusion models – a cutting-edge generative AI technology – provide an effective solution for generating high-quality and diverse synthetic data. In this paper, we introduce a novel federated learning framework for training diffusion models on decentralized private datasets. Our framework leverages personalization and the inherent noise in the forward diffusion process to produce high-quality samples while ensuring robust differential privacy guarantees. Our experiments show that our framework outperforms non-collaborative training methods, particularly in settings with high data heterogeneity, and effectively reduces biases and imbalances in synthetic data, resulting in fairer downstream models.
zh
[CV-16] Neural Pruning for 3D Scene Reconstruction: Efficient NeRF Acceleration
【速读】:该论文旨在解决 Neural Radiance Fields (NeRF) 模型训练时间过长(通常持续数天)的问题,同时减少模型大小。论文的关键解决方案是通过神经剪枝(Neural Pruning)技术,对比分析均匀采样(uniform sampling)、基于重要性方法(importance-based methods)以及基于核心集(coreset-based techniques)的剪枝策略,以实现模型压缩与训练加速。研究发现,基于核心集驱动的剪枝方法能够在仅轻微降低精度的情况下,将模型大小减少 50%,并将训练速度提升 35%。这表明剪枝是一种在资源受限环境下有效提升 NeRF 模型效率的方法。
链接: https://arxiv.org/abs/2504.00950
作者: Tianqi Ding,Dawei Xiang,Pablo Rivas,Liang Dong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures, accepted by International Conference on the AI Revolution: Research, Ethics, and Society (AIR-RES 2025)
点击查看摘要
Abstract:Neural Radiance Fields (NeRF) have become a popular 3D reconstruction approach in recent years. While they produce high-quality results, they also demand lengthy training times, often spanning days. This paper studies neural pruning as a strategy to address these concerns. We compare pruning approaches, including uniform sampling, importance-based methods, and coreset-based techniques, to reduce the model size and speed up training. Our findings show that coreset-driven pruning can achieve a 50% reduction in model size and a 35% speedup in training, with only a slight decrease in accuracy. These results suggest that pruning can be an effective method for improving the efficiency of NeRF models in resource-limited settings.
zh
[CV-17] GKAN: Explainable Diagnosis of Alzheimers Disease Using Graph Neural Network with Kolmogorov-Arnold Networks
【速读】:本文针对阿尔茨海默病(Alzheimer’s Disease, AD)诊断中因复杂病因导致的挑战,提出了一种新的单模态框架GCN-KAN。传统图卷积网络(Graph Convolutional Networks, GCNs)在建模脑连接方面表现出潜力,但其线性变换限制了对神经影像数据中复杂的非线性模式的捕捉能力。为了解决这一问题,GCN-KAN将科莫戈洛夫-阿诺德网络(Kolmogorov-Arnold Networks, KAN)与GCNs相结合,通过可学习的样条基变换(spline-based transformations)更有效地表征脑区交互关系。该方法在阿尔茨海默病神经影像学倡议(Alzheimer’s Disease Neuroimaging Initiative, ADNI)数据集上的分类准确率比传统GCNs高出4%-8%,同时提供了关于与AD相关的关键脑区的可解释性洞见。关键在于结合KAN以增强模型的非线性表达能力和解释性。
链接: https://arxiv.org/abs/2504.00946
作者: Tianqi Ding,Dawei Xiang,Keith E Schubert,Liang Dong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures, under review of The Southwest Data Science Conference (SDSC 2025)
点击查看摘要
Abstract:Alzheimer’s Disease (AD) is a progressive neurodegenerative disorder that poses significant diagnostic challenges due to its complex etiology. Graph Convolutional Networks (GCNs) have shown promise in modeling brain connectivity for AD diagnosis, yet their reliance on linear transformations limits their ability to capture intricate nonlinear patterns in neuroimaging data. To address this, we propose GCN-KAN, a novel single-modal framework that integrates Kolmogorov-Arnold Networks (KAN) into GCNs to enhance both diagnostic accuracy and interpretability. Leveraging structural MRI data, our model employs learnable spline-based transformations to better represent brain region interactions. Evaluated on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset, GCN-KAN outperforms traditional GCNs by 4-8% in classification accuracy while providing interpretable insights into key brain regions associated with AD. This approach offers a robust and explainable tool for early AD diagnosis.
zh
[CV-18] Graph Classification and Radiomics Signature for Identification of Tuberculous Meningitis
【速读】:本文旨在解决结核性脑膜炎(TBM)诊断中依赖侵入性腰椎穿刺(LP)和脑脊液(CSF)分析的问题,提出了一种基于T1加权非对比增强磁共振成像(MRI)扫描的无创分类方法。关键在于利用像素阵列图分类器(PAG-Classifier),通过图框架中邻近三维像素的空间关系进行特征提取,并采用特征分解技术捕捉显著特征,进而训练机器学习模型实现有效分类。研究验证了此方法在小脑延髓池区域具有较高的分类性能,而骨组织与胼胝体区域则表现不佳。
链接: https://arxiv.org/abs/2504.00943
作者: Snigdha Agarwal,Ganaraja V H,Neelam Sinha,Abhilasha Indoria,Netravathi M,Jitender Saini
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, 6 figures, 3 tables
点击查看摘要
Abstract:Introduction: Tuberculous meningitis (TBM) is a serious brain infection caused by Mycobacterium tuberculosis, characterized by inflammation of the meninges covering the brain and spinal cord. Diagnosis often requires invasive lumbar puncture (LP) and cerebrospinal fluid (CSF) analysis. Objectives: This study aims to classify TBM patients using T1-weighted (T1w) non-contrast Magnetic Resonance Imaging (MRI) scans. We hypothesize that specific brain regions, such as the interpeduncular cisterns, bone, and corpus callosum, contain visual markers that can non-invasively distinguish TBM patients from healthy controls. We propose a novel Pixel-array Graphs Classifier (PAG-Classifier) that leverages spatial relationships between neighbouring 3D pixels in a graph-based framework to extract significant features through eigen decomposition. These features are then used to train machine learning classifiers for effective patient classification. We validate our approach using a radiomics-based methodology, classifying TBM patients based on relevant radiomics features. Results: We utilized an internal dataset consisting of 52 scans, 32 from confirmed TBM patients based on mycobacteria detection in CSF, and 20 from healthy individuals. We achieved a 5-fold cross-validated average F1 score of 85.71% for cistern regions with our PAG-Classifier and 92.85% with the radiomics features classifier, surpassing current state-of-the-art benchmarks by 15% and 22%, respectively. However, bone and corpus callosum regions showed poor classification effectiveness, with average F1 scores below 50%. Conclusion: Our study suggests that algorithms like the PAG-Classifier serve as effective tools for non-invasive TBM analysis, particularly by targeting the interpeduncular cistern. Findings indicate that the bone and corpus callosum regions lack distinctive patterns for differentiation.
zh
[CV-19] DBF-UNet: A Two-Stage Framework for Carotid Artery Segmentation with Pseudo-Label Generation
【速读】:本文旨在解决医学图像分析中因标注数据有限导致的三维颈动脉分割任务挑战,尤其关注现有数据集中仅包含少量专家标注切片且标注空间不连续的问题。为应对这一挑战,论文提出了一种两阶段分割框架。关键解决方案在于:第一阶段通过插值标注切片质心构建连续血管中心线,并沿中心线传播标签以生成未标注切片的插值标注,利用专家标注切片微调SAM-Med2D模型,同时使用插值标签作为提示指导推理阶段的分割;第二阶段提出了轻量级的密集双向特征融合UNet(DBF-UNet),其编码器引入双向特征融合,并结合多尺度特征聚合与密集连接实现有效特征重用,从而实现完整三维血管结构的精确分割。实验验证表明,该方法在解决稀疏标注问题的同时,性能优于现有方法。
链接: https://arxiv.org/abs/2504.00908
作者: Haoxuan Li,Wei Song,Aofan Liu,Peiwu Qin
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); School of Automation, Guangdong University of Technology (广东工业大学自动化学院); School of Electronic and Computer Engineering, Peking University (北京大学电子与通信工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Medical image analysis faces significant challenges due to limited annotation data, particularly in three-dimensional carotid artery segmentation tasks, where existing datasets exhibit spatially discontinuous slice annotations with only a small portion of expert-labeled slices in complete 3D volumetric data. To address this challenge, we propose a two-stage segmentation framework. First, we construct continuous vessel centerlines by interpolating between annotated slice centroids and propagate labels along these centerlines to generate interpolated annotations for unlabeled slices. The slices with expert annotations are used for fine-tuning SAM-Med2D, while the interpolated labels on unlabeled slices serve as prompts to guide segmentation during inference. In the second stage, we propose a novel Dense Bidirectional Feature Fusion UNet (DBF-UNet). This lightweight architecture achieves precise segmentation of complete 3D vascular structures. The network incorporates bidirectional feature fusion in the encoder and integrates multi-scale feature aggregation with dense connectivity for effective feature reuse. Experimental validation on public datasets demonstrates that our proposed method effectively addresses the sparse annotation challenge in carotid artery segmentation while achieving superior performance compared to existing approaches. The source code is available at this https URL.
zh
[CV-20] A Decade of Deep Learning for Remote Sensing Spatiotemporal Fusion: Advances Challenges and Opportunities
【速读】:该论文旨在解决高时间-空间分辨率遥感影像获取困难的问题,由于硬件限制和卫星发射成本高昂,直接获取此类数据极具挑战性。论文的关键解决方案是通过遥感时空融合 (Spatiotemporal Fusion, STF) 技术,将高时间分辨率但低空间分辨率的影像与高空间分辨率但低时间分辨率的影像结合,从而高效生成高时间-空间分辨率的卫星影像。这一技术为土地表面变化监测、农业管理和环境研究提供了前所未有的观测能力。论文强调,深度学习 (Deep Learning, DL) 方法在过去十年中通过强大的自动特征提取和非线性建模能力彻底改变了STF领域,显著优于传统方法在处理复杂时空数据时的表现。因此,DL方法是该解决方案的核心关键。
链接: https://arxiv.org/abs/2504.00901
作者: Enzhe Sun,Yongchuan Cui,Peng Liu,Jining Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Hardware limitations and satellite launch costs make direct acquisition of high temporal-spatial resolution remote sensing imagery challenging. Remote sensing spatiotemporal fusion (STF) technology addresses this problem by merging high temporal but low spatial resolution imagery with high spatial but low temporal resolution imagery to efficiently generate high spatiotemporal resolution satellite images. STF provides unprecedented observational capabilities for land surface change monitoring, agricultural management, and environmental research. Deep learning (DL) methods have revolutionized the remote sensing spatiotemporal fusion field over the past decade through powerful automatic feature extraction and nonlinear modeling capabilities, significantly outperforming traditional methods in handling complex spatiotemporal data. Despite the rapid development of DL-based remote sensing STF, the community lacks a systematic review of this quickly evolving field. This paper comprehensively reviews DL developments in remote sensing STF over the last decade, analyzing key research trends, method classifications, commonly used datasets, and evaluation metrics. It discusses major challenges in existing research and identifies promising future research directions as references for researchers in this field to inspire new ideas. The specific models, datasets, and other information mentioned in this article have been collected in: this https URL.
zh
[CV-21] Improved Visual-Spatial Reasoning via R1-Zero-Like Training
【速读】:该论文致力于解决多模态大型语言模型(Multi-modal Large Language Models, MLLMs)在视觉-空间推理(Visual-Spatial Reasoning, VSI)能力上的不足。具体而言,研究关注如何通过类似于R1-Zero的训练方法提升MLLMs的视觉-空间智能。论文的关键创新在于引入了一种基于广义奖励函数优化(Generalized Reward Penalized Optimization, GRPO)的训练策略,并利用精心构建的VSI-100k数据集进行微调。研究发现,在GRPO中即使保持较小的KL惩罚项也具有必要性。通过仅使用120个GPU小时的计算资源,基于Qwen2-VL的vsGRPO模型不仅显著提升了视觉-空间推理性能,且在多个基准测试中超越了基础模型及闭源模型GPT-4o,同时与当前最佳开源模型LLaVA-NeXT-Video-72B表现相当。此外,实验还表明vsGRPO方法优于传统的监督微调和直接偏好优化方法。
链接: https://arxiv.org/abs/2504.00883
作者: Zhenyi Liao,Qingsong Xie,Yanhao Zhang,Zijian Kong,Haonan Lu,Zhenyu Yang,Zhijie Deng
机构: Shanghai Jiao Tong University (上海交通大学); OPPO AI Center (OPPO人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Increasing attention has been placed on improving the reasoning capacities of multi-modal large language models (MLLMs). As the cornerstone for AI agents that function in the physical realm, video-based visual-spatial intelligence (VSI) emerges as one of the most pivotal reasoning capabilities of MLLMs. This work conducts a first, in-depth study on improving the visual-spatial reasoning of MLLMs via R1-Zero-like training. Technically, we first identify that the visual-spatial reasoning capacities of small- to medium-sized Qwen2-VL models cannot be activated via Chain of Thought (CoT) prompts. We then incorporate GRPO training for improved visual-spatial reasoning, using the carefully curated VSI-100k dataset, following DeepSeek-R1-Zero. During the investigation, we identify the necessity to keep the KL penalty (even with a small value) in GRPO. With just 120 GPU hours, our vsGRPO-2B model, fine-tuned from Qwen2-VL-2B, can outperform the base model by 12.1% and surpass GPT-4o. Moreover, our vsGRPO-7B model, fine-tuned from Qwen2-VL-7B, achieves performance comparable to that of the best open-source model LLaVA-NeXT-Video-72B. Additionally, we compare vsGRPO to supervised fine-tuning and direct preference optimization baselines and observe strong performance superiority. The code and dataset will be available soon.
zh
[CV-22] WISE-TTT:Worldwide Information Segmentation Enhancement
【速读】:该论文致力于解决长视频序列中多目标分割的挑战,主要源于现有架构在捕捉全局时间依赖性方面的固有局限。论文提出了一种名为WISE-TTT的协同架构,通过共设计将Test-Time Training (TTT)机制与Transformer架构集成。解决方案的关键在于TTT层通过系统压缩历史时间数据生成包含全域信息的隐藏状态(无损记忆以保持长上下文完整性),同时通过拼接实现多阶段上下文聚合,从而验证了在多个网络层实施全域信息对于优化依赖关系的重要性,并首次证明了分层上下文在视频分割中的优越性。
链接: https://arxiv.org/abs/2504.00879
作者: Fenglei Hao,Yuliang Yang,Ruiyuan Su,Zhengran Zhao,Yukun Qiao,Mengyu Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Video multi-target segmentation remains a major challenge in long sequences, mainly due to the inherent limitations of existing architectures in capturing global temporal dependencies. We introduce WISE-TTT, a synergistic architecture integrating Test-Time Training (TTT) mechanisms with the Transformer architecture through co-design. The TTT layer systematically compresses historical temporal data to generate hidden states containing worldwide information(Lossless memory to maintain long contextual integrity), while achieving multi-stage contextual aggregation through splicing. Crucially, our framework provides the first empirical validation that implementing worldwide information across multiple network layers is essential for optimal dependency this http URL studies show TTT modules at high-level features boost global modeling. This translates to 3.1% accuracy improvement(JF metric) on Davis2017 long-term benchmarks – the first proof of hierarchical context superiority in video segmentation. We provide the first systematic evidence that worldwide information critically impacts segmentation performance.
zh
[CV-23] Data-free Knowledge Distillation with Diffusion Models ICME2025
【速读】:该论文旨在解决数据-free知识蒸馏(DFKD)领域中无法充分利用扩散模型(diffusion models)的问题。尽管扩散模型在合成高质量图像方面表现出色,但现有方法难以直接应用于DFKD。为解决这一问题,论文提出了一种基于扩散模型的新方法——DiffDFKD。其关键在于两个方面:首先,通过利用教师模型中的有价值信息指导预训练扩散模型的数据合成,生成与原始训练数据分布一致的数据集,从而有效弥合领域差距;其次,引入Latent CutMix增强技术,在减少计算负担的同时提升扩散模型生成图像的多样性,并保留关键属性以实现高效的知识迁移。这些创新使DiffDFKD达到了最先进的性能水平。
链接: https://arxiv.org/abs/2504.00870
作者: Xiaohua Qi,Renda Li,Long Peng,Qiang Ling,Jun Yu,Ziyi Chen,Peng Chang,Mei Han,Jing Xiao
机构: University of Science and Technology of China (中国科学技术大学); PAII Inc. (帕睿人工智能科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICME2025
点击查看摘要
Abstract:Recently Data-Free Knowledge Distillation (DFKD) has garnered attention and can transfer knowledge from a teacher neural network to a student neural network without requiring any access to training data. Although diffusion models are adept at synthesizing high-fidelity photorealistic images across various domains, existing methods cannot be easiliy implemented to DFKD. To bridge that gap, this paper proposes a novel approach based on diffusion models, DiffDFKD. Specifically, DiffDFKD involves targeted optimizations in two key areas. Firstly, DiffDFKD utilizes valuable information from teacher models to guide the pre-trained diffusion models’ data synthesis, generating datasets that mirror the training data distribution and effectively bridge domain gaps. Secondly, to reduce computational burdens, DiffDFKD introduces Latent CutMix Augmentation, an efficient technique, to enhance the diversity of diffusion model-generated images for DFKD while preserving key attributes for effective knowledge transfer. Extensive experiments validate the efficacy of DiffDFKD, yielding state-of-the-art results exceeding existing DFKD approaches. We release our code at this https URL.
zh
[CV-24] Feature-Preserving Mesh Decimation for Normal Integration
【速读】:该论文旨在解决使用法线图(normal maps)通过法线整合(normal integration)重建高分辨率 3D 表面时计算资源需求过高的问题。传统方法依赖密集像素网格,在高分辨率下需要大量计算时间。为应对这一挑战,论文提出将密集像素网格替换为稀疏各向异性三角网格(sparse anisotropic triangle mesh),以适应复杂表面结构并减少平坦无特征区域的冗余采样。关键在于从网格简化(mesh decimation)的经典二次误差度量(quadric error measure)推导出适用于屏幕空间应用的优化方法,并结合最优德劳内三角化(optimal Delaunay triangulation),从而在显著降低运行时间的同时保持高精度的表面重建结果。
链接: https://arxiv.org/abs/2504.00867
作者: Moritz Heep,Sven Behnke,Eduard Zell
机构: University of Bonn (波恩大学); Autonomous Intelligent Systems (自主智能系统)(波恩大学); Independent Researcher (独立研究员)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Normal integration reconstructs 3D surfaces from normal maps obtained e.g. by photometric stereo. These normal maps capture surface details down to the pixel level but require large computational resources for integration at high resolutions. In this work, we replace the dense pixel grid with a sparse anisotropic triangle mesh prior to normal integration. We adapt the triangle mesh to the local geometry in the case of complex surface structures and remove oversampling from flat featureless regions. For high-resolution images, the resulting compression reduces normal integration runtimes from hours to minutes while maintaining high surface accuracy. Our main contribution is the derivation of the well-known quadric error measure from mesh decimation for screen space applications and its combination with optimal Delaunay triangulation.
zh
[CV-25] Balancing Multi-Target Semi-Supervised Medical Image Segmentation with Collaborative Generalist and Specialists
【速读】:该论文试图解决多目标医学图像分割任务中因目标尺度不平衡导致大目标主导损失函数,从而影响小目标分割性能的问题。解决方案的关键在于提出了一种名为CGS(Collaborative Generalist and Specialists)的新方法,其核心思想是为每个目标类别配备一个专门的“专家”网络,以避免大目标对整体训练的影响。具体而言,“通用模型”负责常规多目标分割,“专家模型”则专注于区分特定的目标类别及其背景或其它目标类别。此外,通过引入跨一致性损失促进通用模型与专家模型之间的协作学习,并设计了一个头间误差检测模块来进一步提升伪标签的质量,从而实现更平衡的训练过程。实验结果验证了该方法在三个常用数据集上的优越性能。
链接: https://arxiv.org/abs/2504.00862
作者: You Wang,Zekun Li,Lei Qi,Qian Yu,Yinghuan Shi,Yang Gao
机构: State Key Laboratory for Novel Software Technology, Nanjing University, China (国家重点实验室,南京大学); National Institute of Healthcare Data Science, Nanjing University, China (国家健康数据科学研究院,南京大学); School of Computer Science and Engineering, and the Key Lab of Computer Network and Information Integration (Ministry of Education), Southeast University, China (计算机科学与工程学院,教育部计算机网络与信息集成重点实验室,东南大学); School of Data and Computer Science, Shandong Women’s University, China (数据与计算机科学学院,山东女子大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Despite the promising performance achieved by current semi-supervised models in segmenting individual medical targets, many of these models suffer a notable decrease in performance when tasked with the simultaneous segmentation of multiple targets. A vital factor could be attributed to the imbalanced scales among different targets: during simultaneously segmenting multiple targets, large targets dominate the loss, leading to small targets being misclassified as larger ones. To this end, we propose a novel method, which consists of a Collaborative Generalist and several Specialists, termed CGS. It is centered around the idea of employing a specialist for each target class, thus avoiding the dominance of larger targets. The generalist performs conventional multi-target segmentation, while each specialist is dedicated to distinguishing a specific target class from the remaining target classes and the background. Based on a theoretical insight, we demonstrate that CGS can achieve a more balanced training. Moreover, we develop cross-consistency losses to foster collaborative learning between the generalist and the specialists. Lastly, regarding their intrinsic relation that the target class of any specialized head should belong to the remaining classes of the other heads, we introduce an inter-head error detection module to further enhance the quality of pseudo-labels. Experimental results on three popular benchmarks showcase its superior performance compared to state-of-the-art methods.
zh
[CV-26] NeuRadar: Neural Radiance Fields for Automotive Radar Point Clouds
【速读】:本文旨在解决雷达点云的新型视图合成问题,这是自动驾驶(AD)系统中生成式神经辐射场(NeRFs)研究尚未充分探索的方向。论文提出了一种基于NeRF的模型NeuRadar,能够同时生成雷达点云、相机图像和激光雷达点云,并通过基于集合的目标检测方法(如DETR)以及一种依托于NeRF几何的编码器解决方案,提升模型的泛化能力。关键创新在于提出了确定性和概率性两种点云表示方式以准确建模雷达行为,其中概率性表示能够捕捉雷达的随机特性。此外,论文在两个汽车数据集上实现了逼真的重建结果,并建立了基于NeRF的雷达点云模拟模型的基准,同时开源了相关代码与数据,以促进雷达NeRF领域的进一步研究。
链接: https://arxiv.org/abs/2504.00859
作者: Mahan Rafidashti,Ji Lan,Maryam Fatemi,Junsheng Fu,Lars Hammarstrand,Lennart Svensson
机构: Zenseact; Chalmers University of Technology (查尔姆斯理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Radar is an important sensor for autonomous driving (AD) systems due to its robustness to adverse weather and different lighting conditions. Novel view synthesis using neural radiance fields (NeRFs) has recently received considerable attention in AD due to its potential to enable efficient testing and validation but remains unexplored for radar point clouds. In this paper, we present NeuRadar, a NeRF-based model that jointly generates radar point clouds, camera images, and lidar point clouds. We explore set-based object detection methods such as DETR, and propose an encoder-based solution grounded in the NeRF geometry for improved generalizability. We propose both a deterministic and a probabilistic point cloud representation to accurately model the radar behavior, with the latter being able to capture radar’s stochastic behavior. We achieve realistic reconstruction results for two automotive datasets, establishing a baseline for NeRF-based radar point cloud simulation models. In addition, we release radar data for ZOD’s Sequences and Drives to enable further research in this field. To encourage further development of radar NeRFs, we release the source code for NeuRadar.
zh
[CV-27] Exploring Personalized Federated Learning Architectures for Violence Detection in Surveillance Videos
【速读】:该论文旨在解决城市监控系统中暴力事件检测面临的挑战,这些问题源于视频数据量庞大且多样性高。论文的关键解决方案是采用个性化联邦学习(Personalized Federated Learning, PFL)方法,具体利用Flower框架中的带有个性化层的联邦学习技术。通过使学习模型适应每个监控节点的独特数据特性,该方法有效应对了监控视频数据的异构性和非独立同分布(non-IID)性质。实验结果表明,PFL模型在平衡和不平衡数据集上均表现出更高的准确率与效率,最高可达99.3%的准确率,从而证明了PFL在提升监控系统可扩展性和有效性方面的潜力,并提供了针对复杂城市环境中暴力事件检测的鲁棒性、隐私保护型解决方案。
链接: https://arxiv.org/abs/2504.00857
作者: Mohammad Kassir,Siba Haidar,Antoun Yaacoub
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages, 5 figures, 4 tables
点击查看摘要
Abstract:The challenge of detecting violent incidents in urban surveillance systems is compounded by the voluminous and diverse nature of video data. This paper presents a targeted approach using Personalized Federated Learning (PFL) to address these issues, specifically employing the Federated Learning with Personalization Layers method within the Flower framework. Our methodology adapts learning models to the unique data characteristics of each surveillance node, effectively managing the heterogeneous and non-IID nature of surveillance video data. Through rigorous experiments conducted on balanced and imbalanced datasets, our PFL models demonstrated enhanced accuracy and efficiency, achieving up to 99.3% accuracy. This study underscores the potential of PFL to significantly improve the scalability and effectiveness of surveillance systems, offering a robust, privacy-preserving solution for violence detection in complex urban environments.
zh
[CV-28] Global Intervention and Distillation for Federated Out-of-Distribution Generalization
【速读】:该论文旨在解决联邦学习中的属性偏差问题,即本地模型倾向于学习非因果关联,导致优化方向不一致,从而引发性能下降和不稳定收敛。现有方法通常通过数据增强提升样本多样性或利用知识蒸馏学习不变表示,但生成数据质量的不稳定性和领域信息的缺乏限制了其在未见样本上的表现。为了解决这些问题,论文提出了一种名为FedGID的全局干预与蒸馏方法。该方法的关键在于利用多样化的属性特征进行后门调整,打破背景与标签之间的虚假关联。FedGID包含两个主要模块:全局干预模块自适应地解耦图像中的对象和背景,并将背景信息注入随机样本以干预样本分布,确保背景与所有类别相关联,防止模型将背景-标签关联视为因果关系;全局蒸馏模块则基于统一的知识库指导客户端模型的表示学习,避免本地模型过度拟合于特定客户端属性。实验结果表明,FedGID提升了模型在未见数据中关注主要主体的能力,并在协同建模方面优于现有方法。
链接: https://arxiv.org/abs/2504.00850
作者: Zhuang Qi,Runhui Zhang,Lei Meng,Wei Wu,Yachong Zhang,Xiangxu Meng
机构: School of Software, Shandong University (山东大学), Jinan, China; Shandong Research Institute of Industrial Technology (山东工业技术研究院), Jinan, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Attribute skew in federated learning leads local models to focus on learning non-causal associations, guiding them towards inconsistent optimization directions, which inevitably results in performance degradation and unstable convergence. Existing methods typically leverage data augmentation to enhance sample diversity or employ knowledge distillation to learn invariant representations. However, the instability in the quality of generated data and the lack of domain information limit their performance on unseen samples. To address these issues, this paper presents a global intervention and distillation method, termed FedGID, which utilizes diverse attribute features for backdoor adjustment to break the spurious association between background and label. It includes two main modules, where the global intervention module adaptively decouples objects and backgrounds in images, injects background information into random samples to intervene in the sample distribution, which links backgrounds to all categories to prevent the model from treating background-label associations as causal. The global distillation module leverages a unified knowledge base to guide the representation learning of client models, preventing local models from overfitting to client-specific attributes. Experimental results on three datasets demonstrate that FedGID enhances the model’s ability to focus on the main subjects in unseen data and outperforms existing methods in collaborative modeling.
zh
[CV-29] Zero-Shot 4D Lidar Panoptic Segmentation
【速读】:该论文旨在解决基于激光雷达的任意物体零样本四维(4D)分割与识别问题,这是具身导航中的关键任务,应用涵盖从实时感知到语义映射与定位等多个领域。然而,现有研究面临的主要挑战在于缺乏具备足够多样性和规模的数据集,以推动时空场景理解方法的通用化发展。为克服这些挑战,论文提出了一种名为SAL-4D(Segment Anything in Lidar–4D)的方法。其关键在于利用多模态机器人传感器设置作为桥梁,结合视频对象分割(Video Object Segmentation, VOS)模型的最新进展以及现成的视觉-语言基础模型,将这些技术迁移至激光雷达领域。具体而言,该方法通过VOS模型伪标注短时间序列中的轨迹片段(tracklets),使用序列级CLIP标记注释这些轨迹片段,并借助校准的多模态传感系统将其提升至4D激光雷达空间,从而提炼出SAL-4D模型。由于其时间一致性预测能力,该方法在3D零样本激光雷达全景分割(Lidar Panoptic Segmentation, LPS)任务中超越了现有方法,在5项PQ指标上表现出色,并实现了零样本4D-LPS的能力。
链接: https://arxiv.org/abs/2504.00848
作者: Yushan Zhang,Aljoša Ošep,Laura Leal-Taixé,Tim Meinhardt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Zero-shot 4D segmentation and recognition of arbitrary objects in Lidar is crucial for embodied navigation, with applications ranging from streaming perception to semantic mapping and localization. However, the primary challenge in advancing research and developing generalized, versatile methods for spatio-temporal scene understanding in Lidar lies in the scarcity of datasets that provide the necessary diversity and scale of this http URL overcome these challenges, we propose SAL-4D (Segment Anything in Lidar–4D), a method that utilizes multi-modal robotic sensor setups as a bridge to distill recent developments in Video Object Segmentation (VOS) in conjunction with off-the-shelf Vision-Language foundation models to Lidar. We utilize VOS models to pseudo-label tracklets in short video sequences, annotate these tracklets with sequence-level CLIP tokens, and lift them to the 4D Lidar space using calibrated multi-modal sensory setups to distill them to our SAL-4D model. Due to temporal consistent predictions, we outperform prior art in 3D Zero-Shot Lidar Panoptic Segmentation (LPS) over 5 PQ, and unlock Zero-Shot 4D-LPS.
zh
[CV-30] PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks
【速读】:该论文旨在解决场景图生成(Scene Graphs Generation, SGG)中因完全监督方法训练数据偏差导致的性能瓶颈问题。具体而言,现有的完全监督方法受限于小规模精心标注的数据集,其长尾谓词分布问题导致谓词多样性不足,从而影响下游任务的表现。为克服这些问题,论文提出了一种名为PRISM-0的框架,采用自下而上的方式利用基础模型捕捉开放词汇表谓词预测的全部多样性谱系。其关键创新在于通过视觉语言模型(Vision Language Model, VLM)生成描述性标题来过滤对象对,并进一步利用大型语言模型(Large Language Model, LLM)生成细粒度和粗粒度的谓词,最后借助视觉问答(Visual Question Answering, VQA)模型验证谓词,从而构建最终的场景图。这种模块化且与数据集无关的设计不仅能够丰富现有场景图数据集(如Visual Genome),还实现了与当前最佳完全监督方法相当的性能提升,尤其在图像描述生成和句子到图检索等下游任务中表现优异。
链接: https://arxiv.org/abs/2504.00844
作者: Abdelrahman Elskhawy,Mengze Li,Nassir Navab,Benjamin Busam
机构: Technical University of Munich (慕尼黑工业大学); Zeiss Meditec AG (蔡司医疗技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:In Scene Graphs Generation (SGG) one extracts structured representation from visual inputs in the form of objects nodes and predicates connecting them. This facilitates image-based understanding and reasoning for various downstream tasks. Although fully supervised SGG approaches showed steady performance improvements, they suffer from a severe training bias. This is caused by the availability of only small subsets of curated data and exhibits long-tail predicate distribution issues with a lack of predicate diversity adversely affecting downstream tasks. To overcome this, we introduce PRISM-0, a framework for zero-shot open-vocabulary SGG that bootstraps foundation models in a bottom-up approach to capture the whole spectrum of diverse, open-vocabulary predicate prediction. Detected object pairs are filtered and passed to a Vision Language Model (VLM) that generates descriptive captions. These are used to prompt an LLM to generate fine-andcoarse-grained predicates for the pair. The predicates are then validated using a VQA model to provide a final SGG. With the modular and dataset-independent PRISM-0, we can enrich existing SG datasets such as Visual Genome (VG). Experiments illustrate that PRIMS-0 generates semantically meaningful graphs that improve downstream tasks such as Image Captioning and Sentence-to-Graph Retrieval with a performance on par to the best fully supervised methods.
zh
[CV-31] he study of non-complete-ring positron emission tomography (PET) detection method
【速读】:该论文旨在解决因硬件故障、成本限制或特定临床需求导致的不完整环(incomplete-ring)正电子发射断层成像(PET)系统在图像重建过程中由于数据不完整性和几何不一致性所引起的性能退化问题。论文提出了一种由粗到细的重建框架,其关键在于通过引入注意力U-Net模型恢复完整的sino图,结合OSEM算法进行初步重建,并采用包含粗略预测模块(CPM)和迭代细化模块(IRM)的两阶段架构实现精细重建。此外,该方法在输入层面利用相邻轴向切片和光谱变换特征作为辅助指导以确保空间和频域一致性,在输出层面集成对比扩散策略以提高低质量PET输入与优化后PET输出之间的对应性。实验结果表明,该方法在峰值信噪比(PSNR,35.6421 dB)和结构相似性指数(SSIM,0.9588)等指标上显著优于现有方法,有效保留了关键解剖结构和示踪剂分布特征。
链接: https://arxiv.org/abs/2504.00816
作者: Yeqi Fang,Rong Zhou
机构: Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 18 pages, 14 pages
点击查看摘要
Abstract:Positron Emission Tomography (PET) is a vital molecular imaging tool widely used in medical diagnosis and treatment evaluation. Traditional PET systems typically rely on complete detector rings to achieve full angular coverage for uniform and statistically robust sampling of coincidence events. However, incomplete-ring PET scanners have emerged in various scenarios due to hardware failures, cost constraints, or specific clinical needs. In such cases, conventional reconstruction algorithms often suffer from performance degradation due to reduced data completeness and geometric inconsistencies. This thesis proposes a coarse-to-fine reconstruction framework for incomplete-ring PET scanners. The framework first employs an Attention U-Net model to recover complete sinograms from incomplete ones, then uses the OSEM algorithm for preliminary reconstruction, and finally applies a two-stage architecture comprising a Coarse Prediction Module (CPM) and an Iterative Refinement Module (IRM) for fine reconstruction. Our approach utilizes neighboring axial slices and spectral transform features as auxiliary guidance at the input level to ensure spatial and frequency domain consistency, and integrates a contrastive diffusion strategy at the output level to improve correspondence between low-quality PET inputs and refined PET outputs. Experimental results on public and in-house brain PET datasets demonstrate that the proposed method significantly outperforms existing approaches in metrics such as PSNR (35.6421 dB) and SSIM (0.9588), successfully preserving key anatomical structures and tracer distribution features, thus providing an effective solution for incomplete-ring PET imaging.
zh
[CV-32] Scaling Prompt Instructed Zero Shot Composed Image Retrieval with Image-Only Data
【速读】:本文旨在解决传统生成式图像检索(Composed Image Retrieval, CIR)模型训练中依赖昂贵人工标注三元组数据的问题。传统方法需要参考图像、改写文本及目标图像组成的三元组数据,而这类数据的制作通常需耗费大量人力,导致成本高昂,限制了CIR模型在大规模无标注数据可用情况下的扩展性。为应对这一挑战,论文提出了一种新的训练范式,利用大型语言模型(Large Language Models, LLMs)高效生成所需数据,从而替代人工标注。关键解决方案在于引入一种结合图像与文本模态的嵌入改写架构,并基于此构建了一个名为InstructCIR的模型,该模型在CIRR和FashionIQ数据集上的零样本复合图像检索任务中超越了现有最先进的方法。此外,研究还表明,通过增加生成数据的数量,零样本模型的性能能够接近有监督基线模型。
链接: https://arxiv.org/abs/2504.00812
作者: Yiqun Duan,Sameera Ramasinghe,Stephen Gould,Ajanthan Thalaiyasingam
机构: Amazon(亚马逊); University of Technology Sydney(悉尼科技大学); The Australian National University(澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
点击查看摘要
Abstract:Composed Image Retrieval (CIR) is the task of retrieving images matching a reference image augmented with a text, where the text describes changes to the reference image in natural language. Traditionally, models designed for CIR have relied on triplet data containing a reference image, reformulation text, and a target image. However, curating such triplet data often necessitates human intervention, leading to prohibitive costs. This challenge has hindered the scalability of CIR model training even with the availability of abundant unlabeled data. With the recent advances in foundational models, we advocate a shift in the CIR training paradigm where human annotations can be efficiently replaced by large language models (LLMs). Specifically, we demonstrate the capability of large captioning and language models in efficiently generating data for CIR only relying on unannotated image collections. Additionally, we introduce an embedding reformulation architecture that effectively combines image and text modalities. Our model, named InstructCIR, outperforms state-of-the-art methods in zero-shot composed image retrieval on CIRR and FashionIQ datasets. Furthermore, we demonstrate that by increasing the amount of generated data, our zero-shot model gets closer to the performance of supervised baselines.
zh
[CV-33] CellVTA: Enhancing Vision Foundation Models for Accurate Cell Segmentation and Classification
【速读】:该论文旨在解决基于Vision Transformers (ViTs) 的视觉基础模型在细胞实例分割任务中性能提升有限的问题。主要挑战源于ViTs中的tokenization过程显著降低了输入图像的空间分辨率,导致小而密集细胞的分割质量不佳。为解决此问题,论文提出了一种名为CellVTA(Cell Vision Transformer with Adapter)的新方法,其关键是通过引入基于CNN的适配器模块来提升分割性能。该适配器从输入图像中提取高分辨率的空间信息,并通过交叉注意力机制将其注入ViT中,同时保持ViT的核心架构以实现与预训练基础模型的无缝集成。实验结果表明,CellVTA在CoNIC和PanNuke数据集上的表现显著优于现有最先进的细胞分割方法。
链接: https://arxiv.org/abs/2504.00784
作者: Yang Yang,Xijie Xu,Yixun Zhou,Jie Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Cell instance segmentation is a fundamental task in digital pathology with broad clinical applications. Recently, vision foundation models, which are predominantly based on Vision Transformers (ViTs), have achieved remarkable success in pathology image analysis. However, their improvements in cell instance segmentation remain limited. A key challenge arises from the tokenization process in ViTs, which substantially reduces the spatial resolution of input images, leading to suboptimal segmentation quality, especially for small and densely packed cells. To address this problem, we propose CellVTA (Cell Vision Transformer with Adapter), a novel method that improves the performance of vision foundation models for cell instance segmentation by incorporating a CNN-based adapter module. This adapter extracts high-resolution spatial information from input images and injects it into the ViT through a cross-attention mechanism. Our method preserves the core architecture of ViT, ensuring seamless integration with pretrained foundation models. Extensive experiments show that CellVTA achieves 0.538 mPQ on the CoNIC dataset and 0.506 mPQ on the PanNuke dataset, which significantly outperforms the state-of-the-art cell segmentation methods. Ablation studies confirm the superiority of our approach over other fine-tuning strategies, including decoder-only fine-tuning and full fine-tuning. Our code and models are publicly available at this https URL.
zh
[CV-34] Visual Environment-Interactive Planning for Embodied Complex-Question Answering
【速读】:本文研究重点在于具身复杂问题回答(Embodied Complex-Question Answering)任务,旨在使具身机器人能够理解具有复杂结构和抽象语义的人类问题。现有方法通常采用一次性整体规划(one-step planning),即在不充分理解环境的情况下依赖大型模型生成计划。为了解决这一局限性,本文提出了一种分步骤顺序规划框架。关键在于构建了一个结构化语义空间,在该空间中通过层次化的视觉感知与问题本质的链式表达实现迭代交互,从而支持顺序任务规划。具体而言,首先基于视觉层次场景图解析自然语言以明确问题意图,然后结合外部规则制定当前步骤的计划,减少对大型模型的依赖。每个计划均基于视觉感知反馈生成,并通过多轮交互调整直至获得答案,这种连续反馈机制使得机器人能够优化其行动策略。为了验证此框架的有效性,作者贡献了一个包含更复杂问题的新数据集。实验结果表明,所提方法在复杂任务中表现出色且稳定,并已在现实场景中证明其实用性。
链接: https://arxiv.org/abs/2504.00775
作者: Ning Lan,Baoshan Ou,Xuemei Xie,Guangming Shi
机构: School of Artificial Intelligence, Xidian University, Xi’an 710071, China (西安电子科技大学人工智能学院); Guangzhou Institute of Technology, Xidian University, Guangzhou 510000, China (西安电子科技大学广州研究院); Pazhou Laboratory, Huangpu, Guangzhou 510000, China (琶洲实验室); Peng Cheng Laboratory, Shenzhen 518055, China (鹏城实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This study focuses on Embodied Complex-Question Answering task, which means the embodied robot need to understand human questions with intricate structures and abstract semantics. The core of this task lies in making appropriate plans based on the perception of the visual environment. Existing methods often generate plans in a once-for-all manner, i.e., one-step planning. Such approach rely on large models, without sufficient understanding of the environment. Considering multi-step planning, the framework for formulating plans in a sequential manner is proposed in this paper. To ensure the ability of our framework to tackle complex questions, we create a structured semantic space, where hierarchical visual perception and chain expression of the question essence can achieve iterative interaction. This space makes sequential task planning possible. Within the framework, we first parse human natural language based on a visual hierarchical scene graph, which can clarify the intention of the question. Then, we incorporate external rules to make a plan for current step, weakening the reliance on large models. Every plan is generated based on feedback from visual perception, with multiple rounds of interaction until an answer is obtained. This approach enables continuous feedback and adjustment, allowing the robot to optimize its action strategy. To test our framework, we contribute a new dataset with more complex questions. Experimental results demonstrate that our approach performs excellently and stably on complex tasks. And also, the feasibility of our approach in real-world scenarios has been established, indicating its practical applicability.
zh
[CV-35] DropGaussian: Structural Regularization for Sparse-view Gaussian Splatting CVPR2025
【速读】:该论文旨在解决三维高斯点splating (3D Gaussian Splatting, 3DGS) 在稀疏视图设置(如三视角输入)中因过拟合训练视图而导致新颖视图图像质量显著下降的问题。现有方法通常通过引入强先验(如二维生成上下文信息和外部深度信号)来缓解此问题。本文提出了一种无先验的方法,称为DropGaussian,在3DGS的基础上进行简单修改。其关键在于训练过程中以类似于dropout的方式随机移除高斯分布,使未被移除的高斯分布获得更大的梯度同时提高其可见性,从而在渲染稀疏输入视图时让剩余高斯分布对优化过程贡献更多。这种简单的操作有效缓解了过拟合问题,提升了新颖视图合成的质量。通过将DropGaussian应用于原始3DGS框架,无需额外复杂性即可在基准数据集的稀疏视图设置下实现与现有基于先验的3DGS方法相当的性能。相关代码和模型已公开发布。
链接: https://arxiv.org/abs/2504.00773
作者: Hyunwoo Park,Gun Ryu,Wonjun Kim
机构: Konkuk University (建国大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025
点击查看摘要
Abstract:Recently, 3D Gaussian splatting (3DGS) has gained considerable attentions in the field of novel view synthesis due to its fast performance while yielding the excellent image quality. However, 3DGS in sparse-view settings (e.g., three-view inputs) often faces with the problem of overfitting to training views, which significantly drops the visual quality of novel view images. Many existing approaches have tackled this issue by using strong priors, such as 2D generative contextual information and external depth signals. In contrast, this paper introduces a prior-free method, so-called DropGaussian, with simple changes in 3D Gaussian splatting. Specifically, we randomly remove Gaussians during the training process in a similar way of dropout, which allows non-excluded Gaussians to have larger gradients while improving their visibility. This makes the remaining Gaussians to contribute more to the optimization process for rendering with sparse input views. Such simple operation effectively alleviates the overfitting problem and enhances the quality of novel view synthesis. By simply applying DropGaussian to the original 3DGS framework, we can achieve the competitive performance with existing prior-based 3DGS methods in sparse-view settings of benchmark datasets without any additional complexity. The code and model are publicly available at: this https URL release.
zh
[CV-36] Multi-Task Neural Architecture Search Using Architecture Embedding and Transfer Rank
【速读】:本文旨在解决多任务神经架构搜索(Multi-task Neural Architecture Search, NAS)中,由于源任务与目标任务之间的排名混乱导致架构在下游任务中的性能下降的问题。为提升跨任务迁移效率,作者提出了KTNAS,这是一种基于进化算法的跨任务NAS方法。KTNAS通过将神经架构转换为图结构并使用架构嵌入向量进行后续性能预测,创新性地引入了迁移秩(Transfer Rank)这一实例分类器的概念,有效应对了性能退化问题。关键在于迁移秩的设计,它显著提升了跨任务迁移的效果。实验结果表明,KTNAS不仅在搜索效率上优于同类多任务NAS算法,同时在NASBench-201数据集以及Micro TransNAS-Bench-101上的迁移能力验证中表现出色,并在DARTS搜索空间中扩展到多种视觉任务时展现了良好的可扩展性。
链接: https://arxiv.org/abs/2504.00772
作者: TingJie Zhang,HaiLin Liu
机构: School of Mathematics and Statistics (数学与统计学院), Guangdong University of Technology (广东工业大学), Guangzhou, China
类目: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Multi-task neural architecture search (NAS) enables transferring architectural knowledge among different tasks. However, ranking disorder between the source task and the target task degrades the architecture performance on the downstream task. We propose KTNAS, an evolutionary cross-task NAS algorithm, to enhance transfer efficiency. Our data-agnostic method converts neural architectures into graphs and uses architecture embedding vectors for the subsequent architecture performance prediction. The concept of transfer rank, an instance-based classifier, is introduced into KTNAS to address the performance degradation issue. We verify the search efficiency on NASBench-201 and transferability to various vision tasks on Micro TransNAS-Bench-101. The scalability of our method is demonstrated on DARTs search space including CIFAR-10/100, MNIST/Fashion-MNIST, MedMNIST. Experimental results show that KTNAS outperforms peer multi-task NAS algorithms in search efficiency and downstream task performance. Ablation studies demonstrate the vital importance of transfer rank for transfer performance.
zh
[CV-37] UnIRe: Unsupervised Instance Decomposition for Dynamic Urban Scene Reconstruction
【速读】:该论文旨在解决动态城市场景的无监督实例感知分解问题,即在无需人工标注的情况下,将场景分解为静态背景和独立的动态实例。这一能力对于实例级场景理解至关重要,尤其是在自动驾驶、城市规划和场景编辑等领域。论文的关键创新在于引入了4D超点(4D superpoints),这是一种新颖的表示方法,通过在四维空间中聚类多帧激光雷达点云,实现基于时空相关性的无监督实例分割。这些4D超点不仅作为分解初始化的基础,还支持动态三维高斯点云(dynamic 3DGS)的训练,从而实现任意动态类别的灵活建模,而无需边界框或对象标签。此外,论文提出了一种在二维和三维空间中的平滑正则化策略,进一步提升了时间一致性。实验结果表明,该方法在分解动态场景重建任务上优于现有技术,并实现了精确且灵活的实例级编辑,具备实际应用价值。
链接: https://arxiv.org/abs/2504.00763
作者: Yunxuan Mao,Rong Xiong,Yue Wang,Yiyi Liao
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:Reconstructing and decomposing dynamic urban scenes is crucial for autonomous driving, urban planning, and scene editing. However, existing methods fail to perform instance-aware decomposition without manual annotations, which is crucial for instance-level scene this http URL propose UnIRe, a 3D Gaussian Splatting (3DGS) based approach that decomposes a scene into a static background and individual dynamic instances using only RGB images and LiDAR point clouds. At its core, we introduce 4D superpoints, a novel representation that clusters multi-frame LiDAR points in 4D space, enabling unsupervised instance separation based on spatiotemporal correlations. These 4D superpoints serve as the foundation for our decomposed 4D initialization, i.e., providing spatial and temporal initialization to train a dynamic 3DGS for arbitrary dynamic classes without requiring bounding boxes or object this http URL, we introduce a smoothness regularization strategy in both 2D and 3D space, further improving the temporal this http URL on benchmark datasets show that our method outperforms existing methods in decomposed dynamic scene reconstruction while enabling accurate and flexible instance-level editing, making it a practical solution for real-world applications.
zh
[CV-38] MSSFC-Net:Enhancing Building Interpretation with Multi-Scale Spatial-Spectral Feature Collaboration
【速读】:该论文旨在解决遥感影像中建筑物提取与变化检测任务中独立建模导致的相关性忽略以及共享特征表示未被充分利用的问题。此外,建筑物多样化的光谱、空间及尺度特性增加了联合建模空间-光谱多尺度特征的难度,并在精度与召回率之间有效平衡方面带来了挑战。针对这些问题,论文提出了一种多尺度空间-光谱特征协同双任务网络(MSSFC-Net)。其关键在于通过统一架构整合两个任务,利用互补性同时提取建筑物和变化特征。具体而言,设计了一个具有空间-光谱特征协作的双分支多尺度特征提取模块(DMFE),以增强多尺度表征学习,捕获浅层纹理细节和深层语义信息,从而提升建筑物提取性能;同时引入一个多尺度差异融合模块(MDFM),显式建模差分和双时相特征之间的交互,优化网络检测大面积变化和细微结构变化的能力。
链接: https://arxiv.org/abs/2504.00759
作者: Dehua Huo,Weida Zhan,Jinxin Guo,Depeng Zhu,Yu Chen,YiChun Jiang,Yueyi Han,Deng Han,Jin Li
机构: National Demonstration Center for Experimental Electrical, Changchun University of Science and Technology (长春理工大学), Changchun 130022, China; Jilin Province Zhixing IoT Research Institute Co., Ltd. (吉林省智行物联网研究院有限公司), Changchun 130117, China; Beihang University (北京航空航天大学), School of Instrumentation and Optoelectronic Engineering, Beijing 100191, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Building interpretation from remote sensing imagery primarily involves two fundamental tasks: building extraction and change detection. However, most existing methods address these tasks independently, overlooking their inherent correlation and failing to exploit shared feature representations for mutual enhancement. Furthermore, the diverse spectral,spatial, and scale characteristics of buildings pose additional challenges in jointly modeling spatial-spectral multi-scale features and effectively balancing precision and recall. The limited synergy between spatial and spectral representations often results in reduced detection accuracy and incomplete change this http URL address these challenges, we propose a Multi-Scale Spatial-Spectral Feature Cooperative Dual-Task Network (MSSFC-Net) for joint building extraction and change detection in remote sensing images. The framework integrates both tasks within a unified architecture, leveraging their complementary nature to simultaneously extract building and change features. Specifically,a Dual-branch Multi-scale Feature Extraction module (DMFE) with Spatial-Spectral Feature Collaboration (SSFC) is designed to enhance multi-scale representation learning, effectively capturing shallow texture details and deep semantic information, thus improving building extraction performance. For temporal feature aggregation, we introduce a Multi-scale Differential Fusion Module (MDFM) that explicitly models the interaction between differential and dual-temporal features. This module refines the network’s capability to detect large-area changes and subtle structural variations in buildings. Extensive experiments conducted on three benchmark datasets demonstrate that MSSFC-Net achieves superior performance in both building extraction and change detection tasks, effectively improving detection accuracy while maintaining completeness.
zh
[CV-39] CAPE: Connectivity-Aware Path Enforcement Loss for Curvilinear Structure Delineation
【速读】:该论文致力于解决语义分割中曲线结构(如生物医学扫描中的神经过程和CT图像中的血管)连通性不足的问题。传统像素级损失函数(如交叉熵和Dice损失)难以捕捉高级拓扑连通性,导致从预测图得到的图中出现拓扑错误。论文的关键解决方案是提出了一种名为CAPE(连接感知路径强制)的新颖损失函数。CAPE通过优化图连通性度量来强制分割图中的连通性,利用真实标签的图表示选择节点对并使用最短路径算法确定其在预测分割中的相应路径,从而惩罚断连和假阳性连接,有效促进模型保持拓扑正确性。实验结果表明,CAPE显著提升了具有拓扑意识的指标,并优于现有技术方法。
链接: https://arxiv.org/abs/2504.00753
作者: Elyar Esmaeilzadeh,Ehsan Garaaghaji,Farzad Hallaji Azad,Doruk Oner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Promoting the connectivity of curvilinear structures, such as neuronal processes in biomedical scans and blood vessels in CT images, remains a key challenge in semantic segmentation. Traditional pixel-wise loss functions, including cross-entropy and Dice losses, often fail to capture high-level topological connectivity, resulting in topological mistakes in graphs obtained from prediction maps. In this paper, we propose CAPE (Connectivity-Aware Path Enforcement), a novel loss function designed to enforce connectivity in graphs obtained from segmentation maps by optimizing a graph connectivity metric. CAPE uses the graph representation of the ground truth to select node pairs and determine their corresponding paths within the predicted segmentation through a shortest-path algorithm. Using this, we penalize both disconnections and false positive connections, effectively promoting the model to preserve topological correctness. Experiments on 2D and 3D datasets, including neuron and blood vessel tracing demonstrate that CAPE significantly improves topology-aware metrics and outperforms state-of-the-art methods.
zh
[CV-40] Scaling Up Resonate-and-Fire Networks for Fast Deep Learning
【速读】:该论文旨在解决共振-放电(Resonate-and-Fire, RF)神经元网络在参数初始化困难和高效学习方面的挑战,这些问题限制了RF网络的深度扩展。论文的关键创新在于将RF神经元建模为基于HiPPO框架的结构化状态空间模型(Structured State Space Model, SSM),提出了一种新的SSM层——S5-RF。S5-RF通过引入通用的初始化方案和快速训练方法,首次实现了多至四层的深层SNN,并在Spiking Speech Commands数据集上达到了78.8%的最新性能记录,同时显著减少了脉冲操作次数。这一解决方案的核心在于结合S5模型的高效性与RF神经元的生物合理性,从而克服了传统RF网络的局限性。
链接: https://arxiv.org/abs/2504.00719
作者: Thomas E. Huber,Jules Lecomte,Borislav Polovnikov,Axel von Arnim
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 3 figures
点击查看摘要
Abstract:Spiking neural networks (SNNs) present a promising computing paradigm for neuromorphic processing of event-based sensor data. The resonate-and-fire (RF) neuron, in particular, appeals through its biological plausibility, complex dynamics, yet computational simplicity. Despite theoretically predicted benefits, challenges in parameter initialization and efficient learning inhibited the implementation of RF networks, constraining their use to a single layer. In this paper, we address these shortcomings by deriving the RF neuron as a structured state space model (SSM) from the HiPPO framework. We introduce S5-RF, a new SSM layer comprised of RF neurons based on the S5 model, that features a generic initialization scheme and fast training within a deep architecture. S5-RF scales for the first time a RF network to a deep SNN with up to four layers and achieves with 78.8% a new state-of-the-art result for recurrent SNNs on the Spiking Speech Commands dataset in under three hours of training time. Moreover, compared to the reference SNNs that solve our benchmarking tasks, it achieves similar performance with much fewer spiking operations. Our code is publicly available at this https URL.
zh
[CV-41] oVE: Efficient Vision-Language Learning via Knowledge Transfer from Vision Experts ICLR2025
【速读】:该论文旨在解决视觉-语言(Vision-Language, VL)学习中对大规模数据集和巨大模型的依赖问题,提出了一种更高效的解决方案。关键在于引入了一个基于“从视觉专家中心转移知识”(Transfers the knowledge from a hub of Vision Experts, ToVE)的新框架。ToVE 的核心创新包括构建一个冻结的 CLIP 编码器以提供图像条件的语言生成所需的视觉标记,并引入多个预训练视觉专家模型组成的专家池与注意力感知门控网络,该网络能够动态路由专家知识到相应的视觉标记。此外,通过“残差知识迁移”策略,在保持视觉标记泛化能力的同时,可剥离贡献较低的专家以提升推理效率。进一步地,将这些专家知识融合到单一 CLIP 编码器中,形成知识增强的 CLIP,从而在部署阶段无需专家推理即可生成更具信息量的视觉标记。实验结果表明,ToVE 在多种 VL 任务上实现了与现有方法相当的性能,但仅需两个数量级更少的训练数据。
链接: https://arxiv.org/abs/2504.00691
作者: Yuanchen Wu,Junlong Du,Ke Yan,Shouhong Ding,Xiaoqiang Li
机构: School of Computer Engineering and Science, Shanghai University (上海大学计算机工程与科学学院), Shanghai; Tencent Youtu Lab (腾讯优图实验室), Shanghai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2025
点击查看摘要
Abstract:Vision-language (VL) learning requires extensive visual perception capabilities, such as fine-grained object recognition and spatial perception. Recent works typically rely on training huge models on massive datasets to develop these capabilities. As a more efficient alternative, this paper proposes a new framework that Transfers the knowledge from a hub of Vision Experts (ToVE) for efficient VL learning, leveraging pre-trained vision expert models to promote visual perception capability. Specifically, building on a frozen CLIP encoder that provides vision tokens for image-conditioned language generation, ToVE introduces a hub of multiple vision experts and a token-aware gating network that dynamically routes expert knowledge to vision tokens. In the transfer phase, we propose a “residual knowledge transfer” strategy, which not only preserves the generalizability of the vision tokens but also allows detachment of low-contributing experts to improve inference efficiency. Further, we explore to merge these expert knowledge to a single CLIP encoder, creating a knowledge-merged CLIP that produces more informative vision tokens without expert inference during deployment. Experiment results across various VL tasks demonstrate that the proposed ToVE achieves competitive performance with two orders of magnitude fewer training data.
zh
[CV-42] Monocular and Generalizable Gaussian Talking Head Animation CVPR2025
【速读】:本文旨在解决单目视频驱动的高保真Talking Head动画在缺乏多视角及个性化训练数据情况下的几何与外观信息不完整问题。为应对这一挑战,论文提出Monocular and Generalizable Gaussian Talking Head Animation (MGGTalk),其关键在于利用深度信息增强三维高斯点阵(3D Gaussian Splatting, 3DGS)的几何对称性和面部特征完整性。具体而言,首先基于深度估计的逐像素几何信息,结合对称操作与点云滤波技术优化3DGS的位置参数;随后采用带对称先验的两阶段策略预测剩余参数,先从源图像可见面部区域的高斯参数出发,再利用这些参数提升对不可见区域的预测精度。实验结果表明,MGGTalk在多种评估指标上超越现有最先进方法。
链接: https://arxiv.org/abs/2504.00665
作者: Shengjie Gong,Haojie Li,Jiapeng Tang,Dongming Hu,Shuangping Huang,Hao Chen,Tianshui Chen,Zhuoman Liu
机构: South China University of Technology (华南理工大学); Technical University of Munich (慕尼黑工业大学); Pazhou Laboratory (琶洲实验室); Guangdong University of Technology (广东工业大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025
点击查看摘要
Abstract:In this work, we introduce Monocular and Generalizable Gaussian Talking Head Animation (MGGTalk), which requires monocular datasets and generalizes to unseen identities without personalized re-training. Compared with previous 3D Gaussian Splatting (3DGS) methods that requires elusive multi-view datasets or tedious personalized learning/inference, MGGtalk enables more practical and broader applications. However, in the absence of multi-view and personalized training data, the incompleteness of geometric and appearance information poses a significant challenge. To address these challenges, MGGTalk explores depth information to enhance geometric and facial symmetry characteristics to supplement both geometric and appearance features. Initially, based on the pixel-wise geometric information obtained from depth estimation, we incorporate symmetry operations and point cloud filtering techniques to ensure a complete and precise position parameter for 3DGS. Subsequently, we adopt a two-stage strategy with symmetric priors for predicting the remaining 3DGS parameters. We begin by predicting Gaussian parameters for the visible facial regions of the source image. These parameters are subsequently utilized to improve the prediction of Gaussian parameters for the non-visible regions. Extensive experiments demonstrate that MGGTalk surpasses previous state-of-the-art methods, achieving superior performance across various metrics.
zh
[CV-43] QG-VTC: Question-Guided Visual Token Compression in MLLM s for Efficient VQA
【速读】:该论文旨在解决多模态大语言模型(MLLMs)在开放世界视觉问答(VQA)任务中因整合视觉信息导致处理的tokens数量增加,从而引发GPU内存占用高和计算开销大的问题。此外,图像通常包含比文本更多的冗余信息,且并非所有视觉细节都与特定问题相关。为应对这些挑战,论文提出了一种名为QG-VTC(Question-Guided Visual Token Compression)的新方法。该方法的关键在于利用预训练的文本编码器和可学习的前馈层将用户问题嵌入到视觉编码器的特征空间中,并计算问题嵌入与视觉tokens之间的相关性得分。通过选择最相关的tokens并软化压缩其他tokens,QG-VTC确保了对用户需求的高度相关性。此外,采用渐进策略在不同视觉编码器层应用此压缩,逐步减少tokens数量,从而最大化保留与问题相关的信息同时丢弃无关细节。实验结果表明,该方法在仅使用1/8视觉tokens的情况下达到了与未压缩模型相当的性能。代码和模型将在GitHub上公开发布。
链接: https://arxiv.org/abs/2504.00654
作者: Shuai Li,Jian Xu,Xiao-Hui Li,Chao Deng,Lin-Lin Huang
机构: Beijing JiaoTong University (北京交通大学); State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation of Chinese Academy of Sciences (多模态人工智能系统国家重点实验室, 中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent advances in Multi-modal Large Language Models (MLLMs) have shown significant progress in open-world Visual Question Answering (VQA). However, integrating visual information increases the number of processed tokens, leading to higher GPU memory usage and computational overhead. Images often contain more redundant information than text, and not all visual details are pertinent to specific questions. To address these challenges, we propose QG-VTC, a novel question-guided visual token compression method for MLLM-based VQA tasks. QG-VTC employs a pretrained text encoder and a learnable feed-forward layer to embed user questions into the vision encoder’s feature space then computes correlation scores between the question embeddings and visual tokens. By selecting the most relevant tokens and softly compressing others, QG-VTC ensures fine-tuned relevance to user needs. Additionally, a progressive strategy applies this compression across different vision encoder layers, gradually reducing token numbers. This approach maximizes retention of question-relevant information while discarding irrelevant details. Experimental results show that our method achieves performance on par with uncompressed models using just 1/8 of the visual tokens. The code and model will be publicly available on GitHub.
zh
[CV-44] FDDet: Frequency-Decoupling for Boundary Refinement in Temporal Action Detection
【速读】:该论文致力于解决无剪辑视频中动作定位与分类中的背景杂波和无关语义引起的上下文混淆及边界不精确的问题。为应对这一挑战,论文提出了一种频率感知解耦网络(Frequency-Aware Decoupling Network),其关键在于引入自适应时间解耦方案(Adaptive Temporal Decoupling Scheme),通过抑制无关信息同时保留精细的原子动作细节,从而生成更任务相关的特征表示。此外,论文通过捕捉时间变化增强帧间建模,进一步区分动作与背景冗余,并设计了一个长短期类别感知关系网络(Long-Short-Term Category-Aware Relation Network),以联合建模局部转换和长程依赖,提升定位精度。最终,优化后的原子特征与频率引导动态被输入标准检测头,实现准确的动作预测。实验结果表明,该方法在THUMOS14、HACS和ActivityNet-1.3数据集上取得了当前最先进的性能。
链接: https://arxiv.org/abs/2504.00647
作者: Xinnan Zhu,Yicheng Zhu,Tixin Chen,Wentao Wu,Yuanjie Dang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Temporal action detection aims to locate and classify actions in untrimmed videos. While recent works focus on designing powerful feature processors for pre-trained representations, they often overlook the inherent noise and redundancy within these features. Large-scale pre-trained video encoders tend to introduce background clutter and irrelevant semantics, leading to context confusion and imprecise boundaries. To address this, we propose a frequency-aware decoupling network that improves action discriminability by filtering out noisy semantics captured by pre-trained models. Specifically, we introduce an adaptive temporal decoupling scheme that suppresses irrelevant information while preserving fine-grained atomic action details, yielding more task-specific representations. In addition, we enhance inter-frame modeling by capturing temporal variations to better distinguish actions from background redundancy. Furthermore, we present a long-short-term category-aware relation network that jointly models local transitions and long-range dependencies, improving localization precision. The refined atomic features and frequency-guided dynamics are fed into a standard detection head to produce accurate action predictions. Extensive experiments on THUMOS14, HACS, and ActivityNet-1.3 show that our method, powered by InternVideo2-6B features, achieves state-of-the-art performance on temporal action detection benchmarks.
zh
[CV-45] POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation CVPR2025
【速读】:该论文旨在解决基于大型视觉语言模型(Large Visual-Language Model, LVLM)的推理分割方法中存在的分割结果不精确以及文本响应中存在幻觉(hallucinations)的问题。论文提出了一种名为POPen的新框架作为解决方案。POPen的关键在于其基于偏好的优化方法,用于微调LVLM以更好地符合人类偏好,从而生成更高质量的文本响应和分割结果。此外,POPen还引入了一种基于偏好的集成方法,在推理阶段通过基于偏好得分的注意力机制整合多个LVLM输出进行精炼。为了更好地适应分割任务,POPen框架中进一步包含了一些特定任务的设计,如结合课程学习机制收集分割偏好数据的新方法,以及一种新颖的偏好优化损失函数来提升LVLM的分割能力。实验结果表明,该方法在推理分割任务中达到了最先进的性能,显著减少了文本响应中的幻觉现象,并实现了最高的分割精度。
链接: https://arxiv.org/abs/2504.00640
作者: Lanyun Zhu,Tianrun Chen,Qianxiong Xu,Xuanyi Liu,Deyi Ji,Haiyang Wu,De Wen Soh,Jun Liu
机构: Singapore University of Technology and Design; Tencent; Zhejiang University; Nanyang Technological University; Peking University; Lancaster University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025
点击查看摘要
Abstract:Existing LVLM-based reasoning segmentation methods often suffer from imprecise segmentation results and hallucinations in their text responses. This paper introduces POPEN, a novel framework designed to address these issues and achieve improved results. POPEN includes a preference-based optimization method to finetune the LVLM, aligning it more closely with human preferences and thereby generating better text responses and segmentation results. Additionally, POPEN introduces a preference-based ensemble method for inference, which integrates multiple outputs from the LVLM using a preference-score-based attention mechanism for refinement. To better adapt to the segmentation task, we incorporate several task-specific designs in our POPEN framework, including a new approach for collecting segmentation preference data with a curriculum learning mechanism, and a novel preference optimization loss to refine the segmentation capability of the LVLM. Experiments demonstrate that our method achieves state-of-the-art performance in reasoning segmentation, exhibiting minimal hallucination in text responses and the highest segmentation accuracy compared to previous advanced methods like LISA and PixelLM. Project page is this https URL
zh
[CV-46] Coca-Splat: Collaborative Optimization for Camera Parameters and 3D Gaussians
【速读】:该论文旨在解决稀疏视角无姿态约束的场景重建(Sparse View Pose-Free Scene Reconstruction)以及新视角合成(Novel View Synthesis, NVS)的挑战。论文提出了一种名为Coca-Splat的新方法,通过联合优化三维Gaussians与相机参数来实现这一目标。方案的关键在于设计了用于三维Gaussians和相机参数的分离查询,并通过可变形Transformer层逐层更新这些查询,从而在一个网络中实现联合优化。此外,通过引入相机感知的多视图可变形交叉注意力机制(Camera-aware Multi-view Deformable Cross-Attention, CaMDFA),利用由相机参数投影得到的二维参考点将三维Gaussians与相机参数内在关联起来。同时,定义从相机中心到参考点的确定性射线(RayRef),并通过RQ分解增强三维Gaussians与相机参数之间的关系,进一步提升了模型性能。实验表明,该方法在RealEstate10K和ACID数据集上的无姿态设置下优于现有方法。
链接: https://arxiv.org/abs/2504.00639
作者: Jiamin Wu,Hongyang Li,Xiaoke Jiang,Yuan Yao,Lei Zhang
机构: Hong Kong University of Science and Technology (香港科技大学); International Digital Economy Academy (IDEA) (国际数字经 academy (IDEA))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:In this work, we introduce Coca-Splat, a novel approach to addressing the challenges of sparse view pose-free scene reconstruction and novel view synthesis (NVS) by jointly optimizing camera parameters with 3D Gaussians. Inspired by deformable DEtection TRansformer, we design separate queries for 3D Gaussians and camera parameters and update them layer by layer through deformable Transformer layers, enabling joint optimization in a single network. This design demonstrates better performance because to accurately render views that closely approximate ground-truth images relies on precise estimation of both 3D Gaussians and camera parameters. In such a design, the centers of 3D Gaussians are projected onto each view by camera parameters to get projected points, which are regarded as 2D reference points in deformable cross-attention. With camera-aware multi-view deformable cross-attention (CaMDFA), 3D Gaussians and camera parameters are intrinsically connected by sharing the 2D reference points. Additionally, 2D reference point determined rays (RayRef) defined from camera centers to the reference points assist in modeling relationship between 3D Gaussians and camera parameters through RQ-decomposition on an overdetermined system of equations derived from the rays, enhancing the relationship between 3D Gaussians and camera parameters. Extensive evaluation shows that our approach outperforms previous methods, both pose-required and pose-free, on RealEstate10K and ACID within the same pose-free setting.
zh
[CV-47] Bi-Grid Reconstruction for Image Anomaly Detection
【速读】:该论文旨在解决现有无监督或自监督图像异常检测方法在细粒度异常检测中的局限性问题。这些传统方法通常难以有效捕捉细微的异常特征,尤其是在仅使用正常样本数据集的情况下。为了解决这一挑战,论文提出了\textbf{GRAD}(Bi-\textbf{Grid} \textbf{Reconstruction for Image \textbf{Anomaly} \textbf{Detection}),其核心解决方案包括:1)通过引入两个连续的网格作为特征存储库,不仅提升了模型的泛化能力,还缓解了“相同捷径”(Identical Shortcut, IS)问题;2)利用异常特征网格进一步细化正常特征边界,从而显著增强对细微缺陷的检测性能;3)设计了特征块粘贴(Feature Block Paste, FBP)模块,在特征层面合成多种异常以快速部署异常网格。此外,\textbf{GRAD}通过强大的表示能力实现了单模型处理多类别任务的能力。实验结果表明,\textbf{GRAD}在MVTecAD、VisA和GoodsAD等数据集上的细粒度异常检测性能显著提升,尤其在整体准确性和区分微小差异方面表现出色,超越了现有方法。
链接: https://arxiv.org/abs/2504.00609
作者: Huichuan Huang,Zhiqing Zhong,Guangyu Wei,Yonghao Wan,Wenlong Sun,Aimin Feng
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学, China); MIIT Key Laboratory of Pattern Analysis and Machine Intelligence (工业和信息化部模式分析与智能机器重点实验室, China)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:In image anomaly detection, significant advancements have been made using un- and self-supervised methods with datasets containing only normal samples. However, these approaches often struggle with fine-grained anomalies. This paper introduces \textbfGRAD: Bi-\textbfGrid \textbfReconstruction for Image \textbfAnomaly \textbfDetection, which employs two continuous grids to enhance anomaly detection from both normal and abnormal perspectives. In this work: 1) Grids as feature repositories that improve generalization and mitigate the Identical Shortcut (IS) issue; 2) An abnormal feature grid that refines normal feature boundaries, boosting detection of fine-grained defects; 3) The Feature Block Paste (FBP) module, which synthesizes various anomalies at the feature level for quick abnormal grid deployment. GRAD’s robust representation capabilities also allow it to handle multiple classes with a single model. Evaluations on datasets like MVTecAD, VisA, and GoodsAD show significant performance improvements in fine-grained anomaly detection. GRAD excels in overall accuracy and in discerning subtle differences, demonstrating its superiority over existing methods.
zh
[CV-48] Sample-level Adaptive Knowledge Distillation for Action Recognition
【速读】:该论文致力于解决视频分析中知识蒸馏(Knowledge Distillation, KD)面临的两个重要问题:1)教师与学生网络之间的容量差距导致难以转移的知识无法被正确传递,甚至可能损害学生模型的最终性能;2)样本在训练过程中的难易程度可能会动态变化。为缓解这些问题,论文提出了一种面向动作识别的样本级自适应知识蒸馏(Sample-level Adaptive Knowledge Distillation, SAKD)框架。其关键在于包含样本蒸馏难度评估模块和样本自适应蒸馏模块:前者通过帧级别的时间干扰(如随机丢弃或打乱帧序列)来增加蒸馏过程中样本的学习难度,从而更好地评估样本的蒸馏难度;后者根据样本的难易程度自适应调整蒸馏权重,使得在易迁移样本中以KD损失为主导,在难迁移样本中以原始损失为主导。此外,仅选择低蒸馏难度且高多样性的样本用于学生模型的训练,以降低计算成本。实验结果验证了所提方法在性能与效率之间取得了良好的平衡。
链接: https://arxiv.org/abs/2504.00606
作者: Ping Li,Chenhao Ping,Wenxiao Wang,Mingli Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Knowledge Distillation (KD) compresses neural networks by learning a small network (student) via transferring knowledge from a pre-trained large network (teacher). Many endeavours have been devoted to the image domain, while few works focus on video analysis which desires training much larger model making it be hardly deployed in resource-limited devices. However, traditional methods neglect two important problems, i.e., 1) Since the capacity gap between the teacher and the student exists, some knowledge w.r.t. difficult-to-transfer samples cannot be correctly transferred, or even badly affects the final performance of student, and 2) As training progresses, difficult-to-transfer samples may become easier to learn, and vice versa. To alleviate the two problems, we propose a Sample-level Adaptive Knowledge Distillation (SAKD) framework for action recognition. In particular, it mainly consists of the sample distillation difficulty evaluation module and the sample adaptive distillation module. The former applies the temporal interruption to frames, i.e., randomly dropout or shuffle the frames during training, which increases the learning difficulty of samples during distillation, so as to better discriminate their distillation difficulty. The latter module adaptively adjusts distillation ratio at sample level, such that KD loss dominates the training with easy-to-transfer samples while vanilla loss dominates that with difficult-to-transfer samples. More importantly, we only select those samples with both low distillation difficulty and high diversity to train the student model for reducing computational cost. Experimental results on two video benchmarks and one image benchmark demonstrate the superiority of the proposed method by striking a good balance between performance and efficiency.
zh
[CV-49] Continual Cross-Modal Generalization
【速读】:该论文旨在解决跨模态泛化(Cross-modal Generalization)中的挑战,即如何从多模态配对数据中学习一个共享的离散表示空间,以实现未标注模态间的知识迁移。然而,构建适用于所有模态对的统一表示通常需要大量的配对数据,这在实践中往往是不可行的。为应对这一问题,论文提出了一种基于持续学习(Continual Learning)的方法,通过中介模态(Mediator Modality)逐步将新模态映射到共享的离散代码本(Codebook)中。解决方案的关键在于提出的“连续专家适配器(CMoE-Adapter)”,它能够将多样化的模态投影到统一空间的同时保留先验知识。此外,为了在不同阶段对齐语义,引入了伪模态重放(Pseudo-Modality Replay, PMR)机制,并结合动态扩展的代码本,使模型能够利用已学习的模态作为指导,自适应地整合新的模态。实验结果表明,该方法在图像-文本、音频-文本、视频-文本以及语音-文本等多种跨模态泛化任务中表现出色。
链接: https://arxiv.org/abs/2504.00561
作者: Yan Xia,Hai Huang,Minghui Fang,Zhou Zhao
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Cross-modal generalization aims to learn a shared discrete representation space from multimodal pairs, enabling knowledge transfer across unannotated modalities. However, achieving a unified representation for all modality pairs requires extensive paired data, which is often impractical. Inspired by the availability of abundant bimodal data (e.g., in ImageBind), we explore a continual learning approach that incrementally maps new modalities into a shared discrete codebook via a mediator modality. We propose the Continual Mixture of Experts Adapter (CMoE-Adapter) to project diverse modalities into a unified space while preserving prior knowledge. To align semantics across stages, we introduce a Pseudo-Modality Replay (PMR) mechanism with a dynamically expanding codebook, enabling the model to adaptively incorporate new modalities using learned ones as guidance. Extensive experiments on image-text, audio-text, video-text, and speech-text show that our method achieves strong performance on various cross-modal generalization tasks. Code is provided in the supplementary material.
zh
[CV-50] AttentiveGRU: Recurrent Spatio-Temporal Modeling for Advanced Radar-Based BEV Object Detection
【速读】:该论文旨在解决基于雷达的鸟瞰图(BEV)目标检测中,由于雷达数据固有的稀疏性和非确定性特性导致的传统单帧BEV方法效果受限的问题。论文的关键解决方案是提出了一种名为AttentiveGRU的新颖注意力机制驱动的循环方法,专为雷达数据约束设计。AttentiveGRU通过动态识别并融合当前状态与记忆状态中的时间相关结构,提取每个目标独有的时空上下文信息。这种方法利用目标潜在表示随时间的一致性,挖掘时间关系以增强静态和运动物体的特征表示,从而提升检测性能,并且无需外部提供或估计自车运动信息。实验结果表明,该方法在nuScenes数据集上的汽车类别mAP提升了21%,进一步验证了其有效性和适用性。
链接: https://arxiv.org/abs/2504.00559
作者: Loveneet Saini,Mirko Meuter,Hasan Tercan,Tobias Meisen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Bird’s-eye view (BEV) object detection has become important for advanced automotive 3D radar-based perception systems. However, the inherently sparse and non-deterministic nature of radar data limits the effectiveness of traditional single-frame BEV paradigms. In this paper, we addresses this limitation by introducing AttentiveGRU, a novel attention-based recurrent approach tailored for radar constraints, which extracts individualized spatio-temporal context for objects by dynamically identifying and fusing temporally correlated structures across present and memory states. By leveraging the consistency of object’s latent representation over time, our approach exploits temporal relations to enrich feature representations for both stationary and moving objects, thereby enhancing detection performance and eliminating the need for externally providing or estimating any information about ego vehicle motion. Our experimental results on the public nuScenes dataset show a significant increase in mAP for the car category by 21% over the best radar-only submission. Further evaluations on an additional dataset demonstrate notable improvements in object detection capabilities, underscoring the applicability and effectiveness of our method.
zh
[CV-51] Archival Faces: Detection of Faces in Digitized Historical Documents
【速读】:该论文旨在解决在数字化历史档案(尤其是报纸)中检测名人和普通人面部的问题,现有基于扫描历史文档数据集的面部检测工具表现极差,仅达到约24%的mAP(平均精度均值)@50:90% IoU(交并比)。为弥补这一不足,论文引入了一个新的领域特定的手动标注数据集,风格类似于广受欢迎的Wider Face数据集,包含从19至20世纪数字化的历史报纸中提取的2.2k张新图像及11k个新的边界框标注与相关面部关键点。此数据集的关键作用在于允许现有检测器重新训练,以缩小其结果与野外人脸检测领域标准之间的差距。实验部分对比了不同微调检测器家族的表现,并评估了多种检测器尺寸在全面检测与关键点预测性能上的效果。
链接: https://arxiv.org/abs/2504.00558
作者: Marek Vaško,Adam Herout,Michal Hradiš
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 6 figures, 6 tables
点击查看摘要
Abstract:When digitizing historical archives, it is necessary to search for the faces of celebrities and ordinary people, especially in newspapers, link them to the surrounding text, and make them searchable. Existing face detectors on datasets of scanned historical documents fail remarkably – current detection tools only achieve around 24% mAP at 50:90% IoU. This work compensates for this failure by introducing a new manually annotated domain-specific dataset in the style of the popular Wider Face dataset, containing 2.2k new images from digitized historical newspapers from the 19^th to 20^th century, with 11k new bounding-box annotations and associated facial landmarks. This dataset allows existing detectors to be retrained to bring their results closer to the standard in the field of face detection in the wild. We report several experimental results comparing different families of fine-tuned detectors against publicly available pre-trained face detectors and ablation studies of multiple detector sizes with comprehensive detection and landmark prediction performance results.
zh
[CV-52] Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features CVPR2025
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在推理过程中因大量图像特征导致的计算成本增加问题。不同于仅关注自注意力机制的LVLMs剪枝研究,本文专注于基于交叉注意力(Cross-Attention)的模型,这类模型通常具有更优的性能表现。论文指出,在交叉注意力层中,图像标记的关键值(Key-Value, KV)缓存需求远超自注意力层中文本标记的需求,成为主要的计算瓶颈。为缓解这一问题,作者利用交叉注意力图谱中的稀疏特性,选择性地剪枝冗余的视觉特征。关键解决方案在于通过引入Trimmed Llama方法,在不依赖额外训练的情况下有效减少KV缓存需求,从而实现推理延迟和内存使用的同时降低,并保持基准性能持平。
链接: https://arxiv.org/abs/2504.00557
作者: Jewon Lee,Ki-Ung Song,Seungmin Yang,Donguk Lim,Jaeyeon Kim,Wooksu Shin,Bo-Kyeong Kim,Yong Jae Lee,Tae-Ho Kim
机构: Nota Inc. (Nota Inc.); University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: accepted at CVPR 2025 Workshop on ELVM
点击查看摘要
Abstract:Visual token reduction lowers inference costs caused by extensive image features in large vision-language models (LVLMs). Unlike relevant studies that prune tokens in self-attention-only LVLMs, our work uniquely addresses cross-attention-based models, which achieve superior performance. We identify that the key-value (KV) cache size for image tokens in cross-attention layers significantly exceeds that of text tokens in self-attention layers, posing a major compute bottleneck. To mitigate this issue, we exploit the sparse nature in cross-attention maps to selectively prune redundant visual features. Our Trimmed Llama effectively reduces KV cache demands without requiring additional training. By benefiting from 50%-reduced visual features, our model can reduce inference latency and memory usage while achieving benchmark parity.
zh
[CV-53] Generalization-aware Remote Sensing Change Detection via Domain-agnostic Learning
【速读】:该论文旨在解决双时相图像中伪变化检测的问题,这些伪变化由成像环境因素引起,是区域发展中变化检测的关键挑战。现有基于变换的方法将伪变化视为样式偏移,并利用生成对抗网络(GANs)将双时相图像转换为相同样式以缓解此问题,但存在两个局限性:1)变换后的图像会出现失真,降低特征区分能力;2)对齐操作阻碍模型学习无领域依赖的表示,导致在训练数据域发生偏移的场景中性能下降。为此,论文提出了一种可泛化且无领域依赖的差异学习网络(DonaNet),专门针对由样式差异引起的伪变化进行处理。解决方案的关键在于:首先通过引入局部统计量作为样式代理来应对问题1)中的失真问题,其次通过设计一个领域差异去除模块减少特征方差同时保留区分特性,并提出其增强版本以进一步消除更多样式;在问题2)中,通过提出跨时态泛化学习策略突出对象类别特性,使模型能够主动提取对域偏移更鲁棒的特征表示。实验结果表明,DonaNet在三个公开数据集上的表现优于现有最先进的方法,并且具有更小的模型规模和更强的域偏移鲁棒性。
链接: https://arxiv.org/abs/2504.00543
作者: Qi Zang,Shuang Wang,Dong Zhao,Dou Quan,Yang Hu,Licheng Jiao
机构: Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University (西安电子科技大学); DFH Satellite Co., Ltd, CAST (中国航天科技集团五院通信卫星事业部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Change detection has essential significance for the region’s development, in which pseudo-changes between bitemporal images induced by imaging environmental factors are key challenges. Existing transformation-based methods regard pseudo-changes as a kind of style shift and alleviate it by transforming bitemporal images into the same style using generative adversarial networks (GANs). However, their efforts are limited by two drawbacks: 1) Transformed images suffer from distortion that reduces feature discrimination. 2) Alignment hampers the model from learning domain-agnostic representations that degrades performance on scenes with domain shifts from the training data. Therefore, oriented from pseudo-changes caused by style differences, we present a generalizable domain-agnostic difference learning network (DonaNet). For the drawback 1), we argue for local-level statistics as style proxies to assist against domain shifts. For the drawback 2), DonaNet learns domain-agnostic representations by removing domain-specific style of encoded features and highlighting the class characteristics of objects. In the removal, we propose a domain difference removal module to reduce feature variance while preserving discriminative properties and propose its enhanced version to provide possibilities for eliminating more style by decorrelating the correlation between features. In the highlighting, we propose a cross-temporal generalization learning strategy to imitate latent domain shifts, thus enabling the model to extract feature representations more robust to shifts actively. Extensive experiments conducted on three public datasets demonstrate that DonaNet outperforms existing state-of-the-art methods with a smaller model size and is more robust to domain shift.
zh
[CV-54] SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning CVPR2025
【速读】:该论文旨在解决现有基于掩码视频建模(如VideoMAE)的自监督学习方法在处理自然视频时主要关注像素级细节重建的问题,这导致其在语义表示和运动动态充分编码方面的能力受限。为了解决这些问题,论文提出了一种名为SMILE的新颖自监督学习方法,通过融合空间和运动语义来改进视频表征学习。关键在于利用预训练的图像-语言模型(如CLIP)提供高级空间语义指导,并通过引入合成运动模式增强运动表示,使模型能够捕获更复杂和动态的内容。此外,SMILE还建立了一个无需自然视频数据即可学习强大视频表征的新范式。
链接: https://arxiv.org/abs/2504.00527
作者: Fida Mohammad Thoker,Letian Jiang,Chen Zhao,Bernard Ghanem
机构: King Abdullah University of Science and Technology (KAUST)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025
点击查看摘要
Abstract:Masked video modeling, such as VideoMAE, is an effective paradigm for video self-supervised learning (SSL). However, they are primarily based on reconstructing pixel-level details on natural videos which have substantial temporal redundancy, limiting their capability for semantic representation and sufficient encoding of motion dynamics. To address these issues, this paper introduces a novel SSL approach for video representation learning, dubbed as SMILE, by infusing both spatial and motion semantics. In SMILE, we leverage image-language pretrained models, such as CLIP, to guide the learning process with their high-level spatial semantics. We enhance the representation of motion by introducing synthetic motion patterns in the training data, allowing the model to capture more complex and dynamic content. Furthermore, using SMILE, we establish a new self-supervised video learning paradigm capable of learning strong video representations without requiring any natural video data. We have carried out extensive experiments on 7 datasets with various downstream scenarios. SMILE surpasses current state-of-the-art SSL methods, showcasing its effectiveness in learning more discriminative and generalizable video representations. Code is available: this https URL
zh
[CV-55] High-Quality Pseudo-Label Generation Based on Visual Prompt Assisted Cloud Model Update IJCNN’25
【速读】:该论文旨在解决云边协同物体检测中高精度伪标签生成的问题,特别是在动态交通监控等数据分布随时间演化的场景下。现有方法通常假设云端模型可靠,但忽视了潜在的错误或难以应对复杂的分布偏移。为了解决这些问题,论文提出了Cloud-Adaptive High-Quality Pseudo-label generation (CA-HQP),其关键是通过引入可学习的Visual Prompt Generator (VPG) 和双特征对齐机制来改进云端模型更新。VPG通过注入视觉提示实现参数高效的适应,增强了灵活性且无需大量微调。同时,CA-HQP通过全局Domain Query Feature Alignment (DQFA) 和细粒度Temporal Instance-Aware Feature Embedding Alignment (TIAFA) 两种特征对齐技术缓解领域差异,分别捕获场景级变化和处理实例级变化。实验结果表明,CA-HQP显著提升了伪标签质量,并在边缘模型性能提升方面表现优异,验证了其适应性有效性。
链接: https://arxiv.org/abs/2504.00526
作者: Xinrun Xu,Qiuhong Zhang,Jianwen Yang,Zhanbiao Lian,Jin Yan,Zhiming Ding,Shan Jiang
机构: University of Chinese Academy of Sciences (中国科学院大学), Beijing, China; Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所), Beijing, China; Advanced Institute of Big Data (大数据先进技术研究院), Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: IJCNN’25
点击查看摘要
Abstract:Generating high-quality pseudo-labels on the cloud is crucial for cloud-edge object detection, especially in dynamic traffic monitoring where data distributions evolve. Existing methods often assume reliable cloud models, neglecting potential errors or struggling with complex distribution shifts. This paper proposes Cloud-Adaptive High-Quality Pseudo-label generation (CA-HQP), addressing these limitations by incorporating a learnable Visual Prompt Generator (VPG) and dual feature alignment into cloud model updates. The VPG enables parameter-efficient adaptation by injecting visual prompts, enhancing flexibility without extensive fine-tuning. CA-HQP mitigates domain discrepancies via two feature alignment techniques: global Domain Query Feature Alignment (DQFA) capturing scene-level shifts, and fine-grained Temporal Instance-Aware Feature Embedding Alignment (TIAFA) addressing instance variations. Experiments on the Bellevue traffic dataset demonstrate that CA-HQP significantly improves pseudo-label quality compared to existing methods, leading to notable performance gains for the edge model and showcasing CA-HQP’s adaptation effectiveness. Ablation studies validate each component (DQFA, TIAFA, VPG) and the synergistic effect of combined alignment strategies, highlighting the importance of adaptive cloud updates and domain adaptation for robust object detection in evolving scenarios. CA-HQP provides a promising solution for enhancing cloud-edge object detection systems in real-world applications.
zh
[CV-56] Robust LiDAR-Camera Calibration with 2D Gaussian Splatting
【速读】:该论文试图解决的问题是如何实现无需辅助目标物的LiDAR-相机系统的标定,并提高标定的鲁棒性和精度。论文的关键解决方案在于利用几何约束估计LiDAR-相机外参参数,通过使用LiDAR点云重建无色的2D高斯点云(2D Gaussian Splatting, 2DGS),并进一步更新高斯斑点的颜色以最小化光度损失(photometric loss)。在此过程中优化外参参数,并通过引入重投影损失(reprojection loss)和三角化损失(triangulation loss)来克服光度损失的局限性,从而提升标定的鲁棒性和准确性。
链接: https://arxiv.org/abs/2504.00525
作者: Shuyi Zhou,Shuxiang Xie,Ryoichi Ishikawa,Takeshi Oishi
机构: The Institute of Industrial Science, The University of Tokyo, Japan (工业科学研究所,东京大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in IEEE Robotics and Automation Letters. Code available at: this https URL
点击查看摘要
Abstract:LiDAR-camera systems have become increasingly popular in robotics recently. A critical and initial step in integrating the LiDAR and camera data is the calibration of the LiDAR-camera system. Most existing calibration methods rely on auxiliary target objects, which often involve complex manual operations, whereas targetless methods have yet to achieve practical effectiveness. Recognizing that 2D Gaussian Splatting (2DGS) can reconstruct geometric information from camera image sequences, we propose a calibration method that estimates LiDAR-camera extrinsic parameters using geometric constraints. The proposed method begins by reconstructing colorless 2DGS using LiDAR point clouds. Subsequently, we update the colors of the Gaussian splats by minimizing the photometric loss. The extrinsic parameters are optimized during this process. Additionally, we address the limitations of the photometric loss by incorporating the reprojection and triangulation losses, thereby enhancing the calibration robustness and accuracy.
zh
[CV-57] raining Frozen Feature Pyramid DINOv2 for Eyelid Measurements with Infinite Encoding and Orthogonal Regularization
【速读】:本文旨在解决眼睑参数(如Margin Reflex Distances, MRD1/MRD2和Levator Function, LF)测量在眼科整形诊断中的准确性问题,当前这些测量依赖于手动方法,存在不一致性和局限性。为应对这一挑战,研究提出利用深度学习模型(SE-ResNet、EfficientNet以及基于视觉Transformer的DINOv2)自动化分析由智能手机拍摄的眼部图像来实现这些测量。关键在于采用预训练的DINOv2模型,并结合轻量级回归器(如MLP和Deep Ensemble),同时通过引入焦点损失函数、正交正则化及二进制编码策略来处理类别不平衡并提升泛化能力,最终实现各任务下的一致性和高精度预测,为移动友好型临床应用提供了有力支持。
链接: https://arxiv.org/abs/2504.00515
作者: Chun-Hung Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注:
点击查看摘要
Abstract:Accurate measurement of eyelid parameters such as Margin Reflex Distances (MRD1, MRD2) and Levator Function (LF) is critical in oculoplastic diagnostics but remains limited by manual, inconsistent methods. This study evaluates deep learning models: SE-ResNet, EfficientNet, and the vision transformer-based DINOv2 for automating these measurements using smartphone-acquired images. We assess performance across frozen and fine-tuned settings, using MSE, MAE, and R2 metrics. DINOv2, pretrained through self-supervised learning, demonstrates superior scalability and robustness, especially under frozen conditions ideal for mobile deployment. Lightweight regressors such as MLP and Deep Ensemble offer high precision with minimal computational overhead. To address class imbalance and improve generalization, we integrate focal loss, orthogonal regularization, and binary encoding strategies. Our results show that DINOv2 combined with these enhancements delivers consistent, accurate predictions across all tasks, making it a strong candidate for real-world, mobile-friendly clinical applications. This work highlights the potential of foundation models in advancing AI-powered ophthalmic care.
zh
[CV-58] Learned Image Compression with Dictionary-based Entropy Model CVPR2025
【速读】:该论文旨在解决现有学习型图像压缩方法中熵模型仅关注潜表示内部依赖性而忽视从训练数据中提取先验信息的问题。论文的关键创新在于提出了一种名为“基于字典的交叉注意力熵模型”(Dictionary-based Cross Attention Entropy Model)的新方法,通过引入可学习的字典来总结训练数据集中典型结构,以增强熵模型的性能。实验结果表明,该方法在性能与延迟之间取得了更好的权衡,并在多个基准数据集上达到了最先进的压缩效果。
链接: https://arxiv.org/abs/2504.00496
作者: Jingbo Lu,Leheng Zhang,Xingyu Zhou,Mu Li,Wen Li,Shuhang Gu
机构: University of Electronic Science and Technology of China (电子科技大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025
点击查看摘要
Abstract:Learned image compression methods have attracted great research interest and exhibited superior rate-distortion performance to the best classical image compression standards of the present. The entropy model plays a key role in learned image compression, which estimates the probability distribution of the latent representation for further entropy coding. Most existing methods employed hyper-prior and auto-regressive architectures to form their entropy models. However, they only aimed to explore the internal dependencies of latent representation while neglecting the importance of extracting prior from training data. In this work, we propose a novel entropy model named Dictionary-based Cross Attention Entropy model, which introduces a learnable dictionary to summarize the typical structures occurring in the training dataset to enhance the entropy model. Extensive experimental results have demonstrated that the proposed model strikes a better balance between performance and latency, achieving state-of-the-art results on various benchmark datasets.
zh
[CV-59] SCFANet: Style Distribution Constraint Feature Alignment Network For Pathological Staining Translation
【速读】:该论文旨在解决通过深度学习模型将低成本的苏木精和伊红(Hematoxylin and Eosin, HE)染色图像直接转换为免疫组化(Immunohistochemical, IHC)染色图像的技术挑战。这一转换面临的主要问题是图像配准的不一致性以及IHC染色风格模式的多样性。为克服这些挑战,论文提出了Style Distribution Constraint Feature Alignment Network (SCFANet),其关键在于引入两个创新模块:Style Distribution Constrainer (SDC) 和 Feature Alignment Learning (FAL)。SDC确保生成图像与目标图像之间风格分布的一致性,并结合循环一致性损失以保持结构一致性;FAL模块将端到端的图像转换任务分解为图像重建和特征对齐两个子任务,从而简化转换过程。此外,通过保持病理模式一致性和光学密度(Optical Density, OD)均匀性,进一步保证了生成图像与目标图像之间的病理一致性。实验结果表明,SCFANet在Breast Cancer Immunohistochemical (BCI) 数据集上的表现优于现有方法,实现了精确的HE到IHC图像转换。该方法不仅解决了HE到IHC图像转换中的技术难题,还为病理分析中的准确且高效的染色转换提供了稳健框架。
链接: https://arxiv.org/abs/2504.00490
作者: Zetong Chen,Yuzhuo Chen,Hai Zhong,Xu Qiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Immunohistochemical (IHC) staining serves as a valuable technique for detecting specific antigens or proteins through antibody-mediated visualization. However, the IHC staining process is both time-consuming and costly. To address these limitations, the application of deep learning models for direct translation of cost-effective Hematoxylin and Eosin (HE) stained images into IHC stained images has emerged as an efficient solution. Nevertheless, the conversion from HE to IHC images presents significant challenges, primarily due to alignment discrepancies between image pairs and the inherent diversity in IHC staining style patterns. To overcome these challenges, we propose the Style Distribution Constraint Feature Alignment Network (SCFANet), which incorporates two innovative modules: the Style Distribution Constrainer (SDC) and Feature Alignment Learning (FAL). The SDC ensures consistency between the generated and target images’ style distributions while integrating cycle consistency loss to maintain structural consistency. To mitigate the complexity of direct image-to-image translation, the FAL module decomposes the end-to-end translation task into two subtasks: image reconstruction and feature alignment. Furthermore, we ensure pathological consistency between generated and target images by maintaining pathological pattern consistency and Optical Density (OD) uniformity. Extensive experiments conducted on the Breast Cancer Immunohistochemical (BCI) dataset demonstrate that our SCFANet model outperforms existing methods, achieving precise transformation of HE-stained images into their IHC-stained counterparts. The proposed approach not only addresses the technical challenges in HE to IHC image translation but also provides a robust framework for accurate and efficient stain conversion in pathological analysis.
zh
[CV-60] Hierarchical Attention Networks for Lossless Point Cloud Attribute Compression
【速读】:该论文旨在解决点云无损属性压缩的问题,提出了一种基于深度分层注意力上下文模型的方法。解决方案的关键在于引入多分辨率空间结构和残差学习,通过构建由粗到细的细节层次(Level of Detail, LoD)结构实现高效编码,并利用分层聚合邻近点信息的注意力机制捕捉不同尺度和密度下的上下文依赖关系,从而实现全面的特征提取。此外,通过坐标和属性归一化以及将点云分割为多个切片以促进并行处理,进一步提升了压缩效率与时间复杂度优化。实验结果表明,该方法在色彩和反射属性的压缩性能上优于最新的几何点云压缩标准(G-PCC),同时保持了更高效的编码和解码运行时间。
链接: https://arxiv.org/abs/2504.00481
作者: Yueru Chen,Wei Zhang,Dingquan Li,Jing Wang,Ge Li
机构: Pengcheng Laboratory (鹏城实验室), Shenzhen, China; Peking University Shenzhen Graduate School (北京大学深圳研究生院), Shenzhen, China
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: Accepted by DCC 2025
点击查看摘要
Abstract:In this paper, we propose a deep hierarchical attention context model for lossless attribute compression of point clouds, leveraging a multi-resolution spatial structure and residual learning. A simple and effective Level of Detail (LoD) structure is introduced to yield a coarse-to-fine representation. To enhance efficiency, points within the same refinement level are encoded in parallel, sharing a common context point group. By hierarchically aggregating information from neighboring points, our attention model learns contextual dependencies across varying scales and densities, enabling comprehensive feature extraction. We also adopt normalization for position coordinates and attributes to achieve scale-invariant compression. Additionally, we segment the point cloud into multiple slices to facilitate parallel processing, further optimizing time complexity. Experimental results demonstrate that the proposed method offers better coding performance than the latest G-PCC for color and reflectance attributes while maintaining more efficient encoding and decoding runtimes.
zh
[CV-61] FSSUWNet: Mitigating the Frag ility of Pre-trained Models with Feature Enhancement for Few-Shot Semantic Segmentation in Underwater Images
【速读】:该论文试图解决Few-Shot Semantic Segmentation (FSS) 方法在水下环境泛化能力不足的问题。现有FSS方法通常依赖于从预训练模型提取的先验特征,但这些特征在处理水下图像时由于其独特挑战而显得脆弱。为了解决这一问题,论文提出了FSSUWNet,这是一种针对水下图像的定制化FSS框架,并通过特征增强来改进性能。解决方案的关键在于引入了一个名为Feature Enhanced Encoder的辅助编码器,用于提取互补特征以更好地适应水下场景特性;同时设计了一个简单的特征对齐模块,旨在提供全局先验知识并实现低级特征与高级特征在维度上的对齐。此外,为了缓解水下图像数据稀缺的问题,还构建了一个基于水下图像分割数据集的交叉验证数据集版本。实验结果表明,该方法在多个公开的水下分割数据集上达到了最先进的性能。
链接: https://arxiv.org/abs/2504.00478
作者: Zhuohao Li,Zhicheng Huang,Wenchao Liu,Zhuxing Zhang,Jianming Miao
机构: School of Ocean Engineering and Technology, Sun Yat-Sen University (中山大学海洋工程与技术学院); Southern Marine Science and Engineering Guangdong Laboratory (南方海洋科学与工程广东省实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Few-Shot Semantic Segmentation (FSS), which focuses on segmenting new classes in images using only a limited number of annotated examples, has recently progressed in data-scarce domains. However, in this work, we show that the existing FSS methods often struggle to generalize to underwater environments. Specifically, the prior features extracted by pre-trained models used as feature extractors are fragile due to the unique challenges of underwater images. To address this, we propose FSSUWNet, a tailored FSS framework for underwater images with feature enhancement. FSSUWNet exploits the integration of complementary features, emphasizing both low-level and high-level image characteristics. In addition to employing a pre-trained model as the primary encoder, we propose an auxiliary encoder called Feature Enhanced Encoder which extracts complementary features to better adapt to underwater scene characteristics. Furthermore, a simple and effective Feature Alignment Module aims to provide global prior knowledge and align low-level features with high-level features in dimensions. Given the scarcity of underwater images, we introduce a cross-validation dataset version based on the Segmentation of Underwater Imagery dataset. Extensive experiments on public underwater segmentation datasets demonstrate that our approach achieves state-of-the-art performance. For example, our method outperforms the previous best method by 2.8% and 2.6% in terms of the mean Intersection over Union metric for 1-shot and 5-shot scenarios in the datasets, respectively. Our implementation is available at this https URL.
zh
[CV-62] 4th PVUW MeViS 3rd Place Report: Sa2VA
【速读】:该论文试图解决视频对象分割任务中的语言描述引导问题(Referring Video Object Segmentation, RVOS),特别是在包含目标物体运动表达的MeViS数据集上的挑战性基准测试。为了解决这一问题,论文的关键在于通过简单修改更强的多模态大语言模型(Multi-modal Large Language Model, MLLM)在推理阶段的方法,结合近期提出的统一模型Sa2VA(用于图像和视频密集接地理解的统一模型),在不进行额外训练的情况下,通过扩大关键帧的范围,实现了在第4届PVUW研讨会中排名第3的成绩。
链接: https://arxiv.org/abs/2504.00476
作者: Haobo Yuan,Tao Zhang,Xiangtai Li,Lu Qi,Zilong Huang,Shilin Xu,Jiashi Feng,Ming-Hsuan Yang
机构: UC Merced (加州大学默塞德); Bytedance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report, 4 pages, Code: this https URL
点击查看摘要
Abstract:Referring video object segmentation (RVOS) is a challenging task that requires the model to segment the object in a video given the language description. MeViS is a recently proposed dataset that contains motion expressions of the target objects, leading to a challenging benchmark, compared with existing RVOS benchmarks. On the other hand, for referring expression tasks, a new trend is to adopt multi-modal large language model (MLLM) to achieve better image and text alignment. In this report, we show that with a simple modification to the test time inference method on stronger MLLMs, we can lead to stronger results on MeVIS. In particular, we adopt the recent method Sa2VA, a unified model for dense grounded understanding of both images and videos. By enlarging the scope of key frames, without any further training, we can achieve the 3rd place in the 4th PVUW workshop.
zh
[CV-63] Less is More: Efficient Black-box Attribution via Minimal Interpretable Subset Selection
【速读】:该论文旨在解决如何高效且准确地识别离散输入数据(如图像)中影响模型决策的关键区域这一问题。现有归因方法的核心任务虽在于高效准确地解析输入-预测交互关系,但在离散输入场景下,由于组合爆炸的存在,这一任务极具挑战性。论文提出了一种新颖高效的黑盒归因机制LiMA (Less input is More faithful for Attribution),其关键在于将重要区域的归因问题重新表述为子模集选择的优化问题。首先,通过设计一个量化子集重要性的子模函数,精确评估输入与输出之间的交互;其次,通过一种创新的双向贪婪搜索算法,高效排序输入子区域的重要性以提升优化效率。此外,LiMA能够在保证最小化误差的前提下,同时识别最重要的和最不重要的样本,并定义最优的归因边界。实验结果表明,该方法在八个基础模型上的归因解释更加忠实,所需区域更少,且具有更强的泛化能力,在插入操作和删除操作中的性能分别提升了36.3%和39.6%,归因效率比朴素贪婪搜索快1.6倍,同时在解析模型预测错误原因时的平均最高置信度比现有最先进的归因算法高出86.1%。
链接: https://arxiv.org/abs/2504.00470
作者: Ruoyu Chen,Siyuan Liang,Jingzhi Li,Shiming Liu,Li Liu,Hua Zhang,Xiaochun Cao
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所), School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); School of Computing, National University of Singapore (新加坡国立大学计算机学院); RAMS Lab, Huawei Technologies Co., Ltd. (华为技术有限公司RAMS实验室), College of Electronic Science and Technology, NUDT (国防科技大学电子科学与工程学院); School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区网络空间科学与技术学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:To develop a trustworthy AI system, which aim to identify the input regions that most influence the models decisions. The primary task of existing attribution methods lies in efficiently and accurately identifying the relationships among input-prediction interactions. Particularly when the input data is discrete, such as images, analyzing the relationship between inputs and outputs poses a significant challenge due to the combinatorial explosion. In this paper, we propose a novel and efficient black-box attribution mechanism, LiMA (Less input is More faithful for Attribution), which reformulates the attribution of important regions as an optimization problem for submodular subset selection. First, to accurately assess interactions, we design a submodular function that quantifies subset importance and effectively captures their impact on decision outcomes. Then, efficiently ranking input sub-regions by their importance for attribution, we improve optimization efficiency through a novel bidirectional greedy search algorithm. LiMA identifies both the most and least important samples while ensuring an optimal attribution boundary that minimizes errors. Extensive experiments on eight foundation models demonstrate that our method provides faithful interpretations with fewer regions and exhibits strong generalization, shows an average improvement of 36.3% in Insertion and 39.6% in Deletion. Our method also outperforms the naive greedy search in attribution efficiency, being 1.6 times faster. Furthermore, when explaining the reasons behind model prediction errors, the average highest confidence achieved by our method is, on average, 86.1% higher than that of state-of-the-art attribution algorithms. The code is available at this https URL.
zh
[CV-64] Exploring the Collaborative Advantage of Low-level Information on Generalizable AI-Generated Image Detection
【速读】:该论文试图解决现有最先进的AI生成图像检测方法在利用低级信息提升检测泛化能力方面的局限性问题,即单一类型的低级信息可能导致次优泛化性能。此外,简单融合策略难以充分挖掘不同伪造类型中低级和高级信息的检测优势。为了解决这些问题,论文提出了一种自适应低级专家注入(Adaptive Low-level Experts Injection, ALEI)框架。其关键是引入Lora Expert模块,使基于高级语义RGB图像训练的主干网络能够接受并学习来自不同低级信息的知识,并通过跨注意力机制在中间层自适应融合这些特征。同时,设计了低级信息适配器以防止主干网络在后期建模过程中丢失对不同低级特征的建模能力,并提出了动态特征选择机制以最大化检测泛化能力。
链接: https://arxiv.org/abs/2504.00463
作者: Ziyin Zhou,Ke Sun,Zhongxi Chen,Xianming Lin,Yunpeng Luo,Ke Yan,Shouhong Ding,Xiaoshuai Sun
机构: Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University (厦门大学); Tencent YouTu Lab (腾讯互娱实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Existing state-of-the-art AI-Generated image detection methods mostly consider extracting low-level information from RGB images to help improve the generalization of AI-Generated image detection, such as noise patterns. However, these methods often consider only a single type of low-level information, which may lead to suboptimal generalization. Through empirical analysis, we have discovered a key insight: different low-level information often exhibits generalization capabilities for different types of forgeries. Furthermore, we found that simple fusion strategies are insufficient to leverage the detection advantages of each low-level and high-level information for various forgery types. Therefore, we propose the Adaptive Low-level Experts Injection (ALEI) framework. Our approach introduces Lora Experts, enabling the backbone network, which is trained with high-level semantic RGB images, to accept and learn knowledge from different low-level information. We utilize a cross-attention method to adaptively fuse these features at intermediate layers. To prevent the backbone network from losing the modeling capabilities of different low-level features during the later stages of modeling, we developed a Low-level Information Adapter that interacts with the features extracted by the backbone network. Finally, we propose Dynamic Feature Selection, which dynamically selects the most suitable features for detecting the current image to maximize generalization detection capability. Extensive experiments demonstrate that our method, finetuned on only four categories of mainstream ProGAN data, performs excellently and achieves state-of-the-art results on multiple datasets containing unseen GAN and Diffusion methods.
zh
[CV-65] Mixture-of-Attack-Experts with Class Regularization for Unified Physical-Digital Face Attack Detection AAAI-2025
【速读】:该论文旨在解决现实场景中面部识别系统易受数字和物理攻击的问题。现有方法主要通过学习综合特征空间来实现分类,但未能充分考虑物理和数字攻击数据的固有特性,特别是攻击类别内的较大变化(intra-class variation)以及真实人脸与伪造人脸之间的较小差异(inter-class variation)。为克服这些局限性,论文提出了一种细粒度的MoE结合类感知正则化CLIP框架(Fine-Grained MoE with Class-Aware Regularization CLIP, FG-MoE-CLIP-CAR),其关键在于从特征层面和损失函数层面进行改进。在特征层面,采用Soft Mixture of Experts (Soft MoE) 架构以专业化方式处理不同类型的特征,并进一步优化Soft MoE以捕捉伪造人脸之间更细微的差异;在损失层面,引入解耦模块(Disentanglement Module, DM)和聚类蒸馏模块(Cluster Distillation Module, CDM)。其中,DM通过增加真实人脸与伪造人脸类别中心的距离来增强类别可分性,而CDM则进一步围绕各自类别中心聚类特征,同时保持与其他类别的分离。此外,针对偏离常见攻击模式的特定攻击容易被忽略的问题,论文设计的距离计算优先关注更远的特征点。实验结果表明,该方法在两个统一的物理-数字攻击数据集上达到了当前最先进的性能(SOTA)。
链接: https://arxiv.org/abs/2504.00458
作者: Shunxin Chen,Ajian Liu,Junze Zheng,Jun Wan,Kailai Peng,Sergio Escalera,Zhen Lei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures, accepted by AAAI-2025 (Oral)
点击查看摘要
Abstract:Facial recognition systems in real-world scenarios are susceptible to both digital and physical attacks. Previous methods have attempted to achieve classification by learning a comprehensive feature space. However, these methods have not adequately accounted for the inherent characteristics of physical and digital attack data, particularly the large intra class variation in attacks and the small inter-class variation between live and fake faces. To address these limitations, we propose the Fine-Grained MoE with Class-Aware Regularization CLIP framework (FG-MoE-CLIP-CAR), incorporating key improvements at both the feature and loss levels. At the feature level, we employ a Soft Mixture of Experts (Soft MoE) architecture to leverage different experts for specialized feature processing. Additionally, we refine the Soft MoE to capture more subtle differences among various types of fake faces. At the loss level, we introduce two constraint modules: the Disentanglement Module (DM) and the Cluster Distillation Module (CDM). The DM enhances class separability by increasing the distance between the centers of live and fake face classes. However, center-to-center constraints alone are insufficient to ensure distinctive representations for individual features. Thus, we propose the CDM to further cluster features around their respective class centers while maintaining separation from other classes. Moreover, specific attacks that significantly deviate from common attack patterns are often overlooked. To address this issue, our distance calculation prioritizes more distant features. Experimental results on two unified physical-digital attack datasets demonstrate that the proposed method achieves state-of-the-art (SOTA) performance.
zh
[CV-66] Distilling Multi-view Diffusion Models into 3D Generators
【速读】:该论文旨在解决多视图扩散模型(Multi-view Diffusion Model, MV-DM)在生成高质量三维数据时面临的计算复杂性和知识迁移问题。具体而言,作者提出了一种名为DD3G的方法,通过高斯点阵化(Gaussian Splatting)将MV-DM蒸馏到一个三维生成器中,以压缩并整合其视觉与空间几何知识,并确保生成器具有更好的泛化能力。不同于以往的联合优化方法,DD3G通过对齐多视图扩散模型与三维生成器的表征空间,实现了教师模型概率流的有效传递,从而避免了因概率采样导致的优化目标不一致问题。此外,为了应对生成过程中引入的概率流和三维高斯属性耦合带来的挑战,提出了包含模式提取(Pattern Extraction)和渐进解码(Progressive Decoding)两个阶段的PEPD生成器,显著提高了生成效率。同时,设计了一种联合优化目标函数,在显式监督和隐式验证的基础上保证生成样本的质量,进一步缓解了稀疏视角监督下的知识损失问题。实验结果表明,该方法在合成及公开数据集上的有效性。
链接: https://arxiv.org/abs/2504.00457
作者: Hao Qin,Luyuan Chen,Ming Kong,Mengxu Lu,Qiang Zhu
机构: School of Computer Science and Technology, Zhejiang University (浙江大学), Hangzhou 310027, China; School of Computer, Beijing Information Science and Technology University (北京信息科学技术大学), Beijing 100005, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We introduce DD3G, a formulation that Distills a multi-view Diffusion model (MV-DM) into a 3D Generator using gaussian splatting. DD3G compresses and integrates extensive visual and spatial geometric knowledge from the MV-DM by simulating its ordinary differential equation (ODE) trajectory, ensuring the distilled generator generalizes better than those trained solely on 3D data. Unlike previous amortized optimization approaches, we align the MV-DM and 3D generator representation spaces to transfer the teacher’s probabilistic flow to the student, thus avoiding inconsistencies in optimization objectives caused by probabilistic sampling. The introduction of probabilistic flow and the coupling of various attributes in 3D Gaussians introduce challenges in the generation process. To tackle this, we propose PEPD, a generator consisting of Pattern Extraction and Progressive Decoding phases, which enables efficient fusion of probabilistic flow and converts a single image into 3D Gaussians within 0.06 seconds. Furthermore, to reduce knowledge loss and overcome sparse-view supervision, we design a joint optimization objective that ensures the quality of generated samples through explicit supervision and implicit verification. Leveraging existing 2D generation models, we compile 120k high-quality RGBA images for distillation. Experiments on synthetic and public datasets demonstrate the effectiveness of our method. Our project is available at: this https URL
zh
[CV-67] FA3-CLIP: Frequency-Aware Cues Fusion and Attack-Agnostic Prompt Learning for Unified Face Attack Detection
【速读】:本文旨在解决人脸认证系统面临的物理攻击(如打印照片)和数字攻击(如DeepFake)检测难题。现有方法在同时检测这两种攻击类型时面临两大挑战:一是不同攻击类别间存在显著的类内变化;二是仅依靠空间信息难以全面捕捉真实与伪造人脸的线索。为应对这些挑战,论文提出了一种名为Frequency-Aware and Attack-Agnostic CLIP(FA\textsuperscript{3}-CLIP)的统一攻击检测模型。该方案的关键在于引入攻击无关提示学习机制,通过融合空间与频域特征提取通用的真实与伪造线索,从而实现对真实人脸及所有攻击类别的统一检测。具体而言,攻击无关提示模块在语言分支生成通用的真实与伪造提示,以从真实和伪造人脸中提取对应的通用表示,引导模型构建统一特征空间。同时,该模块基于原始的空间和频域信息自适应生成真实/伪造条件偏置,优化通用提示,减少类内差异的影响。此外,论文设计了视觉分支中的双流线索融合框架,利用频域信息补充空间域难以捕捉的细微线索,并采用频率压缩块减少频域特征冗余,保留关键线索多样性。实验结果表明,所提方法在检测物理和数字人脸攻击方面取得了最先进的性能。
链接: https://arxiv.org/abs/2504.00454
作者: Yongze Li,Ning Li,Ajian Liu,Hui Ma,Liying Yang,Xihong Chen,Zhiyao Liang,Yanyan Liang,Jun Wan,Zhen Lei
机构: School of Computer Science and Engineering, Faculty of Innovation Engineering, Macau University of Science and Technology (澳门科技大学计算机科学与工程学院,创新工程学院) ; State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences (CASIA) (中国科学院自动化研究所多模态人工智能系统国家重点实验室), School of Computer Science and Engineering, Faculty of Innovation Engineering, Macau University of Science and Technology (澳门科技大学计算机科学与工程学院,创新工程学院); School of Software Engineering, Beijing Jiaotong University (BJTU) (北京交通大学软件工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures
点击查看摘要
Abstract:Facial recognition systems are vulnerable to physical (e.g., printed photos) and digital (e.g., DeepFake) face attacks. Existing methods struggle to simultaneously detect physical and digital attacks due to: 1) significant intra-class variations between these attack types, and 2) the inadequacy of spatial information alone to comprehensively capture live and fake cues. To address these issues, we propose a unified attack detection model termed Frequency-Aware and Attack-Agnostic CLIP (FA\textsuperscript3-CLIP), which introduces attack-agnostic prompt learning to express generic live and fake cues derived from the fusion of spatial and frequency features, enabling unified detection of live faces and all categories of attacks. Specifically, the attack-agnostic prompt module generates generic live and fake prompts within the language branch to extract corresponding generic representations from both live and fake faces, guiding the model to learn a unified feature space for unified attack detection. Meanwhile, the module adaptively generates the live/fake conditional bias from the original spatial and frequency information to optimize the generic prompts accordingly, reducing the impact of intra-class variations. We further propose a dual-stream cues fusion framework in the vision branch, which leverages frequency information to complement subtle cues that are difficult to capture in the spatial domain. In addition, a frequency compression block is utilized in the frequency stream, which reduces redundancy in frequency features while preserving the diversity of crucial cues. We also establish new challenging protocols to facilitate unified face attack detection effectiveness. Experimental results demonstrate that the proposed method significantly improves performance in detecting physical and digital face attacks, achieving state-of-the-art results.
zh
[CV-68] Suite-IN: A FlexiWear BodyNet Integrating Global and Local Motion Features from Apple Suite for Robust Inertial Navigation
【速读】:该论文旨在解决传统行人航位推算(Pedestrian Dead Reckoning, PDR)在处理多样化运动模式时的局限性,以及数据驱动方法因单一设备依赖而缺乏鲁棒性的问题。为应对这些挑战,论文提出的关键解决方案是充分利用现有的可穿戴设备构建一个灵活的可穿戴体域网(flexiwear bodynet),以实现更稳健且精确的行人定位。具体而言,论文提出了Suite-IN++框架,通过集成不同身体部位可穿戴设备的运动数据,并利用对比学习分离全局与局部运动特征,同时结合设备数据可靠性融合全局特征以捕捉整体运动趋势,采用注意力机制挖掘跨设备局部特征的相关性以提取有助于精准定位的细节信息。这一方案的核心在于有效整合多设备数据,从而显著提升实际行人跟踪场景中的定位精度和鲁棒性。
链接: https://arxiv.org/abs/2504.00438
作者: Lan Sun,Songpengcheng Xia,Jiarui Yang,Ling Pei
机构: Shanghai Key Laboratory of Navigation and Location-based Services (上海导航与位置服务重点实验室), School of Electronic Information and Electrical Engineering (电子与电气工程学院), Shanghai Jiao Tong University (上海交通大学), Shanghai, China, 200240
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages,10 figures
点击查看摘要
Abstract:The proliferation of wearable technology has established multi-device ecosystems comprising smartphones, smartwatches, and headphones as critical enablers for ubiquitous pedestrian localization. However, traditional pedestrian dead reckoning (PDR) struggles with diverse motion modes, while data-driven methods, despite improving accuracy, often lack robustness due to their reliance on a single-device setup. Therefore, a promising solution is to fully leverage existing wearable devices to form a flexiwear bodynet for robust and accurate pedestrian localization. This paper presents Suite-IN++, a deep learning framework for flexiwear bodynet-based pedestrian localization. Suite-IN++ integrates motion data from wearable devices on different body parts, using contrastive learning to separate global and local motion features. It fuses global features based on the data reliability of each device to capture overall motion trends and employs an attention mechanism to uncover cross-device correlations in local features, extracting motion details helpful for accurate localization. To evaluate our method, we construct a real-life flexiwear bodynet dataset, incorporating Apple Suite (iPhone, Apple Watch, and AirPods) across diverse walking modes and device configurations. Experimental results demonstrate that Suite-IN++ achieves superior localization accuracy and robustness, significantly outperforming state-of-the-art models in real-life pedestrian tracking scenarios.
zh
[CV-69] ADGaussian: Generalizable Gaussian Splatting for Autonomous Driving with Multi-modal Inputs
【速读】:该论文旨在解决街景场景通用重建(generalizable street scene reconstruction)的问题。传统高斯点 splatting 方法主要侧重于几何优化,而本文提出的方法强调图像与深度特征联合优化的重要性,以实现更精确的高斯预测。解决方案的关键在于:首先引入稀疏激光雷达深度作为额外输入模态,将高斯预测过程建模为视觉信息与几何线索的联合学习框架;其次提出多模态特征匹配策略及多尺度高斯解码模型,以增强多模态特征的联合优化能力,从而实现高效的多模态高斯学习。实验结果表明,所提出的 ADGaussian 方法在 Waymo 和 KITTI 数据集上达到了最先进的性能,并表现出卓越的零样本泛化能力。
链接: https://arxiv.org/abs/2504.00437
作者: Qi Song,Chenghong Li,Haotong Lin,Sida Peng,Rui Huang
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The project page can be found at this https URL
点击查看摘要
Abstract:We present a novel approach, termed ADGaussian, for generalizable street scene reconstruction. The proposed method enables high-quality rendering from single-view input. Unlike prior Gaussian Splatting methods that primarily focus on geometry refinement, we emphasize the importance of joint optimization of image and depth features for accurate Gaussian prediction. To this end, we first incorporate sparse LiDAR depth as an additional input modality, formulating the Gaussian prediction process as a joint learning framework of visual information and geometric clue. Furthermore, we propose a multi-modal feature matching strategy coupled with a multi-scale Gaussian decoding model to enhance the joint refinement of multi-modal features, thereby enabling efficient multi-modal Gaussian learning. Extensive experiments on two large-scale autonomous driving datasets, Waymo and KITTI, demonstrate that our ADGaussian achieves state-of-the-art performance and exhibits superior zero-shot generalization capabilities in novel-view shifting.
zh
[CV-70] DecoFuse: Decomposing and Fusing the “What” “Where” and “How” for Brain-Inspired fMRI-to-Video Decoding
【速读】:该论文旨在解决从功能性磁共振成像(fMRI)信号中解码视觉体验的问题,特别是现有方法在处理视频重建时侧重于语义内容而忽视空间和运动信息的不足。论文提出了一种名为DecoFuse的新框架,其关键在于通过脑启发的方式将视频分解为语义、空间和运动三个独立成分分别进行解码,并在最后融合这些成分以重构视频。这种方法不仅简化了解码复杂视频的任务,还将学习到的表示与生物上的对应部分建立了更清晰的联系,同时通过消融研究得到了验证。实验结果显示,DecoFuse在语义分类、空间一致性、运动预测以及多类别视频生成方面均优于现有最先进方法。此外,语义和空间信息的神经编码分析进一步支持了腹侧和背侧通路的不同功能角色,从而证明了该框架的生物学合理性。
链接: https://arxiv.org/abs/2504.00432
作者: Chong Li,Jingyang Huo,Weikang Gong,Yanwei Fu,Xiangyang Xue,Jianfeng Feng
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Decoding visual experiences from brain activity is a significant challenge. Existing fMRI-to-video methods often focus on semantic content while overlooking spatial and motion information. However, these aspects are all essential and are processed through distinct pathways in the brain. Motivated by this, we propose DecoFuse, a novel brain-inspired framework for decoding videos from fMRI signals. It first decomposes the video into three components - semantic, spatial, and motion - then decodes each component separately before fusing them to reconstruct the video. This approach not only simplifies the complex task of video decoding by decomposing it into manageable sub-tasks, but also establishes a clearer connection between learned representations and their biological counterpart, as supported by ablation studies. Further, our experiments show significant improvements over previous state-of-the-art methods, achieving 82.4% accuracy for semantic classification, 70.6% accuracy in spatial consistency, a 0.212 cosine similarity for motion prediction, and 21.9% 50-way accuracy for video generation. Additionally, neural encoding analyses for semantic and spatial information align with the two-streams hypothesis, further validating the distinct roles of the ventral and dorsal pathways. Overall, DecoFuse provides a strong and biologically plausible framework for fMRI-to-video decoding. Project page: this https URL.
zh
[CV-71] Enhancing Fundus Image-based Glaucoma Screening via Dynamic Global-Local Feature Integration
【速读】:该论文旨在解决因眼底图像变化(如成像设备间图像质量差异、不同种族训练与测试数据集的不一致及青光眼病例特性导致的边界不确定性)带来的挑战。论文的关键解决方案在于提出一种自适应注意力窗口(self-adaptive attention window),以自主确定最优边界从而增强特征提取,并引入多头注意力机制(multi-head attention mechanism)通过特征线性读出有效融合全局与局部特征,提升模型的判别能力。实验结果表明,所提方法在青光眼分类任务中表现出更高的准确性和鲁棒性。
链接: https://arxiv.org/abs/2504.00431
作者: Yuzhuo Zhou,Chi Liu,Sheng Shen,Siyu Le,Liwen Yu,Sihan Ouyang,Zongyuan Ge
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:With the advancements in medical artificial intelligence (AI), fundus image classifiers are increasingly being applied to assist in ophthalmic diagnosis. While existing classification models have achieved high accuracy on specific fundus datasets, they struggle to address real-world challenges such as variations in image quality across different imaging devices, discrepancies between training and testing images across different racial groups, and the uncertain boundaries due to the characteristics of glaucomatous cases. In this study, we aim to address the above challenges posed by image variations by highlighting the importance of incorporating comprehensive fundus image information, including the optic cup (OC) and optic disc (OD) regions, and other key image patches. Specifically, we propose a self-adaptive attention window that autonomously determines optimal boundaries for enhanced feature extraction. Additionally, we introduce a multi-head attention mechanism to effectively fuse global and local features via feature linear readout, improving the model’s discriminative capability. Experimental results demonstrate that our method achieves superior accuracy and robustness in glaucoma classification.
zh
[CV-72] Data Synthesis with Diverse Styles for Face Recognition via 3DMM-Guided Diffusion CVPR2025
【速读】:该论文致力于解决身份保持的人脸合成问题,即生成能够替代真实世界数据用于训练人脸识别模型的虚拟人脸图像。现有方法在保持身份一致性和风格多样性之间存在权衡,主要局限在于将风格变化视为与主体无关,而未充分考虑真实世界中个体间独特的、主体特定的风格差异。为解决此问题,论文提出了一种基于扩散模型的身份保持人脸生成方法MorphFace。其关键是通过3Dmorphable model (3DMM) 的渲染结果学习精细的面部风格(如形状、姿态和表情),并通过现成的人脸识别模型学习身份特征。此外,MorphFace 在生成虚拟人脸时,结合了来自未标记合成人脸的新身份以及从真实世界先验分布中统计采样的新风格,尤其关注个体内部变化与个体独特性的平衡,并采用上下文融合策略增强模型对身份和风格条件的响应能力。实验表明,MorphFace 在人脸识别效能上优于现有最佳方法。
链接: https://arxiv.org/abs/2504.00430
作者: Yuxi Mi,Zhizhou Zhong,Yuge Huang,Qiuyang Yuan,Xuan Zhao,Jianqing Xu,Shouhong Ding,ShaoMing Wang,Rizen Guo,Shuigeng Zhou
机构: Shanghai Key Lab of Intelligent Information Processing, Fudan University (上海智能信息处理重点实验室,复旦大学); Youtu Lab, Tencent (腾讯优图实验室); WeChat Pay Lab, Tencent (微信支付实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025
点击查看摘要
Abstract:Identity-preserving face synthesis aims to generate synthetic face images of virtual subjects that can substitute real-world data for training face recognition models. While prior arts strive to create images with consistent identities and diverse styles, they face a trade-off between them. Identifying their limitation of treating style variation as subject-agnostic and observing that real-world persons actually have distinct, subject-specific styles, this paper introduces MorphFace, a diffusion-based face generator. The generator learns fine-grained facial styles, e.g., shape, pose and expression, from the renderings of a 3D morphable model (3DMM). It also learns identities from an off-the-shelf recognition model. To create virtual faces, the generator is conditioned on novel identities of unlabeled synthetic faces, and novel styles that are statistically sampled from a real-world prior distribution. The sampling especially accounts for both intra-subject variation and subject distinctiveness. A context blending strategy is employed to enhance the generator’s responsiveness to identity and style conditions. Extensive experiments show that MorphFace outperforms the best prior arts in face recognition efficacy.
zh
[CV-73] Unleashing the Power of Pre-trained Encoders for Universal Adversarial Attack Detection
【速读】:该论文旨在解决现有对抗攻击检测方法在泛化能力和工程成本方面的局限性。传统方法依赖手工设计特征和对攻击模式的先验知识,导致其在未知攻击下的表现不佳且开发成本较高。为应对这些挑战,论文提出了一种基于大规模预训练视觉-语言模型CLIP的轻量级对抗检测框架。关键创新在于采用异常检测视角,通过与可训练适配器网络(trainable adapter networks)和可学习提示(learnable prompts)联合微调CLIP的双模态视觉-文本编码器,构建了一个专为自然图像设计的紧凑表征空间。实验结果表明,该架构在已知和未知攻击模式下的泛化能力显著优于传统方法,同时大幅降低了训练开销。这一研究为实现参数高效且与攻击无关的防御机制提供了新的技术路径,显著提升了视觉系统的鲁棒性以应对不断演化的对抗威胁。
链接: https://arxiv.org/abs/2504.00429
作者: Yinghe Zhang,Chi Liu,Shuai Zhou,Sheng Shen,Peng Gui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Adversarial attacks pose a critical security threat to real-world AI systems by injecting human-imperceptible perturbations into benign samples to induce misclassification in deep learning models. While existing detection methods, such as Bayesian uncertainty estimation and activation pattern analysis, have achieved progress through feature engineering, their reliance on handcrafted feature design and prior knowledge of attack patterns limits generalization capabilities and incurs high engineering costs. To address these limitations, this paper proposes a lightweight adversarial detection framework based on the large-scale pre-trained vision-language model CLIP. Departing from conventional adversarial feature characterization paradigms, we innovatively adopt an anomaly detection perspective. By jointly fine-tuning CLIP’s dual visual-text encoders with trainable adapter networks and learnable prompts, we construct a compact representation space tailored for natural images. Notably, our detection architecture achieves substantial improvements in generalization capability across both known and unknown attack patterns compared to traditional methods, while significantly reducing training overhead. This study provides a novel technical pathway for establishing a parameter-efficient and attack-agnostic defense paradigm, markedly enhancing the robustness of vision systems against evolving adversarial threats.
zh
[CV-74] Can LLM s Assist Computer Education? an Empirical Case Study of DeepSeek
【速读】:该论文旨在评估新兴大型语言模型DeepSeek-V3在计算机教育领域的效能与可靠性。研究通过使用CCNA模拟题及中国网络工程师提出的现实世界网络安全问题,从角色依赖性、跨语言能力以及答案可重复性等多个维度对其进行全面评估,并辅以统计分析。论文的关键在于采用多样化的评价方法,确保模型表现的全面性和客观性,特别是关注其在不同任务类型(如事实回忆与高级推理)上的差异表现,以及跨语言任务中的稳定性。这为未来在专业领域内改进大型语言模型提供了有价值的参考。
链接: https://arxiv.org/abs/2504.00421
作者: Dongfu Xiao,Chen Gao,Zhengquan Luo,Chi Liu,Sheng Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:This study presents an empirical case study to assess the efficacy and reliability of DeepSeek-V3, an emerging large language model, within the context of computer education. The evaluation employs both CCNA simulation questions and real-world inquiries concerning computer network security posed by Chinese network engineers. To ensure a thorough evaluation, diverse dimensions are considered, encompassing role dependency, cross-linguistic proficiency, and answer reproducibility, accompanied by statistical analysis. The findings demonstrate that the model performs consistently, regardless of whether prompts include a role definition or not. In addition, its adaptability across languages is confirmed by maintaining stable accuracy in both original and translated datasets. A distinct contrast emerges between its performance on lower-order factual recall tasks and higher-order reasoning exercises, which underscores its strengths in retrieving information and its limitations in complex analytical tasks. Although DeepSeek-V3 offers considerable practical value for network security education, challenges remain in its capability to process multimodal data and address highly intricate topics. These results provide valuable insights for future refinement of large language models in specialized professional environments.
zh
[CV-75] hink Small Act Big: Primitive Prompt Learning for Lifelong Robot Manipulation CVPR2025
【速读】:本文旨在解决构建具备有效利用先验知识进行持续技能习得的终身机器人(Lifelong Robot)的挑战。尽管经验回放(Experience Replay)和参数高效方法(Parameter-Efficient Methods)在缓解灾难性遗忘(Catastrophic Forgetting)问题方面取得了一定成功,但直接应用这些方法会导致无法充分利用不同技能间的共享基础模块(Shared Primitives)。为了解决这些问题,论文提出了一种基于基础提示学习(Primitive Prompt Learning, PPL)的方法,通过可重用和可扩展的基础模块实现终身机器人操作能力。方案的关键在于分两阶段学习:首先通过多技能预训练阶段学习一组基础提示以表示共享基础模块,并通过运动感知提示捕获不同技能间语义和运动上的共享基础;其次,在终身学习过程中,通过冻结已预训练的基础模块并附加新提示进行优化,借助旧技能向新技能的知识迁移(Knowledge Transfer)来提升新技能的学习效率。实验结果表明,PPL在大规模技能数据集上的模拟与真实任务中表现优于现有最先进的方法。
链接: https://arxiv.org/abs/2504.00420
作者: Yuanqi Yao,Siao Liu,Haoming Song,Delin Qu,Qizhi Chen,Yan Ding,Bin Zhao,Zhigang Wang,Xuelong Li,Dong Wang
机构: Shanghai AI Laboratory (上海人工智能实验室); Fudan University (复旦大学); Shanghai Jiao Tong University (上海交通大学); Zhejiang University (浙江大学); Northwestern Polytechnical University (西北工业大学); TeleAI, China Telecom Corp Ltd (中国电信集团天翼人工智能科技有限公司); INSAIT (INSAIT);
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025
点击查看摘要
Abstract:Building a lifelong robot that can effectively leverage prior knowledge for continuous skill acquisition remains significantly challenging. Despite the success of experience replay and parameter-efficient methods in alleviating catastrophic forgetting problem, naively applying these methods causes a failure to leverage the shared primitives between skills. To tackle these issues, we propose Primitive Prompt Learning (PPL), to achieve lifelong robot manipulation via reusable and extensible primitives. Within our two stage learning scheme, we first learn a set of primitive prompts to represent shared primitives through multi-skills pre-training stage, where motion-aware prompts are learned to capture semantic and motion shared primitives across different skills. Secondly, when acquiring new skills in lifelong span, new prompts are appended and optimized with frozen pretrained prompts, boosting the learning via knowledge transfer from old skills to new ones. For evaluation, we construct a large-scale skill dataset and conduct extensive experiments in both simulation and real-world tasks, demonstrating PPL’s superior performance over state-of-the-art methods.
zh
[CV-76] NCAP: Scene Text Image Super-Resolution with Non-CAtegorical Prior WACV2025
【速读】:该论文旨在解决场景文本图像超分辨率(STISR)中的两个主要问题:(1) 明确的分类先验(如文本先验 TP)在错误情况下可能对 STISR 产生负面影响,并揭示其不稳定性,提出使用非类别先验(Non-Categorical Prior, NCAP)替代,基于预训练模型的倒数第二层表示;(2) 针对用于生成 TP 的预训练识别器在低分辨率图像上的局限性,现有方法通过联合训练识别器与 STISR 网络来弥合域差距,但可能导致先验模态过信现象。论文强调此问题并通过混合硬标签和软标签的方法缓解。解决方案的关键在于引入 NCAP 替代不稳定的 TP,并提出一种结合硬软标签的策略以减少过信现象的影响。实验结果表明,所提方法在 TextZoom 数据集上提升了 3.5%,并在四个文本识别数据集上显著提高了泛化性能达 14.8%。
链接: https://arxiv.org/abs/2504.00410
作者: Dongwoo Park,Suk Pil Ko
机构: THINKWARE Corporation (思图威公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2025
点击查看摘要
Abstract:Scene text image super-resolution (STISR) enhances the resolution and quality of low-resolution images. Unlike previous studies that treated scene text images as natural images, recent methods using a text prior (TP), extracted from a pre-trained text recognizer, have shown strong performance. However, two major issues emerge: (1) Explicit categorical priors, like TP, can negatively impact STISR if incorrect. We reveal that these explicit priors are unstable and propose replacing them with Non-CAtegorical Prior (NCAP) using penultimate layer representations. (2) Pre-trained recognizers used to generate TP struggle with low-resolution images. To address this, most studies jointly train the recognizer with the STISR network to bridge the domain gap between low- and high-resolution images, but this can cause an overconfidence phenomenon in the prior modality. We highlight this issue and propose a method to mitigate it by mixing hard and soft labels. Experiments on the TextZoom dataset demonstrate an improvement by 3.5%, while our method significantly enhances generalization performance by 14.8% across four text recognition datasets. Our method generalizes to all TP-guided STISR networks.
zh
[CV-77] Beyond Wide-Angle Images: Unsupervised Video Portrait Correction via Spatiotemporal Diffusion Adaptation
【速读】:该论文旨在解决宽角相机因畸变引起的面部拉伸问题,尤其是在镜头边缘处,这种畸变会降低视觉吸引力。为了解决这一问题,论文提出了一个名为ImagePD的图像肖像校正框架,它将Transformer的长程感知能力和扩散模型的多步去噪功能集成到一个统一的框架中,从而实现全局结构的鲁棒性和局部细节的优化。针对获取视频标签成本高的问题,进一步将ImagePD扩展至无监督的宽角视频校正(称为VideoPD),通过时空扩散适应并在空间一致性和时间平滑性约束下进行处理。关键在于,对于空间一致性,鼓励去噪后的图像遵循宽角畸变分布模式以逼近伪标签;而对于时间平滑性,则利用反向光流推导校正轨迹并使其平滑。与ImagePD相比,VideoPD在空间上保持高质量的人脸校正,并在时间序列上缓解潜在的抖动问题。最终,为了建立评估基准和训练框架,构建了一个包含大量人物数量、光照条件和背景多样性的视频肖像数据集。实验表明,所提出的方法在定量和定性上均优于现有方法,有助于生成具有稳定自然肖像的高保真宽角视频。代码和数据集将会公开。
链接: https://arxiv.org/abs/2504.00401
作者: Wenbo Nie,Lang Nie,Chunyu Lin,Jingwen Chen,Ke Xing,Jiyuan Wang,Yao Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Wide-angle cameras, despite their popularity for content creation, suffer from distortion-induced facial stretching-especially at the edge of the lens-which degrades visual appeal. To address this issue, we propose an image portrait correction framework using diffusion models named ImagePD. It integrates the long-range awareness of transformer and multi-step denoising of diffusion models into a unified framework, achieving global structural robustness and local detail refinement. Besides, considering the high cost of obtaining video labels, we then repurpose ImagePD for unlabeled wide-angle videos (termed VideoPD), by spatiotemporal diffusion adaption with spatial consistency and temporal smoothness constraints. For the former, we encourage the denoised image to approximate pseudo labels following the wide-angle distortion distribution pattern, while for the latter, we derive rectification trajectories with backward optical flows and smooth them. Compared with ImagePD, VideoPD maintains high-quality facial corrections in space and mitigates the potential temporal shakes sequentially. Finally, to establish an evaluation benchmark and train the framework, we establish a video portrait dataset with a large diversity in people number, lighting conditions, and background. Experiments demonstrate that the proposed methods outperform existing solutions quantitatively and qualitatively, contributing to high-fidelity wide-angle videos with stable and natural portraits. The codes and dataset will be available.
zh
[CV-78] Adaptive Low Light Enhancement via Joint Global-Local Illumination Adjustment
【速读】:该论文旨在解决在真实低光照条件下拍摄的图像因环境光线不均导致动态范围较大且难以调整至正常曝光水平的问题。现有端到端方法在此类图像的亮度增强方面面临挑战,尤其是在局部曝光不一致的情况下表现不佳。为了解决这一问题,论文提出了一种新颖的亮度自适应增强框架,其关键在于结合两个组件:局部对比度增强网络(Local Contrast Enhancement Network, LCEN)和全局光照引导网络(Global Illumination Guidance Network, GIGN)。LCEN通过引入早期停止机制和设计局部判别模块,能够自适应感知图像不同区域的对比度,从而控制具有不同曝光水平的图像块的增强过程避免过早终止;而GIGN则设计了全局注意引导模块,通过捕捉图像中的长距离依赖和上下文信息来有效建模全局光照,指导LCEN显著改善不同区域的亮度。此外,为了协调这两个网络,论文还设计了一种创新的训练策略以优化整个增强过程。实验结果表明,该方法在多个数据集上的定量和定性评估中均优于当前最先进的算法。
链接: https://arxiv.org/abs/2504.00400
作者: Haodian Wang,Yaqi Song
机构: University of Science and Technology of China (中国科学技术大学); Chn Energy Digital Intelligence Technology Development (Beijing) CO., LTD. (中国能源数字化智能技术发展(北京)有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
点击查看摘要
Abstract:Images captured under real-world low-light conditions face significant challenges due to uneven ambient lighting, making it difficult for existing end-to-end methods to enhance images with a large dynamic range to normal exposure levels. To address the above issue, we propose a novel brightness-adaptive enhancement framework designed to tackle the challenge of local exposure inconsistencies in real-world low-light images. Specifically, our proposed framework comprises two components: the Local Contrast Enhancement Network (LCEN) and the Global Illumination Guidance Network (GIGN). We introduce an early stopping mechanism in the LCEN and design a local discriminative module, which adaptively perceives the contrast of different areas in the image to control the premature termination of the enhancement process for patches with varying exposure levels. Additionally, within the GIGN, we design a global attention guidance module that effectively models global illumination by capturing long-range dependencies and contextual information within the image, which guides the local contrast enhancement network to significantly improve brightness across different regions. Finally, in order to coordinate the LCEN and GIGN, we design a novel training strategy to facilitate the training process. Experiments on multiple datasets demonstrate that our method achieves superior quantitative and qualitative results compared to state-of-the-art algorithms.
zh
[CV-79] SPF-Portrait: Towards Pure Portrait Customization with Semantic Pollution-Free Fine-tuning
【速读】:该论文旨在解决在基于文本驱动的人像定制过程中存在的语义污染(Semantic Pollution)问题,即通过微调预训练的文本到图像(Text-to-Image, T2I)模型来实现属性定制时,现有方法会导致原始模型行为退化并阻碍增量学习。为了解决这一问题,论文提出了SPF-Portrait,其核心在于通过双路径管道同时理解定制语义并消除语义污染。关键创新点包括:引入参考路径以保留原始模型性能,利用对比学习适应目标属性且对齐无关属性;设计语义感知细粒度控制图(Semantic-Aware Fine Control Map)以空间引导对比路径的对齐过程,从而避免过拟合;以及提出响应增强机制以强化目标属性表现并缓解直接跨模态监督带来的表征差异。这些方案共同确保了性能提升的同时保持原始模型的行为稳定。
链接: https://arxiv.org/abs/2504.00396
作者: Xiaole Xian,Zhichao Liao,Qingyu Li,Wenyu Qin,Pengfei Wan,Weicheng Xie,Long Zeng,Linlin Shen,Pingfa Feng
机构: Shenzhen University (深圳大学); Tsinghua University (清华大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:While fine-tuning pre-trained Text-to-Image (T2I) models on portrait datasets enables attribute customization, existing methods suffer from Semantic Pollution that compromises the original model’s behavior and prevents incremental learning. To address this, we propose SPF-Portrait, a pioneering work to purely understand customized semantics while eliminating semantic pollution in text-driven portrait customization. In our SPF-Portrait, we propose a dual-path pipeline that introduces the original model as a reference for the conventional fine-tuning path. Through contrastive learning, we ensure adaptation to target attributes and purposefully align other unrelated attributes with the original portrait. We introduce a novel Semantic-Aware Fine Control Map, which represents the precise response regions of the target semantics, to spatially guide the alignment process between the contrastive paths. This alignment process not only effectively preserves the performance of the original model but also avoids over-alignment. Furthermore, we propose a novel response enhancement mechanism to reinforce the performance of target attributes, while mitigating representation discrepancy inherent in direct cross-modal supervision. Extensive experiments demonstrate that SPF-Portrait achieves state-of-the-art performance.
zh
[CV-80] AP-CAP: Advancing High-Quality Data Synthesis for Animal Pose Estimation via a Controllable Image Generation Pipeline
【速读】:该论文旨在解决2D动物姿态估计领域高质量数据集匮乏的问题,这一问题限制了现有方法潜力的充分发挥。为应对这一挑战,论文提出了一种名为AP-CAP(Animal Pose estimation via Controllable Image Generation Pipeline)的新颖可控图像生成管道。该方案的关键在于引入了一个能够生成具有预期姿态的多模态动物图像生成模型,并结合三种创新策略:(1) 基于模态融合的动物图像合成策略以整合多源外观表示;(2) 基于姿态调整的动物图像合成策略以动态捕捉多样化的姿态变化;(3) 基于描述增强的动物图像合成策略以丰富视觉语义理解。这些方法共同促进了MPCH数据集(模态-姿态-描述混合数据集)的创建,这是首个创新性结合合成与真实数据的混合数据集,构建了迄今为止最大规模的多源异构基准库,显著提升了动物姿态估计算法的性能和泛化能力。
链接: https://arxiv.org/abs/2504.00394
作者: Lei Wang,Yujie Zhong,Xiaopeng Sun,Jingchun Cheng,Chengjian Feng,Qiong Cao,Lin Ma,Zhaoxin Fan
机构: Meituan Inc. (美团); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The task of 2D animal pose estimation plays a crucial role in advancing deep learning applications in animal behavior analysis and ecological research. Despite notable progress in some existing approaches, our study reveals that the scarcity of high-quality datasets remains a significant bottleneck, limiting the full potential of current methods. To address this challenge, we propose a novel Controllable Image Generation Pipeline for synthesizing animal pose estimation data, termed AP-CAP. Within this pipeline, we introduce a Multi-Modal Animal Image Generation Model capable of producing images with expected poses. To enhance the quality and diversity of the generated data, we further propose three innovative strategies: (1) Modality-Fusion-Based Animal Image Synthesis Strategy to integrate multi-source appearance representations, (2) Pose-Adjustment-Based Animal Image Synthesis Strategy to dynamically capture diverse pose variations, and (3) Caption-Enhancement-Based Animal Image Synthesis Strategy to enrich visual semantic understanding. Leveraging the proposed model and strategies, we create the MPCH Dataset (Modality-Pose-Caption Hybrid), the first hybrid dataset that innovatively combines synthetic and real data, establishing the largest-scale multi-source heterogeneous benchmark repository for animal pose estimation to date. Extensive experiments demonstrate the superiority of our method in improving both the performance and generalization capability of animal pose estimators.
zh
[CV-81] Scene4U: Hierarchical Layered 3D Scene Reconstruction from Single Panoramic Image for Your Immerse Exploration CVPR2025
【速读】:该论文旨在解决动态对象遮挡导致的场景重建视觉不连续性以及前景-背景遮挡引起的场景空洞问题。为实现无遮挡、全局纹理一致且可自由探索的沉浸式和真实感3D场景重建,论文提出了一种名为Scene4U的新框架。其关键是结合开放词汇分割模型与大语言模型将全景图分解为多层,并通过基于扩散模型的分层修复模块利用视觉线索和深度信息恢复被遮挡区域,构建场景的分层表示。最终,通过3D高斯点阵初始化及分层优化,生成具有语义和结构一致性的沉浸式3D场景。Scene4U在LPIPS和BRISQUE指标上分别提升了24.24%和24.40%,同时实现了最快的训练速度。
链接: https://arxiv.org/abs/2504.00387
作者: Zilong Huang,Jun He,Junyan Ye,Lihan Jiang,Weijia Li,Yiping Chen,Ting Han
机构: Sun Yat-sen University (中山大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025, 11 pages, 7 figures
点击查看摘要
Abstract:The reconstruction of immersive and realistic 3D scenes holds significant practical importance in various fields of computer vision and computer graphics. Typically, immersive and realistic scenes should be free from obstructions by dynamic objects, maintain global texture consistency, and allow for unrestricted exploration. The current mainstream methods for image-driven scene construction involves iteratively refining the initial image using a moving virtual camera to generate the scene. However, previous methods struggle with visual discontinuities due to global texture inconsistencies under varying camera poses, and they frequently exhibit scene voids caused by foreground-background occlusions. To this end, we propose a novel layered 3D scene reconstruction framework from panoramic image, named Scene4U. Specifically, Scene4U integrates an open-vocabulary segmentation model with a large language model to decompose a real panorama into multiple layers. Then, we employs a layered repair module based on diffusion model to restore occluded regions using visual cues and depth information, generating a hierarchical representation of the scene. The multi-layer panorama is then initialized as a 3D Gaussian Splatting representation, followed by layered optimization, which ultimately produces an immersive 3D scene with semantic and structural consistency that supports free exploration. Scene4U outperforms state-of-the-art method, improving by 24.24% in LPIPS and 24.40% in BRISQUE, while also achieving the fastest training speed. Additionally, to demonstrate the robustness of Scene4U and allow users to experience immersive scenes from various landmarks, we build WorldVista3D dataset for 3D scene reconstruction, which contains panoramic images of globally renowned sites. The implementation code and dataset will be released at this https URL .
zh
[CV-82] Leverag ing Contrast Information for Efficient Document Shadow Removal
【速读】:该论文旨在解决文档阴影去除过程中因阴影覆盖区域信息密集而导致的传统方法效果有限的问题。现有方法通常依赖额外的阴影掩膜信息,或在不同阴影场景中缺乏泛化性和有效性,常导致阴影去除不完全或丢失原始文档内容与色调。此外,这些方法未能充分利用原始阴影文档图像中的信息。为解决这些问题,论文提出了一种基于对比表示的端到端文档阴影去除方法,并采用粗到精的细化策略。其关键在于通过提取文档的对比信息,有效且快速地定位阴影形状和位置,而无需额外的掩膜,同时将此信息融入优化后的阴影去除流程,为网络驱动的去除和特征融合提供更好的指导。实验结果表明,该方法达到了当前最先进的性能。
链接: https://arxiv.org/abs/2504.00385
作者: Yifan Liu,Jiancheng Huang,Na Liu,Mingfu Yan,Yi Huang,Shifeng Chen
机构: Southern University of Science and Technology (南方科技大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Document shadows are a major obstacle in the digitization process. Due to the dense information in text and patterns covered by shadows, document shadow removal requires specialized methods. Existing document shadow removal methods, although showing some progress, still rely on additional information such as shadow masks or lack generalization and effectiveness across different shadow scenarios. This often results in incomplete shadow removal or loss of original document content and tones. Moreover, these methods tend to underutilize the information present in the original shadowed document image. In this paper, we refocus our approach on the document images themselves, which inherently contain rich this http URL propose an end-to-end document shadow removal method guided by contrast representation, following a coarse-to-fine refinement approach. By extracting document contrast information, we can effectively and quickly locate shadow shapes and positions without the need for additional masks. This information is then integrated into the refined shadow removal process, providing better guidance for network-based removal and feature fusion. Extensive qualitative and quantitative experiments show that our method achieves state-of-the-art performance.
zh
[CV-83] Intrinsic-feature-guided 3D Object Detection
【速读】:该论文旨在解决基于激光雷达 (LiDAR) 的3D目标检测在稀疏、不均匀分布及结构不完整点云条件下面临的性能限制问题。论文针对道路驾驶环境中车辆、行人和骑行者等目标对象的网格与拓扑结构特性,提出了一种基于模板辅助特征增强模块的内在特征引导3D目标检测方法。其关键在于通过从广义模板中提取内在特征,为前景物体提供丰富的结构信息,并设计了一种基于提议级别的对比学习机制以增强前景与背景物体之间的特征差异。所提出的模块可作为即插即用组件,提升多种现有方法的表现。
链接: https://arxiv.org/abs/2504.00382
作者: Wanjing Zhang,Chenxing Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:LiDAR-based 3D object detection is essential for autonomous driving systems. However, LiDAR point clouds may appear to have sparsity, uneven distribution, and incomplete structures, significantly limiting the detection performance. In road driving environments, target objects referring to vehicles, pedestrians and cyclists are well-suited for enhancing representation through the complete template guidance, considering their grid and topological structures. Therefore, this paper presents an intrinsic-feature-guided 3D object detection method based on a template-assisted feature enhancement module, which extracts intrinsic features from relatively generalized templates and provides rich structural information for foreground objects. Furthermore, a proposal-level contrastive learning mechanism is designed to enhance the feature differences between foreground and background objects. The proposed modules can act as plug-and-play components and improve the performance of multiple existing methods. Extensive experiments illustrate that the proposed method achieves the highly competitive detection results. Code will be available at this https URL.
zh
[CV-84] Hierarchical Flow Diffusion for Efficient Frame Interpolation CVPR2025
【速读】:该论文旨在解决视频帧插值任务中扩散模型方法与非扩散模型方法在精度和效率方面存在的显著差距问题。论文的关键创新在于提出了一种基于分层扩散模型(hierarchical diffusion models)的双边光流显式建模方法,通过在去噪过程中的较小搜索空间(相比于直接在潜在空间去噪的现有方法),提升了模型的有效性。在此基础上,结合由光流引导的图像合成器(flow-guided images synthesizer)生成最终结果,并通过端到端训练优化整个系统。该方法在精度上达到当前最优水平,且比其他扩散模型方法快10倍以上。
链接: https://arxiv.org/abs/2504.00380
作者: Yang Hai,Guo Wang,Tan Su,Wenjie Jiang,Yinlin Hu
机构: Insta360 Research; MagicLeap
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025
点击查看摘要
Abstract:Most recent diffusion-based methods still show a large gap compared to non-diffusion methods for video frame interpolation, in both accuracy and efficiency. Most of them formulate the problem as a denoising procedure in latent space directly, which is less effective caused by the large latent space. We propose to model bilateral optical flow explicitly by hierarchical diffusion models, which has much smaller search space in the denoising procedure. Based on the flow diffusion model, we then use a flow-guided images synthesizer to produce the final result. We train the flow diffusion model and the image synthesizer end to end. Our method achieves state of the art in accuracy, and 10+ times faster than other diffusion-based methods. The project page is at: this https URL.
zh
[CV-85] MPDrive: Improving Spatial Understanding with Marker-Based Prompt Learning for Autonomous Driving CVPR2025
【速读】:该论文旨在解决自动驾驶视觉问答(AD-VQA)领域中因依赖文本坐标表示而导致视觉坐标表达与文本描述之间存在语义鸿沟的问题。这种语义差距限制了空间信息的精确传递,并增加了模型的表达负担。为了解决这一问题,论文提出了一种基于标记的提示学习框架(MPDrive)。其关键是通过简洁的视觉标记来表示空间坐标,确保语言表达一致性,并提升AD-VQA任务中视觉感知和空间表达的准确性。具体而言,MPDrive利用检测专家在对象区域叠加数字标签生成标记图像,将复杂的文本坐标生成转化为直观的基于文本的视觉标记预测。此外,通过融合原始图像和标记图像作为场景级特征,并结合检测先验信息提取实例级特征,构建双粒度视觉提示以激发大型语言模型(LLM)的空间感知能力。实验结果表明,MPDrive在DriveLM和CODA-LM数据集上达到了最先进的性能,尤其是在需要复杂空间理解的情况下表现出色。
链接: https://arxiv.org/abs/2504.00379
作者: Zhiyuan Zhang,Xiaofan Li,Zhihao Xu,Wenjie Peng,Zijian Zhou,Miaojing Shi,Shuangping Huang
机构: South China University of Technology; Baidu Inc. (百度); King’s College London (伦敦国王学院); Tongji University (同济大学); Pazhou Laboratory (琶洲实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025
点击查看摘要
Abstract:Autonomous driving visual question answering (AD-VQA) aims to answer questions related to perception, prediction, and planning based on given driving scene images, heavily relying on the model’s spatial understanding capabilities. Prior works typically express spatial information through textual representations of coordinates, resulting in semantic gaps between visual coordinate representations and textual descriptions. This oversight hinders the accurate transmission of spatial information and increases the expressive burden. To address this, we propose a novel Marker-based Prompt learning framework (MPDrive), which represents spatial coordinates by concise visual markers, ensuring linguistic expressive consistency and enhancing the accuracy of both visual perception and spatial expression in AD-VQA. Specifically, we create marker images by employing a detection expert to overlay object regions with numerical labels, converting complex textual coordinate generation into straightforward text-based visual marker predictions. Moreover, we fuse original and marker images as scene-level features and integrate them with detection priors to derive instance-level features. By combining these features, we construct dual-granularity visual prompts that stimulate the LLM’s spatial perception capabilities. Extensive experiments on the DriveLM and CODA-LM datasets show that MPDrive achieves state-of-the-art performance, particularly in cases requiring sophisticated spatial understanding.
zh
[CV-86] CamoSAM2: Motion-Appearance Induced Auto-Refining Prompts for Video Camouflaged Object Detection
【速读】:该论文致力于解决视频伪装物体分割(VCOD)任务中因伪装物体与周围环境高度相似导致的人眼难以区分以及现有方法在提示生成方面可靠性不足的问题。为应对这些挑战,论文提出了一种名为CamoSAM2的方法,其核心在于设计了一个运动-外观提示诱导器(Motion-Appearance Prompt Inducer, MAPI)及优化框架。关键创新点包括:首先,通过整合运动和外观特征来检测伪装物体,相较于已有方法能够提供更精确的初始预测;其次,提出了针对SAM2的基于视频的自适应多提示优化策略(Adaptive Multi-Prompts Refinement, AMPR),旨在修正初始粗略掩膜中的错误并进一步生成高质量的提示。具体实现上,采用伪装物体确定、关键帧选择以及多提示构建三步法以确保提示的可靠性。实验结果表明,CamoSAM2在两个基准数据集上的mIoU指标分别提升了8.0%和10.1%,并且具有最快的推理速度。
链接: https://arxiv.org/abs/2504.00375
作者: Xin Zhang,Keren Fu,Qijun Zhao
机构: National Key Laboratory of Fundamental Science on Synthetic Vision (国家合成视觉基础科学重点实验室), Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures,
点击查看摘要
Abstract:The Segment Anything Model 2 (SAM2), a prompt-guided video foundation model, has remarkably performed in video object segmentation, drawing significant attention in the community. Due to the high similarity between camouflaged objects and their surroundings, which makes them difficult to distinguish even by the human eye, the application of SAM2 for automated segmentation in real-world scenarios faces challenges in camouflage perception and reliable prompts generation. To address these issues, we propose CamoSAM2, a motion-appearance prompt inducer (MAPI) and refinement framework to automatically generate and refine prompts for SAM2, enabling high-quality automatic detection and segmentation in VCOD task. Initially, we introduce a prompt inducer that simultaneously integrates motion and appearance cues to detect camouflaged objects, delivering more accurate initial predictions than existing methods. Subsequently, we propose a video-based adaptive multi-prompts refinement (AMPR) strategy tailored for SAM2, aimed at mitigating prompt error in initial coarse masks and further producing good prompts. Specifically, we introduce a novel three-step process to generate reliable prompts by camouflaged object determination, pivotal prompting frame selection, and multi-prompts formation. Extensive experiments conducted on two benchmark datasets demonstrate that our proposed model, CamoSAM2, significantly outperforms existing state-of-the-art methods, achieving increases of 8.0% and 10.1% in mIoU metric. Additionally, our method achieves the fastest inference speed compared to current VCOD models.
zh
[CV-87] Spatiotemporal Attention Learning Framework for Event-Driven Object Recognition
【速读】:该论文旨在解决基于事件的视觉传感器在动态目标识别任务中的计算复杂度高及参数量大的问题。传统基于ResNet的方法虽然性能优异,但其参数量较大,限制了实际部署;而VGG模型尽管参数量较小,但在处理此类任务时性能有所不足。为此,论文提出了一种新颖的空间-时间学习框架,通过在增强的VGG网络中引入卷积块注意力模块(Convolutional Block Attention Module, CBAM),实现了对事件数据的有效特征提取与空间注意力机制结合。关键在于利用CBAM模块提升网络对重要特征的关注度,同时保持较低的参数开销,从而在CIFAR10-DVS和N-Caltech101等数据集上取得了比现有ResNet基线方法更优或相当的性能,特别是在无预训练权重的情况下仍表现出较强的鲁棒性,并减少了对数据增强的依赖。
链接: https://arxiv.org/abs/2504.00370
作者: Tiantian Xie,Pengpai Wang,Rosa H. M. Chan
机构: Department of Electrical Engineering, City University of Hong Kong (香港城市大学电气工程系); Shenzhen Research Institute, City University of Hong Kong, Shenzhen, China (中国香港城市大学深圳研究院); State Key Laboratory of Terahertz and Millimeter Waves, City University of Hong Kong, Hong Kong, China (香港城市大学太赫兹与毫米波国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 2025 IEEE NSENS
点击查看摘要
Abstract:Event-based vision sensors, inspired by biological neural systems, asynchronously capture local pixel-level intensity changes as a sparse event stream containing position, polarity, and timestamp information. These neuromorphic sensors offer significant advantages in dynamic range, latency, and power efficiency. Their working principle inherently addresses traditional camera limitations such as motion blur and redundant background information, making them particularly suitable for dynamic vision tasks. While recent works have proposed increasingly complex event-based architectures, the computational overhead and parameter complexity of these approaches limit their practical deployment. This paper presents a novel spatiotemporal learning framework for event-based object recognition, utilizing a VGG network enhanced with Convolutional Block Attention Module (CBAM). Our approach achieves comparable performance to state-of-the-art ResNet-based methods while reducing parameter count by 2.3% compared to the original VGG model. Specifically, it outperforms ResNet-based methods like MVF-Net, achieving the highest Top-1 accuracy of 76.4% (pretrained) and 71.3% (not pretrained) on CIFAR10-DVS, and 72.4% (not pretrained) on N-Caltech101. These results highlight the robustness of our method when pretrained weights are not used, making it suitable for scenarios where transfer learning is unavailable. Moreover, our approach reduces reliance on data augmentation. Experimental results on standard event-based datasets demonstrate the framework’s efficiency and effectiveness for real-world applications.
zh
[CV-88] Hybrid Global-Local Representation with Augmented Spatial Guidance for Zero-Shot Referring Image Segmentation CVPR2025
【速读】:该论文旨在解决零样本指代图像分割(Zero-shot Referring Image Segmentation, RIS)任务中精确且高质量掩码区域表示提取的关键挑战,这一挑战限制了RIS任务的潜力。论文提出了一种无需训练的混合全局-局部特征提取方法,通过结合掩码特定的详细特征与周围区域的上下文信息,增强掩码区域的表征能力。此外,为了进一步加强掩码区域与指代表达之间的对齐,论文还提出了一种空间引导增强策略,以提高空间一致性,这对于准确描述区域的定位至关重要。关键在于所提出的混合特征提取方法以及空间引导增强策略,它们共同促进了更鲁棒和精确的指代分割。实验结果表明,该方法在标准RIS基准数据集上显著优于现有方法,实现了性能的大幅提升。
链接: https://arxiv.org/abs/2504.00356
作者: Ting Liu,Siyuan Li
机构: ASGO, School of Computer Science, Northwestern Polytechnical University (西北工业大学计算机学院); Shenzhen Research Institute of Northwestern Polytechnical University (西北工业大学深圳研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: accepted to CVPR2025
点击查看摘要
Abstract:Recent advances in zero-shot referring image segmentation (RIS), driven by models such as the Segment Anything Model (SAM) and CLIP, have made substantial progress in aligning visual and textual information. Despite these successes, the extraction of precise and high-quality mask region representations remains a critical challenge, limiting the full potential of RIS tasks. In this paper, we introduce a training-free, hybrid global-local feature extraction approach that integrates detailed mask-specific features with contextual information from the surrounding area, enhancing mask region representation. To further strengthen alignment between mask regions and referring expressions, we propose a spatial guidance augmentation strategy that improves spatial coherence, which is essential for accurately localizing described areas. By incorporating multiple spatial cues, this approach facilitates more robust and precise referring segmentation. Extensive experiments on standard RIS benchmarks demonstrate that our method significantly outperforms existing zero-shot RIS models, achieving substantial performance gains. We believe our approach advances RIS tasks and establishes a versatile framework for region-text alignment, offering broader implications for cross-modal understanding and interaction. Code is available at this https URL .
zh
[CV-89] ransductive One-Shot Learning Meet Subspace Decomposition
【速读】:本文专注于单样本学习(one-shot learning)问题,旨在通过利用预训练模型从仅有的单个人类标注图像中推广知识到未见过的类别。论文提出了一种基于归纳推理的单样本学习方法(transductive one-shot learning),其关键是通过子空间分解(subspace decomposition)技术,将支持集中的标记图像和查询集中的未标记图像分解为由较小子空间捕获的潜在变量(latent variables)的线性组合。这种方法能够将支持集中单个标记图像的标签传播到查询集中具有相似潜在变量组合的图像,从而实现从单一标注图像向新类别的有效泛化。
链接: https://arxiv.org/abs/2504.00348
作者: Kyle Stein,Andrew A. Mahyari,Guillermo Francia III,Eman El-Sheikh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:One-shot learning focuses on adapting pretrained models to recognize newly introduced and unseen classes based on a single labeled image. While variations of few-shot and zero-shot learning exist, one-shot learning remains a challenging yet crucial problem due to its ability to generalize knowledge to unseen classes from just one human-annotated image. In this paper, we introduce a transductive one-shot learning approach that employs subspace decomposition to utilize the information from labeled images in the support set and unlabeled images in the query set. These images are decomposed into a linear combination of latent variables representing primitives captured by smaller subspaces. By representing images in the query set as linear combinations of these latent primitives, we can propagate the label from a single image in the support set to query images that share similar combinations of primitives. Through a comprehensive quantitative analysis across various neural network feature extractors and datasets, we demonstrate that our approach can effectively generalize to novel classes from just one labeled image.
zh
[CV-90] NeRF-Based defect detection
【速读】:该论文旨在解决工业自动化快速发展背景下大规模机械设备缺陷检测中面临的精度不足、效率低下以及人工检测劳动强度大、主观性强且存在安全隐患的问题。为应对这些挑战,论文提出了一种基于神经辐射场(Neural Radiance Fields, NeRF)和数字孪生概念的自动化缺陷检测框架。其关键在于利用无人机(UAVs)采集图像并重建机械的三维模型,生成标准参考模型与当前状态模型进行对比,并通过迭代最近点算法(Iterative Closest Point, ICP)实现模型对齐,从而支持精确点云分析以检测潜在缺陷。此方法通过消除人工检查环节,显著提升了检测的准确性、安全性,并提供了可扩展的解决方案。
链接: https://arxiv.org/abs/2504.00270
作者: Tianqi(Kirk)Ding,Dawei Xiang,Yijiashun Qi,Ze Yang,Zunduo Zhao,Tianyao Sun,Pengbin Feng,Haoyu Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 11 figures, 2025 2nd International Conference on Remote Sensing, Mapping and Image Processing (RSMIP 2025)
点击查看摘要
Abstract:The rapid growth of industrial automation has highlighted the need for precise and efficient defect detection in large-scale machinery. Traditional inspection techniques, involving manual procedures such as scaling tall structures for visual evaluation, are labor-intensive, subjective, and often hazardous. To overcome these challenges, this paper introduces an automated defect detection framework built on Neural Radiance Fields (NeRF) and the concept of digital twins. The system utilizes UAVs to capture images and reconstruct 3D models of machinery, producing both a standard reference model and a current-state model for comparison. Alignment of the models is achieved through the Iterative Closest Point (ICP) algorithm, enabling precise point cloud analysis to detect deviations that signify potential defects. By eliminating manual inspection, this method improves accuracy, enhances operational safety, and offers a scalable solution for defect detection. The proposed approach demonstrates great promise for reliable and efficient industrial applications.
zh
[CV-91] MultiMorph: On-demand Atlas Construction CVPR2025
【速读】:该论文试图解决现有解剖学图谱构建方法耗时长(通常需要数天到数周),限制了快速实验的问题,同时指出许多研究因依赖次优且不匹配人群的预计算图谱而影响下游分析。为应对这些挑战,论文提出MultiMorph,其关键在于一种前馈模型,能够在单次前向传递中快速生成高质量、特定人群的图谱,无需微调或优化。该模型基于线性群交互层,能够聚合和共享输入图像组内的特征,并通过利用辅助合成数据,在测试时推广至新的成像模态和人群组。这一创新方案显著提升了构建效率(比现有方法快100倍),为不具备机器学习背景的生物医学研究人员提供了便捷的高质图谱生成框架。
链接: https://arxiv.org/abs/2504.00247
作者: S. Mazdak Abulnaga,Andrew Hoopes,Neel Dey,Malte Hoffmann,Marianne Rakic,Bruce Fischl,John Guttag,Adrian Dalca
机构: MIT Computer Science and Artificial Intelligence Laboratory (麻省理工学院计算机科学与人工智能实验室); Massachusetts General Hospital, Harvard Medical School (马萨诸塞州总医院,哈佛医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: accepted to CVPR 2025
点击查看摘要
Abstract:We present MultiMorph, a fast and efficient method for constructing anatomical atlases on the fly. Atlases capture the canonical structure of a collection of images and are essential for quantifying anatomical variability across populations. However, current atlas construction methods often require days to weeks of computation, thereby discouraging rapid experimentation. As a result, many scientific studies rely on suboptimal, precomputed atlases from mismatched populations, negatively impacting downstream analyses. MultiMorph addresses these challenges with a feedforward model that rapidly produces high-quality, population-specific atlases in a single forward pass for any 3D brain dataset, without any fine-tuning or optimization. MultiMorph is based on a linear group-interaction layer that aggregates and shares features within the group of input images. Further, by leveraging auxiliary synthetic data, MultiMorph generalizes to new imaging modalities and population groups at test-time. Experimentally, MultiMorph outperforms state-of-the-art optimization-based and learning-based atlas construction methods in both small and large population settings, with a 100-fold reduction in time. This makes MultiMorph an accessible framework for biomedical researchers without machine learning expertise, enabling rapid, high-quality atlas generation for diverse studies.
zh
[CV-92] CBIL: Collective Behavior Imitation Learning for Fish from Real Videos
【速读】:该论文旨在解决传统规则驱动方法在生成逼真的群体行为时运动多样性不足的问题,以及现有模仿学习方法依赖真实轨迹数据且难以处理高密度复杂场景的局限性。论文提出了一种可扩展的方法——Collective Behavior Imitation Learning (CBIL),通过直接从视频中学习鱼群行为,无需依赖捕捉的真实运动轨迹来实现目标。其关键在于首先利用Masked Video AutoEncoder (MVAE) 进行视频表征学习,将二维观测映射到紧凑且表达能力强的隐状态空间;接着引入一种新颖的对抗模仿学习框架,在潜在空间中高效捕获鱼群复杂的运动模式分布,并结合生物启发奖励与先验知识正则化训练过程以提高稳定性。这种方案使得经过训练后,CBIL能够基于学到的集体运动先验应用于多种动画任务,并展示了其在不同物种上的有效性以及检测野外视频中异常鱼类行为的能力。
链接: https://arxiv.org/abs/2504.00234
作者: Yifan Wu,Zhiyang Dou,Yuko Ishiwaka,Shun Ogawa,Yuke Lou,Wenping Wang,Lingjie Liu,Taku Komura
机构: The University of Hong Kong (香港大学); University of Pennsylvania (宾夕法尼亚大学); SoftBank Corp. (软银集团); The University of Hong Kong (香港大学); The University of Hong Kong (香港大学); Texas A&M University (德州农工大学); University of Pennsylvania (宾夕法尼亚大学); The University of Hong Kong (香港大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Reproducing realistic collective behaviors presents a captivating yet formidable challenge. Traditional rule-based methods rely on hand-crafted principles, limiting motion diversity and realism in generated collective behaviors. Recent imitation learning methods learn from data but often require ground truth motion trajectories and struggle with authenticity, especially in high-density groups with erratic movements. In this paper, we present a scalable approach, Collective Behavior Imitation Learning (CBIL), for learning fish schooling behavior directly from videos, without relying on captured motion trajectories. Our method first leverages Video Representation Learning, where a Masked Video AutoEncoder (MVAE) extracts implicit states from video inputs in a self-supervised manner. The MVAE effectively maps 2D observations to implicit states that are compact and expressive for following the imitation learning stage. Then, we propose a novel adversarial imitation learning method to effectively capture complex movements of the schools of fish, allowing for efficient imitation of the distribution for motion patterns measured in the latent space. It also incorporates bio-inspired rewards alongside priors to regularize and stabilize training. Once trained, CBIL can be used for various animation tasks with the learned collective motion priors. We further show its effectiveness across different species. Finally, we demonstrate the application of our system in detecting abnormal fish behavior from in-the-wild videos.
zh
[CV-93] GazeLLM : Multimodal LLM s incorporating Human Visual Attention
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理高分辨率、长时序视频时面临的内存消耗大、计算需求高的问题,同时避免因降低视频分辨率而导致的理解性能下降。论文的关键在于提出了一种结合眼动追踪数据的方法,并将第一人称视角视频分解为注视焦点区域子块进行选择性处理。通过仅输入被关注的区域数据,该方法实现了与处理完整高分辨率图像相当甚至更优的任务理解能力,同时将输入视频的数据量减少至原来的十分之一,从而显著提高了解读和利用人类技能的效率。
链接: https://arxiv.org/abs/2504.00221
作者: Jun Rekimoto
机构: The University of Tokyo (东京大学); Sony CSL - Kyoto (索尼计算机科学实验室京都分部)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are advancing into Multimodal LLMs (MLLMs), capable of processing image, audio, and video as well as text. Combining first-person video, MLLMs show promising potential for understanding human activities through video and audio, enabling many human-computer interaction and human-augmentation applications such as human activity support, real-world agents, and skill transfer to robots or other individuals. However, handling high-resolution, long-duration videos generates large latent representations, leading to substantial memory and processing demands, limiting the length and resolution MLLMs can manage. Reducing video resolution can lower memory usage but often compromises comprehension. This paper introduces a method that optimizes first-person video analysis by integrating eye-tracking data, and proposes a method that decomposes first-person vision video into sub areas for regions of gaze focus. By processing these selectively gazed-focused inputs, our approach achieves task comprehension equivalent to or even better than processing the entire image at full resolution, but with significantly reduced video data input (reduce the number of pixels to one-tenth), offering an efficient solution for using MLLMs to interpret and utilize human skills.
zh
[CV-94] Can Diffusion Models Disentangle? A Theoretical Perspective
【速读】:该论文旨在解决如何通过扩散模型(Diffusion Models)学习解缠表示(Disentangled Representations)的问题。论文的关键在于提出了一种新颖的理论框架,该框架不仅建立了广义解缠潜在变量模型的可识别性条件,还分析了训练动态,并推导了解缠潜在子空间模型的样本复杂度界。通过在多种任务和模态下进行解缠实验验证理论的有效性,包括潜在子空间高斯混合模型的子空间恢复、图像着色、图像去噪以及语音分类中的语音转换等。此外,基于该理论提出的训练策略,如风格引导正则化(Style Guidance Regularization),被证明能够一致提升解缠性能。
链接: https://arxiv.org/abs/2504.00220
作者: Liming Wang,Muhammad Jehanzeb Mirza,Yishu Gong,Yuan Gong,Jiaqi Zhang,Brian H. Tracey,Katerina Placek,Marco Vilela,James R. Glass
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This paper presents a novel theoretical framework for understanding how diffusion models can learn disentangled representations. Within this framework, we establish identifiability conditions for general disentangled latent variable models, analyze training dynamics, and derive sample complexity bounds for disentangled latent subspace models. To validate our theory, we conduct disentanglement experiments across diverse tasks and modalities, including subspace recovery in latent subspace Gaussian mixture models, image colorization, image denoising, and voice conversion for speech classification. Additionally, our experiments show that training strategies inspired by our theory, such as style guidance regularization, consistently enhance disentanglement performance.
zh
[CV-95] LITA-GS: Illumination-Agnostic Novel View Synthesis via Reference-Free 3D Gaussian Splatting and Physical Priors CVPR2025
【速读】:该论文旨在解决在不利光照条件下直接使用3D Gaussian Splatting (3DGS) 难以生成高质量、正常曝光的三维表示的问题。具体而言,这些问题包括:(1) 不利光照场景下估计的运动结构(Structure from Motion, SfM)点有限,无法捕捉足够的场景细节;(2) 缺乏真实参考的情况下,信息丢失、显著噪声及颜色失真对3DGS生成高质量结果构成挑战;(3) 现有曝光校正方法与3DGS结合后性能不理想,因其独立增强过程导致不同视角增强图像之间的光照不一致。为解决上述问题,论文提出了一种名为LITA-GS的新方法,这是一种基于无参考3DGS和物理先验的无光照依赖新型视图合成技术。其关键在于引入了一个无光照依赖的物理先验提取流程,并在此基础上开发了无光照结构渲染策略,同时加入渐进去噪模块以减轻光照不变表示中的噪声影响。此外,该方法采用无监督训练策略,实验表明其在超越当前最先进的NeRF基方法的同时,具有更快的推理速度和更短的训练时间。
链接: https://arxiv.org/abs/2504.00219
作者: Han Zhou,Wei Dong,Jun Chen
机构: McMaster University (麦克马斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025. 3DGS, Adverse illumination conditions, Reference-free, Physical priors
点击查看摘要
Abstract:Directly employing 3D Gaussian Splatting (3DGS) on images with adverse illumination conditions exhibits considerable difficulty in achieving high-quality, normally-exposed representations due to: (1) The limited Structure from Motion (SfM) points estimated in adverse illumination scenarios fail to capture sufficient scene details; (2) Without ground-truth references, the intensive information loss, significant noise, and color distortion pose substantial challenges for 3DGS to produce high-quality results; (3) Combining existing exposure correction methods with 3DGS does not achieve satisfactory performance due to their individual enhancement processes, which lead to the illumination inconsistency between enhanced images from different viewpoints. To address these issues, we propose LITA-GS, a novel illumination-agnostic novel view synthesis method via reference-free 3DGS and physical priors. Firstly, we introduce an illumination-invariant physical prior extraction pipeline. Secondly, based on the extracted robust spatial structure prior, we develop the lighting-agnostic structure rendering strategy, which facilitates the optimization of the scene structure and object appearance. Moreover, a progressive denoising module is introduced to effectively mitigate the noise within the light-invariant representation. We adopt the unsupervised strategy for the training of LITA-GS and extensive experiments demonstrate that LITA-GS surpasses the state-of-the-art (SOTA) NeRF-based method while enjoying faster inference speed and costing reduced training time. The code is released at this https URL.
zh
[CV-96] RailGoerl24: Görlitz Rail Test Center CV Dataset 2024
【速读】:该论文旨在解决无人驾驶列车在城市导轨交通和主干线铁路开放轨道运行中对实际及潜在障碍物(尤其是人类)检测的需求,特别是在危险区域内的自动检测。解决方案的关键在于提供一个高质量的数据集RailGoerl24,该数据集包含12205帧Full HD图像,记录于德国Görlitz的TÜV SÜD Rail测试中心,并辅以33556个针对“行人”类别的框级标注。由于现有公开数据集无法满足机器学习算法对大量高质量标注数据的需求,此数据集填补了这一空白,为开发无人列车运行系统提供了支持,并可扩展应用于其他任务如碰撞预测。
链接: https://arxiv.org/abs/2504.00204
作者: Rustam Tagiew(1),Ilkay Wunderlich(2),Mark Sastuba(1),Steffen Seitz(3) ((1) German Centre for Rail Traffic Research at the Federal Railway Authority, (2) EYYES GmbH, (3) Conrad Zuse School of Embedded Composite AI and the Chair of Fundamentals of Electrical Engineering of Dresden University of Technology)
机构: German Centre for Rail Traffic Research at the Federal Railway Authority (DZSF)(德国铁路交通研究中心(DZSF)); EYYES GmbH (EYYES GmbH); Conrad Zuse School of Embedded Composite AI (SECAI) and the Chair of Fundamentals of Electrical Engineering of Dresden University of Technology (德累斯顿工业大学嵌入式复合人工智能学院(SECAI)和电气工程基础讲席)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 4 pages, 5 figures, submitted to Engineering Reliable Autonomous Systems 2025
点击查看摘要
Abstract:Driverless train operation for open tracks on urban guided transport and mainline railways requires, among other things automatic detection of actual and potential obstacles, especially humans, in the danger zone of the train’s path. Machine learning algorithms have proven to be powerful state-of-the-art tools for this task. However, these algorithms require large amounts of high-quality annotated data containing human beings in railway-specific environments as training data. Unfortunately, the amount of publicly available datasets is not yet sufficient and is significantly inferior to the datasets in the road domain. Therefore, this paper presents RailGoerl24, an on-board visual light Full HD camera dataset of 12205 frames recorded in a railway test center of TÜV SÜD Rail, in Görlitz, Germany. Its main purpose is to support the development of driverless train operation for guided transport. RailGoerl24 also includes a terrestrial LiDAR scan covering parts of the area used to acquire the RGB data. In addition to the raw data, the dataset contains 33556 boxwise annotations in total for the object class ‘person’. The faces of recorded actors are not blurred or altered in any other way. RailGoerl24, soon available at this http URL, can also be used for tasks beyond collision prediction.
zh
[CV-97] SmartScan: An AI-based Interactive Framework for Automated Region Extraction from Satellite Images
【速读】:该论文旨在解决连续甲烷监测系统中固定传感器数量与最优布局规划的难题,传统方法因劳动密集型且难以满足多站点评估需求而限制了其可扩展性。论文提出的解决方案核心在于引入SmartScan这一AI框架,通过自动化数据提取实现传感器的最优布置。SmartScan的关键创新在于利用Segment Anything Model (SAM) 的基于提示的Transformer进行零样本分割,从而无需显式训练即可从卫星图像中高效提取感兴趣子空间,并支持交互式工具辅助的质量控制与约束集生成。此外,SmartScan提供两种运行模式:Data Curation Mode用于人工交互式提取高质量子空间,Autonomous Mode则通过深度学习网络替代手动提示以实现完全自动化,大幅降低了人工干预需求,提升了扩展性和效率。
链接: https://arxiv.org/abs/2504.00200
作者: Savinay Nagendra,Kashif Rashid
机构: Schlumberger-Doll Research (斯伦贝谢-多尔研究公司), Cambridge, MA 02139
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The deployment of a continuous methane monitoring system requires determining the optimal number and placement of fixed sensors. However, planning is labor-intensive, requiring extensive site setup and iteration to meet client restrictions. This challenge is amplified when evaluating multiple sites, limiting scalability. To address this, we introduce SmartScan, an AI framework that automates data extraction for optimal sensor placement. SmartScan identifies subspaces of interest from satellite images using an interactive tool to create facility-specific constraint sets efficiently. SmartScan leverages the Segment Anything Model (SAM), a prompt-based transformer for zero-shot segmentation, enabling subspace extraction without explicit training. It operates in two modes: (1) Data Curation Mode, where satellite images are processed to extract high-quality subspaces using an interactive prompting system for SAM, and (2) Autonomous Mode, where user-curated prompts train a deep learning network to replace manual prompting, fully automating subspace extraction. The interactive tool also serves for quality control, allowing users to refine AI-generated outputs and generate additional constraint sets as needed. With its AI-driven prompting mechanism, SmartScan delivers high-throughput, high-quality subspace extraction with minimal human intervention, enhancing scalability and efficiency. Notably, its adaptable design makes it suitable for extracting regions of interest from ultra-high-resolution satellite imagery across various domains.
zh
[CV-98] Leverag ing Diffusion Model and Image Foundation Model for Improved Correspondence Matching in Coronary Angiography
【速读】:该论文旨在解决冠状动脉造影图像中精确对应点匹配的问题,这是重建三维冠状动脉结构的关键,而三维冠状动脉结构对于冠状动脉疾病(CAD)的精准诊断和治疗规划至关重要。然而,传统的自然图像匹配方法由于X射线图像固有的特性(如缺乏纹理、对比度低、结构重叠等),以及训练数据不足,难以泛化到此类图像。为应对这些挑战,论文提出了一种新颖的管道:利用基于扩散模型生成逼真的配对冠状动脉造影图像,该模型以冠状动脉CT造影(CCTA)的三维重建网格的2D投影为条件,从而提供高质量的合成数据用于训练。此外,通过大规模图像基础模型引导特征聚合,该方法聚焦于语义相关的区域和关键点,显著提升了对应点匹配的准确性。解决方案的关键在于结合扩散模型生成合成数据和使用大型图像基础模型指导特征学习,从而在合成数据集上实现优越的匹配性能,并有效泛化至真实世界的数据集。
链接: https://arxiv.org/abs/2504.00191
作者: Lin Zhao,Xin Yu,Yikang Liu,Xiao Chen,Eric Z. Chen,Terrence Chen,Shanhui Sun
机构: United Imaging Intelligence (联合影像智能); Department of Computer Science, Vanderbilt University (计算机科学系, 范德比尔特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Accurate correspondence matching in coronary angiography images is crucial for reconstructing 3D coronary artery structures, which is essential for precise diagnosis and treatment planning of coronary artery disease (CAD). Traditional matching methods for natural images often fail to generalize to X-ray images due to inherent differences such as lack of texture, lower contrast, and overlapping structures, compounded by insufficient training data. To address these challenges, we propose a novel pipeline that generates realistic paired coronary angiography images using a diffusion model conditioned on 2D projections of 3D reconstructed meshes from Coronary Computed Tomography Angiography (CCTA), providing high-quality synthetic data for training. Additionally, we employ large-scale image foundation models to guide feature aggregation, enhancing correspondence matching accuracy by focusing on semantically relevant regions and keypoints. Our approach demonstrates superior matching performance on synthetic datasets and effectively generalizes to real-world datasets, offering a practical solution for this task. Furthermore, our work investigates the efficacy of different foundation models in correspondence matching, providing novel insights into leveraging advanced image foundation models for medical imaging applications.
zh
[CV-99] Self-Evolving Visual Concept Library using Vision-Language Critics CVPR
【速读】:该论文旨在解决构建用于视觉识别的视觉概念库的问题。传统方法中,手动定义概念耗时费力,而仅依赖大语言模型(LLMs)生成概念可能导致缺乏判别力或未能充分考虑概念间的复杂交互。论文提出的解决方案——ESCHER,从图书馆学习的角度出发,通过迭代发现和优化视觉概念来应对这一挑战。其关键是利用视觉-语言模型(VLM)作为评价器,动态调整概念生成策略,并结合大型语言模型(LLMs)的上下文学习能力及历史性能反馈,持续改进概念库的质量,同时确保无需人工标注,实现自动化且可直接应用的框架。实验表明,ESCHER在零样本、少样本以及微调分类任务中均表现出色。
链接: https://arxiv.org/abs/2504.00185
作者: Atharva Sehgal,Patrick Yuan,Ziniu Hu,Yisong Yue,Jennifer J. Sun,Swarat Chaudhuri
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校); Cornell University (康奈尔大学); California Institute of Technology (加州理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: CVPR camera ready
点击查看摘要
Abstract:We study the problem of building a visual concept library for visual recognition. Building effective visual concept libraries is challenging, as manual definition is labor-intensive, while relying solely on LLMs for concept generation can result in concepts that lack discriminative power or fail to account for the complex interactions between them. Our approach, ESCHER, takes a library learning perspective to iteratively discover and improve visual concepts. ESCHER uses a vision-language model (VLM) as a critic to iteratively refine the concept library, including accounting for interactions between concepts and how they affect downstream classifiers. By leveraging the in-context learning abilities of LLMs and the history of performance using various concepts, ESCHER dynamically improves its concept generation strategy based on the VLM critic’s feedback. Finally, ESCHER does not require any human annotations, and is thus an automated plug-and-play framework. We empirically demonstrate the ability of ESCHER to learn a concept library for zero-shot, few-shot, and fine-tuning visual classification tasks. This work represents, to our knowledge, the first application of concept library learning to real-world visual tasks.
zh
[CV-100] SAVeD: Learning to Denoise Low-SNR Video for Improved Downstream Performance
【速读】:该论文旨在解决低信噪比(SNR)传感器视频(如水下声呐、超声波和显微镜视频)中因噪声导致的基础模型性能下降的问题。论文提出了一种名为Spatiotemporal Augmentations and denoising in Video for Downstream Tasks (SAVeD) 的自监督方法,通过利用前景与背景运动的差异,在编码器-解码器架构中加入时间瓶颈来实现视频去噪。其关键在于无需干净数据,仅利用原始噪声数据进行训练,同时通过增强物体可见性显著提升了分类、检测、跟踪和计数等任务的性能,并以更低的资源需求超越了现有最先进的视频去噪方法。
链接: https://arxiv.org/abs/2504.00161
作者: Suzanne Stathatos,Michael Hobley,Markus Marks,Pietro Perona
机构: California Institute of Technology (加州理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL Code page: this https URL
点击查看摘要
Abstract:Foundation models excel at vision tasks in natural images but fail in low signal-to-noise ratio (SNR) videos, such as underwater sonar, ultrasound, and microscopy. We introduce Spatiotemporal Augmentations and denoising in Video for Downstream Tasks (SAVeD), a self-supervised method that denoises low-SNR sensor videos and is trained using only the raw noisy data. By leveraging differences in foreground and background motion, SAVeD enhances object visibility using an encoder-decoder with a temporal bottleneck. Our approach improves classification, detection, tracking, and counting, outperforming state-of-the-art video denoising methods with lower resource requirements. Project page: this https URL Code page: this https URL
zh
[CV-101] SonarSplat: Novel View Synthesis of Imaging Sonar via Gaussian Splatting
【速读】:该论文试图解决水下声呐成像中真实感新型视图合成及声学条纹现象建模的问题。解决方案的关键在于提出了一种名为SonarSplat的新框架,通过将场景表示为具有声反射率和饱和属性的三维高斯分布,并开发了一种高效光栅化学习到的高斯分布的方法,以生成符合声呐图像形成模型的范围/方位图像。特别地,论文还创新性地在高斯散射框架中引入了对方位条纹现象的建模方法。这一方案相较于现有技术,在图像合成能力上提升了+2.5 dB PSNR,并展示了其在方位条纹去除和三维场景重建中的应用潜力。
链接: https://arxiv.org/abs/2504.00159
作者: Advaith V. Sethuraman,Max Rucker,Onur Bagoren,Pou-Chun Kung,Nibarkavi N.B. Amutha,Katherine A. Skinner
机构: Department of Robotics, University of Michigan, Ann Arbor (密歇根大学机器人系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:In this paper, we present SonarSplat, a novel Gaussian splatting framework for imaging sonar that demonstrates realistic novel view synthesis and models acoustic streaking phenomena. Our method represents the scene as a set of 3D Gaussians with acoustic reflectance and saturation properties. We develop a novel method to efficiently rasterize learned Gaussians to produce a range/azimuth image that is faithful to the acoustic image formation model of imaging sonar. In particular, we develop a novel approach to model azimuth streaking in a Gaussian splatting framework. We evaluate SonarSplat using real-world datasets of sonar images collected from an underwater robotic platform in a controlled test tank and in a real-world river environment. Compared to the state-of-the-art, SonarSplat offers improved image synthesis capabilities (+2.5 dB PSNR). We also demonstrate that SonarSplat can be leveraged for azimuth streak removal and 3D scene reconstruction.
zh
[CV-102] Few-Shot Generation of Brain Tumors for Secure and Fair Data Sharing
【速读】:该论文旨在解决利用多中心医疗数据进行分析时面临的隐私保护与数据异质性挑战,特别是在医学影像领域分布式方法(如联邦学习)易受隐私泄露威胁的问题。同时,尽管生成模型(如扩散模型)通过合成逼真的数据增强了隐私保护,但其在小规模数据集训练时容易发生记忆效应。为应对这些挑战,论文提出了一种去中心化的少量样本生成模型(Decentralized Few-Shot Generative Model, DFGM),用于合成脑肿瘤图像,同时确保完全的隐私保护。
DFGM 的关键创新在于将私有的肿瘤数据与来自多个医疗机构可共享的健康图像结合,通过融合肿瘤前景与健康背景构建新的合成数据集。这种方法不仅实现了严格的隐私保护,还通过保留健康背景和肿瘤前景实现了可控且高质量的合成效果。最终,该模型在脑肿瘤分割任务中验证了其有效性,通过数据增强提升了 3.9% 的 Dice 分数,并在公平性评估中提升了 4.6%。
链接: https://arxiv.org/abs/2504.00150
作者: Yongyi Shi,Ge Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 4 figures
点击查看摘要
Abstract:Leveraging multi-center data for medical analytics presents challenges due to privacy concerns and data heterogeneity. While distributed approaches such as federated learning has gained traction, they remain vulnerable to privacy breaches, particularly in sensitive domains like medical imaging. Generative models, such as diffusion models, enhance privacy by synthesizing realistic data. However, they are prone to memorization, especially when trained on small datasets. This study proposes a decentralized few-shot generative model (DFGM) to synthesize brain tumor images while fully preserving privacy. DFGM harmonizes private tumor data with publicly shareable healthy images from multiple medical centers, constructing a new dataset by blending tumor foregrounds with healthy backgrounds. This approach ensures stringent privacy protection and enables controllable, high-quality synthesis by preserving both the healthy backgrounds and tumor foregrounds. We assess DFGM’s effectiveness in brain tumor segmentation using a UNet, achieving Dice score improvements of 3.9% for data augmentation and 4.6% for fairness on a separate dataset.
zh
[CV-103] owards Precise Action Spotting: Addressing Temporal Misalignment in Labels with Dynamic Label Assignment
【速读】:该论文致力于解决精确动作定位中的时间对齐问题,即地面真实标签中存在的固有时序错位现象。这种错位通常源于人为标注错误或难以准确定义相邻帧之间的事件边界。为应对这一挑战,论文提出了一种新颖的动态标签分配策略,在训练过程中允许预测结果与地面真实动作时间之间存在时间偏移,从而确保一致的动作检测效果。该方法的关键在于将空间域中用于目标检测的最小成本匹配概念扩展到时间域,并通过计算基于预测动作类别分数和时间偏移的匹配代价,动态地为最可能的预测分配标签,即使这些预测的时间偏离了地面真实时间。这种方法有效缓解了标签时序错位带来的负面影响。
链接: https://arxiv.org/abs/2504.00149
作者: Masato Tamura
机构: Hitachi America, Ltd. (日立美国有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Precise action spotting has attracted considerable attention due to its promising applications. While existing methods achieve substantial performance by employing well-designed model architecture, they overlook a significant challenge: the temporal misalignment inherent in ground-truth labels. This misalignment arises when frames labeled as containing events do not align accurately with the actual event times, often as a result of human annotation errors or the inherent difficulties in precisely identifying event boundaries across neighboring frames. To tackle this issue, we propose a novel dynamic label assignment strategy that allows predictions to have temporal offsets from ground-truth action times during training, ensuring consistent event spotting. Our method extends the concept of minimum-cost matching, which is utilized in the spatial domain for object detection, to the temporal domain. By calculating matching costs based on predicted action class scores and temporal offsets, our method dynamically assigns labels to the most likely predictions, even when the predicted times of these predictions deviate from ground-truth times, alleviating the negative effects of temporal misalignment in labels. We conduct extensive experiments and demonstrate that our method achieves state-of-the-art performance, particularly in conditions where events are visually distinct and temporal misalignment in labels is common.
zh
[CV-104] SuperEvent: Cross-Modal Learning of Event-based Keypoint Detection ICCV25
【速读】:该论文试图解决事件相机(event camera)在基于事件的关键点检测与匹配任务中因运动依赖的关键点外观变化及复杂噪声导致的特征匹配能力受限以及下游任务性能不佳的问题。解决方案的关键在于提出了一种名为SuperEvent的数据驱动方法,通过自监督学习预测具有表达性描述符的稳定关键点,并结合新颖的信息丰富事件表示,使模型能够在事件流中有效学习鲁棒的关键点检测与描述能力。此外,利用现有帧相机关键点检测器生成伪标签,解决了事件数据集缺乏真实关键点标注的问题。最终,SuperEvent被集成到基于稀疏关键点和描述符的现代SLAM框架中,显著超越了现有的事件相机SLAM技术。
链接: https://arxiv.org/abs/2504.00139
作者: Yannick Burkhardt,Simon Schaefer,Stefan Leutenegger
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (MCML)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: In Review for ICCV25
点击查看摘要
Abstract:Event-based keypoint detection and matching holds significant potential, enabling the integration of event sensors into highly optimized Visual SLAM systems developed for frame cameras over decades of research. Unfortunately, existing approaches struggle with the motion-dependent appearance of keypoints and the complex noise prevalent in event streams, resulting in severely limited feature matching capabilities and poor performance on downstream tasks. To mitigate this problem, we propose SuperEvent, a data-driven approach to predict stable keypoints with expressive descriptors. Due to the absence of event datasets with ground truth keypoint labels, we leverage existing frame-based keypoint detectors on readily available event-aligned and synchronized gray-scale frames for self-supervision: we generate temporally sparse keypoint pseudo-labels considering that events are a product of both scene appearance and camera motion. Combined with our novel, information-rich event representation, we enable SuperEvent to effectively learn robust keypoint detection and description in event streams. Finally, we demonstrate the usefulness of SuperEvent by its integration into a modern sparse keypoint and descriptor-based SLAM framework originally developed for traditional cameras, surpassing the state-of-the-art in event-based SLAM by a wide margin. Source code and multimedia material are available at this http URL.
zh
[CV-105] Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLM s CVPR2025
【速读】:该论文旨在解决视频章节划分(video chaptering)问题,即通过将长视频时间轴划分为语义单元并生成对应的章节标题,以实现高效导航和长视频内容检索。论文的关键在于提出“Chapter-Llama”框架,利用具有大上下文窗口的预训练大型语言模型(LLM),输入语音转录文本与描述视频帧的字幕及其时间戳,通过轻量级的语音引导帧选择策略减少计算负担,并训练模型输出章节边界的时间戳及自由形式的章节标题。这种方案能够在单次前向传播中处理长达一小时的视频,显著提升了在VidChapters-7M基准上的性能表现(如F1分数从26.7提升至45.3)。
链接: https://arxiv.org/abs/2504.00072
作者: Lucas Ventura,Antoine Yang,Cordelia Schmid,Gül Varol
机构: LIGM, École des Ponts, IP Paris, Univ Gustave Eiffel, CNRS (巴黎高科国立路桥学校, IP Paris, Gustave Eiffel大学, CNRS); Inria, École normale supérieure, CNRS, PSL Research University (法国国家信息与自动化研究所, 巴黎高等师范学院, CNRS, PSL大学); Google DeepMind (谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025 Camera ready. Project page: this https URL
点击查看摘要
Abstract:We address the task of video chaptering, i.e., partitioning a long video timeline into semantic units and generating corresponding chapter titles. While relatively underexplored, automatic chaptering has the potential to enable efficient navigation and content retrieval in long-form videos. In this paper, we achieve strong chaptering performance on hour-long videos by efficiently addressing the problem in the text domain with our ‘Chapter-Llama’ framework. Specifically, we leverage a pretrained large language model (LLM) with large context window, and feed as input (i) speech transcripts and (ii) captions describing video frames, along with their respective timestamps. Given the inefficiency of exhaustively captioning all frames, we propose a lightweight speech-guided frame selection strategy based on speech transcript content, and experimentally demonstrate remarkable advantages. We train the LLM to output timestamps for the chapter boundaries, as well as free-form chapter titles. This simple yet powerful approach scales to processing one-hour long videos in a single forward pass. Our results demonstrate substantial improvements (e.g., 45.3 vs 26.7 F1 score) over the state of the art on the recent VidChapters-7M benchmark. To promote further research, we release our code and models at our project page.
zh
[CV-106] CF-CAM: Gradient Perturbation Mitigation and Feature Stabilization for Reliable Interpretability
【速读】:该论文旨在解决深度学习模型在高风险领域应用中因神经网络决策透明度不足而限制信任与适用性的问题。现有类激活映射(Class Activation Mapping, CAM)技术存在固有权衡:基于梯度的方法易受梯度扰动影响,导致解释不稳定;无梯度方法虽缓解了梯度不稳定性,但带来了显著的计算开销和推理延迟。论文提出了一种名为聚类滤波类激活映射(Cluster Filter Class Activation Map, CF-CAM)的新框架,其关键是通过引入基于梯度的加权策略并在保持对抗梯度噪声鲁棒性的同时增强特征表达能力。具体而言,CF-CAM 利用基于密度的空间聚类算法(Density-Based Spatial Clustering of Applications with Noise, DBSCAN)实现语义相关特征通道的分组及噪声激活的剔除,并结合条件聚类梯度过滤技术利用双边滤波器优化梯度信号,从而在保持边缘感知定位的同时抑制噪声干扰。实验结果表明,CF-CAM 在忠实性和鲁棒性方面超越了最先进的 CAM 方法,同时有效减轻了梯度不稳定性且未带来过多计算成本,为医疗诊断和自动驾驶等关键领域的深度神经网络可解释性提升提供了可靠方案。
链接: https://arxiv.org/abs/2504.00060
作者: Hongjie He,Xu Pan,Yudong Yao
机构: School of Physics, Mathematics, and Computing (物理、数学和计算学院), University of Western Australia (西澳大利亚大学); School of Coumputer Science and Information Engineering (计算机科学与信息工程学院), Hefei University of Technology (合肥工业大学); Department of Electrical and Computer Engineering (电气与计算机工程系), Stevens Institute of Technology (史蒂文斯理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:As deep learning continues to advance, the opacity of neural network decision-making remains a critical challenge, limiting trust and applicability in high-stakes domains. Class Activation Mapping (CAM) techniques have emerged as a key approach to visualizing model decisions, yet existing methods face inherent trade-offs. Gradient-based CAM variants suffer from sensitivity to gradient perturbations, leading to unstable and unreliable explanations. Conversely, gradient-free approaches mitigate gradient instability but incur significant computational overhead and inference latency. To address these limitations, we propose Cluster Filter Class Activation Map (CF-CAM), a novel framework that reintroduces gradient-based weighting while enhancing robustness against gradient noise. CF-CAM employs a hierarchical importance weighting strategy to balance discriminative feature preservation and noise elimination. A density-aware channel clustering via Density-Based Spatial Clustering of Applications with Noise (DBSCAN) groups semantically relevant feature channels and discard noise-prone activations. Additionally, cluster-conditioned gradient filtering leverages bilateral filters to refine gradient signals, preserving edge-aware localization while suppressing noise impact. Experiment results demonstrate that CF-CAM achieves superior interpretability performance while maintaining resilience to gradient perturbations, outperforming state-of-the-art CAM methods in faithfulness and robustness. By effectively mitigating gradient instability without excessive computational cost, CF-CAM provides a reliable solution for enhancing the interpretability of deep neural networks in critical applications such as medical diagnosis and autonomous driving.
zh
[CV-107] ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models
【速读】:该论文致力于解决视觉Transformer (Vision Transformer, ViT) 在处理高分辨率输入时因全局自注意力机制导致的二次复杂度问题。解决方案的关键在于提出了一种名为ViT-Linearizer的跨架构蒸馏框架,通过将ViT的丰富表示转移到线性时间、循环风格的模型中,从而实现高效推理。具体而言,该方法利用了激活匹配(activation matching)这一中间约束条件,以促使学生模型的标记间依赖关系与教师模型产生的结果对齐,并结合掩码预测(masked prediction)这一上下文重建目标,要求学生模型预测教师模型在未见(掩码)标记上的表示,从而有效将二次自注意力的知识蒸馏到学生模型中,同时保持高效的计算复杂度。实验表明,该方法在高分辨率任务中提供了显著的加速效果,并提升了基于Mamba架构模型在标准视觉基准测试中的性能。
链接: https://arxiv.org/abs/2504.00037
作者: Guoyizhe Wei,Rama Chellappa
机构: Johns Hopkins University (约翰斯·霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Vision Transformers (ViTs) have delivered remarkable progress through global self-attention, yet their quadratic complexity can become prohibitive for high-resolution inputs. In this work, we present ViT-Linearizer, a cross-architecture distillation framework that transfers rich ViT representations into a linear-time, recurrent-style model. Our approach leverages 1) activation matching, an intermediate constraint that encourages student to align its token-wise dependencies with those produced by the teacher, and 2) masked prediction, a contextual reconstruction objective that requires the student to predict the teacher’s representations for unseen (masked) tokens, to effectively distill the quadratic self-attention knowledge into the student while maintaining efficient complexity. Empirically, our method provides notable speedups particularly for high-resolution tasks, significantly addressing the hardware challenges in inference. Additionally, it also elevates Mamba-based architectures’ performance on standard vision benchmarks, achieving a competitive 84.3% top-1 accuracy on ImageNet with a base-sized model. Our results underscore the good potential of RNN-based solutions for large-scale visual tasks, bridging the gap between theoretical efficiency and real-world practice.
zh
[CV-108] Skeletonization Quality Evaluation: Geometric Metrics for Point Cloud Analysis in Robotics
【速读】:该论文试图解决的问题是如何系统性地评估点云形状骨架化(Skeletonization)结果的质量,并为其性能提供详细的数值量化方法。论文的关键解决方案在于定义和量化几何属性,提出了包括拓扑相似性(Topological Similarity)、有界性(Boundedness)、中心性(Centeredness)以及平滑性(Smoothness)在内的代表性度量指标,并构建了一个数值评分框架,用于分析不同场景下点云数据的骨架化结果,如物体操作与移动机器人导航。此外,论文还提供了开源工具以支持研究社区对骨架模型进行评估与优化,并进一步评估所提出几何评价方法在多种机器人应用中的性能与敏感性。
链接: https://arxiv.org/abs/2504.00032
作者: Qingmeng Wen,Yu-Kun Lai,Ze Ji,Seyed Amir Tafrishi
机构: Cardiff University (卡迪夫大学); Cardiff University (卡迪夫大学); Cardiff University (卡迪夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG); Robotics (cs.RO)
备注: 15 pages, 12 figures, under-review
点击查看摘要
Abstract:Skeletonization is a powerful tool for shape analysis, rooted in the inherent instinct to understand an object’s morphology. It has found applications across various domains, including robotics. Although skeletonization algorithms have been studied in recent years, their performance is rarely quantified with detailed numerical evaluations. This work focuses on defining and quantifying geometric properties to systematically score the skeletonization results of point cloud shapes across multiple aspects, including topological similarity, boundedness, centeredness, and smoothness. We introduce these representative metric definitions along with a numerical scoring framework to analyze skeletonization outcomes concerning point cloud data for different scenarios, from object manipulation to mobile robot navigation. Additionally, we provide an open-source tool to enable the research community to evaluate and refine their skeleton models. Finally, we assess the performance and sensitivity of the proposed geometric evaluation methods from various robotic applications.
zh
[CV-109] A Novel Distance-Based Metric for Quality Assessment in Image Segmentation
【速读】:该论文旨在解决传统分割质量评估方法难以有效量化错误的空间分布的问题。大多数现有指标基于错误像素的数量统计,无法捕捉错误的几何分布特性,而基于距离的传统度量(如平均Hausdorff距离)在不同方法和数据集间的可解释性和可比性较差。为了解决这一问题,论文提出了一种新的基于距离的质量度量——表面一致性系数(Surface Consistency Coefficient, SCC)。SCC 的关键是通过引入结构表面邻近性的概念,量化错误的空间分布特性,从而能够区分靠近表面与远离表面的错误。这种方法不仅易于解释,还能够在不同的结构上下文中进行比较,同时展现出较高的鲁棒性和有效性。
链接: https://arxiv.org/abs/2504.00023
作者: Niklas Rottmayer,Claudia Redenbach
机构: RPTU University Kaiserslautern-Landau (RPTU 卡尔斯鲁厄-莱茵兰-普法尔茨大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:The assessment of segmentation quality plays a fundamental role in the development, optimization, and comparison of segmentation methods which are used in a wide range of applications. With few exceptions, quality assessment is performed using traditional metrics, which are based on counting the number of erroneous pixels but do not capture the spatial distribution of errors. Established distance-based metrics such as the average Hausdorff distance are difficult to interpret and compare for different methods and datasets. In this paper, we introduce the Surface Consistency Coefficient (SCC), a novel distance-based quality metric that quantifies the spatial distribution of errors based on their proximity to the surface of the structure. Through a rigorous analysis using synthetic data and real segmentation results, we demonstrate the robustness and effectiveness of SCC in distinguishing errors near the surface from those further away. At the same time, SCC is easy to interpret and comparable across different structural contexts.
zh
[CV-110] Enhance Vision-based Tactile Sensors via Dynamic Illumination and Image Fusion
【速读】:本文旨在解决传统基于视觉的触觉传感器(如DIGIT和GelSight)因采用单一静态结构光图案而受限于特定传感器外形因素的问题。论文的关键解决方案是引入动态照明模式,并结合图像融合技术,通过捕获多个带有不同照明模式的测量结果并将其融合为单一高质量测量值,从而显著提升基于视觉的触觉传感质量。实验结果表明,这种动态照明方法在图像对比度、清晰度以及背景差异方面带来了明显改进,为现有传感器通过软件更新改善感知性能及开发全新硬件设计提供了可能性。
链接: https://arxiv.org/abs/2504.00017
作者: Artemii Redkin,Zdravko Dugonjic,Mike Lambeta,Roberto Calandra
机构: LASR Lab, TU Dresden (图灵工业大学 LASR 实验室); Meta AI (Meta AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 8 pages
点击查看摘要
Abstract:Vision-based tactile sensors use structured light to measure deformation in their elastomeric interface. Until now, vision-based tactile sensors such as DIGIT and GelSight have been using a single, static pattern of structured light tuned to the specific form factor of the sensor. In this work, we investigate the effectiveness of dynamic illumination patterns, in conjunction with image fusion techniques, to improve the quality of sensing of vision-based tactile sensors. Specifically, we propose to capture multiple measurements, each with a different illumination pattern, and then fuse them together to obtain a single, higher-quality measurement. Experimental results demonstrate that this type of dynamic illumination yields significant improvements in image contrast, sharpness, and background difference. This discovery opens the possibility of retroactively improving the sensing quality of existing vision-based tactile sensors with a simple software update, and for new hardware designs capable of fully exploiting dynamic illumination.
zh
[CV-111] Assessing Foundation Models for Sea Ice Type Segmentation in Sentinel-1 SAR Imagery
【速读】:该论文旨在解决海冰类型分割中对标注数据依赖性强以及现有基础模型(Foundation Models, FMs)在极地特殊条件下表现不足的问题。论文的关键在于评估多种远程 sensing 基础模型在 Sentinel-1 合成孔径雷达(SAR)影像上的海冰类型分割性能,并重点关注其季节性和空间泛化能力。通过引入 Prithvi-600M 和 CROMA 等模型,研究提出了一种系统性的方法来选择适用于海冰数据分析的基础模型,并提供了针对海冰分割任务定制的全面基准测试与性能指标,同时揭示了当前领域特定模型存在的差距及未来改进方向。
链接: https://arxiv.org/abs/2503.22516
作者: Samira Alkaee Taleghan,Morteza Karimzadeh,Andrew P. Barrett,Walter N. Meier,Farnoush Banaei-Kashani
机构: University of Colorado Denver; University of Colorado Boulder (CIRES, National Snow and Ice Data Center (NSIDC))
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Accurate segmentation of sea ice types is essential for mapping and operational forecasting of sea ice conditions for safe navigation and resource extraction in ice-covered waters, as well as for understanding polar climate processes. While deep learning methods have shown promise in automating sea ice segmentation, they often rely on extensive labeled datasets which require expert knowledge and are time-consuming to create. Recently, foundation models (FMs) have shown excellent results for segmenting remote sensing images by utilizing pre-training on large datasets using self-supervised techniques. However, their effectiveness for sea ice segmentation remains unexplored, especially given sea ice’s complex structures, seasonal changes, and unique spectral signatures, as well as peculiar Synthetic Aperture Radar (SAR) imagery characteristics including banding and scalloping noise, and varying ice backscatter characteristics, which are often missing in standard remote sensing pre-training datasets. In particular, SAR images over polar regions are acquired using different modes than used to capture the images at lower latitudes by the same sensors that form training datasets for FMs. This study evaluates ten remote sensing FMs for sea ice type segmentation using Sentinel-1 SAR imagery, focusing on their seasonal and spatial generalization. Among the selected models, Prithvi-600M outperforms the baseline models, while CROMA achieves a very similar performance in F1-score. Our contributions include offering a systematic methodology for selecting FMs for sea ice data analysis, a comprehensive benchmarking study on performances of FMs for sea ice segmentation with tailored performance metrics, and insights into existing gaps and future directions for improving domain-specific models in polar applications using SAR data.
zh
[CV-112] Orientation Scores should be a Piece of Cake
【速读】:该论文旨在解决在从二维位置空间 R2 提升到二维位置与方向空间 R2×S1 的过程中,如何构建具有快速重建特性且最小化位置-方向不确定性的一组小波的问题。论文的关键解决方案在于推导出这些最小不确定性状态,并证明其可以通过“蛋糕小波”(cake wavelets)很好地近似实现。具体而言,标准参数下蛋糕小波的不确定性间隙小于 1.1,在极限情况下其不确定性间隙趋于最小值 1。此外,论文完成了先前关于偏微分方程(PDE)引导的广义卷积神经网络((PDE-)G-CNN)中无需训练提升层的理论论证,并通过实验表明,利用蛋糕小波可以降低网络复杂度,提高模型的可解释性,同时仅对模型性能产生轻微影响。
链接: https://arxiv.org/abs/2504.00702
作者: Finn M. Sherry,Chase van de Geijn,Erik J. Bekkers,Remco Duits
机构: 未知
类目: Differential Geometry (math.DG); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to the 7th International Conference on Geometric Science of Information
点击查看摘要
Abstract:We axiomatically derive a family of wavelets for an orientation score, lifting from position space \mathbbR^2 to position and orientation space \mathbbR^2\times S^1 , with fast reconstruction property, that minimise position-orientation uncertainty. We subsequently show that these minimum uncertainty states are well-approximated by cake wavelets: for standard parameters, the uncertainty gap of cake wavelets is less than 1.1, and in the limit, we prove the uncertainty gap tends to the minimum of 1. Next, we complete a previous theoretical argument that one does not have to train the lifting layer in (PDE-)G-CNNs, but can instead use cake wavelets. Finally, we show experimentally that in this way we can reduce the network complexity and improve the interpretability of (PDE-)G-CNNs, with only a slight impact on the model’s performance.
zh
[CV-113] Deconver: A Deconvolutional Network for Medical Image Segmentation
【速读】:该论文旨在解决现有医学图像分割方法中卷积神经网络(CNNs)局部感受野限制和视觉Transformer(ViTs)高计算复杂度的问题。论文提出的Deconver网络通过在U形架构中引入传统的非负去卷积(Nonnegative Deconvolution, NDC)操作,替代了计算昂贵的注意力机制,从而实现了高频细节的恢复与伪影的有效抑制。关键创新点包括基于严格单调更新规则的反向传播友好型NDC层设计以及参数高效的架构设计。这些改进使Deconver在多个数据集(包括2D和3D分割任务)上达到了最先进的Dice评分和Hausdorff距离表现,同时将计算成本(FLOPs)降低了高达90%,为资源受限的临床工作流程提供了高精度分割的实用解决方案。
链接: https://arxiv.org/abs/2504.00302
作者: Pooya Ashtari,Shahryar Noei,Fateme Nateghi Haredasht,Jonathan H. Chen,Giuseppe Jurman,Aleksandra Pizurica,Sabine Van Huffel
机构: Department of Electrical Engineering (ESAT), STADIUS Center, KU Leuven, Belgium (比利时鲁汶大学电气工程系,STADIUS 中心); Department of Telecommunications and Information Processing, Ghent University, B-9000 Gent, Belgium (比利时根特大学电信与信息处理系); Data Science for Health Unit, Fondazione Bruno Kessler, Via Sommarive 18, Povo, Trento, Italy (意大利布鲁诺凯勒基金会数据科学与健康部门); Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA (美国斯坦福大学生物医学信息研究中心); Department of Biomedical Sciences, Humanitas University, Via Rita Levi Montalcini, 4, 20072 Pieve Emanuele MI (意大利人文大学生物医学科学系)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figures, 5 tables
点击查看摘要
Abstract:While convolutional neural networks (CNNs) and vision transformers (ViTs) have advanced medical image segmentation, they face inherent limitations such as local receptive fields in CNNs and high computational complexity in ViTs. This paper introduces Deconver, a novel network that integrates traditional deconvolution techniques from image restoration as a core learnable component within a U-shaped architecture. Deconver replaces computationally expensive attention mechanisms with efficient nonnegative deconvolution (NDC) operations, enabling the restoration of high-frequency details while suppressing artifacts. Key innovations include a backpropagation-friendly NDC layer based on a provably monotonic update rule and a parameter-efficient design. Evaluated across four datasets (ISLES’22, BraTS’23, GlaS, FIVES) covering both 2D and 3D segmentation tasks, Deconver achieves state-of-the-art performance in Dice scores and Hausdorff distance while reducing computational costs (FLOPs) by up to 90% compared to leading baselines. By bridging traditional image restoration with deep learning, this work offers a practical solution for high-precision segmentation in resource-constrained clinical workflows. The project is available at this https URL.
zh
[CV-114] DiffDenoise: Self-Supervised Medical Image Denoising with Conditional Diffusion Models
【速读】:该论文旨在解决现有自监督去噪方法在处理医学图像时倾向于过度平滑、导致重要高频细节丢失的问题。论文提出的解决方案(DiffDenoise)的关键在于通过三个阶段的设计来保留医学图像中的高频率细节:首先,在带噪图像上训练扩散模型,并利用预训练的盲点网络输出作为条件输入;其次,引入一种新颖的稳定反向采样技术,通过一对对称噪声初始化生成清洁图像;最后,使用带噪图像及其由扩散模型生成的去噪输出对训练监督去噪网络。实验结果表明,DiffDenoise 在合成及真实医学图像去噪任务中优于现有最先进的方法。
链接: https://arxiv.org/abs/2504.00264
作者: Basar Demir,Yikang Liu,Xiao Chen,Eric Z. Chen,Lin Zhao,Boris Mailhe,Terrence Chen,Shanhui Sun
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); United Imaging Intelligence (联影智能)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:
点击查看摘要
Abstract:Many self-supervised denoising approaches have been proposed in recent years. However, these methods tend to overly smooth images, resulting in the loss of fine structures that are essential for medical applications. In this paper, we propose DiffDenoise, a powerful self-supervised denoising approach tailored for medical images, designed to preserve high-frequency details. Our approach comprises three stages. First, we train a diffusion model on noisy images, using the outputs of a pretrained Blind-Spot Network as conditioning inputs. Next, we introduce a novel stabilized reverse sampling technique, which generates clean images by averaging diffusion sampling outputs initialized with a pair of symmetric noises. Finally, we train a supervised denoising network using noisy images paired with the denoised outputs generated by the diffusion model. Our results demonstrate that DiffDenoise outperforms existing state-of-the-art methods in both synthetic and real-world medical image denoising tasks. We provide both a theoretical foundation and practical insights, demonstrating the method’s effectiveness across various medical imaging modalities and anatomical structures.
zh
[CV-115] Detecting Glioma Meningioma and Pituitary Tumors and Normal Brain Tissues based on Yolov11 and Yolov8 Deep Learning Models
【速读】:该论文旨在解决脑肿瘤(包括胶质瘤、脑膜瘤和垂体瘤)快速且准确诊断的问题,以优化治疗方案并提升医疗效果。当前基于磁共振成像(MRI)的手动解读方式耗时、易受人为错误影响,并高度依赖专家经验。为应对这些挑战,论文提出了一种先进的AI驱动技术,利用基于迁移学习的微调方法结合深度学习模型(YoloV8和YoloV11)与医学影像分类技术,将脑肿瘤分为四类:无肿瘤、胶质瘤、脑膜瘤和垂体瘤。解决方案的关键在于通过微调预训练的深度学习模型,实现高精度的脑肿瘤检测与分类,验证了卷积神经网络(CNNs)在该领域中的潜力及其对医学影像诊断的变革性作用。
链接: https://arxiv.org/abs/2504.00189
作者: Ahmed M. Taha,Salah A. Aly,Mohamed F. Darwish
机构: Egypt University of Informatics (埃及信息大学); Badya University (Badya 大学); Fayoum University (法尤姆大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, 7 figures, 8 tables
点击查看摘要
Abstract:Accurate and quick diagnosis of normal brain tissue Glioma, Meningioma, and Pituitary Tumors is crucial for optimal treatment planning and improved medical results. Magnetic Resonance Imaging (MRI) is widely used as a non-invasive diagnostic tool for detecting brain abnormalities, including tumors. However, manual interpretation of MRI scans is often time-consuming, prone to human error, and dependent on highly specialized expertise. This paper proposes an advanced AI-driven technique to detecting glioma, meningioma, and pituitary brain tumors using YoloV11 and YoloV8 deep learning models. Methods: Using a transfer learning-based fine-tuning approach, we integrate cutting-edge deep learning techniques with medical imaging to classify brain tumors into four categories: No-Tumor, Glioma, Meningioma, and Pituitary Tumors. Results: The study utilizes the publicly accessible CE-MRI Figshare dataset and involves fine-tuning pre-trained models YoloV8 and YoloV11 of 99.49% and 99.56% accuracies; and customized CNN accuracy of 96.98%. The results validate the potential of CNNs in achieving high precision in brain tumor detection and classification, highlighting their transformative role in medical imaging and diagnostics. Comments: 6 pages, 7 figures, 8 tables Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2504.00189 [eess.IV] (or arXiv:2504.00189v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2504.00189 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-116] EAP4EMSIG – Enhancing Event-Driven Microscopy for Microfluidic Single-Cell Analysis
【速读】:该论文致力于解决微流控活细胞成像实验中因缺乏实时洞察而导致的连续数据采集挑战,特别是在高通量实验中无法及时响应随机事件的问题。论文的关键解决方案在于提出了一套面向事件驱动显微镜的实验自动化管道,包括三个核心组件:一种快速且准确的深度学习自动对焦方法(Deep Learning Autofocusing),用于预测焦点偏移;对多种实时分割方法的评估;以及一个实时数据分析仪表板。其中,关键创新点在于自动对焦方法实现了0.0226 μm的平均绝对误差(Mean Absolute Error)及低于50 ms的推理时间,显著提升了聚焦精度与效率,为实现高效实时分析奠定了基础。
链接: https://arxiv.org/abs/2504.00047
作者: Nils Friederich,Angelo Jovin Yamachui Sitcheu,Annika Nassal,Erenus Yildiz,Matthias Pesch,Maximilian Beichter,Lukas Scholtes,Bahar Akbaba,Thomas Lautenschlager,Oliver Neumann,Dietrich Kohlheyer,Hanno Scharr,Johannes Seiffarth,Katharina Nöh,Ralf Mikut
机构: Institute for Automation and Applied Informatics (IAI), Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Institute of Biological and Chemical Systems (IBCS), Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Institute for Data Science and Machine Learning (IAS-8), Forschungszentrum Jülich GmbH (于利希研究中心); Institute of Bio- and Geosciences (IBG-1), Forschungszentrum Jülich GmbH (于利希研究中心); Computational Systems Biology (AVT-CSB), RWTH Aachen University (亚琛工业大学)
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to: at - Automatisierungstechnik
点击查看摘要
Abstract:Microfluidic Live-Cell Imaging yields data on microbial cell factories. However, continuous acquisition is challenging as high-throughput experiments often lack realtime insights, delaying responses to stochastic events. We introduce three components in the Experiment Automation Pipeline for Event-Driven Microscopy to Smart Microfluidic Single-Cell Analysis: a fast, accurate Deep Learning autofocusing method predicting the focus offset, an evaluation of real-time segmentation methods and a realtime data analysis dashboard. Our autofocusing achieves a Mean Absolute Error of 0.0226\textmu m with inference times below 50~ms. Among eleven Deep Learning segmentation methods, Cellpose~3 reached a Panoptic Quality of 93.58%, while a distance-based method is fastest (121~ms, Panoptic Quality 93.02%). All six Deep Learning Foundation Models were unsuitable for real-time segmentation.
zh
[CV-117] Diffusion models applied to skin and oral cancer classification
【速读】:本文研究了扩散模型在医学图像分类(DiffMIC)中的应用,重点关注皮肤和口腔病变的分类。论文试图解决如何利用扩散模型实现与现有最先进的深度学习模型(如卷积神经网络CNNs和Transformer)相媲美的性能。解决方案的关键在于设计和训练能够有效提取皮肤癌(使用PAD-UFES-20数据集)和口腔癌(使用P-NDB-UFES数据集)特征的扩散模型,并验证其在多类及二分类任务中的表现。实验结果表明,扩散模型在平衡准确率方面达到了0.6457(六分类)和0.9050(二分类),证明其在皮肤和口腔病变分类中的可行性。此外,进一步分析了在PAD-UFES-20数据集上训练的模型对HIBA临床数据集的泛化能力。
链接: https://arxiv.org/abs/2504.00026
作者: José J. M. Uliana,Renato A. Krohling
机构: Labcin - Nature-inspired computing Lab (自然启发计算实验室); Federal University of Espírito Santo (弗鲁米嫩塞联邦大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This study investigates the application of diffusion models in medical image classification (DiffMIC), focusing on skin and oral lesions. Utilizing the datasets PAD-UFES-20 for skin cancer and P-NDB-UFES for oral cancer, the diffusion model demonstrated competitive performance compared to state-of-the-art deep learning models like Convolutional Neural Networks (CNNs) and Transformers. Specifically, for the PAD-UFES-20 dataset, the model achieved a balanced accuracy of 0.6457 for six-class classification and 0.8357 for binary classification (cancer vs. non-cancer). For the P-NDB-UFES dataset, it attained a balanced accuracy of 0.9050. These results suggest that diffusion models are viable models for classifying medical images of skin and oral lesions. In addition, we investigate the robustness of the model trained on PAD-UFES-20 for skin cancer but tested on the clinical images of the HIBA dataset.
zh
[CV-118] Autonomous AI for Multi-Pathology Detection in Chest X-Rays: A Multi-Site Study in the Indian Healthcare System
【速读】:该论文旨在解决胸部 X 光 (Chest X-ray, CXR) 影像诊断中的自动化与准确性问题,特别是在资源有限的地区。论文提出了一种基于自主人工智能 (AI) 系统的解决方案,该系统通过整合多种先进的架构(如 Vision Transformers、Faster R-CNN 和多种 U-Net 模型,包括 Attention U-Net、U-Net++ 和 Dense U-Net)实现对 75 种不同病理类型的全面分类、检测与分割。解决方案的关键在于利用一个包含超过 500 万张 X 光片的大规模数据集进行训练,并通过子组分析验证模型在不同年龄、性别及设备类型下的鲁棒性与适应性。最终,该系统在多病理分类任务中实现了高达 98% 的精确率和超过 95% 的召回率,在正常与异常分类任务中达到了 99.8% 的精确率、99.6% 的召回率以及 99.9% 的阴性预测值 (Negative Predictive Value, NPV),显著提升了诊断效率与准确性。
链接: https://arxiv.org/abs/2504.00022
作者: Bargava Subramanian,Shajeev Jaikumar,Praveen Shastry,Naveen Kumarasami,Kalyan Sivasailam,Anandakumar D,Keerthana R,Mounigasri M,Kishore Prasath Venkatesh
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages , 8 figures
点击查看摘要
Abstract:Study Design: The study outlines the development of an autonomous AI system for chest X-ray (CXR) interpretation, trained on a vast dataset of over 5 million X rays sourced from healthcare systems across India. This AI system integrates advanced architectures including Vision Transformers, Faster R-CNN, and various U Net models (such as Attention U-Net, U-Net++, and Dense U-Net) to enable comprehensive classification, detection, and segmentation of 75 distinct pathologies. To ensure robustness, the study design includes subgroup analyses across age, gender, and equipment type, validating the model’s adaptability and performance across diverse patient demographics and imaging environments. Performance: The AI system achieved up to 98% precision and over 95% recall for multi pathology classification, with stable performance across demographic and equipment subgroups. For normal vs. abnormal classification, it reached 99.8% precision, 99.6% recall, and 99.9% negative predictive value (NPV). It was deployed in 17 major healthcare systems in India including diagnostic centers, large hospitals, and government hospitals. Over the deployment period, the system processed over 150,000 scans, averaging 2,000 chest X rays daily, resulting in reduced reporting times and improved diagnostic accuracy. Conclusion: The high precision and recall validate the AI’s capability as a reliable tool for autonomous normal abnormal classification, pathology localization, and segmentation. This scalable AI model addresses diagnostic gaps in underserved areas, optimizing radiology workflows and enhancing patient care across diverse healthcare settings in India. Comments: 27 pages , 8 figures Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) MSC classes: 68T07 Cite as: arXiv:2504.00022 [eess.IV] (or arXiv:2504.00022v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2504.00022 Focus to learn more arXiv-issued DOI via DataCite
zh
人工智能
[AI-0] Accelerating drug discovery with Artificial: a whole-lab orchestration and scheduling system for self-driving labs
链接: https://arxiv.org/abs/2504.00986
作者: Yao Fehlis,Paul Mandel,Charles Crain,Betty Liu,David Fuller
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Self-driving labs are transforming drug discovery by enabling automated, AI-guided experimentation, but they face challenges in orchestrating complex workflows, integrating diverse instruments and AI models, and managing data efficiently. Artificial addresses these issues with a comprehensive orchestration and scheduling system that unifies lab operations, automates workflows, and integrates AI-driven decision-making. By incorporating AI/ML models like NVIDIA BioNeMo - which facilitates molecular interaction prediction and biomolecular analysis - Artificial enhances drug discovery and accelerates data-driven research. Through real-time coordination of instruments, robots, and personnel, the platform streamlines experiments, enhances reproducibility, and advances drug discovery.
[AI-1] HDVIO2.0: Wind and Disturbance Estimation with Hybrid Dynamics VIO
链接: https://arxiv.org/abs/2504.00969
作者: Giovanni Cioffi,Leonard Bauersfeld,Davide Scaramuzza
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Visual-inertial odometry (VIO) is widely used for state estimation in autonomous micro aerial vehicles using onboard sensors. Current methods improve VIO by incorporating a model of the translational vehicle dynamics, yet their performance degrades when faced with low-accuracy vehicle models or continuous external disturbances, like wind. Additionally, incorporating rotational dynamics in these models is computationally intractable when they are deployed in online applications, e.g., in a closed-loop control system. We present HDVIO2.0, which models full 6-DoF, translational and rotational, vehicle dynamics and tightly incorporates them into a VIO with minimal impact on the runtime. HDVIO2.0 builds upon the previous work, HDVIO, and addresses these challenges through a hybrid dynamics model combining a point-mass vehicle model with a learning-based component, with access to control commands and IMU history, to capture complex aerodynamic effects. The key idea behind modeling the rotational dynamics is to represent them with continuous-time functions. HDVIO2.0 leverages the divergence between the actual motion and the predicted motion from the hybrid dynamics model to estimate external forces as well as the robot state. Our system surpasses the performance of state-of-the-art methods in experiments using public and new drone dynamics datasets, as well as real-world flights in winds up to 25 km/h. Unlike existing approaches, we also show that accurate vehicle dynamics predictions are achievable without precise knowledge of the full vehicle state.
[AI-2] Enabling Efficient Processing of Spiking Neural Networks with On-Chip Learning on Commodity Neuromorphic Processors for Edge AI Systems IJCNN
链接: https://arxiv.org/abs/2504.00957
作者: Rachmad Vidya Wicaksana Putra,Pasindu Wickramasinghe,Muhammad Shafique
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Accepted at the International Joint Conference on Neural Networks (IJCNN) 2025 in Rome, Italy
点击查看摘要
Abstract:The rising demand for energy-efficient edge AI systems (e.g., mobile agents/robots) has increased the interest in neuromorphic computing, since it offers ultra-low power/energy AI computation through spiking neural network (SNN) algorithms on neuromorphic processors. However, their efficient implementation strategy has not been comprehensively studied, hence limiting SNN deployments for edge AI systems. Toward this, we propose a design methodology to enable efficient SNN processing on commodity neuromorphic processors. To do this, we first study the key characteristics of targeted neuromorphic hardware (e.g., memory and compute budgets), and leverage this information to perform compatibility analysis for network selection. Afterward, we employ a mapping strategy for efficient SNN implementation on the targeted processor. Furthermore, we incorporate an efficient on-chip learning mechanism to update the systems’ knowledge for adapting to new input classes and dynamic environments. The experimental results show that the proposed methodology leads the system to achieve low latency of inference (i.e., less than 50ms for image classification, less than 200ms for real-time object detection in video streaming, and less than 1ms in keyword recognition) and low latency of on-chip learning (i.e., less than 2ms for keyword recognition), while incurring less than 250mW of processing power and less than 15mJ of energy consumption across the respective different applications and scenarios. These results show the potential of the proposed methodology in enabling efficient edge AI systems for diverse application use-cases.
[AI-3] Unfair Learning: GenAI Exceptionalism and Copyright Law
链接: https://arxiv.org/abs/2504.00955
作者: David Atkinson
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:This paper challenges the argument that generative artificial intelligence (GenAI) is entitled to broad immunity from copyright law for reproducing copyrighted works without authorization due to a fair use defense. It examines fair use legal arguments and eight distinct substantive arguments, contending that every legal and substantive argument favoring fair use for GenAI applies equally, if not more so, to humans. Therefore, granting GenAI exceptional privileges in this domain is legally and logically inconsistent with withholding broad fair use exemptions from individual humans. It would mean no human would need to pay for virtually any copyright work again. The solution is to take a circumspect view of any fair use claim for mass copyright reproduction by any entity and focus on the first principles of whether permitting such exceptionalism for GenAI promotes science and the arts.
[AI-4] QSViT: A Methodology for Quantizing Spiking Vision Transformers IJCNN
链接: https://arxiv.org/abs/2504.00948
作者: Rachmad Vidya Wicaksana Putra,Saad Iftikhar,Muhammad Shafique
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at the International Joint Conference on Neural Networks (IJCNN) 2025 in Rome, Italy
点击查看摘要
Abstract:Vision Transformer (ViT)-based models have shown state-of-the-art performance (e.g., accuracy) in vision-based AI tasks. However, realizing their capability in resource-constrained embedded AI systems is challenging due to their inherent large memory footprints and complex computations, thereby incurring high power/energy consumption. Recently, Spiking Vision Transformer (SViT)-based models have emerged as alternate low-power ViT networks. However, their large memory footprints still hinder their applicability for resource-constrained embedded AI systems. Therefore, there is a need for a methodology to compress SViT models without degrading the accuracy significantly. To address this, we propose QSViT, a novel design methodology to compress the SViT models through a systematic quantization strategy across different network layers. To do this, our QSViT employs several key steps: (1) investigating the impact of different precision levels in different network layers, (2) identifying the appropriate base quantization settings for guiding bit precision reduction, (3) performing a guided quantization strategy based on the base settings to select the appropriate quantization setting, and (4) developing an efficient quantized network based on the selected quantization setting. The experimental results demonstrate that, our QSViT methodology achieves 22.75% memory saving and 21.33% power saving, while also maintaining high accuracy within 2.1% from that of the original non-quantized SViT model on the ImageNet dataset. These results highlight the potential of QSViT methodology to pave the way toward the efficient SViT deployments on resource-constrained embedded AI systems.
[AI-5] AI Judges in Design: Statistical Perspectives on Achieving Human Expert Equivalence With Vision-Language Models
链接: https://arxiv.org/abs/2504.00938
作者: Kristen M. Edwards,Farnaz Tehranchi,Scarlett R. Miller,Faez Ahmed
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 21 pages, 8 tables, 6 figures, 8 tables in the appendix
点击查看摘要
Abstract:The subjective evaluation of early stage engineering designs, such as conceptual sketches, traditionally relies on human experts. However, expert evaluations are time-consuming, expensive, and sometimes inconsistent. Recent advances in vision-language models (VLMs) offer the potential to automate design assessments, but it is crucial to ensure that these AI ``judges’’ perform on par with human experts. However, no existing framework assesses expert equivalence. This paper introduces a rigorous statistical framework to determine whether an AI judge’s ratings match those of human experts. We apply this framework in a case study evaluating four VLM-based judges on key design metrics (uniqueness, creativity, usefulness, and drawing quality). These AI judges employ various in-context learning (ICL) techniques, including uni- vs. multimodal prompts and inference-time reasoning. The same statistical framework is used to assess three trained novices for expert-equivalence. Results show that the top-performing AI judge, using text- and image-based ICL with reasoning, achieves expert-level agreement for uniqueness and drawing quality and outperforms or matches trained novices across all metrics. In 6/6 runs for both uniqueness and creativity, and 5/6 runs for both drawing quality and usefulness, its agreement with experts meets or exceeds that of the majority of trained novices. These findings suggest that reasoning-supported VLM models can achieve human-expert equivalence in design evaluation. This has implications for scaling design evaluation in education and practice, and provides a general statistical framework for validating AI judges in other domains requiring subjective content evaluation.
[AI-6] Grounding Multimodal LLM s to Embodied Agents that Ask for Help with Reinforcement Learning
链接: https://arxiv.org/abs/2504.00907
作者: Ram Ramrakhya,Matthew Chang,Xavier Puig,Ruta Desai,Zsolt Kira,Roozbeh Mottaghi
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Embodied agents operating in real-world environments must interpret ambiguous and under-specified human instructions. A capable household robot should recognize ambiguity and ask relevant clarification questions to infer the user intent accurately, leading to more effective task execution. To study this problem, we introduce the Ask-to-Act task, where an embodied agent must fetch a specific object instance given an ambiguous instruction in a home environment. The agent must strategically ask minimal, yet relevant, clarification questions to resolve ambiguity while navigating under partial observability. To solve this problem, we propose a novel approach that fine-tunes multimodal large language models (MLLMs) as vision-language-action (VLA) policies using online reinforcement learning (RL) with LLM-generated rewards. Our method eliminates the need for large-scale human demonstrations or manually engineered rewards for training such agents. We benchmark against strong zero-shot baselines, including GPT-4o, and supervised fine-tuned MLLMs, on our task. Our results demonstrate that our RL-finetuned MLLM outperforms all baselines by a significant margin ( 19.1 - 40.3% ), generalizing well to novel scenes and tasks. To the best of our knowledge, this is the first demonstration of adapting MLLMs as VLA agents that can act and ask for help using LLM-generated rewards with online RL.
[AI-7] Role and Use of Race in AI/ML Models Related to Health
链接: https://arxiv.org/abs/2504.00899
作者: Martin C. Were,Ang Li,Bradley A. Malin,Zhijun Yin,Joseph R. Coco,Benjamin X. Collins,Ellen Wright Clayton,Laurie L. Novak,Rachele Hendricks-Sturrup,Abiodun Oluyomi,Shilo Anders,Chao Yan
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The role and use of race within health-related artificial intelligence and machine learning (AI/ML) models has sparked increasing attention and controversy. Despite the complexity and breadth of related issues, a robust and holistic framework to guide stakeholders in their examination and resolution remains lacking. This perspective provides a broad-based, systematic, and cross-cutting landscape analysis of race-related challenges, structured around the AI/ML lifecycle and framed through “points to consider” to support inquiry and decision-making.
[AI-8] Spectral Architecture Search for Neural Networks
链接: https://arxiv.org/abs/2504.00885
作者: Gianluca Peri,Lorenzo Giambagli,Lorenzo Chicchi,Duccio Fanelli
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Architecture design and optimization are challenging problems in the field of artificial neural networks. Working in this context, we here present SPARCS (SPectral ARchiteCture Search), a novel architecture search protocol which exploits the spectral attributes of the inter-layer transfer matrices. SPARCS allows one to explore the space of possible architectures by spanning continuous and differentiable manifolds, thus enabling for gradient-based optimization algorithms to be eventually employed. With reference to simple benchmark models, we show that the newly proposed method yields a self-emerging architecture with a minimal degree of expressivity to handle the task under investigation and with a reduced parameter count as compared to other viable alternatives.
[AI-9] ReaLitE: Enrichment of Relation Embeddings in Knowledge Graphs using Numeric Literals ESWC2025
链接: https://arxiv.org/abs/2504.00852
作者: Antonis Klironomos,Baifan Zhou,Zhuoxun Zheng,Gad-Elrab Mohamed,Heiko Paulheim,Evgeny Kharlamov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at ESWC 2025
点击查看摘要
Abstract:Most knowledge graph embedding (KGE) methods tailored for link prediction focus on the entities and relations in the graph, giving little attention to other literal values, which might encode important information. Therefore, some literal-aware KGE models attempt to either integrate numerical values into the embeddings of the entities or convert these numerics into entities during preprocessing, leading to information loss. Other methods concerned with creating relation-specific numerical features assume completeness of numerical data, which does not apply to real-world graphs. In this work, we propose ReaLitE, a novel relation-centric KGE model that dynamically aggregates and merges entities’ numerical attributes with the embeddings of the connecting relations. ReaLitE is designed to complement existing conventional KGE methods while supporting multiple variations for numerical aggregations, including a learnable method. We comprehensively evaluated the proposed relation-centric embedding using several benchmarks for link prediction and node classification tasks. The results showed the superiority of ReaLitE over the state of the art in both tasks. Comments: Accepted at ESWC 2025 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.00852 [cs.LG] (or arXiv:2504.00852v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.00852 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-10] Investigating Large Language Models in Diagnosing Students Cognitive Skills in Math Problem-solving
链接: https://arxiv.org/abs/2504.00843
作者: Hyoungwook Jin,Yoonsu Kim,Dongyun Jung,Seungju Kim,Kiyoon Choi,Jinho Son,Juho Kim
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:
点击查看摘要
Abstract:Mathematics learning entails mastery of both content knowledge and cognitive processing of knowing, applying, and reasoning with it. Automated math assessment primarily has focused on grading students’ exhibition of content knowledge by finding textual evidence, such as specific numbers, formulas, and statements. Recent advancements in problem-solving, image recognition, and reasoning capabilities of large language models (LLMs) show promise for nuanced evaluation of students’ cognitive skills. Diagnosing cognitive skills needs to infer students’ thinking processes beyond textual evidence, which is an underexplored task in LLM-based automated assessment. In this work, we investigate how state-of-the-art LLMs diagnose students’ cognitive skills in mathematics. We constructed MathCog, a novel benchmark dataset comprising 639 student responses to 110 expert-curated middle school math problems, each annotated with detailed teachers’ diagnoses based on cognitive skill checklists. Using MathCog, we evaluated 16 closed and open LLMs of varying model sizes and vendors. Our evaluation reveals that even the state-of-the-art LLMs struggle with the task, all F1 scores below 0.5, and tend to exhibit strong false confidence for incorrect cases ( r_s=.617 ). We also found that model size positively correlates with the diagnosis performance ( r_s=.771 ). Finally, we discuss the implications of these findings, the overconfidence issue, and directions for improving automated cognitive skill diagnosis.
[AI-11] Context-Aware Human Behavior Prediction Using Multimodal Large Language Models : Challenges and Insights
链接: https://arxiv.org/abs/2504.00839
作者: Yuchen Liu,Lino Lerch,Luigi Palmieri,Andrey Rudenko,Sebastian Koch,Timo Ropinski,Marco Aiello
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Predicting human behavior in shared environments is crucial for safe and efficient human-robot interaction. Traditional data-driven methods to that end are pre-trained on domain-specific datasets, activity types, and prediction horizons. In contrast, the recent breakthroughs in Large Language Models (LLMs) promise open-ended cross-domain generalization to describe various human activities and make predictions in any context. In particular, Multimodal LLMs (MLLMs) are able to integrate information from various sources, achieving more contextual awareness and improved scene understanding. The difficulty in applying general-purpose MLLMs directly for prediction stems from their limited capacity for processing large input sequences, sensitivity to prompt design, and expensive fine-tuning. In this paper, we present a systematic analysis of applying pre-trained MLLMs for context-aware human behavior prediction. To this end, we introduce a modular multimodal human activity prediction framework that allows us to benchmark various MLLMs, input variations, In-Context Learning (ICL), and autoregressive techniques. Our evaluation indicates that the best-performing framework configuration is able to reach 92.8% semantic similarity and 66.1% exact label accuracy in predicting human behaviors in the target frame.
[AI-12] A Survey on Music Generation from Single-Modal Cross-Modal and Multi-Modal Perspectives: Data Methods and Challenges
链接: https://arxiv.org/abs/2504.00837
作者: Shuyu Li,Shulei Ji,Zihao Wang,Songruoyao Wu,Jiaxing Yu,Kejun Zhang
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:
点击查看摘要
Abstract:Multi-modal music generation, using multiple modalities like images, video, and text alongside musical scores and audio as guidance, is an emerging research area with broad applications. This paper reviews this field, categorizing music generation systems from the perspective of modalities. It covers modality representation, multi-modal data alignment, and their utilization to guide music generation. We also discuss current datasets and evaluation methods. Key challenges in this area include effective multi-modal integration, large-scale comprehensive datasets, and systematic evaluation methods. Finally, we provide an outlook on future research directions focusing on multi-modal fusion, alignment, data, and evaluation.
[AI-13] Example-Based Concept Analysis Framework for Deep Weather Forecast Models
链接: https://arxiv.org/abs/2504.00831
作者: Soyeon Kim,Junho Choi,Subeen Lee,Jaesik Choi
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 39 pages, 10 figures
点击查看摘要
Abstract:To improve the trustworthiness of an AI model, finding consistent, understandable representations of its inference process is essential. This understanding is particularly important in high-stakes operations such as weather forecasting, where the identification of underlying meteorological mechanisms is as critical as the accuracy of the predictions. Despite the growing literature that addresses this issue through explainable AI, the applicability of their solutions is often limited due to their AI-centric development. To fill this gap, we follow a user-centric process to develop an example-based concept analysis framework, which identifies cases that follow a similar inference process as the target instance in a target model and presents them in a user-comprehensible format. Our framework provides the users with visually and conceptually analogous examples, including the probability of concept assignment to resolve ambiguities in weather mechanisms. To bridge the gap between vector representations identified from models and human-understandable explanations, we compile a human-annotated concept dataset and implement a user interface to assist domain experts involved in the the framework development.
[AI-14] Explainable AI-Based Interface System for Weather Forecasting Model
链接: https://arxiv.org/abs/2504.00795
作者: Soyeon Kim,Junho Choi,Yeji Choi,Subeen Lee,Artyom Stitsyuk,Minkyoung Park,Seongyeop Jeong,Youhyun Baek,Jaesik Choi
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 19 pages, 16 figures
点击查看摘要
Abstract:Machine learning (ML) is becoming increasingly popular in meteorological decision-making. Although the literature on explainable artificial intelligence (XAI) is growing steadily, user-centered XAI studies have not extend to this domain yet. This study defines three requirements for explanations of black-box models in meteorology through user studies: statistical model performance for different rainfall scenarios to identify model bias, model reasoning, and the confidence of model outputs. Appropriate XAI methods are mapped to each requirement, and the generated explanations are tested quantitatively and qualitatively. An XAI interface system is designed based on user feedback. The results indicate that the explanations increase decision utility and user trust. Users prefer intuitive explanations over those based on XAI algorithms even for potentially easy-to-recognize examples. These findings can provide evidence for future research on user-centered XAI algorithms, as well as a basis to improve the usability of AI systems in practice.
[AI-15] Conditional Temporal Neural Processes with Covariance Loss
链接: https://arxiv.org/abs/2504.00794
作者: Boseon Yoo,Jiwoo Lee,Janghoon Ju,Seijun Chung,Soyeon Kim,Jaesik Choi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, 18 figures
点击查看摘要
Abstract:We introduce a novel loss function, Covariance Loss, which is conceptually equivalent to conditional neural processes and has a form of regularization so that is applicable to many kinds of neural networks. With the proposed loss, mappings from input variables to target variables are highly affected by dependencies of target variables as well as mean activation and mean dependencies of input and target variables. This nature enables the resulting neural networks to become more robust to noisy observations and recapture missing dependencies from prior information. In order to show the validity of the proposed loss, we conduct extensive sets of experiments on real-world datasets with state-of-the-art models and discuss the benefits and drawbacks of the proposed Covariance Loss.
[AI-16] Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scale Test-Time Compute
链接: https://arxiv.org/abs/2504.00762
作者: Jianhao Chen,Zishuo Xun,Bocheng Zhou,Han Qi,Qiaosheng Zhang,Yang Chen,Wei Hu,Yuzhong Qu,Wanli Ouyang,Shuyue Hu
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:This paper presents a simple, effective, and cost-efficient strategy to improve LLM performance by scaling test-time compute. Our strategy builds upon the repeated-sampling-then-voting framework, with a novel twist: incorporating multiple models, even weaker ones, to leverage their complementary strengths that potentially arise from diverse training data and paradigms. By using consistency as a signal, our strategy dynamically switches between models. Theoretical analysis highlights the efficiency and performance advantages of our strategy. Extensive experiments on six datasets demonstrate that our strategy not only outperforms self-consistency and state-of-the-art multi-agent debate approaches, but also significantly reduces inference costs. Additionally, ModelSwitch requires only a few comparable LLMs to achieve optimal performance and can be extended with verification methods, demonstrating the potential of leveraging multiple LLMs in the generation-verification paradigm.
[AI-17] Personality-Driven Decision-Making in LLM -Based Autonomous Agents AAMAS2025
链接: https://arxiv.org/abs/2504.00727
作者: Lewis Newsham,Daniel Prince
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 10 pages, 8 figures. To be included in Proc. of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025)
点击查看摘要
Abstract:The embedding of Large Language Models (LLMs) into autonomous agents is a rapidly developing field which enables dynamic, configurable behaviours without the need for extensive domain-specific training. In our previous work, we introduced SANDMAN, a Deceptive Agent architecture leveraging the Five-Factor OCEAN personality model, demonstrating that personality induction significantly influences agent task planning. Building on these findings, this study presents a novel method for measuring and evaluating how induced personality traits affect task selection processes - specifically planning, scheduling, and decision-making - in LLM-based agents. Our results reveal distinct task-selection patterns aligned with induced OCEAN attributes, underscoring the feasibility of designing highly plausible Deceptive Agents for proactive cyber defense strategies.
[AI-18] Advancements in Multimodal Differential Evolution: A Comprehensive Review and Future Perspectives
链接: https://arxiv.org/abs/2504.00717
作者: Dikshit Chauhan,Shivani,Donghwi Jung,Anupam Yadav
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Multi-modal optimization involves identifying multiple global and local optima of a function, offering valuable insights into diverse optimal solutions within the search space. Evolutionary algorithms (EAs) excel at finding multiple solutions in a single run, providing a distinct advantage over classical optimization techniques that often require multiple restarts without guarantee of obtaining diverse solutions. Among these EAs, differential evolution (DE) stands out as a powerful and versatile optimizer for continuous parameter spaces. DE has shown significant success in multi-modal optimization by utilizing its population-based search to promote the formation of multiple stable subpopulations, each targeting different optima. Recent advancements in DE for multi-modal optimization have focused on niching methods, parameter adaptation, hybridization with other algorithms including machine learning, and applications across various domains. Given these developments, it is an opportune moment to present a critical review of the latest literature and identify key future research directions. This paper offers a comprehensive overview of recent DE advancements in multimodal optimization, including methods for handling multiple optima, hybridization with EAs, and machine learning, and highlights a range of real-world applications. Additionally, the paper outlines a set of compelling open problems and future research issues from multiple perspectives
[AI-19] Energy Weighted Learning Progress Guided Interleaved Multi-Task Learning
链接: https://arxiv.org/abs/2504.00707
作者: Hanne Say(1),Suzan Ece Ada(2),Emre Ugur(2),Erhan Oztop(1 and 3) ((1) Graduate School of Science and Engineering, Ozyegin University, Istanbul, Turkey, (2) Department of Computer Engineering, Bogazici University, Istanbul, Turkey, (3) OTRI, SISREC, Osaka University, Osaka, Japan)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 15 pages, 8 figures
点击查看摘要
Abstract:Humans can continuously acquire new skills and knowledge by exploiting existing ones for improved learning, without forgetting them. Similarly, ‘continual learning’ in machine learning aims to learn new information while preserving the previously acquired knowledge. Existing research often overlooks the nature of human learning, where tasks are interleaved due to human choice or environmental constraints. So, almost never do humans master one task before switching to the next. To investigate to what extent human-like learning can benefit the learner, we propose a method that interleaves tasks based on their ‘learning progress’ and energy consumption. From a machine learning perspective, our approach can be seen as a multi-task learning system that balances learning performance with energy constraints while mimicking ecologically realistic human task learning. To assess the validity of our approach, we consider a robot learning setting in simulation, where the robot learns the effect of its actions in different contexts. The conducted experiments show that our proposed method achieves better performance than sequential task learning and reduces energy consumption for learning the tasks.
[AI-20] he HCI GenAI CO2ST Calculator: A Tool for Calculating the Carbon Footprint of Generative AI Use in Human-Computer Interaction Research
链接: https://arxiv.org/abs/2504.00692
作者: Nanna Inie,Jeanette Falk,Raghavendra Selvan
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Increased usage of generative AI (GenAI) in Human-Computer Interaction (HCI) research induces a climate impact from carbon emissions due to energy consumption of the hardware used to develop and run GenAI models and systems. The exact energy usage and and subsequent carbon emissions are difficult to estimate in HCI research because HCI researchers most often use cloud-based services where the hardware and its energy consumption are hidden from plain view. The HCI GenAI CO2ST Calculator is a tool designed specifically for the HCI research pipeline, to help researchers estimate the energy consumption and carbon footprint of using generative AI in their research, either a priori (allowing for mitigation strategies or experimental redesign) or post hoc (allowing for transparent documentation of carbon footprint in written reports of the research).
[AI-21] owards Adaptive AI Governance: Comparative Insights from the U.S. EU and Asia
链接: https://arxiv.org/abs/2504.00652
作者: Vikram Kulothungan,Deepti Gupta
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注: Accepted at IEEE BigDataSecurity 2025 Conference
点击查看摘要
Abstract:Artificial intelligence (AI) trends vary significantly across global regions, shaping the trajectory of innovation, regulation, and societal impact. This variation influences how different regions approach AI development, balancing technological progress with ethical and regulatory considerations. This study conducts a comparative analysis of AI trends in the United States (US), the European Union (EU), and Asia, focusing on three key dimensions: generative AI, ethical oversight, and industrial applications. The US prioritizes market-driven innovation with minimal regulatory constraints, the EU enforces a precautionary risk-based framework emphasizing ethical safeguards, and Asia employs state-guided AI strategies that balance rapid deployment with regulatory oversight. Although these approaches reflect different economic models and policy priorities, their divergence poses challenges to international collaboration, regulatory harmonization, and the development of global AI standards. To address these challenges, this paper synthesizes regional strengths to propose an adaptive AI governance framework that integrates risk-tiered oversight, innovation accelerators, and strategic alignment mechanisms. By bridging governance gaps, this study offers actionable insights for fostering responsible AI development while ensuring a balance between technological progress, ethical imperatives, and regulatory coherence.
[AI-22] Impact of Data Duplication on Deep Neural Network-Based Image Classifiers: Robust vs. Standard Models
链接: https://arxiv.org/abs/2504.00638
作者: Alireza Aghabagherloo,Aydin Abadi,Sumanta Sarkar,Vishnu Asutosh Dasu,Bart Preneel
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:
点击查看摘要
Abstract:The accuracy and robustness of machine learning models against adversarial attacks are significantly influenced by factors such as training data quality, model architecture, the training process, and the deployment environment. In recent years, duplicated data in training sets, especially in language models, has attracted considerable attention. It has been shown that deduplication enhances both training performance and model accuracy in language models. While the importance of data quality in training image classifier Deep Neural Networks (DNNs) is widely recognized, the impact of duplicated images in the training set on model generalization and performance has received little attention. In this paper, we address this gap and provide a comprehensive study on the effect of duplicates in image classification. Our analysis indicates that the presence of duplicated images in the training set not only negatively affects the efficiency of model training but also may result in lower accuracy of the image classifier. This negative impact of duplication on accuracy is particularly evident when duplicated data is non-uniform across classes or when duplication, whether uniform or non-uniform, occurs in the training set of an adversarially trained model. Even when duplicated samples are selected in a uniform way, increasing the amount of duplication does not lead to a significant improvement in accuracy. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV) Cite as: arXiv:2504.00638 [cs.LG] (or arXiv:2504.00638v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.00638 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-23] Feature Subset Weighting for Distance-based Supervised Learning through Choquet Integration
链接: https://arxiv.org/abs/2504.00624
作者: Adnan Theerens,Yvan Saeys,Chris Cornelis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:This paper introduces feature subset weighting using monotone measures for distance-based supervised learning. The Choquet integral is used to define a distance metric that incorporates these weights. This integration enables the proposed distances to effectively capture non-linear relationships and account for interactions both between conditional and decision attributes and among conditional attributes themselves, resulting in a more flexible distance measure. In particular, we show how this approach ensures that the distances remain unaffected by the addition of duplicate and strongly correlated features. Another key point of this approach is that it makes feature subset weighting computationally feasible, since only m feature subset weights should be calculated each time instead of calculating all feature subset weights ( 2^m ), where m is the number of attributes. Next, we also examine how the use of the Choquet integral for measuring similarity leads to a non-equivalent definition of distance. The relationship between distance and similarity is further explored through dual measures. Additionally, symmetric Choquet distances and similarities are proposed, preserving the classical symmetry between similarity and distance. Finally, we introduce a concrete feature subset weighting distance, evaluate its performance in a k -nearest neighbors (KNN) classification setting, and compare it against Mahalanobis distances and weighted distance methods.
[AI-24] owards Responsible and Trustworthy Educational Data Mining: Comparing Symbolic Sub-Symbolic and Neural-Symbolic AI Methods
链接: https://arxiv.org/abs/2504.00615
作者: Danial Hooshyar,Eve Kikas,Yeongwook Yang,Gustav Šír,Raija Hämäläinen,Tommi Kärkkäinen,Roger Azevedo
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Given the demand for responsible and trustworthy AI for education, this study evaluates symbolic, sub-symbolic, and neural-symbolic AI (NSAI) in terms of generalizability and interpretability. Our extensive experiments on balanced and imbalanced self-regulated learning datasets of Estonian primary school students predicting 7th-grade mathematics national test performance showed that symbolic and sub-symbolic methods performed well on balanced data but struggled to identify low performers in imbalanced datasets. Interestingly, symbolic and sub-symbolic methods emphasized different factors in their decision-making: symbolic approaches primarily relied on cognitive and motivational factors, while sub-symbolic methods focused more on cognitive aspects, learned knowledge, and the demographic variable of gender – yet both largely overlooked metacognitive factors. The NSAI method, on the other hand, showed advantages by: (i) being more generalizable across both classes – even in imbalanced datasets – as its symbolic knowledge component compensated for the underrepresented class; and (ii) relying on a more integrated set of factors in its decision-making, including motivation, (meta)cognition, and learned knowledge, thus offering a comprehensive and theoretically grounded interpretability framework. These contrasting findings highlight the need for a holistic comparison of AI methods before drawing conclusions based solely on predictive performance. They also underscore the potential of hybrid, human-centered NSAI methods to address the limitations of other AI families and move us closer to responsible AI for education. Specifically, by enabling stakeholders to contribute to AI design, NSAI aligns learned patterns with theoretical constructs, incorporates factors like motivation and metacognition, and strengthens the trustworthiness and responsibility of educational data mining.
[AI-25] LLM -Guided Search for Deletion-Correcting Codes
链接: https://arxiv.org/abs/2504.00613
作者: Franziska Weindel,Reinhard Heckel
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT); Neural and Evolutionary Computing (cs.NE)
*备注:
点击查看摘要
Abstract:Finding deletion-correcting codes of maximum size has been an open problem for over 70 years, even for a single deletion. In this paper, we propose a novel approach for constructing deletion-correcting codes. A code is a set of sequences satisfying certain constraints, and we construct it by greedily adding the highest-priority sequence according to a priority function. To find good priority functions, we leverage FunSearch, a large language model (LLM)-guided evolutionary search proposed by Romera et al., 2024. FunSearch iteratively generates, evaluates, and refines priority functions to construct large deletion-correcting codes. For a single deletion, our evolutionary search finds functions that construct codes which match known maximum sizes, reach the size of the largest (conjectured optimal) Varshamov-Tenengolts codes where the maximum is unknown, and independently rediscover them in equivalent form. For two deletions, we find functions that construct codes with new best-known sizes for code lengths ( n = 12, 13 ), and ( 16 ), establishing improved lower bounds. These results demonstrate the potential of LLM-guided search for information theory and code design and represent the first application of such methods for constructing error-correcting codes.
[AI-26] PLM4NDV: Minimizing Data Access for Number of Distinct Values Estimation with Pre-trained Language Models SIGMOD2025
链接: https://arxiv.org/abs/2504.00608
作者: Xianghong Xu,Xiao He,Tieying Zhang,Lei Zhang,Rui Shi,Jianjun Chen
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注: Accepted by SIGMOD 2025
点击查看摘要
Abstract:Number of Distinct Values (NDV) estimation of a multiset/column is a basis for many data management tasks, especially within databases. Despite decades of research, most existing methods require either a significant amount of samples through uniform random sampling or access to the entire column to produce estimates, leading to substantial data access costs and potentially ineffective estimations in scenarios with limited data access. In this paper, we propose leveraging semantic information, i.e., schema, to address these challenges. The schema contains rich semantic information that can benefit the NDV estimation. To this end, we propose PLM4NDV, a learned method incorporating Pre-trained Language Models (PLMs) to extract semantic schema information for NDV estimation. Specifically, PLM4NDV leverages the semantics of the target column and the corresponding table to gain a comprehensive understanding of the column’s meaning. By using the semantics, PLM4NDV reduces data access costs, provides accurate NDV estimation, and can even operate effectively without any data access. Extensive experiments on a large-scale real-world dataset demonstrate the superiority of PLM4NDV over baseline methods. Our code is available at this https URL.
[AI-27] Data Cleansing for GANs
链接: https://arxiv.org/abs/2504.00603
作者: Naoyuki Terashita,Hiroki Ohashi,Satoshi Hara
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Accepted for IEEE Transactions on Neural Networks and Learning Systems (TNNLS, 2025). Journal extention of this https URL
点击查看摘要
Abstract:As the application of generative adversarial networks (GANs) expands, it becomes increasingly critical to develop a unified approach that improves performance across various generative tasks. One effective strategy that applies to any machine learning task is identifying harmful instances, whose removal improves the performance. While previous studies have successfully estimated these harmful training instances in supervised settings, their approaches are not easily applicable to GANs. The challenge lies in two requirements of the previous approaches that do not apply to GANs. First, previous approaches require that the absence of a training instance directly affects the parameters. However, in the training for GANs, the instances do not directly affect the generator’s parameters since they are only fed into the discriminator. Second, previous approaches assume that the change in loss directly quantifies the harmfulness of the instance to a model’s performance, while common types of GAN losses do not always reflect the generative performance. To overcome the first challenge, we propose influence estimation methods that use the Jacobian of the generator’s gradient with respect to the discriminator’s parameters (and vice versa). Such a Jacobian represents the indirect effect between two models: how removing an instance from the discriminator’s training changes the generator’s parameters. Second, we propose an instance evaluation scheme that measures the harmfulness of each training instance based on how a GAN evaluation metric (e.g., Inception score) is expected to change by the instance’s removal. Furthermore, we demonstrate that removing the identified harmful instances significantly improves the generative performance on various GAN evaluation metrics.
[AI-28] Automated detection of atomicity violations in large-scale systems
链接: https://arxiv.org/abs/2504.00521
作者: Hang He,Yixing Luo,Chengcheng Wan,Ting Su,Haiying Sun,Geguang Pu
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Atomicity violations in interrupt-driven programs pose a significant threat to software safety in critical systems. These violations occur when the execution sequence of operations on shared resources is disrupted by asynchronous interrupts. Detecting atomicity violations is challenging due to the vast program state space, application-level code dependencies, and complex domain-specific knowledge. We propose Clover, a hybrid framework that integrates static analysis with large language model (LLM) agents to detect atomicity violations in real-world programs. Clover first performs static analysis to extract critical code snippets and operation information. It then initiates a multi-agent process, where the expert agent leverages domain-specific knowledge to detect atomicity violations, which are subsequently validated by the judge agent. Evaluations on RaceBench 2.1, SV-COMP, and RWIP demonstrate that Clover achieves a precision/recall of 92.3%/86.6%, outperforming existing approaches by 27.4-118.2% on F1-score.
[AI-29] Operator Learning with Domain Decomposition for Geometry Generalization in PDE Solving
链接: https://arxiv.org/abs/2504.00510
作者: Jianing Huang,Kaixuan Zhang,Youjia Wu,Ze Cheng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Neural operators have become increasingly popular in solving \textitpartial differential equations (PDEs) due to their superior capability to capture intricate mappings between function spaces over complex domains. However, the data-hungry nature of operator learning inevitably poses a bottleneck for their widespread applications. At the core of the challenge lies the absence of transferability of neural operators to new geometries. To tackle this issue, we propose operator learning with domain decomposition, a local-to-global framework to solve PDEs on arbitrary geometries. Under this framework, we devise an iterative scheme \textitSchwarz Neural Inference (SNI). This scheme allows for partitioning of the problem domain into smaller subdomains, on which local problems can be solved with neural operators, and stitching local solutions to construct a global solution. Additionally, we provide a theoretical analysis of the convergence rate and error bound. We conduct extensive experiments on several representative PDEs with diverse boundary conditions and achieve remarkable geometry generalization compared to alternative methods. These analysis and experiments demonstrate the proposed framework’s potential in addressing challenges related to geometry generalization and data efficiency.
[AI-30] Enhancing stroke disease classification through machine learning models via a novel voting system by feature selection techniques
链接: https://arxiv.org/abs/2504.00485
作者: Mahade Hasan,Farhana Yasmin,Md. Mehedi Hassan,Xue Yu,Soniya Yeasmin,Herat Joshi,Sheikh Mohammed Shariful Islam
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Heart disease remains a leading cause of mortality and morbidity worldwide, necessitating the development of accurate and reliable predictive models to facilitate early detection and intervention. While state of the art work has focused on various machine learning approaches for predicting heart disease, but they could not able to achieve remarkable accuracy. In response to this need, we applied nine machine learning algorithms XGBoost, logistic regression, decision tree, random forest, k-nearest neighbors (KNN), support vector machine (SVM), gaussian naïve bayes (NB gaussian), adaptive boosting, and linear regression to predict heart disease based on a range of physiological indicators. Our approach involved feature selection techniques to identify the most relevant predictors, aimed at refining the models to enhance both performance and interpretability. The models were trained, incorporating processes such as grid search hyperparameter tuning, and cross-validation to minimize overfitting. Additionally, we have developed a novel voting system with feature selection techniques to advance heart disease classification. Furthermore, we have evaluated the models using key performance metrics including accuracy, precision, recall, F1-score, and the area under the receiver operating characteristic curve (ROC AUC). Among the models, XGBoost demonstrated exceptional performance, achieving 99% accuracy, precision, F1-Score, 98% recall, and 100% ROC AUC. This study offers a promising approach to early heart disease diagnosis and preventive healthcare.
[AI-31] Learning-Based Approximate Nonlinear Model Predictive Control Motion Cueing
链接: https://arxiv.org/abs/2504.00469
作者: Camilo Gonzalez Arango(1),Houshyar Asadi(1),Mohammad Reza Chalak Qazani(2),Chee Peng Lim(3) ((1) Institute for Intelligent Systems Research and Innovation, Deakin University, Waurn Ponds, Victoria, 3216, Australia. (2) Sohar University, Sohar, 311, Oman. (3) Swinburne University, Hawthorn, Victoria, 3122, Australia.)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:
点击查看摘要
Abstract:Motion Cueing Algorithms (MCAs) encode the movement of simulated vehicles into movement that can be reproduced with a motion simulator to provide a realistic driving experience within the capabilities of the machine. This paper introduces a novel learning-based MCA for serial robot-based motion simulators. Building on the differentiable predictive control framework, the proposed method merges the advantages of Nonlinear Model Predictive Control (NMPC) - notably nonlinear constraint handling and accurate kinematic modeling - with the computational efficiency of machine learning. By shifting the computational burden to offline training, the new algorithm enables real-time operation at high control rates, thus overcoming the key challenge associated with NMPC-based motion cueing. The proposed MCA incorporates a nonlinear joint-space plant model and a policy network trained to mimic NMPC behavior while accounting for joint acceleration, velocity, and position limits. Simulation experiments across multiple motion cueing scenarios showed that the proposed algorithm performed on par with a state-of-the-art NMPC-based alternative in terms of motion cueing quality as quantified by the RMSE and correlation coefficient with respect to reference signals. However, the proposed algorithm was on average 400 times faster than the NMPC baseline. In addition, the algorithm successfully generalized to unseen operating conditions, including motion cueing scenarios on a different vehicle and real-time physics-based simulations.
[AI-32] MetaLoRA: Tensor-Enhanced Adaptive Low-Rank Fine-tuning ICDE2025
链接: https://arxiv.org/abs/2504.00460
作者: Maolin Wang,Xiangyu Zhao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by ICDE 2025 PhD Symposium Track
点击查看摘要
Abstract:There has been a significant increase in the deployment of neural network models, presenting substantial challenges in model adaptation and fine-tuning. Efficient adaptation is crucial in maintaining model performance across diverse tasks and domains. While Low-Rank Adaptation (LoRA) has emerged as a promising parameter-efficient fine-tuning method, its fixed parameter nature limits its ability to handle dynamic task requirements effectively. Adapting models to new tasks can be challenging due to the need for extensive fine-tuning. Current LoRA variants primarily focus on general parameter reduction while overlooking the importance of dynamic parameter adjustment and meta-learning capabilities. Moreover, existing approaches mainly address static adaptations, neglecting the potential benefits of task-aware parameter generation in handling diverse task distributions. To address these limitations, this Ph.D. research proposes a LoRA generation approach to model task relationships and introduces MetaLoRA, a novel parameter-efficient adaptation framework incorporating meta-learning principles. This work develops a comprehensive architecture that integrates meta-parameter generation with adaptive low-rank decomposition, enabling efficient handling of both task-specific and task-agnostic features. MetaLoRA accurately captures task patterns by incorporating meta-learning mechanisms and dynamic parameter adjustment strategies. To our knowledge, this research represents the first attempt to provide a meta-learning enhanced LoRA variant, offering improved adaptation capability while maintaining computational efficiency in model fine-tuning.
[AI-33] No Free Lunch with Guardrails
链接: https://arxiv.org/abs/2504.00441
作者: Divyanshu Kumar,Nitin Aravind Birur,Tanay Baswa,Sahil Agarwal,Prashanth Harshangi
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:As large language models (LLMs) and generative AI become widely adopted, guardrails have emerged as a key tool to ensure their safe use. However, adding guardrails isn’t without tradeoffs; stronger security measures can reduce usability, while more flexible systems may leave gaps for adversarial attacks. In this work, we explore whether current guardrails effectively prevent misuse while maintaining practical utility. We introduce a framework to evaluate these tradeoffs, measuring how different guardrails balance risk, security, and usability, and build an efficient guardrail. Our findings confirm that there is no free lunch with guardrails; strengthening security often comes at the cost of usability. To address this, we propose a blueprint for designing better guardrails that minimize risk while maintaining usability. We evaluate various industry guardrails, including Azure Content Safety, Bedrock Guardrails, OpenAI’s Moderation API, Guardrails AI, Nemo Guardrails, and our own custom-built guardrails. Additionally, we assess how LLMs like GPT-4o, Gemini 2.0-Flash, Claude 3.5-Sonnet, and Mistral Large-Latest respond under different system prompts, including simple prompts, detailed prompts, and detailed prompts with chain-of-thought (CoT) reasoning. Our study provides a clear comparison of how different guardrails perform, highlighting the challenges in balancing security and usability. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.00441 [cs.CR] (or arXiv:2504.00441v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2504.00441 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-34] LLM -Assisted Proactive Threat Intelligence for Automated Reasoning
链接: https://arxiv.org/abs/2504.00428
作者: Shuva Paul,Farhad Alemi,Richard Macwan
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 10 Pages, 1 Figure
点击查看摘要
Abstract:Successful defense against dynamically evolving cyber threats requires advanced and sophisticated techniques. This research presents a novel approach to enhance real-time cybersecurity threat detection and response by integrating large language models (LLMs) and Retrieval-Augmented Generation (RAG) systems with continuous threat intelligence feeds. Leveraging recent advancements in LLMs, specifically GPT-4o, and the innovative application of RAG techniques, our approach addresses the limitations of traditional static threat analysis by incorporating dynamic, real-time data sources. We leveraged RAG to get the latest information in real-time for threat intelligence, which is not possible in the existing GPT-4o model. We employ the Patrowl framework to automate the retrieval of diverse cybersecurity threat intelligence feeds, including Common Vulnerabilities and Exposures (CVE), Common Weakness Enumeration (CWE), Exploit Prediction Scoring System (EPSS), and Known Exploited Vulnerabilities (KEV) databases, and integrate these with the all-mpnet-base-v2 model for high-dimensional vector embeddings, stored and queried in Milvus. We demonstrate our system’s efficacy through a series of case studies, revealing significant improvements in addressing recently disclosed vulnerabilities, KEVs, and high-EPSS-score CVEs compared to the baseline GPT-4o. This work not only advances the role of LLMs in cybersecurity but also establishes a robust foundation for the development of automated intelligent cyberthreat information management systems, addressing crucial gaps in current cybersecurity practices.
[AI-35] Hawkeye:Efficient Reasoning with Model Collaboration
链接: https://arxiv.org/abs/2504.00424
作者: Jianshu She,Zhuohao Li,Zhemin Huang,Qi Li,Peiran Xu,Haonan Li,Qirong Ho
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Chain-of-Thought (CoT) reasoning has demonstrated remarkable effectiveness in enhancing the reasoning abilities of large language models (LLMs). However, its efficiency remains a challenge due to the generation of excessive intermediate reasoning tokens, which introduce semantic redundancy and overly detailed reasoning steps. Moreover, computational expense and latency are significant concerns, as the cost scales with the number of output tokens, including those intermediate steps. In this work, we observe that most CoT tokens are unnecessary, and retaining only a small portion of them is sufficient for producing high-quality responses. Inspired by this, we propose HAWKEYE, a novel post-training and inference framework where a large model produces concise CoT instructions to guide a smaller model in response generation. HAWKEYE quantifies redundancy in CoT reasoning and distills high-density information via reinforcement learning. By leveraging these concise CoTs, HAWKEYE is able to expand responses while reducing token usage and computational cost significantly. Our evaluation shows that HAWKEYE can achieve comparable response quality using only 35% of the full CoTs, while improving clarity, coherence, and conciseness by approximately 10%. Furthermore, HAWKEYE can accelerate end-to-end reasoning by up to 3.4x on complex math tasks while reducing inference cost by up to 60%. HAWKEYE will be open-sourced and the models will be available soon.
[AI-36] From Intuition to Understanding: Using AI Peers to Overcome Physics Misconceptions
链接: https://arxiv.org/abs/2504.00408
作者: Ruben Weijers,Denton Wu,Hannah Betts,Tamara Jacod,Yuxiang Guan,Vidya Sujaya,Kushal Dev,Toshali Goel,William Delooze,Reihaneh Rabbany,Ying Wu,Jean-François Godbout,Kellin Pelrine
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:
点击查看摘要
Abstract:Generative AI has the potential to transform personalization and accessibility of education. However, it raises serious concerns about accuracy and helping students become independent critical thinkers. In this study, we designed a helpful AI “Peer” to help students correct fundamental physics misconceptions related to Newtonian mechanic concepts. In contrast to approaches that seek near-perfect accuracy to create an authoritative AI tutor or teacher, we directly inform students that this AI can answer up to 40% of questions incorrectly. In a randomized controlled trial with 165 students, those who engaged in targeted dialogue with the AI Peer achieved post-test scores that were, on average, 10.5 percentage points higher - with over 20 percentage points higher normalized gain - than a control group that discussed physics history. Qualitative feedback indicated that 91% of the treatment group’s AI interactions were rated as helpful. Furthermore, by comparing student performance on pre- and post-test questions about the same concept, along with experts’ annotations of the AI interactions, we find initial evidence suggesting the improvement in performance does not depend on the correctness of the AI. With further research, the AI Peer paradigm described here could open new possibilities for how we learn, adapt to, and grow with AI.
[AI-37] CyberBOT: Towards Reliable Cybersecurity Education via Ontology-Grounded Retrieval Augmented Generation
链接: https://arxiv.org/abs/2504.00389
作者: Chengshuai Zhao,Riccardo De Maria,Tharindu Kumarage,Kumar Satvik Chaudhary,Garima Agrawal,Yiwen Li,Jongchan Park,Yuli Deng,Ying-Chih Chen,Huan Liu
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Advancements in large language models (LLMs) have enabled the development of intelligent educational tools that support inquiry-based learning across technical domains. In cybersecurity education, where accuracy and safety are paramount, systems must go beyond surface-level relevance to provide information that is both trustworthy and domain-appropriate. To address this challenge, we introduce CyberBOT, a question-answering chatbot that leverages a retrieval-augmented generation (RAG) pipeline to incorporate contextual information from course-specific materials and validate responses using a domain-specific cybersecurity ontology. The ontology serves as a structured reasoning layer that constrains and verifies LLM-generated answers, reducing the risk of misleading or unsafe guidance. CyberBOT has been deployed in a large graduate-level course at Arizona State University (ASU), where more than one hundred students actively engage with the system through a dedicated web-based platform. Computational evaluations in lab environments highlight the potential capacity of CyberBOT, and a forthcoming field study will evaluate its pedagogical impact. By integrating structured domain reasoning with modern generative capabilities, CyberBOT illustrates a promising direction for developing reliable and curriculum-aligned AI applications in specialized educational contexts.
[AI-38] Integrated LLM -Based Intrusion Detection with Secure Slicing xApp for Securing O-RAN-Enabled Wireless Network Deployments
链接: https://arxiv.org/abs/2504.00341
作者: Joshua Moore,Aly Sabri Abdalla,Prabesh Khanal,Vuk Marojevic
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: This article has been accepted for publication in the IEEE 2025 International Conference on Communications (ICC2025)
点击查看摘要
Abstract:The Open Radio Access Network (O-RAN) architecture is reshaping telecommunications by promoting openness, flexibility, and intelligent closed-loop optimization. By decoupling hardware and software and enabling multi-vendor deployments, O-RAN reduces costs, enhances performance, and allows rapid adaptation to new technologies. A key innovation is intelligent network slicing, which partitions networks into isolated slices tailored for specific use cases or quality of service requirements. The RAN Intelligent Controller further optimizes resource allocation, ensuring efficient utilization and improved service quality for user equipment (UEs). However, the modular and dynamic nature of O-RAN expands the threat surface, necessitating advanced security measures to maintain network integrity, confidentiality, and availability. Intrusion detection systems have become essential for identifying and mitigating attacks. This research explores using large language models (LLMs) to generate security recommendations based on the temporal traffic patterns of connected UEs. The paper introduces an LLM-driven intrusion detection framework and demonstrates its efficacy through experimental deployments, comparing non fine-tuned and fine-tuned models for task-specific accuracy.
[AI-39] Agent ic Multimodal AI for Hyperpersonalized B2B and B2C Advertising in Competitive Markets: An AI-Driven Competitive Advertising Framework
链接: https://arxiv.org/abs/2504.00338
作者: Sakhinana Sagar Srinivas,Akash Das,Shivam Gupta,Venkataramana Runkana
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
*备注:
点击查看摘要
Abstract:The growing use of foundation models (FMs) in real-world applications demands adaptive, reliable, and efficient strategies for dynamic markets. In the chemical industry, AI-discovered materials drive innovation, but commercial success hinges on market adoption, requiring FM-driven advertising frameworks that operate in-the-wild. We present a multilingual, multimodal AI framework for autonomous, hyper-personalized advertising in B2B and B2C markets. By integrating retrieval-augmented generation (RAG), multimodal reasoning, and adaptive persona-based targeting, our system generates culturally relevant, market-aware ads tailored to shifting consumer behaviors and competition. Validation combines real-world product experiments with a Simulated Humanistic Colony of Agents to model consumer personas, optimize strategies at scale, and ensure privacy compliance. Synthetic experiments mirror real-world scenarios, enabling cost-effective testing of ad strategies without risky A/B tests. Combining structured retrieval-augmented reasoning with in-context learning (ICL), the framework boosts engagement, prevents market cannibalization, and maximizes ROAS. This work bridges AI-driven innovation and market adoption, advancing multimodal FM deployment for high-stakes decision-making in commercial marketing.
[AI-40] SeizureTransformer: Scaling U-Net with Transformer for Simultaneous Time-Step Level Seizure Detection from Long EEG Recordings
链接: https://arxiv.org/abs/2504.00336
作者: Kerui Wu,Ziyue Zhao,Bülent Yener
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Epilepsy is a common neurological disorder that affects around 65 million people worldwide. Detecting seizures quickly and accurately is vital, given the prevalence and severity of the associated complications. Recently, deep learning-based automated seizure detection methods have emerged as solutions; however, most existing methods require extensive post-processing and do not effectively handle the crucial long-range patterns in EEG data. In this work, we propose SeizureTransformer, a simple model comprised of (i) a deep encoder comprising 1D convolutions (ii) a residual CNN stack and a transformer encoder to embed previous output into high-level representation with contextual information, and (iii) streamlined decoder which converts these features into a sequence of probabilities, directly indicating the presence or absence of seizures at every time step. Extensive experiments on public and private EEG seizure detection datasets demonstrate that our model significantly outperforms existing approaches (ranked in the first place in the 2025 “seizure detection challenge” organized in the International Conference on Artificial Intelligence in Epilepsy and Other Neurological Disorders), underscoring its potential for real-time, precise seizure detection.
[AI-41] FedPaI: Achieving Extreme Sparsity in Federated Learning via Pruning at Initialization
链接: https://arxiv.org/abs/2504.00308
作者: Haonan Wang,Zeli Liu,Kajimusugura Hoshino,Tuo Zhang,John Paul Walters,Stephen Crago
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Federated Learning (FL) enables distributed training on edge devices but faces significant challenges due to resource constraints in edge environments, impacting both communication and computational efficiency. Existing iterative pruning techniques improve communication efficiency but are limited by their centralized design, which struggles with FL’s decentralized and data-imbalanced nature, resulting in suboptimal sparsity levels. To address these issues, we propose FedPaI, a novel efficient FL framework that leverages Pruning at Initialization (PaI) to achieve extreme sparsity. FedPaI identifies optimal sparse connections at an early stage, maximizing model capacity and significantly reducing communication and computation overhead by fixing sparsity patterns at the start of training. To adapt to diverse hardware and software environments, FedPaI supports both structured and unstructured pruning. Additionally, we introduce personalized client-side pruning mechanisms for improved learning capacity and sparsity-aware server-side aggregation for enhanced efficiency. Experimental results demonstrate that FedPaI consistently outperforms existing efficient FL that applies conventional iterative pruning with significant leading in efficiency and model accuracy. For the first time, our proposed FedPaI achieves an extreme sparsity level of up to 98% without compromising the model accuracy compared to unpruned baselines, even under challenging non-IID settings. By employing our FedPaI with joint optimization of model learning capacity and sparsity, FL applications can benefit from faster convergence and accelerate the training by 6.4 to 7.9 times.
[AI-42] Collaborative LLM Numerical Reasoning with Local Data Protection
链接: https://arxiv.org/abs/2504.00299
作者: Min Zhang,Yuzhe Lu,Yun Zhou,Panpan Xu,Lin Lee Cheong,Chang-Tien Lu,Haozhu Wang
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Numerical reasoning over documents, which demands both contextual understanding and logical inference, is challenging for low-capacity local models deployed on computation-constrained devices. Although such complex reasoning queries could be routed to powerful remote models like GPT-4, exposing local data raises significant data leakage concerns. Existing mitigation methods generate problem descriptions or examples for remote assistance. However, the inherent complexity of numerical reasoning hinders the local model from generating logically equivalent queries and accurately inferring answers with remote guidance. In this paper, we present a model collaboration framework with two key innovations: (1) a context-aware synthesis strategy that shifts the query domains while preserving logical consistency; and (2) a tool-based answer reconstruction approach that reuses the remote-generated problem-solving pattern with code snippets. Experimental results demonstrate that our method achieves better reasoning accuracy than solely using local models while providing stronger data protection than fully relying on remote models. Furthermore, our method improves accuracy by 16.2% - 43.6% while reducing data leakage by 2.3% - 44.6% compared to existing data protection approaches.
[AI-43] Digital Twins in Biopharmaceutical Manufacturing: Review and Perspective on Human-Machine Collaborative Intelligence
链接: https://arxiv.org/abs/2504.00286
作者: Mohammed Aatif Shahab,Francesco Destro,Richard D. Braatz
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The biopharmaceutical industry is increasingly developing digital twins to digitalize and automate the manufacturing process in response to the growing market demands. However, this shift presents significant challenges for human operators, as the complexity and volume of information can overwhelm their ability to manage the process effectively. These issues are compounded when digital twins are designed without considering interaction and collaboration with operators, who are responsible for monitoring processes and assessing situations, particularly during abnormalities. Our review of current trends in biopharma digital twin development reveals a predominant focus on technology and often overlooks the critical role of human operators. To bridge this gap, this article proposes a collaborative intelligence framework that emphasizes the integration of operators with digital twins. Approaches to system design that can enhance operator trust and human-machine interface usability are presented. Moreover, innovative training programs for preparing operators to understand and utilize digital twins are discussed. The framework outlined in this article aims to enhance collaboration between operators and digital twins effectively by using their full capabilities to boost resilience and productivity in biopharmaceutical manufacturing.
[AI-44] Exploration and Adaptation in Non-Stationary Tasks with Diffusion Policies
链接: https://arxiv.org/abs/2504.00280
作者: Gunbir Singh Baveja
类目: Artificial Intelligence (cs.AI)
*备注: 7 pages, 1 figure
点击查看摘要
Abstract:This paper investigates the application of Diffusion Policy in non-stationary, vision-based RL settings, specifically targeting environments where task dynamics and objectives evolve over time. Our work is grounded in practical challenges encountered in dynamic real-world scenarios such as robotics assembly lines and autonomous navigation, where agents must adapt control strategies from high-dimensional visual inputs. We apply Diffusion Policy – which leverages iterative stochastic denoising to refine latent action representations-to benchmark environments including Procgen and PointMaze. Our experiments demonstrate that, despite increased computational demands, Diffusion Policy consistently outperforms standard RL methods such as PPO and DQN, achieving higher mean and maximum rewards with reduced variability. These findings underscore the approach’s capability to generate coherent, contextually relevant action sequences in continuously shifting conditions, while also highlighting areas for further improvement in handling extreme non-stationarity.
[AI-45] Rack Position Optimization in Large-Scale Heterogeneous Data Centers ICAPS
链接: https://arxiv.org/abs/2504.00277
作者: Chang-Lin Chen,Jiayu Chen,Tian Lan,Zhaoxia Zhao,Hongbo Dong,Vaneet Aggarwal
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Optimization and Control (math.OC)
*备注: Extended version of paper accepted at The International Conference on Automated Planning and Scheduling (ICAPS) 2025
点击查看摘要
Abstract:As rapidly growing AI computational demands accelerate the need for new hardware installation and maintenance, this work explores optimal data center resource management by balancing operational efficiency with fault tolerance through strategic rack positioning considering diverse resources and locations. Traditional mixed-integer programming (MIP) approaches often struggle with scalability, while heuristic methods may result in significant sub-optimality. To address these issues, this paper presents a novel two-tier optimization framework using a high-level deep reinforcement learning (DRL) model to guide a low-level gradient-based heuristic for local search. The high-level DRL agent employs Leader Reward for optimal rack type ordering, and the low-level heuristic efficiently maps racks to positions, minimizing movement counts and ensuring fault-tolerant resource distribution. This approach allows scalability to over 100,000 positions and 100 rack types. Our method outperformed the gradient-based heuristic by 7% on average and the MIP solver by over 30% in objective value. It achieved a 100% success rate versus MIP’s 97.5% (within a 20-minute limit), completing in just 2 minutes compared to MIP’s 1630 minutes (i.e., almost 4 orders of magnitude improvement). Unlike the MIP solver, which showed performance variability under time constraints and high penalties, our algorithm consistently delivered stable, efficient results - an essential feature for large-scale data center management.
[AI-46] Large Language Models in Numberland: A Quick Test of Their Numerical Reasoning Abilities
链接: https://arxiv.org/abs/2504.00226
作者: Roussel Rahman
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:An essential element of human mathematical reasoning is our number sense – an abstract understanding of numbers and their relationships – which allows us to solve problems involving vast number spaces using limited computational resources. Mathematical reasoning of Large Language Models (LLMs) is often tested on high-level problems (such as Olympiad challenges, geometry, word problems, and puzzles), but their low-level number sense remains less explored. We introduce “Numberland,” a 100-problem test to evaluate the numerical reasoning abilities of LLM-based agents. The tasks – basic operations, advanced calculations (e.g., exponentiation, complex numbers), prime number checks, and the 24 game – aim to test elementary skills and their integration in solving complex and uncertain problems. We evaluated five LLM-based agents: OpenAI’s o1 and o1-mini, Google Gemini, Microsoft Copilot, and Anthropic Claude. They scored 74-95% on the first three tasks that allow deterministic steps to solutions. In the 24 game, which needs trial-and-error search, performance dropped to 10-73%. We tested the top 24 solver (o1 with 73% accuracy) on 25 harder problems, and its score fell to 27%, confirming search as a bottleneck. These results, along with the types of mistakes, suggest a fragile number of LLMs, which is a bit surprising given their prowess in challenging benchmarks. The limits of LLM numerical reasoning highlight the scope of simple, targeted tests to evaluate and explain LLM math skills to ensure safe use.
[AI-47] Identifying Sparsely Active Circuits Through Local Loss Landscape Decomposition
链接: https://arxiv.org/abs/2504.00194
作者: Brianna Chrisman,Lucius Bushnaq,Lee Sharkey
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Much of mechanistic interpretability has focused on understanding the activation spaces of large neural networks. However, activation space-based approaches reveal little about the underlying circuitry used to compute features. To better understand the circuits employed by models, we introduce a new decomposition method called Local Loss Landscape Decomposition (L3D). L3D identifies a set of low-rank subnetworks: directions in parameter space of which a subset can reconstruct the gradient of the loss between any sample’s output and a reference output vector. We design a series of progressively more challenging toy models with well-defined subnetworks and show that L3D can nearly perfectly recover the associated subnetworks. Additionally, we investigate the extent to which perturbing the model in the direction of a given subnetwork affects only the relevant subset of samples. Finally, we apply L3D to a real-world transformer model and a convolutional neural network, demonstrating its potential to identify interpretable and relevant circuits in parameter space.
[AI-48] Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?
链接: https://arxiv.org/abs/2504.00186
作者: Olawale Salaudeen,Nicole Chiou,Shiny Weng,Sanmi Koyejo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Spurious correlations are unstable statistical associations that hinder robust decision-making. Conventional wisdom suggests that models relying on such correlations will fail to generalize out-of-distribution (OOD), especially under strong distribution shifts. However, empirical evidence challenges this view as naive in-distribution empirical risk minimizers often achieve the best OOD accuracy across popular OOD generalization benchmarks. In light of these results, we propose a different perspective: many widely used benchmarks for evaluating robustness to spurious correlations are misspecified. Specifically, they fail to include shifts in spurious correlations that meaningfully impact OOD generalization, making them unsuitable for evaluating the benefit of removing such correlations. We establish conditions under which a distribution shift can reliably assess a model’s reliance on spurious correlations. Crucially, under these conditions, we should not observe a strong positive correlation between in-distribution and OOD accuracy, often called “accuracy on the line.” Yet, most state-of-the-art benchmarks exhibit this pattern, suggesting they do not effectively assess robustness. Our findings expose a key limitation in current benchmarks used to evaluate domain generalization algorithms, that is, models designed to avoid spurious correlations. We highlight the need to rethink how robustness to spurious correlations is assessed, identify well-specified benchmarks the field should prioritize, and enumerate strategies for designing future benchmarks that meaningfully reflect robustness under distribution shift.
[AI-49] MetaCLBench: Meta Continual Learning Benchmark on Resource-Constrained Edge Devices
链接: https://arxiv.org/abs/2504.00174
作者: Sijia Li,Young D. Kwon,Lik-Hang Lee,Pan Hui
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Meta-Continual Learning (Meta-CL) has emerged as a promising approach to minimize manual labeling efforts and system resource requirements by enabling Continual Learning (CL) with limited labeled samples. However, while existing methods have shown success in image-based tasks, their effectiveness remains unexplored for sequential time-series data from sensor systems, particularly audio inputs. To address this gap, we conduct a comprehensive benchmark study evaluating six representative Meta-CL approaches using three network architectures on five datasets from both image and audio modalities. We develop MetaCLBench, an end-to-end Meta-CL benchmark framework for edge devices to evaluate system overheads and investigate trade-offs among performance, computational costs, and memory requirements across various Meta-CL methods. Our results reveal that while many Meta-CL methods enable to learn new classes for both image and audio modalities, they impose significant computational and memory costs on edge devices. Also, we find that pre-training and meta-training procedures based on source data before deployment improve Meta-CL performance. Finally, to facilitate further research, we provide practical guidelines for researchers and machine learning practitioners implementing Meta-CL on resource-constrained environments and make our benchmark framework and tools publicly available, enabling fair evaluation across both accuracy and system-level metrics.
[AI-50] Backdoor Detection through Replicated Execution of Outsourced Training
链接: https://arxiv.org/abs/2504.00170
作者: Hengrui Jia,Sierra Wyllie,Akram Bin Sediq,Ahmed Ibrahim,Nicolas Papernot
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Published in the 3rd IEEE Conference on Secure and Trustworthy Machine Learning (IEEE SaTML 2025)
点击查看摘要
Abstract:It is common practice to outsource the training of machine learning models to cloud providers. Clients who do so gain from the cloud’s economies of scale, but implicitly assume trust: the server should not deviate from the client’s training procedure. A malicious server may, for instance, seek to insert backdoors in the model. Detecting a backdoored model without prior knowledge of both the backdoor attack and its accompanying trigger remains a challenging problem. In this paper, we show that a client with access to multiple cloud providers can replicate a subset of training steps across multiple servers to detect deviation from the training procedure in a similar manner to differential testing. Assuming some cloud-provided servers are benign, we identify malicious servers by the substantial difference between model updates required for backdooring and those resulting from clean training. Perhaps the strongest advantage of our approach is its suitability to clients that have limited-to-no local compute capability to perform training; we leverage the existence of multiple cloud providers to identify malicious updates without expensive human labeling or heavy computation. We demonstrate the capabilities of our approach on an outsourced supervised learning task where 50% of the cloud providers insert their own backdoor; our approach is able to correctly identify 99.6% of them. In essence, our approach is successful because it replaces the signature-based paradigm taken by existing approaches with an anomaly-based detection paradigm. Furthermore, our approach is robust to several attacks from adaptive adversaries utilizing knowledge of our detection scheme.
[AI-51] Lorentzian Graph Isomorphic Network
链接: https://arxiv.org/abs/2504.00142
作者: Srinitish Srinivasan,Omkumar CU
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint. Under Review
点击查看摘要
Abstract:We introduce the Lorentzian Graph Isomorphic Network (LGIN), a novel graph neural network (GNN) designed to operate in hyperbolic spaces, leveraging the Lorentzian model to enhance graph representation learning. Existing GNNs primarily operate in Euclidean spaces, which can limit their ability to capture hierarchical and multi-relational structures inherent to complex graphs. LGIN addresses this by incorporating curvature-aware aggregation functions that preserve the Lorentzian metric tensor, ensuring embeddings remain constrained within the hyperbolic space by proposing a new update rule that effectively captures both local neighborhood interactions and global structural properties, enabling LGIN to distinguish non-isomorphic graphs with expressiveness at least as powerful as the Weisfeiler-Lehman test. Through extensive evaluation across nine benchmark datasets, including molecular and protein structures, LGIN consistently outperforms or matches state-of-the-art GNNs, demonstrating its robustness and efficacy in modeling complex graph structures. To the best of our knowledge, this is the first study to extend the concept of a powerful graph neural network to Riemannian manifolds, paving the way for future advancements in hyperbolic graph learning. The code for our paper can be found at this https URL.
[AI-52] Data-driven Power Loss Identification through Physics-Based Thermal Model Backpropagation
链接: https://arxiv.org/abs/2504.00133
作者: Mattia Scarpa,Francesco Pase,Ruggero Carli,Mattia Bruschetta,Franscesco Toso
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by European Control Conference (ECC) 2020, 8 pages, 7 figures
点击查看摘要
Abstract:Digital twins for power electronics require accurate power losses whose direct measurements are often impractical or impossible in real-world applications. This paper presents a novel hybrid framework that combines physics-based thermal modeling with data-driven techniques to identify and correct power losses accurately using only temperature measurements. Our approach leverages a cascaded architecture where a neural network learns to correct the outputs of a nominal power loss model by backpropagating through a reduced-order thermal model. We explore two neural architectures, a bootstrapped feedforward network, and a recurrent neural network, demonstrating that the bootstrapped feedforward approach achieves superior performance while maintaining computational efficiency for real-time applications. Between the interconnection, we included normalization strategies and physics-guided training loss functions to preserve stability and ensure physical consistency. Experimental results show that our hybrid model reduces both temperature estimation errors (from 7.2±6.8°C to 0.3±0.3°C) and power loss prediction errors (from 5.4±6.6W to 0.2±0.3W) compared to traditional physics-based approaches, even in the presence of thermal model uncertainties. This methodology allows us to accurately estimate power losses without direct measurements, making it particularly helpful for real-time industrial applications where sensor placement is hindered by cost and physical limitations.
[AI-53] mes2D: Multi-Period Decomposition and Derivative Mapping for General Time Series Forecasting AAAI2025
链接: https://arxiv.org/abs/2504.00118
作者: Reza Nematirad,Anil Pahwa,Balasubramaniam Natarajan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at the AAAI 2025 Conference on Artificial Intelligence
点击查看摘要
Abstract:Time series forecasting is an important application in various domains such as energy management, traffic planning, financial markets, meteorology, and medicine. However, real-time series data often present intricate temporal variability and sharp fluctuations, which pose significant challenges for time series forecasting. Previous models that rely on 1D time series representations usually struggle with complex temporal variations. To address the limitations of 1D time series, this study introduces the Times2D method that transforms the 1D time series into 2D space. Times2D consists of three main parts: first, a Periodic Decomposition Block (PDB) that captures temporal variations within a period and between the same periods by converting the time series into a 2D tensor in the frequency domain. Second, the First and Second Derivative Heatmaps (FSDH) capture sharp changes and turning points, respectively. Finally, an Aggregation Forecasting Block (AFB) integrates the output tensors from PDB and FSDH for accurate forecasting. This 2D transformation enables the utilization of 2D convolutional operations to effectively capture long and short characteristics of the time series. Comprehensive experimental results across large-scale data in the literature demonstrate that the proposed Times2D model achieves state-of-the-art performance in both short-term and long-term forecasting. The code is available in this repository: this https URL.
[AI-54] Assessing Code Understanding in LLM s
链接: https://arxiv.org/abs/2504.00065
作者: Cosimo Laneve,Alvise Spanò,Dalila Ressi,Sabina Rossi,Michele Bugliesi
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
*备注: 22 page, 7 tables, submitted at FORTE 2025
点击查看摘要
Abstract:We present an empirical evaluation of Large Language Models in code understanding associated with non-trivial, semantic-preserving program transformations such as copy propagation or constant folding. Our findings show that LLMs fail to judge semantic equivalence in approximately 41% of cases when no context is provided and in 29% when given a simple generic context. To improve accuracy, we advocate integrating LLMs with code-optimization tools to enhance training and facilitate more robust program understanding.
[AI-55] he Axiom-Based Atlas: A Structural Mapping of Theorems via Foundational Proof Vectors
链接: https://arxiv.org/abs/2504.00063
作者: Harim Yoo
类目: Artificial Intelligence (cs.AI); Logic (math.LO)
*备注:
点击查看摘要
Abstract:The Axiom-Based Atlas is a novel framework that structurally represents mathematical theorems as proof vectors over foundational axiom systems. By mapping the logical dependencies of theorems onto vectors indexed by axioms - such as those from Hilbert geometry, Peano arithmetic, or ZFC - we offer a new way to visualize, compare, and analyze mathematical knowledge. This vector-based formalism not only captures the logical foundation of theorems but also enables quantitative similarity metrics - such as cosine distance - between mathematical results, offering a new analytic layer for structural comparison. Using heatmaps, vector clustering, and AI-assisted modeling, this atlas enables the grouping of theorems by logical structure, not just by mathematical domain. We also introduce a prototype assistant (Atlas-GPT) that interprets natural language theorems and suggests likely proof vectors, supporting future applications in automated reasoning, mathematical education, and formal verification. This direction is partially inspired by Terence Tao’s recent reflections on the convergence of symbolic and structural mathematics. The Axiom-Based Atlas aims to provide a scalable, interpretable model of mathematical reasoning that is both human-readable and AI-compatible, contributing to the future landscape of formal mathematical systems. Subjects: Artificial Intelligence (cs.AI); Logic (math.LO) Cite as: arXiv:2504.00063 [cs.AI] (or arXiv:2504.00063v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2504.00063 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-56] GAL-MAD: Towards Explainable Anomaly Detection in Microservice Applications Using Graph Attention Networks
链接: https://arxiv.org/abs/2504.00058
作者: Lahiru Akmeemana,Chamodya Attanayake,Husni Faiz,Sandareka Wickramanayake
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 14 pages, preprint, 10 figures
点击查看摘要
Abstract:The transition to microservices has revolutionized software architectures, offering enhanced scalability and modularity. However, the distributed and dynamic nature of microservices introduces complexities in ensuring system reliability, making anomaly detection crucial for maintaining performance and functionality. Anomalies stemming from network and performance issues must be swiftly identified and addressed. Existing anomaly detection techniques often rely on statistical models or machine learning methods that struggle with the high-dimensional, interdependent data inherent in microservice applications. Current techniques and available datasets predominantly focus on system traces and logs, limiting their ability to support advanced detection models. This paper addresses these gaps by introducing the RS-Anomic dataset generated using the open-source RobotShop microservice application. The dataset captures multivariate performance metrics and response times under normal and anomalous conditions, encompassing ten types of anomalies. We propose a novel anomaly detection model called Graph Attention and LSTM-based Microservice Anomaly Detection (GAL-MAD), leveraging Graph Attention and Long Short-Term Memory architectures to capture spatial and temporal dependencies in microservices. We utilize SHAP values to localize anomalous services and identify root causes to enhance explainability. Experimental results demonstrate that GAL-MAD outperforms state-of-the-art models on the RS-Anomic dataset, achieving higher accuracy and recall across varying anomaly rates. The explanations provide actionable insights into service anomalies, which benefits system administrators.
[AI-57] Revisiting the Relationship between Adversarial and Clean Training: Why Clean Training Can Make Adversarial Training Better
链接: https://arxiv.org/abs/2504.00038
作者: MingWei Zhou,Xiaobing Pei
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Adversarial training (AT) is an effective technique for enhancing adversarial robustness, but it usually comes at the cost of a decline in generalization ability. Recent studies have attempted to use clean training to assist adversarial training, yet there are contradictions among the conclusions. We comprehensively summarize the representative strategies and, with a focus on the multi - view hypothesis, provide a unified explanation for the contradictory phenomena among different studies. In addition, we conduct an in - depth analysis of the knowledge combinations transferred from clean - trained models to adversarially - trained models in previous studies, and find that they can be divided into two categories: reducing the learning difficulty and providing correct guidance. Based on this finding, we propose a new idea of leveraging clean training to further improve the performance of advanced AT this http URL reveal that the problem of generalization degradation faced by AT partly stems from the difficulty of adversarial training in learning certain sample features, and this problem can be alleviated by making full use of clean training.
[AI-58] MiZero: The Shadowy Defender Against Text Style Infringements
链接: https://arxiv.org/abs/2504.00035
作者: Ziwei Zhang,Juan Wen,Wanli Peng,Zhengxian Wu,Yinghan Zhou,Yiming Xue
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:In-Context Learning (ICL) and efficient fine-tuning methods significantly enhanced the efficiency of applying Large Language Models (LLMs) to downstream tasks. However, they also raise concerns about the imitation and infringement of personal creative data. Current methods for data copyright protection primarily focuses on content security but lacks effectiveness in protecting the copyrights of text styles. In this paper, we introduce a novel implicit zero-watermarking scheme, namely MiZero. This scheme establishes a precise watermark domain to protect the copyrighted style, surpassing traditional watermarking methods that distort the style characteristics. Specifically, we employ LLMs to extract condensed-lists utilizing the designed instance delimitation mechanism. These lists guide MiZero in generating the watermark. Extensive experiments demonstrate that MiZero effectively verifies text style copyright ownership against AI imitation.
[AI-59] Generating Structured Plan Representation of Procedures with LLM s
链接: https://arxiv.org/abs/2504.00029
作者: Deepeka Garg,Sihan Zeng,Sumitra Ganesh,Leo Ardon
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:In this paper, we address the challenges of managing Standard Operating Procedures (SOPs), which often suffer from inconsistencies in language, format, and execution, leading to operational inefficiencies. Traditional process modeling demands significant manual effort, domain expertise, and familiarity with complex languages like Business Process Modeling Notation (BPMN), creating barriers for non-techincal users. We introduce SOP Structuring (SOPStruct), a novel approach that leverages Large Language Models (LLMs) to transform SOPs into decision-tree-based structured representations. SOPStruct produces a standardized representation of SOPs across different domains, reduces cognitive load, and improves user comprehension by effectively capturing task dependencies and ensuring sequential integrity. Our approach enables leveraging the structured information to automate workflows as well as empower the human users. By organizing procedures into logical graphs, SOPStruct facilitates backtracking and error correction, offering a scalable solution for process optimization. We employ a novel evaluation framework, combining deterministic methods with the Planning Domain Definition Language (PDDL) to verify graph soundness, and non-deterministic assessment by an LLM to ensure completeness. We empirically validate the robustness of our LLM-based structured SOP representation methodology across SOPs from different domains and varying levels of complexity. Despite the current lack of automation readiness in many organizations, our research highlights the transformative potential of LLMs to streamline process modeling, paving the way for future advancements in automated procedure optimization.
[AI-60] nsor Generalized Approximate Message Passing
链接: https://arxiv.org/abs/2504.00008
作者: Yinchuan Li,Guangchen Lan,Xiaodong Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注:
点击查看摘要
Abstract:We propose a tensor generalized approximate message passing (TeG-AMP) algorithm for low-rank tensor inference, which can be used to solve tensor completion and decomposition problems. We derive TeG-AMP algorithm as an approximation of the sum-product belief propagation algorithm in high dimensions where the central limit theorem and Taylor series approximations are applicable. As TeG-AMP is developed based on a general TR decomposition model, it can be directly applied to many low-rank tensor types. Moreover, our TeG-AMP can be simplified based on the CP decomposition model and a tensor simplified AMP is proposed for low CP-rank tensor inference problems. Experimental results demonstrate that the proposed methods significantly improve recovery performances since it takes full advantage of tensor structures.
[AI-61] Are We There Yet? A Measurement Study of Efficiency for LLM Applications on Mobile Devices
链接: https://arxiv.org/abs/2504.00002
作者: Xiao Yan,Yi Ding
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Networking and Internet Architecture (cs.NI)
*备注:
点击查看摘要
Abstract:Recent advancements in large language models (LLMs) have prompted interest in deploying these models on mobile devices to enable new applications without relying on cloud connectivity. However, the efficiency constraints of deploying LLMs on resource-limited devices present significant challenges. In this paper, we conduct a comprehensive measurement study to evaluate the efficiency tradeoffs between mobile-based, edge-based, and cloud-based deployments for LLM applications. We implement AutoLife-Lite, a simplified LLM-based application that analyzes smartphone sensor data to infer user location and activity contexts. Our experiments reveal that: (1) Only small-size LLMs (4B parameters) can run successfully on powerful mobile devices, though they exhibit quality limitations compared to larger models; (2) Model compression is effective in lower the hardware requirement, but may lead to significant performance degradation; (3) The latency to run LLMs on mobile devices with meaningful output is significant (30 seconds), while cloud services demonstrate better time efficiency (10 seconds); (4) Edge deployments offer intermediate tradeoffs between latency and model capabilities, with different results on CPU-based and GPU-based settings. These findings provide valuable insights for system designers on the current limitations and future directions for on-device LLM applications.
[AI-62] MedPix 2.0: A Comprehensive Multimodal Biomedical Data set for Advanced AI Applications with Retrieval Augmented Generation and Knowledge Graphs
链接: https://arxiv.org/abs/2407.02994
作者: Irene Siragusa,Salvatore Contino,Massimo La Ciura,Rosario Alicata,Roberto Pirrone
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The increasing interest in developing Artificial Intelligence applications in the medical domain, suffers from the lack of high-quality data set, mainly due to privacy-related issues. In addition, the recent increase in large multimodal models (LMM) leads to the need for multimodal medical data sets, where clinical reports and findings are attached to the corresponding CT or MRI scans. This paper illustrates the entire workflow for building the MedPix 2.0 data set. Starting with the well-known multimodal data set MedPix\textsuperscript\textregistered, mainly used by physicians, nurses, and healthcare students for Continuing Medical Education purposes, a semi-automatic pipeline was developed to extract visual and textual data followed by a manual curing procedure in which noisy samples were removed, thus creating a MongoDB database. Along with the data set, we developed a GUI aimed at navigating efficiently the MongoDB instance and obtaining the raw data that can be easily used for training and/or fine-tuning LMMs. To enforce this point, in this work, we first recall DR-Minerva, a RAG-based LMM trained using MedPix 2.0. DR-Minerva predicts the body part and the modality used to scan its input image. We also propose the extension of DR-Minerva with a Knowledge Graph that uses Llama 3.1 Instruct 8B, and leverages MedPix 2.0. The resulting architecture can be queried in a end-to-end manner, as a medical decision support system. MedPix 2.0 is available on GitHub. \urlthis https URL
[AI-63] Resource Allocation for RIS-Assisted CoMP-NOMA Networks using Reinforcement Learning
链接: https://arxiv.org/abs/2504.00975
作者: Muhammad Umer,Muhammad Ahmed Mohsin,Huma Ghafoor,Syed Ali Hassan
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:This thesis delves into the forefront of wireless communication by exploring the synergistic integration of three transformative technologies: STAR-RIS, CoMP, and NOMA. Driven by the ever-increasing demand for higher data rates, improved spectral efficiency, and expanded coverage in the evolving landscape of 6G development, this research investigates the potential of these technologies to revolutionize future wireless networks. The thesis analyzes the performance gains achievable through strategic deployment of STAR-RIS, focusing on mitigating inter-cell interference, enhancing signal strength, and extending coverage to cell-edge users. Resource sharing strategies for STAR-RIS elements are explored, optimizing both transmission and reflection functionalities. Analytical frameworks are developed to quantify the benefits of STAR-RIS assisted CoMP-NOMA networks under realistic channel conditions, deriving key performance metrics such as ergodic rates and outage probabilities. Additionally, the research delves into energy-efficient design approaches for CoMP-NOMA networks incorporating RIS, proposing novel RIS configurations and optimization algorithms to achieve a balance between performance and energy consumption. Furthermore, the application of Deep Reinforcement Learning (DRL) techniques for intelligent and adaptive optimization in aerial RIS-assisted CoMP-NOMA networks is explored, aiming to maximize network sum rate while meeting user quality of service requirements. Through a comprehensive investigation of these technologies and their synergistic potential, this thesis contributes valuable insights into the future of wireless communication, paving the way for the development of more efficient, reliable, and sustainable networks capable of meeting the demands of our increasingly connected world. Subjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.00975 [eess.SP] (or arXiv:2504.00975v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2504.00975 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-64] Science Autonomy using Machine Learning for Astrobiology
链接: https://arxiv.org/abs/2504.00709
作者: Victoria Da Poian,Bethany Theiling,Eric Lyness,David Burtt,Abigail R. Azari,Joey Pasterski,Luoth Chou,Melissa Trainer,Ryan Danell,Desmond Kaplan,Xiang Li,Lily Clough,Brett McKinney,Lukas Mandrake,Bill Diamond,Caroline Freissinet
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Earth and Planetary Astrophysics (astro-ph.EP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages (expanded citations compared to 5 page submitted version for DARES white papers), a white paper for the 2025 NASA Decadal Astrobiology Research and Exploration Strategy (DARES)
点击查看摘要
Abstract:In recent decades, artificial intelligence (AI) including machine learning (ML) have become vital for space missions enabling rapid data processing, advanced pattern recognition, and enhanced insight extraction. These tools are especially valuable in astrobiology applications, where models must distinguish biotic patterns from complex abiotic backgrounds. Advancing the integration of autonomy through AI and ML into space missions is a complex challenge, and we believe that by focusing on key areas, we can make significant progress and offer practical recommendations for tackling these obstacles.
[AI-65] CNOT-Optimal Clifford Synthesis as SAT
链接: https://arxiv.org/abs/2504.00634
作者: Irfansha Shaik,Jaco van de Pol
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注: 27 pages (16 main text, rest references and appendix), 15 Tables, 3 Figures, 2 Algorithms
点击查看摘要
Abstract:Clifford circuit optimization is an important step in the quantum compilation pipeline. Major compilers employ heuristic approaches. While they are fast, their results are often suboptimal. Minimization of noisy gates, like 2-qubit CNOT gates, is crucial for practical computing. Exact approaches have been proposed to fill the gap left by heuristic approaches. Among these are SAT based approaches that optimize gate count or depth, but they suffer from scalability issues. Further, they do not guarantee optimality on more important metrics like CNOT count or CNOT depth. A recent work proposed an exhaustive search only on Clifford circuits in a certain normal form to guarantee CNOT count optimality. But an exhaustive approach cannot scale beyond 6 qubits. In this paper, we incorporate search restricted to Clifford normal forms in a SAT encoding to guarantee CNOT count optimality. By allowing parallel plans, we propose a second SAT encoding that optimizes CNOT depth. By taking advantage of flexibility in SAT based approaches, we also handle connectivity restrictions in hardware platforms, and allow for qubit relabeling. We have implemented the above encodings and variations in our open source tool Q-Synth. In experiments, our encodings significantly outperform existing SAT approaches on random Clifford circuits. We consider practical VQE and Feynman benchmarks to compare with TKET and Qiskit compilers. In all-to-all connectivity, we observe reductions up to 32.1% in CNOT count and 48.1% in CNOT depth. Overall, we observe better results than TKET in the CNOT count and depth. We also experiment with connectivity restrictions of major quantum platforms. Compared to Qiskit, we observe up to 30.3% CNOT count and 35.9% CNOT depth further reduction. Comments: 27 pages (16 main text, rest references and appendix), 15 Tables, 3 Figures, 2 Algorithms Subjects: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.00634 [quant-ph] (or arXiv:2504.00634v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2504.00634 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-66] Improving Diseases Predictions Utilizing External Bio-Banks
链接: https://arxiv.org/abs/2504.00036
作者: Hido Pinto,Eran Segal
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Machine learning has been successfully used in critical domains, such as medicine. However, extracting meaningful insights from biomedical data is often constrained by the lack of their available disease labels. In this research, we demonstrate how machine learning can be leveraged to enhance explainability and uncover biologically meaningful associations, even when predictive improvements in disease modeling are limited. We train LightGBM models from scratch on our dataset (10K) to impute metabolomics features and apply them to the UK Biobank (UKBB) for downstream analysis. The imputed metabolomics features are then used in survival analysis to assess their impact on disease-related risk factors. As a result, our approach successfully identified biologically relevant connections that were not previously known to the predictive models. Additionally, we applied a genome-wide association study (GWAS) on key metabolomics features, revealing a link between vascular dementia and smoking. Although being a well-established epidemiological relationship, this link was not embedded in the model’s training data, which validated the method’s ability to extract meaningful signals. Furthermore, by integrating survival models as inputs in the 10K data, we uncovered associations between metabolic substances and obesity, demonstrating the ability to infer disease risk for future patients without requiring direct outcome labels. These findings highlight the potential of leveraging external bio-banks to extract valuable biomedical insights, even in data-limited scenarios. Our results demonstrate that machine learning models trained on smaller datasets can still be used to uncover real biological associations when carefully integrated with survival analysis and genetic studies.
[AI-67] A multi-locus predictiveness curve and its summary assessment for genetic risk prediction
链接: https://arxiv.org/abs/2504.00024
作者: Changshuai Wei,Ming Li,Yalu Wen,Chengyin Ye,Qing Lu
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
[AI-68] Celler:A Genomic Language Model for Long-Tailed Single-Cell Annotation
链接: https://arxiv.org/abs/2504.00020
作者: Huan Zhao,Yiming Liu,Jina Yao,Ling Xiong,Zexin Zhou,Zixing Zhang
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Recent breakthroughs in single-cell technology have ushered in unparalleled opportunities to decode the molecular intricacy of intricate biological systems, especially those linked to diseases unique to humans. However, these progressions have also ushered in novel obstacles-specifically, the efficient annotation of extensive, long-tailed single-cell data pertaining to disease conditions. To effectively surmount this challenge, we introduce Celler, a state-of-the-art generative pre-training model crafted specifically for the annotation of single-cell data. Celler incorporates two groundbreaking elements: First, we introduced the Gaussian Inflation (GInf) Loss function. By dynamically adjusting sample weights, GInf Loss significantly enhances the model’s ability to learn from rare categories while reducing the risk of overfitting for common categories. Secondly, we introduce an innovative Hard Data Mining (HDM) strategy into the training process, specifically targeting the challenging-to-learn minority data samples, which significantly improved the model’s predictive accuracy. Additionally, to further advance research in this field, we have constructed a large-scale single-cell dataset: Celler-75, which encompasses 40 million cells distributed across 80 human tissues and 75 specific diseases. This dataset provides critical support for comprehensively exploring the potential of single-cell technology in disease research. Our code is available at this https URL.
[AI-69] Deep Learning-Based Hypoglycemia Classification Across Multiple Prediction Horizons
链接: https://arxiv.org/abs/2504.00009
作者: Beyza Cinar,Jennifer Daniel Onwuchekwa,Maria Maleshkova
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
*备注:
点击查看摘要
Abstract:Type 1 diabetes (T1D) management can be significantly enhanced through the use of predictive machine learning (ML) algorithms, which can mitigate the risk of adverse events like hypoglycemia. Hypoglycemia, characterized by blood glucose levels below 70 mg/dL, is a life-threatening condition typically caused by excessive insulin administration, missed meals, or physical activity. Its asymptomatic nature impedes timely intervention, making ML models crucial for early detection. This study integrates short- (up to 2h) and long-term (up to 24h) prediction horizons (PHs) within a single classification model to enhance decision support. The predicted times are 5-15 min, 15-30 min, 30 min-1h, 1-2h, 2-4h, 4-8h, 8-12h, and 12-24h before hypoglycemia. In addition, a simplified model classifying up to 4h before hypoglycemia is compared. We trained ResNet and LSTM models on glucose levels, insulin doses, and acceleration data. The results demonstrate the superiority of the LSTM models when classifying nine classes. In particular, subject-specific models yielded better performance but achieved high recall only for classes 0, 1, and 2 with 98%, 72%, and 50%, respectively. A population-based six-class model improved the results with at least 60% of events detected. In contrast, longer PHs remain challenging with the current approach and may be considered with different models.
机器学习
[LG-0] Data-Driven Safety Verification using Barrier Certificates and Matrix Zonotopes
链接: https://arxiv.org/abs/2504.01007
作者: Mohammed Adib Oumer,Amr Alanwar,Majid Zamani
类目: ystems and Control (eess.SY); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
*备注: Submitted to CDC 2025
点击查看摘要
Abstract:Ensuring safety in cyber-physical systems (CPSs) is a critical challenge, especially when system models are difficult to obtain or cannot be fully trusted due to uncertainty, modeling errors, or environmental disturbances. Traditional model-based approaches rely on precise system dynamics, which may not be available in real-world scenarios. To address this, we propose a data-driven safety verification framework that leverages matrix zonotopes and barrier certificates to verify system safety directly from noisy data. Instead of trusting a single unreliable model, we construct a set of models that capture all possible system dynamics that align with the observed data, ensuring that the true system model is always contained within this set. This model set is compactly represented using matrix zonotopes, enabling efficient computation and propagation of uncertainty. By integrating this representation into a barrier certificate framework, we establish rigorous safety guarantees without requiring an explicit system model. Numerical experiments demonstrate the effectiveness of our approach in verifying safety for dynamical systems with unknown models, showcasing its potential for real-world CPS applications.
[LG-1] CFIRE: A General Method for Combining Local Explanations
链接: https://arxiv.org/abs/2504.00930
作者: Sebastian Müller,Vanessa Toborek,Tamás Horváth,Christian Bauckhage
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We propose a novel eXplainable AI algorithm to compute faithful, easy-to-understand, and complete global decision rules from local explanations for tabular data by combining XAI methods with closed frequent itemset mining. Our method can be used with any local explainer that indicates which dimensions are important for a given sample for a given black-box decision. This property allows our algorithm to choose among different local explainers, addressing the disagreement problem, \ie the observation that no single explanation method consistently outperforms others across models and datasets. Unlike usual experimental methodology, our evaluation also accounts for the Rashomon effect in model explainability. To this end, we demonstrate the robustness of our approach in finding suitable rules for nearly all of the 700 black-box models we considered across 14 benchmark datasets. The results also show that our method exhibits improved runtime, high precision and F1-score while generating compact and complete rules.
[LG-2] Benchmarking Federated Machine Unlearning methods for Tabular Data
链接: https://arxiv.org/abs/2504.00921
作者: Chenguang Xiao,Abhirup Ghosh,Han Wu,Shuo Wang,Diederick van Thiel
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Machine unlearning, which enables a model to forget specific data upon request, is increasingly relevant in the era of privacy-centric machine learning, particularly within federated learning (FL) environments. This paper presents a pioneering study on benchmarking machine unlearning methods within a federated setting for tabular data, addressing the unique challenges posed by cross-silo FL where data privacy and communication efficiency are paramount. We explore unlearning at the feature and instance levels, employing both machine learning, random forest and logistic regression models. Our methodology benchmarks various unlearning algorithms, including fine-tuning and gradient-based approaches, across multiple datasets, with metrics focused on fidelity, certifiability, and computational efficiency. Experiments demonstrate that while fidelity remains high across methods, tree-based models excel in certifiability, ensuring exact unlearning, whereas gradient-based methods show improved computational efficiency. This study provides critical insights into the design and selection of unlearning algorithms tailored to the FL environment, offering a foundation for further research in privacy-preserving machine learning.
[LG-3] Provably accurate adaptive sampling for collocation points in physics-informed neural networks
链接: https://arxiv.org/abs/2504.00910
作者: Antoine Caradot,Rémi Emonet,Amaury Habrard,Abdel-Rahim Mezidi,Marc Sebban
类目: Machine Learning (cs.LG)
*备注: 20 pages. Comments are welcome
点击查看摘要
Abstract:Despite considerable scientific advances in numerical simulation, efficiently solving PDEs remains a complex and often expensive problem. Physics-informed Neural Networks (PINN) have emerged as an efficient way to learn surrogate solvers by embedding the PDE in the loss function and minimizing its residuals using automatic differentiation at so-called collocation points. Originally uniformly sampled, the choice of the latter has been the subject of recent advances leading to adaptive sampling refinements for PINNs. In this paper, leveraging a new quadrature method for approximating definite integrals, we introduce a provably accurate sampling method for collocation points based on the Hessian of the PDE residuals. Comparative experiments conducted on a set of 1D and 2D PDEs demonstrate the benefits of our method.
[LG-4] Explorable INR: An Implicit Neural Representation for Ensemble Simulation Enabling Efficient Spatial and Parameter Exploration
链接: https://arxiv.org/abs/2504.00904
作者: Yi-Tang Chen,Haoyu Li,Neng Shi,Xihaier Luo,Wei Xu,Han-Wei Shen
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Accepted by IEEE Transactions on Visualization and Computer Graphics (TVCG)
点击查看摘要
Abstract:With the growing computational power available for high-resolution ensemble simulations in scientific fields such as cosmology and oceanology, storage and computational demands present significant challenges. Current surrogate models fall short in the flexibility of point- or region-based predictions as the entire field reconstruction is required for each parameter setting, hence hindering the efficiency of parameter space exploration. Limitations exist in capturing physical attribute distributions and pinpointing optimal parameter configurations. In this work, we propose Explorable INR, a novel implicit neural representation-based surrogate model, designed to facilitate exploration and allow point-based spatial queries without computing full-scale field data. In addition, to further address computational bottlenecks of spatial exploration, we utilize probabilistic affine forms (PAFs) for uncertainty propagation through Explorable INR to obtain statistical summaries, facilitating various ensemble analysis and visualization tasks that are expensive with existing models. Furthermore, we reformulate the parameter exploration problem as optimization tasks using gradient descent and KL divergence minimization that ensures scalability. We demonstrate that the Explorable INR with the proposed approach for spatial and parameter exploration can significantly reduce computation and memory costs while providing effective ensemble analysis.
[LG-5] Detection of Anomalous Vehicular Traffic and Sensor Failures Using Data Clustering Techniques
链接: https://arxiv.org/abs/2504.00881
作者: Davide Moretti,Elia Onofri,Emiliano Cristiani
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The increasing availability of traffic data from sensor networks has created new opportunities for understanding vehicular dynamics and identifying anomalies. In this study, we employ clustering techniques to analyse traffic flow data with the dual objective of uncovering meaningful traffic patterns and detecting anomalies, including sensor failures and irregular congestion events. We explore multiple clustering approaches, i.e partitioning and hierarchical methods, combined with various time-series representations and similarity measures. Our methodology is applied to real-world data from highway sensors, enabling us to assess the impact of different clustering frameworks on traffic pattern recognition. We also introduce a clustering-driven anomaly detection methodology that identifies deviations from expected traffic behaviour based on distance-based anomaly scores. Results indicate that hierarchical clustering with symbolic representations provides robust segmentation of traffic patterns, while partitioning methods such as k-means and fuzzy c-means yield meaningful results when paired with Dynamic Time Warping. The proposed anomaly detection strategy successfully identifies sensor malfunctions and abnormal traffic conditions with minimal false positives, demonstrating its practical utility for real-time monitoring. Real-world vehicular traffic data are provided by Autostrade Alto Adriatico S.p.A. Subjects: Machine Learning (cs.LG) MSC classes: 90B20, 62M10 Cite as: arXiv:2504.00881 [cs.LG] (or arXiv:2504.00881v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.00881 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Davide Moretti [view email] [v1] Tue, 1 Apr 2025 15:09:39 UTC (2,586 KB)
[LG-6] P2NIA: Privacy-Preserving Non-Iterative Auditing
链接: https://arxiv.org/abs/2504.00874
作者: Jade Garcia Bourrée,Hadrien Lautraite,Sébastien Gambs,Gilles Tredan,Erwan Le Merrer,Benoît Rottembourg
类目: Machine Learning (cs.LG)
*备注: 19 pages, 8 figures
点击查看摘要
Abstract:The emergence of AI legislation has increased the need to assess the ethical compliance of high-risk AI systems. Traditional auditing methods rely on platforms’ application programming interfaces (APIs), where responses to queries are examined through the lens of fairness requirements. However, such approaches put a significant burden on platforms, as they are forced to maintain APIs while ensuring privacy, facing the possibility of data leaks. This lack of proper collaboration between the two parties, in turn, causes a significant challenge to the auditor, who is subject to estimation bias as they are unaware of the data distribution of the platform. To address these two issues, we present P2NIA, a novel auditing scheme that proposes a mutually beneficial collaboration for both the auditor and the platform. Extensive experiments demonstrate P2NIA’s effectiveness in addressing both issues. In summary, our work introduces a privacy-preserving and non-iterative audit scheme that enhances fairness assessments using synthetic or local data, avoiding the challenges associated with traditional API-based audits.
[LG-7] Whispering Under the Eaves: Protecting User Privacy Against Commercial and LLM -powered Automatic Speech Recognition Systems USENIX-SECURITY2025
链接: https://arxiv.org/abs/2504.00858
作者: Weifei Jin,Yuxin Cao,Junjie Su,Derui Wang,Yedi Zhang,Minhui Xue,Jie Hao,Jin Song Dong,Yixian Yang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accept to USENIX Security 2025
点击查看摘要
Abstract:The widespread application of automatic speech recognition (ASR) supports large-scale voice surveillance, raising concerns about privacy among users. In this paper, we concentrate on using adversarial examples to mitigate unauthorized disclosure of speech privacy thwarted by potential eavesdroppers in speech communications. While audio adversarial examples have demonstrated the capability to mislead ASR models or evade ASR surveillance, they are typically constructed through time-intensive offline optimization, restricting their practicality in real-time voice communication. Recent work overcame this limitation by generating universal adversarial perturbations (UAPs) and enhancing their transferability for black-box scenarios. However, they introduced excessive noise that significantly degrades audio quality and affects human perception, thereby limiting their effectiveness in practical scenarios. To address this limitation and protect live users’ speech against ASR systems, we propose a novel framework, AudioShield. Central to this framework is the concept of Transferable Universal Adversarial Perturbations in the Latent Space (LS-TUAP). By transferring the perturbations to the latent space, the audio quality is preserved to a large extent. Additionally, we propose target feature adaptation to enhance the transferability of UAPs by embedding target text features into the perturbations. Comprehensive evaluation on four commercial ASR APIs (Google, Amazon, iFlytek, and Alibaba), three voice assistants, two LLM-powered ASR and one NN-based ASR demonstrates the protection superiority of AudioShield over existing competitors, and both objective and subjective evaluations indicate that AudioShield significantly improves the audio quality. Moreover, AudioShield also shows high effectiveness in real-time end-to-end scenarios, and demonstrates strong resilience against adaptive countermeasures.
[LG-8] Generalized Tensor-based Parameter-Efficient Fine-Tuning via Lie Group Transformations
链接: https://arxiv.org/abs/2504.00851
作者: Chongjie Si,Zhiyi Shi,Xuehui Wang,Yichen Xiao,Xiaokang Yang,Wei Shen
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Adapting pre-trained foundation models for diverse downstream tasks is a core practice in artificial intelligence. However, the wide range of tasks and high computational costs make full fine-tuning impractical. To overcome this, parameter-efficient fine-tuning (PEFT) methods like LoRA have emerged and are becoming a growing research focus. Despite the success of these methods, they are primarily designed for linear layers, focusing on two-dimensional matrices while largely ignoring higher-dimensional parameter spaces like convolutional kernels. Moreover, directly applying these methods to higher-dimensional parameter spaces often disrupts their structural relationships. Given the rapid advancements in matrix-based PEFT methods, rather than designing a specialized strategy, we propose a generalization that extends matrix-based PEFT methods to higher-dimensional parameter spaces without compromising their structural properties. Specifically, we treat parameters as elements of a Lie group, with updates modeled as perturbations in the corresponding Lie algebra. These perturbations are mapped back to the Lie group through the exponential map, ensuring smooth, consistent updates that preserve the inherent structure of the parameter space. Extensive experiments on computer vision and natural language processing validate the effectiveness and versatility of our approach, demonstrating clear improvements over existing methods.
[LG-9] Logical perspectives on learning statistical objects
链接: https://arxiv.org/abs/2504.00847
作者: Aaron Anderson,Michael Benedikt
类目: Logic in Computer Science (cs.LO); Machine Learning (cs.LG); Logic (math.LO)
*备注:
点击查看摘要
Abstract:We consider the relationship between learnability of a base class'' of functions on a set X and learnability of a class of statistical functions derived from the base class. For example, we refine results showing that learnability of a family of functions implies learnability of the family of functions mapping a function in the class to its expectation under a distribution. We will look at both Probably Approximately Correct (PAC) learning, where example inputs and outputs are chosen at random, and online learning, where the examples are chosen adversarially. We establish improved bounds on the sample complexity of learning for statistical classes, stated in terms of combinatorial dimensions of the base class. We do this by adapting techniques introduced in model theory for
randomizing a structure’'. We give particular attention to classes derived from logical formulas, and relate learnability of the statistical classes to properties of the formula. Finally, we provide bounds on the complexity of learning the statistical classes built on top of a logic-based hypothesis class.
[LG-10] Deep Generative Models: Complexity Dimensionality and Approximation
链接: https://arxiv.org/abs/2504.00820
作者: Kevin Wang,Hongqian Niu,Yixin Wang,Didong Li
类目: Machine Learning (cs.LG); Differential Geometry (math.DG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Generative networks have shown remarkable success in learning complex data distributions, particularly in generating high-dimensional data from lower-dimensional inputs. While this capability is well-documented empirically, its theoretical underpinning remains unclear. One common theoretical explanation appeals to the widely accepted manifold hypothesis, which suggests that many real-world datasets, such as images and signals, often possess intrinsic low-dimensional geometric structures. Under this manifold hypothesis, it is widely believed that to approximate a distribution on a d -dimensional Riemannian manifold, the latent dimension needs to be at least d or d+1 . In this work, we show that this requirement on the latent dimension is not necessary by demonstrating that generative networks can approximate distributions on d -dimensional Riemannian manifolds from inputs of any arbitrary dimension, even lower than d , taking inspiration from the concept of space-filling curves. This approach, in turn, leads to a super-exponential complexity bound of the deep neural networks through expanded neurons. Our findings thus challenge the conventional belief on the relationship between input dimensionality and the ability of generative networks to model data distributions. This novel insight not only corroborates the practical effectiveness of generative networks in handling complex data structures, but also underscores a critical trade-off between approximation error, dimensionality, and model complexity.
[LG-11] Mixture-of-Experts for Distributed Edge Computing with Channel-Aware Gating Function
链接: https://arxiv.org/abs/2504.00819
作者: Qiuchen Song,Shusen Jing,Shuai Zhang,Songyang Zhang,Chuan Huang
类目: Machine Learning (cs.LG)
*备注: 6 pages, 6 figures, published to ICC 2025
点击查看摘要
Abstract:In a distributed mixture-of-experts (MoE) system, a server collaborates with multiple specialized expert clients to perform inference. The server extracts features from input data and dynamically selects experts based on their areas of specialization to produce the final output. Although MoE models are widely valued for their flexibility and performance benefits, adapting distributed MoEs to operate effectively in wireless networks has remained unexplored. In this work, we introduce a novel channel-aware gating function for wireless distributed MoE, which incorporates channel conditions into the MoE gating mechanism. To train the channel-aware gating, we simulate various signal-to-noise ratios (SNRs) for each expert’s communication channel and add noise to the features distributed to the experts based on these SNRs. The gating function then utilizes both features and SNRs to optimize expert selection. Unlike conventional MoE models which solely consider the alignment of features with the specializations of experts, our approach additionally considers the impact of channel conditions on expert performance. Experimental results demonstrate that the proposed channel-aware gating scheme outperforms traditional MoE models.
[LG-12] FeatInsight: An Online ML Feature Management System on 4Paradigm Sage-Studio Platform
链接: https://arxiv.org/abs/2504.00786
作者: Xin Tong,Xuanhe Zhou,Bingsheng He,Guoliang Li,Zirui Tang,Wei Zhou,Fan Wu,Mian Lu,Yuqiang Chen
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Feature management is essential for many online machine learning applications and can often become the performance bottleneck (e.g., taking up to 70% of the overall latency in sales prediction service). Improper feature configurations (e.g., introducing too many irrelevant features) can severely undermine the model’s generalization capabilities. However, managing online ML features is challenging due to (1) large-scale, complex raw data (e.g., the 2018 PHM dataset contains 17 tables and dozens to hundreds of columns), (2) the need for high-performance, consistent computation of interdependent features with complex patterns, and (3) the requirement for rapid updates and deployments to accommodate real-time data changes. In this demo, we present FeatInsight, a system that supports the entire feature lifecycle, including feature design, storage, visualization, computation, verification, and lineage management. FeatInsight (with OpenMLDB as the execution engine) has been deployed in over 100 real-world scenarios on 4Paradigm’s Sage Studio platform, handling up to a trillion-dimensional feature space and enabling millisecond-level feature updates. We demonstrate how FeatInsight enhances feature design efficiency (e.g., for online product recommendation) and improve feature computation performance (e.g., for online fraud detection). The code is available at this https URL.
[LG-13] AMIS: Tailored Membership Inference Attacks on Synthetic Data
链接: https://arxiv.org/abs/2504.00758
作者: Paul Andrey,Batiste Le Bars,Marc Tommasi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Membership Inference Attacks (MIA) enable to empirically assess the privacy of a machine learning algorithm. In this paper, we propose TAMIS, a novel MIA against differentially-private synthetic data generation methods that rely on graphical models. This attack builds upon MAMA-MIA, a recently-published state-of-the-art method. It lowers its computational cost and requires less attacker knowledge. Our attack is the product of a two-fold improvement. First, we recover the graphical model having generated a synthetic dataset by using solely that dataset, rather than shadow-modeling over an auxiliary one. This proves less costly and more performant. Second, we introduce a more mathematically-grounded attack score, that provides a natural threshold for binary predictions. In our experiments, TAMIS achieves better or similar performance as MAMA-MIA on replicas of the SNAKE challenge.
[LG-14] Integrating Fourier Neural Operators with Diffusion Models to improve Spectral Representation of Synthetic Earthquake Ground Motion Response
链接: https://arxiv.org/abs/2504.00757
作者: Niccolò Perrone,Fanny Lehmann,Hugo Gabrielidis,Stefania Fresca,Filippo Gatti
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Nuclear reactor buildings must be designed to withstand the dynamic load induced by strong ground motion earthquakes. For this reason, their structural behavior must be assessed in multiple realistic ground shaking scenarios (e.g., the Maximum Credible Earthquake). However, earthquake catalogs and recorded seismograms may not always be available in the region of interest. Therefore, synthetic earthquake ground motion is progressively being employed, although with some due precautions: earthquake physics is sometimes not well enough understood to be accurately reproduced with numerical tools, and the underlying epistemic uncertainties lead to prohibitive computational costs related to model calibration. In this study, we propose an AI physics-based approach to generate synthetic ground motion, based on the combination of a neural operator that approximates the elastodynamics Green’s operator in arbitrary source-geology setups, enhanced by a denoising diffusion probabilistic model. The diffusion model is trained to correct the ground motion time series generated by the neural operator. Our results show that such an approach promisingly enhances the realism of the generated synthetic seismograms, with frequency biases and Goodness-Of-Fit (GOF) scores being improved by the diffusion model. This indicates that the latter is capable to mitigate the mid-frequency spectral falloff observed in the time series generated by the neural operator. Our method showcases fast and cheap inference in different site and source conditions.
[LG-15] Automated Feature Labeling with Token-Space Gradient Descent ICLR2025
链接: https://arxiv.org/abs/2504.00754
作者: Julian Schulz,Seamus Fallows
类目: Machine Learning (cs.LG)
*备注: 10 pages, 4 figures, Building Trust Workshop ICLR 2025
点击查看摘要
Abstract:We present a novel approach to feature labeling using gradient descent in token-space. While existing methods typically use language models to generate hypotheses about feature meanings, our method directly optimizes label representations by using a language model as a discriminator to predict feature activations. We formulate this as a multi-objective optimization problem in token-space, balancing prediction accuracy, entropy minimization, and linguistic naturalness. Our proof-of-concept experiments demonstrate successful convergence to interpretable single-token labels across diverse domains, including features for detecting animals, mammals, Chinese text, and numbers. Although our current implementation is constrained to single-token labels and relatively simple features, the results suggest that token-space gradient descent could become a valuable addition to the interpretability researcher’s toolkit.
[LG-16] C2AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction
链接: https://arxiv.org/abs/2504.00750
作者: Wenxuan Wu,Xueyuan Chen,Shuai Wang,Jiadong Wang,Lingwei Meng,Xixin Wu,Helen Meng,Haizhou Li
类目: ound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: Accepted by IEEE Journal of Selected Topics in Signal Processing (JSTSP)
点击查看摘要
Abstract:Audio-Visual Target Speaker Extraction (AV-TSE) aims to mimic the human ability to enhance auditory perception using visual cues. Although numerous models have been proposed recently, most of them estimate target signals by primarily relying on local dependencies within acoustic features, underutilizing the human-like capacity to infer unclear parts of speech through contextual information. This limitation results in not only suboptimal performance but also inconsistent extraction quality across the utterance, with some segments exhibiting poor quality or inadequate suppression of interfering speakers. To close this gap, we propose a model-agnostic strategy called the Mask-And-Recover (MAR). It integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules. Additionally, to better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model to assess extraction quality and guide extraction modules to emphasize improvement on low-quality segments. To validate the effectiveness of our proposed model-agnostic training paradigm, six popular AV-TSE backbones were adopted for evaluation on the VoxCeleb2 dataset, demonstrating consistent performance improvements across various metrics.
[LG-17] Detection of Disease on Nasal Breath Sound by New Lightweight Architecture: Using COVID-19 as An Example
链接: https://arxiv.org/abs/2504.00730
作者: Jiayuan She,Lin Shi,Peiqi Li,Ziling Dong,Renxing Li,Shengkai Li,Liping Gu,Tong Zhao,Zhuochang Yang,Yajie Ji,Liang Feng,Jiangang Chen
类目: Machine Learning (cs.LG)
*备注: 14 pages, 5 figures, 6 tables
点击查看摘要
Abstract:Background. Infectious diseases, particularly COVID-19, continue to be a significant global health issue. Although many countries have reduced or stopped large-scale testing measures, the detection of such diseases remains a propriety. Objective. This study aims to develop a novel, lightweight deep neural network for efficient, accurate, and cost-effective detection of COVID-19 using a nasal breathing audio data collected via smartphones. Methodology. Nasal breathing audio from 128 patients diagnosed with the Omicron variant was collected. Mel-Frequency Cepstral Coefficients (MFCCs), a widely used feature in speech and sound analysis, were employed for extracting important characteristics from the audio signals. Additional feature selection was performed using Random Forest (RF) and Principal Component Analysis (PCA) for dimensionality reduction. A Dense-ReLU-Dropout model was trained with K-fold cross-validation (K=3), and performance metrics like accuracy, precision, recall, and F1-score were used to evaluate the model. Results. The proposed model achieved 97% accuracy in detecting COVID-19 from nasal breathing sounds, outperforming state-of-the-art methods such as those by [23] and [13]. Our Dense-ReLU-Dropout model, using RF and PCA for feature selection, achieves high accuracy with greater computational efficiency compared to existing methods that require more complex models or larger datasets. Conclusion. The findings suggest that the proposed method holds significant potential for clinical implementation, advancing smartphone-based diagnostics in infectious diseases. The Dense-ReLU-Dropout model, combined with innovative feature processing techniques, offers a promising approach for efficient and accurate COVID-19 detection, showcasing the capabilities of mobile device-based diagnostics
[LG-18] EMO: Edge Model Overlays to Scale Model Size in Federated Learning
链接: https://arxiv.org/abs/2504.00726
作者: Di Wu,Weibo He,Wanglei Feng,Zhenyu Wen,Bin Qian,Blesson Varghese
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Poster accepted at IEEE ICDCS 2025
点击查看摘要
Abstract:Federated Learning (FL) trains machine learning models on edge devices with distributed data. However, the computational and memory limitations of these devices restrict the training of large models using FL. Split Federated Learning (SFL) addresses this challenge by distributing the model across the device and server, but it introduces a tightly coupled data flow, leading to computational bottlenecks and high communication costs. We propose EMO as a solution to enable the training of large models in FL while mitigating the challenges of SFL. EMO introduces Edge Model Overlay(s) between the device and server, enabling the creation of a larger ensemble model without modifying the FL workflow. The key innovation in EMO is Augmented Federated Learning (AFL), which builds an ensemble model by connecting the original (smaller) FL model with model(s) trained in the overlay(s) to facilitate horizontal or vertical scaling. This is accomplished through three key modules: a hierarchical activation replay cache to decouple AFL from FL, a convergence-aware communication controller to optimize communication overhead, and an ensemble inference module. Evaluations on a real-world prototype show that EMO improves accuracy by up to 17.77% compared to FL, and reduces communication costs by up to 7.17x and decreases training time by up to 6.9x compared to SFL.
[LG-19] Alleviating Performance Disparity in Adversarial Spatiotemporal Graph Learning Under Zero-Inflated Distribution
链接: https://arxiv.org/abs/2504.00721
作者: Songran Bai,Yuheng Ji,Yue Liu,Xingwei Zhang,Xiaolong Zheng,Daniel Dajun Zeng
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Spatiotemporal Graph Learning (SGL) under Zero-Inflated Distribution (ZID) is crucial for urban risk management tasks, including crime prediction and traffic accident profiling. However, SGL models are vulnerable to adversarial attacks, compromising their practical utility. While adversarial training (AT) has been widely used to bolster model robustness, our study finds that traditional AT exacerbates performance disparities between majority and minority classes under ZID, potentially leading to irreparable losses due to underreporting critical risk events. In this paper, we first demonstrate the smaller top-k gradients and lower separability of minority class are key factors contributing to this disparity. To address these issues, we propose MinGRE, a framework for Minority Class Gradients and Representations Enhancement. MinGRE employs a multi-dimensional attention mechanism to reweight spatiotemporal gradients, minimizing the gradient distribution discrepancies across classes. Additionally, we introduce an uncertainty-guided contrastive loss to improve the inter-class separability and intra-class compactness of minority representations with higher uncertainty. Extensive experiments demonstrate that the MinGRE framework not only significantly reduces the performance disparity across classes but also achieves enhanced robustness compared to existing baselines. These findings underscore the potential of our method in fostering the development of more equitable and robust models.
[LG-20] Spectral Normalization and Voigt-Reuss net: A universal approach to microstructure-property forecasting with physical guarantees
链接: https://arxiv.org/abs/2504.00712
作者: Sanath Keshav,Julius Herb,Felix Fritzen
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Heterogeneous materials are crucial to producing lightweight components, functional components, and structures composed of them. A crucial step in the design process is the rapid evaluation of their effective mechanical, thermal, or, in general, constitutive properties. The established procedure is to use forward models that accept microstructure geometry and local constitutive properties as inputs. The classical simulation-based approach, which uses, e.g., finite elements and FFT-based solvers, can require substantial computational resources. At the same time, simulation-based models struggle to provide gradients with respect to the microstructure and the constitutive parameters. Such gradients are, however, of paramount importance for microstructure design and for inverting the microstructure-property mapping. Machine learning surrogates can excel in these situations. However, they can lead to unphysical predictions that violate essential bounds on the constitutive response, such as the upper (Voigt-like) or the lower (Reuss-like) bound in linear elasticity. Therefore, we propose a novel spectral normalization scheme that a priori enforces these bounds. The approach is fully agnostic with respect to the chosen microstructural features and the utilized surrogate model. All of these will automatically and strictly predict outputs that obey the upper and lower bounds by construction. The technique can be used for any constitutive tensor that is symmetric and where upper and lower bounds (in the Löwner sense) exist, i.e., for permeability, thermal conductivity, linear elasticity, and many more. We demonstrate the use of spectral normalization in the Voigt-Reuss net using a simple neural network. Numerical examples on truly extensive datasets illustrate the improved accuracy, robustness, and independence of the type of input features in comparison to much-used neural networks.
[LG-21] GraphMaster: Automated Graph Synthesis via LLM Agents in Data-Limited Environments
链接: https://arxiv.org/abs/2504.00711
作者: Enjun Du,Xunkai Li,Tian Jin,Zhihan Zhang,Rong-Hua Li,Guoren Wang
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The era of foundation models has revolutionized AI research, yet Graph Foundation Models (GFMs) remain constrained by the scarcity of large-scale graph corpora. Traditional graph data synthesis techniques primarily focus on simplistic structural operations, lacking the capacity to generate semantically rich nodes with meaningful textual attributes: a critical limitation for real-world applications. While large language models (LLMs) demonstrate exceptional text generation capabilities, their direct application to graph synthesis is impeded by context window limitations, hallucination phenomena, and structural consistency challenges. To address these issues, we introduce GraphMaster, the first multi-agent framework specifically designed for graph data synthesis in data-limited environments. GraphMaster orchestrates four specialized LLM agents (Manager, Perception, Enhancement, and Evaluation) that collaboratively optimize the synthesis process through iterative refinement, ensuring both semantic coherence and structural integrity. To rigorously evaluate our approach, we create new data-limited “Sub” variants of six standard graph benchmarks, specifically designed to test synthesis capabilities under realistic constraints. Additionally, we develop a novel interpretability assessment framework that combines human evaluation with a principled Grassmannian manifold-based analysis, providing both qualitative and quantitative measures of semantic coherence. Experimental results demonstrate that GraphMaster significantly outperforms traditional synthesis methods across multiple datasets, establishing a strong foundation for advancing GFMs in data-scarce environments.
[LG-22] On Benchmarking Code LLM s for Android Malware Analysis
链接: https://arxiv.org/abs/2504.00694
作者: Yiling He,Hongyu She,Xingzhi Qian,Xinran Zheng,Zhuo Chen,Zhan Qin,Lorenzo Cavallaro
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated strong capabilities in various code intelligence tasks. However, their effectiveness for Android malware analysis remains underexplored. Decompiled Android code poses unique challenges for analysis, primarily due to its large volume of functions and the frequent absence of meaningful function names. This paper presents Cama, a benchmarking framework designed to systematically evaluate the effectiveness of Code LLMs in Android malware analysis tasks. Cama specifies structured model outputs (comprising function summaries, refined function names, and maliciousness scores) to support key malware analysis tasks, including malicious function identification and malware purpose summarization. Built on these, it integrates three domain-specific evaluation metrics, consistency, fidelity, and semantic relevance, enabling rigorous stability and effectiveness assessment and cross-model comparison. We construct a benchmark dataset consisting of 118 Android malware samples, encompassing over 7.5 million distinct functions, and use Cama to evaluate four popular open-source models. Our experiments provide insights into how Code LLMs interpret decompiled code and quantify the sensitivity to function renaming, highlighting both the potential and current limitations of Code LLMs in malware analysis tasks.
[LG-23] Sim-is-More: Randomizing HW-NAS with Synthetic Devices
链接: https://arxiv.org/abs/2504.00663
作者: Francesco Capuano,Gabriele Tiboni,Niccolò Cavagnero,Giuseppe Averta
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Existing hardware-aware NAS (HW-NAS) methods typically assume access to precise information circa the target device, either via analytical approximations of the post-compilation latency model, or through learned latency predictors. Such approximate approaches risk introducing estimation errors that may prove detrimental in risk-sensitive applications. In this work, we propose a two-stage HW-NAS framework, in which we first learn an architecture controller on a distribution of synthetic devices, and then directly deploy the controller on a target device. At test-time, our network controller deploys directly to the target device without relying on any pre-collected information, and only exploits direct interactions. In particular, the pre-training phase on synthetic devices enables the controller to design an architecture for the target device by interacting with it through a small number of high-fidelity latency measurements. To guarantee accessibility of our method, we only train our controller with training-free accuracy proxies, allowing us to scale the meta-training phase without incurring the overhead of full network training. We benchmark on HW-NATS-Bench, demonstrating that our method generalizes to unseen devices and searches for latency-efficient architectures by in-context adaptation using only a few real-world latency evaluations at test-time.
[LG-24] Learning to Normalize on the SPD Manifold under Bures-Wasserstein Geometry CVPR2025
链接: https://arxiv.org/abs/2504.00660
作者: Rui Wang,Shaocheng Jin,Ziheng Chen,Xiaoqing Luo,Xiao-Jun Wu
类目: Machine Learning (cs.LG)
*备注: Accepted by CVPR 2025
点击查看摘要
Abstract:Covariance matrices have proven highly effective across many scientific fields. Since these matrices lie within the Symmetric Positive Definite (SPD) manifold - a Riemannian space with intrinsic non-Euclidean geometry, the primary challenge in representation learning is to respect this underlying geometric structure. Drawing inspiration from the success of Euclidean deep learning, researchers have developed neural networks on the SPD manifolds for more faithful covariance embedding learning. A notable advancement in this area is the implementation of Riemannian batch normalization (RBN), which has been shown to improve the performance of SPD network models. Nonetheless, the Riemannian metric beneath the existing RBN might fail to effectively deal with the ill-conditioned SPD matrices (ICSM), undermining the effectiveness of RBN. In contrast, the Bures-Wasserstein metric (BWM) demonstrates superior performance for ill-conditioning. In addition, the recently introduced Generalized BWM (GBWM) parameterizes the vanilla BWM via an SPD matrix, allowing for a more nuanced representation of vibrant geometries of the SPD manifold. Therefore, we propose a novel RBN algorithm based on the GBW geometry, incorporating a learnable metric parameter. Moreover, the deformation of GBWM by matrix power is also introduced to further enhance the representational capacity of GBWM-based RBN. Experimental results on different datasets validate the effectiveness of our proposed method.
[LG-25] NeuraLUT-Assemble: Hardware-aware Assembling of Sub-Neural Networks for Efficient LUT Inference
链接: https://arxiv.org/abs/2504.00592
作者: Marta Andronic,George A. Constantinides
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Efficient neural networks (NNs) leveraging lookup tables (LUTs) have demonstrated significant potential for emerging AI applications, particularly when deployed on field-programmable gate arrays (FPGAs) for edge computing. These architectures promise ultra-low latency and reduced resource utilization, broadening neural network adoption in fields such as particle physics. However, existing LUT-based designs suffer from accuracy degradation due to the large fan-in required by neurons being limited by the exponential scaling of LUT resources with input width. In practice, in prior work this tension has resulted in the reliance on extremely sparse models. We present NeuraLUT-Assemble, a novel framework that addresses these limitations by combining mixed-precision techniques with the assembly of larger neurons from smaller units, thereby increasing connectivity while keeping the number of inputs of any given LUT manageable. Additionally, we introduce skip-connections across entire LUT structures to improve gradient flow. NeuraLUT-Assemble closes the accuracy gap between LUT-based methods and (fully-connected) MLP-based models, achieving competitive accuracy on tasks such as network intrusion detection, digit classification, and jet classification, demonstrating up to 8.42\times reduction in the area-delay product compared to the state-of-the-art at the time of the publication. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2504.00592 [cs.LG] (or arXiv:2504.00592v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.00592 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-26] Geometric Median Matching for Robust k-Subset Selection from Noisy Data
链接: https://arxiv.org/abs/2504.00564
作者: Anish Acharya,Sujay Sanghavi,Alexandros G Dimakis,Inderjit S Dhillon
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Data pruning – the combinatorial task of selecting a small and representative subset from a large dataset, is crucial for mitigating the enormous computational costs associated with training data-hungry modern deep learning models at scale. Since large scale data collections are invariably noisy, developing data pruning strategies that remain robust even in the presence of corruption is critical in practice. However, existing data pruning methods often fail under high corruption rates due to their reliance on empirical mean estimation, which is highly sensitive to outliers. In response, we propose Geometric Median (GM) Matching, a novel k-subset selection strategy that leverages Geometric Median – a robust estimator with an optimal breakdown point of 1/2; to enhance resilience against noisy data. Our method iteratively selects a k-subset such that the mean of the subset approximates the GM of the (potentially) noisy dataset, ensuring robustness even under arbitrary corruption. We provide theoretical guarantees, showing that GM Matching enjoys an improved O(1/k) convergence rate – a quadratic improvement over random sampling, even under arbitrary corruption. Extensive experiments across image classification and image generation tasks demonstrate that GM Matching consistently outperforms existing pruning approaches, particularly in high-corruption settings and at high pruning rates; making it a strong baseline for robust data pruning. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2504.00564 [cs.LG] (or arXiv:2504.00564v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.00564 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-27] Adversarial Curriculum Graph-Free Knowledge Distillation for Graph Neural Networks
链接: https://arxiv.org/abs/2504.00540
作者: Yuang Jia,Xiaojuan Shan,Jun Xia,Guancheng Wan,Yuchen Zhang,Wenke Huang,Mang Ye,Stan Z. Li
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Data-free Knowledge Distillation (DFKD) is a method that constructs pseudo-samples using a generator without real data, and transfers knowledge from a teacher model to a student by enforcing the student to overcome dimensional differences and learn to mimic the teacher’s outputs on these pseudo-samples. In recent years, various studies in the vision domain have made notable advancements in this area. However, the varying topological structures and non-grid nature of graph data render the methods from the vision domain ineffective. Building upon prior research into differentiable methods for graph neural networks, we propose a fast and high-quality data-free knowledge distillation approach in this paper. Without compromising distillation quality, the proposed graph-free KD method (ACGKD) significantly reduces the spatial complexity of pseudo-graphs by leveraging the Binary Concrete distribution to model the graph structure and introducing a spatial complexity tuning parameter. This approach enables efficient gradient computation for the graph structure, thereby accelerating the overall distillation process. Additionally, ACGKD eliminates the dimensional ambiguity between the student and teacher models by increasing the student’s dimensions and reusing the teacher’s classifier. Moreover, it equips graph knowledge distillation with a CL-based strategy to ensure the student learns graph structures progressively. Extensive experiments demonstrate that ACGKD achieves state-of-the-art performance in distilling knowledge from GNNs without training data.
[LG-28] MARIOH: Multiplicity-Aware Hypergraph Reconstruction ICDE’25
链接: https://arxiv.org/abs/2504.00522
作者: Kyuhan Lee,Geon Lee,Kijung Shin
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: to be published in the 41st IEEE International Conference on Data Engineering (ICDE '25)
点击查看摘要
Abstract:Hypergraphs offer a powerful framework for modeling higher-order interactions that traditional pairwise graphs cannot fully capture. However, practical constraints often lead to their simplification into projected graphs, resulting in substantial information loss and ambiguity in representing higher-order relationships. In this work, we propose MARIOH, a supervised approach for reconstructing the original hypergraph from its projected graph by leveraging edge multiplicity. To overcome the difficulties posed by the large search space, MARIOH integrates several key ideas: (a) identifying provable size-2 hyperedges, which reduces the candidate search space, (b) predicting the likelihood of candidates being hyperedges by utilizing both structural and multiplicity-related features, and © not only targeting promising hyperedge candidates but also examining less confident ones to explore alternative possibilities. Together, these ideas enable MARIOH to efficiently and effectively explore the search space. In our experiments using 10 real-world datasets, MARIOH achieves up to 74.51% higher reconstruction accuracy compared to state-of-the-art methods.
[LG-29] SCRec: A Scalable Computational Storag e System with Statistical Sharding and Tensor-train Decomposition for Recommendation Models
链接: https://arxiv.org/abs/2504.00520
作者: Jinho Yang,Ji-Hoon Kim,Joo-Young Kim
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 14 pages, 12 figures
点击查看摘要
Abstract:Deep Learning Recommendation Models (DLRMs) play a crucial role in delivering personalized content across web applications such as social networking and video streaming. However, with improvements in performance, the parameter size of DLRMs has grown to terabyte (TB) scales, accompanied by memory bandwidth demands exceeding TB/s levels. Furthermore, the workload intensity within the model varies based on the target mechanism, making it difficult to build an optimized recommendation system. In this paper, we propose SCRec, a scalable computational storage recommendation system that can handle TB-scale industrial DLRMs while guaranteeing high bandwidth requirements. SCRec utilizes a software framework that features a mixed-integer programming (MIP)-based cost model, efficiently fetching data based on data access patterns and adaptively configuring memory-centric and compute-centric cores. Additionally, SCRec integrates hardware acceleration cores to enhance DLRM computations, particularly allowing for the high-performance reconstruction of approximated embedding vectors from extremely compressed tensor-train (TT) format. By combining its software framework and hardware accelerators, while eliminating data communication overhead by being implemented on a single server, SCRec achieves substantial improvements in DLRM inference performance. It delivers up to 55.77 \times speedup compared to a CPU-DRAM system with no loss in accuracy and up to 13.35 \times energy efficiency gains over a multi-GPU system.
[LG-30] ParallelFlow: Parallelizing Linear Transformers via Flow Discretization
链接: https://arxiv.org/abs/2504.00492
作者: Nicola Muca Cirone,Cristopher Salvi
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:
点击查看摘要
Abstract:We present a theoretical framework for analyzing linear attention models through matrix-valued state space models (SSMs). Our approach, Parallel Flows, provides a perspective that systematically decouples temporal dynamics from implementation constraints, enabling independent analysis of critical algorithmic components: chunking, parallelization, and information aggregation. Central to this framework is the reinterpretation of chunking procedures as computations of the flows governing system dynamics. This connection establishes a bridge to mathematical tools from rough path theory, opening the door to new insights into sequence modeling architectures. As a concrete application, we analyze DeltaNet in a generalized low-rank setting motivated by recent theoretical advances. Our methods allow us to design simple, streamlined generalizations of hardware-efficient algorithms present in the literature, and to provide completely different ones, inspired by rough paths techniques, with provably lower complexity. This dual contribution demonstrates how principled theoretical analysis can both explain existing practical methods and inspire fundamentally new computational approaches.
[LG-31] Preconditioned Additive Gaussian Processes with Fourier Acceleration
链接: https://arxiv.org/abs/2504.00480
作者: Theresa Wagner,Tianshi Xu,Franziska Nestler,Yuanzhe Xi,Martin Stoll
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
点击查看摘要
Abstract:Gaussian processes (GPs) are crucial in machine learning for quantifying uncertainty in predictions. However, their associated covariance matrices, defined by kernel functions, are typically dense and large-scale, posing significant computational challenges. This paper introduces a matrix-free method that utilizes the Non-equispaced Fast Fourier Transform (NFFT) to achieve nearly linear complexity in the multiplication of kernel matrices and their derivatives with vectors for a predetermined accuracy level. To address high-dimensional problems, we propose an additive kernel approach. Each sub-kernel in this approach captures lower-order feature interactions, allowing for the efficient application of the NFFT method and potentially increasing accuracy across various real-world datasets. Additionally, we implement a preconditioning strategy that accelerates hyperparameter tuning, further improving the efficiency and effectiveness of GPs.
[LG-32] Informed Greedy Algorithm for Scalable Bayesian Network Fusion via Minimum Cut Analysis
链接: https://arxiv.org/abs/2504.00467
作者: Pablo Torrijos,José M. Puerta,José A. Gámez,Juan A. Aledo
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:This paper presents the Greedy Min-Cut Bayesian Consensus (GMCBC) algorithm for the structural fusion of Bayesian Networks (BNs). The method is designed to preserve essential dependencies while controlling network complexity. It addresses the limitations of traditional fusion approaches, which often lead to excessively complex models that are impractical for inference, reasoning, or real-world applications. As the number and size of input networks increase, this issue becomes even more pronounced. GMCBC integrates principles from flow network theory into BN fusion, adapting the Backward Equivalence Search (BES) phase of the Greedy Equivalence Search (GES) algorithm and applying the Ford-Fulkerson algorithm for minimum cut analysis. This approach removes non-essential edges, ensuring that the fused network retains key dependencies while minimizing unnecessary complexity. Experimental results on synthetic Bayesian Networks demonstrate that GMCBC achieves near-optimal network structures. In federated learning simulations, GMCBC produces a consensus network that improves structural accuracy and dependency preservation compared to the average of the input networks, resulting in a structure that better captures the real underlying (in)dependence relationships. This consensus network also maintains a similar size to the original networks, unlike unrestricted fusion methods, where network size grows exponentially.
[LG-33] Efficient Near-Optimal Algorithm for Online Shortest Paths in Directed Acyclic Graphs with Bandit Feedback Against Adaptive Adversaries
链接: https://arxiv.org/abs/2504.00461
作者: Arnab Maiti,Zhiyuan Fan,Kevin Jamieson,Lillian J. Ratliff,Gabriele Farina
类目: Machine Learning (cs.LG)
*备注: 48 pages, 8 figures
点击查看摘要
Abstract:In this paper, we study the online shortest path problem in directed acyclic graphs (DAGs) under bandit feedback against an adaptive adversary. Given a DAG G = (V, E) with a source node v_\mathsfs and a sink node v_\mathsft , let X \subseteq \0,1^|E| denote the set of all paths from v_\mathsfs to v_\mathsft . At each round t , we select a path \mathbfx_t \in X and receive bandit feedback on our loss \langle \mathbfx_t, \mathbfy_t \rangle \in [-1,1] , where \mathbfy_t is an adversarially chosen loss vector. Our goal is to minimize regret with respect to the best path in hindsight over T rounds. We propose the first computationally efficient algorithm to achieve a near-minimax optimal regret bound of \tilde O(\sqrt|E|T\log |X|) with high probability against any adaptive adversary, where \tilde O(\cdot) hides logarithmic factors in the number of edges |E| . Our algorithm leverages a novel loss estimator and a centroid-based decomposition in a nontrivial manner to attain this regret bound. As an application, we show that our algorithm for DAGs provides state-of-the-art efficient algorithms for m -sets, extensive-form games, the Colonel Blotto game, shortest walks in directed graphs, hypercubes, and multi-task multi-armed bandits, achieving improved high-probability regret guarantees in all these settings. Comments: 48 pages, 8 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2504.00461 [cs.LG] (or arXiv:2504.00461v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.00461 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-34] HERA: Hybrid Edge-cloud Resource Allocation for Cost-Efficient AI Agents
链接: https://arxiv.org/abs/2504.00434
作者: Shiyi Liu,Haiying Shen,Shuai Che,Mahdi Ghandi,Mingqin Li
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In the realm of AI, large language models (LLMs) like GPT-4, central to the operation of AI agents, predominantly operate in the cloud, incurring high operational costs. With local-based small language models (SLMs) becoming more accurate, the necessity of cloud-exclusive processing is being reconsidered. An AI agent’s response to a user’s request comprises a series of subtasks or iterations. Existing approaches only allocate a single request between SLM and LLM to ensure their outputs are similar, but adopting this approach in the AI agent scenario for assigning each subtask is not effective since SLM will output a different subsequent subtask, which affects the accuracy of the final output. In this paper, we first conduct experimental analysis to understand the features of AI agent operations. Leveraging our findings, we propose the Adaptive Iteration-level Model Selector (AIMS), a lightweight scheduler to automatically partition AI agent’s subtasks between local-based SLM and cloud-based LLM. AIMS considers the varying subtask features and strategically decides the location for each subtask in order to use SLM as much as possible while attaining the accuracy level. Our experimental results demonstrate that AIMS increases accuracy by up to 9.1% and SLM usage by up to 10.8% compared to HybridLLM. It offloads 45.67% of subtasks to a local SLM while attaining similar accuracy on average compared with the cloud-only LLM approach.
[LG-35] Forward Learning with Differential Privacy
链接: https://arxiv.org/abs/2504.00411
作者: Mingqian Feng,Zeliang Zhang,Jinyang Jiang,Yijie Peng,Chenliang Xu
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Differential privacy (DP) in deep learning is a critical concern as it ensures the confidentiality of training data while maintaining model utility. Existing DP training algorithms provide privacy guarantees by clipping and then injecting external noise into sample gradients computed by the backpropagation algorithm. Different from backpropagation, forward-learning algorithms based on perturbation inherently add noise during the forward pass and utilize randomness to estimate the gradients. Although these algorithms are non-privatized, the introduction of noise during the forward pass indirectly provides internal randomness protection to the model parameters and their gradients, suggesting the potential for naturally providing differential privacy. In this paper, we propose a \blueprivatized forward-learning algorithm, Differential Private Unified Likelihood Ratio (DP-ULR), and demonstrate its differential privacy guarantees. DP-ULR features a novel batch sampling operation with rejection, of which we provide theoretical analysis in conjunction with classic differential privacy mechanisms. DP-ULR is also underpinned by a theoretically guided privacy controller that dynamically adjusts noise levels to manage privacy costs in each training step. Our experiments indicate that DP-ULR achieves competitive performance compared to traditional differential privacy training algorithms based on backpropagation, maintaining nearly the same privacy loss limits.
[LG-36] Minimum Description Length of a Spectrum Variational Autoencoder: A Theory
链接: https://arxiv.org/abs/2504.00395
作者: Canlin Zhang,Xiuwen Liu
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
点击查看摘要
Abstract:Deep neural networks (DNNs) trained through end-to-end learning have achieved remarkable success across diverse machine learning tasks, yet they are not explicitly designed to adhere to the Minimum Description Length (MDL) principle, which posits that the best model provides the shortest description of the data. In this paper, we argue that MDL is essential to deep learning and propose a further generalized principle: Understanding is the use of a small amount of information to represent a large amount of information. To this end, we introduce a novel theoretical framework for designing and evaluating deep Variational Autoencoders (VAEs) based on MDL. In our theory, we designed the Spectrum VAE, a specific VAE architecture whose MDL can be rigorously evaluated under given conditions. Additionally, we introduce the concept of latent dimension combination, or pattern of spectrum, and provide the first theoretical analysis of their role in achieving MDL. We claim that a Spectrum VAE understands the data distribution in the most appropriate way when the MDL is achieved. This work is entirely theoretical and lays the foundation for future research on designing deep learning systems that explicitly adhere to information-theoretic principles.
[LG-37] Deep learning for state estimation of commercial sodium-ion batteries using partial charging profiles: validation with a multi-temperature ageing dataset
链接: https://arxiv.org/abs/2504.00393
作者: Jiapeng Liu,Lunte Li,Jing Xiang,Laiyong Xie,Yuhao Wang,Francesco Ciucci
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Accurately predicting the state of health for sodium-ion batteries is crucial for managing battery modules, playing a vital role in ensuring operational safety. However, highly accurate models available thus far are rare due to a lack of aging data for sodium-ion batteries. In this study, we experimentally collected 53 single cells at four temperatures (0, 25, 35, and 45 °C), along with two battery modules in the lab. By utilizing the charging profiles, we were able to predict the SOC, capacity, and SOH simultaneously. This was achieved by designing a new framework that integrates the neural ordinary differential equation and 2D convolutional neural networks, using the partial charging profile as input. The charging profile is partitioned into segments, and each segment is fed into the network to output the SOC. For capacity and SOH prediction, we first aggregated the extracted features corresponding to segments from one cycle, after which an embedding block for temperature is concatenated for the final prediction. This novel approach eliminates the issue of multiple outputs for a single target. Our model demonstrated an R^2 accuracy of 0.998 for SOC and 0.997 for SOH across single cells at various temperatures. Furthermore, the trained model can be employed to predict single cells at temperatures outside the training set and battery modules with different capacity and current levels. The results presented here highlight the high accuracy of our model and its capability to predict multiple targets simultaneously using a partial charging profile.
[LG-38] Using complex prompts to identify fine-grained biases in image generation through ChatGPT -4o
链接: https://arxiv.org/abs/2504.00388
作者: Marinus Ferreira
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: Presented at the 74th Annual ICA 2024 Conference, in the stream “Image-as-Data Methods in the Age of Generative Artificial Intelligence”, 22 June 2024
点击查看摘要
Abstract:There are not one but two dimensions of bias that can be revealed through the study of large AI models: not only bias in training data or the products of an AI, but also bias in society, such as disparity in employment or health outcomes between different demographic groups. Often training data and AI output is biased for or against certain demographics (i.e. older white people are overrepresented in image datasets), but sometimes large AI models accurately illustrate biases in the real world (i.e. young black men being disproportionately viewed as threatening). These social disparities often appear in image generation AI outputs in the form of ‘marked’ features, where some feature of an individual or setting is a social marker of disparity, and prompts both humans and AI systems to treat subjects that are marked in this way as exceptional and requiring special treatment. Generative AI has proven to be very sensitive to such marked features, to the extent of over-emphasising them and thus often exacerbating social biases. I briefly discuss how we can use complex prompts to image generation AI to investigate either dimension of bias, emphasising how we can probe the large language models underlying image generation AI through, for example, automated sentiment analysis of the text prompts used to generate images.
[LG-39] Reducing Smoothness with Expressive Memory Enhanced Hierarchical Graph Neural Networks
链接: https://arxiv.org/abs/2504.00349
作者: Thomas Bailie,Yun Sing Koh,S. Karthik Mukkavilli,Varvara Vetrova
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Graphical forecasting models learn the structure of time series data via projecting onto a graph, with recent techniques capturing spatial-temporal associations between variables via edge weights. Hierarchical variants offer a distinct advantage by analysing the time series across multiple resolutions, making them particularly effective in tasks like global weather forecasting, where low-resolution variable interactions are significant. A critical challenge in hierarchical models is information loss during forward or backward passes through the hierarchy. We propose the Hierarchical Graph Flow (HiGFlow) network, which introduces a memory buffer variable of dynamic size to store previously seen information across variable resolutions. We theoretically show two key results: HiGFlow reduces smoothness when mapping onto new feature spaces in the hierarchy and non-strictly enhances the utility of message-passing by improving Weisfeiler-Lehman (WL) expressivity. Empirical results demonstrate that HiGFlow outperforms state-of-the-art baselines, including transformer models, by at least an average of 6.1% in MAE and 6.2% in RMSE. Code is available at this https URL this http URL.
[LG-40] Aligning Diffusion Model with Problem Constraints for Trajectory Optimization
链接: https://arxiv.org/abs/2504.00342
作者: Anjian Li,Ryne Beeson
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Diffusion models have recently emerged as effective generative frameworks for trajectory optimization, capable of producing high-quality and diverse solutions. However, training these models in a purely data-driven manner without explicit incorporation of constraint information often leads to violations of critical constraints, such as goal-reaching, collision avoidance, and adherence to system dynamics. To address this limitation, we propose a novel approach that aligns diffusion models explicitly with problem-specific constraints, drawing insights from the Dynamic Data-driven Application Systems (DDDAS) framework. Our approach introduces a hybrid loss function that explicitly measures and penalizes constraint violations during training. Furthermore, by statistically analyzing how constraint violations evolve throughout the diffusion steps, we develop a re-weighting strategy that aligns predicted violations to ground truth statistics at each diffusion step. Evaluated on a tabletop manipulation and a two-car reach-avoid problem, our constraint-aligned diffusion model significantly reduces constraint violations compared to traditional diffusion models, while maintaining the quality of trajectory solutions. This approach is well-suited for integration into the DDDAS framework for efficient online trajectory adaptation as new environmental data becomes available.
[LG-41] Simple yet Effective Node Property Prediction on Edge Streams under Distribution Shifts ICDE2025
链接: https://arxiv.org/abs/2504.00328
作者: Jongha Lee,Taehyung Kwon,Heechan Moon,Kijung Shin
类目: Machine Learning (cs.LG)
*备注: 14 pages, 14 figures, To Appear in ICDE 2025
点击查看摘要
Abstract:The problem of predicting node properties (e.g., node classes) in graphs has received significant attention due to its broad range of applications. Graphs from real-world datasets often evolve over time, with newly emerging edges and dynamically changing node properties, posing a significant challenge for this problem. In response, temporal graph neural networks (TGNNs) have been developed to predict dynamic node properties from a stream of emerging edges. However, our analysis reveals that most TGNN-based methods are (a) far less effective without proper node features and, due to their complex model architectures, (b) vulnerable to distribution shifts. In this paper, we propose SPLASH, a simple yet powerful method for predicting node properties on edge streams under distribution shifts. Our key contributions are as follows: (1) we propose feature augmentation methods and an automatic feature selection method for edge streams, which improve the effectiveness of TGNNs, (2) we propose a lightweight MLP-based TGNN architecture that is highly efficient and robust under distribution shifts, and (3) we conduct extensive experiments to evaluate the accuracy, efficiency, generalization, and qualitative performance of the proposed method and its competitors on dynamic node classification, dynamic anomaly detection, and node affinity prediction tasks across seven real-world datasets.
[LG-42] Diffusion models for probabilistic precipitation generation from atmospheric variables
链接: https://arxiv.org/abs/2504.00307
作者: Michael Aich,Sebastian Bathiany,Philipp Hess,Yu Huang,Niklas Boers
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:
点击查看摘要
Abstract:Improving the representation of precipitation in Earth system models (ESMs) is critical for assessing the impacts of climate change and especially of extreme events like floods and droughts. In existing ESMs, precipitation is not resolved explicitly, but represented by parameterizations. These typically rely on resolving approximated but computationally expensive column-based physics, not accounting for interactions between locations. They struggle to capture fine-scale precipitation processes and introduce significant biases. We present a novel approach, based on generative machine learning, which integrates a conditional diffusion model with a UNet architecture to generate accurate, high-resolution (0.25°) global daily precipitation fields from a small set of prognostic atmospheric variables. Unlike traditional parameterizations, our framework efficiently produces ensemble predictions, capturing uncertainties in precipitation, and does not require fine-tuning by hand. We train our model on the ERA5 reanalysis and present a method that allows us to apply it to arbitrary ESM data, enabling fast generation of probabilistic forecasts and climate scenarios. By leveraging interactions between global prognostic variables, our approach provides an alternative parameterization scheme that mitigates biases present in the ESM precipitation while maintaining consistency with its large-scale (annual) trends. This work demonstrates that complex precipitation patterns can be learned directly from large-scale atmospheric variables, offering a computationally efficient alternative to conventional schemes.
[LG-43] LOCO-EPI: Leave-one-chromosome-out (LOCO) as a benchmarking paradigm for deep learning based prediction of enhancer-promoter interactions
链接: https://arxiv.org/abs/2504.00306
作者: Muhammad Tahir,Shehroz S. Khan,James Davie,Soichiro Yamanaka,Ahmed Ashraf
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:
点击查看摘要
Abstract:In mammalian and vertebrate genomes, the promoter regions of the gene and their distal enhancers may be located millions of base-pairs from each other, while a promoter may not interact with the closest enhancer. Since base-pair proximity is not a good indicator of these interactions, there is considerable work toward developing methods for predicting Enhancer-Promoter Interactions (EPI). Several machine learning methods have reported increasingly higher accuracies for predicting EPI. Typically, these approaches randomly split the dataset of Enhancer-Promoter (EP) pairs into training and testing subsets followed by model training. However, the aforementioned random splitting causes information leakage by assigning EP pairs from the same genomic region to both testing and training sets, leading to performance overestimation. In this paper we propose to use a more thorough training and testing paradigm i.e., Leave-one-chromosome-out (LOCO) cross-validation for EPI-prediction. We demonstrate that a deep learning algorithm, which gives higher accuracies when trained and tested on random-splitting setting, drops drastically in performance under LOCO setting, confirming overestimation of performance. We further propose a novel hybrid deep neural network for EPI-prediction that fuses k-mer features of the nucleotide sequence. We show that the hybrid architecture performs significantly better in the LOCO setting, demonstrating it can learn more generalizable aspects of EP interactions. With this paper we are also releasing the LOCO splitting-based EPI dataset. Research data is available in this public repository: this https URL
[LG-44] A Deep Learning Approach to Anomaly Detection in High-Frequency Trading Data
链接: https://arxiv.org/abs/2504.00287
作者: Qiuliuyang Bao,Jiawei Wang,Hao Gong,Yiwei Zhang,Xiaojun Guo,Hanrui Feng
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:This paper proposes an algorithm based on a staged sliding window Transformer architecture to detect abnormal behaviors in the microstructure of the foreign exchange market, focusing on high-frequency EUR/USD trading data. The method captures multi-scale temporal features through a staged sliding window, extracts global and local dependencies by combining the self-attention mechanism and weighted attention mechanism of the Transformer, and uses a classifier to identify abnormal events. Experimental results on a real high-frequency dataset containing order book depth, spread, and trading volume show that the proposed method significantly outperforms traditional machine learning (such as decision trees and random forests) and deep learning methods (such as MLP, CNN, RNN, LSTM) in terms of accuracy (0.93), F1-Score (0.91), and AUC-ROC (0.95). Ablation experiments verify the contribution of each component, and the visualization of order book depth and anomaly detection further reveals the effectiveness of the model under complex market dynamics. Despite the false positive problem, the model still provides important support for market supervision. In the future, noise processing can be optimized and extended to other markets to improve generalization and real-time performance.
[LG-45] Federated Learning for Cross-Domain Data Privacy: A Distributed Approach to Secure Collaboration
链接: https://arxiv.org/abs/2504.00282
作者: Yiwei Zhang,Jie Liu,Jiawei Wang,Lu Dai,Fan Guo,Guohui Cai
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
点击查看摘要
Abstract:This paper proposes a data privacy protection framework based on federated learning, which aims to realize effective cross-domain data collaboration under the premise of ensuring data privacy through distributed learning. Federated learning greatly reduces the risk of privacy breaches by training the model locally on each client and sharing only model parameters rather than raw data. The experiment verifies the high efficiency and privacy protection ability of federated learning under different data sources through the simulation of medical, financial, and user data. The results show that federated learning can not only maintain high model performance in a multi-domain data environment but also ensure effective protection of data privacy. The research in this paper provides a new technical path for cross-domain data collaboration and promotes the application of large-scale data analysis and machine learning while protecting privacy.
[LG-46] Over-the-Air Edge Inference via End-to-End Metasurfaces-Integrated Artificial Neural Networks
链接: https://arxiv.org/abs/2504.00233
作者: Kyriakos Stylianopoulos,Paolo Di Lorenzo,George C. Alexandropoulos
类目: Machine Learning (cs.LG)
*备注: Submitted for journal publication
点击查看摘要
Abstract:In the Edge Inference (EI) paradigm, where a Deep Neural Network (DNN) is split across the transceivers to wirelessly communicate goal-defined features in solving a computational task, the wireless medium has been commonly treated as a source of noise. In this paper, motivated by the emerging technologies of Reconfigurable Intelligent Surfaces (RISs) and Stacked Intelligent Metasurfaces (SIM) that offer programmable propagation of wireless signals, either through controllable reflections or diffractions, we optimize the RIS/SIM-enabled smart wireless environment as a means of over-the-air computing, resembling the operations of DNN layers. We propose a framework of Metasurfaces-Integrated Neural Networks (MINNs) for EI, presenting its modeling, training through a backpropagation variation for fading channels, and deployment aspects. The overall end-to-end DNN architecture is general enough to admit RIS and SIM devices, through controllable reconfiguration before each transmission or fixed configurations after training, while both channel-aware and channel-agnostic transceivers are considered. Our numerical evaluation showcases metasurfaces to be instrumental in performing image classification under link budgets that impede conventional communications or metasurface-free systems. It is demonstrated that our MINN framework can significantly simplify EI requirements, achieving near-optimal performance with 50~ dB lower testing signal-to-noise ratio compared to training, even without transceiver channel knowledge.
[LG-47] Opportunistic Screening for Pancreatic Cancer using Computed Tomography Imaging and Radiology Reports
链接: https://arxiv.org/abs/2504.00232
作者: David Le,Ramon Correa-Medero,Amara Tariq,Bhavik Patel,Motoyo Yano,Imon Banerjee
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 8 pages, 2 figures, AMIA 2025 Annual Symposium
点击查看摘要
Abstract:Pancreatic ductal adenocarcinoma (PDAC) is a highly aggressive cancer, with most cases diagnosed at stage IV and a five-year overall survival rate below 5%. Early detection and prognosis modeling are crucial for improving patient outcomes and guiding early intervention strategies. In this study, we developed and evaluated a deep learning fusion model that integrates radiology reports and CT imaging to predict PDAC risk. The model achieved a concordance index (C-index) of 0.6750 (95% CI: 0.6429, 0.7121) and 0.6435 (95% CI: 0.6055, 0.6789) on the internal and external dataset, respectively, for 5-year survival risk estimation. Kaplan-Meier analysis demonstrated significant separation (p0.0001) between the low and high risk groups predicted by the fusion model. These findings highlight the potential of deep learning-based survival models in leveraging clinical and imaging data for pancreatic cancer.
[LG-48] A machine learning platform for development of low flammability polymers
链接: https://arxiv.org/abs/2504.00223
作者: Duy Nhat Phan,Alexander B. Morgan,Lokendra Poudel,Rahul Bhowmik
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:
点击查看摘要
Abstract:Flammability index (FI) and cone calorimetry outcomes, such as maximum heat release rate, time to ignition, total smoke release, and fire growth rate, are critical factors in evaluating the fire safety of polymers. However, predicting these properties is challenging due to the complexity of material behavior under heat exposure. In this work, we investigate the use of machine learning (ML) techniques to predict these flammability metrics. We generated synthetic polymers using Synthetic Data Vault to augment the experimental dataset. Our comprehensive ML investigation employed both our polymer descriptors and those generated by the RDkit library. Despite the challenges of limited experimental data, our models demonstrate the potential to accurately predict FI and cone calorimetry outcomes, which could be instrumental in designing safer polymers. Additionally, we developed POLYCOMPRED, a module integrated into the cloud-based MatVerse platform, providing an accessible, web-based interface for flammability prediction. This work provides not only the predictive modeling of polymer flammability but also an interactive analysis tool for the discovery and design of new materials with tailored fire-resistant properties.
[LG-49] Discriminative Subspace Emersion from learning feature relevances across different populations
链接: https://arxiv.org/abs/2504.00176
作者: Marco Canducci,Lida Abdi,Alessandro Prete,Roland J. Veen,Michael Biehl,Wiebke Arlt,Peter Tino
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In a given classification task, the accuracy of the learner is often hampered by finiteness of the training set, high-dimensionality of the feature space and severe overlap between classes. In the context of interpretable learners, with (piecewise) linear separation boundaries, these issues can be mitigated by careful construction of optimization procedures and/or estimation of relevant features for the task. However, when the task is shared across two disjoint populations the main interest is shifted towards estimating a set of features that discriminate the most between the two, when performing classification. We propose a new Discriminative Subspace Emersion (DSE) method to extend subspace learning toward a general relevance learning framework. DSE allows us to identify the most relevant features in distinguishing the classification task across two populations, even in cases of high overlap between classes. The proposed methodology is designed to work with multiple sets of labels and is derived in principle without being tied to a specific choice of base learner. Theoretical and empirical investigations over synthetic and real-world datasets indicate that DSE accurately identifies a common subspace for the classification across different populations. This is shown to be true for a surprisingly high degree of overlap between classes.
[LG-50] Nuclear Microreactor Control with Deep Reinforcement Learning
链接: https://arxiv.org/abs/2504.00156
作者: Leo Tunkle,Kamal Abdulraheem,Linyu Lin,Majdi I. Radaideh
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 28 pages, 11 figures, 2 tables
点击查看摘要
Abstract:The economic feasibility of nuclear microreactors will depend on minimizing operating costs through advancements in autonomous control, especially when these microreactors are operating alongside other types of energy systems (e.g., renewable energy). This study explores the application of deep reinforcement learning (RL) for real-time drum control in microreactors, exploring performance in regard to load-following scenarios. By leveraging a point kinetics model with thermal and xenon feedback, we first establish a baseline using a single-output RL agent, then compare it against a traditional proportional-integral-derivative (PID) controller. This study demonstrates that RL controllers, including both single- and multi-agent RL (MARL) frameworks, can achieve similar or even superior load-following performance as traditional PID control across a range of load-following scenarios. In short transients, the RL agent was able to reduce the tracking error rate in comparison to PID. Over extended 300-minute load-following scenarios in which xenon feedback becomes a dominant factor, PID maintained better accuracy, but RL still remained within a 1% error margin despite being trained only on short-duration scenarios. This highlights RL’s strong ability to generalize and extrapolate to longer, more complex transients, affording substantial reductions in training costs and reduced overfitting. Furthermore, when control was extended to multiple drums, MARL enabled independent drum control as well as maintained reactor symmetry constraints without sacrificing performance – an objective that standard single-agent RL could not learn. We also found that, as increasing levels of Gaussian noise were added to the power measurements, the RL controllers were able to maintain lower error rates than PID, and to do so with less control effort.
[LG-51] Why risk matters for protein binder design WWW ICLR
链接: https://arxiv.org/abs/2504.00146
作者: Tudor Cotet,Igor Krawczuk
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 10 pages, 5 figures, 1 table, presented at ICLR GEM Workshop this https URL
点击查看摘要
Abstract:Bayesian optimization (BO) has recently become more prevalent in protein engineering applications and hence has become a fruitful target of benchmarks. However, current BO comparisons often overlook real-world considerations like risk and cost constraints. In this work, we compare 72 model combinations of encodings, surrogate models, and acquisition functions on 11 protein binder fitness landscapes, specifically from this perspective. Drawing from the portfolio optimization literature, we adopt metrics to quantify the cold-start performance relative to a random baseline, to assess the risk of an optimization campaign, and to calculate the overall budget required to reach a fitness threshold. Our results suggest the existence of Pareto-optimal models on the risk-performance axis, the shift of this preference depending on the landscape explored, and the robust correlation between landscape properties such as epistasis with the average and worst-case model performance. They also highlight that rigorous model selection requires substantial computational and statistical efforts.
[LG-52] EMForecaster: A Deep Learning Framework for Time Series Forecasting in Wireless Networks with Distribution-Free Uncertainty Quantification
链接: https://arxiv.org/abs/2504.00120
作者: Xavier Mootoo,Hina Tabassum,Luca Chiaraviglio
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:With the recent advancements in wireless technologies, forecasting electromagnetic field (EMF) exposure has become critical to enable proactive network spectrum and power allocation, as well as network deployment planning. In this paper, we develop a deep learning (DL) time series forecasting framework referred to as \textitEMForecaster. The proposed DL architecture employs patching to process temporal patterns at multiple scales, complemented by reversible instance normalization and mixing operations along both temporal and patch dimensions for efficient feature extraction. We augment EMForecaster with a conformal prediction mechanism, which is independent of the data distribution, to enhance the trustworthiness of model predictions via uncertainty quantification of forecasts. This conformal prediction mechanism ensures that the ground truth lies within a prediction interval with target error rate \alpha , where 1-\alpha is referred to as coverage. However, a trade-off exists, as increasing coverage often results in wider prediction intervals. To address this challenge, we propose a new metric called the \textitTrade-off Score, that balances trustworthiness of the forecast (i.e., coverage) and the width of prediction interval. Our experiments demonstrate that EMForecaster achieves superior performance across diverse EMF datasets, spanning both short-term and long-term prediction horizons. In point forecasting tasks, EMForecaster substantially outperforms current state-of-the-art DL approaches, showing improvements of 53.97% over the Transformer architecture and 38.44% over the average of all baseline models. EMForecaster also exhibits an excellent balance between prediction interval width and coverage in conformal forecasting, measured by the tradeoff score, showing marked improvements of 24.73% over the average baseline and 49.17% over the Transformer architecture.
[LG-53] Enhancing Time Series Forecasting with Fuzzy Attention-Integrated Transformers
链接: https://arxiv.org/abs/2504.00070
作者: Sanjay Chakraborty,Fredrik Heintz
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:This paper introduces FANTF (Fuzzy Attention Network-Based Transformers), a novel approach that integrates fuzzy logic with existing transformer architectures to advance time series forecasting, classification, and anomaly detection tasks. FANTF leverages a proposed fuzzy attention mechanism incorporating fuzzy membership functions to handle uncertainty and imprecision in noisy and ambiguous time series data. The FANTF approach enhances its ability to capture complex temporal dependencies and multivariate relationships by embedding fuzzy logic principles into the self-attention module of the existing transformer’s architecture. The framework combines fuzzy-enhanced attention with a set of benchmark existing transformer-based architectures to provide efficient predictions, classification and anomaly detection. Specifically, FANTF generates learnable fuzziness attention scores that highlight the relative importance of temporal features and data points, offering insights into its decision-making process. Experimental evaluatios on some real-world datasets reveal that FANTF significantly enhances the performance of forecasting, classification, and anomaly detection tasks over traditional transformer-based models.
[LG-54] Integrating Quantum-Classical Attention in Patch Transformers for Enhanced Time Series Forecasting
链接: https://arxiv.org/abs/2504.00068
作者: Sanjay Chakraborty,Fredrik Heintz
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:QCAAPatchTF is a quantum attention network integrated with an advanced patch-based transformer, designed for multivariate time series forecasting, classification, and anomaly detection. Leveraging quantum superpositions, entanglement, and variational quantum eigensolver principles, the model introduces a quantum-classical hybrid self-attention mechanism to capture multivariate correlations across time points. For multivariate long-term time series, the quantum self-attention mechanism can reduce computational complexity while maintaining temporal relationships. It then applies the quantum-classical hybrid self-attention mechanism alongside a feed-forward network in the encoder stage of the advanced patch-based transformer. While the feed-forward network learns nonlinear representations for each variable frame, the quantum self-attention mechanism processes individual series to enhance multivariate relationships. The advanced patch-based transformer computes the optimized patch length by dividing the sequence length into a fixed number of patches instead of using an arbitrary set of values. The stride is then set to half of the patch length to ensure efficient overlapping representations while maintaining temporal continuity. QCAAPatchTF achieves state-of-the-art performance in both long-term and short-term forecasting, classification, and anomaly detection tasks, demonstrating state-of-the-art accuracy and efficiency on complex real-world datasets.
[LG-55] ModelRadar: Aspect-based Forecast Evaluation
链接: https://arxiv.org/abs/2504.00059
作者: Vitor Cerqueira,Luis Roque,Carlos Soares
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Accurate evaluation of forecasting models is essential for ensuring reliable predictions. Current practices for evaluating and comparing forecasting models focus on summarising performance into a single score, using metrics such as SMAPE. While convenient, averaging performance over all samples dilutes relevant information about model behavior under varying conditions. This limitation is especially problematic for time series forecasting, where multiple layers of averaging–across time steps, horizons, and multiple time series in a dataset–can mask relevant performance variations. We address this limitation by proposing ModelRadar, a framework for evaluating univariate time series forecasting models across multiple aspects, such as stationarity, presence of anomalies, or forecasting horizons. We demonstrate the advantages of this framework by comparing 24 forecasting methods, including classical approaches and different machine learning algorithms. NHITS, a state-of-the-art neural network architecture, performs best overall but its superiority varies with forecasting conditions. For instance, concerning the forecasting horizon, we found that NHITS (and also other neural networks) only outperforms classical approaches for multi-step ahead forecasting. Another relevant insight is that classical approaches such as ETS or Theta are notably more robust in the presence of anomalies. These and other findings highlight the importance of aspect-based model evaluation for both practitioners and researchers. ModelRadar is available as a Python package.
[LG-56] Imbalanced malware classification: an approach based on dynamic classifier selection
链接: https://arxiv.org/abs/2504.00041
作者: J. V. S. Souza,C. B. Vieira,G. D. C. Cunha,R. M. O. Cruz
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Short paper accepted at SSCI 2025. 4 pages + 1 reference page, 3 figures, 1 table
点击查看摘要
Abstract:In recent years, the rise of cyber threats has emphasized the need for robust malware detection systems, especially on mobile devices. Malware, which targets vulnerabilities in devices and user data, represents a substantial security risk. A significant challenge in malware detection is the imbalance in datasets, where most applications are benign, with only a small fraction posing a threat. This study addresses the often-overlooked issue of class imbalance in malware detection by evaluating various machine learning strategies for detecting malware in Android applications. We assess monolithic classifiers and ensemble methods, focusing on dynamic selection algorithms, which have shown superior performance compared to traditional approaches. In contrast to balancing strategies performed on the whole dataset, we propose a balancing procedure that works individually for each classifier in the pool. Our empirical analysis demonstrates that the KNOP algorithm obtained the best results using a pool of Random Forest. Additionally, an instance hardness assessment revealed that balancing reduces the difficulty of the minority class and enhances the detection of the minority class (malware). The code used for the experiments is available at this https URL.
[LG-57] SandboxEval: Towards Securing Test Environment for Untrusted Code
链接: https://arxiv.org/abs/2504.00018
作者: Rafiqul Rabin,Jesse Hostetler,Sean McGregor,Brett Weir,Nick Judd
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: preliminary version, working paper
点击查看摘要
Abstract:While large language models (LLMs) are powerful assistants in programming tasks, they may also produce malicious code. Testing LLM-generated code therefore poses significant risks to assessment infrastructure tasked with executing untrusted code. To address these risks, this work focuses on evaluating the security and confidentiality properties of test environments, reducing the risk that LLM-generated code may compromise the assessment infrastructure. We introduce SandboxEval, a test suite featuring manually crafted test cases that simulate real-world safety scenarios for LLM assessment environments in the context of untrusted code execution. The suite evaluates vulnerabilities to sensitive information exposure, filesystem manipulation, external communication, and other potentially dangerous operations in the course of assessment activity. We demonstrate the utility of SandboxEval by deploying it on an open-source implementation of Dyff, an established AI assessment framework used to evaluate the safety of LLMs at scale. We show, first, that the test suite accurately describes limitations placed on an LLM operating under instructions to generate malicious code. Second, we show that the test results provide valuable insights for developers seeking to harden assessment infrastructure and identify risks associated with LLM execution activities.
[LG-58] Im Sorry Dave: How the old world of personnel security can inform the new world of AI insider risk
链接: https://arxiv.org/abs/2504.00012
作者: Paul Martin,Sarah Mercer
类目: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Organisations are rapidly adopting artificial intelligence (AI) tools to perform tasks previously undertaken by people. The potential benefits are enormous. Separately, some organisations deploy personnel security measures to mitigate the security risks arising from trusted human insiders. Unfortunately, there is no meaningful interplay between the rapidly evolving domain of AI and the traditional world of personnel security. This is a problem. The complex risks from human insiders are hard enough to understand and manage, despite many decades of effort. The emerging security risks from AI insiders are even more opaque. Both sides need all the help they can get. Some of the concepts and approaches that have proved useful in dealing with human insiders are also applicable to the emerging risks from AI insiders. Furthermore, AI can be used defensively to protect against both human and AI insiders.
[LG-59] LayerCraft: Enhancing Text-to-Image Generation with CoT Reasoning and Layered Object Integration
链接: https://arxiv.org/abs/2504.00010
作者: Yuyao Zhang,Jinghao Li,Yu-Wing Tai
类目: Machine Learning (cs.LG); Graphics (cs.GR); Multiagent Systems (cs.MA)
*备注: 23 pages
点击查看摘要
Abstract:Text-to-image generation (T2I) has become a key area of research with broad applications. However, existing methods often struggle with complex spatial relationships and fine-grained control over multiple concepts. Many existing approaches require significant architectural modifications, extensive training, or expert-level prompt engineering. To address these challenges, we introduce \textbfLayerCraft, an automated framework that leverages large language models (LLMs) as autonomous agents for structured procedural generation. LayerCraft enables users to customize objects within an image and supports narrative-driven creation with minimal effort. At its core, the system includes a coordinator agent that directs the process, along with two specialized agents: \textbfChainArchitect, which employs chain-of-thought (CoT) reasoning to generate a dependency-aware 3D layout for precise instance-level control, and the \textbfObject-Integration Network (OIN), which utilizes LoRA fine-tuning on pre-trained T2I models to seamlessly blend objects into specified regions of an image based on textual prompts without requiring architectural changes. Extensive evaluations demonstrate LayerCraft’s versatility in applications ranging from multi-concept customization to storytelling. By providing non-experts with intuitive, precise control over T2I generation, our framework democratizes creative image creation. Our code will be released upon acceptance at this http URL
[LG-60] Diffusion-model approach to flavor models: A case study for S_4prime modular flavor model
链接: https://arxiv.org/abs/2504.00944
作者: Satsuki Nishimura,Hajime Otsuka,Haruki Uchiyama
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Theory (hep-th)
*备注: 19 pages, 2 figures
点击查看摘要
Abstract:We propose a numerical method of searching for parameters with experimental constraints in generic flavor models by utilizing diffusion models, which are classified as a type of generative artificial intelligence (generative AI). As a specific example, we consider the S_4^\prime modular flavor model and construct a neural network that reproduces quark masses, the CKM matrix, and the Jarlskog invariant by treating free parameters in the flavor model as generating targets. By generating new parameters with the trained network, we find various phenomenologically interesting parameter regions where an analytical evaluation of the S_4^\prime model is challenging. Additionally, we confirm that the spontaneous CP violation occurs in the S_4^\prime model. The diffusion model enables an inverse problem approach, allowing the machine to provide a series of plausible model parameters from given experimental data. Moreover, it can serve as a versatile analytical tool for extracting new physical predictions from flavor models.
[LG-61] Privacy-Preserving Transfer Learning for Community Detection using Locally Distributed Multiple Networks
链接: https://arxiv.org/abs/2504.00890
作者: Xiao Guo,Xuming He,Xiangyu Chang,Shujie Ma
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:This paper develops a new spectral clustering-based method called TransNet for transfer learning in community detection of network data. Our goal is to improve the clustering performance of the target network using auxiliary source networks, which are heterogeneous, privacy-preserved, and locally stored across various sources. The edges of each locally stored network are perturbed using the randomized response mechanism to achieve differential privacy. Notably, we allow the source networks to have distinct privacy-preserving and heterogeneity levels as often desired in practice. To better utilize the information from the source networks, we propose a novel adaptive weighting method to aggregate the eigenspaces of the source networks multiplied by adaptive weights chosen to incorporate the effects of privacy and heterogeneity. We propose a regularization method that combines the weighted average eigenspace of the source networks with the eigenspace of the target network to achieve an optimal balance between them. Theoretically, we show that the adaptive weighting method enjoys the error-bound-oracle property in the sense that the error bound of the estimated eigenspace only depends on informative source networks. We also demonstrate that TransNet performs better than the estimator using only the target network and the estimator using only the weighted source networks.
[LG-62] Spingarns Method and Progressive Decoupling Beyond Elicitable Monotonicity
链接: https://arxiv.org/abs/2504.00836
作者: Brecht Evens,Puya Latafat,Panagiotis Patrinos
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Spingarn’s method of partial inverses and the progressive decoupling algorithm address inclusion problems involving the sum of an operator and the normal cone of a linear subspace, known as linkage problems. Despite their success, existing convergence results are limited to the so-called elicitable monotone setting, where nonmonotonicity is allowed only on the orthogonal complement of the linkage subspace. In this paper, we introduce progressive decoupling+, a generalized version of standard progressive decoupling that incorporates separate relaxation parameters for the linkage subspace and its orthogonal complement. We prove convergence under conditions that link the relaxation parameters to the nonmonotonicity of their respective subspaces and show that the special cases of Spingarn’s method and standard progressive decoupling also extend beyond the elicitable monotone setting. Our analysis hinges upon an equivalence between progressive decoupling+ and the preconditioned proximal point algorithm, for which we develop a general local convergence analysis in a certain nonmonotone setting.
[LG-63] Communication-Efficient l_0 Penalized Least Square
链接: https://arxiv.org/abs/2504.00722
作者: Chenqi Gong,Hu Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In this paper, we propose a communication-efficient penalized regression algorithm for high-dimensional sparse linear regression models with massive data. This approach incorporates an optimized distributed system communication algorithm, named CESDAR algorithm, based on the Enhanced Support Detection and Root finding algorithm. The CESDAR algorithm leverages data distributed across multiple machines to compute and update the active set and introduces the communication-efficient surrogate likelihood framework to approximate the optimal solution for the full sample on the active set, resulting in the avoidance of raw data transmission, which enhances privacy and data security, while significantly improving algorithm execution speed and substantially reducing communication costs. Notably, this approach achieves the same statistical accuracy as the global estimator. Furthermore, this paper explores an extended version of CESDAR and an adaptive version of CESDAR to enhance algorithmic speed and optimize parameter selection, respectively. Simulations and real data benchmarks experiments demonstrate the efficiency and accuracy of the CESDAR algorithm.
[LG-64] Deep Learning Model Predictive Control for Deep Brain Stimulation in Parkinsons Disease
链接: https://arxiv.org/abs/2504.00618
作者: Sebastian Steffen,Mark Cannon
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We present a nonlinear data-driven Model Predictive Control (MPC) algorithm for deep brain stimulation (DBS) for the treatment of Parkinson’s disease (PD). Although DBS is typically implemented in open-loop, closed-loop DBS (CLDBS) uses the amplitude of neural oscillations in specific frequency bands (e.g. beta 13-30 Hz) as a feedback signal, resulting in improved treatment outcomes with reduced side effects and slower rates of patient habituation to stimulation. To date, CLDBS has only been implemented in vivo with simple control algorithms, such as proportional or proportional-integral control. Our approach employs a multi-step predictor based on differences of input-convex neural networks to model the future evolution of beta oscillations. The use of a multi-step predictor enhances prediction accuracy over the optimization horizon and simplifies online computation. In tests using a simulated model of beta-band activity response and data from PD patients, we achieve reductions of more than 20% in both tracking error and control activity in comparison with existing CLDBS algorithms. The proposed control strategy provides a generalizable data-driven technique that can be applied to the treatment of PD and other diseases targeted by CLDBS, as well as to other neuromodulation techniques.
[LG-65] Near Field Localization via AI-Aided Subspace Methods
链接: https://arxiv.org/abs/2504.00599
作者: Arad Gast,Luc Le Magoarou,Nir Shlezinger
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Under review for publication in the IEEE Transactions on Wireless Communications
点击查看摘要
Abstract:The increasing demands for high-throughput and energy-efficient wireless communications are driving the adoption of extremely large antennas operating at high-frequency bands. In these regimes, multiple users will reside in the radiative near-field, and accurate localization becomes essential. Unlike conventional far-field systems that rely solely on DOA estimation, near-field localization exploits spherical wavefront propagation to recover both DOA and range information. While subspace-based methods, such as MUSIC and its extensions, offer high resolution and interpretability for near-field localization, their performance is significantly impacted by model assumptions, including non-coherent sources, well-calibrated arrays, and a sufficient number of snapshots. To address these limitations, this work proposes AI-aided subspace methods for near-field localization that enhance robustness to real-world challenges. Specifically, we introduce NF-SubspaceNet, a deep learning-augmented 2D MUSIC algorithm that learns a surrogate covariance matrix to improve localization under challenging conditions, and DCD-MUSIC, a cascaded AI-aided approach that decouples angle and range estimation to reduce computational complexity. We further develop a novel model-order-aware training method to accurately estimate the number of sources, that is combined with casting of near field subspace methods as AI models for learning. Extensive simulations demonstrate that the proposed methods outperform classical and existing deep-learning-based localization techniques, providing robust near-field localization even under coherent sources, miscalibrations, and few snapshots.
[LG-66] Flow Matching on Lie Groups
链接: https://arxiv.org/abs/2504.00494
作者: Finn M. Sherry,Bart M.N. Smets
类目: Differential Geometry (math.DG); Machine Learning (cs.LG)
*备注: Submitted to the 7th International Conference on Geometric Science of Information
点击查看摘要
Abstract:Flow Matching (FM) is a recent generative modelling technique: we aim to learn how to sample from distribution \mathfrakX_1 by flowing samples from some distribution \mathfrakX_0 that is easy to sample from. The key trick is that this flow field can be trained while conditioning on the end point in \mathfrakX_1 : given an end point, simply move along a straight line segment to the end point (Lipman et al. 2022). However, straight line segments are only well-defined on Euclidean space. Consequently, Chen and Lipman (2023) generalised the method to FM on Riemannian manifolds, replacing line segments with geodesics or their spectral approximations. We take an alternative point of view: we generalise to FM on Lie groups by instead substituting exponential curves for line segments. This leads to a simple, intrinsic, and fast implementation for many matrix Lie groups, since the required Lie group operations (products, inverses, exponentials, logarithms) are simply given by the corresponding matrix operations. FM on Lie groups could then be used for generative modelling with data consisting of sets of features (in \mathbbR^n ) and poses (in some Lie group), e.g. the latent codes of Equivariant Neural Fields (Wessels et al. 2025).
[LG-67] CopyQNN: Quantum Neural Network Extraction Attack under Varying Quantum Noise
链接: https://arxiv.org/abs/2504.00366
作者: Zhenxiao Fu,Leyi Zhao,Xuhong Zhang,Yilun Xu,Gang Huang,Fan Chen
类目: Quantum Physics (quant-ph); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Quantum Neural Networks (QNNs) have shown significant value across domains, with well-trained QNNs representing critical intellectual property often deployed via cloud-based QNN-as-a-Service (QNNaaS) platforms. Recent work has examined QNN model extraction attacks using classical and emerging quantum strategies. These attacks involve adversaries querying QNNaaS platforms to obtain labeled data for training local substitute QNNs that replicate the functionality of cloud-based models. However, existing approaches have largely overlooked the impact of varying quantum noise inherent in noisy intermediate-scale quantum (NISQ) computers, limiting their effectiveness in real-world settings. To address this limitation, we propose the CopyQNN framework, which employs a three-step data cleaning method to eliminate noisy data based on its noise sensitivity. This is followed by the integration of contrastive and transfer learning within the quantum domain, enabling efficient training of substitute QNNs using a limited but cleaned set of queried data. Experimental results on NISQ computers demonstrate that a practical implementation of CopyQNN significantly outperforms state-of-the-art QNN extraction attacks, achieving an average performance improvement of 8.73% across all tasks while reducing the number of required queries by 90x, with only a modest increase in hardware overhead.
[LG-68] Using machine learning method for variable star classification using the TESS Sectors 1-57 data
链接: https://arxiv.org/abs/2504.00347
作者: Li-Heng Wang,Kai Li,Xiang Gao,Ya-Ni Guo,Guo-You Sun
类目: olar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注: 15pages, 12 figures, 3 tables, accepted by ApJ, Data available via China-VO PaperData repository
点击查看摘要
Abstract:The Transiting Exoplanet Survey Satellite (TESS) is a wide-field all-sky survey mission designed to detect Earth-sized exoplanets. After over four years photometric surveys, data from sectors 1-57, including approximately 1,050,000 light curves with a 2-minute cadence, were collected. By cross-matching the data with Gaia’s variable star catalogue, we obtained labeled datasets for further analysis. Using a random forest classifier, we performed classification of variable stars and designed distinct classification processes for each subclass, 6770 EA, 2971 EW, 980 CEP, 8347 DSCT, 457 RRab, 404 RRc and 12348 ROT were identified. Each variable star was visually inspected to ensure the reliability and accuracy of the compiled catalog. Subsequently, we ultimately obtained 6046 EA, 3859 EW, 2058 CEP, 8434 DSCT, 482 RRab, 416 RRc, and 9694 ROT, and a total of 14092 new variable stars were discovered.
[LG-69] Plane-Wave Decomposition and Randomised Training; a Novel Path to Generalised PINNs for SHM
链接: https://arxiv.org/abs/2504.00249
作者: Rory Clements,James Ellis,Geoff Hassall,Simon Horsley
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注: 17 pages, 17 figures
点击查看摘要
Abstract:In this paper, we introduce a formulation of Physics-Informed Neural Networks (PINNs), based on learning the form of the Fourier decomposition, and a training methodology based on a spread of randomly chosen boundary conditions. By training in this way we produce a PINN that generalises; after training it can be used to correctly predict the solution for an arbitrary set of boundary conditions and interpolate this solution between the samples that spanned the training domain. We demonstrate for a toy system of two coupled oscillators that this gives the PINN formulation genuine predictive capability owing to an effective reduction of the training to evaluation times ratio due to this decoupling of the solution from specific boundary conditions.
[LG-70] Improving Predictions of Convective Storm Wind Gusts through Statistical Post-Processing of Neural Weather Models
链接: https://arxiv.org/abs/2504.00128
作者: Antoine Leclerc,Erwan Koch,Monika Feldmann,Daniele Nerini,Tom Beucler
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 16 pages, 10 figures, 3 tables, submitted to npj Natural Hazards
点击查看摘要
Abstract:Issuing timely severe weather warnings helps mitigate potentially disastrous consequences. Recent advancements in Neural Weather Models (NWMs) offer a computationally inexpensive and fast approach for forecasting atmospheric environments on a 0.25° global grid. For thunderstorms, these environments can be empirically post-processed to predict wind gust distributions at specific locations. With the Pangu-Weather NWM, we apply a hierarchy of statistical and deep learning post-processing methods to forecast hourly wind gusts up to three days ahead. To ensure statistical robustness, we constrain our probabilistic forecasts using generalised extreme-value distributions across five regions in Switzerland. Using a convolutional neural network to post-process the predicted atmospheric environment’s spatial patterns yields the best results, outperforming direct forecasting approaches across lead times and wind gust speeds. Our results confirm the added value of NWMs for extreme wind forecasting, especially for designing more responsive early-warning systems.
[LG-71] Quantum Generative Models for Image Generation: Insights from MNIST and MedMNIST
链接: https://arxiv.org/abs/2504.00034
作者: Chi-Sheng Chen,Wei An Hou,Siang-Wei Hu,Zhen-Sheng Cai
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Research on quantum generative models is currently in its early exploratory stages, with very few established methodologies. In this paper, we propose a novel hybrid quantum generative model based on variational quantum circuits for image generation tasks, introducing innovative noise techniques specifically tailored for quantum computation. Our approach utilizes two distinctive noise strategies: quantum-generated noise inherent to quantum circuits, and a newly developed noise scheduling method, applying different noise levels strategically across time steps during the training process. Experiments conducted on MNIST and MedMNIST datasets demonstrate that our hybrid quantum model, combined with these specialized noise techniques, achieves promising results, suggesting improved generative performance compared to baseline quantum generative approaches. This exploratory work lays a critical foundation and opens new avenues for advancing quantum generative modeling research.
[LG-72] Four Things People Should Know About Migraines
链接: https://arxiv.org/abs/2504.00011
作者: Mohammad S. Parsa,Lukasz Golab
类目: Physics and Society (physics.soc-ph); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:
信息检索
[IR-0] Linked Array Tree: A Constant-Time Search Structure for Big Data
链接: https://arxiv.org/abs/2504.00828
作者: Songpeng Liu
类目: Databases (cs.DB); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:As data volumes continue to grow rapidly, traditional search algorithms, like the red-black tree and B+ Tree, face increasing challenges in performance, especially in big data scenarios with intensive storage access. This paper presents the Linked Array Tree (LAT), a novel data structure designed to achieve constant-time complexity for search, insertion, and deletion operations. LAT leverages a sparse, non-moving hierarchical layout that enables direct access paths without requiring rebalancing or data movement. Its low memory overhead and avoidance of pointer-heavy structures make it well-suited for large-scale and intensive workloads. While not specifically tested under parallel or concurrent conditions, the structure’s static layout and non-interfering operations suggest potential advantages in such environments. This paper first introduces the structure and algorithms of LAT, followed by a detailed analysis of its time complexity in search, insertion, and deletion operations. Finally, it presents experimental results across both data-intensive and sparse usage scenarios to evaluate LAT’s practical performance. Subjects: Databases (cs.DB); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR) Cite as: arXiv:2504.00828 [cs.DB] (or arXiv:2504.00828v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2504.00828 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
附件下载
点击下载今日全部论文列表