本篇博文主要内容为 2025-03-03 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-03-03)

今日共更新478篇论文,其中:

  • 自然语言处理88篇(Computation and Language (cs.CL))
  • 人工智能130篇(Artificial Intelligence (cs.AI))
  • 计算机视觉109篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习143篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] LLM Post-Training: A Deep Dive into Reasoning Large Language Models

【速读】: 该论文旨在系统性地探索大型语言模型(Large Language Models, LLMs)在预训练(pretraining)之后的后训练方法(post-training methodologies),以进一步提升其性能、鲁棒性和适应性。论文重点关注如何通过后训练技术弥补预训练的局限性,特别是在优化知识精炼、推理能力、事实准确性以及与用户意图和伦理约束的对齐方面。论文的关键在于分析和总结几种核心后训练策略,包括微调(fine-tuning)、强化学习(reinforcement learning)以及测试时扩展(test-time scaling),这些方法被证明是解决预训练模型在实际应用中面临的挑战(如灾难性遗忘、奖励黑客行为和推理时权衡)的有效手段。此外,论文还探讨了模型对齐、可扩展适配及推理时推理等新兴方向,并提供了持续跟踪该快速发展的研究领域的资源库链接。

链接: https://arxiv.org/abs/2502.21321
作者: Komal Kumar,Tajamul Ashraf,Omkar Thawakar,Rao Muhammad Anwer,Hisham Cholakkal,Mubarak Shah,Ming-Hsuan Yang,Phillip H.S. Torr,Salman Khan,Fahad Shahbaz Khan
机构: Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE; Center for Research in Computer Vision, University of Central Florida, Orlando, FL 32816, USA; University of California at Merced, Merced, CA 95343 USA; Google DeepMind, Mountain View, CA 94043, USA; Department of Engineering Science, University of Oxford, Oxford OX1 2JD, UK
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 7 figures, 3 tables, 375 references

点击查看摘要

Abstract:Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications. Pretraining on vast web-scale data has laid the foundation for these models, yet the research community is now increasingly shifting focus toward post-training techniques to achieve further breakthroughs. While pretraining provides a broad linguistic foundation, post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations. Fine-tuning, reinforcement learning, and test-time scaling have emerged as critical strategies for optimizing LLMs performance, ensuring robustness, and improving adaptability across various real-world tasks. This survey provides a systematic exploration of post-training methodologies, analyzing their role in refining LLMs beyond pretraining, addressing key challenges such as catastrophic forgetting, reward hacking, and inference-time trade-offs. We highlight emerging directions in model alignment, scalable adaptation, and inference-time reasoning, and outline future research directions. We also provide a public repository to continually track developments in this fast-evolving field: this https URL.
zh

[NLP-1] Identifying Emerging Concepts in Large Corpora

【速读】: 该论文试图解决如何在大规模文本语料库中识别新兴概念的问题。解决方案的关键在于通过分析嵌入空间热图的变化来检测这些概念,这种方法能够在概念产生的短时间内以高准确性识别它们,从而优于常见的替代方法。

链接: https://arxiv.org/abs/2502.21315
作者: Sibo Ma,Julian Nyarko
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:We introduce a new method to identify emerging concepts in large text corpora. By analyzing changes in the heatmaps of the underlying embedding space, we are able to detect these concepts with high accuracy shortly after they originate, in turn outperforming common alternatives. We further demonstrate the utility of our approach by analyzing speeches in the U.S. Senate from 1941 to 2015. Our results suggest that the minority party is more active in introducing new concepts into the Senate discourse. We also identify specific concepts that closely correlate with the Senators’ racial, ethnic, and gender identities. An implementation of our method is publicly available.
zh

[NLP-2] FANformer: Improving Large Language Models Through Effective Periodicity Modeling

【速读】: 该论文旨在解决Transformer中周期性建模潜在缺陷导致的大规模语言模型(Large Language Models, LLMs)学习效率低下以及从数据中建立底层原则困难的问题。论文的关键解决方案是提出FANformer,通过将傅里叶分析网络(Fourier Analysis Network, FAN)融入注意力机制,并修改注意力机制的特征投影过程,实现高效的周期性建模。实验结果表明,FANformer在扩展模型规模和训练令牌数量时始终优于Transformer,验证了其在提升LLMs学习效率方面的优越性。进一步的预训练实验也证明了FANformer在下游任务中的显著性能提升,确立了其作为有效且有前景的LLMs架构的地位。

链接: https://arxiv.org/abs/2502.21309
作者: Yihong Dong,Ge Li,Xue Jiang,Yongding Tao,Kechi Zhang,Hao Zhu,Huanyu Liu,Jiazheng Ding,Jia Li,Jinliang Deng,Hong Mei
机构: School of Computer Science, Peking University (北京大学计算机学院); aiXcoder; The Hong Kong University of Science and Technology (香港科技大学); Advanced Institute of Big Data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Periodicity, as one of the most important basic characteristics, lays the foundation for facilitating structured knowledge acquisition and systematic cognitive processes within human learning paradigms. However, the potential flaws of periodicity modeling in Transformer affect the learning efficiency and establishment of underlying principles from data for large language models (LLMs) built upon it. In this paper, we demonstrate that integrating effective periodicity modeling can improve the learning efficiency and performance of LLMs. We introduce FANformer, which integrates Fourier Analysis Network (FAN) into attention mechanism to achieve efficient periodicity modeling, by modifying the feature projection process of attention mechanism. Extensive experimental results on language modeling show that FANformer consistently outperforms Transformer when scaling up model size and training tokens, underscoring its superior learning efficiency. To further validate the effectiveness of FANformer, we pretrain a FANformer-1B on 1 trillion tokens. FANformer-1B exhibits marked improvements on downstream tasks compared to open-source LLMs with similar model parameters or training tokens. The results position FANformer as an effective and promising architecture for advancing LLMs.
zh

[NLP-3] Persuasion Should be Double-Blind: A Multi-Domain Dialogue Dataset With Faithfulness Based on Causal Theory of Mind

【速读】: 该论文旨在解决现有说服性对话数据集未能忠实反映真实世界人际互动的问题,例如出现不切实际的情景,其中被说服者明确指导说服者使用何种说服策略,且每个问题都对应一个特定策略。这一问题源于“双重盲态”条件的违反,即参与者之间过度共享关键信息。在现实的人类互动中,说服者的心理状态和说服策略等重要信息并不直接可用,说服者需利用心智理论推断被说服者的心理状态并构建符合其动机的论点。为填补这一差距,论文提出了ToMMA,这是一种由因果心智理论引导的新型多智能体对话生成框架,确保智能体间信息不泄露以维持“双重盲态”条件,同时因果心智理论指导说服者的推理过程,增强与人类说服动态的一致性。最终,论文推出了CToMPersu,一个多领域、多轮次的说服性对话数据集,解决了“双重盲态”和逻辑连贯性问题,并在多个指标上表现出色,更好地与真实人类对话保持一致。

链接: https://arxiv.org/abs/2502.21297
作者: Dingyi Zhang,Deyu Zhou
机构: School of Computer Science and Engineering, Key Laboratory of Computer Network and Information Integration, Ministry of Education, Southeast University (东南大学)
类目: Computation and Language (cs.CL)
备注: 23pages

点击查看摘要

Abstract:Persuasive dialogue plays a pivotal role in human communication, influencing various domains. Recent persuasive dialogue datasets often fail to align with real-world interpersonal interactions, leading to unfaithful representations. For instance, unrealistic scenarios may arise, such as when the persuadee explicitly instructs the persuader on which persuasion strategies to employ, with each of the persuadee’s questions corresponding to a specific strategy for the persuader to follow. This issue can be attributed to a violation of the “Double Blind” condition, where critical information is fully shared between participants. In actual human interactions, however, key information such as the mental state of the persuadee and the persuasion strategies of the persuader is not directly accessible. The persuader must infer the persuadee’s mental state using Theory of Mind capabilities and construct arguments that align with the persuadee’s motivations. To address this gap, we introduce ToMMA, a novel multi-agent framework for dialogue generation that is guided by causal Theory of Mind. This framework ensures that information remains undisclosed between agents, preserving “double-blind” conditions, while causal ToM directs the persuader’s reasoning, enhancing alignment with human-like persuasion dynamics. Consequently, we present CToMPersu, a multi-domain, multi-turn persuasive dialogue dataset that tackles both double-blind and logical coherence issues, demonstrating superior performance across multiple metrics and achieving better alignment with real human dialogues. Our dataset and prompts are available at this https URL .
zh

[NLP-4] oken-level Ensembling of Models with Different Vocabularies

【速读】: 该论文试图解决不同子词词汇表(subword vocabulary)的文本生成模型在推理阶段无法直接进行模型集成(model ensembling)的问题。解决方案的关键在于提出了一种仅在推理阶段运行的算法,该算法能够在无需学习额外参数或修改底层模型的情况下,确保集成模型生成的标记在表面形式上保持一致(surface form agreement),从而实现跨不同词汇表的模型集成。这一方法不仅扩展了可以进行标记级集成的模型对范围,还经常在机器翻译任务中提升整体翻译性能。

链接: https://arxiv.org/abs/2502.21265
作者: Rachel Wicks,Kartik Ravisankar,Xinchen Yang,Philipp Koehn,Matt Post
机构: Johns Hopkins University (约翰斯·霍普金斯大学); University of Maryland (马里兰大学); Microsoft (微软)
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Model ensembling is a technique to combine the predicted distributions of two or more models, often leading to improved robustness and performance. For ensembling in text generation, the next token’s probability distribution is derived from a weighted sum of the distributions of each individual model. This requires the underlying models to share the same subword vocabulary, limiting the applicability of ensembling, since many open-sourced models have distinct vocabularies. In research settings, experimentation or upgrades to vocabularies may introduce multiple vocabulary sizes. This paper proposes an inference-time only algorithm that allows for ensembling models with different vocabularies, without the need to learn additional parameters or alter the underlying models. Instead, the algorithm ensures that tokens generated by the ensembled models \textitagree in their surface form. We apply this technique to combinations of traditional encoder-decoder models and decoder-only LLMs and evaluate on machine translation. In addition to expanding to model pairs that were previously incapable of token-level ensembling, our algorithm frequently improves translation performance over either model individually.
zh

[NLP-5] RuCCoD: Towards Automated ICD Coding in Russian

【速读】: 该论文旨在解决俄语环境下临床编码(ICD编码)自动化的问题,这一领域面临生物医学资源有限的挑战。论文的关键解决方案是构建了一个包含超过10,000个实体和1,500多个独特ICD代码的新数据集,并以此作为基准评估了多种先进的模型(如BERT、LLaMA with LoRA和RAG)。此外,通过跨领域(从PubMed摘要到医疗诊断)和跨术语(从UMLS概念到ICD代码)的迁移学习实验进一步优化模型性能。最终,论文采用表现最佳的模型对内部电子健康记录(EHR)数据进行标注,结果显示使用自动预测的编码显著提高了准确性,优于人工标注的数据。因此,论文的核心在于利用先进的机器学习技术结合精心设计的数据集和迁移学习策略,实现资源受限语言下临床编码的自动化。

链接: https://arxiv.org/abs/2502.21263
作者: Aleksandr Nesterov,Andrey Sakhovskiy,Ivan Sviridov,Airat Valiev,Vladimir Makharev,Petr Anokhin,Galina Zubkova,Elena Tutubalina
机构: AIRI (人工智能与机器人研究所); Sber AI (Sber 银行的人工智能部门); Sber AI Lab (Sber 银行的人工智能实验室); HSE University (高等经济学院国立大学); ISP RAS Research Center for Trusted Artificial Intelligence (俄罗斯科学院信息传输问题研究所可信人工智能研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:This study investigates the feasibility of automating clinical coding in Russian, a language with limited biomedical resources. We present a new dataset for ICD coding, which includes diagnosis fields from electronic health records (EHRs) annotated with over 10,000 entities and more than 1,500 unique ICD codes. This dataset serves as a benchmark for several state-of-the-art models, including BERT, LLaMA with LoRA, and RAG, with additional experiments examining transfer learning across domains (from PubMed abstracts to medical diagnosis) and terminologies (from UMLS concepts to ICD codes). We then apply the best-performing model to label an in-house EHR dataset containing patient histories from 2017 to 2021. Our experiments, conducted on a carefully curated test set, demonstrate that training with the automated predicted codes leads to a significant improvement in accuracy compared to manually annotated data from physicians. We believe our findings offer valuable insights into the potential for automating clinical coding in resource-limited languages like Russian, which could enhance clinical efficiency and data accuracy in these contexts.
zh

[NLP-6] Semantic Volume: Quantifying and Detecting both External and Internal Uncertainty in LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理任务时容易产生幻觉(hallucinations),即生成错误或误导性信息的问题。现有幻觉检测方法主要关注量化模型内部不确定性,但忽略了外部不确定性,后者源于用户查询的模糊性导致的多义性。论文的关键创新在于提出了一种名为“语义体积”(Semantic Volume)的新数学度量方法,用于量化LLMs中的内外部不确定性。其解决方案的核心在于通过扰动查询与响应,将它们嵌入语义空间,并计算嵌入向量Gram矩阵的行列式,以此捕捉向量分散程度作为不确定性衡量标准。此方法无需白盒访问LLMs,提供了一种通用且无监督的不确定性检测手段,同时在内外部不确定性检测实验中均优于现有基线方法。

链接: https://arxiv.org/abs/2502.21239
作者: Xiaomin Li,Zhou Yu,Ziji Zhang,Yingying Zhuang,Swair Shah,Anurag Beniwal
机构: Harvard University (哈佛大学); Amazon (亚马逊)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable performance across diverse tasks by encoding vast amounts of factual knowledge. However, they are still prone to hallucinations, generating incorrect or misleading information, often accompanied by high uncertainty. Existing methods for hallucination detection primarily focus on quantifying internal uncertainty, which arises from missing or conflicting knowledge within the model. However, hallucinations can also stem from external uncertainty, where ambiguous user queries lead to multiple possible interpretations. In this work, we introduce Semantic Volume, a novel mathematical measure for quantifying both external and internal uncertainty in LLMs. Our approach perturbs queries and responses, embeds them in a semantic space, and computes the determinant of the Gram matrix of the embedding vectors, capturing their dispersion as a measure of uncertainty. Our framework provides a generalizable and unsupervised uncertainty detection method without requiring white-box access to LLMs. We conduct extensive experiments on both external and internal uncertainty detection, demonstrating that our Semantic Volume method consistently outperforms existing baselines in both tasks. Additionally, we provide theoretical insights linking our measure to differential entropy, unifying and extending previous sampling-based uncertainty measures such as the semantic entropy. Semantic Volume is shown to be a robust and interpretable approach to improving the reliability of LLMs by systematically detecting uncertainty in both user queries and model responses.
zh

[NLP-7] ransforming Tuberculosis Care: Optimizing Large Language Models For Enhanced Clinician-Patient Communication AAAI-25 ALT

【速读】: 该论文旨在解决全球范围内尤其是低收入和中等收入国家因医疗资源有限及患者与医护人员比例失衡导致的结核病(Tuberculosis, TB)治疗支持不足、沟通不畅以及治疗完成率低的问题。论文提出的关键解决方案是将专用的大语言模型(Large Language Model)集成到有效的数字依从性技术中,通过增强与治疗支持者的互动交流来改善患者参与度。此方法以人机协同框架为基础,利用人工智能(AI)技术提升TB患者的治疗效果。

链接: https://arxiv.org/abs/2502.21236
作者: Daniil Filienko,Mahek Nizar,Javier Roberti,Denise Galdamez,Haroon Jakher,Sarah Iribarren,Weichao Yuwen,Martine De Cock
机构: unknown
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: GenAI4Health at AAAI-25

点击查看摘要

Abstract:Tuberculosis (TB) is the leading cause of death from an infectious disease globally, with the highest burden in low- and middle-income countries. In these regions, limited healthcare access and high patient-to-provider ratios impede effective patient support, communication, and treatment completion. To bridge this gap, we propose integrating a specialized Large Language Model into an efficacious digital adherence technology to augment interactive communication with treatment supporters. This AI-powered approach, operating within a human-in-the-loop framework, aims to enhance patient engagement and improve TB treatment outcomes.
zh

[NLP-8] ECLeKTic: a Novel Challenge Set for Evaluation of Cross-Lingual Knowledge Transfer

【速读】: 该论文试图解决多语言大语言模型(Multilingual Large Language Models, LLMs)在跨语言知识迁移(Cross-Lingual Knowledge Transfer)能力评估方面缺乏可靠方法的问题。论文的关键解决方案是提出了一种名为ECLeKTic的新数据集,这是一个多语言闭卷问答(Closed-Book QA, CBQA)数据集,用于以简单且黑盒的方式评估LLMs的跨语言知识迁移能力。ECLeKTic通过控制12种语言中Wikipedia文章的存在与缺失,检测了不同语言间知识覆盖的不均衡性,并生成了源语言中的知识型问题,这些问题的答案出现在相关Wikipedia文章中,然后将其翻译成其他11种语言,而这些语言对应的Wikipedia缺乏等效的文章。论文假设Wikipedia反映了LLMs训练数据中的主要知识,因此要求模型在解决ECLeKTic的闭卷问答任务时能够在语言之间进行知识迁移。通过实验8个LLMs,研究发现即使最先进的模型在预测用知识获取语言表述的查询答案时表现良好,但在跨语言知识共享方面仍面临挑战。

链接: https://arxiv.org/abs/2502.21228
作者: Omer Goldman,Uri Shaham,Dan Malkin,Sivan Eiger,Avinatan Hassidim,Yossi Matias,Joshua Maynez,Adi Mayrav Gilady,Jason Riesa,Shruti Rijhwani,Laura Rimell,Idan Szpektor,Reut Tsarfaty,Matan Eyal
机构: Google Research (谷歌研究); Google DeepMind (谷歌深度思维)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To achieve equitable performance across languages, multilingual large language models (LLMs) must be able to abstract knowledge beyond the language in which it was acquired. However, the current literature lacks reliable ways to measure LLMs’ capability of cross-lingual knowledge transfer. To that end, we present ECLeKTic, a multilingual closed-book QA (CBQA) dataset that Evaluates Cross-Lingual Knowledge Transfer in a simple, black-box manner. We detected information with uneven coverage across languages by controlling for presence and absence of Wikipedia articles in 12 languages. We generated knowledge-seeking questions in a source language, for which the answer appears in a relevant Wikipedia article and translated them to all other 11 languages, for which the respective Wikipedias lack equivalent articles. Assuming that Wikipedia reflects the prominent knowledge in the LLM’s training data, to solve ECLeKTic’s CBQA task the model is required to transfer knowledge between languages. Experimenting with 8 LLMs, we show that SOTA models struggle to effectively share knowledge across, languages even if they can predict the answer well for queries in the same language the knowledge was acquired in.
zh

[NLP-9] Detecting Linguistic Diversity on Social Media

【速读】: 本文旨在解决如何利用社交媒体数据来考察特定地区语言行为的变化这一问题。在新西兰(Aotearoa New Zealand)这一仅依赖人口普查官方统计数据作为语言使用数据的背景下,研究者将人口普查数据作为基准事实(ground truth),并将全球语言使用语料库(Corpus of Global Language Use, CGU)中的社交媒体子语料库作为替代数据源。研究的关键在于通过“地点”这一共同变量连接两种数据来源,并结合两种语言识别模型验证社交媒体数据中每条推文的语言环境。研究结果表明,社交媒体语言数据能够为地方的语言特征提供丰富的空间和时间洞察,尤其对人口统计和社会政治变化以及低级别区域和本地的语言多样性具有敏感性。因此,该研究的核心解决方案在于构建一种基于社交媒体数据的语言分析框架,以补充传统统计数据的不足,并揭示语言行为的动态变化。

链接: https://arxiv.org/abs/2502.21224
作者: Sidney Wong,Benjamin Adams,Jonathan Dunn
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to Cartography and GIScience in Australasia and Oceania: Including twenty years of GeoCart

点击查看摘要

Abstract:This chapter explores the efficacy of using social media data to examine changing linguistic behaviour of a place. We focus our investigation on Aotearoa New Zealand where official statistics from the census is the only source of language use data. We use published census data as the ground truth and the social media sub-corpus from the Corpus of Global Language Use as our alternative data source. We use place as the common denominator between the two data sources. We identify the language conditions of each tweet in the social media data set and validated our results with two language identification models. We then compare levels of linguistic diversity at national, regional, and local geographies. The results suggest that social media language data has the possibility to provide a rich source of spatial and temporal insights on the linguistic profile of a place. We show that social media is sensitive to demographic and sociopolitical changes within a language and at low-level regional and local geographies.
zh

[NLP-10] Optimizing Large Language Models for ESG Activity Detection in Financial Texts

【速读】: 该论文旨在解决如何利用AI技术自动评估可持续发展报告与非财务披露是否与特定的环境、社会和治理(ESG)活动保持一致的问题。这一任务面临的主要挑战在于通用大型语言模型(Large Language Models, LLMs)在特定领域(context)的能力局限性以及高质量结构化数据的稀缺性。论文的关键解决方案是通过在原始数据与合成生成数据的组合上对LLMs进行微调(fine-tuning),以显著提升其性能。为此,作者引入了一个名为ESG-Activities的数据集,包含1,325个根据欧盟ESG分类法标注的文本片段,并证明了在该数据集上微调后,开源模型如Llama 7B和Gemma 7B在特定配置下优于某些大型专有解决方案。这些结果对金融分析师、政策制定者以及AI研究人员在提高ESG透明度和合规性方面具有重要意义。

链接: https://arxiv.org/abs/2502.21112
作者: Mattia Birti,Francesco Osborne,Andrea Maurino
机构: University of Milano-Bicocca (米兰比可卡大学); KMi, The Open University (开放大学知识媒体研究所)
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The integration of Environmental, Social, and Governance (ESG) factors into corporate decision-making is a fundamental aspect of sustainable finance. However, ensuring that business practices align with evolving regulatory frameworks remains a persistent challenge. AI-driven solutions for automatically assessing the alignment of sustainability reports and non-financial disclosures with specific ESG activities could greatly support this process. Yet, this task remains complex due to the limitations of general-purpose Large Language Models (LLMs) in domain-specific contexts and the scarcity of structured, high-quality datasets. In this paper, we investigate the ability of current-generation LLMs to identify text related to environmental activities. Furthermore, we demonstrate that their performance can be significantly enhanced through fine-tuning on a combination of original and synthetically generated data. To this end, we introduce ESG-Activities, a benchmark dataset containing 1,325 labelled text segments classified according to the EU ESG taxonomy. Our experimental results show that fine-tuning on ESG-Activities significantly enhances classification accuracy, with open models such as Llama 7B and Gemma 7B outperforming large proprietary solutions in specific configurations. These findings have important implications for financial analysts, policymakers, and AI researchers seeking to enhance ESG transparency and compliance through advanced natural language processing techniques.
zh

[NLP-11] Generating patient cohorts from electronic health records using two-step retrieval-augmented text-to-SQL generation

【速读】: 该论文旨在解决临床队列定义中将纳入/排除标准转化为SQL查询的手动且具有挑战性的问题。解决方案的关键在于利用大型语言模型(Large Language Models),通过结合标准解析、两级检索增强生成、专用知识库整合、医学概念标准化以及SQL生成等技术,实现基于电子健康记录(EHR)数据的复杂时间与逻辑关系的准确捕获,最终达到0.75的F1分数,验证了自动化队列生成在流行病学研究中的可行性。

链接: https://arxiv.org/abs/2502.21107
作者: Angelo Ziletti,Leonardo D’Ambrosi
机构: Bayer AG
类目: Computation and Language (cs.CL)
备注: 7 pages, 1 figure

点击查看摘要

Abstract:Clinical cohort definition is crucial for patient recruitment and observational studies, yet translating inclusion/exclusion criteria into SQL queries remains challenging and manual. We present an automated system utilizing large language models that combines criteria parsing, two-level retrieval augmented generation with specialized knowledge bases, medical concept standardization, and SQL generation to retrieve patient cohorts with patient funnels. The system achieves 0.75 F1-score in cohort identification on EHR data, effectively capturing complex temporal and logical relationships. These results demonstrate the feasibility of automated cohort generation for epidemiological research.
zh

[NLP-12] Re-evaluating Theory of Mind evaluation in large language models

【速读】: 该论文试图解决的问题是大型语言模型(Large Language Models, LLMs)是否具备心智理论(Theory of Mind, ToM),即能否推理他人的心理状态。论文指出,现有证据对此问题存在分歧,且评估方法并未达成共识。关键在于明确LLMs是否应被期望在行为上与人类一致,还是在行为背后的计算机制上与人类一致。此外,论文强调当前评估方法可能偏离了对ToM能力“纯粹”测量的方向,这也是造成混淆的原因之一。未来研究方向包括ToM与实用沟通之间的关系,这将有助于深化对人工系统及人类认知的理解。

链接: https://arxiv.org/abs/2502.21098
作者: Jennifer Hu,Felix Sosa,Tomer Ullman
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: under review

点击查看摘要

Abstract:The question of whether large language models (LLMs) possess Theory of Mind (ToM) – often defined as the ability to reason about others’ mental states – has sparked significant scientific and public interest. However, the evidence as to whether LLMs possess ToM is mixed, and the recent growth in evaluations has not resulted in a convergence. Here, we take inspiration from cognitive science to re-evaluate the state of ToM evaluation in LLMs. We argue that a major reason for the disagreement on whether LLMs have ToM is a lack of clarity on whether models should be expected to match human behaviors, or the computations underlying those behaviors. We also highlight ways in which current evaluations may be deviating from “pure” measurements of ToM abilities, which also contributes to the confusion. We conclude by discussing several directions for future research, including the relationship between ToM and pragmatic communication, which could advance our understanding of artificial systems as well as human cognition.
zh

[NLP-13] PASemiQA: Plan-Assisted Agent for Question Answering on Semi-Structured Data with Text and Relational Information

【速读】: 该论文试图解决大型语言模型(LLMs)在处理需要专业和最新知识的问题时容易产生幻觉(hallucination)的问题。为了解决这一局限性,论文提出了一种名为PASemiQA的新方法,其关键是联合利用半结构化数据中的文本和关系信息来回答问题。具体而言,PASemiQA首先生成一个计划以确定相关文本和关系信息,然后使用LLM代理遍历半结构化数据并提取必要信息。这种方法能够有效应对包含文本和关系信息的半结构化数据中的真实世界问题,展示了提升半结构化数据问答系统准确性和可靠性的潜力。

链接: https://arxiv.org/abs/2502.21087
作者: Hansi Yang,Qi Zhang,Wei Jiang,Jianguo Li
机构: CSE, HKUST (香港科技大学); Ant Group (蚂蚁集团); Ant Group (蚂蚁集团); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown impressive abilities in answering questions across various domains, but they often encounter hallucination issues on questions that require professional and up-to-date knowledge. To address this limitation, retrieval-augmented generation (RAG) techniques have been proposed, which retrieve relevant information from external sources to inform their responses. However, existing RAG methods typically focus on a single type of external data, such as vectorized text database or knowledge graphs, and cannot well handle real-world questions on semi-structured data containing both text and relational information. To bridge this gap, we introduce PASemiQA, a novel approach that jointly leverages text and relational information in semi-structured data to answer questions. PASemiQA first generates a plan to identify relevant text and relational information to answer the question in semi-structured data, and then uses an LLM agent to traverse the semi-structured data and extract necessary information. Our empirical results demonstrate the effectiveness of PASemiQA across different semi-structured datasets from various domains, showcasing its potential to improve the accuracy and reliability of question answering systems on semi-structured data.
zh

[NLP-14] CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

【速读】: 该论文旨在解决现有显式链式思维(Explicit Chain-of-Thought, CoT)方法在性能上优于隐式CoT方法的问题,同时探索如何在效率与性能之间取得平衡。论文的关键创新在于提出了一种名为CODI(通过自蒸馏实现的连续链式思维)的新框架,该框架将CoT蒸馏到连续空间中,利用共享模型同时学习显式和隐式的CoT,并在生成最终答案的token处对其隐藏激活进行对齐。这一方法不仅实现了与显式CoT相当的性能,还在GSM8k任务上达到了3.1倍的压缩率,并在准确性上超越先前最先进的隐式CoT方法28.2%,同时保持了解释性,可通过解码连续思维使其推理过程透明。

链接: https://arxiv.org/abs/2502.21074
作者: Zhenyi Shen,Hanqi Yan,Linhai Zhang,Zhanghao Hu,Yali Du,Yulan He
机构: King’s College London (伦敦国王学院); The Alan Turing Institute (图灵研究所)
类目: Computation and Language (cs.CL)
备注: 15 pages

点击查看摘要

Abstract:Chain-of-Thought (CoT) enhances Large Language Models (LLMs) by enabling step-by-step reasoning in natural language. However, the language space may be suboptimal for reasoning. While implicit CoT methods attempt to enable reasoning without explicit CoT tokens, they have consistently lagged behind explicit CoT method in task performance. We propose CODI (Continuous Chain-of-Thought via Self-Distillation), a novel framework that distills CoT into a continuous space, where a shared model acts as both teacher and student, jointly learning explicit and implicit CoT while aligning their hidden activation on the token generating the final answer. CODI is the first implicit CoT method to match explicit CoT’s performance on GSM8k while achieving 3.1x compression, surpassing the previous state-of-the-art by 28.2% in accuracy. Furthermore, CODI demonstrates scalability, robustness, and generalizability to more complex CoT datasets. Additionally, CODI retains interpretability by decoding its continuous thoughts, making its reasoning process transparent. Our findings establish implicit CoT as not only a more efficient but a powerful alternative to explicit CoT.
zh

[NLP-15] Beyond Words: A Latent Memory Approach to Internal Reasoning in LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在利用显式推理链(Chain-of-Thought, CoT)提升可解释性的同时,可能存在的计算效率较低的问题。论文提出了一种将隐式记忆模块(Implicit Memory Module, IMM)集成到LLMs内部推理过程中的框架,以模仿人类认知中依赖于隐式心理表征的方式。解决方案的关键在于通过引入IMM,使模型能够隐式地回忆过去的感觉与情景信息,而非完全依赖显式的自然语言推理步骤,从而实现更高效的训练效果。实验结果显示,该方法相较于标准GPT基线模型可减少35%至57%的最终训练损失。此外,论文还讨论了如何扩展此框架以支持显式的可审计性,并提出了技术机制来扩展记忆模块的规模。

链接: https://arxiv.org/abs/2502.21030
作者: José I. Orlicki
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have popularized the chain-of-thought (CoT) paradigm, in which models produce explicit reasoning steps in natural language. Although this approach improves interpretability and facilitates external auditing, it may not represent the most computationally efficient method for internal reasoning. In contrast, human cognition relies on implicit mental representations that recall past sensory and episodic information without requiring complete verbalization. In this paper, we propose a framework that integrates implicit mental representations into the internal reasoning processes of LLMs. Preliminary experiments indicate that incorporating an Implicit Memory Module (IMM) into a simple GPT model yields a reduction of between 35% and 57% in final training loss compared to a regular GPT baseline. The addition of an explicit interpretability channel (e.g., a chain-of-thought decoder) is straightforward to implement within this approach. We outline theoretical foundations, propose technical mechanisms to scale the memory module, and discuss how these ideas may lead to more efficient and robust reasoning, with optional future extensions for explicit auditability.
zh

[NLP-16] Extending Dense Passage Retrieval with Temporal Information

【速读】: 该论文旨在解决传统信息检索方法(如BM25和Dense Passage Retrieval, DPR)在处理与时间相关的查询时无法有效捕捉时间敏感性的问题。解决方案的关键在于引入了一种融合显式时间信号的时序检索模型,通过将查询的时间戳和文档日期整合到表示空间中,确保检索出的片段不仅在主题上相关,还能与用户的意图在时间上保持一致。此外,论文还提出了一种时间敏感的负采样策略,以增强模型在训练过程中区分时间相关与不相关文档的能力。这一方法在ArchivalQA和ChroniclingAmericaQA两个大规模基准数据集上的实验结果表明,其显著提升了检索性能。

链接: https://arxiv.org/abs/2502.21024
作者: Abdelrahman Abdallah,Bhawna Piryani,Jonas Wallat,Avishek Anand,Adam Jatowt
机构: University of Innsbruck(Innsbruck, Tyrol, Austria); L3S Research Center(Hannover, Germany); Delft University of Technology(Delft, Netherlands)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Temporal awareness is crucial in many information retrieval tasks, particularly in scenarios where the relevance of documents depends on their alignment with the query’s temporal context. Traditional retrieval methods such as BM25 and Dense Passage Retrieval (DPR) excel at capturing lexical and semantic relevance but fall short in addressing time-sensitive queries. To bridge this gap, we introduce the temporal retrieval model that integrates explicit temporal signals by incorporating query timestamps and document dates into the representation space. Our approach ensures that retrieved passages are not only topically relevant but also temporally aligned with user intent. We evaluate our approach on two large-scale benchmark datasets, ArchivalQA and ChroniclingAmericaQA, achieving substantial performance gains over standard retrieval baselines. In particular, our model improves Top-1 retrieval accuracy by 6.63% and NDCG@10 by 3.79% on ArchivalQA, while yielding a 9.56% boost in Top-1 retrieval accuracy and 4.68% in NDCG@10 on ChroniclingAmericaQA. Additionally, we introduce a time-sensitive negative sampling strategy, which refines the model’s ability to distinguish between temporally relevant and irrelevant documents during training. Our findings highlight the importance of explicitly modeling time in retrieval systems and set a new standard for handling temporally grounded queries.
zh

[NLP-17] PersuasiveToM: A Benchmark for Evaluating Machine Theory of Mind in Persuasive Dialogues

【速读】: 该论文旨在解决现有大型语言模型(Large Language Models, LLMs)在理解与预测心理状态(Theory of Mind, ToM)方面存在的局限性,特别是现有基准测试主要集中于基于合成故事和对话的物理感知任务,未能充分捕捉真实社交互动中复杂的心理活动。为弥补这一差距,论文提出了一种名为PersuasiveToM的新基准,专注于评估LLMs在说服性对话中的ToM能力。解决方案的关键在于设计了一个包含两类问题的框架:(1) ToM推理,评估LLMs追踪动态变化心理状态的能力;(2) ToM应用,检验LLMs能否利用推断出的心理状态选择有效的说服策略并评估这些策略的效果。实验结果表明,尽管当前最先进的LLMs在多项任务中表现良好,但在跟踪心理状态的动态变化及全面理解对话整体心理状态方面仍存在困难。

链接: https://arxiv.org/abs/2502.21017
作者: Fangxu Yu,Lai Jiang,Shenyi Huang,Zhen Wu,Xinyu Dai
机构: National Key Laboratory for Novel Software Technology, Nanjing University (南京大学), China; School of Artificial Intelligence, Nanjing University (南京大学), China; Department of Computer Science and Engineering, Shanghai Jiao Tong University (上海交通大学); University of California, San Diego (加州大学圣地亚哥分校), CA, USA
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The ability to understand and predict the mental states of oneself and others, known as the Theory of Mind (ToM), is crucial for effective social interactions. Recent research has emerged to evaluate whether Large Language Models (LLMs) exhibit a form of ToM. Although recent studies have evaluated ToM in LLMs, existing benchmarks focus predominantly on physical perception with principles guided by the Sally-Anne test in synthetic stories and conversations, failing to capture the complex psychological activities of mental states in real-life social interactions. To mitigate this gap, we propose PersuasiveToM, a benchmark designed to evaluate the ToM abilities of LLMs in persuasive dialogues. Our framework introduces two categories of questions: (1) ToM Reasoning, assessing the capacity of LLMs to track evolving mental states (e.g., desire shifts in persuadees), and (2) ToM Application, evaluating whether LLMs can take advantage of inferred mental states to select effective persuasion strategies (e.g., emphasize rarity) and evaluate the effectiveness of persuasion strategies. Experiments across eight state-of-the-art LLMs reveal that while models excel on multiple questions, they struggle to answer questions that need tracking the dynamics and shifts of mental states and understanding the mental states in the whole dialogue comprehensively. Our aim with PersuasiveToM is to allow an effective evaluation of the ToM reasoning ability of LLMs with more focus on complex psychological activities. Our code is available at this https URL.
zh

[NLP-18] Capability Localization: Capabilities Can be Localized rather than Individual Knowledge

【速读】: 该论文旨在解决大型语言模型(LLMs)中个体知识无法局部化的问题,并探索模型参数如何影响性能提升的本质。论文的关键在于提出了一种名为Commonality Neuron Localization (CNL)的方法,通过定位共同性神经元(commonality neurons),揭示了数据共同性可以在神经网络中的集中表示这一现象。CNL方法成功实现了在GSM8K数据集上的神经元重叠率高达96.42%,进一步验证了共同性神经元是由一组能力神经元组成的集合体,这些神经元能够增强模型的性能表现。

链接: https://arxiv.org/abs/2502.20992
作者: Xiusheng Huang,Jiaxiang Liu,Yequan Wang,Jun Zhao,Kang Liu
机构: The Key Laboratory of Cognition and Decision Intelligence for Complex Systems (认知与决策智能实验室), Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence (人工智能学院), University of Chinese Academy of Sciences (中国科学院大学); Beijing Academy of Artificial Intelligence (北京智源人工智能研究院), Beijing, China (中国北京)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large scale language models have achieved superior performance in tasks related to natural language processing, however, it is still unclear how model parameters affect performance improvement. Previous studies assumed that individual knowledge is stored in local parameters, and the storage form of individual knowledge is dispersed parameters, parameter layers, or parameter chains, which are not unified. We found through fidelity and reliability evaluation experiments that individual knowledge cannot be localized. Afterwards, we constructed a dataset for decoupling experiments and discovered the potential for localizing data commonalities. To further reveal this phenomenon, this paper proposes a Commonality Neuron Localization (CNL) method, which successfully locates commonality neurons and achieves a neuron overlap rate of 96.42% on the GSM8K dataset. Finally, we have demonstrated through cross data experiments that commonality neurons are a collection of capability neurons that possess the capability to enhance performance. Our code is available at this https URL.
zh

[NLP-19] Merging Clinical Knowledge into Large Language Models for Medical Research and Applications: A Survey

【速读】: 该论文旨在解决医学人工智能(Medical AI)领域缺乏全面综述和比较的问题,特别是从学术界和工业界角度构建医学AI系统的范式。论文聚焦于医学AI系统构建的关键要素,包括临床数据库与数据集的使用、训练管道的设计、医学知识图谱的整合、系统应用及评估体系,并期望通过此综述帮助研究人员理解现有学术模型在医疗各领域的表现、潜在问题以及未来发展方向。论文的关键在于系统性地梳理和对比不同构建方法及其应用场景,以推动医学AI的实际落地与发展。

链接: https://arxiv.org/abs/2502.20988
作者: Qiyuan Li,Haijiang Liu,Caicai Guo,Deyu Chen,Meng Wang,Feng Gao,Jinguang Gu
机构: Wuhan University of Science and Technology (武汉科技大学); Huazhong University of Science and Technology (华中科技大学); Wuhan University (武汉大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Clinical knowledge is the collection of information learned from studies on the causes, prognosis, diagnosis, and treatment of diseases. This type of knowledge can improve curing performances, and promote physical health. With the emergence of large language models (LLMs), medical artificial intelligence (medical AI), which aims to apply academic medical AI systems to real-world medical scenarios, has entered a new age of development, resulting in excellent works such as DoctorGPT and Pangu-Drug from academic and industrial researches. However, the field lacks a comprehensive compendium and comparison of building medical AI systems from academia and industry. Therefore, this survey focuses on the building paradigms of medical AI systems including the use of clinical databases, datasets, training pipelines, integrating medical knowledge graphs, system applications, and evaluation systems. We hope that this survey can help relevant practical researchers understand the current performance of academic models in various fields of healthcare, as well as the potential problems and future directions for implementing these scientific achievements.
zh

[NLP-20] UoR-NCL at SemEval-2025 Task 1: Using Generative LLM s and CLIP Models for Multilingual Multimodal Idiomaticity Representation

【速读】: 该论文旨在解决基于给定名词组合(可能携带习语意义)对图像进行排序的问题,涉及英语和巴西葡萄牙语两种语言。解决方案的关键在于利用生成式大语言模型(LLMs)和多语言CLIP模型来增强名词组合的习语表示。具体而言,LLMs用于生成潜在习语名词组合的习语意义,以丰富其语义解释,然后通过多语言CLIP模型对这些意义进行编码,作为图像排序的表征。此外,采用对比学习和数据增强技术对这些嵌入进行微调,以提升性能。实验结果表明,通过此方法提取的多模态表征优于仅基于原始名词组合的表征,尽管微调方法显示出有前景的结果,但其效果仍略逊于未经微调的嵌入。

链接: https://arxiv.org/abs/2502.20984
作者: Thanet Markchom,Tong Wu,Liting Huang,Huizhi Liang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:SemEval-2025 Task 1 focuses on ranking images based on their alignment with a given nominal compound that may carry idiomatic meaning in both English and Brazilian Portuguese. To address this challenge, this work uses generative large language models (LLMs) and multilingual CLIP models to enhance idiomatic compound representations. LLMs generate idiomatic meanings for potentially idiomatic compounds, enriching their semantic interpretation. These meanings are then encoded using multilingual CLIP models, serving as representations for image ranking. Contrastive learning and data augmentation techniques are applied to fine-tune these embeddings for improved performance. Experimental results show that multimodal representations extracted through this method outperformed those based solely on the original nominal compounds. The fine-tuning approach shows promising outcomes but is less effective than using embeddings without fine-tuning. The source code used in this paper is available at this https URL.
zh

[NLP-21] Set-Theoretic Compositionality of Sentence Embeddings

【速读】: 该论文旨在解决现有句子编码器(Sentence Encoder)评估方法主要聚焦于特定目标任务性能的问题,而缺乏对其在任务无关背景下展现基础组合性质(如交集、差集和并集等“集合类”操作)的全面理解。为填补这一空白,论文基于经典集合论提出了六项评估标准,并系统性地测试了七种经典句子编码器与九种大型语言模型(Large Language Model, LLM)驱动的编码器,以评估其是否符合这些集合类组合性质。关键解决方案在于引入基于集合论的三类核心操作(文本重叠 TextOverlap、文本差异 TextDifference 和文本并集 TextUnion)作为评价框架,通过大规模数据集验证不同编码器的表现,揭示SBERT在集合类组合性质上的优越性,同时提出一个包含约192K样本的新数据集以支持未来研究。

链接: https://arxiv.org/abs/2502.20975
作者: Naman Bansal,Yash mahajan,Sanjeev Sinha,Santu Karmaker
机构: Auburn University (奥本大学); University of Central Florida (中佛罗里达大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sentence encoders play a pivotal role in various NLP tasks; hence, an accurate evaluation of their compositional properties is paramount. However, existing evaluation methods predominantly focus on goal task-specific performance. This leaves a significant gap in understanding how well sentence embeddings demonstrate fundamental compositional properties in a task-independent context. Leveraging classical set theory, we address this gap by proposing six criteria based on three core “set-like” compositions/operations: \textitTextOverlap, \textitTextDifference, and \textitTextUnion. We systematically evaluate 7 classical and 9 Large Language Model (LLM)-based sentence encoders to assess their alignment with these criteria. Our findings show that SBERT consistently demonstrates set-like compositional properties, surpassing even the latest LLMs. Additionally, we introduce a new dataset of ~ 192 K samples designed to facilitate future benchmarking efforts on set-like compositionality of sentence embeddings.
zh

[NLP-22] Arabizi vs LLM s: Can the Genie Understand the Language of Aladdin?

【速读】: 该论文试图解决阿拉伯izi(Arabizi)在机器翻译中的挑战问题,特别是其非正式结构和深厚文化内涵对现有翻译系统的阻碍。论文聚焦于评估不同大型语言模型(LLMs)解码和翻译多种鲜有研究的阿拉伯方言的能力,旨在满足gist翻译需求。解决方案的关键在于通过结合人工评估与自动度量指标,分析这些模型在将Arabizi翻译成现代标准阿拉伯语和英语时的表现,并探讨哪些方言翻译效果最佳以及英语译文是否优于阿拉伯语译文。

链接: https://arxiv.org/abs/2502.20973
作者: Perla Al Almaoui,Pierrette Bouillon,Simon Hengchen
机构: Faculté de traduction et d’interprétation, Université de Genève (翻译与口译学院,日内瓦大学); iguanodon.ai (伊瓜纳多龙.ai)
类目: Computation and Language (cs.CL)
备注: Submitted to MT Summit 2025

点击查看摘要

Abstract:In this era of rapid technological advancements, communication continues to evolve as new linguistic phenomena emerge. Among these is Arabizi, a hybrid form of Arabic that incorporates Latin characters and numbers to represent the spoken dialects of Arab communities. Arabizi is widely used on social media and allows people to communicate in an informal and dynamic way, but it poses significant challenges for machine translation due to its lack of formal structure and deeply embedded cultural nuances. This case study arises from a growing need to translate Arabizi for gisting purposes. It evaluates the capacity of different LLMs to decode and translate Arabizi, focusing on multiple Arabic dialects that have rarely been studied up until now. Using a combination of human evaluators and automatic metrics, this research project investigates the model’s performance in translating Arabizi into both Modern Standard Arabic and English. Key questions explored include which dialects are translated most effectively and whether translations into English surpass those into Arabic.
zh

[NLP-23] Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLM s

【速读】: 该论文旨在解决角色扮演(Role-playing)对大语言模型(Large Language Models, LLMs)安全性带来的显著风险问题。现有角色扮演微调技术虽能提升角色适应性,但可能导致安全性能下降,尤其是在处理反派角色时表现更为明显。论文的关键解决方案是提出了一种名为“安全感知的角色扮演微调”(Safety-Aware Role-Play Fine-Tuning, SaRFT)的新方法,该方法旨在平衡角色扮演能力与安全性之间的权衡,通过在多个指令微调模型上的实验验证,证明SaRFT在LoRA和全参数微调设置下均优于当前最先进的基线方法。这一研究强调了针对角色适应性的安全性措施的重要性,并为缓解角色特定的安全风险提供了实用见解。

链接: https://arxiv.org/abs/2502.20968
作者: Weixiang Zhao,Yulin Hu,Yang Deng,Jiahe Guo,Xingyu Sui,Xinyang Han,An Zhang,Yanyan Zhao,Bing Qin,Tat-Seng Chua,Ting Liu
机构: Harbin Institute of Technology (哈尔滨工业大学); Singapore Management University (新加坡管理大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注: 25 pages, 10 figures, 13 tables

点击查看摘要

Abstract:Role-playing enables large language models (LLMs) to engage users in immersive and personalized interactions, but it also introduces significant safety risks. Existing role-play fine-tuning techniques improve role adaptability but may degrade safety performance, particularly for villainous characters. In this work, we conduct the first comprehensive assessment of role-play fine-tuning risks by training 95 role-specific LLMs using RoleBench. Our experiments reveal that role-play fine-tuning leads to a noticeable decline in safety performance, with safety risks varying based on character traits. To tackle this challenge, we propose Safety-Aware Role-Play Fine-Tuning (SaRFT), a novel method designed to balance role-playing capabilities and safety. Extensive experiments on LLaMA-3-8B-Instruct, Gemma-2-9B-it, and Qwen2.5-7B-Instruct demonstrate that SaRFT consistently outperforms state-of-the-art baselines under both LoRA and full-parameter fine-tuning settings. Our findings highlight the necessity of role-adaptive safety measures and provide insights into mitigating role-specific safety risks in role-playing LLMs.
zh

[NLP-24] WebFAQ: A Multilingual Collection of Natural QA Datasets for Dense Retrieval

【速读】: 该论文旨在构建一个大规模多语言开放领域问答数据集(WebFAQ),以支持训练和评估多语言密集检索模型。论文的关键在于通过系统化的数据清洗流程(包括精炼过滤和近似重复检测)生成高质量的问答对数据集,并利用这些数据优化预训练的多语言语言模型(XLM-RoBERTa)。此外,论文提出了一种先进的双语平行语料库构建方法,结合最新的双语文本挖掘技术和自动评估工具,生成高质量的跨语言对齐数据。这种方法不仅提升了数据集的质量,还展示了其在零样本迁移任务中的通用性和有效性。

链接: https://arxiv.org/abs/2502.20936
作者: Michael Dinzinger,Laura Caspari,Kanishka Ghosh Dastidar,Jelena Mitrović,Michael Granitzer
机构: University of Passau (帕绍大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 10 pages, 3 figures, 7 tables

点击查看摘要

Abstract:We present WebFAQ, a large-scale collection of open-domain question answering datasets derived from FAQ-style this http URL annotations. In total, the data collection consists of 96 million natural question-answer (QA) pairs across 75 languages, including 47 million (49%) non-English samples. WebFAQ further serves as the foundation for 20 monolingual retrieval benchmarks with a total size of 11.2 million QA pairs (5.9 million non-English). These datasets are carefully curated through refined filtering and near-duplicate detection, yielding high-quality resources for training and evaluating multilingual dense retrieval models. To empirically confirm WebFAQ’s efficacy, we use the collected QAs to fine-tune an in-domain pretrained XLM-RoBERTa model. Through this process of dataset-specific fine-tuning, the model achieves significant retrieval performance gains, which generalize - beyond WebFAQ - to other multilingual retrieval benchmarks evaluated in zero-shot setting. Last but not least, we utilize WebFAQ to construct a set of QA-aligned bilingual corpora spanning over 1000 language pairs using state-of-the-art bitext mining and automated LLM-assessed translation evaluation. Due to our advanced, automated method of bitext dataset generation, the resulting bilingual corpora demonstrate higher translation quality compared to similar datasets. WebFAQ and all associated resources are publicly available on GitHub and HuggingFace.
zh

[NLP-25] Automated Evaluation of Meter and Rhyme in Russian Generative and Human-Authored Poetry

【速读】: 该论文旨在解决俄语诗歌生成系统中数据工程和自动评估工具不足的问题,特别是如何有效评估诗作是否遵循格律规则(如重音音节的正确交替以及押韵的存在)。论文的关键在于引入了俄罗斯诗歌重音标注工具库(Russian Poetry Scansion Tool),用于俄语 syllabo-tonic 诗歌的重音标记放置、押韵检测以及诗意缺陷识别。此外,论文发布了 RIFMA 数据集,包含标注了重音符号的各种体裁和形式的诗歌片段。这一工具和数据集为现代大型语言模型在诗歌文本中准确放置重音标记的能力提供了评估手段,并为创意生成式人工智能领域的研究者和从业者提供了有价值的资源,以推动生成式诗歌系统的发展与评估。

链接: https://arxiv.org/abs/2502.20931
作者: Ilya Koziev
机构: 未知
类目: Computation and Language (cs.CL)
备注: 7 pages, 1 figure

点击查看摘要

Abstract:Generative poetry systems require effective tools for data engineering and automatic evaluation, particularly to assess how well a poem adheres to versification rules, such as the correct alternation of stressed and unstressed syllables and the presence of rhymes. In this work, we introduce the Russian Poetry Scansion Tool library designed for stress mark placement in Russian-language syllabo-tonic poetry, rhyme detection, and identification of defects of poeticness. Additionally, we release RIFMA – a dataset of poem fragments spanning various genres and forms, annotated with stress marks. This dataset can be used to evaluate the capability of modern large language models to accurately place stress marks in poetic texts. The published resources provide valuable tools for researchers and practitioners in the field of creative generative AI, facilitating advancements in the development and evaluation of generative poetry systems. Comments: 7 pages, 1 figure Subjects: Computation and Language (cs.CL) Cite as: arXiv:2502.20931 [cs.CL] (or arXiv:2502.20931v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.20931 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-26] Everything Everywhere All at Once: Is Mechanistic Interpretability Identifiable?

【速读】: 该论文旨在解决生成式 AI (Generative AI) 系统在高风险应用场景下如何确保机制可解释性 (Mechanistic Interpretability, MI) 的核心问题,即在特定行为和 MI 标准下,是否存在唯一的解释。论文通过借鉴统计学中的可识别性概念,探索 MI 解释的唯一性。其解决方案的关键在于分析两种主要的 MI 策略:(1) “where-then-what” 方法,先隔离复制模型行为的电路再进行解释;(2) “what-then-where” 方法,从候选算法出发搜索实现这些算法的神经激活子空间,并利用因果对齐。研究通过测试布尔函数和小型多层感知器,全面枚举候选解释,发现系统性的非唯一性现象,包括多种电路可复制同一行为、单一电路有多种解释、多个算法可与网络对齐以及单一算法可与不同子空间对齐。论文进一步探讨了唯一性是否必要,并提出若唯一性对理解至关重要,则可能需要更严格的判别标准,同时引用内向可解释性框架验证解释的多重标准。这项工作为定义 AI 领域的解释标准提供了重要贡献。

链接: https://arxiv.org/abs/2502.20914
作者: Maxime Méloux,Silviu Maniu,François Portet,Maxime Peyrard
机构: Université Grenoble Alpes (格勒诺布尔阿尔卑斯大学); CNRS (法国国家科学研究中心); Grenoble INP (格勒诺布尔国立理工学院); LIG (格勒诺布尔信息实验室), 38000 Grenoble, France
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As AI systems are used in high-stakes applications, ensuring interpretability is crucial. Mechanistic Interpretability (MI) aims to reverse-engineer neural networks by extracting human-understandable algorithms to explain their behavior. This work examines a key question: for a given behavior, and under MI’s criteria, does a unique explanation exist? Drawing on identifiability in statistics, where parameters are uniquely inferred under specific assumptions, we explore the identifiability of MI explanations. We identify two main MI strategies: (1) “where-then-what,” which isolates a circuit replicating model behavior before interpreting it, and (2) “what-then-where,” which starts with candidate algorithms and searches for neural activation subspaces implementing them, using causal alignment. We test both strategies on Boolean functions and small multi-layer perceptrons, fully enumerating candidate explanations. Our experiments reveal systematic non-identifiability: multiple circuits can replicate behavior, a circuit can have multiple interpretations, several algorithms can align with the network, and one algorithm can align with different subspaces. Is uniqueness necessary? A pragmatic approach may require only predictive and manipulability standards. If uniqueness is essential for understanding, stricter criteria may be needed. We also reference the inner interpretability framework, which validates explanations through multiple criteria. This work contributes to defining explanation standards in AI. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2502.20914 [cs.LG] (or arXiv:2502.20914v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.20914 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Maxime Méloux [view email] [v1] Fri, 28 Feb 2025 10:13:54 UTC (2,307 KB)
zh

[NLP-27] A database to support the evaluation of gender biases in GPT -4o output ISCA

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成语言输出时存在的性别偏见问题,这种偏见可能加剧社会弱势群体受到的伤害。为应对这一伦理风险,论文聚焦于评估LLMs输出的公平性,并提出了一种创新的数据库构建方法,以超越仅评估偏见中性化程度的方式,全面评估LLM生成语言中的性别相关偏见。解决方案的关键在于引入一种能够更深入揭示和量化性别偏见的新数据库构建方法,从而促进相关领域的研究进展、规范基础讨论以及研究成果的可复现性。

链接: https://arxiv.org/abs/2502.20898
作者: Luise Mehner,Lena Alicija Philine Fiedler,Sabine Ammon,Dorothea Kolossa
机构: TU Berlin (柏林工业大学)
类目: Computation and Language (cs.CL)
备注: ISCA/ITG Workshop on Diversity in Large Speech and Language Models

点击查看摘要

Abstract:The widespread application of Large Language Models (LLMs) involves ethical risks for users and societies. A prominent ethical risk of LLMs is the generation of unfair language output that reinforces or exacerbates harm for members of disadvantaged social groups through gender biases (Weidinger et al., 2022; Bender et al., 2021; Kotek et al., 2023). Hence, the evaluation of the fairness of LLM outputs with respect to such biases is a topic of rising interest. To advance research in this field, promote discourse on suitable normative bases and evaluation methodologies, and enhance the reproducibility of related studies, we propose a novel approach to database construction. This approach enables the assessment of gender-related biases in LLM-generated language beyond merely evaluating their degree of neutralization.
zh

[NLP-28] Beyond Demographics: Fine-tuning Large Language Models to Predict Individuals Subjective Text Perceptions

【速读】: 该论文试图解决的问题是如何利用大型语言模型(LLMs)模拟注释者因社会人口统计特征(sociodemographic attributes)导致的注释变化。研究表明,未经训练的LLMs在处理社会人口统计数据时表现不佳,提示其内在的社会人口知识有限。论文的关键在于探索是否可以通过训练LLMs来准确建模注释者的变化模式,并分析模型性能提升的原因。研究结果表明,模型的性能提升主要归因于学习特定注释者的注释行为,而非真正理解社会人口统计数据与注释之间的有意义关联,从而质疑了当前使用LLMs模拟社会人口变化和行为的有效性。

链接: https://arxiv.org/abs/2502.20897
作者: Matthias Orlikowski,Jiaxin Pei,Paul Röttger,Philipp Cimiano,David Jurgens,Dirk Hovy
机构: Bielefeld University (比勒费尔德大学); Stanford University (斯坦福大学); Computing Sciences Department, Bocconi University, Milan, Italy (博科尼大学计算机科学系,意大利米兰); University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL)
备注: Reviewed ARR December 2024

点击查看摘要

Abstract:People naturally vary in their annotations for subjective questions and some of this variation is thought to be due to the person’s sociodemographic characteristics. LLMs have also been used to label data, but recent work has shown that models perform poorly when prompted with sociodemographic attributes, suggesting limited inherent sociodemographic knowledge. Here, we ask whether LLMs can be trained to be accurate sociodemographic models of annotator variation. Using a curated dataset of five tasks with standardized sociodemographics, we show that models do improve in sociodemographic prompting when trained but that this performance gain is largely due to models learning annotator-specific behaviour rather than sociodemographic patterns. Across all tasks, our results suggest that models learn little meaningful connection between sociodemographics and annotation, raising doubts about the current use of LLMs for simulating sociodemographic variation and behaviour.
zh

[NLP-29] ProBench: Benchmarking Large Language Models in Competitive Programming

【速读】: 该论文试图解决现有代码评估基准逐渐无法有效衡量高级大型语言模型(Large Language Models, LLMs)在代码推理能力方面的问题。为填补这一空白,论文提出ProBench,这是一种面向竞争性编程的基准测试工具,灵感来源于国际大学生程序设计竞赛(ICPC)。ProBench的关键解决方案在于收集来自Codeforces、Luogu和Nowcoder平台的综合性竞争编程问题,并通过在线提交获取真实测试结果以确保评估的公平性和准确性。此外,ProBench建立了一个统一的问题属性系统,包括难度分级和算法标注。通过精心收集和注释的数据,论文系统性地从多个维度评估了9种最新竞争性编程LLMs的表现,包括思维链分析、错误类型诊断和推理深度评估。实验结果表明,专用于推理任务训练的模型显著优于通用型模型,这为未来推理模型的发展提供了重要方向。

链接: https://arxiv.org/abs/2502.20868
作者: Lei Yang,Renren Jin,Ling Shi,Jianxiang Peng,Yue Chen,Deyi Xiong
机构: Tianjin University (天津大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With reasoning language models such as OpenAI-o3 and DeepSeek-R1 emerging, large language models (LLMs) have entered a new phase of development. However, existing benchmarks for coding evaluation are gradually inadequate to assess the capability of advanced LLMs in code reasoning. To bridge the gap for high-level code reasoning assessment, we propose ProBench to benchmark LLMs in competitive programming, drawing inspiration from the International Collegiate Programming Contest. ProBench collects a comprehensive set of competitive programming problems from Codeforces, Luogu, and Nowcoder platforms during the period from July to December 2024, obtaining real test results through online submissions to ensure the fairness and accuracy of the evaluation. We establish a unified problem attribute system, including difficulty grading and algorithm tagging. With carefully collected and annotated data in ProBench, we systematically assess 9 latest LLMs in competitive programming across multiple dimensions, including thought chain analysis, error type diagnosis, and reasoning depth evaluation. Experimental results show that QwQ-32B-Preview achieves the best score of 20.93 followed by DeepSeek-V3 with a score of 16.38, suggesting that models trained with specialized reasoning tasks significantly outperform general-purpose models (even larger than reasoning-oriented models) in programming. Further analysis also reveals key areas for programming capability enhancement, e.g., algorithm adaptability and reasoning sufficiency, providing important insights for the future development of reasoning models.
zh

[NLP-30] Better Benchmarking LLM s for Zero-Shot Dependency Parsing ALT

【速读】: 该论文试图解决的问题是评估最先进的开源权重大语言模型(LLMs)在零样本句法解析任务中的表现,并将其与不具备输入句子信息的基线模型进行比较,包括一些在此背景下尚未使用的基线模型,如随机投射树或最优线性排列。论文的关键解决方案在于通过系统性对比实验,揭示大多数测试中的LLMs无法超越最好的非知情基线模型,且即使是最新的大型LLaMA版本,在多数语言上的表现仍然较低,从而得出准确的零样本句法解析难以仅依靠开源LLMs实现的结论。

链接: https://arxiv.org/abs/2502.20866
作者: Ana Ezquerro,Carlos Gómez-Rodríguez,David Vilares
机构: Universidade da Coruña (拉科鲁尼亚大学); CITIC (CITIC)
类目: Computation and Language (cs.CL)
备注: Accepted at NoDaLiDa/Baltic-HLT 2025

点击查看摘要

Abstract:While LLMs excel in zero-shot tasks, their performance in linguistic challenges like syntactic parsing has been less scrutinized. This paper studies state-of-the-art open-weight LLMs on the task by comparing them to baselines that do not have access to the input sentence, including baselines that have not been used in this context such as random projective trees or optimal linear arrangements. The results show that most of the tested LLMs cannot outperform the best uninformed baselines, with only the newest and largest versions of LLaMA doing so for most languages, and still achieving rather low performance. Thus, accurate zero-shot syntactic parsing is not forthcoming with open LLMs.
zh

[NLP-31] Do Language Models Understand Honorific Systems in Javanese?

【速读】: 该论文旨在解决如何开发一个全面的语料库来捕捉爪哇语(Javanese)中复杂的敬语系统变化,以支持自然语言处理(NLP)任务。尽管爪哇语的敬语体系具有重要的文化和语言价值,但以往在构建能够反映这些变化的综合语料库方面进展有限。为了解决这一问题,论文提出了Unggah-Ungguh数据集,这是一个精心策划的数据集,旨在体现基于社会等级和语境的爪哇语敬语框架(Unggah-Ungguh Basa)的细微差别。关键在于通过分类和机器翻译任务评估语言模型(LMs)处理不同层次爪哇语敬语的能力,并进一步通过跨语言实验探索语言模型在特定敬语级别下进行爪哇语与印尼语互译的表现,同时考察语言模型在对话任务中生成符合语境的敬语的能力。研究发现,当前的语言模型在大多数敬语层级上表现不佳,且存在对某些敬语级别的偏倚。

链接: https://arxiv.org/abs/2502.20864
作者: Mohammad Rifqi Farhansyah,Iwan Darmawan,Adryan Kusumawardhana,Genta Indra Winata,Alham Fikri Aji,Derry Tanti Wijaya
机构: Institut Teknologi Bandung (印度尼西亚理工学院); Monash University Indonesia (蒙纳士大学印尼校区); Capital One (资本一号); MBZUAI (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Javanese language features a complex system of honorifics that vary according to the social status of the speaker, listener, and referent. Despite its cultural and linguistic significance, there has been limited progress in developing a comprehensive corpus to capture these variations for natural language processing (NLP) tasks. In this paper, we present Unggah-Ungguh, a carefully curated dataset designed to encapsulate the nuances of Unggah-Ungguh Basa, the Javanese speech etiquette framework that dictates the choice of words and phrases based on social hierarchy and context. Using Unggah-Ungguh, we assess the ability of language models (LMs) to process various levels of Javanese honorifics through classification and machine translation tasks. To further evaluate cross-lingual LMs, we conduct machine translation experiments between Javanese (at specific honorific levels) and Indonesian. Additionally, we explore whether LMs can generate contextually appropriate Javanese honorifics in conversation tasks, where the honorific usage should align with the social role and contextual cues. Our findings indicate that current LMs struggle with most honorific levels, exhibitinga bias toward certain honorific tiers.
zh

[NLP-32] he Power of Personality: A Human Simulation Perspective to Investigate Large Language Model Agents

【速读】: 该论文试图解决的问题是如何通过“人类模拟”的视角系统性地理解大型语言模型(Large Language Models, LLMs)的能力,并填补现有解释与现实世界中人类智能之间的差距。论文围绕三个核心问题展开:(1) 人格特质如何影响封闭任务中的问题解决能力?(2) 人格特质如何塑造开放任务中的创造力?(3) 单一智能体的表现如何影响多智能体协作?
解决方案的关键在于将五大人格特质(Big Five Personality Traits)分配给LLM智能体,并在单智能体和多智能体设置下评估其性能。研究发现,特定的人格特质显著影响推理准确性(封闭任务)和创造性输出(开放任务)。此外,多智能体系统表现出的集体智能不同于个体能力,这种差异由独特的人格组合驱动。论文揭示,LLMs通过下一个标记预测(next-token prediction)内在地模拟人类行为,反映人类语言、决策过程以及协作动态。

链接: https://arxiv.org/abs/2502.20859
作者: Yifan Duan,Yihong Tang,Xuefeng Bai,Kehai Chen,Juntao Li,Min Zhang
机构: Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学(深圳)计算与智能研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel in both closed tasks (including problem-solving, and code generation) and open tasks (including creative writing), yet existing explanations for their capabilities lack connections to real-world human intelligence. To fill this gap, this paper systematically investigates LLM intelligence through the lens of ``human simulation’', addressing three core questions: (1) How do personality traits affect problem-solving in closed tasks? (2) How do traits shape creativity in open tasks? (3) How does single-agent performance influence multi-agent collaboration? By assigning Big Five personality traits to LLM agents and evaluating their performance in single- and multi-agent settings, we reveal that specific traits significantly influence reasoning accuracy (closed tasks) and creative output (open tasks). Furthermore, multi-agent systems exhibit collective intelligence distinct from individual capabilities, driven by distinguishing combinations of personalities. We demonstrate that LLMs inherently simulate human behavior through next-token prediction, mirroring human language, decision-making, and collaborative dynamics.
zh

[NLP-33] MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training

【速读】: 该论文旨在解决现有最先进的Transformer模型在处理和理解数学公式时面临的挑战,这些挑战源于数学符号具有复杂的结构和多样的表示形式。论文的关键解决方案是开发了一个名为Math Mutator (MAMUT) 的框架,该框架能够生成给定LaTeX表示的数学公式的等价版本和伪造版本,从而有效地捕捉同一概念在数学符号表示上的多样性。基于此框架,研究人员生成了四个包含多样化表示的大规模数学数据集,用于训练具有增强数学嵌入的语言模型。

链接: https://arxiv.org/abs/2502.20855
作者: Jonathan Drechsel,Anja Reusch,Steffen Herbold
机构: Faculty of Computer Science and Mathematics, University of Passau (帕绍大学计算机科学与数学学院), Taub Faculty for Computer Science, Technion - Israel Institute of Technology (以色列理工学院塔布计算机学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Mathematical formulas are a fundamental and widely used component in various scientific fields, serving as a universal language for expressing complex concepts and relationships. While state-of-the-art transformer models excel in processing and understanding natural language, they encounter challenges with mathematical notation, which involves a complex structure and diverse representations. This study focuses on the development of specialized training datasets to enhance the encoding of mathematical content. We introduce Math Mutator (MAMUT), a framework capable of generating equivalent and falsified versions of a given mathematical formula in LaTeX notation, effectively capturing the mathematical variety in notation of the same concept. Based on MAMUT, we have generated four large mathematical datasets containing diverse notation, which can be used to train language models with enhanced mathematical embeddings.
zh

[NLP-34] A Pilot Empirical Study on When and How to Use Knowledge Graphs as Retrieval Augmented Generation

【速读】: 该论文试图解决在不同应用场景和技术配置下如何有效利用知识图谱增强检索生成(Knowledge Graph-Augmented Retrieval (KG-RAG))框架的问题。论文的关键在于通过系统分析和实证研究,明确KG-RAG的适用条件及其组件的最佳配置,从而为KG-RAG的合理使用提供理论基础。为此,作者重新实现了6种KG-RAG方法,并在7个数据集的多样化场景中进行评估,同时分析了9种KG-RAG配置与17种大型语言模型(LLMs)的组合影响。研究结果强调了适当的应用条件和KG-RAG组件优化配置的重要性。

链接: https://arxiv.org/abs/2502.20854
作者: Xujie Yuan,Yongxu Liu,Shimin Di,Shiwen Wu,Libin Zheng,Rui Meng,Xiaofang Zhou,Lei Chen,Jian Yin
机构: SYSU(中山大学); PolyU(香港理工大学); HKUST(香港科技大学); BNU-HKBU UIC(北师港浸大联合国际学院); HKUST(GZ)(香港科技大学广州校区)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, 2 figures, 14 tables

点击查看摘要

Abstract:The integration of Knowledge Graphs (KGs) into the Retrieval Augmented Generation (RAG) framework has attracted significant interest, with early studies showing promise in mitigating hallucinations and improving model accuracy. However, a systematic understanding and comparative analysis of the rapidly emerging KG-RAG methods are still lacking. This paper seeks to lay the foundation for systematically answering the question of when and how to use KG-RAG by analyzing their performance in various application scenarios associated with different technical configurations. After outlining the mind map using KG-RAG framework and summarizing its popular pipeline, we conduct a pilot empirical study of KG-RAG works to reimplement and evaluate 6 KG-RAG methods across 7 datasets in diverse scenarios, analyzing the impact of 9 KG-RAG configurations in combination with 17 LLMs. Our results underscore the critical role of appropriate application conditions and optimal configurations of KG-RAG components.
zh

[NLP-35] Learning to Substitute Components for Compositional Generalization

【速读】: 该论文旨在解决神经语言模型在组合泛化(compositional generalization)方面存在的不足,特别是在现有组合数据增强方法无法有效引入多粒度组合归纳偏置(multi-grained compositional bias),或训练样本存在难度分布不均衡的情况下。论文的关键解决方案包括提出一种新颖的组合增强策略——组件替换(Component Substitution, CompSub),它能够对整个训练集中的大量子结构进行多粒度组合;同时引入学习组件替换框架(Learning Component Substitution, LCS),通过最大化神经语言模型的损失函数,以端到端的方式学习组件替换概率,从而优先处理具有难以理解概念和新颖上下文的挑战性组合。此外,论文将这些思想扩展到预训练大语言模型的即时学习场景中,提出了LCS-ICL算法以提升其少样本组合泛化能力。理论分析揭示了为何应用这些算法可以改善语言模型的组合泛化性能,而实验结果进一步验证了CompSub、LCS和LCS-ICL在四个标准组合泛化基准测试上的优越性。

链接: https://arxiv.org/abs/2502.20834
作者: Zhaoyi Li,Gangwei Jiang,Chenwang Wu,Ying Wei,Defu Lian,Enhong Chen
机构: School of Computer Science and Technology, University of Science and Technology of China (中国科学技术大学计算机科学与技术学院); Department of Computer Science, Hong Kong Baptist University (香港浸会大学计算机科学系); College of Computer Science and Technology, Zhejiang University (浙江大学计算机科学与技术学院)
类目: Computation and Language (cs.CL)
备注: 23 pages, 9 figures, preprint, the extension paper of the paper ( arXiv:2306.02840 )

点击查看摘要

Abstract:Despite the rising prevalence of neural language models, recent empirical evidence suggests their deficiency in compositional generalization. One of the current de-facto solutions to this problem is compositional data augmentation, which aims to introduce additional compositional inductive bias. However, existing handcrafted augmentation strategies offer limited improvement when systematic generalization of neural language models requires multi-grained compositional bias (i.e., not limited to either lexical or structural biases alone) or when training sentences have an imbalanced difficulty distribution. To address these challenges, we first propose a novel compositional augmentation strategy called Component Substitution (CompSub), which enables multi-grained composition of substantial substructures across the entire training set. Furthermore, we introduce the Learning Component Substitution (LCS) framework. This framework empowers the learning of component substitution probabilities in CompSub in an end-to-end manner by maximizing the loss of neural language models, thereby prioritizing challenging compositions with elusive concepts and novel contexts. We extend the key ideas of CompSub and LCS to the recently emerging in-context learning scenarios of pre-trained large language models (LLMs), proposing the LCS-ICL algorithm to enhance the few-shot compositional generalization of state-of-the-art (SOTA) LLMs. Theoretically, we provide insights into why applying our algorithms to language models can improve compositional generalization performance. Empirically, our results on four standard compositional generalization benchmarks(SCAN, COGS, GeoQuery, and COGS-QL) demonstrate the superiority of CompSub, LCS, and LCS-ICL, with improvements of up to 66.5%, 10.3%, 1.4%, and 8.8%, respectively.
zh

[NLP-36] HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models

【速读】: 该论文试图解决多模态大型语言模型(Multi-modal Large Language Models, MLLMs)在理解包含人类动作的视频时性能受限的问题,主要原因是缺乏高质量的数据。为了解决这一问题,论文提出的关键解决方案是一个两阶段的数据标注管道:首先从互联网中收集包含清晰人类动作的视频;其次以标准化的描述格式对视频进行标注,使用人物属性区分个体,并按时间顺序详细描述其动作与交互。通过此管道,构建了两个数据集HAICTrain和HAICBench,其中HAICTrain包含126K个视频-描述对,用于训练增强人类动作理解能力,而HAICBench则提供了全面评估所需的标注资源。实验结果表明,利用HAICTrain不仅显著提升了在四个基准测试中的表现,还改善了文本到视频生成的效果。这两个数据集已公开发布。

链接: https://arxiv.org/abs/2502.20811
作者: Xiao Wang,Jingyun Hua,Weihong Lin,Yuanxing Zhang,Fuzheng Zhang,Jianlong Wu,Di Zhang,Liqiang Nie
机构: Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Recent Multi-modal Large Language Models (MLLMs) have made great progress in video understanding. However, their performance on videos involving human actions is still limited by the lack of high-quality data. To address this, we introduce a two-stage data annotation pipeline. First, we design strategies to accumulate videos featuring clear human actions from the Internet. Second, videos are annotated in a standardized caption format that uses human attributes to distinguish individuals and chronologically details their actions and interactions. Through this pipeline, we curate two datasets, namely HAICTrain and HAICBench. \textbfHAICTrain comprises 126K video-caption pairs generated by Gemini-Pro and verified for training purposes. Meanwhile, \textbfHAICBench includes 500 manually annotated video-caption pairs and 1,400 QA pairs, for a comprehensive evaluation of human action understanding. Experimental results demonstrate that training with HAICTrain not only significantly enhances human understanding abilities across 4 benchmarks, but can also improve text-to-video generation results. Both the HAICTrain and HAICBench are released at this https URL.
zh

[NLP-37] Plan2Align: Predictive Planning Based Test-Time Preference Alignment in Parag raph-Level Machine Translation

【速读】: 该论文旨在解决长文本(段落级别)翻译中非扩展性语言模型存在的遗漏和语义不一致问题,以及现有偏好对齐方法在保持上下文连贯性方面的局限性。论文的关键创新在于提出Plan2Align,这是一种测试时对齐框架,将翻译视为预测规划问题,并采用模型预测控制(Model Predictive Control, MPC)来迭代优化翻译输出,从而显著提升段落级别的翻译性能。

链接: https://arxiv.org/abs/2502.20795
作者: Kuang-Da Wang,Teng-Ruei Chen,Yu Heng Hung,Shuoyang Ding,Yueh-Hua Wu,Yu-Chiang Frank Wang,Chao-Han Huck Yang,Wen-Chih Peng,Ping-Chun Hsieh
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); NVIDIA (英伟达)
类目: Computation and Language (cs.CL)
备注: Preprint. Code will be released at Plan2Align GitHub link: this https URL

点击查看摘要

Abstract:Machine Translation (MT) has been predominantly designed for sentence-level translation using transformer-based architectures. While next-token prediction based Large Language Models (LLMs) demonstrate strong capabilities in long-text translation, non-extensive language models often suffer from omissions and semantic inconsistencies when processing paragraphs. Existing preference alignment methods improve sentence-level translation but fail to ensure coherence over extended contexts due to the myopic nature of next-token generation. We introduce Plan2Align, a test-time alignment framework that treats translation as a predictive planning problem, adapting Model Predictive Control to iteratively refine translation outputs. Experiments on WMT24 Discourse-Level Literary Translation show that Plan2Align significantly improves paragraph-level translation, achieving performance surpassing or on par with the existing training-time and test-time alignment methods on LLaMA-3.1 8B.
zh

[NLP-38] Chain-of-Thought Matters: Improving Long-Context Language Models with Reasoning Path Supervision

【速读】: 该论文旨在解决长上下文任务中大型语言模型(Large Language Models, LLMs)难以有效聚合目标信息的问题,特别是在需要跨长输入上下文进行多步推理的情境下。传统方法如链式思维提示(Chain-of-Thought, CoT)在多步推理中表现出潜力,但其在长上下文场景中的有效性尚未得到充分探索。研究发现,CoT的优势可以推广到大多数长上下文任务,并随着上下文长度增加而增强。为此,论文提出了一种过程监督框架LongRePS,其关键在于通过自采样机制引导高质量推理路径的生成,同时结合专为长上下文设计的质量评估协议,以提升模型在长上下文任务中的表现。实验结果表明,与基于最终结果监督的基线相比,该方法在领域内任务(如MuSiQue上的LLaMA和Qwen分别提升13.6/3.8分)及跨领域泛化任务(在多种问答任务中平均提升9.3/8.1分)中均取得了显著改进。

链接: https://arxiv.org/abs/2502.20790
作者: Dawei Zhu,Xiyu Wei,Guangxiang Zhao,Wenhao Wu,Haosheng Zou,Junfeng Ran,Xun Wang,Lin Sun,Xiangzheng Zhang,Sujian Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: 14 pages,6 figures

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have highlighted the challenge of handling long-context tasks, where models need to reason over extensive input contexts to aggregate target information. While Chain-of-Thought (CoT) prompting has shown promise for multi-step reasoning, its effectiveness for long-context scenarios remains underexplored. Through systematic investigation across diverse tasks, we demonstrate that CoT’s benefits generalize across most long-context scenarios and amplify with increasing context length. Motivated by this critical observation, we propose LongRePS, a process-supervised framework that teaches models to generate high-quality reasoning paths for enhanced long-context performance. Our framework incorporates a self-sampling mechanism to bootstrap reasoning paths and a novel quality assessment protocol specifically designed for long-context scenarios. Experimental results on various long-context benchmarks demonstrate the effectiveness of our approach, achieving significant improvements over outcome supervision baselines on both in-domain tasks (+13.6/+3.8 points for LLaMA/Qwen on MuSiQue) and cross-domain generalization (+9.3/+8.1 points on average across diverse QA tasks). Our code, data and trained models are made public to facilitate future research.
zh

[NLP-39] GraphCheck: Multi-Path Fact-Checking with Entity-Relationship Graphs

【速读】: 该论文致力于解决复杂声明(Complex Claims)验证中需要多跳推理(Multi-hop Reasoning)的问题,这是自动化事实核查(Automated Fact-checking)领域的一个重大挑战。论文的关键解决方案是提出了一种名为GraphCheck的新框架,它将声明转换为实体关系图(Entity-Relationship Graphs),并通过识别显式实体与隐式实体之间的关系(跨多条路径)来实现全面验证,从而增强验证的适应性和鲁棒性。此外,还引入了DP-GraphCheck这一两阶段变体,通过直接提示(Direct Prompting)作为初始过滤步骤来提升性能。实验结果表明,该方法在多跳推理任务中优于现有技术,并且其两阶段框架具有良好的通用性。

链接: https://arxiv.org/abs/2502.20785
作者: Hyewon Jeon,Jay-Yoon Lee
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated fact-checking aims to assess the truthfulness of text based on relevant evidence, yet verifying complex claims requiring multi-hop reasoning remains a significant challenge. We propose GraphCheck, a novel framework that converts claims into entity-relationship graphs for comprehensive verification. By identifying relation between explicit entities and latent entities across multiple paths, GraphCheck enhances the adaptability and robustness of verification. Furthermore, we introduce DP-GraphCheck, a two-stage variant that improves performance by incorporating direct prompting as an initial filtering step. Experiments on the HOVER and EX-FEVER datasets show that our approach outperforms existing methods, particularly in multi-hop reasoning tasks. Furthermore, our two-stage framework generalizes well to other fact-checking pipelines, demonstrating its versatility.
zh

[NLP-40] MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language Models

【速读】: 该论文旨在解决医疗领域中视觉-语言模型(Vision-Language Models, VLMs)在应用过程中可能产生的幻觉(hallucinations)问题,即模型可能会生成看似合理但实际上错误的结果,这可能危及临床决策并影响诊断与治疗。为应对这一挑战,论文提出了一种名为MedHallTune的大规模基准数据集,专门用于评估和缓解医疗VLM中的幻觉现象。MedHallTune包含超过10万张图像和100万条指令对样本,并提供真实标签,涵盖幻觉与非幻觉两类情况。通过在MedHallTune上的微调实验,研究发现这种方法能够显著提高现有模型管理幻觉的能力,并增强其在下游视觉问答(Visual Question Answering, VQA)任务中的零样本性能,从而提升模型在实际医疗场景中的可靠性。关键在于构建了一个全面且具有针对性的评估基准来引导模型优化过程。

链接: https://arxiv.org/abs/2502.20780
作者: Qiao Yan,Yuchen Yuan,Xiaowei Hu,Yihan Wang,Jiaqi Xu,Jinpeng Li,Chi-Wing Fu,Pheng-Ann Heng
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The increasing use of vision-language models (VLMs) in healthcare applications presents great challenges related to hallucinations, in which the models may generate seemingly plausible results that are in fact incorrect. Such hallucinations can jeopardize clinical decision making, potentially harming the diagnosis and treatments. In this work, we propose MedHallTune, a large-scale benchmark designed specifically to evaluate and mitigate hallucinations in medical VLMs. Comprising over 100,000 images and 1,000,000 instruction pairs, MedHallTune includes both hallucination and non-hallucination samples, each with ground-truth annotations. We conduct a comprehensive evaluation of current medical and general VLMs using MedHallTune, assessing their performance across key metrics, including clinical accuracy, relevance, detail level, and risk level. The experimental results show that fine-tuning with MedHallTune successfully improves the ability of several existing models to manage hallucinations and boost their zero-shot performance on downstream visual-question-answering (VQA) tasks, making them more reliable for practical medical applications. Our work contributes to the development of more trustworthy VLMs. Codes and dataset will be available at \hrefthis https URLMedHallTune.
zh

[NLP-41] riple Phase Transitions: Understanding the Learning Dynamics of Large Language Models from a Neuroscience Perspective

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在训练过程中出现的突发性能力涌现(phase transitions)现象的理解问题。这种现象通常被称为“相变”,但其背后的机制尚不清晰。论文的关键在于从三个相互关联的角度——LLMs与人脑的相似性、LLMs的内部状态以及下游任务表现——进行综合分析,并提出了一种新的学习动态解释方法。这种方法揭示了不同训练数据和架构的LLMs在训练过程中普遍存在的三种相变:(1) 对齐整个大脑的能力随着LLMs开始遵循任务指令而激增(Brain Alignment and Instruction Following),(2) 出乎意料的是,在下游任务准确率暂时停滞的阶段,LLMs与大脑发生脱离(Brain Detachment and Stagnation),以及(3) 随着LLMs能够解决下游任务,与大脑的对齐重新出现(Brain Realignment and Consolidation)。这些发现不仅阐明了LLMs中相变的潜在机制,还为AI与神经科学交叉研究开辟了新途径。

链接: https://arxiv.org/abs/2502.20779
作者: Yuko Nakagi,Keigo Tada,Sota Yoshino,Shinji Nishimoto,Yu Takagi
机构: Osaka University (大阪大学), Japan; National Institute of Information and Communications Technology (信息与通信技术国立研究所), Japan; National Institute of Informatics (信息学国立研究所), Japan
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注: 46 pages

点击查看摘要

Abstract:Large language models (LLMs) often exhibit abrupt emergent behavior, whereby new abilities arise at certain points during their training. This phenomenon, commonly referred to as a ‘‘phase transition’’, remains poorly understood. In this study, we conduct an integrative analysis of such phase transitions by examining three interconnected perspectives: the similarity between LLMs and the human brain, the internal states of LLMs, and downstream task performance. We propose a novel interpretation for the learning dynamics of LLMs that vary in both training data and architecture, revealing that three phase transitions commonly emerge across these models during training: (1) alignment with the entire brain surges as LLMs begin adhering to task instructions Brain Alignment and Instruction Following, (2) unexpectedly, LLMs diverge from the brain during a period in which downstream task accuracy temporarily stagnates Brain Detachment and Stagnation, and (3) alignment with the brain reoccurs as LLMs become capable of solving the downstream tasks Brain Realignment and Consolidation. These findings illuminate the underlying mechanisms of phase transitions in LLMs, while opening new avenues for interdisciplinary research bridging AI and neuroscience.
zh

[NLP-42] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference ICLR2025

【速读】: 该论文旨在解决大型语言模型(LLMs)在长序列推理过程中,尤其是在注意力预填充阶段遇到的计算挑战。这一挑战源于随着提示长度增加,计算复杂度呈二次增长的问题。传统方法主要依赖固定的稀疏注意力模式或基于有限案例确定稀疏注意力模式,但这些方法缺乏适应不同输入需求的灵活性。为了解决上述问题,论文提出了一种名为FlexPrefill的灵活稀疏预填充机制,能够实时动态调整稀疏注意力模式和计算预算以满足每个输入和注意力头的具体需求。

FlexPrefill的关键创新点在于:1)查询感知的稀疏模式确定:通过测量Jensen-Shannon散度,该组件能够在特定于查询的多样化注意力模式与预定义注意力模式之间自适应切换;2)基于累积注意力的索引选择:此组件根据不同的注意力模式动态选择需要计算的查询-键索引,确保注意力分数总和达到预设阈值。这些创新使得FlexPrefill可以根据提示自适应优化每个注意力头的稀疏模式和稀疏比,在长序列推理任务中提升效率。实验结果表明,与现有方法相比,FlexPrefill在速度和准确性方面均有显著改进,为LLM推理提供了一个更灵活高效的解决方案。

链接: https://arxiv.org/abs/2502.20766
作者: Xunhao Lai,Jianqiao Lu,Yao Luo,Yiyuan Ma,Xun Zhou
机构: Peking University (北京大学); The University of Hong Kong (香港大学); ByteDance Inc (字节跳动)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted at ICLR 2025 (Oral)

点击查看摘要

Abstract:Large language models (LLMs) encounter computational challenges during long-sequence inference, especially in the attention pre-filling phase, where the complexity grows quadratically with the prompt length. Previous efforts to mitigate these challenges have relied on fixed sparse attention patterns or identifying sparse attention patterns based on limited cases. However, these methods lacked the flexibility to efficiently adapt to varying input demands. In this paper, we introduce FlexPrefill, a Flexible sparse Pre-filling mechanism that dynamically adjusts sparse attention patterns and computational budget in real-time to meet the specific requirements of each input and attention head. The flexibility of our method is demonstrated through two key innovations: 1) Query-Aware Sparse Pattern Determination: By measuring Jensen-Shannon divergence, this component adaptively switches between query-specific diverse attention patterns and predefined attention patterns. 2) Cumulative-Attention Based Index Selection: This component dynamically selects query-key indexes to be computed based on different attention patterns, ensuring the sum of attention scores meets a predefined threshold. FlexPrefill adaptively optimizes the sparse pattern and sparse ratio of each attention head based on the prompt, enhancing efficiency in long-sequence inference tasks. Experimental results show significant improvements in both speed and accuracy over prior methods, providing a more flexible and efficient solution for LLM inference.
zh

[NLP-43] he Rise of Darkness: Safety-Utility Trade-Offs in Role-Playing Dialogue Agents

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在角色扮演对话代理中的安全性和实用性权衡难题,即如何在保持角色模拟效用的同时有效避免生成不安全内容。论文的关键在于揭示了风险耦合(risk coupling),即反派角色与用户查询相互作用所引发的风险场景,是影响这一权衡的核心因素。基于此,论文提出了一种自适应动态多偏好(Adaptive Dynamic Multi-Preference, ADMP)方法,通过动态调整安全性和实用性之间的偏好,并结合耦合边界采样(Coupling Margin Sampling, CMS)技术增强模型处理高风险场景的能力,从而在提升安全性的同时保持模型的实用性。

链接: https://arxiv.org/abs/2502.20757
作者: Yihong Tang,Kehai Chen,Xuefeng Bai,Zhengyu Niu,Bo Wang,Jie Liu,Min Zhang
机构: Institute of Computing and Intelligence, Harbin Institute of Technology(哈尔滨工业大学), Shenzhen, China; Baidu Inc.(百度公司), Beijing, China; College of Intelligence and Computing, Tianjin University(天津大学), Tianjin, China
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have made remarkable advances in role-playing dialogue agents, demonstrating their utility in character simulations. However, it remains challenging for these agents to balance character portrayal utility with content safety because this essential character simulation often comes with the risk of generating unsafe content. To address this issue, we first conduct a systematic exploration of the safety-utility trade-off across multiple LLMs. Our analysis reveals that risk scenarios created by villain characters and user queries (referred to as risk coupling) contribute to this trade-off. Building on this, we propose a novel Adaptive Dynamic Multi-Preference (ADMP) method, which dynamically adjusts safety-utility preferences based on the degree of risk coupling and guides the model to generate responses biased toward utility or safety. We further introduce Coupling Margin Sampling (CMS) into coupling detection to enhance the model’s ability to handle high-risk scenarios. Experimental results demonstrate that our approach improves safety metrics while maintaining utility.
zh

[NLP-44] Acquiring Grounded Representations of Words with Situated Interactive Instruction

【速读】: 该论文致力于解决如何通过人机混合主动交互的方式,从人类指导者处获取单词的 grounded(具身化)表示。研究重点在于同时习得感知知识(perceptual knowledge)、语义知识(semantic knowledge)以及程序性知识(procedural knowledge),并构建词的具身化意义。关键在于采用交互式学习(Interactive Learning)机制,使智能体能够主动请求关于未知概念的指导,从而提高学习效率。该方法已在Soar系统中实现,并在具备小物体操作能力的桌面机械臂上进行了验证。

链接: https://arxiv.org/abs/2502.20754
作者: Shiwali Mohan,Aaron H. Mininger,James R. Kirk,John E. Laird
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:We present an approach for acquiring grounded representations of words from mixed-initiative, situated interactions with a human instructor. The work focuses on the acquisition of diverse types of knowledge including perceptual, semantic, and procedural knowledge along with learning grounded meanings. Interactive learning allows the agent to control its learning by requesting instructions about unknown concepts, making learning efficient. Our approach has been instantiated in Soar and has been evaluated on a table-top robotic arm capable of manipulating small objects.
zh

[NLP-45] Mitigating Hallucinations in Large Vision-Language Models by Adaptively Constraining Information Flow AAAI2025

【速读】: 该论文试图解决视觉-语言大模型中存在的对象幻觉(object hallucination)问题,即生成的图像描述中包含图像中实际不存在的对象。为了解决这一问题,论文的关键在于揭示了对象幻觉源于无关视觉特征上的过度自信(overconfidence),当软视觉标记映射到大型语言模型(LLM)的词嵌入空间时尤为明显。论文通过分析视觉标记与LLM词嵌入之间的语义相似性发现,相似度分布的平滑性与对象幻觉的出现密切相关。为此,论文提出了基于变分信息瓶颈(Variational Information Bottleneck, VIB)的AdaVIB方法,通过引入随机噪声缓解过度自信,并设计了一种基于熵的噪声控制策略,使注入的噪声能够根据相似度分布的平滑性自适应调整。实验结果表明,AdaVIB在多个对象幻觉基准数据集上有效减少了无关视觉特征上的过度自信,从而显著减轻了对象幻觉现象。

链接: https://arxiv.org/abs/2502.20750
作者: Jiaqi Bai,Hongcheng Guo,Zhongyuan Peng,Jian Yang,Zhoujun Li,Mohan Li,Zhihong Tian
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to AAAI 2025. Camera ready version

点击查看摘要

Abstract:Large vision-language models show tremendous potential in understanding visual information through human languages. However, they are prone to suffer from object hallucination, i.e., the generated image descriptions contain objects that do not exist in the image. In this paper, we reveal that object hallucination can be attributed to overconfidence in irrelevant visual features when soft visual tokens map to the LLM’s word embedding space. Specifically, by figuring out the semantic similarity between visual tokens and LLM’s word embedding, we observe that the smoothness of similarity distribution strongly correlates with the emergence of object hallucinations. To mitigate hallucinations, we propose using the Variational Information Bottleneck (VIB) to alleviate overconfidence by introducing stochastic noise, facilitating the constraining of irrelevant information. Furthermore, we propose an entropy-based noise-controlling strategy to enable the injected noise to be adaptively constrained regarding the smoothness of the similarity distribution. We adapt the proposed AdaVIB across distinct model architectures. Experimental results demonstrate that the proposed AdaVIB mitigates object hallucinations by effectively alleviating the overconfidence in irrelevant visual features, with consistent improvements on two object hallucination benchmarks.
zh

[NLP-46] ach-to-Reason with Scoring: Self-Explainable Rationale-Driven Multi-Trait Essay Scoring

【速读】: 该论文旨在解决多特质自动生成作文评分(AES)系统在评分准确性方面表现出色,但缺乏透明性的问题,导致评分结果难以被教师和学习者信服,从而限制了其实际应用。论文的关键解决方案是提出了一种基于理性驱动的多特质自动生成作文评分(RaDME)框架。RaDME 通过将大型语言模型(LLMs)的推理能力提炼到一个更小但有效的评分模型中,使学生模型能够在生成特质分数的同时生成对应的评分依据,从而在训练过程中通过考虑后续的评分依据来选择更合理的分数。这种方法有效结合了 LLMs 的优秀推理能力与优化后小型模型的高精度评分能力,显著提升了 AES 的透明度。

链接: https://arxiv.org/abs/2502.20748
作者: Heejin Do,Sangwon Ryu,Gary Geunbae Lee
机构: Graduate School of Artificial Intelligence, POSTECH (POSTECH 人工智能研究生院), Republic of Korea; Department of Computer Science and Engineering, POSTECH (POSTECH 计算机科学与工程系), Republic of Korea
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-trait automated essay scoring (AES) systems provide a fine-grained evaluation of an essay’s diverse aspects. While they excel in scoring, prior systems fail to explain why specific trait scores are assigned. This lack of transparency leaves instructors and learners unconvinced of the AES outputs, hindering their practical use. To address this, we propose a self-explainable Rationale-Driven Multi-trait automated Essay scoring (RaDME) framework. RaDME leverages the reasoning capabilities of large language models (LLMs) by distilling them into a smaller yet effective scorer. This more manageable student model is optimized to sequentially generate a trait score followed by the corresponding rationale, thereby inherently learning to select a more justifiable score by considering the subsequent rationale during training. Our findings indicate that while LLMs underperform in direct AES tasks, they excel in rationale generation when provided with precise numerical scores. Thus, RaDME integrates the superior reasoning capacities of LLMs into the robust scoring accuracy of an optimized smaller model. Extensive experiments demonstrate that RaDME achieves both accurate and adequate reasoning while supporting high-quality multi-trait scoring, significantly enhancing the transparency of AES.
zh

[NLP-47] Structured Preference Optimization for Vision-Language Long-Horizon Task Planning

【速读】: 该论文旨在解决现有视觉-语言任务规划方法在动态环境中复杂长 horizon 规划能力不足的问题。主要挑战源于难以有效训练模型以生成高质量的长 horizon 推理过程。为了解决这一问题,论文提出了一种名为结构化偏好优化(Structured Preference Optimization, SPO)的方法。其关键是通过基于偏好的评分与优化以及课程引导的训练策略来增强长 horizon 任务规划中的推理和动作选择能力。具体而言,SPO 引入了基于偏好的评分与优化机制,系统性评估推理链的任务相关性、视觉定位和历史一致性;同时采用课程引导的训练方式,使模型逐步适应从简单到复杂的任务,从而提升其在长 horizon 场景下的泛化能力和推理鲁棒性。

链接: https://arxiv.org/abs/2502.20742
作者: Xiwen Liang,Min Lin,Weiqi Ruan,Rongtao Xu,Yuecheng Liu,Jiaqi Chen,Bingqian Lin,Yuzheng Zhuang,Xiaodan Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages

点击查看摘要

Abstract:Existing methods for vision-language task planning excel in short-horizon tasks but often fall short in complex, long-horizon planning within dynamic environments. These challenges primarily arise from the difficulty of effectively training models to produce high-quality reasoning processes for long-horizon tasks. To address this, we propose Structured Preference Optimization (SPO), which aims to enhance reasoning and action selection in long-horizon task planning through structured preference evaluation and optimized training strategies. Specifically, SPO introduces: 1) Preference-Based Scoring and Optimization, which systematically evaluates reasoning chains based on task relevance, visual grounding, and historical consistency; and 2) Curriculum-Guided Training, where the model progressively adapts from simple to complex tasks, improving its generalization ability in long-horizon scenarios and enhancing reasoning robustness. To advance research in vision-language long-horizon task planning, we introduce ExtendaBench, a comprehensive benchmark covering 1,509 tasks across VirtualHome and Habitat 2.0, categorized into ultra-short, short, medium, and long tasks. Experimental results demonstrate that SPO significantly improves reasoning quality and final decision accuracy, outperforming prior methods on long-horizon tasks and underscoring the effectiveness of preference-driven optimization in vision-language task planning. Specifically, SPO achieves a +5.98% GCR and +4.68% SR improvement in VirtualHome and a +3.30% GCR and +2.11% SR improvement in Habitat over the best-performing baselines.
zh

[NLP-48] Retrieval Backward Attention without Additional Training: Enhance Embeddings of Large Language Models via Repetition

【速读】: 该论文旨在通过一种简单且易于实现的方法提升预训练语言模型在零样本(zero-shot)设置下的性能。论文的关键创新在于提出了一种新颖的后向注意力机制(backward attention mechanism),以增强上下文信息的编码能力。通过在中文大规模文本嵌入基准(C-MTEB)上的评估,该方法在多个任务中实现了显著的性能提升,为推动零样本学习能力的发展提供了有价值的参考。

链接: https://arxiv.org/abs/2502.20726
作者: Yifei Duan,Raphael Shang,Deng Liang,Yongqiang Cai
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Language models can be viewed as functions that embed text into Euclidean space, where the quality of the embedding vectors directly determines model performance, training such neural networks involves various uncertainties. This paper focuses on improving the performance of pre-trained language models in zero-shot settings through a simple and easily implementable method. We propose a novel backward attention mechanism to enhance contextual information encoding. Evaluated on the Chinese Massive Text Embedding Benchmark (C-MTEB), our approach achieves significant improvements across multiple tasks, providing valuable insights for advancing zero-shot learning capabilities.
zh

[NLP-49] ProAI: Proactive Multi-Agent Conversational AI with Structured Knowledge Base for Psychiatric Diagnosis

【速读】: 该论文旨在解决现有大多数大型语言模型(LLM)驱动的对话系统仅能被动响应用户提示、缺乏主动引导交互能力的问题。在需要明确目标导向的场景(如精神疾病诊断、咨询和面试等)中,这种被动模式限制了人工智能的有效性。为了解决这一挑战,论文提出ProAI框架,这是一种以目标为导向的主动式对话人工智能解决方案。ProAI的关键在于其集成的结构化知识引导记忆、多智能体主动推理机制以及多维度评估策略,这些组件共同使LLMs能够执行类似临床医生的诊断推理,而不仅仅是生成简单回复。通过模拟患者互动、用户体验评估及专业临床验证,ProAI在精神障碍鉴别诊断任务中实现了高达83.3%的准确率,并保持了专业性和同理心的交流标准。这表明ProAI为开发更可靠、适应性强且具有明确目标导向的AI诊断助手提供了可能性,推动了LLMs超越传统的反应型对话系统。

链接: https://arxiv.org/abs/2502.20689
作者: Yuqi Wu,Guangya Wan,Jingjing Li,Shengming Zhao,Lingfeng Ma,Tianyi Ye,Ion Pop,Yanbo Zhang,Jie Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages, 8 figures

点击查看摘要

Abstract:Most LLM-driven conversational AI systems operate reactively, responding to user prompts without guiding the interaction. Most LLM-driven conversational AI systems operate reactively, responding to user prompts without guiding the interaction. However, many real-world applications-such as psychiatric diagnosis, consulting, and interviews-require AI to take a proactive role, asking the right questions and steering conversations toward specific objectives. Using mental health differential diagnosis as an application context, we introduce ProAI, a goal-oriented, proactive conversational AI framework. ProAI integrates structured knowledge-guided memory, multi-agent proactive reasoning, and a multi-faceted evaluation strategy, enabling LLMs to engage in clinician-style diagnostic reasoning rather than simple response generation. Through simulated patient interactions, user experience assessment, and professional clinical validation, we demonstrate that ProAI achieves up to 83.3% accuracy in mental disorder differential diagnosis while maintaining professional and empathetic interaction standards. These results highlight the potential for more reliable, adaptive, and goal-driven AI diagnostic assistants, advancing LLMs beyond reactive dialogue systems.
zh

[NLP-50] JAM: Controllable and Responsible Text Generation via Causal Reasoning and Latent Vector Manipulation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成文本时缺乏可解释性框架的问题,即如何实现对其生成过程的可控性和责任性。现有模型通常作为“黑箱”运行,仅基于大规模无标注数据进行统计训练,难以满足对生成结果的责任控制需求。论文的关键解决方案是提出了一种名为JAM(Just A Move)的新框架,通过在LLM的潜在空间中集成因果分析方法,揭示并利用潜在向量的因果关系来解释和控制文本生成过程。这种方法不仅提高了生成文本的可控性和现实感,还实现了更高的计算效率,显著优于现有的可控文本生成(Controllable Text Generation, CTG)方法。

链接: https://arxiv.org/abs/2502.20684
作者: Yingbing Huang,Deming Chen,Abhishek K. Umrawal
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 3 figures, and 6 tables

点击查看摘要

Abstract:While large language models (LLMs) have made significant strides in generating coherent and contextually relevant text, they often function as opaque black boxes, trained on vast unlabeled datasets with statistical objectives, lacking an interpretable framework for responsible control. In this paper, we introduce JAM (Just A Move), a novel framework that interprets and controls text generation by integrating cause-effect analysis within the latent space of LLMs. Based on our observations, we uncover the inherent causality in LLM generation, which is critical for producing responsible and realistic outputs. Moreover, we explore latent vectors as fundamental components in LLM architectures, aiming to understand and manipulate them for more effective and efficient controllable text generation. We evaluate our framework using a range of tools, including the HHH criteria, toxicity reduction benchmarks, and GPT-4 alignment measures. Our results show that JAM achieves up to a 22% improvement over previous Controllable Text Generation (CTG) methods across multiple quantitative metrics and human-centric evaluations. Furthermore, JAM demonstrates greater computational efficiency compared to other CTG methods. These results highlight the effectiveness and efficiency of JAM for responsible and realistic text generation, paving the way for more interpretable and controllable models.
zh

[NLP-51] Fine-tuning BERT with Bidirectional LSTM for Fine-grained Movie Reviews Sentiment Analysis

【速读】: 本文旨在通过微调预训练的 BERT 模型与双向 LSTM (Bidirectional LSTM, BiLSTM) 结合的方法,提升电影评论在二分类和细粒度情感分析 (Sentiment Analysis, SA) 上的表现。论文的关键在于将情感分类应用于每条评论,并通过计算所有评论的整体情感极性来实现更全面的分析。此外,为了增强模型在细粒度情感分类中的泛化能力,作者引入了两种精度改进技术:合成少数过采样技术 (Synthetic Minority Oversampling Technique, SMOTE) 和自然语言处理增强工具 (NLP Augmenter, NLPAUG)。最终,论文提出了一种启发式算法,利用 BERT+BiLSTM 输出向量计算预测评论的整体极性。该方法在二分类和五分类任务中均表现出色,例如在 IMDb 数据集上的二分类任务达到了 97.67% 的准确率,在 SST-5 数据集上的五分类任务实现了 59.48% 的准确率,超越了当前最先进的 (State-of-the-art, SOTA) 方法。

链接: https://arxiv.org/abs/2502.20682
作者: Gibson Nkhata,Susan Gauch,Usman Anjum,Justin Zhan
机构: University of Arkansas (阿肯色大学); University of Cincinnati (辛辛那提大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures, published in International Journal On Advances in Systems and Measurements, volume 16, numbers 3 and 4, 2023

点击查看摘要

Abstract:Sentiment Analysis (SA) is instrumental in understanding peoples viewpoints facilitating social media monitoring recognizing products and brands and gauging customer satisfaction. Consequently SA has evolved into an active research domain within Natural Language Processing (NLP). Many approaches outlined in the literature devise intricate frameworks aimed at achieving high accuracy, focusing exclusively on either binary sentiment classification or fine-grained sentiment classification. In this paper our objective is to fine-tune the pre-trained BERT model with Bidirectional LSTM (BiLSTM) to enhance both binary and fine-grained SA specifically for movie reviews. Our approach involves conducting sentiment classification for each review followed by computing the overall sentiment polarity across all reviews. We present our findings on binary classification as well as fine-grained classification utilizing benchmark datasets. Additionally we implement and assess two accuracy improvement techniques Synthetic Minority Oversampling Technique (SMOTE) and NLP Augmenter (NLPAUG) to bolster the models generalization in fine-grained sentiment classification. Finally a heuristic algorithm is employed to calculate the overall polarity of predicted reviews from the BERT+BiLSTM output vector. Our approach performs comparably with state-of-the-art (SOTA) techniques in both classifications. For instance in binary classification we achieve 97.67% accuracy surpassing the leading SOTA model NB-weighted-BON+dv-cosine by 0.27% on the renowned IMDb dataset. Conversely for five-class classification on SST-5 while the top SOTA model RoBERTa+large+Self-explaining attains 55.5% accuracy our model achieves 59.48% accuracy surpassing the BERT-large baseline by 3.6%.
zh

[NLP-52] Disentangling Feature Structure: A Mathematically Provable Two-Stage Training Dynamics in Transformers

【速读】: 该论文试图解决Transformer在实际训练过程中观察到的两阶段训练动态现象(two-stage training dynamics),即模型输出从句法错误逐步改进至句法正确,再进一步优化至语义正确的现象。现有理论分析对此现象鲜有解释,因此论文旨在从理论上阐明这种两阶段训练动态的发生机制。解决方案的关键在于基于解耦的双类型特征结构(disentangled two-type feature structure),利用特征学习技术,在in-context learning框架下分析Transformer的动力学行为。此外,研究还揭示了这一两阶段过程与注意力权重的谱性质(spectral properties)密切相关,这与经验观察结果高度一致。

链接: https://arxiv.org/abs/2502.20681
作者: Zixuan Gong,Jiaye Teng,Yong Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Transformers may exhibit two-stage training dynamics during the real-world training process. For instance, when training GPT-2 on the Counterfact dataset, the answers progress from syntactically incorrect to syntactically correct to semantically correct. However, existing theoretical analyses hardly account for this two-stage phenomenon. In this paper, we theoretically demonstrate how such two-stage training dynamics occur in transformers. Specifically, we analyze the dynamics of transformers using feature learning techniques under in-context learning regimes, based on a disentangled two-type feature structure. Such disentanglement of feature structure is general in practice, e.g., natural languages contain syntax and semantics, and proteins contain primary and secondary structures. To our best known, this is the first rigorous result regarding a two-stage optimization process in transformers. Additionally, a corollary indicates that such a two-stage process is closely related to the spectral properties of the attention weights, which accords well with empirical findings.
zh

[NLP-53] Prediction of Item Difficulty for Reading Comprehension Items by Creation of Annotated Item Repository

【速读】: 该论文旨在解决基于文本内容预测题目难度的问题,具体而言,研究聚焦于通过惩罚回归模型结合多种特征(包括阅读材料的语言特征、测试特征及上下文特征)来恢复基于项目反应理论(IRT)的题目难度,其中原始数据仅包含题目正确率(item p-value)。解决方案的关键在于综合运用丰富的元数据特征与现代语言模型(LLMs)嵌入信息,构建了一个能够以均方根误差(RMSE)0.52(相较基线RMSE 0.92)和真实难度与预测难度间相关系数0.77进行有效预测的模型。此外,研究表明,仅使用语言特征或LLM嵌入即可获得相似的预测性能,表明单一特征类别可能已足够。

链接: https://arxiv.org/abs/2502.20663
作者: Radhika Kapoor,Sang T. Truong,Nick Haber,Maria Araceli Ruiz-Primo,Benjamin W. Domingue
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prediction of item difficulty based on its text content is of substantial interest. In this paper, we focus on the related problem of recovering IRT-based difficulty when the data originally reported item p-value (percent correct responses). We model this item difficulty using a repository of reading passages and student data from US standardized tests from New York and Texas for grades 3-8 spanning the years 2017-23. This repository is annotated with meta-data on (1) linguistic features of the reading items, (2) test features of the passage, and (3) context features. A penalized regression prediction model with all these features can predict item difficulty with RMSE 0.52 compared to baseline RMSE of 0.92, and with a correlation of 0.77 between true and predicted difficulty. We supplement these features with embeddings from LLMs (ModernBERT, BERT, and LlAMA), which marginally improve item difficulty prediction. When models use only item linguistic features or LLM embeddings, prediction performance is similar, which suggests that only one of these feature categories may be required. This item difficulty prediction model can be used to filter and categorize reading items and will be made publicly available for use by other stakeholders.
zh

[NLP-54] Automatic database description generation for Text-to-SQL

【速读】: 该论文致力于解决在Text-to-SQL任务中,当缺乏显式表与列描述时,如何自动生成有效的数据库描述以弥合自然语言与数据库模式之间差距的问题。论文提出的方法采用双进程策略:粗到细(coarse-to-fine)和细到粗(fine-to-coarse)。其中,粗到细进程利用大型语言模型(LLM)的内在知识,从数据库全局逐步细化至表和列,确保整体结构理解及上下文一致性;而细到粗进程则从列级别开始,回溯至表级别时提供更精确且细致的理解。关键在于结合两种进程的优势,既保证了整体性又兼顾了局部细节的准确性。实验结果显示,使用该方法生成的描述可将SQL生成准确率提升0.93%,达到人类水平性能的37%。

链接: https://arxiv.org/abs/2502.20657
作者: Yingqi Gao,Zhiling Luo
机构: Alibaba Group (阿里巴巴集团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:In the context of the Text-to-SQL task, table and column descriptions are crucial for bridging the gap between natural language and database schema. This report proposes a method for automatically generating effective database descriptions when explicit descriptions are unavailable. The proposed method employs a dual-process approach: a coarse-to-fine process, followed by a fine-to-coarse process. The coarse-to-fine approach leverages the inherent knowledge of LLM to guide the understanding process from databases to tables and finally to columns. This approach provides a holistic understanding of the database structure and ensures contextual alignment. Conversely, the fine-to-coarse approach starts at the column level, offering a more accurate and nuanced understanding when stepping back to the table level. Experimental results on the Bird benchmark indicate that using descriptions generated by the proposed improves SQL generation accuracy by 0.93% compared to not using descriptions, and achieves 37% of human-level performance. The source code is publicly available at this https URL.
zh

[NLP-55] Consistency Evaluation of News Article Summaries Generated by Large (and Small) Language Models

【速读】: 该论文试图解决文本摘要生成中的两个关键问题:一是大型语言模型(Large Language Models, LLMs)在生成流畅抽象摘要时容易产生与源文本无关的幻觉细节;二是高质量自动评估方法仍是一个开放的研究领域。为了解决这些问题,论文提出了一种综合评估框架,包括使用传统指标(如ROUGE评分和BERT评分)以及基于LLM的评估方法来直接衡量生成摘要与源文本的一致性,并引入了一个元评估分数以评估LLM评估系统的整体性能。关键在于结合多种技术(如TextRank、BART、Mistral-7B-Instruct和OpenAI GPT-3.5-Turbo)和创新的评估机制,确保生成摘要的质量和一致性。

链接: https://arxiv.org/abs/2502.20647
作者: Colleen Gilhuly,Haleh Shahzad
机构: Royal Bank of Canada (皇家银行加拿大)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Text summarizing is a critical Natural Language Processing (NLP) task with applications ranging from information retrieval to content generation. Large Language Models (LLMs) have shown remarkable promise in generating fluent abstractive summaries but they can produce hallucinated details not grounded in the source text. Regardless of the method of generating a summary, high quality automated evaluations remain an open area of investigation. This paper embarks on an exploration of text summarization with a diverse set of techniques, including TextRank, BART, Mistral-7B-Instruct, and OpenAI GPT-3.5-Turbo. The generated summaries are evaluated using traditional metrics such as the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) Score and Bidirectional Encoder Representations from Transformers (BERT) Score, as well as LLM-powered evaluation methods that directly assess a generated summary’s consistency with the source text. We introduce a meta evaluation score which directly assesses the performance of the LLM evaluation system (prompt + model). We find that that all summarization models produce consistent summaries when tested on the XL-Sum dataset, exceeding the consistency of the reference summaries.
zh

[NLP-56] LexRAG : Benchmarking Retrieval-Augmented Generation in Multi-Turn Legal Consultation Conversation

【速读】: 该论文旨在解决法律领域中缺乏专门评估检索增强生成(Retrieval-Augmented Generation, RAG)系统有效性基准的问题。解决方案的关键在于提出LexRAG,这是首个用于多轮法律咨询的RAG系统评估基准,包含1,013个多轮对话样本和17,228个候选法律条文,并设计了两个核心任务:会话知识检索与响应生成。同时,开发了LexiT工具包以实现RAG系统组件在法律领域的定制化实施,并引入基于大型语言模型(LLM)的法官评估管道以支持详细且有效的评估。通过实验分析,揭示了现有RAG系统在处理法律咨询对话时的关键局限性。

链接: https://arxiv.org/abs/2502.20640
作者: Haitao Li,Yifan Chen,Yiran Hu,Qingyao Ai,Junjie Chen,Xiaoyu Yang,Jianhui Yang,Yueyue Wu,Zeyang Liu,Yiqun Liu
机构: DCST, Tsinghua University (清华大学); Quan Cheng Laboratory (量创实验室); Beijing University of Posts and Telecommunications (北京邮电大学); Shandong University (山东大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 10 pages

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has proven highly effective in improving large language models (LLMs) across various domains. However, there is no benchmark specifically designed to assess the effectiveness of RAG in the legal domain, which restricts progress in this area. To fill this gap, we propose LexRAG, the first benchmark to evaluate RAG systems for multi-turn legal consultations. LexRAG consists of 1,013 multi-turn dialogue samples and 17,228 candidate legal articles. Each sample is annotated by legal experts and consists of five rounds of progressive questioning. LexRAG includes two key tasks: (1) Conversational knowledge retrieval, requiring accurate retrieval of relevant legal articles based on multi-turn context. (2) Response generation, focusing on producing legally sound answers. To ensure reliable reproducibility, we develop LexiT, a legal RAG toolkit that provides a comprehensive implementation of RAG system components tailored for the legal domain. Additionally, we introduce an LLM-as-a-judge evaluation pipeline to enable detailed and effective assessment. Through experimental analysis of various LLMs and retrieval methods, we reveal the key limitations of existing RAG systems in handling legal consultation conversations. LexRAG establishes a new benchmark for the practical application of RAG systems in the legal domain, with its code and data available at this https URL.
zh

[NLP-57] Rectifying Belief Space via Unlearning to Harness LLM s Reasoning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在展现高级推理能力的同时仍频繁产生错误答案的问题。作者假设这些错误大多源于虚假信念(spurious beliefs),即模型内部错误认为为真的命题。为解决此问题,论文提出了一种通过抑制虚假信念并同时增强正确信念来校正模型信念空间的方法。方案的关键在于首先利用基于前向-后向束搜索(Forward-Backward Beam Search, FBBS)的方法促使模型生成文本解释,以识别导致错误或正确答案的信念;随后通过无学习(unlearning)技术实现对虚假信念的抑制与正确信念的增强,从而有效校正模型的信念空间。实验结果表明,该方法不仅修正了先前错误回答的问题,还提升了未见数据上的泛化能力,表明校正模型信念空间是减少错误并提高整体可靠性的一个有前景的方向。

链接: https://arxiv.org/abs/2502.20620
作者: Ayana Niwa,Masahiro Kaneko,Kentaro Inui
机构: MBZUAI (MBZUAI); Tohoku University (东北大学); RIKEN (RIKEN)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can exhibit advanced reasoning yet still generate incorrect answers. We hypothesize that such errors frequently stem from spurious beliefs, propositions the model internally considers true but are incorrect. To address this, we propose a method to rectify the belief space by suppressing these spurious beliefs while simultaneously enhancing true ones, thereby enabling more reliable inferences. Our approach first identifies the beliefs that lead to incorrect or correct answers by prompting the model to generate textual explanations, using our Forward-Backward Beam Search (FBBS). We then apply unlearning to suppress the identified spurious beliefs and enhance the true ones, effectively rectifying the model’s belief space. Empirical results on multiple QA datasets and LLMs show that our method corrects previously misanswered questions without harming overall model performance. Furthermore, our approach yields improved generalization on unseen data, suggesting that rectifying a model’s belief space is a promising direction for mitigating errors and enhancing overall reliability.
zh

[NLP-58] Continuous Adversarial Text Representation Learning for Affective Recognition

【速读】: 该论文旨在解决预训练语言模型在语义理解方面表现优异,但难以有效捕捉情感识别任务所需的细微情感信息的问题。为应对这一局限性,论文提出了一种基于Transformer模型的情感感知嵌入增强框架。方案的关键在于引入了一个连续的情绪效价-唤醒度标注系统来指导对比学习(Contrastive Learning),以更有效地捕获情感的细微及多维差异;同时结合梯度驱动的重要度分析机制,动态调整对情感相关标记的关注,从而提升模型对情感线索的敏感性。实验结果表明,该框架在情感分类基准测试中提升了高达15.5%的性能,强调了使用连续标签的重要性,并验证了其在情感表征学习中的有效性。

链接: https://arxiv.org/abs/2502.20613
作者: Seungah Son,Andrez Saurez,Dongsoo Har
机构: CCS Graduate School of Mobility, Korea Advanced Institute of Science and Technology (KAIST) (韩国科学技术院); Robotics Program, Korea Advanced Institute of Science and Technology (KAIST) (韩国科学技术院); CCS Graduate School of Mobility, Korea Advanced Institute of Science and Technology (KAIST) (韩国科学技术院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures, The 7th International Conference on Artificial Intelligence in Information and Communication (ICAIIC 2025)

点击查看摘要

Abstract:While pre-trained language models excel at semantic understanding, they often struggle to capture nuanced affective information critical for affective recognition tasks. To address these limitations, we propose a novel framework for enhancing emotion-aware embeddings in transformer-based models. Our approach introduces a continuous valence-arousal labeling system to guide contrastive learning, which captures subtle and multi-dimensional emotional nuances more effectively. Furthermore, we employ a dynamic token perturbation mechanism, using gradient-based saliency to focus on sentiment-relevant tokens, improving model sensitivity to emotional cues. The experimental results demonstrate that the proposed framework outperforms existing methods, achieving up to 15.5% improvement in the emotion classification benchmark, highlighting the importance of employing continuous labels. This improvement demonstrates that the proposed framework is effective in affective representation learning and enables precise and contextually relevant emotional understanding.
zh

[NLP-59] Leverag ing Large Language Models for Building Interpretable Rule-Based Data-to-Text Systems

【速读】: 该论文试图解决如何高效构建高质量且可解释的数据到文本系统的问题。解决方案的关键在于提出了一种简单的方法,利用大规模语言模型(Large Language Model, LLM)在纯Python环境中自动生成完全可解释的基于规则的数据到文本系统。这种方法通过WebNLG数据集的实验验证表明,生成的文本质量(依据BLEU和BLEURT指标衡量)优于相同LLM直接生成输出的方式,并且比针对同一数据微调的BART语言模型产生的幻觉更少。此外,在运行时,该方法仅使用单个CPU即可在远短的时间内生成文本,显著提升了效率。

链接: https://arxiv.org/abs/2502.20609
作者: Jędrzej Warczyński,Mateusz Lango,Ondrej Dusek
机构: Poznan University of Technology (波兹南工业大学); Charles University (查理大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a simple approach that uses a large language model (LLM) to automatically implement a fully interpretable rule-based data-to-text system in pure Python. Experimental evaluation on the WebNLG dataset showed that such a constructed system produces text of better quality (according to the BLEU and BLEURT metrics) than the same LLM prompted to directly produce outputs, and produces fewer hallucinations than a BART language model fine-tuned on the same data. Furthermore, at runtime, the approach generates text in a fraction of the processing time required by neural approaches, using only a single CPU
zh

[NLP-60] NutriGen: Personalized Meal Plan Generator Leverag ing Large Language Models to Enhance Dietary and Nutritional Adherence

【速读】: 该论文旨在解决现有个性化膳食推荐系统在适应性、实用性及可扩展性方面的不足,具体表现为缺乏灵活性、未能充分考虑实际约束(如食材可用性)以及需要过多用户输入等问题。为应对这些挑战,论文提出了一种名为NutriGen的新框架,其核心在于利用大型语言模型(Large Language Models, LLM)结合提示工程(Prompt Engineering),构建个性化营养数据库,并整合可靠的营养参考数据(如USDA营养数据库),从而生成符合用户定义的饮食偏好与限制的个性化餐计划。关键解决方案在于通过提示工程优化LLMs的表现,使其既能保持高度的灵活性和易用性,又能提供精确且实用的膳食建议,最终实现结构化、可操作且可扩展的餐计划生成能力。实验结果表明,Llama 3.1 8B和GPT-3.5 Turbo在生成贴近用户热量目标的餐计划方面具有显著优势,同时进一步验证了DeepSeek V3等新兴模型在个性化营养规划中的潜力。

链接: https://arxiv.org/abs/2502.20601
作者: Saman Khamesian,Asiful Arefeen,Stephanie M. Carpenter,Hassan Ghasemzadeh
机构: College of Health Solutions, Arizona State University (亚利桑那州立大学); School of Computing and Augmented Intelligence, Arizona State University (亚利桑那州立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Maintaining a balanced diet is essential for overall health, yet many individuals struggle with meal planning due to nutritional complexity, time constraints, and lack of dietary knowledge. Personalized food recommendations can help address these challenges by tailoring meal plans to individual preferences, habits, and dietary restrictions. However, existing dietary recommendation systems often lack adaptability, fail to consider real-world constraints such as food ingredient availability, and require extensive user input, making them impractical for sustainable and scalable daily use. To address these limitations, we introduce NutriGen, a framework based on large language models (LLM) designed to generate personalized meal plans that align with user-defined dietary preferences and constraints. By building a personalized nutrition database and leveraging prompt engineering, our approach enables LLMs to incorporate reliable nutritional references like the USDA nutrition database while maintaining flexibility and ease-of-use. We demonstrate that LLMs have strong potential in generating accurate and user-friendly food recommendations, addressing key limitations in existing dietary recommendation systems by providing structured, practical, and scalable meal plans. Our evaluation shows that Llama 3.1 8B and GPT-3.5 Turbo achieve the lowest percentage errors of 1.55% and 3.68%, respectively, producing meal plans that closely align with user-defined caloric targets while minimizing deviation and improving precision. Additionally, we compared the performance of DeepSeek V3 against several established models to evaluate its potential in personalized nutrition planning.
zh

[NLP-61] Few-Shot No Problem: Descriptive Continual Relation Extraction AAAI2025

【速读】: 该论文致力于解决少样本连续关系抽取(Few-shot Continual Relation Extraction)这一关键挑战,旨在使人工智能系统能够识别并适应动态现实世界领域中不断演化的关联关系。传统基于记忆的方法在有限样本下容易过拟合,且难以强化旧知识,在少样本场景中数据稀缺进一步加剧了这些问题,阻碍了潜在空间中的有效数据增强。为了解决上述问题,论文提出了一种新颖的基于检索的解决方案。其关键是利用大规模语言模型生成每个关系的描述,并从这些描述中引入双编码器检索训练范式以丰富样本和类别表示学习。通过增强后的表示,设计了一种基于检索的预测方法,其中每个样本通过整合关系描述向量和类别原型的互惠排名融合得分“检索”最合适的关联关系,从而有效应对灾难性遗忘问题,显著提升了跨序列任务的鲁棒性能。

链接: https://arxiv.org/abs/2502.20596
作者: Nguyen Xuan Thanh,Anh Duc Le,Quyen Tran,Thanh-Thien Le,Linh Ngo Van,Thien Huu Nguyen
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to AAAI 2025

点击查看摘要

Abstract:Few-shot Continual Relation Extraction is a crucial challenge for enabling AI systems to identify and adapt to evolving relationships in dynamic real-world domains. Traditional memory-based approaches often overfit to limited samples, failing to reinforce old knowledge, with the scarcity of data in few-shot scenarios further exacerbating these issues by hindering effective data augmentation in the latent space. In this paper, we propose a novel retrieval-based solution, starting with a large language model to generate descriptions for each relation. From these descriptions, we introduce a bi-encoder retrieval training paradigm to enrich both sample and class representation learning. Leveraging these enhanced representations, we design a retrieval-based prediction method where each sample “retrieves” the best fitting relation via a reciprocal rank fusion score that integrates both relation description vectors and class prototypes. Extensive experiments on multiple datasets demonstrate that our method significantly advances the state-of-the-art by maintaining robust performance across sequential tasks, effectively addressing catastrophic forgetting.
zh

[NLP-62] Multi2: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing

【速读】: 该论文旨在解决多文档摘要(Multi-Document Summarization, MDS)任务中生成式大语言模型(Large Language Models, LLMs)在推理阶段通过测试时扩展(test-time scaling)进行优化的问题。当前测试时扩展方法在逻辑与数学推理任务中已表现出显著性能提升,但在自然语言生成(Natural Language Generation, NLG),特别是多文档摘要任务中的应用尚未被充分探索。MDS任务的核心挑战在于需要设计更精细的提示(prompt)组合策略以及集成方法,因为不存在单一最优提示能够满足多样化的摘要需求。

解决方案的关键在于提出了一种新颖的框架,利用推理时扩展技术优化多文档摘要任务。具体而言,该框架采用提示集成(prompt ensemble)的方法,通过多种提示生成候选摘要,并利用聚合器(aggregator)将这些候选摘要整合为精炼后的最终摘要。此外,为了增强LLM的上下文理解能力并缓解位置偏差,论文引入了两个新的评估指标:一致性感知偏好(Consistency-Aware Preference, CAP)得分和LLM原子内容单元(Atom-Content-Unit, ACU)得分。实验结果表明,该方法有效提升了摘要质量,并识别分析了摘要任务中的扩展边界。

链接: https://arxiv.org/abs/2502.20592
作者: Juntai Cao,Xiang Zhang,Raymond Li,Chuyuan Li,Shafiq Joty,Giuseppe Carenini
机构: University of British Columbia (不翻译); Salesforce Research (Salesforce 研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in test-time scaling have shown promising results in improving Large Language Models (LLMs) performance through strategic computation allocation during inference. While this approach has demonstrated strong performance improvements in logical and mathematical reasoning tasks, its application to natural language generation (NLG), especially summarization, has yet to be explored. Multi-Document Summarization (MDS) is a challenging task that focuses on extracting and synthesizing useful information from multiple lengthy documents. Unlike reasoning tasks, MDS requires a more nuanced approach to prompt design and ensemble, as there is no “best” prompt to satisfy diverse summarization requirements. To address this, we propose a novel framework that leverages inference-time scaling for this task. Precisely, we take prompt ensemble approach by leveraging various prompt to first generate candidate summaries and then ensemble them with an aggregator to produce a refined summary. We also introduce two new evaluation metrics: Consistency-Aware Preference (CAP) score and LLM Atom-Content-Unit (ACU) score, to enhance LLM’s contextual understanding while mitigating its positional bias. Extensive experiments demonstrate the effectiveness of our approach in improving summary quality while identifying and analyzing the scaling boundaries in summarization tasks.
zh

[NLP-63] LLM s Have Rhythm: Fingerprinting Large Language Models Using Inter-Token Times and Network Traffic Analysis

【速读】: 该论文旨在解决在实际系统中识别部署或交互的语言模型(Language Models, LMs)的问题,以确保系统的安全性和可信度。当前验证方法通常依赖于分析生成的输出来确定模型来源,但这些方法容易受到对抗攻击,在事后进行操作,并且可能需要访问模型权重以注入可验证的指纹。论文的关键解决方案在于提出了一种新颖的被动且非侵入式的指纹技术,该技术能够实时运行,并在加密网络流量条件下仍然有效。其核心在于利用语言模型固有的自回归生成特性,即模型每次仅生成一个令牌并基于之前所有生成的令牌进行预测,从而在输出流中形成一种独特的时序模式(类似节奏或心跳),即使在网络上传输时也能保持不变。通过测量令牌间时间间隔(Inter-Token Times, ITTs),论文发现可以高精度地区分不同的语言模型。为此,研究开发了一个深度学习管道,用于捕获基于网络流量分析的这些时序模式,并在多种部署场景下对16个小语言模型(Small Language Models, SLMs)和10个专有大型语言模型(Large Language Models, LLMs)进行了评估,包括本地主机(GPU/CPU)、局域网(Local Area Network, LAN)、远程网络和虚拟专用网络(Virtual Private Network, VPN)。实验结果表明,所提出的方案在不同网络条件下均表现出高准确性,为现实世界中的模型识别开辟了新途径,并促进了更安全和可信的语言模型部署。

链接: https://arxiv.org/abs/2502.20589
作者: Saeif Alhazbi,Ahmed Mohamed Hussain,Gabriele Oligeri,Panos Papadimitratos
机构: College of Science and Engineering (CSE), Hamad Bin Khalifa University (HBKU) (哈马德本哈利法大学); Networked Systems Security Group, KTH Royal Institute of Technology (瑞典皇家理工学院)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) become increasingly integrated into many technological ecosystems across various domains and industries, identifying which model is deployed or being interacted with is critical for the security and trustworthiness of the systems. Current verification methods typically rely on analyzing the generated output to determine the source model. However, these techniques are susceptible to adversarial attacks, operate in a post-hoc manner, and may require access to the model weights to inject a verifiable fingerprint. In this paper, we propose a novel passive and non-invasive fingerprinting technique that operates in real-time and remains effective even under encrypted network traffic conditions. Our method leverages the intrinsic autoregressive generation nature of language models, which generate text one token at a time based on all previously generated tokens, creating a unique temporal pattern like a rhythm or heartbeat that persists even when the output is streamed over a network. We find that measuring the Inter-Token Times (ITTs)-time intervals between consecutive tokens-can identify different language models with high accuracy. We develop a Deep Learning (DL) pipeline to capture these timing patterns using network traffic analysis and evaluate it on 16 Small Language Models (SLMs) and 10 proprietary LLMs across different deployment scenarios, including local host machine (GPU/CPU), Local Area Network (LAN), Remote Network, and Virtual Private Network (VPN). The experimental results confirm that our proposed technique is effective and maintains high accuracy even when tested in different network conditions. This work opens a new avenue for model identification in real-world scenarios and contributes to more secure and trustworthy language model deployment.
zh

[NLP-64] he Noisy Path from Source to Citation: Measuring How Scholars Engage with Past Research

【速读】: 该论文试图解决学术引用中仅依赖原始引用次数而忽视引用类型多样性的局限性问题,特别是不同引用在传递原始知识准确性(即引用忠实度)上的差异。论文关注的是如何量化这种引用忠实度,并揭示其背后的系统性规律。解决方案的关键在于提出了一套计算管道(computational pipeline),利用论文全文本,通过识别引用句及其对应的被引文献中的主张,结合监督学习模型,在句子层面衡量引用忠实度。这套方法能够有效评估引用在传递信息过程中可能发生的偏差或失真,从而揭示单纯依赖引用数量分析的不足之处以及证据扭曲的可能性。

链接: https://arxiv.org/abs/2502.20581
作者: Hong Chen,Misha Teplitskiy,David Jurgens
机构: University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Academic citations are widely used for evaluating research and tracing knowledge flows. Such uses typically rely on raw citation counts and neglect variability in citation types. In particular, citations can vary in their fidelity as original knowledge from cited studies may be paraphrased, summarized, or reinterpreted, possibly wrongly, leading to variation in how much information changes from cited to citing paper. In this study, we introduce a computational pipeline to quantify citation fidelity at scale. Using full texts of papers, the pipeline identifies citations in citing papers and the corresponding claims in cited papers, and applies supervised models to measure fidelity at the sentence level. Analyzing a large-scale multi-disciplinary dataset of approximately 13 million citation sentence pairs, we find that citation fidelity is higher when authors cite papers that are 1) more recent and intellectually close, 2) more accessible, and 3) the first author has a lower H-index and the author team is medium-sized. Using a quasi-experiment, we establish the “telephone effect” - when citing papers have low fidelity to the original claim, future papers that cite the citing paper and the original have lower fidelity to the original. Our work reveals systematic differences in citation fidelity, underscoring the limitations of analyses that rely on citation quantity alone and the potential for distortion of evidence.
zh

[NLP-65] ECCOS: Efficient Capability and Cost Coordinated Scheduling for Multi-LLM Serving

【速读】: 该论文旨在解决大规模语言模型(LLMs)作为服务端点在系统中部署时因查询量激增带来的显著调度挑战。现有调度框架主要关注延迟优化,而忽视了LLMs服务于不同查询级别的能力,可能导致计算资源浪费。为应对这一挑战,论文提出了一种名为ECCOS的能力-成本协调调度框架,用于多LLM服务,通过显式约束响应质量和工作负载来优化LLM推理成本。其关键在于引入两阶段调度机制,设计一个多目标预测器和一个受约束的优化器:预测器通过基于训练和基于检索的方法估计模型能力和计算成本,而优化器则在质量与工作负载约束下确定成本最优分配。此外,还提出了QAServe数据集,用于通过零样本提示不同LLMs在知识问答和数学推理任务中的样本级响应质量和成本。大量实验表明,ECCOS相比现有方法成功率提高了6.30%,成本降低了10.15%,且占用不到LLM响应时间的0.5%。

链接: https://arxiv.org/abs/2502.20576
作者: Kai Mei,Wujiang Xu,Shuhang Lin,Yongfeng Zhang
机构: Department of Computer Science, Rutgers University (罗格斯大学)
类目: Databases (cs.DB); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed as service endpoints in systems, the surge in query volume creates significant scheduling challenges. Existing scheduling frameworks mainly target at latency optimization while neglecting the capability of LLMs to serve different level of queries, which could lead to computational resource waste. This paper addresses this challenge by proposing a capability-cost coordinated scheduling framework, ECCOS, for multi-LLM serving, which explicitly constrains response quality and workload to optimize LLM inference cost. Specifically, it introduces the two-stage scheduling by designing a multi-objective predictor and a constrained optimizer. The predictor estimates both model capabilities and computational costs through training-based and retrieval-based approaches, while the optimizer determines cost-optimal assignments under quality and workload constraints. It also introduces QAServe, a dataset collected for sample-wise response quality and costs by zero-shot prompting different LLMs on knowledge QA and mathematical reasoning. Extensive experiments demonstrate that ECCOS improves success rates by 6.30% while reducing costs by 10.15% compared to existing methods, consuming less than 0.5% of LLM response time. The code is available at: this https URL.
zh

[NLP-66] Visual Reasoning at Urban Intersections: FineTuning GPT -4o for Traffic Conflict Detection

【速读】: 该论文旨在解决无信号灯城市交叉路口交通控制面临的复杂性、频繁冲突和盲区等重大挑战。论文提出利用多模态大型语言模型(Multimodal Large Language Models, MLLMs),如GPT-4o,通过对四岔路口鸟瞰视频进行直接逻辑与视觉推理,以检测冲突并为驾驶员提供解释与建议。解决方案的关键在于通过微调GPT-4o实现从视频输入中提取实时交通管理的可扩展且可行的洞见,其生成解释的准确率达到89.9%,推荐下一步行动的准确率达到92.3%。

链接: https://arxiv.org/abs/2502.20573
作者: Sari Masri,Huthaifa I. Ashqar,Mohammed Elhenawy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traffic control in unsignalized urban intersections presents significant challenges due to the complexity, frequent conflicts, and blind spots. This study explores the capability of leveraging Multimodal Large Language Models (MLLMs), such as GPT-4o, to provide logical and visual reasoning by directly using birds-eye-view videos of four-legged intersections. In this proposed method, GPT-4o acts as intelligent system to detect conflicts and provide explanations and recommendations for the drivers. The fine-tuned model achieved an accuracy of 77.14%, while the manual evaluation of the true predicted values of the fine-tuned GPT-4o showed significant achievements of 89.9% accuracy for model-generated explanations and 92.3% for the recommended next actions. These results highlight the feasibility of using MLLMs for real-time traffic management using videos as inputs, offering scalable and actionable insights into intersections traffic management and operation. Code used in this study is available at this https URL.
zh

[NLP-67] HazardNet: A Small-Scale Vision Language Model for Real-Time Traffic Safety Detection at Edge Devices

【速读】: 该论文旨在解决城市环境中日益严峻的交通安全问题,特别是在车辆数量增加和道路网络复杂性提升的背景下,传统基于传感器和机器学习算法的安全关键事件检测系统面临数据收集困难及训练复杂度高的挑战。论文的关键解决方案是提出HazardNet,这是一种基于视觉语言模型的小型化设计,通过微调预训练的Qwen2-VL-2B模型(参数规模为两亿)实现高效部署于边缘设备,并具备卓越的推理吞吐量。此外,为了有效训练HazardNet以识别真实世界中的安全关键事件,论文构建了HazardQA数据集,这是一个专门用于视觉问答任务的新颖数据集。实验结果表明,微调后的HazardNet在F1分数上相比基础模型提升了高达89%,并且在某些情况下与更大规模的模型(如GPT-4o)相比也有接近6%的改进,这凸显了HazardNet在提供实时可靠交通安全隐患检测方面的潜力。

链接: https://arxiv.org/abs/2502.20572
作者: Mohammad Abu Tami,Mohammed Elhenawy,Huthaifa I. Ashqar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traffic safety remains a vital concern in contemporary urban settings, intensified by the increase of vehicles and the complicated nature of road networks. Traditional safety-critical event detection systems predominantly rely on sensor-based approaches and conventional machine learning algorithms, necessitating extensive data collection and complex training processes to adhere to traffic safety regulations. This paper introduces HazardNet, a small-scale Vision Language Model designed to enhance traffic safety by leveraging the reasoning capabilities of advanced language and vision models. We built HazardNet by fine-tuning the pre-trained Qwen2-VL-2B model, chosen for its superior performance among open-source alternatives and its compact size of two billion parameters. This helps to facilitate deployment on edge devices with efficient inference throughput. In addition, we present HazardQA, a novel Vision Question Answering (VQA) dataset constructed specifically for training HazardNet on real-world scenarios involving safety-critical events. Our experimental results show that the fine-tuned HazardNet outperformed the base model up to an 89% improvement in F1-Score and has comparable results with improvement in some cases reach up to 6% when compared to larger models, such as GPT-4o. These advancements underscore the potential of HazardNet in providing real-time, reliable traffic safety event detection, thereby contributing to reduced accidents and improved traffic management in urban environments. Both HazardNet model and the HazardQA dataset are available at this https URL and this https URL, respectively.
zh

[NLP-68] owards Statistical Factuality Guarantee for Large Vision-Language Models

【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在图像条件自由形式文本生成任务中因幻觉(hallucinations)导致输出与视觉上下文不一致的问题,这是阻碍这些模型在高可靠性需求场景下应用的主要障碍。论文的关键解决方案是提出了一种名为ConfLVLM的框架,该框架基于一致性预测(conformal prediction),能够为LVLM输出的事实性提供有限样本且分布无关的统计保证。ConfLVLM将LVLM视为假设生成器,将每个生成的文本细节视为独立假设,并通过高效的启发式不确定性度量进行统计假设检验,从而筛选掉不可靠的假设,确保返回给用户的响应具有较高的事实准确性。实验结果表明,该方法显著降低了错误率,并在多种应用场景中提供了严格的幻觉风险控制保障。

链接: https://arxiv.org/abs/2502.20560
作者: Zhuohang Li,Chao Yan,Nicholas J. Jackson,Wendi Cui,Bo Li,Jiaxin Zhang,Bradley A. Malin
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Advancements in Large Vision-Language Models (LVLMs) have demonstrated promising performance in a variety of vision-language tasks involving image-conditioned free-form text generation. However, growing concerns about hallucinations in LVLMs, where the generated text is inconsistent with the visual context, are becoming a major impediment to deploying these models in applications that demand guaranteed reliability. In this paper, we introduce a framework to address this challenge, ConfLVLM, which is grounded on conformal prediction to achieve finite-sample distribution-free statistical guarantees on the factuality of LVLM output. This framework treats an LVLM as a hypothesis generator, where each generated text detail (or claim) is considered an individual hypothesis. It then applies a statistical hypothesis testing procedure to verify each claim using efficient heuristic uncertainty measures to filter out unreliable claims before returning any responses to users. We conduct extensive experiments covering three representative application domains, including general scene understanding, medical radiology report generation, and document understanding. Remarkably, ConfLVLM reduces the error rate of claims generated by LLaVa-1.5 for scene descriptions from 87.8% to 10.0% by filtering out erroneous claims with a 95.3% true positive rate. Our results further demonstrate that ConfLVLM is highly flexible, and can be applied to any black-box LVLMs paired with any uncertainty measure for any image-conditioned free-form text generation task while providing a rigorous guarantee on controlling the risk of hallucination.
zh

[NLP-69] HuAMR: A Hungarian AMR Parser and Dataset

【速读】: 该论文旨在解决非英语语言语义资源匮乏的问题,特别是针对匈牙利语构建Abstract Meaning Representation (AMR) 数据集及相应的解析器。论文的关键解决方案在于利用大语言模型(Llama-3.1-70B)自动生成银标(silver-standard)AMR标注,并通过人工精炼确保质量。在此基础上,研究不同模型架构(mT5 Large 和 Llama-3.2-1B)与微调策略对AMR解析性能的影响,发现虽然将银标AMR加入小模型训练数据未始终提升整体分数,但这些技术有效提升了匈牙利新闻数据上的解析准确性。最终,通过Smatch评分验证了HuAMR数据集及其解析器在推动语义解析研究中的潜力。

链接: https://arxiv.org/abs/2502.20552
作者: Botond Barta,Endre Hamerlik,Milán Konor Nyist,Judit Ács
机构: HUN-REN Institute for Computer Science and Control (匈牙利科学院计算机科学与控制研究所); Eötvös Loránd University (厄特沃什·洛兰大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present HuAMR, the first Abstract Meaning Representation (AMR) dataset and a suite of large language model-based AMR parsers for Hungarian, targeting the scarcity of semantic resources for non-English languages. To create HuAMR, we employed Llama-3.1-70B to automatically generate silver-standard AMR annotations, which we then refined manually to ensure quality. Building on this dataset, we investigate how different model architectures - mT5 Large and Llama-3.2-1B - and fine-tuning strategies affect AMR parsing performance. While incorporating silver-standard AMRs from Llama-3.1-70B into the training data of smaller models does not consistently boost overall scores, our results show that these techniques effectively enhance parsing accuracy on Hungarian news data (the domain of HuAMR). We evaluate our parsers using Smatch scores and confirm the potential of HuAMR and our parsers for advancing semantic parsing research. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2502.20552 [cs.CL] (or arXiv:2502.20552v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.20552 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-70] Qsharp: Provably Optimal Distributional RL for LLM Post-Training

【速读】: 该论文试图解决强化学习后训练在大语言模型(LLMs)对齐与推理中的挑战,特别是现有基于策略的方法(如PPO和DPO)难以修正预训练中继承的捷径问题。论文的关键解决方案是提出了一种名为Q\sharp的价值函数方法,它通过KL正则化强化学习中的最优正则化Q函数来引导参考策略。与之前使用未正则化Q值指导的基线方法不同,Q\sharp从理论上是严谨的,并且能够证明性地学习到KL正则化RL问题下的最优策略。此外,论文还建立了KL正则化RL到无悔在线学习的约减关系,在仅可实现性的假设下提供了确定性马尔可夫决策过程(MDPs)的第一个界限。由于分布强化学习的应用,这些界限依赖于方差并且在参考策略具有小方差时收敛更快。总之,Q\sharp展示了作为LLMs后训练的有效方法,既提高了性能又提供了理论保证。

链接: https://arxiv.org/abs/2502.20548
作者: Jin Peng Zhou,Kaiwen Wang,Jonathan Chang,Zhaolin Gao,Nathan Kallus,Kilian Q. Weinberger,Kianté Brantley,Wen Sun
机构: Cornell University; Netflix; Databricks; Harvard University
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) post-training is crucial for LLM alignment and reasoning, but existing policy-based methods, such as PPO and DPO, can fall short of fixing shortcuts inherited from pre-training. In this work, we introduce Q\sharp , a value-based algorithm for KL-regularized RL that guides the reference policy using the optimal regularized Q function. We propose to learn the optimal Q function using distributional RL on an aggregated online dataset. Unlike prior value-based baselines that guide the model using unregularized Q -values, our method is theoretically principled and provably learns the optimal policy for the KL-regularized RL problem. Empirically, Q\sharp outperforms prior baselines in math reasoning benchmarks while maintaining a smaller KL divergence to the reference policy. Theoretically, we establish a reduction from KL-regularized RL to no-regret online learning, providing the first bounds for deterministic MDPs under only realizability. Thanks to distributional RL, our bounds are also variance-dependent and converge faster when the reference policy has small variance. In sum, our results highlight Q\sharp as an effective approach for post-training LLMs, offering both improved performance and theoretical guarantees. The code can be found at this https URL.
zh

[NLP-71] NANOGPT : A Query-Driven Large Language Model Retrieval-Augmented Generation System for Nanotechnology Research

【速读】: 该论文旨在解决纳米技术研究领域中耗时且复杂的文献综述问题,通过开发一种针对纳米技术优化的大型语言模型检索增强生成系统(LLM-RAG)。解决方案的关键在于其先进的查询后端检索机制,该机制整合了来自多个权威来源的数据。通过利用Google Scholar的高级搜索功能以及从Elsevier、Springer Nature和ACS Publications等开放获取平台上抓取论文,系统能够高效收集最新且多样化的学术资源。这种多维度的方法确保了检索过程的流畅性、精确性和全面性,从而有效加速纳米技术领域的研究进展。严格测试验证了该系统的有效性,表明其在大幅减少文献综述所需时间和精力的同时,保持了高精度和查询相关性,并优于标准公开可用的语言模型。

链接: https://arxiv.org/abs/2502.20541
作者: Achuth Chandrasekhar,Omid Barati Farimani,Olabode T. Ajenifujah,Janghoon Ock,Amir Barati Farimani
机构: Carnegie Mellon University (CMU)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 61 pages, 3 figures

点击查看摘要

Abstract:This paper presents the development and application of a Large Language Model Retrieval-Augmented Generation (LLM-RAG) system tailored for nanotechnology research. The system leverages the capabilities of a sophisticated language model to serve as an intelligent research assistant, enhancing the efficiency and comprehensiveness of literature reviews in the nanotechnology domain. Central to this LLM-RAG system is its advanced query backend retrieval mechanism, which integrates data from multiple reputable sources. The system retrieves relevant literature by utilizing Google Scholar’s advanced search, and scraping open-access papers from Elsevier, Springer Nature, and ACS Publications. This multifaceted approach ensures a broad and diverse collection of up-to-date scholarly articles and papers. The proposed system demonstrates significant potential in aiding researchers by providing a streamlined, accurate, and exhaustive literature retrieval process, thereby accelerating research advancements in nanotechnology. The effectiveness of the LLM-RAG system is validated through rigorous testing, illustrating its capability to significantly reduce the time and effort required for comprehensive literature reviews, while maintaining high accuracy, query relevance and outperforming standard, publicly available LLMS.
zh

[NLP-72] Supervised Fine-Tuning LLM s to Behave as Pedagogical Agents in Programming Education

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在编程教育中的有效性问题,特别是现有模型在提供教学支持时可能过度协助学生,直接给出答案而非引导学习的问题。为了解决这一问题,论文的关键解决方案是通过有监督微调(Supervised Fine-Tuning, SFT)技术,利用一个包含528个学生提问与教师解答配对的数据集,开发了两个定制化的LLM模型——GuideLM和GuideLM-mini。这些模型分别基于ChatGPT-4o和4o-mini进行微调,以优化其在编程教育场景下的教学效能。评估结果表明,与基础的OpenAI模型相比,GuideLM和GuideLM-mini在苏格拉底式引导(提升8%)和语言经济性(提升58%)方面表现更优,尽管在通用准确性上略有下降。这表明,针对特定教育需求设计的微调策略是一种有前景的方法,能够更好地满足编程教育的特殊要求。

链接: https://arxiv.org/abs/2502.20527
作者: Emily Ross,Yuval Kansal,Jake Renzella,Alexandra Vassar,Andrew Taylor
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly being explored in higher education, yet their effectiveness as teaching agents remains underexamined. In this paper, we present the development of GuideLM, a fine-tuned LLM designed for programming education. GuideLM has been integrated into the Debugging C Compiler (DCC), an educational C compiler that leverages LLMs to generate pedagogically sound error explanations. Previously, DCC relied on off-the-shelf OpenAI models, which, while accurate, often over-assisted students by directly providing solutions despite contrary prompting. To address this, we employed supervised fine-tuning (SFT) on a dataset of 528 student-question/teacher-answer pairs, creating two models: GuideLM and GuideLM-mini, fine-tuned on ChatGPT-4o and 4o-mini, respectively. We conducted an expert analysis of 400 responses per model, comparing their pedagogical effectiveness against base OpenAI models. Our evaluation, grounded in constructivism and cognitive load theory, assessed factors such as conceptual scaffolding, clarity, and Socratic guidance. Results indicate that GuideLM and GuideLM-mini improve pedagogical performance, with an 8% increase in Socratic guidance and a 58% improvement in economy of words compared to GPT-4o. However, this refinement comes at the cost of a slight reduction in general accuracy. While further work is needed, our findings suggest that fine-tuning LLMs with targeted datasets is a promising approach for developing models better suited to educational contexts. Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY) Cite as: arXiv:2502.20527 [cs.CL] (or arXiv:2502.20527v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.20527 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-73] ripCraft: A Benchmark for Spatio-Temporally Fine Grained Travel Planning

【速读】: 该论文旨在解决现有旅行规划基准数据集在实际应用中的局限性问题,特别是TravelPlanner和TravelPlanner+等数据集因依赖半合成数据、空间不一致性和缺乏关键旅行约束而无法有效支持实用行程生成的问题。为了解决这些不足,论文提出了TripCraft数据集,其关键创新在于通过整合真实世界约束(如公共交通时刻表、活动可用性、多样化的景点类别以及用户画像)来实现时空一致性,并提升个性化能力。此外,为了更全面评估大型语言模型(LLMs)生成的行程质量,论文还引入了五种连续评价指标(Temporal Meal Score、Temporal Attraction Score、Spatial Score、Ordering Score和Persona Score),显著提升了行程规划的实用性与合理性。通过这些方法,论文不仅增强了行程安排的实际适用性,还为基于LLMs的个性化旅行规划建立了新的评估基准。

链接: https://arxiv.org/abs/2502.20508
作者: Soumyabrata Chaudhuri,Pranav Purkar,Ritwik Raghav,Shubhojit Mallick,Manish Gupta,Abhik Jana,Shreya Ghosh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 27 pages, 18 Tables and 6 Figures

点击查看摘要

Abstract:Recent advancements in probing Large Language Models (LLMs) have explored their latent potential as personalized travel planning agents, yet existing benchmarks remain limited in real world applicability. Existing datasets, such as TravelPlanner and TravelPlanner+, suffer from semi synthetic data reliance, spatial inconsistencies, and a lack of key travel constraints, making them inadequate for practical itinerary generation. To address these gaps, we introduce TripCraft, a spatiotemporally coherent travel planning dataset that integrates real world constraints, including public transit schedules, event availability, diverse attraction categories, and user personas for enhanced personalization. To evaluate LLM generated plans beyond existing binary validation methods, we propose five continuous evaluation metrics, namely Temporal Meal Score, Temporal Attraction Score, Spatial Score, Ordering Score, and Persona Score which assess itinerary quality across multiple dimensions. Our parameter informed setting significantly enhances meal scheduling, improving the Temporal Meal Score from 61% to 80% in a 7 day scenario. TripCraft establishes a new benchmark for LLM driven personalized travel planning, offering a more realistic, constraint aware framework for itinerary generation. Dataset and Codebase will be made publicly available upon acceptance.
zh

[NLP-74] A Thousand Words or An Image: Studying the Influence of Persona Modality in Multimodal LLM s

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在多模态环境中表达不同人格特质时,模态选择对其表现影响的研究空白。具体而言,论文关注文本与图像等不同模态如何影响多模态LLMs中人格表达的生动性。为解决此问题,研究的关键在于构建了一个包含40种多样化人格的新型多模态平行数据集,并设计了一套系统化的评估框架,通过60个问题及其对应的指标全面评估五种多模态LLMs在表达人格属性及场景中的表现。实验结果表明,详细文本描述的人格更易展现语言习惯,而图文结合或文字风格化图像更能保持一致性,揭示了LLMs在处理图像传达的人格细节时常存在的局限性,为未来改进多模态人格建模提供了方向。

链接: https://arxiv.org/abs/2502.20504
作者: Julius Broomfield,Kartik Sharma,Srijan Kumar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have recently demonstrated remarkable advancements in embodying diverse personas, enhancing their effectiveness as conversational agents and virtual assistants. Consequently, LLMs have made significant strides in processing and integrating multimodal information. However, even though human personas can be expressed in both text and image, the extent to which the modality of a persona impacts the embodiment by the LLM remains largely unexplored. In this paper, we investigate how do different modalities influence the expressiveness of personas in multimodal LLMs. To this end, we create a novel modality-parallel dataset of 40 diverse personas varying in age, gender, occupation, and location. This consists of four modalities to equivalently represent a persona: image-only, text-only, a combination of image and small text, and typographical images, where text is visually stylized to convey persona-related attributes. We then create a systematic evaluation framework with 60 questions and corresponding metrics to assess how well LLMs embody each persona across its attributes and scenarios. Comprehensive experiments on 5 multimodal LLMs show that personas represented by detailed text show more linguistic habits, while typographical images often show more consistency with the persona. Our results reveal that LLMs often overlook persona-specific details conveyed through images, highlighting underlying limitations and paving the way for future research to bridge this gap. We release the data and code at this https URL .
zh

[NLP-75] Protecting multimodal large language models against misleading visualizations

【速读】: 该论文旨在评估多模态大型语言模型(Multimodal Large Language Models, MLLMs)对误导性可视化(如通过截断或反转轴等技术扭曲数据的图表)的脆弱性,并发现这些扭曲严重损害了模型的问答准确性,使其降至随机基线水平。为缓解这一问题,论文提出了六种推理阶段的方法,以在保持其对非误导性图表准确性的同时提升对误导性图表的表现。解决方案的关键在于采用一种方法,即首先提取图表背后的原始数据表格,然后利用纯文本大型语言模型基于数据表格回答问题,这种方法使模型在处理误导性可视化时的性能提升了15.4至19.6个百分点。

链接: https://arxiv.org/abs/2502.20503
作者: Jonathan Tonglet,Tinne Tuytelaars,Marie-Francine Moens,Iryna Gurevych
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint. Code and data available at this https URL

点击查看摘要

Abstract:We assess the vulnerability of multimodal large language models to misleading visualizations - charts that distort the underlying data using techniques such as truncated or inverted axes, leading readers to draw inaccurate conclusions that may support misinformation or conspiracy theories. Our analysis shows that these distortions severely harm multimodal large language models, reducing their question-answering accuracy to the level of the random baseline. To mitigate this vulnerability, we introduce six inference-time methods to improve performance of MLLMs on misleading visualizations while preserving their accuracy on non-misleading ones. The most effective approach involves (1) extracting the underlying data table and (2) using a text-only large language model to answer questions based on the table. This method improves performance on misleading visualizations by 15.4 to 19.6 percentage points.
zh

[NLP-76] EgoNormia: Benchmarking Physical Social Norm Understanding

【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在理解与推理规范性行为(Normative Actions)方面的能力不足问题。具体而言,当前最先进的VLMs在处理需要结合物理和社会情境理解的规范性推理任务时表现不佳,尤其是在安全、隐私、空间关系、礼貌、合作、主动性以及沟通等七大规范类别上的表现远低于人类水平(仅达到45%,而人类基准为92%)。为评估和提升VLMs的规范性推理能力,论文提出了EgoNormia数据集,包含1,853个以第一视角记录的人类交互视频,并设计了相关问题来测试模型对规范行为的预测与解释能力。

解决方案的关键在于构建了一个大规模且高质量的标注数据集——EgoNormia。为了实现这一目标,论文提出了一套创新的流水线方法,包括视频采样、自动答案生成、过滤以及人工验证。这种方法不仅确保了数据集的质量,还显著提高了标注效率。此外,研究进一步表明,通过基于检索的生成方法,可以利用EgoNormia有效增强现有VLMs的规范性推理能力。这一工作揭示了当前模型在安全、隐私及协作与沟通能力方面的显著短板,并为未来的研究提供了明确的方向。

链接: https://arxiv.org/abs/2502.20490
作者: MohammadHossein Rezaei,Yicheng Fu,Phil Cuvin,Caleb Ziems,Yanzhe Zhang,Hao Zhu,Diyi Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Human activity is moderated by norms. When performing actions in the real world, humans not only follow norms, but also consider the trade-off between different norms However, machines are often trained without explicit supervision on norm understanding and reasoning, especially when the norms are grounded in a physical and social context. To improve and evaluate the normative reasoning capability of vision-language models (VLMs), we present EgoNormia |\epsilon| , consisting of 1,853 ego-centric videos of human interactions, each of which has two related questions evaluating both the prediction and justification of normative actions. The normative actions encompass seven categories: safety, privacy, proxemics, politeness, cooperation, coordination/proactivity, and communication/legibility. To compile this dataset at scale, we propose a novel pipeline leveraging video sampling, automatic answer generation, filtering, and human validation. Our work demonstrates that current state-of-the-art vision-language models lack robust norm understanding, scoring a maximum of 45% on EgoNormia (versus a human bench of 92%). Our analysis of performance in each dimension highlights the significant risks of safety, privacy, and the lack of collaboration and communication capability when applied to real-world agents. We additionally show that through a retrieval-based generation method, it is possible to use EgoNomia to enhance normative reasoning in VLMs.
zh

[NLP-77] Explainable AI for Clinical Outcome Prediction: A Survey of Clinician Perceptions and Preferences

【速读】: 该论文试图解决如何帮助临床医生理解并有效利用基于文本电子健康记录(EHR)数据的AI预测结果的问题。解决方案的关键在于通过实施四种不同的可解释AI(Explainable AI, XAI)技术(LIME、基于注意力机制的片段高亮显示、示例患者检索以及由大型语言模型生成的自由文本解释),并结合问卷调查收集32名执业临床医生的反馈与偏好,从而总结出每种XAI技术适用的场景、潜在局限性及改进建议,为临床决策支持提供指导。

链接: https://arxiv.org/abs/2502.20478
作者: Jun Hou,Lucy Lu Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Explainable AI (XAI) techniques are necessary to help clinicians make sense of AI predictions and integrate predictions into their decision-making workflow. In this work, we conduct a survey study to understand clinician preference among different XAI techniques when they are used to interpret model predictions over text-based EHR data. We implement four XAI techniques (LIME, Attention-based span highlights, exemplar patient retrieval, and free-text rationales generated by LLMs) on an outcome prediction model that uses ICU admission notes to predict a patient’s likelihood of experiencing in-hospital mortality. Using these XAI implementations, we design and conduct a survey study of 32 practicing clinicians, collecting their feedback and preferences on the four techniques. We synthesize our findings into a set of recommendations describing when each of the XAI techniques may be more appropriate, their potential limitations, as well as recommendations for improvement.
zh

[NLP-78] Promote Suppress Iterate: How Language Models Answer One-to-Many Factual Queries

【速读】: 该论文旨在探究语言模型(Language Model, LM)在回答一对多事实性查询(如列出一个国家的城市)时,如何同时实现知识回忆与避免重复答案这两项子任务,并揭示其内部机制。研究通过跨多个数据集和模型发现了一种“促进-然后抑制”(promote-then-suppress)机制:模型首先回忆所有可能的答案,随后抑制已生成过的答案。关键在于,这一机制利用主题信息和先前答案标记进行知识回忆,其中注意力模块传播主题信息并增强答案,而后续通过注意力抑制先前答案标记,并通过多层感知机(MLP)放大抑制信号。研究通过多种实验验证了这一机制,包括早期解码、因果追踪、引入\emphToken Lens分析特定标记的注意力更新以及使用敲除方法评估移除特定标记注意力后MLP输出的变化。总体而言,论文提供了关于语言模型内部组件如何与不同输入标记交互以支持复杂事实回忆的新见解。代码资源可在提供的链接获取。

链接: https://arxiv.org/abs/2502.20475
作者: Tianyi Lorena Yan,Robin Jia
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:To answer one-to-many factual queries (e.g., listing cities of a country), a language model (LM) must simultaneously recall knowledge and avoid repeating previous answers. How are these two subtasks implemented and integrated internally? Across multiple datasets and models, we identify a promote-then-suppress mechanism: the model first recalls all answers, and then suppresses previously generated ones. Specifically, LMs use both the subject and previous answer tokens to perform knowledge recall, with attention propagating subject information and MLPs promoting the answers. Then, attention attends to and suppresses previous answer tokens, while MLPs amplify the suppression signal. Our mechanism is corroborated by extensive experimental evidence: in addition to using early decoding and causal tracing, we analyze how components use different tokens by introducing both \emphToken Lens, which decodes aggregated attention updates from specified tokens, and a knockout method that analyzes changes in MLP outputs after removing attention to specified tokens. Overall, we provide new insights into how LMs’ internal components interact with different input tokens to support complex factual recall. Code is available at this https URL.
zh

[NLP-79] Shades of Zero: Distinguishing Impossibility from Inconceivability

【速读】: 该论文试图解决的问题是:人们是否能够区分“不可能”(impossible)与“不可思议”(inconceivable),以及这种区分是如何实现的。此外,研究还探讨了统计语言模型(Statistical Language Models, LMs)的概率输出能否用于表征这一区分,并与人类的主观判断进行对比。

解决方案的关键在于通过一系列实验验证人类如何区分不可能事件与不可思议事件。研究发现,尽管人类的主观概率评级无法有效区分这两类事件(因为它们在主观上都被视为极不可能),但统计语言模型的字符串概率可以预测人类对事件可能性的评级,并且两者都能成功区分不可能和不可思议的事件描述。这表明,关于极其罕见事件的知识可能通过语言形式的统计学习获得,但并未明确证明人类是否将不可能与不可思议视为性质上的差异而非程度上的差异。

链接: https://arxiv.org/abs/2502.20469
作者: Jennifer Hu,Felix Sosa,Tomer Ullman
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Some things are impossible, but some things may be even more impossible than impossible. Levitating a feather using one’s mind is impossible in our world, but fits into our intuitive theories of possible worlds, whereas levitating a feather using the number five cannot be conceived in any possible world (“inconceivable”). While prior work has examined the distinction between improbable and impossible events, there has been little empirical research on inconceivability. Here, we investigate whether people maintain a distinction between impossibility and inconceivability, and how such distinctions might be made. We find that people can readily distinguish the impossible from the inconceivable, using categorization studies similar to those used to investigate the differences between impossible and improbable (Experiment 1). However, this distinction is not explained by people’s subjective ratings of event likelihood, which are near zero and indistinguishable between impossible and inconceivable event descriptions (Experiment 2). Finally, we ask whether the probabilities assigned to event descriptions by statistical language models (LMs) can be used to separate modal categories, and whether these probabilities align with people’s ratings (Experiment 3). We find high-level similarities between people and LMs: both distinguish among impossible and inconceivable event descriptions, and LM-derived string probabilities predict people’s ratings of event likelihood across modal categories. Our findings suggest that fine-grained knowledge about exceedingly rare events (i.e., the impossible and inconceivable) may be learned via statistical learning over linguistic forms, yet leave open the question of whether people represent the distinction between impossible and inconceivable as a difference not of degree, but of kind.
zh

[NLP-80] Among Them: A game-based framework for assessing persuasion capabilities of LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在自动化说服和社会影响力方面的潜在风险评估问题。现有研究主要关注孤立的LLM操纵实例,而对不同模型说服能力的系统性评估仍显不足。为此,论文提出了一种受《Among Us》游戏启发的游戏框架,用于在受控环境中评估LLMs的欺骗技能。关键解决方案在于通过游戏统计数据比较不同类型的LLM,并利用社会心理学和修辞学中的25种说服策略量化游戏中模型的操纵行为。实验结果表明,所有测试的LLM均具备说服能力,并成功运用了其中的22种预期技术,同时发现模型规模与说服优势无直接关联,且较长的模型输出与获胜局数呈负相关。这一框架为理解LLMs的欺骗能力提供了新工具和数据资源。

链接: https://arxiv.org/abs/2502.20426
作者: Mateusz Idziejczak,Vasyl Korzavatykh,Mateusz Stawicki,Andrii Chmutov,Marcin Korcz,Iwo Błądek,Dariusz Brzezinski
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The proliferation of large language models (LLMs) and autonomous AI agents has raised concerns about their potential for automated persuasion and social influence. While existing research has explored isolated instances of LLM-based manipulation, systematic evaluations of persuasion capabilities across different models remain limited. In this paper, we present an Among Us-inspired game framework for assessing LLM deception skills in a controlled environment. The proposed framework makes it possible to compare LLM models by game statistics, as well as quantify in-game manipulation according to 25 persuasion strategies from social psychology and rhetoric. Experiments between 8 popular language models of different types and sizes demonstrate that all tested models exhibit persuasive capabilities, successfully employing 22 of the 25 anticipated techniques. We also find that larger models do not provide any persuasion advantage over smaller models and that longer model outputs are negatively correlated with the number of games won. Our study provides insights into the deception capabilities of LLMs, as well as tools and data for fostering future research on the topic.
zh

[NLP-81] SEKI: Self-Evolution and Knowledge Inspiration based Neural Architecture Search via Large Language Models

【速读】: 本文旨在解决神经架构搜索(NAS)领域中效率与性能之间的权衡问题,特别是如何在无特定领域数据的情况下,利用大型语言模型(LLMs)实现高效且高质量的NAS。论文提出了一种名为SEKI的新方法,其关键是结合自演化(self-evolution)与知识蒸馏(knowledge distillation)两个阶段:自演化阶段通过迭代优化机制提升架构性能,并积累高性能架构;知识蒸馏阶段则利用LLMs从已有架构中提取通用模式以生成新的优化设计。这种方法显著提高了LLMs在NAS任务上的能力,同时实现了极高的资源利用率(仅需0.05 GPU-days),并在多个数据集和搜索空间上达到了最先进的性能表现。

链接: https://arxiv.org/abs/2502.20422
作者: Zicheng Cai,Yaohua Tang,Yutao Lai,Hua Wang,Zhi Chen,Hao Chen
机构: Moore Threads AI (摩尔线程人工智能公司); GuangDong University of Technology (广东工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce SEKI, a novel large language model (LLM)-based neural architecture search (NAS) method. Inspired by the chain-of-thought (CoT) paradigm in modern LLMs, SEKI operates in two key stages: self-evolution and knowledge distillation. In the self-evolution stage, LLMs initially lack sufficient reference examples, so we implement an iterative refinement mechanism that enhances architectures based on performance feedback. Over time, this process accumulates a repository of high-performance architectures. In the knowledge distillation stage, LLMs analyze common patterns among these architectures to generate new, optimized designs. Combining these two stages, SEKI greatly leverages the capacity of LLMs on NAS and without requiring any domain-specific data. Experimental results show that SEKI achieves state-of-the-art (SOTA) performance across various datasets and search spaces while requiring only 0.05 GPU-days, outperforming existing methods in both efficiency and accuracy. Furthermore, SEKI demonstrates strong generalization capabilities, achieving SOTA-competitive results across multiple tasks.
zh

[NLP-82] Chitranuvad: Adapting Multi-Lingual LLM s for Multimodal Translation

【速读】: 该论文旨在解决英语到低资源多模态翻译任务中的挑战,特别是在印地语(Hindi)、孟加拉语(Bengali)和马拉雅拉姆语(Malayalam)等印地语言中的应用。论文提出的关键解决方案是Chitranuvad模型,这是一种将多语言大型语言模型(Multilingual LLM)与视觉模块有效集成的多模态翻译模型。其核心方法使用ViT图像编码器提取视觉表示,并通过适配器层将其投影到LLM空间中,以自回归方式生成翻译结果。这种方法在印地语言的图像描述、纯文本和多模态翻译三个赛道上均展现了卓越性能,在挑战集上的印地语结果达到当前最优(SOTA),同时在其他语言中保持竞争力。

链接: https://arxiv.org/abs/2502.20420
作者: Shaharukh Khan,Ayush Tarun,Ali Faraz,Palash Kamble,Vivek Dahiya,Praveen Pokala,Ashish Kulkarni,Chandra Khatri,Abhinav Ravi,Shubham Agarwal
机构: Krutrim AI (克鲁特里姆人工智能); Olak Rutrim (奥拉克鲁特里姆)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we provide the system description of our submission as part of the English to Lowres Multimodal Translation Task at the Workshop on Asian Translation (WAT2024). We introduce Chitranuvad, a multimodal model that effectively integrates Multilingual LLM and a vision module for Multimodal Translation. Our method uses a ViT image encoder to extract visual representations as visual token embeddings which are projected to the LLM space by an adapter layer and generates translation in an autoregressive fashion. We participated in all the three tracks (Image Captioning, Text only and Multimodal translation tasks) for Indic languages (ie. English translation to Hindi, Bengali and Malyalam) and achieved SOTA results for Hindi in all of them on the Challenge set while remaining competitive for the other languages in the shared task.
zh

[NLP-83] Pause-Tuning for Long-Context Comprehension: A Lightweight Approach to LLM Attention Recalibration

【速读】: 该论文旨在解决长上下文理解中的“Lost-in-the-Middle (LITM)”问题,即大型语言模型 (LLMs) 在处理长输入时难以有效理解和利用位于中间位置的信息。为了解决这一挑战,论文提出了一种名为“pause-tuning”的技术,其关键在于通过在数据集中人为插入停顿标记 (pause tokens),将长输入分割成更小且易于管理的部分,从而重新分配注意力机制以增强对长上下文的理解能力。实验结果表明,该方法显著提升了模型性能,在Needle-in-a-Haystack基准测试中,LLaMA 3.2 3B Instruct 和 LLaMA 3.1 8B Instruct 模型分别实现了平均10.61%和3.57%的性能提升,证明了pause-tuning在改善长上下文信息保留方面的有效性。

链接: https://arxiv.org/abs/2502.20405
作者: James Begin,Namit Agrawal,Eshan Singh,Yicheng Fu,Sean O’Brien,Vasu Sharma,Kevin Zhu
机构: Algoverse AI Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs have demonstrated remarkable proficiency in understanding tasks but continue to struggle with long-context comprehension, particularly with content located in the middle of extensive inputs. This limitation, known as the Lost-in-the-Middle (LITM) problem, hinders models from fully processing and utilizing information across lengthy contexts. To address this issue, we introduce pause-tuning, a technique that redistributes attention to enhance comprehension of long-context inputs. Our approach involves fine-tuning language models on datasets with artificially inserted pause tokens, which serve to segment the input into smaller, more manageable parts. We evaluate pause-tuning against alternative approaches using the Needle-in-a-Haystack benchmark, where models must retrieve information embedded within contexts of up to 128K tokens. Experimental results demonstrate significant performance gains, with the LLaMA 3.2 3B Instruct model and the LLaMA 3.1 8B Instruct model improving by 10.61% and 3.57% respectively on average, suggesting that pause-tuning successfully enhances attention redistribution and improves long-context retention. The code and data are available at this https URL.
zh

[NLP-84] Momentum Posterior Regularization for Multi-hop Dense Retrieval COLING2025

【速读】: 该论文旨在解决多跳问答(Multi-hop Question Answering, Multi-hop QA)中后验检索(posterior retrieval)知识难以有效蒸馏到先验检索(prior retrieval)中的问题。现有的一次性检索知识蒸馏方法在多跳QA场景下表现不佳,主要面临两个挑战:一是后验信息通常被定义为答案响应,但缺乏中间检索时可能与查询缺乏明确关联;二是先验与后验检索间巨大的知识差距导致现有蒸馏方法不稳定,甚至可能导致性能下降。为此,论文提出了一种名为MoPo(Momentum Posterior Regularization)的方法,其关键是:1)将某跳的后验信息定义为基于前一跳和当前跳黄金知识的查询聚焦摘要(query-focus summary);2)通过动量移动平均策略同步更新后验检索与先验检索,实现更平滑且有效的蒸馏过程。实验结果表明,MoPo在HotpotQA和StrategyQA数据集上的检索和下游QA任务中均优于现有基线方法。

链接: https://arxiv.org/abs/2502.20399
作者: Zehua Xia,Yuyang Wu,Yiyun Xia,Cam-Tu Nguyen
机构: State Key Laboratory for Novel Software Technology, Nanjing University (国家重点实验室,南京大学); School of Computer Science, Nanjing University (计算机科学学院,南京大学); School of Artificial Intelligence, Nanjing University (人工智能学院,南京大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by COLING 2025

点击查看摘要

Abstract:Multi-hop question answering (QA) often requires sequential retrieval (multi-hop retrieval), where each hop retrieves missing knowledge based on information from previous hops. To facilitate more effective retrieval, we aim to distill knowledge from a posterior retrieval, which has access to posterior information like an answer, into a prior retrieval used during inference when such information is unavailable. Unfortunately, current methods for knowledge distillation in one-time retrieval are ineffective for multi-hop QA due to two issues: 1) Posterior information is often defined as the response (i.e. the answer), which may not clearly connect to the query without intermediate retrieval; and 2) The large knowledge gap between prior and posterior retrievals makes existing distillation methods unstable, even resulting in performance loss. As such, we propose MoPo (Momentum Posterior Regularization) with two key innovations: 1) Posterior information of one hop is defined as a query-focus summary from the golden knowledge of the previous and current hops; 2) We develop an effective training strategy where the posterior retrieval is updated along with the prior retrieval via momentum moving average method, allowing smoother and effective distillation. Experiments on HotpotQA and StrategyQA demonstrate that MoPo outperforms existing baselines in both retrieval and downstream QA tasks.
zh

[NLP-85] Intelligence Test

【速读】: 该论文试图解决的核心问题是:如何定义和量化智能,并评估现有AI系统在实现自主智能方面的局限性及其背后的理论原因。论文的关键解决方案在于提出了一种名为“Intelligence Test”的方法,通过量化任务完成过程中失败次数的期望值和方差来衡量智能水平。该方法不仅能够评估现有AI系统的智能水平,还揭示了当前AI系统在复杂任务中的不足,即依赖于浅层模仿而非深入理解任务本质机制。此外,理论分析表明,实现通用自主智能需要难以企及的巨大参数规模,进一步强调了这一挑战的艰巨性。

链接: https://arxiv.org/abs/2502.18858
作者: Jingtao Zhan,Jiahao Zhao,Jiayu Li,Yiqun Liu,Bo Zhang,Qingyao Ai,Jiaxin Mao,Hongning Wang,Min Zhang,Shaoping Ma
机构: Tsinghua University (清华大学); Renmin University of China (中国人民大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:How does intelligence emerge? We propose that intelligence is not a sudden gift or random occurrence, but rather a necessary trait for species to survive through Natural Selection. If a species passes the test of Natural Selection, it demonstrates the intelligence to survive in nature. Extending this perspective, we introduce Intelligence Test, a method to quantify the intelligence of any subject on any task. Like how species evolve by trial and error, Intelligence Test quantifies intelligence by the number of failed attempts before success. Fewer failures correspond to higher intelligence. When the expectation and variance of failure counts are both finite, it signals the achievement of an autonomous level of intelligence. Using Intelligence Test, we comprehensively evaluate existing AI systems. Our results show that while AI systems achieve a level of autonomy in simple tasks, they are still far from autonomous in more complex tasks, such as vision, search, recommendation, and language. While scaling model size might help, this would come at an astronomical cost. Projections suggest that achieving general autonomy would require unimaginable 10^26 parameters. Even if Moore’s Law continuously holds, such a parameter scale would take 70 years. This staggering cost highlights the complexity of human tasks and the inadequacies of current AI. To further understand this phenomenon, we conduct a theoretical analysis. Our simulations suggest that human tasks possess a criticality property. As a result, autonomy requires a deep understanding of the task’s underlying mechanisms. Current AI, however, does not fully grasp these mechanisms and instead relies on superficial mimicry, making it difficult to reach an autonomous level. We believe Intelligence Test can not only guide the future development of AI but also offer profound insights into the intelligence of humans ourselves.
zh

[NLP-86] Collective Reasoning Among LLM s A Framework for Answer Validation Without Ground Truth

【速读】: 本文旨在解决在缺乏明确标准答案(ground truth)的情况下,如何通过多语言模型协作生成高质量且可靠的问题,特别是针对博士生水平的概率问题。论文的关键在于提出了一种多大语言模型(Large Language Models, LLMs)协作的框架,利用不同模型之间的共识来提升响应的可靠性,并将其作为评估生成问题质量的替代方法。为此,研究采用了统计工具如卡方检验(chi-square test)、Fleiss’ Kappa系数及置信区间分析,从响应精度与问题清晰度两方面量化模型间的共识程度。结果显示,Claude和Gemini因其生成的问题结构更佳且歧义较少,在模型间达成更高一致性,而LLaMA则表现出更大的变异性与较低的可靠性。这一研究揭示了多模型协作不仅能提高回答的可靠性,还为无明确标准答案场景下的问题质量评估与优化提供了有价值的思路。

链接: https://arxiv.org/abs/2502.20758
作者: Seyed Pouyan Mousavi Davoudi,Alireza Shafiee Fard,Alireza Amiri-Margavi
机构: 未知
类目: Applications (stat.AP); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages, 2 figures. arXiv admin note: substantial text overlap with arXiv:2411.16797

点击查看摘要

Abstract:We present a collaborative framework where multiple large language models, namely GPT-4-0125-preview, Meta-LLaMA-3-70B-Instruct, Claude-3-Opus, and Gemini-1.5-Flash, work together to generate and respond to complex PhD-level probability questions in the absence of definitive ground truth. This study explores how inter-model consensus enhances response reliability and serves as a proxy for assessing the quality of generated questions. To quantify agreement and consistency, we employ statistical methods including chi-square tests, Fleiss’ Kappa, and confidence interval analysis, measuring both response precision and question clarity. Our findings highlight that Claude and Gemini generate well-structured and less ambiguous questions, leading to higher inter-model agreement. This is reflected in their narrower confidence intervals and stronger alignment with answering models. Conversely, LLaMA demonstrates increased variability and lower reliability in question formulation, as indicated by broader confidence intervals and reduced consensus rates. These results suggest that multi-model collaboration not only enhances the reliability of responses but also provides a valuable framework for assessing and improving question quality in the absence of explicit ground truth. This research offers meaningful insights into optimizing AI-driven reasoning through collaborative large-language model interactions.
zh

[NLP-87] Brain-Inspired Exploration of Functional Networks and Key Neurons in Large Language Models

【速读】: 该论文试图解决的问题是如何系统性地理解大型语言模型(Large Language Models, LLMs)内部的功能机制,并探索是否存在类似于人脑功能网络(Functional Brain Networks, FBNs)的组织结构。现有研究多集中于单个神经元的作用,而忽视了功能网络的整体交互特性。为此,论文提出了一种基于认知神经科学中功能脑网络分析方法的新框架,用于定位和识别LLMs中的功能网络。

解决方案的关键在于借鉴功能神经影像学的分析手段,将LLMs视为一个复杂的计算系统,通过类似的方法揭示其内在的功能网络结构。实验结果表明,LLMs在运行过程中确实存在频繁出现的功能网络,这些网络对模型性能至关重要。进一步研究发现,屏蔽关键功能网络会显著损害模型表现,而保留部分关键网络即可维持有效运行。这一发现不仅为LLMs的可解释性提供了新视角,也为模型轻量化应用于下游任务奠定了理论基础。

链接: https://arxiv.org/abs/2502.20408
作者: Yiheng Liu,Xiaohui Gao,Haiyang Sun,Bao Ge,Tianming Liu,Junwei Han,Xintao Hu
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:In recent years, the rapid advancement of large language models (LLMs) in natural language processing has sparked significant interest among researchers to understand their mechanisms and functional characteristics. Although existing studies have attempted to explain LLM functionalities by identifying and interpreting specific neurons, these efforts mostly focus on individual neuron contributions, neglecting the fact that human brain functions are realized through intricate interaction networks. Inspired by cognitive neuroscience research on functional brain networks (FBNs), this study introduces a novel approach to investigate whether similar functional networks exist within LLMs. We use methods similar to those in the field of functional neuroimaging analysis to locate and identify functional networks in LLM. Experimental results show that, similar to the human brain, LLMs contain functional networks that frequently recur during operation. Further analysis shows that these functional networks are crucial for LLM performance. Masking key functional networks significantly impairs the model’s performance, while retaining just a subset of these networks is adequate to maintain effective operation. This research provides novel insights into the interpretation of LLMs and the lightweighting of LLMs for certain downstream tasks. Code is available at this https URL.
zh

计算机视觉

[CV-0] How far can we go with ImageNet for Text-to-Image generation?

【速读】:该论文试图解决的问题是如何在文本到图像(Text-to-Image, T2I)生成任务中突破传统“数据量越大越好”(bigger is better)的范式,探索是否可以通过战略性的数据增强(strategic data augmentation)实现更高效且可持续的方法。论文的关键解决方案在于证明通过精心设计的小规模、高质量数据集(如增强后的ImageNet)结合特定的文本和图像增强技术,可以匹配甚至超越基于大规模网络爬取数据训练的模型性能,同时显著减少参数量(1/10)和训练图像数量(1/1000)。这一方法不仅挑战了现有范式,还为T2I生成提供了更具可持续性的路径。

链接: https://arxiv.org/abs/2502.21318
作者: L. Degeorge,A. Ghosh,N. Dufour,D. Picard,V. Kalogeiton
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent text-to-image (T2I) generation models have achieved remarkable results by training on billion-scale datasets, following a `bigger is better’ paradigm that prioritizes data quantity over quality. We challenge this established paradigm by demonstrating that strategic data augmentation of small, well-curated datasets can match or outperform models trained on massive web-scraped collections. Using only ImageNet enhanced with well-designed text and image augmentations, we achieve a +2 overall score over SD-XL on GenEval and +5 on DPGBench while using just 1/10th the parameters and 1/1000th the training images. Our results suggest that strategic data augmentation, rather than massive datasets, could offer a more sustainable path forward for T2I generation.
zh

[CV-1] Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos

【速读】:该论文旨在解决现有文本到视频生成方法受限于数据质量与计算资源的问题。为应对这些局限性,论文提出了一种综合方法,通过改进数据整理和模型设计实现突破。关键在于引入了一个高质量视频数据集CFC-VIDS-1M,它通过粗到细的系统化数据整理流水线构建,包含多维度的视频质量评估及基于视觉-语言模型的精细阶段以提升文本-视频对齐和语义丰富度。同时,基于整理后数据集对视觉质量和时间连贯性的强调,开发了RACCOON模型,这是一种具有解耦空间-时间注意力机制的Transformer架构,并采用四阶段渐进训练策略高效处理视频生成的复杂性。

链接: https://arxiv.org/abs/2502.21314
作者: Zhiyu Tan,Junyan Wang,Hao Yang,Luozheng Qin,Hesen Chen,Qiang Zhou,Hao Li
机构: Fudan University (复旦大学); The University of Adelaide (阿德莱德大学); INF Tech; Shanghai Academy of Artificial Intelligence for Science
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-video generation has demonstrated promising progress with the advent of diffusion models, yet existing approaches are limited by dataset quality and computational resources. To address these limitations, this paper presents a comprehensive approach that advances both data curation and model design. We introduce CFC-VIDS-1M, a high-quality video dataset constructed through a systematic coarse-to-fine curation pipeline. The pipeline first evaluates video quality across multiple dimensions, followed by a fine-grained stage that leverages vision-language models to enhance text-video alignment and semantic richness. Building upon the curated dataset’s emphasis on visual quality and temporal coherence, we develop RACCOON, a transformer-based architecture with decoupled spatial-temporal attention mechanisms. The model is trained through a progressive four-stage strategy designed to efficiently handle the complexities of video generation. Extensive experiments demonstrate that our integrated approach of high-quality data curation and efficient training strategy generates visually appealing and temporally coherent videos while maintaining computational efficiency. We will release our dataset, code, and models.
zh

[CV-2] Unsupervised Parameter Efficient Source-free Post-pretraining

【速读】:该论文旨在解决在目标分布上高效适配大规模预训练模型(通常达到数十亿参数级别)的问题,这一过程在计算和经济成本上变得难以承受。为应对这一挑战,论文提出了UpStep,这是一种无监督、参数高效的源域自由后预训练方法,用于将基础模型从源领域迁移到目标领域。解决方案的关键在于:首先设计了一种自监督训练方案,能够在源域数据不可用的情况下,利用未标记的目标领域数据对预训练模型进行适配;其次提出中心向量正则化(Center Vector Regularization, CVR),通过辅助操作减少灾难性遗忘,并通过在50%的训练迭代中跳过反向传播进一步降低计算开销;最后,采用低秩适配方法以参数高效的方式对预训练模型进行微调,仅优化少量参数。这些措施共同确保了方法的高效性和通用性。实验验证了该方法在多种骨干架构上的适用性及其在多个目标领域的迁移能力。

链接: https://arxiv.org/abs/2502.21313
作者: Abhishek Jha,Tinne Tuytelaars,Yuki M. Asano
机构: ESAT-PSI, KU Leuven (ESAT-PSI, KU Leuven); Fundamental AI Lab, University of Technology Nuremberg (弗赖堡大学根本人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Following the success in NLP, the best vision models are now in the billion parameter ranges. Adapting these large models to a target distribution has become computationally and economically prohibitive. Addressing this challenge, we introduce UpStep, an Unsupervised Parameter-efficient Source-free post-pretraining approach, designed to efficiently adapt a base model from a source domain to a target domain: i) we design a self-supervised training scheme to adapt a pretrained model on an unlabeled target domain in a setting where source domain data is unavailable. Such source-free setting comes with the risk of catastrophic forgetting, hence, ii) we propose center vector regularization (CVR), a set of auxiliary operations that minimize catastrophic forgetting and additionally reduces the computational cost by skipping backpropagation in 50% of the training iterations. Finally iii) we perform this adaptation process in a parameter-efficient way by adapting the pretrained model through low-rank adaptation methods, resulting in a fraction of parameters to optimize. We utilize various general backbone architectures, both supervised and unsupervised, trained on Imagenet as our base model and adapt them to a diverse set of eight target domains demonstrating the adaptability and generalizability of our proposed approach.
zh

[CV-3] MIGE: A Unified Framework for Multimodal Instruction-Based Image Generation and Editing

【速读】:该论文旨在解决基于扩散模型的图像生成中,主体驱动生成(subject-driven generation)和指令驱动编辑(instruction-based editing)面临的挑战。现有方法通常将这两项任务分开处理,导致在有限高质量数据和泛化能力方面的不足。这两个任务的核心难点在于需要捕捉复杂的视觉变化同时保持输入与输出之间的一致性。论文提出的解决方案是MIGE,一个统一框架,通过多模态指令标准化任务表示,将主体驱动生成视为从空白画布开始创作,而将指令驱动编辑视为对已有图像的修改,从而建立共享的输入-输出形式。其关键在于引入了一种新颖的多模态编码器,能够将自由形式的多模态指令映射到统一的视觉-语言空间,并通过特征融合整合视觉和语义特征。这种统一不仅实现了两个任务的联合训练,还带来了两大优势:(1) 跨任务增强,通过共享的视觉和语义表示改善了指令遵从性和视觉一致性;(2) 泛化能力提升,在统一格式下促进跨任务知识迁移,使MIGE能够推广至新任务,如基于指令的主体驱动编辑。实验表明,MIGE在主体驱动生成和指令驱动编辑任务上均表现出色,并在新任务上达到当前最佳性能。

链接: https://arxiv.org/abs/2502.21291
作者: Xueyun Tian,Wei Li,Bingbing Xu,Yige Yuan,Yuanzhuo Wang,Huawei Shen
机构: CAS Key Laboratory of AI Safety (人工智能安全重点实验室), Institute of Computing Technology (计算技术研究所), Chinese Academy of Sciences (中国科学院), Beijing, China; University of Chinese Academy of Sciences (中国科学院大学), Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite significant progress in diffusion-based image generation, subject-driven generation and instruction-based editing remain challenging. Existing methods typically treat them separately, struggling with limited high-quality data and poor generalization. However, both tasks require capturing complex visual variations while maintaining consistency between inputs and outputs. Therefore, we propose MIGE, a unified framework that standardizes task representations using multimodal instructions. It treats subject-driven generation as creation on a blank canvas and instruction-based editing as modification of an existing image, establishing a shared input-output formulation. MIGE introduces a novel multimodal encoder that maps free-form multimodal instructions into a unified vision-language space, integrating visual and semantic features through a feature fusion this http URL unification enables joint training of both tasks, providing two key advantages: (1) Cross-Task Enhancement: By leveraging shared visual and semantic representations, joint training improves instruction adherence and visual consistency in both subject-driven generation and instruction-based editing. (2) Generalization: Learning in a unified format facilitates cross-task knowledge transfer, enabling MIGE to generalize to novel compositional tasks, including instruction-based subject-driven editing. Experiments show that MIGE excels in both subject-driven generation and instruction-based editing while setting a state-of-the-art in the new task of instruction-based subject-driven editing. Code and model have been publicly available at this https URL.
zh

[CV-4] Back to the Future Cyclopean Stereo: a human perception approach unifying deep and geometric constraints

【速读】:该论文致力于解决立体视觉中利用数据驱动方法难以有效捕捉深度不连续性和遮挡区域关键几何信息的问题。论文的关键创新在于提出了一个结合显式三维表面模型与学习到的立体特征的解决方案,该模型通过“独眼”(cyclopean eye)几何框架显式表示深度不连续性和遮挡,同时引入先验单目表面模型填补数据匹配不足的区域。这种结合几何建模与学习特征的方法不仅达到了当前基于纯数据驱动方法的性能水平,还显著提升了视觉质量,强调了三维几何模型在计算机视觉研究中的重要性。

链接: https://arxiv.org/abs/2502.21280
作者: Sherlon Almeida da Silva,Davi Geiger,Luiz Velho,Moacir Antonelli Ponti
机构: Institute of Mathematics and Computer Science, University of São Paulo (ICMC-USP); Courant Institute of Mathematical Sciences, New York University (NYU); Institute for Pure and Applied Mathematics (IMPA)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We innovate in stereo vision by explicitly providing analytical 3D surface models as viewed by a cyclopean eye model that incorporate depth discontinuities and occlusions. This geometrical foundation combined with learned stereo features allows our system to benefit from the strengths of both approaches. We also invoke a prior monocular model of surfaces to fill in occlusion regions or texture-less regions where data matching is not sufficient. Our results already are on par with the state-of-the-art purely data-driven methods and are of much better visual quality, emphasizing the importance of the 3D geometrical model to capture critical visual information. Such qualitative improvements may find applicability in virtual reality, for a better human experience, as well as in robotics, for reducing critical errors. Our approach aims to demonstrate that understanding and modeling geometrical properties of 3D surfaces is beneficial to computer vision research.
zh

[CV-5] Adaptive Keyframe Sampling for Long Video Understanding CVPR2025

【速读】:该论文旨在解决当视觉输入从单张图像扩展到长视频时,多模态大语言模型(MLLMs)面临的挑战。传统方法通过从输入数据中采样少量令牌来构建基于视频的MLLMs,但这种方式可能导致关键信息丢失,从而产生错误答案。为了解决这一问题,论文提出了一种名为自适应关键帧采样(Adaptive Keyframe Sampling, AKS)的简单而有效的算法。其关键是引入一个即插即用的关键帧选择模块,通过优化关键帧与提示的相关性以及关键帧对视频的覆盖率,在固定数量的视频令牌内最大化有用信息。这种自适应算法能够近似最佳解,从而在长视频理解基准测试中显著提升视频问答(Video QA)的准确性。研究强调了信息预筛选在基于视频的MLLMs中的重要性。

链接: https://arxiv.org/abs/2502.21271
作者: Xi Tang,Jihao Qiu,Lingxi Xie,Yunjie Tian,Jianbin Jiao,Qixiang Ye
机构: University of Chinese Academy of Sciences (中国科学院大学); University at Buffalo, SUNY (布法罗大学,纽约州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: CVPR2025

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have enabled open-world visual understanding by injecting visual input as extra tokens into large language models (LLMs) as contexts. However, when the visual input changes from a single image to a long video, the above paradigm encounters difficulty because the vast amount of video tokens has significantly exceeded the maximal capacity of MLLMs. Therefore, existing video-based MLLMs are mostly established upon sampling a small portion of tokens from input data, which can cause key information to be lost and thus produce incorrect answers. This paper presents a simple yet effective algorithm named Adaptive Keyframe Sampling (AKS). It inserts a plug-and-play module known as keyframe selection, which aims to maximize the useful information with a fixed number of video tokens. We formulate keyframe selection as an optimization involving (1) the relevance between the keyframes and the prompt, and (2) the coverage of the keyframes over the video, and present an adaptive algorithm to approximate the best solution. Experiments on two long video understanding benchmarks validate that Adaptive Keyframe Sampling improves video QA accuracy (beyond strong baselines) upon selecting informative keyframes. Our study reveals the importance of information pre-filtering in video-based MLLMs. Code is available at this https URL.
zh

[CV-6] Foundation Models – A Panacea for Artificial Intelligence in Pathology?

【速读】:该论文试图解决的问题是如何评估基础模型(Foundation Models, FMs)与任务特定模型(Task-Specific, TS 模型)在临床病理诊断中的性能差异及其适用性。具体而言,研究聚焦于前列腺癌诊断和格里森分级(Gleason Grading),通过大规模验证(超过 100,000 个核心针活检样本,涉及 7,342 名患者,15 个站点和 11 个国家)来比较两种方法的优劣。

解决方案的关键在于对比 FMs 和 TS 模型在多实例学习框架下的表现,并系统性分析它们在不同数据条件下的性能差异。研究发现,虽然 FMs 在数据稀缺场景下具有一定优势,但在充足的标注训练数据可用时,TS 模型在减少临床显著误诊、复杂形态学识别错误以及跨扫描仪的变异性方面表现更优。此外,TS 模型在资源效率(能耗仅为 FMs 的约 1/35)和临床可靠性方面更具优势。论文强调,对于高风险临床应用,严格的验证和任务特定的训练至关重要,同时建议结合 FMs 和端到端学习的优势以开发稳健且高效的临床适用 AI 病理学解决方案。

链接: https://arxiv.org/abs/2502.21264
作者: Nita Mulliqi(Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden),Anders Blilie(Department of Pathology, Stavanger University Hospital, Stavanger, Norway and Faculty of Health Sciences, University of Stavanger, Stavanger, Norway),Xiaoyi Ji(Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden),Kelvin Szolnoky(Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden),Henrik Olsson(Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden),Sol Erika Boman(Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden and Department of Molecular Medicine and Surgery, Karolinska Institutet, Stockholm, Sweden),Matteo Titus(Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden),Geraldine Martinez Gonzalez(Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden),Julia Anna Mielcarz(Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden),Masi Valkonen(Institute of Biomedicine, University of Turku, Turku, Finland),Einar Gudlaugsson(Department of Pathology, Stavanger University Hospital, Stavanger, Norway),Svein R. Kjosavik(The General Practice and Care Coordination Research Group, Stavanger University Hospital, Norway and Department of Global Public Health and Primary Care, Faculty of Medicine, University of Bergen, Norway),José Asenjo(Department of Pathology, Synlab, Madrid, Spain),Marcello Gambacorta(Department of Pathology, Synlab, Brescia, Italy),Paolo Libretti(Department of Pathology, Synlab, Brescia, Italy),Marcin Braun(Department of Pathology, Chair of Oncology, Medical University of Lodz, Lodz, Poland),Radzislaw Kordek(Department of Pathology, Chair of Oncology, Medical University of Lodz, Lodz, Poland),Roman Łowicki(1st Department of Urology, Medical University of Lodz, Lodz, Poland),Kristina Hotakainen(Department of Clinical Chemistry and Hematology, University of Helsinki, Helsinki, Finland and Laboratory Services, Mehiläinen Oy, Helsinki, Finland),Päivi Väre(Department of Pathology, Mehiläinen Länsi-Pohja Hospital, Kemi, Finland),Bodil Ginnerup Pedersen(Department of Radiology, Aarhus University Hospital, Aarhus, Denmark and Department of Clinical Medicine, Aarhus University, Aarhus, Denmark),Karina Dalsgaard Sørensen(Department of Clinical Medicine, Aarhus University, Aarhus, Denmark and Department of Molecular Medicine, Aarhus University Hospital, Aarhus, Denmark),Benedicte Parm Ulhøi(Department of Pathology, Aarhus University Hospital, Aarhus, Denmark),Pekka Ruusuvuori(Institute of Biomedicine, University of Turku, Turku, Finland and InFLAMES Research Flagship, University of Turku, Turku, Finland and Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland),Brett Delahunt(Malaghan Institute of Medical Research, Wellington, New Zealand and Department of Oncology and Pathology, Karolinska Institutet, Stockholm, Sweden),Hemamali Samaratunga(Aquesta Uropathology and University of Queensland, QLD, Brisbane, Australia),Toyonori Tsuzuki(Department of Surgical Pathology, School of Medicine, Aichi Medical University, Nagoya, Japan),Emilius A.M. Janssen(Department of Pathology, Stavanger University Hospital, Stavanger, Norway and Department of Chemistry, Bioscience and Environmental Engineering, University of Stavanger, Stavanger, Norway and Institute for Biomedicine and Glycomics, Griffith University, Queensland, Australia),Lars Egevad(Department of Oncology and Pathology, Karolinska Institutet, Stockholm, Sweden),Martin Eklund(Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden),Kimmo Kartasalo(Department of Medical Epidemiology and Biostatistics, SciLifeLab, Karolinska Institutet, Stockholm, Sweden)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 50 pages, 15 figures and an appendix (study protocol) which is previously published, see this https URL

点击查看摘要

Abstract:The role of artificial intelligence (AI) in pathology has evolved from aiding diagnostics to uncovering predictive morphological patterns in whole slide images (WSIs). Recently, foundation models (FMs) leveraging self-supervised pre-training have been widely advocated as a universal solution for diverse downstream tasks. However, open questions remain about their clinical applicability and generalization advantages over end-to-end learning using task-specific (TS) models. Here, we focused on AI with clinical-grade performance for prostate cancer diagnosis and Gleason grading. We present the largest validation of AI for this task, using over 100,000 core needle biopsies from 7,342 patients across 15 sites in 11 countries. We compared two FMs with a fully end-to-end TS model in a multiple instance learning framework. Our findings challenge assumptions that FMs universally outperform TS models. While FMs demonstrated utility in data-scarce scenarios, their performance converged with - and was in some cases surpassed by - TS models when sufficient labeled training data were available. Notably, extensive task-specific training markedly reduced clinically significant misgrading, misdiagnosis of challenging morphologies, and variability across different WSI scanners. Additionally, FMs used up to 35 times more energy than the TS model, raising concerns about their sustainability. Our results underscore that while FMs offer clear advantages for rapid prototyping and research, their role as a universal solution for clinically applicable medical AI remains uncertain. For high-stakes clinical applications, rigorous validation and consideration of task-specific training remain critically important. We advocate for integrating the strengths of FMs and end-to-end learning to achieve robust and resource-efficient AI pathology solutions fit for clinical use.
zh

[CV-7] RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete

【速读】:本文针对多模态大型语言模型(MLLMs)在机器人长时序操作任务中的应用局限性展开研究,试图解决其在规划能力(Planning Capability)、感知先验(Affordance Perception)以及轨迹预测(Trajectory Prediction)方面的不足。论文的关键解决方案在于提出ShareRobot,这是一个高质量的异构数据集,包含任务规划、物体先验及末端执行器轨迹等多维标注信息,并通过三名人机注释者优化其多样性和准确性。基于此数据集,作者开发了RoboBrain模型,该模型融合机器人专用与通用多模态数据,采用多阶段训练策略,并结合长视频与高分辨率图像提升操作能力。实验结果表明,RoboBrain在多种机器人任务中达到当前最优性能,展现了显著提升机器人核心能力的潜力。

链接: https://arxiv.org/abs/2502.21257
作者: Yuheng Ji,Huajie Tan,Jiayu Shi,Xiaoshuai Hao,Yuan Zhang,Hengyuan Zhang,Pengwei Wang,Mengdi Zhao,Yao Mu,Pengju An,Xinda Xue,Qinghang Su,Huaihai Lyu,Xiaolong Zheng,Jiaming Liu,Zhongyuan Wang,Shanghang Zhang
机构: Institution1; Institution2
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in Multimodal Large Language Models (MLLMs) have shown remarkable capabilities across various multimodal contexts. However, their application in robotic scenarios, particularly for long-horizon manipulation tasks, reveals significant limitations. These limitations arise from the current MLLMs lacking three essential robotic brain capabilities: Planning Capability, which involves decomposing complex manipulation instructions into manageable sub-tasks; Affordance Perception, the ability to recognize and interpret the affordances of interactive objects; and Trajectory Prediction, the foresight to anticipate the complete manipulation trajectory necessary for successful execution. To enhance the robotic brain’s core capabilities from abstract to concrete, we introduce ShareRobot, a high-quality heterogeneous dataset that labels multi-dimensional information such as task planning, object affordance, and end-effector trajectory. ShareRobot’s diversity and accuracy have been meticulously refined by three human annotators. Building on this dataset, we developed RoboBrain, an MLLM-based model that combines robotic and general multi-modal data, utilizes a multi-stage training strategy, and incorporates long videos and high-resolution images to improve its robotic manipulation capabilities. Extensive experiments demonstrate that RoboBrain achieves state-of-the-art performance across various robotic tasks, highlighting its potential to advance robotic brain capabilities.
zh

[CV-8] Anatomically-guided masked autoencoder pre-training for aneurysm detection

【速读】:该论文旨在解决颅内动脉瘤检测中因手动检测复杂且耗时,以及可用标注数据有限导致的传统监督学习方法难以有效训练的问题。论文的关键解决方案在于提出了一种新颖的预训练策略,利用更易获取的未标注头部CT扫描数据,通过修改掩码自编码器(Masked Auto-Encoder, MAE)预训练方法来预先训练三维视觉Transformer模型,并在微调阶段用于动脉瘤检测任务。具体而言,其创新点包括采用因子化自注意力机制以实现三维注意力计算的可行性、将掩码限制在靠近动脉的区域以聚焦于动脉瘤高发部位,以及不仅重建CT扫描强度值还重建动脉距离图(描述每个体素与最近动脉的距离),从而增强主干网络的学习表征能力。与最先进的动脉瘤检测模型相比,该方法在0.5误报率下提升了4%-8%的绝对灵敏度。

链接: https://arxiv.org/abs/2502.21244
作者: Alberto Mario Ceballos-Arroyo,Jisoo Kim,Chu-Hsuan Lin,Lei Qin,Geoffrey S. Young,Huaizu Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures

点击查看摘要

Abstract:Intracranial aneurysms are a major cause of morbidity and mortality worldwide, and detecting them manually is a complex, time-consuming task. Albeit automated solutions are desirable, the limited availability of training data makes it difficult to develop such solutions using typical supervised learning frameworks. In this work, we propose a novel pre-training strategy using more widely available unannotated head CT scan data to pre-train a 3D Vision Transformer model prior to fine-tuning for the aneurysm detection task. Specifically, we modify masked auto-encoder (MAE) pre-training in the following ways: we use a factorized self-attention mechanism to make 3D attention computationally viable, we restrict the masked patches to areas near arteries to focus on areas where aneurysms are likely to occur, and we reconstruct not only CT scan intensity values but also artery distance maps, which describe the distance between each voxel and the closest artery, thereby enhancing the backbone’s learned representations. Compared with SOTA aneurysm detection models, our approach gains +4-8% absolute Sensitivity at a false positive rate of 0.5. Code and weights will be released.
zh

[CV-9] owards long-term player tracking with graph hierarchies and domain-specific features

【速读】:该论文致力于解决团队运动中长期球员跟踪的挑战,特别是由于球员外观相似性、遮挡以及动态运动模式导致的问题。论文的关键在于引入了SportsSUSHI,这是一种基于分层图的方法,通过利用领域特定特征(如球衣号码、队伍ID和场地坐标)来增强跟踪准确性。这种方法在SoccerNet数据集和新提出的冰球跟踪数据集上均表现出色,尤其后者包含长序列及队伍ID与球衣号码的标注,非常适合评估长时间跟踪能力。实验表明,这些领域特定特征的引入显著提高了关联准确性。

链接: https://arxiv.org/abs/2502.21242
作者: Maria Koshkina,James H. Elder
机构: York University (约克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In team sports analytics, long-term player tracking remains a challenging task due to player appearance similarity, occlusion, and dynamic motion patterns. Accurately re-identifying players and reconnecting tracklets after extended absences from the field of view or prolonged occlusions is crucial for robust analysis. We introduce SportsSUSHI, a hierarchical graph-based approach that leverages domain-specific features, including jersey numbers, team IDs, and field coordinates, to enhance tracking accuracy. SportsSUSHI achieves high performance on the SoccerNet dataset and a newly proposed hockey tracking dataset. Our hockey dataset, recorded using a stationary camera capturing the entire playing surface, contains long sequences and annotations for team IDs and jersey numbers, making it well-suited for evaluating long-term tracking capabilities. The inclusion of domain-specific features in our approach significantly improves association accuracy, as demonstrated in our experiments. The dataset and code are available at this https URL.
zh

[CV-10] he PanAf-FGBG Dataset: Understanding the Impact of Backgrounds in Wildlife Behaviour Recognition

【速读】:该论文旨在解决野生动物行为识别模型在跨分布(out-of-distribution)泛化方面的挑战,特别是行为相关的背景信息对其性能的影响这一未被充分探索的问题。论文的关键创新在于提出了PanAf-FGBG数据集,该数据集首次以配对的形式(每段包含黑猩猩的行为视频即前景视频与同一相机位置无黑猩猩的背景视频)涵盖了分布内(in-distribution)和分布外(out-of-distribution)条件下的评估场景。通过这种设计,研究者能够量化背景对行为识别模型的影响,并提出了一种高效的潜在空间归一化技术,显著提升了卷积模型和基于Transformer模型的分布外性能,分别提高了+5.42% mAP和+3.75% mAP。此外,论文深入分析了背景持续时间对跨分布行为识别的影响。

链接: https://arxiv.org/abs/2502.21201
作者: Otto Brookes,Maksim Kukushkin,Majid Mirmehdi,Colleen Stephens,Paula Dieguez,Thurston C. Hicks,Sorrel Jones,Kevin Lee,Maureen S. McCarthy,Amelia Meier,Emmanuelle Normand,Erin G. Wessling,Roman M.Wittig,Kevin Langergraber,Klaus Zuberbühler,Lukas Boesch,Thomas Schmid,Mimi Arandjelovic,Hjalmar Kühl,Tilo Burghardt
机构: University of Bristol (布里斯托大学); Wild Chimpanzee Foundation (野生黑猩猩基金会); Max Planck Institute for Evolutionary Anthropology (马克斯·普朗克进化人类学研究所); Leipzig University (莱比锡大学); Martin Luther University Halle-Wittenberg (哈雷-维滕贝格马丁·路德大学); Lancaster University in Leipzig (莱比锡兰卡斯特大学); Harvard University (哈佛大学); University of Warsaw (华沙大学); University of St Andrews (圣安德鲁斯大学); University of Lyon (里昂大学); Senckenberg Museum of Natural History (森肯贝格自然历史博物馆); Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at the IEEE / CVF Computer Vision and Pattern Recognition Conference 2025

点击查看摘要

Abstract:Computer vision analysis of camera trap video footage is essential for wildlife conservation, as captured behaviours offer some of the earliest indicators of changes in population health. Recently, several high-impact animal behaviour datasets and methods have been introduced to encourage their use; however, the role of behaviour-correlated background information and its significant effect on out-of-distribution generalisation remain unexplored. In response, we present the PanAf-FGBG dataset, featuring 20 hours of wild chimpanzee behaviours, recorded at over 350 individual camera locations. Uniquely, it pairs every video with a chimpanzee (referred to as a foreground video) with a corresponding background video (with no chimpanzee) from the same camera location. We present two views of the dataset: one with overlapping camera locations and one with disjoint locations. This setup enables, for the first time, direct evaluation of in-distribution and out-of-distribution conditions, and for the impact of backgrounds on behaviour recognition models to be quantified. All clips come with rich behavioural annotations and metadata including unique camera IDs and detailed textual scene descriptions. Additionally, we establish several baselines and present a highly effective latent-space normalisation technique that boosts out-of-distribution performance by +5.42% mAP for convolutional and +3.75% mAP for transformer-based models. Finally, we provide an in-depth analysis on the role of backgrounds in out-of-distribution behaviour recognition, including the so far unexplored impact of background durations (i.e., the count of background frames within foreground videos).
zh

[CV-11] owards High-performance Spiking Transformers from ANN to SNN Conversion

【速读】:该论文旨在解决将Transformer模型高效转换为Spiking Neural Networks (SNNs)的问题,特别是应对因非线性模块的存在导致的转换挑战。论文的关键创新在于提出了一种Expectation Compensation Module,通过利用前T个时间步的信息来计算第T个时间步的期望输出,从而保留转换的准确性。此外,论文还引入了Multi-Threshold Neuron和对应的Parallel Parameter normalization方法,以应对高精度需求下所需的大时间步数带来的网络延迟和功耗问题,目标是降低网络的延迟和能耗。实验结果表明,该方法在保持高精度的同时显著降低了功耗,实现了最先进的性能。

链接: https://arxiv.org/abs/2502.21193
作者: Zihan Huang,Xinyu Shi,Zecheng Hao,Tong Bu,Jianhao Ding,Zhaofei Yu,Tiejun Huang
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spiking neural networks (SNNs) show great potential due to their energy efficiency, fast processing capabilities, and robustness. There are two main approaches to constructing SNNs. Direct training methods require much memory, while conversion methods offer a simpler and more efficient option. However, current conversion methods mainly focus on converting convolutional neural networks (CNNs) to SNNs. Converting Transformers to SNN is challenging because of the presence of non-linear modules. In this paper, we propose an Expectation Compensation Module to preserve the accuracy of the conversion. The core idea is to use information from the previous T time-steps to calculate the expected output at time-step T. We also propose a Multi-Threshold Neuron and the corresponding Parallel Parameter normalization to address the challenge of large time steps needed for high accuracy, aiming to reduce network latency and power consumption. Our experimental results demonstrate that our approach achieves state-of-the-art performance. For example, we achieve a top-1 accuracy of 88.60% with only a 1% loss in accuracy using 4 time steps while consuming only 35% of the original power of the Transformer. To our knowledge, this is the first successful Artificial Neural Network (ANN) to SNN conversion for Spiking Transformers that achieves high accuracy, low latency, and low power consumption on complex datasets. The source codes of the proposed method are available at this https URL.
zh

[CV-12] HQColon: A Hybrid Interactive Machine Learning Pipeline for High Quality Colon Labeling and Segmentation

【速读】:该论文旨在解决结肠高分辨率分割在临床和研究应用中的准确性不足问题,特别是针对复杂且形状多变的结肠,现有开源工具如TotalSegmentator难以实现高效且精确的分割。为了解决这一问题,论文的关键创新在于开发了一种全新的全自动高分辨率结肠分割方法。该方法通过结合区域生长与交互式机器学习构建了一个高效的高分辨率结肠数据集,并利用包含435个标记CT结肠成像(CTC)图像的数据集训练了nnU-Net模型。最终,所提出的全自动模型在对称表面距离(平均0.2 mm vs. TotalSegmentator的4.0 mm)和Hausdorff距离95百分位数(1.0 mm vs. 18 mm)等指标上显著优于TotalSegmentator,实现了更高的分割精度。

链接: https://arxiv.org/abs/2502.21183
作者: Martina Finocchiaro,Ronja Stern,Abraham George Smith,Jens Petersen,Kenny Erleben,Melanie Ganz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:High-resolution colon segmentation is crucial for clinical and research applications, such as digital twins and personalized medicine. However, the leading open-source abdominal segmentation tool, TotalSegmentator, struggles with accuracy for the colon, which has a complex and variable shape, requiring time-intensive labeling. Here, we present the first fully automatic high-resolution colon segmentation method. To develop it, we first created a high resolution colon dataset using a pipeline that combines region growing with interactive machine learning to efficiently and accurately label the colon on CT colonography (CTC) images. Based on the generated dataset consisting of 435 labeled CTC images we trained an nnU-Net model for fully automatic colon segmentation. Our fully automatic model achieved an average symmetric surface distance of 0.2 mm (vs. 4.0 mm from TotalSegmentator) and a 95th percentile Hausdorff distance of 1.0 mm (vs. 18 mm from TotalSegmentator). Our segmentation accuracy substantially surpasses TotalSegmentator. We share our trained model and pipeline code, providing the first and only open-source tool for high-resolution colon segmentation. Additionally, we created a large-scale dataset of publicly available high-resolution colon labels.
zh

[CV-13] Adaptive Illumination-Invariant Synergistic Feature Integration in a Stratified Granular Framework for Visible-Infrared Re-Identification

【速读】:该论文旨在解决跨模态可见光与红外 (Visible-Infrared, VI) 人体再识别中的挑战,包括模态差异、光照变化及频繁遮挡等问题。为克服这些障碍,论文提出了一种自适应模态交互网络 (\textbf{AMINet})。其关键在于采用多粒度特征提取方法以捕获全面的身份属性,并结合交互特征融合策略实现模态内和模态间的深度对齐,同时利用相位一致性进行鲁棒的光照不变特征提取,以及引入自适应多尺度核最大均值差异 (MMD) 来对齐不同尺度下的特征分布。这些创新显著提升了模型在跨模态任务中的泛化能力和匹配精度。

链接: https://arxiv.org/abs/2502.21163
作者: Yuheng Jia,Wesley Armour
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visible-Infrared Person Re-Identification (VI-ReID) plays a crucial role in applications such as search and rescue, infrastructure protection, and nighttime surveillance. However, it faces significant challenges due to modality discrepancies, varying illumination, and frequent occlusions. To overcome these obstacles, we propose \textbfAMINet, an Adaptive Modality Interaction Network. AMINet employs multi-granularity feature extraction to capture comprehensive identity attributes from both full-body and upper-body images, improving robustness against occlusions and background clutter. The model integrates an interactive feature fusion strategy for deep intra-modal and cross-modal alignment, enhancing generalization and effectively bridging the RGB-IR modality gap. Furthermore, AMINet utilizes phase congruency for robust, illumination-invariant feature extraction and incorporates an adaptive multi-scale kernel MMD to align feature distributions across varying scales. Extensive experiments on benchmark datasets demonstrate the effectiveness of our approach, achieving a Rank-1 accuracy of 74.75% on SYSU-MM01, surpassing the baseline by 7.93% and outperforming the current state-of-the-art by 3.95% .
zh

[CV-14] A Review on Generative AI For Text-To-Image and Image-To-Image Generation and Implications To Scientific Images

【速读】:该论文旨在综述生成式人工智能(Generative AI)领域中文本到图像生成以及图像到图像生成的最新进展,并对三种主流架构——变分自编码器(Variational Autoencoders)、生成对抗网络(Generative Adversarial Networks)和扩散模型(Diffusion Models)进行对比分析。论文的关键在于深入解析每种架构的核心概念、结构创新及其在科学图像理解中的实际优势与局限性,最终探讨该快速发展的领域中亟待解决的关键挑战及未来研究方向。

链接: https://arxiv.org/abs/2502.21151
作者: Zineb Sordo,Eric Chagnon,Daniela Ushizima
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This review surveys the state-of-the-art in text-to-image and image-to-image generation within the scope of generative AI. We provide a comparative analysis of three prominent architectures: Variational Autoencoders, Generative Adversarial Networks and Diffusion Models. For each, we elucidate core concepts, architectural innovations, and practical strengths and limitations, particularly for scientific image understanding. Finally, we discuss critical open challenges and potential future research directions in this rapidly evolving field.
zh

[CV-15] Same accuracy twice as fast: continuous training surpasses retraining from scratch

【速读】:本文旨在解决在连续学习场景下,当既有旧数据又有新数据可用时,如何在保持或超越从头训练性能的同时,显著降低计算开销的问题。论文的关键在于探索利用先前训练好的模型和旧数据的方法,而非从头开始重新训练整个模型。为实现这一目标,作者提出了一个评估框架,量化了这些方法的计算节省,并识别了四个关键优化方面:初始化(Initialization)、正则化(Regularization)、数据选择(Data Selection)和超参数调整(Hyper-parameters),每个方面都可有效减少计算成本。通过结合这些方法,在多种计算机视觉任务中实现了高达2.7倍的计算时间缩减,展示了该领域的进一步发展潜力。

链接: https://arxiv.org/abs/2502.21147
作者: Eli Verwimp,Guy Hacohen,Tinne Tuytelaars
机构: KU Leuven (鲁汶大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continual learning aims to enable models to adapt to new datasets without losing performance on previously learned data, often assuming that prior data is no longer available. However, in many practical scenarios, both old and new data are accessible. In such cases, good performance on both datasets is typically achieved by abandoning the model trained on the previous data and re-training a new model from scratch on both datasets. This training from scratch is computationally expensive. In contrast, methods that leverage the previously trained model and old data are worthy of investigation, as they could significantly reduce computational costs. Our evaluation framework quantifies the computational savings of such methods while maintaining or exceeding the performance of training from scratch. We identify key optimization aspects – initialization, regularization, data selection, and hyper-parameters – that can each contribute to reducing computational costs. For each aspect, we propose effective first-step methods that already yield substantial computational savings. By combining these methods, we achieve up to 2.7x reductions in computation time across various computer vision tasks, highlighting the potential for further advancements in this area.
zh

[CV-16] Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning CVPR2025

【速读】:该论文旨在解决病理图像分类中基于多实例学习(MIL)方法因处理来自吉帕级全景病理切片(WSI)的大量图像块而导致推理成本过高的问题。为了解决这一挑战,论文提出了一种名为HDMIL的分层蒸馏多实例学习框架,其关键是通过消除无关图像块实现快速且准确的分类。HDMIL包含两个关键组件:动态多实例网络(DMIN)和轻量级实例预筛选网络(LIPN)。DMIN在高分辨率WSI上运行,而LIPN则在对应的低分辨率WSI上运行。DMIN通过生成基于注意力分数的掩码来标识无关图像块进行训练,并指导LIPN预测每个低分辨率图像块的相关性。测试阶段,LIPN首先确定低分辨率WSI中的有用区域,间接帮助消除高分辨率WSI中的无关区域,从而在不降低性能的情况下减少推理时间。此外,论文还设计了首个基于切比雪夫多项式的Kolmogorov-Arnold分类器,通过可学习激活层进一步提升HDMIL的性能。实验结果表明,HDMIL在Camelyon16数据集上相比现有最先进的方法提升了3.13%的AUC,同时将推理时间减少了28.6%。

链接: https://arxiv.org/abs/2502.21130
作者: Jiuyang Dong,Junjun Jiang,Kui Jiang,Jiahan Li,Yongbing Zhang
机构: Harbin Institute of Technology (哈尔滨工业大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures, accepted by CVPR2025

点击查看摘要

Abstract:Although multi-instance learning (MIL) has succeeded in pathological image classification, it faces the challenge of high inference costs due to processing numerous patches from gigapixel whole slide images (WSIs). To address this, we propose HDMIL, a hierarchical distillation multi-instance learning framework that achieves fast and accurate classification by eliminating irrelevant patches. HDMIL consists of two key components: the dynamic multi-instance network (DMIN) and the lightweight instance pre-screening network (LIPN). DMIN operates on high-resolution WSIs, while LIPN operates on the corresponding low-resolution counterparts. During training, DMIN are trained for WSI classification while generating attention-score-based masks that indicate irrelevant patches. These masks then guide the training of LIPN to predict the relevance of each low-resolution patch. During testing, LIPN first determines the useful regions within low-resolution WSIs, which indirectly enables us to eliminate irrelevant regions in high-resolution WSIs, thereby reducing inference time without causing performance degradation. In addition, we further design the first Chebyshev-polynomials-based Kolmogorov-Arnold classifier in computational pathology, which enhances the performance of HDMIL through learnable activation layers. Extensive experiments on three public datasets demonstrate that HDMIL outperforms previous state-of-the-art methods, e.g., achieving improvements of 3.13% in AUC while reducing inference time by 28.6% on the Camelyon16 dataset.
zh

[CV-17] SEE: See Everything Every Time – Adaptive Brightness Adjustment for Broad Light Range Images via Events

【速读】:该论文旨在解决如何利用事件数据(Event Data)增强图像亮度并适应性调整其在宽广光照条件下的表现这一问题。现有研究主要集中于低光图像增强,而忽略了正常或高光照条件下图像增强的需求。为应对这一挑战,论文提出了一种新颖的方法,并构建了一个包含610,126张图像及其对应事件的新数据集SEE-600K,涵盖了202种场景下的多种光照条件。解决方案的关键在于设计了一种通过提示(Prompt)有效利用事件平滑调节图像亮度的框架。该框架通过传感器模式捕捉颜色,利用交叉注意力机制将事件建模为亮度字典,并调整图像动态范围以形成宽光谱表示(Broad Light-Range Representation, BLR),最终基于亮度提示在像素级别解码。实验结果表明,该方法不仅在低光增强任务中表现优异,还在SEE-600K数据集上展示了对宽广光照范围图像增强的强大鲁棒性,同时实现了像素级亮度调节,为后续处理提供了灵活性并启发了更多成像应用。

链接: https://arxiv.org/abs/2502.21120
作者: Yunfan Lu,Xiaogang Xu,Hao Lu,Yanlin Qian,Pengteng Li,Huizai Yao,Bin Yang,Junyi Li,Qianyi Cai,Weiyu Guo,Hui Xiong
机构: AI Thrust, HKUST(GZ)(香港科技大学(广州)); CUHK(香港中文大学); DJI(大疆创新) / Hasselblad(哈苏); Aalborg University(奥尔堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event cameras, with a high dynamic range exceeding 120dB , significantly outperform traditional embedded cameras, robustly recording detailed changing information under various lighting conditions, including both low- and high-light situations. However, recent research on utilizing event data has primarily focused on low-light image enhancement, neglecting image enhancement and brightness adjustment across a broader range of lighting conditions, such as normal or high illumination. Based on this, we propose a novel research question: how to employ events to enhance and adaptively adjust the brightness of images captured under broad lighting conditions? To investigate this question, we first collected a new dataset, SEE-600K, consisting of 610,126 images and corresponding events across 202 scenarios, each featuring an average of four lighting conditions with over a 1000-fold variation in illumination. Subsequently, we propose a framework that effectively utilizes events to smoothly adjust image brightness through the use of prompts. Our framework captures color through sensor patterns, uses cross-attention to model events as a brightness dictionary, and adjusts the image’s dynamic range to form a broad light-range representation (BLR), which is then decoded at the pixel level based on the brightness prompt. Experimental results demonstrate that our method not only performs well on the low-light enhancement dataset but also shows robust performance on broader light-range image enhancement using the SEE-600K dataset. Additionally, our approach enables pixel-level brightness adjustment, providing flexibility for post-processing and inspiring more imaging applications. The dataset and source code are publicly available at:this https URL.
zh

[CV-18] FlexDrive: Toward Trajectory Flexibility in Driving Scene Reconstruction and Rendering

【速读】:该论文旨在解决基于3D高斯点 splatting 的驾驶场景重建与渲染在路径外视点(out-of-path viewpoints)上的质量问题,主要由于路径外视点缺乏高质量的监督信息。论文的关键解决方案是引入逆向视图扭曲(Inverse View Warping, IVW)技术,通过创建紧凑且高质量的图像作为路径外视点的监督信号,以实现这些视点的高质量渲染结果。此外,为了确保逆向视图扭曲的准确性和鲁棒性,提出了一种深度引导策略,在优化过程中实时生成密集深度图,从而克服激光雷达深度数据稀疏和不完整的问题。这一方法不仅提升了路径内(in-path)的重建和渲染性能,还在路径外视点上表现出显著优势,并通过基于模拟器的基准测试进一步验证了其优越性。

链接: https://arxiv.org/abs/2502.21093
作者: Jingqiu Zhou,Lue Fan,Linjiang Huang,Xiaoyu Shi,Si Liu,Zhaoxiang Zhang,Hongsheng Li
机构: Multimedia Laboratory, The Chinese University of Hong Kong (多媒体实验室,香港中文大学); Centre for Perceptual and Interactive Intelligence, Hong Kong (感知与交互智能中心,香港); Institute of Automation, Chinese Academy of Sciences (自动化研究所,中国科学院); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Driving scene reconstruction and rendering have advanced significantly using the 3D Gaussian Splatting. However, most prior research has focused on the rendering quality along a pre-recorded vehicle path and struggles to generalize to out-of-path viewpoints, which is caused by the lack of high-quality supervision in those out-of-path views. To address this issue, we introduce an Inverse View Warping technique to create compact and high-quality images as supervision for the reconstruction of the out-of-path views, enabling high-quality rendering results for those views. For accurate and robust inverse view warping, a depth bootstrap strategy is proposed to obtain on-the-fly dense depth maps during the optimization process, overcoming the sparsity and incompleteness of LiDAR depth data. Our method achieves superior in-path and out-of-path reconstruction and rendering performance on the widely used Waymo Open dataset. In addition, a simulator-based benchmark is proposed to obtain the out-of-path ground truth and quantitatively evaluate the performance of out-of-path rendering, where our method outperforms previous methods by a significant margin.
zh

[CV-19] BST: Badminton Stroke-type Transformer for Skeleton-based Action Recognition in Racket Sports

【速读】:该论文旨在解决羽毛球比赛中运动员击球类型分类的问题,具体挑战包括运动员识别、场地线检测、羽毛球轨迹跟踪以及击球动作类型的分类。论文的关键解决方案在于提出了一种新颖的视频分割策略,用于从转播比赛中提取每位选手球拍挥动的帧序列。这些分割后的帧随后由两个现有模型处理:一个人体姿态估计模型以获取运动员的骨骼关节信息,另一个羽毛球轨迹检测模型以提取羽毛球轨迹。通过结合这些关节、轨迹和球员位置作为输入,论文提出了羽毛球击球类型变换器(Badminton Stroke-type Transformer, BST)来实现单打比赛中的击球类型分类。实验结果表明,该方法在最大的公开羽毛球视频数据集ShuttleSet上超越了现有的最先进方法,这表明有效利用球的轨迹可能是球拍类运动动作识别的一个趋势。

链接: https://arxiv.org/abs/2502.21085
作者: Jing-Yuan Chang
机构: National Tsing Hua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages (excluding references). The code will be released in a few months

点击查看摘要

Abstract:Badminton, known for having the fastest ball speeds among all sports, presents significant challenges to the field of computer vision, including player identification, court line detection, shuttlecock trajectory tracking, and player stroke-type classification. In this paper, we introduce a novel video segmentation strategy to extract frames of each player’s racket swing in a badminton broadcast match. These segmented frames are then processed by two existing models: one for Human Pose Estimation to obtain player skeletal joints, and the other for shuttlecock trajectory detection to extract shuttlecock trajectories. Leveraging these joints, trajectories, and player positions as inputs, we propose Badminton Stroke-type Transformer (BST) to classify player stroke-types in singles. To the best of our knowledge, experimental results demonstrate that our method outperforms the previous state-of-the-art on the largest publicly available badminton video dataset, ShuttleSet, which shows that effectively leveraging ball trajectory is likely to be a trend for racket sports action recognition.
zh

[CV-20] raining-free and Adaptive Sparse Attention for Efficient Long Video Generation

【速读】:该论文旨在解决使用Diffusion Transformers (DiTs)生成高保真长视频时面临的显著延迟问题,主要由于注意力机制带来的巨大计算开销。例如,生成一段8秒720p视频(110K tokens)需要约600 PFLOPs,其中约500 PFLOPs用于注意力计算。为了解决这一问题,论文提出了一种名为AdaSpa的新方法,它是首个动态模式与在线精确搜索稀疏注意力方法。AdaSpa的关键在于通过引入分块模式实现动态模式,有效捕捉DiTs中固有的层次化稀疏性,同时保持生成视频的高保真度;并通过融合LSE缓存搜索与自适应分层块稀疏注意力,实现在线精确搜索,利用DiTs在去噪步骤中的不变性特性,以最小开销进行实时稀疏索引的精确定位。AdaSpa作为一种自适应的即插即用解决方案,可无缝集成到现有DiTs中,无需额外微调或数据集依赖的配置过程。实验验证表明,AdaSpa在多个模型上实现了显著加速,同时保持了视频质量,确立了其作为高效视频生成的稳健且可扩展方法的地位。

链接: https://arxiv.org/abs/2502.21079
作者: Yifei Xia,Suhan Ling,Fangcheng Fu,Yujie Wang,Huixia Li,Xuefeng Xiao,Bin Cui
机构: Peking University (北京大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating high-fidelity long videos with Diffusion Transformers (DiTs) is often hindered by significant latency, primarily due to the computational demands of attention mechanisms. For instance, generating an 8-second 720p video (110K tokens) with HunyuanVideo takes about 600 PFLOPs, with around 500 PFLOPs consumed by attention computations. To address this issue, we propose AdaSpa, the first Dynamic Pattern and Online Precise Search sparse attention method. Firstly, to realize the Dynamic Pattern, we introduce a blockified pattern to efficiently capture the hierarchical sparsity inherent in DiTs. This is based on our observation that sparse characteristics of DiTs exhibit hierarchical and blockified structures between and within different modalities. This blockified approach significantly reduces the complexity of attention computation while maintaining high fidelity in the generated videos. Secondly, to enable Online Precise Search, we propose the Fused LSE-Cached Search with Head-adaptive Hierarchical Block Sparse Attention. This method is motivated by our finding that DiTs’ sparse pattern and LSE vary w.r.t. inputs, layers, and heads, but remain invariant across denoising steps. By leveraging this invariance across denoising steps, it adapts to the dynamic nature of DiTs and allows for precise, real-time identification of sparse indices with minimal overhead. AdaSpa is implemented as an adaptive, plug-and-play solution and can be integrated seamlessly with existing DiTs, requiring neither additional fine-tuning nor a dataset-dependent profiling. Extensive experiments validate that AdaSpa delivers substantial acceleration across various models while preserving video quality, establishing itself as a robust and scalable approach to efficient video generation.
zh

[CV-21] Enhancing deep neural networks through complex-valued representations and Kuramoto synchronization dynamics

【速读】:该论文试图解决深度学习模型在多目标场景中物体绑定(object binding)能力不足的问题,即如何有效表征复杂视觉场景中的多个对象。为解决这一问题,论文提出利用基于同步(synchrony)的神经机制来增强人工模型的物体编码能力。解决方案的关键在于结合复数表示(complex-valued representations)与Kuramoto动力学(Kuramoto dynamics),通过促进相位对齐(phase alignment)实现同一对象特征的分组,并设计了包含反馈连接的循环模型以利用自顶向下的信息进一步优化相位同步。实验结果表明,基于同步的机制显著提升了模型在多目标图像任务上的性能、鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2502.21077
作者: Sabine Muzellec,Andrea Alamia,Thomas Serre,Rufin VanRullen
机构: CerCo - CNRS, University of Toulouse, France (塞拉科 - 法国国家科学研究中心, 图卢兹大学); Carney Institute for Brain Science, Brown University, USA (布朗大学卡尼脑科学研究所, 美国); CerCo - CNRS, University of Toulouse, France (塞拉科 - 法国国家科学研究中心, 图卢兹大学); CerCo - CNRS, University of Toulouse, France (塞拉科 - 法国国家科学研究中心, 图卢兹大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Adaptation and Self-Organizing Systems (nlin.AO); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Neural synchrony is hypothesized to play a crucial role in how the brain organizes visual scenes into structured representations, enabling the robust encoding of multiple objects within a scene. However, current deep learning models often struggle with object binding, limiting their ability to represent multiple objects effectively. Inspired by neuroscience, we investigate whether synchrony-based mechanisms can enhance object encoding in artificial models trained for visual categorization. Specifically, we combine complex-valued representations with Kuramoto dynamics to promote phase alignment, facilitating the grouping of features belonging to the same object. We evaluate two architectures employing synchrony: a feedforward model and a recurrent model with feedback connections to refine phase synchronization using top-down information. Both models outperform their real-valued counterparts and complex-valued models without Kuramoto synchronization on tasks involving multi-object images, such as overlapping handwritten digits, noisy inputs, and out-of-distribution transformations. Our findings highlight the potential of synchrony-driven mechanisms to enhance deep learning models, improving their performance, robustness, and generalization in complex visual categorization tasks.
zh

[CV-22] Spatial Reasoning with Denoising Models

【速读】:该论文试图解决在复杂分布场景下,现有空间领域生成模型(如扩散模型和流匹配模型)容易出现幻觉(hallucination)的问题。为衡量这一现象,作者引入了一组基准任务来评估生成模型在复杂推理中的表现并量化幻觉的程度。论文的关键解决方案是提出空间推理模型(Spatial Reasoning Models, SRMs),这是一种通过去噪生成模型进行连续变量集合推理的框架。SRMs 能够在给定观测变量的情况下推断未观测变量的连续表示。研究进一步揭示了生成顺序、相关顺序以及训练过程中的采样策略的重要性,并首次证明去噪网络能够自行预测生成顺序,从而将特定推理任务的准确性从 1% 提升到 50%。

链接: https://arxiv.org/abs/2502.21075
作者: Christopher Wewer,Bart Pogodzinski,Bernt Schiele,Jan Eric Lenssen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project website: this https URL

点击查看摘要

Abstract:We introduce Spatial Reasoning Models (SRMs), a framework to perform reasoning over sets of continuous variables via denoising generative models. SRMs infer continuous representations on a set of unobserved variables, given observations on observed variables. Current generative models on spatial domains, such as diffusion and flow matching models, often collapse to hallucination in case of complex distributions. To measure this, we introduce a set of benchmark tasks that test the quality of complex reasoning in generative models and can quantify hallucination. The SRM framework allows to report key findings about importance of sequentialization in generation, the associated order, as well as the sampling strategies during training. It demonstrates, for the first time, that order of generation can successfully be predicted by the denoising network itself. Using these findings, we can increase the accuracy of specific reasoning tasks from 1% to 50%.
zh

[CV-23] Fast 3D point clouds retrieval for Large-scale 3D Place Recognition

【速读】:该论文致力于解决3D点云检索任务中的效率与准确性挑战,即在大规模参考点云数据集中快速且准确地检索与查询点云最相似的目标。当前方法主要通过比较点云描述符来识别相似性,但这一过程复杂度较高。论文的关键创新在于将基于Transformer的可微分搜索索引(Differentiable Search Index, DSI)从文本信息检索领域迁移到3D点云检索任务,并通过生成基于点云描述符的一维标识符,实现常数时间内的直接检索。为适配3D数据,该方法结合视觉Transformer映射描述符到这些标识符,并引入位置编码与语义编码以增强表达能力。通过在公开基准上的评估,该方法在检索质量和速度方面均展现出与现有先进方法相竞争的能力。

链接: https://arxiv.org/abs/2502.21067
作者: Chahine-Nicolas Zede,Laurent Carrafa,Valérie Gouet-Brunet
机构: LASTIG, IGN-ENSG, Gustave Eiffel University (LASTIG, IGN-ENSG, 加斯帕尔·埃菲尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 8 pages, 1 figures

点击查看摘要

Abstract:Retrieval in 3D point clouds is a challenging task that consists in retrieving the most similar point clouds to a given query within a reference of 3D points. Current methods focus on comparing descriptors of point clouds in order to identify similar ones. Due to the complexity of this latter step, here we focus on the acceleration of the retrieval by adapting the Differentiable Search Index (DSI), a transformer-based approach initially designed for text information retrieval, for 3D point clouds retrieval. Our approach generates 1D identifiers based on the point descriptors, enabling direct retrieval in constant time. To adapt DSI to 3D data, we integrate Vision Transformers to map descriptors to these identifiers while incorporating positional and semantic encoding. The approach is evaluated for place recognition on a public benchmark comparing its retrieval capabilities against state-of-the-art methods, in terms of quality and speed of returned point clouds.
zh

[CV-24] FC-Attack: Jailbreaking Large Vision-Language Models via Auto-Generated Flowcharts

【速读】:该论文试图解决大型视觉语言模型(LVLMs)在面对多模态越狱攻击时的安全性问题,特别是视觉模态仍然存在的脆弱性。论文提出了一种基于自动生成流程图的越狱攻击方法FC-Attack,以诱导LVLMs生成有害内容。解决方案的关键在于利用包含部分有害信息的流程图,并通过微调预训练的语言模型生成与有害查询对应的步骤描述,将其转化为三种不同形状(垂直、水平和S形)的视觉提示,结合良性文本提示来执行越狱攻击。实验表明,FC-Attack在多个模型上的攻击成功率超过90%,显著优于现有方法。此外,研究还探讨了影响攻击性能的因素以及相应的防御策略。

链接: https://arxiv.org/abs/2502.21059
作者: Ziyi Zhang,Zhen Sun,Zongmin Zhang,Jihui Guo,Xinlei He
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); The University of Hong Kong(香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have become powerful and widely adopted in some practical applications. However, recent research has revealed their vulnerability to multimodal jailbreak attacks, whereby the model can be induced to generate harmful content, leading to safety risks. Although most LVLMs have undergone safety alignment, recent research shows that the visual modality is still vulnerable to jailbreak attacks. In our work, we discover that by using flowcharts with partially harmful information, LVLMs can be induced to provide additional harmful details. Based on this, we propose a jailbreak attack method based on auto-generated flowcharts, FC-Attack. Specifically, FC-Attack first fine-tunes a pre-trained LLM to create a step-description generator based on benign datasets. The generator is then used to produce step descriptions corresponding to a harmful query, which are transformed into flowcharts in 3 different shapes (vertical, horizontal, and S-shaped) as visual prompts. These flowcharts are then combined with a benign textual prompt to execute a jailbreak attack on LVLMs. Our evaluations using the Advbench dataset show that FC-Attack achieves over 90% attack success rates on Gemini-1.5, Llaval-Next, Qwen2-VL, and InternVL-2.5 models, outperforming existing LVLM jailbreak methods. Additionally, we investigate factors affecting the attack performance, including the number of steps and the font styles in the flowcharts. Our evaluation shows that FC-Attack can improve the jailbreak performance from 4% to 28% in Claude-3.5 by changing the font style. To mitigate the attack, we explore several defenses and find that AdaShield can largely reduce the jailbreak performance but with the cost of utility drop.
zh

[CV-25] HoloMine: A Synthetic Dataset for Buried Landmines Recognition using Microwave Holographic Imaging

【速读】:该论文旨在解决地雷探测与清除这一复杂且高风险任务中所面临的挑战,通过提出一种新型的合成数据集来为研究人员提供有价值的资源,以观察、评估、定位并解决地雷探测中的问题。论文的关键在于利用微波全息成像技术生成包含不同类型埋藏物体(如地雷、杂物及陶器等)的二维微波全息图像及其对应的三维全息反转扫描图像的数据集,并在此基础上评估多种最先进的深度学习模型在分类任务中的表现。尽管当前结果尚未达到理想的性能水平,但研究者相信凭借全息雷达所能提供的精确度和分辨率,该数据集具有推动地雷探测领域发展的巨大潜力。此外,据作者所知,这是首个此类数据集,将有助于推动计算机视觉方法自动化地雷检测的研究,从而降低排雷过程的风险和成本。

链接: https://arxiv.org/abs/2502.21054
作者: Emanuele Vivoli,Lorenzo Capineri,Marco Bertini
机构: Media Integration and Communication Center (媒体集成与通信中心), Viale Giovanni Battista Morgagni, 65, 50134 Firenze FI, Italy; Ultrasound and Non-Destructive Testing Laboratory (超声波与无损检测实验室), University of Florence (佛罗伦萨大学), Italy
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注: under review

点击查看摘要

Abstract:The detection and removal of landmines is a complex and risky task that requires advanced remote sensing techniques to reduce the risk for the professionals involved in this task. In this paper, we propose a novel synthetic dataset for buried landmine detection to provide researchers with a valuable resource to observe, measure, locate, and address issues in landmine detection. The dataset consists of 41,800 microwave holographic images (2D) and their holographic inverted scans (3D) of different types of buried objects, including landmines, clutter, and pottery objects, and is collected by means of a microwave holography sensor. We evaluate the performance of several state-of-the-art deep learning models trained on our synthetic dataset for various classification tasks. While the results do not yield yet high performances, showing the difficulty of the proposed task, we believe that our dataset has significant potential to drive progress in the field of landmine detection thanks to the accuracy and resolution obtainable using holographic radars. To the best of our knowledge, our dataset is the first of its kind and will help drive further research on computer vision methods to automatize mine detection, with the overall goal of reducing the risks and the costs of the demining process. Comments: under review Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP) Cite as: arXiv:2502.21054 [cs.CV] (or arXiv:2502.21054v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.21054 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-26] Synthesizing Individualized Aging Brains in Health and Disease with Generative Models and Parallel Transport

【速读】:本文旨在解决从单个个体大脑图像模拟未来磁共振成像(MRI)扫描的问题,这一任务需要同时考虑典型的老化及疾病进展变化,并结合个体当前状态与独特特征。现有深度生成模型虽能为群体研究生成高分辨率且解剖学精确的模板,但其预测个体老化轨迹的能力有限,尤其是在捕捉受试者特定神经解剖学随时间的变化方面存在不足。为应对这些挑战,本研究提出了个体化脑合成(Individualized Brain Synthesis, InBrainSyn)框架,用于合成针对个体的高分辨率纵向T1加权(T1w)MRI扫描,模拟阿尔茨海默病(AD)及正常老化过程中的神经退行性变化。

InBrainSyn的关键创新在于使用平行传输算法,将由生成式深度模板网络学习到的人群水平老化轨迹调整为适用于个体化的老化合成。通过使用微分同胚变换来模拟老化过程,所合成的图像在拓扑结构上与原始解剖保持一致。实验评估表明,InBrainSyn不仅能够在阿尔茨海默病患者和健康对照组数据集(Open Access Series of Imaging Studies - version 3)上定量和定性地表现良好,还能有效建模正常老化与阿尔茨海默病之间的神经解剖学转变,并展现出良好的泛化能力。最终,仅基于单一基线扫描,InBrainSyn能够生成逼真的三维时空T1w MRI扫描,提供个性化的纵向老化轨迹。

链接: https://arxiv.org/abs/2502.21049
作者: Jingru Fu,Yuqi Zheng,Neel Dey,Daniel Ferreira,Rodrigo Moreno
机构: KTH Royal Institute of Technology (瑞典皇家理工学院); Karolinska Institute (卡罗林斯卡学院); Universidad Fernando Pessoa Canarias (费尔南多佩索阿大学); Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology (麻省理工学院计算机科学与人工智能实验室); MedTechLabs, BioClinicum, Karolinska University Hospital
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 20 pages, 9 figures, 6 tables, diffeomorphic registration, parallel transport, brain aging, medical image generation, Alzheimer’s disease

点击查看摘要

Abstract:Simulating prospective magnetic resonance imaging (MRI) scans from a given individual brain image is challenging, as it requires accounting for canonical changes in aging and/or disease progression while also considering the individual brain’s current status and unique characteristics. While current deep generative models can produce high-resolution anatomically accurate templates for population-wide studies, their ability to predict future aging trajectories for individuals remains limited, particularly in capturing subject-specific neuroanatomical variations over time. In this study, we introduce Individualized Brain Synthesis (InBrainSyn), a framework for synthesizing high-resolution subject-specific longitudinal MRI scans that simulate neurodegeneration in both Alzheimer’s disease (AD) and normal aging. InBrainSyn uses a parallel transport algorithm to adapt the population-level aging trajectories learned by a generative deep template network, enabling individualized aging synthesis. As InBrainSyn uses diffeomorphic transformations to simulate aging, the synthesized images are topologically consistent with the original anatomy by design. We evaluated InBrainSyn both quantitatively and qualitatively on AD and healthy control cohorts from the Open Access Series of Imaging Studies - version 3 dataset. Experimentally, InBrainSyn can also model neuroanatomical transitions between normal aging and AD. An evaluation of an external set supports its generalizability. Overall, with only a single baseline scan, InBrainSyn synthesizes realistic 3D spatiotemporal T1w MRI scans, producing personalized longitudinal aging trajectories. The code for InBrainSyn is available at: this https URL.
zh

[CV-27] Data-free Universal Adversarial Perturbation with Pseudo-semantic Prior CVPR2025

【速读】:该论文旨在解决传统数据无关通用对抗扰动 (Data-free Universal Adversarial Perturbation, UAP) 方法因随机噪声缺乏语义信息而导致的迁移能力受限问题。解决方案的关键在于提出了一种新颖的数据无关通用攻击方法,通过递归生成伪语义先验 (pseudo-semantic prior),丰富了数据无关 UAP 框架中的语义内容。这种方法基于对 UAP 内部固有潜伏语义信息的观察,利用区域采样捕获多样化的语义特征,并引入样本重加权技术强调难以攻击的样本,同时结合输入变换增强黑盒迁移能力。实验表明,该方法在 ImageNet 上显著提升了平均欺骗率,大幅改善了不同 CNN 架构间的攻击迁移性能,甚至超越了依赖数据的方法。

链接: https://arxiv.org/abs/2502.21048
作者: Chanhui Lee,Yeonghwan Song,Jeany Son
机构: AI Graduate School, GIST
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Data-free Universal Adversarial Perturbation (UAP) is an image-agnostic adversarial attack that deceives deep neural networks using a single perturbation generated solely from random noise, without any data priors. However, traditional data-free UAP methods often suffer from limited transferability due to the absence of semantic information in random noise. To address this, we propose a novel data-free universal attack approach that generates a pseudo-semantic prior recursively from the UAPs, enriching semantic contents within the data-free UAP framework. Our method is based on the observation that UAPs inherently contain latent semantic information, enabling the generated UAP to act as an alternative data prior, by capturing a diverse range of semantics through region sampling. We further introduce a sample reweighting technique to emphasize hard examples by focusing on samples that are less affected by the UAP. By leveraging the semantic information from the pseudo-semantic prior, we also incorporate input transformations, typically ineffective in data-free UAPs due to the lack of semantic content in random priors, to boost black-box transferability. Comprehensive experiments on ImageNet show that our method achieves state-of-the-art performance in average fooling rate by a substantial margin, significantly improves attack transferability across various CNN architectures compared to existing data-free UAP methods, and even surpasses data-dependent UAP methods.
zh

[CV-28] When Unsupervised Domain Adaptation meets One-class Anomaly Detection: Addressing the Two-fold Unsupervised Curse by Leverag ing Anomaly Scarcity

【速读】:该论文致力于解决无监督领域适应(Unsupervised Domain Adaptation, UDA)框架在无监督异常检测(Unsupervised Anomaly Detection, UAD)任务中的挑战,具体表现为域偏移(domain shift)导致的性能显著下降问题。不同于二分类和多分类任务,UAD中的UDA策略被认为是不适定的(ill-posed),这可归因于异常检测与领域适应任务均具有无监督性质。作者将此问题称为“双重无监督诅咒”(two-fold unsupervised curse)。为解决这一问题,论文提出了一种创新性方法,其关键在于假设异常数据稀少,并利用聚类技术识别目标特征空间中的主导簇(dominant cluster),将其视为正常簇(normal cluster)。通过将源域的一类正常数据拟合到超球体(hypersphere)内,同时联合对齐目标域主导簇的特征,实现源域与目标域之间的特征适配。实验结果验证了新范式及所提方法的有效性。

链接: https://arxiv.org/abs/2502.21022
作者: Nesryne Mejri,Enjie Ghorbel,Anis Kacem,Pavel Chernakov,Niki Foteinopoulou,Djamila Aouada
机构: Interdisciplinary Centre for Security, Reliability and Trust (SnT), University of Luxembourg (卢森堡大学); Cristal Laboratory, National School of Computer Sciences, University of Manouba (突尼斯曼努巴国立计算机科学学校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces the first fully unsupervised domain adaptation (UDA) framework for unsupervised anomaly detection (UAD). The performance of UAD techniques degrades significantly in the presence of a domain shift, difficult to avoid in a real-world setting. While UDA has contributed to solving this issue in binary and multi-class classification, such a strategy is ill-posed in UAD. This might be explained by the unsupervised nature of the two tasks, namely, domain adaptation and anomaly detection. Herein, we first formulate this problem that we call the two-fold unsupervised curse. Then, we propose a pioneering solution to this curse, considered intractable so far, by assuming that anomalies are rare. Specifically, we leverage clustering techniques to identify a dominant cluster in the target feature space. Posed as the normal cluster, the latter is aligned with the source normal features. Concretely, given a one-class source set and an unlabeled target set composed mostly of normal data and some anomalies, we fit the source features within a hypersphere while jointly aligning them with the features of the dominant cluster from the target set. The paper provides extensive experiments and analysis on common adaptation benchmarks for anomaly detection, demonstrating the relevance of both the newly introduced paradigm and the proposed approach. The code will be made publicly available.
zh

[CV-29] FedDyMem: Efficient Federated Learning with Dynamic Memory and Memory-Reduce for Unsupervised Image Anomaly Detection

【速读】:该论文旨在解决无监督图像异常检测(Unsupervised Image Anomaly Detection, UAD)在联邦学习场景下因数据隐私保护而面临的挑战。具体而言,由于单一类别(正常样本)的数据分布偏差以及跨客户或同一客户内产品变异性引起的分布偏移,现有方法难以在保护数据隐私的同时有效进行知识共享。论文的关键解决方案在于提出了一种名为FedDyMem的高效联邦学习方法,其核心创新点包括动态内存机制和内存缩减策略。通过客户端的动态内存银行而非模型参数进行知识共享,同时结合本地特征一致性优化与全局内存聚合机制,FedDyMem不仅提升了正常样本特征分布的一致性,还显著减少了通信开销。这种设计确保了在隐私保护的前提下实现高效的无监督异常检测。

链接: https://arxiv.org/abs/2502.21012
作者: Silin Chen,Kangjian Di,Yichu Xu,Han-Jia Ye,Wenhan Luo,Ningmu Zou
机构: School of Integrated Circuits, Nanjing University, Suzhou, China (南京大学集成电路学院,中国苏州); National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China (南京大学新型软件技术国家重点实验室,中国南京); School of Artificial Intelligence, Nanjing University, Nanjing, 210023, China (南京大学人工智能学院,中国南京); Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong (香港科技大学清水湾校区,中国香港); Interdisciplinary Research Center for Future Intelligent Chips (Chip-X), Nanjing University, Suzhou, China (南京大学未来智能芯片跨学科研究中心(Chip-X),中国苏州)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised image anomaly detection (UAD) has become a critical process in industrial and medical applications, but it faces growing challenges due to increasing concerns over data privacy. The limited class diversity inherent to one-class classification tasks, combined with distribution biases caused by variations in products across and within clients, poses significant challenges for preserving data privacy with federated UAD. Thus, this article proposes an efficient federated learning method with dynamic memory and memory-reduce for unsupervised image anomaly detection, called FedDyMem. Considering all client data belongs to a single class (i.e., normal sample) in UAD and the distribution of intra-class features demonstrates significant skewness, FedDyMem facilitates knowledge sharing between the client and server through the client’s dynamic memory bank instead of model parameters. In the local clients, a memory generator and a metric loss are employed to improve the consistency of the feature distribution for normal samples, leveraging the local model to update the memory bank dynamically. For efficient communication, a memory-reduce method based on weighted averages is proposed to significantly decrease the scale of memory banks. On the server, global memory is constructed and distributed to individual clients through k-means aggregation. Experiments conducted on six industrial and medical datasets, comprising a mixture of six products or health screening types derived from eleven public datasets, demonstrate the effectiveness of FedDyMem.
zh

[CV-30] MagNet: Multi-Level Attention Graph Network for Predicting High-Resolution Spatial Transcriptomics

【速读】:该论文旨在解决现有空间转录组学(Spatial Transcriptomics, ST)预测方法在处理高分辨率(HD)数据时面临的“信息瓶颈”问题。当前方法主要集中在低分辨率斑点级别的基因表达预测,而当应用于高分辨率数据时,由于输入特征的限制及复杂的空间信息整合需求,其性能显著下降。为克服这一挑战,论文提出了一种名为MagNet的多层级注意力图网络,其关键在于通过交叉注意力层(cross-attention layers)自适应地融合多分辨率图像块的特征,并利用GAT-Transformer模块聚合邻域信息。这种方法能够有效整合多层次特征,从而突破低分辨率输入对高分辨率基因表达预测的限制,实现更精准的高分辨率数据预测。

链接: https://arxiv.org/abs/2502.21011
作者: Junchao Zhu,Ruining Deng,Tianyuan Yao,Juming Xiong,Chongyu Qu,Junlin Guo,Siqi Lu,Yucheng Tang,Daguang Xu,Mengmeng Yin,Yu Wang,Shilin Zhao,Yaohong Wang,Haichun Yang,Yuankai Huo
机构: Vanderbilt University (范德比尔特大学), Nashville, TN, USA; Weill Cornell Medicine (威尔康奈尔医学院), NY, USA; NVIDIA (英伟达), WA, USA; Vanderbilt University Medical Center (范德比尔特大学医学中心), Nashville, TN, USA; UT MD Anderson Cancer Center (德克萨斯大学MD安德森癌症中心), TX, USA; Vanderbilt University (范德比尔特大学), Nashville, TN, USA; Vanderbilt University Medical Center (范德比尔特大学医学中心), Nashville, TN, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid development of spatial transcriptomics (ST) offers new opportunities to explore the gene expression patterns within the spatial microenvironment. Current research integrates pathological images to infer gene expression, addressing the high costs and time-consuming processes to generate spatial transcriptomics data. However, as spatial transcriptomics resolution continues to improve, existing methods remain primarily focused on gene expression prediction at low-resolution spot levels. These methods face significant challenges, especially the information bottleneck, when they are applied to high-resolution HD data. To bridge this gap, this paper introduces MagNet, a multi-level attention graph network designed for accurate prediction of high-resolution HD data. MagNet employs cross-attention layers to integrate features from multi-resolution image patches hierarchically and utilizes a GAT-Transformer module to aggregate neighborhood information. By integrating multilevel features, MagNet overcomes the limitations posed by low-resolution inputs in predicting high-resolution gene expression. We systematically evaluated MagNet and existing ST prediction models on both a private spatial transcriptomics dataset and a public dataset at three different resolution levels. The results demonstrate that MagNet achieves state-of-the-art performance at both spot level and high-resolution bin levels, providing a novel methodology and benchmark for future research and applications in high-resolution HD-level spatial transcriptomics. Code is available at this https URL.
zh

[CV-31] Soften the Mask: Adaptive Temporal Soft Mask for Efficient Dynamic Facial Expression Recognition

【速读】:本文旨在解决动态面部表情识别(Dynamic Facial Expression Recognition, DFER)中因背景噪声和冗余语义导致的相关信息管理困难问题,这些问题影响了方法的效率与效果。为应对这些挑战,论文提出了一种新颖的自适应时序软掩码自编码器网络AdaTosk,其关键在于结合了并行的有监督分类分支与自监督重建分支。其中,自监督重建分支通过随机二值硬掩码生成多样化训练样本,促使可见标记中形成有意义的特征表示;而分类分支则利用自适应时序软掩码根据时间重要性灵活屏蔽可见标记。该方法的两类关键组件——类别无关的通用软掩码和类别语义软掩码,分别用于增强关键表情时刻的重要性并减少时间上的语义冗余。实验结果表明,AdaTosk在保持竞争力性能的同时显著降低了计算成本。

链接: https://arxiv.org/abs/2502.21004
作者: Mengzhu Li,Quanxing Zha,Hongjun Wu
机构: Beijing Key Laboratory of Information Service Engineering, Beijing Union University (北京信息服务工程重点实验室,北京联合大学); Huaqiao University (华侨大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:Dynamic Facial Expression Recognition (DFER) facilitates the understanding of psychological intentions through non-verbal communication. Existing methods struggle to manage irrelevant information, such as background noise and redundant semantics, which impacts both efficiency and effectiveness. In this work, we propose a novel supervised temporal soft masked autoencoder network for DFER, namely AdaTosk, which integrates a parallel supervised classification branch with the self-supervised reconstruction branch. The self-supervised reconstruction branch applies random binary hard mask to generate diverse training samples, encouraging meaningful feature representations in visible tokens. Meanwhile the classification branch employs an adaptive temporal soft mask to flexibly mask visible tokens based on their temporal significance. Its two key components, respectively of, class-agnostic and class-semantic soft masks, serve to enhance critical expression moments and reduce semantic redundancy over time. Extensive experiments conducted on widely-used benchmarks demonstrate that our AdaTosk remarkably reduces computational costs compared with current state-of-the-art methods while still maintaining competitive performance.
zh

[CV-32] owards Lossless Implicit Neural Representation via Bit Plane Decomposition

【速读】:该论文试图解决隐式神经表示(Implicit Neural Representation, INR)模型在高精度需求下的过大数据规模问题。为应对这一挑战,论文提出了一种基于位平面分解的方法,使INR模型能够预测位平面,从而在不改变模型大小的情况下有效降低模型规模的理论上限。解决方案的关键在于通过引入位平面预测机制,将高精度信号的表示转化为逐位逼近的过程,不仅实现了2D图像和音频的无损表示,还突破了以往对于高比特深度信号(如16-bit)的限制,并进一步扩展了INR技术在位深扩展、无损图像压缩以及极端网络量化等任务中的应用。

链接: https://arxiv.org/abs/2502.21001
作者: Woo Kyoung Han,Byeonghun Lee,Hyunmin Cho,Sunghoon Im,Kyong Hwan Jin
机构: Korea University (韩国大学); DGIST (大邱庆北科学技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We quantify the upper bound on the size of the implicit neural representation (INR) model from a digital perspective. The upper bound of the model size increases exponentially as the required bit-precision increases. To this end, we present a bit-plane decomposition method that makes INR predict bit-planes, producing the same effect as reducing the upper bound of the model size. We validate our hypothesis that reducing the upper bound leads to faster convergence with constant model size. Our method achieves lossless representation in 2D image and audio fitting, even for high bit-depth signals, such as 16-bit, which was previously unachievable. We pioneered the presence of bit bias, which INR prioritizes as the most significant bit (MSB). We expand the application of the INR task to bit depth expansion, lossless image compression, and extreme network quantization. Our source code is available at this https URL
zh

[CV-33] LesionLocator: Zero-Shot Universal Tumor Segmentation and Tracking in 3D Whole-Body Imaging CVPR2025

【速读】:本文旨在解决三维医学影像中零样本纵向病灶追踪与分割的问题。传统方法在处理病灶类型多样性和数据稀缺性方面存在局限性,导致模型泛化能力不足。为应对这些挑战,论文提出了一种名为LesionLocator的新框架,其关键创新在于利用一个包含23,262个标注医学扫描数据及多样化病灶类型的合成四维(4D)数据集,实现了基于密集空间提示的端到端四维病灶追踪模型。该方案通过大幅提高数据集的多样性和规模显著增强了模型对真实世界医学成像问题的适应能力,并在病灶分割任务上超越现有可提示模型近10个Dice点,接近人类水平性能,在病灶追踪任务中也达到了当前最优结果,同时提供了开放获取的数据集与模型资源,推动了医学影像领域的进一步发展。

链接: https://arxiv.org/abs/2502.20985
作者: Maximilian Rokuss,Yannick Kirchhoff,Seval Akbal,Balint Kovacs,Saikat Roy,Constantin Ulrich,Tassilo Wald,Lukas T. Rotkopf,Heinz-Peter Schlemmer,Klaus Maier-Hein
机构: German Cancer Research Center, Division of Medical Image Computing (德国癌症研究中心,医学图像计算系); Faculty of Mathematics and Computer Science and Medical Faculty - Heidelberg University (海德堡大学数学与计算机科学学院及医学部); Helmholtz Imaging (赫姆霍兹成像中心); German Cancer Research Center, Department of Radiology (德国癌症研究中心,放射科); HIDSS4Health, Heidelberg (海德堡HIDSS4Health); Pattern Analysis and Learning Group, Heidelberg University Hospital (海德堡大学医院模式分析与学习小组)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at CVPR 2025

点击查看摘要

Abstract:In this work, we present LesionLocator, a framework for zero-shot longitudinal lesion tracking and segmentation in 3D medical imaging, establishing the first end-to-end model capable of 4D tracking with dense spatial prompts. Our model leverages an extensive dataset of 23,262 annotated medical scans, as well as synthesized longitudinal data across diverse lesion types. The diversity and scale of our dataset significantly enhances model generalizability to real-world medical imaging challenges and addresses key limitations in longitudinal data availability. LesionLocator outperforms all existing promptable models in lesion segmentation by nearly 10 dice points, reaching human-level performance, and achieves state-of-the-art results in lesion tracking, with superior lesion retrieval and segmentation accuracy. LesionLocator not only sets a new benchmark in universal promptable lesion segmentation and automated longitudinal lesion tracking but also provides the first open-access solution of its kind, releasing our synthetic 4D dataset and model to the community, empowering future advancements in medical imaging. Code is available at: this http URL
zh

[CV-34] Distribution Prototype Diffusion Learning for Open-set Supervised Anomaly Detection CVPR2025

【速读】:该论文旨在解决开放集有监督异常检测(Open-set Supervised Anomaly Detection, OSAD)中现有方法通常通过生成伪异常来弥补观测到的异常样本不足的问题,但忽视了正常样本的关键先验信息,导致判别边界不够有效。为了解决这一问题,论文提出了一种分布原型扩散学习(Distribution Prototype Diffusion Learning, DPDL)方法,其关键是通过构建多个可学习的高斯原型,在潜在表示空间中为丰富的多样化正常样本创建紧凑且具有判别力的分布空间,并学习一个Schrödinger桥接以引导正常样本向这些原型进行扩散过渡,同时将异常样本推开。此外,为了增强样本间的分离度,设计了一种在超球空间中的离散特征学习方式,有助于识别分布外的异常。实验结果表明,所提出的DPDL方法在9个公开数据集上实现了最先进的性能。

链接: https://arxiv.org/abs/2502.20981
作者: Fuyun Wang,Tong Zhang,Yuanzhi Wang,Yide Qiu,Xin Liu,Xu Guo,Zhen Cui
机构: School of Computer Science and Engineering (计算机科学与工程学院), Nanjing University of Science and Technology (南京理工大学); SeetaCloud Technology (希塔云科技), Nanjing, China; School of Artificial Intelligence (人工智能学院), Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:In Open-set Supervised Anomaly Detection (OSAD), the existing methods typically generate pseudo anomalies to compensate for the scarcity of observed anomaly samples, while overlooking critical priors of normal samples, leading to less effective discriminative boundaries. To address this issue, we propose a Distribution Prototype Diffusion Learning (DPDL) method aimed at enclosing normal samples within a compact and discriminative distribution space. Specifically, we construct multiple learnable Gaussian prototypes to create a latent representation space for abundant and diverse normal samples and learn a Schrödinger bridge to facilitate a diffusive transition toward these prototypes for normal samples while steering anomaly samples away. Moreover, to enhance inter-sample separation, we design a dispersion feature learning way in hyperspherical space, which benefits the identification of out-of-distribution anomalies. Experimental results demonstrate the effectiveness and superiority of our proposed DPDL, achieving state-of-the-art performance on 9 public datasets.
zh

[CV-35] Real-Time Aerial Fire Detection on Resource-Constrained Devices Using Knowledge Distillation

【速读】:该论文旨在解决现有早期火灾检测系统在大范围户外环境中的局限性问题,特别是依赖固定CCTV摄像头导致的视野有限和实时性能不足。论文提出通过将智能火灾检测与遥感技术融合,利用轻量级模型提升覆盖范围和移动性,以实现在偏远及复杂区域的有效监测。解决方案的关键在于开发了一种基于MobileViT-S的轻量化火灾检测模型,该模型通过知识蒸馏从更强的教师模型中压缩而来,并通过消融研究验证了教师模型及所选蒸馏技术对模型性能提升的影响。此外,使用Grad-CAM生成激活图可视化,进一步确认模型专注于相关火灾区域的能力。实验结果表明,该模型在保持紧凑尺寸的同时,达到了比当前最先进的模型更高的处理速度,并实现了资源受限设备上的实时性能。

链接: https://arxiv.org/abs/2502.20979
作者: Sabina Jangirova,Branislava Jankovic,Waseem Ullah,Latif U. Khan,Mohsen Guizani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Wildfire catastrophes cause significant environmental degradation, human losses, and financial damage. To mitigate these severe impacts, early fire detection and warning systems are crucial. Current systems rely primarily on fixed CCTV cameras with a limited field of view, restricting their effectiveness in large outdoor environments. The fusion of intelligent fire detection with remote sensing improves coverage and mobility, enabling monitoring in remote and challenging areas. Existing approaches predominantly utilize convolutional neural networks and vision transformer models. While these architectures provide high accuracy in fire detection, their computational complexity limits real-time performance on edge devices such as UAVs. In our work, we present a lightweight fire detection model based on MobileViT-S, compressed through the distillation of knowledge from a stronger teacher model. The ablation study highlights the impact of a teacher model and the chosen distillation technique on the model’s performance improvement. We generate activation map visualizations using Grad-CAM to confirm the model’s ability to focus on relevant fire regions. The high accuracy and efficiency of the proposed model make it well-suited for deployment on satellites, UAVs, and IoT devices for effective fire detection. Experiments on common fire benchmarks demonstrate that our model suppresses the state-of-the-art model by 0.44%, 2.00% while maintaining a compact model size. Our model delivers the highest processing speed among existing works, achieving real-time performance on resource-constrained devices.
zh

[CV-36] Fine-Grained Retrieval-Augmented Generation for Visual Question Answering

【速读】:该论文旨在解决视觉问答(VQA)任务中,尽管先进的多模态大型语言模型(MLLMs)如GPT-4o表现出色,但它们在访问领域特定或最新知识方面常显不足的问题。为缓解此问题,基于外部知识库(KBs)的检索增强生成(RAG),即KB-VQA方法应运而生。然而,传统的单模态检索技术通常通过将图像转换为文本描述,导致关键视觉细节的丢失。论文的关键解决方案在于提出了一种细粒度知识单元(融合文本片段与存储于向量数据库中的实体图像),以及一种结合细粒度检索与MLLMs的知识单元检索增强生成框架(KU-RAG)。此框架通过知识校正链实现相关知识的精确检索并增强推理能力,从而显著提升了领先KB-VQA方法的性能,最高可提高10%。

链接: https://arxiv.org/abs/2502.20964
作者: Zhengxuan Zhang,Yin Wu,Yuyu Luo,Nan Tang
机构: The Hong Kong University of Science and Technology (香港科技大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual Question Answering (VQA) focuses on providing answers to natural language questions by utilizing information from images. Although cutting-edge multimodal large language models (MLLMs) such as GPT-4o achieve strong performance on VQA tasks, they frequently fall short in accessing domain-specific or the latest knowledge. To mitigate this issue, retrieval-augmented generation (RAG) leveraging external knowledge bases (KBs), referred to as KB-VQA, emerges as a promising approach. Nevertheless, conventional unimodal retrieval techniques, which translate images into textual descriptions, often result in the loss of critical visual details. This study presents fine-grained knowledge units, which merge textual snippets with entity images stored in vector databases. Furthermore, we introduce a knowledge unit retrieval-augmented generation framework (KU-RAG) that integrates fine-grained retrieval with MLLMs. The proposed KU-RAG framework ensures precise retrieval of relevant knowledge and enhances reasoning capabilities through a knowledge correction chain. Experimental findings demonstrate that our approach significantly boosts the performance of leading KB-VQA methods, achieving improvements of up to 10%.
zh

[CV-37] BadRefSR: Backdoor Attacks Against Reference-based Image Super Resolution

【速读】:该论文旨在解决参考图像超分辨率(Reference-based Image Super-Resolution, RefSR)模型在面对后门攻击(backdoor attack)时的脆弱性问题,这是现有研究中未充分探索的领域。论文的关键在于提出了一种名为BadRefSR的新攻击框架,通过在参考图像中添加触发器(triggers)并在混合损失函数(mixed loss function)下进行训练,成功将后门嵌入到RefSR模型中。该方法能够在不影响模型对干净输入图像性能的同时,在触发输入图像的情况下强制输出攻击者指定的目标图像,从而验证了BadRefSR的有效性。

链接: https://arxiv.org/abs/2502.20943
作者: Xue Yang,Tao Chen,Lei Guo,Wenbo Jiang,Ji Guo,Yongming Li,Jiaming He
机构: School of Computer Science and Engineering, University of Electronic Science and Technology of China (电子科技大学计算机科学与工程学院); Laboratory Of Intelligent Collaborative Computing, University of Electronic Science and Technology of China (电子科技大学智能协同计算实验室); School of Information Science and Engineering, XinJiang University (新疆大学信息科学与工程学院); College of Computer Science and Cyber Security (Oxford Brookes College), Chengdu University of Technology (成都理工大学计算机科学与网络安全学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 5 pages,4 figures

点击查看摘要

Abstract:Reference-based image super-resolution (RefSR) represents a promising advancement in super-resolution (SR). In contrast to single-image super-resolution (SISR), RefSR leverages an additional reference image to help recover high-frequency details, yet its vulnerability to backdoor attacks has not been explored. To fill this research gap, we propose a novel attack framework called BadRefSR, which embeds backdoors in the RefSR model by adding triggers to the reference images and training with a mixed loss function. Extensive experiments across various backdoor attack settings demonstrate the effectiveness of BadRefSR. The compromised RefSR network performs normally on clean input images, while outputting attacker-specified target images on triggered input images. Our study aims to alert researchers to the potential backdoor risks in RefSR. Codes are available at this https URL.
zh

[CV-38] Less is More? Revisiting the Importance of Frame Rate in Real-Time Zero-Shot Surgical Video Segmentation

【速读】:本文旨在解决在AI辅助手术中实时视频分割面临的计算资源需求与帧率-分割性能权衡的问题。关键在于研究不同帧率对零样本胆囊切除手术视频分割的影响,并评估SAM2模型在多种帧采样率下的表现。结果表明,在常规评估设置下较低帧率可能优于高帧率,而在真实时间流场景中,较高的帧率能提供更好的时间连贯性和稳定性,特别是在处理动态物体时。最终,通过专业人士对实时手术视频分割的人类感知测试,进一步强调了实时评估的重要性。

链接: https://arxiv.org/abs/2502.20934
作者: Utku Ozbulak,Seyed Amir Mousavi,Francesca Tozzi,Nikdokht Rashidian,Wouter Willaert,Wesley De Neve,Joris Vankerschaver
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-time video segmentation is a promising feature for AI-assisted surgery, providing intraoperative guidance by identifying surgical tools and anatomical structures. However, deploying state-of-the-art segmentation models, such as SAM2, in real-time settings is computationally demanding, which makes it essential to balance frame rate and segmentation performance. In this study, we investigate the impact of frame rate on zero-shot surgical video segmentation, evaluating SAM2’s effectiveness across multiple frame sampling rates for cholecystectomy procedures. Surprisingly, our findings indicate that in conventional evaluation settings, frame rates as low as a single frame per second can outperform 25 FPS, as fewer frames smooth out segmentation inconsistencies. However, when assessed in a real-time streaming scenario, higher frame rates yield superior temporal coherence and stability, particularly for dynamic objects such as surgical graspers. Finally, we investigate human perception of real-time surgical video segmentation among professionals who work closely with such data and find that respondents consistently prefer high FPS segmentation mask overlays, reinforcing the importance of real-time evaluation in AI-assisted surgery.
zh

[CV-39] Decoder Gradient Shield: Provable and High-Fidelity Prevention of Gradient-Based Box-Free Watermark Removal CVPR2025

【速读】:该论文旨在解决深度图像到图像模型知识产权保护中未被充分关注的问题,即当解码器与编码器联合训练且未受保护时,其易受到水印移除网络的攻击。论文的关键解决方案是提出了解码器梯度防护(Decoder Gradient Shield, DGS),作为一种在解码器API中的保护层,以防止基于梯度的水印移除,并提供闭式解。DGS的核心思想借鉴了经典对抗攻击方法,但首次将其作为盒外模型水印保护的防御机制。实验结果验证了所提方法的有效性。

链接: https://arxiv.org/abs/2502.20924
作者: Haonan An,Guang Hua,Zhengru Fang,Guowen Xu,Susanto Rahardja,Yuguang Fang
机构: Department of Computer Science, City University of Hong Kong (香港城市大学); Infocomm Technology (ICT) Cluster, Singapore Institute of Technology (SIT) (新加坡科技设计大学); School of Computer Science and Engineering, University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:The intellectual property of deep image-to-image models can be protected by the so-called box-free watermarking. It uses an encoder and a decoder, respectively, to embed into and extract from the model’s output images invisible copyright marks. Prior works have improved watermark robustness, focusing on the design of better watermark encoders. In this paper, we reveal an overlooked vulnerability of the unprotected watermark decoder which is jointly trained with the encoder and can be exploited to train a watermark removal network. To defend against such an attack, we propose the decoder gradient shield (DGS) as a protection layer in the decoder API to prevent gradient-based watermark removal with a closed-form solution. The fundamental idea is inspired by the classical adversarial attack, but is utilized for the first time as a defensive mechanism in the box-free model watermarking. We then demonstrate that DGS can reorient and rescale the gradient directions of watermarked queries and stop the watermark remover’s training loss from converging to the level without DGS, while retaining decoder output image quality. Experimental results verify the effectiveness of proposed method. Code of paper will be made available upon acceptance.
zh

[CV-40] DiffBrush:Just Painting the Art by Your Hands

【速读】:该论文旨在解决当前以文本驱动的图像到图像扩散模型(Text-to-Image, T2I)在捕捉用户需求时存在的准确性挑战,以及与其他模态兼容时面临的高昂训练成本问题。论文的关键解决方案是提出DiffBrush,它通过操作和调整扩散模型的内部表示,在无需额外训练的情况下引导生成图像逐步匹配用户的手绘草图,从而满足特定需求。DiffBrush的核心在于利用去噪过程中对潜在空间和实例级别注意力图的连续引导,实现对图像中对象的颜色、语义及实例层面的有效控制。此外,通过引入潜在再生机制优化扩散模型中的随机噪声采样,进一步提升图像生成布局质量。最终,用户仅需在画布上粗略绘制实例掩膜(可接受颜色),DiffBrush即可自然生成对应的实例对象。

链接: https://arxiv.org/abs/2502.20904
作者: Jiaming Chu,Lei Jin,Tao Wang,Junliang Xing,Jian Zhao
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Tsinghua University (清华大学); Institute of AI (TeleAI), China Telecom and the School of Artificial Intelligence EVOL Lab (中国电信人工智能研究院及清华大学人工智能学堂班EVOL实验室); Northwestern Polytechnical University (NWPU) (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:The rapid development of image generation and editing algorithms in recent years has enabled ordinary user to produce realistic images. However, the current AI painting ecosystem predominantly relies on text-driven diffusion models (T2I), which pose challenges in accurately capturing user requirements. Furthermore, achieving compatibility with other modalities incurs substantial training costs. To this end, we introduce DiffBrush, which is compatible with T2I models and allows users to draw and edit images. By manipulating and adapting the internal representation of the diffusion model, DiffBrush guides the model-generated images to converge towards the user’s hand-drawn sketches for user’s specific needs without additional training. DiffBrush achieves control over the color, semantic, and instance of objects in images by continuously guiding the latent and instance-level attention map during the denoising process of the diffusion model. Besides, we propose a latent regeneration, which refines the randomly sampled noise in the diffusion model, obtaining a better image generation layout. Finally, users only need to roughly draw the mask of the instance (acceptable colors) on the canvas, DiffBrush can naturally generate the corresponding instance at the corresponding location.
zh

[CV-41] Adaptive Identification of Blurred Regions for Accurate Image Deblurring

【速读】:该论文试图解决图像去模糊(Image Deblurring)中因不同区域退化程度差异而被现有方法忽视的问题。论文提出AIBNet网络,其关键是通过自适应识别模糊区域并进行差异性修复来应对这一挑战。具体而言,AIBNet引入空间特征差分处理块(SFDHBlock),其中空间域特征增强模块(SFEM)通过特征差异操作聚焦于模糊区域的关键信息并抑制隐式噪声干扰。此外,基于清晰与模糊图像的差异主要体现在高频成分上的事实,提出了高频频谱选择块(HFSBlock)以提取并保留重要高频特征。为充分利用解码器能力,仅在解码器中加入上述模块,并采用预训练模型作为编码器。最后,通过渐进式训练策略减轻训练资源负担。实验结果表明,AIBNet在图像去模糊任务中表现出色。

链接: https://arxiv.org/abs/2502.20880
作者: Hu Gao,Depeng Dang
机构: Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image deblurring aims to restore high-quality images from blurred ones. While existing deblurring methods have made significant progress, most overlook the fact that the degradation degree varies across different regions. In this paper, we propose AIBNet, a network that adaptively identifies the blurred regions, enabling differential restoration of these regions. Specifically, we design a spatial feature differential handling block (SFDHBlock), with the core being the spatial domain feature enhancement module (SFEM). Through the feature difference operation, SFEM not only helps the model focus on the key information in the blurred regions but also eliminates the interference of implicit noise. Additionally, based on the fact that the difference between sharp and blurred images primarily lies in the high-frequency components, we propose a high-frequency feature selection block (HFSBlock). The HFSBlock first uses learnable filters to extract high-frequency features and then selectively retains the most important ones. To fully leverage the decoder’s potential, we use a pre-trained model as the encoder and incorporate the above modules only in the decoder. Finally, to alleviate the resource burden during training, we introduce a progressive training strategy. Extensive experiments demonstrate that our AIBNet achieves superior performance in image deblurring.
zh

[CV-42] goPPG: Heart Rate Estimation from Eye-Tracking Cameras in Egocentric Systems to Benefit Downstream Vision Tasks

【速读】:该论文旨在解决从第一人称视角(egocentric)系统中提取佩戴者的生理状态信息(如心率 HR),以增强上下文感知能力的问题。传统方法通常需要专用硬件来监测生理信号,而本文提出了一种创新性任务 egoPPG,并设计了相应的解决方案 EgoPulseFormer,其关键是仅利用系统内置的眼动追踪视频作为输入,通过分析眼周区域的光体积描记图(PPG)信号来估计心率值,无需额外硬件支持。实验结果表明,该方法在提升用户行为熟练度估计方面取得了显著改进(提高 14%),并通过大规模数据集验证了模型的鲁棒性和准确性(平均绝对误差 MAE=8.82 bpm,相关系数 r=0.81)。

链接: https://arxiv.org/abs/2502.20879
作者: Björn Braun,Rayan Armani,Manuel Meier,Max Moebus,Christian Holz
机构: Department of Computer Science, ETH Zurich (计算机科学系, 苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Egocentric vision systems aim to understand the spatial surroundings and the wearer’s behavior inside it, including motions, activities, and interaction with objects. Since a person’s attention and situational responses are influenced by their physiological state, egocentric systems must also detect this state for better context awareness. In this paper, we propose egoPPG, a novel task for egocentric vision systems to extract a person’s heart rate (HR) as a key indicator of the wearer’s physiological state from the system’s built-in sensors (e.g., eye tracking videos). We then propose EgoPulseFormer, a method that solely takes eye-tracking video as input to estimate a person’s photoplethysmogram (PPG) from areas around the eyes to track HR values-without requiring additional or dedicated hardware. We demonstrate the downstream benefit of EgoPulseFormer on EgoExo4D, where we find that augmenting existing models with tracked HR values improves proficiency estimation by 14%. To train and validate EgoPulseFormer, we collected a dataset of 13+ hours of eye-tracking videos from Project Aria and contact-based blood volume pulse signals as well as an electrocardiogram (ECG) for ground-truth HR values. 25 participants performed diverse everyday activities such as office work, cooking, dancing, and exercising, which induced significant natural motion and HR variation (44-164 bpm). Our model robustly estimates HR (MAE=8.82 bpm) and captures patterns (r=0.81). Our results show how egocentric systems may unify environmental and physiological tracking to better understand user actions and internal states.
zh

[CV-43] PathVG: A New Benchmark and Dataset for Pathology Visual Grounding

【速读】:本文旨在解决病理学领域中细胞核分割依赖预定义类别且缺乏灵活性的问题,以及病理学视觉问答在图像级理解能力有余但区域级检测能力不足的局限。为应对这些挑战,论文提出了一个新的基准任务Pathology Visual Grounding (PathVG),其目标是基于具有不同属性的表达式检测区域。为评估PathVG,创建了一个包含27,610张图像和33,500个语言引导框的新数据集RefPath。研究发现,病理表达中的隐含信息构成了主要挑战,为此提出了Pathology Knowledge-enhanced Network (PKNet)作为基线模型。PKNet利用大型语言模型(LLMs)的知识增强能力,将包含隐含信息的病理学术语转换为显式的视觉特征,并通过设计的知识融合模块(KFM)融合知识特征与表达特征。所提出的方法在PathVG基准上达到了最先进的性能。

链接: https://arxiv.org/abs/2502.20869
作者: Chunlin Zhong,Shuang Hao,Junhua Wu,Xiaona Chang,Jiwei Jiang,Xiu Nie,He Tang,Xiang Bai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10pages, 4figures

点击查看摘要

Abstract:With the rapid development of computational pathology, many AI-assisted diagnostic tasks have emerged. Cellular nuclei segmentation can segment various types of cells for downstream analysis, but it relies on predefined categories and lacks flexibility. Moreover, pathology visual question answering can perform image-level understanding but lacks region-level detection capability. To address this, we propose a new benchmark called Pathology Visual Grounding (PathVG), which aims to detect regions based on expressions with different attributes. To evaluate PathVG, we create a new dataset named RefPath which contains 27,610 images with 33,500 language-grounded boxes. Compared to visual grounding in other domains, PathVG presents pathological images at multi-scale and contains expressions with pathological knowledge. In the experimental study, we found that the biggest challenge was the implicit information underlying the pathological expressions. Based on this, we proposed Pathology Knowledge-enhanced Network (PKNet) as the baseline model for PathVG. PKNet leverages the knowledge-enhancement capabilities of Large Language Models (LLMs) to convert pathological terms with implicit information into explicit visual features, and fuses knowledge features with expression features through the designed Knowledge Fusion Module (KFM). The proposed method achieves state-of-the-art performance on the PathVG benchmark.
zh

[CV-44] MESC-3D:Mining Effective Semantic Cues for 3D Reconstruction from a Single Image CVPR2025

【速读】:该论文旨在解决从单张图像重建三维形状的问题,现有方法主要关注从图像中提取语义信息并简单地将其与三维点云拼接,但未进一步探索这些结合后的语义特征,导致纠缠的语义特征严重阻碍了重建性能。论文的关键解决方案在于提出了一种名为“Mining Effective Semantic Cues for 3D Reconstruction from a Single Image (MESC-3D)”的新方法,通过设计一个有效的语义挖掘模块(Effective Semantic Mining Module),在点云与图像语义属性之间建立联系,使点云能够自主选择必要信息;同时,受人类利用先验知识表示三维物体能力的启发,引入了一个三维语义先验学习模块(3D Semantic Prior Learning Module),以增强对空间结构的语义理解,从而提高模型对三维物体解释和重建的准确性与真实性。实验结果表明,该方法在重建质量和鲁棒性方面显著优于现有方法,并展现出强大的泛化能力和零样本学习性能。

链接: https://arxiv.org/abs/2502.20861
作者: Shaoming Li,Qing Cai,Songqi Kong,Runqing Tan,Heng Tong,Shiji Qiu,Yongguo Jiang,Zhi Liu
机构: Faculty of Computer Science and Technology, Ocean University of China (中国海洋大学); School of Information Science and Engineering, Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in CVPR 2025

点击查看摘要

Abstract:Reconstructing 3D shapes from a single image plays an important role in computer vision. Many methods have been proposed and achieve impressive performance. However, existing methods mainly focus on extracting semantic information from images and then simply concatenating it with 3D point clouds without further exploring the concatenated semantics. As a result, these entangled semantic features significantly hinder the reconstruction performance. In this paper, we propose a novel single-image 3D reconstruction method called Mining Effective Semantic Cues for 3D Reconstruction from a Single Image (MESC-3D), which can actively mine effective semantic cues from entangled features. Specifically, we design an Effective Semantic Mining Module to establish connections between point clouds and image semantic attributes, enabling the point clouds to autonomously select the necessary information. Furthermore, to address the potential insufficiencies in semantic information from a single image, such as occlusions, inspired by the human ability to represent 3D objects using prior knowledge drawn from daily experiences, we introduce a 3D Semantic Prior Learning Module. This module incorporates semantic understanding of spatial structures, enabling the model to interpret and reconstruct 3D objects with greater accuracy and realism, closely mirroring human perception of complex 3D environments. Extensive evaluations show that our method achieves significant improvements in reconstruction quality and robustness compared to prior works. Additionally, further experiments validate the strong generalization capabilities and excels in zero-shot preformance on unseen classes. Code is available at this https URL.
zh

[CV-45] Oscillation-Reduced MXFP4 Training for Vision Transformers

【速读】:该论文旨在解决在FP4精度下预训练Transformer模型时,尽管能够显著加速,但通常会导致较大精度损失的问题。具体而言,使用Microscaling (MX) 数据格式虽然通过细粒度的分组量化方法提升了FP4格式的表达能力,但在MXFP4数据格式下的训练仍表现出显著的性能退化,且缺乏系统性研究来揭示其原因。论文的关键在于提出了一种新颖的训练方法TetraJet,以实现更精确的FP4训练。研究全面评估了训练过程中涉及的所有量化器,并发现前向传播中的权重振荡问题是导致MXFP4训练性能下降的主要原因。为此,论文引入了两种创新方法:EMA量化器(Q-EMA)和自适应调节优化器(Q-Ramping),以解决振荡问题。实验结果表明,TetraJet相比现有4位训练方法始终表现更优,而Q-EMA与Q-Ramping进一步通过有效减少振荡提高了性能,使精度下降幅度减少了超过50%,甚至接近全精度训练的性能。

链接: https://arxiv.org/abs/2502.20853
作者: Yuxiang Chen,Haocheng Xi,Jun Zhu,Jianfei Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pre-training Transformers in FP4 precision is becoming a promising approach to gain substantial speedup, but it comes with a considerable loss of accuracy. Microscaling (MX) data format provides a fine-grained per-group quantization method to improve the representation ability of the FP4 format and is supported by the next-generation Blackwell GPU architecture. However, training with MXFP4 data format still results in significant degradation and there is a lack of systematic research on the reason. In this work, we propose a novel training method TetraJet for a more accurate FP4 training. We comprehensively evaluate all of the quantizers involved in the training, and identify the weight oscillation problem in the forward pass as the main source of the degradation in MXFP4 training. Therefore, we introduce two novel methods, EMA Quantizer (Q-EMA) and Adaptive Ramping Optimizer (Q-Ramping), to resolve the oscillation problem. Extensive experiments on Vision Transformers demonstrate that TetraJet consistently outperforms the existing 4-bit training methods, and Q-EMA Q-Ramping can provide additional enhancement by effectively reducing oscillation. We decreased the accuracy degradation by more than 50% compared to the baseline, and can even achieve competitive performance compared to full precision training. The codes are available at this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2502.20853 [cs.LG] (or arXiv:2502.20853v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.20853 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-46] VLEER: Vision and Language Embeddings for Explainable Whole Slide Image Representation

【速读】:该论文旨在解决现有视觉-语言模型(Vision-Language Models, VLMs)在计算病理学中的局限性,即尽管这些模型在组织病理图像-文本数据集上预训练后在病灶级任务中表现出色,但其在全切片图像(Whole Slide Image, WSI)级别的应用潜力尚未被充分挖掘。论文假设,预训练的VLMs能够通过定量特征提取方法内在捕获信息丰富且可解释的WSI表示。为验证这一假设,论文提出了一种名为“Vision and Language Embeddings for Explainable WSI Representation (VLEER)”的新方法,利用VLMs实现WSI表示学习。解决方案的关键在于通过结合视觉与语言模态的嵌入技术,不仅提升了WSI分析性能,还提供了结果的可解释性,使得病理学下游任务的结果能够以人类可读的形式呈现,并提供清晰的推理依据。

链接: https://arxiv.org/abs/2502.20850
作者: Anh Tien Nguyen,Keunho Byeon,Kyungeun Kim,Jin Tae Kwak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Recent advances in vision-language models (VLMs) have shown remarkable potential in bridging visual and textual modalities. In computational pathology, domain-specific VLMs, which are pre-trained on extensive histopathology image-text datasets, have succeeded in various downstream tasks. However, existing research has primarily focused on the pre-training process and direct applications of VLMs on the patch level, leaving their great potential for whole slide image (WSI) applications unexplored. In this study, we hypothesize that pre-trained VLMs inherently capture informative and interpretable WSI representations through quantitative feature extraction. To validate this hypothesis, we introduce Vision and Language Embeddings for Explainable WSI Representation (VLEER), a novel method designed to leverage VLMs for WSI representation. We systematically evaluate VLEER on three pathological WSI datasets, proving its better performance in WSI analysis compared to conventional vision features. More importantly, VLEER offers the unique advantage of interpretability, enabling direct human-readable insights into the results by leveraging the textual modality for detailed pathology annotations, providing clear reasoning for WSI-level pathology downstream tasks.
zh

[CV-47] CoTMR: Chain-of-Thought Multi-Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval

【速读】:该论文致力于解决零样本组合图像检索(Zero-Shot Composed Image Retrieval, ZS-CIR)中的关键挑战,即在没有训练样本的情况下,通过结合参考图像和修改文本生成的查询来检索目标图像。现有方法主要依赖于描述模型和大型语言模型(Large Language Models, LLMs)生成目标描述,但存在兼容性差、视觉信息丢失以及推理能力不足等问题。论文提出了一种名为CoTMR的无训练框架,其关键是引入了新颖的链式思维(Chain-of-Thought, CoT)机制与多尺度推理(Multi-scale Reasoning)。CoTMR摒弃了传统的模态转换依赖描述模型的方法,而是利用大型视觉-语言模型(Large Vision-Language Model, LVLM)实现对组合查询的统一理解和推理。为了增强推理的可靠性,设计了CIRCoT,通过预定义的子任务引导LVLM进行逐步推理。此外,考虑到现有方法仅关注全局推理,CoTMR进一步引入多尺度推理机制,通过细粒度预测对象级关键元素的存在与否,实现更全面的推理过程。同时,论文还提出了多粒度评分(Multi-Grained Scoring, MGS)机制,将上述推理输出与CLIP相似性分数相结合,以实现精确的图像检索。实验结果表明,CoTMR不仅在四个主流基准数据集上显著超越现有方法,还提供了良好的可解释性。

链接: https://arxiv.org/abs/2502.20826
作者: Zelong Sun,Dong Jing,Zhiwu Lu
机构: Gaoling School of Artificial Intelligence (高瓴人工智能学院), Renmin University of China (中国人民大学), Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images by integrating information from a composed query (reference image and modification text) without training samples. Existing methods primarily combine caption models and large language models (LLMs) to generate target captions based on composed queries but face various issues such as incompatibility, visual information loss, and insufficient reasoning. In this work, we propose CoTMR, a training-free framework crafted for ZS-CIR with novel Chain-of-thought (CoT) and Multi-scale Reasoning. Instead of relying on caption models for modality transformation, CoTMR employs the Large Vision-Language Model (LVLM) to achieve unified understanding and reasoning for composed queries. To enhance the reasoning reliability, we devise CIRCoT, which guides the LVLM through a step-by-step inference process using predefined subtasks. Considering that existing approaches focus solely on global-level reasoning, our CoTMR incorporates multi-scale reasoning to achieve more comprehensive inference via fine-grained predictions about the presence or absence of key elements at the object scale. Further, we design a Multi-Grained Scoring (MGS) mechanism, which integrates CLIP similarity scores of the above reasoning outputs with candidate images to realize precise retrieval. Extensive experiments demonstrate that our CoTMR not only drastically outperforms previous methods across four prominent benchmarks but also offers appealing interpretability.
zh

[CV-48] MFSR-GAN: Multi-Frame Super-Resolution with Handheld Motion Modeling

【速读】:该论文旨在解决智能手机相机因小尺寸传感器和紧凑光学设计导致的空间分辨率限制及图像失真的问题。当前多帧超分辨率(Multi-Frame Super-Resolution, MFSR)方法受限于数据集未能有效捕捉真实手持连拍图像中的噪声和运动模式。为解决这一问题,论文提出了一种新颖的合成数据引擎,利用多曝光静态图像生成保留传感器特定噪声特性和手持拍摄运动特征的低分辨率-高分辨率训练对。同时,论文提出了MFSR-GAN,这是一种用于MFSR的多尺度RAW到RGB网络,其关键创新在于在整个架构中强调“基准帧”以减轻伪影问题。实验结果表明,使用该合成引擎训练的MFSR-GAN在真实世界MFSR任务中能够产生更清晰且更真实的图像重建结果。

链接: https://arxiv.org/abs/2502.20824
作者: Fadeel Sher Khan,Joshua Ebenezer,Hamid Sheikh,Seok-Jun Lee
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Samsung Research America (三星研究美国)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:Smartphone cameras have become ubiquitous imaging tools, yet their small sensors and compact optics often limit spatial resolution and introduce distortions. Combining information from multiple low-resolution (LR) frames to produce a high-resolution (HR) image has been explored to overcome the inherent limitations of smartphone cameras. Despite the promise of multi-frame super-resolution (MFSR), current approaches are hindered by datasets that fail to capture the characteristic noise and motion patterns found in real-world handheld burst images. In this work, we address this gap by introducing a novel synthetic data engine that uses multi-exposure static images to synthesize LR-HR training pairs while preserving sensor-specific noise characteristics and image motion found during handheld burst photography. We also propose MFSR-GAN: a multi-scale RAW-to-RGB network for MFSR. Compared to prior approaches, MFSR-GAN emphasizes a “base frame” throughout its architecture to mitigate artifacts. Experimental results on both synthetic and real data demonstrates that MFSR-GAN trained with our synthetic engine yields sharper, more realistic reconstructions than existing methods for real-world MFSR.
zh

[CV-49] Can We Simplify Slide-level Fine-tuning of Pathology Foundation Models?

【速读】:该论文旨在解决传统基于弱监督微调方法(利用多重实例学习,Multiple Instance Learning, MIL)在将基础模型(Foundation Models)适配于全片扫描图像(Whole Slide Imaging, WSI)分析任务时存在的复杂性与局限性问题。论文的关键创新在于提出了一种简单的非线性映射策略——SiMLP(结合均值池化和多层感知机),该方法能够直接将基于Patch级别的基础模型迁移到Slide级别的任务上,而无需复杂的MIL依赖学习。这一方案通过广泛的下游任务实验验证,在多种场景下表现出超越现有SOTA方法的性能,尤其是在大规模泛癌分类任务中超越流行MIL方法3.52%,并在少量样本分类和跨数据集迁移等任务中展现出强大的适应性和鲁棒性。

链接: https://arxiv.org/abs/2502.20823
作者: Jiawen Li,Jiali Hu,Qiehe Sun,Renao Yan,Minxi Ouyang,Tian Guan,Anjia Han,Chao He,Yonghong He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures, 4 tables

点击查看摘要

Abstract:The emergence of foundation models in computational pathology has transformed histopathological image analysis, with whole slide imaging (WSI) diagnosis being a core application. Traditionally, weakly supervised fine-tuning via multiple instance learning (MIL) has been the primary method for adapting foundation models to WSIs. However, in this work we present a key experimental finding: a simple nonlinear mapping strategy combining mean pooling and a multilayer perceptron, called SiMLP, can effectively adapt patch-level foundation models to slide-level tasks without complex MIL-based learning. Through extensive experiments across diverse downstream tasks, we demonstrate the superior performance of SiMLP with state-of-the-art methods. For instance, on a large-scale pan-cancer classification task, SiMLP surpasses popular MIL-based methods by 3.52%. Furthermore, SiMLP shows strong learning ability in few-shot classification and remaining highly competitive with slide-level foundation models pretrained on tens of thousands of slides. Finally, SiMLP exhibits remarkable robustness and transferability in lung cancer subtyping. Overall, our findings challenge the conventional MIL-based fine-tuning paradigm, demonstrating that a task-agnostic representation strategy alone can effectively adapt foundation models to WSI analysis. These insights offer a unique and meaningful perspective for future research in digital pathology, paving the way for more efficient and broadly applicable methodologies.
zh

[CV-50] Improved 3D Point-Line Mapping Regression for Camera Relocalization

【速读】:该论文旨在解决现有相机重定位(camera re-localization)中基于特征匹配(Feature Matching, FM)的方法在大规模环境中计算开销随映射点和线数量增加而变得昂贵的问题,以及单一网络同时编码点和线导致的过拟合问题。论文的关键在于提出一种新架构,通过独立学习点和线的特征,并在结合这些特征用于定位之前分别优化其优先级,从而实现更优的精度。这种独立学习的方式避免了点与线之间不必要的相关性捕捉,显著提升了三维点和线回归的性能。

链接: https://arxiv.org/abs/2502.20814
作者: Bach-Thuan Bui,Huy-Hoang Bui,Yasuyuki Fujii,Dinh-Tuan Tran,Joo-Ho Lee
机构: Graduate School of Information Science and Engineering, Ritsumeikan University (立命馆大学信息科学与工程研究生院), Japan; College of Information Science and Engineering, Ritsumeikan University (立命馆大学信息科学与工程学院), Japan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we present a new approach for improving 3D point and line mapping regression for camera re-localization. Previous methods typically rely on feature matching (FM) with stored descriptors or use a single network to encode both points and lines. While FM-based methods perform well in large-scale environments, they become computationally expensive with a growing number of mapping points and lines. Conversely, approaches that learn to encode mapping features within a single network reduce memory footprint but are prone to overfitting, as they may capture unnecessary correlations between points and lines. We propose that these features should be learned independently, each with a distinct focus, to achieve optimal accuracy. To this end, we introduce a new architecture that learns to prioritize each feature independently before combining them for localization. Experimental results demonstrate that our approach significantly enhances the 3D map point and line regression performance for camera re-localization. The implementation of our method will be publicly available at: this https URL.
zh

[CV-51] owards Semantic 3D Hand-Object Interaction Generation via Functional Text Guidance

【速读】:该论文旨在解决通过功能文本驱动生成精确且高质量的手-物交互(HOI)中功能性抓取的问题。现有方法虽能生成稳定的三维抓取姿态,但难以实现真正意义上的功能性抓取,主要由于未充分考虑抓取语义。为应对这一挑战,论文提出了一种创新的两阶段框架——功能性抓取合成网络(FGS-Net)。其关键是结合文本引导的三维模型生成器(Functional Grasp Generator, FGG)与姿态优化策略(Functional Grasp Refiner, FGR)。其中,FGG基于文本输入生成手部与物体的三维模型,而FGR利用物体姿态近似器和能量函数对姿态进行微调,确保手与物体的相对位置符合人类意图且物理上合理。这一方案无需额外的三维标注数据即可实现高精度的HOI生成。

链接: https://arxiv.org/abs/2502.20805
作者: Yongqi Tian,Xueyu Sun,Haoyuan He,Linji Hao,Ning Ding,Caigui Jiang
机构: Xi’an Jiaotong University (西安交通大学); Alibaba Group (阿里巴巴集团)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hand-object interaction(HOI) is the fundamental link between human and environment, yet its dexterous and complex pose significantly challenges for gesture control. Despite significant advances in AI and robotics, enabling machines to understand and simulate hand-object interactions, capturing the semantics of functional grasping tasks remains a considerable challenge. While previous work can generate stable and correct 3D grasps, they are still far from achieving functional grasps due to unconsidered grasp semantics. To address this challenge, we propose an innovative two-stage framework, Functional Grasp Synthesis Net (FGS-Net), for generating 3D HOI driven by functional text. This framework consists of a text-guided 3D model generator, Functional Grasp Generator (FGG), and a pose optimization strategy, Functional Grasp Refiner (FGR). FGG generates 3D models of hands and objects based on text input, while FGR fine-tunes the poses using Object Pose Approximator and energy functions to ensure the relative position between the hand and object aligns with human intent and remains physically plausible. Extensive experiments demonstrate that our approach achieves precise and high-quality HOI generation without requiring additional 3D annotation data.
zh

[CV-52] wo-Stream Spatial-Temporal Transformer Framework for Person Identification via Natural Conversational Keypoints

【速读】:该论文旨在解决传统生物特征识别系统在面对先进深度伪造(Deepfake)和面部重演技术时所面临的挑战。解决方案的关键在于提出了一种名为“双流时空变换框架”(Two-Stream Spatial-Temporal Transformer Framework)的方法,利用在线对话中可见的上半身关键点(即会话关键点)进行个人身份识别。该框架通过两个专门分支处理关键点的空间关系及其时间演变:空间变换器(Spatial Transformer, STR),用于学习关键点配置中的独特结构模式;时间变换器(Temporal Transformer, TTR),用于捕捉顺序运动模式。此外,采用先进的Sapiens姿态估计器提取133个关键点以表示面部特征、头部姿势及手部位置,并结合共享损失函数融合策略与特征级融合方法显著提升了识别准确性至94.86%,从而增强了对欺骗攻击的鲁棒性。

链接: https://arxiv.org/abs/2502.20803
作者: Masoumeh Chapariniya,Hossein Ranjbar,Teodora Vukovic,Sarah Ebling,Volker Dellwo
机构: University of Zurich (苏黎世大学), Switzerland
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the age of AI-driven generative technologies, traditional biometric recognition systems face unprecedented challenges, particularly from sophisticated deepfake and face reenactment techniques. In this study, we propose a Two-Stream Spatial-Temporal Transformer Framework for person identification using upper body keypoints visible during online conversations, which we term conversational keypoints. Our framework processes both spatial relationships between keypoints and their temporal evolution through two specialized branches: a Spatial Transformer (STR) that learns distinctive structural patterns in keypoint configurations, and a Temporal Transformer (TTR) that captures sequential motion patterns. Using the state-of-the-art Sapiens pose estimator, we extract 133 keypoints (based on COCO-WholeBody format) representing facial features, head pose, and hand positions. The framework was evaluated on a dataset of 114 individuals engaged in natural conversations, achieving recognition accuracies of 80.12% for the spatial stream, 63.61% for the temporal stream. We then explored two fusion strategies: a shared loss function approach achieving 82.22% accuracy, and a feature-level fusion method that concatenates feature maps from both streams, significantly improving performance to 94.86%. By jointly modeling both static anatomical relationships and dynamic movement patterns, our approach learns comprehensive identity signatures that are more robust to spoofing than traditional appearance-based methods.
zh

[CV-53] Information Bottleneck-Guided Heterogeneous Graph Learning for Interpretable Neurodevelopmental Disorder Diagnosis

【速读】:该论文旨在解决神经发育障碍(Neurodevelopmental Disorders, NDDs)诊断模型可解释性不足的问题,主要挑战源于对成像数据(如功能磁共振成像 fMRI)与非成像数据的复杂编码、解码及整合。现有机器学习模型通常难以全面提取有意义的生物标志物或有效解释非成像数据的重要性。论文提出了一种名为Interpretable Information Bottleneck Heterogeneous Graph Neural Network (I2B-HGNN) 的新框架,其关键是通过两个关键模块实现从局部精细模式到全局多模态交互的学习:(1) 信息瓶颈图Transformer (Information Bottleneck Graph Transformer, IBGraphFormer),结合全局建模与基于脑连接组约束的图神经网络,利用信息瓶颈引导池化来识别生物标志物;(2) 信息瓶颈异质图注意网络 (Information Bottleneck Heterogeneous Graph Attention Network, IB-HGAN),采用异质图神经网络实现成像与非成像数据的可解释性融合。实验结果表明,I2B-HGNN在NDDs诊断中表现出高精度,并提供了可解释的生物标志物识别及有效的非成像数据分析能力。

链接: https://arxiv.org/abs/2502.20769
作者: Yueyang Li,Lei Chen,Wenhao Dong,Shengyu Gong,Zijian Kang,Boyang Wei,Weiming Zeng,Hongjie Yan,Lingbin Bian,Wai Ting Siok,Nizhuan Wang
机构: Lab of Digital Image and Intelligent Computation, Shanghai Maritime University (上海海事大学数字图像与智能计算实验室); Department of Neurology, Affiliated Lianyungang Hospital of Xuzhou Medical University (徐州医科大学附属连云港医院神经内科); Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University (香港理工大学中文及双语学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Developing interpretable models for diagnosing neurodevelopmental disorders (NDDs) is highly valuable yet challenging, primarily due to the complexity of encoding, decoding and integrating imaging and non-imaging data. Many existing machine learning models struggle to provide comprehensive interpretability, often failing to extract meaningful biomarkers from imaging data, such as functional magnetic resonance imaging (fMRI), or lacking mechanisms to explain the significance of non-imaging data. In this paper, we propose the Interpretable Information Bottleneck Heterogeneous Graph Neural Network (I2B-HGNN), a novel framework designed to learn from fine-grained local patterns to comprehensive global multi-modal interactions. This framework comprises two key modules. The first module, the Information Bottleneck Graph Transformer (IBGraphFormer) for local patterns, integrates global modeling with brain connectomic-constrained graph neural networks to identify biomarkers through information bottleneck-guided pooling. The second module, the Information Bottleneck Heterogeneous Graph Attention Network (IB-HGAN) for global multi-modal interactions, facilitates interpretable multi-modal fusion of imaging and non-imaging data using heterogeneous graph neural networks. The results of the experiments demonstrate that I2B-HGNN excels in diagnosing NDDs with high accuracy, providing interpretable biomarker identification and effective analysis of non-imaging data.
zh

[CV-54] VRM: Knowledge Distillation via Virtual Relation Matching

【速读】:该论文旨在解决基于关系的知识蒸馏(Relational Knowledge Distillation, RKD)方法在性能上落后于实例匹配方法的问题。具体而言,论文通过识别并解决关系方法中的关键挑战,包括过拟合(overfitting)和虚假响应(spurious responses),重新激活了基于关系的知识蒸馏。解决方案的关键在于创新性地构建亲和图谱(affinity graphs),这些图谱紧凑地封装了丰富的样本间、类别间以及视角间的相关性,并将虚拟视图(virtual views)和关系视为一种新的知识形式进行传递。此外,为了进一步减轻虚假响应带来的负面影响,论文通过动态剪枝冗余且不可靠的边来优化亲和图谱。实验结果表明,所提出的虚拟关系匹配(Virtual Relation Matching, VRM)方法在多种模型、架构及设置下均表现出色。

链接: https://arxiv.org/abs/2502.20760
作者: Weijia Zhang,Fei Xie,Weidong Cai,Chao Ma
机构: Shanghai Jiao Tong University (上海交通大学); University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Knowledge distillation (KD) aims to transfer the knowledge of a more capable yet cumbersome teacher model to a lightweight student model. In recent years, relation-based KD methods have fallen behind, as their instance-matching counterparts dominate in performance. In this paper, we revive relational KD by identifying and tackling several key issues in relation-based methods, including their susceptibility to overfitting and spurious responses. Specifically, we transfer novelly constructed affinity graphs that compactly encapsulate a wealth of beneficial inter-sample, inter-class, and inter-view correlations by exploiting virtual views and relations as a new kind of knowledge. As a result, the student has access to richer guidance signals and stronger regularisation throughout the distillation process. To further mitigate the adverse impact of spurious responses, we prune the affinity graphs by dynamically detaching redundant and unreliable edges. Extensive experiments on CIFAR-100 and ImageNet datasets demonstrate the superior performance of the proposed virtual relation matching (VRM) method over a range of models, architectures, and set-ups. For instance, VRM for the first time hits 74.0% accuracy for ResNet50-to-MobileNetV2 distillation on ImageNet, and improves DeiT-T by 14.44% on CIFAR-100 with a ResNet56 teacher. Thorough analyses are also conducted to gauge the soundness, properties, and complexity of our designs. Code and models will be released.
zh

[CV-55] CADDreamer: CAD object Generation from Single-view Images CVPR2025

【速读】:该论文旨在解决现有基于扩散(diffusion)的3D生成模型所生成的网格过于密集且结构无序的问题,与人类设计师精心设计的紧凑、结构化且边缘分明的计算机辅助设计(CAD)模型形成鲜明对比。为弥合这一差距,论文提出了一种名为CADDreamer的新方法,用于从单张图像生成CAD对象的边界表示(Boundary Representation, B-rep)。该方法的关键在于引入了一个具备基本形状先验知识的多视图扩散模型,在生成过程中同时捕获局部几何细节和高级结构语义,并通过将基本形状语义编码到颜色域中,利用预训练扩散模型的强大先验知识来对齐定义明确的基本形状。此外,通过几何优化技术和保持拓扑不变性的提取方法进一步减少生成基本形状中的噪声和畸变,最终实现完整的无缝B-rep表示。实验结果表明,CADDreamer能够从单视角图像有效恢复高质量的CAD对象,其生成的B-rep模型在表示紧凑性、结构清晰度、边缘锐利性和拓扑封闭性方面显著优于现有3D生成技术。

链接: https://arxiv.org/abs/2502.20732
作者: Yuan Li,Cheng Lin,Yuan Liu,Xiaoxiao Long,Chenxu Zhang,Ningna Wang,Xin Li,Wenping Wang,Xiaohu Guo
机构: University of Texas at Dallas (德克萨斯大学达拉斯分校); The University of Hong Kong (香港大学); Hong Kong University of Science and Technology (香港科技大学); Nanjing University (南京大学); ByteDance (字节跳动); Texas A&M University (德州农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Diffusion-based 3D generation has made remarkable progress in recent years. However, existing 3D generative models often produce overly dense and unstructured meshes, which stand in stark contrast to the compact, structured, and sharply-edged Computer-Aided Design (CAD) models crafted by human designers. To address this gap, we introduce CADDreamer, a novel approach for generating boundary representations (B-rep) of CAD objects from a single image. CADDreamer employs a primitive-aware multi-view diffusion model that captures both local geometric details and high-level structural semantics during the generation process. By encoding primitive semantics into the color domain, the method leverages the strong priors of pre-trained diffusion models to align with well-defined primitives. This enables the inference of multi-view normal maps and semantic maps from a single image, facilitating the reconstruction of a mesh with primitive labels. Furthermore, we introduce geometric optimization techniques and topology-preserving extraction methods to mitigate noise and distortion in the generated primitives. These enhancements result in a complete and seamless B-rep of the CAD model. Experimental results demonstrate that our method effectively recovers high-quality CAD objects from single-view images. Compared to existing 3D generation techniques, the B-rep models produced by CADDreamer are compact in representation, clear in structure, sharp in edges, and watertight in topology.
zh

[CV-56] Glioma Classification using Multi-sequence MRI and Novel Wavelets-based Feature Fusion

【速读】:该论文旨在解决胶质瘤(包括低级别胶质瘤 LGG 和高级别胶质瘤 HGG)精确分类的挑战,特别是在磁共振成像(MRI)数据中因肿瘤异质性和影像组学特征重叠导致的分类难题。论文的关键解决方案在于提出了一种基于小波变换的新型融合算法,用于多序列 MRI 图像(包括 T1、T1 对比增强 T1CE、T2 和 FLAIR 成像)的特征提取,并结合主成分分析(PCA)降维和多种机器学习分类器(XGBoost、支持向量机 SVM、随机森林 RF)。其中,SVM 在 BraTS 2018 数据集上的优异性能(最高准确率 91.34%,AUC 达到 94.60%)验证了该方法的有效性,表明其在计算机辅助诊断与分级系统中的潜在应用价值。

链接: https://arxiv.org/abs/2502.20715
作者: Kiranmayee Janardhan,Christy Bobby Thomas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 11 figures, 6 tables, journal paper

点击查看摘要

Abstract:Glioma, a prevalent and heterogeneous tumor originating from the glial cells, can be differentiated as Low Grade Glioma (LGG) and High Grade Glioma (HGG) according to World Health Organization’s norms. Classifying gliomas is essential for treatment protocols that depend extensively on subtype differentiation. For non-invasive glioma evaluation, Magnetic Resonance Imaging (MRI) offers vital information about the morphology and location of the the tumor. The versatility of MRI allows the classification of gliomas as LGG and HGG based on their texture, perfusion, and diffusion characteristics, and further for improving the diagnosis and providing tailored treatments. Nevertheless, the precise classification is complicated by tumor heterogeneity and overlapping radiomic characteristics. Thus, in this work, wavelet based novel fusion algorithm were implemented on multi-sequence T1, T1-contrast enhanced (T1CE), T2 and Fluid Attenuated Inversion Recovery (FLAIR) MRI images to compute the radiomics features. Furthermore, principal component analysis is applied to reduce the feature space and XGBoost, Support Vector Machine, and Random Forest Classifier are used for the classification. The result shows that the SVM algorithm performs comparatively well with an accuracy of 90.17%, precision of 91.04% and recall of 96.19%, F1-score of 93.53%, and AUC of 94.60% when implemented on BraTS 2018 dataset and with an accuracy of 91.34%, precision of 93.05% and recall of 96.13%, F1-score of 94.53%, and AUC of 93.71% for BraTS 2018 dataset. Thus, the proposed algorithm could be potentially implemented for the computer-aided diagnosis and grading system for gliomas.
zh

[CV-57] owards General Visual-Linguistic Face Forgery Detection(V2) CVPR2025

【速读】:该论文旨在解决现有深度伪造人脸检测中标注方法存在的幻觉问题(hallucination issues),特别是高保真伪造样本的文本描述不准确的问题。解决方案的关键在于提出了一种名为Face Forgery Text Generator (FFTG) 的新型标注管道,通过利用伪造掩码进行初始区域和类型识别,并结合全面的提示策略引导多模态大语言模型(Multimodal Large Language Models, MLLMs)以减少幻觉现象。此外,通过结合单模态和多模态目标的三分支训练框架微调CLIP模型,并使用结构化标注微调MLLMs,进一步验证了所提方法的有效性。实验结果表明,该方法不仅提高了区域识别的准确性,还提升了多种深度伪造检测基准上的模型性能。

链接: https://arxiv.org/abs/2502.20698
作者: Ke Sun,Shen Chen,Taiping Yao,Ziyin Zhou,Jiayi Ji,Xiaoshuai Sun,Chia-Wen Lin,Rongrong Ji
机构: Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University (厦门大学); Youtu Lab, Tencent, China (腾讯); National Tsing Hua University, Taiwan (台湾新竹清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures, Accpet by CVPR2025

点击查看摘要

Abstract:Face manipulation techniques have achieved significant advances, presenting serious challenges to security and social trust. Recent works demonstrate that leveraging multimodal models can enhance the generalization and interpretability of face forgery detection. However, existing annotation approaches, whether through human labeling or direct Multimodal Large Language Model (MLLM) generation, often suffer from hallucination issues, leading to inaccurate text descriptions, especially for high-quality forgeries. To address this, we propose Face Forgery Text Generator (FFTG), a novel annotation pipeline that generates accurate text descriptions by leveraging forgery masks for initial region and type identification, followed by a comprehensive prompting strategy to guide MLLMs in reducing hallucination. We validate our approach through fine-tuning both CLIP with a three-branch training framework combining unimodal and multimodal objectives, and MLLMs with our structured annotations. Experimental results demonstrate that our method not only achieves more accurate annotations with higher region identification accuracy, but also leads to improvements in model performance across various forgery detection benchmarks. Our Codes are available in this https URL.
zh

[CV-58] WorldModelBench: Judging Video Generation Models As World Models

【速读】:该论文试图解决视频生成模型在作为世界模型支持决策应用(如机器人和自动驾驶)时,现有基准未能严格评估其能力的问题。具体而言,当前基准仅关注通用视频质量,而忽视了世界模型至关重要的因素,如物理一致性。为填补这一空白,论文提出了WorldModelBench,这是一个面向应用驱动领域的视频生成模型世界建模能力评估基准。其关键解决方案在于:(1) 引入指令遵循和物理一致性维度,能够检测出微妙的世界建模违规行为,例如违反质量守恒定律的对象尺寸不规则变化,这是先前基准忽略的问题;(2) 基于大规模人类标注数据(67K标签),不仅实现了对14种前沿模型的精准测量,还通过高质量的人类标签进一步微调了精确的评判器以自动化评估流程,在预测世界建模违规方面比GPT-4o(2B参数)高出8.6%的平均准确性。此外,论文展示了通过最大化评判器奖励来对齐人类注释的训练方法显著提升了世界建模能力。

链接: https://arxiv.org/abs/2502.20694
作者: Dacheng Li,Yunhao Fang,Yukang Chen,Shuo Yang,Shiyi Cao,Justin Wong,Michael Luo,Xiaolong Wang,Hongxu Yin,Joseph E. Gonzalez,Ion Stoica,Song Han,Yao Lu
机构: UC Berkeley (加州大学伯克利分校); UC San Diego (加州大学圣地亚哥分校); NVIDIA (英伟达); MIT (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video generation models have rapidly progressed, positioning themselves as video world models capable of supporting decision-making applications like robotics and autonomous driving. However, current benchmarks fail to rigorously evaluate these claims, focusing only on general video quality, ignoring important factors to world models such as physics adherence. To bridge this gap, we propose WorldModelBench, a benchmark designed to evaluate the world modeling capabilities of video generation models in application-driven domains. WorldModelBench offers two key advantages: (1) Against to nuanced world modeling violations: By incorporating instruction-following and physics-adherence dimensions, WorldModelBench detects subtle violations, such as irregular changes in object size that breach the mass conservation law - issues overlooked by prior benchmarks. (2) Aligned with large-scale human preferences: We crowd-source 67K human labels to accurately measure 14 frontier models. Using our high-quality human labels, we further fine-tune an accurate judger to automate the evaluation procedure, achieving 8.6% higher average accuracy in predicting world modeling violations than GPT-4o with 2B parameters. In addition, we demonstrate that training to align human annotations by maximizing the rewards from the judger noticeably improve the world modeling capability. The website is available at this https URL.
zh

[CV-59] EDM: Equirectangular Projection-Oriented Dense Kernelized Feature Matching

【速读】:该论文旨在解决球形全景图像(equirectangular projection, ERP)中基于学习的密集匹配问题。由于ERP图像具有大视场的特点,适合建立跨图像的全面对应关系,但其固有的显著畸变限制了现有密集匹配算法的效果。为解决这一问题,论文提出了一种名为Equirectangular Projection-Oriented Dense Kernelized Feature Matching (EDM) 的方法。关键在于利用球面相机模型和测地流细化来处理ERP图像的畸变,并通过引入基于特征网格三维笛卡尔坐标系的球面位置嵌入进一步缓解畸变影响。此外,在细化过程中结合球面与笛卡尔坐标系之间的双向变换,并利用单位球优化匹配性能,从而显著提升了在Matterport3D和Stanford2D3D数据集上的表现,AUC@5°分别提高了+26.72和+42.62。

链接: https://arxiv.org/abs/2502.20685
作者: Dongki Jung,Jaehoon Choi,Yonghan Lee,Somi Jeong,Taejae Lee,Dinesh Manocha,Suyong Yeon
机构: NAVER LABS (NAVER LABS); University of Maryland (马里兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce the first learning-based dense matching algorithm, termed Equirectangular Projection-Oriented Dense Kernelized Feature Matching (EDM), specifically designed for omnidirectional images. Equirectangular projection (ERP) images, with their large fields of view, are particularly suited for dense matching techniques that aim to establish comprehensive correspondences across images. However, ERP images are subject to significant distortions, which we address by leveraging the spherical camera model and geodesic flow refinement in the dense matching method. To further mitigate these distortions, we propose spherical positional embeddings based on 3D Cartesian coordinates of the feature grid. Additionally, our method incorporates bidirectional transformations between spherical and Cartesian coordinate systems during refinement, utilizing a unit sphere to improve matching performance. We demonstrate that our proposed method achieves notable performance enhancements, with improvements of +26.72 and +42.62 in AUC@5° on the Matterport3D and Stanford2D3D datasets.
zh

[CV-60] Diffusion Restoration Adapter for Real-World Image Restoration

【速读】:该论文旨在解决现有基于扩散模型(Diffusion Models)的图像修复方法因采用ControlNet等技术导致参数量随先验模型规模增大而显著增加的问题。论文提出了一种较轻量的适配器(Adapter),通过利用预训练先验的强大生成能力,实现逼真的图像修复。解决方案的关键在于设计这一轻量级Adapter,使其能够适配去噪UNet(denoising UNet)和DiT模型,并在性能上表现出色。

链接: https://arxiv.org/abs/2502.20679
作者: Hanbang Liang,Zhen Wang,Weihui Deng
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have demonstrated their powerful image generation capabilities, effectively fitting highly complex image distributions. These models can serve as strong priors for image restoration. Existing methods often utilize techniques like ControlNet to sample high quality images with low quality images from these priors. However, ControlNet typically involves copying a large part of the original network, resulting in a significantly large number of parameters as the prior scales up. In this paper, we propose a relatively lightweight Adapter that leverages the powerful generative capabilities of pretrained priors to achieve photo-realistic image restoration. The Adapters can be adapt to both denoising UNet and DiT, and performs excellent.
zh

[CV-61] STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding CVPR’25

【速读】:该论文致力于解决弱监督时空视频定位(Weakly Supervised Spatio-Temporal Video Grounding, WSTVG)问题,即仅利用文本查询在无边界框标注的情况下定位视频中的主体。为应对这一挑战,论文探索了视觉语言基础模型在WSTVG中的应用,并通过其零样本定位能力进行初步尝试。然而,简单适配无法满足所需的时空定位能力。为此,论文引入了管状块引用定位(Tubelet Referral Grounding, TRG),通过将文本查询与管状块连接实现时空预测。尽管TRG具有潜力,但在处理组合动作理解和密集场景时存在局限性。为克服这些限制,论文提出了一种名为STPro的新颖渐进学习框架,包含两个关键模块:(1)子动作时间课程学习(Sub-Action Temporal Curriculum Learning, SA-TCL),逐步构建组合动作理解;(2)拥堵引导空间课程学习(Congestion-Guided Spatial Curriculum Learning, CG-SCL),通过增加空间任务难度使模型适应复杂场景。STPro在三个基准数据集上实现了最先进的性能,分别提升了1.0%(VidSTG-Declarative)和3.0%(HCSTVG-v1)。

链接: https://arxiv.org/abs/2502.20678
作者: Aaryan Garg,Akash Kumar,Yogesh S Rawat
机构: BITS Pilani (BITS Pilani); University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR’25 Conference

点击查看摘要

Abstract:In this work we study Weakly Supervised Spatio-Temporal Video Grounding (WSTVG), a challenging task of localizing subjects spatio-temporally in videos using only textual queries and no bounding box supervision. Inspired by recent advances in vision-language foundation models, we investigate their utility for WSTVG, leveraging their zero-shot grounding capabilities. However, we find that a simple adaptation lacks essential spatio-temporal grounding abilities. To bridge this gap, we introduce Tubelet Referral Grounding (TRG), which connects textual queries to tubelets to enable spatio-temporal predictions. Despite its promise, TRG struggles with compositional action understanding and dense scene scenarios. To address these limitations, we propose STPro, a novel progressive learning framework with two key modules: (1) Sub-Action Temporal Curriculum Learning (SA-TCL), which incrementally builds compositional action understanding, and (2) Congestion-Guided Spatial Curriculum Learning (CG-SCL), which adapts the model to complex scenes by spatially increasing task difficulty. STPro achieves state-of-the-art results on three benchmark datasets, with improvements of 1.0% on VidSTG-Declarative and 3.0% on HCSTVG-v1.
zh

[CV-62] SciceVPR: Stable Cross-Image Correlation Enhanced Model for Visual Place Recognition

【速读】:该论文旨在解决视觉位置识别(Visual Place Recognition, VPR)中基于图像视觉特征预测位置的问题。现有最先进的方法主要利用DINOv2作为骨干网络提取全局描述符,但这些方法要么探索跨图像的相关性,要么采用耗时的两阶段重新排序策略以提升性能,然而它们仅使用DINOv2的最终输出,且跨图像的相关性会导致检索结果不稳定。为此,论文提出了一种名为SciceVPR的稳定跨图像相关增强模型。其关键是充分利用DINOv2提供有用特征表示的潜力,这些特征隐式编码了有价值的上下文知识。具体而言,SciceVPR首先通过多层特征融合模块从DINOv2的多层输出中捕获越来越详细的任务相关通道和空间信息;其次,将批次内图像之间的不变相关性视为有价值的知识,并蒸馏到自增强编码器中,从而使模型能够获得在域偏移情况下(如光照、天气和视角变化等)依然鲁棒的全局特征。实验结果显示,SciceVPR在多个具有不同域条件的数据集上优于单输入的一阶段SOTA方法,其大变体在挑战性的Tokyo24/7数据集上的Recall@1指标比现有模型高出3%以上。

链接: https://arxiv.org/abs/2502.20676
作者: Shanshan Wan,Yingmei Wei,Lai Kang,Tianrui Shen,Haixuan Wang,Yee-Hong Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Place Recognition (VPR) is a major challenge for robotics and autonomous systems, with the goal of predicting the location of an image based solely on its visual features. State-of-the-art (SOTA) models extract global descriptors using the powerful foundation model DINOv2 as backbone. These models either explore the cross-image correlation or propose a time-consuming two-stage re-ranking strategy to achieve better performance. However, existing works only utilize the final output of DINOv2, and the current cross-image correlation causes unstable retrieval results. To produce both discriminative and constant global descriptors, this paper proposes stable cross-image correlation enhanced model for VPR called SciceVPR. This model explores the full potential of DINOv2 in providing useful feature representations that implicitly encode valuable contextual knowledge. Specifically, SciceVPR first uses a multi-layer feature fusion module to capture increasingly detailed task-relevant channel and spatial information from the multi-layer output of DINOv2. Secondly, SciceVPR considers the invariant correlation between images within a batch as valuable knowledge to be distilled into the proposed self-enhanced encoder. In this way, SciceVPR can acquire fairly robust global features regardless of domain shifts (e.g., changes in illumination, weather and viewpoint between pictures taken in the same place). Experimental results demonstrate that the base variant, SciceVPR-B, outperforms SOTA one-stage methods with single input on multiple datasets with varying domain conditions. The large variant, SciceVPR-L, performs on par with SOTA two-stage models, scoring over 3% higher in Recall@1 compared to existing models on the challenging Tokyo24/7 dataset. Our code will be released at this https URL.
zh

[CV-63] EndoPBR: Material and Lighting Estimation for Photorealistic Surgical Simulations via Physically-based Rendering

【速读】:该论文旨在解决外科手术场景中3D视觉领域缺乏标注数据集的问题,这限制了医学领域鲁棒3D重建算法的发展。尽管神经辐射场(Neural Radiance Fields)和3D高斯点绘制(3D Gaussian Splatting)在通用计算机视觉社区中广受欢迎,但由于非静止光源和非朗伯表面等挑战,在外科场景中尚未取得一致的成功。为应对这一挑战,论文提出了一种可微渲染框架,用于从内窥镜图像和已知几何形状中估计材质和光照。关键在于显式地将场景属性中的材质和光照解耦,以实现稳健且逼真的新视角合成。通过定义针对外科场景特有的属性,并将场景光照建模为简单的聚光灯,材质属性建模为双向反射分布函数(BRDF),并由神经网络参数化,结合渲染方程进行颜色预测,从而能够在任意相机姿态下生成逼真的图像。实验结果表明,该方法在结肠镜3D视频数据集的多种序列上实现了与其他方法竞争的新视角合成效果,并进一步证明了利用合成数据微调深度估计模型的有效性。

链接: https://arxiv.org/abs/2502.20669
作者: John J. Han,Jie Ying Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:The lack of labeled datasets in 3D vision for surgical scenes inhibits the development of robust 3D reconstruction algorithms in the medical domain. Despite the popularity of Neural Radiance Fields and 3D Gaussian Splatting in the general computer vision community, these systems have yet to find consistent success in surgical scenes due to challenges such as non-stationary lighting and non-Lambertian surfaces. As a result, the need for labeled surgical datasets continues to grow. In this work, we introduce a differentiable rendering framework for material and lighting estimation from endoscopic images and known geometry. Compared to previous approaches that model lighting and material jointly as radiance, we explicitly disentangle these scene properties for robust and photorealistic novel view synthesis. To disambiguate the training process, we formulate domain-specific properties inherent in surgical scenes. Specifically, we model the scene lighting as a simple spotlight and material properties as a bidirectional reflectance distribution function, parameterized by a neural network. By grounding color predictions in the rendering equation, we can generate photorealistic images at arbitrary camera poses. We evaluate our method with various sequences from the Colonoscopy 3D Video Dataset and show that our method produces competitive novel view synthesis results compared with other approaches. Furthermore, we demonstrate that synthetic data can be used to develop 3D vision algorithms by finetuning a depth estimation model with our rendered outputs. Overall, we see that the depth estimation performance is on par with fine-tuning with the original real images.
zh

[CV-64] OpenEarthSensing: Large-Scale Fine-Grained Benchmark for Open-World Remote Sensing

【速读】:该论文旨在解决开放世界遥感领域中模型在面对持续流入的新数据时所面临的挑战,这些问题包括检测语义偏移、适应协变量偏移以及不断自我更新。现有的研究通常在一个单一的数据集内进行训练和测试以模拟开放世界的条件,但缺乏能够评估多种开放世界任务的大规模基准。为了解决这一问题,论文提出了OpenEarthSensing,这是一个针对开放世界遥感的大规模细粒度基准。OpenEarthSensing包含189个场景和对象类别,并涵盖五个具有显著协变量偏移的数据域,从而提供了一个更全面的测试平台来评估开放世界模型的泛化性能。解决方案的关键在于构建这样一个大规模且多样化的基准,它能够真实反映实际应用中的复杂情况,推动相关技术的发展。

链接: https://arxiv.org/abs/2502.20668
作者: Xiang Xiang,Zhuo Xu,Yao Deng,Qinhao Zhou,Yifan Liang,Ke Chen,Qingfang Zheng,Yaowei Wang,Xilin Chen,Wen Gao
机构: Huazhong University of Science and Technology (华中科技大学); Peng Cheng Laboratory (鹏城实验室); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:In open-world remote sensing, deployed models must continuously adapt to a steady influx of new data, which often exhibits various shifts compared to what the model encountered during the training phase. To effectively handle the new data, models are required to detect semantic shifts, adapt to covariate shifts, and continuously update themselves. These challenges give rise to a variety of open-world tasks. However, existing open-world remote sensing studies typically train and test within a single dataset to simulate open-world conditions. Currently, there is a lack of large-scale benchmarks capable of evaluating multiple open-world tasks. In this paper, we introduce OpenEarthSensing, a large-scale fine-grained benchmark for open-world remote sensing. OpenEarthSensing includes 189 scene and objects categories, covering the vast majority of potential semantic shifts that may occur in the real world. Additionally, OpenEarthSensing encompasses five data domains with significant covariate shifts, including two RGB satellite domians, one RGB aerial domian, one MS RGB domian, and one infrared domian. The various domains provide a more comprehensive testbed for evaluating the generalization performance of open-world models. We conduct the baseline evaluation of current mainstream open-world tasks and methods on OpenEarthSensing, demonstrating that it serves as a challenging benchmark for open-world remote sensing.
zh

[CV-65] Advancing AI-Powered Medical Image Synthesis: Insights from MedVQA-GI Challenge Using CLIP Fine-Tuned Stable Diffusion and Dream-Booth LoRA

【速读】:该论文旨在解决现有医学影像生成方法主要局限于静态图像分析且缺乏从文本描述动态生成医学影像能力的问题。为填补这一空白,研究提出了一种基于微调生成模型的新方法,能够从文本描述中生成动态、可扩展且精确的医学图像。解决方案的关键在于整合微调后的Stable Diffusion和DreamBooth模型以及低秩适应(Low-Rank Adaptation, LORA)技术,以生成高保真的医学图像,并通过图像合成(Image Synthesis, IS)与最优提示生产(Optimal Prompt Generation, OPG)两个子任务实现目标。研究表明,Stable Diffusion在生成高质量多样化图像方面优于CLIP及DreamBooth + LORA,其Fréchet Inception Distance (FID) 和Inception Score指标表现优异,从而显著提升了AI辅助医学诊断的能力。

链接: https://arxiv.org/abs/2502.20667
作者: Ojonugwa Oluwafemi Ejiga Peter,Md Mahmudur Rahman,Fahmi Khalifa
机构: Department of Computer Science, SCMNS School, Morgan State University (摩根州立大学); Electrical & Computer Engineering Dept., School of Engineering, Morgan State University (摩根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The MEDVQA-GI challenge addresses the integration of AI-driven text-to-image generative models in medical diagnostics, aiming to enhance diagnostic capabilities through synthetic image generation. Existing methods primarily focus on static image analysis and lack the dynamic generation of medical imagery from textual descriptions. This study intends to partially close this gap by introducing a novel approach based on fine-tuned generative models to generate dynamic, scalable, and precise images from textual descriptions. Particularly, our system integrates fine-tuned Stable Diffusion and DreamBooth models, as well as Low-Rank Adaptation (LORA), to generate high-fidelity medical images. The problem is around two sub-tasks namely: image synthesis (IS) and optimal prompt production (OPG). The former creates medical images via verbal prompts, whereas the latter provides prompts that produce high-quality images in specified categories. The study emphasizes the limitations of traditional medical image generation methods, such as hand sketching, constrained datasets, static procedures, and generic models. Our evaluation measures showed that Stable Diffusion surpasses CLIP and DreamBooth + LORA in terms of producing high-quality, diversified images. Specifically, Stable Diffusion had the lowest Fréchet Inception Distance (FID) scores (0.099 for single center, 0.064 for multi-center, and 0.067 for combined), indicating higher image quality. Furthermore, it had the highest average Inception Score (2.327 across all datasets), indicating exceptional diversity and quality. This advances the field of AI-powered medical diagnosis. Future research will concentrate on model refining, dataset augmentation, and ethical considerations for efficiently implementing these advances into clinical practice
zh

[CV-66] Dataset Distillation with Neural Characteristic Function: A Minmax Perspective CVPR2025

【速读】:该论文旨在解决现有基于分布匹配的数据集蒸馏方法中距离度量无法准确捕捉分布差异的问题,导致分布不一致性的度量不可靠。论文的关键创新在于将数据集蒸馏重新建模为一个minmax优化问题,并引入了神经特征函数差异(Neural Characteristic Function Discrepancy, NCFD),这是一种综合且理论上严谨的度量方法,用于衡量分布差异。NCFD 利用特征函数(Characteristic Function, CF)封装完整的分布信息,并通过神经网络优化 CF 频率参数的采样策略,以最大化差异来增强距离估计。同时,在优化后的 NCFD 度量下最小化真实数据与合成数据之间的差异。该方法被称为神经特征函数匹配(Neural Characteristic Function Matching, \mymethod),其关键在于通过复平面中的相位和幅度对齐实现真实性和多样性之间的平衡。实验结果表明,该方法在低分辨率和高分辨率数据集上均显著优于现有方法,特别是在 ImageSquawk 上提升了 20.5% 的准确率,并大幅减少了 GPU 内存使用(超过 300 倍)和提高了处理速度(20 倍)。此外,这是首次在单个 NVIDIA 2080 Ti GPU 上实现 CIFAR-100 的无损压缩,仅需 2.3 GB 内存。

链接: https://arxiv.org/abs/2502.20653
作者: Shaobo Wang,Yicun Yang,Zhiyuan Liu,Chenghao Sun,Xuming Hu,Conghui He,Linfeng Zhang
机构: School of Artificial Intelligence, Shanghai Jiao Tong University (上海交通大学人工智能学院); EPIC Lab, Shanghai Jiao Tong University (上海交通大学EPIC实验室); Hong Kong University of Science and Technology, Guangzhou (香港科技大学广州校区); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by CVPR 2025, 11 pages, 7 figures

点击查看摘要

Abstract:Dataset distillation has emerged as a powerful approach for reducing data requirements in deep learning. Among various methods, distribution matching-based approaches stand out for their balance of computational efficiency and strong performance. However, existing distance metrics used in distribution matching often fail to accurately capture distributional differences, leading to unreliable measures of discrepancy. In this paper, we reformulate dataset distillation as a minmax optimization problem and introduce Neural Characteristic Function Discrepancy (NCFD), a comprehensive and theoretically grounded metric for measuring distributional differences. NCFD leverages the Characteristic Function (CF) to encapsulate full distributional information, employing a neural network to optimize the sampling strategy for the CF’s frequency arguments, thereby maximizing the discrepancy to enhance distance estimation. Simultaneously, we minimize the difference between real and synthetic data under this optimized NCFD measure. Our approach, termed Neural Characteristic Function Matching (\mymethod), inherently aligns the phase and amplitude of neural features in the complex plane for both real and synthetic data, achieving a balance between realism and diversity in synthetic samples. Experiments demonstrate that our method achieves significant performance gains over state-of-the-art methods on both low- and high-resolution datasets. Notably, we achieve a 20.5% accuracy boost on ImageSquawk. Our method also reduces GPU memory usage by over 300 \times and achieves 20 \times faster processing speeds compared to state-of-the-art methods. To the best of our knowledge, this is the first work to achieve lossless compression of CIFAR-100 on a single NVIDIA 2080 Ti GPU using only 2.3 GB of memory.
zh

[CV-67] he Common Objects Underwater (COU) Dataset for Robust Underwater Object Detection

【速读】:该论文试图解决水下实例分割数据集缺乏鲁棒类别覆盖的问题,以及现有水下图像数据集中物体类别多样性不足的局限性,这些问题是训练适用于自主水下航行器(Autonomous Underwater Vehicles, AUVs)的轻量级、实时检测模型的关键障碍。目前可用的水下图像数据集主要关注海洋生物,而忽视了人造物体的多样性和实际应用需求。论文提出的解决方案之关键是构建了一个名为COU(Common Objects Underwater)的数据集,包含约10,000张标注的实例分割图像,涵盖了24类常见的人造水下物体(如海洋垃圾、潜水工具和AUV等),并同时包含封闭水域(泳池)和开放水域(湖泊和海洋)环境的数据。通过使用三种最先进的模型评估其性能,结果表明,基于COU训练的检测器相较于仅使用陆地数据训练的检测器表现出更高的准确性和效率,从而验证了注释水下图像在提升模型性能方面的显著优势。

链接: https://arxiv.org/abs/2502.20651
作者: Rishi Mukherjee,Sakshi Singh,Jack McWilliams,Junaed Sattar
机构: University of Minnesota–Twin Cities (明尼苏达大学双城分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce COU: Common Objects Underwater, an instance-segmented image dataset of commonly found man-made objects in multiple aquatic and marine environments. COU contains approximately 10K segmented images, annotated from images collected during a number of underwater robot field trials in diverse locations. COU has been created to address the lack of datasets with robust class coverage curated for underwater instance segmentation, which is particularly useful for training light-weight, real-time capable detectors for Autonomous Underwater Vehicles (AUVs). In addition, COU addresses the lack of diversity in object classes since the commonly available underwater image datasets focus only on marine life. Currently, COU contains images from both closed-water (pool) and open-water (lakes and oceans) environments, of 24 different classes of objects including marine debris, dive tools, and AUVs. To assess the efficacy of COU in training underwater object detectors, we use three state-of-the-art models to evaluate its performance and accuracy, using a combination of standard accuracy and efficiency metrics. The improved performance of COU-trained detectors over those solely trained on terrestrial data demonstrates the clear advantage of training with annotated underwater images. We make COU available for broad use under open-source licenses.
zh

[CV-68] Gungnir: Exploiting Stylistic Features in Images for Backdoor Attacks on Diffusion Models

【速读】:该论文试图解决扩散模型(Diffusion Models, DMs)在图像生成任务中易受后门攻击的问题,尤其是现有防御策略难以应对基于隐藏触发器的高级后门攻击。论文的关键创新在于提出了一种名为Gungnir的新方法,首次利用风格特征(stylistic features)作为隐藏触发器,通过重构对抗噪声(Reconstructing-Adversarial Noise, RAN)和短期时间步保留(Short-Term-Timesteps-Retention, STTR)技术,在图像到图像(image2image)任务中成功实施后门攻击。实验表明,该方法能够轻松绕过现有的防御机制,并且在主流后门防御框架下实现了0%的后门检测率(Backdoor Detection Rate, BDR)。

链接: https://arxiv.org/abs/2502.20650
作者: Yu Pan,Bingrong Dai,Jiahao Chen,Lin Wang,Yi Du,Jiao Liu
机构: School of Computer and Information Engineering, Shanghai Polytechnic University (上海第二工业大学计算机与信息工程学院), China; Shanghai Development Center of Computer Software Technology (上海计算机软件技术开发中心), China
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:In recent years, Diffusion Models (DMs) have demonstrated significant advances in the field of image generation. However, according to current research, DMs are vulnerable to backdoor attacks, which allow attackers to control the model’s output by inputting data containing covert triggers, such as a specific patch or phrase. Existing defense strategies are well equipped to thwart such attacks through backdoor detection and trigger inversion because previous attack methods are constrained by limited input spaces and triggers defined by low-dimensional features. To bridge these gaps, we propose Gungnir, a novel method that enables attackers to activate the backdoor in DMs through hidden style triggers within input images. Our approach proposes using stylistic features as triggers for the first time and implements backdoor attacks successfully in image2image tasks by utilizing Reconstructing-Adversarial Noise (RAN) and Short-Term-Timesteps-Retention (STTR) of DMs. Meanwhile, experiments demonstrate that our method can easily bypass existing defense methods. Among existing DM main backdoor defense frameworks, our approach achieves a 0% backdoor detection rate (BDR). Our codes are available at this https URL.
zh

[CV-69] EDENet: Echo Direction Encoding Network for Place Recognition Based on Ground Penetrating Radar

【速读】:本文旨在解决基于探地雷达 (GPR) 的大规模地图中位置识别 (PR) 的挑战,特别是地下特征稀疏性和介电常数变化带来的鲁棒性难题。论文的关键创新在于探索 GPR 回波序列与地下场景之间的几何关系,通过引入可学习的 Gabor 滤波器精确提取方向响应,并结合方向感知注意力机制实现有效的几何编码。此外,通过引入平移不变单元和多尺度聚合策略,进一步增强了对介电常数变化的适应能力。所提出的 EDENet 不仅在 PR 性能上超越现有方法,还具备更小的模型规模和更高的计算效率。

链接: https://arxiv.org/abs/2502.20643
作者: Pengyu Zhang,Xieyuanli Chen,Yuwei Chen,Beizhen Bi,Zhuo Xu,Tian Jin,Xiaotao Huang,Liang Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Ground penetrating radar (GPR) based localization has gained significant recognition in robotics due to its ability to detect stable subsurface features, offering advantages in environments where traditional sensors like cameras and LiDAR may struggle. However, existing methods are primarily focused on small-scale place recognition (PR), leaving the challenges of PR in large-scale maps unaddressed. These challenges include the inherent sparsity of underground features and the variability in underground dielectric constants, which complicate robust localization. In this work, we investigate the geometric relationship between GPR echo sequences and underground scenes, leveraging the robustness of directional features to inform our network design. We introduce learnable Gabor filters for the precise extraction of directional responses, coupled with a direction-aware attention mechanism for effective geometric encoding. To further enhance performance, we incorporate a shift-invariant unit and a multi-scale aggregation strategy to better accommodate variations in di-electric constants. Experiments conducted on public datasets demonstrate that our proposed EDENet not only surpasses existing solutions in terms of PR performance but also offers advantages in model size and computational efficiency.
zh

[CV-70] ractCloud-FOV: Deep Learning-based Robust Tractography Parcellation in Diffusion MRI with Incomplete Field of View

【速读】:该论文旨在解决因临床扫描中脑部成像视野(Field of View, FOV)不完整导致纤维束部分或截断的问题,影响纤维束分类的准确性。为应对这一挑战,论文提出了一种名为TractCloud-FOV的深度学习框架,用于在视野不完整条件下稳健地进行纤维束分类。其解决方案的关键在于引入了一种创新的数据增强策略——FOV-Cut Augmentation (FOV-CA),通过模拟现实中多种下视视野截断场景来合成截断的纤维束轨迹,从而丰富训练集中的真实截断数据流线,使模型具备卓越的泛化能力。这一方法显著提升了纤维束分类精度、泛化性能及解剖学描绘能力,并保持高效计算效率。

链接: https://arxiv.org/abs/2502.20637
作者: Yuqian Chen,Leo Zekelman,Yui Lo,Suheyla Cetin-Karayumak,Tengfei Xue,Yogesh Rathi,Nikos Makris,Fan Zhang,Weidong Cai,Lauren J. O’Donnell
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tractography parcellation classifies streamlines reconstructed from diffusion MRI into anatomically defined fiber tracts for clinical and research applications. However, clinical scans often have incomplete fields of view (FOV) where brain regions are partially imaged, leading to partial or truncated fiber tracts. To address this challenge, we introduce TractCloud-FOV, a deep learning framework that robustly parcellates tractography under conditions of incomplete FOV. We propose a novel training strategy, FOV-Cut Augmentation (FOV-CA), in which we synthetically cut tractograms to simulate a spectrum of real-world inferior FOV cutoff scenarios. This data augmentation approach enriches the training set with realistic truncated streamlines, enabling the model to achieve superior generalization. We evaluate the proposed TractCloud-FOV on both synthetically cut tractography and two real-life datasets with incomplete FOV. TractCloud-FOV significantly outperforms several state-of-the-art methods on all testing datasets in terms of streamline classification accuracy, generalization ability, tract anatomical depiction, and computational efficiency. Overall, TractCloud-FOV achieves efficient and consistent tractography parcellation in diffusion MRI with incomplete FOV.
zh

[CV-71] Subtask-Aware Visual Reward Learning from Segmented Demonstrations

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)智能体在实际应用中对人工设计奖励函数依赖性强的问题,尤其是在缺乏目标行为信息的现实场景中。传统方法通常需要大量试错,并依赖于明确的奖励信号,这限制了其在复杂任务中的适用性。为了解决这一挑战,论文提出了一种名为REDS的新框架,即从带有分割的动作无关视频中学习奖励(Reward learning from Demonstration with Segmentations)。REDS的关键创新在于利用来自多样化来源的视频演示片段作为子任务的真实奖励信号,并通过最小化等效策略不变比较(Equivalent-Policy Invariant Comparison)距离来训练密集型奖励函数,确保其与真实奖励信号的一致性。此外,还引入对比学习目标以对齐视频表征与子任务,从而在在线交互过程中实现精确的子任务推理。这种设计使得REDS能够在复杂机器人操作任务以及更具挑战性的现实世界任务中表现出色,同时减少人为干预,并支持向未见过的任务和机器人形态的泛化。

链接: https://arxiv.org/abs/2502.20630
作者: Changyeon Kim,Minho Heo,Doohyun Lee,Jinwoo Shin,Honglak Lee,Joseph J. Lim,Kimin Lee
机构: KAIST (韩国科学技术院); University of Michigan (密歇根大学); LG AI Research (LG 人工智能研究院)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project webpage: this https URL

点击查看摘要

Abstract:Reinforcement Learning (RL) agents have demonstrated their potential across various robotic tasks. However, they still heavily rely on human-engineered reward functions, requiring extensive trial-and-error and access to target behavior information, often unavailable in real-world settings. This paper introduces REDS: REward learning from Demonstration with Segmentations, a novel reward learning framework that leverages action-free videos with minimal supervision. Specifically, REDS employs video demonstrations segmented into subtasks from diverse sources and treats these segments as ground-truth rewards. We train a dense reward function conditioned on video segments and their corresponding subtasks to ensure alignment with ground-truth reward signals by minimizing the Equivalent-Policy Invariant Comparison distance. Additionally, we employ contrastive learning objectives to align video representations with subtasks, ensuring precise subtask inference during online interactions. Our experiments show that REDS significantly outperforms baseline methods on complex robotic manipulation tasks in Meta-World and more challenging real-world tasks, such as furniture assembly in FurnitureBench, with minimal human intervention. Moreover, REDS facilitates generalization to unseen tasks and robot embodiments, highlighting its potential for scalable deployment in diverse environments.
zh

[CV-72] 2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting CVPR2025

【速读】:该论文旨在解决零样本目标计数(Zero-shot Object Counting)任务中,现有方法因受限于视觉-语言模型(如CLIP)而对文本提示敏感性不足的问题。论文提出了一种基于扩散模型(Diffusion-based Framework)的方法T2ICount,利用预训练扩散模型中的丰富先验知识和细粒度视觉理解能力。然而,一步去噪过程虽然提高了效率,却进一步削弱了对文本的敏感性。为了解决这一挑战,论文的关键创新在于引入了分层语义校正模块(Hierarchical Semantic Correction Module),通过逐步优化文本与图像特征的对齐,以及设计表征区域一致性损失函数(Representational Regional Coherence Loss),利用去噪U-Net提取的交叉注意力图提供可靠的监督信号。此外,论文还指出当前基准数据集主要关注图像中的多数对象,可能掩盖了模型的文本敏感性,为此贡献了一个重新标注的FSC147子集以更全面评估文本引导的计数能力。综合来看,该方法的核心突破在于结合高效的扩散模型基础架构与针对性设计的文本敏感性增强机制。

链接: https://arxiv.org/abs/2502.20625
作者: Yifei Qian,Zhongliang Guo,Bowen Deng,Chun Tong Lei,Shuai Zhao,Chun Pong Lau,Xiaopeng Hong,Michael P. Pound
机构: University of Nottingham (诺丁汉大学); University of St Andrews (圣安德鲁斯大学); City University of Hong Kong (香港城市大学); Nanyang Technology University (南洋理工大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:Zero-shot object counting aims to count instances of arbitrary object categories specified by text descriptions. Existing methods typically rely on vision-language models like CLIP, but often exhibit limited sensitivity to text prompts. We present T2ICount, a diffusion-based framework that leverages rich prior knowledge and fine-grained visual understanding from pretrained diffusion models. While one-step denoising ensures efficiency, it leads to weakened text sensitivity. To address this challenge, we propose a Hierarchical Semantic Correction Module that progressively refines text-image feature alignment, and a Representational Regional Coherence Loss that provides reliable supervision signals by leveraging the cross-attention maps extracted from the denosing U-Net. Furthermore, we observe that current benchmarks mainly focus on majority objects in images, potentially masking models’ text sensitivity. To address this, we contribute a challenging re-annotated subset of FSC147 for better evaluation of text-guided counting ability. Extensive experiments demonstrate that our method achieves superior performance across different benchmarks. Code is available at this https URL.
zh

[CV-73] SafeText: Safe Text-to-image Models via Aligning the Text Encoder

【速读】:该论文旨在解决文本到图像模型在处理不安全提示(unsafe prompts)时可能生成有害图像的问题,从而降低其带来的安全和社会风险。现有对齐方法主要通过修改扩散模块来防止有害图像生成,但这种方法通常会对安全提示(safe prompts)下的模型行为产生显著负面影响,导致生成图像的质量大幅下降。论文的关键解决方案是提出SafeText,这是一种新颖的对齐方法,通过微调文本编码器而非扩散模块来实现。SafeText通过调整文本编码器,能够显著改变不安全提示对应的嵌入向量,同时对安全提示的影响极小,从而使扩散模块在面对不安全提示时生成无害图像,同时保持安全提示下生成图像的质量。实验结果表明,SafeText在多个包含安全和不安全提示的数据集上表现出色,有效预防了有害图像生成,并且对安全提示下的图像质量影响较小,优于六种现有的对齐方法。

链接: https://arxiv.org/abs/2502.20623
作者: Yuepeng Hu,Zhengyuan Jiang,Neil Zhenqiang Gong
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image models can generate harmful images when presented with unsafe prompts, posing significant safety and societal risks. Alignment methods aim to modify these models to ensure they generate only non-harmful images, even when exposed to unsafe prompts. A typical text-to-image model comprises two main components: 1) a text encoder and 2) a diffusion module. Existing alignment methods mainly focus on modifying the diffusion module to prevent harmful image generation. However, this often significantly impacts the model’s behavior for safe prompts, causing substantial quality degradation of generated images. In this work, we propose SafeText, a novel alignment method that fine-tunes the text encoder rather than the diffusion module. By adjusting the text encoder, SafeText significantly alters the embedding vectors for unsafe prompts, while minimally affecting those for safe prompts. As a result, the diffusion module generates non-harmful images for unsafe prompts while preserving the quality of images for safe prompts. We evaluate SafeText on multiple datasets of safe and unsafe prompts, including those generated through jailbreak attacks. Our results show that SafeText effectively prevents harmful image generation with minor impact on the images for safe prompts, and SafeText outperforms six existing alignment methods. We will publish our code and data after paper acceptance.
zh

[CV-74] RTGen: Real-Time Generative Detection Transformer

【速读】:该论文旨在解决现有生成式目标检测方法因直接附加自回归语言模型而导致的结构冗余和推理速度较慢的问题。论文的关键创新在于提出了一种名为Real-Time GENerative Detection Transformer (RTGen) 的实时生成式目标检测器,其核心解决方案是引入了一种新颖的Region-Language Decoder (RL-Decoder),将非自回归语言模型集成到检测解码器中,实现了物体和文本信息的同时处理。这种高效设计使RTGen达到了60.41 FPS的显著推理速度,并在LVIS数据集上获得了18.6 mAP,超越了之前的SOTA方法3.5 mAP。

链接: https://arxiv.org/abs/2502.20622
作者: Chi Ruan
机构: University of Ottawa (渥太华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While open-vocabulary object detectors require predefined categories during inference, generative object detectors overcome this limitation by endowing the model with text generation capabilities. However, existing generative object detection methods directly append an autoregressive language model to an object detector to generate texts for each detected object. This straightforward design leads to structural redundancy and increased processing time. In this paper, we propose a Real-Time GENerative Detection Transformer (RTGen), a real-time generative object detector with a succinct encoder-decoder architecture. Specifically, we introduce a novel Region-Language Decoder (RL-Decoder), which innovatively integrates a non-autoregressive language model into the detection decoder, enabling concurrent processing of object and text information. With these efficient designs, RTGen achieves a remarkable inference speed of 60.41 FPS. Moreover, RTGen obtains 18.6 mAP on the LVIS dataset, outperforming the previous SOTA method by 3.5 mAP.
zh

[CV-75] Discovering Global False Negatives On the Fly for Self-supervised Contrastive Learning

【速读】:该论文试图解决在自监督对比学习中因“假负样本”(false negatives)导致的嵌入向量被错误推开的问题。传统方法通过从整个数据集中抽取与锚点图像不同的样本来构建负样本对,但这种方法可能生成具有相似语义的负样本对,从而引入假负样本。为了解决这一问题,论文提出了一种基于优化的方法GloFND,其关键在于能够动态学习每个锚点数据的阈值,以识别训练过程中的假负样本。与先前仅在小批量内局部检测假负样本的方法不同,GloFND实现了在整个数据集上的全局检测,并且其每轮迭代的计算开销不依赖于数据集大小。实验结果表明,该方法在图像和图文数据上均表现出有效性。

链接: https://arxiv.org/abs/2502.20612
作者: Vicente Balmaseda,Bokun Wang,Ching-Long Lin,Tianbao Yang
机构: Texas A&M University (德克萨斯农工大学); University of Iowa (爱荷华大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In self-supervised contrastive learning, negative pairs are typically constructed using an anchor image and a sample drawn from the entire dataset, excluding the anchor. However, this approach can result in the creation of negative pairs with similar semantics, referred to as “false negatives”, leading to their embeddings being falsely pushed apart. To address this issue, we introduce GloFND, an optimization-based approach that automatically learns on the fly the threshold for each anchor data to identify its false negatives during training. In contrast to previous methods for false negative discovery, our approach globally detects false negatives across the entire dataset rather than locally within the mini-batch. Moreover, its per-iteration computation cost remains independent of the dataset size. Experimental results on image and image-text data demonstrate the effectiveness of the proposed method. Our implementation is available at this https URL .
zh

[CV-76] Interpreting CLIP with Hierarchical Sparse Autoencoders

【速读】:该论文试图解决的问题是如何在保证高稀疏性的同时提升稀疏自编码器(Sparse Autoencoder, SAE)的重建质量,以更好地分析和解释大规模视觉-语言模型(如CLIP)。当前方法受限于同时优化重建质量和稀疏性的能力,依赖于激活抑制或严格的稀疏约束。论文的关键解决方案是引入Matryoshka SAE (MSAE),这是一种新架构,能够同时学习多粒度的层次化表示,并直接优化重建质量和稀疏性两个指标而无需折衷。通过MSAE,CLIP在重建质量和稀疏性之间达到了新的帕累托前沿,实现了0.99的余弦相似性和小于0.1的未解释方差比例,同时保持约80%的稀疏性。此外,MSAE被证明是一种有效的工具,用于从CLIP的表示中提取超过120个语义概念,从而实现基于概念的相似性搜索和下游任务中的偏见分析。

链接: https://arxiv.org/abs/2502.20578
作者: Vladimir Zaigrajew,Hubert Baniecki,Przemyslaw Biecek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are useful for detecting and steering interpretable features in neural networks, with particular potential for understanding complex multimodal representations. Given their ability to uncover interpretable features, SAEs are particularly valuable for analyzing large-scale vision-language models (e.g., CLIP and SigLIP), which are fundamental building blocks in modern systems yet remain challenging to interpret and control. However, current SAE methods are limited by optimizing both reconstruction quality and sparsity simultaneously, as they rely on either activation suppression or rigid sparsity constraints. To this end, we introduce Matryoshka SAE (MSAE), a new architecture that learns hierarchical representations at multiple granularities simultaneously, enabling a direct optimization of both metrics without compromise. MSAE establishes a new state-of-the-art Pareto frontier between reconstruction quality and sparsity for CLIP, achieving 0.99 cosine similarity and less than 0.1 fraction of variance unexplained while maintaining ~80% sparsity. Finally, we demonstrate the utility of MSAE as a tool for interpreting and controlling CLIP by extracting over 120 semantic concepts from its representation to perform concept-based similarity search and bias analysis in downstream tasks like CelebA.
zh

[CV-77] InstaFace: Identity-Preserving Facial Editing with Single Image Inference

【速读】:该论文旨在解决在有限数据条件下,利用生成式模型(Generative AI)编辑人脸外观时身份信息难以有效保留的问题。传统方法通常需要多张图像,并且容易出现不自然的脸部变化、头发对齐不一致或过度平滑等问题。为了解决这些问题,论文提出了一种基于扩散模型的新框架——InstaFace,其核心在于仅使用单张图像即可生成逼真图像并保留身份特征。

解决方案的关键包括两个方面:首先,引入了一个高效的引导网络(Guidance Network),该网络通过整合多个基于3D Morphable Model (3DMM) 的条件来捕获3D视角信息,而无需引入额外的可训练参数;其次,设计了一个创新模块,该模块利用面部识别模型和预训练的视觉-语言模型的特征嵌入,以确保最大程度的身份保留以及背景、头发以及其他上下文特征(如配饰)的保存。这些技术手段共同保证了身份信息的有效保留与图像的高真实感生成。

链接: https://arxiv.org/abs/2502.20577
作者: MD Wahiduzzaman Khan,Mingshan Jia,Shaolin Zhang,En Yu,Kaska Musial-Gabrys
机构: University of Technology Sydney (悉尼科技大学), Australia; Shandong University of Science and Technology (山东科技大学), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial appearance editing is crucial for digital avatars, AR/VR, and personalized content creation, driving realistic user experiences. However, preserving identity with generative models is challenging, especially in scenarios with limited data availability. Traditional methods often require multiple images and still struggle with unnatural face shifts, inconsistent hair alignment, or excessive smoothing effects. To overcome these challenges, we introduce a novel diffusion-based framework, InstaFace, to generate realistic images while preserving identity using only a single image. Central to InstaFace, we introduce an efficient guidance network that harnesses 3D perspectives by integrating multiple 3DMM-based conditionals without introducing additional trainable parameters. Moreover, to ensure maximum identity retention as well as preservation of background, hair, and other contextual features like accessories, we introduce a novel module that utilizes feature embeddings from a facial recognition model and a pre-trained vision-language model. Quantitative evaluations demonstrate that our method outperforms several state-of-the-art approaches in terms of identity preservation, photorealism, and effective control of pose, expression, and lighting.
zh

[CV-78] LISArD: Learning Image Similarity to Defend Against Gray-box Adversarial Attacks

【速读】:该论文试图解决深度学习模型在实际攻击场景下的鲁棒性不足问题,特别是针对灰盒攻击(gray-box attacks)的脆弱性。传统防御机制通常仅在白盒攻击(white-box attacks)下进行评估,而灰盒攻击假设攻击者了解目标网络的架构和训练数据集,但无法访问其梯度信息,这更接近真实世界中的威胁场景。论文通过实证研究证明现有模型在这种攻击下容易受到损害,并提出了一种名为LISArD的新防御方法。LISArD的关键在于通过近似扰动图像与干净图像嵌入的交叉相关矩阵为对角矩阵,同时结合分类学习任务,从而在不增加计算和时间成本的前提下,有效提升模型对灰盒攻击和白盒攻击的鲁棒性,且无需依赖对抗训练(Adversarial Training, AT)。此外,实验表明,现有的基于对抗蒸馏(Adversarial Distillation, AD)的模型在去除AT或迁移到灰盒设置后表现显著下降,进一步凸显了现有方法在多场景适应性上的不足。

链接: https://arxiv.org/abs/2502.20562
作者: Joana C. Costa,Tiago Roxo,Hugo Proença,Pedro R. M. Inácio
机构: Instituto de Telecomunicações (研究所 de 电信通讯), sins-lab, and Department of Computer Science (计算机科学系), Universidade da Beira Interior (比拉尔瓦大学), Portugal
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:State-of-the-art defense mechanisms are typically evaluated in the context of white-box attacks, which is not realistic, as it assumes the attacker can access the gradients of the target network. To protect against this scenario, Adversarial Training (AT) and Adversarial Distillation (AD) include adversarial examples during the training phase, and Adversarial Purification uses a generative model to reconstruct all the images given to the classifier. This paper considers an even more realistic evaluation scenario: gray-box attacks, which assume that the attacker knows the architecture and the dataset used to train the target network, but cannot access its gradients. We provide empirical evidence that models are vulnerable to gray-box attacks and propose LISArD, a defense mechanism that does not increase computational and temporal costs but provides robustness against gray-box and white-box attacks without including AT. Our method approximates a cross-correlation matrix, created with the embeddings of perturbed and clean images, to a diagonal matrix while simultaneously conducting classification learning. Our results show that LISArD can effectively protect against gray-box attacks, can be used in multiple architectures, and carries over its resilience to the white-box scenario. Also, state-of-the-art AD models underperform greatly when removing AT and/or moving to gray-box settings, highlighting the lack of robustness from existing approaches to perform in various conditions (aside from white-box settings). All the source code is available at this https URL.
zh

[CV-79] Finer Disentanglement of Aleatoric Uncertainty Can Accelerate Chemical Histopathology Imaging

【速读】:本文旨在解决label-free化学成像在数字病理工作流程中的临床转化障碍,特别是数据采集速度慢的问题。为了解决这一瓶颈,论文提出了一种自适应策略:首先快速扫描整个组织的低信息(Low Information, LI)区域,识别高可信度不确定性(Aleatoric Uncertainty, AU)的区域,并选择性地以更高质量重新成像这些区域以捕获高信息(High Information, HI)细节。方案的关键在于区分可以通过HI成像缓解的高-AU区域与无法缓解的区域。由于现有不确定性框架无法分离这些子类别,论文提出了基于后验潜在空间分析的细粒度解缠方法,以区分可解析与不可解析的高-AU区域。通过将该方法应用于乳腺组织的红外光谱数据成像,实现了优于随机基线的分割性能,这标志着首次针对动态图像空间(LI到HI)内的细粒度AU解缠进行算法研究,并为组织病理学优化提供了新应用。

链接: https://arxiv.org/abs/2502.20532
作者: Ji-Hun Oh,Kianoush Falahkheirkhah,Rohit Bhargava
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Label-free chemical imaging holds significant promise for improving digital pathology workflows. However, data acquisition speed remains a limiting factor for smooth clinical transition. To address this gap, we propose an adaptive strategy: initially scan the low information (LI) content of the entire tissue quickly, identify regions with high aleatoric uncertainty (AU), and selectively re-image them at better quality to capture higher information (HI) details. The primary challenge lies in distinguishing between high-AU regions that can be mitigated through HI imaging and those that cannot. However, since existing uncertainty frameworks cannot separate such AU subcategories, we propose a fine-grained disentanglement method based on post-hoc latent space analysis to unmix resolvable from irresolvable high-AU regions. We apply our approach to efficiently image infrared spectroscopic data of breast tissues, achieving superior segmentation performance using the acquired HI data compared to a random baseline. This represents the first algorithmic study focused on fine-grained AU disentanglement within dynamic image spaces (LI-to-HI), with novel application to streamline histopathology.
zh

[CV-80] On the Role of Individual Differences in Current Approaches to Computational Image Aesthetics

【速读】:该论文致力于解决图像美学评估(Image Aesthetic Assessment, IAA)中因图像多样性与用户主观性导致的任务复杂性问题,特别是针对生成式通用图像美学评估(Generic IAA, GIAA)与个性化图像美学评估(Personal IAA, PIAA)之间的迁移学习缺乏理论理解的问题。论文的关键在于提出了一种统一模型,通过以分布形式编码个体特征,同时支持个体与群体评估,并揭示了从GIAA到PIAA的迁移涉及外推(extrapolation),而反向迁移则涉及插值(interpolation),后者通常对机器学习更为有效。此外,通过实验验证不同群体组成(如按群体规模子采样及人口统计学不重叠情况)下性能的显著差异,强调教育水平、摄影与艺术经验对美学差异的影响,以及艺术品中更强的个体主观性。这一解决方案的关键在于建立IAA的理论基础,实现GIAA与PIAA的统一建模与增强跨人口统计学的泛化能力。

链接: https://arxiv.org/abs/2502.20518
作者: Li-Wei Chen,Ombretta Strafforello,Anne-Sofie Maerten,Tinne Tuytelaars,Johan Wagemans
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages

点击查看摘要

Abstract:Image aesthetic assessment (IAA) evaluates image aesthetics, a task complicated by image diversity and user subjectivity. Current approaches address this in two stages: Generic IAA (GIAA) models estimate mean aesthetic scores, while Personal IAA (PIAA) models adapt GIAA using transfer learning to incorporate user subjectivity. However, a theoretical understanding of transfer learning between GIAA and PIAA, particularly concerning the impact of group composition, group size, aesthetic differences between groups and individuals, and demographic correlations, is lacking. This work establishes a theoretical foundation for IAA, proposing a unified model that encodes individual characteristics in a distributional format for both individual and group assessments. We show that transferring from GIAA to PIAA involves extrapolation, while the reverse involves interpolation, which is generally more effective for machine learning. Experiments with varying group compositions, including sub-sampling by group size and disjoint demographics, reveal significant performance variation even for GIAA, indicating that mean scores do not fully eliminate individual subjectivity. Performance variations and Gini index analysis reveal education as the primary factor influencing aesthetic differences, followed by photography and art experience, with stronger individual subjectivity observed in artworks than in photos. Our model uniquely supports both GIAA and PIAA, enhancing generalization across demographics.
zh

[CV-81] In-Model Merging for Enhancing the Robustness of Medical Imaging Classification Models

【速读】:该论文试图解决的问题是在单一模型内部实现特征融合以提升模型鲁棒性,特别是在医学影像领域这一需求尤为关键。传统模型合并方法虽能增强性能但通常伴随额外计算开销,而本文提出的“模型内合并 (In-model merging, InMerge)”是一种创新方案,在单个卷积神经网络(Convolutional Neural Network, CNN)的深层训练过程中,通过选择性地融合相似的卷积核来增强分类任务中的模型鲁棒性。解决方案的关键在于分析影响模型内合并效果的重要特性,并验证该技术在多种CNN架构及常见数据集上的可行性和有效性,最终显著超越常规训练方法的模型表现。

链接: https://arxiv.org/abs/2502.20516
作者: Hu Wang,Ibrahim Almakky,Congbo Ma,Numan Saeed,Mohammad Yaqub
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Model merging is an effective strategy to merge multiple models for enhancing model performances, and more efficient than ensemble learning as it will not introduce extra computation into inference. However, limited research explores if the merging process can occur within one model and enhance the model’s robustness, which is particularly critical in the medical image domain. In the paper, we are the first to propose in-model merging (InMerge), a novel approach that enhances the model’s robustness by selectively merging similar convolutional kernels in the deep layers of a single convolutional neural network (CNN) during the training process for classification. We also analytically reveal important characteristics that affect how in-model merging should be performed, serving as an insightful reference for the community. We demonstrate the feasibility and effectiveness of this technique for different CNN architectures on 4 prevalent datasets. The proposed InMerge-trained model surpasses the typically-trained model by a substantial margin. The code will be made public.
zh

[CV-82] Best Foot Forward: Robust Foot Reconstruction in-the-wild

【速读】:该论文致力于解决3D足部重建在不完整扫描(如自扫描场景)和解剖学变异下的准确性不足问题,特别是在用户移动受限时难以捕捉足弓和脚后跟等区域的挑战。论文的关键创新在于提出了一种端到端的管道,通过结合SE(3)规范化的视角预测模块解决扫描对齐模糊性,并利用基于注意力机制的网络完成缺失几何结构,该网络经过合成增强点云数据的训练。这种方法在保持临床验证的解剖保真度的同时实现了最先进的重建性能,其关键是将合成训练数据与学习到的几何先验相结合,从而实现在真实世界捕获条件下的鲁棒足部重建。

链接: https://arxiv.org/abs/2502.20511
作者: Kyle Fogarty,Jing Yang,Chayan Kumar Patodi,Aadi Bhanti,Steven Chacko,Cengiz Oztireli,Ujwal Bonde
机构: University of Cambridge (剑桥大学); Hike Medical
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate 3D foot reconstruction is crucial for personalized orthotics, digital healthcare, and virtual fittings. However, existing methods struggle with incomplete scans and anatomical variations, particularly in self-scanning scenarios where user mobility is limited, making it difficult to capture areas like the arch and heel. We present a novel end-to-end pipeline that refines Structure-from-Motion (SfM) reconstruction. It first resolves scan alignment ambiguities using SE(3) canonicalization with a viewpoint prediction module, then completes missing geometry through an attention-based network trained on synthetically augmented point clouds. Our approach achieves state-of-the-art performance on reconstruction metrics while preserving clinically validated anatomical fidelity. By combining synthetic training data with learned geometric priors, we enable robust foot reconstruction under real-world capture conditions, unlocking new opportunities for mobile-based 3D scanning in healthcare and retail.
zh

[CV-83] CoCa-CXR: Contrastive Captioners Learn Strong Temporal Structures for Chest X-Ray Vision-Language Understanding

【速读】:该论文致力于解决胸部X光片(Chest X-Ray, CXR)报告中描述序列变化与图像语义差异对齐不足的问题。尽管CXR报告中常引用先前图像进行比较,但现有方法在将这些进展描述与图像对之间的语义差异关联方面研究较少。为应对这一挑战,论文提出了两个关键组件:(1) 一个基于大型语言模型(Large Language Model, LLM)的CXR报告处理流水线,用于提取时间结构,分离描述和对比上下文,并从报告中提取细粒度标注;(2) 一种名为CoCa-CXR的对比性 Captioner 模型,能够同时描述图像及其时间进展。CoCa-CXR通过引入新颖的区域交叉注意力模块,识别配对CXRs图像间的局部差异。实验结果表明,相比以往方法,CoCa-CXR在进展分析和报告生成任务上表现更优,在MS-CXR-T进展分类任务中平均测试准确率达到65.0%,超越现有最优模型BioViL-T 4.8%;在MIMIC-CXR数据集上的RadGraph F1得分达24.2%,接近Med-Gemini基础模型水平。

链接: https://arxiv.org/abs/2502.20509
作者: Yixiong Chen,Shawn Xu,Andrew Sellergren,Yossi Matias,Avinatan Hassidim,Shravya Shetty,Daniel Golden,Alan Yuille,Lin Yang
机构: University of Southern California (南加州大学); Meta (Meta); Google (谷歌); Tel Aviv University (特拉维夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models have proven to be of great benefit for medical image analysis since they learn rich semantics from both images and reports. Prior efforts have focused on better alignment of image and text representations to enhance image understanding. However, though explicit reference to a prior image is common in Chest X-Ray (CXR) reports, aligning progression descriptions with the semantics differences in image pairs remains under-explored. In this work, we propose two components to address this issue. (1) A CXR report processing pipeline to extract temporal structure. It processes reports with a large language model (LLM) to separate the description and comparison contexts, and extracts fine-grained annotations from reports. (2) A contrastive captioner model for CXR, namely CoCa-CXR, to learn how to both describe images and their temporal progressions. CoCa-CXR incorporates a novel regional cross-attention module to identify local differences between paired CXR images. Extensive experiments show the superiority of CoCa-CXR on both progression analysis and report generation compared to previous methods. Notably, on MS-CXR-T progression classification, CoCa-CXR obtains 65.0% average testing accuracy on five pulmonary conditions, outperforming the previous state-of-the-art (SOTA) model BioViL-T by 4.8%. It also achieves a RadGraph F1 of 24.2% on MIMIC-CXR, which is comparable to the Med-Gemini foundation model.
zh

[CV-84] VideoA11y: Method and Dataset for Accessible Video Description

【速读】:该论文旨在解决盲人和低视力(BLV)用户在访问视觉内容时因现有生成式描述模型质量不足而导致的问题。当前基于人工智能的视频描述模型受限于训练数据集中人类标注的质量,无法充分满足BLV用户的需求。为填补这一差距,论文提出了一种名为VideoA11y的方法,其关键是结合多模态大型语言模型(Multimodal Large Language Models, MLLMs)与视频无障碍指南,以生成专门针对BLV用户的高质量描述。通过这种方法,作者创建了VideoA11y-40K,一个包含40,000个视频描述的数据集,该数据集是目前最大且最全面的用于BLV用户的视频描述数据集。实验结果表明,VideoA11y生成的描述在清晰度、准确性、客观性、描述性和用户满意度方面优于普通人的新手标注,并与专业人员的标注相当。

链接: https://arxiv.org/abs/2502.20480
作者: Chaoyu Li,Sid Padmanabhuni,Maryam Cheema,Hasti Seifi,Pooyan Fazli
机构: School of Computing and Augmented Intelligence Arizona State University (亚利桑那州立大学); School of Arts, Media and Engineering Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: ACM CHI 2025

点击查看摘要

Abstract:Video descriptions are crucial for blind and low vision (BLV) users to access visual content. However, current artificial intelligence models for generating descriptions often fall short due to limitations in the quality of human annotations within training datasets, resulting in descriptions that do not fully meet BLV users’ needs. To address this gap, we introduce VideoA11y, an approach that leverages multimodal large language models (MLLMs) and video accessibility guidelines to generate descriptions tailored for BLV individuals. Using this method, we have curated VideoA11y-40K, the largest and most comprehensive dataset of 40,000 videos described for BLV users. Rigorous experiments across 15 video categories, involving 347 sighted participants, 40 BLV participants, and seven professional describers, showed that VideoA11y descriptions outperform novice human annotations and are comparable to trained human annotations in clarity, accuracy, objectivity, descriptiveness, and user satisfaction. We evaluated models on VideoA11y-40K using both standard and custom metrics, demonstrating that MLLMs fine-tuned on this dataset produce high-quality accessible descriptions. Code and dataset are available at this https URL.
zh

[CV-85] Confidence-Weighted Boundary-Aware Learning for Semi-Supervised Semantic Segmentation

【速读】:该论文致力于解决半监督语义分割(Semi-supervised Semantic Segmentation, SSSS)中的三个主要问题:耦合(coupling),即过度依赖初始标注数据导致学习效果不佳;确认偏差(confirmation bias),即错误预测反复自我强化;以及边界模糊(boundary blur),即由于缺乏边界感知能力和边缘信息模糊导致的分割精度下降。为了解决这些问题,论文提出了CW-BASS框架,其关键在于通过引入置信权重(confidence weights)对伪标签进行调整,同时结合边界勾勒技术来增强模型的边界感知能力。具体而言,该方法通过置信加权损失函数减少耦合影响,采用动态阈值机制缓解确认偏差,利用边界感知模块改善边界附近的分割精度,并通过置信衰减策略逐步优化伪标签以降低噪声。实验结果表明,该方法在Pascal VOC 2012和Cityscapes数据集上达到了最先进的性能,特别是在仅使用12.5%标注数据的情况下实现了75.81的平均交并比(mIoU),验证了其在有限标注场景下的有效性。

链接: https://arxiv.org/abs/2502.15152
作者: Ebenezer Tarubinga,Jenifer Kalafatovich Espinoza
机构: Dept. of Artificial Intelligence (人工智能系), Korea University (高丽大学), Seoul, Korea
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Semi-supervised semantic segmentation (SSSS) aims to improve segmentation performance by utilising unlabeled data alongside limited labeled samples. Existing SSSS methods often face challenges such as coupling, where over-reliance on initial labeled data leads to suboptimal learning; confirmation bias, where incorrect predictions reinforce themselves repeatedly; and boundary blur caused by insufficient boundary-awareness and ambiguous edge information. To address these issues, we propose CW-BASS, a novel framework for SSSS. In order to mitigate the impact of incorrect predictions, we assign confidence weights to pseudo-labels. Additionally, we leverage boundary-delineation techniques, which, despite being extensively explored in weakly-supervised semantic segmentation (WSSS) remain under-explored in SSSS. Specifically, our approach: (1) reduces coupling through a confidence-weighted loss function that adjusts the influence of pseudo-labels based on their predicted confidence scores, (2) mitigates confirmation bias with a dynamic thresholding mechanism that learns to filter out pseudo-labels based on model performance, (3) resolves boundary blur with a boundary-aware module that enhances segmentation accuracy near object boundaries, and (4) reduces label noise with a confidence decay strategy that progressively refines pseudo-labels during training. Extensive experiments on the Pascal VOC 2012 and Cityscapes demonstrate that our method achieves state-of-the-art performance. Moreover, using only 1/8 or 12.5% of labeled data, our method achieves a mIoU of 75.81 on Pascal VOC 2012, highlighting its effectiveness in limited-label settings.
zh

[CV-86] omoSelfDEQ: Self-Supervised Deep Equilibrium Learning for Sparse-Angle CT Reconstruction

【速读】:该论文致力于解决稀疏角度 computed tomography (CT) 重建中配对训练数据难以获取的问题。传统方法通常需要带有 ground truth 的配对训练数据,但在医学等实际应用中,这种数据往往难以获得。论文提出了一种名为 TomoSelfDEQ 的自监督 Deep Equilibrium (DEQ) 框架,其关键在于直接利用欠采样的测量数据进行训练,而无需配对的 ground truth 图像。通过理论分析证明,在合适假设下,该自监督更新与包含前向算子(如 CT 前向映射)的全监督训练损失下的更新一致。数值实验验证了这一结论,并表明 TomoSelfDEQ 在仅使用 16 个投影角度的情况下实现了最先进的性能,显著优于现有的自监督方法。

链接: https://arxiv.org/abs/2502.21320
作者: Tatiana A. Bubba,Matteo Santacesaria,Andrea Sebastiani
机构: Department of Mathematics and Computer Science, University of Ferrara (University of Ferrara); MaLGa Center, Department of Mathematics, University of Genoa (University of Genoa); Department of Physics, Computer Science and Mathematics, University of Modena and Reggio Emilia (University of Modena and Reggio Emilia)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning has emerged as a powerful tool for solving inverse problems in imaging, including computed tomography (CT). However, most approaches require paired training data with ground truth images, which can be difficult to obtain, e.g., in medical applications. We present TomoSelfDEQ, a self-supervised Deep Equilibrium (DEQ) framework for sparse-angle CT reconstruction that trains directly on undersampled measurements. We establish theoretical guarantees showing that, under suitable assumptions, our self-supervised updates match those of fully-supervised training with a loss including the (possibly non-unitary) forward operator like the CT forward map. Numerical experiments on sparse-angle CT data confirm this finding, also demonstrating that TomoSelfDEQ outperforms existing self-supervised methods, achieving state-of-the-art results with as few as 16 projection angles.
zh

[CV-87] AutoComb: Automated Comb Sign Detector for 3D CTE Scans

【速读】:该论文旨在解决通过CT增强扫描(CTE scans)自动检测Comb Sign这一重要影像学标志物的问题。当前检测方法依赖手动操作,耗时且易受主观判断影响,特别是在多平面图像定向方面存在局限性。论文的关键在于提出了一种基于概率图的全自动技术,通过一系列逐步算法模块来识别肠道壁的病理高灌注区域,包括利用深度学习分割模型、高斯混合模型(GMM)、血管ness滤波器进行血管提取、基于邻域最大化的迭代概率增强以及基于距离的权重方案。这些模块共同构成了一个客观、准确且可靠的诊断工具,以提升克罗恩病(Crohn’s disease)及相关高灌注疾病的诊断准确性。

链接: https://arxiv.org/abs/2502.21311
作者: Shashwat Gupta,Sarthak Gupta,Akshan Agrawal,Mahim Naaz,Rajanikanth Yadav,Priyanka Bagade
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Comb Sign is an important imaging biomarker to detect multiple gastrointestinal diseases. It shows up as increased blood flow along the intestinal wall indicating potential abnormality, which helps doctors diagnose inflammatory conditions. Despite its clinical significance, current detection methods are manual, time-intensive, and prone to subjective interpretation due to the need for multi-planar image-orientation. To the best of our knowledge, we are the first to propose a fully automated technique for the detection of Comb Sign from CTE scans. Our novel approach is based on developing a probabilistic map that shows areas of pathological hypervascularity by identifying fine vascular bifurcations and wall enhancement via processing through stepwise algorithmic modules. These modules include utilising deep learning segmentation model, a Gaussian Mixture Model (GMM), vessel extraction using vesselness filter, iterative probabilistic enhancement of vesselness via neighborhood maximization and a distance-based weighting scheme over the vessels. Experimental results demonstrate that our pipeline effectively identifies Comb Sign, offering an objective, accurate, and reliable tool to enhance diagnostic accuracy in Crohn’s disease and related hypervascular conditions where Comb Sign is considered as one of the important biomarkers.
zh

[CV-88] “No negatives needed”: weakly-supervised regression for interpretable tumor detection in whole-slide histopathology images

【速读】:本文旨在解决传统弱监督肿瘤检测方法在实际临床工作流中难以获取肿瘤-free病例作为负样本的问题,特别是在手术切除标本的应用场景中。为应对这一挑战,论文的关键创新在于将肿瘤检测任务重新定义为一个回归问题,通过估计全切片图像(WSIs)中的肿瘤百分比来进行肿瘤检测,而无需依赖肿瘤-free的负样本。论文进一步分析了所提出的弱监督回归框架在多种器官、标本类型及临床场景下的性能,并引入了一种新颖的放大技术以提升小肿瘤区域检测的敏感性。此外,通过可视化注意力机制和logit图,提供了模型预测的可解释性洞见。关键解决方案在于利用回归任务替代分类任务,从而避免了对肿瘤-free负样本的依赖。

链接: https://arxiv.org/abs/2502.21109
作者: Marina D’Amato,Jeroen van der Laak,Francesco Ciompi
机构: Radboud University Medical Center (拉德堡德大学医学中心), Nijmegen, The Netherlands
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate tumor detection in digital pathology whole-slide images (WSIs) is crucial for cancer diagnosis and treatment planning. Multiple Instance Learning (MIL) has emerged as a widely used approach for weakly-supervised tumor detection with large-scale data without the need for manual annotations. However, traditional MIL methods often depend on classification tasks that require tumor-free cases as negative examples, which are challenging to obtain in real-world clinical workflows, especially for surgical resection specimens. We address this limitation by reformulating tumor detection as a regression task, estimating tumor percentages from WSIs, a clinically available target across multiple cancer types. In this paper, we provide an analysis of the proposed weakly-supervised regression framework by applying it to multiple organs, specimen types and clinical scenarios. We characterize the robustness of our framework to tumor percentage as a noisy regression target, and introduce a novel concept of amplification technique to improve tumor detection sensitivity when learning from small tumor regions. Finally, we provide interpretable insights into the model’s predictions by analyzing visual attention and logit maps. Our code is available at this https URL.
zh

[CV-89] A Non-contrast Head CT Foundation Model for Comprehensive Neuro-Trauma Triage

【速读】:本文旨在解决急诊头部CT解读中因放射科医生短缺及扫描需求增加导致的评估时间延长和准确性下降的问题。论文的关键解决方案在于提出了一种三维基础模型(3D Foundation Model),用于高效且高精度地检测多种神经创伤表现。通过使用大型语言模型(LLMs)实现自动标注,生成了涵盖关键病症的全面多标签注释。研究采用预训练神经网络分别进行出血亚型分割和脑解剖分区,并通过多模态微调将其整合到一个预训练的综合神经创伤检测网络中。该方法在专家标注对比及与CT-CLIP的比较中,对主要神经创伤(如出血和中线移位)以及较少见但重要的病症(如脑水肿和动脉高密度影)展现了强大的分诊准确性,平均AUC达到0.861,显著提升了诊断能力。这一工作推动了医学影像领域基础模型的发展,为未来的AI辅助神经创伤急诊诊断提供了基准。

链接: https://arxiv.org/abs/2502.21106
作者: Youngjin Yoo,Bogdan Georgescu,Yanbo Zhang,Sasa Grbic,Han Liu,Gabriela D. Aldea,Thomas J. Re,Jyotipriya Das,Poikavila Ullaskrishnan,Eva Eibenberger,Andrei Chekkoury,Uttam K. Bodanapally,Savvas Nicolaou,Pina C. Sanelli,Thomas J. Schroeppel,Yvonne W. Lui,Eli Gibson
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in AI and medical imaging offer transformative potential in emergency head CT interpretation for reducing assessment times and improving accuracy in the face of an increasing request of such scans and a global shortage in radiologists. This study introduces a 3D foundation model for detecting diverse neuro-trauma findings with high accuracy and efficiency. Using large language models (LLMs) for automatic labeling, we generated comprehensive multi-label annotations for critical conditions. Our approach involved pretraining neural networks for hemorrhage subtype segmentation and brain anatomy parcellation, which were integrated into a pretrained comprehensive neuro-trauma detection network through multimodal fine-tuning. Performance evaluation against expert annotations and comparison with CT-CLIP demonstrated strong triage accuracy across major neuro-trauma findings, such as hemorrhage and midline shift, as well as less frequent critical conditions such as cerebral edema and arterial hyperdensity. The integration of neuro-specific features significantly enhanced diagnostic capabilities, achieving an average AUC of 0.861 for 16 neuro-trauma conditions. This work advances foundation models in medical imaging, serving as a benchmark for future AI-assisted neuro-trauma diagnostics in emergency radiology.
zh

[CV-90] Adaptive Accelerated Proximal Gradient Methods with Variance Reduction for Composite Nonconvex Finite-Sum Minimization

【速读】:该论文旨在解决一类复合非凸有限和函数最小化问题,特别是针对随机有限和优化场景。论文提出了一种名为\sf AAPG-SPIDER的自适应加速近端梯度方法,并通过方差缩减技术提高收敛效率。其关键创新在于结合了三种加速技术:自适应步长(adaptive stepsize)、Nesterov加速梯度(Nesterov’s extrapolation)以及递归随机路径积分估计器SPIDER(recursive stochastic path-integrated estimator SPIDER)。此外,在全批量(full-batch)非随机设定下,\sf AAPG-SPIDER简化为\sf AAPG,进一步拓展了适用范围。该方法无需手动调整学习率(learning rate-free),首次实现了此类复合优化问题的最优迭代复杂度,其中\sf AAPG达到\mathcalO(N \epsilon^-2),而\sf AAPG-SPIDER达到\mathcalO(N + \sqrtN \epsilon^-2)。在Kurdyka-Lojasiewicz (KL)假设下,两种方法均展现出非遍历收敛速率(non-ergodic convergence rates)。实验结果验证了\sf AAPG-SPIDER和\sf AAPG在稀疏相位检索和线性特征值问题中的优越性能。

链接: https://arxiv.org/abs/2502.21099
作者: Ganzhao Yuan
机构: 未知
类目: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:This paper proposes \sf AAPG-SPIDER, an Adaptive Accelerated Proximal Gradient (AAPG) method with variance reduction for minimizing composite nonconvex finite-sum functions. It integrates three acceleration techniques: adaptive stepsizes, Nesterov’s extrapolation, and the recursive stochastic path-integrated estimator SPIDER. While targeting stochastic finite-sum problems, \sf AAPG-SPIDER simplifies to \sf AAPG in the full-batch, non-stochastic setting, which is also of independent interest. To our knowledge, \sf AAPG-SPIDER and \sf AAPG are the first learning-rate-free methods to achieve optimal iteration complexity for this class of \textitcomposite minimization problems. Specifically, \sf AAPG achieves the optimal iteration complexity of \mathcalO(N \epsilon^-2) , while \sf AAPG-SPIDER achieves \mathcalO(N + \sqrtN \epsilon^-2) for finding \epsilon -approximate stationary points, where N is the number of component functions. Under the Kurdyka-Lojasiewicz (KL) assumption, we establish non-ergodic convergence rates for both methods. Preliminary experiments on sparse phase retrieval and linear eigenvalue problems demonstrate the superior performance of \sf AAPG-SPIDER and \sf AAPG compared to existing methods.
zh

[CV-91] Guiding Quantitative MRI Reconstruction with Phase-wise Uncertainty MICCAI2025

【速读】:该论文旨在解决定量磁共振成像(qMRI)重建过程中由于数据采样不足导致的不适定逆问题,并探索如何利用不确定性信息提升重建性能。现有研究多集中于量化重建过程中的不确定性,而鲜有工作探讨如何有效利用这些不确定性来优化重建效果。为了解决这一问题,论文提出了一种名为PUQ的新方法,其关键在于引入不确定性信息用于qMRI重建。PUQ采用两阶段重建与参数拟合框架,在重建过程中估计逐相位的不确定性,并在拟合阶段加以利用。这种设计使得不确定性能够反映不同相位的可靠性,并指导参数拟合过程中的信息整合。实验结果表明,PUQ在健康受试者的T1和T2映射数据集上的参数图重建性能达到当前最优水平,验证了不确定性引导的有效性。代码已公开发布。

链接: https://arxiv.org/abs/2502.20877
作者: Haozhong Sun,Zhongsen Li,Chenlin Du,Haokun Li,Yajie Wang,Huijun Chen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to MICCAI2025

点击查看摘要

Abstract:Quantitative magnetic resonance imaging (qMRI) requires multi-phase acqui-sition, often relying on reduced data sampling and reconstruction algorithms to accelerate scans, which inherently poses an ill-posed inverse problem. While many studies focus on measuring uncertainty during this process, few explore how to leverage it to enhance reconstruction performance. In this paper, we in-troduce PUQ, a novel approach that pioneers the use of uncertainty infor-mation for qMRI reconstruction. PUQ employs a two-stage reconstruction and parameter fitting framework, where phase-wise uncertainty is estimated during reconstruction and utilized in the fitting stage. This design allows uncertainty to reflect the reliability of different phases and guide information integration during parameter fitting. We evaluated PUQ on in vivo T1 and T2 mapping datasets from healthy subjects. Compared to existing qMRI reconstruction methods, PUQ achieved the state-of-the-art performance in parameter map-pings, demonstrating the effectiveness of uncertainty guidance. Our code is available at this https URL.
zh

[CV-92] Delta-WKV: A Novel Meta-in-Context Learner for MRI Super-Resolution MICCAI2025

【速读】:该论文旨在解决磁共振成像(MRI)超分辨率(SR)领域中现有技术在有效捕捉局部和全局静态模式方面能力有限的问题。为了解决这些局限性,论文提出了一种名为Delta-WKV的新模型,其关键在于结合了元上下文学习(Meta-in-Context Learning, MiCL)与Delta规则,以更有效地识别MRI图像中的局部和全局模式。这种方法使Delta-WKV能够在推理过程中动态调整权重,通过较少的参数和计算开销提升模式识别性能,而无需使用状态空间建模。此外,受接收加权键值(Receptance Weighted Key Value, RWKV)启发,Delta-WKV采用了四向扫描机制,并结合时间混合和通道混合结构来捕获长距离依赖关系,同时保留高频细节。实验结果表明,Delta-WKV在IXI和fastMRI数据集上的表现优于现有方法,提升了峰值信噪比(PSNR)0.06 dB和结构相似性指数(SSIM)0.001,同时将训练和推理时间减少了超过15%,展示了其在临床应用中的高效性和潜力。

链接: https://arxiv.org/abs/2502.20852
作者: Rongchang Lu,Bingcheng Liao,Haowen Hou,Jiahang Lv,Xin Hai
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been published to MICCAI 2025. Feel free to contact on nomodeset@qq.com

点击查看摘要

Abstract:Magnetic Resonance Imaging (MRI) Super-Resolution (SR) addresses the challenges such as long scan times and expensive equipment by enhancing image resolution from low-quality inputs acquired in shorter scan times in clinical settings. However, current SR techniques still have problems such as limited ability to capture both local and global static patterns effectively and efficiently. To address these limitations, we propose Delta-WKV, a novel MRI super-resolution model that combines Meta-in-Context Learning (MiCL) with the Delta rule to better recognize both local and global patterns in MRI images. This approach allows Delta-WKV to adjust weights dynamically during inference, improving pattern recognition with fewer parameters and less computational effort, without using state-space modeling. Additionally, inspired by Receptance Weighted Key Value (RWKV), Delta-WKV uses a quad-directional scanning mechanism with time-mixing and channel-mixing structures to capture long-range dependencies while maintaining high-frequency details. Tests on the IXI and fastMRI datasets show that Delta-WKV outperforms existing methods, improving PSNR by 0.06 dB and SSIM by 0.001, while reducing training and inference times by over 15%. These results demonstrate its efficiency and potential for clinical use with large datasets and high-resolution imaging.
zh

[CV-93] Autoregressive Medical Image Segmentation via Next-Scale Mask Prediction

【速读】:该论文旨在解决现有医学图像分割方法在处理复杂解剖区域时存在的挑战,特别是多尺度特征学习方法难以建立充分的跨尺度依赖性的问题。大多数现有方法仅依赖于立即前一尺度的特征,未能有效捕捉不同尺度之间的关联。为了解决这一问题,论文提出了AutoRegressive Segmentation框架(AR-Seg),其关键创新在于通过预测下一尺度的掩码来显式建模所有先前尺度之间的依赖关系。具体而言,AR-Seg包含三个核心创新:(1) 多尺度掩码自动编码器,将掩码量化为多尺度标记图以捕获层次化的解剖结构;(2) 下一尺度自回归机制,逐步预测下一尺度的掩码以增强跨尺度依赖性;(3) 一致性聚合策略,结合多次采样的结果生成更精确的掩码,进一步提升分割鲁棒性。实验结果表明,AR-Seg在多个基准数据集上超越了当前最先进的方法,并清晰可视化了从粗到细的分割过程。

链接: https://arxiv.org/abs/2502.20784
作者: Tao Chen,Chenhui Wang,Zhihao Chen,Hongming Shan
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:While deep learning has significantly advanced medical image segmentation, most existing methods still struggle with handling complex anatomical regions. Cascaded or deep supervision-based approaches attempt to address this challenge through multi-scale feature learning but fail to establish sufficient inter-scale dependencies, as each scale relies solely on the features of the immediate predecessor. To this end, we propose the AutoRegressive Segmentation framework via next-scale mask prediction, termed AR-Seg, which progressively predicts the next-scale mask by explicitly modeling dependencies across all previous scales within a unified architecture. AR-Seg introduces three innovations: (1) a multi-scale mask autoencoder that quantizes the mask into multi-scale token maps to capture hierarchical anatomical structures, (2) a next-scale autoregressive mechanism that progressively predicts next-scale masks to enable sufficient inter-scale dependencies, and (3) a consensus-aggregation strategy that combines multiple sampled results to generate a more accurate mask, further improving segmentation robustness. Extensive experimental results on two benchmark datasets with different modalities demonstrate that AR-Seg outperforms state-of-the-art methods while explicitly visualizing the intermediate coarse-to-fine segmentation process.
zh

[CV-94] owards Practical Real-Time Neural Video Compression MICRO CVPR2025

【速读】:本文旨在解决神经视频编解码器(Neural Video Codec, NVC)在实现高压缩比、低延迟和广泛适应性的同时,提升编码速度的问题。论文指出,当前NVC的编码速度受限于计算成本和非计算操作成本(如内存I/O及函数调用次数)。虽然现有高效NVC主要关注降低计算成本,但研究发现,操作成本(operational cost)是进一步提高编码速度的主要瓶颈。为此,作者提出了一系列以降低操作成本为导向的设计改进,包括采用隐式的时序建模替代复杂的显式运动模块,使用单一低分辨率潜在表示而非渐进下采样,从而显著加速NVC而未牺牲压缩质量。此外,通过模型整量化实现跨设备一致编码,并引入基于模块库的速率控制方案以增强实际应用的适应性。实验表明,所提出的DCVC-RT在1080p视频上的平均编码/解码速率达到125.2/112.8 fps,同时比特率较H.266/VTM平均节省21%。

链接: https://arxiv.org/abs/2502.20762
作者: Zhaoyang Jia,Bin Li,Jiahao Li,Wenxuan Xie,Linfeng Qi,Houqiang Li,Yan Lu
机构: University of Science and Technology of China (中国科学技术大学); Microsoft Research Asia (微软亚洲研究院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. The code is available at this https URL

点击查看摘要

Abstract:We introduce a practical real-time neural video codec (NVC) designed to deliver high compression ratio, low latency and broad versatility. In practice, the coding speed of NVCs depends on 1) computational costs, and 2) non-computational operational costs, such as memory I/O and the number of function calls. While most efficient NVCs prioritize reducing computational cost, we identify operational cost as the primary bottleneck to achieving higher coding speed. Leveraging this insight, we introduce a set of efficiency-driven design improvements focused on minimizing operational costs. Specifically, we employ implicit temporal modeling to eliminate complex explicit motion modules, and use single low-resolution latent representations rather than progressive downsampling. These innovations significantly accelerate NVC without sacrificing compression quality. Additionally, we implement model integerization for consistent cross-device coding and a module-bank-based rate control scheme to improve practical adaptability. Experiments show our proposed DCVC-RT achieves an impressive average encoding/decoding speed at 125.2/112.8 fps (frames per second) for 1080p video, while saving an average of 21% in bitrate compared to H.266/VTM. The code is available at this https URL.
zh

[CV-95] SemiSAM: Rethinking Semi-Supervised Medical Image Segmentation in the Era of Foundation Models

【速读】:该论文试图解决深度学习在医学图像分割任务中因需要大量标注数据而导致实际临床应用受限的问题。解决方案的关键在于提出了一种基于提示(prompt)的半监督学习框架SemiSAM+,其核心创新点是利用可提示分割基础模型(promptable segmentation foundation model)与任务特定分割模型之间的协作学习机制。具体而言,通过训练的任务特定模型生成提示信息与冻结的基础模型交互以获取伪标签,并利用基础模型的输出为任务特定模型提供高效且信息丰富的监督信号,从而实现从有限标注数据中高效学习的目标。这一方法尤其适用于标注资源极为有限的场景,并展现出作为即插即用策略的强大适应性。

链接: https://arxiv.org/abs/2502.20749
作者: Yichi Zhang,Bohao Lv,Le Xue,Wenbo Zhang,Yuchen Liu,Yu Fu,Yuan Cheng,Yuan Qi
机构: Artificial Intelligence Innovation and Incubation Institute, Fudan University (复旦大学); Shanghai Academy of Artificial Intelligence for Science (上海人工智能科学中心); Huashan Hospital, Fudan University (复旦大学华山医院); Human Phenome Institute, Fudan University (复旦大学人类表型组研究院); School of Information Science and Engineering, Lanzhou University (兰州大学信息科学与工程学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning-based medical image segmentation typically requires large amount of labeled data for training, making it less applicable in clinical settings due to high annotation cost. Semi-supervised learning (SSL) has emerged as an appealing strategy due to its less dependence on acquiring abundant annotations from experts compared to fully supervised methods. Beyond existing model-centric advancements of SSL by designing novel regularization strategies, we anticipate a paradigmatic shift due to the emergence of promptable segmentation foundation models with universal segmentation capabilities using positional prompts represented by Segment Anything Model (SAM). In this paper, we present SemiSAM+, a foundation model-driven SSL framework to efficiently learn from limited labeled data for medical image segmentation. SemiSAM+ consists of one or multiple promptable foundation models as generalist models, and a trainable task-specific segmentation model as specialist model. For a given new segmentation task, the training is based on the specialist-generalist collaborative learning procedure, where the trainable specialist model delivers positional prompts to interact with the frozen generalist models to acquire pseudo-labels, and then the generalist model output provides the specialist model with informative and efficient supervision which benefits the automatic segmentation and prompt generation in turn. Extensive experiments on two public datasets and one in-house clinical dataset demonstrate that SemiSAM+ achieves significant performance improvement, especially under extremely limited annotation scenarios, and shows strong efficiency as a plug-and-play strategy that can be easily adapted to different specialist and generalist models.
zh

[CV-96] Style Content Decomposition-based Data Augmentation for Domain Generalizable Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割模型在部署时因训练与测试阶段域偏移(domain shifts)而导致性能显著下降的问题。论文揭示了医学图像中的域偏移包含风格偏移(style shifts,即图像外观差异)和内容偏移(content shifts,即解剖结构变化),其中后者长期以来被忽视。为了解决这一问题,论文提出了一种基于风格-内容分解的数据增强方法StyCona,在秩一空间内创新性地同时增强图像的风格和内容。这种方法简单而有效,作为即插即用模块显著提升了模型的泛化能力,且无需额外的训练参数或修改分割模型架构。实验结果表明,StyCona在跨序列、跨中心和跨模态等具有不同程度域偏移的医学图像分割任务中表现出色,并优于现有方法。

链接: https://arxiv.org/abs/2502.20619
作者: Zhiqiang Shen,Peng Cao,Jinzhu Yang,Osmar R. Zaiane,Zhaolin Chen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Due to the domain shifts between training and testing medical images, learned segmentation models often experience significant performance degradation during deployment. In this paper, we first decompose an image into its style code and content map and reveal that domain shifts in medical images involve: \textbfstyle shifts (\emphi.e., differences in image appearance) and \textbfcontent shifts (\emphi.e., variations in anatomical structures), the latter of which has been largely overlooked. To this end, we propose \textbfStyCona, a \textbfstyle \textbfcontent decomposition-based data \textbfaugmentation method that innovatively augments both image style and content within the rank-one space, for domain generalizable medical image segmentation. StyCona is a simple yet effective plug-and-play module that substantially improves model generalization without requiring additional training parameters or modifications to the segmentation model architecture. Experiments on cross-sequence, cross-center, and cross-modality medical image segmentation settings with increasingly severe domain shifts, demonstrate the effectiveness of StyCona and its superiority over state-of-the-arts. The code is available at this https URL.
zh

[CV-97] An Integrated Deep Learning Framework Leverag ing NASNet and Vision Transformer with MixProcessing for Accurate and Precise Diagnosis of Lung Diseases

【速读】:该论文旨在解决肺部疾病早期和精确诊断这一严峻健康挑战,针对包括肺癌、COVID-19、肺炎、结核病及正常状态在内的五种类别进行分类。解决方案的关键在于提出了一种名为NASNet-ViT的新深度学习框架,它结合了NASNet的卷积能力与Vision Transformer ViT的全局注意力机制能力,并采用了一种名为MixProcessing的多维度预处理策略,通过小波变换、自适应直方图均衡化和形态学滤波技术提升诊断准确性。该模型在多种性能指标上达到当前最优水平,同时具备紧凑的模型大小(25.6 MB)和较低的计算时间(12.4秒),适用于实时临床环境。

链接: https://arxiv.org/abs/2502.20570
作者: Sajjad Saleem,Muhammad Imran Sharif
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The lungs are the essential organs of respiration, and this system is significant in the carbon dioxide and exchange between oxygen that occurs in human life. However, several lung diseases, which include pneumonia, tuberculosis, COVID-19, and lung cancer, are serious healthiness challenges and demand early and precise diagnostics. The methodological study has proposed a new deep learning framework called NASNet-ViT, which effectively incorporates the convolution capability of NASNet with the global attention mechanism capability of Vision Transformer ViT. The proposed model will classify the lung conditions into five classes: Lung cancer, COVID-19, pneumonia, TB, and normal. A sophisticated multi-faceted preprocessing strategy called MixProcessing has been used to improve diagnostic accuracy. This preprocessing combines wavelet transform, adaptive histogram equalization, and morphological filtering techniques. The NASNet-ViT model performs at state of the art, achieving an accuracy of 98.9%, sensitivity of 0.99, an F1-score of 0.989, and specificity of 0.987, outperforming other state of the art architectures such as MixNet-LD, D-ResNet, MobileNet, and ResNet50. The model’s efficiency is further emphasized by its compact size, 25.6 MB, and a low computational time of 12.4 seconds, hence suitable for real-time, clinically constrained environments. These results reflect the high-quality capability of NASNet-ViT in extracting meaningful features and recognizing various types of lung diseases with very high accuracy. This work contributes to medical image analysis by providing a robust and scalable solution for diagnostics in lung diseases.
zh

人工智能

[AI-0] Clustering Context in Off-Policy Evaluation AISTATS2025

链接: https://arxiv.org/abs/2502.21304
作者: Daniel Guzman-Olivares,Philipp Schmidt,Jacek Golebiowski,Artur Bekasov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 35 pages, 25 figures, 2 tables. AISTATS 2025

点击查看摘要

Abstract:Off-policy evaluation can leverage logged data to estimate the effectiveness of new policies in e-commerce, search engines, media streaming services, or automatic diagnostic tools in healthcare. However, the performance of baseline off-policy estimators like IPS deteriorates when the logging policy significantly differs from the evaluation policy. Recent work proposes sharing information across similar actions to mitigate this problem. In this work, we propose an alternative estimator that shares information across similar contexts using clustering. We study the theoretical properties of the proposed estimator, characterizing its bias and variance under different conditions. We also compare the performance of the proposed estimator and existing approaches in various synthetic problems, as well as a real-world recommendation dataset. Our experimental results confirm that clustering contexts improves estimation accuracy, especially in deficient information settings.

[AI-1] Contextualizing biological perturbation experiments through language

链接: https://arxiv.org/abs/2502.21290
作者: Menghua Wu,Russell Littman,Jacob Levine,Lin Qiu,Tommaso Biancalani,David Richmond,Jan-Christian Huetter
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: The Thirteenth International Conference on Learning Representations (2025)

点击查看摘要

Abstract:High-content perturbation experiments allow scientists to probe biomolecular systems at unprecedented resolution, but experimental and analysis costs pose significant barriers to widespread adoption. Machine learning has the potential to guide efficient exploration of the perturbation space and extract novel insights from these data. However, current approaches neglect the semantic richness of the relevant biology, and their objectives are misaligned with downstream biological analyses. In this paper, we hypothesize that large language models (LLMs) present a natural medium for representing complex biological relationships and rationalizing experimental outcomes. We propose PerturbQA, a benchmark for structured reasoning over perturbation experiments. Unlike current benchmarks that primarily interrogate existing knowledge, PerturbQA is inspired by open problems in perturbation modeling: prediction of differential expression and change of direction for unseen perturbations, and gene set enrichment. We evaluate state-of-the-art machine learning and statistical approaches for modeling perturbations, as well as standard LLM reasoning strategies, and we find that current methods perform poorly on PerturbQA. As a proof of feasibility, we introduce Summer (SUMMarize, retrievE, and answeR, a simple, domain-informed LLM framework that matches or exceeds the current state-of-the-art. Our code and data are publicly available at this https URL.

[AI-2] L-Lipschitz Gershgorin ResNet Network

链接: https://arxiv.org/abs/2502.21279
作者: Marius F. R. Juston,William R. Norris,Dustin Nottage,Ahmet Soylemezoglu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:Deep residual networks (ResNets) have demonstrated outstanding success in computer vision tasks, attributed to their ability to maintain gradient flow through deep architectures. Simultaneously, controlling the Lipschitz bound in neural networks has emerged as an essential area of research for enhancing adversarial robustness and network certifiability. This paper uses a rigorous approach to design \mathcalL -Lipschitz deep residual networks using a Linear Matrix Inequality (LMI) framework. The ResNet architecture was reformulated as a pseudo-tri-diagonal LMI with off-diagonal elements and derived closed-form constraints on network parameters to ensure \mathcalL -Lipschitz continuity. To address the lack of explicit eigenvalue computations for such matrix structures, the Gershgorin circle theorem was employed to approximate eigenvalue locations, guaranteeing the LMI’s negative semi-definiteness. Our contributions include a provable parameterization methodology for constructing Lipschitz-constrained networks and a compositional framework for managing recursive systems within hierarchical architectures. These findings enable robust network designs applicable to adversarial robustness, certified training, and control systems. However, a limitation was identified in the Gershgorin-based approximations, which over-constrain the system, suppressing non-linear dynamics and diminishing the network’s expressive capacity.

[AI-3] BAnG: Bidirectional Anchored Generation for Conditional RNA Design

链接: https://arxiv.org/abs/2502.21274
作者: Roman Klypa,Alberto Bietti,Sergei Grudinin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Designing RNA molecules that interact with specific proteins is a critical challenge in experimental and computational biology. Existing computational approaches require a substantial amount of experimentally determined RNA sequences for each specific protein or a detailed knowledge of RNA structure, restricting their utility in practice. To address this limitation, we develop RNA-BAnG, a deep learning-based model designed to generate RNA sequences for protein interactions without these requirements. Central to our approach is a novel generative method, Bidirectional Anchored Generation (BAnG), which leverages the observation that protein-binding RNA sequences often contain functional binding motifs embedded within broader sequence contexts. We first validate our method on generic synthetic tasks involving similar localized motifs to those appearing in RNAs, demonstrating its benefits over existing generative approaches. We then evaluate our model on biological sequences, showing its effectiveness for conditional RNA sequence design given a binding protein.

[AI-4] ReaLJam: Real-Time Human-AI Music Jamming with Reinforcement Learning-Tuned Transformers

链接: https://arxiv.org/abs/2502.21267
作者: Alexander Scarlatos,Yusong Wu,Ian Simon,Adam Roberts,Tim Cooijmans,Natasha Jaques,Cassie Tarakajian,Cheng-Zhi Anna Huang
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Published in Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA '25), April 26-May 1, 2025, Yokohama, Japan

点击查看摘要

Abstract:Recent advances in generative artificial intelligence (AI) have created models capable of high-quality musical content generation. However, little consideration is given to how to use these models for real-time or cooperative jamming musical applications because of crucial required features: low latency, the ability to communicate planned actions, and the ability to adapt to user input in real-time. To support these needs, we introduce ReaLJam, an interface and protocol for live musical jamming sessions between a human and a Transformer-based AI agent trained with reinforcement learning. We enable real-time interactions using the concept of anticipation, where the agent continually predicts how the performance will unfold and visually conveys its plan to the user. We conduct a user study where experienced musicians jam in real-time with the agent through ReaLJam. Our results demonstrate that ReaLJam enables enjoyable and musically interesting sessions, and we uncover important takeaways for future work.

[AI-5] Supporting the development of Machine Learning for fundamental science in a federated Cloud with the AI_INFN platform

链接: https://arxiv.org/abs/2502.21266
作者: Lucio Anderlini,Matteo Barbetti,Giulio Bianchini,Diego Ciangottini,Stefano Dal Pra,Diego Michelotto,Carmelo Pellegrino,Rosa Petrini,Alessandro Pascolini,Daniele Spiga
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an)
*备注: Under review in EPJ Web of Conferences (CHEP 2024)

点击查看摘要

Abstract:Machine Learning (ML) is driving a revolution in the way scientists design, develop, and deploy data-intensive software. However, the adoption of ML presents new challenges for the computing infrastructure, particularly in terms of provisioning and orchestrating access to hardware accelerators for development, testing, and production. The INFN-funded project AI_INFN (“Artificial Intelligence at INFN”) aims at fostering the adoption of ML techniques within INFN use cases by providing support on multiple aspects, including the provision of AI-tailored computing resources. It leverages cloud-native solutions in the context of INFN Cloud, to share hardware accelerators as effectively as possible, ensuring the diversity of the Institute’s research activities is not compromised. In this contribution, we provide an update on the commissioning of a Kubernetes platform designed to ease the development of GPU-powered data analysis workflows and their scalability on heterogeneous, distributed computing resources, possibly federated as Virtual Kubelets with the interLink provider.

[AI-6] Modeling Human Beliefs about AI Behavior for Scalable Oversight

链接: https://arxiv.org/abs/2502.21262
作者: Leon Lang,Patrick Forré
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 53 pages

点击查看摘要

Abstract:Contemporary work in AI alignment often relies on human feedback to teach AI systems human preferences and values. Yet as AI systems grow more capable, human feedback becomes increasingly unreliable. This raises the problem of scalable oversight: How can we supervise AI systems that exceed human capabilities? In this work, we propose to model the human evaluator’s beliefs about the AI system’s behavior to better interpret the human’s feedback. We formalize human belief models and theoretically analyze their role in inferring human values. We then characterize the remaining ambiguity in this inference and conditions for which the ambiguity disappears. To mitigate reliance on exact belief models, we then introduce the relaxation of human belief model covering. Finally, we propose using foundation models to construct covering belief models, providing a new potential approach to scalable oversight.

[AI-7] owards Developing Ethical Reason ers: Integrating Probabilistic Reasoning and Decision-Making for Complex AI Systems

链接: https://arxiv.org/abs/2502.21250
作者: Nijesh Upreti,Jessica Ciupa,Vaishak Belle
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A computational ethics framework is essential for AI and autonomous systems operating in complex, real-world environments. Existing approaches often lack the adaptability needed to integrate ethical principles into dynamic and ambiguous contexts, limiting their effectiveness across diverse scenarios. To address these challenges, we outline the necessary ingredients for building a holistic, meta-level framework that combines intermediate representations, probabilistic reasoning, and knowledge representation. The specifications therein emphasize scalability, supporting ethical reasoning at both individual decision-making levels and within the collective dynamics of multi-agent systems. By integrating theoretical principles with contextual factors, it facilitates structured and context-aware decision-making, ensuring alignment with overarching ethical standards. We further explore proposed theorems outlining how ethical reasoners should operate, offering a foundation for practical implementation. These constructs aim to support the development of robust and ethically reliable AI systems capable of navigating the complexities of real-world moral decision-making scenarios.

[AI-8] ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12000 GPUs

链接: https://arxiv.org/abs/2502.21231
作者: Hao Ge,Junda Feng,Qi Huang,Fangcheng Fu,Xiaonan Nie,Lei Zuo,Haibin Lin,Bin Cui,Xin Liu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages, 21 figures

点击查看摘要

Abstract:Scaling long-context ability is essential for Large Language Models (LLMs). To amortize the memory consumption across multiple devices in long-context training, inter-data partitioning (a.k.a. Data Parallelism) and intra-data partitioning (a.k.a. Context Parallelism) are commonly used. Current training frameworks predominantly treat the two techniques as orthogonal, and establish static communication groups to organize the devices as a static mesh (e.g., a 2D mesh). However, the sequences for LLM training typically vary in lengths, no matter for texts, multi-modalities or reinforcement learning. The mismatch between data heterogeneity and static mesh causes redundant communication and imbalanced computation, degrading the training efficiency. In this work, we introduce ByteScale, an efficient, flexible, and scalable LLM training framework for large-scale mixed training of long and short sequences. The core of ByteScale is a novel parallelism strategy, namely Hybrid Data Parallelism (HDP), which unifies the inter- and intra-data partitioning with a dynamic mesh design. In particular, we build a communication optimizer, which eliminates the redundant communication for short sequences by data-aware sharding and dynamic communication, and further compresses the communication cost for long sequences by selective offloading. Besides, we also develop a balance scheduler to mitigate the imbalanced computation by parallelism-aware data assignment. We evaluate ByteScale with the model sizes ranging from 7B to 141B, context lengths from 256K to 2048K, on a production cluster with more than 12,000 GPUs. Experiment results show that ByteScale outperforms the state-of-the-art training system by up to 7.89x. Comments: 12 pages, 21 figures Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2502.21231 [cs.DC] (or arXiv:2502.21231v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2502.21231 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-9] XAIxArts Manifesto: Explainable AI for the Arts

链接: https://arxiv.org/abs/2502.21220
作者: Nick Bryan-Kinns,Shuoyang Jasper Zheng,Francisco Castro,Makayla Lewis,Jia-Rey Chang,Gabriel Vigliensoni,Terence Broad,Michael Clemens,Elizabeth Wilson
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Author version of paper in: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, April 26-May 1, 2025, Yokohama, Japan DOI https://doi.org/10.1145/3706599.3716227 ISBN 979-8-4007-1395-8/25/04

点击查看摘要

Abstract:Explainable AI (XAI) is concerned with how to make AI models more understandable to people. To date these explanations have predominantly been technocentric - mechanistic or productivity oriented. This paper introduces the Explainable AI for the Arts (XAIxArts) manifesto to provoke new ways of thinking about explainability and AI beyond technocentric discourses. Manifestos offer a means to communicate ideas, amplify unheard voices, and foster reflection on practice. To supports the co-creation and revision of the XAIxArts manifesto we combine a World Café style discussion format with a living manifesto to question four core themes: 1) Empowerment, Inclusion, and Fairness; 2) Valuing Artistic Practice; 3) Hacking and Glitches; and 4) Openness. Through our interactive living manifesto experience we invite participants to actively engage in shaping this XIAxArts vision within the CHI community and beyond.

[AI-10] An Algebraic Framework for Hierarchical Probabilistic Abstraction

链接: https://arxiv.org/abs/2502.21216
作者: Nijesh Upreti,Vaishak Belle
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Abstraction is essential for reducing the complexity of systems across diverse fields, yet designing effective abstraction methodology for probabilistic models is inherently challenging due to stochastic behaviors and uncertainties. Current approaches often distill detailed probabilistic data into higher-level summaries to support tractable and interpretable analyses, though they typically struggle to fully represent the relational and probabilistic hierarchies through single-layered abstractions. We introduce a hierarchical probabilistic abstraction framework aimed at addressing these challenges by extending a measure-theoretic foundation for hierarchical abstraction. The framework enables modular problem-solving via layered mappings, facilitating both detailed layer-specific analysis and a cohesive system-wide understanding. This approach bridges high-level conceptualization with low-level perceptual data, enhancing interpretability and allowing layered analysis. Our framework provides a robust foundation for abstraction analysis across AI subfields, particularly in aligning System 1 and System 2 thinking, thereby supporting the development of diverse abstraction methodologies.

[AI-11] ransformers Learn to Implement Multi-step Gradient Descent with Chain of Thought ICLR2025

链接: https://arxiv.org/abs/2502.21212
作者: Jianhao Huang,Zixuan Wang,Jason D. Lee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICLR 2025 Spotlight

点击查看摘要

Abstract:Chain of Thought (CoT) prompting has been shown to significantly improve the performance of large language models (LLMs), particularly in arithmetic and reasoning tasks, by instructing the model to produce intermediate reasoning steps. Despite the remarkable empirical success of CoT and its theoretical advantages in enhancing expressivity, the mechanisms underlying CoT training remain largely unexplored. In this paper, we study the training dynamics of transformers over a CoT objective on an in-context weight prediction task for linear regression. We prove that while a one-layer linear transformer without CoT can only implement a single step of gradient descent (GD) and fails to recover the ground-truth weight vector, a transformer with CoT prompting can learn to perform multi-step GD autoregressively, achieving near-exact recovery. Furthermore, we show that the trained transformer effectively generalizes on the unseen data. With our technique, we also show that looped transformers significantly improve final performance compared to transformers without looping in the in-context learning of linear regression. Empirically, we demonstrate that CoT prompting yields substantial performance improvements.

[AI-12] ARIES: Autonomous Reasoning with LLM s on Interactive Thought Graph Environments

链接: https://arxiv.org/abs/2502.21208
作者: Pedro Gimenes,Zeyu Cao,Jeffrey Wong,Yiren Zhao
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent research has shown that LLM performance on reasoning tasks can be enhanced by scaling test-time compute. One promising approach, particularly with decomposable problems, involves arranging intermediate solutions as a graph on which transformations are performed to explore the solution space. However, prior works rely on pre-determined, task-specific transformation schedules which are subject to a set of searched hyperparameters. In this work, we view thought graph transformations as actions in a Markov decision process, and implement policy agents to drive effective action policies for the underlying reasoning LLM agent. In particular, we investigate the ability for another LLM to act as a policy agent on thought graph environments and introduce ARIES, a multi-agent architecture for reasoning with LLMs. In ARIES, reasoning LLM agents solve decomposed subproblems, while policy LLM agents maintain visibility of the thought graph states, and dynamically adapt the problem-solving strategy. Through extensive experiments, we observe that using off-the-shelf LLMs as policy agents with no supervised fine-tuning (SFT) can yield up to 29% higher accuracy on HumanEval relative to static transformation schedules, as well as reducing inference costs by 35% and avoid any search requirements. We also conduct a thorough analysis of observed failure modes, highlighting that limitations on LLM sizes and the depth of problem decomposition can be seen as challenges to scaling LLM-guided reasoning.

[AI-13] AMPLE: Event-Driven Accelerator for Mixed-Precision Inference of Graph Neural Networks

链接: https://arxiv.org/abs/2502.21196
作者: Pedro Gimenes,Yiren Zhao,George Constantinides
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have recently gained attention due to their performance on non-Euclidean data. The use of custom hardware architectures proves particularly beneficial for GNNs due to their irregular memory access patterns, resulting from the sparse structure of graphs. However, existing FPGA accelerators are limited by their double buffering mechanism, which doesn’t account for the irregular node distribution in typical graph datasets. To address this, we introduce \textbfAMPLE (Accelerated Message Passing Logic Engine), an FPGA accelerator leveraging a new event-driven programming flow. We develop a mixed-arithmetic architecture, enabling GNN inference to be quantized at a node-level granularity. Finally, prefetcher for data and instructions is implemented to optimize off-chip memory access and maximize node parallelism. Evaluation on citation and social media graph datasets ranging from 2 K to 700 K nodes showed a mean speedup of 243\times and 7.2\times against CPU and GPU counterparts, respectively.

[AI-14] Scalable Decision-Making in Stochastic Environments through Learned Temporal Abstraction ICLR2025

链接: https://arxiv.org/abs/2502.21186
作者: Baiting Luo,Ava Pettet,Aron Laszka,Abhishek Dubey,Ayan Mukhopadhyay
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Accepted by ICLR2025. Code would be available at \href{ this https URL }{this https URL}

点击查看摘要

Abstract:Sequential decision-making in high-dimensional continuous action spaces, particularly in stochastic environments, faces significant computational challenges. We explore this challenge in the traditional offline RL setting, where an agent must learn how to make decisions based on data collected through a stochastic behavior policy. We present \textitLatent Macro Action Planner (L-MAP), which addresses this challenge by learning a set of temporally extended macro-actions through a state-conditional Vector Quantized Variational Autoencoder (VQ-VAE), effectively reducing action dimensionality. L-MAP employs a (separate) learned prior model that acts as a latent transition model and allows efficient sampling of plausible actions. During planning, our approach accounts for stochasticity in both the environment and the behavior policy by using Monte Carlo tree search (MCTS). In offline RL settings, including stochastic continuous control tasks, L-MAP efficiently searches over discrete latent actions to yield high expected returns. Empirical results demonstrate that L-MAP maintains low decision latency despite increased action dimensionality. Notably, across tasks ranging from continuous control with inherently stochastic dynamics to high-dimensional robotic hand manipulation, L-MAP significantly outperforms existing model-based methods and performs on-par with strong model-free actor-critic baselines, highlighting the effectiveness of the proposed approach in planning in complex and stochastic environments with high-dimensional action spaces.

[AI-15] A Survey of Link Prediction in Temporal Networks

链接: https://arxiv.org/abs/2502.21185
作者: Jiafeng Xiong,Ahmad Zareie,Rizos Sakellariou
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Temporal networks have gained significant prominence in the past decade for modelling dynamic interactions within complex systems. A key challenge in this domain is Temporal Link Prediction (TLP), which aims to forecast future connections by analysing historical network structures across various applications including social network analysis. While existing surveys have addressed specific aspects of TLP, they typically lack a comprehensive framework that distinguishes between representation and inference methods. This survey bridges this gap by introducing a novel taxonomy that explicitly examines representation and inference from existing methods, providing a novel classification of approaches for TLP. We analyse how different representation techniques capture temporal and structural dynamics, examining their compatibility with various inference methods for both transductive and inductive prediction tasks. Our taxonomy not only clarifies the methodological landscape but also reveals promising unexplored combinations of existing techniques. This taxonomy provides a systematic foundation for emerging challenges in TLP, including model explainability and scalable architectures for complex temporal networks.

[AI-16] Multimodal Dreaming: A Global Workspace Approach to World Model-Based Reinforcement Learning

链接: https://arxiv.org/abs/2502.21142
作者: Léopold Maytié,Roland Bertin Johannet,Rufin VanRullen
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: Under review in a conference

点击查看摘要

Abstract:Humans leverage rich internal models of the world to reason about the future, imagine counterfactuals, and adapt flexibly to new situations. In Reinforcement Learning (RL), world models aim to capture how the environment evolves in response to the agent’s actions, facilitating planning and generalization. However, typical world models directly operate on the environment variables (e.g. pixels, physical attributes), which can make their training slow and cumbersome; instead, it may be advantageous to rely on high-level latent dimensions that capture relevant multimodal variables. Global Workspace (GW) Theory offers a cognitive framework for multimodal integration and information broadcasting in the brain, and recent studies have begun to introduce efficient deep learning implementations of GW. Here, we evaluate the capabilities of an RL system combining GW with a world model. We compare our GW-Dreamer with various versions of the standard PPO and the original Dreamer algorithms. We show that performing the dreaming process (i.e., mental simulation) inside the GW latent space allows for training with fewer environment steps. As an additional emergent property, the resulting model (but not its comparison baselines) displays strong robustness to the absence of one of its observation modalities (images or simulation attributes). We conclude that the combination of GW with World Models holds great potential for improving decision-making in RL agents.

[AI-17] Predicting clinical outcomes from patient care pathways represented with temporal knowledge graphs

链接: https://arxiv.org/abs/2502.21138
作者: Jong Ho Jhee,Alberto Megina,Pacôme Constant Dit Beaufils,Matilde Karakachoff,Richard Redon,Alban Gaignard,Adrien Coulet
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Background: With the increasing availability of healthcare data, predictive modeling finds many applications in the biomedical domain, such as the evaluation of the level of risk for various conditions, which in turn can guide clinical decision making. However, it is unclear how knowledge graph data representations and their embedding, which are competitive in some settings, could be of interest in biomedical predictive modeling. Method: We simulated synthetic but realistic data of patients with intracranial aneurysm and experimented on the task of predicting their clinical outcome. We compared the performance of various classification approaches on tabular data versus a graph-based representation of the same data. Next, we investigated how the adopted schema for representing first individual data and second temporal data impacts predictive performances. Results: Our study illustrates that in our case, a graph representation and Graph Convolutional Network (GCN) embeddings reach the best performance for a predictive task from observational data. We emphasize the importance of the adopted schema and of the consideration of literal values in the representation of individual data. Our study also moderates the relative impact of various time encoding on GCN performance.

[AI-18] Dynamically Local-Enhancement Planner for Large-Scale Autonomous Driving

链接: https://arxiv.org/abs/2502.21134
作者: Nanshan Deng,Weitao Zhou,Bo Zhang,Junze Wen,Kun Jiang,Zhong Cao,Diange Yang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Current autonomous vehicles operate primarily within limited regions, but there is increasing demand for broader applications. However, as models scale, their limited capacity becomes a significant challenge for adapting to novel scenarios. It is increasingly difficult to improve models for new situations using a single monolithic model. To address this issue, we introduce the concept of dynamically enhancing a basic driving planner with local driving data, without permanently modifying the planner itself. This approach, termed the Dynamically Local-Enhancement (DLE) Planner, aims to improve the scalability of autonomous driving systems without significantly expanding the planner’s size. Our approach introduces a position-varying Markov Decision Process formulation coupled with a graph neural network that extracts region-specific driving features from local observation data. The learned features describe the local behavior of the surrounding objects, which is then leveraged to enhance a basic reinforcement learning-based policy. We evaluated our approach in multiple scenarios and compared it with a one-for-all driving model. The results show that our method outperforms the baseline policy in both safety (collision rate) and average reward, while maintaining a lighter scale. This approach has the potential to benefit large-scale autonomous vehicles without the need for largely expanding on-device driving models.

[AI-19] Causality Is Key to Understand and Balance Multiple Goals in Trustworthy ML and Foundation Models

链接: https://arxiv.org/abs/2502.21123
作者: Ruta Binkyte,Ivaxi Sheth,Zhijing Jin,Muhammad Havaei,Bernhardt Schölkopf,Mario Fritz
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Ensuring trustworthiness in machine learning (ML) systems is crucial as they become increasingly embedded in high-stakes domains. This paper advocates for the integration of causal methods into machine learning to navigate the trade-offs among key principles of trustworthy ML, including fairness, privacy, robustness, accuracy, and explainability. While these objectives should ideally be satisfied simultaneously, they are often addressed in isolation, leading to conflicts and suboptimal solutions. Drawing on existing applications of causality in ML that successfully align goals such as fairness and accuracy or privacy and robustness, this paper argues that a causal approach is essential for balancing multiple competing objectives in both trustworthy ML and foundation models. Beyond highlighting these trade-offs, we examine how causality can be practically integrated into ML and foundation models, offering solutions to enhance their reliability and interpretability. Finally, we discuss the challenges, limitations, and opportunities in adopting causal frameworks, paving the way for more accountable and ethically sound AI systems.

[AI-20] AuthSim: Towards Authentic and Effective Safety-critical Scenario Generation for Autonomous Driving Tests

链接: https://arxiv.org/abs/2502.21100
作者: Yukuan Yang,Xucheng Lu,Zhili Zhang,Zepeng Wu,Guoqi Li,Lingzhong Meng,Yunzhi Xue
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generating adversarial safety-critical scenarios is a pivotal method for testing autonomous driving systems, as it identifies potential weaknesses and enhances system robustness and reliability. However, existing approaches predominantly emphasize unrestricted collision scenarios, prompting non-player character (NPC) vehicles to attack the ego vehicle indiscriminately. These works overlook these scenarios’ authenticity, rationality, and relevance, resulting in numerous extreme, contrived, and largely unrealistic collision events involving aggressive NPC vehicles. To rectify this issue, we propose a three-layer relative safety region model, which partitions the area based on danger levels and increases the likelihood of NPC vehicles entering relative boundary regions. This model directs NPC vehicles to engage in adversarial actions within relatively safe boundary regions, thereby augmenting the scenarios’ authenticity. We introduce AuthSim, a comprehensive platform for generating authentic and effective safety-critical scenarios by integrating the three-layer relative safety region model with reinforcement learning. To our knowledge, this is the first attempt to address the authenticity and effectiveness of autonomous driving system test scenarios comprehensively. Extensive experiments demonstrate that AuthSim outperforms existing methods in generating effective safety-critical scenarios. Notably, AuthSim achieves a 5.25% improvement in average cut-in distance and a 27.12% enhancement in average collision interval time, while maintaining higher efficiency in generating effective safety-critical scenarios compared to existing methods. This underscores its significant advantage in producing authentic scenarios over current methodologies.

[AI-21] An LLM -based Delphi Study to Predict GenAI Evolution

链接: https://arxiv.org/abs/2502.21092
作者: Francesco Bertolotti,Luca Mari
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Predicting the future trajectory of complex and rapidly evolving systems remains a significant challenge, particularly in domains where data is scarce or unreliable. This study introduces a novel approach to qualitative forecasting by leveraging Large Language Models to conduct Delphi studies. The methodology was applied to explore the future evolution of Generative Artificial Intelligence, revealing insights into key factors such as geopolitical tensions, economic disparities, regulatory frameworks, and ethical considerations. The results highlight how LLM-based Delphi studies can facilitate structured scenario analysis, capturing diverse perspectives while mitigating issues such as respondent fatigue. However, limitations emerge in terms of knowledge cutoffs, inherent biases, and sensitivity to initial conditions. While the approach provides an innovative means for structured foresight, this method could be also considered as a novel form of reasoning. further research is needed to refine its ability to manage heterogeneity, improve reliability, and integrate external data sources.

[AI-22] Are foundation models useful feature extractors for electroencephalography analysis?

链接: https://arxiv.org/abs/2502.21086
作者: Özgün Turgut,Felix S. Bott,Markus Ploner,Daniel Rueckert
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The success of foundation models in natural language processing and computer vision has motivated similar approaches for general time series analysis. While these models are effective for a variety of tasks, their applicability in medical domains with limited data remains largely unexplored. To address this, we investigate the effectiveness of foundation models in medical time series analysis involving electroencephalography (EEG). Through extensive experiments on tasks such as age prediction, seizure detection, and the classification of clinically relevant EEG events, we compare their diagnostic accuracy with that of specialised EEG models. Our analysis shows that foundation models extract meaningful EEG features, outperform specialised models even without domain adaptation, and localise task-specific biomarkers. Moreover, we demonstrate that diagnostic accuracy is substantially influenced by architectural choices such as context length. Overall, our study reveals that foundation models with general time series understanding eliminate the dependency on large domain-specific datasets, making them valuable tools for clinical practice.

[AI-23] Robust Deterministic Policy Gradient for Disturbance Attenuation and Its Application to Quadrotor Control

链接: https://arxiv.org/abs/2502.21057
作者: Taeho Lee,Donghwan Lee
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Practical control systems pose significant challenges in identifying optimal control policies due to uncertainties in the system model and external disturbances. While H_\infty control techniques are commonly used to design robust controllers that mitigate the effects of disturbances, these methods often require complex and computationally intensive calculations. To address this issue, this paper proposes a reinforcement learning algorithm called Robust Deterministic Policy Gradient (RDPG), which formulates the H_\infty control problem as a two-player zero-sum dynamic game. In this formulation, one player (the user) aims to minimize the cost, while the other player (the adversary) seeks to maximize it. We then employ deterministic policy gradient (DPG) and its deep reinforcement learning counterpart to train a robust control policy with effective disturbance attenuation. In particular, for practical implementation, we introduce an algorithm called robust deep deterministic policy gradient (RDDPG), which employs a deep neural network architecture and integrates techniques from the twin-delayed deep deterministic policy gradient (TD3) to enhance stability and learning efficiency. To evaluate the proposed algorithm, we implement it on an unmanned aerial vehicle (UAV) tasked with following a predefined path in a disturbance-prone environment. The experimental results demonstrate that the proposed method outperforms other control approaches in terms of robustness against disturbances, enabling precise real-time tracking of moving targets even under severe disturbance conditions.

[AI-24] Fast Adversarial Training against Sparse Attacks Requires Loss Smoothing

链接: https://arxiv.org/abs/2502.21041
作者: Xuyang Zhong,Yixiao Huang,Chen Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper studies fast adversarial training against sparse adversarial perturbations bounded by l_0 norm. We demonstrate the challenges of employing 1 -step attacks on l_0 bounded perturbations for fast adversarial training, including degraded performance and the occurrence of catastrophic overfitting (CO). We highlight that CO in l_0 adversarial training is caused by sub-optimal perturbation locations of 1 -step attack. Theoretical and empirical analyses reveal that the loss landscape of l_0 adversarial training is more craggy compared to its l_\infty , l_2 and l_1 counterparts. Moreover, we corroborate that the craggy loss landscape can aggravate CO. To address these issues, we propose Fast-LS- l_0 that incorporates soft labels and the trade-off loss function to smooth the adversarial loss landscape. Extensive experiments demonstrate our method can overcome the challenge of catastrophic overfitting, achieve state-of-the-art performance, and narrow down the performance gap between 1 -step and multi-step adversarial training against sparse attacks.

[AI-25] Reward Learning from Multiple Feedback Types ICLR2025

链接: https://arxiv.org/abs/2502.21038
作者: Yannick Metz,András Geiszl,Raphaël Baur,Mennatallah El-Assady
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published as a conference paper at ICLR 2025

点击查看摘要

Abstract:Learning rewards from preference feedback has become an important tool in the alignment of agentic models. Preference-based feedback, often implemented as a binary comparison between multiple completions, is an established method to acquire large-scale human feedback. However, human feedback in other contexts is often much more diverse. Such diverse feedback can better support the goals of a human annotator, and the simultaneous use of multiple sources might be mutually informative for the learning process or carry type-dependent biases for the reward learning process. Despite these potential benefits, learning from different feedback types has yet to be explored extensively. In this paper, we bridge this gap by enabling experimentation and evaluating multi-type feedback in a broad set of environments. We present a process to generate high-quality simulated feedback of six different types. Then, we implement reward models and downstream RL training for all six feedback types. Based on the simulated feedback, we investigate the use of types of feedback across ten RL environments and compare them to pure preference-based baselines. We show empirically that diverse types of feedback can be utilized and lead to strong reward modeling performance. This work is the first strong indicator of the potential of multi-type feedback for RLHF.

[AI-26] Synthesizing Tabular Data Using Selectivity Enhanced Generative Adversarial Networks

链接: https://arxiv.org/abs/2502.21034
作者: Youran Zhou,Jianzhong Qi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This thesis submitted to the University of Melbourne for partial fulfillment of the degree of Master of Data Science

点击查看摘要

Abstract:As E-commerce platforms face surging transactions during major shopping events like Black Friday, stress testing with synthesized data is crucial for resource planning. Most recent studies use Generative Adversarial Networks (GANs) to generate tabular data while ensuring privacy and machine learning utility. However, these methods overlook the computational demands of processing GAN-generated data, making them unsuitable for E-commerce stress testing. This thesis introduces a novel GAN-based approach incorporating query selectivity constraints, a key factor in database transaction processing. We integrate a pre-trained deep neural network to maintain selectivity consistency between real and synthetic data. Our method, tested on five real-world datasets, outperforms three state-of-the-art GANs and a VAE model, improving selectivity estimation accuracy by up to 20pct and machine learning utility by up to 6 pct. Comments: This thesis submitted to the University of Melbourne for partial fulfillment of the degree of Master of Data Science Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.21034 [cs.LG] (or arXiv:2502.21034v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.21034 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-27] Measuring and identifying factors of individuals trust in Large Language Models

链接: https://arxiv.org/abs/2502.21028
作者: Edoardo Sebastiano De Duro,Giuseppe Alessandro Veltri,Hudson Golino,Massimo Stella
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 24 pages, 6 figures

点击查看摘要

Abstract:Large Language Models (LLMs) can engage in human-looking conversational exchanges. Although conversations can elicit trust between users and LLMs, scarce empirical research has examined trust formation in human-LLM contexts, beyond LLMs’ trustworthiness or human trust in AI in general. Here, we introduce the Trust-In-LLMs Index (TILLMI) as a new framework to measure individuals’ trust in LLMs, extending McAllister’s cognitive and affective trust dimensions to LLM-human interactions. We developed TILLMI as a psychometric scale, prototyped with a novel protocol we called LLM-simulated validity. The LLM-based scale was then validated in a sample of 1,000 US respondents. Exploratory Factor Analysis identified a two-factor structure. Two items were then removed due to redundancy, yielding a final 6-item scale with a 2-factor structure. Confirmatory Factor Analysis on a separate subsample showed strong model fit ( CFI = .995 , TLI = .991 , RMSEA = .046 , p_X^2 .05 ). Convergent validity analysis revealed that trust in LLMs correlated positively with openness to experience, extraversion, and cognitive flexibility, but negatively with neuroticism. Based on these findings, we interpreted TILLMI’s factors as “closeness with LLMs” (affective dimension) and “reliance on LLMs” (cognitive dimension). Younger males exhibited higher closeness with- and reliance on LLMs compared to older women. Individuals with no direct experience with LLMs exhibited lower levels of trust compared to LLMs’ users. These findings offer a novel empirical foundation for measuring trust in AI-driven verbal communication, informing responsible design, and fostering balanced human-AI collaboration.

[AI-28] Improving Open-world Continual Learning under the Constraints of Scarce Labeled Data

链接: https://arxiv.org/abs/2502.20974
作者: Yujie Li,Xiangkun Wang,Xin Yang,Marcello Bonsangue,Junbo Zhang,Tianrui Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Open-world continual learning (OWCL) adapts to sequential tasks with open samples, learning knowledge incrementally while preventing forgetting. However, existing OWCL still requires a large amount of labeled data for training, which is often impractical in real-world applications. Given that new categories/entities typically come with limited annotations and are in small quantities, a more realistic situation is OWCL with scarce labeled data, i.e., few-shot training samples. Hence, this paper investigates the problem of open-world few-shot continual learning (OFCL), challenging in (i) learning unbounded tasks without forgetting previous knowledge and avoiding overfitting, (ii) constructing compact decision boundaries for open detection with limited labeled data, and (iii) transferring knowledge about knowns and unknowns and even update the unknowns to knowns once the labels of open samples are learned. In response, we propose a novel OFCL framework that integrates three key components: (1) an instance-wise token augmentation (ITA) that represents and enriches sample representations with additional knowledge, (2) a margin-based open boundary (MOB) that supports open detection with new tasks emerge over time, and (3) an adaptive knowledge space (AKS) that endows unknowns with knowledge for the updating from unknowns to knowns. Finally, extensive experiments show the proposed OFCL framework outperforms all baselines remarkably with practical importance and reproducibility. The source code is released at this https URL.

[AI-29] Retrieval Augmented Generation for Topic Modeling in Organizational Research: An Introduction with Empirical Demonstration

链接: https://arxiv.org/abs/2502.20963
作者: Gerion Spielberger,Florian Artinger,Jochen Reb,Rudolf Kerschreiter
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); General Economics (econ.GN)
*备注: 30 pages, 4 figures

点击查看摘要

Abstract:Analyzing textual data is the cornerstone of qualitative research. While traditional methods such as grounded theory and content analysis are widely used, they are labor-intensive and time-consuming. Topic modeling offers an automated complement. Yet, existing approaches, including LLM-based topic modeling, still struggle with issues such as high data preprocessing requirements, interpretability, and reliability. This paper introduces Agentic Retrieval-Augmented Generation (Agentic RAG) as a method for topic modeling with LLMs. It integrates three key components: (1) retrieval, enabling automatized access to external data beyond an LLM’s pre-trained knowledge; (2) generation, leveraging LLM capabilities for text synthesis; and (3) agent-driven learning, iteratively refining retrieval and query formulation processes. To empirically validate Agentic RAG for topic modeling, we reanalyze a Twitter/X dataset, previously examined by Mu et al. (2024a). Our findings demonstrate that the approach is more efficient, interpretable and at the same time achieves higher reliability and validity in comparison to the standard machine learning approach but also in comparison to LLM prompting for topic modeling. These results highlight Agentic RAG’s ability to generate semantically relevant and reproducible topics, positioning it as a robust, scalable, and transparent alternative for AI-driven qualitative research in leadership, managerial, and organizational research.

[AI-30] Concealed Adversarial attacks on neural networks for sequential data

链接: https://arxiv.org/abs/2502.20948
作者: Petr Sokerin,Dmitry Anikin,Sofia Krehova,Alexey Zaytsev
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The emergence of deep learning led to the broad usage of neural networks in the time series domain for various applications, including finance and medicine. While powerful, these models are prone to adversarial attacks: a benign targeted perturbation of input data leads to significant changes in a classifier’s output. However, formally small attacks in the time series domain become easily detected by the human eye or a simple detector model. We develop a concealed adversarial attack for different time-series models: it provides more realistic perturbations, being hard to detect by a human or model discriminator. To achieve this goal, the proposed adversarial attack maximizes an aggregation of a classifier and a trained discriminator loss. To make the attack stronger, we also propose a training procedure for a discriminator that provides broader coverage of possible attacks. Extensive benchmarking on six UCR time series datasets across four diverse architectures - including recurrent, convolutional, state-space, and transformer-based models - demonstrates the superiority of our attack for a concealability-efficiency trade-off. Our findings highlight the growing challenge of designing robust time series models, emphasizing the need for improved defenses against realistic and effective attacks. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.20948 [cs.LG] (or arXiv:2502.20948v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.20948 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-31] Generative Uncertainty in Diffusion Models

链接: https://arxiv.org/abs/2502.20946
作者: Metod Jazbec,Eliot Wong-Toi,Guoxuan Xia,Dan Zhang,Eric Nalisnick,Stephan Mandt
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion models have recently driven significant breakthroughs in generative modeling. While state-of-the-art models produce high-quality samples on average, individual samples can still be low quality. Detecting such samples without human inspection remains a challenging task. To address this, we propose a Bayesian framework for estimating generative uncertainty of synthetic samples. We outline how to make Bayesian inference practical for large, modern generative models and introduce a new semantic likelihood (evaluated in the latent space of a feature extractor) to address the challenges posed by high-dimensional sample spaces. Through our experiments, we demonstrate that the proposed generative uncertainty effectively identifies poor-quality samples and significantly outperforms existing uncertainty-based methods. Notably, our Bayesian framework can be applied post-hoc to any pretrained diffusion or flow matching model (via the Laplace approximation), and we propose simple yet effective techniques to minimize its computational overhead during sampling.

[AI-32] A Deep User Interface for Exploring LLaMa

链接: https://arxiv.org/abs/2502.20938
作者: Divya Perumal,Swaroop Panda
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The growing popularity and widespread adoption of large language models (LLMs) necessitates the development of tools that enhance the effectiveness of user interactions with these models. Understanding the structures and functions of these models poses a significant challenge for users. Visual analytics-driven tools enables users to explore and compare, facilitating better decision-making. This paper presents a visual analytics-driven tool equipped with interactive controls for key hyperparameters, including top-p, frequency and presence penalty, enabling users to explore, examine and compare the outputs of LLMs. In a user study, we assessed the tool’s effectiveness, which received favorable feedback for its visual design, with particular commendation for the interface layout and ease of navigation. Additionally, the feedback provided valuable insights for enhancing the effectiveness of Human-LLM interaction tools.

[AI-33] DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping

链接: https://arxiv.org/abs/2502.20900
作者: Yifan Zhong,Xuchuan Huang,Ruochong Li,Ceyao Zhang,Yitao Liang,Yaodong Yang,Yuanpei Chen
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 22 pages, 10 figures

点击查看摘要

Abstract:Dexterous grasping remains a fundamental yet challenging problem in robotics. A general-purpose robot must be capable of grasping diverse objects in arbitrary scenarios. However, existing research typically relies on specific assumptions, such as single-object settings or limited environments, leading to constrained generalization. Our solution is DexGraspVLA, a hierarchical framework that utilizes a pre-trained Vision-Language model as the high-level task planner and learns a diffusion-based policy as the low-level Action controller. The key insight lies in iteratively transforming diverse language and visual inputs into domain-invariant representations, where imitation learning can be effectively applied due to the alleviation of domain shift. Thus, it enables robust generalization across a wide range of real-world scenarios. Notably, our method achieves a 90+% success rate under thousands of unseen object, lighting, and background combinations in a ``zero-shot’’ environment. Empirical analysis further confirms the consistency of internal model behavior across environmental variations, thereby validating our design and explaining its generalization performance. We hope our work can be a step forward in achieving general dexterous grasping. Our demo and code can be found at this https URL.

[AI-34] A Fused Gromov-Wasserstein Approach to Subgraph Contrastive Learning

链接: https://arxiv.org/abs/2502.20885
作者: Amadou S. Sangare,Nicolas Dunou,Jhony H. Giraldo,Fragkiskos D. Malliaros
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Self-supervised learning has become a key method for training deep learning models when labeled data is scarce or unavailable. While graph machine learning holds great promise across various domains, the design of effective pretext tasks for self-supervised graph representation learning remains challenging. Contrastive learning, a popular approach in graph self-supervised learning, leverages positive and negative pairs to compute a contrastive loss function. However, current graph contrastive learning methods often struggle to fully use structural patterns and node similarities. To address these issues, we present a new method called Fused Gromov Wasserstein Subgraph Contrastive Learning (FOSSIL). Our model integrates node-level and subgraph-level contrastive learning, seamlessly combining a standard node-level contrastive loss with the Fused Gromov-Wasserstein distance. This combination helps our method capture both node features and graph structure together. Importantly, our approach works well with both homophilic and heterophilic graphs and can dynamically create views for generating positive and negative pairs. Through extensive experiments on benchmark graph datasets, we show that FOSSIL outperforms or achieves competitive performance compared to current state-of-the-art methods.

[AI-35] Reinforcement Learning with Curriculum-inspired Adaptive Direct Policy Guidance for Truck Dispatching

链接: https://arxiv.org/abs/2502.20845
作者: Shi Meng,Bin Tian,Xiaotong Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Efficient truck dispatching via Reinforcement Learning (RL) in open-pit mining is often hindered by reliance on complex reward engineering and value-based methods. This paper introduces Curriculum-inspired Adaptive Direct Policy Guidance, a novel curriculum learning strategy for policy-based RL to address these issues. We adapt Proximal Policy Optimization (PPO) for mine dispatching’s uneven decision intervals using time deltas in Temporal Difference and Generalized Advantage Estimation, and employ a Shortest Processing Time teacher policy for guided exploration via policy regularization and adaptive guidance. Evaluations in OpenMines demonstrate our approach yields a 10% performance gain and faster convergence over standard PPO across sparse and dense reward settings, showcasing improved robustness to reward design. This direct policy guidance method provides a general and effective curriculum learning technique for RL-based truck dispatching, enabling future work on advanced architectures.

[AI-36] Neuro-Symbolic Learning for Galois Groups: Unveiling Probabilistic Trends in Polynomials

链接: https://arxiv.org/abs/2502.20844
作者: Elira Shaska,Tony Shaska
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents a neurosymbolic approach to classifying Galois groups of polynomials, integrating classical Galois theory with machine learning to address challenges in algebraic computation. By combining neural networks with symbolic reasoning we develop a model that outperforms purely numerical methods in accuracy and interpretability. Focusing on sextic polynomials with height \leq 6 , we analyze a database of 53,972 irreducible examples, uncovering novel distributional trends, such as the 20 sextic polynomials with Galois group C_6 spanning just seven invariant-defined equivalence classes. These findings offer the first empirical insights into Galois group probabilities under height constraints and lay the groundwork for exploring solvability by radicals. Demonstrating AI’s potential to reveal patterns beyond traditional symbolic techniques, this work paves the way for future research in computational algebra, with implications for probabilistic conjectures and higher degree classifications.

[AI-37] Hierarchical and Modular Network on Non-prehensile Manipulation in General Environments

链接: https://arxiv.org/abs/2502.20843
作者: Yoonyoung Cho,Junhyek Han,Jisu Han,Beomjoon Kim
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: this http URL

点击查看摘要

Abstract:For robots to operate in general environments like households, they must be able to perform non-prehensile manipulation actions such as toppling and rolling to manipulate ungraspable objects. However, prior works on non-prehensile manipulation cannot yet generalize across environments with diverse geometries. The main challenge lies in adapting to varying environmental constraints: within a cabinet, the robot must avoid walls and ceilings; to lift objects to the top of a step, the robot must account for the step’s pose and extent. While deep reinforcement learning (RL) has demonstrated impressive success in non-prehensile manipulation, accounting for such variability presents a challenge for the generalist policy, as it must learn diverse strategies for each new combination of constraints. To address this, we propose a modular and reconfigurable architecture that adaptively reconfigures network modules based on task requirements. To capture the geometric variability in environments, we extend the contact-based object representation (CORN) to environment geometries, and propose a procedural algorithm for generating diverse environments to train our agent. Taken together, the resulting policy can zero-shot transfer to novel real-world environments and objects despite training entirely within a simulator. We additionally release a simulation-based benchmark featuring nine digital twins of real-world scenes with 353 objects to facilitate non-prehensile manipulation research in realistic domains.

[AI-38] Weakly Supervised Multiple Instance Learning for Whale Call Detection and Localization in Long-Duration Passive Acoustic Monitoring

链接: https://arxiv.org/abs/2502.20838
作者: Ragib Amin Nihal,Benjamin Yen,Runwu Shi,Kazuhiro Nakadai
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Marine ecosystem monitoring via Passive Acoustic Monitoring (PAM) generates vast data, but deep learning often requires precise annotations and short segments. We introduce DSMIL-LocNet, a Multiple Instance Learning framework for whale call detection and localization using only bag-level labels. Our dual-stream model processes 2-30 minute audio segments, leveraging spectral and temporal features with attention-based instance selection. Tests on Antarctic whale data show longer contexts improve classification (F1: 0.8-0.9) while medium instances ensure localization precision (0.65-0.70). This suggests MIL can enhance scalable marine monitoring. Code: this https URL

[AI-39] LADs: Leverag ing LLM s for AI-Driven DevOps

链接: https://arxiv.org/abs/2502.20825
作者: Ahmad Faraz Khan,Azal Ahmad Khan,Anas Mohamed,Haider Ali,Suchithra Moolinti,Sabaat Haroon,Usman Tahir,Mattia Fazzini,Ali R. Butt,Ali Anwar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Software Engineering (cs.SE)
*备注: 17 pages with Appendix, 8 figures, and 7 tables. This paper is currently Under Review

点击查看摘要

Abstract:Automating cloud configuration and deployment remains a critical challenge due to evolving infrastructures, heterogeneous hardware, and fluctuating workloads. Existing solutions lack adaptability and require extensive manual tuning, leading to inefficiencies and misconfigurations. We introduce LADs, the first LLM-driven framework designed to tackle these challenges by ensuring robustness, adaptability, and efficiency in automated cloud management. Instead of merely applying existing techniques, LADs provides a principled approach to configuration optimization through in-depth analysis of what optimization works under which conditions. By leveraging Retrieval-Augmented Generation, Few-Shot Learning, Chain-of-Thought, and Feedback-Based Prompt Chaining, LADs generates accurate configurations and learns from deployment failures to iteratively refine system settings. Our findings reveal key insights into the trade-offs between performance, cost, and scalability, helping practitioners determine the right strategies for different deployment scenarios. For instance, we demonstrate how prompt chaining-based adaptive feedback loops enhance fault tolerance in multi-tenant environments and how structured log analysis with example shots improves configuration accuracy. Through extensive evaluations, LADs reduces manual effort, optimizes resource utilization, and improves system reliability. By open-sourcing LADs, we aim to drive further innovation in AI-powered DevOps automation.

[AI-40] MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts

链接: https://arxiv.org/abs/2502.20808
作者: Peijie Wang,Zhongzhi Li,Fei Yin,Dekang Ran,Chenglin Liu
类目: Artificial Intelligence (cs.AI)
*备注: 47 pages

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown promising capabilities in mathematical reasoning within visual contexts across various datasets. However, most existing multimodal math benchmarks are limited to single-visual contexts, which diverges from the multi-visual scenarios commonly encountered in real-world mathematical applications. To address this gap, we introduce MV-MATH: a meticulously curated dataset of 2,009 high-quality mathematical problems. Each problem integrates multiple images interleaved with text, derived from authentic K-12 scenarios, and enriched with detailed annotations. MV-MATH includes multiple-choice, free-form, and multi-step questions, covering 11 subject areas across 3 difficulty levels, and serves as a comprehensive and rigorous benchmark for assessing MLLMs’ mathematical reasoning in multi-visual contexts. Through extensive experimentation, we observe that MLLMs encounter substantial challenges in multi-visual math tasks, with a considerable performance gap relative to human capabilities on MV-MATH. Furthermore, we analyze the performance and error patterns of various models, providing insights into MLLMs’ mathematical reasoning capabilities within multi-visual settings.

[AI-41] Multimodal Learning for Just-In-Time Software Defect Prediction in Autonomous Driving Systems

链接: https://arxiv.org/abs/2502.20806
作者: Faisal Mohammad,Duksan Ryu
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 9

点击查看摘要

Abstract:In recent years, the rise of autonomous driving technologies has highlighted the critical importance of reliable software for ensuring safety and performance. This paper proposes a novel approach for just-in-time software defect prediction (JIT-SDP) in autonomous driving software systems using multimodal learning. The proposed model leverages the multimodal transformers in which the pre-trained transformers and a combining module deal with the multiple data modalities of the software system datasets such as code features, change metrics, and contextual information. The key point for adapting multimodal learning is to utilize the attention mechanism between the different data modalities such as text, numerical, and categorical. In the combining module, the output of a transformer model on text data and tabular features containing categorical and numerical data are combined to produce the predictions using the fully connected layers. Experiments conducted on three open-source autonomous driving system software projects collected from the GitHub repository (Apollo, Carla, and Donkeycar) demonstrate that the proposed approach significantly outperforms state-of-the-art deep learning and machine learning models regarding evaluation metrics. Our findings highlight the potential of multimodal learning to enhance the reliability and safety of autonomous driving software through improved defect prediction.

[AI-42] Characteristics Analysis of Autonomous Vehicle Pre-crash Scenarios

链接: https://arxiv.org/abs/2502.20789
作者: Yixuan Li,Xuesong Wang,Tianyi Wang,Qian Liu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:To date, hundreds of crashes have occurred in open road testing of automated vehicles (AVs), highlighting the need for improving AV reliability and safety. Pre-crash scenario typology classifies crashes based on vehicle dynamics and kinematics features. Building on this, characteristics analysis can identify similar features under comparable crashes, offering a more effective reflection of general crash patterns and providing more targeted recommendations for enhancing AV performance. However, current studies primarily concentrated on crashes among conventional human-driven vehicles, leaving a gap in research dedicated to in-depth AV crash analyses. In this paper, we analyzed the latest California AV collision reports and used the newly revised pre-crash scenario typology to identify pre-crash scenarios. We proposed a set of mapping rules for automatically extracting these AV pre-crash scenarios, successfully identifying 24 types with a 98.1% accuracy rate, and obtaining two key scenarios of AV crashes (i.e., rear-end scenarios and intersection scenarios) through detailed analysis. Association analyses of rear-end scenarios showed that the significant environmental influencing factors were traffic control type, location type, light, etc. For intersection scenarios prone to severe crashes with detailed descriptions, we employed causal analyses to obtain the significant causal factors: habitual violations and expectations of certain behavior. Optimization recommendations were then formulated, addressing both governmental oversight and AV manufacturers’ potential improvements. The findings of this paper could guide government authorities to develop related regulations, help manufacturers design AV test scenarios, and identify potential shortcomings in control algorithms specific to various real-world scenarios, thereby optimizing AV systems effectively.

[AI-43] Flattening Supply Chains: When do Technology Improvements lead to Disintermediation?

链接: https://arxiv.org/abs/2502.20783
作者: S. Nageeb Ali,Nicole Immorlica,Meena Jagadeesan,Brendan Lucier
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the digital economy, technological innovations make it cheaper to produce high-quality content. For example, generative AI tools reduce costs for creators who develop content to be distributed online, but can also reduce production costs for the users who consume that content. These innovations can thus lead to disintermediation, since consumers may choose to use these technologies directly, bypassing intermediaries. To investigate when technological improvements lead to disintermediation, we study a game with an intermediary, suppliers of a production technology, and consumers. First, we show disintermediation occurs whenever production costs are too high or too low. We then investigate the consequences of disintermediation for welfare and content quality at equilibrium. While the intermediary is welfare-improving, the intermediary extracts all gains to social welfare and its presence can raise or lower content quality. We further analyze how disintermediation is affected by the level of competition between suppliers and the intermediary’s fee structure. More broadly, our results take a step towards assessing how production technology innovations affect the survival of intermediaries and impact the digital economy.

[AI-44] Damper-B-PINN: Damper Characteristics-Based Bayesian Physics-Informed Neural Network for Vehicle State Estimation

链接: https://arxiv.org/abs/2502.20772
作者: Tianyi Zeng,Tianyi Wang,Junfeng Jiao,Xinbo Chen
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:State estimation for Multi-Input Multi-Output (MIMO) systems with noise, such as vehicle chassis systems, presents a significant challenge due to the imperfect and complex relationship between inputs and outputs. To solve this problem, we design a Damper characteristics-based Bayesian Physics-Informed Neural Network (Damper-B-PINN). First, we introduce a neuron forward process inspired by the mechanical properties of dampers, which limits abrupt jumps in neuron values between epochs while maintaining search capability. Additionally, we apply an optimized Bayesian dropout layer to the MIMO system to enhance robustness against noise and prevent non-convergence issues. Physical information is incorporated into the loss function to serve as a physical prior for the neural network. The effectiveness of our Damper-B-PINN architecture is then validated across ten datasets and fourteen vehicle types, demonstrating superior accuracy, computational efficiency, and convergence in vehicle state estimation (i.e., dynamic wheel load) compared to other state-of-the-art benchmarks.

[AI-45] DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking

链接: https://arxiv.org/abs/2502.20730
作者: Zhuoqun Li,Haiyang Yu,Xuanang Chen,Hongyu Lin,Yaojie Lu,Fei Huang,Xianpei Han,Yongbin Li,Le Sun
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Designing solutions for complex engineering challenges is crucial in human production activities. However, previous research in the retrieval-augmented generation (RAG) field has not sufficiently addressed tasks related to the design of complex engineering solutions. To fill this gap, we introduce a new benchmark, SolutionBench, to evaluate a system’s ability to generate complete and feasible solutions for engineering problems with multiple complex constraints. To further advance the design of complex engineering solutions, we propose a novel system, SolutionRAG, that leverages the tree-based exploration and bi-point thinking mechanism to generate reliable solutions. Extensive experimental results demonstrate that SolutionRAG achieves state-of-the-art (SOTA) performance on the SolutionBench, highlighting its potential to enhance the automation and reliability of complex engineering solution design in real-world applications.

[AI-46] NeuroMorse: A Temporally Structured Dataset For Neuromorphic Computing

链接: https://arxiv.org/abs/2502.20729
作者: Ben Walters,Yeshwanth Bethi,Taylor Kergan,Binh Nguyen,Amirali Amirsoleimani,Jason K. Eshraghian,Saeed Afshar,Mostafa Rahimi Azghadi
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Neuromorphic engineering aims to advance computing by mimicking the brain’s efficient processing, where data is encoded as asynchronous temporal events. This eliminates the need for a synchronisation clock and minimises power consumption when no data is present. However, many benchmarks for neuromorphic algorithms primarily focus on spatial features, neglecting the temporal dynamics that are inherent to most sequence-based tasks. This gap may lead to evaluations that fail to fully capture the unique strengths and characteristics of neuromorphic systems. In this paper, we present NeuroMorse, a temporally structured dataset designed for benchmarking neuromorphic learning systems. NeuroMorse converts the top 50 words in the English language into temporal Morse code spike sequences. Despite using only two input spike channels for Morse dots and dashes, complex information is encoded through temporal patterns in the data. The proposed benchmark contains feature hierarchy at multiple temporal scales that test the capacity of neuromorphic algorithms to decompose input patterns into spatial and temporal hierarchies. We demonstrate that our training set is challenging to categorise using a linear classifier and that identifying keywords in the test set is difficult using conventional methods. The NeuroMorse dataset is available at Zenodo, with our accompanying code on GitHub at this https URL.

[AI-47] SPD: Sync-Point Drop for efficient tensor parallelism of Large Language Models

链接: https://arxiv.org/abs/2502.20727
作者: Han-Byul Kim,Duc Hoang,Arnav Kundu,Mohammad Samragh,Minsik Cho
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:With the rapid expansion in the scale of large language models (LLMs), enabling efficient distributed inference across multiple computing units has become increasingly critical. However, communication overheads from popular distributed inference techniques such as Tensor Parallelism pose a significant challenge to achieve scalability and low latency. Therefore, we introduce a novel optimization technique, Sync-Point Drop (SPD), to reduce communication overheads in tensor parallelism by selectively dropping synchronization on attention outputs. In detail, we first propose a block design that allows execution to proceed without communication through SPD. Second, we apply different SPD strategies to attention blocks based on their sensitivity to the model accuracy. The proposed methods effectively alleviate communication bottlenecks while minimizing accuracy degradation during LLM inference, offering a scalable solution for diverse distributed environments: SPD offered about 20% overall inference latency reduction with 1% accuracy regression for LLaMA2-70B inference over 8 GPUs.

[AI-48] Generating Clinically Realistic EHR Data via a Hierarchy- and Semantics-Guided Transformer

链接: https://arxiv.org/abs/2502.20719
作者: Guanglin Zhou,Sebastiano Barbieri
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generating realistic synthetic electronic health records (EHRs) holds tremendous promise for accelerating healthcare research, facilitating AI model development and enhancing patient privacy. However, existing generative methods typically treat EHRs as flat sequences of discrete medical codes. This approach overlooks two critical aspects: the inherent hierarchical organization of clinical coding systems and the rich semantic context provided by code descriptions. Consequently, synthetic patient sequences often lack high clinical fidelity and have limited utility in downstream clinical tasks. In this paper, we propose the Hierarchy- and Semantics-Guided Transformer (HiSGT), a novel framework that leverages both hierarchical and semantic information for the generative process. HiSGT constructs a hierarchical graph to encode parent-child and sibling relationships among clinical codes and employs a graph neural network to derive hierarchy-aware embeddings. These are then fused with semantic embeddings extracted from a pre-trained clinical language model (e.g., ClinicalBERT), enabling the Transformer-based generator to more accurately model the nuanced clinical patterns inherent in real EHRs. Extensive experiments on the MIMIC-III and MIMIC-IV datasets demonstrate that HiSGT significantly improves the statistical alignment of synthetic data with real patient records, as well as supports robust downstream applications such as chronic disease classification. By addressing the limitations of conventional raw code-based generative models, HiSGT represents a significant step toward clinically high-fidelity synthetic data generation and a general paradigm suitable for interpretable medical code representation, offering valuable applications in data augmentation and privacy-preserving healthcare analytics.

[AI-49] Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff

链接: https://arxiv.org/abs/2502.20704
作者: Maximilian Holsman,Yukun Huang,Bhuwan Dhingra
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Speculative Decoding (SD) enforces strict distributional equivalence to the target model, limiting potential speed ups as distributions of near-equivalence achieve comparable outcomes in many cases. Furthermore, enforcing distributional equivalence means that users are unable to trade deviations from the target model distribution for further inference speed gains. To address these limitations, we introduce Fuzzy Speculative Decoding (FSD) - a decoding algorithm that generalizes SD by accepting candidate tokens purely based on the divergences between the target and draft model distributions. By allowing for controlled divergence from the target model, FSD enables users to flexibly trade generation quality for inference speed. Across several benchmarks, our method is able to achieve significant runtime improvements of over 5 tokens per second faster than SD at only an approximate 2% absolute reduction in benchmark accuracy. In many cases, FSD is even able to match SD benchmark accuracy at over 2 tokens per second faster, demonstrating that distributional equivalence is not necessary to maintain target model performance.

[AI-50] Why Trust in AI May Be Inevitable

链接: https://arxiv.org/abs/2502.20701
作者: Nghi Truong,Phanish Puranam,Ilia Testlin
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:In human-AI interactions, explanation is widely seen as necessary for enabling trust in AI systems. We argue that trust, however, may be a pre-requisite because explanation is sometimes impossible. We derive this result from a formalization of explanation as a search process through knowledge networks, where explainers must find paths between shared concepts and the concept to be explained, within finite time. Our model reveals that explanation can fail even under theoretically ideal conditions - when actors are rational, honest, motivated, can communicate perfectly, and possess overlapping knowledge. This is because successful explanation requires not just the existence of shared knowledge but also finding the connection path within time constraints, and it can therefore be rational to cease attempts at explanation before the shared knowledge is discovered. This result has important implications for human-AI interaction: as AI systems, particularly Large Language Models, become more sophisticated and able to generate superficially compelling but spurious explanations, humans may default to trust rather than demand genuine explanations. This creates risks of both misplaced trust and imperfect knowledge integration.

[AI-51] Unleashing the Potential of Two-Tower Models: Diffusion-Based Cross-Interaction for Large-Scale Matching

链接: https://arxiv.org/abs/2502.20687
作者: Yihan Wang,Fei Xiong,Zhexin Han,Qi Song,Kaiqiao Zhan,Ben Wang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Two-tower models are widely adopted in the industrial-scale matching stage across a broad range of application domains, such as content recommendations, advertisement systems, and search engines. This model efficiently handles large-scale candidate item screening by separating user and item representations. However, the decoupling network also leads to a neglect of potential information interaction between the user and item representations. Current state-of-the-art (SOTA) approaches include adding a shallow fully connected layer(i.e., COLD), which is limited by performance and can only be used in the ranking stage. For performance considerations, another approach attempts to capture historical positive interaction information from the other tower by regarding them as the input features(i.e., DAT). Later research showed that the gains achieved by this method are still limited because of lacking the guidance on the next user intent. To address the aforementioned challenges, we propose a “cross-interaction decoupling architecture” within our matching paradigm. This user-tower architecture leverages a diffusion module to reconstruct the next positive intention representation and employs a mixed-attention module to facilitate comprehensive cross-interaction. During the next positive intention generation, we further enhance the accuracy of its reconstruction by explicitly extracting the temporal drift within user behavior sequences. Experiments on two real-world datasets and one industrial dataset demonstrate that our method outperforms the SOTA two-tower models significantly, and our diffusion approach outperforms other generative models in reconstructing item representations.

[AI-52] FedConv: A Learning-on-Model Paradigm for Heterogeneous Federated Clients

链接: https://arxiv.org/abs/2502.20639
作者: Leming Shen,Qiang Yang,Kaiyan Cui,Yuanqing Zheng,Xiao-Yong Wei,Jianwei Liu,Jinsong Han
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) facilitates collaborative training of a shared global model without exposing clients’ private data. In practical FL systems, clients (e.g., edge servers, smartphones, and wearables) typically have disparate system resources. Conventional FL, however, adopts a one-size-fits-all solution, where a homogeneous large global model is transmitted to and trained on each client, resulting in an overwhelming workload for less capable clients and starvation for other clients. To address this issue, we propose FedConv, a client-friendly FL framework, which minimizes the computation and memory burden on resource-constrained clients by providing heterogeneous customized sub-models. FedConv features a novel learning-on-model paradigm that learns the parameters of the heterogeneous sub-models via convolutional compression. Unlike traditional compression methods, the compressed models in FedConv can be directly trained on clients without decompression. To aggregate the heterogeneous sub-models, we propose transposed convolutional dilation to convert them back to large models with a unified size while retaining personalized information from clients. The compression and dilation processes, transparent to clients, are optimized on the server leveraging a small public dataset. Extensive experiments on six datasets demonstrate that FedConv outperforms state-of-the-art FL systems in terms of model accuracy (by more than 35% on average), computation and communication overhead (with 33% and 25% reduction, respectively).

[AI-53] A Compact Model for Large-Scale Time Series Forecasting

链接: https://arxiv.org/abs/2502.20634
作者: Chin-Chia Michael Yeh,Xiran Fan,Zhimeng Jiang,Yujie Fan,Huiyuan Chen,Uday Singh Saini,Vivian Lai,Xin Dai,Junpeng Wang,Zhongfang Zhuang,Liang Wang,Yan Zheng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Spatio-temporal data, which commonly arise in real-world applications such as traffic monitoring, financial transactions, and ride-share demands, represent a special category of multivariate time series. They exhibit two distinct characteristics: high dimensionality and commensurability across spatial locations. These attributes call for computationally efficient modeling approaches and facilitate the use of univariate forecasting models in a channel-independent fashion. SparseTSF, a recently introduced competitive univariate forecasting model, harnesses periodicity to achieve compactness by concentrating on cross-period dynamics, thereby extending the Pareto frontier with respect to model size and predictive performance. Nonetheless, it underperforms on spatio-temporal data due to an inadequate capture of intra-period temporal dependencies. To address this shortcoming, we propose UltraSTF, which integrates a cross-period forecasting module with an ultra-compact shape bank component. Our model effectively detects recurring patterns in time series through the attention mechanism of the shape bank component, thereby strengthening its ability to learn intra-period dynamics. UltraSTF achieves state-of-the-art performance on the LargeST benchmark while employing fewer than 0.2% of the parameters required by the second-best approaches, thus further extending the Pareto frontier of existing methods.

[AI-54] PersonaBench: Evaluating AI Models on Understanding Personal Information through Accessing (Synthetic) Private User Data

链接: https://arxiv.org/abs/2502.20616
作者: Juntao Tan,Liangwei Yang,Zuxin Liu,Zhiwei Liu,Rithesh Murthy,Tulika Manoj Awalgaonkar,Jianguo Zhang,Weiran Yao,Ming Zhu,Shirley Kokane,Silvio Savarese,Huan Wang,Caiming Xiong,Shelby Heinecke
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Personalization is critical in AI assistants, particularly in the context of private AI models that work with individual users. A key scenario in this domain involves enabling AI models to access and interpret a user’s private data (e.g., conversation history, user-AI interactions, app usage) to understand personal details such as biographical information, preferences, and social connections. However, due to the sensitive nature of such data, there are no publicly available datasets that allow us to assess an AI model’s ability to understand users through direct access to personal information. To address this gap, we introduce a synthetic data generation pipeline that creates diverse, realistic user profiles and private documents simulating human activities. Leveraging this synthetic data, we present PersonaBench, a benchmark designed to evaluate AI models’ performance in understanding personal information derived from simulated private user data. We evaluate Retrieval-Augmented Generation (RAG) pipelines using questions directly related to a user’s personal information, supported by the relevant private documents provided to the models. Our results reveal that current retrieval-augmented AI models struggle to answer private questions by extracting personal information from user documents, highlighting the need for improved methodologies to enhance personalization capabilities in AI. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2502.20616 [cs.AI] (or arXiv:2502.20616v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2502.20616 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-55] Exploring the Impact of Temperature Scaling in Softmax for Classification and Adversarial Robustness

链接: https://arxiv.org/abs/2502.20604
作者: Hao Xuan,Bokai Yang,Xingyu Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The softmax function is a fundamental component in deep learning. This study delves into the often-overlooked parameter within the softmax function, known as “temperature,” providing novel insights into the practical and theoretical aspects of temperature scaling for image classification. Our empirical studies, adopting convolutional neural networks and transformers on multiple benchmark datasets, reveal that moderate temperatures generally introduce better overall performance. Through extensive experiments and rigorous theoretical analysis, we explore the role of temperature scaling in model training and unveil that temperature not only influences learning step size but also shapes the model’s optimization direction. Moreover, for the first time, we discover a surprising benefit of elevated temperatures: enhanced model robustness against common corruption, natural perturbation, and non-targeted adversarial attacks like Projected Gradient Descent. We extend our discoveries to adversarial training, demonstrating that, compared to the standard softmax function with the default temperature value, higher temperatures have the potential to enhance adversarial training. The insights of this work open new avenues for improving model performance and security in deep learning applications.

[AI-56] Scalable Coordinated Learning for H2M/R Applications over Optical Access Networks (Invited)

链接: https://arxiv.org/abs/2502.20598
作者: Sourav Mondal,Elaine Wong
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注: This article is accepted for publication in 29th Opto-Electronics and Communications Conference 2024 (OECC2024). Copyright @ IEEE

点击查看摘要

Abstract:One of the primary research interests adhering to next-generation fiber-wireless access networks is human-to-machine/robot (H2M/R) collaborative communications facilitating Industry 5.0. This paper discusses scalable H2M/R communications across large geographical distances that also allow rapid onboarding of new machines/robots as \sim72% training time is saved through global-local coordinated learning.

[AI-57] LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation

链接: https://arxiv.org/abs/2502.20583
作者: Keisuke Kamahori,Jungo Kasai,Noriyuki Kojima,Baris Kasikci
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Modern automatic speech recognition (ASR) models, such as OpenAI’s Whisper, rely on deep encoder-decoder architectures, and their encoders are a critical bottleneck for efficient deployment due to high computational intensity. We introduce LiteASR, a low-rank compression scheme for ASR encoders that significantly reduces inference costs while maintaining transcription accuracy. Our approach leverages the strong low-rank properties observed in intermediate activations: by applying principal component analysis (PCA) with a small calibration dataset, we approximate linear transformations with a chain of low-rank matrix multiplications, and further optimize self-attention to work in the reduced dimension. Evaluation results show that our method can compress Whisper large-v3’s encoder size by over 50%, matching Whisper medium’s size with better transcription accuracy, thereby establishing a new Pareto-optimal frontier of efficiency and performance. The code of LiteASR is available at this https URL.

[AI-58] PFformer: A Position-Free Transformer Variant for Extreme-Adaptive Multivariate Time Series Forecasting PAKDD2025

链接: https://arxiv.org/abs/2502.20571
作者: Yanhong Li,David C. Anastasiu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: PAKDD 2025 special session on Data Science: Foundations and Applications (DSFA)

点击查看摘要

Abstract:Multivariate time series (MTS) forecasting is vital in fields like weather, energy, and finance. However, despite deep learning advancements, traditional Transformer-based models often diminish the effect of crucial inter-variable relationships by singular token embedding and struggle to effectively capture complex dependencies among variables, especially in datasets with rare or extreme events. These events create significant imbalances and lead to high skewness, complicating accurate prediction efforts. This study introduces PFformer, a position-free Transformer-based model designed for single-target MTS forecasting, specifically for challenging datasets characterized by extreme variability. PFformer integrates two novel embedding strategies: Enhanced Feature-based Embedding (EFE) and Auto-Encoder-based Embedding (AEE). EFE effectively encodes inter-variable dependencies by mapping related sequence subsets to high-dimensional spaces without positional constraints, enhancing the encoder’s functionality. PFformer shows superior forecasting accuracy without the traditional limitations of positional encoding in MTS modeling. We evaluated PFformer across four challenging datasets, focusing on two key forecasting scenarios: long sequence prediction for 3 days ahead and rolling predictions every four hours to reflect real-time decision-making processes in water management. PFformer demonstrated remarkable improvements, from 20% to 60%, compared with state-of-the-art models.

[AI-59] DPZV: Resource Efficient ZO Optimization For Differentially Private VFL

链接: https://arxiv.org/abs/2502.20565
作者: Jianing Zhang,Evan Chen,Chaoyue Liu,Christopher G. Brinton
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Vertical Federated Learning (VFL) enables collaborative model training across feature-partitioned data, yet faces significant privacy risks and inefficiencies when scaling to large models. We propose DPZV, a memory-efficient Zeroth-Order(ZO) optimization framework that integrates differential privacy (DP) with vertical federated learning, addressing three critical challenges: (1) privacy vulnerabilities from gradient leakage, (2) high computation/communication costs of first-order methods, and (3) excessive memory footprint in conventional zeroth-order approaches. Our framework eliminates backpropagation through two-point gradient estimation, reducing client memory usage by 90% compared to first-order counterparts while enabling asynchronous communication. By strategically injecting Gaussian noise on the server, DPZV achieves rigorous (\epsilon, \delta) -DP guarantees without third-party trust assumptions. Theoretical analysis establishes a convergence rate matching centralized case under non-convex objectives. Extensive experiments on image and NLP benchmarks demonstrate that DPZV outperforms all baselines in accuracy while providing strong privacy assurances ( \epsilon \leq 10 ) and requiring far fewer computation resources, establishing new state-of-the-art privacy-utility tradeoffs for resource-constrained VFL deployments.

[AI-60] Revisiting Kernel Attention with Correlated Gaussian Process Representation

链接: https://arxiv.org/abs/2502.20525
作者: Long Minh Bui,Tho Tran Huu,Duy Dinh,Tan Minh Nguyen,Trong Nghia Hoang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 21 pages, 4 figures

点击查看摘要

Abstract:Transformers have increasingly become the de facto method to model sequential data with state-of-the-art performance. Due to its widespread use, being able to estimate and calibrate its modeling uncertainty is important to understand and design robust transformer models. To achieve this, previous works have used Gaussian processes (GPs) to perform uncertainty calibration for the attention units of transformers and attained notable successes. However, such approaches have to confine the transformers to the space of symmetric attention to ensure the necessary symmetric requirement of their GP’s kernel specification, which reduces the representation capacity of the model. To mitigate this restriction, we propose the Correlated Gaussian Process Transformer (CGPT), a new class of transformers whose self-attention units are modeled as cross-covariance between two correlated GPs (CGPs). This allows asymmetries in attention and can enhance the representation capacity of GP-based transformers. We also derive a sparse approximation for CGP to make it scale better. Our empirical studies show that both CGP-based and sparse CGP-based transformers achieve better performance than state-of-the-art GP-based transformers on a variety of benchmark tasks. The code for our experiments is available at this https URL.

[AI-61] Personas Evolved: Designing Ethical LLM -Based Conversational Agent Personalities

链接: https://arxiv.org/abs/2502.20513
作者: Smit Desai,Mateusz Dubiel,Nima Zargham,Thomas Mildner,Laura Spillner
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) has revolutionized Conversational User Interfaces (CUIs), enabling more dynamic, context-aware, and human-like interactions across diverse domains, from social sciences to healthcare. However, the rapid adoption of LLM-based personas raises critical ethical and practical concerns, including bias, manipulation, and unforeseen social consequences. Unlike traditional CUIs, where personas are carefully designed with clear intent, LLM-based personas generate responses dynamically from vast datasets, making their behavior less predictable and harder to govern. This workshop aims to bridge the gap between CUI and broader AI communities by fostering a cross-disciplinary dialogue on the responsible design and evaluation of LLM-based personas. Bringing together researchers, designers, and practitioners, we will explore best practices, develop ethical guidelines, and promote frameworks that ensure transparency, inclusivity, and user-centered interactions. By addressing these challenges collaboratively, we seek to shape the future of LLM-driven CUIs in ways that align with societal values and expectations.

[AI-62] On Benchmarking Human-Like Intelligence in Machines

链接: https://arxiv.org/abs/2502.20502
作者: Lance Ying,Katherine M. Collins,Lionel Wong,Ilia Sucholutsky,Ryan Liu,Adrian Weller,Tianmin Shu,Thomas L. Griffiths,Joshua B. Tenenbaum
类目: Artificial Intelligence (cs.AI)
*备注: 18 pages, 5 figures

点击查看摘要

Abstract:Recent benchmark studies have claimed that AI has approached or even surpassed human-level performances on various cognitive tasks. However, this position paper argues that current AI evaluation paradigms are insufficient for assessing human-like cognitive capabilities. We identify a set of key shortcomings: a lack of human-validated labels, inadequate representation of human response variability and uncertainty, and reliance on simplified and ecologically-invalid tasks. We support our claims by conducting a human evaluation study on ten existing AI benchmarks, suggesting significant biases and flaws in task and label designs. To address these limitations, we propose five concrete recommendations for developing future benchmarks that will enable more rigorous and meaningful evaluations of human-like cognitive capacities in AI with various implications for such AI applications.

[AI-63] Unified Kernel-Segregated Transpose Convolution Operation

链接: https://arxiv.org/abs/2502.20493
作者: Vijay Srinivas Tida,Md Imran Hossen,Liqun Shan,Sai Venkatesh Chilukoti,Sonya Hsu,Xiali Hei
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The optimization of the transpose convolution layer for deep learning applications is achieved with the kernel segregation mechanism. However, kernel segregation has disadvantages, such as computing extra elements to obtain the output feature map with odd dimensions while launching a thread. To mitigate this problem, we introduce a unified kernel segregation approach that limits the usage of memory and computational resources by employing one unified kernel to execute four sub-kernels. The findings reveal that the suggested approach achieves an average computational speedup of 2.03x (3.89x) when tested on specific datasets with an RTX 2070 GPU (Intel Xeon CPU). The ablation study shows an average computational speedup of 3.5x when evaluating the transpose convolution layers from well-known Generative Adversarial Networks (GANs). The implementation of the proposed method for the transpose convolution layers in the EB-GAN model demonstrates significant memory savings of up to 35 MB.

[AI-64] R-ParVI: Particle-based variational inference through lens of rewards

链接: https://arxiv.org/abs/2502.20482
作者: Yongchao Huang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A reward-guided, gradient-free ParVI method, \textitR-ParVI, is proposed for sampling partially known densities (e.g. up to a constant). R-ParVI formulates the sampling problem as particle flow driven by rewards: particles are drawn from a prior distribution, navigate through parameter space with movements determined by a reward mechanism blending assessments from the target density, with the steady state particle configuration approximating the target geometry. Particle-environment interactions are simulated by stochastic perturbations and the reward mechanism, which drive particles towards high density regions while maintaining diversity (e.g. preventing from collapsing into clusters). R-ParVI offers fast, flexible, scalable and stochastic sampling and inference for a class of probabilistic models such as those encountered in Bayesian inference and generative modelling.

[AI-65] Large Language Model Strategic Reasoning Evaluation through Behavioral Game Theory

链接: https://arxiv.org/abs/2502.20432
作者: Jingru Jia,Zehua Yuan,Junhao Pan,Paul E. McNamara,Deming Chen
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Strategic decision-making involves interactive reasoning where agents adapt their choices in response to others, yet existing evaluations of large language models (LLMs) often emphasize Nash Equilibrium (NE) approximation, overlooking the mechanisms driving their strategic choices. To bridge this gap, we introduce an evaluation framework grounded in behavioral game theory, disentangling reasoning capability from contextual effects. Testing 22 state-of-the-art LLMs, we find that GPT-o3-mini, GPT-o1, and DeepSeek-R1 dominate most games yet also demonstrate that the model scale alone does not determine performance. In terms of prompting enhancement, Chain-of-Thought (CoT) prompting is not universally effective, as it increases strategic reasoning only for models at certain levels while providing limited gains elsewhere. Additionally, we investigate the impact of encoded demographic features on the models, observing that certain assignments impact the decision-making pattern. For instance, GPT-4o shows stronger strategic reasoning with female traits than males, while Gemma assigns higher reasoning levels to heterosexual identities compared to other sexual orientations, indicating inherent biases. These findings underscore the need for ethical standards and contextual alignment to balance improved reasoning with fairness.

[AI-66] Will AI replace Software Engineers? Hold your Breath

链接: https://arxiv.org/abs/2502.20429
作者: Abhik Roychoudhury,Andreas Zeller
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 3 pages

点击查看摘要

Abstract:Artificial Intelligence (AI) technology such as Large Language Models (LLMs) have become extremely popular in creating code. This has led to the conjecture that future software jobs will be exclusively conducted by LLMs, and the software industry will cease to exist. But software engineering is much more than producing code – notably, \emphmaintaining large software and keeping it reliable is a major part of software engineering, which LLMs are not yet capable of.

[AI-67] DeePen: Penetration Testing for Audio Deepfake Detection

链接: https://arxiv.org/abs/2502.20427
作者: Nicolas Müller,Piotr Kawa,Adriana Stan,Thien-Phuc Doan,Souhwan Jung,Wei Herng Choong,Philip Sperl,Konstantin Böttinger
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Deepfakes - manipulated or forged audio and video media - pose significant security risks to individuals, organizations, and society at large. To address these challenges, machine learning-based classifiers are commonly employed to detect deepfake content. In this paper, we assess the robustness of such classifiers through a systematic penetration testing methodology, which we introduce as DeePen. Our approach operates without prior knowledge of or access to the target deepfake detection models. Instead, it leverages a set of carefully selected signal processing modifications - referred to as attacks - to evaluate model vulnerabilities. Using DeePen, we analyze both real-world production systems and publicly available academic model checkpoints, demonstrating that all tested systems exhibit weaknesses and can be reliably deceived by simple manipulations such as time-stretching or echo addition. Furthermore, our findings reveal that while some attacks can be mitigated by retraining detection systems with knowledge of the specific attack, others remain persistently effective. We release all associated code.

[AI-68] Backpropagation-free Spiking Neural Networks with the Forward-Forward Algorithm

链接: https://arxiv.org/abs/2502.20411
作者: Mohammadnavid Ghader,Saeed Reza Kheradpisheh,Bahar Farahani,Mahmood Fazlali
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) offer a biologically inspired computational paradigm that emulates neuronal activity through discrete spike-based processing. Despite their advantages, training SNNs with traditional backpropagation (BP) remains challenging due to computational inefficiencies and a lack of biological plausibility. This study explores the Forward-Forward (FF) algorithm as an alternative learning framework for SNNs. Unlike backpropagation, which relies on forward and backward passes, the FF algorithm employs two forward passes, enabling localized learning, enhanced computational efficiency, and improved compatibility with neuromorphic hardware. We introduce an FF-based SNN training framework and evaluate its performance across both non-spiking (MNIST, Fashion-MNIST, CIFAR-10) and spiking (Neuro-MNIST, SHD) datasets. Experimental results demonstrate that our model surpasses existing FF-based SNNs by over 5% on MNIST and Fashion-MNIST while achieving accuracy comparable to state-of-the-art backpropagation-trained SNNs. On more complex tasks such as CIFAR-10 and SHD, our approach outperforms other SNN models by up to 6% and remains competitive with leading backpropagation-trained SNNs. These findings highlight the FF algorithm’s potential to advance SNN training methodologies and neuromorphic computing by addressing key limitations of backpropagation.

[AI-69] Adversarial Robustness of Partitioned Quantum Classifiers

链接: https://arxiv.org/abs/2502.20403
作者: Pouya Kananian,Hans-Arno Jacobsen
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Adversarial robustness in quantum classifiers is a critical area of study, providing insights into their performance compared to classical models and uncovering potential advantages inherent to quantum machine learning. In the NISQ era of quantum computing, circuit cutting is a notable technique for simulating circuits that exceed the qubit limitations of current devices, enabling the distribution of a quantum circuit’s execution across multiple quantum processing units through classical communication. We examine how partitioning quantum classifiers through circuit cutting increase their susceptibility to adversarial attacks, establishing a link between attacking the state preparation channels in wire cutting and implementing adversarial gates within intermediate layers of a quantum classifier. We then proceed to study the latter problem from both a theoretical and experimental perspective.

[AI-70] Beyond transparency: computational reliabilism as an externalist epistemology of algorithms

链接: https://arxiv.org/abs/2502.20402
作者: Juan Manuel Durán
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:This chapter is interested in the epistemology of algorithms. As I intend to approach the topic, this is an issue about epistemic justification. Current approaches to justification emphasize the transparency of algorithms, which entails elucidating their internal mechanisms – such as functions and variables – and demonstrating how (or that) these produce outputs. Thus, the mode of justification through transparency is contingent on what can be shown about the algorithm and, in this sense, is internal to the algorithm. In contrast, I advocate for an externalist epistemology of algorithms that I term computational reliabilism (CR). While I have previously introduced and examined CR in the field of computer simulations ([42, 53, 4]), this chapter extends this reliabilist epistemology to encompass a broader spectrum of algorithms utilized in various scientific disciplines, with a particular emphasis on machine learning applications. At its core, CR posits that an algorithm’s output is justified if it is produced by a reliable algorithm. A reliable algorithm is one that has been specified, coded, used, and maintained utilizing reliability indicators. These reliability indicators stem from formal methods, algorithmic metrics, expert competencies, cultures of research, and other scientific endeavors. The primary aim of this chapter is to delineate the foundations of CR, explicate its operational mechanisms, and outline its potential as an externalist epistemology of algorithms.

[AI-71] Einleitung [Introduction]

链接: https://arxiv.org/abs/2502.21131
作者: Vincent C. Müller
类目: History and Philosophy of Physics (physics.hist-ph); Artificial Intelligence (cs.AI)
*备注: in German language

点击查看摘要

Abstract:Hilary Putnam’s biography and philosophical development reflect the history of Anglo-Saxon philosophy over the last 40 years. Putnam has influenced this history significantly for almost as long. In this introduction, the main aim is to present the context in which Putnam stands and from which his philosophical contributions can be understood. In the context of a sketch of Putnam’s philosophical development, a preliminary historical classification of his work will also be attempted, even if this is not the place for a comprehensive critique or presentation: The introduction must remain at a fairly elementary level and of course cannot replace a reading of the texts. Since Putnam’s work is certainly part of a rapprochement between ‘analytic’ and ‘continental’ philosophy, the introduction to the texts translated here should finally make clear what Putnam has to offer non-analytically oriented readers. Hilary Putnams Biographie und philosophische Entwicklung spiegeln die Geschichte der angelsächsischen Philosophie in den letzten 40 Jahren. Beinahe ebenso lange hat Putnam diese Geschichte wesentlich beeinflußt. In der vorliegenden Einleitung soll vor allem der Kontext dargestellt werden, in dem Putnam steht und aus dem heraus verständlich wird, was er philosophisch zu sagen hat. Im Rahmen einer Skizze von Putnams philosophischer Entwicklung soll zudem eine vorläufige philosophiehistorische Einordnung versucht werden, auch wenn hier nicht der Ort für eine umfassende Kritik oder Darstellung sein kann: Die Einleitung muß auf recht elementarem Niveau bleiben und kann eine Lektüre der Texte natürlich nicht ersetzen. Da Putnams Werk sicherlich Teil einer Annäherung von ‘analytischer’ und ‘kontinentaler’ Philosophie ist, soll bei der Einführung in die hier übersetzten Texte schließlich deutlich werden, was Putnam nicht analytisch orientierten Lesern zu bieten hat. Comments: in German language Subjects: History and Philosophy of Physics (physics.hist-ph); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.21131 [physics.hist-ph] (or arXiv:2502.21131v1 [physics.hist-ph] for this version) https://doi.org/10.48550/arXiv.2502.21131 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: (1993) Vincent C. Müller (ed.), Hilary Putnam: Von einem realistischen Standpunkt, Schriften zu Sprache und Wirklichkeit [From a realist point of view: Writings on language and reality] (rowohlts enzyklopädie; Reinbek: Rowohlt), 9-26

[AI-72] Lattice Protein Folding with Variational Annealing

链接: https://arxiv.org/abs/2502.20632
作者: Shoummo Ahsan Khandoker,Estelle M. Inack,Mohamed Hibat-Allah
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: Github respository will be provided soon

点击查看摘要

Abstract:Understanding the principles of protein folding is a cornerstone of computational biology, with implications for drug design, bioengineering, and the understanding of fundamental biological processes. Lattice protein folding models offer a simplified yet powerful framework for studying the complexities of protein folding, enabling the exploration of energetically optimal folds under constrained conditions. However, finding these optimal folds is a computationally challenging combinatorial optimization problem. In this work, we introduce a novel upper-bound training scheme that employs masking to identify the lowest-energy folds in two-dimensional Hydrophobic-Polar (HP) lattice protein folding. By leveraging Dilated Recurrent Neural Networks (RNNs) integrated with an annealing process driven by temperature-like fluctuations, our method accurately predicts optimal folds for benchmark systems of up to 60 beads. Our approach also effectively masks invalid folds from being sampled without compromising the autoregressive sampling properties of RNNs. This scheme is generalizable to three spatial dimensions and can be extended to lattice protein models with larger alphabets. Our findings emphasize the potential of advanced machine learning techniques in tackling complex protein folding problems and a broader class of constrained combinatorial optimization challenges.

[AI-73] Efficient Risk-sensitive Planning via Entropic Risk Measures

链接: https://arxiv.org/abs/2502.20423
作者: Alexandre Marthe(ENS de Lyon, UMPA-ENSL),Samuel Bounan,Aurélien Garivier(UMPA-ENSL, MC2),Claire Vernade
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Risk-sensitive planning aims to identify policies maximizing some tail-focused metrics in Markov Decision Processes (MDPs). Such an optimization task can be very costly for the most widely used and interpretable metrics such as threshold probabilities or (Conditional) Values at Risk. Indeed, previous work showed that only Entropic Risk Measures (EntRM) can be efficiently optimized through dynamic programming, leaving a hard-to-interpret parameter to choose. We show that the computation of the full set of optimal policies for EntRM across parameter values leads to tight approximations for the metrics of interest. We prove that this optimality front can be computed effectively thanks to a novel structural analysis and smoothness properties of entropic risks. Empirical results demonstrate that our approach achieves strong performance in a variety of decision-making scenarios.

机器学习

[LG-0] Enabling AutoML for Zero-Touch Network Security: Use-Case Driven Analysis

链接: https://arxiv.org/abs/2502.21286
作者: Li Yang,Mirna El Rajab,Abdallah Shami,Sami Muhaidat
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: Published in IEEE Transactions on Network and Service Management (TNSM); Code is available at Github link: this https URL

点击查看摘要

Abstract:Zero-Touch Networks (ZTNs) represent a state-of-the-art paradigm shift towards fully automated and intelligent network management, enabling the automation and intelligence required to manage the complexity, scale, and dynamic nature of next-generation (6G) networks. ZTNs leverage Artificial Intelligence (AI) and Machine Learning (ML) to enhance operational efficiency, support intelligent decision-making, and ensure effective resource allocation. However, the implementation of ZTNs is subject to security challenges that need to be resolved to achieve their full potential. In particular, two critical challenges arise: the need for human expertise in developing AI/ML-based security mechanisms, and the threat of adversarial attacks targeting AI/ML models. In this survey paper, we provide a comprehensive review of current security issues in ZTNs, emphasizing the need for advanced AI/ML-based security mechanisms that require minimal human intervention and protect AI/ML models themselves. Furthermore, we explore the potential of Automated ML (AutoML) technologies in developing robust security solutions for ZTNs. Through case studies, we illustrate practical approaches to securing ZTNs against both conventional and AI/ML-specific threats, including the development of autonomous intrusion detection systems and strategies to combat Adversarial ML (AML) attacks. The paper concludes with a discussion of the future research directions for the development of ZTN security approaches.

[LG-1] Controlled Model Debiasing through Minimal and Interpretable Updates

链接: https://arxiv.org/abs/2502.21284
作者: Federico Di Gennaro,Thibault Laugel,Vincent Grari,Marcin Detyniecki
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Traditional approaches to learning fair machine learning models often require rebuilding models from scratch, generally without accounting for potentially existing previous models. In a context where models need to be retrained frequently, this can lead to inconsistent model updates, as well as redundant and costly validation testing. To address this limitation, we introduce the notion of controlled model debiasing, a novel supervised learning task relying on two desiderata: that the differences between new fair model and the existing one should be (i) interpretable and (ii) minimal. After providing theoretical guarantees to this new problem, we introduce a novel algorithm for algorithmic fairness, COMMOD, that is both model-agnostic and does not require the sensitive attribute at test time. In addition, our algorithm is explicitly designed to enforce minimal and interpretable changes between biased and debiased predictions -a property that, while highly desirable in high-stakes applications, is rarely prioritized as an explicit objective in fairness literature. Our approach combines a concept-based architecture and adversarial learning and we demonstrate through empirical results that it achieves comparable performance to state-of-the-art debiasing methods while performing minimal and interpretable prediction changes.

[LG-2] Does Generation Require Memorization? Creative Diffusion Models using Ambient Diffusion

链接: https://arxiv.org/abs/2502.21278
作者: Kulin Shah,Alkis Kalavasis,Adam R. Klivans,Giannis Daras
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 33 pages

点击查看摘要

Abstract:There is strong empirical evidence that the state-of-the-art diffusion modeling paradigm leads to models that memorize the training set, especially when the training set is small. Prior methods to mitigate the memorization problem often lead to a decrease in image quality. Is it possible to obtain strong and creative generative models, i.e., models that achieve high generation quality and low memorization? Despite the current pessimistic landscape of results, we make significant progress in pushing the trade-off between fidelity and memorization. We first provide theoretical evidence that memorization in diffusion models is only necessary for denoising problems at low noise scales (usually used in generating high-frequency details). Using this theoretical insight, we propose a simple, principled method to train the diffusion models using noisy data at large noise scales. We show that our method significantly reduces memorization without decreasing the image quality, for both text-conditional and unconditional models and for a variety of data availability settings.

[LG-3] ALVI Interface: Towards Full Hand Motion Decoding for Amputees Using sEMG

链接: https://arxiv.org/abs/2502.21256
作者: Aleksandr Kovalev,Anna Makarova,Petr Chizhov,Matvey Antonov,Gleb Duplin,Vladislav Lomtev,Viacheslav Gostevskii,Vladimir Bessonov,Andrey Tsurkan,Mikhail Korobok,Aleksejs Timčenko
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 6 pages, video demo: this https URL

点击查看摘要

Abstract:We present a system for decoding hand movements using surface EMG signals. The interface provides real-time (25 Hz) reconstruction of finger joint angles across 20 degrees of freedom, designed for upper limb amputees. Our offline analysis shows 0.8 correlation between predicted and actual hand movements. The system functions as an integrated pipeline with three key components: (1) a VR-based data collection platform, (2) a transformer-based model for EMG-to-motion transformation, and (3) a real-time calibration and feedback module called ALVI Interface. Using eight sEMG sensors and a VR training environment, users can control their virtual hand down to finger joint movement precision, as demonstrated in our video: youtube link.

[LG-4] mesBERT: A BERT-Style Foundation Model for Time Series Understanding

链接: https://arxiv.org/abs/2502.21245
作者: Haoran Zhang,Yong Liu,Yunzhong Qiu,Haixuan Liu,Zhongyi Pei,Jianmin Wang,Mingsheng Long
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series analysis is crucial in diverse scenarios. Beyond forecasting, considerable real-world tasks are categorized into classification, imputation, and anomaly detection, underscoring different capabilities termed time series understanding in this paper. While GPT-style models have been positioned as foundation models for time series forecasting, the BERT-style architecture, which has made significant advances in natural language understanding, has not been fully unlocked for time series understanding, possibly attributed to the undesirable dropout of essential elements of BERT. In this paper, inspired by the shared multi-granularity structure between multivariate time series and multisentence documents, we design TimesBERT to learn generic representations of time series including temporal patterns and variate-centric characteristics. In addition to a natural adaptation of masked modeling, we propose a parallel task of functional token prediction to embody vital multi-granularity structures. Our model is pre-trained on 260 billion time points across diverse domains. Leveraging multi-granularity representations, TimesBERT achieves state-of-the-art performance across four typical downstream understanding tasks, outperforming task-specific models and language pre-trained backbones, positioning it as a versatile foundation model for time series understanding.

[LG-5] he Structural Complexity of Matrix-Vector Multiplication

链接: https://arxiv.org/abs/2502.21240
作者: Emile Anand,Jan van den Brand,Rose McCarty
类目: Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC); Computational Geometry (cs.CG); Machine Learning (cs.LG)
*备注: 36 pages

点击查看摘要

Abstract:We consider the problem of preprocessing an n\times n matrix M, and supporting queries that, for any vector v, returns the matrix-vector product Mv. This problem has been extensively studied in both theory and practice: on one side, practitioners have developed algorithms that are highly efficient in practice, whereas theoreticians have proven that the problem cannot be solved faster than naive multiplication in the worst-case. This lower bound holds even in the average-case, implying that existing average-case analyses cannot explain this gap between theory and practice. Therefore, we study the problem for structured matrices. We show that for n\times n matrices of VC-dimension d, the matrix-vector multiplication problem can be solved with \tildeO(n^2) preprocessing and \tilde O(n^2-1/d) query time. Given the low constant VC-dimensions observed in most real-world data, our results posit an explanation for why the problem can be solved so much faster in practice. Moreover, our bounds hold even if the matrix does not have a low VC-dimension, but is obtained by (possibly adversarially) corrupting at most a subquadratic number of entries of any unknown low VC-dimension matrix. Our results yield the first non-trivial upper bounds for many applications. In previous works, the online matrix-vector hypothesis (conjecturing that quadratic time is needed per query) was used to prove many conditional lower bounds, showing that it is impossible to compute and maintain high-accuracy estimates for shortest paths, Laplacian solvers, effective resistance, and triangle detection in graphs subject to node insertions and deletions in subquadratic time. Yet, via a reduction to our matrix-vector-multiplication result, we show we can maintain the aforementioned problems efficiently if the input is structured, providing the first subquadratic upper bounds in the high-accuracy regime.

[LG-6] A Method of Selective Attention for Reservoir Based Agents

链接: https://arxiv.org/abs/2502.21229
作者: Kevin McKee
类目: Machine Learning (cs.LG)
*备注: 6 pages, 2 figures

点击查看摘要

Abstract:Training of deep reinforcement learning agents is slowed considerably by the presence of input dimensions that do not usefully condition the reward function. Existing modules such as layer normalization can be trained with weight decay to act as a form of selective attention, i.e. an input mask, that shrinks the scale of unnecessary inputs, which in turn accelerates training of the policy. However, we find a surprising result that adding numerous parameters to the computation of the input mask results in much faster training. A simple, high dimensional masking module is compared with layer normalization and a model without any input suppression. The high dimensional mask resulted in a four-fold speedup in training over the null hypothesis and a two-fold speedup in training over the layer normalization method.

[LG-7] Geodesic Slice Sampler for Multimodal Distributions with Strong Curvature

链接: https://arxiv.org/abs/2502.21190
作者: Bernardo Williams,Hanlin Yu,Hoang Phuc Hau Luu,Georgios Arvanitidis,Arto Klami
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional Markov Chain Monte Carlo sampling methods often struggle with sharp curvatures, intricate geometries, and multimodal distributions. Slice sampling can resolve local exploration inefficiency issues and Riemannian geometries help with sharp curvatures. Recent extensions enable slice sampling on Riemannian manifolds, but they are restricted to cases where geodesics are available in closed form. We propose a method that generalizes Hit-and-Run slice sampling to more general geometries tailored to the target distribution, by approximating geodesics as solutions to differential equations. Our approach enables exploration of regions with strong curvature and rapid transitions between modes in multimodal distributions. We demonstrate the advantages of the approach over challenging sampling problems.

[LG-8] SYN-LUNGS: Towards Simulating Lung Nodules with Anatomy-Informed Digital Twins for AI Training

链接: https://arxiv.org/abs/2502.21187
作者: Fakrul Islam Tushar,Lavsen Dahal,Cindy McCabe,Fong Chi Ho,Paul Segars,Ehsan Abadi,Kyle J. Lafata,Ehsan Samei,Joseph Y. Lo
类目: Machine Learning (cs.LG)
*备注: 6 figures, 12 pages

点击查看摘要

Abstract:AI models for lung cancer screening are limited by data scarcity, impacting generalizability and clinical applicability. Generative models address this issue but are constrained by training data variability. We introduce SYN-LUNGS, a framework for generating high-quality 3D CT images with detailed annotations. SYN-LUNGS integrates XCAT3 phantoms for digital twin generation, X-Lesions for nodule simulation (varying size, location, and appearance), and DukeSim for CT image formation with vendor and parameter variability. The dataset includes 3,072 nodule images from 1,044 simulated CT scans, with 512 lesions and 174 digital twins. Models trained on clinical + simulated data outperform clinical only models, achieving 10% improvement in detection, 2-9% in segmentation and classification, and enhanced this http URL incorporating anatomy-informed simulations, SYN-LUNGS provides a scalable approach for AI model development, particularly in rare disease representation and improving model reliability.

[LG-9] Reducing Reward Dependence in RL Through Adaptive Confidence Discounting

链接: https://arxiv.org/abs/2502.21181
作者: Muhammed Yusuf Satici,David L. Roberts
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In human-in-the-loop reinforcement learning or environments where calculating a reward is expensive, the costly rewards can make learning efficiency challenging to achieve. The cost of obtaining feedback from humans or calculating expensive rewards means algorithms receiving feedback at every step of long training sessions may be infeasible, which may limit agents’ abilities to efficiently improve performance. Our aim is to reduce the reliance of learning agents on humans or expensive rewards, improving the efficiency of learning while maintaining the quality of the learned policy. We offer a novel reinforcement learning algorithm that requests a reward only when its knowledge of the value of actions in an environment state is low. Our approach uses a reward function model as a proxy for human-delivered or expensive rewards when confidence is high, and asks for those explicit rewards only when there is low confidence in the model’s predicted rewards and/or action selection. By reducing dependence on the expensive-to-obtain rewards, we are able to learn efficiently in settings where the logistics or expense of obtaining rewards may otherwise prohibit it. In our experiments our approach obtains comparable performance to a baseline in terms of return and number of episodes required to learn, but achieves that performance with as few as 20% of the rewards.

[LG-10] QFAL: Quantum Federated Adversarial Learning

链接: https://arxiv.org/abs/2502.21171
作者: Walid El Maouaki,Nouhaila Innan,Alberto Marchisio,Taoufik Said,Mohamed Bennai,Muhammad Shafique
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 10 pages

点击查看摘要

Abstract:Quantum federated learning (QFL) merges the privacy advantages of federated systems with the computational potential of quantum neural networks (QNNs), yet its vulnerability to adversarial attacks remains poorly understood. This work pioneers the integration of adversarial training into QFL, proposing a robust framework, quantum federated adversarial learning (QFAL), where clients collaboratively defend against perturbations by combining local adversarial example generation with federated averaging (FedAvg). We systematically evaluate the interplay between three critical factors: client count (5, 10, 15), adversarial training coverage (0-100%), and adversarial attack perturbation strength (epsilon = 0.01-0.5), using the MNIST dataset. Our experimental results show that while fewer clients often yield higher clean-data accuracy, larger federations can more effectively balance accuracy and robustness when partially adversarially trained. Notably, even limited adversarial coverage (e.g., 20%-50%) can significantly improve resilience to moderate perturbations, though at the cost of reduced baseline performance. Conversely, full adversarial training (100%) may regain high clean accuracy but is vulnerable under stronger attacks. These findings underscore an inherent trade-off between robust and standard objectives, which is further complicated by quantum-specific factors. We conclude that a carefully chosen combination of client count and adversarial coverage is critical for mitigating adversarial vulnerabilities in QFL. Moreover, we highlight opportunities for future research, including adaptive adversarial training schedules, more diverse quantum encoding schemes, and personalized defense strategies to further enhance the robustness-accuracy trade-off in real-world quantum federated environments.

[LG-11] Autonomous Curriculum Design via Relative Entropy Based Task Modifications

链接: https://arxiv.org/abs/2502.21166
作者: Muhammed Yusuf Satici,Jianxun Wang,David L. Roberts
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Curriculum learning is a training method in which an agent is first trained on a curriculum of relatively simple tasks related to a target task in an effort to shorten the time required to train on the target task. Autonomous curriculum design involves the design of such curriculum with no reliance on human knowledge and/or expertise. Finding an efficient and effective way of autonomously designing curricula remains an open problem. We propose a novel approach for automatically designing curricula by leveraging the learner’s uncertainty to select curricula tasks. Our approach measures the uncertainty in the learner’s policy using relative entropy, and guides the agent to states of high uncertainty to facilitate learning. Our algorithm supports the generation of autonomous curricula in a self-assessed manner by leveraging the learner’s past and current policies but it also allows the use of teacher guided design in an instructive setting. We provide theoretical guarantees for the convergence of our algorithm using two time-scale optimization processes. Results show that our algorithm outperforms randomly generated curriculum, and learning directly on the target task as well as the curriculum-learning criteria existing in literature. We also present two additional heuristic distance measures that could be combined with our relative-entropy approach for further performance improvements.

[LG-12] Parallel-Learning of Invariant and Tempo-variant Attributes of Single-Lead Cardiac Signals: PLITA AAAI

链接: https://arxiv.org/abs/2502.21162
作者: Adtian Atienza,Jakob E. Bardram,Sadasivan Puthusserypady
类目: Machine Learning (cs.LG)
*备注: Published in The 39th Annual AAAI Conference on Artificial Intelligence. Main Track

点击查看摘要

Abstract:Wearable sensing devices, such as Holter monitors, will play a crucial role in the future of digital health. Unsupervised learning frameworks such as Self-Supervised Learning (SSL) are essential to map these single-lead electrocardiogram (ECG) signals with their anticipated clinical outcomes. These signals are characterized by a tempo-variant component whose patterns evolve through the recording and an invariant component with patterns that remain unchanged. However, existing SSL methods only drive the model to encode the invariant attributes, leading the model to neglect tempo-variant information which reflects subject-state changes through time. In this paper, we present Parallel-Learning of Invariant and Tempo-variant Attributes (PLITA), a novel SSL method designed for capturing both invariant and tempo-variant ECG attributes. The latter are captured by mandating closer representations in space for closer inputs on time. We evaluate both the capability of the method to learn the attributes of these two distinct kinds, as well as PLITA’s performance compared to existing SSL methods for ECG analysis. PLITA performs significantly better in the set-ups where tempo-variant attributes play a major role.

[LG-13] Variational Bayesian Pseudo-Coreset ICLR2025

链接: https://arxiv.org/abs/2502.21143
作者: Hyungi Lee,Seungyoo Lee,Juho Lee
类目: Machine Learning (cs.LG)
*备注: The Thirteenth International Conference on Learning Representations (ICLR2025)

点击查看摘要

Abstract:The success of deep learning requires large datasets and extensive training, which can create significant computational challenges. To address these challenges, pseudo-coresets, small learnable datasets that mimic the entire data, have been proposed. Bayesian Neural Networks, which offer predictive uncertainty and probabilistic interpretation for deep neural networks, also face issues with large-scale datasets due to their high-dimensional parameter space. Prior works on Bayesian Pseudo-Coresets (BPC) attempt to reduce the computational load for computing weight posterior distribution by a small number of pseudo-coresets but suffer from memory inefficiency during BPC training and sub-optimal results. To overcome these limitations, we propose Variational Bayesian Pseudo-Coreset (VBPC), a novel approach that utilizes variational inference to efficiently approximate the posterior distribution, reducing memory usage and computational costs while improving performance across benchmark datasets.

[LG-14] CuPID: Leverag ing Masked Single-Lead ECG Modelling for Enhancing the Representations

链接: https://arxiv.org/abs/2502.21127
作者: Adtian Atienza,Gouthamaan Manimaran,Jakob E. Bardram,Sadasivan Puthusserypady
类目: Machine Learning (cs.LG)
*备注: Paper under review

点击查看摘要

Abstract:Wearable sensing devices, such as Electrocardiogram (ECG) heart-rate monitors, will play a crucial role in the future of digital health. This continuous monitoring leads to massive unlabeled data, incentivizing the development of unsupervised learning frameworks. While Masked Data Modelling (MDM) techniques have enjoyed wide use, their direct application to single-lead ECG data is suboptimal due to the decoder’s difficulty handling irregular heartbeat intervals when no contextual information is provided. In this paper, we present Cueing the Predictor Increments the Detailing (CuPID), a novel MDM method tailored to single-lead ECGs. CuPID enhances existing MDM techniques by cueing spectrogram-derived context to the decoder, thus incentivizing the encoder to produce more detailed representations. This has a significant impact on the encoder’s performance across a wide range of different configurations, leading CuPID to outperform state-of-the-art methods in a variety of downstream tasks.

[LG-15] Rare event modeling with self-regularized normalizing flows: what can we learn from a single failure? ICLR2025

链接: https://arxiv.org/abs/2502.21110
作者: Charles Dawson,Van Tran,Max Z. Li,Chuchu Fan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Published at ICLR 2025

点击查看摘要

Abstract:Increased deployment of autonomous systems in fields like transportation and robotics have seen a corresponding increase in safety-critical failures. These failures can be difficult to model and debug due to the relative lack of data: compared to tens of thousands of examples from normal operations, we may have only seconds of data leading up to the failure. This scarcity makes it challenging to train generative models of rare failure events, as existing methods risk either overfitting to noise in the limited failure dataset or underfitting due to an overly strong prior. We address this challenge with CalNF, or calibrated normalizing flows, a self-regularized framework for posterior learning from limited data. CalNF achieves state-of-the-art performance on data-limited failure modeling and inverse problems and enables a first-of-a-kind case study into the root causes of the 2022 Southwest Airlines scheduling crisis.

[LG-16] Efficient Transformer-based Decoder for Varshamov-Tenengolts Codes

链接: https://arxiv.org/abs/2502.21060
作者: Yali Wei,Alan J.X. Guo,Zihui Yan,Yufan Dai
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 9 pages, 2 figures, 9 tables

点击查看摘要

Abstract:In recent years, the rise of DNA data storage technology has brought significant attention to the challenge of correcting insertion, deletion, and substitution (IDS) errors. Among various coding methods for IDS correction, Varshamov-Tenengolts (VT) codes, primarily designed for single-error correction, have emerged as a central research focus. While existing decoding methods achieve high accuracy in correcting a single error, they often fail to correct multiple IDS errors. In this work, we observe that VT codes retain some capability for addressing multiple errors by introducing a transformer-based VT decoder (TVTD) along with symbol- and statistic-based codeword embedding. Experimental results demonstrate that the proposed TVTD achieves perfect correction of a single error. Furthermore, when decoding multiple errors across various codeword lengths, the bit error rate and frame error rate are significantly improved compared to existing hard decision and soft-in soft-out algorithms. Additionally, through model architecture optimization, the proposed method reduces time consumption by an order of magnitude compared to other soft decoders.

[LG-17] Detection of anomalies in cow activity using wavelet transform based features

链接: https://arxiv.org/abs/2502.21051
作者: Valentin Guien,Violaine Antoine,Romain Lardy,Isabelle Veissier,Luis E C Rocha
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 17 pages, 8 figures, 4 tables, 1 algorithm

点击查看摘要

Abstract:In Precision Livestock Farming, detecting deviations from optimal or baseline values - i.e. anomalies in time series - is essential to allow undertaking corrective actions rapidly. Here we aim at detecting anomalies in 24h time series of cow activity, with a view to detect cases of disease or oestrus. Deviations must be distinguished from noise which can be very high in case of biological data. It is also important to detect the anomaly early, e.g. before a farmer would notice it visually. Here, we investigate the benefit of using wavelet transforms to denoise data and we assess the performance of an anomaly detection algorithm considering the timing of the detection. We developed features based on the comparisons between the wavelet transforms of the mean of the time series and the wavelet transforms of individual time series instances. We hypothesized that these features contribute to the detection of anomalies in periodic time series using a feature-based algorithm. We tested this hypothesis with two datasets representing cow activity, which typically follows a daily pattern but can deviate due to specific physiological or pathological conditions. We applied features derived from wavelet transform as well as statistical features in an Isolation Forest algorithm. We measured the distance of detection between the days annotated abnormal by animal caretakers days and the days predicted abnormal by the algorithm. The results show that wavelet-based features are among the features most contributing to anomaly detection. They also show that detections are close to the annotated days, and often precede it. In conclusion, using wavelet transforms on time series of cow activity data helps to detect anomalies related to specific cow states. The detection is often obtained on days that precede the day annotated by caretakers, which offer possibility to take corrective actions at an early stage.

[LG-18] S4ConvD: Adaptive Scaling and Frequency Adjustment for Energy-Efficient Sensor Networks in Smart Buildings

链接: https://arxiv.org/abs/2502.21035
作者: Melanie Schaller,Bodo Rosenhahn
类目: Machine Learning (cs.LG)
*备注: Submitted to TOSN Journal

点击查看摘要

Abstract:Predicting energy consumption in smart buildings is challenging due to dependencies in sensor data and the variability of environmental conditions. We introduce S4ConvD, a novel convolutional variant of Deep State Space Models (Deep-SSMs), that minimizes reliance on extensive preprocessing steps. S4ConvD is designed to optimize runtime in resource-constrained environments. By implementing adaptive scaling and frequency adjustments, this model shows to capture complex temporal patterns in building energy dynamics. Experiments on the ASHRAE Great Energy Predictor III dataset reveal that S4ConvD outperforms current benchmarks. Additionally, S4ConvD benefits from significant improvements in GPU runtime through the use of Block Tiling optimization techniques. Thus, S4ConvD has the potential for practical deployment in real-time energy modeling. Furthermore, the complete codebase and dataset are accessible on GitHub, fostering open-source contributions and facilitating further research. Our method also promotes resource-efficient model execution, enhancing both energy forecasting and the potential integration of renewable energy sources into smart grid systems.

[LG-19] A data augmentation strategy for deep neural networks with application to epidemic modelling

链接: https://arxiv.org/abs/2502.21033
作者: Muhammad Awais,Abu Sayfan Ali,Giacomo Dimarco,Federica Ferrarese,Lorenzo Pareschi
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Physics and Society (physics.soc-ph); Populations and Evolution (q-bio.PE); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this work, we integrate the predictive capabilities of compartmental disease dynamics models with machine learning ability to analyze complex, high-dimensional data and uncover patterns that conventional models may overlook. Specifically, we present a proof of concept demonstrating the application of data-driven methods and deep neural networks to a recently introduced SIR-type model with social features, including a saturated incidence rate, to improve epidemic prediction and forecasting. Our results show that a robust data augmentation strategy trough suitable data-driven models can improve the reliability of Feed-Forward Neural Networks (FNNs) and Nonlinear Autoregressive Networks (NARs), making them viable alternatives to Physics-Informed Neural Networks (PINNs). This approach enhances the ability to handle nonlinear dynamics and offers scalable, data-driven solutions for epidemic forecasting, prioritizing predictive accuracy over the constraints of physics-based models. Numerical simulations of the post-lockdown phase of the COVID-19 epidemic in Italy and Spain validate our methodology.

[LG-20] Sixth-Sense: Self-Supervised Learning of Spatial Awareness of Humans from a Planar Lidar

链接: https://arxiv.org/abs/2502.21029
作者: Simone Arreghini,Nicholas Carlotti,Mirko Nava,Antonio Paolillo,Alessandro Giusti
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Localizing humans is a key prerequisite for any service robot operating in proximity to people. In these scenarios, robots rely on a multitude of state-of-the-art detectors usually designed to operate with RGB-D cameras or expensive 3D LiDARs. However, most commercially available service robots are equipped with cameras with a narrow field of view, making them blind when a user is approaching from other directions, or inexpensive 1D LiDARs whose readings are difficult to interpret. To address these limitations, we propose a self-supervised approach to detect humans and estimate their 2D pose from 1D LiDAR data, using detections from an RGB-D camera as a supervision source. Our approach aims to provide service robots with spatial awareness of nearby humans. After training on 70 minutes of data autonomously collected in two environments, our model is capable of detecting humans omnidirectionally from 1D LiDAR data in a novel environment, with 71% precision and 80% recall, while retaining an average absolute error of 13 cm in distance and 44° in orientation.

[LG-21] RAG : Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval

链接: https://arxiv.org/abs/2502.20969
作者: Chien-Yu Lin,Keisuke Kamahori,Yiyu Liu,Xiaoxiang Shi,Madhav Kashyap,Yile Gu,Rulin Shao,Zihao Ye,Kan Zhu,Stephanie Wang,Arvind Krishnamurthy,Rohan Kadekodi,Luis Ceze,Baris Kasikci
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) extends large language models (LLMs) with external data sources to enhance factual correctness and domain coverage. Modern RAG pipelines rely on large datastores, leading to system challenges in latency-sensitive deployments, especially when limited GPU memory is available. To address these challenges, we propose TeleRAG, an efficient inference system that reduces RAG latency with minimal GPU memory requirements. The core innovation of TeleRAG is lookahead retrieval, a prefetching mechanism that anticipates required data and transfers it from CPU to GPU in parallel with LLM generation. By leveraging the modularity of RAG pipelines, the inverted file index (IVF) search algorithm and similarities between queries, TeleRAG optimally overlaps data movement and computation. Experimental results show that TeleRAG reduces end-to-end RAG inference latency by up to 1.72x on average compared to state-of-the-art systems, enabling faster, more memory-efficient deployments of advanced RAG applications.

[LG-22] Reward Dimension Reduction for Scalable Multi-Objective Reinforcement Learning ICLR2025

链接: https://arxiv.org/abs/2502.20957
作者: Giseung Park,Youngchul Sung
类目: Machine Learning (cs.LG)
*备注: Accepted to ICLR 2025

点击查看摘要

Abstract:In this paper, we introduce a simple yet effective reward dimension reduction method to tackle the scalability challenges of multi-objective reinforcement learning algorithms. While most existing approaches focus on optimizing two to four objectives, their abilities to scale to environments with more objectives remain uncertain. Our method uses a dimension reduction approach to enhance learning efficiency and policy performance in multi-objective settings. While most traditional dimension reduction methods are designed for static datasets, our approach is tailored for online learning and preserves Pareto-optimality after transformation. We propose a new training and evaluation framework for reward dimension reduction in multi-objective reinforcement learning and demonstrate the superiority of our method in environments including one with sixteen objectives, significantly outperforming existing online dimension reduction methods.

[LG-23] Robust and Efficient Writer-Independent IMU-Based Handwriting Recognization

链接: https://arxiv.org/abs/2502.20954
作者: Jindong Li,Tim Hamann,Jens Barth,Peter Kaempf,Dario Zanca,Bjoern Eskofier
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online handwriting recognition (HWR) using data from inertial measurement units (IMUs) remains challenging due to variations in writing styles and the limited availability of high-quality annotated datasets. Traditional models often struggle to recognize handwriting from unseen writers, making writer-independent (WI) recognition a crucial but difficult problem. This paper presents an HWR model with an encoder-decoder structure for IMU data, featuring a CNN-based encoder for feature extraction and a BiLSTM decoder for sequence modeling, which supports inputs of varying lengths. Our approach demonstrates strong robustness and data efficiency, outperforming existing methods on WI datasets, including the WI split of the OnHW dataset and our own dataset. Extensive evaluations show that our model maintains high accuracy across different age groups and writing conditions while effectively learning from limited data. Through comprehensive ablation studies, we analyze key design choices, achieving a balance between accuracy and efficiency. These findings contribute to the development of more adaptable and scalable HWR systems for real-world applications.

[LG-24] Efficient Jailbreaking of Large Models by Freeze Training: Lower Layers Exhibit Greater Sensitivity to Harmful Content

链接: https://arxiv.org/abs/2502.20952
作者: Hongyuan Shen,Min Zheng,Jincheng Wang,Yang Zhao
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the widespread application of Large Language Models across various domains, their security issues have increasingly garnered significant attention from both academic and industrial communities. This study conducts sampling and normalization of the parameters of the LLM to generate visual representations and heatmaps of parameter distributions, revealing notable discrepancies in parameter distributions among certain layers within the hidden layers. Further analysis involves calculating statistical metrics for each layer, followed by the computation of a Comprehensive Sensitivity Score based on these metrics, which identifies the lower layers as being particularly sensitive to the generation of harmful content. Based on this finding, we employ a Freeze training strategy, selectively performing Supervised Fine-Tuning only on the lower layers. Experimental results demonstrate that this method significantly reduces training duration and GPU memory consumption while maintaining a high jailbreak success rate and a high harm score, outperforming the results achieved by applying the LoRA method for SFT across all layers. Additionally, the method has been successfully extended to other open-source large models, validating its generality and effectiveness across different model architectures. Furthermore, we compare our method with ohter jailbreak method, demonstrating the superior performance of our approach. By innovatively proposing a method to statistically analyze and compare large model parameters layer by layer, this study provides new insights into the interpretability of large models. These discoveries emphasize the necessity of continuous research and the implementation of adaptive security measures in the rapidly evolving field of LLMs to prevent potential jailbreak attack risks, thereby promoting the development of more robust and secure LLMs.

[LG-25] Gradient Imbalance in Direct Preference Optimization

链接: https://arxiv.org/abs/2502.20847
作者: Qinwei Ma,Jingzhe Shi,Can Jin,Jenq-Neng Hwang,Serge Belongie,Lei Li
类目: Machine Learning (cs.LG)
*备注: 15 pages, 2 figures

点击查看摘要

Abstract:Direct Preference Optimization (DPO) has been proposed as a promising alternative to Proximal Policy Optimization (PPO) based Reinforcement Learning with Human Feedback (RLHF). However, empirical evaluations consistently reveal suboptimal performance in DPO compared to common RLHF pipelines. In this work, we conduct a systematic analysis of DPO’s training dynamics and identify gradient imbalance as a critical limitation. We demonstrate theoretically and empirically that this imbalance perturbs optimization trajectories, destabilizes learning, and induces suboptimal convergence. To address this issue, we propose Balanced-DPO, a simple yet effective modification to the DPO objective that introduces a computationally efficient gradient reweighting mechanism. Our experiments demonstrate the effectiveness of Balanced-DPO, validating the theoretical findings and confirming that addressing gradient imbalance is key to improving DPO’s performance, highlighting a promising direction for future research.

[LG-26] uning-Free Structured Sparse PCA via Deep Unfolding Networks

链接: https://arxiv.org/abs/2502.20837
作者: Long Chen,Xianchao Xiu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Sparse principal component analysis (PCA) is a well-established dimensionality reduction technique that is often used for unsupervised feature selection (UFS). However, determining the regularization parameters is rather challenging, and conventional approaches, including grid search and Bayesian optimization, not only bring great computational costs but also exhibit high sensitivity. To address these limitations, we first establish a structured sparse PCA formulation by integrating \ell_1 -norm and \ell_2,1 -norm to capture the local and global structures, respectively. Building upon the off-the-shelf alternating direction method of multipliers (ADMM) optimization framework, we then design an interpretable deep unfolding network that translates iterative optimization steps into trainable neural architectures. This innovation enables automatic learning of the regularization parameters, effectively bypassing the empirical tuning requirements of conventional methods. Numerical experiments on benchmark datasets validate the advantages of our proposed method over the existing state-of-the-art methods. Our code will be accessible at this https URL.

[LG-27] Digital Player: Evaluating Large Language Models based Human-like Agent in Games NEURIPS

链接: https://arxiv.org/abs/2502.20807
作者: Jiawei Wang,Kai Wang,Shaojie Lin,Runze Wu,Bihan Xu,Lingeng Jiang,Shiwei Zhao,Renyu Zhu,Haoyu Liu,Zhipeng Hu,Zhong Fan,Le Li,Tangjie Lyu,Changjie Fan
类目: Machine Learning (cs.LG)
*备注: neurips datasets and benchmarks 2024, not accepted

点击查看摘要

Abstract:With the rapid advancement of Large Language Models (LLMs), LLM-based autonomous agents have shown the potential to function as digital employees, such as digital analysts, teachers, and programmers. In this paper, we develop an application-level testbed based on the open-source strategy game “Unciv”, which has millions of active players, to enable researchers to build a “data flywheel” for studying human-like agents in the “digital players” task. This “Civilization”-like game features expansive decision-making spaces along with rich linguistic interactions such as diplomatic negotiations and acts of deception, posing significant challenges for LLM-based agents in terms of numerical reasoning and long-term planning. Another challenge for “digital players” is to generate human-like responses for social interaction, collaboration, and negotiation with human players. The open-source project can be found at https:/github.com/fuxiAIlab/CivAgent.

[LG-28] Learning to Steer Learners in Games

链接: https://arxiv.org/abs/2502.20770
作者: Yizhou Zhang,Yi-An Ma,Eric Mazumdar
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of learning to exploit learning algorithms through repeated interactions in games. Specifically, we focus on the case of repeated two player, finite-action games, in which an optimizer aims to steer a no-regret learner to a Stackelberg equilibrium without knowledge of its payoffs. We first show that this is impossible if the optimizer only knows that the learner is using an algorithm from the general class of no-regret algorithms. This suggests that the optimizer requires more information about the learner’s objectives or algorithm to successfully exploit them. Building on this intuition, we reduce the problem for the optimizer to that of recovering the learner’s payoff structure. We demonstrate the effectiveness of this approach if the learner’s algorithm is drawn from a smaller class by analyzing two examples: one where the learner uses an ascent algorithm, and another where the learner uses stochastic mirror ascent with known regularizer and step sizes.

[LG-29] Visual Attention Exploration in Vision-Based Mamba Models

链接: https://arxiv.org/abs/2502.20764
作者: Junpeng Wang,Chin-Chia Michael Yeh,Uday Singh Saini,Mahashweta Das
类目: Machine Learning (cs.LG)
*备注: 6 pages, 8 figures

点击查看摘要

Abstract:State space models (SSMs) have emerged as an efficient alternative to transformer-based models, offering linear complexity that scales better than transformers. One of the latest advances in SSMs, Mamba, introduces a selective scan mechanism that assigns trainable weights to input tokens, effectively mimicking the attention mechanism. Mamba has also been successfully extended to the vision domain by decomposing 2D images into smaller patches and arranging them as 1D sequences. However, it remains unclear how these patches interact with (or attend to) each other in relation to their original 2D spatial location. Additionally, the order used to arrange the patches into a sequence also significantly impacts their attention distribution. To better understand the attention between patches and explore the attention patterns, we introduce a visual analytics tool specifically designed for vision-based Mamba models. This tool enables a deeper understanding of how attention is distributed across patches in different Mamba blocks and how it evolves throughout a Mamba model. Using the tool, we also investigate the impact of different patch-ordering strategies on the learned attention, offering further insights into the model’s behavior.

[LG-30] Information-Theoretic Perspectives on Optimizers

链接: https://arxiv.org/abs/2502.20763
作者: Zhiquan Tan,Weiran Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The interplay of optimizers and architectures in neural networks is complicated and hard to understand why some optimizers work better on some specific architectures. In this paper, we find that the traditionally used sharpness metric does not fully explain the intricate interplay and introduces information-theoretic metrics called entropy gap to better help analyze. It is found that both sharpness and entropy gap affect the performance, including the optimization dynamic and generalization. We further use information-theoretic tools to understand a recently proposed optimizer called Lion and find ways to improve it.

[LG-31] Indoor Localization for Autonomous Robot Navigation

链接: https://arxiv.org/abs/2502.20731
作者: Sean Kouma,Rachel Masters
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:Indoor positioning systems (IPSs) have gained attention as outdoor navigation becomes prevalent in everyday life. Research is being actively conducted on how indoor smartphone navigation can be accomplished and improved using received signal strength indication (RSSI) and machine learning (ML). IPSs have more use cases that need further exploration, and we aim to explore using IPSs for the indoor navigation of an autonomous robot. We collected a dataset and trained models to test on a robot. We also developed an A* path-planning algorithm so that our robot could navigate itself using predicted directions. After testing different network structures, our robot was able to successfully navigate corners around 50 percent of the time. The findings of this paper indicate that using IPSs for autonomous robots is a promising area of future research.

[LG-32] Unlearning through Knowledge Overwriting: Reversible Federated Unlearning via Selective Sparse Adapter CVPR2025

链接: https://arxiv.org/abs/2502.20709
作者: Zhengyi Zhong,Weidong Bao,Ji Wang,Shuai Zhang,Jingxuan Zhou,Lingjuan Lyu,Wei Yang Bryan Lim
类目: Machine Learning (cs.LG)
*备注: Accepted by CVPR2025

点击查看摘要

Abstract:Federated Learning is a promising paradigm for privacy-preserving collaborative model training. In practice, it is essential not only to continuously train the model to acquire new knowledge but also to guarantee old knowledge the right to be forgotten (i.e., federated unlearning), especially for privacy-sensitive information or harmful knowledge. However, current federated unlearning methods face several challenges, including indiscriminate unlearning of cross-client knowledge, irreversibility of unlearning, and significant unlearning costs. To this end, we propose a method named FUSED, which first identifies critical layers by analyzing each layer’s sensitivity to knowledge and constructs sparse unlearning adapters for sensitive ones. Then, the adapters are trained without altering the original parameters, overwriting the unlearning knowledge with the remaining knowledge. This knowledge overwriting process enables FUSED to mitigate the effects of indiscriminate unlearning. Moreover, the introduction of independent adapters makes unlearning reversible and significantly reduces the unlearning costs. Finally, extensive experiments on three datasets across various unlearning scenarios demonstrate that FUSED’s effectiveness is comparable to Retraining, surpassing all other baselines while greatly reducing unlearning costs.

[LG-33] FoCTTA: Low-Memory Continual Test-Time Adaptation with Focus

链接: https://arxiv.org/abs/2502.20677
作者: Youbing Hu,Yun Cheng,Zimu Zhou,Anqi Lu,Zhiqiang Cao,Zhijun Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual adaptation to domain shifts at test time (CTTA) is crucial for enhancing the intelligence of deep learning enabled IoT applications. However, prevailing TTA methods, which typically update all batch normalization (BN) layers, exhibit two memory inefficiencies. First, the reliance on BN layers for adaptation necessitates large batch sizes, leading to high memory usage. Second, updating all BN layers requires storing the activations of all BN layers for backpropagation, exacerbating the memory demand. Both factors lead to substantial memory costs, making existing solutions impractical for IoT devices. In this paper, we present FoCTTA, a low-memory CTTA strategy. The key is to automatically identify and adapt a few drift-sensitive representation layers, rather than blindly update all BN layers. The shift from BN to representation layers eliminates the need for large batch sizes. Also, by updating adaptation-critical layers only, FoCTTA avoids storing excessive activations. This focused adaptation approach ensures that FoCTTA is not only memory-efficient but also maintains effective adaptation. Evaluations show that FoCTTA improves the adaptation accuracy over the state-of-the-arts by 4.5%, 4.9%, and 14.8% on CIFAR10-C, CIFAR100-C, and ImageNet-C under the same memory constraints. Across various batch sizes, FoCTTA reduces the memory usage by 3-fold on average, while improving the accuracy by 8.1%, 3.6%, and 0.2%, respectively, on the three datasets.

[LG-34] Dimension Agnostic Neural Processes ICLR2025

链接: https://arxiv.org/abs/2502.20661
作者: Hyungi Lee,Chaeyun Jang,Dongbok Lee,Juho Lee
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, Accepted to ICLR 2025 (International Conference on Learning Representations)

点击查看摘要

Abstract:Meta-learning aims to train models that can generalize to new tasks with limited labeled data by extracting shared features across diverse task datasets. Additionally, it accounts for prediction uncertainty during both training and evaluation, a concept known as uncertainty-aware meta-learning. Neural Process(NP) is a well-known uncertainty-aware meta-learning method that constructs implicit stochastic processes using parametric neural networks, enabling rapid adaptation to new tasks. However, existing NP methods face challenges in accommodating diverse input dimensions and learned features, limiting their broad applicability across regression tasks. To address these limitations and advance the utility of NP models as general regressors, we introduce Dimension Agnostic Neural Processes(DANP). DANP incorporates Dimension Aggregator Block(DAB) to transform input features into a fixed-dimensional space, enhancing the model’s ability to handle diverse datasets. Furthermore, leveraging the Transformer architecture and latent encoding layers, DANP learns a wider range of features that are generalizable across various tasks. Through comprehensive experimentation on various synthetic and practical regression tasks, we empirically show that DANP outperforms previous NP variations, showcasing its effectiveness in overcoming the limitations of traditional NP models and its potential for broader applicability in diverse regression scenarios.

[LG-35] Can LLM Assist in the Evaluation of the Quality of Machine Learning Explanations?

链接: https://arxiv.org/abs/2502.20635
作者: Bo Wang,Yiqiao Li,Jianlong Zhou,Fang Chen
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:EXplainable machine learning (XML) has recently emerged to address the mystery mechanisms of machine learning (ML) systems by interpreting their ‘black box’ results. Despite the development of various explanation methods, determining the most suitable XML method for specific ML contexts remains unclear, highlighting the need for effective evaluation of explanations. The evaluating capabilities of the Transformer-based large language model (LLM) present an opportunity to adopt LLM-as-a-Judge for assessing explanations. In this paper, we propose a workflow that integrates both LLM-based and human judges for evaluating explanations. We examine how LLM-based judges evaluate the quality of various explanation methods and compare their evaluation capabilities to those of human judges within an iris classification scenario, employing both subjective and objective metrics. We conclude that while LLM-based judges effectively assess the quality of explanations using subjective metrics, they are not yet sufficiently developed to replace human judges in this role.

[LG-36] Are LLM s Ready for Practical Adoption for Assertion Generation? DATE2025

链接: https://arxiv.org/abs/2502.20633
作者: Vaishnavi Pulavarthi,Deeksha Nandal,Soham Dan,Debjit Pal
类目: Machine Learning (cs.LG)
*备注: 7 Pages, 9 Figures, Accepted in DATE 2025. arXiv admin note: substantial text overlap with arXiv:2406.18627

点击查看摘要

Abstract:Assertions have been the de facto collateral for simulation-based and formal verification of hardware designs for over a decade. The quality of hardware verification, i.e., detection and diagnosis of corner-case design bugs, is critically dependent on the quality of the assertions. With the onset of generative AI such as Transformers and Large-Language Models (LLMs), there has been a renewed interest in developing novel, effective, and scalable techniques of generating functional and security assertions from design source code. While there have been recent works that use commercial-of-the-shelf (COTS) LLMs for assertion generation, there is no comprehensive study in quantifying the effectiveness of LLMs in generating syntactically and semantically correct assertions. In this paper, we first discuss AssertionBench from our prior work, a comprehensive set of designs and assertions to quantify the goodness of a broad spectrum of COTS LLMs for the task of assertion generations from hardware design source code. Our key insight was that COTS LLMs are not yet ready for prime-time adoption for assertion generation as they generate a considerable fraction of syntactically and semantically incorrect assertions. Motivated by the insight, we propose AssertionLLM, a first of its kind LLM model, specifically fine-tuned for assertion generation. Our initial experimental results show that AssertionLLM considerably improves the semantic and syntactic correctness of the generated assertions over COTS LLMs.

[LG-37] owards Zero Touch Networks: Cross-Layer Automated Security Solutions for 6G Wireless Networks

链接: https://arxiv.org/abs/2502.20627
作者: Li Yang,Shimaa Naser,Abdallah Shami,Sami Muhaidat,Lyndon Ong,Mérouane Debbah
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: Accepted and To Appear in IEEE Transactions on Communications (TCOM); Code is available at Github: this https URL

点击查看摘要

Abstract:The transition from 5G to 6G mobile networks necessitates network automation to meet the escalating demands for high data rates, ultra-low latency, and integrated technology. Recently, Zero-Touch Networks (ZTNs), driven by Artificial Intelligence (AI) and Machine Learning (ML), are designed to automate the entire lifecycle of network operations with minimal human intervention, presenting a promising solution for enhancing automation in 5G/6G networks. However, the implementation of ZTNs brings forth the need for autonomous and robust cybersecurity solutions, as ZTNs rely heavily on automation. AI/ML algorithms are widely used to develop cybersecurity mechanisms, but require substantial specialized expertise and encounter model drift issues, posing significant challenges in developing autonomous cybersecurity measures. Therefore, this paper proposes an automated security framework targeting Physical Layer Authentication (PLA) and Cross-Layer Intrusion Detection Systems (CLIDS) to address security concerns at multiple Internet protocol layers. The proposed framework employs drift-adaptive online learning techniques and a novel enhanced Successive Halving (SH)-based Automated ML (AutoML) method to automatically generate optimized ML models for dynamic networking environments. Experimental results illustrate that the proposed framework achieves high performance on the public Radio Frequency (RF) fingerprinting and the Canadian Institute for CICIDS2017 datasets, showcasing its effectiveness in addressing PLA and CLIDS tasks within dynamic and complex networking environments. Furthermore, the paper explores open challenges and research directions in the 5G/6G cybersecurity domain. This framework represents a significant advancement towards fully autonomous and secure 6G networks, paving the way for future innovations in network automation and cybersecurity.

[LG-38] Map Space Belief Prediction for Manipulation-Enhanced Mapping

链接: https://arxiv.org/abs/2502.20606
作者: Joao Marcos Correia Marques,Nils Dengler,Tobias Zaenker,Jesper Mucke,Shenlong Wang,Maren Bennewitz,Kris Hauser
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 14 pages, 10 figures, currently under review

点击查看摘要

Abstract:Searching for objects in cluttered environments requires selecting efficient viewpoints and manipulation actions to remove occlusions and reduce uncertainty in object locations, shapes, and categories. In this work, we address the problem of manipulation-enhanced semantic mapping, where a robot has to efficiently identify all objects in a cluttered shelf. Although Partially Observable Markov Decision Processes~(POMDPs) are standard for decision-making under uncertainty, representing unstructured interactive worlds remains challenging in this formalism. To tackle this, we define a POMDP whose belief is summarized by a metric-semantic grid map and propose a novel framework that uses neural networks to perform map-space belief updates to reason efficiently and simultaneously about object geometries, locations, categories, occlusions, and manipulation physics. Further, to enable accurate information gain analysis, the learned belief updates should maintain calibrated estimates of uncertainty. Therefore, we propose Calibrated Neural-Accelerated Belief Updates (CNABUs) to learn a belief propagation model that generalizes to novel scenarios and provides confidence-calibrated predictions for unknown areas. Our experiments show that our novel POMDP planner improves map completeness and accuracy over existing methods in challenging simulations and successfully transfers to real-world cluttered shelves in zero-shot fashion.

[LG-39] Deep Learning of the Evolution Operator Enables Forecasting of Out-of-Training Dynamics in Chaotic Systems

链接: https://arxiv.org/abs/2502.20603
作者: Ira J. S. Shokar,Peter H. Haynes,Rich R. Kerswell
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Chaotic Dynamics (nlin.CD)
*备注:

点击查看摘要

Abstract:We demonstrate that a deep learning emulator for chaotic systems can forecast phenomena absent from training data. Using the Kuramoto-Sivashinsky and beta-plane turbulence models, we evaluate the emulator through scenarios probing the fundamental phenomena of both systems: forecasting spontaneous relaminarisation, capturing initialisation of arbitrary chaotic states, zero-shot prediction of dynamics with parameter values outside of the training range, and characterisation of dynamical statistics from artificially restricted training datasets. Our results show that deep learning emulators can uncover emergent behaviours and rare events in complex systems by learning underlying mathematical rules, rather than merely mimicking observed patterns.

[LG-40] Cache-of-Thought: Master-Apprentice Framework for Cost-Effective Vision Language Model Inference

链接: https://arxiv.org/abs/2502.20587
作者: Mingyuan Wu,Jize Jiang,Haozhen Zheng,Meitang Li,Zhaoheng Li,Beitong Tian,Bo Chen,Yongjoo Park,Minjia Zhang,Chengxiang Zhai,Klara Nahrstedt
类目: Machine Learning (cs.LG)
*备注: Mingyuan, Jize, and Haozhen contributed equally, while Minjia, Chengxiang, and Klara advised equally

点击查看摘要

Abstract:Vision Language Models (VLMs) have achieved remarkable success in a wide range of vision applications of increasing complexity and scales, yet choosing the right VLM model size involves a trade-off between response quality and cost. While smaller VLMs are cheaper to run, they typically produce responses only marginally better than random guessing on benchmarks such as MMMU. In this paper, we propose Cache of Thought (CoT), a master apprentice framework for collaborative inference between large and small VLMs. CoT manages high quality query results from large VLMs (master) in a cache, which are then selected via a novel multi modal retrieval and in-context learning to aid the performance of small VLMs (apprentice). We extensively evaluate CoT on various widely recognized and challenging general VQA benchmarks, and show that CoT increases overall VQA performance by up to 7.7% under the same budget, and specifically boosts the performance of apprentice VLMs by up to 36.6%. Comments: Mingyuan, Jize, and Haozhen contributed equally, while Minjia, Chengxiang, and Klara advised equally Subjects: Machine Learning (cs.LG) Cite as: arXiv:2502.20587 [cs.LG] (or arXiv:2502.20587v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.20587 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-41] raining LLM s with MXFP4 AISTATS2025

链接: https://arxiv.org/abs/2502.20586
作者: Albert Tseng,Tao Yu,Youngsuk Park
类目: Machine Learning (cs.LG)
*备注: AISTATS 2025

点击查看摘要

Abstract:Low precision (LP) datatypes such as MXFP4 can accelerate matrix multiplications (GEMMs) and reduce training costs. However, directly using MXFP4 instead of BF16 during training significantly degrades model quality. In this work, we present the first near-lossless training recipe that uses MXFP4 GEMMs, which are 2\times faster than FP8 on supported hardware. Our key insight is to compute unbiased gradient estimates with stochastic rounding (SR), resulting in more accurate model updates. However, directly applying SR to MXFP4 can result in high variance from block-level outliers, harming convergence. To overcome this, we use the random Hadamard tranform to theoretically bound the variance of SR. We train GPT models up to 6.7B parameters and find that our method induces minimal degradation over mixed-precision BF16 training. Our recipe computes 1/2 the training FLOPs in MXFP4, enabling an estimated speedup of 1.3\times over FP8 and 1.7\times over BF16 during backpropagation.

[LG-42] raining Large Neural Networks With Low-Dimensional Error Feedback

链接: https://arxiv.org/abs/2502.20580
作者: Maher Hanut,Jonathan Kadmon
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Training deep neural networks typically relies on backpropagating high dimensional error signals a computationally intensive process with little evidence supporting its implementation in the brain. However, since most tasks involve low-dimensional outputs, we propose that low-dimensional error signals may suffice for effective learning. To test this hypothesis, we introduce a novel local learning rule based on Feedback Alignment that leverages indirect, low-dimensional error feedback to train large networks. Our method decouples the backward pass from the forward pass, enabling precise control over error signal dimensionality while maintaining high-dimensional representations. We begin with a detailed theoretical derivation for linear networks, which forms the foundation of our learning framework, and extend our approach to nonlinear, convolutional, and transformer architectures. Remarkably, we demonstrate that even minimal error dimensionality on the order of the task dimensionality can achieve performance matching that of traditional backpropagation. Furthermore, our rule enables efficient training of convolutional networks, which have previously been resistant to Feedback Alignment methods, with minimal error. This breakthrough not only paves the way toward more biologically accurate models of learning but also challenges the conventional reliance on high-dimensional gradient signals in neural network training. Our findings suggest that low-dimensional error signals can be as effective as high-dimensional ones, prompting a reevaluation of gradient-based learning in high-dimensional systems. Ultimately, our work offers a fresh perspective on neural network optimization and contributes to understanding learning mechanisms in both artificial and biological systems.

[LG-43] Stochastic Rounding for LLM Training: Theory and Practice AISTATS2025

链接: https://arxiv.org/abs/2502.20566
作者: Kaan Ozkara,Tao Yu,Youngsuk Park
类目: Machine Learning (cs.LG)
*备注: AISTATS 2025

点击查看摘要

Abstract:As the parameters of Large Language Models (LLMs) have scaled to hundreds of billions, the demand for efficient training methods – balancing faster computation and reduced memory usage without sacrificing accuracy – has become more critical than ever. In recent years, various mixed precision strategies, which involve different precision levels for optimization components, have been proposed to increase training speed with minimal accuracy degradation. However, these strategies often require manual adjustments and lack theoretical justification. In this work, we leverage stochastic rounding (SR) to address numerical errors of training with low-precision representation. We provide theoretical analyses of implicit regularization and convergence under the Adam optimizer when SR is utilized. With the insights from these analyses, we extend previous BF16 + SR strategy to be used in distributed settings, enhancing the stability and performance for large scale training. Empirical results from pre-training models with up to 6.7B parameters, for the first time, demonstrate that our BF16 with SR strategy outperforms (BF16, FP32) mixed precision strategies, achieving better validation perplexity, up to 1.54\times higher throughput, and 30% less memory usage.

[LG-44] SoS1: O1 and R1-Like Reasoning LLM s are Sum-of-Square Solvers

链接: https://arxiv.org/abs/2502.20545
作者: Kechen Li,Wenqi Zhu,Coralia Cartis,Tianbo Ji,Shiwei Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved human-level proficiency across diverse tasks, but their ability to perform rigorous mathematical problem solving remains an open challenge. In this work, we investigate a fundamental yet computationally intractable problem: determining whether a given multivariate polynomial is nonnegative. This problem, closely related to Hilbert’s Seventeenth Problem, plays a crucial role in global polynomial optimization and has applications in various fields. First, we introduce SoS-1K, a meticulously curated dataset of approximately 1,000 polynomials, along with expert-designed reasoning instructions based on five progressively challenging criteria. Evaluating multiple state-of-the-art LLMs, we find that without structured guidance, all models perform only slightly above the random guess baseline 50%. However, high-quality reasoning instructions significantly improve accuracy, boosting performance up to 81%. Furthermore, our 7B model, SoS-7B, fine-tuned on SoS-1K for just 4 hours, outperforms the 671B DeepSeek-V3 and GPT-4o-mini in accuracy while only requiring 1.8% and 5% of the computation time needed for letters, respectively. Our findings highlight the potential of LLMs to push the boundaries of mathematical reasoning and tackle NP-hard problems.

[LG-45] Data Distributional Properties As Inductive Bias for Systematic Generalization

链接: https://arxiv.org/abs/2502.20499
作者: Felipe del Río,Alain Raymond-Sáez,Daniel Florea,Rodrigo Toro Icarte,Julio Hurtado,Cristián Buc Calderón,Álvaro Soto
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) struggle at systematic generalization (SG). Several studies have evaluated the possibility to promote SG through the proposal of novel architectures, loss functions or training methodologies. Few studies, however, have focused on the role of training data properties in promoting SG. In this work, we investigate the impact of certain data distributional properties, as inductive biases for the SG ability of a multi-modal language model. To this end, we study three different properties. First, data diversity, instantiated as an increase in the possible values a latent property in the training distribution may take. Second, burstiness, where we probabilistically restrict the number of possible values of latent factors on particular inputs during training. Third, latent intervention, where a particular latent factor is altered randomly during training. We find that all three factors significantly enhance SG, with diversity contributing an 89% absolute increase in accuracy in the most affected property. Through a series of experiments, we test various hypotheses to understand why these properties promote SG. Finally, we find that Normalized Mutual Information (NMI) between latent attributes in the training distribution is strongly predictive of out-of-distribution generalization. We find that a mechanism by which lower NMI induces SG is in the geometry of representations. In particular, we find that NMI induces more parallelism in neural representations (i.e., input features coded in parallel neural vectors) of the model, a property related to the capacity of reasoning by analogy.

[LG-46] Creator-Side Recommender System: Challenges Designs and Applications

链接: https://arxiv.org/abs/2502.20497
作者: Xiaoshuang Chen,Yibo Wang,Yao Wang,Husheng Liu,Kaiqiao Zhan,Ben Wang,Kun Gai
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 9 pages and 9 figures

点击查看摘要

Abstract:Users and creators are two crucial components of recommender systems. Typical recommender systems focus on the user side, providing the most suitable items based on each user’s request. In such scenarios, a few items receive a majority of exposures, while many items receive very few. This imbalance leads to poorer experiences and decreased activity among the creators receiving less feedback, harming the recommender system in the long term. To this end, we develop a creator-side recommender system, called DualRec, to answer the following question: how to find the most suitable users for each item to enhance the creators’ experience? We show that typical user-side recommendation algorithms, such as retrieval and ranking algorithms, can be adapted into the creator-side versions with just a few modifications. This greatly simplifies algorithm design in DualRec. Moreover, we discuss a unique challenge in DualRec: the user availability issue, which is not present in user-side recommender systems. To tackle this issue, we incorporate a user availability calculation (UAC) module to effectively enhance DualRec’s performance. DualRec has already been implemented in Kwai, a short video recommendation system with over 100 millions user and over 10 million creators, significantly improving the experience for creators.

[LG-47] Unifying Model Predictive Path Integral Control Reinforcement Learning and Diffusion Models for Optimal Control and Planning

链接: https://arxiv.org/abs/2502.20476
作者: Yankai Li,Mo Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model Predictive Path Integral (MPPI) control, Reinforcement Learning (RL), and Diffusion Models have each demonstrated strong performance in trajectory optimization, decision-making, and motion planning. However, these approaches have traditionally been treated as distinct methodologies with separate optimization frameworks. In this work, we establish a unified perspective that connects MPPI, RL, and Diffusion Models through gradient-based optimization on the Gibbs measure. We first show that MPPI can be interpreted as performing gradient ascent on a smoothed energy function. We then demonstrate that Policy Gradient methods reduce to MPPI when treating policy parameters as control variables under a fixed initial state. Additionally, we establish that the reverse sampling process in diffusion models follows the same update rule as MPPI.

[LG-48] MobiLLM : Enabling LLM Fine-Tuning on the Mobile Device via Server Assisted Side Tuning

链接: https://arxiv.org/abs/2502.20421
作者: Liang Li,Xingke Yang,Wen Wu,Hao Wang,Tomoaki Ohtsuki,Xin Fu,Miao Pan,Xuemin Shen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Model (LLM) at mobile devices and its potential applications never fail to fascinate. However, on-device LLM fine-tuning poses great challenges due to extremely high memory requirements and slow training speeds. Even with parameter-efficient fine-tuning (PEFT) methods that update only a small subset of parameters, resource-constrained mobile devices cannot afford them. In this paper, we propose MobiLLM to enable memory-efficient transformer LLM fine-tuning on a mobile device via server-assisted side-tuning. Particularly, MobiLLM allows the resource-constrained mobile device to retain merely a frozen backbone model, while offloading the memory and computation-intensive backpropagation of a trainable side-network to a high-performance server. Unlike existing fine-tuning methods that keep trainable parameters inside the frozen backbone, MobiLLM separates a set of parallel adapters from the backbone to create a backpropagation bypass, involving only one-way activation transfers from the mobile device to the server with low-width quantization during forward propagation. In this way, the data never leaves the mobile device while the device can remove backpropagation through the local backbone model and its forward propagation can be paralyzed with the server-side execution. Thus, MobiLLM preserves data privacy while significantly reducing the memory and computational burdens for LLM fine-tuning. Through extensive experiments, we demonstrate that MobiLLM can enable a resource-constrained mobile device, even a CPU-only one, to fine-tune LLMs and significantly reduce convergence time and memory usage.

[LG-49] Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks

链接: https://arxiv.org/abs/2502.21269
作者: Andrea Montanari,Pierfrancesco Urbani
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注: 89 pages; 62 pdf figures

点击查看摘要

Abstract:The inductive bias and generalization properties of large machine learning models are – to a substantial extent – a byproduct of the optimization algorithm used for training. Among others, the scale of the random initialization, the learning rate, and early stopping all have crucial impact on the quality of the model learnt by stochastic gradient descent or related algorithms. In order to understand these phenomena, we study the training dynamics of large two-layer neural networks. We use a well-established technique from non-equilibrium statistical physics (dynamical mean field theory) to obtain an asymptotic high-dimensional characterization of this dynamics. This characterization applies to a Gaussian approximation of the hidden neurons non-linearity, and empirically captures well the behavior of actual neural network models. Our analysis uncovers several interesting new phenomena in the training dynamics: (i) The emergence of a slow time scale associated with the growth in Gaussian/Rademacher complexity; (ii) As a consequence, algorithmic inductive bias towards small complexity, but only if the initialization has small enough complexity; (iii) A separation of time scales between feature learning and overfitting; (iv) A non-monotone behavior of the test error and, correspondingly, a `feature unlearning’ phase at large times. Comments: 89 pages; 62 pdf figures Subjects: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG) Cite as: arXiv:2502.21269 [stat.ML] (or arXiv:2502.21269v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2502.21269 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-50] Class prior estimation for positive-unlabeled learning when label shift occurs

链接: https://arxiv.org/abs/2502.21194
作者: Jan Mielniczuk,Wojciech Rejchel,Paweł Teisseyre
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study estimation of class prior for unlabeled target samples which is possibly different from that of source population. It is assumed that for the source data only samples from positive class and from the whole population are available (PU learning scenario). We introduce a novel direct estimator of class prior which avoids estimation of posterior probabilities and has a simple geometric interpretation. It is based on a distribution matching technique together with kernel embedding and is obtained as an explicit solution to an optimisation task. We establish its asymptotic consistency as well as a non-asymptotic bound on its deviation from the unknown prior, which is calculable in practice. We study finite sample behaviour for synthetic and real data and show that the proposal, together with a suitably modified version for large values of source prior, works on par or better than its competitors.

[LG-51] Microscopic Propagator Imaging (MPI) with Diffusion MRI

链接: https://arxiv.org/abs/2502.21129
作者: Tommaso Zajac,Gloria Menegaz,Marco Pizzolato
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Biological Physics (physics.bio-ph); Medical Physics (physics.med-ph)
*备注:

点击查看摘要

Abstract:We propose Microscopic Propagator Imaging (MPI) as a novel method to retrieve the indices of the microscopic propagator which is the probability density function of water displacements due to diffusion within the nervous tissue microstructures. Unlike the Ensemble Average Propagator indices or the Diffusion Tensor Imaging metrics, MPI indices are independent from the mesoscopic organization of the tissue such as the presence of multiple axonal bundle directions and orientation dispersion. As a consequence, MPI indices are more specific to the volumes, sizes, and types of microstructures, like axons and cells, that are present in the tissue. Thus, changes in MPI indices can be more directly linked to alterations in the presence and integrity of microstructures themselves. The methodology behind MPI is rooted on zonal modeling of spherical harmonics, signal simulation, and machine learning regression, and is demonstrated on both synthetic and Human Diffusion MRI data.

[LG-52] he two filter formula reconsidered: Smoothing in partially observed Gauss–Markov models without information parametrization

链接: https://arxiv.org/abs/2502.21116
作者: Filip Tronarp
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: 14 pages, 2 figures

点击查看摘要

Abstract:In this article, the two filter formula is re-examined in the setting of partially observed Gauss–Markov models. It is traditionally formulated as a filter running backward in time, where the Gaussian density is parametrized in information form''. However, the quantity in the backward recursion is strictly speaking not a distribution, but a likelihood. Taking this observation seriously, a recursion over log-quadratic likelihoods is formulated instead, which obviates the need for information’’ parametrization. In particular, it greatly simplifies the square-root formulation of the algorithm. Furthermore, formulae are given for producing the forward Markov representation of the a posteriori distribution over paths from the proposed likelihood representation.

[LG-53] Quantum-aware Transformer model for state classification

链接: https://arxiv.org/abs/2502.21055
作者: Przemysław Sekuła,Michał Romaszewski,Przemysław Głomb,Michał Cholewa,Łukasz Pawela
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 13 pages, 1 figure

点击查看摘要

Abstract:Entanglement is a fundamental feature of quantum mechanics, playing a crucial role in quantum information processing. However, classifying entangled states, particularly in the mixed-state regime, remains a challenging problem, especially as system dimensions increase. In this work, we focus on bipartite quantum states and present a data-driven approach to entanglement classification using transformer-based neural networks. Our dataset consists of a diverse set of bipartite states, including pure separable states, Werner entangled states, general entangled states, and maximally entangled states. We pretrain the transformer in an unsupervised fashion by masking elements of vectorized Hermitian matrix representations of quantum states, allowing the model to learn structural properties of quantum density matrices. This approach enables the model to generalize entanglement characteristics across different classes of states. Once trained, our method achieves near-perfect classification accuracy, effectively distinguishing between separable and entangled states. Compared to previous Machine Learning, our method successfully adapts transformers for quantum state analysis, demonstrating their ability to systematically identify entanglement in bipartite systems. These results highlight the potential of modern machine learning techniques in automating entanglement detection and classification, bridging the gap between quantum information theory and artificial intelligence.

[LG-54] AutoQML: A Framework for Automated Quantum Machine Learning

链接: https://arxiv.org/abs/2502.21025
作者: Marco Roth,David A. Kreplin,Daniel Basilewitsch,João F. Bravo,Dennis Klau,Milan Marinov,Daniel Pranjic,Horst Stuehler,Moritz Willmann,Marc-André Zöller
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Automated Machine Learning (AutoML) has significantly advanced the efficiency of ML-focused software development by automating hyperparameter optimization and pipeline construction, reducing the need for manual intervention. Quantum Machine Learning (QML) offers the potential to surpass classical machine learning (ML) capabilities by utilizing quantum computing. However, the complexity of QML presents substantial entry barriers. We introduce \emphAutoQML, a novel framework that adapts the AutoML approach to QML, providing a modular and unified programming interface to facilitate the development of QML pipelines. AutoQML leverages the QML library sQUlearn to support a variety of QML algorithms. The framework is capable of constructing end-to-end pipelines for supervised learning tasks, ensuring accessibility and efficacy. We evaluate AutoQML across four industrial use cases, demonstrating its ability to generate high-performing QML pipelines that are competitive with both classical ML models and manually crafted quantum solutions.

[LG-55] Position: Solve Layerwise Linear Models First to Understand Neural Dynamical Phenomena (Neural Collapse Emergence Lazy/Rich Regime and Grokking)

链接: https://arxiv.org/abs/2502.21009
作者: Yoonsoo Nam,Seok Hyeong Lee,Clementine Domine,Yea Chan Park,Charles London,Wonyl Choi,Niclas Goring,Seungjai Lee
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:In physics, complex systems are often simplified into minimal, solvable models that retain only the core principles. In machine learning, layerwise linear models (e.g., linear neural networks) act as simplified representations of neural network dynamics. These models follow the dynamical feedback principle, which describes how layers mutually govern and amplify each other’s evolution. This principle extends beyond the simplified models, successfully explaining a wide range of dynamical phenomena in deep neural networks, including neural collapse, emergence, lazy and rich regimes, and grokking. In this position paper, we call for the use of layerwise linear models retaining the core principles of neural dynamical phenomena to accelerate the science of deep learning.

[LG-56] Post-Hoc Uncertainty Quantification in Pre-Trained Neural Networks via Activation-Level Gaussian Processes

链接: https://arxiv.org/abs/2502.20966
作者: Richard Bergna,Stefan Depeweg,Sergio Calvo Ordonez,Jonathan Plenk,Alvaro Cartea,Jose Miguel Hernandez-Lobato
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 10 pages, 8 figures, 7th Symposium on Advances in Approximate Bayesian Inference

点击查看摘要

Abstract:Uncertainty quantification in neural networks through methods such as Dropout, Bayesian neural networks and Laplace approximations is either prone to underfitting or computationally demanding, rendering these approaches impractical for large-scale datasets. In this work, we address these shortcomings by shifting the focus from uncertainty in the weight space to uncertainty at the activation level, via Gaussian processes. More specifically, we introduce the Gaussian Process Activation function (GAPA) to capture neuron-level uncertainties. Our approach operates in a post-hoc manner, preserving the original mean predictions of the pre-trained neural network and thereby avoiding the underfitting issues commonly encountered in previous methods. We propose two methods. The first, GAPA-Free, employs empirical kernel learning from the training data for the hyperparameters and is highly efficient during training. The second, GAPA-Variational, learns the hyperparameters via gradient descent on the kernels, thus affording greater flexibility. Empirical results demonstrate that GAPA-Variational outperforms the Laplace approximation on most datasets in at least one of the uncertainty quantification metrics.

[LG-57] Large Language Models Are Innate Crystal Structure Generators

链接: https://arxiv.org/abs/2502.20933
作者: Jingru Gan,Peichen Zhong,Yuanqi Du,Yanqiao Zhu,Chenru Duan,Haorui Wang,Carla P. Gomes,Kristin A. Persson,Daniel Schwalbe-Koda,Wei Wang
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: Preprint, 18 pages

点击查看摘要

Abstract:Crystal structure generation is fundamental to materials discovery, enabling the prediction of novel materials with desired properties. While existing approaches leverage Large Language Models (LLMs) through extensive fine-tuning on materials databases, we show that pre-trained LLMs can inherently generate stable crystal structures without additional training. Our novel framework MatLLMSearch integrates pre-trained LLMs with evolutionary search algorithms, achieving a 78.38% metastable rate validated by machine learning interatomic potentials and 31.7% DFT-verified stability via quantum mechanical calculations, outperforming specialized models such as CrystalTextLLM. Beyond crystal structure generation, we further demonstrate that our framework can be readily adapted to diverse materials design tasks, including crystal structure prediction and multi-objective optimization of properties such as deformation energy and bulk modulus, all without fine-tuning. These results establish pre-trained LLMs as versatile and effective tools for materials discovery, opening up new venues for crystal structure generation with reduced computational overhead and broader accessibility.

[LG-58] Amortized Conditional Independence Testing PAKDD2025

链接: https://arxiv.org/abs/2502.20925
作者: Bao Duong,Nu Hoang,Thin Nguyen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted at PAKDD 2025

点击查看摘要

Abstract:Testing for the conditional independence structure in data is a fundamental and critical task in statistics and machine learning, which finds natural applications in causal discovery - a highly relevant problem to many scientific disciplines. Existing methods seek to design explicit test statistics that quantify the degree of conditional dependence, which is highly challenging yet cannot capture nor utilize prior knowledge in a data-driven manner. In this study, an entirely new approach is introduced, where we instead propose to amortize conditional independence testing and devise ACID - a novel transformer-based neural network architecture that learns to test for conditional independence. ACID can be trained on synthetic data in a supervised learning fashion, and the learned model can then be applied to any dataset of similar natures or adapted to new domains by fine-tuning with a negligible computational cost. Our extensive empirical evaluations on both synthetic and real data reveal that ACID consistently achieves state-of-the-art performance against existing baselines under multiple metrics, and is able to generalize robustly to unseen sample sizes, dimensionalities, as well as non-linearities with a remarkably low inference time.

[LG-59] Hamiltonian Neural Networks approach to fuzzball geodesics

链接: https://arxiv.org/abs/2502.20881
作者: Andrea Cipriani,Alessandro De Santis,Giorgio Di Russo,Alfredo Grillo,Luca Tabarroni
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG); General Relativity and Quantum Cosmology (gr-qc)
*备注: 25 pages + Appendices, 39 figures

点击查看摘要

Abstract:The recent increase in computational resources and data availability has led to a significant rise in the use of Machine Learning (ML) techniques for data analysis in physics. However, the application of ML methods to solve differential equations capable of describing even complex physical systems is not yet fully widespread in theoretical high-energy physics. Hamiltonian Neural Networks (HNNs) are tools that minimize a loss function defined to solve Hamilton equations of motion. In this work, we implement several HNNs trained to solve, with high accuracy, the Hamilton equations for a massless probe moving inside a smooth and horizonless geometry known as D1-D5 circular fuzzball. We study both planar (equatorial) and non-planar geodesics in different regimes according to the impact parameter, some of which are unstable. Our findings suggest that HNNs could eventually replace standard numerical integrators, as they are equally accurate but more reliable in critical situations.

[LG-60] Enhanced Derivative-Free Optimization Using Adaptive Correlation-Induced Finite Difference Estimators

链接: https://arxiv.org/abs/2502.20819
作者: Guo Liang,Guangwu Liu,Kun Zhang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Finance (q-fin.CP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Gradient-based methods are well-suited for derivative-free optimization (DFO), where finite-difference (FD) estimates are commonly used as gradient surrogates. Traditional stochastic approximation methods, such as Kiefer-Wolfowitz (KW) and simultaneous perturbation stochastic approximation (SPSA), typically utilize only two samples per iteration, resulting in imprecise gradient estimates and necessitating diminishing step sizes for convergence. In this paper, we first explore an efficient FD estimate, referred to as correlation-induced FD estimate, which is a batch-based estimate. Then, we propose an adaptive sampling strategy that dynamically determines the batch size at each iteration. By combining these two components, we develop an algorithm designed to enhance DFO in terms of both gradient estimation efficiency and sample efficiency. Furthermore, we establish the consistency of our proposed algorithm and demonstrate that, despite using a batch of samples per iteration, it achieves the same convergence rate as the KW and SPSA methods. Additionally, we propose a novel stochastic line search technique to adaptively tune the step size in practice. Finally, comprehensive numerical experiments confirm the superior empirical performance of the proposed algorithm.

[LG-61] owards Ultimate NMR Resolution with Deep Learning

链接: https://arxiv.org/abs/2502.20793
作者: Amir Jahangiri,Tatiana Agback,Ulrika Brath,Vladislav Orekhov
类目: Biological Physics (physics.bio-ph); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:In multidimensional NMR spectroscopy, practical resolution is defined as the ability to distinguish and accurately determine signal positions against a background of overlapping peaks, thermal noise, and spectral artifacts. In the pursuit of ultimate resolution, we introduce Peak Probability Presentations ( P^3 )- a statistical spectral representation that assigns a probability to each spectral point, indicating the likelihood of a peak maximum occurring at that location. The mapping between the spectrum and P^3 is achieved using MR-Ai, a physics-inspired deep learning neural network architecture, designed to handle multidimensional NMR spectra. Furthermore, we demonstrate that MR-Ai enables coprocessing of multiple spectra, facilitating direct information exchange between datasets. This feature significantly enhances spectral quality, particularly in cases of highly sparse sampling. Performance of MR-Ai and high value of the P^3 are demonstrated on the synthetic data and spectra of Tau, MATL1, Calmodulin, and several other proteins.

[LG-62] Minimax Optimal Kernel Two-Sample Tests with Random Features

链接: https://arxiv.org/abs/2502.20755
作者: Soumya Mukherjee,Bharath K. Sriperumbudur
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 82 pages, 10 figures, 5 tables

点击查看摘要

Abstract:Reproducing Kernel Hilbert Space (RKHS) embedding of probability distributions has proved to be an effective approach, via MMD (maximum mean discrepancy) for nonparametric hypothesis testing problems involving distributions defined over general (non-Euclidean) domains. While a substantial amount of work has been done on this topic, only recently, minimax optimal two-sample tests have been constructed that incorporate, unlike MMD, both the mean element and a regularized version of the covariance operator. However, as with most kernel algorithms, the computational complexity of the optimal test scales cubically in the sample size, limiting its applicability. In this paper, we propose a spectral regularized two-sample test based on random Fourier feature (RFF) approximation and investigate the trade-offs between statistical optimality and computational efficiency. We show the proposed test to be minimax optimal if the approximation order of RFF (which depends on the smoothness of the likelihood ratio and the decay rate of the eigenvalues of the integral operator) is sufficiently large. We develop a practically implementable permutation-based version of the proposed test with a data-adaptive strategy for selecting the regularization parameter and the kernel. Finally, through numerical experiments on simulated and benchmark datasets, we demonstrate that the proposed RFF-based test is computationally efficient and performs almost similar (with a small drop in power) to the exact test.

[LG-63] Learning Dynamics of Deep Linear Networks Beyond the Edge of Stability ICLR2025

链接: https://arxiv.org/abs/2502.20531
作者: Avrajit Ghosh,Soo Min Kwon,Rongrong Wang,Saiprasad Ravishankar,Qing Qu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Published in ICLR 2025

点击查看摘要

Abstract:Deep neural networks trained using gradient descent with a fixed learning rate \eta often operate in the regime of “edge of stability” (EOS), where the largest eigenvalue of the Hessian equilibrates about the stability threshold 2/\eta . In this work, we present a fine-grained analysis of the learning dynamics of (deep) linear networks (DLNs) within the deep matrix factorization loss beyond EOS. For DLNs, loss oscillations beyond EOS follow a period-doubling route to chaos. We theoretically analyze the regime of the 2-period orbit and show that the loss oscillations occur within a small subspace, with the dimension of the subspace precisely characterized by the learning rate. The crux of our analysis lies in showing that the symmetry-induced conservation law for gradient flow, defined as the balancing gap among the singular values across layers, breaks at EOS and decays monotonically to zero. Overall, our results contribute to explaining two key phenomena in deep networks: (i) shallow models and simple tasks do not always exhibit EOS; and (ii) oscillations occur within top features. We present experiments to support our theory, along with examples demonstrating how these phenomena occur in nonlinear networks and how they differ from those which have benign landscape such as in DLNs.

[LG-64] ransfer Learning through Enhanced Sufficient Representation: Enriching Source Domain Knowledge with Target Data

链接: https://arxiv.org/abs/2502.20414
作者: Yeheng Ge,Xueyu Zhou,Jian Huang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 44 pages

点击查看摘要

Abstract:Transfer learning is an important approach for addressing the challenges posed by limited data availability in various applications. It accomplishes this by transferring knowledge from well-established source domains to a less familiar target domain. However, traditional transfer learning methods often face difficulties due to rigid model assumptions and the need for a high degree of similarity between source and target domain models. In this paper, we introduce a novel method for transfer learning called Transfer learning through Enhanced Sufficient Representation (TESR). Our approach begins by estimating a sufficient and invariant representation from the source domains. This representation is then enhanced with an independent component derived from the target data, ensuring that it is sufficient for the target domain and adaptable to its specific characteristics. A notable advantage of TESR is that it does not rely on assuming similar model structures across different tasks. For example, the source domain models can be regression models, while the target domain task can be classification. This flexibility makes TESR applicable to a wide range of supervised learning problems. We explore the theoretical properties of TESR and validate its performance through simulation studies and real-world data applications, demonstrating its effectiveness in finite sample settings.

信息检索

[IR-0] Joint Modeling in Recommendations: A Survey

链接: https://arxiv.org/abs/2502.21195
作者: Xiangyu Zhao,Yichao Wang,Bo Chen,Jingtong Gao,Yuhao Wang,Xiaopeng Li,Pengyue Jia,Qidong Liu,Huifeng Guo,Ruiming Tang
类目: Information Retrieval (cs.IR)
*备注: arXiv admin note: text overlap with arXiv:2302.03525

点击查看摘要

Abstract:In today’s digital landscape, Deep Recommender Systems (DRS) play a crucial role in navigating and customizing online content for individual preferences. However, conventional methods, which mainly depend on single recommendation task, scenario, data modality and user behavior, are increasingly seen as insufficient due to their inability to accurately reflect users’ complex and changing preferences. This gap underscores the need for joint modeling approaches, which are central to overcoming these limitations by integrating diverse tasks, scenarios, modalities, and behaviors in the recommendation process, thus promising significant enhancements in recommendation precision, efficiency, and customization. In this paper, we comprehensively survey the joint modeling methods in recommendations. We begin by defining the scope of joint modeling through four distinct dimensions: multi-task, multi-scenario, multi-modal, and multi-behavior modeling. Subsequently, we examine these methods in depth, identifying and summarizing their underlying paradigms based on the latest advancements and potential research trajectories. Ultimately, we highlight several promising avenues for future exploration in joint modeling for recommendations and provide a concise conclusion to our findings.

[IR-1] he RAG Paradox: A Black-Box Attack Exploiting Unintentional Vulnerabilities in Retrieval-Augmented Generation Systems

链接: https://arxiv.org/abs/2502.20995
作者: Chanwoo Choi,Jinsoo Kim,Sukmin Cho,Soyeong Jeong,Buru Chang
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:With the growing adoption of retrieval-augmented generation (RAG) systems, recent studies have introduced attack methods aimed at degrading their performance. However, these methods rely on unrealistic white-box assumptions, such as attackers having access to RAG systems’ internal processes. To address this issue, we introduce a realistic black-box attack scenario based on the RAG paradox, where RAG systems inadvertently expose vulnerabilities while attempting to enhance trustworthiness. Because RAG systems reference external documents during response generation, our attack targets these sources without requiring internal access. Our approach first identifies the external sources disclosed by RAG systems and then automatically generates poisoned documents with misinformation designed to match these sources. Finally, these poisoned documents are newly published on the disclosed sources, disrupting the RAG system’s response generation process. Both offline and online experiments confirm that this attack significantly reduces RAG performance without requiring internal access. Furthermore, from an insider perspective within the RAG system, we propose a re-ranking method that acts as a fundamental safeguard, offering minimal protection against unforeseen attacks.

[IR-2] Variations in Relevance Judgments and the Shelf Life of Test Collections

链接: https://arxiv.org/abs/2502.20937
作者: Andrew Parry,Maik Fröbe,Harrisen Scells,Ferdinand Schlatt,Guglielmo Faggioli,Saber Zerhoudi,Sean MacAvaney,Eugene Yang
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 6 tables, 5 figures

点击查看摘要

Abstract:The fundamental property of Cranfield-style evaluations, that system rankings are stable even when assessors disagree on individual relevance decisions, was validated on traditional test collections. However, the paradigm shift towards neural retrieval models affected the characteristics of modern test collections, e.g., documents are short, judged with four grades of relevance, and information needs have no descriptions or narratives. Under these changes, it is unclear whether assessor disagreement remains negligible for system comparisons. We investigate this aspect under the additional condition that the few modern test collections are heavily re-used. Given more possible query interpretations due to less formalized information needs, an ‘‘expiration date’’ for test collections might be needed if top-effectiveness requires overfitting to a single interpretation of relevance. We run a reproducibility study and re-annotate the relevance judgments of the 2019 TREC Deep Learning track. We can reproduce prior work in the neural retrieval setting, showing that assessor disagreement does not affect system rankings. However, we observe that some models substantially degrade with our new relevance judgments, and some have already reached the effectiveness of humans as rankers, providing evidence that test collections can expire.

[IR-3] Scalable Overload-Aware Graph-Based Index Construction for 10-Billion-Scale Vector Similarity Search WWW’25

链接: https://arxiv.org/abs/2502.20695
作者: Yang Shi,Yiping Sun,Jiaolong Du,Xiaocheng Zhong,Zhiyong Wang,Yao Hu
类目: Information Retrieval (cs.IR)
*备注: Accepted by WWW’25

点击查看摘要

Abstract:Approximate Nearest Neighbor Search (ANNS) is essential for modern data-driven applications that require efficient retrieval of top-k results from massive vector databases. Although existing graph-based ANNS algorithms achieve a high recall rate on billion-scale datasets, their slow construction speed and limited scalability hinder their applicability to large-scale industrial scenarios. In this paper, we introduce SOGAIC, the first Scalable Overload-Aware Graph-Based ANNS Index Construction system tailored for ultra-large-scale vector databases: 1) We propose a dynamic data partitioning algorithm with overload constraints that adaptively introduces overlaps among subsets; 2) To enable efficient distributed subgraph construction, we employ a load-balancing task scheduling framework combined with an agglomerative merging strategy; 3) Extensive experiments on various datasets demonstrate a reduction of 47.3% in average construction time compared to existing methods. The proposed method has also been successfully deployed in a real-world industrial search engine, managing over 10 billion daily updated vectors and serving hundreds of millions of users.

[IR-4] CS-PaperSum: A Large-Scale Dataset of AI-Generated Summaries for Scientific Papers

链接: https://arxiv.org/abs/2502.20582
作者: Javin Liu,Aryan Vats,Zihao He
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The rapid expansion of scientific literature in computer science presents challenges in tracking research trends and extracting key insights. Existing datasets provide metadata but lack structured summaries that capture core contributions and methodologies. We introduce CS-PaperSum, a large-scale dataset of 91,919 papers from 31 top-tier computer science conferences, enriched with AI-generated structured summaries using ChatGPT. To assess summary quality, we conduct embedding alignment analysis and keyword overlap analysis, demonstrating strong preservation of key concepts. We further present a case study on AI research trends, highlighting shifts in methodologies and interdisciplinary crossovers, including the rise of self-supervised learning, retrieval-augmented generation, and multimodal AI. Our dataset enables automated literature analysis, research trend forecasting, and AI-driven scientific discovery, providing a valuable resource for researchers, policymakers, and scientific information retrieval systems.

附件下载

点击下载今日全部论文列表